Storage WEKA Capability Assessment v1¶
Date: 2026-04-27
Purpose¶
Validate whether WEKA can support the GPUaaS storage and IAM model documented in:
doc/architecture/Storage_Sharing_and_IAM_Model_v1.mddoc/architecture/Storage_IAM_User_Flows_v1.mddoc/architecture/Storage_Provider_Capability_Model_v1.md
This assessment is based on current WEKA documentation reviewed on 2026-04-27. It must be validated against our actual WEKA version and deployment mode before implementation.
Source Documents Reviewed¶
- WEKA S3 users and authentication:
https://docs.weka.io/additional-protocols/s3/s3-users-and-authentication - WEKA S3 users/authentication CLI:
https://docs.weka.io/additional-protocols/s3/s3-users-and-authentication/s3-users-and-authentication - WEKA S3 service accounts CLI:
https://docs.weka.io/4.4/additional-protocols/s3/s3-users-and-authentication/s3-users-and-authentication-1 - WEKA S3 buckets management:
https://docs.weka.io/5.0/additional-protocols/s3/s3-buckets-management - WEKA S3 supported APIs and limitations:
https://docs.weka.io/additional-protocols/s3/s3-limitations - WEKA S3 audit APIs:
https://docs.weka.io/additional-protocols/s3/audit-s3-apis - WEKA REST API and equivalent CLI commands:
https://docs.weka.io/getting-started-with-weka/weka-rest-api-and-equivalent-cli-commands
Current Direction¶
WEKA should be treated as a dual-protocol storage backend:
- WEKAFS/POSIX for high-performance training, notebooks, Kubernetes PV/PVC, and apps that need filesystem semantics.
- S3 for bucket/object workflows, direct external clients, SDK access, and apps that expect object semantics.
WEKAFS is conceptually similar to an NFS-style shared filesystem from the user model perspective, but it uses WEKA's own client protocol and DPDK-backed data path rather than NFS.
This changes the implementation priority:
WEKAFS/POSIX: primary training/app mount path
S3: bucket/object path; enable when bucket semantics are needed
S3 is not required for POSIX-only training workloads, but it is likely required for the broader GPUaaS product because users will expect buckets, SDK access, and object-style data movement. GPUaaS remains the IAM source of truth; WEKA-specific users, policies, mounts, or service credentials are derived enforcement mechanisms behind the storage adapter.
S3 Capability Summary¶
Supported by documentation:
- S3 local users and LDAP-backed S3 identities.
- IAM policies attached to S3 users.
- STS temporary credentials through AssumeRole.
- More restrictive session policy on STS requests.
- Permanent S3 service accounts as child identities of S3 users.
- Optional service-account policy that restricts parent privileges.
- Bucket management through S3 API, WEKA API, GUI, and CLI.
- Bucket policy APIs and custom JSON bucket policies.
- Bucket quota APIs.
- S3 audit webhook events.
- REST API coverage for S3 users, policies, service accounts, STS, bucket policies, quota, lifecycle, and audit webhook configuration.
Live Cluster Validation - 2026-04-28¶
Access path used for this validation:
ssh -N -L 14000:10.177.24.3:14000 hpcadmin@100.90.157.34
WEKA_SERVER=http://localhost:14000 scripts/ops/weka_capability_probe.sh
The validation script is read-only by default. It redacts token, password, key,
credential, certificate, and license-like fields and does not print raw provider
secrets. The restrictive service-account ownership POST probe is opt-in via
WEKA_PROBE_SERVICE_ACCOUNT_POST=1.
Observed cluster state:
| Check | Observed result | Implication |
|---|---|---|
| API surface | REST API is /api/v2; Swagger UI reports API version 4.4; cluster release is 4.4.22. |
Build the first adapter against WEKA REST API v2, not the older JSON-RPC endpoint. |
| Login | POST /api/v2/login returns short-lived bearer token with expires_in=300 plus refresh token. |
The provider adapter can use a cluster-admin bearer session for administrative reads and cluster configuration APIs. |
| Authenticated role | Logged-in local user is ClusterAdmin in Root. |
This is enough for cluster/file-system/S3 configuration reads, but not every S3 identity operation. |
| Cluster health | Cluster wekacb2, io_status=STARTED, status=OK, link layer ETH. |
Base cluster is usable for control-plane validation. |
| Filesystem | Initial probe saw only default, READY, not encrypted, auth_required=false, audit disabled. Infra then created dedicated GPUaaS filesystems: gpuaas-fs for production-like WEKAFS/POSIX, gpuaas-kind-fs for disposable kind/dev validation, s3-data for future S3 bucket/object data, and s3-config for future S3 protocol configuration. |
Keep default infra-owned and validate GPUaaS against dedicated filesystems. |
| S3 cluster | GET /api/v2/s3 succeeds, but active=false, s3_hosts=0, filesystem_name=N/A, config_fs_name=N/A, port 9000, TLS enabled. Attempting to save S3 with s3-data and s3-config failed because WEKA has no S3 protocol host/container. |
S3 bucket, IAM policy, STS, quota, and S3-client validation is deferred until WEKA S3 protocol hosts/containers are added. This does not block WEKAFS/POSIX mount validation. |
| S3 buckets/policies/userPolicies | GET /s3/buckets, GET /s3/policies, and GET /s3/userPolicies return NoValidProtocolContainerToRunIn. |
These APIs require an active S3 protocol container. This is a provider readiness check the GPUaaS adapter should surface clearly. |
| S3 service accounts | GET /s3/serviceAccounts and POST /s3/serviceAccounts as ClusterAdmin return S3RoleOnlyOperation. |
Service-account lifecycle is not cluster-admin-bearer-owned in this deployment; the adapter likely needs an S3 user credential context held in Vault. |
| S3 audit webhook | GET /s3/auditWebhook succeeds; webhook is disabled. |
Audit webhook can be managed through REST once we are ready to wire a receiver, but it is not currently producing evidence. |
Current validation status:
WEKA control-plane API: validated
WEKA filesystem read model inputs: validated; dedicated GPUaaS filesystems created
WEKA WEKAFS/POSIX workload mount path: not yet validated
WEKA S3 protocol readiness: deferred, no S3 protocol hosts/containers
WEKA bucket/IAM policy/quota behavior: deferred on S3 protocol container
WEKA STS scoped credentials: deferred on S3 protocol container and S3 user credentials
WEKA service-account ownership: partially validated, requires S3 user role
WEKA audit webhook API: validated as disabled/readable
Required infra follow-up before completing WEKAFS/POSIX validation:
- Use
gpuaas-kind-fsas the first disposable kind/dev validation filesystem andgpuaas-fsfor production-like WEKAFS/POSIX validation. - Keep the local SSH tunnel to WEKA API active for control-plane validation from kind, while the actual mount path is tested from a WEKA-reachable node or pod.
- Confirm the WEKA client/CSI deployment path for GPUaaS worker nodes.
- Confirm whether mounts are created directly by node bootstrap, by Kubernetes CSI for app clusters, or by a host-prepared mount exposed into workloads.
- Confirm filesystem and directory layout for tenant/project/bucket/prefix mapping.
- Validate read/write/read-only/multi-writer semantics from at least two workloads.
- Validate quota reporting/enforcement at the filesystem, directory, or namespace level used by GPUaaS.
- Validate cleanup and revocation behavior when a workload is released or a grant is revoked.
- Keep per-user home directory semantics out of the first validation pass unless explicitly testing that feature. User homes need a separate decision on project-scoped versus shared-home layout, membership changes after allocation, OS-user reconciliation, ACL propagation, and quota attribution.
Read-only preflight command:
ssh -N -L 14000:10.177.24.3:14000 hpcadmin@100.90.157.34
WEKA_SERVER=http://localhost:14000 make ops-weka-wekafs-probe
The preflight verifies API login, cluster status, the expected GPUaaS filesystems, and S3 readiness state. It does not create filesystems, buckets, users, policies, mounts, or credentials.
Required infra follow-up before completing S3/STS validation:
- Enable at least one WEKA S3 protocol host/container on the target cluster when ready to validate bucket/object workflows.
- Bind S3 to
s3-dataas the bucket/object data filesystem ands3-configas the protocol configuration filesystem. - Provide or create a GPUaaS-owned S3 parent user with baseline policy and access/secret material stored in Vault.
- Re-run
scripts/ops/weka_capability_probe.sh. - Run mutating validation in a disposable namespace/bucket prefix: create bucket, set quota, create IAM policy, issue STS credentials, exercise S3 client access, create/delete service account, revoke access, and observe audit behavior.
Key constraints:
- S3 user IAM policy length is documented as 2048 bytes.
- Maximum S3 regular users, service accounts, and STS credentials are finite and must be capacity-planned.
- Service accounts are permanent and have no expiration.
- The documented way to immediately invalidate compromised STS credentials is deleting the parent S3 user.
- Service-account management is documented as S3-user-owned and CLI-only in the service-account page, but REST API equivalent endpoints exist. We must verify exact auth/ownership behavior in our WEKA version.
- Bucket policies are available, but some wildcard behavior differs between IAM policies and bucket policies.
Capability Matrix¶
| GPUaaS need | WEKA capability | Assessment |
|---|---|---|
| Project-owned bucket/namespace | WEKA can manage buckets via S3 API, GUI/API/CLI. Buckets are backed by a configured filesystem, with optional placement in another filesystem. | Feasible. Map GPUaaS bucket to WEKA bucket/filesystem/prefix based on deployment layout. |
| Human direct S3 client access | WEKA supports STS temporary credentials returning access key, secret key, and session token. | Feasible. GPUaaS should broker STS credentials after checking project/grant access. |
| Restrict direct credentials to bucket/prefix/action | STS AssumeRole accepts a policy-file that cannot expand beyond the parent S3 user's IAM policy. |
Feasible, but validate prefix-policy syntax and policy size. |
| Workload/app machine access | WEKA supports S3 service accounts as child identities of an S3 user, with optional policy restriction. | Feasible. Use service accounts for workload/runtime credentials where longer-lived access is needed. |
| Avoid every GPUaaS user becoming a WEKA user | WEKA requires an S3 user for S3 access, but the S3 user can be a provider-side enforcement parent rather than one per human. | Feasible. Prefer GPUaaS-managed parent S3 users per project/tenant/provider scope, plus STS/session policies for humans. |
| Cross-project sharing | WEKA IAM/session policies and bucket policies can express allowed S3 resources/actions. | Feasible. GPUaaS grants compile into WEKA policies. Validate scale and wildcard behavior. |
| Revoke future access | GPUaaS can stop issuing future credentials; WEKA policies/service accounts can be removed. | Feasible. |
| Revoke active STS credentials immediately | WEKA docs say compromised STS can be invalidated by deleting the parent S3 user. | Risk. Prefer short TTLs; verify whether there is a narrower session revocation API in our WEKA version. |
| Auditing S3 access | WEKA supports S3 audit webhook events with operation, bucket, object, status, client IP, user agent, authorization credential, and cluster metadata. | Feasible. Need webhook receiver and retention pipeline. |
| Bucket quota | WEKA REST/CLI supports setting and unsetting S3 bucket quota. | Feasible. |
| Lifecycle | WEKA supports lifecycle APIs, with documented limitation around expiration action. | Feasible for basic retention/expiration; validate versioning/snapshot expectations separately. |
| Provider automation through API | WEKA REST API includes S3 policy, service account, STS, bucket policy, quota, lifecycle, and audit webhook endpoints. | Feasible. Build adapter against REST API, not shell CLI, except for bootstrap diagnostics. |
Recommended WEKA Mapping¶
WEKAFS/POSIX Mounts¶
Use GPUaaS-managed project/workload mount intent as the primary integration boundary.
Recommended hierarchy:
tenant/project/provider scope
-> WEKA filesystem or filesystem directory namespace
-> gpuaas-kind-fs for kind/dev validation
-> gpuaas-fs for production-like validation and production
-> project bucket/namespace directory
-> prefixes for shared, users, workloads, datasets, checkpoints, artifacts
-> workload mount plan
-> optional provider quota/snapshot/audit metadata
The GPUaaS storage adapter should own:
- mapping GPUaaS bucket IDs to WEKA filesystem/path references
- generating mount plans for node-agent, app controllers, or Kubernetes CSI
- enforcing read-only versus read-write mount mode from GPUaaS grants
- reconciling mount state and provider drift into v3 read models
- coordinating cleanup on workload release/decommission
WEKA DPDK/client configuration, mount helper setup, and node-level connectivity belong to infra/bootstrap and should be surfaced to GPUaaS as provider health and capability state, not user-facing configuration.
Optional Direct Access: Parent S3 Users¶
Use GPUaaS-managed WEKA S3 parent users as policy envelopes.
For the first cluster validation, bind WEKA S3 as:
S3 data filesystem: s3-data
S3 config filesystem: s3-config
Port: 9000
Anonymous POSIX UID/GID: 65534
All servers: enabled
Virtual-hosted-style domains: unset until DNS/TLS design is explicit
Do not use default for GPUaaS S3 or WEKAFS validation. default stays
infra-owned; gpuaas-kind-fs, gpuaas-fs, s3-data, and s3-config are the
GPUaaS validation planes.
Current implementation posture: S3 is capability-gated and deferred until WEKA
has S3 protocol hosts. Build the first storage integration against
gpuaas-kind-fs/WEKAFS/POSIX mounts in kind/dev, then promote the same contract
to gpuaas-fs for production-like validation. Keep s3-data and s3-config
reserved for a later S3 enablement pass.
This is an early provider placement default, not a permanent global backend decision. If a region contains WEKA plus VAST, DDN, NVMe pools, or multiple WEKA clusters, GPUaaS must select the provider per storage object using storage class, protocol, quota, capacity, fabric, and provider health. The environment default is only a kind/platform-control bootstrap shortcut.
Recommended hierarchy:
tenant/project/provider scope
-> WEKA S3 parent user
-> baseline IAM policy: maximum allowed scope for that project or tenant
-> service accounts for workloads/automation
-> STS credentials for direct user/S3-client sessions
Do not create one long-lived WEKA S3 user for every GPUaaS human by default.
Human Direct Credentials¶
Flow:
- User authenticates to GPUaaS.
- User requests direct storage credentials for project, bucket, prefix, actions, and TTL.
- GPUaaS checks project role and
storage_grants. - GPUaaS compiles a restrictive STS session policy.
- GPUaaS calls WEKA S3 STS with the parent S3 user's access key and policy.
- GPUaaS returns endpoint, access key, secret key, session token, expiration, and safe scope summary.
- GPUaaS records a
storage.credential.issueaudit log.
Default TTL should be short, for example 1 hour or less for UI-generated direct credentials. Longer TTLs should require explicit project/admin policy.
Workload/App Credentials¶
For training, notebooks, Kubernetes PV/PVC, and POSIX-heavy apps, prefer WEKAFS/POSIX mounts bound to the workload's storage grants. Provider-derived S3 service accounts or STS credentials should be used when an app explicitly needs S3/object access or when direct external client access is enabled.
Use STS when:
- access is short-lived
- a workload lifetime is bounded
- revocation by TTL is acceptable
Use S3 service accounts when:
- access must survive longer than STS
- automation needs stable credentials
- the credential is still tightly scoped by policy
- GPUaaS owns rotation and cleanup
Cross-Project Sharing¶
Keep cross-project sharing in GPUaaS storage_grants.
Compile grants into one of:
- STS session policy for direct user access
- service-account policy for workload/automation access
- bucket policy only where provider behavior requires or benefits from it
Use bucket policies carefully because WEKA documents different wildcard support between IAM and bucket policies.
Risks And Validation Items¶
Policy Size¶
WEKA documents S3 IAM user policies as limited to 2048 bytes. This is tight for large grant lists.
Mitigations:
- group grants by prefix and action
- prefer project-level parent policies plus narrower session policies
- avoid one policy statement per user where possible
- track generated policy byte size before submitting to WEKA
- add tests for policy compaction
Scale Limits¶
WEKA documents finite limits for S3 regular users, service accounts, and STS credentials. Exact values must be checked against our deployed version and license/configuration.
Implementation requirement:
- expose provider capacity counters in ops read models
- fail launch/credential issuance with a clear capacity reason
- alert before provider credential/session capacity is exhausted
Active Credential Revocation¶
The docs explicitly mention deleting the parent S3 user to invalidate active STS credentials. That is too broad if multiple sessions depend on the same parent.
Required validation:
- verify whether our WEKA version supports revoking an individual STS session
- verify whether detaching/updating parent policy affects active sessions
- verify how fast service-account deletion takes effect
- choose TTL defaults based on the result
Until validated, design assumption:
GPUaaS can always stop future issuance.
Active STS credentials expire naturally unless broad parent-user deletion is acceptable.
Service Account Ownership¶
WEKA documentation says only an S3 user can manage its service accounts, while the REST/CLI equivalence page lists service-account REST endpoints.
Required validation:
- identify which credential can call
POST /s3/serviceAccounts - confirm whether cluster admin can automate service-account creation
- confirm whether the API requires the parent S3 user's credential context
- decide whether the GPUaaS adapter stores parent S3 user credentials in Vault
Audit Reliability¶
WEKA S3 audit events are sent through webhook, but docs warn that events can be discarded if the webhook target is unavailable or internal buffers fill.
Implications:
- WEKA audit is evidence, not the only source of truth
- GPUaaS must audit all control-plane grant/credential/mount mutations itself
- WEKA audit webhook receiver needs monitoring, buffering, and alerting
Contract Implications¶
Add or confirm OpenAPI contracts for:
- list owned/shared buckets
- create bucket
- create/revoke storage grant
- list mount plans and provider mount health
- reconcile provider path/quota/drift state
- issue direct S3 credentials, capability-gated until S3 is enabled
- list credential sessions, capability-gated until S3 is enabled
- revoke credential session where provider supports it
- provider capability/status read model
- provider drift read model
Add internal adapter interfaces for:
type StorageProvider interface {
CreateNamespace(ctx, input) (ProviderNamespaceRef, error)
DeleteNamespace(ctx, ref) error
BuildMountPlan(ctx, input) (MountPlan, error)
ReconcileMount(ctx, input) (MountStatus, error)
SetNamespaceQuota(ctx, input) error
GetNamespaceUsage(ctx, ref) (Usage, error)
PutBucketPolicy(ctx, input) error // S3 capability-gated
CreateServiceAccount(ctx, input) (ProviderCredentialRef, error) // S3 capability-gated
DeleteServiceAccount(ctx, ref) error // S3 capability-gated
IssueTemporaryCredentials(ctx, input) (TemporaryCredentials, error) // S3 capability-gated
ConfigureAuditWebhook(ctx, input) error // S3 capability-gated
}
Do not expose WEKA-specific request/response types outside the storage provider adapter.
Implementation Recommendation¶
Proceed with the GPUaaS storage IAM model, but implement WEKA as a provider adapter behind the storage domain.
Near-term order:
- Build local-dev fake provider and WEKAFS mount-plan compiler tests.
- Add provider binding, namespace/path reference, quota, and mount health schema.
- Add OpenAPI contracts for storage namespaces, grants, mount plans, provider capabilities, and provider drift.
- Implement WEKA adapter for filesystem/path namespace mapping, usage/quota
reads, and mount-plan generation against environment-scoped filesystems
(
gpuaas-kind-fsin kind/dev,gpuaas-fsin production-like validation). - Run integration validation against actual WEKA:
- create project namespace under
gpuaas-kind-fs - mount read-write from one workload
- mount read-only from a second workload
- validate multi-writer semantics where product allows it
- validate quota/usage reporting
- validate cleanup/revocation behavior
- reconcile provider drift into the v3 read model
- Defer S3 adapter operations until S3 protocol hosts/containers are enabled: parent S3 user policy, STS with restrictive policy, direct AWS CLI access, service accounts, and S3 audit webhook delivery.
Decision¶
WEKA can likely support the expected v3 storage model as a dual-protocol backend. WEKAFS/POSIX should be the default path for training and app mounts; S3/STS should be enabled and validated for bucket/object workflows, with validation still needed around active STS revocation, service-account API ownership, policy size, and scale limits.
The GPUaaS product model should remain provider-neutral and project-first. WEKA-specific details belong in the storage provider adapter and ops read models, not in the user-facing IA.