Storage WEKA Capability Assessment v1¶

Date: 2026-04-27

Purpose¶

Validate whether WEKA can support the GPUaaS storage and IAM model documented in:

doc/architecture/Storage_Sharing_and_IAM_Model_v1.md
doc/architecture/Storage_IAM_User_Flows_v1.md
doc/architecture/Storage_Provider_Capability_Model_v1.md

This assessment is based on current WEKA documentation reviewed on 2026-04-27. It must be validated against our actual WEKA version and deployment mode before implementation.

Source Documents Reviewed¶

WEKA S3 users and authentication: https://docs.weka.io/additional-protocols/s3/s3-users-and-authentication
WEKA S3 users/authentication CLI: https://docs.weka.io/additional-protocols/s3/s3-users-and-authentication/s3-users-and-authentication
WEKA S3 service accounts CLI: https://docs.weka.io/4.4/additional-protocols/s3/s3-users-and-authentication/s3-users-and-authentication-1
WEKA S3 buckets management: https://docs.weka.io/5.0/additional-protocols/s3/s3-buckets-management
WEKA S3 supported APIs and limitations: https://docs.weka.io/additional-protocols/s3/s3-limitations
WEKA S3 audit APIs: https://docs.weka.io/additional-protocols/s3/audit-s3-apis
WEKA REST API and equivalent CLI commands: https://docs.weka.io/getting-started-with-weka/weka-rest-api-and-equivalent-cli-commands

Current Direction¶

WEKA should be treated as a dual-protocol storage backend:

WEKAFS/POSIX for high-performance training, notebooks, Kubernetes PV/PVC, and apps that need filesystem semantics.
S3 for bucket/object workflows, direct external clients, SDK access, and apps that expect object semantics.

WEKAFS is conceptually similar to an NFS-style shared filesystem from the user model perspective, but it uses WEKA's own client protocol and DPDK-backed data path rather than NFS.

This changes the implementation priority:

WEKAFS/POSIX: primary training/app mount path
S3: bucket/object path; enable when bucket semantics are needed

S3 is not required for POSIX-only training workloads, but it is likely required for the broader GPUaaS product because users will expect buckets, SDK access, and object-style data movement. GPUaaS remains the IAM source of truth; WEKA-specific users, policies, mounts, or service credentials are derived enforcement mechanisms behind the storage adapter.

S3 Capability Summary¶

Supported by documentation:

S3 local users and LDAP-backed S3 identities.
IAM policies attached to S3 users.
STS temporary credentials through AssumeRole.
More restrictive session policy on STS requests.
Permanent S3 service accounts as child identities of S3 users.
Optional service-account policy that restricts parent privileges.
Bucket management through S3 API, WEKA API, GUI, and CLI.
Bucket policy APIs and custom JSON bucket policies.
Bucket quota APIs.
S3 audit webhook events.
REST API coverage for S3 users, policies, service accounts, STS, bucket policies, quota, lifecycle, and audit webhook configuration.

Live Cluster Validation - 2026-04-28¶

Access path used for this validation:

ssh -N -L 14000:10.177.24.3:14000 hpcadmin@100.90.157.34
WEKA_SERVER=http://localhost:14000 scripts/ops/weka_capability_probe.sh

The validation script is read-only by default. It redacts token, password, key, credential, certificate, and license-like fields and does not print raw provider secrets. The restrictive service-account ownership POST probe is opt-in via WEKA_PROBE_SERVICE_ACCOUNT_POST=1.

Observed cluster state:

Check	Observed result	Implication
API surface	REST API is `/api/v2`; Swagger UI reports API version `4.4`; cluster release is `4.4.22`.	Build the first adapter against WEKA REST API v2, not the older JSON-RPC endpoint.
Login	`POST /api/v2/login` returns short-lived bearer token with `expires_in=300` plus refresh token.	The provider adapter can use a cluster-admin bearer session for administrative reads and cluster configuration APIs.
Authenticated role	Logged-in local user is `ClusterAdmin` in `Root`.	This is enough for cluster/file-system/S3 configuration reads, but not every S3 identity operation.
Cluster health	Cluster `wekacb2`, `io_status=STARTED`, `status=OK`, link layer `ETH`.	Base cluster is usable for control-plane validation.
Filesystem	Initial probe saw only `default`, `READY`, not encrypted, `auth_required=false`, audit disabled. Infra then created dedicated GPUaaS filesystems: `gpuaas-fs` for production-like WEKAFS/POSIX, `gpuaas-kind-fs` for disposable kind/dev validation, `s3-data` for future S3 bucket/object data, and `s3-config` for future S3 protocol configuration.	Keep `default` infra-owned and validate GPUaaS against dedicated filesystems.
S3 cluster	`GET /api/v2/s3` succeeds, but `active=false`, `s3_hosts=0`, `filesystem_name=N/A`, `config_fs_name=N/A`, port `9000`, TLS enabled. Attempting to save S3 with `s3-data` and `s3-config` failed because WEKA has no S3 protocol host/container.	S3 bucket, IAM policy, STS, quota, and S3-client validation is deferred until WEKA S3 protocol hosts/containers are added. This does not block WEKAFS/POSIX mount validation.
S3 buckets/policies/userPolicies	`GET /s3/buckets`, `GET /s3/policies`, and `GET /s3/userPolicies` return `NoValidProtocolContainerToRunIn`.	These APIs require an active S3 protocol container. This is a provider readiness check the GPUaaS adapter should surface clearly.
S3 service accounts	`GET /s3/serviceAccounts` and `POST /s3/serviceAccounts` as ClusterAdmin return `S3RoleOnlyOperation`.	Service-account lifecycle is not cluster-admin-bearer-owned in this deployment; the adapter likely needs an S3 user credential context held in Vault.
S3 audit webhook	`GET /s3/auditWebhook` succeeds; webhook is disabled.	Audit webhook can be managed through REST once we are ready to wire a receiver, but it is not currently producing evidence.

Current validation status:

WEKA control-plane API: validated
WEKA filesystem read model inputs: validated; dedicated GPUaaS filesystems created
WEKA WEKAFS/POSIX workload mount path: not yet validated
WEKA S3 protocol readiness: deferred, no S3 protocol hosts/containers
WEKA bucket/IAM policy/quota behavior: deferred on S3 protocol container
WEKA STS scoped credentials: deferred on S3 protocol container and S3 user credentials
WEKA service-account ownership: partially validated, requires S3 user role
WEKA audit webhook API: validated as disabled/readable

Required infra follow-up before completing WEKAFS/POSIX validation:

Use gpuaas-kind-fs as the first disposable kind/dev validation filesystem and gpuaas-fs for production-like WEKAFS/POSIX validation.
Keep the local SSH tunnel to WEKA API active for control-plane validation from kind, while the actual mount path is tested from a WEKA-reachable node or pod.
Confirm the WEKA client/CSI deployment path for GPUaaS worker nodes.
Confirm whether mounts are created directly by node bootstrap, by Kubernetes CSI for app clusters, or by a host-prepared mount exposed into workloads.
Confirm filesystem and directory layout for tenant/project/bucket/prefix mapping.
Validate read/write/read-only/multi-writer semantics from at least two workloads.
Validate quota reporting/enforcement at the filesystem, directory, or namespace level used by GPUaaS.
Validate cleanup and revocation behavior when a workload is released or a grant is revoked.
Keep per-user home directory semantics out of the first validation pass unless explicitly testing that feature. User homes need a separate decision on project-scoped versus shared-home layout, membership changes after allocation, OS-user reconciliation, ACL propagation, and quota attribution.

Read-only preflight command:

ssh -N -L 14000:10.177.24.3:14000 hpcadmin@100.90.157.34
WEKA_SERVER=http://localhost:14000 make ops-weka-wekafs-probe

The preflight verifies API login, cluster status, the expected GPUaaS filesystems, and S3 readiness state. It does not create filesystems, buckets, users, policies, mounts, or credentials.

Required infra follow-up before completing S3/STS validation:

Enable at least one WEKA S3 protocol host/container on the target cluster when ready to validate bucket/object workflows.
Bind S3 to s3-data as the bucket/object data filesystem and s3-config as the protocol configuration filesystem.
Provide or create a GPUaaS-owned S3 parent user with baseline policy and access/secret material stored in Vault.
Re-run scripts/ops/weka_capability_probe.sh.
Run mutating validation in a disposable namespace/bucket prefix: create bucket, set quota, create IAM policy, issue STS credentials, exercise S3 client access, create/delete service account, revoke access, and observe audit behavior.

Key constraints:

S3 user IAM policy length is documented as 2048 bytes.
Maximum S3 regular users, service accounts, and STS credentials are finite and must be capacity-planned.
Service accounts are permanent and have no expiration.
The documented way to immediately invalidate compromised STS credentials is deleting the parent S3 user.
Service-account management is documented as S3-user-owned and CLI-only in the service-account page, but REST API equivalent endpoints exist. We must verify exact auth/ownership behavior in our WEKA version.
Bucket policies are available, but some wildcard behavior differs between IAM policies and bucket policies.

Capability Matrix¶

GPUaaS need	WEKA capability	Assessment
Project-owned bucket/namespace	WEKA can manage buckets via S3 API, GUI/API/CLI. Buckets are backed by a configured filesystem, with optional placement in another filesystem.	Feasible. Map GPUaaS bucket to WEKA bucket/filesystem/prefix based on deployment layout.
Human direct S3 client access	WEKA supports STS temporary credentials returning access key, secret key, and session token.	Feasible. GPUaaS should broker STS credentials after checking project/grant access.
Restrict direct credentials to bucket/prefix/action	STS AssumeRole accepts a `policy-file` that cannot expand beyond the parent S3 user's IAM policy.	Feasible, but validate prefix-policy syntax and policy size.
Workload/app machine access	WEKA supports S3 service accounts as child identities of an S3 user, with optional policy restriction.	Feasible. Use service accounts for workload/runtime credentials where longer-lived access is needed.
Avoid every GPUaaS user becoming a WEKA user	WEKA requires an S3 user for S3 access, but the S3 user can be a provider-side enforcement parent rather than one per human.	Feasible. Prefer GPUaaS-managed parent S3 users per project/tenant/provider scope, plus STS/session policies for humans.
Cross-project sharing	WEKA IAM/session policies and bucket policies can express allowed S3 resources/actions.	Feasible. GPUaaS grants compile into WEKA policies. Validate scale and wildcard behavior.
Revoke future access	GPUaaS can stop issuing future credentials; WEKA policies/service accounts can be removed.	Feasible.
Revoke active STS credentials immediately	WEKA docs say compromised STS can be invalidated by deleting the parent S3 user.	Risk. Prefer short TTLs; verify whether there is a narrower session revocation API in our WEKA version.
Auditing S3 access	WEKA supports S3 audit webhook events with operation, bucket, object, status, client IP, user agent, authorization credential, and cluster metadata.	Feasible. Need webhook receiver and retention pipeline.
Bucket quota	WEKA REST/CLI supports setting and unsetting S3 bucket quota.	Feasible.
Lifecycle	WEKA supports lifecycle APIs, with documented limitation around expiration action.	Feasible for basic retention/expiration; validate versioning/snapshot expectations separately.
Provider automation through API	WEKA REST API includes S3 policy, service account, STS, bucket policy, quota, lifecycle, and audit webhook endpoints.	Feasible. Build adapter against REST API, not shell CLI, except for bootstrap diagnostics.

Recommended WEKA Mapping¶

WEKAFS/POSIX Mounts¶

Use GPUaaS-managed project/workload mount intent as the primary integration boundary.

Recommended hierarchy:

tenant/project/provider scope
  -> WEKA filesystem or filesystem directory namespace
     -> gpuaas-kind-fs for kind/dev validation
     -> gpuaas-fs for production-like validation and production
     -> project bucket/namespace directory
        -> prefixes for shared, users, workloads, datasets, checkpoints, artifacts
     -> workload mount plan
     -> optional provider quota/snapshot/audit metadata

The GPUaaS storage adapter should own:

mapping GPUaaS bucket IDs to WEKA filesystem/path references
generating mount plans for node-agent, app controllers, or Kubernetes CSI
enforcing read-only versus read-write mount mode from GPUaaS grants
reconciling mount state and provider drift into v3 read models
coordinating cleanup on workload release/decommission

WEKA DPDK/client configuration, mount helper setup, and node-level connectivity belong to infra/bootstrap and should be surfaced to GPUaaS as provider health and capability state, not user-facing configuration.

Optional Direct Access: Parent S3 Users¶

Use GPUaaS-managed WEKA S3 parent users as policy envelopes.

For the first cluster validation, bind WEKA S3 as:

S3 data filesystem: s3-data
S3 config filesystem: s3-config
Port: 9000
Anonymous POSIX UID/GID: 65534
All servers: enabled
Virtual-hosted-style domains: unset until DNS/TLS design is explicit

Do not use default for GPUaaS S3 or WEKAFS validation. default stays infra-owned; gpuaas-kind-fs, gpuaas-fs, s3-data, and s3-config are the GPUaaS validation planes.

Current implementation posture: S3 is capability-gated and deferred until WEKA has S3 protocol hosts. Build the first storage integration against gpuaas-kind-fs/WEKAFS/POSIX mounts in kind/dev, then promote the same contract to gpuaas-fs for production-like validation. Keep s3-data and s3-config reserved for a later S3 enablement pass.

This is an early provider placement default, not a permanent global backend decision. If a region contains WEKA plus VAST, DDN, NVMe pools, or multiple WEKA clusters, GPUaaS must select the provider per storage object using storage class, protocol, quota, capacity, fabric, and provider health. The environment default is only a kind/platform-control bootstrap shortcut.

Recommended hierarchy:

tenant/project/provider scope
  -> WEKA S3 parent user
     -> baseline IAM policy: maximum allowed scope for that project or tenant
     -> service accounts for workloads/automation
     -> STS credentials for direct user/S3-client sessions

Do not create one long-lived WEKA S3 user for every GPUaaS human by default.

Human Direct Credentials¶

Flow:

User authenticates to GPUaaS.
User requests direct storage credentials for project, bucket, prefix, actions, and TTL.
GPUaaS checks project role and storage_grants.
GPUaaS compiles a restrictive STS session policy.
GPUaaS calls WEKA S3 STS with the parent S3 user's access key and policy.
GPUaaS returns endpoint, access key, secret key, session token, expiration, and safe scope summary.
GPUaaS records a storage.credential.issue audit log.

Default TTL should be short, for example 1 hour or less for UI-generated direct credentials. Longer TTLs should require explicit project/admin policy.

Workload/App Credentials¶

For training, notebooks, Kubernetes PV/PVC, and POSIX-heavy apps, prefer WEKAFS/POSIX mounts bound to the workload's storage grants. Provider-derived S3 service accounts or STS credentials should be used when an app explicitly needs S3/object access or when direct external client access is enabled.

Use STS when:

access is short-lived
a workload lifetime is bounded
revocation by TTL is acceptable

Use S3 service accounts when:

access must survive longer than STS
automation needs stable credentials
the credential is still tightly scoped by policy
GPUaaS owns rotation and cleanup

Keep cross-project sharing in GPUaaS storage_grants.

Compile grants into one of:

STS session policy for direct user access
service-account policy for workload/automation access
bucket policy only where provider behavior requires or benefits from it

Use bucket policies carefully because WEKA documents different wildcard support between IAM and bucket policies.

Risks And Validation Items¶

Policy Size¶

WEKA documents S3 IAM user policies as limited to 2048 bytes. This is tight for large grant lists.

Mitigations:

group grants by prefix and action
prefer project-level parent policies plus narrower session policies
avoid one policy statement per user where possible
track generated policy byte size before submitting to WEKA
add tests for policy compaction

Scale Limits¶

WEKA documents finite limits for S3 regular users, service accounts, and STS credentials. Exact values must be checked against our deployed version and license/configuration.

Implementation requirement:

expose provider capacity counters in ops read models
fail launch/credential issuance with a clear capacity reason
alert before provider credential/session capacity is exhausted

Active Credential Revocation¶

The docs explicitly mention deleting the parent S3 user to invalidate active STS credentials. That is too broad if multiple sessions depend on the same parent.

Required validation:

verify whether our WEKA version supports revoking an individual STS session
verify whether detaching/updating parent policy affects active sessions
verify how fast service-account deletion takes effect
choose TTL defaults based on the result

Until validated, design assumption:

GPUaaS can always stop future issuance.
Active STS credentials expire naturally unless broad parent-user deletion is acceptable.

Service Account Ownership¶

WEKA documentation says only an S3 user can manage its service accounts, while the REST/CLI equivalence page lists service-account REST endpoints.

Required validation:

identify which credential can call POST /s3/serviceAccounts
confirm whether cluster admin can automate service-account creation
confirm whether the API requires the parent S3 user's credential context
decide whether the GPUaaS adapter stores parent S3 user credentials in Vault

Audit Reliability¶

WEKA S3 audit events are sent through webhook, but docs warn that events can be discarded if the webhook target is unavailable or internal buffers fill.

Implications:

WEKA audit is evidence, not the only source of truth
GPUaaS must audit all control-plane grant/credential/mount mutations itself
WEKA audit webhook receiver needs monitoring, buffering, and alerting

Contract Implications¶

Add or confirm OpenAPI contracts for:

list owned/shared buckets
create bucket
create/revoke storage grant
list mount plans and provider mount health
reconcile provider path/quota/drift state
issue direct S3 credentials, capability-gated until S3 is enabled
list credential sessions, capability-gated until S3 is enabled
revoke credential session where provider supports it
provider capability/status read model
provider drift read model

Add internal adapter interfaces for:

type StorageProvider interface {
    CreateNamespace(ctx, input) (ProviderNamespaceRef, error)
    DeleteNamespace(ctx, ref) error
    BuildMountPlan(ctx, input) (MountPlan, error)
    ReconcileMount(ctx, input) (MountStatus, error)
    SetNamespaceQuota(ctx, input) error
    GetNamespaceUsage(ctx, ref) (Usage, error)
    PutBucketPolicy(ctx, input) error // S3 capability-gated
    CreateServiceAccount(ctx, input) (ProviderCredentialRef, error) // S3 capability-gated
    DeleteServiceAccount(ctx, ref) error // S3 capability-gated
    IssueTemporaryCredentials(ctx, input) (TemporaryCredentials, error) // S3 capability-gated
    ConfigureAuditWebhook(ctx, input) error // S3 capability-gated
}

Do not expose WEKA-specific request/response types outside the storage provider adapter.

Implementation Recommendation¶

Proceed with the GPUaaS storage IAM model, but implement WEKA as a provider adapter behind the storage domain.

Near-term order:

Build local-dev fake provider and WEKAFS mount-plan compiler tests.
Add provider binding, namespace/path reference, quota, and mount health schema.
Add OpenAPI contracts for storage namespaces, grants, mount plans, provider capabilities, and provider drift.
Implement WEKA adapter for filesystem/path namespace mapping, usage/quota reads, and mount-plan generation against environment-scoped filesystems (gpuaas-kind-fs in kind/dev, gpuaas-fs in production-like validation).
Run integration validation against actual WEKA:
create project namespace under gpuaas-kind-fs
mount read-write from one workload
mount read-only from a second workload
validate multi-writer semantics where product allows it
validate quota/usage reporting
validate cleanup/revocation behavior
reconcile provider drift into the v3 read model
Defer S3 adapter operations until S3 protocol hosts/containers are enabled: parent S3 user policy, STS with restrictive policy, direct AWS CLI access, service accounts, and S3 audit webhook delivery.

Decision¶

WEKA can likely support the expected v3 storage model as a dual-protocol backend. WEKAFS/POSIX should be the default path for training and app mounts; S3/STS should be enabled and validated for bucket/object workflows, with validation still needed around active STS revocation, service-account API ownership, policy size, and scale limits.

The GPUaaS product model should remain provider-neutral and project-first. WEKA-specific details belong in the storage provider adapter and ops read models, not in the user-facing IA.