Skip to content

Storage WEKA Capability Assessment v1

Date: 2026-04-27

Purpose

Validate whether WEKA can support the GPUaaS storage and IAM model documented in:

  • doc/architecture/Storage_Sharing_and_IAM_Model_v1.md
  • doc/architecture/Storage_IAM_User_Flows_v1.md
  • doc/architecture/Storage_Provider_Capability_Model_v1.md

This assessment is based on current WEKA documentation reviewed on 2026-04-27. It must be validated against our actual WEKA version and deployment mode before implementation.

Source Documents Reviewed

  • WEKA S3 users and authentication: https://docs.weka.io/additional-protocols/s3/s3-users-and-authentication
  • WEKA S3 users/authentication CLI: https://docs.weka.io/additional-protocols/s3/s3-users-and-authentication/s3-users-and-authentication
  • WEKA S3 service accounts CLI: https://docs.weka.io/4.4/additional-protocols/s3/s3-users-and-authentication/s3-users-and-authentication-1
  • WEKA S3 buckets management: https://docs.weka.io/5.0/additional-protocols/s3/s3-buckets-management
  • WEKA S3 supported APIs and limitations: https://docs.weka.io/additional-protocols/s3/s3-limitations
  • WEKA S3 audit APIs: https://docs.weka.io/additional-protocols/s3/audit-s3-apis
  • WEKA REST API and equivalent CLI commands: https://docs.weka.io/getting-started-with-weka/weka-rest-api-and-equivalent-cli-commands

Current Direction

WEKA should be treated as a dual-protocol storage backend:

  • WEKAFS/POSIX for high-performance training, notebooks, Kubernetes PV/PVC, and apps that need filesystem semantics.
  • S3 for bucket/object workflows, direct external clients, SDK access, and apps that expect object semantics.

WEKAFS is conceptually similar to an NFS-style shared filesystem from the user model perspective, but it uses WEKA's own client protocol and DPDK-backed data path rather than NFS.

This changes the implementation priority:

WEKAFS/POSIX: primary training/app mount path
S3: bucket/object path; enable when bucket semantics are needed

S3 is not required for POSIX-only training workloads, but it is likely required for the broader GPUaaS product because users will expect buckets, SDK access, and object-style data movement. GPUaaS remains the IAM source of truth; WEKA-specific users, policies, mounts, or service credentials are derived enforcement mechanisms behind the storage adapter.

S3 Capability Summary

Supported by documentation:

  • S3 local users and LDAP-backed S3 identities.
  • IAM policies attached to S3 users.
  • STS temporary credentials through AssumeRole.
  • More restrictive session policy on STS requests.
  • Permanent S3 service accounts as child identities of S3 users.
  • Optional service-account policy that restricts parent privileges.
  • Bucket management through S3 API, WEKA API, GUI, and CLI.
  • Bucket policy APIs and custom JSON bucket policies.
  • Bucket quota APIs.
  • S3 audit webhook events.
  • REST API coverage for S3 users, policies, service accounts, STS, bucket policies, quota, lifecycle, and audit webhook configuration.

Live Cluster Validation - 2026-04-28

Access path used for this validation:

ssh -N -L 14000:10.177.24.3:14000 hpcadmin@100.90.157.34
WEKA_SERVER=http://localhost:14000 scripts/ops/weka_capability_probe.sh

The validation script is read-only by default. It redacts token, password, key, credential, certificate, and license-like fields and does not print raw provider secrets. The restrictive service-account ownership POST probe is opt-in via WEKA_PROBE_SERVICE_ACCOUNT_POST=1.

Observed cluster state:

Check Observed result Implication
API surface REST API is /api/v2; Swagger UI reports API version 4.4; cluster release is 4.4.22. Build the first adapter against WEKA REST API v2, not the older JSON-RPC endpoint.
Login POST /api/v2/login returns short-lived bearer token with expires_in=300 plus refresh token. The provider adapter can use a cluster-admin bearer session for administrative reads and cluster configuration APIs.
Authenticated role Logged-in local user is ClusterAdmin in Root. This is enough for cluster/file-system/S3 configuration reads, but not every S3 identity operation.
Cluster health Cluster wekacb2, io_status=STARTED, status=OK, link layer ETH. Base cluster is usable for control-plane validation.
Filesystem Initial probe saw only default, READY, not encrypted, auth_required=false, audit disabled. Infra then created dedicated GPUaaS filesystems: gpuaas-fs for production-like WEKAFS/POSIX, gpuaas-kind-fs for disposable kind/dev validation, s3-data for future S3 bucket/object data, and s3-config for future S3 protocol configuration. Keep default infra-owned and validate GPUaaS against dedicated filesystems.
S3 cluster GET /api/v2/s3 succeeds, but active=false, s3_hosts=0, filesystem_name=N/A, config_fs_name=N/A, port 9000, TLS enabled. Attempting to save S3 with s3-data and s3-config failed because WEKA has no S3 protocol host/container. S3 bucket, IAM policy, STS, quota, and S3-client validation is deferred until WEKA S3 protocol hosts/containers are added. This does not block WEKAFS/POSIX mount validation.
S3 buckets/policies/userPolicies GET /s3/buckets, GET /s3/policies, and GET /s3/userPolicies return NoValidProtocolContainerToRunIn. These APIs require an active S3 protocol container. This is a provider readiness check the GPUaaS adapter should surface clearly.
S3 service accounts GET /s3/serviceAccounts and POST /s3/serviceAccounts as ClusterAdmin return S3RoleOnlyOperation. Service-account lifecycle is not cluster-admin-bearer-owned in this deployment; the adapter likely needs an S3 user credential context held in Vault.
S3 audit webhook GET /s3/auditWebhook succeeds; webhook is disabled. Audit webhook can be managed through REST once we are ready to wire a receiver, but it is not currently producing evidence.

Current validation status:

WEKA control-plane API: validated
WEKA filesystem read model inputs: validated; dedicated GPUaaS filesystems created
WEKA WEKAFS/POSIX workload mount path: not yet validated
WEKA S3 protocol readiness: deferred, no S3 protocol hosts/containers
WEKA bucket/IAM policy/quota behavior: deferred on S3 protocol container
WEKA STS scoped credentials: deferred on S3 protocol container and S3 user credentials
WEKA service-account ownership: partially validated, requires S3 user role
WEKA audit webhook API: validated as disabled/readable

Required infra follow-up before completing WEKAFS/POSIX validation:

  1. Use gpuaas-kind-fs as the first disposable kind/dev validation filesystem and gpuaas-fs for production-like WEKAFS/POSIX validation.
  2. Keep the local SSH tunnel to WEKA API active for control-plane validation from kind, while the actual mount path is tested from a WEKA-reachable node or pod.
  3. Confirm the WEKA client/CSI deployment path for GPUaaS worker nodes.
  4. Confirm whether mounts are created directly by node bootstrap, by Kubernetes CSI for app clusters, or by a host-prepared mount exposed into workloads.
  5. Confirm filesystem and directory layout for tenant/project/bucket/prefix mapping.
  6. Validate read/write/read-only/multi-writer semantics from at least two workloads.
  7. Validate quota reporting/enforcement at the filesystem, directory, or namespace level used by GPUaaS.
  8. Validate cleanup and revocation behavior when a workload is released or a grant is revoked.
  9. Keep per-user home directory semantics out of the first validation pass unless explicitly testing that feature. User homes need a separate decision on project-scoped versus shared-home layout, membership changes after allocation, OS-user reconciliation, ACL propagation, and quota attribution.

Read-only preflight command:

ssh -N -L 14000:10.177.24.3:14000 hpcadmin@100.90.157.34
WEKA_SERVER=http://localhost:14000 make ops-weka-wekafs-probe

The preflight verifies API login, cluster status, the expected GPUaaS filesystems, and S3 readiness state. It does not create filesystems, buckets, users, policies, mounts, or credentials.

Required infra follow-up before completing S3/STS validation:

  1. Enable at least one WEKA S3 protocol host/container on the target cluster when ready to validate bucket/object workflows.
  2. Bind S3 to s3-data as the bucket/object data filesystem and s3-config as the protocol configuration filesystem.
  3. Provide or create a GPUaaS-owned S3 parent user with baseline policy and access/secret material stored in Vault.
  4. Re-run scripts/ops/weka_capability_probe.sh.
  5. Run mutating validation in a disposable namespace/bucket prefix: create bucket, set quota, create IAM policy, issue STS credentials, exercise S3 client access, create/delete service account, revoke access, and observe audit behavior.

Key constraints:

  • S3 user IAM policy length is documented as 2048 bytes.
  • Maximum S3 regular users, service accounts, and STS credentials are finite and must be capacity-planned.
  • Service accounts are permanent and have no expiration.
  • The documented way to immediately invalidate compromised STS credentials is deleting the parent S3 user.
  • Service-account management is documented as S3-user-owned and CLI-only in the service-account page, but REST API equivalent endpoints exist. We must verify exact auth/ownership behavior in our WEKA version.
  • Bucket policies are available, but some wildcard behavior differs between IAM policies and bucket policies.

Capability Matrix

GPUaaS need WEKA capability Assessment
Project-owned bucket/namespace WEKA can manage buckets via S3 API, GUI/API/CLI. Buckets are backed by a configured filesystem, with optional placement in another filesystem. Feasible. Map GPUaaS bucket to WEKA bucket/filesystem/prefix based on deployment layout.
Human direct S3 client access WEKA supports STS temporary credentials returning access key, secret key, and session token. Feasible. GPUaaS should broker STS credentials after checking project/grant access.
Restrict direct credentials to bucket/prefix/action STS AssumeRole accepts a policy-file that cannot expand beyond the parent S3 user's IAM policy. Feasible, but validate prefix-policy syntax and policy size.
Workload/app machine access WEKA supports S3 service accounts as child identities of an S3 user, with optional policy restriction. Feasible. Use service accounts for workload/runtime credentials where longer-lived access is needed.
Avoid every GPUaaS user becoming a WEKA user WEKA requires an S3 user for S3 access, but the S3 user can be a provider-side enforcement parent rather than one per human. Feasible. Prefer GPUaaS-managed parent S3 users per project/tenant/provider scope, plus STS/session policies for humans.
Cross-project sharing WEKA IAM/session policies and bucket policies can express allowed S3 resources/actions. Feasible. GPUaaS grants compile into WEKA policies. Validate scale and wildcard behavior.
Revoke future access GPUaaS can stop issuing future credentials; WEKA policies/service accounts can be removed. Feasible.
Revoke active STS credentials immediately WEKA docs say compromised STS can be invalidated by deleting the parent S3 user. Risk. Prefer short TTLs; verify whether there is a narrower session revocation API in our WEKA version.
Auditing S3 access WEKA supports S3 audit webhook events with operation, bucket, object, status, client IP, user agent, authorization credential, and cluster metadata. Feasible. Need webhook receiver and retention pipeline.
Bucket quota WEKA REST/CLI supports setting and unsetting S3 bucket quota. Feasible.
Lifecycle WEKA supports lifecycle APIs, with documented limitation around expiration action. Feasible for basic retention/expiration; validate versioning/snapshot expectations separately.
Provider automation through API WEKA REST API includes S3 policy, service account, STS, bucket policy, quota, lifecycle, and audit webhook endpoints. Feasible. Build adapter against REST API, not shell CLI, except for bootstrap diagnostics.

WEKAFS/POSIX Mounts

Use GPUaaS-managed project/workload mount intent as the primary integration boundary.

Recommended hierarchy:

tenant/project/provider scope
  -> WEKA filesystem or filesystem directory namespace
     -> gpuaas-kind-fs for kind/dev validation
     -> gpuaas-fs for production-like validation and production
     -> project bucket/namespace directory
        -> prefixes for shared, users, workloads, datasets, checkpoints, artifacts
     -> workload mount plan
     -> optional provider quota/snapshot/audit metadata

The GPUaaS storage adapter should own:

  • mapping GPUaaS bucket IDs to WEKA filesystem/path references
  • generating mount plans for node-agent, app controllers, or Kubernetes CSI
  • enforcing read-only versus read-write mount mode from GPUaaS grants
  • reconciling mount state and provider drift into v3 read models
  • coordinating cleanup on workload release/decommission

WEKA DPDK/client configuration, mount helper setup, and node-level connectivity belong to infra/bootstrap and should be surfaced to GPUaaS as provider health and capability state, not user-facing configuration.

Optional Direct Access: Parent S3 Users

Use GPUaaS-managed WEKA S3 parent users as policy envelopes.

For the first cluster validation, bind WEKA S3 as:

S3 data filesystem: s3-data
S3 config filesystem: s3-config
Port: 9000
Anonymous POSIX UID/GID: 65534
All servers: enabled
Virtual-hosted-style domains: unset until DNS/TLS design is explicit

Do not use default for GPUaaS S3 or WEKAFS validation. default stays infra-owned; gpuaas-kind-fs, gpuaas-fs, s3-data, and s3-config are the GPUaaS validation planes.

Current implementation posture: S3 is capability-gated and deferred until WEKA has S3 protocol hosts. Build the first storage integration against gpuaas-kind-fs/WEKAFS/POSIX mounts in kind/dev, then promote the same contract to gpuaas-fs for production-like validation. Keep s3-data and s3-config reserved for a later S3 enablement pass.

This is an early provider placement default, not a permanent global backend decision. If a region contains WEKA plus VAST, DDN, NVMe pools, or multiple WEKA clusters, GPUaaS must select the provider per storage object using storage class, protocol, quota, capacity, fabric, and provider health. The environment default is only a kind/platform-control bootstrap shortcut.

Recommended hierarchy:

tenant/project/provider scope
  -> WEKA S3 parent user
     -> baseline IAM policy: maximum allowed scope for that project or tenant
     -> service accounts for workloads/automation
     -> STS credentials for direct user/S3-client sessions

Do not create one long-lived WEKA S3 user for every GPUaaS human by default.

Human Direct Credentials

Flow:

  1. User authenticates to GPUaaS.
  2. User requests direct storage credentials for project, bucket, prefix, actions, and TTL.
  3. GPUaaS checks project role and storage_grants.
  4. GPUaaS compiles a restrictive STS session policy.
  5. GPUaaS calls WEKA S3 STS with the parent S3 user's access key and policy.
  6. GPUaaS returns endpoint, access key, secret key, session token, expiration, and safe scope summary.
  7. GPUaaS records a storage.credential.issue audit log.

Default TTL should be short, for example 1 hour or less for UI-generated direct credentials. Longer TTLs should require explicit project/admin policy.

Workload/App Credentials

For training, notebooks, Kubernetes PV/PVC, and POSIX-heavy apps, prefer WEKAFS/POSIX mounts bound to the workload's storage grants. Provider-derived S3 service accounts or STS credentials should be used when an app explicitly needs S3/object access or when direct external client access is enabled.

Use STS when:

  • access is short-lived
  • a workload lifetime is bounded
  • revocation by TTL is acceptable

Use S3 service accounts when:

  • access must survive longer than STS
  • automation needs stable credentials
  • the credential is still tightly scoped by policy
  • GPUaaS owns rotation and cleanup

Cross-Project Sharing

Keep cross-project sharing in GPUaaS storage_grants.

Compile grants into one of:

  • STS session policy for direct user access
  • service-account policy for workload/automation access
  • bucket policy only where provider behavior requires or benefits from it

Use bucket policies carefully because WEKA documents different wildcard support between IAM and bucket policies.

Risks And Validation Items

Policy Size

WEKA documents S3 IAM user policies as limited to 2048 bytes. This is tight for large grant lists.

Mitigations:

  • group grants by prefix and action
  • prefer project-level parent policies plus narrower session policies
  • avoid one policy statement per user where possible
  • track generated policy byte size before submitting to WEKA
  • add tests for policy compaction

Scale Limits

WEKA documents finite limits for S3 regular users, service accounts, and STS credentials. Exact values must be checked against our deployed version and license/configuration.

Implementation requirement:

  • expose provider capacity counters in ops read models
  • fail launch/credential issuance with a clear capacity reason
  • alert before provider credential/session capacity is exhausted

Active Credential Revocation

The docs explicitly mention deleting the parent S3 user to invalidate active STS credentials. That is too broad if multiple sessions depend on the same parent.

Required validation:

  • verify whether our WEKA version supports revoking an individual STS session
  • verify whether detaching/updating parent policy affects active sessions
  • verify how fast service-account deletion takes effect
  • choose TTL defaults based on the result

Until validated, design assumption:

GPUaaS can always stop future issuance.
Active STS credentials expire naturally unless broad parent-user deletion is acceptable.

Service Account Ownership

WEKA documentation says only an S3 user can manage its service accounts, while the REST/CLI equivalence page lists service-account REST endpoints.

Required validation:

  • identify which credential can call POST /s3/serviceAccounts
  • confirm whether cluster admin can automate service-account creation
  • confirm whether the API requires the parent S3 user's credential context
  • decide whether the GPUaaS adapter stores parent S3 user credentials in Vault

Audit Reliability

WEKA S3 audit events are sent through webhook, but docs warn that events can be discarded if the webhook target is unavailable or internal buffers fill.

Implications:

  • WEKA audit is evidence, not the only source of truth
  • GPUaaS must audit all control-plane grant/credential/mount mutations itself
  • WEKA audit webhook receiver needs monitoring, buffering, and alerting

Contract Implications

Add or confirm OpenAPI contracts for:

  • list owned/shared buckets
  • create bucket
  • create/revoke storage grant
  • list mount plans and provider mount health
  • reconcile provider path/quota/drift state
  • issue direct S3 credentials, capability-gated until S3 is enabled
  • list credential sessions, capability-gated until S3 is enabled
  • revoke credential session where provider supports it
  • provider capability/status read model
  • provider drift read model

Add internal adapter interfaces for:

type StorageProvider interface {
    CreateNamespace(ctx, input) (ProviderNamespaceRef, error)
    DeleteNamespace(ctx, ref) error
    BuildMountPlan(ctx, input) (MountPlan, error)
    ReconcileMount(ctx, input) (MountStatus, error)
    SetNamespaceQuota(ctx, input) error
    GetNamespaceUsage(ctx, ref) (Usage, error)
    PutBucketPolicy(ctx, input) error // S3 capability-gated
    CreateServiceAccount(ctx, input) (ProviderCredentialRef, error) // S3 capability-gated
    DeleteServiceAccount(ctx, ref) error // S3 capability-gated
    IssueTemporaryCredentials(ctx, input) (TemporaryCredentials, error) // S3 capability-gated
    ConfigureAuditWebhook(ctx, input) error // S3 capability-gated
}

Do not expose WEKA-specific request/response types outside the storage provider adapter.

Implementation Recommendation

Proceed with the GPUaaS storage IAM model, but implement WEKA as a provider adapter behind the storage domain.

Near-term order:

  1. Build local-dev fake provider and WEKAFS mount-plan compiler tests.
  2. Add provider binding, namespace/path reference, quota, and mount health schema.
  3. Add OpenAPI contracts for storage namespaces, grants, mount plans, provider capabilities, and provider drift.
  4. Implement WEKA adapter for filesystem/path namespace mapping, usage/quota reads, and mount-plan generation against environment-scoped filesystems (gpuaas-kind-fs in kind/dev, gpuaas-fs in production-like validation).
  5. Run integration validation against actual WEKA:
  6. create project namespace under gpuaas-kind-fs
  7. mount read-write from one workload
  8. mount read-only from a second workload
  9. validate multi-writer semantics where product allows it
  10. validate quota/usage reporting
  11. validate cleanup/revocation behavior
  12. reconcile provider drift into the v3 read model
  13. Defer S3 adapter operations until S3 protocol hosts/containers are enabled: parent S3 user policy, STS with restrictive policy, direct AWS CLI access, service accounts, and S3 audit webhook delivery.

Decision

WEKA can likely support the expected v3 storage model as a dual-protocol backend. WEKAFS/POSIX should be the default path for training and app mounts; S3/STS should be enabled and validated for bucket/object workflows, with validation still needed around active STS revocation, service-account API ownership, policy size, and scale limits.

The GPUaaS product model should remain provider-neutral and project-first. WEKA-specific details belong in the storage provider adapter and ops read models, not in the user-facing IA.