Storage Sharing and IAM Model v1¶

Date: 2026-04-27

Purpose¶

Define the storage ownership, sharing, and IAM model before integrating WEKA as the first production storage provider. The model must support project-scoped storage by default while allowing controlled sharing of datasets, checkpoints, model artifacts, and workspaces across projects.

This document is provider-neutral. WEKA is the first backend, but the product model should also work for VAST or S3-compatible providers later.

Concrete user, workload, CLI, S3-client, sharing, and revocation scenarios are documented in doc/architecture/Storage_IAM_User_Flows_v1.md.

Lessons from the earlier Scality-oriented IAM project are captured in doc/architecture/Storage_IAM_External_Reference_Lessons_v1.md. The key carry-forward is to keep GPUaaS IAM as source of truth, compile grants into provider policy, and avoid creating long-lived provider users for every human.

WEKA-specific feasibility, limits, and validation items are tracked in doc/architecture/Storage_WEKA_Capability_Assessment_v1.md.

Important WEKA constraint: the integration is dual-protocol. WEKAFS/POSIX is the primary high-performance workload data path for training, notebooks, Kubernetes PV/PVC, and apps that need filesystem semantics. S3/STS should be enabled and validated for bucket/object workflows, SDK access, external clients, and apps that expect S3 semantics.

Core Decision¶

Storage ownership is project-first:

tenant
  project
    bucket / namespace
      shared/
      users/<user-id>/
      workloads/<workload-id>/
      datasets/<dataset-id>/
      checkpoints/<workload-id>/
      artifacts/

Users are actors inside one or more projects. A user's personal files are scoped to a project, for example Research/users/subash/ and Sandbox/users/subash/, not a global users/subash/projects/* hierarchy.

Deferred Decision: User Home Storage¶

Project-owned storage and user home storage are related, but they are not the same product primitive.

The first WEKAFS implementation should focus on project storage objects that are explicitly attached to workloads and apps. Before adding persistent home directories, revisit the user-home model as a separate design decision.

Open questions:

whether every project member receives a separate project-scoped home directory, for example projects/{project_id}/users/{user_id}/home
whether a shared project home is ever appropriate, and if so whether it is a separate shared storage object rather than /home
whether the default workload mount should include only the launching user's home, all project member homes, or no home mount unless requested
what happens when a user is added to or removed from a project after a workload has already been allocated
whether a running multi-user workload can safely reconcile OS users, home directories, filesystem ACLs, and mount visibility without restart
how shared app runtimes such as notebooks, training dashboards, or inference workbenches distinguish the launching user from later collaborators
whether user-home storage is quota-attributed to the user, project, tenant, or a combination of project and user

Default posture until this is designed:

do not mount one writable project-wide home directory for every user
do not automatically grant newly added project members access to already running workloads
treat shared datasets, artifacts, checkpoints, and app data as project storage objects with explicit grants
treat per-user persistent home directories as a future storage/IAM feature requiring explicit UX, backend state, node-agent behavior, and audit events

This prevents the first WEKAFS integration from accidentally baking in a single-home or all-users-visible model that would be hard to unwind later.

Why Project-First¶

GPUaaS IAM, service accounts, app launches, workloads, quotas, and accounting are project-scoped.
A user can have different roles in different projects.
Datasets and model artifacts are usually project assets, not personal assets.
Releasing a workload or deleting a project has a clear cleanup boundary.
Provider policies can be compiled from project + principal + prefix.

Ownership vs Access¶

Every bucket or storage namespace has one owning project.

The owning project controls:

quota attribution
billing attribution
lifecycle policy
deletion authority
default access policy
provider placement and capability selection

Access can be granted outside the owning project through explicit grants.

A bucket or prefix can grant access to:

another project
a user within a project
a GPUaaS service account
a workload or app instance
a tenant-managed shared dataset group, once that exists

Example:

owner project: Training
bucket: training:imagenet

grants:
  Training project: read/write
  Inference project: read
  Sandbox project: read until 2026-05-31
  sa_training_pipeline: read/write checkpoints/
  sa_vllm_inference: read artifacts/model/

Default sharing posture:

no cross-project access unless explicitly granted
read-only is the default cross-project grant
cross-project write requires project admin or tenant admin authority
grants may target a full bucket or a prefix
grants may have expiration
grants must be audited

Principal Mapping¶

Human users do not become long-lived WEKA users by default.

Human user
  -> GPUaaS auth/session
  -> GPUaaS IAM check
  -> optional short-lived provider credentials only when direct S3/client access is enabled

Workload/app runtime
  -> GPUaaS service account
  -> GPUaaS-generated WEKAFS/POSIX mount plan for filesystem workloads
  -> optional provider-derived S3 service account or STS credential for object access

Principal classes:

GPUaaS principal	Provider credential posture
Human user	Short-lived scoped STS/session credentials only when requested for direct S3 client use.
Workload/app instance	Project-scoped service account with a GPUaaS-controlled mount plan for WEKAFS/POSIX; provider-derived credentials only when object/S3 access is enabled.
Project automation	Explicit project service account with scoped provider credentials.
Platform operations	Platform storage-admin credential in platform custody, never exposed to users/workloads.

Direct S3 Client Flow¶

Users may need to use aws s3, Python SDKs, data loaders, or other S3 clients. This requires temporary credentials, not long-lived provider users.

This flow is separate from WEKAFS/POSIX mounts. It is required when the selected storage object is exposed through S3 semantics or when a user/app needs direct object-client access.

Flow:

User authenticates to GPUaaS.
User requests storage credentials for a project, bucket, and optional prefix.
GPUaaS checks project role and storage grants.
GPUaaS asks the provider adapter to issue temporary credentials or a short-lived scoped equivalent.
GPUaaS returns endpoint, access key, secret key, session token when applicable, expiration, and the allowed scope summary.
GPUaaS writes an audit log entry.

Rules:

default TTL should be short, for example 1 hour
no long-lived human provider keys
no provider credentials stored in browser local storage
revocation of project membership or grant stops future issuance
existing temporary credentials expire naturally unless provider-side session invalidation is available

Workload/App Credential Flow¶

Workloads should not use human credentials.

Flow:

User launches a workload or app.
GPUaaS authorizes the user and resolves requested storage mounts.
GPUaaS creates or selects a project-scoped service account for the runtime.
GPUaaS compiles bucket/prefix policy from workload intent and grants.
Provider adapter creates provider-derived credentials.
Credential material is stored in Vault or wrapped delivery, not plaintext DB.
Node-agent/app controller receives sanitized mount instructions or wrapped credential delivery.
On release/decommission, GPUaaS revokes or disables provider-derived access.

WEKA First Backend Mapping¶

WEKA integration should follow this mapping:

GPUaaS concept	WEKA concept
Owning project bucket/namespace	WEKA filesystem directory namespace or project prefix for WEKAFS/POSIX; WEKA S3 bucket when S3 is enabled.
GPUaaS service account	Workload/app identity that receives a GPUaaS-generated mount plan; optional WEKA S3 service account for object access.
GPUaaS storage grant	Mount authorization and read-only/read-write mode for WEKAFS/POSIX; optional WEKA IAM/bucket policy for S3.
Direct user S3 access	Optional short-lived WEKA STS/session credentials if S3 is enabled for the selected mode.
Provider admin operations	Platform-owned WEKA admin/API credential in Vault custody.

Provider-specific details must stay behind packages/services/storage. Read models expose only user-safe capability hints and access summaries.

Policy Compiler Inputs¶

The storage policy compiler needs:

tenant ID
owning project ID
requesting project ID
principal type: user, service account, workload, platform
principal ID
bucket ID or provider namespace
prefix list
operations: list, read, write, delete, mount, admin
expiration
reason / source workflow

Compiler output should be provider-specific policy material plus a user-safe summary for audit and read models.

Contract Baseline¶

The first API contract slice is provider-neutral and uses GPUaaS grants as the source of truth:

GET /api/v1/v3/storage/{bucket_id}/grants lists user-safe grant posture.
POST /api/v1/v3/storage/{bucket_id}/grants creates a bucket or prefix grant.
DELETE /api/v1/v3/storage/{bucket_id}/grants/{grant_id} revokes a grant.
POST /api/v1/v3/storage/{bucket_id}/credentials issues short-lived direct S3 credentials for CLI/SDK/workload mount use after IAM checks.

Storage grants compile into provider policy:

storage_grants
  id
  owner_project_id
  bucket_id
  prefixes[]
  subject_kind: project | user | service_account | workload | tenant_group
  subject_id
  subject_project_id
  permissions[]: list | read | write | delete | mount | admin
  provider_backend: weka | vast | s3_compatible | ...
  provider_policy_ref
  expires_at
  revoked_at

Credential sessions are operational evidence, not durable secrets:

storage_credential_sessions
  id
  org_id
  project_id
  grant_id
  subject_kind
  subject_id
  subject_project_id
  bucket_id
  client_kind: s3_cli | sdk | workload_mount
  credential_type: s3_session | provider_session | vault_wrapped_secret
  prefixes[]
  permissions[]
  provider_backend
  provider_session_ref
  provider_policy_ref
  status: active | expired | revocation_pending | revoked | failed
  issued_by_user_id or issued_by_service_account_id
  issued_at
  expires_at
  revoked_at
  revoked_by_user_id
  revoke_reason
  source_workflow_id
  idempotency_key
  audit_log_id
  metadata

Raw access keys, secret keys, session tokens, provider admin credentials, and provider policy JSON must not be stored in read models. If any credential material is persisted for workload delivery, it must be Vault-wrapped or stored through the platform secret path, never plaintext in Postgres.

The storage_credential_sessions row is the durable evidence boundary for a one-time credential response. It may store provider session references, policy references, scope summaries, issuer identity, expiry, revocation posture, and diagnostic metadata. It must not store access_key_id, secret_access_key, session_token, wrapped-token bytes, provider admin credentials, raw provider policy JSON, or any recoverable credential material.

POST /api/v1/v3/storage/{bucket_id}/credentials creates or reuses the idempotent evidence row before returning the one-time credential material. The response includes credential_session_id and a user-safe session evidence summary so operators and clients can correlate later audit/history without needing the secret-bearing response again.

The service adapter boundary lives under packages/services/storage:

ProviderAdapter.CompileGrantPolicy compiles GPUaaS grants into provider policy material.
ProviderAdapter.ApplyGrant and RevokeGrant reconcile provider policy.
ProviderAdapter.IssueCredential returns one-time short-lived credential material for direct S3/SDK/workload use.
ProviderAdapter.RevokeCredential is best-effort because active STS session revocation depends on provider capability.

Audit Requirements¶

Audit every privileged storage/IAM mutation:

storage.bucket.create
storage.bucket.delete
storage.grant.create
storage.grant.revoke
storage.credential.issue
storage.credential.rotate
storage.credential.revoke
storage.mount.attach
storage.mount.detach

Audit rows must preserve:

actor user or service account
owning project
granted principal
bucket/prefix
permissions
expiration
provider backend type
credential session ID when provider credential material is issued
correlation ID

Do not log raw provider credentials.

Provider credential issuance also uses IAM Token Issuer audit actions:

successful direct provider credential issuance writes auth.provider_credential.issue and storage.credential.issue
denied issuance writes auth.provider_credential.deny when the caller or reusable credential identity can be safely attributed
provider-side or evidence-only revocation writes auth.provider_credential.revoke and storage.credential.revoke

V3 UI Implications¶

Storage workbench needs to show:

owned buckets
shared-with-this-project buckets
bucket owner project
grants and audiences
attached workloads
credential posture, without revealing secrets
provider capability hints

Launch flows need inline bucket selection across:

owned project buckets
buckets shared with the active project
datasets/artifacts that are read-only
output/checkpoint buckets that are writable

Direct S3 credential issuance should be a deliberate action with visible TTL and scope summary.

Non-Goals For First Slice¶

global user home buckets outside projects
public buckets
anonymous S3 access
broad tenant-wide write grants
exposing WEKA admin concepts directly in the UI
making WEKA the source of truth for GPUaaS IAM

First Implementation Slice¶

Project-owned buckets/namespaces.
Prefix-capable grants between projects.
Workload/app service-account credentials for mounts and S3 access.
Short-lived user credentials for direct S3 client access.
WEKA provider adapter hidden behind storage interfaces.
v3 read models for owner, shared audiences, mounts, flags, and provider capability hints.