Skip to content

Storage Provider Capability Model v1

Purpose

Storage starts with WEKA, but the product/API model must stay provider-neutral so VAST, DDN, NVMe pools, S3-compatible object stores, or future backends can be added without redesigning the v3 Storage UI.

The control plane exposes capability metadata, not provider credentials or raw backend internals.

Storage ownership, sharing, and IAM are defined in doc/architecture/Storage_Sharing_and_IAM_Model_v1.md. In short: storage is project-owned by default, cross-project sharing is explicit grant state, and provider credentials are derived from GPUaaS IAM rather than making WEKA the primary user directory.

Provider integration lessons from the prior Scality IAM project are documented in doc/architecture/Storage_IAM_External_Reference_Lessons_v1.md.

The current WEKA capability assessment is documented in doc/architecture/Storage_WEKA_Capability_Assessment_v1.md.

RKE2-specific CSI, StorageClass, PVC, data-path, and exposure boundaries are defined in doc/architecture/RKE2_External_Storage_Model_v1.md.

Capability Shape

Every storage-backed bucket/read model may include a provider object:

Field Meaning
backend_type Stable provider class: weka, vast, ddn, nvme_pool, s3_compatible, local_dev, or unknown.
display_name User-safe provider label. Example: WEKA, VAST, Local dev.
performance_tier Product tier: standard, performance, capacity, archive, or unknown.
access_protocols Supported access surfaces such as posix, wekafs, nfs, s3, smb, csi.
mount_modes Supported mount modes: read_only, read_write, multi_writer.
multi_attach Whether multiple workloads may attach concurrently.
encryption Whether the backend supports encryption at rest for this class.
kms_managed Whether customer/project KMS integration is supported.
snapshots Whether snapshots are supported.
versioning Whether object/file versioning is supported.
retention Whether retention policy enforcement is supported.
quotas Whether quota accounting/enforcement is supported.
region_constraints User-safe region hints. No IPs, cluster IDs, or backend hostnames.
fabric_constraints User-safe fabric hints such as ethernet, roce, infiniband, or same_rack_preferred.

Initial Provider Profiles

Backend Initial posture
WEKA Performance dual-protocol backend: WEKAFS/POSIX over WEKA's client data path for training and app mounts, plus S3 when bucket/object workflows are enabled. Multi-attach capable, quota capable, snapshot capable, KMS depends on deployment integration.
VAST Capacity/performance backend, NFS/S3/SMB-capable, multi-attach capable, quota/snapshot/retention capable, KMS depends on deployment integration.
Local dev Filesystem-backed development adapter. Not representative of production performance, KMS, retention, or multi-attach semantics.

API Rules

  • Do not expose provider endpoints, cluster names, IPs, credentials, access keys, mount secrets, or internal volume IDs.
  • Treat capability fields as hints for UI and policy decisions, not as proof that runtime attachment has completed.
  • Runtime mount state belongs to workload/storage mount read models.
  • Provider-specific implementation detail belongs behind service interfaces in packages/services/storage, not in UI copy.
  • Provider IAM policy material should be generated from GPUaaS storage grants and service-account/user intent. UI/read models receive only the safe summary.

Launch UX Rules

  • Inline bucket creation should ask for intent: purpose, capacity, encryption, lifecycle, and access.
  • Provider selection can remain implicit until multiple provider classes are production-ready.
  • When provider selection becomes visible, show product-level tiers and capabilities, not raw vendor internals.

Provider Placement Model

Provider backend is a placement decision, not a global platform constant.

A single region may eventually contain more than one storage provider or more than one provider instance, for example WEKA for performance POSIX workloads, VAST for capacity/object workflows, and NVMe-local pools for node-affine scratch. The product model must support this without changing the v3 Storage surface.

Long-term placement inputs:

  • region and availability/fabric zone
  • storage class or product tier
  • requested protocol: wekafs, posix, csi, s3, nfs, smb
  • purpose: workspace, dataset, checkpoint, artifact, generic
  • requested quota and current provider capacity
  • access mode and write policy
  • node/app placement constraints
  • tenant/project policy and entitlements
  • provider health and drift status

The backend selected for a storage object must be persisted on the storage record. Attachments inherit that provider unless an explicit migration workflow moves the storage object.

Environment variables such as GPUAAS_STORAGE_PROVIDER_BACKEND are only development or single-provider fallback defaults. They are acceptable for kind and early platform-control validation, but they are not the production selection model.

Future implementation should introduce an explicit storage placement service or table, similar in spirit to compute placement, that returns a provider assignment and user-safe capability summary for each create request.

Current Implementation Boundary

The current v3 bucket create path already treats provider placement as a per-bucket assignment:

  • storage_buckets.provider_backend stores the selected backend class.
  • storage_buckets.provider_filesystem stores the selected filesystem or namespace when the backend needs one, for example WEKA gpuaas-kind-fs or gpuaas-fs.
  • storage_buckets.provider_instance_id stores an operator-safe provider instance handle used by control-plane reconciliation. It is not returned to users.
  • Storage attachments inherit provider assignment from the bucket record, not from a global runtime default.

In local/kind and early single-provider environments, the assignment resolver may still use environment defaults:

GPUAAS_STORAGE_PROVIDER_BACKEND=weka
GPUAAS_STORAGE_WEKA_FILESYSTEM=gpuaas-kind-fs
GPUAAS_STORAGE_PROVIDER_INSTANCE_ID=weka-kind

For multi-provider development, use ordered assignment rules instead of a single backend default:

GPUAAS_STORAGE_PROVIDER_ASSIGNMENTS='purpose=dataset,protocol=wekafs,backend=weka,filesystem=gpuaas-kind-fs,instance=weka-kind;purpose=generic,backend=vast,instance=vast-capacity'

These env rules are still a bootstrap mechanism. Production should move the same assignment shape into database-backed provider inventory and capacity placement when multiple regions/providers are operational.

WEKA Dual-Protocol Posture

The first WEKA production integration should support two product intents:

Intent Primary protocol Product surface
Training, notebooks, app runtimes, POSIX-heavy workloads WEKAFS/POSIX Storage mounts, workload/app launch, Kubernetes PV/PVC where applicable
Bucket/object workflows, direct external clients, SDK access S3 Buckets, direct credentials, object-style app integrations

WEKAFS/POSIX is the primary high-performance workload data path and the first implementation target. S3 is capability-gated for WEKA until S3 protocol hosts/containers are available; enable and validate it later when we need bucket/object semantics, external S3 clients, or app integrations that expect S3.

Operationally this means:

  • GPUaaS storage objects compile into project/workload mount intent.
  • Runtime delivery is a WEKAFS mount through the WEKA client/CSI path or a host-prepared mount exposed to the workload, depending on the final infra design.
  • The first WEKA production provider profile should advertise S3 as unavailable until provider health reports active=true and at least one S3 host.
  • WEKA's DPDK-backed data path is an infrastructure concern. UI/read models should show provider-neutral capability hints such as performance, multi-writer, and same-rack preferred, not raw DPDK configuration.
  • S3/IAM/STS is part of the WEKA integration for bucket/object workflows, but it is independent from POSIX-only workload mounts.
  • Access enforcement for WEKAFS mounts should be owned by GPUaaS placement, mount generation, project grants, service-account/workload identity, and filesystem/prefix layout. Provider S3 policies are not the primary enforcement boundary for POSIX mounts.
  • Access enforcement for S3 buckets should use GPUaaS grants compiled into WEKA S3 IAM/session/bucket policy, with short-lived credentials for humans and scoped service accounts for workloads/automation.

Ownership

Storage/Network owns this model because correctness depends on fabric topology, mount semantics, KMS, quota enforcement, and provider operations. Frontend can render these fields generically; backend/storage infra owns their interpretation.