Point-in-time gap analyses¶
Designed
Allocation_Experience_Gaps_v1.md · App_Platform_Gap_Tracker_v1.md · App_Platform_Clustered_App_Gap_Table_v1.md · Slurm_Product_Workflow_And_Gap_Assessment_v1.md · Login_UX_and_Identity_Provider_Gap_v1.md
Each gap analysis captures a focused comparison: what we have today vs what a fully usable product needs. They are time-stamped (most as of April 2026) and feed the backlog. Some gaps are closed; many are tracked as open requirements.
Overview¶
flowchart LR
classDef domain fill:#e3f2fd,stroke:#1565c0
classDef gap fill:#fff3cd,stroke:#332701
A[Allocation experience]:::domain --> AG[Raw user-facing gaps]:::gap
B[App platform]:::domain --> BG[Gap tracker]:::gap
C[Clustered apps]:::domain --> CG[Gap table]:::gap
D[Slurm workflow]:::domain --> DG[Gap assessment]:::gap
E[Login / IdP]:::domain --> EG[Gap analysis]:::gap
1. Allocation experience gaps¶
Captures major user-facing gaps in the raw allocation experience for GPU users — separates immediate compute-allocation needs from higher-level app platform work.
flowchart TB
classDef now fill:#d1e7dd,stroke:#0a3622
classDef next fill:#fff3cd,stroke:#332701
Now[What the user gets today]:::now --> N1[Create allocation]:::now
Now --> N2[Browser terminal]:::now
Now --> N3[SSH key install]:::now
Now --> N4[Release]:::now
Now --> N5[Usage metering]:::now
Next[Where the gaps live]:::next --> X1[Restart-in-place workflow]:::next
Next --> X2[Resize / reshape]:::next
Next --> X3[Allocation timeline detail]:::next
Next --> X4[Storage attach UX inline]:::next
Next --> X5[SSH grant model UX]:::next
Next --> X6[Failure-reason surfacing]:::next
What came out as requirements¶
- Allocation timeline as a first-class read model (drove
Allocation_Provisioning_Task_Timeline_v1.md). - Restart-in-place modeled in
Allocation_Restart_Model_v1.md. - Storage attachments surfaced inline in launch wizard (
Allocation_Storage_Model_v1.md). - SSH grants distinct from raw allocation membership (
Allocation_Project_SSH_Access_Grants_v1.md,Allocation_Project_SSH_Access_v1.md). release_failedUX: surfaces reason, confirms billing stopped, exposes retry release action.- Stable sequencing — these gaps are sequenced explicitly before higher-level app platform work.
Source: Allocation_Experience_Gaps_v1.md.
2. App platform gap tracker¶
Turns the app-platform builder and quickstart docs into explicit platform gaps that can be implemented in sequence.
flowchart TB
classDef done fill:#d1e7dd,stroke:#0a3622
classDef partial fill:#fff3cd,stroke:#332701
classDef todo fill:#f8d7da,stroke:#42101e
A[Areas tracked]
A --> A1[App manifest registration]:::done
A --> A2[OCI registry baseline]:::done
A --> A3[Artifact trust + promotion]:::done
A --> A4[Launchable OCI workload profile]:::done
A --> A5[Non-OCI artifact lifecycle]:::partial
A --> A6[App-runtime billing model]:::partial
A --> A7[App-runtime metering producer]:::todo
A --> A8[Tenant-shared runtime API]:::partial
A --> A9[Embedded UI gateway implementation]:::todo
A --> A10[Compose-as-platformization framework]:::todo
Status (per the tracker as of April 14, 2026)¶
The tracker notes:
Compose slices now work end to end. The remaining gap is converting those reference slices into a repeatable platformization framework for more app teams.
What came out as requirements¶
- App manifest with
requires_capacity_shapedeclarations. - OCI registry baseline + artifact trust/promotion model.
- Launchable OCI workload profile contract (JSON schema).
- App-runtime billing alignment with the allocation ledger — not double-charging.
- App-runtime metering producer — the missing piece between app-runtime and the ledger.
- Tenant-shared runtime API direction — independent of slice/baremetal.
- External app team integration guide — the bar for "ready for independent app teams."
Source: App_Platform_Gap_Tracker_v1.md.
3. Clustered app gap table¶
For apps that span multiple allocations (e.g. multi-node Slurm, multi-node Kubernetes), the gap is bigger — networking, identity, lifecycle all have to span allocation boundaries.
flowchart LR
classDef single fill:#d1e7dd,stroke:#0a3622
classDef multi fill:#fff3cd,stroke:#332701
SA[Single-allocation apps<br/>Jupyter / vLLM / single-node Slurm /<br/>self-managed RKE2]:::single
MA[Multi-allocation apps<br/>multi-node Slurm /<br/>multi-node Kubernetes /<br/>distributed training]:::multi
SA -->|GA path| OK[Compatible today]:::single
MA -->|requires| G1[Cross-allocation networking]:::multi
MA -->|requires| G2[Cross-allocation identity]:::multi
MA -->|requires| G3[Cross-allocation lifecycle<br/>add/remove members]:::multi
MA -->|requires| G4[Cross-allocation billing<br/>per-app vs per-allocation]:::multi
What came out as requirements¶
- Member-level operations on app instances (add worker, remove worker, recover worker) — already shipped for Slurm.
- Two-allocation Slurm cluster as the proof case (controller in one allocation, worker in another).
- Cross-allocation networking remains a hard gap until slice networking supports clusters.
- Multi-node Kubernetes apps deferred until slice networking supports clusters.
- Project-local networks reserved in the data model (
network_id,network_policy_id,project_network_id).
Source: App_Platform_Clustered_App_Gap_Table_v1.md, Clustered_App_Model_v1.md.
4. Slurm product workflow & gap assessment¶
The Slurm path proves feasibility but still contains shortcut behavior. This gap assessment is the review gate before declaring the app platform ready for independent app teams.
flowchart TB
classDef ok fill:#d1e7dd,stroke:#0a3622
classDef gap fill:#fff3cd,stroke:#332701
Work[Working today]:::ok --> W1[Deploy through app catalog]:::ok
Work --> W2[Controller + worker on same allocation]:::ok
Work --> W3[Add worker on second allocation]:::ok
Work --> W4[Remove + re-add worker via member ops]:::ok
Work --> W5[Controller/worker members + events visible]:::ok
Work --> W6["Native srun / sinfo / sbatch after bootstrap"]:::ok
Work --> W7[Recover bootstrap-failed worker]:::ok
Work --> W8[Stable running/healthy after bootstrap]:::ok
Open[Still open]:::gap --> O1[Honest decommission runtime teardown<br/>not just metadata cleanup]:::gap
Open --> O2["Harden sbatch accounting<br/>no confusing InvalidAccount states"]:::gap
Open --> O3[Clean up PMIx startup warnings]:::gap
Open --> O4[Honest package/catalog version labels<br/>app version vs distro Slurm version]:::gap
Open --> O5[Automated platform-control smoke tests<br/>deploy + add + remove]:::gap
What came out as requirements¶
- App platform readiness gate — Slurm is the canary. Until decommission, accounting, version-label, and smoke-test gaps close, the app platform is not declared GA-ready for external teams.
- Member operation history as a first-class read model (event timeline per app instance).
- Bootstrap completion signal — controller no longer reports
slurm_bootstrap_completedasprogressing. - Tombstone behavior on failed workers — auditable record kept.
- Slurm tenant scope semantics —
Slurm_Tenant_Scope_Semantics_v1.mdcodifies what's tenant-isolated vs cluster-wide.
Source: Slurm_Product_Workflow_And_Gap_Assessment_v1.md, Slurm_Tenant_Scope_Semantics_v1.md.
5. Login UX & identity provider gap¶
Captures the gap between current login UX and the enterprise SSO expectation.
flowchart LR
classDef now fill:#d1e7dd,stroke:#0a3622
classDef gap fill:#fff3cd,stroke:#332701
Now[Today]:::now --> N1[Keycloak realm with<br/>password / dev users]:::now
Now --> N2[Personal account path]:::now
Now --> N3[Work account / SSO path]:::now
Gap[Gap]:::gap --> G1[Enterprise IdP federation]:::gap
Gap --> G2[Brokered identity linking + dedup]:::gap
Gap --> G3[Tenant slug hint on login]:::gap
Gap --> G4[Account-type discovery]:::gap
What came out as requirements¶
- Two login paths explicit in UX: work / personal, tab-style selection.
- Brokered identity linking and dedup documented in
Brokered_Identity_Linking_and_Dedup_v1.md. - Tenant federation SSO model designed in
Tenant_Federation_SSO_Model.md. - Tenant slug / work-email hint is advisory, not required.
- Inline auth error on failed login.
- Session persistence + clear logout required.
- Enterprise federation runbook for incidents.
Source: Login_UX_and_Identity_Provider_Gap_v1.md, Brokered_Identity_Linking_and_Dedup_v1.md, Tenant_Federation_SSO_Model.md.
Cross-cutting takeaways¶
mindmap
root((Gap-analysis themes))
Sequencing
Lower-level gaps before higher-level
Allocation UX before app platform
Slurm gate before external teams
Don't sell what isn't honest
Decommission must truly teardown
Accounting must not look broken
Version labels must match daemon
Reserve model space
project_network_id reserved
Clustered app model documented
Multi-tenant scope semantics codified
Capability boundaries
Single-allocation apps GA
Multi-allocation apps deferred
Network gap blocks clusters
Where to look next¶
- App platform trail
- Slice trail
- Personas & journeys
- External review — the independent reviewer's read on these gaps
- Source docs:
Allocation_Experience_Gaps_v1.mdApp_Platform_Gap_Tracker_v1.mdApp_Platform_Clustered_App_Gap_Table_v1.mdSlurm_Product_Workflow_And_Gap_Assessment_v1.mdLogin_UX_and_Identity_Provider_Gap_v1.md