Point-in-time gap analyses¶

Designed

Source docs: Allocation_Experience_Gaps_v1.md · App_Platform_Gap_Tracker_v1.md · App_Platform_Clustered_App_Gap_Table_v1.md · Slurm_Product_Workflow_And_Gap_Assessment_v1.md · Login_UX_and_Identity_Provider_Gap_v1.md

Each gap analysis captures a focused comparison: what we have today vs what a fully usable product needs. They are time-stamped (most as of April 2026) and feed the backlog. Some gaps are closed; many are tracked as open requirements.

Overview¶

flowchart LR
    classDef domain fill:#e3f2fd,stroke:#1565c0
    classDef gap    fill:#fff3cd,stroke:#332701

    A[Allocation experience]:::domain --> AG[Raw user-facing gaps]:::gap
    B[App platform]:::domain --> BG[Gap tracker]:::gap
    C[Clustered apps]:::domain --> CG[Gap table]:::gap
    D[Slurm workflow]:::domain --> DG[Gap assessment]:::gap
    E[Login / IdP]:::domain --> EG[Gap analysis]:::gap

1. Allocation experience gaps¶

Captures major user-facing gaps in the raw allocation experience for GPU users — separates immediate compute-allocation needs from higher-level app platform work.

flowchart TB
    classDef now  fill:#d1e7dd,stroke:#0a3622
    classDef next fill:#fff3cd,stroke:#332701

    Now[What the user gets today]:::now --> N1[Create allocation]:::now
    Now --> N2[Browser terminal]:::now
    Now --> N3[SSH key install]:::now
    Now --> N4[Release]:::now
    Now --> N5[Usage metering]:::now

    Next[Where the gaps live]:::next --> X1[Restart-in-place workflow]:::next
    Next --> X2[Resize / reshape]:::next
    Next --> X3[Allocation timeline detail]:::next
    Next --> X4[Storage attach UX inline]:::next
    Next --> X5[SSH grant model UX]:::next
    Next --> X6[Failure-reason surfacing]:::next

What came out as requirements¶

Allocation timeline as a first-class read model (drove Allocation_Provisioning_Task_Timeline_v1.md).
Restart-in-place modeled in Allocation_Restart_Model_v1.md.
Storage attachments surfaced inline in launch wizard (Allocation_Storage_Model_v1.md).
SSH grants distinct from raw allocation membership (Allocation_Project_SSH_Access_Grants_v1.md, Allocation_Project_SSH_Access_v1.md).
release_failed UX: surfaces reason, confirms billing stopped, exposes retry release action.
Stable sequencing — these gaps are sequenced explicitly before higher-level app platform work.

Source: Allocation_Experience_Gaps_v1.md.

2. App platform gap tracker¶

Turns the app-platform builder and quickstart docs into explicit platform gaps that can be implemented in sequence.

flowchart TB
    classDef done  fill:#d1e7dd,stroke:#0a3622
    classDef partial fill:#fff3cd,stroke:#332701
    classDef todo  fill:#f8d7da,stroke:#42101e

    A[Areas tracked]
    A --> A1[App manifest registration]:::done
    A --> A2[OCI registry baseline]:::done
    A --> A3[Artifact trust + promotion]:::done
    A --> A4[Launchable OCI workload profile]:::done
    A --> A5[Non-OCI artifact lifecycle]:::partial
    A --> A6[App-runtime billing model]:::partial
    A --> A7[App-runtime metering producer]:::todo
    A --> A8[Tenant-shared runtime API]:::partial
    A --> A9[Embedded UI gateway implementation]:::todo
    A --> A10[Compose-as-platformization framework]:::todo

Status (per the tracker as of April 14, 2026)¶

The tracker notes:

Compose slices now work end to end. The remaining gap is converting those reference slices into a repeatable platformization framework for more app teams.

What came out as requirements¶

App manifest with requires_capacity_shape declarations.
OCI registry baseline + artifact trust/promotion model.
Launchable OCI workload profile contract (JSON schema).
App-runtime billing alignment with the allocation ledger — not double-charging.
App-runtime metering producer — the missing piece between app-runtime and the ledger.
Tenant-shared runtime API direction — independent of slice/baremetal.
External app team integration guide — the bar for "ready for independent app teams."

Source: App_Platform_Gap_Tracker_v1.md.

3. Clustered app gap table¶

For apps that span multiple allocations (e.g. multi-node Slurm, multi-node Kubernetes), the gap is bigger — networking, identity, lifecycle all have to span allocation boundaries.

flowchart LR
    classDef single fill:#d1e7dd,stroke:#0a3622
    classDef multi  fill:#fff3cd,stroke:#332701

    SA[Single-allocation apps<br/>Jupyter / vLLM / single-node Slurm /<br/>self-managed RKE2]:::single
    MA[Multi-allocation apps<br/>multi-node Slurm /<br/>multi-node Kubernetes /<br/>distributed training]:::multi

    SA -->|GA path| OK[Compatible today]:::single
    MA -->|requires| G1[Cross-allocation networking]:::multi
    MA -->|requires| G2[Cross-allocation identity]:::multi
    MA -->|requires| G3[Cross-allocation lifecycle<br/>add/remove members]:::multi
    MA -->|requires| G4[Cross-allocation billing<br/>per-app vs per-allocation]:::multi

What came out as requirements¶

Member-level operations on app instances (add worker, remove worker, recover worker) — already shipped for Slurm.
Two-allocation Slurm cluster as the proof case (controller in one allocation, worker in another).
Cross-allocation networking remains a hard gap until slice networking supports clusters.
Multi-node Kubernetes apps deferred until slice networking supports clusters.
Project-local networks reserved in the data model (network_id, network_policy_id, project_network_id).

Source: App_Platform_Clustered_App_Gap_Table_v1.md, Clustered_App_Model_v1.md.

4. Slurm product workflow & gap assessment¶

The Slurm path proves feasibility but still contains shortcut behavior. This gap assessment is the review gate before declaring the app platform ready for independent app teams.

flowchart TB
    classDef ok    fill:#d1e7dd,stroke:#0a3622
    classDef gap   fill:#fff3cd,stroke:#332701

    Work[Working today]:::ok --> W1[Deploy through app catalog]:::ok
    Work --> W2[Controller + worker on same allocation]:::ok
    Work --> W3[Add worker on second allocation]:::ok
    Work --> W4[Remove + re-add worker via member ops]:::ok
    Work --> W5[Controller/worker members + events visible]:::ok
    Work --> W6["Native srun / sinfo / sbatch after bootstrap"]:::ok
    Work --> W7[Recover bootstrap-failed worker]:::ok
    Work --> W8[Stable running/healthy after bootstrap]:::ok

    Open[Still open]:::gap --> O1[Honest decommission runtime teardown<br/>not just metadata cleanup]:::gap
    Open --> O2["Harden sbatch accounting<br/>no confusing InvalidAccount states"]:::gap
    Open --> O3[Clean up PMIx startup warnings]:::gap
    Open --> O4[Honest package/catalog version labels<br/>app version vs distro Slurm version]:::gap
    Open --> O5[Automated platform-control smoke tests<br/>deploy + add + remove]:::gap

What came out as requirements¶

App platform readiness gate — Slurm is the canary. Until decommission, accounting, version-label, and smoke-test gaps close, the app platform is not declared GA-ready for external teams.
Member operation history as a first-class read model (event timeline per app instance).
Bootstrap completion signal — controller no longer reports slurm_bootstrap_completed as progressing.
Tombstone behavior on failed workers — auditable record kept.
Slurm tenant scope semantics — Slurm_Tenant_Scope_Semantics_v1.md codifies what's tenant-isolated vs cluster-wide.

Source: Slurm_Product_Workflow_And_Gap_Assessment_v1.md, Slurm_Tenant_Scope_Semantics_v1.md.

Captures the gap between current login UX and the enterprise SSO expectation.

flowchart LR
    classDef now  fill:#d1e7dd,stroke:#0a3622
    classDef gap  fill:#fff3cd,stroke:#332701

    Now[Today]:::now --> N1[Keycloak realm with<br/>password / dev users]:::now
    Now --> N2[Personal account path]:::now
    Now --> N3[Work account / SSO path]:::now

    Gap[Gap]:::gap --> G1[Enterprise IdP federation]:::gap
    Gap --> G2[Brokered identity linking + dedup]:::gap
    Gap --> G3[Tenant slug hint on login]:::gap
    Gap --> G4[Account-type discovery]:::gap

What came out as requirements¶

Two login paths explicit in UX: work / personal, tab-style selection.
Brokered identity linking and dedup documented in Brokered_Identity_Linking_and_Dedup_v1.md.
Tenant federation SSO model designed in Tenant_Federation_SSO_Model.md.
Tenant slug / work-email hint is advisory, not required.
Inline auth error on failed login.
Session persistence + clear logout required.
Enterprise federation runbook for incidents.

Source: Login_UX_and_Identity_Provider_Gap_v1.md, Brokered_Identity_Linking_and_Dedup_v1.md, Tenant_Federation_SSO_Model.md.

Cross-cutting takeaways¶

mindmap
  root((Gap-analysis themes))
    Sequencing
      Lower-level gaps before higher-level
      Allocation UX before app platform
      Slurm gate before external teams
    Don't sell what isn't honest
      Decommission must truly teardown
      Accounting must not look broken
      Version labels must match daemon
    Reserve model space
      project_network_id reserved
      Clustered app model documented
      Multi-tenant scope semantics codified
    Capability boundaries
      Single-allocation apps GA
      Multi-allocation apps deferred
      Network gap blocks clusters

Where to look next¶

App platform trail
Slice trail
Personas & journeys
External review — the independent reviewer's read on these gaps
Source docs:
Allocation_Experience_Gaps_v1.md
App_Platform_Gap_Tracker_v1.md
App_Platform_Clustered_App_Gap_Table_v1.md
Slurm_Product_Workflow_And_Gap_Assessment_v1.md
Login_UX_and_Identity_Provider_Gap_v1.md

Point-in-time gap analyses¶

Overview¶

1. Allocation experience gaps¶

What came out as requirements¶

2. App platform gap tracker¶

Status (per the tracker as of April 14, 2026)¶

What came out as requirements¶

3. Clustered app gap table¶

What came out as requirements¶

4. Slurm product workflow & gap assessment¶

What came out as requirements¶

5. Login UX & identity provider gap¶

What came out as requirements¶

Cross-cutting takeaways¶

Where to look next¶