Skip to content

Gap trackers

Designed Decided

Source: doc/architecture/App_Platform_Gap_Tracker_v1.md · GPU_Slice_Implementation_Checklist_v1.md · App_Platform_Clustered_App_Gap_Table_v1.md · V3_Migration_Execution_Tracker_v1.md · V3_V1_Workflow_Parity_Audit_v1.md · Allocation_Experience_Gaps_v1.md

Per-area gap inventories. Each tracker is owned by an area lead, lists what's missing in that area, and orders the gaps for execution.

Tracker map

flowchart TB
    classDef tr fill:#fff3cd,stroke:#332701

    APP[App Platform<br/>Gap Tracker]:::tr
    SLICE[GPU Slice<br/>Implementation Checklist]:::tr
    CLUST[Clustered App<br/>Gap Table]:::tr
    V3[V3 Migration<br/>Execution Tracker]:::tr
    AE[Allocation Experience<br/>Gaps]:::tr
    LOGIN[Login UX & IdP<br/>Gap]:::tr
    SLURM[Slurm Product Workflow<br/>Gap Assessment]:::tr

    APP -.tied to.-> CLUST
    SLICE -.tied to.-> AE
    V3 -.tied to.-> APP

1. App Platform Gap Tracker

Designed

flowchart TB
    classDef done fill:#d1e7dd,stroke:#0a3622
    classDef partial fill:#fff3cd,stroke:#332701
    classDef todo fill:#f8d7da,stroke:#42101e

    G1[App manifest registration]:::done
    G2[OCI registry baseline]:::done
    G3[Artifact trust + promotion]:::done
    G4[Launchable OCI workload profile contract]:::done
    G5[App-runtime lifecycle]:::done
    G6[Member operations<br/>add/remove/recover]:::done
    G7[Non-OCI artifact lifecycle]:::partial
    G8[App-runtime billing model]:::partial
    G9[Tenant-shared runtime API direction]:::partial
    G10[App-runtime metering producer]:::todo
    G11[Embedded UI gateway implementation]:::todo
    G12[Compose-as-platformization framework]:::todo

    G1 --> G2 --> G3 --> G4 --> G5 --> G6
    G6 --> G7
    G6 --> G8
    G6 --> G9
    G6 --> G10
    G6 --> G11
    G6 --> G12

Tracker's own statement (as of April 2026):

Compose slices now work end to end. The remaining gap is converting those reference slices into a repeatable platformization framework for more app teams.

Status Coverage
Implemented Manifest, OCI registry, trust+promotion, launchable profile, lifecycle, member ops
Designed (partial) Non-OCI artifact lifecycle, app-runtime billing model, tenant-shared runtime API direction
Designed (todo) App-runtime metering producer, embedded UI gateway implementation, platformization framework

→ Source: App_Platform_Gap_Tracker_v1.md

2. GPU Slice Implementation Checklist

Decided

The slice implementation is broken into 7 explicit phases. The current state crosses Phases 1-5; Phase 6 (networking hardening) and Phase 7 (app compatibility) are in progress.

flowchart LR
    classDef done fill:#d1e7dd,stroke:#0a3622
    classDef inprog fill:#fff3cd,stroke:#332701

    SP1[Phase 1<br/>Contract & Schema]:::done
    SP2[Phase 2<br/>Admin Inventory &<br/>Topology Approval]:::done
    SP3[Phase 3<br/>Claim-Aware<br/>Baremetal Scheduler]:::done
    SP4[Phase 4<br/>Slice Scheduler]:::done
    SP5[Phase 5<br/>Node-Agent<br/>Slice Runtime]:::done
    SP6[Phase 6<br/>Slice Networking]:::inprog
    SP7[Phase 7<br/>App Compatibility]:::inprog
    SP8[Phase 8<br/>Fractional/Shared GPU<br/>readiness — deferred]:::inprog

    SP1 --> SP2 --> SP3 --> SP4 --> SP5 --> SP6 --> SP7
    SP7 -.future.-> SP8
Phase Goal Key gates
1 Contract & schema capacity_shape, node_resource_slots, allocation_resource_claims, ERD/spec updated
2 Admin inventory & topology approval Discovery task, per-slot VF metadata, admin approval API
3 Claim-aware baremetal scheduler Move correctness from allocations.node_id to claims
4 Slice scheduler Same-node placement, FOR UPDATE SKIP LOCKED, deterministic topology-aware best-fit
5 Node-agent slice runtime 17-phase provision, release, wipe verification, vfio-pci binding
6 Slice networking BF3 VF → OVS → vNIC, private NAT, public ingress (optional), cross-slice denied by default
7 App compatibility Single-node OCI/Jupyter/vLLM on slices; multi-node clusters defer
8 (deferred) Fractional/shared GPU readiness MIG/vGPU/MPS reserved; not exposed in v1

→ Source: GPU_Slice_Implementation_Checklist_v1.md

3. Clustered App Gap Table

Designed

Multi-allocation apps (multi-node Slurm, multi-node Kubernetes, distributed training) need cross-allocation networking, identity, lifecycle, and billing. Tracked separately because the network gap is the load-bearing blocker.

flowchart LR
    classDef single fill:#d1e7dd,stroke:#0a3622
    classDef multi fill:#fff3cd,stroke:#332701

    SA[Single-allocation apps<br/>OK today]:::single --> S1[Jupyter / vLLM<br/>single-node Slurm /<br/>self-managed RKE2]:::single
    MA[Multi-allocation apps<br/>blocked]:::multi --> G1[Cross-allocation networking]:::multi
    MA --> G2[Cross-allocation identity]:::multi
    MA --> G3[Cross-allocation lifecycle<br/>member add/remove]:::multi
    MA --> G4[Cross-allocation billing<br/>per-app vs per-allocation]:::multi

Two-allocation Slurm (controller + remote worker) does work today — that proved the member-ops model. Full multi-node clusters wait on slice networking phase 6+.

→ Source: App_Platform_Clustered_App_Gap_Table_v1.md, Clustered_App_Model_v1.md

4. V3 Migration Execution Tracker

Designed

V3 is the long-term UX/route model. v1 routes are a frozen demo + internal continuity surface. The tracker manages the cutover.

flowchart LR
    V1[/api/v1/* frozen<br/>demo + internal continuity/]
    V3M[/v3 design mocks<br/>HTML in ux-mocks/]
    V3P[/v3-prod production routes<br/>shipped from mocks/]

    V1 -.feature-by-feature retire.-> V3P
    V3M -.design source.-> V3P
    V3P -.parity audit.-> V3M
    V3P -.workflow parity audit.-> V1

    classDef v1 fill:#ffebee,stroke:#c62828
    classDef mock fill:#fff3cd,stroke:#332701
    classDef prod fill:#e8f5e9,stroke:#2e7d32
    class V1 v1
    class V3M mock
    class V3P prod

Tracker artifacts:

Doc Role
V3_Migration_Execution_Tracker_v1.md Live status per route
V3_V1_Workflow_Parity_Audit_v1.md Capability parity gate
V3_V1_Retirement_Guardrails_v1.md Which v1 routes retire vs stay
V3_Cutover_Route_Map_v1.md Ordered cutover sequence
V3_Mock_To_Production_Data_Parity_v1.md Mock-to-prod fidelity rules

→ Comparisons surface: V3 redesign status, Internal parity audits

5. Allocation Experience Gaps

Designed

Raw user-facing gaps in the allocation experience, sequenced explicitly before higher-level app platform work.

flowchart TB
    classDef now fill:#d1e7dd,stroke:#0a3622
    classDef gap fill:#fff3cd,stroke:#332701

    NOW[Today]:::now --> N1[Create allocation]:::now
    NOW --> N2[Browser terminal]:::now
    NOW --> N3[SSH key install]:::now
    NOW --> N4[Release]:::now
    NOW --> N5[Usage metering]:::now

    GAP[Gaps tracked]:::gap --> X1[Restart-in-place workflow]:::gap
    GAP --> X2[Resize / reshape]:::gap
    GAP --> X3[Allocation timeline detail]:::gap
    GAP --> X4[Storage attach UX inline]:::gap
    GAP --> X5[SSH grant model UX]:::gap
    GAP --> X6[Failure-reason surfacing]:::gap

Several of these have shipped or are shipping (timeline, restart model, storage model). Tracker stays as the canonical reference for what user-facing UX maturity looks like.

→ Source: Allocation_Experience_Gaps_v1.md, Allocation_Provisioning_Task_Timeline_v1.md, Allocation_Restart_Model_v1.md

6. Login UX & IdP Gap

Designed

The gap between current login UX and the enterprise SSO expectation.

flowchart LR
    classDef now fill:#d1e7dd,stroke:#0a3622
    classDef gap fill:#fff3cd,stroke:#332701

    NOW[Today]:::now --> N1[Keycloak password +<br/>dev users]:::now
    NOW --> N2[Personal account path]:::now
    NOW --> N3[Work account path]:::now

    GAP[Gap]:::gap --> G1[Enterprise IdP federation]:::gap
    GAP --> G2[Brokered identity linking +<br/>dedup]:::gap
    GAP --> G3[Tenant slug hint on login]:::gap

→ Source: Login_UX_and_Identity_Provider_Gap_v1.md, Brokered_Identity_Linking_and_Dedup_v1.md

7. Slurm Product Workflow Gap Assessment

Designed

The Slurm app proves feasibility but still has shortcut behavior. This assessment is the review gate before declaring the app platform ready for independent app teams.

flowchart TB
    classDef ok fill:#d1e7dd,stroke:#0a3622
    classDef gap fill:#fff3cd,stroke:#332701

    OK[Working today]:::ok --> O1[Deploy through catalog]:::ok
    OK --> O2[Controller + worker on same alloc]:::ok
    OK --> O3[Add worker on second alloc]:::ok
    OK --> O4[srun/sinfo/sbatch native]:::ok
    OK --> O5[Recover bootstrap-failed worker]:::ok

    GAPS[Still open]:::gap --> G1[Honest decommission teardown<br/>not just metadata]:::gap
    GAPS --> G2[sbatch accounting cleanup]:::gap
    GAPS --> G3[PMIx startup warnings]:::gap
    GAPS --> G4[Honest version labelling]:::gap
    GAPS --> G5[Automated smoke tests for<br/>deploy/add/remove]:::gap

→ Source: Slurm_Product_Workflow_And_Gap_Assessment_v1.md

Cross-cutting theme

Across every tracker, the same pattern shows up:

mindmap
  root((Gap-tracker themes))
    Single-allocation things work
      Jupyter, vLLM, single-node Slurm OK
      Single-allocation RKE2 OK
      2-allocation member ops OK
    Multi-allocation things blocked on
      cross-allocation networking
      cross-allocation identity
      cross-allocation lifecycle
    Honest-teardown is the bar
      Slurm decommission must really teardown
      Slice wipe verification before reuse
      App stop must end runtime not just metadata
    Sequencing
      Lower-level gaps before higher
      Allocation UX before App platform polish
      Slice phase 6 gate for clustered apps

Where to look next