Gap trackers¶
Designed Decided
doc/architecture/App_Platform_Gap_Tracker_v1.md · GPU_Slice_Implementation_Checklist_v1.md · App_Platform_Clustered_App_Gap_Table_v1.md · V3_Migration_Execution_Tracker_v1.md · V3_V1_Workflow_Parity_Audit_v1.md · Allocation_Experience_Gaps_v1.md
Per-area gap inventories. Each tracker is owned by an area lead, lists what's missing in that area, and orders the gaps for execution.
Tracker map¶
flowchart TB
classDef tr fill:#fff3cd,stroke:#332701
APP[App Platform<br/>Gap Tracker]:::tr
SLICE[GPU Slice<br/>Implementation Checklist]:::tr
CLUST[Clustered App<br/>Gap Table]:::tr
V3[V3 Migration<br/>Execution Tracker]:::tr
AE[Allocation Experience<br/>Gaps]:::tr
LOGIN[Login UX & IdP<br/>Gap]:::tr
SLURM[Slurm Product Workflow<br/>Gap Assessment]:::tr
APP -.tied to.-> CLUST
SLICE -.tied to.-> AE
V3 -.tied to.-> APP
1. App Platform Gap Tracker¶
Designed
flowchart TB
classDef done fill:#d1e7dd,stroke:#0a3622
classDef partial fill:#fff3cd,stroke:#332701
classDef todo fill:#f8d7da,stroke:#42101e
G1[App manifest registration]:::done
G2[OCI registry baseline]:::done
G3[Artifact trust + promotion]:::done
G4[Launchable OCI workload profile contract]:::done
G5[App-runtime lifecycle]:::done
G6[Member operations<br/>add/remove/recover]:::done
G7[Non-OCI artifact lifecycle]:::partial
G8[App-runtime billing model]:::partial
G9[Tenant-shared runtime API direction]:::partial
G10[App-runtime metering producer]:::todo
G11[Embedded UI gateway implementation]:::todo
G12[Compose-as-platformization framework]:::todo
G1 --> G2 --> G3 --> G4 --> G5 --> G6
G6 --> G7
G6 --> G8
G6 --> G9
G6 --> G10
G6 --> G11
G6 --> G12
Tracker's own statement (as of April 2026):
Compose slices now work end to end. The remaining gap is converting those reference slices into a repeatable platformization framework for more app teams.
| Status | Coverage |
|---|---|
| Implemented | Manifest, OCI registry, trust+promotion, launchable profile, lifecycle, member ops |
| Designed (partial) | Non-OCI artifact lifecycle, app-runtime billing model, tenant-shared runtime API direction |
| Designed (todo) | App-runtime metering producer, embedded UI gateway implementation, platformization framework |
→ Source: App_Platform_Gap_Tracker_v1.md
2. GPU Slice Implementation Checklist¶
Decided
The slice implementation is broken into 7 explicit phases. The current state crosses Phases 1-5; Phase 6 (networking hardening) and Phase 7 (app compatibility) are in progress.
flowchart LR
classDef done fill:#d1e7dd,stroke:#0a3622
classDef inprog fill:#fff3cd,stroke:#332701
SP1[Phase 1<br/>Contract & Schema]:::done
SP2[Phase 2<br/>Admin Inventory &<br/>Topology Approval]:::done
SP3[Phase 3<br/>Claim-Aware<br/>Baremetal Scheduler]:::done
SP4[Phase 4<br/>Slice Scheduler]:::done
SP5[Phase 5<br/>Node-Agent<br/>Slice Runtime]:::done
SP6[Phase 6<br/>Slice Networking]:::inprog
SP7[Phase 7<br/>App Compatibility]:::inprog
SP8[Phase 8<br/>Fractional/Shared GPU<br/>readiness — deferred]:::inprog
SP1 --> SP2 --> SP3 --> SP4 --> SP5 --> SP6 --> SP7
SP7 -.future.-> SP8
| Phase | Goal | Key gates |
|---|---|---|
| 1 | Contract & schema | capacity_shape, node_resource_slots, allocation_resource_claims, ERD/spec updated |
| 2 | Admin inventory & topology approval | Discovery task, per-slot VF metadata, admin approval API |
| 3 | Claim-aware baremetal scheduler | Move correctness from allocations.node_id to claims |
| 4 | Slice scheduler | Same-node placement, FOR UPDATE SKIP LOCKED, deterministic topology-aware best-fit |
| 5 | Node-agent slice runtime | 17-phase provision, release, wipe verification, vfio-pci binding |
| 6 | Slice networking | BF3 VF → OVS → vNIC, private NAT, public ingress (optional), cross-slice denied by default |
| 7 | App compatibility | Single-node OCI/Jupyter/vLLM on slices; multi-node clusters defer |
| 8 (deferred) | Fractional/shared GPU readiness | MIG/vGPU/MPS reserved; not exposed in v1 |
→ Source: GPU_Slice_Implementation_Checklist_v1.md
3. Clustered App Gap Table¶
Designed
Multi-allocation apps (multi-node Slurm, multi-node Kubernetes, distributed training) need cross-allocation networking, identity, lifecycle, and billing. Tracked separately because the network gap is the load-bearing blocker.
flowchart LR
classDef single fill:#d1e7dd,stroke:#0a3622
classDef multi fill:#fff3cd,stroke:#332701
SA[Single-allocation apps<br/>OK today]:::single --> S1[Jupyter / vLLM<br/>single-node Slurm /<br/>self-managed RKE2]:::single
MA[Multi-allocation apps<br/>blocked]:::multi --> G1[Cross-allocation networking]:::multi
MA --> G2[Cross-allocation identity]:::multi
MA --> G3[Cross-allocation lifecycle<br/>member add/remove]:::multi
MA --> G4[Cross-allocation billing<br/>per-app vs per-allocation]:::multi
Two-allocation Slurm (controller + remote worker) does work today — that proved the member-ops model. Full multi-node clusters wait on slice networking phase 6+.
→ Source: App_Platform_Clustered_App_Gap_Table_v1.md, Clustered_App_Model_v1.md
4. V3 Migration Execution Tracker¶
Designed
V3 is the long-term UX/route model. v1 routes are a frozen demo + internal continuity surface. The tracker manages the cutover.
flowchart LR
V1[/api/v1/* frozen<br/>demo + internal continuity/]
V3M[/v3 design mocks<br/>HTML in ux-mocks/]
V3P[/v3-prod production routes<br/>shipped from mocks/]
V1 -.feature-by-feature retire.-> V3P
V3M -.design source.-> V3P
V3P -.parity audit.-> V3M
V3P -.workflow parity audit.-> V1
classDef v1 fill:#ffebee,stroke:#c62828
classDef mock fill:#fff3cd,stroke:#332701
classDef prod fill:#e8f5e9,stroke:#2e7d32
class V1 v1
class V3M mock
class V3P prod
Tracker artifacts:
| Doc | Role |
|---|---|
V3_Migration_Execution_Tracker_v1.md |
Live status per route |
V3_V1_Workflow_Parity_Audit_v1.md |
Capability parity gate |
V3_V1_Retirement_Guardrails_v1.md |
Which v1 routes retire vs stay |
V3_Cutover_Route_Map_v1.md |
Ordered cutover sequence |
V3_Mock_To_Production_Data_Parity_v1.md |
Mock-to-prod fidelity rules |
→ Comparisons surface: V3 redesign status, Internal parity audits
5. Allocation Experience Gaps¶
Designed
Raw user-facing gaps in the allocation experience, sequenced explicitly before higher-level app platform work.
flowchart TB
classDef now fill:#d1e7dd,stroke:#0a3622
classDef gap fill:#fff3cd,stroke:#332701
NOW[Today]:::now --> N1[Create allocation]:::now
NOW --> N2[Browser terminal]:::now
NOW --> N3[SSH key install]:::now
NOW --> N4[Release]:::now
NOW --> N5[Usage metering]:::now
GAP[Gaps tracked]:::gap --> X1[Restart-in-place workflow]:::gap
GAP --> X2[Resize / reshape]:::gap
GAP --> X3[Allocation timeline detail]:::gap
GAP --> X4[Storage attach UX inline]:::gap
GAP --> X5[SSH grant model UX]:::gap
GAP --> X6[Failure-reason surfacing]:::gap
Several of these have shipped or are shipping (timeline, restart model, storage model). Tracker stays as the canonical reference for what user-facing UX maturity looks like.
→ Source: Allocation_Experience_Gaps_v1.md, Allocation_Provisioning_Task_Timeline_v1.md, Allocation_Restart_Model_v1.md
6. Login UX & IdP Gap¶
Designed
The gap between current login UX and the enterprise SSO expectation.
flowchart LR
classDef now fill:#d1e7dd,stroke:#0a3622
classDef gap fill:#fff3cd,stroke:#332701
NOW[Today]:::now --> N1[Keycloak password +<br/>dev users]:::now
NOW --> N2[Personal account path]:::now
NOW --> N3[Work account path]:::now
GAP[Gap]:::gap --> G1[Enterprise IdP federation]:::gap
GAP --> G2[Brokered identity linking +<br/>dedup]:::gap
GAP --> G3[Tenant slug hint on login]:::gap
→ Source: Login_UX_and_Identity_Provider_Gap_v1.md, Brokered_Identity_Linking_and_Dedup_v1.md
7. Slurm Product Workflow Gap Assessment¶
Designed
The Slurm app proves feasibility but still has shortcut behavior. This assessment is the review gate before declaring the app platform ready for independent app teams.
flowchart TB
classDef ok fill:#d1e7dd,stroke:#0a3622
classDef gap fill:#fff3cd,stroke:#332701
OK[Working today]:::ok --> O1[Deploy through catalog]:::ok
OK --> O2[Controller + worker on same alloc]:::ok
OK --> O3[Add worker on second alloc]:::ok
OK --> O4[srun/sinfo/sbatch native]:::ok
OK --> O5[Recover bootstrap-failed worker]:::ok
GAPS[Still open]:::gap --> G1[Honest decommission teardown<br/>not just metadata]:::gap
GAPS --> G2[sbatch accounting cleanup]:::gap
GAPS --> G3[PMIx startup warnings]:::gap
GAPS --> G4[Honest version labelling]:::gap
GAPS --> G5[Automated smoke tests for<br/>deploy/add/remove]:::gap
→ Source: Slurm_Product_Workflow_And_Gap_Assessment_v1.md
Cross-cutting theme¶
Across every tracker, the same pattern shows up:
mindmap
root((Gap-tracker themes))
Single-allocation things work
Jupyter, vLLM, single-node Slurm OK
Single-allocation RKE2 OK
2-allocation member ops OK
Multi-allocation things blocked on
cross-allocation networking
cross-allocation identity
cross-allocation lifecycle
Honest-teardown is the bar
Slurm decommission must really teardown
Slice wipe verification before reuse
App stop must end runtime not just metadata
Sequencing
Lower-level gaps before higher
Allocation UX before App platform polish
Slice phase 6 gate for clustered apps
Where to look next¶
- Implementation roadmap — which roadmap phase each gap maps to
- Active work queue — current task assignments per gap
- GPU slice trail — slice progress with diagrams
- App platform trail — app platform progress
- Tech debt register — tracked fallbacks
- Source: full set of source trackers cross-linked above