Trail: App Platform¶
App platform sits on top of allocations. A user launches an app (Jupyter, vLLM, Slurm, RKE2) inside their allocation. Apps are platform primitives, not part of the core slice/baremetal contract.
Trail map¶
flowchart TB
classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
classDef des fill:#fff3cd,stroke:#332701,color:#332701
classDef run fill:#e9d6ff,stroke:#1e1530,color:#1e1530
A1[1. App control plane]:::des --> A2[2. App manifest]:::impl
A2 --> A3[3. OCI registry baseline]:::impl
A3 --> A4[4. Artifact trust & promotion]:::impl
A4 --> A5[5. App-runtime lifecycle]:::impl
A5 --> A6[6. Single-alloc adapters]:::impl
A6 --> A7[7. App-runtime billing]:::des
A7 --> A8[8. Compatibility model]:::des
A8 --> A9[9. Runbooks]:::run
1. App control plane¶
Designed
The app control plane is the platform-side layer that brokers an app's lifecycle independently of the underlying allocation.
flowchart TB
classDef cp fill:#e3f2fd,stroke:#1565c0
classDef rt fill:#fff3e0,stroke:#e65100
classDef tn fill:#e8f5e9,stroke:#2e7d32
subgraph CP[App control plane]
CAT[App catalog<br/>list of manifests]:::cp
REG[OCI registry baseline]:::cp
TRUST[Artifact trust & promotion]:::cp
APR[app-runtime-worker]:::cp
end
subgraph RT[Runtime executors]
SLURM[slurm-reference-controller]:::rt
RKE2[rke2-self-managed-controller]:::rt
OCI[Launchable OCI workload<br/>via cloud-init]:::rt
end
subgraph TN[Tenant allocation]
ALLOC[allocations<br/>baremetal or gpu_slice]:::tn
end
CAT --> APR
REG --> TRUST --> APR
APR --> SLURM
APR --> RKE2
APR --> OCI
SLURM --> ALLOC
RKE2 --> ALLOC
OCI --> ALLOC
→ Source: App_Control_Plane_v1.md, App_Platform_Primitive_Boundary_v1.md
2. App manifest¶
Implemented
erDiagram
app_manifests ||--o{ app_instances : "launched as"
app_instances ||--o{ app_instance_members : "expands into"
app_instances ||--o{ app_instance_events : "produces timeline"
allocations ||--o{ app_instance_members : "bound to"
app_manifests {
text slug PK
text display_name
text version
text artifact_kind "oci|launchable|controller"
text artifact_ref "digest@registry"
jsonb compatibility "capacity_shape array, min_gpu_count, requires_exclusive_node"
jsonb runtime_profile "image, ports, env, resources"
bool active
}
app_instances {
uuid id PK
text app_slug FK
uuid tenant_id
uuid project_id
text status "created|starting|running|stopping|stopped|failed"
text bound_capacity_shape
}
app_instance_members {
uuid id PK
uuid instance_id FK
uuid allocation_id FK
text role "controller|worker|head|node"
text status
}
app_instance_events {
uuid id PK
uuid instance_id FK
text event_type
timestamp occurred_at
jsonb payload
}
The manifest's compatibility block is the gate. Sample:
{
"requires_capacity_shape": ["baremetal", "gpu_slice"],
"min_gpu_count": 1,
"requires_exclusive_node": false
}
→ Sources: App_Manifest_Registration_Guide_v1.md, Launchable_OCI_Workload_Profile_Contract_v1.md
3. OCI registry baseline¶
Implemented
The platform consumes OCI artifacts: container images and optional non-container payloads (model weights, datasets, scripts).
flowchart LR
DEV[App developer] --> BUILD[CI builds artifact]
BUILD --> SIGN[Sign + push to registry]
SIGN --> REG[(OCI registry<br/>platform-managed or external)]
REG --> VER[Trust verification:<br/>digest, optional cosign]
VER --> CAT[Platform catalog<br/>manifest references digest]
CAT --> LAUNCH[Tenant launches app]
LAUNCH --> EXEC[Runtime executor pulls<br/>by digest on node]
EXEC --> NODE[(Host with credentials)]
classDef secure fill:#e9d6ff,stroke:#1e1530
class VER,SIGN secure
Artifacts are digest-immutable — tags can move but the manifest always references a specific digest. This makes promotion / canary / rollback deterministic.
→ Sources: App_Platform_OCI_Registry_Baseline_v1.md, App_Non_OCI_Artifact_Lifecycle_v1.md
4. Artifact trust & promotion¶
Implemented
stateDiagram-v2
[*] --> draft: developer publishes digest
draft --> staged: passes lint + smoke
staged --> canary: small % rollout enabled
canary --> active: confidence gained
canary --> staged: rollback
active --> deprecated: superseded version active
deprecated --> retired: grace period elapsed
retired --> [*]: no instance references it
Promotion rules:
- Every state transition is audit-logged (actor, before, after, reason).
- An active manifest cannot reference a non-active artifact.
- A deprecated manifest cannot be assigned to a new instance, only existing ones continue.
- Retiring is gated on "no active instances using this digest".
→ Source: App_Artifact_Trust_and_Promotion_v1.md
5. App-runtime lifecycle¶
Implemented
stateDiagram-v2
[*] --> created: API: create app instance
created --> starting: app-runtime-worker dispatches
starting --> running: readiness probes pass
starting --> failed: bootstrap timeout / image pull / permissions
running --> stopping: API: stop OR allocation releasing
stopping --> stopped: cleanup done
stopped --> [*]
running --> failed: runtime error
failed --> [*]: terminal
note right of running
app_instance_events records:
start_completed, healthy, member_added,
member_removed, member_failed
end note
note right of stopping
decommission must perform
honest runtime teardown,
not just metadata cleanup
end note
Each transition emits an app_instance_events row consumed by the UI for the live timeline view (/workloads/:id).
→ Sources: App_Runtime_Instance_Lifecycle_v1.md, App_Runtime_Operating_Modes_v1.md, App_Runtime_Recovery_Model_v1.md
6. Single-allocation adapters¶
Implemented
Two reference adapters ship in the repo today:
flowchart TB
subgraph S[Slurm reference]
SC[slurm-reference-controller<br/>2,310 LOC]
SC --> SCT["Controller slurmctld<br/>in allocation A"]
SC --> SW1["Worker slurmd<br/>same allocation OR<br/>second allocation B"]
SC --> SW2[Worker N<br/>added via member ops]
end
subgraph R[RKE2 self-managed]
RC[rke2-self-managed-controller<br/>1,334 LOC]
RC --> RCP[Control plane node<br/>in tenant allocation]
RC --> RW1[Worker node]
RC --> RW2[Worker N]
end
classDef ctrl fill:#e3f2fd,stroke:#1565c0
classDef worker fill:#e8f5e9,stroke:#2e7d32
class SC,RC ctrl
class SW1,SW2,RW1,RW2,SCT,RCP worker
| Adapter | Binary | Status today |
|---|---|---|
| Slurm | cmd/slurm-reference-controller (2,310 LOC) |
Deploy through catalog; controller + worker on same alloc; add worker on second alloc via member ops; native srun/sinfo/sbatch; recovery for bootstrap-failed workers |
| RKE2 | cmd/rke2-self-managed-controller (1,334 LOC) |
Bootstrap RKE2 control plane inside tenant allocation |
What's still open per the Slurm workflow gap assessment:
- Honest decommission runtime teardown (not just metadata cleanup)
- Harden
sbatchaccounting to remove transientInvalidAccountstates - Automated platform-control smoke tests for deploy + add + remove
Multi-node clusters remain DESIGNED — slice networking has to support cross-allocation networks first.
→ Sources: Slurm_First_Slice_Adapter_Contract_v1.md, Slurm_App_Runtime_Adapter_v1.md, Self_Managed_RKE2_First_Slice_v1.md, Kubernetes_Runtime_Reconcile_Repair_v1.md
7. App-runtime billing¶
Designed
Per-app metering layered on top of allocation billing — not double-charging.
sequenceDiagram
autonumber
participant ALLOC as Allocation<br/>billed at SKU rate
participant APP as App instance
participant AM as App metering<br/>producer (DESIGNED)
participant BW as billing-worker
participant LED as ledger
ALLOC->>BW: usage_records (GPU-hour)
BW->>LED: ledger debit (allocation cost)
APP->>AM: app-specific metric<br/>(tokens, requests, model-seconds)
AM->>BW: app_usage_records<br/>(non-double-charge dimension)
BW->>LED: ledger debit (app-level)
Note over LED: tenant sees combined cost,<br/>operators see breakdown
The metering producer is the missing piece the external review flagged. The contract is defined; the producer wiring is in progress.
→ Sources: App_Runtime_Billing_Model_v1.md, App_Runtime_Metering_v1.md, App_Runtime_External_Worker_Contract_v1.md
8. Compatibility model¶
Decided
flowchart TB
APP[App manifest declares<br/>compatibility]
APP --> RULE{Allocation shape?}
RULE --> S1[gpu_slice + N GPUs]
RULE --> B1[baremetal]
S1 --> J1{App requires<br/>multi-node?}
B1 --> J2{App requires<br/>multi-node?}
J1 -- yes --> BLOCK[Blocked v1<br/>slice networking<br/>doesn't span allocs]
J1 -- no --> OK1[Compatible:<br/>Jupyter / vLLM / single-node Slurm /<br/>single-node RKE2]
J2 -- yes --> WAIT[DESIGNED: multi-node clusters<br/>OK as DESIGNED workstream]
J2 -- no --> OK2[Compatible:<br/>same set + OCI workloads]
classDef ok fill:#d1e7dd,stroke:#0a3622
classDef warn fill:#fff3cd,stroke:#332701
classDef block fill:#f8d7da,stroke:#42101e
class OK1,OK2 ok
class WAIT warn
class BLOCK block
| App profile | gpu_slice |
baremetal |
Why |
|---|---|---|---|
| Launchable OCI / Jupyter / vLLM single-node | ✓ | ✓ | Workload stays inside one alloc |
| Slurm single-node | ✓ | ✓ | Controller + worker share alloc |
| Slurm multi-allocation (controller + remote workers) | ✓ | ✓ | Member ops shipped; works today |
| Slurm multi-node (cross-network cluster) | ✗ | ✓ | Slice networking doesn't cross allocs yet |
| RKE2 self-managed single-allocation | ✓ | ✓ | Control plane stays in one alloc |
| Multi-node Kubernetes cluster | ✗ | ✓ | Same network gap |
9. Runbooks¶
Runbook
flowchart LR
INC[App incident] --> Q{Symptom}
Q -- artifact digest / trust / promote fail --> R1[Artifact_Lifecycle]
Q -- manifest registration / catalog page fail --> R2[Catalog]
Q -- platform-operator system app stuck --> R3[Platform_Operator]
Q -- usage records / billing align off --> R4[Runtime_Billing]
Q -- instance stuck create/start/stop/release --> R5[Runtime_Lifecycle]
classDef rb fill:#e9d6ff,stroke:#1e1530
class R1,R2,R3,R4,R5 rb
| Runbook | When |
|---|---|
| App Artifact Lifecycle Incident | OCI artifact promote/trust/digest issues |
| App Catalog Incident | App manifest registration / catalog page failures |
| App Platform Operator Incident | Platform-operator app (system app) incident |
| App Runtime Billing Incident | Per-app usage records / billing alignment |
| App Runtime Lifecycle Incident | Instance stuck in create/start/stop/release |
Recap¶
sequenceDiagram
autonumber
participant DEV as App developer
participant CAT as App catalog
participant U as Tenant
participant APR as app-runtime-worker
participant EX as runtime executor<br/>(SLURM / RKE2 / OCI)
participant ALLOC as allocation
participant BW as billing-worker
DEV->>CAT: register manifest + digest
CAT->>CAT: promote draft → active
U->>CAT: browse → launch into allocation
CAT->>APR: enqueue create
APR->>EX: bootstrap controller / workers
EX->>ALLOC: spawn processes/VMs/pods
EX-->>APR: ready
APR-->>U: app instance running (WS)
BW->>ALLOC: continue allocation billing
BW->>APR: app-runtime usage events (DESIGNED)
U->>APR: stop
APR->>EX: honest teardown
EX-->>APR: stopped
APR-->>U: stopped