What exists today¶
Implemented Contract Runbook
This section is a fact dashboard — what is in the repository as of the current snapshot. Every claim is backed by a file path you can click through to.
13Go binaries
12Service packages
~65kLines of Go
33kOpenAPI lines
2.3kAsyncAPI lines
2.5kDB schema lines
135Architecture docs
56Product docs
50+Governance docs
38Runbooks
3Documented RCAs
2Capacity shapes
What's in this section¶
| Page | Coverage |
|---|---|
| Runtime binaries | All 13 services/workers under cmd/, line counts, role per binary, NATS subscription map |
| Service packages | 12 domain services + 9 shared packages with responsibility tables |
| Capacity shapes & SKUs | baremetal vs gpu_slice, seeded SKUs, allowed GPU counts, slice VM profiles |
| Node-agent task catalog | Every typed task the node-agent will execute, input shape, security model |
| Contract surface | REST API surface, AsyncAPI events, contract-first workflow gates |
| Runbook inventory | All 38 incident runbooks categorized, plus the catalog manifest |
| RCAs on record | Three documented incidents and the governance rules they produced |
| Database schema | Table inventory grouped by domain, key invariants, immutable tables |
At a glance — what binaries do¶
flowchart LR
subgraph Entry[Entrypoints]
API[cmd/api<br/>BFF — all REST routes<br/>43,294 lines]
CLI[cmd/gpuaas-cli<br/>operator/user CLI<br/>3,687 lines]
TG[cmd/terminal-gateway<br/>WS terminal relay<br/>1,563 lines]
end
subgraph Workers[Async workers]
PW[provisioning-worker<br/>Temporal<br/>1,514 lines]
BW[billing-worker<br/>accrual loop<br/>992 lines]
WW[webhook-worker<br/>Stripe<br/>773 lines]
ARW[app-runtime-worker<br/>app lifecycle<br/>497 lines]
OR[outbox-relay<br/>305 lines]
NR[notification-relay<br/>315 lines]
NLG[node-log-gateway<br/>287 lines]
end
subgraph Fleet[On host]
NA[node-agent<br/>9,072 lines]
end
subgraph AppCtl[App controllers]
SLURM[slurm-reference-controller<br/>2,310 lines]
RKE2[rke2-self-managed-controller<br/>1,334 lines]
end
API --> PW
API --> BW
API --> ARW
API --> NA
TG --> NA
PW --> NA
PW --> ARW
ARW --> SLURM
ARW --> RKE2
At a glance — domain services¶
flowchart TB
subgraph Public[Customer-facing]
auth[auth<br/>OIDC + JWT]
inv[inventory<br/>SKU + nodes]
prov[provisioning<br/>orchestrator + worker]
bill[billing<br/>ledger + accrual]
pay[payments<br/>Stripe + refunds]
term[terminal<br/>WS sessions]
appr[appruntime<br/>app lifecycle]
stor[storage<br/>S3 ops]
end
subgraph Internal[Internal]
admin[admin<br/>privileged ops]
notif[notification<br/>WS + email]
rel[releases<br/>SSH key release]
maas[maas<br/>bare-metal]
end
At a glance — contracts¶
| Surface | File | Lines | Status |
|---|---|---|---|
| Public REST API | doc/api/openapi.draft.yaml |
33,132 | Contract |
| Event bus contract | doc/api/asyncapi.draft.yaml |
2,296 | Contract |
| Physical database | doc/architecture/db_schema_v1.sql |
2,574 | Implemented |
Sanity checks for reviewers¶
Quick exercises to verify the portal isn't drifting from reality:
- Look up any binary on Runtime binaries → confirm its
cmd/<name>/main.goexists. - Pick any SKU on Capacity shapes → confirm the row in
scripts/seed.sql. - Pick any task type on Node-agent task catalog → confirm the
taskType*constant incmd/node-agent/agent.go. - Pick any runbook on Runbook inventory → confirm the file exists under
doc/operations/runbooks/.