Skip to content

What exists today

Implemented Contract Runbook

This section is a fact dashboard — what is in the repository as of the current snapshot. Every claim is backed by a file path you can click through to.

13Go binaries
12Service packages
~65kLines of Go
33kOpenAPI lines
2.3kAsyncAPI lines
2.5kDB schema lines
135Architecture docs
56Product docs
50+Governance docs
38Runbooks
3Documented RCAs
2Capacity shapes

What's in this section

Page Coverage
Runtime binaries All 13 services/workers under cmd/, line counts, role per binary, NATS subscription map
Service packages 12 domain services + 9 shared packages with responsibility tables
Capacity shapes & SKUs baremetal vs gpu_slice, seeded SKUs, allowed GPU counts, slice VM profiles
Node-agent task catalog Every typed task the node-agent will execute, input shape, security model
Contract surface REST API surface, AsyncAPI events, contract-first workflow gates
Runbook inventory All 38 incident runbooks categorized, plus the catalog manifest
RCAs on record Three documented incidents and the governance rules they produced
Database schema Table inventory grouped by domain, key invariants, immutable tables

At a glance — what binaries do

flowchart LR
    subgraph Entry[Entrypoints]
        API[cmd/api<br/>BFF — all REST routes<br/>43,294 lines]
        CLI[cmd/gpuaas-cli<br/>operator/user CLI<br/>3,687 lines]
        TG[cmd/terminal-gateway<br/>WS terminal relay<br/>1,563 lines]
    end

    subgraph Workers[Async workers]
        PW[provisioning-worker<br/>Temporal<br/>1,514 lines]
        BW[billing-worker<br/>accrual loop<br/>992 lines]
        WW[webhook-worker<br/>Stripe<br/>773 lines]
        ARW[app-runtime-worker<br/>app lifecycle<br/>497 lines]
        OR[outbox-relay<br/>305 lines]
        NR[notification-relay<br/>315 lines]
        NLG[node-log-gateway<br/>287 lines]
    end

    subgraph Fleet[On host]
        NA[node-agent<br/>9,072 lines]
    end

    subgraph AppCtl[App controllers]
        SLURM[slurm-reference-controller<br/>2,310 lines]
        RKE2[rke2-self-managed-controller<br/>1,334 lines]
    end

    API --> PW
    API --> BW
    API --> ARW
    API --> NA
    TG --> NA
    PW --> NA
    PW --> ARW
    ARW --> SLURM
    ARW --> RKE2

Full binary breakdown →

At a glance — domain services

flowchart TB
    subgraph Public[Customer-facing]
        auth[auth<br/>OIDC + JWT]
        inv[inventory<br/>SKU + nodes]
        prov[provisioning<br/>orchestrator + worker]
        bill[billing<br/>ledger + accrual]
        pay[payments<br/>Stripe + refunds]
        term[terminal<br/>WS sessions]
        appr[appruntime<br/>app lifecycle]
        stor[storage<br/>S3 ops]
    end
    subgraph Internal[Internal]
        admin[admin<br/>privileged ops]
        notif[notification<br/>WS + email]
        rel[releases<br/>SSH key release]
        maas[maas<br/>bare-metal]
    end

Full package map →

At a glance — contracts

Surface File Lines Status
Public REST API doc/api/openapi.draft.yaml 33,132 Contract
Event bus contract doc/api/asyncapi.draft.yaml 2,296 Contract
Physical database doc/architecture/db_schema_v1.sql 2,574 Implemented

Contract details →

Sanity checks for reviewers

Quick exercises to verify the portal isn't drifting from reality:

  • Look up any binary on Runtime binaries → confirm its cmd/<name>/main.go exists.
  • Pick any SKU on Capacity shapes → confirm the row in scripts/seed.sql.
  • Pick any task type on Node-agent task catalog → confirm the taskType* constant in cmd/node-agent/agent.go.
  • Pick any runbook on Runbook inventory → confirm the file exists under doc/operations/runbooks/.