System context (C4)¶
Implemented
Source:
doc/architecture/Architecture_v1.md · cmd/*/main.go · packages/services/*
C4 levels 1, 2, and 3, plus tech-stack, trust-boundary, and runtime-network views. Every box maps to either a binary you can grep -r for, a container in compose, or an external system we integrate with.
Level 1 — System context¶
flowchart LR
user([End User]):::actor
admin([Admin / Operator]):::actor
billOps([Billing Operator]):::actor
stripe[(Stripe)]:::ext
kc[(Keycloak / OIDC IdP)]:::ext
maas[(MAAS<br/>bare-metal infra)]:::ext
gpu[(GPU host fleet)]:::ext
obs[(OTel / Prometheus / Loki)]:::ext
subgraph gp[GPUaaS Platform]
api[Public API + BFF]
end
user --> api
admin --> api
billOps --> api
api --> stripe
api --> kc
api --> maas
api --> gpu
api --> obs
classDef actor fill:#fff8e1,stroke:#f57f17
classDef ext fill:#e8eaf6,stroke:#3949ab
| Actor / system | Role |
|---|---|
| End User | Provisions allocations, opens terminal/SSH, runs apps, pays |
| Admin | Manages users, nodes, refunds, audit, force-release |
| Billing Operator | Reads usage, reconciles payments |
| Keycloak / OIDC IdP | Token issuance + JWKS publication |
| Stripe | Checkout sessions + webhooks |
| MAAS | Bare-metal commission / deploy / release (optional, maas.enabled) |
| GPU host fleet | Hosts running cmd/node-agent |
| OTel / Prometheus / Loki | Trace + metrics + log destination |
Level 2 — Container view (top-down, layered)¶
The container view is split into four bands so the diagram stays readable. Each band corresponds to a runtime role.
Edge + Control plane¶
flowchart TB
classDef edge fill:#fff3e0,stroke:#e65100
classDef cp fill:#e3f2fd,stroke:#1565c0
subgraph EDGE[Public edge]
direction LR
WAF[WAF + API Gateway]:::edge
LB[L7 LB / Ingress]:::edge
end
subgraph CP[Control plane processes]
direction LR
BFF["cmd/api<br/>43k lines<br/>HTTP REST + admin + internal"]:::cp
TG["cmd/terminal-gateway<br/>1.5k lines<br/>WebSocket terminal"]:::cp
NLG["cmd/node-log-gateway<br/>0.3k lines<br/>node-agent log relay"]:::cp
end
WAF --> LB
LB --> BFF
LB --> TG
LB --> NLG
Workers + app controllers¶
flowchart TB
classDef wk fill:#e8f5e9,stroke:#2e7d32
classDef ad fill:#f3e5f5,stroke:#6a1b9a
subgraph WK[Async workers — driven by NATS subjects]
direction LR
PW["cmd/provisioning-worker<br/>1.5k lines<br/>Temporal workflows"]:::wk
BW["cmd/billing-worker<br/>1.0k lines<br/>accrual + force-release"]:::wk
WW["cmd/webhook-worker<br/>0.8k lines<br/>Stripe webhook"]:::wk
ARW["cmd/app-runtime-worker<br/>0.5k lines<br/>app lifecycle"]:::wk
NR["cmd/notification-relay<br/>0.3k lines<br/>NATS → Redis"]:::wk
OR["cmd/outbox-relay<br/>0.3k lines<br/>Postgres → NATS"]:::wk
end
subgraph AC[App adapters]
direction LR
SLURM["cmd/slurm-reference-controller<br/>2.3k lines"]:::ad
RKE2["cmd/rke2-self-managed-controller<br/>1.3k lines"]:::ad
end
Data plane¶
flowchart TB
classDef data fill:#eceff1,stroke:#455a64
classDef obs fill:#ede7f6,stroke:#5e35b1
subgraph DP[Data plane]
direction LR
PG[(PostgreSQL 16<br/>partitioned audit_logs + usage_records<br/>+ ledger_entries)]:::data
RD[(Redis 7<br/>terminal tokens + rate limits)]:::data
NATS[(NATS JetStream 2.10<br/>PROVISIONING + BILLING + PAYMENTS + DLQ)]:::data
TMP[(Temporal<br/>workflows for provisioning + release)]:::data
S3[(Object storage<br/>tenant namespaces)]:::data
SEC[(Vault / KMS<br/>secrets, envelope encryption)]:::data
end
subgraph OBS[Observability backend]
direction LR
OTEL[(OTel Collector)]:::obs
PROM[(Prometheus)]:::obs
LOKI[(Loki)]:::obs
TEMPO[(Tempo)]:::obs
GRAF[(Grafana)]:::obs
end
OTEL --> PROM
OTEL --> TEMPO
LOKI --> GRAF
PROM --> GRAF
TEMPO --> GRAF
Fleet¶
flowchart TB
classDef fleet fill:#fce4ec,stroke:#c2185b
subgraph FL[GPU host fleet]
direction LR
NA1[cmd/node-agent<br/>host A]:::fleet
NA2[cmd/node-agent<br/>host B]:::fleet
NAN[cmd/node-agent<br/>host N]:::fleet
MAAS[(MAAS server)]:::fleet
end
MAAS -.deploy.-> NA1
MAAS -.deploy.-> NA2
MAAS -.deploy.-> NAN
Inter-band wiring¶
flowchart LR
classDef cp fill:#e3f2fd,stroke:#1565c0
classDef wk fill:#e8f5e9,stroke:#2e7d32
classDef data fill:#eceff1,stroke:#455a64
classDef fleet fill:#fce4ec,stroke:#c2185b
BFF[cmd/api]:::cp
TG[cmd/terminal-gateway]:::cp
PW[provisioning-worker]:::wk
BW[billing-worker]:::wk
OR[outbox-relay]:::wk
NR[notification-relay]:::wk
WW[webhook-worker]:::wk
PG[(Postgres)]:::data
NATS[(NATS)]:::data
RD[(Redis)]:::data
NA[node-agent fleet]:::fleet
MAAS[(MAAS)]:::fleet
STRIPE[(Stripe)]
KC[(Keycloak)]
BFF -->|domain change + outbox| PG
BFF -->|cache, rate-limit, tokens| RD
BFF -->|cached JWKS| KC
OR -->|poll outbox| PG
OR -->|publish| NATS
NATS -->|consume| PW
NATS -->|consume| BW
NATS -->|consume| WW
NATS -->|consume| NR
NR -->|fan out| RD
PW -->|HTTPS mTLS task pull| NA
TG -->|stream relay| NA
BFF -->|MAAS API| MAAS
BFF -.webhook in.- STRIPE
BW -->|emit on depletion| NATS
WW -->|ledger credit| PG
Level 3 — Inside cmd/api¶
flowchart TB
classDef route fill:#fff3e0,stroke:#e65100
classDef mid fill:#ede7f6,stroke:#5e35b1
classDef svc fill:#e8f5e9,stroke:#2e7d32
REQ([HTTP request])
subgraph MID[Middleware stack]
direction TB
M1[OTel tracing<br/>+ correlation-id]:::mid
M2[Bearer JWT verify<br/>cached JWKS]:::mid
M3[Sanitize-first<br/>PII / cred scrubber]:::mid
M4[Rate-limit<br/>Redis token bucket]:::mid
M5[Authz<br/>scope + role]:::mid
M1 --> M2 --> M3 --> M4 --> M5
end
subgraph ROUTES[Route files]
direction TB
R1[routes_v1_frozen.go<br/>demo + internal continuity]:::route
R2[routes_v3_lifecycle_mutations.go<br/>allocation create/release]:::route
R3[routes_v3_launch_*.go<br/>launch wizard]:::route
R4[routes_v3_readmodels*.go<br/>aggregated read APIs]:::route
R5[routes_provisioning_*.go]:::route
R6[routes_admin_*.go]:::route
R7[routes_internal_*.go<br/>node task pull/post]:::route
R8[routes_payments_webhook.go<br/>raw-body-first]:::route
R9[routes_terminal_*.go]:::route
end
subgraph SVCS[Domain services]
direction TB
S1[auth]:::svc
S2[inventory]:::svc
S3[provisioning/orchestrator]:::svc
S4[billing]:::svc
S5[payments]:::svc
S6[admin]:::svc
S7[appruntime]:::svc
S8[storage]:::svc
S9[terminal]:::svc
S10[releases]:::svc
end
REQ --> MID
MID --> ROUTES
ROUTES --> SVCS
SVCS --> PG[(Postgres)]
SVCS --> RD[(Redis)]
SVCS -.outbox row.-> PG
Runtime — what calls what (sequence)¶
sequenceDiagram
autonumber
participant U as User browser
participant LB as Edge / WAF
participant API as cmd/api
participant PG as Postgres
participant OR as outbox-relay
participant NATS as NATS
participant PW as provisioning-worker
participant NA as node-agent
participant BW as billing-worker
participant NR as notification-relay
participant RD as Redis
U->>LB: POST /api/v1/allocations
LB->>API: forward
API->>PG: BEGIN tx
API->>PG: validate SKU, reserve slot(s), insert allocation
API->>PG: insert outbox row
API->>PG: insert audit row
API->>PG: COMMIT
API-->>U: 201 status=requested
OR->>PG: poll outbox FOR UPDATE SKIP LOCKED
OR->>NATS: publish provisioning.requested
NATS-->>PW: deliver event
PW->>NA: dispatch slice.vm_provision (mTLS)
NA->>NA: 17 phases
NA-->>PW: result
PW->>PG: status=active + outbox: provisioning.active
OR->>NATS: publish provisioning.active
NATS-->>BW: deliver → start accrual
NATS-->>NR: deliver → fan out
NR->>RD: publish WS channel
RD-->>U: WS push: allocation active
Trust boundaries¶
flowchart LR
INET([Public internet]):::pub --> WAF[WAF]:::edge
WAF --> CP[Control plane<br/>internal mTLS]:::trusted
CP -->|mTLS pull only| NA[node-agent<br/>per-host enrollment cert]:::fleet
CP -->|outbound HTTPS; verified webhook in| STRIPE[(Stripe)]:::ext
CP -->|cached JWKS| KC[(Keycloak)]:::ext
CP -->|per-service credentials| DB[(Postgres)]:::data
CP -->|signed URLs| S3[(Object storage)]:::data
classDef pub fill:#ffebee,stroke:#c62828
classDef edge fill:#fff3e0,stroke:#e65100
classDef trusted fill:#e8f5e9,stroke:#2e7d32
classDef fleet fill:#fce4ec,stroke:#c2185b
classDef ext fill:#e8eaf6,stroke:#3949ab
classDef data fill:#eceff1,stroke:#455a64
| Boundary | Controls |
|---|---|
| Internet → Edge | WAF, rate-limit, TLS termination |
| Edge → Control plane | mTLS, network-policy default-deny |
| Control plane → Node fleet | mTLS pull from node-agent (node initiates); signed task params; identity-bound result |
| Control plane → Stripe | Outbound HTTPS; webhook signature on raw body |
| Control plane → Keycloak | Cached JWKS refresh every 5 min; no per-request call |
| Control plane → Postgres | Per-service credentials; no shared pool |
| Control plane → Object storage | Signed URLs; tenant-scoped namespaces |
Tech stack¶
| Layer | Choice |
|---|---|
| Languages | Go ≥ 1.25 (services + CLI), TypeScript (web), Python (SDK) |
| HTTP | net/http + chi-style routing |
| DB driver | pgx/v5 + pgxpool |
| Workflow | Temporal (temporalio/auto-setup:1.24) |
| Bus | NATS JetStream 2.10 |
| Cache / rate-limit | Redis 7 (go-redis/v9) |
| Identity | Keycloak 26 (dev: H2 in-memory; realm imported from JSON) |
| Payments | stripe-go/v76 |
| Observability | OpenTelemetry + Prometheus + Loki + Tempo + Grafana |
| Web | Next.js + TypeScript (App Router) |
| PKI (planned) | Smallstep step-ca; Vault PKI migration path via packages/shared/pki.CAClient |
| Bare metal (planned) | Canonical MAAS, gated by maas.enabled |
Where to look next¶
- Domain ownership — which package owns what
- Allocation lifecycle — the central state machine
- Outbox & event flow — how writes become events
- GPU slice as-built — the slice path end-to-end
- Billing & ledger — immutable ledger + accrual