Skip to content

System context (C4)

Implemented

Source: doc/architecture/Architecture_v1.md · cmd/*/main.go · packages/services/*

C4 levels 1, 2, and 3, plus tech-stack, trust-boundary, and runtime-network views. Every box maps to either a binary you can grep -r for, a container in compose, or an external system we integrate with.


Level 1 — System context

flowchart LR
    user([End User]):::actor
    admin([Admin / Operator]):::actor
    billOps([Billing Operator]):::actor
    stripe[(Stripe)]:::ext
    kc[(Keycloak / OIDC IdP)]:::ext
    maas[(MAAS<br/>bare-metal infra)]:::ext
    gpu[(GPU host fleet)]:::ext
    obs[(OTel / Prometheus / Loki)]:::ext

    subgraph gp[GPUaaS Platform]
        api[Public API + BFF]
    end

    user --> api
    admin --> api
    billOps --> api
    api --> stripe
    api --> kc
    api --> maas
    api --> gpu
    api --> obs

    classDef actor fill:#fff8e1,stroke:#f57f17
    classDef ext   fill:#e8eaf6,stroke:#3949ab
Actor / system Role
End User Provisions allocations, opens terminal/SSH, runs apps, pays
Admin Manages users, nodes, refunds, audit, force-release
Billing Operator Reads usage, reconciles payments
Keycloak / OIDC IdP Token issuance + JWKS publication
Stripe Checkout sessions + webhooks
MAAS Bare-metal commission / deploy / release (optional, maas.enabled)
GPU host fleet Hosts running cmd/node-agent
OTel / Prometheus / Loki Trace + metrics + log destination

Level 2 — Container view (top-down, layered)

The container view is split into four bands so the diagram stays readable. Each band corresponds to a runtime role.

Edge + Control plane

flowchart TB
    classDef edge fill:#fff3e0,stroke:#e65100
    classDef cp fill:#e3f2fd,stroke:#1565c0

    subgraph EDGE[Public edge]
        direction LR
        WAF[WAF + API Gateway]:::edge
        LB[L7 LB / Ingress]:::edge
    end

    subgraph CP[Control plane processes]
        direction LR
        BFF["cmd/api<br/>43k lines<br/>HTTP REST + admin + internal"]:::cp
        TG["cmd/terminal-gateway<br/>1.5k lines<br/>WebSocket terminal"]:::cp
        NLG["cmd/node-log-gateway<br/>0.3k lines<br/>node-agent log relay"]:::cp
    end

    WAF --> LB
    LB --> BFF
    LB --> TG
    LB --> NLG

Workers + app controllers

flowchart TB
    classDef wk fill:#e8f5e9,stroke:#2e7d32
    classDef ad fill:#f3e5f5,stroke:#6a1b9a

    subgraph WK[Async workers — driven by NATS subjects]
        direction LR
        PW["cmd/provisioning-worker<br/>1.5k lines<br/>Temporal workflows"]:::wk
        BW["cmd/billing-worker<br/>1.0k lines<br/>accrual + force-release"]:::wk
        WW["cmd/webhook-worker<br/>0.8k lines<br/>Stripe webhook"]:::wk
        ARW["cmd/app-runtime-worker<br/>0.5k lines<br/>app lifecycle"]:::wk
        NR["cmd/notification-relay<br/>0.3k lines<br/>NATS → Redis"]:::wk
        OR["cmd/outbox-relay<br/>0.3k lines<br/>Postgres → NATS"]:::wk
    end

    subgraph AC[App adapters]
        direction LR
        SLURM["cmd/slurm-reference-controller<br/>2.3k lines"]:::ad
        RKE2["cmd/rke2-self-managed-controller<br/>1.3k lines"]:::ad
    end

Data plane

flowchart TB
    classDef data fill:#eceff1,stroke:#455a64
    classDef obs fill:#ede7f6,stroke:#5e35b1

    subgraph DP[Data plane]
        direction LR
        PG[(PostgreSQL 16<br/>partitioned audit_logs + usage_records<br/>+ ledger_entries)]:::data
        RD[(Redis 7<br/>terminal tokens + rate limits)]:::data
        NATS[(NATS JetStream 2.10<br/>PROVISIONING + BILLING + PAYMENTS + DLQ)]:::data
        TMP[(Temporal<br/>workflows for provisioning + release)]:::data
        S3[(Object storage<br/>tenant namespaces)]:::data
        SEC[(Vault / KMS<br/>secrets, envelope encryption)]:::data
    end

    subgraph OBS[Observability backend]
        direction LR
        OTEL[(OTel Collector)]:::obs
        PROM[(Prometheus)]:::obs
        LOKI[(Loki)]:::obs
        TEMPO[(Tempo)]:::obs
        GRAF[(Grafana)]:::obs
    end

    OTEL --> PROM
    OTEL --> TEMPO
    LOKI --> GRAF
    PROM --> GRAF
    TEMPO --> GRAF

Fleet

flowchart TB
    classDef fleet fill:#fce4ec,stroke:#c2185b

    subgraph FL[GPU host fleet]
        direction LR
        NA1[cmd/node-agent<br/>host A]:::fleet
        NA2[cmd/node-agent<br/>host B]:::fleet
        NAN[cmd/node-agent<br/>host N]:::fleet
        MAAS[(MAAS server)]:::fleet
    end
    MAAS -.deploy.-> NA1
    MAAS -.deploy.-> NA2
    MAAS -.deploy.-> NAN

Inter-band wiring

flowchart LR
    classDef cp fill:#e3f2fd,stroke:#1565c0
    classDef wk fill:#e8f5e9,stroke:#2e7d32
    classDef data fill:#eceff1,stroke:#455a64
    classDef fleet fill:#fce4ec,stroke:#c2185b

    BFF[cmd/api]:::cp
    TG[cmd/terminal-gateway]:::cp
    PW[provisioning-worker]:::wk
    BW[billing-worker]:::wk
    OR[outbox-relay]:::wk
    NR[notification-relay]:::wk
    WW[webhook-worker]:::wk
    PG[(Postgres)]:::data
    NATS[(NATS)]:::data
    RD[(Redis)]:::data
    NA[node-agent fleet]:::fleet
    MAAS[(MAAS)]:::fleet
    STRIPE[(Stripe)]
    KC[(Keycloak)]

    BFF -->|domain change + outbox| PG
    BFF -->|cache, rate-limit, tokens| RD
    BFF -->|cached JWKS| KC
    OR -->|poll outbox| PG
    OR -->|publish| NATS
    NATS -->|consume| PW
    NATS -->|consume| BW
    NATS -->|consume| WW
    NATS -->|consume| NR
    NR -->|fan out| RD
    PW -->|HTTPS mTLS task pull| NA
    TG -->|stream relay| NA
    BFF -->|MAAS API| MAAS
    BFF -.webhook in.- STRIPE
    BW -->|emit on depletion| NATS
    WW -->|ledger credit| PG

Level 3 — Inside cmd/api

flowchart TB
    classDef route fill:#fff3e0,stroke:#e65100
    classDef mid fill:#ede7f6,stroke:#5e35b1
    classDef svc fill:#e8f5e9,stroke:#2e7d32

    REQ([HTTP request])

    subgraph MID[Middleware stack]
        direction TB
        M1[OTel tracing<br/>+ correlation-id]:::mid
        M2[Bearer JWT verify<br/>cached JWKS]:::mid
        M3[Sanitize-first<br/>PII / cred scrubber]:::mid
        M4[Rate-limit<br/>Redis token bucket]:::mid
        M5[Authz<br/>scope + role]:::mid
        M1 --> M2 --> M3 --> M4 --> M5
    end

    subgraph ROUTES[Route files]
        direction TB
        R1[routes_v1_frozen.go<br/>demo + internal continuity]:::route
        R2[routes_v3_lifecycle_mutations.go<br/>allocation create/release]:::route
        R3[routes_v3_launch_*.go<br/>launch wizard]:::route
        R4[routes_v3_readmodels*.go<br/>aggregated read APIs]:::route
        R5[routes_provisioning_*.go]:::route
        R6[routes_admin_*.go]:::route
        R7[routes_internal_*.go<br/>node task pull/post]:::route
        R8[routes_payments_webhook.go<br/>raw-body-first]:::route
        R9[routes_terminal_*.go]:::route
    end

    subgraph SVCS[Domain services]
        direction TB
        S1[auth]:::svc
        S2[inventory]:::svc
        S3[provisioning/orchestrator]:::svc
        S4[billing]:::svc
        S5[payments]:::svc
        S6[admin]:::svc
        S7[appruntime]:::svc
        S8[storage]:::svc
        S9[terminal]:::svc
        S10[releases]:::svc
    end

    REQ --> MID
    MID --> ROUTES
    ROUTES --> SVCS
    SVCS --> PG[(Postgres)]
    SVCS --> RD[(Redis)]
    SVCS -.outbox row.-> PG

Runtime — what calls what (sequence)

sequenceDiagram
    autonumber
    participant U as User browser
    participant LB as Edge / WAF
    participant API as cmd/api
    participant PG as Postgres
    participant OR as outbox-relay
    participant NATS as NATS
    participant PW as provisioning-worker
    participant NA as node-agent
    participant BW as billing-worker
    participant NR as notification-relay
    participant RD as Redis

    U->>LB: POST /api/v1/allocations
    LB->>API: forward
    API->>PG: BEGIN tx
    API->>PG: validate SKU, reserve slot(s), insert allocation
    API->>PG: insert outbox row
    API->>PG: insert audit row
    API->>PG: COMMIT
    API-->>U: 201 status=requested

    OR->>PG: poll outbox FOR UPDATE SKIP LOCKED
    OR->>NATS: publish provisioning.requested
    NATS-->>PW: deliver event
    PW->>NA: dispatch slice.vm_provision (mTLS)
    NA->>NA: 17 phases
    NA-->>PW: result
    PW->>PG: status=active + outbox: provisioning.active
    OR->>NATS: publish provisioning.active
    NATS-->>BW: deliver → start accrual
    NATS-->>NR: deliver → fan out
    NR->>RD: publish WS channel
    RD-->>U: WS push: allocation active

Trust boundaries

flowchart LR
    INET([Public internet]):::pub --> WAF[WAF]:::edge
    WAF --> CP[Control plane<br/>internal mTLS]:::trusted
    CP -->|mTLS pull only| NA[node-agent<br/>per-host enrollment cert]:::fleet
    CP -->|outbound HTTPS; verified webhook in| STRIPE[(Stripe)]:::ext
    CP -->|cached JWKS| KC[(Keycloak)]:::ext
    CP -->|per-service credentials| DB[(Postgres)]:::data
    CP -->|signed URLs| S3[(Object storage)]:::data

    classDef pub fill:#ffebee,stroke:#c62828
    classDef edge fill:#fff3e0,stroke:#e65100
    classDef trusted fill:#e8f5e9,stroke:#2e7d32
    classDef fleet fill:#fce4ec,stroke:#c2185b
    classDef ext fill:#e8eaf6,stroke:#3949ab
    classDef data fill:#eceff1,stroke:#455a64
Boundary Controls
Internet → Edge WAF, rate-limit, TLS termination
Edge → Control plane mTLS, network-policy default-deny
Control plane → Node fleet mTLS pull from node-agent (node initiates); signed task params; identity-bound result
Control plane → Stripe Outbound HTTPS; webhook signature on raw body
Control plane → Keycloak Cached JWKS refresh every 5 min; no per-request call
Control plane → Postgres Per-service credentials; no shared pool
Control plane → Object storage Signed URLs; tenant-scoped namespaces

Tech stack

Layer Choice
Languages Go ≥ 1.25 (services + CLI), TypeScript (web), Python (SDK)
HTTP net/http + chi-style routing
DB driver pgx/v5 + pgxpool
Workflow Temporal (temporalio/auto-setup:1.24)
Bus NATS JetStream 2.10
Cache / rate-limit Redis 7 (go-redis/v9)
Identity Keycloak 26 (dev: H2 in-memory; realm imported from JSON)
Payments stripe-go/v76
Observability OpenTelemetry + Prometheus + Loki + Tempo + Grafana
Web Next.js + TypeScript (App Router)
PKI (planned) Smallstep step-ca; Vault PKI migration path via packages/shared/pki.CAClient
Bare metal (planned) Canonical MAAS, gated by maas.enabled

Where to look next