Runtime binaries¶
Implemented
cmd/*/main.go · 13 binaries · ~65,000 lines of Go (non-test)
All 13 binaries¶
| Binary | Lines | Role |
|---|---|---|
cmd/api |
43,294 | BFF — every public + admin + internal REST route, including v3 read models |
cmd/node-agent |
9,072 | On-host typed-task executor (slice + baremetal lifecycle, terminal stream, telemetry) |
cmd/gpuaas-cli |
3,687 | Operator and user CLI for auth, catalog, projects, allocations, billing |
cmd/slurm-reference-controller |
2,310 | Reference Slurm app adapter — single-allocation first-slice flow |
cmd/terminal-gateway |
1,563 | WebSocket terminal endpoint (/ws/terminal/{allocation_id}) |
cmd/provisioning-worker |
1,514 | Temporal worker — allocation/slice provisioning + release activities |
cmd/rke2-self-managed-controller |
1,334 | Self-managed RKE2 (Kubernetes) app adapter |
cmd/billing-worker |
992 | Usage accrual loop + low-balance warnings + force-release |
cmd/webhook-worker |
773 | Stripe webhook consumer (raw-body-first signature verify) |
cmd/app-runtime-worker |
497 | App-instance lifecycle async worker |
cmd/notification-relay |
315 | NATS → Redis Pub/Sub bridge for browser WS notifications |
cmd/outbox-relay |
305 | Postgres outbox → NATS publisher |
cmd/node-log-gateway |
287 | Node log streaming endpoint |
Process topology¶
flowchart TB
subgraph Edge[Edge]
WAF[WAF / Gateway]
TS[Tailscale / VPN]
end
subgraph CP[Control plane processes]
API[cmd/api<br/>HTTP :8443]
TG[cmd/terminal-gateway<br/>WS :8444]
NLG[cmd/node-log-gateway<br/>HTTP :8445]
end
subgraph WK[Worker processes]
PW[cmd/provisioning-worker]
BW[cmd/billing-worker]
WW[cmd/webhook-worker]
ARW[cmd/app-runtime-worker]
NR[cmd/notification-relay]
OR[cmd/outbox-relay]
end
subgraph AppCtl[Reference app controllers]
SLURM[cmd/slurm-reference-controller]
RKE2[cmd/rke2-self-managed-controller]
end
subgraph Fleet[Per host]
NA[cmd/node-agent<br/>HTTPS mTLS pull]
end
subgraph User[Local]
CLI[cmd/gpuaas-cli]
end
WAF --> API
WAF --> TG
CLI --> API
PW <-->|HTTPS mTLS| NA
TG <-->|stream relay| NA
NLG <-->|log pull| NA
API <-->|NATS subjects| WK
AppCtl <--> API
AppCtl <--> NA
Binary detail¶
cmd/api¶
Implemented Contract
Single binary that hosts every public REST route, every internal mTLS route used by node-agent and workers, and every admin route.
- Routes: organized into v1 frozen routes and v3 active routes — see
routes_v1_frozen.go,routes_v3_*.go,routes.go. - Auth: bearer-token JWT validated against cached Keycloak JWKS (refresh 5 min).
- Middleware: tracing, correlation-id, sanitize-first logging, rate-limit (Redis-backed), authz.
- Database:
pgx/v5+pgxpool; all mutations write outbox in the same transaction. - Why it's a single binary: simpler deploy, single contract, all domain packages imported directly. The eventual extraction model is documented in
Inter_Service_Communication.md.
cmd/node-agent¶
Implemented Runbook
The most security-sensitive on-host service. Pulls typed tasks from cmd/api over mTLS, executes them, posts results.
- Identity: enrolled mTLS client certificate (24 h TTL, X5C renewal). Cert lives under
/var/lib/gpuaas/agent/with 0o600 mode. - Task types include
slice.topology_discover,slice.vm_provision,slice.vm_release, plus the bare-metal task family — full catalog in Node-agent task catalog. - Terminal stream: relays bytes between
cmd/terminal-gatewayand an in-host SSH/PTY (fileterminal_stream.go). - Telemetry: exposes
:9110/internal/v1/guest-telemetryfor slice-guest helper pushes. - Runbook:
Node_Agent_Control_Plane_Recovery_2026-03.md.
cmd/gpuaas-cli¶
Implemented
Operator + user CLI. Browser-OIDC PKCE login, catalog browsing, allocation create/release, project membership ops, balance and refund queries, audit export.
- Login flow defined in
CLI_Browser_OIDC_PKCE_Login_v1.md. - Command matrix:
CLI_v2_Command_Matrix.md.
cmd/slurm-reference-controller¶
Implemented Designed
Reference app adapter for Slurm on a single allocation. Spawns the Slurm controller VM (or container) and worker nodes inside one allocation. Multi-node Slurm clusters remain DESIGNED — see Slurm_First_Slice_Adapter_Contract_v1.md.
cmd/terminal-gateway¶
Implemented RCA
Dedicated WebSocket endpoint at /ws/terminal/{allocation_id}. Validates a single-use terminal token (300 s TTL, stored in Redis), opens a session binding with cmd/api, then relays bytes to cmd/node-agent.
- Token shape: opaque 256-bit random, key
terminal_token:{token}. - Auth on browser WS via
Sec-WebSocket-Protocol(header-only transport; no?token=allowed). - Drove RCA:
2026-03-terminal-stream-http2-buffering.
cmd/provisioning-worker¶
Implemented RCA
Temporal worker. Subscribes to PROVISIONING.> NATS subjects, runs Temporal workflows that dispatch typed tasks to cmd/node-agent and observe their completion.
- One workflow ID per
provisioning.requestedevent (idempotent). - Drove RCA:
2026-03-provisioning-workflow-recovery-gaps.
cmd/rke2-self-managed-controller¶
Implemented
Single-allocation RKE2 (lightweight Kubernetes) app adapter. Bootstraps an RKE2 control plane inside the tenant's allocation; allocates worker nodes from within the same allocation.
cmd/billing-worker¶
Implemented Contract
- Accrues usage every
billing.window_seconds(default 60). - Emits
billing.low_balance_warning,billing.auto_release_pending,billing.balance_depletedAsyncAPI events. - Force-release on depletion via
provisioning.force_release_requested.
cmd/webhook-worker¶
Implemented Contract
Stripe webhook handler. Raw-body-first: buffers the raw request body before any JSON parse; signature verification requires the exact bytes. Idempotent — duplicate webhooks do not double-credit. Emits payments.balance_credited on success.
cmd/app-runtime-worker¶
Implemented
Async worker for app-instance lifecycle (create/start/stop/release) — fronts the typed task contract for app adapters.
cmd/notification-relay¶
Implemented
Bridges NATS subjects (billing/provisioning) to Redis Pub/Sub channels that the API's WS notifier reads. Keeps notification fan-out fast without requiring every API replica to subscribe to NATS directly.
cmd/outbox-relay¶
Implemented
Polls outbox_events rows with FOR UPDATE SKIP LOCKED, publishes envelopes to NATS, marks rows published. The reason every domain change writes the outbox row in the same DB transaction (never publishes to NATS directly from a handler).
cmd/node-log-gateway¶
Implemented
Endpoint that streams node-agent logs back to operators. Per Node_Agent_Log_Collection_Loki_v1.md the long-term path is shipping into Loki; the gateway is the bridge.
Where to dig deeper¶
| Want to understand | Go here |
|---|---|
| What each worker subscribes to | Outbox & event flow |
| What each typed task does on the host | Node-agent task catalog |
| How a slice provisioning task chains together | GPU slice as-built |
| How to run all of this locally | Local dev setup |