Skip to content

Runtime binaries

Implemented

Source: cmd/*/main.go · 13 binaries · ~65,000 lines of Go (non-test)

All 13 binaries

Binary Lines Role
cmd/api 43,294 BFF — every public + admin + internal REST route, including v3 read models
cmd/node-agent 9,072 On-host typed-task executor (slice + baremetal lifecycle, terminal stream, telemetry)
cmd/gpuaas-cli 3,687 Operator and user CLI for auth, catalog, projects, allocations, billing
cmd/slurm-reference-controller 2,310 Reference Slurm app adapter — single-allocation first-slice flow
cmd/terminal-gateway 1,563 WebSocket terminal endpoint (/ws/terminal/{allocation_id})
cmd/provisioning-worker 1,514 Temporal worker — allocation/slice provisioning + release activities
cmd/rke2-self-managed-controller 1,334 Self-managed RKE2 (Kubernetes) app adapter
cmd/billing-worker 992 Usage accrual loop + low-balance warnings + force-release
cmd/webhook-worker 773 Stripe webhook consumer (raw-body-first signature verify)
cmd/app-runtime-worker 497 App-instance lifecycle async worker
cmd/notification-relay 315 NATS → Redis Pub/Sub bridge for browser WS notifications
cmd/outbox-relay 305 Postgres outbox → NATS publisher
cmd/node-log-gateway 287 Node log streaming endpoint

Process topology

flowchart TB
    subgraph Edge[Edge]
        WAF[WAF / Gateway]
        TS[Tailscale / VPN]
    end

    subgraph CP[Control plane processes]
        API[cmd/api<br/>HTTP :8443]
        TG[cmd/terminal-gateway<br/>WS :8444]
        NLG[cmd/node-log-gateway<br/>HTTP :8445]
    end

    subgraph WK[Worker processes]
        PW[cmd/provisioning-worker]
        BW[cmd/billing-worker]
        WW[cmd/webhook-worker]
        ARW[cmd/app-runtime-worker]
        NR[cmd/notification-relay]
        OR[cmd/outbox-relay]
    end

    subgraph AppCtl[Reference app controllers]
        SLURM[cmd/slurm-reference-controller]
        RKE2[cmd/rke2-self-managed-controller]
    end

    subgraph Fleet[Per host]
        NA[cmd/node-agent<br/>HTTPS mTLS pull]
    end

    subgraph User[Local]
        CLI[cmd/gpuaas-cli]
    end

    WAF --> API
    WAF --> TG
    CLI --> API
    PW <-->|HTTPS mTLS| NA
    TG <-->|stream relay| NA
    NLG <-->|log pull| NA
    API <-->|NATS subjects| WK
    AppCtl <--> API
    AppCtl <--> NA

Binary detail

cmd/api

Implemented Contract

Single binary that hosts every public REST route, every internal mTLS route used by node-agent and workers, and every admin route.

  • Routes: organized into v1 frozen routes and v3 active routes — see routes_v1_frozen.go, routes_v3_*.go, routes.go.
  • Auth: bearer-token JWT validated against cached Keycloak JWKS (refresh 5 min).
  • Middleware: tracing, correlation-id, sanitize-first logging, rate-limit (Redis-backed), authz.
  • Database: pgx/v5 + pgxpool; all mutations write outbox in the same transaction.
  • Why it's a single binary: simpler deploy, single contract, all domain packages imported directly. The eventual extraction model is documented in Inter_Service_Communication.md.

cmd/node-agent

Implemented Runbook

The most security-sensitive on-host service. Pulls typed tasks from cmd/api over mTLS, executes them, posts results.

  • Identity: enrolled mTLS client certificate (24 h TTL, X5C renewal). Cert lives under /var/lib/gpuaas/agent/ with 0o600 mode.
  • Task types include slice.topology_discover, slice.vm_provision, slice.vm_release, plus the bare-metal task family — full catalog in Node-agent task catalog.
  • Terminal stream: relays bytes between cmd/terminal-gateway and an in-host SSH/PTY (file terminal_stream.go).
  • Telemetry: exposes :9110/internal/v1/guest-telemetry for slice-guest helper pushes.
  • Runbook: Node_Agent_Control_Plane_Recovery_2026-03.md.

cmd/gpuaas-cli

Implemented

Operator + user CLI. Browser-OIDC PKCE login, catalog browsing, allocation create/release, project membership ops, balance and refund queries, audit export.

  • Login flow defined in CLI_Browser_OIDC_PKCE_Login_v1.md.
  • Command matrix: CLI_v2_Command_Matrix.md.

cmd/slurm-reference-controller

Implemented Designed

Reference app adapter for Slurm on a single allocation. Spawns the Slurm controller VM (or container) and worker nodes inside one allocation. Multi-node Slurm clusters remain DESIGNED — see Slurm_First_Slice_Adapter_Contract_v1.md.

cmd/terminal-gateway

Implemented RCA

Dedicated WebSocket endpoint at /ws/terminal/{allocation_id}. Validates a single-use terminal token (300 s TTL, stored in Redis), opens a session binding with cmd/api, then relays bytes to cmd/node-agent.

  • Token shape: opaque 256-bit random, key terminal_token:{token}.
  • Auth on browser WS via Sec-WebSocket-Protocol (header-only transport; no ?token= allowed).
  • Drove RCA: 2026-03-terminal-stream-http2-buffering.

cmd/provisioning-worker

Implemented RCA

Temporal worker. Subscribes to PROVISIONING.> NATS subjects, runs Temporal workflows that dispatch typed tasks to cmd/node-agent and observe their completion.

cmd/rke2-self-managed-controller

Implemented

Single-allocation RKE2 (lightweight Kubernetes) app adapter. Bootstraps an RKE2 control plane inside the tenant's allocation; allocates worker nodes from within the same allocation.

cmd/billing-worker

Implemented Contract

  • Accrues usage every billing.window_seconds (default 60).
  • Emits billing.low_balance_warning, billing.auto_release_pending, billing.balance_depleted AsyncAPI events.
  • Force-release on depletion via provisioning.force_release_requested.

cmd/webhook-worker

Implemented Contract

Stripe webhook handler. Raw-body-first: buffers the raw request body before any JSON parse; signature verification requires the exact bytes. Idempotent — duplicate webhooks do not double-credit. Emits payments.balance_credited on success.

cmd/app-runtime-worker

Implemented

Async worker for app-instance lifecycle (create/start/stop/release) — fronts the typed task contract for app adapters.

cmd/notification-relay

Implemented

Bridges NATS subjects (billing/provisioning) to Redis Pub/Sub channels that the API's WS notifier reads. Keeps notification fan-out fast without requiring every API replica to subscribe to NATS directly.

cmd/outbox-relay

Implemented

Polls outbox_events rows with FOR UPDATE SKIP LOCKED, publishes envelopes to NATS, marks rows published. The reason every domain change writes the outbox row in the same DB transaction (never publishes to NATS directly from a handler).

cmd/node-log-gateway

Implemented

Endpoint that streams node-agent logs back to operators. Per Node_Agent_Log_Collection_Loki_v1.md the long-term path is shipping into Loki; the gateway is the bridge.

Where to dig deeper

Want to understand Go here
What each worker subscribes to Outbox & event flow
What each typed task does on the host Node-agent task catalog
How a slice provisioning task chains together GPU slice as-built
How to run all of this locally Local dev setup