Inter-Service Communication Patterns¶
Decisions¶
| Concern | MVP decision | Phase-2 evolution |
|---|---|---|
| Domain-to-domain calls within API server | Direct Go package calls (no network hop) | gRPC when domain extracted to standalone binary |
| JWT validation per request | Local validation via cached JWKS (no per-request Keycloak call) | Same — JWKS rotation handled via background refresh |
| Async domain events | NATS JetStream (see NATS_Stream_Config.md) |
Same |
| Notification bridge (events -> WS) | Worker relay publishes user-scoped notifications to Redis Pub/Sub; cmd/api WS hub subscribes Redis |
Optional persistent notifications store + fanout service |
| Workflow orchestration | Temporal Go SDK from cmd/api and workers |
Same |
| Policy cache invalidation | Redis Pub/Sub policy.invalidate.* subscriber in cmd/api evicts local PolicyClient cache entries |
Dedicated policy service pushes invalidation/events cluster-wide |
| UI read-model cache | Redis short-TTL JSON cache via packages/shared/readcache; source of truth remains domain services/Postgres |
Event-updated read-model tables with Redis hot cache and explicit invalidation |
| Terminal token validation | Redis atomic single-use consume (GETDEL) by terminal gateway |
Signed JWT with local verification (Phase-2) |
| Terminal WS deployment topology | Dedicated cmd/terminal-gateway for terminal WS (/ws/terminal/*) with API-owned token/session control plane |
Continue hardening gateway isolation, limits, and telemetry without public contract change |
| External: payments provider | Provider interface (disabled / mockstripe / stripe) behind packages/services/payments |
Same, with production provider hardening |
| External: GPU nodes | Pull-based node agent over internal mTLS + signed task payloads | Same model, expanded task catalog and PKI automation |
| Data stores | pgx (Postgres), go-redis (Redis), NATS Go client |
Same |
| Phase-2 internal protocol | — | gRPC (Protobuf, not REST — see rationale below) |
1. Internal Communication at MVP¶
cmd/api remains the primary REST control plane and uses direct in-process package calls
for most domain orchestration logic. MVP also includes explicit internal HTTP boundaries
for node-agent polling/enrollment and terminal stream relay paths.
Browser / SDK / CLI
│ HTTPS
▼
cmd/api (single binary)
┌────────────────────────────────────┐
│ shared middleware (auth, OTel, │
│ rate-limit, PII scrubber) │
├────────────┬───────────────────────┤
│ auth/ │ billing/ │ storage/ │ ← direct Go package calls
│ inventory/ │ payments/ │ admin/ │
│ provisioning/orchestrator/ │
│ notification/ terminal/ │
└────────────────────────────────────┘
│ │ │
pgx go-redis NATS client
│ │ │
Postgres Redis JetStream
Workers (cmd/billing-worker, cmd/provisioning-worker) are separate processes and
communicate via NATS/Postgres (and may call explicitly documented internal APIs where
required by contract). Internal gRPC remains deferred.
2. JWT Validation (Auth Middleware)¶
cmd/api validates bearer tokens locally on every authenticated request.
Flow:
1. On startup, packages/shared/middleware fetches the JWKS from Keycloak:
GET {KEYCLOAK_ISSUER_URL}/protocol/openid-connect/certs
2. JWKS is cached in memory with a background refresh every 5 minutes (configurable).
3. Per request: JWT signature verified against cached JWKS → claims extracted.
4. No per-request Keycloak call — auth-service is not in the hot path.
Required JWT claims:
| Claim | Type | Usage |
|---|---|---|
| sub | string (UUID) | user_id for all authz checks |
| realm_access.roles | string[] | RBAC: user, admin |
| exp | unix timestamp | Token expiry |
| iss | string | Must match KEYCLOAK_ISSUER_URL |
Custom claims (configured in Keycloak realm via protocol mappers):
| Claim | Type | Usage |
|---|---|---|
| org_id | string (UUID) | null | Tenant scoping |
Token revocation:
- Keycloak refresh token revocation is handled at logout
(POST /api/v1/auth/logout calls Keycloak's revocation endpoint).
- Pre-production requirement: admin-role access tokens must support emergency
server-side revocation via Redis deny-list keyed by token jti (or equivalent
token versioning strategy) before public launch.
- Non-admin short-lived access tokens may continue with expiry-only revocation at MVP.
3. Async Coordination¶
All async domain events flow through NATS JetStream. See doc/architecture/NATS_Stream_Config.md
for stream definitions and durable consumer catalog.
Pattern: publish-subscribe with durable pull consumers and explicit ack. Workers are the primary
subscribers. cmd/api publishes events (via the outbox pattern from Postgres) but does not
subscribe to any domain events directly.
Notification delivery bridge (explicit):
1. Domain workers publish notification-worthy domain events to NATS.
2. notification-relay worker subscribes NATS subjects (e.g. billing/provisioning alerts),
transforms to UI notification payload, and publishes to Redis channel notify.user.<user_id>.
3. cmd/api notification WS hub subscribes Redis pub/sub patterns and fans out to connected
browser sessions on WS /ws/notifications.
This preserves the architecture invariant: no direct NATS subscription in cmd/api.
3.1 Policy cache invalidation channel¶
cmd/api runs a Redis Pub/Sub subscriber on:
- policy.invalidate.*
Message semantics:
- policy.invalidate.<key> -> evict one policy key from local cache.
- policy.invalidate.all / policy.invalidate.* / empty suffix -> evict all local policy cache entries.
Implementation:
- subscriber bootstrap: cmd/api/policy_invalidation.go
- key parsing and invalidation routing: handlePolicyInvalidationMessage(...)
This gives immediate cross-pod policy consistency while MVP still uses DB-backed PolicyClient.
3.2 UI read-model cache¶
The v3 UI shell and workbenches use dashboard-style summaries that can otherwise fan out across multiple domains on every browser refresh. The approved cache pattern is:
- explicit OpenAPI read-model endpoint first;
- handler/service composes from domain owners or read-model tables;
- Redis stores the sanitized JSON response with a documented short TTL;
- cache miss recomputes from the source of truth;
- cache failure fails open for reads and records metrics/logs;
- mutations continue to update Postgres and audit/outbox first, then invalidate affected read models.
Implementation entry points:
- Architecture: doc/architecture/UI_Read_Model_Cache_Architecture_v1.md
- Shared helper: packages/shared/readcache
Do not put authorization decisions, ledger correctness, raw tokens, or private keys in read-model caches.
4. Workflow Coordination (Temporal)¶
packages/services/provisioning/orchestrator uses the Temporal Go SDK to start and signal
workflows. Workers (cmd/provisioning-worker) implement Temporal activities.
cmd/api
└── provisioning/orchestrator
└── temporal.Client.ExecuteWorkflow(...)
│
Temporal server (localhost:7233 in dev)
│
cmd/provisioning-worker
└── temporal.Worker → activities (SSH setup, node config)
Temporal is used for: provision workflow, release workflow, force-release workflow, and billing accrual scheduling. Use Temporal in local development and production to avoid execution-path divergence.
5. Terminal Token — MVP vs Phase-2¶
MVP (Day 1): Redis-backed opaque token¶
Client cmd/api Redis Terminal Gateway
│ │ │ │
│ POST /allocations/{id}/ │ │ │
│ terminal-token │ │ │
│ ─────────────────────────► │ │ │
│ │ SET token → │ │
│ │ {user_id,alloc_id, │ │
│ │ expiry} TTL 300s │ │
│ │ ───────────────────►│ │
│ {token, expires_in} │ │ │
│ ◄───────────────────────── │ │ │
│ │ │ │
│ WS connect + terminal token (Sec-WebSocket-Protocol or Authorization) │
│ │ │ │
│ ──────────────────────────────────────────────► │ │
│ │ │ GETDEL token │
│ │ │ ◄───────────────── │
│ │ │ {user_id,alloc_id} │
│ │ │ ────────────────── ►│
│ terminal stream │ │ │
│ ◄──────────────────────────────────────────────────────────────────── │
Token properties: random 256-bit opaque value, single-use consumed atomically via GETDEL,
TTL 300 seconds, Redis key: terminal_token:{token}.
5.2 Terminal session semantics (MVP-secure baseline)¶
-
Normative decision reference:
doc/architecture/adrs/ADR-007-terminal-access-auth-model.md -
terminal.openremains a signed, discrete node task. - Terminal stdin/stdout/stderr and close are stream frames on a persistent session channel, not discrete tasks.
- Session binding is strict:
{session_id, user_id, allocation_id, node_id, expires_at}. - Single active terminal session per allocation is enforced at open time.
- Session TTL is independent from token TTL:
- token TTL (default 300s) gates only the initial open handshake.
- active session TTL is enforced by policy key
terminal.session_max_ttl_seconds(default 14400 / 4h). - Reconnect after drop is a full reopen flow (new token mint -> new
terminal.open-> new stream). No session resume in MVP.
5.3 Release and failure sequencing rules¶
- If allocation release is requested while terminal is active, release wins with deterministic order:
- send terminal close frame (
allocation_released) and wait for ack/timeout, - dispatch
allocation.revoke_user, - continue release workflow completion.
- If node stream drops mid-session, close session with retryable reason and require full reopen flow.
- OIDC token expiry does not terminate an already-open session; auth is enforced at session open, and session TTL/close rules govern runtime.
5.3a Terminal execution privilege boundary (mandatory)¶
Terminal stream transport and node lifecycle/provisioning tasks have different trust models:
- Provisioning/lifecycle path (
allocation.*,node.*tasks): - discrete signed tasks,
- privileged operations allowed by per-task allowlist,
- full task result audit.
- Terminal path (
terminal.open+ stream frames): - interactive runtime path only,
- no lifecycle mutation operations,
- no package install/user creation/deletion/host reconfiguration side effects.
Enforcement rules: - Terminal session bootstrap may perform exactly one privilege transition to the target allocation user. - After transition, terminal process runs with allocation-user privileges only. - No terminal stream frame may execute privileged host operations. - If privilege transition cannot be completed safely, session must fail closed with canonical error envelope and correlation ID.
Post-MVP hardening direction:
- Split terminal executor from provisioning/lifecycle executor at runtime boundary (separate process/binary or strict module isolation) while preserving the same terminal.open and stream frame contracts.
5.4 Option C terminal gateway hardening baseline¶
Current baseline is Option C (dedicated cmd/terminal-gateway) with API
control-plane authority for token minting/session binding.
Hardening requirements:
1. Keep token minting in API; gateway validates/consumes terminal tokens and relays traffic.
2. Restrict network policy:
- public ingress -> gateway only for /ws/terminal/*
- gateway egress -> API internal stream + Redis + approved node endpoints only
- block direct gateway access to payments and unrelated admin surfaces
3. Enforce dedicated rate limits and connection caps on gateway process.
4. Maintain gateway-specific telemetry and runbooks (connect failures, replay rejects, session saturation).
5. Preserve public contracts (POST /api/v1/allocations/{id}/terminal-token, WS /ws/terminal/{allocation_id}) so frontend/SDK do not change.
6. Keep transport boundary interface (TerminalSessionBroker) between gateway relay logic and session transport to allow future transport swaps without gateway rewrite.
5.5 Option-A Internal Relay Transport (approved)¶
For TERMINAL_MODE=node_agent_stream, terminal I/O transport uses an internal relay stream:
- Node-agent endpoint:
POST /internal/v1/nodes/{node_id}/terminal/stream - AuthN/AuthZ: mTLS (
OU=nodes), certificate node identity must match pathnode_id - Wire protocol: JSON line-delimited
TerminalStreamFrameobjects (seedoc/api/asyncapi.draft.yaml) - Binding invariants:
- stream is accepted only for an active broker session bound to
{session_id, allocation_id, user_id, node_id} - single active terminal session per allocation
- session TTL enforced server-side by broker policy
- full correlation and audit coverage on session open/close/error
This endpoint is internal-only and not part of the public API surface.
Cutover approach: - Deploy gateway behind feature flag/routing switch. - Shadow/soak test under load. - Flip terminal WS route from API to gateway at edge. - Keep rollback by route switch without contract change.
Phase-2: Signed JWT¶
cmd/api mints a short-lived JWT signed with its own ECDSA private key (separate from
Keycloak's signing key). Terminal gateway validates locally using the API's cached public key.
Claims: { "sub": user_id, "alloc": allocation_id, "scope": "terminal", "exp": now+300, "jti": uuid }.
jti (JWT ID) is stored in Redis for single-use enforcement until expiry — eliminates the DB
lookup while preserving the single-use guarantee.
6. External Dependencies¶
Payments providers (abstraction)¶
- Package:
packages/services/payments - Provider selector:
PAYMENTS_PROVIDERenv - Implementations:
disabled: returnsErrNotConfiguredmockstripe: deterministic local checkout/portal URLs for contract-faithful local UXstripe: production provider placeholder via Stripe SDK path- API handlers remain provider-agnostic through
payments.Providerinterface.
Stripe-specific notes:
- Library: github.com/stripe/stripe-go/v76
- Calls: checkout.Session.New, billingportal.Session.New, webhook.ConstructEvent
- Raw body preservation: HTTP middleware must buffer the raw request body before any JSON
parsing for webhook signature verification (see Coding_Standards.md §Security).
GPU Nodes (Node Agent, MVP)¶
- Node-side runtime:
cmd/node-agent(pull model). - Control-plane dispatch path:
- provisioning worker writes
node_tasksrows and wakeup signal. - node agent long-polls
/internal/v1/nodes/{id}/tasks/wait. - node agent posts execution result to
/internal/v1/nodes/{id}/tasks/{task_id}/result. - Security model:
- mTLS node identity (cert profile in
doc/architecture/PKI_Spec.md). - signed task envelope verified by agent before handler dispatch.
- allowlisted typed task catalog (no arbitrary shell execution).
- Enrollment/renewal:
- enrollment and cert renewal proxied by
cmd/apiinternal node endpoints. - node agent does not talk directly to Postgres/Redis/NATS/Temporal.
SSH remains only for user terminal sessions via terminal gateway, not provisioning orchestration.
Credential handling posture: - Target production model is no persistent server-side storage of user SSH private keys. - Pre-launch cutover policy: remove persistent private-key dependency directly (no compatibility window required before first public launch).
PostgreSQL¶
- Library:
github.com/jackc/pgx/v5with connection pooling (pgxpool) - Each domain package receives a
*pgxpool.Poolvia dependency injection at startup. - Outbox writes use
pgxtransactions: domain write + outbox row in the sameBEGIN/COMMIT.
Redis¶
- Library:
github.com/redis/go-redis/v9 - Used for: rate limiting, terminal token store, JWKS cache (optional), hot-path policy cache.
- Security requirement: Redis ACLs must isolate notification channels:
- notification-relay principal: publish to
notify.user.* - API notification hub principal: subscribe to
notify.user.* - all other service principals: neither publish nor subscribe on notification channels
NATS¶
- Library:
github.com/nats-io/nats.go - JetStream context used for all publish and subscribe operations.
packages/shared/eventswraps the client and exposes typed publish/subscribe functions.
7. Phase-2: gRPC for Extracted Services¶
When a domain is extracted from cmd/api into its own binary, internal calls switch from
direct Go function calls to gRPC.
Why gRPC, not REST: - Typed contracts via Protobuf — no schema drift between caller and callee - Efficient binary encoding on internal network - Native streaming support (terminal gateway → provisioning in future) - gRPC-Gateway can expose a REST surface if needed without maintaining two protocol stacks - Better fit for service mesh (mTLS, load balancing, health checking)
What does NOT change when a service is extracted:
- Handler business logic (in packages/services/<domain>) — unchanged
- NATS event publishing/subscribing — unchanged
- Postgres/Redis access patterns — unchanged
Only the call site in cmd/api changes: a direct Go function call becomes a gRPC client call.
The packages/shared/policy PolicyClient interface follows the same pattern (see Architecture §19.7).
Proto files: live in packages/api-schemas/proto/ when Phase-2 extraction begins.
8. What NOT to Do¶
- No synchronous HTTP calls between internal services at MVP — adds latency and failure surface where direct package calls work.
- No per-request Keycloak token introspection — validate JWT locally; introspection is only for logout/revocation flows.
- No query-string tokens — auth material is never passed in URL query params.
Browser WebSocket auth uses
Sec-WebSocket-Protocol; non-browser clients may useAuthorizationheader. - No direct DB access across domain boundaries — each domain package queries only its own tables. Cross-domain data needs go through events or explicit API calls.
- No gRPC before Phase-2 — do not introduce gRPC scaffolding prematurely; wait until a service actually needs to be extracted and independently scaled.