Skip to content

Inter-Service Communication Patterns

Decisions

Concern MVP decision Phase-2 evolution
Domain-to-domain calls within API server Direct Go package calls (no network hop) gRPC when domain extracted to standalone binary
JWT validation per request Local validation via cached JWKS (no per-request Keycloak call) Same — JWKS rotation handled via background refresh
Async domain events NATS JetStream (see NATS_Stream_Config.md) Same
Notification bridge (events -> WS) Worker relay publishes user-scoped notifications to Redis Pub/Sub; cmd/api WS hub subscribes Redis Optional persistent notifications store + fanout service
Workflow orchestration Temporal Go SDK from cmd/api and workers Same
Policy cache invalidation Redis Pub/Sub policy.invalidate.* subscriber in cmd/api evicts local PolicyClient cache entries Dedicated policy service pushes invalidation/events cluster-wide
UI read-model cache Redis short-TTL JSON cache via packages/shared/readcache; source of truth remains domain services/Postgres Event-updated read-model tables with Redis hot cache and explicit invalidation
Terminal token validation Redis atomic single-use consume (GETDEL) by terminal gateway Signed JWT with local verification (Phase-2)
Terminal WS deployment topology Dedicated cmd/terminal-gateway for terminal WS (/ws/terminal/*) with API-owned token/session control plane Continue hardening gateway isolation, limits, and telemetry without public contract change
External: payments provider Provider interface (disabled / mockstripe / stripe) behind packages/services/payments Same, with production provider hardening
External: GPU nodes Pull-based node agent over internal mTLS + signed task payloads Same model, expanded task catalog and PKI automation
Data stores pgx (Postgres), go-redis (Redis), NATS Go client Same
Phase-2 internal protocol gRPC (Protobuf, not REST — see rationale below)

1. Internal Communication at MVP

cmd/api remains the primary REST control plane and uses direct in-process package calls for most domain orchestration logic. MVP also includes explicit internal HTTP boundaries for node-agent polling/enrollment and terminal stream relay paths.

Browser / SDK / CLI
       │  HTTPS
  cmd/api (single binary)
  ┌────────────────────────────────────┐
  │  shared middleware (auth, OTel,    │
  │  rate-limit, PII scrubber)         │
  ├────────────┬───────────────────────┤
  │ auth/      │ billing/  │ storage/  │  ← direct Go package calls
  │ inventory/ │ payments/ │ admin/    │
  │ provisioning/orchestrator/         │
  │ notification/  terminal/           │
  └────────────────────────────────────┘
       │               │           │
      pgx            go-redis    NATS client
       │               │           │
   Postgres          Redis      JetStream

Workers (cmd/billing-worker, cmd/provisioning-worker) are separate processes and communicate via NATS/Postgres (and may call explicitly documented internal APIs where required by contract). Internal gRPC remains deferred.


2. JWT Validation (Auth Middleware)

cmd/api validates bearer tokens locally on every authenticated request.

Flow: 1. On startup, packages/shared/middleware fetches the JWKS from Keycloak: GET {KEYCLOAK_ISSUER_URL}/protocol/openid-connect/certs 2. JWKS is cached in memory with a background refresh every 5 minutes (configurable). 3. Per request: JWT signature verified against cached JWKS → claims extracted. 4. No per-request Keycloak call — auth-service is not in the hot path.

Required JWT claims: | Claim | Type | Usage | |---|---|---| | sub | string (UUID) | user_id for all authz checks | | realm_access.roles | string[] | RBAC: user, admin | | exp | unix timestamp | Token expiry | | iss | string | Must match KEYCLOAK_ISSUER_URL |

Custom claims (configured in Keycloak realm via protocol mappers): | Claim | Type | Usage | |---|---|---| | org_id | string (UUID) | null | Tenant scoping |

Token revocation: - Keycloak refresh token revocation is handled at logout (POST /api/v1/auth/logout calls Keycloak's revocation endpoint). - Pre-production requirement: admin-role access tokens must support emergency server-side revocation via Redis deny-list keyed by token jti (or equivalent token versioning strategy) before public launch. - Non-admin short-lived access tokens may continue with expiry-only revocation at MVP.


3. Async Coordination

All async domain events flow through NATS JetStream. See doc/architecture/NATS_Stream_Config.md for stream definitions and durable consumer catalog.

Pattern: publish-subscribe with durable pull consumers and explicit ack. Workers are the primary subscribers. cmd/api publishes events (via the outbox pattern from Postgres) but does not subscribe to any domain events directly.

Notification delivery bridge (explicit): 1. Domain workers publish notification-worthy domain events to NATS. 2. notification-relay worker subscribes NATS subjects (e.g. billing/provisioning alerts), transforms to UI notification payload, and publishes to Redis channel notify.user.<user_id>. 3. cmd/api notification WS hub subscribes Redis pub/sub patterns and fans out to connected browser sessions on WS /ws/notifications.

This preserves the architecture invariant: no direct NATS subscription in cmd/api.

3.1 Policy cache invalidation channel

cmd/api runs a Redis Pub/Sub subscriber on: - policy.invalidate.*

Message semantics: - policy.invalidate.<key> -> evict one policy key from local cache. - policy.invalidate.all / policy.invalidate.* / empty suffix -> evict all local policy cache entries.

Implementation: - subscriber bootstrap: cmd/api/policy_invalidation.go - key parsing and invalidation routing: handlePolicyInvalidationMessage(...)

This gives immediate cross-pod policy consistency while MVP still uses DB-backed PolicyClient.

3.2 UI read-model cache

The v3 UI shell and workbenches use dashboard-style summaries that can otherwise fan out across multiple domains on every browser refresh. The approved cache pattern is:

  • explicit OpenAPI read-model endpoint first;
  • handler/service composes from domain owners or read-model tables;
  • Redis stores the sanitized JSON response with a documented short TTL;
  • cache miss recomputes from the source of truth;
  • cache failure fails open for reads and records metrics/logs;
  • mutations continue to update Postgres and audit/outbox first, then invalidate affected read models.

Implementation entry points: - Architecture: doc/architecture/UI_Read_Model_Cache_Architecture_v1.md - Shared helper: packages/shared/readcache

Do not put authorization decisions, ledger correctness, raw tokens, or private keys in read-model caches.


4. Workflow Coordination (Temporal)

packages/services/provisioning/orchestrator uses the Temporal Go SDK to start and signal workflows. Workers (cmd/provisioning-worker) implement Temporal activities.

cmd/api
  └── provisioning/orchestrator
        └── temporal.Client.ExecuteWorkflow(...)
            Temporal server (localhost:7233 in dev)
        cmd/provisioning-worker
          └── temporal.Worker → activities (SSH setup, node config)

Temporal is used for: provision workflow, release workflow, force-release workflow, and billing accrual scheduling. Use Temporal in local development and production to avoid execution-path divergence.


5. Terminal Token — MVP vs Phase-2

MVP (Day 1): Redis-backed opaque token

Client                     cmd/api                 Redis            Terminal Gateway
  │                            │                     │                     │
  │  POST /allocations/{id}/   │                     │                     │
  │  terminal-token            │                     │                     │
  │ ─────────────────────────► │                     │                     │
  │                            │  SET token →        │                     │
  │                            │  {user_id,alloc_id, │                     │
  │                            │   expiry} TTL 300s  │                     │
  │                            │ ───────────────────►│                     │
  │  {token, expires_in}       │                     │                     │
  │ ◄───────────────────────── │                     │                     │
  │                            │                     │                     │
  │  WS connect + terminal token (Sec-WebSocket-Protocol or Authorization) │
  │                            │                     │                     │
  │ ──────────────────────────────────────────────► │                     │
  │                            │                     │  GETDEL token       │
  │                            │                     │ ◄───────────────── │
  │                            │                     │  {user_id,alloc_id} │
  │                            │                     │ ────────────────── ►│
  │  terminal stream           │                     │                     │
  │ ◄──────────────────────────────────────────────────────────────────── │

Token properties: random 256-bit opaque value, single-use consumed atomically via GETDEL, TTL 300 seconds, Redis key: terminal_token:{token}.

5.2 Terminal session semantics (MVP-secure baseline)

  • Normative decision reference: doc/architecture/adrs/ADR-007-terminal-access-auth-model.md

  • terminal.open remains a signed, discrete node task.

  • Terminal stdin/stdout/stderr and close are stream frames on a persistent session channel, not discrete tasks.
  • Session binding is strict: {session_id, user_id, allocation_id, node_id, expires_at}.
  • Single active terminal session per allocation is enforced at open time.
  • Session TTL is independent from token TTL:
  • token TTL (default 300s) gates only the initial open handshake.
  • active session TTL is enforced by policy key terminal.session_max_ttl_seconds (default 14400 / 4h).
  • Reconnect after drop is a full reopen flow (new token mint -> new terminal.open -> new stream). No session resume in MVP.

5.3 Release and failure sequencing rules

  • If allocation release is requested while terminal is active, release wins with deterministic order:
  • send terminal close frame (allocation_released) and wait for ack/timeout,
  • dispatch allocation.revoke_user,
  • continue release workflow completion.
  • If node stream drops mid-session, close session with retryable reason and require full reopen flow.
  • OIDC token expiry does not terminate an already-open session; auth is enforced at session open, and session TTL/close rules govern runtime.

5.3a Terminal execution privilege boundary (mandatory)

Terminal stream transport and node lifecycle/provisioning tasks have different trust models:

  • Provisioning/lifecycle path (allocation.*, node.* tasks):
  • discrete signed tasks,
  • privileged operations allowed by per-task allowlist,
  • full task result audit.
  • Terminal path (terminal.open + stream frames):
  • interactive runtime path only,
  • no lifecycle mutation operations,
  • no package install/user creation/deletion/host reconfiguration side effects.

Enforcement rules: - Terminal session bootstrap may perform exactly one privilege transition to the target allocation user. - After transition, terminal process runs with allocation-user privileges only. - No terminal stream frame may execute privileged host operations. - If privilege transition cannot be completed safely, session must fail closed with canonical error envelope and correlation ID.

Post-MVP hardening direction: - Split terminal executor from provisioning/lifecycle executor at runtime boundary (separate process/binary or strict module isolation) while preserving the same terminal.open and stream frame contracts.

5.4 Option C terminal gateway hardening baseline

Current baseline is Option C (dedicated cmd/terminal-gateway) with API control-plane authority for token minting/session binding.

Hardening requirements: 1. Keep token minting in API; gateway validates/consumes terminal tokens and relays traffic. 2. Restrict network policy: - public ingress -> gateway only for /ws/terminal/* - gateway egress -> API internal stream + Redis + approved node endpoints only - block direct gateway access to payments and unrelated admin surfaces 3. Enforce dedicated rate limits and connection caps on gateway process. 4. Maintain gateway-specific telemetry and runbooks (connect failures, replay rejects, session saturation). 5. Preserve public contracts (POST /api/v1/allocations/{id}/terminal-token, WS /ws/terminal/{allocation_id}) so frontend/SDK do not change. 6. Keep transport boundary interface (TerminalSessionBroker) between gateway relay logic and session transport to allow future transport swaps without gateway rewrite.

5.5 Option-A Internal Relay Transport (approved)

For TERMINAL_MODE=node_agent_stream, terminal I/O transport uses an internal relay stream:

  • Node-agent endpoint: POST /internal/v1/nodes/{node_id}/terminal/stream
  • AuthN/AuthZ: mTLS (OU=nodes), certificate node identity must match path node_id
  • Wire protocol: JSON line-delimited TerminalStreamFrame objects (see doc/api/asyncapi.draft.yaml)
  • Binding invariants:
  • stream is accepted only for an active broker session bound to {session_id, allocation_id, user_id, node_id}
  • single active terminal session per allocation
  • session TTL enforced server-side by broker policy
  • full correlation and audit coverage on session open/close/error

This endpoint is internal-only and not part of the public API surface.

Cutover approach: - Deploy gateway behind feature flag/routing switch. - Shadow/soak test under load. - Flip terminal WS route from API to gateway at edge. - Keep rollback by route switch without contract change.

Phase-2: Signed JWT

cmd/api mints a short-lived JWT signed with its own ECDSA private key (separate from Keycloak's signing key). Terminal gateway validates locally using the API's cached public key.

Claims: { "sub": user_id, "alloc": allocation_id, "scope": "terminal", "exp": now+300, "jti": uuid }. jti (JWT ID) is stored in Redis for single-use enforcement until expiry — eliminates the DB lookup while preserving the single-use guarantee.


6. External Dependencies

Payments providers (abstraction)

  • Package: packages/services/payments
  • Provider selector: PAYMENTS_PROVIDER env
  • Implementations:
  • disabled: returns ErrNotConfigured
  • mockstripe: deterministic local checkout/portal URLs for contract-faithful local UX
  • stripe: production provider placeholder via Stripe SDK path
  • API handlers remain provider-agnostic through payments.Provider interface.

Stripe-specific notes: - Library: github.com/stripe/stripe-go/v76 - Calls: checkout.Session.New, billingportal.Session.New, webhook.ConstructEvent - Raw body preservation: HTTP middleware must buffer the raw request body before any JSON parsing for webhook signature verification (see Coding_Standards.md §Security).

GPU Nodes (Node Agent, MVP)

  • Node-side runtime: cmd/node-agent (pull model).
  • Control-plane dispatch path:
  • provisioning worker writes node_tasks rows and wakeup signal.
  • node agent long-polls /internal/v1/nodes/{id}/tasks/wait.
  • node agent posts execution result to /internal/v1/nodes/{id}/tasks/{task_id}/result.
  • Security model:
  • mTLS node identity (cert profile in doc/architecture/PKI_Spec.md).
  • signed task envelope verified by agent before handler dispatch.
  • allowlisted typed task catalog (no arbitrary shell execution).
  • Enrollment/renewal:
  • enrollment and cert renewal proxied by cmd/api internal node endpoints.
  • node agent does not talk directly to Postgres/Redis/NATS/Temporal.

SSH remains only for user terminal sessions via terminal gateway, not provisioning orchestration.

Credential handling posture: - Target production model is no persistent server-side storage of user SSH private keys. - Pre-launch cutover policy: remove persistent private-key dependency directly (no compatibility window required before first public launch).

PostgreSQL

  • Library: github.com/jackc/pgx/v5 with connection pooling (pgxpool)
  • Each domain package receives a *pgxpool.Pool via dependency injection at startup.
  • Outbox writes use pgx transactions: domain write + outbox row in the same BEGIN/COMMIT.

Redis

  • Library: github.com/redis/go-redis/v9
  • Used for: rate limiting, terminal token store, JWKS cache (optional), hot-path policy cache.
  • Security requirement: Redis ACLs must isolate notification channels:
  • notification-relay principal: publish to notify.user.*
  • API notification hub principal: subscribe to notify.user.*
  • all other service principals: neither publish nor subscribe on notification channels

NATS

  • Library: github.com/nats-io/nats.go
  • JetStream context used for all publish and subscribe operations.
  • packages/shared/events wraps the client and exposes typed publish/subscribe functions.

7. Phase-2: gRPC for Extracted Services

When a domain is extracted from cmd/api into its own binary, internal calls switch from direct Go function calls to gRPC.

Why gRPC, not REST: - Typed contracts via Protobuf — no schema drift between caller and callee - Efficient binary encoding on internal network - Native streaming support (terminal gateway → provisioning in future) - gRPC-Gateway can expose a REST surface if needed without maintaining two protocol stacks - Better fit for service mesh (mTLS, load balancing, health checking)

What does NOT change when a service is extracted: - Handler business logic (in packages/services/<domain>) — unchanged - NATS event publishing/subscribing — unchanged - Postgres/Redis access patterns — unchanged

Only the call site in cmd/api changes: a direct Go function call becomes a gRPC client call. The packages/shared/policy PolicyClient interface follows the same pattern (see Architecture §19.7).

Proto files: live in packages/api-schemas/proto/ when Phase-2 extraction begins.


8. What NOT to Do

  • No synchronous HTTP calls between internal services at MVP — adds latency and failure surface where direct package calls work.
  • No per-request Keycloak token introspection — validate JWT locally; introspection is only for logout/revocation flows.
  • No query-string tokens — auth material is never passed in URL query params. Browser WebSocket auth uses Sec-WebSocket-Protocol; non-browser clients may use Authorization header.
  • No direct DB access across domain boundaries — each domain package queries only its own tables. Cross-domain data needs go through events or explicit API calls.
  • No gRPC before Phase-2 — do not introduce gRPC scaffolding prematurely; wait until a service actually needs to be extracted and independently scaled.