Scalability and Security Watchlist¶
Purpose: - Capture non-blocking but important hardening items that should be scheduled before/after launch phases. - Prevent known risks from getting lost during implementation velocity.
Status semantics:
- open: not started.
- planned: design agreed, implementation pending.
- in_progress: active implementation.
- done: implemented and validated.
Current Watchlist¶
- Notification delivery durability beyond Redis Pub/Sub
- Status:
open - Why: Redis Pub/Sub is ephemeral and does not provide per-user persistence/read-state guarantees.
- Current baseline: NATS -> notification-relay -> Redis Pub/Sub -> WS fanout.
- Future target:
- Add persistent notification store (Postgres/object log) with retention policy.
- Add read/dismiss tracking and replay for reconnecting clients.
- Keep Pub/Sub as low-latency fanout path.
-
Owners: Platform + Notification service owner.
-
Data growth guardrails (usage/ledger/audit)
- Status:
in_progress - Why: High-growth tables can degrade query performance and operational recovery.
- Current baseline: partition strategy documented at architecture level.
- Future target:
- Define row-count/size triggers to activate partitioning.
- Add runbook automation for partition creation/archival/retention.
- Add dashboard alerts on growth and vacuum/maintenance lag.
- Progress:
- baseline guard script added:
scripts/ops/data_growth_check.shwith row/size thresholds forusage_records,ledger_entries, andaudit_logs. - make target added:
make ops-data-growth-check. -
Owners: Platform + Infra/SRE.
-
Security key management and rotation runbooks
- Status:
in_progress - Why: JWT/terminal/KMS key rotation needs deterministic operational procedures.
- Current baseline: architecture direction documented.
- Future target:
- Rotation cadence and break-glass procedure.
- Key compromise response path with timeline targets.
- Validation checklist for JWKS and terminal token signer rotation.
- Progress:
- added unified runbook:
doc/operations/runbooks/Key_Rotation_and_Compromise_Response_Runbook.md. - break-glass JWKS path already wired via
POST /internal/auth/jwks/refresh. -
Owners: Security + Platform.
-
WS token replay/concurrency hardening tests
- Status:
in_progress - Why: single-use token semantics must hold under race/concurrency conditions.
- Current baseline: contract and architecture specify short-lived single-use tokens.
- Future target:
- Add concurrency tests for duplicate WS connect attempts.
- Add metrics/alerts for token replay rejection rates.
- Progress:
- terminal service now has concurrent consume race test proving only one
GETDELconsume succeeds and all competing consumes fail withErrTokenInvalid. - terminal service now exposes snapshot counters for
consumed_okandreplay_rejectedto support replay anomaly alerting. -
Owners: Backend + QA.
-
Abuse controls beyond RPM
- Status:
planned - Why: single RPM limit is insufficient for mixed endpoint classes and abuse patterns.
- Current baseline: policy-driven per-route-group limits implemented.
- Future target:
- Add burst controls and per-IP heuristics.
- Add anomaly thresholds for auth/payment/terminal endpoints.
- Add SOC-facing signals for automated blocking decisions.
- Owners: Security + Backend.
Scheduling Guidance¶
- Before public beta:
- Item 3 (key management/rotation)
- Item 4 (WS replay/concurrency tests)
- First scale milestone (>= 10M usage rows or equivalent load):
- Item 2 (data growth guardrails)
- Post-beta reliability enhancement:
- Item 1 (persistent notifications)
- Item 5 (advanced abuse controls)
Accepted MVP Tradeoffs (Revisit Triggers)¶
These are intentional MVP decisions. They are acceptable now, but must be re-evaluated at the listed trigger to avoid future scaling/extensibility/security constraints.
- Single control-plane API binary (
cmd/api) - Trigger: first domain extraction candidate or sustained saturation of one domain path.
-
Revisit: split high-load domains into independent deployables behind stable contracts.
-
Service mesh deferred (Envoy/Istio)
- Trigger: multiple independently deployed internal services with complex east/west policy needs.
-
Revisit: adopt mesh when platform-native controls are no longer sufficient.
-
Notification delivery is best-effort (Redis Pub/Sub fanout)
- Trigger: product requires reliable inbox/replay/read-state or support tickets show missed alerts.
-
Revisit: add persistent notification store and replay semantics.
-
Policy evaluation is DB-direct at MVP
- Trigger: duplicated policy-resolution logic/caching drift across services.
-
Revisit: extract dedicated Policy Service behind existing
PolicyClient, and adopt OPA/OPAL in the same step for distributed policy propagation. -
API key auth deferred
- Trigger: CLI/automation demand exceeds browser-only workflows.
-
Revisit: add API key issuance/rotation/revocation with resolver-chain integration.
-
MVP scope constraints (single-region runtime, scheduler backends deferred)
- Trigger: enterprise onboarding requiring multi-region/scheduler integration.
-
Revisit: activate additive Phase-2 components without public contract breaks.
-
Dedicated terminal WS runtime in
cmd/terminal-gateway(Option C) - Trigger: pre-production hardening gate before public launch.
- Revisit: continue gateway hardening (strict ingress/egress policy, saturation controls, and incident drill evidence) and remove any legacy assumptions about API-hosted WS paths.
References:
- doc/governance/Assumptions_Register.md
- doc/operations/Parallel_Ops_Track.md
Pre-Beta Hardening Additions (Captured)¶
- Policy cache invalidation across pods
- Status:
in_progress - Why: 60s local cache can produce inconsistent enforcement after policy updates.
- Target: publish/subscribe invalidation (
policy.invalidate.<key>) and immediate local eviction. - Progress:
- policy cache invalidation methods added:
PostgresClient.Invalidate(key)andInvalidateAll(). - API process now runs Redis pub/sub subscriber on
policy.invalidate.*and invalidates local cache immediately. - invalidation message parsing tests added in
cmd/api/policy_invalidation_test.go. - Remaining:
-
wire publisher on admin policy update path when policy management APIs are implemented.
-
Encryption envelope specification
- Status:
in_progress - Why:
_encfields need one canonical format and rotation strategy to avoid ad-hoc implementations. - Progress:
doc/architecture/Encryption_Envelope_Spec.mdadded with canonical envelope shape and rotation rules.packages/shared/crypto/envelope.go+ tests added (AES-256-GCM envelope helper).- provisioning worker now uses shared envelope helper when writing
allocations.ssh_private_key_enc. - Remaining:
- lock provider-specific KMS authn/authz constraints and secure command execution policy for any remaining app-layer key fetch surfaces.
-
wire helper into any future storage/scheduler credential material paths that use
_encfields. -
Rate-limit fail-open observability
- Status:
done - Why: Redis outages silently disable app-layer limits.
- Target: metrics + alerts for fail-open events; document WAF compensating control.
- Progress:
- rate limiter now tracks fail-open occurrences via
RateLimiter.Snapshot().FailOpenCount. - unit test added for Redis-unavailable fail-open path with counter increment.
-
API now exports fail-open metrics via
GET /metrics(api_ratelimit_fail_open_total) and secured JSON stats viaGET /api/v1/internal/stats. -
JWKS compromise break-glass
- Status:
done - Why: key-compromise response path is time-sensitive.
- Target: runbook with forced JWKS refresh and emergency key-rotation procedure.
- Progress:
- emergency runbook added:
doc/operations/runbooks/JWKS_Compromise_Breakglass_Runbook.md. - auth resolver now exposes
JWKSAuth.ForceRefresh(ctx)hook for incident tooling paths. -
API now exposes authenticated internal trigger
POST /internal/auth/jwks/refresh(enabled byINTERNAL_JWKS_REFRESH_TOKEN) to invoke force refresh on demand. -
Node probe SSRF guardrails
- Status:
in_progress - Why: admin probe can otherwise target sensitive internal addresses.
- Target: allowlist GPU node CIDRs and block metadata/internal reserved ranges.
- Progress:
- inventory service now validates probe targets before dial (
packages/services/inventory/service.go). - blocked by default: loopback, unspecified, multicast, link-local, and metadata endpoint
169.254.169.254. - optional CIDR allowlist enforced via
NODE_PROBE_ALLOWED_CIDRS. -
API handlers map denied targets to
400for admin node create/probe flows. -
Idempotency response-body sanitization
- Status:
in_progress - Why: cached replay bodies may carry PII.
- Target: sanitize before persistence or store bounded allowlisted subset.
- Progress:
- idempotency middleware now sanitizes JSON response bodies before persisting
idempotency_keys.response_body. - invalid/non-JSON response bodies are skipped (fail-safe, no raw payload persistence).
- tests added for sensitive-field redaction and invalid payload behavior.
-
counters added via
IdempotencySnapshot()for persisted JSON bodies, skipped-empty, skipped-non-JSON, and replay-served totals. -
Notification channel namespace extensibility
- Status:
done - Why: user-only channels limit future org/system broadcast patterns.
- Target: define channel constructors for user/org/broadcast namespaces.
- Progress:
- channel constructors implemented in
packages/services/notification/channels.go:UserChannel(userID)OrgChannel(orgID)BroadcastChannel()
-
constructor behavior covered in
packages/services/notification/transform_test.go. -
Scheduler metadata encryption rule
- Status:
in_progress - Why: future scheduler credentials could leak if stored plaintext.
- Target: mandate envelope encryption for credential material in
scheduler_metadata. - Progress:
- allocation create path now envelope-encrypts
scheduler_requestintoallocations.scheduler_metadata.scheduler_request_enc. - ERD and schema notes now explicitly require envelope-encryption for credential-bearing scheduler metadata.
- Remaining:
-
enforce equivalent envelope handling on all future scheduler-adapter write paths (slurm/k8s/ray workers).
-
Temporal execution-path parity
- Status:
open - Why: differing local/prod scheduler paths increase drift risk.
-
Target: run billing schedule through Temporal locally and in production.
-
Outbox payload data minimization
- Status:
in_progress - Why: outbox may contain sensitive payload fields.
- Target: prefer IDs over rich payloads and enforce encryption-at-rest controls.
- Progress:
- added CI guard
scripts/ci/outbox_payload_guard.sh(wired throughcontracts_validate.sh) to block secret/token-like fields in event payload contracts. - Remaining:
-
continue tightening payload schemas toward ID-first patterns where full host/user context is not required.
-
Storage path-safety algorithm lock
- Status:
done - Why: traversal prevention must be deterministic and testable before coding.
- Progress:
packages/shared/storagepath/path.gocodifies namespace-rootedfilepath.Clean+ prefix-check enforcement.packages/shared/storagepath/path_test.gocovers success, normalization, absolute-path reject, and traversal reject.- Remaining:
-
none for MVP baseline; keep enforcing this helper in future storage refactors.
-
Browser token storage hardening (sessionStorage -> httpOnly cookie session)
- Status:
open - Why: browser-accessible token storage raises XSS blast radius and weakens central session controls.
- Target: migrate web auth to server-managed httpOnly/sameSite secure cookie session (or equivalent BFF token handling) before production launch.
- Progress:
- current implementation keeps access token in browser session storage for MVP velocity.
- Remaining:
-
define migration plan and acceptance tests for cookie-based auth flow and logout revocation behavior.
-
Persistent user SSH private-key storage removal
- Status:
open - Why: storing user-access private keys server-side increases blast radius and key compromise impact.
- Current baseline: public
/api/v1/allocations/{id}/ssh-keyendpoint removed from contract; runtime/path cleanup remains in progress underA-P7-005. - Target:
- one-time key delivery model and/or user-managed public-key model.
- control plane stores public keys, fingerprints, and metadata only for steady-state.
- terminal/provisioning runtime paths do not depend on persistent user private-key retrieval.
- Execution mode: pre-launch cutover (no backward-compatibility migration window required).
-
Owners: Security + Provisioning + Terminal.
-
Queue acceptance-check execution evidence enforcement
- Status:
open - Why: queue currently validates acceptance-check syntax and done-commit lineage, but does not execute each task's acceptance checks as part of done-state enforcement.
- Target:
- add CI gate that executes task
acceptance_checksfor tasks moved todoneand records pass/fail evidence. - require evidence link or artifact reference in queue metadata before final done-state acceptance.
- Trigger: activate before introducing reviewer agent or before enabling multi-lane (V2) parallel execution.
- Owners: Governance + CI maintainers.