Skip to content

Scalability and Security Watchlist

Purpose: - Capture non-blocking but important hardening items that should be scheduled before/after launch phases. - Prevent known risks from getting lost during implementation velocity.

Status semantics: - open: not started. - planned: design agreed, implementation pending. - in_progress: active implementation. - done: implemented and validated.

Current Watchlist

  1. Notification delivery durability beyond Redis Pub/Sub
  2. Status: open
  3. Why: Redis Pub/Sub is ephemeral and does not provide per-user persistence/read-state guarantees.
  4. Current baseline: NATS -> notification-relay -> Redis Pub/Sub -> WS fanout.
  5. Future target:
  6. Add persistent notification store (Postgres/object log) with retention policy.
  7. Add read/dismiss tracking and replay for reconnecting clients.
  8. Keep Pub/Sub as low-latency fanout path.
  9. Owners: Platform + Notification service owner.

  10. Data growth guardrails (usage/ledger/audit)

  11. Status: in_progress
  12. Why: High-growth tables can degrade query performance and operational recovery.
  13. Current baseline: partition strategy documented at architecture level.
  14. Future target:
  15. Define row-count/size triggers to activate partitioning.
  16. Add runbook automation for partition creation/archival/retention.
  17. Add dashboard alerts on growth and vacuum/maintenance lag.
  18. Progress:
  19. baseline guard script added: scripts/ops/data_growth_check.sh with row/size thresholds for usage_records, ledger_entries, and audit_logs.
  20. make target added: make ops-data-growth-check.
  21. Owners: Platform + Infra/SRE.

  22. Security key management and rotation runbooks

  23. Status: in_progress
  24. Why: JWT/terminal/KMS key rotation needs deterministic operational procedures.
  25. Current baseline: architecture direction documented.
  26. Future target:
  27. Rotation cadence and break-glass procedure.
  28. Key compromise response path with timeline targets.
  29. Validation checklist for JWKS and terminal token signer rotation.
  30. Progress:
  31. added unified runbook: doc/operations/runbooks/Key_Rotation_and_Compromise_Response_Runbook.md.
  32. break-glass JWKS path already wired via POST /internal/auth/jwks/refresh.
  33. Owners: Security + Platform.

  34. WS token replay/concurrency hardening tests

  35. Status: in_progress
  36. Why: single-use token semantics must hold under race/concurrency conditions.
  37. Current baseline: contract and architecture specify short-lived single-use tokens.
  38. Future target:
  39. Add concurrency tests for duplicate WS connect attempts.
  40. Add metrics/alerts for token replay rejection rates.
  41. Progress:
  42. terminal service now has concurrent consume race test proving only one GETDEL consume succeeds and all competing consumes fail with ErrTokenInvalid.
  43. terminal service now exposes snapshot counters for consumed_ok and replay_rejected to support replay anomaly alerting.
  44. Owners: Backend + QA.

  45. Abuse controls beyond RPM

  46. Status: planned
  47. Why: single RPM limit is insufficient for mixed endpoint classes and abuse patterns.
  48. Current baseline: policy-driven per-route-group limits implemented.
  49. Future target:
  50. Add burst controls and per-IP heuristics.
  51. Add anomaly thresholds for auth/payment/terminal endpoints.
  52. Add SOC-facing signals for automated blocking decisions.
  53. Owners: Security + Backend.

Scheduling Guidance

  • Before public beta:
  • Item 3 (key management/rotation)
  • Item 4 (WS replay/concurrency tests)
  • First scale milestone (>= 10M usage rows or equivalent load):
  • Item 2 (data growth guardrails)
  • Post-beta reliability enhancement:
  • Item 1 (persistent notifications)
  • Item 5 (advanced abuse controls)

Accepted MVP Tradeoffs (Revisit Triggers)

These are intentional MVP decisions. They are acceptable now, but must be re-evaluated at the listed trigger to avoid future scaling/extensibility/security constraints.

  1. Single control-plane API binary (cmd/api)
  2. Trigger: first domain extraction candidate or sustained saturation of one domain path.
  3. Revisit: split high-load domains into independent deployables behind stable contracts.

  4. Service mesh deferred (Envoy/Istio)

  5. Trigger: multiple independently deployed internal services with complex east/west policy needs.
  6. Revisit: adopt mesh when platform-native controls are no longer sufficient.

  7. Notification delivery is best-effort (Redis Pub/Sub fanout)

  8. Trigger: product requires reliable inbox/replay/read-state or support tickets show missed alerts.
  9. Revisit: add persistent notification store and replay semantics.

  10. Policy evaluation is DB-direct at MVP

  11. Trigger: duplicated policy-resolution logic/caching drift across services.
  12. Revisit: extract dedicated Policy Service behind existing PolicyClient, and adopt OPA/OPAL in the same step for distributed policy propagation.

  13. API key auth deferred

  14. Trigger: CLI/automation demand exceeds browser-only workflows.
  15. Revisit: add API key issuance/rotation/revocation with resolver-chain integration.

  16. MVP scope constraints (single-region runtime, scheduler backends deferred)

  17. Trigger: enterprise onboarding requiring multi-region/scheduler integration.
  18. Revisit: activate additive Phase-2 components without public contract breaks.

  19. Dedicated terminal WS runtime in cmd/terminal-gateway (Option C)

  20. Trigger: pre-production hardening gate before public launch.
  21. Revisit: continue gateway hardening (strict ingress/egress policy, saturation controls, and incident drill evidence) and remove any legacy assumptions about API-hosted WS paths.

References: - doc/governance/Assumptions_Register.md - doc/operations/Parallel_Ops_Track.md

Pre-Beta Hardening Additions (Captured)

  1. Policy cache invalidation across pods
  2. Status: in_progress
  3. Why: 60s local cache can produce inconsistent enforcement after policy updates.
  4. Target: publish/subscribe invalidation (policy.invalidate.<key>) and immediate local eviction.
  5. Progress:
  6. policy cache invalidation methods added: PostgresClient.Invalidate(key) and InvalidateAll().
  7. API process now runs Redis pub/sub subscriber on policy.invalidate.* and invalidates local cache immediately.
  8. invalidation message parsing tests added in cmd/api/policy_invalidation_test.go.
  9. Remaining:
  10. wire publisher on admin policy update path when policy management APIs are implemented.

  11. Encryption envelope specification

  12. Status: in_progress
  13. Why: _enc fields need one canonical format and rotation strategy to avoid ad-hoc implementations.
  14. Progress:
  15. doc/architecture/Encryption_Envelope_Spec.md added with canonical envelope shape and rotation rules.
  16. packages/shared/crypto/envelope.go + tests added (AES-256-GCM envelope helper).
  17. provisioning worker now uses shared envelope helper when writing allocations.ssh_private_key_enc.
  18. Remaining:
  19. lock provider-specific KMS authn/authz constraints and secure command execution policy for any remaining app-layer key fetch surfaces.
  20. wire helper into any future storage/scheduler credential material paths that use _enc fields.

  21. Rate-limit fail-open observability

  22. Status: done
  23. Why: Redis outages silently disable app-layer limits.
  24. Target: metrics + alerts for fail-open events; document WAF compensating control.
  25. Progress:
  26. rate limiter now tracks fail-open occurrences via RateLimiter.Snapshot().FailOpenCount.
  27. unit test added for Redis-unavailable fail-open path with counter increment.
  28. API now exports fail-open metrics via GET /metrics (api_ratelimit_fail_open_total) and secured JSON stats via GET /api/v1/internal/stats.

  29. JWKS compromise break-glass

  30. Status: done
  31. Why: key-compromise response path is time-sensitive.
  32. Target: runbook with forced JWKS refresh and emergency key-rotation procedure.
  33. Progress:
  34. emergency runbook added: doc/operations/runbooks/JWKS_Compromise_Breakglass_Runbook.md.
  35. auth resolver now exposes JWKSAuth.ForceRefresh(ctx) hook for incident tooling paths.
  36. API now exposes authenticated internal trigger POST /internal/auth/jwks/refresh (enabled by INTERNAL_JWKS_REFRESH_TOKEN) to invoke force refresh on demand.

  37. Node probe SSRF guardrails

  38. Status: in_progress
  39. Why: admin probe can otherwise target sensitive internal addresses.
  40. Target: allowlist GPU node CIDRs and block metadata/internal reserved ranges.
  41. Progress:
  42. inventory service now validates probe targets before dial (packages/services/inventory/service.go).
  43. blocked by default: loopback, unspecified, multicast, link-local, and metadata endpoint 169.254.169.254.
  44. optional CIDR allowlist enforced via NODE_PROBE_ALLOWED_CIDRS.
  45. API handlers map denied targets to 400 for admin node create/probe flows.

  46. Idempotency response-body sanitization

  47. Status: in_progress
  48. Why: cached replay bodies may carry PII.
  49. Target: sanitize before persistence or store bounded allowlisted subset.
  50. Progress:
  51. idempotency middleware now sanitizes JSON response bodies before persisting idempotency_keys.response_body.
  52. invalid/non-JSON response bodies are skipped (fail-safe, no raw payload persistence).
  53. tests added for sensitive-field redaction and invalid payload behavior.
  54. counters added via IdempotencySnapshot() for persisted JSON bodies, skipped-empty, skipped-non-JSON, and replay-served totals.

  55. Notification channel namespace extensibility

  56. Status: done
  57. Why: user-only channels limit future org/system broadcast patterns.
  58. Target: define channel constructors for user/org/broadcast namespaces.
  59. Progress:
  60. channel constructors implemented in packages/services/notification/channels.go:
    • UserChannel(userID)
    • OrgChannel(orgID)
    • BroadcastChannel()
  61. constructor behavior covered in packages/services/notification/transform_test.go.

  62. Scheduler metadata encryption rule

  63. Status: in_progress
  64. Why: future scheduler credentials could leak if stored plaintext.
  65. Target: mandate envelope encryption for credential material in scheduler_metadata.
  66. Progress:
  67. allocation create path now envelope-encrypts scheduler_request into allocations.scheduler_metadata.scheduler_request_enc.
  68. ERD and schema notes now explicitly require envelope-encryption for credential-bearing scheduler metadata.
  69. Remaining:
  70. enforce equivalent envelope handling on all future scheduler-adapter write paths (slurm/k8s/ray workers).

  71. Temporal execution-path parity

  72. Status: open
  73. Why: differing local/prod scheduler paths increase drift risk.
  74. Target: run billing schedule through Temporal locally and in production.

  75. Outbox payload data minimization

  76. Status: in_progress
  77. Why: outbox may contain sensitive payload fields.
  78. Target: prefer IDs over rich payloads and enforce encryption-at-rest controls.
  79. Progress:
  80. added CI guard scripts/ci/outbox_payload_guard.sh (wired through contracts_validate.sh) to block secret/token-like fields in event payload contracts.
  81. Remaining:
  82. continue tightening payload schemas toward ID-first patterns where full host/user context is not required.

  83. Storage path-safety algorithm lock

  84. Status: done
  85. Why: traversal prevention must be deterministic and testable before coding.
  86. Progress:
  87. packages/shared/storagepath/path.go codifies namespace-rooted filepath.Clean + prefix-check enforcement.
  88. packages/shared/storagepath/path_test.go covers success, normalization, absolute-path reject, and traversal reject.
  89. Remaining:
  90. none for MVP baseline; keep enforcing this helper in future storage refactors.

  91. Browser token storage hardening (sessionStorage -> httpOnly cookie session)

  92. Status: open
  93. Why: browser-accessible token storage raises XSS blast radius and weakens central session controls.
  94. Target: migrate web auth to server-managed httpOnly/sameSite secure cookie session (or equivalent BFF token handling) before production launch.
  95. Progress:
  96. current implementation keeps access token in browser session storage for MVP velocity.
  97. Remaining:
  98. define migration plan and acceptance tests for cookie-based auth flow and logout revocation behavior.

  99. Persistent user SSH private-key storage removal

  100. Status: open
  101. Why: storing user-access private keys server-side increases blast radius and key compromise impact.
  102. Current baseline: public /api/v1/allocations/{id}/ssh-key endpoint removed from contract; runtime/path cleanup remains in progress under A-P7-005.
  103. Target:
  104. one-time key delivery model and/or user-managed public-key model.
  105. control plane stores public keys, fingerprints, and metadata only for steady-state.
  106. terminal/provisioning runtime paths do not depend on persistent user private-key retrieval.
  107. Execution mode: pre-launch cutover (no backward-compatibility migration window required).
  108. Owners: Security + Provisioning + Terminal.

  109. Queue acceptance-check execution evidence enforcement

  110. Status: open
  111. Why: queue currently validates acceptance-check syntax and done-commit lineage, but does not execute each task's acceptance checks as part of done-state enforcement.
  112. Target:
  113. add CI gate that executes task acceptance_checks for tasks moved to done and records pass/fail evidence.
  114. require evidence link or artifact reference in queue metadata before final done-state acceptance.
  115. Trigger: activate before introducing reviewer agent or before enabling multi-lane (V2) parallel execution.
  116. Owners: Governance + CI maintainers.