Skip to content

Production Platform Baseline (DevOps Parallel Track)

Purpose: - Define what platform components are explicitly required for public launch. - Separate MVP-required controls from deferred complexity (for example, service mesh). - Let DevOps execute in parallel with application coding.

1. Required for Public MVP

Capability MVP decision Why
Public ingress/API gateway Required (managed edge gateway) Central TLS termination, routing, auth/rate-limit policy attachment, auditability
WAF at edge Required Baseline protection against common web/API abuse patterns
TLS certificates + rotation Required Public-facing transport security
Runtime secret manager (KMS/Vault/cloud secret manager) Required No static secrets in repo/runtime config
Container/image scanning in CI Required Supply-chain and dependency risk reduction
Structured logs + centralized log sink Required Incident response and forensic baseline
Metrics + alerting Required SLO operations and on-call response
Distributed tracing (OTel) Required Correlation across API/worker/event flows
Backup + restore drills (Postgres, config) Required Recovery guarantees before public launch
Network policy segmentation Required Trust-boundary enforcement between edge/app/data
East/west traffic control (default-deny + allow-list) Required Multi-tenant isolation and lateral-movement reduction
Internal traffic encryption (mTLS or equivalent) Required Protect internal runtime traffic in transit (at MVP primarily API/worker -> Postgres/Redis/NATS/Temporal; later includes service-to-service paths after extraction)
Service identity and certificate lifecycle management Required Deterministic issuance, rotation, revocation, and expiry monitoring
WebSocket terminal routing policy Required Terminal sessions are stateful per pod; enforce session affinity or dedicated terminal gateway
Stream shutdown drain policy Required Rolling restarts must drain active terminal streams before SIGKILL to avoid abrupt session drops
Redis ACL segregation for notification channels Required Prevent unauthorized subscribe/publish on notify.user.* streams carrying user-financial signals
Admin token emergency revocation control Required Minimize blast radius for compromised admin tokens before expiry

2. Explicitly Deferred at MVP (With Trigger Gates)

Component MVP stance Re-evaluate when
Envoy sidecars everywhere Defer Need per-service L7 policy, retries, traffic shaping not handled cleanly at edge/app layer
Istio/service mesh Defer Existing controls (network policy + mTLS + identity) are met without mesh, and advanced traffic policy is not yet required
Dedicated in-repo API gateway service Defer Managed edge gateway cannot satisfy policy/compliance/performance requirements

Decision rationale: - Current topology (cmd/api + workers) does not yet justify service-mesh operational cost. - Use managed API gateway + app-layer middleware for MVP; revisit mesh once domain services are independently deployed.

3. DevOps Parallel Workstreams

  1. Edge + security workstream
  2. Provision gateway routes for /api/v1/*, websocket upgrades, Stripe webhook path.
  3. Enforce TLS, WAF rules, baseline rate limiting, and request size/time limits.
  4. Enable log redaction at edge for auth material and sensitive query/header values.
  5. Configure websocket affinity policy for /ws/terminal/* (or route to dedicated terminal gateway tier).
  6. Define terminal drain-on-shutdown behavior (SIGTERM -> stop accepting new sessions -> wait drain window -> force close remaining sessions).

  7. Platform observability workstream

  8. Deploy log, metric, and trace backends.
  9. Standardize correlation-id propagation from edge -> API -> workers -> events.
  10. Create initial SLO alerts (availability, latency, queue lag, billing worker failures).

  11. Data resilience workstream

  12. Configure Postgres backups, retention, and restore runbook validation.
  13. Validate RPO/RTO controls in staging with rehearsal evidence.

  14. Runtime security workstream

  15. Wire KMS/secret manager for DB/Redis/NATS/Stripe/Keycloak credentials.
  16. Enforce image signing/scanning policy gates in CI/CD.
  17. Establish break-glass access and secret rotation SOP.
  18. Enforce Redis ACL profiles so only notification-relay can publish and only API notification hub can subscribe on notify.user.*.
  19. Implement admin-token deny-list check path with operational runbook for emergency revocation.
  20. Enforce admin node probe SSRF controls (NODE_PROBE_ALLOWED_CIDRS set in production with GPU-node subnet allowlist; metadata/link-local ranges blocked).

  21. East/west security workstream

  22. Enforce cluster/application default-deny east/west policy and explicit allow-list flows.
  23. Implement service/workload identity with short-lived certs for mTLS (mesh or platform-native).
  24. Define cert lifecycle SOP: issuance authority, TTL, rotation cadence, expiry alerting, and emergency revocation.
  25. Add periodic verification job that fails on expired/near-expiry internal certs.

4. Non-Negotiable Launch Gate

Public launch is blocked unless all “Required for Public MVP” controls are in place and validated in staging.