Production Platform Baseline (DevOps Parallel Track)¶
Purpose: - Define what platform components are explicitly required for public launch. - Separate MVP-required controls from deferred complexity (for example, service mesh). - Let DevOps execute in parallel with application coding.
1. Required for Public MVP¶
| Capability | MVP decision | Why |
|---|---|---|
| Public ingress/API gateway | Required (managed edge gateway) | Central TLS termination, routing, auth/rate-limit policy attachment, auditability |
| WAF at edge | Required | Baseline protection against common web/API abuse patterns |
| TLS certificates + rotation | Required | Public-facing transport security |
| Runtime secret manager (KMS/Vault/cloud secret manager) | Required | No static secrets in repo/runtime config |
| Container/image scanning in CI | Required | Supply-chain and dependency risk reduction |
| Structured logs + centralized log sink | Required | Incident response and forensic baseline |
| Metrics + alerting | Required | SLO operations and on-call response |
| Distributed tracing (OTel) | Required | Correlation across API/worker/event flows |
| Backup + restore drills (Postgres, config) | Required | Recovery guarantees before public launch |
| Network policy segmentation | Required | Trust-boundary enforcement between edge/app/data |
| East/west traffic control (default-deny + allow-list) | Required | Multi-tenant isolation and lateral-movement reduction |
| Internal traffic encryption (mTLS or equivalent) | Required | Protect internal runtime traffic in transit (at MVP primarily API/worker -> Postgres/Redis/NATS/Temporal; later includes service-to-service paths after extraction) |
| Service identity and certificate lifecycle management | Required | Deterministic issuance, rotation, revocation, and expiry monitoring |
| WebSocket terminal routing policy | Required | Terminal sessions are stateful per pod; enforce session affinity or dedicated terminal gateway |
| Stream shutdown drain policy | Required | Rolling restarts must drain active terminal streams before SIGKILL to avoid abrupt session drops |
| Redis ACL segregation for notification channels | Required | Prevent unauthorized subscribe/publish on notify.user.* streams carrying user-financial signals |
| Admin token emergency revocation control | Required | Minimize blast radius for compromised admin tokens before expiry |
2. Explicitly Deferred at MVP (With Trigger Gates)¶
| Component | MVP stance | Re-evaluate when |
|---|---|---|
| Envoy sidecars everywhere | Defer | Need per-service L7 policy, retries, traffic shaping not handled cleanly at edge/app layer |
| Istio/service mesh | Defer | Existing controls (network policy + mTLS + identity) are met without mesh, and advanced traffic policy is not yet required |
| Dedicated in-repo API gateway service | Defer | Managed edge gateway cannot satisfy policy/compliance/performance requirements |
Decision rationale:
- Current topology (cmd/api + workers) does not yet justify service-mesh operational cost.
- Use managed API gateway + app-layer middleware for MVP; revisit mesh once domain services are independently deployed.
3. DevOps Parallel Workstreams¶
- Edge + security workstream
- Provision gateway routes for
/api/v1/*, websocket upgrades, Stripe webhook path. - Enforce TLS, WAF rules, baseline rate limiting, and request size/time limits.
- Enable log redaction at edge for auth material and sensitive query/header values.
- Configure websocket affinity policy for
/ws/terminal/*(or route to dedicated terminal gateway tier). -
Define terminal drain-on-shutdown behavior (
SIGTERM-> stop accepting new sessions -> wait drain window -> force close remaining sessions). -
Platform observability workstream
- Deploy log, metric, and trace backends.
- Standardize correlation-id propagation from edge -> API -> workers -> events.
-
Create initial SLO alerts (availability, latency, queue lag, billing worker failures).
-
Data resilience workstream
- Configure Postgres backups, retention, and restore runbook validation.
-
Validate RPO/RTO controls in staging with rehearsal evidence.
-
Runtime security workstream
- Wire KMS/secret manager for DB/Redis/NATS/Stripe/Keycloak credentials.
- Enforce image signing/scanning policy gates in CI/CD.
- Establish break-glass access and secret rotation SOP.
- Enforce Redis ACL profiles so only notification-relay can publish and only API notification hub can subscribe on
notify.user.*. - Implement admin-token deny-list check path with operational runbook for emergency revocation.
-
Enforce admin node probe SSRF controls (
NODE_PROBE_ALLOWED_CIDRSset in production with GPU-node subnet allowlist; metadata/link-local ranges blocked). -
East/west security workstream
- Enforce cluster/application default-deny east/west policy and explicit allow-list flows.
- Implement service/workload identity with short-lived certs for mTLS (mesh or platform-native).
- Define cert lifecycle SOP: issuance authority, TTL, rotation cadence, expiry alerting, and emergency revocation.
- Add periodic verification job that fails on expired/near-expiry internal certs.
4. Non-Negotiable Launch Gate¶
Public launch is blocked unless all “Required for Public MVP” controls are in place and validated in staging.