Parallel Ops Track (Execute Alongside Feature Coding)¶
Purpose: - Track operational work that must run in parallel with backend/frontend implementation. - Convert operations readiness from "later" into gated deliverables with evidence.
Status values:
- open
- in_progress
- done
1) SLO and Alert Pack¶
- Status:
in_progress - Owner:
SRE Lead - Target:
Before public beta - Scope:
- API availability/latency/error budget alerts.
- Queue lag and worker failure alerts.
- Billing drift/processing anomaly alerts.
- Webhook ingestion + reconciliation failure alerts.
- Evidence:
- Evidence baseline check command:
make ops-parallel-evidence-check - Execution artifact:
doc/operations/evidence/slo_alert_pack.md - Baseline definitions:
doc/operations/Observability_Baseline.md - Incident model linkage:
doc/operations/Incident_Severity_Model.md - Alert rules baseline:
doc/operations/evidence/alert_rule_manifest_baseline.yaml - Local observability smoke script:
scripts/ops/observability_smoke.sh - Local observability smoke report:
doc/operations/evidence/observability_local_smoke_report.md - Simulation playbook/report:
doc/operations/evidence/alert_simulation_playbook.md,doc/operations/evidence/alert_simulation_report.md - Pending: staging load/trigger execution and captured outputs.
2) Runbooks and On-Call Readiness¶
- Status:
in_progress - Owner:
SRE Lead - Target:
Before public beta - Scope:
- Complete runbooks for API, queue, billing worker, provisioning stuck/failures, webhook outage.
- On-call ownership matrix and escalation path.
- Incident drill calendar (tabletop + restore drill).
- Evidence:
- Evidence baseline check command:
make ops-parallel-evidence-check - Execution artifact:
doc/operations/evidence/runbooks_oncall_readiness.md - Runbook index:
doc/operations/SRE_Runbook_Index.md - Existing runbooks:
doc/operations/runbooks/* - On-call roster + escalation baseline:
doc/operations/evidence/oncall_roster_and_escalation.md - Drill calendar/report baseline:
doc/operations/evidence/incident_drill_reports.md - Pending: executed drills with timestamps and outcomes.
3) Backup, Restore, and DR Validation¶
- Status:
in_progress - Owner:
Platform + DBA - Target:
Before staging go-live - Scope:
- Automated backup schedule and retention policy.
- Staging restore rehearsal with time-to-recovery measurement.
- RPO/RTO verification and risk note for gaps.
- Evidence:
- Evidence baseline check command:
make ops-parallel-evidence-check - Execution artifact:
doc/operations/evidence/backup_restore_dr.md - Local smoke drill script:
scripts/ops/backup_restore_smoke.sh - Local rehearsal report:
doc/operations/evidence/backup_restore_rehearsal_report.md - Pending: backup job config + restore rehearsal report.
4) Secrets and Key Operations¶
- Status:
in_progress - Owner:
Security + Platform - Target:
Before public beta - Scope:
- KMS/secret-manager integration for runtime secrets.
- Rotation cadence for app secrets, JWT signing/JWKS dependencies, terminal token signer.
- Break-glass access process and audit trail.
- Evidence:
- Evidence baseline check command:
make ops-parallel-evidence-check - Execution artifact:
doc/operations/evidence/secrets_key_ops.md - Platform baseline requirements:
doc/operations/Production_Platform_Baseline.md - KMS operational guardrails:
doc/operations/KMS_Control_Key_Source_Guardrails.md - Node probe SSRF guardrails implementation:
packages/services/inventory/service.go - Pending: formal rotation SOP + execution logs.
5) East/West Security and Certificate Lifecycle¶
- Status:
in_progress - Owner:
Security + Platform - Target:
Before staging->prod promotion - Scope:
- Default-deny network policies and explicit allow-list flows.
- Internal mTLS (or equivalent) across service/workload paths.
- Cert lifecycle controls: issuance, rotation, revocation, expiry alerting.
- Evidence:
- Evidence baseline check command:
make ops-parallel-evidence-check - Execution artifact:
doc/operations/evidence/east_west_security_certs.md - PKI/step-ca staging readiness artifact:
doc/operations/evidence/pki_stepca_staging_readiness.md - Baseline requirements + workstream:
doc/operations/Production_Platform_Baseline.md - Baseline network policy manifest:
doc/operations/evidence/network_policy_baseline.yaml - Baseline cert check script:
scripts/ops/cert_expiry_check.sh - Pending: cluster-applied validation output and cert health dashboard evidence.
6) Capacity and Load Baseline¶
- Status:
in_progress - Owner:
Performance/Infra - Target:
Before public beta - Scope:
- Staging load profile for API endpoints, websocket concurrency, queue throughput.
- Baseline limits for DB connections/pools and worker concurrency.
- Saturation indicators and scale playbook.
- Evidence:
- Data growth guard baseline:
doc/operations/evidence/data_growth_guardrails.md - Pending: load-test scripts/results + baseline report.
7) Security Verification Pipeline¶
- Status:
in_progress - Owner:
AppSec + DevEx - Target:
Before feature-complete freeze - Scope:
- SAST/DAST/dependency/container scanning wired into CI gates.
- Severity-based policy (fail on critical/high as defined by governance).
- Exception workflow with expiry.
- Evidence:
- Baseline policy references:
doc/governance/CI_Enforcement_Checklist.md - Pending: SAST/DAST job outputs in hosted CI.
8) Environment Parity and Config Governance¶
- Status:
in_progress - Owner:
Platform - Target:
Before staging go-live - Scope:
- Controlled environment diffs (dev/staging/prod) for non-secret config.
- Feature flag policy and promotion controls.
- Explicit change approval for risk-sensitive config changes.
- Evidence:
- Promotion policy baseline:
doc/operations/Environment_Promotion_Policy.md - Pending: config diff reports + approval records.
9) Audit and Compliance Operations¶
- Status:
in_progress - Owner:
Security + Billing Ops - Target:
Before public beta - Scope:
- Audit log access controls and export path validation.
- Retention and retrieval procedure for security/finance investigations.
- Correlation-id based incident trace workflow.
- Evidence:
- Audit controls baseline:
doc/governance/Security_Control_Verification.md - Pending: sample investigation report + export validation evidence.
10) Cost Observability and Budget Guardrails¶
- Status:
in_progress - Owner:
FinOps + Platform - Target:
Before public beta - Scope:
- Cost dashboards for compute/storage/egress and alert thresholds.
- Budget policy and anomaly escalation path.
- Cost-per-feature visibility where practical.
- Evidence:
- Pending: dashboard links + alert policy references.
11) Terminal gateway ingress/network policy rollout (Option C)¶
- Status:
in_progress - Owner:
Platform + Security - Target:
Before terminal Option C production cutover - Scope:
- Route websocket terminal traffic to dedicated
cmd/terminal-gatewayruntime. - Apply default-deny + explicit allow-list network policy for terminal gateway pods.
- Validate rollback path to last known-good terminal-gateway revision without contract changes.
- Evidence:
- Evidence baseline check command:
make ops-parallel-evidence-check - Execution artifact:
doc/operations/evidence/terminal_gateway_rollout_plan.md - Network policy baseline:
doc/operations/evidence/network_policy_baseline.yaml - Pending: staging cutover evidence and rollback drill output.
Launch Gating¶
- Public launch requires items 1-5 to be
done. - Items 6-10 must be at least
in_progresswith owners, timeline, and evidence links.