Skip to content

Parallel Ops Track (Execute Alongside Feature Coding)

Purpose: - Track operational work that must run in parallel with backend/frontend implementation. - Convert operations readiness from "later" into gated deliverables with evidence.

Status values: - open - in_progress - done

1) SLO and Alert Pack

  • Status: in_progress
  • Owner: SRE Lead
  • Target: Before public beta
  • Scope:
  • API availability/latency/error budget alerts.
  • Queue lag and worker failure alerts.
  • Billing drift/processing anomaly alerts.
  • Webhook ingestion + reconciliation failure alerts.
  • Evidence:
  • Evidence baseline check command: make ops-parallel-evidence-check
  • Execution artifact: doc/operations/evidence/slo_alert_pack.md
  • Baseline definitions: doc/operations/Observability_Baseline.md
  • Incident model linkage: doc/operations/Incident_Severity_Model.md
  • Alert rules baseline: doc/operations/evidence/alert_rule_manifest_baseline.yaml
  • Local observability smoke script: scripts/ops/observability_smoke.sh
  • Local observability smoke report: doc/operations/evidence/observability_local_smoke_report.md
  • Simulation playbook/report: doc/operations/evidence/alert_simulation_playbook.md, doc/operations/evidence/alert_simulation_report.md
  • Pending: staging load/trigger execution and captured outputs.

2) Runbooks and On-Call Readiness

  • Status: in_progress
  • Owner: SRE Lead
  • Target: Before public beta
  • Scope:
  • Complete runbooks for API, queue, billing worker, provisioning stuck/failures, webhook outage.
  • On-call ownership matrix and escalation path.
  • Incident drill calendar (tabletop + restore drill).
  • Evidence:
  • Evidence baseline check command: make ops-parallel-evidence-check
  • Execution artifact: doc/operations/evidence/runbooks_oncall_readiness.md
  • Runbook index: doc/operations/SRE_Runbook_Index.md
  • Existing runbooks: doc/operations/runbooks/*
  • On-call roster + escalation baseline: doc/operations/evidence/oncall_roster_and_escalation.md
  • Drill calendar/report baseline: doc/operations/evidence/incident_drill_reports.md
  • Pending: executed drills with timestamps and outcomes.

3) Backup, Restore, and DR Validation

  • Status: in_progress
  • Owner: Platform + DBA
  • Target: Before staging go-live
  • Scope:
  • Automated backup schedule and retention policy.
  • Staging restore rehearsal with time-to-recovery measurement.
  • RPO/RTO verification and risk note for gaps.
  • Evidence:
  • Evidence baseline check command: make ops-parallel-evidence-check
  • Execution artifact: doc/operations/evidence/backup_restore_dr.md
  • Local smoke drill script: scripts/ops/backup_restore_smoke.sh
  • Local rehearsal report: doc/operations/evidence/backup_restore_rehearsal_report.md
  • Pending: backup job config + restore rehearsal report.

4) Secrets and Key Operations

  • Status: in_progress
  • Owner: Security + Platform
  • Target: Before public beta
  • Scope:
  • KMS/secret-manager integration for runtime secrets.
  • Rotation cadence for app secrets, JWT signing/JWKS dependencies, terminal token signer.
  • Break-glass access process and audit trail.
  • Evidence:
  • Evidence baseline check command: make ops-parallel-evidence-check
  • Execution artifact: doc/operations/evidence/secrets_key_ops.md
  • Platform baseline requirements: doc/operations/Production_Platform_Baseline.md
  • KMS operational guardrails: doc/operations/KMS_Control_Key_Source_Guardrails.md
  • Node probe SSRF guardrails implementation: packages/services/inventory/service.go
  • Pending: formal rotation SOP + execution logs.

5) East/West Security and Certificate Lifecycle

  • Status: in_progress
  • Owner: Security + Platform
  • Target: Before staging->prod promotion
  • Scope:
  • Default-deny network policies and explicit allow-list flows.
  • Internal mTLS (or equivalent) across service/workload paths.
  • Cert lifecycle controls: issuance, rotation, revocation, expiry alerting.
  • Evidence:
  • Evidence baseline check command: make ops-parallel-evidence-check
  • Execution artifact: doc/operations/evidence/east_west_security_certs.md
  • PKI/step-ca staging readiness artifact: doc/operations/evidence/pki_stepca_staging_readiness.md
  • Baseline requirements + workstream: doc/operations/Production_Platform_Baseline.md
  • Baseline network policy manifest: doc/operations/evidence/network_policy_baseline.yaml
  • Baseline cert check script: scripts/ops/cert_expiry_check.sh
  • Pending: cluster-applied validation output and cert health dashboard evidence.

6) Capacity and Load Baseline

  • Status: in_progress
  • Owner: Performance/Infra
  • Target: Before public beta
  • Scope:
  • Staging load profile for API endpoints, websocket concurrency, queue throughput.
  • Baseline limits for DB connections/pools and worker concurrency.
  • Saturation indicators and scale playbook.
  • Evidence:
  • Data growth guard baseline: doc/operations/evidence/data_growth_guardrails.md
  • Pending: load-test scripts/results + baseline report.

7) Security Verification Pipeline

  • Status: in_progress
  • Owner: AppSec + DevEx
  • Target: Before feature-complete freeze
  • Scope:
  • SAST/DAST/dependency/container scanning wired into CI gates.
  • Severity-based policy (fail on critical/high as defined by governance).
  • Exception workflow with expiry.
  • Evidence:
  • Baseline policy references: doc/governance/CI_Enforcement_Checklist.md
  • Pending: SAST/DAST job outputs in hosted CI.

8) Environment Parity and Config Governance

  • Status: in_progress
  • Owner: Platform
  • Target: Before staging go-live
  • Scope:
  • Controlled environment diffs (dev/staging/prod) for non-secret config.
  • Feature flag policy and promotion controls.
  • Explicit change approval for risk-sensitive config changes.
  • Evidence:
  • Promotion policy baseline: doc/operations/Environment_Promotion_Policy.md
  • Pending: config diff reports + approval records.

9) Audit and Compliance Operations

  • Status: in_progress
  • Owner: Security + Billing Ops
  • Target: Before public beta
  • Scope:
  • Audit log access controls and export path validation.
  • Retention and retrieval procedure for security/finance investigations.
  • Correlation-id based incident trace workflow.
  • Evidence:
  • Audit controls baseline: doc/governance/Security_Control_Verification.md
  • Pending: sample investigation report + export validation evidence.

10) Cost Observability and Budget Guardrails

  • Status: in_progress
  • Owner: FinOps + Platform
  • Target: Before public beta
  • Scope:
  • Cost dashboards for compute/storage/egress and alert thresholds.
  • Budget policy and anomaly escalation path.
  • Cost-per-feature visibility where practical.
  • Evidence:
  • Pending: dashboard links + alert policy references.

11) Terminal gateway ingress/network policy rollout (Option C)

  • Status: in_progress
  • Owner: Platform + Security
  • Target: Before terminal Option C production cutover
  • Scope:
  • Route websocket terminal traffic to dedicated cmd/terminal-gateway runtime.
  • Apply default-deny + explicit allow-list network policy for terminal gateway pods.
  • Validate rollback path to last known-good terminal-gateway revision without contract changes.
  • Evidence:
  • Evidence baseline check command: make ops-parallel-evidence-check
  • Execution artifact: doc/operations/evidence/terminal_gateway_rollout_plan.md
  • Network policy baseline: doc/operations/evidence/network_policy_baseline.yaml
  • Pending: staging cutover evidence and rollback drill output.

Launch Gating

  • Public launch requires items 1-5 to be done.
  • Items 6-10 must be at least in_progress with owners, timeline, and evidence links.