Runbook: Billing Worker Failure and Reconciliation Drift
Trigger
- Alert: billing worker failures / billing queue lag.
- Alert: payment reconcile failures (
payments.reconcile_failed).
- User/support report: unexpected charge, missing credit, or lifecycle boundary dispute.
Required Context
correlation_id from API/UI error envelope (or support ticket metadata).
trace_id if present in error details/logs.
- Scope identifiers when available:
org_id
project_id
user_id
allocation_id
app_instance_id
usage_source
control_plane_component
- Determine blast radius:
- single user/project dispute vs global worker degradation.
- If widespread:
- freeze risky manual adjustments until consistency is verified.
- Confirm dependency health:
- Postgres connectivity
- NATS consumer health
- outbox relay health
Correlation-First Diagnosis
- Start in Loki with
correlation_id:
service=gpuaas-billing-worker and related services (gpuaas-api, gpuaas-webhook-worker).
- Extract
trace_id from log/error details and open in Tempo.
- Validate event sequence:
provisioning.active
provisioning.releasing.completed or provisioning.release_failed
billing.* notifications
payments.balance_credited / payments.reconcile_failed when payment path involved.
- Validate data consistency:
usage_records lifecycle (start_time, end_time, last_billed_at, accrued_cost_minor)
ledger_entries projection for affected requested_by_user_id
- allocation status boundary (
active, releasing, released, release_failed)
- app-runtime attribution boundary (
app_instance_id, usage_source, control_plane_component, operating_mode, control_plane_scope, runtime_backend)
Reconciliation Checklist
- Missing open usage:
- allocation is
active|releasing|release_failed but no open usage_records row.
- Orphan open usage:
- open
usage_records row while allocation is not active/releasing.
- Unbilled closed usage:
- closed
usage_records with positive accrued cost and no usage ledger debit.
- Payment drift:
payment_sessions.status=failed_reconcile or missing linked ledger_entry_id after checkout completion.
Mitigation
- Worker/process remediation:
- restart or roll back billing worker if runtime regression suspected.
- Data remediation:
- use approved reconciliation procedure; avoid ad-hoc direct state mutation.
- Payment remediation:
- investigate provider/webhook mismatch and apply recovery path per payments runbook.
- Keep all corrective actions auditable with actor + correlation linkage.
Recovery Criteria
- Billing worker resumes stable processing.
- Reconciliation checks return no unresolved drift for impacted scope.
- No duplicate debits/credits introduced by recovery.
- Alerts return below threshold.
Evidence to Capture
- Incident timeline with
correlation_id and trace_id.
- Query evidence for
usage_records and ledger_entries before/after remediation.
- Impacted tenant/project/user scope and customer-facing summary.
- Follow-up tasks for root-cause prevention.