Skip to content

Runbook: Billing Worker Failure and Reconciliation Drift

Trigger

  1. Alert: billing worker failures / billing queue lag.
  2. Alert: payment reconcile failures (payments.reconcile_failed).
  3. User/support report: unexpected charge, missing credit, or lifecycle boundary dispute.

Required Context

  1. correlation_id from API/UI error envelope (or support ticket metadata).
  2. trace_id if present in error details/logs.
  3. Scope identifiers when available:
  4. org_id
  5. project_id
  6. user_id
  7. allocation_id
  8. app_instance_id
  9. usage_source
  10. control_plane_component

Immediate Actions

  1. Determine blast radius:
  2. single user/project dispute vs global worker degradation.
  3. If widespread:
  4. freeze risky manual adjustments until consistency is verified.
  5. Confirm dependency health:
  6. Postgres connectivity
  7. NATS consumer health
  8. outbox relay health

Correlation-First Diagnosis

  1. Start in Loki with correlation_id:
  2. service=gpuaas-billing-worker and related services (gpuaas-api, gpuaas-webhook-worker).
  3. Extract trace_id from log/error details and open in Tempo.
  4. Validate event sequence:
  5. provisioning.active
  6. provisioning.releasing.completed or provisioning.release_failed
  7. billing.* notifications
  8. payments.balance_credited / payments.reconcile_failed when payment path involved.
  9. Validate data consistency:
  10. usage_records lifecycle (start_time, end_time, last_billed_at, accrued_cost_minor)
  11. ledger_entries projection for affected requested_by_user_id
  12. allocation status boundary (active, releasing, released, release_failed)
  13. app-runtime attribution boundary (app_instance_id, usage_source, control_plane_component, operating_mode, control_plane_scope, runtime_backend)

Reconciliation Checklist

  1. Missing open usage:
  2. allocation is active|releasing|release_failed but no open usage_records row.
  3. Orphan open usage:
  4. open usage_records row while allocation is not active/releasing.
  5. Unbilled closed usage:
  6. closed usage_records with positive accrued cost and no usage ledger debit.
  7. Payment drift:
  8. payment_sessions.status=failed_reconcile or missing linked ledger_entry_id after checkout completion.

Mitigation

  1. Worker/process remediation:
  2. restart or roll back billing worker if runtime regression suspected.
  3. Data remediation:
  4. use approved reconciliation procedure; avoid ad-hoc direct state mutation.
  5. Payment remediation:
  6. investigate provider/webhook mismatch and apply recovery path per payments runbook.
  7. Keep all corrective actions auditable with actor + correlation linkage.

Recovery Criteria

  1. Billing worker resumes stable processing.
  2. Reconciliation checks return no unresolved drift for impacted scope.
  3. No duplicate debits/credits introduced by recovery.
  4. Alerts return below threshold.

Evidence to Capture

  1. Incident timeline with correlation_id and trace_id.
  2. Query evidence for usage_records and ledger_entries before/after remediation.
  3. Impacted tenant/project/user scope and customer-facing summary.
  4. Follow-up tasks for root-cause prevention.