Skip to content

Runbook: App Runtime Billing Reconciliation Incident

Trigger

  1. User/operator report: app runtime charge is unexpected, missing, or duplicated.
  2. Billing UI/export shows mixed allocation and app-runtime rows that do not reconcile.
  3. Alert or log evidence indicates app-runtime metering drift, especially for control-plane components.

Required Context

  1. correlation_id from API/UI error envelope or support ticket.
  2. trace_id when present.
  3. Attribution identifiers when available:
  4. org_id
  5. project_id
  6. app_instance_id
  7. usage_source
  8. control_plane_component
  9. operating_mode
  10. control_plane_scope
  11. runtime_backend

Immediate Actions

  1. Confirm whether the dispute is:
  2. one app instance,
  3. one project,
  4. one tenant,
  5. or broad billing-worker degradation.
  6. Do not treat app-runtime usage as a separate billing system.
  7. all reconciliation must still flow through usage_records and ledger_entries.
  8. If multiple app instances are impacted, freeze manual customer-facing adjustments until source attribution is verified.

Correlation-First Diagnosis

  1. Start in Loki with correlation_id:
  2. {service=~"gpuaas-(api|billing-worker|app-runtime-worker|webhook-worker)"} | json | correlation_id="<CORRELATION_ID>"
  3. If app_instance_id is known, pivot on it:
  4. {service=~"gpuaas-(api|billing-worker|app-runtime-worker)"} | json | app_instance_id="<APP_INSTANCE_ID>"
  5. Extract trace_id and inspect Tempo for:
  6. app lifecycle call,
  7. outbox relay publish,
  8. billing worker handling,
  9. any webhook/payment follow-on if customer funding is involved.
  10. Confirm source attribution on affected usage rows:
  11. usage_source = app_runtime
  12. app_instance_id present
  13. control_plane_component correct for control-plane cost
  14. operating_mode, control_plane_scope, runtime_backend align with the instance

Reconciliation Checklist

  1. Missing usage row:
  2. app runtime activity happened, but no usage_records row exists for the app_instance_id.
  3. Wrong source attribution:
  4. usage is recorded against allocation when it should be app_runtime, or vice versa.
  5. Wrong attribution anchor:
  6. project_id, app_instance_id, operating_mode, or control_plane_scope do not match the instance.
  7. Ledger mismatch:
  8. app-runtime usage_records exist, but no corresponding debit/credit interpretation appears in customer-visible billing state.
  9. Control-plane classification drift:
  10. control_plane_component is false for scheduler/head/control services that should meter separately.

Mitigation

  1. Fix the owning layer:
  2. metering emitter,
  3. attribution mapping,
  4. billing worker interpretation,
  5. or UI/filter/export path.
  6. Do not patch around drift by inventing app-runtime-only ledgers or manual hidden adjustments.
  7. If remediation needs data correction:
  8. use approved auditable reconciliation procedure,
  9. preserve correlation_id linkage in incident notes and corrective records.

Recovery Criteria

  1. Mixed usage listing/export clearly distinguishes allocation vs app_runtime.
  2. app_instance_id-scoped usage rows reconcile with ledger-visible customer impact.
  3. Control-plane cost rows are explainable by control_plane_component, operating_mode, control_plane_scope, and runtime_backend.
  4. No duplicate or missing customer-visible charges remain for impacted scope.

Evidence to Capture

  1. Incident timeline with correlation_id and trace_id.
  2. Before/after query evidence for:
  3. usage_records
  4. ledger_entries
  5. app instance metadata (app_instance_id, mode/scope/runtime backend)
  6. Customer-visible impact summary by tenant/project/app instance.
  7. Follow-up task for the owning layer if drift originated outside billing.