Runbook: App Runtime Billing Reconciliation Incident
Trigger
- User/operator report: app runtime charge is unexpected, missing, or duplicated.
- Billing UI/export shows mixed allocation and app-runtime rows that do not reconcile.
- Alert or log evidence indicates app-runtime metering drift, especially for control-plane components.
Required Context
correlation_id from API/UI error envelope or support ticket.
trace_id when present.
- Attribution identifiers when available:
org_id
project_id
app_instance_id
usage_source
control_plane_component
operating_mode
control_plane_scope
runtime_backend
- Confirm whether the dispute is:
- one app instance,
- one project,
- one tenant,
- or broad billing-worker degradation.
- Do not treat app-runtime usage as a separate billing system.
- all reconciliation must still flow through
usage_records and ledger_entries.
- If multiple app instances are impacted, freeze manual customer-facing adjustments until source attribution is verified.
Correlation-First Diagnosis
- Start in Loki with
correlation_id:
{service=~"gpuaas-(api|billing-worker|app-runtime-worker|webhook-worker)"} | json | correlation_id="<CORRELATION_ID>"
- If
app_instance_id is known, pivot on it:
{service=~"gpuaas-(api|billing-worker|app-runtime-worker)"} | json | app_instance_id="<APP_INSTANCE_ID>"
- Extract
trace_id and inspect Tempo for:
- app lifecycle call,
- outbox relay publish,
- billing worker handling,
- any webhook/payment follow-on if customer funding is involved.
- Confirm source attribution on affected usage rows:
usage_source = app_runtime
app_instance_id present
control_plane_component correct for control-plane cost
operating_mode, control_plane_scope, runtime_backend align with the instance
Reconciliation Checklist
- Missing usage row:
- app runtime activity happened, but no
usage_records row exists for the app_instance_id.
- Wrong source attribution:
- usage is recorded against
allocation when it should be app_runtime, or vice versa.
- Wrong attribution anchor:
project_id, app_instance_id, operating_mode, or control_plane_scope do not match the instance.
- Ledger mismatch:
- app-runtime
usage_records exist, but no corresponding debit/credit interpretation appears in customer-visible billing state.
- Control-plane classification drift:
control_plane_component is false for scheduler/head/control services that should meter separately.
Mitigation
- Fix the owning layer:
- metering emitter,
- attribution mapping,
- billing worker interpretation,
- or UI/filter/export path.
- Do not patch around drift by inventing app-runtime-only ledgers or manual hidden adjustments.
- If remediation needs data correction:
- use approved auditable reconciliation procedure,
- preserve
correlation_id linkage in incident notes and corrective records.
Recovery Criteria
- Mixed usage listing/export clearly distinguishes
allocation vs app_runtime.
app_instance_id-scoped usage rows reconcile with ledger-visible customer impact.
- Control-plane cost rows are explainable by
control_plane_component, operating_mode, control_plane_scope, and runtime_backend.
- No duplicate or missing customer-visible charges remain for impacted scope.
Evidence to Capture
- Incident timeline with
correlation_id and trace_id.
- Before/after query evidence for:
usage_records
ledger_entries
- app instance metadata (
app_instance_id, mode/scope/runtime backend)
- Customer-visible impact summary by tenant/project/app instance.
- Follow-up task for the owning layer if drift originated outside billing.