Trail: Billing & Payments¶
End-to-end reading path through the money side of GPUaaS — immutable ledger, accrual loop, low-balance enforcement, Stripe webhook idempotency, refund hybrid policy.
Trail map¶
flowchart TB
classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
classDef run fill:#e9d6ff,stroke:#1e1530,color:#1e1530
B1[1. PRD on billing]:::impl --> B2[2. Immutable ledger]:::impl
B2 --> B3[3. Accrual loop]:::impl
B3 --> B4[4. State machine + enforcement]:::impl
B4 --> B5[5. Payments + Stripe webhook]:::impl
B5 --> B6[6. Refund hybrid policy]:::impl
B6 --> B7[7. Policy keys]:::impl
B7 --> B8[8. Runbooks]:::run
1. PRD on billing¶
Contract
mindmap
root((Billing requirements<br/>PRD §FR-6, FR-7))
Metering
SKU × quantity × duration
Minor units integer
Explicit currency
Enforcement
Low-balance warning
Auto-release pending advisory
Depleted balance forced release
No auto-restart after top-up
Payments
Stripe Checkout
Idempotent webhook
Signature on raw body
Domain event on credit
Refunds
Hybrid: provider within window
Internal credit outside window
Refundable amount bounded by policy
Key acceptance tests (Testing_Standards.md):
| AT | Check |
|---|---|
| AT-040 | Billing loop accrues cost over time |
| AT-041 | Low balance warning fires only once per low-state transition |
| AT-042 | Depleted balance triggers forced release |
| AT-051 | Webhook credits balance on valid event |
| AT-052 | Duplicate webhook does not double-credit |
| AT-053 | Webhook signature bypass rejected with 400 |
2. Immutable ledger¶
Implemented
erDiagram
users ||--o{ ledger_entries : "owns balance via"
allocations ||--o{ usage_records : "produces"
usage_records ||--o{ ledger_entries : "drives debit"
payment_sessions ||--o{ ledger_entries : "credits"
refund_records ||--o{ ledger_entries : "corrects"
ledger_entries ||--o{ ledger_entries : "corrects_entry_id"
ledger_entries {
uuid id PK
uuid user_id FK
bigint amount_minor "positive=credit, negative=debit"
text currency "ISO 4217"
text reason "usage|topup|refund|adjustment|credit_grant"
text reference_type "allocation|payment_session|refund_record"
uuid reference_id
text correlation_id
jsonb metadata
timestamp created_at "no updated_at, no deleted_at"
}
usage_records {
uuid id PK
uuid allocation_id FK
timestamp interval_start
timestamp interval_end
bigint cost_minor
}
payment_sessions {
uuid id PK
uuid user_id FK
text stripe_session_id
bigint amount_minor
text status
}
refund_records {
uuid id PK
uuid user_id FK
bigint amount_minor
text outcome "provider_refund|internal_credit"
}
Hard rules (Coding_Standards.md §5):
| Rule | Why |
|---|---|
Never UPDATE ledger_entries |
Auditability of money is non-negotiable |
Never DELETE ledger_entries |
Same |
| Corrections add a new row | With metadata.corrects_entry_id back-pointer |
| Balance is always computed from the sum | No mutable balance column |
| All money in minor units (integer) | No float drift |
| Currency always explicit on the row | Multi-currency-ready |
3. Accrual loop¶
Implemented
sequenceDiagram
autonumber
participant TICK as billing-worker<br/>ticker
participant DB as Postgres
participant NATS as NATS
participant PW as provisioning-worker
participant NR as notification-relay
participant U as User
loop every billing.window_seconds (default 60)
TICK->>DB: SELECT active allocations
loop per active allocation
TICK->>DB: compute interval cost<br/>(rate × duration × gpus)
TICK->>DB: INSERT usage_records + ledger_entries (debit)
TICK->>DB: SELECT SUM(amount_minor) → balance
alt balance ≤ threshold && not yet warned
TICK->>DB: INSERT low_balance_events (idempotency lock)
TICK->>DB: INSERT outbox: billing.low_balance_warning
end
alt balance ≤ 0
TICK->>DB: INSERT outbox: billing.balance_depleted
TICK->>DB: INSERT outbox: provisioning.force_release_requested
end
end
end
NATS-->>NR: billing.low_balance_warning
NR->>U: WS push + email
NATS-->>PW: provisioning.force_release_requested
PW->>PW: graceful release flow
Why an "idempotency lock" on warnings? AT-041: the warning fires once per low-state transition. If the user lingers in low_balance, no new warnings until they recover and dip back down again. The low_balance_events table records the trigger and prevents re-firing.
4. State machine + enforcement¶
Implemented
stateDiagram-v2
[*] --> healthy: account created
healthy --> low_balance: balance ≤ low_balance_threshold_minor
low_balance --> auto_release_pending: projected depletion in window
low_balance --> depleted: balance ≤ 0
auto_release_pending --> depleted: balance ≤ 0
auto_release_pending --> healthy: top-up
depleted --> healthy: top-up posted
low_balance --> healthy: top-up posted
note right of depleted
Active allocations force-released
via provisioning.force_release_requested
Billing stops; user must MANUALLY
reprovision after top-up (no auto-restart)
end note
note right of low_balance
Email + WS notification
via notification-relay
Idempotent on transition
end note
PRD §9 explicit:
After top-up, user must manually reprovision (default). Auto-restart may be introduced as explicit future policy.
The "no auto-restart" rule guards against runaway charging when a tenant accidentally tops up.
5. Payments + Stripe webhook¶
Implemented
sequenceDiagram
autonumber
participant U as User browser
participant API as cmd/api
participant PG as Postgres
participant ST as Stripe
participant WW as webhook-worker
participant NR as notification-relay
U->>API: POST /payments/checkout {amount, currency}
API->>PG: INSERT payment_sessions
API->>ST: create Checkout Session
ST-->>API: session URL
API-->>U: 200 {url}
U->>ST: complete payment
ST->>API: POST /payments/webhook (raw body + Stripe-Signature)
Note over API: raw body captured BEFORE any JSON parse<br/>(Coding_Standards.md §7)
API->>API: verify signature on EXACT bytes
alt signature invalid
API-->>ST: 400 stripe_signature_invalid
else valid
API->>PG: INSERT payment_webhook_events (event_id PK)
Note over PG: PK conflict = duplicate webhook<br/>silently dropped
API-->>ST: 200
end
WW->>PG: SELECT pending events FOR UPDATE SKIP LOCKED
WW->>PG: INSERT ledger_entries (credit)<br/>+ outbox: payments.balance_credited
WW->>PG: mark event processed
NR->>U: WS push "Topped up $X"
Three rules that make this hard to get wrong:
- Raw body first. Stripe signs the bytes Stripe sent. Any middleware that pretty-prints, re-encodes, or parses-then-reserializes breaks the signature.
- Dedupe at INSERT.
payment_webhook_events.event_idis the primary key. Duplicate webhooks (Stripe retries are common) just conflict and bounce. - Worker is FOR UPDATE SKIP LOCKED. Multiple webhook-worker replicas can run; they won't double-credit.
6. Refund hybrid policy¶
Implemented
flowchart TB
REQ([User or admin: refund request]) --> CHECK{"Within<br/>refund_window_days?<br/>policy default 30"}
CHECK -- yes --> PROV[Stripe provider refund]
CHECK -- no --> CRED[Internal balance credit]
PROV --> POL{Refundable amount<br/>≤ unused balance?}
CRED --> POL
POL -- yes --> LED[INSERT ledger_entries<br/>amount_minor positive<br/>reason: refund or credit_grant]
POL -- no --> ERR[refund_window_exceeded<br/>or refund_amount_exceeded]
LED --> AUD[INSERT audit_logs<br/>actor, reason, old_balance, new_balance]
AUD --> OUT[outbox: payments.balance_credited]
OUT --> NOTIFY[WS + email to user]
classDef warn fill:#fff3cd,stroke:#332701
class ERR warn
Refund outcome is explicit (provider_refund or internal_credit) — auditable, never ambiguous. The window comes from allocation.refund_window_days policy key, not a hardcoded 30.
7. Policy keys¶
Implemented
All billing values come from the policy_values table via PolicyClient — never hardcoded constants. Scope resolution: global → tenant → project → user.
flowchart LR
Q[Service code asks<br/>PolicyClient.GetInt key] --> S1{user-scope value?}
S1 -- yes --> R1[user value]
S1 -- no --> S2{project-scope?}
S2 -- yes --> R2[project value]
S2 -- no --> S3{tenant-scope?}
S3 -- yes --> R3[tenant value]
S3 -- no --> S4{plan-scope?}
S4 -- yes --> R4[plan value]
S4 -- no --> R5[global default]
classDef ret fill:#d1e7dd,stroke:#0a3622
class R1,R2,R3,R4,R5 ret
| Key | Type | Default | Effect |
|---|---|---|---|
billing.window_seconds |
int | 60 | Accrual tick interval |
billing.low_balance_threshold_minor |
int | 500 | Below = warning fires |
billing.minimum_deposit_minor |
int | 1000 | Stripe checkout min |
billing.maximum_deposit_minor |
int | 100000 | Stripe checkout max |
allocation.refund_window_days |
int | 30 | Provider vs internal-credit split |
notification.low_balance_enabled |
bool | true | Toggle warnings |
notification.balance_depleted_enabled |
bool | true | Toggle depletion alerts |
→ Full reference: Policy keys
8. Runbooks¶
Runbook
flowchart LR
A[Billing alert / page] --> Q1{Symptom}
Q1 -- accrual wedged / no new usage_records --> R1[Billing_Worker_Failure_Runbook]
Q1 -- webhook backlog / signature errors --> R2[Webhook_Processing_Outage_Runbook]
Q1 -- NATS DLQ growth / worker pile-up --> R3[Queue_Backlog_Runbook]
Q1 -- API p99 spike on /payments/* --> R4[API_Degradation_Runbook]
classDef rb fill:#e9d6ff,stroke:#1e1530
class R1,R2,R3,R4 rb
| Runbook | When |
|---|---|
| Billing Worker Failure | Accrual loop wedged, depleted-balance enforcement not triggering |
| Webhook Processing Outage | Stripe webhook backlog, signature failures |
| Queue Backlog | NATS DLQ or worker backlog growth |
| API Degradation | API p50/p99 latency or error-budget burn on payment paths |
End-to-end recap¶
sequenceDiagram
autonumber
participant U as User
participant API as cmd/api
participant PW as provisioning-worker
participant BW as billing-worker
participant WW as webhook-worker
participant DB as Postgres
participant ST as Stripe
U->>API: POST /allocations
API->>PW: provisioning workflow
PW->>API: allocation active
loop accrual
BW->>DB: insert usage + ledger debit
DB-->>BW: balance dropping
end
BW-->>U: low_balance_warning (WS + email)
U->>API: top-up via Stripe Checkout
ST->>API: webhook (signed raw body)
API->>DB: enqueue webhook event (dedupe by event_id)
WW->>DB: ledger credit + payments.balance_credited
WW-->>U: WS "topped up"
Note over BW: balance recovered → no force-release
U->>API: release allocation
API->>PW: release flow
BW->>BW: stop accrual on releasing.completed