Skip to content

Trail: Billing & Payments

End-to-end reading path through the money side of GPUaaS — immutable ledger, accrual loop, low-balance enforcement, Stripe webhook idempotency, refund hybrid policy.

Trail map

flowchart TB
    classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
    classDef run  fill:#e9d6ff,stroke:#1e1530,color:#1e1530

    B1[1. PRD on billing]:::impl --> B2[2. Immutable ledger]:::impl
    B2 --> B3[3. Accrual loop]:::impl
    B3 --> B4[4. State machine + enforcement]:::impl
    B4 --> B5[5. Payments + Stripe webhook]:::impl
    B5 --> B6[6. Refund hybrid policy]:::impl
    B6 --> B7[7. Policy keys]:::impl
    B7 --> B8[8. Runbooks]:::run

1. PRD on billing

Contract

mindmap
  root((Billing requirements<br/>PRD §FR-6, FR-7))
    Metering
      SKU × quantity × duration
      Minor units integer
      Explicit currency
    Enforcement
      Low-balance warning
      Auto-release pending advisory
      Depleted balance forced release
      No auto-restart after top-up
    Payments
      Stripe Checkout
      Idempotent webhook
      Signature on raw body
      Domain event on credit
    Refunds
      Hybrid: provider within window
      Internal credit outside window
      Refundable amount bounded by policy

Key acceptance tests (Testing_Standards.md):

AT Check
AT-040 Billing loop accrues cost over time
AT-041 Low balance warning fires only once per low-state transition
AT-042 Depleted balance triggers forced release
AT-051 Webhook credits balance on valid event
AT-052 Duplicate webhook does not double-credit
AT-053 Webhook signature bypass rejected with 400

2. Immutable ledger

Implemented

erDiagram
    users ||--o{ ledger_entries : "owns balance via"
    allocations ||--o{ usage_records : "produces"
    usage_records ||--o{ ledger_entries : "drives debit"
    payment_sessions ||--o{ ledger_entries : "credits"
    refund_records ||--o{ ledger_entries : "corrects"
    ledger_entries ||--o{ ledger_entries : "corrects_entry_id"

    ledger_entries {
        uuid id PK
        uuid user_id FK
        bigint amount_minor "positive=credit, negative=debit"
        text currency "ISO 4217"
        text reason "usage|topup|refund|adjustment|credit_grant"
        text reference_type "allocation|payment_session|refund_record"
        uuid reference_id
        text correlation_id
        jsonb metadata
        timestamp created_at "no updated_at, no deleted_at"
    }
    usage_records {
        uuid id PK
        uuid allocation_id FK
        timestamp interval_start
        timestamp interval_end
        bigint cost_minor
    }
    payment_sessions {
        uuid id PK
        uuid user_id FK
        text stripe_session_id
        bigint amount_minor
        text status
    }
    refund_records {
        uuid id PK
        uuid user_id FK
        bigint amount_minor
        text outcome "provider_refund|internal_credit"
    }

Hard rules (Coding_Standards.md §5):

Rule Why
Never UPDATE ledger_entries Auditability of money is non-negotiable
Never DELETE ledger_entries Same
Corrections add a new row With metadata.corrects_entry_id back-pointer
Balance is always computed from the sum No mutable balance column
All money in minor units (integer) No float drift
Currency always explicit on the row Multi-currency-ready

3. Accrual loop

Implemented

sequenceDiagram
    autonumber
    participant TICK as billing-worker<br/>ticker
    participant DB as Postgres
    participant NATS as NATS
    participant PW as provisioning-worker
    participant NR as notification-relay
    participant U as User

    loop every billing.window_seconds (default 60)
        TICK->>DB: SELECT active allocations
        loop per active allocation
            TICK->>DB: compute interval cost<br/>(rate × duration × gpus)
            TICK->>DB: INSERT usage_records + ledger_entries (debit)
            TICK->>DB: SELECT SUM(amount_minor) → balance
            alt balance ≤ threshold && not yet warned
                TICK->>DB: INSERT low_balance_events (idempotency lock)
                TICK->>DB: INSERT outbox: billing.low_balance_warning
            end
            alt balance ≤ 0
                TICK->>DB: INSERT outbox: billing.balance_depleted
                TICK->>DB: INSERT outbox: provisioning.force_release_requested
            end
        end
    end

    NATS-->>NR: billing.low_balance_warning
    NR->>U: WS push + email

    NATS-->>PW: provisioning.force_release_requested
    PW->>PW: graceful release flow

Why an "idempotency lock" on warnings? AT-041: the warning fires once per low-state transition. If the user lingers in low_balance, no new warnings until they recover and dip back down again. The low_balance_events table records the trigger and prevents re-firing.


4. State machine + enforcement

Implemented

stateDiagram-v2
    [*] --> healthy: account created
    healthy --> low_balance: balance ≤ low_balance_threshold_minor
    low_balance --> auto_release_pending: projected depletion in window
    low_balance --> depleted: balance ≤ 0
    auto_release_pending --> depleted: balance ≤ 0
    auto_release_pending --> healthy: top-up
    depleted --> healthy: top-up posted
    low_balance --> healthy: top-up posted

    note right of depleted
      Active allocations force-released
      via provisioning.force_release_requested
      Billing stops; user must MANUALLY
      reprovision after top-up (no auto-restart)
    end note

    note right of low_balance
      Email + WS notification
      via notification-relay
      Idempotent on transition
    end note

PRD §9 explicit:

After top-up, user must manually reprovision (default). Auto-restart may be introduced as explicit future policy.

The "no auto-restart" rule guards against runaway charging when a tenant accidentally tops up.


5. Payments + Stripe webhook

Implemented

sequenceDiagram
    autonumber
    participant U as User browser
    participant API as cmd/api
    participant PG as Postgres
    participant ST as Stripe
    participant WW as webhook-worker
    participant NR as notification-relay

    U->>API: POST /payments/checkout {amount, currency}
    API->>PG: INSERT payment_sessions
    API->>ST: create Checkout Session
    ST-->>API: session URL
    API-->>U: 200 {url}
    U->>ST: complete payment

    ST->>API: POST /payments/webhook (raw body + Stripe-Signature)
    Note over API: raw body captured BEFORE any JSON parse<br/>(Coding_Standards.md §7)
    API->>API: verify signature on EXACT bytes
    alt signature invalid
        API-->>ST: 400 stripe_signature_invalid
    else valid
        API->>PG: INSERT payment_webhook_events (event_id PK)
        Note over PG: PK conflict = duplicate webhook<br/>silently dropped
        API-->>ST: 200
    end

    WW->>PG: SELECT pending events FOR UPDATE SKIP LOCKED
    WW->>PG: INSERT ledger_entries (credit)<br/>+ outbox: payments.balance_credited
    WW->>PG: mark event processed

    NR->>U: WS push "Topped up $X"

Three rules that make this hard to get wrong:

  1. Raw body first. Stripe signs the bytes Stripe sent. Any middleware that pretty-prints, re-encodes, or parses-then-reserializes breaks the signature.
  2. Dedupe at INSERT. payment_webhook_events.event_id is the primary key. Duplicate webhooks (Stripe retries are common) just conflict and bounce.
  3. Worker is FOR UPDATE SKIP LOCKED. Multiple webhook-worker replicas can run; they won't double-credit.

6. Refund hybrid policy

Implemented

flowchart TB
    REQ([User or admin: refund request]) --> CHECK{"Within<br/>refund_window_days?<br/>policy default 30"}
    CHECK -- yes --> PROV[Stripe provider refund]
    CHECK -- no --> CRED[Internal balance credit]
    PROV --> POL{Refundable amount<br/>≤ unused balance?}
    CRED --> POL
    POL -- yes --> LED[INSERT ledger_entries<br/>amount_minor positive<br/>reason: refund or credit_grant]
    POL -- no --> ERR[refund_window_exceeded<br/>or refund_amount_exceeded]
    LED --> AUD[INSERT audit_logs<br/>actor, reason, old_balance, new_balance]
    AUD --> OUT[outbox: payments.balance_credited]
    OUT --> NOTIFY[WS + email to user]

    classDef warn fill:#fff3cd,stroke:#332701
    class ERR warn

Refund outcome is explicit (provider_refund or internal_credit) — auditable, never ambiguous. The window comes from allocation.refund_window_days policy key, not a hardcoded 30.


7. Policy keys

Implemented

All billing values come from the policy_values table via PolicyClient — never hardcoded constants. Scope resolution: global → tenant → project → user.

flowchart LR
    Q[Service code asks<br/>PolicyClient.GetInt key] --> S1{user-scope value?}
    S1 -- yes --> R1[user value]
    S1 -- no --> S2{project-scope?}
    S2 -- yes --> R2[project value]
    S2 -- no --> S3{tenant-scope?}
    S3 -- yes --> R3[tenant value]
    S3 -- no --> S4{plan-scope?}
    S4 -- yes --> R4[plan value]
    S4 -- no --> R5[global default]

    classDef ret fill:#d1e7dd,stroke:#0a3622
    class R1,R2,R3,R4,R5 ret
Key Type Default Effect
billing.window_seconds int 60 Accrual tick interval
billing.low_balance_threshold_minor int 500 Below = warning fires
billing.minimum_deposit_minor int 1000 Stripe checkout min
billing.maximum_deposit_minor int 100000 Stripe checkout max
allocation.refund_window_days int 30 Provider vs internal-credit split
notification.low_balance_enabled bool true Toggle warnings
notification.balance_depleted_enabled bool true Toggle depletion alerts

→ Full reference: Policy keys


8. Runbooks

Runbook

flowchart LR
    A[Billing alert / page] --> Q1{Symptom}
    Q1 -- accrual wedged / no new usage_records --> R1[Billing_Worker_Failure_Runbook]
    Q1 -- webhook backlog / signature errors --> R2[Webhook_Processing_Outage_Runbook]
    Q1 -- NATS DLQ growth / worker pile-up --> R3[Queue_Backlog_Runbook]
    Q1 -- API p99 spike on /payments/* --> R4[API_Degradation_Runbook]
    classDef rb fill:#e9d6ff,stroke:#1e1530
    class R1,R2,R3,R4 rb
Runbook When
Billing Worker Failure Accrual loop wedged, depleted-balance enforcement not triggering
Webhook Processing Outage Stripe webhook backlog, signature failures
Queue Backlog NATS DLQ or worker backlog growth
API Degradation API p50/p99 latency or error-budget burn on payment paths

End-to-end recap

sequenceDiagram
    autonumber
    participant U as User
    participant API as cmd/api
    participant PW as provisioning-worker
    participant BW as billing-worker
    participant WW as webhook-worker
    participant DB as Postgres
    participant ST as Stripe

    U->>API: POST /allocations
    API->>PW: provisioning workflow
    PW->>API: allocation active
    loop accrual
        BW->>DB: insert usage + ledger debit
        DB-->>BW: balance dropping
    end
    BW-->>U: low_balance_warning (WS + email)
    U->>API: top-up via Stripe Checkout
    ST->>API: webhook (signed raw body)
    API->>DB: enqueue webhook event (dedupe by event_id)
    WW->>DB: ledger credit + payments.balance_credited
    WW-->>U: WS "topped up"
    Note over BW: balance recovered → no force-release
    U->>API: release allocation
    API->>PW: release flow
    BW->>BW: stop accrual on releasing.completed