State Machines (Current + Target Clarification)¶
1. Allocation Lifecycle¶
Canonical lifecycle (contract + implementation target)¶
requested -> provisioning -> active -> releasing -> released- Failure side transitions:
provisioning -> failedreleasing -> release_failed(aftermax_deliverretries exhausted onprovisioning.releasing.requested)
release_failed state behaviour¶
- Billing: usage window is closed when
provisioning.release_failedis received by billing-worker — the user is not charged for a failed release. - Node: remains assigned to the allocation until an admin manually retries or removes the assignment. Surface via
GET /api/v1/admin/allocations?status=release_failed. - Retry path: admin uses
POST /api/v1/admin/allocations/{id}/force-releaseto trigger a new release attempt, transitioning the allocation back toreleasing. - User retry:
POST /api/v1/allocations/{id}/releaseon arelease_failedallocation is also accepted and transitions back toreleasing.
1a. Allocation Group Lifecycle¶
Allocation groups are aggregate parent resources over normal single-node allocations. They do not replace allocation state and they do not own billing or placement correctness.
Canonical aggregate lifecycle:
- requested -> provisioning -> active -> releasing -> released
- Failure/degraded side transitions:
- requested|provisioning -> failed when no required member can become usable
- active -> degraded when one required member fails while another member remains usable
- releasing -> release_failed when one or more member releases exhaust retries
Rules:
- Group status is derived from member allocation status plus group-level release
intent.
- Member allocations keep their own allocation lifecycle, connection target, and
usage/billing windows.
- Group release fans out to member release requests and remains idempotent under
the group release idempotency key.
- App runtimes may bind an app instance to an allocation_group_id, but
app-specific topology and member semantics stay in the app-instance member
contract.
Full model: doc/architecture/Allocation_Group_Model_v1.md.
2. Node Assignment Lifecycle¶
Current behavior¶
- Free when
assignedAllocationId = null - In use when
assignedAllocationId = allocation.id
Target¶
- Derive occupancy from active allocation relation in DB.
- Keep assignment pointer as cache only if needed.
3. Usage/Billing Lifecycle¶
Current behavior¶
- Usage starts with allocation creation:
startTime,endTime=null - Billing loop every minute updates
lastBilledAtandcost - Usage closes on allocation release (
endTimeset)
Target¶
- Event-sourced debit windows with idempotency key per
(usage_record, interval_window).
4. User Balance State¶
Current behavior¶
- thresholds:
- low:
balance <= 10 - depleted:
balance <= 0 lowBalanceNotifiedprevents repeated warning spam.- depleted triggers force release of all active allocations.
Target¶
- explicit state field or derived state view:
healthy,low_balance,depleted- notification service/channel decoupled from terminal WS.
5. Stripe Payment State¶
Target lifecycle (implemented via payment_sessions table)¶
initiated → checkout_completed → credited
↘ failed_reconcile
↘ expired (session TTL elapsed with no webhook)
| Status | Trigger |
|---|---|
initiated |
POST /api/v1/payments/checkout-session — Stripe session created, URL returned to user |
checkout_completed |
Stripe webhook checkout.session.completed received and verified |
credited |
Ledger credit posted transactionally with the webhook; ledger_entry_id set |
failed_reconcile |
Checkout completed but credit application failed after all retries |
expired |
Session TTL elapsed (default 24h) with no checkout.session.completed webhook |
Key properties¶
stripe_checkout_session_idis the join key between the webhook payload and the session row.idempotency_key(fromX-Idempotency-Keyheader) is stored; duplicate session creation requests for the same user + key return the existing session URL without a second Stripe call.credited_amount_minoris set from the webhook payload and must equalrequested_amount_minor; a mismatch is flagged asfailed_reconcilefor ops investigation.- Admin endpoint:
GET /api/v1/admin/payments/sessionssurfaces stuck sessions (initiatedwith no completion after >1h, orfailed_reconcile) for support resolution.
6. Terminal Session State¶
Canonical lifecycle¶
opening -> active -> closing -> closed- Failure side transitions:
opening -> erroractive -> error(node stream/tunnel drop, upstream termination, policy timeout)
Rules¶
- Open handshake uses short-lived single-use token.
- Active session max lifetime is enforced by policy key
terminal.session_max_ttl_seconds(default 4h). - Single active terminal session per allocation.
- Reconnect is full reopen (new token + new open task + new session id); no resume.
Edge-case sequencing¶
- Allocation release during active terminal:
- send deterministic close reason (
allocation_released), - wait ack or close-timeout window,
- continue release path (
allocation.revoke_userthen workflow completion). - Node stream drop during active terminal:
- close with retryable reason (
node_stream_dropped), UI may reopen with full flow. - User OIDC expiry during active terminal:
- session remains valid until close/session TTL; auth rechecked on next reopen.
7. Storage Attachment State¶
Storage attachment is the runtime binding between a project-owned storage
namespace and one allocation/workload mount. The full workflow is defined in
doc/architecture/Storage_Attachment_Workflow_v1.md.
Canonical lifecycle:
requested -> prechecking -> grant_applying -> grant_applied -> mounting -> mounted
mounted -> detaching -> detached
Failure side transitions:
requested|prechecking -> failedgrant_applying|grant_applied|mounting -> failedmounted|detaching -> detach_failedfailed|detach_failed -> detaching -> detached
Rules:
- Attach/detach is owned by Temporal, not direct HTTP handler side effects.
- Persistent storage is never deleted by an attachment detach or allocation release path.
quota_byteslives on the storage namespace; attachment precheck validates whether the requested write mode is allowed under current quota posture.multi_writeris allowed only when both provider capability and product/app storage policy allow it.- Node-agent performs local mount/unmount through typed tasks and reports result; API/Temporal never SSHes into nodes directly.