Skip to content

State Machines (Current + Target Clarification)

1. Allocation Lifecycle

Canonical lifecycle (contract + implementation target)

  • requested -> provisioning -> active -> releasing -> released
  • Failure side transitions:
  • provisioning -> failed
  • releasing -> release_failed (after max_deliver retries exhausted on provisioning.releasing.requested)

release_failed state behaviour

  • Billing: usage window is closed when provisioning.release_failed is received by billing-worker — the user is not charged for a failed release.
  • Node: remains assigned to the allocation until an admin manually retries or removes the assignment. Surface via GET /api/v1/admin/allocations?status=release_failed.
  • Retry path: admin uses POST /api/v1/admin/allocations/{id}/force-release to trigger a new release attempt, transitioning the allocation back to releasing.
  • User retry: POST /api/v1/allocations/{id}/release on a release_failed allocation is also accepted and transitions back to releasing.

1a. Allocation Group Lifecycle

Allocation groups are aggregate parent resources over normal single-node allocations. They do not replace allocation state and they do not own billing or placement correctness.

Canonical aggregate lifecycle: - requested -> provisioning -> active -> releasing -> released - Failure/degraded side transitions: - requested|provisioning -> failed when no required member can become usable - active -> degraded when one required member fails while another member remains usable - releasing -> release_failed when one or more member releases exhaust retries

Rules: - Group status is derived from member allocation status plus group-level release intent. - Member allocations keep their own allocation lifecycle, connection target, and usage/billing windows. - Group release fans out to member release requests and remains idempotent under the group release idempotency key. - App runtimes may bind an app instance to an allocation_group_id, but app-specific topology and member semantics stay in the app-instance member contract.

Full model: doc/architecture/Allocation_Group_Model_v1.md.

2. Node Assignment Lifecycle

Current behavior

  • Free when assignedAllocationId = null
  • In use when assignedAllocationId = allocation.id

Target

  • Derive occupancy from active allocation relation in DB.
  • Keep assignment pointer as cache only if needed.

3. Usage/Billing Lifecycle

Current behavior

  • Usage starts with allocation creation: startTime, endTime=null
  • Billing loop every minute updates lastBilledAt and cost
  • Usage closes on allocation release (endTime set)

Target

  • Event-sourced debit windows with idempotency key per (usage_record, interval_window).

4. User Balance State

Current behavior

  • thresholds:
  • low: balance <= 10
  • depleted: balance <= 0
  • lowBalanceNotified prevents repeated warning spam.
  • depleted triggers force release of all active allocations.

Target

  • explicit state field or derived state view:
  • healthy, low_balance, depleted
  • notification service/channel decoupled from terminal WS.

5. Stripe Payment State

Target lifecycle (implemented via payment_sessions table)

initiated → checkout_completed → credited
                                ↘ failed_reconcile
         ↘ expired  (session TTL elapsed with no webhook)
Status Trigger
initiated POST /api/v1/payments/checkout-session — Stripe session created, URL returned to user
checkout_completed Stripe webhook checkout.session.completed received and verified
credited Ledger credit posted transactionally with the webhook; ledger_entry_id set
failed_reconcile Checkout completed but credit application failed after all retries
expired Session TTL elapsed (default 24h) with no checkout.session.completed webhook

Key properties

  • stripe_checkout_session_id is the join key between the webhook payload and the session row.
  • idempotency_key (from X-Idempotency-Key header) is stored; duplicate session creation requests for the same user + key return the existing session URL without a second Stripe call.
  • credited_amount_minor is set from the webhook payload and must equal requested_amount_minor; a mismatch is flagged as failed_reconcile for ops investigation.
  • Admin endpoint: GET /api/v1/admin/payments/sessions surfaces stuck sessions (initiated with no completion after >1h, or failed_reconcile) for support resolution.

6. Terminal Session State

Canonical lifecycle

  • opening -> active -> closing -> closed
  • Failure side transitions:
  • opening -> error
  • active -> error (node stream/tunnel drop, upstream termination, policy timeout)

Rules

  • Open handshake uses short-lived single-use token.
  • Active session max lifetime is enforced by policy key terminal.session_max_ttl_seconds (default 4h).
  • Single active terminal session per allocation.
  • Reconnect is full reopen (new token + new open task + new session id); no resume.

Edge-case sequencing

  • Allocation release during active terminal:
  • send deterministic close reason (allocation_released),
  • wait ack or close-timeout window,
  • continue release path (allocation.revoke_user then workflow completion).
  • Node stream drop during active terminal:
  • close with retryable reason (node_stream_dropped), UI may reopen with full flow.
  • User OIDC expiry during active terminal:
  • session remains valid until close/session TTL; auth rechecked on next reopen.

7. Storage Attachment State

Storage attachment is the runtime binding between a project-owned storage namespace and one allocation/workload mount. The full workflow is defined in doc/architecture/Storage_Attachment_Workflow_v1.md.

Canonical lifecycle:

requested -> prechecking -> grant_applying -> grant_applied -> mounting -> mounted
mounted -> detaching -> detached

Failure side transitions:

  • requested|prechecking -> failed
  • grant_applying|grant_applied|mounting -> failed
  • mounted|detaching -> detach_failed
  • failed|detach_failed -> detaching -> detached

Rules:

  • Attach/detach is owned by Temporal, not direct HTTP handler side effects.
  • Persistent storage is never deleted by an attachment detach or allocation release path.
  • quota_bytes lives on the storage namespace; attachment precheck validates whether the requested write mode is allowed under current quota posture.
  • multi_writer is allowed only when both provider capability and product/app storage policy allow it.
  • Node-agent performs local mount/unmount through typed tasks and reports result; API/Temporal never SSHes into nodes directly.