Skip to content

Allocation Restart Model v1

Purpose: - Define self-service restart for active allocations. - Keep restart distinct from release. - Establish the lifecycle, UX, and control-plane contract before implementation.

Inputs: - doc/product/Allocation_Experience_Gaps_v1.md - doc/architecture/App_Runtime_Instance_Lifecycle_v1.md - doc/architecture/Node_Agent_Spec.md - doc/api/openapi.draft.yaml

Related: - doc/product/Allocation_Storage_Model_v1.md - doc/product/Managed_Runtime_Bundles_v1.md


1. Executive Summary

Users should be able to restart an active allocation without releasing it.

This is a normal compute operation, not an admin-only escape hatch.

Restart should: - preserve allocation identity - interrupt running processes and sessions intentionally - reboot the underlying machine - return the allocation to active when the node comes back healthy - preserve the allocation’s SSH access and attached managed state

Restart is not: - release - rebuild - reprovision - reimage

Those remain separate operations.


2. Problem Statement

GPU users regularly need reboot semantics for cases such as: - hung driver state - unstable networking - runtime cleanup after failed experiments - collaborator handoff - post-install reboot requirements

Without restart, users are forced into one of two bad choices: - release and reacquire the allocation - ask support/admin to intervene

That is not acceptable for a self-service compute product.


3. Product Boundary

3.1 Restart vs release

Restart: - keeps the same allocation object - keeps the same allocation owner and project association - keeps the same node binding if restart succeeds - is meant for operational recovery

Release: - ends the allocation - frees the node - ends user access

3.2 Restart vs rebuild/reimage

Restart: - reboots the existing machine - preserves local machine state unless the OS itself changes it

Reimage: - wipes and reprovisions the node - should remain a different control-plane capability


4. First-Version Decision

The first restart slice should be: - supported only for active allocations - user-invokable by the allocation owner - also invokable by project admins and platform admins if role policy allows - implemented as an asynchronous lifecycle action

The first version should not support restart for: - requested - provisioning - releasing - released - failed


5. Lifecycle Model

Recommended allocation lifecycle extension:

active -> restarting -> active
                    -> restart_failed

Notes: - restarting is a real transient state, not just a UI hint - restart_failed should be explicit so the user knows the allocation still exists but requires attention - a retry path should exist from restart_failed

This mirrors the current release-failure model: - failures are visible - users are not left guessing whether the action completed


6. User Experience

6.1 Where restart appears

Restart should appear on allocation detail near: - Metrics - Release

It is a primary compute action, not a hidden submenu item.

6.2 Confirmation copy

The confirmation should say clearly: - SSH and browser terminal sessions will disconnect - running processes/jobs on the machine will be interrupted - attached persistent storage remains attached after the machine returns - the allocation itself will remain owned by the user

6.3 During restart

The UI should show: - allocation status restarting - recent activity - last progress timestamp

The terminal surface should switch from interactive to a waiting/reconnecting state.

6.4 After restart

When successful: - allocation returns to active - SSH/terminal access becomes available again - existing attached storage and managed runtime state should still be visible

When failed: - allocation moves to restart_failed - user sees failure details and retry guidance


7. Control-Plane Semantics

7.1 Contract shape

Recommended API: - POST /api/v1/allocations/{allocation_id}/restart

Behavior: - idempotent mutation - writes audit log - schedules async reboot task through the normal control-plane path

7.2 Backend path

Restart should not be implemented as: - direct handler-to-node-agent call

It should follow the same principles as the rest of the platform: - API mutation records intent - control plane schedules typed runtime work - node agent performs bounded reboot action - node reconnect/health path returns the allocation to active

7.3 Node-agent contract

The node agent should expose a bounded restart/reboot primitive, not arbitrary shell semantics.

That primitive should: - acknowledge accepted action - initiate reboot - allow the control plane to infer success from post-reboot re-registration / heartbeat recovery


8. Persistence and Recovery

8.1 SSH access

Restart must preserve: - attached SSH key state - any project/allocation access grants

The user should not have to reconfigure SSH access after reboot.

8.2 Storage

Restart should preserve: - attached persistent storage objects - mount intent

The platform may need a post-reboot mount reconcile step, but that should not be exposed as a separate user concern.

8.3 Managed runtimes

Managed runtime bundle state should remain associated with the allocation after restart.

If the runtime requires a post-boot activation or service start: - that should be handled by the managed runtime reconcile path, not by the user manually


9. Failure Model

Potential restart failure reasons: - reboot task could not be dispatched - node did not return within timeout - node returned but allocation connectivity/health did not recover

The first user-visible failure state should be: - restart_failed

Required follow-up UX: - show failure reason if known - allow retry - keep Release available


10. Out of Scope for First Slice

Not part of the first restart slice: - schedule reboot for later - soft reboot vs hard power cycle user choice - admin-only force power actions - restart many allocations in one batch - node replacement on restart failure

Those can be added later if needed.


11. Open Questions

  • Should restart be owner-only, or also available to any project admin?
  • Should restart_failed allow only retry and release, or also some future repair action?
  • Should restart preserve the same advertised hostname/IP in every deployment mode, or is that only best-effort?
  • What timeout is acceptable before the platform marks restart as failed?

12. Decision Summary

Restart should be a normal self-service allocation capability.

The first version should: - add restarting and restart_failed - expose POST /allocations/{id}/restart - use the existing async control-plane/node-agent model - preserve allocation identity, storage attachment intent, SSH access state, and managed runtime association

That is the minimum correct product model for reboot without release.