Allocation Restart Model v1¶
Purpose: - Define self-service restart for active allocations. - Keep restart distinct from release. - Establish the lifecycle, UX, and control-plane contract before implementation.
Inputs:
- doc/product/Allocation_Experience_Gaps_v1.md
- doc/architecture/App_Runtime_Instance_Lifecycle_v1.md
- doc/architecture/Node_Agent_Spec.md
- doc/api/openapi.draft.yaml
Related:
- doc/product/Allocation_Storage_Model_v1.md
- doc/product/Managed_Runtime_Bundles_v1.md
1. Executive Summary¶
Users should be able to restart an active allocation without releasing it.
This is a normal compute operation, not an admin-only escape hatch.
Restart should:
- preserve allocation identity
- interrupt running processes and sessions intentionally
- reboot the underlying machine
- return the allocation to active when the node comes back healthy
- preserve the allocation’s SSH access and attached managed state
Restart is not: - release - rebuild - reprovision - reimage
Those remain separate operations.
2. Problem Statement¶
GPU users regularly need reboot semantics for cases such as: - hung driver state - unstable networking - runtime cleanup after failed experiments - collaborator handoff - post-install reboot requirements
Without restart, users are forced into one of two bad choices: - release and reacquire the allocation - ask support/admin to intervene
That is not acceptable for a self-service compute product.
3. Product Boundary¶
3.1 Restart vs release¶
Restart: - keeps the same allocation object - keeps the same allocation owner and project association - keeps the same node binding if restart succeeds - is meant for operational recovery
Release: - ends the allocation - frees the node - ends user access
3.2 Restart vs rebuild/reimage¶
Restart: - reboots the existing machine - preserves local machine state unless the OS itself changes it
Reimage: - wipes and reprovisions the node - should remain a different control-plane capability
4. First-Version Decision¶
The first restart slice should be:
- supported only for active allocations
- user-invokable by the allocation owner
- also invokable by project admins and platform admins if role policy allows
- implemented as an asynchronous lifecycle action
The first version should not support restart for:
- requested
- provisioning
- releasing
- released
- failed
5. Lifecycle Model¶
Recommended allocation lifecycle extension:
Notes:
- restarting is a real transient state, not just a UI hint
- restart_failed should be explicit so the user knows the allocation still exists but requires attention
- a retry path should exist from restart_failed
This mirrors the current release-failure model: - failures are visible - users are not left guessing whether the action completed
6. User Experience¶
6.1 Where restart appears¶
Restart should appear on allocation detail near:
- Metrics
- Release
It is a primary compute action, not a hidden submenu item.
6.2 Confirmation copy¶
The confirmation should say clearly: - SSH and browser terminal sessions will disconnect - running processes/jobs on the machine will be interrupted - attached persistent storage remains attached after the machine returns - the allocation itself will remain owned by the user
6.3 During restart¶
The UI should show:
- allocation status restarting
- recent activity
- last progress timestamp
The terminal surface should switch from interactive to a waiting/reconnecting state.
6.4 After restart¶
When successful:
- allocation returns to active
- SSH/terminal access becomes available again
- existing attached storage and managed runtime state should still be visible
When failed:
- allocation moves to restart_failed
- user sees failure details and retry guidance
7. Control-Plane Semantics¶
7.1 Contract shape¶
Recommended API:
- POST /api/v1/allocations/{allocation_id}/restart
Behavior: - idempotent mutation - writes audit log - schedules async reboot task through the normal control-plane path
7.2 Backend path¶
Restart should not be implemented as: - direct handler-to-node-agent call
It should follow the same principles as the rest of the platform: - API mutation records intent - control plane schedules typed runtime work - node agent performs bounded reboot action - node reconnect/health path returns the allocation to active
7.3 Node-agent contract¶
The node agent should expose a bounded restart/reboot primitive, not arbitrary shell semantics.
That primitive should: - acknowledge accepted action - initiate reboot - allow the control plane to infer success from post-reboot re-registration / heartbeat recovery
8. Persistence and Recovery¶
8.1 SSH access¶
Restart must preserve: - attached SSH key state - any project/allocation access grants
The user should not have to reconfigure SSH access after reboot.
8.2 Storage¶
Restart should preserve: - attached persistent storage objects - mount intent
The platform may need a post-reboot mount reconcile step, but that should not be exposed as a separate user concern.
8.3 Managed runtimes¶
Managed runtime bundle state should remain associated with the allocation after restart.
If the runtime requires a post-boot activation or service start: - that should be handled by the managed runtime reconcile path, not by the user manually
9. Failure Model¶
Potential restart failure reasons: - reboot task could not be dispatched - node did not return within timeout - node returned but allocation connectivity/health did not recover
The first user-visible failure state should be:
- restart_failed
Required follow-up UX:
- show failure reason if known
- allow retry
- keep Release available
10. Out of Scope for First Slice¶
Not part of the first restart slice: - schedule reboot for later - soft reboot vs hard power cycle user choice - admin-only force power actions - restart many allocations in one batch - node replacement on restart failure
Those can be added later if needed.
11. Open Questions¶
- Should restart be owner-only, or also available to any project admin?
- Should
restart_failedallow only retry and release, or also some future repair action? - Should restart preserve the same advertised hostname/IP in every deployment mode, or is that only best-effort?
- What timeout is acceptable before the platform marks restart as failed?
12. Decision Summary¶
Restart should be a normal self-service allocation capability.
The first version should:
- add restarting and restart_failed
- expose POST /allocations/{id}/restart
- use the existing async control-plane/node-agent model
- preserve allocation identity, storage attachment intent, SSH access state, and managed runtime association
That is the minimum correct product model for reboot without release.