Allocation Restart Model v1¶

Purpose: - Define self-service restart for active allocations. - Keep restart distinct from release. - Establish the lifecycle, UX, and control-plane contract before implementation.

Inputs: - doc/product/Allocation_Experience_Gaps_v1.md - doc/architecture/App_Runtime_Instance_Lifecycle_v1.md - doc/architecture/Node_Agent_Spec.md - doc/api/openapi.draft.yaml

Related: - doc/product/Allocation_Storage_Model_v1.md - doc/product/Managed_Runtime_Bundles_v1.md

1. Executive Summary¶

Users should be able to restart an active allocation without releasing it.

This is a normal compute operation, not an admin-only escape hatch.

Restart should: - preserve allocation identity - interrupt running processes and sessions intentionally - reboot the underlying machine - return the allocation to active when the node comes back healthy - preserve the allocation’s SSH access and attached managed state

Restart is not: - release - rebuild - reprovision - reimage

Those remain separate operations.

2. Problem Statement¶

GPU users regularly need reboot semantics for cases such as: - hung driver state - unstable networking - runtime cleanup after failed experiments - collaborator handoff - post-install reboot requirements

Without restart, users are forced into one of two bad choices: - release and reacquire the allocation - ask support/admin to intervene

That is not acceptable for a self-service compute product.

3. Product Boundary¶

3.1 Restart vs release¶

Restart: - keeps the same allocation object - keeps the same allocation owner and project association - keeps the same node binding if restart succeeds - is meant for operational recovery

Release: - ends the allocation - frees the node - ends user access

3.2 Restart vs rebuild/reimage¶

Restart: - reboots the existing machine - preserves local machine state unless the OS itself changes it

Reimage: - wipes and reprovisions the node - should remain a different control-plane capability

4. First-Version Decision¶

The first restart slice should be: - supported only for active allocations - user-invokable by the allocation owner - also invokable by project admins and platform admins if role policy allows - implemented as an asynchronous lifecycle action

The first version should not support restart for: - requested - provisioning - releasing - released - failed

5. Lifecycle Model¶

Recommended allocation lifecycle extension:

active -> restarting -> active
                    -> restart_failed

Notes: - restarting is a real transient state, not just a UI hint - restart_failed should be explicit so the user knows the allocation still exists but requires attention - a retry path should exist from restart_failed

This mirrors the current release-failure model: - failures are visible - users are not left guessing whether the action completed

6. User Experience¶

6.1 Where restart appears¶

Restart should appear on allocation detail near: - Metrics - Release

It is a primary compute action, not a hidden submenu item.

6.2 Confirmation copy¶

The confirmation should say clearly: - SSH and browser terminal sessions will disconnect - running processes/jobs on the machine will be interrupted - attached persistent storage remains attached after the machine returns - the allocation itself will remain owned by the user

6.3 During restart¶

The UI should show: - allocation status restarting - recent activity - last progress timestamp

The terminal surface should switch from interactive to a waiting/reconnecting state.

6.4 After restart¶

When successful: - allocation returns to active - SSH/terminal access becomes available again - existing attached storage and managed runtime state should still be visible

When failed: - allocation moves to restart_failed - user sees failure details and retry guidance

7. Control-Plane Semantics¶

7.1 Contract shape¶

Recommended API: - POST /api/v1/allocations/{allocation_id}/restart

Behavior: - idempotent mutation - writes audit log - schedules async reboot task through the normal control-plane path

7.2 Backend path¶

Restart should not be implemented as: - direct handler-to-node-agent call

It should follow the same principles as the rest of the platform: - API mutation records intent - control plane schedules typed runtime work - node agent performs bounded reboot action - node reconnect/health path returns the allocation to active

7.3 Node-agent contract¶

The node agent should expose a bounded restart/reboot primitive, not arbitrary shell semantics.

That primitive should: - acknowledge accepted action - initiate reboot - allow the control plane to infer success from post-reboot re-registration / heartbeat recovery

8. Persistence and Recovery¶

8.1 SSH access¶

Restart must preserve: - attached SSH key state - any project/allocation access grants

The user should not have to reconfigure SSH access after reboot.

8.2 Storage¶

Restart should preserve: - attached persistent storage objects - mount intent

The platform may need a post-reboot mount reconcile step, but that should not be exposed as a separate user concern.

8.3 Managed runtimes¶

Managed runtime bundle state should remain associated with the allocation after restart.

If the runtime requires a post-boot activation or service start: - that should be handled by the managed runtime reconcile path, not by the user manually

9. Failure Model¶

Potential restart failure reasons: - reboot task could not be dispatched - node did not return within timeout - node returned but allocation connectivity/health did not recover

The first user-visible failure state should be: - restart_failed

Required follow-up UX: - show failure reason if known - allow retry - keep Release available

10. Out of Scope for First Slice¶

Not part of the first restart slice: - schedule reboot for later - soft reboot vs hard power cycle user choice - admin-only force power actions - restart many allocations in one batch - node replacement on restart failure

Those can be added later if needed.

11. Open Questions¶

Should restart be owner-only, or also available to any project admin?
Should restart_failed allow only retry and release, or also some future repair action?
Should restart preserve the same advertised hostname/IP in every deployment mode, or is that only best-effort?
What timeout is acceptable before the platform marks restart as failed?

12. Decision Summary¶

Restart should be a normal self-service allocation capability.

The first version should: - add restarting and restart_failed - expose POST /allocations/{id}/restart - use the existing async control-plane/node-agent model - preserve allocation identity, storage attachment intent, SSH access state, and managed runtime association

That is the minimum correct product model for reboot without release.