Allocation Experience Gaps v1¶

Purpose: - Capture the major user-facing gaps in the raw allocation experience for GPU users. - Separate immediate compute-allocation needs from higher-level app platform work. - Create a stable product reference for sequencing the next slices after the first Kubernetes and workload UX work.

Inputs: - doc/product/Navigation_Redesign_App_Platform_v1.md - doc/architecture/Allocation_Node_Placement_v1.md - doc/architecture/App_Runtime_Instance_Lifecycle_v1.md - doc/api/openapi.draft.yaml - packages/web/app/allocations/

Related: - doc/product/Slurm_UI_Options_v1.md - doc/architecture/Kubernetes_Platform_Options_v1.md

1. Executive Summary¶

The current allocation experience is strong on: - fast provisioning - stable SSH access - browser terminal access - basic allocation lifecycle

But it still misses several things GPU users expect to have without opening a support thread or rebuilding the node manually.

The highest-value gaps are: - storage attachment and persistence - allocation restart - team/shared SSH access - better progress and activity reporting for long-running operations - managed runtime selection layered on top of compute

These are not all the same kind of problem. Some belong to raw compute allocation. Some belong to managed runtime bundles. Some belong to project collaboration and access control.

This document keeps those boundaries explicit.

2. Current Strengths¶

The current compute allocation model already provides: - fast allocation activation because heavy node setup is done during node provisioning rather than per allocation - stable direct SSH access - browser terminal access - release lifecycle - allocation metrics / live telemetry entry points

These are good platform traits and should be preserved.

The next work should not weaken: - allocation startup speed - the simple single-node mental model - the boundary between platform-managed state and user-managed state

3. Primary Gaps¶

3.1 Storage¶

This is the largest missing user-facing capability.

Questions still to answer: - per-allocation storage vs shared/project storage - attach at allocation creation only vs later attach/detach - single-writer vs multi-attach semantics - lifecycle after allocation release - whether the first storage surface is block, filesystem, object-backed mount, or a shared runtime-backed path - whether a storage backend such as Weka can safely support many live allocations with the needed semantics

Required outcome: - users can understand what persists after release - users can deliberately attach storage rather than treating local disk as durable by accident

3.2 Restart¶

Users need self-service restart for active allocations.

Typical reasons: - hung driver/runtime state - networking recovery - process cleanup - reboot after collaborator handoff

Required product semantics: - restart preserves allocation identity - restart is asynchronous - sessions and running workloads are interrupted intentionally - SSH access metadata and managed-runtime state survive the reboot

3.3 Team / Shared SSH Access¶

The current allocation SSH-key improvement allows a user to swap the registered public keys attached to their allocation.

That is useful, but it does not solve the real collaboration gap: - another project member should be able to gain SSH access to the allocation without requiring the original creator to share a private key

This needs a real project-scoped access model, not just more personal-key management.

Required product semantics: - project membership and role checks determine who may be granted access - audit records show who changed allocation access - the granted user supplies their own public key material

3.4 Progress and Activity Visibility¶

Long operations currently show coarse states such as: - deploying - bootstrap_in_progress - releasing

That is not enough for user trust.

Required UX improvements: - phase timeline - recent activity panel - timestamps for last progress update - eventually log excerpts or live log routing where possible

This matters for: - app-backed workloads like Kubernetes and Slurm - allocation lifecycle operations like restart and release - future managed runtime bundle installation

3.5 Managed Runtime Selection¶

Users need framework/environment choice, but that choice should not bloat node provisioning or make allocations slow.

The platform already has a strong starting point: - drivers and base OS are prepared before allocation time

The gap is: - users cannot yet choose a supported runtime environment such as PyTorch during allocation or apply one later through a platform-owned path

This should be solved by managed runtime bundles, not by moving heavy environment setup into raw allocation provisioning.

See: - doc/product/Managed_Runtime_Bundles_v1.md

4. Things We Should Not Do¶

4.1 Do not overload raw allocation provisioning with too many environment choices¶

If every framework choice becomes a provisioning-time image fork: - image sprawl grows quickly - allocation UX becomes confusing - platform rollout velocity drops

4.2 Do not try to fully govern what users install over SSH¶

Users will always be able to install their own tools manually.

The platform should not pretend it owns: - arbitrary user packages - arbitrary conda/venv state - arbitrary filesystem mutations made over SSH

The platform should only guarantee what it installed through platform-managed flows.

4.3 Do not merge collaboration access with personal SSH-key storage¶

Personal SSH-key management is necessary, but it is not the same thing as: - project-scoped allocation access - team handoff - project-member collaboration

That needs a distinct product and authz model.

5. Recommended Sequencing¶

5.1 Next highest-value slices¶

Storage model and attachment UX
Allocation restart
Project-member SSH access to live allocations
Better progress/activity reporting
Managed runtime bundle selection on top of compute

5.2 Why this order¶

storage is the biggest real workflow blocker for serious users
restart is an expected self-service compute action
shared SSH access is a collaboration blocker
richer progress improves trust in app/runtime workflows already being introduced
managed runtime bundles become more useful once the compute/access/storage layer is stable

6. UX Direction for Allocations¶

The allocation detail page should evolve toward: - Overview - SSH Access - Storage Attachments - Runtime / Managed Bundles - Metrics / Health - Activity / Events

The raw terminal remains important, but it should not be the only prominent interaction surface.

7. Open Questions¶

Should restart be available to all allocation owners, or gated by project role?
Should a restarted allocation keep the same public endpoint/hostname in every case?
Should project-member SSH access be explicit invitation/approval based, or simple owner-managed membership?
When storage lands, is attach allowed after allocation creation or only during create?
Should managed runtime selection be optional during allocation creation and later editable, or only post-allocation?

8. Decision Summary¶

The compute allocation product should remain: - fast - simple - driver-ready

But it must grow the missing operational and collaboration layers around that core: - restart - storage - shared access - progress visibility - managed runtimes

Those layers should be added deliberately without turning raw allocation provisioning into a large mutable configuration surface.