Product Surface IA and Role Model v1¶

Status: - Draft design input for the next full product/admin redesign pass. - Complements, not replaces: - doc/product/Product_UI_System_Redesign_v1.md - doc/product/Unified_Product_UX_Model_v1.md - doc/product/Admin_Ops_and_User_IA_v1.md - doc/product/IAM_UX_Information_Architecture_v1.md - doc/product/UX_Intent_Flow_Audit.md

Purpose: - Define the full product information architecture before page-by-page admin and ops redesign work continues. - Make role, intent, and scope first-class so the redesign does not collapse back into route dumps or entity dumps. - Provide one system-level artifact for review and feedback.

0. Canonical Decisions¶

This is the canonical IA document for the current redesign pass.

When this document disagrees with: - Product_UI_System_Redesign_v1.md - Unified_Product_UX_Model_v1.md - Admin_Ops_and_User_IA_v1.md - IAM_UX_Information_Architecture_v1.md

this document wins for: - shell/navigation structure; - role and mode boundaries; - admin vs ops vs telemetry boundaries; - primary workload nouns; - landing-page ownership.

0.1 Global shell groups¶

For the authenticated product shell, the canonical top-level groups are: - Workloads - Compute - Apps - Storage - Access - Account - Platform

Rules: - Platform is the top-level home for platform-admin, infra, SRE, telemetry, audit, and other privileged operator flows. - Access is the durable group for project/tenant/platform IAM surfaces. - Storage is the durable workbench for buckets, mounts, datasets, checkpoints, artifacts, retention, and workload data attachments. - public/developer portal is a separate surface family, not an eighth internal left-nav group.

0.2 Admin vs ops vs telemetry¶

Decision: - Platform is the authenticated top-level shell group. - Inside Platform, Ops, Fleet, Governance, Finance, and Evidence are peer local-navigation families.

Implication: - ops is not a child of admin as a top-level product concept; - admin and ops are peer operating families inside the broader Platform surface; - telemetry is a peer investigation/evidence capability, not the same thing as admin or ops.

0.3 Workload noun hierarchy¶

Decision: - workload is the primary product noun for active/runnable things users and operators care about. - allocation remains the compute substrate and billing/runtime record. - app instance is treated as a workload subtype rather than a permanently separate top-level mental model.

Implication: - /workloads is the primary user/operator runtime workbench. - allocation surfaces remain valid, but more infrastructure-oriented. - app-instance routes may remain temporarily, but should converge toward the workload model instead of competing with it.

0.4 Mode entry¶

Decision: - mode is explicit in the shell. - users with more than one role enter mode through a top-bar mode switcher.

Canonical modes: - User - Tenant Admin - Project Admin - Platform Admin - Ops

Rules: - mode selection changes default landing page and primary navigation emphasis; - it does not bypass backend authorization; - route access still remains permission-gated.

0.5 Wizard and task decisions¶

Decision: - major launch flows use a full-page wizard; - simple dependency-create flows may use modal/drawer shells, but must reuse the same wizard primitives and language; - row overflow uses visible Open plus labeled More on desktop; - long-running operations get a routed task detail at /tasks/:task_id, with embedded summaries allowed on detail pages.

These are no longer open questions for the mock pass.

0.6 Top-bar context model¶

Decision: - the top bar always contains: - mode switcher - workspace/project selector - region selector - notifications entry - session menu - balance is always visible in user, tenant-admin, and project-admin modes where spend is directly relevant; - in platform-admin and ops modes, balance may collapse into a finance/status entry instead of remaining a primary chip.

Rules: - project/workspace context remains globally visible, but platform-wide pages do not silently filter themselves to one project unless the page is explicitly scoped that way; - platform surfaces should show when they are in fleet-wide vs project/tenant context.

0.7 Global search¶

Decision: - the shell should include a global search / command entry point as part of the redesign target.

Initial scope may be limited, but the mock should include it because the product already crosses allocations, workloads, users, nodes, and docs.

0.8 Notifications¶

Decision: - notifications remain a top-bar shell capability with an inbox/panel model. - the redesign does not require a full dedicated notifications workspace immediately, but the shell mock should assume a stable bell/inbox entry.

0.9 Mode-switcher behavior¶

Decision: - persistent shell with emphasis: the same seven top-level groups remain visible across all modes; - mode changes: - the default landing route; - which left-rail group is highlighted as the home group; - which actions are visible vs gated. - mode does not swap left-rail content. Groups the user cannot use remain visible but disabled, with reason text on hover.

Rationale: - a stable shell is easier to learn than a shape-shifting one; - users with multiple roles can switch modes without losing visual orientation; - access still flows from backend authorization, not shell visibility.

0.10 Mode-switcher visual and persistence¶

Visual: - label + chevron dropdown in the top bar; - compact label format: Mode: <Name>; - single-role users see the dropdown disabled or hidden, with their mode shown as a static label.

Persistence: - mode is sticky across sessions per user; - first login defaults to the highest-relevance non-platform mode in the order User → Tenant Admin → Project Admin, unless the user has explicitly previously selected another mode.

Decision: - /platform lands on a card-grid family overview; - each family page (/platform/ops, /platform/lifecycle, /platform/config, /platform/evidence, /platform/finance, /platform/iam) uses local subnav or tabs as needed; - a permanent secondary left rail inside the shell is not added in this redesign pass.

0.12 Account vs Access split¶

Access (project / tenant / platform IAM): - projects; - team / memberships; - service accounts; - policies / entitlements / platform roles.

Account (personal): - profile; - billing; - SSH keys; - personal sessions / settings.

Rationale: - separates "who can do what across scopes" (Access) from "things about me" (Account).

0.13 Tenant-admin and project-admin landing primary questions¶

Both landings should answer: 1. Who has access? 2. What needs governance attention? 3. What is the current spend / usage posture? 4. What recent changes happened? 5. What needs action now?

These five questions drive the section structure of /tenant/overview and /project/overview. Project admin landing scopes everything to the active project; tenant admin landing aggregates across projects in the tenant.

0.14 App-platform visual elevation¶

Decision: - Apps remains a peer of Compute in the shell — no top-level IA split in this pass. - visual elevation in the shell and landings is allowed and encouraged: distinct icon weight, hero treatment in user-mode landings, prominent default-action surface in the app catalog.

Rationale: - the app platform is a strategic differentiator (per §17.1) but is not yet a separate product surface; - elevating it visually preserves IA simplicity while making the differentiator legible.

0.15 Storage as a peer shell group¶

Decision: - Storage is a top-level authenticated shell group. - Storage owns bucket and data-substrate management after creation. - Launch flows may still create buckets inline when storage is a dependency.

Rationale: - buckets can outlive workloads and remain valuable after the producing workload is released; - workload-to-bucket relationships are many-to-many over time, so storage should not be modeled as a child of one workload; - storage has its own operational vocabulary: quota, unattached buckets, failed mounts, retention, encryption, lifecycle, access drift, and scheduled deletes; - storage appears in multiple launch paths, including notebooks, training datasets, checkpoints, artifacts, and persistence for app services.

Rules: - workload details should link attached buckets to their storage detail pages; - bucket details should link active and historical workload attachments back to workload details; - inline bucket creation in launch flows must produce the same bucket object that later appears in the Storage workbench; - do not bury persistent storage management under Apps, Workloads, Access, or Account.

0.16 Backend read-model readiness¶

Decision: - v3 implementation should not wire major pages directly to ad hoc domain queries. - each major shell group needs an explicit read-model/API map before migration from mock data to live data. - Redis-backed read-model caching is part of the v3 backend foundation, but cache entries are never sources of truth.

Required implementation input: - doc/architecture/UI_Read_Model_Cache_Architecture_v1.md - endpoint-by-endpoint read-model map for Workloads, Apps, Storage, Access, Account, and Platform before broad page migration.

Rules: - pages may launch behind feature flags with partial data only if missing read-model fields are visibly non-blocking; - admin/operator surfaces must not require direct SQL table edits or direct DB inspection for normal operation; - cache keys must include tenant/project/user scope where authorization depends on that scope.

1. Why This Document Exists¶

Current redesign documents already establish: - shared page families; - workflow-oriented admin direction; - IAM separation principles; - user and operator intent audits.

What is still missing is one explicit map for the whole product: - public vs authenticated surfaces; - user vs tenant/project admin vs platform admin vs ops; - admin vs ops vs telemetry boundaries; - infra-ops vs SRE-ops intent boundaries; - how dense admin pages should decompose into summary, operations, diagnostics, history, and advanced detail.

This document fills that gap.

2. Core Design Rule¶

Do not design the product around backend entities or current route prefixes.

Design around: - role - intent - scope - control plane

The same entity may appear in multiple surfaces, but with different framing, actions, and urgency.

Examples: - allocations appear in user, project-admin, platform-admin, and ops flows; - nodes appear in platform-admin, infra-ops, and SRE flows; - billing appears in user, tenant-admin, and platform-admin flows; - IAM spans user, project, tenant, platform, and future app/integration layers.

3. Product Surface Families¶

The product should be treated as several related surfaces, not one giant navigation tree.

3.1 Public / Developer Portal¶

Purpose: - onboarding, docs, SDKs, API references, examples, integration guides

Audience: - prospective users - developers - external integration owners

Examples: - docs - Swagger / Redoc - downloads - integration guides

Rules: - do not mix this surface into internal admin/ops left navigation; - may share brand/system components, but not the same workspace IA.

3.2 User Workspace¶

Purpose: - allocate, connect, operate, and understand resources owned by the current user/project context

Audience: - end users - project members

Examples: - marketplace - allocations - workloads - storage - billing - SSH keys

Rules: - speak in allocation/project terms, not host/guest/internal implementation terms; - show only the operational truth the user needs.

3.3 Tenant / Project Administration¶

Purpose: - administer access, policy, project structure, and usage at tenant/project scope

Audience: - tenant admins - project admins

Examples: - team and memberships - projects - service accounts - app entitlements - quotas and usage posture at scoped levels

Rules: - not the same as platform admin; - scope and blast radius must remain obvious in both copy and action design.

3.4 Platform Operations / Administration¶

Purpose: - run the platform safely across fleet, workflows, policy, incidents, and privileged actions

Audience: - platform admins - infra operators - SRE/operators

Examples: - nodes - allocations - onboarding/decommission flows - ops overview - audit - payment operations - platform IAM

Rules: - this surface is workflow- and incident-oriented, not an entity inventory dump; - deep diagnostics are allowed, but only after summary and safe actions.

4. Role Model¶

These roles may touch overlapping entities, but they do not have the same workflow.

4.1 User¶

Intent: - get capacity - connect to it - understand runtime health and spend

Scope: - self - current project resources visible to the user

4.2 Tenant Admin¶

Intent: - govern tenant-scoped users, projects, policy, usage posture

Scope: - tenant

4.3 Project Admin¶

Intent: - manage one project’s members, identities, entitlements, runtime posture

Scope: - project

4.4 Platform Admin¶

Intent: - privileged platform control, policy, enrollment, audit, recovery

Scope: - platform / fleet

4.5 Infra Operator¶

Intent: - node readiness - capacity health - onboarding / decommission - networking / fabric / image correctness

Scope: - nodes, sites, capacity pools, fleet readiness

4.6 SRE / Ops Operator¶

Intent: - incident detection - failure correlation - degraded workflow recovery - observability-guided remediation

Scope: - services, workflows, fleet signals, critical runtime health

4.7 App Developer / Integration Owner¶

Intent: - app lifecycle - artifacts - integration contracts - external system behavior and troubleshooting

Scope: - project, app, or integration boundary depending product maturity

Note: - this may partially overlap with SRE today, but should not be forced into the same IA bucket without an explicit decision.

4.8 Persona-To-Landing-Page Table¶

This table is the minimum shell contract for the mock pass.

Persona / mode	Default landing route	Top 3 surfaces	Mode entry
User	`/v3-prod/workloads` when workloads exist, otherwise `/v3-prod/compute`	Workloads, Compute, Storage	top-bar mode switcher or default session mode
Tenant Admin	`/v3-prod/tenant/overview`	Access, Billing/usage posture, Storage posture	top-bar mode switcher
Project Admin	`/v3-prod/project/overview`	Access, Workloads, Storage	top-bar mode switcher
Platform Admin	`/v3-prod/platform/overview`	Fleet, Governance, Finance	top-bar mode switcher
Ops	`/v3-prod/platform/ops`	Ops, Telemetry, Fleet	top-bar mode switcher

Notes: - tenant-admin and project-admin home surfaces now exist in /v3-prod as first-cutover production surfaces backed by existing read models; - richer dedicated tenant/project read models are still expected before these landings become final governance dashboards.

5. Intent Model¶

The same role may operate in different intents. Top-level IA should support these rather than hiding them inside entity tables.

Primary intents: - provision - manage - monitor - recover - govern - integrate - investigate

Design implication: - pages must be built around the question the actor is trying to answer, not just around the record being displayed.

6. Scope Model¶

Scope is the simplest way to keep similar verbs from collapsing into one messy UI.

Scopes: - self - project - tenant - node - fleet - platform - public

Examples: - users provision at project/self scope; - project admins manage access at project scope; - tenant admins govern membership and spend at tenant scope; - infra operators manage node readiness at node/fleet scope; - platform admins govern policy at platform scope.

7. Control Planes¶

The redesign should explicitly separate these control planes.

7.1 Workloads¶

Focus: - allocations - app runtimes - workload lifecycle - user-facing and operator-facing runtime views

7.1a Storage¶

Focus: - buckets - workload mounts and attachments - datasets - checkpoints - artifacts - object lifecycle and retention - encryption and storage access drift

7.2 Fleet / Infrastructure¶

Focus: - nodes - slot readiness - networking - fabric - images - enrollment and decommission

7.3 Operations¶

Focus: - incidents - stuck workflows - degraded services - recovery actions - runbooks

7.4 Governance¶

Focus: - users - roles - projects - tenant/platform policies - audit

7.5 Financial¶

Focus: - spend - accounting - budgets - refunds - payment operations - usage attribution

7.6 Developer / Integration¶

Focus: - docs - SDKs - APIs - artifact publishing - integration contracts

8. Admin vs Ops vs Telemetry¶

This boundary needs to be explicit.

8.1 Admin¶

Admin is for: - governance - privileged control - durable platform configuration - enrollment/lifecycle authority - audit and finance controls

Admin should answer: - what is the correct platform state? - who can do what? - what policy or lifecycle action is safe?

8.2 Ops¶

Ops is for: - live intervention - incident handling - degraded workflows - recovery actions

Ops should answer: - what needs action now? - what is broken or risky? - what is the next safe recovery step?

8.3 Fleet Telemetry¶

Fleet telemetry is for: - aggregate observability - trends - hotspots - saturation - cross-node evidence

Fleet telemetry should answer: - where is pressure or degradation emerging across the fleet? - what signals correlate with current incidents?

Rules: - telemetry is not where primary lifecycle mutations should live; - admin is not where cross-fleet observability should be dumped by default; - ops may link into both admin and telemetry, but should remain action-oriented.

9. Infra-Ops vs SRE-Ops¶

“Ops” is not one homogeneous audience.

9.1 Infra / Capacity Intent¶

Questions: - are nodes correctly provisioned? - are slice slots schedulable? - is networking/fabric/storage ready? - is image/bootstrap state consistent?

Primary surfaces: - fleet / nodes - onboarding - slot readiness - image and provisioning views

9.2 SRE / Live Operations Intent¶

Questions: - what is degraded right now? - which service or workflow is failing? - how do I correlate and recover?

Primary surfaces: - ops overview - attention views - workflow health - telemetry - runbooks

Design implication: - do not assume one single ops landing page can serve both intents equally well.

10. IAM Growth Model¶

IAM is already multi-layered and should be designed as such.

Layers: - platform IAM - tenant IAM - project IAM - user identity and credentials - service accounts - app entitlements - future external integration identities

Design implications: - IAM should not be treated as one simple admin table; - platform and scoped IAM must remain clearly separated; - future growth should not require a top-level IA rewrite.

11. Billing Growth Model¶

Billing is bigger than current allocation accounting.

Current product reality: - allocation accounting - balances - payment sessions

Expected growth: - tenant/project spend views - budgets and quota posture - credits/refunds - invoices/financial evidence - usage attribution across products and runtimes - policy-driven financial controls

Design implications: - billing should be treated as its own durable control plane; - do not bury it as a small detail inside one or two admin pages.

12. Dense Page Decomposition Rules¶

Some current admin pages contain enough material for several pages.

That is a signal to split by function, not add more sections to the same page.

When a page mixes: - summary - operations - diagnostics - history - raw metadata

it should be decomposed into one or more of:

Overview
summary
key state
top risks
primary actions
Operations
live controls
intervention actions
workbench/table
Metrics / Diagnostics
charts
telemetry
linked debug tools such as Netdata
History / Audit
lifecycle timeline
audit trail
financial or provisioning evidence
Advanced / Raw
raw IDs
low-frequency technical detail
copy/debug material

Rules: - list pages should not become detail pages; - detail pages should not default to raw metadata dumps; - advanced/debug information should be available, but not dominant.

The redesign should be validated at the full-shell level, not page by page.

Before broad UI implementation, produce a mock showing: - full left navigation; - top-level groups; - local navigation inside each major surface; - handoff points between admin, ops, telemetry, and developer surfaces.

Questions the shell must answer: - where does a tenant admin start? - where does an infra operator start? - where does an SRE start? - where does a developer go for docs and integration material? - which duplicated current pages collapse into one owner?

14. Proposed Review Checklist¶

Use this checklist when reviewing the next IA mock or redesign deck.

Are public/developer, user workspace, scoped administration, and platform operations clearly separated?
Are admin, ops, and telemetry distinct in purpose?
Are infra and SRE intents distinguishable?
Are tenant admin and project admin first-class in the model?
Does IAM account for platform/tenant/project/service-account growth?
Does billing account for future financial scope beyond allocation accounting?
Are dense pages explicitly decomposed instead of “cleaned up in place”?
Can the left nav be explained in terms of user goals, not route dumps?
Do shared entities keep different framing where scope and risk differ?
Does the design leave room for app developer / integration-owner surfaces?

15. Immediate Next Step¶

Use this document with the v3 mock work to produce: - one full-shell IA mock; - one boundary map for admin vs ops vs telemetry; - one decomposition plan for the densest existing admin pages.

That should happen before broad admin page implementation resumes.

16. Additional Current Inputs To Carry Into Redesign¶

These are active product and operator pain points that should inform the next IA/mock pass.

16.1 Brokered identity continuity¶

Observed problem: - repeated Hugging Face logins can create what appears to be a new user/login footprint each time when testing with fresh sessions/incognito.

Design implication: - identity UX must distinguish: - first-time broker signup, - returning login, - account linking, - duplicate-identity conflict resolution.

This is not only an auth/backend problem. It also affects: - how users understand identity continuity, - whether account linking needs UI, - how admin/IAM surfaces explain brokered identities.

16.2 Audit log surface parity¶

Observed problem: - user audit logs and admin audit logs do not currently present with the same product/page grammar.

Design implication: - audit/evidence surfaces should share the same v3 family model where possible, even when scope differs. - scope can differ; page language and evidence framing should not drift without reason.

16.3 Inline dependency creation inside launch flows¶

Observed problem: - launch flows can discover missing prerequisites such as SSH keys. - upcoming launch inputs will likely include storage, network, firewall, and similar dependencies.

Design implication: - wizards should not eject users into unrelated pages to satisfy missing prerequisites. - dependency creation/selection should be inline or in-context wherever safe.

This applies to: - compute launch wizard; - app launch wizard; - future storage/network/firewall steps.

16.4 App launch wizard parity¶

Observed problem: - app launch flows need the same v3 system treatment as compute launch.

Design implication: - compute and app launch should be treated as one shared wizard system with mode-specific branches, not separate design languages.

16.5 Tenant-admin quota model¶

Open question: - can tenant admins set quotas for projects and/or users, and how should that authority be scoped?

Design implication: - quota UX cannot remain only a platform-admin configuration surface if scoped delegation is intended. - tenant/project administration boundaries must explicitly include quota governance decisions.

16.6 Billing as a separate redesign epic¶

Billing requires a separate effort and should not be treated as a small sidebar to current allocation accounting.

Known future billing scope includes: - pricing modes such as on-demand, spot, and reserved duration; - idle policy selection at launch; - invoice generation and payment timing; - support for usage models beyond simple allocation duration; - tenant budgets and alerts; - usage attribution by user, SKU, and app; - data ingress/egress accounting; - standard delinquency handling.

Design implication: - billing remains a first-class control plane and should be reviewed as its own epic with its own UX and domain model pass.

17. External Review Inputs Relevant To UI¶

This section captures the UI/product-relevant points from doc/governance/External_Architectural_Review_2026-04.md so the redesign can use them directly without mixing in the broader backend action list.

17.1 Reinforced design directions¶

The external review reinforces these redesign choices:

admin should move away from entity and action dumps toward workflow-first operator surfaces;
the app platform is one of the strongest and most original parts of the product, so the product IA should expose it more clearly instead of treating it as a sidecar to infrastructure;
developer/docs/download/API-reference surfaces deserve explicit treatment as a real product surface, not an afterthought inside internal navigation;
audit, evidence, and investigation flows are important enough to be their own deliberate page family;
billing is too large to remain a small subpage under current product assumptions and should remain a separate redesign/epic.

17.2 UI-specific tensions to account for¶

The review also highlights tensions that the redesign should consciously handle:

the product is infrastructure-shaped in some places and workload/app-platform shaped in others;
app/runtime workflows are more valuable than the current information hierarchy suggests;
public/developer-facing extension and integration stories are under-expressed in the current product surface;
operational evidence and debug flows matter, but should not dominate default navigation for normal user workflows.

17.3 Explicitly deferred to post-UI backend work¶

The external review also names important backend and architecture issues, but they should be handled after the UI/IA review rather than folded into the current redesign pass. Examples:

cross-cutting middleware enforcement gaps;
idempotency and optimistic locking;
DLQ recovery;
metering producers;
terminal compliance/security follow-up;
large-file/backend decomposition work such as cmd/api/routes.go.

These should inform later implementation planning, but they are not blockers for completing the IA/mock review package first.