Skip to content

PRD distilled

Contract

Source: doc/product/PRD.md v0.3 (265 lines) · MVP-aligned · acceptance criteria per FR

This page is a fact-faithful distillation of the PRD. For the canonical source, read PRD.md.

Vision

Provide a secure self-service GPU platform where users can discover capacity, provision compute, access nodes, monitor usage, and pay based on consumption.

Product goals

  • Fast time-to-compute: provision in minutes.
  • Transparent usage and billing.
  • Safe multi-user operations with role-based controls.
  • Operator-friendly admin surface for inventory and accounts.

Personas

Persona Role
End User Provisions and operates GPU nodes
Admin Manages users, balances, and node inventory
Billing Operator Monitors charges, top-ups, and reconciliation

In scope (MVP rebuild)

Capability Status
OIDC-based auth and role/tenant-aware authorization Implemented
SKU catalog and node inventory Implemented
Allocation lifecycle: provision, active, release Implemented
Browser terminal session to active allocation Implemented
Secure SSH key retrieval for active allocation Implemented
Usage metering and periodic rating Implemented
Balance warnings and depleted enforcement Implemented
Stripe checkout top-up and webhook processing Implemented
Admin user/node management Implemented
Admin allocation/audit/payment-session visibility Implemented
Admin operations telemetry overview Implemented
Object-storage-backed user storage Implemented
Audit logging Implemented

Out of scope (MVP)

Capability Stance
Managed scheduler products (SLURM/k8s/Ray) as real backend features Reference adapters exist in code for first-slice testing only
Enterprise invoicing/subscriptions/commit contracts Deferred
Full multi-tenant org hierarchy UX Schema/policy readiness still required
Multi-region active/active runtime Out of MVP
User-managed API keys for programmatic auth Deferred; middleware remains pluggable

Functional requirements (verbatim)

The PRD defines 12 functional requirements. Summary:

ID Title Code home
FR-1 Authentication and Session packages/services/auth/ + cmd/api middleware
FR-2 SKU Catalog and Inventory packages/services/inventory/
FR-3 Provisioning Lifecycle packages/services/provisioning/ + cmd/provisioning-worker/
FR-4 Runtime Access cmd/terminal-gateway + packages/services/releases/
FR-5 Release Lifecycle packages/services/provisioning/
FR-6 Usage and Billing packages/services/billing/ + cmd/billing-worker/
FR-7 Payments (Stripe + refund hybrid) packages/services/payments/ + cmd/webhook-worker/
FR-8 Admin Operations packages/services/admin/
FR-9 Storage Access packages/services/storage/
FR-10 Abuse Protection and Rate Limiting packages/shared/middleware/ (Redis-backed)
FR-11 Audit Logging packages/services/admin/ + audit_logs table
FR-12 Operations Observability Surface cmd/api admin routes + Grafana

Allocation state machine (required)

stateDiagram-v2
    [*] --> requested
    requested --> provisioning
    provisioning --> active
    provisioning --> failed
    active --> releasing
    releasing --> released
    releasing --> release_failed
    release_failed --> releasing: retry / admin force
    failed --> [*]
    released --> [*]

Rules:

  • Resource can have max one active allocation.
  • User concurrent allocation limit: default allocation.max_concurrent_per_user = 2 (configurable).
  • release_failed means cleanup retries exhausted; billing stopped; admin/user retry path must remain available.

Billing state machine (required)

stateDiagram-v2
    [*] --> healthy
    healthy --> low_balance: balance ≤ threshold
    low_balance --> auto_release_pending: projected depletion in window
    low_balance --> depleted: balance ≤ 0
    auto_release_pending --> depleted
    depleted --> healthy: top-up
    low_balance --> healthy: top-up
    auto_release_pending --> healthy: top-up

Recovery: after top-up, user must manually reprovision (allocations do not auto-restart by default).

First-run policy: zero-balance + no-allocations user is routed to billing with onboarding CTA.

Policy configuration model (mandatory)

All operational policy values are configuration-driven (DB-backed), not hardcoded constants.

Required capabilities:

  • Scoped policy resolution: global → plan → org → user (or narrower).
  • Validated bounds on every policy key (min, max, allowed enum).
  • Effective-at support for controlled rollouts.
  • Full audit trail (who, what, before, after, when, reason).

Initial policy keys (defaults configurable):

rate_limit.api_requests_per_minute, rate_limit.terminal_token_requests_per_minute, rate_limit.financial_requests_per_minute, rate_limit.admin_overview_requests_per_minute, allocation.max_concurrent_per_user, billing.window_seconds, billing.low_balance_threshold_minor, allocation.refund_window_days, billing.minimum_deposit_minor, billing.maximum_deposit_minor, notification.low_balance_enabled, notification.balance_depleted_enabled.

→ Full list with defaults: Policy keys reference.

Non-functional requirements

Property Mechanism
Security Token handling, audited privileged actions, secret management, abuse protection
Reliability Idempotent webhook/provisioning/billing flows; retries; DLQ
Durability Transactional DB; immutable ledger
Performance Responsive APIs; enforced pagination
Observability Traces, structured logs, metrics, alerting

Phase-2 readiness constraints

The PRD pins these as mandatory in MVP design even though they aren't MVP features:

  1. Managed Schedulers — allocation model supports pluggable execution backends; API remains scheduler-agnostic.
  2. Enterprise Billing — ledger model extensible for invoices, subscriptions, commitments.
  3. Multi-Tenant Hierarchy and Policy — core entities tenant-aware (org_id, optional project_id).
  4. Multi-Region — region first-class in placement and resource identity.

No-rework acceptance criteria:

  • Scheduler backend addition does not require breaking allocation API.
  • Enabling org tenancy does not require ledger redesign.
  • Second-region introduction does not require identity rewrite.
  • Enterprise pricing is additive over billing core.

Where to look next