PRD distilled¶
Contract
doc/product/PRD.md v0.3 (265 lines) · MVP-aligned · acceptance criteria per FR
This page is a fact-faithful distillation of the PRD. For the canonical source, read PRD.md.
Vision¶
Provide a secure self-service GPU platform where users can discover capacity, provision compute, access nodes, monitor usage, and pay based on consumption.
Product goals¶
- Fast time-to-compute: provision in minutes.
- Transparent usage and billing.
- Safe multi-user operations with role-based controls.
- Operator-friendly admin surface for inventory and accounts.
Personas¶
| Persona | Role |
|---|---|
| End User | Provisions and operates GPU nodes |
| Admin | Manages users, balances, and node inventory |
| Billing Operator | Monitors charges, top-ups, and reconciliation |
In scope (MVP rebuild)¶
| Capability | Status |
|---|---|
| OIDC-based auth and role/tenant-aware authorization | Implemented |
| SKU catalog and node inventory | Implemented |
| Allocation lifecycle: provision, active, release | Implemented |
| Browser terminal session to active allocation | Implemented |
| Secure SSH key retrieval for active allocation | Implemented |
| Usage metering and periodic rating | Implemented |
| Balance warnings and depleted enforcement | Implemented |
| Stripe checkout top-up and webhook processing | Implemented |
| Admin user/node management | Implemented |
| Admin allocation/audit/payment-session visibility | Implemented |
| Admin operations telemetry overview | Implemented |
| Object-storage-backed user storage | Implemented |
| Audit logging | Implemented |
Out of scope (MVP)¶
| Capability | Stance |
|---|---|
| Managed scheduler products (SLURM/k8s/Ray) as real backend features | Reference adapters exist in code for first-slice testing only |
| Enterprise invoicing/subscriptions/commit contracts | Deferred |
| Full multi-tenant org hierarchy UX | Schema/policy readiness still required |
| Multi-region active/active runtime | Out of MVP |
| User-managed API keys for programmatic auth | Deferred; middleware remains pluggable |
Functional requirements (verbatim)¶
The PRD defines 12 functional requirements. Summary:
| ID | Title | Code home |
|---|---|---|
| FR-1 | Authentication and Session | packages/services/auth/ + cmd/api middleware |
| FR-2 | SKU Catalog and Inventory | packages/services/inventory/ |
| FR-3 | Provisioning Lifecycle | packages/services/provisioning/ + cmd/provisioning-worker/ |
| FR-4 | Runtime Access | cmd/terminal-gateway + packages/services/releases/ |
| FR-5 | Release Lifecycle | packages/services/provisioning/ |
| FR-6 | Usage and Billing | packages/services/billing/ + cmd/billing-worker/ |
| FR-7 | Payments (Stripe + refund hybrid) | packages/services/payments/ + cmd/webhook-worker/ |
| FR-8 | Admin Operations | packages/services/admin/ |
| FR-9 | Storage Access | packages/services/storage/ |
| FR-10 | Abuse Protection and Rate Limiting | packages/shared/middleware/ (Redis-backed) |
| FR-11 | Audit Logging | packages/services/admin/ + audit_logs table |
| FR-12 | Operations Observability Surface | cmd/api admin routes + Grafana |
Allocation state machine (required)¶
stateDiagram-v2
[*] --> requested
requested --> provisioning
provisioning --> active
provisioning --> failed
active --> releasing
releasing --> released
releasing --> release_failed
release_failed --> releasing: retry / admin force
failed --> [*]
released --> [*]
Rules:
- Resource can have max one active allocation.
- User concurrent allocation limit: default
allocation.max_concurrent_per_user = 2(configurable). release_failedmeans cleanup retries exhausted; billing stopped; admin/user retry path must remain available.
Billing state machine (required)¶
stateDiagram-v2
[*] --> healthy
healthy --> low_balance: balance ≤ threshold
low_balance --> auto_release_pending: projected depletion in window
low_balance --> depleted: balance ≤ 0
auto_release_pending --> depleted
depleted --> healthy: top-up
low_balance --> healthy: top-up
auto_release_pending --> healthy: top-up
Recovery: after top-up, user must manually reprovision (allocations do not auto-restart by default).
First-run policy: zero-balance + no-allocations user is routed to billing with onboarding CTA.
Policy configuration model (mandatory)¶
All operational policy values are configuration-driven (DB-backed), not hardcoded constants.
Required capabilities:
- Scoped policy resolution:
global → plan → org → user(or narrower). - Validated bounds on every policy key (
min,max, allowed enum). - Effective-at support for controlled rollouts.
- Full audit trail (who, what, before, after, when, reason).
Initial policy keys (defaults configurable):
rate_limit.api_requests_per_minute, rate_limit.terminal_token_requests_per_minute, rate_limit.financial_requests_per_minute, rate_limit.admin_overview_requests_per_minute, allocation.max_concurrent_per_user, billing.window_seconds, billing.low_balance_threshold_minor, allocation.refund_window_days, billing.minimum_deposit_minor, billing.maximum_deposit_minor, notification.low_balance_enabled, notification.balance_depleted_enabled.
→ Full list with defaults: Policy keys reference.
Non-functional requirements¶
| Property | Mechanism |
|---|---|
| Security | Token handling, audited privileged actions, secret management, abuse protection |
| Reliability | Idempotent webhook/provisioning/billing flows; retries; DLQ |
| Durability | Transactional DB; immutable ledger |
| Performance | Responsive APIs; enforced pagination |
| Observability | Traces, structured logs, metrics, alerting |
Phase-2 readiness constraints¶
The PRD pins these as mandatory in MVP design even though they aren't MVP features:
- Managed Schedulers — allocation model supports pluggable execution backends; API remains scheduler-agnostic.
- Enterprise Billing — ledger model extensible for invoices, subscriptions, commitments.
- Multi-Tenant Hierarchy and Policy — core entities tenant-aware (
org_id, optionalproject_id). - Multi-Region — region first-class in placement and resource identity.
No-rework acceptance criteria:
- Scheduler backend addition does not require breaking allocation API.
- Enabling org tenancy does not require ledger redesign.
- Second-region introduction does not require identity rewrite.
- Enterprise pricing is additive over billing core.