PRD v0.3 - Core42 AI Cloud Compute Platform¶
1. Document Intent¶
This PRD converts prototype learning into a production-oriented, API-first product baseline.
Assumption control:
- Cross-cutting product/architecture assumptions are tracked in doc/governance/Assumptions_Register.md and must be updated with PRD-affecting changes.
2. Product Vision¶
Provide a secure self-service GPU platform where users can discover capacity, provision compute, access nodes, monitor usage, and pay based on consumption.
3. Product Goals¶
- Fast time-to-compute: provision in minutes.
- Transparent usage and billing.
- Safe multi-user operations with role-based controls.
- Operator-friendly admin surface for inventory and accounts.
4. Personas¶
- End User: provisions and operates GPU nodes.
- Admin: manages users, balances, and node inventory.
- Billing Operator: monitors charges, top-ups, and reconciliation.
5. In Scope (MVP Rebuild)¶
- OIDC-based auth and role/tenant-aware authorization.
- SKU catalog and node inventory.
- Allocation lifecycle: provision, active, release.
- Browser terminal session to active allocation.
- Secure SSH key retrieval for active allocation.
- Usage metering and periodic rating.
- Balance warnings and depleted enforcement.
- Stripe checkout top-up and webhook processing.
- Admin user/node management.
- Admin allocation/audit/payment-session operational visibility.
- Admin operations telemetry overview (health, queue depth, throughput, error rates).
- Object-storage-backed user storage operations.
- Audit logging for privileged and financial actions.
6. Out of Scope (MVP)¶
- Managed scheduler products (SLURM/k8s/Ray) as real backend features.
- Enterprise invoicing/subscriptions/commit contracts.
- Full multi-tenant org hierarchy UX (schema/policy readiness still required).
- Multi-region active/active runtime.
- User-managed API keys for programmatic auth (deferred; middleware remains pluggable for future key resolver).
7. Functional Requirements¶
FR-1 Authentication and Session¶
- Users authenticate via OIDC-compatible provider.
- API accepts short-lived access tokens and enforces server-side authz.
- Protected APIs reject missing/invalid tokens.
Acceptance criteria: - Invalid tokens return unauthorized. - Admin-only routes enforce role. - Tenant-scoped routes enforce tenant policy.
FR-2 SKU Catalog and Inventory¶
- System exposes SKU catalog and availability.
- User-facing inventory view excludes infrastructure connection secrets.
Acceptance criteria: - Free capacity reflects only online and unassigned resources. - User node list does not expose admin-only connection coordinates.
FR-3 Provisioning Lifecycle¶
- User requests allocation creation.
- Prechecks enforce availability, policy, and funding constraints.
- System provisions runtime access and records allocation + usage start.
Acceptance criteria:
- Provision failures return explicit machine-readable reason.
- Allocation status transitions are visible: requested, provisioning, active, releasing, released, failed, release_failed.
FR-4 Runtime Access¶
- User can open terminal to active allocation.
- User can view metrics for active allocation.
- User can retrieve access credentials without persistent server-side private-key storage.
Acceptance criteria: - Access denied for non-owner/non-admin. - Key retrieval does not rely on query-token auth. - Production path does not require long-lived storage of user SSH private key material in control-plane DB.
FR-5 Release Lifecycle¶
- User/admin can request release.
- System transitions allocation to
releasingthenreleased. - System performs runtime cleanup and ends usage accounting.
Acceptance criteria:
- Released allocation is not billed further.
- Node/resource becomes available for next provisioning.
- release_failed is surfaced with retry path; billing is stopped while in release_failed.
FR-6 Usage and Billing¶
- Usage rated by SKU x quantity x duration.
- Monetary values use minor units (integer) with explicit currency.
- Low-balance and depleted-balance policies enforced.
Acceptance criteria: - Depleted users have active allocations force-released. - Billing APIs include currency and minor units. - Low-balance and projected depletion warnings are emitted before forced release when projection data is available.
FR-7 Payments¶
- User can initiate Stripe checkout top-up.
- Webhook processing is signature-verified and idempotent.
- Successful credits emit domain event for downstream consumers.
- Refund policy uses hybrid model:
- Provider refund allowed within configurable window
refund_window_days. - Outside window, refund request falls back to internal balance credit.
- Refundable amount must be constrained by configurable policy for unused/prepaid balance.
Acceptance criteria:
- Duplicate webhook does not double-credit.
- Payment-credit event available for notifications/billing read model.
- Refund outcome is explicit (provider_refund or internal_credit) and auditable.
FR-8 Admin Operations¶
- Admin can create users.
- Admin can adjust user balance with explicit credit/debit semantics.
- Admin can request refunds through dedicated refund API (not generic balance adjustment).
- Admin can add/probe/delete nodes.
- Admin can view cross-user allocations and force-release with explicit reason.
- Admin can query and export audit logs.
- Admin can view payment sessions for reconciliation.
Acceptance criteria: - All admin mutations write audit logs. - Node probe status reflected in admin inventory. - Refund API behavior matches policy window + fallback rules.
FR-9 Storage Access¶
- User storage is object-storage-backed with metadata index.
- User can list/upload/download/create/rename/delete within scoped namespace.
- Traversal and namespace breakout attempts are rejected.
Acceptance criteria: - Namespace enforcement verified by negative tests.
FR-10 Abuse Protection and Rate Limiting¶
- Public APIs enforce rate limits and abuse controls.
- Limits are policy-configurable per endpoint/user class.
Acceptance criteria: - Rate-limit responses are deterministic and observable. - Abuse policy ownership is defined in operations/security docs.
FR-11 Audit Logging¶
- System records immutable audit entries for privileged actions.
- Billing and payment mutations are auditable with correlation IDs.
- Admins can query and export audit logs for compliance and incident response.
Acceptance criteria: - Admin balance adjustments, refunds, user creation, and node deletion are auditable. - Audit logs are available via paginated admin API and CSV export endpoint.
FR-12 Operations Observability Surface¶
- Admin can view a read-only operational telemetry summary from within the product UI.
- Ops summary is aggregated and sanitized for browser exposure (no raw infra secrets or tokens).
- Ops panel is role-gated to admin users.
Acceptance criteria:
- /api/v1/admin/ops/overview returns aggregated health/queue/error/throughput metrics.
- Endpoint enforces admin authorization and standard error model.
8. Allocation State Machine (Required)¶
States:
- requested -> provisioning -> active -> releasing -> released
- Failure side paths:
- provisioning -> failed
- releasing -> release_failed
- release_failed -> releasing (user retry or admin force-release)
Rules:
- Resource can have max one active allocation.
- User concurrent allocation limit policy: default allocation.max_concurrent_per_user = 2 (configurable).
- release_failed means cleanup retries were exhausted; billing is stopped and admin/user retry path must remain available.
9. Billing State Machine (Required)¶
User billing states:
- healthy (balance > low threshold)
- low_balance (0 < balance <= low threshold)
- auto_release_pending (advisory warning state when projected depletion time is within warning window)
- depleted (balance <= 0)
Transitions:
- healthy -> low_balance: trigger warning
- low_balance -> auto_release_pending (advisory): trigger projected depletion warning when estimate available
- low_balance -> depleted: trigger forced release
- depleted -> healthy: after successful top-up; allocations do not auto-restart by default
Recovery policy: - After top-up, user must manually reprovision (default). - Auto-restart may be introduced as explicit future policy.
First-run onboarding policy: - If first login has zero balance and no allocations, UX routes user to billing with an onboarding CTA.
10. Non-Functional Requirements¶
- Security: secure token handling, audited privileged actions, secret management, abuse protections.
- Reliability: idempotent webhook/provisioning/billing flows with retries and DLQ.
- Durability: transactional DB and immutable ledger.
- Performance: responsive APIs under expected load with enforced pagination.
- Observability: tracing, structured logs, metrics, and alerting.
11. Policy Configuration Model (Mandatory)¶
All operational policy values are configuration-driven (DB/config), not hardcoded constants.
Required capabilities:
- Scoped policy resolution: global -> plan -> org -> user (or narrower where applicable).
- Validated bounds on every policy key (min, max, allowed enum values).
- Effective-at support for controlled rollouts of policy changes.
- Full audit trail for policy updates (who, what, before, after, when, reason).
Initial policy keys (launch defaults configurable):
- rate_limit.api_requests_per_minute
- rate_limit.terminal_token_requests_per_minute
- rate_limit.financial_requests_per_minute
- rate_limit.admin_overview_requests_per_minute
- allocation.max_concurrent_per_user
- billing.window_seconds
- billing.low_balance_threshold_minor
- allocation.refund_window_days
- billing.minimum_deposit_minor
- billing.maximum_deposit_minor
- notification.low_balance_enabled
- notification.balance_depleted_enabled
12. Open Questions¶
- Launch default values for policy keys above.
- Enterprise override ranges and approval workflow for policy changes.
13. Delivery Milestones and Success Criteria¶
- Architecture/Contract Baseline Ready Success criteria: ADRs frozen, OpenAPI+AsyncAPI validated, Phase tracker >= Ready for Signoff for Phases 1-4.
- Core Platform Slice Success criteria: Auth + catalog + allocation read APIs passing contract/integration tests.
- Provision/Billing/Payments Core Success criteria: end-to-end allocate->bill->release flow stable with idempotency tests.
- Admin/Storage/Terminal Completion Success criteria: admin and storage APIs + terminal gateway pass security and integration suites.
- Hardening and Launch Readiness Success criteria: Go/No-Go checklist mandatory items all pass.
14. Phase-2 Readiness Constraints (Mandatory in MVP Design)¶
The following items remain out of MVP feature scope, but MVP architecture/design MUST avoid blocking them.
14.1 Managed Schedulers (SLURM/k8s/Ray)¶
- Allocation model supports pluggable execution backends.
- Allocation API remains scheduler-agnostic.
14.2 Enterprise Billing¶
- Ledger model extensible for invoices, subscriptions, commitments.
14.3 Multi-Tenant Hierarchy and Policy¶
- Core entities tenant-aware (
org_id, optionalproject_id).
14.4 Multi-Region¶
- Region first-class in placement and resource identity.
14.5 No-Rework Acceptance Criteria¶
- Scheduler backend addition does not require breaking allocation API.
- Enabling org tenancy does not require ledger redesign.
- Second-region introduction does not require identity rewrite.
- Enterprise pricing is additive over billing core.