PRD v0.3 - Core42 AI Cloud Compute Platform¶

1. Document Intent¶

This PRD converts prototype learning into a production-oriented, API-first product baseline.

Assumption control: - Cross-cutting product/architecture assumptions are tracked in doc/governance/Assumptions_Register.md and must be updated with PRD-affecting changes.

2. Product Vision¶

Provide a secure self-service GPU platform where users can discover capacity, provision compute, access nodes, monitor usage, and pay based on consumption.

3. Product Goals¶

Fast time-to-compute: provision in minutes.
Transparent usage and billing.
Safe multi-user operations with role-based controls.
Operator-friendly admin surface for inventory and accounts.

4. Personas¶

End User: provisions and operates GPU nodes.
Admin: manages users, balances, and node inventory.
Billing Operator: monitors charges, top-ups, and reconciliation.

5. In Scope (MVP Rebuild)¶

OIDC-based auth and role/tenant-aware authorization.
SKU catalog and node inventory.
Allocation lifecycle: provision, active, release.
Browser terminal session to active allocation.
Secure SSH key retrieval for active allocation.
Usage metering and periodic rating.
Balance warnings and depleted enforcement.
Stripe checkout top-up and webhook processing.
Admin user/node management.
Admin allocation/audit/payment-session operational visibility.
Admin operations telemetry overview (health, queue depth, throughput, error rates).
Object-storage-backed user storage operations.
Audit logging for privileged and financial actions.

6. Out of Scope (MVP)¶

Managed scheduler products (SLURM/k8s/Ray) as real backend features.
Enterprise invoicing/subscriptions/commit contracts.
Full multi-tenant org hierarchy UX (schema/policy readiness still required).
Multi-region active/active runtime.
User-managed API keys for programmatic auth (deferred; middleware remains pluggable for future key resolver).

7. Functional Requirements¶

FR-1 Authentication and Session¶

Users authenticate via OIDC-compatible provider.
API accepts short-lived access tokens and enforces server-side authz.
Protected APIs reject missing/invalid tokens.

Acceptance criteria: - Invalid tokens return unauthorized. - Admin-only routes enforce role. - Tenant-scoped routes enforce tenant policy.

FR-2 SKU Catalog and Inventory¶

System exposes SKU catalog and availability.
User-facing inventory view excludes infrastructure connection secrets.

Acceptance criteria: - Free capacity reflects only online and unassigned resources. - User node list does not expose admin-only connection coordinates.

FR-3 Provisioning Lifecycle¶

User requests allocation creation.
Prechecks enforce availability, policy, and funding constraints.
System provisions runtime access and records allocation + usage start.

Acceptance criteria: - Provision failures return explicit machine-readable reason. - Allocation status transitions are visible: requested, provisioning, active, releasing, released, failed, release_failed.

FR-4 Runtime Access¶

User can open terminal to active allocation.
User can view metrics for active allocation.
User can retrieve access credentials without persistent server-side private-key storage.

Acceptance criteria: - Access denied for non-owner/non-admin. - Key retrieval does not rely on query-token auth. - Production path does not require long-lived storage of user SSH private key material in control-plane DB.

FR-5 Release Lifecycle¶

User/admin can request release.
System transitions allocation to releasing then released.
System performs runtime cleanup and ends usage accounting.

Acceptance criteria: - Released allocation is not billed further. - Node/resource becomes available for next provisioning. - release_failed is surfaced with retry path; billing is stopped while in release_failed.

FR-6 Usage and Billing¶

Usage rated by SKU x quantity x duration.
Monetary values use minor units (integer) with explicit currency.
Low-balance and depleted-balance policies enforced.

Acceptance criteria: - Depleted users have active allocations force-released. - Billing APIs include currency and minor units. - Low-balance and projected depletion warnings are emitted before forced release when projection data is available.

FR-7 Payments¶

User can initiate Stripe checkout top-up.
Webhook processing is signature-verified and idempotent.
Successful credits emit domain event for downstream consumers.
Refund policy uses hybrid model:
Provider refund allowed within configurable window refund_window_days.
Outside window, refund request falls back to internal balance credit.
Refundable amount must be constrained by configurable policy for unused/prepaid balance.

Acceptance criteria: - Duplicate webhook does not double-credit. - Payment-credit event available for notifications/billing read model. - Refund outcome is explicit (provider_refund or internal_credit) and auditable.

FR-8 Admin Operations¶

Admin can create users.
Admin can adjust user balance with explicit credit/debit semantics.
Admin can request refunds through dedicated refund API (not generic balance adjustment).
Admin can add/probe/delete nodes.
Admin can view cross-user allocations and force-release with explicit reason.
Admin can query and export audit logs.
Admin can view payment sessions for reconciliation.

Acceptance criteria: - All admin mutations write audit logs. - Node probe status reflected in admin inventory. - Refund API behavior matches policy window + fallback rules.

FR-9 Storage Access¶

User storage is object-storage-backed with metadata index.
User can list/upload/download/create/rename/delete within scoped namespace.
Traversal and namespace breakout attempts are rejected.

Acceptance criteria: - Namespace enforcement verified by negative tests.

FR-10 Abuse Protection and Rate Limiting¶

Public APIs enforce rate limits and abuse controls.
Limits are policy-configurable per endpoint/user class.

Acceptance criteria: - Rate-limit responses are deterministic and observable. - Abuse policy ownership is defined in operations/security docs.

FR-11 Audit Logging¶

System records immutable audit entries for privileged actions.
Billing and payment mutations are auditable with correlation IDs.
Admins can query and export audit logs for compliance and incident response.

Acceptance criteria: - Admin balance adjustments, refunds, user creation, and node deletion are auditable. - Audit logs are available via paginated admin API and CSV export endpoint.

FR-12 Operations Observability Surface¶

Admin can view a read-only operational telemetry summary from within the product UI.
Ops summary is aggregated and sanitized for browser exposure (no raw infra secrets or tokens).
Ops panel is role-gated to admin users.

Acceptance criteria: - /api/v1/admin/ops/overview returns aggregated health/queue/error/throughput metrics. - Endpoint enforces admin authorization and standard error model.

8. Allocation State Machine (Required)¶

States: - requested -> provisioning -> active -> releasing -> released - Failure side paths: - provisioning -> failed - releasing -> release_failed - release_failed -> releasing (user retry or admin force-release)

Rules: - Resource can have max one active allocation. - User concurrent allocation limit policy: default allocation.max_concurrent_per_user = 2 (configurable). - release_failed means cleanup retries were exhausted; billing is stopped and admin/user retry path must remain available.

9. Billing State Machine (Required)¶

User billing states: - healthy (balance > low threshold) - low_balance (0 < balance <= low threshold) - auto_release_pending (advisory warning state when projected depletion time is within warning window) - depleted (balance <= 0)

Transitions: - healthy -> low_balance: trigger warning - low_balance -> auto_release_pending (advisory): trigger projected depletion warning when estimate available - low_balance -> depleted: trigger forced release - depleted -> healthy: after successful top-up; allocations do not auto-restart by default

Recovery policy: - After top-up, user must manually reprovision (default). - Auto-restart may be introduced as explicit future policy.

First-run onboarding policy: - If first login has zero balance and no allocations, UX routes user to billing with an onboarding CTA.

10. Non-Functional Requirements¶

Security: secure token handling, audited privileged actions, secret management, abuse protections.
Reliability: idempotent webhook/provisioning/billing flows with retries and DLQ.
Durability: transactional DB and immutable ledger.
Performance: responsive APIs under expected load with enforced pagination.
Observability: tracing, structured logs, metrics, and alerting.

11. Policy Configuration Model (Mandatory)¶

All operational policy values are configuration-driven (DB/config), not hardcoded constants.

Required capabilities: - Scoped policy resolution: global -> plan -> org -> user (or narrower where applicable). - Validated bounds on every policy key (min, max, allowed enum values). - Effective-at support for controlled rollouts of policy changes. - Full audit trail for policy updates (who, what, before, after, when, reason).

Initial policy keys (launch defaults configurable): - rate_limit.api_requests_per_minute - rate_limit.terminal_token_requests_per_minute - rate_limit.financial_requests_per_minute - rate_limit.admin_overview_requests_per_minute - allocation.max_concurrent_per_user - billing.window_seconds - billing.low_balance_threshold_minor - allocation.refund_window_days - billing.minimum_deposit_minor - billing.maximum_deposit_minor - notification.low_balance_enabled - notification.balance_depleted_enabled

12. Open Questions¶

Launch default values for policy keys above.
Enterprise override ranges and approval workflow for policy changes.

13. Delivery Milestones and Success Criteria¶

Architecture/Contract Baseline Ready Success criteria: ADRs frozen, OpenAPI+AsyncAPI validated, Phase tracker >= Ready for Signoff for Phases 1-4.
Core Platform Slice Success criteria: Auth + catalog + allocation read APIs passing contract/integration tests.
Provision/Billing/Payments Core Success criteria: end-to-end allocate->bill->release flow stable with idempotency tests.
Admin/Storage/Terminal Completion Success criteria: admin and storage APIs + terminal gateway pass security and integration suites.
Hardening and Launch Readiness Success criteria: Go/No-Go checklist mandatory items all pass.

14. Phase-2 Readiness Constraints (Mandatory in MVP Design)¶

The following items remain out of MVP feature scope, but MVP architecture/design MUST avoid blocking them.

14.1 Managed Schedulers (SLURM/k8s/Ray)¶

Allocation model supports pluggable execution backends.
Allocation API remains scheduler-agnostic.

14.2 Enterprise Billing¶

Ledger model extensible for invoices, subscriptions, commitments.

14.3 Multi-Tenant Hierarchy and Policy¶

Core entities tenant-aware (org_id, optional project_id).

14.4 Multi-Region¶

Region first-class in placement and resource identity.

14.5 No-Rework Acceptance Criteria¶

Scheduler backend addition does not require breaking allocation API.
Enabling org tenancy does not require ledger redesign.
Second-region introduction does not require identity rewrite.
Enterprise pricing is additive over billing core.