Skip to content

PRD v0.3 - Core42 AI Cloud Compute Platform

1. Document Intent

This PRD converts prototype learning into a production-oriented, API-first product baseline.

Assumption control: - Cross-cutting product/architecture assumptions are tracked in doc/governance/Assumptions_Register.md and must be updated with PRD-affecting changes.

2. Product Vision

Provide a secure self-service GPU platform where users can discover capacity, provision compute, access nodes, monitor usage, and pay based on consumption.

3. Product Goals

  • Fast time-to-compute: provision in minutes.
  • Transparent usage and billing.
  • Safe multi-user operations with role-based controls.
  • Operator-friendly admin surface for inventory and accounts.

4. Personas

  • End User: provisions and operates GPU nodes.
  • Admin: manages users, balances, and node inventory.
  • Billing Operator: monitors charges, top-ups, and reconciliation.

5. In Scope (MVP Rebuild)

  • OIDC-based auth and role/tenant-aware authorization.
  • SKU catalog and node inventory.
  • Allocation lifecycle: provision, active, release.
  • Browser terminal session to active allocation.
  • Secure SSH key retrieval for active allocation.
  • Usage metering and periodic rating.
  • Balance warnings and depleted enforcement.
  • Stripe checkout top-up and webhook processing.
  • Admin user/node management.
  • Admin allocation/audit/payment-session operational visibility.
  • Admin operations telemetry overview (health, queue depth, throughput, error rates).
  • Object-storage-backed user storage operations.
  • Audit logging for privileged and financial actions.

6. Out of Scope (MVP)

  • Managed scheduler products (SLURM/k8s/Ray) as real backend features.
  • Enterprise invoicing/subscriptions/commit contracts.
  • Full multi-tenant org hierarchy UX (schema/policy readiness still required).
  • Multi-region active/active runtime.
  • User-managed API keys for programmatic auth (deferred; middleware remains pluggable for future key resolver).

7. Functional Requirements

FR-1 Authentication and Session

  • Users authenticate via OIDC-compatible provider.
  • API accepts short-lived access tokens and enforces server-side authz.
  • Protected APIs reject missing/invalid tokens.

Acceptance criteria: - Invalid tokens return unauthorized. - Admin-only routes enforce role. - Tenant-scoped routes enforce tenant policy.

FR-2 SKU Catalog and Inventory

  • System exposes SKU catalog and availability.
  • User-facing inventory view excludes infrastructure connection secrets.

Acceptance criteria: - Free capacity reflects only online and unassigned resources. - User node list does not expose admin-only connection coordinates.

FR-3 Provisioning Lifecycle

  • User requests allocation creation.
  • Prechecks enforce availability, policy, and funding constraints.
  • System provisions runtime access and records allocation + usage start.

Acceptance criteria: - Provision failures return explicit machine-readable reason. - Allocation status transitions are visible: requested, provisioning, active, releasing, released, failed, release_failed.

FR-4 Runtime Access

  • User can open terminal to active allocation.
  • User can view metrics for active allocation.
  • User can retrieve access credentials without persistent server-side private-key storage.

Acceptance criteria: - Access denied for non-owner/non-admin. - Key retrieval does not rely on query-token auth. - Production path does not require long-lived storage of user SSH private key material in control-plane DB.

FR-5 Release Lifecycle

  • User/admin can request release.
  • System transitions allocation to releasing then released.
  • System performs runtime cleanup and ends usage accounting.

Acceptance criteria: - Released allocation is not billed further. - Node/resource becomes available for next provisioning. - release_failed is surfaced with retry path; billing is stopped while in release_failed.

FR-6 Usage and Billing

  • Usage rated by SKU x quantity x duration.
  • Monetary values use minor units (integer) with explicit currency.
  • Low-balance and depleted-balance policies enforced.

Acceptance criteria: - Depleted users have active allocations force-released. - Billing APIs include currency and minor units. - Low-balance and projected depletion warnings are emitted before forced release when projection data is available.

FR-7 Payments

  • User can initiate Stripe checkout top-up.
  • Webhook processing is signature-verified and idempotent.
  • Successful credits emit domain event for downstream consumers.
  • Refund policy uses hybrid model:
  • Provider refund allowed within configurable window refund_window_days.
  • Outside window, refund request falls back to internal balance credit.
  • Refundable amount must be constrained by configurable policy for unused/prepaid balance.

Acceptance criteria: - Duplicate webhook does not double-credit. - Payment-credit event available for notifications/billing read model. - Refund outcome is explicit (provider_refund or internal_credit) and auditable.

FR-8 Admin Operations

  • Admin can create users.
  • Admin can adjust user balance with explicit credit/debit semantics.
  • Admin can request refunds through dedicated refund API (not generic balance adjustment).
  • Admin can add/probe/delete nodes.
  • Admin can view cross-user allocations and force-release with explicit reason.
  • Admin can query and export audit logs.
  • Admin can view payment sessions for reconciliation.

Acceptance criteria: - All admin mutations write audit logs. - Node probe status reflected in admin inventory. - Refund API behavior matches policy window + fallback rules.

FR-9 Storage Access

  • User storage is object-storage-backed with metadata index.
  • User can list/upload/download/create/rename/delete within scoped namespace.
  • Traversal and namespace breakout attempts are rejected.

Acceptance criteria: - Namespace enforcement verified by negative tests.

FR-10 Abuse Protection and Rate Limiting

  • Public APIs enforce rate limits and abuse controls.
  • Limits are policy-configurable per endpoint/user class.

Acceptance criteria: - Rate-limit responses are deterministic and observable. - Abuse policy ownership is defined in operations/security docs.

FR-11 Audit Logging

  • System records immutable audit entries for privileged actions.
  • Billing and payment mutations are auditable with correlation IDs.
  • Admins can query and export audit logs for compliance and incident response.

Acceptance criteria: - Admin balance adjustments, refunds, user creation, and node deletion are auditable. - Audit logs are available via paginated admin API and CSV export endpoint.

FR-12 Operations Observability Surface

  • Admin can view a read-only operational telemetry summary from within the product UI.
  • Ops summary is aggregated and sanitized for browser exposure (no raw infra secrets or tokens).
  • Ops panel is role-gated to admin users.

Acceptance criteria: - /api/v1/admin/ops/overview returns aggregated health/queue/error/throughput metrics. - Endpoint enforces admin authorization and standard error model.

8. Allocation State Machine (Required)

States: - requested -> provisioning -> active -> releasing -> released - Failure side paths: - provisioning -> failed - releasing -> release_failed - release_failed -> releasing (user retry or admin force-release)

Rules: - Resource can have max one active allocation. - User concurrent allocation limit policy: default allocation.max_concurrent_per_user = 2 (configurable). - release_failed means cleanup retries were exhausted; billing is stopped and admin/user retry path must remain available.

9. Billing State Machine (Required)

User billing states: - healthy (balance > low threshold) - low_balance (0 < balance <= low threshold) - auto_release_pending (advisory warning state when projected depletion time is within warning window) - depleted (balance <= 0)

Transitions: - healthy -> low_balance: trigger warning - low_balance -> auto_release_pending (advisory): trigger projected depletion warning when estimate available - low_balance -> depleted: trigger forced release - depleted -> healthy: after successful top-up; allocations do not auto-restart by default

Recovery policy: - After top-up, user must manually reprovision (default). - Auto-restart may be introduced as explicit future policy.

First-run onboarding policy: - If first login has zero balance and no allocations, UX routes user to billing with an onboarding CTA.

10. Non-Functional Requirements

  • Security: secure token handling, audited privileged actions, secret management, abuse protections.
  • Reliability: idempotent webhook/provisioning/billing flows with retries and DLQ.
  • Durability: transactional DB and immutable ledger.
  • Performance: responsive APIs under expected load with enforced pagination.
  • Observability: tracing, structured logs, metrics, and alerting.

11. Policy Configuration Model (Mandatory)

All operational policy values are configuration-driven (DB/config), not hardcoded constants.

Required capabilities: - Scoped policy resolution: global -> plan -> org -> user (or narrower where applicable). - Validated bounds on every policy key (min, max, allowed enum values). - Effective-at support for controlled rollouts of policy changes. - Full audit trail for policy updates (who, what, before, after, when, reason).

Initial policy keys (launch defaults configurable): - rate_limit.api_requests_per_minute - rate_limit.terminal_token_requests_per_minute - rate_limit.financial_requests_per_minute - rate_limit.admin_overview_requests_per_minute - allocation.max_concurrent_per_user - billing.window_seconds - billing.low_balance_threshold_minor - allocation.refund_window_days - billing.minimum_deposit_minor - billing.maximum_deposit_minor - notification.low_balance_enabled - notification.balance_depleted_enabled

12. Open Questions

  • Launch default values for policy keys above.
  • Enterprise override ranges and approval workflow for policy changes.

13. Delivery Milestones and Success Criteria

  1. Architecture/Contract Baseline Ready Success criteria: ADRs frozen, OpenAPI+AsyncAPI validated, Phase tracker >= Ready for Signoff for Phases 1-4.
  2. Core Platform Slice Success criteria: Auth + catalog + allocation read APIs passing contract/integration tests.
  3. Provision/Billing/Payments Core Success criteria: end-to-end allocate->bill->release flow stable with idempotency tests.
  4. Admin/Storage/Terminal Completion Success criteria: admin and storage APIs + terminal gateway pass security and integration suites.
  5. Hardening and Launch Readiness Success criteria: Go/No-Go checklist mandatory items all pass.

14. Phase-2 Readiness Constraints (Mandatory in MVP Design)

The following items remain out of MVP feature scope, but MVP architecture/design MUST avoid blocking them.

14.1 Managed Schedulers (SLURM/k8s/Ray)

  • Allocation model supports pluggable execution backends.
  • Allocation API remains scheduler-agnostic.

14.2 Enterprise Billing

  • Ledger model extensible for invoices, subscriptions, commitments.

14.3 Multi-Tenant Hierarchy and Policy

  • Core entities tenant-aware (org_id, optional project_id).

14.4 Multi-Region

  • Region first-class in placement and resource identity.

14.5 No-Rework Acceptance Criteria

  • Scheduler backend addition does not require breaking allocation API.
  • Enabling org tenancy does not require ledger redesign.
  • Second-region introduction does not require identity rewrite.
  • Enterprise pricing is additive over billing core.