Skip to content

CLI + Python SDK v1 Plan

Status: CLI v1 baseline implemented; Python SDK v1 baseline implemented and expanded for app-platform shared runtime workflows

1. Scope

This plan covers: - gpuaas CLI v1 - Go SDK baseline used by the CLI - Python SDK v1

Out of scope: - MaaS integration (deferred) - CLI v2 agent-operable control-plane direction (tracked separately in doc/architecture/CLI_Agent_Operable_Control_Plane_v2.md)

2. Why this order

  1. CLI first:
  2. fastest path to demonstrate full platform workflows
  3. validates contracts, auth, and error handling end-to-end

  4. Python SDK second:

  5. highest utility for AI/ML users
  6. can reuse validated CLI/API flow decisions

3. CLI v1 command surface (implemented)

  • gpuaas auth login [--provider huggingface|github|google] [--tenant-hint <tenant>] [--identity-hint <email>] [--no-browser] [--base-url <url>]
  • gpuaas auth dev-login --username <u> --password <p> [--base-url <url>]
  • gpuaas auth keycloak-login --username <u> --password <p> [--base-url <api>] [--kc-url <kc>] [--realm <realm>] [--client-id <id>] [--client-secret <secret>]
  • gpuaas auth logout
  • gpuaas auth whoami
  • gpuaas catalog list [--output table|csv|json] [--no-heading]
  • gpuaas nodes list [--status <status>] [--output table|csv|json] [--no-heading]
  • gpuaas projects list [--output table|csv|json] [--no-heading]
  • gpuaas projects create --name <name> [--slug <slug>]
  • gpuaas projects use --id <project_id>
  • gpuaas allocations list [--status <status>] [--project-id <id>] [--output table|csv|json] [--no-heading]
  • gpuaas allocations create [--scheduler-type bare_metal|slurm|k8s|ray] [--node-id <id>] [--project-id <id>]
  • gpuaas allocations release --id <allocation_id> [--project-id <id>]
  • gpuaas billing balance
  • gpuaas apps shared-runtimes list|get|create|delete --org-id <org_id> ...
  • gpuaas apps shared-runtimes attachments list|get|create|delete --org-id <org_id> --runtime-id <id> ...
  • gpuaas apps shared-runtimes workers list|get --org-id <org_id> --runtime-id <id> ...
  • gpuaas apps shared-runtimes worker-operations list|get|create --org-id <org_id> --runtime-id <id> ...
  • gpuaas schema <resource>
  • gpuaas explain <command>
  • gpuaas mcp list-tools
  • gpuaas mcp serve

Required behavior: - every API failure prints code, message, correlation_id - deterministic non-zero exit codes by error class - explicit project context flag (--project) plus active-default behavior

4. Python SDK v1 surface

Core client modules: - auth - catalog - allocations - terminal - billing - shared_runtimes

Required operations: - list catalog SKUs - create/list/release allocations - request terminal token - read balance - list/get/create/delete shared runtimes - list/get/create/delete shared runtime attachments - list/get shared runtime workers - list/get/create shared runtime worker operations

Status note: 1. Python SDK v1 baseline is now implemented. 2. Shared runtime and shared worker control-plane coverage is now implemented. 3. Remaining work is iterative expansion, not initial delivery.

Contract readiness guarantees for SDK generation: - paginated list endpoints expose deterministic cursor + page_size parameters and stable envelope shapes - mutation endpoints that are SDK-critical document Idempotency-Key behavior - project-scoped endpoints require explicit X-Project-ID (no automatic default on the server side) - canonical error envelope is preserved (code, message, correlation_id, optional details)

Design requirements: - generated typed models from OpenAPI - thin ergonomic wrappers for polling/wait helpers - exceptions must expose error_code and correlation_id

Current app-platform note: - example apps are still API-first today - app developers should treat the public API as authoritative and the SDK as convenience - app-specific SDK/UI helper layers should only be standardized after the example-app workflow is stable

Companion reference: - doc/architecture/Example_App_Developer_Reference_Workflow_v1.md

5. Auth model

MVP for CLI v1: - personal flow: POST /api/v1/auth/personal/login for local/dev bootstrap - OIDC flow: auth keycloak-login obtains Keycloak access/refresh token directly for operator/dev workflows - no URL/query token transport - refresh/session renewal uses POST /api/v1/auth/token/refresh semantics (CLI command wiring is a follow-up)

Decision: - device code flow is deferred; current CLI baseline is password-based (personal login + Keycloak token flow) to keep MVP deterministic in local and shared environments.

Future: - API keys / service-account credentials - OIDC device code flow for non-browser production CLI login

6. Observability and support requirements

Both CLI and SDK must: - surface correlation_id for support triage - preserve server error code values (no string rewriting) - document standard troubleshooting flow to Loki/Tempo queries

7. Delivery order

  1. A-CLI-001 (backend readiness)
  2. B-CLI-001 (CLI implementation)
  3. C-CLI-OPS-001 (runbook/support)
  4. A-PYSDK-001 (backend + contract readiness)
  5. B-PYSDK-001 (Python SDK)
  6. C-PYSDK-OPS-001 (runbook/support)

8. Definition of done (for each delivery slice)

  • contract-valid behavior (OpenAPI-consistent)
  • canonical error envelope retained
  • correlation-first troubleshooting path documented
  • CI/tests pass for touched package(s)

9. Follow-on Direction

After v1, CLI evolution should follow: - doc/architecture/CLI_Agent_Operable_Control_Plane_v2.md

That direction keeps: 1. curated workflow commands as the primary UX, 2. introspection and machine-readable behavior as first-class for agents, 3. role-gated ops/debug capabilities on the same control-plane client model.