Skip to content

Scheduler as Platform App v1

Goal

Define a sustainable baseline for delivering scheduler capabilities (starting with Slurm) as platform apps, without coupling core allocation APIs to one scheduler implementation.

This document is the implementation baseline for: 1. Internal reference apps (for example Slurm). 2. Future first-party apps. 3. Third-party/team-owned apps built on the same control plane contracts.

Concrete first-adapter companion: - doc/architecture/Slurm_App_Runtime_Adapter_v1.md - doc/architecture/Clustered_App_Model_v1.md - doc/architecture/App_Platform_Primitive_Boundary_v1.md

Decision Summary

  1. Schedulers are modeled as apps in App Catalog and instantiated per project/tenant policy.
  2. Core platform keeps scheduler-agnostic primitives only (identity, policy, audit, events, tenancy boundaries).
  3. Scheduler-specific logic (Slurm/K8s/Ray internals) lives in app operator runtime, not in core allocation handlers.
  4. Authorization remains permission-key based; role labels may evolve without handler rewrites.
  5. Initial production operating mode is tenant_dedicated; shared managed scheduler offerings are explicitly a later mode.

For product language, the scheduler family should expose: 1. project-scoped mode 2. tenant-owned shared mode 3. later platform-managed shared mode

Current mapping: 1. project-scoped mode -> tenant_dedicated + project 2. tenant-owned shared mode -> target tenant_dedicated + tenant 3. platform-managed shared mode -> platform_managed + platform

Important limitation: - the current app-instance contract is still project-owned - so tenant-owned shared mode is a product target that still needs an explicit attachment/ownership model - see: - doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md

Scope

In scope: 1. Responsibility boundary. 2. IAM/policy requirements. 3. Artifact/source model (platform-shared + tenant-scoped). 4. Events and observability requirements. 5. Slurm pilot acceptance criteria and gap capture. 6. operating-mode expectations for scheduler backends.

Out of scope: 1. Slurm internals (controller tuning, partition strategy). 2. MaaS/hardware provisioning implementation details. 3. UI implementation details.

Core vs App Responsibilities

Area Core platform (must provide) Scheduler app/operator (must provide)
Identity User auth, service account auth, project context enforcement None (consume core identities only)
IAM/Authz Permission evaluation, role bindings, policy overlays, audit logs Declare required actions and call only allowed endpoints
API contracts Stable control-plane endpoints, canonical error envelope Adapter/operator APIs behind app runtime boundary
Lifecycle App instance lifecycle (requested -> running/failed) Scheduler deployment/upgrade/rollback mechanics
Events Typed domain events + correlation propagation Consume/emit app lifecycle events and runtime status
Tenancy Project/tenant ownership and boundary checks Never bypass project boundary; include context in all operations
Billing hooks Usage attribution primitives by tenant/project Scheduler usage metrics mapping (jobs/queues -> billable units)

Rule: if functionality requires scheduler-specific branching inside core handlers, treat it as a platform defect and move it behind the app/operator boundary.

Required IAM Model

Use action keys, not role-name checks, in handlers.

Baseline action families

  1. scheduler.catalog.read
  2. scheduler.instance.read
  3. scheduler.instance.create
  4. scheduler.instance.update
  5. scheduler.instance.delete
  6. scheduler.queue.submit
  7. scheduler.queue.read
  8. scheduler.queue.cancel
  9. scheduler.node.read
  10. scheduler.node.operate (drain/cordon/uncordon/label)

Scope rules

  1. Tenant/project resources must enforce project ownership on every mutation.
  2. Service accounts are same-project only.
  3. Platform break-glass is allowed only on explicit admin endpoints and always audited.
  4. Role display labels (project_member, project_admin, etc.) are UI concerns; permission keys are the enforcement contract.

Artifact and Registry Model

Both source tiers are first-class: 1. Platform-shared registries/artifact sources (blessed global sources). 2. Tenant-scoped allowlisted sources (private enterprise registries/buckets).

Policy behavior: 1. Global hard-deny is non-overridable. 2. Tenant/project overlays can narrow, never broaden, beyond global deny. 3. Scheduler app deployment must resolve artifact sources through policy evaluation, not hardcoded host lists.

Direction: 1. Keep API neutral for source type (OCI and non-OCI blob/object sources). 2. Credential delivery remains short-lived and task-scoped.

API Contract Direction

Scheduler app integration should use existing app-control-plane contracts: 1. Catalog/version publication for scheduler app entries. 2. Project entitlement enable/disable with policy overlays. 3. App instance create/read/delete for scheduler control-plane instances.

Required effective instance metadata: 1. operating_mode 2. control_plane_scope 3. runtime_backend

Allocation API remains scheduler-agnostic: 1. allocations.scheduler_type selects adapter path. 2. Scheduler references/metadata are stored as integration metadata. 3. Core allocation handlers do not embed Slurm-specific request semantics.

Clustered scheduler/operator apps must also follow the generic clustered-app model: 1. topology is app-level and tenant/project-admin controlled 2. physical node selection remains platform-owned 3. logical roles and mutable member lifecycle must not leak internal host-role assumptions into the public API

Event and Observability Contract

Minimum required event flow: 1. apps.instance.requested 2. apps.instance.running 3. apps.instance.failed 4. apps.instance.deleting 5. apps.instance.deleted

Every scheduler app operation must include: 1. correlation_id 2. org_id 3. project_id 4. app_slug 5. app_instance_id (where applicable)

Triage path: 1. API/UI error envelope -> correlation_id 2. Loki lookup by correlation_id 3. Tempo trace lookup by trace_id 4. Event timeline reconstruction from apps.instance.*

Slurm Pilot (Reference App)

Use Slurm as the first reference app to validate baseline completeness.

Initial operating-mode target: 1. tenant_dedicated 2. control_plane_scope = project | tenant depending on org policy and environment boundaries 3. project-owned app instances may attach to a project-scoped control plane for dev/test/stage/prod isolation or a tenant-scoped control plane for shared tenant schedulers

Pilot phases

  1. Register Slurm in app catalog + publish version.
  2. Enable entitlement for test project(s).
  3. Create Slurm app instance via app instance API.
  4. Validate scheduler queue operations through permissioned endpoints.
  5. Run upgrade/rollback and delete flows.

Lab baseline: 1. reference control-stack assets are deployed on dev-lab-1 2. worker-side join materials are deployed on dev-gpu-1 3. see doc/operations/Slurm_Reference_Lab_Stack.md

Required acceptance criteria

  1. No Slurm-specific branches in core allocation handlers.
  2. All privileged actions produce audit logs with correlation_id.
  3. Service account operator can manage only same-project scheduler instance.
  4. Policy overlays correctly restrict regions/SKUs/artifact sources.
  5. Full incident path is traceable across logs, traces, and events.

Gap log template (capture during pilot)

  1. Missing primitive in core.
  2. Leaky scheduler-specific coupling in core.
  3. Missing policy key or permission action.
  4. Missing event for operational triage.
  5. Missing billing attribution hook.

Baseline for Any Future App Team

An app is ready for onboarding only if all are true: 1. Uses app catalog + entitlement + app instance contracts (no hidden DB coupling). 2. Uses service accounts for operator automation. 3. Passes tenant/project boundary checks under negative tests. 4. Emits required lifecycle events with correlation context. 5. Supports policy-governed artifact sources. 6. Defines upgrade, rollback, and delete behavior. 7. Provides a support runbook with correlation-first triage steps.

Non-Negotiable Invariants

  1. Internal and external apps use the same contracts.
  2. No authz bypass for internal reference apps.
  3. No runtime hard dependency on one scheduler vendor in control-plane API semantics.
  4. No direct DB writes by app operators outside public contracts/events.
  1. doc/architecture/App_Control_Plane_v1.md
  2. doc/architecture/Clustered_App_Model_v1.md
  3. doc/architecture/Service_Account_Model.md
  4. doc/architecture/Role_and_Policy_Lifecycle_Model.md
  5. doc/architecture/Allocation_Node_Placement_v1.md
  6. doc/product/GPUaaS_vs_Armada_Bridge_Gap_Matrix.md
  7. doc/architecture/App_Runtime_Operating_Modes_v1.md
  8. doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md