Skip to content

Slurm Tenant Scope Semantics v1

Purpose

Define what a tenant-scoped Slurm product actually means on GPUaaS, how it differs from the current project-scoped reference path, and which platform/app responsibilities must exist before implementation starts.

This document exists because the code and contracts now prove the project-scoped Slurm example path, but tenant-scoped and multi-project Slurm behavior is still only implied across several documents.

Current Status

Tenant-scoped Slurm now has a real backend/control-plane path, but it is not yet a fully productized operator flow.

What exists today: - project-scoped app instances, - project-scoped service accounts, - project-scoped access-credential custody and delivery, - project-scoped allocation selection and worker placement, - project-scoped Slurm controller discovery by app_slug, - tenant-owned shared runtime resources and attachments, - delegated shared-runtime operator identity, - shared worker and shared worker-operation resources, - tenant-shared Slurm controller reconcile for shared worker add/drain/remove, - attached-project worker contribution request path.

Canonical attachment-model follow-on: - doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md

What does not exist today: - multi-project Slurm queue and account semantics, - tenant-scoped operator workflow in the platform shell, - submitted-job attribution and queue/account policy surfaced in product UI, - end-to-end tenant-shared parity/UI proof.

Definitions

Project-scoped Slurm

A Slurm instance belongs to one project and serves only that project.

Properties: - jobs are submitted within one project boundary, - controller and worker allocations are selected from that project, - operator service account is project-scoped, - SSH/bootstrap credential is project-scoped, - billing attribution is straightforwardly project-local.

Tenant-scoped Slurm

A Slurm control plane is shared across multiple projects inside one tenant.

Properties: - one Slurm control plane may serve more than one project, - cross-project submission is denied by default and must be enabled by explicit policy, - queue/partition/account visibility is app-owned policy on top of platform-owned tenant/project identity, - operator actions may be tenant-scoped even if individual consuming app instances remain project-owned.

Required Product Semantics

Before tenant-scoped Slurm can be called supported, the product must define:

  1. Control-plane ownership
  2. is the Slurm control plane represented by one tenant-scoped app instance,
  3. or by multiple project-owned app instances attached to one tenant-scoped runtime?

  4. Allocation ownership

  5. which projects may contribute controller/worker allocations,
  6. whether allocations stay project-owned while attached to a tenant-scoped scheduler,
  7. how allocation visibility is presented to the operator.

  8. Job submission policy

  9. whether project A may submit into project B owned queues,
  10. how project membership maps into Slurm accounts, partitions, QoS, or associations,
  11. what the default deny rules are.

  12. Identity model

  13. whether the Slurm controller runs under a tenant-scoped service account,
  14. or a project-owned service account with explicit tenant-wide grants,
  15. how submitted jobs are attributed back to platform identities and projects.

  16. Credential/data custody

  17. whether bootstrap and runtime credentials remain project-scoped,
  18. whether tenant-scoped schedulers require tenant-scoped custody,
  19. how secrets and runtime config are separated between platform and app-owned storage.

  20. Billing attribution

  21. whether controller costs are charged to one owner project,
  22. split across attached projects,
  23. or charged to a tenant-level shared cost center,
  24. and how worker/runtime costs are apportioned.

First implemented baseline: - tenant-owned controller and tenant-reserved capacity are charged to the tenant-shared runtime owner record, - project-contributed workers remain charged to the contributing source project.

If tenant-scoped Slurm is implemented, the first productized version should be narrow:

  1. One tenant-scoped Slurm control plane per tenant environment.
  2. Controller node(s) chosen explicitly by a tenant admin.
  3. Worker allocations chosen explicitly from an allowlisted set of projects in that tenant.
  4. Cross-project submission disabled by default.
  5. Project-to-Slurm-account mapping stored as app-owned state.
  6. Billing policy explicit and visible before deploy.

This keeps the first tenant-scoped slice understandable and avoids pretending that every cross-project scheduler question is solved automatically by the project-scoped app model.

Platform Responsibilities

Platform must provide: - tenant/project identity and membership truth, - explicit authorization for tenant-scoped app/operator actions, - allocation read surfaces that can operate across eligible projects when policy allows, - service-account and access-credential custody semantics that match the chosen scope, - auditable read models for attached projects and placement choices.

Platform must not own: - Slurm account/partition/QoS mapping logic, - Slurm queue policy semantics, - runtime-specific scheduler recovery behavior.

App Responsibilities

The Slurm app must own: - mapping platform tenant/project identity into Slurm-native concepts, - runtime config generation for tenant-shared schedulers, - attached-project bookkeeping, - queue/account policy enforcement inside the Slurm runtime, - scheduler-native operational state.

This likely implies app-owned persistent state even for a tenant-dedicated scheduler product.

UI Expectations

Tenant-scoped Slurm should not reuse the current project-scoped UI unchanged.

The operator workflow will need explicit UI for: - choosing tenant-shared mode, - selecting attached projects, - selecting controller and worker allocations across those projects, - showing which projects may submit jobs, - showing billing/ownership consequences.

This should be built through the platform shell extension model, but it is a distinct product flow, not a hidden variant of the current single-project Slurm deploy form.

Readiness Judgment

Tenant-scoped Slurm is no longer only a design task.

The platform/backend path is now real enough to support: - tenant-owned shared runtime lifecycle, - attached-project contribution, - shared worker topology, - delegated operator reconcile.

What is still not finished is the product surface: - operator UI for tenant-shared deploy/attach/contribute, - clearer queue/account/job-submission policy presentation, - live parity validation through that UI path.

Immediate Next Decisions

  1. Decide whether tenant-scoped Slurm is modeled as:
  2. one tenant-owned scheduler control plane,
  3. or project-owned app instances attached to a tenant-shared runtime.

  4. Decide the first billing rule for a tenant-shared scheduler.

  5. Decide the identity model for the app controller in tenant scope.

  6. Decide the first attached-project and queue-visibility rules.

  7. Only after those decisions, define the tenant-scoped deploy contract and UI.

  1. doc/architecture/App_Tenant_Shared_Attachment_Model_v1.md
  2. doc/architecture/App_Runtime_Operating_Modes_v1.md
  3. doc/architecture/App_Runtime_Billing_Model_v1.md
  4. doc/architecture/Scheduler_as_Platform_App_v1.md
  5. doc/architecture/Shared_Runtime_Worker_Topology_v1.md