App Runtime External Worker Contract v1¶

Purpose¶

Define the long-term public integration boundary for platform apps that are developed outside the GPUaaS core repository.

This note exists to answer: 1. what contract an external app team should build against, 2. which primitives the platform must expose, 3. what stays platform-internal versus app-owned, 4. whether NATS or Temporal is the public app-developer contract, 5. how the current in-repo Slurm reference should evolve into an external-style model.

Use this with: - doc/architecture/Build_an_App_for_GPUaaS_v1.md - doc/architecture/External_App_Team_Integration_Guide_v1.md - doc/architecture/App_Platform_Primitive_Boundary_v1.md - doc/architecture/App_UI_Extension_Model_v1.md - doc/architecture/App_Control_Plane_v1.md - doc/architecture/CLI_Agent_Operable_Control_Plane_v2.md

Decision Summary¶

The public app-developer contract should be a narrow platform runtime API plus app-owned worker identity.
The platform may use NATS, Temporal, or other machinery internally, but those are not the first-class external contract by default.
App lifecycle intelligence should move out of platform in-process switches over time and into app-owned workers.
Transport is a delivery mechanism, not the primary contract.
Slurm remains the proving example app for whether the platform exposes enough primitives for an external app team.
The first public delivery mechanism is polling over the runtime API. Slug-scoped NATS may be added later as an optimization, but polling is the compatibility contract.
Runtime contracts should become manifest-described. Built-in Slurm/RKE2 descriptors are compatibility bridges, not the long-term onboarding model.

Why This Exists¶

Today the platform still contains some runtime-specific logic in core code paths.

That is acceptable for the first proving slice, but it is not the final boundary for a platform that expects external app developers to build without submitting platform-core pull requests.

The long-term question is not: - should the platform use NATS or Temporal internally

The real question is: - what public contract does an app team implement against while keeping runtime-specific logic out of the platform core

Core Principle¶

The external boundary should be: - contract-first, - language-neutral, - authz-scoped, - app-owned for runtime logic, - narrow enough that the platform can evolve internally without breaking app teams.

This means the platform should expose: - stable APIs, - stable event payloads where appropriate, - scoped machine identity, - credential and placement primitives, - status and operation reporting surfaces.

This does not mean the platform must expose: - Temporal internals, - internal switch statements, - direct database access, - or repo-coupled adapter hooks.

Public Contract Shape¶

The public contract for an external app worker should have four parts.

The binding contract is the API surface, identity model, and report/query envelope below. App teams must not depend on platform database tables, Temporal workflow names, or internal NATS consumer names.

1. Runtime read/query surface¶

The app worker needs to read: - the runtime object it is responsible for, - effective placement and allocation binding, - attached projects where relevant, - members/workers and operation queues where relevant, - wrapped credentials or runtime bundles it is authorized to consume.

Examples of the kinds of information this surface must provide: - runtime status - effective operating mode and scope - placement intent - bound allocations and connection details - attachment metadata for tenant-shared runtimes - worker/member operation envelopes

Current project-scoped read/query endpoints: - GET /api/v1/projects/{project_id}/app-instances/{app_instance_id} - GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members - GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members/{member_id} - GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations - GET /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations/{operation_id}

Current tenant-shared read/query endpoints: - GET /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id} - GET /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/attachments - GET /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/attachments/{attachment_id} - GET /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/workers - GET /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/workers/{worker_id} - GET /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/worker-operations - GET /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/worker-operations/{operation_id}

2. Runtime reporting surface¶

The app worker needs to report: - lifecycle progress, - runtime-specific status, - operation success/failure, - deterministic failure reasons, - runtime state detail that the platform can surface to operators.

This is the control surface that keeps app logic out of platform-owned service internals.

Current project-scoped report endpoints: - POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/report - POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/members/{member_id}/report - POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/member-operations/{operation_id}/report - POST /api/v1/projects/{project_id}/app-instances/{app_instance_id}/repair-operations/{repair_operation_id}/report

Current tenant-shared report endpoints: - POST /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/report - POST /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/attachments/{attachment_id}/report - POST /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/workers/{worker_id}/report - POST /api/v1/orgs/{org_id}/shared-app-runtimes/{shared_runtime_id}/worker-operations/{operation_id}/report

The generic envelope is: - platform status fields are constrained enums, - adapter-owned data lives under runtime_state, health_detail, endpoint, or adapter_detail, - deterministic failures use failure_reason, last_error_code, and last_error_message, - all calls preserve the request correlation id.

3. Scoped machine identity¶

The app worker needs a platform-issued machine identity model that is: - narrow, - revocable, - runtime-scoped or project-scoped as appropriate, - explicit about which routes are allowed.

Examples: - project-scoped service account for project-owned app instances - shared-runtime operator identity for tenant-owned shared runtimes

Binding model: - project-scoped app-instance workers use project service-account tokens carrying actor_type=service_account and project_id; - tenant-shared runtime workers use delegated shared-runtime operator tokens carrying actor_type=shared_runtime_operator, org_id, and shared_runtime_id; - human user tokens may create/control app resources, but long-running app workers should use machine identity; - machine actors are denied by default outside their allowlist.

Initial project service-account allowlist: - project app-instance read/report routes, - project member and member-operation read/report routes, - app runtime secret delivery for the bound project, - project allocation reads needed by the app controller.

Initial shared-runtime operator allowlist: - bound shared-runtime read/report routes, - bound attachment read/report routes, - bound worker and worker-operation read/report routes, - tenant allocation reads filtered by shared-runtime attachment/resource binding.

4. Delivery trigger¶

The app worker needs some way to know work is available.

This may be: - event delivery, - polling plus read-model queries, - or a future callback/webhook shape.

This is important, but it is secondary to the runtime API and identity contract.

Decision for the first public delivery mechanism: - polling is the required compatibility contract; - slug-scoped NATS is an optional future delivery acceleration, not required for external app teams; - Temporal remains platform-internal unless a specific app team chooses to run its own workflow engine behind the public API.

Polling baseline: - worker starts with its scoped machine token, - worker reads the runtime object and pending operation list, - worker reports progress/results through report endpoints, - worker repeats using backoff and correlation ids.

Repair/reconcile baseline: - app workers consume repair operations through the same polling-compatible API model; - reconcile is preferred for non-destructive drift correction; - repair is reserved for scoped remediation steps that the runtime contract declares safe; - Kubernetes/RKE2 workers should reject or avoid generic restart when server, agent, and whole-runtime scopes have different safety semantics.

NATS vs Temporal¶

Decision¶

NATS and Temporal are platform capabilities, not the primary app-developer contract.

What the platform should say publicly¶

The platform should say: 1. here is how your worker authenticates, 2. here is how your worker learns work is pending, 3. here is how your worker reads placement, credentials, and operations, 4. here is how your worker reports status and failures.

The platform should not force every app developer to adopt: - Temporal SDK semantics, - or NATS consumer semantics, - unless that transport is intentionally declared as part of the supported external integration model.

Practical guidance¶

NATS is a good fit for advanced operator-style apps and internal reference apps.
Temporal remains appropriate for platform-internal long-running workflows and any app-owned workflow engine that a team chooses to run for itself.
External app teams may use Temporal, NATS, cron, or their own controllers internally as long as they consume the supported platform contract.

Current recommendation¶

Near term: - keep NATS/event-driven integration as an available and likely first reference path, - but do not define "must run a NATS consumer" as the only external app model until the runtime API, auth model, and support story are fully locked down.

Transport Principle¶

Transport should be treated as replaceable or additive.

The stable layer should be: - runtime resource model, - authz model, - report/query APIs, - payload schemas, - correlation model.

Possible delivery mechanisms can then be layered on top: - slug-scoped NATS subjects - polling worker - future callback/webhook registration

This keeps the platform from overcommitting to one transport before the resource and auth contracts are fully mature.

What Stays Platform-Owned¶

The platform should continue to own: - catalog, entitlement, and publication flow - project/tenant identity and IAM - runtime and attachment ownership model - allocation lifecycle and infrastructure provisioning - secure credential custody and delivery - audit, correlation, and observability requirements - operator-facing generic read models and shell UX

The platform should also define: - canonical event payloads where they exist, - canonical error envelopes, - stable SDK surfaces for supported languages.

What Moves To App-Owned Workers¶

The app team should own: - runtime bootstrap logic after platform placement is ready - runtime-specific health evaluation - role/member semantics - runtime-specific operation handling - join, drain, remove, and recovery semantics where they are runtime-specific - runtime-specific config rendering and control-plane behavior

In the long-term external model, these should not require platform-core code changes per app.

Slurm's Role In The Model¶

Slurm is not just a feature. It is the first proving app for whether the platform exposes enough primitives for an app team to build on top.

That means Slurm should validate: - runtime read/report APIs - credential delivery - allocation binding visibility - service-account and shared-runtime operator identity - tenant-shared attachment and worker contribution semantics

It should not permanently justify: - platform-owned switch statements for every runtime backend - or app-specific lifecycle code living forever inside platform services

Catalog Registration Direction¶

The long-term app registration model should move toward an app manifest flow.

The platform should eventually support: - app registration by manifest - app version registration by manifest - schema-backed validation for app config and placement - publish/approve lifecycle - entitlement/grant lifecycle - external worker contract descriptors for component topology and operation seeding

Seed SQL should remain appropriate for: - foundational reference data, - first-party/reference apps, - local bootstrap, but it should not be the only story for app onboarding.

Schema Direction¶

The long-term direction should allow an app/version to declare: - placement input schema - config schema - credential requirements - supported operating modes and scopes - whether generic shell metadata is sufficient or a custom UI extension is required - external worker contract: - delivery mode, initially polling - root component key - root allocation-id field - child operation component key - child allocation-ids field - adapter phase labels that are safe for operator display

This aligns with the existing: - placement_intent storage model - metadata-driven shell direction - extension-registry model

The missing piece is not storage. It is published validation and registration contract.

Migration Stages¶

Stage 1: Prove the primitives with in-repo reference apps¶

Current state: - the platform still contains some runtime-specific logic, - reference apps validate the primitive set, - service-account and shared-runtime operator identity are real, - runtime report/query surfaces exist for the current example apps.

Stage 2: Define the external worker contract explicitly¶

Required outputs: - public runtime report/query API surface: complete for the current project-scoped and tenant-shared runtime primitives listed above - app-worker identity contract: project service account for project scope; delegated shared-runtime operator for tenant-shared scope - delivery contract: polling first; NATS optional later - schema/manifest registration direction: external_worker_contract descriptor added as the target manifest shape

Stage 3: Extract one reference app to the external-style model¶

The first extraction should prove: - no platform-core PR is needed to ship app lifecycle logic, - the runtime worker can authenticate and operate through supported APIs only, - operator UX still works through platform-owned read models and shell surfaces.

Current extraction slice: - project-scoped Slurm and RKE2 launch seeding now flows through a generic external-worker runtime descriptor instead of a direct runtime_backend switch in the app-instance request handler; - Slurm/RKE2 built-in descriptors are only compatibility bridges until their app manifests carry external_worker_contract; - runtime-specific reconciliation remains in app-owned controllers (cmd/slurm-reference-controller, cmd/rke2-self-managed-controller) that poll/read/report through the public runtime APIs.

Stage 4: Support third-party app teams as a product surface¶

Only after Stage 3 proves out should the platform claim: - external app teams can register and operate their own platform apps without platform-core code changes.

Near-Term Decisions¶

Do not prematurely expose Temporal as the app-developer contract.
Do not prematurely require every app team to consume NATS directly.
Define the runtime API and identity boundary first.
Keep Slurm as the proving example for that boundary.
Move toward manifest and schema registration only after the runtime contract is explicit enough to support it cleanly.

Remaining Open Questions¶

These are still open and should be treated as follow-on design work:

When do we expose slug-scoped NATS as an optional external delivery mechanism?
How do we productize self-service app manifest approval/publish without bypassing platform review?
Which app manifest fields become strictly validated before third-party onboarding?
When should the built-in compatibility descriptors for reference apps be removed in favor of manifest-only descriptors?

Current Recommendation¶

The platform should proceed in this order: 1. keep the polling runtime API as the stable public contract, 2. migrate reference app manifests to declare external_worker_contract, 3. keep using Slurm to prove the contract, 4. add slug-scoped NATS only as an optional acceleration after the polling contract remains stable, 5. productize manifest/schema registration once the reference descriptors are manifest-driven.

That sequencing keeps the platform focused on the real boundary: - reusable primitives and contracts, not: - prematurely freezing one internal transport or orchestration style as the app platform itself.