MAAS Execution Readiness v1¶

1. Purpose¶

Capture the last execution-facing decisions before the first real MAAS onboarding workflow is allowed to touch a live machine such as c07u31.

This is narrower than the lifecycle baseline: - MAAS_Bare_Metal_Lifecycle_v1.md - MAAS_Node_State_Model_v1.md - MAAS_Recovery_Matrix_v1.md - Provisioning_BareMetal_MAAS_API_Boundary_v1.md

Use this as the implementation checklist for the first execution slice.

2. First Live Fixture¶

Initial real-machine execution should be constrained to: - one MAAS site - one active profile - one host only: c07u31 - one operator-initiated single-node onboarding path

Out of scope for the first live execution slice: - batch execution - auto-claiming multiple candidates - multiple hardware classes - destructive decommission automation beyond the documented release/remove paths

3. Decisions Locked Before Execution¶

3.1 Operator input¶

The operator-facing onboarding request is: - site_id - profile_id - sku_id - hostname - ipmi_ip

Not part of the normal request: - pxe_mac - inline RoCE mapping - per-request retry/discovery overrides

3.2 Site/profile policy¶

The execution path must read these from site/profile policy, not request-time overrides: - discovery/adoption behavior - strict PXE preflight - deploy retry policy - hardware-sync requirement - phase-2 RoCE enablement - site bootstrap bundle reference

3.3 RoCE¶

RoCE phase-2 is a separate site-scoped managed input: - pre-created assignment records keyed by site_id + hostname - consumed only when the selected profile enables RoCE phase-2

3.4 Site bootstrap bundle¶

The ../maas prototype shows that the so-called extra cloud-init payload is effectively: - a script/bundle launcher - infra-owned - executed before GPUaaS node-agent bootstrap

GPUasService must treat it as: - site/profile-owned first-boot bootstrap - not a free-form replacement for GPUaaS node bootstrap - separately versioned from GPUaaS bootstrap content so error tracking can distinguish site bootstrap failure from GPUaaS bootstrap failure

3.5 SSH exception boundary¶

GPUasService should not own a persistent SSH management channel for MAAS nodes.

If an initial post-deploy seed still requires host access before node-agent enrollment: - treat it as a bootstrap-time exception only - do not store or manage long-lived SSH private keys in GPUasService - move recurring post-enrollment repair/reseed actions to typed node-agent tasks

4. Prototype Behaviors That Must Be Preserved Intentionally¶

These are not product contract by themselves, but they are real behavioral intent from ../maas that must be reviewed before implementation diverges.

4.1 Discovery and adoption¶

onboard_node.sh uses this order: 1. match existing machine by hostname 2. match existing machine by power_address / IPMI IP 3. optionally match by PXE MAC if available 4. create machine with IPMI credentials 5. if needed, power-cycle and wait for discovery 6. optionally auto-claim a sole New machine if policy allows

GPUasService execution must preserve: - deterministic preference order - fail-closed behavior on ambiguity - policy-driven auto-claim only

4.2 Commissioning¶

The prototype commissions with: - enable_ssh=1 - skip_bmc_config=1

This is a real MAAS quirk and must be carried forward intentionally if still required by the environment.

4.3 PXE preflight¶

The prototype actively ensures the node PXE interface is deploy-safe before deploy: - relink to AUTO - validate subnet association - fail early if the PXE interface cannot reach metadata

This is directly tied to the datasource-readme.md failure mode and should remain first-class.

4.4 Deploy retry classification¶

The prototype does not blindly retry deploy.

It: - classifies datasource/cloud-init-like failures - aborts in-progress operations if needed - releases back to Ready - retries once under controlled policy

GPUasService execution should preserve that bounded, classified retry model.

4.5 Release fallback behavior¶

release_node.sh encodes an important recovery path: - requested erase release may fail - if policy allows, fallback to: - abort - mark-fixed - forced no-erase release

This should be preserved in decommission/reimage execution semantics, not rediscovered later.

5. First Workflow Slice Scope¶

The first real execution slice should include: - load site/profile and validate preflight - create/find machine in MAAS - commission - wait for Ready - configure storage / PXE preflight - render composed first-boot payload - deploy - wait for Deployed - stop before any broader multi-node orchestration concerns

It may keep these as follow-on stages in the same workflow, but the first live proof should remain narrow: - hardware-sync seed/health - node-agent enrollment - final active transition

6. First Live Run Guardrails¶

Before first execution against c07u31, require: - site probe healthy - selected profile active - selected SKU valid - no open onboarding for the host - explicit operator confirmation that the target host is the approved test fixture - event timeline visible in GPUasService before the workflow starts

Recommended first run sequence: 1. start single-node onboarding for c07u31 2. watch MAAS machine state changes directly 3. compare workflow events with MAAS events 4. stop after first controlled success or first controlled manual-intervention boundary

7. Open Implementation Checks¶

These should be resolved while building the first execution slice: - whether skip_bmc_config=1 is still required in this site - whether initial hardware-sync seed can be satisfied entirely by first-boot bootstrap, eliminating SSH even as a bootstrap exception - whether the site bootstrap bundle should move from file path to OCI-backed or control-plane-managed bundle reference now or after first live proof - whether datasource retry should remain exactly one retry or be profile-configurable from day one

8. Post-Live Concrete Steps¶

The first complete live run exposed a small number of repeatable failure classes. These should become explicit implementation steps, not operator memory.

8.1 Predeploy hardening¶

Before MAAS deploy starts, add guards for: - bootstrap endpoint reachability from the MAAS node subnet, not just from platform-control - fetchability of the exact rendered bootstrap package or script URL handed to the node - bootstrap CA material being real PEM rather than a template placeholder - bootstrap-token and enrollment-token persistence being present and internally consistent - site/profile bootstrap inputs resolving cleanly before deploy

8.2 Post-deploy completion gates¶

Do not mark onboarding completed until all of these are true: - MAAS machine is Deployed - node-agent install/start succeeded - first enrollment succeeded - the GPUaaS node row reached active - onboarding attempt and completion state match the currently executing run

8.3 Control-plane diagnostics first¶

Expose MAAS-side diagnostics through GPUaaS before relying on host SSH: - MAAS install/deploy output - relevant MAAS machine events - MAAS script or commissioning result output where available

These should appear on onboarding detail APIs and UI first.

8.4 Loki export second¶

After diagnostics retrieval exists in the control plane, add a clean export path for centralized analysis with: - site_id - onboarding_id - node_id - maas_system_id - hostname - correlation_id

Loki should extend the control-plane diagnostics surface, not replace it.

8.5 Enrollment-token consistency¶

Fix the token consistency defect independently of the diagnostics work: - reruns must keep bootstrap-token and enrollment-token storage aligned - onboarding attempt numbering and completion timestamps must reflect the active run - no direct Redis recovery should be required to finish node enrollment

9. Summary¶

The MAAS domain is ready to move from acceptance/read-model creation into execution, but the first live workflow must preserve the proven prototype intent where it matters: - deterministic discovery - PXE safety before deploy - bounded datasource retry - release fallback semantics - infra-owned site bootstrap before GPUaaS node-agent bootstrap

Do not let the first Temporal implementation quietly simplify away these behaviors.