Skip to content

Platform Control Release Pipeline Model

As of: March 17, 2026

Purpose

Document the current platform-control CI and release operating model so future changes do not reintroduce hidden assumptions.

This document is the durable reference for: 1. runner lanes, 2. blocking versus report-only checks, 3. release branch behavior, 4. CI base image usage, 5. runtime image publishing and deploy flow, 6. current runtime image policy.

Canonical Dev-Control Public Config

As of 2026-04-24, the canonical public entrypoints for the live dev-control environment are the Funnel hosts, not the legacy sslip.io app/auth/api web entrypoints.

Canonical browser-facing hosts: - https://gpuaas-dev-app.tailfe39f5.ts.net - https://gpuaas-dev-api.tailfe39f5.ts.net - https://gpuaas-dev-auth.tailfe39f5.ts.net - wss://gpuaas-dev-term.tailfe39f5.ts.net

Required config alignment: - APP_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net - KEYCLOAK_PUBLIC_ISSUER_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net/realms/gpuaas - KEYCLOAK_TOKEN_BASE_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net - NEXT_PUBLIC_GRAFANA_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana - NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL=wss://gpuaas-dev-api.tailfe39f5.ts.net - NEXT_PUBLIC_WS_BASE_URL=wss://gpuaas-dev-term.tailfe39f5.ts.net - Grafana: - GF_SERVER_ROOT_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana/ - GF_SERVER_SERVE_FROM_SUB_PATH=true

Operational rule: - after recreating or materially reconfiguring dev-control, run scripts/ops/configure_platform_control_dev_auth_urls.sh - then rerun proxy smoke on Funnel hosts before considering the environment ready

Reason: - mixed Funnel + sslip.io public config produced misleading failures in auth, Grafana, Jupyter, and launcher/browser-session behavior even when the code was correct.

Operating Rules

  1. release/platform-control is the explicit platform-control deploy branch.
  2. Push pipelines on release/platform-control are disabled.
  3. release/platform-control runs are expected to be API-triggered.
  4. PLATFORM_CONTROL_RELEASE_MODE=deploy is the normal live-iteration mode and exercises:
  5. release artifact publish
  6. runtime image publish
  7. digest assembly
  8. deploy
  9. remote validation
  10. PLATFORM_CONTROL_RELEASE_MODE=full is required when the intent is to exercise:
  11. release artifact publish
  12. runtime image publish
  13. digest assembly
  14. deploy
  15. remote validation
  16. post-deploy report generation
  17. release/platform-control remains single-writer during active CI work.
  18. Shared branch updates must keep platform-gitlab, origin, and github in sync.
  19. This document describes the current model; it does not authorize broad pipeline surgery by itself.

Pipeline Stages

The current GitLab pipeline stages are: 1. ci_image 2. contracts 3. build_test 4. security 5. sdk 6. migration 7. package 8. deploy 9. post_deploy

Pipeline Graph

flowchart TD
  A[contracts_validate] --> B[contracts_breaking_change]
  A --> C[backend_build_and_tests]
  A --> D[frontend_build_and_tests]
  A --> E[sdk_codegen_smoke]

  C --> F[integration_smoke]
  D --> G[frontend_e2e]
  C --> H[security_sast_gosec]
  C --> I[security_sast_semgrep]
  C --> J[security_sast_gitleaks]
  C --> K[security_govulncheck]
  C --> L[migration_validation]

  H --> M[security_sast_report]
  I --> M
  J --> M
  M --> N[security_scans]
  K --> N

  B --> O[package_and_attest]
  E --> O
  L --> O
  D --> O

  O --> P[prepare_release_candidate]
  P --> Q[publish_release_artifacts]

  Q --> R1[publish_runtime_api]
  Q --> R2[publish_runtime_workers]
  Q --> R3[publish_runtime_web]
  R1 --> S[assemble_runtime_digests]
  R2 --> S
  R3 --> S

  P --> T[deploy]
  S --> T
  T --> U[remote_validation]
  S --> V[image_scan_report]
  T --> W[schemathesis_report]
  T --> X[dast_report]
  N --> Y[hardening_report]
  V --> Y
  W --> Y
  X --> Y

Notes: 1. The release branch package/deploy flow activates for PLATFORM_CONTROL_RELEASE_MODE=deploy|full. 2. frontend_e2e remains part of broader validation on non-release-branch and master pipelines, but is intentionally excluded from default release/platform-control pushes. 3. The heavy post-deploy report chain activates only for PLATFORM_CONTROL_RELEASE_MODE=full. 4. Rollback exists as a manual deploy stage job and is not part of the default success path.

Runner Lanes

The pipeline uses five runner lanes:

Lane Tags Main job classes
ci-fast ci-fast,platform-control contract checks, codegen smoke, small validation jobs, selected runtime publish fan-out jobs
ci-build ci-build,platform-control backend build/tests, integration smoke, migration validation, package/attest
ci-release ci-release,platform-control release-candidate prep, release artifact publish, digest fan-in, deploy, remote validation, rollback
ci-report ci-report,platform-control report-only SAST summaries, govulncheck, image scan, Schemathesis, DAST, hardening aggregation
ci-frontend ci-frontend,platform-control frontend build/tests, frontend e2e, web runtime image publish

Current runner-topology target on the control host: 1. ci-fast-1 tagged ci-fast,platform-control 2. ci-fast-2 tagged ci-fast,platform-control 3. ci-build-1 tagged ci-build,platform-control 4. ci-release-1 tagged ci-release,platform-control 5. ci-frontend-1 tagged ci-frontend,platform-control 6. ci-report-1 tagged ci-report,platform-control

The GitLab runner manager should allow at least concurrent = 10 on the platform-control host. The host has enough CPU and memory for this in the dev environment, and the release pipeline contains enough independent build, publish, and report jobs to benefit from more than five total slots.

Operational interpretation: 1. ci-fast is used for quick branch iteration and low-latency checks. 2. ci-build absorbs heavier compile/test/package work so fast validation does not queue behind it. 3. ci-release is reserved for release/deploy intent and should not be treated as ordinary validation capacity. 4. ci-report carries report-only security and runtime conformance checks so findings stay reviewable without blocking fast iteration. 5. ci-frontend isolates browser/web execution from backend-heavy lanes.

2026-04-16 runner tuning notes

Pipeline 708 on release/platform-control exposed two non-functional bottlenecks:

  1. Report scans were reading CI-local caches under .cache/, which inflated report-only findings and made security_sast_semgrep the longest job at about 507 seconds. SAST should scan a tracked-source snapshot and exclude CI caches/generated outputs.
  2. Runtime image fan-out is limited by available ci-fast publish capacity. The last runtime publish jobs queued for about 270-344 seconds before starting because most non-web runtime publish jobs share the ci-fast lane.

Recommended runner/resource changes before further GPU slice iteration:

  1. Keep ci-report isolated, but give it enough CPU to run Semgrep without starving release/deploy jobs. If Semgrep remains above 3 minutes after tracked-source scanning, raise CPU on ci-report-1 or add ci-report-2.
  2. Keep at least two ci-fast runners with Docker build/push capability. The package stage has many independent runtime image jobs and benefits directly from more concurrent publish capacity.
  3. Keep ci-release single-lane for deploy serialization. Do not use extra release runners to parallelize deploy unless the deployment scripts gain explicit locking and environment isolation.
  4. Keep ci-frontend separate; web build/publish is already isolated and should not queue behind Go/runtime image jobs.

2026-04-17 live tuning notes

Pipeline 725 on release/platform-control still took roughly 33 minutes wall-clock even after scan fixes. The longest individual jobs were frontend_build_and_tests (~248s), platform_control_publish_runtime_web (~236s), platform_control_deploy (~190s), contracts_validate (~191s), and Go runtime publish jobs (~158-181s each). The primary deploy-loop bottleneck was not storage exhaustion; it was low runner concurrency and only one effective ci-fast publish runner.

Live dev-control change applied:

  1. raised GitLab runner manager concurrent from 5 to 10;
  2. added ci-fast to the existing generic platform-control-docker runner to provide an immediate second publish-capable ci-fast lane;
  3. left ci-release single-lane so deploy and remote validation remain serialized.

Storage observation from the same host:

  1. /ai-cloud-data had about 1.3 TiB free;
  2. Docker reported roughly 131 GiB reclaimable images and 37 GiB reclaimable volumes;
  3. this is worth scheduled cleanup, but pruning aggressively before every deploy would remove useful image cache and may slow builds rather than speed them up.

Check Inventory

Blocking checks in normal validation flow

These are part of the ordinary validation path and are expected to gate normal CI success: 1. contracts_validate 2. contracts_breaking_change 3. backend_build_and_tests 4. frontend_build_and_tests 5. integration_smoke 6. sdk_codegen_smoke 7. migration_validation 8. package_and_attest

Conditional but still gating when selected: 1. frontend_e2e - enabled by rules:changes on non-release/platform-control pipelines 2. release-package/deploy chain when PLATFORM_CONTROL_RELEASE_MODE=full

Report-only checks

These produce artifacts for review and calibration rather than immediate pipeline blocking: 1. SAST: - gosec - semgrep - gitleaks 2. Go vulnerability reporting: - govulncheck 3. Image scan: - trivy 4. Runtime contract conformance: - Schemathesis 5. API DAST: - ZAP API scan 6. Aggregation: - security summaries - hardening summary

The pipeline model intentionally separates: 1. developer-facing branch validation, 2. release/deploy execution, 3. report-only security and runtime conformance evidence.

Branch Behavior Model

master

master remains the broader validation branch: 1. normal validation jobs run, 2. frontend_e2e can run when relevant paths change, 3. platform-control deploy/publish jobs do not activate by default, 4. success on master is not itself a deploy signal.

release/platform-control

release/platform-control is now an explicit deploy branch, not an auto-push CI branch.

Push behavior: 1. ordinary push pipelines on release/platform-control are disabled at workflow level, 2. this avoids duplicate branch + API pipelines for the same commit, 3. master remains the automatic validation branch.

Explicit deploy behavior: 1. trigger the pipeline by API (or equivalent manual pipeline trigger), 2. set PLATFORM_CONTROL_RELEASE_MODE=deploy, 3. run release artifact publish, 4. publish runtime images, 5. assemble digest manifest, 6. deploy, 7. run remote validation, 8. skip heavy post-deploy security/runtime reports.

Explicit full-release behavior: 1. trigger the pipeline by API (or equivalent manual pipeline trigger), 2. set PLATFORM_CONTROL_RELEASE_MODE=full (default), 2. run release artifact publish, 3. publish runtime images, 4. assemble digest manifest, 5. deploy, 6. run remote validation, 7. run post-deploy report generation and hardening aggregation.

This distinction is required because: 1. release artifact generation and deployment are slower and more stateful than ordinary CI, 2. remote validation requires environment readiness and secrets, 3. MAAS and similar environment-bound work need a fast live-deploy loop rather than another non-deploying validation run, 4. report-only post-deploy checks should run when there is a real deployed target to inspect.

CI Base Image Model

The pipeline supports a reusable CI base image model: 1. GPUAAS_CI_JOB_IMAGE defaults to golang:1.25, 2. ci_base_image_build can build and publish a reusable CI image to $CI_REGISTRY_IMAGE/ci-base, 3. hashed tags are the durable rollout unit, 4. latest is a convenience pointer, not the only source of truth, 5. per-job tool bootstrap remains available as fallback when the reusable image is unavailable.

Operational rules: 1. use the prebuilt CI base image when the registry path is stable, 2. keep golang:1.25 fallback available for bootstrap or recovery, 3. treat CI image changes as explicit rollout events, not incidental branch behavior. 4. prefer hashed CI base image tags over latest for deploy-triggered pipelines, and run the preflight builder before triggering when the expected hash is missing.

Preflight usage:

PLATFORM_CONTROL_USE_CI_BASE_IMAGE=1 \
CI_BASE_IMAGE_REPO=<registry>/<project>/ci-base \
PLATFORM_CONTROL_RELEASE_MODE=deploy \
PLATFORM_CONTROL_RELEASE_PROFILE=standard \
scripts/ci/gitlab_pipeline_trigger.sh

The trigger wrapper computes the CI image hash from build/ci-image/compute_tag.sh, verifies that $CI_BASE_IMAGE_REPO:<hash> exists, builds and pushes it when registry credentials are available, and passes the immutable image tag to GitLab as GPUAAS_CI_JOB_IMAGE. This avoids stale latest behavior while removing per-job tool bootstrap from normal deploy loops.

Go and frontend CI caches are shared across master and release/platform-control using dependency-file cache keys. The dependency hash still invalidates caches when go.sum, packages/web/package.json, or packages/web/pnpm-lock.yaml changes, but release pipelines no longer start from cold branch-specific caches immediately after a green master pipeline.

Runtime Image Model

Current policy

Runtime images should remain fixed unless their service/runtime inputs change.

This means: 1. do not rebuild or republish all runtime images for unrelated repository changes, 2. publish runtime images only for the selected release candidate in full mode, 3. keep digest assembly explicit so deploys operate on immutable references.

Current job structure

Release-mode runtime publish is split into: 1. one job per runtime target for fan-out parallelism, 2. one digest-assembly job for fan-in, 3. one deploy job that consumes the frozen release candidate and digest manifest.

Current runtime targets in CI: 1. api 2. provisioning-worker 3. billing-worker 4. webhook-worker 5. outbox-relay 6. notification-relay 7. terminal-gateway 8. app-runtime-worker 9. web

Base-image direction

Go service runtime images are moving toward minimal images: 1. target distroless or scratch for Go services, 2. keep runtime images digest-pinned, 3. preserve the service/runtime-input rule above so image churn stays controlled.

Current exception: 1. the web runtime remains the Node/dependency-heavy exception and should not be forced into the same minimal-image rule prematurely.

Release Candidate, Deploy, and Validation Flow

When PLATFORM_CONTROL_RELEASE_MODE=full is used on release/platform-control:

  1. prepare a frozen release candidate SHA,
  2. publish release artifacts,
  3. publish per-service runtime images,
  4. assemble dist/platform-control-image-digests.env,
  5. deploy from the frozen release candidate,
  6. run remote validation,
  7. run report-only post-deploy checks,
  8. aggregate hardening output.

Important constraints: 1. deploy consumes the frozen candidate SHA, not an unfrozen branch head, 2. deploy should reject stale digest artifacts unless explicitly overridden, 3. remote validation and post-deploy reports are meaningful only when the remote target was actually exercised.

Blocking vs Report-Only Summary

Check area Current mode
Contracts blocking
Build/tests blocking
Integration smoke blocking
SDK codegen smoke blocking
Migration validation blocking
Packaging/attestation blocking
Frontend e2e conditional blocking when selected
SAST (gosec, semgrep, gitleaks) report-only
govulncheck report-only
Image scan (trivy) report-only
Runtime contract conformance (Schemathesis) report-only
API DAST report-only
Hardening aggregation report-only
Deploy / remote validation explicit full-mode release path

Remote and Branch Discipline

To avoid CI confusion and branch drift: 1. do not assume release/platform-control pushes auto-deploy, 2. only use PLATFORM_CONTROL_RELEASE_MODE=full when the intent is to exercise publish/deploy/remote validation, 3. keep release/platform-control single-writer during active CI work, 4. keep platform-gitlab, origin, and github aligned when shared branch updates occur, 5. prefer documentation and evidence review before reopening broad pipeline rewrites.

Out of Scope

This document does not: 1. redesign the pipeline, 2. change current gate severity, 3. convert report-only checks into blocking gates, 4. redefine all production promotion rules, 5. authorize runtime image strategy changes beyond the documented direction.

Source Anchors

Primary sources for this model: 1. doc/operations/Platform_Control_CI_Handoff_2026-03-17.md 2. .gitlab-ci.yml 3. build/ci-image/README.md 4. scripts/ci/README.md 5. doc/operations/Control_Plane_K8s_Migration_v1.md