Platform Control Release Pipeline Model¶
As of: March 17, 2026
Purpose¶
Document the current platform-control CI and release operating model so future changes do not reintroduce hidden assumptions.
This document is the durable reference for: 1. runner lanes, 2. blocking versus report-only checks, 3. release branch behavior, 4. CI base image usage, 5. runtime image publishing and deploy flow, 6. current runtime image policy.
Canonical Dev-Control Public Config¶
As of 2026-04-24, the canonical public entrypoints for the live dev-control
environment are the Funnel hosts, not the legacy sslip.io app/auth/api web
entrypoints.
Canonical browser-facing hosts:
- https://gpuaas-dev-app.tailfe39f5.ts.net
- https://gpuaas-dev-api.tailfe39f5.ts.net
- https://gpuaas-dev-auth.tailfe39f5.ts.net
- wss://gpuaas-dev-term.tailfe39f5.ts.net
Required config alignment:
- APP_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net
- KEYCLOAK_PUBLIC_ISSUER_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net/realms/gpuaas
- KEYCLOAK_TOKEN_BASE_URL=https://gpuaas-dev-auth.tailfe39f5.ts.net
- NEXT_PUBLIC_GRAFANA_BASE_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana
- NEXT_PUBLIC_NOTIFICATIONS_WS_BASE_URL=wss://gpuaas-dev-api.tailfe39f5.ts.net
- NEXT_PUBLIC_WS_BASE_URL=wss://gpuaas-dev-term.tailfe39f5.ts.net
- Grafana:
- GF_SERVER_ROOT_URL=https://gpuaas-dev-app.tailfe39f5.ts.net/backend/p/grafana/
- GF_SERVER_SERVE_FROM_SUB_PATH=true
Operational rule:
- after recreating or materially reconfiguring dev-control, run
scripts/ops/configure_platform_control_dev_auth_urls.sh
- then rerun proxy smoke on Funnel hosts before considering the environment
ready
Reason:
- mixed Funnel + sslip.io public config produced misleading failures in auth,
Grafana, Jupyter, and launcher/browser-session behavior even when the code was
correct.
Operating Rules¶
release/platform-controlis the explicit platform-control deploy branch.- Push pipelines on
release/platform-controlare disabled. release/platform-controlruns are expected to be API-triggered.PLATFORM_CONTROL_RELEASE_MODE=deployis the normal live-iteration mode and exercises:- release artifact publish
- runtime image publish
- digest assembly
- deploy
- remote validation
PLATFORM_CONTROL_RELEASE_MODE=fullis required when the intent is to exercise:- release artifact publish
- runtime image publish
- digest assembly
- deploy
- remote validation
- post-deploy report generation
release/platform-controlremains single-writer during active CI work.- Shared branch updates must keep
platform-gitlab,origin, andgithubin sync. - This document describes the current model; it does not authorize broad pipeline surgery by itself.
Pipeline Stages¶
The current GitLab pipeline stages are:
1. ci_image
2. contracts
3. build_test
4. security
5. sdk
6. migration
7. package
8. deploy
9. post_deploy
Pipeline Graph¶
flowchart TD
A[contracts_validate] --> B[contracts_breaking_change]
A --> C[backend_build_and_tests]
A --> D[frontend_build_and_tests]
A --> E[sdk_codegen_smoke]
C --> F[integration_smoke]
D --> G[frontend_e2e]
C --> H[security_sast_gosec]
C --> I[security_sast_semgrep]
C --> J[security_sast_gitleaks]
C --> K[security_govulncheck]
C --> L[migration_validation]
H --> M[security_sast_report]
I --> M
J --> M
M --> N[security_scans]
K --> N
B --> O[package_and_attest]
E --> O
L --> O
D --> O
O --> P[prepare_release_candidate]
P --> Q[publish_release_artifacts]
Q --> R1[publish_runtime_api]
Q --> R2[publish_runtime_workers]
Q --> R3[publish_runtime_web]
R1 --> S[assemble_runtime_digests]
R2 --> S
R3 --> S
P --> T[deploy]
S --> T
T --> U[remote_validation]
S --> V[image_scan_report]
T --> W[schemathesis_report]
T --> X[dast_report]
N --> Y[hardening_report]
V --> Y
W --> Y
X --> Y
Notes:
1. The release branch package/deploy flow activates for PLATFORM_CONTROL_RELEASE_MODE=deploy|full.
2. frontend_e2e remains part of broader validation on non-release-branch and master pipelines, but is intentionally excluded from default release/platform-control pushes.
3. The heavy post-deploy report chain activates only for PLATFORM_CONTROL_RELEASE_MODE=full.
4. Rollback exists as a manual deploy stage job and is not part of the default success path.
Runner Lanes¶
The pipeline uses five runner lanes:
| Lane | Tags | Main job classes |
|---|---|---|
ci-fast |
ci-fast,platform-control |
contract checks, codegen smoke, small validation jobs, selected runtime publish fan-out jobs |
ci-build |
ci-build,platform-control |
backend build/tests, integration smoke, migration validation, package/attest |
ci-release |
ci-release,platform-control |
release-candidate prep, release artifact publish, digest fan-in, deploy, remote validation, rollback |
ci-report |
ci-report,platform-control |
report-only SAST summaries, govulncheck, image scan, Schemathesis, DAST, hardening aggregation |
ci-frontend |
ci-frontend,platform-control |
frontend build/tests, frontend e2e, web runtime image publish |
Current runner-topology target on the control host:
1. ci-fast-1 tagged ci-fast,platform-control
2. ci-fast-2 tagged ci-fast,platform-control
3. ci-build-1 tagged ci-build,platform-control
4. ci-release-1 tagged ci-release,platform-control
5. ci-frontend-1 tagged ci-frontend,platform-control
6. ci-report-1 tagged ci-report,platform-control
The GitLab runner manager should allow at least concurrent = 10 on the
platform-control host. The host has enough CPU and memory for this in the dev
environment, and the release pipeline contains enough independent build,
publish, and report jobs to benefit from more than five total slots.
Operational interpretation:
1. ci-fast is used for quick branch iteration and low-latency checks.
2. ci-build absorbs heavier compile/test/package work so fast validation does not queue behind it.
3. ci-release is reserved for release/deploy intent and should not be treated as ordinary validation capacity.
4. ci-report carries report-only security and runtime conformance checks so findings stay reviewable without blocking fast iteration.
5. ci-frontend isolates browser/web execution from backend-heavy lanes.
2026-04-16 runner tuning notes¶
Pipeline 708 on release/platform-control exposed two non-functional bottlenecks:
- Report scans were reading CI-local caches under
.cache/, which inflated report-only findings and madesecurity_sast_semgrepthe longest job at about 507 seconds. SAST should scan a tracked-source snapshot and exclude CI caches/generated outputs. - Runtime image fan-out is limited by available
ci-fastpublish capacity. The last runtime publish jobs queued for about 270-344 seconds before starting because most non-web runtime publish jobs share theci-fastlane.
Recommended runner/resource changes before further GPU slice iteration:
- Keep
ci-reportisolated, but give it enough CPU to run Semgrep without starving release/deploy jobs. If Semgrep remains above 3 minutes after tracked-source scanning, raise CPU onci-report-1or addci-report-2. - Keep at least two
ci-fastrunners with Docker build/push capability. The package stage has many independent runtime image jobs and benefits directly from more concurrent publish capacity. - Keep
ci-releasesingle-lane for deploy serialization. Do not use extra release runners to parallelize deploy unless the deployment scripts gain explicit locking and environment isolation. - Keep
ci-frontendseparate; web build/publish is already isolated and should not queue behind Go/runtime image jobs.
2026-04-17 live tuning notes¶
Pipeline 725 on release/platform-control still took roughly 33 minutes
wall-clock even after scan fixes. The longest individual jobs were
frontend_build_and_tests (~248s), platform_control_publish_runtime_web
(~236s), platform_control_deploy (~190s), contracts_validate (~191s), and
Go runtime publish jobs (~158-181s each). The primary deploy-loop bottleneck
was not storage exhaustion; it was low runner concurrency and only one effective
ci-fast publish runner.
Live dev-control change applied:
- raised GitLab runner manager
concurrentfrom5to10; - added
ci-fastto the existing genericplatform-control-dockerrunner to provide an immediate second publish-capableci-fastlane; - left
ci-releasesingle-lane so deploy and remote validation remain serialized.
Storage observation from the same host:
/ai-cloud-datahad about 1.3 TiB free;- Docker reported roughly 131 GiB reclaimable images and 37 GiB reclaimable volumes;
- this is worth scheduled cleanup, but pruning aggressively before every deploy would remove useful image cache and may slow builds rather than speed them up.
Check Inventory¶
Blocking checks in normal validation flow¶
These are part of the ordinary validation path and are expected to gate normal CI success:
1. contracts_validate
2. contracts_breaking_change
3. backend_build_and_tests
4. frontend_build_and_tests
5. integration_smoke
6. sdk_codegen_smoke
7. migration_validation
8. package_and_attest
Conditional but still gating when selected:
1. frontend_e2e
- enabled by rules:changes on non-release/platform-control pipelines
2. release-package/deploy chain when PLATFORM_CONTROL_RELEASE_MODE=full
Report-only checks¶
These produce artifacts for review and calibration rather than immediate pipeline blocking:
1. SAST:
- gosec
- semgrep
- gitleaks
2. Go vulnerability reporting:
- govulncheck
3. Image scan:
- trivy
4. Runtime contract conformance:
- Schemathesis
5. API DAST:
- ZAP API scan
6. Aggregation:
- security summaries
- hardening summary
The pipeline model intentionally separates: 1. developer-facing branch validation, 2. release/deploy execution, 3. report-only security and runtime conformance evidence.
Branch Behavior Model¶
master¶
master remains the broader validation branch:
1. normal validation jobs run,
2. frontend_e2e can run when relevant paths change,
3. platform-control deploy/publish jobs do not activate by default,
4. success on master is not itself a deploy signal.
release/platform-control¶
release/platform-control is now an explicit deploy branch, not an auto-push CI branch.
Push behavior:
1. ordinary push pipelines on release/platform-control are disabled at workflow level,
2. this avoids duplicate branch + API pipelines for the same commit,
3. master remains the automatic validation branch.
Explicit deploy behavior:
1. trigger the pipeline by API (or equivalent manual pipeline trigger),
2. set PLATFORM_CONTROL_RELEASE_MODE=deploy,
3. run release artifact publish,
4. publish runtime images,
5. assemble digest manifest,
6. deploy,
7. run remote validation,
8. skip heavy post-deploy security/runtime reports.
Explicit full-release behavior:
1. trigger the pipeline by API (or equivalent manual pipeline trigger),
2. set PLATFORM_CONTROL_RELEASE_MODE=full (default),
2. run release artifact publish,
3. publish runtime images,
4. assemble digest manifest,
5. deploy,
6. run remote validation,
7. run post-deploy report generation and hardening aggregation.
This distinction is required because: 1. release artifact generation and deployment are slower and more stateful than ordinary CI, 2. remote validation requires environment readiness and secrets, 3. MAAS and similar environment-bound work need a fast live-deploy loop rather than another non-deploying validation run, 4. report-only post-deploy checks should run when there is a real deployed target to inspect.
CI Base Image Model¶
The pipeline supports a reusable CI base image model:
1. GPUAAS_CI_JOB_IMAGE defaults to golang:1.25,
2. ci_base_image_build can build and publish a reusable CI image to $CI_REGISTRY_IMAGE/ci-base,
3. hashed tags are the durable rollout unit,
4. latest is a convenience pointer, not the only source of truth,
5. per-job tool bootstrap remains available as fallback when the reusable image is unavailable.
Operational rules:
1. use the prebuilt CI base image when the registry path is stable,
2. keep golang:1.25 fallback available for bootstrap or recovery,
3. treat CI image changes as explicit rollout events, not incidental branch behavior.
4. prefer hashed CI base image tags over latest for deploy-triggered pipelines,
and run the preflight builder before triggering when the expected hash is
missing.
Preflight usage:
PLATFORM_CONTROL_USE_CI_BASE_IMAGE=1 \
CI_BASE_IMAGE_REPO=<registry>/<project>/ci-base \
PLATFORM_CONTROL_RELEASE_MODE=deploy \
PLATFORM_CONTROL_RELEASE_PROFILE=standard \
scripts/ci/gitlab_pipeline_trigger.sh
The trigger wrapper computes the CI image hash from
build/ci-image/compute_tag.sh, verifies that
$CI_BASE_IMAGE_REPO:<hash> exists, builds and pushes it when registry
credentials are available, and passes the immutable image tag to GitLab as
GPUAAS_CI_JOB_IMAGE. This avoids stale latest behavior while removing
per-job tool bootstrap from normal deploy loops.
Go and frontend CI caches are shared across master and
release/platform-control using dependency-file cache keys. The dependency
hash still invalidates caches when go.sum, packages/web/package.json, or
packages/web/pnpm-lock.yaml changes, but release pipelines no longer start
from cold branch-specific caches immediately after a green master pipeline.
Runtime Image Model¶
Current policy¶
Runtime images should remain fixed unless their service/runtime inputs change.
This means:
1. do not rebuild or republish all runtime images for unrelated repository changes,
2. publish runtime images only for the selected release candidate in full mode,
3. keep digest assembly explicit so deploys operate on immutable references.
Current job structure¶
Release-mode runtime publish is split into: 1. one job per runtime target for fan-out parallelism, 2. one digest-assembly job for fan-in, 3. one deploy job that consumes the frozen release candidate and digest manifest.
Current runtime targets in CI:
1. api
2. provisioning-worker
3. billing-worker
4. webhook-worker
5. outbox-relay
6. notification-relay
7. terminal-gateway
8. app-runtime-worker
9. web
Base-image direction¶
Go service runtime images are moving toward minimal images:
1. target distroless or scratch for Go services,
2. keep runtime images digest-pinned,
3. preserve the service/runtime-input rule above so image churn stays controlled.
Current exception: 1. the web runtime remains the Node/dependency-heavy exception and should not be forced into the same minimal-image rule prematurely.
Release Candidate, Deploy, and Validation Flow¶
When PLATFORM_CONTROL_RELEASE_MODE=full is used on release/platform-control:
- prepare a frozen release candidate SHA,
- publish release artifacts,
- publish per-service runtime images,
- assemble
dist/platform-control-image-digests.env, - deploy from the frozen release candidate,
- run remote validation,
- run report-only post-deploy checks,
- aggregate hardening output.
Important constraints: 1. deploy consumes the frozen candidate SHA, not an unfrozen branch head, 2. deploy should reject stale digest artifacts unless explicitly overridden, 3. remote validation and post-deploy reports are meaningful only when the remote target was actually exercised.
Blocking vs Report-Only Summary¶
| Check area | Current mode |
|---|---|
| Contracts | blocking |
| Build/tests | blocking |
| Integration smoke | blocking |
| SDK codegen smoke | blocking |
| Migration validation | blocking |
| Packaging/attestation | blocking |
| Frontend e2e | conditional blocking when selected |
SAST (gosec, semgrep, gitleaks) |
report-only |
govulncheck |
report-only |
Image scan (trivy) |
report-only |
Runtime contract conformance (Schemathesis) |
report-only |
| API DAST | report-only |
| Hardening aggregation | report-only |
| Deploy / remote validation | explicit full-mode release path |
Remote and Branch Discipline¶
To avoid CI confusion and branch drift:
1. do not assume release/platform-control pushes auto-deploy,
2. only use PLATFORM_CONTROL_RELEASE_MODE=full when the intent is to exercise publish/deploy/remote validation,
3. keep release/platform-control single-writer during active CI work,
4. keep platform-gitlab, origin, and github aligned when shared branch updates occur,
5. prefer documentation and evidence review before reopening broad pipeline rewrites.
Out of Scope¶
This document does not: 1. redesign the pipeline, 2. change current gate severity, 3. convert report-only checks into blocking gates, 4. redefine all production promotion rules, 5. authorize runtime image strategy changes beyond the documented direction.
Source Anchors¶
Primary sources for this model:
1. doc/operations/Platform_Control_CI_Handoff_2026-03-17.md
2. .gitlab-ci.yml
3. build/ci-image/README.md
4. scripts/ci/README.md
5. doc/operations/Control_Plane_K8s_Migration_v1.md