Skip to content

Runbook: App Artifact Lifecycle Incident

Trigger

  1. A project admin, operator service account, CLI flow, or SDK flow cannot issue an app-artifact publish intent.
  2. Artifact registration, promotion, deprecation, or retirement returns an error envelope with correlation_id.
  3. An artifact shows trust_state=failed_verification or revoked and the app team needs operator triage.
  4. Operators need a deterministic path for artifact lifecycle incidents after the project app-artifact API baseline landed.

Scope

  1. Endpoints:
  2. GET /api/v1/projects/{project_id}/app-artifacts
  3. POST /api/v1/projects/{project_id}/app-artifacts/publish-intents
  4. POST /api/v1/projects/{project_id}/app-artifacts
  5. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/promote
  6. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/deprecate
  7. POST /api/v1/projects/{project_id}/app-artifacts/{artifact_id}/retire
  8. Lifecycle states:
  9. published
  10. promoted
  11. deprecated
  12. retired
  13. Trust states:
  14. unverified
  15. verified
  16. failed_verification
  17. revoked
  18. This runbook covers control-plane lifecycle and trust metadata only.
  19. It does not replace runtime deployment triage once an artifact has already been selected by an app instance.

Required Context

  1. Error envelope fields:
  2. code
  3. message
  4. correlation_id
  5. details
  6. Identity and scope:
  7. org_id
  8. project_id
  9. artifact_id if one exists
  10. app_slug
  11. app_version
  12. actor_user_id or operator_service_account_id
  13. Artifact metadata:
  14. repository
  15. digest
  16. digest_algorithm
  17. artifact_kind
  18. source_type
  19. lifecycle_state
  20. trust_state
  21. Request context:
  22. X-Project-ID
  23. X-Idempotency-Key for mutation paths
  24. intended promotion channel when promotion failed

Immediate Triage

  1. Confirm the failing path and method:
  2. publish intent
  3. registration
  4. list/read
  5. promote
  6. deprecate
  7. retire
  8. Confirm project_id in the route matches X-Project-ID.
  9. Confirm the actor is allowed to mutate project-owned artifacts.
  10. Capture whether the issue is:
  11. new artifact cannot enter the lifecycle
  12. lifecycle transition rejected
  13. trust verification failed
  14. artifact inventory read path degraded

Correlation-First Query Flow

  1. API logs by correlation_id:
  2. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  3. Narrow to artifact endpoints:
  4. path=~".*/app-artifacts.*"
  5. Audit evidence for privileged mutations:
  6. search audit_logs for actions:
    • app_artifact.register
    • app_artifact.promote
    • app_artifact.deprecate
    • app_artifact.retire
    • app_artifact.verify
    • app_artifact.revoke
  7. If trace_id exists, confirm the control-plane path in Tempo before assuming registry or policy defects.
  8. If the failure was raised from CLI or SDK, pivot to the corresponding client runbook only after the API evidence is captured.

Expected Error Classes

  1. invalid_request
  2. malformed repository, digest, media type, source metadata, or missing project context
  3. insufficient_permissions
  4. actor or service account cannot mutate this project or requested promotion target
  5. app_artifact_not_found
  6. wrong artifact_id, wrong project scope, or stale client state
  7. app_artifact_already_exists
  8. digest already registered for the same project
  9. app_artifact_state_invalid
  10. attempted promotion/deprecation/retirement is not valid for the current lifecycle or trust state
  11. service_unavailable or upstream_error
  12. dependency failure in storage/registry/policy verification path
  13. internal_error
  14. control-plane defect

Failure Class Triage

Publish Intent Failure

  1. Confirm the request used an idempotency key and same project context on retry.
  2. Confirm the returned repository path matches the platform-owned naming model.
  3. If the path fails before upload begins, treat this as a control-plane issue, not a registry blob-transfer issue.

Registration Failure

  1. Confirm the digest is immutable and formatted canonically.
  2. Confirm the same digest is not already registered in the target project.
  3. Confirm artifact_kind and source_type are explicit and allowed by policy.
  4. If registration succeeds but trust remains unverified, that is not automatically an outage unless policy requires verified for the next action.

Promotion Failure

  1. Confirm the artifact is not deprecated or retired.
  2. Confirm the target channel is valid and the actor is allowed to promote into it.
  3. Confirm project policy does not require a stronger trust state than the artifact currently has.

Trust Failure

  1. Treat trust_state=failed_verification as the primary pivot, not as a generic lifecycle failure.
  2. Confirm whether the failure came from digest mismatch, source allowlist rejection, or signature/provenance policy.
  3. Do not promote, deprecate around, or otherwise bypass a trust failure as a normal operator step.

Retirement or Deprecation Failure

  1. Confirm the artifact belongs to the project in the route.
  2. Confirm the current state allows the requested transition.
  3. If the artifact must be blocked immediately for safety, escalate toward revoke/trust-policy ownership instead of forcing lifecycle drift.

Boundary Validation

  1. Project ownership:
  2. artifact belongs to the same project_id
  3. Contract alignment:
  4. request shape matches doc/api/openapi.draft.yaml
  5. Policy alignment:
  6. source type and promotion target comply with policy
  7. Audit path:
  8. privileged mutation wrote an audit_logs row with the same correlation_id
  9. Root-cause ownership:
  10. distinguish control-plane lifecycle defect from downstream registry/storage defect before mitigation

Recovery Guidance

  1. invalid_request
  2. correct the request shape or project context and retry with a new idempotency key only if the prior request was malformed
  3. insufficient_permissions
  4. fix project/admin or service-account scope, then retry the exact intended action
  5. app_artifact_already_exists
  6. reuse the existing artifact record; do not register duplicate digests to work around the error
  7. app_artifact_state_invalid
  8. move the artifact through a valid lifecycle path or stop if the request violates trust/lifecycle invariants
  9. failed_verification or revoked
  10. quarantine the artifact from further promotion and escalate to artifact trust/policy ownership
  11. Dependency outage
  12. restore the owning storage/registry/policy dependency before retrying artifact lifecycle mutations

Escalation Map

  1. Project-context or membership issue:
  2. doc/operations/runbooks/Tenant_Project_Authorization_Runbook.md
  3. Client-only reproduction issue with API healthy:
  4. doc/operations/runbooks/CLI_Incident_and_Support_Triage_Runbook.md
  5. doc/operations/runbooks/Python_SDK_Incident_and_Observability_Runbook.md
  6. Broad API degradation:
  7. doc/operations/runbooks/API_Degradation_Runbook.md
  8. App instance deploy/runtime impact after artifact selection:
  9. doc/operations/runbooks/App_Runtime_Lifecycle_Incident_Runbook.md

Evidence to Capture

  1. Exact failing endpoint and method
  2. correlation_id and trace_id
  3. project_id, artifact_id, app_slug, app_version
  4. repository, digest, artifact_kind, source_type
  5. Previous and current lifecycle_state and trust_state
  6. Audit evidence for any privileged mutation
  7. Whether the owning layer is:
  8. request/client misuse
  9. authz/policy
  10. control-plane lifecycle implementation
  11. storage/registry dependency
  12. trust verification path

Escalation Rule

If the only apparent fix is to bypass digest-only registration, suppress trust failure handling, or mutate lifecycle state outside the contract, stop and file a control-plane defect. Do not normalize that workaround in operations.