Skip to content

Runbook: Enterprise Federation Incident

Trigger

  1. Work-account sign-in fails through OIDC or SAML enterprise onboarding flow.
  2. Tenant-admin reports tenant federation setup appears valid but users cannot complete login.
  3. Federation callback returns token_invalid, invalid_request, or membership-related denial with correlation_id.
  4. Support ticket mentions:
  5. tenant hint mismatch
  6. state expired/replayed
  7. user not onboarded
  8. membership required

Required Context

  1. correlation_id from the auth failure envelope.
  2. trace_id if present.
  3. Federation context:
  4. tenant_hint
  5. identity_hint if supplied
  6. protocol: oidc or saml
  7. callback path used
  8. Effective tenant evidence:
  9. resolved org_id if available
  10. whether failure is authn, federation resolution, or membership gate

Correlation-First Triage

  1. API auth logs by correlation_id:
  2. {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"
  3. Filter for federation terms:
  4. federation
  5. oidc
  6. saml
  7. tenant_hint
  8. state
  9. Trace pivot:
  10. open Tempo trace by trace_id if present
  11. Audit and support evidence:
  12. confirm the failing path preserved correlation_id
  13. do not trust client-supplied tenant_hint as authoritative evidence

Failure Separation Model

Keep these classes distinct during support:

  1. Federation resolution failure
  2. invalid/missing tenant_hint
  3. no verified domain binding
  4. no matching provider

  5. Protocol/callback failure

  6. invalid state
  7. expired state
  8. replayed state
  9. OIDC/SAML provider mismatch

  10. Identity onboarding failure

  11. enterprise identity not mapped/onboarded locally

  12. Membership gate failure

  13. identity exists, but no active tenant membership for resolved org_id

These are different failure owners and must not be collapsed into one generic “SSO broken” diagnosis.

OIDC Checks

  1. Confirm request path:
  2. /api/v1/auth/oidc/authorize
  3. /api/v1/auth/oidc/exchange
  4. Confirm state was issued and then consumed once.
  5. Confirm callback used the same redirect_uri.
  6. Confirm issuer resolution matched the selected provider.
  7. Confirm identity was either:
  8. resolved to an onboarded enterprise user, or
  9. rejected as not onboarded/membership missing.

SAML Checks

  1. Confirm request path:
  2. /api/v1/auth/saml/authorize
  3. /api/v1/auth/saml/callback
  4. Confirm RelayState is present and corresponds to issued state.
  5. Confirm the current platform behavior:
  6. if SAML runtime is not configured, this is expected to return service-unavailable guidance rather than silently fail.
  7. Do not treat “not configured” as a protocol defect.

Tenant and Membership Checks

  1. Tenant hints are advisory only.
  2. Server-resolved org_id is authoritative.
  3. If identity is onboarded but not a tenant member:
  4. route to IAM membership remediation, not protocol debugging.
  5. If identity is not onboarded at all:
  6. route to enterprise onboarding/admin setup path.

Common Failure Classes

  1. invalid_request
  2. malformed tenant hint, bad callback input, missing redirect/state fields.
  3. token_invalid
  4. invalid or expired federation state, failed token/assertion processing.
  5. insufficient_permissions
  6. enterprise identity exists but membership gate failed.
  7. service_unavailable
  8. federation runtime not configured or auth backend unavailable.

Recovery Guidance

  1. Bad tenant hint or identity hint:
  2. correct the input and retry.
  3. State replay/expiry:
  4. restart the sign-in flow from authorize.
  5. Onboarded-but-not-member:
  6. fix tenant membership and retry.
  7. Not onboarded:
  8. complete enterprise onboarding/admin binding before retry.
  9. SAML not configured:
  10. do not treat as user error; escalate as expected unsupported path if tenant expects SAML.

Evidence to Capture

  1. correlation_id, trace_id, protocol, and callback path.
  2. Whether failure occurred at:
  3. authorize
  4. exchange/callback
  5. membership gate
  6. Effective resolved org_id if known.
  7. Whether the user is:
  8. not onboarded
  9. onboarded without membership
  10. member but provider path failed
  11. Exact error envelope (code, message, details) and timeline.

Cross-Runbook Linkage

If protocol succeeds but authz fails, continue with: - doc/operations/runbooks/User_Onboarding_Auth_Context_Runbook.md - doc/operations/runbooks/IAM_Role_Assignment_and_Membership_Incident_Runbook.md

If the issue is a broader API auth degradation, also consult: - doc/operations/runbooks/API_Degradation_Runbook.md