Runbook: Enterprise Federation Incident¶
Trigger¶
- Work-account sign-in fails through OIDC or SAML enterprise onboarding flow.
- Tenant-admin reports tenant federation setup appears valid but users cannot complete login.
- Federation callback returns
token_invalid,invalid_request, or membership-related denial withcorrelation_id. - Support ticket mentions:
- tenant hint mismatch
- state expired/replayed
- user not onboarded
- membership required
Required Context¶
correlation_idfrom the auth failure envelope.trace_idif present.- Federation context:
tenant_hintidentity_hintif supplied- protocol:
oidcorsaml - callback path used
- Effective tenant evidence:
- resolved
org_idif available - whether failure is authn, federation resolution, or membership gate
Correlation-First Triage¶
- API auth logs by
correlation_id: {service="gpuaas-api"} | json | correlation_id="<CORRELATION_ID>"- Filter for federation terms:
federationoidcsamltenant_hintstate- Trace pivot:
- open Tempo trace by
trace_idif present - Audit and support evidence:
- confirm the failing path preserved
correlation_id - do not trust client-supplied
tenant_hintas authoritative evidence
Failure Separation Model¶
Keep these classes distinct during support:
- Federation resolution failure
- invalid/missing
tenant_hint - no verified domain binding
-
no matching provider
-
Protocol/callback failure
- invalid state
- expired state
- replayed state
-
OIDC/SAML provider mismatch
-
Identity onboarding failure
-
enterprise identity not mapped/onboarded locally
-
Membership gate failure
- identity exists, but no active tenant membership for resolved
org_id
These are different failure owners and must not be collapsed into one generic “SSO broken” diagnosis.
OIDC Checks¶
- Confirm request path:
/api/v1/auth/oidc/authorize/api/v1/auth/oidc/exchange- Confirm state was issued and then consumed once.
- Confirm callback used the same
redirect_uri. - Confirm issuer resolution matched the selected provider.
- Confirm identity was either:
- resolved to an onboarded enterprise user, or
- rejected as not onboarded/membership missing.
SAML Checks¶
- Confirm request path:
/api/v1/auth/saml/authorize/api/v1/auth/saml/callback- Confirm
RelayStateis present and corresponds to issued state. - Confirm the current platform behavior:
- if SAML runtime is not configured, this is expected to return service-unavailable guidance rather than silently fail.
- Do not treat “not configured” as a protocol defect.
Tenant and Membership Checks¶
- Tenant hints are advisory only.
- Server-resolved
org_idis authoritative. - If identity is onboarded but not a tenant member:
- route to IAM membership remediation, not protocol debugging.
- If identity is not onboarded at all:
- route to enterprise onboarding/admin setup path.
Common Failure Classes¶
invalid_request- malformed tenant hint, bad callback input, missing redirect/state fields.
token_invalid- invalid or expired federation state, failed token/assertion processing.
insufficient_permissions- enterprise identity exists but membership gate failed.
service_unavailable- federation runtime not configured or auth backend unavailable.
Recovery Guidance¶
- Bad tenant hint or identity hint:
- correct the input and retry.
- State replay/expiry:
- restart the sign-in flow from authorize.
- Onboarded-but-not-member:
- fix tenant membership and retry.
- Not onboarded:
- complete enterprise onboarding/admin binding before retry.
- SAML not configured:
- do not treat as user error; escalate as expected unsupported path if tenant expects SAML.
Evidence to Capture¶
correlation_id,trace_id, protocol, and callback path.- Whether failure occurred at:
- authorize
- exchange/callback
- membership gate
- Effective resolved
org_idif known. - Whether the user is:
- not onboarded
- onboarded without membership
- member but provider path failed
- Exact error envelope (
code,message,details) and timeline.
Cross-Runbook Linkage¶
If protocol succeeds but authz fails, continue with:
- doc/operations/runbooks/User_Onboarding_Auth_Context_Runbook.md
- doc/operations/runbooks/IAM_Role_Assignment_and_Membership_Incident_Runbook.md
If the issue is a broader API auth degradation, also consult:
- doc/operations/runbooks/API_Degradation_Runbook.md