Trail: IAM & Tenancy¶

Identity, authorization, scope, and audit — end-to-end, with diagrams per step.

Trail map¶

flowchart TB
    classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
    classDef des  fill:#fff3cd,stroke:#332701,color:#332701
    classDef cmp  fill:#f8d7da,stroke:#42101e,color:#42101e

    I1[1. Tenant root]:::impl --> I2[2. Project scope]:::impl
    I2 --> I3[3. Memberships & roles]:::impl
    I3 --> I4[4. Service accounts]:::impl
    I4 --> I5[5. OIDC / JWKS]:::impl
    I5 --> I6[6. Resource identifiers]:::impl
    I6 --> I7[7. Policy chain]:::impl
    I7 --> I8[8. Audit]:::impl
    I8 --> I9[9. Federation]:::des
    I9 --> I10[10. Cloud hierarchy comparison]:::cmp

1. Tenant root¶

Implemented

The tenant (organizations table) is the ownership root. It survives user churn — when a user leaves, resources stay tenant-owned.

erDiagram
    organizations ||--o{ projects : "owns"
    organizations ||--o{ tenant_memberships : "has members"
    organizations ||--o{ allocations : "owns (nullable for shared)"
    organizations ||--o{ nodes : "owns (nullable for shared)"
    users ||--o{ tenant_memberships : "is member of"

    organizations {
        uuid id PK
        text name
        text stripe_customer_id "billing anchor"
        timestamp created_at
        timestamp deleted_at "soft delete"
    }
    tenant_memberships {
        uuid user_id PK
        uuid org_id PK
        text role "owner|admin|member|viewer"
        timestamp deleted_at
    }
    users {
        uuid id PK
        text email
        text external_subject "OIDC sub"
    }

Why tenant-root, not user-root:

flowchart LR
    U1[User leaves company] --> Q{Owner model}
    Q -- "user-owned (anti-pattern)" --> X[Resources orphaned<br/>or transferred manually<br/>Audit chain broken]
    Q -- "tenant-owned (GPUaaS)" --> OK[Resources stay tenant-owned<br/>User membership revoked<br/>Audit chain preserved]
    classDef bad fill:#f8d7da,stroke:#42101e
    classDef good fill:#d1e7dd,stroke:#0a3622
    class X bad
    class OK good

→ Sources: Tenant_Project_Ownership_Baseline.md, Brokered_Identity_Linking_and_Dedup_v1.md

2. Project scope¶

Implemented

A project is the operational scope inside a tenant. Allocations, app instances, storage namespaces — all belong to a project. A default project is auto-created at signup so single-user accounts don't see hierarchy ceremony.

flowchart TB
    subgraph T[Tenant: acme-corp]
        direction TB
        P1[Project: default<br/>auto-created at signup]
        P2[Project: ml-research]
        P3[Project: prod-inference]
    end
    P1 --> A1[Allocations]
    P1 --> S1[Storage]
    P1 --> APP1[App instances]
    P2 --> A2[Allocations]
    P2 --> S2[Storage]
    P3 --> A3[Allocations]
    P3 --> S3[Storage]
    P3 --> APP3[App instances]

    classDef tenant fill:#fff8e1,stroke:#f57f17
    classDef proj fill:#e3f2fd,stroke:#1565c0
    class T tenant
    class P1,P2,P3 proj

Resources never cross project boundaries without an explicit cross-project authorization. SSH keys, API keys, policies are project-scoped too — see Allocation_Project_SSH_Access_v1.md.

3. Memberships & roles¶

Implemented

flowchart LR
    U[User] -->|tenant_memberships| TM{Tenant role}
    U -->|project_memberships| PM{Project role}
    TM -->|owner| TPRIV[Full tenant<br/>delete tenant, billing]
    TM -->|admin| TADM[Manage projects, users]
    TM -->|member| TMBR[Default member]
    TM -->|viewer| TVIEW[Read only]
    PM -->|owner| PPRIV[Full project]
    PM -->|admin| PADM[Manage allocations<br/>members, policies]
    PM -->|member| PMBR[Use resources]
    PM -->|viewer| PVIEW[Read only]
    classDef hi fill:#fff3e0,stroke:#e65100
    class TPRIV,PPRIV hi

MVP constraint (enforced by DB unique index):

CREATE UNIQUE INDEX uq_tenant_memberships_active_user
    ON tenant_memberships(user_id) WHERE deleted_at IS NULL;

This pins the MVP to single-tenant per user. Multi-tenant users are designed (see federation step), gated until customer signal supports the IAM-redesign cost.

→ Sources: Platform_IAM_Model_v1.md, Role_and_Policy_Lifecycle_Model.md, User_Onboarding_Model.md

4. Service accounts¶

Implemented

Project-scoped non-human identities for CI, agents, app machinery.

sequenceDiagram
    autonumber
    participant U as Admin
    participant API as cmd/api
    participant DB as Postgres
    participant CI as CI / agent

    U->>API: POST /service-accounts {project_id, name, scopes}
    API->>DB: INSERT service_accounts + audit_log
    API-->>U: {sa_id, signing_key} (one-time display)

    Note over CI: stores signing key in CI secret
    loop every TTL (auth.service_account_token_ttl_seconds = 900)
        CI->>API: POST /auth/sa/token {sa_id, signed_assertion}
        API->>API: verify signature + scopes
        API-->>CI: {access_token, exp: now+900s}
    end
    CI->>API: API request with Bearer token
    API->>API: validate JWT (cached JWKS)
    API->>API: authz: sa_id has scope?
    API-->>CI: 200 / 403

Token TTL is policy-driven via auth.service_account_token_ttl_seconds (default 900). Short TTL + signed-assertion exchange limits blast radius if the signing key leaks.

→ Sources: Service_Account_Model.md, Tenant_Scoped_App_Machine_Identity_v1.md, Shared_Runtime_Operator_Authz_Model_v1.md

5. OIDC / JWKS¶

Implemented

sequenceDiagram
    autonumber
    participant U as Browser
    participant API as cmd/api
    participant KC as Keycloak
    participant JCACHE as JWKS cache<br/>(in cmd/api memory)

    Note over KC: JWKS published at<br/>/.well-known/openid-configuration
    JCACHE->>KC: refresh JWKS every 5 min
    KC-->>JCACHE: {keys: [...]}

    U->>KC: OIDC authorize (PKCE)
    KC-->>U: redirect with code
    U->>API: POST /auth/oidc/exchange {code}
    API->>KC: token exchange
    KC-->>API: id_token, access_token
    API->>API: validate JWT against cached JWKS
    Note over API: NO per-request Keycloak call
    API->>API: extract claims:<br/>sub, realm_access.roles, exp, iss, org_id
    API-->>U: {access_token, refresh_token, exp}

    loop subsequent requests
        U->>API: Authorization: Bearer
        API->>API: verify signature locally
        API->>API: enforce role + tenant scope
    end

Required JWT claims:

Claim	Used for
`sub`	`user_id` in all authz checks
`realm_access.roles`	RBAC: `user` or `admin`
`exp`	Token expiry
`iss`	Must match `KEYCLOAK_ISSUER_URL`
`org_id`	Tenant scoping (custom claim, nullable)

→ Source: IAM_Token_Issuer_v1.md. Runbook: JWKS Compromise Breakglass

6. Resource identifiers¶

Implemented

Canonical name across API responses, events, logs, audit:

core42:aicloud:{region}:{tenant_id}:{project_id}:{resource_type}:{resource_id}

flowchart LR
    R[core42:aicloud:us-buffalo-1:acme:default:allocation:<br/>3a8b2f...] --> P1[provider]
    R --> P2[service]
    R --> P3[region]
    R --> P4[tenant]
    R --> P5[project]
    R --> P6[type]
    R --> P7[id]
    classDef k fill:#e3f2fd,stroke:#1565c0
    class P1,P2,P3,P4,P5,P6,P7 k

Used everywhere a resource is named: API responses, NATS event payloads, audit target_id, structured logs. Single canonical form means cross-system tracing is straightforward.

→ Source: Resource_Identifier_Spec.md

7. Policy chain¶

Implemented

Scoped resolution: global → plan → tenant → project → user. Most-specific wins. Every policy update is auditable.

flowchart TB
    REQ[Service code:<br/>PolicyClient.GetInt key]
    REQ --> R1{user-scope<br/>for this user?}
    R1 -- yes --> V1[Return user value]
    R1 -- no --> R2{project-scope<br/>for this project?}
    R2 -- yes --> V2[Return project value]
    R2 -- no --> R3{tenant-scope?}
    R3 -- yes --> V3[Return tenant value]
    R3 -- no --> R4{plan-scope?}
    R4 -- yes --> V4[Return plan value]
    R4 -- no --> V5[Return global default]

    classDef ret fill:#d1e7dd,stroke:#0a3622
    class V1,V2,V3,V4,V5 ret

Every key carries min, max, or enum bounds; admin API rejects out-of-bound updates.

→ Full reference: Policy keys

8. Audit¶

Implemented

sequenceDiagram
    autonumber
    participant H as Handler
    participant S as Service
    participant DB as Postgres
    participant A as Auditor (admin UI)

    H->>S: privileged mutation
    S->>DB: BEGIN
    S->>DB: domain mutation
    S->>DB: INSERT audit_logs<br/>{actor_user_id, actor_role, action,<br/>target_type, target_id, result,<br/>correlation_id, metadata}
    S->>DB: INSERT outbox row
    S->>DB: COMMIT

    Note over DB: audit_logs is immutable.<br/>No UPDATE, no DELETE.<br/>Allowlisted metadata keys.

    A->>DB: GET /admin/audit-logs<br/>?filter=actor=X
    DB-->>A: paginated rows
    A->>DB: GET /admin/audit-logs.csv
    DB-->>A: CSV export

Required fields: actor_user_id, actor_role, action, target_type, target_id, result, correlation_id.

Metadata jsonb has an explicit allowlist. Unknown keys are rejected at write time:

reason, policy_key, old_value, new_value, status_from, status_to, error_code, request_scope, idempotency_key_hash, provider_ref, allocation_id, node_id.

Forbidden in audit_logs.metadata: raw tokens, raw credentials, SSH private/public key material, full request/response payload dumps, direct payment instrument data, end-user PII beyond stable IDs.

→ Source: Audit_Presentation_Model_v1.md. CI gate: scripts/ci/audit_mandatory_guard.sh

9. Federation¶

Designed

Enterprise SSO + multi-IdP federation.

flowchart TB
    U[Work user] --> IDP_HINT{Account type?}
    IDP_HINT -- personal --> KC[Keycloak personal realm]
    IDP_HINT -- work email/tenant slug --> BROKER[Tenant federation broker]
    BROKER --> IDP1[Customer Okta]
    BROKER --> IDP2[Customer AzureAD]
    BROKER --> IDP3[Customer Google Workspace]
    IDP1 --> LINK[Brokered identity linking + dedup<br/>match by email + sub]
    IDP2 --> LINK
    IDP3 --> LINK
    LINK --> ME[users + tenant_memberships]

    classDef ext fill:#e8eaf6,stroke:#3949ab
    classDef plat fill:#e3f2fd,stroke:#1565c0
    class IDP1,IDP2,IDP3 ext
    class KC,BROKER,LINK,ME plat

Brokered identity linking is the hard part: when the same human shows up via two IdPs, dedup logic must merge them cleanly without orphaning resources.

→ Sources: Tenant_Federation_SSO_Model.md, Brokered_Identity_Linking_and_Dedup_v1.md. Runbook: Enterprise Federation Incident

10. Cloud hierarchy comparison¶

Concept	AWS	GCP	Azure	Nebius	GPUaaS
Ownership root	Org / Mgmt Account	Organization	Tenant + Mgmt Group	Tenant	Tenant (`organizations`)
Operational scope	Account	Project	Subscription / RG	Project	Project
Membership	IAM principals	IAM principals	Entra principals + RBAC	Tenant/project groups	`tenant_memberships` + `project_memberships`
Policy model	IAM + SCP	IAM + org policy	Azure RBAC + policy initiatives	Group + role permits	`global → tenant → project` chain
Billing anchor	Account / consolidated payer	Billing account	Subscription	Tenant + quotas	Tenant (Stripe customer)
User is owner-of-record?	No	No	No	No	No

flowchart LR
    classDef us fill:#d1e7dd,stroke:#0a3622
    GP[GPUaaS<br/>tenant → project → resource]:::us
    NB[Nebius] ---|closest semantic| GP
    GCP[GCP] ---|two-level → three-level| GP
    AWS[AWS] ---|account = our project| GP
    AZ[Azure] ---|subscription = our project| GP

→ Source: Cloud_Hierarchy_Comparison.md. Also in: Product comparisons → External clouds.

Recap¶

sequenceDiagram
    autonumber
    participant U as User
    participant KC as Keycloak / IdP
    participant API as cmd/api
    participant AZ as authz layer
    participant DB as Postgres
    participant AUD as audit_logs

    U->>KC: OIDC login (PKCE)
    KC-->>U: id_token + access_token
    U->>API: POST /allocations (Bearer)
    API->>API: verify JWT (cached JWKS)
    API->>AZ: resolve scope (tenant + project from claims)
    AZ->>DB: SELECT memberships + policy chain
    AZ-->>API: allow + effective role
    API->>DB: mutation in tx
    DB->>AUD: audit row in same tx
    API-->>U: 201 with canonical resource id