Trail: IAM & Tenancy¶
Identity, authorization, scope, and audit — end-to-end, with diagrams per step.
Trail map¶
flowchart TB
classDef impl fill:#d1e7dd,stroke:#0a3622,color:#0a3622
classDef des fill:#fff3cd,stroke:#332701,color:#332701
classDef cmp fill:#f8d7da,stroke:#42101e,color:#42101e
I1[1. Tenant root]:::impl --> I2[2. Project scope]:::impl
I2 --> I3[3. Memberships & roles]:::impl
I3 --> I4[4. Service accounts]:::impl
I4 --> I5[5. OIDC / JWKS]:::impl
I5 --> I6[6. Resource identifiers]:::impl
I6 --> I7[7. Policy chain]:::impl
I7 --> I8[8. Audit]:::impl
I8 --> I9[9. Federation]:::des
I9 --> I10[10. Cloud hierarchy comparison]:::cmp
1. Tenant root¶
Implemented
The tenant (organizations table) is the ownership root. It survives user churn — when a user leaves, resources stay tenant-owned.
erDiagram
organizations ||--o{ projects : "owns"
organizations ||--o{ tenant_memberships : "has members"
organizations ||--o{ allocations : "owns (nullable for shared)"
organizations ||--o{ nodes : "owns (nullable for shared)"
users ||--o{ tenant_memberships : "is member of"
organizations {
uuid id PK
text name
text stripe_customer_id "billing anchor"
timestamp created_at
timestamp deleted_at "soft delete"
}
tenant_memberships {
uuid user_id PK
uuid org_id PK
text role "owner|admin|member|viewer"
timestamp deleted_at
}
users {
uuid id PK
text email
text external_subject "OIDC sub"
}
Why tenant-root, not user-root:
flowchart LR
U1[User leaves company] --> Q{Owner model}
Q -- "user-owned (anti-pattern)" --> X[Resources orphaned<br/>or transferred manually<br/>Audit chain broken]
Q -- "tenant-owned (GPUaaS)" --> OK[Resources stay tenant-owned<br/>User membership revoked<br/>Audit chain preserved]
classDef bad fill:#f8d7da,stroke:#42101e
classDef good fill:#d1e7dd,stroke:#0a3622
class X bad
class OK good
→ Sources: Tenant_Project_Ownership_Baseline.md, Brokered_Identity_Linking_and_Dedup_v1.md
2. Project scope¶
Implemented
A project is the operational scope inside a tenant. Allocations, app instances, storage namespaces — all belong to a project. A default project is auto-created at signup so single-user accounts don't see hierarchy ceremony.
flowchart TB
subgraph T[Tenant: acme-corp]
direction TB
P1[Project: default<br/>auto-created at signup]
P2[Project: ml-research]
P3[Project: prod-inference]
end
P1 --> A1[Allocations]
P1 --> S1[Storage]
P1 --> APP1[App instances]
P2 --> A2[Allocations]
P2 --> S2[Storage]
P3 --> A3[Allocations]
P3 --> S3[Storage]
P3 --> APP3[App instances]
classDef tenant fill:#fff8e1,stroke:#f57f17
classDef proj fill:#e3f2fd,stroke:#1565c0
class T tenant
class P1,P2,P3 proj
Resources never cross project boundaries without an explicit cross-project authorization. SSH keys, API keys, policies are project-scoped too — see Allocation_Project_SSH_Access_v1.md.
3. Memberships & roles¶
Implemented
flowchart LR
U[User] -->|tenant_memberships| TM{Tenant role}
U -->|project_memberships| PM{Project role}
TM -->|owner| TPRIV[Full tenant<br/>delete tenant, billing]
TM -->|admin| TADM[Manage projects, users]
TM -->|member| TMBR[Default member]
TM -->|viewer| TVIEW[Read only]
PM -->|owner| PPRIV[Full project]
PM -->|admin| PADM[Manage allocations<br/>members, policies]
PM -->|member| PMBR[Use resources]
PM -->|viewer| PVIEW[Read only]
classDef hi fill:#fff3e0,stroke:#e65100
class TPRIV,PPRIV hi
MVP constraint (enforced by DB unique index):
CREATE UNIQUE INDEX uq_tenant_memberships_active_user
ON tenant_memberships(user_id) WHERE deleted_at IS NULL;
This pins the MVP to single-tenant per user. Multi-tenant users are designed (see federation step), gated until customer signal supports the IAM-redesign cost.
→ Sources: Platform_IAM_Model_v1.md, Role_and_Policy_Lifecycle_Model.md, User_Onboarding_Model.md
4. Service accounts¶
Implemented
Project-scoped non-human identities for CI, agents, app machinery.
sequenceDiagram
autonumber
participant U as Admin
participant API as cmd/api
participant DB as Postgres
participant CI as CI / agent
U->>API: POST /service-accounts {project_id, name, scopes}
API->>DB: INSERT service_accounts + audit_log
API-->>U: {sa_id, signing_key} (one-time display)
Note over CI: stores signing key in CI secret
loop every TTL (auth.service_account_token_ttl_seconds = 900)
CI->>API: POST /auth/sa/token {sa_id, signed_assertion}
API->>API: verify signature + scopes
API-->>CI: {access_token, exp: now+900s}
end
CI->>API: API request with Bearer token
API->>API: validate JWT (cached JWKS)
API->>API: authz: sa_id has scope?
API-->>CI: 200 / 403
Token TTL is policy-driven via auth.service_account_token_ttl_seconds (default 900). Short TTL + signed-assertion exchange limits blast radius if the signing key leaks.
→ Sources: Service_Account_Model.md, Tenant_Scoped_App_Machine_Identity_v1.md, Shared_Runtime_Operator_Authz_Model_v1.md
5. OIDC / JWKS¶
Implemented
sequenceDiagram
autonumber
participant U as Browser
participant API as cmd/api
participant KC as Keycloak
participant JCACHE as JWKS cache<br/>(in cmd/api memory)
Note over KC: JWKS published at<br/>/.well-known/openid-configuration
JCACHE->>KC: refresh JWKS every 5 min
KC-->>JCACHE: {keys: [...]}
U->>KC: OIDC authorize (PKCE)
KC-->>U: redirect with code
U->>API: POST /auth/oidc/exchange {code}
API->>KC: token exchange
KC-->>API: id_token, access_token
API->>API: validate JWT against cached JWKS
Note over API: NO per-request Keycloak call
API->>API: extract claims:<br/>sub, realm_access.roles, exp, iss, org_id
API-->>U: {access_token, refresh_token, exp}
loop subsequent requests
U->>API: Authorization: Bearer
API->>API: verify signature locally
API->>API: enforce role + tenant scope
end
Required JWT claims:
| Claim | Used for |
|---|---|
sub |
user_id in all authz checks |
realm_access.roles |
RBAC: user or admin |
exp |
Token expiry |
iss |
Must match KEYCLOAK_ISSUER_URL |
org_id |
Tenant scoping (custom claim, nullable) |
→ Source: IAM_Token_Issuer_v1.md. Runbook: JWKS Compromise Breakglass
6. Resource identifiers¶
Implemented
Canonical name across API responses, events, logs, audit:
flowchart LR
R[core42:aicloud:us-buffalo-1:acme:default:allocation:<br/>3a8b2f...] --> P1[provider]
R --> P2[service]
R --> P3[region]
R --> P4[tenant]
R --> P5[project]
R --> P6[type]
R --> P7[id]
classDef k fill:#e3f2fd,stroke:#1565c0
class P1,P2,P3,P4,P5,P6,P7 k
Used everywhere a resource is named: API responses, NATS event payloads, audit target_id, structured logs. Single canonical form means cross-system tracing is straightforward.
→ Source: Resource_Identifier_Spec.md
7. Policy chain¶
Implemented
Scoped resolution: global → plan → tenant → project → user. Most-specific wins. Every policy update is auditable.
flowchart TB
REQ[Service code:<br/>PolicyClient.GetInt key]
REQ --> R1{user-scope<br/>for this user?}
R1 -- yes --> V1[Return user value]
R1 -- no --> R2{project-scope<br/>for this project?}
R2 -- yes --> V2[Return project value]
R2 -- no --> R3{tenant-scope?}
R3 -- yes --> V3[Return tenant value]
R3 -- no --> R4{plan-scope?}
R4 -- yes --> V4[Return plan value]
R4 -- no --> V5[Return global default]
classDef ret fill:#d1e7dd,stroke:#0a3622
class V1,V2,V3,V4,V5 ret
Every key carries min, max, or enum bounds; admin API rejects out-of-bound updates.
→ Full reference: Policy keys
8. Audit¶
Implemented
sequenceDiagram
autonumber
participant H as Handler
participant S as Service
participant DB as Postgres
participant A as Auditor (admin UI)
H->>S: privileged mutation
S->>DB: BEGIN
S->>DB: domain mutation
S->>DB: INSERT audit_logs<br/>{actor_user_id, actor_role, action,<br/>target_type, target_id, result,<br/>correlation_id, metadata}
S->>DB: INSERT outbox row
S->>DB: COMMIT
Note over DB: audit_logs is immutable.<br/>No UPDATE, no DELETE.<br/>Allowlisted metadata keys.
A->>DB: GET /admin/audit-logs<br/>?filter=actor=X
DB-->>A: paginated rows
A->>DB: GET /admin/audit-logs.csv
DB-->>A: CSV export
Required fields: actor_user_id, actor_role, action, target_type, target_id, result, correlation_id.
Metadata jsonb has an explicit allowlist. Unknown keys are rejected at write time:
reason, policy_key, old_value, new_value, status_from, status_to, error_code, request_scope, idempotency_key_hash, provider_ref, allocation_id, node_id.
Forbidden in audit_logs.metadata: raw tokens, raw credentials, SSH private/public key material, full request/response payload dumps, direct payment instrument data, end-user PII beyond stable IDs.
→ Source: Audit_Presentation_Model_v1.md. CI gate: scripts/ci/audit_mandatory_guard.sh
9. Federation¶
Designed
Enterprise SSO + multi-IdP federation.
flowchart TB
U[Work user] --> IDP_HINT{Account type?}
IDP_HINT -- personal --> KC[Keycloak personal realm]
IDP_HINT -- work email/tenant slug --> BROKER[Tenant federation broker]
BROKER --> IDP1[Customer Okta]
BROKER --> IDP2[Customer AzureAD]
BROKER --> IDP3[Customer Google Workspace]
IDP1 --> LINK[Brokered identity linking + dedup<br/>match by email + sub]
IDP2 --> LINK
IDP3 --> LINK
LINK --> ME[users + tenant_memberships]
classDef ext fill:#e8eaf6,stroke:#3949ab
classDef plat fill:#e3f2fd,stroke:#1565c0
class IDP1,IDP2,IDP3 ext
class KC,BROKER,LINK,ME plat
Brokered identity linking is the hard part: when the same human shows up via two IdPs, dedup logic must merge them cleanly without orphaning resources.
→ Sources: Tenant_Federation_SSO_Model.md, Brokered_Identity_Linking_and_Dedup_v1.md. Runbook: Enterprise Federation Incident
10. Cloud hierarchy comparison¶
| Concept | AWS | GCP | Azure | Nebius | GPUaaS |
|---|---|---|---|---|---|
| Ownership root | Org / Mgmt Account | Organization | Tenant + Mgmt Group | Tenant | Tenant (organizations) |
| Operational scope | Account | Project | Subscription / RG | Project | Project |
| Membership | IAM principals | IAM principals | Entra principals + RBAC | Tenant/project groups | tenant_memberships + project_memberships |
| Policy model | IAM + SCP | IAM + org policy | Azure RBAC + policy initiatives | Group + role permits | global → tenant → project chain |
| Billing anchor | Account / consolidated payer | Billing account | Subscription | Tenant + quotas | Tenant (Stripe customer) |
| User is owner-of-record? | No | No | No | No | No |
flowchart LR
classDef us fill:#d1e7dd,stroke:#0a3622
GP[GPUaaS<br/>tenant → project → resource]:::us
NB[Nebius] ---|closest semantic| GP
GCP[GCP] ---|two-level → three-level| GP
AWS[AWS] ---|account = our project| GP
AZ[Azure] ---|subscription = our project| GP
→ Source: Cloud_Hierarchy_Comparison.md. Also in: Product comparisons → External clouds.
Recap¶
sequenceDiagram
autonumber
participant U as User
participant KC as Keycloak / IdP
participant API as cmd/api
participant AZ as authz layer
participant DB as Postgres
participant AUD as audit_logs
U->>KC: OIDC login (PKCE)
KC-->>U: id_token + access_token
U->>API: POST /allocations (Bearer)
API->>API: verify JWT (cached JWKS)
API->>AZ: resolve scope (tenant + project from claims)
AZ->>DB: SELECT memberships + policy chain
AZ-->>API: allow + effective role
API->>DB: mutation in tx
DB->>AUD: audit row in same tx
API-->>U: 201 with canonical resource id