Threat model¶

Implemented

Source: doc/governance/Security_Threat_Model.md · doc/architecture/Threat_Model.md · doc/governance/Abuse_Case_Catalog.md

STRIDE-aligned baseline plus the platform-specific threat surfaces that fall out of GPUaaS's actual architecture (tenant VMs, ledger, mTLS fleet, payment webhook).

STRIDE at a glance¶

mindmap
  root((STRIDE))
    Spoofing
      Token theft / session replay
      Bootstrap-token reuse
      Webhook forgery
      Bound mTLS identity
    Tampering
      Ledger record modification
      Audit log mutation
      Outbox row tampering
      Cross-tenant slot metadata edit
    Repudiation
      Admin denial of action
      Audit immutability
      Signed task params
    Information Disclosure
      Token / SSH key in logs
      Cross-tenant data leak via reused storage
      Stale DNS / mTLS cert reveal
      Metrics leakage across slices
    Denial of Service
      API abuse
      WS connection flood
      Slot exhaustion by single tenant
      DLQ poison-pill spam
    Elevation of Privilege
      Bypass tenant/role checks
      MIG-mode escape (deferred)
      Outbox-publish from handler
      ssh_authorized_keys overwrite

Threat × control × verification matrix¶

Threat	Layer	Primary control	Verified by
Token theft / replay	Edge	Short-lived JWT (5-min JWKS refresh), refresh token rotation	AT-001, AT-002
Webhook signature forgery	API	Stripe signature on raw body before parse	AT-053
Webhook replay	Worker	Dedupe at `payment_webhook_events.event_id` PK	AT-052
Privilege escalation	API	`realm_access.roles` check on admin routes + scope-aware authz	AT-003
Multi-tenant data leak	DB	`org_id` scoping in queries; per-service DB credentials	Authz SLO tests, integration
Secret in logs	Code	`middleware.Sanitize` blocklist + CI gate	`observability_trace_gate.sh`
Audit tampering	DB	`audit_logs` immutable; allowlisted metadata jsonb	AT-083, integration
Storage namespace breakout	Code	Path-safety enforcement in `storagepath`	AT-061
Rate-limit bypass	Edge	Redis token bucket; standard `Retry-After` headers	AT-070, AT-071, AT-072
WS auth in URL	Code	`Sec-WebSocket-Protocol` only; no `?token=`	Code review + grep gate
Provisioning task spoof	Fleet	mTLS + signed task params; identity-bound claim (post-RCA)	RCA 2026-03 fix
Allocation FSM bypass	Code	Workflow-controlled transitions only (Temporal)	`Coding_Standards.md §Security`
MIG escape (when added)	Hardware	Reserved as future capacity_shape, gated	DESIGNED, not exposed
Slot metadata edit	DB	Admin-only routes; audit on every change	`audit_mandatory_guard.sh`

Spoofing in depth¶

sequenceDiagram
    autonumber
    participant ATT as Attacker
    participant LB as Edge / WAF
    participant API as cmd/api
    participant KC as Keycloak
    participant J as JWKS cache

    Note over ATT: stolen access_token, expires in <5 min
    ATT->>LB: Bearer <stolen>
    LB->>API: forward
    API->>J: validate sig (cached JWKS)
    J-->>API: ok (signed by Keycloak)
    API->>API: check exp claim
    alt expired
        API-->>ATT: 401 token_expired
    else still valid
        Note over ATT,API: small attack window 5min<br/>refresh requires refresh_token<br/>which has 1 use + rotation
        API->>API: enforce rate limit per user
        API-->>ATT: 403 if exceeds; else action
        API-->>API: audit_logs row with actor=stolen user, correlation id
    end

Defenses stacked:

Short JWT TTL — exposure window minutes, not hours.
JWKS rotation — Keycloak rolls keys; cached for 5 min only.
Refresh-token rotation — refresh exchange invalidates the prior refresh token; replay detection.
Rate-limit per user — slows even a valid-token attack.
Audit trail per call — exfiltration is at least traceable.

Tampering in depth — ledger and audit¶

flowchart LR
    classDef immut fill:#ffebee,stroke:#c62828
    classDef ok fill:#d1e7dd,stroke:#0a3622

    UPDATE[Attempt to UPDATE ledger_entries]:::immut --> DB1{DB role grants?}
    DB1 -- no UPDATE/DELETE grants --> DENIED1[DB-level deny]:::ok
    DELETE[Attempt to DELETE audit_logs]:::immut --> DB2{DB role grants?}
    DB2 -- no UPDATE/DELETE grants --> DENIED2[DB-level deny]:::ok

    APP[Application bug<br/>tries to UPDATE] --> REV{Code review<br/>+ CI gate}
    REV -- caught --> BLOCK[PR blocked]:::ok
    REV -- missed --> DB1

Defense-in-depth: even if an application bug attempts to UPDATE ledger or audit rows, per-service Postgres roles lack the grants. The DB itself denies.

Information disclosure — across slice tenants¶

flowchart TB
    subgraph Host[Slice host]
        direction TB
        GPU[GPU vfio-pci]
        NVME[NVMe namespace]
        VF[Mellanox SR-IOV VF]
    end
    subgraph A[Tenant A slice]
        VMA[VM with passthrough]
        DA[("Tenant A data<br/>on dedicated NVMe")]
    end
    subgraph B[Tenant B slice on same host]
        VMB[VM with passthrough]
        DB2[("Tenant B data<br/>on different NVMe")]
    end

    GPU -.dedicated to A.-> VMA
    NVME -.dedicated NVMe per slot.-> DA
    VF -.per-slot SR-IOV VF.-> VMA

    note1[Cross-tenant leak vectors blocked:<br/>1. NVMe is dedicated per slot, not shared<br/>2. Each slot uses a separate IB VF<br/>3. Slot release wipes per destructive_wipe_policy<br/>4. cleanup_blocked slot cannot be reused]

Leak vector	Mitigation
Shared storage between slices	Slot requires `capacity_metadata.storage_ownership = 'slice'`; dedicated NVMe per slot
Shared fabric between slices	Slot requires `fabric_claim_mode = 'per_slot_vf'` with unique `fabric_vf_pci_address`
Data left on NVMe at release	Slot requires `destructive_wipe_policy` non-empty; release calls wipe; verification before slot returns to `available`
Wipe failure	Slot pinned in `cleanup_blocked`; cannot be re-allocated
Metrics across tenants	Telemetry uses per-allocation token; host Netdata stays operator-only

Denial of service¶

flowchart TB
    subgraph Public[Public edge]
        WAF[WAF]
    end
    subgraph App[Application]
        RL[Redis rate limiter]
        QUOTA[Allocation concurrency limit]
        SLOT[Slot scheduler]
    end
    subgraph WK[Workers]
        DLQ[(DLQ NATS stream)]
    end

    ATT[Attacker / runaway] --> WAF
    WAF -->|connection rate-limit| RL
    RL -->|per-user RPM cap| API[cmd/api]
    API -->|max concurrent allocs per user| QUOTA
    API -->|same node FOR UPDATE SKIP LOCKED| SLOT
    NATS[NATS] -->|poison messages| DLQ
    DLQ -.alert.-> SRE[SRE on-call]

    classDef defence fill:#d1e7dd,stroke:#0a3622
    class WAF,RL,QUOTA,SLOT,DLQ defence

Per-endpoint policy limits (all driven by policy_values table, not constants):

Key	Default	Effect
`rate_limit.api_requests_per_minute`	120	Default per-user RPM
`rate_limit.terminal_token_requests_per_minute`	10	Terminal token mint
`rate_limit.financial_requests_per_minute`	30	Payment / refund / balance reads
`rate_limit.admin_overview_requests_per_minute`	600	Admin overview polling
`allocation.max_concurrent_per_user`	50	Slot exhaustion guard

Elevation of privilege — handler discipline¶

flowchart LR
    REQ[Incoming request] --> M1[Middleware:<br/>JWT verify]
    M1 --> M2[Middleware:<br/>scope resolve - tenant/project]
    M2 --> M3[Middleware:<br/>authz check]
    M3 --> H[Handler]
    H --> S[Service function]
    S -.NEVER.-> NATS[(NATS direct publish<br/>BLOCKED by review)]
    S --> DB[Postgres TX]
    DB --> OB[outbox row in same TX]
    DB --> AUD[audit row in same TX]
    OB --> OR[outbox-relay]
    OR --> NATS

    classDef block fill:#f8d7da,stroke:#42101e
    class NATS block

Three layered checks before a service function runs:

JWT valid + not expired (middleware.Auth).
Scope resolved (packages/shared/authz — most-specific wins, with audit on resolution).
Role/permission check inside the handler before service call.

After the service call, only the outbox-relay publishes to NATS. A handler that tries to call nats.Publish directly fails review.

Slice-specific concerns¶

→ Detail page: GPU slice as-built §Security model. In summary:

VFIO — host kernel cannot accidentally drive a tenant GPU
Per-slot NVMe — no cross-tenant disk reuse
Per-slot SR-IOV VF — no cross-tenant fabric snoop
Operator-approved slot map — guards against accidental exposure of host root NVMe or unsuitable fabric VFs
Destructive wipe policy required — slot can't be schedulable without it
Per-allocation telemetry token — host gateway only forwards from the right allocation
Terminal gateway pattern — no raw VNC, no 0.0.0.0 exposure, single-use tokens

Trust boundaries (in one map)¶

flowchart LR
    INET([Public internet]):::pub --> WAF[WAF]:::edge
    WAF --> CP[Control plane<br/>mTLS internal]:::trusted
    CP -->|mTLS pull only| NA[node-agent fleet]:::fleet
    CP -->|outbound HTTPS<br/>+ webhook signature in| STRIPE[(Stripe)]:::ext
    CP -->|cached JWKS| KC[(Keycloak / IdP)]:::ext
    CP -->|per-service creds| PG[(Postgres)]:::data
    CP -->|signed URLs| S3[(Object storage)]:::data
    CP -->|API token| OBS[(OTel / Loki / Prom)]:::data

    classDef pub fill:#ffebee,stroke:#c62828
    classDef edge fill:#fff3e0,stroke:#e65100
    classDef trusted fill:#e8f5e9,stroke:#2e7d32
    classDef fleet fill:#fce4ec,stroke:#c2185b
    classDef ext fill:#e8eaf6,stroke:#3949ab
    classDef data fill:#eceff1,stroke:#455a64

Boundary	Direction	Controls
Internet → Edge	inbound	WAF, rate-limit, TLS termination
Edge → Control plane	inbound	mTLS, network policy default-deny
Control plane → Node fleet	inbound from node	mTLS pull (node initiates); signed task params; identity-bound result
Control plane → Stripe	outbound + verified inbound	Outbound HTTPS; webhook signature on raw body
Control plane → Keycloak	outbound (rare)	Cached JWKS refresh every 5 min
Control plane → Postgres	outbound	Per-service credentials; no shared pool
Control plane → Object storage	outbound	Signed URLs; tenant-scoped namespaces

Pen test scope¶

→ Read source: Pen_Test_Scope.md. Covers public auth, payment paths, storage breakout, rate-limit bypass, terminal token replay, admin escalation, node-agent contract.

Abuse cases¶

→ Read source: Abuse_Case_Catalog.md. Includes:

Credit-card chargeback after GPU usage
Resource exhaustion via rapid alloc create+release
Storage namespace traversal
WS DoS via terminal token spam
IB fabric VF reuse attempt (blocked by metadata invariants)