Skip to content

Threat model

Implemented

Source: doc/governance/Security_Threat_Model.md · doc/architecture/Threat_Model.md · doc/governance/Abuse_Case_Catalog.md

STRIDE-aligned baseline plus the platform-specific threat surfaces that fall out of GPUaaS's actual architecture (tenant VMs, ledger, mTLS fleet, payment webhook).

STRIDE at a glance

mindmap
  root((STRIDE))
    Spoofing
      Token theft / session replay
      Bootstrap-token reuse
      Webhook forgery
      Bound mTLS identity
    Tampering
      Ledger record modification
      Audit log mutation
      Outbox row tampering
      Cross-tenant slot metadata edit
    Repudiation
      Admin denial of action
      Audit immutability
      Signed task params
    Information Disclosure
      Token / SSH key in logs
      Cross-tenant data leak via reused storage
      Stale DNS / mTLS cert reveal
      Metrics leakage across slices
    Denial of Service
      API abuse
      WS connection flood
      Slot exhaustion by single tenant
      DLQ poison-pill spam
    Elevation of Privilege
      Bypass tenant/role checks
      MIG-mode escape (deferred)
      Outbox-publish from handler
      ssh_authorized_keys overwrite

Threat × control × verification matrix

Threat Layer Primary control Verified by
Token theft / replay Edge Short-lived JWT (5-min JWKS refresh), refresh token rotation AT-001, AT-002
Webhook signature forgery API Stripe signature on raw body before parse AT-053
Webhook replay Worker Dedupe at payment_webhook_events.event_id PK AT-052
Privilege escalation API realm_access.roles check on admin routes + scope-aware authz AT-003
Multi-tenant data leak DB org_id scoping in queries; per-service DB credentials Authz SLO tests, integration
Secret in logs Code middleware.Sanitize blocklist + CI gate observability_trace_gate.sh
Audit tampering DB audit_logs immutable; allowlisted metadata jsonb AT-083, integration
Storage namespace breakout Code Path-safety enforcement in storagepath AT-061
Rate-limit bypass Edge Redis token bucket; standard Retry-After headers AT-070, AT-071, AT-072
WS auth in URL Code Sec-WebSocket-Protocol only; no ?token= Code review + grep gate
Provisioning task spoof Fleet mTLS + signed task params; identity-bound claim (post-RCA) RCA 2026-03 fix
Allocation FSM bypass Code Workflow-controlled transitions only (Temporal) Coding_Standards.md §Security
MIG escape (when added) Hardware Reserved as future capacity_shape, gated DESIGNED, not exposed
Slot metadata edit DB Admin-only routes; audit on every change audit_mandatory_guard.sh

Spoofing in depth

sequenceDiagram
    autonumber
    participant ATT as Attacker
    participant LB as Edge / WAF
    participant API as cmd/api
    participant KC as Keycloak
    participant J as JWKS cache

    Note over ATT: stolen access_token, expires in <5 min
    ATT->>LB: Bearer <stolen>
    LB->>API: forward
    API->>J: validate sig (cached JWKS)
    J-->>API: ok (signed by Keycloak)
    API->>API: check exp claim
    alt expired
        API-->>ATT: 401 token_expired
    else still valid
        Note over ATT,API: small attack window 5min<br/>refresh requires refresh_token<br/>which has 1 use + rotation
        API->>API: enforce rate limit per user
        API-->>ATT: 403 if exceeds; else action
        API-->>API: audit_logs row with actor=stolen user, correlation id
    end

Defenses stacked:

  1. Short JWT TTL — exposure window minutes, not hours.
  2. JWKS rotation — Keycloak rolls keys; cached for 5 min only.
  3. Refresh-token rotation — refresh exchange invalidates the prior refresh token; replay detection.
  4. Rate-limit per user — slows even a valid-token attack.
  5. Audit trail per call — exfiltration is at least traceable.

Tampering in depth — ledger and audit

flowchart LR
    classDef immut fill:#ffebee,stroke:#c62828
    classDef ok fill:#d1e7dd,stroke:#0a3622

    UPDATE[Attempt to UPDATE ledger_entries]:::immut --> DB1{DB role grants?}
    DB1 -- no UPDATE/DELETE grants --> DENIED1[DB-level deny]:::ok
    DELETE[Attempt to DELETE audit_logs]:::immut --> DB2{DB role grants?}
    DB2 -- no UPDATE/DELETE grants --> DENIED2[DB-level deny]:::ok

    APP[Application bug<br/>tries to UPDATE] --> REV{Code review<br/>+ CI gate}
    REV -- caught --> BLOCK[PR blocked]:::ok
    REV -- missed --> DB1

Defense-in-depth: even if an application bug attempts to UPDATE ledger or audit rows, per-service Postgres roles lack the grants. The DB itself denies.

Information disclosure — across slice tenants

flowchart TB
    subgraph Host[Slice host]
        direction TB
        GPU[GPU vfio-pci]
        NVME[NVMe namespace]
        VF[Mellanox SR-IOV VF]
    end
    subgraph A[Tenant A slice]
        VMA[VM with passthrough]
        DA[("Tenant A data<br/>on dedicated NVMe")]
    end
    subgraph B[Tenant B slice on same host]
        VMB[VM with passthrough]
        DB2[("Tenant B data<br/>on different NVMe")]
    end

    GPU -.dedicated to A.-> VMA
    NVME -.dedicated NVMe per slot.-> DA
    VF -.per-slot SR-IOV VF.-> VMA

    note1[Cross-tenant leak vectors blocked:<br/>1. NVMe is dedicated per slot, not shared<br/>2. Each slot uses a separate IB VF<br/>3. Slot release wipes per destructive_wipe_policy<br/>4. cleanup_blocked slot cannot be reused]
Leak vector Mitigation
Shared storage between slices Slot requires capacity_metadata.storage_ownership = 'slice'; dedicated NVMe per slot
Shared fabric between slices Slot requires fabric_claim_mode = 'per_slot_vf' with unique fabric_vf_pci_address
Data left on NVMe at release Slot requires destructive_wipe_policy non-empty; release calls wipe; verification before slot returns to available
Wipe failure Slot pinned in cleanup_blocked; cannot be re-allocated
Metrics across tenants Telemetry uses per-allocation token; host Netdata stays operator-only

Denial of service

flowchart TB
    subgraph Public[Public edge]
        WAF[WAF]
    end
    subgraph App[Application]
        RL[Redis rate limiter]
        QUOTA[Allocation concurrency limit]
        SLOT[Slot scheduler]
    end
    subgraph WK[Workers]
        DLQ[(DLQ NATS stream)]
    end

    ATT[Attacker / runaway] --> WAF
    WAF -->|connection rate-limit| RL
    RL -->|per-user RPM cap| API[cmd/api]
    API -->|max concurrent allocs per user| QUOTA
    API -->|same node FOR UPDATE SKIP LOCKED| SLOT
    NATS[NATS] -->|poison messages| DLQ
    DLQ -.alert.-> SRE[SRE on-call]

    classDef defence fill:#d1e7dd,stroke:#0a3622
    class WAF,RL,QUOTA,SLOT,DLQ defence

Per-endpoint policy limits (all driven by policy_values table, not constants):

Key Default Effect
rate_limit.api_requests_per_minute 120 Default per-user RPM
rate_limit.terminal_token_requests_per_minute 10 Terminal token mint
rate_limit.financial_requests_per_minute 30 Payment / refund / balance reads
rate_limit.admin_overview_requests_per_minute 600 Admin overview polling
allocation.max_concurrent_per_user 50 Slot exhaustion guard

Elevation of privilege — handler discipline

flowchart LR
    REQ[Incoming request] --> M1[Middleware:<br/>JWT verify]
    M1 --> M2[Middleware:<br/>scope resolve - tenant/project]
    M2 --> M3[Middleware:<br/>authz check]
    M3 --> H[Handler]
    H --> S[Service function]
    S -.NEVER.-> NATS[(NATS direct publish<br/>BLOCKED by review)]
    S --> DB[Postgres TX]
    DB --> OB[outbox row in same TX]
    DB --> AUD[audit row in same TX]
    OB --> OR[outbox-relay]
    OR --> NATS

    classDef block fill:#f8d7da,stroke:#42101e
    class NATS block

Three layered checks before a service function runs:

  1. JWT valid + not expired (middleware.Auth).
  2. Scope resolved (packages/shared/authz — most-specific wins, with audit on resolution).
  3. Role/permission check inside the handler before service call.

After the service call, only the outbox-relay publishes to NATS. A handler that tries to call nats.Publish directly fails review.

Slice-specific concerns

→ Detail page: GPU slice as-built §Security model. In summary:

  • VFIO — host kernel cannot accidentally drive a tenant GPU
  • Per-slot NVMe — no cross-tenant disk reuse
  • Per-slot SR-IOV VF — no cross-tenant fabric snoop
  • Operator-approved slot map — guards against accidental exposure of host root NVMe or unsuitable fabric VFs
  • Destructive wipe policy required — slot can't be schedulable without it
  • Per-allocation telemetry token — host gateway only forwards from the right allocation
  • Terminal gateway pattern — no raw VNC, no 0.0.0.0 exposure, single-use tokens

Trust boundaries (in one map)

flowchart LR
    INET([Public internet]):::pub --> WAF[WAF]:::edge
    WAF --> CP[Control plane<br/>mTLS internal]:::trusted
    CP -->|mTLS pull only| NA[node-agent fleet]:::fleet
    CP -->|outbound HTTPS<br/>+ webhook signature in| STRIPE[(Stripe)]:::ext
    CP -->|cached JWKS| KC[(Keycloak / IdP)]:::ext
    CP -->|per-service creds| PG[(Postgres)]:::data
    CP -->|signed URLs| S3[(Object storage)]:::data
    CP -->|API token| OBS[(OTel / Loki / Prom)]:::data

    classDef pub fill:#ffebee,stroke:#c62828
    classDef edge fill:#fff3e0,stroke:#e65100
    classDef trusted fill:#e8f5e9,stroke:#2e7d32
    classDef fleet fill:#fce4ec,stroke:#c2185b
    classDef ext fill:#e8eaf6,stroke:#3949ab
    classDef data fill:#eceff1,stroke:#455a64
Boundary Direction Controls
Internet → Edge inbound WAF, rate-limit, TLS termination
Edge → Control plane inbound mTLS, network policy default-deny
Control plane → Node fleet inbound from node mTLS pull (node initiates); signed task params; identity-bound result
Control plane → Stripe outbound + verified inbound Outbound HTTPS; webhook signature on raw body
Control plane → Keycloak outbound (rare) Cached JWKS refresh every 5 min
Control plane → Postgres outbound Per-service credentials; no shared pool
Control plane → Object storage outbound Signed URLs; tenant-scoped namespaces

Pen test scope

→ Read source: Pen_Test_Scope.md. Covers public auth, payment paths, storage breakout, rate-limit bypass, terminal token replay, admin escalation, node-agent contract.

Abuse cases

→ Read source: Abuse_Case_Catalog.md. Includes:

  • Credit-card chargeback after GPU usage
  • Resource exhaustion via rapid alloc create+release
  • Storage namespace traversal
  • WS DoS via terminal token spam
  • IB fabric VF reuse attempt (blocked by metadata invariants)

Where to look next