Skip to content

PKI Spec — Node and Worker Certificate Management

Status: - MVP canonical (required for node-agent communication and enrollment). - Applies to internal control-plane/node/worker trust; public ingress TLS remains edge-managed.

1. Purpose and Scope

Defines the certificate authority hierarchy, certificate profiles, enrollment flow, renewal strategy, revocation model, and task-signing scheme for:

  • Node agent client certificates (one per GPU node)
  • Provisioning worker client certificates (one per worker instance)
  • Control plane server certificate (internal API endpoint)
  • Task signing keypair (Ed25519, separate from TLS)

Out of scope: public API TLS (managed by edge gateway/CDN); Keycloak TLS; user-facing HTTPS.


2. CA Technology

MVP: Smallstep step-ca

Rationale: battle-tested, supports JWK/X5C provisioners natively, short-lived cert model built-in, renewal via existing cert (X5C), single-use token issuance for enrollment.

step-ca is deployed as an internal Kubernetes service (pki-ca.internal:9000), not exposed to the public internet. Its private key is stored in the cloud KMS (never on disk).

2.1 Migration Path to Vault PKI

All Go code interacts with a CAClient interface (see §9). To migrate to HashiCorp Vault PKI:

  1. Implement VaultCAClient satisfying the same interface.
  2. Run both implementations in parallel during transition (dual-sign period).
  3. Re-enroll nodes with Vault-signed certs; retire step-ca.
  4. No changes required in node agent, worker, or API handler code.

The migration is a backend swap only. The protocol, cert profiles, and enrollment flow are identical.


3. CA Hierarchy

Root CA  (offline; stored on HSM or hardware security token)
  │      (signs Intermediate CA cert only; kept offline after CA ceremony)
  └── Intermediate CA  (step-ca, online, Kubernetes internal)
        ├── Control Plane Server Cert
        │     CN=api.internal.gpuaas.io
        │     Lifetime: 90 days
        │     Renewal: automated (step-ca ACME or manual rotation)
        ├── Worker Client Certs
        │     CN=worker-provisioning-{instance-id}, O=gpuaas, OU=workers
        │     Lifetime: 24 hours
        │     Renewal: autonomous, T-1h before expiry via X5C provisioner
        └── Node Agent Client Certs
              CN=node-{node_id}, O=gpuaas, OU=nodes
              Lifetime: 24 hours
              Renewal: autonomous, T-1h before expiry via X5C provisioner
              Initial issuance: one-time enrollment token (JWK provisioner);
              expired-cert recovery uses a node-bound recovery token stored by the API

The Root CA never issues leaf certificates. After signing the Intermediate CA cert it goes offline. If the Intermediate CA is compromised, the Root CA issues a new one; all leaf certs are re-enrolled.


4. step-ca Provisioners

Provisioner Name Used for
JWK enrollment Node initial enrollment (one-time tokens, TTL 30 min) and node-bound recovery re-enrollment after local client cert expiry
JWK workers Worker cert issuance at startup (service account JWT)
X5C renewal Autonomous cert renewal for both nodes and workers

The X5C provisioner accepts a valid leaf cert as proof of identity and issues a replacement. This is the renewal mechanism — no separate token needed after enrollment.


5. Certificate Profiles

Node Agent Client Cert
  Subject:   CN=node-{node_id}, O=gpuaas, OU=nodes
  Key type:  Ed25519
  Lifetime:  24 hours
  Key usage: digital signature, key agreement
  EKU:       client authentication
  SANs:      none

Worker Client Cert
  Subject:   CN=worker-provisioning-{instance-id}, O=gpuaas, OU=workers
  Key type:  Ed25519
  Lifetime:  24 hours
  Key usage: digital signature, key agreement
  EKU:       client authentication
  SANs:      none

Control Plane Server Cert
  Subject:   CN=api.internal.gpuaas.io, O=gpuaas, OU=control-plane
  Key type:  ECDSA P-256
  Lifetime:  90 days
  Key usage: digital signature, key agreement
  EKU:       server authentication
  SANs:      api.internal.gpuaas.io, *.internal.gpuaas.io

6. Node Enrollment Flow

Trigger: admin registers a new node via POST /api/v1/admin/nodes, then requests a bootstrap bundle for that node.

Fundamental constraint: the node agent has exactly one network destination — api.internal. It never connects to step-ca, Postgres, Redis, Temporal, or NATS directly. The API proxies all CA interactions internally.

Step 1 — Token generation (control plane)

  1. Provisioning service generates a one-time enrollment token; token encodes node_id, TTL = 30 min, single-use. Stored in Redis: node_enroll:{node_id}, TTL = 30 min.
  2. API root CA fingerprint retrieved from config (used by node to verify api.internal server cert on first connection before it has the CA bundle).
  3. Bootstrap bundle assembled and returned to the admin caller:
#cloud-config
write_files:
  - path: /etc/gpuaas/enrollment.env
    permissions: '0600'
    content: |
      GPUAAS_NODE_ID=node-{id}
      GPUAAS_ENROLLMENT_TOKEN={token}
      GPUAAS_CA_FINGERPRINT={sha256-root-ca-fingerprint}
      GPUAAS_API_URL=https://api.internal.gpuaas.io

GPUAAS_CA_URL is intentionally absent — nodes never talk to step-ca. The API is the only endpoint a node ever connects to.

Delivery modes: - manual (MVP default): admin downloads bundle and installs on node. - maas: bundle injected by provider automation/cloud-init. No long-lived shared internal token is used for enrollment.

Step 2 — Agent startup

Node boots. gpuaas-node-agent.service starts. Agent reads bootstrap env file (default: /etc/gpuaas/enrollment.env). If no cert at GPUAAS_CERT_PATH, enters enrollment mode.

Step 3 — CSR submission to API

  1. Agent generates a fresh Ed25519 keypair.
  2. Private key written to /etc/gpuaas/agent.key (mode 0600, owner: gpuaas-agent).
  3. MVP: filesystem. Production upgrade: TPM 2.0 (see §8).
  4. Constructs CSR: CN=node-{node_id}, O=gpuaas, OU=nodes.
  5. Calls the API enrollment endpoint (plain TLS; server cert verified against GPUAAS_CA_FINGERPRINT; no client cert yet):
POST https://api.internal.gpuaas.io/internal/v1/nodes/enroll
Authorization: Bearer {enrollment_token}
Body: { "node_id": "node-{id}", "csr": "PEM..." }
  1. Inside the API (invisible to the node):
  2. Validates the enrollment token from Redis.
  3. On success, promotes a valid one-time node_enroll_token:* to durable node_recovery_token:*, then consumes the one-time token.
  4. Future expired-cert recovery may use the node-bound recovery token; successful node.self_update can rotate that token.
  5. Validates CSR: key type Ed25519, CN matches node_id, no unexpected SANs.
  6. Calls pki.CAClient.Enroll(ctx, csr) — step-ca signs the cert (internal call, pki-ca.internal:9000, never reachable by nodes).
  7. Updates nodes.status = 'enrolled' in DB.
  8. Writes audit log: action=node_enrolled.

  9. API returns to the node:

{ "certificate": "PEM cert chain", "ca_bundle": "PEM CA bundle" }
  1. Agent writes cert to GPUAAS_CERT_PATH, control-plane HTTPS trust remains in GPUAAS_CA_BUNDLE_PATH, and the enrollment-returned CA bundle is stored at GPUAAS_NODE_CERT_CA_BUNDLE_PATH.
  2. Deletes /etc/gpuaas/enrollment.env (secrets no longer needed on disk).

The canonical lifecycle for control-plane HTTPS bootstrap trust delivery and update is: - doc/architecture/Node_Bootstrap_Trust_Delivery_v1.md

Idempotency: if the network fails after the API signs the cert but before the node receives the response, the node retries with the same token. The API caches the issued cert in Redis (TTL = 30 min, keyed by token) and returns it on retry without re-signing.

After step 7 the node enters the task polling loop. No separate confirmation step is needed — the API records enrollment as part of step 4 above.


7. Certificate Renewal

All leaf certs have a 24-hour lifetime. Renewal is autonomous.

Node agent renewal

  1. Agent checks cert expiry every 15 minutes.
  2. At T-1h: generates new Ed25519 keypair + CSR.
  3. Calls the API renewal endpoint (mTLS, presenting current valid cert as proof):
POST https://api.internal.gpuaas.io/internal/v1/nodes/{id}/cert/renew
mTLS client cert: current agent.crt
Body: { "csr": "PEM new CSR..." }
  1. Inside the API (invisible to the node):
  2. Validates mTLS cert (OU=nodes, CN matches node_id, serial not in deny-list).
  3. Calls pki.CAClient.Renew(ctx, currentCert, newCSR) — step-ca X5C provisioner signs the replacement cert (pki-ca.internal:9000, never reachable by nodes).
  4. Records new cert serial in DB.

  5. API returns { "certificate": "PEM new cert chain" }.

  6. Agent atomically swaps: writes new cert + key to temp paths, then renames. Long-lived connections continue using the old cert; new connections use the new cert.

Renewal failure handling

  • Retry with exponential backoff: 5 min, 10 min, 20 min, 40 min, 60 min.
  • If cert expires before renewal succeeds:
  • Node goes offline: nodes.status = 'cert_expired'.
  • On-call alert fired.
  • Recovery: re-enrollment (recorded in audit log as a security event).

Worker renewal

Workers (Kubernetes pods, internal network) call step-ca's X5C provisioner directly — they are inside the cluster and can reach pki-ca.internal:9000. Workers do not go through the API for cert renewal. The distinction:

  • Nodes (external GPU machines) → api.internal only
  • Workers (internal Kubernetes pods) → step-ca directly (same cluster network)

8. Node Private Key Storage

MVP — Filesystem - Path: /etc/gpuaas/agent.key - Permissions: 0600, owner: gpuaas-agent (unprivileged system user) - Key never transmitted; only the public key (in the cert) is externally known.

Production upgrade — TPM 2.0

The CAClient interface (§9) accepts a crypto.Signer. The TPM upgrade is a backend swap only:

  • Generate key inside TPM (never extractable).
  • Pass TPM-backed signer to the CA client; enrollment and renewal flows are identical.
  • Requires go-tpm or tpm2-tools on the node image.

When GPUAAS_USE_TPM=true is set, the agent detects TPM at startup and uses it preferentially. No changes to the PKI spec or protocol.


9. Go CA Client Interface

All code that issues, renews, or validates certs goes through this interface. Changing the CA backend requires only a new implementation — no call-site changes.

// packages/shared/pki/client.go

// CAClient issues and renews certificates.
// MVP: StepCAClient. Migration: VaultCAClient (same interface).
type CAClient interface {
    // Enroll submits a CSR with a one-time enrollment token.
    // Returns PEM-encoded cert chain and CA bundle.
    Enroll(ctx context.Context, req EnrollRequest) (*EnrollResponse, error)

    // Renew exchanges an existing valid cert for a new one (X5C / equivalent).
    Renew(ctx context.Context, currentCert tls.Certificate, newCSR []byte) (*EnrollResponse, error)

    // RootFingerprint returns the SHA-256 fingerprint of the root CA cert.
    RootFingerprint() string
}

type EnrollRequest struct {
    CSR             []byte // PEM
    EnrollmentToken string
}

type EnrollResponse struct {
    CertChainPEM []byte
    CABundlePEM  []byte
}

StepCAClient calls step-ca's /1.0/sign endpoint. VaultCAClient calls Vault's /v1/pki/sign/{role} endpoint. Both satisfy CAClient.


10. Task Signing Keypair

Separate from TLS. Guarantees that even if mTLS is bypassed, unsigned or badly-signed task payloads cannot trigger node-side actions.

  • Algorithm: Ed25519
  • Control plane holds: private key (KMS/Vault custody target; never in node config)
  • Node agent holds: public verifier material only

The canonical lifecycle for signer versioning, verifier rollout, grace, and rollback is: - doc/architecture/Node_Task_Signing_Lifecycle_v1.md

This PKI spec only establishes the cryptographic separation between: 1. certificate identity and transport trust, and 2. typed task authenticity.


11. mTLS Trust Model

Node agent → Control Plane (API) - Node agent is mTLS client. - Server presents: control plane server cert (signed by our CA). - Client presents: node agent cert (OU=nodes). - API verifies: cert chain to CA bundle, OU=nodes, CN matches registered node_id, serial not in deny-list.

Provisioning worker → Control Plane (API) - Worker is mTLS client. - Client presents: worker cert (OU=workers). - API verifies: cert chain, OU=workers.

Node agents and workers use separate OU values, giving the API fine-grained control over which endpoints each can reach (e.g., workers can queue tasks, nodes can poll them).


12. Revocation

Three-layer approach. No full CRL/OCSP needed for MVP.

Layer Mechanism Handles
Short-lived certs 24h TTL Compromised certs expire quickly
Deny-list Redis set node_cert_revoked:{serial}, TTL=24h Immediate revocation on compromise
Node status nodes.status = 'quarantined' Control plane stops task dispatch; renewal blocked

Quarantine flow (POST /api/v1/admin/nodes/{id}/quarantine):

  1. nodes.status → quarantined.
  2. Cert serial → Redis deny-list (TTL = 24h).
  3. All active allocations on node → force-released (outbox events written).
  4. Audit log written.
  5. On-call alert fired.

The deny-list is checked in the API's mTLS middleware on every connection. A cert with a revoked serial is rejected before any handler runs.


13. Infrastructure Requirements

Component Where Port Notes
step-ca Kubernetes (internal namespace) 9000 Reachable by workers and API pods only; not reachable by nodes
Root CA cert Distributed via cloud-init + CI artifact store N/A SHA-256 fingerprint in cloud-init (for node); hardcoded in worker binaries
Root CA private key Offline HSM / hardware token N/A Used only for CA ceremony and Intermediate CA cert renewal
Intermediate CA private key Cloud KMS N/A Referenced by step-ca via KMS key ID
Task signing private key Cloud KMS N/A Accessed only by cmd/api at task issuance time
Redis deny-list Existing Redis cluster Keys: node_cert_revoked:{serial}

14. CA Ceremony (One-Time Procedure)

Documented here; execution tracked in doc/Phase_Readiness_Tracker.md.

  1. Generate Root CA keypair offline (HSM or air-gapped machine).
  2. Self-sign Root CA cert (10-year lifetime, CA:TRUE, pathLen=1).
  3. Generate Intermediate CA keypair in KMS.
  4. Root CA signs Intermediate CA cert (5-year lifetime, CA:TRUE, pathLen=0).
  5. Configure step-ca with: Intermediate CA cert + KMS key reference.
  6. Record Root CA cert SHA-256 fingerprint in:
  7. Agent binary build config.
  8. Worker binary build config.
  9. Cloud-init template.
  10. doc/Phase_Readiness_Tracker.md.
  11. Store Root CA cert in secure artifact store (read-only access for builds).
  12. Root CA private key goes back offline.

15. Relationship to Other Specs

Spec Relationship
doc/architecture/Node_Agent_Spec.md Consumes this spec — enrollment flow, cert storage, renewal loop, task signing
doc/architecture/Encryption_Envelope_Spec.md Orthogonal — governs at-rest secrets (AES-256-GCM); this spec governs in-transit identity
doc/operations/Production_Platform_Baseline.md Cites this spec for internal mTLS SOP requirement
doc/Implementation_Roadmap.md Pre-Phase Node Agent section gates Phase 7