PKI Spec — Node and Worker Certificate Management¶
Status:
- MVP canonical (required for node-agent communication and enrollment).
- Applies to internal control-plane/node/worker trust; public ingress TLS remains edge-managed.
1. Purpose and Scope¶
Defines the certificate authority hierarchy, certificate profiles, enrollment flow, renewal strategy, revocation model, and task-signing scheme for:
- Node agent client certificates (one per GPU node)
- Provisioning worker client certificates (one per worker instance)
- Control plane server certificate (internal API endpoint)
- Task signing keypair (Ed25519, separate from TLS)
Out of scope: public API TLS (managed by edge gateway/CDN); Keycloak TLS; user-facing HTTPS.
2. CA Technology¶
MVP: Smallstep step-ca
Rationale: battle-tested, supports JWK/X5C provisioners natively, short-lived cert model built-in, renewal via existing cert (X5C), single-use token issuance for enrollment.
step-ca is deployed as an internal Kubernetes service (pki-ca.internal:9000), not
exposed to the public internet. Its private key is stored in the cloud KMS (never on disk).
2.1 Migration Path to Vault PKI¶
All Go code interacts with a CAClient interface (see §9). To migrate to HashiCorp
Vault PKI:
- Implement
VaultCAClientsatisfying the same interface. - Run both implementations in parallel during transition (dual-sign period).
- Re-enroll nodes with Vault-signed certs; retire step-ca.
- No changes required in node agent, worker, or API handler code.
The migration is a backend swap only. The protocol, cert profiles, and enrollment flow are identical.
3. CA Hierarchy¶
Root CA (offline; stored on HSM or hardware security token)
│ (signs Intermediate CA cert only; kept offline after CA ceremony)
│
└── Intermediate CA (step-ca, online, Kubernetes internal)
│
├── Control Plane Server Cert
│ CN=api.internal.gpuaas.io
│ Lifetime: 90 days
│ Renewal: automated (step-ca ACME or manual rotation)
│
├── Worker Client Certs
│ CN=worker-provisioning-{instance-id}, O=gpuaas, OU=workers
│ Lifetime: 24 hours
│ Renewal: autonomous, T-1h before expiry via X5C provisioner
│
└── Node Agent Client Certs
CN=node-{node_id}, O=gpuaas, OU=nodes
Lifetime: 24 hours
Renewal: autonomous, T-1h before expiry via X5C provisioner
Initial issuance: one-time enrollment token (JWK provisioner);
expired-cert recovery uses a node-bound recovery token stored by the API
The Root CA never issues leaf certificates. After signing the Intermediate CA cert it goes offline. If the Intermediate CA is compromised, the Root CA issues a new one; all leaf certs are re-enrolled.
4. step-ca Provisioners¶
| Provisioner | Name | Used for |
|---|---|---|
| JWK | enrollment |
Node initial enrollment (one-time tokens, TTL 30 min) and node-bound recovery re-enrollment after local client cert expiry |
| JWK | workers |
Worker cert issuance at startup (service account JWT) |
| X5C | renewal |
Autonomous cert renewal for both nodes and workers |
The X5C provisioner accepts a valid leaf cert as proof of identity and issues a replacement. This is the renewal mechanism — no separate token needed after enrollment.
5. Certificate Profiles¶
Node Agent Client Cert
Subject: CN=node-{node_id}, O=gpuaas, OU=nodes
Key type: Ed25519
Lifetime: 24 hours
Key usage: digital signature, key agreement
EKU: client authentication
SANs: none
Worker Client Cert
Subject: CN=worker-provisioning-{instance-id}, O=gpuaas, OU=workers
Key type: Ed25519
Lifetime: 24 hours
Key usage: digital signature, key agreement
EKU: client authentication
SANs: none
Control Plane Server Cert
Subject: CN=api.internal.gpuaas.io, O=gpuaas, OU=control-plane
Key type: ECDSA P-256
Lifetime: 90 days
Key usage: digital signature, key agreement
EKU: server authentication
SANs: api.internal.gpuaas.io, *.internal.gpuaas.io
6. Node Enrollment Flow¶
Trigger: admin registers a new node via POST /api/v1/admin/nodes, then requests a
bootstrap bundle for that node.
Fundamental constraint: the node agent has exactly one network destination —
api.internal. It never connects to step-ca, Postgres, Redis, Temporal, or NATS
directly. The API proxies all CA interactions internally.
Step 1 — Token generation (control plane)¶
- Provisioning service generates a one-time enrollment token; token encodes
node_id, TTL = 30 min, single-use. Stored in Redis:node_enroll:{node_id}, TTL = 30 min. - API root CA fingerprint retrieved from config (used by node to verify
api.internalserver cert on first connection before it has the CA bundle). - Bootstrap bundle assembled and returned to the admin caller:
#cloud-config
write_files:
- path: /etc/gpuaas/enrollment.env
permissions: '0600'
content: |
GPUAAS_NODE_ID=node-{id}
GPUAAS_ENROLLMENT_TOKEN={token}
GPUAAS_CA_FINGERPRINT={sha256-root-ca-fingerprint}
GPUAAS_API_URL=https://api.internal.gpuaas.io
GPUAAS_CA_URL is intentionally absent — nodes never talk to step-ca. The API is the
only endpoint a node ever connects to.
Delivery modes:
- manual (MVP default): admin downloads bundle and installs on node.
- maas: bundle injected by provider automation/cloud-init.
No long-lived shared internal token is used for enrollment.
Step 2 — Agent startup¶
Node boots. gpuaas-node-agent.service starts. Agent reads bootstrap env file (default:
/etc/gpuaas/enrollment.env).
If no cert at GPUAAS_CERT_PATH, enters enrollment mode.
Step 3 — CSR submission to API¶
- Agent generates a fresh Ed25519 keypair.
- Private key written to
/etc/gpuaas/agent.key(mode 0600, owner:gpuaas-agent). - MVP: filesystem. Production upgrade: TPM 2.0 (see §8).
- Constructs CSR:
CN=node-{node_id},O=gpuaas,OU=nodes. - Calls the API enrollment endpoint (plain TLS; server cert verified against
GPUAAS_CA_FINGERPRINT; no client cert yet):
POST https://api.internal.gpuaas.io/internal/v1/nodes/enroll
Authorization: Bearer {enrollment_token}
Body: { "node_id": "node-{id}", "csr": "PEM..." }
- Inside the API (invisible to the node):
- Validates the enrollment token from Redis.
- On success, promotes a valid one-time
node_enroll_token:*to durablenode_recovery_token:*, then consumes the one-time token. - Future expired-cert recovery may use the node-bound recovery token; successful
node.self_updatecan rotate that token. - Validates CSR: key type Ed25519, CN matches
node_id, no unexpected SANs. - Calls
pki.CAClient.Enroll(ctx, csr)— step-ca signs the cert (internal call,pki-ca.internal:9000, never reachable by nodes). - Updates
nodes.status = 'enrolled'in DB. -
Writes audit log:
action=node_enrolled. -
API returns to the node:
- Agent writes cert to
GPUAAS_CERT_PATH, control-plane HTTPS trust remains inGPUAAS_CA_BUNDLE_PATH, and the enrollment-returned CA bundle is stored atGPUAAS_NODE_CERT_CA_BUNDLE_PATH. - Deletes
/etc/gpuaas/enrollment.env(secrets no longer needed on disk).
The canonical lifecycle for control-plane HTTPS bootstrap trust delivery and update is:
- doc/architecture/Node_Bootstrap_Trust_Delivery_v1.md
Idempotency: if the network fails after the API signs the cert but before the node receives the response, the node retries with the same token. The API caches the issued cert in Redis (TTL = 30 min, keyed by token) and returns it on retry without re-signing.
After step 7 the node enters the task polling loop. No separate confirmation step is needed — the API records enrollment as part of step 4 above.
7. Certificate Renewal¶
All leaf certs have a 24-hour lifetime. Renewal is autonomous.
Node agent renewal¶
- Agent checks cert expiry every 15 minutes.
- At T-1h: generates new Ed25519 keypair + CSR.
- Calls the API renewal endpoint (mTLS, presenting current valid cert as proof):
POST https://api.internal.gpuaas.io/internal/v1/nodes/{id}/cert/renew
mTLS client cert: current agent.crt
Body: { "csr": "PEM new CSR..." }
- Inside the API (invisible to the node):
- Validates mTLS cert (OU=nodes, CN matches node_id, serial not in deny-list).
- Calls
pki.CAClient.Renew(ctx, currentCert, newCSR)— step-ca X5C provisioner signs the replacement cert (pki-ca.internal:9000, never reachable by nodes). -
Records new cert serial in DB.
-
API returns
{ "certificate": "PEM new cert chain" }. - Agent atomically swaps: writes new cert + key to temp paths, then renames. Long-lived connections continue using the old cert; new connections use the new cert.
Renewal failure handling¶
- Retry with exponential backoff: 5 min, 10 min, 20 min, 40 min, 60 min.
- If cert expires before renewal succeeds:
- Node goes offline:
nodes.status = 'cert_expired'. - On-call alert fired.
- Recovery: re-enrollment (recorded in audit log as a security event).
Worker renewal¶
Workers (Kubernetes pods, internal network) call step-ca's X5C provisioner directly —
they are inside the cluster and can reach pki-ca.internal:9000. Workers do not go
through the API for cert renewal. The distinction:
- Nodes (external GPU machines) → api.internal only
- Workers (internal Kubernetes pods) → step-ca directly (same cluster network)
8. Node Private Key Storage¶
MVP — Filesystem
- Path: /etc/gpuaas/agent.key
- Permissions: 0600, owner: gpuaas-agent (unprivileged system user)
- Key never transmitted; only the public key (in the cert) is externally known.
Production upgrade — TPM 2.0
The CAClient interface (§9) accepts a crypto.Signer. The TPM upgrade is a backend
swap only:
- Generate key inside TPM (never extractable).
- Pass TPM-backed signer to the CA client; enrollment and renewal flows are identical.
- Requires
go-tpmortpm2-toolson the node image.
When GPUAAS_USE_TPM=true is set, the agent detects TPM at startup and uses it
preferentially. No changes to the PKI spec or protocol.
9. Go CA Client Interface¶
All code that issues, renews, or validates certs goes through this interface. Changing the CA backend requires only a new implementation — no call-site changes.
// packages/shared/pki/client.go
// CAClient issues and renews certificates.
// MVP: StepCAClient. Migration: VaultCAClient (same interface).
type CAClient interface {
// Enroll submits a CSR with a one-time enrollment token.
// Returns PEM-encoded cert chain and CA bundle.
Enroll(ctx context.Context, req EnrollRequest) (*EnrollResponse, error)
// Renew exchanges an existing valid cert for a new one (X5C / equivalent).
Renew(ctx context.Context, currentCert tls.Certificate, newCSR []byte) (*EnrollResponse, error)
// RootFingerprint returns the SHA-256 fingerprint of the root CA cert.
RootFingerprint() string
}
type EnrollRequest struct {
CSR []byte // PEM
EnrollmentToken string
}
type EnrollResponse struct {
CertChainPEM []byte
CABundlePEM []byte
}
StepCAClient calls step-ca's /1.0/sign endpoint.
VaultCAClient calls Vault's /v1/pki/sign/{role} endpoint.
Both satisfy CAClient.
10. Task Signing Keypair¶
Separate from TLS. Guarantees that even if mTLS is bypassed, unsigned or badly-signed task payloads cannot trigger node-side actions.
- Algorithm: Ed25519
- Control plane holds: private key (KMS/Vault custody target; never in node config)
- Node agent holds: public verifier material only
The canonical lifecycle for signer versioning, verifier rollout, grace, and rollback is:
- doc/architecture/Node_Task_Signing_Lifecycle_v1.md
This PKI spec only establishes the cryptographic separation between: 1. certificate identity and transport trust, and 2. typed task authenticity.
11. mTLS Trust Model¶
Node agent → Control Plane (API)
- Node agent is mTLS client.
- Server presents: control plane server cert (signed by our CA).
- Client presents: node agent cert (OU=nodes).
- API verifies: cert chain to CA bundle, OU=nodes, CN matches registered node_id,
serial not in deny-list.
Provisioning worker → Control Plane (API) - Worker is mTLS client. - Client presents: worker cert (OU=workers). - API verifies: cert chain, OU=workers.
Node agents and workers use separate OU values, giving the API fine-grained control over which endpoints each can reach (e.g., workers can queue tasks, nodes can poll them).
12. Revocation¶
Three-layer approach. No full CRL/OCSP needed for MVP.
| Layer | Mechanism | Handles |
|---|---|---|
| Short-lived certs | 24h TTL | Compromised certs expire quickly |
| Deny-list | Redis set node_cert_revoked:{serial}, TTL=24h |
Immediate revocation on compromise |
| Node status | nodes.status = 'quarantined' |
Control plane stops task dispatch; renewal blocked |
Quarantine flow (POST /api/v1/admin/nodes/{id}/quarantine):
nodes.status → quarantined.- Cert serial → Redis deny-list (TTL = 24h).
- All active allocations on node → force-released (outbox events written).
- Audit log written.
- On-call alert fired.
The deny-list is checked in the API's mTLS middleware on every connection. A cert with a revoked serial is rejected before any handler runs.
13. Infrastructure Requirements¶
| Component | Where | Port | Notes |
|---|---|---|---|
| step-ca | Kubernetes (internal namespace) | 9000 | Reachable by workers and API pods only; not reachable by nodes |
| Root CA cert | Distributed via cloud-init + CI artifact store | N/A | SHA-256 fingerprint in cloud-init (for node); hardcoded in worker binaries |
| Root CA private key | Offline HSM / hardware token | N/A | Used only for CA ceremony and Intermediate CA cert renewal |
| Intermediate CA private key | Cloud KMS | N/A | Referenced by step-ca via KMS key ID |
| Task signing private key | Cloud KMS | N/A | Accessed only by cmd/api at task issuance time |
| Redis deny-list | Existing Redis cluster | — | Keys: node_cert_revoked:{serial} |
14. CA Ceremony (One-Time Procedure)¶
Documented here; execution tracked in doc/Phase_Readiness_Tracker.md.
- Generate Root CA keypair offline (HSM or air-gapped machine).
- Self-sign Root CA cert (10-year lifetime, CA:TRUE, pathLen=1).
- Generate Intermediate CA keypair in KMS.
- Root CA signs Intermediate CA cert (5-year lifetime, CA:TRUE, pathLen=0).
- Configure step-ca with: Intermediate CA cert + KMS key reference.
- Record Root CA cert SHA-256 fingerprint in:
- Agent binary build config.
- Worker binary build config.
- Cloud-init template.
doc/Phase_Readiness_Tracker.md.- Store Root CA cert in secure artifact store (read-only access for builds).
- Root CA private key goes back offline.
15. Relationship to Other Specs¶
| Spec | Relationship |
|---|---|
doc/architecture/Node_Agent_Spec.md |
Consumes this spec — enrollment flow, cert storage, renewal loop, task signing |
doc/architecture/Encryption_Envelope_Spec.md |
Orthogonal — governs at-rest secrets (AES-256-GCM); this spec governs in-transit identity |
doc/operations/Production_Platform_Baseline.md |
Cites this spec for internal mTLS SOP requirement |
doc/Implementation_Roadmap.md |
Pre-Phase Node Agent section gates Phase 7 |