Coding Standards (Production + Agent Compatible)¶

General¶

Strong typing required.
Lint/format/static analysis mandatory.
Small, cohesive modules with single responsibility.
Follow the project’s evidence-first execution model in Evidence_First_Change_Protocol.md.

Evidence-First Execution (Required)¶

Establish a relevant baseline before changing behavior.
Prefer the smallest verifiable unit of change.
Predict the expected outcome before running verification.
Re-run the same scoped checks after the change and compare results.
Do not mark work complete without direct proof of the intended behavior change.
Do not treat “compiles” or “looks right” as sufficient evidence.

API and Domain¶

Contract-first changes only.
Explicit request/response schemas.
Standard error envelope with correlation ID.
All mutations idempotent unless explicitly documented as non-idempotent security session issuance/revocation operations (for example, single-use terminal token minting).
Use generated OpenAPI types (packages/shared/gen/openapigen) only at HTTP boundaries (request decode/response encode); use hand-written domain models for internal service logic, with explicit mapping between boundary and domain types.

Security¶

No secrets in code.
Validate all untrusted input.
Enforce authN/authZ server-side only.
Emit audit events for privileged actions.
Node provisioning/release orchestration must use node-agent typed tasks over internal mTLS.
Direct control-plane SSH provisioning is forbidden in MVP runtime paths.
Provisioning lifecycle transitions (requested → provisioning → active → releasing → released/release_failed) must be driven by Temporal workflows/events only. Direct bypass state writes are forbidden outside workflow-controlled paths.

Data Integrity¶

Immutable ledger for money operations.
No direct mutable-balance source of truth.
Transactions for cross-entity critical updates.
No hardcoded runtime business-policy constants in production code paths.
Policy/business values must come from config/DB tables with bounds validation and audited change history.
Test-only constants are allowed in test/fixture code, never in runtime services.
Operator/admin verification should use API and read-model surfaces by default, not direct SQL.
Temporary direct DB inspection is allowed only while the owning operator/debug surface is missing; if the same query is needed repeatedly, add the corresponding GET/read-model API and treat the missing surface as a product gap.

Root-Cause-First Remediation (Required)¶

Do not ship symptom-only fixes to unblock tests.
Every bug fix must identify and patch the owning layer/root cause (contract, schema, service, worker, runtime, or UI boundary), not only downstream fallout.
Temporary fallbacks are allowed only when:
explicitly feature-flagged,
time-boxed with a queue/backlog task,
documented with risk and removal criteria.
If root cause is outside current task scope, mark the task blocked and create the upstream fix task; do not mark done with a local workaround only.

SQL Parameter Typing (Required)¶

In SQL using jsonb_build_object(...) or other polymorphic Postgres functions, explicitly cast bind parameters ($n::text, $n::int, $n::uuid) instead of relying on type inference.
Any new handler/worker SQL touching audit or metadata JSON must include a test path that executes the query against Postgres (unit with pgx mock is not sufficient for this class of failure).

5xx Classification (Required)¶

Any new/changed 5xx response path must be classified in code review as one of:
upstream dependency failure (upstream_error / service_unavailable)
local contract/schema/query/runtime bug (internal_error)
Do not re-label local defects as upstream issues. Fix the owning layer and add a regression test.

Observability¶

Structured logging everywhere.
Trace context propagation mandatory.
Service-level metrics for critical flows.

Traceability-First Implementation Rules (Required)¶

Every runtime binary under cmd/ (except explicitly documented edge agents) must initialize OTel via middleware.SetupOTel(...).
Every HTTP server binary must wrap routers with tracing middleware:
middleware.Tracing("<service-name>")(middleware.CorrelationID(...))
Async consumers (NATS/workers/relays) must create a processing span per message and include:
correlation_id
event.type
event.id
messaging destination/subject
Mutation handlers must create child spans for high-value steps:
project/tenant scope resolution
domain service/orchestrator call
audit/outbox write boundary
Error paths must set span error status and error_code (catalog-aligned) whenever known.
Any new service added to local observability compose must have OTEL_EXPORTER_OTLP_ENDPOINT wired.

Enforcement: - CI gate script: scripts/ci/observability_trace_gate.sh - Make target: make ops-observability-trace-gate

Log and Trace Sanitization¶

Sensitive and PII fields must be redacted before they reach any log sink or trace backend. This applies equally to structured logs, OTel trace attributes, and span events.

Sanitize First rule: all internal services must pass requests through a sanitization layer before logging or creating trace spans. This is not optional for production services.

Fields that must never appear in logs or traces in plaintext: - password, password_hash — any credential value - access_token, refresh_token, id_token — any auth token material - ssh_private_key, ssh_private_key_enc — any key material - stripe_customer_id, payment_reference — payment identity fields - User PII: email, username where used as a personal identifier in high-volume paths - Any field from access_secret_enc or scheduler_metadata that may contain credentials

Implementation requirements: - Implement a sanitization middleware/interceptor at the service boundary that scrubs known sensitive field names before the log entry or span is emitted. - Redaction format: replace value with [REDACTED] — never omit the field entirely, to preserve log structure for debugging. - Apply the same scrubber to error messages that may echo request payloads. - Audit log metadata jsonb fields must follow an explicit allowlist. Unknown keys are rejected. - Allowed audit_logs.metadata keys (MVP): - reason - policy_key - old_value - new_value - status_from - status_to - error_code - request_scope - idempotency_key_hash - provider_ref - allocation_id - node_id - Forbidden in audit_logs.metadata: raw tokens, raw credentials, SSH private/public key material, full request/response payload dumps, direct payment instrument data, end-user PII fields beyond stable IDs.

Agent PR Rules¶

Spec updates included when behavior changes.
Tests included for changed behavior.
Migration and rollback notes required when schema changes.

Go Implementation Patterns¶

These patterns are mandatory for all Go code in this repo. Every agent and contributor must follow them so the codebase reads consistently regardless of who wrote a given file.

Import grouping¶

Three groups, blank-line-separated: stdlib → external → internal. goimports enforces this automatically.

import (
    // 1. Standard library
    "context"
    "net/http"

    // 2. External dependencies
    "github.com/google/uuid"
    "github.com/jackc/pgx/v5"

    // 3. Internal packages (always use full module path)
    apierrors "github.com/gpuaas/platform/packages/shared/errors"
    "github.com/gpuaas/platform/packages/shared/middleware"
    "github.com/gpuaas/platform/packages/shared/policy"
)

Service handler struct¶

Every service package exposes a Handler struct that holds injected dependencies. Never use package-level variables for dependencies.

type Handler struct {
    pool   *pgxpool.Pool
    policy policy.Client
    log    *slog.Logger
}

func NewHandler(pool *pgxpool.Pool, pc policy.Client, log *slog.Logger) *Handler {
    return &Handler{pool: pool, policy: pc, log: log}
}

HTTP handler signature¶

All route handlers are methods on the service Handler struct. Extract corrID and claims at the top of every handler.

func (h *Handler) CreateAllocation(w http.ResponseWriter, r *http.Request) {
    ctx    := r.Context()
    corrID := middleware.CorrelationIDFromContext(ctx)
    claims := middleware.ClaimsFromContext(ctx)   // always non-nil on auth-protected routes

    var req CreateAllocationRequest
    if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
        writeJSON(w, http.StatusBadRequest,
            apierrors.New(apierrors.ErrInvalidRequest, "invalid JSON", corrID))
        return
    }
    // validate → call service layer → respond
}

Service function shape¶

Service functions never know about HTTP. They accept typed inputs and return (result, error). Domain sentinel errors are defined in the package and translated to HTTP status codes only in the handler.

// In the service layer:
var ErrAllocationNotFound = errors.New("allocation not found")

func (s *Service) GetAllocation(ctx context.Context, id uuid.UUID, userID string) (*Allocation, error) {
    // query DB, return (nil, ErrAllocationNotFound) for missing rows
}

// In the handler layer:
alloc, err := h.svc.GetAllocation(ctx, id, claims.UserID)
if errors.Is(err, svc.ErrAllocationNotFound) {
    writeJSON(w, http.StatusNotFound,
        apierrors.New(apierrors.ErrAllocationNotFound, "allocation not found", corrID))
    return
}
if err != nil {
    h.log.ErrorContext(ctx, "get allocation failed", "error", err, "correlation_id", corrID)
    writeJSON(w, http.StatusInternalServerError,
        apierrors.New(apierrors.ErrInternal, "internal error", corrID))
    return
}

Error response helper¶

Every service handler file that writes HTTP responses should include a local writeJSON helper (or import one from a shared internal package):

func writeJSON(w http.ResponseWriter, status int, v any) {
    b, _ := json.Marshal(v)
    w.Header().Set("Content-Type", "application/json")
    w.WriteHeader(status)
    _, _ = w.Write(b)
}

Structured logging¶

Always use slog (Go stdlib). Always include correlation_id. Never log fields from the PII blocklist — pass data through middleware.Sanitize first.

// Correct:
slog.InfoContext(ctx, "allocation created",
    "allocation_id", alloc.ID,
    "user_id",       claims.UserID,
    "correlation_id", corrID,
)

// Correct — log sanitised request body at DEBUG:
slog.DebugContext(ctx, "incoming request",
    slog.Any("body", middleware.Sanitize(bodyMap)))

// Wrong — raw request struct may contain tokens or keys:
slog.InfoContext(ctx, "request", "body", req)

Build metadata standard¶

All Go binaries under cmd/ must log build identity at startup and shutdown using shared packages/shared/buildinfo fields:

version
commit
built_at

Do not define ad-hoc per-binary version variables in main packages. Use centralized ldflags stamping via Makefile build targets (for example make build-go-binaries / make build-node-agent) so logs remain consistent across API, workers, gateways, and agent binaries.

DB transaction + outbox¶

Any mutation that changes domain state AND should emit an event must write both in the same transaction. Never call events.PublishTyped directly from a handler or service function — write to outbox_events instead.

tx, err := h.pool.BeginTx(ctx, pgx.TxOptions{})
if err != nil {
    return fmt.Errorf("begin tx: %w", err)
}
defer tx.Rollback(ctx) // no-op after Commit

// 1. Domain mutation
_, err = tx.Exec(ctx, `UPDATE allocations SET status = 'releasing' WHERE id = $1`, id)
if err != nil {
    return err
}

// 2. Outbox event (same transaction)
payload, _ := json.Marshal(events.ReleasingRequestedPayload{AllocationID: id.String(), ...})
_, err = tx.Exec(ctx, `
    INSERT INTO outbox_events (aggregate_type, aggregate_id, event_type, payload, correlation_id)
    VALUES ($1, $2, $3, $4, $5)
`, "allocation", id, events.SubjectProvisioningReleasingRequested, payload, corrID)
if err != nil {
    return err
}

return tx.Commit(ctx)

Policy values¶

Never hardcode business constants. Always read from policy.Client. Provide a safe in-code fallback only when the policy key is optional or the fallback is explicitly documented.

// Correct:
limit, err := h.policy.GetInt(ctx, policy.KeyAllocationMaxConcurrentPerUser, policy.WithOrgScope(claims.OrgID))
if err != nil {
    return 0, fmt.Errorf("policy lookup: %w", err)
}

// Wrong — hardcoded constant in production path:
const maxAllocations = 5

Audit log¶

Every privileged mutation (provision, release, force-release, refund, admin node ops, admin user ops) must insert an audit_logs row. Write it inside the same DB transaction as the mutation.

_, err = tx.Exec(ctx, `
    INSERT INTO audit_logs
        (actor_user_id, actor_role, action, target_type, target_id, result, correlation_id)
    VALUES ($1, $2, $3, $4, $5, $6, $7)
`, claims.UserID, role, "allocation.release", "allocation", id, "success", corrID)

Context propagation¶

Always thread context.Context as the first argument. Never store a context in a struct field. Never use context.Background() inside a handler — always propagate from r.Context().

Naming conventions¶

Thing	Convention	Example
Handler struct	`Handler` per package	`billing.Handler`
Service struct	`Service` per package	`billing.Service`
Constructor	`New<Type>`	`NewHandler`, `NewService`
Domain sentinel errors	`Err<Noun>`	`ErrAllocationNotFound`
Policy key consts	`Key<Domain><Name>`	`policy.KeyBillingWindowSeconds`
Event subject consts	`Subject<Domain><Event>`	`events.SubjectProvisioningActive`
Test file	`<file>_test.go` same package	`handler_test.go`
Integration test file	`//go:build integration` tag	see Testing Standards