Skip to content

Slice Guest Telemetry and Benchmark v1

Purpose

Define the next telemetry and benchmark model for gpu_slice allocations.

The current product behavior is intentionally conservative:

  1. bare-metal allocations currently normalize metrics from host-side collectors, with Netdata still present only as the fallback/operator path during transition;
  2. slice allocations do not expose host Netdata to tenants because host GPU telemetry is misleading once devices are bound to vfio-pci;
  3. slice allocations therefore show an explicit "guest telemetry is not enabled yet" gap instead of pretending host values are tenant values.

This document turns that gap into an explicit design.

Non-Goals

This document does not propose:

  1. direct tenant access to guest Netdata;
  2. a generic observability platform redesign;
  3. fake or synthesized slice metrics;
  4. bypassing node-agent to reach guest VMs directly from the API.

Source Model

Allocation metrics need an explicit source contract.

Bare Metal

  • transition target: host_local_probe
  • current rollout state: prefer the platform-owned host probe and fall back to host Netdata only while rollout is incomplete
  • Open Netdata remains an admin/operator tool, not a tenant telemetry dependency

GPU Slice

  • slice_guest
  • guest telemetry must be collected through node-agent using the controlled management path to the VM private IP
  • host Netdata remains useful for node health, bridge state, and operator-only diagnostics, but not for tenant GPU utilization

Unavailable

  • unavailable
  • used when neither host nor guest telemetry can provide a truthful answer
  • responses should carry an explicit reason instead of falling back to host GPU values for slice allocations

Why Not Direct Guest Netdata

Guest Netdata may still be installed inside the slice VM as an implementation detail, but it should not become the public tenant surface.

Reasons:

  1. it expands tenant VM network exposure unnecessarily;
  2. it forces the platform proxy to solve another app-auth/session problem for every slice guest;
  3. it makes telemetry depend on guest networking policy instead of node-agent's controlled management path;
  4. it weakens the product boundary between "tenant workload surface" and "platform-managed telemetry surface".

The preferred product shape is:

  1. node-agent collects guest telemetry;
  2. API normalizes it into the allocation metrics contract;
  3. UI renders it as allocation telemetry without exposing the guest dashboard directly.

1. Reuse the Existing Guest Access Path

Node-agent already:

  1. waits for guest SSH on the slice VM private IP;
  2. checks guest readiness via the managed SSH key;
  3. captures a post-boot performance probe in node task output.

That same control path should be used for telemetry.

2. Add Guest Telemetry Collection in Node-Agent

For gpu_slice allocations, node-agent should collect:

  1. nvidia-smi GPU utilization and memory utilization;
  2. device-level health/power/temperature when available;
  3. optional guest CPU and memory values if we want tenant-visible OS metrics to reflect the VM rather than the host.

The first implementation can be pull-based and read-only. It does not need a long-running in-guest agent before the product contract is proven.

3. Keep Host Netdata for Operator-Only Node Health

Host Netdata still owns:

  1. bridge and host interface telemetry;
  2. BF3/tmfifo visibility;
  3. host pressure, storage, and service health;
  4. bare-metal GPU visibility on nodes not in slice mode.

This operator health path should stay separate from tenant allocation metrics.

4. Extend the Allocation Metrics Contract

The allocation metrics APIs should add an explicit source indicator, for example:

  1. host_netdata
  2. slice_guest
  3. unavailable

The UI should use this to explain why Netdata is available for one allocation class but not another.

Benchmark Model

Benchmark evidence should be captured from the same product-controlled paths used for lifecycle and telemetry, not from ad hoc manual shell sessions.

Baseline

Keep the node-agent performance probe as the cheap default capture:

  1. guest boot/readiness timings;
  2. nvidia-smi availability and latency;
  3. GPU count;
  4. RDMA device count;
  5. root disk identity;
  6. vCPU and memory sizing;
  7. Docker/NVIDIA runtime presence.

BM vs VM Comparison

Add a repo-owned capture harness for repeatable evidence:

  1. same host/guest commands;
  2. JSON output saved under dist/benchmarks/;
  3. optional read-only fio probe;
  4. optional ib_write_bw probe when infra supplies a peer.

This creates a durable comparison artifact for:

  1. hugepages on/off;
  2. guest driver profile changes;
  3. host/guest network tuning changes;
  4. BM versus slice VM performance deltas.

Current Recommendation

  1. Do not expose guest Netdata directly.
  2. Build slice guest metrics through node-agent first.
  3. Treat host Netdata as operator-only on slice nodes.
  4. Use the benchmark harness plus node-agent performance probe to validate BM versus VM behavior before publishing performance expectations.

Product Boundary Update

Netdata should not remain a tenant-facing product surface.

Recommended boundary:

  1. allocation pages render first-party GPUaaS metrics only;
  2. Open Netdata is removed from user allocation pages;
  3. operator-facing Netdata remains allowed in admin surfaces such as /admin/ops today and can later move behind tighter /admin/nodes flows;
  4. the allocation metrics contract stays stable even if the underlying BM collector changes away from Netdata later.

This matches the intended cloud-style model:

  1. VM/allocation pages show selected product telemetry;
  2. deeper infra tooling is separate and operator-owned;
  3. slice and bare metal converge on the same allocation telemetry UX even when their collection backends differ.

Current Bare-Metal Collector Caveats

Today, bare-metal allocation metrics are normalized from host Netdata. That gives a consistent API surface, but the richness of BM detail still depends on what each host exports.

Observed current behavior:

  1. one node may show aggregate GPU metrics without per-device GPU rows;
  2. one node may show IB/fabric interfaces with down/empty throughput while another shows partial speed/utilization data;
  3. this is expected with the current collector because per-device GPU tables require Netdata nvidia_smi.gpu_gpu-* charts and fabric tables require the corresponding net.*, net_speed.*, net_operstate.*, and carrier charts to exist for those interfaces.

Implication:

  1. BM allocation metrics are already product-shaped,
  2. but BM detail is not yet fully uniform across hosts,
  3. and that inconsistency should be treated as a host collector/config quality problem, not as the desired long-term tenant telemetry architecture.

Initial Utility Validation Notes

The first platform-owned telemetry utility was validated over SSH before any API rollout.

Healthy NVIDIA Bare Metal

Reference host:

  1. j27u15
  2. Tailscale 100.99.13.72

Observed:

  1. direct nvidia-smi returned all 8 H200 devices with stable memory, power, temperature, and ECC values;
  2. the utility returned the same 8 devices and the same aggregate shape the allocation metrics API expects;
  3. the utility also returned IB/fabric inventory such as ibp26s0 with correct link speed metadata.

NVIDIA Slice Guest

Reference guest:

  1. slice VM on j22u15
  2. reached through the existing managed SSH path via the host private bridge

Observed:

  1. the utility returned a clean single-device NVIDIA H200 view;
  2. only guest-visible NICs were returned, not host bridge or unrelated node interfaces;
  3. this confirms the guest-scoped collector model gives the product surface we want for slices without exposing host Netdata.

CPU-Only Node

Reference host:

  1. local kind VM 192.168.1.171

Observed:

  1. the utility returned CPU, memory, pressure, and network inventory;
  2. GPU capabilities were correctly reported as unavailable;
  3. this confirms the schema degrades cleanly on non-GPU nodes.

Broken NVIDIA Host Behavior

Reference host:

  1. j22u15
  2. Tailscale 100.103.10.83

Observed:

  1. direct nvidia-smi failed on the host with: Failed to initialize NVML: No supported GPUs were found;
  2. the utility now reports GPU telemetry as unavailable and records the error in metrics.last_error;
  3. this is the desired behavior because it avoids fabricating fake GPU devices from stderr text.

Comparison Against Netdata

Observed on j27u15:

  1. Netdata exposes the expected nvidia_smi.gpu_gpu-* charts and IB charts;
  2. the utility matches the important product-facing values: GPU device count, VRAM, power, temperature, ECC, and fabric inventory;
  3. Netdata still has richer chart families for deep operator debugging, which is acceptable because the long-term product boundary is to keep Netdata admin-only.

Observed on j22u15:

  1. Netdata is old and inconsistent;
  2. host nvidia-smi itself is broken;
  3. this confirms we should not make tenant telemetry depend on host Netdata quality.