Metrics and monitoring#

Navarch collects metrics from GPU nodes to enable autoscaling and health monitoring.

Metrics collection#

What is collected#

Every heartbeat (5-30 seconds) includes:

Node-level metrics:

CPU usage percentage
Memory usage percentage
Timestamp

Per-GPU metrics:

GPU index
Utilization percentage (0-100)
Temperature in Celsius
Power usage in watts
Memory used in bytes

Health status:

Boot check results
GPU communication status
Health event detection (XID errors, thermal, ECC)

Collection flow#

┌─────────────┐     Heartbeat      ┌──────────────┐
│ Node Agent  │ ───────────────────>│ Control Plane│
│             │   (every 5-30s)     │              │
│ - Query GPU │                     │ - Store      │
│ - Collect   │                     │ - Aggregate  │
│ - Check     │                     │ - Autoscale  │
└─────────────┘                     └──────────────┘

Node-side collection#

The node daemon uses the metrics.Collector to gather system and GPU metrics.

System metrics are collected from /proc filesystem (Linux): - CPU usage: Calculated from /proc/stat using delta between consecutive reads - Memory usage: Read from /proc/meminfo using MemTotal and MemAvailable

GPU metrics are collected via the GPU manager interface: - Queries GPU temperature, power, utilization, and memory - Collects health events (XID errors, thermal warnings, ECC errors) - Uses injectable GPU manager for testing/development

Code location: pkg/node/metrics/

// Create a metrics collector
collector := metrics.NewCollector(gpuManager, nil)

// Collect all metrics
nodeMetrics, err := collector.Collect(ctx)
// Returns: CpuUsagePercent, MemoryUsagePercent, GpuMetrics[]

Custom system reader: For non-Linux systems or testing, implement SystemMetricsReader:

type SystemMetricsReader interface {
    ReadCPUUsage(ctx context.Context) (float64, error)
    ReadMemoryUsage(ctx context.Context) (float64, error)
}

// Use custom reader
customReader := &MyCustomReader{}
collector := metrics.NewCollector(gpuManager, customReader)

Storage and retention#

Metrics are stored in-memory per node: - Up to 100 samples per node - Oldest samples automatically pruned - Query window: last 5 minutes (default)

For production deployments requiring longer retention, implement a custom database backend.

Metrics aggregation#

Pool-level aggregation#

The control plane aggregates metrics by pool for autoscaling decisions.

Current utilization: Average GPU utilization across all GPUs in the pool.

Example: Pool has 2 nodes with 8 GPUs each (16 total GPUs): - Node 1 GPUs: 80%, 90%, 75%, 85%, 70%, 80%, 85%, 75% - Node 2 GPUs: 60%, 70%, 65%, 55%, 70%, 65%, 60%, 75%

Pool utilization = (80+90+75+85+70+80+85+75+60+70+65+55+70+65+60+75) / 16 = 71.25%

Utilization history: Per-node average utilization for the last 5 minutes. Used for trend analysis by predictive autoscalers.

Pool filtering#

Nodes are assigned to pools via labels:

pools:
  training:
    labels:
      pool: training
      team: ml-research

The pool label is automatically set and used for metrics aggregation. Additional labels are for organization and filtering.

Autoscaling metrics#

Different autoscaler types use different metrics.

Reactive autoscaler#

Uses current GPU utilization:

autoscaling:
  type: reactive
  scale_up_at: 75    # Scale up when utilization > 75%
  scale_down_at: 25  # Scale down when utilization < 25%

Evaluation: Every 30 seconds (configurable via autoscale_interval)

Example: - Current: 3 nodes, 85% GPU utilization - Recommendation: Scale up to 4 nodes (utilization > 75%) - After cooldown: Actually provision 4th node

Queue-based autoscaler#

Uses job queue depth:

autoscaling:
  type: queue
  jobs_per_node: 10

Requires: External scheduler integration via MetricsSource interface.

Example: - Current: 5 nodes, 73 jobs in queue - Calculation: ceil(73 / 10) = 8 nodes needed - Recommendation: Scale up to 8 nodes

Scheduled autoscaler#

Does not use metrics. Scales based on time:

autoscaling:
  type: scheduled
  schedule:
    - days: [monday, tuesday, wednesday, thursday, friday]
      start: 9
      end: 18
      min_nodes: 10
      max_nodes: 50

Evaluation: Checks current time and applies appropriate limits.

Predictive autoscaler#

Uses utilization history for trend analysis:

autoscaling:
  type: predictive
  lookback_window: 10  # Samples to analyze
  growth_factor: 1.2   # Proactive scaling multiplier

Analyzes recent utilization trend and scales preemptively.

Composite autoscaler#

Combines multiple metrics:

autoscaling:
  type: composite
  mode: max  # Take maximum recommendation
  autoscalers:
    - type: reactive
      scale_up_at: 80
      scale_down_at: 20
    - type: queue
      jobs_per_node: 10

Use case: Scale based on both GPU utilization and job queue depth, whichever demands more capacity.

Integrating external schedulers#

To provide queue depth and pending job metrics, implement the MetricsSource interface.

Interface#

type MetricsSource interface {
    GetPoolMetrics(ctx context.Context, poolName string) (*PoolMetrics, error)
}

type PoolMetrics struct {
    Utilization        float64   // GPU utilization (provided by Navarch)
    PendingJobs        int       // Jobs waiting to start
    QueueDepth         int       // Pending + running jobs
    UtilizationHistory []float64 // Historical utilization
}

Example: Kubernetes integration#

package main

import (
    "context"
    "github.com/NavarchProject/navarch/pkg/controlplane"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/client-go/kubernetes"
)

type KubernetesMetrics struct {
    clientset *kubernetes.Clientset
    dbMetrics *controlplane.DBMetricsSource
}

func (k *KubernetesMetrics) GetPoolMetrics(ctx context.Context, poolName string) (*controlplane.PoolMetrics, error) {
    // Get GPU utilization from Navarch's built-in metrics
    baseMetrics, err := k.dbMetrics.GetPoolMetrics(ctx, poolName)
    if err != nil {
        return nil, err
    }

    // Query Kubernetes for pods with pool label
    pods, err := k.clientset.CoreV1().Pods("").List(ctx, metav1.ListOptions{
        LabelSelector: "pool=" + poolName,
    })
    if err != nil {
        return nil, err
    }

    // Count pending and running pods
    var pending, running int
    for _, pod := range pods.Items {
        if pod.Status.Phase == "Pending" {
            pending++
        } else if pod.Status.Phase == "Running" {
            running++
        }
    }

    // Combine metrics
    baseMetrics.PendingJobs = pending
    baseMetrics.QueueDepth = pending + running

    return baseMetrics, nil
}

Then use it:

k8sMetrics := &KubernetesMetrics{
    clientset: clientset,
    dbMetrics: controlplane.NewDBMetricsSource(database, logger),
}

poolManager := controlplane.NewPoolManager(cfg, k8sMetrics, logger)

Example: Slurm integration#

type SlurmMetrics struct {
    slurmHost string
    dbMetrics *controlplane.DBMetricsSource
}

func (s *SlurmMetrics) GetPoolMetrics(ctx context.Context, poolName string) (*controlplane.PoolMetrics, error) {
    baseMetrics, _ := s.dbMetrics.GetPoolMetrics(ctx, poolName)

    // Query Slurm via scontrol or sacct
    output, _ := exec.CommandContext(ctx, "squeue", 
        "-h", "-p", poolName, "-t", "PENDING", "-o", "%i").Output()

    pending := len(strings.Split(string(output), "\n")) - 1

    output, _ = exec.CommandContext(ctx, "squeue",
        "-h", "-p", poolName, "-t", "RUNNING", "-o", "%i").Output()

    running := len(strings.Split(string(output), "\n")) - 1

    baseMetrics.PendingJobs = pending
    baseMetrics.QueueDepth = pending + running

    return baseMetrics, nil
}

Monitoring and observability#

Prometheus metrics#

The control plane exposes Prometheus metrics at /metrics.

Available metrics:

Metric	Labels	Description
`navarch_nodes_total`	`status`	Total number of nodes by status (active, cordoned, draining, unhealthy)
`navarch_node_health_status`	`node_id`, `status`	Health status per node (1=healthy, 0.5=degraded, 0=unhealthy)
`navarch_gpus_total`	`provider`	Total number of GPUs by provider

Structured logging#

All components emit structured JSON logs with: - level: INFO, WARN, ERROR - msg: Human-readable message - time: RFC3339 timestamp - Context fields (pool, node_id, error, etc.)

Example:

{
  "time": "2026-01-19T22:00:15Z",
  "level": "INFO",
  "msg": "scaling up",
  "pool": "training",
  "from": 5,
  "to": 8,
  "reason": "utilization 87.3% > 75.0% threshold"
}

Health endpoints#

Liveness probe: GET /healthz - Returns 200 if control plane is running - Use for container health checks

Readiness probe: GET /readyz - Returns 200 if database is accessible - Returns 503 if not ready - Use for load balancer health checks

Metrics API#

Query metrics via gRPC API (future work):

service MetricsService {
  rpc GetNodeMetrics(GetNodeMetricsRequest) returns (NodeMetricsResponse);
  rpc GetPoolMetrics(GetPoolMetricsRequest) returns (PoolMetricsResponse);
  rpc QueryMetrics(QueryMetricsRequest) returns (QueryMetricsResponse);
}

Current workaround: Query the in-memory database directly via the CLI or custom tooling.

Best practices#

Heartbeat intervals#

Fast (5s): Development and testing
Standard (30s): Production with reactive autoscaling
Slow (60s): Large clusters (1000+ nodes) to reduce control plane load

Autoscale intervals#

Fast (10s): Development and testing
Standard (30s): Production with quick response times
Slow (60-120s): Cost-sensitive workloads to avoid over-provisioning

Cooldown periods#

Short (2-3m): Development and testing
Standard (5m): Production to prevent oscillation
Long (10-15m): Large nodes (expensive to provision/terminate)

Metrics retention#

The default 100 samples per node retains approximately: - 8 minutes at 5s heartbeat interval - 50 minutes at 30s heartbeat interval

For longer retention, implement custom database backend or export to time-series database.

Troubleshooting#

No metrics for pool#

Symptom: Autoscaler always reports 0% utilization.

Causes:

Nodes not registered with pool label
Nodes not sending metrics in heartbeats
Pool name mismatch

Debug:

# Check node labels
navarch get node-1 --output json | jq .metadata.labels

# Verify pool name in config
grep "pools:" config.yaml -A 5

# Check control plane logs
grep "failed to get metrics" logs.json

Autoscaler not scaling#

Symptom: Utilization high but no scale up.

Causes:

Cooldown period active
At max nodes limit
Provider provisioning failures

Debug:

# Check pool status
navarch list

# Look for scaling events
grep "scaling" logs.json | tail -20

# Check cooldown
grep "cooldown active" logs.json

High control plane memory#

Symptom: Control plane memory usage growing.

Causes:

Metrics retention with many nodes (100 samples × number of nodes)
Memory leak (report bug)

Solutions:

Reduce metrics retention (requires code change)
Restart control plane periodically
Implement external database backend