Configuration#

Navarch uses a single YAML configuration file to define providers, pools, and server settings.

Quick start#

providers:
  lambda:
    type: lambda
    api_key_env: LAMBDA_API_KEY

pools:
  training:
    provider: lambda
    instance_type: gpu_8x_h100_sxm5
    region: us-west-2
    min_nodes: 2
    max_nodes: 20
    autoscaling:
      type: reactive
      scale_up_at: 80
      scale_down_at: 20

Run the control plane with:

control-plane --config config.yaml

Server#

server:
  address: ":50051"              # Listen address
  heartbeat_interval: 30s        # Node heartbeat frequency
  health_check_interval: 60s     # Health check frequency
  autoscale_interval: 30s        # Autoscaler evaluation frequency
  health_policy: ./health-policy.yaml  # Custom health policy file
  notifier:                   # Workload system integration
    type: webhook
    webhook:
      cordon_url: https://scheduler.example.com/api/cordon
      drain_url: https://scheduler.example.com/api/drain

All fields are optional with sensible defaults.

Field	Default	Description
`address`	`:50051`	gRPC/HTTP listen address
`heartbeat_interval`	`30s`	How often nodes send heartbeats
`health_check_interval`	`60s`	How often health checks run
`autoscale_interval`	`30s`	How often autoscaler evaluates
`health_policy`	(none)	Path to health policy file
`notifier`	(none)	Notifier configuration for workload system integration

Authentication#

The control plane supports bearer token authentication:

export NAVARCH_AUTH_TOKEN="your-secret-token"
control-plane --config config.yaml

See Authentication for client configuration and custom methods.

Providers#

Providers define cloud platforms where GPU nodes are provisioned.

providers:
  lambda:
    type: lambda
    api_key_env: LAMBDA_API_KEY    # Environment variable containing API key

  gcp:
    type: gcp
    project: my-gcp-project

  fake:
    type: fake
    gpu_count: 8                   # GPUs per fake instance (for testing)

Type	Description
`lambda`	Lambda Labs Cloud
`gcp`	Google Cloud Platform
`aws`	Amazon Web Services
`fake`	Fake provider for local development

Pools#

Pools define groups of GPU nodes with scaling policies.

Single-provider pool#

pools:
  training:
    provider: lambda
    instance_type: gpu_8x_h100_sxm5
    region: us-west-2
    min_nodes: 2
    max_nodes: 20
    cooldown: 5m
    ssh_keys:
      - ops-team
    labels:
      workload: training
    autoscaling:
      type: reactive
      scale_up_at: 80
      scale_down_at: 20
    health:
      unhealthy_after: 2
      auto_replace: true

Multi-provider pool#

For fungible compute across multiple providers:

pools:
  fungible:
    providers:
      - name: lambda
        priority: 1
        regions: [us-west-2, us-east-1]
      - name: gcp
        priority: 2
        regions: [us-central1]
        instance_type: a3-highgpu-8g    # Provider-specific override
    strategy: priority
    instance_type: h100-8x              # Abstract type
    min_nodes: 4
    max_nodes: 32

Provider selection strategies:

Strategy	Description
`priority`	Try providers in priority order (lowest first)
`cost`	Select cheapest available provider
`availability`	Select first provider with capacity
`round-robin`	Distribute evenly across providers

Pool fields#

Field	Required	Description
`provider`	Yes*	Single provider name
`providers`	Yes*	List of provider entries (multi-provider)
`strategy`	No	Provider selection strategy (multi-provider)
`instance_type`	Yes	Instance type (provider-specific or abstract)
`region`	No	Default region
`zones`	No	Availability zones
`ssh_keys`	No	SSH key names to install
`min_nodes`	Yes	Minimum nodes to maintain
`max_nodes`	Yes	Maximum nodes allowed
`cooldown`	No	Time between scaling actions (default: 5m)
`labels`	No	Key-value labels for workload routing
`autoscaling`	No	Autoscaler configuration
`health`	No	Health check configuration
`setup_commands`	No	Bootstrap commands
`ssh_user`	No	SSH username for bootstrap (default: `ubuntu`)
`ssh_private_key_path`	No	Path to SSH private key for bootstrap

*Either provider or providers is required, but not both.

Autoscaling#

Configure how pools scale based on demand. See Autoscaling Concepts for details on each strategy.

autoscaling:
  type: reactive          # reactive, queue, scheduled, predictive, composite
  scale_up_at: 80         # Scale up when utilization > 80%
  scale_down_at: 20       # Scale down when utilization < 20%

Type	Use case
`reactive`	Scale on current GPU utilization
`queue`	Scale on pending job count
`scheduled`	Time-based scaling limits
`predictive`	Forecast-based proactive scaling
`composite`	Combine multiple strategies

Health#

Configure health checking and auto-replacement:

health:
  unhealthy_after: 2     # Consecutive failures before unhealthy
  auto_replace: true     # Automatically replace unhealthy nodes

See Health Monitoring for details on health events and XID errors.

For custom health evaluation logic, see Health Policy.

Notifier#

The notifier integrates Navarch with external workload systems (job schedulers, Kubernetes, etc.). When nodes are cordoned or drained, the notifier notifies your workload system so it can stop scheduling new work and migrate existing workloads.

Webhook notifier#

Send HTTP notifications to your workload system:

server:
  notifier:
    type: webhook
    webhook:
      cordon_url: https://scheduler.example.com/api/v1/nodes/cordon
      uncordon_url: https://scheduler.example.com/api/v1/nodes/uncordon
      drain_url: https://scheduler.example.com/api/v1/nodes/drain
      drain_status_url: https://scheduler.example.com/api/v1/nodes/drain-status
      timeout: 30s
      headers:
        Authorization: Bearer ${SCHEDULER_TOKEN}

Field	Description
`cordon_url`	Called when a node is cordoned (POST)
`uncordon_url`	Called when a node is uncordoned (POST)
`drain_url`	Called when a node should be drained (POST)
`drain_status_url`	Polled to check if drain is complete (GET)
`timeout`	Request timeout (default: 30s)
`headers`	Custom headers for authentication

Webhook payloads#

POST requests (cordon, uncordon, drain):

{
  "event": "cordon",
  "node_id": "node-abc123",
  "reason": "GPU failure detected",
  "timestamp": "2024-01-15T10:30:00Z"
}

GET drain status request includes ?node_id=node-abc123 query parameter.

Expected response:

{
  "drained": true,
  "message": "All workloads evicted"
}

No notifier (default)#

Without a notifier configured, cordon/drain/uncordon operations only update Navarch's internal state. Use this when:

Running standalone without external schedulers
Your workload system doesn't need notifications
You're testing or developing locally

server:
  notifier:
    type: noop

Defaults#

Apply defaults to all pools:

defaults:
  ssh_keys:
    - ops-team
    - ml-team
  ssh_user: ubuntu
  ssh_private_key_path: ~/.ssh/navarch-key
  health:
    unhealthy_after: 2
    auto_replace: true

Abstract instance types#

Use abstract types to provision equivalent hardware across providers:

Abstract	Lambda	GCP	AWS
`h100-8x`	`gpu_8x_h100_sxm5`	`a3-highgpu-8g`	`p5.48xlarge`
`h100-1x`	`gpu_1x_h100_pcie`	`a3-highgpu-1g`	-
`a100-8x`	`gpu_8x_a100`	`a2-highgpu-8g`	`p4d.24xlarge`
`a100-4x`	`gpu_4x_a100`	`a2-highgpu-4g`	`p4de.24xlarge`
`a100-1x`	`gpu_1x_a100`	`a2-highgpu-1g`	-

Environment variables#

Variable	Description
`NAVARCH_AUTH_TOKEN`	Authentication token for control plane
`LAMBDA_API_KEY`	Lambda Labs API key
`GOOGLE_APPLICATION_CREDENTIALS`	GCP credentials file path
`AWS_ACCESS_KEY_ID`	AWS access key
`AWS_SECRET_ACCESS_KEY`	AWS secret key