Pools & Providers#

Pools organize your GPU nodes. Providers connect to cloud platforms.

Pools#

A pool is a group of GPU nodes with shared configuration:

Same cloud provider and region.
Same instance type (GPU count and model).
Common scaling limits and autoscaler configuration.
Unified health and replacement policies.

Pools let you manage different workload types independently:

pools:
  # Training pool: Large instances, conservative scaling
  training:
    provider: lambda
    instance_type: gpu_8x_h100_sxm5
    min_nodes: 2
    max_nodes: 20
    cooldown: 10m

  # Inference pool: Smaller instances, aggressive scaling
  inference:
    provider: lambda
    instance_type: gpu_1x_a100
    min_nodes: 5
    max_nodes: 100
    cooldown: 2m

When to use multiple pools#

Different instance types: Training on 8xH100, inference on 1xA100.
Different regions: US pool for US users, EU pool for EU users.
Different scaling behavior: Batch jobs scale to zero, serving keeps minimum capacity.
Different teams: Separate pools for separate cost tracking.

Providers#

A provider abstracts cloud-specific operations:

Provisioning new instances.
Terminating instances.
Listing available instance types.
Managing SSH keys and startup scripts.

Supported providers#

Provider	Description
`lambda`	Lambda Labs Cloud GPU instances.
`gcp`	Google Cloud Platform.
`aws`	Amazon Web Services.
`fake`	Simulated instances for development and testing.

Provider configuration#

Providers are configured separately from pools:

providers:
  lambda:
    type: lambda
    api_key_env: LAMBDA_API_KEY

  gcp:
    type: gcp
    project: my-project
    credentials_file: /path/to/credentials.json

pools:
  training:
    provider: lambda  # References the provider above
    # ...

This lets you:

Use the same provider with different credentials.
Switch providers without changing pool configuration.
Test with the fake provider before using real clouds.

See Configuration Reference for all provider options.

Labels#

Labels are key-value pairs attached to pools and nodes. Use them for:

Filtering nodes by workload type.
Routing jobs to appropriate pools.
Organizing resources by team or project.

pools:
  training:
    provider: lambda
    instance_type: gpu_8x_h100
    labels:
      workload: training
      team: ml-platform
      environment: production

Labels propagate to nodes when they're provisioned. Query nodes by label:

navarch list --label workload=training

See CLI Reference for all filtering options.

Multi-cloud setup#

Navarch can manage nodes across multiple providers simultaneously:

providers:
  lambda:
    type: lambda
    api_key_env: LAMBDA_API_KEY

  gcp:
    type: gcp
    project: my-project

pools:
  # Primary pool on Lambda
  training-primary:
    provider: lambda
    instance_type: gpu_8x_h100_sxm5
    region: us-west-1
    min_nodes: 4
    max_nodes: 20

  # Overflow pool on GCP
  training-overflow:
    provider: gcp
    instance_type: a3-highgpu-8g
    region: us-central1
    min_nodes: 0
    max_nodes: 10

This enables:

Failover: If Lambda is out of capacity, use GCP.
Cost optimization: Use the cheapest available provider.
Geographic distribution: Run nodes closer to users.