Architecture#

Navarch is an infrastructure layer that sits between cloud providers and workload schedulers.

System layers#

┌────────────────────────────────────────────┐
│ Workload Schedulers                        │
│ (Kubernetes, Slurm, Ray, custom)           │
└────────────────────────────────────────────┘
                    ↓ schedule jobs
┌────────────────────────────────────────────┐
│ Navarch                                    │
│ - Provisions GPU VMs                       │
│ - Monitors hardware health                 │
│ - Autoscales node pools                    │
│ - Auto-replaces failures                   │
└────────────────────────────────────────────┘
                    ↓ provision/terminate
┌────────────────────────────────────────────┐
│ Cloud Provider APIs                        │
│ (Lambda Labs, GCP, AWS)                    │
└────────────────────────────────────────────┘

Your scheduler places workloads. Navarch maintains healthy infrastructure.

Components#

See Components for details.

Control plane: gRPC server that manages pools, tracks node state, and issues commands.

Node agent: Lightweight process on each GPU instance that reports health and executes commands.

Pool manager: Orchestrates autoscaling and node replacement.

Deployment models#

Single control plane#

One control plane for all pools. Suitable for most deployments.

control-plane --config navarch.yaml

High availability#

Multiple control planes behind a load balancer with external state store. See Deployment for details.

Multi-region#

Separate control planes per region with independent configurations. Use for latency-sensitive workloads or regulatory requirements.

Learn more#

Components - Control plane and node agent details
Pools & Providers - Multi-cloud provisioning
Health Monitoring - GPU failure detection
Autoscaling - Scaling strategies
Extending - Custom providers, autoscalers, metrics sources