Node Bootstrap#
Navarch can run setup commands on newly provisioned instances via SSH. This is useful for installing the node agent, configuring GPU drivers, or running custom initialization scripts.
Configuration#
pools:
training:
provider: lambda
instance_type: gpu_8x_h100_sxm5
min_nodes: 2
max_nodes: 20
ssh_user: ubuntu
ssh_private_key_path: ~/.ssh/navarch-key
setup_commands:
- |
curl -L https://github.com/NavarchProject/navarch/releases/latest/download/navarch-node-linux-amd64 \
-o /usr/local/bin/navarch-node && chmod +x /usr/local/bin/navarch-node
- |
navarch-node --server {{.ControlPlane}} --node-id {{.NodeID}} &
Bootstrap fields#
| Field | Required | Description |
|---|---|---|
setup_commands |
No | List of shell commands to run on the node after provisioning |
ssh_user |
No | SSH username (default: ubuntu) |
ssh_private_key_path |
Yes* | Path to SSH private key file |
ip_wait_timeout |
No | Max time to wait for instance IP (default: 15m) |
ssh_timeout |
No | Max time to wait for SSH to become available (default: 10m) |
ssh_connect_timeout |
No | Timeout for each SSH connection attempt (default: 30s) |
command_timeout |
No | Max time for each command to execute (default: 5m) |
*Required when setup_commands is specified.
These fields can also be set in defaults to apply to all pools:
Template variables#
Setup commands support Go template syntax. The following variables are available:
| Variable | Description | Example |
|---|---|---|
{{.ControlPlane}} |
Control plane URL | http://control-plane.example.com:50051 |
{{.Pool}} |
Pool name | training |
{{.NodeID}} |
Unique node identifier | node-abc123 |
{{.Provider}} |
Provider name | lambda |
{{.Region}} |
Region where node is provisioned | us-west-2 |
{{.InstanceType}} |
Instance type | gpu_8x_h100_sxm5 |
How it works#
- When Navarch provisions a new instance, it waits for the instance to receive an IP address.
- Once the IP is available, it waits for SSH to become available.
- Once connected, it runs each setup command in order, enforcing the command timeout.
- If any command fails or times out, the bootstrap is aborted and the node is marked as failed.
- On success, the node is ready to receive workloads.
Commands that exceed command_timeout receive a SIGKILL signal on the remote host.
Timeouts#
Configure timeouts based on your infrastructure:
pools:
training:
provider: lambda
instance_type: gpu_8x_h100_sxm5
setup_commands:
- ./long-running-setup.sh
# Fast-booting instances with long setup scripts
ip_wait_timeout: 5m
ssh_timeout: 3m
command_timeout: 30m
| Timeout | Default | Use case |
|---|---|---|
ip_wait_timeout |
15m | Increase for slow cloud providers or complex networking |
ssh_timeout |
10m | Decrease for pre-configured images with fast boot |
ssh_connect_timeout |
30s | Increase for high-latency networks |
command_timeout |
5m | Increase for large downloads or compilations |
The control plane logs detailed information about each bootstrap phase:
- SSH connection attempts and timing
- Each command executed with duration
- stdout/stderr output on failure
- Total bootstrap duration
Example: Full node setup#
defaults:
ssh_user: ubuntu
ssh_private_key_path: ~/.ssh/navarch-key
pools:
training:
provider: lambda
instance_type: gpu_8x_h100_sxm5
min_nodes: 2
max_nodes: 20
setup_commands:
# Install NVIDIA drivers if not present
- |
if ! command -v nvidia-smi &> /dev/null; then
apt-get update && apt-get install -y nvidia-driver-535
fi
# Download and install the node agent
- |
curl -L https://github.com/NavarchProject/navarch/releases/latest/download/navarch-node-linux-amd64 \
-o /usr/local/bin/navarch-node
chmod +x /usr/local/bin/navarch-node
# Create systemd service
- |
cat > /etc/systemd/system/navarch-node.service << EOF
[Unit]
Description=Navarch Node Agent
After=network.target
[Service]
ExecStart=/usr/local/bin/navarch-node --server {{.ControlPlane}} --node-id {{.NodeID}} --pool {{.Pool}}
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
EOF
# Start the agent
- systemctl daemon-reload && systemctl enable navarch-node && systemctl start navarch-node
Comparison with other deployment methods#
| Method | Use case |
|---|---|
| SSH bootstrap | Control plane manages agent installation. Good for managed fleets. |
| Custom images | Pre-bake agent into AMI/image. Fastest startup. |
| Cloud-init | Provider runs script at boot. No SSH needed. |
| Container | Run agent as Docker/K8s workload. |
See Deployment for details on each approach.