Node Bootstrap#

Navarch can run setup commands on newly provisioned instances via SSH. This is useful for installing the node agent, configuring GPU drivers, or running custom initialization scripts.

Configuration#

pools:
  training:
    provider: lambda
    instance_type: gpu_8x_h100_sxm5
    min_nodes: 2
    max_nodes: 20
    ssh_user: ubuntu
    ssh_private_key_path: ~/.ssh/navarch-key
    setup_commands:
      - |
        curl -L https://github.com/NavarchProject/navarch/releases/latest/download/navarch-node-linux-amd64 \
          -o /usr/local/bin/navarch-node && chmod +x /usr/local/bin/navarch-node
      - |
        navarch-node --server {{.ControlPlane}} --node-id {{.NodeID}} &

Bootstrap fields#

Field	Required	Description
`setup_commands`	No	List of shell commands to run on the node after provisioning
`ssh_user`	No	SSH username (default: `ubuntu`)
`ssh_private_key_path`	Yes*	Path to SSH private key file
`ip_wait_timeout`	No	Max time to wait for instance IP (default: `15m`)
`ssh_timeout`	No	Max time to wait for SSH to become available (default: `10m`)
`ssh_connect_timeout`	No	Timeout for each SSH connection attempt (default: `30s`)
`command_timeout`	No	Max time for each command to execute (default: `5m`)

*Required when setup_commands is specified.

These fields can also be set in defaults to apply to all pools:

defaults:
  ssh_user: ubuntu
  ssh_private_key_path: ~/.ssh/navarch-key

Template variables#

Setup commands support Go template syntax. The following variables are available:

Variable	Description	Example
`{{.ControlPlane}}`	Control plane URL	`http://control-plane.example.com:50051`
`{{.Pool}}`	Pool name	`training`
`{{.NodeID}}`	Unique node identifier	`node-abc123`
`{{.Provider}}`	Provider name	`lambda`
`{{.Region}}`	Region where node is provisioned	`us-west-2`
`{{.InstanceType}}`	Instance type	`gpu_8x_h100_sxm5`

How it works#

When Navarch provisions a new instance, it waits for the instance to receive an IP address.
Once the IP is available, it waits for SSH to become available.
Once connected, it runs each setup command in order, enforcing the command timeout.
If any command fails or times out, the bootstrap is aborted and the node is marked as failed.
On success, the node is ready to receive workloads.

Commands that exceed command_timeout receive a SIGKILL signal on the remote host.

Timeouts#

Configure timeouts based on your infrastructure:

pools:
  training:
    provider: lambda
    instance_type: gpu_8x_h100_sxm5
    setup_commands:
      - ./long-running-setup.sh

    # Fast-booting instances with long setup scripts
    ip_wait_timeout: 5m
    ssh_timeout: 3m
    command_timeout: 30m

Timeout	Default	Use case
`ip_wait_timeout`	15m	Increase for slow cloud providers or complex networking
`ssh_timeout`	10m	Decrease for pre-configured images with fast boot
`ssh_connect_timeout`	30s	Increase for high-latency networks
`command_timeout`	5m	Increase for large downloads or compilations

The control plane logs detailed information about each bootstrap phase:

SSH connection attempts and timing
Each command executed with duration
stdout/stderr output on failure
Total bootstrap duration

Example: Full node setup#

defaults:
  ssh_user: ubuntu
  ssh_private_key_path: ~/.ssh/navarch-key

pools:
  training:
    provider: lambda
    instance_type: gpu_8x_h100_sxm5
    min_nodes: 2
    max_nodes: 20
    setup_commands:
      # Install NVIDIA drivers if not present
      - |
        if ! command -v nvidia-smi &> /dev/null; then
          apt-get update && apt-get install -y nvidia-driver-535
        fi
      # Download and install the node agent
      - |
        curl -L https://github.com/NavarchProject/navarch/releases/latest/download/navarch-node-linux-amd64 \
          -o /usr/local/bin/navarch-node
        chmod +x /usr/local/bin/navarch-node
      # Create systemd service
      - |
        cat > /etc/systemd/system/navarch-node.service << EOF
        [Unit]
        Description=Navarch Node Agent
        After=network.target

        [Service]
        ExecStart=/usr/local/bin/navarch-node --server {{.ControlPlane}} --node-id {{.NodeID}} --pool {{.Pool}}
        Restart=always
        RestartSec=10

        [Install]
        WantedBy=multi-user.target
        EOF
      # Start the agent
      - systemctl daemon-reload && systemctl enable navarch-node && systemctl start navarch-node

Comparison with other deployment methods#

Method	Use case
SSH bootstrap	Control plane manages agent installation. Good for managed fleets.
Custom images	Pre-bake agent into AMI/image. Fastest startup.
Cloud-init	Provider runs script at boot. No SSH needed.
Container	Run agent as Docker/K8s workload.

See Deployment for details on each approach.