Node Lifecycle#

Navarch tracks instances and nodes as separate concepts with distinct lifecycles.

Instances vs Nodes#

Instance: A cloud resource (what you pay for). Tracked from Provision() until termination.
Node: A registered agent running on an instance. Created when the agent calls RegisterNode.

This separation matters because:

Provisioning can fail - Instance created but agent never boots.
Registration can fail - Instance running but agent crashes on startup.
Costs accrue immediately - You pay for instances, not nodes.

Instance lifecycle#

Provisioning → Pending Registration → Running → Terminating → Terminated
                        ↓
                      Failed

State	Description
Provisioning	Cloud provider is creating the instance.
Pending Registration	Instance exists, waiting for node agent to register.
Running	Node agent has registered successfully.
Failed	Provisioning failed, or registration timed out.
Terminating	Termination requested, in progress.
Terminated	Instance destroyed by cloud provider.

Registration timeout#

If an instance stays in "Pending Registration" too long (default: 10 minutes), Navarch marks it as failed. This catches:

Boot failures (kernel panic, driver issues)
Network issues (instance can't reach control plane)
Agent crashes (segfault before registration)

Configure the timeout:

instance_manager:
  registration_timeout: 10m
  stale_check_interval: 1m

Node lifecycle#

Active → Cordoned → Draining → Terminated
   ↑         ↓
   └─────────┘
    (uncordon)

Active#

The node is registered, healthy, and available for workloads. It sends heartbeats and health check results.

Cordoned#

The node is marked unschedulable. New workloads cannot be placed on it, but existing workloads continue.

Use cordon for:

Scheduled maintenance
Investigating suspected issues
Preparing for decommission

navarch cordon node-1

When a notifier is configured, Navarch notifies your workload system (e.g., Kubernetes, Slurm) to mark the node unschedulable. See Notifier Configuration.

See CLI Reference for details.

Draining#

The node is evicting workloads and will be terminated. No new workloads scheduled.

Use drain for:

Decommissioning nodes
Responding to hardware failures
Forced node replacement

navarch drain node-1

When a notifier is configured, Navarch notifies your workload system to evacuate workloads from the node. You can poll drain status to wait for completion before termination. See Notifier Configuration.

See CLI Reference for details.

Terminated#

The instance has been terminated by the provider. The node record remains for historical reference.

State transitions#

Manual transitions#

Command	From	To
`cordon`	Active	Cordoned
`uncordon`	Cordoned	Active
`drain`	Active, Cordoned	Draining

Automatic transitions#

Trigger	From	To
Health check failure	Active, Cordoned	Unhealthy
Health recovery	Unhealthy	Active
Auto-replacement	Unhealthy	Terminated
Scale-down	Active, Cordoned	Terminated

Heartbeats and liveness#

Nodes send heartbeats every 30 seconds (configurable). If heartbeats stop:

After heartbeat_timeout (default: 2 minutes), node is marked stale.
Stale nodes are considered unhealthy.
If auto-replace is enabled, stale nodes are terminated and replaced.

This handles cases where the node agent crashes or loses network connectivity.