Node Lifecycle#
Navarch tracks instances and nodes as separate concepts with distinct lifecycles.
Instances vs Nodes#
- Instance: A cloud resource (what you pay for). Tracked from
Provision()until termination. - Node: A registered agent running on an instance. Created when the agent calls
RegisterNode.
This separation matters because:
- Provisioning can fail - Instance created but agent never boots.
- Registration can fail - Instance running but agent crashes on startup.
- Costs accrue immediately - You pay for instances, not nodes.
Instance lifecycle#
| State | Description |
|---|---|
| Provisioning | Cloud provider is creating the instance. |
| Pending Registration | Instance exists, waiting for node agent to register. |
| Running | Node agent has registered successfully. |
| Failed | Provisioning failed, or registration timed out. |
| Terminating | Termination requested, in progress. |
| Terminated | Instance destroyed by cloud provider. |
Registration timeout#
If an instance stays in "Pending Registration" too long (default: 10 minutes), Navarch marks it as failed. This catches:
- Boot failures (kernel panic, driver issues)
- Network issues (instance can't reach control plane)
- Agent crashes (segfault before registration)
Configure the timeout:
Node lifecycle#
Active#
The node is registered, healthy, and available for workloads. It sends heartbeats and health check results.
Cordoned#
The node is marked unschedulable. New workloads cannot be placed on it, but existing workloads continue.
Use cordon for:
- Scheduled maintenance
- Investigating suspected issues
- Preparing for decommission
When a notifier is configured, Navarch notifies your workload system (e.g., Kubernetes, Slurm) to mark the node unschedulable. See Notifier Configuration.
See CLI Reference for details.
Draining#
The node is evicting workloads and will be terminated. No new workloads scheduled.
Use drain for:
- Decommissioning nodes
- Responding to hardware failures
- Forced node replacement
When a notifier is configured, Navarch notifies your workload system to evacuate workloads from the node. You can poll drain status to wait for completion before termination. See Notifier Configuration.
See CLI Reference for details.
Terminated#
The instance has been terminated by the provider. The node record remains for historical reference.
State transitions#
Manual transitions#
| Command | From | To |
|---|---|---|
cordon |
Active | Cordoned |
uncordon |
Cordoned | Active |
drain |
Active, Cordoned | Draining |
Automatic transitions#
| Trigger | From | To |
|---|---|---|
| Health check failure | Active, Cordoned | Unhealthy |
| Health recovery | Unhealthy | Active |
| Auto-replacement | Unhealthy | Terminated |
| Scale-down | Active, Cordoned | Terminated |
Heartbeats and liveness#
Nodes send heartbeats every 30 seconds (configurable). If heartbeats stop:
- After
heartbeat_timeout(default: 2 minutes), node is marked stale. - Stale nodes are considered unhealthy.
- If auto-replace is enabled, stale nodes are terminated and replaced.
This handles cases where the node agent crashes or loses network connectivity.