Stress Testing#

The simulator includes a comprehensive stress testing framework for validating system behavior at scale with realistic failure patterns.

Overview#

Stress tests allow you to:

Simulate thousands of nodes simultaneously
Inject failures with realistic distributions based on production data
Test cascading failure scenarios
Simulate scheduled outages (zone, region, provider)
Measure system resilience and recovery
Generate detailed reports

Running stress tests#

# Run a stress test
./bin/simulator run scenarios/stress/1000-node-chaos.yaml -v

# Run with specific seed for reproducibility
./bin/simulator run scenarios/stress/1000-node-chaos.yaml --seed 12345 -v

# Validate before running
./bin/simulator validate scenarios/stress/5000-node-extreme.yaml

Configuration#

Stress tests use an extended scenario format with a stress section:

name: my-stress-test
description: Large-scale chaos testing

fleet: []  # Empty when using fleet_gen

stress:
  duration: 10m
  metrics_interval: 5s
  seed: 12345
  report_file: stress-report.json
  html_report_file: stress-report.html
  log_file: stress-report.log

  fleet_gen:
    # Fleet generation config...

  chaos:
    # Chaos engineering config...

Fleet generation#

Instead of defining individual nodes, generate fleets from templates:

fleet_gen:
  total_nodes: 1000

  templates:
    - name: h100-8gpu
      weight: 60
      gpu_count: 8
      gpu_type: "NVIDIA H100 80GB HBM3"
      instance_type: a3-highgpu-8g
      labels:
        tier: premium
    - name: a100-8gpu
      weight: 40
      gpu_count: 8
      gpu_type: "NVIDIA A100 80GB"
      instance_type: a2-ultragpu-8g

  providers:
    gcp: 50
    aws: 35
    lambda: 15

  regions:
    us-central1: 40
    us-east1: 30
    europe-west1: 30

  startup:
    pattern: exponential
    duration: 2m
    jitter_percent: 15

Field	Description
`total_nodes`	Total number of nodes to generate
`templates`	Node templates with relative weights
`providers`	Provider distribution (percentages)
`regions`	Region distribution (percentages)
`startup`	How nodes join the cluster

Startup patterns#

Pattern	Description
`instant`	All nodes start immediately
`linear`	Nodes start at constant rate
`exponential`	Start slow, accelerate (1, 2, 4, 8, ...)
`wave`	Start in batches with pauses

startup:
  pattern: wave
  duration: 5m
  batch_size: 100
  jitter_percent: 20
  cold_start_min: 30s
  cold_start_max: 2m

Cold start delays simulate provisioning time. Use cold_start_min/cold_start_max for uniform distribution, or cold_start_mean/cold_start_stddev for normal distribution.

Chaos engineering#

Control failure injection:

chaos:
  enabled: true
  failure_rate: 10.0  # Failures per minute per 1000 nodes

  xid_distribution:
    13: 15  # Graphics Engine Exception
    31: 20  # GPU memory page fault
    48: 12  # Double Bit ECC Error
    79: 6   # GPU fallen off bus

  failure_types:
    - type: xid_error
      weight: 70
    - type: temperature
      weight: 10
    - type: nvml_failure
      weight: 8
    - type: network
      weight: 10
    - type: boot_failure
      weight: 2

Failure rate#

Failures per minute per 1000 nodes: - 1000 nodes with rate 10.0 = ~10 failures/minute - 5000 nodes with rate 10.0 = ~50 failures/minute

Failure types#

Type	Description
`xid_error`	GPU XID error with specified distribution
`temperature`	Thermal throttling/shutdown
`backend_error`	GPU backend failure (alias: `nvml_failure`)
`boot_failure`	GPU boot/detection failure
`network`	Network connectivity loss
`memory_error`	ECC memory error
`nvlink_error`	NVLink communication error

Cascading failures#

Simulate realistic failure propagation:

cascading:
  enabled: true
  probability: 0.15      # 15% chance a failure cascades
  max_depth: 3           # Maximum cascade chain length
  min_delay: 1s
  max_delay: 10s
  scope: zone            # Cascade scope
  max_affected_percent: 0.1

Cascade scopes:

Scope	Description
`rack`	Same rack (first 3 node ID segments match)
`zone`	Same availability zone
`region`	Same region
`provider`	Same cloud provider
`random`	Any node in cluster

Automatic recovery#

Configure recovery for non-fatal failures:

recovery:
  enabled: true
  probability: 0.7      # 70% of non-fatal errors recover
  mean_time: 5m
  std_dev: 2m
  replace_fatal: true   # Replace nodes with fatal errors
  replace_cold_start: 45s

Recovery only applies to non-fatal XID codes and recoverable failure types.

Scheduled outages#

Simulate planned or unplanned outages:

scheduled_outages:
  - name: zone-network-partition
    start_time: 10m
    duration: 5m
    scope: zone
    target: us-central1-a
    failure_type: network

  - name: provider-degradation
    start_time: 20m
    duration: 8m
    scope: provider
    target: lambda
    failure_type: xid_error

  - name: random-thermal-event
    start_time: 15m
    duration: 3m
    scope: percentage
    target: "10"
    failure_type: temperature

Outage scopes: zone, region, provider, percentage

Correlated failures#

Define failures that trigger related failures:

correlated_failures:
  - name: nvlink-gpu-cascade
    trigger: "74"           # NVLink error triggers this
    response: xid_error
    probability: 0.6
    delay: 1s
    scope: same_node

  - name: thermal-propagation
    trigger: temperature
    response: temperature
    probability: 0.4
    delay: 3s
    scope: same_rack

Correlation scopes: same_node, same_rack, same_zone, random

Reports#

Configure report outputs:

stress:
  report_file: stress-report.json
  html_report_file: stress-report.html
  log_file: stress-report.log

HTML report#

Interactive web visualization with:

Results tab: Summary statistics, failure breakdowns, interactive charts
Node health over time
Failures vs recoveries
XID error distribution (pie chart)
Failure types breakdown (bar chart)
Configuration tab: Full test configuration

JSON report#

Structured data for programmatic analysis:

{
  "name": "1000-node-chaos-test",
  "duration": "10m0s",
  "summary": {
    "nodes_started": 1000,
    "peak_healthy_nodes": 1000,
    "min_healthy_nodes": 847,
    "total_failures": 98,
    "total_recoveries": 45
  },
  "failures": {
    "by_type": {"xid_error": 68, "temperature": 12},
    "by_xid": {"31": 15, "79": 8, "48": 7},
    "cascading_failures": 12
  }
}

Log file#

Verbose debug output from all components. Useful for:

Debugging specific failure sequences
Post-mortem investigation
Providing context to LLMs for analysis

Example scenarios#

1000-node-chaos.yaml#

Standard chaos test: - 10 minute duration - Mixed H100/A100 fleet across GCP, AWS, Lambda - Realistic XID distribution - Cascading failures enabled - Automatic recovery

./bin/simulator run scenarios/stress/1000-node-chaos.yaml -v

5000-node-extreme.yaml#

Extreme stress test: - 30 minute duration - 5000 nodes across 8 regions - Aggressive failure rate (50/min/1000 nodes) - Multiple scheduled outages - High cascade probability

xid-comprehensive.yaml#

XID error testing: - All known XID codes tested equally - High recovery rate - No cascading (isolates XID behavior)

cascading-failures.yaml#

Cascade testing: - High cascade probability (50%) - Deep cascade chains (depth 5) - Scheduled outages that trigger cascades - Tests blast radius containment

Performance considerations#

Node Count	Recommended Startup	Memory Usage
100-500	linear, 30s	~200MB
500-1000	linear, 1m	~500MB
1000-2000	exponential, 2m	~1GB
2000-5000	wave, 5m	~2-3GB
5000+	wave, 10m+	~5GB+

Tips:

Start with smaller node counts (100-500) during development
Use --seed for debugging specific failure sequences
Monitor memory usage for very large fleets
Allow adequate startup time for large fleets
Use validate command to check scenario syntax