Last indexed: 7 May 2026 (2e12c1)

Common Configuration Errors

Purpose: This page documents common configuration mistakes in AReaL and how to diagnose and fix them. It covers errors in YAML configs, CLI arguments, dataclass validation failures, and engine-specific incompatibilities.

Scope: This page focuses on configuration-time errors. For runtime errors like OOM or distributed training issues, see Memory and OOM Issues and Debugging Distributed Training For performance optimization, see Performance Optimization Guide

Configuration System Overview

AReaL uses Hydra to compose configurations from YAML files and CLI overrides into strongly-typed dataclasses defined in areal/api/cli_args.py . Understanding this hierarchy helps diagnose configuration errors.

Configuration Data Flow

Title: Configuration Hierarchy and Validation Flow

Sources: ,

Error Category 1: AllocationMode and Backend Errors

The allocation_mode string was historically used to specify how GPUs are allocated . While deprecated in favor of per-engine backend fields, it remains a common source of errors in legacy configurations.

Valid Syntax Pattern (Legacy)

<backend>:d<device_count>[+<backend>:d<device_count>]

Examples:

sglang:d4+fsdp:d4 - 4 GPUs for SGLang, 4 for FSDP
vllm:d2+megatron:d6 - 2 GPUs for vLLM, 6 for Megatron

Common Mistakes

Error Pattern	Problem	Fix
`sglang:4+fsdp:4`	Missing `d` prefix	Use `sglang:d4+fsdp:d4`
`sglang:d4,fsdp:d4`	Wrong separator (`,` instead of `+`)	Use `+` to separate components
`invalid:d4+fsdp:d4`	Unknown backend name	Valid: `sglang`, `vllm`, `fsdp`, `megatron`, `archon`

Modern Backend Configuration

Modern configurations should define backends directly within the engine specs (e.g., actor.backend, rollout.backend) .

Sources: ,

Error Category 2: Parallelism Strategy Validation Failures

Parallelism dimensions must satisfy mathematical constraints. In FSDPEngine, MegatronEngine, and ArchonEngine, the world size must be divisible by the product of parallel dimensions.

Parallelism Constraint Formula

world_size = dp * tp * pp * cp * ep

Common Mistakes

Configuration	Problem	Error Message
`dp=2, tp=2, pp=2` on 4 GPUs	Product is 8, need 4	`ValueError: Product of parallel dimensions (8) != world_size (4)`
`pp=4` with FSDP	FSDP doesn't support PP	`ValueError: FSDPEngine does not support pipeline_parallel_size > 1`
`tp=2` for 32-head model	Head count (32) must be divisible by TP	`ValueError: Number of heads (32) must be divisible by tp_size (2)`

Attention Head Constraints

Special validation exists for tp_size and cp_size regarding model attention heads:

n_heads must be divisible by tp_size.
For Ulysses Context Parallelism, local heads after TP must be divisible by cp_size .
ArchonEngine validates these constraints via ArchonParallelDims and ModelSpec to ensure proper tensor sharding across the device mesh .

Sources: , , ,

Error Category 3: MicroBatchSpec and Packing Errors

MicroBatchSpec controls how batches are split and packed during training . Incorrect settings cause OOM or load imbalance.

Configuration Fields

Field	Purpose	Common Error
`n_mbs`	Number of micro-batches	Too small → OOM
`max_tokens_per_mb`	Max tokens per micro-batch	Not set → large batches OOM
`packing_algorithm`	Strategy for bin packing	`ffd` can cause load imbalance

Sequence Packing Algorithms

AReaL supports two primary algorithms:

FFD (First Fit Decreasing): Default, fast but can lead to uneven token distribution across ranks .
KK (Karmarkar-Karp): Recommended for large-scale RL with variable-length sequences to ensure better load balance .

Failure to use kk in highly variable length scenarios often results in "straggler" ranks that delay the entire distributed step.

Sources: ,

Error Category 4: Optimizer and Normalization Errors

Optimizer Selection Logic

Title: Optimizer Configuration Flow

Normalization Levels

NormConfig validates that mean_level and std_level are within {"batch", "group", None} .

Common Error: Setting mean_level="group" without defining a valid group_size .
Default Change: As of v0.3.4, std_unbiased defaults to True .

Sources: , , , ,

Error Category 5: Generation Hyperparameter Errors

Sampling Constraints

temperature: Defaults to 1.0; setting to 0.0 is not supported (use greedy=True instead) .
top_p: Must be in the range (0.0, 1.0] .
max_new_tokens: Defaults to 16384. Setting this too high can cause inference engine timeouts or OOM .

Stop Word Configuration

stop_token_ids: List of integers .
stop: List of strings . Mixing these up or providing a single string instead of a list in YAML will cause parsing errors.

Sources:

Validation and Debugging Strategies

Pre-Flight Checklist

Title: Configuration Pre-flight Validation

Common Error Message Patterns

Error Message Fragment	Likely Cause
`mean_level must be 'batch', 'group' or None`	Incorrect `NormConfig` value
`group_size must be a positive integer`	`group_size < 1` with group normalization
`packing_algorithm must be one of ['ffd', 'kk']`	Invalid algorithm name in `MicroBatchSpec`
`n_mbs_divisor` violation	World size not divisible by requested micro-batches

Sources: , ,

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/16.1-common-configuration-errors