VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/16.1-common-configuration-errors

⇱ Common Configuration Errors | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Common Configuration Errors

Purpose: This page documents common configuration mistakes in AReaL and how to diagnose and fix them. It covers errors in YAML configs, CLI arguments, dataclass validation failures, and engine-specific incompatibilities.

Scope: This page focuses on configuration-time errors. For runtime errors like OOM or distributed training issues, see Memory and OOM Issues and Debugging Distributed Training For performance optimization, see Performance Optimization Guide


Configuration System Overview

AReaL uses Hydra to compose configurations from YAML files and CLI overrides into strongly-typed dataclasses defined in areal/api/cli_args.py . Understanding this hierarchy helps diagnose configuration errors.

Configuration Data Flow

Title: Configuration Hierarchy and Validation Flow


Sources: ,


Error Category 1: AllocationMode and Backend Errors

The allocation_mode string was historically used to specify how GPUs are allocated . While deprecated in favor of per-engine backend fields, it remains a common source of errors in legacy configurations.

Valid Syntax Pattern (Legacy)

<backend>:d<device_count>[+<backend>:d<device_count>]

Examples:

  • sglang:d4+fsdp:d4 - 4 GPUs for SGLang, 4 for FSDP
  • vllm:d2+megatron:d6 - 2 GPUs for vLLM, 6 for Megatron

Common Mistakes

Error PatternProblemFix
sglang:4+fsdp:4Missing d prefixUse sglang:d4+fsdp:d4
sglang:d4,fsdp:d4Wrong separator (, instead of +)Use + to separate components
invalid:d4+fsdp:d4Unknown backend nameValid: sglang, vllm, fsdp, megatron, archon

Modern Backend Configuration

Modern configurations should define backends directly within the engine specs (e.g., actor.backend, rollout.backend) .

Sources: ,


Error Category 2: Parallelism Strategy Validation Failures

Parallelism dimensions must satisfy mathematical constraints. In FSDPEngine, MegatronEngine, and ArchonEngine, the world size must be divisible by the product of parallel dimensions.

Parallelism Constraint Formula

world_size = dp * tp * pp * cp * ep

Common Mistakes

ConfigurationProblemError Message
dp=2, tp=2, pp=2 on 4 GPUsProduct is 8, need 4ValueError: Product of parallel dimensions (8) != world_size (4)
pp=4 with FSDPFSDP doesn't support PPValueError: FSDPEngine does not support pipeline_parallel_size > 1
tp=2 for 32-head modelHead count (32) must be divisible by TPValueError: Number of heads (32) must be divisible by tp_size (2)

Attention Head Constraints

Special validation exists for tp_size and cp_size regarding model attention heads:

  • n_heads must be divisible by tp_size.
  • For Ulysses Context Parallelism, local heads after TP must be divisible by cp_size .
  • ArchonEngine validates these constraints via ArchonParallelDims and ModelSpec to ensure proper tensor sharding across the device mesh .

Sources: , , ,


Error Category 3: MicroBatchSpec and Packing Errors

MicroBatchSpec controls how batches are split and packed during training . Incorrect settings cause OOM or load imbalance.

Configuration Fields

FieldPurposeCommon Error
n_mbsNumber of micro-batchesToo small → OOM
max_tokens_per_mbMax tokens per micro-batchNot set → large batches OOM
packing_algorithmStrategy for bin packingffd can cause load imbalance

Sequence Packing Algorithms

AReaL supports two primary algorithms:

  1. FFD (First Fit Decreasing): Default, fast but can lead to uneven token distribution across ranks .
  2. KK (Karmarkar-Karp): Recommended for large-scale RL with variable-length sequences to ensure better load balance .

Failure to use kk in highly variable length scenarios often results in "straggler" ranks that delay the entire distributed step.

Sources: ,


Error Category 4: Optimizer and Normalization Errors

Optimizer Selection Logic

Title: Optimizer Configuration Flow


Normalization Levels

NormConfig validates that mean_level and std_level are within {"batch", "group", None} .

  • Common Error: Setting mean_level="group" without defining a valid group_size .
  • Default Change: As of v0.3.4, std_unbiased defaults to True .

Sources: , , , ,


Error Category 5: Generation Hyperparameter Errors

Sampling Constraints

  • temperature: Defaults to 1.0; setting to 0.0 is not supported (use greedy=True instead) .
  • top_p: Must be in the range (0.0, 1.0] .
  • max_new_tokens: Defaults to 16384. Setting this too high can cause inference engine timeouts or OOM .

Stop Word Configuration

  • stop_token_ids: List of integers .
  • stop: List of strings . Mixing these up or providing a single string instead of a list in YAML will cause parsing errors.

Sources:


Validation and Debugging Strategies

Pre-Flight Checklist

Title: Configuration Pre-flight Validation


Common Error Message Patterns

Error Message FragmentLikely Cause
mean_level must be 'batch', 'group' or NoneIncorrect NormConfig value
group_size must be a positive integergroup_size < 1 with group normalization
packing_algorithm must be one of ['ffd', 'kk']Invalid algorithm name in MicroBatchSpec
n_mbs_divisor violationWorld size not divisible by requested micro-batches

Sources: , ,