VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/2.5-inference-engine-configurations

⇱ Inference Engine Configurations | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Inference Engine Configurations

This page documents the configuration parameters for AReaL's inference engines (SGLang and vLLM). These configurations control how the inference servers are launched, how they allocate resources, and how they handle generation requests during training.

For training engine configurations, see 2.4 Training Engine Configurations For the allocation_mode syntax that controls resource distribution, see 2.3 allocation_mode Syntax For generation parameter details, see 2.6 Generation Hyperparameters

Configuration Architecture

AReaL's inference system uses a three-tier configuration hierarchy:

  1. InferenceEngineConfig: Core configuration shared across all inference backends areal/api/cli_args.py1048-1186
  2. Backend-specific configs: SGLangConfig areal/api/cli_args.py1189-1280 or vLLMConfig areal/api/cli_args.py1392-1483 for engine-specific parameters.
  3. GenerationHyperparameters: Default generation settings (temperature, top-p, etc.) areal/api/cli_args.py164-210

Inference System Component Mapping

The diagram below shows how configuration objects associate with code entities and translate into running server processes.

Inference System Component Mapping


Sources: areal/api/cli_args.py1048-1186 areal/engine/sglang_remote.py39-40 areal/engine/vllm_remote.py39-40 areal/api/cli_args.py164-210

Core Inference Engine Configuration

The InferenceEngineConfig class defines parameters shared across all inference backends. These settings control engine type selection, model loading, resource allocation, and memory management.

ParameterTypeDefaultDescription
typestr"sglang"Inference engine type. Choices: sglang, vllm areal/api/cli_args.py1053-1058
pathstr""Path to HuggingFace model checkpoint areal/api/cli_args.py1059-1064
allocation_modestr""Resource allocation string. See allocation_mode Syntax areal/api/cli_args.py1065
dtypestr"auto"Model parameter dtype. Choices: auto, float16, bfloat16, float32 areal/api/cli_args.py1066-1071
quantizationstr | NoneNoneQuantization method. Choices: None, fp8 areal/api/cli_args.py1072-1077
kv_cache_dtypestr"auto"KV cache dtype. Choices: auto, fp8 areal/api/cli_args.py1078-1083
use_loraboolFalseEnable LoRA adapter support areal/api/cli_args.py1091-1096
max_lorasint8Maximum number of LoRA adapters to cache areal/api/cli_args.py1097-1101
max_lora_rankint64Maximum LoRA rank supported areal/api/cli_args.py1102-1106
max_model_lenint | NoneNoneMaximum sequence length (prompt + generation) areal/api/cli_args.py1108-1113
gpu_memory_utilizationfloat0.9GPU memory utilization ratio (0.0 to 1.0) areal/api/cli_args.py1114-1118
swap_spaceint4CPU swap space in GiB per GPU areal/api/cli_args.py1119-1123
gconfigGenerationHyperparametersGenerationHyperparameters()Default generation parameters areal/api/cli_args.py1124-1129
sglangSGLangConfigSGLangConfig()SGLang-specific configuration areal/api/cli_args.py1130-1134
vllmvLLMConfigvLLMConfig()vLLM-specific configuration areal/api/cli_args.py1135-1139

Sources: areal/api/cli_args.py1048-1186

SGLang Configuration

SGLangConfig contains SGLang-specific parameters that control memory allocation, scheduling policies, and attention backends.

ParameterTypeDefaultDescription
hoststr"127.0.0.1"SGLang server host address areal/api/cli_args.py1192-1196
mem_fraction_staticfloat0.88Fraction of GPU memory for static KV cache allocation areal/api/cli_args.py1202-1207
chunked_prefill_sizeint8192Chunk size for chunked prefill in tokens areal/api/cli_args.py1208-1212
schedule_policystr"lpm"Request scheduling policy. Choices: lpm, random, fcfs, dfs-weight areal/api/cli_args.py1213-1218
attention_backendstr"flashinfer"Attention kernel backend. Choices: flashinfer, triton, torch_native areal/api/cli_args.py1225-1230
enable_ep_cacheboolTrueEnable expert parallel cache for MoE models areal/api/cli_args.py1231-1235
additional_server_argsstr""Additional command-line arguments for SGLang server areal/api/cli_args.py1266-1271

Sources: areal/api/cli_args.py1189-1280

SGLang Backend Request/Response Flow

The SGLangBackend class implements the protocol for communicating with the remote SGLang server.

SGLang Interaction Protocol


Sources: areal/engine/sglang_remote.py39-126 areal/api/io_struct.py62-131

vLLM Configuration

vLLMConfig contains vLLM-specific parameters for memory management, batching, and pipeline parallelism.

ParameterTypeDefaultDescription
hoststr"127.0.0.1"vLLM server host address areal/api/cli_args.py1395-1399
block_sizeint16Token block size for PagedAttention areal/api/cli_args.py1405-1409
max_num_batched_tokensint | NoneNoneMaximum tokens in a batch areal/api/cli_args.py1410-1415
max_num_seqsint256Maximum number of sequences in a batch areal/api/cli_args.py1416-1420
enforce_eagerboolFalseDisable CUDA graph optimization areal/api/cli_args.py1421-1425
pipeline_parallel_sizeint1Pipeline parallel size for multi-stage models areal/api/cli_args.py1431-1435
additional_server_argsstr""Additional command-line arguments for vLLM server areal/api/cli_args.py1469-1474

Sources: areal/api/cli_args.py1392-1483

vLLM Backend Request/Response Flow

The VLLMBackend class handles OpenAI-compatible requests for the remote vLLM server.

vLLM Interaction Protocol


Sources: areal/engine/vllm_remote.py39-124 areal/api/io_struct.py62-131

Weight Synchronization Configuration

Both SGLang and vLLM support weight update modes controlled by TrainEngineConfig.weight_update_mode areal/api/cli_args.py644-650

Disk-Based Weight Updates

The engine saves checkpoints to a shared path, and the inference backend triggers a reload via a REST API call.


Sources: areal/engine/sglang_remote.py128-158 areal/engine/vllm_remote.py126-145

NCCL-Based Weight Updates (XCCL)

AReaL supports high-performance weight updates via NCCL/XCCL, avoiding the disk bottleneck. This requires initializing a distributed group between the trainer and the inference server.


Sources: areal/engine/sglang_remote.py160-186 areal/engine/vllm_remote.py147-193 areal/api/io_struct.py167-185

Note:

Configuration Validation and Constraints

Memory Configuration Constraints

LoRA Configuration Constraints

Sources: areal/api/cli_args.py1048-1186 areal/engine/sglang_remote.py160-186 areal/engine/vllm_remote.py147-193