Last indexed: 7 May 2026 (2e12c1)

Inference Engine Configurations

This page documents the configuration parameters for AReaL's inference engines (SGLang and vLLM). These configurations control how the inference servers are launched, how they allocate resources, and how they handle generation requests during training.

For training engine configurations, see 2.4 Training Engine Configurations For the allocation_mode syntax that controls resource distribution, see 2.3 allocation_mode Syntax For generation parameter details, see 2.6 Generation Hyperparameters

Configuration Architecture

AReaL's inference system uses a three-tier configuration hierarchy:

InferenceEngineConfig: Core configuration shared across all inference backends areal/api/cli_args.py1048-1186
Backend-specific configs: SGLangConfig areal/api/cli_args.py1189-1280 or vLLMConfig areal/api/cli_args.py1392-1483 for engine-specific parameters.
GenerationHyperparameters: Default generation settings (temperature, top-p, etc.) areal/api/cli_args.py164-210

Inference System Component Mapping

The diagram below shows how configuration objects associate with code entities and translate into running server processes.

Inference System Component Mapping

Sources: areal/api/cli_args.py1048-1186 areal/engine/sglang_remote.py39-40 areal/engine/vllm_remote.py39-40 areal/api/cli_args.py164-210

Core Inference Engine Configuration

The InferenceEngineConfig class defines parameters shared across all inference backends. These settings control engine type selection, model loading, resource allocation, and memory management.

Parameter	Type	Default	Description
`type`	`str`	`"sglang"`	Inference engine type. Choices: `sglang`, `vllm` areal/api/cli_args.py1053-1058
`path`	`str`	`""`	Path to HuggingFace model checkpoint areal/api/cli_args.py1059-1064
`allocation_mode`	`str`	`""`	Resource allocation string. See `allocation_mode` Syntax areal/api/cli_args.py1065
`dtype`	`str`	`"auto"`	Model parameter dtype. Choices: `auto`, `float16`, `bfloat16`, `float32` areal/api/cli_args.py1066-1071
`quantization`	`str \| None`	`None`	Quantization method. Choices: `None`, `fp8` areal/api/cli_args.py1072-1077
`kv_cache_dtype`	`str`	`"auto"`	KV cache dtype. Choices: `auto`, `fp8` areal/api/cli_args.py1078-1083
`use_lora`	`bool`	`False`	Enable LoRA adapter support areal/api/cli_args.py1091-1096
`max_loras`	`int`	`8`	Maximum number of LoRA adapters to cache areal/api/cli_args.py1097-1101
`max_lora_rank`	`int`	`64`	Maximum LoRA rank supported areal/api/cli_args.py1102-1106
`max_model_len`	`int \| None`	`None`	Maximum sequence length (prompt + generation) areal/api/cli_args.py1108-1113
`gpu_memory_utilization`	`float`	`0.9`	GPU memory utilization ratio (0.0 to 1.0) areal/api/cli_args.py1114-1118
`swap_space`	`int`	`4`	CPU swap space in GiB per GPU areal/api/cli_args.py1119-1123
`gconfig`	`GenerationHyperparameters`	`GenerationHyperparameters()`	Default generation parameters areal/api/cli_args.py1124-1129
`sglang`	`SGLangConfig`	`SGLangConfig()`	SGLang-specific configuration areal/api/cli_args.py1130-1134
`vllm`	`vLLMConfig`	`vLLMConfig()`	vLLM-specific configuration areal/api/cli_args.py1135-1139

Sources: areal/api/cli_args.py1048-1186

SGLang Configuration

SGLangConfig contains SGLang-specific parameters that control memory allocation, scheduling policies, and attention backends.

Parameter	Type	Default	Description
`host`	`str`	`"127.0.0.1"`	SGLang server host address areal/api/cli_args.py1192-1196
`mem_fraction_static`	`float`	`0.88`	Fraction of GPU memory for static KV cache allocation areal/api/cli_args.py1202-1207
`chunked_prefill_size`	`int`	`8192`	Chunk size for chunked prefill in tokens areal/api/cli_args.py1208-1212
`schedule_policy`	`str`	`"lpm"`	Request scheduling policy. Choices: `lpm`, `random`, `fcfs`, `dfs-weight` areal/api/cli_args.py1213-1218
`attention_backend`	`str`	`"flashinfer"`	Attention kernel backend. Choices: `flashinfer`, `triton`, `torch_native` areal/api/cli_args.py1225-1230
`enable_ep_cache`	`bool`	`True`	Enable expert parallel cache for MoE models areal/api/cli_args.py1231-1235
`additional_server_args`	`str`	`""`	Additional command-line arguments for SGLang server areal/api/cli_args.py1266-1271

Sources: areal/api/cli_args.py1189-1280

SGLang Backend Request/Response Flow

The SGLangBackend class implements the protocol for communicating with the remote SGLang server.

SGLang Interaction Protocol

Sources: areal/engine/sglang_remote.py39-126 areal/api/io_struct.py62-131

vLLM Configuration

vLLMConfig contains vLLM-specific parameters for memory management, batching, and pipeline parallelism.

Parameter	Type	Default	Description
`host`	`str`	`"127.0.0.1"`	vLLM server host address areal/api/cli_args.py1395-1399
`block_size`	`int`	`16`	Token block size for PagedAttention areal/api/cli_args.py1405-1409
`max_num_batched_tokens`	`int \| None`	`None`	Maximum tokens in a batch areal/api/cli_args.py1410-1415
`max_num_seqs`	`int`	`256`	Maximum number of sequences in a batch areal/api/cli_args.py1416-1420
`enforce_eager`	`bool`	`False`	Disable CUDA graph optimization areal/api/cli_args.py1421-1425
`pipeline_parallel_size`	`int`	`1`	Pipeline parallel size for multi-stage models areal/api/cli_args.py1431-1435
`additional_server_args`	`str`	`""`	Additional command-line arguments for vLLM server areal/api/cli_args.py1469-1474

Sources: areal/api/cli_args.py1392-1483

vLLM Backend Request/Response Flow

The VLLMBackend class handles OpenAI-compatible requests for the remote vLLM server.

vLLM Interaction Protocol

Sources: areal/engine/vllm_remote.py39-124 areal/api/io_struct.py62-131

Weight Synchronization Configuration

Both SGLang and vLLM support weight update modes controlled by TrainEngineConfig.weight_update_mode areal/api/cli_args.py644-650

Disk-Based Weight Updates

The engine saves checkpoints to a shared path, and the inference backend triggers a reload via a REST API call.

Sources: areal/engine/sglang_remote.py128-158 areal/engine/vllm_remote.py126-145

NCCL-Based Weight Updates (XCCL)

AReaL supports high-performance weight updates via NCCL/XCCL, avoiding the disk bottleneck. This requires initializing a distributed group between the trainer and the inference server.

Sources: areal/engine/sglang_remote.py160-186 areal/engine/vllm_remote.py147-193 areal/api/io_struct.py167-185

Note:

SGLang does not support LoRA with NCCL-based weight updates; use weight_update_mode='disk' for LoRA areal/engine/sglang_remote.py168-172
vLLM supports both full model and LoRA updates via NCCL using a two-phase protocol (/areal_set_update_weight_meta_lora followed by /areal_update_weights_lora_xccl) areal/engine/vllm_remote.py162-177

Configuration Validation and Constraints

Memory Configuration Constraints

gpu_memory_utilization must be between 0.0 and 1.0 areal/api/cli_args.py1114-1118
For SGLang: mem_fraction_static should be configured such that the KV cache fits within the gpu_memory_utilization limit areal/api/cli_args.py1202-1207
swap_space is allocated per GPU, so total swap = swap_space × number of GPUs areal/api/cli_args.py1119-1123

LoRA Configuration Constraints

max_lora_rank must be greater than or equal to the rank of all LoRA adapters used areal/api/cli_args.py1102-1106
max_loras limits how many adapters can be loaded simultaneously areal/api/cli_args.py1097-1101
LoRA updates with SGLang require weight_update_mode: disk areal/engine/sglang_remote.py168-172

Sources: areal/api/cli_args.py1048-1186 areal/engine/sglang_remote.py160-186 areal/engine/vllm_remote.py147-193

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/2.5-inference-engine-configurations