VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/4-inference-system

⇱ Inference System | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Inference System

Purpose and Scope

The Inference System provides high-performance model inference for rollout generation in reinforcement learning training. It manages distributed inference workers, handles weight synchronization from training engines, and executes workflows asynchronously to generate trajectories. This document covers the overall architecture and integration patterns of the inference system.

For details on specific components, see:

For information about how workflows interact with the inference system, see Workflow and Rollout System.


Architecture Overview

The inference system uses a composition-based architecture where backend-specific implementations (SGLangBackend, VLLMBackend) delegate to a shared RemoteInfEngine core. This design separates the high-level inference orchestration from backend-specific protocol details.

Class Hierarchy


Sources: areal/api/engine_api.py530-561 areal/engine/sglang_remote.py40-41 areal/engine/vllm_remote.py41-42 areal/infra/remote_inf_engine.py125-230

Component Responsibilities

ComponentResponsibility
InferenceEngineAbstract interface defining operations like agenerate, update_weights, and submit.
RemoteSGLangEngineSGLang-specific wrapper implementing InferenceEngine via RemoteInfEngine.
RemotevLLMEnginevLLM-specific wrapper implementing InferenceEngine via RemoteInfEngine.
RemoteInfEngineCore implementation handling HTTP communication, workflow execution, and weight updates.
SGLangBackendBuilds SGLang-specific HTTP requests and parses JSON responses.
VLLMBackendBuilds vLLM/OpenAI-compatible HTTP requests and parses responses.
WorkflowExecutorManages the asynchronous execution of RolloutWorkflow episodes.

Sources: areal/engine/sglang_remote.py40-187 areal/engine/vllm_remote.py41-183 areal/api/engine_api.py530-561 areal/infra/remote_inf_engine.py60


Backend Implementations

AReaL supports two high-performance inference backends. The choice is typically defined in the allocation_mode configuration or specified in the rollout.backend YAML field.

Backend Comparison

FeatureSGLang (SGLangBackend)vLLM (VLLMBackend)
OptimizationPrefix sharing, RadixAttentionPagedAttention, high throughput
API StyleCustom /generate endpointOpenAI /v1/completions or /v1/chat/completions
LoRA SupportDisk-based adapter loadingXCCL and Disk-based loading
Expert RoutingReturns routed_experts MoE IDsNot natively exposed in standard response
Vision Supportimage_data in payloadOpenAI-style message structures
Weight UpdateSupports /update_weights_from_distributedSupports /areal_update_weights_xccl

Sources: areal/engine/sglang_remote.py43-90 areal/engine/vllm_remote.py44-96 areal/engine/sglang_remote.py161-187 areal/engine/vllm_remote.py150-183

Configuration Example

Backends are configured via YAML, specifying parameters like model_path, context_length, and LoRA settings.


Sources: examples/math/gsm8k_grpo_lora.yaml116-139


Request and Response Flow

Generation Request Pipeline


Sources: areal/engine/sglang_remote.py43-127 areal/engine/vllm_remote.py44-127 areal/api/io_struct.py28-130

Key Data Structures

ModelRequest

The ModelRequest contains the prompt input_ids, generation hyperparameters gconfig, and optional vision data image_data or vision_msg_vllm. areal/api/io_struct.py28-59

ModelResponse

The ModelResponse returns output_tokens, output_logprobs, and metadata like latency or routed_experts for MoE models. It also provides utility properties like output_tokens_without_stop to strip EOS/PAD tokens. areal/api/io_struct.py63-130


Weight Synchronization Mechanisms

Inference engines must stay synchronized with the TrainEngine. AReaL supports multiple modes defined in WeightUpdateMeta.

1. Disk Mode (Shared Storage)

The trainer saves weights to a shared filesystem, and the inference engine loads them via an HTTP command. This is mandatory for certain LoRA setups.

2. XCCL Mode (Network Broadcast)

Weights are broadcasted over the network (NCCL/XCCL) directly from the trainer's GPU memory to the inference worker's GPU memory.

Sources: examples/math/gsm8k_grpo_lora.yaml84 areal/api/io_struct.py183-213


Server Lifecycle Management

The InferenceEngine manages the remote server process through several lifecycle methods defined in the InferenceEngine abstract base class.

MethodDescription
launch_server()Spawns the server subprocess (SGLang/vLLM) with specified CLI args.
pause_generation()Signals the server to stop processing new requests (used during weight updates).
continue_generation()Resumes request processing.
offload()Moves model weights to CPU memory to free up GPU.
onload()Moves model weights back to GPU memory.

Sources: areal/api/engine_api.py567-601


Asynchronous Rollout Execution

The InferenceEngine provides an asynchronous API to allow the Trainer to overlap rollout generation with training steps.

Async Execution Pattern


Sources: areal/api/engine_api.py703-750 areal/api/engine_api.py752-779 areal/infra/controller/rollout_controller.py72-133

Key Methods


Version Tracking

To ensure "off-policyness" is controlled, the system tracks the version of weights used for generation.

Sources: areal/api/engine_api.py683-701 areal/api/io_struct.py63-68


Inference Service (Experimental)

AReaL includes an experimental microservice-based inference architecture designed for scalable agent serving. This system separates the control plane from the data plane.

Experimental Architecture Components

ComponentResponsibility
GatewayEntry point for requests; handles auth and forwards to the Router. areal/experimental/inference_service/gateway/app.py3-7
RouterManages worker state, session pinning, and routing strategies. areal/experimental/inference_service/gateway/app.py3-7
DataProxyIntermediary for session data and tokenizer proxying. areal/experimental/inference_service/data_proxy/app.py19-44
InfBridgeBackend-agnostic HTTP client implementing the async generation protocol. areal/experimental/inference_service/inf_bridge.py32-59

The InfBridge uses a pluggable backend protocol (InfBridgeBackend) to support different inference servers like SGLang and vLLM within this microservice context. areal/experimental/inference_service/backend.py26-31

Sources: areal/experimental/inference_service/gateway/app.py3-7 areal/experimental/inference_service/inf_bridge.py32-59 areal/experimental/inference_service/backend.py26-31 areal/experimental/inference_service/data_proxy/app.py19-44