Last indexed: 7 May 2026 (2e12c1)

Inference System

Purpose and Scope

The Inference System provides high-performance model inference for rollout generation in reinforcement learning training. It manages distributed inference workers, handles weight synchronization from training engines, and executes workflows asynchronously to generate trajectories. This document covers the overall architecture and integration patterns of the inference system.

For details on specific components, see:

InferenceEngine API — Abstract InferenceEngine interface, async generation, and workflow integration
SGLang Backend — SGLangBackend implementation, request/response formats, and SGLang-specific features
vLLM Backend — VLLMBackend implementation, OpenAI-compatible API, and vLLM-specific features
Backend Protocol and Extensibility — How to implement custom inference backends following the Backend protocol
Weight Update Protocols — How inference engines receive and apply weight updates from training
Server Lifecycle Management — Launching, monitoring, and managing local inference server processes
Async Rollout Execution — Submit/wait/prepare_batch patterns and async request handling
Inference Service (Experimental) — Experimental microservice-based inference architecture with Gateway, Router, DataProxy, and Guard components for scalable agent serving.

For information about how workflows interact with the inference system, see Workflow and Rollout System.

Architecture Overview

The inference system uses a composition-based architecture where backend-specific implementations (SGLangBackend, VLLMBackend) delegate to a shared RemoteInfEngine core. This design separates the high-level inference orchestration from backend-specific protocol details.

Class Hierarchy

Sources: areal/api/engine_api.py530-561 areal/engine/sglang_remote.py40-41 areal/engine/vllm_remote.py41-42 areal/infra/remote_inf_engine.py125-230

Component Responsibilities

Component	Responsibility
`InferenceEngine`	Abstract interface defining operations like `agenerate`, `update_weights`, and `submit`.
`RemoteSGLangEngine`	SGLang-specific wrapper implementing `InferenceEngine` via `RemoteInfEngine`.
`RemotevLLMEngine`	vLLM-specific wrapper implementing `InferenceEngine` via `RemoteInfEngine`.
`RemoteInfEngine`	Core implementation handling HTTP communication, workflow execution, and weight updates.
`SGLangBackend`	Builds SGLang-specific HTTP requests and parses JSON responses.
`VLLMBackend`	Builds vLLM/OpenAI-compatible HTTP requests and parses responses.
`WorkflowExecutor`	Manages the asynchronous execution of `RolloutWorkflow` episodes.

Sources: areal/engine/sglang_remote.py40-187 areal/engine/vllm_remote.py41-183 areal/api/engine_api.py530-561 areal/infra/remote_inf_engine.py60

Backend Implementations

AReaL supports two high-performance inference backends. The choice is typically defined in the allocation_mode configuration or specified in the rollout.backend YAML field.

Backend Comparison

Feature	SGLang (`SGLangBackend`)	vLLM (`VLLMBackend`)
Optimization	Prefix sharing, RadixAttention	PagedAttention, high throughput
API Style	Custom `/generate` endpoint	OpenAI `/v1/completions` or `/v1/chat/completions`
LoRA Support	Disk-based adapter loading	XCCL and Disk-based loading
Expert Routing	Returns `routed_experts` MoE IDs	Not natively exposed in standard response
Vision Support	`image_data` in payload	OpenAI-style message structures
Weight Update	Supports `/update_weights_from_distributed`	Supports `/areal_update_weights_xccl`

Sources: areal/engine/sglang_remote.py43-90 areal/engine/vllm_remote.py44-96 areal/engine/sglang_remote.py161-187 areal/engine/vllm_remote.py150-183

Configuration Example

Backends are configured via YAML, specifying parameters like model_path, context_length, and LoRA settings.

Sources: examples/math/gsm8k_grpo_lora.yaml116-139

Request and Response Flow

Generation Request Pipeline

Sources: areal/engine/sglang_remote.py43-127 areal/engine/vllm_remote.py44-127 areal/api/io_struct.py28-130

Key Data Structures

ModelRequest

The ModelRequest contains the prompt input_ids, generation hyperparameters gconfig, and optional vision data image_data or vision_msg_vllm. areal/api/io_struct.py28-59

ModelResponse

The ModelResponse returns output_tokens, output_logprobs, and metadata like latency or routed_experts for MoE models. It also provides utility properties like output_tokens_without_stop to strip EOS/PAD tokens. areal/api/io_struct.py63-130

Weight Synchronization Mechanisms

Inference engines must stay synchronized with the TrainEngine. AReaL supports multiple modes defined in WeightUpdateMeta.

1. Disk Mode (Shared Storage)

The trainer saves weights to a shared filesystem, and the inference engine loads them via an HTTP command. This is mandatory for certain LoRA setups.

SGLang: Uses /update_weights_from_disk or /load_lora_adapter. areal/engine/sglang_remote.py129-159
vLLM: Uses /areal_update_weights or /v1/load_lora_adapter. areal/engine/vllm_remote.py129-148

2. XCCL Mode (Network Broadcast)

Weights are broadcasted over the network (NCCL/XCCL) directly from the trainer's GPU memory to the inference worker's GPU memory.

SGLang: Uses /update_weights_from_distributed. Note that SGLang distributed weight update currently does not support LoRA. areal/engine/sglang_remote.py161-187
vLLM: Uses a two-step process involving /areal_set_update_weight_meta and /areal_update_weights_xccl. areal/engine/vllm_remote.py150-183

Sources: examples/math/gsm8k_grpo_lora.yaml84 areal/api/io_struct.py183-213

Server Lifecycle Management

The InferenceEngine manages the remote server process through several lifecycle methods defined in the InferenceEngine abstract base class.

Method	Description
`launch_server()`	Spawns the server subprocess (SGLang/vLLM) with specified CLI args.
`pause_generation()`	Signals the server to stop processing new requests (used during weight updates).
`continue_generation()`	Resumes request processing.
`offload()`	Moves model weights to CPU memory to free up GPU.
`onload()`	Moves model weights back to GPU memory.

Sources: areal/api/engine_api.py567-601

Asynchronous Rollout Execution

The InferenceEngine provides an asynchronous API to allow the Trainer to overlap rollout generation with training steps.

Async Execution Pattern

Sources: areal/api/engine_api.py703-750 areal/api/engine_api.py752-779 areal/infra/controller/rollout_controller.py72-133

Key Methods

submit(): Non-blocking submission of a rollout task (data and workflow). areal/api/engine_api.py703-750
wait(): Blocks until a specified number of trajectories are collected. areal/api/engine_api.py752-779
prepare_batch(): High-level utility to consume a DataLoader and yield completed trajectory batches. areal/api/engine_api.py852-911

Version Tracking

To ensure "off-policyness" is controlled, the system tracks the version of weights used for generation.

set_version(version): Updates the engine's internal version counter. areal/api/engine_api.py683-690
get_version(): Retrieves the current version. areal/api/engine_api.py692-701
ModelResponse.output_versions: Records which version was active during generation. areal/api/io_struct.py68

Sources: areal/api/engine_api.py683-701 areal/api/io_struct.py63-68

Inference Service (Experimental)

AReaL includes an experimental microservice-based inference architecture designed for scalable agent serving. This system separates the control plane from the data plane.

Experimental Architecture Components

Component	Responsibility
`Gateway`	Entry point for requests; handles auth and forwards to the Router. areal/experimental/inference_service/gateway/app.py3-7
`Router`	Manages worker state, session pinning, and routing strategies. areal/experimental/inference_service/gateway/app.py3-7
`DataProxy`	Intermediary for session data and tokenizer proxying. areal/experimental/inference_service/data_proxy/app.py19-44
`InfBridge`	Backend-agnostic HTTP client implementing the async generation protocol. areal/experimental/inference_service/inf_bridge.py32-59

The InfBridge uses a pluggable backend protocol (InfBridgeBackend) to support different inference servers like SGLang and vLLM within this microservice context. areal/experimental/inference_service/backend.py26-31

Sources: areal/experimental/inference_service/gateway/app.py3-7 areal/experimental/inference_service/inf_bridge.py32-59 areal/experimental/inference_service/backend.py26-31 areal/experimental/inference_service/data_proxy/app.py19-44

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/4-inference-system