Last indexed: 7 May 2026 (2e12c1)

InferenceEngine API

The InferenceEngine API is the abstract interface for all inference backends in AReaL. It defines the contract for asynchronous model generation, weight synchronization, and rollout coordination. Concrete implementations include RemoteSGLangEngine areal/engine/sglang_remote.py247-439 and RemotevLLMEngine areal/engine/vllm_remote.py272-465

This page covers the abstract API specification. For backend-specific implementation details, see SGLang Backend and vLLM Backend. For workflow integration patterns, see RolloutWorkflow API. For weight update mechanisms from the training engine perspective, see Weight Synchronization.

Sources: areal/api/engine_api.py530-992

Abstract Interface Overview

The InferenceEngine abstract base class defines the complete interface for inference backends. All implementations must provide methods for initialization, generation, weight updates, and rollout coordination.

System Architecture and Code Entities

The following diagram bridges the conceptual "Inference Space" with specific code entities used in the AReaL implementation.

Title: Inference System Architecture

Sources: areal/api/engine_api.py530-992 areal/engine/sglang_remote.py247-439 areal/engine/vllm_remote.py272-465 areal/api/io_struct.py28-130 areal/api/workflow_api.py14-40

Core API Methods

The InferenceEngine API is organized into several functional groups:

Category	Methods	Purpose
Lifecycle	`initialize()`, `destroy()`, `initialized`	Engine setup and teardown
Generation	`agenerate()`, `workflow_executor`	Async model inference
Weight Updates	`init_weights_update_group()`, `update_weights_from_distributed()`, `update_weights_from_disk()`, `set_version()`, `get_version()`	Weight synchronization
Rollout	`submit()`, `wait()`, `wait_for_task()`, `rollout_batch()`, `prepare_batch()`	Async rollout coordination
Control	`pause_generation()`, `continue_generation()`, `pause()`, `resume()`	Flow control
Memory	`offload()`, `onload()`	GPU memory management
Server	`launch_server()`, `teardown_server()`	Local server management
Monitoring	`export_stats()`, `save_perf_tracer()`, `config_perf_tracer()`	Observability

Sources: areal/api/engine_api.py530-992

Initialization and Lifecycle

Initialization Flow

Title: InferenceEngine Initialization Lifecycle

The initialize() method areal/api/engine_api.py531-548 accepts the following parameters:

engine_id (str | None): Identifier for server discovery via filesystem. If None, uses the addr parameter.
addr (str | list[str] | None): Direct server address(es). Can be a single address or list of addresses for distributed inference.
train_data_parallel_size (int | None): Data parallel size of training engine for distributed coordination.

Sources: areal/api/engine_api.py531-560

Initialization States

The initialized property areal/api/engine_api.py550-560 tracks whether the engine has been successfully initialized.

Asynchronous Generation

Core Generation Method

The primary generation method is agenerate() areal/api/engine_api.py603-616 which accepts a ModelRequest and returns a ModelResponse.

The ModelRequest structure areal/api/io_struct.py28-59 contains:

input_ids: Token IDs for text input.
gconfig: GenerationHyperparameters for sampling parameters.
image_data: Optional vision inputs (base64 strings).
vision_msg_vllm: Optional vLLM-specific vision messages areal/api/io_struct.py43
metadata: Additional metadata dictionary.

The ModelResponse structure areal/api/io_struct.py63-130 contains:

output_tokens: Generated token IDs.
output_logprobs: Log probabilities for each token.
output_versions: Weight versions used for generation areal/api/io_struct.py68
latency: Generation latency metrics.
routed_experts: Optional MoE expert routing information, extracted as a NumPy array areal/api/io_struct.py83

Sources: areal/api/engine_api.py603-616 areal/api/io_struct.py28-130

WorkflowExecutor Integration

The workflow_executor property areal/api/engine_api.py562-565 provides access to the WorkflowExecutor instance, which manages workflow execution. This executor is responsible for running RolloutWorkflow areal/api/workflow_api.py14-40 instances that orchestrate multi-step inference patterns (e.g., multi-turn reasoning or vision-language tasks).

Weight Update Protocols

The InferenceEngine supports two weight update mechanisms: distributed (NCCL/XCCL) and disk-based.

Weight Update Data Flow

Title: Weight Update Protocol Mapping

Sources: areal/api/engine_api.py618-702 areal/engine/sglang_remote.py128-186 areal/engine/vllm_remote.py126-190

Weight Update Configuration

The WeightUpdateMeta structure areal/api/io_struct.py183-214 contains:

type: Update mode ("disk", "xccl", or "awex").
path: Disk path for disk-based updates.
version: Weight version number.
nccl_group_name: NCCL group identifier.
use_lora: Whether using LoRA adapters.
lora_name, lora_int_id: LoRA adapter identifiers.
peft_config: LoRA configuration parameters (r, alpha, target_modules).

Note: SGLang backend does not support LoRA with distributed (NCCL) weight updates areal/engine/sglang_remote.py168-172 For LoRA weight updates with SGLang, use disk-based update mode instead.

Version Tracking

Weight versions are tracked to enable offpolicyness checks in asynchronous RL areal/api/engine_api.py683-701 Versions increment monotonically with each weight update. Generated responses include the version used for generation in ModelResponse.output_versions areal/api/io_struct.py68 The method with_version(version) areal/api/io_struct.py203-214 allows creating a versioned path for checkpoints (e.g., weight_update_v1).

Rollout Coordination

The InferenceEngine provides three patterns for rollout execution:

Pattern 1: Submit/Wait (Asynchronous)

The recommended pattern for production training:

Title: Async Rollout Execution Pattern

Methods:

submit() areal/api/engine_api.py703-750: Submits a request and returns a task ID immediately.
wait() areal/api/engine_api.py752-779: Waits for a specified number of accepted trajectories.
wait_for_task() areal/api/engine_api.py781-807: Waits for a specific task ID.

Pattern 2: prepare_batch() (Training Orchestration)

The prepare_batch() method areal/api/engine_api.py852-911 encapsulates the submit/wait pattern for training loops. It pulls data from a StatefulDataLoader and manages a queue of async tasks.

Important: The method caches configuration (workflow, group_size, etc.) on the first call areal/api/engine_api.py856-860 Subsequent calls with different parameters will be ignored.

Pattern 3: rollout_batch() (Synchronous)

For offline data collection or debugging areal/api/engine_api.py809-850 It submits a batch of requests and blocks until all are complete.

Memory and Flow Control

Flow Control

pause_generation() / continue_generation() areal/api/engine_api.py913-922: Controls the inference server's internal generation pipeline, typically used during weight updates to ensure consistency.
pause() / resume() areal/api/engine_api.py924-933: Controls the async rollout submission loop in the WorkflowExecutor.

Memory Management

offload() areal/api/engine_api.py935-938: Offloads model from GPU to CPU.
onload() areal/api/engine_api.py940-947: Onloads model from CPU to GPU. Selective onloading is supported via tags (e.g., ["weights", "kv_cache"] for SGLang areal/engine/sglang_remote.py218-220).

Local Server Management

The InferenceEngine can launch and manage local inference server processes via launch_server() areal/api/engine_api.py567-597 This method returns a LocalInfServerInfo object areal/api/io_struct.py262-268 containing the address and process handle.

This is useful for:

Single-controller mode: Launching a local server to serve the engine instance.
Standalone inference: Running agentic workflows without separate server management.

Servers are terminated using teardown_server() areal/api/engine_api.py599-601

Sources: areal/api/engine_api.py567-601 areal/api/io_struct.py262-268

Backend Implementation Pattern

Both RemoteSGLangEngine and RemotevLLMEngine follow a composition pattern delegating to a shared RemoteInfEngine with backend-specific adapters:

SGLangBackend areal/engine/sglang_remote.py40-245: Implements SGLang-specific HTTP payload structures for generation and weight updates. Supports MoE expert routing extraction areal/engine/sglang_remote.py101-110
VLLMBackend areal/engine/vllm_remote.py41-270: Implements vLLM-specific HTTP payloads, including OpenAI-compatible chat completions for vision tasks areal/engine/vllm_remote.py88-90

Sources: areal/engine/sglang_remote.py40-439 areal/engine/vllm_remote.py41-465 areal/api/engine_api.py530-992

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/4.1-inferenceengine-api