VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/4.1-inferenceengine-api

⇱ InferenceEngine API | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

InferenceEngine API

The InferenceEngine API is the abstract interface for all inference backends in AReaL. It defines the contract for asynchronous model generation, weight synchronization, and rollout coordination. Concrete implementations include RemoteSGLangEngine areal/engine/sglang_remote.py247-439 and RemotevLLMEngine areal/engine/vllm_remote.py272-465

This page covers the abstract API specification. For backend-specific implementation details, see SGLang Backend and vLLM Backend. For workflow integration patterns, see RolloutWorkflow API. For weight update mechanisms from the training engine perspective, see Weight Synchronization.

Sources: areal/api/engine_api.py530-992


Abstract Interface Overview

The InferenceEngine abstract base class defines the complete interface for inference backends. All implementations must provide methods for initialization, generation, weight updates, and rollout coordination.

System Architecture and Code Entities

The following diagram bridges the conceptual "Inference Space" with specific code entities used in the AReaL implementation.

Title: Inference System Architecture


Sources: areal/api/engine_api.py530-992 areal/engine/sglang_remote.py247-439 areal/engine/vllm_remote.py272-465 areal/api/io_struct.py28-130 areal/api/workflow_api.py14-40


Core API Methods

The InferenceEngine API is organized into several functional groups:

CategoryMethodsPurpose
Lifecycleinitialize(), destroy(), initializedEngine setup and teardown
Generationagenerate(), workflow_executorAsync model inference
Weight Updatesinit_weights_update_group(), update_weights_from_distributed(), update_weights_from_disk(), set_version(), get_version()Weight synchronization
Rolloutsubmit(), wait(), wait_for_task(), rollout_batch(), prepare_batch()Async rollout coordination
Controlpause_generation(), continue_generation(), pause(), resume()Flow control
Memoryoffload(), onload()GPU memory management
Serverlaunch_server(), teardown_server()Local server management
Monitoringexport_stats(), save_perf_tracer(), config_perf_tracer()Observability

Sources: areal/api/engine_api.py530-992


Initialization and Lifecycle

Initialization Flow

Title: InferenceEngine Initialization Lifecycle


The initialize() method areal/api/engine_api.py531-548 accepts the following parameters:

  • engine_id (str | None): Identifier for server discovery via filesystem. If None, uses the addr parameter.
  • addr (str | list[str] | None): Direct server address(es). Can be a single address or list of addresses for distributed inference.
  • train_data_parallel_size (int | None): Data parallel size of training engine for distributed coordination.

Sources: areal/api/engine_api.py531-560

Initialization States

The initialized property areal/api/engine_api.py550-560 tracks whether the engine has been successfully initialized.


Asynchronous Generation

Core Generation Method

The primary generation method is agenerate() areal/api/engine_api.py603-616 which accepts a ModelRequest and returns a ModelResponse.

The ModelRequest structure areal/api/io_struct.py28-59 contains:

  • input_ids: Token IDs for text input.
  • gconfig: GenerationHyperparameters for sampling parameters.
  • image_data: Optional vision inputs (base64 strings).
  • vision_msg_vllm: Optional vLLM-specific vision messages areal/api/io_struct.py43
  • metadata: Additional metadata dictionary.

The ModelResponse structure areal/api/io_struct.py63-130 contains:

  • output_tokens: Generated token IDs.
  • output_logprobs: Log probabilities for each token.
  • output_versions: Weight versions used for generation areal/api/io_struct.py68
  • latency: Generation latency metrics.
  • routed_experts: Optional MoE expert routing information, extracted as a NumPy array areal/api/io_struct.py83

Sources: areal/api/engine_api.py603-616 areal/api/io_struct.py28-130

WorkflowExecutor Integration

The workflow_executor property areal/api/engine_api.py562-565 provides access to the WorkflowExecutor instance, which manages workflow execution. This executor is responsible for running RolloutWorkflow areal/api/workflow_api.py14-40 instances that orchestrate multi-step inference patterns (e.g., multi-turn reasoning or vision-language tasks).


Weight Update Protocols

The InferenceEngine supports two weight update mechanisms: distributed (NCCL/XCCL) and disk-based.

Weight Update Data Flow

Title: Weight Update Protocol Mapping


Sources: areal/api/engine_api.py618-702 areal/engine/sglang_remote.py128-186 areal/engine/vllm_remote.py126-190

Weight Update Configuration

The WeightUpdateMeta structure areal/api/io_struct.py183-214 contains:

  • type: Update mode ("disk", "xccl", or "awex").
  • path: Disk path for disk-based updates.
  • version: Weight version number.
  • nccl_group_name: NCCL group identifier.
  • use_lora: Whether using LoRA adapters.
  • lora_name, lora_int_id: LoRA adapter identifiers.
  • peft_config: LoRA configuration parameters (r, alpha, target_modules).

Note: SGLang backend does not support LoRA with distributed (NCCL) weight updates areal/engine/sglang_remote.py168-172 For LoRA weight updates with SGLang, use disk-based update mode instead.

Version Tracking

Weight versions are tracked to enable offpolicyness checks in asynchronous RL areal/api/engine_api.py683-701 Versions increment monotonically with each weight update. Generated responses include the version used for generation in ModelResponse.output_versions areal/api/io_struct.py68 The method with_version(version) areal/api/io_struct.py203-214 allows creating a versioned path for checkpoints (e.g., weight_update_v1).


Rollout Coordination

The InferenceEngine provides three patterns for rollout execution:

Pattern 1: Submit/Wait (Asynchronous)

The recommended pattern for production training:

Title: Async Rollout Execution Pattern


Methods:

Pattern 2: prepare_batch() (Training Orchestration)

The prepare_batch() method areal/api/engine_api.py852-911 encapsulates the submit/wait pattern for training loops. It pulls data from a StatefulDataLoader and manages a queue of async tasks.

Important: The method caches configuration (workflow, group_size, etc.) on the first call areal/api/engine_api.py856-860 Subsequent calls with different parameters will be ignored.

Pattern 3: rollout_batch() (Synchronous)

For offline data collection or debugging areal/api/engine_api.py809-850 It submits a batch of requests and blocks until all are complete.


Memory and Flow Control

Flow Control

  • pause_generation() / continue_generation() areal/api/engine_api.py913-922: Controls the inference server's internal generation pipeline, typically used during weight updates to ensure consistency.
  • pause() / resume() areal/api/engine_api.py924-933: Controls the async rollout submission loop in the WorkflowExecutor.

Memory Management


Local Server Management

The InferenceEngine can launch and manage local inference server processes via launch_server() areal/api/engine_api.py567-597 This method returns a LocalInfServerInfo object areal/api/io_struct.py262-268 containing the address and process handle.

This is useful for:

  1. Single-controller mode: Launching a local server to serve the engine instance.
  2. Standalone inference: Running agentic workflows without separate server management.

Servers are terminated using teardown_server() areal/api/engine_api.py599-601

Sources: areal/api/engine_api.py567-601 areal/api/io_struct.py262-268


Backend Implementation Pattern

Both RemoteSGLangEngine and RemotevLLMEngine follow a composition pattern delegating to a shared RemoteInfEngine with backend-specific adapters:

Sources: areal/engine/sglang_remote.py40-439 areal/engine/vllm_remote.py41-465 areal/api/engine_api.py530-992