VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/4.8-inference-service-(experimental)

⇱ Inference Service (Experimental) | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Inference Service (Experimental)

The Inference Service is an experimental, microservice-based architecture designed to decouple LLM inference from the training loop. Unlike the standard InferenceEngine which runs within the same process or via direct RPC, this architecture provides a scalable, OpenAI-compatible serving stack including request routing, session management, and automated weight synchronization. It is specifically designed for complex agentic RL scenarios where multiple agents or external runtimes interact with a shared model pool.

Architecture Overview

The system is composed of four primary microservices that interact to provide a robust inference endpoint for agents and rollouts.

Core Components

  1. Gateway: The entry point for all external traffic. It provides an OpenAI-compatible API, handles authentication, and forwards requests to the Router. areal/experimental/inference_service/gateway/app.py3-7
  2. Router: Manages a pool of Data Proxy workers. It performs health checks, handles session pinning, and implements routing strategies to distribute load across available inference backends. areal/experimental/inference_service/router/app.py3-12
  3. Data Proxy: A stateful proxy that sits in front of a raw inference backend. It manages RL-specific state, including conversation sessions via SessionStore, reward tracking, and weight versioning. areal/experimental/inference_service/data_proxy/app.py21-30
  4. Inference Backend: The underlying high-performance engine (e.g., SGLang or vLLM) that performs the actual token generation. areal/experimental/inference_service/inf_bridge.py3-8
  5. RTensor Storage: A specialized storage layer within the Data Proxy (via data_bp) for handling large tensor shards (e.g., logprobs or hidden states) over HTTP. areal/infra/rpc/guard/data_blueprint.py1-15

System Topology Diagram

The following diagram illustrates the flow from a client request through the microservice stack to the GPU-accelerated backend.

Figure 1: Inference Service Microservice Stack


Sources: areal/experimental/inference_service/gateway/app.py124-175 areal/experimental/inference_service/inf_bridge.py32-59 areal/experimental/inference_service/controller/controller.py64-77 areal/experimental/inference_service/gateway/streaming.py46-56


RolloutControllerV2

The RolloutControllerV2 is the orchestrator responsible for the lifecycle of the entire stack. It is designed to be duck-type compatible with the standard RolloutController, allowing it to be swapped into existing training workflows without changing the trainer logic. areal/experimental/inference_service/controller/controller.py64-77

Lifecycle Management

Code Entity Mapping

The following diagram maps the logical controller operations to the specific code entities and process management mechanisms.

Figure 2: Controller to Code Entity Mapping


Sources: areal/experimental/inference_service/controller/controller.py45-49 areal/experimental/inference_service/controller/controller.py79-80 areal/experimental/inference_service/controller/controller.py167-169


Data Proxy and InfBridge

The Data Proxy is the core component bridging the gap between standard HTTP inference and RL requirements. It uses an internal ArealOpenAI client to handle token-level logprob tracking and interaction caching. areal/experimental/inference_service/data_proxy/app.py189-202

InfBridge

The InfBridge class implements the communication logic with the raw backend. Its primary responsibility is managing the Pause/Resume/Resubmit loop required during weight updates. areal/experimental/inference_service/inf_bridge.py32-59

FeatureImplementation Detail
Weight SyncWhen a weight update occurs, InfBridge calls the backend's pause/resume endpoints. areal/experimental/inference_service/inf_bridge.py94-106
Resubmit Loopagenerate manages request retries and token accumulation if the backend aborts a request during a pause. areal/experimental/inference_service/inf_bridge.py162-200
Backend SupportSupports SGLangBridgeBackend and VLLMBridgeBackend via the InfBridgeBackend protocol. areal/experimental/inference_service/inf_bridge.py71-78

Session Management

The Data Proxy maintains a SessionStore that tracks:

RTensor Storage Endpoints

The Data Proxy exposes a data_bp blueprint for managing tensors over the network:

Figure 3: Data Proxy Internal Data Flow


Sources: areal/experimental/inference_service/data_proxy/app.py166-202 areal/experimental/inference_service/inf_bridge.py162-180 areal/experimental/inference_service/data_proxy/pause.py1-10


Weight Synchronization Protocol

The inference service handles asynchronous weight updates from the trainer without dropping active requests by utilizing a stateful pause mechanism.

  1. Trainer Update: The trainer pushes new weights and increments the model version.
  2. Controller Broadcast: RolloutControllerV2 manages the versioning state and broadcasts updates to the worker pool. areal/experimental/inference_service/controller/controller.py140-142
  3. Pause Signal: The controller sends a pause command to the InfBridge via the PauseState. areal/experimental/inference_service/inf_bridge.py94-100
  4. Backend Transition: The underlying inference backend (SGLang/vLLM) is signaled to pause generation, which may abort in-flight requests. areal/experimental/inference_service/inf_bridge.py97-99
  5. Resume & Resubmit: Once weights are updated, the controller signals a resume. InfBridge automatically resubmits aborted requests, prepending any tokens generated before the abort. areal/experimental/inference_service/inf_bridge.py101-106 areal/experimental/inference_service/inf_bridge.py162-200

Sources: areal/experimental/inference_service/controller/controller.py140-142 areal/experimental/inference_service/inf_bridge.py94-106 areal/experimental/inference_service/data_proxy/pause.py1-10

Configuration

The service is configured via InferenceEngineConfig (passed to the controller) and DataProxyConfig.

ParameterTypeDescription
modelstrName of the model to serve. areal/experimental/inference_service/controller/controller.py91-92
admin_api_keystrKey used for control-plane operations. areal/experimental/inference_service/controller/controller.py87-90
backend_typestrType of backend to bridge to (sglang or vllm). areal/experimental/inference_service/data_proxy/app.py172-177
request_timeoutfloatHTTP timeout per generation call. areal/experimental/inference_service/inf_bridge.py51-52
max_resubmit_retriesintMaximum number of abort-resubmit cycles. areal/experimental/inference_service/inf_bridge.py53-54

Sources: areal/experimental/inference_service/controller/controller.py82-94 areal/experimental/inference_service/inf_bridge.py32-59 areal/experimental/inference_service/data_proxy/app.py166-187