Last indexed: 7 May 2026 (2e12c1)

Inference Service (Experimental)

The Inference Service is an experimental, microservice-based architecture designed to decouple LLM inference from the training loop. Unlike the standard InferenceEngine which runs within the same process or via direct RPC, this architecture provides a scalable, OpenAI-compatible serving stack including request routing, session management, and automated weight synchronization. It is specifically designed for complex agentic RL scenarios where multiple agents or external runtimes interact with a shared model pool.

Architecture Overview

The system is composed of four primary microservices that interact to provide a robust inference endpoint for agents and rollouts.

Core Components

Gateway: The entry point for all external traffic. It provides an OpenAI-compatible API, handles authentication, and forwards requests to the Router. areal/experimental/inference_service/gateway/app.py3-7
Router: Manages a pool of Data Proxy workers. It performs health checks, handles session pinning, and implements routing strategies to distribute load across available inference backends. areal/experimental/inference_service/router/app.py3-12
Data Proxy: A stateful proxy that sits in front of a raw inference backend. It manages RL-specific state, including conversation sessions via SessionStore, reward tracking, and weight versioning. areal/experimental/inference_service/data_proxy/app.py21-30
Inference Backend: The underlying high-performance engine (e.g., SGLang or vLLM) that performs the actual token generation. areal/experimental/inference_service/inf_bridge.py3-8
RTensor Storage: A specialized storage layer within the Data Proxy (via data_bp) for handling large tensor shards (e.g., logprobs or hidden states) over HTTP. areal/infra/rpc/guard/data_blueprint.py1-15

System Topology Diagram

The following diagram illustrates the flow from a client request through the microservice stack to the GPU-accelerated backend.

Figure 1: Inference Service Microservice Stack

Sources: areal/experimental/inference_service/gateway/app.py124-175 areal/experimental/inference_service/inf_bridge.py32-59 areal/experimental/inference_service/controller/controller.py64-77 areal/experimental/inference_service/gateway/streaming.py46-56

RolloutControllerV2

The RolloutControllerV2 is the orchestrator responsible for the lifecycle of the entire stack. It is designed to be duck-type compatible with the standard RolloutController, allowing it to be swapped into existing training workflows without changing the trainer logic. areal/experimental/inference_service/controller/controller.py64-77

Lifecycle Management

Initialization: When initialize() is called, the controller starts an online callback server and performs an asynchronous initialization to launch service workers via the Scheduler. areal/experimental/inference_service/controller/controller.py189-201
Service Forking: Services are forked using a "raw_cmd" mode through the RPCGuard (a lightweight process manager) to ensure they run as independent OS processes. areal/experimental/inference_service/controller/controller.py167-169
Registration: The controller registers the inference model and data proxies in the router to enable traffic flow. areal/experimental/inference_service/controller/controller.py137-138

Code Entity Mapping

The following diagram maps the logical controller operations to the specific code entities and process management mechanisms.

Figure 2: Controller to Code Entity Mapping

Sources: areal/experimental/inference_service/controller/controller.py45-49 areal/experimental/inference_service/controller/controller.py79-80 areal/experimental/inference_service/controller/controller.py167-169

Data Proxy and InfBridge

The Data Proxy is the core component bridging the gap between standard HTTP inference and RL requirements. It uses an internal ArealOpenAI client to handle token-level logprob tracking and interaction caching. areal/experimental/inference_service/data_proxy/app.py189-202

InfBridge

The InfBridge class implements the communication logic with the raw backend. Its primary responsibility is managing the Pause/Resume/Resubmit loop required during weight updates. areal/experimental/inference_service/inf_bridge.py32-59

Feature	Implementation Detail
Weight Sync	When a weight update occurs, `InfBridge` calls the backend's pause/resume endpoints. areal/experimental/inference_service/inf_bridge.py94-106
Resubmit Loop	`agenerate` manages request retries and token accumulation if the backend aborts a request during a pause. areal/experimental/inference_service/inf_bridge.py162-200
Backend Support	Supports `SGLangBridgeBackend` and `VLLMBridgeBackend` via the `InfBridgeBackend` protocol. areal/experimental/inference_service/inf_bridge.py71-78

Session Management

The Data Proxy maintains a SessionStore that tracks:

Interaction Trees: Parent-child relationships between turns in a conversation. areal/experimental/inference_service/data_proxy/session.py21-30
Rewards: Rewards assigned to specific completion IDs via the /rl/set_reward endpoint. areal/experimental/inference_service/data_proxy/app.py78-85
Trajectory Export: Exporting collected interactions for training in "concat" or "individual" styles. areal/experimental/inference_service/data_proxy/app.py21-24

RTensor Storage Endpoints

The Data Proxy exposes a data_bp blueprint for managing tensors over the network:

PUT /data/{shard_id}: Store a tensor shard. areal/infra/rpc/guard/data_blueprint.py58-60
GET /data/{shard_id}: Retrieve a tensor shard. areal/infra/rpc/guard/data_blueprint.py78-80

Figure 3: Data Proxy Internal Data Flow

Sources: areal/experimental/inference_service/data_proxy/app.py166-202 areal/experimental/inference_service/inf_bridge.py162-180 areal/experimental/inference_service/data_proxy/pause.py1-10

Weight Synchronization Protocol

The inference service handles asynchronous weight updates from the trainer without dropping active requests by utilizing a stateful pause mechanism.

Trainer Update: The trainer pushes new weights and increments the model version.
Controller Broadcast: RolloutControllerV2 manages the versioning state and broadcasts updates to the worker pool. areal/experimental/inference_service/controller/controller.py140-142
Pause Signal: The controller sends a pause command to the InfBridge via the PauseState. areal/experimental/inference_service/inf_bridge.py94-100
Backend Transition: The underlying inference backend (SGLang/vLLM) is signaled to pause generation, which may abort in-flight requests. areal/experimental/inference_service/inf_bridge.py97-99
Resume & Resubmit: Once weights are updated, the controller signals a resume. InfBridge automatically resubmits aborted requests, prepending any tokens generated before the abort. areal/experimental/inference_service/inf_bridge.py101-106 areal/experimental/inference_service/inf_bridge.py162-200

Sources: areal/experimental/inference_service/controller/controller.py140-142 areal/experimental/inference_service/inf_bridge.py94-106 areal/experimental/inference_service/data_proxy/pause.py1-10

Configuration

The service is configured via InferenceEngineConfig (passed to the controller) and DataProxyConfig.

Parameter	Type	Description
`model`	`str`	Name of the model to serve. areal/experimental/inference_service/controller/controller.py91-92
`admin_api_key`	`str`	Key used for control-plane operations. areal/experimental/inference_service/controller/controller.py87-90
`backend_type`	`str`	Type of backend to bridge to (`sglang` or `vllm`). areal/experimental/inference_service/data_proxy/app.py172-177
`request_timeout`	`float`	HTTP timeout per generation call. areal/experimental/inference_service/inf_bridge.py51-52
`max_resubmit_retries`	`int`	Maximum number of abort-resubmit cycles. areal/experimental/inference_service/inf_bridge.py53-54

Sources: areal/experimental/inference_service/controller/controller.py82-94 areal/experimental/inference_service/inf_bridge.py32-59 areal/experimental/inference_service/data_proxy/app.py166-187

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/4.8-inference-service-(experimental)