Last indexed: 7 May 2026 (2e12c1)

vLLM Backend

This page documents the vLLM backend implementation for inference in AReaL. The vLLM backend provides PagedAttention-based inference with support for OpenAI-compatible endpoints, pipeline parallelism, and both disk-based and XCCL-based weight updates.

For information about the abstract inference engine interface, see InferenceEngine API(). For the alternative SGLang backend implementation, see SGLang Backend().

Purpose and Architecture

The vLLM backend consists of two main components:

VLLMBackend: A backend protocol implementation that translates AReaL's request/response format into vLLM-specific HTTP endpoints. areal/engine/vllm_remote.py41-42
RemotevLLMEngine: A wrapper class implementing the InferenceEngine interface that delegates to RemoteInfEngine with a VLLMBackend instance. areal/engine/vllm_remote.py272

This composition-based design allows RemoteInfEngine to handle common infrastructure (async execution, server management, RPC) while VLLMBackend provides vLLM-specific protocol translation.

Architecture Diagram: vLLM Backend Components

Sources: areal/engine/vllm_remote.py41-270 areal/engine/vllm_ext/areal_vllm_server.py164-250 areal/engine/vllm_remote.py272-465

VLLMBackend Protocol Implementation

The VLLMBackend class implements the backend protocol contract by providing methods that build HTTP requests and parse responses for vLLM-specific endpoints. Unlike SGLang, vLLM uses OpenAI-compatible endpoints and supports pipeline parallelism in weight updates.

Class Structure

Method	Purpose	Returns
`build_generation_request()`	Constructs vLLM generation request	`HttpRequest` areal/engine/vllm_remote.py44-96
`parse_generation_response()`	Extracts tokens/logprobs from response	`HttpGenerationResult` areal/engine/vllm_remote.py98-127
`build_disk_weight_update_requests()`	Builds disk-based weight update	`WeightUpdateRequests` areal/engine/vllm_remote.py129-148
`build_distributed_weight_update_requests()`	Builds XCCL-based weight update	`WeightUpdateRequests` areal/engine/vllm_remote.py150-191
`build_init_weights_group_request()`	Initializes NCCL weight update group	`HttpRequest` areal/engine/vllm_remote.py193-209
`get_pause_request()`	Pauses generation	`HttpRequest` areal/engine/vllm_remote.py211-213
`get_resume_request()`	Resumes generation	`HttpRequest` areal/engine/vllm_remote.py215-217
`get_health_check_request()`	Health check endpoint	`HttpRequest` areal/engine/vllm_remote.py219-221
`get_offload_request()`	Offloads model to CPU	`HttpRequest` areal/engine/vllm_remote.py223-231
`get_onload_request()`	Onloads model from CPU	`HttpRequest` areal/engine/vllm_remote.py233-248
`launch_server()`	Launches vLLM server subprocess	`subprocess.Popen` areal/engine/vllm_remote.py250-270

Sources: areal/engine/vllm_remote.py41-270

Generation Request and Response Format

Generation Request Construction

The build_generation_request() method constructs vLLM-compatible requests. vLLM uses a flat payload structure (not nested sampling_params like SGLang) and supports two endpoint types:

Text Completion Flow:

Vision Multi-modal Flow:

Key differences from SGLang:

Uses /v1/completions and /v1/chat/completions (OpenAI-compatible) areal/engine/vllm_remote.py93-96
Flat payload structure instead of nested sampling_params areal/engine/vllm_remote.py52-64
Uses max_tokens instead of max_new_tokens areal/engine/vllm_remote.py55
Sets return_tokens_as_token_ids: True to get token IDs areal/engine/vllm_remote.py60
Sets logprobs: 0 or logprobs: True depending on endpoint areal/engine/vllm_remote.py61-92

Sources: areal/engine/vllm_remote.py44-96

Response Parsing

The parse_generation_response() method extracts tokens and log probabilities from vLLM's response format. vLLM returns tokens in "token:123" format that must be parsed:

Sources: areal/engine/vllm_remote.py98-127

Weight Update Protocols

The vLLM backend supports two weight update modes: disk-based and distributed (XCCL/NCCL). Both modes support full model updates and LoRA adapter updates.

Disk-Based Weight Updates

Full Model Update:

LoRA Adapter Update:

The versioned LoRA name is computed by get_versioned_lora_name(lora_name, version) areal/api/io_struct.py161-163

Sources: areal/engine/vllm_remote.py129-148 areal/api/io_struct.py161-163

Distributed (XCCL) Weight Updates

vLLM uses a two-step process for XCCL weight updates: first set metadata, then perform the update. This differs from SGLang's single-step process.

Full Model XCCL Update:

LoRA XCCL Update:

The LoRA metadata includes PEFT configuration from meta.peft_config areal/engine/vllm_remote.py172-176:

target_modules: List of module names to apply LoRA
r: LoRA rank
lora_alpha: Scaling factor
bias: Bias handling strategy

Sources: areal/engine/vllm_remote.py150-191

NCCL Group Initialization

Before performing XCCL weight updates, the weight update group must be initialized. vLLM supports pipeline parallelism (PP), so the rank offset calculation differs from SGLang:

This accounts for both tensor parallel (TP) and pipeline parallel (PP) ranks in each vLLM server instance areal/engine/vllm_remote.py205

Comparison with SGLang:

Aspect	vLLM	SGLang
PP Support	Yes	No (PP size must be 1)
Rank Offset	`1 + idx * tp_size * pp_size` areal/engine/vllm_remote.py205	`1 + idx * tp_size` areal/engine/sglang_remote.py200
Endpoint	`/areal_init_weights_update_group` areal/engine/vllm_remote.py201	`/init_weights_update_group` areal/engine/sglang_remote.py196

Sources: areal/engine/vllm_remote.py193-209 areal/engine/sglang_remote.py191-206

Server Lifecycle Management

Pause and Resume Generation

vLLM uses custom AReaL endpoints for generation control:

Operation	Endpoint	Purpose
Pause	`/areal_pause_generation` areal/engine/vllm_remote.py212	Pause request processing during weight updates
Resume	`/areal_continue_generation` areal/engine/vllm_remote.py216	Resume request processing after weight updates
Health	`/health` areal/engine/vllm_remote.py220	Check server health status

Sources: areal/engine/vllm_remote.py211-221 areal/engine/vllm_ext/areal_vllm_server.py164-250

Memory Management (Offload/Onload)

vLLM provides native sleep/wake-up endpoints for CPU offloading:

Offload Flow:

Onload Flow (with optional tags):

The tags parameter allows selective onloading of components (e.g., ["weights"], ["kv_cache"]) areal/engine/vllm_remote.py237-248

Sources: areal/engine/vllm_remote.py223-248

Server Launch

The launch_server() method spawns a vLLM server subprocess with configured environment variables areal/engine/vllm_remote.py250-269:

Environment Variables:

Variable	Purpose
`TRITON_CACHE_PATH`	Unique Triton cache per server (UUID-based) areal/engine/vllm_remote.py256
`VLLM_CACHE_ROOT`	Unique vLLM cache per server (UUID-based) areal/engine/vllm_remote.py257
`VLLM_ALLOW_RUNTIME_LORA_UPDATING`	Enable runtime LoRA updates (set to `"True"`) areal/engine/vllm_remote.py258

The command is built using vLLMConfig.build_cmd_from_args() which translates server configuration into CLI arguments areal/engine/vllm_remote.py263-264

Sources: areal/engine/vllm_remote.py250-269

RemotevLLMEngine Public API

The RemotevLLMEngine class is a thin wrapper that implements the InferenceEngine interface by delegating all operations to an internal RemoteInfEngine instance with a VLLMBackend areal/engine/vllm_remote.py272-297:

Delegation Pattern:

All methods simply forward to self._engine which handles the actual logic. This composition-based design keeps backend-specific code isolated in VLLMBackend while reusing common infrastructure.

Sources: areal/engine/vllm_remote.py272-465

Integration with RolloutController

The RemotevLLMEngine can be promoted to a RolloutController for distributed rollout management:

This factory method creates a controller that manages multiple vLLM worker processes across a distributed cluster areal/engine/vllm_remote.py447-451

Sources: areal/engine/vllm_remote.py447-451

Key Differences from SGLang Backend

Feature	vLLM Backend	SGLang Backend
Endpoints	OpenAI-compatible `/v1/*`	Custom SGLang endpoints
Request Structure	Flat payload	Nested `sampling_params`
Max Tokens	`max_tokens`	`max_new_tokens`
Pipeline Parallel	Supported in weight updates	Not supported (PP must be 1)
Weight Update Steps	Two-step (set metadata, then update)	Single-step
LoRA XCCL	Supported via special endpoints	Not supported (disk only)
Offload Endpoints	`/sleep`, `/wake_up`	`/release_memory_occupation`, `/resume_memory_occupation`
Token Format	`"token:123"` string parsing	Direct token/logprob arrays

Sources: areal/engine/vllm_remote.py1-465 areal/engine/sglang_remote.py1-439

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/4.3-vllm-backend

⇱ vLLM Backend | inclusionAI/AReaL | DeepWiki