VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/4.3-vllm-backend

⇱ vLLM Backend | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

vLLM Backend

This page documents the vLLM backend implementation for inference in AReaL. The vLLM backend provides PagedAttention-based inference with support for OpenAI-compatible endpoints, pipeline parallelism, and both disk-based and XCCL-based weight updates.

For information about the abstract inference engine interface, see InferenceEngine API(). For the alternative SGLang backend implementation, see SGLang Backend().

Purpose and Architecture

The vLLM backend consists of two main components:

  • VLLMBackend: A backend protocol implementation that translates AReaL's request/response format into vLLM-specific HTTP endpoints. areal/engine/vllm_remote.py41-42
  • RemotevLLMEngine: A wrapper class implementing the InferenceEngine interface that delegates to RemoteInfEngine with a VLLMBackend instance. areal/engine/vllm_remote.py272

This composition-based design allows RemoteInfEngine to handle common infrastructure (async execution, server management, RPC) while VLLMBackend provides vLLM-specific protocol translation.

Architecture Diagram: vLLM Backend Components


Sources: areal/engine/vllm_remote.py41-270 areal/engine/vllm_ext/areal_vllm_server.py164-250 areal/engine/vllm_remote.py272-465

VLLMBackend Protocol Implementation

The VLLMBackend class implements the backend protocol contract by providing methods that build HTTP requests and parse responses for vLLM-specific endpoints. Unlike SGLang, vLLM uses OpenAI-compatible endpoints and supports pipeline parallelism in weight updates.

Class Structure

MethodPurposeReturns
build_generation_request()Constructs vLLM generation requestHttpRequest areal/engine/vllm_remote.py44-96
parse_generation_response()Extracts tokens/logprobs from responseHttpGenerationResult areal/engine/vllm_remote.py98-127
build_disk_weight_update_requests()Builds disk-based weight updateWeightUpdateRequests areal/engine/vllm_remote.py129-148
build_distributed_weight_update_requests()Builds XCCL-based weight updateWeightUpdateRequests areal/engine/vllm_remote.py150-191
build_init_weights_group_request()Initializes NCCL weight update groupHttpRequest areal/engine/vllm_remote.py193-209
get_pause_request()Pauses generationHttpRequest areal/engine/vllm_remote.py211-213
get_resume_request()Resumes generationHttpRequest areal/engine/vllm_remote.py215-217
get_health_check_request()Health check endpointHttpRequest areal/engine/vllm_remote.py219-221
get_offload_request()Offloads model to CPUHttpRequest areal/engine/vllm_remote.py223-231
get_onload_request()Onloads model from CPUHttpRequest areal/engine/vllm_remote.py233-248
launch_server()Launches vLLM server subprocesssubprocess.Popen areal/engine/vllm_remote.py250-270

Sources: areal/engine/vllm_remote.py41-270

Generation Request and Response Format

Generation Request Construction

The build_generation_request() method constructs vLLM-compatible requests. vLLM uses a flat payload structure (not nested sampling_params like SGLang) and supports two endpoint types:

Text Completion Flow:


Vision Multi-modal Flow:


Key differences from SGLang:

Sources: areal/engine/vllm_remote.py44-96

Response Parsing

The parse_generation_response() method extracts tokens and log probabilities from vLLM's response format. vLLM returns tokens in "token:123" format that must be parsed:


Sources: areal/engine/vllm_remote.py98-127

Weight Update Protocols

The vLLM backend supports two weight update modes: disk-based and distributed (XCCL/NCCL). Both modes support full model updates and LoRA adapter updates.

Disk-Based Weight Updates

Full Model Update:


LoRA Adapter Update:


The versioned LoRA name is computed by get_versioned_lora_name(lora_name, version) areal/api/io_struct.py161-163

Sources: areal/engine/vllm_remote.py129-148 areal/api/io_struct.py161-163

Distributed (XCCL) Weight Updates

vLLM uses a two-step process for XCCL weight updates: first set metadata, then perform the update. This differs from SGLang's single-step process.

Full Model XCCL Update:


LoRA XCCL Update:


The LoRA metadata includes PEFT configuration from meta.peft_config areal/engine/vllm_remote.py172-176:

  • target_modules: List of module names to apply LoRA
  • r: LoRA rank
  • lora_alpha: Scaling factor
  • bias: Bias handling strategy

Sources: areal/engine/vllm_remote.py150-191

NCCL Group Initialization

Before performing XCCL weight updates, the weight update group must be initialized. vLLM supports pipeline parallelism (PP), so the rank offset calculation differs from SGLang:


This accounts for both tensor parallel (TP) and pipeline parallel (PP) ranks in each vLLM server instance areal/engine/vllm_remote.py205

Comparison with SGLang:

AspectvLLMSGLang
PP SupportYesNo (PP size must be 1)
Rank Offset1 + idx * tp_size * pp_size areal/engine/vllm_remote.py2051 + idx * tp_size areal/engine/sglang_remote.py200
Endpoint/areal_init_weights_update_group areal/engine/vllm_remote.py201/init_weights_update_group areal/engine/sglang_remote.py196

Sources: areal/engine/vllm_remote.py193-209 areal/engine/sglang_remote.py191-206

Server Lifecycle Management

Pause and Resume Generation

vLLM uses custom AReaL endpoints for generation control:

OperationEndpointPurpose
Pause/areal_pause_generation areal/engine/vllm_remote.py212Pause request processing during weight updates
Resume/areal_continue_generation areal/engine/vllm_remote.py216Resume request processing after weight updates
Health/health areal/engine/vllm_remote.py220Check server health status

Sources: areal/engine/vllm_remote.py211-221 areal/engine/vllm_ext/areal_vllm_server.py164-250

Memory Management (Offload/Onload)

vLLM provides native sleep/wake-up endpoints for CPU offloading:

Offload Flow:


Onload Flow (with optional tags):


The tags parameter allows selective onloading of components (e.g., ["weights"], ["kv_cache"]) areal/engine/vllm_remote.py237-248

Sources: areal/engine/vllm_remote.py223-248

Server Launch

The launch_server() method spawns a vLLM server subprocess with configured environment variables areal/engine/vllm_remote.py250-269:

Environment Variables:

VariablePurpose
TRITON_CACHE_PATHUnique Triton cache per server (UUID-based) areal/engine/vllm_remote.py256
VLLM_CACHE_ROOTUnique vLLM cache per server (UUID-based) areal/engine/vllm_remote.py257
VLLM_ALLOW_RUNTIME_LORA_UPDATINGEnable runtime LoRA updates (set to "True") areal/engine/vllm_remote.py258

The command is built using vLLMConfig.build_cmd_from_args() which translates server configuration into CLI arguments areal/engine/vllm_remote.py263-264

Sources: areal/engine/vllm_remote.py250-269

RemotevLLMEngine Public API

The RemotevLLMEngine class is a thin wrapper that implements the InferenceEngine interface by delegating all operations to an internal RemoteInfEngine instance with a VLLMBackend areal/engine/vllm_remote.py272-297:

Delegation Pattern:


All methods simply forward to self._engine which handles the actual logic. This composition-based design keeps backend-specific code isolated in VLLMBackend while reusing common infrastructure.

Sources: areal/engine/vllm_remote.py272-465

Integration with RolloutController

The RemotevLLMEngine can be promoted to a RolloutController for distributed rollout management:


This factory method creates a controller that manages multiple vLLM worker processes across a distributed cluster areal/engine/vllm_remote.py447-451

Sources: areal/engine/vllm_remote.py447-451

Key Differences from SGLang Backend

FeaturevLLM BackendSGLang Backend
EndpointsOpenAI-compatible /v1/*Custom SGLang endpoints
Request StructureFlat payloadNested sampling_params
Max Tokensmax_tokensmax_new_tokens
Pipeline ParallelSupported in weight updatesNot supported (PP must be 1)
Weight Update StepsTwo-step (set metadata, then update)Single-step
LoRA XCCLSupported via special endpointsNot supported (disk only)
Offload Endpoints/sleep, /wake_up/release_memory_occupation, /resume_memory_occupation
Token Format"token:123" string parsingDirect token/logprob arrays

Sources: areal/engine/vllm_remote.py1-465 areal/engine/sglang_remote.py1-439