Last indexed: 7 May 2026 (2e12c1)

SGLang Backend

This page documents the SGLang backend implementation for remote inference in AReaL. The SGLang backend provides high-performance inference capabilities with support for structured generation, weight updates, and LoRA adapters.

For information about the abstract inference engine interface, see areal/api/engine_api.py32-233 For the alternative vLLM backend, see areal/engine/vllm_remote.py41-127

Purpose and Scope

The SGLang backend implements the InferenceEngine API using SGLang's HTTP server as the underlying inference provider. This page covers:

The SGLangBackend protocol implementation for request/response translation areal/engine/sglang_remote.py40-244
The RemoteSGLangEngine class and its integration with RemoteInfEngine areal/engine/sglang_remote.py247-439
SGLang-specific HTTP endpoint mappings and payload formats.
Weight update mechanisms (disk-based and NCCL-based).
LoRA adapter support with versioned naming areal/api/io_struct.py161-163
Server lifecycle management (launch, pause, resume, offload/onload) and the SGLangServerWrapper utility areal/infra/launcher/sglang_server.py89-201
Expert routing information extraction for MoE models areal/engine/sglang_remote.py101-109
Experimental AWEX (Asynchronous Weight EXchange) integration for SGLang areal/experimental/inference_service/sglang/launch_server.py17-81

Sources: areal/engine/sglang_remote.py40-439 areal/api/io_struct.py161-163 areal/infra/launcher/sglang_server.py89-201 areal/experimental/inference_service/sglang/launch_server.py17-81

Architecture Overview

The SGLang backend follows a composition pattern where RemoteSGLangEngine delegates to RemoteInfEngine with an SGLangBackend implementation that satisfies the RemoteInfBackendProtocol.

Title: SGLang Backend Architecture

Key Classes and Composition

Class	Role	Location
`RemoteSGLangEngine`	Public API implementing `InferenceEngine`	areal/engine/sglang_remote.py247-439
`RemoteInfEngine`	Generic remote engine coordinator	areal/infra/remote_inf_engine.py15-18
`SGLangBackend`	SGLang-specific protocol adapter	areal/engine/sglang_remote.py40-42
`HttpRequest`	Generic HTTP request structure	areal/api/io_struct.py29
`SGLangServerWrapper`	Launcher for SGLang server processes	areal/infra/launcher/sglang_server.py89-90

Sources: areal/engine/sglang_remote.py247-262 areal/engine/sglang_remote.py40-42 areal/api/io_struct.py29 areal/infra/launcher/sglang_server.py89-90

Request Format and Generation

The SGLangBackend translates ModelRequest objects into SGLang-specific HTTP requests via the build_generation_request() method areal/engine/sglang_remote.py43-89

Title: Request Translation Flow

Sampling Parameters Mapping

The gconfig (GenerationHyperparameters) is mapped to SGLang's sampling_params areal/engine/sglang_remote.py56-65:

AReaL Parameter	SGLang Parameter	Notes
`top_p`	`top_p`	Direct mapping
`top_k`	`top_k`	Direct mapping
`max_new_tokens`	`max_new_tokens`	Direct mapping
`temperature`	`temperature`	Set to 0.0 if `greedy=True` areal/engine/sglang_remote.py60
`stop_token_ids`	`stop_token_ids`	Direct mapping
`ignore_eos`	`ignore_eos`	Direct mapping
`skip_special_tokens`	`skip_special_tokens`	Direct mapping
`frequency_penalty`	`frequency_penalty`	Direct mapping
`stop`	`stop`	String stop sequences areal/engine/sglang_remote.py67

Special Features

Vision Support: image_data is passed directly to SGLang payload areal/engine/sglang_remote.py71
LoRA: When with_lora=True, adds lora_path with versioned name using get_versioned_lora_name areal/engine/sglang_remote.py81-87
Expert Routing: When req.metadata.get("return_routed_experts", False) is True, adds return_routed_experts flag to the payload areal/engine/sglang_remote.py78-79

Limitations

Beam search is not supported in the SGLang backend and will raise NotImplementedError areal/engine/sglang_remote.py51-54

Sources: areal/engine/sglang_remote.py43-89 areal/api/io_struct.py28-60

Response Parsing

The parse_generation_response() method extracts tokens and log probabilities from SGLang's response format areal/engine/sglang_remote.py91-127:

Title: Response Parsing Logic

SGLang Response Structure

SGLang returns token-level information in meta_info["output_token_logprobs"], where each entry is a list containing [logprob, token_id] areal/engine/sglang_remote.py119-120

Expert Routing Extraction

For MoE models with return_routed_experts=True, the response includes base64-encoded expert routing data areal/engine/sglang_remote.py101-109:

Decode base64 string to bytes via pybase64.b64decode areal/engine/sglang_remote.py108
Convert to np.int32 array using np.frombuffer areal/engine/sglang_remote.py107
Reshape to (num_sgl_token, -1), where the second dimension represents num_layers * expert_top_k areal/engine/sglang_remote.py109

Abort Handling

If stop_reason == "abort" and the message starts with "Abort before prefill", returns empty tokens with the abort reason preserved areal/engine/sglang_remote.py111-117

Sources: areal/engine/sglang_remote.py91-127

Weight Update Protocols

The SGLang backend supports multiple weight update mechanisms: disk-based, NCCL-based distributed updates, and the experimental AWEX protocol.

Disk-Based Weight Updates

Disk-based updates are handled by build_disk_weight_update_requests() areal/engine/sglang_remote.py129-160

LoRA Update Payload (/load_lora_adapter):

Full Model Update Payload (/update_weights_from_disk):

Sources: areal/engine/sglang_remote.py129-160 areal/api/io_struct.py161-163

NCCL-Based Distributed Updates

For NCCL/XCCL weight updates, SGLang requires a two-step process: initialization of the weight update group via build_init_weights_group_request(), then the actual weight update via build_distributed_weight_update_requests() areal/engine/sglang_remote.py161-203

Init Weights Group Payload: The rank_offset calculation ensures proper rank assignment across multiple servers and TP groups. It uses meta.gen_allocation to determine world sizes and indices areal/engine/sglang_remote.py194-203

LoRA Limitation SGLang's distributed weight update does not support LoRA. Attempting to use NCCL with LoRA raises a ValueError areal/engine/sglang_remote.py169-173

Pipeline Parallelism Limitation SGLang distributed weight update currently requires pp_size == 1. If meta.gen_allocation.parallel.pp_size != 1, it raises NotImplementedError areal/engine/sglang_remote.py190-191

Sources: areal/engine/sglang_remote.py161-203

AWEX (Experimental)

AWEX is an experimental weight update protocol that bridges AReaL's training engines with SGLang's internal scheduler via ZMQ RPC areal/experimental/inference_service/sglang/scheduler.py20-68

AwexSchedulerBridge: Attaches awex_* methods to the SGLang Scheduler instance areal/experimental/inference_service/sglang/scheduler.py51-68
RpcProxy: A ZMQ proxy bridging the HTTP process to scheduler subprocesses for collective RPC areal/experimental/inference_service/sglang/rpc_proxy.py13-35
awex_execute_weight_update: Triggers the actual weight synchronization within the SGLang process areal/experimental/inference_service/sglang/scheduler.py104-105

Sources: areal/experimental/inference_service/sglang/scheduler.py20-105 areal/experimental/inference_service/sglang/rpc_proxy.py13-35

LoRA Support

The SGLang backend implements versioned LoRA adapter management to support asynchronous weight updates.

Versioned Name Format The function get_versioned_lora_name(lora_name, version) creates adapter names like actor-v0, actor-v1, etc., allowing the server to distinguish between different training iterations areal/api/io_struct.py161-163

LoRA in Generation Requests When with_lora=True in build_generation_request(), it fetches the specific versioned path using get_versioned_lora_name to ensure the correct model version is used for the request areal/engine/sglang_remote.py81-87

Sources: areal/engine/sglang_remote.py81-87 areal/api/io_struct.py161-163

Server Lifecycle Management

Server Launch and Orchestration

The system uses SGLangServerWrapper to manage the lifecycle of SGLang server processes, especially in multi-GPU or multi-node environments areal/infra/launcher/sglang_server.py89-201

Title: SGLang Server Launch Sequence

Command Generation: SGLangConfig.build_cmd() constructs the CLI command with appropriate TP size and distributed initialization addresses areal/infra/launcher/sglang_server.py181-190
Resource Allocation: The wrapper calculates GPU offsets and ports based on allocation_mode and n_gpus_per_node areal/infra/launcher/sglang_server.py142-167
Multi-Node Support: For cross-node SGLang, it uses environment variables like AREAL_SGLANG_MULTI_NODE_RANK and AREAL_SGLANG_MULTI_NODE_MASTER_ADDR areal/infra/launcher/sglang_server.py134-136
Environment Isolation: launch_server_cmd ensures each instance has a unique TRITON_CACHE_PATH to avoid DirectoryNotEmpty errors areal/infra/launcher/sglang_server.py49-52

Server Control Endpoints

The backend provides methods to build HTTP requests for various server control operations:

Method	Endpoint	Purpose
`get_pause_request()`	`/pause_generation`	Pause inference requests areal/engine/sglang_remote.py205-207
`get_resume_request()`	`/continue_generation`	Resume inference requests areal/engine/sglang_remote.py209-211
`get_health_check_request()`	`/health`	Check server health (GET) areal/engine/sglang_remote.py213-215
`get_offload_request()`	`/release_memory_occupation`	Release GPU memory areal/engine/sglang_remote.py217-219
`get_onload_request(tags)`	`/resume_memory_occupation`	Restore GPU memory areal/engine/sglang_remote.py221-230

Sources: areal/engine/sglang_remote.py205-244 areal/infra/launcher/sglang_server.py36-62

Integration with RemoteInfEngine

The RemoteSGLangEngine class is a thin wrapper that delegates all operations to RemoteInfEngine with the SGLangBackend protocol implementation areal/engine/sglang_remote.py247-439

Key Design Pattern The composition pattern provides:

Separation of Concerns: Backend-specific logic in SGLangBackend, generic remote engine logic (like retry logic, batching, and async handling) in RemoteInfEngine.
Code Reuse: Same RemoteInfEngine used for SGLang and vLLM backends.
Clean API: RemoteSGLangEngine exposes the full InferenceEngine interface without duplication.

Sources: areal/engine/sglang_remote.py247-439 areal/api/engine_api.py32-233

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/4.2-sglang-backend

⇱ SGLang Backend | inclusionAI/AReaL | DeepWiki