VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/4.2-sglang-backend

⇱ SGLang Backend | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

SGLang Backend

This page documents the SGLang backend implementation for remote inference in AReaL. The SGLang backend provides high-performance inference capabilities with support for structured generation, weight updates, and LoRA adapters.

For information about the abstract inference engine interface, see areal/api/engine_api.py32-233 For the alternative vLLM backend, see areal/engine/vllm_remote.py41-127

Purpose and Scope

The SGLang backend implements the InferenceEngine API using SGLang's HTTP server as the underlying inference provider. This page covers:

Sources: areal/engine/sglang_remote.py40-439 areal/api/io_struct.py161-163 areal/infra/launcher/sglang_server.py89-201 areal/experimental/inference_service/sglang/launch_server.py17-81

Architecture Overview

The SGLang backend follows a composition pattern where RemoteSGLangEngine delegates to RemoteInfEngine with an SGLangBackend implementation that satisfies the RemoteInfBackendProtocol.

Title: SGLang Backend Architecture


Key Classes and Composition

ClassRoleLocation
RemoteSGLangEnginePublic API implementing InferenceEngineareal/engine/sglang_remote.py247-439
RemoteInfEngineGeneric remote engine coordinatorareal/infra/remote_inf_engine.py15-18
SGLangBackendSGLang-specific protocol adapterareal/engine/sglang_remote.py40-42
HttpRequestGeneric HTTP request structureareal/api/io_struct.py29
SGLangServerWrapperLauncher for SGLang server processesareal/infra/launcher/sglang_server.py89-90

Sources: areal/engine/sglang_remote.py247-262 areal/engine/sglang_remote.py40-42 areal/api/io_struct.py29 areal/infra/launcher/sglang_server.py89-90

Request Format and Generation

The SGLangBackend translates ModelRequest objects into SGLang-specific HTTP requests via the build_generation_request() method areal/engine/sglang_remote.py43-89

Title: Request Translation Flow


Sampling Parameters Mapping

The gconfig (GenerationHyperparameters) is mapped to SGLang's sampling_params areal/engine/sglang_remote.py56-65:

AReaL ParameterSGLang ParameterNotes
top_ptop_pDirect mapping
top_ktop_kDirect mapping
max_new_tokensmax_new_tokensDirect mapping
temperaturetemperatureSet to 0.0 if greedy=True areal/engine/sglang_remote.py60
stop_token_idsstop_token_idsDirect mapping
ignore_eosignore_eosDirect mapping
skip_special_tokensskip_special_tokensDirect mapping
frequency_penaltyfrequency_penaltyDirect mapping
stopstopString stop sequences areal/engine/sglang_remote.py67

Special Features

  1. Vision Support: image_data is passed directly to SGLang payload areal/engine/sglang_remote.py71
  2. LoRA: When with_lora=True, adds lora_path with versioned name using get_versioned_lora_name areal/engine/sglang_remote.py81-87
  3. Expert Routing: When req.metadata.get("return_routed_experts", False) is True, adds return_routed_experts flag to the payload areal/engine/sglang_remote.py78-79

Limitations

Beam search is not supported in the SGLang backend and will raise NotImplementedError areal/engine/sglang_remote.py51-54

Sources: areal/engine/sglang_remote.py43-89 areal/api/io_struct.py28-60

Response Parsing

The parse_generation_response() method extracts tokens and log probabilities from SGLang's response format areal/engine/sglang_remote.py91-127:

Title: Response Parsing Logic


SGLang Response Structure

SGLang returns token-level information in meta_info["output_token_logprobs"], where each entry is a list containing [logprob, token_id] areal/engine/sglang_remote.py119-120

Expert Routing Extraction

For MoE models with return_routed_experts=True, the response includes base64-encoded expert routing data areal/engine/sglang_remote.py101-109:

  1. Decode base64 string to bytes via pybase64.b64decode areal/engine/sglang_remote.py108
  2. Convert to np.int32 array using np.frombuffer areal/engine/sglang_remote.py107
  3. Reshape to (num_sgl_token, -1), where the second dimension represents num_layers * expert_top_k areal/engine/sglang_remote.py109

Abort Handling

If stop_reason == "abort" and the message starts with "Abort before prefill", returns empty tokens with the abort reason preserved areal/engine/sglang_remote.py111-117

Sources: areal/engine/sglang_remote.py91-127

Weight Update Protocols

The SGLang backend supports multiple weight update mechanisms: disk-based, NCCL-based distributed updates, and the experimental AWEX protocol.

Disk-Based Weight Updates

Disk-based updates are handled by build_disk_weight_update_requests() areal/engine/sglang_remote.py129-160

LoRA Update Payload (/load_lora_adapter):


Full Model Update Payload (/update_weights_from_disk):


Sources: areal/engine/sglang_remote.py129-160 areal/api/io_struct.py161-163

NCCL-Based Distributed Updates

For NCCL/XCCL weight updates, SGLang requires a two-step process: initialization of the weight update group via build_init_weights_group_request(), then the actual weight update via build_distributed_weight_update_requests() areal/engine/sglang_remote.py161-203

Init Weights Group Payload: The rank_offset calculation ensures proper rank assignment across multiple servers and TP groups. It uses meta.gen_allocation to determine world sizes and indices areal/engine/sglang_remote.py194-203

LoRA Limitation SGLang's distributed weight update does not support LoRA. Attempting to use NCCL with LoRA raises a ValueError areal/engine/sglang_remote.py169-173

Pipeline Parallelism Limitation SGLang distributed weight update currently requires pp_size == 1. If meta.gen_allocation.parallel.pp_size != 1, it raises NotImplementedError areal/engine/sglang_remote.py190-191

Sources: areal/engine/sglang_remote.py161-203

AWEX (Experimental)

AWEX is an experimental weight update protocol that bridges AReaL's training engines with SGLang's internal scheduler via ZMQ RPC areal/experimental/inference_service/sglang/scheduler.py20-68

Sources: areal/experimental/inference_service/sglang/scheduler.py20-105 areal/experimental/inference_service/sglang/rpc_proxy.py13-35

LoRA Support

The SGLang backend implements versioned LoRA adapter management to support asynchronous weight updates.

Versioned Name Format The function get_versioned_lora_name(lora_name, version) creates adapter names like actor-v0, actor-v1, etc., allowing the server to distinguish between different training iterations areal/api/io_struct.py161-163

LoRA in Generation Requests When with_lora=True in build_generation_request(), it fetches the specific versioned path using get_versioned_lora_name to ensure the correct model version is used for the request areal/engine/sglang_remote.py81-87

Sources: areal/engine/sglang_remote.py81-87 areal/api/io_struct.py161-163

Server Lifecycle Management

Server Launch and Orchestration

The system uses SGLangServerWrapper to manage the lifecycle of SGLang server processes, especially in multi-GPU or multi-node environments areal/infra/launcher/sglang_server.py89-201

Title: SGLang Server Launch Sequence


  1. Command Generation: SGLangConfig.build_cmd() constructs the CLI command with appropriate TP size and distributed initialization addresses areal/infra/launcher/sglang_server.py181-190
  2. Resource Allocation: The wrapper calculates GPU offsets and ports based on allocation_mode and n_gpus_per_node areal/infra/launcher/sglang_server.py142-167
  3. Multi-Node Support: For cross-node SGLang, it uses environment variables like AREAL_SGLANG_MULTI_NODE_RANK and AREAL_SGLANG_MULTI_NODE_MASTER_ADDR areal/infra/launcher/sglang_server.py134-136
  4. Environment Isolation: launch_server_cmd ensures each instance has a unique TRITON_CACHE_PATH to avoid DirectoryNotEmpty errors areal/infra/launcher/sglang_server.py49-52

Server Control Endpoints

The backend provides methods to build HTTP requests for various server control operations:

MethodEndpointPurpose
get_pause_request()/pause_generationPause inference requests areal/engine/sglang_remote.py205-207
get_resume_request()/continue_generationResume inference requests areal/engine/sglang_remote.py209-211
get_health_check_request()/healthCheck server health (GET) areal/engine/sglang_remote.py213-215
get_offload_request()/release_memory_occupationRelease GPU memory areal/engine/sglang_remote.py217-219
get_onload_request(tags)/resume_memory_occupationRestore GPU memory areal/engine/sglang_remote.py221-230

Sources: areal/engine/sglang_remote.py205-244 areal/infra/launcher/sglang_server.py36-62

Integration with RemoteInfEngine

The RemoteSGLangEngine class is a thin wrapper that delegates all operations to RemoteInfEngine with the SGLangBackend protocol implementation areal/engine/sglang_remote.py247-439

Key Design Pattern The composition pattern provides:

  1. Separation of Concerns: Backend-specific logic in SGLangBackend, generic remote engine logic (like retry logic, batching, and async handling) in RemoteInfEngine.
  2. Code Reuse: Same RemoteInfEngine used for SGLang and vLLM backends.
  3. Clean API: RemoteSGLangEngine exposes the full InferenceEngine interface without duplication.

Sources: areal/engine/sglang_remote.py247-439 areal/api/engine_api.py32-233