Last indexed: 7 May 2026 (2e12c1)

Server Lifecycle Management

Purpose and Scope

This document describes how inference servers are launched, monitored, and managed within the AReaL framework. It covers the process lifecycle management for local, Ray, and Slurm environments, as well as the specialized wrappers for SGLang and vLLM inference backends. These components ensure that inference resources are correctly allocated, initialized, and synchronized with training weight updates.

Sources: areal/infra/launcher/sglang_server.py1-33 areal/infra/launcher/vllm_server.py1-36

Lifecycle Overview

The server lifecycle is managed by specific Wrapper classes and Launcher implementations. The lifecycle transitions from resource allocation and process spawning to health monitoring and eventual graceful teardown.

Server Lifecycle State Machine

Sources: areal/infra/launcher/sglang_server.py107-123 areal/infra/launcher/vllm_server.py112-145

Inference Server Wrappers

AReaL provides dedicated wrapper classes, SGLangServerWrapper and vLLMServerWrapper, to manage the complexities of launching distributed inference servers across one or more nodes.

SGLang and vLLM Wrapper Logic

The wrappers handle:

Port Allocation: Using find_free_ports within a specific range (typically 10,000 to 32,767) to avoid conflicts areal/infra/launcher/sglang_server.py167-168 areal/infra/launcher/vllm_server.py180-181 areal/utils/network.py114-134
Environment Isolation: Generating unique TRITON_CACHE_PATH and VLLM_CACHE_ROOT for each process to prevent filesystem race conditions areal/infra/launcher/sglang_server.py48-52 areal/infra/launcher/vllm_server.py42-50
Multi-Node Coordination: Setting up master addresses and ports for distributed backends like SGLang areal/infra/launcher/sglang_server.py126-140
LoRA Support: vLLM specifically enables VLLM_ALLOW_RUNTIME_LORA_UPDATING to allow dynamic weight updates areal/infra/launcher/vllm_server.py51

Launch Process Mapping

This diagram maps the natural language "Launch Server" action to specific code entities and data flows.

Sources: areal/infra/launcher/sglang_server.py125-198 areal/infra/launcher/vllm_server.py146-205 areal/infra/launcher/local.py121-156

Resource Management and Allocation

Servers are launched based on an _AllocationMode, which defines the number of GPUs per instance (gen_instance_size) and the tensor parallelism (tp_size) areal/api/alloc_mode.py15-19

Local Process Management

The LocalLauncher manages processes on a single node using subprocess.Popen and psutil. It implements a round-robin GPU allocation strategy to ensure load balancing across available devices.

Feature	Implementation	File Reference
GPU Allocation	Round-robin via `_gpu_counter`	areal/infra/launcher/local.py135-144
Process Tracking	`self._jobs` dict mapping name to `Popen`	areal/infra/launcher/local.py93-94
Status Mapping	`JOB_STATE_TO_PROCESS_STATUS`	areal/infra/launcher/local.py43-63
Termination	`terminate_process_and_children`	areal/infra/launcher/local.py72-85

Sources: areal/infra/launcher/local.py43-156

Distributed Launchers (Ray & Slurm)

For multi-node deployments, AReaL leverages cluster-native scheduling:

Ray Integration: Uses ray.util.placement_group to ensure that inference workers and their required GPUs are correctly scheduled areal/infra/utils/ray.py10-27
SlurmLauncher: Generates sbatch scripts using SBATCH_SCRIPT_TEMPLATE and launches tasks via srun with specific --gres=gpu requirements areal/infra/launcher/slurm.py149-166

Monitoring and Health Checks

Once a server is launched, the system enters a monitoring phase to ensure high availability.

Readiness Probing

The wait_for_server function polls the /v1/models endpoint. It includes a mandatory sleep period after a successful response to ensure the backend internal state is fully settled before the training loop begins.

Sources: areal/infra/launcher/sglang_server.py65-88 areal/infra/launcher/vllm_server.py64-87

Active Monitoring

The _monitor_server_processes method (for SGLang) and the run loop (for vLLM) monitor server health. If any server process in the group exits unexpectedly (process.poll() is not None), the wrapper triggers an exit, signaling the scheduler to handle recovery.

Sources: areal/infra/launcher/sglang_server.py107-123 areal/infra/launcher/vllm_server.py112-124

Graceful Teardown

To prevent "zombie" processes and GPU memory leaks, AReaL implements recursive process termination.

Process Tree Termination

The kill_process_tree utility (used by vLLMServerWrapper._cleanup_all_servers) and terminate_process_and_children (used by LocalLauncher) ensure that the primary server process and all its sidecars are terminated.

Sources: areal/infra/launcher/vllm_server.py125-145 areal/infra/launcher/local.py72-85 areal/infra/utils/proc.py1-29

Signal Handling

vLLMServerWrapper registers handlers for SIGTERM and SIGINT to catch termination requests from the LocalLauncher or cluster managers areal/infra/launcher/vllm_server.py109-111 This allows the wrapper to invoke _cleanup_all_servers() for a clean shutdown areal/infra/launcher/vllm_server.py125-145

Sources: areal/infra/launcher/vllm_server.py109-124

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/4.6-server-lifecycle-management