Last indexed: 7 May 2026 (2e12c1)

Vision-Language Models

This page demonstrates how to train vision-language models (VLMs) with reinforcement learning in AReaL. VLM training enables models to learn from visual and textual inputs simultaneously, using the same RL algorithms (like GRPO or PPO) as text-only training.

Overview

AReaL supports training VLMs through the VisionRLVRWorkflow class, which extends text-only workflows to handle multi-modal inputs. Supported model families include:

Model Family	Training Backend	Example Dataset	Notes
Qwen2.5-VL	PyTorch FSDP / Archon	Geometry3K, ViRL39K	Dynamic resolution support examples/vlm_npu/README.md9-12
Qwen3-VL	PyTorch FSDP / Archon	Geometry3K	Latest vision-language architecture examples/vlm_npu/README.md13-14
Qwen2-VL	PyTorch FSDP	CLEVR Count	Standard vision-language support areal/dataset/clevr_count_70k.py69-76

Key differences from text-only training:

Uses AutoProcessor via load_hf_processor_and_tokenizer to handle image preprocessing and tokenization areal/utils/hf_utils.py72-93
Passes multi_modal_input tensors (containing pixel_values and optional image_grid_thw) to training engines areal/workflow/vision_rlvr.py148-163
Dataset must provide both images and text messages areal/workflow/vision_rlvr.py113-114
Asynchronous rollout supports high-throughput VLM generation using SGLang or vLLM backends areal/workflow/vision_rlvr.py123-133

Sources: areal/workflow/vision_rlvr.py26-43 areal/utils/hf_utils.py72-93 examples/vlm_npu/README.md9-15

Quick Start

Run Geometry3K training on a single node with 8 NPUs using GRPO:

For multi-node training on the ViRL39K dataset:

Sources: examples/vlm_npu/README.md18-34

Training Configuration

VLM training requires specific configuration for the inference backends and the actor model to handle multimodal data.

SGLang and vLLM Setup

For VLMs, the inference engines must be explicitly told to enable multimodal support and handle the specific memory requirements of vision encoders.

Sources: examples/vlm_npu/qwen3_vl_2b_geometry3k_grpo.yaml108-126 examples/vlm_npu/README.md81-84

Allocation Mode for VLMs

Because vision models are memory-intensive, the allocation_mode often splits the cluster between inference (rollout) and training.

vllm:d32+d16: 32 GPUs for vLLM inference and 16 for training in a multi-node setup examples/vlm_npu/README.md83

Sources: examples/vlm_npu/README.md81-84

VLM Dataset Integration

The dataset loading pipeline must return processed images alongside text messages. AReaL provides built-in support for several VLM datasets through get_custom_dataset.

Data Flow: From Image to Token Space

The following diagram shows how the VLM datasets bridge raw images to the model's expected input format within the AReaL ecosystem.

Sources: examples/vlm_npu/geometry3k_grpo.py37-49 areal/dataset/__init__.py165-206

VisionRLVRWorkflow Implementation

The VisionRLVRWorkflow orchestrates the multi-modal rollout. It uses the model's processor to transform images and text into tensors before sending them to the training engine.

Data Flow: Rollout to Training

This diagram maps the transition from high-level Python objects to the low-level tensor dictionaries required by the training engines, specifically showing the involvement of VisionRLVRWorkflow.

Sources: areal/workflow/vision_rlvr.py26-167

Training Orchestration

The VisionRLVRWorkflow is instantiated with a processor and tokenizer to handle the multi-modal input pipeline areal/workflow/vision_rlvr.py27-43 It overrides arun_episode to process images into base64 via image2base64 for inference engines and maintain the multi_modal_input for the training backward pass areal/workflow/vision_rlvr.py103-167

Sources: areal/workflow/vision_rlvr.py103-167

Performance on Ascend NPU

AReaL provides significant performance advantages for VLMs on Ascend NPU hardware compared to synchronous frameworks like verl.

Benchmark Results (Qwen2.5-VL-3B)

On the ViRL39K dataset with 16k max tokens, AReaL's asynchronous strategy reduces training time while maintaining or exceeding accuracy examples/vlm_npu/README.md99-112

Framework	Nodes	Epochs	Training Time	Avg OOD Score
verl	3	1	6.8 hours	26.5
AReaL	3	2	4.3 hours	26.3
AReaL	3	3	6.6 hours	27.0

Sources: examples/vlm_npu/README.md99-112

Hardware Configuration

The following configuration is tested for multi-node VLM training:

NPU: 16x per node examples/vlm_npu/README.md47
Memory: 1TB per node examples/vlm_npu/README.md49
Storage: Shared NAS for distributed experiments examples/vlm_npu/README.md53
Network: RoCE 3.2 Tbps examples/vlm_npu/README.md50

Sources: examples/vlm_npu/README.md43-54

VLM Reward Functions

VLM reward functions in VisionRLVRWorkflow are wrapped by AsyncRewardWrapper and receive both the decoded completions and the original task data (including ground truth answers) areal/workflow/vision_rlvr.py46-73

Sources: areal/workflow/vision_rlvr.py46-73 examples/vlm_npu/geometry3k_grpo.py23-30

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/14.3-vision-language-models