VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/14.3-vision-language-models

⇱ Vision-Language Models | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Vision-Language Models

This page demonstrates how to train vision-language models (VLMs) with reinforcement learning in AReaL. VLM training enables models to learn from visual and textual inputs simultaneously, using the same RL algorithms (like GRPO or PPO) as text-only training.

Overview

AReaL supports training VLMs through the VisionRLVRWorkflow class, which extends text-only workflows to handle multi-modal inputs. Supported model families include:

Model FamilyTraining BackendExample DatasetNotes
Qwen2.5-VLPyTorch FSDP / ArchonGeometry3K, ViRL39KDynamic resolution support examples/vlm_npu/README.md9-12
Qwen3-VLPyTorch FSDP / ArchonGeometry3KLatest vision-language architecture examples/vlm_npu/README.md13-14
Qwen2-VLPyTorch FSDPCLEVR CountStandard vision-language support areal/dataset/clevr_count_70k.py69-76

Key differences from text-only training:

Sources: areal/workflow/vision_rlvr.py26-43 areal/utils/hf_utils.py72-93 examples/vlm_npu/README.md9-15

Quick Start

Run Geometry3K training on a single node with 8 NPUs using GRPO:


For multi-node training on the ViRL39K dataset:


Sources: examples/vlm_npu/README.md18-34

Training Configuration

VLM training requires specific configuration for the inference backends and the actor model to handle multimodal data.

SGLang and vLLM Setup

For VLMs, the inference engines must be explicitly told to enable multimodal support and handle the specific memory requirements of vision encoders.


Sources: examples/vlm_npu/qwen3_vl_2b_geometry3k_grpo.yaml108-126 examples/vlm_npu/README.md81-84

Allocation Mode for VLMs

Because vision models are memory-intensive, the allocation_mode often splits the cluster between inference (rollout) and training.

Sources: examples/vlm_npu/README.md81-84

VLM Dataset Integration

The dataset loading pipeline must return processed images alongside text messages. AReaL provides built-in support for several VLM datasets through get_custom_dataset.

Data Flow: From Image to Token Space

The following diagram shows how the VLM datasets bridge raw images to the model's expected input format within the AReaL ecosystem.


Sources: examples/vlm_npu/geometry3k_grpo.py37-49 areal/dataset/__init__.py165-206

VisionRLVRWorkflow Implementation

The VisionRLVRWorkflow orchestrates the multi-modal rollout. It uses the model's processor to transform images and text into tensors before sending them to the training engine.

Data Flow: Rollout to Training

This diagram maps the transition from high-level Python objects to the low-level tensor dictionaries required by the training engines, specifically showing the involvement of VisionRLVRWorkflow.


Sources: areal/workflow/vision_rlvr.py26-167

Training Orchestration

The VisionRLVRWorkflow is instantiated with a processor and tokenizer to handle the multi-modal input pipeline areal/workflow/vision_rlvr.py27-43 It overrides arun_episode to process images into base64 via image2base64 for inference engines and maintain the multi_modal_input for the training backward pass areal/workflow/vision_rlvr.py103-167

Sources: areal/workflow/vision_rlvr.py103-167

Performance on Ascend NPU

AReaL provides significant performance advantages for VLMs on Ascend NPU hardware compared to synchronous frameworks like verl.

Benchmark Results (Qwen2.5-VL-3B)

On the ViRL39K dataset with 16k max tokens, AReaL's asynchronous strategy reduces training time while maintaining or exceeding accuracy examples/vlm_npu/README.md99-112

FrameworkNodesEpochsTraining TimeAvg OOD Score
verl316.8 hours26.5
AReaL324.3 hours26.3
AReaL336.6 hours27.0

Sources: examples/vlm_npu/README.md99-112

Hardware Configuration

The following configuration is tested for multi-node VLM training:

Sources: examples/vlm_npu/README.md43-54

VLM Reward Functions

VLM reward functions in VisionRLVRWorkflow are wrapped by AsyncRewardWrapper and receive both the decoded completions and the original task data (including ground truth answers) areal/workflow/vision_rlvr.py46-73


Sources: areal/workflow/vision_rlvr.py46-73 examples/vlm_npu/geometry3k_grpo.py23-30