VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/14.15-evaluation-guide

⇱ Evaluation Guide | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Evaluation Guide

The Evaluation Guide details how to leverage AReaL's distributed inference infrastructure and agentic workflow system to perform robust evaluation of Large Language Models (LLMs). AReaL supports standard zero-shot/few-shot evaluation, multi-turn agentic evaluation, and integration with third-party frameworks by treating the evaluation process as a specialized rollout workflow.

Distributed Evaluation Infrastructure

Evaluation in AReaL is powered by the same high-performance inference backends used during training rollouts. This ensures that evaluation metrics are consistent with training behavior and can scale across multiple GPUs and nodes.

Key Components

Data Flow for Evaluation

The following diagram illustrates the data flow from a dataset through the evaluation workflow to the final metrics.

Evaluation Data Flow


Sources: areal/api/workflow_api.py14-39 areal/api/io_struct.py28-63 areal/engine/sglang_remote.py40-127 examples/math/gsm8k_eval.py76-93

Evaluation Workflows

Workflows define the interaction pattern between the evaluator and the model.

1. Mathematical Reasoning Evaluation

For benchmarks like GSM8K, the RLVRWorkflow is commonly used to generate reasoning paths and verify them.

  • Workflow Integration: The evaluation script specifies the workflow path, such as areal.workflow.rlvr.RLVRWorkflow examples/math/gsm8k_eval.py76
  • Reward/Metric Function: A specific reward function (e.g., gsm8k_reward_fn) is passed to the workflow to score the completions examples/math/gsm8k_eval.py78

2. Agentic and Multi-Turn Evaluation

AReaL supports evaluating models in agentic scenarios where the model may need multiple turns or reflection.

  • Multi-Turn Reasoning: Notebooks demonstrate modifying single-turn workflows into multi-turn reflection agents notebook/math_reflection_en.ipynb14-26 This allows the model to "retry" or reflect on its answer if the first attempt is incorrect.
  • Search Agent Evaluation: Long-range search agents are evaluated by their ability to use search tools across multiple turns to answer complex queries notebook/search_agent_zh.ipynb8-17

Sources: areal/api/workflow_api.py63-104 notebook/math_reflection_en.ipynb14-26 notebook/search_agent_zh.ipynb8-17

Distributed Execution and Scheduling

Evaluation can be scaled across nodes using AReaL's scheduling system.

  • Local Execution: Uses LocalScheduler for single-node evaluation examples/math/gsm8k_eval.py28
  • Cluster Execution: RayScheduler and SlurmScheduler enable evaluation on large-scale clusters by managing remote RemoteSGLangEngine or RemotevLLMEngine instances examples/math/gsm8k_eval.py30-32
  • Model Allocation: The ModelAllocation class parses backend strings (e.g., sglang:d4) to determine the parallel strategy (TP/PP) for the evaluation servers examples/math/gsm8k_eval.py23

Sources: examples/math/gsm8k_eval.py23-34 areal/api/engine_api.py261

Practical Example: GSM8K Evaluation

The script examples/math/gsm8k_eval.py demonstrates a complete evaluation pipeline using the RemoteSGLangEngine or RemotevLLMEngine.

Implementation Details

  1. Engine Initialization: The script uses engine_cls.as_controller to create a controller that manages remote workers examples/math/gsm8k_eval.py67
  2. Task Submission: Evaluation tasks are submitted asynchronously using eval_rollout.submit examples/math/gsm8k_eval.py88-93
  3. Synchronization: The script calls eval_rollout.wait(cnt) to ensure all samples are processed before exporting statistics examples/math/gsm8k_eval.py96

Configuration Snippet

The configuration defines the backend and generation parameters used during evaluation.


Sources: examples/math/gsm8k_grpo_lora.yaml21-43 examples/math/gsm8k_eval.py47-97

Performance Monitoring

During evaluation, AReaL tracks key performance indicators (KPIs) via the ModelResponse structure and StatsTracker:

Agent-Engine Evaluation Architecture


Sources: examples/math/gsm8k_eval.py67-97 areal/engine/sglang_remote.py211-260 areal/api/io_struct.py63-83