Last indexed: 7 May 2026 (2e12c1)

Evaluation Guide

The Evaluation Guide details how to leverage AReaL's distributed inference infrastructure and agentic workflow system to perform robust evaluation of Large Language Models (LLMs). AReaL supports standard zero-shot/few-shot evaluation, multi-turn agentic evaluation, and integration with third-party frameworks by treating the evaluation process as a specialized rollout workflow.

Distributed Evaluation Infrastructure

Evaluation in AReaL is powered by the same high-performance inference backends used during training rollouts. This ensures that evaluation metrics are consistent with training behavior and can scale across multiple GPUs and nodes.

Key Components

Inference Engines: Backends like SGLangBackend areal/engine/sglang_remote.py40-41 and VLLMBackend areal/engine/vllm_remote.py41-42 provide the raw generation capabilities.
Rollout Workflows: Evaluation logic is encapsulated in classes implementing the RolloutWorkflow interface areal/api/workflow_api.py14
Remote Engines: RemoteSGLangEngine areal/engine/sglang_remote.py211 and RemotevLLMEngine areal/engine/vllm_remote.py228 allow evaluation scripts to communicate with distributed inference servers via HTTP.
Scheduler: Manages the lifecycle of remote inference workers using LocalScheduler, RayScheduler, or SlurmScheduler examples/math/gsm8k_eval.py27-32

Data Flow for Evaluation

The following diagram illustrates the data flow from a dataset through the evaluation workflow to the final metrics.

Evaluation Data Flow

Sources: areal/api/workflow_api.py14-39 areal/api/io_struct.py28-63 areal/engine/sglang_remote.py40-127 examples/math/gsm8k_eval.py76-93

Evaluation Workflows

Workflows define the interaction pattern between the evaluator and the model.

1. Mathematical Reasoning Evaluation

For benchmarks like GSM8K, the RLVRWorkflow is commonly used to generate reasoning paths and verify them.

Workflow Integration: The evaluation script specifies the workflow path, such as areal.workflow.rlvr.RLVRWorkflow examples/math/gsm8k_eval.py76
Reward/Metric Function: A specific reward function (e.g., gsm8k_reward_fn) is passed to the workflow to score the completions examples/math/gsm8k_eval.py78

2. Agentic and Multi-Turn Evaluation

AReaL supports evaluating models in agentic scenarios where the model may need multiple turns or reflection.

Multi-Turn Reasoning: Notebooks demonstrate modifying single-turn workflows into multi-turn reflection agents notebook/math_reflection_en.ipynb14-26 This allows the model to "retry" or reflect on its answer if the first attempt is incorrect.
Search Agent Evaluation: Long-range search agents are evaluated by their ability to use search tools across multiple turns to answer complex queries notebook/search_agent_zh.ipynb8-17

Sources: areal/api/workflow_api.py63-104 notebook/math_reflection_en.ipynb14-26 notebook/search_agent_zh.ipynb8-17

Distributed Execution and Scheduling

Evaluation can be scaled across nodes using AReaL's scheduling system.

Local Execution: Uses LocalScheduler for single-node evaluation examples/math/gsm8k_eval.py28
Cluster Execution: RayScheduler and SlurmScheduler enable evaluation on large-scale clusters by managing remote RemoteSGLangEngine or RemotevLLMEngine instances examples/math/gsm8k_eval.py30-32
Model Allocation: The ModelAllocation class parses backend strings (e.g., sglang:d4) to determine the parallel strategy (TP/PP) for the evaluation servers examples/math/gsm8k_eval.py23

Sources: examples/math/gsm8k_eval.py23-34 areal/api/engine_api.py261

Practical Example: GSM8K Evaluation

The script examples/math/gsm8k_eval.py demonstrates a complete evaluation pipeline using the RemoteSGLangEngine or RemotevLLMEngine.

Implementation Details

Engine Initialization: The script uses engine_cls.as_controller to create a controller that manages remote workers examples/math/gsm8k_eval.py67
Task Submission: Evaluation tasks are submitted asynchronously using eval_rollout.submit examples/math/gsm8k_eval.py88-93
Synchronization: The script calls eval_rollout.wait(cnt) to ensure all samples are processed before exporting statistics examples/math/gsm8k_eval.py96

Configuration Snippet

The configuration defines the backend and generation parameters used during evaluation.

Sources: examples/math/gsm8k_grpo_lora.yaml21-43 examples/math/gsm8k_eval.py47-97

Performance Monitoring

During evaluation, AReaL tracks key performance indicators (KPIs) via the ModelResponse structure and StatsTracker:

Latency: Total time per evaluation episode areal/api/io_struct.py78
TTFT: Time to first token areal/api/io_struct.py79
ITL: Inter-token latency areal/api/io_struct.py80
Routed Experts: For MoE models, the routed_experts field tracks expert utilization during evaluation areal/api/io_struct.py83

Agent-Engine Evaluation Architecture

Sources: examples/math/gsm8k_eval.py67-97 areal/engine/sglang_remote.py211-260 areal/api/io_struct.py63-83

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/14.15-evaluation-guide