Last indexed: 7 May 2026 (2e12c1)

Search Agent and Deep Research

Search agents and deep research workflows represent a complex class of agentic RL tasks where the model must perform long-range planning, tool usage (searching and visiting web pages), and information synthesis over multiple turns. AReaL supports training these agents by integrating OpenAI-compatible APIs with specialized reward mechanisms and asynchronous rollout controllers.

System Architecture for Search Agents

Training search agents in AReaL involves a decoupled architecture where the Rollout Controller orchestrates interactions between the Inference Engine (e.g., SGLang) and external tools. The agent's trajectories are captured as conversation trees, allowing for fine-grained reward assignment at each step of the research process.

Data Flow for Deep Research

The following diagram illustrates the data flow from the initial user query through the multi-turn search process to the final reward calculation and training update.

Deep Research Interaction Flow

Sources:

areal/workflow/openai/math_agent.py50-86 (Multi-turn orchestration)
notebook/search_agent_zh.ipynb8-17 (Search agent tutorial overview)
areal/experimental/openai/proxy/workflow.py72-154 (Proxy-based workflow execution)

Implementation Details

OpenAI-Compatible Client and Proxy

The search agent framework often utilizes the OpenAIProxyWorkflow to bridge external agent runtimes or SDKs with AReaL's training pipeline areal/experimental/openai/proxy/workflow.py72-112

Key features include:

Execution Modes: Supports inline (same process), subproc (isolated process via ProcessPoolExecutor), and online (external user session) execution of agent logic areal/experimental/openai/proxy/workflow.py84-112
Environment Isolation: When using subproc mode, environment variables like OPENAI_BASE_URL and OPENAI_API_KEY are injected into the worker subprocess to point to AReaL's internal proxy areal/experimental/openai/proxy/workflow.py131-145
Tool Integration: Agents can be decorated with @function_tool or @tool (LangChain) to provide search or calculation capabilities areal/workflow/openai/math_agent.py89-128 areal/workflow/langchain/math_agent.py28-68

Multi-Turn Search Workflow

In search-agent examples, the agent follows a loop of reasoning, tool selection, and observation. The MultiTurnMathAgent or MathToolAgent manages this loop:

Generation: Calls client.chat.completions.create using an AsyncOpenAI client configured with the AReaL proxy base_url areal/workflow/openai/math_agent.py37-42
Tool Execution: Uses frameworks like OpenAIRunner or LangChain's create_agent to handle the iterative tool-calling loop areal/workflow/openai/math_agent.py162-164 areal/workflow/langchain/math_agent.py159-172
State Management: If the answer is incorrect, the agent can be prompted to reflect and retry, appending messages to the session history areal/workflow/openai/math_agent.py78-85

Reward Integration

Deep research often requires specialized verification logic. AReaL uses an AsyncRewardWrapper to handle potentially slow or remote reward calculations, such as LLM-as-a-judge or ground-truth verification areal/workflow/openai/math_agent.py44-47

Reward Assignment Logic

Step	Entity	Action
1	`MultiTurnMathAgent`	Executes `run` loop areal/workflow/openai/math_agent.py65
2	`AsyncRewardWrapper`	Evaluates the response via `math_reward_fn` areal/workflow/openai/math_agent.py73-74
3	`AsyncOpenAI` (Proxy)	The proxy intercepts rewards to calculate advantages for RL training areal/experimental/openai/proxy/workflow.py122-130

Sources:

areal/workflow/openai/math_agent.py50-86 (Multi-turn math agent)
areal/workflow/langchain/math_agent.py115-184 (LangChain tool agent)
areal/experimental/openai/proxy/workflow.py121-155 (Proxy agent execution)

Configuration and Training

Training a search agent typically uses the GRPO (Group Relative Policy Optimization) algorithm to compare multiple research trajectories for the same prompt, which is effective for long-horizon tasks notebook/search_agent_zh.ipynb15-17

Training Configuration (YAML)

A typical configuration for a search agent includes a large max_turns and specific generation hyperparameters to allow for extensive exploration.

Rollout Coordination

The RemoteSGLangEngine or RemotevLLMEngine acts as the controller, submitting tasks to the inference cluster. It uses the arun_episode contract to execute the agentic logic examples/math/gsm8k_eval.py50-67

Rollout Controller to Engine Mapping

Sources:

notebook/search_agent_zh.ipynb40-55 (Loading experiment config)
examples/math/gsm8k_eval.py50-96 (Engine and workflow initialization)
areal/experimental/openai/proxy/workflow.py131-145 (Subprocess execution logic)

Key Functions and Classes

`OpenAIProxyWorkflow` areal/experimental/openai/proxy/workflow.py72

Role: Implements the RolloutWorkflow interface to run arbitrary OpenAI-compatible agents.
Key Method: arun_episode handles the setup of a proxy session and manages the lifecycle of the agent execution areal/experimental/openai/proxy/workflow.py164-185

`MathToolAgent` areal/workflow/openai/math_agent.py129

Role: A reference implementation of an agent that uses external tools (calculators) via the agents library.
Key Logic: Uses OpenAIRunner.run to handle the multi-step tool-calling loop and returns a reward calculated by AsyncRewardWrapper areal/workflow/openai/math_agent.py162-168

`AsyncRewardWrapper` areal/workflow/openai/math_agent.py44

Role: Wraps synchronous reward functions (like math verification) to be compatible with async agent workflows.
Usage: Ensures that the reward calculation does not block the event loop during rollout areal/workflow/openai/math_agent.py73-75

Sources:

Refresh this wiki

URL: https://deepwiki.com/inclusionAI/AReaL/14.11-search-agent-and-deep-research