VOOZH about

URL: https://deepwiki.com/inclusionAI/AReaL/14.11-search-agent-and-deep-research

⇱ Search Agent and Deep Research | inclusionAI/AReaL | DeepWiki


Loading...
Last indexed: 7 May 2026 (2e12c1)
Menu

Search Agent and Deep Research

Search agents and deep research workflows represent a complex class of agentic RL tasks where the model must perform long-range planning, tool usage (searching and visiting web pages), and information synthesis over multiple turns. AReaL supports training these agents by integrating OpenAI-compatible APIs with specialized reward mechanisms and asynchronous rollout controllers.

System Architecture for Search Agents

Training search agents in AReaL involves a decoupled architecture where the Rollout Controller orchestrates interactions between the Inference Engine (e.g., SGLang) and external tools. The agent's trajectories are captured as conversation trees, allowing for fine-grained reward assignment at each step of the research process.

Data Flow for Deep Research

The following diagram illustrates the data flow from the initial user query through the multi-turn search process to the final reward calculation and training update.

Deep Research Interaction Flow


Sources:


Implementation Details

OpenAI-Compatible Client and Proxy

The search agent framework often utilizes the OpenAIProxyWorkflow to bridge external agent runtimes or SDKs with AReaL's training pipeline areal/experimental/openai/proxy/workflow.py72-112

Key features include:

  1. Execution Modes: Supports inline (same process), subproc (isolated process via ProcessPoolExecutor), and online (external user session) execution of agent logic areal/experimental/openai/proxy/workflow.py84-112
  2. Environment Isolation: When using subproc mode, environment variables like OPENAI_BASE_URL and OPENAI_API_KEY are injected into the worker subprocess to point to AReaL's internal proxy areal/experimental/openai/proxy/workflow.py131-145
  3. Tool Integration: Agents can be decorated with @function_tool or @tool (LangChain) to provide search or calculation capabilities areal/workflow/openai/math_agent.py89-128 areal/workflow/langchain/math_agent.py28-68

Multi-Turn Search Workflow

In search-agent examples, the agent follows a loop of reasoning, tool selection, and observation. The MultiTurnMathAgent or MathToolAgent manages this loop:

  1. Generation: Calls client.chat.completions.create using an AsyncOpenAI client configured with the AReaL proxy base_url areal/workflow/openai/math_agent.py37-42
  2. Tool Execution: Uses frameworks like OpenAIRunner or LangChain's create_agent to handle the iterative tool-calling loop areal/workflow/openai/math_agent.py162-164 areal/workflow/langchain/math_agent.py159-172
  3. State Management: If the answer is incorrect, the agent can be prompted to reflect and retry, appending messages to the session history areal/workflow/openai/math_agent.py78-85

Reward Integration

Deep research often requires specialized verification logic. AReaL uses an AsyncRewardWrapper to handle potentially slow or remote reward calculations, such as LLM-as-a-judge or ground-truth verification areal/workflow/openai/math_agent.py44-47

Reward Assignment Logic

StepEntityAction
1MultiTurnMathAgentExecutes run loop areal/workflow/openai/math_agent.py65
2AsyncRewardWrapperEvaluates the response via math_reward_fn areal/workflow/openai/math_agent.py73-74
3AsyncOpenAI (Proxy)The proxy intercepts rewards to calculate advantages for RL training areal/experimental/openai/proxy/workflow.py122-130

Sources:


Configuration and Training

Training a search agent typically uses the GRPO (Group Relative Policy Optimization) algorithm to compare multiple research trajectories for the same prompt, which is effective for long-horizon tasks notebook/search_agent_zh.ipynb15-17

Training Configuration (YAML)

A typical configuration for a search agent includes a large max_turns and specific generation hyperparameters to allow for extensive exploration.


Rollout Coordination

The RemoteSGLangEngine or RemotevLLMEngine acts as the controller, submitting tasks to the inference cluster. It uses the arun_episode contract to execute the agentic logic examples/math/gsm8k_eval.py50-67

Rollout Controller to Engine Mapping


Sources:


Key Functions and Classes

OpenAIProxyWorkflow areal/experimental/openai/proxy/workflow.py72

  • Role: Implements the RolloutWorkflow interface to run arbitrary OpenAI-compatible agents.
  • Key Method: arun_episode handles the setup of a proxy session and manages the lifecycle of the agent execution areal/experimental/openai/proxy/workflow.py164-185

MathToolAgent areal/workflow/openai/math_agent.py129

  • Role: A reference implementation of an agent that uses external tools (calculators) via the agents library.
  • Key Logic: Uses OpenAIRunner.run to handle the multi-step tool-calling loop and returns a reward calculated by AsyncRewardWrapper areal/workflow/openai/math_agent.py162-168

AsyncRewardWrapper areal/workflow/openai/math_agent.py44

  • Role: Wraps synchronous reward functions (like math verification) to be compatible with async agent workflows.
  • Usage: Ensures that the reward calculation does not block the event loop during rollout areal/workflow/openai/math_agent.py73-75

Sources: