![]() |
VOOZH | about |
(a) Sequential test-time scaling.
(b) Parallel test-time scaling.
Our systematic test-time scaling analysis reveals two fundamental limits:
LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains.
Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling.
We introduce General AgentBench, a benchmark for evaluating whether agents can compose multiple skills and tools to solve open-ended requests from diverse domains under a unified framework, more closely reflecting real-world user interactions.
General AgentBench covers four task domains, including Search, Coding, Reason, and Tool use. All tools are consolidated through the Model Context Protocol (MCP) into a unified interface, where agents see only a single shared tool pool across all tasks.
Most models experience substantial degradation (10โ30% on average) when moving from domain-specific to the general-agent setting.
Relative performance change across domains from the Baseline (B) specialized agent setting to the general agent (G) setting with unified context and tools. Negative values indicate performance degradation under the General AgentBench.
Performance comparison between specialized-agent and general-agent settings across models.
We systematically study two primary test-time scaling paradigms for general LLM agents:
We further introduce a self-choice setting to measure the gap between the parallel upper bound (pass@K) and real-world effectiveness: agents must also be capable of evaluating and selecting the best outcome from their own generated trajectories.
Test-time scaling behaviors of general LLM agents. Top: Parallel scaling expands the solution space through increased sampling. Bottom: Sequential scaling allocates additional computation via longer interaction histories, yet exhibits unstable or diminishing returns.
Sequential scaling extends the interaction horizon by injecting additional rounds of feedback. While performance initially improves as agents approach their inherent context length, it plateaus or degrades once context exceeds a critical threshold.
This context ceiling varies by model and domainโfor example, approximately 112K tokens for Qwen3-235B and 96K for Gemini 2.5-Flash in the search domain. Beyond it, accumulated history overwhelms the agent's reasoning capacity, leading to instability in long-horizon tasks.
Sequential scaling behavior of Gemini 2.5-Flash and Qwen3-235B across domains.
Parallel scaling samples multiple independent trajectories, expanding the solution space. While pass@K increases monotonically, the self-choice accuracyโwhere agents must identify the correct solution from their own generationsโconsistently lags behind.
This verification gap limits practical utility: agents can generate correct answers but fail to reliably select them. Even using GPT-5 as an external verifier does not close the gap.
Verification gap between generation and self-choice. The dashed and dotted curves represent two self-choice strategies, while the diamond denotes a stronger evaluator, GPT-5.
Coming soon.
For any questions or feedback, please reach out to xiaochu4 [at] andrew.cmu.edu.