VOOZH about

URL: https://minstar.github.io/OpenBioRQ/

โ‡ฑ OpenBioRQ: Unsolved Biomedical Research Questions for Agents


๐Ÿ‘ OpenBioRQ results overview
OpenBioRQ at a glance. The benchmark is hard, non-saturating, and discriminating: held-out same-lineage models solve only ~17% of the hardest subset, while three independent frontier agents span a wide 29โ€“60% โ€” and even the best leaves ~33โ€“40% unsolved.
Abstract

A working citation looks like proof โ€” but the fact that a link resolves does not mean the cited paper supports the claim. Current agentic models rarely fabricate citations (over 99% resolve), yet roughly 15.9% link to the wrong paper. Existing benchmarks miss this failure mode: when a question has a fixed answer key, a model can reproduce the expected source from that key rather than independently verifying that the source supports the claim.

I introduce OpenBioRQ, a retrieval-grounded agentic benchmark of 12,553 unsolved biomedical research questions across 12 domains that treats open questions as a faithfulness-and-abstention probe. To my knowledge, this is the first biomedical benchmark to combine an agentic setting โ€” where the model must issue multiple tool calls โ€” with unsolved questions that have no answer key. Openness is verified against real follow-up evidence rather than a model's parametric knowledge, and difficulty is empirical: anchored on questions that three open-weight reference models fail to answer. Beyond difficulty, I observe agentic collapse on the hardest questions, where agents stop using their tools โ€” and for the most collapse-prone model, blocking tools entirely barely changes its score. A frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82. OpenBioRQ targets research assistance โ€” evidence retrieval and faithful citation โ€” not clinical decision support.

Why OpenBioRQ?

Four things this benchmark does that answer-key QA cannot.

no answer key
Open questions as a probe

First biomedical benchmark to pair a multi-tool agentic setting with genuinely unsolved questions, so a model cannot back-derive the source from a fixed key.

>99% โ†’ 15.9%
Existence โ‰  correctness

Agent citations almost always resolve, but ~1 in 7 supports a different paper than the claim. A faithfulness failure invisible to existence checks.

29โ€“60%
Hard, non-saturating, discriminating

A clean capability gradient across independent frontier lineages (Gemini < Opus < GPT-5.5); the best agent still leaves ~33โ€“40% unsolved.

tools stop paying off
Agentic collapse

On the hardest questions agents stop calling tools; for the most collapse-prone model, removing tools entirely barely changes the score.

The Benchmark

12,553 questions, two openness-grounding tracks, and an empirically-defined hard core.

๐Ÿ‘ Construction pipeline
Construction pipeline. Questions are extracted from authoritative sources, refined to be self-contained, deduplicated, then openness-verified against real follow-up evidence and screened for contamination โ€” before rubric generation and agentic evaluation.

Provenance, not just "open"

"Open" is a provenance claim: every question is sourced from a genuinely unresolved research front, grounded two ways โ€”

  • Retrieval-verified โ€” PubMed / trial / arXiv questions whose open_status is judged from real follow-up evidence (citing papers, trial results), not a model's memory of the source's framing.
  • Expert-consensus โ€” JLA Priority Setting Partnerships and NICE research recommendations: questions declared open by expert/consensus process.

Empirical difficulty & the frozen core

Difficulty is not self-rated. Each question is answered, with tools, by three open-weight reference models; the pass/fail pattern defines difficulty. The full core (657) is the all-fail set; the frozen core (423) is the subset all three reference models fail at temperature 0 โ€” the primary discriminating hard split.

๐Ÿ‘ Taxonomy across 12 domains
12 biomedical domains. The frozen core spans every domain (largest shares Clinical Medicine, Neuroscience & Psychiatry, Oncology) โ€” not a single-specialty benchmark.

Evaluation Protocol

Agentic multi-round tool use, graded by a frozen per-question checklist.

๐Ÿ‘ Evaluation flow
Agentic evaluation. A model answers each question with multi-round access to 10 real biomedical APIs, and the answer is graded criterion-by-criterion against a frozen rubric.

10 medical tools, no answer key

Models call real REST APIs โ€” pubmed, clinicaltrialsgov, openfda, opentargets, chembl, uniprot, pubchem, kegg, ncbi_datasets, biomcp โ€” and must synthesize evidence themselves.

Checklist scoring

A free-form judge had high variance on open answers. A frozen per-question checklist (must_mention / must_acknowledge / must_ground / must_avoid) graded met / partial / not met raises inter-judge agreement from Spearman 0.35 to 0.82; a question is "solved" at score โ‰ฅ 0.5.

Key Results

A capability gradient across independent lineages โ€” and failure modes that answer-key QA hides.

Leaderboard โ€” frozen core (423), T=0, checklist judge

ModelRole / lineageFrozen-core solve@0.5
Reference roster (difficulty anchors)
GLM-5.1 ยท Qwen3.6 ยท DeepSeek-V4open-weight roster0% *
Held-out (same lineage)
Qwen3-235B-A22Bolder generation2.1%
GLM-5held-out16.6%
Qwen3.5-397B-A17Bheld-out16.8%
Independent frontier lineages
Gemini-3-ProGoogle28.8%
Opus-4.7Anthropic37.8%
GPT-5.5OpenAI59.6%

* The frozen core is the subset all three roster models fail by construction, so their solve rate is 0% by definition. Full-core (657) frontier solve@0.5: Gemini 37.4% ยท Opus 48.6% ยท GPT-5.5 66.7%.

Existence โ‰  correctness

๐Ÿ‘ Two-level citation factuality
Two-level citation audit. Citations almost always exist (fabrication โ‰ˆ0.7%), but ~15.9% are wrong-paper (a real paper that does not support the claim) โ€” confirmed under an independent different-family judge (cross-family ฮบ = 0.755).

Tools stop paying off where they are needed most

๐Ÿ‘ Agentic collapse behavior
Agentic collapse. On the hardest questions, agents stop issuing tool calls. For the most collapse-prone model, blocking tool access entirely barely changes the score โ€” tool access confers no measurable advantage (confidence intervals overlap), replicated across lineages.

Measures what closed-form medical QA cannot

๐Ÿ‘ OpenBioRQ vs MedQA orthogonality
Resolution gap. On closed-form MedQA / PubMedQA / MedMCQA the same models compress into a ~6-point band, while OpenBioRQ spreads 0โ†’60% (Spearman โ‰ˆ 0.14). Models within 0.2 pt on MedQA can be 4ร— apart on OpenBioRQ โ€” heterogeneity that saturated MC benchmarks hide.

Data & Predictions

Evaluation sets, rubrics, and per-model agent trajectories โ€” released for reproducibility.

The ๐Ÿค— Hugging Face release ships the full core (657) and frozen core (423) with gold_answer, the per-question rubrics, and per-model predictions + judge verdicts for all 11 leaderboard models (full agentic trajectories), so every leaderboard number can be re-derived end to end.

from datasets import load_dataset

# 423-question frozen core (the primary hard split)
frozen = load_dataset("Minbyul/OpenBioRQ", data_files="frozen_core_423.jsonl")["train"]

# per-question grading rubrics (join on task_id)
rubrics = load_dataset("Minbyul/OpenBioRQ", data_files="rubrics.jsonl")["train"]

# a model's agent trajectories + judge verdicts
preds = load_dataset("Minbyul/OpenBioRQ",
 data_files="predictions/gpt-5.5/predictions.jsonl")["train"]

Citation

If you use OpenBioRQ, please cite:

@misc{jeong2026openbiorq,
 title = {OpenBioRQ: Unsolved Biomedical Research Questions for Agents},
 author = {Minbyul Jeong},
 year = {2026},
 eprint = {2606.21959},
 archivePrefix = {arXiv},
 primaryClass = {cs.AI},
 howpublished = {\url{https://arxiv.org/abs/2606.21959}},
 note = {Dataset and benchmark}
}