Add ResearchClawBench evaluation result
#80
by black-yt - opened
Hi Qwen team,
This PR adds the ResearchClawBench overall evaluation result for Qwen3.5-397B-A17B.
ResearchClawBench is an end-to-end scientific research benchmark for evaluating AI agents and LLMs on tasks that require reading task data and related work, writing and executing code, producing figures, and generating publication-style reports. Final reports are scored against expert checklists derived from human-authored target papers.
The run was executed with ResearchHarness, using tools enabled, code execution, and a file-system workspace. The submitted value is the overall mean score out of 100 over completed ResearchClawBench tasks:
- Model:
Qwen3.5-397B-A17B - Score:
14.23/ 100 - Completed tasks:
40/40 - Run date:
2026-04-16 - Benchmark task id:
overall
The detailed leaderboard is available here: https://internscience.github.io/ResearchClawBench-Home/
Thank you!
