Voozh

Add ResearchClawBench evaluation result

#120

by black-yt - opened 29 days ago

base: refs/heads/main

←

from: refs/pr/120

Discussion Files changed

-0

Add ResearchClawBench evaluation resultf71bb5e7

👁 Image

black-yt

29 days ago

•

edited 29 days ago

Hi Moonshot AI team,

This PR adds the ResearchClawBench overall evaluation result for Kimi-K2.5.

ResearchClawBench is an end-to-end scientific research benchmark for evaluating AI agents and LLMs on tasks that require reading task data and related work, writing and executing code, producing figures, and generating publication-style reports. Final reports are scored against expert checklists derived from human-authored target papers.

The run was executed with ResearchHarness, using tools enabled, code execution, and a file-system workspace. The submitted value is the overall mean score out of 100 over completed ResearchClawBench tasks:

Model: Kimi-K2.5
Score: 13.96 / 100
Completed tasks: 39/40
Run date: 2026-04-15
Benchmark task id: overall

The detailed leaderboard is available here: https://internscience.github.io/ResearchClawBench-Home/

Thank you!

Simplify ResearchClawBench eval notes897c202e

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment

URL: https://huggingface.co/moonshotai/Kimi-K2.5/discussions/120

⇱ moonshotai/Kimi-K2.5 · Add ResearchClawBench evaluation result

Add ResearchClawBench evaluation result