VOOZH about

URL: https://arxiv.org/abs/2605.08678

⇱ [2605.08678] MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI


Computer Science > Machine Learning

arXiv:2605.08678 (cs)
[Submitted on 9 May 2026 (v1), last revised 27 May 2026 (this version, v2)]

Title:MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

View PDF HTML (experimental)
Abstract:Modern AI progress has been driven by ML methods that are generalizable across settings and scalable to larger regimes. As large language models demonstrate advanced capabilities in reasoning, coding, and engineering tasks, it is increasingly important to understand whether they can discover such methods rather than only apply existing ones. We introduce MLS-Bench, a benchmark for evaluating whether AI systems can invent generalizable and scalable ML methods. MLS-Bench contains 140 tasks across 12 domains, each requiring an agent to improve one targeted component of an ML system or algorithm and demonstrate that the improvement generalizes across controlled settings and scales. We find that current agents remain far from reliably surpassing human-designed methods, and that engineering-style tuning is easier for them than genuine method invention. We further study the effects of test-time scaling, adaptive compute allocation, and context provision on agents' discovery performance, together with case studies of their behavior. Our analyses suggest that the bottleneck is not only in proposing new methods, but also in the scientific insight needed to plan, validate, and scale claims about them. More search, compute, or context alone does not remove this bottleneck. We build and maintain a community platform for cumulative and comparable iteration, and release the data and code at this https URL.
Subjects: Machine Learning (cs.LG)
Cite as: arXiv:2605.08678 [cs.LG]
(or arXiv:2605.08678v2 [cs.LG] for this version)
https://doi.org/10.48550/arXiv.2605.08678
arXiv-issued DOI via DataCite

Submission history

From: Bohan Lyu [view email]
[v1] Sat, 9 May 2026 04:29:46 UTC (4,011 KB)
[v2] Wed, 27 May 2026 10:19:18 UTC (4,013 KB)
Full-text links:

Access Paper:

👁 license icon
view license

Current browse context:

cs.LG
< prev   |   next >
Change to browse by:
cs

References & Citations

BibTeX formatted citation

Data provided by:

Bibliographic and Citation Tools

Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)

Code, Data and Media Associated with this Article

alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
ScienceCast (What is ScienceCast?)

Demos

Replicate (What is Replicate?)
Hugging Face Spaces (What is Spaces?)
TXYZ.AI (What is TXYZ.AI?)

Recommenders and Search Tools

Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
IArxiv Recommender (What is IArxiv?)

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.