Large Language Model (LLM) agents promise to free us from mundane tasks, but how should they best interact with our world? Introducing CodeAct, an agent {framework, instruction-tuning dataset, model}, employs executable Python code to unify the actions of LLM agents.
π§΅1/
Xingyao Wang
495 posts
Co-founder @OpenHandsDev | PhD candidate @IllinoisCDS | BS @UMichCSE ('22) | Ex Intern @GoogleAI@Microsoft | Opinions are my own
Joined April 2019
- I often get asked this question: Why is o1 not so good on OpenHands, but their official report shows a decent SWE-bench number? π€ π§΅Surprising find: OpenAI's O1 - reasoning-high only hit 30% on SWE-Bench Verified - far below their 48.9% claim. Even more interesting: Claude achieves 53% in the same framework. Something's off with O1's "enhanced reasoning"... π§΅1/8
- People have been asking how well Deepseek v3 performs when using native function calling Answer: performance dropped to 8.33% on SWE-Bench Lite from 23% Notably, the percentages of empty patches & stuck-in-loop increase a lot (often happens with OSS models!) Examples in π§΅DeepSeek v3 seems exceptionally capable with its $0.14/$0.28 per 1M tokens pricing π€ as an OpenHands agent
- Introducing OpenDevin CodeAct 1.0 - a new State-of-the-art open coding agent! It achieves a 21% unassisted resolve rate on SWE-Bench Lite, a 17% relative improvement above the previous SOTA by SWE-Agent. Check out our blog or the thread π§΅for more details: xwang.dev/blog/2024/openβ¦
- DeepSeek v3 seems exceptionally capable with its $0.14/$0.28 per 1M tokens pricing π€ as an OpenHands agent
- o3-mini on SWE-Bench Verified using OpenHands: 43.7% and costs $314 (we ran four runs and took the average, following the official system card) TLDR: It is slightly cheaper than Sonnet and performs slightly worse. Why can't we get the official 61% number? (speculations in π§΅)
- Software is a powerful tool, enabling human developers to interact with the world in complex & profound ways. What if we could use software as a tool to create similar versatile AI agents? Meet OpenDevin: an open platform for AI software developers as generalist agents. π§΅ 1/
- Can pretrained language models (LMs) go beyond learning from labels and scalar rewards? Introducing LeTI, a new LM finetuning paradigm that explores LMs' potential to learn from textual interactions & feedback, allowing LMs to understand not just if they were wrong, but why. π§΅1/
- Deepseek V3 0324 got 38.8% SWE-Bench Verified w/ OpenHands Best in open-source model so far π
- Replying to @xingyaow_I have a theory: the amount of information provided in the context differs significantly for these two types of agent scaffolds. And this causes the reasoning model like o1 to perform differently. Reasoning models are trained to THINK hard, e.g., by solving extremely
- We often interact with Large Language Models (LLMs) like ChatGPT in multi-turn dialogues, yet we predominantly evaluate them with single-turn benchmarks. Bridging this gap, we introduce MINT, a new benchmark tailored for LLMs' multi-turn interactions. π§΅
- Excited to share that @allhands_ai has raised $5M -- and it's finally time to announce a new chapter in my life: I'm taking a leave from my PhD to focus full-time on All Hands AI. Let's push open-source agents forward together, in the open!We are proud to announce that All Hands has raised $5M to build the worldβs best software development agents, and do it in the open π all-hands.dev Thank you to @MenloVentures and our wonderful slate of investors for believing in the mission!
- I finally managed to integrate (most of) CodeAct into OpenDevin π₯³. Now, it can work end-to-end on model training (well - very simple linear regressionπ). It is somewhat buggy - But I'm excited that we may have a fully open-sourced AI software engineer/data scientist in the near00:00
- The real "wow" moment for me with Devstral: I asked it to build a todo list app β and instead of jumping straight in, it asked me how I wanted to build it, listing actual options. After so many one-sided decisions from Sonnet 3.7, being asked felt... emotional πMeet Devstral, our SOTA open model designed specifically for coding agents and developed with @allhands_aimistral.ai/news/devstral
