![]() |
VOOZH | about |
Our motivating hypothesis is that achieving general-purpose Software Engineering (SWE) agents requires a shift to computer-use agents that interact with computers as humans do: by observing the screen, typing, and clicking.
Recent SWE agents have largely followed a tool-based paradigm, where agents interact with hand-engineered tool APIs to perform specific tasks. While effective for specialized tasks, these methods fundamentally lack generalization, as they:
For example, an agent designed to manage GitHub pull requests lacks debugging abilities unless specifically programmed into the agent's API.
PwP is a VSCode-based IDE environment where agents perceive the screen and use primitive actions such as typing, pointing, and clicking. This provides two key advantages:
We introduce the first software engineering-focused environment for evaluating computer-use agents, using a modified VSCode IDE with:
from pwp import PwPBench
bench = PwPBench('design2code')
dataset = bench.get_dataset()
row = dataset[0]
env = bench.get_env(row)
for step in range(20):
obs = env.get_observation()
action = agent.get_action(obs)
env.step(action)
score = bench.get_reward(env, row)
PwP provides a straightforward API for evaluating computer-use agents:
This simplified interface works across all 15 tasks and 8 programming languages without any task-specific modifications!
Hover over any segment to see details about individual datasets
Our extensive evaluation of computer-use agents reveals:
The agent fixes a bug in a Python function by analyzing the code context, locating the error, and implementing the correction through natural IDE interactions.
The agent is tasked with creating a simple HTML page given an image. The agent opens the image, and the live preview of code it writes side by side. It uses the preview to iteratively improve the code.
The agent configures development environment settings, installing necessary extensions and adjusting workspace preferences through visual interaction with VSCode.
The agent is tasked with renaming a variable in a complex Python Repostiroy. It successfully uses VSCode rename functionaly to rename the variable correctly across the project.
| Model | % Resolved | Date | Site |
|---|---|---|---|
| ๐ Computer-Use Agent Claude-3.5 Sonnet - 20241022 | 46.8 | 2025-02-24 | |
| ๐ฅ Computer-Use Agent GPT-4o | 32.3 | 2025-02-24 | |
| ๐ฅ Computer-Use Agent Gemini-1.5 Pro | 18.1 | 2025-02-24 | |
| Computer-Use Agent GPT-4o-mini | 17.8 | 2025-02-24 | |
| Computer-Use Agent Gemini-1.5 Flash | 8.9 | 2025-02-24 |
@misc{aggarwal2025programmingpixelscomputerusemeets,
title={Programming with Pixels: Computer-Use Meets Software Engineering},
author={Pranjal Aggarwal and Sean Welleck},
year={2025},
eprint={2502.18525},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.18525},
}