The barrier to software entry was always rather steep, but the rise of LLMs and agentic AI has continued to reduce it to a handful of conversational prompts over the past few years. Generative AI in software development has found some of its loudest proponents sitting at the very top of the industry. Last year, Microsoft CEO, Satya Nadella, noted that 20–30% of code in their repositories is now written by AI. Critics justifiably point to the cracks beneath the surface, questioning the reliability of this prevailing phenomenon.

Whichever side of the aisle you fall on, it is quite evident that we find ourselves ushered into an era where we're spoiled for choice when it comes to coding, whether AI plays a supporting role or takes the lead. The question now is which of the reigning AI tools is worth your time. To find out, I put Claude Code, OpenAI Codex and Google Antigravity to the test.

I put the LLMs through a hackathon

How good are the leading AI tools?

I put the three AI heavyweights through a solo hackathon with a simple brief—to build a 2D platformer game in Python. The choice of the application was deliberate, as games are, in many ways, the ultimate litmus test for agentic intelligence. Unlike any other script, a game would be tailored towards user experience, and many factors like collision detection, gravity, and input must all work together in tandem with one another.

Every model received the exact same prompt, written in plain English, with no technical keywords, library suggestions, or fine-tuning through follow-up prompts. Inference was one of the key aspects tested, and so, one of the goals was to understand how each model could understand the intent behind the prompt and arrive independently at a solution. Naturally, this meant that if something broke, it remained broken, no error messages were fed back, or corrections were applied.

With the rules established, I delivered the following prompt: "Create a simple 2D platformer game featuring a cat as the playable character on python. The gameplay should be light-paced and accessible, with the cat jumping over and navigating through obstacles."

"The Great Fish Quest" by Antigravity took third place

Barely functional, certainly not charming

Antigravity was the first out of the gate, and its output is what one would describe as a skeletal framework of a platformer. Gemini 3.1 Pro is Google's flagship reasoning model, which is strong on reasoning and complex problem-solving. It's exactly the model one would assume is well-equipped for a task like this in theory, but in practice, something certainly felt lost between capability and output.

Structurally, the game held together with a layered platform, a scrollable level and functioning input. The Pygame fundamentals such as rendering, gravity and collision detection worked as expected as well, but that's where the experience stops. With no title screen, objectives, or indicators of progress, I was dropped into the gameplay abruptly in a scene that represented a Microsoft Paint doodle. Pygame supports mostly everything needed to build a complete game, including score tracking, menus, and indicators of performance. All of these aspects were absent here.

"Pudding Paws" by ChatGPT Codex was a step-up

A genuine attempt at a game

OpenAI's frontier model, GPT 5.4, has well documented strengths. It is fast, reliable, and unusually consistent with translating plain-English intent into functional code. "Pudding Paws" reflected much of that to a considerable extent.

The output was far closer to a complete game, and was just as technically thorough as the victor. The most notable aspect upon running "Pudding Paws" was the fact that there was a genuine objective on the screen. The player must collect five fireflies before reaching a cozy pillow. Hazards in the form of pitfalls and small spikes added a sense of challenge, the HUD was readable, and the controls documented within the interface. The model followed the brief closely and produced a playable game, but it never went beyond the brief. This certainly, isn't a complaint, though.

"Neko Dream" by Claude Code took the crown

Functional, enjoyable, and stunning

Sonnet 4.6 isn't Anthropic's most advanced model, but it's certainly well-optimized for the task it was put up to. The current flagship, Claude Opus 4.6, sits above it in the lineup, yet Sonnet 4.6 produced the strongest result against all other models tested, and it was never close.

"Neko Dream" is almost perfectly designed. It delivered on the prompt, and exceeded expectations on every metric under assessment. The game launched with a title screen, complete with an inviting visual of the titular cat (perhaps the first to resemble one so closely), a soft gradient sky, drifting clouds, crescent moon, and perhaps most importantly, clearly labeled "Play" and "Quit" buttons. The subtitle clearly established a winning condition before you've reached for the keyboard.

Inside the level, eleven coins were spread across a vertical landscape of floating platforms that moved. The last part is significant. Moving platforms beckoned the model to independently reason about velocity, boundary logic, timing, and playability, which were all the mechanics never mentioned in the initial prompt. The coin counter ran in the background, and clearing the level triggered a "Win" screen with a final score. No entry came close to this loop, making Claude the undisputed winner, yet again.

Some models adhere, others deliver intuitively

Claude continues to outperform competition with its coding prowess and intuitive intelligence, even with minimal context to work with. While ChatGPT and Gemini (complete with its integration with Antigravity) are no slouches especially as far as prompt adherence is concerned, neither could come close to matching Claude at understanding user intent. For a growing community of vibe coders, that can be a key differentiator.

0
0
Report Error

Found an error? Send it info@www.xda-developers.com so it can be corrected.