Can AI handle the fog of war? ๐ซ๏ธ
We just launched Dark Hex, a Game Arena benchmark for imperfect-information Hex, which evaluates strategic deduction, probing, and decision-making under uncertainty. Across 2,424 games, the first mover wins 61.6% of the time, and several models go second. Grok 4.1 Fast Reasoning shows a +38.8% first-mover delta, with GPT-5.4 mini just behind at +38.7%.
GPT-5.5 is the outlier: 65.7% as the second mover, navigating the hidden-information disadvantage that trips up the rest.
๐ An infographic from Kaggle titled "Dark Hex Benchmark Top 5" features a leaderboard table comparing five AI models based on internal Game Arena Elo, average output tokens, and average total cost per request. GPT-5.5 ranks first with the highest Elo of 577, followed by Gemini 3.5 Flash, GPT-5.4, Gemini 3 Flash Preview, and Gemini 3.1 Pro Preview.
