Can @AnthropicAI Claude 3.5 sonnet outperform @OpenAI o1 in reasoning? Combining Dynamic Chain of Thoughts, reflection, and verbal reinforcement, existing LLMs like Claude 3.5 Sonnet can be prompted to increase test-time compute and match reasoning strong models like OpenAI o1. namic Chain of thoughts + reflection + verbal reinforcement prompting
๐ Benchmarked against tough academic tests (JEE Advanced, UPSC, IMO, Putnam)
๐ Claude 3.5 Sonnet outperformes GPT-4 and matched O1 models
๐ LLMs can create internal simulations and take 50+ reasoning steps for complex problems
๐ Works for smaller, open models like Llama 3.1 8B +10% (Llama 3.1 8B 33/48 vs GPT-4o 36/48)
โ Didnโt benchmark like MMLU, MMLU pro, or GPQA due to computing and budget constraints
๐ High token usage - Claude Sonnet 3.5 used around 1 million tokens for just 7 questions
