Summary

  • Large language models (LLMs) are not reliable for solving math problems and should not be relied upon for accurate answers. Even the best LLMs have poor accuracy rates in math calculations.
  • As the numbers in the calculations get larger, the accuracy of LLMs plummets due to the wider delta between problems in their training set.
  • LLMs can be used as a powerful tool in math when paired with evaluators and iterative processes, as shown by Google's FunSearch method. However, LLMs still need human engineering guidance to steer them in the right direction.

When it comes to large language models (LLM), you might think they're a silver bullet to most of your problems. You can have it plan your day or ask it almost anything, knowing it will do its utmost best to give you a comprehensive answer. However, there's one thing you should never rely on an LLM for, and that's math.

To be clear, LLMs can be trained on large mathematical datasets to recognize patterns and, with smaller numbers, get close to real answers. Even then, though, you're better off just using a calculator.

LLMs suck at math

Even the best of the best has a pretty poor accuracy rate

It's already been proven how bad LLMs are at math, and ironically, it was in a paper titled "GPT Can Solve Mathematical Problems Without a Calculator." Researchers from Tsinghua University demonstrated how a model trained on mathematical calculations (called MathGLM) can be used to solve problems with reasonable accuracy.

As you can see from the above calculations, MathGLM outperforms both GPT-4 and ChatGPT significantly. There's one problem, though, and it's that even with 5-digit calculations, the best you'll get from a 2 billion parameter model is 85.16% accuracy. No matter what, 10,000*5 is still 50,000, and if an LLM gets close to that but isn't that, then it's still the wrong answer. A calculator will solve all of those problems with 100% accuracy 100% of the time.

As the numbers get bigger, the accuracy also plummets. This is likely because smaller calculations are being used in the training set, and the delta between problems that are in its training set as the numbers get larger is expected to get wider. It's not doing calculations; it's pattern matching. If you want to use MathGLM, you can take a look at the team's GitHub. Just know you'll need a powerful PC to run it locally.

Google's FunSearch shows how to use LLMs for math the right way

It's already outperforming humans

Google recently made headlines with its FunSearch method, which pairs a pre-trained LLM with an automatic evaluator to prevent hallucinations and incorrect ideas. It's essentially an iteration process pairing the creativity of an LLM with something that can kick it back a step when it goes too far in the wrong direction. LLMs aren't good at math, but they're good at being creative.

FunSearch works by taking a description of a mathematical problem in the form of code. This description provides a procedure for evaluating the output and initializes a pool of programs to begin with. At each iteration of FunSearch, the system will select some of the programs and feed them to an LLM, like PaLM 2, building new programs on top of that. The best ones are selected for iterating upon, which creates a self-improving loop.

FunSearch, in this case, managed to find the largest cap sets that far exceeded the best-known ones by some of the world's smartest mathematicians. "To the best of our knowledge, this shows the first scientific discovery — a new piece of verifiable knowledge about a notorious scientific problem — using an LLM," the researchers wrote in a paper published in Nature.

LLMs suck at math, but they're still a powerful tool

Just use a calculator for math, though

As shown by Google, an LLM can be a powerful mathematics tool, but it won't be solving problems and generating new ideas all by itself without any outside help. Google's evaluators that it built around FunSearch allowed it to solve mathematical ideas by iterating extensively on the creativity of an LLM that will frequently hallucinate. That's not the LLM being good at math. That's engineers being good at steering it down the right path.

If you ask an LLM to explain a mathematical concept to you, like how to multiply two matrices together, chances are it will tell you how to do it correctly. If you ask it to multiply two matrices together, though, chances are the answer will be wrong. I recently asked ChatGPT to multiply two matrices and received an answer with completely incorrect dimensions. However, I can ask it how to multiply two matrices, and the answer I get will be right.

In other words, if you're trying to use an LLM like ChatGPT or Google Bard to help you with mathematics, then ask it to explain concepts to you and don't ask it for actual answers. If you're lucky, the answer might be in its training set, but you're better off learning how to do it yourself instead of relying on it in the first place.