Bit by Bit is a weekly column focusing on technical advances each and every week across multiple spaces. My name is Adam Conway, and I've been covering tech and following the cutting-edge for a decade. If there's something you're interested in and would like to see covered, you can reach out to me at adam@xda-developers.com.
One of my biggest pet peeves when it comes to the proliferation of generative AI tech has been the attributing of "thought" to the likes of ChatGPT. While one could be fooled at first into thinking that they're capable of reasoning thanks to their ability to present information in a clear and concise way based on user input, there are a few, very easy ways to expose these flaws.
For example, take math. Math is an inherently logical task that requires an understanding of numbers, operations, and other things based on the question that you're faced with. Oftentimes, if you ask an LLM to do a basic enough math equation, it'll be able to answer... but it's not because it's capable of doing anything logical. No matter what, 10,000*5 is still 50,000, and if an LLM gets close to that but isn't that, then it's still the wrong answer. A calculator will solve all of those problems with 100% accuracy 100% of the time.
LLMs give the illusion of knowledge
In reality, it's patterns and combinations
In the world of computing, LLMs are knowledgeable in a sense, and they excel in recognizing patterns. They’re not a traditional database but rather a model trained on vast amounts of text data from millions of sources, which allows them to generate contextually relevant responses. When a user provides a prompt, the LLM interprets it and generates a response based on probabilistic patterns learned during training. These patterns help the LLM predict what is most likely to come next in the context of the prompt, drawing on its understanding of the relationships and structures in the language it has been trained on.
With that said, "understanding" refers to relationships, structures, and patterns, and there is no logical thinking. At best, an LLM can interpret data that it receives and attempt to pass it to another piece of software capable of performing actual calculations. In a recent paper published by six engineers at Apple (currently in pre-print), the supposed ability to reason is called into question by the fact that those LLMs struggle to answer questions depending on the way that they're worded. In the paper, those engineers write:
We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models, even though the added clause does not contribute to the reasoning chain needed to reach the final answer.
Given we already knew that LLMs were not capable of reasoning, this is a largely unsurprising result, but it's a higher-profile study that breaks it down and proves it for the first time while explaining why. As the paper also states: "[a]lthough LLMs can match more abstract reasoning patterns, they fall short of true logical reasoning. Small changes in input tokens can drastically alter model outputs, indicating a strong token bias and suggesting that these models are highly sensitive and fragile."
Even OpenAI's o1 model isn't really "reasoning." OpenAI's blog post announcing o1 was titled "Learning to Reason with LLMs," but it's a faux reasoning. o1 models use "reasoning tokens," where these tokens break down the prompt and consider multiple approaches to generating responses. After generating those reasoning tokens, the visible answer is produced and reasoning tokens are discarded.
However, that's not to say we aren't getting close. As it later came out, o1 was internally known as Q*, and then later as Strawberry. This is the model that caused a stir within OpenAI and led to the ousting of company CEO Sam Altman, though he later returned. o1 is another model that those engineers from Apple suggest may not be as capable of reasoning as it appears on the surface, but it's clear with the inner turmoil at OpenAI that there is a significant amount of concern for the future of these models, including their potential true ability to reason in the future.
3 critical flaws that explain why AI will never live up to the hype
AI has a few big problems, and they'll very likely prevent it from living up to the hype.
LLMs can help with reasoning, but should never be relied upon
Google made it work
Just because an LLM can't reason by itself, doesn't mean that it can't influence a logical reasoning process. It just can't operate on its own. Google paired a pre-trained LLM with an automatic evaluator to prevent hallucinations and incorrect ideas, calling it FunSearch. It's essentially an iteration process pairing the creativity of an LLM with something that can kick it back a step when it goes too far in the wrong direction. LLMs aren't good at math, but they're good at being creative.
FunSearch works by taking a description of a mathematical problem in the form of code. This description provides a procedure for evaluating the output and initializing a pool of programs. At each iteration of FunSearch, the system will select some of the programs and feed them to an LLM, like PaLM 2, building new programs on top of that. The best ones are selected for iterating upon, which creates a self-improving loop.
FunSearch, in this case, managed to find the largest cap sets that far exceeded the best-known ones by some of the world's smartest mathematicians. "To the best of our knowledge, this shows the first scientific discovery — a new piece of verifiable knowledge about a notorious scientific problem — using an LLM," the researchers wrote in a paper published in Nature.
In other words, the next time you use an LLM, taper your expectations. Don't expect logic, don't expect thinking, but do expect patterns, statistics, and a dash of creativity. LLMs are useful tools, as demonstrated by Google, but a tool is only as good as how much the user understands its limitations. Ask ChatGPT how to do a math equation and you'll get a great answer, but don't ask it to do it for you.
