VOOZH about

URL: https://thenewstack.io/what-large-language-models-can-do-well-now-and-what-they-cant/

⇱ What Large Language Models Can Do Well Now, and What They Can't - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2023-06-28 07:36:24
What Large Language Models Can Do Well Now, and What They Can't
AI / Large Language Models

What Large Language Models Can Do Well Now, and What They Can’t

At QCon New York earlier this month, two OpenAI engineers demonstrated ChatGPT's newest feature, Functions, in one session. Another talk, however, pointed to the inherent limitations of LLMs.
Jun 28th, 2023 7:36am by Joab Jackson
👁 Featued image for: What Large Language Models Can Do Well Now, and What They Can’t

Attendees of New York QCon earlier this month got a preview of where the exciting world of Large Language Model (LLM)-based artificial intelligence (AI) may be going, as well as some limits of how far the technology can practically reach.

Two OpenAI engineers from the company’s API team demonstrated ChatGPT’s newest feature, Functions, in one session.

Functions are a way of connecting ChatGPT to the rest of the world, explained  Atty Eleti, OpenAI Software Engineer.

A big limitation of the service to date is that it is built on a body of knowledge that extends only until 2022, when the process of gathering all the training data for ChatGPT was completed.

Functions are a way into the world of real-time data, giving ChatGPT the permission to execute select actions on the user’s behalf, by the user signaling an intent for such an action to take place, using a prompt such as “Show me a list of hotels in the area.”

In practice, this means ChatGPT can now call on third-party services, such as Yelp,  to provide information on the user’s behalf. ChatGPT can then format the results according to a set of instructions provided by the developer.

For this work, the team fine-tuned the ChatGPT 3.5 model to understand when to take action, or use a tool, on behalf of a user.

“The end result is a new set of models that can now intelligently use tools and call functions for you,” explained Sherwin Wu, OpenAI Technical Staff.

👁 Image

As an example, Wu showed how to ask, through the OpenAI API, ChatGPT for the current weather. With instructions passed along to the ChatGPT, it can call an external weather service to provide the answer, using the location of the user, and returning the results to the user, formatted for human readability.

Another is the ability to use Yelp to provide the user with a list of nearby restaurants. ChatGPT can query public external services, or even private sources of data when provided with log-in instructions through the API.

The presentation was clearly aimed at developers who want to build their own apps on the OpenAI platform. While OpenAI might have started as a large-scale experiment in using AI, it is clear the company has plans to market its services as a platform upon which to build applications.

Limits of Generative AI

Others are more circumspect about the possibilities of ChatGPT, such as Mathew Lodge, CEO of AI-based unit test automation provider DiffBlue, who spoke in an earlier session at QCon.

At its core, LLM-based generative AI relies on a single mechanism, called a transformer, which is basically a function to predict what the next word will be, given a prompt and a training set. It is simply a large statistical model that returns results based on information it has seen before.

It’s important to remember this because you can read all kinds of crazy stuff about transformer-based models, large language models, and how they’re intelligent, that they have a theory of mind. They don’t have any of those things. They are next-word predictors,” Lodge said.

Completed in 2019, GPT 2 was built on 1.5 billion parameters. The following year, GPT 3 arrived, built on a 175 billion parameter model — a model that would require 355 GPU years to train on the highest-end GPU today. It would cost $4.6 million to run such a job on Azure, deep learning service provider Lambda has estimated.

We don’t know how big the newest version, GPT-4, is, Lodge noted. OpenAI is keeping mum on the details, citing the newly competitive nature of the market.

But like the previous versions, GPT-4 is answer-driven, rather than goal driven. The breakthrough with this release is that the new version can do tasks that it wasn’t specifically trained to do.

“It generalizes nicely for text and language tasks,” Lodge said.

This means GPT-4 is great at completing boilerplate, such as for Java classes. This also translates well to working with an external API with little documentation (Lodge commented that traffic has declined on Stack Overflow since the emergence of ChatGPT).

But certain, seemingly built-in limitations remain with the new release. They seem to be limitations hard-wired to the LLM model itself.

Accuracy is still problematic. By now, everyone knows of the LLM’s tendency to make stuff up. While LLMs have been compared quite a bit to search engines,  but the fact remains search engines are a lot more accurate in providing information users need.

Lodge shared one surprising example that came from the Geo-location API service OpenCage. ChatGPT falsely stated on numerous occasions that the company offers a service that would, given a phone number, provide a location for that phone. The company got so many API requests for this non-existent service — all generated by ChatGPT — that it had to post a disclaimer on its website.

Nor are LLMs particularly good at mathematics. At heart they are a language analysis tool, not one built for symbolic manipulation, Lodge pointed out.

“Fundamentally, these are very, very large statistical models. They’re not predictable to humans. They’re not explainable by humans. We can’t predict what they’re going to do” — Mathew Lodge

Another problem with generative models is that a small change in the input can make a huge difference in the output. They’re not deterministic. A GPT-4-based code assist may do great building out a programming class for “dogs” within an application for pets, but might totally produce gibberish if the instructions were changed to build a class for “cats.”  It’s the semantics, Lodge explained, that trip up ChatGPT.

Prompt engineering is not engineering at all, Lodge argued, in that the “engineer” is just randomly trying new things at the prompt hoping to achieve success. This is more like programming, Lodge joked.

👁 Image

Mathew Lodge

“These are consistent issues across all the models,” Lodge pointed out. The OpenAI research team disclosed these issues in their earliest papers, dating back to 2014, and you see similar text-processing issues with other neural networks as models.

Lodge’s company DiffBlue, looked at using both ChatGPT 3.5 and 4 for writing unit tests, the company’s specialty, and found them to be less useful than the company’s current AI approach.

Because LLMs are language models, they look for language cues in a set of code. So, for a calculator class, with a function called “add,” the model would run a test checking that function’s addition capabilities — which would be problematic if that wasn’t what the code inside that class actually did.

ChatGPT kicks the bucket down the road, making subtle errors in code generation that are even harder for the regular programmer to find. It will call programming language type identifiers, unaware that they are reserved words, which leads to compilation errors. It will also reference non-existent functions.

👁 Image

LLMs are exciting because they are large enough to support many different languages without knowing those languages specifically. But the downside is that they lose out on accuracy in achieving this generality.

“That’s essentially the trade-off going on with this kind of model,” Lodge said.

Enter Reinforcement Learning

Large language models learn from language, but this is not he way we as humans learn how to do many tasks. We don’t learn to play basketball by reading a book on basketball. Instead, we learn through trial-and-error, the process of actually playing basketball.

This sort of learning can be done through a different branch of AI, called reinforcement learning, which is basically a systematic approach to trial-and-error, Lodge explained. This is also the exact opposite approach compared to LLMs. While LLMs generalize against a large knowledge base, reinforcement learning goes step-by-step, improving accuracy with each attempt of trying something. The knowledge set is smaller but more accurate.

Lodge boasts that, by using this approach, his company DiffBlue can write unit tests for a program within a few seconds, one without any errors and with the ability to catch all regressions. This approach also works very well for code optimization.

This was the approach that Google used for AlphaGo, the AI program that it built to play the Go board games, winning a match in 2015, he explained. It would be impossible, just based on the limits of hardware, to play out all the particular moves of Go, using a brute force approach alone. So AlphaGo took a more focused approach. It narrowed down the possible choices to consider to probabilistic ones, those that could conceivably advance the computer’s position in a favorable way. Reinforcement learning is the algorithm that guides this search.

👁 Image

TRENDING STORIES
Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 30 years, including stints at IDG and Government Computer News. Before that, he...
Read more from Joab Jackson
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.