VOOZH about

URL: https://thenewstack.io/top-5-large-language-models-and-how-to-use-them-effectively/

⇱ Top 5 Large Language Models and How To Use Them Effectively - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-02-27 12:03:04
Top 5 Large Language Models and How To Use Them Effectively
sponsor-singlestore,sponsored-post,
AI / Open Source

Top 5 Large Language Models and How To Use Them Effectively

LLMs hold the key to generative AI, but some are more suited than others to specific tasks. Here's a guide to the five most powerful and how to use them.
Feb 27th, 2025 12:03pm by Charles Humble
👁 Featued image for: Top 5 Large Language Models and How To Use Them Effectively
Image by Susan Holt Simpson from Unsplash. 
SingleStore sponsored this post. Insight Partners is an investor in SingleStore and TNS.
This article has been updated from when it was originally published on August 8, 2023

Modern Large Language Models (LLMs) are pre-trained on a large corpus of self-supervised textual data, then tuned to human preferences via techniques such as reinforcement learning from human feedback (RLHF).

LLMs have seen rapid advances over the last decade, particularly since the development of generative pre-trained transformers (GPTs) in 2012. Google’s BERT, introduced in 2018, represented a significant advance in capability and architecture and was followed by OpenAI’s release of GPT-3 in 2022, and GPT-4 the following year.

While open sourcing AI models is controversial given the potential for widespread abuse — from generating spam and disinformation, to misuse in synthetic biology — we have seen a number of open source alternatives which can be cheaper and as good as their proprietary counterparts.

Use Cases for LLMs

Given how new this all is, we’re still getting to grips with what may or may not be possible with the technology. But the capabilities of LLMs are undoubtedly interesting, with a wide range of potential applications in business. These include being used as chatbots in customer support settings, code generation for developers and business users, and audio transcription summarizing, paraphrasing, translation and content generation.

You can imagine, for example, that customer meetings could be both transcribed and summarized by a suitably-trained LLM in near real time, with the results shared with the sales, marketing and product teams. Or an organization’s web pages might automatically be translated into different languages. In both cases, the results would be imperfect but could be quickly reviewed and fixed by a human reviewer.

In a coding context, many popular internal development environments now support some level of AI-powered code completion, with GitHub Copilot, Sourcegraph, and CodeWhisperer among the leading examples in enterprises. Other related applications, such as natural language database querying, also show promise. LLMs might also be able to generate developer documentation from source code.

LLMs could prove useful when working with other forms of unstructured data in some industries. “In wealth management,” Madhukar Kumar, CMO of SingleStore, a relational database company, told the New Stack, “we are working with customers who have a huge amount of unstructured and structured data, such as legal documents stored in PDFs and user details in database tables, and we want to be able to query them in plain English using a Large Language Model.”

SingleStore is seeing clients using LLMs to perform both deterministic and non-deterministic querying simultaneously.

“In wealth management, I might want to say, ‘Show me the income statements of everyone aged 45 to 55 who recently quit their job,’ because I think they are right for my 401(k) product,” Kumar said.

“This requires both database querying via SQL, and the ability to work with that corpus of unstructured PDF data. This is the sort of use case we are seeing a lot.”

An emerging application of AI is for agentic systems. “We’re seeing a number of new AI companies amongst our customers who are looking to make their data immediately available to build agentic systems,” Kumar told us. “In cybersecurity, for example, you might take several live video feeds and give that to an AI to make decisions very quickly.”

Large language models have been applied to areas such as sentiment analysis. This can be useful for organizations gathering data and feedback to improve customer satisfaction. Sentiment analysis is also helpful for identifying common themes and trends in a large body of text, which may assist with both decision-making and more targeted business strategies.

As we’ve noted elsewhere, one significant challenge with using LLMs is that they make stuff up. For example, the winning solution for a benchmarking competition — organized by Meta and based on Retrieval Augmented Generation (RAG) and complex situations — was wrong about half the time. These findings are similar to those from NewsGuard, a rating system for news and information websites, which showed that the 10 leading chatbots made false claims 40% of the time and gave no answers to 22% of questions. Using RAG and a variety of other techniques can help, but eliminating errors completely looks to be impossible. In view of this, LLMs should not be used in any situation where accuracy matters.

Training an LLM from scratch remains a major undertaking, so it makes more sense to build on top of an existing model where possible. We should also note that the environmental costs of both training and running an LLM are considerable; because of this we recommend only using an LLM where there isn’t a smaller, cheaper alternative. We would also encourage you to ask the vendor or OSS project to disclose their figures for training and running the model, though at the time of writing this information is increasingly hard to obtain.

With Kumar’s help, we’ve compiled a list of what we think are the five most important LLMs at the moment. If you are looking to explore potential uses for LLMs yourself, these are the ones we think you should consider.

The Top 5 LLMs

Best ‘Reasoning’ Models

Reasoning models produce responses incrementally, simulating to a certain extent how humans grapple with problems or ideas.

OpenAI o3-mini-high

OpenAI’s o3-mini-high has been fine-tuned for STEM problems, specifically programming, math and science. As such, and with all the usual caveats that apply to benchmarking, it currently scores highest on the GPQA benchmark commonly used for comparing reasoning performance.

Developers can choose between three reasoning effort options—low, medium and high—to optimize for their specific use cases. This flexibility allows o3‑mini to ‘think harder’ when tackling complex challenges, or prioritize speed when latency is a concern. It is also OpenAI’s first small reasoning model to support function calling⁠, structured outputs⁠, and developer messages.

OpenAI no longer discloses carbon emissions, though model size does make a difference, and claimed improvements to response times imply a lower overall carbon running cost.

DeepSeek-R1

DeepSeek reasoning models were, they claim, trained on a GPU cluster a fraction of the size of any of the major western AI labs. They’ve also released a paper explaining what they did, though some of the details are sparse. The model is free to download and use under an MIT license.

R1 scores highly on the GPQA benchmark, though it is now beaten by o3-mini. DeepSeek says it has been able to do this cheaply — the researchers behind it claim it cost $6m (£4.8m) to train, a fraction of the “over $100m” alluded to by OpenAI boss, Sam Altman, when discussing GPT-4. It also uses less memory than its rivals, ultimately reducing the carbon and other associated costs for users.

DeepSeek is trained to avoid politically sensitive questions — for example, it will not give any details about the Tiananmen Square massacre on June 4, 1989.

You don’t necessarily need to stick to the version DeepSeek provides, of course. “You could use it to distill a model like Qwen 2.5 or Llama 3.1, and it is much cheaper than OpenAI,” Kumar admitted.

Best for Coding Tasks

Anthropic Claude 3.7 Sonnet

While speed of typing or lines of code have long since been debunked as a good measure of developer performance — and many experienced developers have expressed reservations about using AI-generated code — coding is one of the areas where GenAI appears to have early product market fit. It works well because mistakes are typically easy to spot or test for, meaning that the aforementioned accuracy problems are less of an issue.

While most developers will likely favor the code-completion system built into their IDE, such as JetBrains AI or GitHub Copilot, the current best-in-class on the HumanEval benchmark is Claude 3.5 Sonnet from Anthropic. “When it comes to coding, Claude is still the best,” Kumar told us. “I’ve personally used it for hours and hours, and there is very little debate around it.”

This proprietary model also scores well on agentic coding and tool use tasks. On TAU-bench, an agentic tool use task, it scores 69.2% in the retail domain, and 46% in the airline domain. It also scores 49% on SWE-bench Verified.

At the time of writing, Anthropic have just released Claude 3.7 Sonnet which, the vendor claims, “shows particularly strong improvements in coding and frontend web development.” Claude 3.7 Sonnet with extended thinking — letting you see Claude’s thought process alongside its response — is offered as part of a Pro plan. Anthropic also offers a GitHub integration across all Claude plans, allowing developers to connect code repositories directly to Claude.

Best General Purpose

Meta Llama 3.1 405b

Both OpenAI’s o3 and DeepSeek’s R1 models score highly as general purpose models, but we’re fans of the Meta Llama family of open source models which come close. It uses a Mixture of Experts (MoE) approach, which is an ensemble learning technique that scales model capacity without significantly increasing training or inference costs. MoEs can dramatically increase the number of parameters without a proportional increase in computational cost.

Llama 3.1 405b scores 88.6% on the MMLU benchmark, putting it a hair’s-breadth behind the considerably more computationally expensive alternatives.

Google Gemini Flash 2.0

Google’s experimental Gemini Flash 2.0 scores lower than Llama on the MMLU benchmark, at 76.2%, but it has other capabilities that make it interesting. It supports multimodal output like natively-generated images mixed with text and steerable text-to-speech (TTS) multilingual audio. It can also natively call tools like Google Search and code execution, as well as third-party user-defined functions. It is also impressively fast and has one of the largest context sizes of 1 million tokens.

Google is also actively exploring agentic systems through Project Astra and Project Mariner, and Flash 2.0 is built with the intention of making it particularly suitable for agentic systems.

Picking an LLM

Once you’ve drawn up a shortlist of LLMs, and identified one or two low-risk use cases to experiment with, you have the option of running multiple tests using different models to see which one works best for you — as you might do if you were evaluating an observability tool or similar.

It’s also worth considering whether you can use multiple LLMs in concert. “I think that the future will involve not just picking one, but an ensemble of LLMs that are good at different things,” Kumar told us.

Of course, none of this is particularly useful to you unless you have timely access to data. During our conversation, Kumar suggested that this was where contextual databases like SingleStore come in.

“To truly use the power of LLMs,” he said, “you need the ability to do both lexical and semantic searches, manage structured and unstructured data, handle both metadata and the vectorized data, and handle all of that in milliseconds, as you are now sitting between the end user and the LLM’s response.”

Designed for intelligent applications, SingleStore is the world’s only real-time data platform that can read, write and reason on petabyte-scale data in a few milliseconds. Insight Partners is an investor in SingleStore and TNS.
Learn More
The latest from SingleStore
TRENDING STORIES
Charles Humble is a former software engineer, architect and CTO who has worked as a senior leader and executive of both technology and content groups. He was InfoQ’s editor-in-chief from 2014-2020, and was chief editor for Container Solutions from 2020-2023....
Read more from Charles Humble
SingleStore sponsored this post. Insight Partners is an investor in SingleStore and TNS.
SHARE THIS STORY
TRENDING STORIES
Amazon Web Services and SingleStore are sponsors of The New Stack.
TNS owner Insight Partners is an investor in: Sourcegraph, SingleStore, Anthropic, OpenAI.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.