VOOZH about

URL: https://thenewstack.io/where-ai-benchmarks-fall-short-and-how-to-evaluate-models-instead/

⇱ Where AI Benchmarks Fall Short, and How To Evaluate Models Instead  - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-02-08 06:00:09
Where AI Benchmarks Fall Short, and How To Evaluate Models Instead 
contributed,
AI / Data / Large Language Models

Where AI Benchmarks Fall Short, and How To Evaluate Models Instead 

2025 will be the year when organizations increasingly look to gain value from the models they’ve invested so heavily in.
Feb 8th, 2025 6:00am by Victor Botev
👁 Featued image for: Where AI Benchmarks Fall Short, and How To Evaluate Models Instead 
Photo by charlesdeluvio on Unsplash.

Enterprises face an overwhelming array of large language models (LLMs) from which to choose. With new releases like Meta’s Llama 3.3 alongside models like Google’s Gemma and Microsoft’s Phi, the choices have never been so varied. When you scratch below the surface, the choices also become complex.

For businesses looking to leverage LLMs, chatbots, and Agentic systems, the challenge is to evaluate which model aligns with their unique requirements, cutting through the noise of traditional benchmarks and superficial metrics.

The Flaws of Standard Metrics

While most evaluation metrics are academically robust, they fail to account for businesses’ nuanced needs. Tools like Perplexity and BLEU (Bilingual Evaluation Understudy) are commonly used in research to measure predictive accuracy or alignment with reference texts. However, their practical utility for enterprises is limited.

Take Perplexity, for instance. Though it is designed to assess a model’s ability to predict sample text, it says little about how well that model can process industry-specific jargon, interpret complex relationships, or provide actionable insights for expert domains. Similarly, developed initially for machine translation, BLEU often rewards models for strict adherence to reference outputs. This can hinder creativity and flexibility in areas where dynamic responses are critical. A chatbot scoring highly on BLEU might rigidly follow pre-defined scripts but fail to handle nuanced customer queries effectively.

Businesses often find themselves disappointed by models that, on paper, should perform well because they excel in these metrics. In reality, the models fall short when applied to real-world challenges.

The Synthetic Data Problem

Another significant hurdle stems from the reliance of many open source models on synthetic training data. Synthetic datasets, often generated by widely used Large Language Models (LLMs) such as GPT-4, enable faster development cycles but can introduce systemic biases. If the outputs of GPT-4 are unable to grasp the nuances of legal texts, models trained on these outputs will also likely fail to capture these complexities

This reliance on synthetic data creates the risk of feedback loops, where models trained on such datasets mimic patterns and biases from the original generator rather than developing genuine understanding. This issue is exacerbated by using LLM-as-a-judge capabilities, with this accuracy evaluation method reinforcing the biases from the synthetic data on which many LLM-as-a-judge models are trained.

Businesses may mistakenly trust these models based on seemingly strong evaluation scores, only to discover later that they lack the depth needed for specialized tasks. For most enterprises, the solution lies in fine-tuning models with domain-specific data. Models trained on bespoke datasets can demonstrate vastly improved performance in specialized tasks. However, fine-tuning is resource-intensive and requires access to high-quality data, making it a challenging but necessary step for many organizations.

Context Sensitivity

Different models exhibit varying strengths and weaknesses regarding context sensitivity, a crucial factor for business applications. For instance, Meta’s Llama models are adept at maintaining contextual understanding over prolonged interactions. They are well-suited for use cases requiring extended reasoning, such as legal or medical analysis.

By contrast, Google’s Gemma models excel in general-purpose tasks but struggle with applications requiring deep, domain-specific expertise. Similarly, while strong in creative and exploratory tasks, Microsoft’s Phi models can sometimes deviate from strict instructions. This can be an advantage in some contexts but also liability in industries where regulatory compliance is critical. To accurately assess each model’s value, any evaluation framework must account for each model’s nuances and tendencies.

Developing an Effective Evaluation Framework

Models should also be evaluated based on scenarios that reflect the organization’s specific use cases and capabilities. For instance, a financial institution might prioritize testing a model’s ability to analyze regulatory filings, ensuring it can handle the dense, structured language common in compliance documents. Similarly, a healthcare provider may need to focus on the model’s capacity to interpret clinical notes, often requiring an understanding of medical terminology and patient-specific context. Tailoring evaluation scenarios to align with these practical applications ensures the chosen model delivers meaningful results to users with deep domain expertise.

Organizations should avoid over-reliance on synthetic data during testing. Instead, they should adopt a balanced approach, using a mix of real-world and domain-specific datasets. This method helps uncover potential biases that might go unnoticed and ensures the model can manage the intricacies and variability of actual business environments. Real-world data offers a more accurate reflection of a model’s challenges in practice, leading to better long-term performance and reliability.

Once deployed, model performance should be continuously monitored to identify and address any deviations from expected behavior. Real-world testing during production environments provides invaluable insights into how a model adapts to dynamic conditions. By regularly reviewing outputs and performance metrics, organizations can make iterative improvements and refine their AI systems, ensuring they remain aligned with evolving business needs.

Finally, retrieval-augmented generation (RAG) techniques can be particularly beneficial in business contexts, improving the reliability of model outputs by integrating external knowledge. Evaluating a model’s ability to incorporate this external data into its responses is critical for understanding its practical utility. Strong performance in context evaluation provides reassurance that the model can adapt effectively to complex, information-rich scenarios and deliver outputs that align with the nuances of specific business requirements

2025 will be the year when organizations increasingly look to gain value from the models they’ve invested so heavily in. Trusting the outputs will be accurate and having sufficient expertise will be key here. Businesses must approach model evaluation with care and precision. Publicly available benchmarks may offer a starting point. Still, real-world success requires a more nuanced strategy prioritizing domain-specific needs, diverse data testing, and a deep understanding of context sensitivity.

TRENDING STORIES
Victor Botev is the co-founder and CTO of Iris.ai, a leading provider of AI engines for deep knowledge and textual understanding. With a background in AI research and software development, Victor drives the creation of Iris.ai's tools to enhance AI...
Read more from Victor Botev
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Real.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.