VOOZH about

URL: https://thenewstack.io/vision-foundation-models-when-does-size-matter/

⇱ Vision Foundation Models: When Does Size Matter? - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2024-03-27 10:00:12
Vision Foundation Models: When Does Size Matter?
contributed,
AI / Large Language Models

Vision Foundation Models: When Does Size Matter?

Large vision models may seem attractive, but domain-specific models can get you farther.
Mar 27th, 2024 10:00am by Heather D. Couture
👁 Featued image for: Vision Foundation Models: When Does Size Matter?
Image via Pixabay.

It may seem like AI is at its peak hype cycle, but some application areas are just getting started. Large language models (LLMs) stole our attention a little over a year ago, but the enabling technology has been incubating for years. Now, the lessons we’ve learned from LLMs are trickling into other areas, leaving them well-positioned for their own advancements.

Computer vision is one such area. Just as foundation models like GPT set the stage for chatbots and various other language applications, image-based foundation models are enabling a revolution in advanced image analysis, from personalized medicine to precision agriculture to industrial automation.

While early LLMs had fewer than 1 billion parameters, today’s GPT, Bard and LLama have more than one trillion parameters. The largest computer vision models like DINO and Segment Anything top out around 1 billion parameters. They’re not yet as large as LLMs, but they are heading in that direction.

Training such a large model requires an enormous amount of training data. For example, DINOv2 was trained with 142 million images. Using the advancements of self-supervised learning, the training data does not even need to be labeled. Massive amounts of unlabeled data are all that is needed to learn patterns.

For general-purpose applications, large training sets and large models are paving the way for new utility. They can be easily adapted for classification, detection or segmentation tasks on many different types of imagery.

In many ways, bigger is better.

The Problem with Large Models

The problem comes when you take a massive general-purpose model and apply it to data that looks different. Images that contain different patterns. Instead of faces, buildings and street signs, perhaps it’s roads and trees viewed from a drone or satellite above. Or it could be cells and glands seen through a microscope. Or parts on a manufacturing line.

To apply an existing foundation model to one of these examples, you need to fine-tune it for a particular task, perhaps distinguishing tumors from benign tissue. Given a few thousand examples of each class, the weights of the large foundation model can be adjusted — a significantly smaller amount of data than would be required to learn this task from scratch. This process of adapting the model is called fine-tuning.

When you fine-tune a general-purpose vision model on related imagery, it converges quickly to a good model on your downstream task. But on different imagery, your model is much more likely to overfit. This means that it will perform well on your training set but make mistakes on unseen images.

This is because the large foundation model looks for many different patterns in the images. And some of them may happen to be related to the downstream task on the small training set. But the same patterns do enable correct predictions on unseen data. They are just spurious correlations.

This is much more likely to happen with a large model trained on disparate imagery.

Small Vision Models to the Rescue

How do you solve this? You need to build a model that learns the patterns in your unique imagery. The patterns that are meaningful for downstream tasks on that same modality of images.

You likely don’t have a massive number of images available, so you can’t build a large vision model. But you can build a perfectly good small- or medium-sized vision model.

This domain-specific foundation model will be suitable for various downstream tasks on your imagery with just a little fine-tuning. It won’t be very helpful for other types of images — but you don’t need it to be.

Size does matter. But bigger isn’t necessarily better. For niche applications, adapt your model to the data you have available. A smarter, focused model will get you much farther than a large clunky one that looks at the wrong patterns.

TRENDING STORIES
Heather D. Couture is a consultant and founder of Pixel Scientia Labs, where she partners with mission-driven founders and R&D teams to support applications of computer vision for people and planetary health. She has a PhD in Computer Science and...
Read more from Heather D. Couture
SHARE THIS STORY
TRENDING STORIES
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.