Vision Transformers (ViTs): Computer Vision with Transformer Models

Published on January 13, 2025

AI/ML

Python

Data Science

AI Technical Writer

👁 Vision Transformers (ViTs): Computer Vision with Transformer Models

Over the past few years, tranformers have transformed the NLP domain in machine learning. Models like GPT and BERT have set new benchmarks in understanding and generating human language. Now the same principle is been applied to computer vision domain. A recent development in the field of computer vision are vision transformers or ViTs. As detailed in the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”, ViTs and transformer-based models are designed to replace convolutional neural networks (CNNs). Vision Transformers are a fresh take on solving problems in computer vision. Instead of relying on traditional convolutional neural networks (CNNs), which have been the backbone of image-related tasks for decades, ViTs use the transformer architecture to process images. They treat image patches like words in a sentence, allowing the model to learn the relationships between these patches, just like it learns the context in a paragraph of text.

Unlike CNNs, ViTs divide input images into patches, serialize them into vectors, and reduce their dimensionality using matrix multiplication. A transformer encoder then processes these vectors as token embeddings. In this article, we’ll explore vision transformers and their main differences from convolutional neural networks. What makes them particularly interesting is their ability to understand global patterns in an image, which is something CNNs can struggle with.

Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.

Learn more about our products

About the author

👁 Shaoni Mukherjee

Shaoni Mukherjee

Author

AI Technical Writer

See author profile

With a strong background in data science and over six years of experience, I am passionate about creating in-depth content on technologies. Currently focused on AI, machine learning, and GPU computing, working on topics ranging from deep learning frameworks to optimizing GPU-based workloads.

Category:

Tags:

Still looking for an answer?

Ask a question Search for more help

Was this helpful?

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

👁 21a2b7f41a03432fb70f81553ff7b2

21a2b7f41a03432fb70f81553ff7b2

May 29, 2025

i need it… i am happy to see here

👁 Creative Commons
This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

Table of contents

Deploy on DigitalOcean
Click below to sign up for DigitalOcean's virtual machines, Databases, and AIML products.
Sign up

👁 Image

Become a contributor for community

Get paid to write technical tutorials and select a tech-focused charity to receive a matching donation.

👁 Image

DigitalOcean Documentation

Full documentation for every DigitalOcean product.

Learn more

👁 Image

Resources for startups and AI-native businesses

The Wave has everything you need to know about building a business, from raising funding to marketing your product.

Learn more

Get our newsletter

Stay up to date by signing up for DigitalOcean’s Infrastructure as a Newsletter.

New accounts only. By submitting your email you agree to our Privacy Policy

The developer cloud

Scale up as you grow — whether you're running one virtual machine or ten thousand.

View all products

Start building today

From GPU-powered inference and Kubernetes to managed databases and storage, get everything you need to build, scale, and deploy intelligent applications.

Dark mode is coming soon.

URL: https://www.digitalocean.com/community/tutorials/vision-transformer-for-computer-vision?comment=211318