VOOZH about

URL: https://www.analyticsvidhya.com/blog/2021/01/openais-future-of-vision-contrastive-language-image-pre-trainingclip/

⇱ Contrastive Language Image Pre-training(CLIP) by OpenAI


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

OpenAI’s Future of Vision: Contrastive Language Image Pre-training (CLIP)

tanishq Last Updated : 13 Jan, 2021
5 min read

Introduction

2021 has begun with a bang! OpenAI has released two major innovations in the field of Computer Vision: CLIP and DALL-E.

The CLIP network has a really interesting and possibly game-changing approach to Image Classification tasks using Contrastive Pre-training to perform Zero-Shot learning similar to that of GPT-3.

What CLIP allows us to do is to design our own classifiers and removes the need for any specific training data but still achieve State of the art results regardless of the computer vision task.

Before understanding how CLIP works let’s see what OpenAI is aiming to solve.

We have seen major improvements in Computer vision to solve a multitude of problems but they each come with their back draws, such as:-

  • A lot of current vision models like ResNet, InceptionNet are able to achieve human-level performance on complex image classification datasets however they rely on the availability of large datasets, which is difficult to create.
  • Even though the current state of the art models can perform extremely well on datasets like ImageNet they fall drastically when introduced to variants or out of the box data as they have only been optimized for performing on the benchmark and fail to perform in real-life scenarios.

OpenAI aims to solve these problems of large datasets and bad real-life performances with CLIP. CLIP has not only proven to give a state of the art of results on image classification but also other vision tasks such as object classification, action recognition in videos, and OCR. This shows that a single algorithm like CLIP can work with a variety of tasks and datasets without the need to build huge datasets but is computationally expensive.

CLIP also plays a vital role in the working of DALL-E, so make sure to read all about DALL-E in the upcoming blog!!

Overview of The Algorithm

The team at OpenAI have incorporated a lot of state of the art working methods such as zero-shot transfer, natural language supervision, and multimodal learning. Let’s begin with a high-level overview of the working of CLIPs

It starts off with a batch of text and image pairs that can be easily found on the internet. These texts and images are passed into a text and image encoder respectively and a similarity search is done where the images are mapped to the corresponding text from the entire batch. This alignment of image and text is the contrastive pre-training approach. A similar approach has been implemented in the ConVIRT paper in the field of medical imaging.

Once the images and texts have been matched, Zero-shot prediction can be performed. What happens here is that all the classes in the dataset are arranged in a specific format like “a photo of a {classname}’ and that is fed into the encoder. Similar to contrastive pre-training, the image is passed to the encoder and it performs a similarity search to determine which text matches the image from the entire batch, i.e, the text encoder will contain a batch of ‘a photo of a {dog}’, ‘a photo of a {car}’ etc and CLIP will estimate the best pairs with a given image. For example, we can see that the class guacamole ranked 1 out of 101 classes and television ranked 1 out of 397.

We have already seen that zero-shot approaches such as these (GPT-3) are computationally expensive. CLIP utilized a few different approaches to tackle this. The first one we have already seen is the contrastive pre-training approach, which led to significantly less computation. Secondly, it utilizes a vision transformer which further increased the efficiency over standard vision models like the Resnet.

This Zero-shot learning approach coupled with natural language supervision is what differentiates CLIP from the other vision models. By training a wide variety of data easily accessible on the internet and no direct optimizing for benchmark, CLIP is much more generalized and representative. 

We can see in the above image that the CLIP achieved the language model accuracy at just 33M parameters compared to 400M. CLIP is 12 times more efficient!!
As a result of this methodology, CLIP can easily be applied to nearly any visual classification tasks and achieve great performance.

Now you can see that the team at OpenAI has solved a lot of the problems of the current vision models. CLIP has reduced the labor-intensive large datasets that are required for SOTA computer vision tasks by learning from the text–image pairs that are already publicly available and not only that, it has also reduced the need to focus on a limited number of visual concepts.

Did you know the ImageNet dataset required 25,000 workers to annotate 14 million images for 22,000 object categories? That’s a lot of work!

Imagine using a pre-trained imagenet model on a specific dataset of your choice. It would require to build a dataset from scratch and fine-tune your model. But all CLIP requires is for you to pass the names of your task’s visual concepts into the text encoder, and it will output a linear classifier of the visual representations.

One thing to note is that the CLIP can match the performance of SOTA vision models on datasets like ImageNet. But what openAI has also tested is that adding the linear classifier on top of the CLIP’s features boosts its accuracy by 10% however fails to generalize well on other variations of imagenet.

IS CLIP ALL GOOD?

There are many limitations of CLIPs currently. Significant work is still needed to improve the task learning and transfer capabilities of CLIP. While scaling has so far steadily improved performance a large amount of computing is required for zero-shot CLIP to reach overall state-of-the-art performance. This is infeasible to train with current hardware.

When compared to task-specific models, CLIP has shown poor performance on several types of classification problems such as differentiating models of cars, species of flowers, etc. Another limitation is that CLIP still generalizes poorly to data that is truly out-of-distribution for it. Taking the example of MNIST, a simple baseline of logistic regression outperforms zero-shot CLIP. CLIP hopes that by training on a large and varied dataset that all data will be effectively in-distribution but as MNIST demonstrates, it is easy to violate.

CLIP as we have seen can flexibly generate zero-shot classifiers for a wide variety of datasets but is still limited to only those concepts in a given zero-shot classifier. We have seen an approach like Image Captioning that can generate novel outputs in comparison to ZS CLIP.

ENDING NOTES

Kudos to the research team at OpenAI! CLIP has introduced a really interesting and flexible approach to tackling computer vision problems. Not only can it overcome problems faced by many of today’s vision models and approaches but it does so with flying colors. Its ability to tackle almost any vision problem and still produce amazing results is not a small feat. OpenAI has released both the research paper and code. Feel free to check it out!

We have seen what CLIP has able to achieve and it blew our minds. But that’s not the end of it, the release of DALL-E further introduces us to a new era of Artificial Intelligence and CLIP plays a vital role. Stay tuned for our blog post on the biggest breakthrough in Computer Vision in recent years, i.e DALL-E!!

Let us know your thoughts on CLIP in the comment section below.

Login to continue reading and enjoy expert-curated content.

Free Courses

Ensemble Learning and Ensemble Learning Techniques

Learn ensemble learning, its techniques, and how it works in this course!

Nano Course: Dreambooth-Stable Diffusion for Custom Images

Learn to create custom images with Dreambooth Stable Diffusion technology

Dimensionality Reduction for Machine Learning

Master key dimensionality reduction techniques for ML success!

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner