SS
About Me
Frontier AI Paper BriefingsPokebowlClinical Trial EnrollerLittle Human Names
DisclaimersPrivacy PolicyTerms of Use
Privacy Policy·Terms of Use·Disclaimers

© 2026 Silvia Seceleanu

← Back to Explorer
Models·OpenAI·Feb 2021

★11. Learning Transferable Visual Models (CLIP)

Connecting vision and language at scale

Research Paper
Summary

Trained a model to connect images and text by learning from 400M image-text pairs scraped from the internet, achieving remarkable zero-shot visual classification and becoming the backbone of most subsequent multimodal AI systems.

Key Concepts

Learns to match images with text descriptions via contrastive learning on pairs

CLIP (Contrastive Language-Image Pre-training) learns to match images with their text descriptions. Given a batch of image-text pairs, it learns to maximize the similarity between correct pairs and minimize similarity between incorrect ones.

Trained on 400M image-text pairs scraped from the internet (WIT dataset)

Trained on WIT (WebImageText), a dataset of 400 million image-text pairs from the internet.

Classifies images into categories it was never explicitly trained on

CLIP can classify images into categories it has never been explicitly trained on. To classify an image, you encode the image and a set of text descriptions ("a photo of a dog", "a photo of a cat"), and pick the best match.

Zero-shot CLIP matched a fully supervised ResNet-50 on ImageNet

Zero-shot CLIP matched the performance of a fully supervised ResNet-50 on ImageNet — without seeing a single ImageNet training example.

Connections

11. Learning Transfe…Feb 20218. Language Models …May 202014. DALL-E 2: Hierar…Apr 202221. GPT-4V(ision) Sy…Sep 2023Influenced byInfluences
Influenced by
8. Language Models are Few-Shot Learners (GPT-3)
May 2020
Influences
14. DALL-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents
Apr 2022
21. GPT-4V(ision) System Card
Sep 2023