Models·OpenAI·Feb 2021

★11. Learning Transferable Visual Models (CLIP)

Connecting vision and language at scale

Research Paper

Summary

Trained a model to connect images and text by learning from 400M image-text pairs scraped from the internet, achieving remarkable zero-shot visual classification and becoming the backbone of most subsequent multimodal AI systems.

Key Concepts

Learns to match images with text descriptions via contrastive learning on pairs

CLIP (Contrastive Language-Image Pre-training) learns to match images with their text descriptions. Given a batch of image-text pairs, it learns to maximize the similarity between correct pairs and minimize similarity between incorrect ones.

Trained on 400M image-text pairs scraped from the internet (WIT dataset)

Trained on WIT (WebImageText), a dataset of 400 million image-text pairs from the internet.

Classifies images into categories it was never explicitly trained on

CLIP can classify images into categories it has never been explicitly trained on. To classify an image, you encode the image and a set of text descriptions ("a photo of a dog", "a photo of a cat"), and pick the best match.

Zero-shot CLIP matched a fully supervised ResNet-50 on ImageNet

Zero-shot CLIP matched the performance of a fully supervised ResNet-50 on ImageNet — without seeing a single ImageNet training example.

Connections

Influenced by

8. Language Models are Few-Shot Learners (GPT-3)

May 2020

Influences

14. DALL-E 2: Hierarchical Text-Conditional Image Generation with CLIP Latents

Apr 2022

21. GPT-4V(ision) System Card

Sep 2023