Connecting vision and language at scale
Research PaperTrained a model to connect images and text by learning from 400M image-text pairs scraped from the internet, achieving remarkable zero-shot visual classification and becoming the backbone of most subsequent multimodal AI systems.
CLIP (Contrastive Language-Image Pre-training) learns to match images with their text descriptions. Given a batch of image-text pairs, it learns to maximize the similarity between correct pairs and minimize similarity between incorrect ones.
Trained on WIT (WebImageText), a dataset of 400 million image-text pairs from the internet.
CLIP can classify images into categories it has never been explicitly trained on. To classify an image, you encode the image and a set of text descriptions ("a photo of a dog", "a photo of a cat"), and pick the best match.
Zero-shot CLIP matched the performance of a fully supervised ResNet-50 on ImageNet — without seeing a single ImageNet training example.