In January 2021 OpenAI released CLIP (Contrastive Language-Image Pre-Training), a zero-shot classifier that leverages knowledge of the English language to classify images without having to be trained on any specific dataset. It applies the recent advancements in large-scale transformers like GPT-3 to the vision arena.
The results are extremely impressive; we have put together a CLIP tutorial and a CLIP Colab notebook for you to experiment with the model on your own images. We've made slight modifications to make "prompt" engineering easier by extracting it into a configuration file and have automatically generated starter prompts for all of our public datasets. You can use Roboflow to generate this config file to try your own classification or object detection datasets with CLIP.
Below, learn the structure of OpenAI CLIP Classification.
An example picture from the Hard Hat dataset depicting a man wearing a hard-hat
An example picture from the Hard Hat dataset depicting several men wearing hard-hats
An example picture from the Hard Hat dataset depicting several people, some wearing hard hats
An example picture from the Hard Hat dataset depicting several people
With Roboflow supervision, an open source Python package with utilities for completing computer vision tasks, you can merge and split detections in OpenAI CLIP Classification. Read our dedicated guides to learn how to merge and split OpenAI CLIP Classification detections.
Below, see model architectures that require data in the OpenAI CLIP Classification format when training a new model.
On each page below, you can find links to our guides that show how to plot predictions from the model, and complete other common tasks like detecting small objects with the model.