Top Multimodal Vision Models

Multimodal vision models allow you to interact with images and information in a different modality (i.e. text). Some multimodal vision models support asking questions about images; others support comparing the similarity of images to text, useful in classification.
‍‍
If you're more interested in deploying a model without code, check out our Roboflow Deploy product.

Classification
Classification
Classification
Classification

Model Size:

MB

Parameters:

Top FPS:

Architecture:

CLIP (Contrastive Language-Image Pre-Training) is an impressive multimodal zero-shot image classifier that achieves impressive results in a wide range of domains with no fine-tuning. It applies the recent advancements in large-scale transformers like GPT-3 to the vision arena. Learn more »
Object Detection
Object Detection
Object Detection
Object Detection

Model Size:

MB

Parameters:

Top FPS:

Architecture:

Grounding DINO is a zero-shot object detection model made by combining a Transformer-based DINO detector and grounded pre-training. Learn more »
Object Detection
Object Detection
Object Detection
Object Detection

Model Size:

MB

Parameters:

Top FPS:

Architecture:

Kosmos-2 is a multimodal language model capable of object detection and grounding text in images. Learn more »
Object Detection
Object Detection
Object Detection
Object Detection

Model Size:

MB

Parameters:

Top FPS:

Architecture:

LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection. Learn more »
Classification
Classification
Classification
Classification

Model Size:

MB

Parameters:

Top FPS:

Architecture:

MetaCLIP is a zero-shot classification and embedding model developed by Meta AI. Learn more »
Object Detection
Object Detection
Object Detection
Object Detection

Model Size:

MB

Parameters:

Top FPS:

Architecture:

Deploy a computer vision model today

Join 100k developers curating high quality datasets and deploying better models with Roboflow.

Get started