Top Foundation Vision Models

Foundation models are large models that you can use without prior training. You can use foundation models to auto-label data for use in training a smaller, real-time vision model.

Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using Roboflow Inference.

Showing

of

models.

Segment Anything Model (SAM)

Segment Anything (SAM) is an image segmentation model developed by Meta Research, capable of doing zero-shot segmentation.

Instance Segmentation

Deploy with Roboflow

GroundingDINO

Grounding DINO is a zero-shot object detection model made by combining a Transformer-based DINO detector and grounded pre-training.

Object Detection

Deploy with Roboflow

YOLO-World

YOLO-World is a zero-shot object detection model.

Object Detection

Deploy with Roboflow

PaliGemma

PaliGemma is a vision language model (VLM) by Google that has multimodal capabilities.

Vision Language

Deploy with Roboflow

GPT-4o

GPT-4o is OpenAI’s third major iteration of GPT-4 expanding on the capabilities of GPT-4 with Vision

Vision Language

Deploy with Roboflow

OpenAI CLIP

CLIP (Contrastive Language-Image Pre-Training) is an impressive multimodal zero-shot image classifier that achieves impressive results in a wide range of domains with no fine-tuning. It applies the recent advancements in large-scale transformers like GPT-3 to the vision arena.

Video Classification

Deploy with Roboflow

LLaVA-1.5

LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection.

Object Detection

Deploy with Roboflow

Grounding DINO

Grounding DINO is a state-of-the-art zero-shot object detection model, developed by IDEA Research.

Object Detection

Deploy with Roboflow

SigLIP

SigLIP is an image embedding model defined in the "Sigmoid Loss for Language Image Pre-Training" paper.

Deploy with Roboflow

DETIC

Detic is an open source segmentation model developed by Meta Research and released in 2022.

Instance Segmentation

Deploy with Roboflow

MetaCLIP

MetaCLIP is a zero-shot classification and embedding model developed by Meta AI.

Deploy with Roboflow

4M

The 4M model is a versatile multimodal Transformer model developed by EPFL and Apple, capable of handling a handful of vision and language tasks.

Object Detection

Deploy with Roboflow

CoDet

CoDet is an open vocabulary zero-shot object detection model.

Object Detection

Deploy with Roboflow

Phi-4 Multimodal

Phi-4 Multimodal is a multimodal language model developed by Microsoft.

Vision Language

Deploy with Roboflow

PaliGemma-2

PaliGemma-2 is a multimodal model developed by Google.

Vision Language

Deploy with Roboflow

Google Gemini

Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.

Vision Language

Deploy with Roboflow

Anthropic Claude 3

Vision Language

Deploy with Roboflow

MobileCLIP

MobileCLIP is an image embedding model developed by Apple and introduced in the "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training" paper

Deploy with Roboflow

GPT-4 with Vision

GPT-4 with Vision is a multimodal language model developed by OpenAI.

Object Detection

Deploy with Roboflow

OWL ViT

OWL-ViT is a transformer-based object detection model developed by Google Research.

Object Detection

Deploy with Roboflow

OWLv2

OWLv2 is a transformer-based object detection model developed by Google Research. OWLv2 is the successor to OWL ViT.

Object Detection

Deploy with Roboflow

Kosmos-2

Kosmos-2 is a multimodal language model capable of object detection and grounding text in images.

Object Detection

Deploy with Roboflow

Visual Question Answering

Image Similarity

Image Captioning

Zero-shot Detection

Real-Time Vision

Image Embedding

LLMS with Vision Capabilities

Multimodal Vision

Foundation Vision