Top Zero-Shot Object Detection Models

Zero-shot object detection models let you detect objects using an open vocabulary without training a custom model.

Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using Roboflow Inference.

Showing

of

models.

YOLO-World

YOLO-World is a zero-shot object detection model.

Object Detection

Deploy with Roboflow

Segment Anything 3

Segment Anything 3 (SAM 3) is an image segmentation model released by Meta.

Instance Segmentation

Deploy with Roboflow

Grounded SAM

GroundedSAM combines Grounding DINO with the Segment Anything Model to identify and segment objects in an image given text captions.

Zero Shot Segmentation

Deploy with Roboflow

FastSAM

FastSAM is an image segmentation model trained using 2% of the data in the Segment Anything Model SA-1B dataset.

Instance Segmentation

Deploy with Roboflow

Grounding DINO

Grounding DINO is a state-of-the-art zero-shot object detection model, developed by IDEA Research.

Object Detection

Deploy with Roboflow

MetaCLIP

MetaCLIP is a zero-shot classification and embedding model developed by Meta AI.

Deploy with Roboflow

4M

The 4M model is a versatile multimodal Transformer model developed by EPFL and Apple, capable of handling a handful of vision and language tasks.

Object Detection

Deploy with Roboflow

Florence 2

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license.

Open Vocabulary Object Detection

Deploy with Roboflow

BLIPv2

BLIPv2 is a multimodal model developed by Salesforce Research.

Deploy with Roboflow

OWL ViT

OWL-ViT is a transformer-based object detection model developed by Google Research.

Object Detection

Deploy with Roboflow

OWLv2

OWLv2 is a transformer-based object detection model developed by Google Research. OWLv2 is the successor to OWL ViT.

Object Detection

Deploy with Roboflow

Visual Question Answering

Image Similarity

Image Captioning

Zero-shot Detection

Real-Time Vision

Image Embedding

LLMS with Vision Capabilities

Multimodal Vision

Foundation Vision