Computer Vision Models

Explore state-of-the-art computer vision model architectures, immediately usable for training with your custom dataset.
Filter By Task
Filter By Feature
Deploy with Roboflow
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using Roboflow Inference.
Roboflow

RF-DETR

RF-DETR is a SOTA, real-time object detection model architecture developed by Roboflow and released under the Apache 2.0 license.
Object Detection
Deploy with Roboflow
false

YOLOv12

YOLOv12 is a state-of-the-art computer vision model you can use for detection, segmentation, and more.
Object Detection
Deploy with Roboflow
false
Meta

Segment Anything 2

Segment Anything 2 (SAM 2) is a real-time image and video segmentation model.
Instance Segmentation
Deploy with Roboflow
false
OpenAI

GPT-4o

GPT-4o is OpenAI’s third major iteration of GPT-4 expanding on the capabilities of GPT-4 with Vision
Vision Language
Deploy with Roboflow
false
Google

PaliGemma

PaliGemma is a vision language model (VLM) by Google that has multimodal capabilities.
Vision Language
Deploy with Roboflow
true

YOLOv9

YOLOv9 is an object detection model architecture released on February 21st, 2024.
Object Detection
Deploy with Roboflow
true

YOLO-World

YOLO-World is a zero-shot object detection model.
Object Detection
Deploy with Roboflow
true
Ultralytics

YOLOv8 Pose Estimation

The YOLOv8 pose estimation model allows you to detect keypoints in an image.
Pose Estimation
Deploy with Roboflow
true
Meta

Segment Anything Model (SAM)

Segment Anything (SAM) is an image segmentation model developed by Meta Research, capable of doing zero-shot segmentation.
Instance Segmentation
Deploy with Roboflow
false

GroundingDINO

Grounding DINO is a zero-shot object detection model made by combining a Transformer-based DINO detector and grounded pre-training.
Object Detection
Deploy with Roboflow
false
Ultralytics

YOLOv8 Instance Segmentation

The state-of-the-art YOLOv8 model comes with support for instance segmentation tasks.
Instance Segmentation
Deploy with Roboflow
true
Ultralytics

YOLOv8

YOLOv8 is a state-of-the-art object detection and image segmentation model created by Ultralytics, the developers of YOLOv5.
Object Detection
Deploy with Roboflow
true
Anthropic

Claude 3.7 Sonnet

Claude 3.7 is a multimodal "hybrid reasoning" model developed by Anthropic.
Deploy with Roboflow
false
Microsoft

Phi-4 Multimodal

Phi-4 Multimodal is a multimodal language model developed by Microsoft.
Vision Language
Deploy with Roboflow
false

Co-DETR

Co-Deformable-DETR (Co-DETR) is an object detection model architecture introduced in the paper "DETRs with Collaborative Hybrid Assignments Training".
Object Detection
Deploy with Roboflow
false

D-FINE

D-FINE is a real-time object detection model introduced in the paper " D-FINE: Redefine Regression Task of DETRs as Fine‑grained Distribution Refinement".
Deploy with Roboflow
false

DEIM

DEIM is a training framework for DETR models. The framework strives to enable "faster convergence and improved accuracy" in models.
Deploy with Roboflow
false

YOLOE

YOLOE is a new object detection and segmentation model developed by the creators of YOLOv10.
Object Detection
Deploy with Roboflow
false
Hugging Face

SmolVLM2

SmolVLM2 is a multimodal image and video understanding model developed by engineers on the Hugging Face TB (Textbook) Research team.
Deploy with Roboflow
false

Moondream 2

Moondream 2 is the latest model in the Moondream series of “tiny vision language models”.
Deploy with Roboflow
false
Google

Gemma 3

Gemma 3 is a multimodal language model developed by Google.
Deploy with Roboflow
false
OpenAI

OpenAI o3-mini

OpenAI o3-mini is a multimodal reasoning model developed by OpenAI.
Vision Language
Deploy with Roboflow
false
Qwen

Qwen2.5-VL

Qwen2.5-VL is a multimodal vision-language model developed by the Qwen team at Alibaba Cloud.
Vision Language
Deploy with Roboflow
false
Google

PaliGemma-2

PaliGemma-2 is a multimodal model developed by Google.
Vision Language
Deploy with Roboflow
true
Ultralytics

YOLO11

YOLO11 is a computer vision model that you can use for object detection, segmentation, and classification.
Object Detection
Deploy with Roboflow
false

YOLOv9 Image Segmentation

Deploy with Roboflow
false

Florence 2 Image Captioning

Deploy with Roboflow
false

Florence 2 OCR

Florence-2 OCR is a subset of Florence-2 that can read characters in images.
Deploy with Roboflow
false

Florence 2 Image Segmentation

Referring Expression Segmentation
Deploy with Roboflow
false

Florence 2 Object Detection

Deploy with Roboflow
false
Microsoft

Phi-3.5

Vision Language
Deploy with Roboflow
false
Google

MediaPipe

Object Detection
Deploy with Roboflow
false

RT-DETR

Object Detection
Deploy with Roboflow
false

Cambrian

Deploy with Roboflow
false
Apple

4M

The 4M model is a versatile multimodal Transformer model developed by EPFL and Apple, capable of handling a handful of vision and language tasks.
Object Detection
Deploy with Roboflow
false
Microsoft

Florence 2

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license.
Open Vocabulary Object Detection
Deploy with Roboflow
false

PaliGemma Optical Character Recognition

You can use the set of PaliGemma weights trained on the OCRVQA dataset for performing OCR on images.
Deploy with Roboflow
false

PaliGemma Document VQA

You can use the set of PaliGemma weights trained on the DocVQA dataset for asking questions about documents.
Document Question Answering (DocQA)
Deploy with Roboflow
false

PaliGemma VQA

You can use the set of PaliGemma weights trained on the VQAv2 dataset for asking questions about the contents of images.
Deploy with Roboflow
false

PaliGemma Website Understanding

You can use the set of PaliGemma weights trained on the Screen2Words dataset for asking questions about website screenshots.
Deploy with Roboflow
false

PaliGemma Image Captioning

You can use the set of PaliGemma weights trained on the COCO Captions dataset for zero-shot image captioning.
Deploy with Roboflow
false

YOLOv10

YOLOv10 is a real-time object detection model introduced in the paper "YOLOv10: Real-Time End-to-End Object Detection".
Object Detection
Deploy with Roboflow
false

MMOCR

MMOCR is an Optical Character Recognition model zoo implemented with the MMDetection package.
OCR
Deploy with Roboflow
false

TrOCR

TrOCR is a Transformer-based OCR model developed by researchers from Microsoft Research.
OCR
Deploy with Roboflow
false

Tesseract

Tesseract is a highly popular OCR engine and project, now primarily developed open-source.
OCR
Deploy with Roboflow
false

Surya

Surya is a Python package designed for OCR on document layout analysis.
OCR
Deploy with Roboflow
false
Google

Google Gemini

Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.
Vision Language
Deploy with Roboflow
false

ResNet-50

ResNet-50 is a popular image classification model architecture.
Classification
Deploy with Roboflow
false
Anthropic

Anthropic Claude 3

Vision Language
Deploy with Roboflow
false

EasyOCR

OCR
Deploy with Roboflow
false
Ultralytics

YOLOv8 Oriented Bounding Boxes

You can retrieve bounding boxes whose edges match an angled object by training an oriented bounding boxes object detection model, such as YOLOv8's Oriented Bounding Boxes model.
Object Detection
Deploy with Roboflow
false

AltCLIP

AltCLIP is a zero-shot image classification model.
Classification
Deploy with Roboflow
false

RemoteCLIP

RemoteCLIP is a zero-shot classification model for remote sensing.
Classification
Deploy with Roboflow
false

BioCLIP

BioCLIP is a Vision Foundation Model for the Tree of Life
Classification
Deploy with Roboflow
false
Apple

MobileCLIP

MobileCLIP is an image embedding model developed by Apple and introduced in the "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training" paper
Classification
Deploy with Roboflow
false
Google

SigLIP

SigLIP is an image embedding model defined in the "Sigmoid Loss for Language Image Pre-Training" paper.
Classification
Deploy with Roboflow
false
Deci AI

YOLO-NAS Pose

YOLO-NAS Pose is a keypoint detection model developed by Deci AI.
Keypoint Detection
Deploy with Roboflow
false

Grounded EdgeSAM

Grounded EdgeSAM is a combination of Grounding DINO, a zero-shot object detection model, and EdgeSAM, a fast zero-shot image segmentation model.
Zero Shot Segmentation
Deploy with Roboflow
false

BakLLaVA

BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.
Vision Language
Deploy with Roboflow
false

CogVLM

CogVLM shows strong performance in Visual Question Answering (VQA) and other vision tasks.
Vision Language
Deploy with Roboflow
true
Qwen

QwenVL

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.
Vision Language
Deploy with Roboflow
false
Meta

VLPart

VLPart, developed by Meta Research, is an object detection and segmentation model that works with an open vocabulary
Object Detection
Deploy with Roboflow
false
Meta

CoDet

CoDet is an open vocabulary zero-shot object detection model.
Object Detection
Deploy with Roboflow
false
OpenAI

GPT-4 with Vision

GPT-4 with Vision is a multimodal language model developed by OpenAI.
Object Detection
Deploy with Roboflow
false

Grounding DINO

Grounding DINO is a state-of-the-art zero-shot object detection model, developed by IDEA Research.
Object Detection
Deploy with Roboflow
false
Salesforce

BLIP

Classification
Deploy with Roboflow
false

Grounded SAM

GroundedSAM combines Grounding DINO with the Segment Anything Model to identify and segment objects in an image given text captions.
Zero Shot Segmentation
Deploy with Roboflow
false

SAM-CLIP

Use Grounding DINO, Segment Anything, and CLIP to label objects in images.
Instance Segmentation
Deploy with Roboflow
false
Salesforce

BLIPv2

BLIPv2 is a multimodal model developed by Salesforce Research.
Classification
Deploy with Roboflow
false
Salesforce

ALBEF

Classification
Deploy with Roboflow
false
Google

OWL ViT

OWL-ViT is a transformer-based object detection model developed by Google Research.
Object Detection
Deploy with Roboflow
false
Apple

FastViT

FastViT is a fast image classification model developed by Apple.
Classification
Deploy with Roboflow
false
Meta

MetaCLIP

MetaCLIP is a zero-shot classification and embedding model developed by Meta AI.
Classification
Deploy with Roboflow
false

OWLv2

OWLv2 is a transformer-based object detection model developed by Google Research. OWLv2 is the successor to OWL ViT.
Object Detection
Deploy with Roboflow
false

LLaVA-1.5

LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection.
Object Detection
Deploy with Roboflow
true

Kosmos-2

Kosmos-2 is a multimodal language model capable of object detection and grounding text in images.
Object Detection
Deploy with Roboflow
false

L2CS-Net

L2CS-Net is a gaze estimation model that enables you to calculate where someone is looking and in what direction someone is looking.
Object Detection
Deploy with Roboflow
false
Mindee

DocTR

DocTR is an Optical Character Recognition tool powered by deep learning.
Object Detection
Deploy with Roboflow
true

DINOv2

DINOv2 is a self-supervised method for training computer vision models developed by Meta Research and released in April 2023.
Object Detection
Deploy with Roboflow
false

RTMDet

RTMDet is an efficient real-time object detector, with self-reported metrics outperforming the YOLO series. It achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, making it one of the fastest and most accurate object detectors available as of writing this post.
Object Detection
Deploy with Roboflow
false

YOLACT

A simple, fully convolutional model for real-time instance segmentation
Instance Segmentation
Deploy with Roboflow
false

ByteTrack

ByteTrack is a multi-object tracking computer vision algorithm.
Object Detection
Deploy with Roboflow
false
Meta

FastSAM

FastSAM is an image segmentation model trained using 2% of the data in the Segment Anything Model SA-1B dataset.
Instance Segmentation
Deploy with Roboflow
false
Meta

DETIC

Detic is an open source segmentation model developed by Meta Research and released in 2022.
Instance Segmentation
Deploy with Roboflow
false
Deci AI

YOLO-NAS

YOLO-NAS is an object detection model developed by Deci that achieves SOTA performances compared to YOLOv5, v7, and v8.
Object Detection
Deploy with Roboflow
false
Meta

DETR

Detection Transformer (DETR) is an end-to-end object detection model implemented using the Transformer architecture.
Object Detection
Deploy with Roboflow
false
Ultralytics

YOLOv8 Classification

An image classification model built using YOLOv8.
Classification
Deploy with Roboflow
true

YOLOv7 Instance Segmentation

YOLOv7 Instance Segmentation lets you perform segmentation tasks with the YOLOv7 model.
Instance Segmentation
Deploy with Roboflow
true

OneFormer

OneFormer is a state-of-the-art multi-task image segmentation framework that is implemented using transformers.
Instance Segmentation
Deploy with Roboflow
false

ResNet 32

A fast, simple convolutional neural network that gets the job done for many tasks, including classification.
Classification
Deploy with Roboflow
false

YOLOX

YOLOX is a high-performance object detection model.
Object Detection
Deploy with Roboflow
false

YOLOR

YOLOR (You Only Learn One Representation) is an object detection model that uses both implicit and explicit knowledge to make predictions.
Object Detection
Deploy with Roboflow
false

YOLOS

YOLOS looks at patches of an image to to form "patch tokens", which are used in place of the traditional wordpiece tokens in NLP.
Object Detection
Deploy with Roboflow
false

Scaled YOLOv4

Scaled YOLOv4 is an extension of the YOLOv4 research implemented in the YOLOv5 PyTorch framework.
Object Detection
Deploy with Roboflow
false
Google

Vision Transformer

The Vision Transformer leverages powerful natural language processing embeddings (BERT) and applies them to images.
Classification
Deploy with Roboflow
false

Mask RCNN

Mask RCNN is a convolutional neural network for instance segmentation.
Instance Segmentation
Deploy with Roboflow
false
Nvidia

SegFormer

SegFormer is a computer vision framework used in semantic segmentation tasks, implemented with transformers.
Semantic Segmentation
Deploy with Roboflow
false
Ultralytics

YOLOv5 Classification

YOLOv5 Classification is a version of the YOLOv5 model used in single-label and multi-label image classification.
Classification
Deploy with Roboflow
true
Ultralytics

YOLOv5 Instance Segmentation

YOLOv5 Instance Segmentation is a version of YOLOv5 that can be used for instance segmentation tasks.
Instance Segmentation
Deploy with Roboflow
true

YOLOv7

YOLOv7 is a state of the art object detection model.
Object Detection
Deploy with Roboflow
true