Explore Top Computer Vision Models

Co-Deformable-DETR (Co-DETR) is an object detection model architecture introduced in the paper "DETRs with Collaborative Hybrid Assignments Training".

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

D-FINE

D-FINE is a real-time object detection model introduced in the paper " D-FINE: Redefine Regression Task of DETRs as Fine‑grained Distribution Refinement".

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

DEIM

DEIM is a training framework for DETR models. The framework strives to enable "faster convergence and improved accuracy" in models.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

YOLOE

YOLOE is a new object detection and segmentation model developed by the creators of YOLOv10.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

SmolVLM2

SmolVLM2 is a multimodal image and video understanding model developed by engineers on the Hugging Face TB (Textbook) Research team.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Moondream 2

Moondream 2 is the latest model in the Moondream series of “tiny vision language models”.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Gemma 3

Gemma 3 is a multimodal language model developed by Google.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

OpenAI o3-mini

OpenAI o3-mini is a multimodal reasoning model developed by OpenAI.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Qwen2.5-VL

Qwen2.5-VL is a multimodal vision-language model developed by the Qwen team at Alibaba Cloud.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

PaliGemma-2

PaliGemma-2 is a multimodal model developed by Google.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

true

YOLO11

YOLO11 is a computer vision model that you can use for object detection, segmentation, and classification.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

YOLOv9 Image Segmentation

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Florence 2 Image Captioning

Florence-2 Image Captioning is a subset of Florence-2 that supports describing images with text.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Florence 2 OCR

Florence-2 OCR is a subset of Florence-2 that can read characters in images.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Florence 2 Image Segmentation

Referring Expression Segmentation

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Florence 2 Object Detection

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Phi-3.5

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

MediaPipe

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

RT-DETR

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Cambrian

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

4M

The 4M model is a versatile multimodal Transformer model developed by EPFL and Apple, capable of handling a handful of vision and language tasks.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Florence 2

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license.

Open Vocabulary Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

PaliGemma Optical Character Recognition

You can use the set of PaliGemma weights trained on the OCRVQA dataset for performing OCR on images.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

PaliGemma Document VQA

You can use the set of PaliGemma weights trained on the DocVQA dataset for asking questions about documents.

Document Question Answering (DocQA)

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

PaliGemma VQA

You can use the set of PaliGemma weights trained on the VQAv2 dataset for asking questions about the contents of images.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

PaliGemma Website Understanding

You can use the set of PaliGemma weights trained on the Screen2Words dataset for asking questions about website screenshots.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

PaliGemma Image Captioning

You can use the set of PaliGemma weights trained on the COCO Captions dataset for zero-shot image captioning.

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

YOLOv10

YOLOv10 is a real-time object detection model introduced in the paper "YOLOv10: Real-Time End-to-End Object Detection".

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

MMOCR

MMOCR is an Optical Character Recognition model zoo implemented with the MMDetection package.

OCR

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

TrOCR

TrOCR is a Transformer-based OCR model developed by researchers from Microsoft Research.

OCR

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Tesseract

Tesseract is a highly popular OCR engine and project, now primarily developed open-source.

OCR

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Surya

Surya is a Python package designed for OCR on document layout analysis.

OCR

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Google Gemini

Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

ResNet-50

ResNet-50 is a popular image classification model architecture.

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Anthropic Claude 3

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

EasyOCR

OCR

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

YOLOv8 Oriented Bounding Boxes

You can retrieve bounding boxes whose edges match an angled object by training an oriented bounding boxes object detection model, such as YOLOv8's Oriented Bounding Boxes model.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

AltCLIP

AltCLIP is a zero-shot image classification model.

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

RemoteCLIP

RemoteCLIP is a zero-shot classification model for remote sensing.

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

BioCLIP

BioCLIP is a Vision Foundation Model for the Tree of Life

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

MobileCLIP

MobileCLIP is an image embedding model developed by Apple and introduced in the "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training" paper

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

SigLIP

SigLIP is an image embedding model defined in the "Sigmoid Loss for Language Image Pre-Training" paper.

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

YOLO-NAS Pose

YOLO-NAS Pose is a keypoint detection model developed by Deci AI.

Keypoint Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Grounded EdgeSAM

Grounded EdgeSAM is a combination of Grounding DINO, a zero-shot object detection model, and EdgeSAM, a fast zero-shot image segmentation model.

Zero Shot Segmentation

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

BakLLaVA

BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

CogVLM

CogVLM shows strong performance in Visual Question Answering (VQA) and other vision tasks.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

true

QwenVL

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

VLPart

VLPart, developed by Meta Research, is an object detection and segmentation model that works with an open vocabulary

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

CoDet

CoDet is an open vocabulary zero-shot object detection model.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

GPT-4 with Vision

GPT-4 with Vision is a multimodal language model developed by OpenAI.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Grounding DINO

Grounding DINO is a state-of-the-art zero-shot object detection model, developed by IDEA Research.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

BLIP

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

Grounded SAM

GroundedSAM combines Grounding DINO with the Segment Anything Model to identify and segment objects in an image given text captions.

Zero Shot Segmentation

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

SAM-CLIP

Use Grounding DINO, Segment Anything, and CLIP to label objects in images.

Instance Segmentation

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

BLIPv2

BLIPv2 is a multimodal model developed by Salesforce Research.

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

ALBEF

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

OWL ViT

OWL-ViT is a transformer-based object detection model developed by Google Research.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

FastViT

FastViT is a fast image classification model developed by Apple.

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

MetaCLIP

MetaCLIP is a zero-shot classification and embedding model developed by Meta AI.

Classification

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

OWLv2

OWLv2 is a transformer-based object detection model developed by Google Research. OWLv2 is the successor to OWL ViT.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

LLaVA-1.5

LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

true

Kosmos-2

Kosmos-2 is a multimodal language model capable of object detection and grounding text in images.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

L2CS-Net

L2CS-Net is a gaze estimation model that enables you to calculate where someone is looking and in what direction someone is looking.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

DocTR

DocTR is an Optical Character Recognition tool powered by deep learning.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

true

DINOv2

DINOv2 is a self-supervised method for training computer vision models developed by Meta Research and released in April 2023.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false

RTMDet

RTMDet is an efficient real-time object detector, with self-reported metrics outperforming the YOLO series. It achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, making it one of the fastest and most accurate object detectors available as of writing this post.

Object Detection

Deploy with Roboflow

View Model Details

Deploy with free GPU

false