Products
Platform
Universe
Open source computer vision datasets and pre-trained models
Annotate
Label images fast with AI-assisted data annotation
Train
Hosted model training infrastructure and GPU access
Workflows
Low-code interface to build pipelines and applications
Deploy
Run models on device, at the edge, in your VPC, or via API
Solutions
By Industry
Aerospace & Defence
Agriculture
Automotive
Banking & Finance
Government
Healthcare & Medicine
Manufacturing
Oil & Gas
Retail & Ecommerce
Safety & Security
Telecommunications
Transportation
Utilities
Developers
Resources
Documentation
User Forum
Computer Vision Models
Blog
Convert Annotation Formats
Learn Computer Vision
Inference Templates
Weekly Product Webinar
Pricing
Docs
Blog
Sign In
Get Started
Top Foundation Vision Models
Foundation models are large models that you can use without prior training. You can use foundation models to auto-label data for use in training a smaller, real-time vision model.
Filter Models
Search Models
Filter By Task
All Models
Object Detection
Classification
Instance Segmentation
Semantic Segmentation
Keypoint Detection
Vision-Language
OCR
Pose Estimation
Chart Question Answering
Document Question Answering (DocQA)
Video Classification
Open Vocabulary Object Detection
Multi-Label Classification
Region Proposal
Phrase Grounding
Referring Expression Segmentation
Zero Shot Segmentation
Filter By Feature
Foundation Vision
Multimodal Vision
LLMS with Vision Capabilities
Image Embedding
Real-Time Vision
Zero-shot Detection
Image Captioning
Image Similarity
Image Tagging
Visual Question Answering
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Apply
Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using
Roboflow Inference
.
Showing
of
models.
Segment Anything Model (SAM)
Segment Anything (SAM) is an image segmentation model developed by Meta Research, capable of doing zero-shot segmentation.
Instance Segmentation
Deploy with Roboflow
GroundingDINO
Grounding DINO is a zero-shot object detection model made by combining a Transformer-based DINO detector and grounded pre-training.
Object Detection
Deploy with Roboflow
YOLO-World
YOLO-World is a zero-shot object detection model.
Object Detection
Deploy with Roboflow
PaliGemma
PaliGemma is a vision language model (VLM) by Google that has multimodal capabilities.
Vision-Language
Deploy with Roboflow
GPT-4o
GPT-4o is OpenAI’s third major iteration of GPT-4 expanding on the capabilities of GPT-4 with Vision
Vision-Language
Deploy with Roboflow
OpenAI CLIP
CLIP (Contrastive Language-Image Pre-Training) is an impressive multimodal zero-shot image classifier that achieves impressive results in a wide range of domains with no fine-tuning. It applies the recent advancements in large-scale transformers like GPT-3 to the vision arena.
Video Classification
Deploy with Roboflow
LLaVA-1.5
LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection.
Object Detection
Deploy with Roboflow
SigLIP
SigLIP is an image embedding model defined in the "Sigmoid Loss for Language Image Pre-Training" paper.
Classification
Deploy with Roboflow
DETIC
Detic is an open source segmentation model developed by Meta Research and released in 2022.
Instance Segmentation
Deploy with Roboflow
MetaCLIP
MetaCLIP is a zero-shot classification and embedding model developed by Meta AI.
Classification
Deploy with Roboflow
4M
The 4M model is a versatile multimodal Transformer model developed by EPFL and Apple, capable of handling a handful of vision and language tasks.
Object Detection
Deploy with Roboflow
CoDet
CoDet is an open vocabulary zero-shot object detection model.
Object Detection
Deploy with Roboflow
Kosmos-2
Kosmos-2 is a multimodal language model capable of object detection and grounding text in images.
Object Detection
Deploy with Roboflow
OWLv2
OWLv2 is a transformer-based object detection model developed by Google Research. OWLv2 is the successor to OWL ViT.
Object Detection
Deploy with Roboflow
OWL ViT
OWL-ViT is a transformer-based object detection model developed by Google Research.
Object Detection
Deploy with Roboflow
GPT-4 with Vision
GPT-4 with Vision is a multimodal language model developed by OpenAI.
Object Detection
Deploy with Roboflow
MobileCLIP
MobileCLIP is an image embedding model developed by Apple and introduced in the "MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training" paper
Classification
Deploy with Roboflow
Anthropic Claude 3
Vision-Language
Deploy with Roboflow
Google Gemini
Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.
Vision-Language
Deploy with Roboflow
Visual Question Answering
Image Tagging
Image Similarity
Image Captioning
Zero-shot Detection
Real-Time Vision
Image Embedding
LLMS with Vision Capabilities
Multimodal Vision
Foundation Vision