Top Vision-Language Models (VLMs) Models

Filter Models

Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using Roboflow Inference.

Showing

models.

PaliGemma

PaliGemma is a vision language model (VLM) by Google that has multimodal capabilities.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

GPT-4o

GPT-4o is OpenAI’s third major iteration of GPT-4 expanding on the capabilities of GPT-4 with Vision

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

GPT-5

GPT-5 is a multimodal language model developed by OpenAI.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

OpenAI o3-mini

OpenAI o3-mini is a multimodal reasoning model developed by OpenAI.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

Qwen2.5-VL

Qwen2.5-VL is a multimodal vision-language model developed by the Qwen team at Alibaba Cloud.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

PaliGemma-2

PaliGemma-2 is a multimodal model developed by Google.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

Phi-3.5

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

Google Gemini

Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

BakLLaVA

BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

CogVLM

CogVLM shows strong performance in Visual Question Answering (VQA) and other vision tasks.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

Phi-4 Multimodal

Phi-4 Multimodal is a multimodal language model developed by Microsoft.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

Anthropic Claude 3

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

QwenVL

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.

Vision Language

Deploy with Roboflow

View Model Details

Deploy with free GPU

Visual Question Answering

Image Tagging

Image Similarity

Image Captioning

Zero-shot Detection

Real-Time Vision

Image Embedding

LLMS with Vision Capabilities

Multimodal Vision

Foundation Vision

Vision-Language Models (VLMs)

PaliGemma

GPT-4o

GPT-5

OpenAI o3-mini

Qwen2.5-VL

PaliGemma-2

Phi-3.5

Google Gemini

BakLLaVA

CogVLM

Phi-4 Multimodal

Anthropic Claude 3

QwenVL

Model Playground

Frequently Asked Questions

Where Can I Learn More About Object Detection?