Vision-Language Models (VLMs)

Vision language models have been pre-trained on image-text pairs to enable zero-shot predictions for visual recognition tasks. Visual language models can be used for multimodal tasks like visual question answering, visual captioning, and image tagging.

Looking for a dataset? Explore multimodal datasets.

Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using Roboflow Inference.
Google

PaliGemma

PaliGemma is a vision language model (VLM) by Google that has multimodal capabilities.
Vision Language
Deploy with Roboflow
OpenAI

GPT-4o

GPT-4o is OpenAI’s third major iteration of GPT-4 expanding on the capabilities of GPT-4 with Vision
Vision Language
Deploy with Roboflow
OpenAI

OpenAI o3-mini

OpenAI o3-mini is a multimodal reasoning model developed by OpenAI.
Vision Language
Deploy with Roboflow
Qwen

Qwen2.5-VL

Qwen2.5-VL is a multimodal vision-language model developed by the Qwen team at Alibaba Cloud.
Vision Language
Deploy with Roboflow
Google

PaliGemma-2

PaliGemma-2 is a multimodal model developed by Google.
Vision Language
Deploy with Roboflow
Microsoft

Phi-3.5

Vision Language
Deploy with Roboflow
Google

Google Gemini

Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.
Vision Language
Deploy with Roboflow

BakLLaVA

BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.
Vision Language
Deploy with Roboflow

CogVLM

CogVLM shows strong performance in Visual Question Answering (VQA) and other vision tasks.
Vision Language
Deploy with Roboflow
Microsoft

Phi-4 Multimodal

Phi-4 Multimodal is a multimodal language model developed by Microsoft.
Vision Language
Deploy with Roboflow
Anthropic

Anthropic Claude 3

Vision Language
Deploy with Roboflow
Qwen

QwenVL

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.
Vision Language
Deploy with Roboflow

Frequently Asked Questions

No items found.

Where Can I Learn More About Object Detection?

View All Learning Resources
No items found.