Top Large Language Models with Vision Capabilities

Some Large Language Models have vision capabilities that enable you to ask questions about the contents of images. Below, we list the most popular LMMs that can solve computer vision problems.

Filter Models

Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using Roboflow Inference.

Showing

models.

PaliGemma

PaliGemma is a vision language model (VLM) by Google that has multimodal capabilities.

Vision Language

Deploy with Roboflow

Segment Anything 3

Segment Anything 3 (SAM 3) is an image segmentation model released by Meta.

Instance Segmentation

Deploy with Roboflow

GPT-4o

GPT-4o is OpenAI’s third major iteration of GPT-4 expanding on the capabilities of GPT-4 with Vision

Vision Language

Deploy with Roboflow

LLaVA-1.5

LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection.

Object Detection

Deploy with Roboflow

CogVLM

CogVLM shows strong performance in Visual Question Answering (VQA) and other vision tasks.

Vision Language

Deploy with Roboflow

QwenVL

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.

Vision Language

Deploy with Roboflow

BakLLaVA

BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.

Vision Language

Deploy with Roboflow

GPT-4.1

GPT-4.1 is a multimodal model developed by OpenAI that comes in three sizes: GPT-4.1, mini, and nano.

Deploy with Roboflow

Phi-4 Multimodal

Phi-4 Multimodal is a multimodal language model developed by Microsoft.

Vision Language

Deploy with Roboflow

Gemma 3

Gemma 3 is a multimodal language model developed by Google.

Deploy with Roboflow

Google Gemini

Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.

Vision Language

Deploy with Roboflow

Anthropic Claude 3

Vision Language

Deploy with Roboflow

GPT-4 with Vision

GPT-4 with Vision is a multimodal language model developed by OpenAI.

Object Detection

Deploy with Roboflow

Visual Question Answering

Image Tagging

Image Similarity

Image Captioning

Zero-shot Detection

Real-Time Vision

Image Embedding

LLMS with Vision Capabilities

Multimodal Vision

Foundation Vision

Top Large Language Models with Vision Capabilities

PaliGemma

Segment Anything 3

GPT-4o

LLaVA-1.5

CogVLM

QwenVL

BakLLaVA

GPT-4.1

Phi-4 Multimodal

Gemma 3

Google Gemini

Anthropic Claude 3

GPT-4 with Vision

Model Playground