Product
Platform
Universe
Annotate
Auto Label
Train
Deploy
Inference
Integrations
Ecosystem
Notebooks
Autodistill
Supervision
Multimodal Maestro
Open Source
Roboflow Labeling
Solutions
BY INDUSTRY
Aerospace & Defense
Agriculture
Healthcare & Medicine
Automotive
Banking & Finance
Government
Oil and Gas
Retail & Ecommerce
Safety & Security
Telecommunications
Transportation
Manufacturing
Utilities
Resources
Contact Sales
User Forum
Inference Templates
Blog
Explore Models
Pricing
Docs
Talk to Sales
Sign in
Product
Platform
Universe
Annotate
Auto Label
Train
Deploy
Inference
Integrations
Ecosystem
Notebooks
Autodistill
Supervision
Multimodal Maestro
Open Source
Roboflow Labeling
Solutions
BY INDUSTRY
Aerospace & Defense
Agriculture
Healthcare & Medicine
Automotive
Banking & Finance
Government
Oil and Gas
Retail & Ecommerce
Safety & Security
Telecommunications
Transportation
Manufacturing
Utilities
Developers
Contact Sales
User Forum
Inference Templates
Blog
Explore Models
Pricing
Contact Sales
Top Visual Question Answering (VQA) Models
Visual Question Answering (VQA) is a category of vision models to which you can ask a question about a model and retrieve a response. Discover popular VQA models below.
Filter Models
Search Models
Filter By Task
All Models
Object Detection
Classification
Instance Segmentation
Semantic Segmentation
Keypoint Detection
Vision-Language
OCR
Filter By Feature
Foundation Vision
Multimodal Vision
LLMS with Vision Capabilities
Image Embedding
Real-Time Vision
Zero-shot Detection
Image Captioning
Image Similarity
Image Tagging
Visual Question Answering
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Apply
Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using
Roboflow Inference
.
Showing
of
models.
LLaVA-1.5
LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection.
Object Detection
Deploy with Roboflow
CogVLM
CogVLM shows strong performance in Visual Question Answering (VQA) and other vision tasks.
Vision-Language
Deploy with Roboflow
QwenVL
Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.
Vision-Language
Deploy with Roboflow
BakLLaVA
BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.
Vision-Language
Deploy with Roboflow
GPT-4 with Vision
GPT-4 with Vision is a multimodal language model developed by OpenAI.
Object Detection
Deploy with Roboflow
Anthropic Claude 3
Vision-Language
Deploy with Roboflow
Google Gemini
Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.
Vision-Language
Deploy with Roboflow
Visual Question Answering
Image Tagging
Image Similarity
Image Captioning
Zero-shot Detection
Real-Time Vision
Image Embedding
LLMS with Vision Capabilities
Multimodal Vision
Foundation Vision