Top Image Captioning Models

You can use image captioning models to generate descriptions for the contents of an image.

Deploy select models (i.e. YOLOv8, CLIP) using the Roboflow Hosted API, or your own hardware using Roboflow Inference.

Showing

of

models.

PaliGemma

PaliGemma is a vision language model (VLM) by Google that has multimodal capabilities.

Vision Language

Deploy with Roboflow

FastSAM

FastSAM is an image segmentation model trained using 2% of the data in the Segment Anything Model SA-1B dataset.

Instance Segmentation

Deploy with Roboflow

CogVLM

CogVLM shows strong performance in Visual Question Answering (VQA) and other vision tasks.

Vision Language

Deploy with Roboflow

QwenVL

Qwen-VL is an LMM developed by Alibaba Cloud. Qwen-VL accepts images, text, and bounding boxes as inputs. The model can output text and bounding boxes. Qwen-VL naturally supports English, Chinese, and multilingual conversation.

Vision Language

Deploy with Roboflow

4M

The 4M model is a versatile multimodal Transformer model developed by EPFL and Apple, capable of handling a handful of vision and language tasks.

Object Detection

Deploy with Roboflow

BakLLaVA

BakLLaVA is an LMM developed by LAION, Ontocord, and Skunkworks AI. BakLLaVA uses a Mistral 7B base augmented with the LLaVA 1.5 architecture.

Vision Language

Deploy with Roboflow

Florence 2

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license.

Open Vocabulary Object Detection

Deploy with Roboflow

Google Gemini

Gemini is a family of Large Multimodal Models (LMMs) developed by Google Deepmind focused specifically on multimodality.

Vision Language

Deploy with Roboflow

Anthropic Claude 3

Vision Language

Deploy with Roboflow

GPT-4 with Vision

GPT-4 with Vision is a multimodal language model developed by OpenAI.

Object Detection

Deploy with Roboflow

BLIP

Deploy with Roboflow

BLIPv2

BLIPv2 is a multimodal model developed by Salesforce Research.

Deploy with Roboflow

Visual Question Answering

Image Similarity

Image Captioning

Zero-shot Detection

Real-Time Vision

Image Embedding

LLMS with Vision Capabilities

Multimodal Vision

Foundation Vision