Try the Model

Use the widget below to experiment with Florence 2 Image Captioning. You can detect COCO classes such as people, vehicles, animals, household items.

Overview

Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license. The model demonstrates strong zero-shot and fine-tuning capabilities across tasks such as captioning, object detection, grounding, and segmentation.

Despite its small size, it achieves results on par with models many times larger, like Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale FLD-5B dataset, consisting of 126 million images and 5.4 billion comprehensive visual annotations.

Learn how to fine-tune Florence-2 here.

Florence 2 Image Captioning License

Florence 2 Image Captioning

is licensed under a

MIT

license.

Performance

Deploy a Florence 2 Image Captioning API

You can use Roboflow Inference to deploy a

Florence 2 Image Captioning

API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).

Below are instructions on how to deploy your own model API.

Label Data Automatically with Florence 2 Image Captioning

You can automatically label a dataset using Florence 2 Image Captioning with help from Autodistill, an open source package for training computer vision models. You can label a folder of images automatically with only a few lines of code. Below, see our tutorials that demonstrate how to use Florence 2 Image Captioning to train a computer vision model.

No items found.