Use the widget below to experiment with Florence 2 Image Captioning. You can detect COCO classes such as people, vehicles, animals, household items.
Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license. The model demonstrates strong zero-shot and fine-tuning capabilities across tasks such as captioning, object detection, grounding, and segmentation.
Despite its small size, it achieves results on par with models many times larger, like Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale FLD-5B dataset, consisting of 126 million images and 5.4 billion comprehensive visual annotations.
Learn how to fine-tune Florence-2 here.
Florence 2 Image Captioning
is licensed under a
MIT
license.
You can use Roboflow Inference to deploy a
Florence 2 Image Captioning
API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).
Below are instructions on how to deploy your own model API.
