Use the widget below to experiment with Florence 2. You can detect COCO classes such as people, vehicles, animals, household items.
Florence-2 is a lightweight vision-language model open-sourced by Microsoft under the MIT license. The model demonstrates strong zero-shot and fine-tuning capabilities across tasks such as captioning, object detection, grounding, and segmentation.
Despite its small size, it achieves results on par with models many times larger, like Kosmos-2. The model's strength lies not in a complex architecture but in the large-scale FLD-5B dataset, consisting of 126 million images and 5.4 billion comprehensive visual annotations.
Learn how to fine-tune Florence-2 here.
We have made an interactive playground that you can use to test Florence-2. In the below widget, upload an image, then run the playground.
The playground will aim to identify bounding boxes for every object in the image using Florence-2's open ended object detection task type.
It may take several seconds to see the result for your image.
Florence 2
is licensed under a
MIT
license.
Compared to other generalist and specialist models, Florence 2 performs similar to models exponentially larger than itself. In terms of text visual question answering (TextVQA), Florence-2 out performs all other existing specialist and generalist models.
In terms of Zero shot models, Florence 2 outperforms both Kosmos-2 and Flamingo, two massive multimodal models.
Both images were gotten from the Florence-2 Paper.
You can use Roboflow Inference to deploy a
Florence 2
API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).
Below are instructions on how to deploy your own model API.
To use Florence 2 with Autodistill, you need to install the following dependency:
pip3 install autodistill-florence-2
Then, run:
from autodistill_florence_2 import Florence2
from autodistill.detection import DetectionOntology
from PIL import Image
# define an ontology to map class names to our Florence 2 prompt
# the ontology dictionary has the format {caption: class}
# where caption is the prompt sent to the base model, and class is the label that will
# be saved for that caption in the generated annotations
# then, load the model
base_model = Florence2(
ontology=CaptionOntology(
{
"person": "person",
"a forklift": "forklift"
}
)
)
image = Image.open("image.jpeg")
result = base_model.predict('image.jpeg')
bounding_box_annotator = sv.BoundingBoxAnnotator()
annotated_frame = bounding_box_annotator.annotate(
scene=image.copy(),
detections=detections
)
sv.plot_image(image=annotated_frame, size=(16, 16))
# label a dataset
base_model.label("./context_images", extension=".jpeg")