Try the Model

Use the widget below to experiment with PaliGemma. You can detect COCO classes such as people, vehicles, animals, household items.

Overview

PaliGemma, released at the 2024 Google I/O event, is a combined multimodal model based on two other models from Google research: SigLIP, a vision model, and Gemma, a large language model, which means the model is a composition of a Transformer decoder and a Vision Transformer image encoder. It takes both image and text as input and generates text as output, supporting multiple languages.

Unlike other VLMs, such as OpenAI’s GPT-4o, Google Gemini, and Anthropic’s Claude 3 which have struggled with object detection and segmentation, PaliGemma has a wide range of abilities, paired with the ability to fine-tune for better performance on specific tasks.

Learn more in our comprehensive overview and evaluation.

PaliGemma License

PaliGemma

is licensed under a

Custom Google

license.

Performance

We’ve created a table to show PaliGemma results relative to other models based on reported results on common benchmarks.

While benchmarks are helpful data points, they do not tell the entire story. PaliGemma is built to be fine-tuned and the other models are closed-source. For the purposes of showing which options are available, we compare against other, often much larger, models that are unable to be fine-tuned.

Deploy a PaliGemma API

You can use Roboflow Inference to deploy a

PaliGemma

API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).

Below are instructions on how to deploy your own model API.


import inference
from inference.models.paligemma.paligemma import PaliGemma
from PIL import Image

model = PaliGemma(api_key="YOUR ROBOFLOW API KEY")

image = Image.open("/content/image.jpeg")

prompt = "How many dogs are in this image?"
result = model.predict(image, prompt)

Label Data Automatically with PaliGemma

You can automatically label a dataset using PaliGemma with help from Autodistill, an open source package for training computer vision models. You can label a folder of images automatically with only a few lines of code. Below, see our tutorials that demonstrate how to use PaliGemma to train a computer vision model.

No items found.