PaliGemma, released at the 2024 Google I/O event, is a combined multimodal model based on two other models from Google research: SigLIP, a vision model, and Gemma, a large language model, which means the model is a composition of a Transformer decoder and a Vision Transformer image encoder. It takes both image and text as input and generates text as output, supporting multiple languages.
Unlike other VLMs, such as OpenAI’s GPT-4o, Google Gemini, and Anthropic’s Claude 3 which have struggled with object detection and segmentation, PaliGemma has a wide range of abilities, paired with the ability to fine-tune for better performance on specific tasks.
PaliGemma
is licensed under a
Custom Google
license.
We’ve created a table to show PaliGemma results relative to other models based on reported results on common benchmarks.
While benchmarks are helpful data points, they do not tell the entire story. PaliGemma is built to be fine-tuned and the other models are closed-source. For the purposes of showing which options are available, we compare against other, often much larger, models that are unable to be fine-tuned.
You can use Roboflow Inference to deploy a
PaliGemma
API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).
Below are instructions on how to deploy your own model API.
import inference
from inference.models.paligemma.paligemma import PaliGemma
from PIL import Image
model = PaliGemma(api_key="YOUR ROBOFLOW API KEY")
image = Image.open("/content/image.jpeg")
prompt = "How many dogs are in this image?"
result = model.predict(image, prompt)