Try the Model

Use the widget below to experiment with Phi-4 Multimodal. You can detect COCO classes such as people, vehicles, animals, household items.

Overview

Phi-4 Multimodal is a multimodal language model developed by Microsoft. The model has support for inputs in the following modalities:

Text
Images
Speech (audio)

On performance, the official model announcement notes:

Despite its smaller size, the model maintains competitive performance on general multimodal capabilities, such as document and chart understanding, Optical Character Recognition (OCR), and visual science reasoning, matching or exceeding close models like Gemini-2-Flash-lite-preview/Claude-3.5-Sonnet.

Phi-4 Multimodal License

Phi-4 Multimodal

is licensed under a

MIT License

license.

Performance

Deploy a Phi-4 Multimodal API

You can use Roboflow Inference to deploy a

Phi-4 Multimodal

API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).

Below are instructions on how to deploy your own model API.

Label Data Automatically with Phi-4 Multimodal

You can automatically label a dataset using Phi-4 Multimodal with help from Autodistill, an open source package for training computer vision models. You can label a folder of images automatically with only a few lines of code. Below, see our tutorials that demonstrate how to use Phi-4 Multimodal to train a computer vision model.

No items found.