Use the widget below to experiment with SmolVLM2. You can detect COCO classes such as people, vehicles, animals, household items.
SmolVLM2, developed by engineers on the Hugging Face TB Research team, is part of the “Smol Models” initiative. This initiative is focused on making “efficient and lightweight AI models [...] that can run effectively on-device while maintaining strong performance.” SmolVLM2 is capable of both image and video understanding.
SmolVLM2 comes in three sizes: 256M, 500M, and 2.2B. The larger the model, the better performance you can expect on your tasks. When compared with the first SmolVLM model released in November 2024, SmolVLM outperformed on the Mathvista, OCRBench, AI2D, and ScienceQA benchmarks:
According to Hugging Face, the SmolVLM2 models “outperform any existing models per memory consumption. Looking at Video-MME (the go-to scientific benchmark in video), SmolVLM2 joins frontier model families on the 2B range and we lead the pack in the even smaller space.”
We tested SmolVLM on various multimodal tasks. Here were our findings in comparison to other multimodal models available at the time of the model's release:
SmolVLM2
is licensed under a
Apache 2.0
license.
You can use Roboflow Inference to deploy a
SmolVLM2
API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).
Below are instructions on how to deploy your own model API.