Use the widget below to experiment with Qwen2.5-VL. You can detect COCO classes such as people, vehicles, animals, household items.
Qwen2.5-VL is a multimodal vision-language model developed by the Qwen team at Alibaba Cloud. You can use Qwen2.5-VL for a wide variety of tasks, including visual question answering, document OCR, and object detection. The model is available in three sizes: 3B, 7B, and 72B.
Across benchmarks, the largest Qwen2.5-VL model performs competitively when evaluated against other state-of-the-art models like GPT-4o and Gemini-2 Flash. For example, on the DocVQA, InfoVQA, and CC-OCR benchmarks, Qwen2.5-VL outperforms GPT-4o.
In the Qwen2.5 GitHub repository, the project authors note the following as key tasks on which the model performs better than previous Qwen models:
Qwen2.5-VL
is licensed under a
Apache 2.0
license.
You can use Roboflow Inference to deploy a
Qwen2.5-VL
API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).
Below are instructions on how to deploy your own model API.