Try the Model

Use the widget below to experiment with Vision Transformer. You can detect COCO classes such as people, vehicles, animals, household items.

Overview

What is the Vision Transformer?

The Vision Transformer leverages powerful natural language processing embeddings (BERT) and applies them to images. When providing images to the model, each image is split into patches that are linearly embedded after which position embeddings are added and this is sequentially fed to the transformer encoder. Finally, to classify the image, a [CLS] token is inserted at the beginning of the image sequence.

Vision Transformer Architecture

Vision Transformer Performance

Applying transformers to image classification tasks achieves state-of-the-art performance on a variety of datasets, rivaling traditional convolutional neural networks.

Images in Courtesy of Google Research

‍

Vision Transformer License

Vision Transformer

is licensed under a

Apache-2.0

license.

Performance

Deploy a Vision Transformer API

You can use Roboflow Inference to deploy a

Vision Transformer

API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).

Below are instructions on how to deploy your own model API.

Label Data Automatically with Vision Transformer

You can automatically label a dataset using Vision Transformer with help from Autodistill, an open source package for training computer vision models. You can label a folder of images automatically with only a few lines of code. Below, see our tutorials that demonstrate how to use Vision Transformer to train a computer vision model.

No items found.