Name: YOLOv5
Rating: 2 (1 reviews)

Overview

Paper abstract:

The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models. Code and models are available at this https URL.

Source.

Performance

Use This Model

First, install Autodistill and Autodistill FastViT:


pip install autodistill autodistill-fastvit

Then, run:


from autodistill_fastvit import FastViT, FASTVIT_IMAGENET_1K_CLASSES
from autodistill.detection import CaptionOntology

# zero shot with no prompts
base_model = FastViT(None)

# zero shot with prompts from FASTVIT_IMAGENET_1K_CLASSES
base_model = FastViT(
    ontology=CaptionOntology(
        {
            "coffeemaker": "coffeemaker",
            "ice cream": "ice cream"
        }
    )
)

predictions = base_model.predict("./example.png")

labels = [FASTVIT_IMAGENET_1K_CLASSES[i] for i in predictions.class_id.tolist()]

print(labels)

‍

Label Data Automatically with FastViT

You can automatically label a dataset using FastViT with help from Autodistill, an open source package for training computer vision models. You can label a folder of images automatically with only a few lines of code. Below, see our tutorials that demonstrate how to use FastViT to train a computer vision model.

FastViT