No items found.
Use the widget below to experiment with FastViT. You can detect COCO classes such as people, vehicles, animals, household items.
Paper abstract:
The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that - our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models. Code and models are available at this https URL.
FastViT
is licensed under a
license.
You can use Roboflow Inference to deploy a
FastViT
API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).
Below are instructions on how to deploy your own model API.
First, install Autodistill and Autodistill FastViT:
pip install autodistill autodistill-fastvit
Then, run:
from autodistill_fastvit import FastViT, FASTVIT_IMAGENET_1K_CLASSES
from autodistill.detection import CaptionOntology
# zero shot with no prompts
base_model = FastViT(None)
# zero shot with prompts from FASTVIT_IMAGENET_1K_CLASSES
base_model = FastViT(
ontology=CaptionOntology(
{
"coffeemaker": "coffeemaker",
"ice cream": "ice cream"
}
)
)
predictions = base_model.predict("./example.png")
labels = [FASTVIT_IMAGENET_1K_CLASSES[i] for i in predictions.class_id.tolist()]
print(labels)