Blog
Blog
YOLOS - You Only Look At One Sequence
Unlike previous CNN based YOLO models, the YOLOS backbone is a Transformer block, much like the first vision transformer for classification.
YOLOS looks at patches of an image to to form "patch tokens", which are used in place of the traditional wordpiece tokens in NLP. There are 100 detection tokens on the right are learnable embeddings and feed into potential detections.
Compared to other CNN-based YOLO models, YOLOS benefits from the rising tides of transformers in computer vision, as well as inferring without the need for non max supression (NMS), a tedious post-processing step that makes the deployment of other YOLO models difficult and slow.
Its design allows it to generalize well across different datasets and tasks, making it an impressive choice for research and experimental applications.
Train YOLOS on your own dataset here.
Model | Pre-train Epochs | ViT (DeiT) Weight / Log | Fine-tune Epochs | Eval Size | YOLOS Checkpoint / Log | AP @ COCO val |
---|---|---|---|---|---|---|
YOLOS-Ti |
300 | FB | 300 | 512 | Baidu Drive, Google Drive / Log | 28.7 |
YOLOS-S |
200 | Baidu Drive, Google Drive / Log | 150 | 800 | Baidu Drive, Google Drive / Log | 36.1 |
YOLOS-S |
300 | FB | 150 | 800 | Baidu Drive, Google Drive / Log | 36.1 |
YOLOS-S (dWr) |
300 | Baidu Drive, Google Drive / Log | 150 | 800 | Baidu Drive, Google Drive / Log | 37.6 |
YOLOS-B |
1000 | FB | 150 | 800 | Baidu Drive, Google Drive / Log | 42.0 |