LLaVA is an open source multimodal language model that you can use for visual question answering and has limited support for object detection.
Overview
LLaVA-1.5 is an open-source, multi-modal language model. You can ask LLaVA-1.5 questions in text and optionally provide an image as context for your question. The code for LLaVA-1.5 was released to accompany the "Improved Baselines with Visual Instruction Tuning" paper. Use the demo.
The authors of the paper note in the abstract "With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art [performance] across 11 benchmarks."
LLaVA-1.5 is available for use in an online demo playground, with which you can experiment with the model.
Performance
Across multiple tests, LLaVa-1.5 presents SOTA results, outmatching several other models.
Use This Model
Label Data Automatically with LLaVA-1.5
You can automatically label a dataset using LLaVA-1.5 with help from Autodistill, an open source package for training computer vision models. You can label a folder of images automatically with only a few lines of code. Below, see our tutorials that demonstrate how to use LLaVA-1.5 to train a computer vision model.
YOLOv8 uses the uses the YOLOv8 PyTorch TXT annotation format. If your annotation is in a different format, you can use Roboflow's annotation conversion tools to get your data into the right format.