YOLO-World Video Inference

Learn how to run inference on frames from a video using the open source supervision Python package.

Overview

Applied to videos, object detection models can yield a range of insights. You can check if an object is or is not present in a video; you can check for how long an object appears; you can record a list of times when an object is or is not present.

In this guide, we are going to show how to run inference with

YOLO-World

on videos.

We will:

1. Load supervision and an object detection model
2. Create a callback to process a target video
3. Process the target video

Without further ado, let's get started!

YOLO-World and Image Annotation Resources

Explore these resources to enhance your understanding of XXX and image annotation techniques.

Install supervision

We'll be using supervision in this guide, an open source Python package with a range of utilities you can use in building computer vision projects. You can install supervision using the following command:

pip install supervision

Load Data and Model

First, we need to load data into a Python program. We'll also need to load a model for use in inference. Create a new Python file and add the following code:


from inference.models.yolo_world.yolo_world import YOLOWorld
import supervision as sv
import cv2

model = YOLOWorld( model_id="yolo_world/l")

classes = ["person"]

model.set_classes(classes)


Replace the model weights file name with the weights for your model.

Create a Video Processing Callback

Next, we need to write a callback that runs inference and applies all of the logic we want to apply to predictions. In the example below, we run inference on our model and plot all predictions.


def process_frame(frame: np.ndarray, _) -> np.ndarray:
    results = model.infer(frame, text=classes)
    
    detections = sv.Detections.from_inference(results)

    box_annotator = sv.BoxAnnotator(thickness=4, text_thickness=4, text_scale=2)

    labels = [f"{model.names[class_id]} {confidence:0.2f}" for _, _, confidence, class_id, _ in detections]
    frame = box_annotator.annotate(scene=frame, detections=detections, labels=labels)

    return frame

sv.process_video(source_path=VIDEO_PATH, target_path=f"result.mp4", callback=process_frame)

You can also apply filters to only show predictions that meet a certain criteria. To learn more about filtering detections, refer to the supervision Detections() documentation.

Process the Video

Finally, we need to run our callback script on every frame in our video. We can do so using the following code:

sv.process_video(source_path=VIDEO_PATH, target_path=f"result.mp4", callback=process_frame)

Next Steps

supervision provides an extensive range of functionalities for working with computer vision models. With supervision, you can:

1. Process and filter detections and segmentation masks from a range of popular models (YOLOv5, Ultralytics YOLOv8, MMDetection, and more).
2. Process and filter classifications.
3. Plot bounding boxes and segmentation masks.

And more! To learn about the full range of functionality in supervision, check out the supervision documentation.