How to use CLIP with an RTSP Stream

Real Time Streaming Protocol (RTSP) is a protocol commonly used to stream video from internet-connected cameras. With supervision and Roboflow Inference, you can run a range of different models using the output of an RTSP stream in a few lines of code.

In this guide, we are going to show you how to run


on frames from an RTSP camera.

To do this, we will:

1. Install supervision and Inference
2. Use the inference.Stream() method to the webcam and run inference
3. Test the model

Without further ado, let's get started!

Step #1: Install supervision and Inference

For this tutorial, you will need two packages: supervision and Inference. You can install them using the following command:

pip install supervision inference-cli

Once you have installed supervision and Inference, you are ready to start writing logic to use an RTSP stream with your model

Step #2: Configure inference.Stream()

The inference.Stream() method allows you to stream data from a webcam or RTSP steam for use in running predictions. The method allows you to select a model for use then run a callback function that has the predictions from the model and the frame on which inference was inferred.

Below, we show you how to use inference.Stream() with



You can load data using the following code:

import cv2
import inference

from inference.models import Clip
from sklearn.metrics.pairwise import cosine_similarity

API_KEY = ""

model = Clip(api_key=API_KEY)

prompt = "coffee cup"
text_embedding = model.embed_text(prompt)

def render(predictions, frame):
    similarity = cosine_similarity(predictions["embeddings"], text_embedding)

    # turn cosine into a percentage
    range = (0.15, 0.40)
    similarity = (similarity - range[0]) / (range[1] - range[0])
    similarity = max(0, min(1, similarity)) * 100

    text = f"{prompt} {similarity[0][0]:.2f}%"
    cv2.putText(frame, text, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)

    cv2.imshow("frame", frame)

    use_main_thread=True, # for opencv display

Above, we load a model then pass the model into the inference.Steam() method for use in running inference. We define a callback function called render() which runs every time a frame is retrieved from our webcam. render() can contain any logic you want to run on each frame.

In the example code above, we plot predictions from a model each frame and display the frame in a video stream. This allows you to watch your model run in real time and understand how it performs.

Replace the URL with the URL of your RTSP camera. In addition, replace the API_KEY value with your Roboflow API key. Learn how to retrieve your Roboflow API key.

Step #3: Test the Stream

Now that you have configured your model and streaming interface, you can test the stream. To do so, run your Python program.

Now you have all you need to start using


with an RTSP stream.

Next steps

supervision provides an extensive range of functionalities for working with computer vision models. With supervision, you can:

1. Process and filter detections and segmentation masks from a range of popular models (YOLOv5, Ultralytics YOLOv8, MMDetection, and more).
2. Display predictions (i.e. bounding boxes, segmentation masks).
3. Annotate images (i.e. trace predictions, draw heatmaps).
4. Compute confusion matrices.

And more! To learn about the full range of functionality in supervision, check out the supervision documentation.

Learn how to use RTSP streams with other models

Below, you can find our guides on how to use streams with other computer vision models.