SAM-CLIP can accurately segment objects within an image based on both visual and textual inputs, making it highly effective for tasks requiring detailed object identification and separation. The model can classify and segment objects it hasn’t explicitly been trained on, thanks to its integration with CLIP’s zero-shot learning capabilities. SAM-CLIP also provides advanced semantic segmentation, where it can segment and label parts of an image based on their semantic meaning.
Use Grounding DINO, Segment Anything, and CLIP to label objects in images.
Below is an image with segmentation masks of allMcDonalds
logos in an image.
This demo was created by sending the prompt logo
to Grounding DINO and SAM, then classifying each prediction using CLIP with two prompts: McDonalds
and Burger King.