Multimodal vision models allow you to interact with images and information in a different modality (i.e. text). Some multimodal vision models support asking questions about images; others support comparing the similarity of images to text, useful in classification.