Vision-language models have been pre-trained on image-text pairs to enable zero-shot predictions for visual recognition tasks. Visual language models can be used for multimodal tasks like visual question answering, visual captioning, and image tagging.