Use the widget below to experiment with 4M. You can detect COCO classes such as people, vehicles, animals, household items.
Apples 4M model (Massively Multimodal Masked Modeling) provides multimodal learning to computer vision. Released in December 2023, the 4M model takes in various inputs and outputs including text, images, geometric data, semantic data and more. Through the data, 4M becomes a foundational vision, and multimodal model which can handle a wide range of vision tasks such as object detection, caption generation and in-painting and out-painting.
The 4M model was primarily trained on Conceptual Captions 12M (CC12M), which contains 12 million image-caption pairs.
Based off ImageNet-1K (Top-1 accuracy), COCO AP (Object detection accuracy), mLoU (semantic segmentation accuracy), and NYU depth (depth accuracy), the 4M-L model outperforms all other models in all categories except for ImageNet-1K against DeiT III L.
The 4M is capable of both generative and finding objects. Given prompts to change the color of a wall (bottom right corner), Apple's 4M model can effortlessly color the wall blue. Additionally, the 4M is easily fine-tuneable, which can add to its capabilities in the near future.
4M
is licensed under a
Apache 2.0
license.
You can use Roboflow Inference to deploy a
4M
API on your hardware. You can deploy the model on CPU (i.e. Raspberry Pi, AI PCs) and GPU devices (i.e. NVIDIA Jetson, NVIDIA T4).
Below are instructions on how to deploy your own model API.