Edge 416: Inside Apple's 4M-21 Model that Could be the Foundation of its On-Device Multimodal Experience
The model was trained simultaneously across 21 different modalities.
Apple has been late to the generative AI game, but lately, it has been pushing the research agenda quite hard. Apple has an ideal playground for innovating in one of the hottest areas of the next wave of generative AI: on-device multimodal models. The idea of powering mobile AI through API integrations with massively large foundation models seems highly impractical and insecure, and Apple is in a unique position to power alternatives to this paradigm. However, most of Apple’s efforts in small on-device models have been somewhat underwhelming.
This is starting to change.
Recently, Apple released what I consider its most impressive work in small, on-device foundation models with the publication and open source release of 4M-21, a multimodal model that work seamlessly across 21 modalities! The work definitely signals the path for Apple on-device model strategy and the large number of modalities is quite shocking. However, this work builds on a previous research work that Apple published months ago with the release of its 4M model.
Let’s start there.