Edge 396: Inside Ferrett-UI: One of Apple's First Attempts to Unlock Multimodal LLMs for Mobile Devices
The new model excels at mobile screen understanding.
The AI world anxiously waits to see what Apple is going to do in space! Unlike other tech incumbents such as Microsoft, Google, and Meta, Apple has been relatively quiet when it comes to contributions in the AI space. A safe bet seems to assume that anything Apple does in the space is going to be tied to mobile applications to leverage their iPhone/iPad distribution. Not surprisingly, every time Apple Research publishes a paper, it triggers a tremendous level of speculation, and that has certainly been the case with their recent work in mobile screen understanding.
One of the most interesting trends in autonomous agents is based on computer vision models that can infer actions from screens. Earlier this year, we were all amazed at Rabbit’s large action demo in CES. Companies like Adept.ai have been pushing screen understanding as the right way to build autonomous agents. One of the areas in which this paradigm can undoubtedly have an impact is in mobile apps. Recently, Apple decided to dabble into this space by publishing a paper outlining Ferret-UI, a multimodal LLM optimized for mobile screen understanding.