One of the cognitive marvels of the human brain is its ability to simultaneously process information from different sensorial inputs such as speech, touch, or vision in order to accomplish a specific task. Since we are babies, we learn to develop representations of the world based on many different modalities such as objects, sounds, verbal descriptions, and others. Recreating the ability to learn from different modalities simultaneously has long been a goal of ML, but most of those efforts remained constrained to research exercises. For decades, most supervised ML models have been highly optimized for a single representation of the information. That’s rapidly changing now. Multimodal ML is becoming a reality.
In the last two years, we have seen the emergence of multimodal ML models applied to real-world scenarios. Natural language and computer vision have been a powerful combination with the release of models such as OpenAI’s Dall-E or NVIDIA’s GauGAN. This week, Meta AI Research released a new model that combines audio and visual inputs to improve speech recognition. The model uses self-supervision techniques to analyze lip movements from unlabeled videos. That idea would have sounded insane a handful of years ago. While there are still plenty of milestones to reach in individual deep learning modalities, multimodal learning is an essential step towards the goal of building general AI. Little by little, such steps are making it more and more real.
🔺🔻 TheSequence Scope is our Sunday free digest. To receive high-quality educational content about the most relevant concepts, research papers, and developments in the ML world every Tuesday and Thursday, please subscribe to TheSequence Edge 🔺🔻
🗓 Next week in TheSequence Edge:
Edge#155: we discuss A/B Testing for ML Models; we explore how Meta AI uses ML A/B testing for improving its news feed ranking; we overview W&B, one of the top ML experimentations platforms on the market.
Edge#156: we deep dive into the ML mechanisms that power recruiting recommendations at LinkedIn
Now, let’s review the most important developments in the AI industry this week
🔎 ML Research
Audio-Visual Models for Speech Recognition
Meta AI Research (FAIR) published a paper proposing a technique that uses both audio and vision to better understand speech →read more on the FAIR team blog
Researchers from DeepMind and Harvard University proposed Hidden Agenda, a two-team social deduction game used to help reinforcement learning agents develop cooperative mechanics →read more in the original research paper
Transformers and Semi-Supervised Learning for Video
Amazon Research published two papers about novel video intelligence techniques powered by transformers and self-supervised learning →read more on Amazon Research blog
Training Rescoring for Speech Recognition
Staying with Amazon Research: the tech giant published a paper proposing an NLU-based method to rescoring the training in speech recognition models used in the Alexa digital assistant →read more on Amazon Research blog
🤖 Cool AI Tech Releases
Canvas, NVIDIA’s art generative toolset, got a few updates this week →read more on NVIDIA blog
NVIDIA Omniverse is a newly announced studio for creating virtual worlds →read more on NVIDIA blog
🛠 Real World ML
Notorious computer scientist Chip Huyen published a post detailing common challenges and solutions for real-time ML solutions →read more in Huyen’s original post
🐦 Follow us on Twitter where we share all our recommendations in bite-sized form
💸 Money in AI