⚪️ ⚫️ Edge#56: DeepMind’s MuZero that Mastered Go, Chess, Shogi and Atari Without Knowing the Rules
MuZero is likely to become a foundational block of a new set of deep learning models with short- and long-term planning capabilities
What’s New in AI, a deep dive into one of the freshest research papers or technology frameworks that are worth your attention. Our goal is to keep you up to date with new developments in AI in a way that complements the concepts we are debating in other editions of our newsletter.
If you care for what we do, help us spread the word
💥 What’s New in AI: How DeepMind’s MuZero Mastered Go, Chess, Shogi and Atari Without Knowing the Rules
“I’ve seen the future of artificial intelligence (AI) and it’s called MuZero”. Those were the words used by one of my mentors when he read the first preliminary research paper about MuZero, published by DeepMind in 2019. After all, a single deep learning model that can master games like Atari, Go, Chess or Shogi without even knowing the rules seems like something out of a sci-fi book. Well, that’s the essence of MuZero as described by DeepMind in a new research paper published in Nature a few weeks ago.
MuZero represents a major evolution in the use of reinforcement learning algorithms for long-term planning. Models like AlphaGo broke ground in the use of reinforcement learning and tree-search algorithms to master games like Go. That model was extended by AlphaGo-Zero, which was able to learn Go without knowing the rules, followed by AlphaZero which mastered not only Go but other perfect-information environments like Chess and Shogi. MuZero extends AlphaZero with robust search capabilities to also master complex environments like Atari, which deviates from classical planning methods.
Image credit: DeepMind
Conceptually, MuZero presents a solution to one of the toughest challenges in the deep learning space: planning. Since the early days of machine learning, researchers have looked at techniques that can both effectively learn a model given an environment, and also plan the best course of action. Think about a self-driving car or a stock market scenario in which the rules of the environment are constantly changing. Typically, those environments have resulted to be incredibly challenging for planning in deep learning models. At a high level, most efforts related to planning in deep neural networks fit into the following categories:
Lookahead Search Systems: This type of systems relies on knowledge of the environment for its planning. AlphaZero is a prominent example of models in this group. However, look-ahead search techniques struggled when applied to messy environments.
Model-Based Systems: This type of systems tries to learn a representation of the environment in order to plan. Systems such as Agent57 have been successful in this area, but they can be incredibly expensive to implement.
MuZero combines ideas from both approaches but uses an incredibly simple principle. Instead of trying to model the entire environment, MuZero solely focuses on its most important aspects that can drive the most useful planning decisions. Specifically, MuZero decomposes the problem into three elements critical to planning:
The value: how good is the current position?
The policy: which action is the best to take?
The reward: how good was the last action?
For instance, using the given position in a game, MuZero uses a representation function H to map the observations to an input embedding used by the model. Planned actions are described by a dynamic function G and a prediction function F.
Image credit: DeepMind
The experience collected is used to train a neural network. It is important to notice that the experience includes both observations and rewards as well as the results of searches.
Image credit: DeepMind
The training process of MuZero is as creative as the model itself. The model is trained alongside the collected experience, which tries to predict at each step. For instance, the value function V predicts the sum of observed rewards. The policy estimate P predicts the previous search outcome while the rewards estimate predicts the last observed reward.
Image credit: DeepMind
The beauty of MuZero is that it can achieve state-of-the-art performance in challenging planning tasks such as Go, Chess or Shogi, while also outperforming other reinforcement learning agents in complex tasks like Atari. Developing better planning capabilities did not only help MuZero conceive complex strategies but also learn faster and achieve better performance. MuZero is likely to become a foundational block of a new set of deep learning models with short and long-term planning capabilities. Ranging from self-driving vehicles to economic planning, there are a large number of AI scenarios that can benefit from planning models like MuZero.
Additional Resources: Additional details can be found in the original MuZero paper as well as in the 2019 preprint paper.
🧠 The Quiz
Every ten quizzes we reward two random people. Participate! The question is the following:
What is the core principle used by DeepMind’s MuZero in order to learn planning skills across diverse game environments?
Thank you. See you on Sunday 😉
TheSequence is a summary of groundbreaking ML research papers, engaging explanations of ML concepts, and exploration of new ML frameworks and platforms, followed by 65,000+ specialists from top AI labs and companies of the world.
5 minutes of your time, 3 times a week – you will steadily become knowledgeable about everything happening in the AI space.