The Sequence Chat: Daniel J. Mankowitz, DeepMind on Building AlphaDev to Discover New Computer Science Algorithms
One of the researchers behind DeepMind's groundbreaking model that discovered new sorting algorithms shares his insights about the experience.
👤 Quick bio
Tell us a bit about yourself: your background, current role, and how you got started in machine learning.
I was born and raised in Johannesburg, South Africa. I always had an interest in building things and solving problems, with a specific interest in computers and electronics. This is what led me to doing my first degree in Electrical and Information Engineering at the University of the Witwatersrand in Johannesburg.
How I got started in Machine Learning:
In the early 2010’s Artificial Intelligence was not yet mainstream, especially in South Africa.
I found out about a Master’s program at the University of Edinburgh in Machine Learning and Computer vision and was determined to attend the program. I managed to get a scholarship and discovered a whole new world of possibility with Artificial Intelligence through this program. One such course was Reinforcement Learning which I took a particular liking to and this led me on a path to doing my PhD in hierarchical reinforcement learning at the Technion Israel Institute of Technology.
Current role:
My current role is focused on applying core Deepmind technologies to real-world products and applications. This requires me to deeply understand product needs and requirements, have a thorough understanding of the capabilities of the research technologies, and using my experience to determine whether a particular research technology fits with a particular product.
🛠 ML Work
You recently worked on AlphaDev, which reached a major milestone by discovering new sorting algorithms. Could you tell us more about the vision and inspiration behind the project?
Following our success where we incorporated MuZero in Youtube’s video compression pipeline, we realized that we could apply these core RL technologies to real world applications. We then began to brainstorm where we could take these technologies next. We live in an increasingly digital society with more and more applications, screens etc and this leads to an increase in energy demand and consumption. With Moore’s law coming to an end and chips reaching their fundamental physical limits, we need new and innovative ways to optimize the computing stack. This led us to identifying fundamental algorithms (such as sorting and hashing) that are called trillions of times every day. If we could improve these algorithms, we could help optimize one important aspect of the stack - the software that runs on it.
One of the most fascinating aspects of AlphaDev is that it transforms the algorithm creation process into a single-player game. What were the design principles of the assembly game that made it so effective for this process?
One aspect is that AlphaDev built the algorithms in the assembly language. This was a critical design decision that gave us added flexibility to find new optimizations. This is because we were operating at a much lower level (i.e., manipulating memory and CPU registers) than typical high level programming languages such as C++.
Another aspect is that we optimized directly for actual measured latency (speed). This was no small feat. We needed to build a system that could measure sorting algorithms accurately on the order of nanoseconds. There are many sources of noise and as a result, we designed a latency benchmarking server that would take thousands of measurements across multiple machines for generating a single latency value for a given algorithm. This scale of taking measurements enabled us to accurately measure and optimize for latency.
AlphaDev's "swap and copy moves" have been referenced as a signal of creativity equivalent to AlphaGo's famous "move 37". Walk us through the relevance of these instructions in the discovery of new algorithms.
When we first began the AlphaDev project, we decided to try and learn a sort3 algorithm (i.e., sorting 3 elements) from scratch. This means that AlphaDev began with an empty buffer and had to iteratively build an algorithm that can sort three elements from the ground up. This is incredibly challenging as a huge space of algorithms needed to be explored efficiently by AlphaDev - there are more possible algorithms that can be built from scratch, even for sorting three elements, than the number of atoms in the universe!
We visually analyzed the state of the art’s assembly code for sort3 and couldn’t find any way to improve it. We were convinced that AlphaDev could not improve this algorithm further and would probably learn to produce the human benchmark. However, to our surprise, it discovered a sort3 algorithm from scratch that had one less assembly instruction. This, we initially thought, was a mistake. However, upon further analysis, we realized that it discovered what we called the AlphaDev swap move. This is a sequence of instructions, that when applied under specific conditions, saves an instruction. For the swap move, it assumes that given 3 values A,B,C - it can be applied when B<=C - See AlphaDev swap move section and Figures 3a,b,c in our Nature paper.
AlphaDev discovered this same move in other sorting algorithms too such as when it generated a sort5 algorithm from scratch. We had a similar experience with the AlphaDev copy move which can be applied under the condition that for 4 inputs <A,B,C,D>, we have that D>=min(A,C). See AlphaDev copy move section and Figures 3d,e,f in our paper.
Another very interesting case is that of variable sort 4 (VarSort4). Here the input to the sorting algorithm is a vector of up to 4 elements. In this case, AlphaDev discovered, from scratch, VarSort4 algorithms up to 29 instructions shorter than the human benchmark, with fundamentally different structures! (see the image below from the Nature News and Views article - Figure a is the human benchmark, figure b is the fastest discovered AlphaDev VarSort4 algorithm). You can also read the ‘New variable sort algorithms’ section in our paper.
As a result, we realized that there are optimizations to algorithms that we don’t know about, or can’t see, but are still out there. And hopefully these optimizations can also influence how we construct other fundamental algorithms too.
In addition to sorting, it seems that AlphaDev has also improved hashing methods used in data serialization. Describe the process and the results in this area.
Once we discovered faster sorting algorithms, we wanted to see whether AlphaDev could generalize to other fundamental computer science algorithms. Using the analogy of a librarian, sorting algorithms could be used by a librarian to sort books from A to Z. Hashing on the other hand, provides the librarian with a very efficient way to search for and retrieve a particular book. A book is assigned a unique identifier, which we refer to as a hash, and the librarian is then able to know exactly where to find the book as opposed to searching through the entire library. Hashing algorithms for data structures are called trillions of times everyday and are used in applications for information storage and retrieval.
Hashing algorithms define their “correctness” in terms of the number of hashing collisions. So AlphaDev learned an improved hashing algorithm that had a minimal number of collisions and was 30% faster than the current algorithm in Abseil. As such it replaced the benchmark in Abseil and is now used by millions of developers, companies and applications around the world.
AlphaDev is based on DeepMind's AlphaZero model, which achieved superhuman performance in games like Go, chess, and shogi. What aspects of AlphaZero make it a strong foundation for models like AlphaDev?
AlphaZero leverages the power of search and value functions to efficiently explore incredibly large combinatorial problems. For context, discovering a faster sorting algorithm requires searching through a space of algorithms similar in size to the number of possible games in Chess and Go and the number of atoms in the Universe. The value function gives AlphaDev a prediction as to whether the final program will be fast and correct and this is critical to determining how to efficiently explore this massive space of algorithms – AlphaDev’s value function only needs to be trained on the actual measured latency of approximately 0.002% of the algorithms it generates (using its latency benchmarking server).
💥 Miscellaneous – a set of rapid-fire questions
What is your favorite area of AI research aside from reinforcement learning?
All areas of AI are fascinating and interesting to me. At the moment I am very interested in Reinforcement Learning from Human Feedback. This technology has shown the ability to better align Large Language Models (LLMs) with human preferences. Model alignment is critical in developing language models that provide value to humans such as saving time and resources, as well as ensuring safe and responsible responses too.
Can the AlphaDev approach be generalized to any form of computer science algorithms? Are there any major limitations?
AlphaDev can be generalized to many more algorithms. Some examples include compression algorithms, both lossy (e.g., image compression) and lossless (e.g., file compression), cryptographic hashing, serialization, deserialization etc.
Currently, AlphaDev does not support algorithms with loops, although this is possible to support in future work. It may also struggle if you want AlphaDev to optimize a huge algorithm or codebase (e.g., an entire library). In this case, additional break-throughs are necessary.
Months ago, DeepMind unveiled another algorithm discovery model called AlphaTensor. Is there any relationship between AlphaTensor and AlphaDev?
Yes, they both have AlphaZero as their base algorithm. The power of search and value functions are critical in both cases. However, they both play very different single player games which have different properties and algorithmic requirements. For example, in the case of AlphaDev it was critical for us to measure actual measured latency.
A controversial question to end: Is algorithm discovery a valid variation of the Turing Test?
This is indeed a tough question :) I think definitions are very important, and I don’t have a good opinion as to whether this is a valid variation or not. What I can say is that we are taking a step forward to optimizing the computing stack which will hopefully enable us to make further progress with AI as well as the evergrowing computational bottleneck from our increasingly digital society.
This is very interesting. I work on the Expert Systems side of things, having (back in 1984) created an Expert System, named XTRAN, that knows more than 40 computer languages, including assemblers, 3GLs, 4GLs, DSLs, scripting languages, and data base languages. Its rules language can be used to capture and therefore automate coding skills, including assessment / analysis, transformation / re-engineering, translation, and code creation, as well as automating the manipulation of data and text.
It's also interesting because, as a software entrepreneur from 1977, my first commercial product was a sorting utility for the DEC PDP-11 minicomputer. Its unique feature was a high-performance n-way merging algorithm created by a colleague of mine.
And it's ineresting, thirdly, because back around 1970 I was handed the care and feeding of an early DBMS that was based on hashing. And earlier, in the mid-1960s, I was given the task of storing a large insurance policy master file (more than a million punched cards) on drum (!) storage for random access. As a result, I researched early hashing algorithms, with the idea of hashing on policy numbers. One thing I learned was that further complicating an already complex hashing algorithm typically worsens its collision behavior. I wonder if AlphaDev has figured that out. :)
Nowadays I'm using hashing to implement associative arrays in XTRAN. Some things never go out of style. :)