🎙 François Chollet: Keras, TensorFlow and New Ways to Measure Machine Intelligence

There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work can become a great source of insights and inspiration.

Share this interview if you find it enriching. No subscription is needed.


👤 Quick bio / François Chollet

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning? 

François Chollet (FC): I work on the Keras team at Google. The goal of my team is to build the tools that power the workflows of machine learning engineers, mainly at Google and other Alphabet companies (Waymo and YouTube are big Keras users), but also outside of Google, since Keras is an open-source project.

As for how I got into AI – initially, as a teenager, I had a kind of philosophical interest in how the mind worked and the nature of consciousness. Naturally, I started reading up on neuropsychology, which intuitively seemed like it should be the scientific field that could answer these questions. Unfortunately, it quickly appeared that neuropsychology could not answer much and only amounted to a vast collection of relatively superficial observations. So I redirected my attention to artificial intelligence – the idea being to understand the mind by attempting to create a model of it from the ground up. But back then, AI had rather little connection to human intelligence or the mind – it was a subfield of computer science focused on algorithms like A-star or SVMs. That was a bit of a disappointment. I ended up converging towards cognitive developmental robotics, an academic niche that looks into computer models of the early stages of human cognitive development. Algorithms that could learn like children do – that seemed like a very promising idea. After graduation, I started broadening my interest to more practical applications and eventually ended up working on making the shovels and pickaxes of the new AI gold rush.

🛠 ML Work  

You are famous within the AI community for spearheading the Keras framework that simplifies the implementation of machine learning programs. Since then, TensorFlow 2.0 has taken steps to abstract many of the building blocks for machine learning applications. In your mind, how much simpler can machine learning programming can get and what do we need to get there?

FC: There's a lot that remains to be done. Machine learning has still a long way to go in terms of simplification and automation.

To start with, model building and training can be further simplified and automated. Right now, Keras already abstracts away a lot of tedium, but it still requires you to make many decisions on your own – what kind of model architecture or loss function should you use, and so on. I believe hyperparameter tuning, and beyond, AutoML, will take an increasingly important role in the future, replacing manual configuration-fiddling with an automated approach.

Beyond model building and training, where automation has a big role to play, there will always be a significant amount of incompressible difficulty in machine learning and data science. It comes from the need to deeply understand the problem you're solving and curate high-quality datasets to train models on. You can't automate away that sort of expertise, at least not today. So AutoML has its limits. But we can still improve end-to-end productivity for ML engineers and data scientists with a better framework. An ideal framework is one that minimizes time-to-solution, one that lets you focus on where you can provide the most value, such as data curation or feature engineering.

That would be a framework based on "progressive disclosure of complexity": it should let you automate everything you don't want to take care of while enabling you to dive deeper into any part of the workflow that you think is important – only exposing an incremental amount of complexity when you start customizing things. Let the framework get the tedium out of the way when it's convenient, but let the framework get out of the way when you want to step in. That's the general philosophy of Keras, and we're doubling down on that principle going forward.

I was fascinated by your paper ‘On the Measure of Intelligence’ and have written extensively about it. In that work, you challenge many of the established methods by which we measure knowledge in machine learning systems. What is intelligence according to you, and how can it be measured?  

FC: I'm glad you found it interesting! I define the general problem of intelligence like this: how can you leverage information from the past to successfully face novel and unexpected situations in the future? Let's look at evolution, the search process that came up with intelligent brains. All biological organisms have access to a certain amount of information, some embedded in their DNA, some extracted from their environment via experience, and they have to use this information to produce successful behavior, in evolutionary terms, throughout their lives. If the set of situations they had to face was mostly static and known in advance, it would be a simple problem: evolution would just figure out the correct behavior via random trial and error and hard-code it into your DNA. And in fact, that's exactly what happens for very basic organisms.

But in practice, every day in your life is different from every previous day and different from any day ever encountered by any of your evolutionary ancestors in the past. You need to be highly adaptable, not just at the species level but also at the individual level. You need to be able to face unknown and surprising situations – that only share abstract similarities with what you've seen previously – and improvise successful behavior on the fly. How do you do that? Intelligence. That's what intelligence was evolved to do. It's a highly adaptable, on-the-fly behavior generation engine. It's the natural successor to hard-coding behaviors via natural selection.

In this context, I define the degree of intelligence of a system as the efficiency with which the system turns the information it has at its disposal into effective behaviors – into skills. Instead of focusing on a skill within a predefined, static task, like Chess, Go, or Starcraft, this definition puts the focus on the ability to quickly learn to handle new tasks or adapt to a changing environment. Suppose you have two agents that start out with the same information – the same priors – and go through the same experience curriculums. In that case, the agent that comes out of it with the ability to handle a wider scope of potential future situations is the more intelligent agent.

Modern machine learning equates “intelligence” with known skills. What are the limitations of that definition, and how can it become a roadblock for the evolution of AI? 

FC: An effect you constantly see in systems design is the "shortcut rule": if you focus on optimizing one success metric, you will achieve your goal, but at the expense of everything in the system that wasn't covered by your success metric. You end up taking every available shortcut towards the goal. For instance, if you look at data science competitions on Kaggle, the driving metric is your leaderboard score. As a result, you end up with models that rank very high on the leaderboard but require a ridiculously high amount of computation and that are big piles of spaghetti code – because these aspects of a model were not taken into account at all. And such models cannot be used in production.

The shortcut effect is everywhere in AI. If you set "playing chess at a human level" as your goal, you will achieve that. But only that: your system will exhibit none of the cognitive abilities that humans use to play chess, and thus it won't generalize to any other task. Because it's easier to come up with A-star than it is to come up with the human brain. You will take the shortest path to your goal, and that shortcut is simply orthogonal to intelligence. And that's exactly what happened when we finally solved chess-playing with AI.

So far, the driving success metric of the field of AI has been to achieve specific skills, to solve specific tasks. Consequently, the field’s history has been defined by a series of "successes" where we figured out how to solve these tasks without featuring any intelligence. Because it's always possible to solve a specific task without featuring intelligence, and that's always easier than solving the general problem of intelligence – therefore this is the shortcut you will take 100% of the time. If you fix the task, you remove the need to handle uncertainty and novelty – since you can provide an exact algorithmic description of the task, like for chess. Or alternatively, you can generate an infinity of training data, like for Starcraft. And since the nature of intelligence is the ability to handle novelty and uncertainty, you're effectively removing the need for intelligence. 

Some people might say, "oh, you're moving the goalposts. If I could play chess at that level I would be considered very intelligent." That's anthropocentric thinking. Of course, if a human is good at chess, they are intelligent. You know that this human wasn't born knowing how to play chess, nor did they have access to billions of chess games and thousands of years of practice. They had to use the same innate priors as you had, and the same limited amount of time as you have, to learn how to master chess. So they've demonstrated that they can turn experience into skills with high efficiency – they've demonstrated intelligence. They could have learned to perform any other task in the exact same way because nothing about them was specialized for chess in particular – that's general intelligence. So in the case of a human, learning to master task X is proof of general intelligence. But in the case of a machine, performing the same task X is never proof of any level of intelligence at all. It's all completely intuitive if you think clearly about it.

You are the creator of the Abstraction and Reasoning Corpus (ARC) challenge which I often see as an IQ test for machines (maybe a bad analogy). What is the ARC challenge, and how is it different from other “machine intelligence” tests?  

FC: I think this is a pretty good analogy. It's a general intelligence test, very similar to actual IQ tests (such as Raven's progressive matrices), and it is meant to be approachable by both machines and humans.

Two things are fairly unique about ARC:

  • ARC seeks to measure generalization power by only testing you on tasks that you've never seen before. In theory, that means that ARC is a game you can't practice for: the tasks you will get tested on will have their own unique logic that you will have to understand on the fly. You can't just memorize specific strategies from past tasks.

  • ARC is interested in controlling for prior knowledge about the tasks. It's built upon the assumption that all test takers should start from the set of knowledge priors, which we call "Core Knowledge priors", and which represent the "knowledge systems" that humans are born with. Unlike an IQ test, ARC tasks will never involve acquired knowledge, like knowing English sentences, for instance.

One of the things I was confused and then fascinated by about ARC was the explicit acknowledgement of knowledge priors. Why are knowledge priors relevant to measure intelligence in machine learning systems?  

FC: You never approach a new problem entirely from scratch – you bring to it preexisting knowledge, called priors. That's certainly true of the AI models we create, but it's also true of newborn babies. We come into the world already knowing a lot about it, assuming a lot about it. If you want to measure the efficiency with which a system mines its past experience to produce skill, you must control for these priors. In the extreme case, a system could be already skillful at birth – like an insect or a chess-playing algorithm – and therefore it would not need to learn anything or adapt to anything. It would feature task-specific skill while featuring no intelligence.

If you want to control for prior knowledge when measuring the intelligence of a system, you need to standardize the set of priors you expect your system to have. And if you want to compare the intelligence of an AI system with human intelligence, the only realistic option is to standardize on innate human knowledge priors. That is what ARC does. What are those priors? It's a big question, and not yet one that we can fully answer. We call them "Core Knowledge", and they encompass things like a basic understanding of intuitive physics, objectness, simple counting, for instance. Professor Elizabeth Spelke has done a lot of research in this area, so if that's a topic you're interested in, I definitely recommend reading her work. 

Do you like TheSequence? Subscribe to support our mission to simplify AI education, one newsletter at a time. You can also give TheSequence as a gift.

💥 Miscellaneous – a set of rapid-fire questions  

Is the Turing Test still relevant? Is there a better alternative?

FC: The Turing test was never relevant. In fact, Turing himself wrote about his "test" as an argumentative device in a philosophical debate about whether machines could think – he never intended it to be a literal test of intelligence. That was not the purpose of the Turing test. There are two major issues with using it as a literal test, as some folks have been doing since.

First, a big flaw is that it entirely abdicates the responsibility of defining intelligence and how to evaluate it, which is precisely the value of creating a test. Instead, it delegates the task of defining and evaluating intelligence to random human judges, who themselves don't have a proper definition or a proper evaluation process.

A direct consequence of this is that if you set "solving the Turing test" as your goal, then you have no incentive to understand and create actual intelligence, rather, you are solely encouraged to figure out how to trick humans into believing your chatbot is intelligent. The bar you have to meet is subjective human perception rather than an objective, rigorous definition of what it means to be intelligent.

It's a test that encourages deception rather than progress.

Let's say you're a physicist and you're trying to come up with a test to check whether a "matter teleportation event" occurred. If your test was to ask an audience of witnesses about it, you'd be encouraging the development of prestidigitation tricks, not research in physics. This is the same. The incentives created by the test are all about deception, and that's perfectly reflected in the attempts to solve it so far, like the Eugene Goostman chatbot.

I think the true value of intelligence tests is that they should be actionable – they should light the way towards more intelligent systems. It should act as a cognitive device for researchers who are thinking about intelligence. It should help you make progress. And that's what I try to do with the OTMOI paper. I'm not trying to provide a gold standard for measuring intelligence, I'm trying to provide a stepping stone to help us better conceptualize the real challenges and make progress.

What are some practical milestones for the next decade of deep learning?  

FC: I was mentioning AutoML earlier. I think making AutoML "work", making it produce a lot of practical value for ML engineers and data scientists, that's going to be a big milestone for our field in the next few years. In addition, I think the field of ML is moving towards increasing reuse of model-building components and pre-trained features, much like "traditional" software engineering before it. ML frameworks will have a huge role to play there.

There are a few more important trends as well, like increasingly specialized chips (ASICS), increasingly large models trained on increasingly many devices, and workflows that are moving away from local GPUs and into data centers in the cloud.

Favorite math paradox? 

FC: It's not exactly a paradox, but I've always found Gödel's incompleteness theorem quite thought-provoking.

Is P equals NP? 

FC: No strong opinion here, but all empirical evidence so far seems to point towards P != NP. This has actually important implications when it comes to the ability of deep learning to help speed up solving combinatorial search problems. The approach will always come with certain limitations.