๐ Joe Doliner/CEO of Pachyderm on developing a canonical ML stack and main challenges for mainstream developer adoption
Itโs so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work isย a great source of insight and inspiration. Pleaseย shareย this interview if you find it enriching. No subscription is needed.
You can also leave a comment or ask a question in the comment section below.
๐คย Quick bio / Joe Doliner
Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?ย
Joe Doliner (JD):ย Iโmย the co-founder and CEO ofย Pachyderm, whichย provides open-source solutions to help machine learning teams tackle their biggest data management challenges. Previously, I was a software engineer for Airbnb as part of their data science infrastructure team and prior to that was the first engineer atย RethinkDB.ย Consider myself to be an open-source aficionado to the core.ย ย ย ย
๐ ML Workย ย
Part of Pachyderm vision is to develop a โcanonical stackโ for ML solutions. What are the key building blocks or categories of a canonical ML stack? What could be a balance to determine when to use aย single-stackย versus best-of-breed solutions?ย
JD:ย Pachyderm is one of the founding members of the AIIA1, theย AI Infrastructure Alliance, which goal is to createย a canonical stack for MLย meaning a standard toolset that everyone uses as a first choice to design and create AI apps. We started it because we feel that to create a truly end-to-end ML platform, you need a collection of tools that work together seamlessly.ย
Think of canonical stack as a LAMP stack for AI. The LAMP stack created an explosion of apps on the web, most notably WordPress, which became one of the most powerful and flexible web hosting frameworks in the world.ย Without the LAMP stack, developers canโt move up the stack to build amazing applications.ย They need a powerful foundation to build on.ย ย
Right now, AI is mostly done at the big tech companies because they have the power to invest $500 million to roll their own software and the engineers to do it, with money left over to hire the top data scientists and researchers on the planet.ย ย
Enterprises and smaller companies aren't going to build their own stacks from the ground up.ย For AI apps to become as ubiquitous as the apps on your phone, you need a canonical stack that makes it easier for non-tech companies to build AI apps without having to roll their own infrastructure.ย
The graphic below shows a complete end-to-end stack for ML that works for almost every known use case in ML today.ย Itโs not a series of boxes that a logo fits cleanly into, but a stack where one companyโs solution flows across multiple boxes.ย
Check out the color-coded version of the diagram below to see where Pachyderm fits in:ย
You can think of Pachyderm as a massive data lake with versioning and lineage, as well as a framework-agnostic pipeline thatโs data-driven.ย ย
Data-driven means new data or changes in the data can automatically kick off pipelines steps without you having to write a whole loop to check for changes to data constantly or handle errors that arise while doing so.ย
Because our pipeline system understands data versions and diffs, as new data is added to the system, it can easily kick-off incremental processing of that data in the pipeline, processing only the changed data instead of everything from scratch again.ย That means you donโt have to retrain on 200 terabytes of data because 10 GB just flowed into the system.ย We can train forward, keeping all the old weights of the model.ย ย
Weโre often just lumped in as โdata versioningโ but itโs so much more than that.ย ย
Our pipelining system can runย anyย framework you can put in a container. While most pipelines can only run Python and maybe one other language, Pachyderm can run Python, R, C++, Java, Rust, Scala, Go, Bash and anything else you can stuff in a container. If you want to run TensorFlow andย PyTorchย inย different stagesย and then two versions of Anaconda and that library you just found from Stanford to do someย cutting-edgeย NLP, you can do it all.ย
The number of ML frameworks and platforms entering the market is overwhelming. Is the ML space too fragmented for being such anย early-stageย trend? In your opinion, what are categories that are here toย stayย and which ones are likely to become a feature of larger platforms?ย ย
JD:ย We expect to see consolidation around the key steps in the ML lifecycle, as they get more and more widely adopted.ย The software will expand from its original purpose to cover more and more of the lifecycle.ย
Atย itsย most basic, the ML lifecycle takes machine learning applications from idea to production. There can be some variance depending on the use case, but it typically boils down to the same major steps:ย
The collection, preparation, and labeling of data.ย
Experiments and exploration to determine the data and algorithms required for successful models.ย
Training and evaluation to determine if the selected models are successful for their applications.ย
Deployment to production and monitoring.ย
These arenโt linear steps executed one by one. They are incredibly iterative as more of a living, breathing system that needs to be developed and built out over time. Think of them as an ML loop.ย Mastering and scaling these processes are how teams mature their machine learning practices.ย
Over time itโs only natural for companies to build on their strength and deliver a more comprehensive solution to each stage of the lifecycle.ย This type of Cambrian explosion is typical of any early market.ย Before Ford came to dominate car-making with interchangeable parts, there were almost 100 different bespoke car companies.ย ย
We seeย a number ofย areas that are likely to just fold into a larger solution. A โmetadata storeโ appears in everyoneโs stack but what is a metadata store? Itโsย really justย a database. We donโt see much of a need for an external database that stores all the stages of a pipeline and what happens in that pipeline.ย
Pachyderm has a โmetadata storeโ that stores all the information about the code, the models and the data as their all changing simultaneously. When we talk about data in this context, we don't just mean raw data. We mean all the intermediary data artifacts that are produced throughout this process โ cleaned dataset, labeled data and annotations, model features, model artifacts, evaluations metrics, metadata โ itโs all just data and it needs to be developed and managed and versioned and tracked and iterated on.ย Every product out there today hasย some kind of built-inย database and we donโt see any reason itโs likely to become a separate product.ย ย
Lastly, everyone today has their own pipelining system and that wonโt last.ย Weโll end up standardizing on a few ways to do data engineering-focused pipelines and data science-focused pipelines.ย We expect those types of pipelines to be agnostic in the way that Pachydermโs pipeline is agnostic. Any pipeline that requires the development team to support every language or framework is unscalable in the long run, in the same way, that Yahoo couldnโt have people go look at every website once the web really started to grow.ย
One of the areasย in which Pachyderm excels isย data lineage and versioning. What makesย this such an important component of ML solutions and what techniques can beย usedย to address this challenge?ย ย
JD:ย As Andrew Ngย saysย โdata is food for AI.โย ย
In hand-coded logic the data is secondary. When a programmer designs hand-coded logic, they write all the rules and only touch the data briefly. If you write a website login script, you only touch the data once to get aย usernameย and password.ย ย
But with machine learning, the models learn the rules from the data.ย ย
Data is primary.ย ย
In a recent talk, Ng noted that most data scientists spend a lot of time tweaking the models and the hyperparameters of those models for very little result. They might get a 0.2% increase in accuracy playing around with the model.ย But something altogether different happens when they go back and tweak the data.ย ย
They might have 10 people doing labeling and they all interpreted their instructions of how to draw the bounding box a little differently.ย Going back and refining those instructions and telling them to re-label 30% of the images jumps performance up 20%.ย Thatโs a major difference.ย
Your data is the hidden part of the iceberg.ย Weโve got a lot of tools to deal with the part above the water but very few tools that deal with the data effectively and thatโs a major problem because data is the lifeblood of ML.ย ย
As teams get bigger and bigger, youโre going to need to keep track of shifting datasets for compliance purposes and to find which exact dataset trained a model.ย If you need to go back in time and find out why a model is exhibiting bias after new data got added, you need to be able to roll back to that exact point in time where you started training on the newly added data.ย
In a big data science team, different teams may need totally different versions of the data, formatted or changed in various ways to the model theyโre building.ย Eventually keeping track of all those changes without a lineage system is impossible.ย Every model becomes a one-off that you canโt recreate.ย That very quickly becomes a nightmare as models need to get updated and you need an unaltered version of the dataset to start retraining.ย Without that, your model will learn different rules.ย
Itโs not just lineage either.ย You need an immutable filesystem too.ย Immutability is not optional.ย ย
Without it, you canโt guarantee that the data didnโt change out from under you, which makes your experiment non-reproducible.ย If one data scientist distorts the original dataset, overwriting it with a new column in the text, or a different file size, or a different filter on the video, it irrevocably changes the output of the model.ย
Pachydermย couplesย lineage with a powerful copy-on-write filesystem that sits atop an object store, like Amazon S3.ย That keeps the data small as you get infinite snaps of that data automatically every time a change is made.ย Other systems make copies of the data over and over but that wonโt scale.ย If you have multiple 1 GB video files and you throw a filter on it that changes 5 MB, why are you making a copy of the whole 1 GB again?
Another area of focus of your workย isย CI/CDย which I findย super trickyย when comes to ML pipelines. Whatย areย the keyย CI/CDย differences between traditional software and MLย workflows?ย
JD:ย The key difference with machine learning CI/CD is youย have toย keep track of the data, the models, and the code as they are all changingย at the same time.ย ย
The orchestration engine needs to control all of them together, so they can keep track of how they all interrelate and interact with each other at any point in time.ย ย
Most of the CI/CD systems today were built to just deal with the code. The data was secondary.ย It was handled by another system.ย But with ML you need to keep track of the data state as well as the model and the code.ย ย
Those three states may diverge all at once or one at a time.ย Your data may change in one stage while your code remains the same.ย At another point your code may change along with the data and the model too.ย
With Pachyderm, every step in the machine learning loop is a transformation.ย We use JSON or YAML to define a transition from one state to another.ย It covers everything from data ingestion to model output at the end of the pipeline.ย ย
You might write a script to pull all the data from one external data source into the Pachyderm platform, and another script to pull from a different data source.ย Then you might rename all the files or clean them in another step.ย In another stage you may lean on a labeling engine likeย Label Studio, which we integrated with, followed by a series of new training jobs.ย Each step is part of that workflow and easily defined inย Pachydermย so it flows from one step to the next easily.ย ย
How much simpler can ML development get in the next 3-5 years and what are the main challengesย that need to be addressed to trigger mainstream developer adoption?
JD:ย It can get a lotย simplerย and it will, but donโt forget that AI/ML isย really young.ย AlexNetย came out in 2012, less than a decade ago, and spurred the current wave of ML and GPUs to power deep learning.ย
AlexNetย took it out of the research labs and into the enterprise.ย But what big companies like Google found is they couldnโt just use the same tools they already had for traditional development.ย They needed new tools and they had to start from scratch.ย
Thatโs because AI/ML is a totally new branch on the software development tree.ย Itโs not just the AI/ML infrastructure software thatโs changing, itโs the very way we create AIโs thatโs changing at the same time.ย Reinforcement Learning is taking off right now and will likely need tools we havenโt fully conceived of yet as we do it at bigger and bigger scales.ย Innovation in a new space is not an overnight process.ย
Traditional software development was mostly baked for many, many years.ย We had a few innovations, like object-oriented programming and DevOps vs waterfall development, but those innovations were building on well-defined innovations before them. Weโre still figuring out what we need in ML.ย Feature stores barely existed a few years ago and now we have multiple companies designing them.ย It takes lots of people working on a problem across the world to converge on a solution that works for everyone else and becomes standard.ย ย
As we figure out the basic primitives and design patterns in this space, weโll get more advanced and more automated and that will let us go from putting a few models in production, that required a lot ofย hand-holdingย to get there, to putting 1000s of models in production fast in a highly automated way.ย
Do you like TheSequence? Considerย subscribing to support our mission to simplify AI education, one newsletter at a time. You can also give TheSequenceย as a gift.
๐ฅ Miscellaneous โ a set of rapid-fire questionsย ย
Favorite math paradox?ย
JD:ย Gรถdel's incompleteness theorem.ย
Is the Turing Test still relevant? Any clever alternatives ?
JD:ย Itโs useful as a thought experiment, and it lays the groundwork for thinking about testing AI. The test itself is no longer that useful for testing modern ML.ย
Probably the best, updated alternative is the Abstraction and Reasoning Corpus (ARC, by Chollet from his paper, On the Measure of Intelligence. Also check the Q&A with him here.)ย
Abstraction is what separates humans from machines.ย We only need to get cut once by a knife to abstract the idea that sharp = danger.ย We donโt need to see a million pictures of jagged rocks to know they will cut us too.ย
Any book you would recommend to aspiring data scientists?ย ย
JD:ย Deep Learning with Python by Francois Chollet is brilliant.ย ย
Is P equals NP?
JD:ย The fascinating thing about P = NP is that unlike other prominent open problems it doesnโt seem like one outcome is more likely than the other. Whereas for many open questions it seems very likely the theorems are true, the proof just eludes us.ย
It's also worth asking how much it matters if it is true?ย
Do we need to know if every NP problem has a shortcut or is it all right if we find shortcuts to many of those problems as they become important to humans?ย
Weโve already seen better algorithms to generate solutions to NPย problemsย so we donโt have to brute force through every solution.ย I think AlphaGo and AlphaGo Zero, as well as Googleโs new chip design software, shows that given enough need humans will find a shortcut.ย ย
So maybe that means P does equal NP.ย ๐ย ย
I'llย bet on human ingenuity to find a shortcut when it mattersย mostย and I find it hard to believe that the most important problems donโt have shortcuts that can help us approximate the best answer.
TheSequence is a member of AIIA.