🎙Dmitrii Evstiukhin/Provectus: "Four Horsemen of AI Project Failure and How to Deal with Them"

Oct 19, 2022

Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you like it. No subscription is needed.

👤 Quick bio / Dmitrii Evstiukhin

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

Dmitrii Evstiukhin (DE): When I was around ten years old, I encountered computers for the first time. My first language was HTML, and my first product was a web page filled with the color green. I was proud of myself! After that, I kept trodding in that direction. At some point, I realized that connecting things is much more fun for me than just creating web pages, so I went down the road of networking and system administration. After I laid down the infrastructure, the next logical step coincided with the DevOps revolution in the industry. I realized that connecting the two worlds of development and operations is even more important than connecting networks.

This notion of connection, integration, and continuity kept following me when I became an architect and started building platforms for web developers and analytics platforms for data engineers. And also MLOps platforms for ML Engineers and Data Scientists. I approached the ML world from an operational perspective before getting neck-deep.

Currently, I’m the Director of Managed Services at Provectus. We provide Managed DevOps, Managed Data, and Managed AI services, all of which coincide with the sense of value flow that I got so long ago.

🛠 ML Work

ML technologies have evolved at an incredible pace in the last few years, and yet there are plenty of studies suggesting that most ML projects in the real world fail. What are the main causes of the mismatch between high quality technologies and the challenges of delivering end-to-end solutions?

DE: Well, it’s as simple as this:

Failure to frame the ML problem from a business challenge or opportunity perspective.
Failure to put the right talent in the right roles on the team.
Failure to have the right data and ML infrastructure.
Failure to properly manage the AI solution in production.

These are the four horsemen of AI project failure. Let’s take a closer look.

The ML problem can be framed based on unrealistic expectations, or on the urge to follow the trend without a business need or opportunity. Any ML problem has to be defined by a close collaboration between business leaders and experienced engineers. Otherwise, one side or the other may be overlooked. You see, it all starts by defining the problem that AI is supposed to solve, but it does not end there. Between these vast areas of work lies a chasm.

Taking step number two is just as challenging as step number one. Unfortunately, it’s not like riding a bicycle — you can’t just hop on and expect a smooth ride to your destination — and this is what most companies whose AI initiatives have failed don’t understand. When you have a problem to solve, it’s important to get the right talent. And here you face an entirely different challenge that becomes a vicious cycle — to recognize genuine expertise and skill, you already have to have them.

If you’re lucky enough to find the right people to build an AI solution, the next challenge awaits — how do you apply the talent? Where will they work? How will you find and access the data? What hardware and software should you use?

You may have used some flash drives to retrieve the data, and some local GPUs to train the model. But now what? Where do you get your ROI? Well, you need to use your model, integrate it with your current business, and make it work on the scale of your business. This task requires yet another set of expertise and people.

We see companies stumbling on different steps of this ladder, and we help them pave the path to success. For example, we have seen several huge enterprises with hundreds of ML use cases on the shelf having issues because they stumbled on the third or the fourth steps. They failed in different ways: sometimes, it was a lack of team balance; other times, it was a technical failure — the wrong choice of technology or insufficient data quality.

To help our customers deal with one or all of these “horsemen” while keeping in mind all the benefits of AI and not focusing on the problems, we created an offering called “Managed AI.” Businesses can apply it to some of the steps or end-to-end, to solve their AI cases at hand.

MLOps is a highly fragmented space, and it can be overwhelming to keep up with all the frameworks and platforms in the space. What are the top five MLOps capabilities that any real-world ML infrastructure solution must have?

DE: You mentioned two critical parts of the problem in one question! There is indeed a sparse and fragmented continuum of ML-related solutions. The main reason is that, based on your organizational structure and your use case, you may need the platform to enable different capabilities. However, if I had to generalize, I would point to these five as the most common and essential:

Scalability. This is the most important and fundamental. Most organizations and use cases require it in one way or another. Occasional re-trainings may require a lot of resources once a week, for instance. Or real-time inference endpoints may need to scale up and down, to adjust to the end-users’ usage patterns. Long story short, if you have AI in production, you want it to be scalable.
Reproducibility. I’d call this capability a differentiator between good and bad platforms. If you can’t reproduce your experiment from a month ago, your platform won’t do a good job. Of course, it requires versioning of everything: data, ML code, pipeline configuration, infrastructure code, experiments, and more.
Integration. Your development environment should be as similar as possible to your production environment. You also need to have all the best practices in place. An additional abstraction layer can help with this; something like the Ray framework mentioned below.
Security. In pursuit of ML-specific features, platform engineers sometimes forget that security is also a critical capability of the platform. Security is something that should always be kept in mind.
Observability. This is key to success in the long run. You have to fully understand what’s going on with everything — your data, your model, infrastructure, code, users, and so on.

As you mentioned, building a holistic solution that suits your case is a serious challenge that requires experience and expertise. In most cases, a more cost-effective approach to this task is buying a SaaS solution, which doesn’t solve the challenge entirely but makes it much more manageable. Or, use a Managed AI approach, where you just supply your model and sign your SLA, and your production-ready model is handled for you.

In recent years, many tech companies have published details of their internal ML architectures. Could you list some ML reference architectures and best practices that you have found particularly inspiring?

DE: Reference architectures are usually very specific to the company, its organizational structure, and its use case. But I could mention one exciting piece of software that Shopify used as a core component of its ML platform. I’m talking about the Ray framework — a prominent piece of technology that has recently gained significant traction within the AI/ML community.

One of the hardest challenges of ML teams is to find the right balance between Data Science and ML Engineering. Could you share your perspective about the division of responsibilities within a team, and the division between these two areas in ML pipelines?

DE: Data Science is all about gaining insights from data. Data Science specialists usually have some domain-specific knowledge, and their job is to help businesses make sense of their data. They can also see, for example, the predictive ability of data to kick-start AI initiatives. Once they have identified the essential features necessary to make predictions, they can hand it over to ML Engineers, who will build an AI solution based on this data. To avoid confusion, think of it like this: nurture Data Science expertise internally and consider involving Managed AI Services for the rest. I’ve looked more into this in one of my recent articles — check it out here!

From a technical standpoint, what are the key areas that need to improve for ML to experience mainstream adoption?

DE: The critical areas of improvement are all things data, all-around observability, and cost optimization of AI/ML technologies. Observability in particular, I would say, because better transparency can help non-technical people understand what’s going on in the system, hence reducing pushback and ensuring better integration of AI/ML teams and solutions.

💥 Miscellaneous – a set of rapid-fire questions

Favorite math paradox?

The self-reference paradox. This paradox is an exquisite reflection of the world’s complexity and ambiguity. Not to mention that this paradox has given a start to modern programming as we know it.

What book can you recommend to an aspiring ML engineer?

I recommend “Why We Make Mistakes: How We Look Without Seeing, Forget Things in Seconds, and Are All Pretty Sure We Are Way Above Average” by Joseph T. Hallinan. Just look at the title! This book will definitely help ML engineers understand certain things much better.

Is the Turing Test still relevant? Any clever alternatives?

The Turing test may still be relevant for determining if AI has reached human-like intelligence. Though it is not as straightforward as it once was. With the rise of text-generating models, the tester has to shift focus from checking coherent speech to probing integration abilities, and the comprehension of multi-level hierarchical abstractions.

Most exciting area of deep learning research at the moment?

My personal favorite is Graph Neural Networks (GNNs). I believe that somewhere in there we will finally find the real General AI.

TheSequence

Discussion about this post