📝 Guest post: You are probably doing MLOps at a reasonable scale. Embrace it*
No subscription is needed
In TheSequence Guest Post, our partners explain what ML and AI challenges they help deal with. In this article, neptune.ai discusses hyperscale and reasonable scale companies and how their needs for ML tools differ.
You are probably doing MLOps at a reasonable scale. Embrace it.
Solving the right problem and creating a working model, while still crucial, is no longer enough. At more and more companies, ML needs to be deployed to production to show “real value for the business”.
Otherwise, your managers or managers of your managers will start asking questions about the “ROI of our AI investment”. And that means trouble.
The good thing is, many teams, large and small, are past that point, and their models are doing something valuable for the business. The question becomes:
How do you actually deploy, maintain and operate those models in production?
The answer seems to be MLOps.
In 2021 so many teams looked for tools and best practices around ML operations that MLOps became a real deal. Dozens of tools and startups were created. 2021 was even called “a year of MLOps”. Cool.
But what does it mean to have MLOps set up?
If you read through online resources, it would be:
reproducible and orchestrated pipelines,
alerts and monitoring,
versioned and traceable models,
auto-scalable model serving endpoints,
data versioning and data lineage,
feature stores,
and so much more.
But do you?
Do you really need those things or is it just a “standard industry best practice”.
Where do those “standard industry best practices” come from anyway?
Most of the good blog posts, whitepapers, conference talks, and tools are created by people from super-advanced, hyperscale companies. Companies like Google, Uber, and Airbnb, who have hundreds of people working on ML problems that serve trillions of requests a month.
That means most of the best practices you find are naturally biased toward hyperscale. But 99% of companies are not doing production ML at hyperscale.
Most companies are either not doing any production ML yet or do it at a reasonable scale, a term coined last year by Jacopo Tagliabue. Reasonable scale as in five ML people, ten models, millions of requests. Reasonable, demanding, but nothing crazy and hyperscale.
Ok, so the best practices are biased toward hyperscale, what is wrong with that?
The problem is when a reasonable scale team is going with “standard industry best practice” and tries to build or buy a full-blown, hyperscale MLOps system.
Building hyperscale MLOps with the resources of a reasonable scale ML team just cannot work.
Hyperscale companies need everything. Reasonable scale companies need to solve the most important current challenges. They need to be smart and pragmatic about what they need right now.
The tricky part is to tell what your actual needs are and what are potential, nice-to-have, future needs. With so many blog articles and conference talks out there, it is hard. Once you are clear about your reality, you are halfway there.
But there are examples of pragmatic companies achieving great results by embracing reasonable scale MLOps limitations:
Lemonade generates $100M+ in annual recurring revenue from ML models with just 2 ML engineers serving 20 data scientists.
Coveo leverage tools to deliver recommendation systems to thousands of companies with (almost) no ML infrastructure people.
Hypefactors runs NLP/CV data enrichment pipelines on the entire social media landscape with a team of just a few people.
You probably never heard of them, but their problems and solutions are a lot closer to your use case than that Netflix blog post or Google whitepaper you have open in the other tab.
Ok, so say you want to do it right, what do you do?
Some things that are clear(ish) is that there are five main pillars of MLOps that you need to implement somehow:
Data ingestion (and optionally feature store)
Pipeline and orchestration
Model registry and experiment tracking
Model deployment and serving
Model monitoring
Each of those can be solved with a simple script or a full-blown solution depending on your needs.
The decision boils down to whether you want:
an end-to-end platform vs a stack of best-in-class point solutions
to buy vs build vs maintain open-source tools (or buy and build and maintain oss).
The answer, as always, is “it depends”.
Some teams have a fairly standard ML use case and decide to buy an end-to-end ML platform.
By doing so, they get everything-MLOps out of the box, and they can focus on ML.
The problem is that the further away you go from the standard use case, the harder it gets to adjust the platform to your workflow. And everything looks simple and standard at the beginning. Then business needs change, requirements change, and it is not so simple anymore.
And then there is the pricing discussion. Can you justify spending “this much” on an end-to-end enterprise solution when all you really need is just 3 out of 10 components? Sometimes you can, and sometimes you cannot.
Because of all that, many teams stay away from end-to-end and decide to build a canonical MLOps stack from point solutions that solve just some parts very well.
Some of those solutions are in-house tools, some are open-source, some are third-party SaaS or on-prem tools.
Depending on their use case, they may have something as basic as bash scripts for most of their ML operations and get something more advanced for one area where they need it.
For example:
You port your models to native mobile apps. You probably don’t need model monitoring but may need advanced model packaging and deployment.
You have complex pipelines with many models working together. Then you probably need some advanced pipelining and orchestration.
You need to experiment heavily with various model architectures and parameters. You probably need a solid experiment tracking tool.
By pragmatically focusing on the problems you actually have right now, you don’t overengineer solutions for the future. You deploy those limited resources you, as a team doing ML at a reasonable scale have, into things that make a difference for your team/business.
Where are we as neptune.ai in all this?
We really believe in this pragmatic, reasonable scale MLOps approach.
It starts with the content we create, focusing on ML practitioners from reasonable scale companies solving their real-life problems, and it goes all the way to the product we built.
Regardless of how you solve for other components of your MLOps stack, we want you to use neptune to deal with experiment tracking and model registry problems.
For example:
You start with using cron jobs for orchestration,
Then you need more control and integrate Airflow,
Then you get to the point of the Kubernetes scale and want to use Kubeflow for that.
Great, neptune will work for you as you grow and change other components of your MLOps stack. That is, of course, if you actually need an experiment tracking tool today.
If you can live with spreadsheets or git? Go for it, be pragmatic :)