📝 Guest Post: Guide to Building an ML Platform*
In this guest post, Stephen Oladele, Developer Advocate and MLOps Technical Content Creator, together with neptune.ai, dive into the topic of building Machine Learning platforms and share some best practices followed by the industry. Make sure to continue reading!
Machine learning (ML) platforms are increasingly seen as the solution to consolidating all the components of the ML model lifecycle, from experimentation to production.
These platforms not only provide your team with the tools and infrastructure they need to build and operate models at scale but also apply standard engineering and MLOps principles to all use cases.
However, there's a catch: understanding what makes a successful ML platform and building one is no easy task. With a plethora of tools, frameworks, practices, and technologies available, it can be overwhelming to know where to begin. This guide is designed to help you navigate through the process and understand the key factors that contribute to a successful machine learning platform.
What is a machine learning platform?
An ML platform standardizes the technology stack for your data team around best practices to reduce incidental complexities with machine learning and better enable capabilities for data science teams across projects and workflows.
Why are you building an ML platform? We ask this during product demos, user and support calls, and on our MLOps Live podcast. Generally, people say they do MLOps to make the development and maintenance of production machine learning seamless and efficient.
Machine learning operations (MLOps) should be easier with ML platforms at all stages of a machine learning project’s life cycle, from prototyping to production at scale, as the number of models in production grows from one or a few to tens, hundreds, or thousands that have a positive effect on the business.
An ML platform should be designed to:
orchestrate machine learning workflows,
be environment-agnostic (portable to multiple environments),
and work with different libraries and frameworks.
But how to do it?
MLOps best practices, learnings, and considerations from ML platform experts
We have distilled some of the best practices and learnings from ML platform teams into the following points.
Embrace iteration on your ML platform
Similar to any other software system, creating your ML platform shouldn't be a one-off task. As your business needs, infrastructure, teams, and workflows evolve, you should keep making changes to your ML platform.
Initially, you may not have a clear vision of what your ideal ML platform should look like. However, by building something that works and consistently improving it, you should be able to create a platform that supports your data scientists and provides business value.
Isaac Vidas, ML Platform Lead at Shopify, shared at Ray Summit 2022 that Shopify’s ML Platform had to go through three different iterations:
“Our ML platform has gone through three iterations in the past. The first iteration was built on an in-house PySpark solution. The second iteration was built as a wrapper around the Google AI Platform (Vertex AI), which ran as a managed service on Google Cloud.
We reevaluated our machine learning platform last year based on various requirements gathered from our users and data scientists, as well as business goals. We decided to build the third iteration on top of open source tools and technologies around our platform goals, with a focus on scalability, fast iterations, and flexibility.”
Take, for example, Airbnb. They have built and iterated on their ML platform up to three times over the course of the entire project. The platform should evolve as the number of use cases your team solves increases.
Be transparent to your users about true infrastructure costs
Another good idea is to make sure that all of your data scientists can see the cost estimate for every job they run in their workspace. This could help them learn how to manage costs better and use resources efficiently.
“We recently included cost estimations (in every user workspace). This means the user is very familiar with the amount of money it takes to run their jobs. We can also have an estimation for the maximum workspace age cost, because we know the amount of time the workspace will run…” — Isaac Vidas, ML Platform Lead at Shopify, in Ray Summit 2022
Documentation is important on and within your platform
Documentation is crucial for any software, including ML platforms. It should be intuitive and comprehensive to facilitate ease of use and adoption by your users.
To ensure clarity, you can explicitly specify which parts of the platform are not yet perfected and make it easy for users to differentiate between errors due to their own workflows and those of the platform.
Quick-start guides and easy-to-read how-tos can aid in the successful adoption of the platform. Within the platform, it should also be easy for users to document their workflows. For instance, adding a notes section to the interface for the experiment management component could benefit data scientists.
Documentation should start from the architecture and design phases, which enables you to:
Create complete design documents that explain all the moving parts of the platform and constraints specific to ML projects.
Perform regular architectural reviews to identify weak spots and ensure everyone is on the same page with the project.
Tooling and standardization are key
Standardizing workflows and tools on your platform can increase team efficiency, enable the use of the same workflows for multiple projects, simplify the development and deployment of ML services, and improve collaboration. Learn more from Uber Engineering’s former senior software engineer, Achal Shah.
Be tool agnostic
Be tool-agnostic to facilitate faster adoption and cross-functional team collaboration. Integrating your platform with the organization's existing stack eliminates the need for users to learn entirely new tools to improve their productivity. Starting from scratch in this manner is bound to be a lost cause.
Make your platform portable
Ensure that your platform is portable across different infrastructures to avoid difficulty moving it to a new one if it initially runs on the organization's infrastructure layer. Most open-source, end-to-end platforms are portable, and you can use their solutions or design principles as a guide to build your own platform.
What’s next?
Best practices are just the tip of the iceberg and only a small part of the full Guide to Building an ML Platform that you can find on Neptune’s MLOps Blog. It’s a huge resource that talks about:
Components that make up an ML platform;
How to understand your users (data scientists, ML engineers, etc.);
Gathering requirements from your users;
Deciding the best approach to build or adopt ML platforms;
Choosing the perfect tool for your needs;
… And more.
There are also a ton of links to additional resources, like articles, podcasts, whitepapers, and more.