🔂 Edge#209: A New Series About ML Testing

Jul 19, 2022

In this issue:

we start a new series about ML testing;
we explore how Uber backtests time-series forecasting models at scale;
we discuss Deepchecks, an ML testing platform you should know about.

Enjoy the learning!

💡 ML Concept of the Day: A New Series About ML Testing

Testing is one of the most critical elements of the lifecycle of machine learning (ML) models and one that is starting to gain prominence in MLOps platforms. ML testing is also a widely covered topic in the research literature, with new innovative techniques being published regularly. Today, we are starting a new series to dive deeper into the subject of ML testing concepts and relevant research and provide an overview of the best technology stacks in this nascent space.

The essence of ML testing is to execute explicit checks that validate the behavior of an ML model. This approach contrasts with testing in traditional software applications. In a web or mobile application, users provide tests in the form of logic and data and fine-tune the system’s behavior. The cycle is different in ML, where a test starts with the expected behavior and the corresponding dataset, with the model’s logic as an output.

Plenty of taxonomies can be used to organize ML testing techniques. A very general approach segments testing techniques into two main groups relative to the ML model lifecycle:

Pre-Train Tests: Designed to find problems that can help optimize the training workflow.
Post-Train Tests: The most important types of tests in ML. Post-train tests are designed to check the behavior of ML models.

Image credit: https://www.jeremyjordan.me/testing-ml/

Typically, both types of tests should be incorporated into an MLOps pipeline. Also, tests should include both code and data. Over the years, there have been a lot of different ML test techniques that have been widely covered in research. Examples include invariance tests, minimum functionality tests, directional tests and many others. We will cover these instances in detail in the next few editions of this newsletter.

🔎 ML Research You Should Know: How Uber Backtests Time-Series Forecasting Models at Scale

In the blog post titled Building a Backtesting Service to Measure Model Performance at Uber-scale, the Uber engineering team discusses the architecture used to backtest time-series forecasting models at scale.

The objective: Provide insights about the current architecture and the path Uber followed to streamline the backtesting of time-series forecasting models.

Why is it so important: Backtesting is one of the most challenging aspects of time-series forecasting solutions in the real world.

Diving deeper: Time-series forecasting is a key component of Uber’s machine learning architecture. Across its several properties, Uber runs thousands of time-series forecast models across diverse areas such as ride planning and budget management. Ensuring the accuracy of those forecast models is far from being an easy endeavor. The number of models and the scale of computation makes Uber’s environment relatively impractical for most backtesting frameworks. Even Uber’s own backtesting frameworks, such as Omphalos, proved to be effective for some specific use cases but unable to scale with Uber’s operation.

To address the limitations of previous efforts, Uber built a new backtesting service and methodology to apply to its time-series forecasting scenarios. From the methodological standpoint, the transportation giant needed to consider elements such as the number of cities or the testing window in order to backtest models efficiently. Models that work well for one city didn’t necessarily perform well for another. Similarly, some models needed to be backtested in real-time, while others could afford larger windows. All things considered, Uber identified four key vectors that were relevant in order to backtest forecast models:

number of backtesting windows
number of cities
number of model parameters
number of forecast models

Uber complemented that methodology with methods to partition datasets across different time-series as well as a unique metric to measure the accuracy of time-series forecasting models known as mean absolute percentage error (MAPE).

From an architecture standpoint, the new backtesting service consists of a Python library and a service written in Go. The Python library acts like a Python client. Since many ML models at Uber are currently written in Python, it was an easy choice to leverage this framework for the backtesting service, allowing users to seamlessly onboard, test, and iterate on their models. The Go service is written as a series of Cadence workflows. Cadence is an open-source orchestration engine written in Go and built by Uber to execute asynchronous long-running business logic in a scalable and resilient way. At a high level, ML models are uploaded through Data Science Workbench. Backtesting requests on model data are submitted using the Python library that relays the request to the Backtesting Go service. Once an error measurement is calculated, it is either stored in a datastore or immediately put to work by data science teams, who leverage these prediction errors to optimize ML models in training.

Uber has started to apply the new backtesting service across several time-series use cases such as financial forecasting and budget management. Beyond the initial applicability, the new backtesting service could serve as a reference architecture for many organizations building large-scale time-series forecasting solutions. 

🤖 ML Technology to Follow: Deepchecks is an ML Testing Platform You Should Know About

Why should I know about this: Deepchecks is one of the most complete ML testing stacks on the market.

What is it: The area of ML testing is still in a very nascent state but we can already see an emerging number of frameworks that are starting to build foundational capabilities for the space. Among those, Deepchecks has emerged as one of the most sufficient frameworks to enable the testing and validation of ML models.

Deepchecks is built around the concept of checks, which logic is designed to validate the behavior of an ML model or identify a specific issue. The logic of a Check is called Condition, and a group of Checks is called Suites. Essentially, Deepchecks executes Suites across different artifacts in ML solutions such as the training dataset, model, and hyperparameters.

For each stage, Deepchecks executes a number of tests, including data integrity, and test-train validation of model performance evaluations.

Deepcheck’s test results can be accessed from Jupyter notebooks or in environments such as Weight&Biases.

How can I use it: Deepchecks is open source and available at https://github.com/deepchecks/deepchecks

TheSequence

Discussion about this post