🎙 Rinat Gareev/Provectus About Data Quality and Enterprise ML Solutions

Learn about building blocks of MLOps solutions and best practices that companies should consider to enable data quality controls in ML pipelines

Oct 22, 2021

It’s inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed.

👤 Quick bio / Rinat Gareev

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

Rinat Gareev (RG): I am a Senior ML Solution Architect at Provectus. I help build ML-powered solutions for different domains and in different stages, from discovery workshops to model operationalization and maintenance. I also lead the MLOps practice in the company: systematize the experience of our ML engineers, ensure consistency between projects, and optimize the development of new solutions and products.

I have a Master’s degree in Computer Science. My thesis was about ontology-based relation extraction from textual descriptions for items of cultural heritage, such as museum showpieces. This was the first time I looked into natural language processing (NLP) and machine learning.

After graduation, I worked for a few years as a backend developer but got bored and decided to try something else where my math background would be more useful. I went for a full-time junior research position in academia and dove into NLP and ML.

🛠 ML Work

Provectus focuses on helping enterprises implement ML solutions in the real world. Could you tell us about some important challenges of enterprise ML solutions at scale, and what are some common mistakes you typically see in these types of implementations?

RG: One frequent challenge is harmoniously combining established software engineering practices with the emerging lifecycle of ML projects (see diagram). Many enterprises started before ML was a thing. Many ML practitioners come from academia. Both sides have their own legacy and habits.

The first ML initiatives in an enterprise are treated as wild cards. This might remain even after a series of successful ML applications. In practice, it means that ML tasks are solved in isolation: data is artificially aggregated for experiments; assumptions are implicit, not documented and not tested; models are trained only towards better metric values, without looking further into how this model will be served, integrated into products and maintained. Data scientists perform dozens of different hypothesis tests on modeling or feature engineering. In the middle of that it is easy to forget that data is not static, and not all required transformations can be reproduced in the actual production.

These problems require changes in the business and engineering culture. To name a few examples: data scientists should think about model production; data engineers should remember that data is now also consumed by ML applications; DevOps should handle ML assets as first-class citizens on a par with other code and artifacts. Facilitating these processes without disrupting data science workflow is the essence of MLOps. There are many building blocks in MLOps solutions, and for each there are several options: open-source or from cloud vendors. Join us on our upcoming webinar, where we will discuss MLOps components and the most common implementation options.

Data quality management is an area where Provectus has been doing a lot of work recently. What are some of the best practices that companies should consider to enable data quality controls in ML pipelines?

RG: We recommend starting to enable them as early as possible. It is simpler to keep track from the beginning of development rather than get surprises once the model is served in production, which will require untangling all data relationships in pipelines and their hidden properties.

As you can see in the diagram above, there are quite a few spots in the pipeline where data quality checks make sense. The most crucial are usually where different roles and teams collaborate, like data engineers, ML engineers, and model users. Establishing data quality in these spots can also serve as a contract between developers that will be automatically checked.

If data structure is rich, then manually writing even simple validation rules from scratch becomes tedious and error-prone. In such cases it is recommended to utilize data profilers that generate summaries with suggestions for data checks. These summaries can be immediately used as test suites, or with some modifications, after review by a data engineer. Optionally, they can be evaluated on the holdout subset of data.

Data quality controls in ML solutions are not just about data. What is the relationship between data quality and model testing and optimization methods?

RG: The question of the concept of data quality itself has many aspects.

Once a model is trained, its evaluation metrics measure not only the model and algorithm in isolation but the entire system, including data and its processing. Here, data means not only training data per se but also validation and holdout datasets, and their relationships in terms of consistency, homogeneity, sampling bias – and that might become another data test suite.

Another part of model testing is bias detection. There are many types of bias, some of them can be introduced by model training algorithms, others are caused by training data. For example, systematic data collection errors, label annotation errors, framing effect, etc. Many of them can be prevented by adding specific validation rules into the data QA component.

Another aspect is that some data quality problems can be solved either in the data preprocessing step or the model training step of the pipeline. Take, for example, label noise – there are training algorithms that can deal with them. However, dealing with them during data preprocessing has several benefits: it reduces algorithm implementation complexity, makes it portable for other tasks, and enables teamwork and pipelining.

The ML technology space is becoming increasingly fragmented. Could you share some platforms and frameworks you like in terms of data quality management, and some of their benefits and limitations?

RG: So far, there are not many options if we talk about open-source frameworks. GreatExpectations is our default choice. It allows you to establish data contracts and generate data quality reports. It has the following benefits:

It is lightweight;
The rules engine is decoupled from an actual backend. You can maintain rules in one place and then apply them to different parts of your data pipeline. It greatly reduces the total cost of ownership for Data QA;
Works with Pandas/Spark DataFrames, Redshift / Snowflake / BigQuery, etc. out of the box.

One of the main disadvantages of GE is the absence of built-in support for data streaming cases.

Recently we contributed the open-source component that allows using GreatExpectations in a Kubeflow pipeline.

Another worthy alternative is Deequ. It is built on top of Apache Spark, so if the latter is already part of your architecture, give it a try. In some aspects, it offers more than GE, for example – anomaly detection.

In recent years, methods such as semi-supervised or self-supervised learning have gained prominence for building models that can learn from large unlabeled datasets. How do you compare data quality management for these new techniques relative to traditional supervised learning methods?

RG: From the inference perspective, there is not much difference. Once a model is trained and deployed, it has the same requirements for data testing as a model trained using a pure supervised method – data QA checks inputs and predictions.

During training, these algorithms make many assumptions about data. And in theory, these assumptions can become part of data QA. But in practice, it is not that easy and obvious: checking some assumptions might be harder to implement than an algorithm itself, while others are algorithm-specific and not portable.

Semi-supervised and self-supervised learning are usually considered advanced techniques, and it is easy to shoot yourself in the foot if they are applied blindly. This is one of the reasons why there is arguably not wide adoption of them yet. Data quality management is also still in an emerging state, so their intersection is even less explored to date. So this might be a good opportunity for those who need a topic for research.

💎 We recommend

Join Provectus & AWS to learn how to build a robust ML infrastructure on AWS and know why MLOps and reproducible ML are crucial to enabling the delivery of ML-driven innovation at scale. Very practical webinar. And it’s free.

💥 Miscellaneous – a set of rapid-fire questions

Favorite math paradox?

I like sci-fi literature, so I prefer temporal paradoxes. Otherwise, there is one closer to the topic – the accuracy paradox. It teaches us the importance of choosing proper metrics for ML tasks.

What book would you recommend to an aspiring ML engineer?

Nowadays, you can find plenty of well-packed online courses, university lectures on YouTube and books for practitioners with code examples. But if you want to go hardcore, there are always Pattern Recognition and Machine Learning and The Elements of Statistical Learning. They are definitely not for binge reading, but they can occasionally be used as references to better understand the fundamentals.

Is the Turing Test still relevant? Any clever alternatives?

I think that until you can no longer distinguish between chatbots and human operators when you call an enterprise, it is entirely relevant.

Does P equal NP?

I’d like to see the future or alternative universe where it is. But currently, the world does not operate that way. This might sound grim, but on the other hand, it creates many opportunities for researchers and engineers :)

TheSequence

Discussion about this post