🎙 German Osin/Provectus About Data Discovery and Observability in ML Solutions

Learn about data discovery and different approaches to resolving data challenges

Sep 15, 2021

It’s so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is a great source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed.

👤 Quick bio / German Osin

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?

German Osin (GO): I am a Senior Data Platform Architect at Provectus. I lead several of the company’s open-source products and help build advanced data platform solutions. A big part of my job is bringing AI/ML to data, enabling critical data processing for analytics, deep insights, and Business Intelligence.

My first big experience with ML was in 2013 when I worked on a system for modeling look-alike audiences for targeted advertising. My prior experience was in Big Data and near real-time solutions. So initially, I approached the new challenge as a classical Big Data problem. But it made me take a deep dive into the subject and sent me directly into the ML world, helping me explore it more in-depth. I ended up using the full ML stack and am very happy about it.

🛠 ML Work

Provectus has a long history of helping enterprises implement ML solutions in the real world, and recently open-sourced a new data discovery and observability platform. Can you tell us a bit more about the inspiration for this new project, and elaborate on some of its technical capabilities?

GO: We draw a lot of inspiration from our clients and their stories. After talking to them about the issues they experience when trying to implement ML, we discovered several patterns:

Despite the advent of data catalogs, the majority of data teams (up to 90%) still complain about their data access and discovery experience, being unable to access or find data they need for work.
Data discovery still eats up too much time. About 25-50% of data specialists’ working time funnels into finding correct and trustworthy data.
Due to the rapid development of ML and Big Data industries, the bar has been set much higher for products that address these issues. They need to offer not only company-wide lineage and data quality assurance but also encompass the entire ML world. Otherwise, organizations are held back from applying ML/AI at scale.

We explored the market and confirmed that there is nothing close to this type of product available, so we decided to design one ourselves.

Data discovery is an area where Provectus has been doing a lot of work recently. What are some of the best practices that companies should consider to enable data discovery controls in ML pipelines, and how do they differ from traditional enterprise data catalog approaches?

GO: With the advent of the data-centric approach in MLOps, several things are becoming increasingly important:

Building data discovery products based on an open-source standard for metadata exchange. This allows for extensive interoperability and makes it easy to quickly plug in any new data sources, tools, and other catalogs when you need them in your system. Check out this research article about why building on an open metadata standard is so important.
Gathering as much metadata as possible to get the full picture. In the data discovery process, metadata helps us evaluate data objects and their trustability. The more metadata you have, the better your results. Modern data discovery platforms need to serve people working in many different roles and relying on all kinds of metadata for evaluation. A system that offers more metadata enhances their efficiency in finding data and doing their jobs.
Avoiding manual metadata collection. Products that you implement should not require any manual work to retrieve the necessary data. Data retrieval should be a straightforward automated collection process that everyone in your company can use, and that does not require a lot of time or infrastructure resources.

By contrast, traditional enterprise data catalogs lock up metadata and are not flexible enough to include all the tools a company might need. Sharing metadata is often not in a company’s best interest, so the catalogs were not designed to do so.

Additionally, traditional catalogs completely ignore the world of ML. They overlook ML training jobs, model repositories, model instances, and feature stores that are rapidly becoming a fundamental component of the ML stack. The absence of ML from data lineage leaves a company exposed to unknown and unpredictable black-box issues that can lead to numerous downsides for the business.

We consider it crucial to include ML entities as a part of data discovery and lineage, get truly end-to-end observability, and be proactive about black-box issues. We have implemented these principles in the ODD Platform we’re currently working on.

Also, traditional data catalogs rely on manual metadata collection, which is obsolete in the modern world and will never pay off.

If you’d like to learn more about data discovery and our approach to resolving various data challenges, watch this webcast: First open-source data discovery and observability platform for ML Engineers.

Data management methods in ML solutions can be vastly different types of datasets, such as tabular, text, audio or images. What are some of the commonalities and differences for enabling data observability and lineage analysis across different types of datasets?

GO: The ODD Specification we designed treats various types of data equally. Tabular data, text, images, audio – for us, all of them are simply datasets in the system. They differ only in what metadata they contain. This allows our platform to build proper lineage no matter what type of dataset is being used.

Speaking of tabular data, the system should support column-based lineage in addition to object lineage. From a data compliance regulations standpoint, it's important to understand where and how the data is used.

Also, it's important to collect all the data profiling to eliminate black-box issues with anomaly detection. You need to know that your ML pipelines are working correctly. This is where the real end-to-end lineage helps – without it, black-box issues, including anomalies, are impossible to detect.

The MLOps technology space is becoming increasingly fragmented, particularly in areas like data observability. Do you believe data observability will remain as a standalone category, or will it become a feature of larger ML platforms?

GO: The MLOps space is very fragmented because its formation is still in the early stages. Its development is very similar to the data world, although there are some differences between them, and ML operates different sets of entities. But the operations are very similar. My view is that it will eventually become unified and commodified in the same way it happened in the data world. All ML platforms will include end-to-end lineage for both data and ML, in the same way data platforms include monitoring. It will become a must-have built-in default feature, helping us stay on top of the black-box issues and ensuring that everything works correctly.

The current process of gathering metadata is mostly through proprietary products, but imposing vendor locks will shift to open standards. The protocol for gathering metadata should be open to eliminating processes that make both low-level data operations and data discovery very inefficient.

In recent years, we have seen some interesting research on applications of ML methods for streamlining different MLOps areas. Could that be the case for data discovery and observability? Which modern ML methods can be used to improve data discovery and observability in ML models in the near future?

GO: Any data observability relies heavily on anomaly detection. In the world of ML, there are numerous methods for detecting anomalies. So there is a lot of space for ML to develop in the direction of data observability, to predict when there is something wrong with your data.

For efficient data discovery, it’s important to conduct a quick and personalized search. There can be thousands of similar datasets with minor differences, making them hard to choose from. We all have context surrounding us at work, and personalizing it will help boost productivity. We’re already seeing these types of ML models being used for recommendations in online shops, streaming services, and online search engines.

💎 We recommend

If you are interested in Observability and Data Discovery, we recommend registering for this webinar by Provectus*. It’s free.

💥 Miscellaneous – a set of rapid-fire questions

Favorite math paradox?

When talking about groups of data, it's worth mentioning the Friendship paradox. Most people have fewer friends on average than all their friends have. This paradox is applicable to any graph data structure.

What book would you recommend to an aspiring ML engineer?

“Computing Machinery and Intelligence” by Alan Turing. This paper laid the entire foundation for all of data science.

Is the Turing Test still relevant? Any clever alternatives?

I admire Turing too much to consider alternatives.

Does P equal NP?

If there was a simple answer to this question, there would not be a Nobel Prize for it. When we can answer this question, the secret to math will be completely unveiled.

TheSequence

Discussion about this post