🎙 Olga Megorskaya/Toloka: Practical Lessons About Data Labeling
and balance between fully automated, crowdsourced or hybrid approaches to data labeling
It’s inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed.
👤 Quick bio / Olga Megorskaya
Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?Â
Olga Megorskaya (OM):
I’ve got a degree in Mathematical modeling in Economics, with my research interest related to expert judgments and how they can be used to empower statistics-based models. I even started my Ph.D. thesis on this topic but gave up lacking data needed for my research (so many good ideas die due to lack of data, don’t they?). Now I’ve got lots of data on expert judgments (Toloka generates more than 15 million labels every day!), but now I don’t have time for my thesis.Â
However, I found myself in the ML domain quite accidentally: while studying, I made some extra money as a Search quality assessor at Yandex, the largest Russian IT company and search engine. Later, when I joined the Yandex Search team, I had a chance to participate in the development of a search quality evaluation system based on human judgments and then oversee providing all ML-powered services of Yandex with data labeling infrastructure. That was when we started Toloka. We created it to fit our own needs in large-scale industrial ML pipelines and have a proven track record of using it to build successful products. I’m proud to know that under the hood of every Yandex product, be it Search, Self-driving Rovers, Voice assistants, or whatever else, there is Toloka technology.
🛠ML Work Â
Toloka focuses on the important area of data labeling which has many different flavors. How do you see the balance between fully automated, crowdsourced or hybrid approaches to data labeling?Â
OM: We started working with data labeling production more than ten years ago and helped hundreds of teams set up thousands of projects. And we know that there is no silver bullet. The key is the optimal combination of different methods.  Â
Interestingly, the lower the pyramid level is, the harder it is to build such a solution, the more technologies it requires, and the more scalability and effectiveness it provides. It is much easier to train and manage a limited number of annotators, but the labeling production that relies solely on them is expensive and, what is worse, does not scale. At the same time, relying only on a purely automated solution may limit the useful signal in your models.Â
So, in my opinion, the optimal combination consists of:Â
experts providing benchmark labels;Â Â
mathematically managed crowd providing the majority of labels;Â
auto labeling solutions used either to increase the size of the training set, or the quality of labels (by adding an extra vote to each judgment), or the speed of labeling. Â
At Toloka, we provide an infrastructural platform for engineers to build their optimal pipelines integrated into ML production cycle and pre-set optimal pipelines combining all the three components: experts, crowd, automation to obtain the best result. Â
Crowdsourced data labeling seems like one of the most obvious models for organizations to systematically build high quality datasets but, at the same time, they could result in challenges to scale. Could you elaborate on some of the challenges and best practices for implementing crowdsourced data labeling pipelines at scale?Â
OM: Creating an effective data labeling pipeline able to provide stable, high-quality labels at scale requires six steps:Â Â
Decomposing the task. It leads to higher efficiency: simpler tasks are easier to formalize and control, hence obtaining higher quality. While it requires more effort to design such a task, it is worth it. Â
Writing comprehensive guidelines. The better you describe the logic, the more consistent the labels will be. It is always very useful for ML engineers to label at least some data samples before launching the project to feel the nature of data they will later be using for their models. Â
Choosing an optimal interface and tool that will help performers increase speed and quality of the labeling.Â
Setting up quality control: our ideology is treating crowd management as an engineering and mathematical task. By applying a wide range of quality control techniques, one may ensure to build an effective pipeline resistant to the mistakes of individual performers.Â
Setting up pricing, incentives, and bonus system to align the motivation of requesters and performers on producing the highest quality data. Â
Smart aggregation of the results allows us to consider the reliability of every particular performer and weigh their votes accordingly. There are plenty of smart aggregation models that allow increasing the accuracy of the final data. Â
By taking these steps, you'll get to the most cost-effective, high-quality data labeling pipeline at scale. We provide an open platform for engineers with all the necessary components: the global crowd, full range of automated quality control methods, the pre-set interfaces, dynamic pricing and tools for balancing speed/quality ratio, optimal matching of tasks and performers, and so on. As well as a free powerful API to integrate it into the ML production pipeline. Â
However, not every team has resources to set up all the processes themselves, and time-to-result is often the key factor. For such teams, we created a specific solution: pre-set pipelines with the optimal combination of Toloka’s in-house expert labelers, crowd, and automation to provide the best quality at minimal efforts from the requester's side. Â
In recent years, we have seen the emergence of methods such as pretrained models or self-supervised learning that rely on large volumes of unlabeled datasets to train models which can then be fine-tuned for specific scenarios. How would traditional data labeling approaches adapt to this new type of techniques?Â
OM: Indeed, the industry is developing fast, and we are expecting the further rise of such technologies soon. I pay attention to the trends that align with our product vision: Â
The need for human-powered data will not disappear: the more AI moves towards offline applications, the more often people’s help is needed to produce training data from scratch. And, of course, people will still need data to fine-tune the models, validate their quality before shipping them to production, and control them after deployment. Â
We foresee that these models will democratize ML production and increase the number of actors seeking data labeling solutions. Instead of collecting 1,000,000 labels for one project, we may expect that it’ll switch to 1000 labels for 1000 projects. This is where the flexibility of a platform and its ability to support a large variety of task types will play an important role. Â
In general, I believe that automation will play an increasing role in our industry and significantly boost the effectiveness of data labeling production in terms of accuracy and throughput. In Toloka Apps, we use automation techniques that allow us to substantially reduce the labeling error rate.Â
These giant pre-trained models are only available for domains that already have lots of data. But in such domains as rare languages, medical data, or applications of AI in a physical world, there is simply no data to train your initial BERT. This is where Toloka, with its global crowd and limitless flexibility, will be even more demanded. So we expect our job to become more diverse and interesting in the future.Â
Generative models are actively used to augment labeled datasets in supervised learning scenarios. What are the benefits and fallbacks of these methods to create high-quality labeled datasets?   Â
OM: Well, the benefits are obvious: adding auto labeling helps increase the quality and quantity of collected labels. However, one should be careful not to overfit the model on the same datasets and not lose the important additional signal obtained only from an independent source (human labels). So, ML specialists should not forget to validate the quality of their models on independent datasets correctly.
Fairness and bias are relevant considerations when creating systematic approaches to data labeling. What are some of the best practices that should be considered to embed bias and fairness evaluation in data labeling processes?Â
OM: First of all, bias should be considered and avoided on the stage of choosing the data to label, not at the stage of labeling. If we speak about classic data labeling tasks and follow the steps I described above, the problem of bias is reduced by writing comprehensive guidelines that leave minimal room for subjective judgments. Â
However, in some cases, subjective human judgment is required to obtain important signals. For example, the side-by-side tasks are purely subjective to allow for digitizing subjective perception of objects. In these cases, specific models (such as Bradley-Terry) will enable us to avoid systematic bias. Â
Speaking about fairness, I would like to talk about how the annotators are treated. This topic is personally important to me since I had the experience of working as a data annotator. I think it is a true shame of our industry that AI development is still powered by efforts of poorly treated annotators who are forced to sit in gloomy offices for many hours in a row without the ability to choose tasks, without career perspectives, and without free time to devote to education, hobbies or any other sources of joy and self-development in life. Â
At Toloka, we organize everything flexibly when self-sufficient people plug into the platform when interested and are free to choose any task they want based on the open rating of requesters and spend as much time on it as they find reasonable. Â
In the former case, you need strict managerial efforts to provide good quality of labeling. In the latter case, it is managed mathematically. That is why we support the research about crowd workers’ well-being to make sure we develop our platform with respect to their interests.Â
💥 Miscellaneous – a set of rapid-fire questions Â
Favorite math paradox? Â
Monty Hall paradox is an excellent illustration of the Bayes Theorem. Bayesian methods are one of the cornerstones of quality management techniques in Toloka: aggregation models, dynamic pricing, dynamic overlap, etc. All the cases when we reconsider our understanding of the unknown every time we obtain a new piece of information. Â
And the famous Butterfly Effect. It is tightly connected to your previous question about biases in training models. AI will soon be woven into every sphere of our lives. It is trained on data mostly labeled by humans. Any systematic bias incorporated into a dataset on the stage of its annotation may lead to systematic bias in the model. As I said, systematic bias can come from poorly formulated guidelines. So, such a seemingly minor part of ML production as writing guidelines for annotators can have a far-going effect in the future.    Â
What book would you recommend to an aspiring ML engineer?Â
The Toloka team is full of great ML engineers, so I decided to ask them for the best advice. Our team recommends Introduction to Machine Learning with Python: A Guide for Data Scientists and Machine Learning Engineering. Â
Is the Turing Test still relevant? Any clever alternatives?Â
If you intuitively understand the Turing Test as "does the computer convincingly answer the questions asked by a person", then there is a very interesting article where the author asks GPT-3 questions, and it turns out that GPT-3 answers consistently incorrectly. There are ways to improve the model specifically for this case, but there are other examples. For instance, here, the authors show that GPT-3 does not cope well with the task of writing analogs.Â
Does P equal NP?Â
If it is, we are in trouble:)Â