⚪️⚫️ Edge#108: How to Improve Model Accuracy with Crowdsourced Data Labeling – Real World Use Cases

Introducing new format!

Today we want to introduce a new format which is often requested by our readers:  

✔️ Real-World Use Cases 

In these additional issues of TheSequence, we will discuss how different machine learning (ML) concepts are applied in real-world solutions implemented by enterprises and startups. The idea is that, by presenting real-world examples of ML implementations, we can help you better understand how different ML technologies can be practically applied in specific scenarios. Send us your feedback!

🏷 Crowdsourced Data Labeling in Real World

Following Edge#107, in which we described three main approaches to data labeling and focused on the crowdsourced approach, today we would like to demonstrate how different types of businesses use crowdsourced data labeling platforms to improve their data preparation and enhance model predictions. In that sense, we’ve asked our partner Toloka, one of the biggest crowdsourced data labeling platforms, to help us with some real-world use cases.  

ML predictive models typically require large volumes of high-quality labeled training datasets. For many businesses, that’s still one of the biggest roadblocks when implementing ML solutions in the real world. Data labeling in machine learning is one of those things that is easy to trivialize until you need to do it at scale. Automated data labeling technologies have been gaining relevant traction but there are far from being a silver bullet as there are plenty of scenarios that require human intervention. In those scenarios, crowdsourced data labeling has emerged as a popular choice for data science teams.  

In general, crowdsourced data labeling offer some tangible benefits over alternative approaches:  

  • cost efficiency  

  • globally distributed performers base 

  • quality control methods 

Today we will look at three very different domains that illustrate how crowdsourced data labeling methods can be used, from standard labeling to more sophisticated human-in-the-loop approaches for improving model accuracy. Creating quality datasets for these cases required the work of millions of skilled Toloka performers (Tolokers) across the globe who collected, labeled and verified large amounts of image, text, speech, audio, and video data.  

1️⃣ Use Case: Online identity verification

Customer: Biometric technology company ID R&D combines extensive R&D capabilities with advances in AI to deliver superior voice biometrics and passive voice and face liveness detection software. Collectively known as spoofing, this type of identity fraud is a challenge for financial companies and tech businesses who work with sensitive user data. 

Initial setting: ID R&D built an ML model to distinguish between real faces and 3D masks or photo cutouts with no extra effort required from a user (e.g. asking users to record a dynamic video). As for many services, such requirements lead to significant conversion drops.  

Problem they encountered: Building and training a large quality image and audio dataset was a crucial challenge for creating the Anti-spoofing solution: ID R&D needed millions of user photos and speech samples, both real and fake, to feed to the algorithm so that it could reliably tell apart live faces and voices from intricate fakes.  

The solution:

For that task ID R&D used Toloka, which has over 9 million registered performers worldwide. It still wasn’t easy; even with the largest data labeling services, the monthly active user base is limited to hundreds of thousands of people. To address this challenge, ID R&D asked users to submit data in different environments, thus boosting the size of the training database. An added benefit was the diversity of Toloka’s performer population. Having performers in over 100 world countries, allowed to collect comprehensive data on people of different races, skin colors, ages, and ethnicities, thus reliably eliminating the potential algorithmic bias in biometric recognition. 

2️⃣ Use Case: Geo-analytical tool for predicting retail revenue

Customer: Predictive geo-analytical tool BestPlace.ai provides end-to-end visibility of local shopper patterns and helps find the most profitable locations and solutions for offline retail and consumer packaged goods (CPG) brands to increase their revenue.  

Initial setting: BestPlace processes 250+ sources of social, geospatial and historical data to identify 100+ category-specific consumer behavior patterns, then they build mathematical models to analyze those patterns and predict people’s behavior which allows the clients to adjust store equipment (e.g. buy coolers) and assortment accordingly. 

Problem they encountered: Satellite data and online databases do not give complete and up-to-date information on the actual city landscapes, existing points of sales, and consumer behavior. To make accurate predictions about the real world, the online data needs to be complemented with offline information.  

The solution:

To verify the data, BestPlace needed to deploy thousands of people in the field without incurring in a large cost. Toloka addressed that with a unique type of microtask: pedestrian tasks. The task included a vast territory coverage: a well-organized performers who actually walked in the field and collected offline information quickly and with controlled levels of accuracy (Toloka’s built-in measurements for control). That allowed BestPlace to calibrate the math model on verified data from the field, which significantly improved the model’s accuracy. Using 3-5 overlapping performers for the same location, BestPlace was able to verify data, optimize the models and deliver highly insightful and much more accurate predictions, all at a low price. Toloka’s human-in-the-loop approach proved to be the easiest connection between modeled world and real-world predictions. 

With a smart combination of crowdsourced data verification with AI algorithms, BestPlace is able to achieve up to 95% prediction accuracy and provides valuable insights even to CPG giants like PepsiCo, determining optimal efficiency and assortment, as well as sales potential for each of their thousands points of sales. 

3️⃣ Use Case: Extracting data from paperwork

Customer: Y-combinator-backed AI-powered data extraction startup Handl created an algorithm-based tool that helps large companies analyze, categorize, and retrieve customer information from scanned documents in seconds (cutting expenses for some of them by as much as $500 million annually).  

Initial setting: The solution uses computer vision to recognize document types, determine relevant fields, read the data and store it in the database for further analysis.  

Problem they encountered: Even after training the algorithm on hundreds of thousands of observations, a solution like Handl cannot yet work in a fully automated manner: having a certain error rate, it needs human-in-the loop to cross-check and validate data recognition, which also helps continuously retrain and improve the model.  

The solution:

Handl uses Toloka’s large distributed performer base, which provides human input 24/7 from across the globe, to build human verification into its pipeline. Randomly sampled document portions, as well as papers that were predicted to possibly have errors, get distributed among overlapping performers and gather human feedback to verify recognition quality, eliminate errors, and ultimately retrain the algorithm. 


The use cases presented in this edition of The Sequence illustrate the value of crowdsourced data labeling mechanisms in real-world data science pipelines. Even though the use cases were specific to the Toloka platform, the principles and best practices are applicable to other crowdsourced data labeling technology stacks. Crowdsourced data labeling is relatively intuitive to understand but its implementation at scale is far from trivial. Learning from organization leveraging these techniques in real-world scenarios provides a unique perspective into this emerging data labeling field.  

If you are interested to try crowdsourced data labeling for your project, check this free webinar. In the practice session, participants will choose one real language resource production task, experiment with selecting settings for the labeling process, and launch their label collection project. All projects will be run on the real crowd! It’s a free hands-on tutorial, exclusively for the readers of TheSequence: