🎙 Manu Sharma/CEO of Labelbox about the future of data labeling automation

TheSequence interviews ML practitioners to merge you into the real world of machine learning and artificial intelligence

There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work can become a great source of insights and inspiration. Please share these interviews if you find them enriching. No subscription is needed.


👤 Quick bio / Manu Sharma

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning? 

Manu Sharma (MS): I am the founder and CEO at Labelbox. Labelbox is a training data platform that solves the data bottleneck for AI teams, enabling them to build and scale production machine learning systems. Before Labelbox, I helped build products that made visual data from drones and satellites useful for a broad range of industries such as agriculture, insurance, construction, defense, and intelligence.  

My interest in AI started around 2010 during college when I discovered neural networks while working with genetic algorithms. Around 2017, while at Planet Labs, I learned about building production AI systems that could detect nearly any objects of interest visible to the human eye. Building these AI systems was quite challenging because engineering teams had to make the essential tools and infrastructure to create and manage training data. If we could, we would have gladly used a commercial solution. Compute and algorithms were starting to become incredibly powerful, cheap, and readily available. Recognizing that the most value is in the training data, I co-founded Labelbox in 2018 with my long-time friends and colleagues Brian Rieger and Dan Rasmuson.  

🛠 ML Work  

Data labeling automation is one of the most expensive aspects of any machine learning solution. Tell us how Labelbox addresses this challenge and some of the fundamental capabilities of the platform. 

MS: To build AI, you need compute, an algorithm, and training data. Businesses realize that their core asset is no longer computation pipelines or novel algorithms. In the new AI paradigm, a business’s competitive advantage is all about its capability to create and curate training data. 

A few years ago, there simply weren’t tools and workflows to iterate with training data. AI teams would generally label all the data they could at every iteration step. With better tools and integrated workflows, AI teams are becoming smarter and more sophisticated. AI teams now seek to understand the AI model’s weak areas and focus only on data with the highest chance to increase the model’s performance per iteration. This is a vital capability of any Training Data Platform, and AI teams starting now should think about this workflow from the start. 

Building AI, like software development, is most successful when it can iterate quickly. Currently, even the most successful AI teams take as much as four weeks for a single iteration. Speeding up this process will allow models to improve and move to production much faster.  

At Labelbox, our mission is to build tools and workflows that accelerate AI iteration. Our training data platform built around three pillars: the ability to annotate data, manage people and processes, and iterate on training data. This integrated approach enables AI teams to iterate faster, minimize costs, and accelerate their roadmap to production scale and positively impact the business.  

Data labeling seems fundamentally different for different types of datasets such as text, images or videos. Can you give us a few examples about how Labelbox is able to automate data labeling across different types of datasets?   

MS: There are many ways to create labeled training data. The most common technique is asking humans to input a decision based on specified information. This technique is broadly applicable to a wide range of deep learning applications. The orchestration workflow to capture human perception is common across media types. The major difference is in the human-computer interaction. Let’s look at software development to understand this better. Software developers use different languages and IDE to write series of logical statements. They use GitHub or similar products to collaborate with others and orchestrate downstream processes of testing and deployment. Labelbox is a lot like this for AI. Instead of software developers, primarily domain experts encode knowledge through labeling data using specialized tools. The labeling tool to classify text conversations is different than the one to track objects in the video. However, the underlying workflow is the same across all these media types.  

Many AI teams spend a large part of their operating budget on data labeling.  

With Labelbox, AI teams are keeping the labeling costs under control primarily using these two techniques:  

  • Model-assisted labeling: AI teams use their models to generate pre-labels and import them into Labelbox for review. The cost of correcting the pre-labels can be up to 80% lower than the cost of creating the label from scratch. This applies to a broad range of data modalities as long as the tools and workflows are purpose-built & ergonomic.  

  • Active learning: What’s better than faster labeling? No labeling. Choosing the right information to label is paramount. Instead of labeling data blindly, the best AI teams are leveraging active learning architecture. They carefully select the data with the highest chance to improve the model’s performance at every iteration. 

Subscribing you support our mission to simplify AI education, one newsletter at a time. You can also give TheSequence as a gift.

One of the things I like about Labelbox is the idea of incorporating ML as part of the training data labeling process. Can you explain the concept of model-assisted data labeling and tell us how it contrast with traditional data labeling methods? 

MS: Labelbox brings a software-first approach to data labeling and offers the cost-saving benefits of labeling automation to its customers. With model-assisted labeling, AI teams can pre-label their data with their models and send it to labeling teams (whether internal or external) for correction. AI teams can get very creative in pre-labeling and bootstrapping techniques.  

In recent years, generative models and new architectures such as transformers have achieved state-of-the-art result generating synthetic data that is relatively indistinguishable from real datasets. What’s the role that this type of ML methods can have in the future of data labeling automation and how can they influence a platform like Labelbox?  

MS: Humans have spent a lot of time simulating the real world in photo-realistic games with common objects such as cars, buildings, roads, and people. It makes a ton of sense to leverage these engines to generate training data for such use cases in computer vision. 

Broadly, I am excited about generating augmented data with GAN models. AI models are too brittle at the moment, and diversifying the data can help with that. For example, an AI team might train a model on data captured by a specific camera sensor in particular lighting conditions. Whenever the camera sensor or lighting condition changes, the model will have a higher error rate. To mitigate such issues, AI teams are beginning to use synthetically generated data that introduces a broad range of perturbations that reflect these real-world scenarios. 

I am sure there might be some domains where teams can dominantly use synthetic data. However, AI teams that use Labelbox are building nuanced and sophisticated AI systems where synthetic data generation techniques, at best, are used as a data augmentation step in the background. 

I think we humans will continue to supervise AI systems for a while. Don’t underestimate human ingenuity.  

What could be some of the biggest near-term breakthroughs for automated data labeling and machine learning training in general? 

MS: I think the most significant breakthrough is already happening at the macro level. It’s the advent of data-centric programming or popularly known as Software 2.0 paradigm. I believe this to be one of the biggest paradigm shifts in computing. Commercial tools and workflows are beginning to emerge that will further accelerate the adoption and development of AI systems. At any rate of improvement of AI software and hardware, we are inevitably ushering into an era where artificial intelligence will be ubiquitous.  

In the short term, I am particularly looking forward to the breakthroughs in AI models and hardware that can temporally understand video. There’s something massive about to happen there.  


💥 Miscellaneous – a set of rapid-fire questions  

TensorFlow or PyTorch? 

MS: Both. Our users use both frameworks 

Favorite math paradox?

MS: I have mostly been fascinated by the Fermi paradox.

Is the Turing Test still relevant? Any clever alternatives ?

MS: I think the specifics of the Turing Test might be outdated, but a philosophical idea remains relevant. We need better instrumentation to evaluate intelligence. I wouldn’t be shocked if AI passes the Turing test soon, yet there will be much more to be desired. I believe our understanding of artificial intelligence will be a moving target for a long time despite regularly interacting with and surrounded by superhuman narrow AIs around us.  

Any book you would recommend to aspiring data scientists?

Our Mathematical Universe by Max Tegmark 
Zen and the Art of Motorcycle Maintenance by Robert Pirsig 

Is P equals NP?

MS: I honestly had to search online to understand this question!