π Manu Sharma/CEO of Labelbox about the future of data labeling automation
TheSequence interviews ML practitioners to merge you into the real world of machine learning and artificial intelligence
There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work canΒ become a great source of insights and inspiration. Please share these interviews if you find them enriching. No subscription is needed.
π€Β Quick bio / Manu Sharma
Tell us a bit about yourself. Your background, current role and how did youΒ getΒ started in machine learning?Β
Manu Sharma (MS):Β I am the founder and CEO atΒ Labelbox.Β LabelboxΒ is a training data platform that solves the data bottleneck for AI teams, enabling them to build and scale production machine learning systems. BeforeΒ Labelbox, I helped buildΒ products that made visual data from drones and satellites useful for a broad range of industries such as agriculture, insurance, construction, defense, and intelligence.Β Β
My interest in AI started around 2010 during college when I discovered neural networks while working with genetic algorithms. Around 2017, while at Planet Labs, I learned about building production AI systems that could detect nearly any objects of interest visible to the human eye. Building these AI systems was quite challenging because engineering teams had to make the essential tools and infrastructure to create and manage training data. If we could, we would have gladly used a commercial solution. Compute and algorithms were starting to become incredibly powerful, cheap, and readily available. Recognizing that the most value is in the training data, I co-foundedΒ LabelboxΒ in 2018 with my long-time friends and colleagues BrianΒ RiegerΒ and DanΒ Rasmuson.Β Β
π ML WorkΒ Β
Data labeling automation is oneΒ of the most expensive aspects of any machine learning solution. Tell usΒ howΒ LabelboxΒ addresses this challenge and some of the fundamental capabilities of the platform.Β
MS:Β To build AI, you need compute, an algorithm, and training data. Businesses realize that their core asset is no longer computation pipelines or novel algorithms. In the new AI paradigm, a businessβs competitive advantage is all about its capability to create and curate training data.Β
A few years ago, there simply werenβt tools and workflows to iterate with training data. AI teams would generally label all the data they could at every iteration step. With better tools and integrated workflows, AI teams are becoming smarter and more sophisticated. AI teams now seek to understand the AI modelβs weak areas and focus only on data with the highest chance to increase the modelβs performance per iteration. This is a vital capability of any Training Data Platform, and AI teams starting now should think about this workflow from the start.Β
Building AI, like software development, is most successful when it can iterate quickly. Currently, even the most successful AI teams take as much as four weeks for a single iteration. Speeding up this process will allow models to improve and move to production much faster.Β Β
AtΒ Labelbox, our mission is to build tools and workflows that accelerate AI iteration.Β OurΒ training data platform built around three pillars: the ability to annotate data, manage people and processes, and iterate on training data. This integrated approach enables AI teams to iterate faster, minimize costs, and accelerate their roadmap to production scale and positively impact the business.Β Β
Data labeling seems fundamentally differentΒ for different types of datasets such as text,Β images or videos.Β Can you give us a few examples about howΒ LabelboxΒ is ableΒ to automate data labeling across different types ofΒ datasets?Β Β Β
MS:Β There are many ways to create labeled training data. The most common technique is asking humans to input a decision based on specified information. This technique is broadly applicable to a wide range of deep learning applications. The orchestration workflow to capture human perception is common across media types. The major difference is in the human-computer interaction. Letβs look at software development to understand this better. Software developers use different languages and IDE to write series of logical statements. They use GitHub or similar products to collaborate with others and orchestrate downstream processes of testing and deployment.Β LabelboxΒ is a lot like this for AI. Instead of software developers, primarily domain experts encode knowledge through labeling data using specialized tools. The labeling tool to classify text conversations is different than the one to track objects in the video.Β However, the underlying workflow is the same across all these media types.Β Β
Many AI teams spendΒ a large partΒ of their operating budget on data labeling.Β Β
WithΒ Labelbox, AI teams are keeping the labeling costs under control primarily using these two techniques:Β Β
Model-assisted labeling:Β AI teams use their models to generate pre-labels and import them intoΒ LabelboxΒ for review. The cost of correcting the pre-labels can be up to 80% lower than the cost of creating the label from scratch. This applies to a broad range of data modalities as long as the tools and workflows are purpose-built & ergonomic.Β Β
Active learning:Β Whatβs better than faster labeling? No labeling. Choosing the right information to label is paramount. Instead of labeling data blindly, the best AI teams are leveraging active learning architecture. They carefully select the data with the highest chance to improve the modelβs performance at every iteration.Β
Subscribing you support our mission to simplify AI education, one newsletter at a time. You can also give TheSequence as a gift.
One of the things I like aboutΒ LabelboxΒ is theΒ idea of incorporating ML as part of the training data labeling process. Can youΒ explain the concept of model-assisted data labeling and tell us how it contrast with traditional data labeling methods?Β
MS:Β LabelboxΒ brings a software-first approach to data labeling and offers the cost-saving benefits of labeling automation to its customers.Β With model-assisted labeling, AI teams can pre-label their data with their models and send it toΒ labelingΒ teams (whether internal or external) for correction. AI teams can get very creative in pre-labeling and bootstrapping techniques.Β Β
In recent years,Β generative modelsΒ and new architectures such as transformers have achieved state-of-the-art result generating synthetic data that is relatively indistinguishable fromΒ real datasets. Whatβs the role that this type of ML methods can haveΒ in the future of data labeling automation and how can they influence a platform likeΒ Labelbox?Β Β
MS:Β Humans have spent a lot of time simulating the real world in photo-realistic games with common objects such as cars, buildings, roads, and people. It makes a ton of sense to leverage these engines to generate training data for such use cases in computer vision.Β
Broadly, I am excited about generating augmented data with GAN models. AI models are too brittle at the moment, and diversifying the data can help with that. For example, an AI team might train a model on data captured by a specific camera sensor in particular lighting conditions. Whenever the camera sensor or lighting condition changes, the model will have a higher error rate. To mitigate such issues, AI teams are beginning to use synthetically generated data that introduces a broad range of perturbations that reflect these real-world scenarios.Β
I am sure there might be some domains where teams can dominantly use synthetic data. However, AI teams that useΒ LabelboxΒ are building nuanced and sophisticated AI systems where synthetic data generation techniques, at best, are used as a data augmentation step in the background.Β
I think we humans will continue to supervise AI systems for a while. Donβt underestimate human ingenuity.Β Β
What could be some of the biggest near-term breakthroughs for automated data labeling and machine learning training in general?Β
MS:Β I think the most significant breakthrough is already happening at the macro level. Itβs the advent of data-centric programming or popularly known as Software 2.0 paradigm. I believe this to be one of the biggest paradigm shifts in computing. Commercial tools and workflows are beginning to emerge that will further accelerate the adoption and development of AI systems. At any rate of improvement of AI software and hardware, we are inevitably ushering into an era where artificial intelligence will be ubiquitous.Β Β
In the short term, I am particularly looking forward to the breakthroughs in AI models and hardware that can temporally understand video. Thereβs something massive about to happen there.Β Β
π₯ MiscellaneousΒ β a set ofΒ rapid-fireΒ questionsΒ Β
TensorFlow orΒ PyTorch?Β
MS:Β Both. Our users use both frameworksΒ
Favorite math paradox?
MS:Β I have mostly been fascinated by the Fermi paradox.
Is the Turing Test still relevant? Any clever alternatives ?
MS: I think the specifics of the Turing Test might be outdated, but a philosophical idea remains relevant. We need better instrumentation to evaluate intelligence. I wouldnβt be shocked if AI passes the Turing test soon, yet there will be much more to be desired. I believe our understanding of artificial intelligence will be a moving target for a long time despite regularly interacting with and surrounded by superhuman narrow AIs around us.Β Β
AnyΒ bookΒ you wouldΒ recommend to aspiring data scientists?
Our Mathematical Universe by MaxΒ TegmarkΒ
Zen and the Art of Motorcycle Maintenance by RobertΒ PirsigΒ
Is P equals NP?
MS: I honestly had to search online to understand this question!