🎙 Hyun Kim/CEO of Superb AI on true data labeling automation

There is nothing more inspiring than to learn from practitioners. Getting to know the experience gained by researchers, engineers and entrepreneurs doing real ML work can become a great source of insights and inspiration. Please share this interview if you find it enriching. No subscription is needed.


👤 Quick bio / Hyun Kim

Tell us a bit about yourself. Your background, current role and how did you get started in machine learning? 

Hyun Kim (HK): I am the co-founder and CEO of Superb AI, an ML DataOps platform that helps computer vision teams automate and manage the full data pipeline: from ingestion and labeling to data quality assessment and delivery. I initially studied Biomedical Engineering and Electrical Engineering at Duke but shifted from genetic engineering to robotics and deep learning. I then pursued a PhD in computer science at Duke with a focus on Robotics and Deep Learning but ended up taking leave to further immerse myself in the world of AI R&D at a corporate research lab. It was during this time when I started to experience the bottlenecks and obstacles that a lot of companies still face to this day: data labeling and management was very manual and the solutions that were available were nowhere near sufficient enough.    

🛠 ML Work  

Data labeling automation is one of the most expensive aspects of any machine learning solution. Tell us how Superb AI addresses this challenge and some of the fundamental capabilities of the platform. 

HK: I’d like to actually modify the first part of the sentence: true data labeling automation is looking to help relieve most of the financial and time burdens associated with manual and even AI-assisted data preparation workflows. We, at Superb AI, have been able to incorporate cutting edge techniques like Bayesian Deep Learning, few-shot learning, transfer learning and AutoML into our data labeling and QA products so that teams can build massively efficient data pipelines. We couple this automation with agile workflow tools so that teams can get behind the driver seat when necessary.  

We recently announced our advanced transfer-learning Auto-Label via our blog here.

Automated data labeling seems fundamentally different for different types of datasets such as text, images or videos. Could you give us some examples of these differences and what techniques are typically used to automate labeling across different types of datasets? 

HK: Computer vision is pretty universal so there hasn’t been a great need to generalize our AI across, let’s say, multiple languages for example. However, since video data requires an additional time-axis component, we have further developed our customizable automation techniques to accurately track objects over time over many frames. We think our approach to video labeling is much more intelligent than some guesstimation techniques like linear interpolation between two frames which is commonplace in the industry currently.  

Data labeling automation is a big undertake but almost equally important is to estimate the effectiveness of labels in the training process. What techniques are typically used to streamline the data labeling lifecycle in machine learning models?  

HK: It’s safe to assume that manually labeling large training datasets is not the most efficient way to approach data preparation. And we also agree that automation for data labeling can come in many different forms. We’ve seen AI-assisted tools and model assisted workflows that claim to be highly efficient when compared to fully manual processes but do not deliver all that they promise. With our suite of automation tools, we are able to provide ways to automate most of the data labeling process and have coupled that technology with Uncertainty Estimation. Our Uncertainty Estimation uses a combination of Bayesian Deep Learning and Monte Carlo sampling to estimate the output of our AI which can then be used to implement rapid active learning workflows. A simple example of how teams can utilize the full spectrum of our automation products would be the following: label a small ground truth dataset, use this ground truth dataset to train our transfer learning auto-label, apply auto-label to a large dataset and use Uncertainty Estimation to quickly identify hard examples for audit. Teams can then take this newly tuned dataset to retrain the auto-label and the cycle repeats. This is a much more efficient approach to data preparation than manual and even model-assisted labeling (MAL). As a side note, we’ve actually had many clients tell us that model-assisted labeling is less efficient when compared to manually labeling from scratch, for many reasons.  

Image credit: Superb AI

One of the biggest challenges of automated data labeling techniques is how to handle uncertainty in the labeling process. What are some recent methods that should be considered to address uncertainty in the training of ML models? 

HK: Measuring the “confidence” of model output is one popular method to assess label quality. However, if the model is overfitted to the given training data, confidence levels can be erroneously high. Therefore, confidence levels cannot be used to measure how much we can “trust” auto-labeled annotations. On the other hand, calculating the uncertainty of the AI is a much more grounded approach because this method statistically measures how much we can trust a model output. Using this, we can obtain an uncertainty measure that is proportional to the probability of model prediction error regardless of model confidence scores and model overfitting. As it relates to our Uncertainty Estimation, we ended up developing a patented hybrid approach that uses a combination of Monte-Carlo methods such as BALD and Uncertainty Distribution Modeling methods such as EDL to estimate AI uncertainty.  

What could be some of the biggest near-term breakthroughs for automated data labeling and machine learning training in general?

HK: We are looking into self-supervised learning as a way to reduce the amount of labeled images our models need to be trained on during the initial phase of the data labeling process. Currently, a small batch of data needs to be manually labeled to kick start our models, which will learn from this small batch of data using techniques such as few-shot learning, transfer learning and AutoML. However, if we can leverage self-supervision in some way, we may be able to reduce this manual input even further. 

Subscribing you support our mission to simplify AI education, one newsletter at a time. You can also give TheSequence as a gift.

💥 Miscellaneous – a set of rapid-fire questions  

Is the Turing Test still relevant? Any clever alternatives ?

HK: The Turing Test was a simple and elegant way for us to conceptualize AI and build a human-like conversational AI. And I think we’re close to passing the Turing Test with state-of-the-art works like GPT-3. But practically, AGI should be able to do much more than just being able to trick a human evaluator -- it should be able to do everything.  

I had to do a bit of research, but there are clever alternatives like the Wozniak Test where a robot makes coffee in a stranger’s home. It’s a funny test, but a true AGI should be able to pass a mixture of all these alternative tests!   

Favorite math paradox? 

HK: Simpson’s Paradox. I remember being baffled when I first learned about it in my statistics class as a high school student. It also reminds me to be very careful and unbiased when interpreting data. 

Any book you would recommend to aspiring data scientists?  

HK: Assuming the person already has the math and statistics background, I’d recommend Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville. I’d also recommend PRML (Pattern Recognition and Machine Learning) by Christopher Bishop, but I’ve seen people (including myself, to be honest!) find Deep Learning focused books more interesting than those on classical machine learning. 

Is P equals NP?

HK: Probably not. But I think we’re getting better at approximating NP problems with deep learning, so how’s P ≈ NP? :)