🎙 Paroma Varma/Snorkel on programmatic approaches to data labeling
and how to bridge the gap between labeling and AI application development
Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed.
👤 Quick bio / Paroma Varma
Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?
Paroma Varma (PV): I am a co-founder of Snorkel AI, which started as a research project in the Stanford AI Lab in 2015, where we set out to explore a higher-level interface to machine learning through training data. We have developed the first data-centric platform powered by programmatic labeling to build AI applications. Before founding Snorkel AI, I got my Ph.D. from Stanford University, where my research revolved around /weak supervision/ or leveraging high-level, noisier signal sources to label training data efficiently.
My first exposure to ML was actually as an undergrad at Berkeley, working in a neuroscience lab! And even during my Ph.D., I was able to work with some amazing collaborators across various areas like radiology. They helped me understand the bottlenecks in applying ML to real-work tasks and inspired my research directions.
🛠 ML Work
Can you describe the problem that the Snorkel Flow platform tries to solve and why it is relevant to machine learning technologists?
PV: Snorkel Flow is the first data-centric platform for building enterprise AI applications. Based on research from Stanford University, it uses a programmatic approach for building and managing training datasets, making it possible to take advantage of state-of-the-art AI without spending person-months labeling data. By capturing the end-to-end AI application development cycle in a single platform, Snorkel Flow allows data scientists, ML engineers, and subject matter experts to efficiently collaborate, analyze and iterate on each part of the ML pipeline in a systematic manner. We’ve seen Fortune 500 enterprises ranging from top US banks, large telecom, biotech, insurance companies, and several governmental agencies use Snorkel Flow for problems that weren’t practical to solve using AI before, achieving state-of-the-art accuracy, reducing the development time from months to days, and realizing significant cost savings, with an adaptable, auditable, and privacy-compliant solution.
Automatic data labeling seems to be the sweet spot of Snorkel Flow. What techniques does Snorkel Flow leverage to streamline the labeling of training datasets?
PV: Instead of relying on manually-labeled training datasets, Snorkel Flow relies on weak supervision or using noisy, high-level forms of supervision to inject /domain knowledge/ and label training data programmatically. Users write “labeling functions”, heuristics and other interpretable resources to noisily label training data, which are then denoised using systems and algorithms we developed over the years at Stanford. This combination of using rules as inputs to label training data and then training powerful ML models that generalize beyond the rules allows us to take advantage of the best of both worlds.
There seems to be a push in ML research towards models that use less labeled data for training. How do you see the role of methods such as semi-supervised learning or self-supervised learning in the near future of ML applications?
PV: Many lines of research in machine learning are a result of trying to reduce the amount of required labeled training data. Methods like semi-supervised learning and transfer learning address the training data bottleneck without asking subject matter experts to input their knowledge while programming ML models. Depending on the task and/or dataset at hand, these methods can work in a complementary manner with weak supervision, taking advantage of both the subject matter experts efficiently and exploiting the patterns in existing labeled data and models.
Incumbents such as Microsoft, Google and AWS have incorporated data labeling capabilities into their ML platforms. Do you think that the data labeling problem is big enough to create standalone companies in the machine learning space or it will become a feature of broader platforms?
PV: MIT review reported that one in ten AI projects generates significant financial benefits. This is partly because 80% of AI application development time is spent on data preparation, management, and labeling. In fact, in a recent poll, we have discovered that 80% of AI practitioners believe that one out of two projects is blocked by lack of training data. This means that iterating on training data rather than just models is more important now than ever to scale AI and make projects practical. However manual labeling delivered by large clouds or outsourced services, such as Scale, Appen, Hive, etc., do not scale especially in enterprise settings where data is complex and requires experts to label, or is private, or changes rapidly in production requiring constant relabeling. Snorkel Flow is the only platform that bridges the gap between labeling and AI application development using programmatic approaches.
What could be some of the biggest near-term breakthroughs for automated data labeling and machine learning training in general?
PV: It’s exciting to see how we can continue pushing the boundaries for efficiently capturing domain knowledge from subject matter experts by relying on even higher levels of supervision, such as via natural language explanations. At a higher level, achieving the perfect balance between automating various parts of the ML pipeline while still capturing the domain knowledge required for various tasks is really interesting, and I’m looking forward to building more of that into Snorkel Flow!
💥 Miscellaneous – a set of rapid-fire questions
TensorFlow or PyTorch?
I remember manually coding up backprop at one point, so I’d have to say both are quite impactful in getting machine learning to the stage it is today :)
Favorite math paradox?
Not a paradox but the Monty Hall problem is one of the first problems I remember learning about in school.
What book would you recommend to an aspiring ML engineer?
Not a book, but I really like CS 168 at Stanford, which covers concepts useful for data scientists in a very intuitive manner.
Does P equal NP?
Hopefully, not, so our cryptosystems don’t break down!