🎙SuperAnnotate's CTO Vahan Petrosyan on the present and future of ML data labeling
It’s so inspiring to learn from practitioners. Getting to know the experience gained by researchers, engineers, and entrepreneurs doing real ML work is an excellent source of insight and inspiration. Share this interview if you find it enriching. No subscription is needed.
👤 Quick bio / Vahan Petrosyan
Tell us a bit about yourself. Your background, current role and how did you get started in machine learning?
Vahan Petrosyan (VP): I am a co-founder and the CTO of SuperAnnotate, the basis of which was part of my Ph.D. research at KTH Royal Institute of Technology in 2018. During the early stages of my research, I was thinking of applying my segmentation algorithm in image editing, but after attending Conference on Computer Vision and Pattern Recognition (CVPR) in 2018, I realized that there is a much bigger opportunity to apply my research in data labeling. Once we saw the opportunity with my brother, who was a Ph.D. student in Biomedical Imaging in Switzerland, we both dropped out to start the company.
Before my Ph.D. studies in ML, I studied various mathematics and statistics fields as an undergraduate and graduate student. Particularly, I was interested in Financial and Actuarial Mathematics, Quantitative Economics, Statistics, and Data Visualization. My path to ML started ten years ago when I took an ML course with prof. Adele Cutler, one of the co-creators of the legendary RandomForests algorithm.
🛠 ML Work
SuperAnnotate is focusing on one of the most important but also crowded areas of ML. Could you tell us more about the vision and capabilities of the platform?
VP: Creating annotations is an important part of our business. While those annotations are done manually at the beginning, automation and the right data selection (namely, active learning) are areas that our customers are really excited about when using our platform. We are building the most complete platform that can not only efficiently create the ground truth for your unstructured data but also version and manage the created data/annotations. The latter becomes a lot more important for mature AI companies since ground truth data for AI engineers can be treated similarly to the code for the software developers. Therefore, as GitHub became the bread and butter for software engineering development, we are becoming the GitHub of ML Engineers, where versioning and managing their ground truth will be an integral part of any AI development.
SuperAnnotate automates the labeling of different types of datasets such as text, video or audio. What are some of the fundamental differences when it comes to labeling techniques?
VP: Active learning is extremely common no matter the data type you are annotating. Unfortunately, it is not yet easy to use active learning in complex AI tasks, such as multi-class instance segmentation or lidar segmentation.
The differences in labeling techniques can be really big when dealing with different data types. In some complex video annotation cases, the annotation might be less time-consuming, and experts could spend more time finding the errors and fixing the annotation (i.e., quality assurance). This, generally, is not the case with image annotation when you have the entire image annotated in front of you. For text annotation, people more often perform the same task multiple times with different annotators rather than doing a quality check on already annotated documents.
Today, data labeling is mostly used for supervised learning models and, somewhat ironically, there are techniques such as self-supervised or semi-supervised learning that are showing promises to automate the labeling of datasets. How do you see the role of these types of new ML techniques in next generation data labeling platforms?
VP: Self-supervised and semi-supervised techniques are a great way to increase the labeling quality. I am sure the research community will push the algorithms forward to make things learn faster with good data than big data. Such techniques should be integrated or be part of the next gen labeling platforms. In my opinion, supporting such techniques and tightly integrating them with the right management/versioning systems will become one of the key components of any successful AI project.
Generative models are another important technique used in the creation of labeled datasets. What are best practices to balance real and synthetic data in training datasets?
VP: What we generally see is that synthetic data can improve the model accuracy of certain computer vision tasks. Mixing with a simple 80-20 rule can be a really good place to start. However, generating complex scenes is often a lot harder and requires extremely detailed pixel-perfect annotations, which can be extremely time-consuming. Note that even when you use simulated data, one still needs to have the right tools to subset, manage, and version your data. Therefore, no matter how you get your ground truth, efficient data/annotation management and versioning are critical for any successful AI Project.
The data labeling space seems to be getting a bit crowded with dozens of startups and incumbents entering the space. Do you think data labeling platforms can remain as successful standalone companies in the long run or they will be features of broader ML platforms like AWS SageMaker or Azure ML?
VP: Simple annotation editors can be found even open-source for any type of data. While providing simple editors is something many companies do, there are only a few helping rapidly scaling startups and enterprise-grade clients build sophisticated ML pipelines. Therefore, I think that most companies will not survive once the golden venture times are over. However, there will be a few GitHub scale platform solutions that will help the ML engineers take care of their precious ground truth, the backbone of AI.
As to the broader ML platforms you mentioned, the reality is they currently don’t do a great job providing high-quality annotations or the right software to manage those subsequent datasets. We frequently see clients who are woefully unsatisfied with incumbent solutions turn to us for much higher quality annotations, 5-10x faster time to model, and advanced Data and ML Ops. So I’m sure there’s a strong market need and demand for companies like SuperAnnotate to be successful.
💥 Miscellaneous – a set of rapid-fire questions
Favorite math paradox?
Being a Statistician, deep in heart, Simpson's paradox comes first as a favorite paradox.
What book would you recommend to an aspiring ML engineer?
Probably an old favorite: Elements of Statistical Learning.
Is the Turing Test still relevant? Any clever alternatives?
The GPT-3 comes to mind when thinking about the alternatives, but we are still far away from passing the test. So yeah, the short answer is: Yes!
Does P equal NP?
Hopefully yes, in our lifetime.