🤹♂️ Edge#92: Cogito Brings Human-in-the-Loop Data Annotation to Enterprises
In this issue:
we explain what human-in-the-loop (HIL) data labeling is;
we overview model-assisted labeling (MAL);
we explore Cogito Data Annotation Solutions.
💥 What’s New in AI: Cogito Brings Human-in-the-Loop Data Annotation to Enterprises
There are many tasks in which machine learning has surpassed standalone human performance and, yet, even in those tasks the combination of humans and machines regularly outperforms machines alone. Take chess for example. For decades, top chess engines have achieved levels of performance that are out of the reach of the best human players in the world. However, even the top chess engines can improve their performance when combined with human players. The reason for this improvement is because most chess engines play fundamentally tactical positions but still struggle with highly creative methods, such as sacrificing pieces in order to obtain a better position or to pursue moves that deviate from traditional theoretical lines. Former chess world champion Garry Kasparov explained the combination of humans and machines in chess in his recent book Deep Thinking. Chess is just one example in which injecting human feedback can augment the performance of machine learning tasks. Another one of those examples that is closer to every machine learning project is data annotation.
Data labeling and annotation is an intrinsic element of every machine learning project. In the current market dominated by data-hungry supervised learning methods, the systematic creation of labeled datasets is fundamental for the training and maintenance of machine learning models. Most data science teams debate among themselves between traditional manual data labeling processes or more modern automated techniques. However, there is a third approach that provides a nice middle ground.
Human-in-the-Loop Data Labeling
When building a training pipeline for machine learning models, the typical choices range from manual to automated approaches. Manual data labeling techniques are still dominant in the current machine learning ecosystem and they are simpler and more intuitive to implement. Furthermore, there are many scenarios in which manual data labeling processes can incorporate human domain knowledge or subjective interpretations, which escape automated techniques. However, manual data labeling processes are extremely expensive and time-consuming to scale across a large number of machine learning models.
An alternative to manual data labeling processes is the new generation of automated data annotation platforms which use programmable approaches to streamline the creation of labeled datasets used in the training of machine learning models. Obviously, the automation of data annotation processes is able to systematize the creation of large training datasets. However, these methods often miss edge cases that are essential to improve the performance of machine learning models. Take the example of an image of an ice hockey game, while it might be easy for an automated data labeling script to label individual players and objects, it might miss the fact that there was a recent shot at a goal attempt or that two players collided. This is precisely the type of rich annotation that can be captured by manual processes.
Human-in-the-Loop (HIL) data annotation looks for a middle ground between the purely automated and manual data labeling approaches. As its name indicates, HIL data annotation solutions look to leverage automated data labeling technologies while also complementing them with human knowledge to capture more complex knowledge representation. Among the companies following this school of thought, Cogito stands out given its scale and traction in mission-critical applications.
Cogito Data Annotation Solutions
Cogito is a workforce solution provider for data labeling and annotations based on HIL techniques. By HIL, you should think about Cogito as a highly specialized workforce that leverages data annotation tools to streamline the creation of training datasets. In that sense, Cogito can be classified as a manual data labeling solution as it relies on a specialized workforce to accomplish the different tasks. However, instead of relying just on human workers like crowdsourced data labeling solutions, Cogito focuses on an approach known as model-assisted labeling (MAL). Conceptually, MAL leverages a human workforce to label a small but relevant portion of the training dataset which is used to train a specific model. From there, the Cogito solution will use the outputs of a model to train another model that can refine the labeling of a larger dataset without requiring major human intervention. From a data scientist’s perspective, Cogito is a nice complement to automated data labeling platforms. In fact, its services are integrated into the workflows of several automated data labeling platforms. Beyond that capability, we should think about Cogito as a solution provider that excels in scenarios that are more dependent on human domain or subjective knowledge.
One of the key capabilities of the Cogito workflow solutions is the consistent experience to label highly heterogeneous datasets. The current capabilities of Cogito workflow solutions support the following dataset types:
Image: Image labeling capabilities include areas such as object detection and tracking, semantic segmentation, image classification, image moderation, image-to-text transcription.
Text: Text labeling capabilities include single or multi-label classification, named entity recognition, relationship extraction.
Video: Video labeling capabilities include areas such as frame-by-frame object detection & tracking, video classification, event-based timestamp labeling, live video stream monitoring, video moderation.
3D Points Clouds: Object Detection & Tracking, Semantic Segmentation, Photogrammetric Point Clouds
Audio: Audio labeling capabilities include areas such as speech translation, entity relationship, tone annotation.
This type of broad data-set coverage is one of the key differentiators of the Cogito workflow solutions. For the most part, automated data labeling platforms perform well in structured, text, and some basic image datasets. However, complex video or image labeling techniques are still more difficult to implement. Let’s try to put that into practical perspective, considering a computer vision task that includes images and videos for training a self-driving vehicle model. The Cogito workflow solutions support different labeling techniques, such as classifying cars in video frames by using a bounding-box technique to draw a rectangle around them. However, our self-driving car model might require us to identify the complete 3D shape of a given object. Cogito enables that by using the 3D cuboid method that can draw a cubical structure around a target object. Another useful capability is to identify traffic signs in streets or highways. Cogito supports this by using a polygon labeling method around the traffic signals.
Image credit: Cogito
Now imagine an even more complex scenario in which our self-driving car model needs to identify very specific elements, such as a broken taillight, and associate it with a possible accident. Given that Cogito relies on HIL methodologies, it can accommodate some of those very targeted scenarios that might be hard to achieve with automated data labeling platforms. Another robust capability of Cogito that highlights the benefits of HIL methods is that it can extrapolate semantic information from an approaching vehicle as a sign to stop.
Image credit: Cogito
Although image and video labeling are two of the areas in which Cogito excels, the data annotation solutions also include differentiated capabilities in text and speech annotation. One of my favorites is relationship extraction just because that is really difficult to do with automated methods. Imagine having to build a dataset that details relationships between terms in an audio or text file. This process is cumbersome enough that it often requires machine learning by itself to accomplish the task, which creates a circular problem. Cogito’s HIL methods can certainly be useful in this area.
Image credit: Cogito
Data labeling is typically considered as batch-type, long-running processes that are not tied directly to the real-time behavior of a machine learning model. However, the reality of machine learning production systems is quite different. Many solutions require new labeled information produced in relatively high throughput. To address this challenge, Cogito introduced a capability known as real-time annotation workflows which can receive a stream of data and produce labels in a near real-time manner. This capability can be really useful in scenarios in domains such as transportation or agriculture that interact with real-time data streams.
One of the often-overlooked capabilities of Cogito is domain specialization. For the most part, we tend to think about data labeling as a horizontal task that doesn’t require any specific domain expertise, but is that really the case? Annotating medical images is fundamentally different from annotating pictures of agriculture crops. Even within a specific domain such as healthcare, there are major differences between the annotations in datasets used by models in different specialties, such as oncology or dermatology. Cogito specializes its workflow solutions by different domains, trying to maximize the expertise for a given data labeling task. Among those domains, Cogito has developed an almost unique specialty in annotating medical datasets such as MRIs, X-Rays, CT Scans, and many others. This level of domain specialization is partially possible because Cogito does not use a crowdsourced workforce and, instead, relies on full-time employees that are regularly trained in domain-specific annotations.
One of the benefits of Cogito’s HIL methods is that they can be easily integrated into almost any machine learning pipeline and platform, as they don’t have hard technology requirements. This makes it easy for data science teams to incorporate such solutions without disrupting their normal cycle to train machine learning models. HIL data labeling methods are not the best fit for every machine learning training scenario. They are certainly harder and more expensive to scale. However, in general, they have proven to be a nice and, for now, a necessary complement to automated data labeling methods. Cogito is certainly one of the data annotation solutions that provides an easy entry point for data science teams dabbling in the HIL data labeling space.
Further Reading: More details about Cogito can be found in the company’s blog: https://www.cogitotech.com/blog/