🏷 🔥Training Data Labeling is One of the Hottest Markets in Machine Learning

Feb 07, 2021

📝 Editorial

Building high quality labeled training datasets is one of the biggest roadblocks in machine learning projects. Labeling training data is not only resource-intensive but really hard to automate at scale. It is easy to underestimate the complexity of assembling large-scale training datasets if we think about it just as a data collection exercise. Instead, the reality is that training datasets have their own lifecycle that includes capabilities such as filtering, searching and judging the effectiveness of datasets when applied in specific models. Not surprisingly, data labeling is carving its own space as one of the most important markets in the machine learning space.

In recent years, we have seen the emergence of well-funded startups trying to automate data labeling processes from training datasets. Snorkel Flow, Labelbox, and Kili Technology, Keymakr are some prominent examples of this market. Some of these platforms offer generic solutions, like those specialized in domains such as computer vision or language understanding. Just this week, Superb AI announced a new sizable funding round for its data-labeling platform. As with many other segments of the market, the bigger question shadowing the data-labeling space is whether the space is big enough to produce standalone companies or if they will become features of larger machine learning platforms. For now, the excitement and innovation in the data-labeling space are bringing a lot of energy to the machine learning community.

🔺🔻TheSequence Scope – our Sunday edition with the industry’s development overview – is free. To receive high-quality content about the most relevant developments in the ML world every Tuesday and Thursday, please subscribe to TheSequence Edge 🔺🔻

🗓 Next week in TheSequence Edge:

Edge#61: the concept of AutoML and its different disciplines;the original AutoML paper; and Amazon AutoGluon that brings deep learning to AutoML.

Edge#62: a view into the data discovery and management architectures implemented at LinkedIn, Uber, Lyft, Airbnb and Netflix.

Now, let’s review the most important developments in the AI industry this week

🔎 ML Research

English-Language Alexa Learns to Speak Spanish Using the Same Voice

Researchers from Amazon published a blog post detailing a Neural text-to-speech teaching Alexa to fluently speak Spanish and English using the same voice ->read more on Amazon Research blog

Reasoning Over Tabular Data

Google Research published a paper outlining a model that is able to learn relationships between records in tabular structures and express them in natural language ->read more on Google Research blog

Visual Model-Based Reinforcement Learning

Google Research published a paper discussing different design trade-offs in model based reinforcement learning methods applied to image analysis ->read more on Google Research blog

🤖 Cool AI Tech Releases

Microsoft Viva

Microsoft unveiled Viva, a new platform that uses different ML methods to increase employee productivity across different aspects, such as communications and learning ->read more in this blog post from the Viva team

Pretrained Vision Models

Microsoft Research open-sourced ResNet-50, a pre-trained model that achieves state-of-the-art performance across different computer vision tasks ->read more on Microsoft Research blog

💬 Useful Tweet

See the thread by François Chollet on how NOT to do ML open-source libraries:

François Chollet @fchollet

How not to do ML open-source library development: 1. Clone existing packages, almost feature-by-feature. Possibly copy source code. 2. Arrogantly claim your clone is better than the original. Spread FUD about other libraries. 3. Claim others are copying you (projection much?)

Interested in job listing or sponsoring TheSequence? Let us know by replying to this email.

💸 Money in AI

For ML & AI

🦄 Unified data platform Databricks closed a $1 billion funding round. The company valuation is at $28 billion now. Combining the best of data warehouses and data lakes into a lakehouse architecture, Databricks created one platform to collaborate on all of the data, analytics and AI workloads.
ML experiment tracking startup Weights & Biases raised $45 million in Series B funding. Weights & Biases is one of the top platforms in the market that enables the hyperparameter optimization of ML models. W&B is designed to keep track of ML experiments, evaluate results, and optimize hyperparameter configurations. The platform provides a toolset that works consistently across different ML frameworks. We covered them in detail in Edge#1.
AI data platform Superb AI raised $9.3 million in the financing round. Superb AI develops a platform for AI-enabled training data, claiming to be able to train AI models using incredibly small datasets without requiring human assistance in the workflow.

Automation and optimization

🦄 Robotic process automation UiPath, a startup that automates monotonous, repetitive chores traditionally performed by human workers, raised $750 million in Series F funding at a post-money valuation of $35 billion.
Intelligent automation startup Slync.io raised $60 million in its Series B funding round. Slync connects disparate shipping and logistics systems, ingests structured and unstructured datasets, orchestrates teams, and automates processes.
Сomputing workload optimization startup Granulate raised$30 million in Series B. Granulate helps organizations optimize infrastructure performance and costs through AI-driven dynamic and continuous OS-level adaptations.

Share TheSequence

TheSequence