🏷🥊 The Fight Against Labeled Dataset Dependencies

The Scope covers the most relevant ML papers, real-world ML use cases, cool tech releases, and $ in AI. Weekly

Sep 12, 2021

📝 Editorial

Supervised learning has dominated the world of machine learning (ML) for the last few decades. The predominance of supervised models in mainstream ML applications seems logical considering that they are easier to model, interpret, and optimize than the non-supervised alternatives. However, supervised ML models have the big limitation of their dependency on large, labeled datasets which are very expensive to build and maintain. The dependencies on labeled data are not only technological but also economical as it has made ML research a privilege of large organizations with access to highly curated datasets. To that, we should add that supervised learning paradigms are not particularly good at generalizing across multiple tasks. Steadily decreasing the level of supervision in ML models is one of the paramount challenges for the next decade of ML. The ML industry recognizes that and makes massive inroads.

The last few years have seen an explosion of research and implementation efforts to reduce the dependencies on labeled datasets. From pretrained models to semi and self-supervised learning paradigms, we regularly see lightly supervised models match and outperform supervised alternatives across different domains such as computer vision, language, speech, and many others. Just this week, Facebook and Salesforce unveiled research efforts that leverage softer forms of supervision for areas such as speech analysis and code generation respective. In the next few years, we are likely to see these types of models transition from research efforts by big AI labs to mainstream ML applications.

🔺🔻TheSequence Scope – our Sunday edition with the industry’s development overview – is free. To receive high-quality content about the most relevant developments in the ML world every Tuesday and Thursday, please subscribe to TheSequence Edge 🔺🔻

🗓 Next week in TheSequence Edge:

Edge#123: we start a new series about self-supervised learning; discuss “Self-Supervised Learning, the Dark Matter of Artificial Intelligence” paper; explore VISSL, a framework for self-supervised learning in computer vision.

Edge#124: we do a deep dive about Pachyderm platform updates.

Now, let’s review the most important developments in the AI industry this week

🔎 ML Research

Code Intelligence

Salesforce Research published a paper detailing Code T5, a pretrained programming language model that achieves state-of-the-art performance in 15 code intelligence tasks ->read more on Salesforce Research blog

Textless NLP

Facebook AI Research (FAIR) published a paper introducing a generative model that can master NLP tasks using raw audio files in almost any language ->read more on FAIR blog

Speech Recognition Models for Speech Impairment

Google Research released two papers and an open-source dataset to foment the implementation of speech recognition models that can work for people suffering from speech impairment problems ->read more on Google Research blog

🛠 Real World ML

Scaling Hadoop YARN at LinkedIn

The LinkedIn engineering team published a blog post detailing the architecture used to scale their Hadoop YARN infrastructure beyond 10.000 nodes ->read more on LinkedIn blog

Uber Jellyfish

Uber engineering published a blog post detailing the architecture behind its schemaless data storage infrastructure called Jellyfish ->read more on Uber engineering blog

🤖 Cool AI Tech Releases

JetBrains DataSpell

JetBrains announced the release of DataSpell, a new IDE optimized for data science programs ->read more on JetBrains blog

AWS S3 Plugin for PyTorch

Amazon released an S3 plugin for PyTorch, which enables the usage of S3 data buckets in PyTorch datasets ->read more on AWS engineering blog

TensorFlow Lite and XNNPACK

TensorFlow unveiled an extended integration with XNNPACK for faster-quantized inference models ->read more on TensorFlow blog

🗯 Useful Tweet

DeepMind @DeepMind

Introducing the '21 DeepMind x @ai_ucl Reinforcement Learning Lecture Series, a comprehensive introduction to modern RL. Follow along with our researchers are they explore Markov Decision Processes, sample-based learning algorithms & much more: dpmd.ai/2021RLseries 1/2

💸 Money in AI

ML&AI&Quantum

Database startup SingleStore raised $80 million in a Series F funding led by Insight Partners. Hiring in the US/Portugal/Remote.
Quantum control hardware and software platform Quantum Machines raised a $50 million Series B round led by Red Dot Capital Partners. Hiring mostly in Israel.
Conversational AI startup PolyAI raised $14 million in a funding round led by Silicon Valley’s Khosla Ventures. Hiring in the US and UK.
Computer vision training platform Mobius Labs raised a ~$6.1 million funding round led by Ventech VC. Hiring in Berlin.

AI-powered:

Relationship intelligence platform Affinity raised an $80 million Series C funding round led by Menlo Ventures. Hiring in SF/Toronto/Remote.
Fertility-focused women health startup Flo raised a $50 million Series B round co-led by VNV Global and Target Global. Hiring worldwide.
Work insights platform Fin raised $20 million in Series A funding, led by Coatue. Hiring in the US.
Virtual meeting platform Vowel raised $13.5 million in a Series A round led by Lobby Capital. Hiring remote.

TheSequence