🏷🥊 The Fight Against Labeled Dataset Dependencies
The Scope covers the most relevant ML papers, real-world ML use cases, cool tech releases, and $ in AI. Weekly
Supervised learning has dominated the world of machine learning (ML) for the last few decades. The predominance of supervised models in mainstream ML applications seems logical considering that they are easier to model, interpret, and optimize than the non-supervised alternatives. However, supervised ML models have the big limitation of their dependency on large, labeled datasets which are very expensive to build and maintain. The dependencies on labeled data are not only technological but also economical as it has made ML research a privilege of large organizations with access to highly curated datasets. To that, we should add that supervised learning paradigms are not particularly good at generalizing across multiple tasks. Steadily decreasing the level of supervision in ML models is one of the paramount challenges for the next decade of ML. The ML industry recognizes that and makes massive inroads.
The last few years have seen an explosion of research and implementation efforts to reduce the dependencies on labeled datasets. From pretrained models to semi and self-supervised learning paradigms, we regularly see lightly supervised models match and outperform supervised alternatives across different domains such as computer vision, language, speech, and many others. Just this week, Facebook and Salesforce unveiled research efforts that leverage softer forms of supervision for areas such as speech analysis and code generation respective. In the next few years, we are likely to see these types of models transition from research efforts by big AI labs to mainstream ML applications.
🔺🔻TheSequence Scope – our Sunday edition with the industry’s development overview – is free. To receive high-quality content about the most relevant developments in the ML world every Tuesday and Thursday, please subscribe to TheSequence Edge 🔺🔻
🗓 Next week in TheSequence Edge:
Edge#123: we start a new series about self-supervised learning; discuss “Self-Supervised Learning, the Dark Matter of Artificial Intelligence” paper; explore VISSL, a framework for self-supervised learning in computer vision.
Edge#124: we do a deep dive about Pachyderm platform updates.
Now, let’s review the most important developments in the AI industry this week
🔎 ML Research
Salesforce Research published a paper detailing Code T5, a pretrained programming language model that achieves state-of-the-art performance in 15 code intelligence tasks ->read more on Salesforce Research blog
Facebook AI Research (FAIR) published a paper introducing a generative model that can master NLP tasks using raw audio files in almost any language ->read more on FAIR blog
Speech Recognition Models for Speech Impairment
Google Research released two papers and an open-source dataset to foment the implementation of speech recognition models that can work for people suffering from speech impairment problems ->read more on Google Research blog
🛠 Real World ML
Scaling Hadoop YARN at LinkedIn
The LinkedIn engineering team published a blog post detailing the architecture used to scale their Hadoop YARN infrastructure beyond 10.000 nodes ->read more on LinkedIn blog
Uber engineering published a blog post detailing the architecture behind its schemaless data storage infrastructure called Jellyfish ->read more on Uber engineering blog
🤖 Cool AI Tech Releases
JetBrains announced the release of DataSpell, a new IDE optimized for data science programs ->read more on JetBrains blog
AWS S3 Plugin for PyTorch
Amazon released an S3 plugin for PyTorch, which enables the usage of S3 data buckets in PyTorch datasets ->read more on AWS engineering blog
TensorFlow Lite and XNNPACK
TensorFlow unveiled an extended integration with XNNPACK for faster-quantized inference models ->read more on TensorFlow blog
🗯 Useful Tweet
💸 Money in AI