🏷🥊 The Fight Against Labeled Dataset Dependencies

The Scope covers the most relevant ML papers, real-world ML use cases, cool tech releases, and $ in AI. Weekly

📝 Editorial 

Supervised learning has dominated the world of machine learning (ML) for the last few decades. The predominance of supervised models in mainstream ML applications seems logical considering that they are easier to model, interpret, and optimize than the non-supervised alternatives. However, supervised ML models have the big limitation of their dependency on large, labeled datasets which are very expensive to build and maintain. The dependencies on labeled data are not only technological but also economical as it has made ML research a privilege of large organizations with access to highly curated datasets. To that, we should add that supervised learning paradigms are not particularly good at generalizing across multiple tasks. Steadily decreasing the level of supervision in ML models is one of the paramount challenges for the next decade of ML. The ML industry recognizes that and makes massive inroads.   

The last few years have seen an explosion of research and implementation efforts to reduce the dependencies on labeled datasets. From pretrained models to semi and self-supervised learning paradigms, we regularly see lightly supervised models match and outperform supervised alternatives across different domains such as computer vision, language, speech, and many others. Just this week, Facebook and Salesforce unveiled research efforts that leverage softer forms of supervision for areas such as speech analysis and code generation respective. In the next few years, we are likely to see these types of models transition from research efforts by big AI labs to mainstream ML applications.    


🔺🔻TheSequence Scope – our Sunday edition with the industry’s development overview – is free. To receive high-quality content about the most relevant developments in the ML world every Tuesday and Thursday, please subscribe to TheSequence Edge 🔺🔻

🗓 Next week in TheSequence Edge:

Edge#123: we start a new series about self-supervised learning; discuss “Self-Supervised Learning, the Dark Matter of Artificial Intelligence” paper; explore VISSL, a framework for self-supervised learning in computer vision.

Edge#124: we do a deep dive about Pachyderm platform updates.

Now, let’s review the most important developments in the AI industry this week

🔎 ML Research

Code Intelligence  

Salesforce Research published a paper detailing Code T5, a pretrained programming language model that achieves state-of-the-art performance in 15 code intelligence tasks ->read more on Salesforce Research blog

Textless NLP 

Facebook AI Research (FAIR) published a paper introducing a generative model that can master NLP tasks using raw audio files in almost any language ->read more on FAIR blog

Speech Recognition Models for Speech Impairment 

Google Research released two papers and an open-source dataset to foment the implementation of speech recognition models that can work for people suffering from speech impairment problems ->read more on Google Research blog

🛠 Real World ML

Scaling Hadoop YARN at LinkedIn 

The LinkedIn engineering team published a blog post detailing the architecture used to scale their Hadoop YARN infrastructure beyond 10.000 nodes ->read more on LinkedIn blog

Uber Jellyfish 

Uber engineering published a blog post detailing the architecture behind its schemaless data storage infrastructure called Jellyfish ->read more on Uber engineering blog

🤖 Cool AI Tech Releases

JetBrains DataSpell 

JetBrains announced the release of DataSpell, a new IDE optimized for data science programs ->read more on JetBrains blog

AWS S3 Plugin for PyTorch 

Amazon released an S3 plugin for PyTorch, which enables the usage of S3 data buckets in PyTorch datasets ->read more on AWS engineering blog

TensorFlow Lite and XNNPACK 

TensorFlow unveiled an extended integration with XNNPACK for faster-quantized inference models ->read more on TensorFlow blog

🗯 Useful Tweet

Follow us on Twitter

💸 Money in AI



  • Relationship intelligence platform Affinity raised an $80 million Series C funding round led by Menlo Ventures. Hiring in SF/Toronto/Remote.

  • Fertility-focused women health startup Flo raised a $50 million Series B round co-led by VNV Global and Target Global. Hiring worldwide.

  • Work insights platform Fin raised $20 million in Series A funding, led by Coatue. Hiring in the US.

  • Virtual meeting platform Vowel raised $13.5 million in a Series A round led by Lobby Capital. Hiring remote.