🤓☝️The Need for Open-Source Datasets and Benchmarks
Weekly newsletter that discusses impactful ML research papers, cool tech releases, the money in AI, and real-life implementations
📝 Editorial
As one of my mentors used to say, “AI researchers optimize too much for publications.” The wisdom in that phrase encapsulates the gap between claims in research papers and practical implementations. These days, it is nearly impossible to keep up with all relevant research across different areas of machine learning (there is a newsletter that can help with that 😉). Furthermore, trying to recreate the techniques outlined in many AI research papers is a futile effort. Quite often, the source code of the models is not published but, even more often, we encounter scenarios in which the datasets used to train and test the models are not available. In that case, how can you even know if the proposed models are not overfitting for a particular dataset?
To address this challenge, we need more open-source datasets and benchmarks to evaluate machine learning models.
The release of ImageNet unveiled a new era of innovation in computer vision. These days, it is hard to encounter object recognition models that don’t use ImageNet as a benchmark. That experience needs to be recreated across other areas of deep learning. The challenge is that creating high-quality datasets is very hard and can’t only be done by companies with the relevant resource. Thankfully, large technology firms have stepped up to this challenge. Just this week, Facebook and Google open-sourced new datasets for different speech and language models respectively. We need more efforts like this.
Bottom line, my recommendation is to not trust too many research papers without open-source models and benchmarks against open-source datasets.
🔺🔻TheSequence Scope – our Sunday edition with the industry’s development overview – is free. To receive high-quality content about the most relevant developments in the ML world every Tuesday and Thursday, please subscribe to TheSequence Edge 🔺🔻
🗓 Next week in TheSequence Edge:
Edge#57: Transformers for time-series; how Uber manages uncertainty in time-series prediction models; and tsfresh – a magical library for feature extraction in time-series datasets.
Edge#58: deep dive into OpenAI’s CLIP and DALL·E that draw inspiration from GPT-3 to connect language and computer vision.
Now, let’s review the most important developments in the AI industry this week.
🔎 ML Research
The Three Mysteries of Deep Learning
Microsoft Research published a fascinating paper discussing three fundamental challenges of deep learning: ensemble, knowledge distillation, and self-distillation ->read more on their blog
Controlling Hallucination in Text Generation Models
Google Research published a paper unveiling ToTTo, an open domain table-to-text generation dataset that controls the correlation between generated text and its source dataset ->read more on Google Research Blog
A leaderboard for human-in-the-loop language model benchmarking
Researchers at the Allen Institute for Artificial Intelligence published a paper introducing GENIE, a new suite of human-in-the-loop leaderboards for generative language tasks ->read more in this blog from the Allen AI team
🤖 Cool AI Tech Releases
Multilingual LibriSpeech
Facebook AI Research(FAIR) open sourced Multilingual LibriSpeech (MLS), a new dataset to benchmark multilingual speech models ->read more on FAIR blog
Uber Gairos
The Uber engineering team published a detailed blog post describing the architecture behind Gairos, their internal platform for real-time data processing, storage and querying ->read more in this blog post from the Uber engineering team
📌 Job Posting
Beam – a new way to collect your thoughts and experience the internet – is looking for a product-leaning ML engineer for their Head of Machine Learning position. CEO and CTO are in Paris. You can work from anywhere. Are you smart, curious and ridiculously good at ML and NLP? Apply here.
Interested in
sponsoring TheSequence
? Let us know by replying to this email.
💸 Money in AI
ML and AI startups:
AI startup LatticeFlow raised a $2.8 million seed funding round. LatticeFlow’s goal is to build a product that enables companies to deliver trustworthy AI models that are reliable and safe. Sounds simple, but it’s not so trivial as it requires systematic methods for assessing models’ quality to gain confidence in their correctness as well as identifying deficiencies, which AI teams must address.
AI startup AlphaICs raised $8 million in funding. The company has developed a next-generation Real AI Processor (RAP), based on a proprietary highly modular and scalable architecture for edge computing. It enables AI acceleration for low-power edge applications, as well as high-performance edge datacenters.
AI for Business
AI-driven telehealth service K Health raised $132 million in a Series E funding round. K Health app uses a data set of over 2 billion anonymized medical records, finding subtle patterns in the data to give users personalized health advice.
AI-powered precision oncology platform OncoHost raised $8 million. The company develops AI technology to characterize, analyze, and predict patient response to treatment, enabling personalized treatment strategies with improved outcomes & reduced side effects.
AI-driven enterprise fintech platform Trovata.io raised $20 million in a Series A round. Trovata leverages AI to automate workflows such as cash reporting, analysis, and forecasting, allowing companies to see the amount of cash and manage cash flow, as well as building and maintaining forecasts in real-time. They also use a natural language search tool that allows them to find and tag key vendors, customers, and partners across millions of transactions in almost no time at all.
AI-driven agri-tech startup Aerobotics raised $17 million in a Series B round. To quote from their website: “Tree and fruit insights enabled by drone imagery and artificial intelligence.”
Travel and spend management platform TripActions raised $155 million in a Series E round. TripActions not only uses AI to better match travelers’ personal preferences but it also allows them to meet their company’s travel policy guidelines, combining a booking platform with payment, expense and reconciliation solutions.
Airborne data collection startup Skyqraft raised $2.2 million in seed funding. They use drones to collect image data about powerlines for automated risk assessment and predictions about the state of the equipment.
Construction-planning tech startup Swapp raised $7 million in venture capital. Swapp leverages AI to streamline and optimize operations, increasing efficiencies for developers and general contractors.
“No-code” chatbot builder Landbot raised an $8 million Series. PR-ing themselves as an anti-AI chatbot in 2018, the startup now builds its identity around conversational AI that focuses on lead conversion through data capturing and personalization.
TheSequence is a summary of groundbreaking ML research papers, engaging explanations of ML concepts, exploration of new ML frameworks, and platforms. It also keeps you up to date with the news, trends, and technology developments in the AI field.
Is anyone curating? Where does one go to get access to said datasets?