The Generative Audio Momentum

Sundays, The Sequence Scope brings a summary of the most important research papers, technology releases and VC funding deals in the artificial intelligence space.

Jun 25, 2023

Next Week in The Sequence:

Edge 303: Our series about new methods in generative AI continues with an exploration of different retrieval-augmented foundation model techniques. We discuss Meta AI’s famous Atlas paper as well as the innovative Lamini framework for LLM fine-tuning. Please register!
Wednesday we have another special interview.
Edge 304: We deep dive into AlphaDev, DeepMind’s new model that was able to discover new sorting algorithms!

📝 Editorial: The Generative Audio Momentum

The field of generative AI innovation has primarily been dominated by large language models (LLMs) and computer vision models for images. However, other domains like audio, video, and 3D are also progressing rapidly, albeit not at the same pace. While there is an ongoing debate about which domain will achieve mainstream adoption in generative AI, there are early indications that audio/speech might take the lead.

In the audio/speech space, there are several well-established disciplines, such as speech translation, recognition, and audio synthesis, that are witnessing tremendous momentum in research and technology. These disciplines are particularly intriguing when we consider that the amount of pretraining data available for audio is relatively smaller compared to language and computer vision.

Just last week, audio/speech models grabbed the headlines in generative AI. Meta AI open sourced VoiceBox, a text-to-speech generative model that has demonstrated proficiency across six languages. Google Research also introduced AudioPaLM, a multimodal speech model that combines techniques from PaLM2 and AudioLM. ElevenLabs, one of the most captivating startups in the generative AI for audio/speech field, recently announced a new funding round led by a16z, significantly elevating their profile.

The momentum in the generative AI audio/speech domain is palpable. Unlike the LLM or computer vision space, there are currently no clear leaders in generative audio. The race is definitely underway!

🔎 ML Research

VoiceBox

Meta AI Research published a paper unveiling VoiceBox, a text-to-speech generative model that achieved state-of-the-art performance in tasks that was not originally trained on. The model can synthesize speech across six languages, as well as perform noise removal, content editing, style conversion, and diverse sample generation —> Read more.

RoboCat

DeepMind published a paper unveiling RoboCat, a self-improving robotic agent. RoboCat is able to perform different tasks. RoboCat combines language understanding with real world actionability and is able to generate training data to improve itself —> Read more.

Tart

Researchers from Cornell and Stanford University published a paper discussing Tart, a module that uses logistic regression to improve reasoning in LLMs. The module can be added to existing transformer architectures to improves their deductive abilities —>Read more.

Responsible AI Maturity Model

Microsoft Research published a paper outlining a framework for responsible AI. The framework includes 24 dimensions that compose a roadmap for organizations to achieve responsible AI maturity à Read More.

AudionPaLM

Google Research published a paper detailing AudioPaLM, a speech language that mastered different tasks such as text-to-speech, speech translation and recognition. AudioPaLM combines the techniques from models such as PaLM 2 and AudioLM into a single architecture —> Read more.

🤖 Cool AI Tech Releases

Inflecion-1

Inflection, the well-funded generative AI startup created by LinkedLin’s founder Reid Hoffman, previewed a version of Inflection-1, the LLM powering its Pi.ai assistant —> Read more.

FinGPT

AI researchers from Columbia University open sourced FinGPT, a foundation model fine tuned using financial datasets —> Read more.

AnythingLLM

The team from Mintplex Labs open sourced AnythingLLM, a tool that can transform any document or piece of content into an artifact that can be referenced in an conversational interaction with an LLM —> Read more.

🛠 Real World ML

Detecting AI Profile Photos at LinkedIn

LinkedIn discussed the computer vision techniques used to detect AI generated photos for profiles —> Read more.

GitHub Copilot Tips

The GitHub team discusses best practices and tips to use CoPilot —> Read more.

📡AI Radar

Dropbox introduced Dash, an AI-assistant for document knowledge management.
Toyota announced that is using generative AI for engineering design.
AI video creation app Captions announced that it has raised $25 million.
Knowledge platform Stravito unveiled new generative AI capabilities.
Metal raised $2.5 million to boost its new platform for enterprise generative AI apps.
Startup incubator Madrona Venture Labs announced an $11 million fund to incubate AI companies. The fund included the participation of VC firm Madrona Ventures.
Generative AI voice platform ElevanLabs announced a $19 million new funding round.
Parrot, an AI transcription platform for the legal and insurance industry announced a $11 million series A.
Amazon is investing $100 million in a new center to help companies adopt generative AI.
Dropbox announced a $50 million fund to invest in generative AI initiatives.
The UK government announced that is committing £21 million to foster AI adoption within the NHS.

TheSequence

Discussion about this post

Ready for more?