The Transformer Robots are Here, Just a Different Kind

An impressive week in robotic models from both DeepMind and Stanford University and much more...

Jan 07, 2024

Imagine a scene with dozens of robots, each equipped with artificial intelligence for reading books, understanding images, and watching videos. These robots vary in design, ranging from humanoid to more abstract forms, all featuring elements like digital eyes, screens, cameras, and various sensors. They are situated in a large, modern library-like environment, each robot interacting with different media: some are reading books, others are analyzing images on screens, and a few are watching videos. The overall atmosphere is one of focused activity, with robots in various poses of engagement, such as holding books, pointing at screens, or sitting in front of monitors, in a well-lit, high-tech setting. — Created Using DALL-E

Next Week in The Sequence:

Edge 259: Our series about LLM reasoning dives into the fascinating tree-of-thoughts technique including the original paper. We also review the : Language Model Evaluation Harness framework for LLM evaluation.
Edge 260: We dive into Ghostbuster, Berkerley University model for detecting LLM generated content.

You can subscribe below!

📝 Editorial: The Transformer Robots are Here, Just a Different Kind

Robotics has always been one of the most fertile grounds for adopting artificial intelligence (AI) techniques. With recent advancements in computer vision, language, and audio foundation models, we can expect to see a new generation of robotic applications that dazzle us. However, the challenges of building effective robotic solutions extend beyond AI and require deep mastery of the physics of an environment and incredibly effective coordination of perception and action. Typically, collecting those training datasets requires massive effort, but the advent of foundation models has drastically lowered the entry point.

A few months ago, Google DeepMind unveiled the Robotic Transformer 2 (RT-2) models, which use language and computer vision to translate knowledge into robotic actions. Last week, DeepMind followed this research with three notable additions:

AutoRT: A system that leverages vision-language models to deploy robots in completely new environments with minimal human supervision.
SARA-RT: A method that converts RT-2 into a version that is 10% more accurate and 14% faster.
RT-Trajectory: A video model for learning control policies in physical actions in robotic applications. This method takes a video and overlays a 2D sketch of an action that the robot can follow.

These three methods combine foundation models in image, language, and video to improve robotic applications. Certainly, aspects such as perception and its translation into action using foundation models can accelerate robotics to levels we haven’t seen before. The robo transformers are definitely on their way!

📣 apply() Spring ‘24 Call for Speakers!

The next apply() is set for March 14 and we’re looking for speakers! apply() is the biggest virtual ML conference in the world, and is designed to bring together ML practitioners in one space to share best practices, development patterns, and emerging tooling.

Has your team built an ML platform? Pushed ML models to production? Have learned valuable lessons on how to organize an ML team or data scientist team? If yes, we want to hear from you – submit your talk today!

Submit Talk

🔎 ML Research

Robotics with Foundation Models

Google DeepMind published the research and code behind AutoRT, SARA-RT and RT-Trajectory, three methods that leverage foundation models om robotic scenarios. The three techniques are part of the Robotics Transformer initiative aimed to help robots navigate environments and make quick decisions —> Read more.

Mobile ALOHA

Researchers from Stanford University, a very impressive robotic application for object manipulation. The robot uses imitation learning to master a series of complex tasks following specific demonstrations. What the videos —> Read more.

GPU Split

Microsoft Research published a paper detailing Splitwise, an optimization technique for GPU utilization. Splitwise works by separating the token generation adn prompt computation phases of LLM inference into different machines —> Read more.

LLM Augmented LLMs

Google DeepMind published a super interesting paper introducing Composition of Augmented Language Models(CALM), a method that augments the capabilities of LLMs with other LLMs. Specifically, CALM introduces cross-attention between models so that they can reuse knowledge representations —> Read more.

High Quality Text Embeddings Using Synthetic Data

Microsoft Research published a paper detailing a method for obtaining high quality text embeddings using only synthetic data and LLMs. More impressively, the method seems to require only about a thousand steps instead of billions of data pairs used to pretrain embedding models —> Read more.

OpenVoice

Researchers from decentralized AI platform MyShell published a paper detailing OpenVoice, a voice cloning that only requires a short audio clip as input. OpenVoice enables super granular control over voice characteristics such as accent, rhythm, emotion, intonation and several others —> Read more.

🤖 Cool AI Tech Releases

CrewAI

A new open source framework for orchestrating autonomous agents —> Read more.

📡AI Radar

Hasn’t happened since the Windows Key but Microsoft annouced its adding a Copilot key to Windows 11 PCs.
AI search startup Perplexity AI raised $73.6 million at a $520 million valuation.
Intel spins out a new enterprise AI firm called Articul8.
Databricks published a research detailing that inference and training on Intel Gaudi 2 accelerator matches NVIDIA’s performance.
OpenAI is set to release its GPT store next week.
AI-legal platform Robin AI announced a $26 million raise.
Nabla, a medical AI asisstant platform, announced a $24 million series B.

TheSequence

Discussion about this post

Ready for more?