The Sequence Opinion #722: From Language to Action: Transformer Architectures as Robotic Foundation Models
Can we have generalists models for robotics?
Building a transformer-based model for robotics holds great promise for generalizing across multiple tasks and robot embodiments. In recent years, researchers have begun applying the same Transformer architecture that revolutionized NLP and vision to robotics, aiming to create foundation models for robots. Such models would learn from large-scale, diverse data and potentially perform many tasks (and even work with different types of robots) without retraining from scratch for each new skill. This essay surveys the opportunities and challenges in that direction, with a focus on how transformer models can generalize across tasks, what makes that hard (data, training, safety), and which lines of work have pushed the frontier.