World Models are Coming and They are Awesome

Two amazing world models were released this week.

Dec 08, 2024

Next Week in The Sequence

Edge 455: Our series about knowledge distillation continues with an overview of graph-based distillation including a research paper that outlines the key methods in that area. We also dive into HuggingFace’s AutoTrain method for simplifying the pretraining of foundation models.
The Sequence Chat: Explores some non-obvious points about the AI race between the US and China with a key emphasis in robotics.
Edge 456: We dive into FrontierMath, regarded as the hardest math reasoning benchmark in AI.

You can subscribe to The Sequence below:

📝 Editorial: World Models are Coming and They are Awesome

World models is an emerging area of generative AI regarded by many and one of the major frontiers to achieve some levels of AGI. By world models, we are referring to agents that can interact in hyper realitistic environments in which aspects such as understanding of the laws of physics plays a key role. With industries such as embodied AI achieving record levels of traction, the demand for world models is virtually insatiable.

The world of AI has witnessed the release of two remarkable world models this week, both capable of generating interactive 3D environments from simple prompts: DeepMind's Genie 2 and a system by World Labs. These groundbreaking tools hold immense potential for AI research, game development, and beyond, promising to accelerate the development of embodied AI agents and enable new creative workflows for prototyping interactive experiences.

Genie 2, a large-scale foundation world model developed by DeepMind, stands out with its ability to generate a vast array of dynamic 3D worlds from single image prompts generated by Imagen 3, Google's text-to-image model. This means that users can input a text description of their desired world, choose their favorite image representation, and then interact with the generated environment, either directly or through an AI agent.

Beyond its impressive world generation capabilities, Genie 2 demonstrates a range of emergent capabilities that make its environments truly interactive. The model can simulate object interactions, including complex actions like opening doors, shooting explosive barrels, and animating characters with various activities. It can also model physical properties like gravity, lighting, and reflections, further enhancing the realism and depth of the generated worlds. Genie 2's ability to model long-horizon memory allows it to accurately render previously seen parts of the world when they come back into view, and it can generate new plausible content on the fly, maintaining consistency for up to a minute.

World Labs, the startup founded by AI pioneer Fei-Fei Li and backed by $230 million in funding, has also introduced a new AI system for creating 3D spaces from simple prompts, though less information is available about its underlying architecture and training data compared to Genie 2. The system, which uses both text and image prompts, allows for exploration of the generated environments using keyboard and mouse controls. Notably, it boasts a user-friendly 3D scene builder that enables interactive manipulation of the generated environment.

One of the key highlights of World Labs' model is its focus on enabling creative workflows. The system allows for the generation of different variations of the same 3D environment from a single prompt, making it easy for artists and designers to experiment and iterate. It also offers various camera effects, including depth of field and dolly zoom, providing users with control over the visual presentation of their generated worlds.

Both Genie 2 and World Labs' 3D world generator represent significant advancements in AI, pushing the boundaries of world model capabilities and opening up exciting new possibilities for researchers, developers, and creators. DeepMind emphasizes Genie 2's potential for training and evaluating embodied agents, highlighting its ability to generate a limitless curriculum of novel worlds. They showcase this by deploying a SIMA agent, developed in collaboration with game developers, to navigate and complete tasks in environments generated by Genie 2.

World Labs, on the other hand, emphasizes the creative potential of their system, showcasing its ability to transform concept art and drawings into interactive environments and highlighting its use in prototyping game levels and generating variations of 3D scenes. Both approaches showcase the versatility and wide-ranging applications of these new 3D world generation tools.

While both DeepMind and World Labs acknowledge that their respective technologies are still in their early stages, their releases mark a significant step towards more sophisticated and accessible world model creation. As these technologies continue to evolve, we can expect even more groundbreaking applications to emerge, blurring the lines between virtual and real and empowering us to create and interact with digital worlds in unprecedented ways.

🔎 ML Research

Genie 2

In "Genie 2: A Large-Scale Foundation World Model", researchers from Google DeepMind, including Jack Parker-Holder and Stephen Spencer, introduced Genie 2. Genie 2 is a large-scale model that can create a variety of 3D environments for training AI agents, overcoming the limitations of using only existing environments —> Read more.

STAR

In "Automated Architecture Synthesis via Targeted Evolution", researchers from Liquid AI, including Armin W. Thomas, presented STAR, which is a system for automatically designing neural network architectures. STAR uses evolutionary algorithms to optimize a numerical representation of model architectures and uses Linear Input-Varying Systems (LIVs), a new way to represent and understand different parts of a neural network —> Read more.

Enterprise-AI Patterns

In "Generating a Low-code Complete Workflow via Task Decomposition and RAG", researchers from ServiceNow formalized Task Decomposition and Retrieval-Augmented Generation (RAG) as design patterns for systems based on generative AI. The authors demonstrated these patterns in a case study on generating workflows, showing how they can be used to create practical, enterprise-level AI applications —> Read more.

GenCast

In "GenCast: Predicting Weather and Extreme Conditions with State-of-the-Art Accuracy", researchers from Google DeepMind, including Ilan Price and Alvaro Sanchez-Gonzalez, introduced GenCast, a new system for weather forecasting. While the source doesn't give much detail, the research aims to make weather predictions more accurate, especially for extreme weather conditions —> Read more.

EfficientTAMs

In "Efficient Track Anything Models", researchers from Meta, including Yunyang Xiong, proposed Efficient Track Anything Models (EfficientTAMs), which are lightweight and efficient models for video object segmentation and object tracking. They showed that vanilla Vision Transformers can perform as well as more complex models like SAM 2 and proposed an efficient memory cross-attention mechanism that improves performance by taking advantage of the way spatial tokens are arranged in memory —> Read more.

AV-Odyssey Bench

In "AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?", researchers from UC Berkeley, Stanford University and Yale introduce DeafTest and AV-Odyssey Bench for evaluating how well Multimodal Large Language Models (MLLMs) can understand both audio and visual information. DeafTest assesses fundamental listening skills, and AV- Odyssey Bench is a comprehensive benchmark covering many tasks and audio attributes —> Read more.

🤖 AI Tech Releases

ChatGPT Pro

OpenAI introduced ChatGPT Pro, a new version that includes unlimited access to all models including o1 —> Read more.

Llama 3.3

Meta announced the released of Llama 3.3, a 70B parameter model that matches the performance of of its 405B parameter predecessor —> Read more.

Nova

Amazon introduced Nova, a new family of foundation models —> Read more.

Veo and Imagen

Google’s video and image generation models, Veo and Imagen 3 were made available in the Vertex AI platform —> Read more.

AWS AI Announcements

There were major AI announcements at the AWS re:Invent conference —> Read more.

🛠 Real World AI

Scaling Gen AI at Salesforce

Salesforce discusses details about their best practices for RAG and scalability at Salesforce —> Read more.

Fine-tuning Models with Hugging Face

The team from Capital Fund Management shares some details of their fine-tuning strategies with the Hugging Face stack —> Read more.

📡AI Radar

David Sacks was named AI and Crypto Czar by President Elect Donald Trump.
xAI closed $6 billion in a new equity round.
Salesforce announced major adoption KPIs of its Agentforce platform for enterprise AI agents.
Microsoft unveiled the preview of Copilot Vision.
Cake raised $13 million to simplify open source AI adoption in the enterprise.
Across AI raised $5.75 million for a new memory platform for AI applications.
AI medical company Cleerly raised $106 million in new funding.
Enterpret raised $20.8 million for its AI customer intelligence platform.
Lica raised $4 million to use AI to build professional product videos.
Cohere and CoreWeave partnered for a large data center in Canada.
Aethir, Beam Foundation, and MetaStreet launched a $40 million initiative to accelerate decentralized AI.
X’s Grok is now available to all users.

TheSequence

Discussion about this post

Ready for more?