The Sequence Radar #679: From Model to Team: Several Models are Better than One: Sakana’s Blueprint for Collective AI

Sakana's new model combines different models seamlessly at inference time.

Jul 06, 2025

Next Week in The Sequence:

Knowledge: We explore creativity AI evals.
Engineering: We dive into Amazon’s Strands agentic framework.
Opinion : We discuss the limits of autonomy in AI agents.
Research: Dive into Sakana AI’s new AB-MCTS model.

Let’s Go! You can subscribe to The Sequence below:

📝 Editorial: Several Models are Better than One: Sakana’s Blueprint for Collective AI

Sakana AI has rapidly emerged as one of my favorite and most innovative AI research labs in the current landscape. Founded by former Google Brain and DeepMind researchers, the lab has already made headlines with its work on evolutionary methods, model orchestration, and adaptive reasoning.

Now they have launched a new, impressive model.

In the latest evolution of inference-time AI, Sakana AI has introduced a compelling framework that pushes beyond traditional single-model reasoning: Adaptive Branching Monte Carlo Tree Search, or AB-MCTS. At its core, AB-MCTS reflects a philosophical shift in how we think about large language model (LLM) reasoning. Rather than treating generation as a flat, linear process, Sakana’s approach reframes inference as a strategic exploration through a search tree—navigating between depth (refinement of existing ideas), width (generation of new hypotheses), and even model selection itself. The result is a system that begins to resemble collaborative, human-like thinking.

AB-MCTS is an inference-time algorithm grounded in the principles of Monte Carlo Tree Search, a method historically associated with planning in board games like Go. Sakana adapts this mechanism to textual reasoning by using Thompson Sampling to decide whether to continue developing a promising response or branch into an unexplored avenue. This adaptive process means the system is no longer bound to fixed temperature sampling or deterministic prompting. Instead, it engages in a kind of probabilistic deliberation, allocating its computational resources to the most promising parts of the solution space as determined in real-time.

But the real breakthrough lies in the extension: Multi-LLM AB-MCTS. In this paradigm, multiple LLMs—including frontier models like OpenAI’s o4-mini, Google DeepMind’s Gemini 2.5 Pro, and DeepSeek’s R1—are orchestrated into a dynamic ensemble. At each point in the reasoning tree, the system not only decides what to do next (go deeper or go wider) but also who should do it. This introduces a novel third axis to inference: model routing. Initially unbiased, the system learns to favor models that historically perform better on certain subtasks, effectively turning a collection of models into a coherent team.

The implications for real-world AI systems are profound. By decoupling capability from a single monolithic model, AB-MCTS provides a path to compositional reliability. Enterprises can now imagine deploying systems where reasoning chains are distributed across specialized models, dynamically assigned at runtime based on contextual performance. This not only improves robustness but opens up opportunities for cost optimization, interpretability, and safety. Moreover, Sakana has open-sourced the framework—dubbed TreeQuest—under Apache 2.0, inviting both researchers and practitioners to integrate it into their pipelines.

What Sakana has achieved with AB-MCTS is a blueprint for how we might scale intelligence not just by increasing parameters or data, but by scaling the search process itself. It borrows from the playbooks of both biological evolution and algorithmic planning, combining breadth, depth, and diversity in a structured, learnable way. In doing so, it reframes LLMs as components of larger reasoning ecosystems—systems that can adapt, deliberate, and even self-correct. The age of collective intelligence at inference-time may just be getting started.

🔎 AI Research

Title: SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks

AI Lab: Allen Institute for AI & Yale University
SciArena introduces a community-driven platform to evaluate foundation models on scientific literature tasks using researcher votes, achieving a leaderboard based on over 13,000 preference votes. It also proposes SciArena-Eval, a benchmark to assess how well models can serve as automated evaluators compared to human judgments, revealing substantial gaps between LLM-based and human evaluation accuracy.

Title: Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

AI Lab: Carnegie Mellon University, University of Washington, University of Pennsylvania, The Hong Kong Polytechnic University
This study shows that reinforcement learning (RL) fine-tuning on math reasoning data enhances generalization to other reasoning and non-reasoning tasks, while supervised fine-tuning (SFT) often leads to capability degradation. Using latent-space PCA and token distribution analyses, the authors attribute this to SFT-induced representation drift and propose RL as a more robust training paradigm for transferable intelligence.

Title: Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

AI Lab: Multi-institutional collaboration including University of Central Florida, Cornell, Vector Institute, Meta, Amazon, Oxford, and others
This review outlines a roadmap toward Artificial General Intelligence (AGI) grounded in cognitive neuroscience, agentic AI, memory, and modular reasoning, highlighting the limitations of token-based models and the importance of world models and agent architectures. It calls for interdisciplinary alignment and socially grounded, explainable, and adaptive systems to move from statistical learning to general-purpose intelligence.

Title: Zero-shot Antibody Design in a 24-well Plate

AI Lab: Chai Discovery Team
The paper presents Chai-2, a multimodal generative model that enables zero-shot design of antibodies and miniproteins with experimentally validated success rates of 16% for antibodies and 68% for miniproteins—orders of magnitude higher than prior methods. Demonstrating the ability to design binders to novel targets without known antibodies, Chai-2 reduces discovery time from months to weeks using as few as 20 experimental designs per target.

Title: Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

AI Lab: Sakana AI
This paper introduces AB-MCTS (Adaptive Branching Monte Carlo Tree Search), a novel inference-time search framework that dynamically decides whether to “go wider” by exploring new answers or “go deeper” by refining existing ones using external feedback, improving LLM performance without additional training. AB-MCTS outperforms repeated sampling and standard MCTS across coding, reasoning, and ML tasks by balancing exploration and exploitation in a principled, budget-aware manner.

Title: Fast and Simplex: 2-Simplicial Attention in Triton

AI Lab: Meta AI and University of Texas at Austin
This paper proposes the 2-simplicial Transformer, a generalization of standard dot-product attention to trilinear attention forms, improving token efficiency and scaling behavior for reasoning, coding, and mathematical tasks. Through a custom Triton kernel implementation, the model exhibits more favorable scaling exponents under token constraints, outperforming traditional Transformers on key benchmarks such as MMLU and GSM8k.

🤖 AI Tech Releases

Ernie 4.5

Baidu released the newest version of its marquee Ernie model.

DeepSWE

Together AI open sourced DeepSWE, a new coding agent based on Qwen3.

🛠 AI in Production

Agents Auditing at Salesforce

Salesforce discusses the architecture powering the Agentforce auditing capabilities.

📡AI Radar

Meta restructures AI division under Superintelligence Labs : Meta has restructured its AI org, consolidating advanced research under “Superintelligence Labs” in a move to centralize AGI efforts.
Grammarly acquires AI email client Superhuman: Grammarly has acquired Superhuman to integrate its AI writing tools with the premium email client experience, enhancing communication productivity.
Perplexity launches a $200/month subscription plan: Perplexity has introduced a $200/month “Pro” tier for power users and enterprises, offering enhanced AI capabilities, custom data integration, and faster responses.
OpenAI condemns Robinhood's OpenAI tokens: OpenAI criticized Robinhood for listing “OpenAI” tokens, clarifying it has no affiliation and warning of potential misuse of its brand.
Y Combinator alum launches $34M fund for YC startups : A Y Combinator alum has raised a $34 million fund backed by Garry Tan to support early-stage YC companies through follow-on investments.
Ilya Sutskever to lead Safe Superintelligence after CEO exit: After co-founder Daniel Gross stepped down, Ilya Sutskever will now lead Safe Superintelligence Inc., focusing solely on building provably safe AI.
Genesis AI launches with $105M from Eclipse and Khosla: Robotics-focused Genesis AI has raised $105 million in seed funding to build foundational AI models for robots, aiming to power next-gen automation.
Levelpath secures $55M for next-gen procurement platform : Procurement tech startup Levelpath raised $55M to streamline enterprise procurement using modern UX and AI-driven workflows.
Campfire raises $35M to take on NetSuite with tiny AI-powered ERP : Campfire, an AI-native ERP startup, secured $35M Series A funding to attract startups moving away from NetSuite with its lightweight and automated platform.
CoreWeave gets first NVIDIA Blackwell chips from Dell (Bloomberg): Dell has delivered NVIDIA’s new high-end Blackwell AI chips to CoreWeave, making it the first cloud provider to deploy the advanced silicon at scale.

TheSequence

Discussion about this post