The Sequence Radar #669: MiniMax-M1 is a Very Impressive Model

Jun 22, 2025

Next Week in The Sequence:

Here is what we have for next week:

A deep dive into software engineering AI evals.
An overview of Anthropic’s architecture for building a research agent.
A debate about reasoning in AI models works like system1-system2.
A review of the MCP-Use framework to integrate with MCP servers

You can subscribe to The Sequence below:

📝 Editorial: MiniMax-M1 is a Very Impressive Model

Algorithmic innovation is always interesting when comes to LLMs. Some of the new AI frontiers will be reached by scale and others by algorithmic improvements. Last week, we had a very interesting release of a highly innovative model that flew a bit under the radar.

MiniMax-M1 is a new 456B parameter model that redefines efficiency and scale for open-weight models. If the transformer renaissance taught us anything, it's that context is king. MiniMax-M1 doesn’t just stretch context windows; it reinvents the infrastructure to do so meaningfully, at scale, and at a fraction of the cost typically associated with frontier systems.

At its core, MiniMax-M1 fuses a MoE backbone with a novel form of attention called Lightning Attention—a streamlined, linearized mechanism purpose-built for high-efficiency token processing. In combination, this hybrid architecture allows the model to handle up to 1 million tokens of context natively. That’s not extrapolation or memory tricks—it’s real, end-to-end, full-window context, enabling reasoning windows of up to 80K tokens, orders of magnitude beyond most open-source baselines.

The efficiency gains are staggering. MiniMax-M1 consumes just ~25% of the FLOPs of DeepSeek-R1 at 100K token generation, and under 50% at 64K. This is due in large part to Lightning Attention, which avoids the quadratic bottlenecks of traditional softmax attention by approximating token influence with linear complexity. Alternating every seven transformer blocks, this attention schema maintains semantic integrity while dramatically lowering inference cost—a rare blend of theory and pragmatism.

Equally impressive is the model's training methodology. Built on 512 H800 GPUs over a mere three weeks with a budget of ~$534K, MiniMax-M1 shows what a well-orchestrated curriculum and hardware-aware engineering can achieve. The use of CISPO (Clipped Importance Sampling Policy Optimization) in the RL fine-tuning phase is a standout innovation. Rather than clipping updates per token, CISPO clips importance sampling weights, leading to more stable policy learning in deep MoE hybrids.

Benchmarks bear this out: MiniMax-M1 rivals or surpasses state-of-the-art open-weight LLMs on tasks ranging from AIME math problems (86% accuracy) to LiveCodeBench programming to long-context retrieval. Its balance of scale, efficiency, and architectural originality makes it a landmark release. With Apache 2.0 licensing and planned ecosystem tooling, MiniMax-M1 positions itself not just as a research artifact but as a platform.

In an era where scale often comes at the expense of openness or accessibility, MiniMax-M1 makes a bold counterpoint: we can have million-token contexts, powerful reasoning, and open weights—without burning a billion dollars. That vision, executed with elegance, makes M1 a high watermark in the evolution of open-source AI.

🔎 AI Research

TaskCraft: Automated Generation of Agentic Tasks

AI Lab: OPPO AI Agent Team
TaskCraft introduces a scalable, automated pipeline for generating multi-step, tool-based agentic tasks using depth- and width-based extensions and trajectory-aware verification. The framework outputs over 36,000 synthetic tasks, enabling effective fine-tuning and evaluation of autonomous agents in complex environments.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

AI Lab: MiniMax
MiniMax-M1 is an open-weight hybrid-attention model combining Mixture-of-Experts with Lightning Attention to enable low-FLOP, long-context reasoning (up to 1 million tokens) and superior performance on coding, tool use, and software engineering tasks. It introduces the CISPO reinforcement learning algorithm and demonstrates strong results with low training cost across diverse benchmarks.

XBENCH: Tracking Agents Productivity Scaling with Profession-Aligned Real-World Evaluations

AI Lab: Multi-university collaboration including CMU, Tsinghua, MIT, Oxford, and more
XBENCH introduces a new benchmark suite for evaluating AI agents' commercial utility in real-world domains like recruitment and marketing by aligning evaluations with actual professional workflows and productivity metrics. It provides tasks derived from live industry operations and aims to track technology-market fit (TMF) across time for profession-specific agents.

New Methods Boost Reasoning in Small and Large Language Models

AI Lab: Microsoft Research Asia

This work introduces several new reasoning techniques including MCTS-guided stepwise thinking, symbolic equivalence checking, and process-level supervision to improve mathematical reasoning in small and large models. These methods notably improve accuracy on benchmarks like AIME and MATH while enhancing generalization across domains like science and code.

Here are 2-sentence summaries for each of the two additional papers, including the paper title and associated AI lab(s):

All is Not Lost: LLM Recovery without Checkpoints

AI Lab: Gensyn, University of Neuchâtel, TU Delft

This paper introduces CheckFree, a lightweight and storage-free failure recovery method for pipeline-parallel LLM training that reconstructs a failed stage by averaging the weights of its neighboring layers, avoiding costly checkpointing or redundant computation. Its extension, CheckFree+, adds support for recovering first and last stages using out-of-order execution and small weight replication, outperforming traditional recovery techniques in training throughput under realistic failure scenarios.

Revisiting Reinforcement Learning for LLM Reasoning from a Cross-Domain Perspective

AI Lab: UC San Diego, MBZUAI, Carnegie Mellon University, Purdue University

This work introduces GURU, a 92K-example RL reasoning dataset across six domains (Math, Code, Science, Logic, Simulation, Tabular), and trains GURU-7B/32B, models that achieve SOTA performance among open RL-trained models on general reasoning. The study shows that cross-domain RL enhances pretrained-domain reasoning (e.g., math/code), while in-domain RL is required to develop new skills in underrepresented domains, challenging prior assumptions that RL merely elicits existing knowledge.

🤖 AI Tech Releases

🛠 AI in Production

Building a Research Agent

Anthropic shared an incredible amount of details about the implementation of their multi-agent design system.

📡AI Radar

Vast Data is reportedly seeking a new funding round that could value its AI-optimized storage platform at approximately $25 billion .
Mira Murati’s Thinking Machines raised $2 billion.
Meta attempted to buy Safe Superintelligence.
Character.ai has a new CEO.
Controversial AI startup Cluey raised $15 million.
OpenAI for Government was launched as a new offering tailored to public-sector use of OpenAI’s technology.
Apple and Meta evaluated acquiring Perplexity.
Anysphere introduced an “Ultra” plan at $200 per month for its Cursor AI coding tool, promising ~20× the usage of its Pro tier .
Sword Health secured $40 million at a $4 billion valuation, while delaying its planned IPO to 2028 or beyond.
Alta, an AI-powered virtual stylist inspired by Clueless, raised $11 million in seed funding with heavyweight backers.
SportsVisio garnered $3.2 million to expand its AI-powered sports analytics platform for athletes, coaches, and fans.
Masayoshi Son of SoftBank unveiled plans for a $1 trillion “Project Crystal Land” in Arizona to build a robotics and AI manufacturing hub.
Alibaba-backed AI startup MiniMax (valued near $3 billion) is reportedly preparing for a Hong Kong IPO as soon as this year.
xAI, Elon Musk’s AI company behind Grok, is in talks to raise $4.3 billion in equity (part of a larger $9.3 billion funding package) to support its operations.

TheSequence

Discussion about this post