TheSequence

The Sequence Opinion #892: The Anatomy of a Good Environment: When Verifiability is Not Enough

Jesus Rodriguez — Thu, 09 Jul 2026 11:02:57 GMT

Quick note: For over two years now, we’ve been running The Sequence without sponsors. Literally every week we receive tens of inquires to sponsor The Sequence. As a passion project that I have been running for over five years now, we are trying to keep The Sequence true to its ethos of technical depth, objectivity and unique content to help you navigate the world of AI. The best way to help us on that journey is to subscribe below. We understand is not for everyone but appreciate the support.

I was listening to the recent Dwarkesh Patel episode with Grant Sanderson, where he argues that “grindability is just as important as verifiability,” and it crystallized something I’ve been chewing on for months, so I decided to elaborate on that thesis here. The question is deceptively simple: what makes a domain a good domain for AI? Not good in the sense of commercially interesting, but good in the sense that if you point a modern training pipeline at it, capability actually compounds.

The standard answer is verifiability, and it’s not wrong. But I’ve come to believe verifiability is just one axis in a higher-dimensional space, and the domains where AI has been shockingly good (math, code, board games) are the ones that happen to score high on all the axes simultaneously. The domains where progress has been grinding and disappointing, like computer use, robotics, and open-ended knowledge work, are usually strong on one or two axes and quietly broken on the others. Once you see the full set of properties, a lot of otherwise confusing facts about the current moment snap into focus: why reasoning models got good at math before they got good at your inbox, why a cottage industry of RL environment startups suddenly commands billion-dollar budgets, and why I think some of those environments are going to disappoint their buyers.

Let me walk through the axes one at a time. For each one I’ll try to give you a domain that has the property and a domain that conspicuously lacks it, because the contrast is where the intuition lives.

Verifiability

The Sequence AI of the Week #891: Prompting a Spreadsheet : Inside Google’s TabFM for Tabular AI

Jesus Rodriguez — Wed, 08 Jul 2026 11:02:30 GMT

There’s a running joke in machine learning that the field’s most valuable model isn’t a transformer at all — it’s gradient-boosted trees fit on a CSV. For all the noise about frontier models, the workhorses of enterprise ML are still XGBoost pipelines predicting churn, fraud, and credit risk on rows and columns. And the workflow around them has barely changed in a decade: load the table, engineer features, cross-validate, tune hyperparameters, repeat until the AUC stops moving. Every new dataset means starting the ritual over.

Google Research just took direct aim at that ritual. TabFM, released a few days ago, is a foundation model for tabular classification and regression that produces predictions on tables it has never seen, in a single forward pass, with no training, no tuning, and no feature engineering. You hand it the whole problem — training rows, test rows, all of it — as one giant prompt, and it answers. It’s in-context learning, but for spreadsheets. And if the name sounds familiar, it should: this is the same team’s playbook from TimesFM, the time-series foundation model that quietly became one of the most-deployed research artifacts Google has shipped. Understanding TabFM really requires understanding that lineage, so let’s start there.

The TimesFM prelude

The Sequence Knowledge #890: A Brief History of Model Distillation

Jesus Rodriguez — Tue, 07 Jul 2026 11:01:03 GMT

The story most people tell about knowledge distillation starts in 2015, with Geoffrey Hinton, Oriol Vinyals, and Jeff Dean introducing a clever softmax temperature trick and a phrase — “dark knowledge” — that immediately lodged itself in the field’s vocabulary. It is a good story. It is also incomplete by almost a decade.

The real history is quieter, more pragmatic, and worth recovering, because the conceptual moves the field made between 2006 and 2015 still define how we think about distillation today. The vocabulary changed. The diagrams changed. The underlying question — what exactly is being transferred from a teacher to a student? — has not. To understand the modern landscape of on-policy distillation, reasoning distillation, and cross-architecture transfer, it helps to walk back through the three papers that built the foundation, and to notice that each one solved a different problem on its way to inventing the same idea.

2006: Compression as Mimicry

The Sequence Radar #889: Fable 5's Comeback, ZCode's Debut, Claude Science, and the $3.5B Deployment Land Grab

Jesus Rodriguez — Sun, 05 Jul 2026 11:00:35 GMT

Next Week in The Sequence:

We continue our series about model distillation.
The AI of the Week, dives into Google’s new TabFM model for tabular and time series data.
The opinion section, discusses that domains are a good fit for rapid progress of AI models vs which ones are challenging.

Subscribe and don’t miss out:

📝 Editorial: Fable 5's Comeback, ZCode's Debut, Claude Science, and the $3.5B Deployment Land Grab

This week’s developments in AI provide a clear direction where the space is going: more capable models and the imperative of capable delivery capabilities.

Start at the model layer, where we just watched the first frontier model get hot-patched by a government. Claude Fable 5 came back online July 1 after 19 days behind an export-control firewall — pulled three days post-launch when Amazon researchers found a jailbreak that coaxed it into surfacing software vulnerabilities. The fix is beautiful in a systems-engineering way: a classifier that catches the technique in over 99% of attempts and, instead of throwing an error, gracefully degrades the request to Opus 4.8. Not a 403. A fallback route. Frontier deployment now looks like a load balancer with a compliance layer, and Anthropic, Amazon, Microsoft, and Google are literally drafting a CVSS for jailbreaks — severity scoring for prompts. Take a second with that.

While the US was busy negotiating classifiers, Z.ai shipped an existence proof. ZCode is a free “Agentic Development Environment” wrapped around GLM-5.2: a 744B-parameter MoE with 40B active, a genuine one-million-token context window, 28.5 trillion training tokens, MIT-licensed weights — and trained on Huawei silicon, no American chips required. The pitch practically writes itself: self-host the weights and there is no kill switch. The Fable suspension turned “regulatory tail risk” from a slide in someone’s deck into a lived 19-day outage, and Z.ai is selling the mitigation. Same story, opposite sides of the Pacific.

One layer up, Anthropic launched what I’d call claude --science. Claude Science isn’t a new model — it’s a workbench: a coordinating agent with 60+ curated skills and connectors across genomics, proteomics, structural biology, and cheminformatics, rendering protein structures natively, running on your own cluster over SSH, with every figure shipping alongside the exact code and environment that produced it. Reproducibility as a first-class primitive. A reviewer agent even runs alongside the pipeline, flagging citations that don’t resolve — CI/CD for manuscripts. The bet is explicit: do for wet labs what Claude Code did for codebases. If the analogy holds even at 50%, biology just got its terminal moment.

And at the top of the stack, the money moved to the least sexy layer: deployment. Microsoft committed $2.5B and 6,000 engineers to Microsoft Frontier Co., two days after AWS put $1B behind its own forward-deployed engineering push. OpenAI and Anthropic stood up theirs in May. Four of the best-capitalized companies on Earth independently converged on the same gradient: models are no longer the bottleneck. Integration is. The Palantir FDE playbook — engineers embedded in the customer’s messy production environment — is now the industry’s default architecture.

Zoom out and the inversion is complete. Capability is the commodity. Classifiers, weights licenses, workbenches, and boots-on-the-ground are the moat. The frontier isn’t the model anymore. It’s the runtime around it.

🔎 AI Research

SKILLOPT: Executive Strategy for Self-Evolving Agent Skills

AI Lab: Microsoft

Summary: This paper introduces SKILLOPT, a controllable text-space optimizer for training agent skills as the external state of a frozen agent, using trajectory feedback to make bounded edits on a skill document. Across multiple benchmarks and harnesses, SKILLOPT significantly lifts average no-skill accuracy and outperforms competitors without requiring additional inference-time model calls.

Nemotron-Labs-TwoTower: Diffusion Language Modeling with Pretrained Autoregressive Context

AI Lab: NVIDIA

Summary: The authors propose TwoTower, an architecture that separates diffusion language modeling into a frozen autoregressive context tower for clean tokens and a trainable diffusion denoiser tower for noisy blocks. Instantiated on a 30B parameter model, Nemotron-Labs-TwoTower maintains near-baseline autoregressive quality while achieving 2.42x higher wall-clock generation throughput.

TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

AI Lab: Meta AI

Summary: TUA-Bench is a new benchmark designed to evaluate general-purpose terminal-use agents across 120 diverse, real-world tasks spanning everyday digital activities and specialized scientific workflows. Evaluations reveal that even the strongest frontier agents currently achieve only a 65.8% success rate, highlighting significant remaining challenges in long-horizon planning, error recovery, and tool use within terminal environments.

Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLMs

AI Lab: Yale University & Google Research

Summary: This research introduces Reinforcement Learning with Metacognitive Feedback (RLMF), a paradigm that prioritizes and rewards large language models based on the accuracy of their self-assessed performance judgments. By applying this framework alongside targeted rewriting, models learn to faithfully calibrate and express their internal uncertainty numerically and linguistically across diverse tasks while maintaining task accuracy.

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

AI Lab: WeChat, Tencent Inc.

Summary: SkillHone is a framework for the continual evolution of agent skills that maintains a persistent decision history, recording diagnoses, revisions, redacted evaluation evidence, and outcomes. This persistent context enables role-separated subagents to refine skills across sessions without re-deriving past rationale, leading to state-of-the-art performance on deep-research benchmarks like GAIA and WebWalkerQA-EN.

Introducing TabFM: A zero-shot foundation model for tabular data

AI Lab: Google Research

Summary: TabFM is a zero-shot foundation model that simplifies tabular classification and regression workflows by utilizing in-context learning to bypass the need for traditional model training, hyperparameter tuning, and manual feature engineering. Trained entirely on hundreds of millions of synthetic datasets, its hybrid architecture employs alternating row and column attention to natively capture complex feature interactions, allowing users to generate highly accurate predictions on unseen tables in a single forward pass.

🤖 AI Tech Releases

Fable5

Anthropic redeployed Fable5 after the US export controls were lifted.

ZCode

Z.ai released ZCode, a new development environment based on GLM 5.2.

Claude Science

Anthropic announced Claude Science, a workbench for scientists.

📡10 AI News You Need to Know About

Anthropic × Samsung custom chip — Anthropic is in talks with Samsung to explore collaborating on a custom AI chip, though it hasn’t yet decided the chip’s purpose, server role, or power level, while stressing that its diversified Google/Amazon/Nvidia hardware stack remains central to its compute strategy. Original source (The Information broke it): https://www.theinformation.com/articles/anthropic-talks-samsung-manufacture-custom-ai-chip
Meta launches Pocket — Meta quietly launched Pocket, an experimental app born from its Gizmo acquisition that lets users generate and share small interactive apps and games (”gizmos”) from AI text prompts, complete with a scrollable discovery feed. No official Meta announcement exists yet — the closest primary source is the app’s Google Play listing: https://play.google.com/store/apps/details?id=com.facebook.gizmo
Microsoft Frontier Company — Microsoft launched Microsoft Frontier Company, a new operating business backed by a $2.5 billion investment and 6,000 industry and engineering experts to deliver enterprise AI deployments, which Judson Althoff says “goes beyond” the Forward-Deployed Engineering label.
Venice AI unicorn round — Venice, the privacy-first platform offering access to 200+ AI models with client-side encryption and no server-side data storage, raised a $65M Series A led by Dragonfly at a $1B valuation — its first outside capital — while already profitable on $70M+ annualized revenue.
Etched comes out of stealth — Etched emerged from stealth with working first-pass A0 silicon on TSMC’s N4P process, $800M raised (most recently $500M at a $5B post-money valuation), and over $1 billion in signed customer contracts for its rack-scale “frontier inference clusters” shipping this summer.
Amazon’s $1B FDE org — AWS created a dedicated Forward Deployed Engineering organization backed by a $1 billion investment that embeds engineers and purpose-built agents directly inside customer teams, pitching an agentic-first model that compresses deployment timelines from months to days and leaves customers self-sufficient.
Arena hits $100M — Arena, the UC Berkeley-born crowdsourced AI leaderboard, reached $100 million in annualized run-rate revenue just eight months after launching its commercial AI Evaluations service — up from $30M at its January Series A.
Crusoe raising ~$3B — Crusoe, which supplies AI data center capacity to the likes of Meta and Oracle, is in talks to raise about $3 billion in a round investors expect to land around a $30 billion valuation — roughly triple its $10 billion mark from October. Bloomberg is the original source (scoop, no primary release): keep your link.
ElevenLabs $22B tender — ElevenLabs has held early talks with investors for a secondary tender offer letting employees sell shares at a roughly $22 billion valuation — double its February round — with the tender expected by September. Bloomberg is the original source (scoop): keep your link.
Meta cloud business — Meta is developing plans for a cloud infrastructure business selling access to AI compute and models — including a Bedrock-style offering hosting its Muse Spark models — putting it in direct competition with AWS, Azure, and Google Cloud; the news sent Meta shares up ~9% while hitting neoclouds CoreWeave and Nebius.

The Sequence Opinion #888: Everything You Need to Know About the AI in Space Race

Jesus Rodriguez — Thu, 02 Jul 2026 11:03:47 GMT

The core thesis of this essay is simple to state: space is becoming a new frontier for AI — and one of the most competitive ones. AI's frontiers have always been defined by scarcity. When the scarce thing was ideas, the frontier was architectures; when it was data, the frontier was the open web; when it was FLOPs, the frontier was the fab. Today the scarce thing is energy — grid capacity, cooling water, land, permits — and orbit is the one place in reach where energy is effectively unmetered and no zoning board has jurisdiction. That makes low Earth orbit not a science experiment but contested economic territory, and the industry is treating it exactly that way: trillion-dollar companies, hyperscalers, chipmakers, nation-states, and venture-backed startups are all filing, launching, and spending against each other on compressed timelines, with real hardware already running in orbit. As of December 2025, the first large language model ever trained in space was nanoGPT — Karpathy's minimalist GPT repo — trained on the complete works of Shakespeare aboard an H100 in a 130-pound satellite. The frontier is no longer hypothetical; it has a loss curve.

This essay discusses the core thesis of AI in space: value proposition, key players, architecture differences and much more.

The core thesis: compute is now an energy problem, and space is an energy solution

The Sequence AI of the Week #887: Meta's Autodata: When Models Learn to Make Their Own Lessons

Jesus Rodriguez — Wed, 01 Jul 2026 11:02:36 GMT

Today, we are covering an amazing paper published by Meta last week: https://arxiv.org/abs/2606.25996

There is a quiet shift happening in AI training. For years, the center of gravity was the model: more parameters, more GPUs, better architectures, longer context windows, better optimizers. Data mattered, of course, but data was often treated as something upstream of the real action. You scraped it, filtered it, labeled it, maybe mixed it carefully, and then the training run began.

Meta’s new Autodata work flips that perspective.

The core idea is simple but powerful: what if data creation itself becomes an agentic process? Not a one-shot prompt. Not a static synthetic-data recipe. Not “ask a strong model to generate a million examples and hope the distribution is useful.” Instead, Autodata treats data generation like a miniature research loop. An AI agent creates examples, tests them, studies the failures, updates its recipe, and tries again.

The Sequence Knowledge #886: Demystifying Model Distillation

Jesus Rodriguez — Tue, 30 Jun 2026 11:02:25 GMT

The simplest way to understand knowledge distillation is to imagine a very expensive teacher and a very cheap student.

The teacher is a large model: smart, slow, high-capacity, expensive to run. The student is smaller: faster, cheaper, easier to deploy, but usually less capable if trained in the standard way. Distillation asks a very practical question:

Can the student learn not only from the original dataset, but from the teacher’s behavior?

In other words, instead of training the small model directly on reality, we train it on reality as interpreted by the big model.

That sentence is the whole trick.

A traditional training setup looks like this:

The Sequence Radar #885: Last Week in AI: Models, Games, and the Future of Evaluation

Jesus Rodriguez — Sun, 28 Jun 2026 11:01:33 GMT

Next Week in The Sequence:

We continue our series about distillation.
In the AI of the week, we discuss Meta’s amazing AutoData paper.
In the opinion section, we debate the amazing topic of AI in space.

Subscribe and don’t miss out:

📝 Editorial: Last Week in AI: Models, Games, and the Future of Evaluation

This week in AI had the strange feeling of a stack trace resolving itself. For years, the industry has been marching toward the same destination from different directions: better models, richer environments, more autonomous agents, and harder evaluations. This week, those threads snapped together into something legible. AI is no longer just learning to answer. It is learning to act.

Start with OpenAI’s GPT-5.6 release. Or more precisely, its limited preview. The naming alone tells a story: Sol, Terra, Luna. A flagship model, a balanced model, and a fast, cheaper model. The product taxonomy is becoming planetary because the market is no longer asking for “the best model” in the abstract. It wants intelligence at different temperatures: deep reasoning for frontier work, affordable competence for everyday automation, and high-throughput inference for systems that need to move fast.

But the most interesting part of GPT-5.6 is not the benchmark curve. It is the release shape. This is a model launched with a safety architecture, a government coordination layer, and a phased-access strategy. That matters. Frontier AI releases are starting to look less like software updates and more like controlled deployment of critical infrastructure. We used to ask whether a model could write better code. Now we ask who gets access, under which constraints, with what monitoring, and how quickly defenders can use the same capabilities attackers will inevitably want.

Alongside this, Anthropic quietly introduced Claude Tag, a feature that signals another subtle shift in how we interact with models. Claude Tag allows users to structure prompts and responses with explicit semantic markers, making it easier for models to track context, roles, and intent across longer interactions. It is a small interface change with outsized implications: as models become more agentic, the way we communicate with them must evolve from loose conversation into something closer to structured collaboration. Claude Tag hints at a future where prompting is less about clever phrasing and more about designing clear, machine-readable workflows.

Then came General Intuition’s new raise, which feels like the cleanest signal yet that the next data frontier is not text, or even video, but action. The company’s thesis is beautifully nerdy: video games are not just entertainment; they are compressed laboratories of intent, perception, movement, failure, reward, and adaptation. A gameplay clip is not merely pixels. It is pixels plus choices. What did the player see? What did they try? What happened next? That action-labeled loop is exactly what language models are missing when they attempt to reason about the physical world from static media.

In other words, General Intuition is betting that Minecraft, Fortnite-like environments, simulations, and gamer behavior might become for embodied AI what the web was for language models: the messy, gigantic pretraining substrate from which generality emerges.

And then, in the most delightful possible version of this same story, the LayerLens Stratix Cup turned AI evaluation into soccer.

The final between Claude Opus 4.8 and GPT-5.5 was not just a spectacle. It was a different kind of benchmark. Sixteen models wrote their own strategies, controlled teams, adapted between rounds, and survived inside an environment where intelligence had to become policy. Not prose. Not a leaderboard answer. Executable behavior. Claude Opus 4.8 defeating GPT-5.5 1–0 in the final is fun as a result, but the deeper point is methodological: we need arenas where models reveal themselves under pressure, with imperfect information, feedback loops, and consequences.

That is the connective tissue of the week. GPT-5.6 pushed the frontier of controlled capability. General Intuition pushed the frontier of action data. Stratix Cup pushed the frontier of evaluation.

The model is becoming less like a chatbot and more like an organism in a sandbox: sensing, planning, acting, failing, adapting. The future of AI will not be decided only by who has the biggest model. It will be decided by who builds the best worlds for models to learn in, the best guardrails for them to operate within, and the best games to discover what they can actually do.

🔎 AI Research

Autodata: An agentic data scientist to create high quality synthetic data

AI Lab: FAIR at Meta

Summary: This paper introduces Autodata, a framework where an AI agent acts as a data scientist to iteratively generate, evaluate, and refine synthetic training and evaluation data. By meta-optimizing the agent itself, the method significantly improves data quality and downstream model performance across complex reasoning and verifiable tasks.

Improved Large Language Diffusion Models

AI Lab: Gaoling School of Artificial Intelligence, Renmin University of China and ByteDance Seed

Summary: This work presents iLLaDA, an 8B-parameter masked diffusion language model trained from scratch using fully bidirectional attention and scaled to 12 trillion tokens. The model introduces variable-length generation and confidence-based scoring, leading to substantial performance gains over previous diffusion models while remaining competitive with strong autoregressive baselines.

Are We Ready For An Agent-Native Memory System?

AI Lab: Shanghai Jiao Tong University, Tsinghua University, MemTensor (Shanghai) Technology Co., Ltd

Summary: The authors systematically evaluate 12 representative agent memory systems from a data management perspective, decomposing them into representation, extraction, routing, and maintenance modules. Through extensive end-to-end benchmarks, they reveal that no single architecture dominates; rather, effectiveness depends on aligning the memory structure with the specific workload bottleneck and utilizing localized maintenance for cost efficiency.

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

AI Lab: University of Illinois Chicago, KU Leuven, UC San Diego

Summary: MEMPROBE is a novel benchmark that audits long-term agent memory by testing how well an agent can reconstruct a simulated user’s hidden state after a series of interactions. Testing state-of-the-art systems reveals that while agents can easily complete immediate tasks, they struggle to successfully retrieve and consolidate episodic memory, highlighting a major bottleneck in current memory designs.

Qwen-AgentWorld: Language World Models for General Agents

AI Lab: Qwen Team

Summary: This research introduces Qwen-AgentWorld, a foundational language world model designed to simulate seven diverse agentic environments through long chain-of-thought reasoning. By utilizing the model as both a decoupled environment simulator and a unified agent foundation model, the researchers demonstrate significant enhancements in agent training, scalability, and downstream task performance.

Tapered Language Models

AI Lab: Mila, Cornell University, Université de Montréal, CIFAR AI Chair

Summary: This paper proposes Tapered Language Models (TLMs), an architectural design that monotonically tapers parameter capacity across a model’s depth under a fixed total budget, front-loading the capacity to earlier layers. Focusing on MLP width, the authors show that a smooth cosine decay schedule consistently improves perplexity and downstream reasoning accuracy across multiple architectures without increasing overall parameters or compute costs.

🤖 AI Tech Releases

GPT 5.6 Sol

OpenAI unveiled three new models Sol, Terra and Luna as part of its GPT 5.6 suite.

Claude Tag

Anthropic released Claude Tag, a new way for teams to interact with Anthropic.

Mistral OCR

Mistral released Mistral OCR, its latest document understanding model.

📡10 AI News You Need to Know About

Patronus AI raises $50M Series B — Agent-evaluation startup Patronus AI raised a $50M Series B led by Greenfield Partners (total funding now $70M) and unveiled its first “Digital World Models,” large-scale simulation environments for training and stress-testing AI agents. Original source
General Intuition raises $320M at $2.3B — General Intuition, a spinout of gaming-clip platform Medal, raised $320M at a $2.3B valuation (led by Khosla Ventures) to train “large action model” AI agents on billions of action-labeled gameplay clips for robotics and real-world use.
Netris raises $15M Series A — Network-automation startup Netris raised a $15M Series A led by a16z to expand its NAAM platform, which automates and isolates the networking layer so AI “neocloud” operators can bring GPU clusters online in weeks instead of months.
Cerebras stock plunges after earnings — Cerebras shares fell nearly 20% after its first post-IPO earnings, as a full-year core gross-margin forecast of 38–41% (down from 47% in Q1) spooked investors, with CEO Andrew Feldman arguing the guidance was “misunderstood” and reflects a temporary decision to lease systems back from a customer while it builds data-center capacity. T
Groq confirms $650M raise — Six months after Nvidia licensed its chip tech and poached its founder, Groq confirmed a $650M raise (led by Disruptive and Infinitum) and a rebuilt executive bench to pivot toward selling AI inference cloud capacity across its 13 data centers.
Google DeepMind invests $75M in A24 — Google DeepMind announced a “first-of-its-kind” research partnership with film studio A24, including a ~$75M investment, to co-develop AI filmmaking tools with working filmmakers.
General Intuition in talks to raise $300M — This June 18 TechCrunch scoop reported that General Intuition was in talks to raise ~$300M at around a $2B valuation; it’s the rumor that item #2 above later confirmed.
US awards $250M to I-Pulse — The Commerce Department’s CHIPS R&D program signed a definitive agreement to give I-Pulse — a pulsed-power and semiconductor venture co-founded by mining billionaire Robert Friedland — a $250M award to develop silicon-carbide chips for a high-power geothermal drilling technique and defense applications.
SK Hynix files for ~$29.4B US listing — SK Hynix filed to raise up to 45.45 trillion won (~$29.4B) via a Nasdaq ADR listing (expected to begin trading July 10), tapping US investor appetite for AI memory after an ~850% one-year stock run, with proceeds earmarked for HBM fabs, packaging plants, and EUV equipment.
ByteDance seeks $20B offshore loan — The TikTok parent is in early talks with banks for roughly $20B in new offshore borrowing — its largest ever — to help fund an aggressive AI-infrastructure buildout, with a possible three-year term extendable to five.

The Sequence Opinion #884: Self-Driving Labs: The Laboratory That Chooses Its Next Experiment

Jesus Rodriguez — Fri, 26 Jun 2026 10:58:46 GMT

A normal laboratory is already a kind of computer. It has sensors, actuators, memory, protocols, data outputs and error states. But the operating system is usually a human scientist. The scientist decides what to test, transfers samples between instruments, inspects the results, updates their mental model and chooses the next experiment.

A self-driving lab moves part of that loop into software.

The basic idea is simple: connect AI to automated experimental hardware, then let the results of each experiment influence what the system does next. The lab is not just running a long queue of prewritten instructions. It is learning while it works. It makes something, measures it, updates a model and chooses the next move.

This is the key distinction between automation and autonomy. An automated liquid handler can pipette 10,000 wells according to a script. A self-driving lab can run the first few hundred experiments, notice that most of the remaining design space looks unpromising and redirect itself toward better candidates. Automation executes. Autonomy decides.

The simplest mental model is a loop:

design → make → test → learn → design again

The Sequence AI of the Week #883: Qwen is Getting Into Robotics

Jesus Rodriguez — Thu, 25 Jun 2026 11:01:13 GMT

For about three years now, the Qwen family has lived inside a rectangle. It reads your code, looks at your screenshots, answers your questions, and the whole time it has been doing this behind glass. It can describe a coffee cup in exquisite detail. It cannot pick one up.

That gap — the one between a model that understands the physical world and a model that can move something in it — is the single most honest sentence in Alibaba’s June launch of the Qwen-Robot Suite. The Tongyi Lab team put it plainly: seeing is not acting. The perception and reasoning are already strong. The bottleneck for embodied intelligence is the translation layer between “I see what needs to happen” and “here are the joint torques to make it happen.” Three new models — Qwen-RobotNav, Qwen-RobotManip, and Qwen-RobotWorld — are Alibaba’s bet on closing that gap, and they are interesting less for any single benchmark number than for the shape of the bet.

Let me explain why I think this is the right shape, and where I’d keep my skepticism.

The actual bottleneck is not intelligence, it’s tokenization

The Sequence Knowledge #882: A New Series About Distillation

Jesus Rodriguez — Wed, 24 Jun 2026 10:35:00 GMT

I am super excited about this series that deep dives into distillation techniques. I use this constantly so have a lot to say about it :) Over the next few weeks, we are going to cover the evolution of distillation in AI models as well as some of the fundamental techniques in teh space.

For most of the modern AI era, progress has been narrated through the language of scale. Bigger models. Bigger datasets. Bigger clusters. Longer context windows. More GPUs. More tokens. More parameters. Scale became the grand explanatory variable, the gravitational force pulling the field forward.

And, to be fair, scale worked.

It gave us models that could write code, reason through math, translate languages, generate images, operate tools, summarize documents, and converse across almost every domain of human knowledge. The frontier model became one of the strangest artifacts in the history of computing: a single neural network that looks less like a program and more like a compressed civilization of patterns.

But scale is not the end of the story. In fact, scale creates the next problem.

The most capable models are also expensive, slow, centralized, difficult to deploy, difficult to specialize, and often impractical for the long tail of real-world use cases. A bank does not always need the largest general-purpose model on earth. It may need a private model that understands compliance workflows. A phone does not need a trillion-parameter oracle in the cloud for every task. It needs fast, local intelligence. A coding agent does not always need a frontier model for every token. It may need a smaller draft model, a specialized debugging model, or a distilled planner trained on expert trajectories. An enterprise does not want generic brilliance. It wants reliable, repeatable, auditable competence.

This is the world in which distillation becomes central.

The Sequence Special #881: The Soccer World Cup of AI Models

Jesus Rodriguez — Mon, 22 Jun 2026 11:34:28 GMT

A fun, personal note to start the week — about AI evaluations, and why we made the best models in the world fight over a virtual ball.

Before we start, watch this 12 seconds video:

Cool right? Let me explain ;)

A little over a year ago, I co-founded LayerLens on a single bet: that agentic workflows were about to be everywhere, and that evaluations would become a core pillar of the stack — not an afterthought you bolt on once things break in production. LayerLens builds the evaluation and observability layer for that world, working alongside frontier AI teams to ship benchmarks that probe what the standard suites miss.

The thesis was simple to state and hard to execute. For evals to actually matter inside an enterprise, they can’t be academic. They have to be practical, affordable, and grounded in real-world scenarios. A benchmark that costs a fortune to run, or that measures something no one cares about, is just a leaderboard with extra steps. So most of our time goes into building evaluations that are genuinely new — that surface capabilities the usual leaderboards quietly skip over.

Today we have a fun one to share.

Introducing the Stratix Cup

Today, LayerLens is launching the Stratix Cup — a soccer (football, if you insist) tournament in which the top frontier models compete against each other inside a harness that simulates a full soccer environment.

The format is straight out of the World Cup playbook: 16 models, four groups of four, group stage into knockouts, all the way to a single final. Here are the brackets. Every top AI models is there.

The matches are genuinely fun to watch — and weirdly tense. Here’s GLM 5.2 against Gemini 3.5 Flash to give you a feel for it. It’s cool and it looks cool:

Follow @LayerLens_AI on X for hourly updates throughout the tournament — and to throw some support behind a genuinely cool effort.

Why Soccer?

It’s not just World Cup mania (though, fine, that helped).

Games have always been load-bearing in the history of AI. Chess gave us search and evaluation functions. Go gave us self-play and the humbling realization that a network’s “intuition” could outrun human grandmasters. Multiplayer environments gave us coordination, deception, and long-horizon credit assignment. Each one was a clean, adversarial, fully-observable arena where you couldn’t fake competence — either your agent wins or it doesn’t.

Soccer is a great next rung on that ladder. It’s continuous, it’s multi-agent, it punishes brittle strategies, and crucially: you can’t memorize your way to a win. You have to actually reason about a system.

What the Harness Actually Tests

Here’s where it gets interesting. The harness isn’t a single prompt-and-pray call. The structure of a match is what makes it a real agentic evaluation, and it breaks into three distinct phases.

1. Pre-Game. The model reads the match briefing, devises a strategy, writes its team’s code, tests it against baselines, and submits. This is a cold-start task in its purest form: new rules, new constraints, a tight clock, and exactly one submission window. No iterating against a graded oracle. You think, you commit, you live with it.

2. Gameplay. The submitted code now controls all 11 players in real time. And here’s the key detail — the model is not being called every frame. It already authored the policy. What we’re watching is whether the strategy it reasoned its way to in the abstract actually survives contact with a live, adversarial opponent. It’s the gap between “I have a plan” and “the plan works.”

3. Halftime. This is the part I care about most.

At halftime, the model gets access to its own frame log. It can inspect what actually happened in the first half. Maybe the midfield sat too passive. Maybe the defenders all chased the ball and left acres of space behind them. Maybe the attack never formed because the passing logic was too conservative to ever commit. The model then edits its own code and submits a revised strategy for the second half.

That’s the whole game right there. Pre-game tests planning under uncertainty. Gameplay tests whether the plan generalizes. And halftime tests something closer to what we actually want from agents: can you look at evidence of your own failure, diagnose it, and correct course? That’s not a benchmark question. That’s the job.

Here’s another one — MiniMax M3 against Xiaomi’s genuinely impressive MiMo.

The Tournament Schedule

Broadcasts run Monday through Friday, all times Pacific. Group stage on Mon–Wed, knockouts Thu–Fri. You can follow it at the Stratix Cup website.

Monday, June 22 — Group Stage, Matchday 1

7:00 AM — Opus 4.7 vs GPT-5.5 · Group A 8:00 AM — GLM 5.2 vs Seed 2.0 Lite · Group A 9:00 AM — Gemini 3.1 Pro vs Qwen 3.7 Max · Group B 10:00 AM — Grok 4.3 vs Kimi K2.7 Code · Group B 11:00 AM — GPT-5.4 vs MiniMax M3 · Group C 12:00 PM — DeepSeek V4 Flash vs Nemotron 3 Ultra · Group C 1:00 PM — Gemini 3.5 Flash vs Opus 4.8 ⭐ Marquee · Group D 2:00 PM — MiMo v2.5 Pro vs Mistral Large 3 · Group D ~2:30 PM — End of day: standings recap

Tuesday, June 23 — Group Stage, Matchday 2

7:00 AM — GLM 5.2 vs Opus 4.7 · Group A 8:00 AM — Seed 2.0 Lite vs GPT-5.5 · Group A 9:00 AM — Gemini 3.1 Pro vs Kimi K2.7 Code · Group B 10:00 AM — Qwen 3.7 Max vs Grok 4.3 · Group B 11:00 AM — DeepSeek V4 Flash vs MiniMax M3 · Group C 12:00 PM — Nemotron 3 Ultra vs GPT-5.4 · Group C 1:00 PM — Gemini 3.5 Flash vs Mistral Large 3 · Group D 2:00 PM — Opus 4.8 vs MiMo v2.5 Pro ⭐ Marquee · Group D ~2:30 PM — End of day: updated standings

Wednesday, June 24 — Group Stage, Matchday 3 (Decisive Day)

7:00 AM — GLM 5.2 vs GPT-5.5 · Group A 8:00 AM — Opus 4.7 vs Seed 2.0 Lite · Group A 9:00 AM — Gemini 3.1 Pro vs Grok 4.3 · Group B 10:00 AM — Kimi K2.7 Code vs Qwen 3.7 Max · Group B 11:00 AM — DeepSeek V4 Flash vs GPT-5.4 · Group C 12:00 PM — MiniMax M3 vs Nemotron 3 Ultra · Group C 1:00 PM — Gemini 3.5 Flash vs MiMo v2.5 Pro · Group D 2:00 PM — Mistral Large 3 vs Opus 4.8 · Group D 3:00 PM — Final standings reveal + QF bracket stream (~3:20 PM)

Thursday, June 25 — Quarter-Finals

10:00 AM — GPT-5.5 vs MiMo v2.5 Pro · A1 vs D2 11:00 AM — Grok 4.3 vs MiniMax M3 · B1 vs C2 12:00 PM — DeepSeek V4 Flash vs Kimi K2.7 Code ⭐ Upset · C1 vs B2 1:00 PM — Opus 4.8 vs Opus 4.7 ⭐ Anthropic Civil War · D1 vs A2 2:00 PM — SF bracket reveal stream (~2:15 PM)

Later start on Thursday — the QFs are premium, so we let the afternoon audience build.

Friday, June 26 — Semi-Finals + Final

10:00 AM — GPT-5.5 vs Grok 4.3 · Semi-Final 1 11:00 AM — Kimi K2.7 Code vs Opus 4.8 · Semi-Final 2 12:00 PM — Finalists revealed · community vote, hype build 1:00 PM — ⭐ THE FINAL: GPT-5.5 vs Opus 4.8 1:30 PM — Champion stream: trophy, traces, Season 2 tease (~2:00 PM)

We saved the Final for 1pm PT on purpose — east coast lunch, west coast morning peak, maximum audience.

What to Do Next

Go watch some AI play fun, occasionally chaotic soccer. We’ll be sharing highlights right here in the newsletter over the next week or so.

Follow @LayerLens_AI and show us some love. 😊

The Sequence Radar #880: Last Week in AI: A $60B Cursor Deal, Google's Brain Drain, and Midjourney's Body Scanner

Jesus Rodriguez — Sun, 21 Jun 2026 11:02:38 GMT

Next Week in The Sequence:

We start a new series about distillation where we are going to cover all the cool techniques in the space.
In the AI of the week, we are going to cover Alibaba’s new models for robotics.
The opinion section, I would like to dive into this fascinating concept of self-driving labs.

Subscribe and don’t miss out:

📝 Editorial: Last Week in AI: A $60B Cursor Deal, Google's Brain Drain, and Midjourney's Body Scanner

I keep a mental map of where AI “lives.” For most of the last decade it lived in a box: a model, an API, a chat window. This week the box broke open in four directions at once, and the interesting part is that none of the breakouts rhyme with each other. They only rhyme structurally.

Start with the one that reads like a typo. SpaceX agreed to acquire Cursor for $60 billion in stock. Sit with the category error for a second. A rocket company is buying a code editor. The clean story is that SpaceX absorbed Cursor to feed its struggling xAI division, but the deeper signal is that AI tooling has become strategic infrastructure on par with launch capacity. Industrial conglomerates no longer partner for AI; they annex it. When the people who build reusable rockets decide that an autocomplete-for-engineers is worth the GDP of a small country, the implied claim is that the model layer is now load-bearing for everything else.

Then the talent map redrew itself in 48 hours. Noam Shazeer—co-author of “Attention Is All You Need,” the paper every one of us has read until the margins are gray—left Google for OpenAI. A day later John Jumper, who shared a Nobel for AlphaFold, left Google DeepMind for Anthropic. Google paid roughly $2.7 billion two years ago to bring Shazeer back. That is the part worth dwelling on: acqui-hires buy retention windows, not loyalty, and when the window closes the most valuable asset simply walks out the door. The frontier is consolidating into a two-body problem, and the bodies are not the ones with the most compute. They are the ones with the most gravity for researchers. Talent, it turns out, is the scarcest accelerator.

And then there is Midjourney, which decided that text-to-image was insufficiently ambitious and announced a full-body medical scanner. Lower a person into a pool ringed with half a million ultrasonic sensors, fire sound through them from every angle, reconstruct a 3D map of muscle, fat, bone, and organ in—eventually—sixty seconds. They call it ultrasonic CT. Holz cheerfully noted there is no AI in the imaging pipeline yet, which is the most honest sentence in the announcement. The prototype takes twenty minutes and has scanned about a dozen people. Treat the sixty-second, fifty-thousand-scanner figure as a North Star, not a spec sheet. But the move is the message: a generative-image lab now believes its reconstruction expertise transfers to atoms.

Here is the throughline. For years we argued about which company would win AI. This week the more interesting question quietly replaced it: what counts as an AI company at all? A rocket builder, two model labs in a talent knife-fight, an image startup reaching into your body. The substrate is leaking out of the box—into hardware, into biology, into the cap tables of firms that build physical things.

Let’s dive in.

🔎 AI Research

LifeSciBench: Evaluating Language Models on Realistic, Expert-Level Tasks in the Life Sciences

AI Lab: OpenAI and Tacit Labs

Summary: LifeSciBench introduces a dataset of 750 expert-authored tasks designed to rigorously evaluate language models on practical, real-world life science workflows rather than simple factual recall. Although GPT-Rosalind achieved the highest performance among evaluated models with a 36.1% task pass rate, the benchmark remains far from saturated, highlighting its utility as a high-resolution tool for measuring scientific reasoning.

FastContext: Training Efficient Repository Explorer for Coding Agents

AI Lab: Unspecified (Shaoqiu Zhang et al.)

Summary: FastContext introduces a specialized, on-demand exploration subagent that separates repository exploration from code solving to preserve token budget and reduce context pollution (Zhang, n.d.). When integrated into coding agents, it improves end-to-end resolution rates while significantly reducing token consumption with only marginal overhead (Zhang, n.d.).

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

AI Lab: Qwen Team

Summary: Qwen-RobotWorld presents a language-conditioned video world model for embodied intelligence that uses natural language as a unified interface to predict future visual trajectories across various robotic and navigation tasks (Zhang, n.d.). This unified approach enables synthetic data generation, scalable virtual environments, and language-guided planning for downstream control (Zhang, n.d.).

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

AI Lab: FAIR at Meta, Columbia University, Mila Québec AI Institute, McGill University, and Université de Montréal

Summary: Discriminator-Guided RL (DRL) corrects structural failures in flow- and score-matching models by utilizing a discriminator in a pretrained representation space to estimate the density ratio between data and base-model distributions. This logit serves as a reward without needing human preference data, successfully reducing the distributional gap across various architectures to yield sharper, more coherent image generation.

🤖 AI Tech Releases

Qwen-Robot Suite

Alibaba released three new foundation mdoels for embodied intelligence.

LFM2.5 Retrievers

Liquid AI released two multilingual retrieval models for effective searches across 11 languages.

📡10 AI News You Need to Know About

Gemini co-lead and “Attention Is All You Need” co-author Noam Shazeer is leaving Google for OpenAI less than two years after Google reportedly paid $2.7B to bring him back via the Character.AI deal, with Sam Altman publicly welcoming the hire.
Inference startup Baseten is reportedly close to raising a $1.5B round at a $13B valuation (split-priced, with some investors at $11B), just five months after its $300M Series E.
Medal spinout General Intuition, which builds world models that teach agents spatial-temporal reasoning, is in talks to raise ~$300M at a ~$2B valuation, with backers reportedly including Jeff Bezos and Eric Schmidt.
World-model lab Odyssey raised a $310M Series B at a $1.45B valuation led by Natural Capital, naming AWS its preferred cloud provider and committing to Trainium chips.
Pramaana Labs raised a $27M seed led by Khosla Ventures to apply formal verification (LEAN-style proofs) to high-stakes AI domains like tax, law, and drug discovery.
SpaceX agreed to acquire AI coding startup Cursor for $60B in stock to bolster its xAI-based AI division, with the deal expected to close in Q3.
India’s Sarvam became the country’s newest AI unicorn after raising a $234M Series B first close at a $1.5B valuation (of a planned $300M), led by a $150M strategic investment from HCLTech.
Nobel laureate and AlphaFold lead John Jumper is leaving Google DeepMind after nearly nine years to join Anthropic (following a break).
Swiss startup Prem AI, which runs AI models on customers’ own private/air-gapped infrastructure for hedge funds and law firms, is raising a $100M Series A targeting a $500M+ valuation, expected to close in Q3.
Midjourney unveiled “Midjourney Medical” and the Midjourney Scanner, a full-body “ultrasonic CT” device (built on licensed Butterfly Network chips) that CEO David Holz claims will eventually image the whole body in 60 seconds.

The Sequence Opinion #879: When Tokens Become Balance Sheet Items

Jesus Rodriguez — Thu, 18 Jun 2026 10:53:33 GMT

AI tokens are becoming are incresingly becoming part of every company economics. You see large companies measuring and reporting token expenses and forecasts like a well understood accounting units. In reality, we are entering a new era: the token economy. How to think about it ? measury it? Will we see a new generation of software in this space? An ERP for tokens?

Let’s discuss.

The Smallest Billable Thought

Something strange happened quietly, and then all at once: the token became the atomic unit of the AI economy.

Not the parameter. Not the GPU. Not even the model. The token.

The Sequence AI of the Week #878: Inside Google Deepmind's First Real Crack in Next-Token Generation

Jesus Rodriguez — Wed, 17 Jun 2026 10:56:10 GMT

As we wrap up our series about alternatives to transformer architectures, Google DeepMind just released one of the most impressive models in this category. DiffusionGemma is a text-diffusion model that challenges the conventional transfromer models. Today, we would like to deep dive into the specifics of this model.

Most language models write like a typewriter. They place one token after another, left to right, never revisiting the characters already stamped onto the page. This architecture has carried the entire modern LLM era: GPT-style chatbots, coding copilots, reasoning models, agent frameworks, enterprise assistants. The model predicts the next token, appends it, updates its state, and repeats.

Google’s new DiffusionGemma asks a deceptively simple question: what if text generation did not have to work that way?

Let’s dive in.

The Sequence Knowledge #878: Beyond Transformer: What We Learned

Jesus Rodriguez — Tue, 16 Jun 2026 11:03:13 GMT

Today, we bring you a summary of our series about transformer alternatives.

For the better part of a decade, the entire field has been a giant, spectacularly funded wrapper around a single operation: self-attention. The Transformer didn’t win because it was the most elegant or the most brain-like design. It won because it had the best scaling story and it won the hardware lottery. Every token looks at every other token, the whole thing maps cleanly onto a GPU grid, and you train it all at once. Add data, parameters, compute, context — and the loss curve cooperates. That smoothness is rare. Most clever ideas in deep learning never become industrial. This one did.

But the tax was always there in plain sight. Self-attention buys you something genuinely valuable — perfect, lossless recall over the entire context, with every token able to address every other token directly, and a training pass that parallelizes across the whole sequence at once. That’s the benefit, and it’s a real one. The cost is that attention scales quadratically with sequence length, and autoregressive decoding drags around a KV-cache that grows linearly with every token you’ve already seen. When you’re pushing past a million tokens, or watching a 70B model’s cache eat 40GB of VRAM, O(n²) compute and O(n) memory stop being footnotes and become the actual bill. So the interesting question was never “are Transformers good?” They’re spectacular. The question is whether they’re the final architecture or just the first truly scalable one — soon to be absorbed into something richer.

That was the thesis we set out to test, and the cleanest way to read the eight issues is as four families, each making a different bet against attention.

The first family is recurrent and linear-recurrent models — the RNN comeback and xLSTM. Their pitch is constant memory: instead of a cache that grows forever, they carry a fixed-size hidden state and pay O(n) compute over a sequence rather than O(n²). The classic objection was that RNNs train serially and can’t saturate a GPU, but the modern variants reformulate the recurrence so it parallelizes during training while staying cheap at inference. The benefit is brutally efficient generation; the open challenge is whether a fixed-size state can hold enough to match attention’s exact recall on long-range, retrieval-heavy tasks.

The second family is state space models — the SSM/Mamba line, the most serious challenger of the bunch. SSMs treat a sequence as a continuous linear dynamical system, which gives them a near-magical dual form: a parallelizable convolution for training and a recurrent scan for inference. They get linear scaling and long-context handling almost for free. The trade-off is expressivity — pure SSMs can struggle with precise in-context copying and lookup, which is exactly why the strongest results today are hybrids that interleave a few attention layers among many SSM layers.

The third family is text diffusion — generation that abandons left-to-right decoding entirely, refining a whole sequence in parallel over a handful of denoising steps. The benefit is non-autoregressive speed and bidirectional context at generation time; the challenge is matching the raw quality and controllability of autoregressive models, which LLaDA, Gemini Diffusion, and Mercury are now pushing on hard.

The fourth family is liquid and continuous-time models, which throw out the parallel-lookup mental model altogether in favor of dynamics that evolve continuously in time, aiming for far smaller, more adaptive networks. The benefit is parameter efficiency and a different inductive bias; the challenge is scaling that story to frontier sizes.

None of these has dethroned attention. But the monoculture is over, and the most likely future is explicitly hybrid: attention where exact recall earns its quadratic cost, something linear-time everywhere else.

Here is the full series, in order:

#846 — Beyond Transformer: A New Series — The kickoff, framing the palpable vibe shift on arXiv toward post-attention architectures and the decade we’ve spent as a wrapper around self-attention. It lays out the plan to map every major viable alternative to the Transformer.
#850 — The Unexpected Comeback of RNNs — The case for recurrent networks as the alternative most people overlooked, revisiting why linear-time recurrence is attractive again. It positions modern RNN variants as a serious challenger rather than a relic.
#854 — Return of the King: Unrolling the xLSTM Architecture — Traces the lineage from the 1990s LSTM through the 2017 Transformer pivot into xLSTM, the modernized revival of Hochreiter and Schmidhuber’s design. It explains how reworked gating and scaling let xLSTM compete with attention-based models.
#858 — How State Space Models Went from Curiosity to Serious Transformer Competitor — Charts the rise of SSMs as the O(n²) attention bottleneck becomes a real constraint at million-token contexts and large KV-caches. It argues state space models have quietly matured into a genuine rival to the dominant paradigm.
#862 — Learning About Text Diffusion Models — Introduces text diffusion as one of the most credible non-autoregressive alternatives to transformers. It covers how diffusion-style generation breaks from strict left-to-right next-token prediction.
#866 — Three Text Diffusion Models You Need To Know About — A practical follow-up profiling the leading players in the space: LLaDA, Gemini Diffusion, and Mercury. It compares how each implements diffusion-based text generation.
#870 — Liquid Models and the Search for a Post-Transformer Architecture — Dives into liquid neural networks as one of the more promising non-Transformer architectures, contrasting their continuous-time dynamics with attention’s parallel lookup-table approach. It frames them within the broader hunt for a successor.
#874 — Transformers or Not? — The capstone, asking whether the Transformer is the final architecture or merely the first truly scalable one, soon absorbed into something richer. It leans toward the latter and surveys the full landscape the series has covered.

What’s next: a new series on distillation

If the last series was about changing the architecture, the next one is about compressing it. We’re starting a deep dive into knowledge distillation — the set of techniques for taking a large, expensive teacher model and pressing its capabilities into a smaller, faster student. It’s one of the least glamorous and most economically important ideas in modern AI: it’s how frontier capability actually reaches production. We’ll cover the classics (logit matching, the original Hinton formulation), the modern variants (sequence-level, on-policy, and self-distillation), what actually transfers and what doesn’t, and why nearly every model you can afford to run is, in some sense, a distilled one. See you in the first issue.

The Sequence Radar #877: Last Week in AI: Anthropic Ships, Apple Borrows, Musk Lists, Bezos Builds

Jesus Rodriguez — Sun, 14 Jun 2026 11:03:27 GMT

Next Week in The Sequence:

We continue our series about alternative to transformers.
The AI of the week will dive into Fable.
In the opinion section, we are going to discuss AI tokens as a units of economics.
We might introduce a new fun section. Playing with a new idea.

Subscribe and don’t miss out:

📝 Editorial: Last Week in AI: Anthropic Ships, Apple Borrows, Musk Lists, Bezos Builds

Some weeks in AI feel like incremental patch releases. This one felt like a major version bump for the entire industry. Four events — a frontier model launch, a consumer assistant reboot, the largest IPO in history, and a $12 billion bet on physical engineering — and if you squint, they’re all chapters of the same story: AI escaping the chat window.

Start with Anthropic. On Tuesday the company released Claude Fable 5 and Claude Mythos 5, and the architecture of the launch is as interesting as the model itself. Both share the same base model; the difference is policy, not weights. Fable 5 ships with conservative safety classifiers that intercept queries in high-risk domains — cybersecurity, biology, chemistry — and fall back to Opus 4.8, while Mythos 5 runs unrestricted for a vetted group of cyber defenders under Project Glasswing. Think of it as the same kernel with different syscall permissions. The benchmarks justify the caution: 80.3% on SWE-Bench Pro, more than ten points clear of Opus 4.8 and over twenty ahead of GPT-5.5. We’ve entered the era where capability and access are explicitly decoupled — the model you can use is a sandboxed view of the model that exists.

Then Apple, finally, showed up. At Tim Cook’s farewell WWDC, the company unveiled Siri AI — a conversational assistant with personal context, onscreen awareness, and a standalone app, reportedly powered by a custom 1.2-trillion-parameter Gemini model under the hood. There’s something deliciously ironic about Apple, the original vertical integrator, outsourcing the brain. But strategically it’s the right call: Apple’s moat was never the model; it’s the distribution and the personal context graph. A billion devices with intimate access to your messages, photos, and calendar is a dataset no lab can replicate. Apple isn’t competing on intelligence; it’s competing on intimacy.

The week’s most audacious move came from Elon Musk. SpaceX went public at roughly $1.77 trillion, raising about $75 billion in the largest IPO ever — and the prospectus reads less like a rocket company and more like an AI infrastructure thesis. Having merged xAI into SpaceX in February, Musk is pitching orbital data centers: up to a million GPU-packed satellites moving training and inference off-planet, where energy is abundant and regulation is thin. xAI lost $6.4 billion on $3.2 billion in revenue last year, so the IPO is effectively the public market underwriting the most capital-intensive scaling hypothesis ever proposed. Compute, it turns out, has an escape velocity.

Finally, Jeff Bezos broke his silence on Prometheus, which raised $12 billion at a $41 billion valuation to build an “artificial general engineer” — AI that designs and manufactures physical systems, from jet engines to drug compounds. Not robotics, Bezos insists. Something closer to CAD with a frontier brain.

Let’s dive in.

🔎 AI Research

Regularized f-Divergence Kernel Tests

AI Lab: Google Research & Google DeepMind

Summary: This paper introduces a unified framework for constructing practical, kernel-based two-sample tests derived from the family of f-divergences. The authors demonstrate that these adaptive tests, particularly the Hockey-Stick divergence, effectively capture diverse localized differences and are highly applicable to tasks like differential privacy auditing and machine unlearning evaluation.

Verifiable Environments Are LEGO Bricks: Recursive Composition for Reasoning Generalization

AI Lab: Qwen Team, Alibaba Group

Summary: The authors propose RACES, a framework that scales up reinforcement learning for language models by recursively assembling verifiable environments like building blocks when their input and output types match. By utilizing composition operators such as SEQUENTIAL and PARALLEL, this approach generates structurally diverse training tasks that significantly improve the reasoning generalization of models on unseen benchmarks.

REVISION: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

AI Lab: Microsoft Research

Summary: To address the high token cost associated with visual observations in computer-use agents, this paper introduces REVISION, a framework that trains multimodal models to filter out redundant visual patches across consecutive screenshots. By maintaining essential spatial structure while significantly reducing token accumulation, the method allows agents to process longer interaction histories and achieve higher success rates on complex tasks.

Distilling LLM Feedback for Lean Theorem Proving

AI Lab: FAIR at Meta

Summary: This research explores Feedback Distillation, an on-policy post-training method where a model learns to match its own token-level distribution conditioned on privileged feedback from a stronger language model. Evaluated on Lean 4 theorem proving, the technique preserves greater trajectory diversity and achieves better pass@k scaling than standard GRPO, proving especially powerful when used as an initialization for subsequent reinforcement learning.

Decentralized Multi-Agent Systems with Shared Context

AI Lab: Stanford University

Summary: DELM is a novel multi-agent framework that eliminates the bottleneck of centralized orchestration by relying on a shared, verified context and an asynchronous task queue. Agents independently claim subtasks and contribute compact, verified updates to the global state, leading to superior performance and cost efficiency in both software-engineering testing and long-context reasoning workflows.

🤖 AI Tech Releases

Claude Fable 5 and Mythos 5

Anthropic released its highly anticipated Fable 5 model, a limited Mythos-based models. Also released a version of Mythos 5 for a selected group of cyersecurity and infrastructure companies.

Kimi Work

Moonshot AI released Kimi Work, a new agent specialized in work automation.

📡10 AI News You Need to Know About

SpaceX (SPCX) made its Nasdaq debut June 12, 2026, after pricing at $135 per share and raising roughly $75 billion — the largest IPO in stock market history, valuing the company near $1.75 trillion. Shares opened sharply higher and were trading around $161 intraday, with the valuation anchored by Starlink and now bundling in xAI following an all-stock merger earlier this year.
Bezos’s Prometheus raises $12B — Jeff Bezos and Vik Bajaj’s physical-AI startup Prometheus raised $12 billion at a $41 billion valuation to build an “artificial general engineer” that automates the design and manufacturing of complex physical systems from jet engines to drugs.
Mistral AI is in early talks to raise about €3 billion (~$3.5 billion) at a valuation near €20 billion (~$23 billion)— nearly double the €11.7 billion valuation from its Series C last September. The new round would bring the three-year-old company's total financing to roughly €6.5 billion across debt and equity, fueling its compute buildout as Europe's leading AI lab competes against larger US and Chinese rivals.
Theker raises $85M — Barcelona-based Theker raised $85 million in what it bills as Europe’s largest-ever robotics Series A to build reconfigurable factory robots whose arms and hands swap out for different tasks rather than specializing in one.
Jedify raises $24M — New York’s Jedify raised a $24 million Series A led by Norwest, with Snowflake as a strategic investor, to build “context graphs” that give enterprise AI agents the business knowledge they need to run in production.
Sandstone raises $30M — Sandstone raised a $30 million Series A led by Lightspeed to bring AI-powered workflow automation (intake, routing, triage, task execution) to in-house corporate legal teams rather than law firms.
OpenAI to acquire Ona — OpenAI agreed to acquire cloud-platform startup Ona, folding its secure, persistent execution environments into the Codex team so AI agents can run long, multi-step tasks for enterprises.
xAI co-founder unveils River AI — xAI co-founder Igor Babuschkin announced River AI, a startup (staffed partly by former xAI and Tesla employees) building personalized AI agents that learn from and remain owned/controlled by individual users rather than large corporations.
Tether backs Neura in $1.4B round — German firm Neura Robotics raised up to $1.4 billion in a Tether-led Series C — also backed by Nvidia, Amazon, Qualcomm and Bosch — to scale humanoid/cognitive-robot production toward millions of units by 2030.
Apple introduces Siri AI — Apple unveiled “Siri AI,” a rebuilt Apple Intelligence–powered assistant with personal-context understanding, onscreen awareness, world knowledge, a dedicated app, and expanded Visual Intelligence, available for developer testing now and as a user beta later this year.

The Sequence Opinion #876: Systems of Record vs. Systems of Action

Jesus Rodriguez — Thu, 11 Jun 2026 11:03:00 GMT

Thesis: agentic AI does not kill SaaS. It changes what enterprise software is fundamentally for. The old winning layer was the system that held canonical state. The new winning layer is the system that can take action against that state safely, reliably, and observably.

For the last twenty years, the enterprise stack has been built around one hidden constant: the human is the actor.

A person logs in. A person reads a dashboard. A person fills out a form. A person updates the opportunity stage, approves the invoice, closes the ticket, moves the candidate, changes the forecast, escalates the account, or checks the compliance box.

The software is basically a database wrapped in forms, permissions, workflows, and a pricing page. This is not an insult. It is an extremely powerful pattern. It gave companies shared memory. It made business state durable. It turned messy organizational reality into tables, fields, roles, reports, and audit logs.

But now the actor is changing.

The Sequence AI of the Week #875: Why Your Language Model Needs a Nap

Jesus Rodriguez — Wed, 10 Jun 2026 10:39:45 GMT

For today’s essay, I would like to cover an incredible paper with a provocative thesis and an even better title that I found myself reading multiple times last week: Language Models Need Sleep….

There’s an awkward fact about the models we all use every day: they don’t learn anything anymore. Whatever a frontier model knows, it learned once, during training, and then somebody hit save. After that it’s a brilliant fossil. It can reason circles around you about events up to its cutoff and then go completely blank about last Tuesday. You can stuff new facts into the context window, sure, but the moment the session ends, that knowledge evaporates like a dream you forgot to write down.

Behrouz, Hashemi, and Mirrokni (Google + Cornell) have a name for this in their new paper, and it’s a good one: it’s anterograde amnesia. The patient with anterograde amnesia keeps every memory from before the injury and can hold a conversation in the moment, but nothing new ever makes the jump into long-term storage. Each day is experienced as if it were the first. Swap “injury” for “end of pre-training” and that is exactly the shape of a Transformer’s memory. It has the deep past (the MLP weights) and the immediate present (the attention cache), and almost nothing connecting them.

The paper’s pitch is that we’ve been missing a step that biology figured out a long time ago. We sleep.

There is no test time

The Sequence Knowledge #874: Transformers or Not?

Jesus Rodriguez — Tue, 09 Jun 2026 11:03:38 GMT

💡 AI Concept of the Day: Transformers or Not?

The Transformer is currently the reference architecture for serious AI. Not because it is obviously the most brain-like, elegant, or efficient design, but because it has the best scaling story. You add data, parameters, compute, context length, better training recipes, better post-training, and the model gets better in a surprisingly smooth way. That is rare. In deep learning, many ideas are clever. Few are industrial.

The Transformer’s superpower is attention. Every token can look at every other token and decide what matters. This is an incredibly general operation. It works for language, code, images, audio, video, protein sequences, robotics tokens, and tool traces. The architecture is simple enough to scale, parallel enough to train efficiently, and expressive enough to absorb huge datasets.

But it has an obvious tax: attention is expensive. Full self-attention scales badly with sequence length. In autoregressive generation, the model accumulates a key-value cache, which grows with context. A Transformer remembers by keeping a large, explicit, token-indexed memory. That is powerful, but it is not how you would design every intelligent system from first principles.

So the question is not “are Transformers good?” They are spectacular. The question is: are they the final architecture? Or are they the first truly scalable architecture, soon to be absorbed into something richer?

I think the second view is more likely.

TheSequence

The Sequence Opinion #892: The Anatomy of a Good Environment: When Verifiability is Not Enough

Verifiability

The Sequence AI of the Week #891: Prompting a Spreadsheet : Inside Google’s TabFM for Tabular AI

The TimesFM prelude

The Sequence Knowledge #890: A Brief History of Model Distillation

The Sequence Radar #889: Fable 5's Comeback, ZCode's Debut, Claude Science, and the $3.5B Deployment Land Grab

Next Week in The Sequence:

Subscribe and don’t miss out:

📝 Editorial: Fable 5's Comeback, ZCode's Debut, Claude Science, and the $3.5B Deployment Land Grab

🔎 AI Research

🤖 AI Tech Releases

Fable5

ZCode

Claude Science

📡10 AI News You Need to Know About

The Sequence Opinion #888: Everything You Need to Know About the AI in Space Race

The core thesis: compute is now an energy problem, and space is an energy solution

The Sequence AI of the Week #887: Meta's Autodata: When Models Learn to Make Their Own Lessons

The Sequence Knowledge #886: Demystifying Model Distillation

The Sequence Radar #885: Last Week in AI: Models, Games, and the Future of Evaluation

Next Week in The Sequence:

Subscribe and don’t miss out:

📝 Editorial: Last Week in AI: Models, Games, and the Future of Evaluation

🔎 AI Research

🤖 AI Tech Releases

GPT 5.6 Sol

Claude Tag

Mistral OCR

📡10 AI News You Need to Know About

The Sequence Opinion #884: Self-Driving Labs: The Laboratory That Chooses Its Next Experiment

The Sequence AI of the Week #883: Qwen is Getting Into Robotics

The actual bottleneck is not intelligence, it’s tokenization

The Sequence Knowledge #882: A New Series About Distillation

The Sequence Special #881: The Soccer World Cup of AI Models

Introducing the Stratix Cup

Why Soccer?

What the Harness Actually Tests

The Tournament Schedule

Monday, June 22 — Group Stage, Matchday 1

Tuesday, June 23 — Group Stage, Matchday 2

Wednesday, June 24 — Group Stage, Matchday 3 (Decisive Day)

Thursday, June 25 — Quarter-Finals

Friday, June 26 — Semi-Finals + Final

What to Do Next

The Sequence Radar #880: Last Week in AI: A $60B Cursor Deal, Google's Brain Drain, and Midjourney's Body Scanner

Next Week in The Sequence:

Subscribe and don’t miss out:

📝 Editorial: Last Week in AI: A $60B Cursor Deal, Google's Brain Drain, and Midjourney's Body Scanner

🔎 AI Research

🤖 AI Tech Releases

Qwen-Robot Suite

LFM2.5 Retrievers

📡10 AI News You Need to Know About

The Sequence Opinion #879: When Tokens Become Balance Sheet Items

The Smallest Billable Thought

The Sequence AI of the Week #878: Inside Google Deepmind's First Real Crack in Next-Token Generation

The Sequence Knowledge #878: Beyond Transformer: What We Learned

What’s next: a new series on distillation

The Sequence Radar #877: Last Week in AI: Anthropic Ships, Apple Borrows, Musk Lists, Bezos Builds

Next Week in The Sequence:

Subscribe and don’t miss out:

📝 Editorial: Last Week in AI: Anthropic Ships, Apple Borrows, Musk Lists, Bezos Builds

🔎 AI Research

🤖 AI Tech Releases

Claude Fable 5 and Mythos 5

Kimi Work

📡10 AI News You Need to Know About

The Sequence Opinion #876: Systems of Record vs. Systems of Action

The Sequence AI of the Week #875: Why Your Language Model Needs a Nap

There is no test time

The Sequence Knowledge #874: Transformers or Not?

💡 AI Concept of the Day: Transformers or Not?

The Landscape of Alternatives