The Sequence Radar #885: Last Week in AI: Models, Games, and the Future of Evaluation

New model releases, new agents and a soccer cup.

Jun 28, 2026

Next Week in The Sequence:

We continue our series about distillation.
In the AI of the week, we discuss Meta’s amazing AutoData paper.
In the opinion section, we debate the amazing topic of AI in space.

Subscribe and don’t miss out:

📝 Editorial: Last Week in AI: Models, Games, and the Future of Evaluation

This week in AI had the strange feeling of a stack trace resolving itself. For years, the industry has been marching toward the same destination from different directions: better models, richer environments, more autonomous agents, and harder evaluations. This week, those threads snapped together into something legible. AI is no longer just learning to answer. It is learning to act.

Start with OpenAI’s GPT-5.6 release. Or more precisely, its limited preview. The naming alone tells a story: Sol, Terra, Luna. A flagship model, a balanced model, and a fast, cheaper model. The product taxonomy is becoming planetary because the market is no longer asking for “the best model” in the abstract. It wants intelligence at different temperatures: deep reasoning for frontier work, affordable competence for everyday automation, and high-throughput inference for systems that need to move fast.

But the most interesting part of GPT-5.6 is not the benchmark curve. It is the release shape. This is a model launched with a safety architecture, a government coordination layer, and a phased-access strategy. That matters. Frontier AI releases are starting to look less like software updates and more like controlled deployment of critical infrastructure. We used to ask whether a model could write better code. Now we ask who gets access, under which constraints, with what monitoring, and how quickly defenders can use the same capabilities attackers will inevitably want.

Alongside this, Anthropic quietly introduced Claude Tag, a feature that signals another subtle shift in how we interact with models. Claude Tag allows users to structure prompts and responses with explicit semantic markers, making it easier for models to track context, roles, and intent across longer interactions. It is a small interface change with outsized implications: as models become more agentic, the way we communicate with them must evolve from loose conversation into something closer to structured collaboration. Claude Tag hints at a future where prompting is less about clever phrasing and more about designing clear, machine-readable workflows.

Then came General Intuition’s new raise, which feels like the cleanest signal yet that the next data frontier is not text, or even video, but action. The company’s thesis is beautifully nerdy: video games are not just entertainment; they are compressed laboratories of intent, perception, movement, failure, reward, and adaptation. A gameplay clip is not merely pixels. It is pixels plus choices. What did the player see? What did they try? What happened next? That action-labeled loop is exactly what language models are missing when they attempt to reason about the physical world from static media.

In other words, General Intuition is betting that Minecraft, Fortnite-like environments, simulations, and gamer behavior might become for embodied AI what the web was for language models: the messy, gigantic pretraining substrate from which generality emerges.

And then, in the most delightful possible version of this same story, the LayerLens Stratix Cup turned AI evaluation into soccer.

The final between Claude Opus 4.8 and GPT-5.5 was not just a spectacle. It was a different kind of benchmark. Sixteen models wrote their own strategies, controlled teams, adapted between rounds, and survived inside an environment where intelligence had to become policy. Not prose. Not a leaderboard answer. Executable behavior. Claude Opus 4.8 defeating GPT-5.5 1–0 in the final is fun as a result, but the deeper point is methodological: we need arenas where models reveal themselves under pressure, with imperfect information, feedback loops, and consequences.

That is the connective tissue of the week. GPT-5.6 pushed the frontier of controlled capability. General Intuition pushed the frontier of action data. Stratix Cup pushed the frontier of evaluation.

The model is becoming less like a chatbot and more like an organism in a sandbox: sensing, planning, acting, failing, adapting. The future of AI will not be decided only by who has the biggest model. It will be decided by who builds the best worlds for models to learn in, the best guardrails for them to operate within, and the best games to discover what they can actually do.

🔎 AI Research

Autodata: An agentic data scientist to create high quality synthetic data

AI Lab: FAIR at Meta

Summary: This paper introduces Autodata, a framework where an AI agent acts as a data scientist to iteratively generate, evaluate, and refine synthetic training and evaluation data. By meta-optimizing the agent itself, the method significantly improves data quality and downstream model performance across complex reasoning and verifiable tasks.

Improved Large Language Diffusion Models

AI Lab: Gaoling School of Artificial Intelligence, Renmin University of China and ByteDance Seed

Summary: This work presents iLLaDA, an 8B-parameter masked diffusion language model trained from scratch using fully bidirectional attention and scaled to 12 trillion tokens. The model introduces variable-length generation and confidence-based scoring, leading to substantial performance gains over previous diffusion models while remaining competitive with strong autoregressive baselines.

Are We Ready For An Agent-Native Memory System?

AI Lab: Shanghai Jiao Tong University, Tsinghua University, MemTensor (Shanghai) Technology Co., Ltd

Summary: The authors systematically evaluate 12 representative agent memory systems from a data management perspective, decomposing them into representation, extraction, routing, and maintenance modules. Through extensive end-to-end benchmarks, they reveal that no single architecture dominates; rather, effectiveness depends on aligning the memory structure with the specific workload bottleneck and utilizing localized maintenance for cost efficiency.

MEMPROBE: Probing Long-Term Agent Memory via Hidden User-State Recovery

AI Lab: University of Illinois Chicago, KU Leuven, UC San Diego

Summary: MEMPROBE is a novel benchmark that audits long-term agent memory by testing how well an agent can reconstruct a simulated user’s hidden state after a series of interactions. Testing state-of-the-art systems reveals that while agents can easily complete immediate tasks, they struggle to successfully retrieve and consolidate episodic memory, highlighting a major bottleneck in current memory designs.

Qwen-AgentWorld: Language World Models for General Agents

AI Lab: Qwen Team

Summary: This research introduces Qwen-AgentWorld, a foundational language world model designed to simulate seven diverse agentic environments through long chain-of-thought reasoning. By utilizing the model as both a decoupled environment simulator and a unified agent foundation model, the researchers demonstrate significant enhancements in agent training, scalability, and downstream task performance.

Tapered Language Models

AI Lab: Mila, Cornell University, Université de Montréal, CIFAR AI Chair

Summary: This paper proposes Tapered Language Models (TLMs), an architectural design that monotonically tapers parameter capacity across a model’s depth under a fixed total budget, front-loading the capacity to earlier layers. Focusing on MLP width, the authors show that a smooth cosine decay schedule consistently improves perplexity and downstream reasoning accuracy across multiple architectures without increasing overall parameters or compute costs.

🤖 AI Tech Releases

GPT 5.6 Sol

OpenAI unveiled three new models Sol, Terra and Luna as part of its GPT 5.6 suite.

Claude Tag

Anthropic released Claude Tag, a new way for teams to interact with Anthropic.

Mistral OCR

Mistral released Mistral OCR, its latest document understanding model.

📡10 AI News You Need to Know About

Patronus AI raises $50M Series B — Agent-evaluation startup Patronus AI raised a $50M Series B led by Greenfield Partners (total funding now $70M) and unveiled its first “Digital World Models,” large-scale simulation environments for training and stress-testing AI agents. Original source
General Intuition raises $320M at $2.3B — General Intuition, a spinout of gaming-clip platform Medal, raised $320M at a $2.3B valuation (led by Khosla Ventures) to train “large action model” AI agents on billions of action-labeled gameplay clips for robotics and real-world use.
Netris raises $15M Series A — Network-automation startup Netris raised a $15M Series A led by a16z to expand its NAAM platform, which automates and isolates the networking layer so AI “neocloud” operators can bring GPU clusters online in weeks instead of months.
Cerebras stock plunges after earnings — Cerebras shares fell nearly 20% after its first post-IPO earnings, as a full-year core gross-margin forecast of 38–41% (down from 47% in Q1) spooked investors, with CEO Andrew Feldman arguing the guidance was “misunderstood” and reflects a temporary decision to lease systems back from a customer while it builds data-center capacity. T
Groq confirms $650M raise — Six months after Nvidia licensed its chip tech and poached its founder, Groq confirmed a $650M raise (led by Disruptive and Infinitum) and a rebuilt executive bench to pivot toward selling AI inference cloud capacity across its 13 data centers.
Google DeepMind invests $75M in A24 — Google DeepMind announced a “first-of-its-kind” research partnership with film studio A24, including a ~$75M investment, to co-develop AI filmmaking tools with working filmmakers.
General Intuition in talks to raise $300M — This June 18 TechCrunch scoop reported that General Intuition was in talks to raise ~$300M at around a $2B valuation; it’s the rumor that item #2 above later confirmed.
US awards $250M to I-Pulse — The Commerce Department’s CHIPS R&D program signed a definitive agreement to give I-Pulse — a pulsed-power and semiconductor venture co-founded by mining billionaire Robert Friedland — a $250M award to develop silicon-carbide chips for a high-power geothermal drilling technique and defense applications.
SK Hynix files for ~$29.4B US listing — SK Hynix filed to raise up to 45.45 trillion won (~$29.4B) via a Nasdaq ADR listing (expected to begin trading July 10), tapping US investor appetite for AI memory after an ~850% one-year stock run, with proceeds earmarked for HBM fabs, packaging plants, and EUV equipment.
ByteDance seeks $20B offshore loan — The TikTok parent is in early talks with banks for roughly $20B in new offshore borrowing — its largest ever — to help fund an aggressive AI-infrastructure buildout, with a possible three-year term extendable to five.

TheSequence

Discussion about this post

Ready for more?