The Sequence Radar #803: Last Week in AI: Anthropic and OpenAI’s Battle for the Long Horizon, Goodfire and LayerLens Push AI Accountability
Major AI releases this week.
Next Week in The Sequence:
Our series about world models continues diving into the Dreamer models that have become a classic in the space.
AI of the week will dive into Goodfire’s work on AI interpretability.
We are back with our interview series! with a surprive interview.
The opinion section dives into the new agent-as-a-judge space.
Subscribe and don’t miss out:
📝 Editorial: Last Week in AI: Anthropic and OpenAI’s Battle for the Long Horizon, Goodfire and LayerLens Push AI Accountability
The first week of February 2026 has marked a definitive shift in the artificial intelligence landscape, transitioning from models that merely converse to “agentic” systems that independently think, plan, and execute. This evolution is defined by a surge in autonomous capabilities and a critical new focus on the verification and interpretability of these increasingly complex digital minds.
The Rise of the Agentic Giants
The industry’s primary protagonists, OpenAI and Anthropic, have significantly raised the stakes with simultaneous releases focused on multi-step autonomy. OpenAI debuted GPT-5.3-Codex, a model that signals a new paradigm of self-improvement. OpenAI’s own engineers reportedly used early versions of the model to debug its training, manage deployment, and analyze test results—making it the first model instrumental in its own creation. Available through a dedicated application and CLI, the system behaves like a colleague that can handle projects stretching over days, allowing users to steer its behavior in real-time without losing context.
Anthropic countered with Claude Opus 4.6, its most sophisticated model for professional work and agentic workflows to date. Introducing a landmark one-million-token context window, the model can ingest and reason across entire codebases or massive legal archives in a single pass. Its “adaptive thinking” protocol allows the model to autonomously determine when a task requires deeper reasoning, effectively managing its own cognitive resources to solve high-stakes problems with superior reliability.
Decoding the Black Box
As models grow more autonomous, the “black box” problem becomes a matter of systemic safety. Goodfire, an AI research lab dedicated to model interpretability, secured a $150 million Series B funding round this week at a $1.25 billion valuation. Goodfire’s work aims to transform AI from an opaque system into a transparent one that can be debugged like software.
The lab’s platform, Ember, allows researchers to “map out” a model’s internal components and decode neurons to precisely shape behavior and reduce hallucinations. This interpretability-driven approach has already yielded scientific results, such as the discovery of a novel class of Alzheimer’s biomarkers by reverse-engineering an epigenetic foundation model—marking a major milestone for AI in the natural sciences.
The Infrastructure of Accountability
Crucially, as these agents move from guidance to execution, the industry is shifting toward more robust evaluation frameworks. The release of LayerLens‘s “agent-as-a-judge” ( I am a co-founder) capabilities exemplifies this new era of accountability. While previous evaluations relied on “vibes” or static multiple-choice tests, LayerLens focuses on enabling agentic evaluation capabilities that can verify complex, 50-step trajectories involving tool use and database interactions.
By providing an independent oversight layer that analyzes the reasoning chains and execution artifacts of autonomous systems, LayerLens ensures that agents remain grounded in correctness and UX intent. This move toward “Evals 2.0” is essential for shipping code assistants and autonomous agents that users can truly trust in production environments.
Shameless plug: follow LayerLens on X.
We are witnessing the birth of a collaborative intelligence era. Between the agentic coding of OpenAI, the massive reasoning horizons of Anthropic, and the interpretability breakthroughs of Goodfire, the narrative has moved past “what can AI say” to “what can AI reliably do.” With frameworks like LayerLens providing the necessary oversight for this phase transition, the path is being cleared for AI to move from being a sophisticated tool to a verifiable, autonomous partner in the modern economy.
🔎 AI Research
SWE-Universe: Scale Real-World Verifiable Environments to Millions
AI Lab: Qwen Team, Alibaba Group
Summary: This paper introduces SWE-Universe, a framework that automatically constructs over 800,000 verifiable software engineering environments from GitHub pull requests to train and evaluate coding agents. The system uses an autonomous building agent and an in-loop “hacking detector” to ensure that generated tasks are high-fidelity and require actual code execution rather than simple string matching to solve.
Kimi K2.5: Visual Agentic Intelligence
AI Lab: Kimi Team
Summary: Kimi K2.5 is an open-source multimodal model that optimizes text and vision together through joint pre-training and reinforcement learning to enhance performance across both modalities. The report also introduces “Agent Swarm,” a framework that reduces latency by orchestrating parallel sub-agents to decompose and execute complex tasks concurrently.
MARS: Modular Agent with Reflective Search for Automated AI Research
AI Lab: Google Cloud AI Research
Summary: MARS is a framework designed for autonomous AI research that uses budget-aware Monte Carlo Tree Search (MCTS) to balance model performance with the high computational costs of training. It employs a modular “Design-Decompose-Implement” pipeline and comparative reflective memory to better manage complex codebases and distill causal insights from past experiments.
Likelihood-Based Reward Designs for General LLM Reasoning
AI Lab: Meta FAIR
Summary: This study investigates using the log-probability of reference answers as a universal reward signal for fine-tuning LLMs on both verifiable math tasks and non-verifiable long-form answers. The researchers found that log-probability rewards consistently perform well across different setups and models, matching the success of standard binary rewards while significantly improving perplexity.
ERNIE 5.0 Technical Report
AI Lab: ERNIE Team, Baidu
Summary: ERNIE 5.0 is a trillion-parameter unified autoregressive model trained from scratch to handle understanding and generation across text, image, video, and audio modalities. It features a novel “elastic training” paradigm that allows a single pre-training run to produce a family of sub-models with varying sizes and efficiencies to meet diverse deployment constraints.
When does predictive inverse dynamics outperform behavior cloning?
AI Lab: Microsoft, McGill University (Mila), and Tsinghua University
Summary: This paper provides a theoretical and empirical analysis of why Predictive Inverse Dynamics Models (PIDM) achieve higher sample efficiency than Behavior Cloning (BC) by decomposing the imitation learning problem into future state prediction and action selection. The researchers demonstrate that PIDM introduces a beneficial bias-variance tradeoff where conditioning on future states reduces action uncertainty, allowing it to outperform BC by up to 5x in 2D navigation and 66% in complex 3D gaming environments.
🤖 AI Tech Releases
Opus 4.6
Anthropic released Claude Opus 4.6, its marquee model that excels in coding tasks.
GPT-5.3-Codex
OpenAI released GPT-5.3-Codex, a new model specialized in agentic coding.
Frontier
OpenAI released Frontier, a new platform for AI agent managenent in enterprise environments.
LayerLens Agent-as-a-Judge
LayerLens( I am a co-founder) launched a new set of AI evals capabilities based on agent-as-a-judge techniques.
Qwen-Coder-Next
The Qwen team open sourced Qwen-Coder-Next, a new model designed for agentic coding and local development.
Voxtral Transcribe 2
Mistral released Vostral Transcribe 2, two new generation speech-to-text models.
📡AI Radar
Sapiom Raises $15M to Enable AI Agent Commerce
Summary: Sapiom secured $15 million in seed funding to develop a secure financial layer that allows autonomous AI agents to independently purchase and access the technical tools they need to complete tasks.
Fundamental Emerges from Stealth with $255M for Tabular AI
Summary: Founded by DeepMind alumni, Fundamental launched with $255 million in Series A funding and debuted “NEXUS,” a foundation model specifically designed to extract predictive intelligence from massive enterprise tabular datasets.
Resolve AI Reaches $1B Valuation for Autonomous SRE
Summary: Resolve AI confirmed a $125 million Series A round at a unicorn valuation to scale its “AI SRE” platform, which automates the detection and resolution of complex cloud infrastructure outages.
ElevenLabs Secures $500M to Triple Valuation to $11B
Summary: The AI audio leader raised $500 million in a Series D round led by Sequoia Capital, valuing the company at $11 billion as it expands its “ElevenAgents” platform for conversational AI.
Intel Enters GPU Market Targeting Data Center AI
Summary: Intel CEO Lip-Bu Tan announced a new GPU initiative led by industry veterans to challenge Nvidia’s dominance in accelerated computing for AI training and inference.
Lotus Health Nabs $35M for Free AI-Driven Primary Care
Summary: Lotus Health raised $35 million in Series A funding to expand its infrastructure for a 24/7 “AI doctor” service that provides free primary care consultations.
Linq Secures $20M for Native AI Messaging Integration
Summary: Linq raised $20 million to build the infrastructure that allows AI assistants to live natively within iMessage, RCS, and SMS, treating AI-to-human communication like texting a friend.
Oracle Launches Blockbuster $20-25B Bond Sale for AI Cloud
Summary: Oracle initiated a massive bond offering to fund its $45-50 billion capital expansion plan for 2026, driven by high demand for AI cloud infrastructure from partners like NVIDIA and OpenAI.
Goodfire Valued at $1.25B to Advance AI Interpretability
Summary: AI research lab Goodfire raised $150 million in Series B funding to scale its platform that uses interpretability research to understand and improve how large language models function.
Accrual Launches with $75M to Automate Accounting Workflows
Summary: Backed by General Catalyst, Accrual debuted with $75 million to bring AI-native automation to the preparation and review processes for Top 100 accounting firms.
Cerebras Systems Raises $1B at $23B Valuation
Summary: The AI chipmaker closed a $1 billion Series H round to further the development of its wafer-scale processors, which claim to be significantly faster than traditional GPUs for AI workloads.
VMS Group Backs $100M AI Infrastructure Fund in Hong Kong
Summary: Hong Kong-based family office VMS Group is the anchor backer for a new $100 million venture fund by 3C AGI Partners focused on global AI hardware and data center infrastructure.

