The Sequence Radar #554 : The New DeepSeek R1-0528 is Very Impressive

The new model excels at math and reasoning.

Jun 01, 2025

Next Week in The Sequence:

In our series about evals, we discuss multiturn benchmarks. The engineering section dives into the amazing Anthropic Circuits for ML interpretability. In research, we discuss some of UC Berkeley’s recent work in LLM reasoning. Our opinion section dives into the state of AI interpretablity.

You can subscribe to The Sequence below:

📝 Editorial: The New DeepSeek R1-0528 is Very Impressive

This week, DeepSeek AI pushed the boundaries of open-source language modeling with the release of DeepSeek R1-0528. Building on the foundation of the original R1 release, this update delivers notable gains in mathematical reasoning, code generation, and long-context understanding. With improvements derived from enhanced optimization and post-training fine-tuning, R1-0528 marks a critical step toward closing the performance gap between open models and their proprietary counterparts like GPT-4 and Gemini 1.5.

At its core, DeepSeek R1-0528 preserves the powerful 672B Mixture-of-Experts (MoE) architecture, activating 37B parameters per forward pass. This architecture delivers high-capacity performance while optimizing for efficiency, especially in inference settings. One standout feature is its support for 64K-token context windows, enabling the model to engage with substantially larger inputs—ideal for technical documents, structured reasoning chains, and multi-step planning.

In terms of capability uplift, the model shows remarkable progress in competitive benchmarks. On AIME 2025, DeepSeek R1-0528 jumped from 70% to an impressive 87.5%, showcasing an increasingly sophisticated ability to tackle complex mathematical problems. This leap highlights not just better fine-tuning, but a fundamental improvement in reasoning depth—an essential metric for models serving scientific, technical, and educational use cases.

For software engineering and development workflows, R1-0528 brings meaningful updates. Accuracy on LiveCodeBench rose from 63.5% to 73.3%, confirming improvements in structured code synthesis. The inclusion of JSON-formatted outputs and native function calling support positions the model as a strong candidate for integration into automated pipelines, copilots, and tool-augmented environments where structured outputs are non-negotiable.

To ensure broad accessibility, DeepSeek also launched a distilled variant: R1-0528-Qwen3-8B. Despite its smaller footprint, this model surpasses Qwen3-8B on AIME 2024 by over 10%, while rivaling much larger competitors like Qwen3-235B-thinking. This reflects DeepSeek's commitment to democratizing frontier performance, enabling developers and researchers with constrained compute resources to access state-of-the-art capabilities.

DeepSeek R1-0528 is more than just a model upgrade—it's a statement. In an ecosystem increasingly dominated by closed systems, DeepSeek continues to advance the case for open, high-performance AI. By combining transparent research practices, scalable deployment options, and world-class performance, R1-0528 signals a future where cutting-edge AI remains accessible to the entire community—not just a privileged few.

Join Me for a Chat About AI Evals and Benchmarks:

🔎 AI Research

FLEX-Judge: THINK ONCE, JUDGE ANYWHERE

Lab: KAIST AI
Summary:
FLEX-Judge is a reasoning-first multimodal evaluator trained on just 1K text-only explanations, achieving zero-shot generalization across images, audio, video, and molecular tasks while outperforming larger commercial models. Leverages textual reasoning alone to train a judge model that generalizes across modalities without modality-specific supervision.

Learning to Reason without External Rewards

Lab: UC Berkeley & Yale University
Summary:
INTUITOR introduces a novel self-supervised reinforcement learning framework using self-certainty as intrinsic reward, matching supervised methods on math and outperforming them on code generation without any external feedback. The technique proposes self-certainty as an effective intrinsic reward signal for reinforcement learning, replacing gold labels.

Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning

AI Lab: Google DeepMind & Northwestern University
Summary: This paper introduces BARL, a novel Bayes-Adaptive RL algorithm that enables large language models to perform test-time reflective reasoning by switching strategies based on posterior beliefs over MDPs. The authors show that BARL significantly outperforms Markovian RL approaches in math reasoning tasks by improving token efficiency and adaptive exploration.

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

AI Lab: Microsoft Research Asia
Summary: The authors present rStar-Coder, a dataset of 418K competitive programming problems and 580K verified long-reasoning code solutions, which drastically boosts the performance of Qwen models on code reasoning benchmarks. Their pipeline introduces a robust input-output test case synthesis method and mutual-verification mechanism, achieving state-of-the-art performance even with smaller models.

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

AI Lab: Fudan University, CUHK MMLab, Shanghai AI Lab
Summary: MME-Reasoning offers a benchmark of 1,188 multimodal reasoning tasks spanning inductive, deductive, and abductive logic, revealing significant limitations in current MLLMs’ logical reasoning. The benchmark includes multiple question types and rigorous metadata annotations, exposing reasoning gaps especially in abductive tasks.

DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research

AI Lab: Carnegie Mellon University, NOVA LINCS, INESC-ID
Summary: DeepResearchGym is an open-source sandbox providing reproducible search APIs and evaluation protocols over ClueWeb22 and FineWeb for benchmarking deep research agents. It supports scalable dense retrieval and long-form response evaluation using LLM-as-a-judge assessments across dimensions like relevance and factual grounding.

Fine-Tuning Large Language Models with User-Level Differential Privacy

AI Lab: Google Research
Summary: This study compares two scalable user-level differential privacy methods (ELS and ULS) for LLM fine-tuning, with a novel privacy accountant that tightens DP guarantees for ELS. Experiments show that ULS generally offers better utility under large compute budgets or strong privacy settings, while maintaining scalability to hundreds of millions of parameters and users.

🤖 AI Tech Releases

DeepSeek-R1-0528

DeeSeek released a new version of its marquee R1 model.

Anthropic Circuits

Anthropic open sourced its circuit interpretability technology.

Perplexity Labs

Perplexity released a new tool that can generate charts, spreadsheets and dashboards.

Codestral Embed

Mistral released Codestral Embed, an embedding model specialized in coding.

🛠 AI in Production

Multi-Task Learning at Netflix

Netflix shared some details about its multi-task prediction strategy for user intent.

📡AI Radar

Salesforce agreed to buy Informatica for $8 billion.
xAI and Telegram partnered to enable Grok for its users.
Netflix’s Reed Hastings joined Anthropic’s board of directors.
Grammarly raised $1 billion to accelerate sales and acquisitions.
Spott raises $3.2 million for an AI-native recruiting firm.
Buildots $45 million for its AI for construction platform.
Context raised $11 million to power an AI-native office suite.
Rillet raised $25 million to enable AI for mid market accounting.
HuggingFace unveiled two new open source robots.

TheSequence

Discussion about this post