The Sequence Radar #696: Google AI Ultra’s New Gold-Medal Reasoning Model is Available

The new model is available to the Gemini app subscribers.

Aug 03, 2025

Next Week in The Sequence:

Knowledge: We continue with our series about frontier model interpretability.
AI of the Week: Going to deep dive into E2B micro-containers for coding.
Opinion: Exploring a controversial idea about we might reach AGI by 2030 by scaling compute but, if not, we might need algorithmic improvements.

Subscribe Now to Not Miss Anything

📝 Editorial: Google AI Ultra’s New Gold-Medal Reasoning Model is Available

Google just made its math olympiad gold medalist model available! Gemini 2.5 Deep Think, released this week to Google AI Ultra subscribers, has already demonstrated its extraordinary reasoning power by securing a gold‑medal performance at the 2025 International Math Olympiad—solving four shortlisted proofs under timed, exam‑style conditions and matching the best human competitors. The magic of DeepThink is its ability to operationalize truly parallel hypothesis streams, spawning multiple “agents” to explore solution trajectories simultaneously rather than committing to a single linear path. More details here.

At the heart of Deep Think lies the convergence of sparse mixture-of-experts (MoE) and multi-agent prompting. Its MoE backbone dynamically routes tokens among specialized expert sub-networks, delivering enormous total capacity with controlled inference costs. Layered atop this, a multi-agent orchestration mechanism dispatches parallel reasoning threads that independently propose and refine sub-solutions before synthesizing a final answer, thereby mitigating path dependency and amplifying creative problem solving.

Technically, Gemini 2.5 Deep Think supports a 1 million‑token context window and can generate up to 192 000 tokens in one session, making it uniquely capable of deep codebase audits, extended symbolic reasoning, and multimodal investigatory workflows. Its training incorporated bespoke theorem‑proving corpora and reinforcement‑learning datasets that reward systematic, stepwise deductions, while its multimodal design—across text, vision, audio, and video—paves the way for unified cross-domain reasoning tasks.

Perhaps the most compelling testament to its prowess arrived at the 2025 International Math Olympiad (IMO), where Deep Think was tasked with solving four proofs drawn from the competition’s shortlist. The model achieved a gold‑medal performance—matching the top human scorers—by correctly deriving all four solutions under time constraints comparable to human exam durations. Its parallel‑agent framework enabled simultaneous exploration of multiple proof strategies, trimming solution time by over 50% compared to single‑pass baselines and outperforming earlier dense‑decoder prototypes in both accuracy and elegance of exposition.

Google plans a phased API rollout of Deep Think, offering both “with‑tools” and “without‑tools” variants alongside usage quotas and cost‑management guidelines. In the Gemini Pro UI, Ultra subscribers see a dedicated toggle granting limited daily Deep Think sessions, while developers can soon integrate its capabilities via the Gemini API under a usage‑based pricing model calibrated for its higher computational demands.

Gemini 2.5 Deep Think heralds a paradigm shift: AI systems that autonomously explore and evaluate multiple solution paths rather than trailing monolithic heuristics. Its success at the IMO demonstrates not only raw computational acumen but strategic reasoning akin to human experts. As the technical AI community embraces this leap, crucial conversations will revolve around sustainability, equitable access, and governance frameworks to ensure that next‑generation reasoning engines advance collective scientific and technological frontiers responsibly.

🔎 AI Research

Title: Persona Vectors: Monitoring and Controlling Character Traits in Language Models

AI Lab: Anthropic
Summary: This paper introduces "persona vectors"—automatically extracted linear directions in a model’s activation space that correspond to specific character traits (e.g., evil, sycophantic, hallucinating). The authors demonstrate how these vectors can be used to monitor, control, and prevent unwanted personality shifts in large language models during training, deployment, and even before fine-tuning by analyzing training data projections.

Title: Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

AI Lab: MIT
Summary: This paper introduces RLCR, a reinforcement learning framework that trains language models to jointly optimize for answer correctness and confidence calibration using a reward function combining binary correctness and the Brier score. RLCR improves both in-domain and out-of-domain calibration while maintaining accuracy, and enables models to verbalize uncertainty for more reliable and coherent reasoning behavior.

Title: Group Sequence Policy Optimization

AI Lab: Qwen Team, Alibaba Inc.
Summary: This paper presents GSPO, a reinforcement learning algorithm that replaces token-level importance sampling with sequence-level optimization to stabilize training and improve performance in large language models, especially in Mixture-of-Experts architectures. GSPO significantly outperforms GRPO by reducing gradient noise, enabling scalable RL training, and eliminating the need for complex infrastructure strategies like Routing Replay.

Title: Falcon-H1: A Family of Hybrid-Head Language Models

AI Lab: Falcon LLM Team (TII UAE)
Summary: This paper introduces Falcon-H1, a new family of large language models built on a hybrid architecture combining Transformer attention with Mamba-based State Space Models (SSMs) to achieve exceptional performance and efficiency across reasoning, multilingual, and long-context tasks. The models, ranging from 0.5B to 34B parameters, consistently outperform much larger models like Llama3-70B and Qwen2.5-72B on various benchmarks, while offering up to 8× faster inference in long-context settings and supporting multilingual capabilities across 18 languages.

Title: SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

AI Lab: AMD (Advanced Micro Devices, Inc.)

Summary: This paper introduces SAND-Math, an automated pipeline that generates challenging and novel math problems using large language models and systematically amplifies their difficulty through a technique called "Difficulty Hiking." The resulting dataset significantly outperforms other synthetic benchmarks—boosting model performance by up to 17.85 points on AIME25—and proves especially effective for augmenting and training resource-efficient mathematical reasoning models.

Title: MLE-STAR: Machine Learning Engineering Agent via Search and Targeted Refinement

AI Lab: Google Cloud & KAIST
Summary: MLE-STAR is a machine learning engineering agent that combines web search with targeted code block refinement to autonomously construct and iteratively improve ML pipelines using large language models. It significantly outperforms prior agents like AIDE and DS-Agent across diverse Kaggle-style benchmarks by introducing ablation-guided exploration, automated ensembling, and robust safeguards like data leakage and usage checkers.

🤖 AI Tech Releases

GLM 4.5

Chinese AI lab a.zi launched GLM 4.5, a powerful foundation model with strong agentic capabilities.

Llama Nemotron Super 1.5

NVIDIA released a new version of its Llama Nemotron model optimized for multi-step agentic tasks.

Codestral 25.08

Mistral released Codestral 25.08, an entire enterprise coding stack.

📡AI Radar

Julius secures $10M to bring AI-powered data analysts to every business dashboard.
Robinhood’s CEO backs Harmonic, a startup debuting a math-savvy AI chatbot.
Microsoft transforms Edge into a full AI browser with its new Copilot integration.
Tesla and Samsung strike a $16.5B deal to co-develop custom AI chips for next-gen vehicles.
Groq, the AI chip challenger, nears a new funding round at a $6B valuation as demand for ultra-fast inference grows.
Sparrow lands $35M to automate the chaotic world of employee leave and absence management.
Writer launches Action Agent, enabling enterprise teams to build task-specific AI automations without code.
Anthropic edges closer to a $170B valuation, reportedly raising another $5B in a mega round.
e2b raises its Series A to power agents with persistent memory via cloud-based compute sandboxes.
Fundamental Research Labs banks $33M to build vertical-specific AI agents with multi-modal reasoning skills.
Acree debuts AFM-4.5B, a customizable enterprise model trained on rigorously filtered data.
PlayerZero raises $15M to ensure AI agents stop pushing buggy code into production.
Two Berkeley dropouts raise $28M to bring full-stack AI to marketing automation.

TheSequence

Discussion about this post