The Sequence Radar #481: Humanity's Last Exam

One of the most novel and toughest benchmarks for gen AI.

Feb 02, 2025

Next Week in The Sequence:

We continue our series about RAG with an overview of corrective RAG. Our engineering edition will discuss a cool new framework that is making quite a bit of noise. In research, we have no choice but to dive into DeepSeek-R1. In our opinion section we discuss what’s wrong with AI evaluations today.

You can subscribe to The Sequence below:

📝 Editorial: Humanity's Last Exam

We need new evals!" That is a common mantra in generative AI. The top AI models are constantly outpacing and memorizing leading benchmarks, triggering a race to develop more challenging evaluations that push the boundaries of foundation models. Last week, we saw a great addition to that roster.

The Humanity's Last Exam (HLE) benchmark is a novel, multi-modal evaluation suite designed to assess the limits of large language model (LLM) capabilities on closed-ended academic questions. It addresses the issue of benchmark saturation, where state-of-the-art LLMs achieve near-perfect scores on existing evaluations like MMLU, hindering precise measurement of AI progress. HLE consists of 3,000 challenging questions across a wide range of subjects, including mathematics, humanities, and natural sciences. The questions are developed by subject-matter experts globally and are designed to be resistant to simple internet lookups or database retrievals, emphasizing original, precise, and unambiguous content. HLE aims to be the final closed-ended academic benchmark of its kind, providing a clear measure of the gap between current AI capabilities and expert human knowledge.

A key differentiator of HLE is its rigorous question development and review process. Each question undergoes multiple stages of scrutiny, including an initial check against state-of-the-art LLMs, and is rejected if LLMs can answer it correctly. Following this initial check, the questions proceed to a two-stage human review. The first review round involves multiple graduate-level reviewers who iteratively refine the questions. The second round is conducted by organizers and expert reviewers who approve questions based on quality and adherence to submission criteria. This multi-stage review process ensures that only the most challenging and high-quality questions are included in the benchmark. Additionally, all questions must have a known solution that is unambiguous and easily verifiable. This meticulous approach to question creation is a key contribution, as it helps ensure that the benchmark measures advanced reasoning and knowledge, rather than susceptibility to memorization or retrieval.

Another key contribution of the HLE benchmark lies in its diverse question formats and subject coverage. The benchmark includes both exact-match and multiple-choice questions, as well as multi-modal questions that require comprehending both text and image references. This variety of formats ensures that models are evaluated across a broader range of skills. Furthermore, HLE spans a wide array of academic subjects, from STEM fields to law, history, and the arts. This breadth of subject matter ensures that the benchmark is a holistic measure of overall academic ability. By incorporating this wide variety of questions, HLE moves beyond subject-specific tests, aiming to provide a more complete assessment of an LLM's knowledge and problem solving capabilities.

The evaluation results of HLE demonstrate its efficacy as a challenging benchmark. State-of-the-art LLMs consistently show low accuracy (less than 10%) and poor calibration on HLE, indicating a substantial gap between current model capabilities and expert-level performance. Models often provide incorrect answers with high confidence rather than acknowledging their uncertainty, which highlights the problem of hallucination. This level of difficulty contrasts with the saturation seen in many existing benchmarks, demonstrating the utility of HLE in assessing frontier AI capabilities. Furthermore, the evaluation setup includes a standardized system prompt that structures model responses, as well as GPT-4O as a judge to verify answer correctness, ensuring consistency and objectivity.

In conclusion, HLE is a significant contribution to the field of AI benchmarking. Its focus on challenging, original questions, rigorous review process, and broad subject coverage distinguish it from existing benchmarks. The benchmark provides a clear measure of AI capabilities at the frontier of human knowledge. The low accuracy and poor calibration demonstrated by current LLMs underscore the need for continued advancements in AI development, ensuring we are able to accurately measure AI progress. The public release of HLE aims to serve as a common reference point for researchers and policymakers. Although HLE is designed to be the final closed-ended academic benchmark, it does not evaluate open-ended research capabilities, and so is not the final benchmark for AI.

🔎 AI Research

Chain of Agents

In the paper "Chain of Agents: Large Language Models Collaborating on Long-Context Tasks", researchers from Penn State University and Google Cloud AI Research introduce a novel framework called Chain-of-Agents (CoA) that uses multiple collaborating agents to process long-context tasks, improving performance over strong baselines like RAG and Full-Context approaches. CoA mitigates long-context focus issues by having worker agents sequentially handle different parts of the input text, and then using a manager agent to synthesize the results.

Humanity’s Last Exam

In the paper "HUMANITY’S LAST EXAM", AI researchers developed a challenging multi-modal benchmark called HLE, consisting of 3,000 questions across various subjects, designed to assess the limits of LLM capabilities, with the goal of creating a resource that tracks AI progress for scientists and policymakers. The HLE benchmark aims to address the fact that current LLMs can achieve high accuracy on existing benchmarks.

Qwen 2.5-Max

In the paper "Qwen2.5: Advancing LLMs to 1 Million Context Length", the Qwen team presents the Qwen2.5 model, which extends the context length of Large Language Models to one million tokens and demonstrates significant improvements on long-context tasks while maintaining performance on short-context benchmarks. The researchers evaluated the model using benchmarks such as RULER and LV-Eval to assess the model's ability to understand and process long sequences of text.

Mechanistic Interpretability

In the paper "Mechanistic Interpretability: Open Problems and the Road Ahead", researchers from Anthropic, King’s College London, Imperial College London, MATS, MIT, Northeastern University, Tel Aviv University, Goodfire, Timaeus, University of Melbourne, METR and Pr(AI)2r group discuss the current frontier of mechanistic interpretability, its open problems, and future research directions that are necessary to realize the benefits of the field. The review emphasizes the importance of developing methods to understand the inner workings of neural networks, including identifying task-relevant subgraphs and iteratively describing the function of individual components.

RL vs. SFT

In the paper "Generalization vs Memorization: A Comparative Study of Supervised Fine-tuning and Reinforcement Learning on LLM and VLM", researchers from UC Berkeley, Google DeepMind, NYU and other institutions explore the generalization capabilities of Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on both Large Language Models (LLMs) and Vision Language Models (VLMs) in two tasks: GeneralPoints and V-IRL, and show that RL demonstrates superior generalization while SFT can help stabilize output formats. The study highlights that RL is better at learning generalizable rules that can be applied to unseen tasks.

Selene Mini

In the paper "Atla Selene Mini: A General Purpose Evaluation Model", researchers fromatla, University College London and Cohere introduce SLMJ, an LLM-as-aJudge model used to judge other language models, and find that the Atla Selene Mini model achieved the highest overall performance. The paper also examines the performance of other models in the context of their capabilities to evaluate LLM responses across multiple benchmarks.

🤖 AI Tech Releases

Qwen2.5-Max

Alibaba unveiled Quen2.5-Max, a large MoE model that claims impressive performance across leading benchmarks.

OpenAI o3-mini

OpenAI released its next generation reasoning model.

Mistral Small 3

Mistral released a new 24B parameter model with impressive performance.

Tülu 3 405B

Allen AI released Tülu 3 405B, a new model based on its post-training framework.

Unsupervised Speech Dataset

MLCommons and Hugging Face released a massive speech dataset.

🛠 Real World AI

About R1

Anthropic’s CEO Dario Amodei published a very insightful post about the implications of the DeepSeek R1 release.

📡AI Radar

OpenAI is in talks for a new round of funding that will value the company above $300 billion.
ElevenLabs raised $180 million at a $3.3 billion valuation.
SoftBank is considering investing $500 million in robotics startup Skild AI.
LinkedIn’s founder Reid Hoffman published a new book portraying na optimistic future for AI.
Reid Hoffman also announced a $24.6 million round for a new AI startup focused on cancer research.
Martin AI, which is building a new agent for productivity, raised $2 million in seed funding.
A new AI lab has unveiled Oumi, a new platform for developing and evaluating open foundation models.
Block open sourced goose, an AI agent for local tasks.
Conviction Partners announced a new $230 million fund and added former Facebook VP of Engineering Mike Vernal as a GP.
DataBank, which provides data centers for AI workloads, raised $250 million in new funding.
AI-first analytics platform Athenic AI announced a $4.3 million seed round.

PT Lambert

Feb 5

Fixed benchmarks like this make no sense. It's akin to having a single SAT or GRE test and reusing it year after year. Even without deliberate memorization or gaming of the benchmarks, LLMs and students will naturally start to do better over time as they adapt to the benchmark content.

The solution is simple: Just as with SATs and GREs, benchmarks need to be constantly refreshed, as in every year or less, depending on the iteration cycles of AI models.

Expand full comment

Charles Fadel

Feb 2

Just ramping up algebraic complexity is NOT an indication of reasoning, only one of more training. To understand mathematical reasoning in LLMs, one would have to devise different benchmarks, as described in: https://curriculumredesign.org/wp-content/uploads/Benchmark-design-criteria-for-mathematical-reasoning-in-LLMs.pdf

Benchmark design criteria for mathematical reasoning in LLMs (2025)

This paper outlines key aspects in developing robust benchmarks for evaluating large language models (LLMs) in mathematical reasoning, highlights limitations of existing assessments, and proposes criteria for comprehensive evaluations.

TheSequence

Discussion about this post