Discussion about this post

User's avatar
PT Lambert's avatar

Fixed benchmarks like this make no sense. It's akin to having a single SAT or GRE test and reusing it year after year. Even without deliberate memorization or gaming of the benchmarks, LLMs and students will naturally start to do better over time as they adapt to the benchmark content.

The solution is simple: Just as with SATs and GREs, benchmarks need to be constantly refreshed, as in every year or less, depending on the iteration cycles of AI models.

Expand full comment
Charles Fadel's avatar

Just ramping up algebraic complexity is NOT an indication of reasoning, only one of more training. To understand mathematical reasoning in LLMs, one would have to devise different benchmarks, as described in: https://curriculumredesign.org/wp-content/uploads/Benchmark-design-criteria-for-mathematical-reasoning-in-LLMs.pdf

Benchmark design criteria for mathematical reasoning in LLMs (2025)

This paper outlines key aspects in developing robust benchmarks for evaluating large language models (LLMs) in mathematical reasoning, highlights limitations of existing assessments, and proposes criteria for comprehensive evaluations.

Expand full comment

No posts