The Sequence Opinion #860: Every Company’s Last eXam: Some Reflection About Practical AI Evals
Some ideas about how companies should think about evaluations.
For today’s essay, I want to explore an idea that has become central to how we think about AI evaluations at LayerLens. This is not an essay about LayerLens, but about a simple and increasingly unavoidable thesis: evals are becoming the fourth pillar of modern AI, alongside compute, data, and models. As AI systems move from chatbots to agents, from demonstrations to production workflows, every meaningful task performed by every agent inside every company will need its own evaluation layer. Not generic benchmarks. Not leaderboard theater. Practical, dynamic, company-specific exams that measure whether an AI system can actually survive contact with real work. I call this idea Every Company’s Last eXam.
Humanity’s Last Exam is a very specific kind of artifact. It is what a field builds when the old report card stops working. The core observation behind it was simple: familiar benchmarks such as MMLU had become too easy for frontier systems to cleanly separate the leaders, so researchers assembled a harder, broader, multimodal test at the frontier of human knowledge, finalized at 2,500 questions after removing errors and questions that were too easily answerable with search. And then, almost immediately, the benchmark itself taught a second lesson: even “the last exam” needs maintenance. HLE-Verified later showed that noisy items and flawed answers could materially distort comparisons, and that systematic verification could shift measured accuracy by 7 to 10 percentage points on average. In other words, the benchmark was not a stone tablet. It was infrastructure.
That is the right analogy for where enterprise AI is going. Every company now needs its own last exam: a private, living evaluation suite that captures the highest-value, highest-risk, most context-heavy work its agents are supposed to perform. Not a generic IQ test for models. Not another public leaderboard. More like a company-specific CI system for cognition. The public benchmarks still matter, just as SPEC mattered for CPUs and ImageNet mattered for vision, but production truth has moved downstream into proprietary workflows, private documents, internal policies, odd exceptions, and all the sharp edges that never make it into a paper appendix. That is why top frontier labs now emphasize task-specific evals, production-derived datasets, continuous maintenance, and explicit definitions of success rather than vibe-based model selection.

