TheSequence

TheSequence

Share this post

TheSequence
TheSequence
Edge 450: Can LLM Sabotage Human Evaluations

Edge 450: Can LLM Sabotage Human Evaluations

New research from Anthropic provides some interesting ideas in this area.

Nov 21, 2024
∙ Paid
10

Share this post

TheSequence
TheSequence
Edge 450: Can LLM Sabotage Human Evaluations
1
Share
Created Using Midjourney

Controlling the behavior of foundation models has been at the forefront of research in the last few years in order to accelerate mainstream adoption. From a philosophical standpoint, the meta question is whether we can ultimately control intelligent entities that are way smarter than ourselves. Given that we are nowhere near that challenge, a more practical question is whether models show emerging behaviors that subvert human evaluations. This is the subject of a fascinating research by Anthropic.

In a new paper, Anthropic proposes a framework for assessing the risk of AI models sabotaging human efforts to control and evaluate them. This framework, called “Sabotage Evaluations”, aims to provide a way to measure and mitigate the risk of misaligned models, which are models whose goals are not fully aligned with human intentions.

Defining the Threat: Sabotage Capabilities

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Jesus Rodriguez
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share