TheSequence

TheSequence

The Sequence AI of the Week #721: Stop Blaming Temperature: Fighting Nondeterminism in LLM Inference

Discussing the first publication from Thinking Machines.

Sep 17, 2025
∙ Paid
11
Share
Generated image
Created Using GPT-5

Large language models should feel deterministic when we ask them to be. Set temperature to zero, fix the seed, and you expect the same bytes back for the same prompt. Yet many practitioners have watched a production endpoint return slightly different completions across requests even under greedy decoding. This isn’t a ghost in the GPU so much as a systems property: modern inference servers perform dynamic batching, kernels make shape‑dependent choices, and tiny numeric nudges can flip a token early in generation and cascade into a visibly different output. A recent essay from Thinking Machines reframes the root cause as batch‑size–dependent numerics and argues that the right fix is not just “turn on deterministic mode” but to make the model’s core kernels batch‑invariant. The result is a practical path to “same input, same output” under real production load.

Let’s dive in.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Jesus Rodriguez
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture