The Sequence AI of the Week #721: Stop Blaming Temperature: Fighting Nondeterminism in LLM Inference
Discussing the first publication from Thinking Machines.
Large language models should feel deterministic when we ask them to be. Set temperature to zero, fix the seed, and you expect the same bytes back for the same prompt. Yet many practitioners have watched a production endpoint return slightly different completions across requests even under greedy decoding. This isn’t a ghost in the GPU so much as a systems property: modern inference servers perform dynamic batching, kernels make shape‑dependent choices, and tiny numeric nudges can flip a token early in generation and cascade into a visibly different output. A recent essay from Thinking Machines reframes the root cause as batch‑size–dependent numerics and argues that the right fix is not just “turn on deterministic mode” but to make the model’s core kernels batch‑invariant. The result is a practical path to “same input, same output” under real production load.
Let’s dive in.