The Sequence Knowledge #886: Demystifying Model Distillation
Understanding the key principles of distillation in simple terms.
The simplest way to understand knowledge distillation is to imagine a very expensive teacher and a very cheap student.
The teacher is a large model: smart, slow, high-capacity, expensive to run. The student is smaller: faster, cheaper, easier to deploy, but usually less capable if trained in the standard way. Distillation asks a very practical question:
Can the student learn not only from the original dataset, but from the teacher’s behavior?
In other words, instead of training the small model directly on reality, we train it on reality as interpreted by the big model.
That sentence is the whole trick.
A traditional training setup looks like this:

