Edge 455: Building Smaller Foundation Models Using Graph-Based Distillation
Diving into one of the most sophisticated distillation methods in the gen AI space.
In this issue:
An overview of graph based distillation.
A survery of the main GBD methods.
The new Hugging Face Autotrain framework for low-code training of foundation models.
💡 ML Concept of the Day: Understanding Graph-Based Distillation
Throughout this series, we have focused on traditional teacher-student distillation methods which focus on individual data units, such as matching output probabilities or feature transformations between the teacher(TN) and student networks(SN) . While unquestionably effective, these methods often overlook the relationships between data points—a critical factor in helping SNs develop effective data embeddings.
Graph-based knowledge distillation (GKD) is a cutting-edge technique designed to enhance the performance of small student networks by transferring relational knowledge from a larger teacher network. The key concept behind GKD is to use attention networks, particularly multi-head attention (MHA) networks. These networks build a graph representation that captures relationships between feature vectors. Here’s how it works: