Knowledge Distillation (KD) is a model compression and transfer
learning technique that enables small models (students) to learn from
large models (teachers), maintaining performance close to teacher models
while significantly reducing parameters and computation. Hinton et al.'s
seminal 2015 paper "Distilling the Knowledge in a Neural Network"
sparked a research wave in this field. But knowledge distillation is far
more than simple "soft label" training — it involves temperature
parameter tuning, extracting knowledge at different levels, matching
student-teacher architectures, and numerous technical details.
This article derives the mathematical foundations of knowledge
distillation from first principles, explains why soft labels contain
more information than hard labels, details implementation of
response-based, feature-based, and relation-based distillation,
introduces methods like self-distillation, mutual learning, and online
distillation that don't require pre-trained teachers, and explores
synergistic optimization of quantization, pruning, and distillation.
We'll see that distillation is essentially "compression encoding" of
knowledge — explicitly transferring dark knowledge implicitly learned by
teacher models to student models.
Motivation for Knowledge
Distillation
From Model
Compression to Knowledge Transfer
Deep neural networks typically require massive parameters to achieve
optimal performance. However, large models face numerous deployment
challenges:
Mobile Deployment: Phones and IoT devices have
limited memory and computation, cannot run models with billions of
parameters
Inference Latency: Real-time systems like
autonomous driving and industrial control require millisecond-level
response
Energy Constraints: Edge devices need long battery
life, large models consume too much power
Traditional model compression methods (pruning, quantization,
low-rank factorization) directly manipulate model structure or
parameters, often causing significant performance degradation. Knowledge
distillation's core idea is to have small models learn large
models' output distributions, not simply fit hard labels.
Dark
Knowledge: Information Advantage of Soft Labels
Consider an image classification task where the true label is "cat"
(hard label is one-hot vector ). A trained teacher model might output the following
probability distribution:Although the teacher predicts highest probability for
"cat", other class probabilities also contain valuable information:
High probability for "tiger": indicates this cat
shares visual similarities with tigers (body shape, patterns, etc.)
Low but non-zero probability for "dog": shows cats
and dogs have common features (furry, four legs)
Extremely low probability for "car": indicates cats
and cars have completely different visual features
These non-zero "error" probabilities are what Hinton calls
dark knowledge— they reveal similarity structure
between classes and embody the teacher model's generalization ability
learned during training.
From an information theory perspective, hard labels have entropy of 0
(deterministic one-hot vectors), while soft labels have higher
entropy:Soft labels provide richer supervisory
signals, helping student models learn relationships between classes.
Mathematical
Perspective on Distillation: Distribution Matching
Let teacher parameters be, student parameters be, input be, output logits beand. Standard
classification training minimizes cross-entropy:whereis the
hard label (one-hot) andis the softmax function.
Knowledge distillation has students match teachers' output
distributions:This is the
cross-entropy of two distributions, equivalent to
minimizing KL divergence (sinceis constant):From an optimization perspective,
distillation makes the student's output distributionapproximate the
teacher's.
Temperature
Parameter: Softening Probability Distributions
The problem with using softmax outputs directly is that probability
distributions are often too "peaked"— maximum class probability
approaches 1, other class probabilities approach 0, suppressing dark
knowledge.
Hinton introduced the temperature parameterto soften distributions:When, the
probability distribution becomes smoother:
-: All class
probabilities approach uniform distribution -: Standard softmax -: Distribution degenerates to one-hot (argmax)
Intuitive Example: Consider logits -: (class 3 information almost lost) -: (class 3 information preserved)
Distillation loss at temperatureis defined as:where bothandare computed with temperature.
Theoretical Derivation: Why are gradients more
stable at high temperatures?
Taking derivative with respect to logit (omitting normalization
term):Asincreases,
gradient magnitude scales by.
But since the loss itself also changes with, the final gradient scaling factor
is (see Hinton paper appendix).
Therefore, the distillation loss needs to be multiplied byto balance gradient scales during
training:whereis the balance
coefficient andis
standard cross-entropy loss on hard labels.
Response-Based
Distillation: Knowledge Transfer at Output Layer
Response-based distillation is the most classic distillation method,
using only the model's final layer outputs (logits or probabilities) for
knowledge transfer.
ResNet-34 teacher (73.3% accuracy) distilled to ResNet-18
student
Direct training ResNet-18: 69.8%
Distilled training ResNet-18: 71.4%
1.6% improvement, but still 1.9% gap
Why
Temperature Parameter Works: Information-Theoretic Analysis
From an information theory perspective, temperaturecontrols soft label information content.
Define conditional entropy:It can be proven
thatincreases monotonically
with. Higher temperature means
higher entropy — more uncertainty and richer information.
Specifically, whenis large,
softmax can be Taylor expanded:where. Then:This shows that at high
temperatures, relative differences in softmax outputs directly reflect
relative differences in logits, unaffected by the exp function's
non-linearity. Student models can more accurately learn relative
relationships between classes.
Connection
Between Distillation and Label Smoothing
Label smoothing is a regularization technique that replaces hard
labelwith:whereis the smoothing coefficient
(typically 0.1).
It can be proven that knowledge distillation is
data-dependent label smoothing in a sense:
Label smoothing uses the same smoothed distribution (uniform) for
all samples
Knowledge distillation uses different smoothed distributions
(teacher's outputs) for each sample
Experiments show distillation typically outperforms label smoothing
because teacher output distributions contain sample-specific information
(e.g., some cat images look more like tigers).
Layer-wise
Distillation: Multi-stage Knowledge Transfer
For very deep networks (like ResNet-152), distillation can be
decomposed into multiple stages:
Shallow Layer Distillation: Distill teacher's first
few layers to student's first few layers
Deep Layer Distillation: Distill teacher's last few
layers to student's last few layers
Loss function becomes multi-term sum:whereis the loss at the-th distillation point,is the weight.
Advantages: Finer-grained knowledge transfer,
suitable when teacher and student architectures differ
significantly.
Disadvantages: Requires manual design of
distillation point locations and weights, expanding hyperparameter
space.
Feature-Based
Distillation: Knowledge Transfer at Intermediate Layers
Feature-based distillation utilizes not only output layers but also
intermediate layer feature maps for knowledge transfer.
FitNets: Hint Learning
FitNets (Hints for Thin Deep Nets) was one of the earliest feature
distillation methods, proposed by Romero et al. in 2015.
Core Idea: Have student's intermediate layer
features match teacher's intermediate layer features.
Let teacher's features at layerbe, student's features at layerbe. Since dimensions may differ, introduce a learnable
projection layer:whereis the Frobenius norm.
Training Strategy (Two-stage):
Stage 1: Freeze teacher, train only student's
firstlayers and projection
layer, minimizing
Shallow Hints: Student learns low-level features
(edges, textures), suitable when student is very small
Deep Hints: Student learns high-level semantic
features, suitable when student capacity approaches teacher
Experiments find single hint layers have limited effectiveness,
multiple hint layers work better (but increase computational cost).
Attention
Transfer: Attention Map Distillation
Zagoruyko and Komodakis proposed Attention Transfer (AT) in 2017,
using activation statistics of feature maps as "attention maps" for
distillation.
Activation-based Attention:
For feature map, define attention map:whereis typically 1 or 2.represents
activation intensity at each spatial location.
Loss function is:Normalization ensures scale invariance.
Gradient-based Attention:
Besides activation, gradients can also serve as attention:Gradient
attention reflects which locations contribute most to loss, capturing
the model's decision process.
Multi-layer Attention Transfer:whereis the selected layer
set,is the weight.
Experimental Results (CIFAR-10):
ResNet-110 teacher (93.5%)ResNet-20 student
Baseline ResNet-20: 91.3%
Response-based distillation: 91.8%
Attention transfer: 92.4%
Attention distillation provides 0.6% improvement over response-based
distillation, demonstrating intermediate layer knowledge transfer
effectiveness.
PKT: Probabilistic
Knowledge Transfer
Lopez-Paz et al. proposed Probabilistic Knowledge Transfer (PKT) in
2017, matching statistical properties of feature distributions rather
than individual sample features.
Core Idea: Represent knowledge using pairwise sample
similarities.
For a batch of samples, compute feature similarity matrix:whereis a kernel function (like Gaussian kernel). Loss
function is:This matches relational
structure between sample pairs rather than individual sample feature
values.
Advantages:
Insensitive to feature dimension differences (no projection layer
needed)
Captures semantic relationships between samples
Disadvantages:
Computational complexity,
batch size cannot be too large
Requires reasonable selection of kernel function and bandwidth
NST: Neural
Style Transfer-Inspired Distillation
Huang and Wang proposed in 2017, inspired by Neural Style Transfer,
using Gram matrices for feature distillation.
For feature map, reshape to(where), define
Gram matrix:Gram matrix
elementrepresents
correlation between channeland
channel. Loss function is:
Intuition: Gram matrices capture second-order
statistics (covariance) of features, reflecting relationships between
different channels (e.g., co-occurrence patterns of "edge detector" and
"texture detector").
Experiments: On CIFAR-100, NST shows further
improvement over FitNets and AT (about 0.5%-1%).
Relation-based distillation considers not only individual sample
outputs or features, but also relationships between samples.
RKD: Relational Knowledge
Distillation
Park et al. proposed Relational Knowledge Distillation (RKD) in 2019,
defining two types of relations:
Distance-wise Relation:
For a pair of samples,
define normalized Euclidean distance:whereis the normalization factor.
Loss function is:whereis the sampled pair set.
Angle-wise Relation:
For triplet,
define vector angle:where.
Loss function is:
Intuition:
Distance relation ensures relative distances between sample pairs
remain consistent (e.g., "cat" and "dog" distance is smaller than "cat"
and "car" distance)
Angle relation ensures relative positional relationships of samples
(e.g., "Persian cat" direction relative to "cat" and "dog")
Total Loss:Experiments show angle relations are more important than
distance relations ().
CRD: Contrastive
Representation Distillation
Tian et al. proposed Contrastive Representation Distillation (CRD) in
2020, introducing contrastive learning into the distillation
framework.
Core Idea: Use contrastive learning to maximize
mutual information between student and teacher features.
For a positive pair(same
sample's representations in teacher and student) andnegative samples(other samples),
define InfoNCE loss:
Key Difference:
Traditional distillation: Use MSE or KL divergence to match
features
CRD: Use contrastive learning to match features, more focused on
inter-sample discrimination
Experimental Results (CIFAR-100):
ResNet-32x4 teacher (79.4%)ResNet-8x4 student
Response-based distillation: 73.3%
CRD: 75.5%
CRD is particularly effective on small student models (2%+
improvement).
SP: Similarity-Preserving
Distillation
Tung and Mori proposed Similarity-Preserving Distillation (SP) in
2019, requiring student feature similarity matrices to match
teacher's.
For a batch of samples, define similarity matrix:Loss function is:Difference from PKT: SP
uses cosine similarity, PKT uses kernel similarity.
Self-Distillation:
Teacher-Free Knowledge Transfer
Self-distillation is a distillation method that doesn't require
pre-trained teachers, where models learn knowledge from their own
earlier versions or different branches.
Born-Again
Networks: Iterative Self-Distillation
Furlanello et al. proposed Born-Again Networks (BAN) in 2018,
improving model performance through iterative distillation.
Algorithm Workflow:
Train 1st Generation Model: Standard training
yields
Train 2nd Generation Model: Useas teacher to distill(andhave same architecture)
Train 3rd Generation Model: Useas teacher to distill4. Repeat
until performance saturates
Surprising Finding: Even when teacher and student
have identical architectures, distillation still improves
performance!
Iterative distillation is an implicit form of ensemble learning
Each generation explores different regions of the loss
landscape
Experiments (CIFAR-100):
1st generation DenseNet: 74.3%
2nd generation (BAN): 75.2%
3rd generation: 75.4%
4th generation: 75.5% (saturation)
Deep Mutual Learning:
Mutual Learning
Zhang et al. proposed Deep Mutual Learning (DML) in 2018, training
multiple student models simultaneously with mutual supervision.
Algorithm Workflow:
Forstudent models, each model's loss
contains two parts:whereis model's
output distribution.
Key Features:
No pre-trained teacher needed: All models train
from scratch
Symmetry: Each model is both student and
teacher
Online learning: Models learn each other's
knowledge in real-time
Theoretical Intuition:
Each model makes different errors during training
Mutual learning helps models avoid each other's errors, similar to
ensemble learning
Eventually each model performs better than training alone
Experiments (CIFAR-100):
Single ResNet-32 training: 70.2%
2 ResNet-32s mutual learning: 72.1%
4 ResNet-32s mutual learning: 72.8%
Online Distillation
Online distillation aggregates knowledge from multiple student models
into a virtual teacher, avoiding pre-trained teacher overhead.
ONE (Online Network Ensemble):
Lan et al. proposed in 2018, using weighted average of multiple
branches as teacher:Each branch's loss is:
KDCL (Knowledge Distillation via Collaborative
Learning):
Song and Chai proposed in 2018, in addition to branch-wise
distillation, also distills at different depths:whereis
the selected depth set (e.g., every 4 layers).
Advantages:
Single training pass completes, saves time
Can use any branch or ensemble multiple branches finally
Synergistic
Distillation with Quantization and Pruning
Knowledge distillation is often combined with quantization, pruning
and other compression techniques to achieve higher compression
ratios.
Quantization-Aware
Distillation
Quantization maps floating-point parameters to low-bit integers (like
8-bit or 4-bit), but causes accuracy degradation. Distillation can
alleviate this problem.
Algorithm Workflow:
Train Full-Precision Teacher: Standard FP32
training
Quantized Student Initialization: Quantize teacher
parameters to INT8 as student initialization
Distillation Fine-Tuning: Fine-tune quantized
student using teacher's soft labels
Loss function:whereis quantized student
output.
Quantization Details:
For weight, quantization
formula is:whereis the scaling factor:is the number of bits.
Experiments (ResNet-18 on ImageNet):
FP32 baseline: 69.8%
INT8 quantization (no distillation): 68.5% (-1.3%)
import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import DataLoader import torchvision import torchvision.transforms as transforms import numpy as np from typing importList, Tuple, Dict import copy
# ============== Distillation Loss Functions ==============
Iteratively trains multiple generation models, each using previous
generation as teacher
Comprehensive Q&A
Q1: How to choose
temperature parameter?
A: Temperature parameter selection depends on task
and data:
Classification tasks:, typically start trying from
Regression tasks: Temperature has less effect, can
skip or use lower
Tuning strategy: Grid search on validation set
forPrinciple: Temperature needs to balance two factors:
-too small: Soft labels degenerate
to hard labels, dark knowledge lost -too large: All class probabilities
approach uniform, signal weakens
Empirically, more classes and higher class similarity require higher
temperature.
Q2: How to set balance
coefficient?
A:controls weight between
distillation loss and classification loss:
High
(like 0.9): More reliance on teacher knowledge, suitable when teacher is
strong and data is scarce
Low
(like 0.5): More reliance on hard labels, suitable when teacher and
student capacity are similar
Tuning Advice: - If student capacity is below 1/10
of teacher: -
If student capacity is 1/4-1/2 of teacher: - If using multiple
distillation losses (combined): lowerto 0.5-0.7
Q3: How small should
the student model be?
A: Student capacity depends on deployment
constraints and performance requirements:
Mobile: Typically compress to 1/10 of teacher
parameters, accept 2-5% accuracy loss
Edge devices: Compress to 1/20-1/50, may lose
5-10%
Server optimization: Compress to 1/2-1/4, loss
<1%
Important Finding: Distillation effectiveness is
especially significant when student is very small. When student capacity
approaches teacher, distillation returns diminish.
Q4:
Why is distillation particularly effective for small models?
A: Several theoretical explanations:
Knowledge Compression Under Capacity Constraints:
Small models cannot fit all training data, soft labels provide signals
about which knowledge is most important
Regularization Effect: Soft labels have higher
entropy, preventing small models from overfitting on limited data
Smooth Optimization Landscape: Soft label gradients
are smoother, helping small models find better local optima
Experimental Evidence: When student capacity is extremely small
(parameter count <1% of teacher), distillation can bring 10-20%
relative improvement.
Q5: How to
select layers for feature distillation?
Shallow Layers: Low-level features (edges,
textures), useful for all tasks but limited information
Deep Layers: High-level semantic features, strong
task relevance but may overfit
Recommended Strategy: - Choose teacher's middle
layers (like ResNet's layer2 and layer3) - Avoid first layer (too basic
information) and last layer (already covered by response-based
distillation) - For multi-layer distillation, weights should increase
progressively (deeper layers have higher weights)
Automated Methods: Use NAS or reinforcement learning
to search optimal layer combinations (like MetaDistiller).
Q6: Why is
self-distillation effective?
A: Self-distillation seems paradoxical: student and
teacher have same architecture, how can distillation improve
performance?
Explanations: 1. Regularization:
Soft labels provide smoother supervision signals, reducing overfitting
2. Ensemble Effect: Each generation explores different
regions of loss landscape, equivalent to implicit ensemble 3.
Dark Knowledge Refinement: Even with same architecture,
teacher learns dark knowledge like class relationships
Experimental Evidence: - Born-Again Networks improve
1-2% on CIFAR-100 - Improvement more significant when data volume is
smaller
Q7: How
to combine distillation with pruning/quantization?
A: Distillation can significantly alleviate
performance loss from pruning and quantization:
Pruning + Distillation: 1. Train full model teacher
2. Prune teacher to get initial student 3. Fine-tune student with
teacher soft labels 4. Effect: Typically recovers 50-80% of pruning
loss
Quantization + Distillation: 1. FP32 teacher
training 2. Quantized student initialization (INT8 or INT4) 3.
Distillation fine-tuning of quantized student 4. Effect: INT8 nearly
lossless, INT4 loss <1%
Simultaneous Application: First prune then quantize,
with distillation, can achieve 10-20x compression ratio.
Q8:
Are there differences in distillation between NLP and CV?
A: Distillation core principles are same, but
specific implementations differ:
CV Characteristics: - Feature maps have spatial
structure (2D), can use attention maps, Gram matrices, etc. - Typically
perform feature distillation at multiple convolutional layers - Data
augmentation (like MixUp) can further improve distillation
effectiveness
NLP Characteristics: - Features are sequences (1D),
use sequence alignment or pooling methods - BERT and other models'
intermediate layer distillation (like DistilBERT) very effective -
Pre-training + distillation is mainstream paradigm (first pre-train
large model, then distill to small model)
Commonality: Response-based distillation effective
in both domains, baseline method.
Q9: Can we use multiple
teacher models?
A: Yes, called Multi-Teacher
Distillation:
Average Ensemble:Student learns average distribution of multiple
teachers.
Weighted Ensemble:Weightscan be fixed (like by teacher
accuracy) or learnable.
Advantages: Ensemble knowledge from multiple
teachers, more robust.
Disadvantages: Need to train multiple teachers, high
cost.
Q10: How
effective is distillation on small datasets?
A: Distillation especially effective on small
datasets:
Reasons: - Small data prone to overfitting, soft
labels provide strong regularization - Teacher models pre-trained on
larger datasets (like ImageNet) transfer prior knowledge
Experiments (Medical image classification, 1000
training images): - Train small model from scratch: 65% accuracy -
Fine-tune pretrained model with hard labels: 72% - Distill small model
from large model: 75%
Distillation provides 3% improvement over direct fine-tuning,
demonstrating soft label value on small data.
Q11:
What is the computational overhead of distillation?
A: Additional overhead from distillation
includes:
Teacher Inference: Training requires teacher
forward pass, increases computation by about 50%
Feature Storage (feature distillation): Need to
store intermediate features, increases memory
Multiple Loss Computations: Additional KL
divergence, MSE, etc., very small overhead
Optimization Strategies: - Offline
Distillation: Pre-compute teacher's soft labels and save, load
directly during training (saves teacher inference) - Online
Distillation: Dynamically update teacher, but high
computational overhead - Selective Distillation: Only
distill on difficult samples
Inference Stage: Student model deployed
independently, no additional overhead.
Q12:
Relationship between distillation and transfer learning?
A: Distillation is a special type of transfer
learning:
Commonalities: - Both transfer knowledge from one
model (source) to another (target) - Both leverage prior knowledge to
reduce target task data requirements
Differences: - Transfer Learning:
Typically changes tasks (like ImageNetmedical imaging) -
Distillation: Typically keeps task, changes model
capacity
Combination: Cross-task distillation simultaneously
changes tasks and capacity, an active research direction.
Q13:
How does distillation relate to knowledge transfer in humans?
A: Distillation mimics human knowledge transfer
process:
Human Learning: - Experts teach students not just
"correct answers" but also "thinking processes" - Students learn from
experts' confidence levels, similarities between concepts - Iterative
learning: students may become teachers to next generation
Distillation Analogy: - Soft labels are like
experts' "confidence" and "similarities between concepts" - Temperature
parameter controls knowledge granularity - Self-distillation is like
iterative refinement of knowledge
This analogy inspires new distillation methods, like meta-knowledge
distillation, curriculum distillation, etc.
Q14: Can distillation
improve robustness?
A: Distillation can improve model robustness to some
extent:
Adversarial Robustness: - Soft labels provide
smoother supervision signals, reducing sensitivity to adversarial
perturbations - Experiments show distilled models have slightly higher
adversarial accuracy - Combining adversarial training with distillation
further improves robustness
Distribution Shift Robustness: - Teacher models
trained on broader data distributions can transfer robustness to
students - Cross-domain distillation helps student models generalize to
different distributions
Noise Robustness: - Soft labels' regularization
effect makes models less sensitive to label noise - Particularly
effective on small, noisy datasets
Q15:
What is the future direction of distillation research?
A: Several promising research directions:
Automated Distillation Strategy Search:
Use NAS/RL to automatically find optimal distillation layers,
weights, temperatures
Reduce manual tuning effort
Cross-Modal Distillation:
Transfer knowledge from one modality to another (like visionlanguage)
Multimodal large model distillation
Few-Shot Distillation:
How to effectively distill with very few samples
Combine meta-learning and distillation
Self-Supervised Distillation:
Use self-supervised learning objectives for distillation
Reduce reliance on labeled data
Lifelong Distillation:
Continually distill knowledge as new tasks arrive
Avoid catastrophic forgetting
Related Papers
Classic Papers
Hinton et al., "Distilling the Knowledge in a Neural
Network", NIPS 2014 Workshop
Proposed knowledge distillation, temperature parameter, soft label
concepts
Knowledge distillation is a simple yet powerful idea: have small
models learn large models' "thinking patterns" rather than simply
imitating outputs. Through soft labels, temperature parameters, feature
matching, relation preservation and other techniques, distillation can
significantly reduce model size while maintaining performance close to
original models. From Hinton's pioneering work to recent methods like
CRD and TinyBERT, distillation techniques continue to evolve, becoming a
core tool for model compression and transfer learning. Whether for
mobile deployment, edge computing, or democratization of large models,
knowledge distillation will play a crucial role.
Post title:Transfer Learning (5): Knowledge Distillation
Post author:Chen Kai
Create time:2024-11-27 09:30:00
Post link:https://www.chenk.top/transfer-learning-5-knowledge-distillation/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.