Domain Adaptation is one of the most challenging problems in transfer
learning. In practical applications, training data (source domain) and
test data (target domain) often come from different distributions:
medical images transferred from one hospital to another, recommendation
systems transferred from one country to another, autonomous driving
transferred from sunny to rainy conditions. This distribution
shift can lead to significant performance degradation.
The core goal of domain adaptation is: to learn a model that
performs well on the target domain when the source domain has labeled
data but the target domain has no labels (or few labels). This
requires aligning source and target domain feature distributions while
maintaining discriminative power. This article derives the mathematical
characterization of distribution shift from first principles, derives
the theoretical foundation of unsupervised domain adaptation, explains
classic methods like DANN and MMD in detail, and provides a complete
DANN implementation.
Domain
Shift Problems: Covariate Shift and Label Shift
Formal Definition of Domains
A Domain consists of two components: - Feature Space: The value space of
input variables - Marginal Distribution: The probability distribution of
input variables
A Task also consists of two components: - Label
Space: The
value space of output variables - Conditional
Distribution: The
probability distribution of outputs given inputs
Domain adaptation setting: - Source Domain,
with labeled data - Target Domain,
with unlabeled dataGoal: Learn a
model$f: that performs
well on the target domain.
Covariate Shift
Definition: Feature distributions differ but
conditional distributions are the same:
Intuitive Examples: - Spam Email
Classification: Training data from 2020, test data from 2026.
Email topic distribution changes (changes), but rules for determining
spam given content remain unchanged (unchanged) - Medical
Imaging: Training data from Siemens CT, test data from GE CT.
Different device imaging characteristics (changes), but diagnostic criteria for
lesions remain constant (unchanged)
Importance Weighting
Covariate shift can be corrected through importance
weighting. Empirical Risk Minimization (ERM) on source domain
optimizes:But we really care about target
domain risk:Using importance sampling,
target domain risk can be rewritten as:whereis the importance weight. In practice,
optimize weighted loss:
Density Ratio Estimation
Importance weightrequires estimating the ratio of two densities. Direct
density estimation is difficult (curse of dimensionality), but we can
directly estimate the density ratio.
Definition: Label distributions differ but
class-conditional distributions are the same:
Intuitive Examples: - Medical
Diagnosis: Training data from hospitalized patients (high
disease proportion), test data from outpatients (high healthy
proportion). Disease prevalence differs (changes), but given disease, symptom
distribution is same (unchanged) - Recommendation
System: Training data from active users (more young users),
test data from all users (more elderly users). User age distribution
differs (changes), but given
user type, behavior patterns are same (unchanged)
Label Shift Correction
Using Bayes' theorem, target domain posterior:Further rewrite:Therefore, just reweight source model outputs with:
Label Distribution Estimation:can be directly estimated from
source domain labeled data.needs to be estimated from target
domain unlabeled data. One method is Expectation Maximization
(EM):
E-step: Use current model to predict target domain labelsM-step: Update label
distributionIterate until convergence.
Concept Shift
Definition: Conditional distributions differ:This is the most
difficult case, because even with same features, decision boundaries may
differ.
Examples: - Sentiment
Classification: Training data from movie reviews, test data
from product reviews. Same words may have different sentiment tendencies
in different domains - Autonomous Driving: Training
data from US (right-hand traffic), test data from UK (left-hand
traffic). Driving rules completely different
Concept shift usually requires a small amount of target domain
labeled data for adaptation.
Unsupervised Domain Adaptation (UDA) assumes source domain
has labeled data, target domain is completely unlabeled. Core
idea: learn a feature extractor that aligns source and target
domain feature distributions.
Theoretical Foundation:
Ben-David Theory
Ben-David et al. gave a theoretical upper bound for domain
adaptation. Letbe the
hypothesis space,$h be a
hypothesis, target domain risk can be decomposed as:where: -: Source domain risk -:$ -distance between source and
target domains -: Ideal joint hypothesis
risk
Intuitive Explanation: 1.: Classify well on source domain
2.: Source and target domains are close 3.: Source and target tasks are
similar (small ideal joint risk)
Balancing these three terms is the core of domain adaptation.
####$ -Distance$
-distance is defined
as:Intuitive understanding: If source and target domains
are close, any two hypothesesshould have similar disagreement on both domains.
In practice, we can approximate this distance using a domain
discriminator: train a classifier to distinguish source and
target domain samples. If it cannot distinguish (classifier accuracy
close to 50%), the two domains are close.
DANN: Domain-Adversarial
Neural Network
Domain-Adversarial Neural Network (DANN) is the most
classic domain adaptation method, aligning source and target domain
feature distributions through adversarial training.
DANN Architecture
DANN contains three components: 1. Feature
Extractor: Maps input
to feature space 2. Label Predictor: Predicts labels based on features
(classifier) 3. Domain Discriminator: Determines which domain features
come from
Loss function has three terms:
Source Domain Classification Loss:
Domain Discrimination Loss:
Total Loss (adversarial):
Adversarial Training: - Domain
discriminatormaximizes(distinguishes source and target
domains) - Feature extractorminimizeswhile maximizingconfusion (makesunable to distinguish)
This is equivalent to:
Gradient Reversal Layer (GRL)
DANN's clever part is the Gradient Reversal Layer
(GRL), which is identity mapping in forward pass and flips
gradients in backward pass:This way, domain discriminator gradients are
multiplied bybefore passing
back to feature extractor, implementing adversarial training.
AdaptiveDANN uses adaptive adversarial
weightthat increases with
training progress:whereis training progress (current step / total
steps),is
hyperparameter.
Intuition: Early in training, first learn source
domain classification well, then gradually enhance domain alignment.
Theoretical Explanation of
DANN
DANN minimizes a weighted sum of source domain riskand domain discrepancy. Domain
discriminator loss is a proxy for domain discrepancy: if domain
discriminator cannot distinguish source and target domains (), feature
distributions are aligned.
More rigorously, DANN is equivalent to minimizing
Jensen-Shannon divergence:whereis the mixture
distribution.
MMD: Maximum Mean Discrepancy
Maximum Mean Discrepancy (MMD) is another method to
measure distribution difference, measuring distance by comparing means
of two distributions in Reproducing Kernel Hilbert Space (RKHS).
MMD Definition
Letbe RKHS with
corresponding kernel function, MMD is defined as:where$: is the kernel mapping.
Expanding the square:
Empirical Estimation:
MMD in Deep Domain
Adaptation
Deep Domain Confusion (DDC) and Deep Adaptation Network (DAN) embed
MMD into deep networks:whereis the
feature extractor,are
source and target domain samples.
Multi-Kernel MMD:
A single kernel may not sufficiently capture distribution
differences. Use linear combination of multiple kernels:where,.
Kernel Selection for MMD
Common kernel functions:
Gaussian Kernel (RBF kernel):
Polynomial Kernel:
Laplacian Kernel:In practice, typically use multiple
Gaussian kernels with different bandwidths.
CORAL: Correlation Alignment
Correlation Alignment (CORAL) reduces distribution
difference by aligning second-order statistics (covariance) between
source and target domains.
CORAL Loss
Let source domain features be, target domain features be, CORAL
loss is:whereare covariance matrices of source and target domain
features:is the mean vector.
Deep CORAL:
Deep CORAL adds CORAL loss to deep network training:
Intuition of CORAL
Covariance matrix captures correlations between features. Aligning
covariance matrices eliminates linear transformation differences between
domains. CORAL can be seen as whitening +
recoloring:
Generative Adversarial Network (GAN) ideas can also be used for
domain adaptation: using generators to transform source domain samples
to target domain style while preserving semantics.
CycleGAN:
Cycle-Consistent Adversarial Networks
CycleGAN implements unpaired domain transformation through
cycle consistency loss.
CycleGAN Architecture
Contains two generators and two discriminators: -(source
to target domain) -(target to source domain) -: Discriminates target domain
real/fake -: Discriminates
source domain real/fake
Loss Functions
Adversarial Loss:
Cycle Consistency Loss:
Total Loss:
Intuition: - Adversarial loss makes generated
samples look like target domain - Cycle consistency loss ensures
semantics unchanged ()
Domain Adaptation
Application
CycleGAN can transform source domain images to target domain style,
then train classifier with transformed images:
Use CycleGAN to learn$G: _S T{(G(x_i^s),
y_i^s)}{i=1}^{n_s}$3. Train classifier on pseudo target domain
data
Pixel-level Domain
Adaptation
For vision tasks, domain adaptation can be performed at pixel
level.
ADDA's advantage is source and target domain feature
extractors can be different, more flexible.
Adaptive Batch Normalization
Batch Normalization (BN) behaves differently during training and
testing: training uses batch statistics, testing uses global statistics.
This causes problems in domain adaptation.
BN Domain Shift Problem
BN layer computation:During training,are mean and standard
deviation of current batch; during testing, use global mean and standard
deviation (running mean/std) from training set.
Problem: If test data (target domain) distribution
differs from training data (source domain), global statistics are
inaccurate.
Adaptive BN (AdaBN)
AdaBN idea is simple: recompute BN statistics on target
domain.
AdaBN Algorithm
Train model normally on source domain (including BN layers)
Run model on target domain data, compute mean and variance for each
BN layer:3. During testing useto replace originalWhy
Effective?
BN statistics capture low-order statistical properties of data.
Aligning these statistics can partially eliminate domain shift.
Experiments show AdaBN is especially effective for covariate shift.
TransNorm: Transferable
Normalization
TransNorm further decomposes BN into task-related
part and domain-related part: -: Domain-related statistics, recomputed
on target domain -: Task-related parameters, kept
unchanged
This both adapts to target domain distribution and preserves task
knowledge learned from source domain.
Complete
Implementation: DANN Domain Adaptation
Below is a complete DANN implementation with gradient reversal
layers, domain discriminators, and adversarial training key
components.
import torch import torch.nn as nn import torch.nn.functional as F from torch.utils.data import DataLoader, TensorDataset from torch.autograd import Function import numpy as np from tqdm import tqdm from sklearn.metrics import accuracy_score
compute_lambda method dynamically adjusts adversarial
weight based on training progress:Earlyclose to 0 (focus on
classification), later close to 1 (enhance domain alignment).
Adversarial Training
Training simultaneously optimizes classification loss and domain
discrimination loss:
1 2 3 4 5 6 7 8 9
# Source domain classification loss class_loss = self.class_criterion(source_class_logits, source_labels)
# Domain discrimination loss domain_loss = source_domain_loss + target_domain_loss
# Total loss (gradient reversal layer automatically handles adversarial) loss = class_loss + domain_loss loss.backward()
Q1:
Why align feature distributions instead of directly training on source
domain?
Models trained on source domain may fail on target domain because
decision boundaries are in regions with low target domain data
density.
Mathematical Explanation: Let decision boundary
be, source
datahas high density near
boundary, model learns precise boundary. But if target datahas low density near boundary, the
boundary is unreliable.
Intuitive Example: Training data is daytime images,
test data is nighttime images. Even for same objects, nighttime image
feature distribution differs greatly, decision boundary needs
relearning.
Aligning feature distributions allows source and target domain data
to mix in feature space, making boundary reliable in both domains.
Q2:
Why does DANN use gradient reversal instead of directly maximizing
domain discrimination loss?
Directly maximizing domain discrimination loss requires alternating
optimization: first fix feature extractor to optimize domain
discriminator, then fix domain discriminator to optimize feature
extractor. This is equivalent to GAN training, easily unstable.
Gradient reversal layer allows single-step joint
optimization: one backward pass simultaneously updates all
parameters. Domain discriminator gradients are automatically flipped
before passing to feature extractor, implementing adversarial.
Mathematically, gradient reversal is equivalent to
minimizing:wherehas negative sign for
feature extractor (adversarial) and positive sign for domain
discriminator (discrimination).
Q3:
What's the difference between MMD and DANN? Which scenarios for
each?
Method
Distance Metric
Optimization
Advantages
Disadvantages
MMD
Kernel function distance
Direct minimization
Strong theory, stable
Need to select kernel, high computation
DANN
Jensen-Shannon divergence
Adversarial training
Strong expressiveness, good adaptation
Unstable training, needs hyperparameter tuning
Application Scenarios: - MMD: Small
domain difference, limited data, need stability - DANN:
Large domain difference, abundant data, pursue optimal performance
In practice, try MMD first (more stable), if results insufficient
then use DANN.
Q4:
Why is adaptive BN (AdaBN) effective? What problem does it solve?
BN layer statistics (mean, variance) capture low-order
statistical properties of data. Even if source and target
domains have same semantics, low-order statistics may differ.
Examples: - Images: Source domain
images brighter (higher mean), target domain images darker (lower mean)
- Sensors: Source from device A (small variance),
target from device B (large variance)
AdaBN replaces source domain statistics with target domain
statistics, eliminating low-order statistical
differences, letting model focus on high-level semantics.
Theoretical Explanation: AdaBN is equivalent to
whitening target domain data to source domain statistical
distribution, eliminating covariate shift.
Q5:
How to choose domain adaptation method? Is there a decision tree?
Yes, following workflow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
1. Have target domain labeled data? ├─ Yes → Supervised domain adaptation (fine-tuning, importance weighting) └─ No → 2
2. Where's main difference between source and target? ├─ Feature distribution ($P(X)$) → 3 ├─ Label distribution ($P(Y)$) → Label shift correction └─ Conditional distribution ($P(Y|X)$) → Need some target domain labels
3. Difference magnitude? ├─ Small (covariate shift) → AdaBN, CORAL ├─ Medium → MMD, DANN └─ Large (cross-modal) → CycleGAN, ADDA
4. Data and computation resources? ├─ Limited data/resources → AdaBN, CORAL └─ Abundant data/resources → DANN, ADDA
Q6: Can
domain adaptation harm source domain performance?
Yes, this is called negative transfer. Reason is
domain alignment may harm discriminativeness: to make source and target
features close, model may lose information useful for
classification.
Ben-David theory tells us target domain risk
affected by three terms:Excessive
alignment may decreasebut increaseand.
Solutions: 1. Add source validation
set: Monitor source performance, stop if degradation 2.
Adjust adversarial weight: Don't make adversarial
weight too large 3. Conditional alignment: Only align
same-class samples (conditional domain adaptation)
Q7:
How to evaluate domain adaptation effectiveness? Besides target domain
accuracy?
t-SNE Visualization: Reduce source and target
features to 2D, observe if mixed
Intra/Inter-class Distance Ratio:Larger is better (classes
separated but domains mixed)
Per-class Accuracy: Check if all classes improve
(avoid some classes degrading)
Q8:
How to adjust DANN adversarial weight? Automatic adjustment
methods?
DANN uses adaptive weight:whereis training progress,controls growth speed.
HyperparameterSelection: -too small (like 1): Adversarial
weight grows too slowly, insufficient domain alignment -too large (like 100): Adversarial
weight grows too fast, destroys classification learning -
Recommended:Automatic Adjustment: Can use validation set
performance to dynamically adjust:
Q9:
How to handle inconsistent classes between source and target
(partial/open-set domain adaptation)?
Standard domain adaptation assumes source and target classes
are the same. But in practice:
Partial DA: Target classes subset of source classes
()
Open-set DA: Target has classes source doesn't have
()
Solutions:
Partial DA: - Only align classes in source that also
exist in target - Use class weights: Give small weights
to source classes not in target
Open-set DA: - Add "unknown class", detect new
classes in target - Use open-world classifier: Reject
classification when prediction confidence low
Q10:
Why does CycleGAN's cycle consistency loss ensure semantics
unchanged?
Cycle consistency loss:
Intuition: Iftransforms source to target,transforms target to source, thenshould be close to identity
mapping.
Mathematically, cycle consistency equivalent to
requiringandare (approximate) inverse
mappings:This ensures semantic information not
lost:can recover afterand.
But note: Cycle consistency cannot completely
guarantee semantics unchanged. For example, ifmaps all source images to same target
image,maps back to original, cycle
consistency still satisfied, but semantics obviously lost. Therefore
usually need other constraints (like perceptual loss, identity
loss).
Q11: How to
select MMD kernel function and bandwidth?
Kernel Function Selection:
Gaussian Kernel (most common):
Multi-kernel combination: Use multiple Gaussian
kernels with different bandwidths
BandwidthSelection:
Empirical rule (median heuristic):Extra close brace or missing open brace\sigma = \text{median}\{\|x_i - x_j\| : i \ne j\\}i.e., median of all sample pair distances. Intuition:
Thismakes kernel neither too
local (too small) nor too
global (too large).
Multi-kernel MMD: Use, whereis
median heuristic.
Q12:
Where is domain adaptation most valuable in practical applications?
Medical Imaging: Different hospitals, different
devices have different data distributions
CT from Siemens to GE
MRI from 1.5T to 3T
Autonomous Driving: Different weather, lighting,
cities have different data distributions
Sunny to rainy
San Francisco to New York
Recommendation Systems: Different countries,
different time periods have different user behaviors
US to China
2020 to 2026
Sentiment Analysis: Different domains have
different sentiment expressions
Movie reviews to product reviews
Formal text to social media
Object Detection/Segmentation: Synthetic data to
real data
GTA5 game scenes to real street scenes
Domain adaptation especially suitable for difficult to
annotate but abundant source data scenarios.
Domain adaptation is the most challenging but also most practical
problem in transfer learning. This article derives from mathematical
characterization of distribution shift (covariate shift, label shift,
concept shift), derives the theoretical foundation of unsupervised
domain adaptation (Ben-David theory), and explains classic methods like
DANN, MMD, and CORAL in detail.
We see that the core of domain adaptation is finding a balance
between aligning feature distributions and
maintaining discriminativeness. DANN implicitly
minimizes domain difference through adversarial training, MMD explicitly
measures distribution distance through kernel functions, and AdaBN
eliminates low-order differences by adjusting statistics. Each method
has its advantages and limitations, requiring selection based on
specific application scenarios.
The complete DANN implementation demonstrates key techniques like
gradient reversal layers, domain discriminators, and adaptive
adversarial weights. In the next chapter, we'll explore Few-Shot
Learning, studying how to learn new categories from extremely
few samples.
Post title:Transfer Learning (3): Domain Adaptation Methods
Post author:Chen Kai
Create time:2024-11-15 10:15:00
Post link:https://www.chenk.top/transfer-learning-3-domain-adaptation/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.