Chen Kai Blog

Transfer Learning (6): Multi-Task Learning

Multi-Task Learning (MTL) is a machine learning paradigm that improves model generalization by simultaneously learning multiple related tasks. Rich Caruana's pioneering 1997 paper "Multitask Learning" demonstrated how shared representations help models learn more robust features. In modern deep learning, multi-task learning has achieved tremendous success in computer vision (simultaneous detection, segmentation, depth estimation), natural language processing (joint entity recognition and relation extraction), and recommendation systems (simultaneous CTR and CVR prediction). But multi-task learning is far more than simply summing multiple loss functions — how to design shared structures, how to balance learning across different tasks, and how to handle negative transfer between tasks are all questions requiring deep investigation.

This article derives the mathematical foundations of multi-task learning from first principles, analyzes the pros and cons of hard vs soft parameter sharing, explains task relationship learning and task clustering methods in detail, deeply analyzes gradient conflict problems and solutions (PCGrad, GradNorm, CAGrad, etc.), introduces auxiliary task design principles, and provides a complete multi-task network implementation (including dynamic weight adjustment, gradient projection, task balancing and other industrial-grade techniques). We'll see that multi-task learning essentially seeks a Pareto optimal solution satisfying multiple optimization objectives.
2024-12-03
- Machine Learning
- > Transfer Learning
- Multi-Task Learning
- | Transfer Learning
- | Parameter Sharing
- | Gradient Conflict
- | PCGrad
Read more
Transfer Learning (5): Knowledge Distillation

Knowledge Distillation (KD) is a model compression and transfer learning technique that enables small models (students) to learn from large models (teachers), maintaining performance close to teacher models while significantly reducing parameters and computation. Hinton et al.'s seminal 2015 paper "Distilling the Knowledge in a Neural Network" sparked a research wave in this field. But knowledge distillation is far more than simple "soft label" training — it involves temperature parameter tuning, extracting knowledge at different levels, matching student-teacher architectures, and numerous technical details.

This article derives the mathematical foundations of knowledge distillation from first principles, explains why soft labels contain more information than hard labels, details implementation of response-based, feature-based, and relation-based distillation, introduces methods like self-distillation, mutual learning, and online distillation that don't require pre-trained teachers, and explores synergistic optimization of quantization, pruning, and distillation. We'll see that distillation is essentially "compression encoding" of knowledge — explicitly transferring dark knowledge implicitly learned by teacher models to student models.
2024-11-27
- Machine Learning
- > Transfer Learning
- Transfer Learning
- | Knowledge Distillation
- | Model Compression
- | Temperature Parameter
- | Soft Labels
Read more
Transfer Learning (4): Few-Shot Learning

Few-shot learning represents one of the most challenging problems in machine learning. Humans can rapidly learn new concepts from minimal examples - recognizing new species after seeing just a few images, or understanding new linguistic patterns from a handful of instances. Traditional deep learning models, however, require massive amounts of labeled data to train effectively and perform poorly in data-scarce scenarios.

The goal of few-shot learning is to learn classifiers from only a few examples per class (typically 1-10 samples). This requires models with powerful generalization and transfer capabilities - the ability to learn "how to learn" from known classes and quickly adapt to novel classes. This article derives the mathematical foundations of metric learning and meta-learning from first principles, explains classic methods like Siamese networks, Prototypical networks, and MAML in detail, and provides a complete Prototypical network implementation.
2024-11-21
- Machine Learning
- > Transfer Learning
- Meta-Learning
- | MAML
- | Few-Shot Learning
- | Transfer Learning
- | Metric Learning
Read more
Transfer Learning (3): Domain Adaptation Methods

Domain Adaptation is one of the most challenging problems in transfer learning. In practical applications, training data (source domain) and test data (target domain) often come from different distributions: medical images transferred from one hospital to another, recommendation systems transferred from one country to another, autonomous driving transferred from sunny to rainy conditions. This distribution shift can lead to significant performance degradation.

The core goal of domain adaptation is: to learn a model that performs well on the target domain when the source domain has labeled data but the target domain has no labels (or few labels). This requires aligning source and target domain feature distributions while maintaining discriminative power. This article derives the mathematical characterization of distribution shift from first principles, derives the theoretical foundation of unsupervised domain adaptation, explains classic methods like DANN and MMD in detail, and provides a complete DANN implementation.
2024-11-15
- Machine Learning
- > Transfer Learning
- DANN
- | MMD
- | Transfer Learning
- | Domain Adaptation
- | Adversarial Learning
Read more
Transfer Learning (2): Pre-training and Fine-tuning Techniques

Pre-training and fine-tuning have become one of the most successful transfer learning paradigms in modern deep learning. The emergence of BERT in 2018 fundamentally transformed the NLP research landscape, and pre-trained models have achieved tremendous success in computer vision, speech, and multimodal domains. But why does pre-training work? How should we adjust learning rates during fine-tuning? Which layers should be frozen? These questions involve deep theoretical considerations and engineering trade-offs.

This article derives the mathematical foundations of pre-training from first principles, analyzes the loss functions of contrastive learning and masked language models, explains various fine-tuning strategies in detail, and provides a complete industrial-grade BERT fine-tuning implementation with gradient accumulation, mixed-precision training, and learning rate scheduling. We'll see that pre-training essentially learns a powerful prior distribution, while fine-tuning performs Bayesian updates with limited labeled data.
2024-11-09
- Machine Learning
- > Transfer Learning
- BERT
- | GPT
- | Transfer Learning
- | Pre-training
- | Fine-tuning
Read more
Transfer Learning (1): Fundamentals and Core Concepts

Why can a model trained on ImageNet quickly achieve usable performance on medical imaging? Why can BERT learn text classification from just hundreds of samples after pretraining? The essence of these phenomena is transfer learning — enabling models to transfer existing knowledge to new problems rather than starting from scratch every time.

In the deep learning era, transfer learning has become standard practice rather than an option. This article systematically covers the mathematical formalization, core concepts, taxonomy, feasibility analysis, and negative transfer issues, along with a complete 200+ line implementation of feature transfer with MMD domain adaptation.
2024-11-03
- Machine Learning
- Machine Learning
- | Deep Learning
- | Transfer Learning
- | Domain Adaptation
Read more
Reinforcement Learning (12): RLHF and Large Language Model Applications

The breakthrough progress of Large Language Models (LLMs)— from GPT-3 to ChatGPT, from Claude to Gemini — stems not only from model scaling and pretraining data growth, but crucially from the introduction of Reinforcement Learning from Human Feedback (RLHF). While pretrained language models can generate fluent text, they often produce harmful content, misinformation, or responses misaligned with user intent. RLHF collects human preference data on model outputs, trains reward models to capture human values, then uses reinforcement learning (PPO) to fine-tune models toward more helpful, honest, and harmless content. InstructGPT systematized the RLHF pipeline, ChatGPT brought it to mainstream awareness, while DPO (Direct Preference Optimization) and RLAIF (RL from AI Feedback) simplified training complexity and data collection costs. Beyond language, reinforcement learning plays a core role in embodied intelligence (robotics, autonomous driving)— from sim-to-real policy transfer to offline-to-online fine-tuning, RL is shaping the next generation of general agents. This chapter systematically examines RLHF's technical details, DPO's theoretical innovations, RLAIF's practical approaches, and RL applications in multimodal and embodied intelligence, with complete code to help you implement a simplified RLHF pipeline.
2024-10-04
- Artificial Intelligence
- > Reinforcement Learning
- PPO
- | RLHF
- | InstructGPT
- | ChatGPT
- | DPO
Read more
Graph Contextualized Self-Attention Network (GC-SAN) for Session-based Recommendation

Session-based recommendation predicts the next clicked item from a short session sequence when long-term user history is missing or unreliable (e.g., anonymous traffic, cold-start users, multi-device sessions). GC-SAN is a hybrid approach: it uses a session graph + GNN to capture local transition patterns and uses self-attention to capture global, long-range dependencies within the same session. The key insight is that “ sequence ” and “ graph ” are complementary views of session intent: the sequence expresses order, while the graph exposes repeated transitions and multi-hop relations.
2024-10-01
- Paper
- GNN
- | Recommender Systems
- | Attention
Read more
Solving Constrained Mean-Variance Portfolio Optimization Problems Using Spiral Optimization Algorithm

The classic mean – variance portfolio model is elegant, but real trading constraints (buy-in thresholds, cardinality limits, min/max position sizes) quickly turn it into a hard mixed-integer nonlinear problem. This paper tackles that constrained setting with a modified Spiral Optimization Algorithm (SOA)— a metaheuristic designed to search complex feasible regions where convex solvers or gradient methods are not directly applicable. This note focuses on the formulation (what constraints are added and how), how SOA explores the search space, and what the reported results say about solution quality under practical constraints.
2024-10-01
- Paper
- Portfolio optimization
Read more
Session-based Recommendation with Graph Neural Networks (SR-GNN)

Session-based recommendation is challenging when you only observe a short click sequence and have little or no long-term user profile. SR-GNN tackles this by turning each session into a directed graph, where repeated items and multi-step transitions form richer structure than a plain sequence. A gated GNN propagates information over this session graph to learn item representations, and the model then aggregates them into a session representation to score next-item candidates. This note explains the session-graph construction, the gated message passing update, and how SR-GNN produces the final ranking — highlighting why this graph view often outperforms purely sequential baselines on standard SBR benchmarks.
2024-10-01
- Paper
- GNN
- | Recommend System
Read more

Prev Next