Recommendation Systems (3): Deep Learning Foundation Models
Chen Kai BOSS

permalink: "en/recommendation-systems-3-deep-learning-basics/" date: 2024-05-12 10:00:00 tags: - Recommendation Systems - Deep Learning - Neural Networks categories: Recommendation Systems mathjax: true--- In 2016, Google introduced the Wide & Deep model in Google Play's recommendation system, marking the formal entry of deep learning into the mainstream of recommendation systems. Prior to this, recommendation systems primarily relied on traditional methods such as matrix factorization and collaborative filtering. While these methods achieved success in competitions like the Netflix Prize, they had significant limitations: difficulty handling high-dimensional sparse features, inability to capture nonlinear relationships, and heavy reliance on manual feature engineering.

Deep learning has brought revolutionary changes to recommendation systems. Through multi-layer neural networks, we can automatically learn representations (Embeddings) of users and items, capture complex interaction patterns, handle multimodal features, and train end-to-end on large-scale data. From NCF (Neural Collaborative Filtering) to AutoEncoder-based recommendations, from Wide & Deep to DeepFM, deep learning models have demonstrated powerful capabilities across all stages of recommendation systems, including CTR prediction, recall, and ranking.

This article provides an in-depth exploration of the core concepts, mainstream models, and implementation details of deep learning recommendation systems. We'll start by understanding the essence of Embeddings and why they're so important; then dive deep into classic models like NCF, AutoEncoders (CDAE/VAE), and Wide & Deep; discuss feature engineering and training techniques; and finally present 10+ complete code implementations and 10+ Q&A sections addressing common questions. Whether you're new to recommendation systems or want to systematically understand deep learning recommendation models, this article will help you build a complete knowledge framework.

Deep Learning vs Traditional Methods

Limitations of Traditional Recommendation Methods

Before the rise of deep learning, recommendation systems primarily relied on the following methods:

Matrix Factorization: - Decomposes the user-item rating matrix into low-dimensional vectors - Uses vector inner products to predict ratings: \(\hat{r}_{ui} = \mathbf{p}_u^T \mathbf{q}_i\) - Advantages: Simple, interpretable, computationally efficient - Disadvantages: Can only capture linear relationships, difficult to handle high-dimensional sparse features

Collaborative Filtering: - Makes recommendations based on user or item similarity - Advantages: No need for content features, can discover unexpected associations - Disadvantages: Severe data sparsity problems, difficult cold start

Factorization Machine: - Introduces feature interaction terms:\(\hat{y} = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j\) - Advantages: Can handle high-dimensional sparse features, captures second-order interactions - Disadvantages: Can only capture second-order interactions, higher-order interactions require manual design

The core problem with these traditional methods is: they are all linear or can only capture low-order interactions, while user behavior often contains complex nonlinear patterns. For example, users might simultaneously like movies that combine "sci-fi + action + blockbuster," a combination feature that's difficult to express with simple linear models.

Advantages of Deep Learning

Deep learning, through multi-layer neural networks, brings the following advantages to recommendation systems:

Automatic Feature Learning: - Traditional methods require manual feature design (e.g., "user age × item category") - Deep learning automatically learns feature representations through multi-layer nonlinear transformations - Embedding layers map high-dimensional sparse one-hot encodings to low-dimensional dense vectors

Nonlinear Modeling Capability: - Multi-layer neural networks can capture arbitrarily complex nonlinear relationships - Activation functions like ReLU and Sigmoid introduce nonlinearity - Deep networks can learn high-order feature interactions

Multimodal Feature Fusion: - Can simultaneously process user profiles, item attributes, behavior sequences, text, images, and other features - Uses different network structures (CNN, RNN, Transformer) to handle different modalities - End-to-end training within a unified framework

End-to-End Training: - The entire pipeline from raw features to final predictions can be jointly optimized - Gradient backpropagation automatically adjusts all parameters - Avoids the problem of separating feature engineering and model training in traditional methods

Performance Comparison

In practical applications, deep learning models typically achieve 5-30% performance improvements over traditional methods:

Method AUC Improvement
Matrix Factorization 0.750 baseline
FM 0.780 +4.0%
Wide & Deep 0.810 +8.0%
DeepFM 0.825 +10.0%
DIN 0.845 +12.7%

These improvements primarily come from: 1. Better Feature Representations: Embedding vectors learned contain more information than one-hot encodings 2. More Complex Interaction Patterns: Deep networks capture feature combinations that traditional methods cannot express 3. Sequential Modeling Capability: RNN/Transformer can model temporal dependencies in user behavior sequences

Challenges of Deep Learning

Despite its many advantages, deep learning also brings some challenges:

Computational Complexity: - Deep networks require substantial computational resources - Training time may be 10-100 times longer than traditional methods - Requires GPU acceleration for production use

Interpretability: - Black-box models are difficult to interpret regarding why certain items are recommended - Traditional methods (like matrix factorization) have vectors that can be intuitively understood - Requires additional interpretability tools (such as SHAP, LIME)

Data Requirements: - Deep learning requires large amounts of training data - Cold start problems still exist (new users/new items) - Requires carefully designed data augmentation and transfer learning strategies

Hyperparameter Tuning: - Network structure, learning rate, regularization, and other hyperparameters require extensive experimentation - Larger hyperparameter search space compared to traditional methods - Requires automated tools (such as AutoML) for assistance

Embedding Deep Dive

What is Embedding

Embedding is one of the core concepts in deep learning. Simply put, Embedding is a technique that maps high-dimensional sparse discrete features to low-dimensional dense continuous vector spaces.

In recommendation systems, the most common discrete features are user IDs and item IDs. Suppose we have 10 million users and 1 million items. Using one-hot encoding: - User ID: 10 million-dimensional vector, with only 1 position being 1, all others 0 - Item ID: 1 million-dimensional vector, with only 1 position being 1, all others 0

This representation has serious problems: 1. Curse of Dimensionality: Vector dimension equals the number of categories, with enormous storage and computation costs 2. Information Sparsity: 99.9999% of elements are 0, extremely low information density 3. Cannot Express Similarity: The distance between any two one-hot vectors is the same (e.g., Euclidean distance is\(\sqrt{2}\))

Embedding solves these problems: - Maps 10 million-dimensional user IDs to 128-dimensional dense vectors - Maps 1 million-dimensional item IDs to 128-dimensional dense vectors - Similar users/items are closer in vector space

Mathematical Principles of Embedding

Embedding is essentially a lookup table. Let the user set be\(U = \{u_1, u_2, \dots, u_m\}\)and the item set be\(I = \{i_1, i_2, \dots, i_n\}\).

One-hot Encoding: - One-hot vector for user\(u_i\):\(\mathbf{e}_i \in \{0,1} ^m\), where\(e_{ij} = 1\)if and only if\(j=i\) - One-hot vector for item\(i_j\):\(\mathbf{f}_j \in \{0,1} ^n\), where\(f_{jk} = 1\)if and only if\(k=j\) Embedding Layer: - User Embedding matrix:\(\mathbf{P} \in \mathbb{R}^{m \times d}\), where\(d\)is the Embedding dimension - Item Embedding matrix:\(\mathbf{Q} \in \mathbb{R}^{n \times d}\) - Embedding vector for user\(u_i\):\(\mathbf{p}_i = \mathbf{P}^T \mathbf{e}_i\)(essentially the\(i\)-th row of the matrix) - Embedding vector for item\(i_j\):\(\mathbf{q}_j = \mathbf{Q}^T \mathbf{f}_j\)(essentially the\(j\)-th row of the matrix)

In implementation, the Embedding layer is typically a learnable parameter matrix:

1
2
3
4
5
6
7
# Pseudocode
user_embedding = Embedding(num_users, embedding_dim) # Shape: [m, d]
item_embedding = Embedding(num_items, embedding_dim) # Shape: [n, d]

# Forward pass
user_id = 123 # User ID
user_vec = user_embedding[user_id] # Shape: [d]

Learning Process of Embedding

Embedding vectors are not predefined but learned from training data. The learning objective is: to make similar users/items closer in vector space, and dissimilar ones farther apart.

Collaborative Filtering Perspective: - If user\(u\)likes item\(i\), then\(\mathbf{p}_u\)and\(\mathbf{q}_i\)should be similar (large inner product) - If user\(u\)dislikes item\(i\), then\(\mathbf{p}_u\)and\(\mathbf{q}_i\)should be dissimilar (small inner product) - Loss function:\(\mathcal{L} = \sum_{(u,i) \in \mathcal{D }} (r_{ui} - \mathbf{p}_u^T \mathbf{q}_i)^2\) Neural Network Perspective: - The Embedding layer is the first layer of the neural network - Through backpropagation, gradients update the Embedding matrix parameters - The final learned vectors contain latent features of users/items

Embedding Dimension Selection

The Embedding dimension\(d\)is an important hyperparameter. Common choices range from 8-512, depending on: - Data Scale: Larger user/item counts typically require larger dimensions - Task Complexity: CTR prediction may need 32-64 dimensions, recall may need 128-256 dimensions - Computational Resources: Larger dimensions increase storage and computation costs

Rule of thumb: - Small scale (<100K):\(d = 8-16\) - Medium scale (100K-1M):\(d = 32-64\) - Large scale (>1M):\(d = 64-128\)

Embedding Visualization

Through dimensionality reduction techniques (such as t-SNE, PCA), high-dimensional Embeddings can be visualized in 2D space to observe learned structures:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Assume we've trained item_embeddings, shape: [n_items, d]
# Select first 1000 items for visualization
embeddings_subset = item_embeddings[:1000]

# t-SNE dimensionality reduction to 2D
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings_subset)

# Visualization
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
plt.title('Item Embeddings Visualization')
plt.show()

Typically, we find: - Items of the same category cluster together - Items with similar functions are closer - Popular and unpopular items may be distributed in different regions

Pre-training and Fine-tuning Embeddings

In practical applications, Embeddings can: 1. Random Initialization: Train from scratch (most common) 2. Pre-training: Pre-train Embeddings on other tasks (e.g., item classification), then fine-tune 3. Transfer Learning: Transfer Embeddings from other domains (e.g., Word2Vec from NLP)

Advantages of pre-trained Embeddings: - Accelerate convergence: Don't need to start from random state - Improve performance: Leverage external knowledge - Alleviate cold start: New items can use pre-trained Embeddings

Code Example: Embedding Layer Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
import torch.nn.functional as F

class EmbeddingLayer(nn.Module):
"""Basic Embedding layer implementation"""

def __init__(self, num_embeddings, embedding_dim, padding_idx=None):
"""
Args:
num_embeddings: Vocabulary size (number of users or items)
embedding_dim: Embedding dimension
padding_idx: Padding index (for sequence padding)
"""
super(EmbeddingLayer, self).__init__()
self.num_embeddings = num_embeddings
self.embedding_dim = embedding_dim

# Create Embedding matrix, randomly initialized
self.embedding = nn.Embedding(
num_embeddings=num_embeddings,
embedding_dim=embedding_dim,
padding_idx=padding_idx
)

# Xavier initialization (optional)
nn.init.xavier_uniform_(self.embedding.weight)

def forward(self, indices):
"""
Args:
indices: Input indices, shape: [batch_size] or [batch_size, seq_len]
Returns:
embeddings: Embedding vectors, shape: [batch_size, embedding_dim] or [batch_size, seq_len, embedding_dim]
"""
return self.embedding(indices)

# Usage example
num_users = 10000
embedding_dim = 64

# Create user Embedding layer
user_embedding = EmbeddingLayer(num_users, embedding_dim)

# Forward pass
user_ids = torch.LongTensor([0, 1, 2, 3, 4]) # batch_size=5
user_vectors = user_embedding(user_ids) # Shape: [5, 64]

print(f"Input user IDs: {user_ids}")
print(f"Output embeddings shape: {user_vectors.shape}")
print(f"Sample embedding (user 0): {user_vectors[0][:5]}") # Print first 5 dimensions

Multi-field Embedding

In practical recommendation systems, besides user IDs and item IDs, there are many other discrete features (such as user age, item category, city, etc.). Each feature needs an Embedding layer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class MultiFieldEmbedding(nn.Module):
"""Multi-field Embedding layer"""

def __init__(self, field_dims, embedding_dim):
"""
Args:
field_dims: Number of categories for each field, e.g., [10000, 1000, 50] for 3 fields
embedding_dim: Embedding dimension
"""
super(MultiFieldEmbedding, self).__init__()
self.field_dims = field_dims
self.embedding_dim = embedding_dim
self.num_fields = len(field_dims)

# Create Embedding layer for each field
self.embeddings = nn.ModuleList([
nn.Embedding(field_dim, embedding_dim)
for field_dim in field_dims
])

def forward(self, x):
"""
Args:
x: Input features, shape: [batch_size, num_fields]
Returns:
embeddings: Embeddings for all fields, shape: [batch_size, num_fields, embedding_dim]
"""
# Embedding for each field separately
embeddings = []
for i in range(self.num_fields):
embeddings.append(self.embeddings[i](x[:, i]))

# Stack into [batch_size, num_fields, embedding_dim]
return torch.stack(embeddings, dim=1)

# Usage example
field_dims = [10000, 1000, 50, 20] # User ID, Item ID, Category, City
embedding_dim = 32

multi_embedding = MultiFieldEmbedding(field_dims, embedding_dim)

# Input: [batch_size=4, num_fields=4]
x = torch.LongTensor([
[123, 456, 5, 10], # User 123, Item 456, Category 5, City 10
[124, 457, 5, 11],
[125, 458, 6, 10],
[126, 459, 6, 12]
])

embeddings = multi_embedding(x) # Shape: [4, 4, 32]
print(f"Input shape: {x.shape}")
print(f"Output embeddings shape: {embeddings.shape}")

NCF: Neural Collaborative Filtering

Background of NCF

Traditional matrix factorization methods use vector inner products to predict ratings:\[\hat{r}_{ui} = \mathbf{p}_u^T \mathbf{q}_i\]This approach has a fundamental problem: inner products are linear and cannot capture complex nonlinear relationships between users and items. For example, users might like the combination of "sci-fi + action," but this combination feature cannot be expressed with simple inner products.

NCF (Neural Collaborative Filtering), proposed in 2017, replaces inner products with multi-layer neural networks, enabling learning of nonlinear interactions between users and items.

NCF Model Architecture

The NCF model contains three components:

1. GMF (Generalized Matrix Factorization): - User Embedding:\(\mathbf{p}_u \in \mathbb{R}^d\) - Item Embedding:\(\mathbf{q}_i \in \mathbb{R}^d\) - Element-wise product:\(\mathbf{p}_u \odot \mathbf{q}_i\)(element-wise multiplication) - Output:\(\hat{y}_{ui}^{GMF} = \mathbf{h}^T (\mathbf{p}_u \odot \mathbf{q}_i)\), where\(\mathbf{h}\)is a learnable weight vector

2. MLP (Multi-Layer Perceptron): - Concatenate user and item Embeddings:\([\mathbf{p}_u; \mathbf{q}_i]\) - Pass through multi-layer fully connected network:\(\mathbf{z}_1 = \text{ReLU}(\mathbf{W}_1 [\mathbf{p}_u; \mathbf{q}_i] + \mathbf{b}_1)\) -\(\mathbf{z}_2 = \text{ReLU}(\mathbf{W}_2 \mathbf{z}_1 + \mathbf{b}_2)\) - ... - Output:\(\hat{y}_{ui}^{MLP} = \mathbf{h}^T \mathbf{z}_L\) 3. NeuMF (Neural Matrix Factorization): - Fuse GMF and MLP:\(\hat{y}_{ui} = \sigma(\hat{y}_{ui}^{GMF} + \hat{y}_{ui}^{MLP})\) - Where\(\sigma\)is the Sigmoid activation function (for binary classification tasks)

Mathematical Formulation of NCF

The complete NCF model can be expressed as:\[\hat{y}_{ui} = \sigma(\mathbf{h}^T (\mathbf{p}_u \odot \mathbf{q}_i) + \mathbf{h}_{MLP}^T \mathbf{z}_L)\]Where: -\(\mathbf{p}_u, \mathbf{q}_i\): User and item Embedding vectors -\(\odot\): Element-wise product (Hadamard product) -\(\mathbf{z}_L\): Output of the last layer of MLP -\(\mathbf{h}, \mathbf{h}_{MLP}\): Weight vectors of the output layer -\(\sigma\): Sigmoid function

Loss Function of NCF

For implicit feedback (click/no-click), NCF uses binary cross-entropy loss:\[\mathcal{L} = -\sum_{(u,i) \in \mathcal{D }} y_{ui} \log \hat{y}_{ui} + (1-y_{ui}) \log(1-\hat{y}_{ui})\]Where\(y_{ui} \in \{0,1\}\)indicates whether user\(u\)interacted with item\(i\).

For explicit feedback (ratings), mean squared error can be used:\[\mathcal{L} = \sum_{(u,i) \in \mathcal{D }} (r_{ui} - \hat{r}_{ui})^2\]

Complete NCF Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
import torch
import torch.nn as nn
import torch.nn.functional as F

class GMF(nn.Module):
"""Generalized Matrix Factorization"""

def __init__(self, num_users, num_items, embedding_dim):
super(GMF, self).__init__()
self.user_embedding = nn.Embedding(num_users, embedding_dim)
self.item_embedding = nn.Embedding(num_items, embedding_dim)
self.output_layer = nn.Linear(embedding_dim, 1)

# Initialization
nn.init.normal_(self.user_embedding.weight, std=0.01)
nn.init.normal_(self.item_embedding.weight, std=0.01)

def forward(self, user_ids, item_ids):
user_emb = self.user_embedding(user_ids)
item_emb = self.item_embedding(item_ids)

# Element-wise product
element_product = user_emb * item_emb

# Output
output = self.output_layer(element_product)
return output.squeeze()

class MLP(nn.Module):
"""Multi-Layer Perceptron"""

def __init__(self, num_users, num_items, embedding_dim, layers, dropout=0.0):
super(MLP, self).__init__()
self.user_embedding = nn.Embedding(num_users, embedding_dim)
self.item_embedding = nn.Embedding(num_items, embedding_dim)

# MLP layers
mlp_layers = []
input_dim = embedding_dim * 2 # Concatenate user and item Embeddings
for output_dim in layers:
mlp_layers.append(nn.Linear(input_dim, output_dim))
mlp_layers.append(nn.ReLU())
if dropout > 0:
mlp_layers.append(nn.Dropout(dropout))
input_dim = output_dim
self.mlp = nn.Sequential(*mlp_layers)

# Output layer
self.output_layer = nn.Linear(layers[-1], 1)

# Initialization
nn.init.normal_(self.user_embedding.weight, std=0.01)
nn.init.normal_(self.item_embedding.weight, std=0.01)

def forward(self, user_ids, item_ids):
user_emb = self.user_embedding(user_ids)
item_emb = self.item_embedding(item_ids)

# Concatenate
concat = torch.cat([user_emb, item_emb], dim=1)

# MLP
mlp_output = self.mlp(concat)

# Output
output = self.output_layer(mlp_output)
return output.squeeze()

class NeuMF(nn.Module):
"""Neural Matrix Factorization"""

def __init__(self, num_users, num_items, embedding_dim, mlp_layers, dropout=0.0):
super(NeuMF, self).__init__()
self.embedding_dim = embedding_dim

# GMF part
self.gmf_user_embedding = nn.Embedding(num_users, embedding_dim)
self.gmf_item_embedding = nn.Embedding(num_items, embedding_dim)

# MLP part
self.mlp_user_embedding = nn.Embedding(num_users, embedding_dim)
self.mlp_item_embedding = nn.Embedding(num_items, embedding_dim)

# MLP network
mlp_modules = []
input_dim = embedding_dim * 2
for output_dim in mlp_layers:
mlp_modules.append(nn.Linear(input_dim, output_dim))
mlp_modules.append(nn.ReLU())
if dropout > 0:
mlp_modules.append(nn.Dropout(dropout))
input_dim = output_dim
self.mlp = nn.Sequential(*mlp_modules)

# Output layer
self.output_layer = nn.Linear(embedding_dim + mlp_layers[-1], 1)

# Initialization
self._init_weights()

def _init_weights(self):
nn.init.normal_(self.gmf_user_embedding.weight, std=0.01)
nn.init.normal_(self.gmf_item_embedding.weight, std=0.01)
nn.init.normal_(self.mlp_user_embedding.weight, std=0.01)
nn.init.normal_(self.mlp_item_embedding.weight, std=0.01)

def forward(self, user_ids, item_ids):
# GMF part
gmf_user_emb = self.gmf_user_embedding(user_ids)
gmf_item_emb = self.gmf_item_embedding(item_ids)
gmf_output = gmf_user_emb * gmf_item_emb

# MLP part
mlp_user_emb = self.mlp_user_embedding(user_ids)
mlp_item_emb = self.mlp_item_embedding(item_ids)
mlp_concat = torch.cat([mlp_user_emb, mlp_item_emb], dim=1)
mlp_output = self.mlp(mlp_concat)

# Fusion
concat = torch.cat([gmf_output, mlp_output], dim=1)
output = self.output_layer(concat)

return torch.sigmoid(output.squeeze())

# Usage example
num_users = 10000
num_items = 5000
embedding_dim = 64
mlp_layers = [128, 64, 32]

model = NeuMF(num_users, num_items, embedding_dim, mlp_layers, dropout=0.2)

# Training data
user_ids = torch.LongTensor([0, 1, 2, 3, 4])
item_ids = torch.LongTensor([10, 20, 30, 40, 50])
labels = torch.FloatTensor([1, 1, 0, 1, 0]) # Click/no-click

# Forward pass
predictions = model(user_ids, item_ids)
print(f"Predictions: {predictions}")

# Loss calculation
criterion = nn.BCELoss()
loss = criterion(predictions, labels)
print(f"Loss: {loss.item()}")

NCF Training Tips

1. Negative Sampling: - For implicit feedback, negative samples (no-click) far outnumber positive samples (click) - Need negative sampling to balance positive/negative sample ratio - Common ratio: positive:negative = 1:1 to 1:4

2. Learning Rate Scheduling: - Initial learning rate: 0.001-0.01 - Use learning rate decay (e.g., halve every 10 epochs) - Or use adaptive optimizers (Adam, AdamW)

3. Regularization: - L2 regularization: Prevent overfitting - Dropout: Use in MLP layers, dropout rate 0.2-0.5 - Early Stopping: Monitor validation set performance

4. Pre-training: - Pre-train GMF and MLP separately first - Then jointly train with NeuMF - Can accelerate convergence and improve performance

AutoEncoder Recommendations: CDAE and VAE

Basic Idea of AutoEncoder

AutoEncoder is an unsupervised learning model that attempts to learn low-dimensional representations (encoding) of data, then reconstructs the original data (decoding) from the low-dimensional representation.

In recommendation systems, AutoEncoders can be used to: 1. Dimensionality Reduction: Compress high-dimensional user-item interaction matrices to low-dimensional space 2. Denoising: Recover complete user preferences from sparse, noisy interaction data 3. Generation: Generate items users might be interested in

CDAE: Collaborative Denoising Auto-Encoder

CDAE (Collaborative Denoising Auto-Encoder), proposed in 2015, takes user interaction history as input and reconstructs complete user preferences through a denoising autoencoder.

Model Architecture: - Input Layer: User interaction vector\(\mathbf{x}_u \in \{0,1} ^n\)(\(n\)is the number of items) - Encoder Layer:\(\mathbf{h}_u = \sigma(\mathbf{W} \mathbf{x}_u + \mathbf{V} \mathbf{p}_u + \mathbf{b})\), where\(\mathbf{p}_u\)is the user Embedding - Decoder Layer:\(\hat{\mathbf{x }}_u = \sigma(\mathbf{W}' \mathbf{h}_u + \mathbf{b}')\) - Loss Function: Reconstruction error\(\mathcal{L} = \sum_{u} \|\mathbf{x}_u - \hat{\mathbf{x }}_u\|^2\) Denoising Mechanism: - Randomly zero out part of the input during training (dropout) - Forces the model to learn to recover complete information from partial information - Improves model robustness

Complete CDAE Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
import torch
import torch.nn as nn
import torch.nn.functional as F

class CDAE(nn.Module):
"""Collaborative Denoising Auto-Encoder"""

def __init__(self, num_users, num_items, hidden_dim, corruption_ratio=0.5):
"""
Args:
num_users: Number of users
num_items: Number of items
hidden_dim: Hidden layer dimension
corruption_ratio: Noise ratio (proportion of input dropout)
"""
super(CDAE, self).__init__()
self.num_users = num_users
self.num_items = num_items
self.hidden_dim = hidden_dim
self.corruption_ratio = corruption_ratio

# User Embedding
self.user_embedding = nn.Embedding(num_users, hidden_dim)

# Encoder layer: item interactions -> hidden layer
self.encoder = nn.Linear(num_items, hidden_dim)

# Decoder layer: hidden layer -> item interactions
self.decoder = nn.Linear(hidden_dim, num_items)

# Initialization
nn.init.xavier_uniform_(self.user_embedding.weight)
nn.init.xavier_uniform_(self.encoder.weight)
nn.init.xavier_uniform_(self.decoder.weight)

def forward(self, user_ids, user_items, training=True):
"""
Args:
user_ids: User IDs, shape: [batch_size]
user_items: User interaction vectors, shape: [batch_size, num_items]
training: Whether in training mode (affects whether noise is added)
Returns:
reconstructed: Reconstructed interaction vectors, shape: [batch_size, num_items]
"""
batch_size = user_items.size(0)

# Denoising: randomly dropout part of input during training
if training and self.corruption_ratio > 0:
# Create mask, randomly zero out some positions
mask = torch.rand_like(user_items) > self.corruption_ratio
corrupted_input = user_items * mask.float()
else:
corrupted_input = user_items

# User Embedding
user_emb = self.user_embedding(user_ids) # [batch_size, hidden_dim]

# Encoding: interaction vector -> hidden layer
encoded = self.encoder(corrupted_input) # [batch_size, hidden_dim]

# Fuse user Embedding and encoding result
hidden = F.relu(encoded + user_emb) # [batch_size, hidden_dim]

# Decoding: hidden layer -> reconstructed interaction vector
reconstructed = torch.sigmoid(self.decoder(hidden)) # [batch_size, num_items]

return reconstructed

def predict(self, user_ids, user_items):
"""Predict user ratings for all items"""
self.eval()
with torch.no_grad():
predictions = self.forward(user_ids, user_items, training=False)
return predictions

# Usage example
num_users = 1000
num_items = 500
hidden_dim = 128

model = CDAE(num_users, num_items, hidden_dim, corruption_ratio=0.5)

# Training data: user interaction matrix (sparse)
user_ids = torch.LongTensor([0, 1, 2, 3, 4])
user_items = torch.FloatTensor([
[1, 0, 1, 0, 0, 1, 0, ...], # User 0's interaction vector
[0, 1, 0, 1, 1, 0, 0, ...], # User 1's interaction vector
[1, 1, 0, 0, 0, 1, 1, ...], # User 2's interaction vector
[0, 0, 1, 1, 0, 0, 1, ...], # User 3's interaction vector
[1, 0, 0, 0, 1, 1, 0, ...], # User 4's interaction vector
]) # Shape: [5, 500]

# Forward pass
reconstructed = model(user_ids, user_items, training=True)
print(f"Reconstructed shape: {reconstructed.shape}")

# Loss calculation (only for interacted positions)
mask = user_items > 0
loss = F.mse_loss(reconstructed * mask, user_items * mask)
print(f"Loss: {loss.item()}")

# Prediction: recommend Top-K items
predictions = model.predict(user_ids, user_items)
top_k = 10
top_items = torch.topk(predictions[0], top_k).indices
print(f"Top-{top_k} recommended items for user 0: {top_items}")

VAE: Variational Auto-Encoder

VAE (Variational Auto-Encoder), proposed in 2013, is a generative model that probabilizes AutoEncoders, learning latent distributions of data to generate new samples.

In recommendation systems, VAE can be used to: 1. Generate Recommendations: Sample from user latent distributions to generate items of interest 2. Uncertainty Modeling: Not only predict ratings but also predict uncertainty 3. Diverse Recommendations: Increase recommendation diversity through sampling

Mathematical Principles of VAE: - Encoder: Learns posterior distribution\(q_\phi(\mathbf{z}|\mathbf{x})\), where\(\mathbf{z}\)is the latent variable - Decoder: Learns generative distribution\(p_\theta(\mathbf{x}|\mathbf{z})\) - Loss Function: ELBO (Evidence Lower BOund)\[\mathcal{L} = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \text{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\]

VAE Recommendation Model (Mult-VAE)

Mult-VAE, proposed in 2018, is a VAE recommendation model that assumes user interaction vectors follow a multinomial distribution.

Model Architecture: - Encoder:\(\mathbf{z}_u \sim \mathcal{N}(\boldsymbol{\mu}_u, \text{diag}(\boldsymbol{\sigma}_u^2))\) -\(\boldsymbol{\mu}_u = \mathbf{W}_\mu \mathbf{h}_u + \mathbf{b}_\mu\) -\(\log \boldsymbol{\sigma}_u^2 = \mathbf{W}_\sigma \mathbf{h}_u + \mathbf{b}_\sigma\) - Where\(\mathbf{h}_u\)is the encoding of user interaction vector - Sampling:\(\mathbf{z}_u = \boldsymbol{\mu}_u + \boldsymbol{\sigma}_u \odot \boldsymbol{\epsilon}\), where\(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\) - Decoder:\(\hat{\mathbf{x }}_u = \text{softmax}(\mathbf{W}_d \mathbf{z}_u + \mathbf{b}_d)\)

Complete Mult-VAE Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as dist

class MultVAE(nn.Module):
"""Multinomial Variational Auto-Encoder for Recommendation"""

def __init__(self, num_items, hidden_dims, latent_dim, dropout=0.5):
"""
Args:
num_items: Number of items
hidden_dims: List of hidden layer dimensions for encoder and decoder, e.g., [600, 200]
latent_dim: Latent variable dimension
dropout: Dropout ratio
"""
super(MultVAE, self).__init__()
self.num_items = num_items
self.latent_dim = latent_dim

# Encoder
encoder_layers = []
input_dim = num_items
for hidden_dim in hidden_dims:
encoder_layers.append(nn.Linear(input_dim, hidden_dim))
encoder_layers.append(nn.Tanh())
encoder_layers.append(nn.Dropout(dropout))
input_dim = hidden_dim
self.encoder = nn.Sequential(*encoder_layers)

# Mean and variance of latent variables
self.mu_layer = nn.Linear(hidden_dims[-1], latent_dim)
self.logvar_layer = nn.Linear(hidden_dims[-1], latent_dim)

# Decoder
decoder_layers = []
input_dim = latent_dim
for hidden_dim in reversed(hidden_dims):
decoder_layers.append(nn.Linear(input_dim, hidden_dim))
decoder_layers.append(nn.Tanh())
decoder_layers.append(nn.Dropout(dropout))
input_dim = hidden_dim
self.decoder = nn.Sequential(*decoder_layers)

# Output layer
self.output_layer = nn.Linear(hidden_dims[0], num_items)

# Initialization
self._init_weights()

def _init_weights(self):
for layer in self.modules():
if isinstance(layer, nn.Linear):
nn.init.xavier_uniform_(layer.weight)
nn.init.zeros_(layer.bias)

def encode(self, user_items):
"""Encode: user interaction vector -> latent variable distribution"""
h = self.encoder(user_items)
mu = self.mu_layer(h)
logvar = self.logvar_layer(h)
return mu, logvar

def reparameterize(self, mu, logvar):
"""Reparameterization trick"""
std = torch.exp(0.5 * logvar)
eps = torch.randn_like(std)
z = mu + eps * std
return z

def decode(self, z):
"""Decode: latent variable -> reconstructed interaction vector"""
h = self.decoder(z)
logits = self.output_layer(h)
return logits

def forward(self, user_items, beta=1.0):
"""
Args:
user_items: User interaction vectors, shape: [batch_size, num_items]
beta: Weight of KL divergence (for beta-VAE)
Returns:
reconstructed: Reconstructed interaction vectors
mu: Mean of latent variables
logvar: Log variance of latent variables
kl_loss: KL divergence loss
"""
# Encoding
mu, logvar = self.encode(user_items)

# Reparameterization
z = self.reparameterize(mu, logvar)

# Decoding
logits = self.decode(z)

# KL divergence loss
kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1)
kl_loss = beta * kl_loss.mean()

return logits, mu, logvar, kl_loss

def predict(self, user_items):
"""Predict user ratings for all items"""
self.eval()
with torch.no_grad():
mu, logvar = self.encode(user_items)
z = self.reparameterize(mu, logvar)
logits = self.decode(z)
# For non-interacted items, use logits; for interacted items, set to negative infinity (don't recommend)
predictions = logits.clone()
predictions[user_items > 0] = float('-inf')
return predictions

# Usage example
num_items = 500
hidden_dims = [600, 200]
latent_dim = 50

model = MultVAE(num_items, hidden_dims, latent_dim, dropout=0.5)

# Training data
user_items = torch.FloatTensor([
[1, 0, 1, 0, 0, 1, 0, ...], # User 0's interaction vector
[0, 1, 0, 1, 1, 0, 0, ...], # User 1's interaction vector
[1, 1, 0, 0, 0, 1, 1, ...], # User 2's interaction vector
]) # Shape: [3, 500]

# Forward pass
logits, mu, logvar, kl_loss = model(user_items, beta=0.2)

# Reconstruction loss (log-likelihood of multinomial distribution)
reconstruction_loss = -torch.sum(
F.log_softmax(logits, dim=1) * user_items, dim=1
).mean()

# Total loss
total_loss = reconstruction_loss + kl_loss
print(f"Reconstruction loss: {reconstruction_loss.item()}")
print(f"KL loss: {kl_loss.item()}")
print(f"Total loss: {total_loss.item()}")

# Prediction
predictions = model.predict(user_items)
top_k = 10
top_items = torch.topk(predictions[0], top_k).indices
print(f"Top-{top_k} recommended items: {top_items}")

CDAE vs VAE Comparison

Feature CDAE VAE
Model Type Deterministic autoencoder Probabilistic generative model
Latent Variable Fixed vector Probability distribution
Generation Capability Weak (can only reconstruct) Strong (can sample and generate)
Uncertainty Cannot model Can model
Training Difficulty Simple More complex (requires KL divergence)
Recommendation Diversity Lower Higher (through sampling)
Applicable Scenarios Dense interaction data Sparse interaction data

Wide & Deep Model

Background of Wide & Deep

In 2016, Google proposed the Wide & Deep model in Google Play's recommendation system. The core idea of this model is: combining memorization and generalization.

  • Memorization (Wide part): Learns direct associations between features, such as "users who installed Pandora also installed YouTube"
  • Generalization (Deep part): Learns Embedding representations of features, capturing latent associations between sparse features

Wide & Deep Model Architecture

The Wide & Deep model contains two components:

1. Wide Part (Linear Model): - Input: Raw features and cross features (e.g., "user age × item category") - Output:\(\hat{y}_{wide} = \mathbf{w}^T \mathbf{x} + b\) - Role: Memorize feature combinations in historical data

2. Deep Part (Deep Neural Network): - Input: Embedding vectors of sparse features - Structure: Multi-layer fully connected network - Output:\(\hat{y}_{deep} = \text{MLP}(\text{Embedding}(\mathbf{x}))\) - Role: Generalize to unseen feature combinations

3. Fusion: - Final output:\(\hat{y} = \sigma(\hat{y}_{wide} + \hat{y}_{deep})\) - Where\(\sigma\)is the Sigmoid function (for CTR prediction)

Mathematical Formulation of Wide & Deep

The complete Wide & Deep model can be expressed as:\[\hat{y} = \sigma(\mathbf{w}_{wide}^T [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{deep}^T \mathbf{a}^{(L)} + b)\]Where: -\(\mathbf{x}\): Raw features -\(\phi(\mathbf{x})\): Cross features (e.g.,\(\phi(\mathbf{x}) = [x_i \cdot x_j]\)) -\(\mathbf{a}^{(L)}\): Output of the last layer of the Deep part -\(\mathbf{w}_{wide}, \mathbf{w}_{deep}, b\): Learnable parameters

Computation process of the Deep part: -\(\mathbf{a}^{(0)} = \text{Embedding}(\mathbf{x})\) -\(\mathbf{a}^{(l)} = \text{ReLU}(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})\),\(l=1,2,\dots,L\) - Where\(L\)is the number of layers

Complete Wide & Deep Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import torch
import torch.nn as nn
import torch.nn.functional as F

class WideAndDeep(nn.Module):
"""Wide & Deep model"""

def __init__(self,
field_dims, # Number of categories for each field
embedding_dim, # Embedding dimension
deep_layers, # List of hidden layer dimensions for Deep part
dropout=0.0):
super(WideAndDeep, self).__init__()
self.field_dims = field_dims
self.num_fields = len(field_dims)
self.embedding_dim = embedding_dim

# Wide part: linear layer (including bias)
# Input dimension = number of raw features + number of cross features
# Simplified here, assuming only raw features
self.wide_linear = nn.Linear(sum(field_dims), 1)

# Deep part: Embedding layers
self.embeddings = nn.ModuleList([
nn.Embedding(field_dim, embedding_dim)
for field_dim in field_dims
])

# Deep part: MLP
deep_input_dim = self.num_fields * embedding_dim
deep_layers_list = []
for deep_dim in deep_layers:
deep_layers_list.append(nn.Linear(deep_input_dim, deep_dim))
deep_layers_list.append(nn.ReLU())
if dropout > 0:
deep_layers_list.append(nn.Dropout(dropout))
deep_input_dim = deep_dim
self.deep_mlp = nn.Sequential(*deep_layers_list)

# Output layer of Deep part
self.deep_output = nn.Linear(deep_layers[-1], 1)

# Initialization
self._init_weights()

def _init_weights(self):
# Wide part: Xavier initialization
nn.init.xavier_uniform_(self.wide_linear.weight)
nn.init.zeros_(self.wide_linear.bias)

# Deep part: initialization of Embeddings and MLP
for embedding in self.embeddings:
nn.init.xavier_uniform_(embedding.weight)

for layer in self.deep_mlp:
if isinstance(layer, nn.Linear):
nn.init.xavier_uniform_(layer.weight)
nn.init.zeros_(layer.bias)

nn.init.xavier_uniform_(self.deep_output.weight)
nn.init.zeros_(self.deep_output.bias)

def forward(self, x_wide, x_deep):
"""
Args:
x_wide: Input for Wide part (one-hot encoding), shape: [batch_size, sum(field_dims)]
x_deep: Input for Deep part (field indices), shape: [batch_size, num_fields]
Returns:
output: Predicted values, shape: [batch_size]
"""
# Wide part
wide_output = self.wide_linear(x_wide) # [batch_size, 1]

# Deep part: Embedding
deep_embeddings = []
for i in range(self.num_fields):
deep_embeddings.append(self.embeddings[i](x_deep[:, i]))
deep_concat = torch.cat(deep_embeddings, dim=1) # [batch_size, num_fields * embedding_dim]

# Deep part: MLP
deep_output = self.deep_mlp(deep_concat)
deep_output = self.deep_output(deep_output) # [batch_size, 1]

# Fusion
output = wide_output + deep_output
output = torch.sigmoid(output.squeeze()) # [batch_size]

return output

# Usage example
field_dims = [10000, 1000, 50, 20] # User ID, Item ID, Category, City
embedding_dim = 32
deep_layers = [128, 64, 32]

model = WideAndDeep(field_dims, embedding_dim, deep_layers, dropout=0.2)

# Wide part input: one-hot encoding (sparse)
batch_size = 4
x_wide = torch.zeros(batch_size, sum(field_dims))
# Assume user 0's features: User ID=123, Item ID=456, Category=5, City=10
x_wide[0, 123] = 1 # User ID
x_wide[0, 10000 + 456] = 1 # Item ID
x_wide[0, 10000 + 1000 + 5] = 1 # Category
x_wide[0, 10000 + 1000 + 50 + 10] = 1 # City
# ... similar for other samples

# Deep part input: field indices (dense)
x_deep = torch.LongTensor([
[123, 456, 5, 10],
[124, 457, 5, 11],
[125, 458, 6, 10],
[126, 459, 6, 12]
]) # Shape: [4, 4]

# Forward pass
predictions = model(x_wide, x_deep)
print(f"Predictions: {predictions}")

# Loss calculation
labels = torch.FloatTensor([1, 1, 0, 1])
criterion = nn.BCELoss()
loss = criterion(predictions, labels)
print(f"Loss: {loss.item()}")

Optimized Versions of Wide & Deep

In practical applications, Wide & Deep has several optimized versions:

1. DeepFM: - Replaces Wide part with FM - Automatically learns second-order feature interactions - Avoids manual design of cross features

2. xDeepFM: - Introduces CIN (Compressed Interaction Network) - Explicitly models high-order feature interactions - Stronger interaction modeling capability than DeepFM

3. DCN (Deep & Cross Network): - Replaces Wide part with Cross Network - Automatically learns feature interactions of arbitrary order - High computational efficiency

Feature Engineering

Feature Types

Features in recommendation systems can be divided into the following categories:

1. User Features: - User ID, age, gender, city, occupation - User historical behavior statistics (click rate, purchase rate, average rating) - User profile tags (interest tags, consumption ability)

2. Item Features: - Item ID, category, brand, price - Item statistical features (click rate, purchase rate, average rating) - Item content features (text description, images)

3. Context Features: - Time features (hour, day of week, month, whether holiday) - Device features (device type, operating system, APP version) - Location features (GPS coordinates, city, business district)

4. Interaction Features: - User-item interaction history (last N clicks, purchases) - User-category interaction statistics (click counts for each category) - Item-user interaction statistics (user profiles who clicked the item)

5. Cross Features: - User features × item features (e.g., "user age × item category") - Time features × item features (e.g., "time period × item category") - High-order cross features (e.g., "user age × item category × time period")

Feature Encoding

1. Numerical Features: - Standardization:\(x' = \frac{x - \mu}{\sigma}\) - Normalization:\(x' = \frac{x - x_{min }}{x_{max} - x_{min }}\) - Binning: Discretize continuous values, e.g., age divided into "0-18, 19-30, 31-50, 50+"

2. Categorical Features: - One-hot Encoding: One dimension per category - Embedding Encoding: Map to low-dimensional dense vectors (commonly used in deep learning) - Hash Encoding: Use hash function to map categories to fixed dimensions

3. Sequential Features: - Padding: Pad sequences of different lengths to the same length - Pooling: Average pooling, max pooling, attention pooling - RNN/Transformer: Process with sequence models

Feature Selection

Not all features are useful; feature selection is needed:

1. Statistical Methods: - Mutual Information: Measures correlation between features and target - Chi-square Test: Tests independence between features and target - Correlation Coefficient: Computes linear correlation between features and target

2. Model Methods: - L1 Regularization: Automatically sets weights of unimportant features to zero - Feature Importance: Feature importance based on tree models (e.g., XGBoost) - Permutation Importance: Shuffle feature values, observe model performance degradation

3. Business Methods: - A/B Testing: Deploy features, observe metric changes - Feature Analysis: Analyze feature distribution, missing rate, coverage

Feature Engineering Code Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import mutual_info_classif

class FeatureEngineer:
"""Feature engineering utility class"""

def __init__(self):
self.scalers = {}
self.encoders = {}
self.feature_names = []

def process_numerical_features(self, df, numerical_cols):
"""Process numerical features: standardization"""
processed_df = df.copy()

for col in numerical_cols:
scaler = StandardScaler()
processed_df[col] = scaler.fit_transform(df[[col]])
self.scalers[col] = scaler

return processed_df

def process_categorical_features(self, df, categorical_cols):
"""Process categorical features: Label Encoding"""
processed_df = df.copy()

for col in categorical_cols:
encoder = LabelEncoder()
processed_df[col] = encoder.fit_transform(df[col].astype(str))
self.encoders[col] = encoder

return processed_df

def create_cross_features(self, df, field1, field2):
"""Create cross features"""
cross_feature = f"{field1}_x_{field2}"
df[cross_feature] = df[field1].astype(str) + "_" + df[field2].astype(str)
return df

def create_binning_features(self, df, numerical_col, bins):
"""Create binning features"""
bin_feature = f"{numerical_col}_bin"
df[bin_feature] = pd.cut(df[numerical_col], bins=bins, labels=False)
return df

def create_statistical_features(self, df, group_col, agg_col, agg_funcs):
"""Create statistical features (e.g., user average click rate)"""
stats = df.groupby(group_col)[agg_col].agg(agg_funcs)
stats.columns = [f"{group_col}_{agg_col}_{func}" for func in agg_funcs]
df = df.merge(stats, left_on=group_col, right_index=True, how='left')
return df

def select_features(self, X, y, k=10):
"""Feature selection: based on mutual information"""
mi_scores = mutual_info_classif(X, y, random_state=42)
top_k_indices = np.argsort(mi_scores)[-k:]
return top_k_indices

# Usage example
# Assume we have user behavior data
data = {
'user_id': [1, 1, 2, 2, 3, 3],
'item_id': [10, 20, 10, 30, 20, 30],
'category': ['A', 'B', 'A', 'C', 'B', 'C'],
'price': [10.5, 20.3, 10.5, 15.7, 20.3, 15.7],
'age': [25, 25, 30, 30, 35, 35],
'click': [1, 1, 0, 1, 1, 0]
}

df = pd.DataFrame(data)

# Feature engineering
fe = FeatureEngineer()

# Process numerical features
df = fe.process_numerical_features(df, ['price', 'age'])

# Process categorical features
df = fe.process_categorical_features(df, ['category'])

# Create cross features
df = fe.create_cross_features(df, 'user_id', 'category')

# Create binning features
df = fe.create_binning_features(df, 'age', bins=[0, 25, 30, 40, 100])

# Create statistical features (user average click rate)
df = fe.create_statistical_features(
df,
group_col='user_id',
agg_col='click',
agg_funcs=['mean', 'sum']
)

print(df.head())

Training Techniques

Data Preparation

1. Negative Sampling: - For implicit feedback, negative samples far outnumber positive samples - Need negative sampling to balance positive/negative sample ratio - Common strategies: random negative sampling, popular negative sampling, hard negative sampling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def negative_sampling(user_items, num_negatives=4):
"""Negative sampling: sample N negative samples for each positive sample"""
positive_samples = []
negative_samples = []

for user_id, item_id in user_items:
# Positive sample
positive_samples.append((user_id, item_id, 1))

# Negative sampling: randomly select items user hasn't interacted with
user_interacted = set(user_items[user_items[:, 0] == user_id][:, 1])
all_items = set(range(num_items))
negative_candidates = list(all_items - user_interacted)

# Random sampling
negative_items = np.random.choice(
negative_candidates,
size=min(num_negatives, len(negative_candidates)),
replace=False
)

for neg_item in negative_items:
negative_samples.append((user_id, neg_item, 0))

return positive_samples, negative_samples

2. Data Augmentation: - Time Window Sliding: Build training sets with different time windows - Data Mixing: Mix data from different sources - Noise Injection: Add noise during training to improve robustness

3. Data Splitting: - Time-based Split: Split training and test sets chronologically (more realistic) - Random Split: Random split (may cause data leakage) - User-based Split: Split by users (avoid users appearing in both training and test sets)

Model Training

1. Optimizer Selection: - Adam/AdamW: Adaptive learning rate, suitable for most scenarios - SGD: Requires manual learning rate tuning, but may converge to better solutions - Adagrad: Suitable for sparse gradients

1
2
3
4
5
6
7
8
9
10
11
import torch.optim as optim

# Adam optimizer (recommended)
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

# AdamW optimizer (better weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=1e-4)

# SGD optimizer (requires learning rate scheduling)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

2. Learning Rate Scheduling: - StepLR: Decay every N epochs - ExponentialLR: Exponential decay - CosineAnnealingLR: Cosine annealing - ReduceLROnPlateau: Automatically adjust based on validation performance

1
2
3
4
5
6
7
8
9
10
# StepLR: halve learning rate every 10 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

# CosineAnnealingLR: cosine annealing
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# ReduceLROnPlateau: reduce learning rate when validation performance doesn't improve
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=5
)

3. Regularization: - L2 Regularization: Implemented through weight_decay - Dropout: Randomly zero out some neurons - Batch Normalization: Normalize activation values - Early Stopping: Stop early when validation performance doesn't improve

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Dropout
model = nn.Sequential(
nn.Linear(100, 64),
nn.ReLU(),
nn.Dropout(0.2), # 20% of neurons are dropped
nn.Linear(64, 32),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(32, 1)
)

# Early Stopping
class EarlyStopping:
def __init__(self, patience=5, min_delta=0):
self.patience = patience
self.min_delta = min_delta
self.counter = 0
self.best_score = None

def __call__(self, val_score):
if self.best_score is None:
self.best_score = val_score
elif val_score < self.best_score + self.min_delta:
self.counter += 1
if self.counter >= self.patience:
return True
else:
self.best_score = val_score
self.counter = 0
return False

Training Loop Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def train_model(model, train_loader, val_loader, num_epochs=50):
"""Complete training loop"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode='min', factor=0.5, patience=5
)
early_stopping = EarlyStopping(patience=10)

best_val_loss = float('inf')

for epoch in range(num_epochs):
# Training phase
model.train()
train_loss = 0.0
for batch in train_loader:
user_ids, item_ids, labels = batch
user_ids = user_ids.to(device)
item_ids = item_ids.to(device)
labels = labels.to(device)

# Forward pass
optimizer.zero_grad()
predictions = model(user_ids, item_ids)
loss = criterion(predictions, labels)

# Backward pass
loss.backward()
optimizer.step()

train_loss += loss.item()

train_loss /= len(train_loader)

# Validation phase
model.eval()
val_loss = 0.0
with torch.no_grad():
for batch in val_loader:
user_ids, item_ids, labels = batch
user_ids = user_ids.to(device)
item_ids = item_ids.to(device)
labels = labels.to(device)

predictions = model(user_ids, item_ids)
loss = criterion(predictions, labels)
val_loss += loss.item()

val_loss /= len(val_loader)

# Learning rate scheduling
scheduler.step(val_loss)

# Early stopping check
if early_stopping(val_loss):
print(f"Early stopping at epoch {epoch}")
break

# Save best model
if val_loss < best_val_loss:
best_val_loss = val_loss
torch.save(model.state_dict(), 'best_model.pth')

print(f"Epoch {epoch+1}/{num_epochs}: "
f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

return model

Evaluation Metrics

1. Classification Tasks (CTR Prediction): - AUC: Area under ROC curve, measures ranking capability - LogLoss: Logarithmic loss, measures accuracy of predicted probabilities - Precision@K: Proportion of positive samples in Top-K recommendations - Recall@K: Coverage of positive samples by Top-K recommendations

2. Regression Tasks (Rating Prediction): - RMSE: Root mean squared error - MAE: Mean absolute error - NDCG: Normalized Discounted Cumulative Gain (ranking metric)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from sklearn.metrics import roc_auc_score, log_loss, precision_recall_fscore_support

def evaluate_model(model, test_loader):
"""Evaluate model"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.eval()

all_predictions = []
all_labels = []

with torch.no_grad():
for batch in test_loader:
user_ids, item_ids, labels = batch
user_ids = user_ids.to(device)
item_ids = item_ids.to(device)

predictions = model(user_ids, item_ids)
all_predictions.extend(predictions.cpu().numpy())
all_labels.extend(labels.numpy())

# Calculate metrics
auc = roc_auc_score(all_labels, all_predictions)
logloss = log_loss(all_labels, all_predictions)

# Top-K metrics
k = 10
sorted_indices = np.argsort(all_predictions)[::-1]
top_k_labels = [all_labels[i] for i in sorted_indices[:k]]
precision_k = sum(top_k_labels) / k
recall_k = sum(top_k_labels) / sum(all_labels)

return {
'AUC': auc,
'LogLoss': logloss,
f'Precision@{k}': precision_k,
f'Recall@{k}': recall_k
}

Q&A: Common Questions

Q1: How to Choose Embedding Dimensions?

A: The choice of Embedding dimension\(d\)requires balancing model capacity and computational cost:

  • Small-scale data (<100K users/items):\(d = 8-16\)is sufficient
  • Medium-scale data (100K-1M):\(d = 32-64\)is common
  • Large-scale data (>1M):\(d = 64-128\), even 256

Rule of thumb: 1. Start with\(d=32\), gradually increase to 64, 128 2. Observe validation performance; if improvement <1%, stop increasing 3. Consider computational resources: doubling\(d\)doubles the number of parameters

Q2: Why Does NCF Perform Better Than Matrix Factorization?

A: NCF's advantages mainly lie in:

  1. Nonlinear Modeling: Matrix factorization can only capture linear relationships (inner products), while NCF can capture nonlinear relationships through MLP
  2. Feature Fusion: NCF's GMF and MLP parts complement each other; GMF captures simple interactions, MLP captures complex interactions
  3. End-to-End Training: The entire model can be jointly optimized, while matrix factorization typically requires alternating optimization

However, NCF also has disadvantages: - Higher computational complexity (requires forward propagation) - Poor interpretability (black-box model) - Requires more data to train well

Q3: What's the Difference Between CDAE and VAE?

A: Main differences:

  1. Model Type:
    • CDAE: Deterministic autoencoder, latent variable is a fixed vector
    • VAE: Probabilistic generative model, latent variable is a probability distribution
  2. Generation Capability:
    • CDAE: Can only reconstruct input, cannot generate new samples
    • VAE: Can sample from latent distribution to generate new samples
  3. Uncertainty:
    • CDAE: Cannot model uncertainty
    • VAE: Can model uncertainty through variance of latent distribution
  4. Training:
    • CDAE: Simple training, only needs reconstruction loss
    • VAE: Requires KL divergence term, more complex training
  5. Recommendation Diversity:
    • CDAE: Recommendation results are relatively fixed
    • VAE: Can increase diversity through sampling

Q4: What Are the Respective Roles of Wide and Deep Parts in Wide & Deep?

A:

Wide Part (Memorization): - Learns direct associations between features - Example: "Users who installed Pandora also installed YouTube" - Suitable for handling sparse, high-dimensional cross features - Can quickly memorize patterns in historical data

Deep Part (Generalization): - Learns Embedding representations of features - Captures latent associations between sparse features - Can generalize to unseen feature combinations - Suitable for handling dense Embedding features

Why Combine Both: - Only Wide: Cannot generalize, can only memorize historical data - Only Deep: May over-generalize, ignoring important direct associations - Wide + Deep: Both memorize and generalize, achieving optimal results

Q5: How to Handle Cold Start Problems?

A: Cold start is a classic problem in recommendation systems, with the following solutions:

1. New User Cold Start: - Popular Recommendations: Recommend popular items - Content-based Recommendations: Recommend based on user registration information (age, gender, etc.) - Transfer Learning: Transfer preferences from similar users - Multi-armed Bandit: Exploration-exploitation balance

2. New Item Cold Start: - Content Features: Recommend to similar users based on item attributes (category, tags) - Embedding Pre-training: Pre-train Embeddings using item content features - Collaborative Filtering: Based on interaction data of similar items

3. System Cold Start: - External Data: Leverage data from other platforms - Expert Rules: Manually designed recommendation rules - A/B Testing: Rapid iterative optimization

Q6: How to Choose Negative Sampling Strategies?

A: Negative sampling strategies affect model performance:

1. Random Negative Sampling: - Simplest, randomly sample from all non-interacted items - Suitable for most scenarios - May sample items that "users aren't interested in but don't dislike"

2. Popular Negative Sampling: - Sample negative samples from popular items - Assumes users not clicking popular items means they don't like them - May introduce popularity bias

3. Hard Negative Sampling: - Sample negative samples with high model prediction scores - Let the model learn to distinguish "easily confused" positive and negative samples - Improves model performance but requires dynamic sampling (model changes during training)

4. Mixed Strategy: - 50% random + 50% popular - Or adjust according to training phase: random early, hard negative sampling later

Q7: How to Prevent Overfitting?

A: Methods to prevent overfitting:

1. Regularization: - L2 Regularization: Implemented through weight_decay, typically 1e-5 to 1e-3 - Dropout: Randomly zero out some neurons, dropout rate 0.2-0.5 - Batch Normalization: Normalize activation values, stabilize training

2. Data Augmentation: - Negative Sampling: Increase number of negative samples - Noise Injection: Add noise during training - Data Mixing: Mix data from different sources

3. Model Complexity Control: - Reduce Layers: Start with deep networks, gradually reduce - Reduce Embedding Dimensions: Lower model capacity - Early Stopping: Stop training when validation performance doesn't improve

4. Cross Validation: - Use K-fold cross validation to evaluate models - Avoid randomness of single split

Q8: How to Accelerate Model Training?

A: Methods to accelerate training:

1. Hardware Acceleration: - GPU: Use CUDA acceleration, 10-100x speedup - Multi-GPU: Data parallelism or model parallelism - TPU: Google's specialized chip, suitable for large-scale training

2. Data Optimization: - Data Preprocessing: Preprocess features in advance, avoid computation during training - Data Loading: Use multi-process DataLoader (num_workers>0) - Batch Size: Increase batch size to improve GPU utilization

3. Model Optimization: - Mixed Precision Training: Use FP16, 2x speedup - Gradient Accumulation: Simulate large batch training - Model Pruning: Reduce model parameters

4. Algorithm Optimization: - Learning Rate Scheduling: Use warmup to accelerate convergence - Optimizer Selection: Adam usually converges faster than SGD - Asynchronous Training: Multi-machine multi-GPU asynchronous updates

Q9: How to Evaluate Recommendation System Effectiveness?

A: Recommendation system evaluation requires multi-dimensional metrics:

1. Offline Metrics: - Accuracy Metrics: AUC, LogLoss, RMSE, MAE - Ranking Metrics: NDCG, MRR, MAP - Coverage Metrics: Coverage (diversity of recommended items) - Diversity Metrics: Intra-list Diversity (differences between items in recommendation list)

2. Online Metrics: - CTR: Click-through rate - CVR: Conversion rate (purchase/download) - GMV: Gross merchandise value - User Retention Rate: Proportion of returning users

3. Business Metrics: - User Satisfaction: Ratings, feedback - Long-tail Item Recommendations: Whether unpopular items are recommended - Real-time Performance: Recommendation response time

4. A/B Testing: - Compare effects of old and new models - Requires sufficient sample size (typically >1000 users) - Focus on statistical significance

Q10: Can Embeddings Be Visualized?

A: Yes, common visualization methods:

1. t-SNE: - Reduce high-dimensional Embeddings to 2D - Can observe whether similar items cluster - Suitable for exploratory analysis

2. PCA: - Linear dimensionality reduction, fast computation - Preserves main variance - Suitable for preliminary analysis

3. UMAP: - Faster than t-SNE, similar effects - Preserves local and global structure - Suitable for large-scale data

4. Visualization Tools: - TensorBoard: TensorFlow's visualization tool - Weights & Biases: Online visualization platform - Plotly: Interactive visualization

Visualization can help: - Understand what the model has learned - Discover anomalies (e.g., some item Embeddings are abnormal) - Explain recommendation results (why this item is recommended)

Q11: How to Handle New Categories in Categorical Features?

A: New categories (OOV, Out-of-Vocabulary) are common problems:

1. Default Embedding: - Assign a special Embedding vector to new categories - Can randomly initialize or use zero vector - Will be updated during training

2. Hash Trick: - Use hash function to map new categories to known categories - Example: hash(new_category) % num_categories - May have hash collisions but can handle arbitrary new categories

3. Content Features: - If new categories have content features (e.g., text description), can initialize Embeddings with content features - Example: Encode category names with Word2Vec

4. Transfer Learning: - Transfer Embeddings from similar categories - Example: New movie categories can initialize Embeddings with similar categories

Q12: How to Combine Deep Learning Recommendation Models with Traditional Methods?

A: Can combine in various ways:

1. Model Fusion: - Weighted Average: Weighted average of predictions from multiple models - Stacking: Use meta-model to learn how to combine multiple models - Blending: Different models responsible for different scenarios

2. Feature Fusion: - Outputs of traditional methods as input features for deep learning models - Example: Prediction scores from matrix factorization as features

3. Two-stage Recommendation: - Recall Stage: Use traditional methods (e.g., Item-CF) to quickly recall candidate sets - Ranking Stage: Use deep learning models for fine-grained ranking

4. Ensemble Learning: - Train multiple models with different structures - Vote or average to get final results - Usually performs better than single models

Summary

Deep learning has brought revolutionary changes to recommendation systems. From automatic feature learning through Embeddings, to nonlinear modeling with NCF, from denoising reconstruction with AutoEncoders, to combining memorization and generalization with Wide & Deep, deep learning models have demonstrated powerful capabilities across all recommendation scenarios.

However, deep learning is not a silver bullet. It requires large amounts of data, computational resources, and tuning experience. In practical applications, we need to: 1. Understand Business Scenarios: Choose appropriate model architectures 2. Do Feature Engineering Well: Feature quality determines the model's upper limit 3. Carefully Design Training Pipelines: Data preparation, negative sampling, regularization, evaluation metrics 4. Continuously Iterate and Optimize: A/B testing, online monitoring, rapid iteration

Recommendation systems are complex engineering systems, and deep learning is just one component. Only by combining algorithms, engineering, and business can we build truly effective recommendation systems.

Future directions for recommendation systems include: - Sequential Recommendations: Use Transformer to model user behavior sequences - Reinforcement Learning: Dynamically adjust recommendation strategies - Multimodal Recommendations: Fuse text, images, video, and other modalities - Explainable Recommendations: Help users understand why items are recommended - Fair Recommendations: Avoid recommendation bias, protect user privacy

I hope this article helps you build a complete knowledge framework for deep learning recommendation systems. If you have any questions, feel free to discuss them in the comments.

  • Post title:Recommendation Systems (3): Deep Learning Foundation Models
  • Post author:Chen Kai
  • Create time:2026-02-03 23:11:11
  • Post link:https://www.chenk.top/recommendation-systems-3-deep-learning-basics/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments