Recommendation Systems (3): Deep Learning Foundation Models

permalink: "en/recommendation-systems-3-deep-learning-basics/" date: 2024-05-12 10:00:00 tags: - Recommendation Systems - Deep Learning - Neural Networks categories: Recommendation Systems mathjax: true--- In 2016, Google introduced the Wide & Deep model in Google Play's recommendation system, marking the formal entry of deep learning into the mainstream of recommendation systems. Prior to this, recommendation systems primarily relied on traditional methods such as matrix factorization and collaborative filtering. While these methods achieved success in competitions like the Netflix Prize, they had significant limitations: difficulty handling high-dimensional sparse features, inability to capture nonlinear relationships, and heavy reliance on manual feature engineering.

Deep learning has brought revolutionary changes to recommendation systems. Through multi-layer neural networks, we can automatically learn representations (Embeddings) of users and items, capture complex interaction patterns, handle multimodal features, and train end-to-end on large-scale data. From NCF (Neural Collaborative Filtering) to AutoEncoder-based recommendations, from Wide & Deep to DeepFM, deep learning models have demonstrated powerful capabilities across all stages of recommendation systems, including CTR prediction, recall, and ranking.

This article provides an in-depth exploration of the core concepts, mainstream models, and implementation details of deep learning recommendation systems. We'll start by understanding the essence of Embeddings and why they're so important; then dive deep into classic models like NCF, AutoEncoders (CDAE/VAE), and Wide & Deep; discuss feature engineering and training techniques; and finally present 10+ complete code implementations and 10+ Q&A sections addressing common questions. Whether you're new to recommendation systems or want to systematically understand deep learning recommendation models, this article will help you build a complete knowledge framework.

Deep Learning vs Traditional Methods

Limitations of Traditional Recommendation Methods

Before the rise of deep learning, recommendation systems primarily relied on the following methods:

Matrix Factorization: - Decomposes the user-item rating matrix into low-dimensional vectors - Uses vector inner products to predict ratings: \(\hat{r}_{ui} = \mathbf{p}_u^T \mathbf{q}_i\) - Advantages: Simple, interpretable, computationally efficient - Disadvantages: Can only capture linear relationships, difficult to handle high-dimensional sparse features

Collaborative Filtering: - Makes recommendations based on user or item similarity - Advantages: No need for content features, can discover unexpected associations - Disadvantages: Severe data sparsity problems, difficult cold start

Factorization Machine: - Introduces feature interaction terms:\(\hat{y} = w_0 + \sum_{i=1}^n w_i x_i + \sum_{i=1}^n \sum_{j=i+1}^n \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j\) - Advantages: Can handle high-dimensional sparse features, captures second-order interactions - Disadvantages: Can only capture second-order interactions, higher-order interactions require manual design

The core problem with these traditional methods is: they are all linear or can only capture low-order interactions, while user behavior often contains complex nonlinear patterns. For example, users might simultaneously like movies that combine "sci-fi + action + blockbuster," a combination feature that's difficult to express with simple linear models.

Advantages of Deep Learning

Deep learning, through multi-layer neural networks, brings the following advantages to recommendation systems:

Automatic Feature Learning: - Traditional methods require manual feature design (e.g., "user age × item category") - Deep learning automatically learns feature representations through multi-layer nonlinear transformations - Embedding layers map high-dimensional sparse one-hot encodings to low-dimensional dense vectors

Nonlinear Modeling Capability: - Multi-layer neural networks can capture arbitrarily complex nonlinear relationships - Activation functions like ReLU and Sigmoid introduce nonlinearity - Deep networks can learn high-order feature interactions

Multimodal Feature Fusion: - Can simultaneously process user profiles, item attributes, behavior sequences, text, images, and other features - Uses different network structures (CNN, RNN, Transformer) to handle different modalities - End-to-end training within a unified framework

End-to-End Training: - The entire pipeline from raw features to final predictions can be jointly optimized - Gradient backpropagation automatically adjusts all parameters - Avoids the problem of separating feature engineering and model training in traditional methods

Performance Comparison

In practical applications, deep learning models typically achieve 5-30% performance improvements over traditional methods:

Method	AUC	Improvement
Matrix Factorization	0.750	baseline
FM	0.780	+4.0%
Wide & Deep	0.810	+8.0%
DeepFM	0.825	+10.0%
DIN	0.845	+12.7%

These improvements primarily come from: 1. Better Feature Representations: Embedding vectors learned contain more information than one-hot encodings 2. More Complex Interaction Patterns: Deep networks capture feature combinations that traditional methods cannot express 3. Sequential Modeling Capability: RNN/Transformer can model temporal dependencies in user behavior sequences

Challenges of Deep Learning

Despite its many advantages, deep learning also brings some challenges:

Computational Complexity: - Deep networks require substantial computational resources - Training time may be 10-100 times longer than traditional methods - Requires GPU acceleration for production use

Interpretability: - Black-box models are difficult to interpret regarding why certain items are recommended - Traditional methods (like matrix factorization) have vectors that can be intuitively understood - Requires additional interpretability tools (such as SHAP, LIME)

Data Requirements: - Deep learning requires large amounts of training data - Cold start problems still exist (new users/new items) - Requires carefully designed data augmentation and transfer learning strategies

Hyperparameter Tuning: - Network structure, learning rate, regularization, and other hyperparameters require extensive experimentation - Larger hyperparameter search space compared to traditional methods - Requires automated tools (such as AutoML) for assistance

Embedding Deep Dive

What is Embedding

Embedding is one of the core concepts in deep learning. Simply put, Embedding is a technique that maps high-dimensional sparse discrete features to low-dimensional dense continuous vector spaces.

In recommendation systems, the most common discrete features are user IDs and item IDs. Suppose we have 10 million users and 1 million items. Using one-hot encoding: - User ID: 10 million-dimensional vector, with only 1 position being 1, all others 0 - Item ID: 1 million-dimensional vector, with only 1 position being 1, all others 0

This representation has serious problems: 1. Curse of Dimensionality: Vector dimension equals the number of categories, with enormous storage and computation costs 2. Information Sparsity: 99.9999% of elements are 0, extremely low information density 3. Cannot Express Similarity: The distance between any two one-hot vectors is the same (e.g., Euclidean distance is\(\sqrt{2}\))

Embedding solves these problems: - Maps 10 million-dimensional user IDs to 128-dimensional dense vectors - Maps 1 million-dimensional item IDs to 128-dimensional dense vectors - Similar users/items are closer in vector space

Mathematical Principles of Embedding

Embedding is essentially a lookup table. Let the user set be\(U = \{u_1, u_2, \dots, u_m\}\)and the item set be\(I = \{i_1, i_2, \dots, i_n\}\).

One-hot Encoding: - One-hot vector for user\(u_i\):\(\mathbf{e}_i \in \{0,1} ^m\), where\(e_{ij} = 1\)if and only if\(j=i\) - One-hot vector for item\(i_j\):\(\mathbf{f}_j \in \{0,1} ^n\), where\(f_{jk} = 1\)if and only if\(k=j\) Embedding Layer: - User Embedding matrix:\(\mathbf{P} \in \mathbb{R}^{m \times d}\), where\(d\)is the Embedding dimension - Item Embedding matrix:\(\mathbf{Q} \in \mathbb{R}^{n \times d}\) - Embedding vector for user\(u_i\):\(\mathbf{p}_i = \mathbf{P}^T \mathbf{e}_i\)(essentially the\(i\)-th row of the matrix) - Embedding vector for item\(i_j\):\(\mathbf{q}_j = \mathbf{Q}^T \mathbf{f}_j\)(essentially the\(j\)-th row of the matrix)

In implementation, the Embedding layer is typically a learnable parameter matrix:

# Pseudocode
user_embedding = Embedding(num_users, embedding_dim)  # Shape: [m, d]
item_embedding = Embedding(num_items, embedding_dim)  # Shape: [n, d]

# Forward pass
user_id = 123  # User ID
user_vec = user_embedding[user_id]  # Shape: [d]

Learning Process of Embedding

Embedding vectors are not predefined but learned from training data. The learning objective is: to make similar users/items closer in vector space, and dissimilar ones farther apart.

Collaborative Filtering Perspective: - If user\(u\)likes item\(i\), then\(\mathbf{p}_u\)and\(\mathbf{q}_i\)should be similar (large inner product) - If user\(u\)dislikes item\(i\), then\(\mathbf{p}_u\)and\(\mathbf{q}_i\)should be dissimilar (small inner product) - Loss function:\(\mathcal{L} = \sum_{(u,i) \in \mathcal{D }} (r_{ui} - \mathbf{p}_u^T \mathbf{q}_i)^2\) Neural Network Perspective: - The Embedding layer is the first layer of the neural network - Through backpropagation, gradients update the Embedding matrix parameters - The final learned vectors contain latent features of users/items

Embedding Dimension Selection

The Embedding dimension\(d\)is an important hyperparameter. Common choices range from 8-512, depending on: - Data Scale: Larger user/item counts typically require larger dimensions - Task Complexity: CTR prediction may need 32-64 dimensions, recall may need 128-256 dimensions - Computational Resources: Larger dimensions increase storage and computation costs

Rule of thumb: - Small scale (<100K):\(d = 8-16\) - Medium scale (100K-1M):\(d = 32-64\) - Large scale (>1M):\(d = 64-128\)

Embedding Visualization

Through dimensionality reduction techniques (such as t-SNE, PCA), high-dimensional Embeddings can be visualized in 2D space to observe learned structures:

import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Assume we've trained item_embeddings, shape: [n_items, d]
# Select first 1000 items for visualization
embeddings_subset = item_embeddings[:1000]

# t-SNE dimensionality reduction to 2D
tsne = TSNE(n_components=2, random_state=42)
embeddings_2d = tsne.fit_transform(embeddings_subset)

# Visualization
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])
plt.title('Item Embeddings Visualization')
plt.show()

Typically, we find: - Items of the same category cluster together - Items with similar functions are closer - Popular and unpopular items may be distributed in different regions

Pre-training and Fine-tuning Embeddings

In practical applications, Embeddings can: 1. Random Initialization: Train from scratch (most common) 2. Pre-training: Pre-train Embeddings on other tasks (e.g., item classification), then fine-tune 3. Transfer Learning: Transfer Embeddings from other domains (e.g., Word2Vec from NLP)

Advantages of pre-trained Embeddings: - Accelerate convergence: Don't need to start from random state - Improve performance: Leverage external knowledge - Alleviate cold start: New items can use pre-trained Embeddings

Code Example: Embedding Layer Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class EmbeddingLayer(nn.Module):
    """Basic Embedding layer implementation"""
    
    def __init__(self, num_embeddings, embedding_dim, padding_idx=None):
        """
        Args:
            num_embeddings: Vocabulary size (number of users or items)
            embedding_dim: Embedding dimension
            padding_idx: Padding index (for sequence padding)
        """
        super(EmbeddingLayer, self).__init__()
        self.num_embeddings = num_embeddings
        self.embedding_dim = embedding_dim
        
        # Create Embedding matrix, randomly initialized
        self.embedding = nn.Embedding(
            num_embeddings=num_embeddings,
            embedding_dim=embedding_dim,
            padding_idx=padding_idx
        )
        
        # Xavier initialization (optional)
        nn.init.xavier_uniform_(self.embedding.weight)
    
    def forward(self, indices):
        """
        Args:
            indices: Input indices, shape: [batch_size] or [batch_size, seq_len]
        Returns:
            embeddings: Embedding vectors, shape: [batch_size, embedding_dim] or [batch_size, seq_len, embedding_dim]
        """
        return self.embedding(indices)

# Usage example
num_users = 10000
embedding_dim = 64

# Create user Embedding layer
user_embedding = EmbeddingLayer(num_users, embedding_dim)

# Forward pass
user_ids = torch.LongTensor([0, 1, 2, 3, 4])  # batch_size=5
user_vectors = user_embedding(user_ids)  # Shape: [5, 64]

print(f"Input user IDs: {user_ids}")
print(f"Output embeddings shape: {user_vectors.shape}")
print(f"Sample embedding (user 0): {user_vectors[0][:5]}")  # Print first 5 dimensions

Multi-field Embedding

In practical recommendation systems, besides user IDs and item IDs, there are many other discrete features (such as user age, item category, city, etc.). Each feature needs an Embedding layer:

class MultiFieldEmbedding(nn.Module):
    """Multi-field Embedding layer"""
    
    def __init__(self, field_dims, embedding_dim):
        """
        Args:
            field_dims: Number of categories for each field, e.g., [10000, 1000, 50] for 3 fields
            embedding_dim: Embedding dimension
        """
        super(MultiFieldEmbedding, self).__init__()
        self.field_dims = field_dims
        self.embedding_dim = embedding_dim
        self.num_fields = len(field_dims)
        
        # Create Embedding layer for each field
        self.embeddings = nn.ModuleList([
            nn.Embedding(field_dim, embedding_dim)
            for field_dim in field_dims
        ])
    
    def forward(self, x):
        """
        Args:
            x: Input features, shape: [batch_size, num_fields]
        Returns:
            embeddings: Embeddings for all fields, shape: [batch_size, num_fields, embedding_dim]
        """
        # Embedding for each field separately
        embeddings = []
        for i in range(self.num_fields):
            embeddings.append(self.embeddings[i](x[:, i]))
        
        # Stack into [batch_size, num_fields, embedding_dim]
        return torch.stack(embeddings, dim=1)

# Usage example
field_dims = [10000, 1000, 50, 20]  # User ID, Item ID, Category, City
embedding_dim = 32

multi_embedding = MultiFieldEmbedding(field_dims, embedding_dim)

# Input: [batch_size=4, num_fields=4]
x = torch.LongTensor([
    [123, 456, 5, 10],   # User 123, Item 456, Category 5, City 10
    [124, 457, 5, 11],
    [125, 458, 6, 10],
    [126, 459, 6, 12]
])

embeddings = multi_embedding(x)  # Shape: [4, 4, 32]
print(f"Input shape: {x.shape}")
print(f"Output embeddings shape: {embeddings.shape}")

NCF: Neural Collaborative Filtering

Background of NCF

Traditional matrix factorization methods use vector inner products to predict ratings:\[\hat{r}_{ui} = \mathbf{p}_u^T \mathbf{q}_i\]This approach has a fundamental problem: inner products are linear and cannot capture complex nonlinear relationships between users and items. For example, users might like the combination of "sci-fi + action," but this combination feature cannot be expressed with simple inner products.

NCF (Neural Collaborative Filtering), proposed in 2017, replaces inner products with multi-layer neural networks, enabling learning of nonlinear interactions between users and items.

NCF Model Architecture

The NCF model contains three components:

1. GMF (Generalized Matrix Factorization): - User Embedding:\(\mathbf{p}_u \in \mathbb{R}^d\) - Item Embedding:\(\mathbf{q}_i \in \mathbb{R}^d\) - Element-wise product:\(\mathbf{p}_u \odot \mathbf{q}_i\)(element-wise multiplication) - Output:\(\hat{y}_{ui}^{GMF} = \mathbf{h}^T (\mathbf{p}_u \odot \mathbf{q}_i)\), where\(\mathbf{h}\)is a learnable weight vector

2. MLP (Multi-Layer Perceptron): - Concatenate user and item Embeddings:\([\mathbf{p}_u; \mathbf{q}_i]\) - Pass through multi-layer fully connected network:\(\mathbf{z}_1 = \text{ReLU}(\mathbf{W}_1 [\mathbf{p}_u; \mathbf{q}_i] + \mathbf{b}_1)\) -\(\mathbf{z}_2 = \text{ReLU}(\mathbf{W}_2 \mathbf{z}_1 + \mathbf{b}_2)\) - ... - Output:\(\hat{y}_{ui}^{MLP} = \mathbf{h}^T \mathbf{z}_L\) 3. NeuMF (Neural Matrix Factorization): - Fuse GMF and MLP:\(\hat{y}_{ui} = \sigma(\hat{y}_{ui}^{GMF} + \hat{y}_{ui}^{MLP})\) - Where\(\sigma\)is the Sigmoid activation function (for binary classification tasks)

Mathematical Formulation of NCF

The complete NCF model can be expressed as:\[\hat{y}_{ui} = \sigma(\mathbf{h}^T (\mathbf{p}_u \odot \mathbf{q}_i) + \mathbf{h}_{MLP}^T \mathbf{z}_L)\]Where: -\(\mathbf{p}_u, \mathbf{q}_i\): User and item Embedding vectors -\(\odot\): Element-wise product (Hadamard product) -\(\mathbf{z}_L\): Output of the last layer of MLP -\(\mathbf{h}, \mathbf{h}_{MLP}\): Weight vectors of the output layer -\(\sigma\): Sigmoid function

Loss Function of NCF

For implicit feedback (click/no-click), NCF uses binary cross-entropy loss:\[\mathcal{L} = -\sum_{(u,i) \in \mathcal{D }} y_{ui} \log \hat{y}_{ui} + (1-y_{ui}) \log(1-\hat{y}_{ui})\]Where\(y_{ui} \in \{0,1\}\)indicates whether user\(u\)interacted with item\(i\).

For explicit feedback (ratings), mean squared error can be used:\[\mathcal{L} = \sum_{(u,i) \in \mathcal{D }} (r_{ui} - \hat{r}_{ui})^2\]

Complete NCF Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class GMF(nn.Module):
    """Generalized Matrix Factorization"""
    
    def __init__(self, num_users, num_items, embedding_dim):
        super(GMF, self).__init__()
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.item_embedding = nn.Embedding(num_items, embedding_dim)
        self.output_layer = nn.Linear(embedding_dim, 1)
        
        # Initialization
        nn.init.normal_(self.user_embedding.weight, std=0.01)
        nn.init.normal_(self.item_embedding.weight, std=0.01)
    
    def forward(self, user_ids, item_ids):
        user_emb = self.user_embedding(user_ids)
        item_emb = self.item_embedding(item_ids)
        
        # Element-wise product
        element_product = user_emb * item_emb
        
        # Output
        output = self.output_layer(element_product)
        return output.squeeze()

class MLP(nn.Module):
    """Multi-Layer Perceptron"""
    
    def __init__(self, num_users, num_items, embedding_dim, layers, dropout=0.0):
        super(MLP, self).__init__()
        self.user_embedding = nn.Embedding(num_users, embedding_dim)
        self.item_embedding = nn.Embedding(num_items, embedding_dim)
        
        # MLP layers
        mlp_layers = []
        input_dim = embedding_dim * 2  # Concatenate user and item Embeddings
        for output_dim in layers:
            mlp_layers.append(nn.Linear(input_dim, output_dim))
            mlp_layers.append(nn.ReLU())
            if dropout > 0:
                mlp_layers.append(nn.Dropout(dropout))
            input_dim = output_dim
        self.mlp = nn.Sequential(*mlp_layers)
        
        # Output layer
        self.output_layer = nn.Linear(layers[-1], 1)
        
        # Initialization
        nn.init.normal_(self.user_embedding.weight, std=0.01)
        nn.init.normal_(self.item_embedding.weight, std=0.01)
    
    def forward(self, user_ids, item_ids):
        user_emb = self.user_embedding(user_ids)
        item_emb = self.item_embedding(item_ids)
        
        # Concatenate
        concat = torch.cat([user_emb, item_emb], dim=1)
        
        # MLP
        mlp_output = self.mlp(concat)
        
        # Output
        output = self.output_layer(mlp_output)
        return output.squeeze()

class NeuMF(nn.Module):
    """Neural Matrix Factorization"""
    
    def __init__(self, num_users, num_items, embedding_dim, mlp_layers, dropout=0.0):
        super(NeuMF, self).__init__()
        self.embedding_dim = embedding_dim
        
        # GMF part
        self.gmf_user_embedding = nn.Embedding(num_users, embedding_dim)
        self.gmf_item_embedding = nn.Embedding(num_items, embedding_dim)
        
        # MLP part
        self.mlp_user_embedding = nn.Embedding(num_users, embedding_dim)
        self.mlp_item_embedding = nn.Embedding(num_items, embedding_dim)
        
        # MLP network
        mlp_modules = []
        input_dim = embedding_dim * 2
        for output_dim in mlp_layers:
            mlp_modules.append(nn.Linear(input_dim, output_dim))
            mlp_modules.append(nn.ReLU())
            if dropout > 0:
                mlp_modules.append(nn.Dropout(dropout))
            input_dim = output_dim
        self.mlp = nn.Sequential(*mlp_modules)
        
        # Output layer
        self.output_layer = nn.Linear(embedding_dim + mlp_layers[-1], 1)
        
        # Initialization
        self._init_weights()
    
    def _init_weights(self):
        nn.init.normal_(self.gmf_user_embedding.weight, std=0.01)
        nn.init.normal_(self.gmf_item_embedding.weight, std=0.01)
        nn.init.normal_(self.mlp_user_embedding.weight, std=0.01)
        nn.init.normal_(self.mlp_item_embedding.weight, std=0.01)
    
    def forward(self, user_ids, item_ids):
        # GMF part
        gmf_user_emb = self.gmf_user_embedding(user_ids)
        gmf_item_emb = self.gmf_item_embedding(item_ids)
        gmf_output = gmf_user_emb * gmf_item_emb
        
        # MLP part
        mlp_user_emb = self.mlp_user_embedding(user_ids)
        mlp_item_emb = self.mlp_item_embedding(item_ids)
        mlp_concat = torch.cat([mlp_user_emb, mlp_item_emb], dim=1)
        mlp_output = self.mlp(mlp_concat)
        
        # Fusion
        concat = torch.cat([gmf_output, mlp_output], dim=1)
        output = self.output_layer(concat)
        
        return torch.sigmoid(output.squeeze())

# Usage example
num_users = 10000
num_items = 5000
embedding_dim = 64
mlp_layers = [128, 64, 32]

model = NeuMF(num_users, num_items, embedding_dim, mlp_layers, dropout=0.2)

# Training data
user_ids = torch.LongTensor([0, 1, 2, 3, 4])
item_ids = torch.LongTensor([10, 20, 30, 40, 50])
labels = torch.FloatTensor([1, 1, 0, 1, 0])  # Click/no-click

# Forward pass
predictions = model(user_ids, item_ids)
print(f"Predictions: {predictions}")

# Loss calculation
criterion = nn.BCELoss()
loss = criterion(predictions, labels)
print(f"Loss: {loss.item()}")

NCF Training Tips

1. Negative Sampling: - For implicit feedback, negative samples (no-click) far outnumber positive samples (click) - Need negative sampling to balance positive/negative sample ratio - Common ratio: positive:negative = 1:1 to 1:4

2. Learning Rate Scheduling: - Initial learning rate: 0.001-0.01 - Use learning rate decay (e.g., halve every 10 epochs) - Or use adaptive optimizers (Adam, AdamW)

3. Regularization: - L2 regularization: Prevent overfitting - Dropout: Use in MLP layers, dropout rate 0.2-0.5 - Early Stopping: Monitor validation set performance

4. Pre-training: - Pre-train GMF and MLP separately first - Then jointly train with NeuMF - Can accelerate convergence and improve performance

AutoEncoder Recommendations: CDAE and VAE

Basic Idea of AutoEncoder

AutoEncoder is an unsupervised learning model that attempts to learn low-dimensional representations (encoding) of data, then reconstructs the original data (decoding) from the low-dimensional representation.

In recommendation systems, AutoEncoders can be used to: 1. Dimensionality Reduction: Compress high-dimensional user-item interaction matrices to low-dimensional space 2. Denoising: Recover complete user preferences from sparse, noisy interaction data 3. Generation: Generate items users might be interested in

CDAE: Collaborative Denoising Auto-Encoder

CDAE (Collaborative Denoising Auto-Encoder), proposed in 2015, takes user interaction history as input and reconstructs complete user preferences through a denoising autoencoder.

Model Architecture: - Input Layer: User interaction vector\(\mathbf{x}_u \in \{0,1} ^n\)(\(n\)is the number of items) - Encoder Layer:\(\mathbf{h}_u = \sigma(\mathbf{W} \mathbf{x}_u + \mathbf{V} \mathbf{p}_u + \mathbf{b})\), where\(\mathbf{p}_u\)is the user Embedding - Decoder Layer:\(\hat{\mathbf{x }}_u = \sigma(\mathbf{W}' \mathbf{h}_u + \mathbf{b}')\) - Loss Function: Reconstruction error\(\mathcal{L} = \sum_{u} \|\mathbf{x}_u - \hat{\mathbf{x }}_u\|^2\) Denoising Mechanism: - Randomly zero out part of the input during training (dropout) - Forces the model to learn to recover complete information from partial information - Improves model robustness

Complete CDAE Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class CDAE(nn.Module):
    """Collaborative Denoising Auto-Encoder"""
    
    def __init__(self, num_users, num_items, hidden_dim, corruption_ratio=0.5):
        """
        Args:
            num_users: Number of users
            num_items: Number of items
            hidden_dim: Hidden layer dimension
            corruption_ratio: Noise ratio (proportion of input dropout)
        """
        super(CDAE, self).__init__()
        self.num_users = num_users
        self.num_items = num_items
        self.hidden_dim = hidden_dim
        self.corruption_ratio = corruption_ratio
        
        # User Embedding
        self.user_embedding = nn.Embedding(num_users, hidden_dim)
        
        # Encoder layer: item interactions -> hidden layer
        self.encoder = nn.Linear(num_items, hidden_dim)
        
        # Decoder layer: hidden layer -> item interactions
        self.decoder = nn.Linear(hidden_dim, num_items)
        
        # Initialization
        nn.init.xavier_uniform_(self.user_embedding.weight)
        nn.init.xavier_uniform_(self.encoder.weight)
        nn.init.xavier_uniform_(self.decoder.weight)
    
    def forward(self, user_ids, user_items, training=True):
        """
        Args:
            user_ids: User IDs, shape: [batch_size]
            user_items: User interaction vectors, shape: [batch_size, num_items]
            training: Whether in training mode (affects whether noise is added)
        Returns:
            reconstructed: Reconstructed interaction vectors, shape: [batch_size, num_items]
        """
        batch_size = user_items.size(0)
        
        # Denoising: randomly dropout part of input during training
        if training and self.corruption_ratio > 0:
            # Create mask, randomly zero out some positions
            mask = torch.rand_like(user_items) > self.corruption_ratio
            corrupted_input = user_items * mask.float()
        else:
            corrupted_input = user_items
        
        # User Embedding
        user_emb = self.user_embedding(user_ids)  # [batch_size, hidden_dim]
        
        # Encoding: interaction vector -> hidden layer
        encoded = self.encoder(corrupted_input)  # [batch_size, hidden_dim]
        
        # Fuse user Embedding and encoding result
        hidden = F.relu(encoded + user_emb)  # [batch_size, hidden_dim]
        
        # Decoding: hidden layer -> reconstructed interaction vector
        reconstructed = torch.sigmoid(self.decoder(hidden))  # [batch_size, num_items]
        
        return reconstructed
    
    def predict(self, user_ids, user_items):
        """Predict user ratings for all items"""
        self.eval()
        with torch.no_grad():
            predictions = self.forward(user_ids, user_items, training=False)
        return predictions

# Usage example
num_users = 1000
num_items = 500
hidden_dim = 128

model = CDAE(num_users, num_items, hidden_dim, corruption_ratio=0.5)

# Training data: user interaction matrix (sparse)
user_ids = torch.LongTensor([0, 1, 2, 3, 4])
user_items = torch.FloatTensor([
    [1, 0, 1, 0, 0, 1, 0, ...],  # User 0's interaction vector
    [0, 1, 0, 1, 1, 0, 0, ...],  # User 1's interaction vector
    [1, 1, 0, 0, 0, 1, 1, ...],  # User 2's interaction vector
    [0, 0, 1, 1, 0, 0, 1, ...],  # User 3's interaction vector
    [1, 0, 0, 0, 1, 1, 0, ...],  # User 4's interaction vector
])  # Shape: [5, 500]

# Forward pass
reconstructed = model(user_ids, user_items, training=True)
print(f"Reconstructed shape: {reconstructed.shape}")

# Loss calculation (only for interacted positions)
mask = user_items > 0
loss = F.mse_loss(reconstructed * mask, user_items * mask)
print(f"Loss: {loss.item()}")

# Prediction: recommend Top-K items
predictions = model.predict(user_ids, user_items)
top_k = 10
top_items = torch.topk(predictions[0], top_k).indices
print(f"Top-{top_k} recommended items for user 0: {top_items}")

VAE: Variational Auto-Encoder

VAE (Variational Auto-Encoder), proposed in 2013, is a generative model that probabilizes AutoEncoders, learning latent distributions of data to generate new samples.

In recommendation systems, VAE can be used to: 1. Generate Recommendations: Sample from user latent distributions to generate items of interest 2. Uncertainty Modeling: Not only predict ratings but also predict uncertainty 3. Diverse Recommendations: Increase recommendation diversity through sampling

Mathematical Principles of VAE: - Encoder: Learns posterior distribution\(q_\phi(\mathbf{z}|\mathbf{x})\), where\(\mathbf{z}\)is the latent variable - Decoder: Learns generative distribution\(p_\theta(\mathbf{x}|\mathbf{z})\) - Loss Function: ELBO (Evidence Lower BOund)\[\mathcal{L} = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})] - \text{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\]

VAE Recommendation Model (Mult-VAE)

Mult-VAE, proposed in 2018, is a VAE recommendation model that assumes user interaction vectors follow a multinomial distribution.

Model Architecture: - Encoder:\(\mathbf{z}_u \sim \mathcal{N}(\boldsymbol{\mu}_u, \text{diag}(\boldsymbol{\sigma}_u^2))\) -\(\boldsymbol{\mu}_u = \mathbf{W}_\mu \mathbf{h}_u + \mathbf{b}_\mu\) -\(\log \boldsymbol{\sigma}_u^2 = \mathbf{W}_\sigma \mathbf{h}_u + \mathbf{b}_\sigma\) - Where\(\mathbf{h}_u\)is the encoding of user interaction vector - Sampling:\(\mathbf{z}_u = \boldsymbol{\mu}_u + \boldsymbol{\sigma}_u \odot \boldsymbol{\epsilon}\), where\(\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})\) - Decoder:\(\hat{\mathbf{x }}_u = \text{softmax}(\mathbf{W}_d \mathbf{z}_u + \mathbf{b}_d)\)

Complete Mult-VAE Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributions as dist

class MultVAE(nn.Module):
    """Multinomial Variational Auto-Encoder for Recommendation"""
    
    def __init__(self, num_items, hidden_dims, latent_dim, dropout=0.5):
        """
        Args:
            num_items: Number of items
            hidden_dims: List of hidden layer dimensions for encoder and decoder, e.g., [600, 200]
            latent_dim: Latent variable dimension
            dropout: Dropout ratio
        """
        super(MultVAE, self).__init__()
        self.num_items = num_items
        self.latent_dim = latent_dim
        
        # Encoder
        encoder_layers = []
        input_dim = num_items
        for hidden_dim in hidden_dims:
            encoder_layers.append(nn.Linear(input_dim, hidden_dim))
            encoder_layers.append(nn.Tanh())
            encoder_layers.append(nn.Dropout(dropout))
            input_dim = hidden_dim
        self.encoder = nn.Sequential(*encoder_layers)
        
        # Mean and variance of latent variables
        self.mu_layer = nn.Linear(hidden_dims[-1], latent_dim)
        self.logvar_layer = nn.Linear(hidden_dims[-1], latent_dim)
        
        # Decoder
        decoder_layers = []
        input_dim = latent_dim
        for hidden_dim in reversed(hidden_dims):
            decoder_layers.append(nn.Linear(input_dim, hidden_dim))
            decoder_layers.append(nn.Tanh())
            decoder_layers.append(nn.Dropout(dropout))
            input_dim = hidden_dim
        self.decoder = nn.Sequential(*decoder_layers)
        
        # Output layer
        self.output_layer = nn.Linear(hidden_dims[0], num_items)
        
        # Initialization
        self._init_weights()
    
    def _init_weights(self):
        for layer in self.modules():
            if isinstance(layer, nn.Linear):
                nn.init.xavier_uniform_(layer.weight)
                nn.init.zeros_(layer.bias)
    
    def encode(self, user_items):
        """Encode: user interaction vector -> latent variable distribution"""
        h = self.encoder(user_items)
        mu = self.mu_layer(h)
        logvar = self.logvar_layer(h)
        return mu, logvar
    
    def reparameterize(self, mu, logvar):
        """Reparameterization trick"""
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        z = mu + eps * std
        return z
    
    def decode(self, z):
        """Decode: latent variable -> reconstructed interaction vector"""
        h = self.decoder(z)
        logits = self.output_layer(h)
        return logits
    
    def forward(self, user_items, beta=1.0):
        """
        Args:
            user_items: User interaction vectors, shape: [batch_size, num_items]
            beta: Weight of KL divergence (for beta-VAE)
        Returns:
            reconstructed: Reconstructed interaction vectors
            mu: Mean of latent variables
            logvar: Log variance of latent variables
            kl_loss: KL divergence loss
        """
        # Encoding
        mu, logvar = self.encode(user_items)
        
        # Reparameterization
        z = self.reparameterize(mu, logvar)
        
        # Decoding
        logits = self.decode(z)
        
        # KL divergence loss
        kl_loss = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp(), dim=1)
        kl_loss = beta * kl_loss.mean()
        
        return logits, mu, logvar, kl_loss
    
    def predict(self, user_items):
        """Predict user ratings for all items"""
        self.eval()
        with torch.no_grad():
            mu, logvar = self.encode(user_items)
            z = self.reparameterize(mu, logvar)
            logits = self.decode(z)
            # For non-interacted items, use logits; for interacted items, set to negative infinity (don't recommend)
            predictions = logits.clone()
            predictions[user_items > 0] = float('-inf')
        return predictions

# Usage example
num_items = 500
hidden_dims = [600, 200]
latent_dim = 50

model = MultVAE(num_items, hidden_dims, latent_dim, dropout=0.5)

# Training data
user_items = torch.FloatTensor([
    [1, 0, 1, 0, 0, 1, 0, ...],  # User 0's interaction vector
    [0, 1, 0, 1, 1, 0, 0, ...],  # User 1's interaction vector
    [1, 1, 0, 0, 0, 1, 1, ...],  # User 2's interaction vector
])  # Shape: [3, 500]

# Forward pass
logits, mu, logvar, kl_loss = model(user_items, beta=0.2)

# Reconstruction loss (log-likelihood of multinomial distribution)
reconstruction_loss = -torch.sum(
    F.log_softmax(logits, dim=1) * user_items, dim=1
).mean()

# Total loss
total_loss = reconstruction_loss + kl_loss
print(f"Reconstruction loss: {reconstruction_loss.item()}")
print(f"KL loss: {kl_loss.item()}")
print(f"Total loss: {total_loss.item()}")

# Prediction
predictions = model.predict(user_items)
top_k = 10
top_items = torch.topk(predictions[0], top_k).indices
print(f"Top-{top_k} recommended items: {top_items}")

CDAE vs VAE Comparison

Feature	CDAE	VAE
Model Type	Deterministic autoencoder	Probabilistic generative model
Latent Variable	Fixed vector	Probability distribution
Generation Capability	Weak (can only reconstruct)	Strong (can sample and generate)
Uncertainty	Cannot model	Can model
Training Difficulty	Simple	More complex (requires KL divergence)
Recommendation Diversity	Lower	Higher (through sampling)
Applicable Scenarios	Dense interaction data	Sparse interaction data

Wide & Deep Model

Background of Wide & Deep

In 2016, Google proposed the Wide & Deep model in Google Play's recommendation system. The core idea of this model is: combining memorization and generalization.

Memorization (Wide part): Learns direct associations between features, such as "users who installed Pandora also installed YouTube"
Generalization (Deep part): Learns Embedding representations of features, capturing latent associations between sparse features

Wide & Deep Model Architecture

The Wide & Deep model contains two components:

1. Wide Part (Linear Model): - Input: Raw features and cross features (e.g., "user age × item category") - Output:\(\hat{y}_{wide} = \mathbf{w}^T \mathbf{x} + b\) - Role: Memorize feature combinations in historical data

2. Deep Part (Deep Neural Network): - Input: Embedding vectors of sparse features - Structure: Multi-layer fully connected network - Output:\(\hat{y}_{deep} = \text{MLP}(\text{Embedding}(\mathbf{x}))\) - Role: Generalize to unseen feature combinations

3. Fusion: - Final output:\(\hat{y} = \sigma(\hat{y}_{wide} + \hat{y}_{deep})\) - Where\(\sigma\)is the Sigmoid function (for CTR prediction)

Mathematical Formulation of Wide & Deep

The complete Wide & Deep model can be expressed as:\[\hat{y} = \sigma(\mathbf{w}_{wide}^T [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{deep}^T \mathbf{a}^{(L)} + b)\]Where: -\(\mathbf{x}\): Raw features -\(\phi(\mathbf{x})\): Cross features (e.g.,\(\phi(\mathbf{x}) = [x_i \cdot x_j]\)) -\(\mathbf{a}^{(L)}\): Output of the last layer of the Deep part -\(\mathbf{w}_{wide}, \mathbf{w}_{deep}, b\): Learnable parameters

Computation process of the Deep part: -\(\mathbf{a}^{(0)} = \text{Embedding}(\mathbf{x})\) -\(\mathbf{a}^{(l)} = \text{ReLU}(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})\),\(l=1,2,\dots,L\) - Where\(L\)is the number of layers

Complete Wide & Deep Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F

class WideAndDeep(nn.Module):
    """Wide & Deep model"""
    
    def __init__(self, 
                 field_dims,           # Number of categories for each field
                 embedding_dim,        # Embedding dimension
                 deep_layers,          # List of hidden layer dimensions for Deep part
                 dropout=0.0):
        super(WideAndDeep, self).__init__()
        self.field_dims = field_dims
        self.num_fields = len(field_dims)
        self.embedding_dim = embedding_dim
        
        # Wide part: linear layer (including bias)
        # Input dimension = number of raw features + number of cross features
        # Simplified here, assuming only raw features
        self.wide_linear = nn.Linear(sum(field_dims), 1)
        
        # Deep part: Embedding layers
        self.embeddings = nn.ModuleList([
            nn.Embedding(field_dim, embedding_dim)
            for field_dim in field_dims
        ])
        
        # Deep part: MLP
        deep_input_dim = self.num_fields * embedding_dim
        deep_layers_list = []
        for deep_dim in deep_layers:
            deep_layers_list.append(nn.Linear(deep_input_dim, deep_dim))
            deep_layers_list.append(nn.ReLU())
            if dropout > 0:
                deep_layers_list.append(nn.Dropout(dropout))
            deep_input_dim = deep_dim
        self.deep_mlp = nn.Sequential(*deep_layers_list)
        
        # Output layer of Deep part
        self.deep_output = nn.Linear(deep_layers[-1], 1)
        
        # Initialization
        self._init_weights()
    
    def _init_weights(self):
        # Wide part: Xavier initialization
        nn.init.xavier_uniform_(self.wide_linear.weight)
        nn.init.zeros_(self.wide_linear.bias)
        
        # Deep part: initialization of Embeddings and MLP
        for embedding in self.embeddings:
            nn.init.xavier_uniform_(embedding.weight)
        
        for layer in self.deep_mlp:
            if isinstance(layer, nn.Linear):
                nn.init.xavier_uniform_(layer.weight)
                nn.init.zeros_(layer.bias)
        
        nn.init.xavier_uniform_(self.deep_output.weight)
        nn.init.zeros_(self.deep_output.bias)
    
    def forward(self, x_wide, x_deep):
        """
        Args:
            x_wide: Input for Wide part (one-hot encoding), shape: [batch_size, sum(field_dims)]
            x_deep: Input for Deep part (field indices), shape: [batch_size, num_fields]
        Returns:
            output: Predicted values, shape: [batch_size]
        """
        # Wide part
        wide_output = self.wide_linear(x_wide)  # [batch_size, 1]
        
        # Deep part: Embedding
        deep_embeddings = []
        for i in range(self.num_fields):
            deep_embeddings.append(self.embeddings[i](x_deep[:, i]))
        deep_concat = torch.cat(deep_embeddings, dim=1)  # [batch_size, num_fields * embedding_dim]
        
        # Deep part: MLP
        deep_output = self.deep_mlp(deep_concat)
        deep_output = self.deep_output(deep_output)  # [batch_size, 1]
        
        # Fusion
        output = wide_output + deep_output
        output = torch.sigmoid(output.squeeze())  # [batch_size]
        
        return output

# Usage example
field_dims = [10000, 1000, 50, 20]  # User ID, Item ID, Category, City
embedding_dim = 32
deep_layers = [128, 64, 32]

model = WideAndDeep(field_dims, embedding_dim, deep_layers, dropout=0.2)

# Wide part input: one-hot encoding (sparse)
batch_size = 4
x_wide = torch.zeros(batch_size, sum(field_dims))
# Assume user 0's features: User ID=123, Item ID=456, Category=5, City=10
x_wide[0, 123] = 1  # User ID
x_wide[0, 10000 + 456] = 1  # Item ID
x_wide[0, 10000 + 1000 + 5] = 1  # Category
x_wide[0, 10000 + 1000 + 50 + 10] = 1  # City
# ... similar for other samples

# Deep part input: field indices (dense)
x_deep = torch.LongTensor([
    [123, 456, 5, 10],
    [124, 457, 5, 11],
    [125, 458, 6, 10],
    [126, 459, 6, 12]
])  # Shape: [4, 4]

# Forward pass
predictions = model(x_wide, x_deep)
print(f"Predictions: {predictions}")

# Loss calculation
labels = torch.FloatTensor([1, 1, 0, 1])
criterion = nn.BCELoss()
loss = criterion(predictions, labels)
print(f"Loss: {loss.item()}")

Optimized Versions of Wide & Deep

In practical applications, Wide & Deep has several optimized versions:

1. DeepFM: - Replaces Wide part with FM - Automatically learns second-order feature interactions - Avoids manual design of cross features

2. xDeepFM: - Introduces CIN (Compressed Interaction Network) - Explicitly models high-order feature interactions - Stronger interaction modeling capability than DeepFM

3. DCN (Deep & Cross Network): - Replaces Wide part with Cross Network - Automatically learns feature interactions of arbitrary order - High computational efficiency

Feature Engineering

Feature Types

Features in recommendation systems can be divided into the following categories:

1. User Features: - User ID, age, gender, city, occupation - User historical behavior statistics (click rate, purchase rate, average rating) - User profile tags (interest tags, consumption ability)

2. Item Features: - Item ID, category, brand, price - Item statistical features (click rate, purchase rate, average rating) - Item content features (text description, images)

3. Context Features: - Time features (hour, day of week, month, whether holiday) - Device features (device type, operating system, APP version) - Location features (GPS coordinates, city, business district)

4. Interaction Features: - User-item interaction history (last N clicks, purchases) - User-category interaction statistics (click counts for each category) - Item-user interaction statistics (user profiles who clicked the item)

5. Cross Features: - User features × item features (e.g., "user age × item category") - Time features × item features (e.g., "time period × item category") - High-order cross features (e.g., "user age × item category × time period")

Feature Encoding

1. Numerical Features: - Standardization:\(x' = \frac{x - \mu}{\sigma}\) - Normalization:\(x' = \frac{x - x_{min }}{x_{max} - x_{min }}\) - Binning: Discretize continuous values, e.g., age divided into "0-18, 19-30, 31-50, 50+"

2. Categorical Features: - One-hot Encoding: One dimension per category - Embedding Encoding: Map to low-dimensional dense vectors (commonly used in deep learning) - Hash Encoding: Use hash function to map categories to fixed dimensions

3. Sequential Features: - Padding: Pad sequences of different lengths to the same length - Pooling: Average pooling, max pooling, attention pooling - RNN/Transformer: Process with sequence models

Feature Selection

Not all features are useful; feature selection is needed:

1. Statistical Methods: - Mutual Information: Measures correlation between features and target - Chi-square Test: Tests independence between features and target - Correlation Coefficient: Computes linear correlation between features and target

2. Model Methods: - L1 Regularization: Automatically sets weights of unimportant features to zero - Feature Importance: Feature importance based on tree models (e.g., XGBoost) - Permutation Importance: Shuffle feature values, observe model performance degradation

3. Business Methods: - A/B Testing: Deploy features, observe metric changes - Feature Analysis: Analyze feature distribution, missing rate, coverage

Feature Engineering Code Example

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import mutual_info_classif

class FeatureEngineer:
    """Feature engineering utility class"""
    
    def __init__(self):
        self.scalers = {}
        self.encoders = {}
        self.feature_names = []
    
    def process_numerical_features(self, df, numerical_cols):
        """Process numerical features: standardization"""
        processed_df = df.copy()
        
        for col in numerical_cols:
            scaler = StandardScaler()
            processed_df[col] = scaler.fit_transform(df[[col]])
            self.scalers[col] = scaler
        
        return processed_df
    
    def process_categorical_features(self, df, categorical_cols):
        """Process categorical features: Label Encoding"""
        processed_df = df.copy()
        
        for col in categorical_cols:
            encoder = LabelEncoder()
            processed_df[col] = encoder.fit_transform(df[col].astype(str))
            self.encoders[col] = encoder
        
        return processed_df
    
    def create_cross_features(self, df, field1, field2):
        """Create cross features"""
        cross_feature = f"{field1}_x_{field2}"
        df[cross_feature] = df[field1].astype(str) + "_" + df[field2].astype(str)
        return df
    
    def create_binning_features(self, df, numerical_col, bins):
        """Create binning features"""
        bin_feature = f"{numerical_col}_bin"
        df[bin_feature] = pd.cut(df[numerical_col], bins=bins, labels=False)
        return df
    
    def create_statistical_features(self, df, group_col, agg_col, agg_funcs):
        """Create statistical features (e.g., user average click rate)"""
        stats = df.groupby(group_col)[agg_col].agg(agg_funcs)
        stats.columns = [f"{group_col}_{agg_col}_{func}" for func in agg_funcs]
        df = df.merge(stats, left_on=group_col, right_index=True, how='left')
        return df
    
    def select_features(self, X, y, k=10):
        """Feature selection: based on mutual information"""
        mi_scores = mutual_info_classif(X, y, random_state=42)
        top_k_indices = np.argsort(mi_scores)[-k:]
        return top_k_indices

# Usage example
# Assume we have user behavior data
data = {
    'user_id': [1, 1, 2, 2, 3, 3],
    'item_id': [10, 20, 10, 30, 20, 30],
    'category': ['A', 'B', 'A', 'C', 'B', 'C'],
    'price': [10.5, 20.3, 10.5, 15.7, 20.3, 15.7],
    'age': [25, 25, 30, 30, 35, 35],
    'click': [1, 1, 0, 1, 1, 0]
}

df = pd.DataFrame(data)

# Feature engineering
fe = FeatureEngineer()

# Process numerical features
df = fe.process_numerical_features(df, ['price', 'age'])

# Process categorical features
df = fe.process_categorical_features(df, ['category'])

# Create cross features
df = fe.create_cross_features(df, 'user_id', 'category')

# Create binning features
df = fe.create_binning_features(df, 'age', bins=[0, 25, 30, 40, 100])

# Create statistical features (user average click rate)
df = fe.create_statistical_features(
    df, 
    group_col='user_id', 
    agg_col='click', 
    agg_funcs=['mean', 'sum']
)

print(df.head())

Training Techniques

Data Preparation

1. Negative Sampling: - For implicit feedback, negative samples far outnumber positive samples - Need negative sampling to balance positive/negative sample ratio - Common strategies: random negative sampling, popular negative sampling, hard negative sampling

def negative_sampling(user_items, num_negatives=4):
    """Negative sampling: sample N negative samples for each positive sample"""
    positive_samples = []
    negative_samples = []
    
    for user_id, item_id in user_items:
        # Positive sample
        positive_samples.append((user_id, item_id, 1))
        
        # Negative sampling: randomly select items user hasn't interacted with
        user_interacted = set(user_items[user_items[:, 0] == user_id][:, 1])
        all_items = set(range(num_items))
        negative_candidates = list(all_items - user_interacted)
        
        # Random sampling
        negative_items = np.random.choice(
            negative_candidates, 
            size=min(num_negatives, len(negative_candidates)),
            replace=False
        )
        
        for neg_item in negative_items:
            negative_samples.append((user_id, neg_item, 0))
    
    return positive_samples, negative_samples

2. Data Augmentation: - Time Window Sliding: Build training sets with different time windows - Data Mixing: Mix data from different sources - Noise Injection: Add noise during training to improve robustness

3. Data Splitting: - Time-based Split: Split training and test sets chronologically (more realistic) - Random Split: Random split (may cause data leakage) - User-based Split: Split by users (avoid users appearing in both training and test sets)

Model Training

1. Optimizer Selection: - Adam/AdamW: Adaptive learning rate, suitable for most scenarios - SGD: Requires manual learning rate tuning, but may converge to better solutions - Adagrad: Suitable for sparse gradients

import torch.optim as optim

# Adam optimizer (recommended)
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)

# AdamW optimizer (better weight decay)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=1e-4)

# SGD optimizer (requires learning rate scheduling)
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

2. Learning Rate Scheduling: - StepLR: Decay every N epochs - ExponentialLR: Exponential decay - CosineAnnealingLR: Cosine annealing - ReduceLROnPlateau: Automatically adjust based on validation performance

# StepLR: halve learning rate every 10 epochs
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

# CosineAnnealingLR: cosine annealing
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# ReduceLROnPlateau: reduce learning rate when validation performance doesn't improve
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5
)

3. Regularization: - L2 Regularization: Implemented through weight_decay - Dropout: Randomly zero out some neurons - Batch Normalization: Normalize activation values - Early Stopping: Stop early when validation performance doesn't improve

# Dropout
model = nn.Sequential(
    nn.Linear(100, 64),
    nn.ReLU(),
    nn.Dropout(0.2),  # 20% of neurons are dropped
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(32, 1)
)

# Early Stopping
class EarlyStopping:
    def __init__(self, patience=5, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_score = None
    
    def __call__(self, val_score):
        if self.best_score is None:
            self.best_score = val_score
        elif val_score < self.best_score + self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                return True
        else:
            self.best_score = val_score
            self.counter = 0
        return False

Training Loop Example

def train_model(model, train_loader, val_loader, num_epochs=50):
    """Complete training loop"""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    criterion = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
    scheduler = optim.lr_scheduler.ReduceLROnPlateau(
        optimizer, mode='min', factor=0.5, patience=5
    )
    early_stopping = EarlyStopping(patience=10)
    
    best_val_loss = float('inf')
    
    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0.0
        for batch in train_loader:
            user_ids, item_ids, labels = batch
            user_ids = user_ids.to(device)
            item_ids = item_ids.to(device)
            labels = labels.to(device)
            
            # Forward pass
            optimizer.zero_grad()
            predictions = model(user_ids, item_ids)
            loss = criterion(predictions, labels)
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            train_loss += loss.item()
        
        train_loss /= len(train_loader)
        
        # Validation phase
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for batch in val_loader:
                user_ids, item_ids, labels = batch
                user_ids = user_ids.to(device)
                item_ids = item_ids.to(device)
                labels = labels.to(device)
                
                predictions = model(user_ids, item_ids)
                loss = criterion(predictions, labels)
                val_loss += loss.item()
        
        val_loss /= len(val_loader)
        
        # Learning rate scheduling
        scheduler.step(val_loss)
        
        # Early stopping check
        if early_stopping(val_loss):
            print(f"Early stopping at epoch {epoch}")
            break
        
        # Save best model
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            torch.save(model.state_dict(), 'best_model.pth')
        
        print(f"Epoch {epoch+1}/{num_epochs}: "
              f"Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")
    
    return model

Evaluation Metrics

1. Classification Tasks (CTR Prediction): - AUC: Area under ROC curve, measures ranking capability - LogLoss: Logarithmic loss, measures accuracy of predicted probabilities - Precision@K: Proportion of positive samples in Top-K recommendations - Recall@K: Coverage of positive samples by Top-K recommendations

2. Regression Tasks (Rating Prediction): - RMSE: Root mean squared error - MAE: Mean absolute error - NDCG: Normalized Discounted Cumulative Gain (ranking metric)

from sklearn.metrics import roc_auc_score, log_loss, precision_recall_fscore_support

def evaluate_model(model, test_loader):
    """Evaluate model"""
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.eval()
    
    all_predictions = []
    all_labels = []
    
    with torch.no_grad():
        for batch in test_loader:
            user_ids, item_ids, labels = batch
            user_ids = user_ids.to(device)
            item_ids = item_ids.to(device)
            
            predictions = model(user_ids, item_ids)
            all_predictions.extend(predictions.cpu().numpy())
            all_labels.extend(labels.numpy())
    
    # Calculate metrics
    auc = roc_auc_score(all_labels, all_predictions)
    logloss = log_loss(all_labels, all_predictions)
    
    # Top-K metrics
    k = 10
    sorted_indices = np.argsort(all_predictions)[::-1]
    top_k_labels = [all_labels[i] for i in sorted_indices[:k]]
    precision_k = sum(top_k_labels) / k
    recall_k = sum(top_k_labels) / sum(all_labels)
    
    return {
        'AUC': auc,
        'LogLoss': logloss,
        f'Precision@{k}': precision_k,
        f'Recall@{k}': recall_k
    }

Q&A: Common Questions

Q1: How to Choose Embedding Dimensions?

A: The choice of Embedding dimension\(d\)requires balancing model capacity and computational cost:

Small-scale data (<100K users/items):\(d = 8-16\)is sufficient
Medium-scale data (100K-1M):\(d = 32-64\)is common
Large-scale data (>1M):\(d = 64-128\), even 256

Rule of thumb: 1. Start with\(d=32\), gradually increase to 64, 128 2. Observe validation performance; if improvement <1%, stop increasing 3. Consider computational resources: doubling\(d\)doubles the number of parameters

Q2: Why Does NCF Perform Better Than Matrix Factorization?

A: NCF's advantages mainly lie in:

Nonlinear Modeling: Matrix factorization can only capture linear relationships (inner products), while NCF can capture nonlinear relationships through MLP
Feature Fusion: NCF's GMF and MLP parts complement each other; GMF captures simple interactions, MLP captures complex interactions
End-to-End Training: The entire model can be jointly optimized, while matrix factorization typically requires alternating optimization

However, NCF also has disadvantages: - Higher computational complexity (requires forward propagation) - Poor interpretability (black-box model) - Requires more data to train well

Q3: What's the Difference Between CDAE and VAE?

A: Main differences:

Model Type:
- CDAE: Deterministic autoencoder, latent variable is a fixed vector
- VAE: Probabilistic generative model, latent variable is a probability distribution
Generation Capability:
- CDAE: Can only reconstruct input, cannot generate new samples
- VAE: Can sample from latent distribution to generate new samples
Uncertainty:
- CDAE: Cannot model uncertainty
- VAE: Can model uncertainty through variance of latent distribution
Training:
- CDAE: Simple training, only needs reconstruction loss
- VAE: Requires KL divergence term, more complex training
Recommendation Diversity:
- CDAE: Recommendation results are relatively fixed
- VAE: Can increase diversity through sampling

Q4: What Are the Respective Roles of Wide and Deep Parts in Wide & Deep?

Wide Part (Memorization): - Learns direct associations between features - Example: "Users who installed Pandora also installed YouTube" - Suitable for handling sparse, high-dimensional cross features - Can quickly memorize patterns in historical data

Deep Part (Generalization): - Learns Embedding representations of features - Captures latent associations between sparse features - Can generalize to unseen feature combinations - Suitable for handling dense Embedding features

Why Combine Both: - Only Wide: Cannot generalize, can only memorize historical data - Only Deep: May over-generalize, ignoring important direct associations - Wide + Deep: Both memorize and generalize, achieving optimal results

Q5: How to Handle Cold Start Problems?

A: Cold start is a classic problem in recommendation systems, with the following solutions:

1. New User Cold Start: - Popular Recommendations: Recommend popular items - Content-based Recommendations: Recommend based on user registration information (age, gender, etc.) - Transfer Learning: Transfer preferences from similar users - Multi-armed Bandit: Exploration-exploitation balance

2. New Item Cold Start: - Content Features: Recommend to similar users based on item attributes (category, tags) - Embedding Pre-training: Pre-train Embeddings using item content features - Collaborative Filtering: Based on interaction data of similar items

3. System Cold Start: - External Data: Leverage data from other platforms - Expert Rules: Manually designed recommendation rules - A/B Testing: Rapid iterative optimization

Q6: How to Choose Negative Sampling Strategies?

A: Negative sampling strategies affect model performance:

1. Random Negative Sampling: - Simplest, randomly sample from all non-interacted items - Suitable for most scenarios - May sample items that "users aren't interested in but don't dislike"

2. Popular Negative Sampling: - Sample negative samples from popular items - Assumes users not clicking popular items means they don't like them - May introduce popularity bias

3. Hard Negative Sampling: - Sample negative samples with high model prediction scores - Let the model learn to distinguish "easily confused" positive and negative samples - Improves model performance but requires dynamic sampling (model changes during training)

4. Mixed Strategy: - 50% random + 50% popular - Or adjust according to training phase: random early, hard negative sampling later

Q7: How to Prevent Overfitting?

A: Methods to prevent overfitting:

1. Regularization: - L2 Regularization: Implemented through weight_decay, typically 1e-5 to 1e-3 - Dropout: Randomly zero out some neurons, dropout rate 0.2-0.5 - Batch Normalization: Normalize activation values, stabilize training

2. Data Augmentation: - Negative Sampling: Increase number of negative samples - Noise Injection: Add noise during training - Data Mixing: Mix data from different sources

3. Model Complexity Control: - Reduce Layers: Start with deep networks, gradually reduce - Reduce Embedding Dimensions: Lower model capacity - Early Stopping: Stop training when validation performance doesn't improve

4. Cross Validation: - Use K-fold cross validation to evaluate models - Avoid randomness of single split

Q8: How to Accelerate Model Training?

A: Methods to accelerate training:

1. Hardware Acceleration: - GPU: Use CUDA acceleration, 10-100x speedup - Multi-GPU: Data parallelism or model parallelism - TPU: Google's specialized chip, suitable for large-scale training

2. Data Optimization: - Data Preprocessing: Preprocess features in advance, avoid computation during training - Data Loading: Use multi-process DataLoader (num_workers>0) - Batch Size: Increase batch size to improve GPU utilization

3. Model Optimization: - Mixed Precision Training: Use FP16, 2x speedup - Gradient Accumulation: Simulate large batch training - Model Pruning: Reduce model parameters

4. Algorithm Optimization: - Learning Rate Scheduling: Use warmup to accelerate convergence - Optimizer Selection: Adam usually converges faster than SGD - Asynchronous Training: Multi-machine multi-GPU asynchronous updates

Q9: How to Evaluate Recommendation System Effectiveness?

A: Recommendation system evaluation requires multi-dimensional metrics:

1. Offline Metrics: - Accuracy Metrics: AUC, LogLoss, RMSE, MAE - Ranking Metrics: NDCG, MRR, MAP - Coverage Metrics: Coverage (diversity of recommended items) - Diversity Metrics: Intra-list Diversity (differences between items in recommendation list)

2. Online Metrics: - CTR: Click-through rate - CVR: Conversion rate (purchase/download) - GMV: Gross merchandise value - User Retention Rate: Proportion of returning users

3. Business Metrics: - User Satisfaction: Ratings, feedback - Long-tail Item Recommendations: Whether unpopular items are recommended - Real-time Performance: Recommendation response time

4. A/B Testing: - Compare effects of old and new models - Requires sufficient sample size (typically >1000 users) - Focus on statistical significance

Q10: Can Embeddings Be Visualized?

A: Yes, common visualization methods:

1. t-SNE: - Reduce high-dimensional Embeddings to 2D - Can observe whether similar items cluster - Suitable for exploratory analysis

2. PCA: - Linear dimensionality reduction, fast computation - Preserves main variance - Suitable for preliminary analysis

3. UMAP: - Faster than t-SNE, similar effects - Preserves local and global structure - Suitable for large-scale data

4. Visualization Tools: - TensorBoard: TensorFlow's visualization tool - Weights & Biases: Online visualization platform - Plotly: Interactive visualization

Visualization can help: - Understand what the model has learned - Discover anomalies (e.g., some item Embeddings are abnormal) - Explain recommendation results (why this item is recommended)

Q11: How to Handle New Categories in Categorical Features?

A: New categories (OOV, Out-of-Vocabulary) are common problems:

1. Default Embedding: - Assign a special Embedding vector to new categories - Can randomly initialize or use zero vector - Will be updated during training

2. Hash Trick: - Use hash function to map new categories to known categories - Example: hash(new_category) % num_categories - May have hash collisions but can handle arbitrary new categories

3. Content Features: - If new categories have content features (e.g., text description), can initialize Embeddings with content features - Example: Encode category names with Word2Vec

4. Transfer Learning: - Transfer Embeddings from similar categories - Example: New movie categories can initialize Embeddings with similar categories

Q12: How to Combine Deep Learning Recommendation Models with Traditional Methods?

A: Can combine in various ways:

1. Model Fusion: - Weighted Average: Weighted average of predictions from multiple models - Stacking: Use meta-model to learn how to combine multiple models - Blending: Different models responsible for different scenarios

2. Feature Fusion: - Outputs of traditional methods as input features for deep learning models - Example: Prediction scores from matrix factorization as features

3. Two-stage Recommendation: - Recall Stage: Use traditional methods (e.g., Item-CF) to quickly recall candidate sets - Ranking Stage: Use deep learning models for fine-grained ranking

4. Ensemble Learning: - Train multiple models with different structures - Vote or average to get final results - Usually performs better than single models

Summary

Deep learning has brought revolutionary changes to recommendation systems. From automatic feature learning through Embeddings, to nonlinear modeling with NCF, from denoising reconstruction with AutoEncoders, to combining memorization and generalization with Wide & Deep, deep learning models have demonstrated powerful capabilities across all recommendation scenarios.

However, deep learning is not a silver bullet. It requires large amounts of data, computational resources, and tuning experience. In practical applications, we need to: 1. Understand Business Scenarios: Choose appropriate model architectures 2. Do Feature Engineering Well: Feature quality determines the model's upper limit 3. Carefully Design Training Pipelines: Data preparation, negative sampling, regularization, evaluation metrics 4. Continuously Iterate and Optimize: A/B testing, online monitoring, rapid iteration

Recommendation systems are complex engineering systems, and deep learning is just one component. Only by combining algorithms, engineering, and business can we build truly effective recommendation systems.

Future directions for recommendation systems include: - Sequential Recommendations: Use Transformer to model user behavior sequences - Reinforcement Learning: Dynamically adjust recommendation strategies - Multimodal Recommendations: Fuse text, images, video, and other modalities - Explainable Recommendations: Help users understand why items are recommended - Fair Recommendations: Avoid recommendation bias, protect user privacy

I hope this article helps you build a complete knowledge framework for deep learning recommendation systems. If you have any questions, feel free to discuss them in the comments.