Recommendation Systems (2): Collaborative Filtering and Matrix Factorization
Chen Kai BOSS

permalink: "en/recommendation-systems-2-collaborative-filtering/" date: 2024-05-07 14:30:00 tags: - Recommendation Systems - Collaborative Filtering - Matrix Factorization categories: Recommendation Systems mathjax: true --- Imagine you've just finished watching The Shawshank Redemption and want to find a similar movie. Traditional approaches might filter by genre, director, or actors, but collaborative filtering takes a different path — it doesn't care about the movie's attributes. Instead, it observes what "users with similar taste to you also liked." If user A and user B both loved The Shawshank Redemption and Forrest Gump, and user A also enjoyed The Pursuit of Happyness, the system will recommend that movie to user B.

Collaborative filtering is the cornerstone of recommendation systems, from the GroupLens system in the 1990s to the Netflix Prize competition, to today's personalized recommendations on major platforms. Matrix factorization is the mathematical core of collaborative filtering, decomposing the user-item rating matrix into low-dimensional vectors and using vector inner products to predict ratings. This approach solves the data sparsity problem while laying the foundation for deep learning-based recommendation systems.

This article provides an in-depth exploration of collaborative filtering's two paradigms (User-CF and Item-CF), similarity calculation methods, the mathematical principles of matrix factorization, SVD/SVD++ algorithms, ALS optimization methods, BPR ranking learning, FM factorization machines, implicit feedback handling, bias term design, and complete Python implementation code. Each algorithm includes detailed mathematical derivations, code examples, and Q&A sections to help you master recommendation system core algorithms from theory to practice.

The Core Idea of Collaborative Filtering

What is Collaborative Filtering?

Collaborative Filtering (CF) is based on a core assumption: similar users will have similar preferences, and similar items will be liked by similar users. It doesn't need to know item content features (like movie genres or actors) or user profiles (like age or gender)— it only requires user historical behavior data (ratings, clicks, purchases, etc.) to make recommendations.

Collaborative filtering falls into two main categories:

  1. User-Based Collaborative Filtering (User-CF): Find users similar to the target user and recommend items those similar users liked.
  2. Item-Based Collaborative Filtering (Item-CF): Find items similar to the target item. If a user likes a certain item, recommend similar items.

An Intuitive Example

Suppose we have a rating matrix (1-5 scale) of 5 users for 5 movies:

User Shawshank Forrest Gump Pursuit Titanic Godfather
Alice 5 5 ? 3 4
Bob 5 5 4 2 ?
Carol 4 4 3 5 5
David 2 1 ? 4 3
Eve ? ? 2 3 4

User-CF approach: Alice and Bob both gave high ratings to The Shawshank Redemption and Forrest Gump, indicating similar taste. Bob rated The Pursuit of Happyness 4, so recommend this movie to Alice.

Item-CF approach: The Shawshank Redemption and Forrest Gump both received high ratings from Alice, Bob, and Carol, indicating these movies are similar. If a user likes The Shawshank Redemption, recommend Forrest Gump.

Advantages and Limitations of Collaborative Filtering

Advantages: - No content features needed: Doesn't require movie genres, actors, etc. - Discovers unexpected associations: May find non-intuitive correlations like "people who buy diapers also buy beer" - Cross-domain recommendations: Can recommend items across different categories

Limitations: - Cold start problem: New users or items lack historical data - Data sparsity: User-item matrices are typically very sparse (over 99% missing values) - Scalability: Similarity computation cost increases with user and item counts - Popularity bias: Popular items are easily recommended, long-tail items are ignored

User-Based Collaborative Filtering (User-CF)

Algorithm Flow

The core steps of User-CF:

  1. Calculate user similarity: Find the \(k\)users most similar to target user\(u\) (called "neighbors")
  2. Predict ratings: Based on these\(k\)neighbors' ratings for item\(i\), predict user\(u\)'s rating for item\(i\)
  3. Generate recommendations: Select the\(N\)items with highest predicted ratings to recommend to the user

Similarity Calculation Methods

Similarity measures how close two users' rating vectors are. Let user\(u\)and\(v\)'s rating vectors for item set\(I_{uv}\) (items both users rated) be\(\mathbf{r}_u\)and\(\mathbf{r}_v\).

1. Cosine Similarity

Treat user ratings as vectors and calculate the cosine of the angle between them:\[\text{sim}(u, v) = \cos(\mathbf{r}_u, \mathbf{r}_v) = \frac{\mathbf{r}_u \cdot \mathbf{r}_v}{|\mathbf{r}_u| \cdot |\mathbf{r}_v|} = \frac{\sum_{i \in I_{uv}} r_{ui} \cdot r_{vi}}{\sqrt{\sum_{i \in I_{uv}} r_{ui}^2} \cdot \sqrt{\sum_{i \in I_{uv}} r_{vi}^2}}\]Cosine similarity ranges from\([-1, 1]\), with larger values indicating greater similarity.

Characteristics: - Only considers rating direction (preference), not absolute values - Insensitive to rating scale - Suitable for scenarios with different rating ranges

2. Pearson Correlation Coefficient

Considers differences in user rating means, calculating standardized correlation:\[\text{sim}(u, v) = \frac{\sum_{i \in I_{uv}} (r_{ui} - \bar{r}_u)(r_{vi} - \bar{r}_v)}{\sqrt{\sum_{i \in I_{uv}} (r_{ui} - \bar{r}_u)^2} \cdot \sqrt{\sum_{i \in I_{uv}} (r_{vi} - \bar{r}_v)^2}}\]where\(\bar{r}_u = \frac{1}{|I_u|} \sum_{i \in I_u} r_{ui}\)is user\(u\)'s average rating.

Characteristics: - Accounts for user rating habits (some users tend to rate high, others low) - Range\([-1, 1]\) - More commonly used than cosine similarity

3. Adjusted Cosine Similarity

Subtract user average rating from cosine similarity:\[\text{sim}(u, v) = \frac{\sum_{i \in I_{uv }} (r_{ui} - \bar{r}_u)(r_{vi} - \bar{r}_v)}{\sqrt{\sum_{i \in I_{uv }} (r_{ui} - \bar{r}_u)^2} \cdot \sqrt{\sum_{i \in I_{uv }} (r_{vi} - \bar{r}_v)^2 }}\]In practice, adjusted cosine similarity is mathematically equivalent to Pearson correlation.

4. Euclidean Distance

Calculate Euclidean distance between rating vectors, then convert to similarity:\[d(u, v) = \sqrt{\sum_{i \in I_{uv }} (r_{ui} - r_{vi})^2}\] \[\text{sim}(u, v) = \frac{1}{1 + d(u, v)}\]

Characteristics: - Smaller distance means higher similarity - Sensitive to outliers

Rating Prediction Formula

After finding\(k\)most similar users, predict user\(u\)'s rating for item\(i\):\[\hat{r}_{ui} = \bar{r}_u + \frac{\sum_{v \in N_k(u)} \text{sim}(u, v) \cdot (r_{vi} - \bar{r}_v)}{\sum_{v \in N_k(u)} |\text{sim}(u, v)|}\]where: -\(\bar{r}_u\)is user\(u\)'s average rating -\(N_k(u)\)is the set of\(k\)users most similar to user\(u\) -\(\text{sim}(u, v)\)is similarity between users\(u\)and\(v\) -\(r_{vi} - \bar{r}_v\)is user\(v\)'s rating deviation for item\(i\)relative to their average

Formula explanation: - Base score is user\(u\)'s average rating - Weighted average of similar users' rating deviations for item\(i\) - Weights are similarities — higher similarity means greater influence

User-CF Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
import numpy as np
from collections import defaultdict
from scipy.spatial.distance import cosine
from scipy.stats import pearsonr

class UserBasedCF:
def __init__(self, k=20, similarity='pearson'):
"""
User-Based Collaborative Filtering

Parameters:
-----------
k : int
Number of neighbors
similarity : str
Similarity method: 'cosine', 'pearson', 'euclidean'
"""
self.k = k
self.similarity = similarity
self.user_item_matrix = None
self.user_means = None
self.similarity_matrix = None

def fit(self, user_item_matrix):
"""
Train the model

Parameters:
-----------
user_item_matrix : dict
{user_id: {item_id: rating }}
"""
self.user_item_matrix = user_item_matrix
self.user_means = {}

# Calculate each user's average rating
for user_id, items in user_item_matrix.items():
ratings = list(items.values())
self.user_means[user_id] = np.mean(ratings) if ratings else 0

# Calculate user similarity matrix
self._compute_similarity_matrix()

def _compute_similarity_matrix(self):
"""Calculate user similarity matrix"""
users = list(self.user_item_matrix.keys())
n_users = len(users)
self.similarity_matrix = np.zeros((n_users, n_users))
self.user_to_idx = {user: idx for idx, user in enumerate(users)}
self.idx_to_user = {idx: user for idx, user in enumerate(users)}

for i in range(n_users):
for j in range(i + 1, n_users):
sim = self._compute_similarity(users[i], users[j])
self.similarity_matrix[i][j] = sim
self.similarity_matrix[j][i] = sim

def _compute_similarity(self, user1, user2):
"""Calculate similarity between two users"""
items1 = set(self.user_item_matrix[user1].keys())
items2 = set(self.user_item_matrix[user2].keys())
common_items = items1 & items2

if len(common_items) == 0:
return 0

ratings1 = [self.user_item_matrix[user1][item] for item in common_items]
ratings2 = [self.user_item_matrix[user2][item] for item in common_items]

if self.similarity == 'cosine':
# Cosine similarity
return 1 - cosine(ratings1, ratings2)
elif self.similarity == 'pearson':
# Pearson correlation coefficient
if len(ratings1) < 2:
return 0
corr, _ = pearsonr(ratings1, ratings2)
return corr if not np.isnan(corr) else 0
elif self.similarity == 'euclidean':
# Euclidean distance converted to similarity
dist = np.sqrt(np.sum((np.array(ratings1) - np.array(ratings2))**2))
return 1 / (1 + dist)
else:
raise ValueError(f"Unknown similarity: {self.similarity}")

def predict(self, user_id, item_id):
"""
Predict user's rating for an item

Parameters:
-----------
user_id : int/str
User ID
item_id : int/str
Item ID

Returns:
--------
float
Predicted rating
"""
if user_id not in self.user_item_matrix:
return self.user_means.get(user_id, 3.0) # Default rating

# Find users who rated item_id
users_rated_item = []
for user, items in self.user_item_matrix.items():
if item_id in items:
users_rated_item.append(user)

if len(users_rated_item) == 0:
return self.user_means.get(user_id, 3.0)

# Calculate similarity with these users
similarities = []
ratings = []

user_idx = self.user_to_idx[user_id]
for other_user in users_rated_item:
other_user_idx = self.user_to_idx[other_user]
sim = self.similarity_matrix[user_idx][other_user_idx]
if sim > 0: # Only consider positive similarity
similarities.append(sim)
ratings.append(self.user_item_matrix[other_user][item_id])

if len(similarities) == 0:
return self.user_means.get(user_id, 3.0)

# Select top-k similar users
if len(similarities) > self.k:
top_k_indices = np.argsort(similarities)[-self.k:]
similarities = [similarities[i] for i in top_k_indices]
ratings = [ratings[i] for i in top_k_indices]

# Weighted average prediction
user_mean = self.user_means[user_id]
numerator = sum(sim * (rating - self.user_means[other_user])
for sim, rating, other_user in zip(similarities, ratings, users_rated_item))
denominator = sum(abs(sim) for sim in similarities)

if denominator == 0:
return user_mean

return user_mean + numerator / denominator

def recommend(self, user_id, n=10):
"""
Recommend items to user

Parameters:
-----------
user_id : int/str
User ID
n : int
Number of items to recommend

Returns:
--------
list
[(item_id, predicted_rating), ...]
"""
if user_id not in self.user_item_matrix:
return []

user_items = set(self.user_item_matrix[user_id].keys())
all_items = set()
for items in self.user_item_matrix.values():
all_items.update(items.keys())

candidate_items = all_items - user_items

predictions = []
for item_id in candidate_items:
pred_rating = self.predict(user_id, item_id)
predictions.append((item_id, pred_rating))

# Sort by predicted rating
predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]


# Usage example
if __name__ == "__main__":
# Construct example data
user_item_matrix = {
'Alice': {'Shawshank': 5, 'Forrest Gump': 5, 'Titanic': 3, 'Godfather': 4},
'Bob': {'Shawshank': 5, 'Forrest Gump': 5, 'Pursuit': 4, 'Titanic': 2},
'Carol': {'Shawshank': 4, 'Forrest Gump': 4, 'Pursuit': 3, 'Titanic': 5, 'Godfather': 5},
'David': {'Shawshank': 2, 'Forrest Gump': 1, 'Titanic': 4, 'Godfather': 3},
'Eve': {'Pursuit': 2, 'Titanic': 3, 'Godfather': 4}
}

# Train model
model = UserBasedCF(k=2, similarity='pearson')
model.fit(user_item_matrix)

# Predict Alice's rating for "Pursuit"
pred = model.predict('Alice', 'Pursuit')
print(f"Alice's predicted rating for 'Pursuit': {pred:.2f}")

# Recommend movies to Alice
recommendations = model.recommend('Alice', n=3)
print("\nMovies recommended to Alice:")
for item, rating in recommendations:
print(f" {item}: {rating:.2f}")

Advantages and Disadvantages of User-CF

Advantages: - Intuitive and easy to understand, aligns with human recommendation habits - Can discover users' latent interests - Strong explainability of recommendations

Disadvantages: - High computational complexity:\(O(m^2 \cdot n)\), where\(m\)is number of users,\(n\)is number of items - Similarity matrix storage and computation costs increase dramatically as user count grows - Difficult to find users with common ratings when data is sparse - Severe cold start problem for new users

Item-Based Collaborative Filtering (Item-CF)

Algorithm Flow

The core steps of Item-CF:

  1. Calculate item similarity: Find the\(k\)items most similar to target item\(i\)

  2. Predict ratings: Based on user\(u\)'s ratings for these\(k\)similar items, predict user\(u\)'s rating for item\(i\)

  3. Generate recommendations: Select the\(N\)items with highest predicted ratings to recommend to the user

Why is Item-CF More Commonly Used?

Compared to User-CF, Item-CF has the following advantages:

  1. Item count is usually less than user count: Lower cost to compute item similarity matrix
  2. Item similarity is more stable: Item attributes change slowly, similarity can be pre-computed and cached
  3. More stable recommendations: When user interests change, item similarity doesn't change immediately
  4. Better explainability: "Because you liked A, we recommend similar B"

Amazon's 2003 paper proved Item-CF outperforms User-CF and became the mainstream method in industry.

Item Similarity Calculation

Let items\(i\)and\(j\)be rated by user set\(U_{ij}\)(users who rated both items).

1. Cosine Similarity\[\text{sim}(i, j) = \frac{\sum_{u \in U_{ij }} r_{ui} \cdot r_{uj }}{\sqrt{\sum_{u \in U_{ij }} r_{ui}^2} \cdot \sqrt{\sum_{u \in U_{ij }} r_{uj}^2 }}\]

2. Adjusted Cosine Similarity (Commonly Used)\[\text{sim}(i, j) = \frac{\sum_{u \in U_{ij }} (r_{ui} - \bar{r}_u)(r_{uj} - \bar{r}_u)}{\sqrt{\sum_{u \in U_{ij }} (r_{ui} - \bar{r}_u)^2} \cdot \sqrt{\sum_{u \in U_{ij }} (r_{uj} - \bar{r}_u)^2 }}\]Note: Here we subtract the user average rating, not the item average rating. This eliminates the influence of user rating habits.

Rating Prediction Formula\[\hat{r}_{ui} = \frac{\sum_{j \in N_k(i)} \text{sim}(i, j) \cdot r_{uj }}{\sum_{j \in N_k(i)} |\text{sim}(i, j)|}\]where\(N_k(i)\)is the set of\(k\)items most similar to item\(i\).

If considering user average rating:\[\hat{r}_{ui} = \bar{r}_u + \frac{\sum_{j \in N_k(i)} \text{sim}(i, j) \cdot (r_{uj} - \bar{r}_u)}{\sum_{j \in N_k(i)} |\text{sim}(i, j)|}\]

Item-CF Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
import numpy as np
from collections import defaultdict

class ItemBasedCF:
def __init__(self, k=20, similarity='adjusted_cosine'):
"""
Item-Based Collaborative Filtering

Parameters:
-----------
k : int
Number of similar items
similarity : str
Similarity method: 'cosine', 'adjusted_cosine'
"""
self.k = k
self.similarity = similarity
self.user_item_matrix = None
self.item_user_matrix = None
self.user_means = None
self.similarity_matrix = None

def fit(self, user_item_matrix):
"""
Train the model

Parameters:
-----------
user_item_matrix : dict
{user_id: {item_id: rating }}
"""
self.user_item_matrix = user_item_matrix

# Build item-user matrix
self.item_user_matrix = defaultdict(dict)
for user_id, items in user_item_matrix.items():
for item_id, rating in items.items():
self.item_user_matrix[item_id][user_id] = rating

# Calculate user average ratings
self.user_means = {}
for user_id, items in user_item_matrix.items():
ratings = list(items.values())
self.user_means[user_id] = np.mean(ratings) if ratings else 0

# Calculate item similarity matrix
self._compute_similarity_matrix()

def _compute_similarity_matrix(self):
"""Calculate item similarity matrix"""
items = list(self.item_user_matrix.keys())
n_items = len(items)
self.similarity_matrix = {}
self.item_to_idx = {item: idx for idx, item in enumerate(items)}

for i, item1 in enumerate(items):
for item2 in items[i+1:]:
sim = self._compute_similarity(item1, item2)
if sim > 0: # Only store positive similarity
if item1 not in self.similarity_matrix:
self.similarity_matrix[item1] = {}
if item2 not in self.similarity_matrix:
self.similarity_matrix[item2] = {}
self.similarity_matrix[item1][item2] = sim
self.similarity_matrix[item2][item1] = sim

def _compute_similarity(self, item1, item2):
"""Calculate similarity between two items"""
users1 = set(self.item_user_matrix[item1].keys())
users2 = set(self.item_user_matrix[item2].keys())
common_users = users1 & users2

if len(common_users) == 0:
return 0

if self.similarity == 'cosine':
# Cosine similarity
ratings1 = [self.item_user_matrix[item1][u] for u in common_users]
ratings2 = [self.item_user_matrix[item2][u] for u in common_users]
dot_product = sum(r1 * r2 for r1, r2 in zip(ratings1, ratings2))
norm1 = np.sqrt(sum(r**2 for r in ratings1))
norm2 = np.sqrt(sum(r**2 for r in ratings2))
if norm1 == 0 or norm2 == 0:
return 0
return dot_product / (norm1 * norm2)

elif self.similarity == 'adjusted_cosine':
# Adjusted cosine similarity
numerator = 0
denom1 = 0
denom2 = 0

for user in common_users:
r1 = self.item_user_matrix[item1][user]
r2 = self.item_user_matrix[item2][user]
user_mean = self.user_means[user]

diff1 = r1 - user_mean
diff2 = r2 - user_mean

numerator += diff1 * diff2
denom1 += diff1 ** 2
denom2 += diff2 ** 2

if denom1 == 0 or denom2 == 0:
return 0

return numerator / (np.sqrt(denom1) * np.sqrt(denom2))

else:
raise ValueError(f"Unknown similarity: {self.similarity}")

def predict(self, user_id, item_id):
"""
Predict user's rating for an item

Parameters:
-----------
user_id : int/str
User ID
item_id : int/str
Item ID

Returns:
--------
float
Predicted rating
"""
if user_id not in self.user_item_matrix:
return self.user_means.get(user_id, 3.0)

if item_id not in self.item_user_matrix:
return self.user_means.get(user_id, 3.0)

# Find items rated by user
user_items = set(self.user_item_matrix[user_id].keys())

# Find items similar to item_id
if item_id not in self.similarity_matrix:
return self.user_means.get(user_id, 3.0)

similar_items = self.similarity_matrix[item_id]

# Find similar items rated by user
rated_similar_items = {}
for similar_item, sim in similar_items.items():
if similar_item in user_items:
rated_similar_items[similar_item] = sim

if len(rated_similar_items) == 0:
return self.user_means.get(user_id, 3.0)

# Select top-k similar items
sorted_items = sorted(rated_similar_items.items(),
key=lambda x: x[1], reverse=True)[:self.k]

# Weighted average prediction
user_mean = self.user_means[user_id]
numerator = sum(sim * (self.user_item_matrix[user_id][item] - user_mean)
for item, sim in sorted_items)
denominator = sum(abs(sim) for _, sim in sorted_items)

if denominator == 0:
return user_mean

return user_mean + numerator / denominator

def recommend(self, user_id, n=10):
"""
Recommend items to user

Parameters:
-----------
user_id : int/str
User ID
n : int
Number of items to recommend

Returns:
--------
list
[(item_id, predicted_rating), ...]
"""
if user_id not in self.user_item_matrix:
return []

user_items = set(self.user_item_matrix[user_id].keys())
all_items = set(self.item_user_matrix.keys())
candidate_items = all_items - user_items

predictions = []
for item_id in candidate_items:
pred_rating = self.predict(user_id, item_id)
predictions.append((item_id, pred_rating))

predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]


# Usage example
if __name__ == "__main__":
user_item_matrix = {
'Alice': {'Shawshank': 5, 'Forrest Gump': 5, 'Titanic': 3, 'Godfather': 4},
'Bob': {'Shawshank': 5, 'Forrest Gump': 5, 'Pursuit': 4, 'Titanic': 2},
'Carol': {'Shawshank': 4, 'Forrest Gump': 4, 'Pursuit': 3, 'Titanic': 5, 'Godfather': 5},
'David': {'Shawshank': 2, 'Forrest Gump': 1, 'Titanic': 4, 'Godfather': 3},
'Eve': {'Pursuit': 2, 'Titanic': 3, 'Godfather': 4}
}

model = ItemBasedCF(k=2, similarity='adjusted_cosine')
model.fit(user_item_matrix)

pred = model.predict('Alice', 'Pursuit')
print(f"Alice's predicted rating for 'Pursuit': {pred:.2f}")

recommendations = model.recommend('Alice', n=3)
print("\nMovies recommended to Alice:")
for item, rating in recommendations:
print(f" {item}: {rating:.2f}")

Matrix Factorization

From Collaborative Filtering to Matrix Factorization

Traditional collaborative filtering methods (User-CF and Item-CF) have obvious scalability issues: when user and item counts reach millions, the cost of computing and storing similarity matrices becomes unacceptable. Matrix Factorization (MF) solves the sparsity problem and dramatically reduces computational complexity by decomposing high-dimensional sparse matrices into products of low-dimensional dense matrices.

The Basic Idea of Matrix Factorization

The core of matrix factorization is decomposing the user-item rating matrix\(R \in \mathbb{R}^{m \times n}\)(\(m\)users,\(n\)items) into the product of two low-dimensional matrices:\[R \approx P \cdot Q^T\]where: -\(P \in \mathbb{R}^{m \times k}\)is the user feature matrix, each row\(\mathbf{p}_u\)represents user\(u\)'s\(k\)-dimensional latent feature vector -\(Q \in \mathbb{R}^{n \times k}\)is the item feature matrix, each row\(\mathbf{q}_i\)represents item\(i\)'s\(k\)-dimensional latent feature vector -\(k\)is the latent feature dimension (typically\(k \ll \min(m, n)\), e.g., 50, 100)

Predict user\(u\)'s rating for item\(i\)as:\[\hat{r}_{ui} = \mathbf{p}_u \cdot \mathbf{q}_i^T = \sum_{f=1}^{k} p_{uf} \cdot q_{if}\]

Geometric Interpretation of Matrix Factorization

Matrix factorization can be understood as: - User vector\(\mathbf{p}_u\): User's preference strength on\(k\)latent dimensions - Item vector\(\mathbf{q}_i\): Item's feature strength on\(k\)latent dimensions - Inner product\(\mathbf{p}_u \cdot \mathbf{q}_i\): Match degree between user preference and item features

For example, if\(k=3\), it might correspond to: - Dimension 1: Action vs. Art films - Dimension 2: Comedy vs. Tragedy - Dimension 3: Modern vs. Classical

User Alice's vector might be\([0.8, 0.2, 0.6]\)(likes action, doesn't like comedy much, prefers modern), and movie The Shawshank Redemption's vector might be\([0.3, 0.1, 0.9]\)(more art, less comedy, modern), with inner product\(0.8 \times 0.3 + 0.2 \times 0.1 + 0.6 \times 0.9 = 0.68\), indicating high match.

Objective Function

Matrix factorization aims to find\(P\)and\(Q\)such that predicted ratings are as close as possible to true ratings. Define the loss function:\[\mathcal{L} = \sum_{(u,i) \in \mathcal{R }} (r_{ui} - \hat{r}_{ui})^2 + \lambda (\|\mathbf{p}_u\|^2 + \|\mathbf{q}_i\|^2)\]where: -\(\mathcal{R}\)is the set of known ratings - First term is squared error loss (MSE) - Second term is regularization to prevent overfitting -\(\lambda\)is the regularization coefficient

Optimization Method: Stochastic Gradient Descent (SGD)

For each known rating\((u, i)\), calculate prediction error:\[e_{ui} = r_{ui} - \hat{r}_{ui}\]Then update user and item vectors:\[\mathbf{p}_u \leftarrow \mathbf{p}_u + \alpha (e_{ui} \cdot \mathbf{q}_i - \lambda \mathbf{p}_u)\] \[\mathbf{q}_i \leftarrow \mathbf{q}_i + \alpha (e_{ui} \cdot \mathbf{p}_u - \lambda \mathbf{q}_i)\]where\(\alpha\)is the learning rate.

Basic Matrix Factorization Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import numpy as np
from collections import defaultdict
import random

class MatrixFactorization:
def __init__(self, n_factors=50, learning_rate=0.01, reg=0.01, n_epochs=20):
"""
Basic Matrix Factorization model

Parameters:
-----------
n_factors : int
Latent feature dimension
learning_rate : float
Learning rate
reg : float
Regularization coefficient
n_epochs : int
Number of iterations
"""
self.n_factors = n_factors
self.learning_rate = learning_rate
self.reg = reg
self.n_epochs = n_epochs
self.user_factors = None
self.item_factors = None
self.global_mean = None

def fit(self, user_item_matrix):
"""Train the model"""
users = list(user_item_matrix.keys())
items = set()
for items_dict in user_item_matrix.values():
items.update(items_dict.keys())
items = list(items)

self.user_to_idx = {user: idx for idx, user in enumerate(users)}
self.idx_to_user = {idx: user for idx, user in enumerate(users)}
self.item_to_idx = {item: idx for idx, item in enumerate(items)}
self.idx_to_item = {idx: item for idx, item in enumerate(items)}

n_users = len(users)
n_items = len(items)

# Initialize user and item feature matrices
self.user_factors = np.random.normal(scale=0.1, size=(n_users, self.n_factors))
self.item_factors = np.random.normal(scale=0.1, size=(n_items, self.n_factors))

# Calculate global mean
all_ratings = []
for items_dict in user_item_matrix.values():
all_ratings.extend(items_dict.values())
self.global_mean = np.mean(all_ratings) if all_ratings else 0

# Build training samples
samples = []
for user_id, items_dict in user_item_matrix.items():
user_idx = self.user_to_idx[user_id]
for item_id, rating in items_dict.items():
item_idx = self.item_to_idx[item_id]
samples.append((user_idx, item_idx, rating))

# Training
for epoch in range(self.n_epochs):
random.shuffle(samples)
total_loss = 0

for user_idx, item_idx, rating in samples:
# Predict rating
pred = np.dot(self.user_factors[user_idx], self.item_factors[item_idx])
error = rating - pred

# Update parameters
user_factor = self.user_factors[user_idx].copy()
item_factor = self.item_factors[item_idx].copy()

self.user_factors[user_idx] += self.learning_rate * (
error * item_factor - self.reg * self.user_factors[user_idx]
)
self.item_factors[item_idx] += self.learning_rate * (
error * user_factor - self.reg * self.item_factors[item_idx]
)

total_loss += error ** 2

if (epoch + 1) % 5 == 0:
rmse = np.sqrt(total_loss / len(samples))
print(f"Epoch {epoch + 1}, RMSE: {rmse:.4f}")

def predict(self, user_id, item_id):
"""Predict rating"""
if user_id not in self.user_to_idx or item_id not in self.item_to_idx:
return self.global_mean

user_idx = self.user_to_idx[user_id]
item_idx = self.item_to_idx[item_id]
pred = np.dot(self.user_factors[user_idx], self.item_factors[item_idx])
return pred

def recommend(self, user_id, n=10):
"""Recommend items"""
if user_id not in self.user_to_idx:
return []

predictions = []
for item_idx in range(len(self.item_factors)):
item_id = self.idx_to_item[item_idx]
pred = self.predict(user_id, item_id)
predictions.append((item_id, pred))

predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]

Matrix Factorization with Bias Terms

Why Do We Need Bias Terms?

Basic matrix factorization assumes ratings are determined only by user and item latent features, but in reality: - User bias: Some users tend to rate high, others low - Item bias: Popular items have higher average ratings, unpopular items lower - Global bias: Overall rating level (e.g., average 3.5)

After introducing bias terms, the prediction formula becomes:\[\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{p}_u \cdot \mathbf{q}_i^T\]where: -\(\mu\)is the global average rating -\(b_u\)is user\(u\)'s bias (relative to global average) -\(b_i\)is item\(i\)'s bias (relative to global average)

Calculation of Bias Terms

Initialization:\[b_u = \frac{1}{|I_u|} \sum_{i \in I_u} (r_{ui} - \mu)\] \[b_i = \frac{1}{|U_i|} \sum_{u \in U_i} (r_{ui} - \mu - b_u)\]where\(I_u\)is the set of items rated by user\(u\),\(U_i\)is the set of users who rated item\(i\).

Matrix Factorization with Bias Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
class BiasedMatrixFactorization:
def __init__(self, n_factors=50, learning_rate=0.01, reg=0.01, reg_bias=0.01, n_epochs=20):
"""Matrix Factorization model with bias terms"""
self.n_factors = n_factors
self.learning_rate = learning_rate
self.reg = reg
self.reg_bias = reg_bias
self.n_epochs = n_epochs
self.user_factors = None
self.item_factors = None
self.user_bias = None
self.item_bias = None
self.global_mean = None

def fit(self, user_item_matrix):
"""Train the model"""
users = list(user_item_matrix.keys())
items = set()
for items_dict in user_item_matrix.values():
items.update(items_dict.keys())
items = list(items)

self.user_to_idx = {user: idx for idx, user in enumerate(users)}
self.idx_to_user = {idx: user for idx, user in enumerate(users)}
self.item_to_idx = {item: idx for idx, item in enumerate(items)}
self.idx_to_item = {idx: item for idx, item in enumerate(items)}

n_users = len(users)
n_items = len(items)

# Initialize
self.user_factors = np.random.normal(scale=0.1, size=(n_users, self.n_factors))
self.item_factors = np.random.normal(scale=0.1, size=(n_items, self.n_factors))
self.user_bias = np.zeros(n_users)
self.item_bias = np.zeros(n_items)

# Calculate global mean
all_ratings = []
for items_dict in user_item_matrix.values():
all_ratings.extend(items_dict.values())
self.global_mean = np.mean(all_ratings) if all_ratings else 0

# Initialize bias terms
for user_id, items_dict in user_item_matrix.items():
user_idx = self.user_to_idx[user_id]
ratings = list(items_dict.values())
self.user_bias[user_idx] = np.mean(ratings) - self.global_mean

for item_idx in range(n_items):
item_id = self.idx_to_item[item_idx]
ratings = []
for user_id, items_dict in user_item_matrix.items():
if item_id in items_dict:
ratings.append(items_dict[item_id])
if ratings:
self.item_bias[item_idx] = np.mean(ratings) - self.global_mean

# Build training samples
samples = []
for user_id, items_dict in user_item_matrix.items():
user_idx = self.user_to_idx[user_id]
for item_id, rating in items_dict.items():
item_idx = self.item_to_idx[item_id]
samples.append((user_idx, item_idx, rating))

# Training
for epoch in range(self.n_epochs):
random.shuffle(samples)
total_loss = 0

for user_idx, item_idx, rating in samples:
# Predict rating
pred = (self.global_mean +
self.user_bias[user_idx] +
self.item_bias[item_idx] +
np.dot(self.user_factors[user_idx], self.item_factors[item_idx]))
error = rating - pred

# Update parameters
user_factor = self.user_factors[user_idx].copy()
item_factor = self.item_factors[item_idx].copy()

self.user_factors[user_idx] += self.learning_rate * (
error * item_factor - self.reg * self.user_factors[user_idx]
)
self.item_factors[item_idx] += self.learning_rate * (
error * user_factor - self.reg * self.item_factors[item_idx]
)
self.user_bias[user_idx] += self.learning_rate * (
error - self.reg_bias * self.user_bias[user_idx]
)
self.item_bias[item_idx] += self.learning_rate * (
error - self.reg_bias * self.item_bias[item_idx]
)

total_loss += error ** 2

if (epoch + 1) % 5 == 0:
rmse = np.sqrt(total_loss / len(samples))
print(f"Epoch {epoch + 1}, RMSE: {rmse:.4f}")

def predict(self, user_id, item_id):
"""Predict rating"""
if user_id not in self.user_to_idx or item_id not in self.item_to_idx:
return self.global_mean

user_idx = self.user_to_idx[user_id]
item_idx = self.item_to_idx[item_id]

pred = (self.global_mean +
self.user_bias[user_idx] +
self.item_bias[item_idx] +
np.dot(self.user_factors[user_idx], self.item_factors[item_idx]))
return pred

def recommend(self, user_id, n=10):
"""Recommend items"""
if user_id not in self.user_to_idx:
return []

predictions = []
for item_idx in range(len(self.item_factors)):
item_id = self.idx_to_item[item_idx]
pred = self.predict(user_id, item_id)
predictions.append((item_id, pred))

predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]

SVD and SVD++

Singular Value Decomposition (SVD)

Traditional SVD requires a complete matrix, but rating matrices are sparse. In the Netflix Prize competition, researchers proposed SVD methods suitable for sparse matrices, which are essentially matrix factorization with bias terms.

SVD++: Incorporating Implicit Feedback

SVD++ introduces implicit feedback on top of SVD. Implicit feedback refers to user behaviors that don't involve explicit ratings but may express preferences, such as: - Browsing without purchasing - Clicking without rating - Watch time - Favorites, shares

SVD++ prediction formula:\[\hat{r}_{ui} = \mu + b_u + b_i + \mathbf{q}_i^T \left( \mathbf{p}_u + |I_u|^{-\frac{1}{2 }} \sum_{j \in I_u} \mathbf{y}_j \right)\]where: -\(\mathbf{p}_u\)is user\(u\)'s explicit feedback feature vector -\(I_u\)is the set of items with implicit feedback from user\(u\) -\(\mathbf{y}_j\)is item\(j\)'s implicit feedback feature vector -\(|I_u|^{-\frac{1}{2 }}\)is the normalization factor

SVD++ Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
class SVDPlusPlus:
def __init__(self, n_factors=50, learning_rate=0.01, reg=0.01, reg_bias=0.01, n_epochs=20):
"""SVD++ model"""
self.n_factors = n_factors
self.learning_rate = learning_rate
self.reg = reg
self.reg_bias = reg_bias
self.n_epochs = n_epochs
self.user_factors = None
self.item_factors = None
self.item_implicit_factors = None
self.user_bias = None
self.item_bias = None
self.global_mean = None

def fit(self, user_item_matrix, implicit_feedback=None):
"""Train the model"""
users = list(user_item_matrix.keys())
items = set()
for items_dict in user_item_matrix.values():
items.update(items_dict.keys())
if implicit_feedback:
for item_list in implicit_feedback.values():
items.update(item_list)
items = list(items)

self.user_to_idx = {user: idx for idx, user in enumerate(users)}
self.idx_to_user = {idx: user for idx, user in enumerate(users)}
self.item_to_idx = {item: idx for idx, item in enumerate(items)}
self.idx_to_item = {idx: item for idx, item in enumerate(items)}

n_users = len(users)
n_items = len(items)

# Initialize
self.user_factors = np.random.normal(scale=0.1, size=(n_users, self.n_factors))
self.item_factors = np.random.normal(scale=0.1, size=(n_items, self.n_factors))
self.item_implicit_factors = np.random.normal(scale=0.1, size=(n_items, self.n_factors))
self.user_bias = np.zeros(n_users)
self.item_bias = np.zeros(n_items)

# Calculate global mean
all_ratings = []
for items_dict in user_item_matrix.values():
all_ratings.extend(items_dict.values())
self.global_mean = np.mean(all_ratings) if all_ratings else 0

# Initialize bias
for user_id, items_dict in user_item_matrix.items():
user_idx = self.user_to_idx[user_id]
ratings = list(items_dict.values())
self.user_bias[user_idx] = np.mean(ratings) - self.global_mean

for item_idx in range(n_items):
item_id = self.idx_to_item[item_idx]
ratings = []
for user_id, items_dict in user_item_matrix.items():
if item_id in items_dict:
ratings.append(items_dict[item_id])
if ratings:
self.item_bias[item_idx] = np.mean(ratings) - self.global_mean

# Build implicit feedback sets
if implicit_feedback is None:
implicit_feedback = {}
for user_id, items_dict in user_item_matrix.items():
implicit_feedback[user_id] = list(items_dict.keys())

self.user_implicit_items = {}
for user_id in users:
if user_id in implicit_feedback:
self.user_implicit_items[user_id] = [
self.item_to_idx[item] for item in implicit_feedback[user_id]
if item in self.item_to_idx
]
else:
self.user_implicit_items[user_id] = []

# Build training samples
samples = []
for user_id, items_dict in user_item_matrix.items():
user_idx = self.user_to_idx[user_id]
for item_id, rating in items_dict.items():
item_idx = self.item_to_idx[item_id]
samples.append((user_idx, item_idx, rating))

# Training
for epoch in range(self.n_epochs):
random.shuffle(samples)
total_loss = 0

for user_idx, item_idx, rating in samples:
# Calculate enhanced user feature vector
user_id = self.idx_to_user[user_idx]
implicit_items = self.user_implicit_items[user_id]

if len(implicit_items) > 0:
implicit_sum = np.sum(
[self.item_implicit_factors[j] for j in implicit_items], axis=0
)
norm_factor = 1.0 / np.sqrt(len(implicit_items))
enhanced_user_factor = self.user_factors[user_idx] + norm_factor * implicit_sum
else:
enhanced_user_factor = self.user_factors[user_idx]

# Predict rating
pred = (self.global_mean +
self.user_bias[user_idx] +
self.item_bias[item_idx] +
np.dot(enhanced_user_factor, self.item_factors[item_idx]))
error = rating - pred

# Update parameters
user_factor = self.user_factors[user_idx].copy()
item_factor = self.item_factors[item_idx].copy()

# Update explicit features
self.user_factors[user_idx] += self.learning_rate * (
error * item_factor - self.reg * self.user_factors[user_idx]
)
self.item_factors[item_idx] += self.learning_rate * (
error * enhanced_user_factor - self.reg * self.item_factors[item_idx]
)

# Update implicit features
if len(implicit_items) > 0:
norm_factor = 1.0 / np.sqrt(len(implicit_items))
for j in implicit_items:
self.item_implicit_factors[j] += self.learning_rate * (
error * norm_factor * item_factor -
self.reg * self.item_implicit_factors[j]
)

# Update bias
self.user_bias[user_idx] += self.learning_rate * (
error - self.reg_bias * self.user_bias[user_idx]
)
self.item_bias[item_idx] += self.learning_rate * (
error - self.reg_bias * self.item_bias[item_idx]
)

total_loss += error ** 2

if (epoch + 1) % 5 == 0:
rmse = np.sqrt(total_loss / len(samples))
print(f"Epoch {epoch + 1}, RMSE: {rmse:.4f}")

def predict(self, user_id, item_id):
"""Predict rating"""
if user_id not in self.user_to_idx or item_id not in self.item_to_idx:
return self.global_mean

user_idx = self.user_to_idx[user_id]
item_idx = self.item_to_idx[item_id]

# Calculate enhanced user feature vector
implicit_items = self.user_implicit_items.get(user_id, [])
if len(implicit_items) > 0:
implicit_sum = np.sum(
[self.item_implicit_factors[j] for j in implicit_items], axis=0
)
norm_factor = 1.0 / np.sqrt(len(implicit_items))
enhanced_user_factor = self.user_factors[user_idx] + norm_factor * implicit_sum
else:
enhanced_user_factor = self.user_factors[user_idx]

pred = (self.global_mean +
self.user_bias[user_idx] +
self.item_bias[item_idx] +
np.dot(enhanced_user_factor, self.item_factors[item_idx]))
return pred

def recommend(self, user_id, n=10):
"""Recommend items"""
if user_id not in self.user_to_idx:
return []

predictions = []
for item_idx in range(len(self.item_factors)):
item_id = self.idx_to_item[item_idx]
pred = self.predict(user_id, item_id)
predictions.append((item_id, pred))

predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]

ALS: Alternating Least Squares

Why Do We Need ALS?

Stochastic Gradient Descent (SGD) is simple but slow on large-scale data. Alternating Least Squares (ALS) optimizes one variable while fixing the other, enabling parallelization and significantly improving training speed.

ALS Algorithm Principle

ALS alternates between optimizing user vectors and item vectors:

  1. Fix item vectors\(Q\), optimize user vectors\(P\):
    • For each user\(u\), solve the least squares problem:\[\min_{\mathbf{p}_u} \sum_{i \in I_u} (r_{ui} - \mathbf{p}_u \cdot \mathbf{q}_i^T)^2 + \lambda \|\mathbf{p}_u\|^2\]
  • Analytical solution:\[\mathbf{p}_u = (Q_u^T Q_u + \lambda I)^{-1} Q_u^T \mathbf{r}_u\]where\(Q_u\)is the feature matrix of items rated by user\(u\),\(\mathbf{r}_u\)is the corresponding rating vector.
  1. Fix user vectors\(P\), optimize item vectors\(Q\):
    • For each item\(i\), solve the least squares problem:\[\min_{\mathbf{q}_i} \sum_{u \in U_i} (r_{ui} - \mathbf{p}_u \cdot \mathbf{q}_i^T)^2 + \lambda \|\mathbf{q}_i\|^2\]
  • Analytical solution:\[\mathbf{q}_i = (P_i^T P_i + \lambda I)^{-1} P_i^T \mathbf{r}_i\]where\(P_i\)is the feature matrix of users who rated item\(i\),\(\mathbf{r}_i\)is the corresponding rating vector.
  1. Alternate iterations: Repeat steps 1 and 2 until convergence.

Advantages of ALS

  • Parallelizable: Optimization for different users/items can be computed in parallel
  • Fast convergence: Each iteration has an analytical solution
  • Suitable for large-scale data: Frameworks like Spark MLlib implement ALS

ALS Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
class ALSMatrixFactorization:
def __init__(self, n_factors=50, reg=0.01, n_epochs=20):
"""ALS Matrix Factorization model"""
self.n_factors = n_factors
self.reg = reg
self.n_epochs = n_epochs
self.user_factors = None
self.item_factors = None

def fit(self, user_item_matrix):
"""Train the model"""
users = list(user_item_matrix.keys())
items = set()
for items_dict in user_item_matrix.values():
items.update(items_dict.keys())
items = list(items)

self.user_to_idx = {user: idx for idx, user in enumerate(users)}
self.idx_to_user = {idx: user for idx, user in enumerate(users)}
self.item_to_idx = {item: idx for idx, item in enumerate(items)}
self.idx_to_item = {idx: item for idx, item in enumerate(items)}

n_users = len(users)
n_items = len(items)

# Initialize
self.user_factors = np.random.normal(scale=0.1, size=(n_users, self.n_factors))
self.item_factors = np.random.normal(scale=0.1, size=(n_items, self.n_factors))

# Build user-item rating matrix (sparse)
self.user_items = {}
self.item_users = {}
for user_id, items_dict in user_item_matrix.items():
user_idx = self.user_to_idx[user_id]
self.user_items[user_idx] = {}
for item_id, rating in items_dict.items():
item_idx = self.item_to_idx[item_id]
self.user_items[user_idx][item_idx] = rating
if item_idx not in self.item_users:
self.item_users[item_idx] = {}
self.item_users[item_idx][user_idx] = rating

# ALS iteration
lambda_eye = self.reg * np.eye(self.n_factors)

for epoch in range(self.n_epochs):
# Fix item vectors, optimize user vectors
for user_idx in range(n_users):
if user_idx not in self.user_items:
continue

items_rated = list(self.user_items[user_idx].keys())
ratings = [self.user_items[user_idx][i] for i in items_rated]

if len(items_rated) == 0:
continue

Q_u = self.item_factors[items_rated]
r_u = np.array(ratings)

# Solve (Q_u^T Q_u + lambda I) p_u = Q_u^T r_u
A = Q_u.T.dot(Q_u) + lambda_eye
b = Q_u.T.dot(r_u)
self.user_factors[user_idx] = np.linalg.solve(A, b)

# Fix user vectors, optimize item vectors
for item_idx in range(n_items):
if item_idx not in self.item_users:
continue

users_rated = list(self.item_users[item_idx].keys())
ratings = [self.item_users[item_idx][u] for u in users_rated]

if len(users_rated) == 0:
continue

P_i = self.user_factors[users_rated]
r_i = np.array(ratings)

# Solve (P_i^T P_i + lambda I) q_i = P_i^T r_i
A = P_i.T.dot(P_i) + lambda_eye
b = P_i.T.dot(r_i)
self.item_factors[item_idx] = np.linalg.solve(A, b)

# Calculate RMSE
if (epoch + 1) % 5 == 0:
total_error = 0
count = 0
for user_idx, items_dict in self.user_items.items():
for item_idx, rating in items_dict.items():
pred = np.dot(self.user_factors[user_idx], self.item_factors[item_idx])
total_error += (rating - pred) ** 2
count += 1
rmse = np.sqrt(total_error / count) if count > 0 else 0
print(f"Epoch {epoch + 1}, RMSE: {rmse:.4f}")

def predict(self, user_id, item_id):
"""Predict rating"""
if user_id not in self.user_to_idx or item_id not in self.item_to_idx:
return 0

user_idx = self.user_to_idx[user_id]
item_idx = self.item_to_idx[item_id]
pred = np.dot(self.user_factors[user_idx], self.item_factors[item_idx])
return pred

def recommend(self, user_id, n=10):
"""Recommend items"""
if user_id not in self.user_to_idx:
return []

predictions = []
for item_idx in range(len(self.item_factors)):
item_id = self.idx_to_item[item_idx]
pred = self.predict(user_id, item_id)
predictions.append((item_id, pred))

predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]

BPR: Bayesian Personalized Ranking

From Rating Prediction to Ranking Learning

Traditional matrix factorization methods (MF, SVD++) are rating prediction tasks: predicting user ratings for items. But in real recommendation scenarios, we care more about ranking: placing items users are most likely to like at the top.

BPR (Bayesian Personalized Ranking) transforms the recommendation problem into a ranking learning problem, directly optimizing ranking metrics (like AUC, NDCG) rather than rating prediction RMSE.

Core Idea of BPR

BPR assumes: users prefer items they've rated over items they haven't rated.

For user\(u\), if they have positive feedback (like rating, click) for item\(i\)but no feedback for item\(j\), then\(u\)prefers\(i\)over\(j\):\[r_{ui} > r_{uj}\]

BPR Objective Function

BPR uses Bayesian methods to maximize posterior probability:\[p(\Theta | >_u) \propto p(>_u | \Theta) p(\Theta)\]where: -\(>_u\)represents user\(u\)'s preference ranking -\(\Theta\)are model parameters (user and item vectors)

Define:\[p(>_u | \Theta) = \prod_{(u,i,j) \in D_S} p(i >_u j | \Theta)\]where\(D_S\)is the training sample set, each sample is a triple\((u, i, j)\), meaning user\(u\)prefers item\(i\)over\(j\).

Define:\[p(i >_u j | \Theta) = \sigma(\hat{r}_{uij})\]where\(\hat{r}_{uij} = \hat{r}_{ui} - \hat{r}_{uj}\),\(\sigma\)is the sigmoid function.

Final objective function (log-likelihood):\[\mathcal{L} = \sum_{(u,i,j) \in D_S} \ln \sigma(\hat{r}_{uij}) - \lambda \|\Theta\|^2\]

BPR Optimization

Use stochastic gradient descent to update parameters:

For each sample\((u, i, j)\):\[\frac{\partial \mathcal{L }}{\partial \Theta} = \frac{e^{-\hat{r}_{uij }}}{1 + e^{-\hat{r}_{uij }}} \cdot \frac{\partial \hat{r}_{uij }}{\partial \Theta} - \lambda \Theta\]For matrix factorization model\(\hat{r}_{ui} = \mathbf{p}_u \cdot \mathbf{q}_i^T\):\[\frac{\partial \hat{r}_{uij }}{\partial \mathbf{p}_u} = \mathbf{q}_i - \mathbf{q}_j\] \[\frac{\partial \hat{r}_{uij }}{\partial \mathbf{q}_i} = \mathbf{p}_u\] \[\frac{\partial \hat{r}_{uij }}{\partial \mathbf{q}_j} = -\mathbf{p}_u\]

BPR Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
class BPRMatrixFactorization:
def __init__(self, n_factors=50, learning_rate=0.01, reg=0.01, n_epochs=20):
"""BPR Matrix Factorization model"""
self.n_factors = n_factors
self.learning_rate = learning_rate
self.reg = reg
self.n_epochs = n_epochs
self.user_factors = None
self.item_factors = None

def fit(self, user_item_matrix):
"""Train the model"""
users = list(user_item_matrix.keys())
items = set()
for items_dict in user_item_matrix.values():
items.update(items_dict.keys())
items = list(items)

self.user_to_idx = {user: idx for idx, user in enumerate(users)}
self.idx_to_user = {idx: user for idx, user in enumerate(users)}
self.item_to_idx = {item: idx for idx, item in enumerate(items)}
self.idx_to_item = {idx: item for idx, item in enumerate(items)}

n_users = len(users)
n_items = len(items)

# Initialize
self.user_factors = np.random.normal(scale=0.1, size=(n_users, self.n_factors))
self.item_factors = np.random.normal(scale=0.1, size=(n_items, self.n_factors))

# Build training samples: for each user, positive items vs negative items
self.user_items = {}
for user_id, items_dict in user_item_matrix.items():
user_idx = self.user_to_idx[user_id]
self.user_items[user_idx] = set(items_dict.keys())

all_items = set(items)

# Training
for epoch in range(self.n_epochs):
total_loss = 0
sample_count = 0

for user_idx in range(n_users):
if user_idx not in self.user_items:
continue

positive_items = self.user_items[user_idx]
negative_items = all_items - positive_items

if len(positive_items) == 0 or len(negative_items) == 0:
continue

# For each positive sample, sample a negative sample
for pos_item_id in positive_items:
pos_item_idx = self.item_to_idx[pos_item_id]

# Randomly sample negative sample
neg_item_id = random.choice(list(negative_items))
neg_item_idx = self.item_to_idx[neg_item_id]

# Calculate prediction
r_ui = np.dot(self.user_factors[user_idx], self.item_factors[pos_item_idx])
r_uj = np.dot(self.user_factors[user_idx], self.item_factors[neg_item_idx])
r_uij = r_ui - r_uj

# Calculate gradient
x_uij = 1 / (1 + np.exp(r_uij)) # sigmoid derivative

# Update parameters
user_factor = self.user_factors[user_idx].copy()

self.user_factors[user_idx] += self.learning_rate * (
x_uij * (self.item_factors[pos_item_idx] - self.item_factors[neg_item_idx]) -
self.reg * self.user_factors[user_idx]
)
self.item_factors[pos_item_idx] += self.learning_rate * (
x_uij * user_factor - self.reg * self.item_factors[pos_item_idx]
)
self.item_factors[neg_item_idx] += self.learning_rate * (
-x_uij * user_factor - self.reg * self.item_factors[neg_item_idx]
)

total_loss += np.log(1 / (1 + np.exp(-r_uij)))
sample_count += 1

if (epoch + 1) % 5 == 0:
avg_loss = total_loss / sample_count if sample_count > 0 else 0
print(f"Epoch {epoch + 1}, Average Loss: {avg_loss:.4f}")

def predict(self, user_id, item_id):
"""Predict user's preference score for an item"""
if user_id not in self.user_to_idx or item_id not in self.item_to_idx:
return 0

user_idx = self.user_to_idx[user_id]
item_idx = self.item_to_idx[item_id]
pred = np.dot(self.user_factors[user_idx], self.item_factors[item_idx])
return pred

def recommend(self, user_id, n=10):
"""Recommend items"""
if user_id not in self.user_to_idx:
return []

predictions = []
for item_idx in range(len(self.item_factors)):
item_id = self.idx_to_item[item_idx]
pred = self.predict(user_id, item_id)
predictions.append((item_id, pred))

predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]

Factorization Machines (FM)

Motivation for FM

Factorization Machines (FM), proposed by Steffen Rendle in 2010, were initially designed to solve sparse feature regression and classification problems. In recommendation systems, FM can simultaneously utilize: - User features: Age, gender, location, etc. - Item features: Category, price, brand, etc. - User-item interactions: Historical ratings, clicks, etc.

FM Model

FM prediction formula:\[\hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j\]where: -\(w_0\)is global bias -\(w_i\)is weight for feature\(i\) -\(\mathbf{v}_i\)is\(k\)-dimensional latent vector for feature\(i\) -\(\langle \mathbf{v}_i, \mathbf{v}_j \rangle = \sum_{f=1}^{k} v_{i,f} \cdot v_{j,f}\)is vector inner product

Advantages of FM

  1. Handles sparse features: Captures feature interactions through vector inner products, even when feature pairs rarely appear in training data

  2. Linear time complexity: Can be rewritten as linear complexity:\[\sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2} \sum_{f=1}^{k} \left[ \left( \sum_{i=1}^{n} v_{i,f} x_i \right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_i^2 \right]\]

  3. General purpose: Can be used for regression, classification, ranking, and other tasks

FM Applications in Recommendation Systems

In recommendation systems, feature vector\(\mathbf{x}\)typically includes: - User one-hot encoding: e.g.,\([0, 0, 1, 0, ...]\)represents user 3 - Item one-hot encoding: e.g.,\([0, 1, 0, 0, ...]\)represents item 2 - Other features: User age, item category, etc.

FM can automatically learn: - User-item interactions:\(\langle \mathbf{v}_u, \mathbf{v}_i \rangle\) - User-feature interactions:\(\langle \mathbf{v}_u, \mathbf{v}_{age} \rangle\) - Item-feature interactions:\(\langle \mathbf{v}_i, \mathbf{v}_{category} \rangle\)

FM Code Implementation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class FactorizationMachine:
def __init__(self, n_factors=10, learning_rate=0.01, reg=0.01, n_epochs=20):
"""Factorization Machine"""
self.n_factors = n_factors
self.learning_rate = learning_rate
self.reg = reg
self.n_epochs = n_epochs
self.w0 = 0
self.w = None
self.v = None

def fit(self, X, y):
"""
Train the model

Parameters:
-----------
X : np.ndarray
Feature matrix, shape=(n_samples, n_features)
y : np.ndarray
Labels, shape=(n_samples,)
"""
n_samples, n_features = X.shape

# Initialize parameters
self.w0 = np.mean(y)
self.w = np.zeros(n_features)
self.v = np.random.normal(scale=0.1, size=(n_features, self.n_factors))

# Training
for epoch in range(self.n_epochs):
total_loss = 0

for idx in range(n_samples):
x = X[idx]
y_true = y[idx]

# Predict
y_pred = self._predict_single(x)
error = y_true - y_pred

# Update w0
self.w0 += self.learning_rate * error

# Update w
for i in range(n_features):
if x[i] != 0:
self.w[i] += self.learning_rate * (error * x[i] - self.reg * self.w[i])

# Update v
for i in range(n_features):
if x[i] != 0:
# Calculate gradient of interaction term w.r.t. v[i]
sum_vx = np.sum([self.v[j] * x[j] for j in range(n_features) if j != i], axis=0)
grad_v = error * x[i] * sum_vx - self.reg * self.v[i]
self.v[i] += self.learning_rate * grad_v

total_loss += error ** 2

if (epoch + 1) % 5 == 0:
rmse = np.sqrt(total_loss / n_samples)
print(f"Epoch {epoch + 1}, RMSE: {rmse:.4f}")

def _predict_single(self, x):
"""Predict single sample"""
# Linear term
linear = np.dot(self.w, x)

# Interaction term (using optimized formula)
interaction = 0
sum_vx = np.zeros(self.n_factors)
sum_vx_sq = 0

for i in range(len(x)):
if x[i] != 0:
vx = self.v[i] * x[i]
sum_vx += vx
sum_vx_sq += np.dot(vx, vx)

interaction = 0.5 * (np.dot(sum_vx, sum_vx) - sum_vx_sq)

return self.w0 + linear + interaction

def predict(self, X):
"""Predict"""
predictions = []
for x in X:
predictions.append(self._predict_single(x))
return np.array(predictions)

Implicit Feedback Handling

Explicit vs. Implicit Feedback

  • Explicit feedback: User's explicit ratings (1-5 stars), likes, favorites, etc.
  • Implicit feedback: User behavior data, such as clicks, view time, purchases, etc.

Characteristics of Implicit Feedback

  1. Large data volume: Implicit feedback is much more abundant than explicit feedback
  2. High noise: Clicks don't necessarily mean likes, could be accidental
  3. Only positive samples: No explicit negative samples (no click doesn't mean dislike)

Methods for Handling Implicit Feedback

1. Weighted Matrix Factorization

Assign different weights to different implicit feedback:\[\mathcal{L} = \sum_{(u,i)} c_{ui} (r_{ui} - \hat{r}_{ui})^2 + \lambda (\|\mathbf{p}_u\|^2 + \|\mathbf{q}_i\|^2)\]where\(c_{ui}\)is confidence weight, typically: - More clicks mean higher weight - Longer view time means higher weight

2. Negative Sample Sampling

Sample negative samples from uninteracted items, pair with positive samples for training.

3. Using BPR

BPR naturally suits implicit feedback, as it only needs positive vs. negative sample comparisons.

Implicit Feedback Code Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
class ImplicitFeedbackMF:
def __init__(self, n_factors=50, learning_rate=0.01, reg=0.01, n_epochs=20):
"""Implicit Feedback Matrix Factorization"""
self.n_factors = n_factors
self.learning_rate = learning_rate
self.reg = reg
self.n_epochs = n_epochs
self.user_factors = None
self.item_factors = None

def fit(self, user_item_interactions, confidence_func=None):
"""
Train the model

Parameters:
-----------
user_item_interactions : dict
{user_id: {item_id: interaction_count }}
confidence_func : callable, optional
Confidence function, converts interaction count to confidence
"""
if confidence_func is None:
# Default confidence function: c = 1 + alpha * log(1 + count)
confidence_func = lambda count: 1 + 10 * np.log1p(count)

users = list(user_item_interactions.keys())
items = set()
for items_dict in user_item_interactions.values():
items.update(items_dict.keys())
items = list(items)

self.user_to_idx = {user: idx for idx, user in enumerate(users)}
self.idx_to_user = {idx: user for idx, user in enumerate(users)}
self.item_to_idx = {item: idx for idx, item in enumerate(items)}
self.idx_to_item = {idx: item for idx, item in enumerate(items)}

n_users = len(users)
n_items = len(items)

# Initialize
self.user_factors = np.random.normal(scale=0.1, size=(n_users, self.n_factors))
self.item_factors = np.random.normal(scale=0.1, size=(n_items, self.n_factors))

# Build training samples (with confidence)
samples = []
for user_id, items_dict in user_item_interactions.items():
user_idx = self.user_to_idx[user_id]
for item_id, count in items_dict.items():
item_idx = self.item_to_idx[item_id]
confidence = confidence_func(count)
samples.append((user_idx, item_idx, confidence))

# Training
for epoch in range(self.n_epochs):
random.shuffle(samples)
total_loss = 0

for user_idx, item_idx, confidence in samples:
# Predict (assuming implicit feedback target value is 1)
pred = np.dot(self.user_factors[user_idx], self.item_factors[item_idx])
error = 1 - pred # Target value is 1

# Update parameters (weighted)
user_factor = self.user_factors[user_idx].copy()
item_factor = self.item_factors[item_idx].copy()

self.user_factors[user_idx] += self.learning_rate * confidence * (
error * item_factor - self.reg * self.user_factors[user_idx]
)
self.item_factors[item_idx] += self.learning_rate * confidence * (
error * user_factor - self.reg * self.item_factors[item_idx]
)

total_loss += confidence * error ** 2

if (epoch + 1) % 5 == 0:
rmse = np.sqrt(total_loss / len(samples))
print(f"Epoch {epoch + 1}, Weighted RMSE: {rmse:.4f}")

def predict(self, user_id, item_id):
"""Predict user's preference for an item"""
if user_id not in self.user_to_idx or item_id not in self.item_to_idx:
return 0

user_idx = self.user_to_idx[user_id]
item_idx = self.item_to_idx[item_id]
pred = np.dot(self.user_factors[user_idx], self.item_factors[item_idx])
return pred

def recommend(self, user_id, n=10):
"""Recommend items"""
if user_id not in self.user_to_idx:
return []

predictions = []
for item_idx in range(len(self.item_factors)):
item_id = self.idx_to_item[item_idx]
pred = self.predict(user_id, item_id)
predictions.append((item_id, pred))

predictions.sort(key=lambda x: x[1], reverse=True)
return predictions[:n]

Frequently Asked Questions (Q&A)

Q1: Which should I choose, User-CF or Item-CF?

A: The choice depends on the specific scenario:

  • Item-CF is more commonly used: Item count is usually less than user count, lower computation cost; item similarity is more stable and can be pre-computed; recommendations are more stable.
  • User-CF is suitable for: Relatively fewer users; user interests change quickly, need real-time capture; want to discover users' latent interests.

Practical advice: In most cases, choose Item-CF unless there are special requirements.

Q2: How to choose the latent feature dimension\(k\)for matrix factorization?

A: Choosing\(k\)requires trade-offs:

  • \(k\)too small: Insufficient model expressiveness, underfitting
  • \(k\)too large: Easy to overfit, high computation cost

Rule of thumb: - Small scale (< 100K users):\(k = 10-50\) - Medium scale (100K-1M users):\(k = 50-100\) - Large scale (> 1M users):\(k = 100-200\) Practice: Use cross-validation, test different\(k\)values on validation set, choose the one with optimal RMSE/NDCG.

Q3: How to handle the cold start problem?

A: Cold start is divided into user cold start and item cold start:

User cold start: - Use user registration info (age, gender, location) to initialize user vectors - Use popular item recommendations as default strategy - Guide users to label interests

Item cold start: - Use item content features (category, tags) to initialize item vectors - Use content similarity to recommend similar items - Use item publisher, publish time, etc.

Hybrid strategy: Use content-based recommendations during cold start, switch to collaborative filtering after accumulating data.

Q4: Which is better, SGD or ALS?

A: Each has advantages:

SGD: - Advantages: Simple implementation, low memory usage, suitable for online learning - Disadvantages: Slow convergence, hard to parallelize

ALS: - Advantages: Fast convergence, parallelizable, suitable for large-scale data - Disadvantages: High memory usage (need to store matrices), not suitable for online learning

Selection advice: - Small-scale data: SGD - Large-scale data: ALS (Spark MLlib) - Online learning: SGD

Q5: Why does BPR perform better than traditional rating prediction methods?

A: BPR's advantages:

  1. Directly optimizes ranking metrics: BPR optimizes ranking (AUC) rather than rating prediction RMSE, better aligned with recommendation task goals
  2. Suitable for implicit feedback: Only needs positive vs. negative sample comparisons, doesn't need precise ratings
  3. More personalized: Each user's ranking is independent, not affected by other users' ratings

Experiments show: In implicit feedback scenarios, BPR's AUC is typically 5-10% higher than MF.

Q6: How to handle data sparsity?

A: Data sparsity is a core challenge in recommendation systems:

  1. Matrix factorization: Decomposing high-dimensional sparse matrices into low-dimensional dense matrices is an effective method
  2. Regularization: Prevents overfitting, improves generalization
  3. Utilize implicit feedback: Implicit feedback has large volume, can supplement explicit feedback
  4. Feature engineering: Use auxiliary info (categories, tags) for users and items
  5. Dimensionality reduction: PCA, SVD, etc.

Q7: What is the role of bias terms?

A: Bias terms capture systematic biases in ratings:

  • User bias\(b_u\): Captures user rating habits (some users tend to rate high, others low)
  • Item bias\(b_i\): Captures item popularity (popular items have higher average ratings)
  • Global bias\(\mu\): Overall rating level

Experiments show: After adding bias terms, RMSE typically decreases by 10-20%.

Q8: How to construct implicit feedback in SVD++?

A: Implicit feedback can be extracted from various user behaviors:

  1. Browsing behavior: Items browsed but not rated by users
  2. Click behavior: Items clicked by users
  3. Purchase behavior: Items purchased by users (even without rating)
  4. Favorite behavior: Items favorited by users
  5. View time: Video/audio view time exceeding threshold

Note: Implicit feedback needs normalization, typically using\(|I_u|^{-1/2}\)as normalization factor.

Q9: What is the relationship between FM and matrix factorization?

A: FM is a generalization of matrix factorization:

  • Matrix factorization: Only considers user-item interactions, feature vectors are one-hot encodings
  • FM: Can simultaneously consider user features, item features, and interaction features, more flexible

Mathematical relationship: If FM's feature vector only contains user and item one-hot encodings, FM reduces to matrix factorization.

FM advantage: Can fuse more features (like user age, item category), improving recommendation effectiveness.

Q10: How to evaluate recommendation system effectiveness?

A: Recommendation system evaluation metrics fall into two categories:

Offline metrics: - Rating prediction: RMSE, MAE - Ranking: AUC, NDCG, Precision@K, Recall@K

Online metrics: - Click-through rate (CTR): CTR of recommended items - Conversion rate: Purchase/conversion rate of recommended items - User retention: User retention after using recommendation system

Practice advice: Offline metrics for model selection and hyperparameter tuning, online A/B testing for final validation.

Q11: What initialization methods are there for matrix factorization?

A: Common initialization methods:

  1. Random initialization: Sample from normal distribution\(\mathcal{N}(0, 0.1)\)

  2. Xavier initialization:\(\mathcal{N}(0, 1/n)\), where\(n\)is input dimension

  3. Pre-trained initialization: Initialize using similarity matrices from other models (like Item-CF)

  4. Mean initialization: Initialize bias terms with user/item average ratings

Recommendation: Usually use random initialization, combined with learning rate decay.

Q12: How to handle boundary issues in rating prediction?

A: Rating predictions may exceed valid range (e.g., 1-5):

  1. Truncation: Limit predicted values to\([r_{min}, r_{max}]\)range

  2. Sigmoid mapping: Use sigmoid function to map predictions to\([r_{min}, r_{max}]\)

  3. No handling: If proportion exceeding range is small, can leave as is

Practice: In most cases, truncation is the simplest and most effective method.

Q13: How to accelerate matrix factorization training?

A: Acceleration methods:

  1. Use ALS: ALS can be parallelized, faster than SGD

  2. Negative sampling: In BPR, only sample part of negative samples

  3. Feature dimensionality reduction: Reduce latent feature dimension\(k\)

  4. Batch updates: Use mini-batch SGD

  5. Use GPU: Utilize GPU to accelerate matrix operations

  6. Distributed training: Use Spark, TensorFlow, and other distributed frameworks

Q14: How to solve overfitting in matrix factorization?

A: Methods to prevent overfitting:

  1. Regularization: L2 regularization\(\lambda (\|\mathbf{p}_u\|^2 + \|\mathbf{q}_i\|^2)\)

  2. Early stopping: Monitor performance on validation set, stop when performance no longer improves

  3. Dropout: Randomly drop part of features (commonly used in deep learning)

  4. Reduce feature dimension: Lower\(k\)value

  5. Increase training data: More training samples help generalization

Recommendation: Regularization is the most commonly used method,\(\lambda\)typically set between\(0.01-0.1\).

Q15: How to implement real-time recommendations?

A: Real-time recommendation implementation strategies:

  1. Incremental updates: Use online learning algorithms (like SGD), incrementally update model when new data arrives
  2. Caching mechanism: Pre-compute recommendations for popular users, cache them
  3. Feature caching: Cache user and item feature vectors, avoid repeated computation
  4. Stream processing: Use Kafka, Flink, and other stream processing frameworks for real-time user behavior processing
  5. Hybrid strategy: Offline model + online rules, offline model updates periodically, online rules adjust in real-time

Architecture: Usually adopt Lambda architecture, offline batch processing + online stream processing.

Summary

This article provides an in-depth exploration of collaborative filtering and matrix factorization core algorithms:

  1. Collaborative Filtering: Basic principles, similarity calculations, and code implementations of User-CF and Item-CF
  2. Matrix Factorization: Mathematical principles and implementations of basic MF, MF with bias, SVD++
  3. Optimization Methods: Comparison and selection of SGD and ALS
  4. Ranking Learning: BPR's transformation from rating prediction to ranking learning
  5. Factorization Machines: FM's ability to handle sparse features
  6. Implicit Feedback: How to handle implicit feedback data
  7. Practical Techniques: Bias terms, regularization, cold start, etc.

These algorithms form the foundation of recommendation systems. Understanding them helps in learning more advanced deep learning methods. In practice, choose appropriate algorithms based on data scale and business scenarios, and validate effectiveness through experiments.

References

  1. Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, 42(8), 30-37. IEEE
  2. Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). BPR: Bayesian personalized ranking from implicit feedback. UAI. arXiv:1205.2618
  3. Rendle, S. (2010). Factorization machines. ICDM. DOI:10.1109/ICDM.2010.127
  4. Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collaborative filtering model. KDD. DOI:10.1145/1401890.1401944
  5. Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. ICDM. DOI:10.1109/ICDM.2008.22
  • Post title:Recommendation Systems (2): Collaborative Filtering and Matrix Factorization
  • Post author:Chen Kai
  • Create time:2026-02-03 23:11:11
  • Post link:https://www.chenk.top/recommendation-systems-2-collaborative-filtering/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments