Have you ever wondered how the movie recommendations on Netflix are made? I am always fascinated by how their recommendations are made, and they seem to know my preference better than my self. Such recommendation systems are very complex, but mostly base on a technique called collaborative filtering. This basically means to look at the current user ‘s previous ratings, find other users with the same preferences, and recommend the current user the products that have been positively rated by the similar users.

In this post, we will employ the Movielens dataset to train a recommender system with collaborative filtering. This dataset contains 100000 ratings from 943 users on 1682 movies, where each user has rated at least 20 movies. Although the dataset also contains further interesting information that might be helpful, we will only concentrate on the correlation of user ratings.

Prepare the dataset

import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torch.utils.data as data_utils
import numpy as np
import math
from sklearn.model_selection import train_test_split
import copy
import matplotlib.pyplot as plt
import torch.utils.data as data_utils
from sklearn.decomposition import PCA

# load the rating data
path = "/content/ml-100k/"
ratings = pd.read_csv(path + 'u.data', delimiter='\t', header=None,
                      usecols=(0,1,2), names=['user','movie','rating'])
ratings.head()

	user	movie	rating
0	196	242	3
1	186	302	3
2	22	377	1
3	244	51	2
4	166	346	1

# load the movie data
movies = pd.read_csv(path + 'u.item',  delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head()

	movie	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

# merge the dataset
ratings = ratings.merge(movies)
ratings.head()

	user	movie	rating	title
0	196	242	3	Kolya (1996)
1	63	242	3	Kolya (1996)
2	226	242	5	Kolya (1996)
3	154	242	3	Kolya (1996)
4	306	242	5	Kolya (1996)

ratings['user'] = ratings['user'] -1
ratings['movie'] = ratings['movie'] -1

movies['movie'] = movies['movie'] -1

dataset_sizes = ratings.shape
dataset_sizes

(100000, 4)

n_users = len(ratings["user"].unique())
n_movies = len(ratings["movie"].unique())
n_users, n_movies

(943, 1682)

# movies with max. number of ratings
ratings["title"].value_counts()[:20]

Star Wars (1977)                    583
Contact (1997)                      509
Fargo (1996)                        508
Return of the Jedi (1983)           507
Liar Liar (1997)                    485
English Patient, The (1996)         481
Scream (1996)                       478
Toy Story (1995)                    452
Air Force One (1997)                431
Independence Day (ID4) (1996)       429
Raiders of the Lost Ark (1981)      420
Godfather, The (1972)               413
Pulp Fiction (1994)                 394
Twelve Monkeys (1995)               392
Silence of the Lambs, The (1991)    390
Jerry Maguire (1996)                384
Chasing Amy (1997)                  379
Rock, The (1996)                    378
Empire Strikes Back, The (1980)     367
Star Trek: First Contact (1996)     365
Name: title, dtype: int64

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

target = torch.tensor(ratings['rating'].values.astype(np.float32), device=device)[..., np.newaxis]
features = torch.tensor(ratings.drop(['rating', 'title'], axis = 1).values.astype(np.int32), device=device) 

# split datasets
data_tensor = data_utils.TensorDataset(features, target)
train_data, test_data = train_test_split(data_tensor, test_size=0.2, random_state=42)
len(train_data), len(test_data)

(80000, 20000)

bs = 100
train_loader = data_utils.DataLoader(train_data, batch_size=bs, shuffle=True)
test_loader = data_utils.DataLoader(test_data, batch_size=bs, shuffle=True)

Define the model

In this section, we are going to use Pytorch to build our network. Our network is made up of 4 embedding layers, which map the user or movie id to some latent spaces that are described in the following. The user_factors embedding layers represent the correlation between each user preferences. The movie_factors layer, on the other hand, describes the correlation between movies. Of course, some movies are inherently good and bad, independent of user ratings, so we need a movie_bias to store this information. Likewise, some users are more critical than the other, and the user_bias encodes this information.

We see that these 4 layers contain exactly the core idea of collaborative filtering. Using deep learning, these layers can be learnt directly from user ratings, just using user id and movie id, regardless of any further information. We will see in the next section that this is enough to make good recommendation.

# helper function to force predictions to be between 0 and 5.
def sigmoid_range(x, low, high):
    "Sigmoid function with range `(low, high)`"
    return torch.sigmoid(x) * (high - low) + low

Pytorch provieds a very simple way to define an embedding layer, namely with nn.Embedding:

class EmbeddingNet(nn.Module):
  def __init__(self, n_users, n_movies, n_factors = 10, y_range=(0,5.5)):
      super().__init__()
      self.user_factors = nn.Embedding(n_users, n_factors)
      self.user_bias = nn.Embedding(n_users, 1)
      self.movie_factors = nn.Embedding(n_movies, n_factors)
      self.movie_bias = nn.Embedding(n_movies, 1)
      self.y_range = y_range
  def forward(self, user_idx, movie_idx):
      users = self.user_factors(user_idx)
      movies = self.movie_factors(movie_idx)
      res = (users * movies).sum(dim=1, keepdim=True)
      res += self.user_bias(user_idx) + self.movie_bias(movie_idx)
      return sigmoid_range(res, *self.y_range)

Define traing objects

# training loop parameters
lr = 3e-3
wd = 3e-4

n_epochs = 30
patience = 20
no_improvements = 0
best_loss = np.inf
best_weights = None
n_factors = 50

model = EmbeddingNet(n_users, n_movies, n_factors).to(device)

criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=wd)
scheduler = optim.lr_scheduler.LinearLR(optimizer, start_factor=0.5, total_iters=10)

Show model info:

model

EmbeddingNet(
  (user_factors): Embedding(943, 50)
  (user_bias): Embedding(943, 1)
  (movie_factors): Embedding(1682, 50)
  (movie_bias): Embedding(1682, 1)
)

Train the model

train_history = []
val_history = []
lr_history = []

best_loss = 1e3
for i in range(n_epochs):
  train_loss = 0
  n_batches = 0
  # train phase
  for features, targets in train_loader:
      optimizer.zero_grad()
      output = model(features[...,0], features[...,1])
      loss = criterion(output, targets)
 
      loss.backward()
      optimizer.step()
      
      train_loss += loss.detach()
      n_batches += 1

  #scheduler.step()
      
  train_loss = train_loss/n_batches
  train_history.append(train_loss)
  lr_history.append(scheduler.get_last_lr())
  
  # validation
  n_batches = 0
  val_loss = 0
  with torch.no_grad():
    for features, targets in test_loader:
      output = model(features[...,0], features[...,1])
      loss = criterion(output, targets)
      val_loss += loss
      n_batches += 1  
  val_loss = val_loss/n_batches
  val_history.append(val_loss)
  if (val_loss < best_loss):
    best_weights = copy.deepcopy(model.state_dict())
    best_loss = val_loss
  print("Epoch {}, train_loss: {}, val_loss: {}"    
        .format(i, train_loss.cpu().numpy(), val_loss.cpu().numpy()))
#model.load_state_dict(best_weights)

Epoch 0, train_loss: 7.457712173461914, val_loss: 7.2247538566589355
Epoch 1, train_loss: 6.389175891876221, val_loss: 6.712790012359619
Epoch 2, train_loss: 5.461848258972168, val_loss: 6.160996913909912
Epoch 3, train_loss: 4.655222415924072, val_loss: 5.5575456619262695
Epoch 4, train_loss: 3.8971118927001953, val_loss: 4.903194904327393
Epoch 5, train_loss: 3.1772820949554443, val_loss: 4.215761184692383
Epoch 6, train_loss: 2.510025978088379, val_loss: 3.52919340133667
Epoch 7, train_loss: 1.9204601049423218, val_loss: 2.887474298477173
Epoch 8, train_loss: 1.4384808540344238, val_loss: 2.3278536796569824
Epoch 9, train_loss: 1.0748509168624878, val_loss: 1.8732044696807861
Epoch 10, train_loss: 0.8238030076026917, val_loss: 1.5306777954101562
Epoch 11, train_loss: 0.669340193271637, val_loss: 1.2942407131195068
Epoch 12, train_loss: 0.5864189267158508, val_loss: 1.1392508745193481
Epoch 13, train_loss: 0.5462559461593628, val_loss: 1.041947603225708
Epoch 14, train_loss: 0.5291838049888611, val_loss: 0.9798373579978943
Epoch 15, train_loss: 0.5221965312957764, val_loss: 0.9406192302703857
Epoch 16, train_loss: 0.5192626714706421, val_loss: 0.9124849438667297
Epoch 17, train_loss: 0.5155234932899475, val_loss: 0.8936429023742676
Epoch 18, train_loss: 0.5117744207382202, val_loss: 0.8807430267333984
Epoch 19, train_loss: 0.5071832537651062, val_loss: 0.8699108362197876
Epoch 20, train_loss: 0.5019437074661255, val_loss: 0.8627417087554932
Epoch 21, train_loss: 0.49624019861221313, val_loss: 0.8580852746963501
Epoch 22, train_loss: 0.4914500415325165, val_loss: 0.8545466661453247
Epoch 23, train_loss: 0.4865557551383972, val_loss: 0.8525440692901611
Epoch 24, train_loss: 0.4807068109512329, val_loss: 0.8509074449539185
Epoch 25, train_loss: 0.47658705711364746, val_loss: 0.849090039730072
Epoch 26, train_loss: 0.4731192886829376, val_loss: 0.8473911285400391
Epoch 27, train_loss: 0.46931734681129456, val_loss: 0.8465709090232849
Epoch 28, train_loss: 0.4657156467437744, val_loss: 0.8458096385002136
Epoch 29, train_loss: 0.4617335796356201, val_loss: 0.8460785746574402

model.load_state_dict(best_weights)

<All keys matched successfully>

Visualize training history

train_history = np.asarray(train_history)
val_history = np.asarray(val_history)
lr_history = np.asarray(lr_history) 
fig = plt.figure()
plt.plot(train_history)
plt.plot(val_history)

[<matplotlib.lines.Line2D at 0x7f81a0f41b10>]

png

Visualize the Results

def get_movie(ids):
    return [movies.loc[id]["title"] for id in ids]

We can try guessing the user ratings on the test set:

 with torch.no_grad():
    for features, targets in test_loader:
        target_predicted = model(features[...,0], features[...,1])
        movie_names = get_movie(features[...,1].numpy())
        loss = criterion(target_predicted, targets)
        print(loss)
        target_predicted = target_predicted.squeeze().cpu().numpy()
        user_idx = features[...,0].cpu().numpy() 
        movies_idx = features[...,1].cpu().numpy()  
        target_real = targets.squeeze().cpu().numpy()
        #target_predicted  
        pred_df = pd.DataFrame(list(zip(user_idx, movies_idx, movie_names, target_real, target_predicted)))
        break
pred_df.head(20)

tensor(0.7353)

	0	1	2	3	4
0	746	418	Mary Poppins (1964)	5.0	3.960400
1	777	404	Mission: Impossible (1996)	3.0	2.279045
2	785	230	Batman Returns (1992)	2.0	3.065361
3	879	139	Homeward Bound: The Incredible Journey (1993)	4.0	3.125930
4	23	257	Contact (1997)	4.0	4.397166
5	303	680	Wishmaster (1997)	2.0	2.716514
6	59	135	Mr. Smith Goes to Washington (1939)	4.0	4.160026
7	415	14	Mr. Holland's Opus (1995)	4.0	4.215237
8	739	331	Kiss the Girls (1997)	3.0	3.314857
9	886	430	Highlander (1986)	3.0	4.366575
10	933	68	Forrest Gump (1994)	5.0	3.462778
11	84	41	Clerks (1994)	3.0	3.208103
12	881	242	Jungle2Jungle (1997)	4.0	2.592931
13	12	910	Twilight (1998)	2.0	2.822413
14	393	402	Batman (1989)	4.0	3.974622
15	404	591	True Crime (1995)	1.0	1.466990
16	637	509	Magnificent Seven, The (1954)	3.0	3.588601
17	486	540	Mortal Kombat (1995)	3.0	2.984062
18	270	413	My Favorite Year (1982)	4.0	3.552820
19	343	123	Lone Star (1996)	5.0	4.303177

We can find out which movies are “the best” and “the worst” just by looking at the movie_bias layer weight:

# lowest
movie_bias = model.movie_bias.weight
idxs = movie_bias.argsort(0)[:5].squeeze().numpy()
get_movie(idxs)

['Children of the Corn: The Gathering (1996)',
 'Lawnmower Man 2: Beyond Cyberspace (1996)',
 'Mortal Kombat: Annihilation (1997)',
 'Jury Duty (1995)',
 'Striptease (1996)']

These movies have the IMDB score 4.2, 2.5, 3.7, 4.3, 4.5, respectively. Our model does detect the movies that are not appreciated by film critics.

# highest
idxs = movie_bias.argsort(0, descending=True)[:5].squeeze().numpy()
get_movie(idxs)

["Schindler's List (1993)",
 'Shawshank Redemption, The (1994)',
 'Casablanca (1942)',
 'Rear Window (1954)',
 'Star Wars (1977)']

Here also, the model assigns high bias to the movies that actually have very high IMDB scores. This validates the ability of our simple model to find “good” and “bad” movies.

Get top k favourite movies

We can get the k top rated movies as follows:

# get top k movies that have 
def get_top_k(k):
  with torch.no_grad():
    all_bias = model.movie_bias(torch.tensor(np.arange(n_movies), device=device))
  idx = all_bias.squeeze().cpu().argsort(descending=True)[:k].squeeze().numpy()
  return get_movie(idx), idx
k_name, k_idx = get_top_k(40)
k_name

["Schindler's List (1993)",
 'Shawshank Redemption, The (1994)',
 'Casablanca (1942)',
 'Rear Window (1954)',
 'Star Wars (1977)',
 'Close Shave, A (1995)',
 'Usual Suspects, The (1995)',
 'Wrong Trousers, The (1993)',
 'Silence of the Lambs, The (1991)',
 'Good Will Hunting (1997)',
 'Raiders of the Lost Ark (1981)',
 'North by Northwest (1959)',
 "One Flew Over the Cuckoo's Nest (1975)",
 'Vertigo (1958)',
 'Titanic (1997)',
 'L.A. Confidential (1997)',
 'Godfather, The (1972)',
 'Boot, Das (1981)',
 'To Kill a Mockingbird (1962)',
 '12 Angry Men (1957)',
 'Wallace & Gromit: The Best of Aardman Animation (1996)',
 'Manchurian Candidate, The (1962)',
 'As Good As It Gets (1997)',
 'Third Man, The (1949)',
 'Citizen Kane (1941)',
 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)',
 'Secrets & Lies (1996)',
 "It's a Wonderful Life (1946)",
 'Empire Strikes Back, The (1980)',
 'Fargo (1996)',
 'Henry V (1989)',
 'Blade Runner (1982)',
 'Amadeus (1984)',
 'Braveheart (1995)',
 'Bridge on the River Kwai, The (1957)',
 'Princess Bride, The (1987)',
 'Maltese Falcon, The (1941)',
 'African Queen, The (1951)',
 'Glory (1989)',
 'Wizard of Oz, The (1939)']

Visualize top k movies

We employed a very large latent space, which is impossible to be visualized. Luckily, we can use the PCA algorithm to reduce the latent space and visualize them:

with torch.no_grad():
  movie_factors = model.movie_factors(torch.tensor(np.arange(n_movies), device=device)).cpu().numpy()

pca = PCA(n_components=3)
pca.fit(movie_factors)
reduced_embed = pca.transform(movie_factors)
X = reduced_embed[k_idx][:,0]
Y = reduced_embed[k_idx][:,2]
plt.figure(figsize=(15,15))
plt.scatter(X, Y)
for i, x, y in zip(k_name,X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)

png

Make recommendations using distance between 2 movies

To compute the distance betwwem 2 movies, we can compute the cosine distance between their movie_factors (not movie_bias) representation. This allows us to define the similarity in terms of user ratings on both movies.

# helper function
def movie_o2i(s):
  return movies[movies["title"] == s]["movie"].values

# get the closest movie
def get_closest_movie(s1):
  i1 = torch.tensor(movie_o2i(s1))
  movie_factors_torch = torch.tensor(movie_factors)
  distances = nn.CosineSimilarity(dim=1)(movie_factors_torch, movie_factors_torch[i1])

  idx = distances.argsort(descending=True)[:6].squeeze().numpy()
  print(idx)
  return get_movie(idx)

s1 = 'Snow White and the Seven Dwarfs (1937)' # kids movie
get_closest_movie(s1)

[ 98 403 500 431 841 135]

['Snow White and the Seven Dwarfs (1937)',
 'Pinocchio (1940)',
 'Dumbo (1941)',
 'Fantasia (1940)',
 'Pollyanna (1960)',
 'Mr. Smith Goes to Washington (1939)']

We see that our model good a very good recommendation for this example: the recommended movies for “Snow White” are mostly Disney kid movies.

s1 = 'Star Wars (1977)' # the first 2 suggested films are actually star wars film
get_closest_movie(s1)

[ 49 171 180 173 108 122]

['Star Wars (1977)',
 'Empire Strikes Back, The (1980)',
 'Return of the Jedi (1983)',
 'Raiders of the Lost Ark (1981)',
 'Mystery Science Theater 3000: The Movie (1996)',
 'Frighteners, The (1996)']

Here also, the model recommends very relevant movies for the input movie. Indeed, the first 2 recommendations are also Star wars movie, the next 2 are sci-fi movies, and the last one is a thriller.

Conclusion

In this post, we have trained a simple deep learning model using Pytorch for collaborative filtering. The model is able to identify good and not so good movies, and can make plausible movie recommendations.

Share on

Twitter Facebook LinkedIn

Collaborative Filtering with Pytorch Embedding on Movielens

Giao Nguyen-Quynh

Prepare the dataset

Define the model

Train the model

Visualize training history

Visualize the Results

Get top k favourite movies

Visualize top k movies

Make recommendations using distance between 2 movies

Conclusion

Share on

You may also enjoy

Create a Stack Bot for Stock Indexes Notification

World Hapiness

Algorithm Visualization Web App with JavaFX, JPro, Docker and Heroku

Detect Covid Infection on X-ray Images with Transfer Learning