Train our first movie recommender
In this tutorial, we build our first recommender system using a simple algorithm called P3alpha.
We will learn
How to represent the implicit feedback dataset as a sparse matrix.
How to fit
irspack
’s models using the sparse matrix representation.How to make a recommendation using our API.
[1]:
import numpy as np
import scipy.sparse as sps
from irspack.dataset import MovieLens1MDataManager
from irspack import P3alphaRecommender
Read Movielens 1M dataset
We first load the Movielens1M dataset. For the first time, you will be asked to allow downloading the dataset.
[2]:
loader = MovieLens1MDataManager()
df = loader.read_interaction()
df.head()
[2]:
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 2000-12-31 22:12:40 |
1 | 1 | 661 | 3 | 2000-12-31 22:35:09 |
2 | 1 | 914 | 3 | 2000-12-31 22:32:48 |
3 | 1 | 3408 | 4 | 2000-12-31 22:04:35 |
4 | 1 | 2355 | 5 | 2001-01-06 23:38:11 |
df
stores the users’ watch event history.
Although the rating information is available in this case, we will not be using this column. What matters to implicit feedback based recommender system is “which user interacted with which item (movie)”.
By loader
we can also read the dataframe for the movie meta data:
[3]:
movies = loader.read_item_info()
movies.head()
[3]:
title | genres | release_year | |
---|---|---|---|
movieId | |||
1 | Toy Story (1995) | Animation|Children's|Comedy | 1995 |
2 | Jumanji (1995) | Adventure|Children's|Fantasy | 1995 |
3 | Grumpier Old Men (1995) | Comedy|Romance | 1995 |
4 | Waiting to Exhale (1995) | Comedy|Drama | 1995 |
5 | Father of the Bride Part II (1995) | Comedy | 1995 |
Represent your data as a sparse matrix
We represent the data as a sparse matrix \(X\), whose element \(X_{ui}\) is given by
For this purpose, we use np.unique
function with return_inverse=True
. This will return a tuple that consists of
The list of unique user/movie ids appearing in the original user/movie id array
How the original user/movie id array elements are mapped to the array 1.
So if we do
[4]:
unique_user_ids, user_index = np.unique(df.userId, return_inverse=True)
unique_movie_ids, movie_index = np.unique(df.movieId, return_inverse=True)
then unique_user_ids[user_index]
and unique_movie_ids[movie_index]
is equal to the original array:
[5]:
assert np.all( unique_user_ids[user_index] == df.userId.values )
assert np.all( unique_movie_ids[movie_index] == df.movieId.values )
Thus, we can think of user_index
and movie_index
as representing the row and column positions of non-zero elements, respectively.
Now \(X\) can be constructed as scipy’s sparse csr matrix as follows.
[6]:
X = sps.csr_matrix(
(
np.ones(df.shape[0]), # values of non-zero elements
(
user_index, # rows of non-zero elements
movie_index # cols of non-zero elements
)
)
)
X
[6]:
<6040x3706 sparse matrix of type '<class 'numpy.float64'>'
with 1000209 stored elements in Compressed Sparse Row format>
We encounter this pattern so often, so there is df_to_sparse
function in irspack:
[7]:
from irspack import df_to_sparse
X_, unique_user_ids_, unique_item_ids_ = df_to_sparse(df, 'userId', 'movieId')
# X_ is identitcal to X.
assert (X_ - X).getnnz() == 0
Fit the recommender.
We fit P3alphaRecommender
against X.
[8]:
recommender = P3alphaRecommender(X)
recommender.learn()
[8]:
<irspack.recommenders.p3.P3alphaRecommender at 0x7f61e7e679d0>
Check the recommender’s output
Suppose there is a new user who has just watched “Toy Story”. Let us see what would be the recommended for this user.
We first represent the user’s watch profile as another sparse matrix (which contains a single non-zero element).
[9]:
movie_id_vs_movie_index = { mid: i for i, mid in enumerate(unique_movie_ids)}
toystory_id = 1
toystory_watcher_matrix = sps.csr_matrix(
([1], ([0], [movie_id_vs_movie_index[toystory_id]])),
shape=(1, len(unique_movie_ids)) # this time shape parameter is required
)
movies.loc[toystory_id]
[9]:
title Toy Story (1995)
genres Animation|Children's|Comedy
release_year 1995
Name: 1, dtype: object
Since this user is new (previously unseen) to the recommender, we use get_score_cold_user_remove_seen
method.
remove_seen
means that we mask the scores for the items that user had watched already (in this case, Toy Story) so that such items would not be recommended again.
As you can see, the score corresponding to “Toy Story” has \(-\infty\) score.
[10]:
score = recommender.get_score_cold_user_remove_seen(
toystory_watcher_matrix
)
# Id 1 (index 0) is masked (have -infinity score)
score
[10]:
array([[ -inf, 8.18606963e-04, 4.30083199e-04, ...,
4.30589311e-05, 1.09994485e-05, 2.71571993e-04]])
To get the recommendation, we argsort
wthe score by descending order and convert “movie index” (which starts from 0) to “movie id”.
[11]:
recommended_movie_index = score[0].argsort()[::-1][:10]
recommended_movie_ids = unique_movie_ids[recommended_movie_index]
# Top-10 recommendations
recommended_movie_ids
[11]:
array([2858, 1265, 2396, 3114, 260, 1210, 1196, 1270, 2028, 34])
And here are the titles of the recommendations.
[12]:
movies.reindex(recommended_movie_ids)
[12]:
title | genres | release_year | |
---|---|---|---|
movieId | |||
2858 | American Beauty (1999) | Comedy|Drama | 1999 |
1265 | Groundhog Day (1993) | Comedy|Romance | 1993 |
2396 | Shakespeare in Love (1998) | Comedy|Romance | 1998 |
3114 | Toy Story 2 (1999) | Animation|Children's|Comedy | 1999 |
260 | Star Wars: Episode IV - A New Hope (1977) | Action|Adventure|Fantasy|Sci-Fi | 1977 |
1210 | Star Wars: Episode VI - Return of the Jedi (1983) | Action|Adventure|Romance|Sci-Fi|War | 1983 |
1196 | Star Wars: Episode V - The Empire Strikes Back... | Action|Adventure|Drama|Sci-Fi|War | 1980 |
1270 | Back to the Future (1985) | Comedy|Sci-Fi | 1985 |
2028 | Saving Private Ryan (1998) | Action|Drama|War | 1998 |
34 | Babe (1995) | Children's|Comedy|Drama | 1995 |
The above pattern - mapping item IDs to indexes, creating sparse matrices, and reverting indexes of recommended items to item IDs - is a quite common one, and we have also created a convenient class that does the item index/ID mapping:
[13]:
from irspack.utils.id_mapping import ItemIDMapper
id_mapper = ItemIDMapper(
item_ids=unique_movie_ids
)
id_and_scores = id_mapper.recommend_for_new_user(
recommender,
[toystory_id], cutoff = 10
)
movies.reindex(
[ item_id for item_id, score in id_and_scores ]
)
[13]:
title | genres | release_year | |
---|---|---|---|
movieId | |||
2858 | American Beauty (1999) | Comedy|Drama | 1999 |
1265 | Groundhog Day (1993) | Comedy|Romance | 1993 |
2396 | Shakespeare in Love (1998) | Comedy|Romance | 1998 |
3114 | Toy Story 2 (1999) | Animation|Children's|Comedy | 1999 |
260 | Star Wars: Episode IV - A New Hope (1977) | Action|Adventure|Fantasy|Sci-Fi | 1977 |
1210 | Star Wars: Episode VI - Return of the Jedi (1983) | Action|Adventure|Romance|Sci-Fi|War | 1983 |
1196 | Star Wars: Episode V - The Empire Strikes Back... | Action|Adventure|Drama|Sci-Fi|War | 1980 |
1270 | Back to the Future (1985) | Comedy|Sci-Fi | 1985 |
2028 | Saving Private Ryan (1998) | Action|Drama|War | 1998 |
34 | Babe (1995) | Children's|Comedy|Drama | 1995 |
While the above result might make sense, this is not an optimal result. To get better results, we have to tune the recommender’s hyper parameters against some accuracy metric measured on a validation set.
In the next tutorial, we will see how to define the hold-out and validation score.