Train our first movie recommender

In this tutorial, we build our first recommender system using a simple algorithm called P3alpha.

We will learn

  • How to represent the implicit feedback dataset as a sparse matrix.

  • How to fit irspack’s models using the sparse matrix representation.

  • How to make a recommendation using our API.

[1]:
import numpy as np
import scipy.sparse as sps

from irspack.dataset import MovieLens1MDataManager
from irspack import P3alphaRecommender

Read Movielens 1M dataset

We first load the Movielens1M dataset. For the first time, you will be asked to allow downloading the dataset.

[2]:
loader = MovieLens1MDataManager()

df = loader.read_interaction()
df.head()
[2]:
userId movieId rating timestamp
0 1 1193 5 2000-12-31 22:12:40
1 1 661 3 2000-12-31 22:35:09
2 1 914 3 2000-12-31 22:32:48
3 1 3408 4 2000-12-31 22:04:35
4 1 2355 5 2001-01-06 23:38:11

df stores the users’ watch event history.

Although the rating information is available in this case, we will not be using this column. What matters to implicit feedback based recommender system is “which user interacted with which item (movie)”.

By loader we can also read the dataframe for the movie meta data:

[3]:
movies = loader.read_item_info()
movies.head()
[3]:
title genres release_year
movieId
1 Toy Story (1995) Animation|Children's|Comedy 1995
2 Jumanji (1995) Adventure|Children's|Fantasy 1995
3 Grumpier Old Men (1995) Comedy|Romance 1995
4 Waiting to Exhale (1995) Comedy|Drama 1995
5 Father of the Bride Part II (1995) Comedy 1995

Represent your data as a sparse matrix

We represent the data as a sparse matrix \(X\), whose element \(X_{ui}\) is given by

\[\begin{split}X_{ui} = \begin{cases} 1 & \text{if the user }u\text{ has watched the item (movie) } i \\ 0 & \text{otherwise} \end{cases}\end{split}\]

For this purpose, we use np.unique function with return_inverse=True. This will return a tuple that consists of

  1. The list of unique user/movie ids appearing in the original user/movie id array

  2. How the original user/movie id array elements are mapped to the array 1.

So if we do

[4]:
unique_user_ids, user_index = np.unique(df.userId, return_inverse=True)
unique_movie_ids, movie_index = np.unique(df.movieId, return_inverse=True)

then unique_user_ids[user_index] and unique_movie_ids[movie_index] is equal to the original array:

[5]:
assert np.all( unique_user_ids[user_index] == df.userId.values )
assert np.all( unique_movie_ids[movie_index] == df.movieId.values )

Thus, we can think of user_index and movie_index as representing the row and column positions of non-zero elements, respectively.

Now \(X\) can be constructed as scipy’s sparse csr matrix as follows.

[6]:
X = sps.csr_matrix(
    (
        np.ones(df.shape[0]), # values of non-zero elements
        (
            user_index, # rows of non-zero elements
            movie_index # cols of non-zero elements
        )
    )
)

X
[6]:
<6040x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 1000209 stored elements in Compressed Sparse Row format>

We encounter this pattern so often, so there is df_to_sparse function in irspack:

[7]:
from irspack import df_to_sparse
X_, unique_user_ids_, unique_item_ids_ = df_to_sparse(df, 'userId', 'movieId')

# X_ is identitcal to X.
assert (X_ - X).getnnz() == 0

Fit the recommender.

We fit P3alphaRecommender against X.

[8]:
recommender = P3alphaRecommender(X)
recommender.learn()
[8]:
<irspack.recommenders.p3.P3alphaRecommender at 0x7f61e7e679d0>

Check the recommender’s output

Suppose there is a new user who has just watched “Toy Story”. Let us see what would be the recommended for this user.

We first represent the user’s watch profile as another sparse matrix (which contains a single non-zero element).

[9]:
movie_id_vs_movie_index = { mid: i for i, mid in enumerate(unique_movie_ids)}

toystory_id = 1
toystory_watcher_matrix = sps.csr_matrix(
    ([1], ([0], [movie_id_vs_movie_index[toystory_id]])),
    shape=(1, len(unique_movie_ids)) # this time shape parameter is required
)

movies.loc[toystory_id]
[9]:
title                      Toy Story (1995)
genres          Animation|Children's|Comedy
release_year                           1995
Name: 1, dtype: object

Since this user is new (previously unseen) to the recommender, we use get_score_cold_user_remove_seen method.

remove_seen means that we mask the scores for the items that user had watched already (in this case, Toy Story) so that such items would not be recommended again.

As you can see, the score corresponding to “Toy Story” has \(-\infty\) score.

[10]:
score = recommender.get_score_cold_user_remove_seen(
    toystory_watcher_matrix
)

# Id 1 (index 0) is masked (have -infinity score)
score
[10]:
array([[          -inf, 8.18606963e-04, 4.30083199e-04, ...,
        4.30589311e-05, 1.09994485e-05, 2.71571993e-04]])

To get the recommendation, we argsort wthe score by descending order and convert “movie index” (which starts from 0) to “movie id”.

[11]:
recommended_movie_index = score[0].argsort()[::-1][:10]
recommended_movie_ids = unique_movie_ids[recommended_movie_index]

# Top-10 recommendations
recommended_movie_ids
[11]:
array([2858, 1265, 2396, 3114,  260, 1210, 1196, 1270, 2028,   34])

And here are the titles of the recommendations.

[12]:
movies.reindex(recommended_movie_ids)
[12]:
title genres release_year
movieId
2858 American Beauty (1999) Comedy|Drama 1999
1265 Groundhog Day (1993) Comedy|Romance 1993
2396 Shakespeare in Love (1998) Comedy|Romance 1998
3114 Toy Story 2 (1999) Animation|Children's|Comedy 1999
260 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Fantasy|Sci-Fi 1977
1210 Star Wars: Episode VI - Return of the Jedi (1983) Action|Adventure|Romance|Sci-Fi|War 1983
1196 Star Wars: Episode V - The Empire Strikes Back... Action|Adventure|Drama|Sci-Fi|War 1980
1270 Back to the Future (1985) Comedy|Sci-Fi 1985
2028 Saving Private Ryan (1998) Action|Drama|War 1998
34 Babe (1995) Children's|Comedy|Drama 1995

The above pattern - mapping item IDs to indexes, creating sparse matrices, and reverting indexes of recommended items to item IDs - is a quite common one, and we have also created a convenient class that does the item index/ID mapping:

[13]:
from irspack.utils.id_mapping import ItemIDMapper

id_mapper = ItemIDMapper(
    item_ids=unique_movie_ids
)
id_and_scores = id_mapper.recommend_for_new_user(
    recommender,
    [toystory_id], cutoff = 10
)
movies.reindex(
    [ item_id for item_id, score in id_and_scores ]
)
[13]:
title genres release_year
movieId
2858 American Beauty (1999) Comedy|Drama 1999
1265 Groundhog Day (1993) Comedy|Romance 1993
2396 Shakespeare in Love (1998) Comedy|Romance 1998
3114 Toy Story 2 (1999) Animation|Children's|Comedy 1999
260 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Fantasy|Sci-Fi 1977
1210 Star Wars: Episode VI - Return of the Jedi (1983) Action|Adventure|Romance|Sci-Fi|War 1983
1196 Star Wars: Episode V - The Empire Strikes Back... Action|Adventure|Drama|Sci-Fi|War 1980
1270 Back to the Future (1985) Comedy|Sci-Fi 1985
2028 Saving Private Ryan (1998) Action|Drama|War 1998
34 Babe (1995) Children's|Comedy|Drama 1995

While the above result might make sense, this is not an optimal result. To get better results, we have to tune the recommender’s hyper parameters against some accuracy metric measured on a validation set.

In the next tutorial, we will see how to define the hold-out and validation score.