Train our first movie recommender

In this tutorial, we build our first recommender system using a simple algorithm called P3alpha.

We will learn

How to represent the implicit feedback dataset as a sparse matrix.
How to fit irspack’s models using the sparse matrix representation.
How to make a recommendation using our API.

[1]:

import numpy as np
import scipy.sparse as sps

from irspack.dataset import MovieLens1MDataManager
from irspack import P3alphaRecommender

Read Movielens 1M dataset

We first load the Movielens1M dataset. For the first time, you will be asked to allow downloading the dataset.

[2]:

loader = MovieLens1MDataManager()

df = loader.read_interaction()
df.head()

[2]:

	userId	movieId	rating	timestamp
0	1	1193	5	2000-12-31 22:12:40
1	1	661	3	2000-12-31 22:35:09
2	1	914	3	2000-12-31 22:32:48
3	1	3408	4	2000-12-31 22:04:35
4	1	2355	5	2001-01-06 23:38:11

df stores the users’ watch event history.

Although the rating information is available in this case, we will not be using this column. What matters to implicit feedback based recommender system is “which user interacted with which item (movie)”.

By loader we can also read the dataframe for the movie meta data:

[3]:

movies = loader.read_item_info()
movies.head()

[3]:

	title	genres	release_year
movieId
1	Toy Story (1995)	Animation\|Children's\|Comedy	1995
2	Jumanji (1995)	Adventure\|Children's\|Fantasy	1995
3	Grumpier Old Men (1995)	Comedy\|Romance	1995
4	Waiting to Exhale (1995)	Comedy\|Drama	1995
5	Father of the Bride Part II (1995)	Comedy	1995

Represent your data as a sparse matrix

We represent the data as a sparse matrix \(X\), whose element \(X_{ui}\) is given by

\[\begin{split}X_{ui} = \begin{cases} 1 & \text{if the user }u\text{ has watched the item (movie) } i \\ 0 & \text{otherwise} \end{cases}\end{split}\]

For this purpose, we use np.unique function with return_inverse=True. This will return a tuple that consists of

The list of unique user/movie ids appearing in the original user/movie id array
How the original user/movie id array elements are mapped to the array 1.

So if we do

[4]:

unique_user_ids, user_index = np.unique(df.userId, return_inverse=True)
unique_movie_ids, movie_index = np.unique(df.movieId, return_inverse=True)

then unique_user_ids[user_index] and unique_movie_ids[movie_index] is equal to the original array:

[5]:

assert np.all( unique_user_ids[user_index] == df.userId.values )
assert np.all( unique_movie_ids[movie_index] == df.movieId.values )

Thus, we can think of user_index and movie_index as representing the row and column positions of non-zero elements, respectively.

Now \(X\) can be constructed as scipy’s sparse csr matrix as follows.

[6]:

X = sps.csr_matrix(
    (
        np.ones(df.shape[0]), # values of non-zero elements
        (
            user_index, # rows of non-zero elements
            movie_index # cols of non-zero elements
        )
    )
)

X

[6]:

<6040x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 1000209 stored elements in Compressed Sparse Row format>

We encounter this pattern so often, so there is df_to_sparse function in irspack:

[7]:

from irspack import df_to_sparse
X_, unique_user_ids_, unique_item_ids_ = df_to_sparse(df, 'userId', 'movieId')

# X_ is identitcal to X.
assert (X_ - X).getnnz() == 0

Fit the recommender.

We fit P3alphaRecommender against X.

[8]:

recommender = P3alphaRecommender(X)
recommender.learn()

[8]:

<irspack.recommenders.p3.P3alphaRecommender at 0x7f61e7e679d0>

Check the recommender’s output

Suppose there is a new user who has just watched “Toy Story”. Let us see what would be the recommended for this user.

We first represent the user’s watch profile as another sparse matrix (which contains a single non-zero element).

[9]:

movie_id_vs_movie_index = { mid: i for i, mid in enumerate(unique_movie_ids)}

toystory_id = 1
toystory_watcher_matrix = sps.csr_matrix(
    ([1], ([0], [movie_id_vs_movie_index[toystory_id]])),
    shape=(1, len(unique_movie_ids)) # this time shape parameter is required
)

movies.loc[toystory_id]

[9]:

title                      Toy Story (1995)
genres          Animation|Children's|Comedy
release_year                           1995
Name: 1, dtype: object

Since this user is new (previously unseen) to the recommender, we use get_score_cold_user_remove_seen method.

remove_seen means that we mask the scores for the items that user had watched already (in this case, Toy Story) so that such items would not be recommended again.

As you can see, the score corresponding to “Toy Story” has \(-\infty\) score.

[10]:

score = recommender.get_score_cold_user_remove_seen(
    toystory_watcher_matrix
)

# Id 1 (index 0) is masked (have -infinity score)
score

[10]:

array([[          -inf, 8.18606963e-04, 4.30083199e-04, ...,
        4.30589311e-05, 1.09994485e-05, 2.71571993e-04]])

To get the recommendation, we argsort wthe score by descending order and convert “movie index” (which starts from 0) to “movie id”.

[11]:

recommended_movie_index = score[0].argsort()[::-1][:10]
recommended_movie_ids = unique_movie_ids[recommended_movie_index]

# Top-10 recommendations
recommended_movie_ids

[11]:

array([2858, 1265, 2396, 3114,  260, 1210, 1196, 1270, 2028,   34])

And here are the titles of the recommendations.

[12]:

movies.reindex(recommended_movie_ids)

[12]:

	title	genres	release_year
movieId
2858	American Beauty (1999)	Comedy\|Drama	1999
1265	Groundhog Day (1993)	Comedy\|Romance	1993
2396	Shakespeare in Love (1998)	Comedy\|Romance	1998
3114	Toy Story 2 (1999)	Animation\|Children's\|Comedy	1999
260	Star Wars: Episode IV - A New Hope (1977)	Action\|Adventure\|Fantasy\|Sci-Fi	1977
1210	Star Wars: Episode VI - Return of the Jedi (1983)	Action\|Adventure\|Romance\|Sci-Fi\|War	1983
1196	Star Wars: Episode V - The Empire Strikes Back...	Action\|Adventure\|Drama\|Sci-Fi\|War	1980
1270	Back to the Future (1985)	Comedy\|Sci-Fi	1985
2028	Saving Private Ryan (1998)	Action\|Drama\|War	1998
34	Babe (1995)	Children's\|Comedy\|Drama	1995

The above pattern - mapping item IDs to indexes, creating sparse matrices, and reverting indexes of recommended items to item IDs - is a quite common one, and we have also created a convenient class that does the item index/ID mapping:

[13]:

from irspack.utils.id_mapping import ItemIDMapper

id_mapper = ItemIDMapper(
    item_ids=unique_movie_ids
)
id_and_scores = id_mapper.recommend_for_new_user(
    recommender,
    [toystory_id], cutoff = 10
)
movies.reindex(
    [ item_id for item_id, score in id_and_scores ]
)

[13]:

	title	genres	release_year
movieId
2858	American Beauty (1999)	Comedy\|Drama	1999
1265	Groundhog Day (1993)	Comedy\|Romance	1993
2396	Shakespeare in Love (1998)	Comedy\|Romance	1998
3114	Toy Story 2 (1999)	Animation\|Children's\|Comedy	1999
260	Star Wars: Episode IV - A New Hope (1977)	Action\|Adventure\|Fantasy\|Sci-Fi	1977
1210	Star Wars: Episode VI - Return of the Jedi (1983)	Action\|Adventure\|Romance\|Sci-Fi\|War	1983
1196	Star Wars: Episode V - The Empire Strikes Back...	Action\|Adventure\|Drama\|Sci-Fi\|War	1980
1270	Back to the Future (1985)	Comedy\|Sci-Fi	1985
2028	Saving Private Ryan (1998)	Action\|Drama\|War	1998
34	Babe (1995)	Children's\|Comedy\|Drama	1995

While the above result might make sense, this is not an optimal result. To get better results, we have to tune the recommender’s hyper parameters against some accuracy metric measured on a validation set.

In the next tutorial, we will see how to define the hold-out and validation score.