Evaluation of recommender systems

In this tutorial, we explain how to evaluate recommender systems with implicit feedback by holding out method.

[1]:
from irspack.dataset import MovieLens1MDataManager
from irspack import (
    P3alphaRecommender, TopPopRecommender,
    rowwise_train_test_split, Evaluator,
    df_to_sparse
)

Read the ML1M dataset again.

As in the previous tutorial, we load the rating dataset and construct a sparse matrix.

[2]:
loader = MovieLens1MDataManager()

df = loader.read_interaction()

X, unique_user_ids, unique_movie_ids = df_to_sparse(
    df, 'userId', 'movieId'
)

Split scheme 1. Hold-out for all users.

To evaluate the performance of a recommender system trained with implicit feedback, the standard method is to hide some subset of the known user-item interactions as a validation set and see how the recommender ranks these hidden groundtruths:

Perform hold out for all users.

We have prepared a fast implementaion of such a split (with random selection of these subset) in rowwise_train_test_split function:

[3]:
X_train, X_valid = rowwise_train_test_split(X, test_ratio=0.2, random_state=0)

assert X_train.shape == X_valid.shape

They sum back to the original matrix:

[4]:
X - (X_train + X_valid) # 0 stored elements
[4]:
<6040x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 0 stored elements in Compressed Sparse Row format>

There is no overlap of non-zero elements:

[5]:
X_train.multiply(X_valid) # Element-wise multiplication yields 0 stored elements
[5]:
<6040x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 0 stored elements in Compressed Sparse Row format>

This scheme however has a problem regarding the performance because we have to compute the recommendation score for all the users. So in the next tutorial, we will be using a sub-sampled version of this splitting.

Obtain the evaluation metric

Now we define the Evaluator object, which will measure the performance of various recommender systems based on X_valid (the meaning of offset=0 will be clarified in the next tutorial).

[6]:
evaluator = Evaluator(X_valid, offset=0)

We fit P3alphaRecommender using X_train, and compute its accuracy metrics against X_valid using evaluator.

Internally, evaluator calls the recommender’s get_score_remove_seen method, sorts the score to obtain the rank, and reconciles it with the stored validation interactions.

[7]:
recommender = P3alphaRecommender(X_train, top_k=100)
recommender.learn()

evaluator.get_scores(recommender, cutoffs=[5, 10])
[7]:
OrderedDict([('hit@5', 0.7988410596026491),
             ('recall@5', 0.09333713653053713),
             ('ndcg@5', 0.396153990817333),
             ('map@5', 0.06632471989643292),
             ('precision@5', 0.37317880794701996),
             ('gini_index@5', 0.9738255665593293),
             ('entropy@5', 4.777123828102488),
             ('appeared_item@5', 536.0),
             ('hit@10', 0.8925496688741722),
             ('recall@10', 0.15053594583416965),
             ('ndcg@10', 0.3662089065311077),
             ('map@10', 0.08982972949880254),
             ('precision@10', 0.32049668874172194),
             ('gini_index@10', 0.961654047669253),
             ('entropy@10', 5.1898345912683315),
             ('appeared_item@10', 764.0)])

Comparison with a simple baseline

Now that we have a qualitative way to measure the recommenders’ performance, we can compare the performance of different algorithms.

As a simple baseline, we fit TopPopRecommender, which recommends items with descending order of the popularity in the train set, regardless of the users’ watch event history. (But note that already-seen items by a user will not be commended again).

[8]:
toppop_recommender = TopPopRecommender(X_train)
toppop_recommender.learn()

evaluator.get_scores(toppop_recommender, cutoffs=[5, 10])
[8]:
OrderedDict([('hit@5', 0.5473509933774835),
             ('recall@5', 0.04097159018092859),
             ('ndcg@5', 0.21914854330480912),
             ('map@5', 0.025239375245273265),
             ('precision@5', 0.20884105960264907),
             ('gini_index@5', 0.9972226173414867),
             ('entropy@5', 2.608344593727912),
             ('appeared_item@5', 42.0),
             ('hit@10', 0.6637417218543047),
             ('recall@10', 0.0667908851617647),
             ('ndcg@10', 0.19939297384808613),
             ('map@10', 0.033298013667913656),
             ('precision@10', 0.17811258278145692),
             ('gini_index@10', 0.9950046639957398),
             ('entropy@10', 3.1812807131889786),
             ('appeared_item@10', 69.0)])

So we see that P3alphaRecommender actually exhibits better accuracy scores compared to rather trivial TopPopRecommender.

In the next tutorial, we will optimize the recommender’s performance using the hold-out method.