Hyperparameter Optimization

In this tutorial, we first demonstrate how P3alphaRecommender’s performance can be optimized by optuna-backed tune function.

Then, by further splitting the ground-truth interaction into tran, validation and test ones, we compare several recommenders’ performance optimized on the validation set and measured on the test set.

[1]:
from IPython.display import clear_output, display
import numpy as np
import scipy.sparse as sps
from sklearn.model_selection import train_test_split

from irspack.dataset import MovieLens1MDataManager
from irspack import (
    P3alphaRecommender, rowwise_train_test_split, Evaluator,
    df_to_sparse
)

Read the ML1M dataset again.

We again prepare the sparse matrix X.

[2]:
loader = MovieLens1MDataManager()

df = loader.read_interaction()

movies = loader.read_item_info()
movies.head()


X, unique_user_ids, unique_movie_ids = df_to_sparse(
    df, 'userId', 'movieId'
)

Split scheme 2. Hold-out for partial users.

To perform the hyperparameter optimization, we have to repeatedly measure the accuracy metrics on the validation set. As mentioned in the previous tutorial, doing this for all users is time-comsuming (often heavier than the recommender’s learning process), so we truncate this subset as follows:

  1. First split users into “train”, “validation” (and “test”) ones.

  2. For train users, feed all their interactions into the recommender. For validation (test) users, hold-out part of their interaction for the validation (“prediction” part), and feed the rest (“learning” part) into the recommender.

  3. After the fit, ask the recommender to output the score only for validation (test) users, and see how it ranks these held-out interactions for the validation (test) users.

Perform hold out for part of users.

Although we have prepared another function to do this procedure, let us first do this manually.

[3]:
# Split users into train and validation users.

X_train_user, X_valid_user = train_test_split(X, test_size=.4, random_state=0)

# Split the validation users' interaction into learning 50% and predcition 50%.

X_valid_learn, X_valid_predict = rowwise_train_test_split(
    X_valid_user, test_ratio=.5, random_state=0
)

Define the evaluator and optimize the validation metric

As illustrated above, we will use

  • Train users’ all interactions (X_train_user)

  • Validation users’ 50% interaction (X_valid_learn)

as the recommender’s training resource, and validation users’ rest interaction (X_valid_predict) as the held-out ground truth:

[4]:
X_train_val_learn = sps.vstack([X_train_user, X_valid_learn])
evaluator = Evaluator(X_valid_predict, offset=X_train_user.shape[0], target_metric='ndcg', cutoff=20)

The offset parameter specifies where the validation user block begins (where the train user block ends).

Now to start the optimization.

[5]:
best_params, validation_results = P3alphaRecommender.tune(X_train_val_learn, evaluator, random_seed=0, n_trials=20)
clear_output() # output is a bit lengthy

The best ndcg@20 value is

[6]:
validation_results['ndcg@20'].max()
[6]:
0.5159628863136182

which has been obtained by using these hyper parameters:

[7]:
best_params
[7]:
{'top_k': 217, 'normalize_weight': True}

Meanwhile, the default argument of P3alphaRecommdner (which has been used so far) attains ndcg@20 = 0.4084. So this is indeed a significant improvement:

[8]:
rec_default = P3alphaRecommender(X_train_val_learn).learn()
evaluator.get_score(rec_default)['ndcg']
[8]:
0.4084060191998281

Check the recommender’s output again

Let us check how our recommender has evolved from the first tutorial. We consider the same setting (a new user has watched “Toy Story”), but fit the recommender using the obtained parameters.

[9]:
rec_tuned = P3alphaRecommender(X, **best_params).learn()

from irspack import ItemIDMapper
id_mapper = ItemIDMapper(unique_movie_ids)
[10]:
toystory_id = 1
recommended_id_and_score = id_mapper.recommend_for_new_user(
    rec_tuned, user_profile=[toystory_id], cutoff=10
)

# Top-10 recommendations
movies.reindex([movie_id for movie_id, score in recommended_id_and_score])
[10]:
title genres release_year
movieId
1265 Groundhog Day (1993) Comedy|Romance 1993
2396 Shakespeare in Love (1998) Comedy|Romance 1998
3114 Toy Story 2 (1999) Animation|Children's|Comedy 1999
1270 Back to the Future (1985) Comedy|Sci-Fi 1985
2028 Saving Private Ryan (1998) Action|Drama|War 1998
34 Babe (1995) Children's|Comedy|Drama 1995
2571 Matrix, The (1999) Action|Sci-Fi|Thriller 1999
356 Forrest Gump (1994) Comedy|Romance|War 1994
2355 Bug's Life, A (1998) Animation|Children's|Comedy 1998
1197 Princess Bride, The (1987) Action|Adventure|Comedy|Romance 1987

Note how drastically the recommended contents have changed (increased significance of genre “Children’s” and disapperance of “Star Wars” series, etc…).

A train/validation/test split example

To rigorously compare the performance of various recommender algorithms, we should measure the final score against the test dataset, not the validation set, and it is straightforward now.

To begin with, we have prepared a function called split_dataframe_partial_user_holdout which splits the users in the original dataframe into train/validation/test users, holding out partial interaction for validation/test user:

[11]:
from irspack.split import split_dataframe_partial_user_holdout

dataset, item_ids = split_dataframe_partial_user_holdout(
    df, 'userId', 'movieId', val_user_ratio=.3, test_user_ratio=.3,
    heldout_ratio_val=.5, heldout_ratio_test=.5
)

dataset
[11]:
{'train': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7ff61f483430>,
 'val': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7ff61f4831f0>,
 'test': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7ff61f482bc0>}

As you can see, the returned dataset is a dictionary which stores train/validation/test-users’ interactions as an instance of UserTrainTestInteractionPair.

[12]:
train_users = dataset['train']
val_users = dataset['val']
test_users = dataset['test']

# Concatenate train/validation users into one.
train_and_val_users = train_users.concat(val_users)
[13]:
val_users.X_train
[13]:
<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 152333 stored elements in Compressed Sparse Row format>
[14]:
val_users.X_test
[14]:
<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 151435 stored elements in Compressed Sparse Row format>
[15]:
val_users.X_all # which equals val_users.X_train + val_users.X_test
[15]:
<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 303768 stored elements in Compressed Sparse Row format>
[16]:
# For train users, there is no "test" interaction held out.
train_users.X_test
[16]:
<2416x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 0 stored elements in Compressed Sparse Row format>

For each recommender algorithm (here P3alpha, RP3beta, IALS and DenseSLIM), we perform:

  1. Hyperparameter optimization. During this phase, we will be using train users’ all interaction and validation users’ train interaction as the source of learning, and validation users’ test interaction as the held-out ground truth.

  2. Evaluation. During this phase, we will include train/validation users’ all interactions as well as test users’ train interaction as the source of learning, and fit the model using the parameters obtained in the optimization phase. Then we measure the recommender’s performance against test users’ test interaction.

[17]:
from typing import Type
from irspack import DenseSLIMRecommender, RP3betaRecommender, IALSRecommender, BaseRecommender
[18]:
val_evaluator = Evaluator(
    val_users.X_test,
    offset=train_users.n_users,
    cutoff=20, target_metric="ndcg"
)
test_evaluator = Evaluator(
    test_users.X_test,
    offset=train_and_val_users.n_users
)
test_results = []
recommender_name_vs_best_parameter = {}
recommender_class: Type[BaseRecommender]
for recommender_class in [IALSRecommender, DenseSLIMRecommender, P3alphaRecommender, RP3betaRecommender]:
    print(f'Start tuning {recommender_class.__name__}.')
    best_params, validation_results_df = recommender_class.tune(
        sps.vstack([train_users.X_all, val_users.X_train]),
        val_evaluator, n_trials=40, random_seed=0
    )
    recommender = recommender_class(
        sps.vstack([train_and_val_users.X_all, test_users.X_train]),
        **best_params
    ).learn()
    recommender_name_vs_best_parameter[recommender_class.__name__] = best_params

    test_score = dict(
        algorithm=recommender_class.__name__,
        **test_evaluator.get_scores(recommender, cutoffs=[20])
    )
    test_results.append(test_score)
    clear_output()

As you can see below, iALS and DenseSLIM outperforms others in terms of accuracy measures (recall, ndcg, map).

iALS performed well regarding the diversity scores (entropy, gini-index, appeared_item), too.

[19]:
import pandas as pd
pd.DataFrame(test_results)
[19]:
algorithm hit@20 recall@20 ndcg@20 map@20 precision@20 gini_index@20 entropy@20 appeared_item@20
0 IALSRecommender 0.996137 0.208540 0.576164 0.135950 0.528201 0.915915 5.989211 1108.0
1 DenseSLIMRecommender 0.995033 0.207463 0.572965 0.135436 0.525055 0.926926 5.859120 1018.0
2 P3alphaRecommender 0.993377 0.182982 0.526259 0.114564 0.477152 0.962736 5.174319 690.0
3 RP3betaRecommender 0.995033 0.188152 0.537070 0.119353 0.486010 0.957136 5.297020 846.0

Let’s ask each recommender, “What would you recommend to a user who has just seen “Toy Story”?

IALS, DenseSLIM, RP3beta rank “Toy Story2” at the top of recommendation list, which seems appropriate.

[20]:
for recommender_class in [IALSRecommender, DenseSLIMRecommender, RP3betaRecommender, P3alphaRecommender]:
    rec_tuned = recommender_class(X, **recommender_name_vs_best_parameter[recommender_class.__name__]).learn()

    toystory_id = 1
    recommended_id_and_score = id_mapper.recommend_for_new_user(
        rec_tuned, user_profile=[toystory_id], cutoff=10
    )
    print(f"{recommender_class.__name__}'s result:")
    # Top-10 recommendations
    display(movies.reindex([movie_id for movie_id, score in recommended_id_and_score]))
100.00% [9/9 00:00<00:00]
IALSRecommender's result:
title genres release_year
movieId
3114 Toy Story 2 (1999) Animation|Children's|Comedy 1999
34 Babe (1995) Children's|Comedy|Drama 1995
2355 Bug's Life, A (1998) Animation|Children's|Comedy 1998
1265 Groundhog Day (1993) Comedy|Romance 1993
588 Aladdin (1992) Animation|Children's|Comedy|Musical 1992
2396 Shakespeare in Love (1998) Comedy|Romance 1998
2321 Pleasantville (1998) Comedy 1998
356 Forrest Gump (1994) Comedy|Romance|War 1994
1148 Wrong Trousers, The (1993) Animation|Comedy 1993
595 Beauty and the Beast (1991) Animation|Children's|Musical 1991
DenseSLIMRecommender's result:
title genres release_year
movieId
3114 Toy Story 2 (1999) Animation|Children's|Comedy 1999
2355 Bug's Life, A (1998) Animation|Children's|Comedy 1998
34 Babe (1995) Children's|Comedy|Drama 1995
588 Aladdin (1992) Animation|Children's|Comedy|Musical 1992
1265 Groundhog Day (1993) Comedy|Romance 1993
2396 Shakespeare in Love (1998) Comedy|Romance 1998
356 Forrest Gump (1994) Comedy|Romance|War 1994
1148 Wrong Trousers, The (1993) Animation|Comedy 1993
1641 Full Monty, The (1997) Comedy 1997
1923 There's Something About Mary (1998) Comedy 1998
RP3betaRecommender's result:
title genres release_year
movieId
3114 Toy Story 2 (1999) Animation|Children's|Comedy 1999
1265 Groundhog Day (1993) Comedy|Romance 1993
2396 Shakespeare in Love (1998) Comedy|Romance 1998
34 Babe (1995) Children's|Comedy|Drama 1995
2355 Bug's Life, A (1998) Animation|Children's|Comedy 1998
1270 Back to the Future (1985) Comedy|Sci-Fi 1985
260 Star Wars: Episode IV - A New Hope (1977) Action|Adventure|Fantasy|Sci-Fi 1977
2028 Saving Private Ryan (1998) Action|Drama|War 1998
356 Forrest Gump (1994) Comedy|Romance|War 1994
1210 Star Wars: Episode VI - Return of the Jedi (1983) Action|Adventure|Romance|Sci-Fi|War 1983
P3alphaRecommender's result:
title genres release_year
movieId
1265 Groundhog Day (1993) Comedy|Romance 1993
2396 Shakespeare in Love (1998) Comedy|Romance 1998
3114 Toy Story 2 (1999) Animation|Children's|Comedy 1999
1270 Back to the Future (1985) Comedy|Sci-Fi 1985
2028 Saving Private Ryan (1998) Action|Drama|War 1998
34 Babe (1995) Children's|Comedy|Drama 1995
356 Forrest Gump (1994) Comedy|Romance|War 1994
2355 Bug's Life, A (1998) Animation|Children's|Comedy 1998
1197 Princess Bride, The (1987) Action|Adventure|Comedy|Romance 1987
588 Aladdin (1992) Animation|Children's|Comedy|Musical 1992