Hyperparameter Optimization

In this tutorial, we first demonstrate how P3alphaRecommender’s performance can be optimized by optuna-backed tune function.

Then, by further splitting the ground-truth interaction into tran, validation and test ones, we compare several recommenders’ performance optimized on the validation set and measured on the test set.

[1]:

from IPython.display import clear_output, display
import numpy as np
import scipy.sparse as sps
from sklearn.model_selection import train_test_split

from irspack.dataset import MovieLens1MDataManager
from irspack import (
    P3alphaRecommender, rowwise_train_test_split, Evaluator,
    df_to_sparse
)

Read the ML1M dataset again.

We again prepare the sparse matrix X.

[2]:

loader = MovieLens1MDataManager()

df = loader.read_interaction()

movies = loader.read_item_info()
movies.head()


X, unique_user_ids, unique_movie_ids = df_to_sparse(
    df, 'userId', 'movieId'
)

Split scheme 2. Hold-out for partial users.

To perform the hyperparameter optimization, we have to repeatedly measure the accuracy metrics on the validation set. As mentioned in the previous tutorial, doing this for all users is time-comsuming (often heavier than the recommender’s learning process), so we truncate this subset as follows:

First split users into “train”, “validation” (and “test”) ones.
For train users, feed all their interactions into the recommender. For validation (test) users, hold-out part of their interaction for the validation (“prediction” part), and feed the rest (“learning” part) into the recommender.
After the fit, ask the recommender to output the score only for validation (test) users, and see how it ranks these held-out interactions for the validation (test) users.

Perform hold out for part of users.

Although we have prepared another function to do this procedure, let us first do this manually.

[3]:

# Split users into train and validation users.

X_train_user, X_valid_user = train_test_split(X, test_size=.4, random_state=0)

# Split the validation users' interaction into learning 50% and predcition 50%.

X_valid_learn, X_valid_predict = rowwise_train_test_split(
    X_valid_user, test_ratio=.5, random_state=0
)

Define the evaluator and optimize the validation metric

As illustrated above, we will use

Train users’ all interactions (X_train_user)
Validation users’ 50% interaction (X_valid_learn)

as the recommender’s training resource, and validation users’ rest interaction (X_valid_predict) as the held-out ground truth:

[4]:

X_train_val_learn = sps.vstack([X_train_user, X_valid_learn])
evaluator = Evaluator(X_valid_predict, offset=X_train_user.shape[0], target_metric='ndcg', cutoff=20)

The offset parameter specifies where the validation user block begins (where the train user block ends).

Now to start the optimization.

[5]:

best_params, validation_results = P3alphaRecommender.tune(X_train_val_learn, evaluator, random_seed=0, n_trials=20)
clear_output() # output is a bit lengthy

The best ndcg@20 value is

[6]:

validation_results['ndcg@20'].max()

[6]:

0.5159628863136182

which has been obtained by using these hyper parameters:

[7]:

best_params

[7]:

{'top_k': 217, 'normalize_weight': True}

Meanwhile, the default argument of P3alphaRecommdner (which has been used so far) attains ndcg@20 = 0.4084. So this is indeed a significant improvement:

[8]:

rec_default = P3alphaRecommender(X_train_val_learn).learn()
evaluator.get_score(rec_default)['ndcg']

[8]:

0.4084060191998281

Check the recommender’s output again

Let us check how our recommender has evolved from the first tutorial. We consider the same setting (a new user has watched “Toy Story”), but fit the recommender using the obtained parameters.

[9]:

rec_tuned = P3alphaRecommender(X, **best_params).learn()

from irspack import ItemIDMapper
id_mapper = ItemIDMapper(unique_movie_ids)

[10]:

toystory_id = 1
recommended_id_and_score = id_mapper.recommend_for_new_user(
    rec_tuned, user_profile=[toystory_id], cutoff=10
)

# Top-10 recommendations
movies.reindex([movie_id for movie_id, score in recommended_id_and_score])

[10]:

	title	genres	release_year
movieId
1265	Groundhog Day (1993)	Comedy\|Romance	1993
2396	Shakespeare in Love (1998)	Comedy\|Romance	1998
3114	Toy Story 2 (1999)	Animation\|Children's\|Comedy	1999
1270	Back to the Future (1985)	Comedy\|Sci-Fi	1985
2028	Saving Private Ryan (1998)	Action\|Drama\|War	1998
34	Babe (1995)	Children's\|Comedy\|Drama	1995
2571	Matrix, The (1999)	Action\|Sci-Fi\|Thriller	1999
356	Forrest Gump (1994)	Comedy\|Romance\|War	1994
2355	Bug's Life, A (1998)	Animation\|Children's\|Comedy	1998
1197	Princess Bride, The (1987)	Action\|Adventure\|Comedy\|Romance	1987

Note how drastically the recommended contents have changed (increased significance of genre “Children’s” and disapperance of “Star Wars” series, etc…).

A train/validation/test split example

To rigorously compare the performance of various recommender algorithms, we should measure the final score against the test dataset, not the validation set, and it is straightforward now.

To begin with, we have prepared a function called split_dataframe_partial_user_holdout which splits the users in the original dataframe into train/validation/test users, holding out partial interaction for validation/test user:

[11]:

from irspack.split import split_dataframe_partial_user_holdout

dataset, item_ids = split_dataframe_partial_user_holdout(
    df, 'userId', 'movieId', val_user_ratio=.3, test_user_ratio=.3,
    heldout_ratio_val=.5, heldout_ratio_test=.5
)

dataset

[11]:

{'train': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7ff61f483430>,
 'val': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7ff61f4831f0>,
 'test': <irspack.split.userwise.UserTrainTestInteractionPair at 0x7ff61f482bc0>}

As you can see, the returned dataset is a dictionary which stores train/validation/test-users’ interactions as an instance of UserTrainTestInteractionPair.

[12]:

train_users = dataset['train']
val_users = dataset['val']
test_users = dataset['test']

# Concatenate train/validation users into one.
train_and_val_users = train_users.concat(val_users)

[13]:

val_users.X_train

[13]:

<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 152333 stored elements in Compressed Sparse Row format>

[14]:

val_users.X_test

[14]:

<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 151435 stored elements in Compressed Sparse Row format>

[15]:

val_users.X_all # which equals val_users.X_train + val_users.X_test

[15]:

<1812x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 303768 stored elements in Compressed Sparse Row format>

[16]:

# For train users, there is no "test" interaction held out.
train_users.X_test

[16]:

<2416x3706 sparse matrix of type '<class 'numpy.float64'>'
        with 0 stored elements in Compressed Sparse Row format>

For each recommender algorithm (here P3alpha, RP3beta, IALS and DenseSLIM), we perform:

Hyperparameter optimization. During this phase, we will be using train users’ all interaction and validation users’ train interaction as the source of learning, and validation users’ test interaction as the held-out ground truth.
Evaluation. During this phase, we will include train/validation users’ all interactions as well as test users’ train interaction as the source of learning, and fit the model using the parameters obtained in the optimization phase. Then we measure the recommender’s performance against test users’ test interaction.

[17]:

from typing import Type
from irspack import DenseSLIMRecommender, RP3betaRecommender, IALSRecommender, BaseRecommender

[18]:

val_evaluator = Evaluator(
    val_users.X_test,
    offset=train_users.n_users,
    cutoff=20, target_metric="ndcg"
)
test_evaluator = Evaluator(
    test_users.X_test,
    offset=train_and_val_users.n_users
)
test_results = []
recommender_name_vs_best_parameter = {}
recommender_class: Type[BaseRecommender]
for recommender_class in [IALSRecommender, DenseSLIMRecommender, P3alphaRecommender, RP3betaRecommender]:
    print(f'Start tuning {recommender_class.__name__}.')
    best_params, validation_results_df = recommender_class.tune(
        sps.vstack([train_users.X_all, val_users.X_train]),
        val_evaluator, n_trials=40, random_seed=0
    )
    recommender = recommender_class(
        sps.vstack([train_and_val_users.X_all, test_users.X_train]),
        **best_params
    ).learn()
    recommender_name_vs_best_parameter[recommender_class.__name__] = best_params

    test_score = dict(
        algorithm=recommender_class.__name__,
        **test_evaluator.get_scores(recommender, cutoffs=[20])
    )
    test_results.append(test_score)
    clear_output()

As you can see below, iALS and DenseSLIM outperforms others in terms of accuracy measures (recall, ndcg, map).

iALS performed well regarding the diversity scores (entropy, gini-index, appeared_item), too.

[19]:

import pandas as pd
pd.DataFrame(test_results)

[19]:

	algorithm	hit@20	recall@20	ndcg@20	map@20	precision@20	gini_index@20	entropy@20	appeared_item@20
0	IALSRecommender	0.996137	0.208540	0.576164	0.135950	0.528201	0.915915	5.989211	1108.0
1	DenseSLIMRecommender	0.995033	0.207463	0.572965	0.135436	0.525055	0.926926	5.859120	1018.0
2	P3alphaRecommender	0.993377	0.182982	0.526259	0.114564	0.477152	0.962736	5.174319	690.0
3	RP3betaRecommender	0.995033	0.188152	0.537070	0.119353	0.486010	0.957136	5.297020	846.0

Let’s ask each recommender, “What would you recommend to a user who has just seen “Toy Story”?

IALS, DenseSLIM, RP3beta rank “Toy Story2” at the top of recommendation　list, which seems appropriate.

[20]:

for recommender_class in [IALSRecommender, DenseSLIMRecommender, RP3betaRecommender, P3alphaRecommender]:
    rec_tuned = recommender_class(X, **recommender_name_vs_best_parameter[recommender_class.__name__]).learn()

    toystory_id = 1
    recommended_id_and_score = id_mapper.recommend_for_new_user(
        rec_tuned, user_profile=[toystory_id], cutoff=10
    )
    print(f"{recommender_class.__name__}'s result:")
    # Top-10 recommendations
    display(movies.reindex([movie_id for movie_id, score in recommended_id_and_score]))

100.00% [9/9 00:00<00:00]

IALSRecommender's result:

	title	genres	release_year
movieId
3114	Toy Story 2 (1999)	Animation\|Children's\|Comedy	1999
34	Babe (1995)	Children's\|Comedy\|Drama	1995
2355	Bug's Life, A (1998)	Animation\|Children's\|Comedy	1998
1265	Groundhog Day (1993)	Comedy\|Romance	1993
588	Aladdin (1992)	Animation\|Children's\|Comedy\|Musical	1992
2396	Shakespeare in Love (1998)	Comedy\|Romance	1998
2321	Pleasantville (1998)	Comedy	1998
356	Forrest Gump (1994)	Comedy\|Romance\|War	1994
1148	Wrong Trousers, The (1993)	Animation\|Comedy	1993
595	Beauty and the Beast (1991)	Animation\|Children's\|Musical	1991

DenseSLIMRecommender's result:

	title	genres	release_year
movieId
3114	Toy Story 2 (1999)	Animation\|Children's\|Comedy	1999
2355	Bug's Life, A (1998)	Animation\|Children's\|Comedy	1998
34	Babe (1995)	Children's\|Comedy\|Drama	1995
588	Aladdin (1992)	Animation\|Children's\|Comedy\|Musical	1992
1265	Groundhog Day (1993)	Comedy\|Romance	1993
2396	Shakespeare in Love (1998)	Comedy\|Romance	1998
356	Forrest Gump (1994)	Comedy\|Romance\|War	1994
1148	Wrong Trousers, The (1993)	Animation\|Comedy	1993
1641	Full Monty, The (1997)	Comedy	1997
1923	There's Something About Mary (1998)	Comedy	1998

RP3betaRecommender's result:

	title	genres	release_year
movieId
3114	Toy Story 2 (1999)	Animation\|Children's\|Comedy	1999
1265	Groundhog Day (1993)	Comedy\|Romance	1993
2396	Shakespeare in Love (1998)	Comedy\|Romance	1998
34	Babe (1995)	Children's\|Comedy\|Drama	1995
2355	Bug's Life, A (1998)	Animation\|Children's\|Comedy	1998
1270	Back to the Future (1985)	Comedy\|Sci-Fi	1985
260	Star Wars: Episode IV - A New Hope (1977)	Action\|Adventure\|Fantasy\|Sci-Fi	1977
2028	Saving Private Ryan (1998)	Action\|Drama\|War	1998
356	Forrest Gump (1994)	Comedy\|Romance\|War	1994
1210	Star Wars: Episode VI - Return of the Jedi (1983)	Action\|Adventure\|Romance\|Sci-Fi\|War	1983

P3alphaRecommender's result:

	title	genres	release_year
movieId
1265	Groundhog Day (1993)	Comedy\|Romance	1993
2396	Shakespeare in Love (1998)	Comedy\|Romance	1998
3114	Toy Story 2 (1999)	Animation\|Children's\|Comedy	1999
1270	Back to the Future (1985)	Comedy\|Sci-Fi	1985
2028	Saving Private Ryan (1998)	Action\|Drama\|War	1998
34	Babe (1995)	Children's\|Comedy\|Drama	1995
356	Forrest Gump (1994)	Comedy\|Romance\|War	1994
2355	Bug's Life, A (1998)	Animation\|Children's\|Comedy	1998
1197	Princess Bride, The (1987)	Action\|Adventure\|Comedy\|Romance	1987
588	Aladdin (1992)	Animation\|Children's\|Comedy\|Musical	1992