irspack.split.split_dataframe_partial_user_holdout

irspack.split.split_dataframe_partial_user_holdout(df_all, user_column, item_column, time_column=None, rating_column=None, n_val_user=None, n_test_user=None, val_user_ratio=0.1, test_user_ratio=0.1, heldout_ratio_val=0.5, n_heldout_val=None, heldout_ratio_test=0.5, n_heldout_test=None, ceil_n_heldout=False, random_state=None)[source]

Splits the DataFrame and build an interaction matrix, holding out random interactions for a subset of randomly selected users (whom we call “validation users” and “test users”).

Parameters:
  • df_all (DataFrame) – The user-item interaction event log.

  • user_column (str) – The column name for user_id.

  • item_column (str) – The column name for movie_id.

  • time_column (Optional[str]) – The column name (if any) specifying the time of the interaction. If this is set, the split will be based on time, and some of the most recent interactions will be held out for each user. Defaults to None.

  • rating_column (Optional[str]) – The column name for ratings. If None, the rating will be treated as 1 for all interactions. Defaults to None.

  • n_val_user (Optional[int]) – The number of “validation users”. Defaults to None.

  • n_test_user (Optional[int]) – The number of “test users”. Defaults to None.

  • val_user_ratio (float) – The percentage of “validation users” with respect to all users. Ignored when n_val_user is set. Defaults to 0.1.

  • test_user_ratio (float) – The percentage of “test users” with respect to all users. Ignored when n_text_user is set. Defaults to 0.1.

  • heldout_ratio_val (float) – The percentage of held-out interactions for “validation users”. Ignored if n_heldout_val is specified. Defaults to 0.5.

  • n_heldout_val (Optional[int]) – The maximal number of held-out interactions for “validation users”.

  • heldout_ratio_test (float) – The percentage of held-out interactions for “test users”. Ignored if n_heldout_test is specified. Defaults to 0.5.

  • n_heldout_val – The maximal number of held-out interactions for “test users”.

  • ceil_n_heldout (bool) – If True, the number of held-out interactions of user u will be ceil(heldout_ratio_val * N_u) and ceil(heldout_ratio_test * N_u). If False, floor function will be used instead. Defaults to False.

  • random_state (Union[None, int, RandomState]) – The random state for this procedure. Defaults to None.

  • n_heldout_test (Optional[int]) –

Raises:

ValueError – When n_val_user + n_test_user is greater than the number of total users.

Returns:

  1. A dictionary with "train", "val", "test" as its keys and the coressponding dataset as its values.

  2. List of unique item ids (which corresponds to the columns of the datasets).

Return type:

A tuple consisting of