Type: Package
Title: Evaluation Metrics for Implicit-Feedback Recommender Systems
Version: 0.1.6-3
Author: David Cortes
Maintainer: David Cortes <david.cortes.rivera@gmail.com>
URL: https://github.com/david-cortes/recometrics
BugReports: https://github.com/david-cortes/recometrics/issues
Description: Calculates evaluation metrics for implicit-feedback recommender systems that are based on low-rank matrix factorization models, given the fitted model matrices and data, thus allowing to compare models from a variety of libraries. Metrics include P@K (precision-at-k, for top-K recommendations), R@K (recall at k), AP@K (average precision at k), NDCG@K (normalized discounted cumulative gain at k), Hit@K (from which the 'Hit Rate' is calculated), RR@K (reciprocal rank at k, from which the 'MRR' or 'mean reciprocal rank' is calculated), ROC-AUC (area under the receiver-operating characteristic curve), and PR-AUC (area under the precision-recall curve). These are calculated on a per-user basis according to the ranking of items induced by the model, using efficient multi-threaded routines. Also provides functions for creating train-test splits for model fitting and evaluation.
LinkingTo: Rcpp, float
Imports: Rcpp (≥ 1.0.1), Matrix (≥ 1.3-4), MatrixExtra (≥ 0.1.6), float, RhpcBLASctl, methods
Suggests: recommenderlab (≥ 0.2-7), cmfrec (≥ 3.2.0), data.table, knitr, rmarkdown, kableExtra, testthat
VignetteBuilder: knitr
License: BSD_2_clause + file LICENSE
RoxygenNote: 7.1.1
StagedInstall: TRUE
Biarch: TRUE
NeedsCompilation: yes
Packaged: 2023-02-17 17:12:29 UTC; david
Repository: CRAN
Date/Publication: 2023-02-19 23:00:02 UTC

Calculate Recommendation Quality Metrics

Description

Calculates recommendation quality metrics for implicit-feedback recommender systems (fit to user-item interactions data such as "number of times that a user played each song in a music service") that are based on low-rank matrix factorization or for which predicted scores can be reduced to a dot product between user and item factors/components.

These metrics are calculated on a per-user basis, by producing a ranking of the items according to model predictions (in descending order), ignoring the items that are in the training data for each user. The items that were not consumed by the user (not present in 'X_train' and not present in 'X_test') are considered "negative" entries, while the items in 'X_test' are considered "positive" entries, and the items present in 'X_train' are ignored for these calculations.

The metrics that can be calculated by this function are:

Metrics can be calculated for a given value of 'k' (e.g. "P@3"), or for values ranging from 1 to 'k' (e.g. ["P@1", "P@2", "P@3"]).

This package does NOT cover other more specialized metrics. One might also want to look at other indicators such as:

Usage

calc.reco.metrics(
  X_train,
  X_test,
  A,
  B,
  k = 5L,
  item_biases = NULL,
  as_df = TRUE,
  by_rows = FALSE,
  sort_indices = TRUE,
  precision = TRUE,
  trunc_precision = FALSE,
  recall = FALSE,
  average_precision = TRUE,
  trunc_average_precision = FALSE,
  ndcg = TRUE,
  hit = FALSE,
  rr = FALSE,
  roc_auc = FALSE,
  pr_auc = FALSE,
  all_metrics = FALSE,
  rename_k = TRUE,
  break_ties_with_noise = TRUE,
  min_pos_test = 1L,
  min_items_pool = 2L,
  consider_cold_start = TRUE,
  cumulative = FALSE,
  nthreads = parallel::detectCores(),
  seed = 1L
)

Arguments

X_train

Training data for user-item interactions, with users denoting rows, items denoting columns, and values corresponding to confidence scores. Entries in 'X_train' and 'X_test' for each user should not intersect (that is, if an item is the training data as a non-missing entry, it should not be in the test data as non-missing, and vice versa).

Should be passed as a sparse matrix in CSR format (class 'dgRMatrix' from package 'Matrix', can be converted to that format using 'MatrixExtra::as.csr.matrix'). Items not consumed by the user should not be present in this matrix.

Alternatively, if there is no training data, can pass 'NULL', in which case it will look only at the test data.

This matrix and 'X_test' are not meant to contain negative values, and if 'X_test' does contain any, it will still be assumed for all metrics other than NDCG that such items are deemed better for the user than the missing/zero-valued items (that is, implicit feedback is not meant to signal dislikes).

X_test

Test data for user-item interactions. Same format as 'X_train'.

A

The user factors. If the number of users is 'm' and the number of factors is 'p', should have dimension '[p, m]' if passing 'by_rows=FALSE' (the default), or dimension '[m, p]' if passing 'by_rows=TRUE' (in wich case it will be internally transposed due to R's column-major storage order). Can be passed as a dense matrix from base R (class 'matrix'), or as a matrix from package float (class 'float32') - if passed as 'float32', will do the calculations in single precision (which is faster and uses less memory) and output the calculated metrics as 'float32' arrays.

It is assumed that the model score for a given item 'j' for user 'i' is calculated as the inner product or dot product between the corresponding vectors \mathbf{a}_i \cdot \mathbf{b}_j (columns 'i' and 'j' of 'A' and 'B', respectively, when passing 'by_rows=FALSE'), with higher scores meaning that the item is deemed better for that user, and the top-K recommendations produced by ranking these scores in descending order.

Alternatively, for evaluation of non-personalized models, can pass 'NULL' here and for 'B', in which case 'item_biases' must be passed.

B

The item factors, in the same format as 'A'.

k

The number of top recommendations to consider for the metrics (as in "precision-at-k" or "P@K").

item_biases

Optional item biases/intercepts (fixed base score that is added to the predictions of each item). If present, it will append them to 'B' as an extra factor while adding a factor of all-ones to 'A'.

Alternatively, for non-personalized models which have only item-by-item scores, can pass 'NULL' for 'A' and 'B' while passing only 'item_biases'.

as_df

Whether to output the result as a 'data.frame'. If passing 'FALSE', the results will be returned as a list of vectors or matrices depending on what is passed for 'cumulative'. If 'A' and 'B' are passed as 'float32' matrices, the resulting ‘float32' arrays will be converted to base R’s arrays in order to be able to create a 'data.frame'.

by_rows

Whether the latent factors/components are ordered by rows, in which case they will be transposed beforehand (see documentation for 'A').

sort_indices

Whether to sort the indices of the 'X' data in case they are not sorted already. Skipping this step will make it faster and will make it consume less memory.

If the 'X_train' and 'X_test' matrices were created using functions from the 'Matrix' package such as 'Matrix::spMatrix' or 'Matrix::Matrix', the indices will always be sorted, but if creating it manually through S4 methods or as the output of other software, the indices can end up unsorted.

precision

Whether to calculate precision metrics or not.

trunc_precision

Whether to calculate truncated precision metrics or not. Note that this is output as a separate metric from "precision" and they are not mutually exclusive options.

recall

Whether to calculate recall metrics or not.

average_precision

Whether to calculate average precision metrics or not.

trunc_average_precision

Whether to calculate truncated average precision metrics or not. Note that this is output as a separate metric from "average_precision" and they are not mutually exclusive options.

ndcg

Whether to calculate NDCG (normalized discounted cumulative gain) metrics or not.

hit

Whether to calculate Hit metrics or not.

rr

Whether to calculate RR (reciprocal rank) metrics or not.

roc_auc

Whether to calculate ROC-AUC (area under the ROC curve) metrics or not.

pr_auc

Whether to calculate PR-AUC (area under the PR curve) metrics or not.

all_metrics

Passing 'TRUE' here is equivalent to passing 'TRUE' to all the calculable metrics.

rename_k

If passing 'as_df=TRUE' and 'cumulative=FALSE', whether to rename the 'k' in the resulting column names to the actual value of 'k' that was used (e.g. "p_at_k" -> "p_at_5").

break_ties_with_noise

Whether to add a small amount of noise '~Uniform(-10^-12, 10^-12)' in order to break ties at random, in case there are any ties in the ranking. This is not recommended unless one expects ties (can happen if e.g. some factors are set to all-zeros for some items), as it has the potential to slightly alter the ranking.

min_pos_test

Minimum number of positive entries (non-zero entries in the test set) that users need to have in order to calculate metrics for that user. If a given user does not meet the threshold, the metrics will be set to 'NA'.

min_items_pool

Minimum number of items (sum of positive and negative items) that a user must have in order to calculate metrics for that user. If a given user does not meet the threshold, the metrics will be set to 'NA'.

consider_cold_start

Whether to calculate metrics in situations in which some user has test data but no positive (non-zero) entries in the training data. If passing 'FALSE' and such cases are encountered, the metrics will be set to 'NA'.

Will be automatically set to 'TRUE' when passing 'NULL' for 'X_train'.

cumulative

Whether to calculate the metrics cumulatively (e.g. [P@1, P@2, P@3] if passing 'k=3') for all values up to 'k', or only for a single desired 'k' (e.g. only P@3 if passing 'k=3').

nthreads

Number of parallel threads to use. Parallelization is done at the user level, so passing more threads than there are users will not result in a speed up. Be aware that, the more threads that are used, the higher the memory consumption.

seed

Seed used for random number generation. Only used when passing 'break_ties_with_noise=TRUE'.

Details

Metrics for a given user will be set to 'NA' in the following situations:

The NDCG@K metric with 'cumulative=TRUE' will have lower decimal precision than with 'cumulative=FALSE' when using 'float32' inputs - this is extremely unlikely to be noticeable in typical datasets and small 'k', but for large 'k' and large (absolute) values in 'X_test', it might make a difference after a couple of decimal points.

Internally, it relies on BLAS function calls, so it's recommended to use R with an optimized BLAS library such as OpenBLAS or MKL for better speed - see this link for instructions on getting OpenBLAS in R for Windows (Alternatively, Microsoft's R distribution comes with MKL preinstalled).

Doing computations in float32 precision depends on the package float, and as such comes with some caveats:

Value

Will return the calculated metrics on a per-user basis (each user corresponding to a row):

The 'ROC-AUC' and 'PR-AUC' metrics will be named just "roc_auc" and "pr_auc", since they are calculated for the full ranked predictions without stopping at 'k'.

Examples

### (See the package vignette for a better example)
library(recometrics)
library(Matrix)
library(MatrixExtra)

### Generating random data
n_users <- 10L
n_items <- 20L
n_factors <- 3L
k <- 4L
set.seed(1)
UserFactors <- matrix(rnorm(n_users * n_factors), nrow=n_factors)
ItemFactors <- matrix(rnorm(n_items * n_factors), nrow=n_factors)
X <- Matrix::rsparsematrix(n_users, n_items, .5, repr="R")
X <- abs(X)

### Generating a random train-test split
data_split <- create.reco.train.test(X, split_type="all")
X_train <- data_split$X_train
X_test <- data_split$X_test

### Calculating these metrics
### (should be poor quality, since data is random)
metrics <- calc.reco.metrics(
    X_train, X_test,
    UserFactors, ItemFactors,
    k=k, as_df=TRUE,
    nthreads=1L
)
print(metrics)

Create Train-Test Splits of Implicit-Feedback Data

Description

Creates train-test splits of implicit-feedback data (recorded user-item interactions) for fitting and evaluating models for recommender systems.

These splits choose "test users" and "items for a given user" separately, offering three modes of splitting the data:

Usage

create.reco.train.test(
  X,
  split_type = "separated",
  users_test_fraction = 0.1,
  max_test_users = 10000L,
  items_test_fraction = 0.3,
  min_items_pool = 2L,
  min_pos_test = 1L,
  consider_cold_start = FALSE,
  seed = 1L
)

Arguments

X

The implicit feedback data to split into training-testing-remainder for evaluating recommender systems. Should be passed as a sparse CSR matrix from the 'Matrix' package (class 'dgRMatrix'). Users should correspond to rows, items to columns, and non-zero values to observed user-item interactions.

split_type

Type of data split to generate. Allowed values are: 'all', 'separated', 'joined' (see the function description above for more details).

users_test_fraction

Target fraction of the users to set as test (see the function documentation for more details). If the number represented by this fraction exceeds the number set by 'max_test_users', then the actual number will be set to 'max_test_users'. Note however that the end result might end up containing fewer users if there are not enough users in the data meeting the minimum desired criteria.

If passing 'NULL', will not take a fraction, but will instead take the number that is passed for 'max_test_users'.

Ignored when passing ‘split_type=’all''.

max_test_users

Maximum number of users to set as test. Note that this will only be applied for choosing the minimum between this and 'ncol(X)*users_test_fraction', while the actual number might end up being lower if there are not enough users meeting the desired minimum conditions.

If passing 'NULL' for 'users_test_fraction', will interpret this as the number of test users to take.

Ignored when passing ‘split_type=’all''.

items_test_fraction

Target fraction of the data (items) to set for test for each user. Should be a number between zero and one (non-inclusive). The actual number of test entries for each user will be determined as 'round(n_entries_user*items_test_fraction)', thus in a long-tailed distribution (typical for recommender systems), the actual fraction that will be obtained is likely to be slightly lower than what is passed here.

Note that items are sampled independently for each user, thus the items that are in the test set for some users might be in the training set for different users.

min_items_pool

Minimum number of items (sum of positive and negative items) that a user must have in order to be eligible as test user.

min_pos_test

Minimum number of positive entries (non-zero entries in the test set) that users would need to have in order to be eligible as test user.

consider_cold_start

Whether to still set users as eligible for test in situations in which some user would have test data but no positive (non-zero) entries in the training data. This will only happen when passing 'test_fraction>=0.5'.

seed

Seed to use for random number generation.

Value

Will return a list with two to four elements depending on the requested split type:

The training and testing items for each user will not intersect, and each item in the original 'X' data for a given test user will be assigned to either the training or the testing sets.