Version: 1.0.4
Title: Prediction Explanation with Dependence-Aware Shapley Values
Description: Complex machine learning models are often hard to interpret. However, in many situations it is crucial to understand and explain why a model made a specific prediction. Shapley values is the only method for such prediction explanation framework with a solid theoretical foundation. Previously known methods for estimating the Shapley values do, however, assume feature independence. This package implements methods which accounts for any feature dependence, and thereby produces more accurate estimates of the true Shapley values. An accompanying 'Python' wrapper ('shaprpy') is available through the GitHub repository.
URL: https://norskregnesentral.github.io/shapr/, https://github.com/NorskRegnesentral/shapr/
BugReports: https://github.com/NorskRegnesentral/shapr/issues
License: MIT + file LICENSE
Encoding: UTF-8
ByteCompile: true
Language: en-US
RoxygenNote: 7.3.2
Depends: R (≥ 3.5.0)
Imports: stats, data.table (≥ 1.15.0), Rcpp (≥ 0.12.15), Matrix, future.apply, methods, cli, rlang
Suggests: ranger, xgboost, mgcv, testthat (≥ 3.0.0), knitr, rmarkdown, roxygen2, ggplot2, gbm, party, partykit, waldo, progressr, future, ggbeeswarm, vdiffr, forecast, torch, GGally, coro, parsnip, recipes, workflows, tune, dials, yardstick, hardhat, rsample
LinkingTo: RcppArmadillo, Rcpp
VignetteBuilder: knitr
Config/testthat/edition: 3
NeedsCompilation: yes
Packaged: 2025-04-28 11:26:18 UTC; jullum
Author: Martin Jullum ORCID iD [cre, aut], Lars Henry Berge Olsen ORCID iD [aut], Annabelle Redelmeier [aut], Jon Lachmann ORCID iD [aut], Nikolai Sellereite ORCID iD [aut], Anders Løland [ctb], Jens Christian Wahl [ctb], Camilla Lingjærde [ctb], Norsk Regnesentral [cph, fnd]
Maintainer: Martin Jullum <Martin.Jullum@nr.no>
Repository: CRAN
Date/Publication: 2025-04-28 13:00:02 UTC

shapr: Prediction Explanation with Dependence-Aware Shapley Values

Description

Complex machine learning models are often hard to interpret. However, in many situations it is crucial to understand and explain why a model made a specific prediction. Shapley values is the only method for such prediction explanation framework with a solid theoretical foundation. Previously known methods for estimating the Shapley values do, however, assume feature independence. This package implements methods which accounts for any feature dependence, and thereby produces more accurate estimates of the true Shapley values. An accompanying 'Python' wrapper ('shaprpy') is available through the GitHub repository.

Author(s)

Maintainer: Martin Jullum Martin.Jullum@nr.no (ORCID)

Authors:

Other contributors:

See Also

Useful links:


Additional setup for regression-based methods

Description

Additional setup for regression-based methods

Usage

additional_regression_setup(internal, model, predict_model)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

The (updated) internal list


AICc formula for several sets, alternative definition

Description

AICc formula for several sets, alternative definition

Usage

aicc_full_cpp(h, X_list, mcov_list, S_scale_dist, y_list, negative)

Arguments

h

numeric specifying the scaling (sigma)

X_list

List. Contains matrices with the appropriate features of the training data

mcov_list

List. Contains the covariance matrices of the matrices in X_list

S_scale_dist

Logical. Indicates whether Mahalanobis distance should be scaled with the number of variables.

y_list

List. Contains the appropriate (temporary) response variables.

negative

Logical. Whether to return the negative of the AICc value.

Value

Scalar with the numeric value of the AICc formula

Author(s)

Martin Jullum


Temp-function for computing the full AICc with several X's etc

Description

Temp-function for computing the full AICc with several X's etc

Usage

aicc_full_single_cpp(X, mcov, S_scale_dist, h, y)

Arguments

X

matrix.

mcov

matrix The covariance matrix of X.

S_scale_dist

logical. Indicating whether the Mahalanobis distance should be scaled with the number of variables

h

numeric specifying the scaling (sigma)

y

Vector Representing the (temporary) response variable

Value

Scalar with the numeric value of the AICc formula.

Author(s)

Martin Jullum


Appends the new vS_list to the prev vS_list

Description

Appends the new vS_list to the prev vS_list

Usage

append_vS_list(vS_list, internal)

Arguments

vS_list

List Output from compute_vS()

internal

List. Not used directly, but passed through from explain().

Value

The vS_list after being merged with previously computed vS_lists (stored in internal)


A torch::nn_module() Representing a categorical_to_one_hot_layer

Description

The categorical_to_one_hot_layer module/layer expands categorical features into one-hot vectors, because multi-layer perceptrons are known to work better with this data representation. It also replaces NaNs with zeros in order so that further layers may work correctly.

Usage

categorical_to_one_hot_layer(
  one_hot_max_sizes,
  add_nans_map_for_columns = NULL
)

Arguments

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

add_nans_map_for_columns

Optional list which contains indices of columns which is_nan masks are to be appended to the result tensor. This option is necessary for the full encoder to distinguish whether value is to be reconstructed or not.

Details

Note that the module works with mixed data represented as 2-dimensional inputs and it works correctly with missing values in groundtruth as long as they are represented by NaNs.

Author(s)

Lars Henry Berge Olsen


Check that all explicands has at least one valid MC sample in causal Shapley values

Description

Check that all explicands has at least one valid MC sample in causal Shapley values

Usage

check_categorical_valid_MCsamp(dt, n_explain, n_MC_samples, joint_prob_dt)

Arguments

dt

Data.table containing the generated MC samples (and conditional values) after each sampling step

n_MC_samples

Positive integer. For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration of every conditional expectation. For approach="ctree", n_MC_samples corresponds to the number of samples from the leaf node (see an exception related to the ctree.sample argument setup_approach.ctree()). For approach="empirical", n_MC_samples is the K parameter in equations (14-15) of Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the empirical.eta argument setup_approach.empirical().

Details

For undocumented arguments, see setup_approach.categorical().

Author(s)

Lars Henry Berge Olsen


Checks the convergence according to the convergence threshold

Description

Checks the convergence according to the convergence threshold

Usage

check_convergence(internal)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

The (updated) internal list


Check that the group parameter has the right form and content

Description

Check that the group parameter has the right form and content

Usage

check_groups(feature_names, group)

Function that checks the verbose parameter

Description

Function that checks the verbose parameter

Usage

check_verbose(verbose)

Arguments

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen, Martin Jullum


Printing messages in compute_vS with cli

Description

Printing messages in compute_vS with cli

Usage

cli_compute_vS(internal)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

No return value (but prints compute_vS messages with cli)


Printing messages in iterative procedure with cli

Description

Printing messages in iterative procedure with cli

Usage

cli_iter(verbose, internal, iter)

Arguments

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

internal

List. Not used directly, but passed through from explain().

iter

Integer. The iteration number. Only used internally.

Value

No return value (but prints iterative messages with cli)


Printing startup messages with cli

Description

Printing startup messages with cli

Usage

cli_startup(internal, model_class, verbose)

Arguments

internal

List. Not used directly, but passed through from explain().

model_class

String. Class of the model as a string

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

Value

No return value (but prints startup messages with cli)


Create a header topline with cli

Description

Create a header topline with cli

Usage

cli_topline(verbose, testing, init_time, type, is_python)

Arguments

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

testing

Logical. Only use to remove random components like timing from the object output when comparing output with testthat. Defaults to FALSE.

init_time

POSIXct object. The time when the explain() function was called, as outputted by Sys.time(). Used to calculate the time it took to run the full explain call.

type

Character. Either "regular" or "forecast" corresponding to function setup() is called from, correspondingly the type of explanation that should be generated.

is_python

Logical. Indicates whether the function is called from the Python wrapper. Default is FALSE which is never changed when calling the function via explain() in R. The parameter is later used to disallow running the AICc-versions of the empirical method as that requires data based optimization, which is not supported in shaprpy.

Value

No return value (but prints header with cli unless verbose is NULL)


Get coalition matrix

Description

Get coalition matrix

Usage

coalition_matrix_cpp(coalitions, m)

Arguments

coalitions

List. Each of the elements equals an integer vector representing a valid combination of features/feature groups.

m

Integer. Number of features/feature groups.

Value

Matrix

Author(s)

Nikolai Sellereite, Martin Jullum


Mean Squared Error of the Contribution Function v(S)

Description

Function that computes the Mean Squared Error (MSEv) of the contribution function v(s) as proposed by Frye et al. (2019) and used by Olsen et al. (2022).

Usage

compute_MSEv_eval_crit(
  internal,
  dt_vS,
  MSEv_uniform_comb_weights,
  MSEv_skip_empty_full_comb = TRUE
)

Arguments

internal

List. Holds all parameters, data, functions and computed objects used within explain() The list contains one or more of the elements parameters, data, objects, iter_list, timing_list, main_timing_list, output, and iter_timing_list.

dt_vS

Data.table of dimension n_coalitions times n_explain + 1 containing the contribution function estimates. The first column is assumed to be named id_coalition and containing the ids of the coalitions. The last row is assumed to be the full coalition, i.e., it contains the predicted responses for the observations which are to be explained.

MSEv_uniform_comb_weights

Logical. If TRUE (default), then the function weights the coalitions uniformly when computing the MSEv criterion. If FALSE, then the function use the Shapley kernel weights to weight the coalitions when computing the MSEv criterion. Note that the Shapley kernel weights are replaced by the sampling frequency when not all coalitions are considered.

MSEv_skip_empty_full_comb

Logical. If TRUE (default), we exclude the empty and grand coalitions when computing the MSEv evaluation criterion. This is reasonable as they are identical for all methods, i.e., their contribution function is independent of the used method as they are special cases not effected by the used method. If FALSE, we include the empty and grand coalitions. In this situation, we also recommend setting MSEv_uniform_comb_weights = TRUE, as otherwise the large weights for the empty and grand coalitions will outweigh all other coalitions and make the MSEv criterion uninformative.

Details

The MSEv evaluation criterion does not rely on access to the true contribution functions nor the true Shapley values to be computed. A lower value indicates better approximations, however, the scale and magnitude of the MSEv criterion is not directly interpretable in regard to the precision of the final estimated Shapley values. Olsen et al. (2024) illustrates in Figure 11 a fairly strong linear relationship between the MSEv criterion and the MAE between the estimated and true Shapley values in a simulation study. Note that explicands refer to the observations whose predictions we are to explain.

Value

List containing:

MSEv

A data.table with the overall MSEv evaluation criterion averaged over both the coalitions and observations/explicands. The data.table also contains the standard deviation of the MSEv values for each explicand (only averaged over the coalitions) divided by the square root of the number of explicands.

MSEv_explicand

A data.table with the mean squared error for each explicand, i.e., only averaged over the coalitions.

MSEv_coalition

A data.table with the mean squared error for each coalition, i.e., only averaged over the explicands/observations. The data.table also contains the standard deviation of the MSEv values for each coalition divided by the square root of the number of explicands.

Author(s)

Lars Henry Berge Olsen

References


Computes the the Shapley values and their standard deviation given the v(S)

Description

Computes the the Shapley values and their standard deviation given the v(S)

Usage

compute_estimates(internal, vS_list)

Arguments

internal

List. Not used directly, but passed through from explain().

vS_list

List Output from compute_vS()

Value

The (updated) internal list


Compute shapley values

Description

Compute shapley values

Usage

compute_shapley(internal, dt_vS)

Arguments

internal

List. Holds all parameters, data, functions and computed objects used within explain() The list contains one or more of the elements parameters, data, objects, iter_list, timing_list, main_timing_list, output, and iter_timing_list.

dt_vS

The contribution matrix.

Value

A data.table with Shapley values for each test observation.


Gathers and computes the timing of the different parts of the explain function.

Description

Gathers and computes the timing of the different parts of the explain function.

Usage

compute_time(internal)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

List of reformatted timing information


Computes v(S) for all features subsets S.

Description

Computes v(S) for all features subsets S.

Usage

compute_vS(internal, model, predict_model)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

List of v(S) for different coalitions S, optionally also with the samples used to estimate v(S)


Convert feature names into feature indices

Description

Functions that takes a causal_ordering specified using strings and convert these strings to feature indices.

Usage

convert_feature_name_to_idx(causal_ordering, labels, feat_group_txt)

Arguments

causal_ordering

List. Not applicable for (regular) non-causal or asymmetric explanations. causal_ordering is an unnamed list of vectors specifying the components of the partial causal ordering that the coalitions must respect. Each vector represents a component and contains one or more features/groups identified by their names (strings) or indices (integers). If causal_ordering is NULL (default), no causal ordering is assumed and all possible coalitions are allowed. No causal ordering is equivalent to a causal ordering with a single component that includes all features (list(1:n_features)) or groups (list(1:n_groups)) for feature-wise and group-wise Shapley values, respectively. For feature-wise Shapley values and causal_ordering = list(c(1, 2), c(3, 4)), the interpretation is that features 1 and 2 are the ancestors of features 3 and 4, while features 3 and 4 are on the same level. Note: All features/groups must be included in the causal_ordering without any duplicates.

labels

Vector of strings containing (the order of) the feature names.

feat_group_txt

String that is either "feature" or "group" based on if shapr is computing feature- or group-wise Shapley values

Value

The causal_ordering list, but with feature indices (w.r.t. labels) instead of feature names.

Author(s)

Lars Henry Berge Olsen


Correction term with trace_input in AICc formula

Description

Correction term with trace_input in AICc formula

Usage

correction_matrix_cpp(tr_H, n)

Arguments

tr_H

numeric The trace of H

n

numeric The number of rows in H

Value

Scalar

Author(s)

Martin Jullum


Define coalitions, and fetch additional information about each unique coalition

Description

Define coalitions, and fetch additional information about each unique coalition

Usage

create_coalition_table(
  m,
  exact = TRUE,
  n_coalitions = 200,
  n_coal_each_size = choose(m, seq(m - 1)),
  weight_zero_m = 10^6,
  paired_shap_sampling = TRUE,
  prev_X = NULL,
  n_samps_scale = 10,
  coal_feature_list = as.list(seq_len(m)),
  approach0 = "gaussian",
  kernelSHAP_reweighting = "none",
  semi_deterministic_sampling = FALSE,
  dt_coal_samp_info = NULL,
  dt_valid_causal_coalitions = NULL
)

Arguments

m

Positive integer. Total number of features/groups.

exact

Logical. If TRUE all 2^m coalitions are generated, otherwise a subsample of the coalitions is used.

n_coalitions

Positive integer. Note that if exact = TRUE, n_coalitions is ignored.

n_coal_each_size

Vector of integers of length m-1. The number of valid coalitions of each coalition size 1, 2,..., m-1. For symmetric Shapley values, this is choose(m, seq(m-1)) (default). While for asymmetric Shapley values, this is the number of valid coalitions of each size in the causal ordering. Used to correctly normalize the Shapley weights.

weight_zero_m

Numeric. The value to use as a replacement for infinite coalition weights when doing numerical operations.

paired_shap_sampling

Logical. Whether to do paired sampling of coalitions.

prev_X

data.table. The X data.table from the previous iteration.

n_samps_scale

Positive integer. Integer that scales the number of coalitions n_coalitions to sample as sampling is cheap, while checking for n_coalitions unique coalitions is expensive, thus we over sample the number of coalitions by a factor of n_samps_scale and determine when we have n_coalitions unique coalitions and only use the coalitions up to this point and throw away the remaining coalitions.

coal_feature_list

List. A list mapping each coalition to the features it contains.

approach0

Character vector. Contains the approach to be used for estimation of each coalition size. Same as approach in explain().

kernelSHAP_reweighting

String. How to reweight the sampling frequency weights in the kernelSHAP solution after sampling. The aim of this is to reduce the randomness and thereby the variance of the Shapley value estimates. The options are one of 'none', 'on_N', 'on_all', 'on_all_cond' (default). 'none' means no reweighting, i.e. the sampling frequency weights are used as is. 'on_N' means the sampling frequencies are averaged over all coalitions with the same original sampling probabilities. 'on_all' means the original sampling probabilities are used for all coalitions. 'on_all_cond' means the original sampling probabilities are used for all coalitions, while adjusting for the probability that they are sampled at least once. 'on_all_cond' is preferred as it performs the best in simulation studies, see Olsen & Jullum (2024).

semi_deterministic_sampling

Logical. If FALSE (default), then we sample from all coalitions. If TRUE, the sampling of coalitions is semi-deterministic, i.e. the sampling is done in a way that ensures that coalitions that are expected to be sampled based on the number of coalitions are deterministically included such that we sample among fewer coalitions. This is done to reduce the variance of the Shapley value estimates, and corresponds to the PySHAP* strategy in the paper Olsen & Jullum (2024).

dt_coal_samp_info

data.table. The data.table contains information about the which coalitions should be deterministically included and which can be sampled, in addition to the sampling probabilities of each available coalition size, and the weight given to the sampled and deterministically included coalitions (excluding empty and grand coalitions which are given the weight_zero_m weight).

dt_valid_causal_coalitions

data.table. Only applicable for asymmetric Shapley values explanations, and is NULL for symmetric Shapley values. The data.table contains information about the coalitions that respects the causal ordering.

Value

A data.table with info about the coalitions to use

Author(s)

Nikolai Sellereite, Martin Jullum, Lars Henry Berge Olsen


Build all the conditional inference trees

Description

Build all the conditional inference trees

Usage

create_ctree(
  given_ind,
  x_train,
  mincriterion,
  minsplit,
  minbucket,
  use_partykit = "on_error"
)

Arguments

given_ind

Integer vector. Indicates which features are conditioned on.

x_train

Data.table with training data.

use_partykit

String. In some semi-rare cases party::ctree() runs into an error related to the LINPACK used by R. To get around this problem, one may fall back to using the newer (but slower) partykit::ctree() function, which is a reimplementation of the same method. Setting this parameter to "on_error" (default) falls back to partykit::ctree(), if party::ctree() fails. Other options are "never", which always uses party::ctree(), and "always", which always uses partykit::ctree(). A warning message is created whenever partykit::ctree() is used.

Details

See the documentation of the setup_approach.ctree() function for undocumented parameters.

Value

List with conditional inference tree and the variables conditioned/not conditioned on.

Author(s)

Annabelle Redelmeier, Martin Jullum


Create marginal categorical data for causal Shapley values

Description

This function is used when we generate marginal data for the categorical approach when we have several sampling steps. We need to treat this separately, as we here in the marginal step CANNOT make feature values such that the combination of those and the feature values we condition in S are NOT in categorical.joint_prob_dt. If we do this, then we cannot progress further in the chain of sampling steps. E.g., X1 in (1,2,3), X2 in (1,2,3), and X3 in (1,2,3). We know X2 = 2, and let causal structure be X1 -> X2 -> X3. Assume that P(X1 = 1, X2 = 2, X = 3) = P(X1 = 2, X2 = 2, X = 3) = 1/2. Then there is no point generating X1 = 3, as we then cannot generate X3. The solution is only to generate the values which can proceed through the whole chain of sampling steps. To do that, we have to ensure the the marginal sampling respects the valid feature coalitions for all sets of conditional features, i.e., the features in features_steps_cond_on. We sample from the valid coalitions using the MARGINAL probabilities.

Usage

create_marginal_data_cat(
  n_MC_samples,
  x_explain,
  Sbar_features,
  S_original,
  joint_prob_dt
)

Arguments

n_MC_samples

Positive integer. For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration of every conditional expectation. For approach="ctree", n_MC_samples corresponds to the number of samples from the leaf node (see an exception related to the ctree.sample argument setup_approach.ctree()). For approach="empirical", n_MC_samples is the K parameter in equations (14-15) of Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the empirical.eta argument setup_approach.empirical().

x_explain

Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained.

Sbar_features

Vector of integers containing the features indices to generate marginal observations for. That is, if Sbar_features is c(1,4), then we sample n_MC_samples observations from P(X_1, X_4). That is, we sample the first and fourth feature values from the same valid feature coalition using the marginal probability, so we do not break the dependence between them.

S_original

Vector of integers containing the features indices of the original coalition S. I.e., not the features in the current sampling step, but the features are known to us before starting the chain of sampling steps.

Details

For undocumented arguments, see setup_approach.categorical().

Value

Data table of dimension (`n_MC_samples` * `nrow(x_explain)`) \times `length(Sbar_features)` with the sampled observations.

Author(s)

Lars Henry Berge Olsen


Generate marginal Gaussian data using Cholesky decomposition

Description

Given a multivariate Gaussian distribution, this function creates data from specified marginals of said distribution.

Usage

create_marginal_data_gaussian(n_MC_samples, Sbar_features, mu, cov_mat)

Arguments

n_MC_samples

Integer. The number of samples to generate.

Sbar_features

Vector of integers indicating which marginals to sample from.

mu

Numeric vector containing the expected values for all features in the multivariate Gaussian distribution.

cov_mat

Numeric matrix containing the covariance between all features in the multivariate Gaussian distribution.

Author(s)

Lars Henry Berge Olsen


Function that samples data from the empirical marginal training distribution

Description

Sample observations from the empirical distribution P(X) using the training dataset.

Usage

create_marginal_data_training(
  x_train,
  n_explain,
  Sbar_features,
  n_MC_samples = 1000,
  stable_version = TRUE
)

Arguments

x_train

Data.table with training data.

Sbar_features

Vector of integers containing the features indices to generate marginal observations for. That is, if Sbar_features is c(1,4), then we sample n_MC_samples observations from P(X_1, X_4) using the empirical training observations (with replacements). That is, we sample the first and fourth feature values from the same training observation, so we do not break the dependence between them.

stable_version

Logical. If TRUE and n_MC_samples > n_train, then we include each training observation n_MC_samples %/% n_train times and then sample the remaining ⁠n_MC_samples %% n_train samples⁠. Only the latter is done when n_MC_samples < n_train. This is done separately for each explicand. If FALSE, we randomly sample the from the observations.

Value

Data table of dimension n_MC_samples \times length(Sbar_features) with the sampled observations.

Author(s)

Lars Henry Berge Olsen


Exported documentation helper function.

Description

Exported documentation helper function.

Usage

default_doc_export(internal, iter, index_features)

Arguments

internal

List. Not used directly, but passed through from explain().

iter

Integer. The iteration number. Only used internally.

index_features

Positive integer vector. Specifies the id_coalition to apply to the present method. NULL means all coalitions. Only used internally.


Unexported documentation helper function.

Description

Unexported documentation helper function.

Usage

default_doc_internal(
  internal,
  model,
  predict_model,
  x_explain,
  x_train,
  n_features,
  W_kernel,
  S,
  dt_vS,
  output_size,
  ...
)

Arguments

internal

List. Holds all parameters, data, functions and computed objects used within explain() The list contains one or more of the elements parameters, data, objects, iter_list, timing_list, main_timing_list, output, and iter_timing_list.

model

Objects. The model object that ought to be explained. See the documentation of explain() for details.

predict_model

Function. The prediction function used when model is not natively supported. See the documentation of explain() for details.

x_explain

Data.table with the features of the observation whose predictions ought to be explained (test data).

x_train

Data.table with training data.

n_features

Positive integer. The number of features.

W_kernel

Numeric matrix. Contains all nonscaled weights between training and test observations for all coalitions. The dimension equals ⁠n_train x m⁠.

S

Integer matrix of dimension ⁠n_coalitions x m⁠, where n_coalitions and m equals the total number of sampled/non-sampled coalitions and the total number of unique features, respectively. Note that m = ncol(x_train).

dt_vS

Data.table of dimension n_coalitions times n_explain + 1 containing the contribution function estimates. The first column is assumed to be named id_coalition and containing the ids of the coalitions. The last row is assumed to be the full coalition, i.e., it contains the predicted responses for the observations which are to be explained.

output_size

Scalar integer. Specifies the dimension of the output from the prediction model for every observation.

...

Further arguments passed to approach-specific functions.

Value

The internal list. It holds all parameters, data, and computed objects used within explain().


Get table with all (exact) coalitions

Description

Get table with all (exact) coalitions

Usage

exact_coalition_table(
  m,
  max_fixed_coal_size = ceiling((m - 1)/2),
  dt_valid_causal_coalitions = NULL,
  weight_zero_m = 10^6
)

Arguments

m

Positive integer. Total number of features/groups.

dt_valid_causal_coalitions

data.table. Only applicable for asymmetric Shapley values explanations, and is NULL for symmetric Shapley values. The data.table contains information about the coalitions that respects the causal ordering.

weight_zero_m

Numeric. The value to use as a replacement for infinite coalition weights when doing numerical operations.


Explain the output of machine learning models with dependence-aware (conditional/observational) Shapley values

Description

Computes dependence-aware Shapley values for observations in x_explain from the specified model by using the method specified in approach to estimate the conditional expectation. See Aas et al. (2021) for a thorough introduction to dependence-aware prediction explanation with Shapley values.

Usage

explain(
  model,
  x_explain,
  x_train,
  approach,
  phi0,
  iterative = NULL,
  max_n_coalitions = NULL,
  group = NULL,
  n_MC_samples = 1000,
  seed = NULL,
  verbose = "basic",
  predict_model = NULL,
  get_model_specs = NULL,
  prev_shapr_object = NULL,
  asymmetric = FALSE,
  causal_ordering = NULL,
  confounding = NULL,
  extra_computation_args = list(),
  iterative_args = list(),
  output_args = list(),
  ...
)

Arguments

model

Model object. Specifies the model whose predictions we want to explain. Run get_supported_models() for a table of which models explain supports natively. Unsupported models can still be explained by passing predict_model and (optionally) get_model_specs, see details for more information.

x_explain

Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained.

x_train

Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula.

approach

Character vector of length 1 or one less than the number of features. All elements should, either be "gaussian", "copula", "empirical", "ctree", "vaeac", "categorical", "timeseries", "independence", "regression_separate", or "regression_surrogate". The two regression approaches can not be combined with any other approach. See details for more information.

phi0

Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable.

iterative

Logical or NULL If NULL (default), the argument is set to TRUE if there are more than 5 features/groups, and FALSE otherwise. If eventually TRUE, the Shapley values are estimated iteratively in an iterative manner. This provides sufficiently accurate Shapley value estimates faster. First an initial number of coalitions is sampled, then bootsrapping is used to estimate the variance of the Shapley values. A convergence criterion is used to determine if the variances of the Shapley values are sufficiently small. If the variances are too high, we estimate the number of required samples to reach convergence, and thereby add more coalitions. The process is repeated until the variances are below the threshold. Specifics related to the iterative process and convergence criterion are set through iterative_args.

max_n_coalitions

Integer. The upper limit on the number of unique feature/group coalitions to use in the iterative procedure (if iterative = TRUE). If iterative = FALSE it represents the number of feature/group coalitions to use directly. The quantity refers to the number of unique feature coalitions if group = NULL, and group coalitions if group != NULL. max_n_coalitions = NULL corresponds to max_n_coalitions=2^n_features.

group

List. If NULL regular feature wise Shapley values are computed. If provided, group wise Shapley values are computed. group then has length equal to the number of groups. The list element contains character vectors with the features included in each of the different groups. See Jullum et al. (2021) for more information on group wise Shapley values.

n_MC_samples

Positive integer. For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration of every conditional expectation. For approach="ctree", n_MC_samples corresponds to the number of samples from the leaf node (see an exception related to the ctree.sample argument setup_approach.ctree()). For approach="empirical", n_MC_samples is the K parameter in equations (14-15) of Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the empirical.eta argument setup_approach.empirical().

seed

Positive integer. Specifies the seed before any randomness based code is being run. If NULL (default) no seed is set in the calling environment.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

predict_model

Function. The prediction function used when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function must have two arguments, model and newdata which specify, respectively, the model and a data.frame/data.table to compute predictions for. The function must give the prediction as a numeric vector. NULL (the default) uses functions specified internally. Can also be used to override the default function for natively supported model classes.

get_model_specs

Function. An optional function for checking model/data consistency when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function takes model as argument and provides a list with 3 elements:

labels

Character vector with the names of each feature.

classes

Character vector with the classes of each features.

factor_levels

Character vector with the levels for any categorical features.

If NULL (the default) internal functions are used for natively supported model classes, and the checking is disabled for unsupported model classes. Can also be used to override the default function for natively supported model classes.

prev_shapr_object

shapr object or string. If an object of class shapr is provided, or string with a path to where intermediate results are stored, then the function will use the previous object to continue the computation. This is useful if the computation is interrupted or you want higher accuracy than already obtained, and therefore want to continue the iterative estimation. See the general usage vignette for examples.

asymmetric

Logical. Not applicable for (regular) non-causal or asymmetric explanations. If FALSE (default), explain computes regular symmetric Shapley values, If TRUE, then explain compute asymmetric Shapley values based on the (partial) causal ordering given by causal_ordering. That is, explain only uses the feature combinations/coalitions that respect the causal ordering when computing the asymmetric Shapley values. If asymmetric is TRUE and confounding is NULL (default), then explain computes asymmetric conditional Shapley values as specified in Frye et al. (2020). If confounding is provided, i.e., not NULL, then explain computes asymmetric causal Shapley values as specified in Heskes et al. (2020).

causal_ordering

List. Not applicable for (regular) non-causal or asymmetric explanations. causal_ordering is an unnamed list of vectors specifying the components of the partial causal ordering that the coalitions must respect. Each vector represents a component and contains one or more features/groups identified by their names (strings) or indices (integers). If causal_ordering is NULL (default), no causal ordering is assumed and all possible coalitions are allowed. No causal ordering is equivalent to a causal ordering with a single component that includes all features (list(1:n_features)) or groups (list(1:n_groups)) for feature-wise and group-wise Shapley values, respectively. For feature-wise Shapley values and causal_ordering = list(c(1, 2), c(3, 4)), the interpretation is that features 1 and 2 are the ancestors of features 3 and 4, while features 3 and 4 are on the same level. Note: All features/groups must be included in the causal_ordering without any duplicates.

confounding

Logical vector. Not applicable for (regular) non-causal or asymmetric explanations. confounding is a vector of logicals specifying whether confounding is assumed or not for each component in the causal_ordering. If NULL (default), then no assumption about the confounding structure is made and explain computes asymmetric/symmetric conditional Shapley values, depending on the value of asymmetric. If confounding is a single logical, i.e., FALSE or TRUE, then this assumption is set globally for all components in the causal ordering. Otherwise, confounding must be a vector of logicals of the same length as causal_ordering, indicating the confounding assumption for each component. When confounding is specified, then explain computes asymmetric/symmetric causal Shapley values, depending on the value of asymmetric. The approach cannot be regression_separate and regression_surrogate as the regression-based approaches are not applicable to the causal Shapley value methodology.

extra_computation_args

Named list. Specifies extra arguments related to the computation of the Shapley values. See get_extra_comp_args_default() for description of the arguments and their default values.

iterative_args

Named list. Specifies the arguments for the iterative procedure. See get_iterative_args_default() for description of the arguments and their default values.

output_args

Named list. Specifies certain arguments related to the output of the function. See get_output_args_default() for description of the arguments and their default values.

...

Arguments passed on to setup_approach.categorical, setup_approach.copula, setup_approach.ctree, setup_approach.empirical, setup_approach.gaussian, setup_approach.independence, setup_approach.regression_separate, setup_approach.regression_surrogate, setup_approach.timeseries, setup_approach.vaeac

categorical.joint_prob_dt

Data.table. (Optional) Containing the joint probability distribution for each combination of feature values. NULL means it is estimated from the x_train and x_explain.

categorical.epsilon

Numeric value. (Optional) If categorical.joint_probability_dt is not supplied, probabilities/frequencies are estimated using x_train. If certain observations occur in x_explain and NOT in x_train, then epsilon is used as the proportion of times that these observations occurs in the training data. In theory, this proportion should be zero, but this causes an error later in the Shapley computation.

internal

List. Not used directly, but passed through from explain().

ctree.mincriterion

Numeric scalar or vector. Either a scalar or vector of length equal to the number of features in the model. The value is equal to 1 - \alpha where \alpha is the nominal level of the conditional independence tests. If it is a vector, this indicates which value to use when conditioning on various numbers of features. The default value is 0.95.

ctree.minsplit

Numeric scalar. Determines minimum value that the sum of the left and right daughter nodes required for a split. The default value is 20.

ctree.minbucket

Numeric scalar. Determines the minimum sum of weights in a terminal node required for a split The default value is 7.

ctree.sample

Boolean. If TRUE (default), then the method always samples n_MC_samples observations from the leaf nodes (with replacement). If FALSE and the number of observations in the leaf node is less than n_MC_samples, the method will take all observations in the leaf. If FALSE and the number of observations in the leaf node is more than n_MC_samples, the method will sample n_MC_samples observations (with replacement). This means that there will always be sampling in the leaf unless sample = FALSE and the number of obs in the node is less than n_MC_samples.

empirical.type

Character. (default = "fixed_sigma") Should be equal to either "independence","fixed_sigma", "AICc_each_k" "AICc_full". "independence" is deprecated. Use approach = "independence" instead. "fixed_sigma" uses a fixed bandwidth (set through empirical.fixed_sigma) in the kernel density estimation. "AICc_each_k" and "AICc_full" optimize the bandwidth using the AICc criterion, with respectively one bandwidth per coalition size and one bandwidth for all coalition sizes.

empirical.eta

Numeric scalar. Needs to be ⁠0 < eta <= 1⁠. The default value is 0.95. Represents the minimum proportion of the total empirical weight that data samples should use. If e.g. eta = .8 we will choose the K samples with the largest weight so that the sum of the weights accounts for 80\ eta is the \eta parameter in equation (15) of Aas et al. (2021).

empirical.fixed_sigma

Positive numeric scalar. The default value is 0.1. Represents the kernel bandwidth in the distance computation used when conditioning on all different coalitions. Only used when empirical.type = "fixed_sigma"

empirical.n_samples_aicc

Positive integer. Number of samples to consider in AICc optimization. The default value is 1000. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.eval_max_aicc

Positive integer. Maximum number of iterations when optimizing the AICc. The default value is 20. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.start_aicc

Numeric. Start value of the sigma parameter when optimizing the AICc. The default value is 0.1. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.cov_mat

Numeric matrix. (Optional) The covariance matrix of the data generating distribution used to define the Mahalanobis distance. NULL means it is estimated from x_train.

gaussian.mu

Numeric vector. (Optional) Containing the mean of the data generating distribution. NULL means it is estimated from the x_train.

gaussian.cov_mat

Numeric matrix. (Optional) Containing the covariance matrix of the data generating distribution. NULL means it is estimated from the x_train.

regression.model

A tidymodels object of class model_specs. Default is a linear regression model, i.e., parsnip::linear_reg(). See tidymodels for all possible models, and see the vignette for how to add new/own models. Note, to make it easier to call explain() from Python, the regression.model parameter can also be a string specifying the model which will be parsed and evaluated. For example, ⁠"parsnip::rand_forest(mtry = hardhat::tune(), trees = 100, engine = "ranger", mode = "regression")"⁠ is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.tune_values

Either NULL (default), a data.frame/data.table/tibble, or a function. The data.frame must contain the possible hyperparameter value combinations to try. The column names must match the names of the tunable parameters specified in regression.model. If regression.tune_values is a function, then it should take one argument x which is the training data for the current coalition and returns a data.frame/data.table/tibble with the properties described above. Using a function allows the hyperparameter values to change based on the size of the coalition See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.tune_values can also be a string containing an R function. For example, "function(x) return(dials::grid_regular(dials::mtry(c(1, ncol(x)))), levels = 3))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.vfold_cv_para

Either NULL (default) or a named list containing the parameters to be sent to rsample::vfold_cv(). See the regression vignette for several examples.

regression.recipe_func

Either NULL (default) or a function that that takes in a recipes::recipe() object and returns a modified recipes::recipe() with potentially additional recipe steps. See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.recipe_func can also be a string containing an R function. For example, "function(recipe) return(recipes::step_ns(recipe, recipes::all_numeric_predictors(), deg_free = 2))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.surrogate_n_comb

Positive integer. Specifies the number of unique coalitions to apply to each training observation. The default is the number of sampled coalitions in the present iteration. Any integer between 1 and the default is allowed. Larger values requires more memory, but may improve the surrogate model. If the user sets a value lower than the maximum, we sample this amount of unique coalitions separately for each training observations. That is, on average, all coalitions should be equally trained.

timeseries.fixed_sigma

Positive numeric scalar. Represents the kernel bandwidth in the distance computation. The default value is 2.

timeseries.bounds

Numeric vector of length two. Specifies the lower and upper bounds of the timeseries. The default is c(NULL, NULL), i.e. no bounds. If one or both of these bounds are not NULL, we restrict the sampled time series to be between these bounds. This is useful if the underlying time series are scaled between 0 and 1, for example.

vaeac.depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

vaeac.lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

vaeac.activation_function

An torch::nn_module() representing an activation function such as, e.g., torch::nn_relu() (default), torch::nn_leaky_relu(), torch::nn_selu(), or torch::nn_sigmoid().

vaeac.n_vaeacs_initialize

Positive integer (default is 4). The number of different vaeac models to initiate in the start. Pick the best performing one after vaeac.extra_parameters$epochs_initiation_phase epochs (default is 2) and continue training that one.

vaeac.epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes vaeac.extra_parameters$epochs_initiation_phase, where the default is 2.

vaeac.extra_parameters

Named list with extra parameters to the vaeac approach. See vaeac_get_extra_para_default() for description of possible additional parameters and their default values.

Details

The shapr package implements kernelSHAP estimation of dependence-aware Shapley values with eight different Monte Carlo-based approaches for estimating the conditional distributions of the data. These are all introduced in the general usage vignette. (From R: vignette("general_usage", package = "shapr")). Moreover, Aas et al. (2021) gives a general introduction to dependence-aware Shapley values, and the three approaches "empirical", "gaussian", "copula", and also discusses "independence". Redelmeier et al. (2020) introduces the approach "ctree". Olsen et al. (2022) introduces the "vaeac" approach. Approach "timeseries" is discussed in Jullum et al. (2021). shapr has also implemented two regression-based approaches "regression_separate" and "regression_surrogate", as described in Olsen et al. (2024). It is also possible to combine the different approaches, see the general usage for more information.

The package also supports the computation of causal and asymmetric Shapley values as introduced by Heskes et al. (2020) and Frye et al. (2020). Asymmetric Shapley values were proposed by Heskes et al. (2020) as a way to incorporate causal knowledge in the real world by restricting the possible feature combinations/coalitions when computing the Shapley values to those consistent with a (partial) causal ordering. Causal Shapley values were proposed by Frye et al. (2020) as a way to explain the total effect of features on the prediction, taking into account their causal relationships, by adapting the sampling procedure in shapr.

The package allows for parallelized computation with progress updates through the tightly connected future::future and progressr::progressr packages. See the examples below. For iterative estimation (iterative=TRUE), intermediate results may also be printed to the console (according to the verbose argument). Moreover, the intermediate results are written to disk. This combined batch computing of the v(S) values, enables fast and accurate estimation of the Shapley values in a memory friendly manner.

Value

Object of class c("shapr", "list"). Contains the following items:

shapley_values_est

data.table with the estimated Shapley values with explained observation in the rows and features along the columns. The column none is the prediction not devoted to any of the features (given by the argument phi0)

shapley_values_sd

data.table with the standard deviation of the Shapley values reflecting the uncertainty. Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by definition 0 when all coalitions is used. Only present when extra_computation_args$compute_sd=TRUE, which is the default when iterative = TRUE

internal

List with the different parameters, data, functions and other output used internally.

pred_explain

Numeric vector with the predictions for the explained observations

MSEv

List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.

timing

List containing timing information for the different parts of the computation. init_time and end_time gives the time stamps for the start and end of the computation. total_time_secs gives the total time in seconds for the complete execution of explain(). main_timing_secs gives the time in seconds for the main computations. iter_timing_secs gives for each iteration of the iterative estimation, the time spent on the different parts iterative estimation routine.

Author(s)

Martin Jullum, Lars Henry Berge Olsen

References

Examples



# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

# Split data into test- and training data
data_train <- head(airquality, -3)
data_explain <- tail(airquality, 3)

x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]

# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)

# Explain predictions
p <- mean(data_train[, y_var])

# (Optionally) enable parallelization via the future package
if (requireNamespace("future", quietly = TRUE)) {
  future::plan("multisession", workers = 2)
}


# (Optionally) enable progress updates within every iteration via the progressr package
if (requireNamespace("progressr", quietly = TRUE)) {
  progressr::handlers(global = TRUE)
}

# Empirical approach
explain1 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "empirical",
  phi0 = p,
  n_MC_samples = 1e2
)

# Gaussian approach
explain2 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "gaussian",
  phi0 = p,
  n_MC_samples = 1e2
)

# Gaussian copula approach
explain3 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "copula",
  phi0 = p,
  n_MC_samples = 1e2
)

if (requireNamespace("party", quietly = TRUE)) {
  # ctree approach
  explain4 <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "ctree",
    phi0 = p,
    n_MC_samples = 1e2
  )
}

# Combined approach
approach <- c("gaussian", "gaussian", "empirical")
explain5 <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = approach,
  phi0 = p,
  n_MC_samples = 1e2
)

# Print the Shapley values
print(explain1$shapley_values_est)

# Plot the results
if (requireNamespace("ggplot2", quietly = TRUE)) {
  plot(explain1)
  plot(explain1, plot_type = "waterfall")
}

# Group-wise explanations
group_list <- list(A = c("Temp", "Month"), B = c("Wind", "Solar.R"))

explain_groups <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  group = group_list,
  approach = "empirical",
  phi0 = p,
  n_MC_samples = 1e2
)
print(explain_groups$shapley_values_est)

# Separate and surrogate regression approaches with linear regression models.
req_pkgs <- c("parsnip", "recipes", "workflows", "rsample", "tune", "yardstick")
if (requireNamespace(req_pkgs, quietly = TRUE)) {
  explain_separate_lm <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    phi0 = p,
    approach = "regression_separate",
    regression.model = parsnip::linear_reg()
  )

  explain_surrogate_lm <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    phi0 = p,
    approach = "regression_surrogate",
    regression.model = parsnip::linear_reg()
  )
}

# Iterative estimation
# For illustration purposes only. By default not used for such small dimensions as here

# Gaussian approach
explain_iterative <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "gaussian",
  phi0 = p,
  n_MC_samples = 1e2,
  iterative = TRUE,
  iterative_args = list(initial_n_coalitions = 10)
)




Explain a forecast from time series models with dependence-aware (conditional/observational) Shapley values

Description

Computes dependence-aware Shapley values for observations in explain_idx from the specified model by using the method specified in approach to estimate the conditional expectation. See Aas, et. al (2021) for a thorough introduction to dependence-aware prediction explanation with Shapley values.

Usage

explain_forecast(
  model,
  y,
  xreg = NULL,
  train_idx = NULL,
  explain_idx,
  explain_y_lags,
  explain_xreg_lags = explain_y_lags,
  horizon,
  approach,
  phi0,
  max_n_coalitions = NULL,
  iterative = NULL,
  group_lags = TRUE,
  group = NULL,
  n_MC_samples = 1000,
  seed = NULL,
  predict_model = NULL,
  get_model_specs = NULL,
  verbose = "basic",
  extra_computation_args = list(),
  iterative_args = list(),
  output_args = list(),
  ...
)

Arguments

model

Model object. Specifies the model whose predictions we want to explain. Run get_supported_models() for a table of which models explain supports natively. Unsupported models can still be explained by passing predict_model and (optionally) get_model_specs, see details for more information.

y

Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained.

xreg

Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows.

train_idx

Numeric vector. The row indices in data and reg denoting points in time to use when estimating the conditional expectations in the Shapley value formula. If train_idx = NULL (default) all indices not selected to be explained will be used.

explain_idx

Numeric vector. The row indices in data and reg denoting points in time to explain.

explain_y_lags

Numeric vector. Denotes the number of lags that should be used for each variable in y when making a forecast.

explain_xreg_lags

Numeric vector. If xreg != NULL, denotes the number of lags that should be used for each variable in xreg when making a forecast.

horizon

Numeric. The forecast horizon to explain. Passed to the predict_model function.

approach

Character vector of length 1 or one less than the number of features. All elements should, either be "gaussian", "copula", "empirical", "ctree", "vaeac", "categorical", "timeseries", "independence", "regression_separate", or "regression_surrogate". The two regression approaches can not be combined with any other approach. See details for more information.

phi0

Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable.

max_n_coalitions

Integer. The upper limit on the number of unique feature/group coalitions to use in the iterative procedure (if iterative = TRUE). If iterative = FALSE it represents the number of feature/group coalitions to use directly. The quantity refers to the number of unique feature coalitions if group = NULL, and group coalitions if group != NULL. max_n_coalitions = NULL corresponds to max_n_coalitions=2^n_features.

iterative

Logical or NULL If NULL (default), the argument is set to TRUE if there are more than 5 features/groups, and FALSE otherwise. If eventually TRUE, the Shapley values are estimated iteratively in an iterative manner. This provides sufficiently accurate Shapley value estimates faster. First an initial number of coalitions is sampled, then bootsrapping is used to estimate the variance of the Shapley values. A convergence criterion is used to determine if the variances of the Shapley values are sufficiently small. If the variances are too high, we estimate the number of required samples to reach convergence, and thereby add more coalitions. The process is repeated until the variances are below the threshold. Specifics related to the iterative process and convergence criterion are set through iterative_args.

group_lags

Logical. If TRUE all lags of each variable are grouped together and explained as a group. If FALSE all lags of each variable are explained individually.

group

List. If NULL regular feature wise Shapley values are computed. If provided, group wise Shapley values are computed. group then has length equal to the number of groups. The list element contains character vectors with the features included in each of the different groups. See Jullum et al. (2021) for more information on group wise Shapley values.

n_MC_samples

Positive integer. For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration of every conditional expectation. For approach="ctree", n_MC_samples corresponds to the number of samples from the leaf node (see an exception related to the ctree.sample argument setup_approach.ctree()). For approach="empirical", n_MC_samples is the K parameter in equations (14-15) of Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the empirical.eta argument setup_approach.empirical().

seed

Positive integer. Specifies the seed before any randomness based code is being run. If NULL (default) no seed is set in the calling environment.

predict_model

Function. The prediction function used when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function must have two arguments, model and newdata which specify, respectively, the model and a data.frame/data.table to compute predictions for. The function must give the prediction as a numeric vector. NULL (the default) uses functions specified internally. Can also be used to override the default function for natively supported model classes.

get_model_specs

Function. An optional function for checking model/data consistency when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function takes model as argument and provides a list with 3 elements:

labels

Character vector with the names of each feature.

classes

Character vector with the classes of each features.

factor_levels

Character vector with the levels for any categorical features.

If NULL (the default) internal functions are used for natively supported model classes, and the checking is disabled for unsupported model classes. Can also be used to override the default function for natively supported model classes.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

extra_computation_args

Named list. Specifies extra arguments related to the computation of the Shapley values. See get_extra_comp_args_default() for description of the arguments and their default values.

iterative_args

Named list. Specifies the arguments for the iterative procedure. See get_iterative_args_default() for description of the arguments and their default values.

output_args

Named list. Specifies certain arguments related to the output of the function. See get_output_args_default() for description of the arguments and their default values.

...

Arguments passed on to setup_approach.categorical, setup_approach.copula, setup_approach.ctree, setup_approach.empirical, setup_approach.gaussian, setup_approach.independence, setup_approach.timeseries, setup_approach.vaeac

categorical.joint_prob_dt

Data.table. (Optional) Containing the joint probability distribution for each combination of feature values. NULL means it is estimated from the x_train and x_explain.

categorical.epsilon

Numeric value. (Optional) If categorical.joint_probability_dt is not supplied, probabilities/frequencies are estimated using x_train. If certain observations occur in x_explain and NOT in x_train, then epsilon is used as the proportion of times that these observations occurs in the training data. In theory, this proportion should be zero, but this causes an error later in the Shapley computation.

internal

List. Not used directly, but passed through from explain().

ctree.mincriterion

Numeric scalar or vector. Either a scalar or vector of length equal to the number of features in the model. The value is equal to 1 - \alpha where \alpha is the nominal level of the conditional independence tests. If it is a vector, this indicates which value to use when conditioning on various numbers of features. The default value is 0.95.

ctree.minsplit

Numeric scalar. Determines minimum value that the sum of the left and right daughter nodes required for a split. The default value is 20.

ctree.minbucket

Numeric scalar. Determines the minimum sum of weights in a terminal node required for a split The default value is 7.

ctree.sample

Boolean. If TRUE (default), then the method always samples n_MC_samples observations from the leaf nodes (with replacement). If FALSE and the number of observations in the leaf node is less than n_MC_samples, the method will take all observations in the leaf. If FALSE and the number of observations in the leaf node is more than n_MC_samples, the method will sample n_MC_samples observations (with replacement). This means that there will always be sampling in the leaf unless sample = FALSE and the number of obs in the node is less than n_MC_samples.

empirical.type

Character. (default = "fixed_sigma") Should be equal to either "independence","fixed_sigma", "AICc_each_k" "AICc_full". "independence" is deprecated. Use approach = "independence" instead. "fixed_sigma" uses a fixed bandwidth (set through empirical.fixed_sigma) in the kernel density estimation. "AICc_each_k" and "AICc_full" optimize the bandwidth using the AICc criterion, with respectively one bandwidth per coalition size and one bandwidth for all coalition sizes.

empirical.eta

Numeric scalar. Needs to be ⁠0 < eta <= 1⁠. The default value is 0.95. Represents the minimum proportion of the total empirical weight that data samples should use. If e.g. eta = .8 we will choose the K samples with the largest weight so that the sum of the weights accounts for 80\ eta is the \eta parameter in equation (15) of Aas et al. (2021).

empirical.fixed_sigma

Positive numeric scalar. The default value is 0.1. Represents the kernel bandwidth in the distance computation used when conditioning on all different coalitions. Only used when empirical.type = "fixed_sigma"

empirical.n_samples_aicc

Positive integer. Number of samples to consider in AICc optimization. The default value is 1000. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.eval_max_aicc

Positive integer. Maximum number of iterations when optimizing the AICc. The default value is 20. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.start_aicc

Numeric. Start value of the sigma parameter when optimizing the AICc. The default value is 0.1. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.cov_mat

Numeric matrix. (Optional) The covariance matrix of the data generating distribution used to define the Mahalanobis distance. NULL means it is estimated from x_train.

gaussian.mu

Numeric vector. (Optional) Containing the mean of the data generating distribution. NULL means it is estimated from the x_train.

gaussian.cov_mat

Numeric matrix. (Optional) Containing the covariance matrix of the data generating distribution. NULL means it is estimated from the x_train.

timeseries.fixed_sigma

Positive numeric scalar. Represents the kernel bandwidth in the distance computation. The default value is 2.

timeseries.bounds

Numeric vector of length two. Specifies the lower and upper bounds of the timeseries. The default is c(NULL, NULL), i.e. no bounds. If one or both of these bounds are not NULL, we restrict the sampled time series to be between these bounds. This is useful if the underlying time series are scaled between 0 and 1, for example.

vaeac.depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

vaeac.lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

vaeac.activation_function

An torch::nn_module() representing an activation function such as, e.g., torch::nn_relu() (default), torch::nn_leaky_relu(), torch::nn_selu(), or torch::nn_sigmoid().

vaeac.n_vaeacs_initialize

Positive integer (default is 4). The number of different vaeac models to initiate in the start. Pick the best performing one after vaeac.extra_parameters$epochs_initiation_phase epochs (default is 2) and continue training that one.

vaeac.epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes vaeac.extra_parameters$epochs_initiation_phase, where the default is 2.

vaeac.extra_parameters

Named list with extra parameters to the vaeac approach. See vaeac_get_extra_para_default() for description of possible additional parameters and their default values.

Details

This function explains a forecast of length horizon. The argument train_idx is analogous to x_train in explain(), however, it just contains the time indices of where in the data the forecast should start for each training sample. In the same way explain_idx defines the time index (indices) which will precede a forecast to be explained.

As any autoregressive forecast model will require a set of lags to make a forecast at an arbitrary point in time, explain_y_lags and explain_xreg_lags define how many lags are required to "refit" the model at any given time index. This allows the different approaches to work in the same way they do for time-invariant models.

See the forecasting section of the general usages for further details.

Value

Object of class c("shapr", "list"). Contains the following items:

shapley_values_est

data.table with the estimated Shapley values with explained observation in the rows and features along the columns. The column none is the prediction not devoted to any of the features (given by the argument phi0)

shapley_values_sd

data.table with the standard deviation of the Shapley values reflecting the uncertainty. Note that this only reflects the coalition sampling part of the kernelSHAP procedure, and is therefore by definition 0 when all coalitions is used. Only present when extra_computation_args$compute_sd=TRUE, which is the default when iterative = TRUE

internal

List with the different parameters, data, functions and other output used internally.

pred_explain

Numeric vector with the predictions for the explained observations

MSEv

List with the values of the MSEv evaluation criterion for the approach. See the MSEv evaluation section in the general usage for details.

timing

List containing timing information for the different parts of the computation. init_time and end_time gives the time stamps for the start and end of the computation. total_time_secs gives the total time in seconds for the complete execution of explain(). main_timing_secs gives the time in seconds for the main computations. iter_timing_secs gives for each iteration of the iterative estimation, the time spent on the different parts iterative estimation routine.

Author(s)

Jon Lachmann, Martin Jullum

References

Examples


# Load example data
data("airquality")
data <- data.table::as.data.table(airquality)

# Fit an AR(2) model.
model_ar_temp <- ar(data$Temp, order = 2)

# Calculate the zero prediction values for a three step forecast.
p0_ar <- rep(mean(data$Temp), 3)

# Empirical approach, explaining forecasts starting at T = 152 and T = 153.
explain_forecast(
  model = model_ar_temp,
  y = data[, "Temp"],
  train_idx = 2:151,
  explain_idx = 152:153,
  explain_y_lags = 2,
  horizon = 3,
  approach = "empirical",
  phi0 = p0_ar,
  group_lags = FALSE
)



Gathers the final output to create the explanation object

Description

Gathers the final output to create the explanation object

Usage

finalize_explanation(internal)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

List of reformatted output information extracted from internal


A torch::nn_module() Representing a gauss_cat_loss

Description

The ⁠gauss_cat_loss module⁠ layer computes the log probability of the groundtruth for each object given the mask and the distribution parameters. That is, the log-likelihoods of the true/full training observations based on the generative distributions parameters distr_params inferred by the masked versions of the observations.

Usage

gauss_cat_loss(one_hot_max_sizes, min_sigma = 1e-04, min_prob = 1e-04)

Arguments

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

min_sigma

For stability it might be desirable that the minimal sigma is not too close to zero.

min_prob

For stability it might be desirable that the minimal probability is not too close to zero.

Details

Note that the module works with mixed data represented as 2-dimensional inputs and it works correctly with missing values in groundtruth as long as they are represented by NaNs.

Author(s)

Lars Henry Berge Olsen


A torch::nn_module() Representing a gauss_cat_parameters

Description

The gauss_cat_parameters module extracts the parameters from the inferred generative Gaussian and categorical distributions for the continuous and categorical features, respectively.

If one_hot_max_sizes is [4, 1, 1, 2], then the inferred distribution parameters for one observation is the vector [p_{00}, p_{01}, p_{02}, p_{03}, \mu_1, \sigma_1, \mu_2, \sigma_2, p_{30}, p_{31}], where \operatorname{Softmax}([p_{00}, p_{01}, p_{02}, p_{03}]) and \operatorname{Softmax}([p_{30}, p_{31}]) are probabilities of the first and the fourth feature categories respectively in the model generative distribution, and Gaussian(\mu_1, \sigma_1^2) and Gaussian(\mu_2, \sigma_2^2) are the model generative distributions on the second and the third features.

Usage

gauss_cat_parameters(one_hot_max_sizes, min_sigma = 1e-04, min_prob = 1e-04)

Arguments

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

min_sigma

For stability it might be desirable that the minimal sigma is not too close to zero.

min_prob

For stability it might be desirable that the minimal probability is not too close to zero.

Author(s)

Lars Henry Berge Olsen


A torch::nn_module() Representing a gauss_cat_sampler_most_likely

Description

The gauss_cat_sampler_most_likely generates the most likely samples from the generative distribution defined by the output of the vaeac. I.e., the layer will return the mean and most probable class for the Gaussian (continuous features) and categorical (categorical features) distributions, respectively.

Usage

gauss_cat_sampler_most_likely(
  one_hot_max_sizes,
  min_sigma = 1e-04,
  min_prob = 1e-04
)

Arguments

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

min_sigma

For stability it might be desirable that the minimal sigma is not too close to zero.

min_prob

For stability it might be desirable that the minimal probability is not too close to zero.

Value

A gauss_cat_sampler_most_likely object.

Author(s)

Lars Henry Berge Olsen


A torch::nn_module() Representing a gauss_cat_sampler_random

Description

The gauss_cat_sampler_random generates random samples from the generative distribution defined by the output of the vaeac. The random sample is generated by sampling from the inferred Gaussian and categorical distributions for the continuous and categorical features, respectively.

Usage

gauss_cat_sampler_random(
  one_hot_max_sizes,
  min_sigma = 1e-04,
  min_prob = 1e-04
)

Arguments

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

min_sigma

For stability it might be desirable that the minimal sigma is not too close to zero.

min_prob

For stability it might be desirable that the minimal probability is not too close to zero.

Author(s)

Lars Henry Berge Olsen


Transforms a sample to standardized normal distribution

Description

Transforms a sample to standardized normal distribution

Usage

gaussian_transform(x)

Arguments

x

Numeric vector.The data which should be transformed to a standard normal distribution.

Value

Numeric vector of length length(x)

Author(s)

Martin Jullum


Transforms new data to standardized normal (dimension 1) based on other data transformations

Description

Transforms new data to standardized normal (dimension 1) based on other data transformations

Usage

gaussian_transform_separate(yx, n_y)

Arguments

yx

Numeric vector. The first n_y items is the data that is transformed, and last part is the data with the original transformation.

n_y

Positive integer. Number of elements of yx that belongs to the Gaussian data.

Value

Vector of back-transformed Gaussian data

Author(s)

Martin Jullum


Get the steps for generating MC samples for coalitions following a causal ordering

Description

Get the steps for generating MC samples for coalitions following a causal ordering

Usage

get_S_causal_steps(S, causal_ordering, confounding, as_string = FALSE)

Arguments

S

Integer matrix of dimension n_coalitions_valid x m, where n_coalitions_valid equals the total number of valid coalitions that respect the causal ordering given in causal_ordering and m equals the total number of features.

causal_ordering

List. Not applicable for (regular) non-causal or asymmetric explanations. causal_ordering is an unnamed list of vectors specifying the components of the partial causal ordering that the coalitions must respect. Each vector represents a component and contains one or more features/groups identified by their names (strings) or indices (integers). If causal_ordering is NULL (default), no causal ordering is assumed and all possible coalitions are allowed. No causal ordering is equivalent to a causal ordering with a single component that includes all features (list(1:n_features)) or groups (list(1:n_groups)) for feature-wise and group-wise Shapley values, respectively. For feature-wise Shapley values and causal_ordering = list(c(1, 2), c(3, 4)), the interpretation is that features 1 and 2 are the ancestors of features 3 and 4, while features 3 and 4 are on the same level. Note: All features/groups must be included in the causal_ordering without any duplicates.

confounding

Logical vector. Not applicable for (regular) non-causal or asymmetric explanations. confounding is a vector of logicals specifying whether confounding is assumed or not for each component in the causal_ordering. If NULL (default), then no assumption about the confounding structure is made and explain computes asymmetric/symmetric conditional Shapley values, depending on the value of asymmetric. If confounding is a single logical, i.e., FALSE or TRUE, then this assumption is set globally for all components in the causal ordering. Otherwise, confounding must be a vector of logicals of the same length as causal_ordering, indicating the confounding assumption for each component. When confounding is specified, then explain computes asymmetric/symmetric causal Shapley values, depending on the value of asymmetric. The approach cannot be regression_separate and regression_surrogate as the regression-based approaches are not applicable to the causal Shapley value methodology.

as_string

Boolean. If the returned object is to be a list of lists of integers or a list of vectors of strings.

Value

Depends on the value of the parameter as_string. If a string, then results[j] is a vector specifying the process of generating the samples for coalition j. The length of results[j] is the number of steps, and results[j][i] is a string of the form features_to_sample|features_to_condition_on. If the features_to_condition_on part is blank, then we are to sample from the marginal distribution. For as_string == FALSE, then we rather return a vector where results[[j]][[i]] contains the elements Sbar and S representing the features to sample and condition on, respectively.

Author(s)

Lars Henry Berge Olsen


get_cov_mat

Description

get_cov_mat

Usage

get_cov_mat(x_train, min_eigen_value = 1e-06)

Arguments

x_train

Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula.

min_eigen_value

Numeric Specifies the smallest allowed eigen value before the covariance matrix of x_train is assumed to not be positive definite, and Matrix::nearPD() is used to find the nearest one.


Set up data for explain_forecast

Description

Set up data for explain_forecast

Usage

get_data_forecast(
  y,
  xreg,
  train_idx,
  explain_idx,
  explain_y_lags,
  explain_xreg_lags,
  horizon
)

Arguments

y

Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained.

xreg

Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows.

train_idx

Numeric vector. The row indices in data and reg denoting points in time to use when estimating the conditional expectations in the Shapley value formula. If train_idx = NULL (default) all indices not selected to be explained will be used.

explain_idx

Numeric vector. The row indices in data and reg denoting points in time to explain.

explain_y_lags

Numeric vector. Denotes the number of lags that should be used for each variable in y when making a forecast.

explain_xreg_lags

Numeric vector. If xreg != NULL, denotes the number of lags that should be used for each variable in xreg when making a forecast.

horizon

Numeric. The forecast horizon to explain. Passed to the predict_model function.

Value

A list containing


Fetches feature information from a given data set

Description

Fetches feature information from a given data set

Usage

get_data_specs(x)

Arguments

x

data.frame or data.table. The data to extract feature information from.

Details

This function is used to extract the feature information to be checked against the corresponding information extracted from the model and other data sets. The function is only called internally

Value

A list with the following elements:

labels

character vector with the feature names to compute Shapley values for

classes

a named character vector with the labels as names and the class types as elements

factor_levels

a named list with the labels as names and character vectors with the factor levels as elements (NULL if the feature is not a factor)

Author(s)

Martin Jullum


Gets the default values for the extra computation arguments

Description

Gets the default values for the extra computation arguments

Usage

get_extra_comp_args_default(
  internal,
  paired_shap_sampling = isFALSE(internal$parameters$asymmetric),
  semi_deterministic_sampling = FALSE,
  kernelSHAP_reweighting = "on_all_cond",
  compute_sd = isFALSE(internal$parameters$exact),
  n_boot_samps = 100,
  vS_batching_method = "future",
  max_batch_size = 10,
  min_n_batches = 10
)

Arguments

internal

List. Not used directly, but passed through from explain().

paired_shap_sampling

Logical. If TRUE paired versions of all sampled coalitions are also included in the computation. That is, if there are 5 features and e.g. coalitions (1,3,5) are sampled, then also coalition (2,4) is used for computing the Shapley values. This is done to reduce the variance of the Shapley value estimates. TRUE is the default and is recommended for highest accuracy. For asymmetric, FALSE is the default and the only legal value.

semi_deterministic_sampling

Logical. If FALSE (default), then we sample from all coalitions. If TRUE, the sampling of coalitions is semi-deterministic, i.e. the sampling is done in a way that ensures that coalitions that are expected to be sampled based on the number of coalitions are deterministically included such that we sample among fewer coalitions. This is done to reduce the variance of the Shapley value estimates, and corresponds to the PySHAP* strategy in the paper Olsen & Jullum (2024).

kernelSHAP_reweighting

String. How to reweight the sampling frequency weights in the kernelSHAP solution after sampling. The aim of this is to reduce the randomness and thereby the variance of the Shapley value estimates. The options are one of 'none', 'on_N', 'on_all', 'on_all_cond' (default). 'none' means no reweighting, i.e. the sampling frequency weights are used as is. 'on_N' means the sampling frequencies are averaged over all coalitions with the same original sampling probabilities. 'on_all' means the original sampling probabilities are used for all coalitions. 'on_all_cond' means the original sampling probabilities are used for all coalitions, while adjusting for the probability that they are sampled at least once. 'on_all_cond' is preferred as it performs the best in simulation studies, see Olsen & Jullum (2024).

compute_sd

Logical. Whether to estimate the standard deviations of the Shapley value estimates. This is TRUE whenever sampling based kernelSHAP is applied (either iteratively or with a fixed number of coalitions).

n_boot_samps

Integer. The number of bootstrapped samples (i.e. samples with replacement) from the set of all coalitions used to estimate the standard deviations of the Shapley value estimates.

vS_batching_method

String. The method used to perform batch computing of vS. "future" (default), utilizes future.apply::future_apply (via the future::future package), enabling parallelized computation and progress updates via progressr::progressr. Alternatively, "forloop" can be used for straight forward sequential computation, which is mainly useful for package development and debugging purposes.

max_batch_size

Integer. The maximum number of coalitions to estimate simultaneously within each iteration. A larger numbers requires more memory, but may have a slight computational advantage.

min_n_batches

Integer. The minimum number of batches to split the computation into within each iteration. Larger numbers gives more frequent progress updates. If parallelization is applied, this should be set no smaller than the number of parallel workers.

Value

A list with the default values for the extra computation arguments.

Author(s)

Martin Jullum

References


This includes both extra parameters and other objects

Description

This includes both extra parameters and other objects

Usage

get_extra_parameters(internal, type)

Gets the feature specifications form the model

Description

Gets the feature specifications form the model

Usage

get_feature_specs(get_model_specs, model)

Arguments

get_model_specs

Function. An optional function for checking model/data consistency when model is not natively supported. (Run get_supported_models() for a list of natively supported models.) The function takes model as argument and provides a list with 3 elements:

labels

Character vector with the names of each feature.

classes

Character vector with the classes of each features.

factor_levels

Character vector with the levels for any categorical features.

If NULL (the default) internal functions are used for natively supported model classes, and the checking is disabled for unsupported model classes. Can also be used to override the default function for natively supported model classes.

model

Model object. Specifies the model whose predictions we want to explain. Run get_supported_models() for a table of which models explain supports natively. Unsupported models can still be explained by passing predict_model and (optionally) get_model_specs, see details for more information.


Function to specify arguments of the iterative estimation procedure

Description

Function to specify arguments of the iterative estimation procedure

Usage

get_iterative_args_default(
  internal,
  initial_n_coalitions = ceiling(min(200, max(5, internal$parameters$n_features,
    (2^internal$parameters$n_features)/10), internal$parameters$max_n_coalitions)),
  fixed_n_coalitions_per_iter = NULL,
  max_iter = 20,
  convergence_tol = 0.02,
  n_coal_next_iter_factor_vec = c(seq(0.1, 1, by = 0.1), rep(1, max_iter - 10))
)

Arguments

internal

List. Not used directly, but passed through from explain().

initial_n_coalitions

Integer. Number of coalitions to use in the first estimation iteration.

fixed_n_coalitions_per_iter

Integer. Number of n_coalitions to use in each iteration. NULL (default) means setting it based on estimates based on a set convergence threshold.

max_iter

Integer. Maximum number of estimation iterations

convergence_tol

Numeric. The t variable in the convergence threshold formula on page 6 in the paper Covert and Lee (2021), 'Improving KernelSHAP: Practical Shapley Value Estimation via Linear Regression' https://arxiv.org/pdf/2012.01536. Smaller values requires more coalitions before convergence is reached.

n_coal_next_iter_factor_vec

Numeric vector. The number of n_coalitions that must be used to reach convergence in the next iteration is estimated. The number of n_coalitions actually used in the next iteration is set to this estimate multiplied by n_coal_next_iter_factor_vec[i] for iteration i. It is wise to start with smaller numbers to avoid using too many n_coalitions due to uncertain estimates in the first iterations.

Details

The functions sets default values for the iterative estimation procedure, according to the function defaults. If the argument iterative of explain() is FALSE, it sets parameters corresponding to the use of a non-iterative estimation procedure

Value

A list with the default values for the iterative estimation procedure

Author(s)

Martin Jullum


Get the number of coalitions that respects the causal ordering

Description

Get the number of coalitions that respects the causal ordering

Usage

get_max_n_coalitions_causal(causal_ordering)

Arguments

causal_ordering

List. Not applicable for (regular) non-causal or asymmetric explanations. causal_ordering is an unnamed list of vectors specifying the components of the partial causal ordering that the coalitions must respect. Each vector represents a component and contains one or more features/groups identified by their names (strings) or indices (integers). If causal_ordering is NULL (default), no causal ordering is assumed and all possible coalitions are allowed. No causal ordering is equivalent to a causal ordering with a single component that includes all features (list(1:n_features)) or groups (list(1:n_groups)) for feature-wise and group-wise Shapley values, respectively. For feature-wise Shapley values and causal_ordering = list(c(1, 2), c(3, 4)), the interpretation is that features 1 and 2 are the ancestors of features 3 and 4, while features 3 and 4 are on the same level. Note: All features/groups must be included in the causal_ordering without any duplicates.

Details

The function computes the number of coalitions that respects the causal ordering by computing the number of coalitions in each partial causal component and then summing these. We compute the number of coalitions in the ith a partial causal component by 2^n - 1, where n is the number of features in the the ith partial causal component and we subtract one as we do not want to include the situation where no features in the ith partial causal component are present. In the end, we add 1 for the empty coalition.

Value

Integer. The (maximum) number of coalitions that respects the causal ordering.

Author(s)

Lars Henry Berge Olsen


Fetches feature information from natively supported models

Description

This function is used to extract the feature information from the model to be checked against the corresponding feature information in the data passed to explain().

NOTE: You should never need to call this function explicitly. It is exported just to be easier accessible for users, see details.

Usage

get_model_specs(x)

## Default S3 method:
get_model_specs(x)

## S3 method for class 'ar'
get_model_specs(x)

## S3 method for class 'Arima'
get_model_specs(x)

## S3 method for class 'forecast_ARIMA'
get_model_specs(x)

## S3 method for class 'glm'
get_model_specs(x)

## S3 method for class 'lm'
get_model_specs(x)

## S3 method for class 'gam'
get_model_specs(x)

## S3 method for class 'ranger'
get_model_specs(x)

## S3 method for class 'workflow'
get_model_specs(x)

## S3 method for class 'xgb.Booster'
get_model_specs(x)

Arguments

x

Model object for the model to be explained.

Details

If you are explaining a model not supported natively, you may (optionally) enable such checking by creating this function yourself and passing it on to explain().

Value

A list with the following elements:

labels

character vector with the feature names to compute Shapley values for

classes

a named character vector with the labels as names and the class type as elements

factor_levels

a named list with the labels as names and character vectors with the factor levels as elements (NULL if the feature is not a factor)

Author(s)

Martin Jullum

See Also

For model classes not supported natively, you NEED to create an analogue to predict_model(). See it's help file for details.

Examples

# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
# Split data into test- and training data
x_train <- head(airquality, -3)
x_explain <- tail(airquality, 3)
# Fit a linear model
model <- lm(Ozone ~ Solar.R + Wind + Temp + Month, data = x_train)
get_model_specs(model)


get_mu_vec

Description

get_mu_vec

Usage

get_mu_vec(x_train)

Arguments

x_train

Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula.


Gets the default values for the output arguments

Description

Gets the default values for the output arguments

Usage

get_output_args_default(
  keep_samp_for_vS = FALSE,
  MSEv_uniform_comb_weights = TRUE,
  saving_path = tempfile("shapr_obj_", fileext = ".rds")
)

Arguments

keep_samp_for_vS

Logical. Indicates whether the samples used in the Monte Carlo estimation of v_S should be returned (in internal$output). Not used for approach="regression_separate" or approach="regression_surrogate".

MSEv_uniform_comb_weights

Logical. If TRUE (default), then the function weights the coalitions uniformly when computing the MSEv criterion. If FALSE, then the function use the Shapley kernel weights to weight the coalitions when computing the MSEv criterion. Note that the Shapley kernel weights are replaced by the sampling frequency when not all coalitions are considered.

saving_path

String. The path to the directory where the results of the iterative estimation procedure should be saved. Defaults to a temporary directory.

Value

A list of default output arguments.

Author(s)

Martin Jullum


Get predict_model function

Description

Get predict_model function

Usage

get_predict_model(predict_model, model)

Arguments

predict_model

Function. The prediction function used when model is not natively supported. See the documentation of explain() for details.

model

Objects. The model object that ought to be explained. See the documentation of explain() for details.


Gets the implemented approaches

Description

Gets the implemented approaches

Usage

get_supported_approaches()

Value

Character vector. The names of the implemented approaches that can be passed to argument approach in explain().


Provides a data.table with the supported models

Description

Provides a data.table with the supported models

Usage

get_supported_models()

Value

A data.table with the supported models.


Get all coalitions satisfying the causal ordering

Description

This function is only relevant when we are computing asymmetric Shapley values. For symmetric Shapley values (both regular and causal), all coalitions are allowed.

Usage

get_valid_causal_coalitions(
  causal_ordering,
  sort_features_in_coalitions = TRUE
)

Arguments

causal_ordering

List. Not applicable for (regular) non-causal or asymmetric explanations. causal_ordering is an unnamed list of vectors specifying the components of the partial causal ordering that the coalitions must respect. Each vector represents a component and contains one or more features/groups identified by their names (strings) or indices (integers). If causal_ordering is NULL (default), no causal ordering is assumed and all possible coalitions are allowed. No causal ordering is equivalent to a causal ordering with a single component that includes all features (list(1:n_features)) or groups (list(1:n_groups)) for feature-wise and group-wise Shapley values, respectively. For feature-wise Shapley values and causal_ordering = list(c(1, 2), c(3, 4)), the interpretation is that features 1 and 2 are the ancestors of features 3 and 4, while features 3 and 4 are on the same level. Note: All features/groups must be included in the causal_ordering without any duplicates.

sort_features_in_coalitions

Boolean. If TRUE, then the feature indices in the coalitions are sorted in increasing order. If FALSE, then the function maintains the order of features within each group given in causal_ordering.

Value

List of vectors containing all coalitions that respects the causal ordering.

Author(s)

Lars Henry Berge Olsen


Set up user provided groups for explanation in a forecast model.

Description

Set up user provided groups for explanation in a forecast model.

Usage

group_forecast_setup(group, horizon_features)

Arguments

group

The list of groups to be explained.

horizon_features

A list of features per horizon, to split appropriate groups over.

Value

A list containing


Computing single H matrix in AICc-function using the Mahalanobis distance

Description

Computing single H matrix in AICc-function using the Mahalanobis distance

Usage

hat_matrix_cpp(X, mcov, S_scale_dist, h)

Arguments

X

matrix.

mcov

matrix The covariance matrix of X.

S_scale_dist

logical. Indicating whether the Mahalanobis distance should be scaled with the number of variables

h

numeric specifying the scaling (sigma)

Value

Matrix of dimension ncol(X)*ncol(X)

Author(s)

Martin Jullum


Transforms new data to a standardized normal distribution

Description

Transforms new data to a standardized normal distribution

Usage

inv_gaussian_transform_cpp(z, x)

Arguments

z

arma::mat. The data are the Gaussian Monte Carlos samples to transform.

x

arma::mat. The data with the original transformation. Used to conduct the transformation of z.

Value

arma::mat of the same dimension as z

Author(s)

Lars Henry Berge Olsen


Lag a matrix of variables a specific number of lags for each variables.

Description

Lag a matrix of variables a specific number of lags for each variables.

Usage

lag_data(x, lags)

Arguments

x

The matrix of variables (one variable per column).

lags

A numeric vector denoting how many lags each variable should have.

Value

A list with two items


(Generalized) Mahalanobis distance

Description

Used to get the Euclidean distance as well by setting mcov = diag(m).

Usage

mahalanobis_distance_cpp(
  featureList,
  Xtrain_mat,
  Xexplain_mat,
  mcov,
  S_scale_dist
)

Arguments

featureList

List. Contains the vectors indicating all factor combinations that should be included in the computations. Assumes that the first one is empty.

Xtrain_mat

Matrix Training data in matrix form

Xexplain_mat

Matrix Explanation data in matrix form.

mcov

matrix The covariance matrix of X.

S_scale_dist

logical. Indicating whether the Mahalanobis distance should be scaled with the number of variables

Value

Array of three dimensions. Contains the squared distance for between all training and test observations for all feature combinations passed to the function.

Author(s)

Martin Jullum


Missing Completely at Random (MCAR) Mask Generator

Description

A mask generator which masks the entries in the input completely at random.

Usage

mcar_mask_generator(masking_ratio = 0.5, paired_sampling = FALSE)

Arguments

masking_ratio

Numeric between 0 and 1. The probability for an entry in the generated mask to be 1 (masked).

paired_sampling

Boolean. If we are doing paired sampling. So include both S and \bar{S}. If TRUE, then batch must be sampled using paired_sampler() which ensures that the batch contains two instances for each original observation. That is, batch = [X_1, X_1, X_2, X_2, X_3, X_3, ...], where each entry X_j is a row of dimension p (i.e., the number of features).

Details

The mask generator mask each element in the batch (N x p) using a component-wise independent Bernoulli distribution with probability masking_ratio. Default values for masking_ratio is 0.5, so all masks are equally likely to be generated, including the empty and full masks. The function returns a mask of the same shape as the input batch, and the batch can contain missing values, indicated by the "NaN" token, which will always be masked.

Shape

Author(s)

Lars Henry Berge Olsen


A torch::nn_module() Representing a Memory Layer

Description

The layer is used to make skip-connections inside a torch::nn_sequential() network or between several torch::nn_sequential() networks without unnecessary code complication.

Usage

memory_layer(id, shared_env, output = FALSE, add = FALSE, verbose = FALSE)

Arguments

id

A unique id to use as a key in the storage list.

shared_env

A shared environment for all instances of memory_layer where the inputs are stored.

output

Boolean variable indicating if the memory layer is to store input in storage or extract from storage.

add

Boolean variable indicating if the extracted value are to be added or concatenated to the input. Only applicable when output = TRUE.

verbose

Boolean variable indicating if we want to give printouts to the user.

Details

If output = FALSE, this layer stores its input in the shared_env with the key id and then passes the input to the next layer. I.e., when memory layer is used in the masked encoder. If output = TRUE, this layer takes stored tensor from the storage. I.e., when memory layer is used in the decoder. If add = TRUE, it returns sum of the stored vector and an input, otherwise it returns their concatenation. If the tensor with specified id is not in storage when the layer with output = TRUE is called, it would cause an exception.

Author(s)

Lars Henry Berge Olsen


Check that the type of model is supported by the native implementation of the model class

Description

The function checks whether the model given by x is supported. If x is not a supported model the function will return an error message, otherwise it return NULL (meaning all types of models with this class is supported)

Usage

model_checker(x)

## Default S3 method:
model_checker(x)

## S3 method for class 'ar'
model_checker(x)

## S3 method for class 'Arima'
model_checker(x)

## S3 method for class 'forecast_ARIMA'
model_checker(x)

## S3 method for class 'glm'
model_checker(x)

## S3 method for class 'lm'
model_checker(x)

## S3 method for class 'gam'
model_checker(x)

## S3 method for class 'ranger'
model_checker(x)

## S3 method for class 'workflow'
model_checker(x)

## S3 method for class 'xgb.Booster'
model_checker(x)

Arguments

x

Model object for the model to be explained.

Value

Error or NULL

See Also

See predict_model() for more information about what type of models shapr currently support.


Generate permutations of training data using test observations

Description

Generate permutations of training data using test observations

Usage

observation_impute(
  W_kernel,
  S,
  x_train,
  x_explain,
  empirical.eta = 0.7,
  n_MC_samples = 1000
)

Arguments

W_kernel

Numeric matrix. Contains all nonscaled weights between training and test observations for all coalitions. The dimension equals ⁠n_train x m⁠.

S

Integer matrix of dimension ⁠n_coalitions x m⁠, where n_coalitions and m equals the total number of sampled/non-sampled coalitions and the total number of unique features, respectively. Note that m = ncol(x_train).

x_train

Data.table with training data.

x_explain

Data.table with the features of the observation whose predictions ought to be explained (test data).

Value

data.table

Author(s)

Nikolai Sellereite


Get imputed data

Description

Get imputed data

Usage

observation_impute_cpp(index_xtrain, index_s, x_train, x_explain, S)

Arguments

index_xtrain

Positive integer. Represents a sequence of row indices from x_train, i.e. min(index_xtrain) >= 1 and max(index_xtrain) <= nrow(x_train).

index_s

Positive integer. Represents a sequence of row indices from S, i.e. min(index_s) >= 1 and max(index_s) <= nrow(S).

x_train

Matrix. Contains the training data.

x_explain

Matrix with 1 row. Contains the features of the observation for a single prediction.

S

arma::mat. Matrix of dimension (n_coalitions, n_features) containing binary representations of the used coalitions. S cannot contain the empty or grand coalition, i.e., a row containing only zeros or ones. This is not a problem internally in shapr as the empty and grand coalitions are treated differently.

Details

S(i, j) = 1 if and only if feature j is present in feature combination i, otherwise S(i, j) = 0. I.e. if m = 3, there are 2^3 = 8 unique ways to combine the features. In this case dim(S) = c(8, 3). Let's call the features x1, x2, x3 and take a closer look at the combination represented by s = c(x1, x2). If this combination is represented by the second row, the following is true: S[2, 1:3] = c(1, 1, 0).

The returned object, X, is a numeric matrix where dim(X) = c(length(index_xtrain), ncol(x_train)). If feature j is present in the k-th observation, that is S[index_[k], j] == 1, X[k, j] = x_explain[1, j]. Otherwise X[k, j] = x_train[index_xtrain[k], j].

Value

Numeric matrix

Author(s)

Nikolai Sellereite


Sampling Paired Observations

Description

A sampler used to samples the batches where each instances is sampled twice

Usage

paired_sampler(vaeac_dataset_object, shuffle = FALSE)

Arguments

vaeac_dataset_object

A vaeac_dataset() object containing the data.

shuffle

Boolean. If TRUE, then the data is shuffled. If FALSE, then the data is returned in chronological order.

Details

A sampler object that allows for paired sampling by always including each observation from the vaeac_dataset() twice. A torch::sampler() object can be used with torch::dataloader() when creating batches from a torch dataset torch::dataset(). See https://rdrr.io/cran/torch/src/R/utils-data-sampler.R for more information. This function does not use batch iterators, which might increase the speed.

Author(s)

Lars Henry Berge Olsen


Plot of the Shapley value explanations

Description

Plots the individual prediction explanations.

Usage

## S3 method for class 'shapr'
plot(
  x,
  plot_type = "bar",
  digits = 3,
  index_x_explain = NULL,
  top_k_features = NULL,
  col = NULL,
  bar_plot_phi0 = TRUE,
  bar_plot_order = "largest_first",
  scatter_features = NULL,
  scatter_hist = TRUE,
  include_group_feature_means = FALSE,
  beeswarm_cex = 1/length(index_x_explain)^(1/4),
  ...
)

Arguments

x

An shapr object. The output from explain().

plot_type

Character. Specifies the type of plot to produce. "bar" (the default) gives a regular horizontal bar plot of the Shapley value magnitudes. "waterfall" gives a waterfall plot indicating the changes in the prediction score due to each features contribution (their Shapley values). "scatter" plots the feature values on the x-axis and Shapley values on the y-axis, as well as (optionally) a background scatter_hist showing the distribution of the feature data. "beeswarm" summarizes the distribution of the Shapley values along the x-axis for all the features. Each point gives the shapley value of a given instance, where the points are colored by the feature value of that instance.

digits

Integer. Number of significant digits to use in the feature description. Applicable for plot_type "bar" and "waterfall"

index_x_explain

Integer vector. Which of the test observations to plot. E.g. if you have explained 10 observations using explain(), you can generate a plot for the first 5 observations by setting index_x_explain = 1:5.

top_k_features

Integer. How many features to include in the plot. E.g. if you have 15 features in your model you can plot the 5 most important features, for each explanation, by setting top_k_features = 1:5. Applicable for plot_type "bar" and "waterfall"

col

Character vector (where length depends on plot type). The color codes (hex codes or other names understood by ggplot2::ggplot()) for positive and negative Shapley values, respectively. The default is col=NULL, plotting with the default colors respective to the plot type. For plot_type = "bar" and plot_type = "waterfall", the default is c("#00BA38","#F8766D"). For plot_type = "beeswarm", the default is c("#F8766D","yellow","#00BA38"). For plot_type = "scatter", the default is "#619CFF".

If you want to alter the colors i the plot, the length of the col vector depends on plot type. For plot_type = "bar" or plot_type = "waterfall", two colors should be provided, first for positive and then for negative Shapley values. For plot_type = "beeswarm", either two or three colors can be given. If two colors are given, then the first color determines the color that points with high feature values will have, and the second determines the color of points with low feature values. If three colors are given, then the first colors high feature values, the second colors mid-range feature values, and the third colors low feature values. For instance, col = c("red", "yellow", "blue") will make high values red, mid-range values yellow, and low values blue. For plot_type = "scatter", a single color is to be given, which determines the color of the points on the scatter plot.

bar_plot_phi0

Logical. Whether to include phi0 in the plot for plot_type = "bar".

bar_plot_order

Character. Specifies what order to plot the features with respect to the magnitude of the shapley values with plot_type = "bar": "largest_first" (the default) plots the features ordered from largest to smallest absolute Shapley value. "smallest_first" plots the features ordered from smallest to largest absolute Shapley value. "original" plots the features in the original order of the data table.

scatter_features

Integer or character vector. Only used for plot_type = "scatter". Specifies what features to include in (scatter) plot. Can be a numerical vector indicating feature index, or a character vector, indicating the name(s) of the feature(s) to plot.

scatter_hist

Logical. Only used for plot_type = "scatter". Whether to include a scatter_hist indicating the distribution of the data when making the scatter plot. Note that the bins are scaled so that when all the bins are stacked they fit the span of the y-axis of the plot.

include_group_feature_means

Logical. Whether to include the average feature value in a group on the y-axis or not. If FALSE (default), then no value is shown for the groups. If TRUE, then shapr includes the mean of the features in each group.

beeswarm_cex

Numeric. The cex argument of ggbeeswarm::geom_beeswarm(), controlling the spacing in the beeswarm plots.

...

Other arguments passed to underlying functions, like ggbeeswarm::geom_beeswarm() for plot_type = "beeswarm".

Details

See the examples below, or vignette("general_usage", package = "shapr") for an examples of how you should use the function.

Value

ggplot object with plots of the Shapley value explanations

Author(s)

Martin Jullum, Vilde Ung, Lars Henry Berge Olsen

Examples


if (requireNamespace("party", quietly = TRUE)) {
  data("airquality")
  airquality <- airquality[complete.cases(airquality), ]
  x_var <- c("Solar.R", "Wind", "Temp", "Month")
  y_var <- "Ozone"

  # Split data into test- and training data
  data_train <- head(airquality, -50)
  data_explain <- tail(airquality, 50)

  x_train <- data_train[, x_var]
  x_explain <- data_explain[, x_var]

  # Fit a linear model
  lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
  model <- lm(lm_formula, data = data_train)

  # Explain predictions
  p <- mean(data_train[, y_var])

  # Empirical approach
  x <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "empirical",
    phi0 = p,
    n_MC_samples = 1e2
  )

  if (requireNamespace(c("ggplot2", "ggbeeswarm"), quietly = TRUE)) {
    # The default plotting option is a bar plot of the Shapley values
    # We draw bar plots for the first 4 observations
    plot(x, index_x_explain = 1:4)

    # We can also make waterfall plots
    plot(x, plot_type = "waterfall", index_x_explain = 1:4)
    # And only showing the 2 features with largest contribution
    plot(x, plot_type = "waterfall", index_x_explain = 1:4, top_k_features = 2)

    # Or scatter plots showing the distribution of the shapley values and feature values
    plot(x, plot_type = "scatter")
    # And only for a specific feature
    plot(x, plot_type = "scatter", scatter_features = "Temp")

    # Or a beeswarm plot summarising the Shapley values and feature values for all features
    plot(x, plot_type = "beeswarm")
    plot(x, plot_type = "beeswarm", col = c("red", "black")) # we can change colors

    # Additional arguments can be passed to ggbeeswarm::geom_beeswarm() using the '...' argument.
    # For instance, sometimes the beeswarm plots overlap too much.
    # This can be fixed with the 'corral="wrap" argument.
    # See ?ggbeeswarm::geom_beeswarm for more information.
    plot(x, plot_type = "beeswarm", corral = "wrap")
  }

  # Example of scatter and beeswarm plot with factor variables
  airquality$Month_factor <- as.factor(month.abb[airquality$Month])
  airquality <- airquality[complete.cases(airquality), ]
  x_var <- c("Solar.R", "Wind", "Temp", "Month_factor")
  y_var <- "Ozone"

  # Split data into test- and training data
  data_train <- airquality
  data_explain <- tail(airquality, 50)

  x_train <- data_train[, x_var]
  x_explain <- data_explain[, x_var]

  # Fit a linear model
  lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
  model <- lm(lm_formula, data = data_train)

  # Explain predictions
  p <- mean(data_train[, y_var])

  # Empirical approach
  x <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "ctree",
    phi0 = p,
    n_MC_samples = 1e2
  )

  if (requireNamespace(c("ggplot2", "ggbeeswarm"), quietly = TRUE)) {
    plot(x, plot_type = "scatter")
    plot(x, plot_type = "beeswarm")
  }
}



Plots of the MSEv Evaluation Criterion

Description

Make plots to visualize and compare the MSEv evaluation criterion for a list of explain() objects applied to the same data and model. The function creates bar plots and line plots with points to illustrate the overall MSEv evaluation criterion, but also for each observation/explicand and coalition by only averaging over the coalitions and observations/explicands, respectively.

Usage

plot_MSEv_eval_crit(
  explanation_list,
  index_x_explain = NULL,
  id_coalition = NULL,
  CI_level = if (length(explanation_list[[1]]$pred_explain) < 20) NULL else 0.95,
  geom_col_width = 0.9,
  plot_type = "overall"
)

Arguments

explanation_list

A list of explain() objects applied to the same data and model. If the entries in the list are named, then the function use these names. Otherwise, they default to the approach names (with integer suffix for duplicates) for the explanation objects in explanation_list.

index_x_explain

Integer vector. Which of the test observations to plot. E.g. if you have explained 10 observations using explain(), you can generate a plot for the first 5 observations by setting index_x_explain = 1:5.

id_coalition

Integer vector. Which of the coalitions to plot. E.g. if you used n_coalitions = 16 in explain(), you can generate a plot for the first 5 coalitions and the 10th by setting id_coalition = c(1:5, 10).

CI_level

Positive numeric between zero and one. Default is 0.95 if the number of observations to explain is larger than 20, otherwise CI_level = NULL, which removes the confidence intervals. The level of the approximate confidence intervals for the overall MSEv and the MSEv_coalition. The confidence intervals are based on that the MSEv scores are means over the observations/explicands, and that means are approximation normal. Since the standard deviations are estimated, we use the quantile t from the T distribution with N_explicands - 1 degrees of freedom corresponding to the provided level. Here, N_explicands is the number of observations/explicands. MSEv +/- tSD(MSEv)/sqrt(N_explicands). Note that the explain() function already scales the standard deviation by sqrt(N_explicands), thus, the CI are MSEv \/- tMSEv_sd, where the values MSEv and MSEv_sd are extracted from the MSEv data.tables in the objects in the explanation_list.

geom_col_width

Numeric. Bar width. By default, set to 90% of the ggplot2::resolution() of the data.

plot_type

Character vector. The possible options are "overall" (default), "comb", and "explicand". If plot_type = "overall", then the plot (one bar plot) associated with the overall MSEv evaluation criterion for each method is created, i.e., when averaging over both the coalitions and observations/explicands. If plot_type = "comb", then the plots (one line plot and one bar plot) associated with the MSEv evaluation criterion for each coalition are created, i.e., when we only average over the observations/explicands. If plot_type = "explicand", then the plots (one line plot and one bar plot) associated with the MSEv evaluation criterion for each observations/explicands are created, i.e., when we only average over the coalitions. If plot_type is a vector of one or several of "overall", "comb", and "explicand", then the associated plots are created.

Value

Either a single ggplot2::ggplot() object of the MSEv criterion when plot_type = "overall", or a list of ggplot2::ggplot() objects based on the plot_type parameter.

Author(s)

Lars Henry Berge Olsen

Examples


if (requireNamespace("xgboost", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) {
  # Get the data
  data("airquality")
  data <- data.table::as.data.table(airquality)
  data <- data[complete.cases(data), ]

  #' Define the features and the response
  x_var <- c("Solar.R", "Wind", "Temp", "Month")
  y_var <- "Ozone"

  # Split data into test and training data set
  ind_x_explain <- 1:25
  x_train <- data[-ind_x_explain, ..x_var]
  y_train <- data[-ind_x_explain, get(y_var)]
  x_explain <- data[ind_x_explain, ..x_var]

  # Fitting a basic xgboost model to the training data
  model <- xgboost::xgboost(
    data = as.matrix(x_train),
    label = y_train,
    nround = 20,
    verbose = FALSE
  )

  # Specifying the phi_0, i.e. the expected prediction without any features
  phi0 <- mean(y_train)

  # Independence approach
  explanation_independence <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "independence",
    phi0 = phi0,
    n_MC_samples = 1e2
  )

  # Gaussian 1e1 approach
  explanation_gaussian_1e1 <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "gaussian",
    phi0 = phi0,
    n_MC_samples = 1e1
  )

  # Gaussian 1e2 approach
  explanation_gaussian_1e2 <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "gaussian",
    phi0 = phi0,
    n_MC_samples = 1e2
  )

  # ctree approach
  explanation_ctree <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "ctree",
    phi0 = phi0,
    n_MC_samples = 1e2
  )

  # Combined approach
  explanation_combined <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = c("gaussian", "independence", "ctree"),
    phi0 = phi0,
    n_MC_samples = 1e2
  )

  # Create a list of explanations with names
  explanation_list_named <- list(
    "Ind." = explanation_independence,
    "Gaus. 1e1" = explanation_gaussian_1e1,
    "Gaus. 1e2" = explanation_gaussian_1e2,
    "Ctree" = explanation_ctree,
    "Combined" = explanation_combined
  )

  # Create the default MSEv plot where we average over both the coalitions and observations
  # with approximate 95% confidence intervals
  plot_MSEv_eval_crit(explanation_list_named, CI_level = 0.95, plot_type = "overall")

  # Can also create plots of the MSEv criterion averaged only over the coalitions or observations.
  MSEv_figures <- plot_MSEv_eval_crit(explanation_list_named,
    CI_level = 0.95,
    plot_type = c("overall", "comb", "explicand")
  )
  MSEv_figures$MSEv_bar
  MSEv_figures$MSEv_coalition_bar
  MSEv_figures$MSEv_explicand_bar

  # When there are many coalitions or observations, then it can be easier to look at line plots
  MSEv_figures$MSEv_coalition_line_point
  MSEv_figures$MSEv_explicand_line_point

  # We can specify which observations or coalitions to plot
  plot_MSEv_eval_crit(explanation_list_named,
    plot_type = "explicand",
    index_x_explain = c(1, 3:4, 6),
    CI_level = 0.95
  )$MSEv_explicand_bar
  plot_MSEv_eval_crit(explanation_list_named,
    plot_type = "comb",
    id_coalition = c(3, 4, 9, 13:15),
    CI_level = 0.95
  )$MSEv_coalition_bar

  # We can alter the figures if other palette schemes or design is wanted
  bar_text_n_decimals <- 1
  MSEv_figures$MSEv_bar +
    ggplot2::scale_x_discrete(limits = rev(levels(MSEv_figures$MSEv_bar$data$Method))) +
    ggplot2::coord_flip() +
    ggplot2::scale_fill_discrete() + #' Default ggplot2 palette
    ggplot2::theme_minimal() + #' This must be set before the other theme call
    ggplot2::theme(
      plot.title = ggplot2::element_text(size = 10),
      legend.position = "bottom"
    ) +
    ggplot2::guides(fill = ggplot2::guide_legend(nrow = 1, ncol = 6)) +
    ggplot2::geom_text(
      ggplot2::aes(label = sprintf(
        paste("%.", sprintf("%d", bar_text_n_decimals), "f", sep = ""),
        round(MSEv, bar_text_n_decimals)
      )),
      vjust = -1.1, # This value must be altered based on the plot dimension
      hjust = 1.1, # This value must be altered based on the plot dimension
      color = "black",
      position = ggplot2::position_dodge(0.9),
      size = 5
    )
}



Shapley value bar plots for several explanation objects

Description

Make plots to visualize and compare the estimated Shapley values for a list of explain() objects applied to the same data and model. For group-wise Shapley values, the features values plotted are the mean feature values for all features in each group.

Usage

plot_SV_several_approaches(
  explanation_list,
  index_explicands = NULL,
  index_explicands_sort = FALSE,
  only_these_features = NULL,
  plot_phi0 = FALSE,
  digits = 4,
  add_zero_line = FALSE,
  axis_labels_n_dodge = NULL,
  axis_labels_rotate_angle = NULL,
  horizontal_bars = TRUE,
  facet_scales = "free",
  facet_ncol = 2,
  geom_col_width = 0.85,
  brewer_palette = NULL,
  include_group_feature_means = FALSE
)

Arguments

explanation_list

A list of explain() objects applied to the same data and model. If the entries in the list are named, then the function use these names. Otherwise, they default to the approach names (with integer suffix for duplicates) for the explanation objects in explanation_list.

index_explicands

Integer vector. Which of the explicands (test observations) to plot. E.g. if you have explained 10 observations using explain(), you can generate a plot for the first 5 observations/explicands and the 10th by setting index_x_explain = c(1:5, 10). The argument index_explicands_sort must be FALSE to plot the explicand in the order specified in index_x_explain.

index_explicands_sort

Boolean. If FALSE (default), then shapr plots the explicands in the order specified in index_explicands. If TRUE, then shapr sort the indices in increasing order based on their id.

only_these_features

String vector. Containing the names of the features which are to be included in the bar plots.

plot_phi0

Boolean. If we are to include the \phi_0 in the bar plots or not.

digits

Integer. Number of significant digits to use in the feature description. Applicable for plot_type "bar" and "waterfall"

add_zero_line

Boolean. If we are to add a black line for a feature contribution of 0.

axis_labels_n_dodge

Integer. The number of rows that should be used to render the labels. This is useful for displaying labels that would otherwise overlap.

axis_labels_rotate_angle

Numeric. The angle of the axis label, where 0 means horizontal, 45 means tilted, and 90 means vertical. Compared to setting the angle in ggplot2::theme() / ggplot2::element_text(), this also uses some heuristics to automatically pick the hjust and vjust that you probably want.

horizontal_bars

Boolean. Flip Cartesian coordinates so that horizontal becomes vertical, and vertical, horizontal. This is primarily useful for converting geoms and statistics which display y conditional on x, to x conditional on y. See ggplot2::coord_flip().

facet_scales

Should scales be free ("free", the default), fixed ("fixed"), or free in one dimension ("free_x", "free_y")? The user has to change the latter manually depending on the value of horizontal_bars.

facet_ncol

Integer. The number of columns in the facet grid. Default is facet_ncol = 2.

geom_col_width

Numeric. Bar width. By default, set to 85% of the ggplot2::resolution() of the data.

brewer_palette

String. Name of one of the color palettes from RColorBrewer::RColorBrewer(). If NULL, then the function uses the default ggplot2::ggplot() color scheme. The following palettes are available for use with these scales:

Diverging

BrBG, PiYG, PRGn, PuOr, RdBu, RdGy, RdYlBu, RdYlGn, Spectral

Qualitative

Accent, Dark2, Paired, Pastel1, Pastel2, Set1, Set2, Set3

Sequential

Blues, BuGn, BuPu, GnBu, Greens, Greys, Oranges, OrRd, PuBu, PuBuGn, PuRd, Purples, RdPu, Reds, YlGn, YlGnBu, YlOrBr, YlOrRd

include_group_feature_means

Logical. Whether to include the average feature value in a group on the y-axis or not. If FALSE (default), then no value is shown for the groups. If TRUE, then shapr includes the mean of the features in each group.

Value

A ggplot2::ggplot() object.

Author(s)

Lars Henry Berge Olsen

Examples

## Not run: 
if (requireNamespace("xgboost", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE)) {
  # Get the data
  data("airquality")
  data <- data.table::as.data.table(airquality)
  data <- data[complete.cases(data), ]

  # Define the features and the response
  x_var <- c("Solar.R", "Wind", "Temp", "Month")
  y_var <- "Ozone"

  # Split data into test and training data set
  ind_x_explain <- 1:12
  x_train <- data[-ind_x_explain, ..x_var]
  y_train <- data[-ind_x_explain, get(y_var)]
  x_explain <- data[ind_x_explain, ..x_var]

  # Fitting a basic xgboost model to the training data
  model <- xgboost::xgboost(
    data = as.matrix(x_train),
    label = y_train,
    nround = 20,
    verbose = FALSE
  )

  # Specifying the phi_0, i.e. the expected prediction without any features
  phi0 <- mean(y_train)

  # Independence approach
  explanation_independence <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "independence",
    phi0 = phi0,
    n_MC_samples = 1e2
  )

  # Empirical approach
  explanation_empirical <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "empirical",
    phi0 = phi0,
    n_MC_samples = 1e2
  )

  # Gaussian 1e1 approach
  explanation_gaussian_1e1 <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "gaussian",
    phi0 = phi0,
    n_MC_samples = 1e1
  )

  # Gaussian 1e2 approach
  explanation_gaussian_1e2 <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "gaussian",
    phi0 = phi0,
    n_MC_samples = 1e2
  )

  # Combined approach
  explanation_combined <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = c("gaussian", "ctree", "empirical"),
    phi0 = phi0,
    n_MC_samples = 1e2
  )

  # Create a list of explanations with names
  explanation_list <- list(
    "Ind." = explanation_independence,
    "Emp." = explanation_empirical,
    "Gaus. 1e1" = explanation_gaussian_1e1,
    "Gaus. 1e2" = explanation_gaussian_1e2,
    "Combined" = explanation_combined
  )

  # The function uses the provided names.
  plot_SV_several_approaches(explanation_list)

  # We can change the number of columns in the grid of plots and add other visual alterations
  plot_SV_several_approaches(explanation_list,
    facet_ncol = 3,
    facet_scales = "free_y",
    add_zero_line = TRUE,
    digits = 2,
    brewer_palette = "Paired",
    geom_col_width = 0.6
  ) +
    ggplot2::theme_minimal() +
    ggplot2::theme(legend.position = "bottom", plot.title = ggplot2::element_text(size = 0))


  # We can specify which explicands to plot to get less chaotic plots and make the bars vertical
  plot_SV_several_approaches(explanation_list,
    index_explicands = c(1:2, 5, 10),
    horizontal_bars = FALSE,
    axis_labels_rotate_angle = 45
  )

  # We can change the order of the features by specifying the
  # order using the `only_these_features` parameter.
  plot_SV_several_approaches(explanation_list,
    index_explicands = c(1:2, 5, 10),
    only_these_features = c("Temp", "Solar.R", "Month", "Wind")
  )

  # We can also remove certain features if we are not interested in them
  # or want to focus on, e.g., two features. The function will give a
  # message to if the user specifies non-valid feature names.
  plot_SV_several_approaches(explanation_list,
    index_explicands = c(1:2, 5, 10),
    only_these_features = c("Temp", "Solar.R"),
    plot_phi0 = TRUE
  )
}

## End(Not run)


Plot the training VLB and validation IWAE for vaeac models

Description

This function makes (ggplot2::ggplot()) figures of the training VLB and the validation IWAE for a list of explain() objects with approach = "vaeac". See setup_approach() for more information about the vaeac approach. Two figures are returned by the function. In the figure, each object in explanation_list gets its own facet, while in the second figure, we plot the criteria in each facet for all objects.

Usage

plot_vaeac_eval_crit(
  explanation_list,
  plot_from_nth_epoch = 1,
  plot_every_nth_epoch = 1,
  criteria = c("VLB", "IWAE"),
  plot_type = c("method", "criterion"),
  facet_wrap_scales = "fixed",
  facet_wrap_ncol = NULL
)

Arguments

explanation_list

A list of explain() objects applied to the same data, model, and vaeac must be the used approach. If the entries in the list is named, then the function use these names. Otherwise, it defaults to the approach names (with integer suffix for duplicates) for the explanation objects in explanation_list.

plot_from_nth_epoch

Integer. If we are only plot the results form the nth epoch and so forth. The first epochs can be large in absolute value and make the rest of the plot difficult to interpret.

plot_every_nth_epoch

Integer. If we are only to plot every nth epoch. Usefully to illustrate the overall trend, as there can be a lot of fluctuation and oscillation in the values between each epoch.

criteria

Character vector. The possible options are "VLB", "IWAE", "IWAE_running". Default is the first two.

plot_type

Character vector. The possible options are "method" and "criterion". Default is to plot both.

facet_wrap_scales

String. Should the scales be fixed ("fixed", the default), free ("free"), or free in one dimension ("free_x", "free_y").

facet_wrap_ncol

Integer. Number of columns in the facet wrap.

Details

See Olsen et al. (2022) or the blog post for a summary of the VLB and IWAE.

Value

Either a single ggplot2::ggplot() object or a list of ggplot2::ggplot() objects based on the plot_type parameter.

Author(s)

Lars Henry Berge Olsen

References

Examples



if (requireNamespace("xgboost", quietly = TRUE) &&
  requireNamespace("torch", quietly = TRUE) &&
  torch::torch_is_installed()) {
  data("airquality")
  data <- data.table::as.data.table(airquality)
  data <- data[complete.cases(data), ]

  x_var <- c("Solar.R", "Wind", "Temp", "Month")
  y_var <- "Ozone"

  ind_x_explain <- 1:6
  x_train <- data[-ind_x_explain, ..x_var]
  y_train <- data[-ind_x_explain, get(y_var)]
  x_explain <- data[ind_x_explain, ..x_var]

  # Fitting a basic xgboost model to the training data
  model <- xgboost::xgboost(
    data = as.matrix(x_train),
    label = y_train,
    nround = 100,
    verbose = FALSE
  )

  # Specifying the phi_0, i.e. the expected prediction without any features
  p0 <- mean(y_train)

  # Train vaeac with and without paired sampling
  explanation_paired <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "vaeac",
    phi0 = p0,
    n_MC_samples = 1, # As we are only interested in the training of the vaeac
    vaeac.epochs = 10, # Should be higher in applications.
    vaeac.n_vaeacs_initialize = 1,
    vaeac.width = 16,
    vaeac.depth = 2,
    vaeac.extra_parameters = list(vaeac.paired_sampling = TRUE)
  )

  explanation_regular <- explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "vaeac",
    phi0 = p0,
    n_MC_samples = 1, # As we are only interested in the training of the vaeac
    vaeac.epochs = 10, # Should be higher in applications.
    vaeac.width = 16,
    vaeac.depth = 2,
    vaeac.n_vaeacs_initialize = 1,
    vaeac.extra_parameters = list(vaeac.paired_sampling = FALSE)
  )

  # Collect the explanation objects in an named list
  explanation_list <- list(
    "Regular sampling" = explanation_regular,
    "Paired sampling" = explanation_paired
  )

  # Call the function with the named list, will use the provided names
  plot_vaeac_eval_crit(explanation_list = explanation_list)

  # The function also works if we have only one method,
  # but then one should only look at the method plot.
  plot_vaeac_eval_crit(
    explanation_list = explanation_list[2],
    plot_type = "method"
  )

  # Can alter the plot
  plot_vaeac_eval_crit(
    explanation_list = explanation_list,
    plot_from_nth_epoch = 2,
    plot_every_nth_epoch = 2,
    facet_wrap_scales = "free"
  )

  # If we only want the VLB
  plot_vaeac_eval_crit(
    explanation_list = explanation_list,
    criteria = "VLB",
    plot_type = "criterion"
  )

  # If we want only want the criterion version
  tmp_fig_criterion <-
    plot_vaeac_eval_crit(explanation_list = explanation_list, plot_type = "criterion")

  # Since tmp_fig_criterion is a ggplot2 object, we can alter it
  # by, e.g,. adding points or smooths with se bands
  tmp_fig_criterion + ggplot2::geom_point(shape = "circle", size = 1, ggplot2::aes(col = Method))
  tmp_fig_criterion$layers[[1]] <- NULL
  tmp_fig_criterion + ggplot2::geom_smooth(method = "loess", formula = y ~ x, se = TRUE) +
    ggplot2::scale_color_brewer(palette = "Set1") +
    ggplot2::theme_minimal()
}



Plot Pairwise Plots for Imputed and True Data

Description

A function that creates a matrix of plots (GGally::ggpairs()) from generated imputations from the unconditioned distribution p(\boldsymbol{x}) estimated by a vaeac model, and then compares the imputed values with data from the true distribution (if provided). See ggpairs for an introduction to GGally::ggpairs(), and the corresponding vignette.

Usage

plot_vaeac_imputed_ggpairs(
  explanation,
  which_vaeac_model = "best",
  x_true = NULL,
  add_title = TRUE,
  alpha = 0.5,
  upper_cont = c("cor", "points", "smooth", "smooth_loess", "density", "blank"),
  upper_cat = c("count", "cross", "ratio", "facetbar", "blank"),
  upper_mix = c("box", "box_no_facet", "dot", "dot_no_facet", "facethist",
    "facetdensity", "denstrip", "blank"),
  lower_cont = c("points", "smooth", "smooth_loess", "density", "cor", "blank"),
  lower_cat = c("facetbar", "ratio", "count", "cross", "blank"),
  lower_mix = c("facetdensity", "box", "box_no_facet", "dot", "dot_no_facet",
    "facethist", "denstrip", "blank"),
  diag_cont = c("densityDiag", "barDiag", "blankDiag"),
  diag_cat = c("barDiag", "blankDiag"),
  cor_method = c("pearson", "kendall", "spearman")
)

Arguments

explanation

Shapr list. The output list from the explain() function.

which_vaeac_model

String. Indicating which vaeac model to use when generating the samples. Possible options are always 'best', 'best_running', and 'last'. All possible options can be obtained by calling names(explanation$internal$parameters$vaeac$models).

x_true

Data.table containing the data from the distribution that the vaeac model is fitted to.

add_title

Logical. If TRUE, then a title is added to the plot based on the internal description of the vaeac model specified in which_vaeac_model.

alpha

Numeric between 0 and 1 (default is 0.5). The degree of color transparency.

upper_cont

String. Type of plot to use in upper triangle for continuous features, see GGally::ggpairs(). Possible options are: 'cor' (default), 'points', 'smooth', 'smooth_loess', 'density', and 'blank'.

upper_cat

String. Type of plot to use in upper triangle for categorical features, see GGally::ggpairs(). Possible options are: 'count' (default), 'cross', 'ratio', 'facetbar', and 'blank'.

upper_mix

String. Type of plot to use in upper triangle for mixed features, see GGally::ggpairs(). Possible options are: 'box' (default), 'box_no_facet', 'dot', 'dot_no_facet', 'facethist', 'facetdensity', 'denstrip', and 'blank'

lower_cont

String. Type of plot to use in lower triangle for continuous features, see GGally::ggpairs(). Possible options are: 'points' (default), 'smooth', 'smooth_loess', 'density', 'cor', and 'blank'.

lower_cat

String. Type of plot to use in lower triangle for categorical features, see GGally::ggpairs(). Possible options are: 'facetbar' (default), 'ratio', 'count', 'cross', and 'blank'.

lower_mix

String. Type of plot to use in lower triangle for mixed features, see GGally::ggpairs(). Possible options are: 'facetdensity' (default), 'box', 'box_no_facet', 'dot', 'dot_no_facet', 'facethist', 'denstrip', and 'blank'.

diag_cont

String. Type of plot to use on the diagonal for continuous features, see GGally::ggpairs(). Possible options are: 'densityDiag' (default), 'barDiag', and 'blankDiag'.

diag_cat

String. Type of plot to use on the diagonal for categorical features, see GGally::ggpairs(). Possible options are: 'barDiag' (default) and 'blankDiag'.

cor_method

String. Type of correlation measure, see GGally::ggpairs(). Possible options are: 'pearson' (default), 'kendall', and 'spearman'.

Value

A GGally::ggpairs() figure.

Author(s)

Lars Henry Berge Olsen

References

Examples



if (requireNamespace("xgboost", quietly = TRUE) &&
  requireNamespace("ggplot2", quietly = TRUE) &&
  requireNamespace("torch", quietly = TRUE) &&
  torch::torch_is_installed()) {
  data("airquality")
  data <- data.table::as.data.table(airquality)
  data <- data[complete.cases(data), ]

  x_var <- c("Solar.R", "Wind", "Temp", "Month")
  y_var <- "Ozone"

  ind_x_explain <- 1:6
  x_train <- data[-ind_x_explain, ..x_var]
  y_train <- data[-ind_x_explain, get(y_var)]
  x_explain <- data[ind_x_explain, ..x_var]

  # Fitting a basic xgboost model to the training data
  model <- xgboost::xgboost(
    data = as.matrix(x_train),
    label = y_train,
    nround = 100,
    verbose = FALSE
  )

  explanation <- shapr::explain(
    model = model,
    x_explain = x_explain,
    x_train = x_train,
    approach = "vaeac",
    phi0 = mean(y_train),
    n_MC_samples = 1,
    vaeac.epochs = 10,
    vaeac.n_vaeacs_initialize = 1
  )

  # Plot the results
  figure <- shapr::plot_vaeac_imputed_ggpairs(
    explanation = explanation,
    which_vaeac_model = "best",
    x_true = x_train,
    add_title = TRUE
  )
  figure

  # Note that this is an ggplot2 object which we can alter, e.g., we can change the colors.
  figure +
    ggplot2::scale_color_manual(values = c("#E69F00", "#999999")) +
    ggplot2::scale_fill_manual(values = c("#E69F00", "#999999"))
}


Generate predictions for input data with specified model

Description

Performs prediction of response stats::lm(), stats::glm(), ranger::ranger(), mgcv::gam(), workflows::workflow() (i.e., tidymodels models), and xgboost::xgb.train() with binary or continuous response. See details for more information.

Usage

predict_model(x, newdata, ...)

## Default S3 method:
predict_model(x, newdata, ...)

## S3 method for class 'ar'
predict_model(x, newdata, newreg, horizon, ...)

## S3 method for class 'Arima'
predict_model(
  x,
  newdata,
  newreg,
  horizon,
  explain_idx,
  explain_lags,
  y,
  xreg,
  ...
)

## S3 method for class 'forecast_ARIMA'
predict_model(x, newdata, newreg, horizon, ...)

## S3 method for class 'glm'
predict_model(x, newdata, ...)

## S3 method for class 'lm'
predict_model(x, newdata, ...)

## S3 method for class 'gam'
predict_model(x, newdata, ...)

## S3 method for class 'ranger'
predict_model(x, newdata, ...)

## S3 method for class 'workflow'
predict_model(x, newdata, ...)

## S3 method for class 'xgb.Booster'
predict_model(x, newdata, ...)

Arguments

x

Model object for the model to be explained.

newdata

A data.frame/data.table with the features to predict from.

...

newreg and horizon parameters used in models passed to ⁠[explain_forecast()]⁠

horizon

Numeric. The forecast horizon to explain. Passed to the predict_model function.

explain_idx

Numeric vector. The row indices in data and reg denoting points in time to explain.

y

Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained.

xreg

Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows.

Details

The following models are currently supported:

If you have a binary classification model we'll always return the probability prediction for a single class.

If you are explaining a model not supported natively, you need to create the ⁠[predict_model()]⁠ function yourself, and pass it on to as an argument to ⁠[explain()]⁠.

For more details on how to explain such non-supported models (i.e. custom models), see the Advanced usage section of the general usage:
From R: vignette("general_usage", package = "shapr")
Web: https://norskregnesentral.github.io/shapr/articles/general_usage.html#explain-custom-models

Value

Numeric. Vector of size equal to the number of rows in newdata.

Author(s)

Martin Jullum

Examples

# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
# Split data into test- and training data
x_train <- head(airquality, -3)
x_explain <- tail(airquality, 3)
# Fit a linear model
model <- lm(Ozone ~ Solar.R + Wind + Temp + Month, data = x_train)

# Predicting for a model with a standardized format
predict_model(x = model, newdata = x_explain)

Generate data used for predictions and Monte Carlo integration

Description

Generate data used for predictions and Monte Carlo integration

Usage

prepare_data(internal, index_features = NULL, ...)

## S3 method for class 'categorical'
prepare_data(internal, index_features = NULL, ...)

## S3 method for class 'copula'
prepare_data(internal, index_features, ...)

## S3 method for class 'ctree'
prepare_data(internal, index_features = NULL, ...)

## S3 method for class 'empirical'
prepare_data(internal, index_features = NULL, ...)

## S3 method for class 'gaussian'
prepare_data(internal, index_features, ...)

## S3 method for class 'independence'
prepare_data(internal, index_features = NULL, ...)

## S3 method for class 'regression_separate'
prepare_data(internal, index_features = NULL, ...)

## S3 method for class 'regression_surrogate'
prepare_data(internal, index_features = NULL, ...)

## S3 method for class 'timeseries'
prepare_data(internal, index_features = NULL, ...)

## S3 method for class 'vaeac'
prepare_data(internal, index_features = NULL, ...)

Arguments

internal

List. Not used directly, but passed through from explain().

index_features

Positive integer vector. Specifies the id_coalition to apply to the present method. NULL means all coalitions. Only used internally.

...

Currently not used.

Value

A data.table containing simulated data used to estimate the contribution function by Monte Carlo integration.

Author(s)

Martin Jullum

Annabelle Redelmeier and Lars Henry Berge Olsen

Lars Henry Berge Olsen

Martin Jullum,


Generate data used for predictions and Monte Carlo integration for causal Shapley values

Description

This function loops over the given coalitions, and for each coalition it extracts the chain of relevant sampling steps provided in internal$object$S_causal. This chain can contain sampling from marginal and conditional distributions. We use the approach given by internal$parameters$approach to generate the samples from the conditional distributions, and we iteratively call prepare_data() with a modified internal_copy list to reuse code. However, this also means that chains with the same conditional distributions will retrain a model of said conditional distributions several times. For the marginal distribution, we sample from the Gaussian marginals when the approach is gaussian and from the marginals of the training data for all other approaches. Note that we could extend the code to sample from the marginal (gaussian) copula, too, when approach is copula.

Usage

prepare_data_causal(internal, index_features = NULL, ...)

Arguments

internal

List. Not used directly, but passed through from explain().

index_features

Positive integer vector. Specifies the id_coalition to apply to the present method. NULL means all coalitions. Only used internally.

...

Currently not used.

Value

A data.table containing simulated data that respects the (partial) causal ordering and the the confounding assumptions. The data is used to estimate the contribution function by Monte Carlo integration.

Author(s)

Lars Henry Berge Olsen


Generate (Gaussian) Copula MC samples

Description

Generate (Gaussian) Copula MC samples

Usage

prepare_data_copula_cpp(
  MC_samples_mat,
  x_explain_mat,
  x_explain_gaussian_mat,
  x_train_mat,
  S,
  mu,
  cov_mat
)

Arguments

MC_samples_mat

arma::mat. Matrix of dimension (n_MC_samples, n_features) containing samples from the univariate standard normal.

x_explain_mat

arma::mat. Matrix of dimension (n_explain, n_features) containing the observations to explain.

x_explain_gaussian_mat

arma::mat. Matrix of dimension (n_explain, n_features) containing the observations to explain after being transformed using the Gaussian transform, i.e., the samples have been transformed to a standardized normal distribution.

x_train_mat

arma::mat. Matrix of dimension (n_train, n_features) containing the training observations.

S

arma::mat. Matrix of dimension (n_coalitions, n_features) containing binary representations of the used coalitions. S cannot contain the empty or grand coalition, i.e., a row containing only zeros or ones. This is not a problem internally in shapr as the empty and grand coalitions are treated differently.

mu

arma::vec. Vector of length n_features containing the mean of each feature after being transformed using the Gaussian transform, i.e., the samples have been transformed to a standardized normal distribution.

cov_mat

arma::mat. Matrix of dimension (n_features, n_features) containing the pairwise covariance between all pairs of features after being transformed using the Gaussian transform, i.e., the samples have been transformed to a standardized normal distribution.

Value

An arma::cube/3D array of dimension (n_MC_samples, n_explain * n_coalitions, n_features), where the columns (,j,) are matrices of dimension (n_MC_samples, n_features) containing the conditional Gaussian copula MC samples for each explicand and coalition on the original scale.

Author(s)

Lars Henry Berge Olsen


Generate (Gaussian) Copula MC samples for the causal setup with a single MC sample for each explicand

Description

Generate (Gaussian) Copula MC samples for the causal setup with a single MC sample for each explicand

Usage

prepare_data_copula_cpp_caus(
  MC_samples_mat,
  x_explain_mat,
  x_explain_gaussian_mat,
  x_train_mat,
  S,
  mu,
  cov_mat
)

Arguments

MC_samples_mat

arma::mat. Matrix of dimension (n_MC_samples, n_features) containing samples from the univariate standard normal.

x_explain_mat

arma::mat. Matrix of dimension (n_explain, n_features) containing the observations to explain.

x_explain_gaussian_mat

arma::mat. Matrix of dimension (n_explain, n_features) containing the observations to explain after being transformed using the Gaussian transform, i.e., the samples have been transformed to a standardized normal distribution.

x_train_mat

arma::mat. Matrix of dimension (n_train, n_features) containing the training observations.

S

arma::mat. Matrix of dimension (n_coalitions, n_features) containing binary representations of the used coalitions. S cannot contain the empty or grand coalition, i.e., a row containing only zeros or ones. This is not a problem internally in shapr as the empty and grand coalitions are treated differently.

mu

arma::vec. Vector of length n_features containing the mean of each feature after being transformed using the Gaussian transform, i.e., the samples have been transformed to a standardized normal distribution.

cov_mat

arma::mat. Matrix of dimension (n_features, n_features) containing the pairwise covariance between all pairs of features after being transformed using the Gaussian transform, i.e., the samples have been transformed to a standardized normal distribution.

Value

An arma::cube/3D array of dimension (n_MC_samples, n_explain * n_coalitions, n_features), where the columns (,j,) are matrices of dimension (n_MC_samples, n_features) containing the conditional Gaussian copula MC samples for each explicand and coalition on the original scale.

Author(s)

Lars Henry Berge Olsen


Generate Gaussian MC samples

Description

Generate Gaussian MC samples

Usage

prepare_data_gaussian_cpp(MC_samples_mat, x_explain_mat, S, mu, cov_mat)

Arguments

MC_samples_mat

arma::mat. Matrix of dimension (n_MC_samples, n_features) containing samples from the univariate standard normal.

x_explain_mat

arma::mat. Matrix of dimension (n_explain, n_features) containing the observations to explain.

S

arma::mat. Matrix of dimension (n_coalitions, n_features) containing binary representations of the used coalitions. S cannot contain the empty or grand coalition, i.e., a row containing only zeros or ones. This is not a problem internally in shapr as the empty and grand coalitions are treated differently.

mu

arma::vec. Vector of length n_features containing the mean of each feature.

cov_mat

arma::mat. Matrix of dimension (n_features, n_features) containing the covariance matrix of the features.

Value

An arma::cube/3D array of dimension (n_MC_samples, n_explain * n_coalitions, n_features), where the columns (,j,) are matrices of dimension (n_MC_samples, n_features) containing the conditional Gaussian MC samples for each explicand and coalition.

Author(s)

Lars Henry Berge Olsen


Generate Gaussian MC samples for the causal setup with a single MC sample for each explicand

Description

Generate Gaussian MC samples for the causal setup with a single MC sample for each explicand

Usage

prepare_data_gaussian_cpp_caus(MC_samples_mat, x_explain_mat, S, mu, cov_mat)

Arguments

MC_samples_mat

arma::mat. Matrix of dimension (n_MC_samples, n_features) containing samples from the univariate standard normal.

x_explain_mat

arma::mat. Matrix of dimension (n_explain, n_features) containing the observations to explain.

S

arma::mat. Matrix of dimension (n_coalitions, n_features) containing binary representations of the used coalitions. S cannot contain the empty or grand coalition, i.e., a row containing only zeros or ones. This is not a problem internally in shapr as the empty and grand coalitions are treated differently.

mu

arma::vec. Vector of length n_features containing the mean of each feature.

cov_mat

arma::mat. Matrix of dimension (n_features, n_features) containing the covariance matrix of the features.

Value

An arma::cube/3D array of dimension (n_MC_samples, n_explain * n_coalitions, n_features), where the columns (,j,) are matrices of dimension (n_MC_samples, n_features) containing the conditional Gaussian MC samples for each explicand and coalition.

Author(s)

Lars Henry Berge Olsen


Compute the conditional probabilities for a single coalition for the categorical approach

Description

The prepare_data.categorical() function is slow when evaluated for a single coalition. This is a bottleneck for Causal Shapley values which call said function a lot with single coalitions.

Usage

prepare_data_single_coalition(internal, index_features)

Arguments

internal

List. Holds all parameters, data, functions and computed objects used within explain() The list contains one or more of the elements parameters, data, objects, iter_list, timing_list, main_timing_list, output, and iter_timing_list.

Author(s)

Lars Henry Berge Olsen


Prepares the next iteration of the iterative sampling algorithm

Description

Prepares the next iteration of the iterative sampling algorithm

Usage

prepare_next_iteration(internal)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

The (updated) internal list


Print method for shapr objects

Description

Print method for shapr objects

Usage

## S3 method for class 'shapr'
print(x, digits = 4, ...)

Arguments

x

A shapr object

digits

Scalar Integer. Number of digits to display to the console

...

Unused

Value

No return value (but prints the shapley values to the console)


Description

Prints iterative information

Usage

print_iter(internal)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

No return value (but prints iterative information)


Treat factors as numeric values

Description

Factors are given a numeric value above the highest numeric value in the data. The value of the different levels are sorted by factor and then level.

Usage

process_factor_data(dt, factor_cols)

Arguments

dt

data.table to plot

factor_cols

Columns that are factors or character

Value

A list of a lookup table with each factor and level and its numeric value, a data.table very similar to the input data, but now with numeric values for factors, and the maximum feature value.


Compute the quantiles using quantile type seven

Description

Compute the quantiles using quantile type seven

Usage

quantile_type7_cpp(x, probs)

Arguments

x

arma::vec. Numeric vector whose sample quantiles are wanted.

probs

arma::vec. Numeric vector of probabilities with values between zero and one.

Details

Using quantile type number seven from stats::quantile in R.

Value

A vector of length length(probs) with the quantiles is returned.

Author(s)

Lars Henry Berge Olsen


Set up exogenous regressors for explanation in a forecast model.

Description

Set up exogenous regressors for explanation in a forecast model.

Usage

reg_forecast_setup(x, horizon, group)

Arguments

x

A matrix with the exogenous variables.

horizon

Numeric. The forecast horizon to explain. Passed to the predict_model function.

group

The list of endogenous groups, to append exogenous groups to.

Value

A list containing


Check that needed libraries are installed

Description

This function checks that the parsnip, recipes, workflows, tune, dials, yardstick, hardhat and rsample, packages are available.

Usage

regression.check_namespaces()

Author(s)

Lars Henry Berge Olsen


Check regression parameters

Description

Check regression parameters

Usage

regression.check_parameters(internal)

Arguments

internal

List. Holds all parameters, data, functions and computed objects used within explain() The list contains one or more of the elements parameters, data, objects, iter_list, timing_list, main_timing_list, output, and iter_timing_list.

Value

The same internal list, but added logical indicator internal$parameters$regression.tune if we are to tune the regression model/models.

Author(s)

Lars Henry Berge Olsen


Check regression.recipe_func

Description

Check that regression.recipe_func is a function that returns the RHS of the formula for arbitrary feature name inputs.

Usage

regression.check_recipe_func(regression.recipe_func, x_explain)

Arguments

regression.recipe_func

Either NULL (default) or a function that that takes in a recipes::recipe() object and returns a modified recipes::recipe() with potentially additional recipe steps. See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.recipe_func can also be a string containing an R function. For example, "function(recipe) return(recipes::step_ns(recipe, recipes::all_numeric_predictors(), deg_free = 2))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

x_explain

Data.table with the features of the observation whose predictions ought to be explained (test data).

Author(s)

Lars Henry Berge Olsen


Check the regression.surrogate_n_comb parameter

Description

Check that regression.surrogate_n_comb is either NULL or a valid integer.

Usage

regression.check_sur_n_comb(regression.surrogate_n_comb, n_coalitions)

Arguments

regression.surrogate_n_comb

Positive integer. Specifies the number of unique coalitions to apply to each training observation. The default is the number of sampled coalitions in the present iteration. Any integer between 1 and the default is allowed. Larger values requires more memory, but may improve the surrogate model. If the user sets a value lower than the maximum, we sample this amount of unique coalitions separately for each training observations. That is, on average, all coalitions should be equally trained.

n_coalitions

Integer. The number of used coalitions (including the empty and grand coalition).

Author(s)

Lars Henry Berge Olsen


Check the parameters that are sent to rsample::vfold_cv()

Description

Check that regression.vfold_cv_para is either NULL or a named list that only contains recognized parameters.

Usage

regression.check_vfold_cv_para(regression.vfold_cv_para)

Arguments

regression.vfold_cv_para

Either NULL (default) or a named list containing the parameters to be sent to rsample::vfold_cv(). See the regression vignette for several examples.

Author(s)

Lars Henry Berge Olsen


Produce message about which batch prepare_data is working on

Description

Produce message about which batch prepare_data is working on

Usage

regression.cv_message(
  regression.results,
  regression.grid,
  n_cv = 10,
  current_comb
)

Arguments

regression.results

The results of the CV procedures.

regression.grid

Object containing the hyperparameter values.

n_cv

Integer (default is 10) specifying the number of CV hyperparameter configurations to print.

current_comb

Integer vector. The current combination of features, passed to verbosity printing function.

Author(s)

Lars Henry Berge Olsen


Convert the string into an R object

Description

Convert the string into an R object

Usage

regression.get_string_to_R(string)

Arguments

string

A character vector/string containing the text to convert into R code.

Author(s)

Lars Henry Berge Olsen


Get if model is to be tuned

Description

That is, if the regression model contains hyperparameters we are to tune using cross validation. See tidymodels for default model hyperparameters.

Usage

regression.get_tune(regression.model, regression.tune_values, x_train)

Arguments

regression.model

A tidymodels object of class model_specs. Default is a linear regression model, i.e., parsnip::linear_reg(). See tidymodels for all possible models, and see the vignette for how to add new/own models. Note, to make it easier to call explain() from Python, the regression.model parameter can also be a string specifying the model which will be parsed and evaluated. For example, ⁠"parsnip::rand_forest(mtry = hardhat::tune(), trees = 100, engine = "ranger", mode = "regression")"⁠ is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.tune_values

Either NULL (default), a data.frame/data.table/tibble, or a function. The data.frame must contain the possible hyperparameter value combinations to try. The column names must match the names of the tunable parameters specified in regression.model. If regression.tune_values is a function, then it should take one argument x which is the training data for the current coalition and returns a data.frame/data.table/tibble with the properties described above. Using a function allows the hyperparameter values to change based on the size of the coalition See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.tune_values can also be a string containing an R function. For example, "function(x) return(dials::grid_regular(dials::mtry(c(1, ncol(x)))), levels = 3))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

x_train

Data.table with training data.

Value

A boolean variable indicating if the regression model is to be tuned.

Author(s)

Lars Henry Berge Olsen


Get the predicted responses

Description

Get the predicted responses

Usage

regression.get_y_hat(internal, model, predict_model)

Arguments

internal

List. Holds all parameters, data, functions and computed objects used within explain() The list contains one or more of the elements parameters, data, objects, iter_list, timing_list, main_timing_list, output, and iter_timing_list.

model

Objects. The model object that ought to be explained. See the documentation of explain() for details.

predict_model

Function. The prediction function used when model is not natively supported. See the documentation of explain() for details.

Value

The same internal list, but added vectors internal$data$x_train_y_hat and internal$data$x_explain_y_hat containing the predicted response of the training and explain data.

Author(s)

Lars Henry Berge Olsen


Augment the training data and the explicands

Description

Augment the training data and the explicands

Usage

regression.surrogate_aug_data(
  internal,
  x,
  y_hat = NULL,
  index_features = NULL,
  augment_masks_as_factor = FALSE,
  augment_include_grand = FALSE,
  augment_add_id_coal = FALSE,
  augment_comb_prob = NULL,
  augment_weights = NULL
)

Arguments

internal

List. Holds all parameters, data, functions and computed objects used within explain() The list contains one or more of the elements parameters, data, objects, iter_list, timing_list, main_timing_list, output, and iter_timing_list.

x

Data.table containing the training data.

y_hat

Vector of numerics (optional) containing the predicted responses for the observations in x.

index_features

Array of integers (optional) containing which coalitions to consider. Must be provided if x is the explicands.

augment_masks_as_factor

Logical (default is FALSE). If TRUE, then the binary masks are converted to factors. If FALSE, then the binary masks are numerics.

augment_include_grand

Logical (default is FALSE). If TRUE, then the grand coalition is included. If index_features are provided, then augment_include_grand has no effect. Note that if we sample the coalitions then the grand coalition is equally likely to be samples as the other coalitions (or weighted if augment_comb_prob is provided).

augment_add_id_coal

Logical (default is FALSE). If TRUE, an additional column is adding containing which coalition was applied.

augment_comb_prob

Array of numerics (default is NULL). The length of the array must match the number of coalitions being considered, where each entry specifies the probability of sampling the corresponding coalition. This is useful if we want to generate more training data for some specific coalitions. One possible choice would be augment_comb_prob = if (use_Shapley_weights) internal$objects$X$shapley_weight[2:actual_n_coalitions] else NULL.

augment_weights

String (optional). Specifying which type of weights to add to the observations. If NULL (default), then no weights are added. If "Shapley", then the Shapley weights for the different coalitions are added to corresponding observations where the coalitions was applied. If uniform, then all observations get an equal weight of one.

Value

A data.table containing the augmented data.

Author(s)

Lars Henry Berge Olsen


Train a tidymodels model via workflows

Description

Function that trains a tidymodels model via workflows based on the provided input parameters. This function allows for cross validating the hyperparameters of the model.

Usage

regression.train_model(
  x,
  seed = 1,
  verbose = NULL,
  regression.model = parsnip::linear_reg(),
  regression.tune = FALSE,
  regression.tune_values = NULL,
  regression.vfold_cv_para = NULL,
  regression.recipe_func = NULL,
  regression.response_var = "y_hat",
  regression.surrogate_n_comb = NULL,
  current_comb = NULL
)

Arguments

x

Data.table containing the training data.

seed

Positive integer. Specifies the seed before any randomness based code is being run. If NULL (default) no seed is set in the calling environment.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

regression.model

A tidymodels object of class model_specs. Default is a linear regression model, i.e., parsnip::linear_reg(). See tidymodels for all possible models, and see the vignette for how to add new/own models. Note, to make it easier to call explain() from Python, the regression.model parameter can also be a string specifying the model which will be parsed and evaluated. For example, ⁠"parsnip::rand_forest(mtry = hardhat::tune(), trees = 100, engine = "ranger", mode = "regression")"⁠ is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.tune

Logical (default is FALSE). If TRUE, then we are to tune the hyperparemeters based on the values provided in regression.tune_values. Note that no checks are conducted as this is checked earlier in setup_approach.regression_separate and setup_approach.regression_surrogate.

regression.tune_values

Either NULL (default), a data.frame/data.table/tibble, or a function. The data.frame must contain the possible hyperparameter value combinations to try. The column names must match the names of the tunable parameters specified in regression.model. If regression.tune_values is a function, then it should take one argument x which is the training data for the current coalition and returns a data.frame/data.table/tibble with the properties described above. Using a function allows the hyperparameter values to change based on the size of the coalition See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.tune_values can also be a string containing an R function. For example, "function(x) return(dials::grid_regular(dials::mtry(c(1, ncol(x)))), levels = 3))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.vfold_cv_para

Either NULL (default) or a named list containing the parameters to be sent to rsample::vfold_cv(). See the regression vignette for several examples.

regression.recipe_func

Either NULL (default) or a function that that takes in a recipes::recipe() object and returns a modified recipes::recipe() with potentially additional recipe steps. See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.recipe_func can also be a string containing an R function. For example, "function(recipe) return(recipes::step_ns(recipe, recipes::all_numeric_predictors(), deg_free = 2))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.response_var

String (default is y_hat) containing the name of the response variable.

regression.surrogate_n_comb

Integer (default is NULL). The number of times each training observations has been augmented. If NULL, then we assume that we are doing separate regression.

current_comb

Integer vector. The current combination of features, passed to verbosity printing function.

Value

A trained tidymodels model based on the provided input parameters.

Author(s)

Lars Henry Berge Olsen


Auxiliary function for the vignettes

Description

Function that question if the main and vaeac vignette has been built using the rebuild-long-running-vignette.R function. This is only useful when using devtools to release shapr to cran. See devtools::release() for more information.

Usage

release_questions()

Function for computing sigma_hat_sq

Description

Function for computing sigma_hat_sq

Usage

rss_cpp(H, y)

Arguments

H

Matrix. Output from hat_matrix_cpp()

y

Vector Representing the (temporary) response variable

Value

Scalar

Author(s)

Martin Jullum


Get table with sampled coalitions using the semi-deterministic sampling approach

Description

Get table with sampled coalitions using the semi-deterministic sampling approach

Usage

sample_coalition_table(
  m,
  n_coalitions = 200,
  n_coal_each_size = choose(m, seq(m - 1)),
  weight_zero_m = 10^6,
  paired_shap_sampling = TRUE,
  prev_X = NULL,
  kernelSHAP_reweighting = "on_all_cond",
  semi_deterministic_sampling = FALSE,
  dt_coal_samp_info = NULL,
  dt_valid_causal_coalitions = NULL,
  n_samps_scale = 10
)

Arguments

m

Positive integer. Total number of features/groups.

n_coalitions

Positive integer. Note that if exact = TRUE, n_coalitions is ignored.

n_coal_each_size

Vector of integers of length m-1. The number of valid coalitions of each coalition size 1, 2,..., m-1. For symmetric Shapley values, this is choose(m, seq(m-1)) (default). While for asymmetric Shapley values, this is the number of valid coalitions of each size in the causal ordering. Used to correctly normalize the Shapley weights.

weight_zero_m

Numeric. The value to use as a replacement for infinite coalition weights when doing numerical operations.

paired_shap_sampling

Logical. Whether to do paired sampling of coalitions.

prev_X

data.table. The X data.table from the previous iteration.

kernelSHAP_reweighting

String. How to reweight the sampling frequency weights in the kernelSHAP solution after sampling. The aim of this is to reduce the randomness and thereby the variance of the Shapley value estimates. The options are one of 'none', 'on_N', 'on_all', 'on_all_cond' (default). 'none' means no reweighting, i.e. the sampling frequency weights are used as is. 'on_N' means the sampling frequencies are averaged over all coalitions with the same original sampling probabilities. 'on_all' means the original sampling probabilities are used for all coalitions. 'on_all_cond' means the original sampling probabilities are used for all coalitions, while adjusting for the probability that they are sampled at least once. 'on_all_cond' is preferred as it performs the best in simulation studies, see Olsen & Jullum (2024).

semi_deterministic_sampling

Logical. If FALSE (default), then we sample from all coalitions. If TRUE, the sampling of coalitions is semi-deterministic, i.e. the sampling is done in a way that ensures that coalitions that are expected to be sampled based on the number of coalitions are deterministically included such that we sample among fewer coalitions. This is done to reduce the variance of the Shapley value estimates, and corresponds to the PySHAP* strategy in the paper Olsen & Jullum (2024).

dt_coal_samp_info

data.table. The data.table contains information about the which coalitions should be deterministically included and which can be sampled, in addition to the sampling probabilities of each available coalition size, and the weight given to the sampled and deterministically included coalitions (excluding empty and grand coalitions which are given the weight_zero_m weight).

dt_valid_causal_coalitions

data.table. Only applicable for asymmetric Shapley values explanations, and is NULL for symmetric Shapley values. The data.table contains information about the coalitions that respects the causal ordering.

n_samps_scale

Positive integer. Integer that scales the number of coalitions n_coalitions to sample as sampling is cheap, while checking for n_coalitions unique coalitions is expensive, thus we over sample the number of coalitions by a factor of n_samps_scale and determine when we have n_coalitions unique coalitions and only use the coalitions up to this point and throw away the remaining coalitions.


We here return a vector of strings/characters, i.e., a CharacterVector, where each string is a space-separated list of integers.

Description

We here return a vector of strings/characters, i.e., a CharacterVector, where each string is a space-separated list of integers.

Usage

sample_coalitions_cpp_str_paired(m, n_coalitions, paired_shap_sampling = TRUE)

Arguments

m

Positive integer. Total number of features/groups.

n_coalitions

IntegerVector. The number of features to sample for each feature combination.

paired_shap_sampling

Logical. Whether to do paired sampling of coalitions.


Helper function to sample a combination of training and testing rows, which does not risk getting the same observation twice. Need to improve this help file.

Description

Helper function to sample a combination of training and testing rows, which does not risk getting the same observation twice. Need to improve this help file.

Usage

sample_combinations(ntrain, ntest, nsamples, joint_sampling = TRUE)

Arguments

ntrain

Positive integer. Number of training observations to sample from.

ntest

Positive integer. Number of test observations to sample from.

nsamples

Positive integer. Number of samples.

joint_sampling

Logical. Indicates whether train- and test data should be sampled separately or in a joint sampling space. If they are sampled separately (which typically would be used when optimizing more than one distribution at once) we sample with replacement if nsamples > ntrain. Note that this solution is not optimal. Be careful if you're doing optimization over every test observation when nsamples > ntrain.

Value

data.frame

Author(s)

Martin Jullum


Sample ctree variables from a given conditional inference tree

Description

Sample ctree variables from a given conditional inference tree

Usage

sample_ctree(tree, n_MC_samples, x_explain, x_train, n_features, sample)

Arguments

tree

List. Contains tree which is an object of type ctree built from the party package. Also contains given_ind, the features to condition upon.

n_MC_samples

Scalar integer. Corresponds to the number of samples from the leaf node. See an exception when sample = FALSE in setup_approach.ctree().

x_explain

Data.table with the features of the observation whose predictions ought to be explained (test data).

x_train

Data.table with training data.

n_features

Positive integer. The number of features.

Details

See the documentation of the setup_approach.ctree() function for undocumented parameters.

Value

data.table with n_MC_samples (conditional) Gaussian samples

Author(s)

Annabelle Redelmeier


Saves the intermediate results to disk

Description

Saves the intermediate results to disk

Usage

save_results(internal)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

No return value (but saves the intermediate results to disk)


check_setup

Description

check_setup

Usage

setup(
  x_train,
  x_explain,
  approach,
  phi0,
  output_size = 1,
  max_n_coalitions,
  group,
  n_MC_samples,
  seed,
  feature_specs,
  type = "regular",
  horizon = NULL,
  y = NULL,
  xreg = NULL,
  train_idx = NULL,
  explain_idx = NULL,
  explain_y_lags = NULL,
  explain_xreg_lags = NULL,
  group_lags = NULL,
  verbose,
  iterative = NULL,
  iterative_args = list(),
  is_python = FALSE,
  testing = FALSE,
  init_time = NULL,
  prev_shapr_object = NULL,
  asymmetric = FALSE,
  causal_ordering = NULL,
  confounding = NULL,
  output_args = list(),
  extra_computation_args = list(),
  ...
)

Arguments

x_train

Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula.

x_explain

Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained.

approach

Character vector of length 1 or one less than the number of features. All elements should, either be "gaussian", "copula", "empirical", "ctree", "vaeac", "categorical", "timeseries", "independence", "regression_separate", or "regression_surrogate". The two regression approaches can not be combined with any other approach. See details for more information.

phi0

Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable.

output_size

Scalar integer. Specifies the dimension of the output from the prediction model for every observation.

max_n_coalitions

Integer. The upper limit on the number of unique feature/group coalitions to use in the iterative procedure (if iterative = TRUE). If iterative = FALSE it represents the number of feature/group coalitions to use directly. The quantity refers to the number of unique feature coalitions if group = NULL, and group coalitions if group != NULL. max_n_coalitions = NULL corresponds to max_n_coalitions=2^n_features.

group

List. If NULL regular feature wise Shapley values are computed. If provided, group wise Shapley values are computed. group then has length equal to the number of groups. The list element contains character vectors with the features included in each of the different groups. See Jullum et al. (2021) for more information on group wise Shapley values.

n_MC_samples

Positive integer. For most approaches, it indicates the maximum number of samples to use in the Monte Carlo integration of every conditional expectation. For approach="ctree", n_MC_samples corresponds to the number of samples from the leaf node (see an exception related to the ctree.sample argument setup_approach.ctree()). For approach="empirical", n_MC_samples is the K parameter in equations (14-15) of Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the empirical.eta argument setup_approach.empirical().

seed

Positive integer. Specifies the seed before any randomness based code is being run. If NULL (default) no seed is set in the calling environment.

feature_specs

List. The output from get_model_specs() or get_data_specs(). Contains the 3 elements:

labels

Character vector with the names of each feature.

classes

Character vector with the classes of each features.

factor_levels

Character vector with the levels for any categorical features.

type

Character. Either "regular" or "forecast" corresponding to function setup() is called from, correspondingly the type of explanation that should be generated.

horizon

Numeric. The forecast horizon to explain. Passed to the predict_model function.

y

Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained.

xreg

Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneously when producing a forecast, this item should contain nrow(y) + horizon rows.

train_idx

Numeric vector. The row indices in data and reg denoting points in time to use when estimating the conditional expectations in the Shapley value formula. If train_idx = NULL (default) all indices not selected to be explained will be used.

explain_idx

Numeric vector. The row indices in data and reg denoting points in time to explain.

explain_y_lags

Numeric vector. Denotes the number of lags that should be used for each variable in y when making a forecast.

explain_xreg_lags

Numeric vector. If xreg != NULL, denotes the number of lags that should be used for each variable in xreg when making a forecast.

group_lags

Logical. If TRUE all lags of each variable are grouped together and explained as a group. If FALSE all lags of each variable are explained individually.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

iterative

Logical or NULL If NULL (default), the argument is set to TRUE if there are more than 5 features/groups, and FALSE otherwise. If eventually TRUE, the Shapley values are estimated iteratively in an iterative manner. This provides sufficiently accurate Shapley value estimates faster. First an initial number of coalitions is sampled, then bootsrapping is used to estimate the variance of the Shapley values. A convergence criterion is used to determine if the variances of the Shapley values are sufficiently small. If the variances are too high, we estimate the number of required samples to reach convergence, and thereby add more coalitions. The process is repeated until the variances are below the threshold. Specifics related to the iterative process and convergence criterion are set through iterative_args.

iterative_args

Named list. Specifies the arguments for the iterative procedure. See get_iterative_args_default() for description of the arguments and their default values.

is_python

Logical. Indicates whether the function is called from the Python wrapper. Default is FALSE which is never changed when calling the function via explain() in R. The parameter is later used to disallow running the AICc-versions of the empirical method as that requires data based optimization, which is not supported in shaprpy.

testing

Logical. Only use to remove random components like timing from the object output when comparing output with testthat. Defaults to FALSE.

init_time

POSIXct object. The time when the explain() function was called, as outputted by Sys.time(). Used to calculate the time it took to run the full explain call.

prev_shapr_object

shapr object or string. If an object of class shapr is provided, or string with a path to where intermediate results are stored, then the function will use the previous object to continue the computation. This is useful if the computation is interrupted or you want higher accuracy than already obtained, and therefore want to continue the iterative estimation. See the general usage vignette for examples.

asymmetric

Logical. Not applicable for (regular) non-causal or asymmetric explanations. If FALSE (default), explain computes regular symmetric Shapley values, If TRUE, then explain compute asymmetric Shapley values based on the (partial) causal ordering given by causal_ordering. That is, explain only uses the feature combinations/coalitions that respect the causal ordering when computing the asymmetric Shapley values. If asymmetric is TRUE and confounding is NULL (default), then explain computes asymmetric conditional Shapley values as specified in Frye et al. (2020). If confounding is provided, i.e., not NULL, then explain computes asymmetric causal Shapley values as specified in Heskes et al. (2020).

causal_ordering

List. Not applicable for (regular) non-causal or asymmetric explanations. causal_ordering is an unnamed list of vectors specifying the components of the partial causal ordering that the coalitions must respect. Each vector represents a component and contains one or more features/groups identified by their names (strings) or indices (integers). If causal_ordering is NULL (default), no causal ordering is assumed and all possible coalitions are allowed. No causal ordering is equivalent to a causal ordering with a single component that includes all features (list(1:n_features)) or groups (list(1:n_groups)) for feature-wise and group-wise Shapley values, respectively. For feature-wise Shapley values and causal_ordering = list(c(1, 2), c(3, 4)), the interpretation is that features 1 and 2 are the ancestors of features 3 and 4, while features 3 and 4 are on the same level. Note: All features/groups must be included in the causal_ordering without any duplicates.

confounding

Logical vector. Not applicable for (regular) non-causal or asymmetric explanations. confounding is a vector of logicals specifying whether confounding is assumed or not for each component in the causal_ordering. If NULL (default), then no assumption about the confounding structure is made and explain computes asymmetric/symmetric conditional Shapley values, depending on the value of asymmetric. If confounding is a single logical, i.e., FALSE or TRUE, then this assumption is set globally for all components in the causal ordering. Otherwise, confounding must be a vector of logicals of the same length as causal_ordering, indicating the confounding assumption for each component. When confounding is specified, then explain computes asymmetric/symmetric causal Shapley values, depending on the value of asymmetric. The approach cannot be regression_separate and regression_surrogate as the regression-based approaches are not applicable to the causal Shapley value methodology.

output_args

Named list. Specifies certain arguments related to the output of the function. See get_output_args_default() for description of the arguments and their default values.

extra_computation_args

Named list. Specifies extra arguments related to the computation of the Shapley values. See get_extra_comp_args_default() for description of the arguments and their default values.

...

Further arguments passed to specific approaches, see below.

Value

A internal list, containing parameters, info, data and computations needed for the later computations. The list is expanded and modified in other functions.


Set up the framework for the chosen approach

Description

The different choices of approach take different (optional) parameters, which are forwarded from explain(). See the general usage vignette for more information about the different approaches.

Usage

setup_approach(internal, ...)

## S3 method for class 'combined'
setup_approach(internal, ...)

## S3 method for class 'categorical'
setup_approach(
  internal,
  categorical.joint_prob_dt = NULL,
  categorical.epsilon = 0.001,
  ...
)

## S3 method for class 'copula'
setup_approach(internal, ...)

## S3 method for class 'ctree'
setup_approach(
  internal,
  ctree.mincriterion = 0.95,
  ctree.minsplit = 20,
  ctree.minbucket = 7,
  ctree.sample = TRUE,
  ...
)

## S3 method for class 'empirical'
setup_approach(
  internal,
  empirical.type = "fixed_sigma",
  empirical.eta = 0.95,
  empirical.fixed_sigma = 0.1,
  empirical.n_samples_aicc = 1000,
  empirical.eval_max_aicc = 20,
  empirical.start_aicc = 0.1,
  empirical.cov_mat = NULL,
  model = NULL,
  predict_model = NULL,
  ...
)

## S3 method for class 'gaussian'
setup_approach(internal, gaussian.mu = NULL, gaussian.cov_mat = NULL, ...)

## S3 method for class 'independence'
setup_approach(internal, ...)

## S3 method for class 'regression_separate'
setup_approach(
  internal,
  regression.model = parsnip::linear_reg(),
  regression.tune_values = NULL,
  regression.vfold_cv_para = NULL,
  regression.recipe_func = NULL,
  ...
)

## S3 method for class 'regression_surrogate'
setup_approach(
  internal,
  regression.model = parsnip::linear_reg(),
  regression.tune_values = NULL,
  regression.vfold_cv_para = NULL,
  regression.recipe_func = NULL,
  regression.surrogate_n_comb =
    internal$iter_list[[length(internal$iter_list)]]$n_coalitions - 2,
  ...
)

## S3 method for class 'timeseries'
setup_approach(
  internal,
  timeseries.fixed_sigma = 2,
  timeseries.bounds = c(NULL, NULL),
  ...
)

## S3 method for class 'vaeac'
setup_approach(
  internal,
  vaeac.depth = 3,
  vaeac.width = 32,
  vaeac.latent_dim = 8,
  vaeac.activation_function = torch::nn_relu,
  vaeac.lr = 0.001,
  vaeac.n_vaeacs_initialize = 4,
  vaeac.epochs = 100,
  vaeac.extra_parameters = list(),
  ...
)

Arguments

internal

List. Not used directly, but passed through from explain().

...

Arguments passed to specific classes. See below

categorical.joint_prob_dt

Data.table. (Optional) Containing the joint probability distribution for each combination of feature values. NULL means it is estimated from the x_train and x_explain.

categorical.epsilon

Numeric value. (Optional) If categorical.joint_probability_dt is not supplied, probabilities/frequencies are estimated using x_train. If certain observations occur in x_explain and NOT in x_train, then epsilon is used as the proportion of times that these observations occurs in the training data. In theory, this proportion should be zero, but this causes an error later in the Shapley computation.

ctree.mincriterion

Numeric scalar or vector. Either a scalar or vector of length equal to the number of features in the model. The value is equal to 1 - \alpha where \alpha is the nominal level of the conditional independence tests. If it is a vector, this indicates which value to use when conditioning on various numbers of features. The default value is 0.95.

ctree.minsplit

Numeric scalar. Determines minimum value that the sum of the left and right daughter nodes required for a split. The default value is 20.

ctree.minbucket

Numeric scalar. Determines the minimum sum of weights in a terminal node required for a split The default value is 7.

ctree.sample

Boolean. If TRUE (default), then the method always samples n_MC_samples observations from the leaf nodes (with replacement). If FALSE and the number of observations in the leaf node is less than n_MC_samples, the method will take all observations in the leaf. If FALSE and the number of observations in the leaf node is more than n_MC_samples, the method will sample n_MC_samples observations (with replacement). This means that there will always be sampling in the leaf unless sample = FALSE and the number of obs in the node is less than n_MC_samples.

empirical.type

Character. (default = "fixed_sigma") Should be equal to either "independence","fixed_sigma", "AICc_each_k" "AICc_full". "independence" is deprecated. Use approach = "independence" instead. "fixed_sigma" uses a fixed bandwidth (set through empirical.fixed_sigma) in the kernel density estimation. "AICc_each_k" and "AICc_full" optimize the bandwidth using the AICc criterion, with respectively one bandwidth per coalition size and one bandwidth for all coalition sizes.

empirical.eta

Numeric scalar. Needs to be ⁠0 < eta <= 1⁠. The default value is 0.95. Represents the minimum proportion of the total empirical weight that data samples should use. If e.g. eta = .8 we will choose the K samples with the largest weight so that the sum of the weights accounts for 80\ eta is the \eta parameter in equation (15) of Aas et al. (2021).

empirical.fixed_sigma

Positive numeric scalar. The default value is 0.1. Represents the kernel bandwidth in the distance computation used when conditioning on all different coalitions. Only used when empirical.type = "fixed_sigma"

empirical.n_samples_aicc

Positive integer. Number of samples to consider in AICc optimization. The default value is 1000. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.eval_max_aicc

Positive integer. Maximum number of iterations when optimizing the AICc. The default value is 20. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.start_aicc

Numeric. Start value of the sigma parameter when optimizing the AICc. The default value is 0.1. Only used for empirical.type is either "AICc_each_k" or "AICc_full".

empirical.cov_mat

Numeric matrix. (Optional) The covariance matrix of the data generating distribution used to define the Mahalanobis distance. NULL means it is estimated from x_train.

model

Objects. The model object that ought to be explained. See the documentation of explain() for details.

predict_model

Function. The prediction function used when model is not natively supported. See the documentation of explain() for details.

gaussian.mu

Numeric vector. (Optional) Containing the mean of the data generating distribution. NULL means it is estimated from the x_train.

gaussian.cov_mat

Numeric matrix. (Optional) Containing the covariance matrix of the data generating distribution. NULL means it is estimated from the x_train.

regression.model

A tidymodels object of class model_specs. Default is a linear regression model, i.e., parsnip::linear_reg(). See tidymodels for all possible models, and see the vignette for how to add new/own models. Note, to make it easier to call explain() from Python, the regression.model parameter can also be a string specifying the model which will be parsed and evaluated. For example, ⁠"parsnip::rand_forest(mtry = hardhat::tune(), trees = 100, engine = "ranger", mode = "regression")"⁠ is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.tune_values

Either NULL (default), a data.frame/data.table/tibble, or a function. The data.frame must contain the possible hyperparameter value combinations to try. The column names must match the names of the tunable parameters specified in regression.model. If regression.tune_values is a function, then it should take one argument x which is the training data for the current coalition and returns a data.frame/data.table/tibble with the properties described above. Using a function allows the hyperparameter values to change based on the size of the coalition See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.tune_values can also be a string containing an R function. For example, "function(x) return(dials::grid_regular(dials::mtry(c(1, ncol(x)))), levels = 3))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.vfold_cv_para

Either NULL (default) or a named list containing the parameters to be sent to rsample::vfold_cv(). See the regression vignette for several examples.

regression.recipe_func

Either NULL (default) or a function that that takes in a recipes::recipe() object and returns a modified recipes::recipe() with potentially additional recipe steps. See the regression vignette for several examples. Note, to make it easier to call explain() from Python, the regression.recipe_func can also be a string containing an R function. For example, "function(recipe) return(recipes::step_ns(recipe, recipes::all_numeric_predictors(), deg_free = 2))" is also a valid input. It is essential to include the package prefix if the package is not loaded.

regression.surrogate_n_comb

Positive integer. Specifies the number of unique coalitions to apply to each training observation. The default is the number of sampled coalitions in the present iteration. Any integer between 1 and the default is allowed. Larger values requires more memory, but may improve the surrogate model. If the user sets a value lower than the maximum, we sample this amount of unique coalitions separately for each training observations. That is, on average, all coalitions should be equally trained.

timeseries.fixed_sigma

Positive numeric scalar. Represents the kernel bandwidth in the distance computation. The default value is 2.

timeseries.bounds

Numeric vector of length two. Specifies the lower and upper bounds of the timeseries. The default is c(NULL, NULL), i.e. no bounds. If one or both of these bounds are not NULL, we restrict the sampled time series to be between these bounds. This is useful if the underlying time series are scaled between 0 and 1, for example.

vaeac.depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

vaeac.latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

vaeac.activation_function

An torch::nn_module() representing an activation function such as, e.g., torch::nn_relu() (default), torch::nn_leaky_relu(), torch::nn_selu(), or torch::nn_sigmoid().

vaeac.lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

vaeac.n_vaeacs_initialize

Positive integer (default is 4). The number of different vaeac models to initiate in the start. Pick the best performing one after vaeac.extra_parameters$epochs_initiation_phase epochs (default is 2) and continue training that one.

vaeac.epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes vaeac.extra_parameters$epochs_initiation_phase, where the default is 2.

vaeac.extra_parameters

Named list with extra parameters to the vaeac approach. See vaeac_get_extra_para_default() for description of possible additional parameters and their default values.

Value

Updated internal object with the approach set up

Author(s)

Martin Jullum

Lars Henry Berge Olsen

References


Set up the kernelSHAP framework

Description

Set up the kernelSHAP framework

Usage

shapley_setup(internal)

Arguments

internal

List. Not used directly, but passed through from explain().

Value

The internal list updated with the coalitions to be estimated


Calculate Shapley weight

Description

Calculate Shapley weight

Usage

shapley_weights(m, N, n_components, weight_zero_m = 10^6)

Arguments

m

Positive integer. Total number of features/groups.

N

Positive integer. The number of unique coalitions when sampling n_components features/feature groups, without replacement, from a sample space consisting of m different features/feature groups.

n_components

Positive integer. Represents the number of features/feature groups you want to sample from a feature space consisting of m unique features/feature groups. Note that ⁠ 0 < = n_components <= m⁠.

weight_zero_m

Numeric. The value to use as a replacement for infinite coalition weights when doing numerical operations.

Value

Numeric

Author(s)

Nikolai Sellereite


A torch::nn_module() Representing a skip connection

Description

Skip connection over the sequence of layers in the constructor. The module passes input data sequentially through these layers and then adds original data to the result.

Usage

skip_connection(...)

Arguments

...

network modules such as, e.g., torch::nn_linear(), torch::nn_relu(), and memory_layer() objects. See vaeac() for more information.

Author(s)

Lars Henry Berge Olsen


A torch::nn_module() Representing a specified_masks_mask_generator

Description

A mask generator which masks the entries based on sampling provided 1D masks with corresponding probabilities. Used for Shapley value estimation when only a subset of coalitions are used to compute the Shapley values.

Usage

specified_masks_mask_generator(masks, masks_probs, paired_sampling = FALSE)

Arguments

masks

Matrix/Tensor of possible/allowed 'masks' which we sample from.

masks_probs

Array of 'probabilities' for each of the masks specified in 'masks'. Note that they do not need to be between 0 and 1 (e.g. sampling frequency). They are scaled, hence, they only need to be positive.

paired_sampling

Boolean. If we are doing paired sampling. So include both S and \bar{S}. If TRUE, then batch must be sampled using 'paired_sampler' which creates batches where the first half and second half of the rows are duplicates of each other. That is, ⁠batch = [row1, row1, row2, row2, row3, row3, ...]⁠.

Author(s)

Lars Henry Berge Olsen


A torch::nn_module() Representing a specified_prob_mask_generator

Description

A mask generator which masks the entries based on specified probabilities.

Usage

specified_prob_mask_generator(masking_probs, paired_sampling = FALSE)

Arguments

masking_probs

An M+1 numerics containing the probabilities masking 'd' of the (0,...M) entries for each observation.

paired_sampling

Boolean. If we are doing paired sampling. So include both S and \bar{S}. If TRUE, then batch must be sampled using 'paired_sampler' which creates batches where the first half and second half of the rows are duplicates of each other. That is, ⁠batch = [row1, row1, row2, row2, row3, row3, ...]⁠.

Details

A class that takes in the probabilities of having d masked observations. I.e., for M dimensional data, masking_probs is of length M+1, where the d'th entry is the probability of having d-1 masked values.

A mask generator that first samples the number of entries 'd' to be masked in the 'M'-dimensional observation 'x' in the batch based on the given M+1 probabilities. The 'd' masked are uniformly sampled from the 'M' possible feature indices. The d'th entry of the probability of having d-1 masked values.

Note that mcar_mask_generator with p = 0.5 is the same as using specified_prob_mask_generator() with masking_ratio = choose(M, 0:M), where M is the number of features. This function was initially created to check if increasing the probability of having a masks with many masked features improved vaeac's performance by focusing more on these situations during training.


Model testing function

Description

Model testing function

Usage

test_predict_model(x_test, predict_model, model, internal)

Arguments

predict_model

Function. The prediction function used when model is not natively supported. See the documentation of explain() for details.

model

Objects. The model object that ought to be explained. See the documentation of explain() for details.

internal

List. Holds all parameters, data, functions and computed objects used within explain() The list contains one or more of the elements parameters, data, objects, iter_list, timing_list, main_timing_list, output, and iter_timing_list.


Cleans out certain output arguments to allow perfect reproducibility of the output

Description

Cleans out certain output arguments to allow perfect reproducibility of the output

Usage

testing_cleanup(output)

Value

Cleaned up version of the output list used for testthat testing

Author(s)

Lars Henry Berge Olsen, Martin Jullum


Initializing a vaeac model

Description

Class that represents a vaeac model, i.e., the class creates the neural networks in the vaeac model and necessary training utilities. For more details, see Olsen et al. (2022).

Usage

vaeac(
  one_hot_max_sizes,
  width = 32,
  depth = 3,
  latent_dim = 8,
  activation_function = torch::nn_relu,
  skip_conn_layer = FALSE,
  skip_conn_masked_enc_dec = FALSE,
  batch_normalization = FALSE,
  paired_sampling = FALSE,
  mask_generator_name = c("mcar_mask_generator", "specified_prob_mask_generator",
    "specified_masks_mask_generator"),
  masking_ratio = 0.5,
  mask_gen_coalitions = NULL,
  mask_gen_coalitions_prob = NULL,
  sigma_mu = 10000,
  sigma_sigma = 1e-04
)

Arguments

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

width

Integer. The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

depth

Integer. The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

latent_dim

Integer. The number of dimensions in the latent space.

activation_function

A torch::nn_module() representing an activation function such as, e.g., torch::nn_relu(), torch::nn_leaky_relu(), torch::nn_selu(), torch::nn_sigmoid().

skip_conn_layer

Boolean. If we are to use skip connections in each layer, see skip_connection(). If TRUE, then we add the input to the outcome of each hidden layer, so the output becomes X + \operatorname{activation}(WX + b). I.e., the identity skip connection.

skip_conn_masked_enc_dec

Boolean. If we are to apply concatenating skip connections between the layers in the masked encoder and decoder. The first layer of the masked encoder will be linked to the last layer of the decoder. The second layer of the masked encoder will be linked to the second to last layer of the decoder, and so on.

batch_normalization

Boolean. If we are to use batch normalization after the activation function. Note that if skip_conn_layer is TRUE, then the normalization is done after the adding from the skip connection. I.e, we batch normalize the whole quantity X + activation(WX + b).

paired_sampling

Boolean. If we are doing paired sampling. I.e., if we are to include both coalition S and \bar{S} when we sample coalitions during training for each batch.

mask_generator_name

String specifying the type of mask generator to use. Need to be one of 'mcar_mask_generator', 'specified_prob_mask_generator', and 'specified_masks_mask_generator'.

masking_ratio

Scalar. The probability for an entry in the generated mask to be 1 (masked). Not used if mask_gen_coalitions is given.

mask_gen_coalitions

Matrix containing the different coalitions to learn. Must be given if mask_generator_name = 'specified_masks_mask_generator'.

mask_gen_coalitions_prob

Numerics containing the probabilities for sampling each mask in mask_gen_coalitions. Array containing the probabilities for sampling the coalitions in mask_gen_coalitions.

sigma_mu

Numeric representing a hyperparameter in the normal-gamma prior used on the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

sigma_sigma

Numeric representing a hyperparameter in the normal-gamma prior used on the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

Details

This function builds neural networks (masked encoder, full encoder, decoder) given the list of one-hot max sizes of the features in the dataset we use to train the vaeac model, and the provided parameters for the networks. It also creates, e.g., reconstruction log probability function, methods for sampling from the decoder output, and then use these to create the vaeac model.

Value

Returns a list with the neural networks of the masked encoder, full encoder, and decoder together with reconstruction log probability function, optimizer constructor, sampler from the decoder output, mask generator, batch size, and scale factor for the stability of the variational lower bound optimization.

make_observed

Apply Mask to Batch to Create Observed Batch

Compute the parameters for the latent normal distributions inferred by the encoders. If only_masked_encoder = TRUE, then we only compute the latent normal distributions inferred by the masked encoder. This is used in the deployment phase when we do not have access to the full observation.

make_latent_distributions

Compute the Latent Distributions Inferred by the Encoders

Compute the parameters for the latent normal distributions inferred by the encoders. If only_masked_encoder = TRUE, then we only compute the latent normal distributions inferred by the masked encoder. This is used in the deployment phase when we do not have access to the full observation.

masked_encoder_regularization

Compute the Regularizes for the Latent Distribution Inferred by the Masked Encoder.

The masked encoder (prior) distribution regularization in the latent space. This is used to compute the extended variational lower bound used to train vaeac, see Section 3.3.1 in Olsen et al. (2022). Though regularizing prevents the masked encoder distribution parameters from going to infinity, the model usually doesn't diverge even without this regularization. It almost doesn't affect learning process near zero with default regularization parameters which are recommended to be used.

batch_vlb

Compute the Variational Lower Bound for the Observations in the Batch

Compute differentiable lower bound for the given batch of objects and mask. Used as the (negative) loss function for training the vaeac model.

batch_iwae

Compute IWAE log likelihood estimate with K samples per object.

Technically, it is differentiable, but it is recommended to use it for evaluation purposes inside torch.no_grad in order to save memory. With torch::with_no_grad() the method almost doesn't require extra memory for very large K. The method makes K independent passes through decoder network, so the batch size is the same as for training with batch_vlb. IWAE is an abbreviation for Importance Sampling Estimator:

\log p_{\theta, \psi}(x|y) \approx \log {\frac{1}{K} \sum_{i=1}^K [p_\theta(x|z_i, y) * p_\psi(z_i|y) / q_\phi(z_i|x,y)]} \newline = \log {\sum_{i=1}^K \exp(\log[p_\theta(x|z_i, y) * p_\psi(z_i|y) / q_\phi(z_i|x,y)])} - \log(K) \newline = \log {\sum_{i=1}^K \exp(\log[p_\theta(x|z_i, y)] + \log[p_\psi(z_i|y)] - \log[q_\phi(z_i|x,y)])} - \log(K) \newline = \operatorname{logsumexp}(\log[p_\theta(x|z_i, y)] + \log[p_\psi(z_i|y)] - \log[q_\phi(z_i|x,y)]) - \log(K) \newline = \operatorname{logsumexp}(\text{rec}\_\text{loss} + \text{prior}\_\text{log}\_\text{prob} - \text{proposal}\_\text{log}\_\text{prob}) - \log(K),

where z_i \sim q_\phi(z|x,y).

generate_samples_params

Generate the parameters of the generative distributions for samples from the batch.

The function makes K latent representation for each object from the batch, send these latent representations through the decoder to obtain the parameters for the generative distributions. I.e., means and variances for the normal distributions (continuous features) and probabilities for the categorical distribution (categorical features). The second axis is used to index samples for an object, i.e. if the batch shape is [n x D1 x D2], then the result shape is [n x K x D1 x D2]. It is better to use it inside torch::with_no_grad() in order to save memory. With torch::with_no_grad() the method doesn't require extra memory except the memory for the result.

Author(s)

Lars Henry Berge Olsen


Creates Categorical Distributions

Description

Function that takes in a tensor containing the logits for each of the K classes. Each row corresponds to an observations. Send each row through the softmax function to convert from logits to probabilities that sum 1 one. The function also clamps the probabilities between a minimum and maximum probability. Note that we still normalize them afterward, so the final probabilities can be marginally below or above the thresholds.

Usage

vaeac_categorical_parse_params(params, min_prob = 0, max_prob = 1)

Arguments

params

Tensor of dimension batch_size x K containing the logits for each of the K classes and batch_size observations.

min_prob

For stability it might be desirable that the minimal probability is not too close to zero.

max_prob

For stability it might be desirable that the maximal probability is not too close to one.

Details

Take a Tensor (e. g. a part of neural network output) and return torch::distr_categorical() distribution. The input tensor after applying softmax over the last axis contains a batch of the categorical probabilities. So there are no restrictions on the input tensor. Technically, this function treats the last axis as the categorical probabilities, but Categorical takes only 2D input where the first axis is the batch axis and the second one corresponds to the probabilities, so practically the function requires 2D input with the batch of probabilities for one categorical feature. min_prob is the minimal probability for each class. After clipping the probabilities from below and above they are renormalized in order to be a valid distribution. This regularization is required for the numerical stability and may be considered as a neural network architecture choice without any change to the probabilistic model.Note that the softmax function is given by \operatorname{Softmax}(x_i) = (\exp(x_i))/(\sum_{j} \exp(x_j)), where x_i are the logits and can take on any value, negative and positive. The output \operatorname{Softmax}(x_i) \in [0,1] and \sum_{j} Softmax(x_i) = 1.

Value

A torch::distr_categorical distributions with the provided probabilities for each class.

Author(s)

Lars Henry Berge Olsen


Function that checks the provided activation function

Description

Function that checks the provided activation function

Usage

vaeac_check_activation_func(activation_function)

Arguments

activation_function

An torch::nn_module() representing an activation function such as, e.g., torch::nn_relu() (default), torch::nn_leaky_relu(), torch::nn_selu(), or torch::nn_sigmoid().

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks for access to CUDA

Description

Function that checks for access to CUDA

Usage

vaeac_check_cuda(cuda, verbose)

Arguments

cuda

Logical (default is FALSE). If TRUE, then the vaeac model will be trained using cuda/GPU. If torch::cuda_is_available() is FALSE, the we fall back to use CPU. If FALSE, we use the CPU. Using a GPU for smaller tabular dataset often do not improve the efficiency. See vignette("installation", package = "torch") fo help to enable running on the GPU (only Linux and Windows).

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks provided epoch arguments

Description

Function that checks provided epoch arguments

Usage

vaeac_check_epoch_values(
  epochs,
  epochs_initiation_phase,
  epochs_early_stopping,
  save_every_nth_epoch,
  verbose
)

Arguments

epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes epochs_initiation_phase, where the default is 2.

epochs_initiation_phase

Positive integer (default is 2). The number of epochs to run each of the n_vaeacs_initialize vaeac models before continuing to train only the best performing model.

epochs_early_stopping

Positive integer (default is NULL). The training stops if there has been no improvement in the validation IWAE for epochs_early_stopping epochs. If the user wants the training process to be solely based on this training criterion, then epochs in explain() should be set to a large number. If NULL, then shapr will internally set epochs_early_stopping = vaeac.epochs such that early stopping does not occur.

save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every save_every_nth_epochth epoch will be saved.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Check vaeac.extra_parameters list

Description

Check vaeac.extra_parameters list

Usage

vaeac_check_extra_named_list(vaeac.extra_parameters)

Arguments

vaeac.extra_parameters

List containing the extra parameters to the vaeac approach

Author(s)

Lars Henry Berge Olsen


Function that checks logicals

Description

Function that checks logicals

Usage

vaeac_check_logicals(named_list_logicals)

Arguments

named_list_logicals

List containing named entries. I.e., list(a = TRUE, b = FALSE).

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks the specified masking scheme

Description

Function that checks the specified masking scheme

Usage

vaeac_check_mask_gen(mask_gen_coalitions, mask_gen_coalitions_prob, x_train)

Arguments

mask_gen_coalitions

Matrix (default is NULL). Matrix containing the coalitions that the vaeac model will be trained on, see specified_masks_mask_generator(). This parameter is used internally in shapr when we only consider a subset of coalitions, i.e., when n_coalitions < 2^{n_{\text{features}}}, and for group Shapley, i.e., when group is specified in explain().

mask_gen_coalitions_prob

Numeric array (default is NULL). Array of length equal to the height of mask_gen_coalitions containing the probabilities of sampling the corresponding coalitions in mask_gen_coalitions.

x_train

A data.table containing the training data. Categorical data must have class names 1,2,\dots,K.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks that the masking ratio argument is valid

Description

Function that checks that the masking ratio argument is valid

Usage

vaeac_check_masking_ratio(masking_ratio, n_features)

Arguments

masking_ratio

Numeric (default is 0.5). Probability of masking a feature in the mcar_mask_generator() (MCAR = Missing Completely At Random). The MCAR masking scheme ensures that vaeac model can do arbitrary conditioning as all coalitions will be trained. masking_ratio will be overruled if mask_gen_coalitions is specified.

n_features

The number of features, i.e., the number of columns in the training data.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that calls all vaeac parameters check functions

Description

Function that calls all vaeac parameters check functions

Usage

vaeac_check_parameters(
  x_train,
  model_description,
  folder_to_save_model,
  cuda,
  n_vaeacs_initialize,
  epochs_initiation_phase,
  epochs,
  epochs_early_stopping,
  save_every_nth_epoch,
  val_ratio,
  val_iwae_n_samples,
  depth,
  width,
  latent_dim,
  lr,
  batch_size,
  running_avg_n_values,
  activation_function,
  skip_conn_layer,
  skip_conn_masked_enc_dec,
  batch_normalization,
  paired_sampling,
  masking_ratio,
  mask_gen_coalitions,
  mask_gen_coalitions_prob,
  sigma_mu,
  sigma_sigma,
  save_data,
  log_exp_cont_feat,
  which_vaeac_model,
  verbose,
  seed,
  ...
)

Arguments

x_train

A data.table containing the training data. Categorical data must have class names 1,2,\dots,K.

model_description

String (default is make.names(Sys.time())). String containing, e.g., the name of the data distribution or additional parameter information. Used in the save name of the fitted model. If not provided, then a name will be generated based on base::Sys.time() to ensure a unique name. We use base::make.names() to ensure a valid file name for all operating systems.

folder_to_save_model

String (default is base::tempdir()). String specifying a path to a folder where the function is to save the fitted vaeac model. Note that the path will be removed from the returned explain() object if vaeac.save_model = FALSE.

cuda

Logical (default is FALSE). If TRUE, then the vaeac model will be trained using cuda/GPU. If torch::cuda_is_available() is FALSE, the we fall back to use CPU. If FALSE, we use the CPU. Using a GPU for smaller tabular dataset often do not improve the efficiency. See vignette("installation", package = "torch") fo help to enable running on the GPU (only Linux and Windows).

n_vaeacs_initialize

Positive integer (default is 4). The number of different vaeac models to initiate in the start. Pick the best performing one after epochs_initiation_phase epochs (default is 2) and continue training that one.

epochs_initiation_phase

Positive integer (default is 2). The number of epochs to run each of the n_vaeacs_initialize vaeac models before continuing to train only the best performing model.

epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes epochs_initiation_phase, where the default is 2.

epochs_early_stopping

Positive integer (default is NULL). The training stops if there has been no improvement in the validation IWAE for epochs_early_stopping epochs. If the user wants the training process to be solely based on this training criterion, then epochs in explain() should be set to a large number. If NULL, then shapr will internally set epochs_early_stopping = vaeac.epochs such that early stopping does not occur.

save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every save_every_nth_epochth epoch will be saved.

val_ratio

Numeric (default is 0.25). Scalar between 0 and 1 indicating the ratio of instances from the input data which will be used as validation data. That is, val_ratio = 0.25 means that ⁠75%⁠ of the provided data is used as training data, while the remaining ⁠25%⁠ is used as validation data.

val_iwae_n_samples

Positive integer (default is 25). The number of generated samples used to compute the IWAE criterion when validating the vaeac model on the validation data.

depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

batch_size

Positive integer (default is 64). The number of samples to include in each batch during the training of the vaeac model. Used in torch::dataloader().

running_avg_n_values

running_avg_n_values Positive integer (default is 5). The number of previous IWAE values to include when we compute the running means of the IWAE criterion.

activation_function

An torch::nn_module() representing an activation function such as, e.g., torch::nn_relu() (default), torch::nn_leaky_relu(), torch::nn_selu(), or torch::nn_sigmoid().

skip_conn_layer

Logical (default is TRUE). If TRUE, we apply identity skip connections in each layer, see skip_connection(). That is, we add the input X to the outcome of each hidden layer, so the output becomes X + activation(WX + b).

skip_conn_masked_enc_dec

Logical (default is TRUE). If TRUE, we apply concatenate skip connections between the layers in the masked encoder and decoder. The first layer of the masked encoder will be linked to the last layer of the decoder. The second layer of the masked encoder will be linked to the second to last layer of the decoder, and so on.

batch_normalization

Logical (default is FALSE). If TRUE, we apply batch normalization after the activation function. Note that if skip_conn_layer = TRUE, then the normalization is applied after the inclusion of the skip connection. That is, we batch normalize the whole quantity X + activation(WX + b).

paired_sampling

Logical (default is TRUE). If TRUE, we apply paired sampling to the training batches. That is, the training observations in each batch will be duplicated, where the first instance will be masked by S while the second instance will be masked by \bar{S}. This ensures that the training of the vaeac model becomes more stable as the model has access to the full version of each training observation. However, this will increase the training time due to more complex implementation and doubling the size of each batch. See paired_sampler() for more information.

masking_ratio

Numeric (default is 0.5). Probability of masking a feature in the mcar_mask_generator() (MCAR = Missing Completely At Random). The MCAR masking scheme ensures that vaeac model can do arbitrary conditioning as all coalitions will be trained. masking_ratio will be overruled if mask_gen_coalitions is specified.

mask_gen_coalitions

Matrix (default is NULL). Matrix containing the coalitions that the vaeac model will be trained on, see specified_masks_mask_generator(). This parameter is used internally in shapr when we only consider a subset of coalitions, i.e., when n_coalitions < 2^{n_{\text{features}}}, and for group Shapley, i.e., when group is specified in explain().

mask_gen_coalitions_prob

Numeric array (default is NULL). Array of length equal to the height of mask_gen_coalitions containing the probabilities of sampling the corresponding coalitions in mask_gen_coalitions.

sigma_mu

Numeric (default is 1e4). One of two hyperparameter values in the normal-gamma prior used in the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

sigma_sigma

Numeric (default is 1e-4). One of two hyperparameter values in the normal-gamma prior used in the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

save_data

Logical (default is FALSE). If TRUE, then the data is stored together with the model. Useful if one are to continue to train the model later using vaeac_train_model_continue().

log_exp_cont_feat

Logical (default is FALSE). If we are to \log transform all continuous features before sending the data to vaeac(). The vaeac model creates unbounded Monte Carlo sample values. Thus, if the continuous features are strictly positive (as for, e.g., the Burr distribution and Abalone data set), it can be advantageous to \log transform the data to unbounded form before using vaeac. If TRUE, then vaeac_postprocess_data() will take the \exp of the results to get back to strictly positive values when using the vaeac model to impute missing values/generate the Monte Carlo samples.

which_vaeac_model

String (default is best). The name of the vaeac model (snapshots from different epochs) to use when generating the Monte Carlo samples. The standard choices are: "best" (epoch with lowest IWAE), "best_running" (epoch with lowest running IWAE, see vaeac.running_avg_n_values), and last (the last epoch). Note that additional choices are available if vaeac.save_every_nth_epoch is provided. For example, if vaeac.save_every_nth_epoch = 5, then vaeac.which_vaeac_model can also take the values "epoch_5", "epoch_10", "epoch_15", and so on.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

seed

Positive integer (default is 1). Seed for reproducibility. Specifies the seed before any randomness based code is being run.

...

List of extra parameters, currently not used.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks positive integers

Description

Function that checks positive integers

Usage

vaeac_check_positive_integers(named_list_positive_integers)

Arguments

named_list_positive_integers

List containing named entries. I.e., list(a = 1, b = 2).

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks positive numerics

Description

Function that checks positive numerics

Usage

vaeac_check_positive_numerics(named_list_positive_numerics)

Arguments

named_list_positive_numerics

List containing named entries. I.e., list(a = 0.2, b = 10^3).

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks probabilities

Description

Function that checks probabilities

Usage

vaeac_check_probabilities(named_list_probabilities)

Arguments

named_list_probabilities

List containing named entries. I.e., list(a = 0.2, b = 0.9).

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks that the save folder exists and for a valid file name

Description

Function that checks that the save folder exists and for a valid file name

Usage

vaeac_check_save_names(folder_to_save_model, model_description)

Arguments

folder_to_save_model

String (default is base::tempdir()). String specifying a path to a folder where the function is to save the fitted vaeac model. Note that the path will be removed from the returned explain() object if vaeac.save_model = FALSE.

model_description

String (default is make.names(Sys.time())). String containing, e.g., the name of the data distribution or additional parameter information. Used in the save name of the fitted model. If not provided, then a name will be generated based on base::Sys.time() to ensure a unique name. We use base::make.names() to ensure a valid file name for all operating systems.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that gives a warning about disk usage

Description

Function that gives a warning about disk usage

Usage

vaeac_check_save_parameters(
  save_data,
  epochs,
  save_every_nth_epoch,
  x_train_size,
  verbose
)

Arguments

save_data

Logical (default is FALSE). If TRUE, then the data is stored together with the model. Useful if one are to continue to train the model later using vaeac_train_model_continue().

epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes epochs_initiation_phase, where the default is 2.

save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every save_every_nth_epochth epoch will be saved.

x_train_size

The object size of the x_train object.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks for valid vaeac model name

Description

Function that checks for valid vaeac model name

Usage

vaeac_check_which_vaeac_model(
  which_vaeac_model,
  epochs,
  save_every_nth_epoch = NULL
)

Arguments

which_vaeac_model

String (default is best). The name of the vaeac model (snapshots from different epochs) to use when generating the Monte Carlo samples. The standard choices are: "best" (epoch with lowest IWAE), "best_running" (epoch with lowest running IWAE, see vaeac.running_avg_n_values), and last (the last epoch). Note that additional choices are available if vaeac.save_every_nth_epoch is provided. For example, if vaeac.save_every_nth_epoch = 5, then vaeac.which_vaeac_model can also take the values "epoch_5", "epoch_10", "epoch_15", and so on.

epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes epochs_initiation_phase, where the default is 2.

save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every save_every_nth_epochth epoch will be saved.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function that checks the feature names of data and vaeac model

Description

Function that checks the feature names of data and vaeac model

Usage

vaeac_check_x_colnames(feature_names_vaeac, feature_names_new)

Arguments

feature_names_vaeac

Array of strings containing the feature names of the vaeac model.

feature_names_new

Array of strings containing the feature names to compare with.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Compute Featurewise Means and Standard Deviations

Description

Returns the means and standard deviations for all continuous features in the data set. Categorical features get mean = 0 and sd = 1 by default.

Usage

vaeac_compute_normalization(data, one_hot_max_sizes)

Arguments

data

A torch_tensor of dimension n_observation x n_features containing the data.

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

Value

List containing the means and the standard deviations of the different features.

Author(s)

Lars Henry Berge Olsen


Dataset used by the vaeac model

Description

Convert a the data into a torch::dataset() which the vaeac model creates batches from.

Usage

vaeac_dataset(X, one_hot_max_sizes)

Arguments

X

A torch_tensor contain the data of shape N x p, where N and p are the number of observations and features, respectively.

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

Details

This function creates a torch::dataset() object that represent a map from keys to data samples. It is used by the torch::dataloader() to load data which should be used to extract the batches for all epochs in the training phase of the neural network. Note that a dataset object is an R6 instance, see https://r6.r-lib.org/articles/Introduction.html, which is classical object-oriented programming, with self reference. I.e, vaeac_dataset() is a subclass of type torch::dataset().

Author(s)

Lars Henry Berge Olsen


Extends Incomplete Batches by Sampling Extra Data from Dataloader

Description

If the height of the batch is less than batch_size, this function extends the batch with data from the torch::dataloader() until the batch reaches the required size. Note that batch is a tensor.

Usage

vaeac_extend_batch(batch, dataloader, batch_size)

Arguments

batch

The batch we want to check if has the right size, and if not extend it until it has the right size.

dataloader

A torch::dataloader() object from which we can create an iterator object and load data to extend the batch.

batch_size

Integer. The number of samples to include in each batch.

Value

Returns the extended batch with the correct batch_size.

Author(s)

Lars Henry Berge Olsen


Function that extracts additional objects from the environment to the state list

Description

The function extract the objects that we are going to save together with the vaeac model to make it possible to train the model further and to evaluate it. The environment should be the local environment inside the vaeac_train_model_auxiliary() function.

Usage

vaeac_get_current_save_state(environment)

Arguments

environment

The base::environment() where the objects are stored.

Value

List containing the values of epoch, train_vlb, val_iwae, val_iwae_running, and the state_dict() of the vaeac model and optimizer.

Author(s)

Lars Henry Berge Olsen


Function to set up data loaders and save file names

Description

Function to set up data loaders and save file names

Usage

vaeac_get_data_objects(
  x_train,
  log_exp_cont_feat,
  val_ratio,
  batch_size,
  paired_sampling,
  model_description,
  depth,
  width,
  latent_dim,
  lr,
  epochs,
  save_every_nth_epoch,
  folder_to_save_model,
  train_indices = NULL,
  val_indices = NULL
)

Arguments

x_train

A data.table containing the training data. Categorical data must have class names 1,2,\dots,K.

log_exp_cont_feat

Logical (default is FALSE). If we are to \log transform all continuous features before sending the data to vaeac(). The vaeac model creates unbounded Monte Carlo sample values. Thus, if the continuous features are strictly positive (as for, e.g., the Burr distribution and Abalone data set), it can be advantageous to \log transform the data to unbounded form before using vaeac. If TRUE, then vaeac_postprocess_data() will take the \exp of the results to get back to strictly positive values when using the vaeac model to impute missing values/generate the Monte Carlo samples.

val_ratio

Numeric (default is 0.25). Scalar between 0 and 1 indicating the ratio of instances from the input data which will be used as validation data. That is, val_ratio = 0.25 means that ⁠75%⁠ of the provided data is used as training data, while the remaining ⁠25%⁠ is used as validation data.

batch_size

Positive integer (default is 64). The number of samples to include in each batch during the training of the vaeac model. Used in torch::dataloader().

paired_sampling

Logical (default is TRUE). If TRUE, we apply paired sampling to the training batches. That is, the training observations in each batch will be duplicated, where the first instance will be masked by S while the second instance will be masked by \bar{S}. This ensures that the training of the vaeac model becomes more stable as the model has access to the full version of each training observation. However, this will increase the training time due to more complex implementation and doubling the size of each batch. See paired_sampler() for more information.

model_description

String (default is make.names(Sys.time())). String containing, e.g., the name of the data distribution or additional parameter information. Used in the save name of the fitted model. If not provided, then a name will be generated based on base::Sys.time() to ensure a unique name. We use base::make.names() to ensure a valid file name for all operating systems.

depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes epochs_initiation_phase, where the default is 2.

save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every save_every_nth_epochth epoch will be saved.

folder_to_save_model

String (default is base::tempdir()). String specifying a path to a folder where the function is to save the fitted vaeac model. Note that the path will be removed from the returned explain() object if vaeac.save_model = FALSE.

train_indices

Numeric array (optional) containing the indices of the training observations. There are conducted no checks to validate the indices.

val_indices

Numeric array (optional) containing the indices of the validation observations. #' There are conducted no checks to validate the indices.

Value

List of objects needed to train the vaeac model


Extract the Training VLB and Validation IWAE from a list of explanations objects using the vaeac approach

Description

Extract the Training VLB and Validation IWAE from a list of explanations objects using the vaeac approach

Usage

vaeac_get_evaluation_criteria(explanation_list)

Arguments

explanation_list

A list of explain() objects applied to the same data, model, and vaeac must be the used approach. If the entries in the list is named, then the function use these names. Otherwise, it defaults to the approach names (with integer suffix for duplicates) for the explanation objects in explanation_list.

Value

A data.table containing the training VLB, validation IWAE, and running validation IWAE at each epoch for each vaeac model.

Author(s)

Lars Henry Berge Olsen


Function to specify the extra parameters in the vaeac model

Description

In this function, we specify the default values for the extra parameters used in explain() for approach = "vaeac".

Usage

vaeac_get_extra_para_default(
  vaeac.model_description = make.names(Sys.time()),
  vaeac.folder_to_save_model = tempdir(),
  vaeac.pretrained_vaeac_model = NULL,
  vaeac.cuda = FALSE,
  vaeac.epochs_initiation_phase = 2,
  vaeac.epochs_early_stopping = NULL,
  vaeac.save_every_nth_epoch = NULL,
  vaeac.val_ratio = 0.25,
  vaeac.val_iwae_n_samples = 25,
  vaeac.batch_size = 64,
  vaeac.batch_size_sampling = NULL,
  vaeac.running_avg_n_values = 5,
  vaeac.skip_conn_layer = TRUE,
  vaeac.skip_conn_masked_enc_dec = TRUE,
  vaeac.batch_normalization = FALSE,
  vaeac.paired_sampling = TRUE,
  vaeac.masking_ratio = 0.5,
  vaeac.mask_gen_coalitions = NULL,
  vaeac.mask_gen_coalitions_prob = NULL,
  vaeac.sigma_mu = 10000,
  vaeac.sigma_sigma = 1e-04,
  vaeac.sample_random = TRUE,
  vaeac.save_data = FALSE,
  vaeac.log_exp_cont_feat = FALSE,
  vaeac.which_vaeac_model = "best",
  vaeac.save_model = TRUE
)

Arguments

vaeac.model_description

String (default is make.names(Sys.time())). String containing, e.g., the name of the data distribution or additional parameter information. Used in the save name of the fitted model. If not provided, then a name will be generated based on base::Sys.time() to ensure a unique name. We use base::make.names() to ensure a valid file name for all operating systems.

vaeac.folder_to_save_model

String (default is base::tempdir()). String specifying a path to a folder where the function is to save the fitted vaeac model. Note that the path will be removed from the returned explain() object if vaeac.save_model = FALSE. Furthermore, the model cannot be moved from its original folder if we are to use the vaeac_train_model_continue() function to continue training the model.

vaeac.pretrained_vaeac_model

List or String (default is NULL). 1) Either a list of class vaeac, i.e., the list stored in explanation$internal$parameters$vaeac where explanation is the returned list from an earlier call to the explain() function. 2) A string containing the path to where the vaeac model is stored on disk, for example, explanation$internal$parameters$vaeac$models$best.

vaeac.cuda

Logical (default is FALSE). If TRUE, then the vaeac model will be trained using cuda/GPU. If torch::cuda_is_available() is FALSE, the we fall back to use CPU. If FALSE, we use the CPU. Using a GPU for smaller tabular dataset often do not improve the efficiency. See vignette("installation", package = "torch") fo help to enable running on the GPU (only Linux and Windows).

vaeac.epochs_initiation_phase

Positive integer (default is 2). The number of epochs to run each of the vaeac.n_vaeacs_initialize vaeac models before continuing to train only the best performing model.

vaeac.epochs_early_stopping

Positive integer (default is NULL). The training stops if there has been no improvement in the validation IWAE for vaeac.epochs_early_stopping epochs. If the user wants the training process to be solely based on this training criterion, then vaeac.epochs in explain() should be set to a large number. If NULL, then shapr will internally set vaeac.epochs_early_stopping = vaeac.epochs such that early stopping does not occur.

vaeac.save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every vaeac.save_every_nth_epochth epoch will be saved.

vaeac.val_ratio

Numeric (default is 0.25). Scalar between 0 and 1 indicating the ratio of instances from the input data which will be used as validation data. That is, vaeac.val_ratio = 0.25 means that ⁠75%⁠ of the provided data is used as training data, while the remaining ⁠25%⁠ is used as validation data.

vaeac.val_iwae_n_samples

Positive integer (default is 25). The number of generated samples used to compute the IWAE criterion when validating the vaeac model on the validation data.

vaeac.batch_size

Positive integer (default is 64). The number of samples to include in each batch during the training of the vaeac model. Used in torch::dataloader().

vaeac.batch_size_sampling

Positive integer (default is NULL) The number of samples to include in each batch when generating the Monte Carlo samples. If NULL, then the function generates the Monte Carlo samples for the provided coalitions and all explicands sent to explain() at the time. The number of coalitions are determined by the n_batches used by explain(). We recommend to tweak extra_computation_args$max_batch_size and extra_computation_args$min_n_batches rather than vaeac.batch_size_sampling. Larger batch sizes are often much faster provided sufficient memory.

vaeac.running_avg_n_values

Positive integer (default is 5). The number of previous IWAE values to include when we compute the running means of the IWAE criterion.

vaeac.skip_conn_layer

Logical (default is TRUE). If TRUE, we apply identity skip connections in each layer, see skip_connection(). That is, we add the input X to the outcome of each hidden layer, so the output becomes X + activation(WX + b).

vaeac.skip_conn_masked_enc_dec

Logical (default is TRUE). If TRUE, we apply concatenate skip connections between the layers in the masked encoder and decoder. The first layer of the masked encoder will be linked to the last layer of the decoder. The second layer of the masked encoder will be linked to the second to last layer of the decoder, and so on.

vaeac.batch_normalization

Logical (default is FALSE). If TRUE, we apply batch normalization after the activation function. Note that if vaeac.skip_conn_layer = TRUE, then the normalization is applied after the inclusion of the skip connection. That is, we batch normalize the whole quantity X + activation(WX + b).

vaeac.paired_sampling

Logical (default is TRUE). If TRUE, we apply paired sampling to the training batches. That is, the training observations in each batch will be duplicated, where the first instance will be masked by S while the second instance will be masked by \bar{S}. This ensures that the training of the vaeac model becomes more stable as the model has access to the full version of each training observation. However, this will increase the training time due to more complex implementation and doubling the size of each batch. See paired_sampler() for more information.

vaeac.masking_ratio

Numeric (default is 0.5). Probability of masking a feature in the mcar_mask_generator() (MCAR = Missing Completely At Random). The MCAR masking scheme ensures that vaeac model can do arbitrary conditioning as all coalitions will be trained. vaeac.masking_ratio will be overruled if vaeac.mask_gen_coalitions is specified.

vaeac.mask_gen_coalitions

Matrix (default is NULL). Matrix containing the coalitions that the vaeac model will be trained on, see specified_masks_mask_generator(). This parameter is used internally in shapr when we only consider a subset of coalitions, i.e., when n_coalitions < 2^{n_{\text{features}}}, and for group Shapley, i.e., when group is specified in explain().

vaeac.mask_gen_coalitions_prob

Numeric array (default is NULL). Array of length equal to the height of vaeac.mask_gen_coalitions containing the probabilities of sampling the corresponding coalitions in vaeac.mask_gen_coalitions.

vaeac.sigma_mu

Numeric (default is 1e4). One of two hyperparameter values in the normal-gamma prior used in the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

vaeac.sigma_sigma

Numeric (default is 1e-4). One of two hyperparameter values in the normal-gamma prior used in the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

vaeac.sample_random

Logical (default is TRUE). If TRUE, the function generates random Monte Carlo samples from the inferred generative distributions. If FALSE, the function use the most likely values, i.e., the mean and class with highest probability for continuous and categorical, respectively.

vaeac.save_data

Logical (default is FALSE). If TRUE, then the data is stored together with the model. Useful if one are to continue to train the model later using vaeac_train_model_continue().

vaeac.log_exp_cont_feat

Logical (default is FALSE). If we are to \log transform all continuous features before sending the data to vaeac(). The vaeac model creates unbounded Monte Carlo sample values. Thus, if the continuous features are strictly positive (as for, e.g., the Burr distribution and Abalone data set), it can be advantageous to \log transform the data to unbounded form before using vaeac. If TRUE, then vaeac_postprocess_data() will take the \exp of the results to get back to strictly positive values when using the vaeac model to impute missing values/generate the Monte Carlo samples.

vaeac.which_vaeac_model

String (default is best). The name of the vaeac model (snapshots from different epochs) to use when generating the Monte Carlo samples. The standard choices are: "best" (epoch with lowest IWAE), "best_running" (epoch with lowest running IWAE, see vaeac.running_avg_n_values), and last (the last epoch). Note that additional choices are available if vaeac.save_every_nth_epoch is provided. For example, if vaeac.save_every_nth_epoch = 5, then vaeac.which_vaeac_model can also take the values "epoch_5", "epoch_10", "epoch_15", and so on.

vaeac.save_model

Boolean. If TRUE (default), the vaeac model will be saved either in a base::tempdir() folder or in a user specified location in vaeac.folder_to_save_model. If FALSE, then the paths to model and the model will will be deleted from the returned object from explain().

Details

The vaeac model consists of three neural network (a full encoder, a masked encoder, and a decoder) based on the provided vaeac.depth and vaeac.width. The encoders map the full and masked input representations to latent representations, respectively, where the dimension is given by vaeac.latent_dim. The latent representations are sent to the decoder to go back to the real feature space and provide a samplable probabilistic representation, from which the Monte Carlo samples are generated. We use the vaeac method at the epoch with the lowest validation error (IWAE) by default, but other possibilities are available but setting the vaeac.which_vaeac_model parameter. See Olsen et al. (2022) for more details.

Value

Named list of the default values vaeac extra parameter arguments specified in this function call. Note that both vaeac.model_description and vaeac.folder_to_save_model will change with time and R session.

Author(s)

Lars Henry Berge Olsen

References


Function that extracts the state list objects from the environment

Description

#' @description The function extract the objects that we are going to save together with the vaeac model to make it possible to train the model further and to evaluate it. The environment should be the local environment inside the vaeac_train_model_auxiliary() function.

Usage

vaeac_get_full_state_list(environment)

Arguments

environment

The base::environment() where the objects are stored.

Value

List containing the values of norm_mean, norm_std, model_description, folder_to_save_model, n_train, n_features, one_hot_max_sizes, epochs, epochs_specified, epochs_early_stopping, early_stopping_applied, running_avg_n_values, paired_sampling, mask_generator_name, masking_ratio, mask_gen_coalitions, mask_gen_coalitions_prob, val_ratio, val_iwae_n_samples, n_vaeacs_initialize, epochs_initiation_phase, width, depth, latent_dim, activation_function, lr, batch_size, skip_conn_layer, skip_conn_masked_enc_dec, batch_normalization, cuda, train_indices, val_indices, save_every_nth_epoch, sigma_mu, sigma_sigma, feature_list, col_cat_names, col_cont_names, col_cat, col_cont, cat_in_dataset, map_new_to_original_names, map_original_to_new_names, log_exp_cont_feat, save_data, verbose, seed, and vaeac_save_file_names.

Author(s)

Lars Henry Berge Olsen


Function that determines which mask generator to use

Description

Function that determines which mask generator to use

Usage

vaeac_get_mask_generator_name(
  mask_gen_coalitions,
  mask_gen_coalitions_prob,
  masking_ratio,
  verbose
)

Arguments

mask_gen_coalitions

Matrix (default is NULL). Matrix containing the coalitions that the vaeac model will be trained on, see specified_masks_mask_generator(). This parameter is used internally in shapr when we only consider a subset of coalitions, i.e., when n_coalitions < 2^{n_{\text{features}}}, and for group Shapley, i.e., when group is specified in explain().

mask_gen_coalitions_prob

Numeric array (default is NULL). Array of length equal to the height of mask_gen_coalitions containing the probabilities of sampling the corresponding coalitions in mask_gen_coalitions.

masking_ratio

Numeric (default is 0.5). Probability of masking a feature in the mcar_mask_generator() (MCAR = Missing Completely At Random). The MCAR masking scheme ensures that vaeac model can do arbitrary conditioning as all coalitions will be trained. masking_ratio will be overruled if mask_gen_coalitions is specified.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

Value

The function does not return anything.

Author(s)

Lars Henry Berge Olsen


Function to load a vaeac model and set it in the right state and mode

Description

Function to load a vaeac model and set it in the right state and mode

Usage

vaeac_get_model_from_checkp(checkpoint, cuda, mode_train)

Arguments

checkpoint

List. This must be a loaded vaeac save object. That is, torch::torch_load('vaeac_save_path').

cuda

Logical (default is FALSE). If TRUE, then the vaeac model will be trained using cuda/GPU. If torch::cuda_is_available() is FALSE, the we fall back to use CPU. If FALSE, we use the CPU. Using a GPU for smaller tabular dataset often do not improve the efficiency. See vignette("installation", package = "torch") fo help to enable running on the GPU (only Linux and Windows).

mode_train

Logical. If TRUE, the returned vaeac model is set to be in training mode. If FALSE, the returned vaeac model is set to be in evaluation mode.

Value

A vaeac model with the correct state (based on checkpoint), sent to the desired hardware (based on cuda), and in the right mode (based on mode_train).

Author(s)

Lars Henry Berge Olsen


Function to get string of values with specific number of decimals

Description

Function to get string of values with specific number of decimals

Usage

vaeac_get_n_decimals(value, n_decimals = 3)

Arguments

value

The number to get n_decimals for.

n_decimals

Positive integer. The number of decimals. Default is three.

Value

String of value with n_decimals decimals.

Author(s)

Lars Henry Berge Olsen


Function to create the optimizer used to train vaeac

Description

Only torch::optim_adam() is currently supported. But it is easy to add an additional option later.

Usage

vaeac_get_optimizer(vaeac_model, lr, optimizer_name = "adam")

Arguments

vaeac_model

A vaeac model created using vaeac().

lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

optimizer_name

String containing the name of the torch::optimizer() to use.

Value

A torch::optim_adam() optimizer connected to the parameters of the vaeac_model.

Author(s)

Lars Henry Berge Olsen


Function that creates the save file names for the vaeac model

Description

Function that creates the save file names for the vaeac model

Usage

vaeac_get_save_file_names(
  model_description,
  n_features,
  n_train,
  depth,
  width,
  latent_dim,
  lr,
  epochs,
  save_every_nth_epoch,
  folder_to_save_model = NULL
)

Arguments

model_description

String (default is make.names(Sys.time())). String containing, e.g., the name of the data distribution or additional parameter information. Used in the save name of the fitted model. If not provided, then a name will be generated based on base::Sys.time() to ensure a unique name. We use base::make.names() to ensure a valid file name for all operating systems.

depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes epochs_initiation_phase, where the default is 2.

save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every save_every_nth_epochth epoch will be saved.

folder_to_save_model

String (default is base::tempdir()). String specifying a path to a folder where the function is to save the fitted vaeac model. Note that the path will be removed from the returned explain() object if vaeac.save_model = FALSE.

Value

Array of string containing the save files to use when training the vaeac model. The first three names corresponds to the best, best_running, and last epochs, in that order.

Author(s)

Lars Henry Berge Olsen


Compute the Importance Sampling Estimator (Validation Error)

Description

Compute the Importance Sampling Estimator which the vaeac model uses to evaluate its performance on the validation data.

Usage

vaeac_get_val_iwae(
  val_dataloader,
  mask_generator,
  batch_size,
  vaeac_model,
  val_iwae_n_samples
)

Arguments

val_dataloader

A torch dataloader which loads the validation data.

mask_generator

A mask generator object that generates the masks.

batch_size

Integer. The number of samples to include in each batch.

vaeac_model

The vaeac model.

val_iwae_n_samples

Number of samples to generate for computing the IWAE for each validation sample.

Details

Compute mean IWAE log likelihood estimation of the validation set. IWAE is an abbreviation for Importance Sampling Estimator

\log p_{\theta, \psi}(x|y) \approx \log {\frac{1}{S}\sum_{i=1}^S p_\theta(x|z_i, y) p_\psi(z_i|y) \big/ q_\phi(z_i|x,y),}

where z_i \sim q_\phi(z|x,y). For more details, see Olsen et al. (2022).

Value

The average iwae over all instances in the validation dataset.

Author(s)

Lars Henry Berge Olsen


Function to extend the explicands and apply all relevant masks/coalitions

Description

Function to extend the explicands and apply all relevant masks/coalitions

Usage

vaeac_get_x_explain_extended(x_explain, S, index_features)

Arguments

x_explain

Matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained.

S

The internal$objects$S matrix containing the possible coalitions.

index_features

Positive integer vector. Specifies the id_coalition to apply to the present method. NULL means all coalitions. Only used internally.

Value

The extended version of x_explain where the masks from S with indices index_features have been applied.

Author(s)

Lars Henry Berge Olsen


Impute Missing Values Using Vaeac

Description

Impute Missing Values Using Vaeac

Usage

vaeac_impute_missing_entries(
  x_explain_with_NaNs,
  n_MC_samples,
  vaeac_model,
  checkpoint,
  sampler,
  batch_size,
  verbose = NULL,
  seed = NULL,
  n_explain = NULL,
  index_features = NULL
)

Arguments

x_explain_with_NaNs

A 2D matrix, where the missing entries to impute are represented by NaN.

n_MC_samples

Integer. The number of imputed versions we create for each row in x_explain_with_NaNs.

vaeac_model

An initialized vaeac model that we are going to use to generate the MC samples.

checkpoint

List containing the parameters of the vaeac model.

sampler

A sampler object used to sample the MC samples.

batch_size

Positive integer (default is 64). The number of samples to include in each batch during the training of the vaeac model. Used in torch::dataloader().

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

seed

Positive integer (default is 1). Seed for reproducibility. Specifies the seed before any randomness based code is being run.

n_explain

Positive integer. The number of explicands.

index_features

Optional integer vector. Used internally in shapr package to index the coalitions.

Details

Function that imputes the missing values in 2D matrix where each row constitute an individual. The values are sampled from the conditional distribution estimated by a vaeac model.

Value

A data.table where the missing values (NaN) in x_explain_with_NaNs have been imputed n_MC_samples times. The data table will contain extra id columns if index_features and n_explain are provided.

Author(s)

Lars Henry Berge Olsen


Compute the KL Divergence Between Two Gaussian Distributions.

Description

Computes the KL divergence between univariate normal distributions using the analytical formula, see https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Multivariate_normal_distributions.

Usage

vaeac_kl_normal_normal(p, q)

Arguments

p

A torch::distr_normal() object.

q

A torch::distr_normal() object.

Value

The KL divergence between the two Gaussian distributions.

Author(s)

Lars Henry Berge Olsen


Creates Normal Distributions

Description

Function that takes in the a tensor where the first half of the columns contains the means of the normal distributions, while the latter half of the columns contains the standard deviations. The standard deviations are clamped with min_sigma to ensure stable results. If params is of dimensions batch_size x 8, the function will create 4 independent normal distributions for each of the observation (batch_size observations in total).

Usage

vaeac_normal_parse_params(params, min_sigma = 1e-04)

Arguments

params

Tensor of dimension batch_size x 2*n_featuers containing the means and standard deviations to be used in the normal distributions for of the batch_size observations.

min_sigma

For stability it might be desirable that the minimal sigma is not too close to zero.

Details

Take a Tensor (e.g. neural network output) and return a torch::distr_normal() distribution. This normal distribution is component-wise independent, and its dimensionality depends on the input shape. First half of channels is mean (\mu) of the distribution, the softplus of the second half is std (\sigma), so there is no restrictions on the input tensor. min_sigma is the minimal value of \sigma. I.e., if the above softplus is less than min_sigma, then \sigma is clipped from below with value min_sigma. This regularization is required for the numerical stability and may be considered as a neural network architecture choice without any change to the probabilistic model.

Value

A torch::distr_normal() distribution with the provided means and standard deviations.

Author(s)

Lars Henry Berge Olsen


Normalize mixed data for vaeac

Description

Compute the mean and std for each continuous feature, while the categorical features will have mean 0 and std 1.

Usage

vaeac_normalize_data(
  data_torch,
  one_hot_max_sizes,
  norm_mean = NULL,
  norm_std = NULL
)

Arguments

one_hot_max_sizes

A torch tensor of dimension n_features containing the one hot sizes of the n_features features. That is, if the ith feature is a categorical feature with 5 levels, then one_hot_max_sizes[i] = 5. While the size for continuous features can either be 0 or 1.

norm_mean

Torch tensor (optional). A 1D array containing the means of the columns of x_torch.

norm_std

Torch tensor (optional). A 1D array containing the stds of the columns of x_torch.

Value

A list containing the normalized version of x_torch, norm_mean and norm_std.

Author(s)

Lars Henry Berge Olsen


Postprocess Data Generated by a vaeac Model

Description

vaeac generates numerical values. This function converts categorical features to from numerics with class labels 1,2,...,K, to factors with the original and class labels.

Usage

vaeac_postprocess_data(data, vaeac_model_state_list)

Arguments

data

data.table containing the data generated by a vaeac model

vaeac_model_state_list

List. The returned list from the vaeac_preprocess_data function or a loaded checkpoint list of a saved vaeac object.

Value

data.table with the generated data from a vaeac model where the categorical features now have the original class names.

Author(s)

Lars Henry Berge Olsen


Preprocess Data for the vaeac approach

Description

vaeac only supports numerical values. This function converts categorical features to numerics with class labels 1,2,...,K, and keeps track of the map between the original and new class labels. It also computes the one_hot_max_sizes.

Usage

vaeac_preprocess_data(
  data,
  log_exp_cont_feat = FALSE,
  normalize = TRUE,
  norm_mean = NULL,
  norm_std = NULL
)

Arguments

data

matrix/data.frame/data.table containing the training data. Only the features and not the response.

log_exp_cont_feat

Boolean. If we are to log transform all continuous features before sending the data to vaeac. vaeac creates unbounded values, so if the continuous features are strictly positive, as for Burr and Abalone data, it can be advantageous to log-transform the data to unbounded form before using vaeac. If TRUE, then vaeac_postprocess_data will take the exp of the results to get back to strictly positive values.

norm_mean

Torch tensor (optional). A 1D array containing the means of the columns of x_torch.

norm_std

Torch tensor (optional). A 1D array containing the stds of the columns of x_torch.

Value

list containing data which can be used in vaeac, maps between original and new class names for categorical features, one_hot_max_sizes, and list of information about the data.

Author(s)

Lars Henry Berge Olsen


Function to printout a training summary for the vaeac model

Description

Function to printout a training summary for the vaeac model

Usage

vaeac_print_train_summary(best_epoch, best_epoch_running, last_state)

Arguments

best_epoch

Positive integer. The epoch with the lowest validation error.

best_epoch_running

Positive integer. The epoch with the lowest running validation error.

last_state

The state list (i.e., the saved vaeac object) of vaeac model at the epoch with the lowest IWAE.

Value

This function only prints out a message.

Author(s)

Lars Henry Berge Olsen


Function that saves the state list and the current save state of the vaeac model

Description

Function that saves the state list and the current save state of the vaeac model

Usage

vaeac_save_state(state_list, file_name, return_state = FALSE)

Arguments

state_list

List containing all the parameters in the state.

file_name

String containing the file path.

return_state

Logical if we are to return the state list or not.

Value

This function does not return anything

Author(s)

Lars Henry Berge Olsen


Train the Vaeac Model

Description

Function that fits a vaeac model to the given dataset based on the provided parameters, as described in Olsen et al. (2022). Note that all default parameters specified below origin from setup_approach.vaeac() and vaeac_get_extra_para_default().

Usage

vaeac_train_model(
  x_train,
  model_description,
  folder_to_save_model,
  cuda,
  n_vaeacs_initialize,
  epochs_initiation_phase,
  epochs,
  epochs_early_stopping,
  save_every_nth_epoch,
  val_ratio,
  val_iwae_n_samples,
  depth,
  width,
  latent_dim,
  lr,
  batch_size,
  running_avg_n_values,
  activation_function,
  skip_conn_layer,
  skip_conn_masked_enc_dec,
  batch_normalization,
  paired_sampling,
  masking_ratio,
  mask_gen_coalitions,
  mask_gen_coalitions_prob,
  sigma_mu,
  sigma_sigma,
  save_data,
  log_exp_cont_feat,
  which_vaeac_model,
  verbose,
  seed,
  ...
)

Arguments

x_train

A data.table containing the training data. Categorical data must have class names 1,2,\dots,K.

model_description

String (default is make.names(Sys.time())). String containing, e.g., the name of the data distribution or additional parameter information. Used in the save name of the fitted model. If not provided, then a name will be generated based on base::Sys.time() to ensure a unique name. We use base::make.names() to ensure a valid file name for all operating systems.

folder_to_save_model

String (default is base::tempdir()). String specifying a path to a folder where the function is to save the fitted vaeac model. Note that the path will be removed from the returned explain() object if vaeac.save_model = FALSE.

cuda

Logical (default is FALSE). If TRUE, then the vaeac model will be trained using cuda/GPU. If torch::cuda_is_available() is FALSE, the we fall back to use CPU. If FALSE, we use the CPU. Using a GPU for smaller tabular dataset often do not improve the efficiency. See vignette("installation", package = "torch") fo help to enable running on the GPU (only Linux and Windows).

n_vaeacs_initialize

Positive integer (default is 4). The number of different vaeac models to initiate in the start. Pick the best performing one after epochs_initiation_phase epochs (default is 2) and continue training that one.

epochs_initiation_phase

Positive integer (default is 2). The number of epochs to run each of the n_vaeacs_initialize vaeac models before continuing to train only the best performing model.

epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes epochs_initiation_phase, where the default is 2.

epochs_early_stopping

Positive integer (default is NULL). The training stops if there has been no improvement in the validation IWAE for epochs_early_stopping epochs. If the user wants the training process to be solely based on this training criterion, then epochs in explain() should be set to a large number. If NULL, then shapr will internally set epochs_early_stopping = vaeac.epochs such that early stopping does not occur.

save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every save_every_nth_epochth epoch will be saved.

val_ratio

Numeric (default is 0.25). Scalar between 0 and 1 indicating the ratio of instances from the input data which will be used as validation data. That is, val_ratio = 0.25 means that ⁠75%⁠ of the provided data is used as training data, while the remaining ⁠25%⁠ is used as validation data.

val_iwae_n_samples

Positive integer (default is 25). The number of generated samples used to compute the IWAE criterion when validating the vaeac model on the validation data.

depth

Positive integer (default is 3). The number of hidden layers in the neural networks of the masked encoder, full encoder, and decoder.

width

Positive integer (default is 32). The number of neurons in each hidden layer in the neural networks of the masked encoder, full encoder, and decoder.

latent_dim

Positive integer (default is 8). The number of dimensions in the latent space.

lr

Positive numeric (default is 0.001). The learning rate used in the torch::optim_adam() optimizer.

batch_size

Positive integer (default is 64). The number of samples to include in each batch during the training of the vaeac model. Used in torch::dataloader().

running_avg_n_values

running_avg_n_values Positive integer (default is 5). The number of previous IWAE values to include when we compute the running means of the IWAE criterion.

activation_function

An torch::nn_module() representing an activation function such as, e.g., torch::nn_relu() (default), torch::nn_leaky_relu(), torch::nn_selu(), or torch::nn_sigmoid().

skip_conn_layer

Logical (default is TRUE). If TRUE, we apply identity skip connections in each layer, see skip_connection(). That is, we add the input X to the outcome of each hidden layer, so the output becomes X + activation(WX + b).

skip_conn_masked_enc_dec

Logical (default is TRUE). If TRUE, we apply concatenate skip connections between the layers in the masked encoder and decoder. The first layer of the masked encoder will be linked to the last layer of the decoder. The second layer of the masked encoder will be linked to the second to last layer of the decoder, and so on.

batch_normalization

Logical (default is FALSE). If TRUE, we apply batch normalization after the activation function. Note that if skip_conn_layer = TRUE, then the normalization is applied after the inclusion of the skip connection. That is, we batch normalize the whole quantity X + activation(WX + b).

paired_sampling

Logical (default is TRUE). If TRUE, we apply paired sampling to the training batches. That is, the training observations in each batch will be duplicated, where the first instance will be masked by S while the second instance will be masked by \bar{S}. This ensures that the training of the vaeac model becomes more stable as the model has access to the full version of each training observation. However, this will increase the training time due to more complex implementation and doubling the size of each batch. See paired_sampler() for more information.

masking_ratio

Numeric (default is 0.5). Probability of masking a feature in the mcar_mask_generator() (MCAR = Missing Completely At Random). The MCAR masking scheme ensures that vaeac model can do arbitrary conditioning as all coalitions will be trained. masking_ratio will be overruled if mask_gen_coalitions is specified.

mask_gen_coalitions

Matrix (default is NULL). Matrix containing the coalitions that the vaeac model will be trained on, see specified_masks_mask_generator(). This parameter is used internally in shapr when we only consider a subset of coalitions, i.e., when n_coalitions < 2^{n_{\text{features}}}, and for group Shapley, i.e., when group is specified in explain().

mask_gen_coalitions_prob

Numeric array (default is NULL). Array of length equal to the height of mask_gen_coalitions containing the probabilities of sampling the corresponding coalitions in mask_gen_coalitions.

sigma_mu

Numeric (default is 1e4). One of two hyperparameter values in the normal-gamma prior used in the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

sigma_sigma

Numeric (default is 1e-4). One of two hyperparameter values in the normal-gamma prior used in the masked encoder, see Section 3.3.1 in Olsen et al. (2022).

save_data

Logical (default is FALSE). If TRUE, then the data is stored together with the model. Useful if one are to continue to train the model later using vaeac_train_model_continue().

log_exp_cont_feat

Logical (default is FALSE). If we are to \log transform all continuous features before sending the data to vaeac(). The vaeac model creates unbounded Monte Carlo sample values. Thus, if the continuous features are strictly positive (as for, e.g., the Burr distribution and Abalone data set), it can be advantageous to \log transform the data to unbounded form before using vaeac. If TRUE, then vaeac_postprocess_data() will take the \exp of the results to get back to strictly positive values when using the vaeac model to impute missing values/generate the Monte Carlo samples.

which_vaeac_model

String (default is best). The name of the vaeac model (snapshots from different epochs) to use when generating the Monte Carlo samples. The standard choices are: "best" (epoch with lowest IWAE), "best_running" (epoch with lowest running IWAE, see vaeac.running_avg_n_values), and last (the last epoch). Note that additional choices are available if vaeac.save_every_nth_epoch is provided. For example, if vaeac.save_every_nth_epoch = 5, then vaeac.which_vaeac_model can also take the values "epoch_5", "epoch_10", "epoch_15", and so on.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

seed

Positive integer (default is 1). Seed for reproducibility. Specifies the seed before any randomness based code is being run.

...

List of extra parameters, currently not used.

Details

The vaeac model consists of three neural networks, i.e., a masked encoder, a full encoder, and a decoder. The networks have shared depth, width, and activation_function. The encoders maps the x_train to a latent representation of dimension latent_dim, while the decoder maps the latent representations back to the feature space. See Olsen et al. (2022) for more details. The function first initiates n_vaeacs_initialize vaeac models with different randomly initiated network parameter values to remedy poorly initiated values. After epochs_initiation_phase epochs, the n_vaeacs_initialize vaeac models are compared and the function continues to only train the best performing one for a total of epochs epochs. The networks are trained using the ADAM optimizer with the learning rate is lr.

Value

A list containing the training/validation errors and paths to where the vaeac models are saved on the disk.

Author(s)

Lars Henry Berge Olsen

References


Function used to train a vaeac model

Description

This function can be applied both in the initialization phase when, we train several initiated vaeac models, and to keep training the best performing vaeac model for the remaining number of epochs. We are in the former setting when initialization_idx is provided and the latter when it is NULL. When it is NULL, we save the vaeac models with lowest VLB, IWAE, running IWAE, and the epochs according to save_every_nth_epoch to disk.

Usage

vaeac_train_model_auxiliary(
  vaeac_model,
  optimizer,
  train_dataloader,
  val_dataloader,
  val_iwae_n_samples,
  running_avg_n_values,
  verbose,
  cuda,
  epochs,
  save_every_nth_epoch,
  epochs_early_stopping,
  epochs_start = 1,
  progressr_bar = NULL,
  vaeac_save_file_names = NULL,
  state_list = NULL,
  initialization_idx = NULL,
  n_vaeacs_initialize = NULL,
  train_vlb = NULL,
  val_iwae = NULL,
  val_iwae_running = NULL
)

Arguments

vaeac_model

A vaeac() object. The vaeac model this function is to train.

optimizer

A torch::optimizer() object. See vaeac_get_optimizer().

train_dataloader

A torch::dataloader() containing the training data for the vaeac model.

val_dataloader

A torch::dataloader() containing the validation data for the vaeac model.

val_iwae_n_samples

Positive integer (default is 25). The number of generated samples used to compute the IWAE criterion when validating the vaeac model on the validation data.

running_avg_n_values

running_avg_n_values Positive integer (default is 5). The number of previous IWAE values to include when we compute the running means of the IWAE criterion.

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

cuda

Logical (default is FALSE). If TRUE, then the vaeac model will be trained using cuda/GPU. If torch::cuda_is_available() is FALSE, the we fall back to use CPU. If FALSE, we use the CPU. Using a GPU for smaller tabular dataset often do not improve the efficiency. See vignette("installation", package = "torch") fo help to enable running on the GPU (only Linux and Windows).

epochs

Positive integer (default is 100). The number of epochs to train the final vaeac model. This includes epochs_initiation_phase, where the default is 2.

save_every_nth_epoch

Positive integer (default is NULL). If provided, then the vaeac model after every save_every_nth_epochth epoch will be saved.

epochs_early_stopping

Positive integer (default is NULL). The training stops if there has been no improvement in the validation IWAE for epochs_early_stopping epochs. If the user wants the training process to be solely based on this training criterion, then epochs in explain() should be set to a large number. If NULL, then shapr will internally set epochs_early_stopping = vaeac.epochs such that early stopping does not occur.

epochs_start

Positive integer (default is 1). At which epoch the training is starting at.

progressr_bar

A progressr::progressor() object (default is NULL) to keep track of progress.

vaeac_save_file_names

Array of strings containing the save file names for the vaeac model.

state_list

Named list containing the objects returned from vaeac_get_full_state_list().

initialization_idx

Positive integer (default is NULL). The index of the current vaeac model in the initialization phase.

n_vaeacs_initialize

Positive integer (default is 4). The number of different vaeac models to initiate in the start. Pick the best performing one after epochs_initiation_phase epochs (default is 2) and continue training that one.

train_vlb

A torch::torch_tensor() (default is NULL) of one dimension containing previous values for the training VLB.

val_iwae

A torch::torch_tensor() (default is NULL) of one dimension containing previous values for the validation IWAE.

val_iwae_running

A torch::torch_tensor() (default is NULL) of one dimension containing previous values for the running validation IWAE.

Value

Depending on if we are in the initialization phase or not. Then either the trained vaeac model, or a list of where the vaeac models are stored on disk and the parameters of the model.

Author(s)

Lars Henry Berge Olsen


Continue to Train the vaeac Model

Description

Function that loads a previously trained vaeac model and continue the training, either on new data or on the same dataset as it was trained on before. If we are given a new dataset, then we assume that new dataset has the same distribution and one_hot_max_sizes as the original dataset.

Usage

vaeac_train_model_continue(
  explanation,
  epochs_new,
  lr_new = NULL,
  x_train = NULL,
  save_data = FALSE,
  verbose = NULL,
  seed = 1
)

Arguments

explanation

A explain() object and vaeac must be the used approach.

epochs_new

Positive integer. The number of extra epochs to conduct.

lr_new

Positive numeric. If we are to overwrite the old learning rate in the adam optimizer.

x_train

A data.table containing the training data. Categorical data must have class names 1,2,\dots,K.

save_data

Logical (default is FALSE). If TRUE, then the data is stored together with the model. Useful if one are to continue to train the model later using vaeac_train_model_continue().

verbose

String vector or NULL. Specifies the verbosity (printout detail level) through one or more of strings "basic", "progress", "convergence", "shapley" and "vS_details". "basic" (default) displays basic information about the computation which is being performed, in addition to some messages about parameters being sets or checks being unavailable due to specific input. ⁠"progress⁠ displays information about where in the calculation process the function currently is. #' "convergence" displays information on how close to convergence the Shapley value estimates are (only when iterative = TRUE) . "shapley" displays intermediate Shapley value estimates and standard deviations (only when iterative = TRUE) and the final estimates. "vS_details" displays information about the v_S estimates. This is most relevant for ⁠approach %in% c("regression_separate", "regression_surrogate", "vaeac"⁠). NULL means no printout. Note that any combination of four strings can be used. E.g. verbose = c("basic", "vS_details") will display basic information + details about the v(S)-estimation process.

seed

Positive integer (default is 1). Seed for reproducibility. Specifies the seed before any randomness based code is being run.

Value

A list containing the training/validation errors and paths to where the vaeac models are saved on the disk.

Author(s)

Lars Henry Berge Olsen

References


Move vaeac parameters to correct location

Description

This function ensures that the main and extra parameters for the vaeac approach is located at their right locations.

Usage

vaeac_update_para_locations(parameters)

Arguments

parameters

List. The internal$parameters list created inside the explain() function.

Value

Updated version of parameters where all vaeac parameters are located at the correct location.

Author(s)

Lars Henry Berge Olsen


Function that checks and adds a pre-trained vaeac model

Description

Function that checks and adds a pre-trained vaeac model

Usage

vaeac_update_pretrained_model(parameters)

Arguments

parameters

List containing the parameters used within explain().

Value

This function adds a valid pre-trained vaeac model to the parameter.

Author(s)

Lars Henry Berge Olsen


Calculate weighted matrix

Description

Calculate weighted matrix

Usage

weight_matrix(X, normalize_W_weights = TRUE)

Arguments

X

data.table. Output from create_coalition_table().

normalize_W_weights

Logical. Whether to normalize the weights for the coalitions to sum to 1 for increased numerical stability before solving the WLS (weighted least squares). Applies to all coalitions except coalition 1 and 2^m.

Value

Numeric matrix. See weight_matrix_cpp() for more information.

Author(s)

Nikolai Sellereite, Martin Jullum


Calculate weight matrix

Description

Calculate weight matrix

Usage

weight_matrix_cpp(coalitions, m, n, w)

Arguments

coalitions

List. Each of the elements equals an integer vector representing a valid combination of features/feature groups.

m

Integer. Number of features/feature groups.

n

Integer. Number of combinations.

w

Numeric vector Should have length n. w[i] equals the Shapley weight of feature/feature group combination i, represented by coalitions[[i]].

Value

Matrix of dimension n x m + 1

Author(s)

Nikolai Sellereite, Martin Jullum