Type: | Package |
Title: | Predictive Power Score |
Version: | 0.0.5 |
Description: | The Predictive Power Score (PPS) is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two variables. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). PPS can be useful for data exploration purposes, in the same way correlation analysis is. For more information on PPS, see https://github.com/paulvanderlaken/ppsr. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
Suggests: | testthat (≥ 2.0.0) |
Config/testthat/edition: | 3 |
Config/testthat/parallel: | true |
RoxygenNote: | 7.2.3 |
Imports: | ggplot2 (≥ 3.3.3), parsnip (≥ 0.1.5), rpart (≥ 4.1.15), withr (≥ 2.4.1), gridExtra (≥ 2.3) |
NeedsCompilation: | no |
Packaged: | 2024-02-18 11:57:33 UTC; pvdl |
Author: | Paul van der Laken [aut, cre, cph] |
Maintainer: | Paul van der Laken <paulvanderlaken@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2024-02-18 12:30:02 UTC |
ppsr: An R implementation of the Predictive Power Score (PPS)
Description
The PPS is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power). It can be used as an alternative to the correlation (matrix).
Lists all algorithms currently supported
Description
Lists all algorithms currently supported
Usage
available_algorithms()
Value
a list of all available parsnip engines
Examples
available_algorithms()
Lists all evaluation metrics currently supported
Description
Lists all evaluation metrics currently supported
Usage
available_evaluation_metrics()
Value
a list of all available evaluation metrics and their implementation in functional form
Examples
available_evaluation_metrics()
Normalizes the original score compared to a naive baseline score The calculation that's being performed depends on the type of model
Description
Normalizes the original score compared to a naive baseline score The calculation that's being performed depends on the type of model
Usage
normalize_score(baseline_score, model_score, type)
Arguments
baseline_score |
float, the evaluation metric score for a naive baseline (model) |
model_score |
float, the evaluation metric score for a statistical model |
type |
character, type of model |
Value
numeric vector of length one, normalized score
Calculate predictive power score for x on y
Description
Calculate predictive power score for x on y
Usage
score(
df,
x,
y,
algorithm = "tree",
metrics = list(regression = "MAE", classification = "F1_weighted"),
cv_folds = 5,
seed = 1,
verbose = TRUE
)
Arguments
df |
data.frame containing columns for x and y |
x |
string, column name of predictor variable |
y |
string, column name of target variable |
algorithm |
string, see |
metrics |
named list of |
cv_folds |
float, number of cross-validation folds |
seed |
float, seed to ensure reproducibility/stability |
verbose |
boolean, whether to print notifications |
Value
a named list, potentially containing
- x
the name of the predictor variable
- y
the name of the target variable
- result_type
text showing how to interpret the resulting score
- pps
the predictive power score
- metric
the evaluation metric used to compute the PPS
- baseline_score
the score of a naive model on the evaluation metric
- model_score
the score of the predictive model on the evaluation metric
- cv_folds
how many cross-validation folds were used
- seed
the seed that was set
- algorithm
text shwoing what algorithm was used
- model_type
text showing whether classification or regression was used
Examples
score(iris, x = 'Petal.Length', y = 'Species')
Calculate correlation coefficients for whole dataframe
Description
Calculate correlation coefficients for whole dataframe
Usage
score_correlations(df, ...)
Arguments
df |
data.frame containing columns for x and y |
... |
arguments to pass to |
Value
a data.frame with x-y correlation coefficients
Examples
score_correlations(iris)
Calculate predictive power scores for whole dataframe
Iterates through the columns of the dataframe, calculating the predictive power
score for every possible combination of x
and y
.
Description
Calculate predictive power scores for whole dataframe
Iterates through the columns of the dataframe, calculating the predictive power
score for every possible combination of x
and y
.
Usage
score_df(df, ..., do_parallel = FALSE, n_cores = -1)
Arguments
df |
data.frame containing columns for x and y |
... |
any arguments passed to |
do_parallel |
bool, whether to perform |
n_cores |
numeric, number of cores to use, defaults to maximum minus 1 |
Value
a data.frame containing
- x
the name of the predictor variable
- y
the name of the target variable
- result_type
text showing how to interpret the resulting score
- pps
the predictive power score
- metric
the evaluation metric used to compute the PPS
- baseline_score
the score of a naive model on the evaluation metric
- model_score
the score of the predictive model on the evaluation metric
- cv_folds
how many cross-validation folds were used
- seed
the seed that was set
- algorithm
text shwoing what algorithm was used
- model_type
text showing whether classification or regression was used
Examples
score_df(iris)
score_df(mtcars, do_parallel = TRUE, n_cores = 2)
Calculate predictive power score matrix
Iterates through the columns of the dataset, calculating the predictive power
score for every possible combination of x
and y
.
Description
Note that the targets are on the rows, and the features on the columns.
Usage
score_matrix(df, ...)
Arguments
df |
data.frame containing columns for x and y |
... |
any arguments passed to |
Value
a matrix of numeric values, representing predictive power scores
Examples
score_matrix(iris)
score_matrix(mtcars, do_parallel = TRUE, n_cores=2)
Calculates out-of-sample model performance of a statistical model
Description
Calculates out-of-sample model performance of a statistical model
Usage
score_model(train, test, model, x, y, metric)
Arguments
train |
df, training data, containing variable y |
test |
df, test data, containing variable y |
model |
parsnip model object, with mode preset |
x |
character, column name of predictor variable |
y |
character, column name of target variable |
metric |
character, name of evaluation metric being used, see |
Value
numeric vector of length one, evaluation score for predictions using naive model
Calculate out-of-sample model performance of naive baseline model The calculation that's being performed depends on the type of model For regression models, the mean is used as prediction For classification, a model predicting random values and a model predicting modal values are used and the best model is taken as baseline score
Description
Calculate out-of-sample model performance of naive baseline model The calculation that's being performed depends on the type of model For regression models, the mean is used as prediction For classification, a model predicting random values and a model predicting modal values are used and the best model is taken as baseline score
Usage
score_naive(train, test, x, y, type, metric)
Arguments
train |
df, training data, containing variable y |
test |
df, test data, containing variable y |
x |
character, column name of predictor variable |
y |
character, column name of target variable |
type |
character, type of model |
metric |
character, evaluation metric being used |
Value
numeric vector of length one, evaluation score for predictions using naive model
Calculate predictive power scores for y
Calculates the predictive power scores for the specified y
variable
using every column in the dataset as x
, including itself.
Description
Calculate predictive power scores for y
Calculates the predictive power scores for the specified y
variable
using every column in the dataset as x
, including itself.
Usage
score_predictors(df, y, ..., do_parallel = FALSE, n_cores = -1)
Arguments
df |
data.frame containing columns for x and y |
y |
string, column name of target variable |
... |
any arguments passed to |
do_parallel |
bool, whether to perform |
n_cores |
numeric, number of cores to use, defaults to maximum minus 1 |
Value
a data.frame containing
- x
the name of the predictor variable
- y
the name of the target variable
- result_type
text showing how to interpret the resulting score
- pps
the predictive power score
- metric
the evaluation metric used to compute the PPS
- baseline_score
the score of a naive model on the evaluation metric
- model_score
the score of the predictive model on the evaluation metric
- cv_folds
how many cross-validation folds were used
- seed
the seed that was set
- algorithm
text shwoing what algorithm was used
- model_type
text showing whether classification or regression was used
Examples
score_predictors(df = iris, y = 'Species')
score_predictors(df = mtcars, y = 'mpg', do_parallel = TRUE, n_cores = 2)
Visualize the PPS & correlation matrices
Description
Visualize the PPS & correlation matrices
Usage
visualize_both(
df,
color_value_positive = "#08306B",
color_value_negative = "#8b0000",
color_text = "#FFFFFF",
include_missings = TRUE,
nrow = 1,
...
)
Arguments
df |
data.frame containing columns for x and y |
color_value_positive |
color used for upper limit of gradient (high positive correlation) |
color_value_negative |
color used for lower limit of gradient (high negative correlation) |
color_text |
string, hex value or color name used for text, best to pick high contrast with |
include_missings |
bool, whether to include the variables without correlation values in the plot |
nrow |
numeric, number of rows, either 1 or 2 |
... |
any arguments passed to |
Value
a grob object, a grid with two ggplot2 heatmap visualizations
Examples
visualize_both(iris)
visualize_both(mtcars, do_parallel = TRUE, n_cores = 2)
Visualize the correlation matrix
Description
Visualize the correlation matrix
Usage
visualize_correlations(
df,
color_value_positive = "#08306B",
color_value_negative = "#8b0000",
color_text = "#FFFFFF",
include_missings = FALSE,
...
)
Arguments
df |
data.frame containing columns for x and y |
color_value_positive |
color used for upper limit of gradient (high positive correlation) |
color_value_negative |
color used for lower limit of gradient (high negative correlation) |
color_text |
color used for text, best to pick high contrast with |
include_missings |
bool, whether to include the variables without correlation values in the plot |
... |
arguments to pass to |
Value
a ggplot object, a heatmap visualization
Examples
visualize_correlations(iris)
Visualize the Predictive Power scores of the entire dataframe, or given a target
Description
If y
is specified, visualize_pps
returns a barplot of the PPS of
every predictor on the specified target variable.
If y
is not specified, visualize_pps
returns a heatmap visualization
of the PPS for all X-Y combinations in a dataframe.
Usage
visualize_pps(
df,
y = NULL,
color_value_high = "#08306B",
color_value_low = "#FFFFFF",
color_text = "#FFFFFF",
include_target = TRUE,
...
)
Arguments
df |
data.frame containing columns for x and y |
y |
string, column name of target variable,
can be left |
color_value_high |
string, hex value or color name used for upper limit of PPS gradient (high PPS) |
color_value_low |
string, hex value or color name used for lower limit of PPS gradient (low PPS) |
color_text |
string, hex value or color name used for text, best to pick high contrast with |
include_target |
boolean, whether to include the target variable in the barplot |
... |
any arguments passed to |
Value
a ggplot object, a vertical barplot or heatmap visualization
Examples
visualize_pps(iris, y = 'Species')
visualize_pps(iris)
visualize_pps(mtcars, do_parallel = TRUE, n_cores = 2)