Type: | Package |
Title: | Memory-Based Learning in Spectral Chemometrics |
Version: | 2.2.4 |
Date: | 2025-05-19 |
Maintainer: | Leonardo Ramirez-Lopez <ramirez.lopez.leo@gmail.com> |
BugReports: | https://github.com/l-ramirez-lopez/resemble/issues |
Description: | Functions for dissimilarity analysis and memory-based learning (MBL, a.k.a local modeling) in complex spectral data sets. Most of these functions are based on the methods presented in Ramirez-Lopez et al. (2013) <doi:10.1016/j.geoderma.2012.12.014>. |
License: | MIT + file LICENSE |
URL: | http://l-ramirez-lopez.github.io/resemble/ |
Depends: | R (≥ 3.5.0) |
Imports: | foreach, iterators, Rcpp (≥ 1.0.3), mathjaxr (≥ 1.0), magrittr (≥ 1.5.0), lifecycle (≥ 0.2.0), data.table (≥ 1.9.8) |
Suggests: | prospectr, parallel, doParallel, testthat, formatR, rmarkdown, bookdown, knitr |
LinkingTo: | Rcpp, RcppArmadillo |
RdMacros: | mathjaxr |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Repository: | CRAN |
RoxygenNote: | 7.3.2 |
Encoding: | UTF-8 |
Config/VersionName: | olbap |
Packaged: | 2025-05-19 22:10:07 UTC; leo |
Author: | Leonardo Ramirez-Lopez
|
Date/Publication: | 2025-05-19 22:30:02 UTC |
Overview of the functions in the resemble package
Description
Functions for memory-based learning
Details
This is the version 2.2.4 – olbap of the package. It implements a number of functions useful for modeling complex spectral spectra (e.g. NIR, IR). The package includes functions for dimensionality reduction, computing spectral dissimilarity matrices, nearest neighbor search, and modeling spectral data using memory-based learning. This package builds upon the methods presented in Ramirez-Lopez et al. (2013) doi:10.1016/j.geoderma.2012.12.014.
Development versions can be found in the github repository of the package at https://github.com/l-ramirez-lopez/resemble.
The functions available for dimensionality reduction are:
The functions available for computing dissimilarity matrices are:
The functions available for evaluating dissimilarity matrices are:
The functions available for nearest neighbor search:
The functions available for modeling spectral data:
Other supplementary functions:
Author(s)
Maintainer / Creator: Leonardo Ramirez-Lopez ramirez.lopez.leo@gmail.com
Authors:
Leonardo Ramirez-Lopez (ORCID)
Antoine Stevens (ORCID)
Claudio Orellano
Raphael Viscarra Rossel (ORCID)
Zefang Shen
Craig Lobsey (ORCID)
Alex Wadoux (ORCID)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
See Also
Useful links:
Report bugs at https://github.com/l-ramirez-lopez/resemble/issues
Print method for an object of class local_ortho_diss
Description
prints the subsets of local_ortho_diss objects
Usage
## S3 method for class 'local_ortho_diss'
x[rows, columns, drop = FALSE, ...]
Arguments
x |
|
rows |
the indices of the rows |
columns |
the indices of the columns |
drop |
drop argument |
... |
not used |
checks the pc_selection argument
Description
internal
Usage
check_pc_arguments(
n_rows_x,
n_cols_x,
pc_selection,
default_max_comp = 40,
default_max_cumvar = 0.99,
default_max_var = 0.01
)
Correlation and moving correlation dissimilarity measurements (cor_diss)
Description
Computes correlation and moving correlation dissimilarity matrices.
Usage
cor_diss(Xr, Xu = NULL, ws = NULL,
center = TRUE, scale = FALSE)
Arguments
Xr |
a matrix. |
Xu |
an optional matrix containing data of a second set of observations. |
ws |
for moving correlation dissimilarity, an odd integer value which
specifies the window size. If |
center |
a logical indicating if the spectral data |
scale |
a logical indicating if |
Details
The correlation dissimilarity \(d\) between two observations \(x_i\) and \(x_j\) is based on the Perason's correlation coefficient (\(\rho\)) and it can be computed as follows:
\[d(x_i, x_j) = \frac{1}{2}((1 - \rho(x_i, x_j)))\]The above formula is used when ws = NULL
.
On the other hand (when ws != NULL
) the moving correlation
dissimilarity between two observations \(x_i\) and \(x_j\)
is computed as follows:
where \(ws\) represents a given window size which rolls sequentially from 1 up to \(p - ws\) and \(p\) is the number of variables of the observations.
The function does not accept input data containing missing values.
Value
a matrix of the computed dissimilarities.
Author(s)
Antoine Stevens and Leonardo Ramirez-Lopez
Examples
library(prospectr)
data(NIRsoil)
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
cor_diss(Xr = Xr)
cor_diss(Xr = Xr, Xu = Xu)
cor_diss(Xr = Xr, ws = 41)
cor_diss(Xr = Xr, Xu = Xu, ws = 41)
From dissimilarity matrix to neighbors
Description
internal
Usage
diss_to_neighbors(
diss_matrix,
k = NULL,
k_diss = NULL,
k_range = NULL,
spike = NULL,
return_dissimilarity = FALSE,
skip_first = FALSE
)
Arguments
diss_matrix |
a matrix representing the dissimilarities between
observations in a matrix |
k |
an integer value indicating the k-nearest neighbors of each
observation in |
k_diss |
an integer value indicating a dissimilarity treshold.
For each observation in |
k_range |
an integer vector of length 2 which specifies the minimum
(first value) and the maximum (second value) number of neighbors to be
retained when the |
spike |
a vector of integers indicating what observations in |
return_dissimilarity |
logical indicating if the input dissimilarity must be mirroed in the output. |
skip_first |
a logical indicating whether to skip the first neighbor or
not. Default is |
Dissimilarity computation between matrices
Description
This is a wrapper to integrate the different dissimilarity functions of the offered by package.It computes the dissimilarities between observations in numerical matrices by using an specifed dissmilarity measure.
Usage
dissimilarity(Xr, Xu = NULL,
diss_method = c("pca", "pca.nipals", "pls", "mpls",
"cor", "euclid", "cosine", "sid"),
Yr = NULL, gh = FALSE, pc_selection = list("var", 0.01),
return_projection = FALSE, ws = NULL,
center = TRUE, scale = FALSE, documentation = character(),
...)
Arguments
Xr |
a matrix of containing |
Xu |
an optional matrix containing data of a second set of observations
with |
diss_method |
a character string indicating the method to be used to compute the dissimilarities between observations. Options are:
|
Yr |
a numeric matrix of
|
gh |
a logical indicating if the Mahalanobis distance (in the pls score space) between each observation and the pls centre/mean must be computed. |
pc_selection |
a list of length 2 to be passed onto the
The default is Optionally, the |
return_projection |
a logical indicating if the projection(s) must be
returned. Projections are used if the |
ws |
an odd integer value which specifies the window size, when
|
center |
a logical indicating if |
scale |
a logical indicating if |
documentation |
an optional character string that can be used to
describe anything related to the |
... |
other arguments passed to the dissimilarity functions
( |
Details
This function is a wrapper for ortho_diss
, cor_diss
,
f_diss
, sid
. Check the documentation of these
functions for further details.
Value
A list with the following components:
dissimilarity
: the resulting dissimilarity matrix.projection
: anortho_projection
object. Only output ifreturn_projection = TRUE
and ifdiss_method = "pca"
,diss_method = "pca.nipals"
,diss_method = "pls"
ordiss_method = "mpls"
. This object contains the projection used to compute the dissimilarity matrix. In case of local dissimilarity matrices, the projection corresponds to the global projection used to select the neighborhoods (seeortho_diss
function for further details).gh
: a list containing the GH distances as well as the pls projection used to compute the GH.
Author(s)
References
Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.
See Also
ortho_diss
cor_diss
f_diss
sid
.
Examples
library(prospectr)
data(NIRsoil)
# Filter the data using the first derivative with Savitzky and Golay
# smoothing filter and a window size of 11 spectral variables and a
# polynomial order of 4
sg <- savitzkyGolay(NIRsoil$spc, m = 1, p = 4, w = 15)
# Replace the original spectra with the filtered ones
NIRsoil$spc <- sg
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
Xu <- Xu[!is.na(Yu), ]
Xr <- Xr[!is.na(Yr), ]
Yu <- Yu[!is.na(Yu)]
Yr <- Yr[!is.na(Yr)]
dsm_pca <- dissimilarity(
Xr = Xr, Xu = Xu,
diss_method = c("pca"),
Yr = Yr, gh = TRUE,
pc_selection = list("opc", 30),
return_projection = TRUE
)
A function for transforming a matrix from its Euclidean space to its Mahalanobis space
Description
For internal use only
Usage
euclid_to_mahal(X, sm_method = c("svd", "eigen"))
evaluation of multiple distances obtained with multiple PCs
Description
internal
Usage
eval_multi_pc_diss(
scores,
side_info,
from = 1,
to = ncol(scores),
steps = 1,
method = c("pc", "pls"),
check_dims = TRUE
)
Euclidean, Mahalanobis and cosine dissimilarity measurements
Description
This function is used to compute the dissimilarity between observations based on Euclidean or Mahalanobis distance measures or on cosine dissimilarity measures (a.k.a spectral angle mapper).
Usage
f_diss(Xr, Xu = NULL, diss_method = "euclid",
center = TRUE, scale = FALSE)
Arguments
Xr |
a matrix containing the (reference) data. |
Xu |
an optional matrix containing data of a second set of observations (samples). |
diss_method |
the method for computing the dissimilarity between
observations.
Options are |
center |
a logical indicating if the spectral data |
scale |
a logical indicating if |
Details
The results obtained for Euclidean dissimilarity are equivalent to those
returned by the stats::dist()
function, but are scaled
differently. However, f_diss
is considerably faster (which can be
advantageous when computing dissimilarities for very large matrices). The
final scaling of the dissimilarity scores in f_diss
where
the number of variables is used to scale the squared dissimilarity scores. See
the examples section for a comparison between stats::dist()
and
f_diss
.
In the case of both the Euclidean and Mahalanobis distances, the scaled dissimilarity matrix \(D\) between between observations in a given matrix \(X\) is computed as follows:
\[d(x_i, x_j)^{2} = \sum (x_i - x_j)M^{-1}(x_i - x_j)^{\mathrm{T}}\] \[d_{scaled}(x_i, x_j) = \sqrt{\frac{1}{p}d(x_i, x_j)^{2}}\]where \(p\) is the number of variables in \(X\), \(M\) is the identity matrix in the case of the Euclidean distance and the variance-covariance matrix of \(X\) in the case of the Mahalanobis distance. The Mahalanobis distance can also be viewed as the Euclidean distance after applying a linear transformation of the original variables. Such a linear transformation is done by using a factorization of the inverse covariance matrix as \(M^{-1} = W^{T}W\), where \(M\) is merely the square root of \(M^{-1}\) which can be found by using a singular value decomposition.
Note that when attempting to compute the Mahalanobis distance on a dataset with highly correlated variables (i.e. spectral variables) the variance-covariance matrix may result in a singular matrix which cannot be inverted and therefore the distance cannot be computed. This is also the case when the number of observations in the dataset is smaller than the number of variables.
For the computation of the Mahalanobis distance, the mentioned method is used.
The cosine dissimilarity \(c\) between two observations \(x_i\) and \(x_j\) is computed as follows:
\[c(x_i, x_j) = cos^{-1}{\frac{\sum_{k=1}^{p}x_{i,k} x_{j,k}}{\sqrt{\sum_{k=1}^{p} x_{i,k}^{2}} \sqrt{\sum_{k=1}^{p} x_{j,k}^{2}}}}\]where \(p\) is the number of variables of the observations.
The function does not accept input data containing missing values.
NOTE: The computed distances are divided by the number of variables/columns
in Xr
.
Value
a matrix of the computed dissimilarities.
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
Examples
library(prospectr)
data(NIRsoil)
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
# Euclidean distances between all the observations in Xr
ed <- f_diss(Xr = Xr, diss_method = "euclid")
# Equivalence with the dist() fucntion of R base
ed_dist <- (as.matrix(dist(Xr))^2 / ncol(Xr))^0.5
round(ed_dist - ed, 5)
# Comparing the computational time
iter <- 20
tm <- proc.time()
for (i in 1:iter) {
f_diss(Xr)
}
f_diss_time <- proc.time() - tm
tm_2 <- proc.time()
for (i in 1:iter) {
dist(Xr)
}
dist_time <- proc.time() - tm_2
f_diss_time
dist_time
# Euclidean distances between observations in Xr and observations in Xu
ed_xr_xu <- f_diss(Xr, Xu)
# Mahalanobis distance computed on the first 20 spectral variables
md_xr_xu <- f_diss(Xr[, 1:20], Xu[, 1:20], "mahalanobis")
# Cosine dissimilarity matrix
cdiss_xr_xu <- f_diss(Xr, Xu, "cosine")
A fast distance algorithm for two matrices written in C++
Description
Computes distances between two data matrices using "euclid", "cor", "cosine"
Usage
fast_diss(X, Y, method)
Arguments
X |
a matrix |
Y |
a matrix |
method |
a |
Value
a distance matrix
Author(s)
Antoine Stevens and Leonardo Ramirez-Lopez
A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++
Description
A fast (parallel for linux) algorithm of (squared) Euclidean cross-distance for vectors written in C++
Usage
fast_diss_vector(X)
Arguments
X |
a vector. |
Details
used internally in ortho_projection
Value
a vector of distance (lower triangle of the distance matrix, stored by column)
Author(s)
Antoine Stevens
Local multivariate regression
Description
internal
Usage
fit_and_predict(
x,
y,
pred_method,
scale = FALSE,
weights = NULL,
newdata,
pls_c = NULL,
CV = FALSE,
tune = FALSE,
number = 10,
p = 0.75,
group = NULL,
noise_variance = 0.001,
range_prediction_limits = TRUE,
pls_max_iter = 1,
pls_tol = 1e-06,
modified = FALSE,
seed = NULL
)
format internal messages
Description
internal
Usage
format_xr_xu_indices(xr_xu_names)
Arguments
xr_xu_names |
the names of Xr and Xu |
Cross validation for Gaussian process regression
Description
internal
Usage
gaussian_pr_cv(
x,
y,
scale,
weights = NULL,
p = 0.75,
number = 10,
group = NULL,
noise_variance = 0.001,
retrieve = c("final_model", "none"),
seed = NULL
)
Gaussian process regression with linear kernel (gaussian_process)
Description
Carries out a gaussian process regression with a linear kernel (dot product). For internal use only!
Usage
gaussian_process(X, Y, noisev, scale)
Arguments
X |
a matrix of predictor variables |
Y |
a matrix with a single response variable |
noisev |
a value indicating the variance of the noise for Gaussian process regression. Default is 0.001. a matrix with a single response variable |
scale |
a logical indicating whether both the predictors and the response variable must be scaled to zero mean and unit variance. |
Value
a list containing the following elements:
b
: the regression coefficients.Xz
: the (final transformed) matrix of predictor variables.alpha
: the alpha matrix.is.scaled
: logical indicating whether both the predictors and response variable were scaled to zero mean and unit variance.Xcenter
: if matrix of predictors was scaled, the centering vector used forX
.Xscale
: if matrix of predictors was scaled, the scaling vector used forX
.Ycenter
: if matrix of predictors was scaled, the centering vector used forY
.Yscale
: if matrix of predictors was scaled, the scaling vector used forY
.
Author(s)
Leonardo Ramirez-Lopez
Internal Cpp function for performing leave-group-out cross validations for gaussian process
Description
For internal use only!.
Usage
gaussian_process_cv(X, Y, mindices, pindices, noisev = 0.001,
scale = TRUE, statistics = TRUE)
Arguments
X |
a matrix of predictor variables. |
Y |
a matrix of a single response variable. |
mindices |
a matrix with |
pindices |
a matrix with |
scale |
a logical indicating whether both the predictors and the response variable must be scaled to zero mean and unit variance. |
statistics |
a logical value indicating whether the precision and accuracy statistics are to be returned, otherwise the predictions for each validation segment are retrieved. |
Value
a list containing the following one-row matrices:
rmse.seg
: the RMSEs.st.rmse.seg
: the standardized RMSEs.rsq.seg
: the coefficients of determination.
Author(s)
Leonardo Ramirez-Lopez
Function for identifiying the column in a matrix with the largest standard deviation
Description
Identifies the column with the largest standard deviation. For internal use only!
Usage
get_col_largest_sd(X)
Arguments
X |
a matrix. |
Value
a value indicating the index of the column with the largest standard deviation.
Author(s)
Leonardo Ramirez-Lopez
Standard deviation of columns
Description
For internal use only!
Usage
get_col_sds(x)
Function for computing the mean of each column in a matrix
Description
Computes the mean of each column in a matrix. For internal use only!
Usage
get_column_means(X)
Arguments
X |
a a matrix. |
Value
a vector of mean values.
Author(s)
Leonardo Ramirez-Lopez
Function for computing the standard deviation of each column in a matrix
Description
Computes the standard deviation of each column in a matrix. For internal use only!
Usage
get_column_sds(X)
Arguments
X |
a a matrix. |
Value
a vector of standard deviation values.
Author(s)
Leonardo Ramirez-Lopez
Function for computing sum of each column in a matrix
Description
Computes the sum of each column in a matrix. For internal use only!
Usage
get_column_sums(X)
Arguments
X |
a matrix. |
Value
a vector of standard deviation values.
Author(s)
Leonardo Ramirez-Lopez
get the evaluation results for categorical data
Description
internal
Usage
get_eval_categorical(y, indices_closest)
get the evaluation results for continuous data
Description
internal
Usage
get_eval_continuous(y, indices_closest)
A function to obtain the local neighbors based on dissimilarity matrices from orthogonal projections.
Description
internal function. This function is used to obtain the local neighbors based on dissimilarity matrices from orthogonal projections. These neighbors are obatin from an orthogonal projection on a set of precomputed neighbors. This function is used internally by the mbl fucntion. ortho_diss(, .local = TRUE) operates in the same way, however for mbl, it is more efficient to do the re-search of the neighbors inside its main for loop
Usage
get_ith_local_neighbors(
ith_xr,
ith_xu,
ith_yr,
ith_yu = NULL,
diss_usage = "none",
ith_neig_indices,
k = NULL,
k_diss = NULL,
k_range = NULL,
spike = NULL,
diss_method,
pc_selection,
ith_group = NULL,
center,
scale,
...
)
Arguments
ith_xr |
the set of neighbors of a Xu observation found in Xr |
ith_xu |
the Xu observation |
ith_yr |
the response values of the set of neighbors of the Xu observation found in Xr |
ith_yu |
the response value of the xu observation |
diss_usage |
a character string indicating if the dissimilarity data will be used as predictors ("predictors") or not ("none"). |
ith_neig_indices |
a vector of the original indices of the Xr neighbors. |
k |
the number of nearest neighbors to select from the already identified neighbors |
k_diss |
the distance threshold to select the neighbors from the already identified neighbors |
k_range |
a min and max number of allowed neighbors when |
spike |
a vector with the indices of the observations forced to be
retained as neighbors. They have to be present in all the neighborhoods and
at the top of |
diss_method |
the ortho_diss() method |
pc_selection |
the pc_selection argument as in ortho_diss() |
ith_group |
the vector containing the group labes of |
center |
center the data in the local diss computation? |
scale |
scale the data in the local diss computation? |
Value
a list:
ith_xr: the new Xr data of the neighbors for the ith observation (if
diss_usage = "predictors"
, this data is combined with the local dissmilarity scores of the neighbors of Xu)ith_yr: the new Yr data of the neighbors for the ith observation
ith_xu: the ith Xu observation (if
diss_usage = "predictors"
, this data is combined with the local dissmilarity scores to its Xr neighborsith_yu: the ith Yu observation
ith_neigh_diss: the new dissimilarity scores of the neighbors for the ith observation
ith_group: the group labels for the new ith_xr
n_k: the number of neighbors
ith_components: the number of components used
Author(s)
Leonardo Ramirez-Lopez
Internal Cpp function for computing the weights of the PLS components necessary for weighted average PLS
Description
For internal use only!.
Usage
get_local_pls_weights(projection_mat,
xloadings,
coefficients,
new_x,
min_component,
max_component,
scale,
Xcenter,
Xscale)
Arguments
projection_mat |
the projection matrix generated either by the |
xloadings |
. |
coefficients |
the matrix of regression coefficients. |
new_x |
a matrix of one new spectra to be predicted. |
min_component |
an integer indicating the minimum number of pls components. |
max_component |
an integer indicating the maximum number of pls components. |
scale |
a logical indicating whether the matrix of predictors used to create the regression model was scaled. |
Xcenter |
a matrix of one row with the values that must be used for centering |
Xscale |
if |
Value
a matrix of one row with the weights for each component between the max. and min. specified.
Author(s)
Leonardo Ramirez-Lopez
A function to get the neighbor information
Description
This fucntion gathers information of all neighborhoods of the
Xu
observations found in Xr
. This information is equired during
local regressions.
Usage
get_neighbor_info(
Xr,
Xu,
diss_method,
Yr = NULL,
k = NULL,
k_diss = NULL,
k_range = NULL,
spike = NULL,
pc_selection,
return_dissimilarity,
center,
scale,
gh,
diss_usage,
allow_parallel = FALSE,
...
)
Details
For local pca and pls distances, the local dissimilarity matrices are not computed as it is cheaer to compute them during the local regressions. Instead the global distances (required for later local dissimilarity matrix computation are output)
Extract predictions from an object of class mbl
Description
Extract predictions from an object of class mbl
Usage
get_predictions(object)
Arguments
object |
an object of class |
Value
a data.table of predicted values according to either k
or k_dist
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
See Also
A function to assign values to sample distribution strata
Description
for internal use only! This function takes a continuous variable, creates n strata based on its distribution and assigns the corresponding starta to every value.
Usage
get_sample_strata(y, n = NULL, probs = NULL)
Arguments
y |
a matrix of one column with the response variable. |
n |
the number of strata. |
Value
a data table with the input y
and the corresponding strata to
every value.
A function for stratified calibration/validation sampling
Description
for internal use only! This function selects samples based on provided strata.
Usage
get_samples_from_strata(
y,
original_order,
strata,
samples_per_strata,
sampling_for = c("calibration", "validation"),
replacement = FALSE
)
Arguments
original_order |
a matrix of one column with the response variable. |
strata |
the number of strata. |
sampling_for |
sampling to select the calibration samples ("calibration") or sampling to select the validation samples ("validation"). |
replacement |
logical indicating if sampling with replacement must be done. |
Value
a list with the indices of the calibration and validation samples.
Internal function for computing the weights of the PLS components necessary for weighted average PLS
Description
internal
Usage
get_wapls_weights(pls_model, original_x, type = "w1", new_x = NULL, pls_c)
Arguments
pls_model |
either an object returned by the |
original_x |
the original spectral matrix which was used for calibrating the pls model. |
type |
type of weight to be computed. The only available option (for
the moment) is |
new_x |
a vector of a new spectral observation. When "w1" is selected, new_x must be specified. |
pls_c |
a vector of length 2 which contains both the minimum and maximum number of PLS components for which the weights must be computed. |
Value
get_wapls_weights
returns a vector of weights for each PLS
component specified
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
Computes the weights for pls regressions
Description
This is an internal function that computes the wights required for obtaining each vector of pls scores. Implementation is done in C++ for improved performance.
Usage
get_weights(X, Y, algorithm = "pls", xls_min_w = 3L, xls_max_w = 15L)
Arguments
X |
a numeric matrix of spectral data. |
Y |
a matrix of one column with the response variable. |
algorithm |
a character string indicating what method to use. Options are:
|
xls_min_w |
an integer indicating the minimum window size for the "xls"
method. Only used if |
xls_max_w |
an integer indicating the maximum window size for the "xls"
method. Only used if |
Value
a matrix
of one column containing the weights.
Author(s)
Leonardo Ramirez-Lopez and Claudio Orellano
References
Shenk, J. S., & Westerhaus, M. O. (1991). Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548-1555.
Westerhaus, M. (2014). Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.
An iterator for local prediction data in mbl
Description
internal function. It collects only the data necessary to execute a local prediction for the mbl function based on a list of neighbors. Not valid for local dissmilitary (e.g. for ortho_diss(...., .local = TRUE))
Usage
ith_mbl_neighbor(
Xr,
Xu = NULL,
Yr,
Yu = NULL,
diss_usage = "none",
neighbor_indices,
neighbor_diss = NULL,
diss_xr_xr = NULL,
group = NULL
)
Arguments
Xr |
the Xr matrix in mbl. |
Xu |
the Xu matrix in mbl. Default |
Yr |
the Yr matrix in mbl. |
Yu |
the Yu matrix in mbl. Default |
diss_usage |
a character string indicating if the dissimilarity data will be used as predictors ("predictors") or not ("none"). |
neighbor_indices |
a matrix with the indices of neighbors of every Xu found in Xr. |
neighbor_diss |
a matrix with the dissimilarity socres for the neighbors
of every Xu found in Xr. This matrix is organized in the same way as
|
diss_xr_xr |
a dissimilarity matrix between sampes in Xr. |
group |
a factor representing the group labels of Xr. |
Details
isubset will look at the order of knn in each col of D and re-organize the rows of x accordingly
Value
an object of class
iterator giving the following list:
ith_xr: the Xr data of the neighbors for the ith observation (if
diss_usage = "predictors"
, this data is combined with the local dissmilarity scores of the neighbors of Xu (or Xr if Xu was not provided))ith_yr: the Yr data of the neighbors for the ith observation
ith_xu: the ith Xu observation (or Xr if Xu was not provided). If
diss_usage = "predictors"
, this data is combined with the local dissmilarity scores to its Xr neighbors.ith_yu: the ith Yu observation (or Yr observation if Xu was not provided).
ith_neigh_diss: the dissimilarity scores of the neighbors for the ith observation.
ith_group: the group labels for ith_xr.
n_k: the number of neighbors.
Author(s)
Leonardo Ramirez-Lopez
iterator for nearest neighbor subsets
Description
internal
Usage
ith_subsets_ortho_diss(x, xu = NULL, y, kindx, na_rm = FALSE)
Arguments
x |
a reference matrix |
xu |
a second matrix |
y |
a matrix of side information |
kindx |
a matrix of nearest neighbor indices |
na_rm |
logical indicating whether NAs must be removed. |
Local fit functions
Description
These functions define the way in which each local fit/prediction is done
within each iteration in the mbl
function.
Usage
local_fit_pls(pls_c, modified = FALSE, max_iter = 100, tol = 1e-6)
local_fit_wapls(min_pls_c, max_pls_c, modified = FALSE,
max_iter = 100, tol = 1e-6)
local_fit_gpr(noise_variance = 0.001)
Arguments
pls_c |
an integer indicating the number of pls components to be used in
the local regressions when the partial least squares ( |
modified |
a logical indicating whether the modified version of the pls
algorithm (Shenk and Westerhaus, 1991 and Westerhaus, 2014). Default is
|
max_iter |
an integer indicating the maximum number of iterations in
case |
tol |
a numeric value indicating the convergence for calculating the scores. Default is 1-e6. |
min_pls_c |
an integer indicating the minimum number of pls components
to be used in the local regressions when the weighted average partial least
squares ( |
max_pls_c |
integer indicating the maximum number of pls components
to be used in the local regressions when the weighted average partial least
squares ( |
noise_variance |
a numeric value indicating the variance of the noise
for Gaussian process local regressions ( |
Details
These functions are used to indicate how to fit
the regression models within the mbl
function.
There are three possible options for performing these regressions:
Partial least squares (pls,
local_fit_pls
): It uses the orthogonal scores (non-linear iterative partial least squares, nipals) algorithm. The only parameter which needs to be optimized is the number of pls components.Weighted average pls (
\[w_{j} = \frac{1}{s_{1:j}\times g_{j}}\]local_fit_wapls
): This method was developed by Shenk et al. (1997) and it used as the regression method in the widely known LOCAL algorithm. It uses multiple models generated by multiple pls components (i.e. between a minimum and a maximum number of pls components). At each local partition the final predicted value is a ensemble (weighted average) of all the predicted values generated by the multiple pls models. The weight for each component is calculated as follows:where \(s_{1:j}\) is the root mean square of the spectral reconstruction error of the unknown (or target) observation(s) when a total of \(j\) pls components are used and \(g_{j}\) is the root mean square of the squared regression coefficients corresponding to the \(j\)th pls component (see Shenk et al., 1997 for more details).
Gaussian process with dot product covariance (
\[A = (X X^{T} + \sigma^2 I)^{-1} Y\]local_fit_gpr
): Gaussian process regression is a probabilistic and non-parametric Bayesian method. It is commonly described as a collection of random variables which have a joint Gaussian distribution and it is characterized by both a mean and a covariance function (Rasmussen and Williams, 2006). The covariance function used in the implemented method is the dot product. The only parameter to be taken into account in this method is the noise. In this method, the process for predicting the response variable of a new sample (\(y_u\)) from its predictor variables (\(x_u\)) is carried out first by computing a prediction vector (\(A\)). It is derived from a reference/training observations congaing both a response vector (\(Y\)) and predictors (\(X\)) as follows:where \(\sigma^2\) denotes the variance of the noise and \(I\) the identity matrix (with dimensions equal to the number of observations in \(X\)). The prediction of \(y_{u}\) is then done as follows:
\[\hat{y}_{u} = (x_{u}x_{u}^{T}) A\]
The modified
argument in the pls methods (local_fit_pls()
and local_fit_wapls()
) is used to indicate if
a modified version of the pls algorithm (modified pls or mpls) is to be used.
The modified pls was proposed Shenk and Westerhaus
(1991, see also Westerhaus, 2014) and it differs from the standard pls method
in the way the weights of the predictors (used to compute the matrix of
scores) are obtained. While pls uses the covariance between response(s)
and predictors (and later their deflated versions corresponding at each pls
component iteration) to obtain these weights, the modified pls uses the
correlation as weights. The authors indicate that by using correlation,
a larger potion of the response variable(s) can be explained.
Value
An object of class local_fit
mirroring the input arguments.
Author(s)
References
Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548-1555.
Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006.
Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.
See Also
Examples
local_fit_wapls(min_pls_c = 3, max_pls_c = 12)
local ortho dissimilarity matrices initialized by a global dissimilarity matrix
Description
internal
Usage
local_ortho_diss(
k_index_matrix,
Xr,
Yr,
Xu,
diss_method,
pc_selection,
center,
scale,
allow_parallel,
...
)
Arguments
k_index_matrix |
a matrix of nearest neighnbor indices |
Xr |
argument passed to ortho_projection |
Yr |
argument passed to ortho_projection |
Xu |
argument passed to ortho_projection |
diss_method |
argument passed to ortho_projection |
pc_selection |
argument passed to ortho_projection |
center |
argument passed to ortho_projection |
scale |
argument passed to ortho_projection |
A function for memory-based learning (mbl)
Description
This function is implemented for memory-based learning (a.k.a. instance-based learning or local regression) which is a non-linear lazy learning approach for predicting a given response variable from a set of predictor variables. For each observation in a prediction set, a specific local regression is carried out based on a subset of similar observations (nearest neighbors) selected from a reference set. The local model is then used to predict the response value of the target (prediction) observation. Therefore this function does not yield a global regression model.
Usage
mbl(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = min(dim(Xr), 15)),
diss_method = "pca", diss_usage = "predictors", gh = TRUE,
pc_selection = list(method = "opc", value = min(dim(Xr), 40)),
control = mbl_control(), group = NULL, center = TRUE, scale = FALSE,
verbose = TRUE, documentation = character(), seed = NULL, ...)
Arguments
Xr |
a matrix of predictor variables of the reference data (observations in rows and variables in columns). |
Yr |
a numeric matrix of one column containing the values of the response variable corresponding to the reference data. |
Xu |
a matrix of predictor variables of the data to be predicted (observations in rows and variables in columns). |
Yu |
an optional matrix of one column containing the values of the
response variable corresponding to the data to be predicted. Default is
|
k |
a vector of integers specifying the sequence of k-nearest
neighbors to be tested. Either |
k_diss |
a numeric vector specifying the sequence of dissimilarity
thresholds to be tested for the selection of the nearest neighbors found in
|
k_range |
an integer vector of length 2 which specifies the minimum
(first value) and the maximum (second value) number of neighbors to be
retained when the |
spike |
an integer vector (with positive and/or negative values) indicating
the indices of observations in |
method |
an object of class |
diss_method |
a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbors of each observation. Options are:
Alternatively, a matrix of dissimilarities can also be passed to this
argument. This matrix is supposed to be a user-defined matrix
representing the dissimilarities between observations in |
diss_usage |
a character string specifying how the dissimilarity
information shall be used. The possible options are: |
gh |
a logical indicating if the global Mahalanobis distance (in the pls
score space) between each observation and the pls mean (centre) must be
computed. This metric is known as the GH distance in the literature. Note
that this computation is based on the number of pls components determined by
using the |
pc_selection |
a list of length 2 used for the computation of GH (if
The list
|
control |
a list created with the |
group |
an optional factor (or character vector vector
that can be coerced to |
center |
a logical if the predictor variables must be centred at each
local segment (before regression). In addition, if |
scale |
a logical indicating if the predictor variables must be scaled
to unit variance at each local segment (before regression). In addition, if
|
verbose |
a logical indicating whether or not to print a progress bar
for each observation to be predicted. Default is |
documentation |
an optional character string that can be used to
describe anything related to the |
seed |
an integer value containing the random number generator (RNG)
state for random number generation. This argument can be used for
reproducibility purposes (for random sampling) in the cross-validation
results. Default is |
... |
further arguments to be passed to the |
Details
The argument spike
can be used to indicate what reference observations
in Xr
must be kept in the neighborhood of every single Xu
observation. If a vector of length \(m\) is passed to this argument,
this means that the \(m\) original neighbors with the largest
dissimilarities to the target observations will be forced out of the
neighborhood. Spiking might be useful in cases where
some reference observations are known to be somehow related to the ones in
Xu
and therefore might be relevant for fitting the local models. See
Guerrero et al. (2010) for an example on the benefits of spiking.
The mbl
function uses the dissimilarity
function to
compute the dissimilarities between Xr
and Xu
. The dissimilarity
method to be used is specified in the diss_method
argument.
Arguments to dissimilarity
as well as further arguments to the
functions used inside dissimilarity
(i.e. ortho_diss
cor_diss
f_diss
sid
) can be passed to those functions by using ...
.
The diss_usage
argument is used to specify whether the dissimilarity
information must be used within the local regressions and, if so, how.
When diss_usage = "predictors"
the local (square symmetric)
dissimilarity matrix corresponding the selected neighborhood is used as
source of additional predictors (i.e the columns of this local matrix are
treated as predictor variables). In some cases this results in an improvement
of the prediction performance (Ramirez-Lopez et al., 2013a).
If diss_usage = "weights"
, the neighbors of the query point
(\(xu_{j}\)) are weighted according to their dissimilarity to
\(xu_{j}\) before carrying out each local regression. The following
tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for
computing the final weights based on the measured dissimilarities:
where if \({xr_{i} \in }\) neighbors of \(xu_{j}\):
\[v_{j}(xu_{j}) = d(xr_{i}, xu_{j})\]otherwise:
\[v_{j}(xu_{j}) = 0\]In the above formulas \(d(xr_{i}, xu_{j})\) represents the
dissimilarity between the query point and each object in \(Xr\).
When diss_usage = "none"
is chosen the dissimilarity information is
not used.
The global Mahalanobis distance (a.k.a GH) is computed based on the scores
of a pls projection. A pls projection model is built with for {Yr}, {Xr}
and this model is used to obtain the pls scores of the Xu
observations. The Mahalanobis distance between each Xu
observation in
(the pls space) and the centre of Xr
is then computed. The number of
pls components is optimized based on the parameters passed to the
pc_selection
argument. In addition, the mbl
function also
reports the GH distance for the observations in Xr
.
Some aspects of the mbl process, such as the type of internal validation,
parameter tuning, what extra objects to return, permission for parallel
execution, prediction limits, etc, can be specified by using the
mbl_control
function.
By using the group
argument one can specify groups of observations
that have something in common (e.g. observations with very similar origin).
The purpose of group
is to avoid biased cross-validation results due
to pseudo-replication. This argument allows to select calibration points
that are independent from the validation ones. In this regard, when
validation_type = "local_cv"
(used in mbl_control
function), then the p
argument refers to the percentage of groups of
observations (rather than single observations) to be retained in each
sampling iteration at each local segment.
Value
a list
of class mbl
with the following components
(sorted either by k
or k_diss
):
call
: the call to mbl.cntrl_param
: the list with the control parameters passed to control.Xu_neighbors
: a list containing two elements: a matrix ofXr
indices corresponding to the neighbors ofXu
and a matrix of dissimilarities between eachXu
observation and its corresponding neighbor inXr
.dissimilarities
: a list with the method used to obtain the dissimilarity matrices and the dissimilarity matrix corresponding to \(D(Xr, Xu)\). This object is returned only if thereturn_dissimilarity
argument in thecontrol
list was set toTRUE
.n_predictions
: the total number of observations predicted.gh
: ifgh = TRUE
, a list containing the global Mahalanobis distance values for the observations inXr
andXu
as well as the results of the global pls projection object used to obtain the GH values.validation_results
: a list of validation results for "local cross validation" (returned if thevalidation_type
incontrol
list was set to"local_cv"
), "nearest neighbor validation" (returned if thevalidation_type
incontrol
list was set to"NNv"
) and "Yu prediction statistics" (returned ifYu
was supplied).“results
: a list of data tables containing the results of the predictions for each eitherk
ork_diss
. Each data table contains the following columns:o_index
: The index of the predicted observation.k_diss
: This column is only output if thek_diss
argument is used. It indicates the corresponding dissimilarity threshold for selecting the neighbors.k_original
: This column is only output if thek_diss
argument is used. It indicates the number of neighbors that were originally found when the given dissimilarity threshold is used.k
: This column indicates the final number of neighbors used.npls
: This column is only output if thepls
regression method was used. It indicates the final number of pls components used.min_pls
: This column is only output ifwapls
regression method was used. It indicates the final number of minimum pls components used. If no optimization was set, it retrieves the original minimum pls components passed to themethod
argument.max_pls
: This column is only output if thewapls
regression method was used. It indicates the final number of maximum pls components used. If no optimization was set, it retrieves the original maximum pls components passed to themethod
argument.yu_obs
: The input values given inYu
(the response variable corresponding to the data to be predicted). IfYu = NULL
, thenNA
s are retrieved.pred
: The predicted Yu values.yr_min_obs
: The minimum reference value (of the response variable) in the neighborhood.yr_max_obs
: The maximum reference value (of the response variable) in the neighborhood.index_nearest_in_Xr
: The index of the nearest neighbor found inXr
.index_farthest_in_Xr
: The index of the farthest neighbor found inXr
.y_nearest
: The reference value (Yr
) corresponding to the nearest neighbor found inXr
.y_nearest_pred
: This column is only output if the validation method in the object passed tocontrol
was set to"NNv"
. It represents the predicted value of the nearest neighbor observation found inXr
. This prediction come from model fitted with the remaining observations in the neighborhood of the target observation inXu
.loc_rmse_cv
: This column is only output if the validation method in the object passed tocontrol
was set to'local_cv'
. It represents the RMSE of the cross-validation computed for the neighborhood of the target observation inXu
.loc_st_rmse_cv
: This column is only output if the validation method in the object passed tocontrol
was set to'local_cv'
. It represents the standardized RMSE of the cross-validation computed for the neighborhood of the target observation inXu
.dist_nearest
: The distance to the nearest neighbor.dist_farthest
: The distance to the farthest neighbor.loc_n_components
: This column is only output if the dissimilarity method used is one of"pca"
,"pca.nipals"
or"pls"
and in addition the dissimilarities are requested to be computed locally by passing.local = TRUE
to thembl
function. See.local
argument in theortho_diss
function.
seed
: a value mirroring the one passed to seed.documentation
: a character string mirroring the one provided in thedocumentation
argument.
When the k_diss
argument is used, the printed results show a table
with a column named 'p_bounded
. It represents the percentage of
observations for which the neighbors selected by the given dissimilarity
threshold were outside the boundaries specified in the k_range
argument.
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
References
Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596-610.
Guerrero, C., Zornoza, R., Gómez, I., Mataix-Beneyto, J. 2010. Spiking of NIR regional models using observations from target sites: Effect of model size on prediction accuracy. Geoderma, 158(1-2), 66-77.
Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62, 664-673.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006.
Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
See Also
mbl_control
, f_diss
,
cor_diss
, sid
, ortho_diss
,
search_neighbors
, local_fit
Examples
library(prospectr)
data(NIRsoil)
# Proprocess the data using detrend plus first derivative with Savitzky and
# Golay smoothing filter
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc,
wav = as.numeric(colnames(NIRsoil$spc))
),
m = 1,
p = 1,
w = 7
)
NIRsoil$spc_pr <- sg_det
# split into training and testing sets
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]
train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]
# Example 1
# A mbl implemented in Ramirez-Lopez et al. (2013,
# the spectrum-based learner)
# Example 1.1
# An exmaple where Yu is supposed to be unknown, but the Xu
# (spectral variables) are known
my_control <- mbl_control(validation_type = "NNv")
## The neighborhood sizes to test
ks <- seq(40, 140, by = 20)
sbl <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
k = ks,
method = local_fit_gpr(),
control = my_control,
scale = TRUE
)
sbl
plot(sbl)
get_predictions(sbl)
# Example 1.2
# If Yu is actually known...
sbl_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_gpr(),
control = my_control
)
sbl_2
plot(sbl_2)
# Example 2
# the LOCAL algorithm (Shenk et al., 1997)
local_algorithm <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "cor",
diss_usage = "none",
control = my_control
)
local_algorithm
plot(local_algorithm)
# Example 3
# A variation of the LOCAL algorithm (using the optimized pc
# dissmilarity matrix) and dissimilarity matrix as source of
# additional preditors
local_algorithm_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "pca",
diss_usage = "predictors",
control = my_control
)
local_algorithm_2
plot(local_algorithm_2)
# Example 4
# Running the mbl function in parallel with example 2
n_cores <- 2
if (parallel::detectCores() < 2) {
n_cores <- 1
}
# Alternatively:
# n_cores <- parallel::detectCores() - 1
# if (n_cores == 0) {
# n_cores <- 1
# }
library(doParallel)
clust <- makeCluster(n_cores)
registerDoParallel(clust)
# Alernatively:
# library(doSNOW)
# clust <- makeCluster(n_cores, type = "SOCK")
# registerDoSNOW(clust)
# getDoParWorkers()
local_algorithm_par <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "cor",
diss_usage = "none",
control = my_control
)
local_algorithm_par
registerDoSEQ()
try(stopCluster(clust))
# Example 5
# Using local pls distances
with_local_diss <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "pls",
diss_usage = "predictors",
control = my_control,
.local = TRUE,
pre_k = 150,
)
with_local_diss
plot(with_local_diss)
A function that controls some few aspects of the memory-based learning
process in the mbl
function
Description
This function is used to further control some aspects of the memory-based
learning process in the mbl
function.
Usage
mbl_control(
return_dissimilarity = FALSE,
validation_type = c("NNv", "local_cv"),
tune_locally = TRUE,
number = 10,
p = 0.75,
range_prediction_limits = TRUE,
progress = TRUE,
allow_parallel = TRUE
)
Arguments
return_dissimilarity |
a logical indicating if the dissimilarity matrix
between |
validation_type |
a character vector which indicates the (internal) validation
method(s) to be used for assessing the global performance of the local models.
Possible options are: |
tune_locally |
a logical. It only applies when
|
number |
an integer indicating the number of sampling iterations at
each local segment when |
p |
a numeric value indicating the percentage of observations to be retained
at each sampling iteration at each local segment when |
range_prediction_limits |
a logical. It indicates whether the prediction
limits at each local regression are determined by the range of the response
variable within each neighborhood. When the predicted value is outside
this range, it will be automatically replaced with the value of the nearest
range value. If |
progress |
a logical indicating whether or not to print a progress bar
for each observation to be predicted. Default is |
allow_parallel |
a logical indicating if parallel execution is allowed.
If |
Details
The validation methods available for assessing the predictive performance of the memory-based learning method used are described as follows:
Leave-nearest-neighbor-out cross-validation (
"NNv"
): From the group of neighbors of each observation to be predicted, the nearest observation (i.e. the most similar observation) is excluded and then a local model is fitted using the remaining neighbors. This model is then used to predict the value of the target response variable of the nearest observation. These predicted values are finally cross validated with the actual values (See Ramirez-Lopez et al. (2013a) for additional details). This method is faster than"local_cv"
.Local leave-group-out cross-validation (
"local_cv"
): The group of neighbors of each observation to be predicted is partitioned into different equal size subsets. Each partition is selected based on a stratified random sampling which takes into account the values of the response variable of the corresponding set of neighbors. The selected local subset is used as local validation subset and the remaining observations are used for fitting a model. This model is used to predict the target response variable values of the local validation subset and the local root mean square error is computed. This process is repeated \(m\) times and the final local error is computed as the average of the local root mean square error of all the \(m\) iterations. In thembl
function \(m\) is controlled by thenumber
argument and the size of the subsets is controlled by thep
argument which indicates the percentage of observations to be selected from the subset of nearest neighbours. The global error of the predictions is computed as the average of the local root mean square errors.No validation (
"none"
): No validation is carried out. If"none"
is selected along with"NNv"
and/or"local_cv"
, then it will be ignored and the respective validation(s) will be carried out.
Value
a list
mirroring the specified parameters
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
See Also
f_diss
, cor_diss
, sid
,
ortho_diss
, mbl
Examples
# A control list with the default parameters
mbl_control()
Moving/rolling correlation distance of two matrices
Description
Computes a moving window correlation distance between two data matrices
Usage
moving_cor_diss(X,Y,w)
Arguments
X |
a matrix |
Y |
a matrix |
w |
window size (must be odd) |
Value
a matrix of correlation distance
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
orthogonal scores algorithn of partial leat squares (opls)
Description
Computes orthogonal socres partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables. It does not return the variance information of the components. NOTE: For internal use only!
Usage
opls(X,
Y,
ncomp,
scale,
maxiter,
tol,
algorithm = "pls",
xls_min_w = 3,
xls_max_w = 15)
Arguments
X |
a matrix of predictor variables. |
Y |
a matrix of either a single or multiple response variables. |
ncomp |
the number of pls components. |
scale |
logical indicating whether |
maxiter |
maximum number of iterations. |
tol |
limit for convergence of the algorithm in the nipals algorithm. |
algorithm |
(for weights computation) a character string indicating
what method to use. Options are:
|
xls_min_w |
(for weights computation) an integer indicating the minimum window size for the "xls"
method. Only used if |
xls_max_w |
(for weights computation) an integer indicating the maximum window size for the "xls"
method. Only used if |
Value
a list containing the following elements:
coefficients
: the matrix of regression coefficients.bo
: a matrix of one row containing the intercepts for each component.scores
: the matrix of scores.X_loadings
: the matrix of X loadings.Y_loadings
: the matrix of Y loadings.projection_mat
: the projection matrix.Y
: theY
input.transf
: alist
conating two objects:Xcenter
andXscale
.weights
: the matrix of wheights.
Author(s)
Leonardo Ramirez-Lopez
Internal Cpp function for performing leave-group-out cross-validations for pls regression
Description
For internal use only!.
Usage
opls_cv_cpp(X, Y, scale, method,
mindices, pindices,
min_component, ncomp,
new_x,
maxiter, tol,
wapls_grid,
algorithm,
statistics = TRUE)
Arguments
X |
a matrix of predictor variables. |
Y |
a matrix of a single response variable. |
scale |
a logical indicating whether the matrix of predictors
( |
method |
the method used for regression. One of the following options:
|
mindices |
a matrix with |
pindices |
a matrix with |
min_component |
an integer indicating the number of minimum pls
components (if the |
ncomp |
an integer indicating the number of pls components. |
new_x |
a matrix of one row corresponding to the observation to be
predicted (if the |
maxiter |
maximum number of iterations. |
tol |
limit for convergence of the algorithm in the nipals algorithm. |
wapls_grid |
the grid on which the search for the best combination of
minimum and maximum pls factors of |
algorithm |
either pls ( |
statistics |
a logical value indicating whether the precision and accuracy statistics are to be returned, otherwise the predictions for each validation segment are retrieved. |
Value
if statistics = true
a list containing the following one-row matrices:
rmse_seg
: the RMSEs.st_rmse_seg
: the standardized RMSEs.rsq_seg
: the coefficients of determination.
if statistics = false
a list containing the following one-row matrices:
predictions
: the predictions of each of the validation segments inpindices
. Each column inpindices
contains the validation indices of a segment.st_rmse_seg
: the standardized RMSEs.rsq_seg
: the coefficients of determination.
If method = "wapls"
, data of the pls weights are output in this
list(compweights
).
If method = "completewapls1"
, data of all the combination of
components passed in wapls_grid
are
output in this list(complete_compweights
).
Author(s)
Leonardo Ramirez-Lopez
orthogonal scores algorithn of partial leat squares (opls) projection
Description
Computes orthogonal socres partial least squares (opls) projection with the NIPALS algorithm. It allows multiple response variables. Although the main use of the function is for projection, it also retrieves regression coefficients. NOTE: For internal use only!
Usage
opls_for_projection(X, Y, ncomp, scale,
maxiter, tol,
pcSelmethod = "var",
pcSelvalue = 0.01,
algorithm = "pls",
xls_min_w = 3,
xls_max_w = 15)
Arguments
X |
a matrix of predictor variables. |
Y |
a matrix of either a single or multiple response variables. |
ncomp |
the number of pls components. |
scale |
logical indicating whether |
maxiter |
maximum number of iterations. |
tol |
limit for convergence of the algorithm in the nipals algorithm. |
pcSelmethod |
if |
pcSelvalue |
a numerical value that complements the selected method
( |
algorithm |
(for weights computation) a character string indicating
what method to use. Options are:
|
xls_min_w |
(for weights computation) an integer indicating the minimum window size for the "xls"
method. Only used if |
xls_max_w |
(for weights computation) an integer indicating the maximum window size for the "xls"
method. Only used if |
Value
a list containing the following elements:
coefficients
: the matrix of regression coefficients.bo
: a matrix of one row containing the intercepts for each component.scores
: the matrix of scores.X_loadings
: the matrix of X loadings.Y_loadings
: the matrix of Y loadings.projection_mat
: the projection matrix.Y
: theY
input.variance
: alist
conating two objects:x_var
andy_var
. These objects contain information on the explained variance for theX
andY
matrices respectively.transf
: alist
conating two objects:Xcenter
andXscale
.weights
: the matrix of wheights.
Author(s)
Leonardo Ramirez-Lopez
orthogonal scores algorithn of partial leat squares (opls_get_all)
Description
Computes orthogonal socres partial least squares (opls_get_all) regressions with the NIPALS algorithm. It retrives a comprehensive set of pls outputs (e.g. vip and sensivity radius). It allows multiple response variables. NOTE: For internal use only!
Usage
opls_get_all(X,
Y,
ncomp,
scale,
maxiter,
tol,
algorithm = "pls",
xls_min_w = 3,
xls_max_w = 15)
Arguments
X |
a matrix of predictor variables. |
Y |
a matrix of either a single or multiple response variables. |
ncomp |
the number of pls components. |
scale |
logical indicating whether |
maxiter |
maximum number of iterations. |
tol |
limit for convergence of the algorithm in the nipals algorithm. |
algorithm |
(for weights computation) a character string indicating
what method to use. Options are:
|
xls_min_w |
(for weights computation) an integer indicating the minimum window size for the "xls"
method. Only used if |
xls_max_w |
(for weights computation) an integer indicating the maximum window size for the "xls"
method. Only used if |
Value
a list containing the following elements:
ncomp
: the number of components used.coefficients
: the matrix of regression coefficients.bo
: a matrix of one row containing the intercepts for each component.scores
: the matrix of scores.X_loadings
: the matrix of X loadings.Y_loadings
: the matrix of Y loadings.vip
: the projection matrix.selectivity_ratio
: the matrix of selectivity ratio (see Rajalahti, Tarja, et al. 2009).Y
: theY
input.variance
: alist
conating two objects:x_var
andy_var
. These objects contain information on the explained variance for theX
andY
matrices respectively.transf
: alist
conating two objects:Xcenter
andXscale
.weights
: the matrix of wheights.
Author(s)
Leonardo Ramirez-Lopez
fast orthogonal scores algorithn of partial leat squares (opls)
Description
Computes orthogonal socres partial least squares (opls)
regressions with the NIPALS algorithm. It allows multiple response variables.
In contrast to opls
function, this one does not compute unnecessary
data for (local) regression.
For internal use only!
Usage
opls_get_basics(X, Y, ncomp, scale,
maxiter, tol,
algorithm = "pls",
xls_min_w = 3,
xls_max_w = 15)
Arguments
X |
a matrix of predictor variables. |
Y |
a matrix of either a single or multiple response variables. |
ncomp |
the number of pls components. |
scale |
logical indicating whether |
maxiter |
maximum number of iterations. |
tol |
limit for convergence of the algorithm in the nipals algorithm. |
algorithm |
(for weights computation) a character string indicating
what method to use. Options are:
|
xls_min_w |
(for weights computation) an integer indicating the minimum window size for the "xls"
method. Only used if |
xls_max_w |
(for weights computation) an integer indicating the maximum window size for the "xls"
method. Only used if |
Value
a list containing the following elements:
coefficients
: the matrix of regression coefficients.bo
: a matrix of one row containing the intercepts for each component.Y_loadings
: the matrix of Y loadings.projection_mat
: the projection matrix.transf
: alist
conating two objects:Xcenter
andXscale
.
Author(s)
Leonardo Ramirez-Lopez
orthogonal scores algorithm of partial leat squares (opls)
Description
Computes orthogonal scores partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables. It does not return the variance information of the components. NOTE: For internal use only!
Usage
opls_gs(Xr,
Yr,
Xu,
ncomp,
scale,
response = FALSE,
reconstruction = TRUE,
similarity = TRUE,
fresponse = TRUE,
algorithm = "pls")
Arguments
Xr |
a matrix of predictor variables for the training set. |
Yr |
a matrix of a single response variable for the training set. |
Xu |
a matrix of predictor variables for the test set. |
ncomp |
the number of pls components. |
scale |
logical indicating whether |
response |
logical indicating whether to compute the prediction of |
reconstruction |
logical indicating whether to compute the reconstruction error of |
similarity |
logical indicating whether to compute the the distance score between |
fresponse |
logical indicating whether to compute the score of the variance not explained for |
algorithm |
(for weights computation) a character string indicating
what method to use. Options are:
|
Value
a list containing the following elements:
ncomp
: the number of components.pred_response
: the response predictions forXu
.rmse_reconstruction
: the rmse of the reconstruction forXu
.score_dissimilarity
: the distance score betweenXr
andXu
.
Author(s)
Leonardo Ramirez-Lopez
A function to construct an optimal strata for the samples, based on the distribution of the given y.
Description
for internal use only! This function computes the optimal strata from the distribution of the given y
Usage
optim_sample_strata(y, n)
Arguments
y |
a matrix of one column with the response variable. |
n |
number of samples that must be sampled. |
Value
a list with two data.table
objects: sample_strata
contains
the optimal strata, whereas samples_to_get
contains information on how
many samples per stratum are supposed to be drawn.
A function for computing dissimilarity matrices from orthogonal projections (ortho_diss)
Description
This function computes dissimilarities (in an orthogonal space) between either observations in a given set or between observations in two different sets.The dissimilarities are computed based on either principal component projection or partial least squares projection of the data. After projecting the data, the Mahalanobis distance is applied.
Usage
ortho_diss(Xr, Xu = NULL,
Yr = NULL,
pc_selection = list(method = "var", value = 0.01),
diss_method = "pca",
.local = FALSE,
pre_k,
center = TRUE,
scale = FALSE,
compute_all = FALSE,
return_projection = FALSE,
allow_parallel = TRUE, ...)
Arguments
Xr |
a matrix containing |
Xu |
an optional matrix containing data of a second set of observations
with |
Yr |
a matrix of
|
pc_selection |
a list of length 2 which specifies the method to be used
for optimizing the number of components (principal components or pls factors)
to be retained. This list must contain two elements (in the following order):
Default is Optionally, the |
diss_method |
a character value indicating the type of projection on which
the dissimilarities must be computed. This argument is equivalent to
See the |
.local |
a logical indicating whether or not to compute the dissimilarities
locally (i.e. projecting locally the data) by using the |
pre_k |
if |
center |
a logical indicating if the |
scale |
a logical indicating if the |
compute_all |
a logical. In case |
return_projection |
a logical. If |
allow_parallel |
a logical (default TRUE). It allows parallel computing
of the local distance matrices (i.e. when |
... |
additional arguments to be passed to the
|
Details
When .local = TRUE
, first a global dissimilarity matrix is computed based on
the parameters specified. Then, by using this matrix for each target
observation, a given set of nearest neighbors (pre_k
) are identified.
These neighbors (together with the target observation) are projected
(from the original data space) onto a (local) orthogonal space (using the
same parameters specified in the function). In this projected space the
Mahalanobis distance between the target observation and its neighbors is
recomputed. A missing value is assigned to the observations that do not belong to
this set of neighbors (non-neighbor observations).
In this case the dissimilarity matrix cannot be considered as a distance
metric since it does not necessarily satisfies the symmetry condition for
distance matrices (i.e. given two observations \(x_i\) and \(x_j\), the local
dissimilarity (\(d\)) between them is relative since generally
\(d(x_i, x_j) \neq d(x_j, x_i)\)). On the other hand, when
.local = FALSE
, the dissimilarity matrix obtained can be considered as
a distance matrix.
In the cases where "Yr"
is required to compute the dissimilarities and
if .local = TRUE
, care must be taken as some neighborhoods might
not have enough observations with non-missing "Yr"
values, which might retrieve
unreliable dissimilarity computations.
If "opc"
or "manual"
are used in pc_selection$method
and .local = TRUE
, the minimum number of observations with non-missing
"Yr"
values at each neighborhood is determined by
pc_selection$value
(i.e. the maximum number of components to compute).
Value
a list
of class ortho_diss
with the following elements:
n_components
: the number of components (either principal components or partial least squares components) used for computing the global dissimilarities.global_variance_info
: the information about the expalined variance(s) of the projection. When.local = TRUE
, the information corresponds to the global projection done prior computing the local projections.local_n_components
: if.local = TRUE
, a data.table which specifies the number of local components (either principal components or partial least squares components) used for computing the dissimilarity between each target observation and its neighbor observations.dissimilarity
: the computed dissimilarity matrix. If.local = FALSE
a distance matrix. If.local = TRUE
a matrix of classlocal_ortho_diss
. In this case, each column represent the dissimilarity between a target observation and its neighbor observations.projection
: ifreturn_projection = TRUE
, anortho_projection
object.
Author(s)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
See Also
Examples
library(prospectr)
data(NIRsoil)
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil[!as.logical(NIRsoil$train), "CEC", drop = FALSE]
Yr <- NIRsoil[as.logical(NIRsoil$train), "CEC", drop = FALSE]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu), , drop = FALSE]
Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr), , drop = FALSE]
# Computation of the orthogonal dissimilarity matrix using the
# default parameters
pca_diss <- ortho_diss(Xr, Xu)
# Computation of a principal component dissimilarity matrix using
# the "opc" method for the selection of the principal components
pca_diss_optim <- ortho_diss(
Xr, Xu, Yr,
pc_selection = list("opc", 40),
compute_all = TRUE
)
# Computation of a partial least squares (PLS) dissimilarity
# matrix using the "opc" method for the selection of the PLS
# components
pls_diss_optim <- ortho_diss(
Xr = Xr, Xu = Xu,
Yr = Yr,
pc_selection = list("opc", 40),
diss_method = "pls"
)
Orthogonal projections using principal component analysis and partial least squares
Description
Functions to perform orthogonal projections of high dimensional data matrices using principal component analysis (pca) and partial least squares (pls).
Usage
ortho_projection(Xr, Xu = NULL,
Yr = NULL,
method = "pca",
pc_selection = list(method = "var", value = 0.01),
center = TRUE, scale = FALSE, ...)
pc_projection(Xr, Xu = NULL, Yr = NULL,
pc_selection = list(method = "var", value = 0.01),
center = TRUE, scale = FALSE,
method = "pca",
tol = 1e-6, max_iter = 1000, ...)
pls_projection(Xr, Xu = NULL, Yr,
pc_selection = list(method = "opc", value = min(dim(Xr), 40)),
scale = FALSE, method = "pls",
tol = 1e-6, max_iter = 1000, ...)
## S3 method for class 'ortho_projection'
predict(object, newdata, ...)
Arguments
Xr |
a matrix of observations. |
Xu |
an optional matrix containing data of a second set of observations. |
Yr |
if the method used in the |
method |
the method for projecting the data. Options are:
|
pc_selection |
a list of length 2 which specifies the method to be used
for optimizing the number of components (principal components or pls factors)
to be retained. This list must contain two elements (in the following order):
The list |
center |
a logical indicating if the data |
scale |
a logical indicating if |
... |
additional arguments to be passed
to |
tol |
tolerance limit for convergence of the algorithm in the nipals algorithm (default is 1e-06). In the case of PLS this applies only to Yr with more than one variable. |
max_iter |
maximum number of iterations (default is 1000). In the case of
|
object |
object of class |
newdata |
an optional data frame or matrix in which to look for variables with which to predict. If omitted, the scores are used. It must contain the same number of columns, to be used in the same order. |
Details
In the case of method = "pca"
, the algorithm used is the singular value
decomposition in which a given data matrix (\(X\)) is factorized as follows:
where \(U\) and \(V\) are orthogonal matrices, being the left and right singular vectors of \(X\) respectively, \(D\) is a diagonal matrix containing the singular values of \(X\) and \(V\) is the is a matrix of the right singular vectors of \(X\). The matrix of principal component scores is obtained by a matrix multiplication of \(U\) and \(D\), and the matrix of principal component loadings is equivalent to the matrix \(V\).
When method = "pca.nipals"
, the algorithm used for principal component
analysis is the non-linear iterative partial least squares (nipals).
In the case of the of the partial least squares projection (a.k.a projection
to latent structures) the nipals regression algorithm is used by default.
Details on the "nipals" algorithm are presented in Martens (1991). Another
method called modified pls ('mpls'
) can also be used. The modified
pls was proposed Shenk and Westerhaus (1991, see also Westerhaus, 2014) and it
differs from the standard pls method in the way the weights of the Xr
(used to compute the matrix of scores) are obtained. While pls uses the covariance
between Yr
and Xr
(and later their deflated versions
corresponding at each pls component iteration) to obtain these weights, the modified pls
uses the correlation as weights. The authors indicate that by using correlation,
a larger potion of the response variable(s) can be explained.
When method = "opc"
, the selection of the components is carried out by
using an iterative method based on the side information concept
(Ramirez-Lopez et al. 2013a, 2013b). First let be \(P\) a sequence of
retained components (so that \(P = 1, 2, ...,k \)).
At each iteration, the function computes a dissimilarity matrix retaining
\(p_i\) components. The values in this side information variable are
compared against the side information values of their most spectrally similar
observations (closest Xr
observation).
The optimal number of components retrieved by the function is the one that
minimizes the root mean squared differences (RMSD) in the case of continuous
variables, or maximizes the kappa index in the case of categorical variables.
In this process, the sim_eval
function is used.
Note that for the "opc"
method Yr
is required (i.e. the
side information of the observations).
Value
a list
of class ortho_projection
with the following
components:
scores
: a matrix of scores corresponding to the observations inXr
(andXu
if it was provided). The components retrieved correspond to the ones optimized or specified.X_loadings
: a matrix of loadings corresponding to the explanatory variables. The components retrieved correspond to the ones optimized or specified.Y_loadings
: a matrix of partial least squares loadings corresponding toYr
. The components retrieved correspond to the ones optimized or specified. This object is only returned if the partial least squares algorithm was used.weigths
: a matrix of partial least squares ("pls") weights. This object is only returned if the "pls" algorithm was used.projection_mat
: a matrix that can be used to project new data onto a "pls" space. This object is only returned if the "pls" algorithm was used.variance
: a list with information on the original variance and the explained variances. This list contains a matrix indicating the amount of variance explained by each component (var), the ratio between explained variance by each single component and the original variance (explained_var) and the cumulative ratio of explained variance (cumulative_explained_var). The amount of variance explained by each component is computed by multiplying its score vector by its corresponding loading vector and calculating the variance of the result. These values are computed based on the observations used to create the projection matrices. For example if the "pls" method was used, then these values are computed based only on the data that contains information onYr
(i.e. theXr
data). If the principal component method is used, the this data is computed on the basis ofXr
andXu
(if it applies) since both matrices are employed in the computation of the projection matrix (loadings in this case).sdv
: the standard deviation of the retrieved scores. This vector can be different from the "sd" invariance
.n_components
: the number of components (either principal components or partial least squares components) used for computing the global dissimilarity scores.opc_evaluation
: a matrix containing the statistics computed for optimizing the number of principal components based on the variable(s) specified in theYr
argument. IfYr
was a continuous was a continuous vector or matrix then this object indicates the root mean square of differences (rmse) for each number of components. IfYr
was a categorical variable this object indicates the kappa values for each number of components. This object is returned only if"opc"
was used within thepc_selection
argument. See thesim_eval
function for more details.method
: theortho_projection
method used.
predict.ortho_projection
, returns a matrix of scores proprojected for
newdtata
.
Author(s)
References
Martens, H. (1991). Multivariate calibration. John Wiley & Sons.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548-1555.
Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.
See Also
Examples
library(prospectr)
data(NIRsoil)
# Proprocess the data using detrend plus first derivative with Savitzky and
# Golay smoothing filter
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc,
wav = as.numeric(colnames(NIRsoil$spc))
),
m = 1,
p = 1,
w = 7
)
NIRsoil$spc_pr <- sg_det
# split into training and testing sets
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]
train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]
# A principal component analysis using 5 components
pca_projected <- ortho_projection(train_x, pc_selection = list("manual", 5))
pca_projected
# A principal components projection using the "opc" method
# for the selection of the optimal number of components
pca_projected_2 <- ortho_projection(
Xr = train_x, Xu = test_x, Yr = train_y,
method = "pca",
pc_selection = list("opc", 40)
)
pca_projected_2
plot(pca_projected_2)
# A partial least squares projection using the "opc" method
# for the selection of the optimal number of components
pls_projected <- ortho_projection(
Xr = train_x, Xu = test_x, Yr = train_y,
method = "pls",
pc_selection = list("opc", 40)
)
pls_projected
plot(pls_projected)
# A partial least squares projection using the "cumvar" method
# for the selection of the optimal number of components
pls_projected_2 <- ortho_projection(
Xr = train_x, Xu = test_x, Yr = train_y,
method = "pls",
pc_selection = list("cumvar", 0.99)
)
Function for computing the overall variance of a matrix
Description
Computes the variance of a matrix. For internal use only!
Usage
overall_var(X)
Arguments
X |
a matrix. |
Value
a vector of standard deviation values.
Author(s)
Leonardo Ramirez-Lopez
Principal components based on the non-linear iterative partial least squares (nipals) algorithm
Description
Computes orthogonal socres partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables. For internal use only!
Usage
pca_nipals(X, ncomp, center, scale,
maxiter, tol,
pcSelmethod = "var",
pcSelvalue = 0.01)
Arguments
X |
a matrix of predictor variables. |
ncomp |
the number of pls components. |
scale |
logical indicating whether |
maxiter |
maximum number of iterations. |
tol |
limit for convergence of the algorithm in the nipals algorithm. |
pcSelmethod |
the method for selecting the number of components.
Options are: |
pcSelvalue |
a numerical value that complements the selected method ( |
Value
a list containing the following elements:
pc_scores
: a matrix of principal component scores.pc_loadings
: a matrix of of principal component loadings.variance
: a matrix of the variance of the principal components.scale
: alist
conating two objects:center
andscale
, which correspond to the vectors used to center and scale the input matrix.
Author(s)
Leonardo Ramirez-Lopez
Get the package version info
Description
returns package info.
Usage
pkg_info(pkg = "resemble")
Arguments
pkg |
the package name i.e "resemble" |
Plot method for an object of class mbl
Description
Plots the content of an object of class mbl
Usage
## S3 method for class 'mbl'
plot(x, g = c("validation", "gh"), param = "rmse", pls_c = c(1,2), ...)
Arguments
x |
an object of class |
g |
a character vector indicating what results shall be plotted.
Options are: |
param |
a character string indicating what validation statistics shall be
plotted. The following options are available: |
pls_c |
a numeric vector of length one or two indicating the pls factors to be
plotted. Default is |
... |
some arguments to be passed to the plot methods. |
Details
For plotting the pls scores from the pls score matrix (of more than one column), this matrix is first transformed from the Euclidean space to the Mahalanobis space. This is done by multiplying the score matrix by the root square of its covariance matrix. The root square of this matrix is estimated using a singular value decomposition.
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
See Also
Examples
library(prospectr)
data(NIRsoil)
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu)]
Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr)]
ctrl <- mbl_control(validation_type = "NNv")
ex_1 <- mbl(
Yr = Yr, Xr = Xr, Xu = Xu,
diss_method = "cor",
diss_usage = "none",
gh = TRUE,
mblCtrl = ctrl,
k = seq(50, 250, 30)
)
plot(ex_1)
plot(ex_1, g = "gh", pls_c = c(2, 3))
Plot method for an object of class ortho_projection
Description
Plots objects of class ortho_projection
Usage
## S3 method for class 'ortho_projection'
plot(x, col = "dodgerblue", ...)
Arguments
x |
an object of class |
col |
the color of the plots (default is "dodgerblue") |
... |
arguments to be passed to methods. |
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
See Also
Cross validation for PLS regression
Description
for internal use only!
Usage
pls_cv(
x,
y,
ncomp,
method = c("pls", "wapls"),
center = TRUE,
scale,
min_component = 1,
new_x = matrix(0, 1, 1),
weights = NULL,
p = 0.75,
number = 10,
group = NULL,
retrieve = TRUE,
tune = TRUE,
max_iter = 1,
tol = 1e-06,
seed = NULL,
modified = FALSE
)
Prediction function for the gaussian_process
function (Gaussian process regression with dot product covariance)
Description
Predicts response values based on a model generated by the gaussian_process
function (Gaussian process regression with dot product covariance). For internal use only!.
Usage
predict_gaussian_process(Xz, alpha, newdata, scale, Xcenter, Xscale, Ycenter, Yscale)
Arguments
newdata |
a matrix containing the predictor variables |
scale |
a logical indicating whether the matrix of predictors used to create the regression model
(in the |
Xcenter |
if |
Xscale |
if |
Ycenter |
if |
Yscale |
if |
Value
a matrix of predicted values
Author(s)
Leonardo Ramirez-Lopez
Prediction function for the opls
and fopls
functions
Description
Predicts response values based on a model generated by either by opls
or the fopls
functions.
For internal use only!.
Usage
predict_opls(bo, b, ncomp, newdata, scale, Xscale)
Arguments
bo |
a numeric value indicating the intercept. |
b |
the matrix of regression coefficients. |
ncomp |
an integer value indicating how may components must be used in the prediction. |
newdata |
a matrix containing the predictor variables. |
scale |
a logical indicating whether the matrix of predictors used to create the regression model was scaled. |
Xscale |
if |
Value
a matrix of predicted values.
Author(s)
Leonardo Ramirez-Lopez
Print method for an object of class local_fit
Description
Prints the contents of an object of class local_fit
Usage
## S3 method for class 'local_fit'
print(x, ...)
Arguments
x |
an object of class |
... |
not yet functional. |
Author(s)
Leonardo Ramirez-Lopez
Print method for an object of class ortho_diss
Description
Prints the content of an object of class ortho_diss
Usage
## S3 method for class 'local_ortho_diss'
print(x, ...)
Arguments
x |
an object of class |
... |
arguments to be passed to methods (not yet functional). |
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
Print method for an object of class mbl
Description
Prints the content of an object of class mbl
Usage
## S3 method for class 'mbl'
print(x, ...)
Arguments
x |
an object of class |
... |
arguments to be passed to methods (not functional). |
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
Print method for an object of class ortho_projection
Description
Prints the contents of an object of class ortho_projection
Usage
## S3 method for class 'ortho_projection'
print(x, ...)
Arguments
x |
an object of class |
... |
arguments to be passed to methods (not yet functional). |
Author(s)
Leonardo Ramirez-Lopez
Projection function for the opls
function
Description
Projects new spectra onto a PLS space based on a model generated by either by opls
or the opls2
functions.
For internal use only!.
Usage
project_opls(projection_mat, ncomp, newdata, scale, Xcenter, Xscale)
Arguments
projection_mat |
the projection matrix generated by the |
ncomp |
an integer value indicating how may components must be used in the prediction. |
newdata |
a matrix containing the predictor variables. |
scale |
a logical indicating whether the matrix of predictors used to create the regression model was scaled. |
Xcenter |
a matrix of one row with the values that must be used for centering |
Xscale |
if |
Value
a matrix corresponding to the new spectra projected onto the PLS space
Author(s)
Leonardo Ramirez-Lopez
Projection to pls and then re-construction
Description
Projects spectra onto a PLS space and then reconstructs it back.
Usage
reconstruction_error(x,
projection_mat,
xloadings,
scale,
Xcenter,
Xscale,
scale_back = FALSE)
Arguments
x |
a matrix to project. |
projection_mat |
the projection matrix generated by the |
xloadings |
the loadings matrix generated by the |
scale |
logical indicating if scaling is required |
Xcenter |
a matrix of one row with the centering values |
Xscale |
a matrix of one row with the scaling values |
scale_back |
compute the reconstruction error after de-centering the data and de-scaling it. |
Value
a matrix of 1 row and 1 column.
Author(s)
Leonardo Ramirez-Lopez
A function to create calibration and validation sample sets for leave-group-out cross-validation
Description
for internal use only! This is stratified sampling based on the values of a continuous response variable (y). If group is provided, the sampling is done based on the groups and the average of y per group. This function is used to create calibration and validation groups for leave-group-out cross-validations (or leave-group-of-groups-out cross-validation if group argument is provided).
Usage
sample_stratified(y, p, number, group = NULL, replacement = FALSE, seed = NULL)
Arguments
y |
a matrix of one column with the response variable. |
p |
the percentage of samples (or groups if group argument is used) to retain in the validation_indices set |
number |
the number of sample groups to be crated |
group |
the labels for each sample in |
replacement |
A logical indicating sample replacements for the calibration set are required. |
seed |
an integer for random number generator (default |
Value
a list with two matrices (hold_in
and
hold_out
) giving the indices of the observations in each
column. The number of columns represents the number of sampling repetitions.
A function for searching in a given reference set the neighbors of another given set of observations (search_neighbors)
Description
This function searches in a reference set the neighbors of the observations provided in another set.
Usage
search_neighbors(Xr, Xu, diss_method = c("pca", "pca.nipals", "pls", "mpls",
"cor", "euclid", "cosine", "sid"),
Yr = NULL, k, k_diss, k_range, spike = NULL,
pc_selection = list("var", 0.01),
return_projection = FALSE, return_dissimilarity = FALSE,
ws = NULL,
center = TRUE, scale = FALSE,
documentation = character(), ...)
Arguments
Xr |
a matrix of reference (spectral) observations where the neighbor search is to be conducted. See details. |
Xu |
an optional matrix of (spectral) observations for which its
neighbors are to be searched in |
diss_method |
a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbors of each observation.
|
Yr |
a numeric matrix of
|
k |
an integer value indicating the k-nearest neighbors of each
observation in |
k_diss |
an integer value indicating a dissimilarity treshold.
For each observation in |
k_range |
an integer vector of length 2 which specifies the minimum
(first value) and the maximum (second value) number of neighbors to be
retained when the |
spike |
a vector of integers (with positive and/or negative values)
indicating what observations in |
pc_selection |
a list of length 2 to be passed onto the
The default is Optionally, the |
return_projection |
a logical indicating if the projection(s) must be
returned. Projections are used if the |
return_dissimilarity |
a logical indicating if the dissimilarity matrix used for neighbor search must be returned. |
ws |
an odd integer value which specifies the window size, when
|
center |
a logical indicating if the |
scale |
a logical indicating if the |
documentation |
an optional character string that can be used to
describe anything related to the |
... |
further arguments to be passed to the |
Details
This function may be specially useful when the reference set (Xr
) is
very large. In some cases the number of observations in the reference set
can be reduced by removing irrelevant observations (i.e. observations that are not
neighbors of a particular target set). For example, this fucntion can be
used to reduce the size of the reference set before before running the
mbl
function.
This function uses the dissimilarity
fucntion to compute the
dissimilarities between Xr
and Xu
. Arguments to
dissimilarity
as well as further arguments to the functions
used inside dissimilarity
(i.e. ortho_diss
cor_diss
f_diss
sid
) can be passed to
those functions as additional arguments (i.e. ...
).
If no matrix is passed to Xu
, the neighbor search is conducted for the
observations in Xr
that are found whiting that matrix. If a matrix is
passed to Xu
, the neighbors of Xu
are searched in the Xr
matrix.
Value
a list
containing the following elements:
neighbors_diss
: a matrix of theXr
dissimilarity scores corresponding to the neighbors of eachXr
observation (orXu
observation, in caseXu
was supplied). The neighbor dissimilarity scores are organized by columns and are sorted in ascending order.neighbors
: a matrix of theXr
indices corresponding to the neighbors of each observation inXu
. The neighbor indices are organized by columns and are sorted in ascending order by their dissimilarity score.unique_neighbors
: a vector of the indices inXr
identified as neighbors of any observation inXr
(or inXu
, in case it was supplied). This is obtained by converting theneighbors
matrix into a vector and applying theunique
function.k_diss_info
: adata.table
that is returned only if thek_diss
argument was used. It comprises three columns, the first one (Xr_index
orXu_index
) indicates the index of the observations inXr
(or inXu
, in case it was suppplied), the second column (n_k
) indicates the number of neighbors found inXr
and the third column (final_n_k
) indicates the final number of neighbors selected bounded byk_range
. argument.dissimilarity
: Ifreturn_dissimilarity = TRUE
the dissimilarity object used (as computed by thedissimilarity
function.projection
: anortho_projection
object. Only output ifreturn_projection = TRUE
and ifdiss_method = "pca"
,diss_method = "pca.nipals"
ordiss_method = "pls"
.
This object contains the projection used to compute the dissimilarity matrix. In case of local dissimilarity matrices, the projection corresponds to the global projection used to select the neighborhoods. (seeortho_diss
function for further details).
Author(s)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
See Also
dissimilarity
ortho_diss
cor_diss
f_diss
sid
mbl
Examples
library(prospectr)
data(NIRsoil)
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu)]
Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr)]
# Identify the neighbor observations using the correlation dissimilarity and
# default parameters
# (In this example all the observations in Xr belong at least to the
# first 100 neighbors of one observation in Xu)
ex1 <- search_neighbors(
Xr = Xr, Xu = Xu,
diss_method = "cor",
k = 40
)
# Identify the neighbor observations using principal component (PC)
# and partial least squares (PLS) dissimilarities, and using the "opc"
# approach for selecting the number of components
ex2 <- search_neighbors(
Xr = Xr, Xu = Xu,
diss_method = "pca",
Yr = Yr, k = 50,
pc_selection = list("opc", 40),
scale = TRUE
)
# Observations that do not belong to any neighborhood
seq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex2$unique_neighbors]
ex3 <- search_neighbors(
Xr = Xr, Xu = Xu,
diss_method = "pls",
Yr = Yr, k = 50,
pc_selection = list("opc", 40),
scale = TRUE
)
# Observations that do not belong to any neighborhood
seq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex3$unique_neighbors]
# Identify the neighbor observations using local PC dissimialrities
# Here, 150 neighbors are used to compute a local dissimilarity matrix
# and then this matrix is used to select 50 neighbors
ex4 <- search_neighbors(
Xr = Xr, Xu = Xu,
diss_method = "pls",
Yr = Yr, k = 50,
pc_selection = list("opc", 40),
scale = TRUE,
.local = TRUE,
pre_k = 150
)
A function for computing the spectral information divergence between spectra (sid)
Description
This function computes the spectral information divergence/dissimilarity between spectra based on the kullback-leibler divergence algorithm (see details).
Usage
sid(Xr, Xu = NULL,
mode = "density",
center = FALSE, scale = FALSE,
kernel = "gaussian",
n = if(mode == "density") round(0.5 * ncol(Xr)),
bw = "nrd0",
reg = 1e-04,
...)
Arguments
Xr |
a matrix containing the spectral (reference) data. |
Xu |
an optional matrix containing the spectral data of a second set of observations. |
mode |
the method to be used for computing the spectral information
divergence. Options are |
center |
a logical indicating if the computations must be carried out on
the centred |
scale |
a logical indicating if the computations must be carried out on
the variance scaled |
kernel |
if |
n |
if |
bw |
if |
reg |
a numerical value larger than 0 which indicates a regularization parameter. Values (probabilities) below this threshold are replaced by this value for numerical stability. Default is 1e-4. |
... |
additional arguments to be passed to the
|
Details
This function computes the spectral information divergence (distance)
between spectra.
When mode = "density"
, the function first computes the probability
distribution of each spectrum which result in a matrix of density
distribution estimates. The density distributions of all the observations in
the datasets are compared based on the kullback-leibler divergence algorithm.
When mode = "feature"
, the kullback-leibler divergence between all
the observations is computed directly on the spectral variables.
The spectral information divergence (SID) algorithm (Chang, 2000) uses the
Kullback-Leibler divergence (\(KL\)) or relative entropy
(Kullback and Leibler, 1951) to account for the vis-NIR information provided
by each spectrum. The SID between two spectra (\(x_{i}\) and
\(x_{j}\)) is computed as follows:
where \(k\) represents the number of variables or spectral features, \(p\) and \(q\) are the probability vectors of \(x_{i}\) and \(x_{i}\) respectively which are calculated as:
\[p = \frac{x_i}{\sum_{l=1}^{k} x_{i,l}}\] \[q = \frac{x_j}{\sum_{l=1}^{k} x_{j,l}}\]From the above equations it can be seen that the original SID algorithm
assumes that all the components in the data matrices are nonnegative.
Therefore centering cannot be applied when mode = "feature"
. If a
data matrix with negative values is provided and mode = "feature"
,
the sid
function automatically scales the matrix as follows:
or
\[X_{s} = \frac{X-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\] \[Xu_{s} = \frac{Xu-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\]if Xu
is specified. The 0 values are replaced by a regularization
parameter (reg
argument) for numerical stability.
The default of the sid
function is to compute the SID based on the
density distributions of the spectra (mode = "density"
). For each
spectrum in X
the density distribution is computed using the
density
function of the stats
package.
The 0 values of the estimated density distributions of the spectra are
replaced by a regularization parameter ("reg"
argument) for numerical
stability. Finally the divergence between the computed spectral histogramas
is computed using the SID algorithm. Note that if mode = "density"
,
the sid
function will accept negative values and matrix centering
will be possible.
Value
a list
with the following components:
sid
: if only"X"
is specified (i.e.Xu = NULL
), a square symmetric matrix of SID distances between all the components in"X"
. If both"X"
and"Xu"
are specified, a matrix of SID distances between the components in"X"
and the components in"Xu"
) where the rows represent the objects in"X"
and the columns represent the objects in"Xu"
Xr
: the (centered and/or scaled if specified) spectralX
matrixXu
: the (centered and/or scaled if specified) spectralXu
matrixdensityDisXr
: ifmode = "density"
, the computed density distributions ofXr
densityDisXu
: ifmode = "density"
, the computed density distributions ofXu
Author(s)
Leonardo Ramirez-Lopez
References
Chang, C.I. 2000. An information theoretic-based approach to spectral variability, similarity and discriminability for hyperspectral image analysis. IEEE Transactions on Information Theory 46, 1927-1932.
See Also
Examples
library(prospectr)
data(NIRsoil)
Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
Xu <- Xu[!is.na(Yu), ]
Xr <- Xr[!is.na(Yr), ]
# Example 1
# Compute the SID distance between all the observations in Xr
xr_sid <- sid(Xr)
xr_sid
# Example 2
# Compute the SID distance between the observations in Xr and the observations
# in Xu
xr_xu_sid <- sid(Xr, Xu)
xr_xu_sid
A function for evaluating dissimilarity matrices (sim_eval)
Description
This function searches for the most similar observation (closest neighbor) of each observation in a given dataset based on a dissimilarity (e.g. distance matrix). The observations are compared against their corresponding closest observations in terms of their side information provided. The root mean square of differences and the correlation coefficient are used for continuous variables and for discrete variables the kappa index is used.
Usage
sim_eval(d, side_info)
Arguments
d |
a symmetric matrix of dissimilarity scores between observations of a given dataset. Alternatively, a vector of with the dissimilarity scores of the lower triangle (without the diagonal values) can be used (see details). |
side_info |
a matrix containing the side information corresponding to the observations in the dataset from which the dissimilarity matrix was computed. It can be either a numeric matrix with one or multiple columns/variables or a matrix with one character variable (discrete variable). If it is numeric, the root mean square of differences is used for assessing the similarity between the observations and their corresponding most similar observations in terms of the side information provided. If it is a character variable, then the kappa index is used. See details. |
Details
For the evaluation of dissimilarity matrices this function uses side
information (information about one variable which is available for a
group of observations, Ramirez-Lopez et al., 2013). It is assumed that there
is a (direct or indirect) correlation between this side informative variable
and the variables from which the dissimilarity was computed.
If side_info
is numeric, the root mean square of differences (RMSD)
is used for assessing the similarity between the observations and their
corresponding most similar observations in terms of the side information
provided. It is computed as follows:
where \(NN(xr_i, Xr^{-i})\) represents a function to obtain the index of the nearest neighbor observation found in \(Xr\) (excluding the \(i\)th observation) for \(xr_i\), \(y_{i}\) is the value of the side variable of the \(i\)th observation, \(y_{j(i)}\) is the value of the side variable of the nearest neighbor of the \(i\)th observation and \(m\) is the total number of observations.
If side_info
is a factor the kappa index (\(\kappa\)) is
used instead the RMSD. It is computed as follows:
where both \(p_o\) and \(p_e\) are two different agreement indices between the the side information of the observations and the side information of their corresponding nearest observations (i.e. most similar observations). While \(p_o\) is the relative agreement \(p_e\) is the the agreement expected by chance.
This functions accepts vectors to be passed to argument d
, in this
case, the vector must represent the lower triangle of a dissimilarity matrix
(e.g. as returned by the stats::dist()
function of stats
).
Value
sim_eval
returns a list with the following components:
"
eval
: either the RMSD (and the correlation coefficient) or the kappa indexfirst_nn
: a matrix containing the original side informative variable in the first half of the columns, and the side informative values of the corresponding nearest neighbors in the second half of the columns.
Author(s)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Examples
library(prospectr)
data(NIRsoil)
sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0)
# Replace the original spectra with the filtered ones
NIRsoil$spc <- sg
Yr <- NIRsoil$Nt[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
# Example 1
# Compute a principal components distance
pca_d <- ortho_diss(Xr, pc_selection = list("manual", 8))$dissimilarity
# Example 1.1
# Evaluate the distance matrix on the baisis of the
# side information (Yr) associated with Xr
se <- sim_eval(pca_d, side_info = as.matrix(Yr))
# The final evaluation results
se$eval
# The final values of the side information (Yr) and the values of
# the side information corresponding to the first nearest neighbors
# found by using the distance matrix
se$first_nn
# Example 1.2
# Evaluate the distance matrix on the basis of two side
# information (Yr and Yr2)
# variables associated with Xr
Yr_2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]
se_2 <- sim_eval(d = pca_d, side_info = cbind(Yr, Yr_2))
# The final evaluation results
se_2$eval
# The final values of the side information variables and the values
# of the side information variables corresponding to the first
# nearest neighbors found by using the distance matrix
se_2$first_nn
# Example 2
# Evaluate the distances produced by retaining different number of
# principal components (this is the same principle used in the
# optimized principal components approach ("opc"))
# first project the data
pca_2 <- ortho_projection(Xr, pc_selection = list("manual", 30))
results <- matrix(NA, pca_2$n_components, 3)
colnames(results) <- c("pcs", "rmsd", "r")
results[, 1] <- 1:pca_2$n_components
for (i in 1:pca_2$n_components) {
ith_d <- f_diss(pca_2$scores[, 1:i, drop = FALSE], scale = TRUE)
ith_eval <- sim_eval(ith_d, side_info = as.matrix(Yr))
results[i, 2:3] <- as.vector(ith_eval$eval)
}
plot(results)
# Example 3
# Example 3.1
# Evaluate a dissimilarity matrix computed using the correlation
# method
cd <- cor_diss(Xr)
eval_corr_diss <- sim_eval(cd, side_info = as.matrix(Yr))
eval_corr_diss$eval
Square root of (square) symmetric matrices
Description
For internal use only
Usage
sqrt_sm(X, method = c("svd", "eigen"))
A function to compute row-wise index of minimum values of a square distance matrix
Description
For internal use only
Usage
which_min(X)
Arguments
X |
a square matrix of distances |
Details
Used internally to find the nearest neighbors
Value
a vector of the indices of the minimum value in each row of the input matrix
Author(s)
Antoine Stevens
A function to compute indices of minimum values of a distance vector
Description
For internal use only
Usage
which_min_vector(X)
Arguments
X |
a vector of distances |
Details
Used internally to find the nearest neighbors.
It searches in lower (or upper) triangular matrix. Therefore this must be the format of the
input data. The piece of code int len = (sqrt(X.size()*8+1)+1)/2
generated an error in CRAN
since sqrt
cannot be applied to integers.
Value
a vector of the indices of the nearest neighbors
Author(s)
Antoine Stevens