Type: Package
Title: Memory-Based Learning in Spectral Chemometrics
Version: 2.2.4
Date: 2025-05-19
Maintainer: Leonardo Ramirez-Lopez <ramirez.lopez.leo@gmail.com>
BugReports: https://github.com/l-ramirez-lopez/resemble/issues
Description: Functions for dissimilarity analysis and memory-based learning (MBL, a.k.a local modeling) in complex spectral data sets. Most of these functions are based on the methods presented in Ramirez-Lopez et al. (2013) <doi:10.1016/j.geoderma.2012.12.014>.
License: MIT + file LICENSE
URL: http://l-ramirez-lopez.github.io/resemble/
Depends: R (≥ 3.5.0)
Imports: foreach, iterators, Rcpp (≥ 1.0.3), mathjaxr (≥ 1.0), magrittr (≥ 1.5.0), lifecycle (≥ 0.2.0), data.table (≥ 1.9.8)
Suggests: prospectr, parallel, doParallel, testthat, formatR, rmarkdown, bookdown, knitr
LinkingTo: Rcpp, RcppArmadillo
RdMacros: mathjaxr
VignetteBuilder: knitr
NeedsCompilation: yes
Repository: CRAN
RoxygenNote: 7.3.2
Encoding: UTF-8
Config/VersionName: olbap
Packaged: 2025-05-19 22:10:07 UTC; leo
Author: Leonardo Ramirez-Lopez ORCID iD [aut, cre], Antoine Stevens ORCID iD [aut, ctb], Claudio Orellano [ctb], Raphael Viscarra Rossel ORCID iD [ctb], Alex Wadoux ORCID iD [ctb]
Date/Publication: 2025-05-19 22:30:02 UTC

Overview of the functions in the resemble package

Description

Maturing lifecycle

Functions for memory-based learning

logo

Details

This is the version 2.2.4 – olbap of the package. It implements a number of functions useful for modeling complex spectral spectra (e.g. NIR, IR). The package includes functions for dimensionality reduction, computing spectral dissimilarity matrices, nearest neighbor search, and modeling spectral data using memory-based learning. This package builds upon the methods presented in Ramirez-Lopez et al. (2013) doi:10.1016/j.geoderma.2012.12.014.

Development versions can be found in the github repository of the package at https://github.com/l-ramirez-lopez/resemble.

The functions available for dimensionality reduction are:

The functions available for computing dissimilarity matrices are:

The functions available for evaluating dissimilarity matrices are:

The functions available for nearest neighbor search:

The functions available for modeling spectral data:

Other supplementary functions:

Author(s)

Maintainer / Creator: Leonardo Ramirez-Lopez ramirez.lopez.leo@gmail.com

Authors:

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

See Also

Useful links:


Print method for an object of class local_ortho_diss

Description

prints the subsets of local_ortho_diss objects

Usage

## S3 method for class 'local_ortho_diss'
x[rows, columns, drop = FALSE, ...]

Arguments

x

local_ortho_diss matrix

rows

the indices of the rows

columns

the indices of the columns

drop

drop argument

...

not used


checks the pc_selection argument

Description

internal

Usage

check_pc_arguments(
  n_rows_x,
  n_cols_x,
  pc_selection,
  default_max_comp = 40,
  default_max_cumvar = 0.99,
  default_max_var = 0.01
)

Correlation and moving correlation dissimilarity measurements (cor_diss)

Description

Stable lifecycle

Computes correlation and moving correlation dissimilarity matrices.

Usage

cor_diss(Xr, Xu = NULL, ws = NULL,
         center = TRUE, scale = FALSE)

Arguments

Xr

a matrix.

Xu

an optional matrix containing data of a second set of observations.

ws

for moving correlation dissimilarity, an odd integer value which specifies the window size. If ws = NULL, then the window size will be equal to the number of variables (columns), i.e. instead moving correlation, the normal correlation will be used. See details.

center

a logical indicating if the spectral data Xr (and Xu if specified) must be centered. If Xu is provided, the data is scaled on the basis of \(Xr \cup Xu\).

scale

a logical indicating if Xr (and Xu if specified) must be scaled. If Xu is provided the data is scaled on the basis of \(Xr \cup Xu\).

Details

The correlation dissimilarity \(d\) between two observations \(x_i\) and \(x_j\) is based on the Perason's correlation coefficient (\(\rho\)) and it can be computed as follows:

\[d(x_i, x_j) = \frac{1}{2}((1 - \rho(x_i, x_j)))\]

The above formula is used when ws = NULL. On the other hand (when ws != NULL) the moving correlation dissimilarity between two observations \(x_i\) and \(x_j\) is computed as follows:

\[d(x_i, x_j; ws) = \frac{1}{2 ws}\sum_{k=1}^{p-ws}1 - \rho(x_{i,(k:k+ws)}, x_{j,(k:k+ws)})\]

where \(ws\) represents a given window size which rolls sequentially from 1 up to \(p - ws\) and \(p\) is the number of variables of the observations.

The function does not accept input data containing missing values.

Value

a matrix of the computed dissimilarities.

Author(s)

Antoine Stevens and Leonardo Ramirez-Lopez

Examples


library(prospectr)
data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

cor_diss(Xr = Xr)

cor_diss(Xr = Xr, Xu = Xu)

cor_diss(Xr = Xr, ws = 41)

cor_diss(Xr = Xr, Xu = Xu, ws = 41)


From dissimilarity matrix to neighbors

Description

internal

Usage

diss_to_neighbors(
  diss_matrix,
  k = NULL,
  k_diss = NULL,
  k_range = NULL,
  spike = NULL,
  return_dissimilarity = FALSE,
  skip_first = FALSE
)

Arguments

diss_matrix

a matrix representing the dissimilarities between observations in a matrix Xu and observations in another matrix Xr. Xr in rows Xu in columns.

k

an integer value indicating the k-nearest neighbors of each observation in Xu that must be selected from Xr.

k_diss

an integer value indicating a dissimilarity treshold. For each observation in Xu, its nearest neighbors in Xr are selected as those for which their dissimilarity to Xu is below this k_diss threshold. This treshold depends on the corresponding dissimilarity metric specified in diss_method. Either k or k_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum (first value) and the maximum (second value) number of neighbors to be retained when the k_diss is given.

spike

a vector of integers indicating what observations in Xr (and Yr) must be 'forced' to always be part of all the neighborhoods.

return_dissimilarity

logical indicating if the input dissimilarity must be mirroed in the output.

skip_first

a logical indicating whether to skip the first neighbor or not. Default is FALSE. This is used when the search is being conducted in symmetric matrix of distances (i.e. to avoid that the nearest neighbor of each observation is itself).


Dissimilarity computation between matrices

Description

This is a wrapper to integrate the different dissimilarity functions of the offered by package.It computes the dissimilarities between observations in numerical matrices by using an specifed dissmilarity measure.

Usage

dissimilarity(Xr, Xu = NULL,
              diss_method = c("pca", "pca.nipals", "pls", "mpls",
                              "cor", "euclid", "cosine", "sid"),
              Yr = NULL, gh = FALSE, pc_selection = list("var", 0.01),
              return_projection = FALSE, ws = NULL,
              center = TRUE, scale = FALSE, documentation = character(),
              ...)

Arguments

Xr

a matrix of containing n observations/rows and p variables/columns.

Xu

an optional matrix containing data of a second set of observations with p variables/columns.

diss_method

a character string indicating the method to be used to compute the dissimilarities between observations. Options are:

  • "pca": Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of Xr (and Xu if provided). PC projection is done using the singular value decomposition (SVD) algorithm. See ortho_diss function.

  • "pca.nipals": Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of Xr (and Xu if provided). PC projection is done using the non-linear iterative partial least squares (nipals) algorithm. See ortho_diss function.

  • "pls": Mahalanobis distance computed on the matrix of scores of a partial least squares projection of Xr (and Xu if provided). In this case, Yr is always required. See ortho_diss function.

  • "mpls": Mahalanobis distance computed on the matrix of scores of a modified partial least squares projection (Shenk and Westerhaus, 1991; Westerhaus, 2014) of Xr (and Xu if provided). In this case, Yr is always required. See ortho_diss function.

  • "cor": based on the correlation coefficient between observations. See cor_diss function.

  • "euclid": Euclidean distance between observations. See f_diss function.

  • "cosine": Cosine distance between observations. See f_diss function.

  • "sid": spectral information divergence between observations. See sid function.

Yr

a numeric matrix of n observations used as side information of Xr for the ortho_diss methods (i.e. pca, pca.nipals or pls). It is required when:

  • diss_method = "pls"

  • diss_method = "pca" with "opc" used as the method in the pc_selection argument. See ortho_diss.

  • gh = TRUE

gh

a logical indicating if the Mahalanobis distance (in the pls score space) between each observation and the pls centre/mean must be computed.

pc_selection

a list of length 2 to be passed onto the ortho_diss methods. It is required if the method selected in diss_method is any of "pca", "pca.nipals" or "pls" or if gh = TRUE. This argument is used for optimizing the number of components (principal components or pls factors) to be retained. This list must contain two elements in the following order: method (a character indicating the method for selecting the number of components) and value (a numerical value that complements the selected method). The methods available are:

  • "opc": optimized principal component selection based on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components (of set of observations) is the one for which its distance matrix minimizes the differences between the Yr value of each observation and the Yr value of its closest observation. In this case value must be a value ((larger than 0 and below the minimum dimension of Xr or Xr and Xu combined) indicating the maximum number of principal components to be tested. See the ortho_projection function for more details.

  • "cumvar": selection of the principal components based on a given cumulative amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of cumulative variance that the combination of retained components should explain.

  • "var": selection of the principal components based on a given amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of variance that a single component should explain in order to be retained.

  • "manual": for manually specifying a fix number of principal components. In this case, value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined). indicating the minimum amount of variance that a component should explain in order to be retained.

The default is list(method = "var", value = 0.01).

Optionally, the pc_selection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used, the default "value" is set to 0.01.

return_projection

a logical indicating if the projection(s) must be returned. Projections are used if the ortho_diss methods are called (i.e. diss_method = "pca", diss_method = "pca.nipals" or diss_method = "pls") or when gh = TRUE. In case gh = TRUE and a ortho_diss method is used (in the diss_method argument), both projections are returned.

ws

an odd integer value which specifies the window size, when diss_method = "cor" (cor_diss method) for moving correlation dissimilarity. If ws = NULL (default), then the window size will be equal to the number of variables (columns), i.e. instead moving correlation, the normal correlation will be used. See cor_diss function.

center

a logical indicating if Xr (and Xu if provided) must be centered. If Xu is provided the data is centered around the mean of the pooled Xr and Xu matrices (\(Xr \cup Xu\)). For dissimilarity computations based on diss_method = pls, the data is always centered.

scale

a logical indicating if Xr (and Xu if provided) must be scaled. If Xu is provided the data is scaled based on the standard deviation of the the pooled Xr and Xu matrices (\(Xr \cup Xu\)). If center = TRUE, scaling is applied after centering.

documentation

an optional character string that can be used to describe anything related to the mbl call (e.g. description of the input data). Default: character(). NOTE: his is an experimental argument.

...

other arguments passed to the dissimilarity functions (ortho_diss, cor_diss, f_diss or sid).

Details

This function is a wrapper for ortho_diss, cor_diss, f_diss, sid. Check the documentation of these functions for further details.

Value

A list with the following components:

Author(s)

Leonardo Ramirez-Lopez

References

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.

See Also

ortho_diss cor_diss f_diss sid.

Examples

library(prospectr)
data(NIRsoil)

# Filter the data using the first derivative with Savitzky and Golay
# smoothing filter and a window size of 11 spectral variables and a
# polynomial order of 4
sg <- savitzkyGolay(NIRsoil$spc, m = 1, p = 4, w = 15)

# Replace the original spectra with the filtered ones
NIRsoil$spc <- sg

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]

Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

Xu <- Xu[!is.na(Yu), ]
Xr <- Xr[!is.na(Yr), ]

Yu <- Yu[!is.na(Yu)]
Yr <- Yr[!is.na(Yr)]

dsm_pca <- dissimilarity(
  Xr = Xr, Xu = Xu,
  diss_method = c("pca"),
  Yr = Yr, gh = TRUE,
  pc_selection = list("opc", 30),
  return_projection = TRUE
)

A function for transforming a matrix from its Euclidean space to its Mahalanobis space

Description

For internal use only

Usage

euclid_to_mahal(X, sm_method = c("svd", "eigen"))

evaluation of multiple distances obtained with multiple PCs

Description

internal

Usage

eval_multi_pc_diss(
  scores,
  side_info,
  from = 1,
  to = ncol(scores),
  steps = 1,
  method = c("pc", "pls"),
  check_dims = TRUE
)

Euclidean, Mahalanobis and cosine dissimilarity measurements

Description

Stable lifecycle

This function is used to compute the dissimilarity between observations based on Euclidean or Mahalanobis distance measures or on cosine dissimilarity measures (a.k.a spectral angle mapper).

Usage

f_diss(Xr, Xu = NULL, diss_method = "euclid",
       center = TRUE, scale = FALSE)

Arguments

Xr

a matrix containing the (reference) data.

Xu

an optional matrix containing data of a second set of observations (samples).

diss_method

the method for computing the dissimilarity between observations. Options are "euclid" (Euclidean distance), "mahalanobis" (Mahalanobis distance) and "cosine" (cosine distance, a.k.a spectral angle mapper). See details.

center

a logical indicating if the spectral data Xr (and Xu if specified) must be centered. If Xu is provided, the data is scaled on the basis of \(Xr \cup Xu\).

scale

a logical indicating if Xr (and Xu if specified) must be scaled. If Xu is provided the data is scaled on the basis of \(Xr \cup Xu\).

Details

The results obtained for Euclidean dissimilarity are equivalent to those returned by the stats::dist() function, but are scaled differently. However, f_diss is considerably faster (which can be advantageous when computing dissimilarities for very large matrices). The final scaling of the dissimilarity scores in f_diss where the number of variables is used to scale the squared dissimilarity scores. See the examples section for a comparison between stats::dist() and f_diss.

In the case of both the Euclidean and Mahalanobis distances, the scaled dissimilarity matrix \(D\) between between observations in a given matrix \(X\) is computed as follows:

\[d(x_i, x_j)^{2} = \sum (x_i - x_j)M^{-1}(x_i - x_j)^{\mathrm{T}}\] \[d_{scaled}(x_i, x_j) = \sqrt{\frac{1}{p}d(x_i, x_j)^{2}}\]

where \(p\) is the number of variables in \(X\), \(M\) is the identity matrix in the case of the Euclidean distance and the variance-covariance matrix of \(X\) in the case of the Mahalanobis distance. The Mahalanobis distance can also be viewed as the Euclidean distance after applying a linear transformation of the original variables. Such a linear transformation is done by using a factorization of the inverse covariance matrix as \(M^{-1} = W^{T}W\), where \(M\) is merely the square root of \(M^{-1}\) which can be found by using a singular value decomposition.

Note that when attempting to compute the Mahalanobis distance on a dataset with highly correlated variables (i.e. spectral variables) the variance-covariance matrix may result in a singular matrix which cannot be inverted and therefore the distance cannot be computed. This is also the case when the number of observations in the dataset is smaller than the number of variables.

For the computation of the Mahalanobis distance, the mentioned method is used.

The cosine dissimilarity \(c\) between two observations \(x_i\) and \(x_j\) is computed as follows:

\[c(x_i, x_j) = cos^{-1}{\frac{\sum_{k=1}^{p}x_{i,k} x_{j,k}}{\sqrt{\sum_{k=1}^{p} x_{i,k}^{2}} \sqrt{\sum_{k=1}^{p} x_{j,k}^{2}}}}\]

where \(p\) is the number of variables of the observations. The function does not accept input data containing missing values. NOTE: The computed distances are divided by the number of variables/columns in Xr.

Value

a matrix of the computed dissimilarities.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

Examples


library(prospectr)
data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

# Euclidean distances between all the observations in Xr

ed <- f_diss(Xr = Xr, diss_method = "euclid")

# Equivalence with the dist() fucntion of R base
ed_dist <- (as.matrix(dist(Xr))^2 / ncol(Xr))^0.5
round(ed_dist - ed, 5)

# Comparing the computational time
iter <- 20
tm <- proc.time()
for (i in 1:iter) {
  f_diss(Xr)
}
f_diss_time <- proc.time() - tm

tm_2 <- proc.time()
for (i in 1:iter) {
  dist(Xr)
}
dist_time <- proc.time() - tm_2

f_diss_time
dist_time

# Euclidean distances between observations in Xr and observations in Xu
ed_xr_xu <- f_diss(Xr, Xu)

# Mahalanobis distance computed on the first 20 spectral variables
md_xr_xu <- f_diss(Xr[, 1:20], Xu[, 1:20], "mahalanobis")

# Cosine dissimilarity matrix
cdiss_xr_xu <- f_diss(Xr, Xu, "cosine")


A fast distance algorithm for two matrices written in C++

Description

Computes distances between two data matrices using "euclid", "cor", "cosine"

Usage

fast_diss(X, Y, method)

Arguments

X

a matrix

Y

a matrix

method

a string with possible values "euclid", "cor", "cosine"

Value

a distance matrix

Author(s)

Antoine Stevens and Leonardo Ramirez-Lopez


A fast algorithm of (squared) Euclidean cross-distance for vectors written in C++

Description

A fast (parallel for linux) algorithm of (squared) Euclidean cross-distance for vectors written in C++

Usage

fast_diss_vector(X)

Arguments

X

a vector.

Details

used internally in ortho_projection

Value

a vector of distance (lower triangle of the distance matrix, stored by column)

Author(s)

Antoine Stevens


Local multivariate regression

Description

internal

Usage

fit_and_predict(
  x,
  y,
  pred_method,
  scale = FALSE,
  weights = NULL,
  newdata,
  pls_c = NULL,
  CV = FALSE,
  tune = FALSE,
  number = 10,
  p = 0.75,
  group = NULL,
  noise_variance = 0.001,
  range_prediction_limits = TRUE,
  pls_max_iter = 1,
  pls_tol = 1e-06,
  modified = FALSE,
  seed = NULL
)

format internal messages

Description

internal

Usage

format_xr_xu_indices(xr_xu_names)

Arguments

xr_xu_names

the names of Xr and Xu


Cross validation for Gaussian process regression

Description

internal

Usage

gaussian_pr_cv(
  x,
  y,
  scale,
  weights = NULL,
  p = 0.75,
  number = 10,
  group = NULL,
  noise_variance = 0.001,
  retrieve = c("final_model", "none"),
  seed = NULL
)

Gaussian process regression with linear kernel (gaussian_process)

Description

Carries out a gaussian process regression with a linear kernel (dot product). For internal use only!

Usage

gaussian_process(X, Y, noisev, scale)

Arguments

X

a matrix of predictor variables

Y

a matrix with a single response variable

noisev

a value indicating the variance of the noise for Gaussian process regression. Default is 0.001. a matrix with a single response variable

scale

a logical indicating whether both the predictors and the response variable must be scaled to zero mean and unit variance.

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


Internal Cpp function for performing leave-group-out cross validations for gaussian process

Description

For internal use only!.

Usage

gaussian_process_cv(X, Y, mindices, pindices, noisev = 0.001,  
scale = TRUE, statistics = TRUE)

Arguments

X

a matrix of predictor variables.

Y

a matrix of a single response variable.

mindices

a matrix with n rows and m columns where m is equivalent to the number of resampling iterations. The elements of each column indicate the indices of the observations to be used for modeling at each iteration.

pindices

a matrix with k rows and m columns where m is equivalent to the number of resampling iterations. The elements of each column indicate the indices of the observations to be used for predicting at each iteration.

scale

a logical indicating whether both the predictors and the response variable must be scaled to zero mean and unit variance.

statistics

a logical value indicating whether the precision and accuracy statistics are to be returned, otherwise the predictions for each validation segment are retrieved.

Value

a list containing the following one-row matrices:

Author(s)

Leonardo Ramirez-Lopez


Function for identifiying the column in a matrix with the largest standard deviation

Description

Identifies the column with the largest standard deviation. For internal use only!

Usage

get_col_largest_sd(X)

Arguments

X

a matrix.

Value

a value indicating the index of the column with the largest standard deviation.

Author(s)

Leonardo Ramirez-Lopez


Standard deviation of columns

Description

For internal use only!

Usage

get_col_sds(x)

Function for computing the mean of each column in a matrix

Description

Computes the mean of each column in a matrix. For internal use only!

Usage

get_column_means(X)

Arguments

X

a a matrix.

Value

a vector of mean values.

Author(s)

Leonardo Ramirez-Lopez


Function for computing the standard deviation of each column in a matrix

Description

Computes the standard deviation of each column in a matrix. For internal use only!

Usage

get_column_sds(X)

Arguments

X

a a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez


Function for computing sum of each column in a matrix

Description

Computes the sum of each column in a matrix. For internal use only!

Usage

get_column_sums(X)

Arguments

X

a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez


get the evaluation results for categorical data

Description

internal

Usage

get_eval_categorical(y, indices_closest)

get the evaluation results for continuous data

Description

internal

Usage

get_eval_continuous(y, indices_closest)

A function to obtain the local neighbors based on dissimilarity matrices from orthogonal projections.

Description

internal function. This function is used to obtain the local neighbors based on dissimilarity matrices from orthogonal projections. These neighbors are obatin from an orthogonal projection on a set of precomputed neighbors. This function is used internally by the mbl fucntion. ortho_diss(, .local = TRUE) operates in the same way, however for mbl, it is more efficient to do the re-search of the neighbors inside its main for loop

Usage

get_ith_local_neighbors(
  ith_xr,
  ith_xu,
  ith_yr,
  ith_yu = NULL,
  diss_usage = "none",
  ith_neig_indices,
  k = NULL,
  k_diss = NULL,
  k_range = NULL,
  spike = NULL,
  diss_method,
  pc_selection,
  ith_group = NULL,
  center,
  scale,
  ...
)

Arguments

ith_xr

the set of neighbors of a Xu observation found in Xr

ith_xu

the Xu observation

ith_yr

the response values of the set of neighbors of the Xu observation found in Xr

ith_yu

the response value of the xu observation

diss_usage

a character string indicating if the dissimilarity data will be used as predictors ("predictors") or not ("none").

ith_neig_indices

a vector of the original indices of the Xr neighbors.

k

the number of nearest neighbors to select from the already identified neighbors

k_diss

the distance threshold to select the neighbors from the already identified neighbors

k_range

a min and max number of allowed neighbors when k_diss is used

spike

a vector with the indices of the observations forced to be retained as neighbors. They have to be present in all the neighborhoods and at the top of neighbor_indices.

diss_method

the ortho_diss() method

pc_selection

the pc_selection argument as in ortho_diss()

ith_group

the vector containing the group labes of ith_xr.

center

center the data in the local diss computation?

scale

scale the data in the local diss computation?

Value

a list:

Author(s)

Leonardo Ramirez-Lopez


Internal Cpp function for computing the weights of the PLS components necessary for weighted average PLS

Description

For internal use only!.

Usage

get_local_pls_weights(projection_mat, 
          xloadings, 
          coefficients, 
          new_x, 
          min_component, 
          max_component, 
          scale, 
          Xcenter, 
          Xscale)

Arguments

projection_mat

the projection matrix generated either by the opls function.

xloadings

.

coefficients

the matrix of regression coefficients.

new_x

a matrix of one new spectra to be predicted.

min_component

an integer indicating the minimum number of pls components.

max_component

an integer indicating the maximum number of pls components.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xcenter

a matrix of one row with the values that must be used for centering newdata.

Xscale

if scale = TRUE a matrix of one row with the values that must be used for scaling newdata.

Value

a matrix of one row with the weights for each component between the max. and min. specified.

Author(s)

Leonardo Ramirez-Lopez


A function to get the neighbor information

Description

This fucntion gathers information of all neighborhoods of the Xu observations found in Xr. This information is equired during local regressions.

Usage

get_neighbor_info(
  Xr,
  Xu,
  diss_method,
  Yr = NULL,
  k = NULL,
  k_diss = NULL,
  k_range = NULL,
  spike = NULL,
  pc_selection,
  return_dissimilarity,
  center,
  scale,
  gh,
  diss_usage,
  allow_parallel = FALSE,
  ...
)

Details

For local pca and pls distances, the local dissimilarity matrices are not computed as it is cheaer to compute them during the local regressions. Instead the global distances (required for later local dissimilarity matrix computation are output)


Extract predictions from an object of class mbl

Description

Stable lifecycle

Extract predictions from an object of class mbl

Usage

get_predictions(object)

Arguments

object

an object of class mbl as returned by mbl

Value

a data.table of predicted values according to either k or k_dist

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

See Also

mbl


A function to assign values to sample distribution strata

Description

for internal use only! This function takes a continuous variable, creates n strata based on its distribution and assigns the corresponding starta to every value.

Usage

get_sample_strata(y, n = NULL, probs = NULL)

Arguments

y

a matrix of one column with the response variable.

n

the number of strata.

Value

a data table with the input y and the corresponding strata to every value.


A function for stratified calibration/validation sampling

Description

for internal use only! This function selects samples based on provided strata.

Usage

get_samples_from_strata(
  y,
  original_order,
  strata,
  samples_per_strata,
  sampling_for = c("calibration", "validation"),
  replacement = FALSE
)

Arguments

original_order

a matrix of one column with the response variable.

strata

the number of strata.

sampling_for

sampling to select the calibration samples ("calibration") or sampling to select the validation samples ("validation").

replacement

logical indicating if sampling with replacement must be done.

Value

a list with the indices of the calibration and validation samples.


Internal function for computing the weights of the PLS components necessary for weighted average PLS

Description

internal

Usage

get_wapls_weights(pls_model, original_x, type = "w1", new_x = NULL, pls_c)

Arguments

pls_model

either an object returned by the pls_cv function or an object as returned by the opls_get_basics function which contains a pls model.

original_x

the original spectral matrix which was used for calibrating the pls model.

type

type of weight to be computed. The only available option (for the moment) is "w1". See details on the mbl function where it is explained how "w1" is computed whitin the "wapls" regression.

new_x

a vector of a new spectral observation. When "w1" is selected, new_x must be specified.

pls_c

a vector of length 2 which contains both the minimum and maximum number of PLS components for which the weights must be computed.

Value

get_wapls_weights returns a vector of weights for each PLS component specified

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens


Computes the weights for pls regressions

Description

This is an internal function that computes the wights required for obtaining each vector of pls scores. Implementation is done in C++ for improved performance.

Usage

get_weights(X, Y, algorithm = "pls", xls_min_w = 3L, xls_max_w = 15L)

Arguments

X

a numeric matrix of spectral data.

Y

a matrix of one column with the response variable.

algorithm

a character string indicating what method to use. Options are: 'pls' for pls (using covariance between X and Y), 'mpls' for modified pls (using correlation between X and Y as in Shenk and Westerhaus, 1991; Westerhaus 2014) or 'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

an integer indicating the minimum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

an integer indicating the maximum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a matrix of one column containing the weights.

Author(s)

Leonardo Ramirez-Lopez and Claudio Orellano

References

Shenk, J. S., & Westerhaus, M. O. (1991). Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548-1555.

Westerhaus, M. (2014). Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.


An iterator for local prediction data in mbl

Description

internal function. It collects only the data necessary to execute a local prediction for the mbl function based on a list of neighbors. Not valid for local dissmilitary (e.g. for ortho_diss(...., .local = TRUE))

Usage

ith_mbl_neighbor(
  Xr,
  Xu = NULL,
  Yr,
  Yu = NULL,
  diss_usage = "none",
  neighbor_indices,
  neighbor_diss = NULL,
  diss_xr_xr = NULL,
  group = NULL
)

Arguments

Xr

the Xr matrix in mbl.

Xu

the Xu matrix in mbl. Default NULL. If not provided, the function will iterate for each {Yr, Xr} to get the respective neighbors.

Yr

the Yr matrix in mbl.

Yu

the Yu matrix in mbl. Default NULL.

diss_usage

a character string indicating if the dissimilarity data will be used as predictors ("predictors") or not ("none").

neighbor_indices

a matrix with the indices of neighbors of every Xu found in Xr.

neighbor_diss

a matrix with the dissimilarity socres for the neighbors of every Xu found in Xr. This matrix is organized in the same way as neighbor_indices.

diss_xr_xr

a dissimilarity matrix between sampes in Xr.

group

a factor representing the group labels of Xr.

Details

isubset will look at the order of knn in each col of D and re-organize the rows of x accordingly

Value

an object of class iterator giving the following list:

Author(s)

Leonardo Ramirez-Lopez


iterator for nearest neighbor subsets

Description

internal

Usage

ith_subsets_ortho_diss(x, xu = NULL, y, kindx, na_rm = FALSE)

Arguments

x

a reference matrix

xu

a second matrix

y

a matrix of side information

kindx

a matrix of nearest neighbor indices

na_rm

logical indicating whether NAs must be removed.


Local fit functions

Description

These functions define the way in which each local fit/prediction is done within each iteration in the mbl function.

Usage

local_fit_pls(pls_c, modified = FALSE, max_iter = 100, tol = 1e-6)

local_fit_wapls(min_pls_c, max_pls_c, modified = FALSE,
                max_iter = 100, tol = 1e-6)

local_fit_gpr(noise_variance = 0.001)

Arguments

pls_c

an integer indicating the number of pls components to be used in the local regressions when the partial least squares (local_fit_pls) method is used.

modified

a logical indicating whether the modified version of the pls algorithm (Shenk and Westerhaus, 1991 and Westerhaus, 2014). Default is FALSE.

max_iter

an integer indicating the maximum number of iterations in case tol is not reached. Defaul is 100.

tol

a numeric value indicating the convergence for calculating the scores. Default is 1-e6.

min_pls_c

an integer indicating the minimum number of pls components to be used in the local regressions when the weighted average partial least squares (local_fit_wapls) method is used. See details.

max_pls_c

integer indicating the maximum number of pls components to be used in the local regressions when the weighted average partial least squares (local_fit_wapls) method is used. See details.

noise_variance

a numeric value indicating the variance of the noise for Gaussian process local regressions (local_fit_gpr). Default is 0.001.

Details

These functions are used to indicate how to fit the regression models within the mbl function.

There are three possible options for performing these regressions:

The modified argument in the pls methods (local_fit_pls() and local_fit_wapls()) is used to indicate if a modified version of the pls algorithm (modified pls or mpls) is to be used. The modified pls was proposed Shenk and Westerhaus (1991, see also Westerhaus, 2014) and it differs from the standard pls method in the way the weights of the predictors (used to compute the matrix of scores) are obtained. While pls uses the covariance between response(s) and predictors (and later their deflated versions corresponding at each pls component iteration) to obtain these weights, the modified pls uses the correlation as weights. The authors indicate that by using correlation, a larger potion of the response variable(s) can be explained.

Value

An object of class local_fit mirroring the input arguments.

Author(s)

Leonardo Ramirez-Lopez

References

Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548-1555.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.

Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.

See Also

mbl

Examples

local_fit_wapls(min_pls_c = 3, max_pls_c = 12)

local ortho dissimilarity matrices initialized by a global dissimilarity matrix

Description

internal

Usage

local_ortho_diss(
  k_index_matrix,
  Xr,
  Yr,
  Xu,
  diss_method,
  pc_selection,
  center,
  scale,
  allow_parallel,
  ...
)

Arguments

k_index_matrix

a matrix of nearest neighnbor indices

Xr

argument passed to ortho_projection

Yr

argument passed to ortho_projection

Xu

argument passed to ortho_projection

diss_method

argument passed to ortho_projection

pc_selection

argument passed to ortho_projection

center

argument passed to ortho_projection

scale

argument passed to ortho_projection


A function for memory-based learning (mbl)

Description

This function is implemented for memory-based learning (a.k.a. instance-based learning or local regression) which is a non-linear lazy learning approach for predicting a given response variable from a set of predictor variables. For each observation in a prediction set, a specific local regression is carried out based on a subset of similar observations (nearest neighbors) selected from a reference set. The local model is then used to predict the response value of the target (prediction) observation. Therefore this function does not yield a global regression model.

Usage

mbl(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL,
    method = local_fit_wapls(min_pls_c = 3, max_pls_c = min(dim(Xr), 15)),
    diss_method = "pca", diss_usage = "predictors", gh = TRUE,
    pc_selection = list(method = "opc", value = min(dim(Xr), 40)),
    control = mbl_control(), group = NULL, center = TRUE, scale = FALSE,
    verbose = TRUE, documentation = character(), seed = NULL, ...)

Arguments

Xr

a matrix of predictor variables of the reference data (observations in rows and variables in columns).

Yr

a numeric matrix of one column containing the values of the response variable corresponding to the reference data.

Xu

a matrix of predictor variables of the data to be predicted (observations in rows and variables in columns).

Yu

an optional matrix of one column containing the values of the response variable corresponding to the data to be predicted. Default is NULL.

k

a vector of integers specifying the sequence of k-nearest neighbors to be tested. Either k or k_diss must be specified. This vector will be automatically sorted into ascending order. If non-integer numbers are passed, they will be coerced to the next upper integers.

k_diss

a numeric vector specifying the sequence of dissimilarity thresholds to be tested for the selection of the nearest neighbors found in Xr around each observation in Xu. These thresholds depend on the corresponding dissimilarity measure specified in the object passed to control. Either k or k_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum (first value) and the maximum (second value) number of neighbors to be retained when the k_diss is given.

spike

an integer vector (with positive and/or negative values) indicating the indices of observations in Xr that must be either be forced into or avoided in the neighborhoods of every Xu observation. Default is NULL (i.e. no observations are forced or avoided). Note that this argument is not intended for increasing or reducing the neighborhood size which is only controlled by k or k_diss and k_range. By forcing observations into the neighborhood, some of the farthest observations may be forced out of the neighborhood. In contrast, by avoiding observations in the neighborhood, some of farthest observations may be included into the neighborhood. See details.

method

an object of class local_fit which indicates the type of regression to conduct at each local segment as well as additional parameters affecting this regression. See local_fit function.

diss_method

a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbors of each observation. Options are:

  • "pca" (Default): Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of Xr and Xu. PC projection is done using the singular value decomposition (SVD) algorithm. See ortho_diss function.

  • "pca.nipals": Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of Xr and Xu. PC projection is done using the non-linear iterative partial least squares (nipals) algorithm. See ortho_diss function.

  • "pls": Mahalanobis distance computed on the matrix of scores of a partial least squares projection of Xr and Xu. In this case, Yr is always required. See ortho_diss function.

  • "cor": correlation coefficient between observations. See cor_diss function.

  • "euclid": Euclidean distance between observations. See f_diss function.

  • "cosine": Cosine distance between observations. See f_diss function.

  • "sid": spectral information divergence between observations. See sid function.

Alternatively, a matrix of dissimilarities can also be passed to this argument. This matrix is supposed to be a user-defined matrix representing the dissimilarities between observations in Xr and Xu. When diss_usage = "predictors", this matrix must be squared (derived from a matrix of the form rbind(Xr, Xu)) for which the diagonal values are zeros (since the dissimilarity between an object and itself must be 0). On the other hand, if diss_usage is set to either "weights" or "none", it must be a matrix representing the dissimilarity of each observation in Xu to each observation in Xr. The number of columns of the input matrix must be equal to the number of rows in Xu and the number of rows equal to the number of rows in Xr.

diss_usage

a character string specifying how the dissimilarity information shall be used. The possible options are: "predictors", "weights" and "none" (see details below). Default is "predictors".

gh

a logical indicating if the global Mahalanobis distance (in the pls score space) between each observation and the pls mean (centre) must be computed. This metric is known as the GH distance in the literature. Note that this computation is based on the number of pls components determined by using the pc_selection argument. See details.

pc_selection

a list of length 2 used for the computation of GH (if gh = TRUE) as well as in the computation of the dissimilarity methods based on ortho_diss (i.e. when diss_method is one of: "pca", "pca.nipals" or "pls") or when gh = TRUE. This argument is used for optimizing the number of components (principal components or pls factors) to be retained for dissimilarity/distance computation purposes only (i.e not for regression). This list must contain two elements in the following order: method (a character indicating the method for selecting the number of components) and value (a numerical value that complements the selected method). The methods available are:

  • "opc": optimized principal component selection based on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components (of set of observations) is the one for which its distance matrix minimizes the differences between the Yr value of each observation and the Yr value of its closest observation. In this case value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined) indicating the maximum number of principal components to be tested. See the ortho_projection function for more details.

  • "cumvar": selection of the principal components based on a given cumulative amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of cumulative variance that the combination of retained components should explain.

  • "var": selection of the principal components based on a given amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of variance that a single component should explain in order to be retained.

  • "manual": for manually specifying a fix number of principal components. In this case, value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined). indicating the minimum amount of variance that a component should explain in order to be retained.

The list list(method = "opc", value = min(dim(Xr), 40)) is the default. Optionally, the pc_selection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used, the default "value" is set to 0.01.

control

a list created with the mbl_control function which contains additional parameters that control some few aspects of the mbl function (cross-validation, parameter tuning, etc). The default list is as returned by mbl_control(). See the mbl_control function for more details.

group

an optional factor (or character vector vector that can be coerced to factor by as.factor) that assigns a group/class label to each observation in Xr (e.g. groups can be given by spectra collected from the same batch of measurements, from the same observation, from observations with very similar origin, etc). This is taken into account for internal leave-group-out cross validation for pls tuning (factor optimization) to avoid pseudo-replication. When one observation is selected for cross-validation, all observations of the same group are removed together and assigned to validation. The length of the vector must be equal to the number of observations in the reference/training set (i.e. nrow(Xr)). See details.

center

a logical if the predictor variables must be centred at each local segment (before regression). In addition, if TRUE, Xr and Xu will be centred for dissimilarity computations.

scale

a logical indicating if the predictor variables must be scaled to unit variance at each local segment (before regression). In addition, if TRUE, Xr and Xu will be scaled for dissimilarity computations.

verbose

a logical indicating whether or not to print a progress bar for each observation to be predicted. Default is TRUE. Note: In case parallel processing is used, these progress bars will not be printed.

documentation

an optional character string that can be used to describe anything related to the mbl call (e.g. description of the input data). Default: character(). NOTE: his is an experimental argument.

seed

an integer value containing the random number generator (RNG) state for random number generation. This argument can be used for reproducibility purposes (for random sampling) in the cross-validation results. Default is NULL, i.e. no RNG is applied.

...

further arguments to be passed to the dissimilarity function. See details.

Details

The argument spike can be used to indicate what reference observations in Xr must be kept in the neighborhood of every single Xu observation. If a vector of length \(m\) is passed to this argument, this means that the \(m\) original neighbors with the largest dissimilarities to the target observations will be forced out of the neighborhood. Spiking might be useful in cases where some reference observations are known to be somehow related to the ones in Xu and therefore might be relevant for fitting the local models. See Guerrero et al. (2010) for an example on the benefits of spiking.

The mbl function uses the dissimilarity function to compute the dissimilarities between Xr and Xu. The dissimilarity method to be used is specified in the diss_method argument. Arguments to dissimilarity as well as further arguments to the functions used inside dissimilarity (i.e. ortho_diss cor_diss f_diss sid) can be passed to those functions by using ....

The diss_usage argument is used to specify whether the dissimilarity information must be used within the local regressions and, if so, how. When diss_usage = "predictors" the local (square symmetric) dissimilarity matrix corresponding the selected neighborhood is used as source of additional predictors (i.e the columns of this local matrix are treated as predictor variables). In some cases this results in an improvement of the prediction performance (Ramirez-Lopez et al., 2013a). If diss_usage = "weights", the neighbors of the query point (\(xu_{j}\)) are weighted according to their dissimilarity to \(xu_{j}\) before carrying out each local regression. The following tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for computing the final weights based on the measured dissimilarities:

\[W_{j} = (1 - v^{3})^{3}\]

where if \({xr_{i} \in }\) neighbors of \(xu_{j}\):

\[v_{j}(xu_{j}) = d(xr_{i}, xu_{j})\]

otherwise:

\[v_{j}(xu_{j}) = 0\]

In the above formulas \(d(xr_{i}, xu_{j})\) represents the dissimilarity between the query point and each object in \(Xr\). When diss_usage = "none" is chosen the dissimilarity information is not used.

The global Mahalanobis distance (a.k.a GH) is computed based on the scores of a pls projection. A pls projection model is built with for {Yr}, {Xr} and this model is used to obtain the pls scores of the Xu observations. The Mahalanobis distance between each Xu observation in (the pls space) and the centre of Xr is then computed. The number of pls components is optimized based on the parameters passed to the pc_selection argument. In addition, the mbl function also reports the GH distance for the observations in Xr.

Some aspects of the mbl process, such as the type of internal validation, parameter tuning, what extra objects to return, permission for parallel execution, prediction limits, etc, can be specified by using the mbl_control function.

By using the group argument one can specify groups of observations that have something in common (e.g. observations with very similar origin). The purpose of group is to avoid biased cross-validation results due to pseudo-replication. This argument allows to select calibration points that are independent from the validation ones. In this regard, when validation_type = "local_cv" (used in mbl_control function), then the p argument refers to the percentage of groups of observations (rather than single observations) to be retained in each sampling iteration at each local segment.

Value

a list of class mbl with the following components (sorted either by k or k_diss):

When the k_diss argument is used, the printed results show a table with a column named 'p_bounded. It represents the percentage of observations for which the neighbors selected by the given dissimilarity threshold were outside the boundaries specified in the k_range argument.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

References

Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596-610.

Guerrero, C., Zornoza, R., Gómez, I., Mataix-Beneyto, J. 2010. Spiking of NIR regional models using observations from target sites: Effect of model size on prediction accuracy. Geoderma, 158(1-2), 66-77.

Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62, 664-673.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.

See Also

mbl_control, f_diss, cor_diss, sid, ortho_diss, search_neighbors, local_fit

Examples


library(prospectr)
data(NIRsoil)

# Proprocess the data using detrend plus first derivative with Savitzky and
# Golay smoothing filter
sg_det <- savitzkyGolay(
  detrend(NIRsoil$spc,
    wav = as.numeric(colnames(NIRsoil$spc))
  ),
  m = 1,
  p = 1,
  w = 7
)

NIRsoil$spc_pr <- sg_det

# split into training and testing sets
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]

train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]

# Example 1
# A mbl implemented in Ramirez-Lopez et al. (2013,
# the spectrum-based learner)
# Example 1.1
# An exmaple where Yu is supposed to be unknown, but the Xu
# (spectral variables) are known
my_control <- mbl_control(validation_type = "NNv")

## The neighborhood sizes to test
ks <- seq(40, 140, by = 20)

sbl <- mbl(
  Xr = train_x,
  Yr = train_y,
  Xu = test_x,
  k = ks,
  method = local_fit_gpr(),
  control = my_control,
  scale = TRUE
)
sbl
plot(sbl)
get_predictions(sbl)

# Example 1.2
# If Yu is actually known...
sbl_2 <- mbl(
  Xr = train_x,
  Yr = train_y,
  Xu = test_x,
  Yu = test_y,
  k = ks,
  method = local_fit_gpr(),
  control = my_control
)
sbl_2
plot(sbl_2)

# Example 2
# the LOCAL algorithm (Shenk et al., 1997)
local_algorithm <- mbl(
  Xr = train_x,
  Yr = train_y,
  Xu = test_x,
  Yu = test_y,
  k = ks,
  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
  diss_method = "cor",
  diss_usage = "none",
  control = my_control
)
local_algorithm
plot(local_algorithm)

# Example 3
# A variation of the LOCAL algorithm (using the optimized pc
# dissmilarity matrix) and dissimilarity matrix as source of
# additional preditors
local_algorithm_2 <- mbl(
  Xr = train_x,
  Yr = train_y,
  Xu = test_x,
  Yu = test_y,
  k = ks,
  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
  diss_method = "pca",
  diss_usage = "predictors",
  control = my_control
)
local_algorithm_2
plot(local_algorithm_2)

# Example 4
# Running the mbl function in parallel with example 2

n_cores <- 2

if (parallel::detectCores() < 2) {
  n_cores <- 1
}

# Alternatively:
# n_cores <- parallel::detectCores() - 1
# if (n_cores == 0) {
#  n_cores <- 1
# }

library(doParallel)
clust <- makeCluster(n_cores)
registerDoParallel(clust)

# Alernatively:
# library(doSNOW)
# clust <- makeCluster(n_cores, type = "SOCK")
# registerDoSNOW(clust)
# getDoParWorkers()

local_algorithm_par <- mbl(
  Xr = train_x,
  Yr = train_y,
  Xu = test_x,
  Yu = test_y,
  k = ks,
  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
  diss_method = "cor",
  diss_usage = "none",
  control = my_control
)
local_algorithm_par

registerDoSEQ()
try(stopCluster(clust))

# Example 5
# Using local pls distances
with_local_diss <- mbl(
  Xr = train_x,
  Yr = train_y,
  Xu = test_x,
  Yu = test_y,
  k = ks,
  method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
  diss_method = "pls",
  diss_usage = "predictors",
  control = my_control,
  .local = TRUE,
  pre_k = 150,
)
with_local_diss
plot(with_local_diss)


A function that controls some few aspects of the memory-based learning process in the mbl function

Description

Experimental lifecycle

This function is used to further control some aspects of the memory-based learning process in the mbl function.

Usage

mbl_control(
  return_dissimilarity = FALSE,
  validation_type = c("NNv", "local_cv"),
  tune_locally = TRUE,
  number = 10,
  p = 0.75,
  range_prediction_limits = TRUE,
  progress = TRUE,
  allow_parallel = TRUE
)

Arguments

return_dissimilarity

a logical indicating if the dissimilarity matrix between Xr and Xu must be returned.

validation_type

a character vector which indicates the (internal) validation method(s) to be used for assessing the global performance of the local models. Possible options are: "NNv" and "local_cv". Alternatively "none" can be used when cross-validation is not required (see details below).

tune_locally

a logical. It only applies when validation_type = "local_cv" and "pls" or "wapls" fitting algorithms are used. If TRUE, the parameters of the local pls-based models (i.e. pls factors for the "pls" method and minimum and maximum pls factors for the "wapls" method). Default is TRUE.

number

an integer indicating the number of sampling iterations at each local segment when "local_cv" is selected in the validation_type argument. Default is 10.

p

a numeric value indicating the percentage of observations to be retained at each sampling iteration at each local segment when "local_cv" is selected in the validation_type argument. Default is 0.75 (75 %).

range_prediction_limits

a logical. It indicates whether the prediction limits at each local regression are determined by the range of the response variable within each neighborhood. When the predicted value is outside this range, it will be automatically replaced with the value of the nearest range value. If FALSE, no prediction limits are imposed. Default is TRUE.

progress

a logical indicating whether or not to print a progress bar for each observation to be predicted. Default is TRUE. Note: In case parallel processing is used, these progress bars will not be printed.

allow_parallel

a logical indicating if parallel execution is allowed. If TRUE, this parallelism is applied to the loop in mbl in which each iteration takes care of a single observation in Xu. The parallelization of this for loop is implemented using the foreach function of the foreach package. Default is TRUE.

Details

The validation methods available for assessing the predictive performance of the memory-based learning method used are described as follows:

Value

a list mirroring the specified parameters

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

f_diss, cor_diss, sid, ortho_diss, mbl

Examples

# A control list with the default parameters
mbl_control()

Moving/rolling correlation distance of two matrices

Description

Computes a moving window correlation distance between two data matrices

Usage

moving_cor_diss(X,Y,w)

Arguments

X

a matrix

Y

a matrix

w

window size (must be odd)

Value

a matrix of correlation distance

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens


orthogonal scores algorithn of partial leat squares (opls)

Description

Computes orthogonal socres partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables. It does not return the variance information of the components. NOTE: For internal use only!

Usage

opls(X, 
     Y, 
     ncomp, 
     scale, 
     maxiter, 
     tol, 
     algorithm = "pls", 
     xls_min_w = 3, 
     xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whether X must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

(for weights computation) a character string indicating what method to use. Options are: 'pls' for pls (using covariance between X and Y), 'mpls' for modified pls (using correlation between X and Y) or 'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


Internal Cpp function for performing leave-group-out cross-validations for pls regression

Description

For internal use only!.

Usage

opls_cv_cpp(X, Y, scale, method, 
                  mindices, pindices, 
                  min_component, ncomp, 
                  new_x, 
                  maxiter, tol, 
                  wapls_grid, 
                  algorithm, 
                  statistics = TRUE)

Arguments

X

a matrix of predictor variables.

Y

a matrix of a single response variable.

scale

a logical indicating whether the matrix of predictors (X) must be scaled.

method

the method used for regression. One of the following options: 'pls' or 'wapls' or 'completewapls1p'.

mindices

a matrix with n rows and m columns where m is equivalent to the number of resampling iterations. The elements of each column indicate the indices of the observations to be used for modeling at each iteration.

pindices

a matrix with k rows and m columns where m is equivalent to the number of resampling iterations. The elements of each column indicate the indices of the observations to be used for predicting at each iteration.

min_component

an integer indicating the number of minimum pls components (if the method = 'pls').

ncomp

an integer indicating the number of pls components.

new_x

a matrix of one row corresponding to the observation to be predicted (if the method = 'wapls').

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

wapls_grid

the grid on which the search for the best combination of minimum and maximum pls factors of 'wapls' is based on in case method = 'completewapls1p'.

algorithm

either pls ('pls') or modified pls ('mpls'). See get_weigths function.

statistics

a logical value indicating whether the precision and accuracy statistics are to be returned, otherwise the predictions for each validation segment are retrieved.

Value

if statistics = true a list containing the following one-row matrices:

if statistics = false a list containing the following one-row matrices:

If method = "wapls", data of the pls weights are output in this list(compweights).

If method = "completewapls1", data of all the combination of components passed in wapls_grid are output in this list(complete_compweights).

Author(s)

Leonardo Ramirez-Lopez


orthogonal scores algorithn of partial leat squares (opls) projection

Description

Computes orthogonal socres partial least squares (opls) projection with the NIPALS algorithm. It allows multiple response variables. Although the main use of the function is for projection, it also retrieves regression coefficients. NOTE: For internal use only!

Usage

opls_for_projection(X, Y, ncomp, scale,
                    maxiter, tol,
                    pcSelmethod = "var",
                    pcSelvalue = 0.01, 
                    algorithm = "pls", 
                    xls_min_w = 3, 
                    xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whether X must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

pcSelmethod

if regression = TRUE, the method for selecting the number of components. Options are: 'manual', 'cumvar' (for selecting the number of principal components based on a given cumulative amount of explained variance) and 'var' (for selecting the number of principal components based on a given amount of explained variance). Default is 'cumvar'.

pcSelvalue

a numerical value that complements the selected method (pcSelmethod). If 'cumvar' is chosen (default), pcSelvalue must be a value (larger than 0 and below 1) indicating the maximum amount of cumulative variance that the retained components should explain. Default is 0.99. If 'var' is chosen, pcSelvalue must be a value (larger than 0 and below 1) indicating that components that explain (individually) a variance lower than this threshold must be excluded. If 'manual' is chosen, pcSelvalue has no effect and the number of components retrieved are the one specified in ncomp.

algorithm

(for weights computation) a character string indicating what method to use. Options are: 'pls' for pls (using covariance between X and Y), 'mpls' for modified pls (using correlation between X and Y) or 'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


orthogonal scores algorithn of partial leat squares (opls_get_all)

Description

Computes orthogonal socres partial least squares (opls_get_all) regressions with the NIPALS algorithm. It retrives a comprehensive set of pls outputs (e.g. vip and sensivity radius). It allows multiple response variables. NOTE: For internal use only!

Usage

opls_get_all(X, 
             Y, 
             ncomp, 
             scale, 
             maxiter, 
             tol, 
             algorithm = "pls", 
             xls_min_w = 3, 
             xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whether X must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

(for weights computation) a character string indicating what method to use. Options are: 'pls' for pls (using covariance between X and Y), 'mpls' for modified pls (using correlation between X and Y) or 'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


fast orthogonal scores algorithn of partial leat squares (opls)

Description

Computes orthogonal socres partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables. In contrast to opls function, this one does not compute unnecessary data for (local) regression. For internal use only!

Usage

opls_get_basics(X, Y, ncomp, scale, 
                maxiter, tol, 
                algorithm = "pls", 
                xls_min_w = 3, 
                xls_max_w = 15)

Arguments

X

a matrix of predictor variables.

Y

a matrix of either a single or multiple response variables.

ncomp

the number of pls components.

scale

logical indicating whether X must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

algorithm

(for weights computation) a character string indicating what method to use. Options are: 'pls' for pls (using covariance between X and Y), 'mpls' for modified pls (using correlation between X and Y) or 'xls' for extended pls (as implemented in BUCHI NIRWise PLUS software).

xls_min_w

(for weights computation) an integer indicating the minimum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 3 (as in BUCHI NIRWise PLUS software).

xls_max_w

(for weights computation) an integer indicating the maximum window size for the "xls" method. Only used if algorithm = 'xls'. Default is 15 (as in BUCHI NIRWise PLUS software).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


orthogonal scores algorithm of partial leat squares (opls)

Description

Computes orthogonal scores partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables. It does not return the variance information of the components. NOTE: For internal use only!

Usage

opls_gs(Xr, 
        Yr,
        Xu, 
        ncomp,
        scale,     
        response = FALSE, 
        reconstruction = TRUE,
        similarity = TRUE,
        fresponse = TRUE,
        algorithm = "pls")

Arguments

Xr

a matrix of predictor variables for the training set.

Yr

a matrix of a single response variable for the training set.

Xu

a matrix of predictor variables for the test set.

ncomp

the number of pls components.

scale

logical indicating whether X must be scaled.

response

logical indicating whether to compute the prediction of Yu.

reconstruction

logical indicating whether to compute the reconstruction error of Xu.

similarity

logical indicating whether to compute the the distance score between Xr and Xu (in the pls space).

fresponse

logical indicating whether to compute the score of the variance not explained for Yu.

algorithm

(for weights computation) a character string indicating what method to use. Options are: 'pls' for pls (using covariance between X and Y) or 'mpls' for modified pls (using correlation between X and Y).

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


A function to construct an optimal strata for the samples, based on the distribution of the given y.

Description

for internal use only! This function computes the optimal strata from the distribution of the given y

Usage

optim_sample_strata(y, n)

Arguments

y

a matrix of one column with the response variable.

n

number of samples that must be sampled.

Value

a list with two data.table objects: sample_strata contains the optimal strata, whereas samples_to_get contains information on how many samples per stratum are supposed to be drawn.


A function for computing dissimilarity matrices from orthogonal projections (ortho_diss)

Description

This function computes dissimilarities (in an orthogonal space) between either observations in a given set or between observations in two different sets.The dissimilarities are computed based on either principal component projection or partial least squares projection of the data. After projecting the data, the Mahalanobis distance is applied.

Usage

ortho_diss(Xr, Xu = NULL,
           Yr = NULL,
           pc_selection = list(method = "var", value = 0.01),
           diss_method = "pca",
           .local = FALSE,
           pre_k,
           center = TRUE,
           scale = FALSE,
           compute_all = FALSE,
           return_projection = FALSE,
           allow_parallel = TRUE, ...)

Arguments

Xr

a matrix containing n reference observations rows and p variablescolumns.

Xu

an optional matrix containing data of a second set of observations with p variables/columns.

Yr

a matrix of n rows and one or more columns (variables) with side information corresponding to the observations in Xr (e.g. response variables). It can be numeric with multiple variables/columns, or character with one single column. This argument is required if:

  • diss_method == 'pls': Yr is required to project the variables to orthogonal directions such that the covariance between the extracted pls components and Yr is maximized.

  • pc_selection$method == 'opc': Yr is required to optimize the number of components. The optimal number of projected components is the one for which its distance matrix minimizes the differences between the Yr value of each observation and the Yr value of its closest observation. See sim_eval.

pc_selection

a list of length 2 which specifies the method to be used for optimizing the number of components (principal components or pls factors) to be retained. This list must contain two elements (in the following order): method (a character indicating the method for selecting the number of components) and value (a numerical value that complements the selected method). The methods available are:

  • "opc": optimized principal component selection based on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components (of a given set of observations) is the one for which its distance matrix minimizes the differences between the Yr value of each observation and the Yr value of its closest observation. In this case, value must be a value (larger than 0 and below min(nrow(Xr) + nrow(Xu), ncol(Xr)) indicating the maximum number of principal components to be tested. See the ortho_projection function for more details.

  • "cumvar": selection of the principal components based on a given cumulative amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of cumulative variance that the combination of retained components should explain.

  • "var": selection of the principal components based on a given amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of variance that a single component should explain in order to be retained.

  • "manual": for manually specifying a fix number of principal components. In this case, value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined). indicating the minimum amount of variance that a component should explain in order to be retained.

Default is list(method = "var", value = 0.01).

Optionally, the pc_selection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such case, the default "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used, the default "value" is set to 0.01.

diss_method

a character value indicating the type of projection on which the dissimilarities must be computed. This argument is equivalent to method argument in the ortho_projection function. Options are:

  • "pca": principal component analysis using the singular value decomposition algorithm)

  • "pca.nipals": principal component analysis using the non-linear iterative partial least squares algorithm.

  • "pls": partial least squares.

  • "mpls": modified partial least squares (Shenk and Westerhaus, 1991 and Westerhaus, 2014).

See the ortho_projection function for further details on the projection methods.

.local

a logical indicating whether or not to compute the dissimilarities locally (i.e. projecting locally the data) by using the pre_k nearest neighbor observations of each target observation. Default is FALSE. See details.

pre_k

if .local = TRUE a numeric integer value which indicates the number of nearest neighbors to (pre-)retain for each observation to compute the (local) orthogonal dissimilarities to each observation in its neighborhhod.

center

a logical indicating if the Xr and Xu must be centered. If Xu is provided the data is centered around the mean of the pooled Xr and Xu matrices (\(Xr \cup Xu\)). For dissimilarity computations based on pls, the data is always centered for the projections.

scale

a logical indicating if the Xr and Xu must be scaled. If Xu is provided the data is scaled based on the standard deviation of the the pooled Xr and Xu matrices (\(Xr \cup Xu\)). if center = TRUE, scaling is applied after centering.

compute_all

a logical. In case Xu is specified it indicates whether or not the distances between all the elements resulting from the pooled Xr and Xu matrices (\(Xr \cup Xu\) must be computed).

return_projection

a logical. If TRUE the ortho_projection object on which the dissimilarities are computed will be returned. Default is FALSE. Note that for .local = TRUE only the initial projection is returned (i.e. local projections are not).

allow_parallel

a logical (default TRUE). It allows parallel computing of the local distance matrices (i.e. when .local = TRUE). This is done via foreach function of the 'foreach' package.

...

additional arguments to be passed to the ortho_projection function.

Details

When .local = TRUE, first a global dissimilarity matrix is computed based on the parameters specified. Then, by using this matrix for each target observation, a given set of nearest neighbors (pre_k) are identified. These neighbors (together with the target observation) are projected (from the original data space) onto a (local) orthogonal space (using the same parameters specified in the function). In this projected space the Mahalanobis distance between the target observation and its neighbors is recomputed. A missing value is assigned to the observations that do not belong to this set of neighbors (non-neighbor observations). In this case the dissimilarity matrix cannot be considered as a distance metric since it does not necessarily satisfies the symmetry condition for distance matrices (i.e. given two observations \(x_i\) and \(x_j\), the local dissimilarity (\(d\)) between them is relative since generally \(d(x_i, x_j) \neq d(x_j, x_i)\)). On the other hand, when .local = FALSE, the dissimilarity matrix obtained can be considered as a distance matrix.

In the cases where "Yr" is required to compute the dissimilarities and if .local = TRUE, care must be taken as some neighborhoods might not have enough observations with non-missing "Yr" values, which might retrieve unreliable dissimilarity computations.

If "opc" or "manual" are used in pc_selection$method and .local = TRUE, the minimum number of observations with non-missing "Yr" values at each neighborhood is determined by pc_selection$value (i.e. the maximum number of components to compute).

Value

a list of class ortho_diss with the following elements:

Author(s)

Leonardo Ramirez-Lopez

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

ortho_projection, sim_eval

Examples

library(prospectr)
data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil[!as.logical(NIRsoil$train), "CEC", drop = FALSE]
Yr <- NIRsoil[as.logical(NIRsoil$train), "CEC", drop = FALSE]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu), , drop = FALSE]

Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr), , drop = FALSE]

# Computation of the orthogonal dissimilarity matrix using the
# default parameters
pca_diss <- ortho_diss(Xr, Xu)

# Computation of a principal component dissimilarity matrix using
# the "opc" method for the selection of the principal components
pca_diss_optim <- ortho_diss(
  Xr, Xu, Yr,
  pc_selection = list("opc", 40),
  compute_all = TRUE
)

# Computation of a partial least squares (PLS) dissimilarity
# matrix using the "opc" method for the selection of the PLS
# components
pls_diss_optim <- ortho_diss(
  Xr = Xr, Xu = Xu,
  Yr = Yr,
  pc_selection = list("opc", 40),
  diss_method = "pls"
)

Orthogonal projections using principal component analysis and partial least squares

Description

Functions to perform orthogonal projections of high dimensional data matrices using principal component analysis (pca) and partial least squares (pls).

Usage

ortho_projection(Xr, Xu = NULL,
                 Yr = NULL,
                 method = "pca",
                 pc_selection = list(method = "var", value = 0.01),
                 center = TRUE, scale = FALSE, ...)

pc_projection(Xr, Xu = NULL, Yr = NULL,
              pc_selection = list(method = "var", value = 0.01),
              center = TRUE, scale = FALSE,
              method = "pca",
              tol = 1e-6, max_iter = 1000, ...)

pls_projection(Xr, Xu = NULL, Yr,
               pc_selection = list(method = "opc", value = min(dim(Xr), 40)),
               scale = FALSE, method = "pls",
               tol = 1e-6, max_iter = 1000, ...)

## S3 method for class 'ortho_projection'
predict(object, newdata, ...)

Arguments

Xr

a matrix of observations.

Xu

an optional matrix containing data of a second set of observations.

Yr

if the method used in the pc_selection argument is "opc" or if method = "pls", then it must be a matrix containing the side information corresponding to the spectra in Xr. It is equivalent to the side_info parameter of the sim_eval function. In case method = "pca", a matrix (with one or more continuous variables) can also be used as input. The root mean square of differences (rmsd) is used for assessing the similarity between the observations and their corresponding most similar observations in terms of the side information provided. A single discrete variable of class factor can also be passed. In that case, the kappa index is used. See sim_eval function for more details.

method

the method for projecting the data. Options are:

  • "pca": principal component analysis using the singular value decomposition algorithm.

  • "pca.nipals": principal component analysis using the non-linear iterative partial least squares algorithm.

  • "pls": partial least squares.

  • "mpls": modified partial least squares. See details.

pc_selection

a list of length 2 which specifies the method to be used for optimizing the number of components (principal components or pls factors) to be retained. This list must contain two elements (in the following order): method (a character indicating the method for selecting the number of components) and value (a numerical value that complements the selected method). The methods available are:

  • "opc": optimized principal component selection based on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components of a given set of observations is the one for which its distance matrix minimizes the differences between the Yr value of each observation and the Yr value of its closest observation. In this case value must be a value (larger than 0 and below min(nrow(Xr) + nrow(Xu), ncol(Xr)) indicating the maximum number of principal components to be tested. See details.

  • "cumvar": selection of the principal components based on a given cumulative amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of cumulative variance that the combination of retained components should explain.

  • "var": selection of the principal components based on a given amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of variance that a single component should explain in order to be retained.

  • "manual": for manually specifying a fix number of principal components. In this case, value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined). indicating the minimum amount of variance that a component should explain in order to be retained.

The list list(method = "var", value = 0.01) is the default. Optionally, the pc_selection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used, the default "value" is set to 0.01.

center

a logical indicating if the data Xr (and Xu if specified) must be centered. If Xu is specified the data is centered on the basis of \(Xr \cup Xu\). NOTE: This argument only applies to the principal components projection. For pls projections the data is always centered.

scale

a logical indicating if Xr (and Xu if specified) must be scaled. If Xu is specified the data is scaled on the basis of \(Xr \cup Xu\).

...

additional arguments to be passed to pc_projection or pls_projection.

tol

tolerance limit for convergence of the algorithm in the nipals algorithm (default is 1e-06). In the case of PLS this applies only to Yr with more than one variable.

max_iter

maximum number of iterations (default is 1000). In the case of method = "pls" this applies only to Yr matrices with more than one variable.

object

object of class "ortho_projection".

newdata

an optional data frame or matrix in which to look for variables with which to predict. If omitted, the scores are used. It must contain the same number of columns, to be used in the same order.

Details

In the case of method = "pca", the algorithm used is the singular value decomposition in which a given data matrix (\(X\)) is factorized as follows:

\[X = UDV^{T}\]

where \(U\) and \(V\) are orthogonal matrices, being the left and right singular vectors of \(X\) respectively, \(D\) is a diagonal matrix containing the singular values of \(X\) and \(V\) is the is a matrix of the right singular vectors of \(X\). The matrix of principal component scores is obtained by a matrix multiplication of \(U\) and \(D\), and the matrix of principal component loadings is equivalent to the matrix \(V\).

When method = "pca.nipals", the algorithm used for principal component analysis is the non-linear iterative partial least squares (nipals).

In the case of the of the partial least squares projection (a.k.a projection to latent structures) the nipals regression algorithm is used by default. Details on the "nipals" algorithm are presented in Martens (1991). Another method called modified pls ('mpls') can also be used. The modified pls was proposed Shenk and Westerhaus (1991, see also Westerhaus, 2014) and it differs from the standard pls method in the way the weights of the Xr (used to compute the matrix of scores) are obtained. While pls uses the covariance between Yr and Xr (and later their deflated versions corresponding at each pls component iteration) to obtain these weights, the modified pls uses the correlation as weights. The authors indicate that by using correlation, a larger potion of the response variable(s) can be explained.

When method = "opc", the selection of the components is carried out by using an iterative method based on the side information concept (Ramirez-Lopez et al. 2013a, 2013b). First let be \(P\) a sequence of retained components (so that \(P = 1, 2, ...,k \)). At each iteration, the function computes a dissimilarity matrix retaining \(p_i\) components. The values in this side information variable are compared against the side information values of their most spectrally similar observations (closest Xr observation). The optimal number of components retrieved by the function is the one that minimizes the root mean squared differences (RMSD) in the case of continuous variables, or maximizes the kappa index in the case of categorical variables. In this process, the sim_eval function is used. Note that for the "opc" method Yr is required (i.e. the side information of the observations).

Value

a list of class ortho_projection with the following components:

predict.ortho_projection, returns a matrix of scores proprojected for newdtata.

Author(s)

Leonardo Ramirez-Lopez

References

Martens, H. (1991). Multivariate calibration. John Wiley & Sons.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Shenk, J. S., & Westerhaus, M. O. 1991. Populations structuring of near infrared spectra and modified partial least squares regression. Crop Science, 31(6), 1548-1555.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.

Westerhaus, M. 2014. Eastern Analytical Symposium Award for outstanding Wachievements in near infrared spectroscopy: my contributions to Wnear infrared spectroscopy. NIR news, 25(8), 16-20.

See Also

ortho_diss, sim_eval, mbl

Examples


library(prospectr)
data(NIRsoil)

# Proprocess the data using detrend plus first derivative with Savitzky and
# Golay smoothing filter
sg_det <- savitzkyGolay(
  detrend(NIRsoil$spc,
    wav = as.numeric(colnames(NIRsoil$spc))
  ),
  m = 1,
  p = 1,
  w = 7
)
NIRsoil$spc_pr <- sg_det

# split into training and testing sets
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]

train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]

# A principal component analysis using 5 components
pca_projected <- ortho_projection(train_x, pc_selection = list("manual", 5))
pca_projected

# A principal components projection using the "opc" method
# for the selection of the optimal number of components
pca_projected_2 <- ortho_projection(
  Xr = train_x, Xu = test_x, Yr = train_y,
  method = "pca",
  pc_selection = list("opc", 40)
)
pca_projected_2
plot(pca_projected_2)

# A partial least squares projection using the "opc" method
# for the selection of the optimal number of components
pls_projected <- ortho_projection(
  Xr = train_x, Xu = test_x, Yr = train_y,
  method = "pls",
  pc_selection = list("opc", 40)
)
pls_projected
plot(pls_projected)

# A partial least squares projection using the "cumvar" method
# for the selection of the optimal number of components
pls_projected_2 <- ortho_projection(
  Xr = train_x, Xu = test_x, Yr = train_y,
  method = "pls",
  pc_selection = list("cumvar", 0.99)
)


Function for computing the overall variance of a matrix

Description

Computes the variance of a matrix. For internal use only!

Usage

overall_var(X)

Arguments

X

a matrix.

Value

a vector of standard deviation values.

Author(s)

Leonardo Ramirez-Lopez


Principal components based on the non-linear iterative partial least squares (nipals) algorithm

Description

Computes orthogonal socres partial least squares (opls) regressions with the NIPALS algorithm. It allows multiple response variables. For internal use only!

Usage

pca_nipals(X, ncomp, center, scale,
           maxiter, tol,
           pcSelmethod = "var",
           pcSelvalue = 0.01)

Arguments

X

a matrix of predictor variables.

ncomp

the number of pls components.

scale

logical indicating whether X must be scaled.

maxiter

maximum number of iterations.

tol

limit for convergence of the algorithm in the nipals algorithm.

pcSelmethod

the method for selecting the number of components. Options are: 'cumvar' (for selecting the number of principal components based on a given cumulative amount of explained variance) and "var" (for selecting the number of principal components based on a given amount of explained variance). Default is 'var'

pcSelvalue

a numerical value that complements the selected method (pcSelmethod). If "cumvar" is chosen, it must be a value (larger than 0 and below 1) indicating the maximum amount of cumulative variance that the retained components should explain. If "var" is chosen, it must be a value (larger than 0 and below 1) indicating that components that explain (individually) a variance lower than this threshold must be excluded. If "manual" is chosen, it must be a value specifying the desired number of principal components to retain. Default is 0.01.

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez


Get the package version info

Description

returns package info.

Usage

pkg_info(pkg = "resemble")

Arguments

pkg

the package name i.e "resemble"


Plot method for an object of class mbl

Description

Plots the content of an object of class mbl

Usage

## S3 method for class 'mbl'
plot(x, g = c("validation", "gh"), param = "rmse", pls_c = c(1,2), ...)

Arguments

x

an object of class mbl (as returned by mbl).

g

a character vector indicating what results shall be plotted. Options are: "validation" (for plotting the validation results) and/or "gh" (for plotting the pls scores used to compute the GH distance. See details).

param

a character string indicating what validation statistics shall be plotted. The following options are available: "rmse", "st_rmse" or "r2". These options only available if the mbl object contains validation results.

pls_c

a numeric vector of length one or two indicating the pls factors to be plotted. Default is c(1, 2). It is only available if "gh" is specified in the g argument.

...

some arguments to be passed to the plot methods.

Details

For plotting the pls scores from the pls score matrix (of more than one column), this matrix is first transformed from the Euclidean space to the Mahalanobis space. This is done by multiplying the score matrix by the root square of its covariance matrix. The root square of this matrix is estimated using a singular value decomposition.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

See Also

mbl

Examples


library(prospectr)

data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu)]

Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr)]

ctrl <- mbl_control(validation_type = "NNv")

ex_1 <- mbl(
  Yr = Yr, Xr = Xr, Xu = Xu,
  diss_method = "cor",
  diss_usage = "none",
  gh = TRUE,
  mblCtrl = ctrl,
  k = seq(50, 250, 30)
)

plot(ex_1)
plot(ex_1, g = "gh", pls_c = c(2, 3))


Plot method for an object of class ortho_projection

Description

Plots objects of class ortho_projection

Usage

## S3 method for class 'ortho_projection'
plot(x, col = "dodgerblue", ...)

Arguments

x

an object of class ortho_projection (as returned by ortho_projection).

col

the color of the plots (default is "dodgerblue")

...

arguments to be passed to methods.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

See Also

ortho_projection


Cross validation for PLS regression

Description

for internal use only!

Usage

pls_cv(
  x,
  y,
  ncomp,
  method = c("pls", "wapls"),
  center = TRUE,
  scale,
  min_component = 1,
  new_x = matrix(0, 1, 1),
  weights = NULL,
  p = 0.75,
  number = 10,
  group = NULL,
  retrieve = TRUE,
  tune = TRUE,
  max_iter = 1,
  tol = 1e-06,
  seed = NULL,
  modified = FALSE
)

Prediction function for the gaussian_process function (Gaussian process regression with dot product covariance)

Description

Predicts response values based on a model generated by the gaussian_process function (Gaussian process regression with dot product covariance). For internal use only!.

Usage

predict_gaussian_process(Xz, alpha, newdata, scale, Xcenter, Xscale, Ycenter, Yscale)

Arguments

newdata

a matrix containing the predictor variables

scale

a logical indicating whether the matrix of predictors used to create the regression model (in the gaussian_process function) was scaled

Xcenter

if center = TRUE a matrix of one row with the values that must be used for centering newdata.

Xscale

if scale = TRUE a matrix of one row with the values that must be used for scaling newdata.

Ycenter

if center = TRUE a matrix of one row with the values that must be used for accounting for the centering of the response variable.

Yscale

if scale = TRUE a matrix of one row with the values that must be used for accounting for the scaling of the response variable.

Value

a matrix of predicted values

Author(s)

Leonardo Ramirez-Lopez


Prediction function for the opls and fopls functions

Description

Predicts response values based on a model generated by either by opls or the fopls functions. For internal use only!.

Usage

predict_opls(bo, b, ncomp, newdata, scale, Xscale)

Arguments

bo

a numeric value indicating the intercept.

b

the matrix of regression coefficients.

ncomp

an integer value indicating how may components must be used in the prediction.

newdata

a matrix containing the predictor variables.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xscale

if scale = TRUE a matrix of one row with the values that must be used for scaling newdata.

Value

a matrix of predicted values.

Author(s)

Leonardo Ramirez-Lopez


Print method for an object of class local_fit

Description

Prints the contents of an object of class local_fit

Usage

## S3 method for class 'local_fit'
print(x, ...)

Arguments

x

an object of class local_fit

...

not yet functional.

Author(s)

Leonardo Ramirez-Lopez


Print method for an object of class ortho_diss

Description

Prints the content of an object of class ortho_diss

Usage

## S3 method for class 'local_ortho_diss'
print(x, ...)

Arguments

x

an object of class local_ortho_diss (returned by ortho_diss when it uses .local = TRUE).

...

arguments to be passed to methods (not yet functional).

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens


Print method for an object of class mbl

Description

Prints the content of an object of class mbl

Usage

## S3 method for class 'mbl'
print(x, ...)

Arguments

x

an object of class mbl (as returned by the mbl function).

...

arguments to be passed to methods (not functional).

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens


Print method for an object of class ortho_projection

Description

Prints the contents of an object of class ortho_projection

Usage

## S3 method for class 'ortho_projection'
print(x, ...)

Arguments

x

an object of class ortho_projection (as returned by the ortho_projection function).

...

arguments to be passed to methods (not yet functional).

Author(s)

Leonardo Ramirez-Lopez


Projection function for the opls function

Description

Projects new spectra onto a PLS space based on a model generated by either by opls or the opls2 functions. For internal use only!.

Usage

project_opls(projection_mat, ncomp, newdata, scale, Xcenter, Xscale)

Arguments

projection_mat

the projection matrix generated by the opls function.

ncomp

an integer value indicating how may components must be used in the prediction.

newdata

a matrix containing the predictor variables.

scale

a logical indicating whether the matrix of predictors used to create the regression model was scaled.

Xcenter

a matrix of one row with the values that must be used for centering newdata.

Xscale

if scale = TRUE a matrix of one row with the values that must be used for scaling newdata.

Value

a matrix corresponding to the new spectra projected onto the PLS space

Author(s)

Leonardo Ramirez-Lopez


Projection to pls and then re-construction

Description

Projects spectra onto a PLS space and then reconstructs it back.

Usage

reconstruction_error(x, 
                            projection_mat, 
                            xloadings, 
                            scale, 
                            Xcenter, 
                            Xscale, 
                            scale_back = FALSE)

Arguments

x

a matrix to project.

projection_mat

the projection matrix generated by the opls_get_basics function.

xloadings

the loadings matrix generated by the opls_get_basics function.

scale

logical indicating if scaling is required

Xcenter

a matrix of one row with the centering values

Xscale

a matrix of one row with the scaling values

scale_back

compute the reconstruction error after de-centering the data and de-scaling it.

Value

a matrix of 1 row and 1 column.

Author(s)

Leonardo Ramirez-Lopez


A function to create calibration and validation sample sets for leave-group-out cross-validation

Description

for internal use only! This is stratified sampling based on the values of a continuous response variable (y). If group is provided, the sampling is done based on the groups and the average of y per group. This function is used to create calibration and validation groups for leave-group-out cross-validations (or leave-group-of-groups-out cross-validation if group argument is provided).

Usage

sample_stratified(y, p, number, group = NULL, replacement = FALSE, seed = NULL)

Arguments

y

a matrix of one column with the response variable.

p

the percentage of samples (or groups if group argument is used) to retain in the validation_indices set

number

the number of sample groups to be crated

group

the labels for each sample in y indicating the group each observation belongs to.

replacement

A logical indicating sample replacements for the calibration set are required.

seed

an integer for random number generator (default NULL).

Value

a list with two matrices (hold_in and hold_out) giving the indices of the observations in each column. The number of columns represents the number of sampling repetitions.


A function for searching in a given reference set the neighbors of another given set of observations (search_neighbors)

Description

This function searches in a reference set the neighbors of the observations provided in another set.

Usage

search_neighbors(Xr, Xu, diss_method = c("pca", "pca.nipals", "pls", "mpls",
                                         "cor", "euclid", "cosine", "sid"),
                 Yr = NULL, k, k_diss, k_range, spike = NULL,
                 pc_selection = list("var", 0.01),
                 return_projection = FALSE, return_dissimilarity = FALSE,
                 ws = NULL,
                 center = TRUE, scale = FALSE,
                 documentation = character(), ...)

Arguments

Xr

a matrix of reference (spectral) observations where the neighbor search is to be conducted. See details.

Xu

an optional matrix of (spectral) observations for which its neighbors are to be searched in Xr. Default is NULL. See details.

diss_method

a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbors of each observation.

  • "pca": Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of Xr (and Xu if supplied). PC projection is done using the singular value decomposition (SVD) algorithm. See ortho_diss function.

  • "pca.nipals": Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of Xr (and Xu if supplied). PC projection is done using the non-linear iterative partial least squares (niapls) algorithm. See ortho_diss function.

  • "pls": Mahalanobis distance computed on the matrix of scores of a partial least squares projection of Xr (and Xu if supplied). In this case, Yr is always required. See ortho_diss function.

  • "mpls": Mahalanobis distance computed on the matrix of scores of a modified partial least squares projection (Shenk and Westerhaus, 1991; Westerhaus, 2014) of Xr (and Xu if provided). In this case, Yr is always required. See ortho_diss function.

  • "cor": correlation coefficient between observations. See cor_diss function.

  • "euclid": Euclidean distance between observations. See f_diss function.

  • "cosine": Cosine distance between observations. See f_diss function.

  • "sid": spectral information divergence between observations. See sid function.

Yr

a numeric matrix of n observations used as side information of Xr for the ortho_diss methods (i.e. pca, pca.nipals or pls). It is required when:

  • diss_method = "pls"

  • diss_method = "pca" with "opc" used as the method in the pc_selection argument. See ortho_diss().

k

an integer value indicating the k-nearest neighbors of each observation in Xu that must be selected from Xr.

k_diss

an integer value indicating a dissimilarity treshold. For each observation in Xu, its nearest neighbors in Xr are selected as those for which their dissimilarity to Xu is below this k_diss threshold. This treshold depends on the corresponding dissimilarity metric specified in diss_method. Either k or k_diss must be specified.

k_range

an integer vector of length 2 which specifies the minimum (first value) and the maximum (second value) number of neighbors to be retained when the k_diss is given.

spike

a vector of integers (with positive and/or negative values) indicating what observations in Xr (and Yr) must be forced into or avoided in the neighborhoods.

pc_selection

a list of length 2 to be passed onto the ortho_diss methods. It is required if the method selected in diss_method is any of "pca", "pca.nipals" or "pls". This argument is used for optimizing the number of components (principal components or pls factors) to be retained. This list must contain two elements in the following order: method (a character indicating the method for selecting the number of components) and value (a numerical value that complements the selected method). The methods available are:

  • "opc": optimized principal component selection based on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components (of set of observations) is the one for which its distance matrix minimizes the differences between the Yr value of each observation and the Yr value of its closest observation. In this case value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined) indicating the maximum number of principal components to be tested. See the ortho_projection function for more details.

  • "cumvar": selection of the principal components based on a given cumulative amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of cumulative variance that the combination of retained components should explain.

  • "var": selection of the principal components based on a given amount of explained variance. In this case, value must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of variance that a single component should explain in order to be retained.

  • "manual": for manually specifying a fix number of principal components. In this case, value must be a value (larger than 0 and below the minimum dimension of Xr or Xr and Xu combined) indicating the minimum amount of variance that a component should explain in order to be retained.

The default is list(method = "var", value = 0.01).

Optionally, the pc_selection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used, the default "value" is set to 0.01.

return_projection

a logical indicating if the projection(s) must be returned. Projections are used if the ortho_diss methods are called (i.e. method = "pca", method = "pca.nipals" or method = "pls").

return_dissimilarity

a logical indicating if the dissimilarity matrix used for neighbor search must be returned.

ws

an odd integer value which specifies the window size, when diss_method = cor (cor_diss method) for moving correlation dissimilarity. If ws = NULL (default), then the window size will be equal to the number of variables (columns), i.e. instead moving correlation, the normal correlation will be used. See cor_diss function.

center

a logical indicating if the Xr and Xu matrices must be centered. If Xu is provided the data is centered around the mean of the pooled Xr and Xu matrices (\(Xr \cup Xu\)). For dissimilarity computations based on diss_method = pls, the data is always centered.

scale

a logical indicating if the Xr and Xu matrices must be scaled. If Xu is provided the data is scaled based on the standard deviation of the the pooled Xr and Xu matrices (\(Xr \cup Xu\)). If center = TRUE, scaling is applied after centering.

documentation

an optional character string that can be used to describe anything related to the mbl call (e.g. description of the input data). Default: character(). NOTE: his is an experimental argument.

...

further arguments to be passed to the dissimilarity function. See details.

Details

This function may be specially useful when the reference set (Xr) is very large. In some cases the number of observations in the reference set can be reduced by removing irrelevant observations (i.e. observations that are not neighbors of a particular target set). For example, this fucntion can be used to reduce the size of the reference set before before running the mbl function.

This function uses the dissimilarity fucntion to compute the dissimilarities between Xr and Xu. Arguments to dissimilarity as well as further arguments to the functions used inside dissimilarity (i.e. ortho_diss cor_diss f_diss sid) can be passed to those functions as additional arguments (i.e. ...).

If no matrix is passed to Xu, the neighbor search is conducted for the observations in Xr that are found whiting that matrix. If a matrix is passed to Xu, the neighbors of Xu are searched in the Xr matrix.

Value

a list containing the following elements:

Author(s)

Leonardo Ramirez-Lopez.

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

dissimilarity ortho_diss cor_diss f_diss sid mbl

Examples


library(prospectr)

data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu)]

Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr)]

# Identify the neighbor observations using the correlation dissimilarity and
# default parameters
# (In this example all the observations in Xr belong at least to the
# first 100 neighbors of one observation in Xu)
ex1 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "cor",
  k = 40
)

# Identify the neighbor observations using principal component (PC)
# and partial least squares (PLS) dissimilarities, and using the "opc"
# approach for selecting the number of components
ex2 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pca",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE
)

# Observations that do not belong to any neighborhood
seq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex2$unique_neighbors]

ex3 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pls",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE
)
# Observations that do not belong to any neighborhood
seq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex3$unique_neighbors]

# Identify the neighbor observations using local PC dissimialrities
# Here, 150 neighbors are used to compute a local dissimilarity matrix
# and then this matrix is used to select 50 neighbors
ex4 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pls",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE,
  .local = TRUE,
  pre_k = 150
)


A function for computing the spectral information divergence between spectra (sid)

Description

Experimental lifecycle

This function computes the spectral information divergence/dissimilarity between spectra based on the kullback-leibler divergence algorithm (see details).

Usage

sid(Xr, Xu = NULL,
    mode = "density",
    center = FALSE, scale = FALSE,
    kernel = "gaussian",
    n = if(mode == "density") round(0.5 * ncol(Xr)),
    bw = "nrd0",
    reg = 1e-04,
    ...)

Arguments

Xr

a matrix containing the spectral (reference) data.

Xu

an optional matrix containing the spectral data of a second set of observations.

mode

the method to be used for computing the spectral information divergence. Options are "density" (default) for computing the divergence values on the density distributions of the spectral observations, and "feature" for computing the divergence vales on the spectral variables. See details.

center

a logical indicating if the computations must be carried out on the centred X and Xu (if specified) matrices. If mode = "feature" centring is not carried out since this option does not accept negative values which are generated after centring the matrices. Default is FALSE. See details.

scale

a logical indicating if the computations must be carried out on the variance scaled X and Xu (if specified) matrices. Default is TRUE.

kernel

if mode = "density" a character string indicating the smoothing kernel to be used. It must be one of "gaussian" (default), "rectangular", "triangular", "epanechnikov", "biweight", "cosine" or "optcosine". See the density function of the stats package.

n

if mode = "density" a numerical value indicating the number of equally spaced points at which the density is to be estimated. See the density function of the stats package for further details. Default is round(0.5 * ncol(X)).

bw

if mode = "density" a numerical value indicating the smoothing kernel bandwidth to be used. Optionally the character string "nrd0" can be used, it computes the bandwidth using the bw.nrd0 function of the stats package (see bw.nrd0). See the density and the bw.nrd0 functions for more details. By default "nrd0" is used, in this case the bandwidth is computed as bw.nrd0(as.vector(X)), if Xu is specified the bandwidth is computed as bw.nrd0(as.vector(rbind(X, Xu))).

reg

a numerical value larger than 0 which indicates a regularization parameter. Values (probabilities) below this threshold are replaced by this value for numerical stability. Default is 1e-4.

...

additional arguments to be passed to the density function of the base package.

Details

This function computes the spectral information divergence (distance) between spectra. When mode = "density", the function first computes the probability distribution of each spectrum which result in a matrix of density distribution estimates. The density distributions of all the observations in the datasets are compared based on the kullback-leibler divergence algorithm. When mode = "feature", the kullback-leibler divergence between all the observations is computed directly on the spectral variables. The spectral information divergence (SID) algorithm (Chang, 2000) uses the Kullback-Leibler divergence (\(KL\)) or relative entropy (Kullback and Leibler, 1951) to account for the vis-NIR information provided by each spectrum. The SID between two spectra (\(x_{i}\) and \(x_{j}\)) is computed as follows:

\[sid(x_{i},x_{j}) = KL(x_{i} \left |\right | x_{j}) + KL(x_{j} \left |\right | x_{i})\] \[sid(x_{i},x_{j}) = \sum_{l=1}^{k} p_l \ log(\frac{p_l}{q_l}) + \sum_{l=1}^{k} q_l \ log(\frac{q_l}{p_l})\]

where \(k\) represents the number of variables or spectral features, \(p\) and \(q\) are the probability vectors of \(x_{i}\) and \(x_{i}\) respectively which are calculated as:

\[p = \frac{x_i}{\sum_{l=1}^{k} x_{i,l}}\] \[q = \frac{x_j}{\sum_{l=1}^{k} x_{j,l}}\]

From the above equations it can be seen that the original SID algorithm assumes that all the components in the data matrices are nonnegative. Therefore centering cannot be applied when mode = "feature". If a data matrix with negative values is provided and mode = "feature", the sid function automatically scales the matrix as follows:

\[X_s = \frac{X-min(X)}{max(X)-min(X)}\]

or

\[X_{s} = \frac{X-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\] \[Xu_{s} = \frac{Xu-min(X, Xu)}{max(X, Xu)-min(X, Xu)}\]

if Xu is specified. The 0 values are replaced by a regularization parameter (reg argument) for numerical stability. The default of the sid function is to compute the SID based on the density distributions of the spectra (mode = "density"). For each spectrum in X the density distribution is computed using the density function of the stats package. The 0 values of the estimated density distributions of the spectra are replaced by a regularization parameter ("reg" argument) for numerical stability. Finally the divergence between the computed spectral histogramas is computed using the SID algorithm. Note that if mode = "density", the sid function will accept negative values and matrix centering will be possible.

Value

a list with the following components:

Author(s)

Leonardo Ramirez-Lopez

References

Chang, C.I. 2000. An information theoretic-based approach to spectral variability, similarity and discriminability for hyperspectral image analysis. IEEE Transactions on Information Theory 46, 1927-1932.

See Also

density

Examples


library(prospectr)

data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

Xu <- Xu[!is.na(Yu), ]
Xr <- Xr[!is.na(Yr), ]

# Example 1
# Compute the SID distance between all the observations in Xr
xr_sid <- sid(Xr)
xr_sid

# Example 2
# Compute the SID distance between the observations in Xr and the observations
# in Xu
xr_xu_sid <- sid(Xr, Xu)
xr_xu_sid


A function for evaluating dissimilarity matrices (sim_eval)

Description

Stable lifecycle

This function searches for the most similar observation (closest neighbor) of each observation in a given dataset based on a dissimilarity (e.g. distance matrix). The observations are compared against their corresponding closest observations in terms of their side information provided. The root mean square of differences and the correlation coefficient are used for continuous variables and for discrete variables the kappa index is used.

Usage

sim_eval(d, side_info)

Arguments

d

a symmetric matrix of dissimilarity scores between observations of a given dataset. Alternatively, a vector of with the dissimilarity scores of the lower triangle (without the diagonal values) can be used (see details).

side_info

a matrix containing the side information corresponding to the observations in the dataset from which the dissimilarity matrix was computed. It can be either a numeric matrix with one or multiple columns/variables or a matrix with one character variable (discrete variable). If it is numeric, the root mean square of differences is used for assessing the similarity between the observations and their corresponding most similar observations in terms of the side information provided. If it is a character variable, then the kappa index is used. See details.

Details

For the evaluation of dissimilarity matrices this function uses side information (information about one variable which is available for a group of observations, Ramirez-Lopez et al., 2013). It is assumed that there is a (direct or indirect) correlation between this side informative variable and the variables from which the dissimilarity was computed. If side_info is numeric, the root mean square of differences (RMSD) is used for assessing the similarity between the observations and their corresponding most similar observations in terms of the side information provided. It is computed as follows:

\[j(i) = NN(xr_i, Xr^{{-i}})\] \[RMSD = \sqrt{\frac{1}{m} \sum_{i=1}^n {(y_i - y_{j(i)})^2}}\]

where \(NN(xr_i, Xr^{-i})\) represents a function to obtain the index of the nearest neighbor observation found in \(Xr\) (excluding the \(i\)th observation) for \(xr_i\), \(y_{i}\) is the value of the side variable of the \(i\)th observation, \(y_{j(i)}\) is the value of the side variable of the nearest neighbor of the \(i\)th observation and \(m\) is the total number of observations.

If side_info is a factor the kappa index (\(\kappa\)) is used instead the RMSD. It is computed as follows:

\[\kappa = \frac{p_{o}-p_{e}}{1-p_{e}}\]

where both \(p_o\) and \(p_e\) are two different agreement indices between the the side information of the observations and the side information of their corresponding nearest observations (i.e. most similar observations). While \(p_o\) is the relative agreement \(p_e\) is the the agreement expected by chance.

This functions accepts vectors to be passed to argument d, in this case, the vector must represent the lower triangle of a dissimilarity matrix (e.g. as returned by the stats::dist() function of stats).

Value

sim_eval returns a list with the following components:

Author(s)

Leonardo Ramirez-Lopez

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples


library(prospectr)
data(NIRsoil)

sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0)

# Replace the original spectra with the filtered ones
NIRsoil$spc <- sg

Yr <- NIRsoil$Nt[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

# Example 1
# Compute a principal components distance
pca_d <- ortho_diss(Xr, pc_selection = list("manual", 8))$dissimilarity

# Example 1.1
# Evaluate the distance matrix on the baisis of the
# side information (Yr) associated with Xr
se <- sim_eval(pca_d, side_info = as.matrix(Yr))

# The final evaluation results
se$eval

# The final values of the side information (Yr) and the values of
# the side information corresponding to the first nearest neighbors
# found by using the distance matrix
se$first_nn

# Example 1.2
# Evaluate the distance matrix on the basis of two side
# information (Yr and Yr2)
# variables associated with Xr
Yr_2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]
se_2 <- sim_eval(d = pca_d, side_info = cbind(Yr, Yr_2))

# The final evaluation results
se_2$eval

# The final values of the side information variables and the values
# of the side information variables corresponding to the first
# nearest neighbors found by using the distance matrix
se_2$first_nn

# Example 2
# Evaluate the distances produced by retaining different number of
# principal components (this is the same principle used in the
# optimized principal components approach ("opc"))

# first project the data
pca_2 <- ortho_projection(Xr, pc_selection = list("manual", 30))

results <- matrix(NA, pca_2$n_components, 3)
colnames(results) <- c("pcs", "rmsd", "r")
results[, 1] <- 1:pca_2$n_components
for (i in 1:pca_2$n_components) {
  ith_d <- f_diss(pca_2$scores[, 1:i, drop = FALSE], scale = TRUE)
  ith_eval <- sim_eval(ith_d, side_info = as.matrix(Yr))
  results[i, 2:3] <- as.vector(ith_eval$eval)
}
plot(results)

# Example 3
# Example 3.1
# Evaluate a dissimilarity matrix computed using the correlation
# method
cd <- cor_diss(Xr)
eval_corr_diss <- sim_eval(cd, side_info = as.matrix(Yr))
eval_corr_diss$eval


Square root of (square) symmetric matrices

Description

For internal use only

Usage

sqrt_sm(X, method = c("svd", "eigen"))

A function to compute row-wise index of minimum values of a square distance matrix

Description

For internal use only

Usage

which_min(X)

Arguments

X

a square matrix of distances

Details

Used internally to find the nearest neighbors

Value

a vector of the indices of the minimum value in each row of the input matrix

Author(s)

Antoine Stevens


A function to compute indices of minimum values of a distance vector

Description

For internal use only

Usage

which_min_vector(X)

Arguments

X

a vector of distances

Details

Used internally to find the nearest neighbors. It searches in lower (or upper) triangular matrix. Therefore this must be the format of the input data. The piece of code int len = (sqrt(X.size()*8+1)+1)/2 generated an error in CRAN since sqrt cannot be applied to integers.

Value

a vector of the indices of the nearest neighbors

Author(s)

Antoine Stevens