Type: | Package |
Title: | Spatially Smoothed MRCD Estimator |
Version: | 1.1.0 |
Maintainer: | Patricia Puchhammer <patricia.puchhammer@tuwien.ac.at> |
Description: | Estimation of the Spatially Smoothed Minimum Regularized Determinant (ssMRCD) estimator and its usage in an ssMRCD-based outlier detection method as described in Puchhammer and Filzmoser (2023) <doi:10.1080/10618600.2023.2277875> and for sparse robust PCA for multi-source data described in Puchhammer, Wilms and Filzmoser (2024) <doi:10.48550/arXiv.2407.16299>. Included are also complementary visualization and parameter tuning tools. |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
Imports: | stats, grDevices, graphics, robustbase, scales, car, dbscan, plot3D, dplyr, ggplot2, expm, foreach, doParallel, rrcov, DescTools, rootSolve, parallel, Matrix, reshape2 |
RoxygenNote: | 7.2.3 |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
Depends: | R (≥ 4.0.0) |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2024-08-23 09:56:19 UTC; puchhammer |
Author: | Patricia Puchhammer [aut, cre, cph], Peter Filzmoser [aut] |
Repository: | CRAN |
Date/Publication: | 2024-08-23 11:10:02 UTC |
Align Loadings of Principal Components
Description
Aligns loadings per neighborhood for better visualization and comparison. Different options are available.
Usage
align_PC(PC, N, p, type = "largest", vec = NULL)
Arguments
PC |
matrix of loadings of size Np x k |
N |
integer, number of groups/neighborhoods |
p |
integer, number of variables |
type |
character indicating how loadings are aligned (see details),
options are |
vec |
|
Details
For input type
possible values are "largest", "maxvar","nonzero","mean","scalar"
.
For option "maxvar"
the variable with the highest absolute value in the loading
is scaled to be positive (per neighborhood, per loading).
For option "nonzero"
the variable with largest distance to zero in the entries is
scaled to be positive (per neighborhood, per loading).
For option "scalar"
the variable is scaled in a way, that the scalar product
between the loading and the respective part of vec
is positive (per neighborhood, per loading).
If vec
is of size p
times k
, the same vector is used for all neighborhoods.
Option "mean"
is option "scalar"
with vec
being the mean of the loadings per variable across neighborhoods.
Option "largest"
scales the largest absolute value to be positive per neighborhood and per PC.
Option "none"
does nothing and returns PC
.
Value
Returns a matrix of loadings of size Np
times k
.
Examples
x = matrix(c(1, 0, 0, 0, sqrt(0.5), -sqrt(0.5), 0, 0,
0, sqrt(1/3), -sqrt(1/3), sqrt(1/3), sqrt(0.5), sqrt(0.5), 0, 0),
ncol = 2)
align_PC(PC = x, N = 2, p = 4, type = "largest")
align_PC(PC = x, N = 2, p = 4, type = "mean")
Biplot for PCAloc
Description
Biplot for PCAloc
Usage
## S3 method for class 'PCAloc'
biplot(x, ...)
Arguments
x |
object of class PCAloc. |
... |
other input arguments, see details. |
Details
Additional parameters that can be given to the function are:
shape | point shape |
size | point size |
alpha | transparency |
color | either "variable" or "groups"
indication how points should be coloured. |
Value
Returns version of biplot for PCAloc object.
Examples
# set seed
set.seed(236)
# make data
data = matrix(rnorm(2000), ncol = 4)
groups = sample(1:10, 500, replace = TRUE)
W = time_weights(N = 10, c(3,2,1))
# calculate covariance matrices
covs = ssMRCD(data, groups = groups, weights = W, lambda = 0.3)
# sparse PCA
pca = sparsePCAloc(eta = 0.3, gamma = 0.7, cor = FALSE, COVS = covs$MRCDcov,
n_max = 1000, increase_rho = list(TRUE, 50, 1), trace = FALSE)
# plot biplot
biplot(pca, alpha = 0.4, shape = 16, size = 2, color = "variable")
Contamination Through Swapping
Description
This function swaps observations completely random in order to introduce contamination
in the data. Used in parameter_tuning
.
Usage
contamination_random(cont, data)
Arguments
cont |
numeric, amount of contamination in data. |
data |
data whose observations should be switched. |
Value
A matrix with switched observations.
Examples
# set seed
set.seed(1)
# get data
data(weatherAUT2021)
# switch 5% of observations
contamination_random(cont = 0.05, data = weatherAUT2021[,1:6])
Objective function value for local sparse PCA
Description
Objective function value for local sparse PCA
Usage
eval_objective(PC, eta, gamma, COVS)
Arguments
PC |
vectorised component to evaluate. |
eta |
degree of sparsity. |
gamma |
distribution of sparsity between groupwise ( |
COVS |
list of covariance matrices used for PCA |
Value
Returns value of the objective function for given v
.
Examples
S1 = matrix(c(1, 0.9, 0.8, 0.5,
0.9, 1.1, 0.7, 0.4,
0.8, 0.7, 1.5, 0.2,
0.5, 0.4, 0.2, 1), ncol = 4)
S2 = t(S1)%*% S1
S2 = S2/2
eval_objective(PC = c(1,0,0,0,sqrt(2),0,0,-sqrt(2)),
eta = 1, gamma = 0.5,
COVS = list(S1, S2))
Explained Variance summarized over Groups
Description
Explained Variance summarized over Groups
Usage
explained_var(COVS, PC, k, type = "scaled", cor = FALSE, gamma = 0.5)
Arguments
COVS |
list of covariance matrices |
PC |
matrix-like object holding the loadings of length np |
k |
which component should be evaluated |
type |
character, either |
cor |
logical, if |
gamma |
scalar between 0 and 1 indicatig distribution of sparsity. |
Value
Returns scalar
Examples
S1 = matrix(c(1, 0.9, 0.8, 0.5,
0.9, 1.1, 0.7, 0.4,
0.8, 0.7, 1.5, 0.2,
0.5, 0.4, 0.2, 1), ncol = 4)
S2 = t(S1)%*% S1
S2 = S2/2
explained_var(COVS = list(S1, S2),
PC = c(1,0,0,0,sqrt(2),0,0,-sqrt(2)),
k = 1,
cor = FALSE,
gamma = 0.5)
explained_var(COVS = list(cov2cor(S1), cov2cor(S2)),
PC = c(1,0,0,0,sqrt(2),0,0,-sqrt(2)),
k = 1,
cor = TRUE,
gamma = 0.5)
Inverse Geographic Weight Matrix
Description
Calculates a inverse-distance based weight matrix for the function ssMRCD
(see details).
Usage
geo_weights(coordinates, groups)
Arguments
coordinates |
matrix of coordinates of observations. |
groups |
vector of neighborhood groups. |
Details
First, the centers (means of the coordinates given) c_i
of each neighborhood is calculated.
Then, the Euclidean distance between the centers is calculated and the weight is based on
the inverse distance between two neighborhoods,
w_{ij} = \frac{1}{dist(c_i, c_j)}.
It is scaled according to a weight matrix.
Value
Returns a weighting matrix W
and the coordinates of the centers per neighborhood centersN
.
See Also
Examples
coordinates = matrix(rnorm(1000), ncol = 2, nrow = 500)
groups = sample(1:5, 500, replace = TRUE)
geo_weights(coordinates, groups)
Creates Grid-Based Neighborhood Structure
Description
This function creates a grid-based neighborhood structure for the ssMRCD
function using cut-off values for two coordinate axis.
Usage
groups_gridbased(x, y, cutx, cuty)
Arguments
x |
vector of first coordinate of data set. |
y |
vector of second coordinate of data set. |
cutx |
cut-offs for first coordinate. |
cuty |
cut-offs for second coordinate. |
Value
Returns a neighborhood assignment vector for the coordinates x
and y
.
Examples
# get data
data(weatherAUT2021)
# set cut-off values
cut_lon = c(9:16, 18)
cut_lat = c(46, 47, 47.5, 48, 49)
# create neighborhood assignments
groups_gridbased(weatherAUT2021$lon,
weatherAUT2021$lat,
cut_lon,
cut_lat)
Local Outlier Detection Technique based on ssMRCD
Description
This function applies the local outlier detection method based on the spatially smoothed MRCD estimator developed in Puchhammer and Filzmoser (2023).
Usage
local_outliers_ssMRCD(
data,
coords,
groups,
lambda,
weights = NULL,
k = NULL,
dist = NULL
)
Arguments
data |
data matrix with measured values. |
coords |
matrix of coordinates of observations. |
groups |
vector of neighborhood assignments. |
lambda |
scalar used for spatial smoothing (see also |
weights |
weight matrix used in |
k |
integer, if given the |
dist |
scalar, if given the neighbors closer than given distance are used for next distances. If |
Value
Returns an object of class "locOuts"
with following components:
outliers | indices of found outliers. |
next_distance | vector of next distances for all observations. |
cutoff | upper fence of adjusted boxplot (see adjbox ) used as cutoff value for next distances. |
coords | matrix of observation coordinates. |
data | matrix of observation values. |
groups | vector of neighborhood assignments. |
k, dist | specifications regarding neighbor comparisons. |
centersN | coordinates of centers of neighborhoods. |
matneighbor | matrix storing information which observations where used to calculate next distance for each observation (per row). 1 indicates it is used. |
ssMRCD | object of class "ssMRCD" and output of ssMRCD covariance estimation. |
References
Puchhammer P. and Filzmoser P. (2023): Spatially smoothed robust covariance estimation for local outlier detection. doi:10.48550/arXiv.2305.05371
See Also
See also functions ssMRCD, plot.locOuts, summary.locOuts
.
Examples
# data construction
data = matrix(rnorm(2000), ncol = 4)
coords = matrix(rnorm(1000), ncol = 2)
groups = sample(1:10, 500, replace = TRUE)
lambda = 0.3
# apply function
outs = local_outliers_ssMRCD(data = data,
coords = coords,
groups = groups,
lambda = lambda,
k = 10)
outs
Calculation of Objective Function
Description
Calculation of the value of the objective function for the ssMRCD
for a given list of matrices,
lambda and a weighting matrix according to formula (3) in Puchhammer and Filzmoser (2023).
Usage
objective_matrix(matrix_list, lambda, weights)
Arguments
matrix_list |
a list of matrices |
lambda |
scalar smoothing parameter |
weights |
matrix of weights |
Value
Returns the value of the objective function using matrices K_i
.
References
Puchhammer P. and Filzmoser P. (2023): Spatially smoothed robust covariance estimation for local outlier detection. doi:10.48550/arXiv.2305.05371
Examples
# construct matrices
k1 = matrix(c(1,2,3,4), nrow = 2)
k2 = matrix(c(1,3,5,7), nrow = 2)
# construct weighting matrix
W = matrix(c(0, 1, 1, 0), nrow = 2)
objective_matrix(list(k1, k2), 0.5, W)
Optimal Smoothing Parameter for ssMRCD based on Local Outliers
Description
This function provides insight into the effects of different parameter settings.
Usage
parameter_tuning(
data,
coords,
groups,
lambda = c(0, 0.25, 0.5, 0.75, 0.9),
weights = NULL,
k = NULL,
dist = NULL,
cont = 0.05,
repetitions = 5
)
Arguments
data |
matrix with observations. |
coords |
matrix of coordinates of these observations. |
groups |
numeric vector, the neighborhood structure that should be used for |
lambda |
scalar, the smoothing parameter. |
weights |
weighting matrix used in |
k |
vector of possible k-values to evaluate. |
dist |
vector of possible dist-values to evaluate. |
cont |
level of contamination, between 0 and 1. |
repetitions |
number of repetitions wanted to have a good picture of the best parameter combination. |
Value
Returns a matrix of average false-negative rate (FNR) values and the total number of outliers found by the method as aproxy for the false-positive rate. Be aware that the FNR does not take into account that there are also natural outliers included in the data set that might or might not be found. Also a plot is returned representing these average. The best parameter selection depends on the goal of the analysis.
Examples
# get data set
data("weatherAUT2021")
# make neighborhood assignments
cut_lon = c(9:16, 18)
cut_lat = c(46, 47, 47.5, 48, 49)
N = ssMRCD::groups_gridbased(weatherAUT2021$lon, weatherAUT2021$lat, cut_lon, cut_lat)
table(N)
N[N == 2] = 1
N[N == 3] = 4
N[N == 5] = 4
N[N == 6] = 7
N[N == 11] = 15
N = as.numeric(as.factor(N))
# tune parameters
set.seed(123)
parameter_tuning(data = weatherAUT2021[, 1:6 ],
coords = weatherAUT2021[, c("lon", "lat")],
groups = N,
lambda = c(0.5, 0.75),
k = c(10),
repetitions = 1)
Plotting method PCAloc object
Description
Plotting method PCAloc object
Usage
## S3 method for class 'PCAloc'
plot(
x,
type = c("loadings", "screeplot", "scores", "score_distances", "biplot"),
...
)
Arguments
x |
object of class PCAloc |
type |
character indicating the type of plot, see details. |
... |
further arguments passed down. |
Value
Returns plots in ggplot2.
Examples
# set seed
set.seed(236)
# create data and setup
data = matrix(rnorm(2000), ncol = 4)
groups = sample(1:10, 500, replace = TRUE)
W = time_weights(N = 10, c(3,2,1))
# calculate covariances
covs = ssMRCD(data, groups = groups, weights = W, lambda = 0.3)
# calculate sparse PCA
pca = sparsePCAloc(eta = 0.3, gamma = 0.7, cor = FALSE, COVS = covs$MRCDcov,
n_max = 1000, increase_rho = list(TRUE, 50, 1), trace = FALSE)
# align loadings
pca$PC = align_PC(PC = pca$PC, N = pca$N, p = pca$p, type = "mean")
# plot different PCA plots
plot(x = pca, type = "score_distances", groups = groups, X = data, ssMRCD = covs, k = 2)
plot(x = pca, type = "biplot", color = "variable")
plot(x = pca, type = "scores", groups = groups, X = data, ssMRCD = covs, k = 1)
plot(x = pca, type = "screeplot")
plot(x = pca, type = "loadings", k = 1)
Diagnostic Plots for Local Outlier Detection
Description
This function plots different diagnostic plots for local outlier detection.
It can be applied to an object of class "locOuts"
which is the output of the function local_outliers_ssMRCD
.
Usage
## S3 method for class 'locOuts'
plot(
x,
type = c("hist", "spatial", "lines", "3D"),
colour = "all",
focus = NULL,
pos = NULL,
alpha = 0.3,
data = NULL,
add_map = TRUE,
...
)
Arguments
x |
a locOuts object obtained by the function |
type |
vector containing the types of plots that should be plotted, possible values |
colour |
character specifying the color scheme (see details). Possible values |
focus |
an integer being the index of the observation whose neighborhood should be analysed more closely. |
pos |
integer specifying the position of the text "cut-off" in the histogram (see |
alpha |
scalar specifying the transparency level of the points plotted for plot type |
data |
optional data frame or matrix used for plot of type |
add_map |
TRUE if a map should be plotted along the line plot ( |
... |
further parameters passed on to base-R plotting functions. |
Details
Regarding the parameter type
the value "hist"
corresponds to a plot of the
histogram of the next distances together with the used cutoff-value.
When using "spatial"
the coordinates of each observation are plotted and colorized according to the color setting.
The "lines"
plot is used with the index focus
of one observation whose out/inlyingness to its neighborhood
should by plotted. The whole data set is scaled to the range [0,1] and the scaled value of the selected observation and
its neighbors are plotted. Outliers are plotted in orange.
The "3D"
setting leads to a 3D-plot using the colour setting as height.
The view can be adapted using the parameters theta
and phi
.
For the colour
setting possible values are "all"
(all next distances are
used and colored in an orange palette), "onlyOuts"
(only outliers are
plotted in orange, inliers are plotted in grey) and "outScore"
(the next
distance divided by the cutoff value is used to colourize the points; inliers are colorized in blue, outliers in orange).
Value
Returns plots regarding next distances and spatial context.
See Also
Examples
# set seed
set.seed(1)
# make locOuts object
data = matrix(rnorm(2000), ncol = 4)
coords = matrix(rnorm(1000), ncol = 2)
groups = sample(1:10, 500, replace = TRUE)
lambda = 0.3
# local outlier detection
outs = local_outliers_ssMRCD(data = data,
coords = coords,
groups = groups,
lambda = lambda,
k = 10)
# plot results
plot(outs, type = "hist")
plot(outs, type = "spatial", colour = "outScore")
plot(outs, type = "3D", colour = "outScore", theta = 0)
plot(outs, type ="lines", focus = outs$outliers[1])
Plot Method for ssMRCD Object
Description
Plots diagnostics for function output of ssMRCD
regarding convergence behavior
and the resulting covariances matrices.
Usage
## S3 method for class 'ssMRCD'
plot(
x,
type = c("convergence", "ellipses"),
centersN = NULL,
colour_scheme = "none",
xlim_upper = 9,
manual_rescale = 1,
legend = TRUE,
xlim = NULL,
ylim = NULL,
...
)
Arguments
x |
object of class |
type |
type of plot, possible values are |
centersN |
for plot type |
colour_scheme |
coloring scheme used for plot type |
xlim_upper |
numeric giving the upper x limit for plot type |
manual_rescale |
for plot type |
legend |
logical, if color legend should be included. |
xlim |
vector of xlim (see |
ylim |
vector of ylim (see |
... |
further plotting parameters. |
Details
For type = "convergence"
a plot is produced displaying the convergence behaviour.
Each line represents a different initial value used for the c-step iteration. On the x-axis the
iteration step is plotted with the corresponding value of the objective function. Not monotonically
lines are plotted in red.
For type = "ellipses"
and more than a 2-dimensional data setting plotting the exact tolerance ellipse is
not possible anymore. Instead the two eigenvectors with highest eigenvalue from the
MCD used on the full data set without neighborhood assignments are taken and used as axis for
the tolerance ellipses of the ssMRCD covariance estimators. The tolerance ellipse for the global MCD
covariance is plotted in grey in the upper left corner. It is possible to set the colour scheme
to "trace"
to see the overall amount of variabilty and compare the plotted covariance and
the real trace to see how much variance is not plotted. For "regularity"
the regularization of each
covariance is shown.
Value
Returns plots of the ssMRCD methodology and results.
See Also
ssMRCD, summary.ssMRCD,
local_outliers_ssMRCD, plot.locOuts
Examples
# set seed
set.seed(1)
# create data set
data = matrix(rnorm(2000), ncol = 4)
coords = matrix(rnorm(1000), ncol = 2)
groups = sample(1:10, 500, replace = TRUE)
lambda = 0.3
# calculate ssMRCD by using the local outlier detection method
outs = local_outliers_ssMRCD(data = data,
coords = coords,
groups = groups,
lambda = lambda,
k = 10)
# plot ssMRCD object included in outs
plot(x = outs$ssMRCD,
centersN = outs$centersN,
colour_scheme = "trace",
legend = FALSE)
Plots of loadings of PCAloc object
Description
Plots of loadings of PCAloc object
Usage
plot_loadings(object, ...)
Arguments
object |
object of class PCAloc |
... |
other input arguments, see details. |
Details
Additional parameters that can be given to the function are:
text | logical if values should be added as text. |
size | point size. |
tolerance | tolerance for rounding to zero. |
k | integer, which component scores should be plotted. |
groupnames | names of groups. |
varnames | names of variables. |
textrotate | angle of text rotation, if included. |
Value
Returns loading heatmap for component k
.
Examples
# set seed
set.seed(236)
data = matrix(rnorm(2000), ncol = 4)
groups = sample(1:10, 500, replace = TRUE)
W = time_weights(N = 10, c(3,2,1))
# calculate covariance matrices
covs = ssMRCD(data, groups = groups, weights = W, lambda = 0.3)
# sparse PCA
pca = sparsePCAloc(eta = 0.3, gamma = 0.7, cor = FALSE, COVS = covs$MRCDcov,
n_max = 1000, increase_rho = list(TRUE, 50, 1), trace = FALSE)
# plot score distances
plot_loadings(object = pca,
k = 1,
size = 2)
Distance-distance plot of scores of PCA
Description
Distance-distance plot of scores of PCA
Usage
plot_score_distances(X, PC, groups, ssMRCD, k, ...)
Arguments
X |
data matrix. |
PC |
loadings from PCA. |
groups |
vector containing group assignments. |
ssMRCD |
ssMRCD object. |
k |
integer of how many components should be used. |
... |
other input arguments, see details. |
Details
Additional parameters that can be given to the function are:
shape | point shape |
size | point size |
alpha | transparency |
Value
Returns distance-distance plot of orthogonal and score distance.
Examples
# set seed
set.seed(236)
data = matrix(rnorm(2000), ncol = 4)
groups = sample(1:10, 500, replace = TRUE)
W = time_weights(N = 10, c(3,2,1))
# calculate covariance matrices
covs = ssMRCD(data, groups = groups, weights = W, lambda = 0.3)
# sparse PCA
pca = sparsePCAloc(eta = 0.3, gamma = 0.7, cor = FALSE, COVS = covs$MRCDcov,
n_max = 1000, increase_rho = list(TRUE, 50, 1), trace = FALSE)
# plot score distances
plot_score_distances(PC = pca$PC,
groups = groups,
X = data,
ssMRCD = covs,
k = 2,
alpha = 0.4,
shape = 16,
size = 2)
Plots of score distribution
Description
Plots of score distribution
Usage
plot_scores(X, PC, groups, ssMRCD, ...)
Arguments
X |
data matrix. |
PC |
loadings from PCA. |
groups |
vector containing group assignments. |
ssMRCD |
ssMRCD object. |
... |
other input arguments, see details. |
Details
Additional parameters that can be given to the function are:
shape | point shape |
size | point size |
alpha | transparency |
k | integer, which component scores should be plotted |
Value
Returns histograms of scores for component k
.
Examples
# set seed
set.seed(236)
data = matrix(rnorm(2000), ncol = 4)
groups = sample(1:10, 500, replace = TRUE)
W = time_weights(N = 10, c(3,2,1))
# calculate covariance matrices
covs = ssMRCD(data, groups = groups, weights = W, lambda = 0.3)
# sparse PCA
pca = sparsePCAloc(eta = 0.3, gamma = 0.7, cor = FALSE, COVS = covs$MRCDcov,
n_max = 1000, increase_rho = list(TRUE, 50, 1), trace = FALSE)
# plot score distances
plot_scores(PC = pca$PC,
groups = groups,
X = data,
ssMRCD = covs,
k = 1,
alpha = 0.4,
shape = 16,
size = 2)
Rescale Weight Matrix
Description
Given a matrix with values for neighborhood influences the function rescales
the matrix in order to get an appropriate weight matrix used for the function ssMRCD
.
Usage
rescale_weights(W)
Arguments
W |
weight matrix with diagonals equal to zero and at least one positive entry per row. |
Value
An appropriately scaled weight matrix.
See Also
ssMRCD, local_outliers_ssMRCD, geo_weights
Examples
W = matrix(c(0, 1, 2,
1, 0, 1,
2, 1, 0), nrow = 3)
rescale_weights(W)
Extracting Residuals from Local Fit
Description
Extracting Residuals from Local Fit
Usage
## S3 method for class 'ssMRCD'
residuals(object, ...)
Arguments
object |
|
... |
see details |
Details
Other input variables are:
remove_outliers | logical (default FALSE ). If TRUE, only residuals
from not outlying observations are calculated. If FALSE, trimmed residuals are used (see alpha ). |
X | matrix of new data, if data from the ssMRCD object is used. |
groups | vector of groups for new data, if NULL data from the ssMRCD object is used. |
mean | logical (default FALSE ), specifying if mean of trimmed
observations is returned or all residuals. |
If X
and groups
are provided, alpha
is set to one and all residuals are used.
If remove_outliers
is TRUE, alpha
is set to 1 automatically.
Value
Returns either all residuals or the mean of the residual norms lower than the alpha
- Quantile.
Examples
# create data set
x1 = matrix(runif(200), ncol = 2)
x2 = matrix(rnorm(200), ncol = 2)
x = list(x1, x2)
# create weighting matrix
W = matrix(c(0, 1, 1, 0), ncol = 2)
# calculate ssMRCD
localCovs = ssMRCD(x, weights = W, lambda = 0.5)
# residuals of model
residuals(localCovs, remove_outliers = TRUE, mean = FALSE)
# residuals of new data
residuals(localCovs,
X = matrix(rnorm(20), ncol = 2, nrow = 10),
groups = rep(2, 10),
mean =TRUE)
Restructure Data Matrix as List
Description
This function restructures neighborhood information given by a data matrix
containing all information and one neighborhood assignment vector. It returns a list
of data matrices used in ssMRCD
.
Usage
restructure_as_list(data, groups)
Arguments
data |
data matrix with all observations. |
groups |
numeric neighborhood assignment vector. |
Value
Returns a list containing the observations per neighborhood assignment. The list is sorted according to the order of the first appearance in the groups vector.
Examples
# data matrix
data = matrix(rnorm(n = 3000), ncol = 3)
N_assign = sample(x = 1:10, size = 1000, replace = TRUE)
restructure_as_list(data, N_assign)
Scale Data Locally
Description
Scale Data Locally
Usage
scale_ssMRCD(
ssMRCD,
X = NULL,
groups = NULL,
multivariate = FALSE,
center_only = FALSE
)
Arguments
ssMRCD |
|
X |
matrix, new data to scale with ssMRCD estimation. |
groups |
vector, group assignments of new data |
multivariate |
logical, |
center_only |
logical, if |
Value
Returns matrix of observations. If X = NULL
X from the ssMRCD object is
used and sorted according to group numbering.
See Also
Examples
# create data set
x1 = matrix(runif(200), ncol = 2)
x2 = matrix(rnorm(200), ncol = 2)
x = list(x1, x2)
# create weighting matrix
W = matrix(c(0, 1, 1, 0), ncol = 2)
# calculate ssMRCD
localCovs = ssMRCD(x, weights = W, lambda = 0.5)
# scale used data
scale_ssMRCD(localCovs,
multivariate = TRUE)
# scale new data
scale_ssMRCD(localCovs,
X = matrix(rnorm(20), ncol = 2, nrow = 10),
groups = rep(2, 10),
multivariate =TRUE)
Calculate Scores for local sparse PCA
Description
Calculate Scores for local sparse PCA
Usage
scores(X, PC, groups, ssMRCD = NULL)
Arguments
X |
data set as matrix. |
PC |
loading matrix. |
groups |
vector of grouping structure (numeric). |
ssMRCD |
ssMRCD object used for scaling |
Value
Returns a list with scores and univariately and locally centered and scaled observations.
See Also
Examples
# create data set
x1 = matrix(runif(200), ncol = 2)
x2 = matrix(rnorm(200), ncol = 2)
x = list(x1, x2)
# create weighting matrix
W = matrix(c(0, 1, 1, 0), ncol = 2)
# calculate ssMRCD
loccovs = ssMRCD(x, weights = W, lambda = 0.5)
# calculate PCA
pca = sparsePCAloc(eta = 1, gamma = 0.5, cor = FALSE,
COVS = loccovs$MRCDcov,
increase_rho = list(FALSE, 20, 1))
# calculate scores
scores(X = rbind(x1, x2), PC = pca$PC,
groups = rep(c(1,2), each = 100), ssMRCD = loccovs)
Orthogonal Distances for PCAloc
Description
Orthogonal Distances for PCAloc
Usage
scores.OD(X, PC, groups, ssMRCD)
Arguments
X |
data matrix of observations. |
PC |
loadings of sparse local PCA. |
groups |
grouping vector for locality. |
ssMRCD |
ssMRCD object used for PCA calculation. |
Value
Returns vector of orthogonal distances of observations.
See Also
scores
, scores.SD
, sparsePCAloc
, scale_ssMRCD
Examples
# create data set
x1 = matrix(runif(200), ncol = 2)
x2 = matrix(rnorm(200), ncol = 2)
x = list(x1, x2)
# create weighting matrix
W = matrix(c(0, 1, 1, 0), ncol = 2)
# calculate ssMRCD
loccovs = ssMRCD(x, weights = W, lambda = 0.5)
# calculate PCA
pca = sparsePCAloc(eta = 1, gamma = 0.5, cor = FALSE,
COVS = loccovs$MRCDcov,
increase_rho = list(FALSE, 20, 1))
# calculate scores
scores.OD(X = rbind(x1, x2), PC = pca$PC,
groups = rep(c(1,2), each = 100), ssMRCD = loccovs)
Score Distances for PCAloc
Description
Score Distances for PCAloc
Usage
scores.SD(X, PC, groups, ssMRCD)
Arguments
X |
data matrix of observations. |
PC |
loadings of sparse local PCA. |
groups |
grouping vector for locality. |
ssMRCD |
ssMRCD object used for PCA calculation. |
Value
Returns vector of score distances of observations.
See Also
scores
, scores.OD
, sparsePCAloc
, scale_ssMRCD
Examples
# create data set
x1 = matrix(runif(200), ncol = 2)
x2 = matrix(rnorm(200), ncol = 2)
x = list(x1, x2)
# create weighting matrix
W = matrix(c(0, 1, 1, 0), ncol = 2)
# calculate ssMRCD
loccovs = ssMRCD(x, weights = W, lambda = 0.5)
# calculate PCA
pca = sparsePCAloc(eta = 1, gamma = 0.5, cor = FALSE,
COVS = loccovs$MRCDcov,
increase_rho = list(FALSE, 20, 1))
# calculate scores
scores.SD(X = rbind(x1, x2), PC = pca$PC,
groups = rep(c(1,2), each = 100), ssMRCD = loccovs)
Screeplot for PCAloc
Description
Screeplot for PCAloc
Usage
## S3 method for class 'PCAloc'
screeplot(x, ...)
Arguments
x |
object of class PCAloc. |
... |
other input arguments, see details. |
Details
Additional parameters that can be given to the function are:
text | logical if text should be plotted |
size | text size |
cutoff | cutoff line for scree plot |
groupnames | name of groups |
textrotate | angle of text, if text is plotted. |
Value
Returns version of scree plot and cumulative explained variance per group for PCAloc object.
Examples
# set seed
set.seed(236)
data = matrix(rnorm(2000), ncol = 4)
groups = sample(1:10, 500, replace = TRUE)
W = time_weights(N = 10, c(3,2,1))
# calculate covariance matrices
covs = ssMRCD(data, groups = groups, weights = W, lambda = 0.3)
# sparse PCA
pca = sparsePCAloc(eta = 0.3, gamma = 0.7, cor = FALSE, COVS = covs$MRCDcov,
n_max = 1000, increase_rho = list(TRUE, 50, 1), trace = FALSE)
# plot biplot
screeplot(pca, text = TRUE, cutoff = 0.8, size = 2)
Optimal Smoothing Parameter for ssMRCD based on Residuals
Description
The optimal smoothing value for the ssMRCD estimator is based on the residuals and the trimmed mean of the norm.
Usage
select_smoothing(
X,
groups,
weights,
lambda = seq(0, 1, 0.1),
TM = NULL,
alpha = 0.75,
seed = 123436,
return_all = TRUE,
cores = 1
)
Arguments
X |
data matrix containing observations. |
groups |
grouping vector corresponding to |
weights |
weight matrix for groups, see |
lambda |
vector of parameter values for smoothing, between 0 and 1. |
TM |
target matrix, if not given MCD (or MRCD if non regular) is used with default values and |
alpha |
percentage of outliers to be expected. |
seed |
seed for ssMRCD calculations. |
return_all |
logical, if FALSE the function returns only the optimal lambda. |
cores |
integer, number of cores used for parallel computing. |
Value
lambda_opt | optimal lambda for smoothing. |
COVS | ssMRCD object with optimal parameter setting. |
plot | plot for optimal parameter setting. |
residuals | mean of norm of residuals for varying lambda. |
Examples
# create data set
x1 = matrix(runif(200), ncol = 2)
x2 = matrix(rnorm(200), ncol = 2)
# create weighting matrix
W = matrix(c(0, 1, 1, 0), ncol = 2)
select_smoothing (X = rbind(x1, x2),
groups = rep(c(1,2), each = 100),
weights = W,
lambda = seq(0, 1, 0.1),
return_all = TRUE,
cores = 1)
Optimal Sparsity Parameter Selection for PCA
Description
Optimal Sparsity Parameter Selection for PCA
Usage
select_sparsity(
COVS,
k = 1,
rho = NULL,
cor = FALSE,
eta = seq(0, 5, by = 0.2),
gamma = seq(0, 1, 0.05),
eps_threshold = 0.001,
eps_root = 0.1,
eps_ADMM = 1e-04,
n_max = 300,
adjust_eta = FALSE,
cores = 1,
increase_rho = list(TRUE, 100, 1),
convergence_plot = FALSE,
trace = FALSE,
stop.sparse = TRUE
)
Arguments
COVS |
list of covariance or correlation matrices. |
k |
number of components to be returned. |
rho |
penalty parameter for ADMM. |
cor |
logical, if starting values for covariances or correlation matrices should be used. |
eta |
vector of possible values for degree of sparsity. |
gamma |
vector of possible values for distribution of sparsity. If only one value is provided, the optimal eta is calculated. |
eps_threshold |
tolerance for thresholding. |
eps_root |
tolerance for root finder. |
eps_ADMM |
tolerance for ADMM iterations. |
n_max |
maximal number of ADMM iterations. |
adjust_eta |
if eta should be adjusted for further components. |
cores |
number of cores for parallel computing. |
increase_rho |
list of settings for improved automated calculation and convergence. See Details. |
convergence_plot |
logical, if convergence plot should be plotted. Not applicable for |
trace |
logical, if messages should be displayed. Not applicable for |
stop.sparse |
calculate if AUC should be calculated for PCAs until full sparsity is reached ( |
Details
The input increase_rho
consists of a logical indicating if rho should be adjusted
if algorithm did not converged within the given maximal number of iterations. Two integers specify the
maximal rho
that is allowed and the step size.
Value
Returns list with
PCA | object of type PCAloc. |
PC | local loadings of PCA |
gamma | optimal value for gamma. |
eta | optimal value for eta. |
eta_tpo | values of Trade-Off-Product for eta from optimization process. |
auc | area under the curve for varying gamma values. |
pars | parameters and respective sparsity entrywise and mixed and explained variance. |
plot | ggplot object for optimal parameter selection. |
plot_info | additional data for plotting functions. |
Examples
C1 = matrix(c(1,0,0,0.9), ncol = 2)
C2 = matrix(c(1.1, 0.1, 0.1, 1), ncol = 2)
C3 = matrix(c(1.2, 0.2, 0.2, 1), ncol = 2)
select_sparsity(COVS = list(C1, C2, C3),
k = 1,
rho = 5,
eta = c(0, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5 ,0.75, 1),
gamma =c(0, 0.25, 0.5, 0.75, 1),
eps_threshold = 0.005,
increase_rho = list(FALSE, 20, 5))
Calculate Sparse Principle Components
Description
Calculate Sparse Principle Components
Usage
sparsePCAloc(
eta,
gamma,
COVS,
cor = FALSE,
rho = NULL,
k = NULL,
eps_threshold = NULL,
eps_ADMM = 1e-04,
n_max = 200,
eps_root = 0.1,
maxiter_root = 50,
increase_rho = list(TRUE, 100, 1),
convergence_plot = TRUE,
starting_value = NULL,
adjust_eta = TRUE,
trace = TRUE
)
Arguments
eta |
numeric, degree of sparsity. |
gamma |
numeric, distribution of sparsity. |
COVS |
list of covariance or correlation matrices. |
cor |
logical, if starting value for correlation or covariance matrices should be used. |
rho |
numeric bigger than zero, penalty for ADMM. |
k |
number of components to calculate. |
eps_threshold |
tolerance for thresholding. |
eps_ADMM |
tolerance for ADMM convergence. |
n_max |
number of maximal iterations. |
eps_root |
tolerance for root finder. |
maxiter_root |
maximal number of iterations for root finder. |
increase_rho |
list with entries for stable convergence. See Details. |
convergence_plot |
logical, if convergence plot should be displayed. |
starting_value |
optional given starting value. |
adjust_eta |
logical, if eta should be adjusted by the variance. |
trace |
logical, if messages should be displayed. |
Details
The input increase_rho
consists of a logical indicating if rho should be adjusted
if algorithm did not converged within the given maximal number of iterations. Two integers specify the
maximal rho
that is allowed and the step size.
Value
An object of class "PCAloc"
containing the following elements:
PC | Matrix of dimension Np x k of stacked loading vectors. |
p | Number of variables. |
N | Number of neighborhoods. |
k | Number of components. |
COVS | List of covariance matrices sorted by neighborhood. |
gamma | Sparsity distribution. |
eta | Amount of sparsity. |
converged | Logical, if ADMM converged with given specifications. |
n_steps | Number of steps used. |
summary | Description of result per component. |
residuals | Primary and secondary residuals. |
Examples
C1 = diag(c(1.1, 0.9, 0.6))
C2 = matrix(c(1.1, 0.1, -0.1,
0.1, 1.0, -0.2,
-0.1, -0.2, 0.7), ncol = 3)
C3 = (C1 + C2)/2
sparsePCAloc(eta = 1, gamma = 0.5, cor = FALSE, COVS = list(C1, C2, C3),
n_max = 100, increase_rho = list(FALSE, 100, 1))
Entry-wise Sparsity in the Loadings
Description
Entry-wise Sparsity in the Loadings
Usage
sparsity_entries(PC, N, p, tolerance = 0, k = 1, scaled = TRUE)
Arguments
PC |
matrix-like object of PCs. |
N |
integer, number of groups. |
p |
integer, number of variables. |
tolerance |
tolerance for sparsity. |
k |
integer or integer vector of which component should be used. |
scaled |
logical, if total number or percentage of possible sparse entries should be returned. |
Value
Returns either a percentage (scaled = TRUE
) or the amount of zero-values entries (scaled = FALSE
).
Examples
PC = matrix(c(1,0,2,3,0,7,0,1,0,1,0.001,0), ncol = 2)
sparsity_entries(PC, N = 2, p = 3, tolerance = 0, k = 1, scaled = FALSE)
sparsity_entries(PC, N = 2, p = 3, tolerance = 0.001, k = 2, scaled = TRUE)
Group-wise Sparsity in the Loadings
Description
Group-wise Sparsity in the Loadings
Usage
sparsity_group(PC, N, p, tolerance = 0, k = 1, scaled = TRUE)
Arguments
PC |
matrix-like object of PCs. |
N |
integer, number of groups. |
p |
integer, number of variables. |
tolerance |
tolerance for sparsity. |
k |
integer, which components should be used. Does not work for multiple PCs simultaneously. |
scaled |
logical, if total number or percentage of possible sparse entries should be returned. |
Value
Returns either a matrix of percentages (scaled = TRUE
) or the amounts
of zero-values entries (scaled = FALSE
) for each group/neighborhood.
Examples
PC = matrix(c(1,0,2,3,0,7,0,1,0,1,0.001,0), ncol = 2)
sparsity_group(PC, N = 2, p = 3, tolerance = 0, k = 1, scaled = FALSE)
sparsity_group(PC, N = 2, p = 3, tolerance = 0.001, k = 2, scaled = TRUE)
Mixed Sparsity of the Loadings
Description
Mixed Sparsity of the Loadings
Usage
sparsity_mixed(PC, p, N, k = 1, tolerance = 0.001, mean = "arithmetic")
Arguments
PC |
matrix-like object of PCs. |
p |
integer, number of variables. |
N |
integer, number of groups. |
k |
integer, which components should be used. Does not work for multiple PCs simultaneously. |
tolerance |
tolerance for sparsity. |
mean |
if |
Value
Returns the geometric mean of the percentage of entry-wise and group-wise sparsity.
Examples
PC = matrix(c(1,0,2,3,0,7,0,1,0,1,0.001,0), ncol = 2)
sparsity_mixed(PC, N = 2, p = 3, tolerance = 0, k = 1)
sparsity_mixed(PC, N = 2, p = 3, tolerance = 0.001, k = 2, mean = "geometric")
Entry-wise Sparsity in the Loadings per Group
Description
Entry-wise Sparsity in the Loadings per Group
Usage
sparsity_summary(PC, N, p, tolerance = 0, k = 1, scaled = FALSE)
Arguments
PC |
matrix-like object of PCs. |
N |
integer, number of groups. |
p |
integer, number of variables. |
tolerance |
tolerance for sparsity. |
k |
integer or integer vector of which component should be used. |
scaled |
logical, if total number or percentage of possible sparse entries should be returned. |
Value
Returns either a matrix of percentages (scaled = TRUE
) or the amounts
of zero-values entries (scaled = FALSE
) for each group/neighborhood.
Examples
PC = matrix(c(1,0,2,3,0,7,0,1,0,1,0.001,0), ncol = 2)
sparsity_summary(PC, N = 2, p = 3, tolerance = 0, k = 1, scaled = FALSE)
sparsity_summary(PC, N = 2, p = 3, tolerance = 0.001, k = 2, scaled = TRUE)
Spatially Smoothed MRCD Estimator
Description
The ssMRCD function calculates the spatially smoothed MRCD estimator from Puchhammer and Filzmoser (2023).
Usage
ssMRCD(
x,
groups = NULL,
weights,
lambda,
TM = NULL,
alpha = 0.75,
maxcond = 50,
maxcsteps = 200,
n_initialhsets = NULL
)
Arguments
x |
a list of matrices containing the observations per neighborhood sorted which can be obtained by the function |
groups |
vector of neighborhood assignments |
weights |
weighting matrix, symmetrical, rows sum up to one and diagonals need to be zero (see also |
lambda |
numeric between 0 and 1. |
TM |
target matrix (optional), default value is the covMcd from robustbase. |
alpha |
numeric, proportion of values included, between 0.5 and 1. |
maxcond |
optional, maximal condition number used for rho-estimation. |
maxcsteps |
maximal number of c-steps before algorithm stops. |
n_initialhsets |
number of initial h-sets, default is 6 times number of neighborhoods. |
Value
An object of class "ssMRCD"
containing the following elements:
MRCDcov | List of ssMRCD-covariance matrices sorted by neighborhood. |
MRCDicov | List of inverse ssMRCD-covariance matrices sorted by neighborhood. |
MRCDmu | List of ssMRCD-mean vectors sorted by neighborhood. |
mX | List of data matrices sorted by neighborhood. |
N | Number of neighborhoods. |
mT | Target matrix. |
rho | Vector of regularization values sorted by neighborhood. |
alpha | Scalar what percentage of observations should be used. |
h | Vector of how many observations are used per neighborhood, sorted. |
numiter | The number of iterations for the best initial h-set combination. |
c_alpha | Consistency factor for normality. |
weights | The weighting matrix. |
lambda | Smoothing factor. |
obj_fun_values | A matrix with objective function values for all initial h-set combinations (rows) and iterations (columns). |
best6pack | initial h-set combinations with best objective function value after c-step iterations. |
Kcov | returns MRCD-estimates without smoothing. |
References
Puchhammer P. and Filzmoser P. (2023): Spatially smoothed robust covariance estimation for local outlier detection. doi:10.48550/arXiv.2305.05371
See Also
plot.ssMRCD, summary.ssMRCD, restructure_as_list
Examples
# create data set
x1 = matrix(runif(200), ncol = 2)
x2 = matrix(rnorm(200), ncol = 2)
x = list(x1, x2)
# create weighting matrix
W = matrix(c(0, 1, 1, 0), ncol = 2)
# calculate ssMRCD
ssMRCD(x, weights = W, lambda = 0.5)
Summary method for PCAloc
Description
Summary method for PCAloc
Usage
## S3 method for class 'PCAloc'
summary(object, ...)
Arguments
object |
object of class PCAloc |
... |
other input variables. |
Value
Summary for PCAloc
See Also
Examples
#'
C1 = diag(c(1.1, 0.9, 0.6))
C2 = matrix(c(1.1, 0.1, -0.1,
0.1, 1.0, -0.2,
-0.1, -0.2, 0.7), ncol = 3)
C3 = (C1 + C2)/2
pca = sparsePCAloc(eta = 1, gamma = 0.5, cor = FALSE, COVS = list(C1, C2, C3),
n_max = 100, increase_rho = list(FALSE, 100, 1), trace = FALSE)
summary(pca)
Summary of Local Outlier Detection
Description
Prints a summary of the locOuts object obtained by the function local_outliers_ssMRCD
.
Usage
## S3 method for class 'locOuts'
summary(object, ...)
Arguments
object |
a locOuts object. |
... |
further parameters passed on. |
Value
Prints a summary of the locOuts
object.
See Also
Examples
# set seed
set.seed(1)
# make locOuts object
data = matrix(rnorm(2000), ncol = 4)
coords = matrix(rnorm(1000), ncol = 2)
groups = sample(1:10, 500, replace = TRUE)
lambda = 0.3
# local outlier detection
outs = local_outliers_ssMRCD(data = data,
coords = coords,
groups = groups,
lambda = lambda,
k = 10)
# summary method
summary(outs)
Summary Method for ssMRCD Object
Description
Summarises most important information of output ssMRCD
.
Usage
## S3 method for class 'ssMRCD'
summary(object, ...)
Arguments
object |
object of class |
... |
further parameters. |
Value
Prints a summary of the ssMRCD
object.
See Also
See also ssMRCD
, plot.ssMRCD
.
Band weight matrix for time series groupings
Description
Band weight matrix for time series groupings
Usage
time_weights(N, off_diag)
Arguments
N |
number of groups. |
off_diag |
vector for off-diagonal values unequal to zero. |
Value
Returns weight matrix for time series groups appropriate for ssMRCD
.
See Also
Examples
time_weights(N = 10, off_diag = c(2,1))
Austrian Weather Data 2021
Description
This data is a subset of the GeoSphere Austria monthly weather data of 2021 averaged using the median. Stations with missing values are removed.
Usage
weatherAUT2021
Format
A data frame with 183 rows and 10 columns:
- name
Unique name of the weather station in German.
- lon, lat
Longitude and latitude of the weather station.
- alt
Altitude of the weather station (meter).
- p
Average air pressure (hPa).
- s
Monthly sum of sunshine duration (hours).
- vv
Wind velocity (meter/second).
- t
Air temperature in 2 meters above the ground in (°C).
- rsum
Average daily sum of precipitation (mm).
- rel
Relative air humidity (percent).
Source
The original data was downloaded here (December 2022): https://data.hub.geosphere.at/dataset/klima-v1-1m.
References
Data Source: GeoSphere Austria - https://data.hub.geosphere.at.
Examples
data(weatherAUT2021)
summary(weatherAUT2021)
Vienna Weather Time Series (1960-2023)
Description
This data is a subset of the GeoSphere Austria daily weather data of the time 1960-2023 for the weather station Hohe Warte in Vienna.
Usage
weatherHoheWarte
Format
A data frame with 23372 rows and 18 columns including 13 weather measurements:
- time
Time of measurement in date format.
- cloud_cover
Daily mean of cloud coverage, values between 1 and 100.
- global_radiation
Daily sum of global radiation (J/cm^2).
- vapor_pressure
Daily mean of vapour pressuer (hPa).
- max_wind_speed
Maximal wind speed (m/s).
- air_pressure
Daily mean of air pressure (hPa).
- relative_humidity
Daily mean of relative humidity (percent).
- precipitation
Daily sum of precepitation (mm).
- sight
Sight distance at 1pm (m).
- sunshine_duration
Daily sum of sunshine duration (h).
- temperature_max
Daily maximum of temperature at 2m air height (°C).
- temperature_min
Daily minimum of temperature at 2m air height (°C).
- temperature_mean
Daily mean of temperature at 2m air height (°C).
- wind_velocity
Daily mean of wind speed (m/s).
- year
Year of measurement.
- month
Month of measurement.
- day
Day of the year of measurement.
- season
Season of measuremen (1 = winter, 2 = spring, 3 = summer, 4 = fall).
Source
The original data was downloaded here (April 2024): https://data.hub.geosphere.at/dataset/klima-v2-1d.
References
Data Source: GeoSphere Austria - https://data.hub.geosphere.at.
Examples
data(weatherHoheWarte)
summary(weatherHoheWarte)