Type: | Package |
Title: | Principal Component Pursuit for Environmental Epidemiology |
Version: | 1.0.0 |
Description: | Implementation of the pattern recognition technique Principal Component Pursuit tailored to environmental health data, as described in Gibson et al (2022) <doi:10.1289/EHP10479>. |
License: | GPL (≥ 3) |
URL: | https://columbia-prime.github.io/pcpr/, https://github.com/Columbia-PRIME/pcpr |
BugReports: | https://github.com/Columbia-PRIME/pcpr/issues |
Depends: | R (≥ 3.5.0) |
Imports: | checkmate, dplyr, foreach, furrr, future, magrittr, progressr, purrr, rlang, tidyselect |
Suggests: | GGally, ggplot2, knitr, lubridate, metR, rmarkdown, tidyr |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-03-26 18:44:18 UTC; lgc2035 |
Author: | Lawrence G. Chillrud
|
Maintainer: | Lawrence G. Chillrud <lawrencechillrud@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-03-27 18:20:02 UTC |
pcpr: Principal Component Pursuit for Environmental Epidemiology
Description
Implementation of the pattern recognition technique Principal Component Pursuit tailored to environmental health data, as described in Gibson et al (2022) doi:10.1289/EHP10479.
Author(s)
Maintainer: Lawrence G. Chillrud lawrencechillrud@gmail.com (ORCID) [copyright holder]
Authors:
Jaime Benavides (ORCID)
Elizabeth A. Gibson (ORCID)
Junhui Zhang (ORCID)
Jingkai Yan (ORCID)
John N. Wright
Jeff Goldsmith
Marianthi-Anna Kioumourtzoglou (ORCID)
Other contributors:
Columbia University (00hj8s172) [funder]
See Also
Useful links:
Report bugs at https://github.com/Columbia-PRIME/pcpr/issues
Retrieve default PCP parameter settings for given matrix
Description
get_pcp_defaults()
calculates "default" PCP parameter settings lambda
,
mu
(used in root_pcp()
), and eta
(used in rrmc()
) for a given data
matrix D
.
The "default" values of lambda
and mu
offer theoretical guarantees
of optimal estimation performance. Candès et al. (2011) obtained the
guarantee for lambda
, while
Zhang et al. (2021)
obtained the result for mu
. It has not yet been proven whether or
not eta
enjoys similar properties.
In practice it is common to find different optimal parameter values
after tuning these parameters in a grid search. Therefore, it is
recommended to use these defaults primarily to help define a reasonable
initial parameter search space to pass into grid_search_cv()
.
Usage
get_pcp_defaults(D)
Arguments
D |
The input data matrix. |
Value
A list containing:
-
lambda
: The theoretically optimallambda
value used inroot_pcp()
. -
mu
: The theoretically optimalmu
value used inroot_pcp()
. -
eta
: The defaulteta
value used inrrmc()
.
The intuition behind PCP parameters
root_pcp()
's objective function is given by:
\min_{L, S} ||L||_* + \lambda ||S||_1 + \mu ||L + S - D||_F
-
lambda
controls the sparsity ofroot_pcp()
's outputS
matrix; larger values oflambda
penalize non-zero entries inS
more stringently, driving the recovery of sparserS
matrices. Therefore, if you a priori expect few outlying events in your model, you might expect a grid search to recover relatively largerlambda
values, and vice-versa. -
mu
adjustsroot_pcp()
's sensitivity to noise; larger values ofmu
penalize errors between the predicted model and the observed data (i.e. noise), more severely. Environmental data subject to higher noise levels therefore require aroot_pcp()
model equipped with smallermu
values (since higher noise means a greater discrepancy between the observed mixture and the true underlying low-rank and sparse model). In virtually noise-free settings (e.g. simulations), larger values ofmu
would be appropriate.
rrmc()
's objective function is given by:
\min_{L, S} I_{rank(L) \leq r} + \eta ||S||_0 + ||L + S - D||_F^2
-
eta
controls the sparsity ofrrmc()
's outputS
matrix, just aslambda
does forroot_pcp()
. Because there are no other parameters scaling the noise term,eta
can be thought of as a ratio betweenroot_pcp()
'slambda
andmu
: Larger values ofeta
will place a greater emphasis on penalizing the non-zero entries inS
over penalizing the errors between the predicted and observed data (the dense noiseZ
).
The calculation of the "default" PCP parameters
-
lambda
is calculated as\lambda = 1 / \sqrt{\max(n, p)},
wheren
andp
are the dimensions of the input matrixD_{n \times p}
Candès et al. (2011). -
mu
is calculated as\mu = \sqrt{\frac{\min(n, p)}{2}},
wheren
andp
are as above [Zhang et al. (2021)]. -
eta
is simply\eta = \frac{\lambda}{\mu}
.
References
Candès, Emmanuel J., Xiaodong Li, Yi Ma, and John Wright. "Robust principal component analysis?." Journal of the ACM (JACM) 58, no. 3 (2011): 1-37.
Zhang, Junhui, Jingkai Yan, and John Wright. "Square root principal component pursuit: tuning-free noisy robust matrix recovery." Advances in Neural Information Processing Systems 34 (2021): 29464-29475. [available here]
See Also
Examples
# Examine the queens PM2.5 data
queens
# Get rid of the Date column
D <- as.matrix(queens[, 2:ncol(queens)])
# Get default PCP parameters
default_params <- get_pcp_defaults(D)
# Use default parameters to define parameter search space
scaling_factors <- sort(c(10^seq(-2, 4, 1), 2 * 10^seq(-2, 4, 1)))
etas_to_grid_search <- default_params$eta * scaling_factors
etas_to_grid_search
Cross-validated grid search for PCP models
Description
grid_search_cv()
conducts a Monte Carlo style cross-validated grid search
of PCP parameters for a given data matrix D
, PCP function pcp_fn
, and
grid of parameter settings to search through grid
. The run time of the grid
search can be sped up using bespoke parallelization settings. The call to
grid_search_cv()
can be wrapped in a call to progressr::with_progress()
for progress bar updates. See the below sections for details.
Usage
grid_search_cv(
D,
pcp_fn,
grid,
...,
parallel_strategy = "sequential",
num_workers = 1,
perc_test = 0.05,
num_runs = 100,
return_all_tests = FALSE,
verbose = TRUE
)
Arguments
D |
The input data matrix (can contain |
pcp_fn |
The PCP function to use when grid searching. Must be either
|
grid |
A |
... |
Any parameters required by |
parallel_strategy |
(Optional) The parallelization strategy used when
conducting the grid search (to be passed on to the |
num_workers |
(Optional) An integer specifying the number of workers to
use when parallelizing the grid search, to be passed on to
|
perc_test |
(Optional) The fraction of entries of |
num_runs |
(Optional) The number of times to test a given parameter
setting. By default, |
return_all_tests |
(Optional) A logical indicating if you would like the
output from all the calls made to |
verbose |
(Optional) A logical indicating if you would like verbose
output displayed or not. By default, |
Value
A list containing:
-
all_stats
: Adata.frame
containing the statistics of every run comprising the grid search. These statistics include the parameter settings for the run, along with therun_num
(used as the seed for the corruption step, step 1 in the grid search procedure), the relative error for the runrel_err
, the rank of the recovered L matrixL_rank
, the sparsity of the recovered S matrixS_sparsity
, the number ofiterations
PCP took to reach convergence (forroot_pcp()
only), and the error statusrun_error
of the PCP run (NA
if no error, otherwise a character string). -
summary_stats
: Adata.frame
containing a summary of the information inall_stats
. Summary made by column-wise averaging the results inall_stats
. -
metadata
: A character string containing the metadata associated with the grid search instance.
If return_all_tests = TRUE
then the following are also returned as part
of the list:
-
L_mats
: A list containing all theL
matrices returned from PCP throughout the grid search. Therefore,length(L_mats) == nrow(all_stats)
. Rowi
inall_stats
corresponds toL_mats[[i]]
. -
S_mats
: A list containing all the S matrices returned from PCP throughout the grid search. Therefore,length(S_mats) == nrow(all_stats)
. Rowi
inall_stats
corresponds toS_mats[[i]]
. -
test_mats
: A list oflength(num_runs)
containing all the corrupted test mats (and their masks) used throughout the grid search. Note:all_stats$run[i]
corresponds totest_mats[[i]]
. -
original_mat
: The original data matrixD
. -
constant_params
: A copy of the constant parameters that were originally passed to the grid search (for record keeping).
The Monte Carlo style cross-validation procedure
Each hyperparameter setting is cross-validated by:
Randomly corrupting
perc_test
percent of the entries inD
as missing (i.e.NA
values), yieldingD_tilde
. Done viasim_na()
.Running the PCP function
pcp_fn
onD_tilde
, yielding estimatesL
andS
.Recording the relative recovery error of
L
compared with the input data matrixD
for only those values that were imputed as missing during the corruption step (step 1 above). Mathematically, calculate:||P_{\Omega^c}(D - L)||_F / ||P_{\Omega^c}(D)||_F
, whereP_{\Omega^c}
selects only those entries whereis.na(D_tilde) == TRUE
.Repeating steps 1-3 for a total of
num_runs
-many times, where each "run" has a unique random seed from1
tonum_runs
associated with it.Performance statistics can then be calculated for each "run", and then summarized across all runs for average model performance statistics.
Best practices for perc_test
and num_runs
Experimentally, this grid search procedure retrieves the best performing
PCP parameter settings when perc_test
is relatively low, e.g.
perc_test = 0.05
, or 5%, and num_runs
is relatively high, e.g.
num_runs = 100
.
The larger perc_test
is, the more the test set turns into a matrix
completion problem, rather than the desired matrix decomposition problem. To
better resemble the actual problem PCP will be faced with come inference
time, perc_test
should therefore be kept relatively low.
Choosing a reasonable value for num_runs
is dependent on the need to keep
perc_test
relatively low. Ideally, a large enough num_runs
is used so
that many (if not all) of the entries in D
are likely to eventually be
tested. Note that since test set entries are chosen randomly for all runs 1
through num_runs
, in the pathologically worst case scenario, the same exact
test set could be drawn each time. In the best case scenario, a different
test set is obtained each run, providing balanced coverage of D
. Viewed
another way, the smaller num_runs
is, the more the results are susceptible
to overfitting to the relatively few selected test sets.
Interpretaion of results
Once the grid search of has been conducted, the optimal hyperparameters can
be chosen by examining the output statistics summary_stats
. Below are a
few suggestions for how to interpret the summary_stats
table:
Generally speaking, the first thing a user will want to inspect is the
rel_err
statistic, capturing the relative discrepancy between recovered test sets and their original, observed (yet possibly noisy) values. Lowerrel_err
means the PCP model was better able to recover the held-out test set. So, in general, the best parameter settings are those with the lowestrel_err
. Having said this, it is important to remember that this statistic should be taken with a grain of salt: Because no ground truthL
matrix exists, therel_err
measurement is forced to rely on the comparison between the noisy observed data matrixD
and the estimated low-rank modelL
. So therel_err
metric is an "apples to oranges" relative error. For data that is a priori expected to be subject to a high degree of noise, it may actually be better to discard parameter settings with suspiciously lowrel_err
s (in which case the solution may be hallucinating an inaccurate low-rank structure from the observed noise).For grid searches using
root_pcp()
as the PCP model, parameters that fail to converge can be discarded. Generally, fewerroot_pcp()
iterations (num_iter
) taken to reach convergence portend a more reliable / stable solution. In rare cases, the user may need to increaseroot_pcp()
'smax_iter
argument to reach convergence.rrmc()
does not report convergence metadata, as its optimization scheme runs for a fixed number of iterations.Parameter settings with unreasonable sparsity or rank measurements can also be discarded. Here, "unreasonable" means these reported metrics flagrantly contradict prior assumptions, knowledge, or work. For instance, most air pollution datasets contain a number of extreme exposure events, so PCP solutions returning sparse
S
models with 100% sparsity have obviously been regularized too heavily. Solutions with lower sparsities should be preferred. Note that reported sparsity and rank measurements are estimates heavily dependent on thethresh
set by thesparsity()
&matrix_rank()
functions. E.g. it could be that the actual average matrix rank is much higher or lower when a threshold that better takes into account the relative scale of the singular values is used. Likewise for the sparsity estimations. Also, recall that the given value forperc_test
artifically sets a sparsity floor, since those missing entries in the test set cannot be recovered in theS
matrix. E.g. ifperc_test = 0.05
, then no parameter setting will have an estimated sparsity lower than 5%.
See Also
sim_na()
, sparsity()
, matrix_rank()
, get_pcp_defaults()
Examples
#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
#### -------Tiny grid search-------####
# Here is a tiny grid search just to test the function quickly.
# In practice we would recommend a larger grid search.
# For examples of larger searches, see the vignettes.
gs <- grid_search_cv(
data$D,
rrmc,
data.frame("eta" = 0.35),
r = 3,
num_runs = 2
)
gs$summary_stats
Hard-thresholding operator
Description
hard_threshold()
implements the hard-thresholding operator on a given
matrix D
, making D
sparser: elements of D
whose absolute value are less
than a given threshold thresh
are set to 0, i.e. D[|D| < thresh] = 0
.
This is used in the non-convex PCP function rrmc()
to provide a non-convex
replacement for the prox_l1()
method used in the convex PCP function
root_pcp()
. It is used to iteratively model the sparse S
matrix with the
help of an adaptive threshold (thresh
changes over the course of
optimization).
Usage
hard_threshold(D, thresh)
Arguments
D |
The input data matrix. |
thresh |
The scalar-valued hard-threshold acting on |
Value
The hard-thresholded matrix.
Examples
set.seed(42)
D <- matrix(rnorm(25), 5, 5)
S <- hard_threshold(D, thresh = 1)
D
S
Impute missing values in given matrix
Description
impute_matrix()
imputes the missing NA
values in a given matrix using a
given imputation_scheme
.
Usage
impute_matrix(D, imputation_scheme)
Arguments
D |
The input data matrix. |
imputation_scheme |
The values to replace missing
|
Value
The imputed matrix.
See Also
sim_na()
, sim_lod()
, sim_data()
Examples
#### ------------Imputation with a scalar------------####
# simulate a small 5x5 mixture
D <- sim_data(5, 5)$D
# corrupt the mixture with 40% missing observations
D_tilde <- sim_na(D, 0.4)$D_tilde
D_tilde
# impute missing values with 0
impute_matrix(D_tilde, 0)
# impute missing values with -1
impute_matrix(D_tilde, -1)
#### ------------Imputation with a vector------------####
# impute missing values with the column-mean
impute_matrix(D_tilde, apply(D_tilde, 2, mean, na.rm = TRUE))
# impute missing values with the column-min
impute_matrix(D_tilde, apply(D_tilde, 2, min, na.rm = TRUE))
#### ------------Imputation with a matrix------------####
# impute missing values with random Gaussian noise
noise <- matrix(rnorm(prod(dim(D_tilde))), nrow(D_tilde), ncol(D_tilde))
impute_matrix(D_tilde, noise)
#### ------------Imputation with LOD/sqrt(2)------------####
D <- sim_data(5, 5)$D
lod_info <- sim_lod(D, q = 0.2)
D_tilde <- lod_info$D_tilde
D_tilde
lod <- lod_info$lod
impute_matrix(D_tilde, lod / sqrt(2))
Estimate rank of a given matrix
Description
matrix_rank()
estimates the rank of a given data matrix D
by counting the
number of "practically nonzero" singular values of D
.
The rank of a matrix is the number of linearly independent columns or rows in the matrix, governing the structure of the data. It can intuitively be thought of as the number of inherent latent patterns in the data.
A singular value s
is determined to be "practically nonzero" if
s \geq s_{max} \cdot thresh
, i.e. if it is greater than or equal to the
maximum singular value in D
scaled by a given threshold thresh
.
Usage
matrix_rank(D, thresh = NULL)
Arguments
D |
The input data matrix (cannot have |
thresh |
(Optional) A double |
Value
An integer estimating the rank of D
.
See Also
Examples
data <- sim_data()
matrix_rank(data$D)
matrix_rank(data$L)
Project matrix to rank r
Description
proj_rank_r()
implements a best (i.e. closest) rank-r
approximation of
an input matrix.
This is computed via a simple truncated singular value decomposition (SVD),
retaining the first r
leading singular values/vectors of D
.
This is equivalent to solving the following optimization problem:
min ||X-D||_F s.t. rank(X) <= r
, where X
is the approximated solution
and D
is the input matrix.
proj_rank_r()
is used to iteratively model the low-rank L
matrix in the
non-convex PCP function rrmc()
, providing a non-convex replacement for the
prox_nuclear()
method used in the convex PCP function root_pcp()
.
Intuitively, proj_rank_r()
can also be thought of as providing a PCA
estimate of a rank-r
matrix L
from observed data D
.
Usage
proj_rank_r(D, r)
Arguments
D |
The input data matrix (cannot have |
r |
The rank that |
Value
The best rank-r
approximation to D
via a truncated SVD.
See Also
Examples
# Simulating a simple dataset D with the sim_data() function.
# The dataset will be a 10x5 matrix comprised of:
# 1. A rank-1 component as the ground truth L matrix; and
# 2. A dense Gaussian noise component corrupting L, making L full-rank
data <- sim_data(10, 5, 1, numeric(), 0.01)
# The observed matrix D is full-rank, while L is rank-1:
data.frame("D_rank" = matrix_rank(data$D), "L_rank" = matrix_rank(data$L))
before_proj_err <- norm(data$D - data$L, "F") / norm(data$L, "F")
# Projecting D onto the nearest rank-1 approximation, X, via proj_rank_r()
X <- proj_rank_r(data$D, r = 1)
after_proj_err <- norm(X - data$L, "F") / norm(data$L, "F")
proj_v_obs_err <- norm(X - data$D, "F") / norm(data$D, "F")
data.frame(
"Observed_error" = before_proj_err,
"Projected_error" = after_proj_err,
"Projected_vs_observed_error" = proj_v_obs_err
)
Daily chemical concentrations of 26 PM2.5 species from Queens, NYC (2001-2021)
Description
A dataset containing the chemical concentrations (in µg/m^3) of 26 PM2.5
species measured every three to six days from 04/04/2001 through 12/30/2021
in Queens, New York City. Data obtained from the U.S. Environmental
Protection Agency's Air Quality System data mart (site ID: 36-081-0124
).
Usage
queens
Format
A tibble with 2443 rows and 27 variables:
-
Date
: The date the PM2.5 measurements were made -
...
: The remaining 26 variables are the 26 PM2.5 species (in µg/m^3): Al, NH4, As, Ba, Br, Cd, Ca, Cl, Cr, Cu, EC, Fe, Pb, Mg, Mn, Ni, OC, K, Se, Si, Na, S, Ti, NO3, V, Zn
Source
https://epa.maps.arcgis.com/apps/webappviewer/index.html?id=5f239fd3e72f424f98ef3d5def547eb5
References
US Environmental Protection Agency. Air Quality System Data Mart internet database available via https://www.epa.gov/outdoor-air-quality-data. Accessed July 15, 2022.
Examples
queens
Square root principal component pursuit (convex PCP)
Description
root_pcp()
implements the convex PCP algorithm "Square root principal
component pursuit" as described in
Zhang et al. (2021)
, outfitted with environmental health (EH)-specific extensions as described
in Gibson et al. (2022).
Given an observed data matrix D
, and regularization parameters lambda
and
mu
, root_pcp()
aims to find the best low-rank and sparse estimates L
and S
. The L
matrix encodes latent patterns that govern the observed
data. The S
matrix captures any extreme events in the data unexplained by
the underlying patterns in L
.
Being convex, root_pcp()
determines the rank r
, or number of latent
patterns in the data, autonomously during it's optimization. As such, the
user does not need to specify the desired rank r
of the output L
matrix
as in the non-convex PCP model rrmc()
.
Experimentally, the root_pcp()
approach to PCP modeling has best been able
to handle those datasets that are governed by well-defined underlying
patterns, characterized by quickly decaying singular values. This is typical
of imaging and video data, but uncommon for EH data. For observed data with a
complex low rank structure (slowly decaying singular values), like EH data,
rrmc()
may offer a better model estimate.
Three EH-specific extensions are currently supported by root_pcp()
:
The model can handle missing values in the input data matrix
D
;The model can also handle measurements that fall below the limit of detection (LOD), if provided
LOD
information by the user; andThe model is also equipped with an optional non-negativity constraint on the low-rank
L
matrix, ensuring that all output values inL
are> 0
.
Usage
root_pcp(
D,
lambda = NULL,
mu = NULL,
LOD = -Inf,
non_negative = TRUE,
max_iter = 10000,
verbose = FALSE
)
Arguments
D |
The input data matrix (can contain |
lambda , mu |
(Optional) A pair of doubles each in the range |
LOD |
(Optional) The limit of detection (LOD) data. Entries in
By default, |
non_negative |
(Optional) A logical indicating whether or not the
non-negativity constraint should be used to constrain the output |
max_iter |
(Optional) An integer specifying the maximum number of
iterations to allow PCP before giving up on meeting PCP's convergence
criteria. By default, |
verbose |
(Optional) A logical indicating whether or not to print
information in real time over the course of PCP's optimization. By
default, |
Value
A list containing:
-
L
: The rank-r
low-rank matrix encoding ther
-many latent patterns governing the observed input data matrixD
.dim(L)
will be the same asdim(D)
. To explicitly obtain the underlying patterns,L
can be used as the input to any matrix factorization technique of choice, e.g. PCA, factor analysis, or non-negative matrix factorization. -
S
: The sparse matrix containing the rare outlying or extreme observations inD
that are not explained by the underlying patterns in the correspondingL
matrix.dim(S)
will be the same asdim(D)
. Most entries inS
are0
, while non-zero entries identify the extreme outlying observations inD
. -
num_iter
: The number of iterations taken to reach convergence. Ifnum_iter == max_iter
thenroot_pcp()
did not converge. -
objective
: A vector containing the values ofroot_pcp()
's objective function over the course of optimization. -
converged
: A boolean indicating whether the convergence criteria were met beforemax_iter
was reached.
The objective function
root_pcp()
optimizes the following objective function:
\min_{L, S} ||L||_* + \lambda ||S||_1 + \mu ||L + S - D||_F
The first term is the nuclear norm of the L
matrix, incentivizing L
to be
low-rank. The second term is the \ell_1
norm of the S
matrix,
encouraging S
to be sparse. The third term is the Frobenius norm
applied to the model's noise, ensuring that the estimated low-rank and sparse
models L
and S
together have high fidelity to the observed data D
.
The objective is not smooth nor differentiable, however it is convex and
separable. As such, it is optimized using the Alternating Direction
Method of Multipliers (ADMM) algorithm Boyd et al. (2011), Gao et al. (2020).
The lambda
and mu
parameters
-
lambda
controls the sparsity ofroot_pcp()
's outputS
matrix; larger values oflambda
penalize non-zero entries inS
more stringently, driving the recovery of sparserS
matrices. Therefore, if you a priori expect few outlying events in your model, you might expect a grid search to recover relatively largerlambda
values, and vice-versa. -
mu
adjustsroot_pcp()
's sensitivity to noise; larger values ofmu
penalize errors between the predicted model and the observed data (i.e. noise), more severely. Environmental data subject to higher noise levels therefore require aroot_pcp()
model equipped with smallermu
values (since higher noise means a greater discrepancy between the observed mixture and the true underlying low-rank and sparse model). In virtually noise-free settings (e.g. simulations), larger values ofmu
would be appropriate.
The default values of lambda
and mu
offer theoretical guarantees
of optimal estimation performance, and stable recovery of L
and S
. By
"stable", we mean root_pcp()
's reconstruction error is, in the worst case,
proportional to the magnitude of the noise corrupting the observed data
(||Z||_F
), often outperforming this upper bound.
Candès et al. (2011) obtained the guarantee for lambda
, while
Zhang et al. (2021)
obtained the result for mu
.
Environmental health specific extensions
We refer interested readers to Gibson et al. (2022) for the complete details regarding the EH-specific extensions.
Missing value functionality: PCP assumes that the same data generating
mechanisms govern both the missing and the observed entries in D
. Because
PCP primarily seeks accurate estimation of patterns rather than
individual observations, this assumption is reasonable, but in some edge
cases may not always be justified. Missing values in D
are therefore
reconstructed in the recovered low-rank L
matrix according to the
underlying patterns in L
. There are three corollaries to keep in mind
regarding the quality of recovered missing observations:
Recovery of missing entries in
D
relies on accurate estimation ofL
;The fewer observations there are in
D
, the harder it is to accurately reconstructL
(therefore estimation of both unobserved and observed measurements inL
degrades); andGreater proportions of missingness in
D
artifically drive up the sparsity of the estimatedS
matrix. This is because it is not possible to recover a sparse event inS
when the corresponding entry inD
is unobserved. By definition, sparse events inS
cannot be explained by the consistent patterns inL
. Practically, if 20% of the entries inD
are missing, then at least 20% of the entries inS
will be 0.
Handling measurements below the limit of detection: When equipped with
LOD information, PCP treats any estimations of values known to be below the
LOD as equally valid if their approximations fall between 0 and the LOD. Over
the course of optimization, observations below the LOD are pushed into this
known range [0, LOD]
using penalties from above and below: should a
< LOD
estimate be < 0
, it is stringently penalized, since
measured observations cannot be negative. On the other hand, if a < LOD
estimate is >
the LOD, it is also heavily penalized: less so than when
< 0
, but more so than observations known to be above the LOD, because
we have prior information that these observations must be below LOD.
Observations known to be above the LOD are penalized as usual, using the
Frobenius norm in the above objective function.
Gibson et al. (2022) demonstrates that
in experimental settings with up to 50% of the data corrupted below the LOD,
PCP with the LOD extension boasts superior accuracy of recovered L
models
compared to PCA coupled with LOD / \sqrt{2}
imputation. PCP even
outperforms PCA in low-noise scenarios with as much as 75% of the data
corrupted below the LOD. The few situations in which PCA bettered PCP were
those pathological cases in which D
was characterized by extreme noise and
huge proportions (i.e., 75%) of observations falling below the LOD.
The non-negativity constraint on L
: To enhance interpretability of
PCP-rendered solutions, there is an optional non-negativity constraint
that can be imposed on the L
matrix to ensure all estimated values
within it are \geq 0
. This prevents researchers from having to deal
with negative observation values and questions surrounding their meaning
and utility. Non-negative L
models also allow for seamless use of methods
such as non-negative matrix factorization to extract non-negative patterns.
The non-negativity constraint is incorporated in the ADMM splitting technique
via the introduction of an additional optimization variable and corresponding
constraint.
References
Zhang, Junhui, Jingkai Yan, and John Wright. "Square root principal component pursuit: tuning-free noisy robust matrix recovery." Advances in Neural Information Processing Systems 34 (2021): 29464-29475. [available here]
Gibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence Chillrud, Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou. "Principal component pursuit for pattern identification in environmental mixtures." Environmental Health Perspectives 130, no. 11 (2022): 117008.
Boyd, Stephen, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. "Distributed optimization and statistical learning via the alternating direction method of multipliers." Foundations and Trends in Machine learning 3, no. 1 (2011): 1-122.
Gao, Wenbo, Donald Goldfarb, and Frank E. Curtis. "ADMM for multiaffine constrained optimization." Optimization Methods and Software 35, no. 2 (2020): 257-303.
Candès, Emmanuel J., Xiaodong Li, Yi Ma, and John Wright. "Robust principal component analysis?." Journal of the ACM (JACM) 58, no. 3 (2011): 1-37.
See Also
Examples
#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-2 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data(r = 2, sigma = 0.1)
# Best practice is to conduct a grid search with grid_search_cv() function,
# but we skip that here for brevity.
pcp_model <- root_pcp(data$D, lambda = 0.225, mu = 3.04)
data.frame(
"Estimated_L_rank" = matrix_rank(pcp_model$L, 5e-2),
"Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
"PCA_error" = norm(data$L - proj_rank_r(data$D, r = 2), "F") / norm(data$L, "F"),
"PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
"PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
)
Rank-based robust matrix completion (non-convex PCP)
Description
rrmc()
implements the non-convex PCP algorithm "Rank-based robust matrix
completion" as described in
Cherapanamjeri et al. (2017)
(see Algorithm 3), outfitted with environmental health (EH)-specific
extensions as described in Gibson et al. (2022).
Given an observed data matrix D
, maximum rank to search up to r
, and
regularization parameter eta
, rrmc()
seeks to find the best low-rank
and sparse estimates L
and S
using an incremental rank-based strategy.
The L
matrix encodes latent patterns that govern the observed data.
The S
matrix captures any extreme events in the data unexplained by the
underlying patterns in L
.
rrmc()
's incremental rank-based strategy first estimates a rank-1
model
(L^{(1)}, S^{(1)})
, before using the rank-1
model as the initialization
point to then construct a rank-2
model (L^{(2)}, S^{(2)})
, and so on,
until the desired rank-r
model (L^{(r)}, S^{(r)})
is recovered. All
models from ranks 1
through r
are returned by rrmc()
in this way.
Experimentally, the rrmc()
approach to PCP has best been able to handle
those datasets that are governed by complex underlying patterns characterized
by slowly decaying singular values, such as EH data. For observed data with a
well-defined low rank structure (rapidly decaying singular values),
root_pcp()
may offer a better model estimate.
Two EH-specific extensions are currently supported by rrmc()
:
The model can handle missing values in the input data matrix
D
; andThe model can also handle measurements that fall below the limit of detection (LOD), if provided
LOD
information by the user.
Support for a non-negativity constraint on rrmc()
's output will be added in
a future release of pcpr
.
Usage
rrmc(D, r, eta = NULL, LOD = -Inf)
Arguments
D |
The input data matrix (can contain |
r |
An integer |
eta |
(Optional) A double in the range |
LOD |
(Optional) The limit of detection (LOD) data. Entries in
By default, |
Value
A list containing:
-
L
: The rank-r
low-rank matrix encoding ther
-many latent patterns governing the observed input data matrixD
.dim(L)
will be the same asdim(D)
. To explicitly obtain the underlying patterns,L
can be used as the input to any matrix factorization technique of choice, e.g. PCA, factor analysis, or non-negative matrix factorization. -
S
: The sparse matrix containing the rare outlying or extreme observations inD
that are not explained by the underlying patterns in the correspondingL
matrix.dim(S)
will be the same asdim(D)
. Most entries inS
are0
, while non-zero entries identify the extreme outlying observations inD
. -
L_list
: A list of ther
-manyL
matrices recovered over the course ofrrmc()
's iterative optimization procedure. The first element inL_list
corresponds to the rank-1
L
matrix, the second to the rank-2
L
matrix, and so on. -
S_list
: A list of ther
-many correspondingS
matrices recovered over the course ofrrmc()
's iterative optimization procedure. The first element inS_list
corresponds to the rank-1
solution'sS
matrix, the second to the rank-2
solution'sS
matrix, and so on. -
objective
: A vector containing the values ofrrmc()
's objective function over the course of optimization.
The objective function
rrmc()
implicitly optimizes the following objective function:
\min_{L, S} I_{rank(L) \leq r} + \eta ||S||_0 + ||L + S - D||_F^2
The first term is the indicator function checking that the L
matrix is
strictly rank r
or less, implemented using a rank r
projection operator
proj_rank_r()
. The second term is the \ell_0
norm applied to the S
matrix to encourage sparsity, and is implemented with the help of an adaptive
hard-thresholding operator hard_threshold()
. The third term is the squared
Frobenius norm applied to the model's noise.
The eta
parameter
The eta
parameter scales the sparse penalty applied to rrmc()
's output
sparse S
matrix. Larger values of eta
penalize non-zero entries in S
more stringently, driving the recovery of sparser S
matrices.
Because there are no other parameters scaling the other terms in rrmc()
's
objective function, eta
can intuitively be thought of as the dial that
balances the model's sensitivity to extreme events (placed in S
) and
its sensitivity to noise Z
(captured by the last term in the objective,
which measures the discrepancy between the between the predicted model
and the observed data). Larger values of eta
will place a
greater emphasis on penalizing the non-zero entries in S
over penalizing
the errors between the predicted and observed data Z = L + S - D
.
Environmental health specific extensions
We refer interested readers to Gibson et al. (2022) for the complete details regarding the EH-specific extensions.
Missing value functionality: PCP assumes that the same data generating
mechanisms govern both the missing and the observed entries in D
. Because
PCP primarily seeks accurate estimation of patterns rather than
individual observations, this assumption is reasonable, but in some edge
cases may not always be justified. Missing values in D
are therefore
reconstructed in the recovered low-rank L
matrix according to the
underlying patterns in L
. There are three corollaries to keep in mind
regarding the quality of recovered missing observations:
Recovery of missing entries in
D
relies on accurate estimation ofL
;The fewer observations there are in
D
, the harder it is to accurately reconstructL
(therefore estimation of both unobserved and observed measurements inL
degrades); andGreater proportions of missingness in
D
artifically drive up the sparsity of the estimatedS
matrix. This is because it is not possible to recover a sparse event inS
when the corresponding entry inD
is unobserved. By definition, sparse events inS
cannot be explained by the consistent patterns inL
. Practically, if 20% of the entries inD
are missing, then at least 20% of the entries inS
will be 0.
Handling measurements below the limit of detection: When equipped with
LOD information, PCP treats any estimations of values known to be below the
LOD as equally valid if their approximations fall between 0 and the LOD. Over
the course of optimization, observations below the LOD are pushed into this
known range [0, LOD]
using penalties from above and below: should a
< LOD
estimate be < 0
, it is stringently penalized, since
measured observations cannot be negative. On the other hand, if a < LOD
estimate is >
the LOD, it is also heavily penalized: less so than when
< 0
, but more so than observations known to be above the LOD, because
we have prior information that these observations must be below LOD.
Observations known to be above the LOD are penalized as usual, using the
Frobenius norm in the above objective function.
Gibson et al. (2022) demonstrates that
in experimental settings with up to 50% of the data corrupted below the LOD,
PCP with the LOD extension boasts superior accuracy of recovered L
models
compared to PCA coupled with LOD / \sqrt{2}
imputation. PCP even
outperforms PCA in low-noise scenarios with as much as 75% of the data
corrupted below the LOD. The few situations in which PCA bettered PCP were
those pathological cases in which D
was characterized by extreme noise and
huge proportions (i.e., 75%) of observations falling below the LOD.
References
Cherapanamjeri, Yeshwanth, Kartik Gupta, and Prateek Jain. "Nearly optimal robust matrix completion." International Conference on Machine Learning. PMLR, 2017. [available here]
Gibson, Elizabeth A., Junhui Zhang, Jingkai Yan, Lawrence Chillrud, Jaime Benavides, Yanelli Nunez, Julie B. Herbstman, Jeff Goldsmith, John Wright, and Marianthi-Anna Kioumourtzoglou. "Principal component pursuit for pattern identification in environmental mixtures." Environmental Health Perspectives 130, no. 11 (2022): 117008.
See Also
Examples
#### -------Simple simulated PCP problem-------####
# First we will simulate a simple dataset with the sim_data() function.
# The dataset will be a 100x10 matrix comprised of:
# 1. A rank-3 component as the ground truth L matrix;
# 2. A ground truth sparse component S w/outliers along the diagonal; and
# 3. A dense Gaussian noise component
data <- sim_data()
# Best practice is to conduct a grid search with grid_search_cv() function,
# but we skip that here for brevity.
pcp_model <- rrmc(data$D, r = 3, eta = 0.35)
data.frame(
"Observed_relative_error" = norm(data$L - data$D, "F") / norm(data$L, "F"),
"PCA_error" = norm(data$L - proj_rank_r(data$D, r = 3), "F") / norm(data$L, "F"),
"PCP_L_error" = norm(data$L - pcp_model$L, "F") / norm(data$L, "F"),
"PCP_S_error" = norm(data$S - pcp_model$S, "F") / norm(data$S, "F")
)
Simulate simple mixtures data
Description
sim_data()
generates a simulated dataset D = L + S + Z
for
experimentation with Principal Component Pursuit (PCP) algorithms.
Usage
sim_data(
n = 100,
p = 10,
r = 3,
sparse_nonzero_idxs = NULL,
sigma = 0.05,
seed = 42
)
Arguments
n , p |
(Optional) A pair of integers specifying the simulated dataset's
number of |
r |
(Optional) An integer specifying the rank of the simulated dataset's
low-rank component. Intuitively, the number of latent patterns governing
the simulated dataset. Must be that |
sparse_nonzero_idxs |
(Optional) An integer vector with
|
sigma |
(Optional) A double specifying the standard deviation of the
dense (Gaussian) noise component |
seed |
(Optional) An integer specifying the seed for random number
generation. By default, |
Details
The data is simulated as follows:
L <- matrix(runif(n * r), n, r) %*% matrix(runif(r * p), r, p)
S <- matrix(0, n, p)
S[sparse_nonzero_idxs] <- 1
Z <- matrix(rnorm(n * p, sd = sigma), n, p)
D <- L + S + Z
Value
A list containing:
-
D
: The observed data matrix, whereD = L + S + Z
. -
L
: The ground truth rank-r
low-rank matrix. -
S
: The ground truth sparse matrix. -
S
: The ground truth dense (Gaussian) noise matrix.
See Also
sim_na()
, sim_lod()
, impute_matrix()
Examples
# rank 3 example
data <- sim_data()
matrix_rank(data$D)
matrix_rank(data$L)
# rank 7 example
data <- sim_data(n = 1000, p = 25, r = 7)
matrix_rank(data$D)
matrix_rank(data$L)
Simulate limit of detection data
Description
sim_lod()
simulates putting the columns of a given matrix D
under a limit
of detection (LOD) by calculating the given quantile q
of each column and
corrupting all values < the quantile to NA
, returning the newly corrupted
matrix, the binary corruption mask, and a vector of column LODs.
Usage
sim_lod(D, q)
Arguments
D |
The input data matrix. |
q |
A double in the range |
Value
A list containing:
-
D_tilde
: The original matrixD
corrupted with < LODNA
values. -
tilde_mask
: A binary matrix ofdim(D)
specifying the locations of corrupted entries (1
) and uncorrupted entries (0
). -
lod
: A vector withlength(lod) == ncol(D)
providing the simulated LOD values corresponding to each column in theD_tilde
.
See Also
sim_na()
, impute_matrix()
, sim_data()
Examples
D <- sim_data(5, 5, sigma = 0.8)$D
D
sim_lod(D, q = 0.2)
Simulate random missingness in a given matrix
Description
sim_na()
corrupts a given data matrix D
such that a random perc
percent of its entries are set to be missing (set to NA
). Used by
grid_search_cv()
in constructing test matrices for PCP models. Can be
used for experimentation with PCP models.
Note: only observed values can be corrupted as NA
. This means if a matrix
D
already has e.g. 20% of its values missing, then
sim_na(D, perc = 0.2)
would result in a matrix with 40% of
its values as missing.
Should e.g. perc = 0.6
be passed as input when D
only has e.g. 10% of its
entries left as observed, then all remaining corruptable entries will be
set to NA
.
Usage
sim_na(D, perc, seed = 42)
Arguments
D |
The input data matrix. |
perc |
A double in the range |
seed |
(Optional) An integer specifying the seed for the random
selection of entries in |
Value
A list containing:
-
D_tilde
: The original matrixD
with a randomperc
percent of its entries set toNA
. -
tilde_mask
: A binary matrix ofdim(D)
specifying the locations of corrupted entries (1
) and uncorrupted entries (0
).
See Also
grid_search_cv()
, sim_lod()
, impute_matrix()
, sim_data()
Examples
# Simple example corrupting 20% of a 5x5 matrix
D <- matrix(1:25, 5, 5)
corrupted_data <- sim_na(D, perc = 0.2)
corrupted_data$D_tilde
sum(is.na(corrupted_data$D_tilde)) / prod(dim(corrupted_data$D_tilde))
# Now corrupting another 20% ontop of the original 20%
double_corrupted <- sim_na(corrupted_data$D_tilde, perc = 0.2)
double_corrupted$D_tilde
sum(is.na(double_corrupted$D_tilde)) / prod(dim(double_corrupted$D_tilde))
# Corrupting the remaining entries by passing in a large value for perc
all_corrupted <- sim_na(double_corrupted$D_tilde, perc = 1)
all_corrupted$D_tilde
Compute singular values of given matrix
Description
sing()
calculates the singular values of a given data matrix D
. This is
done with a call to svd()
, and is included in pcpr
to enable the quick
characterization of a data matrix's raw low-rank structure, to help decide
whether rrmc()
or root_pcp()
is the more appropriate PCP algorithm to
employ in conjunction with D
.
Experimentally, the rrmc()
approach to PCP has best been able to handle
those datasets that are governed by complex underlying patterns characterized
by slowly decaying singular values, such as EH data. For observed data with a
well-defined low rank structure (rapidly decaying singular values),
root_pcp()
may offer a better model estimate.
Usage
sing(D)
Arguments
D |
The input data matrix (cannot have |
Value
A numeric vector containing the singular values of D
.
References
"Singular value decomposition" Wikipedia article.
See Also
Examples
data <- sim_data()
sing(data$D)
Estimate sparsity of given matrix
Description
sparsity()
estimates the percentage of entries in a given data matrix D
whose values are "practically zero". If the absolute value of an entry is
below a given threshold parameter thresh
, then that value is determined
to be "practically zero", increasing the estimated sparsity of D
. Note
that NA
values are imputed as 0 before the sparsity calculation is made.
Usage
sparsity(D, thresh = 1e-04)
Arguments
D |
The input data matrix. |
thresh |
(Optional) A numeric threshold |
Value
The sparsity of D
, measured as the percentage of entries in D
that are "practically zero".
See Also
Examples
sparsity(matrix(rep(c(1, 0), 8), 4, 4))
sparsity(matrix(0:8, 3, 3))
sparsity(matrix(0, 3, 3))