Type: | Package |
Title: | Measure Dependence Between Categorical and Continuous Variables |
Version: | 0.1.0 |
Date: | 2023-11-19 |
Description: | Semi-distance and mean-variance (MV) index are proposed to measure the dependence between a categorical random variable and a continuous variable. Test of independence and feature screening for classification problems can be implemented via the two dependence measures. For the details of the methods, see Zhong et al. (2023) <doi:10.1080/01621459.2023.2284988>; Cui and Zhong (2019) <doi:10.1016/j.csda.2019.05.004>; Cui, Li and Zhong (2015) <doi:10.1080/01621459.2014.920256>. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
Imports: | energy, FNN, furrr, purrr, Rcpp, stats |
RoxygenNote: | 7.2.3 |
LinkingTo: | Rcpp, RcppArmadillo |
Suggests: | testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
URL: | https://github.com/wzhong41/semidist |
BugReports: | https://github.com/wzhong41/semidist/issues |
NeedsCompilation: | yes |
Packaged: | 2023-11-20 19:43:13 UTC; Chain |
Author: | Wei Zhong [aut], Zhuoxi Li [aut, cre, cph], Wenwen Guo [aut], Hengjian Cui [aut], Runze Li [aut] |
Maintainer: | Zhuoxi Li <chainchei@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-11-21 06:50:02 UTC |
semidist: Measure Dependence Between Categorical and Continuous Variables
Description
Semi-distance and mean-variance (MV) index are proposed to measure the dependence between a categorical random variable and a continuous variable. Test of independence and feature screening for classification problems can be implemented via the two dependence measures. For the details of the methods, see Zhong et al. (2023) doi:10.1080/01621459.2023.2284988; Cui and Zhong (2019) doi:10.1016/j.csda.2019.05.004; Cui, Li and Zhong (2015) doi:10.1080/01621459.2014.920256.
Author(s)
Maintainer: Zhuoxi Li chainchei@gmail.com [copyright holder]
Authors:
Wei Zhong wzhong@xmu.edu.cn
Wenwen Guo guowenwen114@163.com
Hengjian Cui hjcui@bnu.edu.cn
Runze Li ril4@psu.edu
See Also
Useful links:
Mutual information independence test (categorical-continuous case)
Description
Implement the mutual information independence test (MINT) (Berrett and Samworth, 2019), but with some modification in estimating the mutual informaion (MI) between a categorical random variable and a continuous variable. The modification is based on the idea of Ross (2014).
MINTsemiperm()
implements the permutation independence test via
mutual information, but the parameter k
should be pre-specified.
MINTsemiauto()
automatically selects an appropriate k
based on a
data-driven procedure, and conducts MINTsemiperm()
with the k
chosen.
Usage
MINTsemiperm(X, y, k, B = 1000)
MINTsemiauto(X, y, kmax, B1 = 1000, B2 = 1000)
Arguments
X |
Data of multivariate continuous variables, which should be an
|
y |
Data of categorical variables, which should be a factor of length
|
k |
Number of nearest neighbor. See References for details. |
B , B1 , B2 |
Number of permutations to use. Defaults to 1000. |
kmax |
Maximum |
Value
A list with class "indtest"
containing the following components
-
method
: name of the test; -
name_data
: names of theX
andy
; -
n
: sample size of the data; -
num_perm
: number of replications in permutation test; -
stat
: test statistic; -
pvalue
: computed p-value.
For MINTsemiauto()
, the list also contains
-
kmax
: maximumk
in the automatic search for optimalk
; -
kopt
: optimalk
chosen.
References
Berrett, Thomas B., and Richard J. Samworth. "Nonparametric independence testing via mutual information." Biometrika 106, no. 3 (2019): 547-566.
Ross, Brian C. "Mutual information between discrete and continuous data sets." PloS one 9, no. 2 (2014): e87357.
Examples
X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])
MINTsemiperm(X, y, 5)
MINTsemiauto(X, y, kmax = 32)
Mean Variance (MV) statistics
Description
Compute the statistics of mean variance (MV) index, which can measure the dependence between a univariate continuous variable and a categorical variable. See Cui, Li and Zhong (2015); Cui and Zhong (2019) for details.
Usage
mv(x, y, return_mat = FALSE)
Arguments
x |
Data of univariate continuous variables, which should be a vector of
length |
y |
Data of categorical variables, which should be a factor of length
|
return_mat |
A boolean. If |
Value
The value of the corresponding sample statistic.
If the argument return_mat
of mv()
is set as TRUE
, a list with
elements
-
mv
: the MV index statistic; -
mat_x
: the matrices of the distances of the indicator for x <= x_i;
will be returned.
See Also
-
mv_test()
for implementing independence test via MV index; -
mv_sis()
for implementing feature screening via MV index.
Examples
x <- mtcars[, "mpg"]
y <- factor(mtcars[, "am"])
print(mv(x, y))
# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; prob <- rep(1/R, R)
x <- rnorm(n)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
print(mv(x, y))
# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
print(mv(x, y))
Feature screening via MV Index
Description
Implement the feature screening for the classification problem via MV index.
Usage
mv_sis(X, y, d = NULL, parallel = FALSE)
Arguments
X |
Data of multivariate covariates, which should be an
|
y |
Data of categorical response, which should be a factor of length
|
d |
An integer specifying how many features should be kept after
screening. Defaults to |
parallel |
A boolean indicating whether to calculate parallelly via
|
Value
A list of the objects about the implemented feature screening:
-
measurement
: sample MV index calculated for each single covariate; -
selected
: indicies or names (if avaiable as colnames ofX
) of covariates that are selected after feature screening; -
ordering
: order of the calculated measurements of each single covariate. The first one is the largest, and the last is the smallest.
Examples
X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")]
y <- factor(mtcars[, "am"])
mv_sis(X, y, d = 4)
MV independence test
Description
Implement the MV independence test via permutation test, or via the asymptotic approximation
Usage
mv_test(x, y, test_type = "perm", num_perm = 10000)
Arguments
x |
Data of univariate continuous variables, which should be a vector of
length |
y |
Data of categorical variables, which should be a factor of length
|
test_type |
Type of the test:
See the Reference for details. |
num_perm |
The number of replications in permutation test. |
Value
A list with class "indtest"
containing the following components
-
method
: name of the test; -
name_data
: names of thex
andy
; -
n
: sample size of the data; -
num_perm
: number of replications in permutation test; -
stat
: test statistic; -
pvalue
: computed p-value. (Notice: asymptotic test cannot return a p-value, but only the critical valuescrit_vals
for 90%, 95% and 99% confidence levels.)
Examples
x <- mtcars[, "mpg"]
y <- factor(mtcars[, "am"])
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)
# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; prob <- rep(1/R, R)
x <- rnorm(n)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)
# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)
Print Method for Independence Tests Between Categorical and Continuous Variables
Description
Printing object of class "indtest"
, by simple print method.
Usage
## S3 method for class 'indtest'
print(x, digits = getOption("digits"), ...)
Arguments
x |
|
digits |
minimal number of significant digits. |
... |
further arguments passed to or from other methods. |
Value
None
Examples
# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3
x <- rep(0, n)
x[1:10] <- 0.3; x[11:20] <- 0.2; x[21:30] <- -0.1
y <- factor(rep(1:3, each = 10))
test <- mv_test(x, y)
print(test)
test_asym <- mv_test(x, y, test_type = "asym")
print(test_asym)
Feature screening via semi-distance correlation
Description
Implement the (grouped) feature screening for the classification problem via semi-distance correlation.
Usage
sd_sis(X, y, group_info = NULL, d = NULL, parallel = FALSE)
Arguments
X |
Data of multivariate covariates, which should be an
|
y |
Data of categorical response, which should be a factor of length
|
group_info |
A list specifying the group information, with elements
being sets of indicies of covariates in a same group. For example,
Defaults to If The names of the list can help recoginize the group. For example,
|
d |
An integer specifying at least how many (single) features should
be kept after screening. For example, if Defaults to |
parallel |
A boolean indicating whether to calculate parallelly via
|
Value
A list of the objects about the implemented feature screening:
-
group_info
: group information; -
measurement
: sample semi-distance correlations calculated for the groups specified ingroup_info
; -
selected
: indicies/names of (single) covariates that are selected after feature screening; -
ordering
: order of the calculated measurements of the groups specified ingroup_info
. The first one is the largest, and the last is the smallest.
See Also
sdcor()
for calculating the sample semi-distance correlation.
Examples
X <- mtcars[, c("mpg", "disp", "hp", "drat", "wt", "qsec")]
y <- factor(mtcars[, "am"])
sd_sis(X, y, d = 4)
# Suppose we have prior information for the group structure as
# ("mpg", "drat"), ("disp", "hp") and ("wt", "qsec")
group_info <- list(
mpg_drat = c("mpg", "drat"),
disp_hp = c("disp", "hp"),
wt_qsec = c("wt", "qsec")
)
sd_sis(X, y, group_info, d = 4)
Semi-distance independence test
Description
Implement the semi-distance independence test via permutation
test, or via the asymptotic approximation when the dimensionality of
continuous variables p
is high.
Usage
sd_test(X, y, test_type = "perm", num_perm = 10000)
Arguments
X |
Data of multivariate continuous variables, which should be an
|
y |
Data of categorical variables, which should be a factor of length
|
test_type |
Type of the test:
See the Reference for details. |
num_perm |
The number of replications in permutation test. Defaults to 10000. See Details and Reference. |
Details
The semi-distance independence test statistic is
T_n = n \cdot
\widetilde{\text{SDcov}}_n(X, y),
where the
\widetilde{\text{SDcov}}_n(X, y)
can be computed by sdcov(X, y, type = "U")
.
For the permutation test (test_type = "perm"
), totally K
replications of permutation will be conducted, and the argument num_perm
specifies the K
here. The p-value of permutation test is computed by
\text{p-value} = (\sum_{k=1}^K I(T^{\ast (k)}_{n} \ge T_{n}) + 1) /
(K + 1),
where T_{n}
is the semi-distance test statistic and
T^{\ast (k)}_{n}
is the test statistic with k
-th permutation
sample.
When the dimension of the continuous variables is high, the asymptotic
approximation approach can be applied (test_type = "asym"
), which is
computationally faster since no permutation is needed.
Value
A list with class "indtest"
containing the following components
-
method
: name of the test; -
name_data
: names of theX
andy
; -
n
: sample size of the data; -
test_type
: type of the test; -
num_perm
: number of replications in permutation test, iftest_type = "perm"
; -
stat
: test statistic; -
pvalue
: computed p-value.
See Also
sdcov()
for computing the statistic of semi-distance covariance.
Examples
X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])
test <- sd_test(X, y)
print(test)
# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R)
X <- matrix(rnorm(n*p), n, p)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
test <- sd_test(X, y)
print(test)
# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3; p <- 3
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)
#' Man-made high-dimensionally independent data -----------------------------
n <- 30; R <- 3; p <- 100
X <- matrix(rnorm(n*p), n, p)
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)
test <- sd_test(X, y, test_type = "asym")
print(test)
# Man-made high-dimensionally dependent data --------------------------------
n <- 30; R <- 3; p <- 100
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
test <- sd_test(X, y)
print(test)
test <- sd_test(X, y, test_type = "asym")
print(test)
Semi-distance covariance and correlation statistics
Description
Compute the statistics (or sample estimates) of semi-distance covariance and correlation. The semi-distance correlation is a standardized version of semi-distance covariance, and it can measure the dependence between a multivariate continuous variable and a categorical variable. See Details for the definition of semi-distance covariance and semi-distance correlation.
Usage
sdcov(X, y, type = "V", return_mat = FALSE)
sdcor(X, y)
Arguments
X |
Data of multivariate continuous variables, which should be an
|
y |
Data of categorical variables, which should be a factor of length
|
type |
Type of statistic: |
return_mat |
A boolean. If |
Details
For \bm{X} \in \mathbb{R}^{p}
and Y \in \{1, 2, \cdots,
R\}
, the (population-level) semi-distance covariance is defined as
\mathrm{SDcov}(\bm{X}, Y) =
\mathrm{E}\left[\|\bm{X}-\widetilde{\bm{X}}\|\left(1-\sum_{r=1}^R
I(Y=r,\widetilde{Y}=r)/p_r\right)\right],
where p_r = P(Y = r)
and
(\widetilde{\bm{X}}, \widetilde{Y})
is an iid copy of (\bm{X},
Y)
.
The (population-level) semi-distance correlation is defined as
\mathrm{SDcor}(\bm{X}, Y) = \dfrac{\mathrm{SDcov}(\bm{X},
Y)}{\mathrm{dvar}(\bm{X})\sqrt{R-1}},
where \mathrm{dvar}(\bm{X})
is
the distance variance (Szekely, Rizzo, and Bakirov 2007) of \bm{X}
.
With n
observations \{(\bm{X}_i, Y_i)\}_{i=1}^{n}
, sdcov()
and sdcor()
can compute the sample estimates for the semi-distance
covariance and correlation.
If type = "V"
, the semi-distance covariance statistic is computed as a
V-statistic, which takes a very similar form as the energy-based statistic
with double centering, and is always non-negative. Specifically,
\text{SDcov}_n(\bm{X}, y) = \frac{1}{n^2} \sum_{k=1}^{n}
\sum_{l=1}^{n} A_{kl} B_{kl},
where
A_{kl} = a_{kl} - \bar{a}_{k.} - \bar{a}_{.l} + \bar{a}_{..}
is the double centering (Szekely, Rizzo, and Bakirov 2007) of
a_{kl} = \| \bm{X}_k - \bm{X}_l \|,
and
B_{kl} =
1 - \sum_{r=1}^{R} I(Y_k = r) I(Y_l = r) / \hat{p}_r
with \hat{p}_r =
n_r / n = n^{-1}\sum_{i=1}^{n} I(Y_i = r)
.
The semi-distance correlation statistic is
\text{SDcor}_n(\bm{X}, y)
= \dfrac{\text{SDcov}_n(\bm{X}, y)}{\text{dvar}_n(\bm{X})\sqrt{R - 1}},
where \text{dvar}_n(\bm{X})
is the V-statistic of distance variance
of \bm{X}
.
If type = "U"
, then the semi-distance covariance statistic is computed as
an “estimated U-statistic”, which is utilized in the independence test
statistic and is not necessarily non-negative. Specifically,
\widetilde{\text{SDcov}}_n(\bm{X}, y) = \frac{1}{n(n-1)}
\sum_{i \ne j} \| \bm{X}_i - \bm{X}_j \| \left(1 - \sum_{r=1}^{R}
I(Y_i = r) I(Y_j = r) / \tilde{p}_r\right),
where \tilde{p}_r = (n_r-1) / (n-1) = (n-1)^{-1}(\sum_{i=1}^{n} I(Y_i
= r) - 1)
. Note that the test statistic of the semi-distance independence
test is
T_n = n \cdot \widetilde{\text{SDcov}}_n(\bm{X}, y).
Value
The value of the corresponding sample statistic.
If the argument return_mat
of sdcov()
is set as TRUE
, a list with
elements
-
sdcov
: the semi-distance covariance statistic; -
mat_x, mat_y
: the matrices of the distances of X and the divergences of y, respectively;
will be returned.
See Also
-
sd_test()
for implementing independence test via semi-distance covariance; -
sd_sis()
for implementing groupwise feature screening via semi-distance correlation.
Examples
X <- mtcars[, c("mpg", "disp", "drat", "wt")]
y <- factor(mtcars[, "am"])
print(sdcov(X, y))
print(sdcor(X, y))
# Man-made independent data -------------------------------------------------
n <- 30; R <- 5; p <- 3; prob <- rep(1/R, R)
X <- matrix(rnorm(n*p), n, p)
y <- factor(sample(1:R, size = n, replace = TRUE, prob = prob), levels = 1:R)
print(sdcov(X, y))
print(sdcor(X, y))
# Man-made functionally dependent data --------------------------------------
n <- 30; R <- 3; p <- 3
X <- matrix(0, n, p)
X[1:10, 1] <- 1; X[11:20, 2] <- 1; X[21:30, 3] <- 1
y <- factor(rep(1:3, each = 10))
print(sdcov(X, y))
print(sdcor(X, y))
Switch the representation of a categorical object
Description
Categorical data with n observations and R levels can typically be represented as two forms in R: a factor with length n, or an n by K indicator matrix with elements being 0 or 1. This function is to switch the form of a categorical object from one to the another.
Usage
switch_cat_repr(obj)
Arguments
obj |
an object representing categorical data, either a factor or an indicator matrix with each row representing an observation. |
Value
categorical object in the another form.
Estimate the trace of the covariance matrix and its square
Description
For a design matrix \mathbf{X}
, estimate the trace of its covariance matrix \Sigma = \mathrm{cov}(\mathbf{X})
,
and the square of covariance matrix \Sigma^2
.
Usage
tr_estimate(X)
Arguments
X |
The design matrix. |
Value
A list with elements:
-
tr_S
: estimate for trace of\Sigma
; -
tr_S2
: estimate for trace of\Sigma^2
.