Type: | Package |
Title: | Multiple Imputation in Cluster Analysis |
Version: | 1.2.8 |
Description: | Implementation of a framework for cluster analysis with selection of the final number of clusters and an optional variable selection procedure. The package is designed to integrate the results of multiple imputed datasets while accounting for the uncertainty that the imputations introduce in the final results. In addition, the package can also be used for a cluster analysis of the complete cases of a single dataset. The package also includes specific methods to summarize and plot the results. The methods are described in Basagana et al. (2013) <doi:10.1093/aje/kws289>. |
Depends: | R (≥ 4.1) |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
Suggests: | knitr, xtable, rmarkdown |
Imports: | doBy, combinat, flexclust, graphics, irr, matrixStats, stats, utils |
Author: | Jose Barrera-Gomez
|
Maintainer: | Jose Barrera-Gomez <jose.barrera@isglobal.org> |
RoxygenNote: | 7.1.2 |
NeedsCompilation: | no |
Packaged: | 2022-02-07 17:00:17 UTC; jbarrera |
Repository: | CRAN |
Date/Publication: | 2022-02-07 17:20:02 UTC |
miclust-package: integrating multiple imputation with cluster analysis
Description
Cluster analysis with selection of the final number of clusters and an optional variable selection procedure. The package is designed to integrate the results of multiply imputed data sets while accounting for the uncertainty that the imputations introduce in the final results. See ‘Procedure’ below for further details on how the tool works.
Procedure
The tool consists of a two-step procedure. In the first step,
the user provides the data to be analyzed. They can be a single data.frame or a
list of data.frames including the raw data and the imputed data sets. In the
latter case, getdata
needs to by used first to get data prepared. In the
second step, the miclust
performs k-means clustering with selection of
the final number of clusters and an optional (backward or forward) variable
selection procedure. Specific summary
and plot
methods are provided
to summarize and visualize the impact of the imputations on the results.
Authors
Jose Barrera-Gomez (maintainer, <jose.barrera@isglobal.org>) and Xavier Basagana.
References
The methodology used in the package is described in
Basagana X, Barrera-Gomez J, Benet M, Anto JM, Garcia-Aymerich J. A Framework for Multiple Imputation in Cluster Analysis. American Journal of Epidemiology. 2013;177(7):718-725.
Computes probabilities of (relabeled) cluster and kappas.
Description
assignprobandkappas
returns a list with information on probabilities of
cluster belonging and Cohen's kappas.
Usage
assignprobandkappas(variables, k, metriccent, data, initialcluster)
Arguments
variables |
internally provided by |
k |
internally provided by |
metriccent |
internally provided by |
data |
internally provided by |
initialcluster |
internally provided by |
Value
internal value to be used by summary.miclust
.
Center data.
Description
centerdata
centers all variables at the mean.
Usage
centerdata(data)
Arguments
data |
internally provided by |
Value
internal value to be used by standardizedata
function.
Computes centroid.
Description
centroid
computes the centroid for each cluster (mean o median).
Usage
centroid(data, cluster, centpos)
Arguments
data |
internally provided by |
cluster |
internally provided by |
centpos |
internally provided by |
Value
internal value to be used by doclusterkmeans
function.
Performs K-means clustering with optional variable selection.
Description
doclusterkmeans
performs K-means clustering with optional variable
selection.
Usage
doclusterkmeans(
search,
data,
k,
metriccent,
inertiapower = 1,
maxvars = NULL,
centpos,
initcl
)
Arguments
search |
internally provided by |
data |
internally provided by |
k |
internally provided by |
metriccent |
internally provided by |
inertiapower |
internally provided by |
maxvars |
internally provided by |
centpos |
internally provided by |
initcl |
internally provided by |
Value
internal value to be used by miclust
function.
Performs K-means with backward selection.
Description
doclusterkmeansbackward
performs K-means clustering with backward
variable selection.
Usage
doclusterkmeansbackward(data, k, metriccent, inertiapower = 1, centpos, initcl)
Arguments
data |
internally provided by |
k |
internally provided by |
metriccent |
internally provided by |
inertiapower |
internally provided by |
centpos |
internally provided by |
initcl |
internally provided by |
Value
internal value to be used by doclusterkmeans
function.
Performs K-means with forward selection.
Description
doclusterkmeansforward
performs K-means clustering with forward
variable selection.
Usage
doclusterkmeansforward(
data,
k,
metriccent,
inertiapower = 1,
maxvars,
centpos,
initcl
)
Arguments
data |
internally provided by |
k |
internally provided by |
metriccent |
internally provided by |
inertiapower |
internally provided by |
maxvars |
internally provided by |
centpos |
internally provided by |
initcl |
internally provided by |
Value
internal value to be used by doclusterkmeans
function.
Performs K-means with forward selection.
Description
doclusterkmeansforwardhc
performs K-means clustering with forward
variable selection (option hc).
Usage
doclusterkmeansforwardhc(
data,
k,
metriccent,
inertiapower = 1,
maxvars,
centpos
)
Arguments
data |
internally provided by |
k |
internally provided by |
metriccent |
internally provided by |
inertiapower |
internally provided by |
maxvars |
internally provided by |
centpos |
internally provided by |
Value
internal value to be used by doclusterkmeans
function.
Performs K-means with forward selection.
Description
doclusterkmeansforwardrand
performs K-means clustering with forward
variable selection (option rand).
Usage
doclusterkmeansforwardrand(
data,
k,
metriccent,
inertiapower = 1,
maxvars,
centpos
)
Arguments
data |
internally provided by |
k |
internally provided by |
metriccent |
internally provided by |
inertiapower |
internally provided by |
maxvars |
internally provided by |
centpos |
internally provided by |
Value
internal value to be used by doclusterkmeans
function.
Performs K-means without variable selection.
Description
doclusterkmeansnone
performs K-means clustering without variable
selection.
Usage
doclusterkmeansnone(data, k, metriccent, inertiapower = 1, centpos, initcl)
Arguments
data |
internally provided by |
k |
internally provided by |
metriccent |
internally provided by |
inertiapower |
internally provided by |
centpos |
internally provided by |
initcl |
internally provided by |
Value
internal value to be used by doclusterkmeans
function.
Computes euclidean distance.
Description
euclidean
computes the euclidean distance from data to centers.
Usage
euclidean(x, centers)
Arguments
x |
internally provided by |
centers |
internally provided by |
Value
internal value to be used by miclust
function.
Computes CritCF.
Description
getcritcfkcca
computes CritCF (see references).
Usage
getcritcfkcca(kmeansfitted, inertiapower)
Arguments
kmeansfitted |
internally provided by |
inertiapower |
internally provided by |
Value
internal value to be used by doclusterkmeans
function.
Creates a midata
object.
Description
Creates an object of class miData
to be clustered by the function miclust
.
Usage
getdata(data)
Arguments
data |
a |
Details
All variables in data frames in impdata
are standardized by getdata
,
so categorical variables need to be coded with numeric values. Standardization
is performed by centering all variables at the mean and then dividing by the
standard deviation (or the difference between the maximum and the minimum values
for binary variables). Such a standardization is applied only to the imputed
data sets. The standardization of the raw data is internally applied by the
miclust
if needed (which is the case of analyzing just the raw data, i.e.
complete cases analysis).
Value
An object of classes c("list", "midata") including the following items:
- rawdata
a data frame containing the raw data.
- impdata
if
data
is an object of classlist
,impdata
is a list containing the standardized imputed data sets.
See Also
Examples
### data minhanes:
data(minhanes)
class(minhanes)
### number of imputed datasets:
length(minhanes) - 1
### raw data with missing values:
summary(minhanes[[1]])
### first imputed data set:
minhanes[[2]]
summary(minhanes[[2]])
### data preparation for a complete case cluster analysis:
data1 <- getdata(minhanes[[1]])
class(data1)
names(data1)
### there are no imputed data sets:
data1$impdata
### data preparation for a multiple imputation cluster analysis:
data2 <- getdata(minhanes)
class(data2)
names(data2)
### number of imputed data sets:
length(data2$impdata)
### imputed data sets are standardized:
summary(data2$rawdata)
summary(data2$impdata[[1]])
Computes initial centroids.
Description
getinitialcentroids
computes initial centroids for the clustering
process.
Usage
getinitialcentroids(data, ncentr)
Arguments
data |
internally provided by |
ncentr |
internally provided by |
Value
internal value to be used by doclusterkmeans
function.
Calculates the ranked selection frequency of the variables.
Description
Creates a ranked selection frequency for all the variables that have been
selected at least once along the analyzed imputed data sets. getvariablesfrequency
can be useful for customizing the plot of these frequencies as it is shown
in Examples below.
Usage
getvariablesfrequency(x, k = NULL)
Arguments
x |
an object of class |
k |
the number of clusters. The default value is the optimal number of clusters
obtained by the function |
Value
A list including the following items:
- percfreq
vector of the selection frequencies (percentage of times) of the variables in decreasing order.
- varnames
names of the variables.
See Also
Examples
### see examples in miclust.
Computes Manhattan distance.
Description
manhattan
computes the Manhattan distance from data to centers.
Usage
manhattan(x, centers)
Arguments
x |
internally provided by |
centers |
internally provided by |
Value
internal value to be used by miclust
function.
Cluster analysis in multiple imputed data sets with optional variable selection.
Description
Performs cluster analysis in multiple imputed data sets with optional variable
selection. Results can be summarized and visualized with the summary
and plot
methods.
Usage
miclust(
data,
method = "kmeans",
search = c("none", "backward", "forward"),
ks = 2:3,
maxvars = NULL,
usedimp = NULL,
distance = c("manhattan", "euclidean"),
centpos = c("means", "medians"),
initcl = c("hc", "rand"),
verbose = TRUE,
seed = NULL
)
Arguments
data |
object of class |
method |
clustering method. Currently, only |
search |
search algorithm for the selection variable procedure: |
ks |
the values of the explored number of clusters. Default is exploring 2 and 3 clusters. |
maxvars |
if |
usedimp |
numeric. Which imputed data sets must be included in the cluster
analysis. If |
distance |
two metrics are allowed to compute distances: |
centpos |
position computation of the cluster centroid. If |
initcl |
starting values for the clustering algorithm. If |
verbose |
a logical value indicating output status messages. Default is |
seed |
a number. Seed for reproducibility of results. Default is |
Details
The optimal number of clusters and the final set of variables are selected according to CritCF. CritCF is defined as
CritCF = \left(\frac{2m}{2m + 1} \cdot \frac{1}{1 + W / B}\right)^{\frac{1 + \log_2(k + 1)}{1 + \log_2(m + 1)}},
where m
is the number of variables, k
is the number of clusters,
and W
and B
are the within- and between-cluster inertias. Higher
values of CritCF are preferred (Breaban, 2011). See References below for further
details about the clustering algorithm.
For computational reasons, option "rand"
is suggested instead of "hc"
for high dimensional data
.
Value
A list with class "miclust" including the following items:
- clustering
a list of lists containing the results of the clustering algorithm for each analyzed data set and for each analyzed number of clusters. Includes information about selected variables and the cluster vector.
- completecasesperc
if
data
contains a single data frame, percentage of complete cases indata
.- data
input
data
.- ks
the values of the explored number of clusters.
- usedimp
indicator of the imputed data sets used.
- kfin
optimal number of clusters.
- critcf
if
data
contains a single data frame,critcf
contains the optimal (maximum) value of CritCF (see Details) and the number of selected variables in the reduction procedure for each explored number of clusters. Ifdata
is a list,critcf
contains the optimal value of CritCF for each imputed data set and for each explored value of the number of clusters.- numberofselectedvars
number of selected variables.
- selectedkdistribution
if
data
is a list, frequency of selection of each analyzed number of clusters.- method
input
method
.- search
input
search
.- maxvars
input
maxvars
.- distance
input
distance
.- centpos
input
centpos
.- selmetriccent
an object of class
kccaFamily
needed by the specificsummary
method.- initcl
input
initcl
.
References
Basagana X, Barrera-Gomez J, Benet M, Anto JM, Garcia-Aymerich J. A framework for multiple imputation in cluster analysis. American Journal of Epidemiology. 2013;177(7):718-25.
Breaban M, Luchian H. A unifying criterion for unsupervised clustering and feature selection. Pattern Recognition 2001;44(4):854-65.
See Also
getdata
for data preparation before using miclust
.
Examples
### data preparation:
minhanes1 <- getdata(data = minhanes)
##################
###
### Example 1:
###
### Multiple imputation clustering process with backward variable selection
###
##################
### using only the imputations 1 to 10 for the clustering process and exploring
### 2 vs. 3 clusters:
minhanes1clust <- miclust(data = minhanes1, search = "backward", ks = 2:3,
usedimp = 1:10, seed = 4321)
minhanes1clust
minhanes1clust$kfin ### optimal number of clusters
### graphical summary:
plot(minhanes1clust)
### selection frequency of the variables for the optimal number of clusters:
y <- getvariablesfrequency(minhanes1clust)
y
plot(y$percfreq, type = "h", main = "", xlab = "Variable",
ylab = "Percentage of times selected", xlim = 0.5 + c(0, length(y$varnames)),
lwd = 15, col = "blue", xaxt = "n")
axis(1, at = 1:length(y$varnames), labels = y$varnames)
### default summary for the optimal number of clusters:
summary(minhanes1clust)
## summary forcing 3 clusters:
summary(minhanes1clust, k = 3)
##################
###
### Example 2:
###
### Same analysis but without variable selection
###
##################
minhanes2clust <- miclust(data = minhanes1, ks = 2:3, usedimp = 1:10, seed = 4321)
minhanes2clust
plot(minhanes2clust)
summary(minhanes2clust)
##################
###
### Example 3:
###
### Complete case clustering process with backward variable selection
###
##################
nhanes0 <- getdata(data = minhanes[[1]])
nhanes2clust <- miclust(data = nhanes0, search = "backward", ks = 2:3, seed = 4321)
nhanes2clust
summary(nhanes2clust)
### nothing to plot for a single data set analysis
# plot(nhanes2clust)
##################
###
### Example 4:
###
### Complete case clustering process without variable selection
###
##################
nhanes3clust <- miclust(data = nhanes0, ks = 2:3, seed = 4321)
nhanes3clust
summary(nhanes3clust)
Multiple imputation for nhanes data.
Description
A list with 101 data sets. The first data set contains nhanes
data from mice
package. The remaining data sets were obtained by applying
the multiple imputation function mice
from package mice
.
Usage
minhanes
Format
A list of 101 data.frames each of them with 25 observations of the following 4 variables:
- age
age group (1 = 20-39, 2 = 40-59, 3 = 60+). Treated as numerical.
- bmi
body mass index (kg/m
^2
)- hyp
hypertensive (1 = no, 2 = yes). Treated as numerical.
- chl
total serum cholesterol (mg/dL)
Source
https://CRAN.R-project.org/package=mice
Examples
data(minhanes)
### raw data:
minhanes[[1]]
summary(minhanes[[1]])
### number of imputed data sets:
length(minhanes) - 1
### first imputed data set:
minhanes[[2]]
summary(minhanes[[2]])
Shows a graphical representation of the results.
Description
Creates a graphical representation of the results of miclust
.
Usage
## S3 method for class 'miclust'
plot(x, k = NULL, ...)
Arguments
x |
object of class |
k |
number of clusters. The default value is the optimal number of clusters
obtained by |
... |
further arguments for the plot function. |
Value
a plot to visualize the clustering results.
See Also
Prints the results.
Description
Creates a summary print of the results of miclust
.
Usage
## S3 method for class 'miclust'
print(x, ...)
Arguments
x |
object of class |
... |
further arguments for the print method. |
Value
prints a description of the clustering main results.
Prints the summary of results.
Description
Prints the summary of the results of summary.miclust
.
Usage
## S3 method for class 'summary.miclust'
print(x, digits = 2, ...)
Arguments
x |
object of class |
digits |
digits for the print method. Default is 2. |
... |
further arguments for the print method. |
Value
a print of the summary of the results generated by summary.miclust
.
See Also
Relabel clusters.
Description
relabelclusters
relabels the clusters so that they all have the same
meaning in the all the data sets.
Usage
relabelclusters(refcluster, cluster)
Arguments
refcluster |
internally provided by |
cluster |
internally provided by |
Value
internal value to be used by assignprobandkappas
function.
Standardize data.
Description
standardizedata
standardizes variables in data.
Usage
standardizedata(data)
Arguments
data |
internally provided by |
Value
internal value to be used by getdata
function.
Summarizes the results.
Description
Performs a within-cluster descriptive analysis of the variables after the
clustering process performed by the function miclust
.
Usage
## S3 method for class 'miclust'
summary(object, k = NULL, quantilevars = NULL, ...)
Arguments
object |
object of class |
k |
number of clusters. The default value is the optimal number of clusters
obtained by |
quantilevars |
numeric. If a variable selection procedure was used, the
cut-off percentile in order to decide the number of selected variables in the
variable reduction procedure by decreasing order of presence along the imputations
results. The default value is |
... |
further arguments for the plot function. |
Value
An object with classes c("list", "summary.miclust") including the following items:
- allocationprobabilities
if imputations were analyzed, descriptive summary of the probability of cluster assignment.
- classmatrix
if imputations were analyzed, the individual probabilities of cluster assignment.
- cluster
if imputations were analyzed, the final individual cluster assignment.
- clusterssize
if imputations were analyzed, size of the imputed cluster and between-imputations summary of the cluster size.
- clustervector
if a single data set (raw data set) has been clustered, a vector containing the individuals cluster assignments.
- clustervectors
if imputed data sets have been clustered, the individual cluster assignment in each imputation.
- completecasesperc
if a single data set (raw data set) has been clustered, the percentage of complete cases in the data set.
- k
number of clusters.
- kappas
if imputations were analyzed, the Cohen's kappa values after comparing the cluster vector in the first imputation with the cluster vector in each of the remaining imputations.
- kappadistribution
a summary of
kappas
.- m
number of imputations used in the descriptive analysis which is the total number of imputations provided.
- quantilevars
if variable selection was performed, the input value of
quantilevars
.- search
search algorithm for the selection variable procedure.
- selectedvariables
if variable selection was performed, the selected variables obtained considering
quantilevars
.- selectedvarspresence
if imputations were analyzed and variable selection was performed, the presence of the selected variables along imputations.
- summarybycluster
within-cluster descriptive analysis of the selected variables.
- usedimp
indicator of imputations used in the clustering procedure.
See Also
Examples
### see examples in miclust.