Title: | Perform Monothetic Clustering with Extensions to Circular Data |
Version: | 1.2.1 |
Description: | Implementation of the Monothetic Clustering algorithm (Chavent, 1998 <doi:10.1016/S0167-8655(98)00087-7>) on continuous data sets. A lot of extensions are included in the package, including applying Monothetic clustering on data sets with circular variables, visualizations with the results, and permutation and cross-validation based tests to support the decision on the number of clusters. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
URL: | https://vinhtantran.github.io/monoClust/, https://github.com/vinhtantran/monoClust |
BugReports: | https://github.com/vinhtantran/monoClust/issues |
Depends: | R (≥ 3.3.0) |
Imports: | cluster (≥ 2.0.5), doParallel, dplyr (≥ 1.0.0), foreach, ggplot2, graphics, grDevices, parallel, permute, purrr (≥ 0.3.0), rlang (≥ 0.3.0), stats, stringr (≥ 0.5), tibble (≥ 3.0.0), tidyr (≥ 1.0.0) |
Suggests: | knitr, mice, rmarkdown, covr, testthat (≥ 3.0.0) |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.1.1 |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2021-02-15 00:10:00 UTC; vinht |
Author: | Tan Tran |
Maintainer: | Tan Tran <vinhtantran@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2021-02-15 15:00:02 UTC |
monoClust: Perform Monothetic Clustering with Extensions to Circular Data
Description
Implementation of the Monothetic Clustering algorithm (Chavent, 1998 <doi:10.1016/S0167-8655(98)00087-7>) on continuous data sets. A lot of extensions are included in the package, including applying Monothetic clustering on data sets with circular variables, visualizations with the results, and permutation and cross-validation based tests to support the decision on the number of clusters.
Author(s)
Maintainer: Tan Tran vinhtantran@gmail.com (ORCID)
Authors:
Brian McGuire mcguirebc@gmail.com
Mark Greenwood greenwood@montana.edu (ORCID)
See Also
Useful links:
Report bugs at https://github.com/vinhtantran/monoClust/issues
Monothetic Clustering
Description
Creates a MonoClust object after partitioning the data set using Monothetic Clustering.
Usage
MonoClust(
toclust,
cir.var = NULL,
variables = NULL,
distmethod = NULL,
digits = getOption("digits"),
nclusters = 2L,
minsplit = 5L,
minbucket = round(minsplit/3),
ncores = 1L
)
Arguments
toclust |
Data set as a data frame. |
cir.var |
Index or name of the circular variable in the data set. |
variables |
List of variables selected for clustering procedure. It could be a vector of variable indexes, or a vector of variable names. |
distmethod |
Distance method to use with the data set. Can be chosen
from "euclidean" (for Euclidean distance), "mahattan" (for Manhattan
distance), or "gower" (for Gower distance). If not set, Euclidean distance
is used unless |
digits |
Significant decimal number printed in the output. |
nclusters |
Number of clusters created. Default is 2. |
minsplit |
The minimum number of observations that must exist in a node in order for a split to be attempted. Default is 5. |
minbucket |
The minimum number of observations in any terminal leaf
node. Default is |
ncores |
Number of CPU cores on the current host. If greater than 1,
parallel processing with |
Value
A MonoClust
object. See MonoClust.object
.
References
Chavent, M. (1998). A monothetic clustering method. Pattern Recognition Letters, 19(11), 989-996. doi: 10.1016/S0167-8655(98)00087-7.
Tran, T. V. (2019). Monothetic Cluster Analysis with Extensions to Circular and Functional Data. Montana State University - Bozeman.
Examples
# Very simple data set
library(cluster)
data(ruspini)
ruspini4sol <- MonoClust(ruspini, nclusters = 4)
ruspini4sol
# data with circular variable
library(monoClust)
data(wind_sensit_2007)
# Use a small data set
set.seed(12345)
wind_reduced <- wind_sensit_2007[sample.int(nrow(wind_sensit_2007), 10), ]
circular_wind <- MonoClust(wind_reduced, cir.var = 3, nclusters = 2)
circular_wind
Monothetic Clustering Tree Object
Description
The structure and objects contained in MonoClust, an object returned from
the MonoClust()
function and used as the input in other functions in the
package.
Value
- frame
Data frame in the form of a
tibble::tibble()
representing a tree structure with one row for each node. The columns include:- number
Index of the node. Depth of a node can be derived by
number %/% 2
.- var
Name of the variable used in the split at a node or
"<leaf>"
if it is a leaf node.- cut
Splitting value, so values of
var
that are smaller than that go to left branch while values greater than that go to the right branch.- n
Cluster size, the number of observations in that cluster.
- inertia
Inertia value of the cluster at that node.
- bipartsplitrow
Position of the next split row in the data set (that position will belong to left node (smaller)).
- bipartsplitcol
Position of the next split variable in the data set.
- inertiadel
Proportion of inertia value of the cluster at that node to the inertia of the root.
- medoid
Position of the data point regarded as the medoid of its cluster.
- loc
y-coordinate of the splitting node to facilitate showing on the tree. See
plot.MonoClust()
for details.- split.order
Order of the splits with root is 0.
- inertia_explained
Percent inertia explained as described in Chavent (2007). It is
1 - (sum(current inertia)/inertial[1])
.- alt
A nested tibble of alternate splits at a node. It contains
bipartsplitrow
andbipartsplitcol
with the same meaning above. Note that this is only for information purpose. CurrentlymonoClust
does not support choosing an alternate splitting route. RunningMonoClust()
withnclusters = 2
step-by-step can be run if needed.
- membership
Vector of the same length as the number of rows in the data, containing the value of
frame$number
corresponding to the leaf node that an observation falls into.- dist
Distance matrix calculated using the method indicated in
distmethod
argument ofMonoClust()
.- terms
Vector of variable names in the data that were used to split.
- centroids
Data frame with one row for centroid value of each cluster.
- medoids
Named vector of positions of the data points regarded as medoids of clusters.
- alt
Indicator of having an alternate splitting route occurred when splitting.
- circularroot
List of values designed for circular variable in the data set.
var
is the name of circular variable andcut
is its first best split value. If circular variable is not available, both objects are NULL.
References
Chavent, M., Lechevallier, Y., & Briant, O. (2007). DIVCLUS-T: A monothetic divisive hierarchical clustering method. Computational Statistics & Data Analysis, 52(2), 687-701. doi: 10.1016/j.csda.2007.03.013.
See Also
Coerce Similar Object to MonoClust
Description
The function turns a MonoClust-similar object into MonoClust object so it
can use supported functions for MonoClust such as print.MonoClust()
and
plot.MonoClust()
.
Usage
as_MonoClust(x, ...)
## Default S3 method:
as_MonoClust(x, ...)
Arguments
x |
An object that can be coerced to MonoClust object. |
... |
For extensibility. |
Details
as_MonoClust()
is an S3 generic. The function itself doesn't run unless
it is implemented for another similar object. Currently, this function is not
implemented within monoClust
package.
Find Centroid of the Cluster
Description
Centroid is point whose coordinates are the means of their cluster values.
Usage
centroid(data, frame, cloc)
Arguments
data |
Original data set. |
frame |
The split tree transferred as data frame. |
cloc |
Vector of current cluster membership. |
Value
A data frame with coordinates of centroids
First Gate Function
Description
This function checks what are available nodes to split and then call
find_split()
on each node, then decide which node creates best split, and
call splitter()
to perform the split.
Usage
checkem(
data,
cuts,
frame,
cloc,
dist,
variables,
minsplit,
minbucket,
split_order,
ncores
)
Arguments
data |
Original data set. |
cuts |
Cuts data set, which has the next higher value of each variable in the original data set. |
frame |
The split tree transferred as data frame. |
cloc |
Vector of current cluster membership. |
dist |
Distance matrix of all observations in the data. exported function yet. Vector of 1 for all observations. |
variables |
List of variables selected for clustering procedure. It could be a vector of variable indexes, or a vector of variable names. |
minsplit |
The minimum number of observations that must exist in a node in order for a split to be attempted. |
minbucket |
The minimum number of observations in any terminal leaf
node. Default is |
split_order |
The control argument to see how many split has been done. |
ncores |
Number of CPU cores on the current host. |
Value
It is not supposed to return anything because global environment was used. However, if there is nothing left to split, it returns 0 to tell the caller to stop running the loop.
Add/Subtract Circular Values in Degrees/Radian
Description
Add/subtract two circular variables in degrees (%cd+%
and %cd-%
) and
radian (%cr+%
and %cr-%
).
Usage
x %cd+% y
x %cd-% y
x %cr+% y
x %cr-% y
Arguments
x , y |
Circular values in degrees/radians. |
Value
A value between [0, 360) in degrees or [0, 2*pi) in radian.
Examples
90 %cd+% 90
250 %cd+% 200
25 %cd-% 80
pi %cr+% (pi/2)
Distance Matrix of Circular Variables
Description
Calculates the distance matrix of observations with circular variables using an adapted version of Gower's distance. This distance should be compatible with the Gower's distance for other variable types.
Usage
circ_dist(frame)
Arguments
frame |
A data frame with all columns are circular measured in degrees. |
Details
The distance between two observations i and j of a circular variable q is suggested to be
(y_{iq}, y_{jq}) = \frac{180 - |180 - |y_{iq} - y_{jq}||}{180}.
Value
Object of class "dist".
References
Tran, T. V. (2019). Chapter 3. Monothetic Cluster Analysis with Extensions to Circular and Functional Data. Montana State University - Bozeman.
See Also
Examples
# Make a sample data set of 20 observations with 2 circular variables
data <- data.frame(var1 = sample.int(359, 20),
var2 = sample.int(359, 20))
circ_dist(data)
Cluster Statistics Calculation
Description
Calinski-Harabasz's pseudo-F (Calinski and Harabasz, 1974) and Average silhoutte width (Rousseeuw, 1987) calculation.
Usage
cluster_stats(d, clustering)
Arguments
d |
Distance object (as generated by |
clustering |
Integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters. |
Value
- f_stat
Calinski-Harabasz's pseudo-F.
- asw
Average silhouette width.
References
Caliński, T. and Harabasz, J (1974). "A dendrite method for cluster analysis". en. In: Communications in Statistics 3.1, pp. 1–27. doi: 10.1080/03610927408827101.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis". In: Journal of Computational and Applied Mathematics 20, pp. 53–65. ISSN: 03770427. doi: 10.1016/0377-0427(87)90125-7.
See Also
Create Labels for Split Variables
Description
This function prints variable's labels for a MonoClust
tree.
Usage
create_labels(x, abbrev, digits = getOption("digits"), ...)
Arguments
x |
MonoClust result object. |
abbrev |
Whether to print the abbreviated versions of variable names. Can be either "no" (default), "short", or "abbreviate". Short forms of them can also be used. If "no", the labels recorded in If "short", variable names will be turned into "V1", "V2", ... If "abbreviate", |
digits |
Number of significant digits to print. |
... |
Optional arguments to |
Value
A list containing two elements:
-
varnames
: A named vector of labels corresponding to variable's names (at vector names). -
labels
: Vector of labels of splitting rules to be displayed.
See Also
Cross-Validation Test on MonoClust
Description
Perform cross-validation test for different different number of clusters of Monothetic Clustering.
Usage
cv.test(data, fold = 10L, minnodes = 2L, maxnodes = 10L, ncores = 1L, ...)
Arguments
data |
Data set to be partitioned. |
fold |
Number of folds (k). |
minnodes |
Minimum number of clusters to be checked. |
maxnodes |
Maximum number of clusters to be checked. |
ncores |
Number of CPU cores on the current host. When set to NULL, all available cores are used. |
... |
Other parameters transferred to |
Details
The k
-fold cross-validation randomly partitions data into k
subsets with equal (or close to equal) sizes. k - 1
subsets are used as
the training data set to create a tree with a desired number of leaves and
the other subset is used as validation data set to evaluate the predictive
performance of the trained tree. The process repeats for each subset as the
validating set (m = 1, \ldots, k
) and the mean squared difference,
MSE_m=\frac{1}{n_m} \sum_{q=1}^Q\sum_{i \in m} d^2_{euc}(y_{iq},
\hat{y}_{(-i)q}),
is calculated, where \hat{y}_{(-i)q}
is the cluster mean on the
variable
q
of the cluster created by the training data where the observed value,
y_{iq}
, of the validation data set will fall into, and
d^2_{euc}(y_{iq}, \hat{y}_{(-i)q})
is the squared Euclidean distance
(dissimilarity) between two observations at variable $q$. This process is
repeated for the $k$ subsets of the data set and the average of these test
errors is the cross-validation-based estimate of the mean squared error of
predicting a new observation,
CV_K = \overline{MSE} = \frac{1}{M} \sum_{m=1}^M MSE_m.
Value
A MonoClust.cv
class containing a data frame of mean sum of square
error and its standard deviation.
Note
This function supports parallel processing with foreach::foreach()
.
It distributes MonoClust calls to processes.
See Also
plot.cv.MonoClust()
, MonoClust()
, predict.MonoClust()
Examples
library(cluster)
data(ruspini)
# Leave-one-out cross-validation
cv.test(ruspini, fold = 1, minnodes = 2, maxnodes = 4)
# 5-fold cross-validation
cv.test(ruspini, fold = 5, minnodes = 2, maxnodes = 4)
Make Error Bars
Description
Make Error Bars
Usage
error_bar(x, y, upper, lower = upper, length = 0.1, ...)
Arguments
x |
x coordinates. |
y |
y coordinates. |
upper |
Distance from y to the upper bar. |
lower |
Distance from y to the lower bar. |
length |
Length of the horizontal bar. |
... |
Other arguments to |
Value
Plot
Find the Closest Cut
Description
Find the cuts for a quantitative variable. These cuts are what we are going to consider when thinking about bi-partitioning the data. For a quantitative column, find the next larger value of each value, if it is the largest, that value + 1
Usage
find_closest(col)
Arguments
col |
a quantitative vector. |
Value
a quantitative vector which contains the closest higher cut.
Find the Best Split
Description
Find the best split in terms of reduction in inertia for the transferred node, indicate by row. Find the terminal node with the greatest change in inertia and bi-partition it.
Usage
find_split(
data,
cuts,
frame_row,
cloc,
dist,
variables,
minsplit,
minbucket,
ncores
)
Arguments
data |
Original data set. |
cuts |
Cuts data set, which has the next higher value of each variable in the original data set. |
frame_row |
One row of the split tree as data frame. |
cloc |
Vector of current cluster membership. |
dist |
Distance matrix of all observations in the data. exported function yet. Vector of 1 for all observations. |
variables |
List of variables selected for clustering procedure. It could be a vector of variable indexes, or a vector of variable names. |
minsplit |
The minimum number of observations that must exist in a node in order for a split to be attempted. |
minbucket |
The minimum number of observations in any terminal leaf
node. Default is |
ncores |
Number of CPU cores on the current host. |
Value
The updated frame_row
with the next split updated.
GGPlot the Mean Square Error with Error Bar for +/- 1 Standard Error
Description
GGPlot the Mean Square Error with Error Bar for +/- 1 Standard Error
Usage
ggcv(
cv.obj,
title = "MSE for CV of monothetic clustering",
xlab = "Number of clusters",
ylab = "MSE +/- 1 SE",
type = c("b", "p", "l"),
linetype = 2,
err.col = "red",
err.width = 0.2
)
Arguments
cv.obj |
A |
title |
Overall title for the plot. |
xlab |
Title for x axis. |
ylab |
Title for y axis. |
type |
What type of plot should be drawn. Choosing between |
linetype |
The line type. See |
err.col |
Color of the error bars. |
err.width |
Width of the bars. |
Value
A ggplot2 object.
See Also
Plot using base R plot.cv.MonoClust()
Examples
library(cluster)
data(ruspini)
# 10-fold cross-validation
cptable <- cv.test(ruspini, minnodes = 2, maxnodes = 4)
ggcv(cptable)
Parallel Coordinates Plot with Circular Variables
Description
Making a parallel coordinates plot with the circular variables are plotted as ellipses. The function currently works well with data with one circular variable.
Usage
ggpcp(
data,
circ.var = NULL,
is.degree = TRUE,
rotate = 0,
north = 0,
cw = FALSE,
order.appear = NULL,
linetype = 1,
size = 0.5,
alpha = 0.5,
clustering,
medoids = NULL,
cluster.col = NULL,
show.medoids = FALSE,
labelsize = 4,
xlab = "Variables",
ylab = NULL,
legend.cluster = "groups"
)
Arguments
data |
Data set. |
circ.var |
Circular variable(s) in the data set, indicated by names or index in the data set. |
is.degree |
Whether the unit of the circular variables is degree or not
(radian). Default is |
rotate |
The rotate (offset, shift) of the circular variable, in radians. Default is 0 (no rotation). |
north |
What value of the circular variable is labeled North. Default is 0 radian. |
cw |
Which direction of the circular variable is considered increasing
in value, clockwise ( |
order.appear |
The order of appearance of the variables, listed by a vector of names or index. If set, length has to be equal to the number of variables in the data set. |
linetype |
Line type. Default is solid line. See details in
|
size |
Size of a line is its width in mm. Default is 0.5. See details in
|
alpha |
The transparency of the lines. Default is 0.1. |
clustering |
Cluster membership. |
medoids |
Vector of medoid observations of cluster. Only required when
|
cluster.col |
Color of clusters, indicating by a vector. If set, the
length of this vector must be equal to the number of clusters in
|
show.medoids |
Whether to highlight the median lines or not. Default is
|
labelsize |
The size of labels on the plot. Default is 4. |
xlab |
Labels for x-axis. |
ylab |
Labels for y-axis. |
legend.cluster |
Labels for group membership. Implemented by setting
label for ggplot |
Value
A ggplot2 object.
Examples
# Set color constant
COLOR4 <- c("#e41a1c", "#377eb8", "#4daf4a", "#984ea3")
# Reduce the size of the data for for sake of example speed
set.seed(12345)
wind_reduced <- wind_sensit_2007[sample.int(nrow(wind_sensit_2007), 50), ]
sol42007 <- MonoClust(wind_reduced, cir.var = 3, nclusters = 4)
library(ggplot2)
ggpcp(data = wind_reduced,
circ.var = "WDIR",
# To improve aesthetics
rotate = pi*3/4-0.3,
order.appear = c("WDIR", "has.sensit", "WS"),
alpha = 0.5,
clustering = sol42007$membership,
medoids = sol42007$medoids,
cluster.col = COLOR4,
show.medoids = TRUE) +
theme(panel.background = element_rect(color = "white"),
panel.border = element_rect(color = "white", fill = NA),
panel.grid.major = element_line(color = "#f0f0f0"),
panel.grid.minor = element_blank(),
axis.line = element_line(color = "black"),
legend.key = element_rect(color = NA),
legend.position = "bottom",
legend.direction = "horizontal",
legend.title = element_text(face = "italic"),
legend.justification = "center")
Cluster Inertia Calculation
Description
Calculate inertia for a given subset of the distance matrix from the original
data set provided to x
. Assumes that distance matrices are stored as
matrices and not distance objects.
Usage
inertia_calc(x)
Arguments
x |
Distance matrix, not an object of some distance measure. |
Value
Inertia value of the matrix, formula in Chavent (1998). If x
is a
single number, return 0.
Examples
data(iris)
# Euclidean distance on first 20 rows of the 4 continuous variables
dist_mat <- as.matrix(dist(iris[1:20, 1:4]))
inertia_calc(dist_mat)
Test If The Object is A MonoClust
Description
This function returns TRUE
for MonoClust, and FALSE for all other objects.
Usage
is_MonoClust(mono_obj)
Arguments
mono_obj |
An object. |
Value
TRUE
if the object inherits from the MonoClust
class.
Create Jump Table
Description
Create jump table from the MonoClust's frame object. number
and var
will
be used to create the table.
Usage
make_jump_table(frame)
Arguments
frame |
MonoClust's frame object |
Value
Jump table with number
, var
, and two new columns left
and
right
indicate the left and right number at split.
Find Medoid of the Cluster
Description
Medoid is the point that has minimum distance to all other points in the cluster.
Usage
medoid(members, dist_mat)
Arguments
members |
index vector indicating which observation belongs to the cluster. |
dist_mat |
distance matrix of the whole data set. A class of |
Value
index of the medoid point in the members vector.
Examples
library(cluster)
data(ruspini)
ruspini4sol <- MonoClust(ruspini, nclusters = 4)
ruspini4sol
medoid(which(ruspini4sol$membership == 4), ruspini4sol$dist)
# Check with the output with "4" label
ruspini4sol$medoids
Create A New Node for Split Data Frame
Description
This function is just a helper to make sure that the default values of the split data frame is correct when unspecified. It helps reduce type error, especially when moving to use dplyr which is stricter in data types.
Usage
new_node(
number,
var,
cut = -99L,
n,
inertia,
bipartsplitrow = -99L,
bipartsplitcol = -99L,
inertiadel = 0,
inertia_explained = -99,
medoid,
loc,
split.order = -99L,
alt = list(tibble::tibble(bipartsplitrow = numeric(), bipartsplitcol = numeric()))
)
Arguments
number |
Row index of the data frame. |
var |
Whether it is a leaf, or the name of the next split variable. |
cut |
The splitting value, so values (of |
n |
Cluster size. Number of observations in that cluster. |
inertia |
Inertia value of the cluster at that node. |
bipartsplitrow |
Position of the next split row in the data set (that position will belong to left node (smaller)). |
bipartsplitcol |
Position of the next split variable in the data set. |
inertiadel |
The proportion of inertia value of the cluster at that node to the inertia of the root. |
inertia_explained |
Percent inertia explained as described in Chavent (2007) |
medoid |
Position of the data point regarded as the medoid of its cluster. |
loc |
y-coordinate of the splitting node to facilitate showing on the
tree. See |
split.order |
Order of the splits. Root is 0, and increasing. |
alt |
Indicator of an alternative cut yielding the same reduction in inertia at that split. |
Value
A tibble with only one row and correct default data type for even an unspecified variables.
References
Chavent, M., Lechevallier, Y., & Briant, O. (2007). DIVCLUS-T: A monothetic divisive hierarchical clustering method. Computational Statistics & Data Analysis, 52(2), 687–701. https://doi.org/10.1016/j.csda.2007.03.013
Permutation Test on Monothetic Tree
Description
Testing the significance of each monothetic clustering split by permutation
methods. The "simple-withhold" method ("sw"
) shuffles the observations
between two groups without the splitting variable. The other two methods
shuffle the values in the splitting variable to create a new data set, then
it either splits again on that variable ("resplit-limit", "rl"
) or use all
variables as the splitting candidates ("resplit-nolimit", "rn"
).
Usage
perm.test(
object,
data,
auto.pick = FALSE,
sig.val = 0.05,
method = c("sw", "rl", "rn"),
rep = 1000L,
stat = c("f", "aw"),
bon.adj = TRUE,
ncores = 1L
)
Arguments
object |
The |
data |
The data set which is being clustered. |
auto.pick |
Whether the algorithm stops when p-value becomes larger than
|
sig.val |
Significance value to decide when to stop splitting. This
option is ignored if |
method |
Can be chosen between |
rep |
Number of permutations required to calculate test statistic. |
stat |
Statistic to use. Choosing between |
bon.adj |
Whether to adjust for multiple testing problem using Bonferroni correction. |
ncores |
Number of CPU cores on the current host. When set to NULL, all available cores are used. |
Details
Permutation Methods
Simple-Withhold: Shuffle the observations between two proposed clusters
The stat
calculated from the shuffles create the reference distribution
to find the p-value. Because the splitting variable that was chosen is
already the best in terms of reduction of inertia, that variable is withheld
from the distance matrix used in the permutation test.
Resplit-Limit: Shuffle splitting variable, split again on that variable
This method shuffles the values of the splitting variables while keeping
other variables fixed to create a new data set, then the chosen stat
is
calculated for each rep to compare with the observed stat
.
Resplit-Nolimit: Shuffle splitting variable, split on all variables
Similar to Method 2 but all variables are splitting candidates.
Bonferroni Correction
A hypothesis test occurred lower in the monothetic clustering tree could have its p-value corrected for multiple tests happened before it in order to reach that node. The formula is
adj.p = unadj.p \times depth,
with depth
is 1 at the root node.
Value
The same MonoClust
object with an extra column (p-value), as well
as the numofclusters
object if auto.pick = TRUE
.
Note
This function uses foreach::foreach()
to facilitate parallel
processing. It distributes reps to processes.
References
Calinski, T. and Harabasz, J (1974). "A dendrite method for cluster analysis". en. In: Communications in Statistics 3.1, pp. 1-27. doi: 10.1080/03610927408827101.
Rousseeuw, P. J. (1987). "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis". In: Journal of Computational and Applied Mathematics 20, pp. 53-65. ISSN: 03770427. doi: 10.1016/0377-0427(87)90125-7.
Examples
library(cluster)
data(ruspini)
ruspini6sol <- MonoClust(ruspini, nclusters = 6)
ruspini6.p_value <- perm.test(ruspini6sol, data = ruspini, method = "sw",
rep = 1000)
ruspini6.p_value
Plot MonoClust Splitting Rule Tree
Description
Print the MonoClust tree in the form of dendrogram.
Usage
## S3 method for class 'MonoClust'
plot(
x,
uniform = FALSE,
branch = 1,
margin = c(0.12, 0.02, 0, 0.05),
minbranch = 0.3,
text = TRUE,
which = 4,
stats = TRUE,
abbrev = c("no", "short", "abbreviate"),
digits = getOption("digits") - 2,
cols = NULL,
col.type = c("l", "p", "b"),
rel.loc.x = TRUE,
show.pval = TRUE,
...
)
Arguments
x |
MonoClust result object. |
uniform |
If TRUE, uniform vertical spacing of the nodes is used; this may be less cluttered when fitting a large plot onto a page. The default is to use a non-uniform spacing proportional to the inertia in the fit. |
branch |
Controls the shape of the branches from parent to child node. Any number from 0 to 1 is allowed. A value of 1 gives square shouldered branches, a value of 0 give V shaped branches, with other values being intermediate. |
margin |
An extra fraction of white space to leave around the borders of the tree. (Long labels sometimes get cut off by the default computation). |
minbranch |
Set the minimum length for a branch to |
text |
Whether to print the labels on the tree. |
which |
Labeling modes, which are:
|
stats |
Whether to show statistics (cluster sizes and medoid points) on the tree. |
abbrev |
Whether to print the abbreviated versions of variable names. Can be either "no" (default), "short", or "abbreviate". Short forms of them can also be used. If "no", the labels recorded in If "short", variable names will be turned into "V1", "V2", ... If "abbreviate", |
digits |
Number of significant digits to print. |
cols |
Whether to shown color bars at leaves or not. It helps matching
this tree plot with other plots whose cluster membership were colored. It
only works when |
col.type |
When |
rel.loc.x |
Whether to use the relative distance between clusters as x coordinate of the leaves. Default is TRUE. |
show.pval |
If MonoClust object has been run through |
... |
Arguments to be passed to |
Value
A plot of splitting rule.
Examples
library(cluster)
data(ruspini)
# MonoClust tree
ruspini4sol <- MonoClust(ruspini, nclusters = 4)
plot(ruspini4sol)
# MonoClust tree after permutation test is run
ruspini6sol <- MonoClust(ruspini, nclusters = 6)
ruspini6_test <- perm.test(ruspini6sol,
data = ruspini,
method = "sw",
rep = 1000)
plot(ruspini6_test, branch = 1, uniform = TRUE)
Plot the Mean Square Error with Error Bar for +/- 1 Standard Error
Description
Plot the Mean Square Error with Error Bar for +/- 1 Standard Error
Usage
## S3 method for class 'cv.MonoClust'
plot(
x,
main = "MSE for CV of monothetic clustering",
xlab = "Number of clusters",
ylab = "MSE +/- 1 SE",
type = "b",
lty = 2,
err.col = "red",
err.width = 0.1,
...
)
Arguments
x |
A |
main |
Overall title for the plot. |
xlab |
Title for x axis. |
ylab |
Title for y axis. |
type |
What type of plot should be drawn. See |
lty |
The line type. |
err.col |
Color of the error bars. |
err.width |
Width of the bars. |
... |
Arguments to be passed to |
Value
A line plot with error bars.
See Also
Plot using ggplot2 ggcv()
Examples
library(cluster)
data(ruspini)
# 10-fold cross-validation
cptable <- cv.test(ruspini, minnodes = 2, maxnodes = 4)
plot(cptable)
Calculate Branch Coordinates
Description
Calculate Branch Coordinates
Usage
plot_prep_branch(x, y, node, branch = 0)
Arguments
x |
Nodes x-coordinates. |
y |
Nodes y-coordinates. |
node |
Nodes row number. |
branch |
Controls the shape of the branches from parent to child node. Any number from 0 to 1 is allowed. A value of 1 gives square shouldered branches, a value of 0 give V shaped branches, with other values being intermediate. |
Value
Branch coordinates in a list of x and y axis.
Calculate Nodes Coordinates
Description
Calculate Nodes Coordinates
Usage
plot_prep_node(tree, uniform = FALSE, minbranch = 0.3)
Arguments
tree |
MonoClust result object. |
uniform |
If TRUE, uniform vertical spacing of the nodes is used; this may be less cluttered when fitting a large plot onto a page. The default is to use a non-uniform spacing proportional to the inertia in the fit. |
minbranch |
Set the minimum length for a branch to |
Value
Nodes coordinates in a list of x and y axis.
Plot the monoClust Tree.
Description
This function plots the MonoClust tree. It is partially inspired by rpart package.
Usage
plot_tree(
x,
uniform = FALSE,
branch = 1,
margin = 0,
minbranch = 0.3,
rel.loc.x = TRUE,
...
)
Arguments
x |
MonoClust result object. |
uniform |
If TRUE, uniform vertical spacing of the nodes is used; this may be less cluttered when fitting a large plot onto a page. The default is to use a non-uniform spacing proportional to the inertia in the fit. |
branch |
Controls the shape of the branches from parent to child node. Any number from 0 to 1 is allowed. A value of 1 gives square shouldered branches, a value of 0 give V shaped branches, with other values being intermediate. |
margin |
An extra fraction of white space to leave around the borders of the tree. (Long labels sometimes get cut off by the default computation). |
minbranch |
Set the minimum length for a branch to |
rel.loc.x |
Whether to use the relative distance between clusters as x coordinate of the leaves. Default is TRUE. |
... |
Arguments to be passed to |
Value
Plot of tree
Predictions from a MonoClust Object
Description
Predict the cluster memberships of a new data set from a MonoClust
object.
Usage
## S3 method for class 'MonoClust'
predict(object, newdata, type = c("centroid", "medoid"), ...)
Arguments
object |
MonoClust result object. |
newdata |
Data frame containing the values to be predicted. If missing, the memberships of the MonoClust object are returned. |
type |
Type of returned cluster representatives. Either |
... |
Further arguments passed to or from other methods. |
Value
A tibble of cluster index in cname
and either centroid values or
medoid observations index based on the value of type
argument.
Examples
library(cluster)
data(ruspini)
set.seed(1234)
test_index <- sample(1:nrow(ruspini), nrow(ruspini)/5)
train_index <- setdiff(1:nrow(ruspini), test_index)
ruspini_train <- ruspini[train_index, ]
ruspini_test <- ruspini[test_index, ]
ruspini_train_4sol <- MonoClust(ruspini_train, nclusters = 4)
predict(ruspini_train_4sol, newdata = ruspini_test)
Print Monothetic Clustering Results
Description
Render the MonoClust
split tree in an easy to read format with important
information such as terminal nodes, p-value (if possible), etc.
Usage
## S3 method for class 'MonoClust'
print(
x,
abbrev = c("no", "short", "abbreviate"),
spaces = 2L,
digits = getOption("digits"),
...
)
Arguments
x |
MonoClust result object. |
abbrev |
Whether to print the abbreviated versions of variable names. Can be either "no" (default), "short", or "abbreviate". Short forms of them can also be used. If "no", the labels recorded in If "short", variable names will be turned into "V1", "V2", ... If "abbreviate", |
spaces |
Spaces indent between 2 tree levels. |
digits |
Number of significant digits to print. |
... |
Optional arguments to |
Value
A nicely displayed MonoClust split tree.
See Also
Examples
library(cluster)
data(ruspini)
ruspini4sol <- MonoClust(ruspini, nclusters = 4)
print(ruspini4sol, digits = 2)
Print MonoClust Cross-Validation Result
Description
Print MonoClust Cross-Validation Result
Usage
## S3 method for class 'cv.MonoClust'
print(x, ...)
Arguments
x |
A |
... |
Further arguments passed to or from other methods. |
Examples
library(cluster)
data(ruspini)
# 10-fold cross-validation
cp_table <- cv.test(ruspini, minnodes = 2, maxnodes = 4)
print(cp_table)
Split Function
Description
Given the Cluster's frame's row position to split at split_row
, this
function performs the split, calculate all necessary information for the
splitting tree and cluster memberships.
Usage
splitter(data, cuts, split_row, frame, cloc, dist, split_order = 0L)
Arguments
data |
Original data set. |
cuts |
Cuts data set, which has the next higher value of each variable in the original data set. |
split_row |
The row index in frame that would be split on. |
frame |
The split tree transferred as data frame. |
cloc |
Vector of current cluster membership. |
dist |
Distance matrix of all observations in the data. exported function yet. Vector of 1 for all observations. |
split_order |
The control argument to see how many split has been done. |
Value
Updated frame
and cloc
saved in a list.
Hypothesis Test at Split
Description
Hypothesis Test at Split
Usage
test_split(members_l, members_r, method, data, split_var, rep, stat, ncores)
Arguments
members_l , members_r |
Vector of the index of observations that are members of the left child node and the right child node, respectively. |
method |
Can be chosen between |
data |
The data set which is being clustered. |
split_var |
Splitting variable at current split. |
rep |
Number of permutations required to calculate test statistic. |
stat |
Statistic to use. Choosing between |
ncores |
Number of CPU cores on the current host. When set to NULL, all available cores are used. |
Value
p-value of the test
Implementation of Print Labels on MonoClust Tree
Description
This function plots the labels onto the MonoClust tree. It is partially inspired by rpart package.
Usage
text_tree(
x,
which = 4,
digits = getOption("digits") - 2,
stats = TRUE,
abbrev,
cols = NULL,
cols.type = c("l", "p", "b"),
rel.loc.x = TRUE,
show.pval = TRUE,
uniform = FALSE,
minbranch = 0.3,
...
)
Arguments
x |
MonoClust result object. |
which |
Labeling modes, which are:
|
digits |
Number of significant digits to print. |
stats |
Whether to show statistics (cluster sizes and medoid points) on the tree. |
abbrev |
Whether to print the abbreviated versions of variable names. Can be either "no" (default), "short", or "abbreviate". Short forms of them can also be used. If "no", the labels recorded in If "short", variable names will be turned into "V1", "V2", ... If "abbreviate", |
cols |
Whether to shown color bars at leaves or not. It helps matching
this tree plot with other plots whose cluster membership were colored. It
only works when |
rel.loc.x |
Whether to use the relative distance between clusters as x coordinate of the leaves. Default is TRUE. |
show.pval |
If MonoClust object has been run through |
uniform |
If TRUE, uniform vertical spacing of the nodes is used; this may be less cluttered when fitting a large plot onto a page. The default is to use a non-uniform spacing proportional to the inertia in the fit. |
minbranch |
Set the minimum length for a branch to |
... |
Extra arguments that would be transferred to |
Value
Labels on tree.
Transform Between Degree and Radian
Description
This function transforms a circular angle from degree to radian or from radian to degree.
Usage
torad(x)
todeg(x)
Arguments
x |
A degree value if |
Value
A radian value if torad
or degree value if todeg
.
Examples
torad(90)
torad(-45)
todeg(pi/2)
Find Tree Depth Based on Node Indexes
Description
Find Tree Depth Based on Node Indexes
Usage
tree_depth(nodes)
Arguments
nodes |
Vector of node indexes in the tree. |
Details
When building MonoClust tree, the node index was created with the rule that new node indexes are the split node times 2 plus 0 (left) and 1 (right). Therefore, this function is just a back-transform, taking a log base 2.
Value
Depth of the node, with 0 is the root relative to the input.
Traverse a Tree to Find the Leaves (Terminal Nodes)
Description
Traverse a Tree to Find the Leaves (Terminal Nodes)
Usage
tree_walk(new_point, jump_table)
Arguments
new_point |
New data point |
jump_table |
Jump table |
Value
The index of the terminal node after traversing the new data point on the tree.
Existence of Microorganisms Carried in Wind
Description
Data set is a part of a study on microorganisms carried in strong f\"ohn winds at the Bonney Riegel location of Taylor Valley, an ice free area in the Antarctic continent. Wind direction and wind speed data were obtained from the meteorological station. Wind direction was recorded every 30 seconds and wind speeds every 4 seconds at 1.15 meters above the ground surface. The recorded wind directions and speeds were averaged at 15 minute intervals. For wind direction, as discussed previously, winds from the north are defined as 0/360 degrees and from the east as 90 degrees. 2007 data were collected from August 4–11, 2007.
Usage
wind_sensit_2007
Format
A data frame with 671 rows and 3 variables:
- has.sensit
A binary variable of the existence of particles in the wind (1) or not (0).
- WS
Wind speed measured in m/s.
- WDIR
Wind direction in degree with 0 indicates "from the north" and 90 degrees indicate "from the east".
Source
Sabacka, M., Priscu, J. C., Basagic, H. J., Fountain, A. G., Wall, D. H., Virginia, R. A., and Greenwood, M. C. (2012). "Aeolian flux of biotic and abiotic material in Taylor Valley, Antarctica". In: Geomorphology 155-156, pp. 102-111. issn: 0169555X. doi: 10.1016/j.geomorph.2011.12.009.
Existence of Microorganisms Carried in Wind
Description
Data set is a part of a study on microorganisms carried in strong f\"ohn winds at the Bonney Riegel location of Taylor Valley, an ice free area in the Antarctic continent. Wind direction and wind speed data were obtained from the meteorological station. Wind direction was recorded every 30 seconds and wind speeds every 4 seconds at 1.15 meters above the ground surface. The recorded wind directions and speeds were averaged at 15 minute intervals. For wind direction, as discussed previously, winds from the north are defined as 0/360 degrees and from the east as 90 degrees. 2008 data were collected from July 7–14, 2008.
Usage
wind_sensit_2008
Format
A data frame with 673 rows and 3 variables:
- has.sensit
A binary variable of the existence of particles in the wind (1) or not (0).
- WS
Wind speed measured in m/s.
- WDIR
Wind direction in degree with 0 indicates "from the north" and 90 degrees indicate "from the east".
Source
Sabacka, M., Priscu, J. C., Basagic, H. J., Fountain, A. G., Wall, D. H., Virginia, R. A., and Greenwood, M. C. (2012). "Aeolian flux of biotic and abiotic material in Taylor Valley, Antarctica". In: Geomorphology 155-156, pp. 102-111. issn: 0169555X. doi: 10.1016/j.geomorph.2011.12.009.