Help for package mulgar

Title:

Functions for Pre-Processing Data for Multivariate Data Visualisation using Tours

Version:

1.0.5

Description:

This is a companion to the book Cook, D. and Laa, U. (2023) https://dicook.github.io/mulgar_book/ "Interactively exploring high-dimensional data and models in R". by Cook and Laa. It contains useful functions for processing data in preparation for visualising with a tour. There are also several sample data sets.

Depends:

R (≥ 4.1.0)

Imports:

geozoo, tibble, ggplot2, tidyr, dplyr, purrr, stats, methods

Suggests:

tourr, ggdendro, colorspace, mclust, kohonen, GGally

License:

MIT + file LICENSE

Encoding:

UTF-8

LazyData:

true

LazyDataCompression:

bzip2

RoxygenNote:

7.3.2

URL:

https://dicook.github.io/mulgar/, https://github.com/dicook/mulgar

BugReports:

https://github.com/dicook/mulgar/issues

NeedsCompilation:

Packaged:

2025-04-07 01:55:56 UTC; cookd

Author:

Dianne Cook

[aut, cre], Ursula Laa

[aut]

Maintainer:

Dianne Cook <dicook@monash.edu>

Repository:

CRAN

Date/Publication:

2025-04-07 02:30:02 UTC

mulgar: Functions for Pre-Processing Data for Multivariate Data Visualisation using Tours

Description

Author(s)

Maintainer: Dianne Cook dicook@monash.edu (ORCID)

Authors:

Ursula Laa ursula.laa@boku.ac.at (ORCID)

AFLW player statistics

Description

This is data from the 2021 Women's Australian Football League. These are average player statistics across the season, with game statistics provided by the fitzRoy package. If you are new to the game of AFL, there is a nice explanation on Wikipedia. The primary analysis is to summarise the variation using principal component analysis, which gives information about relationships between the statistics or skills sets common in players. One also might be tempted to cluster the players, but there are no obvious clusters so it could be frustrating. At best one could partition the players into groups, while recognising there are no absolutely distinct and separated groups.

Format

A dataset with 381 rows and 35 columns

Details

id, given_name, surname, number, position, team: player identification details
time_pct, ..., clearances: player statistics for the match

Examples


require(dplyr)
data(aflw)
glimpse(aflw)

Data sets with anomalies

Description

Simulated data with anomalies

Usage

anomaly1

anomaly2

anomaly3

anomaly4

anomaly5

Format

A datasets with anomalies

Details

x1, x2, x3, x4: numeric variables

Source

Created by Di Cook.

Examples

require(GGally)
data(anomaly1)
data(anomaly2)
ggscatmat(anomaly1)
ggscatmat(anomaly2)

Data sets with different types of association

Description

Simulated data with various associations

Usage

assoc1

assoc2

assoc3

Format

A datasets with various association

Details

x1, x2, x3, x4: numeric variables

Examples

require(GGally)
data(assoc1)
ggscatmat(assoc1)

3D plane in 5D

Description

This data is simulated to use for testing. It has three dimensions of variability and two of noise. It is created from a 3 factor model. All variables are linearly associated.

Format

A dataset with 200 rows and 5 columns

Details

x1, x2, x3, x4, x5: five numeric variables

Examples


box_pca <- prcomp(box)
ggscree(box_pca)

Australian bushfires 2019-2020

Description

This data was collated by Weihao (Patrick) Li as part of his Honours research at Monash University. It contains fire ignitions as detected from satellite hotspots, and processed using the spotoroo package, augmented with measurements on weather, vegetation, proximity to human activity. The cause variable is predicted based on historical fire ignition data collected by County Fire Authority personnel.

Format

A dataset with 1021 rows and 60 columns

Details

id, lon, lat, time: unique id, and spatiotemporal information for each fire ignition
FOR_CODE, FOR_TYPE, COVER, HEIGHT, FOREST: vegetation variables
rf, arf7-arf720: average rainfall, on that day, and over last 7, ..., 720 days
se, ase7-ase720: solar exposure, on that day, and over last 7, ..., 720 days
maxt, amaxt7-amaxt720: max temperature, on that day, and over last 7, ..., 720 days
mint, amint7-amint720: min temperature, on that day, and over last 7, ..., 720 days
ws, aws_m0-aws_m24: average wind speed, on that day, and for last 1-24 months
dist_road, log_dist_road: distance to nearest road
dist_cfa, log_dist_cfa: distance to nearest county fire authority facility
dist_camp, log_dist_camp: distance to nearest camp site
cause: predicted ignition cause, accident, arson, burning_off, lightning

Examples


require(dplyr)
data(bushfires)
glimpse(bushfires)

Compute Mahalanobis distances between all pairs of observations

Description

For a data matrix, compute the sample variance-covariance, which is used to compute the Mahalanobis distance.

Usage

calc_mv_dist(x)

Arguments

x

multivariate data set

Details

This is useful for checking distance arise from a multivariate normal sample.

Value

vector of length n

Examples

require(ggplot2)
require(tibble)
data(aflw)
aflw_std <- apply(aflw[,7:35], 2, function(x)
                    (x-mean(x, na.rm=TRUE))/
							       sd(x, na.rm=TRUE))
d <- calc_mv_dist(aflw_std[,c("goals","behinds",
                               "kicks","disposals")])
d <- as_tibble(d, .name_repair="minimal")
ggplot(d, aes(x=value)) + geom_histogram()

Calculate the norm of a vector

Description

Returns the square root of the sum of squares of a vector

Usage

calc_norm(x)

Arguments

x

numeric vector

Value

numeric value

Examples

x <- rnorm(5)
calc_norm(x)

Cluster challenge data sets

Description

Simulated data with different structures

Usage

c1

c2

c3

c4

c5

c6

c7

Format

A datasets with differing number of rows and columns

Details

x1, x2, ...: numeric variables

Source

Created by Di Cook.

Examples

require(ggplot2)
data(c1)
ggplot(c1, aes(x=x1, y=x2)) +
  geom_point() + theme(aspect.ratio=1)

Three clusters in 5D

Description

This data is simulated to use for testing. It has three elliptical clusters in mostly variables 2 and 4. They are not equidistant.

Format

A dataset with 300 rows and 6 columns

Details

x1, x2, x3, x4, x5: five numeric variables
cl: class variable

Examples


clusters_pca <- prcomp(clusters[,1:5])
ggscree(clusters_pca)

Four unusually shaped clusters in 4D

Description

This data is simulated to use for testing. It has two small spherical clusters, and a curve cluster and a sine wave cluster.

Format

A dataset with 300 rows and 6 columns

Details

x1, x2, x3, x4: five numeric variables

Examples


require(ggplot2)
ggplot(clusters_nonlin, aes(x=x1, y=x2)) +
  geom_point() +
  theme(aspect.ratio=1)

This function turns a projection sequence into a tibble

Description

Take an array of a projection sequence, and turn into a tibble with numbered projections

Usage

convert_proj_tibble(t1)

Arguments

t1

tour projection sequence

Value

tbl1 tibble

Examples


require(tourr)
t1 <- interpolate(save_history(flea[, 1:6], grand_tour(4), max = 2))
tbl1 <- convert_proj_tibble(t1)

Data sets generated using copulas

Description

Simulated data from covsim, using different copula models

Usage

copclayton

copjoe

copfrank

copnorm

Format

A datasets with various association

Details

x1, x2, x3, x4, x5: numeric variables

Examples

require(GGally)
data(copclayton)
ggscatmat(copclayton)

Generate points on the surface of an ellipse

Description

This function generates points by transforming points on the surface of a sphere.

Usage

gen_vc_ellipse(vc, xm = rep(0, ncol(vc)), n = 500)

Arguments

vc

symmetric square matrix describing the variance-covariance matrix which defines the shape of the ellipse.

xm

center of the ellipse, a vector of length equal to the dimension of vc

n

number of points to generate

Value

matrix of size n x p

Examples

require(ggplot2)
require(tibble)
ell2d <- gen_vc_ellipse(vc = matrix(c(4, 2, 2, 6),
                        ncol=2, byrow=TRUE),
                        xm = c(1,1))
ell2d <- as_tibble(ell2d)
ggplot(ell2d, aes(x = V1, y = V2)) + geom_point() +
  theme(aspect.ratio=1)

Ellipse matching data center and variance

Description

This function generates points on the surface of an ellipse with the same center and variance-covariance of the provided data.

Usage

gen_xvar_ellipse(x, n = 100, nstd = 1)

Arguments

x

multivariate data set.

n

number of points to generate

nstd

scale factor for size of ellipse, in terms of number of standard deviations

Details

This is useful for checking the equal variance-covariance assumption from linear discriminant analysis.

Value

matrix of size n x p

Examples

data(aflw)
aflw_vc <- gen_xvar_ellipse(aflw[,c("goals","behinds",
                               "kicks","disposals")], n=500)
require(ggplot2)
ggplot(aflw_vc, aes(x=goals, y=behinds)) + geom_point() +
  theme(aspect.ratio=1)
if (interactive()) {
  require(tourr)
  animate_slice(aflw_vc, rescale=TRUE, v_rel=0.02)
  aflw_all <- rbind(aflw_vc, aflw[,c("goals","behinds",
                               "kicks","disposals")])
  clrs <- c(rep("orange", 500), rep("black", nrow(aflw)))
  animate_xy(aflw_all, col=clrs)
}

Produces an mclust summary plot with ggplot

Description

Takes data returned by mclustBIC(), converts to a tibble for plotting.

Usage

ggmcbic(mc, cl = 1:nrow(mc), top = ncol(mc))

Arguments

mc

mclustBIC object

cl

subset of clusters to show

top

number to indicate how many models to show, default "all"

Value

mc_bic a ggplot object

Examples

require(mclust)
data(clusters)
clusters_BIC <- mclustBIC(clusters[,1:5], G=2:6)
ggmcbic(clusters_BIC)
ggmcbic(clusters_BIC, top=4)

data(simple_clusters)
clusters_BIC <- mclustBIC(simple_clusters[,1:2])
ggmcbic(clusters_BIC, cl=2:5, top=3)

This function produces a simple scree plot

Description

Takes a PCA object returned by prcomp(), extracts the standard deviations of the principal components (PC), and plots these against the PC number. The guidance line assumes that all of the variables have been standardised prior to PCA.

Usage

ggscree(pc, q = 2, guide = TRUE, cumulative = FALSE)

Arguments

pc

PCA object

q

number of principal components to show, default 2 (you should change)

guide

logical whether to compute and add a typical value of the variance, if the data was full-dimensional

cumulative

logical whether to draw cumulative variance

Value

scree a ggplot object

Examples

data(aflw)
aflw_std <- apply(aflw[,7:35], 2, function(x)
                    (x-mean(x, na.rm=TRUE))/
							       sd(x, na.rm=TRUE))
aflw_pca <- prcomp(aflw_std[,c("goals","behinds",
                               "kicks","disposals")])
ggscree(aflw_pca, q=3)

Generate an axis-parallel slice display

Description

Following the slice definition available in tourr this function returns a ggplot2 display of a slice defined via the projection onto two of the variables. Note that because the underlying function works with any projection, the axis labels need to be set by the user.

Usage

ggslice(data, h, v1 = 1, v2 = 2, center = NULL, col = NULL)

Arguments

data

data frame containing only variables used for the display

h

slice thickness

v1

column number of variable mapped to x-axis

v2

column number of variable mapped to y-axis

center

center point vector used for anchoring the slice, if NULL the mean of the data is used

col

grouping vector mapped to color in the display

Value

ggplot2 object showing the sliced data

Examples

d <- geozoo::sphere.hollow(4, 1000)$points
ggslice(d, 0.3, 1, 2)
ggslice(d, 0.3, 1, 2, center = c(0, 0, 0.7, 0))

Generate slice display

Description

Generate slice display

Usage

ggslice_projection(data, h, proj, center = NULL, col = NULL)

Arguments

data

data frame containing only variables used for the display

h

slice thickness

proj

projection matrix from p to 2 dimensions

center

center point vector used for anchoring the slice, if NULL the mean of the data is used

col

grouping vector mapped to color in the display

Value

ggplot2 object showing the sliced data

Examples

d <- geozoo::sphere.hollow(4, 1000)$points
ggslice_projection(d, 0.3, tourr::basis_random(4))
ggslice_projection(d, 0.3, tourr::basis_random(4),
                   center = c(0.4, 0.4, 0.4, 0.4))

Generate a dendrogram to be added to data

Description

Supplements a data set with information needed to draw a dendrogram. Intermediate cluster nodes are added as needed, and positioned at the centroid of the combined clusters. Note that categorical variables need to be factors.

Usage

hierfly(data, h = NULL, metric = "euclidean", method = "ward.D2", scale = TRUE)

Arguments

data

data set

h

an hclust object

metric

distance metric to use, see dist for list of possibilities

method

cluster distance measure to use, see hclust for details

scale

logical value whether to scale data or not, default TRUE

Value

list with data and edges and segments

Examples

data(clusters)
cl_dist <- dist(clusters[,1:5])
cl_hw <- hclust(cl_dist, method="ward.D2")
require(ggdendro)
ggdendrogram(cl_hw, type = "triangle", labels = FALSE)
clusters$clw <- factor(cutree(cl_hw, 3))
cl_hfly <- hierfly(clusters, cl_hw, scale=FALSE)
if (interactive()) {
  require(tourr)
  glyphs <- c(16, 46)
  pch <- glyphs[cl_hfly$data$node+1]
  require(colorspace)
  clrs <- heat_hcl(length(unique(cl_hfly$data$clw)))
  pcol <- clrs[cl_hfly$data$clw]
  ecol <- clrs[cl_hfly$data$clw[cl_hfly$edges[,1]]]
  animate_xy(cl_hfly$data[,1:5], edges=cl_hfly$edges,
    col=pcol, pch=pch, edges.col=ecol,
    axes="bottomleft")
}

Computes the ellipses of an mclust model

Description

Takes data returned by Mclust(), extracts parameter estimates, and computes points on ellipses.

Usage

mc_ellipse(mc, npts = 100)

Arguments

mc

Mclust object

npts

Number of points to simulate for each cluster, default 100

Value

mc_ellipses data frame

Examples

require(mclust)
data(simple_clusters)
clusters_mc <- Mclust(simple_clusters[,1:2],
                     G=2,
                     modelname="EEI")
mce <- mc_ellipse(clusters_mc, npts=400)
require(ggplot2)
sc <- simple_clusters
sc$cl <- factor(clusters_mc$classification)
ggplot() +
  geom_point(data=sc, aes(x=x1, y=x2, colour=cl)) +
  geom_point(data=mce$ell, aes(x=x1, y=x2, colour=cl), shape=4) +
  geom_point(data=mce$mn, aes(x=x1, y=x2, colour=cl), shape=3) +
  theme(aspect.ratio=1, legend.position="none")

Multiple clusters of different sizes, shapes and distance from each other

Description

This data is originally from http://ifs.tuwien.ac.at/dm/download/multiChallenge-matrix.txt, and provided as a challenge for non-linear dimension reduction.It was used as an example in Lee, Laa, Cook (2023) https://doi.org/10.52933/jdssv.v2i3.

Format

A dataset with 400 rows and 11 columns

Details

group: cluster label
x1, ... x10: numeric variables

Examples


require(ggplot2)
ggplot(multicluster, aes(x=x1, y=x2)) +
  geom_point() + theme(aspect.ratio=1)

Normalise a vector to have length 1

Description

Returns the normalised vector, where the sum of squares is equal to 1

Usage

norm_vec(x)

Arguments

x

numeric vector

Value

numeric vector

Examples

x <- rnorm(5)
norm_vec(x)

Create wire frame of PCA model

Description

This function takes the PCA and produces a wire frame of the PCA to examine with the data in a tour. The purpose is to see how well the variance is explained. The model will be centered at the mean, and extend 3 SDs towards the edge of the data, which is assuming that the data is standardised.

Usage

pca_model(pc, d = 2, s = 1)

Arguments

pc

PCA object

d

number of dimensions to use, default=2

s

scale model, default=1

Value

a list of points and edges

Examples

data(plane)
plane_pca <- prcomp(plane)
plane_m <- pca_model(plane_pca)
plane_m_d <- rbind(plane_m$points, plane)
if (interactive()) {
  require(tourr)
  animate_xy(plane_m_d, edges=plane_m$edges, axes="bottomleft")
}

PISA scores

Description

This is data from the 2018 testing, available from https://webfs.oecd.org/pisa2018/SPSS_STU_QQQ.zip. A subset of the data containing samples from Australia and Indonesia, and the simulated scores for math, reading and science.

Format

A data set with 10548 rows and 31 columns

Details

CNT: Country (Australia, Indonesia)
PV1MATH-PV10SCIE: simulated scores for math, reading and science

Examples


require(dplyr)
data(pisa)
pisa |> count(CNT)

2D plane in 5D

Description

This data is simulated to use for testing. It has two dimensions of variability and three of noise. It is created from a 2 factor model, where all variables are related.

Format

A data set with 100 rows and 5 columns

Details

x1, x2, x3, x4, x5: five numeric variables

Examples


plane_pca <- prcomp(plane)
ggscree(plane_pca)

Non-linear relationship in 5D

Description

This data is simulated to use for testing. It has three dimensions of variability and two of noise. It is created from a 2 factor non-linear model. All variables are associated.

Format

A dataset with 100 rows and 5 columns

Details

x1, x2, x3, x4, x5: five numeric variables

Examples


plane_nonlin_pca <- prcomp(plane_nonlin)
ggscree(plane_nonlin_pca)

Compute pooled variance-covariance matrix

Description

This function computes the group variance-covariance matrices, and produces a weighted average. It is useful for examining the linear discriminant analysis model.

Usage

pooled_vc(x, cl, prior = rep(1/length(unique(cl)), length(unique(cl))))

Arguments

x

multivariate data set, matrix.

cl

class variable

prior

prior probability for each class, must sum to 1, default all equal

Value

matrix

Examples

data(clusters)
pooled_vc(clusters[,1:5], clusters$cl)

Generate a sample from a multivariate normal

Description

This function generates a sample of size n from a multivariate normal distribution

Usage

rmvn(n = 100, p = 5, mn = rep(0, p), vc = diag(rep(1, p)))

Arguments

n

number of points to generate

p

dimension

mn

mean of the distribution, a vector of length equal to the dimension of vc

vc

symmetric square matrix describing the variance-covariance matrix which defines the shape of the ellipse.

Value

matrix of size n x p

Examples

require(ggplot2)
d <- mulgar::rmvn(n=100, p=2, mn = c(1,1),
                  vc = matrix(c(4, 2, 2, 6),
                         ncol=2, byrow=TRUE))
ggplot(data.frame(d), aes(x = x1, y = x2)) +
  geom_point() + theme(aspect.ratio=1)

Two clusters in 2D

Description

This data is simulated to use for testing. It has two spherical clusters, and two variables.

Format

A dataset with 137 rows and 3 columns

Details

x1, x2: two numeric variables
cl: class variable

Examples


require(ggplot2)
ggplot(simple_clusters, aes(x=x1, y=x2)) +
  geom_point() + theme(aspect.ratio=1)

Images of sketches for testing

Description

This data is a subset of images from https://quickdraw.withgoogle.com The subset was created using the quickdraw R package at https://huizezhang-sherry.github.io/quickdraw/. It has 6 different groups: banana, boomerang, cactus, flip flops, kangaroo. Each image is 28x28 pixels.

Format

A data frame with 1200 rows and 786 columns

Details

V1-V784: grey scale 0-255
word: all NA, you need to predict this
id: unique id for each sketch

Examples


require(ggplot2)
data("sketches_test")
x <- sketches_test[sample(1:nrow(sketches_test), 1), ]
xm <- data.frame(gry=t(as.matrix(x[,1:784])),
        x=rep(1:28, 28),
        y=rep(28:1, rep(28, 28)))
ggplot(xm, aes(x=x, y=y, fill=gry)) +
  geom_tile() +
  scale_fill_gradientn(colors = gray.colors(256, start = 0, end = 1, rev = TRUE )) +
  theme_void() + theme(legend.position="none")

Images of sketches for training

Description

Format

A data frame with 5998 rows and 786 columns

Details

V1-V784: grey scale 0-255
word: what the person was asked to draw
id: unique id for each sketch

Examples


require(ggplot2)
data("sketches_train")
x <- sketches_train[sample(1:nrow(sketches_train), 1), ]
# print(x$word)
xm <- data.frame(gry=t(as.matrix(x[,1:784])),
        x=rep(1:28, 28),
        y=rep(28:1, rep(28, 28)))
ggplot(xm, aes(x=x, y=y, fill=gry)) +
  geom_tile() +
  scale_fill_gradientn(colors = gray.colors(256, start = 0, end = 1, rev = TRUE )) +
  theme_void() + theme(legend.position="none")

Process the output from SOM to display the map and data

Description

This function generates a grid of points to match the nodes from the self-organising map (SOM), and jitters points from the data so they can be seen relative to the grid. This allows the clustering of points by SOM to be inspected.

Usage

som_model(x_som, j_val = 0.5)

Arguments

x_som

object returned by kohonen::som

j_val

amount of jitter, should range from 0-1, default 0.3

Value

data this object contains
- original variables from the data
- map1, map2 location of observations in 2D som map, jittered
- distance distances between observations and the closest node
- id row id of data
net this object contains
- values of the nodes in the high-d space
- map1, map2 nodes of the som net
- distance distances between observations and the closest node
- id row id of net
edges from, to specifying row ids of net to connect with lines
edges_s x, xend, y, yend for segments to draw lines to form 2D map

Examples

require(kohonen)
data(clusters)
c_grid <- kohonen::somgrid(xdim = 5, ydim = 5,
  topo = 'rectangular')
c_som <- kohonen::som(as.matrix(clusters[,1:5]), grid = c_grid)
c_data_net <- som_model(c_som)
require(ggplot2)
ggplot() +
  geom_segment(data=c_data_net$edges_s,
    aes(x=x, xend=xend, y=y, yend=yend)) +
  geom_point(data=c_data_net$data, aes(x=map1, y=map2),
    colour="orange", size=2, alpha=0.5)

mulgar: Functions for Pre-Processing Data for Multivariate Data Visualisation using Tours

Description

Author(s)

See Also

AFLW player statistics

Description

Format

Details

Examples

Data sets with anomalies

Description

Usage

Format

Details

Source

Examples

Data sets with different types of association

Description

Usage

Format

Details

Examples

3D plane in 5D

Description

Format

Details

See Also

Examples

Australian bushfires 2019-2020

Description

Format

Details

Examples

Compute Mahalanobis distances between all pairs of observations

Description

Usage

Arguments

Details

Value

Examples

Calculate the norm of a vector

Description

Usage

Arguments

Value

Examples

Cluster challenge data sets

Description

Usage

Format

Details

Source

Examples

Three clusters in 5D

Description

Format

Details

See Also

Examples

Four unusually shaped clusters in 4D

Description

Format

Details

See Also

Examples

This function turns a projection sequence into a tibble

Description

Usage

Arguments

Value

Examples

Data sets generated using copulas

Description

Usage

Format

Details

Examples

Generate points on the surface of an ellipse

Description

Usage