Title: | Turnkey Visualisations for Exploratory Data Analysis |
Version: | 0.1.0 |
Description: | Provides interactive visualisations for exploratory data analysis of high-dimensional datasets. Includes parallel coordinate plots for exploring large datasets with mostly quantitative features, but also stacked one-dimensional visualisations that more effectively show missingness and complex categorical relationships in smaller datasets. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
URL: | https://github.com/CCICB/ggEDA, https://ccicb.github.io/ggEDA/ |
BugReports: | https://github.com/CCICB/ggEDA/issues |
Imports: | assertions (≥ 0.2.0), cli, ggiraph (≥ 0.8.11), ggplot2, ggtext, grDevices, patchwork (≥ 1.3.0), rank (≥ 0.1.1), rlang |
Suggests: | covr, infotheo, knitr, rmarkdown, testthat (≥ 3.0.0), TSP |
Config/Needs/website: | uwot, datarium, palmerpenguins |
Config/testthat/edition: | 3 |
Depends: | R (≥ 3.5) |
LazyData: | true |
NeedsCompilation: | no |
Packaged: | 2025-05-05 07:54:36 UTC; selkamand |
Author: | Sam El-Kamand |
Maintainer: | Sam El-Kamand <sam.elkamand@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2025-05-07 12:00:02 UTC |
ggEDA: Turnkey Visualisations for Exploratory Data Analysis
Description
Provides interactive visualisations for exploratory data analysis of high-dimensional datasets. Includes parallel coordinate plots for exploring large datasets with mostly quantitative features, but also stacked one-dimensional visualisations that more effectively show missingness and complex categorical relationships in smaller datasets.
Author(s)
Maintainer: Sam El-Kamand sam.elkamand@gmail.com (ORCID)
Other contributors:
Children's Cancer Institute Australia [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/CCICB/ggEDA/issues
Baseball Fans Dataset
Description
An artificially generated dataset describing basic demographics and accessorization choices of baseball fans as part of a a hypothetical market research study from stadium merchandise vendors. None of the data are real; they were produced for illustrative and testing purposes only.
Usage
baseballfans
Format
baseballfans
A data frame with 19 rows and 10 columns:
- ID
Unique integer identifier for each individual.
- Age
Age in years at time of observation.
- Gender
Self‐reported gender (“Male” or “Female”).
- EyeColour
Eye color (“Brown”, “Green”, “Blue”), or missing (NA) if not recorded.
- Height
Height in centimeters; missing (NA) if not recorded.
- HairColour
Hair color (“Black”, “Blond”, “Red”, “Brown”).
- Glasses
Logical flag (TRUE/FALSE) indicating whether the individual wears glasses.
- WearingHat
Logical flag (TRUE/FALSE) indicating whether the individual is wearing a hat.
- WearingHat_tooltip
Type of hat worn, if any (e.g., “baseball cap”, “stetson”, “fedora”, “top hat”); empty when WearingHat == FALSE.
- Date
Date of observation in day/month/year format (e.g., 9/05/2023). Stored as character vector
#' @source Synthetic data; no real persons were observed.
Details
This mock dataset was created to demonstrate ggEDA functionality. All entries are fictional.
Make strings prettier for printing
Description
Takes an input string and 'beautify' by converting underscores to spaces and
Usage
beautify(string, autodetect_units = TRUE)
Arguments
string |
input string |
autodetect_units |
automatically detect units (e.g. mm, kg, etc) and wrap in brackets. |
Value
string
Parse a tibble and ensure it meets standards
Description
Parse a tibble and ensure it meets standards
Usage
column_info_table(
data,
maxlevels = 6,
col_id = NULL,
cols_to_plot,
tooltip_column_suffix = "_tooltip",
ignore_column_regex = "_ignore$",
palettes,
colours_default,
colours_default_logical,
verbose
)
Arguments
data |
data.frame to autoplot (data.frame) |
maxlevels |
for categorical variables, what is the maximum number of distinct values to allow (too many will make it hard to find a palette that suits). (number) |
col_id |
name of column to use as an identifier. If null, artificial IDs will be created based on row-number. |
cols_to_plot |
names of columns in data that should be plotted. By default plots all valid columns (character) |
tooltip_column_suffix |
the suffix added to a column name that indicates column should be used as a tooltip (string) |
ignore_column_regex |
a regex string that, if matches a column name, will cause that column to be excluded from plotting (string). If NULL no regex check will be performed. (default: "_ignore$") |
palettes |
A list of named vectors. List names correspond to data column names (categorical only). Vector names to levels of columns. Vector values are colours, the vector names are used to map values in data to a colour. |
colours_default |
Default colors for categorical variables without a custom palette. |
colours_default_logical |
Colors for binary variables: a vector of three colors representing |
verbose |
Numeric value indicating the verbosity level:
|
Value
tibble with the following columns:
colnames
coltype (categorical/numeric/tooltip/invalid)
ndistinct (number of distinct values)
plottable (should this column be plotted)
tooltip_col (the name of the column to use as the tooltip) or NA if no obvious tooltip column found
Count Edge Crossings for All Numeric Column Pairs
Description
Computes the total number of edge crossings between all pairs of numeric columns in a given dataset.
Usage
count_all_edge_crossings(
data,
approximate = FALSE,
subsample_prop = 0.4,
recalibrate = FALSE
)
Arguments
data |
A |
approximate |
estimate crossings based on a subsample of the data. See |
subsample_prop |
only used when approximate = TRUE. If 0-1, controls the proportion of data to be sampled to speed up computation. If a whole number other than 0 or 1, represents the number of rows subsampled |
recalibrate |
when approximating crossings via subsetting, should number of crossings calculated for the subsample be upscaled to match the full count. (turned off by default since it amplifies sampling error). |
Details
The function:
Filters the input data to retain only numeric columns.
Computes all possible pairs of numeric columns.
Uses
count_edge_crossings()
to calculate crossings for each pair.Returns the results in a summarized data frame.
Value
A data.frame
with three columns:
- col1
The name of the first column in the pair.
- col2
The name of the second column in the pair.
- crossings
Total number of edge crossings for that pair.
Count Edge Crossings in Parallel Coordinates
Description
Calculates the total number of edge crossings between two numeric vectors in a 2-column parallel coordinates setup. Each axis represents one of the columns.
Usage
count_edge_crossings(l, r)
Arguments
l |
A numeric vector representing values on the left axis. Must have the
same length as |
r |
A numeric vector representing values on the right axis. Must have
the same length as |
Details
An edge crossing occurs when two edges intersect between the axes.
Formally, edges (l[i], r[i])
and (l[j], r[j])
cross if
(l[i] - l[j]) * (r[i] - r[j]) < 0
.
Value
An integer indicating the total number of edge crossings.
Create a Distance Matrix from Edge Crossing Data
Description
Converts the results of count_all_edge_crossings()
into a distance matrix,
where each entry represents the number of crossings between two columns.
Usage
create_distance_matrix(data, as.dist = FALSE)
Arguments
data |
A data frame with columns |
as.dist |
Logical; if |
Value
A square matrix of distances, or a dist
object if as.dist = TRUE
.
Reorder Factor Levels by Descending Frequency
Description
Reorders the levels of a factor by their frequency, in descending order.
Usage
fct_infreq(x)
Arguments
x |
A factor or an object coerced to a factor. |
Value
A factor with levels ordered by descending frequency.
Relevel Factor by Specified Levels
Description
Reorder the levels of a factor by moving specified levels to a new position.
Usage
fct_relevel_base(x, ..., after = 0)
Arguments
x |
A factor to be releveled. |
... |
Levels to move in the factor. |
after |
A numeric scalar specifying the position after which the moved
levels should be placed. Use |
Value
A factor with the specified levels moved to the chosen position.
Reverse the Levels of a Factor
Description
Reverses the existing level order of a factor.
Usage
fct_rev(x)
Arguments
x |
A factor or an object coerced to a factor. |
Value
A factor with reversed levels.
Compute the Total Path Distance for an Axis Order
Description
Given a sequence of axis names and a distance matrix, sums pairwise distances along the path.
Usage
feature_vector_to_total_path_distance(axis_names, mx)
Arguments
axis_names |
A character vector indicating the axis order. |
mx |
A matrix of distances, with row and column names matching |
Value
A numeric value representing the total distance.
Optimize Axis Ordering Directly from a Data Frame
Description
Computes the number of edge crossings between all numeric columns in data
,
converts this information into a distance matrix, and then determines an
optimal ordering of the columns based on the specified method.
Usage
get_optimal_axis_order(
data,
verbose = TRUE,
method = "auto",
metric = c("mutinfo", "crossings", "crossings_fast"),
return_detailed = FALSE
)
Arguments
data |
A |
verbose |
A logical value; if |
method |
A character string specifying the method. Options are |
metric |
which metric should take as the distance between axes to minimise. mutual information: minimise mutual distance (1- uniminmax of mutinfo similarity matrix calculated by emp) crossings: minimise the total number of edge crossings (warning: slow to compute for large datasets). crossings_fast: same as above but calculates crossings on a subset of data (100 rows) |
return_detailed |
A logical; if |
Value
A character vector of axis names in the chosen order, or a list with
additional data if return_detailed = TRUE
.
Parallel Coordinate Plots
Description
Visualize relationships between numeric variables and categorical groupings using parallel coordinate plots.
Usage
ggparallel(
data,
col_id = NULL,
col_colour = NULL,
highlight = NULL,
interactive = TRUE,
order_columns_by = c("appearance", "random", "auto"),
order_observations_by = c("frequency", "original"),
verbose = TRUE,
palette_colour = palette.colors(palette = "Set2"),
palette_highlight = c("red", "grey90"),
convert_binary_numeric_to_factor = TRUE,
scaling = c("uniminmax", "none"),
return = c("plot", "data"),
options = ggparallel_options()
)
Arguments
data |
A data frame containing the variables to plot. |
col_id |
The name of the column to use as an identifier. If |
col_colour |
Name of the column to use for coloring lines in the plot. If |
highlight |
A level from |
interactive |
Produce interactive ggiraph visualiastion (flag) |
order_columns_by |
Strategy for ordering columns in the plot. Options include:
|
order_observations_by |
Strategy for ordering lines in the plot. Options include:
Ignored if |
verbose |
Logical; whether to display informative messages during execution. (default: |
palette_colour |
A named vector of colors for categorical levels in |
palette_highlight |
A two-color vector for highlighting ( |
convert_binary_numeric_to_factor |
Logical; whether to convert numeric columns containing only 0, 1, and NA to factors. (default: |
scaling |
Method for scaling numeric variables. Options include:
|
return |
What to return. Options include:
|
options |
A list of additional visualization parameters created by |
Value
A ggplot object or a processed data frame, depending on the return
parameter.
Examples
ggparallel(
data = minibeans,
col_colour = "Class",
order_columns_by = "auto"
)
ggparallel(
data = minibeans,
col_colour = "Class",
highlight = "DERMASON",
order_columns_by = "auto"
)
# Customise appearance using options argument
ggparallel(
data = minibeans,
col_colour = "Class",
order_columns_by = "auto",
options = ggparallel_options(show_legend = FALSE)
)
Visual Parameters for ggparallel Plots
Description
Configures aesthetic and layout settings for plots generated by ggparallel
.
Usage
ggparallel_options(
show_legend = TRUE,
show_legend_titles = FALSE,
legend_position = c("bottom", "right", "left", "top"),
legend_title_position = c("left", "top", "bottom", "right"),
legend_nrow = NULL,
legend_ncol = NULL,
legend_key_size = 1,
beautify_text = TRUE,
max_digits_bounds = 1,
x_axis_text_angle = 90,
x_axis_text_hjust = 0,
x_axis_text_vjust = 0.5,
fontsize_x_axis_text = 12,
show_column_names = TRUE,
show_points = FALSE,
show_bounds_labels = FALSE,
show_bounds_rect = FALSE,
line_alpha = 0.5,
line_width = NULL,
line_type = 1,
x_axis_gridlines = ggplot2::element_line(colour = "black"),
interactive_svg_width = NULL,
interactive_svg_height = NULL
)
Arguments
show_legend |
Display the legend on the plot (flag). |
show_legend_titles |
Display titles for legends (flag). |
legend_position |
Position of the legend ("right", "left", "bottom", "top"). |
legend_title_position |
Position of the legend title ("top", "bottom", "left", "right"). |
legend_nrow |
Number of rows in the legend (number). |
legend_ncol |
Number of columns in the legend. If set, |
legend_key_size |
Size of the legend key symbols. (number). |
beautify_text |
Beautify y-axis text and legend titles by capitalizing words and adding spaces (flag). |
max_digits_bounds |
Number of digits to round the axis bounds label text to (number) |
x_axis_text_angle |
Angle of the x axis text describing column names (number) |
x_axis_text_hjust |
Horizontal Justification of the x axis text describing column names (number) |
x_axis_text_vjust |
Vertical Justification of the x axis text describing column names (number) |
fontsize_x_axis_text |
fontsize of the x-axis text describing column names (number) |
show_column_names |
Show column names as x axis text (flag) |
show_points |
Show points (flag) |
show_bounds_labels |
Show bounds (min and max value) of each feature with labels above / below the axes (flag) |
show_bounds_rect |
Show bounds (min and max value) of each feature with a rectangular graphic (flag) |
line_alpha |
Alpha of line geom (number) |
line_width |
Width of the line geom (number) |
line_type |
Type of line geom (number or string. see |
x_axis_gridlines |
Customise look of x axis gridlines. Must be either a call to |
interactive_svg_width , interactive_svg_height |
Width and height of the interactive graphic region (in inches). Only used when |
Value
A list of visualization parameters for ggparallel
.
Examples
ggparallel(
data = minibeans,
col_colour = "Class",
order_columns_by = "auto"
)
ggparallel(
data = minibeans,
col_colour = "Class",
highlight = "DERMASON",
order_columns_by = "auto"
)
# Customise appearance using options argument
ggparallel(
data = minibeans,
col_colour = "Class",
order_columns_by = "auto",
options = ggparallel_options(show_legend = FALSE)
)
AutoPlot an entire data.frame
Description
Visualize all columns in a data frame with ggEDA's vertically aligned plots and automatic plot selection based on variable type. Plots are fully interactive, and custom tooltips can be added.
Usage
ggstack(
data,
col_id = NULL,
col_sort = NULL,
order_matches_sort = TRUE,
maxlevels = 7,
verbose = 2,
drop_unused_id_levels = FALSE,
interactive = TRUE,
return = c("plot", "column_info", "data"),
palettes = NULL,
sort_type = c("frequency", "alphabetical"),
desc = TRUE,
limit_plots = TRUE,
max_plottable_cols = 10,
cols_to_plot = NULL,
tooltip_column_suffix = "_tooltip",
ignore_column_regex = "_ignore$",
convert_binary_numeric_to_factor = TRUE,
options = ggstack_options(show_legend = !interactive)
)
Arguments
data |
data.frame to autoplot (data.frame) |
col_id |
name of column to use as an identifier. If null, artificial IDs will be created based on row-number. |
col_sort |
name of columns to sort on. To do a hierarchical sort, supply a vector of column names in the order they should be sorted (character). |
order_matches_sort |
should the column plots be stacked top-to-bottom in the order they appear in |
maxlevels |
for categorical variables, what is the maximum number of distinct values to allow (too many will make it hard to find a palette that suits). (number) |
verbose |
Numeric value indicating the verbosity level:
|
drop_unused_id_levels |
if col_id is a factor with unused levels, should these be dropped or included in visualisation |
interactive |
produce interactive ggiraph visualiastion (flag) |
return |
a string describing what this function should return. Options include:
|
palettes |
A list of named vectors. List names correspond to data column names (categorical only). Vector names to levels of columns. Vector values are colours, the vector names are used to map values in data to a colour. |
sort_type |
controls how categorical variables are sorted.
Numerical variables are always sorted in numerical order irrespective of the value given here.
Options are |
desc |
sort in descending order (flag) |
limit_plots |
throw an error when there are > |
max_plottable_cols |
maximum number of columns that can be plotted (default: 10) (number) |
cols_to_plot |
names of columns in data that should be plotted. By default plots all valid columns (character) |
tooltip_column_suffix |
the suffix added to a column name that indicates column should be used as a tooltip (string) |
ignore_column_regex |
a regex string that, if matches a column name, will cause that column to be excluded from plotting (string). If NULL no regex check will be performed. (default: "_ignore$") |
convert_binary_numeric_to_factor |
If a numeric column conatins only values 0, 1, & NA, then automatically convert to a factor. |
options |
a list of additional visual parameters created by calling |
Value
ggiraph interactive visualisation
Examples
# Create Basic Plot
ggstack(baseballfans, col_id = "ID", col_sort = "Glasses")
# Configure plot ggstack_options()
ggstack(
lazy_birdwatcher,
col_sort = "Magpies",
palettes = list(
Birdwatcher = c(Robert = "#E69F00", Catherine = "#999999"),
Day = c(Weekday = "#999999", Weekend = "#009E73")
),
options = ggstack_options(
show_legend = TRUE,
fontsize_barplot_y_numbers = 12,
legend_text_size = 16,
legend_key_size = 1,
legend_nrow = 1,
)
)
Visual Parameters for ggstack Plots
Description
Configures aesthetic and layout settings for plots generated by ggstack
.
Usage
ggstack_options(
colours_default = c("#66C2A5", "#FC8D62", "#8DA0CB", "#E78AC3", "#A6D854", "#FFD92F",
"#E5C494"),
colours_default_logical = c(`TRUE` = "#648fff", `FALSE` = "#dc267f"),
colours_missing = "grey90",
show_legend_titles = FALSE,
legend_title_position = c("top", "bottom", "left", "right"),
legend_nrow = 4,
legend_ncol = NULL,
legend_title_size = NULL,
legend_text_size = NULL,
legend_key_size = 0.3,
legend_orientation_heatmap = c("horizontal", "vertical"),
show_legend = TRUE,
legend_position = c("right", "left", "bottom", "top"),
na_marker = "!",
na_marker_size = 8,
na_marker_colour = "black",
show_na_marker_categorical = FALSE,
show_na_marker_heatmap = FALSE,
colours_heatmap_low = "purple",
colours_heatmap_high = "seagreen",
transform_heatmap = c("identity", "log10", "log2"),
fontsize_values_heatmap = 3,
show_values_heatmap = FALSE,
colours_values_heatmap = "white",
vertical_spacing = 0,
numeric_plot_type = c("bar", "heatmap"),
y_axis_position = c("left", "right"),
width = 0.9,
relative_height_numeric = 4,
cli_header = "Running ggstack",
interactive_svg_width = NULL,
interactive_svg_height = NULL,
fontsize_barplot_y_numbers = 8,
max_digits_barplot_y_numbers = 3,
fontsize_y_title = 12,
beautify_text = TRUE
)
Arguments
colours_default |
Default colors for categorical variables without a custom palette. |
colours_default_logical |
Colors for binary variables: a vector of three colors representing |
colours_missing |
Color for missing ( |
show_legend_titles |
Display titles for legends (flag). |
legend_title_position |
Position of the legend title ("top", "bottom", "left", "right"). |
legend_nrow |
Number of rows in the legend (number). |
legend_ncol |
Number of columns in the legend. If set, |
legend_title_size |
Size of the legend title text (number). |
legend_text_size |
Size of the text within the legend (number). |
legend_key_size |
Size of the legend key symbols (number). |
legend_orientation_heatmap |
should legend orientation be "horizontal" or "vertical". |
show_legend |
Display the legend on the plot (flag). |
legend_position |
Position of the legend ("right", "left", "bottom", "top"). |
na_marker |
Text used to mark |
na_marker_size |
Size of the text marker for |
na_marker_colour |
Color of the |
show_na_marker_categorical |
Show a marker for |
show_na_marker_heatmap |
Show a marker for |
colours_heatmap_low |
Color for the lowest value in heatmaps (string). |
colours_heatmap_high |
Color for the highest value in heatmaps (string). |
transform_heatmap |
Transformation to apply before visualizing heatmap values ("identity", "log10", "log2"). |
fontsize_values_heatmap |
Font size for heatmap values (number). |
show_values_heatmap |
Display numerical values on heatmap tiles (flag). |
colours_values_heatmap |
Color for heatmap values (string). |
vertical_spacing |
Space between each data row in points (number). |
numeric_plot_type |
Type of visualization for numeric data: "bar" or "heatmap". |
y_axis_position |
Position of the y-axis ("left" or "right"). |
width |
controls how much space is present between bars and tiles within each plot. Can be 0-1 where values of 1 makes bars/tiles take up 100% of available space (no gaps between bars). |
relative_height_numeric |
how many times taller should numeric plots be relative to categorical tile plots. Only taken into account if numeric_plot_type == "bar" (number) |
cli_header |
Text used for h1 header. Included so it can be tweaked by packages that use ggstack, so they can customise how the info messages appear. |
interactive_svg_width , interactive_svg_height |
width and height of the interactive graphic region (in inches). Only used when |
fontsize_barplot_y_numbers |
fontsize of the text describing numeric barplot max & min values (number). |
max_digits_barplot_y_numbers |
Number of digits to round the numeric barplot max and min values to (number). |
fontsize_y_title |
fontsize of the y axis titles (a.k.a the data.frame column names) (number). |
beautify_text |
Beautify y-axis text and legend titles by capitalizing words and adding spaces (flag). |
Value
A list of visualization parameters for ggstack
.
Examples
# Create Basic Plot
ggstack(baseballfans, col_id = "ID", col_sort = "Glasses")
# Configure plot ggstack_options()
ggstack(
lazy_birdwatcher,
col_sort = "Magpies",
palettes = list(
Birdwatcher = c(Robert = "#E69F00", Catherine = "#999999"),
Day = c(Weekday = "#999999", Weekend = "#009E73")
),
options = ggstack_options(
show_legend = TRUE,
fontsize_barplot_y_numbers = 12,
legend_text_size = 16,
legend_key_size = 1,
legend_nrow = 1,
)
)
Determine Whether Two Edges Cross
Description
Given the positions of two edges on the left and right axes, decides if they intersect in a parallel coordinates setup.
Usage
is_crossing(l1, r1, l2, r2)
Arguments
l1 |
Numeric position of the first edge on the left axis. |
r1 |
Numeric position of the first edge on the right axis. |
l2 |
Numeric position of the second edge on the left axis. |
r2 |
Numeric position of the second edge on the right axis. |
Value
A logical value. TRUE
if they cross, FALSE
otherwise.
Lazy Birdwatcher Dataset
Description
A simulated dataset describing the number of magpies observed by two birdwatchers.
Usage
lazy_birdwatcher
Format
lazy_birdwatcher
A data frame with 45 rows and 3 columns:
- Magpies
Number of magpies observed
- Day
Was the day of observation a weekday or a weekend?
- Birdwatcher
Name of the birdwatcher
Dry Beans Dataset
Description
A subsample of the Koklu & Ozkan (2020) dry beans dataset produced by imaging a total of 13,611 grains from 7 varieties of dry beans. The original dataset contains 13,611 observations, but here we include a random subsample of 1000.
Usage
minibeans
Format
minibeans
A data frame with 1000 rows and 17 columns:
- Area
The area of a bean zone and the number of pixels within its boundaries.
- Perimeter
Bean circumference is defined as the length of its border.
- Major axis length
The distance between the ends of the longest line that can be drawn from a bean.
- Minor axis length
The longest line that can be drawn from the bean while standing perpendicular to the main axis.
- Aspect ratio
Defines the relationship between L and l.
- Eccentricity
Eccentricity of the ellipse having the same moments as the region.
- Convex area
Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
- Equivalent diameter
The diameter of a circle having the same area as a bean seed area.
- Extent
The ratio of the pixels in the bounding box to the bean area.
- Solidity
Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
- Roundness
Calculated with the following formula: (4piA)/(P^2).
- Compactness
Measures the roundness of an object: Ed/L.
- ShapeFactor1
Shape factor 1.
- ShapeFactor2
Shape factor 2.
- ShapeFactor3
Shape factor 3.
- ShapeFactor4
Shape factor 4.
- Class
Seker, Barbunya, Bombay, Cali, Dermosan, Horoz, and Sira.
Source
Koklu, M, and IA Ozkan. 2020. Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques. Computers and Electronics in Agriculture, 174: 105507. doi: 10.1016/j.compag.2020.105507, https://doi.org/10.24432/C50S4B
Compute Mutual Information
Description
Computes mutual information between each feature in the features
data frame and the target
vector.
The features are discretized using the "equalfreq" method from infotheo::discretize()
.
Usage
mutinfo(features, target, return_colnames = FALSE)
Arguments
features |
A data frame of features. These will be discretized using the "equalfreq" method
(see |
target |
A vector (character or factor) representing the variable to compute mutual information with. |
return_colnames |
Logical; if |
Value
If return_colnames = FALSE
, a named numeric vector of mutual information scores is returned (one for each column in features
), sorted in descending order.
The names of the vector correspond to the column names of features
.
If return_colnames = TRUE
, only the ordered column names of features
are returned.
Examples
data(iris)
# Compute mutual information scores
mutinfo(iris[1:4], iris[[5]])
# Get column names ordered by mutual information with target column (most mutual info first)
mutinfo(iris[1:4], iris[[5]], return_colnames = TRUE)
Optimise the Ordering of Axes Using Distance Matrix
Description
Finds an ordering of axes that minimises a pairwise distance metric (usually the number of crossings). Offers brute-force and heuristic approaches.
Usage
optimise_axis_ordering_from_matrix(
mx,
method = c("auto", "brute_force", "repetitive_nn_with_2opt"),
return_detailed = FALSE,
verbose = TRUE
)
Arguments
mx |
A matrix or |
method |
A character string specifying the method. Can be |
return_detailed |
Logical; if |
verbose |
Logical; if |
Value
If return_detailed = FALSE
, returns a character vector of axis
names in the chosen order. Otherwise, returns a list with additional data.
Generate Permutations of the Integers 1..n
Description
Creates a matrix of all permutations for the integers from 1 to n.
Usage
permutations(n)
Arguments
n |
Number of elements to permute. |
Value
A matrix where each row is a permutation of 1..n.
Generate All Permutations of Axis Names
Description
Takes a character vector of axis names and returns a matrix of permutations.
Usage
permute_axis_names(axis_names)
Arguments
axis_names |
A character vector of axis names. |
Value
A matrix where each row represents one permutation of axis_names
.
GGplot breaks
Description
Find sensible values to add 2 breaks at for a ggplot2 axis
Usage
sensible_2_breaks(vector)
Arguments
vector |
vector fed into ggplot axis you want to define sensible breaks for |
Value
vector of length 2. first element descripts upper break position, lower describes lower break