Type: | Package |
Title: | Fast, Robust, and High-Quality Synthetic Data Generation with a Tuneable Privacy-Utility Trade-Off |
Version: | 0.5.0 |
Maintainer: | Mark van der Loo <mark.vanderloo@gmail.com> |
Description: | Synthesize numeric, categorical, mixed and time series data. Data circumstances including mixed (or zero-inflated) distributions and missing data patterns are reproduced in the synthetic data. A single parameter allows balancing between high-quality synthetic data that represents correlations of the original data and lower quality but more privacy safe synthetic data without correlations. Tuning can be done per variable or for the whole dataset. |
License: | EUPL version 1.1 | EUPL version 1.2 [expanded from: EUPL] |
URL: | https://github.com/markvanderloo/synthesizer |
Imports: | stats |
VignetteBuilder: | simplermarkdown |
Depends: | R (≥ 3.5.0) |
Suggests: | tinytest, simplermarkdown |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-07-10 10:24:19 UTC; mplo |
Author: | Mark van der Loo |
Repository: | CRAN |
Date/Publication: | 2025-07-10 13:10:05 UTC |
synthesizer: Fast, Robust, and High-Quality Synthetic Data Generation with a Tuneable Privacy-Utility Trade-Off
Description
Synthesize numeric, categorical, mixed and time series data. Data circumstances including mixed (or zero-inflated) distributions and missing data patterns are reproduced in the synthetic data. A single parameter allows balancing between high-quality synthetic data that represents correlations of the original data and lower quality but more privacy safe synthetic data without correlations. Tuning can be done per variable or for the whole dataset.
Author(s)
Maintainer: Mark van der Loo mark.vanderloo@gmail.com (ORCID)
See Also
Useful links:
Create a function that generates synthetic data
Description
Create a function that accepts a non-negative integer n
, and
that returns synthetic data sampled from the emperical (multivariate)
distribution of x
.
Usage
make_synthesizer(x, ...)
## S3 method for class 'numeric'
make_synthesizer(x, na.rm = FALSE, ...)
## S3 method for class 'integer'
make_synthesizer(x, na.rm = FALSE, ...)
## S3 method for class 'logical'
make_synthesizer(x, na.rm = FALSE, ...)
## S3 method for class 'factor'
make_synthesizer(x, na.rm = FALSE, ...)
## S3 method for class 'character'
make_synthesizer(x, na.rm = FALSE, ...)
## S3 method for class 'ts'
make_synthesizer(x, ...)
## S3 method for class 'data.frame'
make_synthesizer(x, na.rm = FALSE, ...)
Arguments
x |
|
... |
arguments passed to other methods |
na.rm |
|
Value
A function
accepting a single integer argument: the number
of synthesized values or records to return. For objects of class
ts
n
must be equal to the length of the original data
(this is set as the default).
See Also
Other synthesis:
synthesize()
Examples
synth <- make_synthesizer(cars$speed)
synth(10)
synth <- make_synthesizer(iris)
synth(6)
synth(150)
synth(250)
Create synthetic version of a dataset
Description
Create n
values or records based on the emperical (multivariate)
distribution of y
. For data frames it is possible to decorrelate synthetic
from the original variables by lowering the value for the rankcor
parameter.
Usage
synthesize(x, na.rm = FALSE, n = NROW(x), rankcor = 1)
Arguments
x |
|
na.rm |
|
n |
|
rankcor |
|
Value
A data object of the same type and structure as x
.
Note
The utility of a synthetic variable is lowered by decorelating the rank
correlation between the real and synthetic data. If rankcor=1
, the
synthetic data will ordered such that it has the same rank order as the
original data. If rankcor=0
, no such reordering will take place. For
values between 0 and 1, blocks of data are randomly selected and randomly
permuted iteratively until the rank correlation between original and
synthetic data drops below the parameter.
See Also
Other synthesis:
make_synthesizer()
Examples
synthesize(cars$speed,10)
synthesize(cars)
synthesize(cars,25)
s1 <- synthesize(iris, rankcor=1)
s2 <- synthesize(iris, rankcor=0.5)
s3 <- synthesize(iris, rankcor=c("Species"=0.5))
oldpar <- par(mfrow=c(2,2), pch=16, las=1)
plot(Sepal.Length ~ Sepal.Width, data=iris, col=iris$Species, main="Iris")
plot(Sepal.Length ~ Sepal.Width, data=s1, col=s1$Species, main="Synthetic Iris")
plot(Sepal.Length ~ Sepal.Width, data=s2, col=s2$Species, main="Low utility Iris")
plot(Sepal.Length ~ Sepal.Width, data=s3, col=s3$Species, main="Low utility Species")
par(oldpar)