Help for package preputils

Type:

Package

Title:

Utilities for Preparation of Data Analysis

Version:

1.0.3

Date:

2020-05-18

Author:

Josef Frank

Maintainer:

Josef Frank <josef.frank@gmx.ch>

Description:

Miscellaneous small utilities are provided to mitigate issues with messy, inconsistent or high dimensional data and help for preprocessing and preparing analyses.

Imports:

data.table

License:

GPL-3

Suggests:

knitr, rmarkdown

NeedsCompilation:

LazyData:

true

Packaged:

2020-05-18 20:31:09 UTC; josef

Repository:

CRAN

Date/Publication:

2020-05-18 22:10:02 UTC

Convert proportional data to M-Values

Description

Proportional data are commonly modelled using a glm approach with logit link function. When performing the logit transformation in advance separately, simple OLS methods can be applied.

Usage

beta2m(b)

Arguments

b

vector or matrix holding the original data

Details

Data are transformed according to M=log₂(^b⁄_1-b) . The input data are assumed to have the range 0<b<1. Data outside this range will lead to missing values. Corner cases (data of b=0 or b=1) can be handled by use of fixlimits().

Value

A named vector/matrix with same dimensions as b and log₂ transformed values

Examples

    a <- sapply(c(0.01,0.05,0.5,0.8,0.9),function(x) rbinom(30,100,x)/100)
    matplot(a,pch=20)
    matplot(beta2m(a),pch=20)
    matplot(a,beta2m(a),pch=20)

select 1st existing value of several columns

Description

Inspired by the respectively named sql function. In a list of vectors with equal length the function for each of the observations selects the first value that is not NA, in the order provided

Usage

coalesce(...)

Arguments

...

vectors holding the values, sepearated by "," (commonly columns of data.frame)

Details

Data are transformed according to

b=\frac{2^M}{2^M+1}

Value

A named vector holding the supplemented values

Examples

    a1 = c(1,NA,NA,NA)
    a2 = c(2,2,NA,NA)
    a3 = c(NA,3,3,NA)
    cbind(a1,a2,a3,suppl=coalesce(a1,a2,a3))

Supplement missing values in mapping of data

Description

In properly normalized data bases, 1:1 mapping should be complete and unique. In real world data however ID mappings or data base key candidates are repeated over and over across observations, partly with missing data in case of merged data set. fillmap supplements NAs in mapping variables as far as possible

Usage

fillmap(x, y, what = "xy", rmdup=FALSE, rmmiss=FALSE,
    printori=FALSE)

Arguments

x, y

vectors of equal length, holding the mapping values, sepearated by ","

what

data to be returned, either 1st ("x") or 2nd argument ("y") or a data,table, cointaining both ("xy")

rmdup

remove duplicates from mapping (TRUE) or return all rows in original order (FALSE)

rmmiss

remove rows, where not mapping could be found (TRUE) or return all rows (FALSE)

printori

print original variables side by side

Details

incons assumes a 1:1 mapping between provided variables, as is commonly the case for example in ID translation steps.

For all cases where a proper unambiguous 1:1 matching exists. The missings values are filled in

Value

Vector or data.table with original mapping data, where NAs are filled in whith supplemented data where possible

Examples

    library(data.table)
    pheno1 <- data.frame(id1=c(1,2,3,4),id2=c(11,22,NA,NA),phenodat=c(NA,NA,NA,"d"))
    pheno2 <- data.frame(id1=c(NA,NA,NA),id2=c(11,22,33),phenodat=c("a","b","c"))
    pheno3 <- data.frame(id1=c(4,3),id2=c(44,33),phenodat=c(NA,NA))
    phenoges <- rbind(rbind(pheno1,pheno2),pheno3)
    with(phenoges,fillmap(id1,phenodat))
    with(phenoges,fillmap(id1,phenodat,rmdup=TRUE))
    with(phenoges,fillmap(id1,phenodat,rmmiss=TRUE))
    with(phenoges,fillmap(id1,phenodat,rmdup=TRUE,rmmiss=TRUE))
    with(phenoges,fillmap(id2,phenodat))
    with(phenoges,fillmap(id2,phenodat,rmdup=TRUE))
    with(phenoges,fillmap(id2,phenodat,rmmiss=TRUE))
    with(phenoges,fillmap(id2,phenodat,rmdup=TRUE,rmmiss=TRUE))
    phenosupp <- with(phenoges,fillmap(id1,id2))
    names(phenosupp) <- c("id1","id2")
    phenosupp$phenodat <- fillmap(phenosupp$id1,phenoges$phenodat,what="y")
    unique(phenosupp)

Filter data set using PCA

Description

Noise removal in data set by means of using principal component analysis. Optionally calculate distances (reconstruction error and Mahalanobis distance

Usage

filterpca(x,npc=NULL,pcs=NULL,scale.=F,
	method=c("k","t"),resulttype=c("p","d","b"),lambda=NULL)

Arguments

x

data set

npc

Number of leading principal components to be used for reconstruction of data set after filtering (positive integer) or number of last components to be skipped (negative integer).

pcs

Vector of integers providing column numbers of components to be included for reconstruction (positive numbers) or components to be skipped (negative numbers). In case of mixed signs negative numbers are ignored.

scale.

should values be scaled to unit variance before PCA?

method

One of either "k" or "t", with following meaning: "k": No further filtering except from ignoring some components when projecting back into original space; "t": Additionally threshold data by setting all value with absolute value below lambda to 0

resulttype

Type of resulting value, either matrix of projected values (p), distances (d) or a list containing both (b)

lambda

cutoff to be used for thresholding data. Lambda = NULL instructs to use a predefined value of 5% of the mean deviation

Details

The function performs PCA on provided data set. Noise is removed by reconstructing original values either on only a subset of extracted PCs, thresholding PC-scores (setting all values with absolute value below provided cutoff to 0) or a combination of both.

Value

Depending on requested resulttype:

p

Matrix with original observations projected back onto original attribute space after filtering

d

Data frame with Mahalanobis distance of observations calculated only on subset of requested PCs and with reconstruction error

b

List containing both values mentioned above

Examples

    a = iris[-5]
    b0 = filterpca(a,npc=4,res="b")
    b1 = filterpca(a,npc=3,res="b")
    b2 = filterpca(a,npc=2,res="b")
    pairs(b0,pch=20,col=iris$Species)
    pairs(b1,pch=20,col=iris$Species)
    pairs(b2,pch=20,col=iris$Species)

Fix extremes for logit transformation

Description

Change extreme values in proprtional data prior to logit transformation

Usage

fixlimits(x)

Arguments

x

name of vector to adjust

Details

The function assumes a data range of 0<=x<=1. Data outside this range are regarded as measurement errors and recoded to NA. In order to avoid generating missings during logit transformation values >=1 and <=0 respectively are shifted to lie within the range (0,1) excluding the borders themselves by recoding them to the mean of the respective border and and the most extreme nearest neighbour.

Value

vector of same length as x with adjusted values

Examples

    fixlimits(0:5/5)

Detect inconsistencies in 1:1 mapping

Description

In properly normalized data bases, no inconsistencies should be present. In real world data however ID mappings or data base key candidates are repeated over and over across observations, especially in mult centric studies with basic research data. incons tries to detect and flag these mapping discrepanices

Usage

incons(x, y, printproblems=FALSE)

Arguments

x, y

vectors of equal length, holding the mapping values, sepearated by ","

printproblems

Should a table of found problems be printed in addition to the returned flag?

Details

incons assumes a 1:1 mapping between provided variables, as is commonly the case for example in ID translation steps

Value

A named vector indicating whether ambiguous mapping does occur (TRUE) or mapping is clean (FALSE)

Examples

    id1 = c(1,2,2,3,4)
    id2 = c("a","b","c","d","d")
    ambiguous <- incons(id1,id2,print=TRUE)
    data.frame(id1,id2,ambiguous)

Convert logit transformed M-Values of proportional data back to original 0/1 range

Description

Despite conducting analysis of proportional data in M space, for publication figures the estimated values are commonly shown in the original space (range between 0 and 1). This function provides backscaling of the M values to original space by inverting the logit transformation done by beta2m()

Usage

m2beta(M)

Arguments

M

vector or matrix holding the original data

Details

Data are transformed according to

b=\frac{2^M}{2^M+1}

Value

A named vector/matrix with same dimensions as M and transformed values

Examples

    b = 1:99 / 100
    M = beta2m(b)
    plot(b,m2beta(M))
    print(all.equal(b, m2beta(M)))

Normalize numeric variable to range(0,1)

Description

Changes range of numeric variables to have min=0 and max=1

Usage

normalize(x)

Arguments

x

name of object to normalize

Details

The function changes the range of the named numeric vector to finally have min(x)=0 and max(x)=1.

Value

vector of same length as x with normalized values

Examples

    normalize(1:5)

PCA on automatically selected attributes in high dimensional data

Description

Conduct PCA on variables with biggest variance in high dimensional data matrix

Usage

pcv(x, cols=5, sites=5000)

Arguments

x

name of data matrix

cols

number of principal components to extract

sites

number of attributes to consider

Details

pcv assumes data in a numeric matrix and variable major format, i.e. every line corresponds to to a variable, while the columns correspond to the individual observations. This is commonly the case for data in high throughput experiments where the number of data points per individuals is high (> 10,000), while the size of batches is comparably small (dozens to hundreds). Variables with missing values are disregarded for the selection.

Use t() to transpose individual major data sets beforehand.

pcv selects the attributes with the highest variance up to the numbers provided, but takes considerations to limit these to the actual size of the present data set.

This is often used as first step in high throughput measurements to detect global effects of known batch variables.

Value

matrix with rows corresponding to observations and columns to extracted components. Values denote the scores on the extracted components for the respective observations.

Examples

    pcs <- pcv(t(iris[1:4]),cols=2)    
    cor(pcs,iris[-5])

batch effect removal by mean centering and shifting

Description

Remove known categorical batch effects from high dimensional data sets

Usage

rmbat(x,batches)

Arguments

x

name of object to be processed. This is a matrix in atribute major format (rows correspond to variables, columns to observations)

batches

Vector with batch identifiers for each of the columns in x

Details

For each variable the mean values of all batches are shifted to the grand mean of the total sample. On case of several independent bacth effects being present in th data set, thes can either be combined in one batch variable, or the batches can be removed one at a time by chaining the processing and caling the cuntiong with each of the batch variables in turn

Value

matrix with same dimensions as x and batch effects removed

Note

This function is intended for use with methods that do not inherently allow inclusion of covariates in the analysis itself, e.g. pca or heatmap. If methods are used that allow inclusion of batches in analysis like linear models, that is preferred, as the method above can otherwise greatly reduce power if batches are correlated with the effect variable

Examples

    # create data set
    n_obs = 8
    n_var = 10
    predictor <- rep(0:1,n_obs*0.5)
    pure_effect <- outer(rnorm(n_var),predictor)
    error <- matrix(rnorm(n_var*n_obs),n_var,n_obs)
    batch1 <- rep(1:2,each=n_obs*0.5)
    batch2 <- rep(c(1,2,1,2),each=n_obs*0.25)
    batch_effect1 <- outer(rnorm(n_var)*2,scale(batch1))[,,1]
    batch_effect2 <- outer(rnorm(n_var)*4,scale(batch2))[,,1]
    batch_effect <- batch_effect1 + batch_effect2
    data_measured <- pure_effect + batch_effect + error
    
    zero = outer(rep(0,n_var),rep(0,n_obs))
    b1 <- rmbat(batch_effect1,batch1)
    b2 <- rmbat(batch_effect2,batch2)
    b12a <- rmbat(batch_effect1,paste(batch1,batch2))
    b12b <- batch_effect 
    all.equal(b1,zero)
    all.equal(b2,zero)
    all.equal(b12a,zero)
    all.equal(b12b,zero)

Compute Variance inflation factor

Description

Calculate variance inflation factors (VIF) for all numeric variables contained in data set

Usage

vifx(x)

Arguments

x

name of data frame for which the VIFs should be computed

Details

The function reads in the object named in x, builds a linear model for each of the contained variables in the data set, regressing the selected variable on all other numeric variables contained in the data set.

The multiple R-squared is computed and transformed to VIF using following formula: VIF_i=\frac{1}{1-R_i^2}

Value

A named vector with names given by the contained numeric variables and values by the computed respective VIFs

Examples

    vifx(iris)

Create text data files using convenient defaults

Description

Several presets are provided for creating text data files. Functions are based on write.table, with some predefined extras to save time when writing data sets to clear text files

Usage

write.tab(data, filename, sep="\t", quote=FALSE, row.names=FALSE, ...)

Arguments

data

name of object to be saved

filename

File name

sep

Column separator, see details

quote

Should text data be quoted? Default FALSE

row.names

Whether rownames whould be included in output, default=FALSE

...

Further arguments passed on to write.table()

Details

Both of the named functions just use the filename as 2nd positional argument and call write.table(). Difference between both is that write.tab has predefined the column separator as "\t", while write.space uses the write.table default " ".

Examples

    ## Not run: 
        write.tab(iris,"~/iris_tab.txt")
        write.space(iris,"~/iris_space.txt")
    
## End(Not run)