| Type: | Package |
| Title: | Toolkit for Analysis of Genomic Data |
| Version: | 5.3.1 |
| Description: | A toolkit for analysis of genomic data. The 'misha' package implements an efficient data structure for storing genomic data, and provides a set of functions for data extraction, manipulation and analysis. Some of the 2D genome algorithms were described in Yaffe and Tanay (2011) <doi:10.1038/ng.947>. |
| License: | MIT + file LICENSE |
| URL: | https://tanaylab.github.io/misha/, https://github.com/tanaylab/misha |
| BugReports: | https://github.com/tanaylab/misha/issues |
| Depends: | R (≥ 3.0.0) |
| Imports: | magrittr, curl, digest, ps, parallel, utils |
| Suggests: | data.table, dplyr, glue, knitr, readr, rmarkdown, spelling, stats, stringr, testthat (≥ 3.0.0), tibble, withr |
| Config/testthat/edition: | 3 |
| Config/testthat/start-first: | liftover, multifasta-import |
| Encoding: | UTF-8 |
| Language: | en-US |
| LazyLoad: | yes |
| NeedsCompilation: | yes |
| OS_type: | unix |
| RoxygenNote: | 7.3.2 |
| VignetteBuilder: | knitr |
| Packaged: | 2025-12-10 10:19:53 UTC; aviezerl |
| Author: | Misha Hoichman [aut], Aviezer Lifshitz [aut, cre], Eitan Yaffe [aut], Amos Tanay [aut], Weizmann Institute of Science [cph] |
| Maintainer: | Aviezer Lifshitz <aviezer.lifshitz@weizmann.ac.il> |
| Repository: | CRAN |
| Date/Publication: | 2025-12-10 11:30:02 UTC |
Toolkit for analysis of genomic data
Description
'misha' package is intended to help users to efficiently analyze genomic data achieved from various experiments.
Details
For a complete list of help resources, use library(help = "misha").
The following options are available for the package. Use 'options' function to alter the value of the options.
| NAME | DEFAULT | DESCRIPTION |
| gmax.data.size | AUTO | Auto-configured based on system RAM and processes. |
| Formula: min((RAM * 0.7) / gmax.processes, 10GB). | ||
| Prevents excessive memory usage by 'gextract', 'gscreen', etc. | ||
| gbig.intervals.size | 1000000 | Minimal number of intervals in a big intervals set format |
| gmax.mem.usage | 10000000 | Maximal memory consumption of all child processes in KB before |
| the limiting algorithm is invoked. | ||
| gmax.processes | AUTO | Auto-configured to 70% of CPU cores. |
| Maximal number of processes for multitasking. | ||
| gmax.processes2core | 2 | Maximal number of processes per CPU core for multitasking |
| gmin.scope4process | 10000 | Minimal scope range (for 2D: surface) assigned to a process |
| in multitasking mode. | ||
| gbuf.size | 1000 | Size of track expression values buffer. |
| gtrack.chunk.size | 100000 | Chunk size in bytes of a 2D track. If '0' chunk size |
| is unlimited. | ||
| gtrack.num.chunks | 0 | Maximal number of 2D track chunks simultaneously stored |
| in memory. | ||
| gmultitasking | TRUE | Enable/disable automatic parallelization. Small datasets |
| (< gmax.processes * 1000 records) use single-threaded mode. |
More information about the options can be found in 'User manual' of the package.
Author(s)
Maintainer: Aviezer Lifshitz aviezer.lifshitz@weizmann.ac.il
Authors:
Misha Hoichman misha@hoichman.com
Eitan Yaffe eitan.yaffe@weizmann.ac.il
Amos Tanay amos.tanay@weizmann.ac.il
Other contributors:
Weizmann Institute of Science [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/tanaylab/misha/issues
Pipe operator
Description
See magrittr::%>% for details.
Usage
lhs %>% rhs
Arguments
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Value
The result of calling 'rhs(lhs)'.
An environment for storing the package global variables
Description
An environment for storing the package global variables
Usage
.misha
Format
An object of class environment of length 10.
Calculates quantiles of a track expression for bins
Description
Calculates quantiles of a track expression for bins.
Usage
gbins.quantiles(
...,
expr = NULL,
percentiles = 0.5,
intervals = get("ALLGENOME", envir = .misha),
include.lowest = FALSE,
iterator = NULL,
band = NULL
)
Arguments
... |
pairs of track expressions ('bin_expr') that determines the bins and breaks that define the bins. See |
expr |
track expression for which quantiles are calculated |
percentiles |
an array of percentiles of quantiles in [0, 1] range |
intervals |
genomic scope for which the function is applied. |
include.lowest |
if 'TRUE', the lowest value of the range determined by breaks is included |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
Details
This function is a binned version of 'gquantiles'. For each iterator interval the value of 'bin_expr' is calculated and assigned to the corresponding bin determined by 'breaks'. The quantiles of 'expr' are calculated then separately for each bin.
The bins can be multi-dimensional depending on the number of 'bin_expr'-'breaks' pairs.
The range of bins is determined by 'breaks' argument. For example: 'breaks=c(x1, x2, x3, x4)' represents three different intervals (bins): (x1, x2], (x2, x3], (x3, x4].
If 'include.lowest' is 'TRUE' the the lowest value will be included in the first interval, i.e. in [x1, x2].
Value
Multi-dimensional array representing quantiles for each percentile and bin.
See Also
gquantiles, gintervals.quantiles,
gdist
Examples
gdb.init_examples()
gbins.quantiles("dense_track", c(0, 0.2, 0.4, 2), "sparse_track",
percentiles = c(0.2, 0.5),
intervals = gintervals(1),
iterator = "dense_track"
)
Calculates summary statistics of a track expression for bins
Description
Calculates summary statistics of a track expression for bins.
Usage
gbins.summary(
...,
expr = NULL,
intervals = get("ALLGENOME", envir = .misha),
include.lowest = FALSE,
iterator = NULL,
band = NULL
)
Arguments
... |
pairs of track expressions ('bin_expr') that determines the bins and breaks that define the bins. See |
expr |
track expression for which summary statistics is calculated |
intervals |
genomic scope for which the function is applied |
include.lowest |
if 'TRUE', the lowest value of the range determined by breaks is included |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
Details
This function is a binned version of 'gsummary'. For each iterator interval the value of 'bin_expr' is calculated and assigned to the corresponding bin determined by 'breaks'. The summary statistics of 'expr' are calculated then separately for each bin.
The bins can be multi-dimensional depending on the number of 'bin_expr'-'breaks' pairs.
The range of bins is determined by 'breaks' argument. For example: 'breaks=c(x1, x2, x3, x4)' represents three different intervals (bins): (x1, x2], (x2, x3], (x3, x4].
If 'include.lowest' is 'TRUE' the the lowest value will be included in the first interval, i.e. in [x1, x2].
Value
Multi-dimensional array representing summary statistics for each bin.
See Also
gsummary, gintervals.summary,
gdist
Examples
gdb.init_examples()
gbins.summary("dense_track", c(0, 0.2, 0.4, 2), "sparse_track",
intervals = gintervals(1), iterator = "dense_track"
)
Calculates distribution of contact distances
Description
Calculates distribution of contact distances.
Usage
gcis_decay(
expr = NULL,
breaks = NULL,
src = NULL,
domain = NULL,
intervals = NULL,
include.lowest = FALSE,
iterator = NULL,
band = NULL
)
Arguments
expr |
track expression |
breaks |
breaks that determine the bin |
src |
source intervals |
domain |
domain intervals |
intervals |
genomic scope for which the function is applied |
include.lowest |
if 'TRUE', the lowest value of the range determined by breaks is included |
iterator |
2D track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
Details
A 2D iterator interval '(chrom1, start1, end1, chrom2, start2, end2)' is said to represent a contact between two 1D intervals I1 and I2: '(chrom1, start1, end1)' and '(chrom2, start2, end2)'.
For contacts where 'chrom1' equals to 'chrom2' and I1 is within source intervals the function calculates the distribution of distances between I1 and I2. The distribution is calculated separately for intra-domain and inter-domain contacts.
An interval is within source intervals if the unification of all source intervals fully overlaps it. 'src' intervals are allowed to contain overlapping intervals.
Two intervals I1 and I2 are within the same domain (intra-domain contact) if among the domain intervals exists an interval that fully overlaps both I1 and I2. Otherwise the contact is considered to be inter-domain. 'domain' must contain only non-overlapping intervals.
The distance between I1 and I2 is the absolute distance between the centers of these intervals, i.e.: '|(start1 + end1 - start2 - end2) / 2|'.
The range of distances for which the distribution is calculated is defined by 'breaks' argument. For example: 'breaks=c(x1, x2, x3, x4)' represents three different intervals (bins): (x1, x2], (x2, x3], (x3, x4].
If 'include.lowest' is 'TRUE' the the lowest value will be included in the first interval, i.e. in [x1, x2]
Value
2-dimensional vector representing the distribution of contact distances for inter and intra domains.
See Also
gdist, gtrack.2d.import_contacts
Examples
gdb.init_examples()
src <- rbind(
gintervals(1, 10, 100),
gintervals(1, 200, 300),
gintervals(1, 400, 500),
gintervals(1, 600, 700),
gintervals(1, 7000, 9100),
gintervals(1, 9000, 18000),
gintervals(1, 30000, 31000),
gintervals(2, 1130, 15000)
)
domain <- rbind(
gintervals(1, 0, 483000),
gintervals(2, 0, 300000)
)
gcis_decay("rects_track", 50000 * (1:10), src, domain)
Runs R commands on a cluster
Description
Runs R commands on a cluster that supports SGE.
Usage
gcluster.run(
...,
opt.flags = "",
max.jobs = 400,
debug = FALSE,
R = "R",
control_dir = NULL
)
Arguments
... |
R commands |
opt.flags |
optional flags for qsub command |
max.jobs |
maximal number of simultaneously submitted jobs |
debug |
if 'TRUE', additional reports are printed |
R |
command that launches R |
control_dir |
directory where the control files are stored. Note that this directory should be accessible from all nodes. If 'NULL', a temporary directory would be created under the current misha database. |
Details
This function runs R commands on a cluster by distributing them among cluster nodes. It must run on a machine that supports Sun Grid Engine (SGE). The order in which the commands are executed can not be guaranteed, therefore the commands must be inter-independent.
Optional flags to 'qsub' command can be passed through 'opt.flags' parameter. Users are strongly recommended to use only '-l' flag as other flags might interfere with those that are already used (-terse, -S, -o, -e, -V). For additional information please refer to the manual of 'qsub'.
The maximal number of simultaneously submitted jobs is controlled by 'max.jobs'.
Set 'debug' argument to 'TRUE to allow additional report prints.
'gcluster.run' launches R on the cluster nodes to execute the commands. 'R' argument specifies how R executable should be invoked.
Value
Return value ('retv') is a list, such that 'retv[[i]]' represents the result of the run of command number 'i'. Each result consists of 4 fields that can be accessed by 'retv[[i]]$FIELDNAME':
| FIELDNAME | DESCRIPTION |
| exit.status | Exit status of the command. Possible values: 'success', 'failure' or 'interrupted'. |
| retv | Return value of the command. |
| stdout | Standard output of the command. |
| stderr | Standard error of the command. |
Examples
gdb.init_examples()
# Run only on systems with Sun Grid Engine (SGE)
if (FALSE) {
v <- 17
gcluster.run(
gsummary("dense_track + v"),
{
intervs <- gscreen("dense_track > 0.1", gintervals(1, 2))
gsummary("sparse_track", intervs)
},
gsummary("rects_track")
)
}
Computes auto-correlation between the strands for a file of mapped sequences
Description
Calculates auto-correlation between plus and minus strands for the given chromosome in a file of mapped sequences.
Usage
gcompute_strands_autocorr(
file = NULL,
chrom = NULL,
binsize = NULL,
maxread = 400,
cols.order = c(9, 11, 13, 14),
min.coord = 0,
max.coord = 3e+08
)
Arguments
file |
the name of the file containing mapped sequences |
chrom |
chromosome for which the auto-correlation is computed |
binsize |
calculate the auto-correlation for bins in the range of [-maxread, maxread] |
maxread |
maximal length of the sequence used for statistics |
cols.order |
order of sequence, chromosome, coordinate and strand columns in file |
min.coord |
minimal coordinate used for statistics |
max.coord |
maximal coordinate used for statistics |
Details
This function calculates auto-correlation between plus and minus strands for the given chromosome in a file of mapped sequences. Each line in the file describes one read. Each column is separated by a TAB character.
The following columns must be presented in the file: sequence, chromosome, coordinate and strand. The position of these columns are controlled by 'cols.order' argument accordingly. The default value of 'cols.order' is a vector (9,11,13,14) meaning that sequence is expected to be found at column number 9, chromosome - at column 11, coordinate - at column 13 and strand - at column 14. The first column should be referenced by 1 and not by 0.
Coordinates that are not in [min.coord, max.coord] range are ignored.
gcompute_strands_autocorr outputs the total statistics and the auto-correlation given by bins. The size of the bin is indicated by 'binsize' parameter. Statistics is calculated for bins in the range of [-maxread, maxread].
Value
Statistics for each strand and auto-correlation by given bins.
Examples
gdb.init_examples()
gcompute_strands_autocorr(paste(.misha$GROOT, "reads", sep = "/"),
"chr1", 50,
maxread = 300
)
Change Database to Indexed Genome Format
Description
Converts a per-chromosome database to indexed genome format with a single consolidated genome.seq file and genome.idx index. Optionally also converts tracks and interval sets to indexed format.
Usage
gdb.convert_to_indexed(
groot = NULL,
remove_old_files = FALSE,
force = FALSE,
validate = TRUE,
convert_tracks = FALSE,
convert_intervals = FALSE,
verbose = FALSE,
chunk_size = 104857600
)
Arguments
groot |
Root directory of the database to change to indexed format. If NULL, uses the currently active database. |
remove_old_files |
Logical. If TRUE, removes old per-chromosome files after successful conversion. Default: FALSE. |
force |
Logical. If TRUE, forces the conversion without confirmation. Default: FALSE. |
validate |
Logical. If TRUE, validates the conversion by comparing sequences. Default: TRUE. |
convert_tracks |
Logical. If TRUE, also converts all eligible tracks to indexed format. Default: FALSE. |
convert_intervals |
Logical. If TRUE, also converts all eligible interval sets to indexed format. Default: FALSE. |
verbose |
Logical. If TRUE, prints verbose messages. Default: FALSE. |
chunk_size |
Integer. The size of the chunk to read from the sequence files. Default: 104857600 (100MB). Reduce if you are running into memory issues. |
Details
This function converts a per-chromosome database (with separate .seq files per contig) to indexed format (single genome.seq + genome.idx). The indexed format provides better performance and scalability, especially for genomes with many contigs.
Important: Preserving Chromosome Order
For exact conversion that produces bit-for-bit identical results before and after conversion,
you should load the source database first using gsetroot() or gdb.init():
If database is loaded: Uses chromosome order from ALLGENOME (exact preservation)
If database is not loaded: Uses order from chrom_sizes.txt (may differ from ALLGENOME)
This ensures that the converted database has the exact same chromosome ordering, which affects iteration order, interval IDs, and other operations that depend on chromosome order.
The conversion process:
Checks if database is already in indexed format
Gets chromosome order from ALLGENOME (if loaded) or chrom_sizes.txt
Consolidates all per-chromosome .seq files into genome.seq
Creates genome.idx with CRC64 checksum
Optionally validates the conversion
Optionally removes old .seq files
If convert_tracks=TRUE, converts all eligible 1D tracks (dense, sparse, array)
If convert_intervals=TRUE, converts all eligible interval sets (1D and 2D)
Tracks and intervals that cannot be converted (and are skipped):
Tracks: 2D tracks, virtual tracks, single-file tracks, already converted tracks
Intervals: Single-file interval sets, already converted interval sets
Value
Invisible NULL
See Also
gdb.create, gdb.init, gtrack.convert_to_indexed, gintervals.convert_to_indexed, gintervals.2d.convert_to_indexed
Examples
## Not run:
# Recommended: Load database first for exact conversion
gsetroot("/path/to/database")
gdb.convert_to_indexed(
convert_tracks = TRUE,
convert_intervals = TRUE,
remove_old_files = TRUE,
verbose = TRUE
)
# Convert current database to indexed format (genome only)
gdb.convert_to_indexed()
# Convert specific database without loading it first
# Note: chromosome order may differ from ALLGENOME
gdb.convert_to_indexed(groot = "/path/to/database")
# Convert genome and all tracks to indexed format
gdb.convert_to_indexed(convert_tracks = TRUE)
# Full conversion with validation and cleanup
gsetroot("/path/to/database") # Load first for exact order preservation
gdb.convert_to_indexed(
convert_tracks = TRUE,
convert_intervals = TRUE,
remove_old_files = TRUE,
validate = TRUE,
verbose = TRUE
)
## End(Not run)
Creates a new Genomic Database
Description
Creates a new Genomic Database.
Usage
gdb.create(
groot = NULL,
fasta = NULL,
genes.file = NULL,
annots.file = NULL,
annots.names = NULL,
format = NULL,
verbose = FALSE
)
Arguments
groot |
path to newly created database |
fasta |
an array of names or URLs of FASTA files. Can contain wildcards for multiple files |
genes.file |
name or URL of file that contains genes. If 'NULL' no genes are imported |
annots.file |
name of URL file that contains annotations. If 'NULL' no annotations are imported |
annots.names |
annotations names |
format |
database format: "indexed" (default, single genome.seq + genome.idx)
or "per-chromosome" (separate .seq file per contig). If NULL, uses the value from
|
verbose |
if TRUE, prints verbose messages |
Details
This function creates a new Genomic Database at the location specified by 'groot'. FASTA files are converted to 'Seq' format and appropriate 'chrom_sizes.txt' file is generated (see "User Manual" for more details).
Two database formats are supported:
-
indexed: Single genome.seq + genome.idx (default). Recommended for genomes with many contigs. Provides better performance and scalability.
-
per-chromosome: Separate .seq file per contig.
If 'genes.file' is not 'NULL' four sets of intervals are created in the
database: tss, exons, utr3 and utr5. See
gintervals.import_genes for more details about importing genes
intervals.
'fasta', 'genes.file' and 'annots.file' can be either a file path or URL in a form of 'ftp://[address]/[file]'. 'fasta' can also contain wildcards to indicate multiple files. Files that these arguments point to can be zipped or unzipped.
See the 'Genomes' vignette for details on how to create a database from common genome sources.
Value
None.
See Also
gdb.init, gdb.reload,
gintervals.import_genes
Examples
# ftp <- "ftp://hgdownload.soe.ucsc.edu/goldenPath/mm10"
# mm10_dir <- file.path(tempdir(), "mm10")
# # only a single chromosome is loaded in this example
# # see "Genomes" vignette how to download all of them and how
# # to download other genomes
# gdb.create(
# mm10_dir,
# paste(ftp, "chromosomes", paste0(
# "chr", c("X"),
# ".fa.gz"
# ), sep = "/"),
# paste(ftp, "database/knownGene.txt.gz", sep = "/"),
# paste(ftp, "database/kgXref.txt.gz", sep = "/"),
# c(
# "kgID", "mRNA", "spID", "spDisplayID", "geneSymbol",
# "refseq", "protAcc", "description", "rfamAcc",
# "tRnaName"
# )
# )
# gdb.init(mm10_dir)
# gintervals.ls()
# gintervals.all()
Create and Load a Genome Database
Description
This function downloads, extracts, and loads a misha genome database for the specified genome.
Usage
gdb.create_genome(genome, path = getwd(), tmpdir = tempdir())
Arguments
genome |
A character string specifying the genome to download. Supported genomes are "mm9", "mm10", "mm39", "hg19", and "hg38". |
path |
A character string specifying the directory where the genome will be extracted. Defaults to genome name (e.g. "mm10") in the current working directory. |
tmpdir |
A character string specifying the directory for storing temporary files. This is used for storing the downloaded genome file. |
Details
The function checks if the specified genome is available. If tmpdir, it constructs the download URL, downloads the genome file,
extracts it to the specified directory, and loads the genome database using gsetroot. The function also calls gdb.reload to reload the genome database.
Value
None.
Examples
mm10_dir <- tempdir()
gdb.create_genome("mm10", path = mm10_dir)
list.files(file.path(mm10_dir, "mm10"))
gsetroot(file.path(mm10_dir, "mm10"))
gintervals.ls()
Returns a list of read-only track attributes
Description
Returns a list of read-only track attributes.
Usage
gdb.get_readonly_attrs()
Details
This function returns a list of read-only track attributes. These attributes are not allowed to be modified or deleted.
If no attributes are marked as read-only a 'NULL' is returned.
Value
A list of read-only track attributes.
See Also
gdb.set_readonly_attrs,
gtrack.attr.get, gtrack.attr.set
Get Database Information
Description
Returns information about a misha genome database including format, number of chromosomes, total genome size, and whether it uses the indexed format.
Usage
gdb.info(groot = NULL)
Arguments
groot |
Root directory of the database. If NULL, uses the currently active database. |
Value
A list with database information:
-
path- Full path to the database -
is_db- TRUE if this is a valid misha database -
format- "indexed" or "per-chromosome" -
num_chromosomes- Number of chromosomes/contigs -
genome_size- Total length of genome in bases -
chromosomes- Data frame with chromosome names and sizes
Examples
## Not run:
# Get info about currently active database
info <- gdb.info()
cat("Database format:", info$format, "\n")
cat("Genome size:", info$genome_size / 1e6, "Mb\n")
# Get info about specific database
info <- gdb.info("/path/to/database")
## End(Not run)
Initializes connection with Genomic Database
Description
Initializes connection with Genomic Database: loads the list of tracks, intervals, etc.
Usage
gdb.init(groot = NULL, dir = NULL, rescan = FALSE)
gdb.init_examples()
gsetroot(groot = NULL, dir = NULL, rescan = FALSE)
Arguments
groot |
the root directory of the Genomic Database |
dir |
the current working directory inside the Genomic Database |
rescan |
indicates whether the file structure should be rescanned |
Details
'gdb.init' initializes the connection with the Genomic Database. It is typically called first prior to any other function. When the package is attached it internally calls to 'gdb.init.examples' which opens the connection with the database located at 'PKGDIR/trackdb/test' directory, where 'PKGDIR' is the directory where the package is installed.
The current working directory inside the Genomic Database is set to 'dir'. If 'dir' is 'NULL', the current working directory is set to 'GROOT/tracks'.
If 'rescan' is 'TRUE', the list of tracks and intervals is achieved by rescanning directory structure under the current current working directory. Otherwise 'gdb.init' attempts to use the cached list that resides in 'groot/.db.cache' file.
Upon completion the connection is established with the database. If
auto-completion mode is switched on (see 'gset_input_method') the list of
tracks and intervals sets is loaded and added as variables to the global
environment allowing auto-completion of object names with <TAB> key. Also a
few variables are defined at an environment called .misha, and can be
accessed using .misha$variable, e.g. .misha$ALLGENOME.
These variables should not be modified by user.
| GROOT | Root directory of Genomic Database |
| GWD | Current working directory inside Genomic Database |
| GTRACKS | List of all available tracks |
| GINTERVS | List of all available intervals |
| GVTRACKS | List of all available virtual tracks |
| ALLGENOME | List of all chromosomes and their sizes |
| GITERATOR.INTERVALS | A set of iterator intervals for which the track expression is evaluated |
When option 'gmulticontig.indexed_format' is set to TRUE, the function loads a database with "indexed" track format.
Value
None.
See Also
gdb.reload, gdb.create,
gdir.cd, gtrack.ls, gintervals.ls,
gvtrack.ls
Mark cached track list as dirty
Description
When tracks or interval sets are modified outside of misha (e.g. files copied
manually), the cached inventory may become out of date. Calling this helper
marks the cache as dirty so the next gsetroot() forces a rescan.
Usage
gdb.mark_cache_dirty()
Value
Invisible TRUE if the dirty flag was written, FALSE otherwise.
See Also
Reloads database from the disk
Description
Reloads database from disk: list of tracks, intervals, etc.
Usage
gdb.reload(rescan = TRUE)
Arguments
rescan |
indicates whether the file structure should be rescanned |
Details
Reloads Genomic Database from disk: list of tracks, intervals, etc. Use this function if you manually add tracks or if for any reason the database becomes corrupted. If 'rescan' is 'TRUE', the list of tracks and intervals is achieved by rescanning directory structure under the current current working directory. Otherwise 'gdb.reload' attempts to use the cached list that resides in 'GROOT/.db.cache' file.
Value
No return value, called for side effects.
See Also
gdb.init, gdb.create,
gdir.cd,
Sets read-only track attributes
Description
Sets read-only track attributes.
Usage
gdb.set_readonly_attrs(attrs)
Arguments
attrs |
a vector of read-only attributes names or 'NULL' |
Details
This function sets the list of read-only track attributes. The specified attributes may or may not already exist in the tracks.
If 'attrs' is 'NULL' the list of read-only attributes is emptied.
Value
None.
See Also
gdb.get_readonly_attrs,
gtrack.attr.get, gtrack.attr.set
Changes current working directory in Genomic Database
Description
Changes current working directory in Genomic Database.
Usage
gdir.cd(dir = NULL)
Arguments
dir |
directory path |
Details
This function changes the current working directory in Genomic Database (not to be confused with shell's current working directory). The list of database objects - tracks, intervals, track variables - is rescanned recursively under 'dir'. Object names are updated with the respect to the new current working directory. Example: a track named 'subdir.dense' will be referred as 'dense' once current working directory is set to 'subdir'. All virtual tracks are removed.
Value
None.
See Also
gdb.init, gdir.cwd,
gdir.create, gdir.rm
Examples
gdb.init_examples()
gdir.cd("subdir")
gtrack.ls()
gdir.cd("..")
gtrack.ls()
Creates a new directory in Genomic Database
Description
Creates a new directory in Genomic Database.
Usage
gdir.create(dir = NULL, showWarnings = TRUE, mode = "0777")
Arguments
dir |
directory path |
showWarnings |
see 'dir.create' |
mode |
see 'dir.create' |
Details
This function creates a new directory in Genomic Database. Creates only the last element in the specified path.
Value
None.
Note
A new directory cannot be created within an existing track directory.
See Also
dir.create, gdb.init,
gdir.cwd, gdir.rm
Returns the current working directory in Genomic Database
Description
Returns the absolute path of the current working directory in Genomic Database.
Usage
gdir.cwd()
Details
This function returns the absolute path of the current working directory in Genomic Database (not to be confused with shell's current working directory).
Value
A character string of the path.
See Also
gdb.init, gdir.cd,
gdir.create, gdir.rm
Deletes a directory from Genomic Database
Description
Deletes a directory from Genomic Database.
Usage
gdir.rm(dir = NULL, recursive = FALSE, force = FALSE)
Arguments
dir |
directory path |
recursive |
if 'TRUE', the directory is deleted recursively |
force |
if 'TRUE', suppresses user confirmation of tracks/intervals removal |
Details
This function deletes a directory from Genomic Database. If 'recursive' is 'TRUE', the directory is deleted with all the files/directories it contains. If the directory contains tracks or intervals, the user is prompted to confirm the deletion. Set 'force' to 'TRUE' to suppress the prompt.
Value
None.
See Also
gdb.init, gdir.create,
gdir.cd, gdir.cwd
Calculates distribution of track expressions
Description
Calculates distribution of track expressions' values over the given set of bins.
Usage
gdist(
...,
intervals = NULL,
include.lowest = FALSE,
iterator = NULL,
band = NULL
)
Arguments
... |
pairs of 'expr', 'breaks' where 'expr' is a track expression and the breaks determine the bin |
intervals |
genomic scope for which the function is applied |
include.lowest |
if 'TRUE', the lowest value of the range determined by breaks is included |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
Details
This function calculates the distribution of values of the numeric track expressions over the given set of bins.
The range of bins is determined by 'breaks' argument. For example: 'breaks=c(x1, x2, x3, x4)' represents three different intervals (bins): (x1, x2], (x2, x3], (x3, x4].
If 'include.lowest' is 'TRUE' the the lowest value will be included in the first interval, i.e. in [x1, x2]
'gdist' can work with any number of dimensions. If more than one 'expr'-'breaks' pair is passed, the result is a multidimensional vector, and an individual value can be accessed by [i1,i2,...,iN] notation, where 'i1' is the first track and 'iN' is the last track expression.
Value
N-dimensional vector where N is the number of 'expr'-'breaks' pairs.
See Also
Examples
gdb.init_examples()
## calculate the distribution of dense_track for bins:
## (0, 0.2], (0.2, 0.5] and (0.5, 1]
gdist("dense_track", c(0, 0.2, 0.5, 1))
## calculate two-dimensional distribution:
## dense_track vs. sparse_track
gdist("dense_track", seq(0, 1, by = 0.1), "sparse_track",
seq(0, 2, by = 0.2),
iterator = 100
)
Returns evaluated track expression
Description
Returns the result of track expressions evaluation for each of the iterator intervals.
Usage
gextract(
...,
intervals = NULL,
colnames = NULL,
iterator = NULL,
band = NULL,
file = NULL,
intervals.set.out = NULL
)
Arguments
... |
track expression |
intervals |
genomic scope for which the function is applied |
colnames |
sets the columns names in the returned value. If 'NULL' names are set to track expression. |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
file |
file name where the function result is optionally outputted in tab-delimited format |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function returns the result of track expressions evaluation for each of the iterator intervals. The returned value is a set of intervals with an additional column for each of the track expressions. This value can be used as an input for any other function that accepts intervals. If the intervals inside 'intervals' argument overlap gextract returns the overlapped coordinate more than once.
The order inside the result might not be the same as the order of intervals. An additional column 'intervalID' is added to the return value. Use this column to refer to the index of the original interval from the supplied 'intervals'.
If 'file' parameter is not 'NULL' the result is outputted to a tab-delimited text file (without 'intervalID' column) rather than returned to the user. This can be especially useful when the result is too big to fit into the physical memory. The resulted file can be used as an input for 'gtrack.import' or 'gtrack.array.import' functions.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Similarly to 'file' parameter 'intervals.set.out' can be useful to overcome the limits of the physical memory.
'colnames' parameter controls the names of the columns that contain the evaluated expressions. By default the column names match the track expressions.
Value
If 'file' and 'intervals.set.out' are 'NULL' a set of intervals with an additional column for each of the track expressions and 'columnID' column.
See Also
gtrack.array.extract, gsample,
gtrack.import, gtrack.array.import,
glookup, gpartition, gdist
Examples
gdb.init_examples()
## get values of 'dense_track' for [0, 400), chrom 1
gextract("dense_track", gintervals(1, 0, 400))
## get values of 'rects_track' (a 2D track) for a 2D interval
gextract(
"rects_track",
gintervals.2d("chr1", 0, 4000, "chr2", 2000, 5000)
)
Creates a set of 1D intervals
Description
Creates a set of 1D intervals.
Usage
gintervals(chroms = NULL, starts = 0, ends = -1, strands = NULL)
Arguments
chroms |
chromosomes - an array of strings with or without "chr" prefixes or an array of integers (like: '1' for "chr1") |
starts |
an array of start coordinates |
ends |
an array of end coordinates. If '-1' chromosome size is assumed. |
strands |
'NULL' or an array consisting of '-1', '0' or '1' values |
Details
This function returns a set of one-dimensional intervals. The returned value can be used in all functions that accept 'intervals' argument.
One-dimensional intervals is a data frame whose first three columns are 'chrom', 'start' and 'end'. Each row of the data frame represents a genomic interval of the specified chromosome in the range of [start, end). Additional columns can be presented in 1D intervals object yet these columns must be added after the three obligatory ones.
If 'strands' argument is not 'NULL' an additional column "strand" is added to the intervals. The possible values of a strand can be '1' (plus strand), '-1' (minus strand) or '0' (unknown).
Value
A data frame representing the intervals.
See Also
gintervals.2d, gintervals.force_range
Examples
gdb.init_examples()
## the following 3 calls produce identical results
gintervals(1)
gintervals("1")
gintervals("chrX")
gintervals(1, 1000)
gintervals(c("chr2", "chrX"), 10, c(3000, 5000))
Creates a set of 2D intervals
Description
Creates a set of 2D intervals.
Usage
gintervals.2d(
chroms1 = NULL,
starts1 = 0,
ends1 = -1,
chroms2 = NULL,
starts2 = 0,
ends2 = -1
)
Arguments
chroms1 |
chromosomes1 - an array of strings with or without "chr" prefixes or an array of integers (like: '1' for "chr1") |
starts1 |
an array of start1 coordinates |
ends1 |
an array of end1 coordinates. If '-1' chromosome size is assumed. |
chroms2 |
chromosomes2 - an array of strings with or without "chr" prefixes or an array of integers (like: '1' for "chr1"). If 'NULL', 'chroms2' is assumed to be equal to 'chroms1'. |
starts2 |
an array of start2 coordinates |
ends2 |
an array of end2 coordinates. If '-1' chromosome size is assumed. |
Details
This function returns a set of two-dimensional intervals. The returned value can be used in all functions that accept 'intervals' argument.
Two-dimensional intervals is a data frame whose first six columns are 'chrom1', 'start1', 'end1', 'chrom2', 'start2' and 'end2'. Each row of the data frame represents two genomic intervals from two chromosomes in the range of [start, end). Additional columns can be presented in 2D intervals object yet these columns must be added after the six obligatory ones.
Value
A data frame representing the intervals.
See Also
gintervals, gintervals.force_range
Examples
gdb.init_examples()
## the following 3 calls produce identical results
gintervals.2d(1)
gintervals.2d("1")
gintervals.2d("chrX")
gintervals.2d(1, 1000, 2000, "chrX", 400, 800)
gintervals.2d(c("chr2", "chrX"), 10, c(3000, 5000), 1)
Returns 2D intervals that cover the whole genome
Description
Returns 2D intervals that cover the whole genome.
Usage
gintervals.2d.all()
Details
This function returns a set of two-dimensional intervals that cover the whole genome as it is defined by 'chrom_sizes.txt' file.
Value
A data frame representing the intervals.
See Also
Intersects two-dimensional intervals with a band
Description
Intersects two-dimensional intervals with a band.
Usage
gintervals.2d.band_intersect(
intervals = NULL,
band = NULL,
intervals.set.out = NULL
)
Arguments
intervals |
two-dimensional intervals |
band |
track expression band. If 'NULL' no band is used. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function intersects each two-dimensional interval from 'intervals' with 'band'. If the intersection is not empty, the interval is shrunk to the minimal rectangle that contains the band and added to the return value.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame representing the intervals.
See Also
gintervals.2d, gintervals.intersect
Examples
gdb.init_examples()
gintervals.2d.band_intersect(gintervals.2d(1), c(10000, 20000))
Convert 2D interval set to indexed format
Description
Converts a per-chromosome interval set to indexed format (intervals2d.dat + intervals2d.idx) which reduces file descriptor usage.
Usage
gintervals.2d.convert_to_indexed(
set.name = NULL,
remove.old = FALSE,
force = FALSE
)
Arguments
set.name |
name of 2D interval set to convert |
remove.old |
if TRUE, removes old per-chromosome files after successful conversion |
force |
if TRUE, re-converts even if already in indexed format |
Details
The indexed format stores all chromosome pairs in a single intervals2d.dat file with an intervals2d.idx index file. This dramatically reduces file descriptor usage, especially for genomes with many chromosomes (N*(N-1)/2 files to just 2).
Only non-empty pairs are stored in the index, avoiding O(N^2) space overhead.
The conversion process:
Scans directory for existing per-pair files
Creates temporary intervals2d.dat.tmp and intervals2d.idx.tmp files
Concatenates all per-pair files into intervals2d.dat.tmp
Builds index with pair offsets and checksums
Atomically renames temporary files to final names
Optionally removes old per-pair files
The indexed format is 100
Value
invisible NULL
Examples
## Not run:
# Convert a 2D interval set
gintervals.2d.convert_to_indexed("my_2d_intervals")
# Convert and remove old files
gintervals.2d.convert_to_indexed("my_2d_intervals", remove.old = TRUE)
# Force re-conversion
gintervals.2d.convert_to_indexed("my_2d_intervals", force = TRUE)
## End(Not run)
Returns 1D intervals that cover the whole genome
Description
Returns 1D intervals that cover the whole genome.
Usage
gintervals.all()
Details
This function returns a set of one-dimensional intervals that cover the whole genome as it is defined by 'chrom_sizes.txt' file.
Value
A data frame representing the intervals.
See Also
Annotates 1D intervals using nearest neighbors
Description
Annotates one-dimensional intervals by finding nearest neighbors in another set of intervals and adding selected columns from the neighbors to the original intervals.
Usage
gintervals.annotate(
intervals,
annotation_intervals,
annotation_columns = NULL,
column_names = NULL,
dist_column = "dist",
max_dist = Inf,
na_value = NA,
maxneighbors = 1,
tie_method = c("first", "min.start", "min.end"),
overwrite = FALSE,
keep_order = TRUE,
intervals.set.out = NULL,
...
)
Arguments
intervals |
Intervals to annotate (1D). |
annotation_intervals |
Source intervals containing annotation data (1D). |
annotation_columns |
Character vector of column names to copy from
|
column_names |
Optional custom names for the annotation columns. If
provided, must have the same length as |
dist_column |
Name of the distance column to include. Use |
max_dist |
Maximum absolute distance. When finite, neighbors with
|
na_value |
Value(s) to use for annotations when beyond |
maxneighbors |
Maximum number of neighbors per interval (duplicates intervals as needed). Defaults to 1. |
tie_method |
Tie-breaking when distances are equal: one of
"first" (arbitrary but stable), "min.start" (smaller neighbor start first),
or "min.end" (smaller neighbor end first). Applies when
|
overwrite |
When |
keep_order |
If |
intervals.set.out |
intervals set name where the function result is optionally outputted |
... |
Additional arguments forwarded to |
Details
The function wraps and extends gintervals.neighbors to provide
convenient column selection/renaming, optional distance inclusion, distance
thresholding with custom NA values, multiple neighbors per interval, and
deterministic tie-breaking. Currently supports 1D intervals only.
- When annotation_columns = NULL, all non-basic columns present in
annotation_intervals are included.
- Setting dist_column = NULL omits the distance column.
- If no neighbor is found for an interval, annotation columns are filled with
na_value and the distance (when present) is NA_real_.
- Column name collisions are handled as follows: when overwrite=FALSE
a clear error is emitted; when overwrite=TRUE, base columns with the
same names are replaced by annotation columns.
Value
A data frame containing the original intervals plus the requested
annotation columns (and optional distance column). If
maxneighbors > 1, rows may be duplicated per input interval to
accommodate multiple neighbors.
Examples
# Prepare toy data
intervs <- gintervals(1, c(1000, 5000), c(1100, 5050))
ann <- gintervals(1, c(900, 5400), c(950, 5500))
ann$remark <- c("a", "b")
ann$score <- c(10, 20)
# Basic usage with default columns (all non-basic columns)
gintervals.annotate(intervs, ann)
# Select specific columns, with custom names and distance column name
gintervals.annotate(
intervs, ann,
annotation_columns = c("remark"),
column_names = c("ann_remark"),
dist_column = "ann_dist"
)
# Distance threshold with scalar NA replacement
gintervals.annotate(
intervs, ann,
annotation_columns = c("remark"),
max_dist = 200,
na_value = "no_ann"
)
# Multiple neighbors with deterministic tie-breaking
nbrs <- gintervals.annotate(
gintervals(1, 1000, 1100),
{
x <- gintervals(1, c(800, 1200), c(900, 1300))
x$label <- c("left", "right")
x
},
annotation_columns = "label",
maxneighbors = 2,
tie_method = "min.start"
)
nbrs
# Overwrite existing columns in the base intervals
intervs2 <- intervs
intervs2$remark <- c("orig1", "orig2")
gintervals.annotate(intervs2, ann, annotation_columns = "remark", overwrite = TRUE)
Transforms existing intervals to a chain format
Description
Transforms existing intervals to a chain format by validating required columns and adding chain attributes.
Usage
gintervals.as_chain(
intervals = NULL,
src_overlap_policy = "error",
tgt_overlap_policy = "auto",
min_score = NULL
)
Arguments
intervals |
a data frame with chain columns: chrom, start, end, strand, chromsrc, startsrc, endsrc, strandsrc, chain_id, score |
src_overlap_policy |
source overlap policy: "error", "keep", or "discard" |
tgt_overlap_policy |
target overlap policy: "error", "auto", "auto_first", "auto_longer", "auto_score", "discard", "keep", or "agg" |
min_score |
optional minimum alignment score threshold |
Details
This function checks that the input intervals data frame has all the required columns for a chain format and adds the necessary attributes. A chain format requires both target coordinates (chrom, start, end, strand) and source coordinates (chromsrc, startsrc, endsrc, strandsrc), as well as chain_id and score columns.
Value
A data frame in chain format with chain attributes set
See Also
gintervals.load_chain, gintervals.liftover
Examples
gdb.init_examples()
# Create a chain from existing intervals
chain_data <- data.frame(
chrom = "chr1",
start = 1000,
end = 2000,
strand = 0,
chromsrc = "chr1",
startsrc = 5000,
endsrc = 6000,
strandsrc = 0,
chain_id = 1L,
score = 1000.0
)
chain <- gintervals.as_chain(chain_data)
Converts intervals to canonic form
Description
Converts intervals to canonic form.
Usage
gintervals.canonic(intervals = NULL, unify_touching_intervals = TRUE)
Arguments
intervals |
intervals to be converted |
unify_touching_intervals |
if 'TRUE', touching one-dimensional intervals are unified |
Details
This function converts 'intervals' into a "canonic" form: properly sorted with no overlaps. The result can be used later in the functions that require the intervals to be in canonic form. Use 'unify_touching_intervals' to control whether the intervals that touch each other (i.e. the end coordinate of one equals to the start coordinate of the other) are unified. 'unify_touching_intervals' is ignored if two-dimensional intervals are used.
Since 'gintervals.canonic' unifies overlapping or touching intervals, the number of the returned intervals might be less than the number of the original intervals. To allow the user to find the origin of the new interval 'mapping' attribute is attached to the result. It maps between the original intervals and the resulted intervals. Use 'attr(retv_of_gintervals.canonic, "mapping")' to retrieve the map.
Value
A data frame representing the canonic intervals and an attribute 'mapping' that maps the original intervals to the resulted ones.
See Also
Examples
gdb.init_examples()
## Create intervals manually by using 'data.frame'.
## Note that we add an additional column 'data'.
## Return value:
## chrom start end data
## 1 chr1 11000 12000 10
## 2 chr1 100 200 20
## 3 chr1 10000 13000 30
## 4 chr1 10500 10600 40
intervs <- data.frame(
chrom = "chr1",
start = c(11000, 100, 10000, 10500),
end = c(12000, 200, 13000, 10600),
data = c(10, 20, 30, 40)
)
## Convert the intervals into the canonic form.
## The function discards any columns besides chrom, start and end.
## Return value:
## chrom start end
## 1 chr1 100 200
## 2 chr1 10000 13000
res <- gintervals.canonic(intervs)
## By inspecting mapping attribute we can see how the new
## intervals were created: "2 1 2 2" means that the first
## interval in the result was created from the second interval in
## the original set (we look for the indices in mapping where "1"
## appears). Likewise the second interval in the result was
## created from 3 intervals in the original set. Their indices are
## 1, 3 and 4 (once again we look for the indices in mapping where
## "2" appears).
## Return value:
## 2 1 2 2
attr(res, "mapping")
## Finally (and that is the most useful part of 'mapping'
## attribute): we add a new column 'data' to our result which is
## the mean value of the original data column. The trick is done
## using 'tapply' on par with 'mapping' attribute. For example,
## 20.00000 equals is a result of 'mean(intervs[2,]$data' while
## 26.66667 is a result of 'mean(intervs[c(1,3,4),]$data)'.
## 'res' after the following call:
## chrom start end data
## 1 chr1 100 200 20.00000
## 2 chr1 10000 13000 26.66667
res$data <- tapply(intervs$data, attr(res, "mapping"), mean)
Returns number of intervals per chromosome
Description
Returns number of intervals per chromosome (or chromosome pair).
Usage
gintervals.chrom_sizes(intervals = NULL)
Arguments
intervals |
intervals set |
Details
This function returns number of intervals per chromosome (for 1D intervals) or chromosome pair (for 2D intervals).
Value
Data frame representing number of intervals per chromosome (for 1D intervals) or chromosome pair (for 2D intervals).
See Also
gintervals.load, gintervals.save,
gintervals.exists, gintervals.ls,
gintervals, gintervals.2d
Examples
gdb.init_examples()
gintervals.chrom_sizes("annotations")
Convert 1D interval set to indexed format
Description
Converts a per-chromosome interval set to indexed format (intervals.dat + intervals.idx) which reduces file descriptor usage.
Usage
gintervals.convert_to_indexed(
set.name = NULL,
remove.old = FALSE,
force = FALSE
)
Arguments
set.name |
name of interval set to convert |
remove.old |
if TRUE, removes old per-chromosome files after successful conversion |
force |
if TRUE, re-converts even if already in indexed format |
Details
The indexed format stores all chromosomes in a single intervals.dat file with an intervals.idx index file. This reduces file descriptor usage from N files (one per chromosome) to just 2 files.
The conversion process:
Creates temporary intervals.dat.tmp and intervals.idx.tmp files
Concatenates all per-chromosome files into intervals.dat.tmp
Builds index with offsets and checksums
Atomically renames temporary files to final names
Optionally removes old per-chromosome files
The indexed format is 100
Value
invisible NULL
See Also
gintervals.save, gintervals.load
Examples
## Not run:
# Convert an interval set
gintervals.convert_to_indexed("my_intervals")
# Convert and remove old files
gintervals.convert_to_indexed("my_intervals", remove.old = TRUE)
# Force re-conversion
gintervals.convert_to_indexed("my_intervals", force = TRUE)
## End(Not run)
Calculate fraction of genomic space covered by intervals
Description
Returns the fraction of a genomic space that is covered by a set of intervals.
Usage
gintervals.coverage_fraction(intervals1 = NULL, intervals2 = NULL)
Arguments
intervals1 |
set of one-dimensional intervals (the covering set) |
intervals2 |
set of one-dimensional intervals to be covered (default: NULL, meaning the entire genome) |
Details
This function calculates what fraction of 'intervals2' is covered by 'intervals1'. If 'intervals2' is NULL, it calculates the fraction of the entire genome that is covered by 'intervals1'. Overlapping intervals in either set are automatically unified before calculation.
Value
A single numeric value between 0 and 1 representing the fraction of 'intervals2' (or the genome) covered by 'intervals1'.
See Also
gintervals, gintervals.intersect,
gintervals.covered_bp, gintervals.all
Examples
gdb.init_examples()
# Create some intervals
intervs1 <- gscreen("dense_track > 0.15")
intervs2 <- gintervals(c("chr1", "chr2"), 0, c(100000, 100000))
# Calculate fraction of intervs2 covered by intervs1
gintervals.coverage_fraction(intervs1, intervs2)
# Calculate fraction of entire genome covered by intervs1
gintervals.coverage_fraction(intervs1)
Calculate total base pairs covered by intervals
Description
Returns the total number of base pairs covered by a set of intervals.
Usage
gintervals.covered_bp(intervals = NULL)
Arguments
intervals |
set of one-dimensional intervals |
Details
This function first canonicalizes the intervals to remove overlaps and touching intervals, then sums up the lengths of all resulting intervals. Overlapping intervals are counted only once.
Value
A single numeric value representing the total number of base pairs covered by the intervals.
See Also
gintervals, gintervals.canonic,
gintervals.coverage_fraction
Examples
gdb.init_examples()
# Create some intervals
intervs <- gintervals(
c("chr1", "chr1", "chr2"),
c(100, 150, 1000),
c(200, 250, 2000)
)
# Calculate total bp covered
# Note: intervals [100,200) and [150,250) overlap,
# so total is (200-100) + (250-150) + (2000-1000) = 100 + 100 + 1000 = 1200
# But after canonicalization: [100,250) + [1000,2000) = 150 + 1000 = 1150
gintervals.covered_bp(intervs)
Calculates difference of two intervals sets
Description
Returns difference of two sets of intervals.
Usage
gintervals.diff(intervals1 = NULL, intervals2 = NULL, intervals.set.out = NULL)
Arguments
intervals1, intervals2 |
set of one-dimensional intervals |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function returns a genomic space that is covered by 'intervals1' but not covered by 'intervals2'.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame representing the intervals.
See Also
gintervals, gintervals.intersect,
gintervals.union
Examples
gdb.init_examples()
intervs1 <- gscreen("dense_track > 0.15")
intervs2 <- gscreen("dense_track < 0.2")
## 'res3' equals to 'res4'
res3 <- gintervals.diff(intervs1, intervs2)
res4 <- gscreen("dense_track >= 0.2")
Tests for a named intervals set existence
Description
Tests for a named intervals set existence.
Usage
gintervals.exists(intervals.set = NULL)
Arguments
intervals.set |
name of an intervals set |
Details
This function returns 'TRUE' if a named intervals set exists in Genomic Database.
Value
'TRUE' if a named intervals set exists. Otherwise 'FALSE'.
See Also
gintervals.ls, gintervals.load,
gintervals.rm, gintervals.save,
gintervals, gintervals.2d
Examples
gdb.init_examples()
gintervals.exists("annotations")
Limits intervals to chromosomal range
Description
Limits intervals to chromosomal range.
Usage
gintervals.force_range(intervals = NULL, intervals.set.out = NULL)
Arguments
intervals |
intervals |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function enforces the intervals to be within the chromosomal range [0, chrom length) by altering the intervals' boundaries. Intervals that lay entirely outside of the chromosomal range are eliminated. The new intervals are returned.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame representing the intervals.
See Also
gintervals, gintervals.2d,
gintervals.canonic
Examples
gdb.init_examples()
intervs <- data.frame(
chrom = "chr1",
start = c(11000, -100, 10000, 10500),
end = c(12000, 200, 13000000, 10600)
)
gintervals.force_range(intervs)
Imports genes and annotations from files
Description
Imports genes and annotations from files.
Usage
gintervals.import_genes(
genes.file = NULL,
annots.file = NULL,
annots.names = NULL
)
Arguments
genes.file |
name or URL of file that contains genes |
annots.file |
name of URL file that contains annotations. If 'NULL' no annotations are imported |
annots.names |
annotations names |
Details
This function reads a definition of genes from 'genes.file' and returns four sets of intervals: TSS, exons, 3utr and 5utr. In addition to the regular intervals columns 'strand' column is added. It contains '1' values for '+' strands and '-1' values for '-' strands.
If annotation file 'annots.file' is given then annotations are attached too to the intervals. The names of the annotations as they would appear in the return value must be specified in 'annots.names' argument.
Both 'genes.file' and 'annots.file' can be either a file path or URL in a form of 'ftp://[address]/[file]'. Files that these arguments point to can be zipped or unzipped.
Examples of 'genes.file' and 'annots.file' can be found here:
ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz
ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/kgXref.txt.gz
If a few intervals overlap (for example: two TSS regions) they are all unified to an interval that covers the whole overlapping region. 'strand' value is set to '0' if two or more of the overlapping intervals have different strands. The annotations of the overlapping intervals are concatenated to a single character string separated by semicolons. Identical values of overlapping intervals' annotation are eliminated.
Value
A list of four intervals sets named 'tss', 'exons', 'utr3' and 'utr5'. 'strand' column and annotations are attached to the intevals.
See Also
Calculates an intersection of two sets of intervals
Description
Calculates an intersection of two sets of intervals.
Usage
gintervals.intersect(
intervals1 = NULL,
intervals2 = NULL,
intervals.set.out = NULL
)
Arguments
intervals1, intervals2 |
set of intervals |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function returns intervals that represent a genomic space which is achieved by intersection of 'intervals1' and 'intervals2'.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame representing the intersection of intervals.
See Also
gintervals.2d.band_intersect,
gintervals.diff, gintervals.union,
gintervals, gintervals.2d
Examples
gdb.init_examples()
intervs1 <- gscreen("dense_track > 0.15")
intervs2 <- gscreen("dense_track < 0.2")
## 'intervs3' and 'intervs4' are identical
intervs3 <- gintervals.intersect(intervs1, intervs2)
intervs4 <- gscreen("dense_track > 0.15 & dense_track < 0.2")
Tests for big intervals set
Description
Tests for big intervals set.
Usage
gintervals.is.bigset(intervals.set = NULL)
Arguments
intervals.set |
name of an intervals set |
Details
This function tests whether 'intervals.set' is a big intervals set. Intervals set is big if it is stored in big intervals set format and given the current limits it cannot be fully loaded into memory.
Memory limit is controlled by 'gmax.data.size' option (see: 'getOption("gmax.data.size")').
Value
'TRUE' if intervals set is big, otherwise 'FALSE'.
See Also
gintervals.load, gintervals.save,
gintervals.exists, gintervals.ls
Examples
gdb.init_examples()
gintervals.is.bigset("annotations")
Converts intervals from another assembly
Description
Converts intervals from another assembly to the current one.
Usage
gintervals.liftover(
intervals = NULL,
chain = NULL,
src_overlap_policy = "error",
tgt_overlap_policy = "auto",
min_score = NULL,
include_metadata = FALSE,
canonic = FALSE,
value_col = NULL,
multi_target_agg = c("mean", "median", "sum", "min", "max", "count", "first", "last",
"nth", "max.coverage_len", "min.coverage_len", "max.coverage_frac",
"min.coverage_frac"),
params = NULL,
na.rm = TRUE,
min_n = NULL
)
Arguments
intervals |
intervals from another assembly | ||||||||||||||||||||
chain |
name of chain file or data frame as returned by 'gintervals.load_chain' | ||||||||||||||||||||
src_overlap_policy |
policy for handling source overlaps: "error" (default), "keep", or "discard". "keep" allows one source interval to map to multiple target intervals, "discard" discards all source intervals that have overlaps and "error" throws an error if source overlaps are detected. | ||||||||||||||||||||
tgt_overlap_policy |
policy for handling target overlaps. One of:
| ||||||||||||||||||||
min_score |
optional minimum alignment score threshold. Chains with scores below this value are filtered out. Useful for excluding low-quality alignments. | ||||||||||||||||||||
include_metadata |
logical; if TRUE, adds 'score' column to the output indicating the alignment score of the chain used for each mapping. Only applicable with "auto_score" or "auto" policy. | ||||||||||||||||||||
canonic |
logical; if TRUE, merges adjacent target intervals that originated from the same source interval (same intervalID) and same chain (same chain_id). This is useful when a source interval maps to multiple adjacent target blocks due to chain gaps. | ||||||||||||||||||||
value_col |
optional character string specifying the name of a numeric column in the intervals data frame to track through the liftover. When specified, this column's values are preserved in the output with the same column name. Use with multi_target_agg to aggregate values when multiple source intervals map to overlapping target regions. | ||||||||||||||||||||
multi_target_agg |
aggregation method to use when value_col is specified. One of: "mean", "median", "sum", "min", "max", "count", "first", "last", "nth", "max.coverage_len", "min.coverage_len", "max.coverage_frac", "min.coverage_frac". Default: "mean". Ignored when value_col is NULL. | ||||||||||||||||||||
params |
additional parameters for specific aggregation methods. Currently only used for "nth" aggregation, where it specifies which element to select (e.g., params = 2 for second element, or params = list(n = 2)). | ||||||||||||||||||||
na.rm |
logical; if TRUE (default), NA values are removed before aggregation. If FALSE, any NA in the values will cause the result to be NA. Only used when value_col is specified. | ||||||||||||||||||||
min_n |
optional minimum number of non-NA observations required for aggregation. If fewer observations are available, the result is NA. NULL (default) means no minimum. Only used when value_col is specified. |
Details
This function converts 'intervals' from another assembly to the current one. Chain file instructs how the conversion of coordinates should be done. It can be either a name of a chain file or a data frame in the same format as returned by 'gintervals.load_chain' function.
The converted intervals are returned. An additional column named 'intervalID' is added to the resulted data frame. For each interval in the resulted intervals it indicates the index of the original interval.
Note: When passing a pre-loaded chain (data frame), overlap policies cannot be specified - they are taken from the chain's attributes that were set during loading. When passing a chain file path, policies can be specified and will be used for loading.
Value
A data frame representing the converted intervals. For 1D intervals, always includes 'intervalID' (index of original interval) and 'chain_id' (identifier of the chain that produced the mapping) columns. The chain_id column is essential for distinguishing results when a source interval maps to multiple target regions via different chains (duplications). When include_metadata=TRUE, also includes 'score' column. When value_col is specified, includes the value column with its original name.
See Also
gintervals.load_chain, gtrack.liftover,
gintervals
Examples
gdb.init_examples()
chainfile <- paste(.misha$GROOT, "data/test.chain", sep = "/")
intervs <- data.frame(
chrom = "chr25", start = c(0, 7000),
end = c(6000, 20000)
)
# Liftover with default policies
gintervals.liftover(intervs, chainfile)
# Liftover keeping source overlaps (one source interval may map to multiple targets)
# gintervals.liftover(intervs, chainfile, src_overlap_policy = "keep")
Loads a named intervals set
Description
Loads a named intervals set.
Usage
gintervals.load(
intervals.set = NULL,
chrom = NULL,
chrom1 = NULL,
chrom2 = NULL
)
Arguments
intervals.set |
name of an intervals set |
chrom |
chromosome for 1D intervals set |
chrom1 |
first chromosome for 2D intervals set |
chrom2 |
second chromosome for 2D intervals set |
Details
This function loads and returns intervals stored in a named intervals set.
If intervals set contains 1D intervals and 'chrom' is not 'NULL' only the intervals of the given chromosome are returned.
Likewise if intervals set contains 2D intervals and 'chrom1', 'chrom2' are not 'NULL' only the intervals of the given pair of chromosomes are returned.
For big intervals sets 'chrom' parameter (1D case) / 'chrom1', 'chrom2' parameters (2D case) must be specified. In other words: big intervals sets can be loaded only by chromosome or chromosome pair.
Value
A data frame representing the intervals.
See Also
gintervals.save, gintervals.is.bigset,
gintervals.exists, gintervals.ls,
gintervals, gintervals.2d
Examples
gdb.init_examples()
gintervals.load("annotations")
Loads assembly conversion table from a chain file
Description
Loads assembly conversion table from a chain file.
Usage
gintervals.load_chain(
file = NULL,
src_overlap_policy = "error",
tgt_overlap_policy = "auto",
src_groot = NULL,
min_score = NULL
)
Arguments
file |
name of chain file | ||||||||||||||||||||
src_overlap_policy |
policy for handling source overlaps: "error" (default), "keep", or "discard". "keep" allows one source interval to map to multiple target intervals, "discard" discards all source intervals that have overlaps and "error" throws an error if source overlaps are detected. | ||||||||||||||||||||
tgt_overlap_policy |
policy for handling target overlaps. One of:
| ||||||||||||||||||||
src_groot |
optional path to source genome database for validating source chromosomes and coordinates. If provided, the function temporarily switches to this database to verify that all source chromosomes exist and coordinates are within bounds, then restores the original database. | ||||||||||||||||||||
min_score |
optional minimum alignment score threshold. Chains with scores below this value are filtered out. Useful for excluding low-quality alignments. |
Details
This function reads a file in 'chain' format and returns assembly conversion table that can be used in 'gtrack.liftover' and 'gintervals.liftover'.
Source overlaps occur when the same source genome position maps to multiple target genome positions. Target overlaps occur when multiple source positions map to overlapping regions in the target genome.
The 'src_overlap_policy' controls how source overlaps are handled:
"error" (default): Throw an error if source overlaps are detected
"keep": Keep all mappings, allowing one source to map to multiple targets
"discard": Remove all chain intervals involved in source overlaps
The 'tgt_overlap_policy' controls how target overlaps are handled:
"error": Throw an error if target overlaps are detected
"auto" (default) / "auto_first": Keep the first overlapping chain (original file order) by trimming or discarding later overlaps while keeping source/target lengths consistent
"auto_longer": Keep the longer overlapping chain and trim/drop the shorter ones
"discard": Remove all chain intervals involved in target overlaps
"keep": Allow target overlaps to remain untouched (liftover must be able to handle them)
Value
A data frame representing assembly conversion table with columns: chrom, start, end, strand, chromsrc, startsrc, endsrc, strandsrc, chain_id, score.
See Also
gintervals.liftover, gtrack.liftover
Examples
gdb.init_examples()
chainfile <- paste(.misha$GROOT, "data/test.chain", sep = "/")
# Load chain file with default policies
gintervals.load_chain(chainfile)
Returns a list of named intervals sets
Description
Returns a list of named intervals sets in Genomic Database.
Usage
gintervals.ls(
pattern = "",
ignore.case = FALSE,
perl = FALSE,
fixed = FALSE,
useBytes = FALSE
)
Arguments
pattern, ignore.case, perl, fixed, useBytes |
see 'grep' |
Details
This function returns a list of named intervals sets that match the pattern (see 'grep'). If called without any arguments all named intervals sets are returned.
Value
An array that contains the names of intervals sets.
See Also
grep, gintervals.exists,
gintervals.load, gintervals.save,
gintervals.rm, gintervals,
gintervals.2d
Examples
gdb.init_examples()
gintervals.ls()
gintervals.ls(pattern = "annot*")
Applies a function to values of track expressions
Description
Applies a function to values of track expressions for each interval.
Usage
gintervals.mapply(
FUN = NULL,
...,
intervals = NULL,
enable.gapply.intervals = FALSE,
iterator = NULL,
band = NULL,
intervals.set.out = NULL,
colnames = "value"
)
Arguments
FUN |
function to apply, found via 'match.fun' |
... |
track expressions whose values are used as arguments for 'FUN' |
intervals |
intervals for which track expressions are calculated |
enable.gapply.intervals |
if 'TRUE', then a variable 'GAPPLY.INTERVALS' is available |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
colnames |
name of the column that contains the return values of 'FUN'. Default is "value". |
Details
This function evaluates track expressions for each interval from 'intervals'. The resulted vectors are passed then as arguments to 'FUN'.
If the intervals are one-dimensional and have an additional column named 'strand' whose value is '-1', the values of the track expression are placed to the vector in reverse order.
The current interval index (1-based) is stored in 'GAPPLY.INTERVID' variable that is available during the execution of 'gintervals.mapply'. There is no guarantee about the order in which the intervals are processed. Do not rely on any specific order and use 'GITERATOR.INTERVID' variable to detect the current interval id.
If 'enable.gapply.intervals' is 'TRUE', an additional variable 'GAPPLY.INTERVALS' is defined during the execution of 'gintervals.mapply'. This variable stores the current iterator intervals prior to track expression evaluation. Please note that setting 'enable.gapply.intervals' to 'TRUE' might severely affect the run-time of the function.
Note: all the changes made in R environment by 'FUN' will be void if multitasking mode is switched on. One should also refrain from performing any other operations in 'FUN' that might be not "thread-safe" such as updating files, etc. Please switch off multitasking ('options(gmultitasking = FALSE)') if you wish to perform such operations.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame representing intervals with an additional column that contains the return values of 'FUN'. The name of this additional column is specified by the 'colnames' parameter.
See Also
Examples
gdb.init_examples()
gintervals.mapply(
max, "dense_track",
gintervals(c(1, 2), 0, 10000)
)
gintervals.mapply(
function(x, y) {
max(x + y)
}, "dense_track",
"sparse_track", gintervals(c(1, 2), 0, 10000),
iterator = "sparse_track"
)
# Using custom column name
gintervals.mapply(
max, "dense_track",
gintervals(c(1, 2), 0, 10000),
colnames = "max_value"
)
Mark overlapping intervals with a group ID
Description
Mark overlapping intervals with a group ID
Usage
gintervals.mark_overlaps(
intervals,
group_col = "overlap_group",
unify_touching_intervals = TRUE
)
Arguments
intervals |
intervals set |
group_col |
name of the column to store the overlap group IDs (default: "overlap_group") |
unify_touching_intervals |
if 'TRUE', touching one-dimensional intervals are unified |
Value
The intervals set with an additional column containing group IDs from gintervals.canonic mapping. All overlapping intervals will have the same group ID.
Examples
gdb.init_examples()
# Create sample overlapping intervals
intervs <- data.frame(
chrom = "chr1",
start = c(11000, 100, 10000, 10500),
end = c(12000, 200, 13000, 10600),
data = c(10, 20, 30, 40)
)
# Mark overlapping intervals
intervs_marked <- gintervals.mark_overlaps(intervs)
# Use custom column name
intervs_marked <- gintervals.mark_overlaps(intervs, group_col = "my_groups")
Finds neighbors between two sets of intervals
Description
For each interval in 'intervals1', finds the closest intervals from 'intervals2'. Distance directionality can be determined by either the strand of the target intervals (intervals2, default) or the query intervals (intervals1). When no strand column is present, all intervals are treated as positive strand (strand = 1).
Usage
gintervals.neighbors(
intervals1 = NULL,
intervals2 = NULL,
maxneighbors = 1,
mindist = -1e+09,
maxdist = 1e+09,
mindist1 = -1e+09,
maxdist1 = 1e+09,
mindist2 = -1e+09,
maxdist2 = 1e+09,
na.if.notfound = FALSE,
use_intervals1_strand = FALSE,
warn.ignored.strand = TRUE,
intervals.set.out = NULL
)
Arguments
intervals1, intervals2 |
intervals |
maxneighbors |
maximal number of neighbors |
mindist, maxdist |
distance range for 1D intervals |
mindist1, maxdist1, mindist2, maxdist2 |
distance range for 2D intervals |
na.if.notfound |
if 'TRUE' return 'NA' interval if no matching neighbors were found, otherwise omit the interval in the answer |
use_intervals1_strand |
if 'TRUE' use intervals1 strand column for distance directionality instead of intervals2 strand. If intervals1 has no strand column, all intervals are treated as positive strand (strand = 1). Invalid strand values (not -1 or 1) will cause an error. |
warn.ignored.strand |
if 'TRUE' (default) show warning when 'intervals1' contains a strand column that will be ignored for distance calculation |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function finds for each interval in 'intervals1' the closest 'maxneighbors' intervals from 'intervals2'.
For 1D intervals the distance must fall in the range of ['mindist', 'maxdist'].
Distance is defined as the number of base pairs between the the last base pair of the query interval and the first base pair of the target interval.
**Strand handling:** By default, distance directionality is determined by the 'strand' column in 'intervals2' (if present). If 'use_intervals1_strand' is TRUE, distance directionality is instead determined by the 'strand' column in 'intervals1'. This is particularly useful for TSS analysis where you want upstream/downstream distances relative to gene direction.
**Distance calculation modes:**
**use_intervals1_strand = FALSE (default):** Uses intervals2 strand for directionality
**use_intervals1_strand = TRUE:** Uses intervals1 strand for directionality
**Important:** When 'use_intervals1_strand = TRUE', distance signs are interpreted as:
**+ strand queries:** Negative distances = upstream, Positive distances = downstream
**- strand queries:** Negative distances = downstream, Positive distances = upstream
For 2D intervals two distances are calculated and returned for each axis. The distances must fall in the range of ['mindist1', 'maxdist1'] for axis 1 and ['mindist2', 'maxdist2'] for axis 2. For selecting the closest 'maxneighbors' intervals Manhattan distance is used (i.e. dist1+dist2).
**Note:** 'use_intervals1_strand' is not yet supported for 2D intervals.
The names of the returned columns are made unique using
make.unique(colnames(df), sep = ""), assuming 'df' is the result.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame containing the pairs of intervals from 'intervals1', intervals from 'intervals2' and an additional column named 'dist' ('dist1' and 'dist2' for 2D intervals) representing the distance between the corresponding intervals. The intervals from intervals2 would be changed to 'chrom1', 'start1', and 'end1' and for 2D intervals chrom11, start11, end11 and chrom22, start22, end22. If 'na.if.notfound' is 'TRUE', the data frame contains all the intervals from 'intervals1' including those for which no matching neighbor was found. For the latter intervals an 'NA' neighboring interval is stated. If 'na.if.notfound' is 'FALSE', the data frame contains only intervals from 'intervals1' for which matching neighbor(s) was found.
See Also
gintervals, gintervals.neighbors.upstream,
gintervals.neighbors.downstream
Examples
gdb.init_examples()
# Basic intervals
intervs1 <- giterator.intervals("dense_track",
gintervals(1, 0, 4000),
iterator = 233
)
intervs2 <- giterator.intervals(
"sparse_track",
gintervals(1, 0, 2000)
)
# Original behavior - no strand considerations
gintervals.neighbors(intervs1, intervs2, 10,
mindist = -300,
maxdist = 500
)
# Add strand to intervals2 - affects distance directionality (original behavior)
intervs2$strand <- c(1, 1, -1, 1)
gintervals.neighbors(intervs1, intervs2, 10,
mindist = -300,
maxdist = 500
)
# TSS analysis example - use intervals1 (TSS) strand for directionality
tss <- data.frame(
chrom = c("chr1", "chr1", "chr1"),
start = c(1000, 2000, 3000),
end = c(1001, 2001, 3001),
strand = c(1, -1, 1), # +, -, +
gene = c("GeneA", "GeneB", "GeneC")
)
features <- data.frame(
chrom = "chr1",
start = c(500, 800, 1200, 1800, 2200, 2800, 3200),
end = c(600, 900, 1300, 1900, 2300, 2900, 3300),
feature_id = paste0("F", 1:7)
)
# Use TSS strand for distance directionality
result <- gintervals.neighbors(tss, features,
maxneighbors = 2,
mindist = -1000, maxdist = 1000,
use_intervals1_strand = TRUE
)
# Convenience functions for common TSS analysis
# Find upstream neighbors (negative distances for + strand genes)
upstream <- gintervals.neighbors.upstream(tss, features,
maxneighbors = 2, maxdist = 1000
)
# Find downstream neighbors (positive distances for + strand genes)
downstream <- gintervals.neighbors.downstream(tss, features,
maxneighbors = 2, maxdist = 1000
)
# Find both directions
both_directions <- gintervals.neighbors.directional(tss, features,
maxneighbors_upstream = 1,
maxneighbors_downstream = 1,
maxdist = 1000
)
Directional neighbor finding functions
Description
These functions find neighbors using query strand directionality, where upstream/downstream directionality is determined by the strand of the query intervals rather than the target intervals. This is particularly useful for TSS analysis where you want distances relative to gene direction.
Usage
gintervals.neighbors.upstream(
query_intervals,
target_intervals,
maxneighbors = 1,
maxdist = 1e+09,
...
)
gintervals.neighbors.downstream(
query_intervals,
target_intervals,
maxneighbors = 1,
maxdist = 1e+09,
...
)
gintervals.neighbors.directional(
query_intervals,
target_intervals,
maxneighbors_upstream = 1,
maxneighbors_downstream = 1,
maxdist = 1e+09,
...
)
Arguments
query_intervals |
intervals with strand information (query intervals) |
target_intervals |
intervals to search for neighbors |
maxneighbors |
maximum number of neighbors per query interval (default: 1) |
maxdist |
maximum distance to search (default: 1e+09) |
... |
additional arguments passed to |
maxneighbors_upstream |
maximum upstream neighbors per query interval (default: 1) |
maxneighbors_downstream |
maximum downstream neighbors per query interval (default: 1) |
Details
**Distance interpretation:**
**Positive strand queries:** upstream distances < 0, downstream distances > 0
**Negative strand queries:** upstream distances > 0, downstream distances < 0
If no strand column is present, all intervals are treated as positive strand.
Value
- gintervals.neighbors.upstream
data frame of upstream neighbors
- gintervals.neighbors.downstream
data frame of downstream neighbors
- gintervals.neighbors.directional
list with 'upstream' and 'downstream' components
See Also
Examples
gdb.init_examples()
# Create TSS intervals with strand information
tss <- data.frame(
chrom = c("chr1", "chr1", "chr1"),
start = c(1000, 2000, 3000),
end = c(1001, 2001, 3001),
strand = c(1, -1, 1), # +, -, +
gene = c("GeneA", "GeneB", "GeneC")
)
# Create regulatory features
features <- data.frame(
chrom = "chr1",
start = c(500, 800, 1200, 1800, 2200, 2800, 3200),
end = c(600, 900, 1300, 1900, 2300, 2900, 3300),
feature_id = paste0("F", 1:7)
)
# Find upstream neighbors (promoter analysis)
upstream <- gintervals.neighbors.upstream(tss, features,
maxneighbors = 2, maxdist = 1000
)
print(upstream)
# Find downstream neighbors (gene body analysis)
downstream <- gintervals.neighbors.downstream(tss, features,
maxneighbors = 2, maxdist = 5000
)
print(downstream)
# Find both directions in one call
both <- gintervals.neighbors.directional(tss, features,
maxneighbors_upstream = 1,
maxneighbors_downstream = 1,
maxdist = 1000
)
print(both$upstream)
print(both$downstream)
Normalize intervals to a fixed size
Description
This function normalizes intervals by computing their centers and then expanding them to a fixed size, while ensuring they don't cross chromosome boundaries.
Usage
gintervals.normalize(intervals = NULL, size = NULL, intervals.set.out = NULL)
Arguments
intervals |
intervals set |
size |
target size for normalized intervals (must be positive integer) |
intervals.set.out |
intervals set name where the function result is saved. If NULL, the result is returned to the user. |
Value
Normalized intervals set with fixed size, or NULL if result is saved to intervals.set.out
See Also
Examples
gdb.init_examples()
intervs <- gintervals(1, c(1000, 5000), c(2000, 6000))
gintervals.normalize(intervs, 500)
Returns the path on disk of an interval set
Description
Returns the path on disk of an interval set.
Usage
gintervals.path(intervals.set = NULL)
Arguments
intervals.set |
name of an interval set or a vector of interval set names |
Details
This function returns the actual file system path where an interval set is stored. The function works with a single interval set name or a vector of names.
Value
A character vector containing the full paths to the interval sets on disk.
See Also
gintervals.exists, gintervals.ls,
gtrack.path
Examples
gdb.init_examples()
gintervals.path("annotations")
gintervals.path(c("annotations", "coding"))
Calculates quantiles of a track expression for intervals
Description
Calculates quantiles of a track expression for intervals.
Usage
gintervals.quantiles(
expr = NULL,
percentiles = 0.5,
intervals = NULL,
iterator = NULL,
band = NULL,
intervals.set.out = NULL
)
Arguments
expr |
track expression for which quantiles are calculated |
percentiles |
an array of percentiles of quantiles in [0, 1] range |
intervals |
set of intervals |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function calculates quantiles of 'expr' for each interval in 'intervals'.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a set of intervals with additional columns representing quantiles for each percentile.
See Also
Examples
gdb.init_examples()
intervs <- gintervals(c(1, 2), 0, 5000)
gintervals.quantiles("dense_track",
percentiles = c(0.5, 0.3, 0.9), intervs
)
Generate random genome intervals
Description
Generate random genome intervals with a specified number of regions of a specified size. This function samples intervals uniformly across the genome, weighted by chromosome length.
Usage
gintervals.random(
size,
n,
dist_from_edge = 3000000,
chromosomes = NULL,
filter = NULL
)
Arguments
size |
The size of the intervals to generate (in base pairs) |
n |
The number of intervals to generate |
dist_from_edge |
The minimum distance from the edge of the chromosome for a region to start or end (default: 3e6) |
chromosomes |
The chromosomes to sample from (default: all chromosomes). Can be a character vector of chromosome names. |
filter |
A set of intervals to exclude from sampling (default: NULL). Generated intervals will not overlap with these regions. |
Details
The function samples intervals randomly across the genome, with chromosomes weighted by their length. Each interval is guaranteed to:
Be of the specified size
Start and end at least
dist_from_edgebases away from chromosome boundariesFall entirely within a single chromosome
Not overlap with any intervals in the
filter(if provided)
When a filter is provided, the function pre-computes valid genome segments (regions not in the filter) and samples from these segments. Note that this can be slow when the filter contains many intervals.
The function uses R's random number generator, so set.seed() can be used for reproducibility.
This function is implemented in C++ for high performance and can generate millions of intervals quickly.
Value
A data.frame with columns chrom, start, and end representing genomic intervals
Examples
## Not run:
gdb.init_examples()
# Generate 1000 random intervals of 100bp
intervals <- gintervals.random(100, 1000)
head(intervals)
# Generate intervals only on chr1 and chr2
intervals <- gintervals.random(100, 1000, chromosomes = c("chr1", "chr2"))
# Generate intervals avoiding specific regions
filter_regions <- gintervals(c("chr1", "chr2"), c(1000, 5000), c(2000, 6000))
intervals <- gintervals.random(100, 1000, filter = filter_regions)
# Verify no overlaps with filter
overlaps <- gintervals.intersect(intervals, filter_regions)
nrow(overlaps) # Should be 0
# For reproducibility
set.seed(123)
intervals1 <- gintervals.random(100, 100)
set.seed(123)
intervals2 <- gintervals.random(100, 100)
identical(intervals1, intervals2) # TRUE
## End(Not run)
Combines several sets of intervals
Description
Combines several sets of intervals into one set.
Usage
gintervals.rbind(..., intervals.set.out = NULL)
Arguments
... |
intervals sets to combine |
intervals.set.out |
intervals set name where the function result is optionally outputted |
intervals |
intervals set |
Details
This function combines several intervals sets into one set. It works in a similar manner as 'rbind' yet it is faster. Also it supports intervals sets that are stored in files including the big intervals sets.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. If the format of the output intervals is set to be "big" (determined implicitly based on the result size and options), the order of the resulted intervals is altered as they are sorted by chromosome (or chromosomes pair - for 2D).
Value
If 'intervals.set.out' is 'NULL' a data frame combining intervals sets.
See Also
gintervals, gintervals.2d,
gintervals.canonic
Examples
gdb.init_examples()
intervs1 <- gextract("sparse_track", gintervals(c(1, 2), 1000, 4000))
intervs2 <- gextract("sparse_track", gintervals(c(2, "X"), 2000, 5000))
gintervals.save("testintervs", intervs2)
gintervals.rbind(intervs1, "testintervs")
gintervals.rm("testintervs", force = TRUE)
Deletes a named intervals set
Description
Deletes a named intervals set.
Usage
gintervals.rm(intervals.set = NULL, force = FALSE)
Arguments
intervals.set |
name of an intervals set |
force |
if 'TRUE', suppresses user confirmation of a named intervals set removal |
Details
This function deletes a named intervals set from the Genomic Database. By default 'gintervals.rm' requires the user to interactively confirm the deletion. Set 'force' to 'TRUE' to suppress the user prompt.
Value
None.
See Also
gintervals.save, gintervals.exists,
gintervals.ls, gintervals,
gintervals.2d
Examples
gdb.init_examples()
intervs <- gintervals(c(1, 2))
gintervals.save("testintervs", intervs)
gintervals.ls()
gintervals.rm("testintervs", force = TRUE)
gintervals.ls()
Creates a named intervals set
Description
Saves intervals to a named intervals set.
Usage
gintervals.save(intervals.set.out = NULL, intervals = NULL)
Arguments
intervals.set.out |
name of the new intervals set |
intervals |
intervals to save |
Details
This function saves 'intervals' as a named intervals set.
Value
None.
See Also
gintervals.rm, gintervals.load,
gintervals.exists, gintervals.ls,
gintervals, gintervals.2d
Examples
gdb.init_examples()
intervs <- gintervals(c(1, 2))
gintervals.save("testintervs", intervs)
gintervals.ls()
gintervals.rm("testintervs", force = TRUE)
Calculates summary statistics of track expression for intervals
Description
Calculates summary statistics of track expression for intervals.
Usage
gintervals.summary(
expr = NULL,
intervals = NULL,
iterator = NULL,
band = NULL,
intervals.set.out = NULL
)
Arguments
expr |
track expression |
intervals |
set of intervals |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expression. |
band |
track expression band. If 'NULL' no band is used. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function returns summary statistics of a track expression for each interval 'intervals': total number of bins, total number of bins whose value is NaN, min, max, sum, mean and standard deviation of the values.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a set of intervals with additional columns representing summary statistics for each percentile and interval.
See Also
Examples
gdb.init_examples()
intervs <- gintervals(c(1, 2), 0, 5000)
gintervals.summary("dense_track", intervs)
Calculates a union of two sets of intervals
Description
Calculates a union of two sets of intervals.
Usage
gintervals.union(
intervals1 = NULL,
intervals2 = NULL,
intervals.set.out = NULL
)
Arguments
intervals1, intervals2 |
set of one-dimensional intervals |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function returns intervals that represent a genomic space covered by either 'intervals1' or 'intervals2'.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame representing the union of intervals.
See Also
gintervals.intersect, gintervals.diff,
gintervals, gintervals.2d
Examples
gdb.init_examples()
intervs1 <- gscreen("dense_track > 0.15 & dense_track < 0.18")
intervs2 <- gscreen("dense_track >= 0.18 & dense_track < 0.2")
## 'intervs3' and 'intervs4' are identical
intervs3 <- gintervals.union(intervs1, intervs2)
intervs4 <- gscreen("dense_track > 0.15 & dense_track < 0.2")
Updates a named intervals set
Description
Updates a named intervals set.
Usage
gintervals.update(
intervals.set = NULL,
intervals = "",
chrom = NULL,
chrom1 = NULL,
chrom2 = NULL
)
Arguments
intervals.set |
name of an intervals set |
intervals |
intervals or 'NULL' |
chrom |
chromosome for 1D intervals set |
chrom1 |
first chromosome for 2D intervals set |
chrom2 |
second chromosome for 2D intervals set |
Details
This function replaces all intervals of given chromosome (or chromosome pair) within 'intervals.set' with 'intervals'. Chromosome is specified by 'chrom' for 1D intervals set or 'chrom1', 'chrom2' for 2D intervals set.
If 'intervals' is 'NULL' all intervals of given chromosome are removed from 'intervals.set'.
Value
None.
See Also
gintervals.save, gintervals.load,
gintervals.exists, gintervals.ls
Examples
gdb.init_examples()
intervs <- gscreen(
"sparse_track > 0.2",
gintervals(c(1, 2), 0, 10000)
)
gintervals.save("testintervs", intervs)
gintervals.load("testintervs")
gintervals.update("testintervs", intervs[intervs$chrom == "chr2", ][1:5, ], chrom = 2)
gintervals.load("testintervs")
gintervals.update("testintervs", NULL, chrom = 2)
gintervals.load("testintervs")
gintervals.rm("testintervs", force = TRUE)
Creates a cartesian-grid iterator
Description
Creates a cartesian grid two-dimensional iterator that can be used by any function that accepts an iterator argument.
Usage
giterator.cartesian_grid(
intervals1 = NULL,
expansion1 = NULL,
intervals2 = NULL,
expansion2 = NULL,
min.band.idx = NULL,
max.band.idx = NULL
)
Arguments
intervals1 |
one-dimensional intervals |
expansion1 |
an array of integers that define expansion around intervals1 centers |
intervals2 |
one-dimensional intervals. If 'NULL' then 'intervals2' is considered to be equal to 'intervals1' |
expansion2 |
an array of integers that define expansion around intervals2 centers. If 'NULL' then 'expansion2' is considered to be equal to 'expansion1' |
min.band.idx, max.band.idx |
integers that limit iterator intervals to band |
Details
This function creates and returns a cartesian grid two-dimensional iterator that can be used by any function that accepts an iterator argument.
Assume 'centers1' and 'centers2' to be the central points of each interval from 'intervals1' and 'intervals2', and 'C1', 'C2' to be two points from 'centers1', 'centers2' accordingly. Assume also that the values in 'expansion1' and 'expansion2' are unique and sorted.
'giterator.cartesian_grid' creates a set of all possible unique and non-overlapping two-dimensional intervals of form: '(chrom1, start1, end1, chrom2, start2, end2)'. Each '(chrom1, start1, end1)' is created by taking a point 'C1' - '(chrom1, coord1)' and converting it to 'start1' and 'end1' such that 'start1 == coord1+E1[i]', 'end1 == coord1+E1[i+1]', where 'E1[i]' is one of the sorted 'expansion1' values. Overlaps between rectangles or expansion beyond the limits of chromosome are avoided.
'min.band.idx' and 'max.band.idx' parameters control whether a pair of 'C1' and 'C2' is skipped or not. If both of these parameters are not 'NULL' AND if both 'C1' and 'C2' share the same chromosome AND the delta of indices of 'C1' and 'C2' ('C1 index - C2 index') lays within '[min.band.idx, max.band.idx]' range - only then the pair will be used to create the intervals. Otherwise 'C1-C2' pair is filtered out. Note: if 'min.band.idx' and 'max.band.idx' are not 'NULL', i.e. band indices filtering is applied, then 'intervals2' parameter must be set to 'NULL'.
Value
A list containing the definition of cartesian iterator.
See Also
Examples
gdb.init_examples()
intervs1 <- gintervals(
c(1, 1, 2), c(100, 300, 200),
c(300, 500, 300)
)
intervs2 <- gintervals(
c(1, 2, 2), c(400, 1000, 3000),
c(800, 2000, 4000)
)
itr <- giterator.cartesian_grid(
intervs1, c(-20, 100), intervs2,
c(-40, -10, 50)
)
giterator.intervals(iterator = itr)
itr <- giterator.cartesian_grid(intervs1, c(-20, 50, 100))
giterator.intervals(iterator = itr)
itr <- giterator.cartesian_grid(intervs1, c(-20, 50, 100),
min.band.idx = -1,
max.band.idx = 0
)
giterator.intervals(iterator = itr)
Returns iterator intervals
Description
Returns iterator intervals given track expression, scope, iterator and band.
Usage
giterator.intervals(
expr = NULL,
intervals = .misha$ALLGENOME,
iterator = NULL,
band = NULL,
intervals.set.out = NULL
)
Arguments
expr |
track expression |
intervals |
genomic scope |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expression. |
band |
track expression band. If 'NULL' no band is used. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function returns a set of intervals used by the iterator intervals for the given track expression, genomic scope, iterator and band. Some functions accept an iterator without accepting a track expression (like 'gtrack.create_pwm_energy'). These functions generate the values for each iterator interval by themselves. Use set 'expr' to 'NULL' to simulate the work of these functions.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame representing iterator intervals.
See Also
Examples
gdb.init_examples()
## iterator is set implicitly to bin size of 'dense' track
giterator.intervals("dense_track", gintervals(1, 0, 200))
## iterator = 30
giterator.intervals("dense_track", gintervals(1, 0, 200), 30)
## iterator is an intervals set named 'annotations'
giterator.intervals("dense_track", .misha$ALLGENOME, "annotations")
## iterator is set implicitly to intervals of 'array_track' track
giterator.intervals("array_track", gintervals(1, 0, 200))
## iterator is a rectangle 100000 by 50000
giterator.intervals(
"rects_track",
gintervals.2d(chroms1 = 1, chroms2 = "chrX"),
c(100000, 50000)
)
Returns values from a lookup table based on track expression
Description
Evaluates track expression and translates the values into bin indices that are used in turn to retrieve and return values from a lookup table.
Usage
glookup(
lookup_table = NULL,
...,
intervals = NULL,
include.lowest = FALSE,
force.binning = TRUE,
iterator = NULL,
band = NULL,
intervals.set.out = NULL
)
Arguments
lookup_table |
a multi-dimensional array containing the values that are returned by the function |
... |
pairs of 'expr', 'breaks' where 'expr' is a track expression and the breaks determine the bin |
intervals |
genomic scope for which the function is applied |
include.lowest |
if 'TRUE', the lowest value of the range determined by breaks is included |
force.binning |
if 'TRUE', the values smaller than the minimal break will be translated to index 1, and the values that exceed the maximal break will be translated to index N-1 where N is the number of breaks. If 'FALSE' the out-of-range values will produce NaN values. |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function evaluates the track expression for all iterator intervals and translates this value into an index based on the breaks. This index is then used to address the lookup table and return the according value. More than one 'expr'-'breaks' pair can be used. In that case 'lookup_table' is addressed in a multidimensional manner, i.e. 'lookup_table[i1, i2, ...]'.
The range of bins is determined by 'breaks' argument. For example: 'breaks = c(x1, x2, x3, x4)' represents three different intervals (bins): (x1, x2], (x2, x3], (x3, x4].
If 'include.lowest' is 'TRUE' then the lowest value is included in the first interval, i.e. in [x1, x2].
'force.binning' parameter controls what should be done when the value of 'expr' exceeds the range determined by 'breaks'. If 'force.binning' is 'TRUE' then values smaller than the minimal break will be translated to index 1, and the values exceeding the maximal break will be translated to index 'M-1' where 'M' is the number of breaks. If 'force.binning' is 'FALSE' the out-of-range values will produce 'NaN' values.
Regardless of 'force.binning' value if the value of 'expr' is 'NaN' then result is 'NaN' too.
The order inside the result might not be the same as the order of intervals. Use 'intervalID' column to refer to the index of the original interval from the supplied 'intervals'.
If 'intervals.set.out' is not 'NULL' the result (without 'columnID' column) is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a set of intervals with additional 'value' and 'columnID' columns.
See Also
gtrack.lookup, gextract,
gpartition, gdist
Examples
gdb.init_examples()
## one-dimensional lookup table
breaks1 <- seq(0.1, 0.2, length.out = 6)
glookup(1:5, "dense_track", breaks1, gintervals(1, 0, 200))
## two-dimensional lookup table
t <- array(1:15, dim = c(5, 3))
breaks2 <- seq(0.31, 0.37, length.out = 4)
glookup(
t, "dense_track", breaks1, "2 * dense_track", breaks2,
gintervals(1, 0, 200)
)
Partitions the values of track expression
Description
Converts the values of track expression to intervals that match corresponding bin.
Usage
gpartition(
expr = NULL,
breaks = NULL,
intervals = NULL,
include.lowest = FALSE,
iterator = NULL,
band = NULL,
intervals.set.out = NULL
)
Arguments
expr |
track expression |
breaks |
breaks that determine the bin |
intervals |
genomic scope for which the function is applied |
include.lowest |
if 'TRUE', the lowest value of the range determined by breaks is included |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expression. |
band |
track expression band. If 'NULL' no band is used. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function converts first the values of track expression into 1-based bin's index according 'breaks' argument. It returns then the intervals with the corresponding bin's index.
The range of bins is determined by 'breaks' argument. For example: 'breaks=c(x1, x2, x3, x4)' represents three different intervals (bins): (x1, x2], (x2, x3], (x3, x4].
If 'include.lowest' is 'TRUE' the the lowest value will be included in the first interval, i.e. in [x1, x2].
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a set of intervals with an additional column that indicates the corresponding bin index.
See Also
gscreen, gextract,
glookup, gdist
Examples
gdb.init_examples()
breaks <- seq(0, 0.2, by = 0.05)
gpartition("dense_track", breaks, gintervals(1, 0, 5000))
Calculates quantiles of a track expression
Description
Calculates the quantiles of a track expression for the given percentiles.
Usage
gquantiles(
expr = NULL,
percentiles = 0.5,
intervals = get("ALLGENOME", envir = .misha),
iterator = NULL,
band = NULL
)
Arguments
expr |
track expression |
percentiles |
an array of percentiles of quantiles in [0, 1] range |
intervals |
genomic scope for which the function is applied |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expression. |
band |
track expression band. If 'NULL' no band is used. |
Details
This function calculates the quantiles for the given percentiles.
If data size exceeds the limit (see: 'getOption(gmax.data.size)'), the data is randomly sampled to fit the limit. A warning message is generated. The seed of the pseudo-random generator can be controlled through 'grnd.seed' option.
Note: this function is capable to run in multitasking mode. Sampling may vary according to the extent of multitasking. Since multitasking depends on the number of available CPU cores, running the function on two different machines might give different results. Please switch off multitasking if you want to achieve identical results on any machine. For more information regarding multitasking please refer "User Manual".
Value
An array that represent quantiles.
See Also
gbins.quantiles, gintervals.quantiles,
gdist
Examples
gdb.init_examples()
gquantiles("dense_track", c(0.1, 0.6, 0.8), gintervals(c(1, 2)))
Get reverse complement of DNA sequence
Description
Takes a DNA sequence string and returns its reverse complement.
Usage
grevcomp(seq)
Arguments
seq |
A character vector containing DNA sequences (using A,C,G,T). Ignores other characters and NA values. |
Value
A character vector of the same length as the input, containing the reverse complement sequences
Examples
grevcomp("ACTG") # Returns "CAGT"
grevcomp(c("ACTG", "GGCC")) # Returns c("CAGT", "GGCC")
grevcomp(c("ACTG", NA, "GGCC")) # Returns c("CAGT", NA, "GGCC")
Returns samples from the values of track expression
Description
Returns a sample of the specified size from the values of track expression.
Usage
gsample(expr = NULL, n = NULL, intervals = NULL, iterator = NULL, band = NULL)
Arguments
expr |
track expression |
n |
a number of items to choose |
intervals |
genomic scope for which the function is applied |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expression. |
band |
track expression band. If 'NULL' no band is used. |
Details
This function returns a sample of the specified size from the values of track expression. If 'n' is less than the total number of values, the data is randomly sampled. The seed of the pseudo-random generator can be controlled through 'grnd.seed' option.
If 'n' is higher than the total number of values, all values are returned (yet reshuffled).
Value
An array that represent quantiles.
See Also
Examples
gdb.init_examples()
gsample("sparse_track", 10)
Finds intervals that match track expression
Description
Finds all intervals where track expression is 'TRUE'.
Usage
gscreen(
expr = NULL,
intervals = NULL,
iterator = NULL,
band = NULL,
intervals.set.out = NULL
)
Arguments
expr |
logical track expression |
intervals |
genomic scope for which the function is applied |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expression. |
band |
track expression band. If 'NULL' no band is used. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function finds all intervals where track expression's value is 'TRUE'.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a set of intervals that match track expression.
See Also
Examples
gdb.init_examples()
gscreen("dense_track > 0.2 & sparse_track < 0.4",
iterator = "dense_track"
)
Divides track expression into segments
Description
Divides the values of track expression into segments by using Wilcoxon test.
Usage
gsegment(
expr = NULL,
minsegment = NULL,
maxpval = 0.05,
onetailed = TRUE,
intervals = NULL,
iterator = NULL,
intervals.set.out = NULL
)
Arguments
expr |
track expression |
minsegment |
minimal segment size |
maxpval |
maximal P-value that separates two adjacent segments |
onetailed |
if 'TRUE', Wilcoxon test is performed one tailed, otherwise two tailed |
intervals |
genomic scope for which the function is applied |
iterator |
track expression iterator of "fixed bin" type. If 'NULL' iterator is determined implicitly based on track expression. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function divides the values of track expression into segments, where each segment size is at least of 'minsegment' size and the P-value of comparing the segment with the first 'minsegment' values from the next segment is at most 'maxpval'. Comparison is done using Wilcoxon (also known as Mann-Whitney) test.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a set of intervals where each interval represents a segment.
See Also
Examples
gdb.init_examples()
gsegment("dense_track", 5000, 0.0001)
Complement DNA sequence
Description
Takes a DNA sequence string and returns its complement (without reversing).
Usage
gseq.comp(seq)
Arguments
seq |
A character vector containing DNA sequences (using A,C,G,T). Preserves case and handles NA values. |
Value
A character vector of the same length as the input, containing the complemented sequences
See Also
Examples
gseq.comp("ACTG") # Returns "TGAC"
gseq.comp(c("ACTG", "GGCC")) # Returns c("TGAC", "CCGG")
gseq.comp(c("ACTG", NA, "GGCC")) # Returns c("TGAC", NA, "CCGG")
Returns DNA sequences
Description
Returns DNA sequences for given intervals
Usage
gseq.extract(intervals = NULL)
Arguments
intervals |
intervals for which DNA sequence is returned |
Details
This function returns an array of sequence strings for each interval from 'intervals'. If intervals contain an additional 'strand' column and its value is '-1', the reverse-complementary sequence is returned.
Value
An array of character strings representing DNA sequence.
See Also
Examples
gdb.init_examples()
intervs <- gintervals(c(1, 2), 10000, 10020)
gseq.extract(intervs)
Score DNA sequences with a k-mer over a region of interest
Description
Counts exact matches of a k-mer in DNA sequences over a specified region of interest
(ROI). The ROI is defined by start_pos and end_pos (1-based, inclusive),
with optional extension controlled by extend.
Usage
gseq.kmer(
seqs,
kmer,
mode = c("count", "frac"),
strand = 0L,
start_pos = NULL,
end_pos = NULL,
extend = FALSE,
skip_gaps = TRUE,
gap_chars = c("-", ".")
)
Arguments
seqs |
character vector of DNA sequences (A/C/G/T/N; case-insensitive) |
kmer |
single character string containing the k-mer to search for (A/C/G/T only) |
mode |
character; one of "count" or "frac" |
strand |
integer; 1=forward, -1=reverse, 0=both strands (default: 0) |
start_pos |
integer or NULL; 1-based inclusive start of ROI (default: 1) |
end_pos |
integer or NULL; 1-based inclusive end of ROI (default: sequence length) |
extend |
logical or integer; extension of allowed window starts (default: FALSE) |
skip_gaps |
logical; if TRUE, treat gap characters as holes and skip them while scanning. Windows are k consecutive non-gap bases (default: TRUE) |
gap_chars |
character vector; which characters count as gaps (default: c("-", ".")) |
Details
This function counts k-mer occurrences in DNA sequences directly without requiring
a genomics database. For detailed documentation on k-mer counting parameters, see
gvtrack.create (functions "kmer.count" and "kmer.frac").
The ROI (region of interest) is defined by start_pos and end_pos.
The extend parameter controls whether k-mer matches can extend beyond the ROI boundaries.
For palindromic k-mers, use strand=1 or -1 to avoid double counting.
When skip_gaps=TRUE, characters specified in gap_chars are treated as gaps.
Windows are defined as k consecutive non-gap bases. The frac denominator counts the
number of possible logical starts (non-gap windows) in the region. start_pos and
end_pos are interpreted as physical coordinates on the full sequence.
Value
Numeric vector with counts (for "count" mode) or fractions (for "frac" mode). Returns 0 when sequence is too short or ROI is invalid.
See Also
gvtrack.create for detailed k-mer parameter documentation
Examples
## Not run:
# Example sequences
seqs <- c("CGCGCGCGCG", "ATATATATAT", "ACGTACGTACGT")
# Count CG dinucleotides on both strands
gseq.kmer(seqs, "CG", mode = "count", strand = 0)
# Count on forward strand only
gseq.kmer(seqs, "CG", mode = "count", strand = 1)
# Get CG fraction
gseq.kmer(seqs, "CG", mode = "frac", strand = 0)
# Count in a specific region
gseq.kmer(seqs, "CG", mode = "count", start_pos = 2, end_pos = 8)
# Allow k-mer to extend beyond ROI boundaries
gseq.kmer(seqs, "CG", mode = "count", start_pos = 2, end_pos = 8, extend = TRUE)
# Calculate GC content by summing G and C fractions
g_frac <- gseq.kmer(seqs, "G", mode = "frac", strand = 1)
c_frac <- gseq.kmer(seqs, "C", mode = "frac", strand = 1)
gc_content <- g_frac + c_frac
gc_content
# Compare AT counts on different strands
at_forward <- gseq.kmer(seqs, "AT", mode = "count", strand = 1)
at_reverse <- gseq.kmer(seqs, "AT", mode = "count", strand = -1)
at_both <- gseq.kmer(seqs, "AT", mode = "count", strand = 0)
data.frame(forward = at_forward, reverse = at_reverse, both = at_both)
## End(Not run)
Score DNA sequences with a PWM over a region of interest
Description
Scores full DNA sequences using a Position Weight Matrix (PWM) over a specified
region of interest (ROI). The ROI is defined by start_pos and end_pos
(1-based, inclusive), with optional extension controlled by extend.
All reported positions are on the full input sequence.
Usage
gseq.pwm(
seqs,
pssm,
mode = c("lse", "max", "pos", "count"),
bidirect = TRUE,
strand = 0L,
score.thresh = 0,
start_pos = NULL,
end_pos = NULL,
extend = FALSE,
spat.factor = NULL,
spat.bin = 1L,
spat.min = NULL,
spat.max = NULL,
return_strand = FALSE,
skip_gaps = TRUE,
gap_chars = c("-", "."),
neutral_chars = c("N", "n", "*"),
neutral_chars_policy = c("average", "log_quarter", "na"),
prior = 0.01
)
Arguments
seqs |
character vector of DNA sequences (A/C/G/T/N; case-insensitive) |
pssm |
numeric matrix or data frame with columns named A, C, G, T (additional columns are allowed and will be ignored) |
mode |
character; one of "lse", "max", "pos", or "count" |
bidirect |
logical; if TRUE, scans both strands (default: TRUE) |
strand |
integer; 1=forward, -1=reverse, 0=both strands (default: 0) |
score.thresh |
numeric; score threshold for |
start_pos |
integer or NULL; 1-based inclusive start of ROI (default: 1) |
end_pos |
integer or NULL; 1-based inclusive end of ROI (default: sequence length) |
extend |
logical or integer; extension of allowed window starts (default: FALSE) |
spat.factor |
numeric vector; spatial weighting factors (optional) |
spat.bin |
integer; bin size for spatial weighting |
spat.min |
numeric; start of scanning window |
spat.max |
numeric; end of scanning window |
return_strand |
logical; if TRUE and |
skip_gaps |
logical; if TRUE, treat gap characters as holes and skip them while scanning. Windows are w consecutive non-gap bases (default: TRUE) |
gap_chars |
character vector; which characters count as gaps (default: c("-", ".")) |
neutral_chars |
character vector; bases treated as unknown and scored with the average log probability per position (default: c("N", "n", "*")) |
neutral_chars_policy |
character string; how to treat neutral characters. One of
|
prior |
numeric; pseudocount added to frequencies (default: 0.01). Set to 0 for no pseudocounts. |
Details
This function scores DNA sequences directly without requiring a genomics database.
For detailed documentation on PWM scoring modes, parameters, and spatial weighting,
see gvtrack.create (functions "pwm", "pwm.max", "pwm.max.pos", "pwm.count").
The ROI (region of interest) is defined by start_pos and end_pos.
The extend parameter controls whether motif matches can extend beyond the ROI boundaries.
When skip_gaps=TRUE, characters specified in gap_chars are treated as gaps.
Windows are defined as w consecutive non-gap bases. All positions (pos) are reported
as 1-based indices on the original full sequence (including gaps). start_pos and
end_pos are interpreted as physical coordinates on the full sequence.
Neutral characters (neutral_chars, default c("N", "n", "*")) are treated as
unknown bases in both orientations. Each neutral contributes the mean log-probability of the
corresponding PSSM column, yielding identical penalties on forward and reverse strands without
hard-coded background scores. In mode = "max" the reported value is the single best
strand score after applying any spatial weights; forward and reverse contributions are not
aggregated. This matches the default behavior of the PWM virtual tracks (pwm.max,
pwm.max.pos, etc.).
Value
Numeric vector (for "lse"/"max"/"count" modes), integer vector (for "pos" mode),
or data.frame with pos and strand columns (for "pos" mode with
return_strand=TRUE). Returns NA when no valid windows exist.
See Also
gvtrack.create for detailed PWM parameter documentation
Examples
## Not run:
# Create a PSSM (position-specific scoring matrix) with frequency values
pssm <- matrix(
c(
0.7, 0.1, 0.1, 0.1, # Position 1: mostly A
0.1, 0.7, 0.1, 0.1, # Position 2: mostly C
0.1, 0.1, 0.7, 0.1, # Position 3: mostly G
0.1, 0.1, 0.1, 0.7 # Position 4: mostly T
),
ncol = 4, byrow = TRUE
)
colnames(pssm) <- c("A", "C", "G", "T")
# Example sequences
seqs <- c("ACGTACGTACGT", "GGGGACGTCCCC", "TTTTTTTTTTT")
# Score sequences using log-sum-exp (default mode)
gseq.pwm(seqs, pssm, mode = "lse")
# Get maximum score
gseq.pwm(seqs, pssm, mode = "max")
# Find position of best match
gseq.pwm(seqs, pssm, mode = "pos")
# Find position with strand information
gseq.pwm(seqs, pssm, mode = "pos", bidirect = TRUE, return_strand = TRUE)
# Count matches above threshold
gseq.pwm(seqs, pssm, mode = "count", score.thresh = 0.5)
# Score only a region of interest
gseq.pwm(seqs, pssm, mode = "max", start_pos = 3, end_pos = 10)
# Allow matches to extend beyond ROI boundaries
gseq.pwm(seqs, pssm, mode = "count", start_pos = 5, end_pos = 8, extend = TRUE)
# Spatial weighting example: higher weight in the center
spatial_weights <- c(0.5, 1.0, 2.0, 1.0, 0.5)
gseq.pwm(seqs, pssm,
mode = "lse",
spat.factor = spatial_weights,
spat.bin = 2
)
## End(Not run)
Reverse DNA sequence
Description
Takes a DNA sequence string and returns its reverse (without complementing).
Usage
gseq.rev(seq)
Arguments
seq |
A character vector containing DNA sequences. Preserves case and handles NA values. |
Value
A character vector of the same length as the input, containing the reversed sequences
See Also
Examples
gseq.rev("ACTG") # Returns "GTCA"
gseq.rev(c("ACTG", "GGCC")) # Returns c("GTCA", "CCGG")
gseq.rev(c("ACTG", NA, "GGCC")) # Returns c("GTCA", NA, "CCGG")
Get reverse complement of DNA sequence
Description
Alias for grevcomp. Takes a DNA sequence string and returns its reverse complement.
Usage
gseq.revcomp(seq)
Arguments
seq |
A character vector containing DNA sequences (using A,C,G,T). Ignores other characters and NA values. |
Value
A character vector of the same length as the input, containing the reverse complement sequences
See Also
Examples
gseq.revcomp("ACTG") # Returns "CAGT"
gseq.revcomp(c("ACTG", "GGCC")) # Returns c("CAGT", "GGCC")
Calculates summary statistics of track expression
Description
Calculates summary statistics of track expression.
Usage
gsummary(expr = NULL, intervals = NULL, iterator = NULL, band = NULL)
Arguments
expr |
track expression |
intervals |
genomic scope for which the function is applied |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expression. |
band |
track expression band. If 'NULL' no band is used. |
Details
This function returns summary statistics of a track expression: total number of bins, total number of bins whose value is NaN, min, max, sum, mean and standard deviation of the values.
Value
An array that represents summary statistics.
See Also
gintervals.summary, gbins.summary
Examples
gdb.init_examples()
gsummary("rects_track")
Creates a 'Rectangles' track from intervals and values
Description
Creates a 'Rectangles' track from intervals and values.
Usage
gtrack.2d.create(
track = NULL,
description = NULL,
intervals = NULL,
values = NULL
)
Arguments
track |
track name |
description |
a character string description |
intervals |
a set of two-dimensional intervals |
values |
an array of numeric values - one for each interval |
Details
This function creates a new 'Rectangles' (two-dimensional) track with values at given intervals. 'description' is added as a track attribute.
Value
None.
See Also
gtrack.create, gtrack.create_sparse,
gtrack.smooth, gtrack.modify,
gtrack.rm, gtrack.info,
gdir.create, gtrack.attr.get
Examples
gdb.init_examples()
intervs1 <- gintervals.2d(
1, (1:4) * 200, (1:4) * 200 + 100,
1, (1:4) * 300, (1:4) * 300 + 200
)
intervs2 <- gintervals.2d(
"X", (7:10) * 100, (7:10) * 100 + 50,
2, (1:4) * 200, (1:4) * 200 + 130
)
intervs <- rbind(intervs1, intervs2)
gtrack.2d.create(
"test_rects", "Test 2d track", intervs,
runif(dim(intervs)[1], 1, 100)
)
gextract("test_rects", .misha$ALLGENOME)
gtrack.rm("test_rects", force = TRUE)
Creates a 2D track from tab-delimited file
Description
Creates a 2D track from tab-delimited file(s).
Usage
gtrack.2d.import(track = NULL, description = NULL, file = NULL)
Arguments
track |
track name |
description |
a character string description |
file |
vector of file paths |
Details
This function creates a 2D track track from one or more tab-delimited files. Each file must start with a header describing the columns. The first 6 columns must have the following names: 'chrom1', 'start1', 'end1', 'chrom2', 'start2', 'end2'. The last column is designated for the value and it may have an arbitrary name. The header is followed by a list of intervals and a value for each interval. Overlapping intervals are forbidden.
One can learn about the format of the tab-delimited file by running 'gextract' function on a 2D track with a 'file' parameter set to the name of the file.
If all the imported intervals represent a point (i.e. end == start + 1) a 'Points' track is created otherwise it is a 'Rectangles' track.
'description' is added as a track attribute.
Note: temporary files are created in the directory of the track during the run of the function. A few of them need to be kept simultaneously open. If the number of chromosomes and / or intervals is particularly high, a few thousands files might be needed to be opened simultaneously. Some operating systems limit the number of open files per user, in which case the function might fail with "Too many open files" or similar error. The workaround could be:
1. Increase the limit of simultaneously opened files (the way varies depending on your operating system). 2. Increase the value of 'gmax.data.size' option. Higher values of 'gmax.data.size' option will increased memory usage of the function but create fewer temporary files.
Value
None.
See Also
gtrack.rm, gtrack.info,
gdir.create
Creates a track from a file of inter-genomic contacts
Description
Creates a track from a file of inter-genomic contacts.
Usage
gtrack.2d.import_contacts(
track = NULL,
description = NULL,
contacts = NULL,
fends = NULL,
allow.duplicates = TRUE
)
Arguments
track |
track name |
description |
a character string description |
contacts |
vector of contacts files |
fends |
name of fragment ends file |
allow.duplicates |
if 'TRUE' duplicated contacts are allowed |
Details
This function creates a 'Points' (two-dimensional) track from contacts files. If 'allow.duplicates' is 'TRUE' duplicated contacts are allowed and summed up, otherwise an error is reported.
Contacts (coord1, coord2) within the same chromosome are automatically doubled to include also '(coord2, coord1)' unless 'coord1' equals to 'coord2'.
Contacts may come in one or more files.
If 'fends' is 'NULL' contacts file is expected to be in "intervals-value" tab-separated format. The file starts with a header defining the column names. The first 6 columns must have the following names: 'chrom1', 'start1', 'end1', 'chrom2', 'start2', 'end2'. The last column is designated for the value and it may have an arbitrary name. The header is followed by a list of intervals and a value for each interval. An interval of form (chrom1, start1, end1, chrom2, start2, end2) is added as a point (X, Y) to the resulted track where X = (start1 + end1) / 2 and Y = (start2 + end2) / 2.
One can see an example of "intervals-value" format by running 'gextract' function on a 2D track with a 'file' parameter set to the name of the file.
If 'fends' is not 'NULL' contacts file is expected to be in "fends-value" tab-separated format. It should start with a header containing at least 3 column names 'fend1', 'fend2' and 'count' in arbitrary order followed by lines each defining a contact between two fragment ends.
| COLUMN | VALUE | DESCRIPTION |
| fend1 | Integer | ID of the first fragment end |
| fend2 | Integer | ID of the second fragment end |
| count | Numeric | Value associated with the contact |
A fragment ends file is also in tab-separated format. It should start with a header containing at least 3 column names 'fend', 'chr' and 'coord' in arbitrary order followed by lines each defining a single fragment end.
| COLUMN | VALUE | DESCRIPTION |
| fend | Unique integer | ID of the fragment end |
| chr | Chromosome name | Can be specified with or without "chr" prefix, like: "X" or "chrX" |
| coord | Integer | Coordinate |
'description' is added as a track attribute.
Note: temporary files are created in the directory of the track during the run of the function. A few of them need to be kept simultaneously open. If the number of chromosomes and / or contacts is particularly high, a few thousands files might be needed to be opened simultaneously. Some operating systems limit the number of open files per user, in which case the function might fail with "Too many open files" or similar error. The workaround could be:
1. Increase the limit of simultaneously opened files (the way varies depending on your operating system). 2. Increase the value of 'gmax.data.size' option. Higher values of 'gmax.data.size' option will increased memory usage of the function but create fewer temporary files.
Value
None.
See Also
gtrack.2d.import, gtrack.rm,
gtrack.info, gdir.create
Returns values from 'Array' track
Description
Returns values from 'Array' track.
Usage
gtrack.array.extract(
track = NULL,
slice = NULL,
intervals = NULL,
file = NULL,
intervals.set.out = NULL
)
Arguments
track |
track name |
slice |
a vector of column names or column indices or 'NULL' |
intervals |
genomic scope for which the function is applied |
file |
file name where the function result is to be saved. If 'NULL' result is returned to the user. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function returns the column values of an 'Array' track in the genomic scope specified by 'intervals'. 'slice' parameter determines which columns should appear in the result. The columns can be indicated by their names or their indices. If 'slice' is 'NULL' the values of all track columns are returned.
The order inside the result might not be the same as the order of intervals. An additional column 'intervalID' is added to the return value. Use this column to refer to the index of the original interval from the supplied 'intervals'.
If 'file' parameter is not 'NULL' the result is saved to a tab-delimited text file (without 'intervalID' column) rather than returned to the user. This can be especially useful when the result is too big to fit into the physical memory. The resulted file can be used as an input for 'gtrack.array.import' function.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Similarly to 'file' parameter 'intervals.set.out' can be useful to overcome the limits of the physical memory.
Value
If 'file' and 'intervals.set.out' are 'NULL' a set of intervals with additional columns for 'Array' track column values and 'columnID'.
See Also
gextract, gtrack.array.get_colnames,
gtrack.array.import
Examples
gdb.init_examples()
gtrack.array.extract(
"array_track", c("col3", "col5"),
gintervals(1, 0, 2000)
)
Returns column names of array track
Description
Returns column names of array track.
Usage
gtrack.array.get_colnames(track = NULL)
Arguments
track |
track name |
Details
This function returns the column names of an array track.
Value
A character vector with column names.
See Also
gtrack.array.set_colnames,
gtrack.array.extract, gvtrack.array.slice,
gtrack.info
Examples
gtrack.array.get_colnames("array_track")
Creates an array track from array tracks or files
Description
Creates an array track from array tracks or files.
Usage
gtrack.array.import(track = NULL, description = NULL, ...)
Arguments
track |
name of the newly created track |
description |
a character string description |
... |
array track or name of a tab-delimited file |
Details
This function creates a new 'Array' track from one or more "sources". Each source can be either another 'Array' track or a tab-delimited file that contains one-dimensional intervals and column values that should be added to the newly created track. One can learn about the exact format of the file by running 'gtrack.array.extract' or 'gextract' functions with a 'file' parameter and inspecting the output file.
There might be more than one source used to create the new track. In that case the new track will contain the columns from all the sources. The equally named columns are merged. Intervals that appear in one source but not in the other are added and the values for the missing columns are set to NaN. Intervals with all NaN values are not added. Partial overlaps between two intervals from different sources are forbidden.
'description' is added as a track attribute.
Value
None.
See Also
gextract, gtrack.array.extract,
gtrack.array.set_colnames, gtrack.rm,
gtrack.info, gdir.create
Examples
f1 <- tempfile()
gextract("sparse_track", gintervals(1, 5000, 20000), file = f1)
f2 <- tempfile()
gtrack.array.extract("array_track", c("col2", "col3", "col4"),
gintervals(1, 0, 20000),
file = f2
)
f3 <- tempfile()
gtrack.array.extract("array_track", c("col1", "col3"),
gintervals(1, 0, 20000),
file = f3
)
gtrack.array.import("test_track1", "Test array track 1", f1, f2)
gtrack.array.extract("test_track1", NULL, .misha$ALLGENOME)
gtrack.array.import(
"test_track2", "Test array track 2",
"test_track1", f3
)
gtrack.array.extract("test_track2", NULL, .misha$ALLGENOME)
gtrack.rm("test_track1", TRUE)
gtrack.rm("test_track2", TRUE)
unlink(c(f1, f2, f3))
Sets column names of array track
Description
Sets column names of array track.
Usage
gtrack.array.set_colnames(track = NULL, names = NULL)
Arguments
track |
track name |
names |
vector of column names |
Details
This sets the column names of an array track.
Value
None.
See Also
gtrack.array.get_colnames,
gtrack.array.extract, gvtrack.array.slice,
gtrack.info
Examples
old.names <- gtrack.array.get_colnames("array_track")
new.names <- paste("modified", old.names, sep = "_")
gtrack.array.set_colnames("array_track", new.names)
gtrack.array.get_colnames("array_track")
gtrack.array.set_colnames("array_track", old.names)
gtrack.array.get_colnames("array_track")
Returns track attributes values
Description
Returns track attributes values.
Usage
gtrack.attr.export(tracks = NULL, attrs = NULL)
Arguments
tracks |
a vector of track names or 'NULL' |
attrs |
a vector of attribute names or 'NULL' |
Details
This function returns a data frame that contains track attributes values. Column names of the data frame consist of the attribute names, row names contain the track names.
The list of required tracks is specified by 'tracks' argument. If 'tracks' is 'NULL' the attribute values of all existing tracks are returned.
Likewise the list of required attributes is controlled by 'attrs' argument. If 'attrs' is 'NULL' all attribute values of the specified tracks are returned. The columns are also sorted then by "popularity" of an attribute, i.e. the number of tracks containing this attribute. This sorting is not applied if 'attrs' is not 'NULL'.
Empty character string in a table cell marks a non-existing attribute.
Value
A data frame containing track attributes values.
See Also
gtrack.attr.import, gtrack.attr.get,
gtrack.attr.set
Examples
gdb.init_examples()
gtrack.attr.export()
gtrack.attr.export(tracks = c("sparse_track", "dense_track"))
gtrack.attr.export(attrs = "created.by")
Returns value of a track attribute
Description
Returns value of a track attribute.
Usage
gtrack.attr.get(track = NULL, attr = NULL)
Arguments
track |
track name |
attr |
attribute name |
Details
This function returns the value of a track attribute. If the attribute does not exist an empty sting is returned.
Value
Track attribute value.
See Also
gtrack.attr.import, gtrack.attr.set
Examples
gdb.init_examples()
gtrack.attr.set("sparse_track", "test_attr", "value")
gtrack.attr.get("sparse_track", "test_attr")
gtrack.attr.set("sparse_track", "test_attr", "")
Imports track attributes values
Description
Imports track attributes values.
Usage
gtrack.attr.import(table = NULL, remove.others = FALSE)
Arguments
table |
a data frame containing attribute values |
remove.others |
specifies what to do with the attributes that are not in the table |
Details
This function makes imports attribute values contained in a data frame 'table'. The format of a table is similar to the one returned by 'gtrack.attr.export'. The values of the table must be character strings. Column names of the table should specify the attribute names, while row names should contain the track names.
The specified attributes of the specified tracks are modified. If an attribute value is an empty string this attribute is removed from the track.
If 'remove.others' is 'TRUE' all non-readonly attributes that do not appear in the table are removed, otherwise they are preserved unchanged.
Error is reported on an attempt to modify a value of a read-only attribute.
Value
None.
See Also
gtrack.attr.import, gtrack.attr.set,
gtrack.attr.get, gdb.get_readonly_attrs
Examples
gdb.init_examples()
t <- gtrack.attr.export()
t$newattr <- as.character(1:dim(t)[1])
gtrack.attr.import(t)
gtrack.attr.export(attrs = "newattr")
# roll-back the changes
t$newattr <- ""
gtrack.attr.import(t)
Assigns value to a track attribute
Description
Assigns value to a track attribute.
Usage
gtrack.attr.set(track = NULL, attr = NULL, value = NULL)
Arguments
track |
track name |
attr |
attribute name |
value |
value |
Details
This function creates a track attribute and assigns 'value' to it. If the attribute already exists its value is overwritten.
If 'value' is an empty string the attribute is removed.
Error is reported on an attempt to modify a value of a read-only attribute.
Value
None.
See Also
gtrack.attr.get, gtrack.attr.import,
gtrack.var.set, gdb.get_readonly_attrs
Examples
gdb.init_examples()
gtrack.attr.set("sparse_track", "test_attr", "value")
gtrack.attr.get("sparse_track", "test_attr")
gtrack.attr.set("sparse_track", "test_attr", "")
Converts a track to the most current format
Description
Converts a track (if needed) to the most current format.
Usage
gtrack.convert(src.track = NULL, tgt.track = NULL)
Arguments
src.track |
source track name |
tgt.track |
target track name. If 'NULL' the source track is overwritten. |
Details
This function converts a track to the most current format. It should be used if a track created by an old version of the library cannot be read anymore by the newer version. The old track is given by 'src.track'. After conversion a new track 'tgt.track' is created. If 'tgt.track' is 'NULL' the source track is overwritten.
Value
None
See Also
gtrack.create, gtrack.2d.create,
gtrack.create_sparse
Convert a track to indexed format
Description
Converts a per-chromosome track to indexed format (track.dat + track.idx).
Usage
gtrack.convert_to_indexed(track = NULL)
Arguments
track |
track name to convert |
Details
This function converts a track from the per-chromosome file format to single-file indexed format. The indexed format dramatically reduces file descriptor usage for genomes with many contigs and provides better performance for parallel access.
The function performs the following steps:
Validates that all per-chromosome files have consistent metadata
Creates track.dat by concatenating all per-chromosome files
Creates track.idx with offset/length information for each chromosome
Uses atomic operations (fsync + rename) to ensure data integrity
Removes the old per-chromosome files after successful conversion
Value
None
See Also
gtrack.create, gtrack.create_sparse, gtrack.create_dense
Examples
## Not run:
# Convert a track to indexed format
gtrack.convert_to_indexed("my_track")
## End(Not run)
Creates a track from a track expression
Description
Creates a track from a track expression.
Usage
gtrack.create(
track = NULL,
description = NULL,
expr = NULL,
iterator = NULL,
band = NULL
)
Arguments
track |
track name |
description |
a character string description |
expr |
track expression |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expression. |
band |
track expression band. If 'NULL' no band is used. |
Details
This function creates a new track named track. The values of the track are determined by evaluation of 'expr' - a numeric track expression. The type of the new track is determined by the type of the iterator. 'Fixed bin', 'Sparse' or 'Rectangles' track can be created accordingly. 'description' is added as a track attribute.
Value
None.
See Also
gtrack.2d.create, gtrack.create_sparse,
gtrack.smooth, gtrack.modify,
gtrack.rm, gtrack.info,
gdir.create
Examples
gdb.init_examples()
## Creates a new track that is a sum of values from 'dense' and
## 2 * non-nan values of 'sparse' track. The new track type is
## Dense with a bin size that equals to '70'.
gtrack.create("mixed_track", "Test track",
"dense_track +
replace(sparse_track, is.nan(sparse_track), 0) * 2",
iterator = 70
)
gtrack.info("mixed_track")
gtrack.rm("mixed_track", force = TRUE)
Creates a 'Dense' track from intervals and values
Description
Creates a 'Dense' track from intervals and values.
Usage
gtrack.create_dense(
track = NULL,
description = NULL,
intervals = NULL,
values = NULL,
binsize = NULL,
defval = NaN
)
Arguments
track |
track name |
description |
a character string description |
intervals |
a set of one-dimensional intervals |
values |
an array of numeric values - one for each interval |
binsize |
bin size of the newly created 'Dense' track |
defval |
default track value for genomic regions not covered by the intervals |
Details
This function creates a new 'Dense' track with values at given intervals. 'description' is added as a track attribute.
Value
None.
See Also
gtrack.create_sparse, gtrack.import,
gtrack.modify, gtrack.rm,
gtrack.info
Examples
gdb.init_examples()
intervs <- gintervals.load("annotations")
gtrack.create_dense(
"test_dense", "Test dense track", intervs,
1:dim(intervs)[1], 50, 0
)
gextract("test_dense", .misha$ALLGENOME)
gtrack.rm("test_dense", force = TRUE)
Create directories needed for track creation
Description
This function creates the directories needed for track creation. For example, if the track name is 'proj.sample.my_track', this function creates the directories 'proj' and 'sample'. Use this function with caution - a long track name may create a deep directory structure.
Usage
gtrack.create_dirs(track, mode = "0777")
Arguments
track |
name of the track |
mode |
see 'dir.create' |
Value
None.
Examples
gdb.init_examples()
# This creates the directories 'proj' and 'sample'
gtrack.create_dirs("proj.sample.my_track")
Creates a new track from PSSM energy function
Description
Creates a new track from PSSM energy function.
Usage
gtrack.create_pwm_energy(
track = NULL,
description = NULL,
pssmset = NULL,
pssmid = NULL,
prior = NULL,
iterator = NULL
)
Arguments
track |
track name |
description |
a character string description |
pssmset |
name of PSSM set: 'pssmset.key' and 'pssmset.data' must be presented in 'GROOT/pssms' directory |
pssmid |
PSSM id |
prior |
prior |
iterator |
track expression iterator for the newly created track |
Details
This function creates a new track with values of a PSSM energy function. PSSM parameters (nucleotide probability per position and pluralization) are determined by 'pssmset' key and data files ('pssmset.key' and 'pssmset.data'). These two files must be located in 'GROOT/pssms' directory. The type of the created track is determined by the type of the iterator. 'description' is added as a track attribute.
Value
None.
See Also
gtrack.create, gtrack.2d.create,
gtrack.create_sparse, gtrack.smooth,
gtrack.modify, gtrack.rm,
gtrack.info, gdir.create
Examples
gdb.init_examples()
gtrack.create_pwm_energy("pwm_energy_track", "Test track", "pssm",
3, 0.01,
iterator = 100
)
gextract("pwm_energy_track", gintervals(1, 0, 1000))
Creates a 'Sparse' track from intervals and values
Description
Creates a 'Sparse' track from intervals and values.
Usage
gtrack.create_sparse(
track = NULL,
description = NULL,
intervals = NULL,
values = NULL
)
Arguments
track |
track name |
description |
a character string description |
intervals |
a set of one-dimensional intervals |
values |
an array of numeric values - one for each interval |
Details
This function creates a new 'Sparse' track with values at given intervals. 'description' is added as a track attribute.
Value
None.
See Also
gtrack.create, gtrack.2d.create,
gtrack.smooth, gtrack.modify,
gtrack.rm, gtrack.info,
gdir.create
Examples
gdb.init_examples()
intervs <- gintervals.load("annotations")
gtrack.create_sparse(
"test_sparse", "Test track", intervs,
1:dim(intervs)[1]
)
gextract("test_sparse", .misha$ALLGENOME)
gtrack.rm("test_sparse", force = TRUE)
Tests for a track existence
Description
Tests for a track existence.
Usage
gtrack.exists(track = NULL)
Arguments
track |
track name |
Details
This function returns 'TRUE' if a track exists in Genomic Database.
Value
'TRUE' if a track exists. Otherwise 'FALSE'.
See Also
gtrack.ls, gtrack.info,
gtrack.create, gtrack.rm
Examples
gdb.init_examples()
gtrack.exists("dense_track")
Creates a track from WIG / BigWig / BedGraph / BED / tab-delimited file
Description
Creates a track from WIG / BigWig / BedGraph / BED / tab-delimited file
Usage
gtrack.import(
track = NULL,
description = NULL,
file = NULL,
binsize = NULL,
defval = NaN,
attrs = NULL
)
Arguments
track |
track name |
description |
a character string description |
file |
file path |
binsize |
bin size of the newly created 'Dense' track or '0' for a 'Sparse' track |
defval |
default track value |
attrs |
a named vector or list of attributes to be set on the track after import |
Details
This function creates a track from WIG / BigWig / BedGraph / tab-delimited file. Zipped files are supported (file name must have '.gz' or '.zip' suffix).
Tab-delimited files must start with a header line with the following column names (tab-separated): 'chrom', 'start', 'end', and exactly one value column name (e.g. 'value'). Each subsequent line provides a single interval: - chrom: chromosome name (e.g. 'chr1') - start: 0-based start coordinate (inclusive) - end: 0-based end coordinate (exclusive) - value: numeric value (floating point allowed); exactly one value column is supported
Columns must be separated by tabs. Coordinates must refer to chromosomes existing in the current genome. Missing values can be specified as 'NaN'.
BED files (.bed/.bed.gz/.bed.zip) are also supported. If the BED 'score' column (5th column) exists and is numeric, it is used as the interval value; otherwise a constant value of 1 is used. For BED inputs, 'binsize' controls the output type: if 'binsize' is 0 the track is 'Sparse'; otherwise the track is 'Dense' with bin-averaged values based on overlaps with BED intervals (and 'defval' for regions not covered).
If 'binsize' is 0 the resulted track is created in 'Sparse' format. Otherwise the 'Dense' format is chosen with a bin size equal to 'binsize'. The values that were not defined in input file file are substituted by 'defval' value.
'description' is added as a track attribute.
Value
None.
See Also
gtrack.import_set, gtrack.rm,
gtrack.info, gdir.create, gextract
Examples
gdb.init_examples()
# Create a simple WIG file for demonstration
temp_file <- tempfile(fileext = ".wig")
writeLines(c(
"track type=wiggle_0 name=\"example track\"",
"fixedStep chrom=chr1 start=1 step=1",
"1.5",
"2.0",
"1.8",
"3.2"
), temp_file)
# Basic import
gtrack.import("example_track", "Example track from WIG file",
temp_file,
binsize = 1
)
gtrack.info("example_track")
gtrack.rm("example_track", force = TRUE)
# Import with custom attributes
attrs <- c("author" = "researcher", "version" = "1.0", "experiment" = "test")
gtrack.import("example_track_with_attrs", "Example track with attributes",
temp_file,
binsize = 1, attrs = attrs
)
# Check that attributes were set
gtrack.attr.get("example_track_with_attrs", "author")
gtrack.attr.get("example_track_with_attrs", "version")
gtrack.attr.get("example_track_with_attrs", "experiment")
# Clean up
gtrack.rm("example_track_with_attrs", force = TRUE)
Creates a track from a file of mapped sequences
Description
Creates a track from a file of mapped sequences.
Usage
gtrack.import_mappedseq(
track = NULL,
description = NULL,
file = NULL,
pileup = 0,
binsize = -1,
cols.order = c(9, 11, 13, 14),
remove.dups = TRUE
)
Arguments
track |
track name |
description |
a character string description |
file |
name of mapped sequences file |
pileup |
interval expansion |
binsize |
bin size of a dense track |
cols.order |
order of sequence, chromosome, coordinate and strand columns in mapped sequences file or NULL if SAM file is used |
remove.dups |
if 'TRUE' the duplicated coordinates are counted only once. |
Details
This function creates a track from a file of mapped sequences. The file can be in SAM format or in a general TAB delimited text format where each line describes a single read.
For a SAM file 'cols.order' must be set to 'NULL'.
For a general TAB delimited text format the following columns must be presented in the file: sequence, chromosome, coordinate and strand. The position of these columns should be specified in 'cols.order' argument. The default value of 'cols.order' is an array of (9, 11, 13, 14) meaning that sequence is expected to be found at column number 9, chromosome - at column 11, coordinate - at column 13 and strand - at column 14. The column indices are 1-based, i.e. the first column is referenced by 1. Chromosome needs a prefix 'chr' e.g. 'chr1'. Valid strand values are '+' or 'F' for forward strand and '-' or 'R' for the reverse strand.
Each read at given coordinate can be "expanded" to cover an interval rather than a single point. The length of the interval is controlled by 'pileup' argument. The direction of expansion depends on the strand value. If 'pileup' is '0', no expansion is performed and the read is converted to a single point. The track is created in sparse format. If 'pileup' is greater than zero, the output track is in dense format. 'binsize' controls the bin size of the dense track.
If 'remove.dups' is 'TRUE' the duplicated coordinates are counted only once.
'description' is added as a track attribute.
'gtrack.import_mappedseq' returns the statistics of the conversion process.
Value
A list of conversion process statistics.
See Also
gtrack.rm, gtrack.info,
gdir.create
Creates one or more tracks from multiple WIG / BigWig / BedGraph / tab-delimited files on disk or FTP
Description
Creates one or more tracks from WIG / BigWig / BedGraph / tab-delimited files on disk or FTP.
Usage
gtrack.import_set(
description = NULL,
path = NULL,
binsize = NULL,
track.prefix = NULL,
defval = NaN
)
Arguments
description |
a character string description |
path |
file path or URL (may contain wildcards) |
binsize |
bin size of the newly created 'Dense' track or '0' for a 'Sparse' track |
track.prefix |
prefix for a track name |
defval |
default track value |
Details
This function is similar to 'gtrack.import' however unlike the latter it can create multiple tracks. Additionally the files can be fetched from an FTP server.
The files are expected to be in WIG / BigWig / BedGraph / tab-delimited formats. One can learn about the format of the tab-delimited file by running 'gextract' function with a 'file' parameter set to the name of the file. Zipped files are supported (file name must have '.gz' or '.zip' suffix).
Files are specified by 'path' argument. 'path' can be also a URL of an FTP server in the form of 'ftp://[address]/[files]'. If 'path' is a URL, the files are first downloaded from FTP server to a temporary directory and then imported to tracks. The temporary directory is created at 'GROOT/downloads'.
Regardless whether 'path' is file path or to a URL, it can contain wildcards. Hence multiple files can be imported (and downloaded) at once.
If 'binsize' is 0 the resulted tracks are created in 'Sparse' format. Otherwise the 'Dense' format is chosen with a bin size equal to 'binsize'. The values that were not defined in input file file are substituted by 'defval' value.
The name of a each created track is of '[track.prefix][filename]' form, where 'filename' is the name of the WIG file. For example, if 'track.prefix' equals to "wigs."" and an input file name is 'mydata', a track named 'wigs.mydata' is created. If 'track.prefix' is 'NULL' no prefix is appended to the name of the created track.
Existing tracks are not overwritten and no new directories are automatically created.
'description' is added to the created tracks as an attribute.
'gtrack.import_set' does not stop if an error occurs while importing a file. It rather continues importing the rest of the files.
'gtrack.import_set' returns the names of the files that were successfully imported and those that failed.
Value
Names of files that were successfully imported and those that failed.
See Also
gtrack.import, gwget,
gtrack.rm, gtrack.info,
gdir.create, gextract
Returns information about a track
Description
Returns information about a track.
Usage
gtrack.info(track = NULL, validate = FALSE)
Arguments
track |
track name |
validate |
if TRUE, validates the track index file integrity (for indexed tracks). Default: FALSE |
Details
Returns information about the track (type, dimensions, size in bytes, etc.). The fields in the returned value vary depending on the type of the track.
Value
A list that contains track properties
See Also
Examples
gdb.init_examples()
gtrack.info("dense_track")
gtrack.info("rects_track")
Imports a track from another assembly
Description
Imports a track from another assembly.
Usage
gtrack.liftover(
track = NULL,
description = NULL,
src.track.dir = NULL,
chain = NULL,
src_overlap_policy = "error",
tgt_overlap_policy = "auto",
multi_target_agg = c("mean", "median", "sum", "min", "max", "count", "first", "last",
"nth", "max.coverage_len", "min.coverage_len", "max.coverage_frac",
"min.coverage_frac"),
params = NULL,
na.rm = TRUE,
min_n = NULL,
min_score = NULL
)
Arguments
track |
name of a created track | ||||||||||||||||||||
description |
a character string description | ||||||||||||||||||||
src.track.dir |
path to the directory of the source track | ||||||||||||||||||||
chain |
name of chain file or data frame as returned by 'gintervals.load_chain' | ||||||||||||||||||||
src_overlap_policy |
policy for handling source overlaps: "error" (default), "keep", or "discard". "keep" allows one source interval to map to multiple target intervals, "discard" discards all source intervals that have overlaps and "error" throws an error if source overlaps are detected. | ||||||||||||||||||||
tgt_overlap_policy |
policy for handling target overlaps. One of:
| ||||||||||||||||||||
multi_target_agg |
aggregation/selection policy for contributors that land on the same target locus. When multiple source intervals map to overlapping regions in the target genome (after applying tgt_overlap_policy), their values must be combined into a single value. | ||||||||||||||||||||
params |
additional parameters for aggregation (e.g., for "nth" aggregation) | ||||||||||||||||||||
na.rm |
logical indicating whether NA values should be removed before aggregation (default: TRUE) | ||||||||||||||||||||
min_n |
minimum number of non-NA values required for aggregation. If fewer values are available, the result will be NA. | ||||||||||||||||||||
min_score |
optional minimum alignment score threshold. Chains with scores below this value are filtered out. Useful for excluding low-quality alignments. |
Details
This function imports a track located in 'src.track.dir' of another assembly to the current database. Chain file instructs how the conversion of coordinates should be done. It can be either a name of a chain file or a data frame in the same format as returned by 'gintervals.load_chain' function. The name of the newly created track is specified by 'track' argument and 'description' is added as a track attribute.
Note: When passing a pre-loaded chain (data frame), overlap policies cannot be specified - they are taken from the chain's attributes that were set during loading. When passing a chain file path, policies can be specified and will be used for loading. Aggregation parameters (multi_target_agg, params, na.rm, min_n) can always be specified regardless of chain type.
Value
None.
Note
Terminology note for UCSC chain format users: In the UCSC chain format specification, the fields prefixed with 't' (tName, tStart, tEnd, etc.) are called "target" or "reference", while fields prefixed with 'q' (qName, qStart, qEnd, etc.) are called "query". However, misha uses reversed terminology: UCSC's "target/reference" corresponds to misha's "source" (chromsrc, startsrc, endsrc), and UCSC's "query" corresponds to misha's "target" (chrom, start, end).
See Also
gintervals.load_chain,
gintervals.liftover
Creates a new track from a lookup table based on track expression
Description
Evaluates track expression and translates the values into bin indices that are used in turn to retrieve values from a lookup table and create a track.
Usage
gtrack.lookup(
track = NULL,
description = NULL,
lookup_table = NULL,
...,
include.lowest = FALSE,
force.binning = TRUE,
iterator = NULL,
band = NULL
)
Arguments
track |
track name |
description |
a character string description |
lookup_table |
a multi-dimensional array containing the values that are returned by the function |
... |
pairs of track expressions and breaks |
include.lowest |
if 'TRUE', the lowest value of the range determined by breaks is included |
force.binning |
if 'TRUE', the values smaller than the minimal break will be translated to index 1, and the values that exceed the maximal break will be translated to index N-1 where N is the number of breaks. If 'FALSE' the out-of-range values will produce NaN values. |
iterator |
track expression iterator. If 'NULL' iterator is determined implicitly based on track expressions. |
band |
track expression band. If 'NULL' no band is used. |
Details
This function evaluates the track expression for all iterator intervals and translates this value into an index based on the breaks. This index is then used to address the lookup table and create with its values a new track. More than one 'expr'-'breaks' pair can be used. In that case 'lookup_table' is addressed in a multidimensional manner, i.e. 'lookup_table[i1, i2, ...]'.
The range of bins is determined by 'breaks' argument. For example: 'breaks = c(x1, x2, x3, x4)' represents three different intervals (bins): (x1, x2], (x2, x3], (x3, x4].
If 'include.lowest' is 'TRUE' the the lowest value is included in the first interval, i.e. in [x1, x2].
'force.binning' parameter controls what should be done when the value of 'expr' exceeds the range determined by 'breaks'. If 'force.binning' is 'TRUE' then values smaller than the minimal break will be translated to index 1, and the values exceeding the maximal break will be translated to index 'M-1' where 'M' is the number of breaks. If 'force.binning' is 'FALSE' the out-of-range values will produce 'NaN' values.
Regardless of 'force.binning' value if the value of 'expr' is 'NaN' then the value in the track would be 'NaN' too.
'description' is added as a track attribute.
Value
None.
See Also
glookup, gtrack.2d.create,
gtrack.create_sparse, gtrack.smooth,
gtrack.modify, gtrack.rm,
gtrack.info, gdir.create
Examples
gdb.init_examples()
## one-dimensional example
breaks1 <- seq(0.1, 0.2, length.out = 6)
gtrack.lookup(
"lookup_track", "Test track", 1:5, "dense_track",
breaks1
)
gtrack.rm("lookup_track", force = TRUE)
## two-dimensional example
t <- array(1:15, dim = c(5, 3))
breaks2 <- seq(0.31, 0.37, length.out = 4)
gtrack.lookup(
"lookup_track", "Test track", t, "dense_track",
breaks1, "2 * dense_track", breaks2
)
gtrack.rm("lookup_track", force = TRUE)
Returns a list of track names
Description
Returns a list of track names in Genomic Database.
Usage
gtrack.ls(
...,
ignore.case = FALSE,
perl = FALSE,
fixed = FALSE,
useBytes = FALSE
)
Arguments
... |
these arguments are of either form 'pattern' or 'attribute = pattern' |
ignore.case, perl, fixed, useBytes |
see 'grep' |
Details
This function returns a list of tracks whose name or track attribute value match a pattern (see 'grep'). If called without any arguments all tracks are returned.
If pattern is specified without a track attribute (i.e. in the form of 'pattern') then filtering is applied to the track names. If pattern is supplied with a track attribute (i.e. in the form of 'name = pattern') then track attribute is matched against the pattern.
Multiple patterns are applied one after another. The resulted list of tracks should match all the patterns.
Value
An array that contains the names of tracks that match the supplied patterns.
See Also
grep, gtrack.exists,
gtrack.create, gtrack.rm
Examples
gdb.init_examples()
# get all track names
gtrack.ls()
# get track names that match the pattern "den*"
gtrack.ls("den*")
# get track names whose "created.by" attribute match the pattern
# "create_sparse"
gtrack.ls(created.by = "create_sparse")
# get track names whose names match the pattern "den*" and whose
# "created.by" attribute match the pattern "track"
gtrack.ls("den*", created.by = "track")
Modifies track contents
Description
Modifies 'Dense' track contents.
Usage
gtrack.modify(track = NULL, expr = NULL, intervals = NULL)
Arguments
track |
track name |
expr |
track expression |
intervals |
genomic scope for which track is modified |
Details
This function modifies the contents of a 'Dense' track by the values of 'expr'. 'intervals' argument controls which portion of the track is modified. The iterator policy is set internally to the bin size of the track.
Value
None.
See Also
Examples
gdb.init_examples()
intervs <- gintervals(1, 300, 800)
gextract("dense_track", intervs)
gtrack.modify("dense_track", "dense_track * 2", intervs)
gextract("dense_track", intervs)
gtrack.modify("dense_track", "dense_track / 2", intervs)
Returns the path on disk of a track
Description
Returns the path on disk of a track.
Usage
gtrack.path(track = NULL)
Arguments
track |
track name or a vector of track names |
Details
This function returns the actual file system path where a track is stored. The function works with a single track name or a vector of track names.
Value
A character vector containing the full paths to the tracks on disk.
See Also
gtrack.exists, gtrack.ls,
gintervals.path
Examples
gdb.init_examples()
gtrack.path("dense_track")
gtrack.path(c("dense_track", "sparse_track"))
Deletes a track
Description
Deletes a track.
Usage
gtrack.rm(track = NULL, force = FALSE)
Arguments
track |
track name |
force |
if 'TRUE', suppresses user confirmation of a named track removal |
Details
This function deletes a track from the Genomic Database. By default 'gtrack.rm' requires the user to interactively confirm the deletion. Set 'force' to 'TRUE' to suppress the user prompt.
Value
None.
See Also
gtrack.exists, gtrack.ls,
gtrack.create, gtrack.2d.create,
gtrack.create_sparse, gtrack.smooth
Examples
gdb.init_examples()
gtrack.create("new_track", "Test track", "2 * dense_track")
gtrack.exists("new_track")
gtrack.rm("new_track", force = TRUE)
gtrack.exists("new_track")
Creates a new track from smoothed values of track expression
Description
Creates a new track from smoothed values of track expression.
Usage
gtrack.smooth(
track = NULL,
description = NULL,
expr = NULL,
winsize = NULL,
weight_thr = 0,
smooth_nans = FALSE,
alg = "LINEAR_RAMP",
iterator = NULL
)
Arguments
track |
track name |
description |
a character string description |
expr |
track expression |
winsize |
size of smoothing window |
weight_thr |
smoothing weight threshold |
smooth_nans |
if 'FALSE' track value is always set to 'NaN' if central window value is 'NaN', otherwise it is calculated from the rest of non 'NaN' values |
alg |
smoothing algorithm - "MEAN" or "LINEAR_RAMP" |
iterator |
track expression iterator of 'Fixed bin' type |
Details
This function creates a new 'Dense' track named 'track'. The values of the track are results of smoothing the values of 'expr'.
Each track value at coordinate 'C' is determined by smoothing non 'NaN' values of 'expr' over the window around 'C'. The window size is controlled by 'winsize' and is given in coordinate units (not in number of bins), defining the total regions to be considered when smoothing (on both sides of the central point). Two different algorithms can be used for smoothing:
"MEAN" - an arithmetic average.
"LINEAR_RAMP" - a weighted arithmetic average, where the weights linearly decrease as the distance from the center of the window increases.
'weight_thr' determines the function behavior when some of the values in the window are missing or 'NaN' (missing values may occur at the edges of each chromosome when the window covers an area beyond chromosome boundaries). 'weight_thr' sets the weight sum threshold below which smoothing algorithm returns 'NaN' rather than a smoothing value based on non 'NaN' values in the window.
'smooth_nans' controls what would be the smoothed value if the central value in the window is 'NaN'. If 'smooth_nans' is 'FALSE' then the smoothed value is set to 'NaN' regardless of 'weight_thr' parameter. Otherwise it is calculated normally.
'description' is added as a track attribute.
Iterator policy must be of "fixed bin" type.
Value
None.
See Also
gtrack.create, gtrack.2d.create,
gtrack.create_sparse, gtrack.modify,
gtrack.rm, gtrack.info,
gdir.create
Examples
gdb.init_examples()
gtrack.smooth("smoothed_track", "Test track", "dense_track", 500)
gextract("dense_track", "smoothed_track", gintervals(1, 0, 1000))
gtrack.rm("smoothed_track", force = TRUE)
Returns value of a track variable
Description
Returns value of a track variable.
Usage
gtrack.var.get(track = NULL, var = NULL)
Arguments
track |
track name |
var |
track variable name |
Details
This function returns the value of a track variable. If the variable does not exist an error is reported.
Value
Track variable value.
See Also
gtrack.var.set, gtrack.var.ls,
gtrack.var.rm
Examples
gdb.init_examples()
gtrack.var.set("sparse_track", "test_var", 1:10)
gtrack.var.get("sparse_track", "test_var")
gtrack.var.rm("sparse_track", "test_var")
Returns a list of track variables for a track
Description
Returns a list of track variables for a track.
Usage
gtrack.var.ls(
track = NULL,
pattern = "",
ignore.case = FALSE,
perl = FALSE,
fixed = FALSE,
useBytes = FALSE
)
Arguments
track |
track name |
pattern, ignore.case, perl, fixed, useBytes |
see 'grep' |
Details
This function returns a list of track variables of a track that match the pattern (see 'grep'). If called without any arguments all track variables of a track are returned.
Value
An array that contains the names of track variables.
See Also
grep, gtrack.var.get,
gtrack.var.set, gtrack.var.rm
Examples
gdb.init_examples()
gtrack.var.ls("sparse_track")
gtrack.var.set("sparse_track", "test_var1", 1:10)
gtrack.var.set("sparse_track", "test_var2", "v")
gtrack.var.ls("sparse_track")
gtrack.var.ls("sparse_track", pattern = "2")
gtrack.var.rm("sparse_track", "test_var1")
gtrack.var.rm("sparse_track", "test_var2")
Deletes a track variable
Description
Deletes a track variable.
Usage
gtrack.var.rm(track = NULL, var = NULL)
Arguments
track |
track name |
var |
track variable name |
Details
This function deletes a track variable.
Value
None.
See Also
gtrack.var.get, gtrack.var.set,
gtrack.var.ls
Examples
gdb.init_examples()
gtrack.var.set("sparse_track", "test_var1", 1:10)
gtrack.var.set("sparse_track", "test_var2", "v")
gtrack.var.ls("sparse_track")
gtrack.var.rm("sparse_track", "test_var1")
gtrack.var.rm("sparse_track", "test_var2")
gtrack.var.ls("sparse_track")
Assigns value to a track variable
Description
Assigns value to a track variable.
Usage
gtrack.var.set(track = NULL, var = NULL, value = NULL)
Arguments
track |
track name |
var |
track variable name |
value |
value |
Details
This function creates a track variable and assigns 'value' to it. If the track variable already exists its value is overwritten.
Value
None.
See Also
gtrack.var.get, gtrack.var.ls,
gtrack.var.rm
Examples
gdb.init_examples()
gtrack.var.set("sparse_track", "test_var", 1:10)
gtrack.var.get("sparse_track", "test_var")
gtrack.var.rm("sparse_track", "test_var")
Defines rules for a single value calculation of a virtual 'Array' track
Description
Defines how a single value within an interval is achieved for a virtual track based on 'Array' track.
Usage
gvtrack.array.slice(vtrack = NULL, slice = NULL, func = "avg", params = NULL)
Arguments
vtrack |
virtual track name |
slice |
a vector of column names or column indices or 'NULL' |
func, params |
see below |
Details
A track (regular or virtual) used in a track expression is expected to return one value for each track interval. 'Array' tracks store multiple values per interval (one for each 'column') and hence if used in a track expression one must define the way of how a single value should be deduced from several ones.
By default if an 'Array' track is used in a track expressions, its interval value would be the average of all column values that are not NaN. 'gvtrack.array.slice' allows to select specific columns and to specify the function applied to their values.
'slice' parameter allows to choose the columns. Columns can be indicated by their names or their indices. If 'slice' is 'NULL' the non-NaN values of all track columns are used.
'func' parameter determines the function applied to the columns' values. Use the following table for a reference of all valid functions and parameters combinations:
func = "avg", params = NULL
Average of columns' values.
func = "max", params = NULL
Maximum of columns' values.
func = "min", params = NULL
Minimum of columns' values.
func = "stdev", params = NULL
Unbiased standard deviation of
columns' values.
func = "sum", params = NULL
Sum of columns' values.
func = "quantile", params = [Percentile in the range of [0, 1]]
Quantile of columns' values.
Value
None.
See Also
gvtrack.create,
gtrack.array.get_colnames, gtrack.array.extract
Examples
gdb.init_examples()
gvtrack.create("vtrack1", "array_track")
gvtrack.array.slice("vtrack1", c("col2", "col4"), "max")
gextract("vtrack1", gintervals(1, 0, 1000))
Creates a new virtual track
Description
Creates a new virtual track.
Usage
gvtrack.create(
vtrack = NULL,
src = NULL,
func = NULL,
params = NULL,
dim = NULL,
sshift = NULL,
eshift = NULL,
filter = NULL,
...
)
Arguments
vtrack |
virtual track name |
src |
source (track/intervals). NULL for PWM functions. For value-based
tracks, provide a data frame with columns |
func |
function name (see above) |
params |
function parameters (see above) |
dim |
use 'NULL' or '0' for 1D iterators. '1' converts 2D iterator to (chrom1, start1, end1) , '2' converts 2D iterator to (chrom2, start2, end2) |
sshift |
shift of 'start' coordinate |
eshift |
shift of 'end' coordinate |
filter |
genomic mask to apply. Can be:
|
... |
additional PWM parameters |
Details
This function creates a new virtual track named 'vtrack' with the given source, function and parameters. 'src' can be either a track, intervals (1D or 2D), or a data frame with intervals and a numeric value column (value-based track). The tables below summarize the supported combinations.
Value-based tracks
Value-based tracks are data frames containing genomic intervals with associated
numeric values. They function as in-memory sparse tracks without requiring
track creation in the database. To create a value-based track, provide a data
frame with columns chrom, start, end, and one numeric
value column (any name is acceptable). Value-based tracks support all track-based
summarizer functions (e.g., avg, min, max, sum,
stddev, quantile, nearest, exists, size,
first, last, sample, and position functions), but do not
support overlapping intervals. They behave like sparse tracks in aggregation:
values are aggregated using count-based averaging (each interval contributes equally
regardless of length), not coverage-based averaging.
Track-based summarizers
| Source | func | params | Description |
| Track | avg | NULL | Average track value in the iterator interval. |
| Track (1D) | exists | vals (optional) | Returns 1 if any value exists (or specific vals if provided), 0 otherwise. |
| Track (1D) | first | NULL | First value in the iterator interval. |
| Track (1D) | last | NULL | Last value in the iterator interval. |
| Track | max | NULL | Maximum track value in the iterator interval. |
| Track | min | NULL | Minimum track value in the iterator interval. |
| Dense / Sparse / Array track | nearest | NULL | Average value inside the iterator; for sparse tracks with no samples in the interval, falls back to the closest sample outside the interval (by genomic distance). |
| Track (1D) | sample | NULL | Uniformly sampled source value from the iterator interval. |
| Track (1D) | size | NULL | Number of non-NaN values in the iterator interval. |
| Dense / Sparse / Array track | stddev | NULL | Unbiased standard deviation of values in the iterator interval. |
| Dense / Sparse / Array track | sum | NULL | Sum of values in the iterator interval. |
| Dense / Sparse / Array track | quantile | Percentile in [0, 1] | Quantile of values in the iterator interval. |
| Dense track | global.percentile | NULL | Percentile of the interval average relative to the full-track distribution. |
| Dense track | global.percentile.max | NULL | Percentile of the interval maximum relative to the full-track distribution. |
| Dense track | global.percentile.min | NULL | Percentile of the interval minimum relative to the full-track distribution. |
Track position summarizers
| Source | func | params | Description |
| Track (1D) | first.pos.abs | NULL | Absolute genomic coordinate of the first value. |
| Track (1D) | first.pos.relative | NULL | Zero-based position (relative to interval start) of the first value. |
| Track (1D) | last.pos.abs | NULL | Absolute genomic coordinate of the last value. |
| Track (1D) | last.pos.relative | NULL | Zero-based position (relative to interval start) of the last value. |
| Track (1D) | max.pos.abs | NULL | Absolute genomic coordinate of the maximum value inside the iterator interval. |
| Track (1D) | max.pos.relative | NULL | Zero-based position (relative to interval start) of the maximum value. |
| Track (1D) | min.pos.abs | NULL | Absolute genomic coordinate of the minimum value inside the iterator interval. |
| Track (1D) | min.pos.relative | NULL | Zero-based position (relative to interval start) of the minimum value. |
| Track (1D) | sample.pos.abs | NULL | Absolute genomic coordinate of a uniformly sampled value. |
| Track (1D) | sample.pos.relative | NULL | Zero-based position (relative to interval start) of a uniformly sampled value. |
For max.pos.relative, min.pos.relative, first.pos.relative, last.pos.relative, sample.pos.relative,
iterator modifiers (including sshift /
eshift and 1D projections generated via gvtrack.iterator) are
applied before the position is reported. In other words, the returned
coordinate is always 0-based and measured from the start of the iterator
interval after all modifier adjustments.
Interval-based summarizers
| Source | func | params | Description |
| 1D intervals | distance | Minimal distance from center (default 0) | Signed distance using normalized formula when inside intervals, distance to edge when outside; see notes below for exact formula. |
| 1D intervals | distance.center | NULL | Distance from iterator center to the closest interval center, NA if outside all intervals. |
| 1D intervals | distance.edge | NULL | Edge-to-edge distance from iterator interval to closest source interval (like gintervals.neighbors); see notes below for strand handling. |
| 1D intervals | coverage | NULL | Fraction of iterator length covered by source intervals (after unifying overlaps). |
| 1D intervals | neighbor.count | Max distance (>= 0) | Number of source intervals whose edge-to-edge distance from the iterator interval is within params (no unification). |
2D track summarizers
| Source | func | params | Description |
| 2D track | area | NULL | Area covered by intersections of track rectangles with the iterator interval. |
| 2D track | weighted.sum | NULL | Weighted sum of values where each weight equals the intersection area. |
Motif (PWM) summarizers
| Source | func | Key params | Description |
| NULL (sequence) | pwm | pssm, bidirect, prior, extend, spat_* | Log-sum-exp score of motif likelihoods across all anchors inside the iterator interval. |
| NULL (sequence) | pwm.max | pssm, bidirect, prior, extend, spat_* | Maximum log-likelihood score among all anchors (per-position union across strands). |
| NULL (sequence) | pwm.max.pos | pssm, bidirect, prior, extend, spat_* | 1-based position of the best-scoring anchor (signed by strand when bidirect = TRUE); coordinates are always relative to the iterator interval after any gvtrack.iterator() shifts/extensions. |
| NULL (sequence) | pwm.count | pssm, score.thresh, bidirect, prior, extend, strand, spat_* | Count of anchors whose score exceeds score.thresh (per-position union).
|
K-mer summarizers
| Source | func | Key params | Description |
| NULL (sequence) | kmer.count | kmer, extend, strand | Number of k-mer occurrences whose anchor lies inside the iterator interval. |
| NULL (sequence) | kmer.frac | kmer, extend, strand | Fraction of possible anchors within the interval that match the k-mer. |
Masked sequence summarizers
| Source | func | Key params | Description |
| NULL (sequence) | masked.count | NULL | Number of masked (lowercase) base pairs in the iterator interval. |
| NULL (sequence) | masked.frac | NULL | Fraction of base pairs in the iterator interval that are masked (lowercase). |
The sections below provide additional notes for motif, interval, k-mer, and masked sequence functions.
Motif (PWM) notes
-
pssm: Position-specific scoring matrix (matrix or data frame) with columnsA,C,G,T; extra columns are ignored. -
bidirect: When TRUE (default), both strands are scanned and combined per genomic start (per-position union). Thestrandargument is ignored. When FALSE, only the strand specified bystrandis scanned. -
prior: Pseudocount added to frequencies (default 0.01). Set to 0 to disable. -
extend: Extends the fetched sequence so boundary-anchored motifs retain full context (default TRUE). The END coordinate is padded by motif_length - 1 for all strand modes; anchors must still start inside the iterator. Neutral characters (
N,n,*) contribute the mean log-probability of the corresponding PSSM column on both strands.-
strand: Used only whenbidirect = FALSE; 1 scans the forward strand, -1 scans the reverse strand. Forpwm.max.pos, strand = -1 reports the hit position at the end of the match (still relative to the forward orientation). -
score.thresh: Threshold forpwm.count. Anchors with log-likelihood >=score.threshare counted; only one count per genomic start. Spatial weighting (
spat_factor,spat_bin,spat_min,spat_max): optional position-dependent weights applied in log-space. Provide a positive numeric vectorspat_factor;spat_bin(integer > 0) defines bin width;spat_min/spat_maxrestrict the scanning window.-
pwm.max.pos: Positions are reported 1-based relative to the final scan window (after iterator shifts and spatial trimming). Ties resolve to the most 5' anchor; the forward strand wins ties at the same coordinate. Values are signed whenbidirect = TRUE(positive for forward, negative for reverse).
Spatial weighting
enables position-dependent weighting for modeling positional biases. Bins are 0-indexed from the
scan start. When using gvtrack.iterator() shifts (e.g., sshift = -50, eshift = 50), bins index from
the expanded scan window start, not the original interval. Both strands use the same bin at each
genomic position. Positions beyond the last bin reuse the final bin's weight. If the window size is
not divisible by spat_bin, the last bin is shorter (e.g., scanning 500 bp with 40 bp bins yields
bins 0-11 of 40 bp plus bin 12 of 20 bp). Use spat_min and spat_max to restrict scanning to a
range divisible by spat_bin if needed.
PWM parameters can be supplied either as a single list (params) or via named arguments (see examples).
Interval distance notes
distance: Given the center 'C' of the current iterator interval, returns 'DC * X/2' where 'DC' is the normalized distance to the center of the interval that contains 'C', and 'X' is the value of the parameter (default: 0). If no interval contains 'C', the result is 'D + X/2' where 'D' is the distance between 'C' and the edge of the closest interval.
distance.center: Given the center 'C' of the current iterator interval, returns NaN if 'C' is outside of all intervals, otherwise returns the distance between 'C' and the center of the closest interval.
distance.edge: Computes edge-to-edge distance from the iterator interval to the closest source interval, using the same calculation as gintervals.neighbors. Returns 0 for overlapping intervals. Distance sign depends on the strand column of source intervals; returns unsigned (absolute) distance if no strand column exists. Returns NA if no source intervals exist on the current chromosome.
For distance and distance.center, distance can be positive or negative depending on the position of the coordinate relative to the interval and the strand (-1 or 1) of the interval. Distance is always positive if strand = 0 or if the strand column is missing. The result is NA if no intervals exist for the current chromosome.
Difference between distance functions: The distance function measures from the center of the iterator interval (a single coordinate point) to the closest edge of source intervals when outside, or returns a normalized distance within the interval when inside. The distance.center function measures from the center of the iterator interval to the center of source intervals. The distance.edge function measures edge-to-edge distance between intervals, exactly like gintervals.neighbors. Use distance.edge when you need the same distance computation as gintervals.neighbors within a virtual track context.
K-mer notes
-
kmer: DNA sequence (case-insensitive) to count. -
extend: If TRUE (default), counts kmers whose anchor lies in the interval even if the kmer extends beyond it; when FALSE, only kmers fully contained in the interval are considered. -
strand: 1 counts forward-strand occurrences, -1 counts reverse-strand occurrences, 0 counts both strands (default). For palindromic kmers, consider using 1 or -1 to avoid double counting.
K-mer parameters can be supplied as a list or via named arguments (see examples).
Modify iterator behavior with 'gvtrack.iterator' or 'gvtrack.iterator.2d'.
Value
None.
See Also
gvtrack.info, gvtrack.iterator,
gvtrack.iterator.2d, gvtrack.array.slice,
gvtrack.ls, gvtrack.rm
gvtrack.iterator, gvtrack.iterator.2d, gvtrack.filter
Examples
gdb.init_examples()
gvtrack.create("vtrack1", "dense_track", "max")
gvtrack.create("vtrack2", "dense_track", "quantile", 0.5)
gextract("dense_track", "vtrack1", "vtrack2",
gintervals(1, 0, 10000),
iterator = 1000
)
gvtrack.create("vtrack3", "dense_track", "global.percentile")
gvtrack.create("vtrack4", "annotations", "distance")
gdist(
"vtrack3", seq(0, 1, l = 10), "vtrack4",
seq(-500, 500, 200)
)
gvtrack.create("cov", "annotations", "coverage")
gextract("cov", gintervals(1, 0, 1000), iterator = 100)
pssm <- matrix(
c(
0.7, 0.1, 0.1, 0.1, # Example PSSM
0.1, 0.7, 0.1, 0.1,
0.1, 0.1, 0.7, 0.1,
0.1, 0.1, 0.7, 0.1,
0.1, 0.1, 0.7, 0.1,
0.1, 0.1, 0.7, 0.1
),
ncol = 4, byrow = TRUE
)
colnames(pssm) <- c("A", "C", "G", "T")
gvtrack.create(
"motif_score", NULL, "pwm",
list(pssm = pssm, bidirect = TRUE, prior = 0.01)
)
gvtrack.create("max_motif_score", NULL, "pwm.max",
pssm = pssm, bidirect = TRUE, prior = 0.01
)
gvtrack.create("max_motif_pos", NULL, "pwm.max.pos",
pssm = pssm
)
gextract(
c(
"dense_track", "motif_score", "max_motif_score",
"max_motif_pos"
),
gintervals(1, 0, 10000),
iterator = 500
)
# Kmer counting examples
gvtrack.create("cg_count", NULL, "kmer.count", kmer = "CG", strand = 1)
gvtrack.create("cg_frac", NULL, "kmer.frac", kmer = "CG", strand = 1)
gextract(c("cg_count", "cg_frac"), gintervals(1, 0, 10000), iterator = 1000)
gvtrack.create("at_pos", NULL, "kmer.count", kmer = "AT", strand = 1)
gvtrack.create("at_neg", NULL, "kmer.count", kmer = "AT", strand = -1)
gvtrack.create("at_both", NULL, "kmer.count", kmer = "AT", strand = 0)
gextract(c("at_pos", "at_neg", "at_both"), gintervals(1, 0, 10000), iterator = 1000)
# GC content
gvtrack.create("g_frac", NULL, "kmer.frac", kmer = "G")
gvtrack.create("c_frac", NULL, "kmer.frac", kmer = "C")
gextract("g_frac + c_frac", gintervals(1, 0, 10000),
iterator = 1000,
colnames = "gc_content"
)
# Masked base pair counting
gvtrack.create("masked_count", NULL, "masked.count")
gvtrack.create("masked_frac", NULL, "masked.frac")
gextract(c("masked_count", "masked_frac"), gintervals(1, 0, 10000), iterator = 1000)
# Combined with GC content (unmasked regions only)
gvtrack.create("gc", NULL, "kmer.frac", kmer = "G")
gextract("gc * (1 - masked_frac)",
gintervals(1, 0, 10000),
iterator = 1000,
colnames = "gc_unmasked"
)
# Value-based track examples
# Create a data frame with intervals and numeric values
intervals_with_values <- data.frame(
chrom = "chr1",
start = c(100, 300, 500),
end = c(200, 400, 600),
score = c(10, 20, 30)
)
# Use as value-based sparse track (functions like sparse track)
gvtrack.create("value_track", intervals_with_values, "avg")
gvtrack.create("value_track_max", intervals_with_values, "max")
gextract(c("value_track", "value_track_max"),
gintervals(1, 0, 10000),
iterator = 1000
)
# Spatial PWM examples
# Create a PWM with higher weight in the center of intervals
pssm <- matrix(
c(
0.7, 0.1, 0.1, 0.1,
0.1, 0.7, 0.1, 0.1,
0.1, 0.1, 0.7, 0.1,
0.1, 0.1, 0.1, 0.7
),
ncol = 4, byrow = TRUE
)
colnames(pssm) <- c("A", "C", "G", "T")
# Spatial factors: low weight at edges, high in center
# For 200bp intervals with 40bp bins: bins 0, 40, 80, 120, 160
spatial_weights <- c(0.5, 1.0, 2.0, 1.0, 0.5)
gvtrack.create(
"spatial_pwm", NULL, "pwm",
list(
pssm = pssm,
bidirect = TRUE,
spat_factor = spatial_weights,
spat_bin = 40L
)
)
# Compare with non-spatial PWM
gvtrack.create(
"regular_pwm", NULL, "pwm",
list(pssm = pssm, bidirect = TRUE)
)
gextract(c("spatial_pwm", "regular_pwm"),
gintervals(1, 0, 10000),
iterator = 200
)
# Using spatial parameters with iterator shifts
gvtrack.create(
"spatial_extended", NULL, "pwm.max",
pssm = pssm,
spat_factor = c(0.5, 1.0, 2.0, 2.5, 2.0, 1.0, 0.5),
spat_bin = 40L
)
# Scan window will be 280bp (100bp + 2*90bp)
gvtrack.iterator("spatial_extended", sshift = -90, eshift = 90)
gextract("spatial_extended", gintervals(1, 0, 10000), iterator = 100)
# Using spat_min/spat_max to restrict scanning to a window
# For 500bp intervals, scan only positions 30-470 (440bp window)
gvtrack.create(
"window_pwm", NULL, "pwm",
pssm = pssm,
bidirect = TRUE,
spat_min = 30, # 1-based position
spat_max = 470 # 1-based position
)
gextract("window_pwm", gintervals(1, 0, 10000), iterator = 500)
# Combining spatial weighting with window restriction
# Scan positions 50-450 with spatial weights favoring the center
gvtrack.create(
"window_spatial_pwm", NULL, "pwm",
pssm = pssm,
bidirect = TRUE,
spat_factor = c(0.5, 1.0, 2.0, 2.5, 2.0, 1.0, 0.5, 1.0, 0.5, 0.5),
spat_bin = 40L,
spat_min = 50,
spat_max = 450
)
gextract("window_spatial_pwm", gintervals(1, 0, 10000), iterator = 500)
Attach or clear a genomic mask filter on a virtual track
Description
Attaches or clears a genomic mask filter on a virtual track. When a filter is attached, the virtual track function is evaluated only over the unmasked regions (i.e., regions not covered by the filter intervals).
Usage
gvtrack.filter(vtrack = NULL, filter = NULL)
Arguments
vtrack |
virtual track name |
filter |
genomic mask to apply. Can be:
|
Details
The filter defines regions to exclude from virtual track evaluation.
The virtual track function will be evaluated only on the complement of the filter.
Once a filter is attached to a virtual track, it applies to all subsequent extractions
of that virtual track until explicitly cleared with filter = NULL.
Order of Operations:
Filters are applied after iterator modifiers (sshift/eshift/dim). The order is:
Apply iterator modifiers (gvtrack.iterator with sshift/eshift)
Subtract mask from the modified intervals
Evaluate virtual track function over unmasked regions
Semantics by function type:
-
Aggregations (avg/sum/min/max/stddev/quantile): Length-weighted over unmasked regions
-
coverage: Returns (covered length in unmasked region) / (total unmasked length)
-
distance/distance.center: Unaffected by mask (pure geometry)
-
PWM/kmer: Masked bases act as hard boundaries; matches cannot span masked regions. Important: When
extend=TRUE(the default), motifs at the boundaries of unmasked segments can use bases from the adjacent masked regions to complete the motif scoring. For example, if a 4bp motif starts at position 1998 in an unmasked region that ends at 2000, and positions 2000-2002 are masked, the motif will still be scored using the masked bases. In other words, motif matches starting positions must be in unmasked regions, but the motif sequence itself can extend into masked regions whenextend=TRUE. Setextend=FALSEto prevent any use of masked bases in scoring.
Completely Masked Intervals:
If an entire iterator interval is masked, the function returns NA (not 0).
Value
None (invisibly).
See Also
gvtrack.create, gvtrack.iterator, gvtrack.info
Examples
gdb.init_examples()
## Basic usage: Excluding specific regions
gvtrack.create("vtrack1", "dense_track", func = "avg")
# Create intervals to mask (e.g., repetitive regions)
repeats <- gintervals(c(1, 1), c(100, 500), c(200, 600))
# Attach filter - track will be evaluated excluding these regions
gvtrack.filter("vtrack1", filter = repeats)
# Extract values - masked regions are excluded from calculation
result_filtered <- gextract("vtrack1", gintervals(1, 0, 1000))
# Check filter info
gvtrack.info("vtrack1")
# Clear the filter and compare
gvtrack.filter("vtrack1", filter = NULL)
result_unfiltered <- gextract("vtrack1", gintervals(1, 0, 1000))
## Using multiple filter sources (combined automatically)
centromeres <- gintervals(1, 10000, 15000)
telomeres <- gintervals(1, 0, 1000)
combined_mask <- list(repeats, centromeres, telomeres)
gvtrack.filter("vtrack1", filter = combined_mask)
result_multi_filter <- gextract("vtrack1", gintervals(1, 0, 20000))
## Filters work with iterator modifiers
gvtrack.create("vtrack2", "dense_track", func = "sum")
gvtrack.filter("vtrack2", filter = repeats)
gvtrack.iterator("vtrack2", sshift = -50, eshift = 50)
# Iterator shifts applied first, then mask subtracted
result_shifted <- gextract("vtrack2", gintervals(1, 1000, 2000), iterator = 100)
Returns the definition of a virtual track
Description
Returns the definition of a virtual track.
Usage
gvtrack.info(vtrack = NULL)
Arguments
vtrack |
virtual track name |
Details
This function returns the internal representation of a virtual track.
Value
Internal representation of a virtual track.
See Also
Examples
gdb.init_examples()
gvtrack.create("vtrack1", "dense_track", "max")
gvtrack.info("vtrack1")
Defines modification rules for a one-dimensional iterator in a virtual track
Description
Defines modification rules for a one-dimensional iterator in a virtual track.
Usage
gvtrack.iterator(vtrack = NULL, dim = NULL, sshift = 0, eshift = 0)
Arguments
vtrack |
virtual track name |
dim |
use 'NULL' or '0' for 1D iterators. '1' converts 2D iterator to (chrom1, start1, end1) , '2' converts 2D iterator to (chrom2, start2, end2) |
sshift |
shift of 'start' coordinate |
eshift |
shift of 'end' coordinate |
Details
This function defines modification rules for one-dimensional iterator intervals in a virtual track.
'dim' converts a 2D iterator interval (chrom1, start1, end1, chrom2, start2, end2) to a 1D interval. If 'dim' is '1' the interval is converted to (chrom1, start1, end1). If 'dim' is '2' the interval is converted to (chrom2, start2, end2). If 1D iterator is used 'dim' must be set to 'NULL' or '0' (meaning: no conversion is made).
Iterator interval's 'start' coordinate is modified by adding 'sshift'. Similarly 'end' coordinate is altered by adding 'eshift'.
Value
None.
See Also
gvtrack.create, gvtrack.iterator.2d
Examples
gdb.init_examples()
gvtrack.create("vtrack1", "dense_track")
gvtrack.iterator("vtrack1", sshift = 200, eshift = 200)
gextract("dense_track", "vtrack1", gintervals(1, 0, 500))
gvtrack.create("vtrack2", "dense_track")
gvtrack.iterator("vtrack2", dim = 1)
gextract("vtrack2", gintervals.2d(1, 0, 1000, 1, 0, -1),
iterator = "rects_track"
)
Defines modification rules for a two-dimensional iterator in a virtual track
Description
Defines modification rules for a two-dimensional iterator in a virtual track.
Usage
gvtrack.iterator.2d(
vtrack = NULL,
sshift1 = 0,
eshift1 = 0,
sshift2 = 0,
eshift2 = 0
)
Arguments
vtrack |
virtual track name |
sshift1 |
shift of 'start1' coordinate |
eshift1 |
shift of 'end1' coordinate |
sshift2 |
shift of 'start2' coordinate |
eshift2 |
shift of 'end2' coordinate |
Details
This function defines modification rules for one-dimensional iterator intervals in a virtual track.
Iterator interval's 'start1' coordinate is modified by adding 'sshift1'. Similarly 'end1', 'start2', 'end2' coordinates are altered by adding 'eshift1', 'sshift2' and 'eshift2' accordingly.
Value
None.
See Also
gvtrack.create, gvtrack.iterator
Examples
gdb.init_examples()
gvtrack.create("vtrack1", "rects_track")
gvtrack.iterator.2d("vtrack1", sshift1 = 1000, eshift1 = 2000)
gextract(
"rects_track", "vtrack1",
gintervals.2d(1, 0, 5000, 2, 0, 5000)
)
Returns a list of virtual track names
Description
Returns a list of virtual track names.
Usage
gvtrack.ls(
pattern = "",
ignore.case = FALSE,
perl = FALSE,
fixed = FALSE,
useBytes = FALSE
)
Arguments
pattern, ignore.case, perl, fixed, useBytes |
see 'grep' |
Details
This function returns a list of virtual tracks that exist in current R environment that match the pattern (see 'grep'). If called without any arguments all virtual tracks are returned.
Value
An array that contains the names of virtual tracks.
See Also
grep, gvtrack.create,
gvtrack.rm
Examples
gdb.init_examples()
gvtrack.create("vtrack1", "dense_track", "max")
gvtrack.create("vtrack2", "dense_track", "quantile", 0.5)
gvtrack.ls()
gvtrack.ls(pattern = "*2")
Deletes a virtual track
Description
Deletes a virtual track.
Usage
gvtrack.rm(vtrack = NULL)
Arguments
vtrack |
virtual track name |
Details
This function deletes a virtual track from current R environment.
Value
None.
See Also
Examples
gdb.init_examples()
gvtrack.create("vtrack1", "dense_track", "max")
gvtrack.create("vtrack2", "dense_track", "quantile", 0.5)
gvtrack.ls()
gvtrack.rm("vtrack1")
gvtrack.ls()
Downloads files from FTP server
Description
Downloads multiple files from FTP server
Usage
gwget(url = NULL, path = NULL)
Arguments
url |
URL of FTP server |
path |
directory path where the downloaded files are stored |
Details
This function downloads files from FTP server given by 'url'. The address in 'url' can contain wildcards to download more than one file at once. Files are downloaded to a directory given by 'path' argument. If 'path' is 'NULL', file are downloaded into 'GROOT/downloads'.
Value
An array of file names that have been downloaded.
See Also
Examples
gdb.init_examples()
outdir <- tempdir()
gwget("ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/md5sum.txt", path = outdir)
Calculates Wilcoxon test on sliding windows over track expression
Description
Calculates Wilcoxon test on sliding windows over the values of track expression.
Usage
gwilcox(
expr = NULL,
winsize1 = NULL,
winsize2 = NULL,
maxpval = 0.05,
onetailed = TRUE,
what2find = 1,
intervals = NULL,
iterator = NULL,
intervals.set.out = NULL
)
Arguments
expr |
track expression |
winsize1 |
number of values in the first sliding window |
winsize2 |
number of values in the second sliding window |
maxpval |
maximal P-value |
onetailed |
if 'TRUE', Wilcoxon test is performed one tailed, otherwise two tailed |
what2find |
if '-1', lows are searched. If '1', peaks are searched. If '0', both peaks and lows are searched |
intervals |
genomic scope for which the function is applied |
iterator |
track expression iterator of "fixed bin" type. If 'NULL' iterator is determined implicitly based on track expression. |
intervals.set.out |
intervals set name where the function result is optionally outputted |
Details
This function runs a Wilcoxon test (also known as a Mann-Whitney test) over the values of track expression in the two sliding windows having an identical center. The sizes of the windows are specified by 'winsize1' and 'winsize2'. 'gwilcox' returns intervals where the smaller window tested against a larger window gives a P-value below 'maxpval'. The test can be one or two tailed.
'what2find' argument controls what should be searched: peaks, lows or both.
If 'intervals.set.out' is not 'NULL' the result is saved as an intervals set. Use this parameter if the result size exceeds the limits of the physical memory.
Value
If 'intervals.set.out' is 'NULL' a data frame representing the intervals with an additional 'pval' column where P-value is below 'maxpval'.
See Also
Examples
gdb.init_examples()
gwilcox("dense_track", 100000, 1000,
maxpval = 0.01,
what2find = 1
)