Title: | Alt String Implementation |
Version: | 0.17.0 |
Date: | 2025-07-12 |
Maintainer: | Travers Ching <traversc@gmail.com> |
Description: | Provides an extendable, performant and multithreaded 'alt-string' implementation backed by 'C++' vectors and strings. |
License: | GPL-3 |
Biarch: | true |
Encoding: | UTF-8 |
Depends: | R (≥ 3.6.0) |
SystemRequirements: | GNU make |
LinkingTo: | Rcpp (≥ 0.12.18.3), RcppParallel (≥ 5.1.4) |
Imports: | Rcpp, RcppParallel |
Suggests: | qs2, qs, knitr, rmarkdown, usethis, dplyr, stringr, rlang |
VignetteBuilder: | knitr |
RoxygenNote: | 7.3.2 |
Copyright: | Copyright for the bundled 'PCRE2' library is held by University of Cambridge, Zoltan Herczeg and Tilera Coporation (Stack-less Just-In-Time compiler); Copyright for the bundled 'xxHash' code is held by Yann Collet. |
URL: | https://github.com/traversc/stringfish |
BugReports: | https://github.com/traversc/stringfish/issues |
NeedsCompilation: | yes |
Packaged: | 2025-07-13 05:58:59 UTC; tching |
Author: | Travers Ching [aut, cre, cph], Phillip Hazel [ctb] (Bundled PCRE2 code), Zoltan Herczeg [ctb, cph] (Bundled PCRE2 code), University of Cambridge [cph] (Bundled PCRE2 code), Tilera Corporation [cph] (Stack-less Just-In-Time compiler bundled with PCRE2), Yann Collet [ctb, cph] (Yann Collet is the author of the bundled xxHash code) |
Repository: | CRAN |
Date/Publication: | 2025-07-13 06:40:02 UTC |
convert_to_sf
Description
Converts a character vector to a stringfish vector
Usage
convert_to_sf(x)
sf_convert(x)
Arguments
x |
A character vector |
Details
Converts a character vector to a stringfish vector. The opposite of 'materialize'.
Value
The converted character vector
Examples
if(getRversion() >= "3.5.0") {
x <- convert_to_sf(letters)
}
get_string_type
Description
Returns the type of the character vector
Usage
get_string_type(x)
Arguments
x |
the vector |
Details
A function that returns the type of character vector. Possible values are "normal vector", "stringfish vector", "stringfish vector (materialized)" or "other alt-rep vector"
Value
The type of vector
Examples
if(getRversion() >= "3.5.0") {
x <- sf_vector(10)
get_string_type(x) # returns "stringfish vector"
x <- character(10)
get_string_type(x) # returns "normal vector"
}
materialize
Description
Materializes an alt-rep object
Usage
materialize(x)
Arguments
x |
An alt-rep object |
Details
Materializes any alt-rep object and then returns it. Note: the object is materialized regardless of whether the return value is assigned to a variable.
Value
x
Examples
if(getRversion() >= "3.5.0") {
x <- sf_vector(10)
sf_assign(x, 1, "hello world")
sf_assign(x, 2, "another string")
x <- materialize(x)
}
random_strings
Description
A function that generates random strings
Usage
random_strings(N, string_size = 50, charset = "abcdefghijklmnopqrstuvwxyz",
vector_mode = "stringfish")
Arguments
N |
The number of strings to generate |
string_size |
The length of the strings |
charset |
The characters used to generate the random strings (default: abcdefghijklmnopqrstuvwxyz) |
vector_mode |
The type of character vector to generate (either stringfish or normal, default: stringfish) |
Details
The function uses the PCRE2 library, which is also used internally by R. Note: the order of paramters is switched compared to the 'gsub' base R function, with subject being first. See also: https://www.pcre.org/current/doc/html/pcre2api.html for more documentation on match syntax.
Value
A character vector of the random strings
See Also
gsub
Examples
if(getRversion() >= "3.5.0") {
set.seed(1)
x <- random_strings(1e6, 80, "ACGT", vector_mode = "stringfish")
}
sf_assign
Description
Assigns a new string to a stringfish vector or any other character vector
Usage
sf_assign(x, i, e)
Arguments
x |
the vector |
i |
the index to assign to |
e |
the new string to replace at i in x |
Details
A function to assign a new element to an existing character vector. If the the vector is a stringfish vector, it does so without materialization.
Value
No return value, the function assigns an element to an existing stringfish vector
Examples
if(getRversion() >= "3.5.0") {
x <- sf_vector(10)
sf_assign(x, 1, "hello world")
sf_assign(x, 2, "another string")
}
sf_collapse
Description
Pastes a series of strings together separated by the 'collapse' parameter
Usage
sf_collapse(x, collapse)
Arguments
x |
A character vector |
collapse |
A single string |
Details
This works the same way as 'paste0(x, collapse=collapse)'
Value
A single string with all values in 'x' pasted together, separated by 'collapse'.
See Also
paste0, paste
Examples
if(getRversion() >= "3.5.0") {
x <- c("hello", "\\xe4\\xb8\\x96\\xe7\\x95\\x8c")
Encoding(x) <- "UTF-8"
sf_collapse(x, " ") # "hello world" in Japanese
sf_collapse(letters, "") # returns the alphabet
}
sf_compare
Description
Returns a logical vector testing equality of strings from two string vectors
Usage
sf_compare(x, y, nthreads = getOption("stringfish.nthreads", 1L))
sf_equals(x, y, nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
A character vector of length 1 or the same non-zero length as y |
y |
Another character vector of length 1 or the same non-zero length as y |
nthreads |
Number of threads to use |
Details
Note: the function tests for both string and encoding equality
Value
A logical vector
Examples
if(getRversion() >= "3.5.0") {
sf_compare(letters, "a")
}
sf_concat
Description
Appends vectors together
Usage
sf_concat(...)
sfc(...)
Arguments
... |
Any number of vectors, coerced to character vector if necessary |
Value
A concatenated stringfish vector
Examples
if(getRversion() >= "3.5.0") {
sf_concat(letters, 1:5)
}
sf_ends
Description
A function for detecting a pattern at the end of a string
Usage
sf_ends(subject, pattern, ...)
Arguments
subject |
A character vector |
pattern |
A string to look for at the start |
... |
Parameters passed to sf_grepl |
Value
A logical vector true if there is a match, false if no match, NA is the subject was NA
See Also
endsWith, sf_starts
Examples
if(getRversion() >= "3.5.0") {
x <- c("alpha", "beta", "gamma", "delta", "epsilon")
sf_ends(x, "a")
}
sf_grepl
Description
A function that matches patterns and returns a logical vector
Usage
sf_grepl(subject, pattern, encode_mode = "auto", fixed = FALSE,
nthreads = getOption("stringfish.nthreads", 1L))
Arguments
subject |
The subject character vector to search |
pattern |
The pattern to search for |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the pattern parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
Details
The function uses the PCRE2 library, which is also used internally by R. The encoding is based on the pattern string (or forced via the encode_mode parameter). Note: the order of paramters is switched compared to the 'grepl' base R function, with subject being first. See also: https://www.pcre.org/current/doc/html/pcre2api.html for more documentation on match syntax.
Value
A logical vector with the same length as subject
See Also
grepl
Examples
if(getRversion() >= "3.5.0") {
x <- sf_vector(10)
sf_assign(x, 1, "hello world")
pattern <- "^hello"
sf_grepl(x, pattern)
}
sf_gsub
Description
A function that performs pattern substitution
Usage
sf_gsub(subject, pattern, replacement, encode_mode = "auto", fixed = FALSE,
nthreads = getOption("stringfish.nthreads", 1L))
Arguments
subject |
The subject character vector to search |
pattern |
The pattern to search for |
replacement |
The replacement string |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the pattern parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
Details
The function uses the PCRE2 library, which is also used internally by R. However, syntax may be slightly different. E.g.: capture groups: "\1" in R, but "$1" in PCRE2 (as in Perl). The encoding of the output is determined by the pattern (or forced using encode_mode parameter) and encodings should be compatible. E.g: mixing ASCII and UTF-8 is okay, but not UTF-8 and latin1. Note: the order of paramters is switched compared to the 'gsub' base R function, with subject being first. See also: https://www.pcre.org/current/doc/html/pcre2api.html for more documentation on match syntax.
Value
A stringfish vector of the replacement string
See Also
gsub
Examples
if(getRversion() >= "3.5.0") {
x <- "hello world"
pattern <- "^hello (.+)"
replacement <- "goodbye $1"
sf_gsub(x, pattern, replacement)
}
sf_iconv
Description
Converts encoding of one character vector to another
Usage
sf_iconv(x, from, to, nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
An alt-rep object |
from |
the encoding to assume of 'x' |
nthreads |
Number of threads to use |
to |
the new encoding |
Details
This is an analogue to the base R function 'iconv'. It converts a string from one encoding (e.g. latin1 or UTF-8) to another
Value
the converted character vector as a stringfish vector
See Also
iconv
Examples
if(getRversion() >= "3.5.0") {
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
sf_iconv(x, "latin1", "UTF-8")
}
sf_match
Description
Returns a vector of the positions of x in table
Usage
sf_match(x, table, nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
A character vector to search for in table |
table |
A character vector to be matched against x |
nthreads |
Number of threads to use |
Details
Note: similarly to the base R function, long "table" vectors are not supported. This is due to the maximum integer value that can be returned ('.Machine$integer.max')
Value
An integer vector of the indicies of each x element's position in table
See Also
match
Examples
if(getRversion() >= "3.5.0") {
sf_match("c", letters)
}
sf_nchar
Description
Counts the number of characters in a character vector
Usage
sf_nchar(x, type = "chars", nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
A character vector |
type |
The type of counting to perform ("chars" or "bytes", default: "chars") |
nthreads |
Number of threads to use |
Details
Returns the number of characters per string. The type of counting only matters for UTF-8 strings, where a character can be represented by multiple bytes.
Value
An integer vector of the number of characters
See Also
nchar
Examples
if(getRversion() >= "3.5.0") {
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
x <- sf_iconv(x, "latin1", "UTF-8")
}
sf_paste
Description
Pastes a series of strings together
Usage
sf_paste(..., sep = "", nthreads = getOption("stringfish.nthreads", 1L))
Arguments
... |
Any number of character vector strings |
sep |
The seperating string between strings |
nthreads |
Number of threads to use |
Details
This works the same way as 'paste0(..., sep=sep)'
Value
A character vector where elements of the arguments are pasted together
See Also
paste0, paste
Examples
if(getRversion() >= "3.5.0") {
x <- letters
y <- LETTERS
sf_paste(x,y, sep = ":")
}
sf_readLines
Description
A function that reads a file line by line
Usage
sf_readLines(file, encoding = "UTF-8")
Arguments
file |
The file name |
encoding |
The encoding to use (Default: UTF-8) |
Details
A function for reading in text data using 'std::ifstream'.
Value
A stringfish vector of the lines in a file
See Also
readLines
Examples
if(getRversion() >= "3.5.0") {
file <- tempfile()
sf_writeLines(letters, file)
sf_readLines(file)
}
sf_split
Description
A function to split strings by a delimiter
Usage
sf_split(subject, split, encode_mode = "auto", fixed = FALSE,
nthreads = getOption("stringfish.nthreads", 1L))
Arguments
subject |
A character vector |
split |
A delimiter to split the string by |
encode_mode |
"auto", "UTF-8" or "byte". Determines multi-byte (UTF-8) characters or single-byte characters are used. |
fixed |
determines whether the split parameter should be interpreted literally or as a regular expression |
nthreads |
Number of threads to use |
Value
A list of stringfish character vectors
See Also
strsplit
Examples
if(getRversion() >= "3.5.0") {
sf_split(datasets::state.name, "\\s") # split U.S. state names by any space character
}
sf_starts
Description
A function for detecting a pattern at the start of a string
Usage
sf_starts(subject, pattern, ...)
Arguments
subject |
A character vector |
pattern |
A string to look for at the start |
... |
Parameters passed to sf_grepl |
Value
A logical vector true if there is a match, false if no match, NA is the subject was NA
See Also
startsWith, sf_ends
Examples
if(getRversion() >= "3.5.0") {
x <- c("alpha", "beta", "gamma", "delta", "epsilon")
sf_starts(x, "a")
}
sf_substr
Description
Extracts substrings from a character vector
Usage
sf_substr(x, start, stop, nthreads = getOption("stringfish.nthreads", 1L))
Arguments
x |
A character vector |
start |
The begining to extract from |
stop |
The end to extract from |
nthreads |
Number of threads to use |
Details
This works the same way as 'substr', but in addition allows negative indexing. Negative indicies count backwards from the end of the string, with -1 being the last character.
Value
A stringfish vector of substrings
See Also
substr
Examples
if(getRversion() >= "3.5.0") {
x <- c("fa\xE7ile", "hello world")
Encoding(x) <- "latin1"
x <- sf_iconv(x, "latin1", "UTF-8")
sf_substr(x, 4, -1) # extracts from the 4th character to the last
## [1] "ile" "lo world"
}
sf_tolower
Description
A function converting a string to all lowercase
Usage
sf_tolower(x)
Arguments
x |
A character vector |
Details
Note: the function only converts ASCII characters.
Value
A stringfish vector where all uppercase is converted to lowercase
See Also
tolower
Examples
if(getRversion() >= "3.5.0") {
x <- LETTERS
sf_tolower(x)
}
sf_toupper
Description
A function converting a string to all uppercase
Usage
sf_toupper(x)
Arguments
x |
A character vector |
Details
Note: the function only converts ASCII characters.
Value
A stringfish vector where all lowercase is converted to uppercase
See Also
toupper
Examples
if(getRversion() >= "3.5.0") {
x <- letters
sf_toupper(x)
}
sf_trim
Description
A function to remove leading/trailing whitespace
Usage
sf_trim(subject, which = c("both", "left", "right"), whitespace = "[ \\t\\r\\n]", ...)
Arguments
subject |
A character vector |
which |
"both", "left", or "right" determines which white space is removed |
whitespace |
Whitespace characters (default: "[ \\t\\r\\n]") |
... |
Parameters passed to sf_gsub |
Value
A stringfish vector of trimmed whitespace
See Also
trimws
Examples
if(getRversion() >= "3.5.0") {
x <- c(" alpha ", " beta", " gamma ", "delta ", "epsilon ")
sf_trim(x)
}
sf_vector
Description
Creates a new stringfish vector
Usage
sf_vector(len)
Arguments
len |
length of the new vector |
Details
This function creates a new stringfish vector, an alt-rep character vector backed by a C++ "std::vector" as the internal memory representation. The vector type is "sfstring", which is a simple C++ class containing a "std::string" and a single byte (uint8_t) representing the encoding.
Value
A new (empty) stringfish vector
Examples
if(getRversion() >= "3.5.0") {
x <- sf_vector(10)
sf_assign(x, 1, "hello world")
sf_assign(x, 2, "another string")
}
sf_writeLines
Description
A function that reads a file line by line
Usage
sf_writeLines(text, file, sep = "\n", na_value = "NA", encode_mode = "UTF-8")
Arguments
text |
A character to write to file |
file |
Name of the file to write to |
sep |
The line separator character(s) |
na_value |
What to write in case of a NA string |
encode_mode |
"UTF-8" or "byte". If "UTF-8", all strings are re-encoded as UTF-8. |
Details
A function for writing text data using 'std::ofstream'.
See Also
writeLines
Examples
if(getRversion() >= "3.5.0") {
file <- tempfile()
sf_writeLines(letters, file)
sf_readLines(file)
}
string_identical
Description
A stricter comparison of string equality
Usage
string_identical(x, y)
Arguments
x |
A character vector |
y |
Another character to compare to x |
Value
TRUE if strings are identical, including encoding
See Also
identical
Examples
x <- "fa\xE7ile"
Encoding(x) <- "latin1"
y <- iconv(x, "latin1", "UTF-8")
identical(x, y) # TRUE
string_identical(x, y) # FALSE