Title: | Tools for Preparing Text for Tokenizers |
Version: | 1.0.2 |
Description: | Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer. |
License: | Apache License (≥ 2) |
URL: | https://github.com/macmillancontentscience/piecemaker, https://macmillancontentscience.github.io/piecemaker/ |
BugReports: | https://github.com/macmillancontentscience/piecemaker/issues |
Depends: | R (≥ 2.10) |
Imports: | cli, glue, rlang (≥ 0.4.2), stringi, stringr |
Suggests: | covr, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.3 |
NeedsCompilation: | no |
Packaged: | 2023-06-02 18:40:35 UTC; jonth |
Author: | Jon Harmon |
Maintainer: | Jon Harmon <jonthegeek@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-06-02 19:50:03 UTC |
piecemaker: Tools for Preparing Text for Tokenizers
Description
Tokenizers break text into pieces that are more usable by machine learning models. Many tokenizers share some preparation steps. This package provides those shared steps, along with a simple tokenizer.
Author(s)
Maintainer: Jon Harmon jonthegeek@gmail.com (ORCID)
Authors:
Jonathan Bratt jonathan.bratt@macmillan.com (ORCID)
Other contributors:
Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/macmillancontentscience/piecemaker/issues
Make Regex for Unicode Blocks
Description
Make Regex for Unicode Blocks
Usage
.make_unicode_block_regex(unicode_block_name)
Arguments
unicode_block_name |
The name of the unicode block as it appears in the Wikipedia list of Unicode blocks. |
Value
A regex wildcard box in square brackets.
Space Text by a Regex Selector
Description
Space Text by a Regex Selector
Usage
.space_regex_selector(text, regex_selector)
Arguments
text |
Character; text to space. |
regex_selector |
A regular expression that selects the blocks of text you want to space. |
Value
A character vector the same length as text, with spaces around the specified blocks.
Split Text on Spaces
Description
This is an extremely simple tokenizer that simply splits text on spaces. It
also optionally applies the cleaning processes from
prepare_text
.
Usage
prepare_and_tokenize(text, prepare = TRUE, ...)
Arguments
text |
A character vector to clean. |
prepare |
Logical; should the text be passed through
|
... |
Arguments passed on to
|
Value
The text as a list of character vectors. Each element of each vector is roughly equivalent to a word.
Examples
prepare_and_tokenize("This is some text.")
prepare_and_tokenize("This is some text.", space_punctuation = FALSE)
Prepare Text for Tokenization
Description
This function combines the other functions in this package to prepare text for tokenization. The text gets converted to valid UTF-8 (if possible), and then various cleaning functions are applied.
Usage
prepare_text(
text,
squish_whitespace = TRUE,
remove_terminal_hyphens = TRUE,
remove_control_characters = TRUE,
remove_replacement_characters = TRUE,
remove_diacritics = TRUE,
space_cjk = TRUE,
space_punctuation = TRUE,
space_hyphens = TRUE,
space_abbreviations = TRUE
)
Arguments
text |
A character vector to clean. |
squish_whitespace |
Logical scalar; squish whitespace characters (using
|
remove_terminal_hyphens |
Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken". |
remove_control_characters |
Logical scalar; remove control characters? |
remove_replacement_characters |
Logical scalar; remove the "replacement
character", |
remove_diacritics |
Logical scalar; remove diacritical marks (accents, etc) from characters? |
space_cjk |
Logical scalar; add spaces around Chinese/Japanese/Korean ideographs? |
space_punctuation |
Logical scalar; add spaces around punctuation (to make it easier to keep punctuation during tokenization)? |
space_hyphens |
Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation. |
space_abbreviations |
Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation. |
Value
The character vector, cleaned as specified.
Examples
piece1 <- " This is a \n\nfa\xE7ile\n\n example.\n"
# Specify encoding so this example behaves the same on all systems.
Encoding(piece1) <- "latin1"
example_text <- paste(
piece1,
"It has the bell character, \a, and the replacement character,",
intToUtf8(65533)
)
prepare_text(example_text)
prepare_text(example_text, squish_whitespace = FALSE)
prepare_text(example_text, remove_control_characters = FALSE)
prepare_text(example_text, remove_replacement_characters = FALSE)
prepare_text(example_text, remove_diacritics = FALSE)
Remove Non-Character Characters
Description
Unicode includes several control codes, such as U+0000
(NULL, used in
null-terminated strings) and U+000D
(carriage return). This function
removes all such characters from text.
Usage
remove_control_characters(text)
Arguments
text |
A character vector to clean. |
Details
Note: We highly recommend that you first condense all space-like characters
(including new lines) before removing control codes. You can easily do so
with str_squish
. We also recommend validating text at
the start of any cleaning process using validate_utf8
.
Value
The character vector without control characters.
Examples
remove_control_characters("Line 1\nLine2")
Remove Diacritical Marks on Characters
Description
Accent characters and other diacritical marks are often difficult to type, and thus can be missing from text. To normalize the various ways a user might spell a word that should have a diacritical mark, you can convert all such characters to their simpler equivalent character.
Usage
remove_diacritics(text)
Arguments
text |
A character vector to clean. |
Value
The character vector with simpler character representations.
Examples
# This text can appear differently between machines if we aren't careful, so
# we explicitly encode the desired characters.
sample_text <- "fa\u00e7ile r\u00e9sum\u00e9"
sample_text
remove_diacritics(sample_text)
Remove the Unicode Replacement Character
Description
The replacement character, U+FFFD
, is used to mark characters that
could not be loaded. These characters might be a sign of encoding issues, so
it is advisable to investigate and try to eliminate any cases in your text,
but in the end these characters will almost definitely confuse downstream
processes.
Usage
remove_replacement_characters(text)
Arguments
text |
A character vector to clean. |
Value
The character vector with replacement characters removed.
Examples
remove_replacement_characters(
paste(
"The replacement character:",
intToUtf8(65533)
)
)
Add Spaces Around CJK Ideographs
Description
To tokenize Chinese, Japanese, and Korean (CJK) characters, it's convenient to add spaces around the characters.
Usage
space_cjk(text)
Arguments
text |
A character vector to clean. |
Value
A character vector the same length as the input text, with spaces added between ideographs.
Examples
to_space <- intToUtf8(13312:13320)
to_space
space_cjk(to_space)
Add Spaces Around Punctuation
Description
To keep punctuation during tokenization, it's convenient to add spacing around punctuation. This function does that, with options to keep certain types of punctuation together as part of the word.
Usage
space_punctuation(text, space_hyphens = TRUE, space_abbreviations = TRUE)
Arguments
text |
A character vector to clean. |
space_hyphens |
Logical; treat hyphens between letters and at the start/end of words as punctuation? Other hyphens are always treated as punctuation. |
space_abbreviations |
Logical; treat apostrophes between letters as punctuation? Other apostrophes are always treated as punctuation. |
Value
A character vector the same length as the input text, with spaces added around punctuation characters.
Examples
to_space <- "This is some 'gosh-darn' $5 text. Isn't it lovely?"
to_space
space_punctuation(to_space)
space_punctuation(to_space, space_hyphens = FALSE)
space_punctuation(to_space, space_abbreviations = FALSE)
Remove Extra Whitespace
Description
This function is mostly a wrapper around str_squish
,
with the additional option to remove hyphens at the ends of lines.
Usage
squish_whitespace(text, remove_terminal_hyphens = TRUE)
Arguments
text |
A character vector to clean. |
remove_terminal_hyphens |
Logical; should hyphens at the end of lines after a word be removed? For example, "un-\nbroken" would become "unbroken". |
Value
The character vector with spacing at the start and end removed, and with internal spacing reduced to a single space character each.
Examples
sample_text <- "This had many space char-\\nacters."
squish_whitespace(sample_text)
Break Text at Spaces
Description
This is an extremely simple tokenizer, breaking only and exactly on the space
character. This tokenizer is intended to work in tandem with
prepare_text
, so that spaces are cleaned up and inserted as
necessary before the tokenizer runs. This function and
prepare_text
are combined together in
prepare_and_tokenize
.
Usage
tokenize_space(text)
Arguments
text |
A character vector to clean. |
Value
The text as a list of character vectors (one vector per element of
text
). Each element of each vector is roughly equivalent to a word.
Examples
tokenize_space("This is some text.")
Clean Up Text to UTF-8
Description
Text cleaning works best if the encoding is known. This function attempts to convert text to UTF-8 encoding, and provides an informative error if that is not possible.
Usage
validate_utf8(text)
Arguments
text |
A character vector to clean. |
Value
The text with formal UTF-8 encoding, if possible.
Examples
text <- "fa\xE7ile"
# Specify the encoding so the example is the same on all systems.
Encoding(text) <- "latin1"
validate_utf8(text)