Type: | Package |
Title: | Morpheme Tokenization |
Version: | 1.2.3 |
Description: | Tokenize text into morphemes. The morphemepiece algorithm uses a lookup table to determine the morpheme breakdown of words, and falls back on a modified wordpiece tokenization algorithm for words not found in the lookup table. |
URL: | https://github.com/macmillancontentscience/morphemepiece |
BugReports: | https://github.com/macmillancontentscience/morphemepiece/issues |
License: | Apache License (≥ 2) |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.2 |
Imports: | dlr (≥ 1.0.0), fastmatch, magrittr, memoise (≥ 2.0.0), morphemepiece.data, piecemaker (≥ 1.0.0), purrr (≥ 0.3.4), readr, rlang, stringr (≥ 1.4.0) |
Suggests: | dplyr, fs, ggplot2, here, knitr, remotes, rmarkdown, testthat (≥ 3.0.0), utils |
VignetteBuilder: | knitr |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2022-04-16 13:57:47 UTC; jonathan.bratt |
Author: | Jonathan Bratt |
Maintainer: | Jonathan Bratt <jonathan.bratt@macmillan.com> |
Repository: | CRAN |
Date/Publication: | 2022-04-16 14:12:29 UTC |
morphemepiece: Morpheme Tokenization
Description
Tokenize words into morphemes (the smallest unit of meaning).
Determine Vocabulary Casedness
Description
Determine whether or not a wordpiece vocabulary is case-sensitive.
Usage
.infer_case_from_vocab(vocab)
Arguments
vocab |
The vocabulary as a character vector. |
Details
If none of the tokens in the vocabulary start with a capital letter, it will be assumed to be uncased. Note that tokens like "\[CLS\]" contain uppercase letters, but don't start with uppercase letters.
Value
TRUE if the vocabulary is cased, FALSE if uncased.
Tokenize an Input Word-by-word
Description
Tokenize an Input Word-by-word
Usage
.mp_tokenize_single_string(words, vocab, lookup, unk_token, max_chars)
Arguments
words |
Character; a vector of words (generated by space-tokenizing a single input). |
vocab |
A morphemepiece vocabulary. |
lookup |
A morphemepiece lookup table. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
A named integer vector of tokenized words.
Tokenize a Word
Description
Tokenize a single "word" (no whitespace). The word can technically contain punctuation, but typically punctuation has been split off by this point.
Usage
.mp_tokenize_word(
word,
vocab_split,
dir = 1,
allow_compounds = TRUE,
unk_token = "[UNK]",
max_chars = 100
)
Arguments
word |
Word to tokenize. |
vocab_split |
List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes". |
dir |
Integer; if 1 (the default), look for tokens starting at the beginning of the word. Otherwise, start at the end. |
allow_compounds |
Logical; whether to allow multiple whole words in the breakdown. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Details
This is an adaptation of wordpiece:::.tokenize_word. The main differences are
that it was designed to work with a morphemepiece vocabulary, which can
include prefixes (denoted like "pre##"). As in wordpiece, the algorithm uses
a repeated greedy search for the largest piece from the vocabulary found
within the word, but starting from either the beginning or the end of the
word (controlled by the dir
parameter). The input vocabulary must be split
into prefixes, suffixes, and "words".
Value
Input word as a list of tokens.
Tokenize a Word Bidirectionally
Description
Apply .mp_tokenize_word from both directions and pick the result with fewer pieces.
Usage
.mp_tokenize_word_bidir(
word,
vocab_split,
unk_token,
max_chars,
allow_compounds = TRUE
)
Arguments
word |
Character scalar; word to tokenize. |
vocab_split |
List of character vectors containing vocabulary words. Should have components named "prefixes", "words", "suffixes". |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
allow_compounds |
Logical; whether to allow multiple whole words in the breakdown. Default is TRUE. This option will not be exposed to end users; it is kept here for documentation + development purposes. |
Value
Input word as a list of tokens.
Tokenize a Word Including Lookup
Description
Look up a word in the table; go to fall-back otherwise.
Usage
.mp_tokenize_word_lookup(word, vocab, lookup, unk_token, max_chars)
Arguments
word |
Character scalar; word to tokenize. |
vocab |
A morphemepiece vocabulary. |
lookup |
A morphemepiece lookup table. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
Input word, broken into tokens.
Constructor for Class morphemepiece_vocabulary
Description
Constructor for Class morphemepiece_vocabulary
Usage
.new_morphemepiece_vocabulary(vocab, vocab_split, is_cased)
Arguments
vocab |
Character vector; the "actual" vocabulary. |
vocab_split |
List of character vectors; the split vocabulary. |
is_cased |
Logical; whether the vocabulary is cased. |
Value
The vocabulary with is_cased
attached as an attribute, and the
class morphemepiece_vocabulary
applied. The split vocabulary is also
attached as an attribute.
Process a Morphemepiece Vocabulary for Tokenization
Description
Process a Morphemepiece Vocabulary for Tokenization
Usage
.process_mp_vocab(v)
## Default S3 method:
.process_mp_vocab(v)
## S3 method for class 'morphemepiece_vocabulary'
.process_mp_vocab(v)
## S3 method for class 'integer'
.process_mp_vocab(v)
## S3 method for class 'character'
.process_mp_vocab(v)
Arguments
v |
An object of class |
Value
A character vector of tokens for tokenization.
Validator for Objects of Class morphemepiece_vocabulary
Description
Validator for Objects of Class morphemepiece_vocabulary
Usage
.validate_morphemepiece_vocabulary(vocab)
Arguments
vocab |
morphemepiece_vocabulary object to validate |
Value
vocab
if the object passes the checks. Otherwise, abort with
message.
Load a morphemepiece lookup file
Description
Usually you will want to use the included lookup that can be accessed via
morphemepiece_lookup()
. This function can be used to load a different
lookup from a file.
Usage
load_lookup(lookup_file)
Arguments
lookup_file |
path to lookup file. File is assumed to be a text file, with one word per line. The lookup value, if different from the word, follows the word on the same line, after a space. |
Value
The lookup as a named list. Names are words in lookup.
Load a lookup file, or retrieve from cache
Description
Usually you will want to use the included lookup that can be accessed via
morphemepiece_lookup()
. This function can be used to load (and cache) a
different lookup from a file.
Usage
load_or_retrieve_lookup(lookup_file)
Arguments
lookup_file |
path to lookup file. File is assumed to be a text file, with one word per line. The lookup value, if different from the word, follows the word on the same line, after a space. |
Value
The lookup table as a named character vector.
Load a vocabulary file, or retrieve from cache
Description
Usually you will want to use the included vocabulary that can be accessed via
morphemepiece_vocab()
. This function can be used to load (and cache) a
different vocabulary from a file.
Usage
load_or_retrieve_vocab(vocab_file)
Arguments
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.
Load a vocabulary file
Description
Usually you will want to use the included vocabulary that can be accessed via
morphemepiece_vocab()
. This function can be used to load a different
vocabulary from a file.
Usage
load_vocab(vocab_file)
Arguments
vocab_file |
path to vocabulary file. File is assumed to be a text file, with one token per line, with the line number (starting at zero) corresponding to the index of that token in the vocabulary. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.
Retrieve Directory for Morphemepiece Cache
Description
The morphemepiece cache directory is a platform- and user-specific path where morphemepiece saves caches (such as a downloaded lookup). You can override the default location in a few ways:
Option:
morphemepiece.dir
Useset_morphemepiece_cache_dir
to set a specific cache directory for this sessionEnvironment:
MORPHEMEPIECE_CACHE_DIR
Set this environment variable to specify a morphemepiece cache directory for all sessions.Environment:
R_USER_CACHE_DIR
Set this environment variable to specify a cache directory root for all packages that use the caching system.
Usage
morphemepiece_cache_dir()
Value
A character vector with the normalized path to the cache.
Tokenize Sequence with Morpheme Pieces
Description
Given a single sequence of text and a morphemepiece vocabulary, tokenizes the text.
Usage
morphemepiece_tokenize(
text,
vocab = morphemepiece_vocab(),
lookup = morphemepiece_lookup(),
unk_token = "[UNK]",
max_chars = 100
)
Arguments
text |
Character scalar; text to tokenize. |
vocab |
A morphemepiece vocabulary. |
lookup |
A morphemepiece lookup table. |
unk_token |
Token to represent unknown words. |
max_chars |
Maximum length of word recognized. |
Value
A character vector of tokenized text (later, this should be a named integer vector, as in the wordpiece package.)
Format a Token List as a Vocabulary
Description
We use a character vector with class morphemepiece_vocabulary to provide
information about tokens used in
morphemepiece_tokenize
. This function takes a character vector
of tokens and puts it into that format.
Usage
prepare_vocab(token_list)
Arguments
token_list |
A character vector of tokens. |
Value
The vocab as a character vector of tokens. The casedness of the vocabulary is inferred and attached as the "is_cased" attribute. The vocabulary indices are taken to be the positions of the tokens, starting at zero for historical consistency.
Note that from the perspective of a neural net, the numeric indices are the tokens, and the mapping from token to index is fixed. If we changed the indexing, it would break any pre-trained models using that vocabulary.
Examples
my_vocab <- prepare_vocab(c("some", "example", "tokens"))
class(my_vocab)
attr(my_vocab, "is_cased")
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- fastmatch
- magrittr
- morphemepiece.data
- rlang
Set a Cache Directory for Morphemepiece
Description
Use this function to override the cache path used by morphemepiece for the
current session. Set the MORPHEMEPIECE_CACHE_DIR
environment variable
for a more permanent change.
Usage
set_morphemepiece_cache_dir(cache_dir = NULL)
Arguments
cache_dir |
Character scalar; a path to a cache directory. |
Value
A normalized path to a cache directory. The directory is created if the user has write access and the directory does not exist.