Type: | Package |
Title: | Linguistic Matching and Accommodation |
Version: | 1.0.7 |
Author: | Micah Iserman |
Maintainer: | Micah Iserman <micah.iserman@gmail.com> |
Description: | Measure similarity between texts. Offers a variety of processing tools and similarity metrics to facilitate flexible representation of texts and matching. Implements forms of Language Style Matching (Ireland & Pennebaker, 2010) <doi:10.1037/a0020386> and Latent Semantic Analysis (Landauer & Dumais, 1997) <doi:10.1037/0033-295X.104.2.211>. |
URL: | https://miserman.github.io/lingmatch/ |
BugReports: | https://github.com/miserman/lingmatch/issues |
Depends: | R (≥ 3.5), methods, Matrix |
Imports: | Rcpp, RcppParallel |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
RoxygenNote: | 7.3.1 |
Suggests: | knitr, rmarkdown, splot, testthat (≥ 2.1.0) |
LinkingTo: | Rcpp, RcppParallel, BH |
NeedsCompilation: | yes |
Packaged: | 2024-05-03 16:17:58 UTC; Admin |
Repository: | CRAN |
Date/Publication: | 2024-05-03 16:30:02 UTC |
Assess Dictionary Categories Within a Latent Semantic Space
Description
Assess Dictionary Categories Within a Latent Semantic Space
Usage
dictionary_meta(dict, space = "auto", n_spaces = 5, suggest = FALSE,
suggestion_terms = 10, suggest_stopwords = FALSE,
suggest_discriminate = TRUE, expand_cutoff_freq = 0.98,
expand_cutoff_spaces = 10, dimension_prop = 1, pairwise = TRUE,
glob = TRUE, space_dir = getOption("lingmatch.lspace.dir"),
verbose = TRUE)
Arguments
dict |
A vector of terms, list of such vectors, or a matrix-like object to be
categorized by |
space |
A vector space used to calculate similarities between terms.
Names of spaces (see |
n_spaces |
Number of spaces to draw from if |
suggest |
Logical; if |
suggestion_terms |
Number of terms to use when selecting suggested additions. |
suggest_stopwords |
Logical; if |
suggest_discriminate |
Logical; if |
expand_cutoff_freq |
Proportion of mapped terms to include when expanding dictionary terms.
Applies when |
expand_cutoff_spaces |
Number of spaces in which a term has to appear to be considered
for expansion. Applies when |
dimension_prop |
Proportion of dimensions to use when searching for suggested additions, where less than 1 will calculate similarities to the category core using fewer dimensions of the space. |
pairwise |
Logical; if |
glob |
Logical; if |
space_dir |
Directory from which |
verbose |
Logical; if |
Value
A list:
-
expanded
: A version ofdict
with fuzzy terms expanded. -
summary
: A summary of each dictionary category. -
terms
: Match (expanded term) similarities within terms and categories. -
suggested
: Ifsuggest
isTRUE
, a list with suggested additions for each dictionary category. Each entry is a named numeric vector with similarities for each suggested term.
See Also
To just expand fuzzy terms, see report_term_matches()
.
Similar information is provided in the dictionary builder web tool.
Other Dictionary functions:
download.dict()
,
lma_patcat()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
Examples
if (dir.exists("~/Latent Semantic Spaces")) {
dict <- list(
furniture = c("table", "chair", "desk*", "couch*", "sofa*"),
well_adjusted = c("happy", "bright*", "friend*", "she", "he", "they")
)
dictionary_meta(dict, space_dir = "~/Latent Semantic Spaces")
}
Download Dictionaries
Description
Downloads the specified dictionaries from osf.io/y6g5b.
Usage
download.dict(dict = "lusi", check.md5 = TRUE, mode = "wb",
dir = getOption("lingmatch.dict.dir"), overwrite = FALSE)
Arguments
dict |
One or more names of dictionaries to download, or |
check.md5 |
Logical; if |
mode |
A character specifying the file write mode; default is 'wb'. See
|
dir |
Directory in which to save the dictionary; |
overwrite |
Logical; if |
Value
Path to the downloaded dictionary, or a list of such if multiple were downloaded.
See Also
Other Dictionary functions:
dictionary_meta()
,
lma_patcat()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
Examples
## Not run:
download.dict("lusi", dir = "~/Dictionaries")
## End(Not run)
Download Latent Semantic Spaces
Description
Downloads the specified semantic space from osf.io/489he.
Usage
download.lspace(space = "100k_lsa", decompress = TRUE, check.md5 = TRUE,
mode = "wb", dir = getOption("lingmatch.lspace.dir"),
overwrite = FALSE)
Arguments
space |
Name of one or more spaces you want to download, or |
decompress |
Logical; if |
check.md5 |
Logical; if |
mode |
A character specifying the file write mode; default is 'wb'. See
|
dir |
Directory in which to save the space. Specify this here, or set the lspace directory option
(e.g., |
overwrite |
Logical; if |
Value
A character vector with paths to the [1] data and [2] term files.
See Also
Other Latent Semantic Space functions:
lma_lspace()
,
select.lspace()
,
standardize.lspace()
Examples
## Not run:
download.lspace("glove_crawl", dir = "~/Latent Semantic Spaces")
## End(Not run)
Linguistic Matching and Accommodation
Description
Offers a variety of methods to assess linguistic matching or accommodation, where matching is general similarity (sometimes called homophily), and accommodation is some form of conditional similarity (accounting for some base-rate or precedent; sometimes called alignment).
Usage
lingmatch(input = NULL, comp = mean, data = NULL, group = NULL, ...,
comp.data = NULL, comp.group = NULL, order = NULL, drop = FALSE,
all.levels = FALSE, type = "lsm")
Arguments
input |
Texts to be compared; a vector, document-term matrix (dtm; with terms as column names), or path to a file (.txt or .csv, with texts separated by one or more lines/rows). | |||||
comp |
Defines the comparison to be made:
| |||||
data |
A matrix-like object as a reference for column names, if variables are referred to in
other arguments (e.g., | |||||
group |
A logical or factor-like vector the same length as | |||||
... |
Passes arguments to | |||||
comp.data |
A matrix-like object as a source for | |||||
comp.group |
The column name of the grouping variable(s) in | |||||
order |
A numeric vector the same length as | |||||
drop |
logical; if | |||||
all.levels |
logical; if | |||||
type |
A character at least partially matching 'lsm' or 'lsa'; applies default settings aligning with the standard calculations of each type:
|
Details
There are a great many points of decision in the assessment of linguistic similarity and/or
accommodation, partly inherited from the great many point of decision inherent in the numerical
representation of language. Two general types of matching are implemented here as sets of
defaults: Language/Linguistic Style Matching (LSM; Niederhoffer & Pennebaker, 2002; Ireland &
Pennebaker, 2010), and Latent Semantic Analysis/Similarity (LSA; Landauer & Dumais, 1997;
Babcock, Ta, & Ickes, 2014). See the type
argument for specifics.
Value
A list with processed components of the input, information about the comparison, and results of the comparison:
-
dtm
: A sparse matrix; the raw count-dtm, or a version of the original input if it is more processed. -
processed
: A matrix-like object; a processed version of the input (e.g., weighted and categorized). -
comp.type
: A string describing the comparison if applicable. -
comp
: A vector or matrix-like object; the comparison data if applicable. -
group
: A string describing the group if applicable. -
sim
: Result oflma_simets
.
Grouping and Comparisons
Defining groups and comparisons can sometimes be a bit complicated, and requires dataset
specific knowledge, so it can't always (readily) be done automatically. Variables entered in the
group
argument are treated differently depending on their position and other arguments:
- Splitting
By default, groups are treated as if they define separate chunks of data in which comparisons should be calculated. Functions used to calculated comparisons, and pairwise comparisons are performed separately in each of these groups. For example, if you wanted to compare each text with the mean of all texts in its condition, a
group
variable could identify and split by condition. Given multiple grouping variables, calculations will either be done in each split (ifall.levels = TRUE
; applied in sequence so that groups become smaller and smaller), or once after all splits are made (ifall.levels = FALSE
). This makes for 'one to many' comparisons with either calculated or preexisting standards (i.e., the profile of the current data, or a precalculated profile, respectively).- Comparison ID
When comparison data is identified in
comp
, groups are assumed to apply to bothinput
andcomp
(either both indata
, or separately betweendata
andcomp.data
, in which casecomp.group
may be needed if the same grouping variable have different names betweendata
andcomp.data
). In this case, multiple grouping variables are combined into a single factor assumed to uniquely identify a comparison. This makes for 'one to many' comparisons with specific texts (as in the case of manipulated prompts or text-based conditions).- Speaker ID
If
comp
matches'sequential'
, the last grouping variable entered is assumed to identify something like speakers (i.e., a factor with two or more levels and multiple observations per level). In this case, the data are assumed to be ordered (or ordered once sorted byorder
if specified). Any additional grouping variables before the last are treated as splitting groups. This can set up for probabilistic accommodation metrics. At the moment, when sequential comparisons are made within groups, similarity scores between speakers are averaged, resulting in mean matching between speakers within the group.
References
Babcock, M. J., Ta, V. P., & Ickes, W. (2014). Latent semantic similarity and language style matching in initial dyadic interactions. Journal of Language and Social Psychology, 33, 78-88.
Ireland, M. E., & Pennebaker, J. W. (2010). Language style matching in writing: synchrony in essays, correspondence, and poetry. Journal of Personality and Social Psychology, 99, 549.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211.
Niederhoffer, K. G., & Pennebaker, J. W. (2002). Linguistic style matching in social interaction. Journal of Language and Social Psychology, 21, 337-360.
See Also
For a general text processing function, see lma_process()
.
Examples
# compare single strings
lingmatch("Compare this sentence.", "With this other sentence.")
# compare each entry in a character vector with...
texts <- c(
"One bit of text as an entry...",
"Maybe multiple sentences in an entry. Maybe essays or posts or a book.",
"Could be lines or a column from a read-in file..."
)
## one another
lingmatch(texts)
## the first
lingmatch(texts, 1)
## the next
lingmatch(texts, "seq")
## the set average
lingmatch(texts, mean)
## other entries in a group
lingmatch(texts, group = c("a", "a", "b"))
## one another, without stop words
lingmatch(texts, exclude = "function")
## a standard average (based on function words)
lingmatch(texts, "auto", dict = lma_dict(1:9))
English Function Word Category and Special Character Lists
Description
Returns a list of function words based on the Linguistic Inquiry and Word Count 2015 dictionary (in terms of category names – words were selected independently), or a list of special characters and patterns.
Usage
lma_dict(..., as.regex = TRUE, as.function = FALSE)
Arguments
... |
Numbers or letters corresponding to category names: ppron, ipron, article, adverb, conj, prep, auxverb, negate, quant, interrog, number, interjection, or special. |
as.regex |
Logical: if |
as.function |
Logical or a function: if specified and |
Value
A list with a vector of terms for each category, or (when as.function = TRUE
) a function which
accepts an initial "terms" argument (a character vector), and any additional arguments determined by function
entered as as.function
(grepl
by default).
Note
The special
category is not returned unless specifically requested. It is a list of regular expression
strings attempting to capture special things like ellipses and emojis, or sets of special characters (those outside
of the Basic Latin range; [^\u0020-\u007F]
), which can be used for character conversions.
If special
is part of the returned list, as.regex
is set to TRUE
.
The special
list is always used by both lma_dtm
and lma_termcat
. When creating a
dtm, special
is used to clean the original input (so that, by default, the punctuation involved in ellipses
and emojis are treated as different – as ellipses and emojis rather than as periods and parens and colons and such).
When categorizing a dtm, the input dictionary is passed by the special lists to be sure the terms in the dtm match up
with the dictionary (so, for example, ": (" would be replaced with "repfrown" in both the text and dictionary).
See Also
To score texts with these categories, use lma_termcat()
.
Examples
# return the full dictionary (excluding special)
lma_dict()
# return the standard 7 category lsm categories
lma_dict(1:7)
# return just a few categories without regular expression
lma_dict(neg, ppron, aux, as.regex = FALSE)
# return special specifically
lma_dict(special)
# returning a function
is.ppron <- lma_dict(ppron, as.function = TRUE)
is.ppron(c("i", "am", "you", "were"))
in.lsmcat <- lma_dict(1:7, as.function = TRUE)
in.lsmcat(c("a", "frog", "for", "me"))
## use as a stopword filter
is.stopword <- lma_dict(as.function = TRUE)
dtm <- lma_dtm("Most of these words might not be all that relevant.")
dtm[, !is.stopword(colnames(dtm))]
## use to replace special characters
clean <- lma_dict(special, as.function = gsub)
clean(c(
"\u201Ccurly quotes\u201D", "na\u00EFve", "typographer\u2019s apostrophe",
"en\u2013dash", "em\u2014dash"
))
Document-Term Matrix Creation
Description
Creates a document-term matrix (dtm) from a set of texts.
Usage
lma_dtm(text, exclude = NULL, context = NULL, replace.special = FALSE,
numbers = FALSE, punct = FALSE, urls = TRUE, emojis = FALSE,
to.lower = TRUE, word.break = " +", dc.min = 0, dc.max = Inf,
sparse = TRUE, tokens.only = FALSE)
Arguments
text |
Texts to be processed. This can be a vector (such as a column in a data frame)
or list. When a list, these can be in the form returned with | |||||||||
exclude |
A character vector of words to be excluded. If | |||||||||
context |
A character vector used to reformat text based on look- ahead/behind. For example,
you might attempt to disambiguate like by reformatting certain likes
(e.g., | |||||||||
replace.special |
Logical: if | |||||||||
numbers |
Logical: if | |||||||||
punct |
Logical: if | |||||||||
urls |
Logical: if | |||||||||
emojis |
Logical: if | |||||||||
to.lower |
Logical: if | |||||||||
word.break |
A regular expression string determining the way words are split. Default is
| |||||||||
dc.min |
Numeric: excludes terms appearing in the set number or fewer documents. Default is 0 (no limit). | |||||||||
dc.max |
Numeric: excludes terms appearing in the set number or more. Default is Inf (no limit). | |||||||||
sparse |
Logical: if | |||||||||
tokens.only |
Logical: if
|
Value
A sparse matrix (or regular matrix if sparse = FALSE
), with a row per text
,
and column per term, or a list if tokens.only = TRUE
. Includes an attribute with options (opts
),
and attributes with word count (WC
) and column sums (colsums
) if tokens.only = FALSE
.
Note
This is a relatively simple way to make a dtm. To calculate the (more or less) standard forms of LSM and LSS, a somewhat raw dtm should be fine, because both processes essentially use dictionaries (obviating stemming) and weighting or categorization (largely obviating 'stop word' removal). The exact effect of additional processing will depend on the dictionary/semantic space and weighting scheme used (particularly for LSA). This function also does some processing which may matter if you plan on categorizing with categories that have terms with look- ahead/behind assertions (like LIWC dictionaries). Otherwise, other methods may be faster, more memory efficient, and/or more featureful.
Examples
text <- c(
"Why, hello there! How are you this evening?",
"I am well, thank you for your inquiry!",
"You are a most good at social interactions person!",
"Why, thank you! You're not all bad yourself!"
)
lma_dtm(text)
# return tokens only
(tokens <- lma_dtm(text, tokens.only = TRUE))
## convert those to a regular DTM
lma_dtm(tokens)
# convert a list-representation to a sparse matrix
lma_dtm(list(
doc1 = c(why = 1, hello = 1, there = 1),
doc2 = c(i = 1, am = 1, well = 1)
))
Initialize Directories for Dictionaries and Latent Semantic Spaces
Description
Creates directories for dictionaries and latent semantic spaces if needed, sets them as the
lingmatch.dict.dir
and lingmatch.lspace.dir
options if they are not already set,
and creates links to them in their expected locations ('~/Dictionaries'
and
'~/Latent Semantic Spaces'
) by default if applicable.
Usage
lma_initdirs(base = "", dict = "Dictionaries",
lspace = "Latent Semantic Spaces", link = TRUE)
Arguments
base |
Path to a directory in which to create the |
dict |
Path to the dictionaries directory relative to |
lspace |
Path to the latent semantic spaces directory relative to |
link |
Logical; if |
Value
Paths to the [1] dictionaries and [2] latent semantic space directories, or a single path if only
dict
or lspace
is specified.
Examples
## Not run:
# set up the expected dictionary and latent semantic space directories
lma_initdirs("~")
# set up directories elsewhere, and links to the expected locations
lma_initdirs("d:")
# point options and create links to preexisting directories
lma_initdirs("~/NLP_Resources", "Dicts", "Dicts/Embeddings")
# create just a dictionaries directory and set the
# lingmatch.dict.dir option without creating a link
lma_initdirs(dict = "z:/external_dictionaries", link = FALSE)
## End(Not run)
Latent Semantic Space (Embeddings) Operations
Description
Map a document-term matrix onto a latent semantic space, extract terms from a
latent semantic space (if dtm
is a character vector, or map.space =
FALSE
),
or perform a singular value decomposition of a document-term matrix (if dtm
is a matrix
and space
is missing).
Usage
lma_lspace(dtm = "", space, map.space = TRUE, fill.missing = FALSE,
term.map = NULL, dim.cutoff = 0.5, keep.dim = FALSE,
use.scan = FALSE, dir = getOption("lingmatch.lspace.dir"))
Arguments
dtm |
A matrix with terms as column names, or a character vector of terms to be extracted
from a specified space. If this is of length 1 and |
space |
A matrix with terms as rownames. If missing, this will be the right singular vectors
of a singular value decomposition of |
map.space |
Logical: if |
fill.missing |
Logical: if |
term.map |
A matrix with |
dim.cutoff |
If a |
keep.dim |
Logical: if |
use.scan |
Logical: if |
dir |
Path to a folder containing spaces. |
Value
A matrix or sparse matrix with either (a) a row per term and column per latent dimension (a latent
space, either calculated from the input, or retrieved when map.space = FALSE
), (b) a row per document
and column per latent dimension (when a dtm is mapped to a space), or (c) a row per document and
column per term (when a space is calculated and keep.dim = TRUE
).
Note
A traditional latent semantic space is a selection of right singular vectors from the singular
value decomposition of a dtm (svd(dtm)$v[, 1:k]
, where k
is the selected number of
dimensions, decided here by dim.cutoff
).
Mapping a new dtm into a latent semantic space consists of multiplying common terms:
dtm[, ct]
%*% space[ct, ]
, where ct
=
colnames(dtm)[colnames(dtm)
%in%
rownames(space)]
– the terms common between the dtm and the space. This
results in a matrix with documents as rows, and dimensions as columns, replacing terms.
See Also
Other Latent Semantic Space functions:
download.lspace()
,
select.lspace()
,
standardize.lspace()
Examples
text <- c(
paste(
"Hey, I like kittens. I think all kinds of cats really are just the",
"best pet ever."
),
paste(
"Oh year? Well I really like cars. All the wheels and the turbos...",
"I think that's the best ever."
),
paste(
"You know what? Poo on you. Cats, dogs, rabbits -- you know, living",
"creatures... to think you'd care about anything else!"
),
paste(
"You can stick to your opinion. You can be wrong if you want. You know",
"what life's about? Supercharging, diesel guzzling, exhaust spewing,",
"piston moving ignitions."
)
)
dtm <- lma_dtm(text)
# calculate a latent semantic space from the example text
lss <- lma_lspace(dtm)
# show that document similarities between the truncated and full space are the same
spaces <- list(
full = lma_lspace(dtm, keep.dim = TRUE),
truncated = lma_lspace(dtm, lss)
)
sapply(spaces, lma_simets, metric = "cosine")
## Not run:
# specify a directory containing spaces,
# or where you would like to download spaces
space_dir <- "~/Latent Semantic Spaces"
# map to a pretrained space
ddm <- lma_lspace(dtm, "100k", dir = space_dir)
# load the matching subset of the space
# without mapping
lss_100k_part <- lma_lspace(colnames(dtm), "100k", dir = space_dir)
## or
lss_100k_part <- lma_lspace(dtm, "100k", map.space = FALSE, dir = space_dir)
# load the full space
lss_100k <- lma_lspace("100k", dir = space_dir)
## or
lss_100k <- lma_lspace(space = "100k", dir = space_dir)
## End(Not run)
Calculate Text-Based Metastatistics
Description
Calculate simple descriptive statistics from text.
Usage
lma_meta(text)
Arguments
text |
A character vector of texts. |
Value
A data.frame:
-
characters
: Total number of characters. -
syllables
: Total number of syllables, as estimated by split length of
'a+[eu]*|e+a*|i+|o+[ui]*|u+|y+[aeiou]*'
- 1. -
words
: Total number of words (raw word count). -
unique_words
: Number of unique words (binary word count). -
clauses
: Number of clauses, as marked by commas, colons, semicolons, dashes, or brackets within sentences. -
sentences
: Number of sentences, as marked by periods, question marks, exclamation points, or new line characters. -
words_per_clause
: Average number of words per clause. -
words_per_sentence
: Average number of words per sentence. -
sixltr
: Number of words 6 or more characters long. -
characters_per_word
: Average number of characters per word (characters
/words
). -
syllables_per_word
: Average number of syllables per word (syllables
/words
). -
type_token_ratio
: Ratio of unique to total words:unique_words
/words
. -
reading_grade
: Flesch-Kincaid grade level: .39 *words
/sentences
+ 11.8 *syllables
/words
- 15.59. -
numbers
: Number of terms starting with numbers. -
punct
: Number of terms starting with non-alphanumeric characters. -
periods
: Number of periods. -
commas
: Number of commas. -
qmarks
: Number of question marks. -
exclams
: Number of exclamation points. -
quotes
: Number of quotation marks (single and double). -
apostrophes
: Number of apostrophes, defined as any modified letter apostrophe, or backtick or single straight or curly quote surrounded by letters. -
brackets
: Number of bracketing characters (including parentheses, and square, curly, and angle brackets). -
orgmarks
: Number of characters used for organization or structuring (including dashes, foreword slashes, colons, and semicolons).
Examples
text <- c(
succinct = "It is here.",
verbose = "Hear me now. I shall tell you about it. It is here. Do you hear?",
couched = "I might be wrong, but it seems to me that it might be here.",
bigwords = "Object located thither.",
excited = "It's there! It's there! It's there!",
drippy = "It's 'there', right? Not 'here'? 'there'? Are you Sure?",
struggly = "It's here -- in that place where it is. Like... the 1st place (here)."
)
lma_meta(text)
Categorize Texts
Description
Categorize raw texts using a pattern-based dictionary.
Usage
lma_patcat(text, dict = NULL, pattern.weights = "weight",
pattern.categories = "category", bias = NULL, to.lower = TRUE,
return.dtm = FALSE, drop.zeros = FALSE, exclusive = TRUE,
boundary = NULL, fixed = TRUE, globtoregex = FALSE,
name.map = c(intname = "_intercept", term = "term"),
dir = getOption("lingmatch.dict.dir"))
Arguments
text |
A vector of text to be categorized. Texts are padded by 2 spaces, and potentially lowercased. |
dict |
At least a vector of terms (patterns), usually a matrix-like object with columns for terms, categories, and weights. |
pattern.weights |
A vector of weights corresponding to terms in |
pattern.categories |
A vector of category names corresponding to terms in |
bias |
A constant to add to each category after weighting and summing. Can be a vector with names
corresponding to the unique values in |
to.lower |
Logical indicating whether |
return.dtm |
Logical; if |
drop.zeros |
logical; if |
exclusive |
Logical; if |
boundary |
A string to add to the beginning and end of each dictionary term. If |
fixed |
Logical; if |
globtoregex |
Logical; if |
name.map |
A named character vector:
Missing names are added, so names can be specified positional (e.g., |
dir |
Path to a folder in which to look for |
Value
A matrix with a row per text
and columns per dictionary category, or (when return.dtm = TRUE
)
a sparse matrix with a row per text
and column per term. Includes a WC
attribute with original
word counts, and a categories
attribute with row indices associated with each category if
return.dtm = TRUE
.
See Also
For applying term-based dictionaries (to a document-term matrix) see lma_termcat()
.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
Examples
# example text
text <- c(
paste(
"Oh, what youth was! What I had and gave away.",
"What I took and spent and saw. What I lost. And now? Ruin."
),
paste(
"God, are you so bored?! You just want what's gone from us all?",
"I miss the you that was too. I love that you."
),
paste(
"Tomorrow! Tomorrow--nay, even tonight--you wait, as I am about to change.",
"Soon I will off to revert. Please wait."
)
)
# make a document-term matrix with pre-specified terms only
lma_patcat(text, c("bored?!", "i lo", ". "), return.dtm = TRUE)
# get counts of sets of letter
lma_patcat(text, list(c("a", "b", "c"), c("d", "e", "f")))
# same thing with regular expressions
lma_patcat(text, list("[abc]", "[def]"), fixed = FALSE)
# match only words
lma_patcat(text, list("i"), boundary = TRUE)
# match only words, ignoring punctuation
lma_patcat(
text, c("you", "tomorrow", "was"),
fixed = FALSE,
boundary = "\\b", return.dtm = TRUE
)
## Not run:
# read in the temporal orientation lexicon from the World Well-Being Project
tempori <- read.csv(paste0(
"https://raw.githubusercontent.com/wwbp/lexica/master/",
"temporal_orientation/temporal_orientation_lexicon.csv"
))
lma_patcat(text, tempori)
# or use the standardized version
tempori_std <- read.dic("wwbp_prospection", dir = "~/Dictionaries")
lma_patcat(text, tempori_std)
## get scores on the same scale by adjusting the standardized values
tempori_std[, -1] <- tempori_std[, -1] / 100 *
select.dict("wwbp_prospection")$selected[, "original_max"]
lma_patcat(text, tempori_std)[, unique(tempori$category)]
## End(Not run)
Process Text
Description
A wrapper to other pre-processing functions, potentially from read.segments
, to lma_dtm
or lma_patcat
, to lma_weight
, then lma_termcat
or lma_lspace
,
and optionally including lma_meta
output.
Usage
lma_process(input = NULL, ..., meta = TRUE, coverage = FALSE)
Arguments
input |
A vector of text, or path to a text file or folder. |
... |
arguments to be passed to |
meta |
Logical; if |
coverage |
Logical; if |
Value
A matrix with texts represented by rows, and features in columns, unless there are multiple rows per output (e.g., when a latent semantic space is applied without terms being mapped) in which case only the special output is returned (e.g., a matrix with terms as rows and latent dimensions in columns).
See Also
If you just want to compare texts, see the lingmatch()
function.
Examples
# starting with some texts in a vector
texts <- c(
"Firstly, I would like to say, and with all due respect...",
"Please, proceed. I hope you feel you can speak freely...",
"Oh, of course, I just hope to be clear, and not cause offense...",
"Oh, no, don't monitor yourself on my account..."
)
# by default, term counts and metastatistics are returned
lma_process(texts)
# add dictionary and percent arguments for standard dictionary-based results
lma_process(texts, dict = lma_dict(), percent = TRUE)
# add space and weight arguments for standard word-centroid vectors
lma_process(texts, space = lma_lspace(texts), weight = "tfidf")
Similarity Calculations
Description
Enter a numerical matrix, set of vectors, or set of matrices to calculate similarity per vector.
Usage
lma_simets(a, b = NULL, metric = NULL, group = NULL, lag = 0,
agg = TRUE, agg.mean = TRUE, pairwise = TRUE, symmetrical = FALSE,
mean = FALSE, return.list = FALSE)
Arguments
a |
A vector or matrix. If a vector, |
b |
A vector or matrix to be compared with |
metric |
A character or vector of characters at least partially matching one of the available metric names (or 'all' to explicitly include all metrics), or a number or vector of numbers indicating the metric by index:
|
group |
If |
lag |
Amount to adjust the |
agg |
Logical: if |
agg.mean |
Logical: if |
pairwise |
Logical: if |
symmetrical |
Logical: if |
mean |
Logical: if |
return.list |
Logical: if |
Details
Use setThreadOptions
to change parallelization options; e.g., run
RcppParallel::setThreadOptions(4) before a call to lma_simets to set the number of CPU
threads to 4.
Value
Output varies based on the dimensions of a
and b
:
-
Out: A vector with a value per metric.
In: Only whena
andb
are both vectors. -
Out: A vector with a value per row.
In: Any time a single value is expected per row:a
orb
is a vector,a
andb
are matrices with the same number of rows andpairwise = FALSE
, a group is specified, ormean = TRUE
, and only one metric is requested. -
Out: A data.frame with a column per metric.
In: When multiple metrics are requested in the previous case. -
Out: A sparse matrix with a
metric
attribute with the metric name.
In: Pairwise comparisons within ana
matrix or between ana
andb
matrix, when only 1 metric is requested. -
Out: A list with a sparse matrix per metric.
In: When multiple metrics are requested in the previous case.
Examples
text <- c(
"words of speaker A", "more words from speaker A",
"words from speaker B", "more words from speaker B"
)
(dtm <- lma_dtm(text))
# compare each entry
lma_simets(dtm)
# compare each entry with the mean of all entries
lma_simets(dtm, colMeans(dtm))
# compare by group (corresponding to speakers and turns in this case)
speaker <- c("A", "A", "B", "B")
## by default, consecutive rows from the same group are averaged:
lma_simets(dtm, group = speaker)
## with agg = FALSE, only the rows at the boundary between
## groups (rows 2 and 3 in this case) are used:
lma_simets(dtm, group = speaker, agg = FALSE)
Document-Term Matrix Categorization
Description
Reduces the dimensions of a document-term matrix by dictionary-based categorization.
Usage
lma_termcat(dtm, dict, term.weights = NULL, bias = NULL,
bias.name = "_intercept", escape = TRUE, partial = FALSE,
glob = TRUE, term.filter = NULL, term.break = 20000,
to.lower = FALSE, dir = getOption("lingmatch.dict.dir"),
coverage = FALSE)
Arguments
dtm |
A matrix with terms as column names. |
dict |
The name of a provided dictionary
(osf.io/y6g5b/wiki) or of a file found in
|
term.weights |
A |
bias |
A list or named vector specifying a constant to add to the named category. If a term
matching |
bias.name |
A character specifying a term to be used as a category bias; default is
|
escape |
Logical indicating whether the terms in |
partial |
Logical; if |
glob |
Logical; if |
term.filter |
A regular expression string used to format the text of each term (passed to
|
term.break |
If a category has more than |
to.lower |
Logical; if |
dir |
Path to a folder in which to look for |
coverage |
Logical; if |
Value
A matrix with a row per dtm
row and columns per dictionary category
(with added coverage_
versions if coverage
is TRUE
),
and a WC
attribute with original word counts.
See Also
For applying pattern-based dictionaries (to raw text) see lma_patcat()
.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
Examples
dict <- list(category = c("cat", "dog", "pet*"))
lma_termcat(c(
"cat, cat, cat, cat, cat, cat, cat, cat",
"a cat, dog, or anything petlike, really",
"petite petrochemical petitioned petty peter for petrified petunia petals"
), dict, coverage = TRUE)
## Not run:
# Score texts with the NRC Affect Intensity Lexicon
dict <- readLines("https://saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt")
dict <- read.table(
text = dict[-seq_len(grep("term\tscore", dict, fixed = TRUE)[[1]])],
col.names = c("term", "weight", "category")
)
text <- c(
angry = paste(
"We are outraged by their hateful brutality,",
"and by the way they terrorize us with their hatred."
),
fearful = paste(
"The horrific torture of that terrorist was tantamount",
"to the terrorism of terrorists."
),
joyous = "I am jubilant to be celebrating the bliss of this happiest happiness.",
sad = paste(
"They are nearly suicidal in their mourning after",
"the tragic and heartbreaking holocaust."
)
)
emotion_scores <- lma_termcat(text, dict)
if (require("splot")) splot(emotion_scores ~ names(text), leg = "out")
## or use the standardized version (which includes more categories)
emotion_scores <- lma_termcat(text, "nrc_eil", dir = "~/Dictionaries")
emotion_scores <- emotion_scores[, c("anger", "fear", "joy", "sadness")]
if (require("splot")) splot(emotion_scores ~ names(text), leg = "out")
## End(Not run)
Document-Term Matrix Weighting
Description
Weight a document-term matrix.
Usage
lma_weight(dtm, weight = "count", normalize = TRUE, wc.complete = TRUE,
log.base = 10, alpha = 1, pois.x = 1L, doc.only = FALSE,
percent = FALSE)
Arguments
dtm |
A matrix with words as column names. |
weight |
A string referring at least partially to one (or a combination; see note) of the available weighting methods: Term weights (applied uniquely to each cell)
Document weights (applied by column)
Alternatively, |
normalize |
Logical: if |
wc.complete |
If the dtm was made with |
log.base |
The base of logs, applied to any weight using |
alpha |
A scaling factor applied to document frequency as part of pointwise mutual
information weighting, or amplify's power ( |
pois.x |
integer; quantile or probability of the poisson distribution ( |
doc.only |
Logical: if |
percent |
Logical; if |
Value
A weighted version of dtm
, with a type
attribute added (attr(dtm, 'type')
).
Note
Term weights works to adjust differences in counts within documents, with differences meaning
increasingly more from binary
to log
to sqrt
to count
to amplify
.
Document weights work to treat words differently based on their between-document or overall frequency.
When term frequencies are constant, dpois
, idf
, ridf
, and normal
give
less common words increasingly more weight, and dfmax
, dfmlog
, ppois
, df
,
dflog
, and entropy
give less common words increasingly less weight.
weight
can either be a vector with two characters, corresponding to term weight and
document weight (e.g., c('count', 'idf')
), or it can be a string with term and
document weights separated by any of :\*_/; ,-
(e.g., 'count-idf'
).
'tf'
is also acceptable for 'count'
, and 'tfidf'
will be parsed as
c('count', 'idf')
, though this is a special case.
For weight
, term or document weights can be entered individually; term weights alone will
not apply any document weight, and document weights alone will apply a 'count'
term weight
(unless doc.only = TRUE
, in which case a term-named vector of document weights is returned
instead of a weighted dtm).
Examples
# visualize term and document weights
## term weights
term_weights <- c("binary", "log", "sqrt", "count", "amplify")
Weighted <- sapply(term_weights, function(w) lma_weight(1:20, w, FALSE))
if (require(splot)) splot(Weighted ~ 1:20, labx = "Raw Count", lines = "co")
## document weights
doc_weights <- c(
"df", "dflog", "dfmax", "dfmlog", "idf", "ridf",
"normal", "dpois", "ppois", "entropy"
)
weight_range <- function(w, value = 1) {
m <- diag(20)
m[upper.tri(m, TRUE)] <- if (is.numeric(value)) {
value
} else {
unlist(lapply(
1:20, function(v) rep(if (value == "inverted") 21 - v else v, v)
))
}
lma_weight(m, w, FALSE, doc.only = TRUE)
}
if (require(splot)) {
category <- rep(c("df", "idf", "normal", "poisson", "entropy"), c(4, 2, 1, 2, 1))
op <- list(
laby = "Relative (Scaled) Weight", labx = "Document Frequency",
leg = "outside", lines = "connected", mv.scale = TRUE, note = FALSE
)
splot(
sapply(doc_weights, weight_range) ~ 1:20,
options = op, title = "Same Term, Varying Document Frequencies",
sud = "All term frequencies are 1.",
colorby = list(category, grade = TRUE)
)
splot(
sapply(doc_weights, weight_range, value = "sequence") ~ 1:20,
options = op, title = "Term as Document Frequencies",
sud = "Non-zero terms are the number of non-zero terms.",
colorby = list(category, grade = TRUE)
)
splot(
sapply(doc_weights, weight_range, value = "inverted") ~ 1:20,
options = op, title = "Term Opposite of Document Frequencies",
sud = "Non-zero terms are the number of zero terms + 1.",
colorby = list(category, grade = TRUE)
)
}
Read/Write Dictionary Files
Description
Read in or write dictionary files in Comma-Separated Values (.csv; weighted) or Linguistic Inquiry and Word Count (.dic; non-weighted) format.
Usage
read.dic(path, cats = NULL, type = "asis", as.weighted = FALSE,
dir = getOption("lingmatch.dict.dir"), ..., term.name = "term",
category.name = "category", raw = FALSE)
write.dic(dict, filename = NULL, type = "asis", as.weighted = FALSE,
save = TRUE)
Arguments
path |
Path to a file, a name corresponding to a file in |
cats |
A character vector of category names to be returned. All categories are returned by default. |
type |
A character indicating whether and how terms should be altered. Unspecified or matching 'asis'
leaves terms as they are. Other options change wildcards to regular expressions:
|
as.weighted |
Logical; if |
dir |
Path to a folder containing dictionaries, or where you would like dictionaries to be downloaded;
passed to |
... |
Passes arguments to |
term.name , category.name |
Strings identifying column names in |
raw |
Logical or a character. As logical, indicates if |
dict |
A |
filename |
The name of the file to be saved. |
save |
Logical: if |
Value
read.dic
: A list
(unweighted) with an entry for each category containing
character vectors of terms, or a data.frame
(weighted) with columns for terms (first, "term") and
weights (all subsequent, with category labels as names).
write.dic
: A version of the written dictionary – a raw character vector for
unweighted dictionaries, or a data.frame
for weighted dictionaries.
See Also
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
lma_termcat()
,
report_term_matches()
,
select.dict()
Examples
# make a small murder related dictionary
dict <- list(
kill = c("kill*", "murd*", "wound*", "die*"),
death = c("death*", "dying", "die*", "kill*")
)
# convert it to a weighted format
(dict_weighted <- read.dic(dict, as.weighted = TRUE))
# categorize it back
read.dic(dict_weighted)
# convert it to a string without writing to a file
cat(raw_dict <- write.dic(dict, save = FALSE))
# parse it back in
read.dic(raw = raw_dict)
## Not run:
# save it as a .dic file
write.dic(dict, "murder")
# read it back in as a list
read.dic("murder.dic")
# read in the Moral Foundations or LUSI dictionaries from urls
moral_dict <- read.dic("https://osf.io/download/whjt2")
lusi_dict <- read.dic("https://osf.io/download/29ayf")
# save and read in a version of the General Inquirer dictionary
inquirer <- read.dic("inquirer", dir = "~/Dictionaries")
## End(Not run)
Read and Segment Multiple Texts
Description
Split texts by word count or specific characters. Input texts directly, or read them in from files.
Usage
read.segments(path = ".", segment = NULL, ext = ".txt", subdir = FALSE,
segment.size = -1, bysentence = FALSE, end_in_quotes = TRUE,
preclean = FALSE, text = NULL)
Arguments
path |
Path to a folder containing files, or a vector of paths to files. If no folders or files are
recognized in |
segment |
Specifies how the text of each file should be segmented. If a character, split at that character; '\n' by default. If a number, texts will be broken into that many segments, each with a roughly equal number of words. |
ext |
The extension of the files you want to read in. '.txt' by default. |
subdir |
Logical; if |
segment.size |
Logical; if specified, |
bysentence |
Logical; if |
end_in_quotes |
Logical; if |
preclean |
Logical; if |
text |
A character vector with text to be split, used in place of |
Value
A data.frame
with columns for file names (input
),
segment number within file (segment
), word count for each segment (WC
), and the text of
each segment (text
).
Examples
# split preloaded text
read.segments("split this text into two segments", 2)
## Not run:
# read in all files from the package directory
texts <- read.segments(path.package("lingmatch"), ext = "")
texts[, -4]
# segment .txt files in dir in a few ways:
dir <- "path/to/files"
## into 1 line segments
texts_lines <- read.segments(dir)
## into 5 even segments each
texts_5segs <- read.segments(dir, 5)
## into 50 word segments
texts_50words <- read.segments(dir, segment.size = 50)
## into 1 sentence segments
texts_1sent <- read.segments(dir, segment.size = 1, bysentence = TRUE)
## End(Not run)
Generate a Report of Term Matches
Description
Extract matches to fuzzy terms (globs/wildcards or regular expressions) from provided text, in order to assess their appropriateness for inclusion in a dictionary.
Usage
report_term_matches(dict, text = NULL, space = NULL, glob = TRUE,
parse_phrases = TRUE, tolower = TRUE, punct = TRUE, special = TRUE,
as_terms = FALSE, bysentence = FALSE, as_string = TRUE,
term_map_freq = 1, term_map_spaces = 1, outFile = NULL,
space_dir = getOption("lingmatch.lspace.dir"), verbose = TRUE)
Arguments
dict |
A vector of terms, list of such vectors, or a matrix-like object to be
categorized by |
text |
A vector of text to extract matches from. If not specified, will use the terms
in the |
space |
A vector space used to calculate similarities between term matches.
Name of a the space (see |
glob |
Logical; if |
parse_phrases |
Logical; if |
tolower |
Logical; if |
punct |
Logical; if |
special |
Logical; if |
as_terms |
Logical; if |
bysentence |
Logical; if |
as_string |
Logical; if |
term_map_freq |
Proportion of terms to include when using the term map as a source
of terms. Applies when |
term_map_spaces |
Number of spaces in which a term has to appear to be included.
Applies when |
outFile |
File path to write results to, always ending in |
space_dir |
Directory from which |
verbose |
Logical; if |
Value
A data.frame
of results, with a row for each unique term, and the following columns:
-
term
: The originally entered term. -
regex
: The converted and applied regular expression form of the term. -
categories
: Comma-separated category names, ifdict
is a list with named entries. -
count
: Total number of matches to the term. -
max_count
: Number of matches to the most representative (that with the highest average similarity) variant of the term. -
variants
: Number of variants of the term. -
space
: Name of the latent semantic space, if one was used. -
mean_sim
: Average similarity to the most representative variant among terms found in the space, if one was used. -
min_sim
: Minimal similarity to the most representative variant. -
matches
: Variants, with counts and similarity (Pearson's r) to the most representative term (if a space was specified). Either in the form of a comma-separated string or adata.frame
(ifas_string
isFALSE
).
Note
Matches are extracted for each term independently, so they may not align with some implementations
of dictionaries. For instance, by default lma_patcat
matches destructively, and sorts
terms by length such that shorter terms will not match the same text and longer terms that overlap.
Here, the match would show up for both terms.
See Also
For a more complete assessment of dictionaries, see dictionary_meta()
.
Similar information is provided in the dictionary builder web tool.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
lma_termcat()
,
read.dic()
,
select.dict()
Examples
text <- c(
"I am sadly homeless, and suffering from depression :(",
"This wholesome happiness brings joy to my heart! :D:D:D",
"They are joyous in these fearsome happenings D:",
"I feel weightless now that my sadness has been depressed! :()"
)
dict <- list(
sad = c("*less", "sad*", "depres*", ":("),
happy = c("*some", "happ*", "joy*", "d:"),
self = c("i *", "my *")
)
report_term_matches(dict, text)
Select Dictionaries
Description
Retrieve information and links to dictionaries (lexicons/word lists) available at osf.io/y6g5b.
Usage
select.dict(query = NULL, dir = getOption("lingmatch.dict.dir"),
check.md5 = TRUE, mode = "wb")
Arguments
query |
A character matching a dictionary name, or a set of keywords to search for in dictionary information. |
dir |
Path to a folder containing dictionaries, or where you want them to be saved. Will look in getOption('lingmatch.dict.dir') and '~/Dictionaries' by default. |
check.md5 |
Logical; if |
mode |
Passed to |
Value
A list with varying entries:
-
info
: The version of osf.io/kjqb8 stored internally; adata.frame
with dictionary names as row names, and information about each dictionary in columns.
Also described at osf.io/y6g5b/wiki/dict_variables, hereshort
(corresponding to the file name [{short}.(csv|dic)
] and wiki urls [https://osf.io/y6g5b/wiki/{short}
]) is set as row names and removed:-
name
: Full name of the dictionary. -
description
: Description of the dictionary, relating to its purpose and development. -
note
: Notes about processing decisions that additionally alter the original. -
constructor
: How the dictionary was constructed:-
algorithm
: Terms were selected by some automated process, potentially learned from data or other resources. -
crowd
: Several individuals rated the terms, and in aggregate those ratings translate to categories and weights. -
mixed
: Some combination of the other methods, usually in some iterative process. -
team
: One of more individuals make decisions about term inclusions, categories, and weights.
-
-
subject
: Broad, rough subject or purpose of the dictionary:-
emotion
: Terms relate to emotions, potentially exemplifying or expressing them. -
general
: A large range of categories, aiming to capture the content of the text. -
impression
: Terms are categorized and weighted based on the impression they might give. -
language
: Terms are categorized or weighted based on their linguistic features, such as part of speech, specificity, or area of use. -
social
: Terms relate to social phenomena, such as characteristics or concerns of social entities.
-
-
terms
: Number of unique terms across categories. -
term_type
: Format of the terms:-
glob
: Include asterisks which denote inclusion of any characters until a word boundary. -
glob+
: Glob-style asterisks with regular expressions within terms. -
ngram
: Includes any number of words as a term, separated by spaces. -
pattern
: A string of characters, potentially within or between words, or spanning words. -
regex
: Regular expressions. -
stem
: Unigrams with common endings removed. -
unigram
: Complete single words.
-
-
weighted
: Indicates whether weights are associated with terms. This determines the file type of the dictionary: dictionaries with weights are stored as .csv, and those without are stored as .dic files. -
regex_characters
: Logical indicating whether special regular expression characters are present in any term, which might need to be escaped if the terms are used in regular expressions. Glob-type terms allow complete parens (at least one open and one closed, indicating preceding or following words), and initial and terminal asterisks. For all other terms,[](){}*.^$+?\|
are counted as regex characters. These could be escaped in R withgsub('([][)(}{*.^$+?\\|])', '\\\1', terms)
ifterms
is a character vector, and in Python with (importing re)[re.sub(r'([][(){}*.^$+?\|])', r'\\1', term)
for term in terms]
ifterms
is a list. -
categories
: Category names in the order in which they appear in the dictionary file, separated by commas. -
ncategories
: Number of categories. -
original_max
: Maximum value of the original dictionary before standardization:original values / max(original values) * 100
. Dictionaries with no weights are considered to have a max of1
. -
osf
: ID of the file on OSF, translating to the file's URL: https://osf.io/osf
. -
wiki
: URL of the dictionary's wiki. -
downloaded
: Path to the file if downloaded, and''
otherwise.
-
-
selected
: A subset ofinfo
selected byquery
.
See Also
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
Examples
# just retrieve information about available dictionaries
dicts <- select.dict()$info
dicts[1:10, 4:9]
# select all dictionaries mentioning sentiment or emotion
sentiment_dicts <- select.dict("sentiment emotion")$selected
sentiment_dicts[1:10, 4:9]
Select Latent Semantic Spaces
Description
Retrieve information and links to latent semantic spaces (sets of word vectors/embeddings) available at osf.io/489he, and optionally download their term mappings (osf.io/xr7jv).
Usage
select.lspace(query = NULL, dir = getOption("lingmatch.lspace.dir"),
terms = NULL, get.map = FALSE, check.md5 = TRUE, mode = "wb")
Arguments
query |
A character used to select spaces, based on names or other features.
If length is over 1, |
dir |
Path to a directory containing |
terms |
A character vector of terms to search for in the downloaded term map, to calculate
coverage of spaces, or select by coverage if |
get.map |
Logical; if |
check.md5 |
Logical; if |
mode |
Passed to |
Value
A list with varying entries:
-
info
: The version of osf.io/9yzca stored internally; adata.frame
with spaces as row names, and information about each space in columns:-
terms
: number of terms in the space -
corpus
: corpus(es) on which the space was trained -
model
: model from which the space was trained -
dimensions
: number of dimensions in the model (columns of the space) -
model_info
: some parameter details about the model -
original_max
: maximum value used to normalize the space; the original space would be(vectors *
original_max) /
100
-
osf_dat
: OSF id for the.dat
files; the URL would be https://osf.io/osf_dat
-
osf_terms
: OSF id for the_terms.txt
files; the URL would be https://osf.io/osf_terms
-
wiki
: link to the wiki for the space -
downloaded
: path to the.dat
file if downloaded, and''
otherwise.
-
-
selected
: A subset ofinfo
selected byquery
. -
term_map
: Ifget.map
isTRUE
orlma_term_map.rda
is found indir
, a copy of osf.io/xr7jv, which has space names as column names, terms as row names, and indices as values, with 0 indicating the term is not present in the associated space.
See Also
Other Latent Semantic Space functions:
download.lspace()
,
lma_lspace()
,
standardize.lspace()
Examples
# just retrieve information about available spaces
spaces <- select.lspace()
spaces$info[1:10, c("terms", "dimensions", "original_max")]
# retrieve all spaces that used word2vec
w2v_spaces <- select.lspace("word2vec")$selected
w2v_spaces[, c("terms", "dimensions", "original_max")]
## Not run:
# select spaces by terms
select.lspace(terms = c(
"part-time", "i/o", "'cause", "brexit", "debuffs"
))$selected[, c("terms", "coverage")]
## End(Not run)
Standardize a Latent Semantic Space
Description
Reformat a .rda file which has a matrix with terms as row names, or a plain-text embeddings file which has a term at the start of each line, and consistent delimiting characters. Plain-text files are processed line-by-line, so large spaces can be reformatted RAM-conservatively.
Usage
standardize.lspace(infile, name, sep = " ", digits = 9,
dir = getOption("lingmatch.lspace.dir"), outdir = dir, remove = "",
term_check = "^[a-zA-Z]+$|^['a-zA-Z][a-zA-Z.'\\/-]*[a-zA-Z.]$",
verbose = FALSE)
Arguments
infile |
Name of the .rda or plain-text file relative to |
name |
Base name of the reformatted file and term file; e.g., "glove" would result in
|
sep |
Delimiting character between values in each line, e.g., |
digits |
Number of digits to round values to; default is 9. |
dir |
Path to folder containing |
outdir |
Path to folder in which to save standardized files; default is |
remove |
A string with a regex pattern to be removed from term names |
term_check |
A string with a regex pattern by which to filter terms; i.e., only lines with fully
matched terms are written to the reformatted file. The default attempts to retain only regular words, including
those with dashes, foreword slashes, and periods. Set to an empty string ( |
verbose |
Logical: if |
Value
Path to the standardized [1] data file and [2] terms file if applicable.
See Also
Other Latent Semantic Space functions:
download.lspace()
,
lma_lspace()
,
select.lspace()
Examples
## Not run:
# from https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces
standardize.lspace("EN_100k_lsa.rda", "100k_lsa")
# from https://fasttext.cc/docs/en/english-vectors.html
standardize.lspace("crawl-300d-2M.vec", "facebook_crawl")
# Standardized versions of these spaces can also be downloaded with download.lspace.
## End(Not run)