Type: | Package |
Title: | An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction |
Version: | 1.0.1 |
Maintainer: | Samuel Borms <borms_sam@hotmail.com> |
Description: | Optimized prediction based on textual sentiment, accounting for the intrinsic challenge that sentiment can be computed and pooled across texts and time in various ways. See Ardia et al. (2021) <doi:10.18637/jss.v099.i02>. |
Depends: | R (≥ 3.3.0) |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
BugReports: | https://github.com/SentometricsResearch/sentometrics/issues |
URL: | https://sentometrics-research.com/sentometrics/ |
Encoding: | UTF-8 |
LazyData: | true |
Suggests: | covr, doParallel, e1071, lexicon, MCS, NLP, parallel, randomForest, stopwords, testthat, tm |
Imports: | caret, compiler, data.table, foreach, ggplot2, glmnet, ISOweek, quanteda, Rcpp (≥ 0.12.13), RcppRoll, RcppParallel, stats, stringi, utils |
LinkingTo: | Rcpp, RcppArmadillo, RcppParallel |
RoxygenNote: | 7.3.2 |
SystemRequirements: | GNU make |
NeedsCompilation: | yes |
Packaged: | 2025-04-03 08:00:50 UTC; saborms |
Author: | Samuel Borms |
Repository: | CRAN |
Date/Publication: | 2025-04-03 09:10:02 UTC |
sentometrics: An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction
Description
The sentometrics package is an integrated framework for textual sentiment time series aggregation and prediction. It accounts for the intrinsic challenge that, for a given text, sentiment can be computed in many different ways, as well as the large number of possibilities to pool sentiment across texts and time. This additional layer of manipulation does not exist in standard text mining and time series analysis packages. The package therefore integrates the fast quantification of sentiment from texts, the aggregation into different sentiment time series and the optimized prediction based on these measures.
Main functions
Corpus (features) generation:
sento_corpus
,add_features
,as.sento_corpus
Sentiment computation and aggregation into sentiment measures:
ctr_agg
,sento_lexicons
,compute_sentiment
,aggregate.sentiment
,as.sentiment
,sento_measures
,peakdocs
,peakdates
,aggregate.sento_measures
Sparse modeling:
ctr_model
,sento_model
Prediction and post-modeling analysis:
predict.sento_model
,attributions
Note
Please cite the package in publications. Use citation("sentometrics")
.
Author(s)
Maintainer: Samuel Borms borms_sam@hotmail.com (ORCID)
Authors:
David Ardia david.ardia@hec.ca (ORCID)
Keven Bluteau keven.bluteau@usherbrooke.ca (ORCID)
Kris Boudt kris.boudt@vub.be (ORCID)
Other contributors:
Jeroen Van Pelt jeroenvanpelt@hotmail.com [contributor]
Andres Algaba andres.algaba@vub.be [contributor]
References
Ardia, Bluteau, Borms and Boudt (2021). The R Package sentometrics to Compute, Aggregate, and Predict with Textual Sentiment. Journal of Statistical Software 99(2), 1-40, doi:10.18637/jss.v099.i02.
Ardia, Bluteau and Boudt (2019). Questioning the news about economic growth: Sparse forecasting using thousands of news-based sentiment values. International Journal of Forecasting 35, 1370-1386, doi:10.1016/j.ijforecast.2018.10.010.
See Also
Useful links:
Report bugs at https://github.com/SentometricsResearch/sentometrics/issues
Add feature columns to a (sento_)corpus object
Description
Adds new feature columns, either user-supplied or based on keyword(s)/regex pattern search, to
a provided sento_corpus
or a quanteda corpus
object.
Usage
add_features(
corpus,
featuresdf = NULL,
keywords = NULL,
do.binary = TRUE,
do.regex = FALSE
)
Arguments
corpus |
a |
featuresdf |
a named |
keywords |
a named |
do.binary |
a |
do.regex |
a |
Details
If a provided feature name is already part of the corpus, it will be replaced. The featuresdf
and
keywords
arguments can be provided at the same time, or only one of them, leaving the other at NULL
. We use
the stringi package for searching the keywords. The do.regex
argument points to the corresponding elements
in keywords
. For FALSE
, we transform the keywords into a simple regex expression, involving "\b"
for
exact word boundary matching and (if multiple keywords) |
as OR operator. The elements associated to TRUE
do
not undergo this transformation, and are evaluated as given, if the corresponding keywords vector consists of only one
expression. For a large corpus and/or complex regex patterns, this function may require some patience. Scaling between 0
and 1 is performed via min-max normalization, per column.
Value
An updated corpus
object.
Author(s)
Samuel Borms
Examples
set.seed(505)
# construct a corpus and add (a) feature(s) to it
corpus <- quanteda::corpus_sample(
sento_corpus(corpusdf = sentometrics::usnews), 500
)
corpus1 <- add_features(corpus,
featuresdf = data.frame(random = runif(quanteda::ndoc(corpus))))
corpus2 <- add_features(corpus,
keywords = list(pres = "president", war = "war"),
do.binary = FALSE)
corpus3 <- add_features(corpus,
keywords = list(pres = c("Obama", "US president")))
corpus4 <- add_features(corpus,
featuresdf = data.frame(all = 1),
keywords = list(pres1 = "Obama|US [p|P]resident",
pres2 = "\\bObama\\b|\\bUS president\\b",
war = "war"),
do.regex = c(TRUE, TRUE, FALSE))
sum(quanteda::docvars(corpus3, "pres")) ==
sum(quanteda::docvars(corpus4, "pres2")) # TRUE
# adding a complementary feature
nonpres <- data.frame(nonpres = as.numeric(!quanteda::docvars(corpus3, "pres")))
corpus3 <- add_features(corpus3, featuresdf = nonpres)
Aggregate textual sentiment across sentences, documents and time
Description
Aggregates textual sentiment scores at sentence- or document-level into a panel of textual
sentiment measures. Can also be used to aggregate sentence-level sentiment scores into
document-level sentiment scores. This function is called within the sento_measures
function.
Usage
## S3 method for class 'sentiment'
aggregate(x, ctr, do.full = TRUE, ...)
Arguments
x |
a |
ctr |
output from a |
do.full |
if |
... |
not used. |
Value
A document-level sentiment
object or a fully aggregated sento_measures
object.
Author(s)
Samuel Borms, Keven Bluteau
See Also
compute_sentiment
, ctr_agg
, sento_measures
Examples
set.seed(505)
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
# computation of sentiment
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]])
l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]][, c("x", "t")])
sent1 <- compute_sentiment(corpusSample, l1, how = "counts")
sent2 <- compute_sentiment(corpusSample, l2, do.sentence = TRUE)
sent3 <- compute_sentiment(as.character(corpusSample), l2,
do.sentence = TRUE)
ctr <- ctr_agg(howTime = c("linear"), by = "year", lag = 3)
# aggregate into sentiment measures
sm1 <- aggregate(sent1, ctr)
sm2 <- aggregate(sent2, ctr)
# two-step aggregation (first into document-level sentiment)
sd2 <- aggregate(sent2, ctr, do.full = FALSE)
sm3 <- aggregate(sd2, ctr)
# aggregation of a sentiment data.table
cols <- c("word_count", names(l2)[-length(l2)])
sd3 <- sent3[, lapply(.SD, sum), by = "id", .SDcols = cols]
Aggregate sentiment measures
Description
Aggregates sentiment measures by combining across provided lexicons, features, and time weighting
schemes dimensions. For do.global = FALSE
, the combination occurs by taking the mean of the relevant
measures. For do.global = TRUE
, this function aggregates all sentiment measures into a weighted global textual
sentiment measure for each of the dimensions.
Usage
## S3 method for class 'sento_measures'
aggregate(
x,
features = NULL,
lexicons = NULL,
time = NULL,
do.global = FALSE,
do.keep = FALSE,
...
)
Arguments
x |
a |
features |
a |
lexicons |
a |
time |
a |
do.global |
a |
do.keep |
a |
... |
not used. |
Details
If do.global = TRUE
, the measures are constructed from weights that indicate the importance (and sign)
along each component from the lexicons
, features
, and time
dimensions. There is no restriction in
terms of allowed weights. For example, the global index based on the supplied lexicon weights ("globLex"
) is obtained
first by multiplying every sentiment measure with its corresponding weight (meaning, the weight given to the lexicon the
sentiment is computed with), then by taking the average per date.
Value
If do.global = FALSE
, a modified sento_measures
object, with the aggregated sentiment
measures, including updated information and statistics, but the original sentiment scores data.table
untouched.
If do.global = TRUE
, a data.table
with the different types of weighted global sentiment measures,
named "globLex"
, "globFeat"
, "globTime"
and "global"
, with "date"
as the first
column. The last measure is an average of the the three other measures.
Author(s)
Samuel Borms
Examples
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
# construct a sento_measures object to start with
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]])
ctr <- ctr_agg(howTime = c("equal_weight", "linear"),
by = "year", lag = 3)
sento_measures <- sento_measures(corpusSample, l, ctr)
# aggregation across specified components
smAgg <- aggregate(sento_measures,
time = list(W = c("equal_weight", "linear")),
features = list(journals = c("wsj", "wapo")),
do.keep = TRUE)
# aggregation in full
dims <- get_dimensions(sento_measures)
smFull <- aggregate(sento_measures,
lexicons = list(L = dims[["lexicons"]]),
time = list(T = dims[["time"]]),
features = list(F = dims[["features"]]))
# "global" aggregation
smGlobal <- aggregate(sento_measures, do.global = TRUE,
lexicons = c(0.3, 0.1),
features = c(1, -0.5, 0.3, 1.2),
time = NULL)
## Not run:
# aggregation won't work, but produces informative error message
aggregate(sento_measures,
time = list(W = c("equal_weight", "almon1")),
lexicons = list(LEX = c("LM_en")),
features = list(journals = c("notInHere", "wapo")))
## End(Not run)
Get the sentiment measures
Description
Extracts the sentiment measures data.table
in either wide (by default)
or long format.
Usage
## S3 method for class 'sento_measures'
as.data.table(x, keep.rownames = FALSE, format = "wide", ...)
Arguments
x |
a |
keep.rownames |
see |
format |
a single |
... |
not used. |
Value
The panel of sentiment measures under sento_measures[["measures"]]
,
in wide or long format.
Author(s)
Samuel Borms
Examples
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
sm <- sento_measures(sento_corpus(corpusdf = usnews[1:200, ]),
sento_lexicons(list_lexicons["LM_en"]),
ctr_agg(lag = 3))
data.table::as.data.table(sm)
data.table::as.data.table(sm, format = "long")
Convert a sentiment table to a sentiment object
Description
Converts a properly structured sentiment table into a sentiment
object, that can be used
for further aggregation with the aggregate.sentiment
function. This allows to start from
sentiment scores not necessarily computed with compute_sentiment
.
Usage
as.sentiment(s)
Arguments
s |
a |
Value
A sentiment
object.
Author(s)
Samuel Borms
Examples
set.seed(505)
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
ids <- paste0("id", 1:200)
dates <- sample(seq(as.Date("2015-01-01"), as.Date("2018-01-01"), by = "day"), 200, TRUE)
word_count <- sample(150:850, 200, replace = TRUE)
sent <- matrix(rnorm(200 * 8), nrow = 200)
s1 <- s2 <- data.table::data.table(id = ids, date = dates, word_count = word_count, sent)
s3 <- data.frame(id = ids, date = dates, word_count = word_count, sent,
stringsAsFactors = FALSE)
s4 <- compute_sentiment(usnews$texts[201:400],
sento_lexicons(list_lexicons["GI_en"]),
"counts", do.sentence = TRUE)
m <- "method"
colnames(s1)[-c(1:3)] <- paste0(m, 1:8)
sent1 <- as.sentiment(s1)
colnames(s2)[-c(1:3)] <- c(paste0(m, 1:4, "--", "feat1"), paste0(m, 1:4, "--", "feat2"))
sent2 <- as.sentiment(s2)
colnames(s3)[-c(1:3)] <- c(paste0(m, 1:3, "--", "feat1"), paste0(m, 1:3, "--", "feat2"),
paste0(m, 4:5))
sent3 <- as.sentiment(s3)
s4[, "date" := rep(dates, s4[, max(sentence_id), by = id][[2]])]
sent4 <- as.sentiment(s4)
# further aggregation from then on is easy...
sentMeas1 <- aggregate(sent1, ctr_agg(lag = 10))
sent5 <- aggregate(sent4, ctr_agg(howDocs = "proportional"), do.full = FALSE)
Convert a quanteda or tm corpus object into a sento_corpus object
Description
Converts most common quanteda and tm corpus objects into a
sento_corpus
object. Appropriate available metadata is integrated as features;
for a quanteda corpus, this can come from docvars(x)
, for a tm corpus,
only meta(x, type = "indexed")
metadata is considered.
Usage
as.sento_corpus(x, dates = NULL, do.clean = FALSE)
Arguments
x |
a quanteda |
dates |
an optional sequence of dates as |
do.clean |
see |
Value
A sento_corpus
object, as returned by the sento_corpus
function.
Author(s)
Samuel Borms
See Also
corpus
, SimpleCorpus
, VCorpus
,
sento_corpus
Examples
data("usnews", package = "sentometrics")
txt <- system.file("texts", "txt", package = "tm")
reuters <- system.file("texts", "crude", package = "tm")
# reshuffle usnews data.frame for use in quanteda and tm
dates <- usnews$date
usnews$wrong <- "notNumeric"
colnames(usnews)[c(1, 3)] <- c("doc_id", "text")
# conversion from a quanteda corpus
qcorp <- quanteda::corpus(usnews,
text_field = "text", docid_field = "doc_id")
corp1 <- as.sento_corpus(qcorp)
corp2 <- as.sento_corpus(qcorp, sample(dates)) # overwrites "date" column
# conversion from a tm SimpleCorpus corpus (DataframeSource)
tmSCdf <- tm::SimpleCorpus(tm::DataframeSource(usnews))
corp3 <- as.sento_corpus(tmSCdf)
# conversion from a tm SimpleCorpus corpus (DirSource)
tmSCdir <- tm::SimpleCorpus(tm::DirSource(txt))
corp4 <- as.sento_corpus(tmSCdir, dates[1:length(tmSCdir)])
# conversion from a tm VCorpus corpus (DataframeSource)
tmVCdf <- tm::VCorpus(tm::DataframeSource(usnews))
corp5 <- as.sento_corpus(tmVCdf)
# conversion from a tm VCorpus corpus (DirSource)
tmVCdir <- tm::VCorpus(tm::DirSource(reuters),
list(reader = tm::readReut21578XMLasPlain))
corp6 <- as.sento_corpus(tmVCdir, dates[1:length(tmVCdir)])
Retrieve top-down model sentiment attributions
Description
Computes the attributions to predictions for a (given) number of dates at all possible sentiment dimensions, based on the coefficients associated to each sentiment measure, as estimated in the provided model object.
Usage
attributions(
model,
sento_measures,
do.lags = TRUE,
do.normalize = FALSE,
refDates = NULL,
factor = NULL
)
Arguments
model |
a |
sento_measures |
the |
do.lags |
a |
do.normalize |
a |
refDates |
the dates (as |
factor |
the factor level as a single |
Details
See sento_model
for an elaborate modeling example including the calculation and plotting of
attributions. The attribution for logistic models is represented in terms of log odds. For binomial models, it is
calculated with respect to the last factor level or factor column. A NULL
value for document-level attribution
on a given date means no documents are directly implicated in the associated prediction.
Value
A list
of class attributions
, with "documents"
, "lags"
, "lexicons"
,
"features"
and "time"
as attribution dimensions. The last four dimensions are
data.table
s having a "date"
column and the other columns the different components of the dimension, with
the attributions as values. Document-level attribution is further decomposed into a data.table
per date, with
"id"
, "date"
and "attrib"
columns. If do.lags = FALSE
, the "lags"
element is set
to NULL
.
Author(s)
Samuel Borms, Keven Bluteau
See Also
Compute textual sentiment across features and lexicons
Description
Given a corpus of texts, computes sentiment per document or sentence using the valence shifting augmented bag-of-words approach, based on the lexicons provided and a choice of aggregation across words.
Usage
compute_sentiment(
x,
lexicons,
how = "proportional",
tokens = NULL,
do.sentence = FALSE,
nCore = 1
)
Arguments
x |
either a |
lexicons |
a |
how |
a single |
tokens |
a |
do.sentence |
a |
nCore |
a positive |
Details
For a separate calculation of positive (resp. negative) sentiment, provide distinct positive (resp.
negative) lexicons (see the do.split
option in the sento_lexicons
function). All NA
s
are converted to 0, under the assumption that this is equivalent to no sentiment. Per default tokens = NULL
,
meaning the corpus is internally tokenized as unigrams, with punctuation and numbers but not stopwords removed.
All tokens are converted to lowercase, in line with what the sento_lexicons
function does for the
lexicons and valence shifters. Word counts are based on that same tokenization.
Value
If x
is a sento_corpus
object: a sentiment
object, i.e., a data.table
containing
the sentiment scores data.table
with an "id"
, a "date"
and a "word_count"
column,
and all lexicon-feature sentiment scores columns. The tokenized sentences are not provided but can be
obtained as stringi::stri_split_boundaries(texts, type = "sentence")
. A sentiment
object can
be aggregated (into time series) with the aggregate.sentiment
function.
If x
is a quanteda corpus
object: a sentiment scores
data.table
with an "id"
and a "word_count"
column, and all lexicon-feature
sentiment scores columns.
If x
is a tm SimpleCorpus
object, a tm VCorpus
object, or a character
vector: a sentiment scores data.table
with an auto-created "id"
column, a "word_count"
column, and all lexicon sentiment scores columns.
When do.sentence = TRUE
, an additional "sentence_id"
column along the
"id"
column is added.
Calculation
If the lexicons
argument has no "valence"
element, the sentiment computed corresponds to simple unigram
matching with the lexicons [unigrams approach]. If valence shifters are included in lexicons
with a
corresponding "y"
column, the polarity of a word detected from a lexicon gets multiplied with the associated
value of a valence shifter if it appears right before the detected word (examples: not good or can't defend) [bigrams
approach]. If the valence table contains a "t"
column, valence shifters are searched for in a cluster centered around
a detected polarity word [clusters approach]. The latter approach is a simplified version of the one utilized by the
sentimentr package. A cluster amounts to four words before and two words after a polarity word. A cluster never overlaps
with a preceding one. Roughly speaking, the polarity of a cluster is calculated as n(1 + 0.80d)S + \sum s
. The polarity
score of the detected word is S
, s
represents polarities of eventual other sentiment words, and d
is
the difference between the number of amplifiers (t = 2
) and the number of deamplifiers (t = 3
). If there
is an odd number of negators (t = 1
), n = -1
and amplifiers are counted as deamplifiers, else n = 1
.
The sentence-level sentiment calculation approaches each sentence as if it is a document. Depending on the input either
the unigrams, bigrams or clusters approach is used. We enhanced latter approach following more closely the default
sentimentr settings. They use a cluster of five words before and two words after a polarized word. The cluster
is limited to the words after a previous comma and before a next comma. Adversative conjunctions (t = 4
) are
accounted for here. The cluster is reweighted based on the value 1 + 0.25adv
, where adv
is the difference
between the number of adversative conjunctions found before and after the polarized word.
Author(s)
Samuel Borms, Jeroen Van Pelt, Andres Algaba
Examples
data("usnews", package = "sentometrics")
txt <- system.file("texts", "txt", package = "tm")
reuters <- system.file("texts", "crude", package = "tm")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]])
l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]][, c("x", "t")])
# from a sento_corpus object - unigrams approach
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 200)
sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol")
# from a character vector - bigrams approach
sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts")
# from a corpus object - clusters approach
corpusQ <- quanteda::corpus(usnews, text_field = "texts")
corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200)
sent3 <- compute_sentiment(corpusQSample, l3, how = "counts")
# from an already tokenized corpus - using the 'tokens' argument
toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword"))
sent4 <- compute_sentiment(corpusQSample, l1[1], how = "counts", tokens = toks)
# from a SimpleCorpus object - unigrams approach
scorp <- tm::SimpleCorpus(tm::DirSource(txt))
sent5 <- compute_sentiment(scorp, l1, how = "proportional")
# from a VCorpus object - unigrams approach
## in contrast to what as.sento_corpus(vcorp) would do, the
## sentiment calculator handles multiple character vectors within
## a single corpus element as separate documents
vcorp <- tm::VCorpus(tm::DirSource(reuters))
sent6 <- compute_sentiment(vcorp, l1)
# from a sento_corpus object - unigrams approach with tf-idf weighting
sent7 <- compute_sentiment(corpusSample, l1, how = "TFIDF")
# sentence-by-sentence computation
sent8 <- compute_sentiment(corpusSample, l1, how = "proportionalSquareRoot",
do.sentence = TRUE)
# from a (fake) multilingual corpus
usnews[["language"]] <- "en" # add language column
usnews$language[1:100] <- "fr"
lEn <- sento_lexicons(list("FEEL_en" = list_lexicons$FEEL_en_tr,
"HENRY" = list_lexicons$HENRY_en),
list_valence_shifters$en)
lFr <- sento_lexicons(list("FEEL_fr" = list_lexicons$FEEL_fr),
list_valence_shifters$fr)
lexicons <- list(en = lEn, fr = lFr)
corpusLang <- sento_corpus(corpusdf = usnews[1:250, ])
sent9 <- compute_sentiment(corpusLang, lexicons, how = "proportional")
Summarize the sento_corpus object
Description
Summarizes the sento_corpus
object and returns insights about the evolution of
documents, features and tokens over time.
Usage
corpus_summarize(x, by = "day", features = NULL)
Arguments
x |
is a |
by |
a single |
features |
a |
Details
This function summarizes the sento_corpus
object by generating statistics about
documents, features and tokens over time. The insights can be narrowed down to a chosen set of metadata
features. The same tokenization as in the sentiment calculation in compute_sentiment
is used.
Value
returns a list
containing:
stats |
a |
plots |
a |
Author(s)
Jeroen Van Pelt, Samuel Borms, Andres Algaba
Examples
data("usnews", package = "sentometrics")
corpus <- sento_corpus(usnews)
# summary of corpus by day
summary1 <- corpus_summarize(corpus)
# summary of corpus by month for both journals
summary2 <- corpus_summarize(corpus, by = "month",
features = c("wsj", "wapo"))
Set up control for aggregation into sentiment measures
Description
Sets up control object for (computation of textual sentiment and) aggregation into textual sentiment measures.
Usage
ctr_agg(
howWithin = "proportional",
howDocs = "equal_weight",
howTime = "equal_weight",
do.sentence = FALSE,
do.ignoreZeros = TRUE,
by = "day",
lag = 1,
fill = "zero",
alphaExpDocs = 0.1,
alphasExp = seq(0.1, 0.5, by = 0.1),
do.inverseExp = FALSE,
ordersAlm = 1:3,
do.inverseAlm = TRUE,
aBeta = 1:4,
bBeta = 1:4,
weights = NULL,
tokens = NULL,
nCore = 1
)
Arguments
howWithin |
a single |
howDocs |
a single |
howTime |
a |
do.sentence |
see |
do.ignoreZeros |
a |
by |
a single |
lag |
a single |
fill |
a single |
alphaExpDocs |
a single |
alphasExp |
a |
do.inverseExp |
a |
ordersAlm |
a |
do.inverseAlm |
a |
aBeta |
a |
bBeta |
a |
weights |
optional own weighting scheme(s), used if provided as a |
tokens |
see |
nCore |
see |
Details
For available options on how aggregation can occur (via the howWithin
,
howDocs
and howTime
arguments), inspect get_hows
. The control parameters
associated to howDocs
are used both for aggregation across documents and across sentences.
Value
A list
encapsulating the control parameters.
Author(s)
Samuel Borms, Keven Bluteau
See Also
measures_fill
, almons
, compute_sentiment
Examples
set.seed(505)
# simple control function
ctr1 <- ctr_agg(howTime = "linear", by = "year", lag = 3)
# more elaborate control function (particular attention to time weighting schemes)
ctr2 <- ctr_agg(howWithin = "proportionalPol",
howDocs = "exponential",
howTime = c("equal_weight", "linear", "almon", "beta", "exponential", "own"),
do.ignoreZeros = TRUE,
by = "day",
lag = 20,
ordersAlm = 1:3,
do.inverseAlm = TRUE,
alphasExp = c(0.20, 0.50, 0.70, 0.95),
aBeta = c(1, 3),
bBeta = c(1, 3, 4, 7),
weights = data.frame(myWeights = runif(20)),
alphaExp = 0.3)
# set up control function with one linear and two chosen Almon weighting schemes
a <- weights_almon(n = 70, orders = 1:3, do.inverse = TRUE, do.normalize = TRUE)
ctr3 <- ctr_agg(howTime = c("linear", "own"), by = "year", lag = 70,
weights = data.frame(a1 = a[, 1], a2 = a[, 3]),
do.sentence = TRUE)
Set up control for sentiment-based sparse regression modeling
Description
Sets up control object for linear or nonlinear modeling of a response variable onto a large panel of
textual sentiment measures (and potentially other variables). See sento_model
for details on the
estimation and calibration procedure.
Usage
ctr_model(
model = c("gaussian", "binomial", "multinomial"),
type = c("BIC", "AIC", "Cp", "cv"),
do.intercept = TRUE,
do.iter = FALSE,
h = 0,
oos = 0,
do.difference = FALSE,
alphas = seq(0, 1, by = 0.2),
lambdas = NULL,
nSample = NULL,
trainWindow = NULL,
testWindow = NULL,
start = 1,
do.shrinkage.x = FALSE,
do.progress = TRUE,
nCore = 1
)
Arguments
model |
a |
type |
a |
do.intercept |
a |
do.iter |
a |
h |
an |
oos |
a non-negative |
do.difference |
a |
alphas |
a |
lambdas |
a |
nSample |
a positive |
trainWindow |
a positive |
testWindow |
a positive |
start |
a positive |
do.shrinkage.x |
a |
do.progress |
a |
nCore |
a positive |
Value
A list
encapsulating the control parameters.
Author(s)
Samuel Borms, Keven Bluteau
References
Tibshirani and Taylor (2012). Degrees of freedom in LASSO problems. The Annals of Statistics 40, 1198-1232, doi:10.1214/12-AOS1003.
Zou, Hastie and Tibshirani (2007). On the degrees of freedom of the LASSO. The Annals of Statistics 35, 2173-2192, doi:10.1214/009053607000000127.
See Also
Examples
# information criterion based model control functions
ctrIC1 <- ctr_model(model = "gaussian", type = "BIC", do.iter = FALSE, h = 0,
alphas = seq(0, 1, by = 0.10))
ctrIC2 <- ctr_model(model = "gaussian", type = "AIC", do.iter = TRUE, h = 4, nSample = 100,
do.difference = TRUE, oos = 3)
# cross-validation based model control functions
ctrCV1 <- ctr_model(model = "gaussian", type = "cv", do.iter = FALSE, h = 0,
trainWindow = 250, testWindow = 4, oos = 0, do.progress = TRUE)
ctrCV2 <- ctr_model(model = "binomial", type = "cv", h = 0, trainWindow = 250,
testWindow = 4, oos = 0, do.progress = TRUE)
ctrCV3 <- ctr_model(model = "multinomial", type = "cv", h = 2, trainWindow = 250,
testWindow = 4, oos = 2, do.progress = TRUE)
ctrCV4 <- ctr_model(model = "gaussian", type = "cv", do.iter = TRUE, h = 0, trainWindow = 45,
testWindow = 4, oos = 0, nSample = 70, do.progress = TRUE)
Datasets with defunct names
Description
These are datasets that have been renamed and removed.
Details
The dataset lexicons
is defunct, use list_lexicons
instead.
The dataset valence
is defunct, use list_valence_shifters
instead.
Differencing of sentiment measures
Description
Differences the sentiment measures from a sento_measures
object.
Usage
## S3 method for class 'sento_measures'
diff(x, lag = 1, differences = 1, ...)
Arguments
x |
a |
lag |
a |
differences |
a |
... |
not used. |
Value
A modified sento_measures
object, with the measures replaced by the differenced measures as well as updated
statistics.
Author(s)
Samuel Borms
Examples
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
# construct a sento_measures object to start with
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]])
ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3)
sento_measures <- sento_measures(corpusSample, l, ctr)
# first-order difference sentiment measures with a lag of two
diffed <- diff(sento_measures, lag = 2, differences = 1)
Monthly U.S. Economic Policy Uncertainty index
Description
Monthly news-based U.S. Economic Policy Uncertainty (EPU) index (Baker, Bloom and Davis, 2016). Goes from January 1985 to July 2018, and includes a binomial and a multinomial example series. Following columns are present:
date. Date as
"yyyy-mm-01"
.index. A
numeric
monthly index value.above. A
factor
with value"above"
if the index is greater than the mean of the entire series, else"below"
.aboveMulti. A
factor
with values"above+"
,"above"
,"below"
and"below-"
if the index is greater than the 75% quantile and the 50% quantile, or smaller than the 50% quantile and the 25% quantile, respectively and in a mutually exclusive sense.
Usage
data("epu")
Format
A data.frame
with 403 rows and 4 columns.
Source
Measuring Economic Policy Uncertainty. Retrieved August 24, 2018.
References
Baker, Bloom and Davis (2016). Measuring Economic Policy Uncertainty. The Quarterly Journal of Economics 131, 1593-1636, doi:10.1093/qje/qjw024.
Examples
data("epu", package = "sentometrics")
head(epu)
Get the dates of the sentiment measures/time series
Description
Returns the dates of the sentiment time series.
Usage
get_dates(sento_measures)
Arguments
sento_measures |
a |
Value
The "date"
column in sento_measures[["measures"]]
as a character
vector.
Author(s)
Samuel Borms
Get the dimensions of the sentiment measures
Description
Returns the components across all three dimensions of the sentiment measures.
Usage
get_dimensions(sento_measures)
Arguments
sento_measures |
a |
Value
The "features"
, "lexicons"
and "time"
elements in sento_measures
.
Author(s)
Samuel Borms
Options supported to perform aggregation into sentiment measures
Description
Outputs the supported aggregation arguments. Call for information purposes only. Used within
ctr_agg
to check if supplied aggregation hows are supported.
Usage
get_hows()
Details
See the package's vignette for a detailed explanation of all aggregation options.
Value
A list with the supported aggregation hows for arguments howWithin
("words"
), howDows
("docs"
) and howTime
("time"
), to be supplied to ctr_agg
.
See Also
Retrieve loss data from a selection of models
Description
Structures specific performance data for a set of different sento_modelIter
objects as loss data.
Can then be used, for instance, as an input to create a model confidence set (Hansen, Lunde and Nason, 2011) with
the MCS package.
Usage
get_loss_data(models, loss = c("DA", "error", "errorSq", "AD", "accuracy"))
Arguments
models |
a named |
loss |
a single |
Value
A matrix
of loss data.
Author(s)
Samuel Borms
References
Hansen, Lunde and Nason (2011). The model confidence set. Econometrica 79, 453-497, doi:10.3982/ECTA5771.
See Also
Examples
## Not run:
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
data("epu", package = "sentometrics")
set.seed(505)
# construct two sento_measures objects
corpusAll <- sento_corpus(corpusdf = usnews)
corpus <- quanteda::corpus_subset(corpusAll, date >= "1997-01-01" & date < "2014-10-01")
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]])
ctrA <- ctr_agg(howWithin = "proportionalPol", howDocs = "proportional",
howTime = c("equal_weight", "linear"), by = "month", lag = 3)
sentMeas <- sento_measures(corpus, l, ctrA)
# prepare y and other x variables
y <- epu[epu$date %in% get_dates(sentMeas), "index"]
length(y) == nobs(sentMeas) # TRUE
x <- data.frame(runif(length(y)), rnorm(length(y))) # two other (random) x variables
colnames(x) <- c("x1", "x2")
# estimate different type of regressions
ctrM <- ctr_model(model = "gaussian", type = "AIC", do.iter = TRUE,
h = 0, nSample = 120, start = 50)
out1 <- sento_model(sentMeas, y, x = x, ctr = ctrM)
out2 <- sento_model(sentMeas, y, x = NULL, ctr = ctrM)
out3 <- sento_model(subset(sentMeas, select = "linear"), y, x = x, ctr = ctrM)
out4 <- sento_model(subset(sentMeas, select = "linear"), y, x = NULL, ctr = ctrM)
lossData <- get_loss_data(models = list(m1 = out1, m2 = out2, m3 = out3, m4 = out4),
loss = "errorSq")
mcs <- MCS::MCSprocedure(lossData)
## End(Not run)
Built-in lexicons
Description
A list
containing all built-in lexicons as a data.table
with two columns: a x
column with the words,
and a y
column with the polarities. The list
element names incorporate consecutively the name and language
(based on the two-letter ISO code convention as in stopwords
), and "_tr"
as
suffix if the lexicon is translated. The translation was done via Microsoft Translator through Microsoft
Word. Only the entries that conform to the original language entry after retranslation, and those that have actually been
translated, are kept. The last condition is assumed to be fulfilled when the translation differs from the original entry.
All words are unigrams and in lowercase. The built-in lexicons are the following:
FEEL_en_tr
FEEL_fr (Abdaoui, Azé, Bringay and Poncelet, 2017)
FEEL_nl_tr
GI_en (General Inquirer, i.e. Harvard IV-4 combined with Laswell)
GI_fr_tr
GI_nl_tr
HENRY_en (Henry, 2008)
HENRY_fr_tr
HENRY_nl_tr
LM_en (Loughran and McDonald, 2011)
LM_fr_tr
LM_nl_tr
Other useful lexicons can be found in the lexicon package, more specifically the datasets preceded by
hash_sentiment_
.
Usage
data("list_lexicons")
Format
A list
with all built-in lexicons, appropriately named as "NAME_language(_tr)"
.
Source
FEEL lexicon. Retrieved November 1, 2017.
GI lexicon. Retrieved November 1, 2017.
HENRY lexicon. Retrieved November 1, 2017.
LM lexicon. Retrieved November 1, 2017.
References
Abdaoui, Azé, Bringay and Poncelet (2017). FEEL: French Expanded Emotion Lexicon. Language Resources & Evaluation 51, 833-855, doi:10.1007/s10579-016-9364-5.
Henry (2008). Are investors influenced by how earnings press releases are written?. Journal of Business Communication 45, 363-407, doi:10.1177/0021943608319388.
Loughran and McDonald (2011). When is a liability not a liability? Textual analysis, dictionaries, and 10-Ks. Journal of Finance 66, 35-65, doi:10.1111/j.1540-6261.2010.01625.x.
Examples
data("list_lexicons", package = "sentometrics")
list_lexicons[c("FEEL_en_tr", "LM_en")]
Built-in valence word lists
Description
A list
containing all built-in valence word lists, as data.table
s with three columns: a x
column with
the words, a y
column with the values associated to each word, and a t
column with the type of valence
shifter (1
= negators, 2
= amplifiers, 3
= deamplifiers,
4
= adversative conjunctions). The list
element names indicate the language
(based on the two-letter ISO code convention as in stopwords
) of the valence word list.
All non-English word lists are translated via Microsoft Translator through Microsoft Word. Only the entries whose
translation differs from the original entry are kept. All words are unigrams and in lowercase. The built-in valence word
lists are available in following languages:
English (
"en"
)French (
"fr"
)Dutch (
"nl"
)
Usage
data("list_valence_shifters")
Format
A list
with all built-in valence word lists, appropriately named.
Source
hash_valence_shifters
(English valence shifters). Retrieved August 24, 2018.
Examples
data("list_valence_shifters", package = "sentometrics")
list_valence_shifters["en"]
Add and fill missing dates to sentiment measures
Description
Adds missing dates between earliest and latest date of a sento_measures
object or two more extreme
boundary dates, such that the time series are continuous date-wise. Fills in any missing date with either 0 or the
most recent non-missing value.
Usage
measures_fill(
sento_measures,
fill = "zero",
dateBefore = NULL,
dateAfter = NULL
)
Arguments
sento_measures |
a |
fill |
an element of |
dateBefore |
a date as |
dateAfter |
a date as |
Details
The dateBefore
and dateAfter
dates are converted according to the sento_measures[["by"]]
frequency.
Value
A modified sento_measures
object.
Author(s)
Samuel Borms
Examples
# construct a sento_measures object to start with
corpus <- sento_corpus(corpusdf = sentometrics::usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l <- sento_lexicons(sentometrics::list_lexicons[c("LM_en", "HENRY_en")],
sentometrics::list_valence_shifters[["en"]])
ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "day", lag = 7, fill = "none")
sento_measures <- sento_measures(corpusSample, l, ctr)
# fill measures
f1 <- measures_fill(sento_measures)
f2 <- measures_fill(sento_measures, fill = "latest")
f3 <- measures_fill(sento_measures, fill = "zero",
dateBefore = get_dates(sento_measures)[1] - 10,
dateAfter = tail(get_dates(sento_measures), 1) + 15)
Update sentiment measures
Description
Updates a sento_measures
object based on a new sento_corpus
provided.
Sentiment for the unseen corpus texts calculated and aggregated applying the control variables
from the input sento_measures
object.
Usage
measures_update(sento_measures, sento_corpus, lexicons)
Arguments
sento_measures |
|
sento_corpus |
a |
lexicons |
a |
Value
An updated sento_measures
object.
Author(s)
Jeroen Van Pelt, Samuel Borms, Andres Algaba
See Also
sento_measures
, compute_sentiment
Examples
data("usnews", package = "sentometrics")
corpus1 <- sento_corpus(usnews[1:500, ])
corpus2 <- sento_corpus(usnews[400:2000, ])
ctr <- ctr_agg(howTime = "linear", by = "year", lag = 3)
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
list_valence_shifters[["en"]])
sento_measures <- sento_measures(corpus1, l, ctr)
sento_measuresNew <- measures_update(sento_measures, corpus2, l)
Merge sentiment objects horizontally and/or vertically
Description
Combines multiple sentiment
objects with possibly different column names
into a new sentiment
object. Here, too, any resulting NA
values are converted to zero.
Usage
## S3 method for class 'sentiment'
merge(...)
Arguments
... |
|
Value
The new, combined, sentiment
object, ordered by "date"
and "id"
.
Author(s)
Samuel Borms
Examples
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
l2 <- sento_lexicons(list_lexicons[c("FEEL_en_tr")])
l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en", "FEEL_en_tr")])
corp1 <- sento_corpus(corpusdf = usnews[1:200, ])
corp2 <- sento_corpus(corpusdf = usnews[201:450, ])
corp3 <- sento_corpus(corpusdf = usnews[401:700, ])
s1 <- compute_sentiment(corp1, l1, "proportionalPol")
s2 <- compute_sentiment(corp2, l1, "counts")
s3 <- compute_sentiment(corp3, l1, "counts")
s4 <- compute_sentiment(corp2, l1, "counts", do.sentence = TRUE)
s5 <- compute_sentiment(corp3, l2, "proportional", do.sentence = TRUE)
s6 <- compute_sentiment(corp3, l1, "counts", do.sentence = TRUE)
s7 <- compute_sentiment(corp3, l3, "UShaped", do.sentence = TRUE)
# straightforward row-wise merge
m1 <- merge(s1, s2, s3)
nrow(m1) == 700 # TRUE
# another straightforward row-wise merge
m2 <- merge(s4, s6)
# merge of sentence and non-sentence calculations
m3 <- merge(s3, s6)
# different methods adds columns
m4 <- merge(s4, s5)
nrow(m4) == nrow(m2) # TRUE
# different methods and weighting adds rows and columns
## rows are added only when the different weighting
## approach for a specific method gives other sentiment values
m5 <- merge(s4, s7)
nrow(m5) > nrow(m4) # TRUE
Get number of sentiment measures
Description
Returns the number of sentiment measures.
Usage
nmeasures(sento_measures)
Arguments
sento_measures |
a |
Value
The number of sentiment measures in the input sento_measures
object.
Author(s)
Samuel Borms
Get number of observations in the sentiment measures
Description
Returns the number of data points available in the sentiment measures.
Usage
## S3 method for class 'sento_measures'
nobs(object, ...)
Arguments
object |
a |
... |
not used. |
Value
The number of rows (observations/data points) in object[["measures"]]
.
Author(s)
Samuel Borms
Extract dates related to sentiment time series peaks
Description
This function extracts the dates for which aggregated time series sentiment is most extreme (lowest, highest or both in absolute terms). The extracted dates are unique, even when, for example, all most extreme sentiment values (for different sentiment measures) occur on only one date.
Usage
peakdates(sento_measures, n = 10, type = "both", do.average = FALSE)
Arguments
sento_measures |
a |
n |
a positive |
type |
a |
do.average |
a |
Value
A vector of type "Date"
corresponding to the n
extracted sentiment peak dates.
Author(s)
Samuel Borms
Examples
set.seed(505)
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
# construct a sento_measures object to start with
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]])
ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "month", lag = 3)
sento_measures <- sento_measures(corpusSample, l, ctr)
# extract the peaks
peaksAbs <- peakdates(sento_measures, n = 5)
peaksAbsQuantile <- peakdates(sento_measures, n = 0.50)
peaksPos <- peakdates(sento_measures, n = 5, type = "pos")
peaksNeg <- peakdates(sento_measures, n = 5, type = "neg")
Extract documents related to sentiment peaks
Description
This function extracts the documents with most extreme sentiment (lowest, highest or both in absolute terms). The extracted documents are unique, even when, for example, all most extreme sentiment values (across sentiment calculation methods) occur only for one document.
Usage
peakdocs(sentiment, n = 10, type = "both", do.average = FALSE)
Arguments
sentiment |
a |
n |
a positive |
type |
a |
do.average |
a |
Value
A vector of type "character"
corresponding to the n
extracted document identifiers.
Author(s)
Samuel Borms
Examples
set.seed(505)
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 200)
sent <- compute_sentiment(corpusSample, l, how = "proportionalPol")
# extract the peaks
peaksAbs <- peakdocs(sent, n = 5)
peaksAbsQuantile <- peakdocs(sent, n = 0.50)
peaksPos <- peakdocs(sent, n = 5, type = "pos")
peaksNeg <- peakdocs(sent, n = 5, type = "neg")
Plot prediction attributions at specified level
Description
Shows a plot of the attributions along the dimension provided, stacked per date.
Usage
## S3 method for class 'attributions'
plot(x, group = "features", ...)
Arguments
x |
an |
group |
a value from |
... |
not used. |
Details
See sento_model
for an elaborate modeling example including the calculation and plotting of
attributions. This function does not handle the plotting of the attribution of individual documents, since there are
often a lot of documents involved and they appear only once at one date (even though a document may contribute to
predictions at several dates, depending on the number of lags in the time aggregation).
Value
Returns a simple ggplot
object, which can be added onto (or to alter its default elements) by using
the +
operator. By default, a legend is positioned at the top if the number of components of the
dimension is at maximum twelve.
Author(s)
Samuel Borms, Keven Bluteau
Plot sentiment measures
Description
Plotting method that shows all sentiment measures from the provided sento_measures
object in one plot, or the average along one of the lexicons, features and time weighting dimensions.
Usage
## S3 method for class 'sento_measures'
plot(x, group = "all", ...)
Arguments
x |
a |
group |
a value from |
... |
not used. |
Value
Returns a simple ggplot
object, which can be added onto (or to alter its default elements) by using
the +
operator (see example). By default, a legend is positioned at the top if there are at maximum twelve line
graphs plotted and group
is different from "all"
.
Author(s)
Samuel Borms
Examples
# construct a sento_measures object to start with
corpus <- sento_corpus(corpusdf = sentometrics::usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l <- sento_lexicons(sentometrics::list_lexicons[c("LM_en")],
sentometrics::list_valence_shifters[["en"]])
ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "month", lag = 3)
sm <- sento_measures(corpusSample, l, ctr)
# plot sentiment measures
plot(sm, "features")
## Not run:
# adjust appearance of plot
library("ggplot2")
p <- plot(sm)
p <- p +
scale_x_date(name = "year", date_labels = "%Y") +
scale_y_continuous(name = "newName")
p
## End(Not run)
Plot iterative predictions versus realized values
Description
Displays a plot of all predictions made through the iterative model computation as incorporated in the
input sento_modelIter
object, as well as the corresponding true values.
Usage
## S3 method for class 'sento_modelIter'
plot(x, ...)
Arguments
x |
a |
... |
not used. |
Details
See sento_model
for an elaborate modeling example including the plotting of out-of-sample
performance.
Value
Returns a simple ggplot
object, which can be added onto (or to alter its default elements) by using
the +
operator.
Author(s)
Samuel Borms
Make predictions from a sento_model object
Description
Prediction method for sento_model
class, with usage along the lines of
predict.glmnet
, but simplified in terms of parameters.
Usage
## S3 method for class 'sento_model'
predict(object, newx, type = "response", offset = NULL, ...)
Arguments
object |
a |
newx |
a data |
type |
type of prediction required, a value from |
offset |
not used. |
... |
not used. |
Value
A prediction output depending on the type
argument.
Author(s)
Samuel Borms
See Also
Scaling and centering of sentiment measures
Description
Scales and centers the sentiment measures from a sento_measures
object, column-per-column. By default,
the measures are normalized. NA
s are removed first.
Usage
## S3 method for class 'sento_measures'
scale(x, center = TRUE, scale = TRUE)
Arguments
x |
a |
center |
a |
scale |
a |
Details
If one of the arguments center
or scale
is a matrix
, this operation will be applied first,
and eventual other centering or scaling is computed on that data.
Value
A modified sento_measures
object, with the measures replaced by the scaled measures as well as updated
statistics.
Author(s)
Samuel Borms
Examples
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
set.seed(505)
# construct a sento_measures object to start with
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3)
sento_measures <- sento_measures(corpusSample, l, ctr)
# scale sentiment measures to zero mean and unit standard deviation
sc1 <- scale(sento_measures)
n <- nobs(sento_measures)
m <- nmeasures(sento_measures)
# subtract a matrix
sc2 <- scale(sento_measures, center = matrix(runif(n * m), n, m), scale = FALSE)
# divide every row observation based on a one-column matrix, then center
sc3 <- scale(sento_measures, center = TRUE, scale = matrix(runif(n)))
Create a sento_corpus object
Description
Formalizes a collection of texts into a sento_corpus
object derived from the quanteda
corpus
object. The quanteda package provides a robust text mining infrastructure
(see their website), including a handy corpus manipulation toolset. This function
performs a set of checks on the input data and prepares the corpus for further analysis by structurally
integrating a date dimension and numeric metadata features.
Usage
sento_corpus(corpusdf, do.clean = FALSE)
Arguments
corpusdf |
a |
do.clean |
a |
Details
A sento_corpus
object is a specialized instance of a quanteda corpus
. Any
quanteda function applicable to its corpus
object can also be applied to a sento_corpus
object. However, changing a given sento_corpus
object too drastically using some of quanteda's functions might
alter the very structure the corpus is meant to have (as defined in the corpusdf
argument) to be able to be used as
an input in other functions of the sentometrics package. There are functions, including
corpus_sample
or corpus_subset
, that do not change the actual corpus
structure and may come in handy.
To add additional features, use add_features
. Binary features are useful as
a mechanism to select the texts which have to be integrated in the respective feature-based sentiment measure(s), but
applies only when do.ignoreZeros = TRUE
. Because of this (implicit) selection that can be performed, having
complementary features (e.g., "economy"
and "noneconomy"
) makes sense.
It is also possible to add one non-numerical feature, that is, "language"
, to designate the language
of the corpus texts. When this feature is provided, a list
of lexicons for different
languages is expected in the compute_sentiment
function.
Value
A sento_corpus
object, derived from a quanteda corpus
object. The corpus is ordered by date.
Author(s)
Samuel Borms
See Also
Examples
data("usnews", package = "sentometrics")
# corpus construction
corp <- sento_corpus(corpusdf = usnews)
# take a random subset making use of quanteda
corpusSmall <- quanteda::corpus_sample(corp, size = 500)
# deleting a feature
quanteda::docvars(corp, field = "wapo") <- NULL
# deleting all features results in the addition of a dummy feature
quanteda::docvars(corp, field = c("economy", "noneconomy", "wsj")) <- NULL
## Not run:
# to add or replace features, use the add_features() function...
quanteda::docvars(corp, field = c("wsj", "new")) <- 1
## End(Not run)
# corpus creation when no features are present
corpusDummy <- sento_corpus(corpusdf = usnews[, 1:3])
# corpus creation with a qualitative language feature
usnews[["language"]] <- "en"
usnews[["language"]][c(200:400)] <- "nl"
corpusLang <- sento_corpus(corpusdf = usnews)
Set up lexicons (and valence word list) for use in sentiment analysis
Description
Structures provided lexicon(s) and optionally valence words. One can for example combine (part of) the
built-in lexicons from data("list_lexicons")
with other lexicons, and add one of the built-in valence word lists
from data("list_valence_shifters")
. This function makes the output coherent, by converting all words to
lowercase and checking for duplicates. All entries consisting of more than one word are discarded, as required for
bag-of-words sentiment analysis.
Usage
sento_lexicons(lexiconsIn, valenceIn = NULL, do.split = FALSE)
Arguments
lexiconsIn |
a named |
valenceIn |
a single valence word list as a |
do.split |
a |
Value
A list
of class sento_lexicons
with each lexicon as a separate element according to its name, as a
data.table
, and optionally an element named valence
that comprises the valence words. Every "x"
column
contains the words, every "y"
column contains the scores. The "t"
column for valence shifters
contains the different types.
Author(s)
Samuel Borms
Examples
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
# lexicons straight from built-in word lists
l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
# including a self-made lexicon, with and without valence shifters
lexIn <- c(list(myLexicon = data.table::data.table(w = c("nice", "boring"), s = c(2, -1))),
list_lexicons[c("GI_en")])
valIn <- list_valence_shifters[["en"]]
l2 <- sento_lexicons(lexIn)
l3 <- sento_lexicons(lexIn, valIn)
l4 <- sento_lexicons(lexIn, valIn[, c("x", "y")], do.split = TRUE)
l5 <- sento_lexicons(lexIn, valIn[, c("x", "t")], do.split = TRUE)
l6 <- l5[c("GI_en_POS", "valence")] # preserves sento_lexicons class
## Not run:
# include lexicons from lexicon package
lexIn2 <- list(hul = lexicon::hash_sentiment_huliu, joc = lexicon::hash_sentiment_jockers)
l7 <- sento_lexicons(c(lexIn, lexIn2), valIn)
## End(Not run)
## Not run:
# faulty extraction, no replacement allowed
l5["valence"]
l2[0]
l3[22]
l4[1] <- l2[1]
l4[[1]] <- l2[[1]]
l4$GI_en_NEG <- l2$myLexicon
## End(Not run)
One-way road towards a sento_measures object
Description
Wrapper function which assembles calls to compute_sentiment
and aggregate
.
Serves as the most direct way towards a panel of textual sentiment measures as a sento_measures
object.
Usage
sento_measures(sento_corpus, lexicons, ctr)
Arguments
sento_corpus |
a |
lexicons |
a |
ctr |
output from a |
Details
As a general rule, neither the names of the features, lexicons or time weighting schemes may contain any ‘-’ symbol.
Value
A sento_measures
object, which is a list
containing:
measures |
a |
features |
a |
lexicons |
a |
time |
a |
stats |
a |
sentiment |
the document-level sentiment scores |
attribWeights |
a |
ctr |
a |
Author(s)
Samuel Borms, Keven Bluteau
See Also
compute_sentiment
, aggregate
, measures_update
Examples
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
# construct a sento_measures object to start with
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")], list_valence_shifters[["en"]])
ctr <- ctr_agg(howWithin = "counts",
howDocs = "proportional",
howTime = c("equal_weight", "linear", "almon"),
by = "month",
lag = 3,
ordersAlm = 1:3,
do.inverseAlm = TRUE)
sento_measures <- sento_measures(corpusSample, l, ctr)
summary(sento_measures)
Optimized and automated sentiment-based sparse regression
Description
Linear or nonlinear penalized regression of any dependent variable on the wide number of sentiment measures and potentially other explanatory variables. Either performs a regression given the provided variables at once, or computes regressions sequentially for a given sample size over a longer time horizon, with associated prediction performance metrics.
Usage
sento_model(sento_measures, y, x = NULL, ctr)
Arguments
sento_measures |
a |
y |
a one-column |
x |
a named |
ctr |
output from a |
Details
Models are computed using the elastic net regularization as implemented in the glmnet package, to account for
the multidimensionality of the sentiment measures. Independent variables are normalized in the regression process, but
coefficients are returned in their original space. For a helpful introduction to glmnet, we refer to their
vignette. The optimal elastic net parameters
lambda
and alpha
are calibrated either through a to specify information criterion or through
cross-validation (based on the "rolling forecasting origin" principle, using the train
function).
In the latter case, the training metric is automatically set to "RMSE"
for a linear model and to "Accuracy"
for a logistic model. We suppress many of the details that can be supplied to the glmnet
and
train
functions we rely on, for the sake of user-friendliness.
Value
If ctr$do.iter = FALSE
, a sento_model
object which is a list
containing:
reg |
optimized regression, i.e., a model-specific glmnet object, including for example the estimated coefficients. |
model |
the input argument |
alpha |
calibrated alpha. |
lambda |
calibrated lambda. |
trained |
output from |
ic |
a |
dates |
sample reference dates as a two-element |
nVar |
a vector of size two, with respectively the number of sentiment measures, and the number of other explanatory variables inputted. |
discarded |
a named |
If ctr$do.iter = TRUE
, a sento_modelIter
object which is a list
containing:
models |
all sparse regressions, i.e., separate |
alphas |
calibrated alphas. |
lambdas |
calibrated lambdas. |
performance |
a |
Author(s)
Samuel Borms, Keven Bluteau
See Also
ctr_model
, glmnet
, train
, attributions
Examples
## Not run:
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
data("epu", package = "sentometrics")
set.seed(505)
# construct a sento_measures object to start with
corpusAll <- sento_corpus(corpusdf = usnews)
corpus <- quanteda::corpus_subset(corpusAll, date >= "2004-01-01")
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional",
howTime = c("equal_weight", "linear"),
by = "month", lag = 3)
sento_measures <- sento_measures(corpus, l, ctr)
# prepare y and other x variables
y <- epu[epu$date %in% get_dates(sento_measures), "index"]
length(y) == nobs(sento_measures) # TRUE
x <- data.frame(runif(length(y)), rnorm(length(y))) # two other (random) x variables
colnames(x) <- c("x1", "x2")
# a linear model based on the Akaike information criterion
ctrIC <- ctr_model(model = "gaussian", type = "AIC", do.iter = FALSE, h = 4,
do.difference = TRUE)
out1 <- sento_model(sento_measures, y, x = x, ctr = ctrIC)
# attribution and prediction as post-analysis
attributions1 <- attributions(out1, sento_measures,
refDates = get_dates(sento_measures)[20:25])
plot(attributions1, "features")
nx <- nmeasures(sento_measures) + ncol(x)
newx <- runif(nx) * cbind(data.table::as.data.table(sento_measures)[, -1], x)[30:40, ]
preds <- predict(out1, newx = as.matrix(newx), type = "link")
# an iterative out-of-sample analysis, parallelized
ctrIter <- ctr_model(model = "gaussian", type = "BIC", do.iter = TRUE, h = 3,
oos = 2, alphas = c(0.25, 0.75), nSample = 75, nCore = 2)
out2 <- sento_model(sento_measures, y, x = x, ctr = ctrIter)
summary(out2)
# plot predicted vs. realized values
p <- plot(out2)
p
# a cross-validation based model, parallelized
cl <- parallel::makeCluster(2)
doParallel::registerDoParallel(cl)
ctrCV <- ctr_model(model = "gaussian", type = "cv", do.iter = FALSE,
h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70,
testWindow = 10, oos = 0, do.progress = TRUE)
out3 <- sento_model(sento_measures, y, x = x, ctr = ctrCV)
parallel::stopCluster(cl)
foreach::registerDoSEQ()
summary(out3)
# a cross-validation based model for a binomial target
yb <- epu[epu$date %in% get_dates(sento_measures), "above"]
ctrCVb <- ctr_model(model = "binomial", type = "cv", do.iter = FALSE,
h = 0, alphas = c(0.10, 0.50, 0.90), trainWindow = 70,
testWindow = 10, oos = 0, do.progress = TRUE)
out4 <- sento_model(sento_measures, yb, x = x, ctr = ctrCVb)
summary(out4)
## End(Not run)
Defunct functions
Description
Functions defunct due to changed naming or because functionality is discarded. See the NEWS file for more information about since when or why functions have been defunct.
Usage
ctr_merge(...)
perform_MCS(...)
fill_measures(...)
merge_measures(...)
to_global(...)
subset_measures(...)
select_measures(...)
setup_lexicons(...)
retrieve_attributions(...)
perform_agg(...)
plot_attributions(...)
almons(...)
exponentials(...)
to_sentocorpus(...)
to_sentiment(...)
get_measures(...)
measures_subset(...)
measures_select(...)
measures_delete(...)
sentiment_bind(...)
measures_merge(...)
measures_global(...)
sento_app(...)
Arguments
... |
allowed input arguments. |
Deprecated functions
Description
Functions deprecated due to changed naming or because functionality is discarded. The general (but not blindly followed) rule is that deprecated functions are made defunct every 1 major or every 2 minor package updates. See the NEWS file for more information about since when or why functions have been deprecated.
Subset sentiment measures
Description
Subsets rows of the sentiment measures based on its columns.
Usage
## S3 method for class 'sento_measures'
subset(x, subset = NULL, select = NULL, delete = NULL, ...)
Arguments
x |
a |
subset |
a logical (non- |
select |
a |
delete |
see the |
... |
not used. |
Value
A modified sento_measures
object, with only the remaining rows and sentiment measures,
including updated information and statistics, but the original sentiment scores data.table
untouched.
Author(s)
Samuel Borms
Examples
data("usnews", package = "sentometrics")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")
# construct a sento_measures object to start with
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 500)
l <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
ctr <- ctr_agg(howTime = c("equal_weight", "linear"), by = "year", lag = 3)
sm <- sento_measures(corpusSample, l, ctr)
# three specified indices in required list format
three <- as.list(
stringi::stri_split(c("LM_en--economy--linear",
"HENRY_en--wsj--equal_weight",
"HENRY_en--wapo--equal_weight"),
regex = "--")
)
# different subsets
sub1 <- subset(sm, HENRY_en--economy--equal_weight >= 0.01)
sub2 <- subset(sm, date %in% get_dates(sm)[3:12])
sub3 <- subset(sm, 3:12)
sub4 <- subset(sm, 1:100) # warning
# different selections
sel1 <- subset(sm, select = "equal_weight")
sel2 <- subset(sm, select = c("equal_weight", "linear"))
sel3 <- subset(sm, select = c("linear", "LM_en"))
sel4 <- subset(sm, select = list(c("linear", "wsj"), c("linear", "economy")))
sel5 <- subset(sm, select = three)
# different deletions
del1 <- subset(sm, delete = "equal_weight")
del2 <- subset(sm, delete = c("linear", "LM_en"))
del3 <- subset(sm, delete = list(c("linear", "wsj"), c("linear", "economy")))
del4 <- subset(sm, delete = c("equal_weight", "linear")) # warning
del5 <- subset(sm, delete = three)
Texts (not) relevant to the U.S. economy
Description
A collection of texts annotated by humans in terms of relevance to the U.S. economy or not. The texts come from two major journals in the U.S. (The Wall Street Journal and The Washington Post) and cover 4145 documents between 1995 and 2014. It contains following information:
id. A
character
ID identifier.date. Date as
"yyyy-mm-dd"
.texts. Texts in
character
format.wsj. Equals 1 if the article comes from The Wall Street Journal.
wapo. Equals 1 if the article comes from The Washington Post (complementary to ‘wsj’).
economy. Equals 1 if the article is relevant to the U.S. economy.
noneconomy. Equals 1 if the article is not relevant to the U.S. economy (complementary to ‘economy’).
Usage
data("usnews")
Format
A data.frame
, formatted as required to be an input for sento_corpus
.
Source
Economic News Article Tone and Relevance dataset. Retrieved November 1, 2017.
Examples
data("usnews", package = "sentometrics")
usnews[3192, "texts"]
usnews[1:5, c("id", "date", "texts")]
Compute Almon polynomials
Description
Computes Almon polynomial weighting curves. Handy to self-select specific time aggregation weighting schemes
for input in ctr_agg
using the weights
argument.
Usage
weights_almon(n, orders = 1:3, do.inverse = TRUE, do.normalize = TRUE)
Arguments
n |
a single |
orders |
a |
do.inverse |
|
do.normalize |
a |
Details
The Almon polynomial formula implemented is:
(1 - (1 - i/n)^{r})(1 - i/n)^{R - r}
, where i
is the lag index ordered from
1 to n
. The inverse is computed by changing i/n
to 1 - i/n
.
Value
A data.frame
of all Almon polynomial weighting curves, of size length(orders)
(times two if
do.inverse = TRUE
).
See Also
Compute Beta weighting curves
Description
Computes Beta weighting curves as in Ghysels, Sinko and Valkanov (2007). Handy to self-select specific
time aggregation weighting schemes for input in ctr_agg
using the weights
argument.
Usage
weights_beta(n, a = 1:4, b = 1:4, do.normalize = TRUE)
Arguments
n |
a single |
a |
a |
b |
a |
do.normalize |
a |
Details
The Beta weighting abides by following formula:
f(i/n; a, b) / \sum_{i}(i/n; a, b)
, where i
is the lag index ordered
from 1 to n
, a
and b
are two decay parameters, and
f(x; a, b) = (x^{a - 1}(1 - x)^{b - 1}\Gamma(a + b)) / (\Gamma(a)\Gamma(b))
, where \Gamma(.)
is
the gamma
function.
Value
A data.frame
of beta weighting curves per combination of a
and b
. If n = 1
,
all weights are set to 1.
References
Ghysels, Sinko and Valkanov (2007). MIDAS regressions: Further results and new directions. Econometric Reviews 26, 53-90, doi:10.1080/07474930600972467.
See Also
Compute exponential weighting curves
Description
Computes exponential weighting curves. Handy to self-select specific time aggregation weighting schemes
for input in ctr_agg
using the weights
argument.
Usage
weights_exponential(
n,
alphas = seq(0.1, 0.5, by = 0.1),
do.inverse = FALSE,
do.normalize = TRUE
)
Arguments
n |
a single |
alphas |
a |
do.inverse |
|
do.normalize |
a |
Value
A data.frame
of exponential weighting curves per value of alphas
.