Type: | Package |
Title: | 'KorAP' Web Service Client Package |
Version: | 1.1.0 |
Description: | A client package that makes the 'KorAP' web service API accessible from R. The corpus analysis platform 'KorAP' has been developed as a scientific tool to make potentially large, stratified and multiply annotated corpora, such as the 'German Reference Corpus DeReKo' or the 'Corpus of the Contemporary Romanian Language CoRoLa', accessible for linguists to let them verify hypotheses and to find interesting patterns in real language use. The 'RKorAPClient' package provides access to 'KorAP' and the corpora behind it for user-created R code, as a programmatic alternative to the 'KorAP' web user-interface. You can learn more about 'KorAP' and use it directly on 'DeReKo' at https://korap.ids-mannheim.de/. |
Depends: | R (≥ 4.1.0) |
Language: | en-US |
License: | BSD_2_clause + file LICENSE |
URL: | https://github.com/KorAP/RKorAPClient/, https://korap.ids-mannheim.de/, https://www.ids-mannheim.de/digspra/kl/projekte/korap |
BugReports: | https://github.com/KorAP/RKorAPClient/issues |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | R.cache, broom, ggplot2, tibble, magrittr, tidyr, dplyr, lubridate, highcharter, jsonlite, keyring, utils, httr2, curl, methods, PTXQC, purrr, stringr, urltools |
Suggests: | lifecycle, testthat, htmlwidgets, rmarkdown, shiny, vcd, kableExtra, knitr, purrrlyr, raster, tidyverse |
Collate: | 'logging.R' 'KorAPConnection.R' 'KorAPCorpusStats.R' 'RKorAPClient-package.R' 'KorAPQuery.R' 'association-scores.R' 'ci.R' 'collocationAnalysis.R' 'collocationScoreQuery.R' 'hc_add_onclick_korap_search.R' 'hc_freq_by_year_ci.R' 'misc.R' 'reexports.R' 'textMetadata.R' |
NeedsCompilation: | no |
Packaged: | 2025-06-26 15:12:25 UTC; kupietz |
Author: | Marc Kupietz [aut, cre], Nils Diewald [ctb], Leibniz Institute for the German Language [cph, fnd] |
Maintainer: | Marc Kupietz <kupietz@ids-mannheim.de> |
Repository: | CRAN |
Date/Publication: | 2025-06-26 16:10:02 UTC |
R Client for KorAP Corpus Analysis Platform
Description
RKorAPClient provides programmatic access to KorAP corpus analysis platform instances, enabling corpus linguistic research on large corpora like DeReKo, CoRoLa, DeLiKo@DNB.
Main Functions
- Connection
KorAPConnection()
,auth()
,persistAccessToken()
- Search
corpusQuery()
,fetchAll()
,fetchNext()
- Analysis
corpusStats()
,frequencyQuery()
,collocationAnalysis()
Quick Start
library(RKorAPClient) # Connect and search kcon <- KorAPConnection() query <- corpusQuery(kcon, "Ameisenplage") results <- fetchAll(query) # Access results results@collectedMatches results@totalResults
Common Workflows
Basic Search:
kcon <- KorAPConnection() kcon |> corpusQuery("search term") |> fetchAll()
Frequency Analysis:
frequencyQuery(kcon, c("term1", "term2"), vc="pubDate in 2020")
Corpus Statistics:
corpusStats(kcon, vc="textType=Zeitung", as.df=TRUE)
Author(s)
Maintainer: Marc Kupietz kupietz@ids-mannheim.de
Other contributors:
Nils Diewald diewald@ids-mannheim.de [contributor]
Leibniz Institute for the German Language [copyright holder, funder]
References
Kupietz, Marc / Diewald, Nils / Margaretha, Eliza (2020): RKorAPClient: An R package for accessing the German Reference Corpus DeReKo via KorAP. In: Calzolari, Nicoletta, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis (eds.): Proceedings of The 12th Language Resources and Evaluation Conference (LREC 2020) Marseille: European Language Resources Association (ELRA), 7017-7023. http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.867.pdf
See Also
Useful links:
Report bugs at https://github.com/KorAP/RKorAPClient/issues
Connect to KorAP Server
Description
KorAPConnection()
creates a connection to a KorAP server for corpus queries.
This is your starting point for all corpus analysis tasks.
Arguments
KorAPUrl |
URL of the web user interface of the KorAP server instance you want to access.
Defaults to the environment variable |
apiVersion |
which version of KorAP's API you want to connect to. Defaults to "v1.0". |
apiUrl |
URL of the KorAP web service. If not provided, it will be constructed from KorAPUrl and apiVersion. |
accessToken |
OAuth2 access token. For queries on corpus parts with restricted access (e.g. textual queries on IPR protected data), you need to authorize your application with an access token. You can obtain an access token in the OAuth settings of your KorAP web interface. More details are explained in the authorization section of the RKorAPClient Readme on GitHub. To use authorization based on an access token in subsequent queries, initialize your KorAP connection with: kco <- KorAPConnection(accessToken="<access token>") In order to make the API
token persistent for the currently used persistAccessToken(kco) This will store it in your keyring using the
keyring::keyring-package. Subsequent KorAPConnection() calls will
then automatically retrieve the token from your keying. To stop using a
persisted token, call An alternative to using an access token is to use a browser-based oauth2 workflow
to obtain an access token. This can be done with the |
oauthClient |
OAuth2 client object. |
oauthScope |
OAuth2 scope. Defaults to "search match_info". |
authorizationSupported |
logical that indicates if authorization is supported/necessary for the current KorAP instance. Automatically set during initialization. |
userAgent |
user agent string. Defaults to "R-KorAP-Client". |
timeout |
timeout in seconds for API requests (this does not influence server internal timeouts). Defaults to 240 seconds. |
verbose |
logical that decides whether following operations will default to be verbose. Defaults to FALSE. |
cache |
logical that decides if API calls are cached locally. You can clear
the cache with |
Details
Use KorAPConnection()
to connect, then corpusQuery()
to search, and
fetchAll()
to retrieve results. For authorized access to restricted corpora,
use auth()
or provide an accessToken
.
The KorAPConnection object contains various configuration slots for advanced users: KorAPUrl (server URL), apiVersion, accessToken (OAuth2 token), timeout (request timeout), verbose (logging), cache (local caching), and other technical parameters. Most users can ignore these implementation details.
Value
KorAPConnection()
object that can be used e.g. with corpusQuery()
Basic Workflow
# Connect to KorAP kcon <- KorAPConnection() # Search for a term query <- corpusQuery(kcon, "Ameisenplage") # Get all results results <- fetchAll(query)
Authorization
For access to restricted corpora, authorize your connection:
kcon <- KorAPConnection() |> auth()
See Also
Other initialization functions:
auth,KorAPConnection-method
,
clearAccessToken,KorAPConnection-method
,
persistAccessToken,KorAPConnection-method
KorAPCorpusStats class (internal)
Description
Internal class for corpus statistics storage. Users work with corpusStats()
function instead.
Usage
## S4 method for signature 'KorAPCorpusStats'
show(object)
Arguments
object |
KorAPCorpusStats object |
KorAPQuery class (internal)
Description
Internal class for query state management. Users work with corpusQuery()
, fetchAll()
, and fetchNext()
instead.
Usage
buildWebUIRequestUrlFromString(KorAPUrl, query, vc = "", ql = "poliqarp")
buildWebUIRequestUrl(
kco,
query = if (missing(KorAPUrl)) {
stop("At least one of the parameters query and KorAPUrl must be specified.", call. =
FALSE)
} else {
httr2::url_parse(KorAPUrl)$query$q
},
vc = if (missing(KorAPUrl)) "" else httr2::url_parse(KorAPUrl)$query$cq,
KorAPUrl,
ql = if (missing(KorAPUrl)) "poliqarp" else httr2::url_parse(KorAPUrl)$query$ql
)
## S3 method for class 'KorAPQuery'
format(x, ...)
## S4 method for signature 'KorAPQuery'
show(object)
Arguments
x |
KorAPQuery object |
... |
further arguments passed to or from other methods |
object |
KorAPQuery object |
Internal API call method
Description
Internal API call method
Usage
## S4 method for signature 'KorAPConnection'
apiCall(
kco,
url,
json = TRUE,
getHeaders = FALSE,
cache = kco@cache,
timeout = kco@timeout
)
Arguments
kco |
KorAPConnection object |
url |
request url |
json |
logical that determines if JSON result is expected |
getHeaders |
logical that determines if headers and content should be returned (as a list) |
Association score functions
Description
Functions to calculate different collocation association scores between
a node (target word) and words in a window around the it.
The functions are primarily used by collocationScoreQuery()
.
pmi: pointwise mutual information
mi2: pointwise mutual information squared (Daille 1994), also referred to as mutual dependency (Thanopoulos et al. 2002)
mi3: pointwise mutual information cubed (Daille 1994), also referred to as log-frequency biased mutual dependency) (Thanopoulos et al. 2002)
logDice: log-Dice coefficient, a heuristic measure that is popular in lexicography (Rychlý 2008)
ll: log-likelihood (Dunning 1993) using Stefan Evert's (2004) simplified implementation
Usage
defaultAssociationScoreFunctions()
pmi(O1, O2, O, N, E, window_size)
mi2(O1, O2, O, N, E, window_size)
mi3(O1, O2, O, N, E, window_size)
logDice(O1, O2, O, N, E, window_size)
ll(O1, O2, O, N, E, window_size)
Arguments
O1 |
observed absolute frequency of node |
O2 |
observed absolute frequency of collocate |
O |
observed absolute frequency of collocation |
N |
corpus size |
E |
expected absolute frequency of collocation (already adjusted to window size) |
window_size |
total window size around node (left neighbour count + right neighbour count) |
Value
association score
References
Daille, B. (1994): Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7.
Thanopoulos, A., Fakotakis, N., Kokkinakis, G. (2002): Comparative evaluation of collocation extraction metrics. In: Proc. of LREC 2002: 620–625.
Rychlý, Pavel (2008): A lexicographer-friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, 6–9. https://www.fi.muni.cz/usr/sojka/download/raslan2008/13.pdf.
Dunning, T. (1993): Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 1 (March 1993), 61-74.
Evert, Stefan (2004): The Statistics of Word Cooccurrences: Word Pairs and Collocations. PhD dissertation, IMS, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Free PDF available from https://purl.org/stefan.evert/PUB/Evert2004phd.pdf
See Also
Other collocation analysis functions:
collocationAnalysis,KorAPConnection-method
,
collocationScoreQuery,KorAPConnection-method
,
synsemanticStopwords()
Examples
## Not run:
KorAPConnection(verbose = TRUE) %>%
collocationScoreQuery("Perlen", c("verziertes", "Säue"),
scoreFunctions = append(defaultAssociationScoreFunctions(),
list(localMI = function(O1, O2, O, N, E, window_size) {
O * log2(O/E)
})))
## End(Not run)
Authorize RKorAPClient
Description
Authorize RKorAPClient to make KorAP queries and download results on behalf of the user.
Usage
## S4 method for signature 'KorAPConnection'
auth(
kco,
app_id = generic_kor_app_id,
app_secret = NULL,
scope = kco@oauthScope
)
Arguments
kco |
KorAPConnection object |
app_id |
OAuth2 application id. Defaults to the generic KorAP client application id. |
app_secret |
OAuth2 application secret. Used with confidential client applications. Defaults to |
scope |
OAuth2 scope. Defaults to "search match_info". |
Value
KorAPConnection object with access token set in @accessToken
.
See Also
persistAccessToken()
, clearAccessToken()
Other initialization functions:
KorAPConnection-class
,
clearAccessToken,KorAPConnection-method
,
persistAccessToken,KorAPConnection-method
Examples
## Not run:
kco <- KorAPConnection(verbose = TRUE) %>% auth()
df <- collocationAnalysis(kco, "focus([marmot/p=ADJA] {Ameisenplage})",
leftContextSize = 1, rightContextSize = 0
)
## End(Not run)
Calculate and format ETA for batch operations
Description
Helper function to calculate estimated time of arrival based on elapsed time and progress through a batch operation.
Usage
calculate_eta(current_item, total_items, start_time)
Arguments
current_item |
current item number (1-based) |
total_items |
total number of items to process |
start_time |
POSIXct start time of the operation |
Value
character string with formatted ETA and completion time or empty string if not calculable
Calculate sophisticated ETA using median of recent non-cached times
Description
Advanced ETA calculation that excludes cached responses and uses median of recent timing data for more stable estimates. This is particularly useful for operations where some responses may be cached and much faster.
Usage
calculate_sophisticated_eta(
individual_times,
current_item,
total_items,
cache_threshold = 0.1,
window_size = 5
)
Arguments
individual_times |
numeric vector of individual item processing times |
current_item |
current item number (1-based) |
total_items |
total number of items to process |
cache_threshold |
minimum time in seconds to consider as non-cached (default: 0.1) |
window_size |
number of recent non-cached times to use for median calculation (default: 5) |
Value
list with eta_seconds, estimated_completion_time, and is_cached flag
Add confidence interval and relative frequency variables
Description
Using prop.test()
, ci
adds three columns to a data frame:
relative frequency (
f
)lower bound of a confidence interval (
ci.low
)upper bound of a confidence interval
Convenience function for converting frequency tables to instances per million.
Convenience function for converting frequency tables of alternative variants
(generated with as.alternatives=TRUE
) to percent.
Converts a vector of query or vc strings to typically appropriate legend labels by clipping off prefixes and suffixes that are common to all query strings.
Experimental convenience function for plotting typical frequency by year graphs with confidence intervals using ggplot2. Warning: This function may be moved to a new package.
Usage
ci(df, x = totalResults, N = total, conf.level = 0.95)
ipm(df)
percent(df)
queryStringToLabel(data, pubDateOnly = FALSE, excludePubDate = FALSE)
geom_freq_by_year_ci(mapping = aes(ymin = conf.low, ymax = conf.high), ...)
Arguments
df |
table returned from |
x |
column with the observed absolute frequency. |
N |
column with the total frequencies |
conf.level |
confidence level of the returned confidence interval. Must be a single number between 0 and 1. |
data |
string or vector of query or vc definition strings |
pubDateOnly |
discard all but the publication date |
excludePubDate |
discard publication date constraints |
mapping |
Set of aesthetic mappings created by aes() or aes_(). If specified and inherit.aes = TRUE (the default), it is combined with the default mapping at the top level of the plot. You must supply mapping if there is no plot mapping. |
... |
Other arguments passed to geom_ribbon, geom_line, and geom_click_point. |
Details
Given a table with columns f
, conf.low
, and conf.high
, ipm
ads a column ipm
und multiplies conf.low and conf.high
with 10^6.
Value
original table with additional column ipm
and converted columns conf.low
and conf.high
original table with converted columns f
, conf.low
and conf.high
string or vector of strings with clipped off common prefixes and suffixes
See Also
ci
is already included in frequencyQuery()
Examples
## Not run:
library(ggplot2)
kco <- KorAPConnection(verbose=TRUE)
expand_grid(year=2015:2018, alternatives=c("Hate Speech", "Hatespeech")) %>%
bind_cols(corpusQuery(kco, .$alternatives, sprintf("pubDate in %d", .$year))) %>%
mutate(total=corpusStats(kco, vc=vc)$tokens) %>%
ci() %>%
ggplot(aes(x=year, y=f, fill=query, color=query, ymin=conf.low, ymax=conf.high)) +
geom_point() + geom_line() + geom_ribbon(alpha=.3)
## End(Not run)
## Not run:
KorAPConnection() %>% frequencyQuery("Test", paste0("pubDate in ", 2000:2002)) %>% ipm()
## End(Not run)
## Not run:
KorAPConnection() %>%
frequencyQuery(c("Tollpatsch", "Tolpatsch"),
vc=paste0("pubDate in ", 2000:2002),
as.alternatives = TRUE) %>%
percent()
## End(Not run)
queryStringToLabel(paste("textType = /Zeit.*/ & pubDate in", c(2010:2019)))
queryStringToLabel(c("[marmot/m=mood:subj]", "[marmot/m=mood:ind]"))
queryStringToLabel(c("wegen dem [tt/p=NN]", "wegen des [tt/p=NN]"))
## Not run:
library(ggplot2)
kco <- KorAPConnection(verbose=TRUE)
expand_grid(condition = c("textDomain = /Wirtschaft.*/", "textDomain != /Wirtschaft.*/"),
year = (2005:2011)) %>%
cbind(frequencyQuery(kco, "[tt/l=Heuschrecke]",
paste0(.$condition," & pubDate in ", .$year))) %>%
ipm() %>%
ggplot(aes(year, ipm, fill = condition, color = condition)) +
geom_freq_by_year_ci()
## End(Not run)
Clear access token from keyring and KorAPConnection object
Description
Clear access token from keyring and KorAPConnection object
Usage
## S4 method for signature 'KorAPConnection'
clearAccessToken(kco)
Arguments
kco |
KorAPConnection object |
Value
KorAPConnection object with access token set to NULL
.
See Also
Other initialization functions:
KorAPConnection-class
,
auth,KorAPConnection-method
,
persistAccessToken,KorAPConnection-method
Examples
## Not run:
kco <- KorAPConnection()
kco <- clearAccessToken(kco)
## End(Not run)
Clear local cache
Description
Clears the local cache of API responses for the current RKorAPClient version. Useful when you want to force fresh data retrieval or free up disk space.
Usage
## S4 method for signature 'KorAPConnection'
clearCache(kco)
Arguments
kco |
KorAPConnection object |
Value
Invisible NULL (function called for side effects)
Examples
## Not run:
kco <- KorAPConnection()
clearCache(kco)
## End(Not run)
Collocation analysis
Description
Performs a collocation analysis for the given node (or query) in the given virtual corpus.
Usage
## S4 method for signature 'KorAPConnection'
collocationAnalysis(
kco,
node,
vc = "",
lemmatizeNodeQuery = FALSE,
minOccur = 5,
leftContextSize = 5,
rightContextSize = 5,
topCollocatesLimit = 200,
searchHitsSampleLimit = 20000,
ignoreCollocateCase = FALSE,
withinSpan = ifelse(exactFrequencies, "base/s=s", ""),
exactFrequencies = TRUE,
stopwords = append(RKorAPClient::synsemanticStopwords(), node),
seed = 7,
expand = length(vc) != length(node),
maxRecurse = 0,
addExamples = FALSE,
thresholdScore = "logDice",
threshold = 2,
localStopwords = c(),
collocateFilterRegex = "^[:alnum:]+-?[:alnum:]*$",
...
)
Arguments
kco |
|
node |
target word |
vc |
string describing the virtual corpus in which the query should be performed. An empty string (default) means the whole corpus, as far as it is license-wise accessible. |
lemmatizeNodeQuery |
if TRUE, node query will be lemmatized, i.e. |
minOccur |
minimum absolute number of observed co-occurrences to consider a collocate candidate |
leftContextSize |
size of the left context window |
rightContextSize |
size of the right context window |
topCollocatesLimit |
limit analysis to the n most frequent collocates in the search hits sample |
searchHitsSampleLimit |
limit the size of the search hits sample |
ignoreCollocateCase |
logical, set to TRUE if collocate case should be ignored |
withinSpan |
KorAP span specification (see https://korap.ids-mannheim.de/doc/ql/poliqarp-plus?embedded=true#spans) for collocations to be searched within. Defaults to |
exactFrequencies |
if FALSE, extrapolate observed co-occurrence frequencies from frequencies in search hits sample, otherwise retrieve exact co-occurrence frequencies |
stopwords |
vector of stopwords not to be considered as collocates |
seed |
seed for random page collecting order |
expand |
if TRUE, |
maxRecurse |
apply collocation analysis recursively |
addExamples |
If TRUE, examples for instances of collocations will be added in a column |
thresholdScore |
association score function (see |
threshold |
minimum value of |
localStopwords |
vector of stopwords that will not be considered as collocates in the current function call, but that will not be passed to recursive calls |
collocateFilterRegex |
allow only collocates matching the regular expression |
... |
more arguments will be passed to |
Details
The collocation analysis is currently implemented on the client side, as some of the functionality is not yet provided by the KorAP backend. Mainly for this reason it is very slow (several minutes, up to hours), but on the other hand very flexible. You can, for example, perform the analysis in arbitrary virtual corpora, use complex node queries, and look for expression-internal collocates using the focus function (see examples and demo).
To increase speed at the cost of accuracy and possible false negatives, you can decrease searchHitsSampleLimit and/or topCollocatesLimit and/or set exactFrequencies to FALSE.
Note that some outdated non-DeReKo back-ends might not yet support returning tokenized matches (warning issued). In this case, the client library will fall back to client-side tokenization which might be slightly less accurate. This might lead to false negatives and to frequencies that differ from corresponding ones acquired via the web user interface.
Value
Tibble with top collocates, association scores, corresponding URLs for web user interface queries, etc.
See Also
Other collocation analysis functions:
association-score-functions
,
collocationScoreQuery,KorAPConnection-method
,
synsemanticStopwords()
Examples
## Not run:
# Find top collocates of "Packung" inside and outside the sports domain.
KorAPConnection(verbose = TRUE) |>
collocationAnalysis("Packung",
vc = c("textClass=sport", "textClass!=sport"),
leftContextSize = 1, rightContextSize = 1, topCollocatesLimit = 20
) |>
dplyr::filter(logDice >= 5)
## End(Not run)
## Not run:
# Identify the most prominent light verb construction with "in ... setzen".
# Note that, currently, the use of focus function disallows exactFrequencies.
KorAPConnection(verbose = TRUE) |>
collocationAnalysis("focus(in [tt/p=NN] {[tt/l=setzen]})",
leftContextSize = 1, rightContextSize = 0, exactFrequencies = FALSE, topCollocatesLimit = 20
)
## End(Not run)
Query frequencies of a node and a collocate and calculate collocation association scores
Description
Computes various collocation association scores
based on frequencyQuery()
s for a target word and a collocate.
Usage
## S4 method for signature 'KorAPConnection'
collocationScoreQuery(
kco,
node,
collocate,
vc = "",
lemmatizeNodeQuery = FALSE,
lemmatizeCollocateQuery = FALSE,
leftContextSize = 5,
rightContextSize = 5,
scoreFunctions = defaultAssociationScoreFunctions(),
smoothingConstant = 0.5,
observed = NA,
ignoreCollocateCase = FALSE,
withinSpan = "base/s=s"
)
Arguments
kco |
|
node |
target word |
collocate |
collocate of target word |
vc |
string describing the virtual corpus in which the query should be performed. An empty string (default) means the whole corpus, as far as it is license-wise accessible. |
lemmatizeNodeQuery |
logical, set to TRUE if node query should be lemmatized, i.e. |
lemmatizeCollocateQuery |
logical, set to TRUE if collocate query should be lemmatized, i.e. |
leftContextSize |
size of the left context window |
rightContextSize |
size of the right context window |
scoreFunctions |
named list of score functions of the form function(O1, O2, O, N, E, window_size), see e.g. pmi |
smoothingConstant |
smoothing constant will be added to all observed values |
observed |
if collocation frequencies are already known (or estimated from a sample) they can be passed as a vector here, otherwise: NA |
ignoreCollocateCase |
logical, set to TRUE if collocate case should be ignored |
withinSpan |
KorAP span specification (see https://korap.ids-mannheim.de/doc/ql/poliqarp-plus?embedded=true#spans) for collocations to be searched within. Defaults to |
Value
tibble with query KorAP web request URL, all observed values and association scores
See Also
Other collocation analysis functions:
association-score-functions
,
collocationAnalysis,KorAPConnection-method
,
synsemanticStopwords()
Examples
## Not run:
KorAPConnection(verbose = TRUE) |>
collocationScoreQuery("Grund", "triftiger")
## End(Not run)
## Not run:
KorAPConnection(verbose = TRUE) |>
collocationScoreQuery("Grund", c("guter", "triftiger"),
scoreFunctions = list(localMI = function(O1, O2, O, N, E, window_size) { O * log2(O/E) }) )
## End(Not run)
## Not run:
library(highcharter)
library(tidyr)
KorAPConnection(verbose = TRUE) |>
collocationScoreQuery("Team", "agil", vc = paste("pubDate in", c(2014:2018)),
lemmatizeNodeQuery = TRUE, lemmatizeCollocateQuery = TRUE) |>
pivot_longer(14:last_col(), names_to = "measure", values_to = "score") |>
hchart(type="spline", hcaes(label, score, group=measure)) |>
hc_add_onclick_korap_search()
## End(Not run)
Search corpus for query terms
Description
corpusQuery
performs a corpus query via a connection to a KorAP-API-server
Usage
## S4 method for signature 'KorAPConnection'
corpusQuery(
kco,
query = if (missing(KorAPUrl)) {
stop("At least one of the parameters query and KorAPUrl must be specified.", call. =
FALSE)
} else {
httr2::url_parse(KorAPUrl)$query$q
},
vc = if (missing(KorAPUrl)) "" else httr2::url_parse(KorAPUrl)$query$cq,
KorAPUrl,
metadataOnly = TRUE,
ql = if (missing(KorAPUrl)) "poliqarp" else httr2::url_parse(KorAPUrl)$query$ql,
fields = c("corpusSigle", "textSigle", "pubDate", "pubPlace", "availability",
"textClass", "snippet", "tokens"),
accessRewriteFatal = TRUE,
verbose = kco@verbose,
expand = length(vc) != length(query),
as.df = FALSE,
context = NULL
)
Arguments
kco |
|
query |
string that contains the corpus query. The query language depends on the |
vc |
string describing the virtual corpus in which the query should be performed. An empty string (default) means the whole corpus, as far as it is license-wise accessible. |
KorAPUrl |
instead of providing the query and vc string parameters, you can also simply copy a KorAP query URL from your browser and use it here (and in |
metadataOnly |
logical that determines whether queries should return only metadata without any snippets. This can also be useful to prevent access rewrites. Note that the default value is TRUE.
If you want your corpus queries to return not only metadata, but also KWICS, you need to authorize
your RKorAPClient application as explained in the
authorization section
of the RKorAPClient Readme on GitHub and set the |
ql |
string to choose the query language (see section on Query Parameters in the Kustvakt-Wiki for possible values. |
fields |
character vector specifying which metadata fields to retrieve for each match. Available fields depend on the corpus. For DeReKo (German Reference Corpus), possible fields include:
Use |
accessRewriteFatal |
abort if query or given vc had to be rewritten due to insufficient rights (not yet implemented). |
verbose |
print some info |
expand |
logical that decides if |
as.df |
return result as data frame instead of as S4 object? |
context |
string that specifies the size of the left and the right context returned in |
Value
Depending on the as.df
parameter, a tibble or a KorAPQuery()
object that, among other information, contains the total number of results in @totalResults
. The resulting object can be used to fetch all query results (with fetchAll()
) or the next page of results (with fetchNext()
).
A corresponding URL to be used within a web browser is contained in @webUIRequestUrl
Please make sure to check $collection$rewrites
to see if any unforeseen access rewrites of the query's virtual corpus had to be performed.
References
https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9026
See Also
KorAPConnection()
, fetchNext()
, fetchRest()
, fetchAll()
, corpusStats()
Other corpus search functions:
fetchAll,KorAPQuery-method
,
fetchNext,KorAPQuery-method
Examples
## Not run:
# Fetch basic metadata for "Ameisenplage"
KorAPConnection() |>
corpusQuery("Ameisenplage") |>
fetchAll()
# Fetch specific metadata fields for bibliographic analysis
query <- KorAPConnection() |>
corpusQuery("Ameisenplage",
fields = c("textSigle", "author", "title", "pubDate", "pubPlace", "textType"))
results <- fetchAll(query)
results@collectedMatches
## End(Not run)
## Not run:
# Use the copy of a KorAP-web-frontend URL for an API query of "Ameise" in a virtual corpus
# and show the number of query hits (but don't fetch them).
KorAPConnection(verbose = TRUE) |>
corpusQuery(
KorAPUrl =
"https://korap.ids-mannheim.de/?q=Ameise&cq=pubDate+since+2017&ql=poliqarp"
)
## End(Not run)
## Not run:
# Plot the time/frequency curve of "Ameisenplage"
KorAPConnection(verbose = TRUE) |>
{
. ->> kco
} |>
corpusQuery("Ameisenplage") |>
fetchAll() |>
slot("collectedMatches") |>
mutate(year = lubridate::year(pubDate)) |>
dplyr::select(year) |>
group_by(year) |>
summarise(Count = dplyr::n()) |>
mutate(Freq = mapply(function(f, y) {
f / corpusStats(kco, paste("pubDate in", y))@tokens
}, Count, year)) |>
dplyr::select(-Count) |>
complete(year = min(year):max(year), fill = list(Freq = 0)) |>
plot(type = "l")
## End(Not run)
Get corpus size and statistics
Description
Retrieve information about corpus size (documents, tokens, sentences, paragraphs) for the entire corpus or a virtual corpus subset.
Usage
## S4 method for signature 'KorAPConnection'
corpusStats(kco, vc = "", verbose = kco@verbose, as.df = FALSE)
Arguments
kco |
|
vc |
string describing the virtual corpus. An empty string (default) means the whole corpus, as far as it is license-wise accessible. |
verbose |
logical. If |
as.df |
return result as data frame instead of as S4 object? |
Value
Object containing corpus statistics with the following information:
vc
Virtual corpus definition used (empty string for entire corpus)
documents
Total number of documents in the (virtual) corpus
tokens
Total number of word tokens in the (virtual) corpus
sentences
Total number of sentences in the (virtual) corpus
paragraphs
Total number of paragraphs in the (virtual) corpus
webUIRequestUrl
URL to view this corpus subset in KorAP web interface
When as.df=TRUE
, returns a data frame with these columns.
When as.df=FALSE
(default), returns a KorAPCorpusStats object with these values as slots.
Usage
# Get statistics for entire corpus kcon <- KorAPConnection() stats <- corpusStats(kcon) # Get statistics for a specific time period stats <- corpusStats(kcon, "pubDate in 2020") # Access the number of tokens stats@tokens
Examples
## Not run:
kco <- KorAPConnection()
# Get statistics for entire corpus (returns S4 object)
stats <- corpusStats(kco)
stats@tokens # Access number of tokens
# Get statistics for newspaper texts from 2017 (as data frame)
df <- corpusStats(kco, "pubDate in 2017 & textType=/Zeitung.*/", as.df = TRUE)
df$documents # Access number of documents
# Compare corpus sizes across years
years <- 2015:2020
sizes <- sapply(years, function(y) {
corpusStats(kco, paste("pubDate in", y))@tokens
})
## End(Not run)
Fetch all results of a KorAP query.
Description
fetchAll
fetches all results of a KorAP query.
Usage
## S4 method for signature 'KorAPQuery'
fetchAll(kqo, verbose = kqo@korapConnection@verbose, ...)
Arguments
kqo |
object obtained from |
verbose |
print progress information if true |
... |
further arguments passed to |
Value
The updated kqo
object with all results in @collectedMatches
See Also
Other corpus search functions:
corpusQuery,KorAPConnection-method
,
fetchNext,KorAPQuery-method
Examples
## Not run:
# Fetch all metadata of every query hit for "Ameisenplage" and show a summary
q <- KorAPConnection() |>
corpusQuery("Ameisenplage") |>
fetchAll()
q@collectedMatches
# Fetch also all KWICs
q <- KorAPConnection() |> auth() |>
corpusQuery("Ameisenplage", metadataOnly = FALSE) |>
fetchAll()
q@collectedMatches
# Retrieve title and text sigle metadata of all texts published on 1958-03-12
q <- KorAPConnection() |>
corpusQuery("<base/s=t>", # this matches each text once
vc = "pubDate in 1958-03-12",
fields = c("textSigle", "title"),
) |>
fetchAll()
q@collectedMatches
## End(Not run)
Fetch the next bunch of results of a KorAP query.
Description
fetchNext
fetches the next bunch of results of a KorAP query.
Usage
## S4 method for signature 'KorAPQuery'
fetchNext(
kqo,
offset = kqo@nextStartIndex,
maxFetch = maxResultsPerPage,
verbose = kqo@korapConnection@verbose,
randomizePageOrder = FALSE
)
Arguments
kqo |
object obtained from |
offset |
start offset for query results to fetch |
maxFetch |
maximum number of query results to fetch |
verbose |
print progress information if true |
randomizePageOrder |
fetch result pages in pseudo random order if true. Use |
Value
The kqo
input object with updated slots collectedMatches
, apiResponse
, nextStartIndex
, hasMoreMatches
References
https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/9026
See Also
Other corpus search functions:
corpusQuery,KorAPConnection-method
,
fetchAll,KorAPQuery-method
Examples
## Not run:
q <- KorAPConnection() |>
corpusQuery("Ameisenplage") |>
fetchNext()
q@collectedMatches
## End(Not run)
Fetches the remaining results of a KorAP query.
Description
Fetches the remaining results of a KorAP query.
Usage
## S4 method for signature 'KorAPQuery'
fetchRest(kqo, verbose = kqo@korapConnection@verbose, ...)
Arguments
kqo |
object obtained from |
verbose |
print progress information if true |
... |
further arguments passed to |
Value
The updated kqo
object with remaining results in @collectedMatches
Examples
## Not run:
q <- KorAPConnection() |>
corpusQuery("Ameisenplage") |>
fetchRest()
q@collectedMatches
## End(Not run)
Format duration in seconds to human-readable format
Description
Converts a duration in seconds to a formatted string with days, hours, minutes, and seconds. Used for ETA calculations and progress reporting.
Usage
format_duration(seconds)
Arguments
seconds |
numeric duration in seconds |
Value
character string with formatted duration
Examples
## Not run:
format_duration(3661) # "01h 01m 01s"
format_duration(86461) # "1d 00h 01m 01s"
## End(Not run)
Format ETA information for display
Description
Helper function to format ETA information consistently across different methods.
Usage
format_eta_display(eta_seconds, estimated_completion_time)
Arguments
eta_seconds |
numeric ETA in seconds (can be NA) |
estimated_completion_time |
POSIXct estimated completion time (can be NA) |
Value
character string with formatted ETA or empty string if NA
Query frequencies of search expressions in virtual corpora
Description
frequencyQuery
combines corpusQuery()
, corpusStats()
and
ci()
to compute a tibble with the absolute and relative frequencies and
confidence intervals of one ore multiple search terms across one or multiple
virtual corpora.
Usage
## S4 method for signature 'KorAPConnection'
frequencyQuery(
kco,
query,
vc = "",
conf.level = 0.95,
as.alternatives = FALSE,
...
)
Arguments
kco |
|
query |
corpus query string(s.) (can be a vector). The query language depends on the |
vc |
virtual corpus definition(s) (can be a vector) |
conf.level |
confidence level of the returned confidence interval (passed through |
as.alternatives |
LOGICAL that specifies if the query terms should be treated as alternatives. If |
... |
further arguments passed to or from other methods (see |
Value
A tibble, with each row containing the following result columns for query and vc combinations:
-
query: the query string used for the frequency analysis.
-
totalResults: absolute frequency of query matches in the vc.
-
vc: virtual corpus used for the query.
-
webUIRequestUrl: URL of the corresponding web UI request with respect to query and vc.
-
total: total number of words in vc.
-
f: relative frequency of query matches in the vc.
-
conf.low: lower bound of the confidence interval for the relative frequency, given
conf.level
. -
conf.high: upper bound of the confidence interval for the relative frequency, given
conf.level
.
Examples
## Not run:
KorAPConnection(verbose = TRUE) |>
frequencyQuery(c("Mücke", "Schnake"), paste0("pubDate in ", 2000:2003))
## End(Not run)
Get cache indicator string
Description
Helper function to generate cache indicator for logging.
Usage
get_cache_indicator(is_cached, cache_threshold = 0.1)
Arguments
is_cached |
logical indicating if the item was cached |
cache_threshold |
minimum time threshold for non-cached items |
Value
character string with cache indicator or empty string
Add KorAP search click events to highchart plots
Description
Adds on-click events to data points of highcharts that were constructed with
frequencyQuery()
or collocationScoreQuery()
. Clicks on data points
then launch KorAP web UI queries for the given query term and virtual corpus in
a separate tab.
Usage
hc_add_onclick_korap_search(hc)
Arguments
hc |
A highchart htmlwidget object generated by e.g. |
Value
The input highchart object with added on-click events.
See Also
Other highcharter-helpers:
hc_freq_by_year_ci()
Examples
## Not run:
library(highcharter)
library(tidyr)
KorAPConnection(verbose = TRUE) %>%
collocationScoreQuery("Team", "agil", vc = paste("pubDate in", c(2014:2018)),
lemmatizeNodeQuery = TRUE, lemmatizeCollocateQuery = TRUE) %>%
pivot_longer(c("O", "E")) %>%
hchart(type="spline", hcaes(label, value, group=name)) %>%
hc_add_onclick_korap_search()
## End(Not run)
Plot interactive frequency curves with confidence intervals
Description
Convenience function for plotting typical frequency by year graphs with confidence intervals using highcharter.
Warning: This function may be moved to a new package.
Usage
hc_freq_by_year_ci(
df,
as.alternatives = FALSE,
ylabel = if (as.alternatives) "%" else "ipm",
smooth = FALSE,
...
)
Arguments
df |
data frame like the value of a |
as.alternatives |
boolean decides whether queries should be treated as mutually exclusive and exhaustive wrt. to some meaningful class (e.g. spelling variants of a certain word form). |
ylabel |
defaults to |
smooth |
boolean decides whether the graph is smoothed using the highcharts plot types spline and areasplinerange. |
... |
additional arguments passed to |
Value
A highchart htmlwidget object containing the frequency plot.
See Also
Other highcharter-helpers:
hc_add_onclick_korap_search()
Examples
## Not run:
year <- c(1990:2018)
alternatives <- c("macht []{0,3} Sinn", "ergibt []{0,3} Sinn")
KorAPConnection(verbose = TRUE) %>%
frequencyQuery(query = alternatives,
vc = paste("textType = /Zeit.*/ & pubDate in", year),
as.alternatives = TRUE) %>%
hc_freq_by_year_ci(as.alternatives = TRUE)
kco <- KorAPConnection(verbose = TRUE)
expand_grid(
condition = c("textDomain = /Wirtschaft.*/", "textDomain != /Wirtschaft.*/"),
year = (2005:2011)
) %>%
cbind(frequencyQuery(
kco,
"[tt/l=Heuschrecke]",
paste0(.$condition, " & pubDate in ", .$year)
)) %>%
hc_freq_by_year_ci()
## End(Not run)
Initialize KorAPConnection object
Description
Initialize KorAPConnection object
Usage
## S4 method for signature 'KorAPConnection'
initialize(
.Object,
KorAPUrl = if (is.null(Sys.getenv("KORAP_URL")) | Sys.getenv("KORAP_URL") == "") {
"https://korap.ids-mannheim.de/"
} else {
Sys.getenv("KORAP_URL")
},
apiVersion = "v1.0",
apiUrl,
accessToken = getAccessToken(KorAPUrl),
oauthClient = NULL,
oauthScope = "search match_info",
authorizationSupported = TRUE,
userAgent = "R-KorAP-Client",
timeout = 240,
verbose = FALSE,
cache = TRUE
)
Initialize KorAPQuery object
Description
Initialize KorAPQuery object
Usage
## S4 method for signature 'KorAPQuery'
initialize(
.Object,
korapConnection = NULL,
request = NULL,
vc = "",
totalResults = 0,
nextStartIndex = 0,
fields = c("corpusSigle", "textSigle", "pubDate", "pubPlace", "availability",
"textClass", "snippet", "tokens"),
requestUrl = "",
webUIRequestUrl = "",
apiResponse = NULL,
hasMoreMatches = FALSE,
collectedMatches = NULL
)
Arguments
.Object |
… |
korapConnection |
KorAPConnection object |
request |
query part of the request URL |
vc |
definition of a virtual corpus |
totalResults |
number of hits the query has yielded |
nextStartIndex |
at what index to start the next fetch of query results |
fields |
what data / metadata fields should be collected |
requestUrl |
complete URL of the API request |
webUIRequestUrl |
URL of a web frontend request corresponding to the API request |
apiResponse |
data-frame representation of the JSON response of the API request |
hasMoreMatches |
logical that signals if more query results can be fetched |
collectedMatches |
matches already fetched from the KorAP-API-server |
Logging utilities for RKorAPClient
Description
This module provides centralized logging functions used throughout the package for progress reporting and ETA calculations. Log informational messages with optional coloring
Usage
log_info(v, ...)
Arguments
v |
logical flag indicating whether to output the message |
... |
message components to concatenate and display |
Merge duplicate collocate rows and re-calculate association scores and URLs. Useful if collocation analyses were performed separately for collocates on the left and right side of a node.
Description
Merge duplicate collocate rows and re-calculate association scores and URLs. Useful if collocation analyses were performed separately for collocates on the left and right side of a node.
Usage
mergeDuplicateCollocates(..., smoothingConstant = 0.5)
Arguments
... |
tibbles with collocate rows returned from |
smoothingConstant |
original smoothing constant (to be added only once to the observed values) |
Value
tibble with unique collocate rows
Persist current access token in keyring
Description
Persist current access token in keyring
Usage
## S4 method for signature 'KorAPConnection'
persistAccessToken(kco, accessToken = kco@accessToken)
Arguments
kco |
KorAPConnection object |
accessToken |
access token to be persisted. If not supplied, the current access token of the KorAPConnection object will be used. |
Value
KorAPConnection object.
See Also
Other initialization functions:
KorAPConnection-class
,
auth,KorAPConnection-method
,
clearAccessToken,KorAPConnection-method
Examples
## Not run:
kco <- KorAPConnection(accessToken = "e739u6eOzkwADQPdVChxFg")
persistAccessToken(kco)
kco <- KorAPConnection() %>%
auth(app_id = "<my application id>") %>%
persistAccessToken()
## End(Not run)
Objects exported from other packages
Description
These objects are imported from other packages. Follow the links below to see their documentation.
- broom
- dplyr
- lubridate
- magrittr
- tibble
- tidyr
Display KorAPConnection object
Description
Display KorAPConnection object
Usage
## S4 method for signature 'KorAPConnection'
show(object)
Arguments
object |
KorAPConnection object |
Preliminary synsemantic stopwords function
Description
Preliminary synsemantic stopwords function to be used in collocation analysis.
Usage
synsemanticStopwords(...)
Arguments
... |
future arguments for language detection |
Details
Currently only suitable for German. See stopwords package for other languages.
Value
Vector of synsemantic stopwords.
See Also
Other collocation analysis functions:
association-score-functions
,
collocationAnalysis,KorAPConnection-method
,
collocationScoreQuery,KorAPConnection-method
Retrieve metadata for a text, identified by its sigle (id)
Description
Retrieves metadata for a text, identified by its sigle (id) using the corresponding KorAP API
(see Kustvakt Wiki).
To retrieve the metadata for every text in a virtual corpus, use corpusQuery()
with <base/s=t>
as query, instead.
Usage
## S4 method for signature 'KorAPConnection'
textMetadata(kco, textSigle, verbose = kco@verbose)
Arguments
kco |
|
textSigle |
unique text id (concatenation of corpus, document and text ids, separated by |
verbose |
logical. If |
Value
Tibble with columns for each metadata property. In case of errors, such as non-existing texts/sigles, the tibble will also contain a column called errors
.
If there are metadata columns you cannot make sense of, please ignore them. The function simply returns all the metadata it gets from the server.
Examples
## Not run:
KorAPConnection() |> textMetadata(c("WUD17/A97/08542", "WUD17/B96/57558", "WUD17/A97/08541"))
## End(Not run)