Type: | Package |
Title: | Automated Topic Labeling with Language Models |
Version: | 0.2.0 |
Date: | 2024-10-21 |
Maintainer: | Jonas Rieger <rieger@statistik.tu-dortmund.de> |
Description: | Leveraging (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
Depends: | R (≥ 3.6.0) |
Imports: | checkmate (≥ 1.8.5), httr, progress, stats, jsonlite |
RoxygenNote: | 7.3.2 |
URL: | https://github.com/PetersFritz/topiclabels |
BugReports: | https://github.com/PetersFritz/topiclabels/issues |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2024-10-21 14:06:56 UTC; riege |
Author: | Jonas Rieger |
Repository: | CRAN |
Date/Publication: | 2024-10-21 14:30:02 UTC |
Automated Topic Labeling with Language Models
Description
Leveraging (large) language models for automatic topic labeling. The main function converts a list of top terms into a label for each topic. Hence, it is complementary to any topic modeling package that produces a list of top terms for each topic. While human judgement is indispensable for topic validation (i.e., inspecting top terms and most representative documents), automatic topic labeling can be a valuable tool for researchers in various scenarios.
Labeling function
Constructor
Author(s)
Maintainer: Jonas Rieger rieger@statistik.tu-dortmund.de (ORCID)
Authors:
Fritz Peters fpeters3@sheffield.ac.uk (ORCID)
Andreas Fischer andreasfischer1985@web.de (ORCID)
Tim Lauer tl@leibniz-psychology.org (ORCID)
André Bittermann abi@leibniz-psychology.org (ORCID)
See Also
Useful links:
Report bugs at https://github.com/PetersFritz/topiclabels/issues
lm_topic_labels object
Description
Constructor for lm_topic_labels objects used in this package.
Usage
as.lm_topic_labels(
x,
terms,
prompts,
model,
params,
with_token,
time,
model_output,
labels
)
is.lm_topic_labels(obj, verbose = FALSE)
Arguments
x |
[ |
terms |
[ |
prompts |
[ |
model |
[ |
params |
[ |
with_token |
[ |
time |
[ |
model_output |
[ |
labels |
[ |
obj |
[ |
verbose |
[ |
Details
If you call as.lm_topic_labels
on an object x
which already is of
the structure of a lm_topic_labels
object (in particular a lm_topic_labels
object itself), the additional arguments id, param, ...
may be used to override the specific elements.
Value
[named list
] lm_topic_labels
object.
Examples
## Not run:
token = "" # please insert your hf token here
topwords_matrix = matrix(c("zidane", "figo", "kroos",
"gas", "power", "wind"), ncol = 2)
obj = label_topics(topwords_matrix, token = token)
obj$model
obj_modified = as.lm_topic_labels(obj, model = "It is possible to modify individual entries")
obj_modified$model
obj_modified$model = 3.5 # example for an invalid modification
is.lm_topic_labels(obj_modified, verbose = TRUE)
obj_manual = as.lm_topic_labels(terms = list(c("zidane", "figo", "kroos"),
c("gas", "power", "wind")),
model = "manual labels",
labels = c("Football Players", "Energy Supply"))
## End(Not run)
Automatically label topics using language models based on top terms
Description
Performs an automated labeling process of topics from topic models using language models. For this, the top terms and (optionally) a short context description are used.
Usage
label_topics(...)
## Default S3 method:
label_topics(
terms,
model = "mistralai/Mixtral-8x7B-Instruct-v0.1",
params = list(),
token = NA_character_,
context = "",
sep_terms = "; ",
max_length_label = 5L,
prompt_type = c("json", "plain", "json-roles"),
max_wait = 0L,
progress = TRUE,
...
)
## S3 method for class 'labelTopics'
label_topics(terms, stm_type = c("prob", "frex", "lift", "score"), ...)
Arguments
... |
additional arguments |
terms |
[ |
model |
[ |
params |
[ |
token |
[ |
context |
[ |
sep_terms |
[ |
max_length_label |
[ |
prompt_type |
[ |
max_wait |
[ |
progress |
[ |
stm_type |
[ |
Details
The function builds helpful prompts based on the top terms and sends these
prompts to language models on Huggingface. The output is in turn
post-processed so that the labels for each topic are extracted automatically.
If the automatically extracted labels show any errors, they can alternatively
be extracted using custom functions or manually from the original output of
the model using the model_output
entry of the lm_topic_labels object.
Implemented default parameters for the models HuggingFaceH4/zephyr-7b-beta
,
tiiuae/falcon-7b-instruct
, and mistralai/Mixtral-8x7B-Instruct-v0.1
are:
max_new_tokens
300
return_full_text
FALSE
Implemented prompt types are:
json
the language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic
plain
the language model is asked to return an answer that should only consist of the best label for the topic
json-roles
the language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic; in addition, the model is queried using identifiers for <|user|> input and the beginning of the <|assistant|> output
Value
[named list
] lm_topic_labels
object.
Examples
## Not run:
token = "" # please insert your hf token here
topwords_matrix = matrix(c("zidane", "figo", "kroos",
"gas", "power", "wind"), ncol = 2)
label_topics(topwords_matrix, token = token)
label_topics(list(c("zidane", "figo", "kroos"),
c("gas", "power", "wind")),
token = token)
label_topics(list(c("zidane", "figo", "ronaldo"),
c("gas", "power", "wind")),
token = token)
label_topics(list("wind", "greta", "hambach"),
token = token)
label_topics(list("wind", "fire", "air"),
token = token)
label_topics(list("wind", "feuer", "luft"),
token = token)
label_topics(list("wind", "feuer", "luft"),
context = "Elements of the Earth",
token = token)
## End(Not run)