Type: Package
Title: Customisable Stop-Words in 110 Languages
Version: 0.9.1
Date: 2021-10-24
Author: Silvie Cinkova [aut], Maciej Eder [aut, cre]
Maintainer: Maciej Eder <maciejeder@gmail.com>
Depends: R (≥ 3.5.0)
Imports: dplyr
Description: Functions to generate stop-word lists in 110 languages, in a way consistent across all the languages supported. The generated lists are based on the morphological tagset from the Universal Dependencies.
License: GPL (≥ 3)
Encoding: UTF-8
Suggests: knitr, rmarkdown
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2021-10-26 13:54:31 UTC; m
Repository: CRAN
Date/Publication: 2021-10-27 12:10:02 UTC

Customisable Lists of Stop-Words in 110 Languages

Description

The idea behind this package is to give the user control over the stop-word selection.

Details

The idea behind this package is to give the user control over the stop-word selection. The core generate_stoplist function relies on multilingual_stopwords, a large data frame derived from the current release of the Universal Dependencies Treebanks. We have included all languages whose corpora totalled above 10,000 tokens – large enough to cover all common closed-class words, such as prepositions, conjunctions, and auxiliary verbs. The data comes encoded in UTF-8.

Author(s)

Silvie Cinková, Maciej Eder

References

The data set is based on the official release of Version 2.1 of Universal Dependencies.

https://universaldependencies.org

Nivre, Joakim; Agić, Željko; Ahrenberg, Lars; et al., 2017, Universal Dependencies 2.1, LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-2515.

See Also

list_supported_languages, multilingual_stoplist


Listing of stop words in different languages.

Description

Generate a vector of stop words in one or several languages.

Usage

generate_stoplist(language = NULL, output_form = 1)

Arguments

language

single string or a character vector. NULL by default. The strings can be language names or ISO-639 language codes as listed by the list_supported_languages(), freely combined, case-sensitive. When no language is recognized, the following error message appears: "The language name or language id you have selected is not supported. (Or you didn't specify a language at all). Check out the supported languages by calling 'list_supported_languages'.".

output_form

default 1, alternatively 2 or 3. Option 1 returns a character vector of unique stopwords word forms. Option 2 returns a named vector whose elements are the stopwords word forms and names are the associated stop classes. One word form can occur with different stop classes; hence the word forms in this vector are not unique, unlike Option 1. Option 3 returns a data frame filtered according to the language selection.

Value

The function comes with three output options.

All outputs are encoded in UTF-8.

Warning

Author(s)

Silvie Cinková, Maciej Eder

References

The underlying data frame 'multilingual_stoplist' is based on the official release of Version 2.8 of Universal Dependencies.

https://universaldependencies.org

Zeman, Daniel; et al., 2021, Universal Dependencies 2.8.1, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3687.

See Also

list_supported_languages, multilingual_stoplist

Examples

generate_stoplist(language = "English", output_form = 1) 

generate_stoplist(language = "English", output_form = 2) 
  
generate_stoplist(language = "English", output_form = 3) 


Listing of languages supported by list_supported_languages by their names and ISO-639 codes in a data frame.

Description

Generate a data frame containing language names and their corresponding ISO-639 codes, with numbers of stop words for the given language

Usage

list_supported_languages()

Arguments

No arguments.

Value

A grouped tibble (data frame) with three columns:

Author(s)

Silvie Cinková, Maciej Eder

References

The underlying data frame 'multilingual_stoplist' is based on the official release of Version 2.8 of Universal Dependencies.

https://universaldependencies.org

Zeman, Daniel; et al., 2021, Universal Dependencies 2.8.1, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3687.

See Also

generate_stoplist, multilingual_stoplist

Examples

list_supported_languages()

Multilingual Stop-Word List

Description

This dataset contains a dataframe with individual word forms in rows. You can control the part of speech and various frequency counts of your desired stop-word list.

Format

A data frame encoded in UTF-8, with the following columns:

Details

This data frame has been derived from an official release of the Universal Dependencies (UD) treebanks. Treebanks are text corpora with linguistic annotation. The UD syntactic annotation follows the principles of dependency syntax. The annotation encompasses for each text token:

Source

The data set is based on the official release of Version 2.8.1 of the Universal Dependencies stored in the LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University, Czech Republic, http://hdl.handle.net/11234/1-3687.

References

https://universaldependencies.org

Zeman, Daniel; et al., 2021, Universal Dependencies 2.8.1, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (UFAL), Faculty of Mathematics and Physics, Charles University, http://hdl.handle.net/11234/1-3687.