Help for package moranajp

Title:

Morphological Analysis for Japanese

Version:

0.9.7

Description:

Supports morphological analysis for Japanese by using 'MeCab' https://taku910.github.io/mecab/, 'Sudachi' https://github.com/WorksApplications/Sudachi, 'Chamame' https://chamame.ninjal.ac.jp/, or 'Ginza' https://github.com/megagonlabs/ginza. Can input a data.frame and obtain all results of 'MeCab' and the row number of the original data.frame as a text id.

License:

MIT + file LICENSE

Depends:

R (≥ 3.5.0)

URL:

https://github.com/matutosi/moranajp, https://matutosi.github.io/moranajp/

BugReports:

https://github.com/matutosi/moranajp/issues

Imports:

dplyr, ggplot2, ggraph, grid, igraph, purrr, rlang, rvest, stats, stringr, stringi, tibble, tidyr, utils

Suggests:

devtools, knitr, rmarkdown, spelling, testthat (≥ 3.0.0)

VignetteBuilder:

knitr

Config/testthat/edition:

Encoding:

UTF-8

LazyData:

true

RoxygenNote:

7.3.1

Language:

en-US

NeedsCompilation:

Packaged:

2024-07-30 05:51:00 UTC; matutosi

Author:

Toshikazu Matsumura [aut, cre]

Maintainer:

Toshikazu Matsumura <matutosi@gmail.com>

Repository:

CRAN

Date/Publication:

2024-08-01 08:40:02 UTC

Add group id column into result of morphological analysis

Description

Add group id column into result of morphological analysis

Usage

add_group(
  tbl,
  col,
  brk = "EOS",
  grp = "group",
  cond = NULL,
  end_with_brk = TRUE
)

Arguments

tbl

A dataframe

col

A string to specify the column including breaks

brk

A string to specify breaks

grp

A string to specify group

cond

A string to specify condition

end_with_brk

A logical

Value

A dataframe

Examples

brk <- "EOS"
tbl <- tibble::tibble(col=c(rep("a", 2), brk, rep("b", 3), brk, rep("c", 4), brk))
add_group(tbl, col = "col")
add_group(tbl, col = "col", end_with_brk = FALSE)

Add id in each group

Description

Add id in each group

Usage

add_id(tbl, grp = "group", id = "id")

Arguments

tbl

A dataframe

grp, id

A string to specify the column of group and id

Value

A dataframe

Examples

brk <- "EOS"
tbl <- tibble::tibble(col=c(rep("a", 2), brk, rep("b", 3), brk, rep("c", 4), brk))
add_group(tbl, col = "col") |>
  add_id(id = "id_in_group")

Wrapper function for add_group() to add sentence id

Description

Wrapper function for add_group() to add sentence id

Usage

add_sentence_no(df, s_id = "sentence")

Arguments

df

A dataframe

s_id

A string for sentence colame

Value

A dataframe

Examples

review_mecab |>
  unescape_utf() |>
  add_sentence_no() |>
  print(n=200)

Add id column into result of morphological analysis

Description

Internal function for moranajp_all(). Add text_id column when there is brk ("BPMJP"). "BPMJP": Break Point Of MoranaJP

Usage

add_text_id(tbl, method, brk = "BPMJP")

Arguments

tbl

A tibble or data.frame.

method

A text. Method to use: "mecab", "ginza", "sudachi_a", "sudachi_b", "sudachi_c", or "chamame". "a", "b" and "c" specify the mode of splitting. "a" split shortest, "b" middle and "c" longest. See https://github.com/WorksApplications/Sudachi for detail. "chamame" use https://chamame.ninjal.ac.jp/ and rvest.

brk

A string of break point

Value

A data.frame with column "text_id".

Clean up result of morphological analyzed data frame

Description

Clean up result of morphological analyzed data frame

Usage

clean_up(df, add_depend = FALSE, ...)

pos_filter(df)

add_depend_ginza(df)

delete_stop_words(df, use_common_data = TRUE, add_stop_words = NULL, ...)

replace_words(
  df,
  synonym_df = tibble::tibble(),
  synonym_from = "",
  synonym_to = "",
  ...
)

term_lemma(df)

term_pos_0(df)

term_pos_1(df)

Arguments

df

A dataframe including result of morphological analysis.

add_depend

A logical. Available for ginza

...

Extra arguments to internal functions.

use_common_data

A logical. TRUE: use data(stop_words).

add_stop_words

A string vector adding into stop words. When use_common_data is TRUE and add_stop_words are given, both of them will be used as stop_words.

synonym_df

A data.frame including synonym word pairs. The first column: replace from, the second: replace to.

synonym_from, synonym_to

A string vector. Length of synonym_from and synonym_to should be the same. When synonym_df and synonym pairs (synonym_from and synonym_to) are given, both of them will be used as synonym.

Value

A data.frame.

Examples

data(neko_mecab)
data(neko_ginza)
data(review_sudachi_c)
data(synonym)
synonym <- 
  synonym |> unescape_utf()

neko_mecab <- 
  neko_mecab |>
  unescape_utf() |>
  print()

neko_mecab |>
  clean_up(use_common_data = TRUE, synonym_df = synonym)

review_ginza |>
  unescape_utf() |>
  add_sentence_no() |>
  clean_up(add_depend = TRUE, use_common_data = TRUE, synonym_df = synonym)

review_sudachi_c |>
  unescape_utf() |>
  add_sentence_no() |>
  clean_up(use_common_data = TRUE, synonym_df = synonym)

Combine words after morphological analysis

Description

Combine words after morphological analysis

Usage

combine_words(df, combi, sep = "-")

combi_words(x, combi, sep = "-")

Arguments

df

A dataframe including result of morphological analysis.

combi

A string (combi_words()) or string vector (combine_words()) to combine words.

sep

A string of separator of words

x

A pair of string joining with "-"

Value

A data.frame with combined words.

Examples

x <- letters[1:10]
combi <- c("b-c")
combi_words(x, combi)
expected <- c("a", "bc", NA, "d", "e", "f", "g", "h", "i", "j")
testthat::expect_equal(combi_words(x, combi), expected)

df <- unescape_utf(review_chamame) |> head(20)
combi <- unescape_utf(
           c("\\u751f\\u7269-\\u591a\\u69d8", "\\u8fb2\\u5730-\\u306f"       ,
             "\\u8fb2\\u7523-\\u7269"       , "\\u751f\\u7523-\\u3059\\u308b"))
combine_words(df, combi)

Draw bigram network using morphological analysis data.

Description

Draw bigram network using morphological analysis data.

Usage

draw_bigram_network(df, draw = TRUE, ...)

bigram(df, group = "sentence", depend = FALSE, term_depend = NULL, ...)

trigram(df, group = "sentence")

bigram_depend(df, group = "sentence")

bigram_network(bigram, rand_seed = 12, threshold = 100, ...)

word_freq(df, big_net, ...)

bigram_network_plot(
  big_net,
  freq,
  ...,
  arrow_size = 5,
  circle_size = 5,
  text_size = 5,
  font_family = "",
  arrow_col = "darkgreen",
  circle_col = "skyblue",
  x_limits = NULL,
  y_limits = NULL,
  no_scale = FALSE
)

Arguments

df

A dataframe including result of morphological analysis.

draw

A logical.

...

Extra arguments to internal functions.

group

A string to specify sentence.

depend

A logical.

term_depend

A string of dependent terms column to use bigram.

bigram

A result of bigram().

rand_seed

A numeric.

threshold

A numeric used as threshold for frequency of bigram.

big_net

A result of bigram_network().

freq

A numeric of word frequency in bigram_network. Can be got using word_freq().

arrow_size, circle_size, text_size

A numeric.

font_family

A string.

arrow_col, circle_col

A string to specify arrow and circle color in bigram network.

x_limits, y_limits

A Pair of numeric to specify range.

no_scale

A logical. FALSE: Not draw x and y axis.

Value

A list including df (input), bigram, freq (frequency) and gg (ggplot2 object of bigram network plot).

Examples


sentences <- 50
len <- 30
n <- sentences * len
x <- letters
prob <- (length(x):1) ^ 3
df <- 
  tibble::tibble(
    lemma = sample(x = x, size = n, replace = TRUE, prob = prob),
    sentence = rep(seq(sentences), each = len))
draw_bigram_network(df)

Generate code like "stringi::stri_unescape_unicode(...)"

Description

Generate code like "stringi::stri_unescape_unicode(...)"

Usage

escape_japanese(x)

Arguments

x

A string or vector of Japanese

Value

    A string or vector

Examples

stringi::stri_unescape_unicode("\\u8868\\u5c64\\u5f62") |>
  print() |>
  escape_japanese()

iconv x

Description

iconv x

Usage

iconv_x(x, iconv = "", reverse = FALSE)

Arguments

x

A string vector or a tibble.

iconv

A text. Convert encoding of MeCab output. Default (""): don't convert. "CP932_UTF-8": iconv(output, from = "Shift-JIS" to = "UTF-8") "EUC_UTF-8" : iconv(output, from = "eucjp", to = "UTF-8") iconv is also used to convert input text before running MeCab. "CP932_UTF-8": iconv(input, from = "UTF-8", to = "Shift-JIS")

reverse

A logical.

Value

    A string vector.

Make groups by splitting string length

Description

Using 'MeCab' for morphological analysis. Keep other colnames in dataframe.

Usage

make_groups(
  tbl,
  text_col = "text",
  length = 8000,
  tmp_group = "tmp_group",
  str_length = "str_length"
)

make_groups_sub(tbl, text_col, n_group, tmp_group, str_length)

max_sum_str_length(tbl, tmp_group, str_length)

Arguments

tbl

A tibble or data.frame.

text_col

A text. Colnames for morphological analysis.

length

A numeric.

tmp_group, str_length

A string to use temporary.

n_group

A numeric.

Value

A tibble. Output of morphological analysis and added column "text_id".

A string

A character vector

A data.frame

Examples


  # sample data of Japanese sentences
  data(neko)
  neko <-
      neko |>
      unescape_utf()
  # chamame
  neko |>
    moranajp_all(method = "chamame") |>
        print(n=100)

## Not run: 
  # Need to install 'mecab', 'ginza', or 'sudachi' in local PC

  # mecab
  bin_dir <- "d:/pf/mecab/bin"
  iconv <- "CP932_UTF-8"
  neko |>
    moranajp_all(text_col = "text", bin_dir = bin_dir, iconv = iconv) |>
        print(n=100)

  # ginza
  neko |>
    moranajp_all(text_col = "text", method = "ginza") |>
      print(n=100)

  # sudachi
  bin_dir <- "d:/pf/sudachi"
  iconv <- "CP932_UTF-8"
  neko |>
    moranajp_all(text_col = "text", bin_dir = bin_dir,
                 method = "sudachi_a", iconv = iconv) |>
        print(n=100)

## End(Not run)

Morphological analysis for a specific column in dataframe

Description

Using 'MeCab' for morphological analysis. Keep other colnames in dataframe.

Usage

moranajp_all(
  tbl,
  bin_dir = "",
  method = "mecab",
  text_col = "text",
  option = "",
  iconv = "",
  col_lang = "jp"
)

moranajp(tbl, bin_dir, method, text_col, option = "", iconv = "", col_lang)

remove_linebreaks(tbl, text_col)

separate_cols_ginza(tbl, col_lang)

make_input(tbl, text_col, iconv, brk = "BPMJP ")

make_cmd(method, bin_dir, option = "")

make_cmd_mecab(option = "")

out_cols_mecab(col_lang = "jp")

out_cols_ginza(col_lang = "jp")

out_cols_sudachi(col_lang = "jp")

out_cols_jp()

out_cols_en()

out_cols()

mecab_all(tbl, text_col = "text", bin_dir = "")

mecab(tbl, bin_dir)

Arguments

tbl

A tibble or data.frame.

bin_dir

A text. Directory of mecab.

method

text_col

A text. Colnames for morphological analysis.

option

A text. Options for mecab. "-b" option is already set by moranajp. To see option, use "mecab -h" in command (win) or terminal (Mac).

iconv

col_lang

A text. "jp" or "en"

brk

A string of break point

Value

A tibble. Output of morphological analysis and added column "text_id".

A string

A character vector

A data.frame

Examples


  # sample data of Japanese sentences
  data(neko)
  neko <-
      neko |>
      unescape_utf()
  # chamame
  neko |>
    moranajp_all(method = "chamame") |>
        print(n=100)

## Not run: 
  # Need to install 'mecab', 'ginza', or 'sudachi' in local PC

  # mecab
  bin_dir <- "d:/pf/mecab/bin"
  iconv <- "CP932_UTF-8"
  neko |>
    moranajp_all(text_col = "text", bin_dir = bin_dir, iconv = iconv) |>
        print(n=100)

  # ginza
  neko |>
    moranajp_all(text_col = "text", method = "ginza") |>
      print(n=100)

  # sudachi
  bin_dir <- "d:/pf/sudachi"
  iconv <- "CP932_UTF-8"
  neko |>
    moranajp_all(text_col = "text", bin_dir = bin_dir,
                 method = "sudachi_a", iconv = iconv) |>
        print(n=100)

## End(Not run)

The first part of 'I Am a Cat' by Soseki Natsume

Description

The first part of 'I Am a Cat' by Soseki Natsume

Usage

neko

Format

A data frame with 9 rows and 1 variable:

text: Body text. Escaped by stringi::stri_escape_unicode().

Examples

data(neko)
neko |>
  unescape_utf()

Analyzed data of neko by chamame

Description

chamame: https://chamame.ninjal.ac.jp/index.html

Usage

neko_chamame

Format

A data frame with 2959 rows and 7 variable: (column names are escaped by stringi::stri_escape_unicode(), stringi::stri_unescape_unicode() or unescape_utf() will show Japanese)

text_id: id
\u8868\u5c64\u5f62: result of chamame
\u54c1\u8a5e: result of chamame
\u54c1\u8a5e\u7d30\u5206\u985e1: result of chamame
\u54c1\u8a5e\u7d30\u5206\u985e2: result of chamame
\u54c1\u8a5e\u7d30\u5206\u985e3: result of chamame
\u539f\u5f62: result of chamame

Examples

data(neko_chamame)
neko_chamame |>
  unescape_utf()

Analyzed data of neko by GiNZA

Description

GiNZA: https://megagonlabs.github.io/ginza/

Usage

neko_ginza

Format

A data frame with 2945 rows and 13 variable:

text_id: id
id: result of GiNZA
\u8868\u5c64\u5f62: result of GiNZA
\u539f\u5f62: result of GiNZA
UD\u54c1\u8a5e\u30bf\u30b0: result of GiNZA
\u54c1\u8a5e: result of GiNZA
\u54c1\u8a5e\u7d30\u5206\u985e1: result of GiNZA
\u54c1\u8a5e\u7d30\u5206\u985e2: result of GiNZA
\u5c5e\u6027: result of GiNZA
\u4fc2\u53d7\u5143: result of GiNZA
\u4fc2\u53d7\u30bf\u30b0: result of GiNZA
\u4fc2\u53d7\u30da\u30a2: result of GiNZA
\u305d\u306e\u4ed6: result of GiNZA

Examples

data(neko_ginza)
neko_ginza |>
  unescape_utf()

Analyzed data of neko by MeCab

Description

MeCab: https://taku910.github.io/mecab/

Usage

neko_mecab

Format

A data frame with 2884 rows and 11 variable: (column names are escaped by stringi::stri_escape_unicode(), stringi::stri_unescape_unicode() or unescape_utf() will show Japanese)

text_id: id
\u8868\u5c64\u5f62: result of MeCab
\u54c1\u8a5e: result of MeCab
\u54c1\u8a5e\u7d30\u5206\u985e1: result of MeCab
\u54c1\u8a5e\u7d30\u5206\u985e2: result of MeCab
\u54c1\u8a5e\u7d30\u5206\u985e3: result of MeCab
\u6d3b\u7528\u578b: result of MeCab
\u6d3b\u7528\u5f62: result of MeCab
\u539f\u5f62: result of MeCab
\u8aad\u307f: result of MeCab
\u767a\u97f3: result of MeCab

Examples

data(neko_mecab)
neko_mecab |>
  unescape_utf()

Analyzed data of neko by Sudachi

Description

Sudachi: https://github.com/WorksApplications/Sudachi

Usage

neko_sudachi_a

neko_sudachi_b

neko_sudachi_c

Format

A data frame with 3130 rows and 9 variable:

text_id: id
\u8868\u5c64\u5f62: result of Sudachi
\u54c1\u8a5e: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e1: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e2: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e3: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e4: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e5: result of Sudachi
\u539f\u5f62: result of Sudachi

A data frame with 3088 rows and 9 variable:

A data frame with 3080 rows and 9 variable:

Examples

data(neko_sudachi_a)
neko_sudachi_a |>
  unescape_utf()

Morphological analysis for Japanese text by web chamame

Description

Using https://chamame.ninjal.ac.jp/ and rvest.

Usage

out_cols_chamame(col_lang = "jp")

web_chamame(text, col_lang = "jp")

html_radio_set(form, ...)

is_radio(fields)

Arguments

col_lang

A text. "jp" or "en"

text

A text.

form

vest_form object

...

dynamic-dots Name-value pairs giving radio button to modify.

fields

$fields in vest_form object

Value

A character vector

A dataframe

vest_form object

A boolean or vector

Examples

text <-
  paste0("\\u3059",
         paste0(rep("\\u3082",8),collapse=""),
         "\\u306e\\u3046\\u3061") |>
  unescape_utf()
web_chamame(text)

Remove break point and other unused rows from the result of morphological analysis

Description

Internal function for moranajp_all().

Usage

remove_brk(tbl, method, brk = "BPMJP")

Arguments

tbl

A tibble or data.frame.

method

brk

A string of break point

Value

A data.frame.

Full text of review article

Description

Full text of review article

Usage

review

Format

A data frame with 457 rows and 4 variables:

text: Body text. Escaped by stringi::stri_escape_unicode(). Body text. Escaped by stringi::stri_escape_unicode(). Citation is as below. Matsumura et al. 2014. Conditions and conservation for biodiversity of the semi-natural grassland vegetation on rice paddy levees. Vegetation Science, 31, 193-218. doi = 10.15031/vegsci.31.193 https://www.jstage.jst.go.jp/article/vegsci/31/2/31_193/_article/-char/en
chap: chapter
sect: section
para: paragraph

Examples

data(review)
review |>
  unescape_utf()

Analyzed data of review by chamame

Description

chamame: https://chamame.ninjal.ac.jp/index.html

Usage

review_chamame

Format

A data frame with 21125 rows and 10 variable (column names are escaped by stringi::stri_escape_unicode(), stringi::stri_unescape_unicode() or unescape_utf() will show Japanese)

text_id: id
chap: chapter
sect: section
para: paragraph
\u8868\u5c64\u5f62: result of chamame
\u54c1\u8a5e: result of chamame
\u54c1\u8a5e\u7d30\u5206\u985e1: result of chamame
\u54c1\u8a5e\u7d30\u5206\u985e2: result of chamame
\u54c1\u8a5e\u7d30\u5206\u985e3: result of chamame
\u539f\u5f62: result of chamame

Examples

data(review_chamame)
review_chamame |>
  unescape_utf()

Analyzed data of review by GiNZA

Description

GiNZA: https://megagonlabs.github.io/ginza/

Usage

review_ginza

Format

A data frame with 19514 rows and 16 variable:

text_id: id
chap: chapter
sect: section
para: paragraph
id: result of GiNZA
\u8868\u5c64\u5f62: result of GiNZA
\u539f\u5f62: result of GiNZA
UD\u54c1\u8a5e\u30bf\u30b0: result of GiNZA
\u54c1\u8a5e: result of GiNZA
\u54c1\u8a5e\u7d30\u5206\u985e1: result of GiNZA
\u54c1\u8a5e\u7d30\u5206\u985e2: result of GiNZA
\u5c5e\u6027: result of GiNZA
\u4fc2\u53d7\u5143: result of GiNZA
\u4fc2\u53d7\u30bf\u30b0: result of GiNZA
\u4fc2\u53d7\u30da\u30a2: result of GiNZA
\u305d\u306e\u4ed6: result of GiNZA

Examples

data(review_ginza)
review_ginza |>
  unescape_utf()

Analyzed data of review by MeCab

Description

MeCab: https://taku910.github.io/mecab/

Usage

review_mecab

Format

A data frame with 199985 rows and 14 variable: (column names are escaped by stringi::stri_escape_unicode(), stringi::stri_unescape_unicode() or unescape_utf() will show Japanese)

text_id: id
chap: chapter
sect: section
para: paragraph
\u8868\u5c64\u5f62: result of MeCab
\u54c1\u8a5e: result of MeCab
\u54c1\u8a5e\u7d30\u5206\u985e1: result of MeCab
\u54c1\u8a5e\u7d30\u5206\u985e2: result of MeCab
\u54c1\u8a5e\u7d30\u5206\u985e3: result of MeCab
\u6d3b\u7528\u578b: result of MeCab
\u6d3b\u7528\u5f62: result of MeCab
\u539f\u5f62: result of MeCab
\u8aad\u307f: result of MeCab
\u767a\u97f3: result of MeCab

Examples

data(review_mecab)
review_mecab |>
  unescape_utf()

Analyzed data of review by Sudachi

Description

Sudachi: https://github.com/WorksApplications/Sudachi

Usage

review_sudachi_a

review_sudachi_b

review_sudachi_c

Format

A data frame with 20100 rows and 12 variable:

text_id: id
chap: chapter
sect: section
para: paragraph
\u8868\u5c64\u5f62: result of Sudachi
\u54c1\u8a5e: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e1: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e2: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e3: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e4: result of Sudachi
\u54c1\u8a5e\u7d30\u5206\u985e5: result of Sudachi
\u539f\u5f62: result of Sudachi

A data frame with 19565 rows and 12 variable:

A data frame with 19526 rows and 12 variable:

Examples

data(review_sudachi_a)
review_sudachi_a |>
  unescape_utf()

Stop words for morphological analysis

Description

Stop words for morphological analysis

Usage

stop_words

Format

A data frame with 310 rows and 1 variable:

stop_word: Stop words can be used with delete_stop_words(). Escaped by stringi::stri_escape_unicode(). Downloaded from http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt

Examples

data(stop_words)
stop_words |>
  unescape_utf()

An example of synonym word pairs

Description

An example of synonym word pairs

Usage

synonym

Format

A data frame with 25 rows and 2 variables:

from: Words to be replaced from. Escaped by stringi::stri_escape_unicode().
to: Words to be replaced to.

Examples

data(synonym)
synonym |>
  unescape_utf()

Add ids.

Description

Add ids.

Usage

text_id_with_break(x, brk, end_with_brk = TRUE)

add_text_id_df(df, col, brk, end_with_brk = TRUE)

Arguments

x

A string vector.

brk

A string to specify the break between ids.

end_with_brk

A logical. TRUE: brk means the end of groups. FALSE: brk means the beginning of groups.

df

A dataframe.

col

A string to specify the column.

Value

id_with_break() returns id vector, add_id_df() returns dataframe.

Examples

tmp <- c("a", "brk", "b", "brk", "c")
brk <- "brk"
text_id_with_break(tmp, brk)
add_text_id_df(tibble::tibble(tmp), col = "tmp", "brk")

Wrapper functions for escape and unescape unicode

Description

Wrapper functions for escape and unescape unicode

Usage

unescape_utf(x)

escape_utf(x)

Arguments

x

A dataframe or character vector

Value

A dataframe or character vector

Examples

data(review_mecab)
review_mecab |>
  print() |>
  unescape_utf() |>
  print() |>
  escape_utf()