Type: | Package |
Title: | R Source Code Similarity Evaluation by Variable/Function Names |
Version: | 0.2.1 |
Date: | 2022-01-20 |
Description: | Evaluates R source codes by variable and/or functions names. Similar source codes should deliver similarity coefficients near one. Since neither the frequency nor the order of the used names is considered, a manual inspection of the R source code is required to check for similarity. Possible use cases include detection of code clones for improving software quality and of plagiarism amongst students' assignments. |
License: | GPL-3 |
URL: | https://github.com/sigbertklinke/rscc (development version) |
Imports: | crayon, formatR, highlight, igraph, tm |
Encoding: | UTF-8 |
RoxygenNote: | 7.1.2 |
Suggests: | rmarkdown, knitr |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2022-01-20 11:31:07 UTC; sk |
Author: | Sigbert Klinke [aut, cre] |
Maintainer: | Sigbert Klinke <sigbert@hu-berlin.de> |
Repository: | CRAN |
Date/Publication: | 2022-01-20 12:02:42 UTC |
as.igraph
Description
Converts a data frame of similarity coefficients into a graph.
Usage
as_igraph(x, tol = 100 * .Machine$double.eps, tol1 = 8 * tol, ...)
Arguments
x |
a similarity object |
tol |
numeric scalar >= 0. Smaller differences are not
considered, see |
tol1 |
numeric scalar >= 0. |
... |
further parameters used by igraph::graph_from_adjacency_matrix |
Value
an igraph object
Examples
files <- list.files(path=system.file("examples", package="rscc"), pattern="*.R$", full.names = TRUE)
prgs <- sourcecode(files, title=basename(files))
docs <- documents(prgs)
simm <- similarities(docs)
# a similarity coefficients equal to zero does not create an edge!
g <- as_igraph(simm, diag=FALSE)
# thicker edges have higher similarity coefficients
plot(g, edge.width=1+3*igraph::E(g)$weight)
browse
Description
Creates a temporary HTML file with source codes and opens it into a browser using browseURL
.
Note that the source code is reformatted.
Usage
browse(prgs, simdf, n = (simdf[, 3] > 0), width.cutoff = 60, css = NULL)
Arguments
prgs |
sourcecode object |
simdf |
similarity object |
n |
integer: comparisons to show (default: |
width.cutoff |
integer: an integer in [20, 500]: if a line's character length is at or over this number, the function will try to break it into a new line (default: |
css |
character: file name of CSS style for highlighting the R code |
Value
invisibly the name of the temporary HTML file
Examples
# example files are taken from https://CRAN.R-project.org/package=SimilaR
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names=TRUE)
prgs <- sourcecode(files)
simm <- similarities(documents(prgs))
simdf <- matrix2dataframe(simm)
if (interactive()) browse(prgs, simdf)
documents
Description
Creates word vectors from parsed sourec code objects. If
-
type=="vars"
then the names ofall.vars(.)
, -
type=="funs"
then the namas ofsetdiff(all.names(.), all.vars(.)
, and -
type=="names"
then the names ofall.names(.)
are used.
Usage
documents(
prgs,
type = c("vars", "funs", "names"),
ignore.case = TRUE,
minlen = 2,
...
)
Arguments
prgs |
prgs sourcecode object |
type |
character: either |
ignore.case |
logical: If TRUE, case is ignored for computing (default: |
minlen |
integer: minimal name length to be considered (default: |
... |
unused |
Value
a
Examples
# example files are taken from https://CRAN.R-project.org/package=SimilaR
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names=TRUE)
prgs <- sourcecode(files, basename=TRUE)
docs <- documents(prgs)
docs
freq_table
Description
Computes a frequency table of words and documents.
Usage
freq_table(docs, ...)
Arguments
docs |
documents object |
... |
unused |
Value
a matrix with similarities
Examples
# example files are taken from https://CRAN.R-project.org/package=SimilaR
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names=TRUE)
prgs <- sourcecode(files, basename=TRUE)
docs <- documents(prgs)
freq_table (docs)
matrix2dataframe
Description
Converts a numeric matrix to a data frame with decreasing or increasing values: First column row index, second column col index and third column the value. If the matrix is symmetric, only the upper triangle is taken into account.
Usage
matrix2dataframe(
m,
decreasing = TRUE,
tol = 100 * .Machine$double.eps,
tol1 = 8 * tol,
...
)
Arguments
m |
numeric: a matrix of values |
decreasing |
logical: should the sort order be increasing or decreasing (default: |
tol |
numeric scalar >= 0. Smaller differences are not
considered, see |
tol1 |
numeric scalar >= 0. |
... |
further arguments passed to methods; the matrix method
passes these to |
Value
a data frame with an attribute matrix
with m
Examples
# non-symmetric
x <- matrix(runif(9), ncol=3)
matrix2dataframe(x)
same_file
Description
same_file
Usage
same_file(m, replacement = 0)
Arguments
m |
matrix object with row- and columnnames |
replacement |
value for replacement (default: |
Value
matrix
Examples
m <- matrix(runif(25), ncol=5)
colnames(m) <- rownames(m) <- c(sprintf("m[%.f]", 1:3), sprintf("m2[%.f]", 1:2))
m
same_file(m)
sim_coeff
Description
Internal function for faster computation. No checks on input will be performed.
Usage
sim_coeff(set1, set2, setfull, coeff)
Arguments
set1 |
character: unique vector of words |
set2 |
character: unique vector of words |
setfull |
character: unique vector of texts to compare |
coeff |
character: name of similarity coefficient to use |
Value
value of similarity coefficient
similarity_coeff
Description
Computes a similarity coefficient based on the unique elements set1
and set2
in relation to setfull
. If setfull
is NULL
then setfull
is set
to unique(c(set1, set2))
. For more details, see the vignette vignette("rscc")
.
Usage
similarity_coeff(
set1,
set2,
setfull = NULL,
coeff = c("jaccard", "braun", "dice", "hamann", "kappa", "kulczynski", "ochiai",
"phi", "russelrao", "matching", "simpson", "sneath", "tanimoto", "yule")
)
Arguments
set1 |
vector: elements to compare |
set2 |
vector: elements to compare |
setfull |
vector: elements to compare (default: |
coeff |
character: coefficient to compute (default: |
Value
a numeric similarity coefficient
Examples
s1 <- 1:3
s2 <- 1:5
similarity_coeff(s1, s2)
s1 <- letters[1:3]
s2 <- LETTERS[1:5]
similarity_coeff(s1, s2)
similarities
Description
sims
and similarities
both calculate for each pair of source code objects
the similarity coefficients and return a data frame with the coefficients in descending order.
A larger coefficient means a greater similarity.
Usage
sims(...)
similarities(
docs,
all = FALSE,
coeff = c("jaccard", "braun", "dice", "hamann", "kappa", "kulczynski", "ochiai",
"phi", "russelrao", "matching", "simpson", "sneath", "tanimoto", "yule")
)
Arguments
... |
all parameters in |
docs |
document object |
all |
logical: should the similarity coefficients computed based on all sourcecode objects or just the two considered (default: |
coeff |
character: coefficient to compute (default: |
Value
a data frame with the results
Examples
# example files are taken from https://CRAN.R-project.org/package=SimilaR
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names=TRUE)
prgs <- sourcecode(files, basename=TRUE)
docs <- documents(prgs)
similarities(docs)
# further steps
# m <- similarities(docs)
# df <- matrix2dataframe(m)
# head(df, n=20)
# browse(prgs, df, n=5)
sourcecode
Description
Reads and parses files with R source code.
Usage
sourcecode(x, ...)
## Default S3 method:
sourcecode(x, title = x, silent = FALSE, minlines = -1, ...)
Arguments
x |
character: filenames |
... |
unused |
title |
character: vector of program titles (default: |
silent |
logical: should the report of messages be suppressed (default: |
minlines |
integer: only expressions with |
Value
a sourcecode object
Examples
# example files are taken from https://CRAN.R-project.org/package=SimilaR
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names=TRUE)
prgs <- sourcecode(files)
tfidf
Description
Computes the term frequency–inverse document frequency uses tha cosine of the angles between the documents as similarity measure. Since R source code is provided no stemming or stop words are applied.
Usage
tfidf(docs)
Arguments
docs |
document object |
Value
similarity matrix
Examples
files <- list.files(system.file("examples", package="rscc"), "*.R$", full.names = TRUE)
prgs <- sourcecode(files, basename=TRUE, silent=TRUE)
docs <- documents(prgs)
tfidf(docs)
# further steps
# m <- tfidf(docs)
# df <- matrix2dataframe(m)
# head(df, n=20)
# browse(prgs, df, n=5)