Type: | Package |
Title: | Fuzzy String Comparison |
Version: | 0.2.0 |
Description: | Provides string similarity calculations inspired by the Python 'thefuzz' package. Compare strings by edit distance, similarity ratio, best matching substring, ordered token matching and set-based token matching. A range of edit distance measures are available thanks to the 'stringdist' package. |
License: | GPL-3 |
URL: | https://github.com/lewinfox/levitate/, https://www.lewinfox.com/levitate/ |
BugReports: | https://github.com/lewinfox/levitate/issues |
Depends: | R (≥ 2.10) |
Imports: | rlang, stringdist |
Suggests: | glue, knitr, pkgdown, rmarkdown, styler, testthat |
VignetteBuilder: | knitr |
Encoding: | UTF-8 |
Language: | en-GB |
LazyData: | true |
RoxygenNote: | 7.2.3 |
NeedsCompilation: | no |
Packaged: | 2023-09-30 23:43:43 UTC; lewin |
Author: | Lewin Appleton-Fox [aut, cre, cph] |
Maintainer: | Lewin Appleton-Fox <lewin.a.f@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-10-01 00:00:02 UTC |
Default parameters inherited by other documentation
Description
Default parameters inherited by other documentation
Arguments
a , b |
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
See Also
stringdist::stringdist()
, stringdist::stringsim()
for details on the underlying
functions and the additional options available.
Hotel room listings
Description
The dataset contains 85 descriptions of the same hotel rooms from Expedia and Booking.com.
Usage
hotel_rooms
Format
A data frame with the following columns:
- expedia
The title of the room's listing on expedia.com
- booking
The title of the room's listing on booking.com
Source
Based on a dataset compiled by Susan Li.
Internal functions
Description
lev_partial_ratio()
and lev_token_set_ratio()
are hard to vectorise in one go, so in the
interests of lazy thinking these "internal" versions contain the logic to operate on
single-length inputs, and the calling functions just apply()
them as needed.
Usage
internal_lev_token_set_ratio(a, b, pairwise = TRUE, useNames = !pairwise, ...)
internal_lev_partial_ratio(a, b, pairwise = TRUE, useNames = !pairwise, ...)
Arguments
a , b |
The input strings. For these "internal" functions these must be length 1 |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
Functions
-
internal_lev_token_set_ratio()
: Seelev_token_set_ratio()
. -
internal_lev_partial_ratio()
: Seelev_partial_ratio()
.
Get the best matched string from a list of candidates
Description
Given an input
string and multiple candidates
, return the candidate with the best score as
calculated by .fn
.
Usage
lev_best_match(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)
Arguments
input |
A single string |
candidates |
One or more candidate strings to score |
.fn |
The scoring function to use, as a string or function object. Defaults to
|
... |
Additional arguments to pass to |
decreasing |
If |
Value
A string
See Also
Examples
lev_best_match("bilbo", c("frodo", "gandalf", "legolas"))
String distance metrics
Description
Uses stringdist::stringdistmatrix()
to compute a range of
string distance metrics.
Usage
lev_distance(a, b, pairwise = TRUE, useNames = TRUE, ...)
Arguments
a , b |
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
Value
A numeric scalar, vector or matrix depending on the length of the inputs. See "Details".
Details
This is a thin wrapper around stringdist::stringdistmatrix()
and mainly exists to coerce the
output into the simplest possible format (via lev_simplify_matrix()
).
The function will return the simplest possible data structure permitted by the length of the
inputs a
and b
. This will be a scalar if a
and b
are length 1, a vector if either (but
not both) is length > 1, and a matrix otherwise.
Other options
In addition to useNames
stringdist::stringdistmatrix()
provides a range of options to control
the matching, which can be passed using ...
. Refer to the stringdist
documentation for more
information.
Examples
lev_distance("Bilbo", "Frodo")
lev_distance("Bilbo", c("Frodo", "Merry"))
lev_distance("Bilbo", c("Frodo", "Merry"), useNames = FALSE)
lev_distance(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))
Ratio of the best-matching substring
Description
Find the best lev_ratio()
between substrings.
Usage
lev_partial_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
Arguments
a , b |
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
Value
A numeric scalar, vector or matrix depending on the length of the inputs.
Details
If string a
has length len_a
and is shorter than string b
, this function finds the highest
lev_ratio()
of all the len_a
-long substrings of b
(and vice versa).
Examples
lev_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band")
# Here the two "Bruce Springsteen" strings will match perfectly.
lev_partial_ratio("Bruce Springsteen", "Bruce Springsteen and the E Street Band")
String similarity ratio
Description
String similarity ratio
Usage
lev_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
Arguments
a , b |
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
Value
A numeric scalar, vector or matrix depending on the length of the inputs.
Details
This is a thin wrapper around stringdist::stringsimmatrix()
and mainly exists to coerce the
output into the simplest possible format (via lev_simplify_matrix()
).
The function will return the simplest possible data structure permitted by the length of the
inputs a
and b
. This will be a scalar if a
and b
are length 1, a vector if either (but
not both) is length > 1, and a matrix otherwise.
Examples
lev_ratio("Bilbo", "Frodo")
lev_ratio("Bilbo", c("Frodo", "Merry"))
lev_ratio("Bilbo", c("Frodo", "Merry"), useNames = FALSE)
lev_ratio(c("Bilbo", "Gandalf"), c("Frodo", "Merry"))
Score multiple candidate strings against a single input
Description
Given a single input
string and multiple candidates
, compute scores for each candidate.
Usage
lev_score_multiple(input, candidates, .fn = lev_ratio, ..., decreasing = TRUE)
Arguments
input |
A single string |
candidates |
One or more candidate strings to score |
.fn |
The scoring function to use, as a string or function object. Defaults to
|
... |
Additional arguments to pass to |
decreasing |
If |
Value
A list where the keys are candidates
and the values are the scores. The list is sorted
according to the decreasing
parameter, so by default higher scores are first.
See Also
Examples
lev_score_multiple("bilbo", c("frodo", "gandalf", "legolas"))
Simplify a matrix
Description
Given an input matrix, try and simplify it to a scalar or vector. This requires that one or both
of the dimensions are 1. If the matrix has dimnames()
and the output has more than one item,
name the elements according to the longest dimname.
Usage
lev_simplify_matrix(m)
Arguments
m |
A matrix. If |
Value
A scalar, vector or matrix as described above.
Matching based on common tokens
Description
Compare stings based on shared tokens.
Usage
lev_token_set_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
Arguments
a , b |
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
Value
A numeric scalar, vector or matrix depending on the length of the inputs.
Details
Similar to lev_token_sort_ratio()
this function breaks the input down into tokens. It then
identifies any common tokens between strings and creates three new strings:
x <- {common_tokens} y <- {common_tokens}{remaining_unique_tokens_from_string_a} z <- {common_tokens}{remaining_unique_tokens_from_string_b}
and performs three pairwise lev_ratio()
calculations between them (x
vs y
, y
vs z
and
x
vs z
). The highest of those three ratios is returned.
See Also
Examples
x <- "the quick brown fox jumps over the lazy dog"
y <- "my lazy dog was jumped over by a quick brown fox"
lev_ratio(x, y)
lev_token_sort_ratio(x, y)
lev_token_set_ratio(x, y)
Ordered token matching
Description
Compares strings by tokenising them, sorting the tokens alphabetically and then computing the
lev_ratio()
of the result. This means that the order of words is irrelevant which can be
helpful in some circumstances.
Usage
lev_token_sort_ratio(a, b, pairwise = TRUE, useNames = TRUE, ...)
Arguments
a , b |
The input strings |
pairwise |
Boolean. If |
useNames |
Boolean. Use input vectors as row and column names? |
... |
Additional arguments to be passed to |
Value
A numeric scalar, vector or matrix depending on the length of the inputs.
See Also
Examples
x <- "Episode IV - Star Wars: A New Hope"
y <- "Star Wars Episode IV - New Hope"
# Because the order of words is different the simple approach gives a low match ratio.
lev_ratio(x, y)
# The sorted token approach ignores word order.
lev_token_sort_ratio(x, y)
Weighted token similarity measure
Description
Computes similarity but allows you to assign weights to specific tokens. This is useful, for example, when you have a frequently-occurring string that doesn't contain useful information. See examples.
Usage
lev_weighted_token_ratio(a, b, weights = list(), ...)
Arguments
a , b |
The input strings |
weights |
List of token weights. For example, |
... |
Additional arguments to be passed to |
Value
A float
Details
The algorithm used here is as follows:
Tokenise the input strings
Compute the edit distance between each pair of tokens
Compute the maximum edit distance between each pair of tokens
Apply any weights from the
weights
argumentReturn
1 - (sum(weighted_edit_distances) / sum(weighted_max_edit_distance))
See Also
Other weighted token functions:
lev_weighted_token_set_ratio()
,
lev_weighted_token_sort_ratio()
Examples
lev_weighted_token_ratio("jim ltd", "tim ltd")
lev_weighted_token_ratio("tim ltd", "jim ltd", weights = list(ltd = 0.1))
Weighted version of lev_token_set_ratio()
Description
Weighted version of lev_token_set_ratio()
Usage
lev_weighted_token_set_ratio(a, b, weights = list(), ...)
Arguments
a , b |
The input strings |
weights |
List of token weights. For example, |
... |
Additional arguments to be passed to |
Value
Float
See Also
Other weighted token functions:
lev_weighted_token_ratio()
,
lev_weighted_token_sort_ratio()
Weighted version of lev_token_sort_ratio()
Description
This function tokenises inputs, sorts tokens and computes similarities for each pair of tokens.
Similarity scores are weighted based on the weights
argument, and a total similarity score is
returned in the same manner as lev_weighted_token_ratio()
.
Usage
lev_weighted_token_sort_ratio(a, b, weights = list(), ...)
Arguments
a , b |
The input strings |
weights |
List of token weights. For example, |
... |
Additional arguments to be passed to |
Value
Float
See Also
Other weighted token functions:
lev_weighted_token_ratio()
,
lev_weighted_token_set_ratio()
Find all substrings of a given length
Description
Find all substrings of a given length
Usage
str_all_substrings(x, n)
Arguments
x |
The input string. Non-character inputs will be coerced with |
n |
The length of the desired substrings. |
Value
A character vector containing all the length n
substrings of x
. If x
has length >
then a list is returned containing an entry for each input element.
Tokenise and sort a string
Description
Given an input string, tokenise it (using str_tokenise()
), sort the tokens alphabetically and
return the result as a single space-separated string.
Usage
str_token_sort(x)
Arguments
x |
The input string. Non-character inputs will be coerced with |
Value
A character vector the same length as x
containing the sorted tokens.
Tokenise a string
Description
Splits an input string into tokens. A wrapper for strsplit()
with a predefined split regex of
[^[:alnum:]]
, i.e. anything except [a-zA-Z0-9]
.
Usage
str_tokenise(x, split = "[^[:alnum:]]+")
Arguments
x |
The input string. Non-character inputs will be coerced with |
split |
The regular expression to split on. See |
Value
A list containing one character vector for each element of x
.