Title: | Multifile Record Linkage and Duplicate Detection |
Version: | 0.1.1 |
Description: | Implementation of the methodology of Aleshin-Guendel & Sadinle (2022) <doi:10.1080/01621459.2021.2013242>. It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates. |
Depends: | R (≥ 3.5.0) |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.1.2 |
URL: | https://github.com/aleshing/multilink |
BugReports: | https://github.com/aleshing/multilink/issues |
Imports: | igraph, RecordLinkage, Rcpp, utils, mcclust, geosphere, stringr |
LinkingTo: | Rcpp, RcppArmadillo |
NeedsCompilation: | yes |
Packaged: | 2023-06-08 20:25:20 UTC; sergealeshin-guendel |
Author: | Serge Aleshin-Guendel [aut, cre] |
Maintainer: | Serge Aleshin-Guendel <saleshinguendel@gmail.com> |
Repository: | CRAN |
Date/Publication: | 2023-06-09 14:20:07 UTC |
Create Comparison Data
Description
Create comparison data for all pairs of records, except for those records in files which are assumed to have no duplicates.
Usage
create_comparison_data(
records,
types,
breaks,
file_sizes,
duplicates,
verbose = TRUE
)
Arguments
records |
A |
types |
A |
breaks |
A |
file_sizes |
A |
duplicates |
A |
verbose |
A |
Details
The purpose of this function is to construct comparison vectors for each pair
of records. In order to construct these vectors, one needs to specify the
types
and breaks
arguments. The types
argument specifies
how each field should be compared, and the breaks
argument specifies
how to discretize these comparisons.
Currently, the types
argument supports three types of field
comparisons: binary, absolute difference, and the normalized Levenshtein
distance. Please contact the package maintainer if you need a new type of
comparison to be supported.
The breaks
argument should be a list
, with with one element for
each field. If a field is being compared with a binary comparison, i.e.
types[f]="bi"
, then the corresponding element of breaks
should
be NA
, i.e. breaks[[f]]=NA
. If a field is being compared with a
numeric or string comparison, then the corresponding element of breaks
should be a vector of cut points used to discretize the comparisons. To give
more detail, suppose you pass in cut points
breaks[[f]]=c(cut_1, ...,cut_L)
. These cut points
discretize the range of the comparisons into L+1
intervals:
I_0=(-\infty, cut_1], I_1=(cut_1, cut_2], ..., I_L=(cut_L, \infty]
. The
raw comparisons, which lie in [0,\infty)
for numeric comparisons and
[0,1]
for string comparisons, are then replaced with indicators of
which interval the comparisons lie in. The interval I_0
corresponds to
the lowest level of disagreement for a comparison, while the interval
I_L
corresponds to the highest level of disagreement for a comparison.
Value
a list containing:
record_pairs
A
data.frame
, where each row contains the pair of records being compared in the corresponding row ofcomparisons
. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e.record_pairs[i, 1] < record_pairs[i, 2]
for each rowi
.comparisons
A
logical
matrix, where each row contains the comparisons for the record pair in the corresponding row ofrecord_pairs
. Comparisons are in the same order as the columns ofrecords
, and are represented byL + 1
columns ofTRUE/FALSE
indicators, whereL + 1
is the number of disagreement levels for the field based onbreaks
.K
The number of files, assumed to be of class
numeric
.file_sizes
A
numeric
vector of lengthK
, indicating the size of each file.duplicates
A
numeric
vector of lengthK
, indicating which files are assumed to have duplicates.duplicates[k]
should be1
if filek
has duplicates, andduplicates[k]
should be0
if filek
has no duplicates. If any files do not have duplicates, we strongly recommend that the largest such file is organized to be the first file.field_levels
A
numeric
vector indicating the number of disagreement levels for each field.file_labels
An
integer
vector of lengthsum(file_sizes)
, wherefile_labels[i]
indicates which file recordi
is in.fp_matrix
An
integer
matrix, wherefp_matrix[k1, k2]
is a label for the file pair(k1, k2)
. Note thatfp_matrix[k1, k2] = fp_matrix[k2, k1]
.rp_to_fp
A
logical
matrix that indicates which record pairs belong to which file pairs.rp_to_fp[fp, rp]
isTRUE
if the recordsrecord_pairs[rp, ]
belong to the file pairfp
, and is FALSE otherwise. Note thatfp
is given by the labeling infp_matrix
.ab
An
integer
vector, of lengthncol(comparisons) * K * (K + 1) / 2
that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.file_sizes_not_included
A
numeric
vector of0
s. This element is non-zero whenreduce_comparison_data
is used.ab_not_included
A
numeric
vector of0
s. This element is non-zero whenreduce_comparison_data
is used.labels
NA
. This element is notNA
whenreduce_comparison_data
is used.pairs_to_keep
NA
. This element is notNA
whenreduce_comparison_data
is used.cc
0
. This element is non-zero whenreduce_comparison_data
is used.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
## Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
## Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
Duplicate Dataset
Description
A dataset containing 867
simulated records from 3
files with
no duplicate records in each file.
Usage
dup_data
Format
A list with three elements:
- records
A
data.frame
with the records, containing7
fields, from all three files, in the format used for input tocreate_comparison_data
.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integer
vector of arbitrary labels of lengthsum(file_sizes)
.
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
data(dup_data)
# There are 500 entities represented in the records
length(unique(dup_data$IDs))
Small Duplicate Dataset
Description
A dataset containing 96
simulated records from 3
files with
no duplicate records in each file, subset from dup_data
.
Usage
dup_data_small
Format
A list with three elements:
- records
A
data.frame
with the records, containing7
fields, from all three files, in the format used for input tocreate_comparison_data
.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integer
vector of arbitrary labels of lengthsum(file_sizes)
.
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
data(dup_data_small)
# There are 96 entities represented in the records
length(unique(dup_data_small$IDs))
Find the Bayes Estimate of a Partition
Description
Find the (approximate) Bayes estimate of a partition based on MCMC samples of the partition and a specified loss function.
Usage
find_bayes_estimate(
partitions,
burn_in,
L_FNM = 1,
L_FM1 = 1,
L_FM2 = 2,
L_A = Inf,
max_cc_size = nrow(partitions),
verbose = TRUE
)
Arguments
partitions |
Posterior samples of the partition, where each column
is one sample and the partition is represented as an |
burn_in |
The number of samples to discard for burn in. |
L_FNM |
Positive loss for a false non-match. Default is |
L_FM1 |
Positive loss for a type 1 false match. Default is |
L_FM2 |
Positive loss for a type 2 false match. Default is |
L_A |
Positive loss for abstaining from making a decision for a record.
Default is |
max_cc_size |
The maximum allowable connected component size over which
the posterior expected loss is minimized. Default is |
verbose |
A |
Value
A vector, the same length of a column of partitions
containing the
(approximate) Bayes estimate of the partition. If !is.infinite(L_A)
the output may be a partial estimate. A positive number l
in index
i
indicates that record i
is in the same cluster as every other
record j
with l
in index j
. A value of -1
in
index i
indicates that the Bayes estimate abstained from making a
decision for record i
.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Find initialization for the matching (this step is optional)
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)
# Run the Gibbs sampler
results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000,
Z_init = Z_init, seed = 42)
# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)
# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
# Run the Gibbs sampler
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
seed = 42)
# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)
# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)
Gibbs Sampler for Posterior Inference
Description
Run a Gibbs sampler to explore the posterior distribution of partitions of records.
Usage
gibbs_sampler(
comparison_list,
prior_list,
n_iter = 2000,
Z_init = 1:sum(comparison_list$file_sizes),
seed = 70,
single_likelihood = FALSE,
chaperones_info = NA,
verbose = TRUE
)
Arguments
comparison_list |
The output from a call to
|
prior_list |
The output from a call to |
n_iter |
The number of iterations of the Gibbs sampler to run. |
Z_init |
Initialization of the partition of records, represented as an
|
seed |
The seed to use while running the Gibbs sampler. |
single_likelihood |
A |
chaperones_info |
If |
verbose |
A |
Details
Given the prior specified using specify_prior
, this function
runs a Gibbs sampler to explore the posterior distribution of partitions of
records, conditional on the comparison data created using
create_comparison_data
or reduce_comparison_data
.
Value
a list containing:
m
Posterior samples of the
m
parameters. Each column is one sample.u
Posterior samples of the
u
parameters. Each column is one sample.partitions
Posterior samples of the partition. Each column is one sample. Note that the partition is represented as an
integer
vector of arbitrary labels of lengthsum(comparison_list$file_sizes)
.contingency_tables
Posterior samples of the overlap table. Each column is one sample. This incorporates counts of records determined not to be candidate matches to any other records using
reduce_comparison_data
.cluster_sizes
Posterior samples of the size of each cluster (associated with an arbitrary label from
1
tosum(comparison_list$file_sizes)
). Each column is one sample.sampling_time
The time in seconds it took to run the sampler.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Jeffrey Miller, Brenda Betancourt, Abbas Zaidi, Hanna Wallach, & Rebecca C. Steorts (2015). Microclustering: When the cluster sizes grow sublinearly with the size of the data set. NeurIPS Bayesian Nonparametrics: The Next Generation Workshop Series. [arXiv]
Brenda Betancourt, Giacomo Zanella, Jeffrey Miller, Hanna Wallach, Abbas Zaidi, & Rebecca C. Steorts (2016). Flexible Models for Microclustering with Application to Entity Resolution. Advances in neural information processing systems. [Published] [arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Find initialization for the matching (this step is optional)
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)
# Run the Gibbs sampler
{
results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000,
Z_init = Z_init, seed = 42)
}
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
# Run the Gibbs sampler
{
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
seed = 42)
}
Initialize the Partition
Description
Generate an initialization for the partition in the case when it is assumed there are no duplicates in all files (so that the partition is a matching).
Usage
initialize_partition(comparison_list, pairs_to_keep, seed = NA)
Arguments
comparison_list |
the output from a call to
|
pairs_to_keep |
A |
seed |
The seed to use to generate the initialization. |
Details
When it is assumed that there are no duplicates in all files, and
reduce_comparison_data
is not used to reduce the number of
potential matches, the Gibbs sampler used for posterior inference may
experience slow mixing when using an initialization for the partition where
each record is in its own cluster (the default option for the Gibbs sampler).
The purpose of this function is to provide an alternative initialization
scheme.
To use this initialization scheme, the user passes in a logical
vector
that indicates which record pairs are potential matches according to an
indexing method (as in reduce_comparison_data
). Note that this
indexing is only used to generate the initialization, it is not used for
inference. The initialization scheme first finds the transitive closure of
the potential matches, which partitions the records into blocks. Within each
block of records, the scheme randomly selects a record from each file, and
these selected records are then placed in the same cluster for the partition
initialization. All other records are placed in their own clusters.
Value
an integer
vector of arbitrary labels of length
sum(comparison_list$file_sizes)
, giving an initialization for the
partition.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Find initialization for the matching
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)
Multifile Record Linkage and Duplicate Detection
Description
The multilink package implements the methodology of Aleshin-Guendel & Sadinle (2022). It handles the general problem of multifile record linkage and duplicate detection, where any number of files are to be linked, and any of the files may have duplicates.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242] [arXiv]
Examples
# Here we demonstrate an example workflow with the small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Find initialization for the matching (this step is optional)
# The following line corresponds to only keeping pairs of records as
# potential matches in the initialization for which neither gname nor fname
# disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
Z_init <- initialize_partition(comparison_list, pairs_to_keep, seed = 42)
# Run the Gibbs sampler
results <- gibbs_sampler(comparison_list, prior_list, n_iter = 1000,
Z_init = Z_init, seed = 42)
# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)
# The number of clusters in the full estimate
length(unique(full_estimate))
# The number of entities represented in the records
length(unique(no_dup_data_small$IDs))
# Find which record pairs are truly coreferent based on IDs
true_links <- no_dup_data_small$IDs[comparison_list$record_pairs[, 1]] ==
no_dup_data_small$IDs[comparison_list$record_pairs[, 2]]
# Find which record pairs are in the same clusters in the full estimate
full_estimate_links <- full_estimate[comparison_list$record_pairs[, 1]] ==
full_estimate[comparison_list$record_pairs[, 2]]
# Find the number of true matches in the full estimate
true_matches <- sum(full_estimate_links & true_links)
# Precision of the full estimate
true_matches / sum(full_estimate_links)
# Recall of the full estimate
true_matches / sum(true_links)
# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)
# The partial estimate abstains from making decisions for how many records?
sum(partial_estimate == -1)
# For the records which decisions were made for in the partial estimate,
# there are how many clusters?
length(unique(partial_estimate))
# Abstain rate of partial_estimate
sum(partial_estimate == -1) / length(partial_estimate)
# Relabel records where we abstained
partial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) +
which(partial_estimate == -1)
# Find which record pairs are in the same clusters in the full estimate
partial_estimate_links <-
partial_estimate[comparison_list$record_pairs[, 1]] ==
partial_estimate[comparison_list$record_pairs[, 2]]
# Find the number of true matches in the partial estimate
true_matches_A <- sum(partial_estimate_links & true_links)
# Precision of the partial estimate
true_matches_A / sum(partial_estimate_links)
# Here we demonstrate an example workflow with the small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
# Run the Gibbs sampler
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
seed = 42)
# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)
# The number of clusters in the full estimate (including records records
# determined not to be candidate matches to any other records using
# reduce_comparison_data)
length(unique(full_estimate)) +
sum(reduced_comparison_list$file_sizes_not_included)
# The number of entities represented in the records
length(unique(dup_data_small$IDs))
# Find which record pairs are truly coreferent based on IDs
true_links <- dup_data_small$IDs[comparison_list$record_pairs[, 1]] ==
dup_data_small$IDs[comparison_list$record_pairs[, 2]]
# Focus on the record pairs that were candidate matches
true_links_reduced <- true_links[reduced_comparison_list$pairs_to_keep]
# Calculate the number of prior false non-matches based on the indexing
# scheme used
prior_fnm <-
nrow(comparison_list$record_pairs[true_links &
(!reduced_comparison_list$pairs_to_keep), ])
# Find which record pairs are in the same clusters in the full estimate
full_estimate_links <-
full_estimate[reduced_comparison_list$record_pairs[, 1]] ==
full_estimate[reduced_comparison_list$record_pairs[, 2]]
# Find the number of true matches in the full estimate
true_matches <- sum(full_estimate_links & true_links_reduced)
# Precision of the full estimate
true_matches / sum(full_estimate_links)
# Recall of the full estimate
true_matches / (sum(true_links_reduced) + prior_fnm)
# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)
# The partial estimate abstains from making decisions for how many records?
sum(partial_estimate == -1)
# For the records which decisions were made for in the partial estimate,
# there are how many clusters? (including records determined not to be
# candidate matches to any other records using reduce_comparison_data)
length(unique(partial_estimate)) +
sum(reduced_comparison_list$file_sizes_not_included)
# Abstain rate of partial_estimat (excluding records determined not
# to be candidate matches to any other records using reduce_comparison_data)
sum(partial_estimate == -1) / length(partial_estimate)
# Relabel records where we abstained
partial_estimate[which(partial_estimate == -1)] <- length(partial_estimate) +
which(partial_estimate == -1)
# Find which record pairs are in the same clusters in the full estimate
partial_estimate_links <-
partial_estimate[reduced_comparison_list$record_pairs[, 1]] ==
partial_estimate[reduced_comparison_list$record_pairs[, 2]]
# Find the number of true matches in the partial estimate
true_matches_A <- sum(partial_estimate_links & true_links_reduced)
# Precision of the partial estimate
true_matches_A / sum(partial_estimate_links)
# Relabel the full and partial Bayes estimates
full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
full_estimate)
partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
partial_estimate)
# Add columns to the records corresponding to their full and partial
# Bayes estimates
dup_data_small$records <- cbind(dup_data_small$records,
full_estimate_id = full_estimate_relabel$link_id,
partial_estimate_id = partial_estimate_relabel$link_id)
No Duplicate Dataset
Description
A dataset containing 730
simulated records from 3
files with
no duplicate records in each file.
Usage
no_dup_data
Format
A list with three elements:
- records
A
data.frame
with the records, containing7
fields, from all three files, in the format used for input tocreate_comparison_data
.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integer
vector of arbitrary labels of lengthsum(file_sizes)
.
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242] [arXiv]
Examples
data(no_dup_data)
# There are 500 entities represented in the records
length(unique(no_dup_data$IDs))
Small No Duplicate Dataset
Description
A dataset containing 71
simulated records from 3
files with
no duplicate records in each file, subset from no_dup_data
.
Usage
no_dup_data_small
Format
A list with three elements:
- records
A
data.frame
with the records, containing7
fields, from all three files, in the format used for input tocreate_comparison_data
.- file_sizes
The size of each file.
- IDs
The true partition of the records, represented as an
integer
vector of arbitrary labels of lengthsum(file_sizes)
.
Source
Extracted from the datasets used in the simulation study of the paper. The datasets were generated using code from Peter Christen's group https://dmm.anu.edu.au/geco/index.php.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242] [arXiv]
Examples
data(no_dup_data_small)
# There are 71 entities represented in the records
length(unique(no_dup_data_small$IDs))
Reduce Comparison Data Size
Description
Use indexing to reduce the number of record pairs that are potential matches.
Usage
reduce_comparison_data(comparison_list, pairs_to_keep, cc = 1)
Arguments
comparison_list |
The output of a call to
|
pairs_to_keep |
A |
cc |
A |
Details
When using comparison-based record linkage methods, scalability is a concern,
as the number of record pairs is quadratic in the number of records. In
order to address these concerns, it's common to declare certain record pairs
to not be potential matches a priori, using indexing methods. The user is
free to index using any method they like, as long as they can produce a
logical
vector that indicates which record pairs are potential matches
according to their indexing method. We recommend, if the user chosen indexing
method does not output potential matches that are transitive, to set the
cc
argument to 1
. By transitive we mean, for any three records
i
, j
, and k
, if i
and j
are potential matches,
and j
and k
are potential matches, then i
and k
are
potential matches. Non-transitive indexing schemes can lead to poor mixing of
the Gibbs sampler used for posterior inference, and suggests that the
indexing method used may have been too stringent.
If indexing is used, it may be the case that some records are declared to not
be potential matches to any other records. In this case, the indexing method
has made the decision that these records have no matches, and thus we can
remove them from the data set and relabel the remaining records; see the
documentation for labels
for information on how to go between the
original labeling and the new labeling.
If indexing is used, comparisons for record pairs that aren't potential matches are still used during inference, where they're used to inform the distribution of comparisons for non-matches.
Value
a list containing:
record_pairs
A
data.frame
, where each row contains the pair of records being compared in the corresponding row ofcomparisons
. The rows are sorted in ascending order according to the first column, with ties broken according to the second column in ascending order. For any given row, the first column is less than the second column, i.e.record_pairs[i, 1] < record_pairs[i, 2]
for each rowi
. If according topairs_to_keep
there are records which are not potential matches to any other records, the remaining records are relabeled (seelabels
).comparisons
A
logical
matrix, where each row contains the comparisons between the record pair in the corresponding row ofrecord_pairs
. Comparisons are in the same order as the columns ofrecords
, and are represented byL + 1
columns ofTRUE/FALSE
indicators, whereL + 1
is the number of disagreement levels for the field based onbreaks
.K
The number of files, assumed to be of class
numeric
.file_sizes
A
numeric
vector of lengthK
, indicating the size of each file. If according topairs_to_keep
there are records which are not potential matches to any other records, the remaining records are relabeled (seelabels
), andfile_sizes
now represents the sizes of each file after removing such records.duplicates
A
numeric
vector of lengthK
, indicating which files are assumed to have duplicates.duplicates[k]
should be1
if filek
has duplicates, andduplicates[k]
should be0
if filek
has no duplicates.field_levels
A
numeric
vector indicating the number of disagreement levels for each field.file_labels
An
integer
vector of lengthsum(file_sizes)
, wherefile_labels[i]
indicated which file recordi
is in.fp_matrix
An
integer
matrix, wherefp_matrix[k1, k2]
is a label for the file pair(k1, k2)
. Note thatfp_matrix[k1, k2] = fp_matrix[k2, k1]
.rp_to_fp
A
logical
matrix that indicates which record pairs belong to which file pairs.rp_to_fp[fp, rp]
isTRUE
if the recordsrecord_pairs[rp, ]
belong to the file pairfp
, and is FALSE otherwise. Note thatfp
is given by the labeling infp_matrix
.ab
An
integer
vector, of lengthncol(comparisons) * K * (K + 1) / 2
that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.file_sizes_not_included
If according to
pairs_to_keep
there are records which are not potential matches to any other records, the remaining records are relabeled (seelabels
), andfile_sizes_not_included
indicates, for each file, the number of such records that were removed.ab_not_included
For record pairs not included according to
pairs_to_keep
, this is aninteger
vector, of lengthncol(comparisons) * K * (K + 1) / 2
that indicates how many record pairs there are with a given disagreement level for a given field, for each file pair.labels
If according to
pairs_to_keep
there are records which are not potential matches to any other records, the remaining records are relabeled.labels
provides a dictionary that indicates, for each of the new labels, which record in the original labeling the new label corresponds to. In particular, the first column indicates the record in the original labeling, and the second column indicates the new labeling.pairs_to_keep
A
logical
vector, the same length ascomparison_list$record_pairs
, indicating which record pairs were kept as potential matches. This may not be the same as the inputpairs_to_keep
ifcc
was set to 1.cc
A
numeric
indicator of whether the connected components of the potential matches are closed under transitivity.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
Relabel the Bayes Estimate of a Partition
Description
Relabel the Bayes estimate of a partition, for use after using indexing to reduce the number of record pairs that are potential matches.
Usage
relabel_bayes_estimate(reduced_comparison_list, bayes_estimate)
Arguments
reduced_comparison_list |
The output from a call to
|
bayes_estimate |
The output from a call to
|
Details
When the function reduce_comparison_data
is used to reduce the
number of record pairs that are potential matches, it may be the case that
some records are declared to not be potential matches to any other records.
In this case, the indexing method has made the decision that these records
have no matches, and thus we can remove them from the data set and relabel
the remaining records; see the documentation for labels
in
reduce_comparison_data
for information on how to go between the
original labeling and the new labeling. The purpose of this function is to
relabel the output of find_bayes_estimate
when the function
reduce_comparison_data
is used, so that the user doesn't have
to do this relabeling themselves.
Value
A data.frame
, with as many rows as
sum(reduced_comparison_list$file_sizes +
reduced_comparison_list$file_sizes_not_included)
, i.e. the number of
records originally input to create_comparison_data
, before
indexing occurred. This data.frame
has two columns,
"original_labels"
and "link_id"
. Given row i
of
records
originally input to create_comparison_data
,
the linkage id according to bayes_estimate
is given by the i
th
row of the link_id
column. See the documentation for
find_bayes_estimate
for information on how to interpret this
linkage id.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242][arXiv]
Examples
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)
# Run the Gibbs sampler
{
results <- gibbs_sampler(reduced_comparison_list, prior_list, n_iter = 1000,
seed = 42)
# Find the full Bayes estimate
full_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = Inf, max_cc_size = 50)
# Find the partial Bayes estimate
partial_estimate <- find_bayes_estimate(results$partitions, burn_in = 100,
L_FNM = 1, L_FM1 = 1, L_FM2 = 2, L_A = 0.1, max_cc_size = 12)
# Relabel the full and partial Bayes estimates
full_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
full_estimate)
partial_estimate_relabel <- relabel_bayes_estimate(reduced_comparison_list,
partial_estimate)
# Add columns to the records corresponding to their full and partial
# Bayes estimates
dup_data_small$records <- cbind(dup_data_small$records,
full_estimate_id = full_estimate_relabel$link_id,
partial_estimate_id = partial_estimate_relabel$link_id)
}
Specify the Prior Distributions
Description
Specify the prior distributions for the m
and u
parameters of the
models for comparison data among matches and non-matches, and the partition.
Usage
specify_prior(
comparison_list,
mus = NA,
nus = NA,
flat = 0,
alphas = NA,
dup_upper_bound = NA,
dup_count_prior_family = NA,
dup_count_prior_pars = NA,
n_prior_family = NA,
n_prior_pars = NA
)
Arguments
comparison_list |
the output from a call to
|
mus , nus |
The hyperparameters of the Dirichlet priors for the |
flat |
A |
alphas |
The hyperparameters for the Dirichlet-multinomial overlap table
prior, a positive |
dup_upper_bound |
A |
dup_count_prior_family |
A |
dup_count_prior_pars |
A |
n_prior_family |
A |
n_prior_pars |
Currently set to |
Details
The purpose of this function is to specify prior distributions for all
parameters of the model. Please note that if
reduce_comparison_data
is used to the reduce the number of
record pairs that are potential matches, then the output of
reduce_comparison_data
(not
create_comparison_data
) should be used as input.
For the hyperparameters of the Dirichlet priors for the m
and u
parameters for the comparisons among matches and non-matches,
respectively, we recommend using a flat prior. This is accomplished by
setting mus=NA
and nus=NA
. Informative prior specifications
are possible, but in practice they will be overwhelmed by the large number of
comparisons.
For the prior for partitions, we do not recommend using a flat prior. Instead
we recommend using our structure prior for partitions. By setting
flat=0
and the remaining arguments to NA
, one obtains the
default specification for the structured prior that we have found to perform
well in simulation studies. The structured prior for partitions is specified
as follows:
Specify a prior for
n
, the number of clusters represented in the records. Note that this includes records determined not to be potential matches to any other records usingreduce_comparison_data
. Currently, a uniform prior and a scale prior forn
are supported. Our default specification uses a uniform prior.Specify a prior for the overlap table (see the documentation for
alphas
for more information). Currently a Dirichlet-multinomial prior is supported. Our default specification sets all hyperparameters of the Dirichlet-multinomial prior to1
.For each file, specify a prior for the number of duplicates in each cluster. As a part of this prior, we specify the maximum number of records in a cluster for each file, through
dup_upper_bound
. When there are assumed to be no duplicates in a file, the maximum number of records in a cluster for that file is set to1
. When there are assumed to be duplicates in a file, we recommend setting the maximum number of records in a cluster for that file to be less than the file size, if prior knowledge allows. Currently, a Poisson prior for the the number of duplicates in each cluster is supported. Our default specification uses a Poisson prior with mean1
.
Please contact the package maintainer if you need new prior families
for n
or the number of duplicates in each cluster to be supported.
Value
a list containing:
mus
The hyperparameters of the Dirichlet priors for the
m
parameters for the comparisons among matches.nus
The hyperparameters of the Dirichlet priors for the
u
parameters for the comparisons among non-matches. Includes data from comparisons of record pairs that were declared to not be potential matches usingreduce_comparison_data
.flat
A
numeric
indicator of whether a flat prior for partitions should be used.flat
is1
if a flat prior is used, andflat
is0
if a structured prior is used.no_dups
A
numeric
indicator of whether no duplicates are allowed in all of the files.alphas
The hyperparameters for the Dirichlet-multinomial overlap table prior, a positive
numeric
vector of length2 ^ comparison_list$K
, where the first element is0
.alpha_0
The sum of
alphas
.dup_upper_bound
A
numeric
vector indicating the maximum number of duplicates, from each file, allowed in each cluster. For a given filek
,dup_upper_bound[k]
should be between1
andcomparison_list$file_sizes[k]
, i.e. even if you don't want to impose an upper bound, you have to implicitly place an upper bound: the number of records in a file.log_dup_count_prior
A
list
containing the log density of the prior distribution for the number of duplicates in each cluster, for each file.log_n_prior
A
numeric
vector containing the log density of the prior distribution for the number of clusters represented in the records.nus_specified
The
nus
before data from comparisons of record pairs that were declared to not be potential matches usingreduce_comparison_data
are added. Used for input checking.
References
Serge Aleshin-Guendel & Mauricio Sadinle (2022). Multifile Partitioning for Record Linkage and Duplicate Detection. Journal of the American Statistical Association. [doi: 10.1080/01621459.2021.2013242] [arXiv]
Examples
# Example with small no duplicate dataset
data(no_dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(no_dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = no_dup_data_small$file_sizes,
duplicates = c(0, 0, 0))
# Specify the prior
prior_list <- specify_prior(comparison_list, mus = NA, nus = NA, flat = 0,
alphas = rep(1, 7), dup_upper_bound = c(1, 1, 1),
dup_count_prior_family = NA, dup_count_prior_pars = NA,
n_prior_family = "uniform", n_prior_pars = NA)
# Example with small duplicate dataset
data(dup_data_small)
# Create the comparison data
comparison_list <- create_comparison_data(dup_data_small$records,
types = c("bi", "lv", "lv", "lv", "lv", "bi", "bi"),
breaks = list(NA, c(0, 0.25, 0.5), c(0, 0.25, 0.5),
c(0, 0.25, 0.5), c(0, 0.25, 0.5), NA, NA),
file_sizes = dup_data_small$file_sizes,
duplicates = c(1, 1, 1))
# Reduce the comparison data
# The following line corresponds to only keeping pairs of records for which
# neither gname nor fname disagree at the highest level
pairs_to_keep <- (comparison_list$comparisons[, "gname_DL_3"] != TRUE) &
(comparison_list$comparisons[, "fname_DL_3"] != TRUE)
reduced_comparison_list <- reduce_comparison_data(comparison_list,
pairs_to_keep, cc = 1)
# Specify the prior
prior_list <- specify_prior(reduced_comparison_list, mus = NA, nus = NA,
flat = 0, alphas = rep(1, 7), dup_upper_bound = c(10, 10, 10),
dup_count_prior_family = c("Poisson", "Poisson", "Poisson"),
dup_count_prior_pars = list(c(1), c(1), c(1)), n_prior_family = "uniform",
n_prior_pars = NA)