Type: | Package |
Title: | Read Cancer Records in the NAACCR Format |
Version: | 3.1.1 |
Maintainer: | Nathan Werth <nwerth@pa.gov> |
Description: | Functions for reading cancer record files which follow a format defined by the North American Association of Central Cancer Registries (NAACCR). |
URL: | https://github.com/WerthPADOH/naaccr |
BugReports: | https://github.com/WerthPADOH/naaccr/issues |
Depends: | R (≥ 2.10) |
Imports: | data.table, stringi, utils, XML |
Suggests: | devtools, httr, jsonlite, magrittr, testthat, ISOcodes, xml2, rmarkdown, roxygen2, rvest |
License: | MIT + file LICENSE |
Copyright: | file COPYRIGHTS |
Encoding: | UTF-8 |
LazyData: | true |
NeedsCompilation: | no |
Packaged: | 2024-09-20 13:34:31 UTC; nwerth |
Author: | Nathan Werth [aut, cre], Pennsylvania Department of Health [cph], North American Association of Cancer Registries [cph], World Health Organization [cph], United States Centers for Disease Control and Prevention [cph], United States Bureau of the Census [cph], United States National Program of Cancer Registries [cph] |
Repository: | CRAN |
Date/Publication: | 2024-09-20 14:20:05 UTC |
Coerce to a naaccr_record dataset
Convert objects into naaccr_record
objects, if a method exists.
Description
Coerce to a naaccr_record dataset
Convert objects into naaccr_record
objects, if a method exists.
Usage
as.naaccr_record(x, keep_unknown = FALSE, version = NULL, format = NULL, ...)
## S3 method for class 'list'
as.naaccr_record(x, keep_unknown = FALSE, version = NULL, format = NULL, ...)
## S3 method for class 'data.frame'
as.naaccr_record(x, keep_unknown = FALSE, version = NULL, format = NULL, ...)
Arguments
x |
An R object. |
keep_unknown |
Logical indicating whether values of "unknown" should be
a level in the factor or |
version |
An integer specifying the NAACCR format version for parsing
the records. Use this or |
format |
A |
... |
Additional arguments passed to or from methods. |
Value
An object of class naaccr_record
See Also
Clean city names
Description
Clean city names
Usage
clean_address_city(city, keep_unknown = FALSE)
Arguments
city |
A character vector of city names. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and "UNKNOWN" are replaced
with NA
.
Clean house number and street values
Description
Clean house number and street values
Usage
clean_address_number_and_street(location, keep_unknown = FALSE)
Arguments
location |
A character vector of house numbers and street names. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and "UNKNOWN" are replaced
with NA
.
Clean patient ages
Description
Clean patient ages
Usage
clean_age(age, keep_unknown = FALSE)
Arguments
age |
|
keep_unknown |
Replace values for "unknown" with |
Value
An integer vector of ages.
If keep_unknown
is FALSE
, values representing unknown ages
are replaced with NA
.
Clean Census block group codes
Description
Clean Census block group codes
Usage
clean_census_block(block, keep_unknown = FALSE)
Arguments
block |
A character vector of Census block group codes. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and values representing
unknown block groups are replaced with NA
.
Clean Census tract group codes
Description
Clean Census tract group codes
Usage
clean_census_tract(tract, keep_unknown = FALSE)
Arguments
tract |
A character vector of Census tract group codes. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and values representing
unknown Census Tracts are replaced with NA
.
Clean counts
Description
Replaces any values of all 9's with NA
(if keep_unknown
is TRUE
) and converts the rest to integers.
Usage
clean_count(count, width, keep_unknown = FALSE)
Arguments
count |
A character vector of counts (integer characters only). |
width |
Integer giving the character width of the field. |
keep_unknown |
Replace values for "unknown" with |
Value
Integer vector of count
.
If keep_unknown
is FALSE
, values representing unknown counts
are replaced with NA
.
Clean county FIPS codes
Description
Clean county FIPS codes
Usage
clean_county_fips(county, keep_unknown = FALSE)
Arguments
county |
A character vector of county FIPS codes. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and values representing
unknown counties are replaced with NA
.
Clean facility identification numbers
Description
Clean facility identification numbers
Usage
clean_facility_id(fin, keep_unknown = FALSE)
Arguments
fin |
A character vector of facility identification numbers (FIN). |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and values representing
unknown facilities are replaced with NA
.
Clean ICD-9-CM codes
Description
Clean ICD-9-CM codes
Usage
clean_icd_9_cm(code, keep_unknown = FALSE)
Arguments
code |
A character vector of ICD-9-CM codes. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and the ICD-9-CM code for
"unknown" ("00000"
) are replaced with NA
.
Clean cause of death codes
Description
Clean cause of death codes
Usage
clean_icd_code(code, keep_unknown = FALSE)
Arguments
code |
A character vector of ICD-7, ICD-8, ICD-9, and/or ICD-10 codes. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and the ICD codes for
"unknown" ("0000"
, "7777"
and "7797"
) are replaced
with NA
.
Clean physician identification numbers
Description
Clean physician identification numbers
Usage
clean_physician_id(physician, keep_unknown = FALSE)
Arguments
physician |
A character vector of medical license number or facility-generated codes. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and values representing
unknown physicians or non-applicable are replaced with NA
.
Clean postal codes
Description
Clean postal codes
Usage
clean_postal(postal, keep_unknown = FALSE)
Arguments
postal |
A character vector of postal codes. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and values representing
uncertain postal codes are replaced with NA
.
Clean Social Security ID numbers
Description
Clean Social Security ID numbers
Usage
clean_ssn(number, keep_unknown = FALSE)
Arguments
number |
A character vector of Social Security identification numbers. No spaces or punctuation, only numbers. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and values representing
unknown Social Security ID numbers are replaced with NA
.
Clean telephone numbers
Description
Clean telephone numbers
Usage
clean_telephone(number, keep_unknown = FALSE)
Arguments
number |
A character vector of telephone numbers. No spaces or punctuation, only numbers. |
keep_unknown |
Replace values for "unknown" with |
Value
A character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blanks and values representing
unknown numbers or patients without a number are replaced with NA
.
Clean free-form text
Description
Clean free-form text
Usage
clean_text(text, keep_unknown = FALSE)
Arguments
text |
A character vector of free text values. |
keep_unknown |
Replace values for "unknown" with |
Value
An character vector with leading and trailing whitespace removed.
If keep_unknown
is FALSE
, blank values are replaced with
NA
.
List of possible values for a field
Description
These lists gives the levels for each categorical or flag field from the NAACCR formats. It is intended to help researchers
Usage
field_levels
field_levels_all
Format
A named list
, where the names are for categorical fields or
sentinel flags, and the values are the possible levels for each field.
An object of class list
of length 340.
Details
field_levels
does not include levels representing "unknown."
field_levels_all
does include the "unknown" levels.
Interpret NAACCR-style booleans
Description
Interpret NAACCR-style booleans
Usage
naaccr_boolean(flag, false_value = c("0", "1"))
Arguments
flag |
Character vector of flags. |
false_value |
The flag value to interpret as |
Value
A logical
vector with the interpreted values of flag
.
Any original values not seen as TRUE
or FALSE
are converted
to NA
.
Examples
x <- c("0", "1", "2", "9", NA)
naaccr_boolean(x)
naaccr_boolean(x, false_value = "1")
Parse NAACCR-formatted dates
Description
Parse NAACCR-formatted dates
Usage
naaccr_date(date)
Arguments
date |
Character vector of dates in NAACCR format ( |
Value
A Date
vector. Any incomplete or invalid dates are converted
to NA
. The original strings can be retrieved with the
naaccr_encode
function.
Examples
input <- c("20151031", "201408 ", "99999999")
d <- naaccr_date(input)
d
naaccr_encode(d, "dateOfDiagnosis")
Parse NAACCR-formatted datetimes
Description
Parse NAACCR-formatted datetimes
Usage
naaccr_datetime(datetime, tz = "")
Arguments
datetime |
Character vector of datetimes in HL7 OBR-7 format
( |
tz |
time zone specification to be used for the conversion,
if one is required. System-specific (see time zones),
but |
Value
A POSIXct
vector. Any incomplete or invalid datetimes are
converted to NA
. The original strings can be retrieved with the
naaccr_encode
function.
Examples
input <- c("20151031100856", "20140822 ", "99999999")
d <- naaccr_datetime(input)
d
naaccr_encode(d, "pathDateSpecCollect1")
Format a value as a string according to the NAACCR format
Description
Format a value as a string according to the NAACCR format
Usage
naaccr_encode(x, field, flag = NULL, version = NULL, format = NULL)
Arguments
x |
Vector of values. |
field |
Character string naming the field. |
flag |
Character vector of flags for the field. Only needed if the field contains sentinel values. |
version |
An integer specifying the NAACCR format version for parsing
the records. Use this or |
format |
A |
Value
Character vector of the values as they would be encoded in a NAACCR-formatted text file.
See Also
Examples
r <- naaccr_record(
ageAtDiagnosis = c("089", "000", "200"),
dateOfDiagnosis = c("20070402", "201709 ", " ")
)
r
mapply(FUN = naaccr_encode, x = r, field = names(r))
Replace NAACCR codes with understandable factors
Description
Replace NAACCR codes with understandable factors
Usage
naaccr_factor(x, field, keep_unknown = FALSE, ...)
Arguments
x |
Vector (usually character) of codes. |
field |
String giving the XML name of the NAACCR field to code. |
keep_unknown |
Logical indicating whether values of "unknown" should be
a level in the factor or |
... |
Additional arguments passed onto |
Value
A factor
vector version of x
. The levels are short
descriptions instead of the basic NAACCR codes. Codes which stood for
"unknown" with no further information are replaced with NA
.
If field
names a text or site-specific field, x
will be
returned unchanged with a warning.
Examples
naaccr_factor(c("20", "43", "99"), "radRegionalRxModality")
naaccr_factor(c("USA", "GER", "XEN"), "addrAtDxCountry")
# Default: NA for unknowns,
naaccr_factor(c("1", "8", "9"), "tumorGrowthPattern")
naaccr_factor(c("1", "8", "9"), "tumorGrowthPattern", keep_unknown = TRUE)
Field definitions from all NAACCR format versions
Description
See record_format
.
Usage
naaccr_formats
naaccr_format_12
naaccr_format_13
naaccr_format_14
naaccr_format_15
naaccr_format_16
naaccr_format_18
naaccr_format_21
naaccr_format_22
naaccr_format_23
naaccr_format_24
naaccr_format_25
Format
An object of class list
of length 22.
An object of class record_format
(inherits from data.table
, data.frame
) with 509 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 529 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 548 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 555 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 587 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 791 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 800 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 810 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 782 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 783 rows and 12 columns.
An object of class record_format
(inherits from data.table
, data.frame
) with 780 rows and 12 columns.
Details
Each naaccr_format_XX
object is a data.table
defining the
fields for each version of NAACCR's record file format.
naaccr_formats
is a list of these record formats, with each name
being the two- or three-digit code for the format.
Interpret basic over-ride flags
Description
Interpret basic over-ride flags
Usage
naaccr_override(flag)
Arguments
flag |
Character vector of over-ride flags. Its values should only
include |
Value
A logical
vector with the interpreted values of flag
.
The interpretation follows these rules: "1"
goes to TRUE
(reviewed and confirmed as reported), ""
(blank) goes to
FALSE
(not reviewed or reviewed and corrected), and all other values
go to NA
.
Examples
naaccr_override(c("", "1", NA, "9"))
Analysis-ready NAACCR records
Description
Subclass of data.frame
for doing analysis with NAACCR records.
Usage
naaccr_record(..., keep_unknown = FALSE, version = NULL, format = NULL)
Arguments
... |
Arguments of the form |
keep_unknown |
Logical indicating whether values of "unknown" should be
a level in the factor or |
version |
An integer specifying the NAACCR format version for parsing
the records. Use this or |
format |
A |
Details
naaccr_record
creates a data.frame
of cancer incidence records
ready for analysis:
columns are of appropriate classes, coded values are replaced with factors,
and unknowns are replaced with NA
.
Value
A naaccr_record
with columns named using the NAACCR XML scheme.
It inherits from data.frame
.
Parse the 14 values from the geocodingQualityCodeDetail
field
Description
Parse the 14 values from the geocodingQualityCodeDetail
field
Usage
parse_geocoding_quality_codes(value)
Arguments
value |
Character vector of values from the
|
Value
A data.frame
with the following columns:
geocodingQualityInputType
-
(
factor
) Type of address given to geocoder. Has the following levels:"full address"
,"street only"
,"number, no street"
,"city only"
,"zip and city"
,"zip only"
,"error"
geocodingQualityStreetType
-
(
factor
) Type of street for address. Has the following levels:"street"
,"PO box"
,"rural route"
,"highway contract route"
,"star route"
,"error"
geocodingQualityStreet
-
(
factor
) Quality of match for the street name. Has the following levels:"100% match"
,"soundex match"
,"street name different"
,"missing street name"
,"error"
geocodingQualityZip
-
(
factor
) Quality of match for the ZIP code. Has the following levels:"100% match"
,"5th digit different"
,"4th digit different"
,"3rd digit different"
,"2nd digit different"
,"1st digit different"
,"more than one digit different"
,"invalid ZIP"
,"error"
geocodingQualityCity
-
(
factor
) Quality of match for the city name. Has the following levels:"100% match"
,"alias match"
,"soundex match"
,"no match"
,"error"
geocodingQualityCityRefs
-
(
factor
) Number of city reference data sets that don't match the geocoding result. Has the following levels:"all match"
,"1 reference unmatched"
,"2 to 4 references unmatched"
,"5 or more references unmatched"
,"no references matched"
,"error"
geocodingQualityDirectionals
-
(
factor
) Whether the street directionals are present in the input and feature data sets. Has the following levels:"all match"
,"missing feature pre and post directionals"
,"missing input pre and post directionals"
,"both pre and post directionals do not match"
,"feature missing post directional"
,"input missing post directional"
,"post directionals do not match"
,"missing feature pre directional"
,"missing feature pre directional and input post directional"
,"missing feature pre directional and post directionals do not match"
,"missing input pre directional"
,"missing input pre directional and missing feature post directional"
,"missing input pre directional and post directionals do not match"
,"pre directionals do not match"
,"pre directionals do not match and missing feature post directional"
,"pre directionals do not match and missing input post directional"
geocodingQualityQualifiers
-
(
factor
) Whether the address qualifiers are present in the input and feature data sets. Has the following levels:"all match"
,"missing feature pre and post qualifiers"
,"missing input pre and post qualifiers"
,"both pre and post qualifiers do not match"
,"feature missing post qualifier"
,"input missing post qualifier"
,"post qualifiers do not match"
,"missing feature pre qualifier"
,"missing feature pre qualifier and input post qualifier"
,"missing feature pre qualifier and post qualifiers do not match"
,"missing input pre qualifier"
,"missing input pre qualifier and missing feature post qualifier"
,"missing input pre qualifier and post qualifiers do not match"
,"pre qualifiers do not match"
,"pre qualifiers do not match and missing feature post qualifier"
,"pre qualifiers do not match and missing input post qualifier"
geocodingQualityDistance
-
(
factor
) Average distance between the possible matched parcels and their respective possible matched streets. Has the following levels:"< 10m"
,"10m-100m"
,"100m-500m"
,"500m-1km"
,"1km-5km"
,"> 5km"
,"error"
geocodingQualityOutliers
-
(
factor
) Distribution of distances between the possible matched parcels and their respective possible matched streets. Has the following levels:"100% within 10m"
,"60% within 10m and 40% within 100m"
,"60% within 10m and 40% within 500m"
,"60% within 10m and 40% within 1km"
,"60% within 10m and 40% within 5km"
,"60% within 10m and at least 1 over 5km exists"
,"30% within 10m and 70% within 100m"
,"30% within 10m and 70% within 500m"
,"30% within 10m and 70% within 1km"
,"30% within 10m and 70% within 5km"
,"30% within 10m and at least 1 over 5km exists"
,"error"
geocodingQualityCensusBlockGroups
-
(
factor
) Consistency of geocoded result against Census Block Group references. Has the following levels:"all match"
,"at least one reference different"
,"no Census data"
,"error"
geocodingQualityCensusTracts
-
(
factor
) Consistency of geocoded result against Census Tract references. Has the following levels:"all match"
,"at least one reference different"
,"no Census data"
,"error"
geocodingQualityCensusCounties
-
(
factor
) Consistency of geocoded result against Census County references. Has the following levels:"all match"
,"at least one reference different"
,"no Census data"
,"error"
geocodingQualityRefMatchCount
-
(
integer
) Number of reference data sets matched by geocoding result.
Read NAACCR records from a file
Description
Read and parse cancer incidence records according to a NAACCR format from
either fixed-width files (read_naaccr
and read_naaccr_plain
)
or XML documents (read_naaccr_xml
and read_naaccr_xml_plain
).
Usage
read_naaccr_plain(
input,
version = NULL,
format = NULL,
keep_fields = NULL,
skip = 0,
nrows = Inf,
buffersize = 10000,
encoding = getOption("encoding")
)
read_naaccr(
input,
version = NULL,
format = NULL,
keep_fields = NULL,
keep_unknown = FALSE,
skip = 0,
nrows = Inf,
buffersize = 10000,
encoding = getOption("encoding"),
...
)
read_naaccr_xml_plain(
input,
version = NULL,
format = NULL,
keep_fields = NULL,
as_text = FALSE,
encoding = getOption("encoding")
)
read_naaccr_xml(
input,
version = NULL,
format = NULL,
keep_fields = NULL,
keep_unknown = FALSE,
as_text = FALSE,
encoding = getOption("encoding"),
...
)
Arguments
input |
Either a string with a file name (containing no |
version |
An integer specifying the NAACCR format version for parsing
the records. Use this or |
format |
A |
keep_fields |
Character vector of XML field names to keep in the
dataset. If |
skip |
An integer specifying the number of lines of the data file to skip before beginning to read data. |
nrows |
A number specifying the maximum number of records to read.
|
buffersize |
Maximum number of lines to read at one time. |
encoding |
String giving the input's encoding. See the 'Encoding'
section of |
keep_unknown |
Logical indicating whether values of "unknown" should be
a level in the factor or |
... |
Additional arguments passed onto |
as_text |
Logical indicating (if |
Details
read_naaccr
and read_naaccr_xml
return data sets suited for
analysis in R.
read_naaccr_plain
and read_naaccr_xml_plain
return data sets
with the unchanged record values.
Anyone who wants to analyze the records in R should use read_naaccr
or read_naaccr_xml
.
In the returned naaccr_record
, columns are of appropriate
classes, coded values are replaced with factors, and unknowns are replaced
with NA
.
read_naaccr_plain
and read_naaccr_xml_plain
is a "format strict"
way to read incidence records.
All values returned are the literal character values from the records.
The only processing done is that leading and trailing whitespace is trimmed.
This is useful if the values will be passed to other software that expects
the plain NAACCR values.
For read_naaccr_plain
and read_naaccr
, if the version
and format
arguments are left NULL
, the default format is
version 18. This was the last format to be used for fixed-width files.
Value
For read_naaccr
, a data.frame
of the records.
The columns included depend on the NAACCR record_format
version.
Columns are atomic vectors; there are too many to describe them all.
For read_naaccr_plain
, a data.frame
based on the
record_format
specified by either the version
or
format
argument.
The names of the columns will be those in the format's name
column.
All columns are character vectors.
Note
Some of the parameter text was shamelessly copied from the
read.table
and read.fwf
help
pages.
References
North American Association of Central Cancer Registries (October 2018). Standards for Cancer Registries Volume II: Data Standards and Data Dictionary. Twenty first edition. https://apps.naaccr.org/data-dictionary/.
North American Association of Central Cancer Registries (April 2019). NAACCR XML Data Exchange Standard. Version 1.4. https://www.naaccr.org/xml-data-exchange-standard/.
See Also
Examples
# This file has synthetic abstract records
incfile <- system.file(
"extdata", "synthetic-naaccr-18-abstract.txt",
package = "naaccr"
)
fields <- c("ageAtDiagnosis", "sex", "sequenceNumberCentral")
read_naaccr(incfile, version = 18, keep_fields = fields)
recs <- read_naaccr_plain(incfile, version = 18, keep_fields = fields)
recs
# Note sequenceNumberCentral has been split in two: a number and a flag
summary(recs[["sequenceNumberCentral"]])
summary(recs[["sequenceNumberCentralFlag"]])
Define custom fields for NAACCR records
Description
Create a record_format
object, which is used to read NAACCR records.
Usage
record_format(
name,
item,
start_col = NA_integer_,
end_col = NA_integer_,
type = "character",
alignment = "left",
padding = " ",
parent = "Tumor",
cleaner = list(NULL),
unknown_finder = list(NULL),
name_literal = NA_character_,
width = NA_integer_
)
as.record_format(x, ...)
Arguments
name |
Item name appropriate for a |
item |
NAACCR item number. |
start_col |
First column of the field in a fixed-width record. |
end_col |
*Deprecated: Use the |
type |
Name of the column class. |
alignment |
Alignment of the field in fixed-width files. Either
|
padding |
Single-character strings to use for padding in fixed-width files. |
parent |
Name of the parent node to include this field under when
writing to an XML file.
Values can be |
cleaner |
(Optional) List of functions to handle special cases of
cleaning field data (e.g., convert all values to uppercase).
Values of |
unknown_finder |
(Optional) List of functions to detect when codes mean
the actual values are unknown or not applicable.
Values of |
name_literal |
(Optional) Item name in plain language. |
width |
(Optional) Item width in characters. |
x |
Object to be coerced to a |
... |
Other arguments passed to |
Details
To define registry-specific fields in addition to the standard fields, create
a record_format
object for the registry-specific fields and combine it
with one of the formats provided with the package using rbind
.
Value
An object of class "record_format"
which has the following
columns:
name
-
(
character
) XML field name. item
-
(
integer
) Field item number. start_col
-
(
integer
) First column of the field in a fixed-width text file. IfNA
, the field will not be read from or written to fixed-width files. They will included in XML files. end_col
-
(
integer
) (*Deprecated: Usewidth
instead.*) Last column of the field in a fixed-width text file. IfNA
, the field will not be read from or written to fixed-width files. This is the norm for fields only found in XML formats. type
-
(
factor
) R class for the column vector. alignment
-
(
factor
) Alignment of the field's values in a fixed-width text file. padding
-
(
character
) String used for padding field values in a fixed-width text file. parent
-
(
factor
) Parent XML node for the field. One of"NaaccrData"
,"Patient"
, or"Tumor"
. cleaner
-
(
list
offunction
objects) Function to prepare the field's values for analysis. Values ofNULL
will use the standard cleaner functions for thetype
(see below). unknown_finder
-
(
list
offunction
objects) Function to detect codes meaning the actual values are missing or unknown for the field. name_literal
-
(
character
) Field name in plain language. width
-
(
integer
) Character width of the field values. Mostly meant for reading and writing flat files.
Format Types
The levels type
can take, along with the functions used to process
them when reading a file:
address
-
(
clean_address_number_and_street
) Street number and street name parts of an address. age
-
(
clean_age
) Age in years. boolean01
-
(
naaccr_boolean
, withfalse_value = "0"
) True/false, where"0"
means false and"1"
means true. boolean12
-
(
naaccr_boolean
, withfalse_value = "1"
) True/false, where"1"
means false and"2"
means true. census_block
-
(
clean_census_block
) Census Block ID number. census_tract
-
(
clean_census_tract
) Census Tract ID number. character
-
(
clean_text
) Miscellaneous text. city
-
(
clean_address_city
) City name. count
-
(
clean_count
) Integer count. county
-
(
clean_county_fips
) County FIPS code. Date
-
(
as.Date
, withformat = "%Y%m%d"
) NAACCR-formatted date (YYYYMMDD). datetime
-
(
as.POSIXct
, withformat = "%Y%m%d%H%M%S"
) NAACCR-formatted datetime (YYYYMMDDHHMMSS) facility
-
(
clean_facility_id
) Facility ID number. icd_9
-
(
clean_icd_9_cm
) ICD-9-CM code. icd_code
-
(
clean_icd_code
) ICD-9 or ICD-10 code. integer
-
(
as.integer
) Miscellaneous whole number. numeric
-
(
as.numeric
) Miscellaneous decimal number. override
-
(
naaccr_override
) Field describing why another field's value was over-ridden. physician
-
(
clean_physician_id
) Physician ID number. postal
-
(
clean_postal
) Postal code for an address (a.k.a. ZIP code in the United States). ssn
-
(
clean_ssn
) Social Security Number. telephone
-
(
clean_telephone
) 10-digit telephone number.
Examples
my_fields <- record_format(
name = c("foo", "bar", "baz"),
item = c(2163, 1180, 1181),
start_col = c(975, 1381, NA),
width = c(1, 55, 4),
type = c("numeric", "facility", "character"),
parent = c("Patient", "Tumor", "Tumor"),
cleaner = list(NULL, NULL, trimws)
)
my_format <- rbind(naaccr_format_16, my_fields)
Separate a field's continuous and sentinel values
Description
Separate a sentineled field's values into two vectors: one with the continuous data and one with the sentinel values.
Usage
split_sentineled(x, field)
Arguments
x |
Vector (usually character) of codes. |
field |
String giving the XML name of the NAACCR field to code. |
Value
If field
is a sentineled field, a data.frame
with two
columns. The first is a numeric
version of the continuous values
from x
. Its name is the value of field
. The second is a
factor
with levels representing the sentinel values. For all
non-missing values in the numeric vector, the respective value in the
factor is NA
. If a value of x
was not valid, the respective
row will be NA
for the continuous and flag values.
If field
is not a sentineled field, a data.frame with just x
is returned with a warning.
Examples
node_codes <- c("10", "20", "90", "95", "99", NA)
s <- split_sentineled(node_codes, "regionalNodesPositive")
print(s)
s[is.na(s[["regionalNodesPositive"]]), "regionalNodesPositiveFlag"]
Unpack tumor sequence number data
Description
Separate the multiple types of information in sequenceNumberCentral
and sequenceNumberHospital
into multiple columns.
Usage
split_sequence_number(x)
Arguments
x |
Vector (usually character) of sequence number codes. |
Value
A data.frame
with three columns:
- sequenceNumber
-
(
integer
) The number of the tumor in chronological sequence for the patient. - reportable
-
(
logical
) IfTRUE
, then the tumor is required to be reported by SEER/NPCR standards. IfFALSE
, it is either non-malignant or defined as reportable by the registry. - onlyTumor
-
(
logical
) IfTRUE
, this is the only known SEER/NPCR-reportable or the only known non-SEER/NPCR-reportable tumor for the patient. - sequenceFlag
-
(
factor
) Special flags, such as unknowns or changes in reporting requirements. Created usingsplit_sentineled
.
See Also
Replace labels for unknown with NA
Description
Replace labels for unknown with NA
Usage
unknown_to_na(x, ...)
## S3 method for class 'naaccr_record'
unknown_to_na(x, ...)
## S3 method for class 'factor'
unknown_to_na(x, field, ...)
Arguments
x |
Either a factor created with |
... |
Further arguments passed to or from other methods. |
field |
String giving the XML name of the NAACCR field to code. |
Value
If x
was a factor
, then the result is a vector with the
values of x
, except all levels which effectively mean "unknown" are
replaced with NA
.
The returned factor won't have those in its levels, either.
If x
is a naaccr_record
object, then the result is the
naaccr_record
created by applying this function to all columns of
x
.
Examples
r <- naaccr_record(
sex = c("1", "2", "9"),
kras = c("8", "9", "3"),
keep_unknown = TRUE
)
r
unknown_to_na(r[["sex"]], field = "sex")
unknown_to_na(r)
Write records in NAACCR format
Description
Write records from a naaccr_record
object to a connection in
fixed-width format, according to a specific version of the NAACCR format.
Usage
write_naaccr(records, con, version = NULL, format = NULL, encoding = "UTF-8")
Arguments
records |
A |
con |
Either a character string naming a file or a
|
version |
An integer specifying the NAACCR format version for parsing
the records. Use this or |
format |
A |
encoding |
String specifying the character encoding for the output file. |
Write records to a NAACCR-formatted XML file
Description
Write records to a NAACCR-formatted XML file
Usage
write_naaccr_xml(
records,
con,
version = NULL,
format = NULL,
base_dictionary = NULL,
user_dictionary = NULL,
encoding = "UTF-8"
)
Arguments
records |
A |
con |
Either a character string naming a file or a
|
version |
An integer specifying the NAACCR format version for parsing
the records. Use this or |
format |
A |
base_dictionary |
URI for the dictionary defining the NAACCR data items.
If this is |
user_dictionary |
URI for the dictionary defining the user-specified
data items. If |
encoding |
String specifying the character encoding for the output file. |