Title: | Create Codebooks from Data Frames |
Version: | 0.1.8 |
Maintainer: | Brad Cannell <brad.cannell@gmail.com> |
Description: | Quickly and easily create codebooks (i.e. data dictionaries) directly from a data frame. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.2.2 |
URL: | https://github.com/brad-cannell/codebookr, https://brad-cannell.github.io/codebookr/ |
BugReports: | https://github.com/brad-cannell/codebookr/issues |
Depends: | R (≥ 2.10) |
LazyData: | true |
Suggests: | hms, knitr, rmarkdown, testthat |
Imports: | haven (≥ 2.5.0), flextable, dplyr, officer, purrr, rlang, stringr, tibble, tidyr |
NeedsCompilation: | no |
Packaged: | 2024-02-17 19:57:04 UTC; bradcannell |
Author: | Brad Cannell [aut, cre, cph] |
Repository: | CRAN |
Date/Publication: | 2024-02-19 08:20:08 UTC |
Add Attributes to Columns
Description
Add arbitrary attributes to columns (e.g., description, source, column type). These attributes can later be accessed to fill in the column attributes table.
Usage
cb_add_col_attributes(df, .x, ...)
Arguments
df |
Data frame of interest |
.x |
Column of interest in df |
... |
Arbitrary list of attributes (i.e., attribute = "value") |
Details
Typically, though not necessarily, the first step in creating your
codebook will be to add column attributes to your data. The
cb_add_col_attributes()
function is a convenience function that allows
you to add arbitrary attributes to the columns of the data frame. These
attributes can later be accessed to fill in the column attributes table of
the codebook document. Column attributes can serve a similar function to
variable labels in SAS or Stata; however, you can assign many different
attributes to a column and they can contain any kind of information you want.
Although the cb_add_col_attributes()
function will allow you to add any
attributes you want, there are currently only five special attributes
that the codebook()
function will recognize and add to the column
attributes table of the codebook document. They are:
- description:
-
Although you may add any text you desire to the
description
attribute, it is intended to be used describe the question/process that generated the data contained in the column. Many statistical software packages refer to this as a variable label. If the data was imported from SAS, Stata, or SPSS with variable labels using thehaven
package,codebook
will automatically recognize them. There is no need to manually create them. However, you may overwrite the imported variable label for any column by adding adescription
attribute. - source:
-
Although you may add any text you desire to the
source
attribute, it is intended to be used describe where the data contained in the column originally came from. For example, if the current data frame was created by merging multiple data sets together, you may want to use the source attribute to identify the data set it originates from. As another example, if the current data frame contains longitudinal data, you may want to use the source attribute to identify the wave(s) in which data for this column was collected. - col_type:
-
The
col_type
attribute is intended to provide additional information above and beyond theData type
(i.e., column class) about the values in the column. For example, you may have a column of 0's and 1's, which will have a numeric data type. However, you may want to inform data users that this is really a dummy variable where the 0's and 1's represent discrete categories (No and Yes). Another way to think about it is that theData type
attribute is how R understands the column and theColumn type
attribute is how humans should understand the column. Currently accepted values are:Numeric
Categorical
Time
Perhaps even more importantly, setting the
col_type
attribute helps R determine which descriptive statistics to calculate for the bottom half of the column attributes table. Inside of thecodebook()
function, thecb_add_summary_stats()
function will attempt to figure out whether the column is:numeric
categorical - many categories (e.g. participant id)
categorical - few categories (e.g. sex)
time - including dates
Again, this matters because the table of summary stats shown in the codebook document depends on the value
cb_add_summary_stats()
chooses. However, the user can directly tellcb_add_summary_stats()
which summary stats to calculate by providing acol_type
attribute to a column with one of the following values:Numeric
,Categorical
, orTime
. - value_labels:
-
Although you may pass any named vector you desire to the
value_labels
attribute, it is intended to inform your data users about how to correctly interpret numerically coded categorical variables. For example, you may have a column of 0's and 1's that represent discrete categories (i.e., "No" and "Yes") instead of numerical quantities. In many other software packages (e.g., SAS, Stata, and SPSS), you can layer "No" and "Yes" labels on top of the 0's and 1's to improve the readability of your analysis output. These are commonly referred to as value labels. The R programming language does not really have value labels in the same way that other popular statistical software applications do. R users can (and typically should) coerce numerically coded categorical variables into factors; however, coercing a numeric vector to a factor is not the same as adding value labels to a numeric vector because the underlying numeric values can change in the process of creating the factor. For this, and other reasons, many R programmers choose to create a new factor version of a numerically encoded variable as opposed to overwriting/transforming the numerically encoded variable. In those cases, you may want to inform your data users about how to correctly interpret numerically coded categorical variables. Adding value labels to your codebook is one way of doing so.-
Add value labels to columns as a named vector to the
value_labels
attribute. For example,value_labels
= c("No" = 0, "Yes" = 1). -
If the data was imported from SAS, Stata, or SPSS with value labels using the
haven
package,codebook
will automatically recognize them. There is no need to manually create them. However, you may overwrite the imported value labels for any column by adding avalue_labels
attribute as shown in the example below.
-
- skip_pattern:
-
Although you may add any text you desire to the
skip_pattern
attribute, it is intended to be used describe skip patterns in the data collection tools that impact which study participants were exposed to each study item. For example, If a question in your data was only asked of participants who were enrolled in the study for at least 10 days, then you may want to add a note like "Not asked if days < 10" to the skip pattern section of the column attributes table.
Value
Returns the same data frame (or tibble) passed to the df
argument
with column attributes added.
Examples
library(dplyr, warn.conflicts = FALSE)
library(codebookr)
data(study)
study <- study %>%
cb_add_col_attributes(
.x = likert,
description = "An example Likert scale item",
source = "Exposure questionnaire",
col_type = "categorical",
value_labels = c(
"Very dissatisfied" = 1,
"Somewhat dissatisfied" = 2,
"Neither satisfied nor dissatisfied" = 3,
"Somewhat satisfied" = 4,
"Very satisfied" = 5
),
skip_pattern = "Not asked if days < 10"
)
Add Description Text to Codebook
Description
Basically, just checks for the number of paragraphs in the description and then runs cb_add_text for each one.
Usage
cb_add_description(rdocx, description)
Arguments
rdocx |
rdocx rdocx object created with |
description |
Text description of the dataset |
Value
rdocx object
Create Formatted Section Header
Description
Create Formatted Section Header
Usage
cb_add_section_header(rdocx, text = NA)
Arguments
rdocx |
rdocx object created with |
text |
Text of section header |
Value
rdocx object
Calculate Appropriate Statistics for Variable Type
Description
The input to cb_add_summary_stats is a data frame and a column from that
data frame in the format cb_add_summary_stats(study, "id"). The column name
is a character string because it is passed from a for loop in the codebook
function. The purpose of cb_add_summary_stats is to attempt to figure out
whether the column is:
Numeric (e.g., height)
Categorical - many categories (e.g. participant id)
Categorical - few categories (e.g. gender)
Time - including dates
This matters because the table of summary stats shown in the codebook document depends on the value cb_add_summary_stats chooses.
Usage
cb_add_summary_stats(
df,
.x,
many_cats = 10,
num_to_cat = 4,
digits = 2,
n_extreme_cats = 5
)
Arguments
df |
Data frame of interest |
.x |
Column of interest |
many_cats |
The many_cats argument sets the cutoff value that partially (i.e., along with the col_type attribute) determines whether cb_add_summary_stats will categorize the variable as categorical with few categories or categorical with many categories. The number of categories that constitutes "many" is defined by the value passed to the many_cats argument. The default is 10. |
num_to_cat |
The num_to_cat argument sets the cutoff value that partially (i.e., along with the col_type attribute) determines whether cb_add_summary_stats will categorize a numeric as categorical. If the col_type attribute is not set for a column AND the number of unique non-missing values is <= num_to_cat, then cb_add_summary_stats will guess that the variable is categorical. The default value for num_to_cat is 4. |
digits |
Number of digits after the decimal to display |
n_extreme_cats |
Number of extreme values to display when the column is
classified as |
Details
The user can tell the cb_add_summary_stats function what to choose explicitly by giving the column a col_type attribute set to one of the following values:
Numeric. For example, height and/or weight.
-
study <- cb_add_col_attributes(study, height, col_type = "numeric")
-
Categorical. We describe how many categories vs few categories is determined below.
-
study <- cb_add_col_attributes(study, id, col_type = "categorical")
-
Time. Dates, times, and datetimes.
-
cb_add_col_attributes(study, date, col_type = "time")
-
If the user does not explicitly set the col_type attribute to one of these values, then cb_add_summary_stats will guess which col_type attribute to assign to each column based on the column's class and the number of unique non-missing values the it has.
However, the number of unique non-missing values isn't used in an absolute way (e.g., 10 or more unique values is ALWAYS many_cats). Instead, the number of unique non-missing values used relative to the values passed to the many_cats parameter and/or the num_to_cat parameter – depending on the class of the column.
Value
A tibble of results
See Also
Other add_summary_stats:
cb_summary_stats_few_cats()
,
cb_summary_stats_many_cats()
,
cb_summary_stats_numeric()
,
cb_summary_stats_time()
Create Formatted Text
Description
Create Formatted Text
Usage
cb_add_text(rdocx, text = NA)
Arguments
rdocx |
rdocx rdocx object created with |
text |
Arbitrary text |
Value
rdocx object
Optionally Add Title and Subtitle to Codebook
Description
This function is not intended to be a stand-alone function. It is indented
to be used by the codebook
function.
Usage
cb_add_title(rdocx, title = NA, subtitle = NA)
Arguments
rdocx |
rdocx object created with |
title |
Optional title |
subtitle |
Optional subtitle |
Value
rdocx object
Get Column Attributes
Description
Used in codebook() to create the top half of the column attributes table.
Usage
cb_get_col_attributes(df, .x, keep_blank_attributes = keep_blank_attributes)
Arguments
df |
Data frame of interest |
.x |
Column of interest in df |
keep_blank_attributes |
By default, the column attributes table will omit
the Column description, Source information, Column type, and value labels
rows from the column attributes table in the codebook document if those
attributes haven't been set. In other words, it won't show blank rows for
those attributes. Passing |
Details
Typically, though not necessarily, the first step in creating your
codebook will be to add column attributes to your data. The
cb_add_col_attributes()
function is a convenience function that allows you
to add arbitrary attributes to columns (e.g., description, source, column type).
These attributes can later be accessed to fill in the column attributes table
of the codebook document. Column attributes can serve a similar function
to variable labels in SAS or Stata; however, you can assign many different
attributes to a column and they can contain any kind of information you want.
Although the cb_add_col_attributes()
function will allow you to add any
attributes you want, there are currently only four special attributes
that the codebook()
function (via cb_get_col_attributes()
) will recognize
and add to the column attributes table of the codebook document. They are:
-
description: Although you may add any text you desire to the
description
attribute, it is intended to be used to describe the question/process that generated the data contained in the column. Many statistical software packages refer to this as a variable label. -
source: Although you may add any text you desire to the
source
attribute, it is intended to be used to describe where the data contained in the column originally came from. For example, if the current data frame was created by merging multiple data sets together, you may want to use the source attribute to identify the data set it originates from. As another example, if the current data frame contains longitudinal data, you may want to use the source attribute to identify the wave(s) in which data for this column was collected. -
col_type: The
col_type
attribute is intended to provide additional information above and beyond theData type
(i.e., column class) about the values in the column. For example, you may have a column of 0's and 1's, which will have a numeric data type. However, you may want to inform data users that this is really a dummy variable where the 0's and 1's represent discrete categories (No and Yes). Another way to think about it is that theData type
attribute is how R understands the column and theColumn type
attribute is how humans should understand the column. Currently accepted values are:Numeric
,Categorical
, orTime
.Perhaps even more importantly, setting the
col_type
attribute helps R determine which descriptive statistics to calculate for the bottom half of the column attributes table. Inside of thecodebook()
function, thecb_add_summary_stats()
function will attempt to figure out whether the column is numeric, categorical - many categories (e.g. participant id), categorical - few categories (e.g. sex), or time - including dates. Again, this matters because the table of summary stats shown in the codebook document depends on the valuecb_add_summary_stats()
chooses. However, the user can directly tellcb_add_summary_stats()
which summary stats to calculate by providing by adding acol_type
attribute to a column with one of the following values:Numeric
,Categorical
, orTime
.
-
value_labels: Although you may pass any named vector you desire to the
value_labels
attribute, it is intended to inform your data users about how to correctly interpret numerically coded categorical variables. For example, you may have a column of 0's and 1's that represent discrete categories (i.e., "No" and "Yes") instead of numerical quantities. In some many other software packages (e.g., SAS, Stata, and SPSS), you can layer "No" and "Yes" labels on top of the 0's and 1's to improve the readability of your analysis output. These are commonly referred to as value labels. The R programming language does not really have value labels in the same way that other popular statistical software applications do. R users can (and typically should) coerce numerically coded categorical variables into factors; however, coercing a numeric vector to a factor is not the same as adding value labels to a numeric vector because the underlying numeric values can change in the process of creating the factor. For this, and other reasons, many R programmers choose to create a new factor version of a numerically encoded variable as opposed to overwriting/transforming the numerically encoded variable. In those cases, you may want to inform your data users about how to correctly interpret numerically coded categorical variables. Adding value labels to your codebook is one way of doing so.
Value
A tibble of column attributes
Compute Summary Statistics for Categorical Variables with Few Categories
Description
Compute Summary Statistics for Categorical Variables with Few Categories
Usage
cb_summary_stats_few_cats(df, .x, digits = 2)
Arguments
df |
Data frame of interest |
.x |
Column of interest |
digits |
Number of digits after decimal to display |
Value
A tibble
See Also
Other add_summary_stats:
cb_add_summary_stats()
,
cb_summary_stats_many_cats()
,
cb_summary_stats_numeric()
,
cb_summary_stats_time()
Compute Summary Statistics for Categorical Variables with Many Categories
Description
Compute Summary Statistics for Categorical Variables with Many Categories
Usage
cb_summary_stats_many_cats(df, .x, n_extreme_cats = 5)
Arguments
df |
Data frame of interest |
.x |
Column of interest |
n_extreme_cats |
Number of extreme values to display |
Value
A tibble
See Also
Other add_summary_stats:
cb_add_summary_stats()
,
cb_summary_stats_few_cats()
,
cb_summary_stats_numeric()
,
cb_summary_stats_time()
Compute Summary Statistics for Numeric Variables
Description
Compute Summary Statistics for Numeric Variables
Usage
cb_summary_stats_numeric(df, .x, digits = 2)
Arguments
df |
Data frame of interest |
.x |
Column of interest |
digits |
Number of digits after decimal to display |
Value
A tibble
See Also
Other add_summary_stats:
cb_add_summary_stats()
,
cb_summary_stats_few_cats()
,
cb_summary_stats_many_cats()
,
cb_summary_stats_time()
Compute Summary Statistics for Date or Time Variables
Description
Compute Summary Statistics for Date or Time Variables
Usage
cb_summary_stats_time(df, .x, digits = 2)
Arguments
df |
Data frame of interest |
.x |
Column of interest |
digits |
Number of digits after decimal to display |
Value
A tibble
See Also
Other add_summary_stats:
cb_add_summary_stats()
,
cb_summary_stats_few_cats()
,
cb_summary_stats_many_cats()
,
cb_summary_stats_numeric()
Create Formatted Flextable From Summary Statistics
Description
Create Formatted Flextable From Summary Statistics
Usage
cb_summary_stats_to_ft(df, ...)
## S3 method for class 'summary_numeric'
cb_summary_stats_to_ft(df, col_width = 1.3, ...)
## S3 method for class 'summary_many_cats'
cb_summary_stats_to_ft(df, col_width = 1.62, ...)
## S3 method for class 'summary_few_cats'
cb_summary_stats_to_ft(df, col_width = 1.62, ...)
## S3 method for class 'summary_time'
cb_summary_stats_to_ft(df, col_width = 1.62, ...)
Arguments
df |
Data frame of summary statistics |
... |
Other stuff |
col_width |
Set the width of the column that will appear in the Word table |
Value
Flextable object
Format Column Attributes Flextable
Description
Format Column Attributes Flextable
Usage
cb_theme_col_attr(ft)
Arguments
ft |
A flextable object |
Value
A flextable object
Format Data Frame Attributes Flextable
Description
Format Data Frame Attributes Flextable
Usage
cb_theme_df_attributes(ft)
Arguments
ft |
A flextable object |
Value
A flextable object
Automate creation of a data codebook
Description
The codebook function assists with the creation of a codebook for a given data frame.
Usage
codebook(
df,
title = NA,
subtitle = NA,
description = NA,
keep_blank_attributes = FALSE,
no_summary_stats = NULL
)
Arguments
df |
The data frame the codebook will describe |
title |
An optional title that will appear at the top of the Word codebook document |
subtitle |
An optional subtitle that will appear at the top of the Word codebook document |
description |
An optional text description of the dataset that will appear on the first page of the Word codebook document |
keep_blank_attributes |
TRUE or FALSE. By default, the column attributes
table will omit the Column description, Source information, Column type,
value labels, and skip pattern rows from the column attributes table in
the codebook document if those attributes haven't been set. In other
words, it won't show blank rows for those attributes. Passing |
no_summary_stats |
A character vector of column names. The summary statistics will not be added to column attributes table for any column passed to this argument. This can be useful when a column contains values that are sensitive or may be used to identify individual people (e.g., names, addresses, etc.) and the individual values for that column should not appear in the codebook. |
Details
Codebook expects that df
is a data frame that you have read into memory
from a saved data file. Please provide the path to the saved data file. This
function gets selected attributes about file saved at path
and stores
those attributes in a data frame, which is later turned into a flextable and
added to the codebook document.
Typically, though not necessarily, the first step in creating your
codebook will be to add column attributes to your data. The
cb_add_col_attributes()
function is a convenience function that allows
you to add arbitrary attributes to the columns of the data frame. These
attributes can later be accessed to fill in the column attributes table of
the codebook document. Column attributes can serve a similar function to
variable labels in SAS or Stata; however, you can assign many different
attributes to a column and they can contain any kind of information you want.
For details see cb_add_col_attributes
Value
An rdocx object that can be printed to a Word document
Examples
## Not run:
study_codebook <- codebook(
df = study,
title = "My Example Study",
subtitle = "A Subtitle for My Example Study Codebook",
description = "Brief (or long) description of the data."
)
# Create the Word codebook document
print(study_codebook, path = "example_codebook.docx")
## End(Not run)
Simulated study data.
Description
This is the code to create the study data - a simulated dataset that can be used to demonstrate how to use the codebook package.
Usage
study
Format
A data frame with 20 rows and 10 variables:
- id
Participant's study identification number
- address
Participant's home address
- sex
Biological sex of the participant assigned at birth, female/male
- date
Participant's date of enrollment
- time
Participant's time of enrollment
- date_time
Participant's date and time of enrollment
- days
Total number of days the participant was enrolled in the study
- height
Participant's height in inches at date of enrollment
- likert
An example Likert scale item, 1-5
- outcome
Participant experienced the outcome of interest, TRUE or FALSE