vignettes/tbl_summary.Rmd
tbl_summary.Rmd
The tbl_summary()
function calculates
descriptive statistics for continuous, categorical, and
dichotomous variables in R, and presents the results in
a beautiful, customizable summary table ready for
publication (for example, Table 1 or demographic tables).
This vignette will walk a reader through the
tbl_summary()
function, and the various functions available
to modify and make additions to an existing table summary object.
We’ll be using the trial
data set throughout this example.
This set contains data from 200 patients who received one of two types of chemotherapy (Drug A or Drug B). The outcomes are tumor response and death.
Each variable in the data frame has been assigned an
attribute label
(i.e. attr(trial$trt, "label") == "Chemotherapy Treatment")
with the labelled
package. These labels are displayed in the {gtsummary} output table by
default. Using {gtsummary} on a data frame without labels will simply
print variable names in place of variable labels; there is also an
option to add labels later.
Variable  Class  Label 


character  Chemotherapy Treatment 

numeric  Age 

numeric  Marker Level (ng/mL) 

factor  T Stage 

factor  Grade 

integer  Tumor Response 

integer  Patient Died 

numeric  Months to Death/Censor 
Includes mix of continuous, dichotomous, and categorical variables 
head(trial)
#> # A tibble: 6 × 8
#> trt age marker stage grade response death ttdeath
#> <chr> <dbl> <dbl> <fct> <fct> <int> <int> <dbl>
#> 1 Drug A 23 0.16 T1 II 0 0 24
#> 2 Drug B 9 1.11 T2 I 1 0 24
#> 3 Drug A 31 0.277 T1 II 0 0 24
#> 4 Drug A NA 2.07 T3 III 1 1 17.6
#> 5 Drug A 51 2.77 T4 III 1 1 16.4
#> 6 Drug B 39 0.613 T4 I 0 1 15.6
For brevity, in this tutorial we’ll use a subset of the variables from the trial data set.
The default output from tbl_summary()
is meant to be
publication ready.
Let’s start by creating a table of summary statistics from the
trial
data set. The tbl_summary()
function can
take, at minimum, a data frame as the only input, and returns
descriptive statistics for each column in the data frame.
trial2 %>% tbl_summary()
Characteristic  N = 200^{1} 

Chemotherapy Treatment  
Drug A  98 (49%) 
Drug B  102 (51%) 
Age  47 (38, 57) 
Unknown  11 
Grade  
I  68 (34%) 
II  68 (34%) 
III  64 (32%) 
^{1} n (%); Median (IQR) 
Note the sensible defaults with this basic usage; each of the defaults may be customized.
Variable types are automatically detected so that appropriate descriptive statistics are calculated.
Label attributes from the data set are automatically printed.
Missing values are listed as “Unknown” in the table.
Variable levels are indented and footnotes are added.
For this study data the summary statistics should be split by
treatment group, which can be done by using the
by=
argument. To compare two or more
groups, include add_p()
with the function call, which detects variable type and uses an
appropriate statistical test.
trial2 %>% tbl_summary(by = trt) %>% add_p()
Characteristic  Drug A, N = 98^{1}  Drug B, N = 102^{1}  pvalue^{2} 

Age  46 (37, 59)  48 (39, 56)  0.7 
Unknown  7  4  
Grade  0.9  
I  35 (36%)  33 (32%)  
II  32 (33%)  36 (35%)  
III  31 (32%)  33 (32%)  
^{1} Median (IQR); n (%)  
^{2} Wilcoxon rank sum test; Pearson's Chisquared test 
There are four primary ways to customize the output of the summary table.
tbl_summary()
function argumentsadd_*()
functionstbl_summary()
function arguments
The tbl_summary()
function includes many input options
for modifying the appearance.
Argument  Description 


specify the variable labels printed in table 

specify the variable type (e.g. continuous, categorical, etc.) 

change the summary statistics presented 

number of digits the summary statistics will be rounded to 

whether to display a row with the number of missing observations 

text label for the missing number row 

change the sorting of categorical levels by frequency 

print column, row, or cell percentages 

list of variables to include in summary table 
Example modifying tbl_summary()
arguments.
trial2 %>%
tbl_summary(
by = trt,
statistic = list(all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"),
digits = all_continuous() ~ 2,
label = grade ~ "Tumor Grade",
missing_text = "(Missing)"
)
Characteristic  Drug A, N = 98^{1}  Drug B, N = 102^{1} 

Age  47.01 (14.71)  47.45 (14.01) 
(Missing)  7  4 
Tumor Grade  
I  35 / 98 (36%)  33 / 102 (32%) 
II  32 / 98 (33%)  36 / 102 (35%) 
III  31 / 98 (32%)  33 / 102 (32%) 
^{1} Mean (SD); n / N (%) 
There are multiple ways to specify the statistic=
argument using a single formula, a list of formulas, and a named list.
The following table shows equivalent ways to specify the mean statistic
for continuous variables age
and marker.
Any
{gtsummary} function argument that accepts formulas will accept each of
these variations.
Select with Helpers  Select by Variable Name  Select with Named List 






— 
— 

— 
The {gtsummary} package has functions to adding information or
statistics to tbl_summary()
tables.
Function  Description 

add pvalues to the output comparing values across groups  
add a column with overall summary statistics  
add a column with N (or N missing) for each variable  
add column for difference between two group, confidence interval, and pvalue  
add label for the summary statistics shown in each row  
generic function to add a column with userdefined values  
add a column of q values to control for multiple comparisons 
The {gtsummary} package comes with functions specifically made to modify and format summary tables.
Function  Description 

update column headers  
update column footnote  
update spanning headers  
update table caption/title  
bold variable labels  
bold variable levels  
italicize variable labels  
italicize variable levels  
bold significant pvalues 
Example adding tbl_summary()
family functions
trial2 %>%
tbl_summary(by = trt) %>%
add_p(pvalue_fun = ~style_pvalue(.x, digits = 2)) %>%
add_overall() %>%
add_n() %>%
modify_header(label ~ "**Variable**") %>%
modify_spanning_header(c("stat_1", "stat_2") ~ "**Treatment Received**") %>%
modify_footnote(
all_stat_cols() ~ "Median (IQR) or Frequency (%)"
) %>%
modify_caption("**Table 1. Patient Characteristics**") %>%
bold_labels()
Variable  N  Overall, N = 200^{1}  Treatment Received  pvalue^{2}  

Drug A, N = 98^{1}  Drug B, N = 102^{1}  
Age  189  47 (38, 57)  46 (37, 59)  48 (39, 56)  0.72 
Unknown  11  7  4  
Grade  200  0.87  
I  68 (34%)  35 (36%)  33 (32%)  
II  68 (34%)  32 (33%)  36 (35%)  
III  64 (32%)  31 (32%)  33 (32%)  
^{1} Median (IQR) or Frequency (%)  
^{2} Wilcoxon rank sum test; Pearson's Chisquared test 
The {gt} package is packed with many great functions for modifying table output—too many to list here. Review the package’s website for a full listing.
To use the {gt} package functions with {gtsummary} tables, the
summary table must first be converted into a gt
object. To
this end, use the as_gt()
function after modifications have
been completed with {gtsummary} functions.
trial2 %>%
tbl_summary(by = trt, missing = "no") %>%
add_n() %>%
as_gt() %>%
gt::tab_source_note(gt::md("*This data is simulated*"))
Characteristic  N  Drug A, N = 98^{1}  Drug B, N = 102^{1} 

Age  189  46 (37, 59)  48 (39, 56) 
Grade  200  
I  35 (36%)  33 (32%)  
II  32 (33%)  36 (35%)  
III  31 (32%)  33 (32%)  
This data is simulated  
^{1} Median (IQR); n (%) 
There is flexibility in how you select variables for {gtsummary}
arguments, which allows for many customization opportunities! For
example, if you want to show age and the marker levels to one decimal
place in tbl_summary()
, you can pass
digits = c(age, marker) ~ 1
. The selecting input is
flexible, and you may also pass quoted column names.
Going beyond typing out specific variables in your data set, you can use:
All {tidyselect}
helpers available throughout the tidyverse, such as
starts_with()
, contains()
, and
everything()
(i.e. anything you can use with the
dplyr::select()
function), can be used with
{gtsummary}.
Additional {gtsummary} selectors that are included in the package to supplement tidyselect functions.
Summary type There are two primary ways to
select variables by their summary type. This is useful, for example,
when you wish to report the mean and standard deviation for all
continuous variables:
statistic = all_continuous() ~ "{mean} ({sd})"
.
Dichotomous variables are, by default, included with
all_categorical()
.
Continuous variables may also be summarized on multiple lines—a
common format in some journals. To update the continuous variables to
summarize on multiple lines, update the summary type to
"continuous2"
(for summaries on two or more lines).
trial2 %>%
select(age, trt) %>%
tbl_summary(
by = trt,
type = all_continuous() ~ "continuous2",
statistic = all_continuous() ~ c("{N_nonmiss}",
"{median} ({p25}, {p75})",
"{min}, {max}"),
missing = "no"
) %>%
add_p(pvalue_fun = ~style_pvalue(.x, digits = 2))
Characteristic  Drug A, N = 98  Drug B, N = 102  pvalue^{1} 

Age  0.72  
N  91  98  
Median (IQR)  46 (37, 59)  48 (39, 56)  
Range  6, 78  9, 83  
^{1} Wilcoxon rank sum test 
The information in this section applies to all {gtsummary} objects.
The {gtsummary} table has two important internal objects:
Internal Object  Description 


data frame that is printed as the gtsummary output table 

contains instructions for styling 
When you print output from the tbl_summary()
function
into the R console or into an R markdown document, the
.$table_body
data frame is formatted using the instructions
listed in .$table_styling
. The default printer converts the
{gtsummary} object to a {gt} object with as_gt()
via a
sequence of {gt} commands executed on .$table_body
. Here’s
an example of the first few calls saved with
tbl_summary()
:
tbl_summary(trial2) %>% as_gt(return_calls = TRUE) %>% head(n = 4)
#> $gt
#> gt::gt(data = x$table_body, groupname_col = NULL, caption = NULL)
#>
#> $fmt_missing
#> $fmt_missing[[1]]
#> gt::sub_missing(columns = gt::everything(), missing_text = "")
#>
#>
#> $cols_align
#> $cols_align[[1]]
#> gt::cols_align(columns = c("variable", "var_type", "var_label",
#> "row_type", "stat_0"), align = "center")
#>
#> $cols_align[[2]]
#> gt::cols_align(columns = "label", align = "left")
#>
#>
#> $indent
#> $indent[[1]]
#> gt::text_transform(locations = gt::cells_body(columns = "label",
#> rows = c(2L, 3L, 5L, 7L, 8L, 9L)), fn = function(x) paste0(" ",
#> x))
The {gt} functions are called in the order they appear, beginning
with gt::gt()
.
If the user does not want a specific {gt} function to run (i.e. would
like to change default printing), any {gt} call can be excluded in the
as_gt()
function. In the example below, the default
alignment is restored.
After the as_gt()
function is run, additional formatting
may be added to the table using {gt} functions. In the example below, a
source note is added to the table.
tbl_summary(trial2, by = trt) %>%
as_gt(include = cols_align) %>%
gt::tab_source_note(gt::md("*This data is simulated*"))
Characteristic  Drug A, N = 98^{1}  Drug B, N = 102^{1} 

Age  46 (37, 59)  48 (39, 56) 
Unknown  7  4 
Grade  
I  35 (36%)  33 (32%) 
II  32 (33%)  36 (35%) 
III  31 (32%)  33 (32%) 
This data is simulated  
^{1} Median (IQR); n (%) 
The {gtsummary} tbl_summary()
function and the related
functions have sensible defaults for rounding and presenting results. If
you, however, would like to change the defaults there are a few options.
The default options can be changed using the {gtsummary} themes function
set_gtsummary_theme()
. The package includes prespecified
themes, and you can also create your own. Themes can control baseline
behavior, for example, how pvalues and percentages are rounded, which
statistics are presented in tbl_summary()
, default
statistical tests in add_p()
, etc.
For details on creating a theme and setting personal defaults, review the themes vignette.
The {gtsummary} package also supports survey data (objects created
with the {survey} package)
via the tbl_svysummary()
function. The syntax for
tbl_svysummary()
and tbl_summary()
are nearly
identical, and the examples above apply to survey summaries as well.
To begin, install the {survey} package and load the
apiclus1
data set.
install.packages("survey")
# loading the api data set
data(api, package = "survey")
Before we begin, we convert the data frame to a survey object, registering the ID and weighting columns, and setting the finite population correction column.
svy_apiclus1 <
survey::svydesign(
id = ~dnum,
weights = ~pw,
data = apiclus1,
fpc = ~fpc
)
After creating the survey object, we can now summarize it similarly
to a standard data frame using tbl_svysummary()
. Like
tbl_summary()
, tbl_svysummary()
accepts the
by=
argument and works with the add_p()
and
add_overall()
functions.
It is not possible to pass custom functions to the
statistic=
argument of tbl_svysummary()
. You
must use one of the predefined
summary statistic functions (e.g. {mean}
,
{median}
) which leverage functions from the {survey}
package to calculate weighted statistics.
svy_apiclus1 %>%
tbl_svysummary(
# stratify summary statistics by the "both" column
by = both,
# summarize a subset of the columns
include = c(api00, api99, both),
# adding labels to table
label = list(api00 ~ "API in 2000",
api99 ~ "API in 1999")
) %>%
add_p() %>% # comparing values by "both" column
add_overall() %>%
# adding spanning header
modify_spanning_header(c("stat_1", "stat_2") ~ "**Met Both Targets**")
Characteristic  Overall, N = 6,194^{1}  Met Both Targets  pvalue^{2}  

No, N = 1,692^{1}  Yes, N = 4,502^{1}  
API in 2000  652 (552, 718)  631 (556, 710)  654 (551, 722)  0.4 
API in 1999  615 (512, 691)  632 (548, 698)  611 (497, 686)  0.2 
^{1} Median (IQR)  
^{2} Wilcoxon ranksum test for complex survey samples 
tbl_svysummary()
can also handle weighted survey data
where each row represents several individuals:
Titanic %>%
as_tibble() %>%
survey::svydesign(data = ., ids = ~ 1, weights = ~ n) %>%
tbl_svysummary(include = c(Age, Survived))
Characteristic  N = 2,201^{1} 

Age  
Adult  2,092 (95%) 
Child  109 (5.0%) 
Survived  711 (32%) 
^{1} n (%) 
Use tbl_cross()
to compare two categorical variables in
your data. tbl_cross()
is a wrapper for
tbl_summary()
that:
percent = "cell"
by default.margin
argument).missing
argument).Chemotherapy Treatment  Total  pvalue^{1}  

Drug A  Drug B  
T Stage  0.9  
T1  28 (14%)  25 (12%)  53 (26%)  
T2  25 (12%)  29 (14%)  54 (27%)  
T3  22 (11%)  21 (10%)  43 (22%)  
T4  23 (12%)  27 (14%)  50 (25%)  
Total  98 (49%)  102 (51%)  200 (100%)  
^{1} Pearson's Chisquared test 