Reproducibility with {gtsummary}

# Reproducibility with {gtsummary}

### Daniel D. Sjoberg, Karissa Whiting

### August 28, 2020
### R/Medicine

<img src="images/msk-white-logo.png" width=25%>
 
#### bit.ly/rmed-gtsummary
 
---

# the problem: (ir)reproducibility

- **Low quality code** in medical research part of the problem

- Low quality code is more likely to **contain errors**

- Reproducibility is often **cumbersome** and **time-consuming**
]]

???

- reproducibility in medical research is a long-reported problem, and the code is an important part of this

- Doug Altman noted the quality of medical research in 1994 and called in a scandal

- I don't need to explain this to the people here though! One of the big themes of this year's conference is reproducibility

- At European Urology we asked authors to submit their code.
    - one-third of papers with nontrivial statistics indicated they did not use code

---
# gtsummary: (a part of) the solution
<img src="images/gtsummary-HexSticker.png" width=45%>

???

- Nice output is HARD, and often takes a lot of custom coding and time. In a pinch, it's easier to not follow best practices.

- Wanted to create package geared towards medical research that made creating the most common tables as simple as possible

- Make it simple
    
    - Also very flexible
    
    - Most packages for R use APA formatting...not helpful for medical journals (also, life, but that is just my opinion!)

- Eliminate the step of tweaking after you export your results

---
# gtsummary: (a part of) the solution

### Types of summaries:
.center[
.xlarge[
.pull-left[
**"Table 1"-types**

**Cross tabulation**
]

**Survival data**
]
]

_Report statistics from tables **in-line**_

_**Themes** to control aspects of all tables_

_Choices on **print engine**_
]
]

???

Includes survey data as well with `tbl_svysummary()`

---

# summarize data with tbl_summary()

---
# summarize data with tbl_summary()

**Goal**: Summarize data by treatment groups:
- Age
- Tumor Response
- Tumor Grade
]

```r
library(gtsummary)
library(tidyverse)

sm_trial <-
 trial %>%
 select(trt, age, grade, response)
```

]
.pull-right[
<img src="images/gt_trial_info.png" width=90%>
]

???
- This is an abbreviated version of the example data used in the package help files/documentation.

- note that the column have been labeled using the {labelled} package, and those are used throughout the package

---
# summarize data with tbl_summary()
.pull-left[
.large[
**basic tbl_summary()**
]

```r
tbl_summary_1 <- 
 sm_trial %>%
 tbl_summary(by = trt)
```

- Default statistics are `median (IQR)` for continuous variables, and `n (%)` for categorical/dichotomous data

- Variables coded as `0/1`, `TRUE/FALSE`, and `Yes/No` are presented dichotomously by default
]
]
.pull-right[
<img src="images/tbl_summary_1.png" width=85%>
]

???
- Go slow here

- summarizing a data set is the MOST important analysis

- summarize data first!  you will often catch mistakes.  Data is complicated, and understanding it up front is important.

- Communicating a summary of the data ALONG with analytic results in necessary (others may catch mistakes you're not aware of)

- {gtsummary} is for presenting results, other great packages are available for summarizing data for your self (e.g. skimr)

- just one line of code

- all functions beginning with `tbl_*` create a new tables

- this is how I used the package 95% percent of the time...so easy

- three types of data shown here (explain them)

---
# summarize data with tbl_summary()

```r
tbl_summary_2 <- 
 sm_trial %>%
 tbl_summary(
 by = trt,
 statistic = list(
 all_continuous() ~ "{mean} ({sd})",
 all_categorical() ~ "{n} / {N} ({p}%)"), 
 label = age ~ "Patient Age",
 digits = age ~ 2,
 missing_text = "N Unknown")
```

.medium[
- `statistic` Report mean and standard deviation for continuous variables
- `label` Specify label for age
- `digits` Specify number of decimals to round to
- `missing_text` Text to appear for N missing
]
]
.pull-right[
<img src="images/tbl_summary_2.png" width=95%>
]

???

- defaults are great, let's change the default behavior

- statistics can be changed to anything...literally any R function (e.g. variance)

- discuss the formula notation
    - it's like `case_when()`, condition/variable on LHS and result on RHS
    - one formula doesn't need to be in a list, but more than one must be listed

- the vignette has examples with more examples

---
background-image: url(images/Dan-tbl_summary_small_example.png)
background-position: center
background-size: cover

---
# {gtsummary} + formulas

???
- case_when uses similar syntax

---
# tbl_summary() + helper functions

- `add_*()` add additional column of statistics or information, e.g. p-values, q-values, overall statistics, N obs., and more

- `modify_*()` modify table headers, spanning headers, and footnotes

- `bold_*()/italicize_*()` style labels, variable levels, significant p-values

]

???
The modify functions and the bold functions work on ALLL gtsummary tables

---
background-image: url(images/Dan-tbl_summary_big_example.png)
background-position: center
background-size: cover

---
# cross table with tbl_cross()

`tbl_cross()` is a wrapper for `tbl_summary()` for **n x m** tables

```r
tbl_cross_1 <-
 sm_trial %>%
 tbl_cross(
 row = trt, 
 col = grade,
 percent = "row",
 margin = "row"
 ) %>%
 add_p(source_note = TRUE)
```

]

# summarizing models with tbl_regression()

---
# summarize models with tbl_regression()

```r
library(gtsummary)
library(tidyverse)

m1 = glm(response ~ age + stage,
         data = trial,
         family = binomial(link = "logit"))

summary(m1)
```
]
.large[
- Display **odds ratio** estimates and **confidence intervals**

- Display **p-values** for covariates

- Show **reference levels** for categorical variables
]

]

---
# summarize models with tbl_regression()
.pull-left[
.large[
**basic tbl_regression() code**
]

```r
tbl_logreg <- 
 tbl_regression(m1, exponentiate = TRUE)
```

- **Variable labels** are displayed

- Coefficients are exponentiated and **Odds Ratios** are displayed
]
]
.pull-right[
<img src="images/tbl_logreg.png" width=90%>
]

???
- Model estimates and confidence intervals are rounded and nicely formatted.

- P-values above 0.9 are presented as “>0.9” and below 0.001 are presented as “<0.001”. Non-significant p-values are only rounded to one decimal, while those close to or below the significance threshold (default 0.05) have additional decimal places by default.

- Variable levels are indented and footnotes are added if printed using {gt}. (can alternatively be printed using knitr::kable(); see options here)

---
# summarize models with tbl_regression()
.pull-left[
.large[
**customize regression tables**
]

```r
tbl_logreg2 <-
 tbl_regression(m1,
 exponentiate = TRUE,
 pvalue_fun =
 ~style_pvalue(.x, digits = 2)) %>%
 bold_labels() %>%
 italicize_levels() %>%
 add_global_p() %>%
 bold_p(t = .1)
```

.medium[
- `exponentiate` - Exponentiate model coefficients to display ORs
- `pvalue_fun` - Round and format p-values
- `add_global_p()` - Calculate global p-values for categorical variables
- `bold_p()` - Bold p-values at a specific threshold
]
]

???
- use arguments and helper functions to customize

---
background-image: url(images/tbl_regression_markup.png)
background-position: center
background-size: contain

---

# tbl_uvregression() univariate models
.pull-left[
.large[
**basic tbl_uvregression() code**
]

```r
tbl_uvreg <- 
 trial %>% 
 select(age, stage, response) %>%
 tbl_uvregression(
 method = glm,
 y = response,
 method.args = list(family = binomial),
 exponentiate = TRUE)
```

- Arguments and helper functions like `exponentiate`, `bold_*()`, `add_global_p()` can also be used with `tbl_uvregression()`
]
]
.pull-right[
<img src="images/tbl_uvreg.png" width=90%>
]

???
- OR was recognized due to exponentiate arg

---

# inline reporting with inline_text()

]

---

# inline reporting with inline_text()

**In Report:** 
The odds ratio for age is 1.02 (95% CI 1.00, 1.04; p=0.091)
]

---

# tbl_merge() and tbl_stack()

```r
tbl_merge_1 <- tbl_merge(list(tbl_uvreg, tbl_logreg2),
 tab_spanner = c("**Univariable**", "**Multivariable**"))
```

---

# advanced customization with themes

---
# {gtsummary} + Themes

.large[
- A **theme** is a defined set of customization preferences that can be easily set and reused.

- Themes control **default settings for existing functions** (e.g. always present mean instead of median in `tbl_summary()`)

- Themes control more **fine-grained customization** not available via arguments or helper functions

- Easily use one of the **available package themes**, or **create your own**!

]

???

---
# {gtsummary} + Themes

]
]

---
# {gtsummary} + Themes

- `theme_gtsummary_jornal()` - formats tables to specific publication guidelines.

]
]

---
# {gtsummary} + Themes

- `theme_gtsummary_jornal()` - formats tables to specific publication guidelines.

- `theme_gtsummary_language()` - translates table text

]
]

---
# {gtsummary} + Themes

- `theme_gtsummary_jornal()` - formats tables to specific publication guidelines.

- `theme_gtsummary_language()` - translates table text

- `theme_gtsummary_compact()` - reduces padding and font size

]
]

---
# {gtsummary} + Themes

- `theme_gtsummary_jornal()` - formats tables to specific publication guidelines.

- `theme_gtsummary_language()` - translates table text

- `theme_gtsummary_compact()` - reduces padding and font size

- `set_gtsummary_theme("my_theme")` - create your own!

]
]

---

# {gtsummary} + R Markdown

---
background-image: url(images/gtsummary_rmarkdown.png)
background-position: center
background-size: cover

# {gtsummary} + R Markdown

---

# Thank You

.medium[
**{gtsummary} authors & contributors:** 
Daniel D. Sjoberg (@statistishdan), Margie Hannum (@Margaret_Hannum), Karissa Whiting (@karissawhiting), Emily Zabor, Michael Curry, Esther Drill, Jessica Flynn, Joseph Larmarange, Stephanie Lobaugh, Gustavo Zapata Wainberg

Check out package docs at <a href="http://www.danieldsjoberg.com/gtsummary">http://www.danieldsjoberg.com/gtsummary</a>
]

---

# code example 1

```r
trial %>%
  select(age, grade, trt) %>%
  tbl_summary(
    # stratify table by treatment
    by = trt,
    # show mean and SD for age, and add denom for grade
    statistic = list(age ~ "{mean} ({sd})",
                     grade ~ "{n} / {N} ({p}%)"),
    # update label for grade
    label = grade ~ "Tumor Grade",
    # updated text for missing values
    missing_text = "N Unknown"
  )
```

---
background-image: url(images/Dan-tbl_summary_small_example.png)
background-position: center
background-size: cover

---
# example code 2

```r
# summarize variables by treatment received
trial %>%
  select(age, grade, trt) %>%
  tbl_summary(by = trt) %>%
  # add a column with statistics not stratified by treatment
  add_overall() %>%
  # compare values across treatments
  add_p() %>%
  # add a column with number of non-missing observations
  add_n() %>%
  # bold the variable labels
  bold_labels() %>%
  # add a header over the treatment columns
  modify_spanning_header(c(stat_1, stat_2) ~ "**Treatment Received**") %>%
  # update the column header
  modify_header(label ~ "**Variable**") %>%
  # udpate the default footnote for the statistics presented
  modify_footnote(starts_with("stat_") ~ "Median (IQR) or Frequency (%)")
```

---
background-image: url(images/Dan-tbl_summary_big_example.png)
background-position: center
background-size: cover

```r
pagedown::chrome_print("index.html")
```