class: center, middle, inverse, title-slide # Introduction to {gt} + {gtsummary} Packages ### Daniel D. Sjoberg
Memorial Sloan Kettering Cancer Center ### June 27, 2019 --- class: center background-image: url(images/gt_logo.png) background-size: contain # {gt} package ??? - package from Rstudio - you may remember it from Stefania's review of the RStudio conference - gt is the grammar of tables - seeks to create a unifying code base for HTML, PDF, and RTF output - HTML and PDF are great. RTF is getting close - {gt} is wonderful, and we are doing the most brief introduction to it. - I recommend you look into further yourself --- # {gt} philosophy .large[ "We can construct a wide variety of useful tables with a cohesive set of table parts. These include the *table header*, the *stub*, the *stub head*, the *column labels*, the *table body*, and the *table footer*." ] <p align="center"><img src="images/gt_parts_of_a_table.png" width=60%></p> ??? - package has sets of functions for modifying each piece of a table - we'll review the most common/useful - {gt} documentation is so organized and color coded. - It's easy to find what you're looking for. --- # {gt} installation .large[ - {gt} is not on CRAN. - Use the code below to install from GitHub. ] ```r remotes::install_github("rstudio/gt") ``` .large[ - While you're at it, install {gtsummary} as well. ] ```r remotes::install_github("ddsjoberg/gtsummary") ``` .large[ - There is a version of {gtsummary} on CRAN, but with limited functionality. - Use the version on GitHub (<a href="https://github.com/ddsjoberg/gtsummary">www.github.com/ddsjoberg/gtsummary</a>). - The full version of {gtsummary} be released on CRAN after {gt} is released. ] --- # {gt} examples: the data .large[When used alone, the `gt()` function prints a data frame. But so much more is possible!] .pull-left[ ```r library(gt) # loading gtsummary for the data library(gtsummary) gt_trial_head <- head(trial) %>% * gt() ``` <p align="center"><img src="images/gt_trial_head.png" width=98%></p> ] -- .pull-right[ <p align="center"><img src="images/gt_trial_info.png" width=85%></p> ] ??? - the most common and useful is `gt()` - `gt()` is the first function to run, required every time - AFTER CLICK - Data set contains different types of data - Data is labelled! (leads to nice labels be default) --- # {gt} examples: the viewer .large[ - {gt} tables print to the RStudio viewer when in the global environment. ] <p align="center"><img src="images/gt_in_viewer.PNG" width=64%></p> .large[ - {gt} tables also print in R markdown documents (HTML, PDF, RTF), Shiny apps, etc. ] ??? - Word *.docx is NOT an output type. - Word can, however, read RTF documents. - RTF is how SAS creates in highly customized output, for example --- # {gt} examples: formatting columns ```r *trial_summary <- trial %>% group_by(trt) %>% summarise_at(vars(age, marker), mean, na.rm = TRUE) ``` .pull-left[ ### Raw Summary Statistics ```r gt_print <- * gt(trial_summary) ``` <p align="center"><img src="images/gt_print.png" width=64%></p> ] -- .pull-right[ ### Formatted Summary Statistics ```r gt_format <- gt(trial_summary) %>% * fmt_number(columns = vars(age), decimals = 0) %>% * fmt_number(columns = "marker", decimals = 2) ``` <img src="images/gt_format.png" width=50%></p> ] .bottom[.large[Each column can be formatted without creating a character version of the column!]] ??? - Remember this trial_summary object. We're using it for the next few slides - the raw print is not pretty - AFTER CLICK - we can add formatting, here we round columns with `fmt_number()` - we don't have to create character versions of our columns! wonderful! - use `vars()` because we can select multiple columns. - Input allows ALL {tidyselect} functions. - also accepts character vector of names --- # {gt} examples: formatting cells ```r gt_fmt_cell <- trial_summary %>% gather("variable", "mean", -trt) %>% * gt() %>% * fmt_number(columns = vars(mean), rows = (variable == "age"), decimals = 0) %>% * fmt_number(columns = vars(mean), rows = (variable == "marker"), decimals = 2) ``` .pull-left[ <p align="center"><img src="images/gt_fmt_cell.png" width=75%></p> ] .pull-right[.large[ - Use the `rows = ` argument to pinpoint a cell to format. - There are many formatting functions available: `fmt_percent()`, `fmt_currency()`, `fmt_date()`, `fmt_time()`, `fmt_missing()`, and more. - You can write your own function and pass it to `fmt()` to format a table. ]] ??? - the `fmt_number()` function can also used to format a single cell - lots of formatting functions to choose from - write your own formatting functions --- # {gt} examples: grouping data ```r gt_group <- trial_summary %>% gather("variable", "mean", -trt) %>% * gt(groupname_col = "trt") %>% fmt_number(columns = vars(mean), rows = variable == "age", decimals = 0) %>% fmt_number(columns = vars(mean), rows = variable == "marker", decimals = 2) ``` .pull-left[ <p align="center"><img src="images/gt_group.png" width=35%></p> ] .pull-right[.large[ - Use the `groupname_col = ` argument to specify a column to group results. - The grouping column is not printed and a stub row for each group is added. ]] --- # {gt} examples: column formatting ```r gt_cols <- trial_summary %>% gt() %>% fmt_number(columns = vars(age), decimals = 0) %>% fmt_number(columns = vars(marker), decimals = 2) %>% * cols_label(trt = md("**Treatment**"), age = md("**Age**"), marker = md("**Marker**")) %>% * tab_spanner(label = "Patient Characteristics", columns = vars(age, marker)) ``` .pull-left[ <p align="center"><img src="images/gt_cols.png" width=95%></p> ] .pull-right[.large[ - The `cols_label()` function modifies the column headers. - The `tab_spanner()` function includes a spanning header row. - The `md()` function interprets input text as Markdown (see also `html()`). ]] --- # {gt} examples: titles & footnotes ```r gt_title_footnote <- trial_summary %>% gt() %>% fmt_number(columns = vars(age), decimals = 0) %>% fmt_number(columns = vars(marker), decimals = 2) %>% cols_label(trt = md("**Treatment**"), age = md("**Age**"), marker = md("**Marker**")) %>% * tab_header(title = "Patient Characteristics", subtitle = "Presented by treatment") %>% * tab_footnote(footnote = "Statistic presented is the mean.", * locations = cells_column_labels(columns = vars(age, marker))) ``` .pull-left[ <p align="center"><img src="images/gt_title_footnote.png" width=70%></p> ] .pull-right[.large[ - It's easy to include titles, subtitles, footnotes, and source notes in {gt} tables. - The footnotes are automatically numbered based on where they appear in the table. ]] --- class: center background-image: url(images/gt_functions.svg) background-size: contain # much much more {gt} to learn ??? - color coded documentation --- class: center background-image: url(images/gtsummary_logo1.png) background-size: contain # {gtsummary} ??? - Still working on the hex sticker! - Font? - Bubble placement is so important --- # {gtsummary} introduction <p align="center"><img src="images/biostatR_flow.png" width=70%></p> .large[ - {gtsummary} will soon be a part of the biostatR-verse of packages. - The package uses {gt} as its back end to create tables. - Used to summarize data frames, regression models, and more. - Has a tidy API, sensible defaults (meaning minimal code), and is highly customizable. ] ??? - some functions from biostatR v0.1 have been exported to mskR - mskR and biostatR are both undergoing an uncoupling. - biostatR will comprise of three packages - biostatR won't fully undergo the uncoupling until {gt} is released on CRAN (or their RTF output is improved) --- # {gtsummary} summarize data with tbl_summary() .large[Let's review the data once more] .pull-left[ <p align="center"><img src="images/gt_trial_info.png" width=90%></p> ] .pull-right[ .large[For brevity, we'll use an abbreviated version of the trial data set with fewer columns.] ```r sm_trial <- trial %>% select(trt, age, response, grade) ``` ] --- # {gtsummary} summarize data with tbl_summary() .pull-left[ ```r tbl_summary_1 <- * tbl_summary(sm_trial, by = "trt") ``` .large[ - Default statistics are median (IQR) for continuous variables, and n (percent) for categorical data. - By default, variables coded as 0/1, TRUE/FALSE, and Yes/No are presented dichotomously. ] ] .pull-right[ <p align="center"><img src="images/tbl_summary_1.png" width=100%></p> ] ??? - Go slow here - summarizing a data set is the MOST important analysis - summarize data first! you will often catch mistakes. Data is complicated, and understanding it up front is important. - Communicating a summary of the data ALONG with analytic results in necessary (others may catch mistakes you're not aware of) - {gtsummary} is for presenting results, other great packages are available for summarizing data for your self (e.g. skimr) - just one line of code - all functions beginning with `tbl_*` create a new tables - this is how I used the package 95% percent of the time...so easy - three types of data shown here (explain them) --- # {gtsummary} summarize data with tbl_summary() .pull-left[ ```r tbl_summary_2 <- tbl_summary(sm_trial, by = "trt") %>% * add_p() ``` .large[ - To compare values across two or more groups, use the `add_p()` function. - The default tests are the Wilcoxon rank-sum test for continuous variables, chi-square test of independence for most categorical data, and Fisher's exact test for categorical data with low expected counts. ] ] .pull-right[ <p align="center"><img src="images/tbl_summary_2.png" width=100%></p> ] ??? - `add_p()` account for another 4% of how I use the function - all functions beginning with `tbl_*` create a new tables, and `add_*` add information to an existing table - explain the test defaults - 2 or more groups - random effects for correlated data --- # {gtsummary} and the {glue} package: an aside .large[ - {glue} is similar to paste (but I like it so much more). - Embed R expressions in curly braces. - They are then evaluated and inserted into the argument string. ] ```r name = "Daniel" x = 1 *glue::glue("{name} is number {x}") ``` ``` ## Daniel is number 1 ``` .large[ - Expression can be complex. ] ```r *glue::glue("{name} is number {((x + 100) * 10) - 1009}") ``` ``` ## Daniel is number 1 ``` ??? - an aside - glue is used in some {gtsummary} function arguments, also {gt} --- # {gtsummary} summarize data with tbl_summary() .pull-left[ ```r tbl_summary_3 <- sm_trial %>% tbl_summary( by = "trt", * statistic = list( * all_continuous() ~ "{mean} ({sd})", * all_categorical() ~ "{n} / {N} ({p}%)" * ), * label = "age" ~ "Patient Age" ) %>% * add_p(test = all_continuous() ~ "t.test") ``` .large[ - Report mean and standard deviation for continuous variables. - Specify label for age variable. - Report p-values from the t-test. ] ] .pull-right[ <p align="center"><img src="images/tbl_summary_3.png" width=100%></p> ] ??? - defaults are great, let's change the default behavior - statistics can be changed to anything...literally any R function - discuss the formula notation - it's like `case_when()`, condition/variable on LHS and result on RHS - one formula doesn't need to be in a list, but more than one must be listed - the vignette has examples with more examples --- # {gtsummary} summarize data with tbl_summary() .pull-left[ ```r tbl_summary_4 <- sm_trial %>% tbl_summary( by = "trt", * type = "response" ~ "categorical", statistic = all_continuous() ~ "{mean} ({sd})", * digits = vars(age) ~ c(0, 1) ) %>% add_p(test = all_continuous() ~ "t.test") %>% * add_stat_label() ``` .large[ - Report levels for the response variable. - Modify the default rounding for age. - Add column of statistics presented. - Footnote about statistics is gone! ] ] .pull-right[ <p align="center"><img src="images/tbl_summary_4.png" width=100%></p> ] ??? - further discuss formula notation - just like {gt} can use both select helpers OR characters vector of names - discuss digits and how it's used - discuss `stat_label = `, and mention the footnote was omitted --- # {gtsummary} summarize data with tbl_summary() .large[Advanced Customization - It's natural a {gtsummary} package user would want to customize the aesthetics of the table with one or more of the many {gt} functions available. - Every function in {gt} is available to use with a {gtsummary} object. 1. Create a {gtsummary} table. 1. Convert the table to a {gt} object with the `as_gt()` function. 1. Continue formatting as a {gt} table with any {gt} function. ] ??? Discuss `as_gt()` and how to use --- # {gtsummary} summarize data with tbl_summary() .pull-left[ .large[Advanced Customization] ```r tbl_summary_5 <- sm_trial %>% tbl_summary(by = "trt") %>% # convert from gtsummary object to gt object * as_gt() %>% # modify with gt functions * tab_spanner( * label = "Randomization Group", * columns = starts_with("stat_") * ) ``` .footnote[More on this in the `tbl_summary()` <a href="http://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html#advanced-customization">vignette</a>] ] .pull-right[ <p align="center"><img src="images/tbl_summary_5.png" width=90%></p> ] --- class: left # {gtsummary} summarize data with tbl_summary() .large[ Review the tbl_summary vignette for more details http://www.danieldsjoberg.com/gtsummary/articles/tbl_summary.html ] .pull-left[.large[ - Reporting any statistic for continuous variables, including user-written functions. - More on dichotomous variables and how to specify the level printed. - Missing data options (e.g. report as a column rather than a row, always report N missing even when no missing data, modify missing text, etc.). ]] .pull-right[.large[ - Sort categorical variables by frequency. - Report row percent, rather than column percent. - Report q-values from various methods like false discovery rate. - Sort data by ascending p-values when comparisons have been made. ]] ??? There is more that we are not covering here --- # {gtsummary} summarize models with tbl_regression() ### Raw Output ```r *m1 <- glm(response ~ trt + grade + age, data = trial, family = binomial) m1 ``` ``` ## ## Call: glm(formula = response ~ trt + grade + age, family = binomial, ## data = trial) ## ## Coefficients: ## (Intercept) trtPlacebo gradeII gradeIII age ## 0.449477 -0.660514 -0.504973 -0.167732 -0.004199 ## ## Degrees of Freedom: 181 Total (i.e. Null); 177 Residual ## (18 observations deleted due to missingness) ## Null Deviance: 249.1 ## Residual Deviance: 242.8 AIC: 252.8 ``` ??? - it's not pretty - most often I want the odds ratios from a logistic regression, not the betas - format from every type of model is different and difficult to work with --- # {gtsummary} summarize models with tbl_regression() ### {broom} Output ```r *broom::tidy(m1, conf.int = TRUE, exponentiate = TRUE) ``` ``` ## # A tibble: 5 x 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.57 0.565 0.796 0.426 0.520 4.81 ## 2 trtPlacebo 0.517 0.309 -2.14 0.0324 0.280 0.941 ## 3 gradeII 0.604 0.390 -1.30 0.195 0.279 1.29 ## 4 gradeIII 0.846 0.369 -0.455 0.649 0.409 1.74 ## 5 age 0.996 0.0106 -0.395 0.693 0.975 1.02 ``` ??? - MUCH MUCH better! - all models returned with consistent format - but does not include reference groups - still needs additional modification before it can be presented --- # {gtsummary} summarize models with tbl_regression() ### {gtsummary} Output ```r *tbl_regression_1 <- tbl_regression(m1, exponentiate = TRUE) ``` .pull-left[.large[ - `tbl_regression()` accepts regression model object as inputs. - Reference groups added to the table. - Logistic regression model with odds ratio header and footnote. ]] .pull-right[ <p align="center"><img src="images/tbl_regression_1.png" width=90%></p> ] ??? - This table is ready for publication in a single line of code! - That is something no other package I know of can do - The back end for the function is {broom} and {gt}, meaning that there is broad support for most regression model types, and the resulting tables are gorgeous and customizable. - Common regression models, such as logistic regression and Cox regression, are automatically identified and the tables are created with appropriate headers. - build the regression model on your own....we are not in the business of model estimation or checking --- # {gtsummary} summarize models with tbl_regression() .pull-left[ ```r tbl_regression_2 <- m1 %>% tbl_regression(exponentiate = TRUE) %>% * add_global_p() ``` .large[ - Replace individual p-values for categorical variables with global p-value for the entire variable. ] ] .pull-right[ <p align="center"><img src="images/tbl_regression_2.png" width=100%></p> ] ??? Thank you to Stefania for suggesting an improved name change for `add_global_p()`! --- # {gtsummary} summarize models with tbl_regression() .pull-left[ ```r library(survival) tbl_regression_3 <- coxph(Surv(ttdeath, death) ~ trt + grade + age, data = trial) %>% tbl_regression(exponentiate = TRUE) tbl_regression_4 <- * tbl_merge( * tbls = list(tbl_regression_1, tbl_regression_3), * tab_spanner = c("Tumor Response", "Time to Death") * ) ``` .large[ - Build Cox regression model with same predictors as previous model. - Merge the two regression models with the same predictors and present results side-by-side. ] ] .pull-right[ <p align="center"><img src="images/tbl_regression_4.png" width=100%></p> ] ??? - side-by-side regression results is common in cancer research (e.g. time to recurrence, then time to death) - stacking two or more models is also possible - easy to create custom tables that are formatted beautifully --- # {gtsummary} summarize data with tbl_uvregression() .pull-left[ ```r library(survival) tbl_uvregression_1 <- * tbl_uvregression( * sm_trial, * method = glm, * y = response, * method.args = list(family = binomial), * exponentiate = TRUE * ) ``` .large[ - Table of univariate regression models. - Specify the outcome, and the remaining variables in data frame serve as predictors. ] ] .pull-right[ <p align="center"><img src="images/tbl_uvregression_1.png" width=100%></p> ] ??? - Tables of univariate results can be good for exploratory analysis - Code is similar to {ggplot2} `geom_smooth()` and `stat_smooth()` - also great with time-to-event endpoint when you cannot do a `tbl_summary()` to get bivariate p-values --- # {gtsummary} summarize data with tbl_survival() .pull-left[ ```r *fit1 <- survfit(Surv(ttdeath, death) ~ trt, * data = trial) survminer::ggsurvplot( fit = fit1, xlab = "Months", ylab = "Overall survival probability", legend.title = "Treatment Group", legend.labs = c("Drug", "Placebo"), break.x.by = 6, censor = FALSE, risk.table = TRUE, risk.table.y.text = FALSE ) ``` ] .pull-right[ ![](index_files/figure-html/unnamed-chunk-44-1.png)<!-- --> ] ??? - You've probably seen something like this before - It's a Kaplan-Meier curve. It shows the probability of being free from an event (e.g. cancer recurrence after treatment) - we can use {gtsummary} to grab estimates from curves like this --- # {gtsummary} summarize data with tbl_survival() .pull-left[ ```r tbl_survival_1 <- fit1 %>% * tbl_survival(times = c(12, 24), * label = "{time} Month") ``` .large[ - First, use `survfit()` to estimate survival times. - Create table of estimates with `tbl_survival()`. - Can use this function to print survival quantiles as well, e.g. median survival. ] ] .pull-right[ <p align="center"><img src="images/tbl_survival_1.png" width=85%></p> ] --- # {gtsummary} reporting results with inline_text() .large[ - Tables are important, but we often need to report results in-line in a report. - Any statistic reported in a {gtsummary} table can be extracted and reported in-line in a R Markdown document with the `inline_text()` function. ```r inline_text(tbl_regression_1, variable = "trt", level = "Placebo") ``` ``` 0.52 (95% CI 0.28, 0.94; p=0.032) ``` - The pattern of what is reported can be modified with the `pattern = ` argument. - Default is `pattern = "{estimate} ({conf.level*100}% CI {conf.low}, {conf.high}; {p.value})"`. ] ??? - discuss importance of reproducible results - data is constantly updating - this functionality assures you won't miss updating a reported estimate in a document - for me, this is one the most powerful parts of the {gtsummary} package - something I've never seen in another package --- class: center # {gtsummary} .large[ • Every function is documented further in the help file • • Check out the package website for vignettes including detailed examples and explanations • <img src = "images/open-book-white.png" width="2.4%" height="2.4%"> {gtsummary} documentation <a href="http://www.danieldsjoberg.com/gtsummary/">danieldsjoberg.com/gtsummary/</a> <img src = "images/github_icon.png" width="2.4%" height="2.4%"> {gtsummary} package <a href="https://github.com/ddsjoberg/gtsummary">github.com/ddsjoberg/gtsummary</a> <img src = "images/slide_show_icon.png" width="2.4%" height="2.4%"> slides at <a href="http://www.danieldsjoberg.com/gt-and-gtsummary-presentation">danieldsjoberg.com/gt-and-gtsummary-presentation</a> <img src = "images/github_icon.png" width="2.4%" height="2.4%"> source code for slides at <a href="https://github.com/ddsjoberg/gt-and-gtsummary-presentation">github.com/ddsjoberg/gt-and-gtsummary-presentation</a> <img src = "images/github_icon.png" width="2.4%" height="2.4%"> {gt} package <a href="https://github.com/rstudio/gt">github.com/rstudio/gt</a> ] ??? Go star {gtsummary} on GitHub...we're already to 50+! --- # {gtsummary} Advanced .large[ {gtsummary} output is a list that prints as a {gt} table. ] ```r names(tbl_summary_1) ``` ``` ## [1] "gt_calls" "table_body" "meta_data" "inputs" "call_list" ## [6] "by" "df_by" ``` .pull-left[ ```r pluck(tbl_summary_1, "table_body") %>% head() ``` ``` ## # A tibble: 6 x 5 ## variable row_type label stat_1 stat_2 ## <chr> <chr> <chr> <chr> <chr> ## 1 age label Age, yrs 47 (39, 58) 45 (36, 54) ## 2 age missing Unknown 6 3 ## 3 response label Tumor Response 53 (51%) 30 (34%) ## 4 response missing Unknown 4 5 ## 5 grade label Grade <NA> <NA> ## 6 grade level I 38 (36%) 29 (31%) ``` ] .pull-right[ ```r pluck(tbl_summary_1, "gt_calls") %>% head(n = 4) ``` ``` ## $gt ## gt(data = x$table_body) ## ## $cols_label_label ## cols_label(label = md('**Characteristic**')) ## ## $cols_align ## cols_align(align = 'center') %>% cols_align(align = 'left', columns = vars(label)) ## ## $cols_hide ## cols_hide(columns = vars(variable, row_type)) ``` ] ??? If there is time, review the structure of a {gtsummary} object Essentially, what is going on is that the {gt} calls on the right are called on the table on the left whenever the object is printed. Understanding this structure will help you modify if you need. If there is a {gt} call that formats in a way you don't like, convert your object with `as_gt()` and use the `omit =` argument to leave out the gt call you don't like. You can replace it with whatever you choose.