Exercise 1

The Data

Using the National Health and Nutrition Examination Survey Data (nhefs), we will assess whether quitting smoking leads to a decrease in the risk of death. We’ll also investigate whether potential confounding factors–age, sex, blood pressure, diabetes, and exercise.

To begin, let’s pare down the data frame and assign column labels.

library(gtsummary)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

theme_gtsummary_compact()

Setting theme `Compact`

df_nhefs <-
  causaldata::nhefs %>%
  select(death, qsmk, age, sex, sbp, dbp, exercise) %>%
  drop_na() %>%
  mutate(
    qsmk =
      factor(
        qsmk,
        levels = 0:1,
        labels = c("Did not Quit", "Quit")
      ),
    sex = 
      case_when(
        sex == 0 ~ "Male",
        sex == 1 ~ "Female"
      ),
    exercise = 
      factor(
        exercise,
        levels = 0:2,
        labels = c("Much exercise", "Moderate exercise", "Little or no exercise")
      ) %>%
      fct_rev()
  ) %>%
  labelled::set_variable_labels(
    death = "Participant Passed Away",
    qsmk = "Quit Smoking",
    age = "Age",
    sex = "Sex",
    sbp = "Systolic BP",
    dbp = "Diastolic BP",
    exercise = "Exercise Level"
  )

skimr::skim(df_nhefs)

Data summary
Name	df_nhefs
Number of rows	1548
Number of columns	7
_______________________
Column type frequency:
character	1
factor	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
sex	0	1	4	6	0	2	0

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
qsmk	0	1	FALSE	2	Did: 1157, Qui: 391
exercise	0	1	FALSE	3	Mod: 651, Lit: 597, Muc: 300

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
death	1	0.18	0.39	0	0	0	0	1	▇▁▁▁▂
age	1	43.62	11.99	25	33	43	53	74	▇▆▇▅▂
sbp	1	128.70	19.00	87	116	126	140	229	▃▇▂▁▁
dbp	1	77.74	10.63	47	70	77	85	130	▁▇▅▁▁

Exercise 2

Prepare a table describing the cohort by whether or not the participant quit smoking. Do not include death in the summary table. Consider using the gtsummary functions that build on a summary table.

gts_patient_characteristics <-
  df_nhefs %>%
  tbl_summary(
    by = qsmk,
    include = -death
  ) %>%
  add_p() %>%
  add_stat_label() %>%
  modify_spanning_header(
    all_stat_cols() ~ "**Smoking Status**"
  )
gts_patient_characteristics

Characteristic	Smoking Status		p-value¹
Characteristic	Did not Quit, N = 1,157	Quit, N = 391	p-value¹
Age, Median (IQR)	42 (33, 51)	46 (35, 56)	<0.001
Sex, n (%)			0.005
Female	619 (54%)	177 (45%)
Male	538 (46%)	214 (55%)
Systolic BP, Median (IQR)	125 (115, 138)	130 (118, 142)	<0.001
Diastolic BP, Median (IQR)	77 (70, 84)	78 (71, 86)	0.016
Exercise Level, n (%)			0.2
Little or no exercise	438 (38%)	159 (41%)
Moderate exercise	482 (42%)	169 (43%)
Much exercise	237 (20%)	63 (16%)
¹ Wilcoxon rank sum test; Pearson's Chi-squared test

Exercise 3

Is there a difference in death rates by smoking status?

Prepare an unadjusted and adjusted rate difference in the table.

# Visit https://www.danieldsjoberg.com/gtsummary/reference/tests.html for a listing of included tests.
# Tests that make use of the `add_difference(adj.vars=)` argument are adjusted analyses.

gts_death_unadjusted <-
  df_nhefs %>%
  tbl_summary(
    by = qsmk, 
    include = death,
    statistic = all_categorical() ~ "{p}%"
  ) %>%
  add_difference() %>%
  add_stat_label() 
gts_death_unadjusted

Characteristic	Did not Quit, N = 1,157	Quit, N = 391	Difference¹	95% CI^1,2	p-value¹
Participant Passed Away, %	17%	22%	-4.9%	-9.7%, -0.07%	0.037
¹ Two sample test for equality of proportions
² CI = Confidence Interval

gts_death_adjusted <-
  df_nhefs %>%
  tbl_summary(
    by = qsmk, 
    include = death,
    statistic = all_categorical() ~ "{p}%"
  ) %>%
  add_difference(
    test = all_categorical() ~ "emmeans",
    adj.vars = c(age, sex, sbp, dbp, exercise)
  ) %>%
  add_stat_label() 
gts_death_adjusted

Characteristic	Did not Quit, N = 1,157	Quit, N = 391	Adjusted Difference¹	95% CI^1,2	p-value¹
Participant Passed Away, %	17%	22%	1.0%	-2.2%, 4.2%	0.5
¹ Regression least-squares adjusted mean difference
² CI = Confidence Interval

Exercise 4

Build a logistic regression model with death as the outcome. Include smoking and the other variables as covariates. Summarize your regression model in a table, reporting the odds ratios, confidence intervals, and p-values.

mod <- glm(death ~ qsmk + age + sex + sbp + dbp + exercise, 
           data = df_nhefs, family = binomial)

gts_mod <-
  tbl_regression(mod, exponentiate = TRUE) %>%
  add_global_p() %>%
  bold_p() %>%
  italicize_levels() %>%
  modify_caption("**Logistic regression model predicting death**") %>%
  modify_header(label = "**Variable** (N = {N})")
gts_mod

**Logistic regression model predicting death**
Variable (N = 1548)	OR¹	95% CI¹	p-value
Quit Smoking			0.5
Did not Quit	—	—
Quit	0.90	0.64, 1.26
Age	1.11	1.10, 1.13	<0.001
Sex			<0.001
Female	—	—
Male	1.78	1.31, 2.43
Systolic BP	1.02	1.01, 1.03	<0.001
Diastolic BP	1.00	0.98, 1.01	0.6
Exercise Level			0.3
Little or no exercise	—	—
Moderate exercise	0.78	0.56, 1.08
Much exercise	0.81	0.52, 1.25
¹ OR = Odds Ratio, CI = Confidence Interval

Exercise 5

Write a brief summary of the results above using inline_text() to report values from the tables directly into the markdown report.

median_age_quit <-
  inline_text(gts_patient_characteristics, 
              variable = age, 
              column = "Quit", 
              pattern = "{median}")
median_age_no_quit <-
  inline_text(gts_patient_characteristics, 
              variable = age, 
              column = "Did not Quit", 
              pattern = "{median}")
pvalue_age <- 
  inline_text(gts_patient_characteristics, 
              variable = age,
              column = p.value)
unadj_difference <-
  inline_text(gts_death_unadjusted,
              variable = death,
              pattern = "difference {estimate}; 95% CI {ci}; {p.value}")

adj_qsmk_or <-
  inline_text(gts_mod,
              variable = qsmk,
              level = "Quit",
              pattern = "odds ratio {estimate}; 95% CI {ci}")

The analysis assessing the relationship between quitting smoking and subsequent death within the next 20 years included 1548 participants. The median age among those who quit was higher compared to those who did not (46 vs 42; p<0.001).

On univariate analysis, participants who did not quit smoking had higher rates of death (difference -4.9%; 95% CI -9.7%, -0.07%; p=0.037). However, on multivariable analysis, the relationship was not longer significant (odds ratio 0.90; 95% CI 0.64, 1.26).