Yates’ Continuity Correction

The {gtsummary} package function tbl_summary() %>% add_p() had its default statistical test slightly modified in version 1.3.6 for 2-by-2 table comparisons. Previously, 2-by-2 tables were compared with chisq.test(correct = TRUE). However, the simulation study below demonstrates the superior properties of chisq.test(correct = FALSE), and we have updated the default and removed the continuity correction.

Simulating results

We will simulate many studies assuming various prevalence rates of two binary variables and at various sample sizes. Event rates between \(0.05 \le p \le p +\delta \le 0.50\) were investigated, and sample sizes between 20 and 500. The power and size of the chi-squared tests were tabulated.

Attach needed libraries.

library(tidyverse)
library(gtsummary)

Set the number of simulated studies that will be run for each combination of rates and sample size.

simn <- 10000

Simulate the trial results.

sim_results <-
  list(
    simn = seq_len(simn),                              # number of simulations per scenario
    n = c(20, 50, 100, 500),                           # sample sizes
    p = c(0.05, 0.10, 0.20, 0.50),                     # rate of success in group 1
    delta = c(0.0, 0.05, 0.10, 0.20, 0.30, 0.40, 0.45) # p + delta = rate of success in group 2
  ) %>%
  cross_df() %>%
  # deleting duplicated scenarios (due to symmetry of test)
  filter(p + delta <= 0.5) %>%
  mutate(
    # simulating  data set with specified success rates, and calculating p-values
    df_result = pmap(
      list(n, p, delta),
      function(n, p, delta) {
        data = 
          bind_rows(
            tibble(value = runif(n/2) < p, group = 1),
            tibble(value = runif(n/2) < p + delta, group = 2),
          )
        
        tribble(
          ~test, ~reject, 
          "With Correction", with(data, table(value, group)) %>% 
            {tryCatch(chisq.test(., correct = TRUE), error = function(e) NA)$p.value} %>% 
            {. < 0.05},
          "Without Correction", with(data, table(value, group)) %>% 
            {tryCatch(chisq.test(., correct = FALSE), error = function(e) NA)$p.value} %>% 
            {. < 0.05}
        )
      }
    )
  ) %>%
  unnest(df_result) %>% 
  # deleting sets of results where at least one of the p-values are missing
  group_by(simn, n, p, delta) %>% 
  filter(sum(is.na(reject)) == 0) %>%
  ungroup()

Summarize simulation results.

# calculating power (or level) of test across simulations
results <- 
  sim_results %>%
  group_by(n, p, delta, test) %>%
  summarise(power = mean(reject),
            simn = n(),
            .groups = "drop")

Power of test

results %>%
  filter(p != 0.5) %>%
  mutate(n = str_glue("N = {n}") %>% factor(levels = unique(.)), 
         p = str_glue("{p * 100}%") %>% factor(levels = unique(.))) %>%
  ggplot(aes(x = delta, y = power, color = test)) +
  geom_line() +
  geom_point() +
  geom_hline(yintercept = 0.05) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  facet_grid(n ~ p, scale = "free") +
  labs(color = "",
       y = "Power",
       x = "Delta") +
  theme(legend.position="bottom")

When n is large, both variants of the chi-squared test yield similar results. However, when n is 100 or less, the test without continuity correction exhibits higher power for the various prevalences of the binary variables.

Level of test

results %>%
  filter(delta == 0) %>%
  mutate(n = str_glue("N = {n}") %>% factor(levels = unique(.))) %>%
  ggplot(aes(x = p, y = power, color = test)) +
  geom_line() +
  geom_point() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  facet_wrap(~n) +
  geom_hline(yintercept = 0.05) +
  labs(color = "",
       y = "Prob. of Rejecting H0",
       x = "Event Rate") +
  theme(legend.position="bottom")

In the simulation study, we rejected the null hypothesis when the p-value was less than 0.05. The true rate of rejection should be 5% by definition. The test without continuity correction is closer to this threshold, thus preserving more accuratly the true level of the test.

Restore Previous Behavior

If a user wishes to continue using the continuity correction, there are two options.

Specify the variables you’d like to compare with the chi-squared test without continuity correction. For example, this line of code specifies all categorical variables are compared with "chisq.test" that uses the continuity correction: trial %>% tbl_summary(by = trt) %>% add_p(all_categorical() ~ "chisq.test")
If you want to make the global change for the entire script or R session, use the gtsummary themes. Run the following line anywhere in your R script, which will make the chi-sequared test with continuity correction the default: set_gtsummary_theme(list("add_p.tbl_summary-attr:test.continuous_by2" = "chisq.test"))

2021-01-06 13:13:36 EST

Yates’ Continuity Correction

Assessment of Utility

Simulating results

Power of test

Level of test

Restore Previous Behavior