This tutorial has moved. Visit http://www.decisioncurveanalysis.org for the latest updates.
Diagnostic and prognostic models are typically evaluated with measures of accuracy, such as the area-under-the-curve (AUC) or the Brier score, that do not address clinical consequences. Decision-analytic techniques allow assessment of clinical outcomes but generally require collection of extensive additional information and are cumbersome to apply to models that yield a risk estimate on a continuous scale from 0 to 1. Decision curve analysis allows incorporation of clinical consequences, thus addressing the question of whether a model would do more good than harm, but is able to do so without excessive extraneous data.
This document will walk you through how to perform a decision curve analysis in a variety of different settings, and then how to interpret the resulting curves. In decision curve analysis, models are compared to two default strategies: 1) assume that all patients are test positive and therefore treat everyone, or 2) assume that all patients are test negative and offer treatment to no one. “Treatment” is considered in the widest possible sense, not only drugs, radiotherapy or surgery, but advice, further diagnostic procedures or more intensive monitoring.
For more on DCA, visit www.decisioncurveanalysis.org: you’ll find the original articles explaining the theory and mathematical derivation of net benefit along with papers justifying the advantages of decision curve analysis over other methods of model evaluation. Below we will walk through how to perform decision curve analysis for binary and time-to-event outcomes using R, Stata, SAS, and Python. Code is provided for all languages and can be downloaded or simply copy and pasted into your application to see how it runs. For simplicity’s sake, however, we only show output from the R functions; although, naturally, output is very similar irrespective of programming language. .
Use the scripts below to install the decision curve analysis functions and/or load them for use.
# install dcurves to perform DCA from CRAN
install.packages("dcurves")
# install other packages used in this tutorial
install.packages(
c("tidyverse", "survival", "gt", "broom",
"gtsummary", "rsample", "labelled")
)
# load packages
library(dcurves)
library(tidyverse)
library(gtsummary)
* install dca functions from GitHub.com
net install dca, from("https://raw.github.com/ddsjoberg/dca.stata/master/") replace
/* source the dca macros from GitHub.com */
/* you can also navigate to GitHub.com and save the macros locally */
FILENAME dca URL "https://raw.githubusercontent.com/ddsjoberg/dca.sas/main/dca.sas";
FILENAME stdca URL "https://raw.githubusercontent.com/ddsjoberg/dca.sas/main/stdca.sas";
%INCLUDE dca;
%INCLUDE stdca;
# install dcurves to perform DCA (first install package via pip)
# pip install dcurves
from dcurves import dca, plot_graphs
# install other packages used in this tutorial
# pip install pandas numpy statsmodels lifelines
import pandas as pd
import numpy as np
import statsmodels.api as sm
import lifelines
We will be working with an example data set containing information about cancer diagnosis. The data set includes information on 750 patients who have recently discovered they have a gene mutation that puts them at a higher risk for harboring cancer. Each patient has been biopsied and we know their cancer status. It is known that older patients with a family history of cancer have a higher probability of harboring cancer. A clinical chemist has recently discovered a marker that she believes can distinguish between patients with and without cancer. We wish to assess whether or not the new marker does indeed distinguish between patients with and without cancer. If the marker does indeed predict well, many patients will not need to undergo a painful biopsy.
We will go through step by step how to import your data, build models based on multiple variables, and use those models to obtain predicted probabilities. The first step is to import your data, label the variables and produce a table of summary statistics. The second step is you’ll want to begin building your model. As we have a binary outcome (i.e. the outcome of our model has two levels: cancer or no cancer), we will be using a logistic regression model.
# import data
df_cancer_dx <-
readr::read_csv(
file = "https://raw.githubusercontent.com/ddsjoberg/dca-tutorial/main/data/df_cancer_dx.csv"
) %>%
# assign variable labels. these labels will be carried through in the `dca()` output
labelled::set_variable_labels(
patientid = "Patient ID",
cancer = "Cancer Diagnosis",
risk_group = "Risk Group",
age = "Patient Age",
famhistory = "Family History",
marker = "Marker",
cancerpredmarker = "Prediction Model"
)
# summarize data
df_cancer_dx %>%
select(-patientid) %>%
tbl_summary(type = all_dichotomous() ~ "categorical")
* import data
import delimited "https://raw.githubusercontent.com/ddsjoberg/dca-tutorial/main/data/df_cancer_dx.csv", clear
* assign variable labels. these labels will be carried through in the DCA output
label variable patientid "Patient ID"
label variable cancer "Cancer Diagnosis"
label variable risk_group "Risk Group"
label variable age "Patient Age"
label variable famhistory "Family History"
label variable marker "Marker"
label variable cancerpredmarker "Prediction Model"
FILENAME cancer URL "https://raw.githubusercontent.com/ddsjoberg/dca-tutorial/main/data/df_cancer_dx.csv";
PROC IMPORT FILE = cancer OUT = work.data_cancer DBMS = CSV;
RUN;
* assign variable labels. these labels will be carried through in the DCA output;
DATA data_cancer;
SET data_cancer;
LABEL patientid = "Patient ID"
cancer = "Cancer Diagnosis"
risk_group = "Risk Group"
age = "Patient Age"
famhistory = "Family History"
marker = "Marker"
cancerpredmarker = "Prediction Model";
RUN;
df_cancer_dx = pd.read_csv('https://raw.githubusercontent.com/ddsjoberg/dca-tutorial/main/data/df_cancer_dx.csv')
| Characteristic | N = 7501 |
|---|---|
| Cancer Diagnosis | |
| 0 | 645 (86%) |
| 1 | 105 (14%) |
| Risk Group | |
| high | 21 (2.8%) |
| intermediate | 335 (45%) |
| low | 394 (53%) |
| Patient Age | 65.1 (61.7, 68.3) |
| Family History | |
| 0 | 635 (85%) |
| 1 | 115 (15%) |
| Marker | 0.64 (0.29, 1.38) |
| Prediction Model | 0.06 (0.02, 0.18) |
| 1 n (%); Median (IQR) | |
First, we want to confirm family history of cancer is indeed associated with the biopsy result.
# build logistic regression model
mod1 <- glm(cancer ~ famhistory, data = df_cancer_dx, family = binomial)
# model summary
mod1_summary <- tbl_regression(mod1, exponentiate = TRUE)
mod1_summary
* Test whether family history is associated with cancer
logit cancer famhistory
* Test whether family history is associated with cancer;
PROC LOGISTIC DATA = data_cancer DESCENDING;
MODEL cancer = famhistory;
RUN;
mod1 = sm.GLM.from_formula('cancer ~ famhistory', data=df_cancer_dx, family=sm.families.Binomial())
mod1_results = mod1.fit()
print(mod1_results.summary())
| Characteristic | OR1 | 95% CI1 | p-value |
|---|---|---|---|
| Family History | 1.80 | 1.07, 2.96 | 0.022 |
| 1 OR = Odds Ratio, CI = Confidence Interval | |||
Via logistic regression with cancer as the outcome, we can see that family history is related to biopsy outcome with odds ratio 1.80 (95% CI 1.07, 2.96; p=0.022). The decision curve analysis can help us address the clinical utility of using family history to predict biopsy outcome.
dca(cancer ~ famhistory, data = df_cancer_dx) %>%
plot()
* Run the decision curve: family history is coded as 0 or 1, i.e. a probability
* so no need to specify the “probability” option
dca cancer famhistory
* Run the decision curve: family history is coded as 0 or 1, i.e. a probability, so no need to specify the “probability” option;
%DCA(data = data_cancer, outcome = cancer, predictors = famhistory, graph = yes);
dca_famhistory_df = \
dca(
data=df_cancer_dx,
outcome='cancer',
modelnames=['famhistory']
)
plot_graphs(
plot_df=dca_famhistory_df,
graph_type='net_benefit',
y_limits=[-0.05, 0.2]
)