3  Models for Binary Outcomes

Logistic regression and variations

Published

Last modified: 2024-12-13: 0:43:41 (AM)



Configuring R

Functions from these packages will be used throughout this document:

Show R code
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggeasy) # help with graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

Show R code
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 4)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9

Acknowledgements

This content is adapted from:

3.1 Introduction

3.1.1 What is logistic regression?

Logistic regression is a framework for modeling binary outcomes, conditional on one or more predictors (a.k.a. covariates).


Exercise 3.1 (Examples of binary outcomes) What are some examples of binary outcomes in the health sciences?


Solution. Examples of binary outcomes include:

  • exposure (exposed vs unexposed)
  • disease (diseased vs healthy)
  • recovery (recovered vs unrecovered)
  • relapse (relapse vs remission)
  • return to hospital (returned vs not)
  • vital status (dead vs alive)

Logistic regression uses the Bernoulli distribution to model the outcome variable, conditional on one or more covariates.


Exercise 3.2 Write down a mathematical definition of the Bernoulli distribution.


Solution. The Bernoulli distribution family for a random variable \(X\) is defined as:

\[ \begin{aligned} \Pr(X=x) &= \mathbb{1}_{x\in \left\{0,1\right\}}\pi^x(1-\pi)^{1-x}\\ &= \left\{{\pi, x=1}\atop{1-\pi, x=0}\right. \end{aligned} \]


3.1.2 Logistic regression versus linear regression

Logistic regression differs from linear regression, which uses the Gaussian (“normal”) distribution to model the outcome variable, conditional on the covariates.


Exercise 3.3 Recall: what kinds of outcomes is linear regression used for?


Solution. Linear regression is typically used for numerical outcomes that aren’t event counts or waiting times for an event. Examples of outcomes that are often analyzed using linear regression include include weight, height, and income.

3.2 Risk Estimation and Prediction

In Epi 203, you have already seen methods for modeling binary outcomes using one covariate that is also binary (such as exposure/non-exposure). In this section, we review one-covariate analyses, with a special focus on risk ratios and odds ratios, which are important concepts for interpreting logistic regression.


Example 3.1 (Oral Contraceptive Use and Heart Attack)  

  • Research question: how does oral contraceptive (OC) use affect the risk of myocardial infarction (MI; a.k.a. heart attack)?

This was an issue when oral contraceptives were first developed, because the original formulations used higher concentrations of hormones. Modern OCs don’t have this issue.

Table 3.1 contains simulated data for an imaginary follow-up (a.k.a. prospective) study in which two groups are identified, one using OCs and another not using OCs, and both groups are tracked for three years to determine how many in each groups have MIs.

Show R code
library(dplyr)
oc_mi = 
  tribble(
    ~OC, ~MI, ~Total,
    "OC use", 13, 5000,
    "No OC use", 7, 10000
  ) |> 
  mutate(`No MI` = Total - MI) |> 
  relocate(`No MI`, .after = MI)

totals = 
  oc_mi |> 
  summarize(across(c(MI, `No MI`, Total), sum)) |> 
  mutate(OC = "Total")

tbl_oc_mi = bind_rows(oc_mi, totals)

tbl_oc_mi
Table 3.1: Simulated data from study of oral contraceptive use and heart attack risk

Exercise 3.4 Review: estimate the probabilities of MI for OC users and non-OC users in Example 3.1.


Solution. \[\hat{\text{p}}(MI|OC) = \frac{13}{5000} = 0.0026\]

\[\hat{\text{p}}(MI|\neg OC) = \frac{7}{10000} = 7\times 10^{-4}\]


Controls

Two meanings of “controls”

Depending on context, “controls” can mean either individuals who don’t experience an exposure of interest, or individuals who don’t experience an outcome of interest.


Definition 3.1 (cases and controls in retrospective studies) In retrospective studies, participants who experience the outcome of interest are called cases, while participants who don’t experience that outcome are called controls.


Definition 3.2 (treatment groups and control groups in prospective studies) In prospective studies, the group of participants who experience the treatment or exposure of interest is called the treatment group, while the participants who receive the baseline or comparison treatment (for example, clinical trial participants who receive a placebo or a standard-of-care treatment rather than an experimental treatment) are called controls.

3.3 Comparing Probabilities

3.3.1 Risk differences

The simplest comparison of two probabilities, \(\pi_1\), and \(\pi_2\), is the difference of their values:

Definition 3.3 (Risk difference) The risk difference of two probabilities, \(\pi_1\), and \(\pi_2\), is the difference of their values: \[\delta(\pi_1,\pi_2) \stackrel{\text{def}}{=}\pi_1 - \pi_2\]


Example 3.2 (Difference in MI risk) In Example 3.1, the maximum likelihood estimate of the difference in MI risk between OC users and OC non-users is:

\[ \begin{aligned} \hat\delta(\pi(OC), \pi(\neg OC)) &= \delta(\hat\pi(OC), \hat\pi(\neg OC))\\ &= \hat\pi(OC) - \hat\pi(\neg OC)\\ &= 0.0026 - 7\times 10^{-4}\\ &= 0.0019 \end{aligned} \]


3.3.2 Risk ratios

Definition 3.4 (Relative risk ratios)  

The relative risk of probability \(\pi_1\) compared to another probability \(\pi_2\), also called the risk ratio, relative risk ratio, probability ratio, or rate ratio, is the ratio of those probabilities:

\[\rho(\pi_1,\pi_2) = \frac{\pi_1}{\pi_2}\]


Example 3.3  

Above, we estimated that:

\[\hat{\text{p}}(MI|OC) = 0.0026\]

\[\hat{\text{p}}(MI|\neg OC) = 7\times 10^{-4}\]

So we might estimate that the relative risk of MI for OC versus non-OC is:

Show R code
rr = (13/5000)/(7/10000)

\[ \begin{aligned} \hat\rho(OC, \neg OC) &=\frac{\hat{\text{p}}(MI|OC)}{\hat{\text{p}}(MI|\neg OC)}\\ &= \frac{0.0026}{7\times 10^{-4}}\\ &= 3.7143 \end{aligned} \]

We might summarize this result by saying that “the estimated probability of MI among OC users was 3.7143 as high as the estimated probability among OC non-users.


3.3.3 Relative risk differences

Definition 3.5 (Relative risk difference)  

Sometimes, we divide the risk difference by the comparison probability; the result is called the relative risk difference:

\[\xi(\pi_1,\pi_2) \stackrel{\text{def}}{=}\frac{\delta(\pi_1,\pi_2)}{\pi_2}\]


Theorem 3.1 (Relative risk difference equals risk ratio minus 1) \[\xi(\pi_1,\pi_2) = \rho(\pi_1,\pi_2) - 1\]


Proof. \[ \begin{aligned} \xi(\pi_1,\pi_2) &\stackrel{\text{def}}{=}\frac{\delta(\pi_1,\pi_2)}{\pi_2} \\&= \frac{\pi_1-\pi_2}{\pi_2} \\&= \frac{\pi_1}{\pi_2} - 1 \\&= \rho(\pi_1,\pi_2) - 1 \end{aligned} \]


3.3.4 Changing reference groups in risk comparisons

Risk differences, risk ratios, and relative risk differences are defined by two probabilities, plus a choice of which probability is the baseline or reference probability (i.e., which probability is the subtrahend of the risk difference or the denominator of the risk ratio).

\[\delta(\pi_2,\pi_1) = -\delta(\pi_1, \pi_2)\]

\[\rho(\pi_2,\pi_1) = \left(\rho(\pi_1,\pi_2)\right)^{-1}\] \[\xi(\pi_2,\pi_1) = \left(\xi(\pi_2,\pi_1) + 1\right)^{-1} - 1\]

Exercise 3.5 Prove the relationships above.


Example 3.4 (Switching the reference group in a risk ratio) Above, we estimated that the risk ratio of OC versus non-OC is:

\[ \begin{aligned} \rho(OC, \neg OC) &= 3.7143 \end{aligned} \]

In comparison, the risk ratio for non-OC versus OC is:

\[ \begin{aligned} \rho(\neg OC, OC) &=\frac{\hat{\text{p}}(MI|\neg OC)}{\hat{\text{p}}(MI|OC)}\\ &= \frac{7\times 10^{-4}}{0.0026}\\ &= 0.2692\\ &= \frac{1}{\rho(OC, \neg OC)} \end{aligned} \]

3.4 Odds and Odds Ratios

3.4.1 Odds and probabilities

In logistic regression, we will make use of a mathematically-convenient transformation of probability, called odds.

Definition 3.6 (Odds) The odds of an outcome \(A\), which we will represent using \(\omega\) (“omega”), is the probability that the outcome occurs, divided by the probability that it doesn’t occur:

\[\omega(A) \stackrel{\text{def}}{=}\frac{\Pr(A)}{\Pr(\neg A)}\]


Theorem 3.2 If the probability of an outcome \(A\) is \(\Pr(A)=\pi\), then the corresponding odds of \(A\) is:

\[\text{odds}\left\{\pi\right\} = \frac{\pi}{1-\pi} \tag{3.1}\]


Proof. \[ \begin{aligned} \Pr(\neg A) &= 1 - \Pr(A) \\ &= 1 - \pi \end{aligned} \]

\[ \begin{aligned} \therefore\omega(A) &\stackrel{\text{def}}{=}\frac{\Pr(A)}{\Pr(\neg A)} \\ &= \frac{\pi}{1-\pi} \end{aligned} \]


Function 3.1, which transforms probabilities into odds, can be called the odds function. Figure 3.1 graphs the shape of this function.

Show R code
odds = function(pi) pi / (1 - pi)
library(ggplot2)
ggplot() +
  geom_function(
    fun = odds,
    arrow = arrow(ends = "last"),
    mapping = aes(col = "odds function")
  ) +
  xlim(0, .99) +
  xlab("Probability") +
  ylab("Odds") +
  geom_abline(aes(
    intercept = 0,
    slope = 1,
    col = "y=x"
  )) +
  theme_bw() +
  labs(colour = "") +
  theme(legend.position = "bottom")
Figure 3.1: Odds versus probability

Example 3.5 (Computing odds from probabilities) In Exercise 3.4, we estimated that the probability of MI, given OC use, is \(\pi(OC) \stackrel{\text{def}}{=}\Pr(MI|OC) = 0.0026\). If this estimate is correct, then the odds of MI, given OC use, is:

\[ \begin{aligned} \omega(OC) &\stackrel{\text{def}}{=}\frac{\Pr(MI|OC)}{\Pr(\neg MI|OC)}\\ &=\frac{\Pr(MI|OC)}{1-\Pr(MI|OC)}\\ &=\frac{\pi(OC)}{1-\pi(OC)}\\ &=\frac{0.0026}{1-0.0026}\\ &\approx 0.002607 \end{aligned} \]


Exercise 3.6 (Computing odds from probabilities) Estimate the odds of MI, for non-OC users.

Solution. \[ \omega(\neg OC) = 7.0049\times 10^{-4} \]


Theorem 3.3 (One-sample MLE for odds) Let \(X_1,...X_n\) be a set of \(n\) \(\text{iid}\) Bernoulli trials, and let \(X = \sum_{i=1}^nX_i\) be their sum.

Then the maximum likelihood estimate of the odds of \(X=1\), \(\omega\), is:

\[ \hat{\omega}= \frac{x}{n-x} \]


Proof. \[ \begin{aligned} 1-\hat\pi &= 1-\frac{x}{n}\\ &= \frac{n}{n} - \frac{x}{n}\\ &= \frac{n - x}{n} \end{aligned} \]

Thus, the estimated odds is:

\[ \begin{aligned} \frac{\hat\pi}{1-\hat\pi} &= \frac{\left(\frac{x}{n}\right)}{\left(\frac{n-x}{n}\right)}\\ &= \frac{x}{n-x} \end{aligned} \]

That is, odds can be calculated directly as “# events” divided by “# nonevents” (without needing to calculate \(\hat\pi\) and \(1-\hat\pi\) first).


Example 3.6 (calculating odds using the shortcut) In Example 3.5, we calculated \[ \begin{aligned} \omega(OC) &=0.0026 \end{aligned} \]

Let’s recalculate this result using our shortcut.


Solution 3.1. \[ \begin{aligned} \omega(OC) &=\frac{13}{5000-13}\\ &=0.0026 \end{aligned} \]

Same answer as in Example 3.5!


Theorem 3.4 (Simplified expression for odds function)  

An equivalent expression for the odds function is

\[ \text{odds}\left\{\pi\right\} = \left(\left(\pi\right)^{-1}-1)\right)^{-1} \tag{3.2}\]


Exercise 3.7 Prove that Equation 3.2 is equivalent to Definition 3.6.


Theorem 3.5 (Derivative of odds function) \[{\text{odds}}'\left\{\pi\right\} = \frac{1}{\left(1-\pi\right)^2}\]


Proof. \[ \begin{aligned} {\text{odds}}'\left\{\pi\right\} &= \frac{\partial}{\partial \pi}\left(\frac{\pi}{1-\pi}\right) \\ &= \frac {\frac{\partial}{\partial \pi}\pi} {1-\pi} - \left(\frac{\pi}{\left(1-\pi\right)^2} \cdot\frac{\partial}{\partial \pi}\left(1-\pi\right)\right) \\ &= \frac{1}{1-\pi} - \frac{\pi}{\left(1-\pi\right)^2} \cdot(-1) \\ &= \frac{1}{1-\pi} + \frac{\pi}{\left(1-\pi\right)^2} \\ &= \frac{1-\pi}{\left(1-\pi\right)^2} + \frac{\pi}{\left(1-\pi\right)^2} \\ &= \frac{1-\pi + \pi}{\left(1-\pi\right)^2} \\ &= \frac{1}{\left(1-\pi\right)^2} \end{aligned} \]


Odds of rare events

For rare events (small \(\pi\)), odds and probabilities are nearly equal, because \(1-\pi \approx 1\) (see Figure 3.1).

For example, in Example 3.5, the probability and odds differ by \(6.7776\times 10^{-6}\).


Exercise 3.8 What odds value corresponds to the probability \(\pi = 0.2\), and what is the numerical difference between these two values?


Solution. \[ \omega= \frac{\pi}{1-\pi} =\frac{.2}{.8} = .25 \]


Theorem 3.6 Let \(\omega= \frac{\pi}{1-\pi}\). Then:

\[\omega- \pi = \omega\cdot\pi\]


Exercise 3.9 Prove Theorem 3.6.


Solution. \[ \begin{aligned} \omega- \pi &= \frac{\pi}{1-\pi} - \pi \\ &= \frac{\pi}{1-\pi} - \frac{\pi(1-\pi)}{1-\pi} \\ &= \frac{\pi}{1-\pi} - \frac{\pi - \pi^2}{1-\pi} \\ &= \frac{\pi - (\pi - \pi^2)}{1-\pi} \\ &= \frac{\pi - \pi + \pi^2}{1-\pi} \\ &= \frac{\pi^2}{1-\pi} \\ &= \frac{\pi}{1-\pi} \pi \\ &= \omega\pi \end{aligned} \]


Lemma 3.1 (Odds of a non-event) If \(\pi\) is the odds of event \(A\) and \(\omega\) is the corresponding odds of \(A\), then the odds of \(\neg A\) are:

\[ \omega(\neg A) = \frac{1-\pi}{\pi} \]

Proof. Left to the reader.


3.4.2 The inverse odds function

Definition 3.7 (inverse odds function) The inverse odds function,

\[\text{invodds}\left\{\omega\right\} \stackrel{\text{def}}{=}\frac{\omega}{1 + \omega}\] converts odds into their corresponding probabilities (Figure 3.2).

The inverse-odds function takes an odds as input and produces a probability as output. Its domain of inputs is \([0,\infty)\) and its range of outputs is \([0,1]\).

I haven’t seen anyone give the inverse-odds function a concise name; maybe \(\text{prob}()\)?


Show R code
odds_inv = function(omega) (1 + omega^-1)^-1
ggplot() +
  geom_function(fun = odds_inv, aes(col = "inverse-odds")) +
  xlab("Odds") +
  ylab("Probability") +
  xlim(0,5) +
  ylim(0,1) +
  geom_abline(aes(intercept = 0, slope = 1, col = "x=y"))
Figure 3.2: The inverse odds function, \(\text{invodds}\left\{\omega\right\}\)

Exercise 3.10 What probability corresponds to an odds of \(\omega= 1\), and what is the numerical difference between these two values?


Solution. \[ \pi(1) = \frac{1}{1+1} =\frac{1}{2} = .5 \] \[ 1 - \pi(1) = 1 - .5 = .5 \]


Lemma 3.2 (Simplified expression for inverse odds function)  

An equivalent expression for the inverse odds function is

\[ \pi(\omega) = (1+\omega^{-1})^{-1} \tag{3.3}\]


Exercise 3.11 Prove that Equation 3.3 is equivalent to Definition 3.7.


Lemma 3.3 (One minus inverse-odds) \[1 - \text{invodds}\left\{\omega\right\} = \frac{1}{1+\omega}\]


Proof. \[ \begin{aligned} 1 - \text{invodds}\left\{\omega\right\} &= 1 - \frac{\omega}{1 + \omega} \\ &= \frac{{\color{red}1+\omega}}{1 + \omega} - \frac{{\color{blue}\omega}}{1 + \omega} \\ &= \frac{{\color{red}(1+\omega)} - {\color{blue}\omega}}{1 + \omega} \\ &= \frac{1 + \omega- \omega}{1 + \omega} \\ &= \frac{1}{1 + \omega} \end{aligned} \]


Theorem 3.7 If \(\omega\) is the odds of event \(A\), then the probability that \(A\) does not occur is:

\[\Pr(\neg A) = \frac{1}{1+\omega}\]


Proof.

Use Lemma 3.3:

\[ \begin{aligned} \Pr(\neg A) &= 1 - \Pr(A) \\ &= 1 - \text{invodds}\left\{\omega\right\} \\ &= \frac{1}{1 + \omega} \end{aligned} \]


Theorem 3.8 (Derivative of inverse odds function) \[{\text{invodds}}'\left\{\omega\right\} = \frac{1}{\left(1+\omega\right)^2}\]


Proof.

Use the quotient rule:

\[ \begin{aligned} {\text{invodds}}'(\omega) &= \frac{\partial}{\partial \omega} \text{invodds}\left\{\omega\right\} \\ &= \frac{\partial}{\partial \omega} \frac{\omega}{1+\omega} \\ &= \frac{\frac{\partial}{\partial \omega}\omega}{1+\omega} - \frac{\omega}{\left(1+\omega\right)^2} \cdot\frac{\partial}{\partial \omega}(1+\omega) \\ &= \frac{1}{1+\omega} - \frac{\omega}{\left(1+\omega\right)^2} \cdot 1 \\ &= \frac{1}{1+\omega} - \frac{\omega}{\left(1+\omega\right)^2} \\ &= \frac{1+\omega}{\left(1+\omega\right)^2} - \frac{\omega}{\left(1+\omega\right)^2} \\ &= \frac{1+\omega- \omega}{\left(1+\omega\right)^2} \\ &= \frac{1}{\left(1+\omega\right)^2} \end{aligned} \]


Corollary 3.1 \[{\text{invodds}}'\left\{\omega\right\} = \left(1 - \text{invodds}\left\{\omega\right\}\right)^2\]


3.4.3 Odds ratios

Now that we have defined odds, we can introduce another way of comparing event probabilities: odds ratios.

Definition 3.8 (Odds ratio) The odds ratio for two odds \(\omega_1\), \(\omega_2\) is their ratio:

\[\theta(\omega_1, \omega_2) = \frac{\omega_1}{\omega_2}\]


Example 3.7 (Calculating odds ratios) In Example 3.1, the odds ratio for OC users versus OC-non-users is:

\[ \begin{aligned} \theta(\omega(OC), \omega(\neg OC)) &= \frac{\omega(OC)}{\omega(\neg OC)}\\ &= \frac{0.0026}{7\times 10^{-4}}\\ &= 3.7143\\ \end{aligned} \]


A shortcut for calculating odds ratio estimates

The general form of a two-by-two table is shown in Table 3.2.

Table 3.2: A generic 2x2 table
Event Non-Event Total
Exposed a b a+b
Non-exposed c d c+d
Total a+c b+d a+b+c+d

From this table, we have:

  • \(\hat\pi(Event|Exposed) = a/(a+b)\)

  • \(\hat\pi(\neg Event|Exposed) = b/(a+b)\)

  • \(\hat\omega(Event|Exposed) = \frac{\left(\frac{a}{a+b}\right)}{\left(\frac{b}{a+b}\right)}=\frac{a}{b}\)

  • \(\hat\omega(Event|\neg Exposed) = \frac{c}{d}\) (see Exercise 3.12)

  • \(\theta(Exposed,\neg Exposed) = \frac{\frac{a}{b}}{\frac{c}{d}} = \frac{ad}{bc}\)


Exercise 3.12 Given Table 3.2, show that \(\hat\omega(Event|\neg Exposed) = \frac{c}{d}\).


Properties of odds ratios

Odds ratios have a special property: we can swap a covariate with the outcome, and the odds ratio remains the same.

Theorem 3.9 (Odds ratios are reversible) For any two events \(A\), \(B\):

\[\theta(A|B) = \theta(B|A)\]


Proof. \[ \begin{aligned} \theta(A|B) &\stackrel{\text{def}}{=}\frac{\omega(A|B)}{\omega(A|\neg B)} \\ &= \frac {\left(\frac{\text{p}(A|B)}{\text{p}(\neg A|B)}\right)} {\left(\frac{\text{p}(A|\neg B)}{\text{p}(\neg A| \neg B)}\right)} \\ &= \left(\frac{\text{p}(A|B)}{\text{p}(\neg A|B)}\right) \left(\frac{\text{p}(A|\neg B)}{\text{p}(\neg A| \neg B)}\right)^{-1} \\ &= \left(\frac{\text{p}(A|B)}{\text{p}(\neg A|B)}\right) \left(\frac{\text{p}(\neg A| \neg B)}{\text{p}(A|\neg B)}\right) \\ &= \left(\frac{\text{p}(A|B)}{\text{p}(\neg A|B)} \cdot \frac{\text{p}(B)}{\text{p}(B)}\right) \left(\frac{\text{p}(\neg A| \neg B)}{\text{p}(A|\neg B)} \cdot \frac{\text{p}(\neg B)}{\text{p}(\neg B)}\right) \\ &= \left(\frac{\text{p}(A,B)}{\text{p}(\neg A,B)}\right) \left(\frac{\text{p}(\neg A, \neg B)}{\text{p}(A, \neg B)}\right) \\ &= \left(\frac{\text{p}(B,A)}{{\color{red}\text{p}(B,\neg A)}}\right) \left(\frac{\text{p}(\neg B, \neg A)}{{\color{blue}\text{p}(\neg B, A)}}\right) \\ &= \left(\frac{\text{p}(B,A)}{{\color{blue}\text{p}(\neg B, A)}}\right) \left(\frac{\text{p}(\neg B, \neg A)}{{\color{red}\text{p}(B,\neg A)}}\right) \\ &= \text{[reverse the preceding steps]} \\ &= \theta(B|A) \end{aligned} \]


Example 3.8 In Example 3.1, we have:

\[ \begin{aligned} \theta(MI; OC) &\stackrel{\text{def}}{=} \frac{\omega(MI|OC)}{\omega(MI|\neg OC)}\\ &\stackrel{\text{def}}{=}\frac {\left(\frac{\Pr(MI|OC)}{\Pr(\neg MI|OC)}\right)} {\left(\frac{\Pr(MI|\neg OC)}{\Pr(\neg MI|\neg OC)}\right)}\\ &= \frac {\left(\frac{\Pr(MI,OC)}{\Pr(\neg MI,OC)}\right)} {\left(\frac{\Pr(MI,\neg OC)}{\Pr(\neg MI,\neg OC)}\right)}\\ &= \left(\frac{\Pr(MI,OC)}{\Pr(\neg MI,OC)}\right) \left(\frac{\Pr(\neg MI,\neg OC)}{\Pr(MI,\neg OC)}\right)\\ &= \left(\frac{\Pr(MI,OC)}{\Pr(MI,\neg OC)}\right) \left(\frac{\Pr(\neg MI,\neg OC)}{\Pr(\neg MI,OC)}\right)\\ &= \left(\frac{\Pr(OC,MI)}{\Pr(\neg OC,MI)}\right) \left(\frac{\Pr(\neg OC,\neg MI)}{\Pr(OC,\neg MI)}\right)\\ &= \left(\frac{\Pr(OC|MI)}{\Pr(\neg OC|MI)}\right) \left(\frac{\Pr(\neg OC|\neg MI)}{\Pr(OC|\neg MI)}\right)\\ &= \frac{\left(\frac{\Pr(OC|MI)}{\Pr(\neg OC|MI)}\right)} {\left(\frac{\Pr(OC|\neg MI)}{\Pr(\neg OC|\neg MI)}\right)}\\ &\stackrel{\text{def}}{=}\frac{\omega(OC|MI)} {\omega(OC|\neg MI)}\\ &\stackrel{\text{def}}{=}\theta(OC; MI) \end{aligned} \]


Exercise 3.13 For Table 3.2, show that \(\hat\theta(Exposed, Unexposed) = \hat\theta(Event, \neg Event)\).


Conditional odds ratios have the same reversibility property:

Theorem 3.10 (Conditional odds ratios are reversible) For any three events \(A\), \(B\), \(C\):

\[\theta(A|B,C) = \theta(B|A,C)\]


Proof. Apply the same steps as for Theorem 3.9, inserting \(C\) into the conditions (RHS of \(|\)) of every expression.


Odds Ratios vs Probability (Risk) Ratios

When the outcome is rare (i.e., its probability is small) for both groups being compared in an odds ratio, the odds of the outcome will be similar to the probability of the outcome, and thus the risk ratio will be similar to the odds ratio.

Case 1: rare events

For rare events, odds ratios and probability (a.k.a. risk, a.k.a. prevalence) ratios will be close:

\(\pi_1 = .01\) \(\pi_2 = .02\)

Show R code
pi1 = .01
pi2 = .02
pi2/pi1
#> [1] 2
odds(pi2)/odds(pi1)
#> [1] 2.02

Example 3.9 In Example 3.1, the outcome is rare for both OC and non-OC participants, so the odds for both groups are similar to the corresponding probabilities, and the odds ratio is similar the risk ratio.


Case 2: frequent events

\(\pi_1 = .4\) \(\pi_2 = .5\)

For more frequently-occurring outcomes, this won’t be the case:

Show R code
pi1 = .4
pi2 = .5
pi2/pi1
#> [1] 1.25
odds(pi2)/odds(pi1)
#> [1] 1.5

Odds Ratios in Case-Control Studies

Table 3.1 simulates a follow-up study in which two populations were followed and the number of MI’s was observed. The risks are \(P(MI|OC)\) and \(P(MI|\neg OC)\) and we can estimate these risks from the data.

But suppose we had a case-control study in which we had 100 women with MI and selected a comparison group of 100 women without MI (matched as groups on age, etc.). Then MI is not random, and we cannot compute P(MI|OC) and we cannot compute the risk ratio. However, the odds ratio however can be computed.

The disease odds ratio is the odds for the disease in the exposed group divided by the odds for the disease in the unexposed group, and we cannot validly compute and use these separate parts.

We can still validly compute and use the exposure odds ratio, which is the odds for exposure in the disease group divided by the odds for exposure in the non-diseased group (because exposure can be treated as random):

\[ \hat\theta(OC|MI) = \frac{\hat{\omega}(OC|MI)}{\hat{\omega}(OC|\neg MI)} \]

And these two odds ratios, \(\hat\theta(MI|OC)\) and \(\hat\theta(OC|MI)\), are mathematically equivalent, as we saw in Section 3.4.3.2:

\[\hat\theta(MI|OC) = \hat\theta(OC|MI)\]


Exercise 3.14 Calculate the odds ratio of MI with respect to OC use, assuming that Table 3.1 comes from a case-control study. Confirm that the result is the same as in Example 3.7.


Solution.

  • \(\omega(OC|MI) = P(OC|MI)/(1 – P(OC|MI) = \frac{13}{7} = 1.8571\)

  • \(\omega(OC|\neg MI) = P(OC|\neg MI)/(1 – P(OC|\neg MI) = \frac{4987}{9993} = 0.499\)

  • \(\theta(OC,MI) = \frac{\omega(OC|MI)}{\omega(OC|\neg MI)} = \frac{13/7}{4987/9993} = 3.7214\)

This is the same estimate we calculated in Example 3.7.


Odds Ratios in Cross-Sectional Studies

  • If a cross-sectional study is a probability sample of a population (which it rarely is) then we can estimate risks.

  • If it is a sample, but not an unbiased probability sample, then we need to treat it in the same way as a case-control study.

  • We can validly estimate odds ratios in either case.

  • But we can usually not validly estimate risks and risk ratios.

3.5 The logit and expit functions

3.5.1 The logit function

Definition 3.9 (logit function)  

The logit function of a probability \(\pi\) is the natural logarithm of the odds function of \(\pi\):

\[\text{logit}(\pi) \stackrel{\text{def}}{=}\text{log}\left\{\omega(\pi)\right\}\]


Figure 3.3 shows the shape of the \(\text{logit}()\) function.

Show R code
logit = function(p) log(odds(p))

logit_plot = 
  ggplot() + 
  geom_function(
    fun = logit,
    arrow = arrow(ends = "both")) + 
  xlim(.001, .999) + 
  ylab("logit(p)") +
  xlab("p") +
  theme_bw()
print(logit_plot)
Figure 3.3: The logit function

Theorem 3.11 \[\text{logit}(\pi) = \text{log}\left\{\frac{\pi}{1-\pi}\right\} \tag{3.4}\]


Exercise 3.15 (Compose the logit function) Prove Theorem 3.11.


Proof. Apply Definition 3.9 and then Definition 3.6 (details left to the reader).


Theorem 3.12 (Derivative of logit function) \[\text{logit}'(\pi) = \frac{1}{(\pi)(1-\pi)}\]


Proof. \[ \begin{aligned} \text{logit}'(\pi) &= \frac{\partial}{\partial \pi}\text{logit}(\pi) \\ &= \frac{\partial}{\partial \pi}\text{log}\left\{\omega(\pi)\right\} \\ &= \frac{\omega'(\pi)}{\omega(\pi)} \\ &= \omega'(\pi) \frac{1}{\omega(\pi)} \\ &= \frac{1}{\left(1-\pi\right)^2}\frac{1-\pi}{\pi} \\ &= \frac{1}{(\pi)(1-\pi)} \end{aligned} \]


3.5.2 The expit function

Definition 3.10 (expit, logistic, inverse-logit) The expit function (Figure 3.4) of a log-odds \(\eta\), also known as the inverse-logit function or logistic function, is the inverse-odds of the exponential of \(\eta\):

\[\text{expit}(\eta) \stackrel{\text{def}}{=}\text{invodds}\left\{\text{exp}\left\{\eta\right\}\right\}\]

Show R code
expit = function(eta)
  exp(eta) / (1 + exp(eta))
library(ggplot2)
expit_plot =
  ggplot() +
  geom_function(fun = expit,
                arrow = arrow(ends = "both")) +
  xlim(-8, 8) +
  ylim(0, 1) +
  ylab(expression(expit(eta))) +
  xlab(expression(eta)) +
  theme_bw()
print(expit_plot)
Figure 3.4: The expit function

Theorem 3.13 (logit and expit are each others’ inverses) \[\text{logit}\left\{\text{expit}\left\{\eta\right\}\right\} = \eta\]

\[\text{expit}\left\{\text{logit}\left\{\pi\right\}\right\} = \pi\]


Proof. Left to the reader.


Theorem 3.14 (Expressions for expit function) \[ \begin{aligned} \text{expit}(\eta) &= \frac{\text{exp}\left\{\eta\right\}}{1+\text{exp}\left\{\eta\right\}} \\ &= (1 + \text{exp}\left\{-\eta\right\})^{-1} \end{aligned} \]


Proof. Apply definitions and Lemma 3.2. Details left to the reader.


Lemma 3.4 \[1-\text{expit}\left\{\eta\right\} = \left(1+\text{exp}\left\{\eta\right\}\right)^{-1}\]


Proof. Using Lemma 3.3:

\[ \begin{aligned} 1 - \text{expit}\left\{\eta\right\} &= 1 - \text{invodds}\left\{\text{exp}\left\{\eta\right\}\right\} \\ &= \frac{1}{1+\text{exp}\left\{\eta\right\}} \\ &= \left(1 + \text{exp}\left\{\eta\right\}\right)^{-1} \end{aligned} \]


Lemma 3.5 \[\text{expit}'\left\{\eta\right\} = (\text{expit}\left\{\eta\right\}) (1 - \text{expit}\left\{\eta\right\})\]


Proof. Using Theorem 3.8:

\[ \begin{aligned} \text{expit}'\left\{\eta\right\} &= \frac{\partial}{\partial \eta} \text{expit}\left\{\eta\right\} \\ &= \frac{\partial}{\partial \eta} \text{invodds}\left\{\text{exp}\left\{\eta\right\}\right\} \\ &= {\text{invodds}}'\left\{\text{exp}\left\{\eta\right\}\right\} \frac{\partial}{\partial \eta}\text{exp}\left\{\eta\right\} \\ &= \frac{1}{\left(1 + \text{exp}\left\{\eta\right\}\right)^2} \text{exp}\left\{\eta\right\} \\ &= \frac{\text{exp}\left\{\eta\right\}}{\left(1 + \text{exp}\left\{\eta\right\}\right)^2} \\ &= \frac{\text{exp}\left\{\eta\right\}}{1 + \text{exp}\left\{\eta\right\}} \frac{1}{1 + \text{exp}\left\{\eta\right\}} \\ &= \text{expit}\left\{\eta\right\} (1-\text{expit}\left\{\eta\right\}) \end{aligned} \]


Proof. Alternatively, we can use Theorem 3.14:

\[ \begin{aligned} \text{expit}'\left\{\eta\right\} &= \frac{\partial}{\partial \eta} \text{expit}\left\{\eta\right\} \\ &= \frac{\partial}{\partial \eta} (1 + \text{exp}\left\{-\eta\right\})^{-1} \\ &= -(1 + \text{exp}\left\{-\eta\right\})^{-2} \frac{\partial}{\partial \eta} (1 + \text{exp}\left\{-\eta\right\}) \\ &= -(1 + \text{exp}\left\{-\eta\right\})^{-2} (-\text{exp}\left\{-\eta\right\}) \\ &= (1 + \text{exp}\left\{-\eta\right\})^{-2} (\text{exp}\left\{-\eta\right\}) \\ &= (1 + \text{exp}\left\{-\eta\right\})^{-1} \frac{\text{exp}\left\{-\eta\right\}}{1 + \text{exp}\left\{-\eta\right\}} \\ &= (1 + \text{exp}\left\{-\eta\right\})^{-1} \frac{1}{1 + \text{exp}\left\{\eta\right\}} \\ &= \text{expit}\left\{\eta\right\} (1-\text{expit}\left\{\eta\right\}) \end{aligned} \]

3.5.3 Diagram of expit and logit

\[ \underbrace{\pi}_{\atop{\Pr(Y=1)} } \overbrace{ \underbrace{ \underset{ \xleftarrow[\frac{\omega}{1+\omega}]{} } { \xrightarrow{\frac{\pi}{1-\pi}} } \underbrace{\omega}_{\text{odds}(Y=1)} \underset{ \xleftarrow[\text{exp}\left\{\eta\right\}]{} } { \xrightarrow{\text{log}\left\{\omega\right\}} } }_{\text{expit}(\eta)} }^{\text{logit}(\pi)} \underbrace{\eta}_{\atop{\text{log-odds}(Y=1)}} \]

3.6 Introduction to logistic regression

  • In Example 3.1, we estimated the risk and the odds of MI for two groups, defined by oral contraceptive use.

  • If the predictor is quantitative (dose) or there is more than one predictor, the task becomes more difficult.

  • In this case, we will use logistic regression, which is a generalization of the linear regression models you have been using that can account for a binary response instead of a continuous one.

3.6.1 Binary outcomes models - one group, no covariates

\[ \begin{aligned} \text{P}(Y=1) &= \pi\\ \text{P}(Y=0) &= 1-\pi\\ \text{P}(Y=y) &= \pi^y (1-\pi)^{1-y}\\ \mathbf y &= (y_1, ..., y_n)\\ \mathcal L(\pi;\mathbf y) &= \pi^{\sum y_i} (1-\pi)^{n - \sum y_i}\\ \ell(\pi, \mathbf y) &= \left({\sum y_i}\right) \text{log}\left\{\pi\right\} + \left(n - \sum y_i\right) \text{log}\left\{1-\pi\right\}\\ &= \left({\sum y_i}\right) \left(\text{log}\left\{\pi\right\} - \text{log}\left\{1-\pi\right\}\right) + n \cdot \text{log}\left\{1-\pi\right\}\\ &= \left({\sum y_i}\right) \text{log}\left\{\frac{\pi}{ 1-\pi}\right\} + n \cdot \text{log}\left\{1-\pi\right\} \\ &= \left({\sum y_i}\right) \text{logit}(\pi) + n \cdot \text{log}\left\{1-\pi\right\} \end{aligned} \]

3.6.2 Binary outcomes - general

\[ \begin{aligned} \text{P}(Y_i=1) &= \pi_i \\ \text{P}(Y_i=0) &= 1-\pi_i \end{aligned} \]

\[\text{P}(Y_i=y_i) = (\pi_i)^y_i (1-\pi_i)^{1-y_i}\]

\[\mathcal{L}_i(\pi_i) = \text{P}(Y_i=y_i)\]

\[ \begin{aligned} \ell_i(\pi_i) &= \text{log}\left\{\mathcal{L}_i(\pi_i)\right\} \\ &= y_i \text{log}\left\{\pi_i\right\} + (1-y_i) \text{log}\left\{1-\pi_i\right\} \end{aligned} \]


For \(\text{iid}\) data \(\tilde{y} = (y_1, ..., y_n)\):

\[ \begin{aligned} \mathcal{L}(\pi;\tilde{y}) &= \text{P}(Y_1=y_1, \ldots, Y_n = y_n) \\ &= \prod_{i=1}^n \text{P}(Y_i=y_i) \\ &= \prod_{i=1}^n \mathcal{L}_i(\pi_i) \end{aligned} \]

3.6.3 Modeling \(\pi_i\) as a function of \(X_i\)

If there are only a few distinct \(X_i\) values, we can model \(\pi_i\) separately for each value of \(X_i\).

Otherwise, we need regression.

\[ \begin{aligned} \pi(x) &\equiv \text{E}(Y=1|X=x)\\ &= f(x^\top\beta) \end{aligned} \]

Typically, we use the \(\text{expit}\) inverse-link:

\[\pi(\tilde{x}) = \text{expit}(\tilde{x}'\beta) \tag{3.5}\]

3.6.4 Meet the beetles

Show R code
library(glmx)

data(BeetleMortality, package = "glmx")
beetles = BeetleMortality |>
  mutate(
    pct = died/n,
    survived = n - died
  )

plot1 = 
  beetles |> 
  ggplot(aes(x = dose, y = pct)) +
  geom_point(aes(size = n)) +
  xlab("Dose (log mg/L)") +
  ylab("Mortality rate (%)") +
  scale_y_continuous(labels = scales::percent) +
  scale_size(range = c(1,2)) +
  theme_bw(base_size = 18)

print(plot1)
Figure 3.5: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

3.6.5 Why don’t we use linear regression?

Show R code
beetles_long = beetles  |> 
   reframe(
    .by = everything(),
    outcome = c(
      rep(1, times = died), 
      rep(0, times = survived))) |> 
  as_tibble()

lm1 = beetles_long |> lm(formula = outcome ~ dose)
f.linear = function(x) predict(lm1, newdata = data.frame(dose = x))

range1 = range(beetles$dose) + c(-.2, .2)

plot2 = 
  plot1 + 
  geom_function(
    fun = f.linear, 
    aes(col = "Straight line")) +
  labs(colour="Model", size = "")

plot2 |> print()
Figure 3.6: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

3.6.6 Zoom out

Show R code
(plot2 + expand_limits(x = c(1.6, 2))) |> print()
Figure 3.7: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

3.6.7 log transformation of dose?

Show R code
lm2 = beetles_long |> lm(formula = outcome ~ log(dose))
f.linearlog = function(x) predict(lm2, newdata = data.frame(dose = x))

plot3 = plot2 + 
  expand_limits(x = c(1.6, 2)) +
  geom_function(fun = f.linearlog, aes(col = "Log-transform dose"))
(plot3 + expand_limits(x = c(1.6, 2))) |> print()
Figure 3.8: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

3.6.8 Logistic regression

Show R code

#| label: fig-beetles_5
#| fig-cap: "Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935)"

beetles_glm_grouped = beetles |>
  glm(formula = cbind(died, survived) ~ dose, family = "binomial")
f = function(x)
  beetles_glm_grouped |>
  predict(newdata = data.frame(dose = x), type = "response")

plot4 = plot3 + geom_function(fun = f, aes(col = "Logistic regression"))
plot4 |> print()


3.6.9 Three parts to regression models

  • What distribution does the outcome have for a specific subpopulation defined by covariates? (outcome model)

  • How does the combination of covariates relate to the mean? (link function)

  • How do the covariates combine? (linear predictor, interactions)


3.6.10 Logistic regression in R

Show R code

beetles_glm_grouped = 
  beetles |> 
  glm(
    formula = cbind(died, survived) ~ dose, 
    family = "binomial")

library(parameters)
beetles_glm_grouped |> 
  parameters() |> 
  print_md()
Parameter Log-Odds SE 95% CI z p
(Intercept) -60.72 5.18 (-71.44, -51.08) -11.72 < .001
dose 34.27 2.91 (28.85, 40.30) 11.77 < .001

Fitted values:

Show R code
fitted.values(beetles_glm_grouped)
#>      1      2      3      4      5      6      7      8 
#> 0.0586 0.1640 0.3621 0.6053 0.7952 0.9032 0.9552 0.9790
predict(beetles_glm_grouped, type = "response")
#>      1      2      3      4      5      6      7      8 
#> 0.0586 0.1640 0.3621 0.6053 0.7952 0.9032 0.9552 0.9790
predict(beetles_glm_grouped, type = "link")
#>       1       2       3       4       5       6       7       8 
#> -2.7766 -1.6286 -0.5662  0.4277  1.3564  2.2337  3.0596  3.8444

fit_y = beetles$n * fitted.values(beetles_glm_grouped)

3.6.11 Individual observations

Show R code
beetles_long
Table 3.3: beetles data in long format

Show R code
beetles_glm_ungrouped = 
  beetles_long |> 
  glm(
    formula = outcome ~ dose, 
    family = "binomial")

beetles_glm_ungrouped |> parameters() |> print_md()
Table 3.4: logistic regression model for beetles data with individual Bernoulli data
Parameter Log-Odds SE 95% CI z p
(Intercept) -60.72 5.18 (-71.44, -51.08) -11.72 < .001
dose 34.27 2.91 (28.85, 40.30) 11.77 < .001

Here’s the previous version again:

Show R code
beetles_glm_grouped |> parameters() |> print_md()
Table 3.5: logistic regression model for beetles data with grouped (binomial) data
Parameter Log-Odds SE 95% CI z p
(Intercept) -60.72 5.18 (-71.44, -51.08) -11.72 < .001
dose 34.27 2.91 (28.85, 40.30) 11.77 < .001

They seem the same! But not quite:

Show R code

logLik(beetles_glm_grouped)
#> 'log Lik.' -18.72 (df=2)
logLik(beetles_glm_ungrouped)
#> 'log Lik.' -186.2 (df=2)

The difference is due to the binomial coefficient \(\left(n\atop x \right)\) which isn’t included in the individual-observations (Bernoulli) version of the model.

3.7 Multiple logistic regression

3.7.1 Coronary heart disease (WCGS) study data

Let’s use the data from the Western Collaborative Group Study (WCGS) (Rosenman et al. (1975)) to explore multiple logistic regression:

From Vittinghoff et al. (2012):

“The Western Collaborative Group Study (WCGS) was a large epidemiological study designed to investigate the association between the”type A” behavior pattern and coronary heart disease (CHD)“.

From Wikipedia, “Type A and Type B personality theory”:

“The hypothesis describes Type A individuals as outgoing, ambitious, rigidly organized, highly status-conscious, impatient, anxious, proactive, and concerned with time management….

The hypothesis describes Type B individuals as a contrast to those of Type A. Type B personalities, by definition, are noted to live at lower stress levels. They typically work steadily and may enjoy achievement, although they have a greater tendency to disregard physical or mental stress when they do not achieve.”


Study design

from ?faraway::wcgs:

3154 healthy young men aged 39-59 from the San Francisco area were assessed for their personality type. All were free from coronary heart disease at the start of the research. Eight and a half years later change in CHD status was recorded.

Details (from faraway::wcgs)

The WCGS began in 1960 with 3,524 male volunteers who were employed by 11 California companies. Subjects were 39 to 59 years old and free of heart disease as determined by electrocardiogram. After the initial screening, the study population dropped to 3,154 and the number of companies to 10 because of various exclusions. The cohort comprised both blue- and white-collar employees.


3.7.2 Baseline data collection

socio-demographic characteristics:

  • age
  • education
  • marital status
  • income
  • occupation
  • physical and physiological including:
  • height
  • weight
  • blood pressure
  • electrocardiogram
  • corneal arcus;

biochemical measurements:

  • cholesterol and lipoprotein fractions;
  • medical and family history and use of medications;

behavioral data:

  • Type A interview,
  • smoking,
  • exercise
  • alcohol use.

Later surveys added data on:

  • anthropometry
  • triglycerides
  • Jenkins Activity Survey
  • caffeine use

Average follow-up continued for 8.5 years with repeat examinations.

3.7.3 Load the data

Here, I load the data:

Show R code
### load the data directly from a UCSF website:
library(haven)
url = paste0(
    # I'm breaking up the url into two chunks for readability
    "https://regression.ucsf.edu/sites/g/files/",
    "tkssra6706/f/wysiwyg/home/data/wcgs.dta")
wcgs = haven::read_dta(url)
Show R code
wcgs |> head()
Table 3.6: wcgs data

3.7.4 Data cleaning

Now let’s do some data cleaning

Show R code
library(arsenal) # provides `set_labels()`
library(forcats) # provides `as_factor()`
library(haven)
library(plotly)
wcgs = wcgs |> 
  mutate(
    age = age |> 
      arsenal::set_labels("Age (years)"),
    
    arcus = 
      arcus |> 
      as.logical() |> 
      arsenal::set_labels("Arcus Senilis"),
    
    time169 = 
      time169 |> 
      as.numeric() |> 
      arsenal::set_labels("Observation (follow up) time (days)"),
    
    dibpat =
      dibpat |> 
      as_factor() |> 
      relevel(ref = "Type B") |> 
      arsenal::set_labels("Behavioral Pattern"),
    
    typchd69 = typchd69 |> 
      labelled(
        label = "Type of CHD Event",
        labels = 
          c(
            "None" = 0, 
            "infdeath" = 1,
            "silent" = 2,
            "angina" = 3)),
    
    # turn stata-style labelled variables in to R-style factors:
    across(
      where(is.labelled), 
      haven::as_factor)
  )

3.7.5 What’s in the data

Here’s a table of the data:

Show R code
wcgs |>
  select(-c(id, uni, t1)) |>
  tableby(chd69 ~ ., data = _) |>
  summary(
    pfootnote = TRUE,
    title =
      "Baseline characteristics by CHD status at end of follow-up")
Baseline characteristics by CHD status at end of follow-up
No (N=2897) Yes (N=257) Total (N=3154) p value
Age (years) < 0.0011
   Mean (SD) 46.082 (5.457) 48.490 (5.801) 46.279 (5.524)
   Range 39.000 - 59.000 39.000 - 59.000 39.000 - 59.000
Arcus Senilis < 0.0012
   N-Miss 0 2 2
   FALSE 2058 (71.0%) 153 (60.0%) 2211 (70.1%)
   TRUE 839 (29.0%) 102 (40.0%) 941 (29.9%)
Behavioral Pattern < 0.0012
   A1 234 (8.1%) 30 (11.7%) 264 (8.4%)
   A2 1177 (40.6%) 148 (57.6%) 1325 (42.0%)
   B3 1155 (39.9%) 61 (23.7%) 1216 (38.6%)
   B4 331 (11.4%) 18 (7.0%) 349 (11.1%)
Body Mass Index (kg/m2) < 0.0011
   Mean (SD) 24.471 (2.561) 25.055 (2.579) 24.518 (2.567)
   Range 11.191 - 37.653 19.225 - 38.947 11.191 - 38.947
Total Cholesterol < 0.0011
   N-Miss 12 0 12
   Mean (SD) 224.261 (42.217) 250.070 (49.396) 226.372 (43.420)
   Range 103.000 - 400.000 155.000 - 645.000 103.000 - 645.000
Diastolic Blood Pressure < 0.0011
   Mean (SD) 81.723 (9.621) 85.315 (10.311) 82.016 (9.727)
   Range 58.000 - 150.000 64.000 - 122.000 58.000 - 150.000
Behavioral Pattern < 0.0012
   Type B 1486 (51.3%) 79 (30.7%) 1565 (49.6%)
   Type A 1411 (48.7%) 178 (69.3%) 1589 (50.4%)
Height (inches) 0.2901
   Mean (SD) 69.764 (2.539) 69.938 (2.410) 69.778 (2.529)
   Range 60.000 - 78.000 63.000 - 77.000 60.000 - 78.000
Ln of Systolic Blood Pressure < 0.0011
   Mean (SD) 4.846 (0.110) 4.900 (0.125) 4.850 (0.112)
   Range 4.585 - 5.438 4.605 - 5.298 4.585 - 5.438
Ln of Weight < 0.0011
   Mean (SD) 5.126 (0.123) 5.155 (0.118) 5.128 (0.123)
   Range 4.357 - 5.670 4.868 - 5.768 4.357 - 5.768
Cigarettes per day < 0.0011
   Mean (SD) 11.151 (14.329) 16.665 (15.657) 11.601 (14.518)
   Range 0.000 - 99.000 0.000 - 60.000 0.000 - 99.000
Systolic Blood Pressure < 0.0011
   Mean (SD) 128.034 (14.746) 135.385 (17.473) 128.633 (15.118)
   Range 98.000 - 230.000 100.000 - 200.000 98.000 - 230.000
Current smoking < 0.0012
   No 1554 (53.6%) 98 (38.1%) 1652 (52.4%)
   Yes 1343 (46.4%) 159 (61.9%) 1502 (47.6%)
Observation (follow up) time (days) < 0.0011
   Mean (SD) 2775.158 (562.205) 1654.700 (859.297) 2683.859 (666.524)
   Range 238.000 - 3430.000 18.000 - 3229.000 18.000 - 3430.000
Type of CHD Event
   None 0 (0.0%) 0 (0.0%) 0 (0.0%)
   infdeath 2897 (100.0%) 0 (0.0%) 2897 (91.9%)
   silent 0 (0.0%) 135 (52.5%) 135 (4.3%)
   angina 0 (0.0%) 71 (27.6%) 71 (2.3%)
   4 0 (0.0%) 51 (19.8%) 51 (1.6%)
Weight (lbs) < 0.0011
   Mean (SD) 169.554 (21.010) 174.463 (21.573) 169.954 (21.096)
   Range 78.000 - 290.000 130.000 - 320.000 78.000 - 320.000
Weight Category < 0.0012
   < 140 217 (7.5%) 15 (5.8%) 232 (7.4%)
   140-170 1440 (49.7%) 98 (38.1%) 1538 (48.8%)
   170-200 1049 (36.2%) 122 (47.5%) 1171 (37.1%)
   > 200 191 (6.6%) 22 (8.6%) 213 (6.8%)
RECODE of age (Age) < 0.0012
   35-40 512 (17.7%) 31 (12.1%) 543 (17.2%)
   41-45 1036 (35.8%) 55 (21.4%) 1091 (34.6%)
   46-50 680 (23.5%) 70 (27.2%) 750 (23.8%)
   51-55 463 (16.0%) 65 (25.3%) 528 (16.7%)
   56-60 206 (7.1%) 36 (14.0%) 242 (7.7%)
  1. Linear Model ANOVA
  2. Pearson’s Chi-squared test

3.7.6 Data by age and personality type

For now, we will look at the interaction between age and personality type (dibpat). To make it easier to visualize the data, we summarize the event rates for each combination of age:

Show R code
chd_grouped_data = 
  wcgs |> 
  summarize(
    .by = c(age, dibpat),
    n = sum(chd69 %in% c("Yes", "No")),
    x = sum(chd69 == "Yes")) |> 
  mutate(
    `n - x` = n - x,
    `p(chd)` = (x / n) |> 
      labelled(label = "CHD Event by 1969"),
    `odds(chd)` = `p(chd)` / (1 - `p(chd)`), 
    `logit(chd)` = log(`odds(chd)`)
  )

chd_grouped_data

3.7.7 Graphical exploration

Show R code
library(ggplot2)
library(ggeasy)
library(scales)
chd_plot_probs = 
  chd_grouped_data |> 
  ggplot(
    aes(
      x = age, 
      y = `p(chd)`, 
      col = dibpat)
  ) +
  geom_point(aes(size = n), alpha = .7) + 
  scale_size(range = c(1,4)) +
  geom_line() +
  theme_bw() +
  ylab("P(CHD Event by 1969)") +
  scale_y_continuous(
    labels = scales::label_percent(),
    sec.axis = sec_axis(
      ~ odds(.),
      name = "odds(CHD Event by 1969)")) +
  ggeasy::easy_labs() +
  theme(legend.position = "bottom")

print(chd_plot_probs)
Figure 3.9: CHD rates by age group, probability scale

Odds scale

Show R code
trans_odds = trans_new(
  name = "odds", 
  transform = odds, 
  inverse = odds_inv)

chd_plot_odds = chd_plot_probs + 
  scale_y_continuous(
    trans = trans_odds, # this line changes the vertical spacing
    name = chd_plot_probs$labels$y,
    sec.axis = sec_axis(
      ~ odds(.),
      name = "odds(CHD Event by 1969)"))

print(chd_plot_odds)
Figure 3.10: CHD rates by age group, odds spacing

Log-odds (logit) scale

Show R code
trans_logit = trans_new(
  name = "logit", 
  transform = logit, 
  inverse = expit)

chd_plot_logit = 
  chd_plot_probs + 
  scale_y_continuous(
    trans = trans_logit, # this line changes the vertical spacing
    name = chd_plot_probs$labels$y,
    breaks = c(seq(.01, .1, by = .01), .15, .2),
    minor_breaks = NULL,
    sec.axis = sec_axis(
      ~ logit(.),
      name = "log(odds(CHD Event by 1969))"))

print(chd_plot_logit)
Figure 3.11: CHD data (logit-scale)

3.7.8 Logistic regression models for CHD data

Here, we fit stratified models for CHD by personality type.

Show R code
chd_glm_strat = glm(
  "formula" = chd69 == "Yes" ~ dibpat + dibpat:age - 1, 
  "data" = wcgs,
  "family" = binomial(link = "logit")
)

chd_glm_strat |> parameters() |> print_md()
Table 3.7: CHD model, stratified parametrization
Parameter Log-Odds SE 95% CI z p
dibpat (Type B) -5.80 0.98 (-7.73, -3.90) -5.95 < .001
dibpat (Type A) -5.50 0.67 (-6.83, -4.19) -8.18 < .001
dibpat (Type B) × age 0.06 0.02 (0.02, 0.10) 3.01 0.003
dibpat (Type A) × age 0.07 0.01 (0.05, 0.10) 5.24 < .001

We can get the corresponding odds ratios (\(e^{\beta}\)s) by passing exponentiate = TRUE to parameters():

Show R code
chd_glm_strat |> 
  parameters(exponentiate = TRUE) |> 
  print_md()
Table 3.8: Odds ratio estimates for CHD model
Parameter Odds Ratio SE 95% CI z p
dibpat (Type B) 3.02e-03 2.94e-03 (4.40e-04, 0.02) -5.95 < .001
dibpat (Type A) 4.09e-03 2.75e-03 (1.08e-03, 0.02) -8.18 < .001
dibpat (Type B) × age 1.06 0.02 (1.02, 1.11) 3.01 0.003
dibpat (Type A) × age 1.07 0.01 (1.05, 1.10) 5.24 < .001

3.7.9 Models superimposed on data

We can graph our fitted models on each scale (probability, odds, log-odds).


probability scale

Show R code

curve_type_A = function(x) 
{
  chd_glm_strat |> predict(
    type = "response",
    newdata = tibble(age = x, dibpat = "Type A"))
}

curve_type_B = function(x) 
{
  chd_glm_strat |> predict(
    type = "response",
    newdata = tibble(age = x, dibpat = "Type B"))
}

chd_plot_probs_2 =
  chd_plot_probs +
  geom_function(
    fun = curve_type_A,
    aes(col = "Type A")
  ) +
  geom_function(
    fun = curve_type_B,
    aes(col = "Type B")
  )
print(chd_plot_probs_2)


odds scale

Show R code
# curve_type_A = function(x) 
# {
#   chd_glm_strat |> predict(
#     type = "link",
#     newdata = tibble(age = x, dibpat = "Type A")) |> exp()
# }
# curve_type_B = function(x) 
# {
#   chd_glm_strat |> predict(
#     type = "link",
#     newdata = tibble(age = x, dibpat = "Type B")) |> exp()
# }

chd_plot_odds_2 =
  chd_plot_odds +
  geom_function(
    fun = curve_type_A,
    aes(col = "Type A")
  ) +
  geom_function(
    fun = curve_type_B,
    aes(col = "Type B")
  )
print(chd_plot_odds_2)
Figure 3.12

log-odds (logit) scale

Show R code
chd_plot_logit_2 =
  chd_plot_logit +
  geom_function(
    fun = curve_type_A,
    aes(col = "Type A")
  ) +
  geom_function(
    fun = curve_type_B,
    aes(col = "Type B")
  )

print(chd_plot_logit_2)
Figure 3.13

3.7.10 reference-group and contrast parametrization

We can also use the corner-point parametrization (with reference groups and contrasts):

Show R code
chd_glm_contrasts = 
  wcgs |> 
  glm(
    "data" = _,
    "formula" = chd69 == "Yes" ~ dibpat*I(age - 50), 
    "family" = binomial(link = "logit")
  )

chd_glm_contrasts |> 
  parameters() |> 
  print_md()
Table 3.9: CHD model (corner-point parametrization)
Parameter Log-Odds SE 95% CI z p
(Intercept) -2.73 0.13 (-2.98, -2.49) -21.45 < .001
dibpat (Type A) 0.82 0.15 (0.53, 1.13) 5.42 < .001
age - 50 0.06 0.02 (0.02, 0.10) 3.01 0.003
dibpat (Type A) × age - 50 0.01 0.02 (-0.04, 0.06) 0.42 0.674

Compare with Table 3.8.


Exercise 3.16 If I give you model 1, how would you get the coefficients of model 2?


Theorem 3.15 For the logistic regression model:

  • \(Y_i|\tilde{X}_i \ \sim_{⫫}\ \text{Ber}(\pi(\tilde{X}_i))\)
  • \(\pi(\tilde{x}) = \text{expit}\left\{\tilde{x}'\tilde{\beta}\right\}\)

Consider two covariate patterns, \(\tilde{x}\) and \(\tilde{x^*}\).

The odds ratio comparing these covariate patterns is:

\[ \omega(\tilde{x},\tilde{x^*}) = \text{exp}\left\{(\tilde{x}-\tilde{x^*})^{\top} \tilde{\beta}\right\} \]


Proof. \[ \begin{aligned} \omega(\tilde{x},\tilde{x^*}) &= \frac {\omega(Y=1 | \tilde{X}= \tilde{x})} {\omega(Y=1 | \tilde{X}= \tilde{x^*})} \\ &= \frac{\text{exp}\left\{\tilde{x}^{\top}\tilde{\beta}\right\}}{\text{exp}\left\{{\tilde{x^*}}^{\top} \tilde{\beta}\right\}} \\ &= \text{exp}\left\{\tilde{x}^{\top}\tilde{\beta}- {\tilde{x^*}}^{\top} \tilde{\beta}\right\} \\ &= \text{exp}\left\{(\tilde{x}^{\top} - {\tilde{x^*}}^{\top}) \tilde{\beta}\right\} \\ &= \text{exp}\left\{{(\tilde{x}- \tilde{x^*})}^{\top} \tilde{\beta}\right\} \end{aligned} \]

3.8 Fitting logistic regression models

3.8.1 Maximum likelihood estimation for \(\text{ciid}\) data

Assume:

  • \(Y_i|\tilde{X}_i \ \sim_{⫫}\ \text{Ber}(\pi(X_i))\)
  • \(\pi(\tilde{x}) = \text{expit}\left\{\tilde{x}'\tilde{\beta}\right\}\)

log-likelihood function

\[ \begin{aligned} \ell(\tilde{\beta}, \tilde{y}) &= \text{log}\left\{\mathcal{L}(\tilde{\beta}, \tilde{y}) \right\} \\ &= \sum_{i=1}^n{\color{red}\ell_i}(\pi(\tilde{x}_i)) \end{aligned} \tag{3.6}\]


\[ \begin{aligned} {\color{red}\ell_i}(\pi) &= y_i \text{log}\left\{\pi\right\} + (1 - y_i) \text{log}\left\{1-\pi\right\} \\ &= y_i \text{log}\left\{\pi\right\} + (1 \cdot\text{log}\left\{1-\pi\right\} - y_i \cdot\text{log}\left\{1-\pi\right\}) \\ &= y_i \text{log}\left\{\pi\right\} + (\text{log}\left\{1-\pi\right\} - y_i \text{log}\left\{1-\pi\right\}) \\ &= y_i \text{log}\left\{\pi\right\} + \text{log}\left\{1-\pi\right\} - y_i \text{log}\left\{{\color{blue}1-\pi}\right\} \\ &= y_i \text{log}\left\{\pi\right\} - y_i \text{log}\left\{{\color{blue}1-\pi}\right\} + \text{log}\left\{1-\pi\right\} \\ &= (y_i \text{log}\left\{\pi\right\} - y_i \text{log}\left\{{\color{blue}1-\pi}\right\}) + \text{log}\left\{1-\pi\right\} \\ &= y_i (\text{log}\left\{{\color{red}\pi}\right\} - \text{log}\left\{{\color{blue}1-\pi}\right\}) + \text{log}\left\{1-\pi\right\} \\ &= y_i \left(\text{log}\left\{\frac{{\color{red}\pi}}{{\color{blue}1-\pi}}\right\}\right) + \text{log}\left\{1-\pi\right\} \\ &= y_i (\text{logit}(\pi)) + \text{log}\left\{1-\pi\right\} \end{aligned} \]


score function

\[ \begin{aligned} \ell'(\tilde{\beta}) &\stackrel{\text{def}}{=}\frac{\partial}{\partial \tilde{\beta}} \ell(\tilde{\beta}) \\ &= \frac{\partial}{\partial \tilde{\beta}} \sum_{i=1}^n\ell_i(\tilde{\beta}) \\ &= \sum_{i=1}^n\frac{\partial}{\partial \tilde{\beta}} \ell_i(\tilde{\beta}) \\ &= \sum_{i=1}^n\ell'_i(\tilde{\beta}) \end{aligned} \]


\[ \begin{aligned} \ell_i'(\tilde{\beta}) &= \frac{\partial}{\partial \tilde{\beta}} y_i \left(\text{logit}\left\{\pi_i\right\}\right) + \text{log}\left\{1-\pi_i\right\} \\ &= \frac{\partial}{\partial \tilde{\beta}}\left\{y_i \left(\tilde{x}_i'\tilde{\beta}\right) + \text{log}\left\{1-\pi_i\right\}\right\} \\ &= \left\{y_i \frac{\partial}{\partial \tilde{\beta}}\left(\tilde{x}_i'\tilde{\beta}\right) + \frac{\partial}{\partial \tilde{\beta}}\text{log}\left\{1-\pi_i\right\}\right\} \\ &= \left\{\tilde{x}_i y_i + \frac{\partial}{\partial \tilde{\beta}}\text{log}\left\{1-\text{expit}(\tilde{x}_i'\tilde{\beta})\right\}\right\} \\ &= \left\{\tilde{x}_i y_i + \frac{\partial}{\partial \tilde{\beta}}\text{log}\left\{\left(1+\text{exp}\left\{\tilde{x}_i'\tilde{\beta}\right\}\right)^{-1}\right\}\right\} \\ &= \left\{\tilde{x}_i y_i - \frac{\partial}{\partial \tilde{\beta}}\text{log}\left\{1+\text{exp}\left\{\tilde{x}_i'\tilde{\beta}\right\}\right\}\right\} \end{aligned} \]


Now we need to apply the chain rule:

\[ \frac{\partial}{\partial \beta}\text{log}\left\{1+\text{exp}\left\{\tilde{x}_i'\beta\right\}\right\} = \frac{1}{1+\text{exp}\left\{\tilde{x}_i'\beta\right\}} \frac{\partial}{\partial \beta}\left\{1+\text{exp}\left\{\tilde{x}_i'\beta\right\}\right\} \]

\[ \begin{aligned} \frac{\partial}{\partial \beta}\left\{1+\text{exp}\left\{\tilde{x}_i'\beta\right\}\right\} &= \text{exp}\left\{\tilde{x}_i'\beta\right\} \frac{\partial}{\partial \beta}\tilde{x}_i'\beta \\ &= \tilde{x}_i \text{exp}\left\{\tilde{x}_i'\beta\right\} \end{aligned} \]

So:

\[ \begin{aligned} \frac{\partial}{\partial \beta}\text{log}\left\{1+\text{exp}\left\{\tilde{x}_i'\beta\right\}\right\} &= \frac{1}{1+\text{exp}\left\{\tilde{x}_i'\beta\right\}} \text{exp}\left\{\tilde{x}_i'\beta\right\} \tilde{x}_i \\ &= \frac{\text{exp}\left\{\tilde{x}_i'\beta\right\}}{1+\text{exp}\left\{\tilde{x}_i'\beta\right\}} \tilde{x}_i \\ &= \tilde{x}_i \text{expit}\left\{\tilde{x}_i'\beta\right\} \end{aligned} \]


So:

\[ \begin{aligned} \ell_i'(\tilde{\beta}) &= \tilde{x}_i y_i - \tilde{x}_i \text{expit}\left\{\tilde{x}_i'\beta\right\} \\ &= \tilde{x}_i (y_i - \text{expit}\left\{\tilde{x}_i'\beta\right\}) \\ &= \tilde{x}_i (y_i - \pi_i) \\ &= \tilde{x}_i (y_i - \mathbb{E}[Y_i|\tilde{X}_i=\tilde{x}_i]) \\ &= \tilde{x}_i \ \varepsilon(y_i|\tilde{X}_i=\tilde{x}_i) \end{aligned} \]

This last expression is essentially the same as we found in linear regression.


Putting the pieces of \(\ell'(\tilde{\beta})\) back together, we have:

\[ \ell'(\tilde{\beta}) = \sum_{i=1}^n\left\{\tilde{x}_i(y_i - \text{expit}\left\{\tilde{x}_i'\beta\right\}) \right\} \]

Setting \(\ell'(\tilde{\beta}; \tilde{y}) = 0\) gives us:

\[\sum_{i=1}^n\left\{\tilde{x}_i(y_i - \text{expit}\left\{\tilde{x}_i'\beta\right\}) \right\} = 0 \tag{3.7}\]


In general, the estimating equation \(\ell'(\tilde{\beta}; \tilde{y}) = 0\) cannot be solved analytically.

Instead, we can use the Newton-Raphson method:

\[ {\widehat{\theta}}^* \leftarrow {\widehat{\theta}}^* - \left(\ell''\left(\tilde{y};{\widehat{\theta}}^*\right)\right)^{-1} \ell'\left(\tilde{y};{\widehat{\theta}}^*\right) \]

We make an iterative series of guesses, and each guess helps us make the next guess better (i.e., higher log-likelihood). You can see some information about this process like so:

options(digits = 8)
temp = 
  wcgs |> 
  glm(
    control = glm.control(trace = TRUE),
    data = _,
    formula = chd69 == "Yes" ~ dibpat*age, 
    family = binomial(link = "logit")
  )
#> Deviance = 1775.7899 Iterations - 1
#> Deviance = 1708.5396 Iterations - 2
#> Deviance = 1704.0434 Iterations - 3
#> Deviance = 1703.9833 Iterations - 4
#> Deviance = 1703.9832 Iterations - 5
#> Deviance = 1703.9832 Iterations - 6

After each iteration of the fitting procedure, the deviance (\(2(\ell_{\text{full}} - \ell(\hat\beta))\) ) is printed. You can see that the algorithm took six iterations to converge to a solution where the likelihood wasn’t changing much anymore.

3.9 Model comparisons for logistic models

3.9.1 Deviance test

We can compare the maximized log-likelihood of our model, \(\ell(\hat\beta; \mathbf x)\), versus the log-likelihood of the full model (aka saturated model aka maximal model), \(\ell_{\text{full}}\), which has one parameter per covariate pattern. With enough data, \(2(\ell_{\text{full}} - \ell(\hat\beta; \mathbf x)) \dot \sim \chi^2(N - p)\), where \(N\) is the number of distinct covariate patterns and \(p\) is the number of \(\beta\) parameters in our model. A significant p-value for this deviance statistic indicates that there’s some detectable pattern in the data that our model isn’t flexible enough to catch.

Caution

The deviance statistic needs to have a large amount of data for each covariate pattern for the \(\chi^2\) approximation to hold. A guideline from Dobson is that if there are \(q\) distinct covariate patterns \(x_1...,x_q\), with \(n_1,...,n_q\) observations per pattern, then the expected frequencies \(n_k \cdot \pi(x_k)\) should be at least 1 for every pattern \(k\in 1:q\).

If you have covariates measured on a continuous scale, you may not be able to use the deviance tests to assess goodness of fit.

3.9.2 Hosmer-Lemeshow test

If our covariate patterns produce groups that are too small, a reasonable solution is to make bigger groups by merging some of the covariate-pattern groups together.

Hosmer and Lemeshow (1980) proposed that we group the patterns by their predicted probabilities according to the model of interest. For example, you could group all of the observations with predicted probabilities of 10% or less together, then group the observations with 11%-20% probability together, and so on; \(g=10\) categories in all.

Then we can construct a statistic \[X^2 = \sum_{c=1}^g \frac{(o_c - e_c)^2}{e_c}\] where \(o_c\) is the number of events observed in group \(c\), and \(e_c\) is the number of events expected in group \(c\) (based on the sum of the fitted values \(\hat\pi_i\) for observations in group \(c\)).

If each group has enough observations in it, you can compare \(X^2\) to a \(\chi^2\) distribution; by simulation, the degrees of freedom has been found to be approximately \(g-2\).

For our CHD model, this procedure would be:

Show R code
wcgs = 
  wcgs |> 
  mutate(
    pred_probs_glm1 = chd_glm_strat |> fitted(),
    pred_prob_cats1 = 
      pred_probs_glm1 |> 
      cut(breaks = seq(0, 1, by = .1), 
          include.lowest = TRUE))

HL_table = 
  wcgs |> 
  summarize(
    .by = pred_prob_cats1,
    n = n(),
    o = sum(chd69 == "Yes"),
    e = sum(pred_probs_glm1)
  )

library(pander)
HL_table |> pander()
pred_prob_cats1 n o e
(0.1,0.2] 785 116 108
(0.2,0.3] 64 12 13.77
[0,0.1] 2,305 129 135.2
Show R code

X2 = HL_table |> 
  summarize(
    `X^2` = sum((o-e)^2/e)
  ) |> 
  pull(`X^2`)
print(X2)
#> [1] 1.1102871

pval1 = pchisq(X2, lower = FALSE, df = nrow(HL_table) - 2)

Our statistic is \(X^2 = 1.11028711\); \(p(\chi^2(1) > 1.11028711) = 0.29201955\), which is our p-value for detecting a lack of goodness of fit.

Unfortunately that grouping plan left us with just three categories with any observations, so instead of grouping by 10% increments of predicted probability, typically analysts use deciles of the predicted probabilities:

Show R code
wcgs = 
  wcgs |> 
  mutate(
    pred_probs_glm1 = chd_glm_strat |> fitted(),
    pred_prob_cats1 = 
      pred_probs_glm1 |> 
      cut(breaks = quantile(pred_probs_glm1, seq(0, 1, by = .1)), 
          include.lowest = TRUE))

HL_table = 
  wcgs |> 
  summarize(
    .by = pred_prob_cats1,
    n = n(),
    o = sum(chd69 == "Yes"),
    e = sum(pred_probs_glm1)
  )

HL_table |> pander()
pred_prob_cats1 n o e
(0.114,0.147] 275 48 36.81
(0.147,0.222] 314 51 57.19
(0.0774,0.0942] 371 27 32.56
(0.0942,0.114] 282 30 29.89
(0.0633,0.069] 237 17 15.97
(0.069,0.0774] 306 20 22.95
(0.0487,0.0633] 413 27 24.1
(0.0409,0.0487] 310 14 14.15
[0.0322,0.0363] 407 16 13.91
(0.0363,0.0409] 239 7 9.48
Show R code

X2 = HL_table |> 
  summarize(
    `X^2` = sum((o-e)^2/e)
  ) |> 
  pull(`X^2`)

print(X2)
#> [1] 6.7811383

pval1 = pchisq(X2, lower = FALSE, df = nrow(HL_table) - 2)

Now we have more evenly split categories. The p-value is \(0.56041994\), still not significant.

Graphically, we have compared:

Show R code

HL_plot = 
  HL_table |> 
  ggplot(aes(x = pred_prob_cats1)) + 
  geom_line(aes(y = e, x = pred_prob_cats1, group = "Expected", col = "Expected")) +
  geom_point(aes(y = e, size = n, col = "Expected")) +
  geom_point(aes(y = o, size = n, col = "Observed")) +
  geom_line(aes(y = o, col = "Observed", group = "Observed")) +
  scale_size(range = c(1,4)) +
  theme_bw() +
  ylab("number of CHD events") +
  theme(axis.text.x = element_text(angle = 45))
Show R code
ggplotly(HL_plot)

3.9.3 Comparing models

  • AIC = \(-2 * \ell(\hat\theta) + 2 * p\) [lower is better]
  • BIC = \(-2 * \ell(\hat\theta) + p * \text{log}(n)\) [lower is better]
  • likelihood ratio [higher is better]

3.10 Residual-based diagnostics

3.10.1 Logistic regression residuals only work for grouped data

Residuals only work if there is more than one observation for most covariate patterns.

Here we will create the grouped-data version of our CHD model from the WCGS study:

Show R code

wcgs_grouped = 
  wcgs |> 
  summarize(
    .by = c(dibpat, age),
    n = n(),
    chd = sum(chd69 == "Yes"),
    `!chd` = sum(chd69 == "No")
  )

chd_glm_strat_grouped = glm(
  "formula" = cbind(chd, `!chd`) ~ dibpat + dibpat:age - 1, 
  "data" = wcgs_grouped,
  "family" = binomial(link = "logit")
)

chd_glm_strat_grouped |> parameters() |> print_md()
Parameter Log-Odds SE 95% CI z p
dibpat (Type B) -5.80 0.98 (-7.73, -3.90) -5.95 < .001
dibpat (Type A) -5.50 0.67 (-6.83, -4.19) -8.18 < .001
dibpat (Type B) × age 0.06 0.02 (0.02, 0.10) 3.01 0.003
dibpat (Type A) × age 0.07 0.01 (0.05, 0.10) 5.24 < .001

3.10.2 (Response) residuals

\[e_k \stackrel{\text{def}}{=}\bar y_k - \hat{\pi}(x_k)\]

(\(k\) indexes the covariate patterns)

We can graph these residuals \(e_k\) against the fitted values \(\hat\pi(x_k)\):

Show R code
wcgs_grouped = 
  wcgs_grouped |> 
  mutate(
    fitted = chd_glm_strat_grouped |> fitted(),
    fitted_logit = fitted |> logit(),
    response_resids = 
      chd_glm_strat_grouped |> resid(type = "response")
  )

wcgs_response_resid_plot = 
  wcgs_grouped |> 
  ggplot(
    mapping = aes(
      x = fitted,
      y = response_resids
    )
  ) + 
  geom_point(
    aes(col = dibpat)
  ) +
  geom_hline(yintercept = 0) + 
1  geom_smooth(
    se = TRUE,
    method.args = list(
      span=2/3,
      degree=1,
      family="symmetric",
      iterations=3),
    method = stats::loess)
1
Don’t worry about these options for now; I chose them to match autoplot() as closely as I can. plot.glm and autoplot use stats::lowess instead of stats::loess; stats::lowess is older, hard to use with geom_smooth, and hard to match exactly with stats::loess; see https://support.bioconductor.org/p/2323/.]
Show R code
wcgs_response_resid_plot |> ggplotly()

We can see a slight fan-shape here: observations on the right have larger variance (as expected since \(var(\bar y) = \pi(1-\pi)/n\) is maximized when \(\pi = 0.5\)).

3.10.3 Pearson residuals

The fan-shape in the response residuals plot isn’t necessarily a concern here, since we haven’t made an assumption of constant residual variance, as we did for linear regression.

However, we might want to divide by the standard error in order to make the graph easier to interpret. Here’s one way to do that:

The Pearson (chi-squared) residual for covariate pattern \(k\) is: \[ \begin{aligned} X_k &= \frac{\bar y_k - \hat\pi_k}{\sqrt{\hat \pi_k (1-\hat\pi_k)/n_k}} \end{aligned} \]

where \[ \begin{aligned} \hat\pi_k &\stackrel{\text{def}}{=}\hat\pi(x_k)\\ &\stackrel{\text{def}}{=}\hat P(Y=1|X=x_k)\\ &\stackrel{\text{def}}{=}\text{expit}(x_i'\hat \beta)\\ &\stackrel{\text{def}}{=}\text{expit}(\hat \beta_0 + \sum_{j=1}^p \hat \beta_j x_{ij}) \end{aligned} \]

Let’s take a look at the Pearson residuals for our CHD model from the WCGS data (graphed against the fitted values on the logit scale):

Show R code
library(ggfortify)
Show R code
autoplot(chd_glm_strat_grouped, which = 1, ncol = 1) |> print()

The fan-shape is gone, and these residuals don’t show any obvious signs of model fit issues.

Pearson residuals plot for beetles data

If we create the same plot for the beetles model, we see some strong evidence of a lack of fit:

Show R code
autoplot(beetles_glm_grouped, which = 1, ncol = 1) |> print()

Pearson residuals with individual (ungrouped) data

What happens if we try to compute residuals without grouping the data by covariate pattern?

Show R code
library(ggfortify)
Show R code
autoplot(chd_glm_strat, which = 1, ncol = 1) |> print()

Meaningless.

Residuals plot by hand (optional section)

If you want to check your understanding of what these residual plots are, try building them yourself:

Show R code

wcgs_grouped = 
  wcgs_grouped |> 
  mutate(
    fitted = chd_glm_strat_grouped |> fitted(),
    fitted_logit = fitted |> logit(),
    resids = chd_glm_strat_grouped |> resid(type = "pearson")
  )

wcgs_resid_plot1 = 
  wcgs_grouped |> 
  ggplot(
    mapping = aes(
      x = fitted_logit,
      y = resids
      
    ) 
    
  ) + 
  geom_point(
    aes(col = dibpat)
  ) +
  geom_hline(yintercept = 0) + 
  geom_smooth(se = FALSE, 
              method.args = list(
                span=2/3,
                degree=1,
                family="symmetric",
                iterations=3,
                surface="direct"
                # span = 2/3, 
                # iterations = 3
              ),
              method = stats::loess)
# plot.glm and autoplot use stats::lowess, which is hard to use with 
# geom_smooth and hard to match exactly; 
# see https://support.bioconductor.org/p/2323/
Show R code
wcgs_resid_plot1 |> ggplotly()

3.10.4 Pearson chi-squared goodness of fit test

The Pearson chi-squared goodness of fit statistic is: \[ X^2 = \sum_{k=1}^m X_k^2 \] Under the null hypothesis that the model in question is correct (i.e., sufficiently complex), \(X^2\ \dot \sim\ \chi^2(N-p)\).

Show R code

X = chd_glm_strat_grouped |> 
  resid(type = "pearson")

chisq_stat = sum(X^2)

pval = pchisq(
  chisq_stat, 
  lower = FALSE, 
  df = length(X) - length(coef(chd_glm_strat_grouped)))

For our CHD model, the p-value for this test is 0.26523556; no significant evidence of a lack of fit at the 0.05 level.

Standardized Pearson residuals

Especially for small data sets, we might want to adjust our residuals for leverage (since outliers in \(X\) add extra variance to the residuals):

\[r_{P_k} = \frac{X_k}{\sqrt{1-h_k}}\]

where \(h_k\) is the leverage of \(X_k\). The functions autoplot() and plot.lm() use these for some of their graphs.

3.10.5 Deviance residuals

For large sample sizes, the Pearson and deviance residuals will be approximately the same. For small sample sizes, the deviance residuals from covariate patterns with small sample sizes can be unreliable (high variance).

\[d_k = \text{sign}(y_k - n_k \hat \pi_k)\left\{\sqrt{2[\ell_{\text{full}}(x_k) - \ell(\hat\beta; x_k)]}\right\}\]

Standardized deviance residuals

\[r_{D_k} = \frac{d_k}{\sqrt{1-h_k}}\]

3.10.6 Diagnostic plots

Let’s take a look at the full set of autoplot() diagnostics now for our CHD model:

Show R code
chd_glm_strat_grouped |> autoplot(which = 1:6) |> print()
Figure 3.14: Diagnostics for CHD model

Things look pretty good here. The QQ plot is still usable; with large samples; the residuals should be approximately Gaussian.

Beetles

Let’s look at the beetles model diagnostic plots for comparison:

Show R code
beetles_glm_grouped |> autoplot(which = 1:6) |> print()
Figure 3.15: Diagnostics for logistic model of BeetleMortality data

Hard to tell much from so little data, but there might be some issues here.

3.12 Quasibinomial

See Hua Zhou’s lecture notes

3.13 Further reading

  • Hosmer, Lemeshow, and Sturdivant (2013) is a classic textbook on logistic regression