4  Models for Count Outcomes

Poisson regression and variations

Published

Last modified: 2024-06-11: 10:06:25 (AM)


Acknowledgements

This content is adapted from:

  • Dobson and Barnett (2018), Chapter 9
  • Vittinghoff et al. (2012), Chapter 8


Configuring R

Functions from these packages will be used throughout this document:

Show R code
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggeasy) # help with graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

Show R code
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 4)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9

4.1 Introduction

This chapter presents models for count data outcomes. With covariates, the event rate \(\lambda\) becomes a function of the covariates \(\tilde{X}= (X_1, \dots,X_n)\). Typically, count data models use a \(\text{log}\left\{\right\}\) link function, and thus an \(\text{exp}\left\{\right\}\) inverse-link function. That is:

\[ \begin{aligned} \mathbb{E}[Y | \tilde{X}= \tilde{x}, T = t] &= \mu(\tilde{x},t) \\ \mu(\tilde{x},t) &= \lambda(\tilde{x})\cdot t \\ \lambda(\tilde{x}) &= \text{exp}\left\{\eta(\tilde{x})\right\} \\ \eta(\tilde{x}) &= \tilde{x}'\tilde \beta = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p \end{aligned} \]

Therefore, \[ \begin{aligned} \text{log}\left\{\mathbb{E}[Y | \tilde{X}= \tilde{x},T=t]\right\} &= \text{log}\left\{\mu(\tilde{x})\right\}\\ &=\text{log}\left\{\lambda(\tilde{x}) \cdot t\right\}\\ &=\text{log}\left\{\lambda(\tilde{x})\right\} + \text{log}\left\{t\right\}\\ &=\text{log}\left\{\text{exp}\left\{\eta(\tilde{x})\right\}\right\} + \text{log}\left\{t\right\}\\ &=\eta(\tilde{x}) + \text{log}\left\{t\right\}\\ &=\tilde{x}'\tilde\beta + \text{log}\left\{t\right\}\\ &=(\beta_0 +\beta_1 x_1+\dots + \beta_p x_p) + \text{log}\left\{t\right\}\\ \end{aligned} \]

In contrast with the other covariates (represented by \(\tilde{X}\)), \(T\) enters this expression with a \(\text{log}\left\{\right\}\) transformation and without a corresponding \(\beta\) coefficient.

Note

Terms that enter the linear component of a model without a coefficient, such as \(\text{log}\left\{t\right\}\) here, are called offsets.

4.1.1 Rate ratios

Differences on the log-rate scale become ratios on the rate scale, because

\[\text{exp}\left\{a-b\right\} = \frac{\text{exp}\left\{a\right\}}{\text{exp}\left\{b\right\}}\]

(recall from Algebra 2)

Therefore, according to this model, differences of \(\delta\) in covariate \(x_j\) correspond to rate ratios of \(\text{exp}\left\{\beta_j \cdot \delta\right\}\).

That is, letting \(\tilde{X}_{-j}\) denote vector \(\tilde{X}\) with element \(j\) removed:

\[ \begin{aligned} &{ \left\{ \text{log}\left\{\mathbb{E}[Y |{\color{red}{X_j = a}}, \tilde{X}_{-j}=\tilde{x}_{-j},T=t]\right\} \atop {-\text{log}\left\{\mathbb{E}[Y |{\color{red}{X_j = b}}, \tilde{X}_{-j}=\tilde{x}_{-j},T=t]\right\}} \right\} }\\ &= {\left\{ \text{log}\left\{t\right\} + \beta_0 + \beta_1 x_1 + ... + {\color{red}{\beta_j (a)}} + ...+\beta_p x_p \atop {-\text{log}\left\{t\right\} + \beta_0 + \beta_1 x_1 + ... + {\color{red}{\beta_j (b)}} + ...+\beta_p x_p} \right\}}\\ &= \color{red}{\beta_j(a-b)} \end{aligned} \]

And accordingly,

\[ \begin{aligned} \frac {\mathbb{E}[Y |{\color{red}{X_j = a}}, \tilde{X}_{-j} = \tilde{x}_{-j}, T = t] } { \mathbb{E}[Y |{\color{red}{X_j = b}}, \tilde{X}_{-j}=\tilde{x}_{-j},T=t] } = \text{exp}\left\{{\color{red}{\beta_j(a-b)}}\right\} \end{aligned} \]

4.2 Inference for count regression models

4.2.1 Confidence intervals for regression coefficients and rate ratios

As usual:

\[ \beta \in \left[\hat \beta{\color{red}\pm} z_{1 - \frac{\alpha}{2}}\cdot \hat{\text{se}}\left(\hat \beta\right)\right] \]

Rate ratios: exponentiate CI endpoints

\[ \text{exp}\left\{\beta\right\} \in \left[\text{exp}\left\{\hat \beta{\color{red}\pm} z_{1 - \frac{\alpha}{2}}\cdot \hat{\text{se}}\left(\hat \beta\right)\right\} \right] \]

4.2.2 Hypothesis tests for regression coefficients

\[ t = \frac{\hat \beta - \beta_0}{\hat{\text{se}}\left(\hat \beta\right)} \]

Compare \(t\) or \(|t|\) to the tails of the standard Gaussian distribution, according to the null hypothesis.

4.2.3 Comparing nested models

log(likelihood ratio) tests, as usual.

4.3 Prediction

\[ \begin{aligned} \hat y &\stackrel{\text{def}}{=}\hat{\mathbb{E}}[Y|\tilde{X}= \tilde{x},T=t]\\ &=\hat\mu(\tilde{x}, t)\\ &=\hat\lambda(\tilde{x}) \cdot t\\ &=\text{exp}\left\{\hat\eta(\tilde{x})\right\} \cdot t\\ &=\text{exp}\left\{\tilde{x}'\hat{\boldsymbol{\beta}}\right\} \cdot t \end{aligned} \]

4.4 Diagnostics

4.4.1 Residuals

Observation residuals

\[e \stackrel{\text{def}}{=}y - \hat y\]

Pearson residuals

\[r = \frac{e}{\hat{\text{se}}\left(e\right)} \approx \frac{e}{\sqrt{\hat y}}\]

Standardized Pearson residuals

\[r_p = \frac{r}{\sqrt{1-h}}\] where \(h\) is the “leverage” (which we will continue to leave undefined).

Deviance residuals

\[ d_k = \text{sign}(y - \hat y)\left\{\sqrt{2[\ell_{\text{full}}(y) - \ell(\hat\beta; y)]}\right\} \]

Note

\[\text{sign}(x) \stackrel{\text{def}}{=}\frac{x}{|x|}\] In other words:

  • \(\text{sign}(x) = -1\) if \(x < 0\)
  • \(\text{sign}(x) = 0\) if \(x = 0\)
  • \(\text{sign}(x) = 1\) if \(x > 0\)

4.5 Zero-inflation

4.5.1 Models for zero-inflated counts

We assume a latent (unobserved) binary variable, \(Z\), which we model using logistic regression:

\[P(Z=1|X=x) = \pi(x) = \text{expit}(\gamma_0 + \gamma_1 x_1 +...)\]

According to this model, if \(Z=1\), then \(Y\) will always be zero, regardless of \(X\) and \(T\):

\[P(Y=0|Z=1,X=x,T=t) = 1\]

Otherwise (if \(Z=0\)), \(Y\) will have a Poisson distribution, conditional on \(X\) and \(T\), as above.

Even though we never observe \(Z\), we can estimate the parameters \(\gamma_0\)-\(\gamma_p\), via maximum likelihood:

\[ \begin{aligned} P(Y=y|X=x,T=t) &= P(Y=y,Z=1|...) + P(Y=y,Z=0|...) \end{aligned} \] (by the Law of Total Probability)

where \[ \begin{aligned} P(Y=y,Z=z|...) &= P(Y=y|Z=z,...)P(Z=z|...) \end{aligned} \]


Exercise 4.1 Expand \(P(Y=0|X=x,T=t)\), \(P(Y=1|X=x,T=t)\) and \(P(Y=y|X=x,T=t)\) into expressions involving \(P(Z=1|X=x,T=t)\) and \(P(Y=y|Z=0,X=x,T=t)\).


Exercise 4.2 Derive the expected value and variance of \(Y\), conditional on \(X\) and \(T\), as functions of \(P(Z=1|X=x,T=t)\) and \(\mathbb{E}[Y|Z=0,X=x,T=t]\).

4.6 Over-dispersion


The Poisson distribution model forces the variance to equal the mean. In practice, many count distributions will have a variance substantially larger than the mean (or occasionally, smaller).

Definition 4.1 (Overdispersion) A random variable \(X\) is overdispersed relative to a model \(\text{p}(X=x)\) if if its empirical variance in a dataset is larger than the value is predicted by the fitted model \(\hat{\text{p}}(X=x)\).

c.f. Dobson and Barnett (2018) §3.2.1, 7.7, 9.8; Vittinghoff et al. (2012) §8.1.5; and https://en.wikipedia.org/wiki/Overdispersion.

When we encounter overdispersion, we can try to reduce the residual variance by adding more covariates.

4.6.1 Negative binomial models

There are alternatives to the Poisson model. Most notably, the negative binomial model.

We can still model \(\mu\) as a function of \(X\) and \(T\) as before, and we can combine this model with zero-inflation (as the conditional distribution for the non-zero component).

4.6.2 Quasipoisson

An alternative to Negative binomial is the “quasipoisson” distribution. I’ve never used it, but it seems to be a method-of-moments type approach rather than maximum likelihood. It models the variance as \(\text{Var}\left(Y\right) = \mu\theta\), and estimates \(\theta\) accordingly.

See ?quasipoisson in R for more.

4.7 Example: needle-sharing

(adapted from Vittinghoff et al. (2012), §8)

Show R code
library(tidyverse)
library(haven)
needles = 
  "inst/extdata/needle_sharing.dta" |> 
  read_dta() |>
  as_tibble() |>
  mutate(
    hivstat =
      hivstat |>
      case_match(
        1 ~ "HIV+",
        0 ~ "HIV-") |> 
      factor() |> 
      relevel(ref = "HIV-"),
    polydrug =
      polydrug |>
      case_match(
        1 ~ "multiple drugs used",
        0 ~ "one drug used") |>
      factor() |>
      relevel(ref = "one drug used"),
    homeless = 
      homeless |> 
      case_match(
        1 ~ "homeless", 
        0 ~ "not homeless") |>
      factor() |>
      relevel(ref = "not homeless"),
    sex = sex |> factor() |> relevel(ref = "M"))
needles
Table 4.1: Needle-sharing data

Show R code
library(ggplot2)

needles |> 
  ggplot(
    aes(
      x = age,
      y = shared_syr,
      shape = sex,
      col = ethn
    )
  ) + 
  geom_point(
    size = 3, 
    alpha = .5) +
  facet_grid(
    cols = vars(sex, polydrug), 
    rows = vars(homeless)) +
  theme(legend.position = "bottom")
Figure 4.1: Rates of needle sharing

Covariate counts

Table 4.2: Counts of observations in needles dataset by sex, unhoused status, and multiple drug use
Show R code
needles |> 
  dplyr::select(sex, homeless, polydrug) |> 
  summary()
#>     sex             homeless                 polydrug  
#>  M    :97   not homeless:63   one drug used      :109  
#>  F    :30   homeless    :61   multiple drugs used: 19  
#>  Trans: 1   NA's        : 4

There’s only one individual with sex = Trans, which unfortunately isn’t enough data to analyze. We will remove that individual:

remove singleton observation with sex == Trans
needles = needles |> filter(sex != "Trans")

4.7.1 models

Show R code
glm1 = glm(
  data = needles,
  family = stats::poisson,
  shared_syr ~ age + sex + homeless*polydrug
)

library(parameters)
glm1 |> parameters(exponentiate = TRUE) |> 
  print_md()
Table 4.3: Poisson model for needle-sharing data
Parameter IRR SE 95% CI z p
(Intercept) 4.52 1.15 (2.74, 7.45) 5.90 < .001
age 0.97 5.58e-03 (0.96, 0.98) -5.41 < .001
sex (F) 1.98 0.23 (1.58, 2.49) 5.88 < .001
homeless (homeless) 3.58 0.45 (2.79, 4.59) 10.06 < .001
polydrug (multiple drugs used) 1.45e-07 5.82e-05 (0.00, Inf) -0.04 0.969
homeless (homeless) × polydrug (multiple drugs used) 1.27e+06 5.12e+08 (0.00, Inf) 0.03 0.972
Show R code
Table 4.4: Diagnostics for Poisson model

Table 4.5: Negative binomial model for needle-sharing data
Show R code
library(MASS) #need this for glm.nb()
glm1.nb = glm.nb(
  data = needles,
  shared_syr ~ age + sex + homeless*polydrug
)
summary(glm1.nb)
#> 
#> Call:
#> glm.nb(formula = shared_syr ~ age + sex + homeless * polydrug, 
#>     data = needles, init.theta = 0.08436295825, link = log)
#> 
#> Coefficients:
#>                                               Estimate Std. Error z value
#> (Intercept)                                   9.91e-01   1.71e+00    0.58
#> age                                          -2.76e-02   3.82e-02   -0.72
#> sexF                                          1.06e+00   8.07e-01    1.32
#> homelesshomeless                              1.65e+00   7.22e-01    2.29
#> polydrugmultiple drugs used                  -2.46e+01   3.61e+04    0.00
#> homelesshomeless:polydrugmultiple drugs used  2.32e+01   3.61e+04    0.00
#>                                              Pr(>|z|)  
#> (Intercept)                                     0.563  
#> age                                             0.469  
#> sexF                                            0.187  
#> homelesshomeless                                0.022 *
#> polydrugmultiple drugs used                     0.999  
#> homelesshomeless:polydrugmultiple drugs used    0.999  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for Negative Binomial(0.0844) family taken to be 1)
#> 
#>     Null deviance: 69.193  on 119  degrees of freedom
#> Residual deviance: 57.782  on 114  degrees of freedom
#>   (7 observations deleted due to missingness)
#> AIC: 315.5
#> 
#> Number of Fisher Scoring iterations: 1
#> 
#> 
#>               Theta:  0.0844 
#>           Std. Err.:  0.0197 
#> 
#>  2 x log-likelihood:  -301.5060
Show R code
tibble(name = names(coef(glm1)), poisson = coef(glm1), nb = coef(glm1.nb))
Table 4.6: Poisson versus Negative Binomial Regression coefficient estimates

zero-inflation

Show R code
library(glmmTMB)
zinf_fit1 = glmmTMB(
  family = "poisson",
  data  = needles,
  formula = shared_syr ~ age + sex + homeless*polydrug,
  ziformula = ~ age + sex + homeless + polydrug # fit won't converge with interaction
)

zinf_fit1 |> 
  parameters(exponentiate = TRUE) |> 
  print_md()
Table 4.7: Zero-inflated poisson model
# Fixed Effects
Parameter IRR SE 95% CI z p
(Intercept) 3.16 0.82 (1.90, 5.25) 4.44 < .001
age 1.01 5.88e-03 (1.00, 1.02) 1.74 0.081
sex [F] 3.43 0.44 (2.67, 4.40) 9.68 < .001
homeless [homeless] 3.44 0.47 (2.63, 4.50) 9.03 < .001
polydrug [multiple drugs used] 1.85e-09 1.21e-05 (0.00, Inf) -3.08e-03 0.998
homeless [homeless] × polydrug [multiple drugs used] 1.38e+08 9.04e+11 (0.00, Inf) 2.87e-03 0.998
# Zero-Inflation
Parameter Odds Ratio SE 95% CI z p
(Intercept) 0.49 0.54 (0.06, 4.25) -0.65 0.514
age 1.05 0.03 (1.00, 1.10) 1.95 0.051
sex [F] 1.44 0.84 (0.46, 4.50) 0.62 0.533
homeless [homeless] 0.68 0.34 (0.26, 1.80) -0.78 0.436
polydrug [multiple drugs used] 1.15 0.91 (0.24, 5.43) 0.18 0.858

Another R package for zero-inflated models is pscl (Zeileis, Kleiber, and Jackman (2008)).

zero-inflated negative binomial model

Show R code
library(glmmTMB)
zinf_fit1 = glmmTMB(
  family = nbinom2,
  data  = needles,
  formula = shared_syr ~ age + sex + homeless*polydrug,
  ziformula = ~ age + sex + homeless + polydrug # fit won't converge with interaction
)

zinf_fit1 |> 
  parameters(exponentiate = TRUE) |> 
  print_md()
Table 4.8: Zero-inflated negative binomial model
# Fixed Effects
Parameter IRR SE 95% CI z p
(Intercept) 1.06 1.48 (0.07, 16.52) 0.04 0.969
age 1.02 0.03 (0.96, 1.08) 0.53 0.599
sex [F] 6.86 6.36 (1.12, 42.16) 2.08 0.038
homeless [homeless] 6.44 4.59 (1.60, 26.01) 2.62 0.009
polydrug [multiple drugs used] 8.25e-10 7.07e-06 (0.00, Inf) -2.44e-03 0.998
homeless [homeless] × polydrug [multiple drugs used] 2.36e+08 2.02e+12 (0.00, Inf) 2.25e-03 0.998
# Zero-Inflation
Parameter Odds Ratio SE 95% CI z p
(Intercept) 0.10 0.20 (1.47e-03, 6.14) -1.11 0.269
age 1.07 0.04 (0.99, 1.15) 1.78 0.075
sex [F] 2.72 2.40 (0.48, 15.33) 1.13 0.258
homeless [homeless] 1.15 0.86 (0.27, 4.96) 0.19 0.853
polydrug [multiple drugs used] 0.75 0.86 (0.08, 7.12) -0.25 0.799
# Dispersion
Parameter Coefficient 95% CI
(Intercept) 0.44 (0.11, 1.71)

4.8 More on count regression