Appendix B — Probability

Published

Last modified: 2024-05-16: 19:20:14 (PM)



Configuring R

Functions from these packages will be used throughout this document:

Show R code
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggeasy) # help with graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

Show R code
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 4)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9

B.1 Random variables

B.1.1 Binary variables

Definition B.1 (binary variable) A binary variable is a random variable which has only two possible values in its range.

Exercise B.1 (Examples of binary variables) What are some examples of binary variables in the health sciences?


Solution. Examples of binary outcomes include:

  • exposure (exposed vs unexposed)
  • disease (diseased vs healthy)
  • recovery (recovered vs unrecovered)
  • relapse (relapse vs remission)
  • return to hospital (returned vs not)
  • vital status (dead vs alive)

B.1.2 Count variables

Definition B.2 (Count variable) A count variable is a random variable whose possible values are some subset of the non-negative integers; that is, a random variable \(X\) such that:

\[\mathcal{R}{X} \in \mathbb{N}\]

Exercise B.2 What are some examples of count variables?


Solution.


Exposure magnitude

Definition B.3 (Exposure magnitude)  

For many count outcomes, there is some sense of exposure magnitude, population size, or duration of observation (Table B.1).

Table B.1: Examples of exposure units
outcome exposure units
infectious disease incidence number of individuals exposed, time at risk
car accidents miles driven
worksite accidents person-hours worked
population size size of habitat

Exposure units are similar to the number of trials in a binomial distribution, but in non-binomial count outcomes, there can be more than one event per unit of exposure.

We can use \(t\) to represent continuous-valued exposures/observation durations, and \(n\) to represent discrete-valued exposures.


Definition B.4 (Event rate)  

When the concept of an exposure magnitude is meaningful, the mean of the outcome \(Y\) is typically modeled as an event rate (denoted \(\lambda\)) times the exposure magnitude (\(t\)). That is:

\[\mathbb{E}[Y|T=t] \stackrel{\text{def}}{=}\mu = \lambda \cdot t\]

\[\lambda \stackrel{\text{def}}{=}\frac{\mu}{t} \tag{B.1}\]


Theorem B.1 When the exposure magnitude is 0, there is no opportunity for events to occur:

\[\mathbb{E}[Y|T=0] = 0\]


Proof. \[\mathbb{E}[Y|T=0] = \lambda \cdot 0 = 0\]


Probability distributions for count outcomes


B.2 Key probability distributions

B.2.1 The Bernoulli distribution

Definition B.5 (Bernoulli distribution) The Bernoulli distribution family for a random variable \(X\) is defined as:

\[ \begin{aligned} \Pr(X=x) &= \mathbb{1}_{x\in \left\{0,1\right\}}\pi^x(1-\pi)^{1-x}\\ &= \left\{{\pi, x=1}\atop{1-\pi, x=0}\right. \end{aligned} \]


B.2.2 The Poisson distribution

Figure B.1: “Les Poissons”
Siméon Denis Poisson


Definition B.6 (Poisson distribution) \[\mathcal{R}(Y) = \left\{0, 1, 2, ...\right\} = \mathbb{N}\]

\[\text{P}(Y = y) = \frac{\mu^{y} e^{-\mu}}{y!}, y \in \mathbb{N} \tag{B.2}\]

(see Figure B.2)

\[\text{P}(Y \le y) = e^{-\mu} \sum_{j=0}^{\left \lfloor{y}\right \rfloor}\frac{\mu^j}{j!} \tag{B.3}\]

(see Figure B.3)


Show R code
library(dplyr)
pois_dists = tibble(
  mu = c(0.5, 1, 2, 5, 10, 20)) |> 
  reframe(
    .by = mu,
    x = 0:30
  ) |> 
  mutate(
    `P(X = x)` = dpois(x, lambda = mu),
    `P(X <= x)` = ppois(x, lambda = mu),
    mu = factor(mu)
  )

library(ggplot2)
library(latex2exp)

plot0 = pois_dists |> 
  ggplot(
    aes(
      x = x,
      y = `P(X = x)`,
      fill = mu,
      col = mu)) +
  theme(legend.position = "bottom") +
  labs(
    fill = latex2exp::TeX("$\\mu$"),
    col = latex2exp::TeX("$\\mu$"),
    y = latex2exp::TeX("$\\Pr_{\\mu}(X = x)$"))

plot1 = plot0 + 
  geom_col(position = "identity", alpha  = .5) +
  facet_wrap(~mu)
  # geom_point(alpha = 0.75) +
  # geom_line(alpha = 0.75)
print(plot1)
Figure B.2: Poisson PMFs, by mean parameter \(\mu\)

Show R code
library(ggplot2)

plot2 = 
  plot0 + 
  geom_step(alpha = 0.75) +
  aes(y = `P(X <= x)`) + 
  labs(y = latex2exp::TeX("$\\Pr_{\\mu}(X \\leq x)$"))

print(plot2)
Figure B.3: Poisson CDFs

Exercise B.3 (Poisson distribution functions) Let \(X \sim \text{Pois}(\mu = 3.75)\).

Compute:

  • \(\text{P}(X = 4 | \mu = 3.75)\)
  • \(\text{P}(X \le 7 | \mu = 3.75)\)
  • \(\text{P}(X > 5 | \mu = 3.75)\)

Solution.

  • \(\text{P}(X=4) = 0.19378025\)
  • \(\text{P}(X\le 7) = 0.96237866\)
  • \(\text{P}(X > 5) = 0.17711717\)

Theorem B.2 (Properties of the Poisson distribution) If \(X \sim \text{Pois}(\mu)\), then:

  • \(\mathbb{E}[Y] = \mu\)
  • \(\text{Var}(Y) = \mu\)

Exercise B.4 Prove Theorem B.2.


Solution. \[ \begin{aligned} \text{E}[X] &= \sum_{x=0}^\infty x \cdot P(X=x)\\ &= 0 \cdot P(X=0) + \sum_{x=1}^\infty x \cdot P(X=x)\\ &= 0 + \sum_{x=1}^\infty x \cdot P(X=x)\\ &= \sum_{x=1}^\infty x \cdot P(X=x)\\ &= \sum_{x=1}^\infty x \cdot \frac{\lambda^x e^{-\lambda}}{x!}\\ &= \sum_{x=1}^\infty x \cdot \frac{\lambda^x e^{-\lambda}}{x \cdot (x-1)!} & [\text{definition of factorial ("!") function}]\\ &= \sum_{x=1}^\infty \frac{\lambda^x e^{-\lambda}}{ (x-1)!}\\ &= \sum_{x=1}^\infty \frac{(\lambda \cdot \lambda^{x-1}) e^{-\lambda}}{ (x-1)!}\\ &= \lambda \cdot \sum_{x=1}^\infty \frac{( \lambda^{x-1}) e^{-\lambda}}{ (x-1)!}\\ &= \lambda \cdot \sum_{y=0}^\infty \frac{( \lambda^{y}) e^{-\lambda}}{ (y)!} &[\text{substituting } y \stackrel{\text{def}}{=}x-1]\\ &= \lambda \cdot 1 &[\text{because PDFs sum to 1}]\\ &= \lambda\\ \end{aligned} \]


B.2.3 Accounting for exposure

If the exposures/observation durations, denoted \(T=t\) or \(N=n\), vary between observations, we model:

\[\mu = \lambda\cdot t\]

\(\lambda\) is interpreted as the “expected event rate per unit of exposure”; that is,

\[\lambda = \frac{\mathbb{E}[Y|T=t]}{t}\]

Important

The exposure magnitude, \(T\), is similar to a covariate in linear or logistic regression. However, there is an important difference: in count regression, there is no intercept corresponding to \(\mathbb{E}[Y|T=0]\). In other words, this model assumes that if there is no exposure, there can’t be any events.

Theorem B.3 If \(\mu = \lambda\cdot t\), then:

\[\text{log}\left\{\mu \right\}= \text{log}\left\{\lambda\right\} + \text{log}\left\{t\right\}\]

Definition B.7 (Offset) When the linear component of a model involves a term without an unknown coefficient, that term is called an offset.


B.2.4 The Negative-Binomial distribution

{{}}

B.3 Characteristics of probability distributions

Definition B.8 (Density function) The density function \(f(t)\) or \(\text{p}(T=t)\) for a random variable \(T\) at value \(t\) can be defined as the derivative of the cumulative probability function \(P(T\le t)\); that is:

\[f(t) \stackrel{\text{def}}{=}\frac{\partial}{\partial t} \Pr(T\le t)\]

Definition B.9 (Hazard function) The hazard function for a random variable \(T\) at value \(t\) is the conditional density of \(T\) at \(t\), given \(T\ge t\); that is:

\[h(t) \stackrel{\text{def}}{=}p(T=t|T\ge t)\]

If \(T\) represents the time at which an event occurs, then \(h(t)\) is the probability that the event occurs at time \(t\), given that it has not occurred prior to time \(t\).

Definition B.10 (Expectation, expected value, population mean ) The expectation, expected value, or population mean of a continuous random variable \(X\), denoted \(\mathbb{E}\left[X\right]\), \(\mu(X)\), or \(\mu_X\), is the weighted mean of \(X\)’s possible values, weighted by the probability density function of those values:

\[\mathbb{E}\left[X\right] = \int_{x\in \mathcal{R}(X)} x \cdot \text{p}(X=x)dx\]

The expectation, expected value, or population mean of a discrete random variable \(X\), denoted \(\mathbb{E}\left[X\right]\), \(\mu(X)\), or \(\mu_X\), is the mean of \(X\)’s possible values, weighted by the probability mass function of those values:

\[\mathbb{E}\left[X\right] = \sum_{x \in \mathcal{R}(X)} x \cdot \text{P}(X=x)\]

(c.f. https://en.wikipedia.org/wiki/Expected_value)


Theorem B.4 (Expectation of the Bernoulli distribution) The expectation of a Bernoulli random variable with parameter \(\pi\) is:

\[\mathbb{E}\left[X\right] = \pi\]


Proof. \[ \begin{aligned} \mathbb{E}\left[X\right] &= \sum_{x\in \mathcal{R}(X)} x \cdot\text{P}(X=x) \\&= \sum_{x\in \left\{0,1\right\}} x \cdot\text{P}(X=x) \\&= \left(0 \cdot\text{P}(X=0)\right) + \left(1 \cdot\text{P}(X=1)\right) \\&= \left(0 \cdot(1-\pi)\right) + \left(1 \cdot\pi\right) \\&= 0 + \pi \\&= \pi \end{aligned} \]