Appendix C — Probability

Published

Last modified: 2025-07-17: 20:17:05 (PM)

Configuring R

Functions from these packages will be used throughout this document:

Show R code

library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggeasy) # help with graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

Show R code

rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 6)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9
run_graphs = TRUE

Most of the content in this chapter should be review from UC Davis Epi 202.

C.1 Core properties of probabilities

C.1.1 Defining probabilities

Definition C.1 (Probability measure) A probability measure, often denoted \(\Pr()\) or \(\text{P}()\), is a function whose domain is a \(\sigma\)-algebra of possible outcomes, \(\mathscr{S}\), and which satisfies the following properties:

For any statistical event \(A \in \mathscr{S}\), \(\Pr(A) \ge 0\).
The probability of the union of all outcomes (\(\Omega \stackrel{\text{def}}{=}\cup \mathscr{S}\)) is 1:

\[\Pr(\Omega) = 1\]

The probability of the union of disjoint events, \(A_1 \cup A_2: A_1 \cap A_2 = \emptyset\), is equal to the sum of their probabilities:

\[\Pr(A_1 \cup A_2) = \Pr(A_1) + \Pr(A_2)\]

Theorem C.1 If \(A\) and \(B\) are statistical events and \(A\subseteq B\), then \(\Pr(A \cap B) = \Pr(A)\).

Proof. Left to the reader for now.

Theorem C.2 \[\Pr(A) + \Pr(\neg A) = 1\]

Proof. By properties 2 and 3 of Definition C.1.

Corollary C.1 \[\Pr(\neg A) = 1 - \Pr(A)\]

Proof. By Theorem C.2 and algebra.

Corollary C.2 If the probability of an outcome \(A\) is \(\Pr(A)=\pi\), then the probability that \(A\) does not occur is:

\[\Pr(\neg A)= 1 - \pi\]

Proof. Using Corollary C.1:

\[ \begin{aligned} \Pr(\neg A) &= 1 - \Pr(A) \\ &= 1 - \pi \end{aligned} \]

C.2 Random variables

C.2.1 Binary variables

Definition C.2 (binary variable) A binary variable is a random variable which has only two possible values in its range.

Exercise C.1 (Examples of binary variables) What are some examples of binary variables in the health sciences?

Solution. Examples of binary outcomes include:

exposure (exposed vs unexposed)
disease (diseased vs healthy)
recovery (recovered vs unrecovered)
relapse (relapse vs remission)
return to hospital (returned vs not)
vital status (dead vs alive)

C.2.2 Count variables

Definition C.3 (Count variable) A count variable is a random variable whose possible values are some subset of the non-negative integers; that is, a random variable \(X\) such that:

\[\mathcal{R}(X) \in \text{N}\]

Exercise C.2 What are some examples of count variables?

Solution.

Number of fish in a pond
Number of cyclones per season
Seconds of tooth-brushing per session (if rounded)
Infections per person-year
Visits to ER per person-month
Car accidents per 1000 miles driven

Definition C.4 (Exposure magnitude) For many count outcomes, there is some sense of an exposure magnitude, such as population size, or duration of observation, which multiplicatively rescales the expected (mean) count.

Exercise C.3 What are some examples of exposure magnitudes?

Solution.

Table C.1: Examples of exposure units

outcome	exposure units
disease incidence	number of individuals exposed; time at risk
car accidents	miles driven
worksite accidents	person-hours worked
population size	size of habitat

Exposure units are similar to the number of trials in a binomial distribution, but in non-binomial count outcomes, there can be more than one event per unit of exposure.

We can use \(t\) to represent continuous-valued exposures/observation durations, and \(n\) to represent discrete-valued exposures.

Definition C.5 (Event rate)

For a count outcome \(Y\) with exposure magnitude \(t\), the event rate (denoted \(\lambda\)) is defined as the mean of \(Y\) divided by the exposure magnitude. That is:

\[\mu \stackrel{\text{def}}{=}\text{E}[Y|T=t]\]

\[\lambda \stackrel{\text{def}}{=}\frac{\mu}{t} \tag{C.1}\]

Event rate is somewhat analogous to odds in binary outcome models; it typically serves as an intermediate transformation between the mean of the outcome and the linear component of the model. However, in contrast with the odds function, the transformation \(\lambda = \mu/t\) is not considered part of the Poisson model’s link function, and it treats the exposure magnitude covariate differently from the other covariates.

Theorem C.3 (Transformation function from event rate to mean) For a count variable with mean \(\mu\), event rate \(\lambda\), and exposure magnitude \(t\):

\[\therefore\mu = \lambda \cdot t \tag{C.2}\]

Solution. Start from definition of event rate and use algebra to solve for \(\mu\).

Equation C.2 is analogous to the inverse-odds function for binary variables.

Theorem C.4 When the exposure magnitude is 0, there is no opportunity for events to occur:

\[\text{E}[Y|T=0] = 0\]

Proof. \[\text{E}[Y|T=0] = \lambda \cdot 0 = 0\]

Probability distributions for count outcomes

Poisson distribution
Negative binomial distribution

C.3 Key probability distributions

Some distributions are typically used for outcome models (Table C.2); other distributions are typically used for test statistics (Table C.3).

Table C.2: Distributions typically used for outcome models

Distribution	Uses
Bernoulli	Binary outcomes
Binomial	Sums of Bernoulli outcomes
Poisson	unbounded count outcomes
Geometric	Counts of non-events before an event occurs
Negative binomal	Mixtures of Poisson distributions, counts of non-events until a given number of events occurs
Normal (Gaussian)	Continuous outcomes without a more specific distribution
exponential	Time to event outcomes
Gamma	Time to event outcomes
Weibull	Time to event outcomes
Log-normal	Time to event outcomes

Table C.3: Distributions typically used for test statistics

Distribution	Uses
\(\chi^2\)	Regression comparisons (asymptotic), contingency table independence tests, goodness-of-fit tests
\(F\)	Gaussian model comparisons (exact)
\(Z\) (standard normal)	Proportions, means, regression coefficients (asymptotic)
\(T\)	Means, regression coefficients in Gaussian outcome models (exact)

C.3.1 The Bernoulli distribution

Definition C.6 (Bernoulli distribution) The Bernoulli distribution family for a random variable \(X\) is defined as:

\[ \begin{aligned} \Pr(X=x) &= \text{1}_{x\in \left\{0,1\right\}}\pi^x(1-\pi)^{1-x}\\ &= \left\{{\pi, x=1}\atop{1-\pi, x=0}\right. \end{aligned} \]

C.3.2 The Poisson distribution

Definition C.7 (Poisson distribution) \[\mathcal{R}(Y) = \left\{0, 1, 2, ...\right\} = \text{N}\]

\[\text{P}(Y = y) = \frac{\mu^{y} e^{-\mu}}{y!}, y \in \text{N} \tag{C.3}\]

(see Figure C.2)

\[\text{P}(Y \le y) = e^{-\mu} \sum_{j=0}^{\left \lfloor{y}\right \rfloor}\frac{\mu^j}{j!} \tag{C.4}\]

(see Figure C.3)

Show R code

library(dplyr)
pois_dists = tibble(
  mu = c(0.5, 1, 2, 5, 10, 20)) |> 
  reframe(
    .by = mu,
    x = 0:30
  ) |> 
  mutate(
    `P(X = x)` = dpois(x, lambda = mu),
    `P(X <= x)` = ppois(x, lambda = mu),
    mu = factor(mu)
  )

library(ggplot2)
library(latex2exp)

plot0 = pois_dists |> 
  ggplot(
    aes(
      x = x,
      y = `P(X = x)`,
      fill = mu,
      col = mu)) +
  theme(legend.position = "bottom") +
  labs(
    fill = latex2exp::TeX("$\\mu$"),
    col = latex2exp::TeX("$\\mu$"),
    y = latex2exp::TeX("$\\Pr_{\\mu}(X = x)$"))

plot1 = plot0 + 
  geom_col(position = "identity", alpha  = .5) +
  facet_wrap(~mu)
  # geom_point(alpha = 0.75) +
  # geom_line(alpha = 0.75)
print(plot1)

Figure C.2: Poisson PMFs, by mean parameter \(\mu\)

Show R code

library(ggplot2)

plot2 = 
  plot0 + 
  geom_step(alpha = 0.75) +
  aes(y = `P(X <= x)`) + 
  labs(y = latex2exp::TeX("$\\Pr_{\\mu}(X \\leq x)$"))

print(plot2)

Exercise C.4 (Poisson distribution functions) Let \(X \sim \text{Pois}(\mu = 3.75)\).

Compute:

\(\text{P}(X = 4 | \mu = 3.75)\)
\(\text{P}(X \le 7 | \mu = 3.75)\)
\(\text{P}(X > 5 | \mu = 3.75)\)

Solution.

\(\text{P}(X=4) = 0.19378\)
\(\text{P}(X\le 7) = 0.962379\)
\(\text{P}(X > 5) = 0.177117\)

Theorem C.5 (Properties of the Poisson distribution) If \(X \sim \text{Pois}(\mu)\), then:

\(\text{E}[X] = \mu\)
\(\text{Var}(X) = \mu\)
\(\text{P}(X=x) = \frac{\mu}{x} \text{P}(X = x-1)\)
For \(x < \mu\), \(\text{P}(X=x) > \text{P}(X = x-1)\)
For \(x = \mu\), \(\text{P}(X=x) = \text{P}(X = x-1)\)
For \(x > \mu\), \(\text{P}(X=x) < \text{P}(X = x-1)\)
\(\arg \max_{x} \text{P}(X=x) = \left \lfloor{\mu}\right \rfloor\)

Exercise C.5 Prove Theorem C.5.

Solution. \[ \begin{aligned} \text{E}[X] &= \sum_{x=0}^\infty x \cdot P(X=x)\\ &= 0 \cdot P(X=0) + \sum_{x=1}^\infty x \cdot P(X=x)\\ &= 0 + \sum_{x=1}^\infty x \cdot P(X=x)\\ &= \sum_{x=1}^\infty x \cdot P(X=x)\\ &= \sum_{x=1}^\infty x \cdot \frac{\lambda^x e^{-\lambda}}{x!}\\ &= \sum_{x=1}^\infty x \cdot \frac{\lambda^x e^{-\lambda}}{x \cdot (x-1)!} & [\text{definition of factorial ("!") function}]\\ &= \sum_{x=1}^\infty \frac{\lambda^x e^{-\lambda}}{ (x-1)!}\\ &= \sum_{x=1}^\infty \frac{(\lambda \cdot \lambda^{x-1}) e^{-\lambda}}{ (x-1)!}\\ &= \lambda \cdot \sum_{x=1}^\infty \frac{( \lambda^{x-1}) e^{-\lambda}}{ (x-1)!}\\ &= \lambda \cdot \sum_{y=0}^\infty \frac{( \lambda^{y}) e^{-\lambda}}{ (y)!} &[\text{substituting } y \stackrel{\text{def}}{=}x-1]\\ &= \lambda \cdot 1 &[\text{because PDFs sum to 1}]\\ &= \lambda\\ \end{aligned} \]

For the variance, see https://statproofbook.github.io/P/poiss-var.

Accounting for exposure

If the exposures/observation durations, denoted \(T=t\) or \(N=n\), vary between observations, we model:

\[\mu = \lambda\cdot t\]

\(\lambda\) is interpreted as the “expected event rate per unit of exposure”; that is,

\[\lambda = \frac{\text{E}[Y|T=t]}{t}\]

Important

The exposure magnitude, \(T\), is similar to a covariate in linear or logistic regression. However, there is an important difference: in count regression, there is no intercept corresponding to \(\text{E}[Y|T=0]\). In other words, this model assumes that if there is no exposure, there can’t be any events.

Theorem C.6 If \(\mu = \lambda\cdot t\), then:

\[\log{\mu} = \log{\lambda} + \log{t}\]

Definition C.8 (Offset) When the linear component of a model involves a term without an unknown coefficient, that term is called an offset.

Theorem C.7 If \(X\) and \(Y\) are independent Poisson random variables with means \(\mu_X\) and \(\mu_Y\), their sum, \(Z=X+Y\), is also a Poisson random variable, with mean \(\mu_Z = \mu_X + \mu_Y\).

Proof. See https://web.stanford.edu/class/archive/cs/cs109/cs109.1206/lectureNotes/LN12_independent_rvs.pdf, Example 3.

C.3.3 The Negative-Binomial distribution

Definition C.9 (Negative binomial distribution) \[ \text{P}(Y=y) = \frac{\mu^y}{y!} \cdot \frac{\Gamma(\rho + y)}{\Gamma(\rho) \cdot (\rho + \mu)^y} \cdot \left(1+\frac{\mu}{\rho}\right)^{-\rho} \]

where \(\rho\) is an overdispersion parameter and \(\Gamma(x) = (x-1)!\) for integers \(x\).

You don’t need to memorize or understand this expression.

As \(\rho \rightarrow \infty\), the second term converges to 1 and the third term converges to \(\text{exp}\left\{-\mu\right\}\), which brings us back to the Poisson distribution.

Theorem C.8 If \(Y \sim \text{NegBin}(\mu, \rho)\), then:

\(\text{E}[Y] = \mu\)
\(\text{Var}\left(Y\right) = \mu + \frac{\mu^2}{\rho} > \mu\)

C.3.4 Weibull Distribution

\[ \begin{aligned} p(t)&= \alpha\lambda x^{\alpha-1}\text{e}^{-\lambda x^\alpha}\\ {\lambda}(t)&=\alpha\lambda x^{\alpha-1}\\ \text{S}(t)&=\text{e}^{-\lambda x^\alpha}\\ E(T)&= \Gamma(1+1/\alpha)\cdot \lambda^{-1/\alpha} \end{aligned} \]

When \(\alpha=1\) this is the exponential. When \(\alpha>1\) the hazard is increasing and when \(\alpha < 1\) the hazard is decreasing. This provides more flexibility than the exponential.

We will see more of this distribution later.

C.4 Characteristics of probability distributions

C.4.1 Probability density function

Definition C.10 (probability density) If \(X\) is a continuous random variable, then the probability density of \(X\) at value \(x\), denoted \(f(x)\), \(f_X(x)\), \(\text{p}(x)\), \(\text{p}_X(x)\), or \(\text{p}(X=x)\), is defined as the limit of the probability (mass) that \(X\) is in an interval around \(x\), divided by the width of that interval, as that width reduces to 0.

\[ \begin{aligned} f(x) &\stackrel{\text{def}}{=}\lim_{\Delta \rightarrow 0} \frac{\text{P}(X \in [x, x + \Delta])}{\Delta} \end{aligned} \]

See also Rothman et al. (2021) (Chapter 22, p. 535) and https://en.wikipedia.org/wiki/Probability_density_function#Formal_definition

Theorem C.9 (Density function is derivative of CDF) The density function \(f(t)\) or \(\text{p}(T=t)\) for a random variable \(T\) at value \(t\) is equal to the derivative of the cumulative probability function \(F(t) \stackrel{\text{def}}{=}P(T\le t)\); that is:

\[f(t) \stackrel{\text{def}}{=}\frac{\partial}{\partial t} F(t)\]

Theorem C.10 (Density functions integrate to 1) For any density function \(f(x)\),

\[\int_{x \in \mathcal{R}(X)} f(x) dx = 1\]

C.4.2 Hazard function

Definition C.11 (Hazard function, hazard rate, hazard rate function)

The hazard function, hazard rate, hazard rate function, for a random variable \(T\) at value \(t\), typically denoted as \(\text{h}(t)\) ¹ or \(\lambda(t)\), ² is the conditional density of \(T\) at \(t\), given \(T \ge t\). That is:

\[{\lambda}(t) \stackrel{\text{def}}{=}\text{p}(T=t|T\ge t)\]

If \(T\) represents the time at which an event occurs, then \({\lambda}(t)\) is the probability that the event occurs at time \(t\), given that it has not occurred prior to time \(t\).

Table C.4: Probability distribution functions

Name	Symbols	Definition
Probability density function (PDF)	\(\text{f}(t), \text{p}(t)\)	\(\text{p}(T=t)\)
Cumulative distribution function (CDF)	\(\text{F}(t), \text{P}(t)\)	\(\text{P}(T\leq t)\)
Survival function	\(\text{S}(t), \bar{\text{F}}(t)\)	\(\text{P}(T > t)\)
Hazard function	\(\lambda(t), \text{h}(t)\)	\(\text{p}(T=t\|T\ge t)\)
Cumulative hazard function	\(\Lambda(t), \text{H}(t)\)	\(\int_{u=-\infty}^t {\lambda}(u)du\)
Log-hazard function	\(\eta(t)\)	\(\text{log}\left\{{\lambda}(t)\right\}\)

\[ \text{f}(t) \xleftarrow[\text{S}(t){\lambda}(t)]{-S'(t)} \text{S}(t) \xleftarrow[]{\text{exp}\left\{-{\Lambda}(t)\right\}} {\Lambda}(t) \xleftarrow[]{\int_{u=0}^t {\lambda}(u)du} {\lambda}(t) \xleftarrow[]{\text{exp}\left\{\eta(t)\right\}} \eta(t) \]

\[ \text{f}(t) \xrightarrow[\int_{u=t}^\infty \text{f}(u)du]{\text{f}(t)/{\lambda}(t)} \text{S}(t) \xrightarrow[-\log{\text{S}(t)}]{} {\Lambda}(t) \xrightarrow[{\Lambda}'(t)]{} {\lambda}(t) \xrightarrow[\text{log}\left\{{\lambda}(t)\right\}]{} \eta(t) \]

C.4.3 Expectation

Definition C.12 (Expectation, expected value, population mean ) The expectation, expected value, or population mean of a continuous random variable \(X\), denoted \(\text{E}\left[X\right]\), \(\mu(X)\), or \(\mu_X\), is the weighted mean of \(X\)’s possible values, weighted by the probability density function of those values:

\[\text{E}\left[X\right] = \int_{x\in \mathcal{R}(X)} x \cdot \text{p}(X=x)dx\]

The expectation, expected value, or population mean of a discrete random variable \(X\), denoted \(\text{E}\left[X\right]\), \(\mu(X)\), or \(\mu_X\), is the mean of \(X\)’s possible values, weighted by the probability mass function of those values:

\[\text{E}\left[X\right] = \sum_{x \in \mathcal{R}(X)} x \cdot \text{P}(X=x)\]

(c.f. https://en.wikipedia.org/wiki/Expected_value)

Theorem C.11 (Expectation of the Bernoulli distribution) The expectation of a Bernoulli random variable with parameter \(\pi\) is:

\[\text{E}\left[X\right] = \pi\]

Proof. \[ \begin{aligned} \text{E}\left[X\right] &= \sum_{x\in \mathcal{R}(X)} x \cdot\text{P}(X=x) \\&= \sum_{x\in \left\{0,1\right\}} x \cdot\text{P}(X=x) \\&= \left(0 \cdot\text{P}(X=0)\right) + \left(1 \cdot\text{P}(X=1)\right) \\&= \left(0 \cdot(1-\pi)\right) + \left(1 \cdot\pi\right) \\&= 0 + \pi \\&= \pi \end{aligned} \]

Theorem C.12 (Expectation of time-to-event variables) If \(T\) is a non-negative random variable, then:

\[\mu(T|\tilde{X}= \tilde{x}) = \int_{t=0}^{\infty}\text{S}(t)dt\]

C.5 The Central Limit Theorem

The sum of many independent or nearly-independent random variables with small variances (relative to the number of RVs being summed) produces bell-shaped distributions.

For example, consider the sum of five dice (Figure C.4).

Show R code

library(dplyr)
dist = 
  expand.grid(1:6, 1:6, 1:6, 1:6, 1:6) |> 
  rowwise() |>
  mutate(total = sum(c_across(everything()))) |> 
  ungroup() |> 
  count(total) |> 
  mutate(`p(X=x)` = n/sum(n))

library(ggplot2)

dist |> 
  ggplot() +
  aes(x = total, y = `p(X=x)`) +
  geom_col() +
  xlab("sum of dice (x)") +
  ylab("Probability of outcome, Pr(X=x)") +
  expand_limits(y = 0)

Figure C.4: Distribution of the sum of five dice

In comparison, the outcome of just one die is not bell-shaped (Figure C.5).

Show R code

library(dplyr)
dist = 
  expand.grid(1:6) |> 
  rowwise() |>
  mutate(total = sum(c_across(everything()))) |> 
  ungroup() |> 
  count(total) |> 
  mutate(`p(X=x)` = n/sum(n))

library(ggplot2)

dist |> 
  ggplot() +
  aes(x = total, y = `p(X=x)`) +
  geom_col() +
  xlab("sum of dice (x)") +
  ylab("Probability of outcome, Pr(X=x)") +
  expand_limits(y = 0)

Figure C.5: Distribution of the outcome of one die

What distribution does a single die have?

Answer: discrete uniform on 1:6.

C.6 Additional resources

Miller (2017)

for example in Dobson and Barnett (2018), Vittinghoff et al. (2012), Klein and Moeschberger (2003), and Kleinbaum and Klein (2012)↩︎
for example, in Rothman et al. (2021) and Kalbfleisch and Prentice (2011)↩︎

Configuring R

C.1 Core properties of probabilities

C.1.1 Defining probabilities

C.2 Random variables

C.2.1 Binary variables

C.2.2 Count variables

Probability distributions for count outcomes

C.3 Key probability distributions

C.3.1 The Bernoulli distribution

C.3.2 The Poisson distribution

Accounting for exposure

C.3.3 The Negative-Binomial distribution

C.3.4 Weibull Distribution

C.4 Characteristics of probability distributions

C.4.1 Probability density function

C.4.2 Hazard function

C.4.3 Expectation

C.4.4 Variance and related characteristics

C.5 The Central Limit Theorem

C.6 Additional resources