Published

Last modified: 2026-04-14: 19:59:04 (UTC)


Configuring R

Functions from these packages will be used throughout this document:

Show R code
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

Show R code
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 6)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9
run_graphs = TRUE

Most of the content in this chapter should be review from UC Davis Epi 202.

1 Core properties of probabilities

1.1 Defining probabilities

Definition 1 (Probability measure) A probability measure, often denoted \(\Pr()\) or \(\text{P}()\), is a function whose domain is a \(\sigma\)-algebra of possible outcomes, \(\mathscr{S}\), and which satisfies the following properties:

  1. For any statistical event \(A \in \mathscr{S}\), \(\Pr(A) \ge 0\).

  2. The probability of the union of all outcomes (\(\Omega \stackrel{\text{def}}{=}\cup \mathscr{S}\)) is 1:

\[\Pr(\Omega) = 1\]

  1. The probability of the union of countably many mutually disjoint events \(A_1, A_2, \ldots\) (where \(A_i \cap A_j = \emptyset\) for all \(i \neq j\)) is equal to the sum of their probabilities (countable additivity or sigma-additivity):

\[\Pr\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} \Pr(A_i)\]

Property 3 (countable additivity) is stronger than finite additivity, which only requires

\[\Pr(A_1 \cup \cdots \cup A_n) = \sum_{i=1}^{n} \Pr(A_i)\]

for every finite collection of mutually disjoint events. Countable additivity implies finite additivity (set \(A_{n+1} = A_{n+2} = \cdots = \emptyset\) in property 3, using \(\Pr(\emptyset) = 0\)), but not vice versa: there exist set functions that satisfy finite additivity but fail countable additivity (see Wikipedia: Sigma-additive set function — An additive function which is not σ-additive). Requiring countable additivity enables results such as the continuity of probability (if \(A_1 \supseteq A_2 \supseteq \cdots\) with \(\bigcap_i A_i = \emptyset\), then \(\Pr(A_i) \to 0\)) and underpins the Theorem 4 for countable partitions.


Theorem 1 If \(A\) and \(B\) are statistical events and \(A\subseteq B\), then \(\Pr(A \cap B) = \Pr(A)\).


Proof. Left to the reader for now.


Theorem 2 \[\Pr(A) + \Pr(\neg A) = 1\]


Proof. By properties 2 and 3 of Definition 1.


Corollary 1 \[\Pr(\neg A) = 1 - \Pr(A)\]


Proof. By Theorem 2 and algebra.


Corollary 2 If the probability of an outcome \(A\) is \(\Pr(A)=\pi\), then the probability that \(A\) does not occur is:

\[\Pr(\neg A)= 1 - \pi\]


Proof. Using Corollary 1:

\[ \begin{aligned} \Pr(\neg A) &= 1 - \Pr(A) \\ &= 1 - \pi \end{aligned} \]


1.2 Conditional probability

Definition 2 (Conditional probability) For two events \(A\) and \(B\) with \(\Pr(B) > 0\), the conditional probability of \(A\) given \(B\), denoted \(\Pr(A \mid B)\), is:

\[\Pr(A \mid B) \stackrel{\text{def}}{=}\frac{\Pr(A \cap B)}{\Pr(B)}\]


Theorem 3 (Law of conditional probability) For any two events \(A\) and \(B\) with \(\Pr(B) > 0\):

\[\Pr(A \cap B) = \Pr(A \mid B) \cdot\Pr(B)\]


Proof. Rearranging Definition 2:

\[ \begin{aligned} \Pr(A \mid B) &= \frac{\Pr(A \cap B)}{\Pr(B)} \\ \Pr(A \cap B) &= \Pr(A \mid B) \cdot\Pr(B) \end{aligned} \]


Example 1 (Applying the law of conditional probability) Suppose 30% of adults exercise regularly (\(\Pr(E) = 0.30\)), and among adults who exercise regularly, 60% have low blood pressure (\(\Pr(L \mid E) = 0.60\)).

Then the probability that a randomly selected adult both exercises regularly and has low blood pressure is:

\[ \begin{aligned} \Pr(L \cap E) &= \Pr(L \mid E) \cdot\Pr(E) \\&= 0.60 \cdot 0.30 \\&= 0.18 \end{aligned} \]


Theorem 4 (Law of total probability) If \(B_1, B_2, \ldots\) is a countable partition of the sample space (i.e., countably many mutually exclusive events whose union is the entire sample space), then for any event \(A\):

\[\Pr(A) = \sum_{i=1}^{\infty} \Pr(A \mid B_i) \cdot\Pr(B_i)\]


Proof. Since \(B_1, B_2, \ldots\) partition the sample space, the events \(A \cap B_1, A \cap B_2, \ldots\) are mutually exclusive and their union is \(A\). By property 3 of Definition 1 (countable additivity), and then by Theorem 3:

\[ \begin{aligned} \Pr(A) &= \sum_{i=1}^{\infty} \Pr(A \cap B_i) \\&= \sum_{i=1}^{\infty} \Pr(A \mid B_i) \cdot\Pr(B_i) \end{aligned} \]


Theorem 5 (Bayes’ theorem) For any two events \(A\) and \(B\) with \(\Pr(A) > 0\) and \(\Pr(B) > 0\):

\[\Pr(A \mid B) = \frac{\Pr(B \mid A) \cdot\Pr(A)}{\Pr(B)}\]


Proof. Apply Definition 2 to both \(\Pr(A \mid B)\) and \(\Pr(B \mid A)\):

\[ \begin{aligned} \Pr(A \mid B) &= \frac{\Pr(A \cap B)}{\Pr(B)} \\&= \frac{\Pr(B \mid A) \cdot\Pr(A)}{\Pr(B)} \end{aligned} \]

The second equality follows from Theorem 3 applied to \(\Pr(B \cap A) = \Pr(B \mid A) \cdot\Pr(A)\).


Example 2 (Positive predictive value of a medical test) Suppose a disease test has 99% sensitivity and 99% specificity, and the prevalence of the disease in the population is 7%.

Let \(D\) be the event “person has the disease” and \(+\) be the event “test is positive”. Then:

  • \(\Pr(+ \mid D) = 0.99\) (sensitivity)
  • \(\Pr(\neg + \mid \neg D) = 0.99\) (specificity), so the false positive rate is \(\Pr(+ \mid \neg D) = 1 - 0.99 = 0.01\)
  • \(\Pr(D) = 0.07\) (prevalence)

By Bayes’ theorem (Theorem 5) and the law of total probability (Theorem 4):

\[ \begin{aligned} \Pr(D \mid +) &= \frac{\Pr(+ \mid D) \cdot\Pr(D)}{\Pr(+)} \\&= \frac{\Pr(+ \mid D) \cdot\Pr(D)}{\Pr(+ \mid D) \cdot\Pr(D) + \Pr(+ \mid \neg D) \cdot\Pr(\neg D)} \\&= \frac{0.99 \cdot 0.07}{0.99 \cdot 0.07 + 0.01 \cdot 0.93} \\&= \frac{0.0693}{0.0693 + 0.0093} \\&= \frac{0.0693}{0.0786} \\&\approx 0.88 \end{aligned} \]

Even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have the disease, because the disease prevalence is relatively low (7%).

2 Random variables

2.1 Binary variables

Definition 3 (binary variable) A binary variable is a random variable which has only two possible values in its range.

Exercise 1 (Examples of binary variables) What are some examples of binary variables in the health sciences?


Solution. Examples of binary outcomes include:

  • exposure (exposed vs unexposed)
  • disease (diseased vs healthy)
  • recovery (recovered vs unrecovered)
  • relapse (relapse vs remission)
  • return to hospital (returned vs not)
  • vital status (dead vs alive)

2.2 Count variables

Definition 4 (Count variable) A count variable is a random variable whose possible values are some subset of the non-negative integers; that is, a random variable \(X\) such that:

\[\mathcal{R}(X) \in \text{N}\]


Exercise 2 What are some examples of count variables?


Solution.


Definition 5 (Exposure magnitude) For many count outcomes, there is some sense of an exposure magnitude, such as population size, or duration of observation, which multiplicatively rescales the expected (mean) count.


Exercise 3 What are some examples of exposure magnitudes?


Solution.

Table 1: Examples of exposure units
outcome exposure units
disease incidence number of individuals exposed; time at risk
car accidents miles driven
worksite accidents person-hours worked
population size size of habitat

Exposure units are similar to the number of trials in a binomial distribution, but in non-binomial count outcomes, there can be more than one event per unit of exposure.

We can use \(t\) to represent continuous-valued exposures/observation durations, and \(n\) to represent discrete-valued exposures.


Definition 6 (Event rate)  

For a count outcome \(Y\) with exposure magnitude \(t\), the event rate (denoted \(\lambda\)) is defined as the mean of \(Y\) divided by the exposure magnitude. That is:

\[\mu \stackrel{\text{def}}{=}\text{E}[Y|T=t]\]

\[\lambda \stackrel{\text{def}}{=}\frac{\mu}{t} \tag{1}\]

Event rate is somewhat analogous to odds in binary outcome models; it typically serves as an intermediate transformation between the mean of the outcome and the linear component of the model. However, in contrast with the odds function, the transformation \(\lambda = \mu/t\) is not considered part of the Poisson model’s link function, and it treats the exposure magnitude covariate differently from the other covariates.


Theorem 6 (Transformation function from event rate to mean) For a count variable with mean \(\mu\), event rate \(\lambda\), and exposure magnitude \(t\):

\[\therefore\mu = \lambda \cdot t \tag{2}\]


Solution. Start from definition of event rate and use algebra to solve for \(\mu\).


Equation 2 is analogous to the inverse-odds function for binary variables.


Theorem 7 When the exposure magnitude is 0, there is no opportunity for events to occur:

\[\text{E}[Y|T=0] = 0\]


Proof. \[\text{E}[Y|T=0] = \lambda \cdot 0 = 0\]


Probability distributions for count outcomes


3 Key probability distributions


Some distributions are typically used for outcome models (Table 2); other distributions are typically used for test statistics (Table 3).

Table 2: Distributions typically used for outcome models
Distribution Uses
Bernoulli Binary outcomes
Binomial Sums of Bernoulli outcomes
Poisson unbounded count outcomes
Geometric Counts of non-events before an event occurs
Negative binomal Mixtures of Poisson distributions, counts of non-events until a given number of events occurs
Normal (Gaussian) Continuous outcomes without a more specific distribution
exponential Time to event outcomes
Gamma Time to event outcomes
Weibull Time to event outcomes
Log-normal Time to event outcomes

Table 3: Distributions typically used for test statistics
Distribution Uses
\(\chi^2\) Regression comparisons (asymptotic), contingency table independence tests, goodness-of-fit tests
\(F\) Gaussian model comparisons (exact)
\(Z\) (standard normal) Proportions, means, regression coefficients (asymptotic)
\(T\) Means, regression coefficients in Gaussian outcome models (exact)

3.1 The Bernoulli distribution

Definition 7 (Bernoulli distribution) The Bernoulli distribution family for a random variable \(X\) is defined as:

\[ \begin{aligned} \Pr(X=x) &= \text{1}_{x\in {\left\{0,1\right\}}}\pi^x(1-\pi)^{1-x}\\ &= \left\{{\pi, x=1}\atop{1-\pi, x=0}\right. \end{aligned} \]


3.2 The Poisson distribution

Figure 1: “Les Poissons”
Siméon Denis Poisson


Exercise 4 Define the Poisson distribution.


Solution 1.

Definition 8 (Poisson distribution) \[\text{P}(Y = y) = \frac{\mu^{y} e^{-\mu}}{y!}, y \in \text{N} \tag{3}\]

(see Figure 2)


Exercise 5 What is the range of possible values for a Poisson distribution?


Solution 2. \[\mathcal{R}(Y) = {\left\{0, 1, 2, ...\right\}} = \text{N}\]


Theorem 8 (CDF of Poisson distribution) \[\text{P}(Y \le y) = e^{-\mu} \sum_{j=0}^{\left \lfloor{y}\right \rfloor}\frac{\mu^j}{j!} \tag{4}\]

(see Figure 3)


Show R code
library(dplyr)
pois_dists <- tibble(
  mu = c(0.5, 1, 2, 5, 10, 20)
) |>
  reframe(
    .by = mu,
    x = 0:30
  ) |>
  mutate(
    `P(X = x)` = dpois(x, lambda = mu),
    `P(X <= x)` = ppois(x, lambda = mu),
    mu = factor(mu)
  )

library(ggplot2)
library(latex2exp)

plot0 <- pois_dists |>
  ggplot(
    aes(
      x = x,
      y = `P(X = x)`,
      fill = mu,
      col = mu
    )
  ) +
  theme(legend.position = "bottom") +
  labs(
    fill = latex2exp::TeX("$\\mu$"),
    col = latex2exp::TeX("$\\mu$"),
    y = latex2exp::TeX("$\\Pr_{\\mu}(X = x)$")
  )

plot1 <- plot0 +
  geom_segment(yend = 0) +
  facet_wrap(~mu)

print(plot1)
Figure 2: Poisson PMFs, by mean parameter \(\mu\)

Show R code
library(ggplot2)

plot2 <-
  plot0 +
  geom_step(alpha = 0.75) +
  aes(y = `P(X <= x)`) +
  labs(y = latex2exp::TeX("$\\Pr_{\\mu}(X \\leq x)$"))

print(plot2)
Figure 3: Poisson CDFs

Exercise 6 (Poisson distribution functions) Let \(X \sim \text{Pois}(\mu = 3.75)\).

Compute:

  • \(\text{P}(X = 4 | \mu = 3.75)\)
  • \(\text{P}(X \le 7 | \mu = 3.75)\)
  • \(\text{P}(X > 5 | \mu = 3.75)\)

Solution.

  • \(\text{P}(X=4) = 0.19378\)
  • \(\text{P}(X\le 7) = 0.962379\)
  • \(\text{P}(X > 5) = 0.177117\)

Theorem 9 (Properties of the Poisson distribution) If \(X \sim \text{Pois}(\mu)\), then:

  • \(\text{E}[X] = \mu\)
  • \(\text{Var}(X) = \mu\)
  • \(\text{P}(X=x) = \frac{\mu}{x} \text{P}(X = x-1)\)
  • For \(x < \mu\), \(\text{P}(X=x) > \text{P}(X = x-1)\)
  • For \(x = \mu\), \(\text{P}(X=x) = \text{P}(X = x-1)\)
  • For \(x > \mu\), \(\text{P}(X=x) < \text{P}(X = x-1)\)
  • \(\arg \max_{x} \text{P}(X=x) = \left \lfloor{\mu}\right \rfloor\)

Exercise 7 Prove Theorem 9.


Solution. \[ \begin{aligned} \text{E}[X] &= \sum_{x=0}^\infty x \cdot P(X=x)\\ &= 0 \cdot P(X=0) + \sum_{x=1}^\infty x \cdot P(X=x)\\ &= 0 + \sum_{x=1}^\infty x \cdot P(X=x)\\ &= \sum_{x=1}^\infty x \cdot P(X=x)\\ &= \sum_{x=1}^\infty x \cdot \frac{\lambda^x e^{-\lambda}}{x!}\\ &= \sum_{x=1}^\infty x \cdot \frac{\lambda^x e^{-\lambda}}{x \cdot (x-1)!} & [\text{definition of factorial ("!") function}]\\ &= \sum_{x=1}^\infty \frac{\lambda^x e^{-\lambda}}{ (x-1)!}\\ &= \sum_{x=1}^\infty \frac{(\lambda \cdot \lambda^{x-1}) e^{-\lambda}}{ (x-1)!}\\ &= \lambda \cdot \sum_{x=1}^\infty \frac{( \lambda^{x-1}) e^{-\lambda}}{ (x-1)!}\\ &= \lambda \cdot \sum_{y=0}^\infty \frac{( \lambda^{y}) e^{-\lambda}}{ (y)!} &[\text{substituting } y \stackrel{\text{def}}{=}x-1]\\ &= \lambda \cdot 1 &[\text{because PDFs sum to 1}]\\ &= \lambda\\ \end{aligned} \]

See also https://statproofbook.github.io/P/poiss-mean.

For the variance, see https://statproofbook.github.io/P/poiss-var.


Accounting for exposure

If the exposures/observation durations, denoted \(T=t\) or \(N=n\), vary between observations, we model:

\[\mu = \lambda\cdot t\]

\(\lambda\) is interpreted as the “expected event rate per unit of exposure”; that is,

\[\lambda = \frac{\text{E}[Y|T=t]}{t}\]

Important

The exposure magnitude, \(T\), is similar to a covariate in linear or logistic regression. However, there is an important difference: in count regression, there is no intercept corresponding to \(\text{E}[Y|T=0]\). In other words, this model assumes that if there is no exposure, there can’t be any events.

Theorem 10 If \(\mu = \lambda\cdot t\), then:

\[\log{\mu} = \log{\lambda} + \log{t}\]

Definition 9 (Offset) When the linear component of a model involves a term without an unknown coefficient, that term is called an offset.


Theorem 11 If \(X\) and \(Y\) are independent Poisson random variables with means \(\mu_X\) and \(\mu_Y\), their sum, \(Z=X+Y\), is also a Poisson random variable, with mean \(\mu_Z = \mu_X + \mu_Y\).



3.3 The Negative-Binomial distribution

Definition 10 (Negative binomial distribution) \[ \text{P}(Y=y) = \frac{\mu^y}{y!} \cdot \frac{\Gamma(\rho + y)}{\Gamma(\rho) \cdot (\rho + \mu)^y} \cdot \left(1+\frac{\mu}{\rho}\right)^{-\rho} \]

where \(\rho\) is an overdispersion parameter and \(\Gamma(x) = (x-1)!\) for integers \(x\).

You don’t need to memorize or understand this expression.

As \(\rho \rightarrow \infty\), the second term converges to 1 and the third term converges to \(\text{exp}{\left\{-\mu\right\}}\), which brings us back to the Poisson distribution.


Theorem 12 If \(Y \sim \text{NegBin}(\mu, \rho)\), then:

  • \(\text{E}[Y] = \mu\)
  • \(\text{Var}{\left(Y\right)} = \mu + \frac{\mu^2}{\rho} > \mu\)

3.4 Weibull Distribution

\[ \begin{aligned} p(t)&= \alpha\lambda x^{\alpha-1}\text{e}^{-\lambda x^\alpha}\\ {\lambda}(t)&=\alpha\lambda x^{\alpha-1}\\ \text{S}(t)&=\text{e}^{-\lambda x^\alpha}\\ E(T)&= \Gamma(1+1/\alpha)\cdot \lambda^{-1/\alpha} \end{aligned} \]

When \(\alpha=1\) this is the exponential. When \(\alpha>1\) the hazard is increasing and when \(\alpha < 1\) the hazard is decreasing. This provides more flexibility than the exponential.

We will see more of this distribution later.

4 Characteristics of probability distributions

4.1 Probability density function

Definition 11 (probability density) If \(X\) is a continuous random variable, then the probability density of \(X\) at value \(x\), denoted \(f(x)\), \(f_X(x)\), \(\text{p}(x)\), \(\text{p}_X(x)\), or \(\text{p}(X=x)\), is defined as the limit of the probability (mass) that \(X\) is in an interval around \(x\), divided by the width of that interval, as that width reduces to 0.

\[ \begin{aligned} f(x) &\stackrel{\text{def}}{=}\lim_{\Delta \rightarrow 0} \frac{\text{P}(X \in [x, x + \Delta])}{\Delta} \end{aligned} \]

See also Rothman et al. (2021) (Chapter 22, p. 535) and https://en.wikipedia.org/wiki/Probability_density_function#Formal_definition


Theorem 13 (Density function is derivative of CDF) The density function \(f(t)\) or \(\text{p}(T=t)\) for a random variable \(T\) at value \(t\) is equal to the derivative of the cumulative probability function \(F(t) \stackrel{\text{def}}{=}P(T\le t)\); that is:

\[f(t) \stackrel{\text{def}}{=}\frac{\partial}{\partial t} F(t)\]


Theorem 14 (Density functions integrate to 1) For any density function \(f(x)\),

\[\int_{x \in \mathcal{R}(X)} f(x) dx = 1\]


4.2 Hazard function

Definition 12 (Hazard function, hazard rate, hazard rate function)  

The hazard function, hazard rate, hazard rate function, for a random variable \(T\) at value \(t\), typically denoted as \(\text{h}(t)\) 1 or \(\lambda(t)\), 2 is the conditional density of \(T\) at \(t\), given \(T \ge t\). That is:

\[{\lambda}(t) \stackrel{\text{def}}{=}\text{p}(T=t|T\ge t)\]

If \(T\) represents the time at which an event occurs, then \({\lambda}(t)\) is the probability that the event occurs at time \(t\), given that it has not occurred prior to time \(t\).


Table 4: Probability distribution functions
Name Symbols Definition
Probability density function (PDF) \(\text{f}(t), \text{p}(t)\) \(\text{p}(T=t)\)
Cumulative distribution function (CDF) \(\text{F}(t), \text{P}(t)\) \(\text{P}(T\leq t)\)
Survival function \(\text{S}(t), \bar{\text{F}}(t)\) \(\text{P}(T > t)\)
Hazard function \(\lambda(t), \text{h}(t)\) \(\text{p}(T=t|T\ge t)\)
Cumulative hazard function \(\Lambda(t), \text{H}(t)\) \(\int_{u=-\infty}^t {\lambda}(u)du\)
Log-hazard function \(\eta(t)\) \(\text{log}{\left\{{\lambda}(t)\right\}}\)

\[ \text{f}(t) \xleftarrow[\text{S}(t){\lambda}(t)]{-S'(t)} \text{S}(t) \xleftarrow[]{\text{exp}{\left\{-{\Lambda}(t)\right\}}} {\Lambda}(t) \xleftarrow[]{\int_{u=0}^t {\lambda}(u)du} {\lambda}(t) \xleftarrow[]{\text{exp}{\left\{\eta(t)\right\}}} \eta(t) \]

\[ \text{f}(t) \xrightarrow[\int_{u=t}^\infty \text{f}(u)du]{\text{f}(t)/{\lambda}(t)} \text{S}(t) \xrightarrow[-\log{\text{S}(t)}]{} {\Lambda}(t) \xrightarrow[{\Lambda}'(t)]{} {\lambda}(t) \xrightarrow[\text{log}{\left\{{\lambda}(t)\right\}}]{} \eta(t) \]


4.3 Expectation

Definition 13 (Expectation, expected value, population mean ) The expectation, expected value, or population mean of a continuous random variable \(X\), denoted \(\text{E}{\left[X\right]}\), \(\mu(X)\), or \(\mu_X\), is the weighted mean of \(X\)’s possible values, weighted by the probability density function of those values:

\[\text{E}{\left[X\right]} = \int_{x\in \mathcal{R}(X)} x \cdot \text{p}(X=x)dx\]

The expectation, expected value, or population mean of a discrete random variable \(X\), denoted \(\text{E}{\left[X\right]}\), \(\mu(X)\), or \(\mu_X\), is the mean of \(X\)’s possible values, weighted by the probability mass function of those values:

\[\text{E}{\left[X\right]} = \sum_{x \in \mathcal{R}(X)} x \cdot \text{P}(X=x)\]

(c.f. https://en.wikipedia.org/wiki/Expected_value)


Theorem 15 (Expectation of the Bernoulli distribution) The expectation of a Bernoulli random variable with parameter \(\pi\) is:

\[\text{E}{\left[X\right]} = \pi\]


Proof. \[ \begin{aligned} \text{E}{\left[X\right]} &= \sum_{x\in \mathcal{R}(X)} x \cdot\text{P}(X=x) \\&= \sum_{x\in {\left\{0,1\right\}}} x \cdot\text{P}(X=x) \\&= {\left(0 \cdot\text{P}(X=0)\right)} + {\left(1 \cdot\text{P}(X=1)\right)} \\&= {\left(0 \cdot(1-\pi)\right)} + {\left(1 \cdot\pi\right)} \\&= 0 + \pi \\&= \pi \end{aligned} \]


Theorem 16 (Expectation of time-to-event variables) If \(T\) is a non-negative random variable, then:

\[\mu(T|\tilde{X}= \tilde{x}) = \int_{t=0}^{\infty}\text{S}(t)dt\]


Theorem 17 (Law of the Unconscious Statistician (LOTUS)) For any function \(g\) of a discrete random variable \(X\):

\[\text{E}{\left[g(X)\right]} = \sum_{x \in \mathcal{R}(X)} g(x) \cdot\text{P}(X=x)\]


Proof. Let \(Y = g(X)\). By Definition 13 applied to \(Y\):

\[ \begin{aligned} \text{E}{\left[g(X)\right]} &= \text{E}{\left[Y\right]} \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(Y=y) \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(g(X)=y) \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\sum_{\substack{x \in \mathcal{R}(X) \\ g(x) = y}} \text{P}(X=x) \\&= \sum_{x \in \mathcal{R}(X)} g(x) \cdot\text{P}(X=x) \end{aligned} \]

where the last equality follows by rearranging the double sum, grouping each term \(x\) by its image \(y = g(x)\).


LOTUS says that to compute \(\text{E}{\left[g(X)\right]}\), we do not need to first find the distribution of \(g(X)\); we can compute the expectation directly using the distribution of \(X\).

For a continuous random variable \(X\) with density \(\text{p}(X=x)\), the analogous formula is:

\[\text{E}{\left[g(X)\right]} = \int_{x \in \mathcal{R}(X)} g(x) \cdot\text{p}(X=x)\, dx\]


Example 3 (Expected value of \(X^2\) for a Bernoulli variable) Let \(X \sim \text{Ber}(\pi)\). By LOTUS (Theorem 17):

\[ \begin{aligned} \text{E}{\left[X^2\right]} &= \sum_{x \in {\left\{0,1\right\}}} x^2 \cdot\text{P}(X=x) \\&= 0^2 \cdot\text{P}(X=0) + 1^2 \cdot\text{P}(X=1) \\&= 0^2 \cdot(1-\pi) + 1^2 \cdot\pi \\&= 0 + \pi \\&= \pi \end{aligned} \]


Definition 14 (Conditional expectation) Discrete case. Let \(X\) and \(Y\) be jointly distributed discrete random variables. The conditional probability mass function of \(Y\) given \(X = x\) (for values of \(x\) with \(\text{P}(X = x) > 0\)) is:

\[\text{P}(Y = y \mid X = x) \stackrel{\text{def}}{=}\frac{\text{P}(X = x,\, Y = y)}{\text{P}(X = x)}\]

The conditional expectation of \(Y\) given \(X = x\) is:

\[\text{E}{\left[Y \mid X = x\right]} \stackrel{\text{def}}{=}\sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(Y = y \mid X = x)\]

Continuous case. Let \(X\) and \(Y\) be jointly distributed continuous random variables with joint density \(\text{p}(X = x,\, Y = y)\) and marginal density \(\text{p}(X = x)\). The conditional probability density function of \(Y\) given \(X = x\) (for values of \(x\) with \(\text{p}(X = x) > 0\)) is:

\[\text{p}(Y = y \mid X = x) \stackrel{\text{def}}{=}\frac{\text{p}(X = x,\, Y = y)}{\text{p}(X = x)}\]

The conditional expectation of \(Y\) given \(X = x\) is:

\[\text{E}{\left[Y \mid X = x\right]} \stackrel{\text{def}}{=}\int_{y \in \mathcal{R}(Y)} y \cdot\text{p}(Y = y \mid X = x)\, dy\]

Conditional expectation function. The conditional expectation function \(\text{E}{\left[Y \mid X\right]}\) is the function (and hence random variable) of \(X\) obtained by evaluating \(\text{E}{\left[Y \mid X = x\right]}\) at \(X\); that is, \(\text{E}{\left[Y \mid X\right]} = g(X)\) where \(g(x) \stackrel{\text{def}}{=}\text{E}{\left[Y \mid X = x\right]}\).


Theorem 18 (Law of iterated expectations) For any two random variables \(X\) and \(Y\):

\[\text{E}{\left[Y\right]} = \text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]}\]


Proof. Discrete case. When \(X\) and \(Y\) are discrete, applying Definition 13 to \(\text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]}\) and then the law of total probability (Theorem 4) applied to the countable partition \(\{X = x : x \in \mathcal{R}(X)\}\):

\[ \begin{aligned} \text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]} &= \sum_{x \in \mathcal{R}(X)} \text{E}{\left[Y \mid X=x\right]} \cdot\text{P}(X=x) \\&= \sum_{x \in \mathcal{R}(X)} {\left(\sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(Y=y \mid X=x)\right)} \cdot\text{P}(X=x) \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\sum_{x \in \mathcal{R}(X)} \text{P}(Y=y \mid X=x) \cdot\text{P}(X=x) \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(Y=y) \\&= \text{E}{\left[Y\right]} \end{aligned} \]

Continuous case. When \(X\) and \(Y\) are continuous, applying Definition 13 to \(\text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]}\) and then using Definition 14 for \(\text{E}{\left[Y \mid X=x\right]}\):

\[ \begin{aligned} \text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]} &= \int_{x \in \mathcal{R}(X)} \text{E}{\left[Y \mid X=x\right]} \cdot\text{p}(X=x)\, dx \\&= \int_{x \in \mathcal{R}(X)} {\left(\int_{y \in \mathcal{R}(Y)} y \cdot\text{p}(Y=y \mid X=x)\, dy\right)} \cdot\text{p}(X=x)\, dx \\&= \int_{y \in \mathcal{R}(Y)} y \cdot{\left(\int_{x \in \mathcal{R}(X)} \text{p}(Y=y \mid X=x) \cdot\text{p}(X=x)\, dx\right)}\, dy \\&= \int_{y \in \mathcal{R}(Y)} y \cdot\text{p}(Y=y)\, dy \\&= \text{E}{\left[Y\right]} \end{aligned} \]

where the third equality exchanges the order of integration by Fubini’s theorem, and the fourth equality uses \(\int_{x} \text{p}(Y=y \mid X=x) \cdot\text{p}(X=x)\, dx = \int_{x} \text{p}(X=x, Y=y)\, dx = \text{p}(Y=y)\) (marginalization of the joint density).


Example 4 (Marginal expectation from conditional expectations) Suppose \(X\) is a binary random variable indicating treatment assignment (\(X=1\) treated, \(X=0\) control), with \(\text{P}(X=1) = 0.5\), and suppose the outcome \(Y\) has conditional expectations:

\[\text{E}{\left[Y \mid X=1\right]} = 10, \quad \text{E}{\left[Y \mid X=0\right]} = 6\]

By the law of iterated expectations (Theorem 18):

\[ \begin{aligned} \text{E}{\left[Y\right]} &= \text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]} \\&= \text{E}{\left[Y \mid X=1\right]} \cdot\text{P}(X=1) + \text{E}{\left[Y \mid X=0\right]} \cdot\text{P}(X=0) \\&= 10 \cdot 0.5 + 6 \cdot 0.5 \\&= 5 + 3 \\&= 8 \end{aligned} \]


Definition 15 (Expectation of a random matrix) For a random matrix \(\mathbf{A}\) of size \(m \times n\) with \((i,j)\)-th element \(A_{ij}\), the expectation \(\text{E}\mathbf{A}\) is the \(m \times n\) matrix whose \((i,j)\)-th element is \(\text{E}{\left[A_{ij}\right]}\):

\[ \text{E}\mathbf{A} \stackrel{\text{def}}{=}\begin{pmatrix} \text{E}{\left[A_{11}\right]} & \text{E}{\left[A_{12}\right]} & \cdots & \text{E}{\left[A_{1n}\right]} \\ \text{E}{\left[A_{21}\right]} & \text{E}{\left[A_{22}\right]} & \cdots & \text{E}{\left[A_{2n}\right]} \\ \vdots & \vdots & \ddots & \vdots \\ \text{E}{\left[A_{m1}\right]} & \text{E}{\left[A_{m2}\right]} & \cdots & \text{E}{\left[A_{mn}\right]} \end{pmatrix} \]

In other words, expectation is applied element-wise to a random matrix.


4.5 The Central Limit Theorem

The sum of many independent or nearly-independent random variables with small variances (relative to the number of RVs being summed) produces bell-shaped distributions.

For example, consider the sum of five dice (Figure 4).

Show R code
library(dplyr)
dist = 
  expand.grid(1:6, 1:6, 1:6, 1:6, 1:6) |> 
  rowwise() |>
  mutate(total = sum(c_across(everything()))) |> 
  ungroup() |> 
  count(total) |> 
  mutate(`p(X=x)` = n/sum(n))

library(ggplot2)

dist |> 
  ggplot() +
  aes(x = total, y = `p(X=x)`) +
  geom_col() +
  xlab("sum of dice (x)") +
  ylab("Probability of outcome, Pr(X=x)") +
  expand_limits(y = 0)

  
  
Figure 4: Distribution of the sum of five dice

In comparison, the outcome of just one die is not bell-shaped (Figure 5).

Show R code
library(dplyr)
dist = 
  expand.grid(1:6) |> 
  rowwise() |>
  mutate(total = sum(c_across(everything()))) |> 
  ungroup() |> 
  count(total) |> 
  mutate(`p(X=x)` = n/sum(n))

library(ggplot2)

dist |> 
  ggplot() +
  aes(x = total, y = `p(X=x)`) +
  geom_col() +
  xlab("sum of dice (x)") +
  ylab("Probability of outcome, Pr(X=x)") +
  expand_limits(y = 0)

  
  
Figure 5: Distribution of the outcome of one die

What distribution does a single die have?

Answer: discrete uniform on 1:6.

5 Additional resources

Back to top

References

Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.
Kalbfleisch, John D, and Ross L Prentice. 2011. The Statistical Analysis of Failure Time Data. John Wiley & Sons.
Klein, John P, and Melvin L Moeschberger. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.
Kleinbaum, David G, and Mitchel Klein. 2012. Survival Analysis: A Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.
Miller, Steven J. 2017. The Probability Lifesaver : All the Tools You Need to Understand Chance. A Princeton Lifesaver Study Guide. Princeton University Press. https://press.princeton.edu/books/hardcover/9780691149547/the-probability-lifesaver.
Rothman, Kenneth J., Timothy L. Lash, Tyler J. VanderWeele, and Sebastien Haneuse. 2021. Modern Epidemiology. Fourth edition. Wolters Kluwer.
Vittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. Springer. https://doi.org/10.1007/978-1-4614-1353-0.

Footnotes

  1. for example in Dobson and Barnett (2018), Vittinghoff et al. (2012), Klein and Moeschberger (2003), and Kleinbaum and Klein (2012)↩︎

  2. for example, in Rothman et al. (2021) and Kalbfleisch and Prentice (2011)↩︎