Probability

Configuring R

Functions from these packages will be used throughout this document:

[R code]
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times
library(broom) # Summarizes key information about statistical objects in tidy tibbles
library(broom.helpers) # Provides suite of functions to work with regression model 'broom::tidy()' tibbles

Here are some R settings I use in this document:

[R code]
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 6)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9
run_graphs = TRUE

Most of the content in this chapter should be review from UC Davis Epi 202.

1 Core properties of probabilities

1.1 Defining probabilities

Definition 1 (Probability measure) A probability measure, often denoted \(\Pr()\) or \(\text{P}()\), is a function whose domain is a \(\sigma\)-algebra of possible outcomes, \(\mathscr{S}\), and which satisfies the following properties:

  1. For any statistical event \(A \in \mathscr{S}\), \(\Pr(A) \ge 0\).

  2. The probability of the union of all outcomes (\(\Omega \stackrel{\text{def}}{=}\cup \mathscr{S}\)) is 1:

\[\Pr(\Omega) = 1\]

  1. The probability of the union of countably many mutually disjoint events \(A_1, A_2, \ldots\) (where \(A_i \cap A_j = \emptyset\) for all \(i \neq j\)) is equal to the sum of their probabilities (countable additivity or sigma-additivity):

\[\Pr\!\left(\bigcup_{i=1}^{\infty} A_i\right) = \sum_{i=1}^{\infty} \Pr(A_i)\]

Theorem 1 If \(A\) and \(B\) are statistical events and \(A\subseteq B\), then \(\Pr(A \cap B) = \Pr(A)\).

Proof. Left to the reader for now.

Theorem 2 \[\Pr(A) + \Pr(\neg A) = 1\]

Proof. By properties 2 and 3 of Definition 1.

Corollary 1 \[\Pr(\neg A) = 1 - \Pr(A)\]

Proof. By Theorem 2 and algebra.

Corollary 2 If the probability of an outcome \(A\) is \(\Pr(A)=\pi\), then the probability that \(A\) does not occur is:

\[\Pr(\neg A)= 1 - \pi\]

Proof. Using Corollary 1:

\[ \begin{aligned} \Pr(\neg A) &= 1 - \Pr(A) \\ &= 1 - \pi \end{aligned} \]

1.2 Conditional probability

Definition 2 (Conditional probability) For two events \(A\) and \(B\) with \(\Pr(B) > 0\), the conditional probability of \(A\) given \(B\), denoted \(\Pr(A \mid B)\), is:

\[\Pr(A \mid B) \stackrel{\text{def}}{=}\frac{\Pr(A \cap B)}{\Pr(B)}\]

Theorem 3 (Law of conditional probability) For any two events \(A\) and \(B\) with \(\Pr(B) > 0\):

\[\Pr(A \cap B) = \Pr(A \mid B) \cdot\Pr(B)\]

Proof. Rearranging Definition 2:

\[ \begin{aligned} \Pr(A \mid B) &= \frac{\Pr(A \cap B)}{\Pr(B)} \\ \Pr(A \cap B) &= \Pr(A \mid B) \cdot\Pr(B) \end{aligned} \]

Example 1 (Applying the law of conditional probability) Suppose 30% of adults exercise regularly (\(\Pr(E) = 0.30\)), and among adults who exercise regularly, 60% have low blood pressure (\(\Pr(L \mid E) = 0.60\)).

Then the probability that a randomly selected adult both exercises regularly and has low blood pressure is:

\[ \begin{aligned} \Pr(L \cap E) &= \Pr(L \mid E) \cdot\Pr(E) \\&= 0.60 \cdot 0.30 \\&= 0.18 \end{aligned} \]

Theorem 4 (Law of total probability) If \(B_1, B_2, \ldots\) is a countable partition of the sample space (i.e., countably many mutually exclusive events whose union is the entire sample space), then for any event \(A\):

\[\Pr(A) = \sum_{i=1}^{\infty} \Pr(A \mid B_i) \cdot\Pr(B_i)\]

Proof. Since \(B_1, B_2, \ldots\) partition the sample space, the events \(A \cap B_1, A \cap B_2, \ldots\) are mutually exclusive and their union is \(A\). By property 3 of Definition 1 (countable additivity), and then by Theorem 3:

\[ \begin{aligned} \Pr(A) &= \sum_{i=1}^{\infty} \Pr(A \cap B_i) \\&= \sum_{i=1}^{\infty} \Pr(A \mid B_i) \cdot\Pr(B_i) \end{aligned} \]

Theorem 5 (Bayes’ theorem) For any two events \(A\) and \(B\) with \(\Pr(A) > 0\) and \(\Pr(B) > 0\):

\[\Pr(A \mid B) = \frac{\Pr(B \mid A) \cdot\Pr(A)}{\Pr(B)}\]

Proof. Apply Definition 2 to both \(\Pr(A \mid B)\) and \(\Pr(B \mid A)\):

\[ \begin{aligned} \Pr(A \mid B) &= \frac{\Pr(A \cap B)}{\Pr(B)} \\&= \frac{\Pr(B \mid A) \cdot\Pr(A)}{\Pr(B)} \end{aligned} \]

The second equality follows from Theorem 3 applied to \(\Pr(B \cap A) = \Pr(B \mid A) \cdot\Pr(A)\).

Example 2 (Positive predictive value of a medical test) Suppose a disease test has 99% sensitivity and 99% specificity, and the prevalence of the disease in the population is 7%.

Let \(D\) be the event “person has the disease” and \(+\) be the event “test is positive”. Then:

  • \(\Pr(+ \mid D) = 0.99\) (sensitivity)
  • \(\Pr(\neg + \mid \neg D) = 0.99\) (specificity), so the false positive rate is \(\Pr(+ \mid \neg D) = 1 - 0.99 = 0.01\)
  • \(\Pr(D) = 0.07\) (prevalence)

By Bayes’ theorem (Theorem 5) and the law of total probability (Theorem 4):

\[ \begin{aligned} \Pr(D \mid +) &= \frac{\Pr(+ \mid D) \cdot\Pr(D)}{\Pr(+)} \\&= \frac{\Pr(+ \mid D) \cdot\Pr(D)}{\Pr(+ \mid D) \cdot\Pr(D) + \Pr(+ \mid \neg D) \cdot\Pr(\neg D)} \\&= \frac{0.99 \cdot 0.07}{0.99 \cdot 0.07 + 0.01 \cdot 0.93} \\&= \frac{0.0693}{0.0693 + 0.0093} \\&= \frac{0.0693}{0.0786} \\&\approx 0.88 \end{aligned} \]

Even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have the disease, because the disease prevalence is relatively low (7%).

2 Key probability distributions

Table 1: Distributions typically used for outcome models
Distribution Uses
Bernoulli Binary outcomes
Binomial Sums of Bernoulli outcomes
Poisson unbounded count outcomes
Geometric Counts of non-events before an event occurs
Negative binomal Mixtures of Poisson distributions, counts of non-events until a given number of events occurs
Normal (Gaussian) Continuous outcomes without a more specific distribution
exponential Time to event outcomes
Gamma Time to event outcomes
Weibull Time to event outcomes
Log-normal Time to event outcomes
Table 2: Distributions typically used for test statistics
Distribution Uses
\(\chi^2\) Regression comparisons (asymptotic), contingency table independence tests, goodness-of-fit tests
\(F\) Gaussian model comparisons (exact)
\(Z\) (standard normal) Proportions, means, regression coefficients (asymptotic)
\(T\) Means, regression coefficients in Gaussian outcome models (exact)

2.1 The Bernoulli distribution

Definition 3 (Bernoulli distribution) The Bernoulli distribution family for a random variable \(X\) is defined as:

\[ \begin{aligned} \Pr(X=x) &= \text{1}_{x\in {\left\{0,1\right\}}}\pi^x(1-\pi)^{1-x}\\ &= \left\{{\pi, x=1}\atop{1-\pi, x=0}\right. \end{aligned} \]

2.2 The Poisson distribution

Figure 1: “Les Poissons”
Siméon Denis Poisson

Exercise 1 Define the Poisson distribution.

Solution 1.

Definition 4 (Poisson distribution) \[\text{P}(Y = y) = \frac{\mu^{y} e^{-\mu}}{y!}, y \in \text{N} \tag{1}\]

Exercise 2 What is the range of possible values for a Poisson distribution?

Solution 2. \[\mathcal{R}(Y) = {\left\{0, 1, 2, ...\right\}} = \text{N}\]

Theorem 6 (CDF of Poisson distribution) \[\text{P}(Y \le y) = e^{-\mu} \sum_{j=0}^{\left \lfloor{y}\right \rfloor}\frac{\mu^j}{j!} \tag{2}\]

[R code]
library(dplyr)
pois_dists <- tibble(
  mu = c(0.5, 1, 2, 5, 10, 20)
) |>
  reframe(
    .by = mu,
    x = 0:30
  ) |>
  mutate(
    `P(X = x)` = dpois(x, lambda = mu),
    `P(X <= x)` = ppois(x, lambda = mu),
    mu = factor(mu)
  )

library(ggplot2)
library(latex2exp)

plot0 <- pois_dists |>
  ggplot(
    aes(
      x = x,
      y = `P(X = x)`,
      fill = mu,
      col = mu
    )
  ) +
  theme(legend.position = "bottom") +
  labs(
    fill = latex2exp::TeX("$\\mu$"),
    col = latex2exp::TeX("$\\mu$"),
    y = latex2exp::TeX("$\\Pr_{\\mu}(X = x)$")
  )

plot1 <- plot0 +
  geom_segment(yend = 0) +
  facet_wrap(~mu)

print(plot1)
Figure 2: Poisson PMFs, by mean parameter \(\mu\)
[R code]
library(ggplot2)

plot2 <-
  plot0 +
  geom_step(alpha = 0.75) +
  aes(y = `P(X <= x)`) +
  labs(y = latex2exp::TeX("$\\Pr_{\\mu}(X \\leq x)$"))

print(plot2)
Figure 3: Poisson CDFs

Exercise 3 (Poisson distribution functions) Let \(X \sim \text{Pois}(\mu = 3.75)\).

Compute:

  • \(\text{P}(X = 4 | \mu = 3.75)\)
  • \(\text{P}(X \le 7 | \mu = 3.75)\)
  • \(\text{P}(X > 5 | \mu = 3.75)\)

Solution.

  • \(\text{P}(X=4) = 0.19378\)
  • \(\text{P}(X\le 7) = 0.962379\)
  • \(\text{P}(X > 5) = 0.177117\)

Theorem 7 (Properties of the Poisson distribution) If \(X \sim \text{Pois}(\mu)\), then:

  • \(\text{E}[X] = \mu\)
  • \(\text{Var}(X) = \mu\)
  • \(\text{P}(X=x) = \frac{\mu}{x} \text{P}(X = x-1)\)
  • For \(x < \mu\), \(\text{P}(X=x) > \text{P}(X = x-1)\)
  • For \(x = \mu\), \(\text{P}(X=x) = \text{P}(X = x-1)\)
  • For \(x > \mu\), \(\text{P}(X=x) < \text{P}(X = x-1)\)
  • \(\arg \max_{x} \text{P}(X=x) = \left \lfloor{\mu}\right \rfloor\)

Exercise 4 Prove Theorem 7.

Solution. \[ \begin{aligned} \text{E}[X] &= \sum_{x=0}^\infty x \cdot P(X=x)\\ &= 0 \cdot P(X=0) + \sum_{x=1}^\infty x \cdot P(X=x)\\ &= 0 + \sum_{x=1}^\infty x \cdot P(X=x)\\ &= \sum_{x=1}^\infty x \cdot P(X=x)\\ &= \sum_{x=1}^\infty x \cdot \frac{\lambda^x e^{-\lambda}}{x!}\\ &= \sum_{x=1}^\infty x \cdot \frac{\lambda^x e^{-\lambda}}{x \cdot (x-1)!} & [\text{definition of factorial ("!") function}]\\ &= \sum_{x=1}^\infty \frac{\lambda^x e^{-\lambda}}{ (x-1)!}\\ &= \sum_{x=1}^\infty \frac{(\lambda \cdot \lambda^{x-1}) e^{-\lambda}}{ (x-1)!}\\ &= \lambda \cdot \sum_{x=1}^\infty \frac{( \lambda^{x-1}) e^{-\lambda}}{ (x-1)!}\\ &= \lambda \cdot \sum_{y=0}^\infty \frac{( \lambda^{y}) e^{-\lambda}}{ (y)!} &[\text{substituting } y \stackrel{\text{def}}{=}x-1]\\ &= \lambda \cdot 1 &[\text{because PDFs sum to 1}]\\ &= \lambda\\ \end{aligned} \]

See also https://statproofbook.github.io/P/poiss-mean.

For the variance, see https://statproofbook.github.io/P/poiss-var.

Accounting for exposure

Definition 5 (Exposure magnitude) For many count outcomes, there is some sense of an exposure magnitude, such as population size, or duration of observation, which multiplicatively rescales the expected (mean) count.

Exercise 5 What are some examples of exposure magnitudes?

Solution.

Table 3: Examples of exposure units
outcome exposure units
disease incidence number of individuals exposed; time at risk
car accidents miles driven
worksite accidents person-hours worked
population size size of habitat

Definition 6 (Event rate)  

\[\mu \stackrel{\text{def}}{=}\text{E}[Y|T=t]\]

\[\lambda \stackrel{\text{def}}{=}\frac{\mu}{t} \tag{3}\]

Theorem 8 (Transformation function from event rate to mean) For a count variable with mean \(\mu\), event rate \(\lambda\), and exposure magnitude \(t\):

\[\mu = \lambda \cdot t \tag{4}\]

Solution. Start from definition of event rate and use algebra to solve for \(\mu\).

Equation 4 is analogous to the inverse-odds function for binary variables.

Theorem 9 When the exposure magnitude is 0, there is no opportunity for events to occur:

\[\text{E}[Y|T=0] = 0\]

Proof. \[\text{E}[Y|T=0] = \lambda \cdot 0 = 0\]

Important

The exposure magnitude, \(T\), is similar to a covariate in linear or logistic regression. However, there is an important difference: in count regression, there is no intercept corresponding to \(\text{E}[Y|T=0]\). In other words, this model assumes that if there is no exposure, there can’t be any events.

Theorem 10 If \(\mu = \lambda\cdot t\), then:

\[\log{\mu} = \log{\lambda} + \log{t}\]

Definition 7 (Offset) When the linear component of a model involves a term without an unknown coefficient, that term is called an offset.

Theorem 11 If \(X\) and \(Y\) are independent Poisson random variables with means \(\mu_X\) and \(\mu_Y\), their sum, \(Z=X+Y\), is also a Poisson random variable, with mean \(\mu_Z = \mu_X + \mu_Y\).

2.3 The Negative-Binomial distribution

Definition 8 (Negative binomial distribution) \[ \text{P}(Y=y) = \frac{\mu^y}{y!} \cdot \frac{\Gamma(\rho + y)}{\Gamma(\rho) \cdot (\rho + \mu)^y} \cdot \left(1+\frac{\mu}{\rho}\right)^{-\rho} \]

where \(\rho\) is an overdispersion parameter and \(\Gamma(x) = (x-1)!\) for integers \(x\).

Theorem 12 If \(Y \sim \text{NegBin}(\mu, \rho)\), then:

  • \(\text{E}[Y] = \mu\)
  • \(\text{Var}{\left(Y\right)} = \mu + \frac{\mu^2}{\rho} > \mu\)

2.4 Weibull Distribution

\[ \begin{aligned} p(t)&= \alpha\lambda x^{\alpha-1}\text{e}^{-\lambda x^\alpha}\\ {\lambda}(t)&=\alpha\lambda x^{\alpha-1}\\ \text{S}(t)&=\text{e}^{-\lambda x^\alpha}\\ E(T)&= \Gamma(1+1/\alpha)\cdot \lambda^{-1/\alpha} \end{aligned} \]

When \(\alpha=1\) this is the exponential. When \(\alpha>1\) the hazard is increasing and when \(\alpha < 1\) the hazard is decreasing. This provides more flexibility than the exponential.

We will see more of this distribution later.

3 Characteristics of probability distributions

3.1 Probability density function

Definition 9 (probability density) If \(X\) is a continuous random variable, then the probability density of \(X\) at value \(x\), denoted \(f(x)\), \(f_X(x)\), \(\text{p}(x)\), \(\text{p}_X(x)\), or \(\text{p}(X=x)\), is defined as the limit of the probability (mass) that \(X\) is in an interval around \(x\), divided by the width of that interval, as that width reduces to 0.

\[ \begin{aligned} f(x) &\stackrel{\text{def}}{=}\lim_{\Delta \rightarrow 0} \frac{\text{P}(X \in [x, x + \Delta])}{\Delta} \end{aligned} \]

Definition 10 (Cumulative distribution function (CDF)) For a random variable \(X\), its population CDF is \[F(t)=\Pr(X\le t), \quad t\in\mathbb{R}.\]

Definition 11 (Quantile function (population inverse CDF)) For a random variable \(X\) with cumulative distribution function (CDF) \(F\), its population quantile function (generalized inverse of \(F\)) is \[Q(p)=\inf\{t:F(t)\ge p\}, \quad 0<p\le 1.\]

Theorem 13 (Density function is derivative of CDF) The density function \(f(t)\) or \(\text{p}(T=t)\) for a random variable \(T\) at value \(t\) is equal to the derivative of the cumulative probability function \(F(t) \stackrel{\text{def}}{=}P(T\le t)\); that is:

\[f(t) \stackrel{\text{def}}{=}\frac{\partial}{\partial t} F(t)\]

Theorem 14 (Density functions integrate to 1) For any density function \(f(x)\),

\[\int_{x \in \mathcal{R}(X)} f(x) dx = 1\]

3.2 Hazard function

Definition 12 (Hazard function, hazard rate, hazard rate function)  

\[{\lambda}(t) \stackrel{\text{def}}{=}\text{p}(T=t|T\ge t)\]

Table 4: Probability distribution functions
Name Symbols Definition
Probability density function (PDF) \(\text{f}(t), \text{p}(t)\) \(\text{p}(T=t)\)
Cumulative distribution function (CDF) \(\text{F}(t), \text{P}(t)\) \(\text{P}(T\leq t)\)
Survival function \(\text{S}(t), \bar{\text{F}}(t)\) \(\text{P}(T > t)\)
Hazard function \(\lambda(t), \text{h}(t)\) \(\text{p}(T=t|T\ge t)\)
Cumulative hazard function \(\Lambda(t), \text{H}(t)\) \(\int_{u=-\infty}^t {\lambda}(u)du\)
Log-hazard function \(\eta(t)\) \(\text{log}{\left\{{\lambda}(t)\right\}}\)

\[ \text{f}(t) \xleftarrow[\text{S}(t){\lambda}(t)]{-S'(t)} \text{S}(t) \xleftarrow[]{\text{exp}{\left\{-{\Lambda}(t)\right\}}} {\Lambda}(t) \xleftarrow[]{\int_{u=0}^t {\lambda}(u)du} {\lambda}(t) \xleftarrow[]{\text{exp}{\left\{\eta(t)\right\}}} \eta(t) \]

\[ \text{f}(t) \xrightarrow[\int_{u=t}^\infty \text{f}(u)du]{\text{f}(t)/{\lambda}(t)} \text{S}(t) \xrightarrow[-\log{\text{S}(t)}]{} {\Lambda}(t) \xrightarrow[{\Lambda}'(t)]{} {\lambda}(t) \xrightarrow[\text{log}{\left\{{\lambda}(t)\right\}}]{} \eta(t) \]

3.3 Expectation

Definition 13 (Expectation, expected value, population mean ) The expectation, expected value, or population mean of a continuous random variable \(X\), denoted \(\text{E}{\left[X\right]}\), \(\mu(X)\), or \(\mu_X\), is the weighted mean of \(X\)’s possible values, weighted by the probability density function of those values:

\[\text{E}{\left[X\right]} = \int_{x\in \mathcal{R}(X)} x \cdot \text{p}(X=x)dx\]

The expectation, expected value, or population mean of a discrete random variable \(X\), denoted \(\text{E}{\left[X\right]}\), \(\mu(X)\), or \(\mu_X\), is the mean of \(X\)’s possible values, weighted by the probability mass function of those values:

\[\text{E}{\left[X\right]} = \sum_{x \in \mathcal{R}(X)} x \cdot \text{P}(X=x)\]

(c.f. https://en.wikipedia.org/wiki/Expected_value)

Theorem 15 (Expectation of the Bernoulli distribution) The expectation of a Bernoulli random variable with parameter \(\pi\) is:

\[\text{E}{\left[X\right]} = \pi\]

Proof. \[ \begin{aligned} \text{E}{\left[X\right]} &= \sum_{x\in \mathcal{R}(X)} x \cdot\text{P}(X=x) \\&= \sum_{x\in {\left\{0,1\right\}}} x \cdot\text{P}(X=x) \\&= {\left(0 \cdot\text{P}(X=0)\right)} + {\left(1 \cdot\text{P}(X=1)\right)} \\&= {\left(0 \cdot(1-\pi)\right)} + {\left(1 \cdot\pi\right)} \\&= 0 + \pi \\&= \pi \end{aligned} \]

Theorem 16 (Expectation of time-to-event variables) If \(T\) is a non-negative random variable, then:

\[\mu(T|\tilde{X}= \tilde{x}) = \int_{t=0}^{\infty}\text{S}(t)dt\]

Theorem 17 (Law of the Unconscious Statistician (LOTUS)) For any function \(g\) of a discrete random variable \(X\):

\[\text{E}{\left[g(X)\right]} = \sum_{x \in \mathcal{R}(X)} g(x) \cdot\text{P}(X=x)\]

Proof. Let \(Y = g(X)\). By Definition 13 applied to \(Y\):

\[ \begin{aligned} \text{E}{\left[g(X)\right]} &= \text{E}{\left[Y\right]} \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(Y=y) \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(g(X)=y) \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\sum_{\substack{x \in \mathcal{R}(X) \\ g(x) = y}} \text{P}(X=x) \\&= \sum_{x \in \mathcal{R}(X)} g(x) \cdot\text{P}(X=x) \end{aligned} \]

where the last equality follows by rearranging the double sum, grouping each term \(x\) by its image \(y = g(x)\).

Example 3 (Expected value of \(X^2\) for a Bernoulli variable) Let \(X \sim \text{Ber}(\pi)\). By LOTUS (Theorem 17):

\[ \begin{aligned} \text{E}{\left[X^2\right]} &= \sum_{x \in {\left\{0,1\right\}}} x^2 \cdot\text{P}(X=x) \\&= 0^2 \cdot\text{P}(X=0) + 1^2 \cdot\text{P}(X=1) \\&= 0^2 \cdot(1-\pi) + 1^2 \cdot\pi \\&= 0 + \pi \\&= \pi \end{aligned} \]

Definition 14 (Conditional expectation) Discrete case. Let \(X\) and \(Y\) be jointly distributed discrete random variables. The conditional probability mass function of \(Y\) given \(X = x\) (for values of \(x\) with \(\text{P}(X = x) > 0\)) is:

\[\text{P}(Y = y \mid X = x) \stackrel{\text{def}}{=}\frac{\text{P}(X = x,\, Y = y)}{\text{P}(X = x)}\]

The conditional expectation of \(Y\) given \(X = x\) is:

\[\text{E}{\left[Y \mid X = x\right]} \stackrel{\text{def}}{=}\sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(Y = y \mid X = x)\]

Continuous case. Let \(X\) and \(Y\) be jointly distributed continuous random variables with joint density \(\text{p}(X = x,\, Y = y)\) and marginal density \(\text{p}(X = x)\). The conditional probability density function of \(Y\) given \(X = x\) (for values of \(x\) with \(\text{p}(X = x) > 0\)) is:

\[\text{p}(Y = y \mid X = x) \stackrel{\text{def}}{=}\frac{\text{p}(X = x,\, Y = y)}{\text{p}(X = x)}\]

The conditional expectation of \(Y\) given \(X = x\) is:

\[\text{E}{\left[Y \mid X = x\right]} \stackrel{\text{def}}{=}\int_{y \in \mathcal{R}(Y)} y \cdot\text{p}(Y = y \mid X = x)\, dy\]

Conditional expectation function. The conditional expectation function \(\text{E}{\left[Y \mid X\right]}\) is the function (and hence random variable) of \(X\) obtained by evaluating \(\text{E}{\left[Y \mid X = x\right]}\) at \(X\); that is, \(\text{E}{\left[Y \mid X\right]} = g(X)\) where \(g(x) \stackrel{\text{def}}{=}\text{E}{\left[Y \mid X = x\right]}\).

Theorem 18 (Law of iterated expectations) For any two random variables \(X\) and \(Y\):

\[\text{E}{\left[Y\right]} = \text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]}\]

Proof. Discrete case. When \(X\) and \(Y\) are discrete, applying Definition 13 to \(\text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]}\) and then the law of total probability (Theorem 4) applied to the countable partition \(\{X = x : x \in \mathcal{R}(X)\}\):

\[ \begin{aligned} \text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]} &= \sum_{x \in \mathcal{R}(X)} \text{E}{\left[Y \mid X=x\right]} \cdot\text{P}(X=x) \\&= \sum_{x \in \mathcal{R}(X)} {\left(\sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(Y=y \mid X=x)\right)} \cdot\text{P}(X=x) \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\sum_{x \in \mathcal{R}(X)} \text{P}(Y=y \mid X=x) \cdot\text{P}(X=x) \\&= \sum_{y \in \mathcal{R}(Y)} y \cdot\text{P}(Y=y) \\&= \text{E}{\left[Y\right]} \end{aligned} \]

Continuous case. When \(X\) and \(Y\) are continuous, applying Definition 13 to \(\text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]}\) and then using Definition 14 for \(\text{E}{\left[Y \mid X=x\right]}\):

\[ \begin{aligned} \text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]} &= \int_{x \in \mathcal{R}(X)} \text{E}{\left[Y \mid X=x\right]} \cdot\text{p}(X=x)\, dx \\&= \int_{x \in \mathcal{R}(X)} {\left(\int_{y \in \mathcal{R}(Y)} y \cdot\text{p}(Y=y \mid X=x)\, dy\right)} \cdot\text{p}(X=x)\, dx \\&= \int_{y \in \mathcal{R}(Y)} y \cdot{\left(\int_{x \in \mathcal{R}(X)} \text{p}(Y=y \mid X=x) \cdot\text{p}(X=x)\, dx\right)}\, dy \\&= \int_{y \in \mathcal{R}(Y)} y \cdot\text{p}(Y=y)\, dy \\&= \text{E}{\left[Y\right]} \end{aligned} \]

where the third equality exchanges the order of integration by Fubini’s theorem, and the fourth equality uses \(\int_{x} \text{p}(Y=y \mid X=x) \cdot\text{p}(X=x)\, dx = \int_{x} \text{p}(X=x, Y=y)\, dx = \text{p}(Y=y)\) (marginalization of the joint density).

Theorem 19 (Conditional law of iterated expectations) For random variables \(X\), \(Y\), and \(Z\):

\[\text{E}{\left[Y \mid Z\right]} = \text{E}{\left[\text{E}{\left[Y \mid X,Z\right]} \mid Z\right]}\]

Proof. For each fixed value \(z\) with positive probability or density:

Discrete case. Conditioning on \(Z=z\), and applying the law of total probability to the partition \(\{X=x : x \in \mathcal{R}(X)\}\) under the conditional distribution given \(Z=z\):

\[ \begin{aligned} \text{E}{\left[\text{E}{\left[Y \mid X,Z\right]} \mid Z=z\right]} &= \sum_{x \in \mathcal{R}(X)} \text{E}{\left[Y \mid X=x,Z=z\right]} \cdot\text{P}(X=x \mid Z=z) \\&= \text{E}{\left[Y \mid Z=z\right]} \end{aligned} \]

Continuous case. Conditioning on \(Z=z\), and integrating over \(X\) under the conditional density \(\text{p}(X=x \mid Z=z)\):

\[ \begin{aligned} \text{E}{\left[\text{E}{\left[Y \mid X,Z\right]} \mid Z=z\right]} &= \int_{x \in \mathcal{R}(X)} \text{E}{\left[Y \mid X=x,Z=z\right]} \cdot\text{p}(X=x \mid Z=z)\, dx \\&= \text{E}{\left[Y \mid Z=z\right]} \end{aligned} \]

Therefore, as random variables of \(Z\), \(\text{E}{\left[Y \mid Z\right]} = \text{E}{\left[\text{E}{\left[Y \mid X,Z\right]} \mid Z\right]}\).

Example 4 (Marginal expectation from conditional expectations) Suppose \(X\) is a binary random variable indicating treatment assignment (\(X=1\) treated, \(X=0\) control), with \(\text{P}(X=1) = 0.5\), and suppose the outcome \(Y\) has conditional expectations:

\[\text{E}{\left[Y \mid X=1\right]} = 10, \quad \text{E}{\left[Y \mid X=0\right]} = 6\]

By the law of iterated expectations (Theorem 18):

\[ \begin{aligned} \text{E}{\left[Y\right]} &= \text{E}{\left[\text{E}{\left[Y \mid X\right]}\right]} \\&= \text{E}{\left[Y \mid X=1\right]} \cdot\text{P}(X=1) + \text{E}{\left[Y \mid X=0\right]} \cdot\text{P}(X=0) \\&= 10 \cdot 0.5 + 6 \cdot 0.5 \\&= 5 + 3 \\&= 8 \end{aligned} \]

Definition 15 (Expectation of a random matrix) For a random matrix \(\mathbf{A}\) of size \(m \times n\) with \((i,j)\)-th element \(A_{ij}\), the expectation \(\text{E}\mathbf{A}\) is the \(m \times n\) matrix whose \((i,j)\)-th element is \(\text{E}{\left[A_{ij}\right]}\):

\[ \text{E}\mathbf{A} \stackrel{\text{def}}{=}\begin{pmatrix} \text{E}{\left[A_{11}\right]} & \text{E}{\left[A_{12}\right]} & \cdots & \text{E}{\left[A_{1n}\right]} \\ \text{E}{\left[A_{21}\right]} & \text{E}{\left[A_{22}\right]} & \cdots & \text{E}{\left[A_{2n}\right]} \\ \vdots & \vdots & \ddots & \vdots \\ \text{E}{\left[A_{m1}\right]} & \text{E}{\left[A_{m2}\right]} & \cdots & \text{E}{\left[A_{mn}\right]} \end{pmatrix} \]

In other words, expectation is applied element-wise to a random matrix.

3.4 Deviation, error, and noise

Definition 16 (Deviation) A deviation is the difference between a value and a reference value. For any quantity \(z\) and reference value \(r\):

\[z - r\]

In probability and statistics, “deviation” often means deviation from a population mean. For a random variable \(Y\):

\[Y - \text{E}{\left[Y\right]}\]

Definition 17 (Deviation from a population or subpopulation mean) In probabilistic models, we call this quantity a deviation from a mean. It is often also called an error or noise term in other sources. For the random variable \(Y\), define the deviation from its mean as:

\[e(Y) \stackrel{\text{def}}{=}Y - \text{E}{\left[Y\right]}\]

For a realized observation \(y\): \[e(y) \stackrel{\text{def}}{=}y - \text{E}{\left[Y\right]}\]

In regression settings, the reference mean is often conditional on covariates: \(e(y_i) \stackrel{\text{def}}{=}y_i - \text{E}{\left[Y_i \mid X_i\right]}\).

In this course, we prefer “deviation” for this mean-deviation quantity. The terms “error” and “noise” are common aliases. We use “residual” (defined in the Linear regression chapter) for deviations from fitted values. For notation in this course, we use \(e(\cdot)\) for these model/data deviations, and reserve \(\varepsilon{\left(\cdot\right)}\) for estimator-to-estimand deviations (see Estimation).

See:

Theorem 20 (Simplified expression for variance) \[\text{Var}{\left(X\right)}=\text{E}{\left[X^2\right]} - {\left(\text{E}{\left[X\right]}\right)}^2\]


Proof. By linearity of expectation, we have:

\[ \begin{aligned} \text{Var}{\left(X\right)} &\stackrel{\text{def}}{=}\text{E}{\left[(X-\text{E}{\left[X\right]})^2\right]}\\ &=\text{E}{\left[X^2 - 2X\text{E}{\left[X\right]} + {\left(\text{E}{\left[X\right]}\right)}^2\right]}\\ &=\text{E}{\left[X^2\right]} - \text{E}{\left[2X\text{E}{\left[X\right]}\right]} + \text{E}{\left[{\left(\text{E}{\left[X\right]}\right)}^2\right]}\\ &=\text{E}{\left[X^2\right]} - 2\text{E}{\left[X\right]}\text{E}{\left[X\right]} + {\left(\text{E}{\left[X\right]}\right)}^2\\ &=\text{E}{\left[X^2\right]} - {\left(\text{E}{\left[X\right]}\right)}^2\\ \end{aligned} \]

Theorem 21 (Law of total variance) For random variables \(X\) and \(Y\):

\[\text{Var}{\left(Y\right)} = \text{E}{\left[\text{Var}{\left(Y \mid X\right)}\right]} + \text{Var}{\left(\text{E}{\left[Y \mid X\right]}\right)}\]

where \(\text{Var}{\left(Y \mid X\right)} \stackrel{\text{def}}{=}\text{E}{\left[(Y-\text{E}{\left[Y \mid X\right]})^2 \mid X\right]}\).

Proof. Write \(Y-\text{E}{\left[Y\right]} = {\left(Y-\text{E}{\left[Y \mid X\right]}\right)} + {\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}\). Then:

\[ \begin{aligned} {\left(Y-\text{E}{\left[Y\right]}\right)}^2 &= {\left(Y-\text{E}{\left[Y \mid X\right]}\right)}^2 + {\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}^2 + 2{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)} \end{aligned} \]

Taking expectation:

\[ \begin{aligned} \text{Var}{\left(Y\right)} &= \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}^2\right]} + \text{E}{\left[{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}^2\right]} \\&\quad + 2\text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}\right]} \end{aligned} \]

For the cross-term:

Discrete case.

\[ \begin{aligned} \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}\right]} &= \sum_{x \in \mathcal{R}(X)} \text{E}{\left[ {\left(Y-\text{E}{\left[Y \mid X\right]}\right)} {\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)} \mid X=x \right]} \cdot\text{P}(X=x) \\&= \sum_{x \in \mathcal{R}(X)} {\left(\text{E}{\left[Y \mid X=x\right]}-\text{E}{\left[Y\right]}\right)} \cdot\text{E}{\left[Y-\text{E}{\left[Y \mid X=x\right]}\mid X=x\right]} \cdot\text{P}(X=x) \\&= 0 \end{aligned} \]

Continuous case.

\[ \begin{aligned} \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}\right]} &= \int_{x \in \mathcal{R}(X)} \text{E}{\left[ {\left(Y-\text{E}{\left[Y \mid X\right]}\right)} {\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)} \mid X=x \right]} \cdot\text{p}(X=x)\, dx \\&= \int_{x \in \mathcal{R}(X)} {\left(\text{E}{\left[Y \mid X=x\right]}-\text{E}{\left[Y\right]}\right)} \cdot\text{E}{\left[Y-\text{E}{\left[Y \mid X=x\right]}\mid X=x\right]} \cdot\text{p}(X=x)\, dx \\&= 0 \end{aligned} \]

Therefore:

\[ \begin{aligned} \text{Var}{\left(Y\right)} &= \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}^2\right]} + \text{E}{\left[{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}^2\right]} \\&= \text{E}{\left[\text{Var}{\left(Y \mid X\right)}\right]} + \text{Var}{\left(\text{E}{\left[Y \mid X\right]}\right)} \end{aligned} \]

Definition 19 (Precision) The precision of a random variable \(X\), often denoted \(\tau(X)\), \(\tau_X\), or shorthanded as \(\tau\), is the inverse of that random variable’s variance; that is:

\[\tau(X) \stackrel{\text{def}}{=}{\left(\text{Var}{\left(X\right)}\right)}^{-1}\]

Definition 20 (Standard deviation) The standard deviation of a random variable \(X\) is the square-root of the variance of \(X\):

\[\text{SD}{\left(X\right)} \stackrel{\text{def}}{=}\sqrt{\text{Var}{\left(X\right)}}\]

Definition 21 (Covariance) For any two one-dimensional random variables, \(X,Y\):

\[\text{Cov}{\left(X,Y\right)} \stackrel{\text{def}}{=}\text{E}{\left[(X - \text{E}{\left[X\right]})(Y - \text{E}{\left[Y\right]})\right]}\]

Theorem 22 \[\text{Cov}{\left(X,Y\right)}= \text{E}{\left[XY\right]} - \text{E}{\left[X\right]} \text{E}{\left[Y\right]}\]

Theorem 23 (Law of total covariance) For random variables \(X\), \(Y\), and \(Z\):

\[\text{Cov}{\left(Y,Z\right)} = \text{E}{\left[\text{Cov}{\left(Y,Z \mid X\right)}\right]} + \text{Cov}{\left(\text{E}{\left[Y \mid X\right]}, \text{E}{\left[Z \mid X\right]}\right)}\]

where \(\text{Cov}{\left(Y,Z \mid X\right)} \stackrel{\text{def}}{=}\text{E}{\left[(Y-\text{E}{\left[Y \mid X\right]})(Z-\text{E}{\left[Z \mid X\right]}) \mid X\right]}\).

Proof. Write:

\[ \begin{aligned} Y-\text{E}{\left[Y\right]} &= {\left(Y-\text{E}{\left[Y \mid X\right]}\right)} + {\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)} \\ Z-\text{E}{\left[Z\right]} &= {\left(Z-\text{E}{\left[Z \mid X\right]}\right)} + {\left(\text{E}{\left[Z \mid X\right]}-\text{E}{\left[Z\right]}\right)} \end{aligned} \]

Then:

\[ \begin{aligned} \text{Cov}{\left(Y,Z\right)} &= \text{E}{\left[{\left(Y-\text{E}{\left[Y\right]}\right)}{\left(Z-\text{E}{\left[Z\right]}\right)}\right]} \\&= \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(Z-\text{E}{\left[Z \mid X\right]}\right)}\right]} \\&\quad + \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(\text{E}{\left[Z \mid X\right]}-\text{E}{\left[Z\right]}\right)}\right]} \\&\quad + \text{E}{\left[{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}{\left(Z-\text{E}{\left[Z \mid X\right]}\right)}\right]} \\&\quad + \text{E}{\left[{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}{\left(\text{E}{\left[Z \mid X\right]}-\text{E}{\left[Z\right]}\right)}\right]} \end{aligned} \]

For the two mixed terms:

Discrete case.

\[ \begin{aligned} \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(\text{E}{\left[Z \mid X\right]}-\text{E}{\left[Z\right]}\right)}\right]} &= \sum_{x \in \mathcal{R}(X)} \text{E}{\left[ {\left(Y-\text{E}{\left[Y \mid X\right]}\right)} {\left(\text{E}{\left[Z \mid X\right]}-\text{E}{\left[Z\right]}\right)} \mid X=x \right]} \cdot\text{P}(X=x) \\&= \sum_{x \in \mathcal{R}(X)} {\left(\text{E}{\left[Z \mid X=x\right]}-\text{E}{\left[Z\right]}\right)} \cdot\text{E}{\left[Y-\text{E}{\left[Y \mid X=x\right]} \mid X=x\right]} \cdot\text{P}(X=x) \\&= 0 \end{aligned} \]

and similarly:

\[ \text{E}{\left[{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}{\left(Z-\text{E}{\left[Z \mid X\right]}\right)}\right]}=0. \]

Continuous case.

\[ \begin{aligned} \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(\text{E}{\left[Z \mid X\right]}-\text{E}{\left[Z\right]}\right)}\right]} &= \int_{x \in \mathcal{R}(X)} \text{E}{\left[ {\left(Y-\text{E}{\left[Y \mid X\right]}\right)} {\left(\text{E}{\left[Z \mid X\right]}-\text{E}{\left[Z\right]}\right)} \mid X=x \right]} \cdot\text{p}(X=x)\, dx \\&= \int_{x \in \mathcal{R}(X)} {\left(\text{E}{\left[Z \mid X=x\right]}-\text{E}{\left[Z\right]}\right)} \cdot\text{E}{\left[Y-\text{E}{\left[Y \mid X=x\right]} \mid X=x\right]} \cdot\text{p}(X=x)\, dx \\&= 0 \end{aligned} \]

and similarly:

\[ \text{E}{\left[{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}{\left(Z-\text{E}{\left[Z \mid X\right]}\right)}\right]}=0. \]

Hence:

\[ \begin{aligned} \text{Cov}{\left(Y,Z\right)} &= \text{E}{\left[{\left(Y-\text{E}{\left[Y \mid X\right]}\right)}{\left(Z-\text{E}{\left[Z \mid X\right]}\right)}\right]} + \text{E}{\left[{\left(\text{E}{\left[Y \mid X\right]}-\text{E}{\left[Y\right]}\right)}{\left(\text{E}{\left[Z \mid X\right]}-\text{E}{\left[Z\right]}\right)}\right]} \\&= \text{E}{\left[\text{Cov}{\left(Y,Z \mid X\right)}\right]} + \text{Cov}{\left(\text{E}{\left[Y \mid X\right]}, \text{E}{\left[Z \mid X\right]}\right)} \end{aligned} \]

Lemma 1 (The covariance of a variable with itself is its variance) For any random variable \(X\):

\[\text{Cov}{\left(X,X\right)} = \text{Var}{\left(X\right)}\]

Proof. \[ \begin{aligned} \text{Cov}{\left(X,X\right)} &= \text{E}{\left[XX\right]} - \text{E}{\left[X\right]}\text{E}{\left[X\right]} \\&= \text{E}{\left[X^2\right]} - {\left(\text{E}{\left[X\right]}\right)}^2 \\ &= \text{Var}{\left(X\right)} \end{aligned} \]

Definition 22 (Variance/covariance of a \(p \times 1\) random vector) For a \(p \times 1\) dimensional random vector \(\tilde{X}\),

\[ \begin{aligned} \text{Var}{\left(\tilde{X}\right)} &\stackrel{\text{def}}{=}\text{Cov}{\left(\tilde{X}\right)} \\ &\stackrel{\text{def}}{=}\text{E}{\left[{\left(\tilde{X}- \text{E}\tilde{X}\right)} {{\left(\tilde{X}- \text{E}\tilde{X}\right)}}^{\top}\right]} \end{aligned} \]

Theorem 24 (Elements of the variance-covariance matrix are pairwise covariances) For a \(p \times 1\) random vector \(\tilde{X}= {(X_1, \ldots, X_p)}^{\top}\), the \((i,j)\)-th element of \(\text{Var}{\left(\tilde{X}\right)}\) is \(\text{Cov}{\left(X_i, X_j\right)}\):

\[ \text{Var}{\left(\tilde{X}\right)}= \begin{pmatrix} \text{Var}{\left(X_1\right)} & \text{Cov}{\left(X_1, X_2\right)} & \cdots & \text{Cov}{\left(X_1, X_p\right)} \\ \text{Cov}{\left(X_2, X_1\right)} & \text{Var}{\left(X_2\right)} & \cdots & \text{Cov}{\left(X_2, X_p\right)} \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}{\left(X_p, X_1\right)} & \text{Cov}{\left(X_p, X_2\right)} & \cdots & \text{Var}{\left(X_p\right)} \end{pmatrix} \]

Proof. Let \(\mu_i = \text{E}{\left[X_i\right]}\) for \(i = 1, \ldots, p\), so \(\text{E}\tilde{X}= {(\mu_1, \ldots, \mu_p)}^{\top}\). By Definition 22:

\[ \begin{aligned} \text{Var}{\left(\tilde{X}\right)} &= \text{E}{\left[ {\left(\tilde{X}- \text{E}\tilde{X}\right)} {{\left(\tilde{X}- \text{E}\tilde{X}\right)}}^{\top} \right]} \\ &= \text{E}{\left[ \begin{pmatrix}X_1 - \mu_1 \\ \vdots \\ X_p - \mu_p\end{pmatrix} \begin{pmatrix}X_1 - \mu_1 & \cdots & X_p - \mu_p\end{pmatrix} \right]} \\ &= \text{E}{\left[ \begin{pmatrix} (X_1 - \mu_1)(X_1 - \mu_1) & \cdots & (X_1 - \mu_1)(X_p - \mu_p) \\ \vdots & \ddots & \vdots \\ (X_p - \mu_p)(X_1 - \mu_1) & \cdots & (X_p - \mu_p)(X_p - \mu_p) \end{pmatrix} \right]} \\ &= \begin{pmatrix} \text{E}{\left[(X_1 - \mu_1)(X_1 - \mu_1)\right]} & \cdots & \text{E}{\left[(X_1 - \mu_1)(X_p - \mu_p)\right]} \\ \vdots & \ddots & \vdots \\ \text{E}{\left[(X_p - \mu_p)(X_1 - \mu_1)\right]} & \cdots & \text{E}{\left[(X_p - \mu_p)(X_p - \mu_p)\right]} \end{pmatrix} \\ &= \begin{pmatrix} \text{Cov}{\left(X_1, X_1\right)} & \cdots & \text{Cov}{\left(X_1, X_p\right)} \\ \vdots & \ddots & \vdots \\ \text{Cov}{\left(X_p, X_1\right)} & \cdots & \text{Cov}{\left(X_p, X_p\right)} \end{pmatrix} \\ &= \begin{pmatrix} \text{Var}{\left(X_1\right)} & \cdots & \text{Cov}{\left(X_1, X_p\right)} \\ \vdots & \ddots & \vdots \\ \text{Cov}{\left(X_p, X_1\right)} & \cdots & \text{Var}{\left(X_p\right)} \end{pmatrix} \end{aligned} \]

where:

Theorem 25 (Alternate expression for variance of a random vector) \[ \begin{aligned} \text{Var}{\left(\tilde{X}\right)} &= \text{E}{\left[\tilde{X}{\tilde{X}}^{\top}\right]} - {\left(\text{E}\tilde{X}\right)} {{\left(\text{E}\tilde{X}\right)}}^{\top} \end{aligned} \]

Proof. \[ \begin{aligned} \text{Var}{\left(\tilde{X}\right)} &= \text{E}{\left[ {\left(\tilde{X}- \text{E}\tilde{X}\right)} {{\left(\tilde{X}- \text{E}\tilde{X}\right)}}^{\top} \right]} \\ &= \text{E}{\left[ \tilde{X}{\tilde{X}}^{\top} - \tilde{X}{{\left(\text{E}\tilde{X}\right)}}^{\top} - {\left(\text{E}\tilde{X}\right)} {\tilde{X}}^{\top} + {\left(\text{E}\tilde{X}\right)} {{\left(\text{E}\tilde{X}\right)}}^{\top} \right]} \\ &= \text{E}{\left[\tilde{X}{\tilde{X}}^{\top}\right]} - {\left(\text{E}\tilde{X}\right)} {{\left(\text{E}\tilde{X}\right)}}^{\top} - {\left(\text{E}\tilde{X}\right)} {{\left(\text{E}\tilde{X}\right)}}^{\top} + {\left(\text{E}\tilde{X}\right)} {{\left(\text{E}\tilde{X}\right)}}^{\top} \\ &= \text{E}{\left[\tilde{X}{\tilde{X}}^{\top}\right]} - {\left(\text{E}\tilde{X}\right)} {{\left(\text{E}\tilde{X}\right)}}^{\top} \end{aligned} \]

Theorem 26 (Variance of a linear combination) For any vector of random variables \(\tilde{X}= (X_1, \ldots, X_n)\) and corresponding vector of constants \(\tilde{a}= (a_1, ... ,a_n)\), the variance of their linear combination is:

\[ \begin{aligned} \text{Var}{\left(\tilde{a}\cdot \tilde{X}\right)} &= \text{Var}{\left(\sum_{i=1}^na_i X_i\right)} \\ &= \tilde{a}^{\top} \text{Var}{\left(\tilde{X}\right)} \tilde{a} \\ &= \sum_{i=1}^n\sum_{j=1}^n a_i a_j \text{Cov}{\left(X_i,X_j\right)} \end{aligned} \]

Proof. Left to the reader…

Corollary 3 For any two random variables \(X\) and \(Y\) and scalars \(a\) and \(b\):

\[\text{Var}{\left(aX + bY\right)} = a^2 \text{Var}{\left(X\right)} + b^2 \text{Var}{\left(Y\right)} + 2(a \cdot b) \text{Cov}{\left(X,Y\right)}\]

Proof. Apply Theorem 26 with \(n=2\), \(X_1 = X\), and \(X_2 = Y\).

Or, see https://statproofbook.github.io/P/var-lincomb.html

Definition 23 (homoskedastic, heteroskedastic) A random variable \(Y\) is homoskedastic (with respect to covariates \(X\)) if the variance of \(Y\) does not vary with \(X\):

\[\text{Var}(Y|X=x) = \sigma^2, \forall x\]

Otherwise it is heteroskedastic.

Definition 24 (Statistical independence) A set of random variables \(X_1, \ldots, X_n\) are statistically independent if their joint probability is equal to the product of their marginal probabilities:

\[\Pr(X_1=x_1, \ldots, X_n = x_n) = \prod_{i=1}^n{\Pr(X_i=x_i)}\]

Definition 25 (Conditional independence) A set of random variables \(Y_1, \ldots, Y_n\) are conditionally statistically independent given a set of covariates \(X_1, \ldots, X_n\) if the joint probability of the \(Y_i\)s given the \(X_i\)s is equal to the product of their marginal probabilities:

\[\Pr(Y_1=y_1, \ldots, Y_n = y_n|X_1=x_1, \ldots, X_n = x_n) = \prod_{i=1}^n{\Pr(Y_i=y_i|X_i=x_i)}\]

Definition 26 (Identically distributed) A set of random variables \(X_1, \ldots, X_n\) are identically distributed if they have the same range \(\mathcal{R}(X)\) and if their marginal distributions \(\text{P}(X_1=x_1), ..., \text{P}(X_n=x_n)\) are all equal to some shared distribution \(\text{P}(X=x)\):

\[ \forall i\in {\left\{1:n\right\}}, \forall x \in \mathcal{R}(X): \text{P}(X_i=x) = \text{P}(X=x) \]

Definition 27 (Conditionally identically distributed) A set of random variables \(Y_1, \ldots, Y_n\) are conditionally identically distributed given a set of covariates \(X_1, \ldots, X_n\) if \(Y_1, \ldots, Y_n\) have the same range \(\mathcal{R}(X)\) and if the distributions \(\text{P}(Y_i=y_i|X_i =x_i)\) are all equal to the same distribution \(\text{P}(Y=y|X=x)\):

\[ \text{P}(Y_i=y|X_i=x) = \text{P}(Y=y|X=x) \]

Definition 28 (Independent and identically distributed) A set of random variables \(X_1, \ldots, X_n\) are independent and identically distributed (shorthand: “\(X_i\ \text{iid}\)”) if they are statistically independent and identically distributed.

Definition 29 (Conditionally independent and identically distributed) A set of random variables \(Y_1, \ldots, Y_n\) are conditionally independent and identically distributed (shorthand: “\(Y_i | X_i\ \text{ciid}\)” or just “\(Y_i |X_i\ \text{iid}\)”) given a set of covariates \(X_1, \ldots, X_n\) if \(Y_1, \ldots, Y_n\) are conditionally independent given \(X_1, \ldots, X_n\) and \(Y_1, \ldots, Y_n\) are identically distributed given \(X_1, \ldots, X_n\).

3.6 The Central Limit Theorem

The sum of many independent or nearly-independent random variables with small variances (relative to the number of RVs being summed) produces bell-shaped distributions.

For example, consider the sum of five dice (Figure 4).

[R code]
library(dplyr)
dist = 
  expand.grid(1:6, 1:6, 1:6, 1:6, 1:6) |> 
  rowwise() |>
  mutate(total = sum(c_across(everything()))) |> 
  ungroup() |> 
  count(total) |> 
  mutate(`p(X=x)` = n/sum(n))

library(ggplot2)

dist |> 
  ggplot() +
  aes(x = total, y = `p(X=x)`) +
  geom_col() +
  xlab("sum of dice (x)") +
  ylab("Probability of outcome, Pr(X=x)") +
  expand_limits(y = 0)

  
  
Figure 4: Distribution of the sum of five dice

In comparison, the outcome of just one die is not bell-shaped (Figure 5).

[R code]
library(dplyr)
dist = 
  expand.grid(1:6) |> 
  rowwise() |>
  mutate(total = sum(c_across(everything()))) |> 
  ungroup() |> 
  count(total) |> 
  mutate(`p(X=x)` = n/sum(n))

library(ggplot2)

dist |> 
  ggplot() +
  aes(x = total, y = `p(X=x)`) +
  geom_col() +
  xlab("sum of dice (x)") +
  ylab("Probability of outcome, Pr(X=x)") +
  expand_limits(y = 0)

  
  
Figure 5: Distribution of the outcome of one die

What distribution does a single die have?

Answer: discrete uniform on 1:6.

4 Additional resources

References

Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.
Kalbfleisch, John D, and Ross L Prentice. 2011. The Statistical Analysis of Failure Time Data. John Wiley & Sons.
Klein, John P, and Melvin L Moeschberger. 2003. Survival Analysis: Techniques for Censored and Truncated Data. Vol. 1230. Springer. https://link.springer.com/book/10.1007/b97377.
Kleinbaum, David G, and Mitchel Klein. 2012. Survival Analysis: A Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.
Miller, Steven J. 2017. The Probability Lifesaver : All the Tools You Need to Understand Chance. A Princeton Lifesaver Study Guide. Princeton University Press. https://press.princeton.edu/books/hardcover/9780691149547/the-probability-lifesaver.
Rothman, Kenneth J., Timothy L. Lash, Tyler J. VanderWeele, and Sebastien Haneuse. 2021. Modern Epidemiology. Fourth edition. Wolters Kluwer.
Vittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. Springer. https://doi.org/10.1007/978-1-4614-1353-0.