Functions from these packages will be used throughout this document:
[R code]
library(conflicted) # check for conflicting function definitions# library(printr) # inserts help-file output into markdown outputlibrary(rmarkdown) # Convert R Markdown documents into a variety of formats.library(pander) # format tables for markdownlibrary(ggplot2) # graphicslibrary(ggfortify) # help with graphicslibrary(dplyr) # manipulate datalibrary(tibble) # `tibble`s extend `data.frame`slibrary(magrittr) # `%>%` and other additional piping toolslibrary(haven) # import Stata fileslibrary(knitr) # format R output for markdownlibrary(tidyr) # Tools to help to create tidy datalibrary(plotly) # interactive graphicslibrary(dobson) # datasets from Dobson and Barnett 2018library(parameters) # format model output tables for markdownlibrary(haven) # import Stata fileslibrary(latex2exp) # use LaTeX in R code (for figures and tables)library(fs) # filesystem path manipulationslibrary(survival) # survival analysislibrary(survminer) # survival analysis graphicslibrary(KMsurv) # datasets from Klein and Moeschbergerlibrary(parameters) # format model output tables forlibrary(webshot2) # convert interactive content to static for pdflibrary(forcats) # functions for categorical variables ("factors")library(stringr) # functions for dealing with stringslibrary(lubridate) # functions for dealing with dates and timeslibrary(broom) # Summarizes key information about statistical objects in tidy tibbleslibrary(broom.helpers) # Provides suite of functions to work with regression model 'broom::tidy()' tibbles
Here are some R settings I use in this document:
[R code]
rm(list =ls()) # delete any data that's already loaded into Rconflicts_prefer(dplyr::filter)ggplot2::theme_set( ggplot2::theme_bw() +# ggplot2::labs(col = "") + ggplot2::theme(legend.position ="bottom",text = ggplot2::element_text(size =12, family ="serif")))knitr::opts_chunk$set(message =FALSE)options('digits'=6)panderOptions("big.mark", ",")pander::panderOptions("table.emphasize.rownames", FALSE)pander::panderOptions("table.split.table", Inf)conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by defaultlegend_text_size =9run_graphs =TRUE
Most of the content in this chapter should be review from UC Davis Epi 202.
1 Core properties of probabilities
1.1 Defining probabilities
Definition 1 (Probability measure) A probability measure, often denoted \(\Pr()\) or \(\operatorname{P}()\), is a function whose domain is a \(\sigma\)-algebra of possible outcomes, \(\mathscr{S}\), and which satisfies the following properties:
For any statistical event \(A \in \mathscr{S}\), \(\Pr(A) \ge 0\).
The probability of the union of all outcomes (\(\Omega \stackrel{\text{def}}{=}\cup \mathscr{S}\)) is 1:
\[\Pr(\Omega) = 1\]
The probability of the union of countably many mutually disjoint events \(A_1, A_2, \ldots\) (where \(A_i \cap A_j = \emptyset\) for all \(i \neq j\)) is equal to the sum of their probabilities (countable additivity or sigma-additivity):
Corollary 2 (Complement rule in probability (\(\pi\)) notation) If the probability of an outcome \(A\) is \(\Pr(A)=\pi\), then the probability that \(A\) does not occur is:
Definition 2 (Conditional probability) For two events \(A\) and \(B\) with \(\Pr(B) > 0\), the conditional probability of \(A\) given \(B\), denoted \(\Pr(A \mid B)\), is:
\[\Pr(A \mid B) \stackrel{\text{def}}{=}\frac{\Pr(A \cap B)}{\Pr(B)}\]
Theorem 3 (Law of conditional probability) For any two events \(A\) and \(B\) with \(\Pr(B) > 0\):
\[
\begin{aligned}
\Pr(A \mid B) &= \frac{\Pr(A \cap B)}{\Pr(B)}
\\ \Pr(A \cap B) &= \Pr(A \mid B) \cdot\Pr(B)
\end{aligned}
\]
Example 1 (Applying the law of conditional probability) Suppose 30% of adults exercise regularly (\(\Pr(E) = 0.30\)), and among adults who exercise regularly, 60% have low blood pressure (\(\Pr(L \mid E) = 0.60\)).
Then the probability that a randomly selected adult both exercises regularly and has low blood pressure is:
Theorem 4 (Law of total probability) If \(B_1, B_2, \ldots\) is a countable partition of the sample space (i.e., countably many mutually exclusive events whose union is the entire sample space), then for any event \(A\):
Proof. Since \(B_1, B_2, \ldots\) partition the sample space, the events \(A \cap B_1, A \cap B_2, \ldots\) are mutually exclusive and their union is \(A\). By property 3 of Definition 1 (countable additivity), and then by Theorem 3:
Theorem 5 (Bayes’ theorem) For any two events \(A\) and \(B\) with \(\Pr(A) > 0\) and \(\Pr(B) > 0\):
\[\Pr(A \mid B) = \frac{\Pr(B \mid A) \cdot\Pr(A)}{\Pr(B)}\]
Proof. Apply Definition 2 to both \(\Pr(A \mid B)\) and \(\Pr(B \mid A)\):
\[
\begin{aligned}
\Pr(A \mid B)
&= \frac{\Pr(A \cap B)}{\Pr(B)}
\\&= \frac{\Pr(B \mid A) \cdot\Pr(A)}{\Pr(B)}
\end{aligned}
\]
The second equality follows from Theorem 3 applied to \(\Pr(B \cap A) = \Pr(B \mid A) \cdot\Pr(A)\).
Example 2 (Positive predictive value of a medical test) Suppose a disease test has 99% sensitivity and 99% specificity, and the prevalence of the disease in the population is 7%.
Let \(D\) be the event “person has the disease” and \(+\) be the event “test is positive”. Then:
\(\Pr(+ \mid D) = 0.99\) (sensitivity)
\(\Pr(\neg + \mid \neg D) = 0.99\) (specificity), so the false positive rate is \(\Pr(+ \mid \neg D) = 1 - 0.99 = 0.01\)
\(\Pr(D) = 0.07\) (prevalence)
By Bayes’ theorem (Theorem 5) and the law of total probability (Theorem 4):
Even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have the disease, because the disease prevalence is relatively low (7%).
2 Key probability distributions
Table 1: Distributions typically used for outcome models
Distribution
Uses
Bernoulli
Binary outcomes
Binomial
Sums of Bernoulli outcomes
Poisson
unbounded count outcomes
Geometric
Counts of non-events before an event occurs
Negative binomal
Mixtures of Poisson distributions, counts of non-events until a given number of events occurs
Normal (Gaussian)
Continuous outcomes without a more specific distribution
exponential
Time to event outcomes
Gamma
Time to event outcomes
Weibull
Time to event outcomes
Log-normal
Time to event outcomes
Table 2: Distributions typically used for test statistics
Definition 5 (Exposure magnitude) For many count outcomes, there is some sense of an exposure magnitude, such as population size, or duration of observation, which multiplicatively rescales the expected (mean) count.
Exercise 5 What are some examples of exposure magnitudes?
Theorem 8 (Transformation function from event rate to mean) For a count variable with mean \(\mu\), event rate \(\lambda\), and exposure magnitude \(t\):
\[\mu = \lambda \cdot t \tag{4}\]
Solution. Start from definition of event rate and use algebra to solve for \(\mu\).
Equation 4 is analogous to the inverse-odds function for binary variables.
Theorem 9 (No exposure means no expected events) When the exposure magnitude is 0, there is no opportunity for events to occur:
The exposure magnitude, \(T\), is similar to a covariate in linear or logistic regression. However, there is an important difference: in count regression, there is no intercept corresponding to \(\operatorname{E}[Y|T=0]\). In other words, this model assumes that if there is no exposure, there can’t be any events.
Theorem 10 (Exposure is additive on the log scale) If \(\mu = \lambda\cdot t\), then:
\[\log{\mu} = \log{\lambda} + \log{t}\]
Definition 7 (Offset) When the linear component of a model involves a term without an unknown coefficient, that term is called an offset.
Theorem 11 (Sum of independent Poisson random variables) If \(X\) and \(Y\) are independent Poisson random variables with means \(\mu_X\) and \(\mu_Y\), their sum, \(Z=X+Y\), is also a Poisson random variable, with mean \(\mu_Z = \mu_X + \mu_Y\).
When \(\alpha=1\) this is the exponential. When \(\alpha>1\) the hazard is increasing and when \(\alpha < 1\) the hazard is decreasing. This provides more flexibility than the exponential.
We will see more of this distribution later.
3 Characteristics of probability distributions
3.1 Probability density function
Definition 9 (probability density) If \(X\) is a continuous random variable, then the probability density of \(X\) at value \(x\), denoted \(f(x)\), \(f_X(x)\), \(\operatorname{p}(x)\), \(\operatorname{p}_X(x)\), or \(\operatorname{p}(X=x)\), is defined as the limit of the probability (mass) that \(X\) is in an interval around \(x\), divided by the width of that interval, as that width reduces to 0.
Definition 10 (Cumulative distribution function (CDF)) For a random variable \(X\), its population CDF is \[F(t)=\Pr(X\le t), \quad t\in\mathbb{R}.\]
Definition 11 (Quantile function (population inverse CDF)) For a random variable \(X\) with cumulative distribution function (CDF)\(F\), its population quantile function (generalized inverse of \(F\)) is \[Q(p)=\inf\{t:F(t)\ge p\}, \quad 0<p\le 1.\]
Theorem 13 (Density function is derivative of CDF) The density function \(f(t)\) or \(\operatorname{p}(T=t)\) for a random variable \(T\) at value \(t\) is equal to the derivative of the cumulative probability function \(F(t) \stackrel{\text{def}}{=}P(T\le t)\); that is:
Definition 13 (Expectation, expected value, population mean ) The expectation, expected value, or population mean of a continuous random variable \(X\), denoted \(\operatorname{E}\mathopen{}\left[X\right]\mathclose{}\), \(\mu(X)\), or \(\mu_X\), is the weighted mean of \(X\)’s possible values, weighted by the probability density function of those values:
\[\operatorname{E}\mathopen{}\left[X\right]\mathclose{} = \int_{x\in \mathcal{R}(X)} x \cdot \operatorname{p}(X=x)dx\]
The expectation, expected value, or population mean of a discrete random variable \(X\), denoted \(\operatorname{E}\mathopen{}\left[X\right]\mathclose{}\), \(\mu(X)\), or \(\mu_X\), is the mean of \(X\)’s possible values, weighted by the probability mass function of those values:
\[\operatorname{E}\mathopen{}\left[X\right]\mathclose{} = \sum_{x \in \mathcal{R}(X)} x \cdot \operatorname{P}(X=x)\]
Proof. We prove the continuous case, in which \(T\) has a density \(\operatorname{f}\). The result follows from applying Tonelli’s theorem (hypothesis (a) of Fubini–Tonelli) to the function \(g(t, u) = \operatorname{f}(u) \cdot \mathbb{1}\mathopen{}\left(0 \le t \le u\right)\mathclose{}\) on the product space \([0, \infty) \times [0, \infty)\): \(g\) is nonnegative everywhere and vanishes outside the (unbounded) triangular region \(D = \{(t, u) : 0 \le t \le u < \infty\}\), so the iterated integrals over \(D\) are exchangeable.
Since \(\operatorname{f}(u) \ge 0\), hypothesis (a) of Fubini–Tonelli (the nonnegative case, Tonelli’s theorem) applies, and we may exchange the order of integration over \(D\):
Example 3 (Mean of an exponential random variable via survival function) Let \(T \sim \mathrm{Exponential}(\lambda)\), so \(\operatorname{S}(t) = \text{e}^{-\lambda t}\) for \(t \ge 0\). By Theorem 16:
Definition 14 (Conditional expectation)Discrete case. Let \(X\) and \(Y\) be jointly distributed discrete random variables. The conditional probability mass function of \(Y\) given \(X = x\) (for values of \(x\) with \(\operatorname{P}(X = x) > 0\)) is:
\[\operatorname{P}(Y = y \mid X = x) \stackrel{\text{def}}{=}\frac{\operatorname{P}(X = x,\, Y = y)}{\operatorname{P}(X = x)}\]
The conditional expectation of \(Y\) given \(X = x\) is:
\[\operatorname{E}\mathopen{}\left[Y \mid X = x\right]\mathclose{} \stackrel{\text{def}}{=}\sum_{y \in \mathcal{R}(Y)} y \cdot\operatorname{P}(Y = y \mid X = x)\]
Continuous case. Let \(X\) and \(Y\) be jointly distributed continuous random variables with joint density \(\operatorname{p}(X = x,\, Y = y)\) and marginal density \(\operatorname{p}(X = x)\). The conditional probability density function of \(Y\) given \(X = x\) (for values of \(x\) with \(\operatorname{p}(X = x) > 0\)) is:
\[\operatorname{p}(Y = y \mid X = x) \stackrel{\text{def}}{=}\frac{\operatorname{p}(X = x,\, Y = y)}{\operatorname{p}(X = x)}\]
The conditional expectation of \(Y\) given \(X = x\) is:
\[\operatorname{E}\mathopen{}\left[Y \mid X = x\right]\mathclose{} \stackrel{\text{def}}{=}\int_{y \in \mathcal{R}(Y)} y \cdot\operatorname{p}(Y = y \mid X = x)\, dy\]
Conditional expectation function. The conditional expectation function\(\operatorname{E}\mathopen{}\left[Y \mid X\right]\mathclose{}\) is the function (and hence random variable) of \(X\) obtained by evaluating \(\operatorname{E}\mathopen{}\left[Y \mid X = x\right]\mathclose{}\) at \(X\); that is, \(\operatorname{E}\mathopen{}\left[Y \mid X\right]\mathclose{} = g(X)\) where \(g(x) \stackrel{\text{def}}{=}\operatorname{E}\mathopen{}\left[Y \mid X = x\right]\mathclose{}\).
3.4 Fubini–Tonelli for expectations
The Riemann version of Fubini’s theorem, stated in the math-prereqs chapter, lets us switch the order of integration for continuous integrands on simple regions. For expectations against probability measures we use its measure-theoretic form, which holds on any σ-finite measure space. The σ-finiteness hypothesis is automatic for probability measures (every probability measure is finite, hence σ-finite), so Fubini–Tonelli yields the corollary below directly.
Corollary 3 (Joint-distribution form (without independence; corollary of Fubini–Tonelli)) Let \((X, Y)\) be jointly distributed random variables whose joint distribution has a density \(f_{X,Y}\) with respect to a product of σ-finite reference measures \(\mu_X \otimes \mu_Y\) on \(\mathcal{R}(X) \times \mathcal{R}(Y)\), and let \(h : \mathcal{R}(X) \times \mathcal{R}(Y) \to \mathbb{R}\) be measurable. If either
The choice of reference measures covers three cases:
Both continuous:\(\mu_X = \mu_Y = \text{Lebesgue measure}\); \(f_{X,Y}\) is the joint probability density function (PDF), and \(\int g(x)\,d\mu_X(x) = \int g(x)\,dx\).
Both discrete:\(\mu_X = \mu_Y = \text{counting measure}\); \(f_{X,Y}(x,y) = \operatorname{P}(X = x,\, Y = y)\) is the joint probability mass function (PMF), and \(\int g(x)\,d\mu_X(x) = \sum_{x \in \mathcal{R}(X)} g(x)\).
Mixed (one continuous, one discrete): one reference measure is Lebesgue and the other is counting; \(f_{X,Y}(x,y) = f_{X \mid Y}(x \mid y)\,\operatorname{P}(Y = y)\) (or \(\operatorname{P}(X = x \mid Y = y)\,f_Y(y)\) if \(X\) is discrete and \(Y\) continuous), and the iterated integrals combine an ordinary integral with a sum.
Proof. Apply Fubini–Tonelli with \(\mu_1 = \mu_X\) and \(\mu_2 = \mu_Y\) to the integrand \(h(x,y)\,f_{X,Y}(x,y)\) on \(\mathcal{R}(X) \times \mathcal{R}(Y)\). Lebesgue measure and counting measure on a countable set are each σ-finite, so \(\mu_X \otimes \mu_Y\) is σ-finite in all three cases. The relevant hypothesis is (a) when \(h \ge 0\) and (b) when \(\operatorname{E}\mathopen{}\left[\mathopen{}\left|h(X, Y)\right|\mathclose{}\right]\mathclose{} < \infty\). Independence is not required. When \(X\) and \(Y\) are independent, \(f_{X,Y}(x,y) = f_X(x)\,f_Y(y)\) (or \(\operatorname{P}(X=x,Y=y) = \operatorname{P}(X=x)\,\operatorname{P}(Y=y)\) in the discrete case), and the iterated integrals factor into separate integrals over the marginals.
Example 5 (Expectation of a product of independent variables) Let \(X \sim \mathrm{Uniform}(0, 1)\) and \(Y \sim \mathrm{Uniform}(0, 2)\), independently distributed. Compute \(\operatorname{E}\mathopen{}\left[XY\right]\mathclose{}\).
We apply Corollary 3 (both-continuous case) with \(h(x, y) = xy\). Since \(X\) and \(Y\) are independent with densities \(f_X(x) = 1\) on \([0,1]\) and \(f_Y(y) = \tfrac{1}{2}\) on \([0,2]\), the joint density factors as \(f_{X,Y}(x,y) = f_X(x)\,f_Y(y) = \tfrac{1}{2}\), and \(\mu_X = \mu_Y = \text{Lebesgue measure}\):
As a check: \(\operatorname{E}\mathopen{}\left[X\right]\mathclose{} = \tfrac{1}{2}\), \(\operatorname{E}\mathopen{}\left[Y\right]\mathclose{} = 1\), and \(\operatorname{E}\mathopen{}\left[X\right]\mathclose{}\operatorname{E}\mathopen{}\left[Y\right]\mathclose{} = \tfrac{1}{2}\).
Example 6 (When independence fails: a counterexample) Correctly applying Corollary 3 requires the actual joint density \(f_{X,Y}\) — not the product of marginals \(f_X(x)\,f_Y(y)\), which is valid only when \(X\) and \(Y\) are independent. Using the wrong joint density gives the wrong answer.
Let \(X \sim \mathrm{Uniform}(0, 1)\) and set \(Y = X\) (so \(X\) and \(Y\) are perfectly correlated and not independent).
Note that Fubini–Tonelli’s own conditions still hold here (\(h(x,y) = xy\) is nonnegative and integrable), so the error is not a failure of Fubini–Tonelli. Rather, the error is using the wrong measure: the joint distribution of \((X, X)\) is concentrated on the diagonal \(\{(x, x) : x \in [0, 1]\} \subset [0, 1]^2\), which has Lebesgue measure zero in \(\mathbb{R}^2\). The joint distribution is therefore not absolutely continuous with respect to two-dimensional Lebesgue measure, so no joint density \(f_{X,Y}\) on \([0, 1]^2\) exists, which is the reference density Corollary 3 requires.
The calculation below is what someone would erroneously write if they assumed independence and used \(f_X(x)\,f_Y(y)\) as a “joint density” — a function that does not in fact correspond to the joint distribution of \((X, X)\). The marginals \(X \sim \mathrm{Uniform}(0,1)\) and \(Y \sim \mathrm{Uniform}(0,1)\) do have densities \(f_X = f_Y = 1\), but the product\(f_X(x)\,f_Y(y) = 1\) on \([0, 1]^2\) is the density of an independent pair, not of \((X, X)\):
This recovers \(\operatorname{E}\mathopen{}\left[XY\right]\mathclose{}\) for independent uniforms (\(\tfrac{1}{4}\)), not \(\operatorname{E}\mathopen{}\left[XX\right]\mathclose{}\) for the perfectly correlated pair (\(\tfrac{1}{3}\)). The lesson is that Corollary 3 requires the actual joint density \(f_{X,Y}\). For independent \((X, Y)\), this factors as \(f_X(x)\,f_Y(y)\); for dependent \((X, Y)\), \(f_{X,Y}\) need not factor — and for \((X, X)\), no joint density on \(\mathbb{R}^2\) exists at all, so Corollary 3 simply does not apply.
[R code]
set.seed(204)n <-400x_dep <-runif(n)y_dep <- x_depx_ind <-runif(n)y_ind <-runif(n)plotly::plot_ly() |> plotly::add_trace(type ="scatter", mode ="markers",x = x_ind, y = y_ind,name ="Assumed independent (X<sub>1</sub>, X<sub>2</sub>)",marker =list(size =5, color ="#999999", opacity =0.5) ) |> plotly::add_trace(type ="scatter", mode ="markers",x = x_dep, y = y_dep,name ="Actual (X, X) on diagonal",marker =list(size =6, color ="#b40426") ) |> plotly::layout(xaxis =list(title ="x", range =c(0, 1), scaleanchor ="y"),yaxis =list(title ="y", range =c(0, 1)),legend =list(orientation ="h", y =-0.2) )
Figure 4: Samples from the joint distribution of \((X, X)\) (red, on the diagonal) versus an independent pair \((X_1, X_2)\) with the same marginals (grey, scattered over \([0, 1]^2\)). The actual joint mass for \((X, X)\) is concentrated on a 1-dimensional diagonal — a set of Lebesgue measure zero in \(\mathbb{R}^2\) — so no joint density on \([0, 1]^2\) exists, and the “\(f_X(x)\,f_Y(y) = 1\)” calculation integrates against the wrong measure (the grey distribution).
Example 7 (Both-continuous case: joint PDF on a non-rectangular support) Let \((X, Y)\) have joint density \(f_{X,Y}(x, y) = 2\) for \(0 \le x \le y \le 1\) (and \(0\) otherwise). Compute \(\operatorname{E}\mathopen{}\left[X + Y\right]\mathclose{}\).
n_grid <-51x_seq <-seq(0, 1, length.out = n_grid)y_seq <-seq(0, 1, length.out = n_grid)z_mat <-outer(x_seq, y_seq, function(x, y) { z <-rep(2, length(x)) z[x > y] <-NA z})plotly::plot_ly(x =~x_seq, y =~y_seq, z =~t(z_mat)) |> plotly::add_surface(showscale =FALSE) |> plotly::layout(scene =list(xaxis =list(title ="x"),yaxis =list(title ="y"),zaxis =list(title ="f(x, y)", range =c(0, 2.5)),camera =list(eye =list(x =1.6, y =-1.6, z =0.8)) ))
Figure 5: Joint density \(f_{X,Y}(x, y) = 2\) on the triangular support \(\{(x, y) : 0 \le x \le y \le 1\}\), and zero elsewhere. The total “volume” under the density is \(2 \cdot \tfrac{1}{2} = 1\), as required.
Example 8 (Both-discrete case: joint PMF) Let \((X, Y)\) be discrete with joint probability mass function:
\(Y = 0\)
\(Y = 1\)
\(X = 0\)
\(0.2\)
\(0.3\)
\(X = 1\)
\(0.1\)
\(0.4\)
Compute \(\operatorname{E}\mathopen{}\left[X + Y\right]\mathclose{}\) using Corollary 3 with \(\mu_X = \mu_Y = \text{counting measure}\) and \(h(x,y) = x + y\).
As a check: \(\operatorname{E}\mathopen{}\left[X\right]\mathclose{} = 0(0.5) + 1(0.5) = 0.5\) and \(\operatorname{E}\mathopen{}\left[Y\right]\mathclose{} = 0(0.3) + 1(0.7) = 0.7\), so \(\operatorname{E}\mathopen{}\left[X + Y\right]\mathclose{} = \operatorname{E}\mathopen{}\left[X\right]\mathclose{} + \operatorname{E}\mathopen{}\left[Y\right]\mathclose{} = 1.2\).
Note that \(X\) and \(Y\) are not independent here: \(\operatorname{P}(X = 0, Y = 0) = 0.2 \neq 0.15 = \operatorname{P}(X = 0)\,\operatorname{P}(Y = 0)\). Corollary 3 applies regardless, since it requires only the actual joint mass function, not independence.
Figure 6: Joint probability mass function \(\operatorname{P}(X = x, Y = y)\). Marginal totals: \(\operatorname{P}(X = 0) = 0.5\), \(\operatorname{P}(X = 1) = 0.5\), \(\operatorname{P}(Y = 0) = 0.3\), \(\operatorname{P}(Y = 1) = 0.7\).
Example 9 (Mixed case: one continuous variable, one discrete variable) Let \(Y \sim \mathrm{Bernoulli}(0.6)\) and, given \(Y = y\), let \(X \mid Y = y \sim \mathrm{Uniform}(0,\, y + 1)\). Compute \(\operatorname{E}\mathopen{}\left[X\right]\mathclose{}\) using Corollary 3 with \(\mu_X = \text{Lebesgue measure}\), \(\mu_Y = \text{counting measure}\), and \(h(x, y) = x\).
The joint density w.r.t. Lebesgue \(\times\) counting measure is \(f_{X,Y}(x, y) = f_{X \mid Y}(x \mid y)\,\operatorname{P}(Y = y)\):
\[
\begin{aligned}
f_{X,Y}(x,\, 0) &= 1 \cdot 0.4 = 0.4 &&\text{ for } x \in [0,1];\\
f_{X,Y}(x,\, 1) &= \tfrac{1}{2} \cdot 0.6 = 0.3 &&\text{ for } x \in [0,2].
\end{aligned}
\]
As a check using the law of total expectation: \(\operatorname{E}\mathopen{}\left[X \mid Y = 0\right]\mathclose{} = \tfrac{1}{2}\) and \(\operatorname{E}\mathopen{}\left[X \mid Y = 1\right]\mathclose{} = 1\), so \(\operatorname{E}\mathopen{}\left[X\right]\mathclose{} = \tfrac{1}{2}(0.4) + 1(0.6) = 0.2 + 0.6 = 0.8\).
Figure 7: Joint density \(f_{X,Y}(x, y) = f_{X \mid Y}(x \mid y)\,\operatorname{P}(Y = y)\) for each value of the discrete variable \(Y\). The area under each component integrates to \(\operatorname{P}(Y = y)\): \(0.4 \cdot 1 = 0.4\) (blue) and \(0.3 \cdot 2 = 0.6\) (red), summing to 1.
Theorem 18 (Law of iterated expectations) For any two random variables \(X\) and \(Y\):
Proof. Discrete case. When \(X\) and \(Y\) are discrete, applying Definition 13 to \(\operatorname{E}\mathopen{}\left[\operatorname{E}\mathopen{}\left[Y \mid X\right]\mathclose{}\right]\mathclose{}\) and then the law of total probability (Theorem 4) applied to the countable partition \(\{X = x : x \in \mathcal{R}(X)\}\):
Continuous case. When \(X\) and \(Y\) are continuous, applying Definition 13 to \(\operatorname{E}\mathopen{}\left[\operatorname{E}\mathopen{}\left[Y \mid X\right]\mathclose{}\right]\mathclose{}\) and then using Definition 14 for \(\operatorname{E}\mathopen{}\left[Y \mid X=x\right]\mathclose{}\):
where the third equality exchanges the order of integration by hypothesis (b) of Fubini–Tonelli (the absolute-integrability case, Fubini’s theorem); this requires \(\operatorname{E}\mathopen{}\left[\mathopen{}\left|Y\right|\mathclose{}\right]\mathclose{} < \infty\), which is implicit in \(\operatorname{E}\mathopen{}\left[Y\right]\mathclose{}\) being defined, and the fourth equality uses \(\int_{x} \operatorname{p}(Y=y \mid X=x) \cdot\operatorname{p}(X=x)\, dx = \int_{x} \operatorname{p}(X=x, Y=y)\, dx = \operatorname{p}(Y=y)\) (marginalization of the joint density).
Theorem 19 (Conditional law of iterated expectations) For random variables \(X\), \(Y\), and \(Z\):
Proof. For each fixed value \(z\) with positive probability or density:
Discrete case. Conditioning on \(Z=z\), and applying the law of total probability to the partition \(\{X=x : x \in \mathcal{R}(X)\}\) under the conditional distribution given \(Z=z\):
Therefore, as random variables of \(Z\), \(\operatorname{E}\mathopen{}\left[Y \mid Z\right]\mathclose{} = \operatorname{E}\mathopen{}\left[\operatorname{E}\mathopen{}\left[Y \mid X,Z\right]\mathclose{} \mid Z\right]\mathclose{}\).
Example 10 (Marginal expectation from conditional expectations) Suppose \(X\) is a binary random variable indicating treatment assignment (\(X=1\) treated, \(X=0\) control), with \(\operatorname{P}(X=1) = 0.5\), and suppose the outcome \(Y\) has conditional expectations:
Definition 15 (Expectation of a random matrix) For a random matrix \(\mathbf{A}\) of size \(m \times n\) with \((i,j)\)-th element \(A_{ij}\), the expectation\(\operatorname{E}\mathbf{A}\) is the \(m \times n\) matrix whose \((i,j)\)-th element is \(\operatorname{E}\mathopen{}\left[A_{ij}\right]\mathclose{}\):
Definition 17 (Deviation from a population or subpopulation mean) In probabilistic models, we call this quantity a deviation from a mean. It is often also called an error or noise term in other sources. For the random variable \(Y\), define the deviation from its mean as:
For a realized observation \(y\): \[e(y) \stackrel{\text{def}}{=}y - \operatorname{E}\mathopen{}\left[Y\right]\mathclose{}\]
In regression settings, the reference mean is often conditional on covariates: \(e(y_i) \stackrel{\text{def}}{=}y_i - \operatorname{E}\mathopen{}\left[Y_i \mid X_i\right]\mathclose{}\).
In this course, we prefer “deviation” for this mean-deviation quantity. The terms “error” and “noise” are common aliases. We use “residual” (defined in the Linear regression chapter) for deviations from fitted values. For notation in this course, we use \(e(\cdot)\) for these model/data deviations, and reserve \(\varepsilon\mathopen{}\left(\cdot\right)\mathclose{}\) for estimator-to-estimand deviations (see Estimation).
Definition 18 (Variance) The variance of a random variable \(X\) is the expectation of the squared difference between \(X\) and \(\operatorname{E}\mathopen{}\left[X\right]\mathclose{}\); that is:
Definition 19 (Precision) The precision of a random variable \(X\), often denoted \(\tau(X)\), \(\tau_X\), or shorthanded as \(\tau\), is the inverse of that random variable’s variance; that is:
Theorem 22 (Alternative formula for covariance)\[\operatorname{Cov}\mathopen{}\left(X,Y\right)\mathclose{}= \operatorname{E}\mathopen{}\left[XY\right]\mathclose{} - \operatorname{E}\mathopen{}\left[X\right]\mathclose{} \operatorname{E}\mathopen{}\left[Y\right]\mathclose{}\]
Theorem 23 (Law of total covariance) For random variables \(X\), \(Y\), and \(Z\):
Theorem 24 (Elements of the variance-covariance matrix are pairwise covariances) For a \(p \times 1\) random vector \(\tilde{X}= {(X_1, \ldots, X_p)}^{\top}\), the \((i,j)\)-th element of \(\operatorname{Var}\mathopen{}\left(\tilde{X}\right)\mathclose{}\) is \(\operatorname{Cov}\mathopen{}\left(X_i, X_j\right)\mathclose{}\):
Proof. Let \(\mu_i = \operatorname{E}\mathopen{}\left[X_i\right]\mathclose{}\) for \(i = 1, \ldots, p\), so \(\operatorname{E}\tilde{X}= {(\mu_1, \ldots, \mu_p)}^{\top}\). By Definition 22:
Theorem 25 (Alternate expression for variance of a random vector)\[
\begin{aligned}
\operatorname{Var}\mathopen{}\left(\tilde{X}\right)\mathclose{}
&= \operatorname{E}\mathopen{}\left[\tilde{X}{\tilde{X}}^{\top}\right]\mathclose{} - \mathopen{}\left(\operatorname{E}\tilde{X}\right)\mathclose{} {\mathopen{}\left(\operatorname{E}\tilde{X}\right)\mathclose{}}^{\top}
\end{aligned}
\]
Theorem 26 (Variance of a linear combination) For any vector of random variables \(\tilde{X}= (X_1, \ldots, X_n)\) and corresponding vector of constants \(\tilde{a}= (a_1, ... ,a_n)\), the variance of their linear combination is:
Definition 23 (homoskedastic, heteroskedastic) A random variable \(Y\) is homoskedastic (with respect to covariates \(X\)) if the variance of \(Y\) does not vary with \(X\):
Definition 24 (Statistical independence) A set of random variables \(X_1, \ldots, X_n\) are statistically independent if their joint probability is equal to the product of their marginal probabilities:
Definition 25 (Conditional independence) A set of random variables \(Y_1, \ldots, Y_n\) are conditionally statistically independent given a set of covariates \(X_1, \ldots, X_n\) if the joint probability of the \(Y_i\)s given the \(X_i\)s is equal to the product of their marginal probabilities:
Definition 26 (Identically distributed) A set of random variables \(X_1, \ldots, X_n\) are identically distributed if they have the same range \(\mathcal{R}(X)\) and if their marginal distributions \(\operatorname{P}(X_1=x_1), ..., \operatorname{P}(X_n=x_n)\) are all equal to some shared distribution \(\operatorname{P}(X=x)\):
Definition 27 (Conditionally identically distributed) A set of random variables \(Y_1, \ldots, Y_n\) are conditionally identically distributed given a set of covariates \(X_1, \ldots, X_n\) if \(Y_1, \ldots, Y_n\) have the same range \(\mathcal{R}(X)\) and if the distributions \(\operatorname{P}(Y_i=y_i|X_i =x_i)\) are all equal to the same distribution \(\operatorname{P}(Y=y|X=x)\):
Definition 28 (Independent and identically distributed) A set of random variables \(X_1, \ldots, X_n\) are independent and identically distributed (shorthand: “\(X_i\ \operatorname{iid}\)”) if they are statistically independent and identically distributed.
Definition 29 (Conditionally independent and identically distributed) A set of random variables \(Y_1, \ldots, Y_n\) are conditionally independent and identically distributed (shorthand: “\(Y_i | X_i\ \operatorname{ciid}\)” or just “\(Y_i |X_i\ \operatorname{iid}\)”) given a set of covariates \(X_1, \ldots, X_n\) if \(Y_1, \ldots, Y_n\) are conditionally independent given \(X_1, \ldots, X_n\) and \(Y_1, \ldots, Y_n\) are identically distributed given \(X_1, \ldots, X_n\).
3.7 The Central Limit Theorem
The sum of many independent or nearly-independent random variables with small variances (relative to the number of RVs being summed) produces bell-shaped distributions.
For example, consider the sum of five dice (Figure 8).
[R code]
library(dplyr)dist =expand.grid(1:6, 1:6, 1:6, 1:6, 1:6) |>rowwise() |>mutate(total =sum(c_across(everything()))) |>ungroup() |>count(total) |>mutate(`p(X=x)`= n/sum(n))library(ggplot2)dist |>ggplot() +aes(x = total, y =`p(X=x)`) +geom_col() +xlab("sum of dice (x)") +ylab("Probability of outcome, Pr(X=x)") +expand_limits(y =0)
Figure 8: Distribution of the sum of five dice
In comparison, the outcome of just one die is not bell-shaped (Figure 9).
[R code]
library(dplyr)dist =expand.grid(1:6) |>rowwise() |>mutate(total =sum(c_across(everything()))) |>ungroup() |>count(total) |>mutate(`p(X=x)`= n/sum(n))library(ggplot2)dist |>ggplot() +aes(x = total, y =`p(X=x)`) +geom_col() +xlab("sum of dice (x)") +ylab("Probability of outcome, Pr(X=x)") +expand_limits(y =0)
Rothman, Kenneth J., Timothy L. Lash, Tyler J. VanderWeele, and Sebastien Haneuse. 2021. Modern Epidemiology. Fourth edition. Wolters Kluwer.
Vittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. Springer. https://doi.org/10.1007/978-1-4614-1353-0.