Math is not just a way of calculating numerical answers; it is a way of thinking, using clear definitions for concepts and rigorous logic to organize our thoughts and back up our assertions.
Cheng (2025)
These lecture notes use:
Some key results are listed here.
Theorem 1 (Equalities are transitive) If \(a=b\) and \(b=c\), then \(a=c\)
Theorem 2 (Substituting equivalent expressions) If \(a = b\), then for any function \(f(x)\), \(f(a) = f(b)\)
Theorem 3 (Adding to both sides of an inequality) If \(a<b\), then \(a+c < b+c\)
Theorem 4 (negating both sides of an inequality) If \(a < b\), then: \(-a > -b\)
Theorem 5 (Multiplying both sides of an inequality by a nonnegative number) If \(a < b\) and \(c \geq 0\), then \(ca < cb\).
Theorem 6 (Negation is multiplication by \(-1\)) \[-a = (-1)*a\]
Definition 1 (Infimum (greatest lower bound)) The infimum of a nonempty set \(A \subseteq \mathbb{R}\), written \(\inf A\), is the greatest real number \(m\) satisfying \(m \le a\) for all \(a \in A\):
\[\inf A \stackrel{\text{def}}{=}\max_{t \in \mathbb{R}}\mathopen{}\left\{t : \forall a \in A, a \ge t\right\}\mathclose{}\]
If the infimum belongs to \(A\), it equals the minimum: \(\inf A = \min A\).
Example 1 (Numerical examples of infimum)
Definition 2 (Supremum (least upper bound)) The supremum of a nonempty set \(A \subseteq \mathbb{R}\), written \(\sup A\), is the smallest real number \(M\) satisfying \(M \ge a\) for all \(a \in A\):
\[\sup A \stackrel{\text{def}}{=}\min_{t \in \mathbb{R}}\mathopen{}\left\{t : \forall a \in A, a \le t\right\}\mathclose{}\]
If the supremum belongs to \(A\), it equals the maximum: \(\sup A = \max A\).
Example 2 (Numerical examples of supremum)
Theorem 7 (adding zero changes nothing) \[a+0=a\]
Theorem 8 (Sums are symmetric) \[a+b = b+a\]
Theorem 9 (Sums are associative)
\[(a + b) + c = a + (b + c)\]
Theorem 10 (Multiplying by 1 changes nothing) \[a \times 1 = a\]
Theorem 11 (Products are symmetric) \[a \times b = b \times a\]
Theorem 12 (Products are associative) \[(a \times b) \times c = a \times (b \times c)\]
Theorem 13 (Division can be written as a product) \[\frac {a}{b} = a \times \frac{1}{b}\]
Theorem 14 (Multiplication is distributive) \[a(b+c) = ab + ac\]
Definition 3 (Quotients, fractions, rates)
\[\frac{a}{b}\]
Definition 4 (Ratios) A ratio is a quotient in which the numerator and denominator are measured using the same unit scales.
Definition 5 (Proportion) In statistics, a “proportion” typically means a ratio where the numerator represents a subset of the denominator.
Definition 6 (Proportional) Two functions \(f(x)\) and \(g(x)\) are proportional if their ratio \(\frac{f(x)}{g(x)}\) does not depend on \(x\). (c.f. https://en.wikipedia.org/wiki/Proportionality_(mathematics))
Additional reference for elementary algebra: https://en.wikipedia.org/wiki/Population_proportion#Mathematical_definition
Theorem 15 (logarithm of a product is the sum of the logs of the factors) \[ \log{a\cdot b} = \log{a} + \log{b} \]
Corollary 1 (logarithm of a quotient)
\[\log{\frac{a}{b}} = \log{a} - \log{b}\]
Theorem 16 (logarithm of an exponential function) \[ \operatorname{log}\mathopen{}\left\{a^b\right\}\mathclose{} = b \cdot\operatorname{log}\mathopen{}\left\{a\right\}\mathclose{} \]
Theorem 17 (exponential of a sum)
\[\operatorname{exp}\mathopen{}\left\{a+b\right\}\mathclose{} = \operatorname{exp}\mathopen{}\left\{a\right\}\mathclose{} \cdot\operatorname{exp}\mathopen{}\left\{b\right\}\mathclose{}\]
Corollary 2 (exponential of a difference)
\[\operatorname{exp}\mathopen{}\left\{a-b\right\}\mathclose{} = \frac{\operatorname{exp}\mathopen{}\left\{a\right\}\mathclose{}}{\operatorname{exp}\mathopen{}\left\{b\right\}\mathclose{}}\]
Theorem 18 (exponential of a product) \[a^{bc} = \mathopen{}\left(a^b\right)\mathclose{}^c = \mathopen{}\left(a^c\right)\mathclose{}^b\]
Corollary 3 (natural exponential of a product) \[\operatorname{exp}\mathopen{}\left\{ab\right\}\mathclose{} = (\operatorname{exp}\mathopen{}\left\{a\right\}\mathclose{})^b = (\operatorname{exp}\mathopen{}\left\{b\right\}\mathclose{})^a\]
Exercise 1 For \(a \ge 0,~b,c \in \mathbb{R}\), When does \((a^b)^c = a^{(b^c)}\)?
Solution 1. Short answer: rarely (that’s all you need to know for this course).
Long answer:
If \((a^b)^c = a^{(b^c)}\), then since \((a^b)^c = a^{bc}\), we have: \[a^{bc} = a^{(b^c)}\] \[\operatorname{log}\mathopen{}\left\{a^{bc}\right\}\mathclose{} = \operatorname{log}\mathopen{}\left\{a^{(b^c)}\right\}\mathclose{}\] \[bc \cdot \operatorname{log}\mathopen{}\left\{a\right\}\mathclose{} = b^c\cdot \operatorname{log}\mathopen{}\left\{a\right\}\mathclose{} \tag{1}\]
Equation 1 holds in each of the following cases:
In particular, when \(a=0\) and \(c=0\), \(bc = 0\) and \(b^c = 1\) (for any \(b \in \mathbb{R}\)), so \(\operatorname{sign}\mathopen{}\left\{bc\right\}\mathclose{}\neq \operatorname{sign}\mathopen{}\left\{b^c\right\}\mathclose{}\), and \((a^b)^c \neq a^{(b^c)}\):
\[ \begin{aligned} (a^b)^c &= (0^b)^0 \\ &= 1 \end{aligned} \]
\[ \begin{aligned} a^{(b^c)} &= 0^{(b^0)} \\ &= 0^1 \\ &= 0 \end{aligned} \]
Exercise 2 For \(b,c \in \mathbb{R}\), when does \(b^c = bc\)?
Solution 2. \(bc = b^c\) in each of the following cases:
See the red contours in Figure 2 for a visualization.
`b*c_f` <- function(b, c) b*c
`b^c_f` <- function(b, c) b^c
values_b <- seq(0, 5, by = .01)
values_c <- seq(-.5, 3, by = .01)
`b*c` <- outer(values_b, values_c, `b*c_f`)
`b^c` <- outer(values_b, values_c, `b^c_f`)
`b^c`[is.infinite(`b^c`)] = NA
opacity <- .3
z_min <- min(`b*c`, `b^c`, na.rm = TRUE)
z_max <- 5
plotly::plot_ly(
x = ~values_b,
y = ~values_c
) |>
plotly::add_surface(
z = ~ t(`b*c`),
contours = list(
z = list(
show = TRUE,
start = -1,
end = 1,
size = .1
)
),
name = "b*c",
showscale = FALSE,
opacity = opacity,
colorscale = list(c(0, 1), c("green", "green"))
) |>
plotly::add_surface(
opacity = opacity,
colorscale = list(c(0, 1), c("red", "red")),
z = ~ t(`b^c`),
contours = list(
z = list(
show = TRUE,
start = z_min,
end = z_max,
size = .2
)
),
showscale = FALSE,
name = "b^c"
) |>
plotly::layout(
scene = list(
xaxis = list(
# type = "log",
title = "b"
),
yaxis = list(
# type = "log",
title = "c"
),
zaxis = list(
# type = "log",
range = c(z_min, z_max),
title = "outcome"
),
camera = list(eye = list(x = -1.25, y = -1.25, z = 0.5)),
aspectratio = list(x = .9, y = .8, z = 0.7)
)
)`b^c - b*c_f` <- function(b, c) `b^c_f`(b,c) - `b*c_f`(b,c)
mat1 <- outer(values_b, values_c, `b^c - b*c_f`)
mat1[is.infinite(mat1)] = NA
opacity <- .3
plotly::plot_ly(
x = ~values_b,
y = ~values_c
) |>
plotly::add_surface(
z = ~ t(mat1),
contours = list(
z = list(
show = TRUE,
start = 0,
end = 1,
size = 1,
color = "red"
)
),
name = "b^c - b*c",
showscale = TRUE,
opacity = opacity
) |>
plotly::layout(
scene = list(
xaxis = list(
# type = "log",
title = "b"
),
yaxis = list(
# type = "log",
title = "c"
),
zaxis = list(
title = "outcome"
),
camera = list(eye = list(x = -1.25, y = -1.25, z = 0.5)),
aspectratio = list(x = .9, y = .8, z = 0.7)
)
)Theorem 19 (\(\operatorname{exp}\mathopen{}\left\{\right\}\mathclose{}\) and \(\operatorname{log}\mathopen{}\left\{\right\}\mathclose{}\) are mutual inverses) \[\operatorname{exp}\mathopen{}\left\{\operatorname{log}\mathopen{}\left\{a\right\}\mathclose{}\right\}\mathclose{} = \operatorname{log}\mathopen{}\left\{\operatorname{exp}\mathopen{}\left\{a\right\}\mathclose{}\right\}\mathclose{} = a\]
Theorem 20 (Constant rule) \[\frac{\partial}{\partial x}c = 0\]
Theorem 21 (Power rule) If \(a\) is constant with respect to \(x\), then: \[\frac{\partial}{\partial x}ay = a \frac{\partial x}{\partial y}\]
Theorem 22 (Power rule) \[\frac{\partial}{\partial x}x^q = qx^{q-1}\]
Theorem 23 (Derivative of natural logarithm) \[\operatorname{log}'\mathopen{}\left\{x\right\}\mathclose{} = \frac{1}{x} = x^{-1}\]
Theorem 24 (derivative of exponential) \[\operatorname{exp}'\mathopen{}\left\{x\right\}\mathclose{} = \operatorname{exp}\mathopen{}\left\{x\right\}\mathclose{}\]
Theorem 25 (Product rule) \[(ab)' = ab' + ba'\]
Theorem 26 (Quotient rule) \[(a/b)' = a'/b - (a/b^2)b'\]
Theorem 27 (Chain rule) \[\begin{aligned} \frac{\partial a}{\partial c} &= \frac{\partial a}{\partial b} \frac{\partial b}{\partial c} \\ &= \frac{\partial b}{\partial c} \frac{\partial a}{\partial b} \end{aligned} \]
or in Euler/Lagrange notation:
\[(f(g(x)))' = g'(x) f'(g(x))\]
Corollary 4 (Chain rule for logarithms) \[ \frac{\partial}{\partial x}\log{f(x)} = \frac{f'(x)}{f(x)} \]
Proof. Apply Theorem 27 and Theorem 23.
Integration is the inverse operation of differentiation: it recovers a function from its derivative and accumulates quantities such as areas, totals, and probabilities. We begin with antiderivatives, then state basic integration rules, and conclude with the Fundamental Theorem of Calculus and a worked example from probability.
Definition 7 (Antiderivative) A function \(F\) is an antiderivative of \(f\) on an interval \(I\) if:
\[\frac{\partial}{\partial x} F(x) = f(x), \quad \forall x \in I\]
The family of all antiderivatives of \(f\) is written as the indefinite integral:
\[\int f(x)\,dx = F(x) + C\]
where \(C\) is an arbitrary constant of integration.
(Larson and Edwards 2018, sec. 4.1, pp. 248–249)
Example 3 (Antiderivative of a power function) For \(f(x) = x^2\), an antiderivative is \(F(x) = \frac{x^3}{3}\), since \(\frac{\partial}{\partial x}\frac{x^3}{3} = x^2 = f(x)\).
Adding any constant \(C\) gives another antiderivative; for example, with \(C = 7\), \(F(x) = \frac{x^3}{3} + 7\) also satisfies \(F'(x) = x^2\), since adding a constant does not change the derivative. See Figure 3.
Theorem 28 (Basic integration rules) Each antiderivative below is defined only up to an arbitrary constant \(C\) (see Definition 7); the table omits \(+ C\) from every row for brevity.
| Function \(f(x)\) | Antiderivative \(F(x)\) | Condition |
|---|---|---|
| \(c\) | \(cx\) | — |
| \(x^n\) | \(\dfrac{x^{n+1}}{n+1}\) | \(n \ne -1\) |
| \(\dfrac{1}{x}\) | \(\ln\mathopen{}\left|x\right|\mathclose{}\) | \(x \ne 0\) |
| \(\text{e}^{x}\) | \(\text{e}^{x}\) | (self-antiderivative) |
| \(\text{e}^{cx}\) | \(\dfrac{1}{c}\text{e}^{cx}\) | \(c \ne 0\) |
| \(c \cdot f(x)\) | \(c \cdot F(x)\) | — |
| \(f(x) + g(x)\) | \(F(x) + G(x)\) | — |
The first two rows and the bottom two rows (linearity) are from (Larson and Edwards 2018, sec. 4.1, p. 250 “Basic Integration Rules”); \(1/x\) is from (Larson and Edwards 2018, sec. 5.2, Theorem 5.5, p. 324); \(\text{e}^{x}\) and \(\text{e}^{cx}\) are from (Larson and Edwards 2018, sec. 5.4, Theorem 5.12, p. 346).
Example 4 (Antiderivative of \(3x^2 - 1\)) By the power rule (\(n = 2\)) and linearity from Theorem 28:
\[ \int \mathopen{}\left(3x^2 - 1\right)\mathclose{}\,dx = 3 \cdot\frac{x^3}{3} - x + C = x^3 - x + C. \]
Verify by differentiating: \(\frac{\partial}{\partial x}\mathopen{}\left(x^3 - x + C\right)\mathclose{} = 3x^2 - 1 = f(x)\), as required.
Definition 8 (Differentiable function) A function \(f\) is differentiable at \(x = c\) if the limit
\[f'(c) = \lim_{h \to 0} \frac{f(c + h) - f(c)}{h}\]
exists and is finite. \(f\) is differentiable on an interval if it is differentiable at every interior point; at a closed endpoint, the appropriate one-sided derivative is used.
(Larson and Edwards 2018, sec. 2.1, p. 100)
Definition 9 (Continuous function) A function \(f\) is continuous at \(x = c\) if all three conditions hold:
\(f\) is continuous on a closed interval \([a, b]\) if it is continuous at every point of \([a, b]\).
(Larson and Edwards 2018, sec. 1.4, p. 73)
Definition 10 (Riemann integrable) A bounded function \(f\) is Riemann integrable on \([a, b]\) if the Riemann integral
\[\int_a^b f(x)\,dx = \lim_{n \to \infty} \sum_{i=1}^n f(x_i^*)\,\Delta x, \quad \Delta x = \frac{b - a}{n},\]
exists and is finite (for equal-width partitions of width \(\Delta x = (b - a)/n\)), where \(x_i^*\) is any point in the \(i\)-th subinterval.
(Larson and Edwards 2018, sec. 4.3, p. 272)
General Riemann integrability
More generally, using partitions \(\mathcal{P}\) of arbitrary mesh — subintervals of varying widths \(\Delta x_i\) — a bounded function \(f\) is Riemann integrable on \([a, b]\) if
\[\int_a^b f(x)\,dx = \lim_{\|\mathcal{P}\| \to 0} \sum_{i=1}^n f(x_i^*)\,\Delta x_i\]
exists and is finite, where \(\|\mathcal{P}\| = \max_i \Delta x_i\) is the mesh of the partition.
Theorem 29 (Equivalence of Riemann sum formulations) For continuous \(f\) on a closed interval \([a, b]\), the equal-width Riemann sum (Definition 10) and the arbitrary-mesh Riemann sum (in the callout above) give the same value (Rudin 1976, chap. 6). The equal-width form in Definition 10 is used throughout Epi 204.
Before stating the Fundamental Theorem of Calculus, we record two prerequisite results. The FTC requires the integrand \(f\) to be continuous, and the two theorems below establish where continuity comes from (differentiability \(\Rightarrow\) continuity) and what it buys us (continuity \(\Rightarrow\) integrability).
Theorem 30 (Differentiability implies continuity) If \(f\) is differentiable at \(x = c\), then \(f\) is continuous at \(x = c\).
(Larson and Edwards 2018, Theorem 2.1, p. 106)
Example 5 (Differentiable, hence continuous: \(x^3 - x\)) \(f(x) = x^3 - x\) is differentiable everywhere (with derivative \(f'(x) = 3x^2 - 1\)), so by Theorem 30 it is continuous everywhere.
Example 6 (Continuous but not differentiable: \(\mathopen{}\left|x\right|\mathclose{}\)) The absolute-value function \(f(x) = \mathopen{}\left|x\right|\mathclose{}\) is continuous at \(x = 0\) (\(\lim_{x \to 0}\mathopen{}\left|x\right|\mathclose{} = 0 = \mathopen{}\left|0\right|\mathclose{}\)), but it is not differentiable at \(x = 0\): the left-derivative is \(-1\) and the right-derivative is \(+1\).
This shows that the converse of Theorem 30 fails: continuity does not imply differentiability. See Figure 4.
Theorem 31 (Continuity implies integrability) If \(f\) is continuous on the closed interval \([a, b]\), then \(f\) is integrable on \([a, b]\) (i.e., the Riemann integral \(\int_a^b f(x)\,dx\) exists and is finite).
(Larson and Edwards 2018, Theorem 4.4, p. 272)
Example 7 (Continuous, hence integrable: polynomials) Every polynomial is continuous on \(\mathbb{R}\), so by Theorem 31 every polynomial is integrable on every closed interval \([a, b]\).
Example 8 (Integrable but not continuous: a step function) Let \(f(x) = 0\) for \(x < \tfrac{1}{2}\) and \(f(x) = 1\) for \(x \ge \tfrac{1}{2}\). Then \(f\) is discontinuous at \(x = \tfrac{1}{2}\), but it is integrable on \([0, 1]\):
\[ \int_0^1 f(x)\,dx = \int_0^{1/2} 0\,dx + \int_{1/2}^1 1\,dx = 0 + \tfrac{1}{2} = \tfrac{1}{2}. \]
This shows that the converse of Theorem 31 fails: integrability does not imply continuity. See Figure 5.
step_df <- data.frame(
x = c(0, 0.5, 0.5, 1),
y = c(0, 0, 1, 1),
segment = c("left", "left", "right", "right")
)
ggplot() +
geom_rect(
aes(xmin = 0.5, xmax = 1, ymin = 0, ymax = 1),
fill = "steelblue", alpha = 0.3
) +
geom_line(
data = step_df,
aes(x = x, y = y, group = segment),
linewidth = 1
) +
geom_point(aes(x = 0.5, y = 0), shape = 1, size = 3) +
geom_point(aes(x = 0.5, y = 1), shape = 16, size = 3) +
scale_x_continuous(breaks = c(0, 0.5, 1), labels = c("0", "1/2", "1")) +
scale_y_continuous(limits = c(-0.1, 1.2)) +
labs(x = "x", y = "f(x)") +
theme_minimal()Together, Theorem 30 and Theorem 31 establish the chain:
\[\text{differentiable} \;\Rightarrow\; \text{continuous} \;\Rightarrow\; \text{integrable}\]
Example 6 and Example 8 show that neither implication reverses in general.
Theorem 32 (Fundamental Theorem of Calculus) Let \(f\) be a continuous function on a closed interval \([a, b]\).
Part 1 (Derivative of an integral). Define \(F(x) = \int_a^x f(t)\,dt\) for \(x \in [a, b]\). Then \(F\) is differentiable and:
\[\frac{\partial}{\partial x}\int_a^x f(t)\,dt = f(x) \tag{2}\]
Note
Continuity on all of \([a, b]\) is a sufficient condition. More generally, Part 1 holds at any individual point \(x\) where \(f\) is integrable on \([a, b]\) (see Definition 10) and continuous at \(x\) (see Definition 9), even if \(f\) has jump discontinuities elsewhere (Rudin 1976, Theorem 6.20, p. 133).
(Larson and Edwards 2018, Theorem 4.11, p. 288)
Part 2 (Evaluation theorem). The \(F\) here may be any antiderivative of \(f\) — not just the accumulation function from Part 1. If \(F\) is an antiderivative of \(f\) on \([a, b]\) (i.e., \(\frac{\partial}{\partial x} F(x) = f(x)\) for all \(x \in [a, b]\)), then:
\[\int_a^b f(x)\,dx = F(b) - F(a) \tag{3}\]
Equivalently, with \(b\) replaced by a variable upper limit \(x\), integrating the derivative of \(F\) recovers the net change in \(F\):
\[\int_a^x F'(t)\,dt = F(x) - F(a) \tag{4}\]
or equivalently in Leibniz notation:
\[\int_a^x \frac{d F}{d t}\,dt = F(x) - F(a)\]
(Banner 2007, chap. 18; Larson and Edwards 2018, Theorem 4.9, p. 282)
The two parts of the FTC together express that differentiation and integration are inverse operations:
The standard form of the FTC assumes \(f\) is continuous on \([a, b]\); continuity is sufficient but not strictly necessary (see the preceding callout note for the more general statement). Since differentiability implies continuity (Theorem 30), the FTC applies in particular whenever \(f\) is differentiable — a common situation in Epi 204.
Example 9 (FTC Part 1 visualized: accumulation function for \(f(t) = 2t\)) Take \(f(t) = 2t\) on \([0, 2]\). The accumulation function from \(0\) is
\[F(x) \;\stackrel{\text{def}}{=}\; \int_0^x 2t\,dt \;=\; \mathopen{}\left[t^2\right]\mathclose{}_{t=0}^{t=x} \;=\; x^2 - 0^2 \;=\; x^2,\]
so \(F(x) = x^2\), and indeed \(F'(x) = 2x = f(x)\), as Theorem 32 Part 1 predicts. Figure 6 shows the integrand on the left (shaded area equals \(F(x)\) at each \(x\)) and the accumulation function \(F(x) = x^2\) on the right (its slope at \(x\) equals \(f(x) = 2x\)).
ggplot() +
geom_area(
data = data.frame(t = seq(0, x_focus, length.out = 200)),
aes(x = t, y = 2 * t),
fill = "steelblue", alpha = 0.4
) +
geom_function(fun = \(t) 2 * t, xlim = c(0, 2.2), linewidth = 1) +
geom_vline(
data = data.frame(x = x_marks),
aes(xintercept = x, color = factor(x)),
linetype = "dashed", linewidth = 0.6
) +
labs(x = "t", y = "f(t) = 2t", color = "x") +
theme_minimal() +
theme(legend.position = "bottom")slope_df <- data.frame(
x = x_marks,
Fx = x_marks^2,
slope = 2 * x_marks
)
ggplot() +
geom_function(fun = \(x) x^2, xlim = c(0, 2.2), linewidth = 1) +
geom_point(
data = slope_df,
aes(x = x, y = Fx, color = factor(x)),
size = 3
) +
geom_segment(
data = slope_df,
aes(
x = x - 0.3, xend = x + 0.3,
y = Fx - 0.3 * slope, yend = Fx + 0.3 * slope,
color = factor(x)
),
linewidth = 0.8
) +
labs(x = "x", y = expression(F(x) == x^2), color = "x") +
theme_minimal() +
theme(legend.position = "bottom")Example 10 (CDF and PDF of the exponential distribution) In what follows, \(f\) denotes the PDF and \(F\) the CDF — the same letters as the antiderivative pair in Definition 7, because the FTC will show \(F\) is exactly an antiderivative of \(f\).
For the exponential distribution with rate parameter \(\lambda > 0\), the probability density function (PDF) is (Kleinbaum and Klein 2012, sec. II, p. 295, “Survival and Hazard Functions for Selected Distributions”):
\[f(t) = \lambda \text{e}^{-\lambda t}, \quad t \ge 0\]
FTC Part 2 gives the cumulative distribution function (CDF) from the PDF. Apply the \(\text{e}^{cx}\) rule from Theorem 28 with \(c = -\lambda\) to antidifferentiate the integrand:
\[ \begin{aligned} F(t) &= \int_0^t \lambda \text{e}^{-\lambda u}\,du \\ &= \mathopen{}\left[\lambda \cdot\frac{1}{-\lambda}\text{e}^{-\lambda u}\right]\mathclose{}_{u=0}^{u=t} \\ &= \mathopen{}\left[(-1)\text{e}^{-\lambda u}\right]\mathclose{}_{u=0}^{u=t} \\ &= \mathopen{}\left[-\text{e}^{-\lambda u}\right]\mathclose{}_{u=0}^{u=t} \\ &= -\text{e}^{-\lambda t} - \mathopen{}\left(-\text{e}^{0}\right)\mathclose{} \\ &= -\text{e}^{-\lambda t} - (-1) \\ &= 1 - \text{e}^{-\lambda t} \end{aligned} \]
FTC Part 1 recovers the PDF from the CDF:
\[ \begin{aligned} \frac{\partial}{\partial t} F(t) &= \frac{\partial}{\partial t}\mathopen{}\left(1 - \text{e}^{-\lambda t}\right)\mathclose{} \\ &= 0 - (-\lambda)\text{e}^{-\lambda t} \\ &= \lambda\text{e}^{-\lambda t} \\ &= f(t) \end{aligned} \]
For a concrete instance: with \(\lambda = 1\) (standard exponential), the probability that \(T \le 2\) is:
\[ F(2) = 1 - \text{e}^{-1 \cdot 2} = 1 - \text{e}^{-2} \approx 1 - 0.135 = 0.865 \]
See Figure 7.
ggplot() +
geom_function(
fun = \(t) 1 - exp(-lambda * t),
xlim = c(0, t_max), linewidth = 1
) +
geom_point(
aes(x = t_focus, y = F_at_focus),
size = 3, color = "steelblue"
) +
geom_segment(
aes(x = t_focus, xend = t_focus, y = 0, yend = F_at_focus),
linetype = "dashed", color = "steelblue"
) +
geom_segment(
aes(x = 0, xend = t_focus, y = F_at_focus, yend = F_at_focus),
linetype = "dashed", color = "steelblue"
) +
labs(x = "t", y = "F(t)") +
theme_minimal()The Fubini–Tonelli theorem states conditions under which the order of integration in a double integral can be exchanged. We state two versions: the Riemann version (Theorem 33) is what Epi 204 directly uses for double integrals of continuous functions on simple regions; the σ-finite measure-theoretic version (Theorem 34) is included to make the joint-distribution form corollary in the probability chapter follow from a stated theorem rather than from an aside.
Theorem 33 (Fubini’s theorem (Riemann version)) Let \(f\) be continuous on a plane region \(R \subseteq \mathbb{R}^2\).
Vertically simple region. If \(R\) is defined by \(a \le x \le b\) and \(g_1(x) \le y \le g_2(x)\), where \(g_1\) and \(g_2\) are continuous on \([a, b]\), then
\[ \begin{aligned} \iint_R f(x, y)\,dA &= \int_a^b \int_{g_1(x)}^{g_2(x)} f(x, y)\,dy\,dx. \end{aligned} \]
Horizontally simple region. If \(R\) is defined by \(c \le y \le d\) and \(h_1(y) \le x \le h_2(y)\), where \(h_1\) and \(h_2\) are continuous on \([c, d]\), then
\[ \begin{aligned} \iint_R f(x, y)\,dA &= \int_c^d \int_{h_1(y)}^{h_2(y)} f(x, y)\,dx\,dy. \end{aligned} \]
When \(R\) can be described both ways, the two iterated integrals are equal — so the order of integration can be exchanged.
(Larson and Edwards 2018, Theorem 14.2, p. 982)
Theorem 34 (Fubini–Tonelli theorem (measure-theoretic form)) Let \((\Omega_1, \mathcal F_1, \mu_1)\) and \((\Omega_2, \mathcal F_2, \mu_2)\) be σ-finite measure spaces, and let \(f : \Omega_1 \times \Omega_2 \to \mathbb{R}\) be measurable with respect to the product σ-algebra \(\mathcal F_1 \otimes \mathcal F_2\). If either
\(f \ge 0\) almost everywhere with respect to \(\mu_1 \otimes \mu_2\) (Tonelli’s theorem), or
\(\int_{\Omega_1 \times \Omega_2} \mathopen{}\left|f\right|\mathclose{}\,d(\mu_1 \otimes \mu_2) < \infty\) (Fubini’s theorem),
then both iterated integrals exist, agree with the double integral, and equal each other:
\[ \begin{aligned} \int_{\Omega_1 \times \Omega_2} f\,d(\mu_1 \otimes \mu_2) &= \int_{\Omega_1} \mathopen{}\left(\int_{\Omega_2} f(\omega_1, \omega_2)\,d\mu_2(\omega_2)\right)\mathclose{}\,d\mu_1(\omega_1)\\ &= \int_{\Omega_2} \mathopen{}\left(\int_{\Omega_1} f(\omega_1, \omega_2)\,d\mu_1(\omega_1)\right)\mathclose{}\,d\mu_2(\omega_2). \end{aligned} \]
(Billingsley 1995, Theorem 18.3; Gut 2013, Theorem 9.1, p. 65; Fubini 1907; Wikipedia contributors 2024)
Corollary 5 (Continuous functions on a rectangle (corollary of Theorem 33)) If \(f : [a, b] \times [c, d] \to \mathbb{R}\) is continuous on the closed bounded rectangle \([a, b] \times [c, d]\), then:
\[ \begin{aligned} \int_a^b \mathopen{}\left(\int_c^d f(x, y)\,dy\right)\mathclose{}\,dx &= \int_c^d \mathopen{}\left(\int_a^b f(x, y)\,dx\right)\mathclose{}\,dy\\ &= \iint_{[a,b]\times[c,d]} f(x, y)\,dx\,dy. \end{aligned} \]
(Larson and Edwards 2018, Theorem 14.2, p. 982)
Proof. A closed bounded rectangle \([a, b] \times [c, d]\) is both vertically simple (with \(g_1 \equiv c\), \(g_2 \equiv d\)) and horizontally simple (with \(h_1 \equiv a\), \(h_2 \equiv b\)). Applying both parts of Theorem 33 to \(f\) on this rectangle gives the two iterated forms shown.
Example 11 (Evaluating a double integral on a rectangle) Adapted from (Larson and Edwards 2018, sec. 14.2, Example 2, pp. 982–983).
Evaluate \(\displaystyle\iint_R \mathopen{}\left(1 - \frac{1}{2}x^2 - \frac{1}{2}y^2\right)\mathclose{}\,dA\) on the unit square \(R = \{(x, y) : 0 \le x \le 1,\; 0 \le y \le 1\}\).
The integrand is continuous on \(R\), so Corollary 5 applies and either order of integration yields the same value.
Integrating \(y\) first, then \(x\) (the order Larson chooses):
\[ \begin{aligned} \iint_R \mathopen{}\left(1 - \frac{1}{2}x^2 - \frac{1}{2}y^2\right)\mathclose{}\,dA &= \int_0^1\!\int_0^1 \mathopen{}\left(1 - \frac{1}{2}x^2 - \frac{1}{2}y^2\right)\mathclose{}\,dy\,dx \\&= \int_0^1 \mathopen{}\left[\mathopen{}\left(1 - \frac{1}{2}x^2\right)\mathclose{}y - \frac{y^3}{6}\right]\mathclose{}_0^1\,dx \\&= \int_0^1 \mathopen{}\left(\frac{5}{6} - \frac{1}{2}x^2\right)\mathclose{}\,dx \\&= \mathopen{}\left[\frac{5}{6}x - \frac{x^3}{6}\right]\mathclose{}_0^1 \\&= \frac{2}{3} \end{aligned} \]
Integrating \(x\) first, then \(y\) (verifying the order can be swapped):
The integrand is symmetric in \(x\) and \(y\), so the same arithmetic with the roles swapped gives:
\[ \int_0^1\!\int_0^1 \mathopen{}\left(1 - \frac{1}{2}x^2 - \frac{1}{2}y^2\right)\mathclose{}\,dx\,dy = \frac{2}{3} \]
Both orders give \(\frac{2}{3}\), as Corollary 5 guarantees.
n_grid <- 41
x_seq <- seq(0, 1, length.out = n_grid)
y_seq <- seq(0, 1, length.out = n_grid)
z_mat <- outer(x_seq, y_seq, function(x, y) 1 - x^2 / 2 - y^2 / 2)
plotly::plot_ly(x = ~x_seq, y = ~y_seq, z = ~t(z_mat)) |>
plotly::add_surface(showscale = FALSE) |>
plotly::layout(scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "y"),
zaxis = list(title = "z", range = c(0, 1))
))Example 12 (Changing the order of integration for a non-rectangular region) Adapted from (Larson and Edwards 2018, sec. 14.2, Example 4, pp. 984–985).
Find the volume of the solid bounded by the surface \(z = \text{e}^{-x^2}\) and the planes \(z = 0\), \(y = 0\), \(y = x\), and \(x = 1\).
The base of the solid in the \(xy\)-plane is the triangular region \(D = \{(x, y) : 0 \le x \le 1,\; 0 \le y \le x\}\), so the volume is
\[\iint_D \text{e}^{-x^2}\,dA.\]
Order \(dx\,dy\) is intractable. Re-describing \(D\) as \(D = \{(x, y) : 0 \le y \le 1,\; y \le x \le 1\}\), the inner integral is
\[\int_y^1 \text{e}^{-x^2}\,dx,\]
which has no elementary antiderivative.
Order \(dy\,dx\) works. Applying Theorem 33 Part 1 (\(\text{e}^{-x^2}\) is continuous and \(D\) is the vertically simple region \(0 \le x \le 1\), \(0 \le y \le x\)):
\[ \begin{aligned} \iint_D \text{e}^{-x^2}\,dA &= \int_0^1\!\int_0^x \text{e}^{-x^2}\,dy\,dx \\&= \int_0^1 \text{e}^{-x^2}\mathopen{}\left(\int_0^x dy\right)\mathclose{}\,dx \\&= \int_0^1 x\,\text{e}^{-x^2}\,dx \\&= \mathopen{}\left[-\frac{1}{2}\,\text{e}^{-x^2}\right]\mathclose{}_0^1 \\&= -\frac{1}{2}\mathopen{}\left(\text{e}^{-1} - 1\right)\mathclose{} \\&= \frac{e - 1}{2e} \\&\approx 0.316 \end{aligned} \]
n_grid <- 51
x_seq <- seq(0, 1, length.out = n_grid)
y_seq <- seq(0, 1, length.out = n_grid)
z_mat <- outer(x_seq, y_seq, function(x, y) {
z <- exp(-x^2)
z[y > x] <- NA
z
})
plotly::plot_ly(x = ~x_seq, y = ~y_seq, z = ~t(z_mat)) |>
plotly::add_surface(showscale = FALSE) |>
plotly::layout(scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "y"),
zaxis = list(title = "z = exp(-x^2)"),
camera = list(eye = list(x = 1.6, y = -1.6, z = 0.8))
))Example 13 (When conditions fail: a counterexample) The conditions in Theorem 33 are not merely technical — when they fail, iterated integrals can exist yet disagree.
Let \[f(x, y) = \frac{x^2 - y^2}{(x^2 + y^2)^2}\] on the unit square \(R = [0, 1] \times [0, 1]\). Strictly, \(f\) is defined on \(R \setminus \{(0, 0)\}\): the denominator vanishes at the origin, so \(f\) is undefined there (we return to this point below).
Integrating \(y\) first, then \(x\):
Using \(\displaystyle\frac{\partial}{\partial y}\frac{y}{x^2 + y^2} = \frac{x^2 - y^2}{(x^2 + y^2)^2}\):
\[ \begin{aligned} \int_0^1\!\int_0^1 f(x, y)\,dy\,dx &= \int_0^1 \mathopen{}\left[\frac{y}{x^2 + y^2}\right]\mathclose{}_{y=0}^{y=1}\,dx \\&= \int_0^1 \frac{1}{x^2 + 1}\,dx \\&= \mathopen{}\left[\arctan(x)\right]\mathclose{}_0^1 \\&= \frac{\pi}{4} \end{aligned} \]
Integrating \(x\) first, then \(y\):
Using \(\displaystyle\frac{\partial}{\partial x}\mathopen{}\left(-\frac{x}{x^2 + y^2}\right)\mathclose{} = \frac{x^2 - y^2}{(x^2 + y^2)^2}\):
\[ \begin{aligned} \int_0^1\!\int_0^1 f(x, y)\,dx\,dy &= \int_0^1 \mathopen{}\left[-\frac{x}{x^2 + y^2}\right]\mathclose{}_{x=0}^{x=1}\,dy \\&= \int_0^1 \mathopen{}\left(-\frac{1}{1 + y^2}\right)\mathclose{}\,dy \\&= -\mathopen{}\left[\arctan(y)\right]\mathclose{}_0^1 \\&= -\frac{\pi}{4} \end{aligned} \]
Conclusion: \(\dfrac{\pi}{4} \neq -\dfrac{\pi}{4}\), so the two iterated integrals are unequal. Neither Theorem 33 nor Corollary 5 applies here.
Why the hypotheses fail: Theorem 33 requires \(f\) to be continuous on \(R\). The denominator \((x^2 + y^2)^2\) vanishes at the origin \((0, 0) \in R\), so \(f\) is not even defined there — let alone continuous — and the theorem does not apply.
For Theorem 34, the failure shows up as \(\iint_R |f|\,dA = \infty\), which violates the absolute-integrability hypothesis (b). Switching to polar coordinates \((r, \theta)\) near the origin, the integrand satisfies \(|f(x, y)| = \mathopen{}\left|x^2 - y^2\right|\mathclose{}/(x^2 + y^2)^2 = \mathopen{}\left|\cos 2\theta\right|\mathclose{}/r^2\), so
\[ \begin{aligned} \iint_R |f|\,dA &\ge \int_0^{\pi/2}\!\int_0^{\epsilon} \frac{\mathopen{}\left|\cos 2\theta\right|\mathclose{}}{r^2}\, r\,dr\,d\theta\\ &= \mathopen{}\left(\int_0^{\pi/2}\mathopen{}\left|\cos 2\theta\right|\mathclose{}\,d\theta\right)\mathclose{} \int_0^{\epsilon} \frac{dr}{r}\\ &= +\infty, \end{aligned} \]
since \(\int_0^{\epsilon} dr/r\) diverges.
n_grid <- 81
eps <- 0.04
x_seq <- seq(eps, 1, length.out = n_grid)
y_seq <- seq(eps, 1, length.out = n_grid)
z_mat <- outer(x_seq, y_seq, function(x, y) (x^2 - y^2) / (x^2 + y^2)^2)
z_clip <- 50
z_mat[z_mat > z_clip] <- z_clip
z_mat[z_mat < -z_clip] <- -z_clip
plotly::plot_ly(x = ~x_seq, y = ~y_seq, z = ~t(z_mat)) |>
plotly::add_surface(
showscale = FALSE,
colorscale = list(
list(0, "#3b4cc0"),
list(0.5, "#dddddd"),
list(1, "#b40426")
)
) |>
plotly::layout(scene = list(
xaxis = list(title = "x"),
yaxis = list(title = "y"),
zaxis = list(title = "f(x, y)", range = c(-z_clip, z_clip)),
camera = list(eye = list(x = 1.6, y = 1.6, z = 0.6))
))Definition 11 (Column vector) A column vector of length \(p\) is an ordered list of \(p\) numbers, written vertically:
\[ \tilde{x}= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{p} \end{bmatrix} \]
Definition 12 (Transpose) The transpose of a column vector \(\tilde{x}\) is the row vector with the same sequence of entries, written horizontally:
\[ {\tilde{x}}^{\top} \equiv \tilde{x}' \equiv [x_1,\; x_2,\; \ldots,\; x_p] \]
Definition 13 (Zero vector) The zero vector \(\tilde{0}\) of length \(p\) has all entries equal to zero:
\[ \tilde{0}= \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \]
Definition 14 (Ones vector) The ones vector \(\tilde{1}\) of length \(p\) has all entries equal to one:
\[ \tilde{1} = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \]
Definition 15 (Indicator vector / standard basis vector) The \(j\)-th indicator vector (or standard basis vector) \(\tilde{e}_j\) of length \(p\) has a \(1\) in position \(j\) and \(0\)s elsewhere:
\[ (\tilde{e}_j)_i = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases} \qquad \tilde{e}_j = \begin{bmatrix} 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \leftarrow \text{position } j \]
Theorem 35 (Indicator vectors select entries) For any vector \(\tilde{x}\) of length \(p\) and any \(j \in \{1, \ldots, p\}\):
\[{\tilde{e}_j}^{\top}\tilde{x}= x_j\]
Proof. Writing the product componentwise:
\[ \begin{aligned} {\tilde{e}_j}^{\top}\tilde{x} &= \sum_{i=1}^{p} (\tilde{e}_j)_i\, x_i \\&= \sum_{i=1}^{p} \begin{cases} 1 \cdot x_i & \text{if } i = j \\ 0 \cdot x_i & \text{if } i \neq j \end{cases} \\&= x_j \end{aligned} \]
Definition 16 (Dot product/linear combination/inner product) For any two real-valued vectors \(\tilde{x}= (x_1, \ldots, x_n)\) and \(\tilde{y}= (y_1, \ldots, y_n)\), the dot-product, linear combination, or inner product of \(\tilde{x}\) and \(\tilde{y}\) is:
\[\tilde{x}\cdot \tilde{y}= \tilde{x}^{\top} \tilde{y}\stackrel{\text{def}}{=}\sum_{i=1}^nx_i y_i\]
Theorem 36 (Dot product is symmetric) The dot product is symmetric:
\[\tilde{x}\cdot \tilde{y}= \tilde{y}\cdot \tilde{x}\]
Proof. Apply:
Example 14 (Dot product as matrix multiplication) The dot product of two column vectors \(\tilde{x}\) and \(\tilde{\beta}\) can be written as a matrix product of the row vector \({\tilde{x}}^{\top}\) with the column vector \(\tilde{\beta}\):
\[ \begin{aligned} \tilde{x}\cdot \tilde{\beta} &= {\tilde{x}}^{\top}\, \tilde{\beta} \\ &= [x_1,\; x_2,\; \ldots,\; x_p] \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p} \end{bmatrix} \\ &= x_1\beta_1 + x_2\beta_2 + \cdots + x_p \beta_p \end{aligned} \]
Definition 17 (Orthogonal vectors) Two vectors \(\tilde{x}\) and \(\tilde{y}\) of the same length are orthogonal (written \(\tilde{x}\perp \tilde{y}\)) if their dot product is zero:
\[\tilde{x}\perp \tilde{y}\iff {\tilde{x}}^{\top}\tilde{y}= 0\]
Definition 18 (Orthonormal vectors) A set of vectors \(\{\tilde{x}_1, \tilde{x}_2, \ldots, \tilde{x}_k\}\) is orthonormal if the vectors are mutually orthogonal and each has unit length:
\[{\tilde{x}_i}^{\top}\tilde{x}_j = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}\]
Definition 19 (Matrix) A matrix of dimensions \(m \times n\) is a rectangular array of \(m \cdot n\) numbers, arranged in \(m\) rows and \(n\) columns:
\[ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} \]
Definition 20 (Matrix transpose) The transpose of an \(m \times n\) matrix \(\mathbf{A}\) is the \(n \times m\) matrix \({\mathbf{A}}^{\top}\) obtained by swapping the rows and columns of \(\mathbf{A}\):
\[({\mathbf{A}}^{\top})_{ij} = a_{ji}\]
Theorem 37 (Transpose of a sum) \[{(\mathbf{A} + \mathbf{B})}^{\top} = {\mathbf{A}}^{\top} + {\mathbf{B}}^{\top}\]
In particular, for column vectors \(\tilde{x}\) and \(\tilde{y}\):
\[{(\tilde{x}+ \tilde{y})}^{\top} = {\tilde{x}}^{\top} + {\tilde{y}}^{\top}\]
Theorem 38 (Transpose of a product) For compatible matrices \(\mathbf{A}\) and \(\mathbf{B}\):
\[{(\mathbf{A}\mathbf{B})}^{\top} = {\mathbf{B}}^{\top}\,{\mathbf{A}}^{\top}\]
Definition 21 (Zero matrix) The \(m \times n\) zero matrix \(\mathbf{0}_{m \times n}\) (or \(\mathbf{0}\) when dimensions are clear from context) has all entries equal to zero:
\[ \mathbf{0}_{m \times n} = \begin{bmatrix} 0 & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{bmatrix} \]
Definition 22 (Matrix addition) Two matrices \(\mathbf{A}\) and \(\mathbf{B}\) of the same dimensions \(m \times n\) can be added element-wise:
\[(\mathbf{A} + \mathbf{B})_{ij} = a_{ij} + b_{ij}\]
Theorem 39 (Matrix addition is commutative) \[\mathbf{A} + \mathbf{B} = \mathbf{B} + \mathbf{A}\]
Theorem 40 (Matrix addition is associative) \[(\mathbf{A} + \mathbf{B}) + \mathbf{C} = \mathbf{A} + (\mathbf{B} + \mathbf{C})\]
Theorem 41 (Zero matrix is the additive identity) \[\mathbf{A} + \mathbf{0} = \mathbf{A}\]
Theorem 42 (Additive inverse) For any matrix \(\mathbf{A}\), the matrix \(-\mathbf{A}\) (defined by \((-\mathbf{A})_{ij} = -a_{ij}\)) satisfies:
\[\mathbf{A} + (-\mathbf{A}) = \mathbf{0}\]
Definition 23 (Scalar multiplication) A matrix \(\mathbf{A}\) can be multiplied by a scalar \(c\):
\[(c\mathbf{A})_{ij} = c \cdot a_{ij}\]
Definition 24 (Matrix multiplication) The product of an \(m \times k\) matrix \(\mathbf{A}\) and a \(k \times n\) matrix \(\mathbf{B}\) is the \(m \times n\) matrix \(\mathbf{C} = \mathbf{A}\mathbf{B}\) with entries:
\[c_{ij} = \sum_{s=1}^{k} a_{is}\, b_{sj}\]
Theorem 43 (Matrix multiplication is associative) \[(\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C})\]
Theorem 44 (Matrix multiplication is distributive over addition) \[\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}\]
\[(\mathbf{A} + \mathbf{B})\mathbf{C} = \mathbf{A}\mathbf{C} + \mathbf{B}\mathbf{C}\]
Definition 25 (Matrix-vector multiplication) The product of an \(m \times p\) matrix \(\mathbf{A}\) and a \(p \times 1\) column vector \(\tilde{x}\) is the \(m \times 1\) column vector \(\mathbf{A}\tilde{x}\) with entries:
\[(\mathbf{A}\tilde{x})_i = \sum_{j=1}^{p} a_{ij}\, x_j\]
Definition 26 (Square matrix) A matrix is square if it has the same number of rows as columns. The number of rows (= columns) is the order of the matrix.
Definition 27 (Matrix power) For a square matrix \(\mathbf{A}\) of order \(p\) and a positive integer \(k\), the \(k\)-th power of \(\mathbf{A}\) is:
\[\mathbf{A}^k = \underbrace{\mathbf{A}\,\mathbf{A}\cdots\mathbf{A}}_{k \text{ copies}}\]
In particular, \(\mathbf{A}^2 = \mathbf{A}\mathbf{A}\).
Definition 28 (Identity matrix) The \(p \times p\) identity matrix \(\mathbf{I}_p\) (or \(\mathbf{I}\) when the size is clear from context) has ones on the main diagonal and zeros elsewhere:
\[ (\mathbf{I}_p)_{ij} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases} \qquad \mathbf{I}_p = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} \]
Theorem 45 (Identity matrix is a multiplicative identity) For any \(m \times p\) matrix \(\mathbf{A}\):
\[\mathbf{A}\,\mathbf{I}_p = \mathbf{A}\]
\[\mathbf{I}_m\,\mathbf{A} = \mathbf{A}\]
Definition 29 (Symmetric matrix) A square matrix \(\mathbf{A}\) is symmetric if \({\mathbf{A}}^{\top} = \mathbf{A}\), i.e., \(a_{ij} = a_{ji}\) for all \(i\) and \(j\).
Definition 30 (Diagonal matrix) A square matrix \(\mathbf{D}\) is a diagonal matrix if all off-diagonal entries are zero: \(d_{ij} = 0\) whenever \(i \neq j\):
\[ \mathbf{D} = \begin{bmatrix} d_1 & 0 & \cdots & 0 \\ 0 & d_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_p \end{bmatrix} \]
Definition 31 (Matrix inverse) For a square \(p \times p\) matrix \(\mathbf{A}\), the inverse \(\mathbf{A}^{-1}\) (if it exists) is the unique matrix satisfying:
\[\mathbf{A}\,\mathbf{A}^{-1} = \mathbf{A}^{-1}\,\mathbf{A} = \mathbf{I}_p\]
Theorem 46 (Inverse of a product) For invertible matrices \(\mathbf{A}\) and \(\mathbf{B}\):
\[(\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\]
Definition 32 (Idempotent matrix) A square matrix \(\mathbf{A}\) is idempotent if
\[\mathbf{A}^2 = \mathbf{A}\]
Definition 33 (Projection matrix) A square matrix \(\mathbf{P}\) is a projection matrix (also called an orthogonal projector) if it is both symmetric and idempotent:
\[{\mathbf{P}}^{\top} = \mathbf{P} \qquad \text{and} \qquad \mathbf{P}^2 = \mathbf{P}\]
Theorem 47 (Complement of a projection matrix) If \(\mathbf{P}\) is a projection matrix, then \(\mathbf{I} - \mathbf{P}\) is also a projection matrix.
Proof. We verify symmetry and idempotency.
Symmetry: \[{(\mathbf{I} - \mathbf{P})}^{\top} = {\mathbf{I}}^{\top} - {\mathbf{P}}^{\top} = \mathbf{I} - \mathbf{P}\]
Idempotency: \[\begin{aligned} (\mathbf{I} - \mathbf{P})^2 &= (\mathbf{I} - \mathbf{P})(\mathbf{I} - \mathbf{P}) \\ &= \mathbf{I} - \mathbf{P} - \mathbf{P} + \mathbf{P}^2 \\ &= \mathbf{I} - \mathbf{P} - \mathbf{P} + \mathbf{P} \\ &= \mathbf{I} - \mathbf{P} \end{aligned}\]
Theorem 48 (Hat matrix is a projection matrix) In a linear regression model with full-rank design matrix \(\mathbf{X}\), the hat matrix
\[\mathbf{H} = \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top}\]
is a projection matrix.
Proof. We verify symmetry and idempotency.
Symmetry: \[\begin{aligned} {\mathbf{H}}^{\top} &= {\left(\mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top}\right)}^{\top} \\ &= {({\mathbf{X}}^{\top})}^{\top} \cdot {\left(({\mathbf{X}}^{\top}\mathbf{X})^{-1}\right)}^{\top} \cdot {\mathbf{X}}^{\top} \\ &= \mathbf{X}\cdot ({\mathbf{X}}^{\top}\mathbf{X})^{-1} \cdot {\mathbf{X}}^{\top} \\ &= \mathbf{H} \end{aligned}\]
where the third line uses \({({\mathbf{X}}^{\top})}^{\top} = \mathbf{X}\) and the fact that \({\mathbf{X}}^{\top}\mathbf{X}\) is symmetric, so its inverse is also symmetric (\({\left(({\mathbf{X}}^{\top}\mathbf{X})^{-1}\right)}^{\top} = ({\mathbf{X}}^{\top}\mathbf{X})^{-1}\)).
Idempotency: \[\begin{aligned} \mathbf{H}^2 &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \cdot \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}({\mathbf{X}}^{\top}\mathbf{X})({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{H} \end{aligned}\]
Theorem 49 (Projection matrices produce orthogonal decompositions) If \(\mathbf{P}\) is a projection matrix and \(\tilde{v}\) is any vector of compatible dimension, then the two components of the decomposition
\[\tilde{v} = \underbrace{\mathbf{P}\tilde{v}}_{\text{projected}} + \underbrace{(\mathbf{I} - \mathbf{P})\tilde{v}}_{\text{residual}}\]
are orthogonal:
\[\mathbf{P}\tilde{v} \;\perp\; (\mathbf{I} - \mathbf{P})\tilde{v}\]
Proof. \[\begin{aligned} {(\mathbf{P}\tilde{v})}^{\top}\,(\mathbf{I} - \mathbf{P})\tilde{v} &= {\tilde{v}}^{\top}\,{\mathbf{P}}^{\top}\,(\mathbf{I} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,\mathbf{P}\,(\mathbf{I} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,(\mathbf{P} - \mathbf{P}^2)\tilde{v} \\ &= {\tilde{v}}^{\top}\,(\mathbf{P} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,\mathbf{0}\,\tilde{v} \\ &= 0 \end{aligned}\]
where the second line uses symmetry (\({\mathbf{P}}^{\top} = \mathbf{P}\)) and the fourth line uses idempotency (\(\mathbf{P}^2 = \mathbf{P}\)).
Definition 34 (Quadratic form) A quadratic form is a mathematical expression of the structure
\[{\tilde{x}}^{\top}\, \mathbf{S}\, \tilde{x}\]
where \(\tilde{x}\) is a \(p \times 1\) vector and \(\mathbf{S}\) is a \(p \times p\) matrix.
Theorem 50 (Symmetric part of a quadratic form) If \(\mathbf{S}\) is a square matrix, then
\[ {\tilde{x}}^{\top}\mathbf{S}\tilde{x} = {\tilde{x}}^{\top}\left(\frac{1}{2}(\mathbf{S}+{\mathbf{S}}^{\top})\right)\tilde{x}. \]
So the value of a quadratic form depends only on the symmetric part of \(\mathbf{S}\).
Definition 35 (Design matrix) In a regression model with \(n\) observations and \(p\) predictors, the design matrix (or model matrix) \(\mathbf{X}\) is the \(n \times p\) matrix whose \(i\)-th row is the covariate vector \({\tilde{x}_i}^{\top}\) for observation \(i\):
\[ \mathbf{X}= \begin{bmatrix} {\tilde{x}_1}^{\top} \\ {\tilde{x}_2}^{\top} \\ \vdots \\ {\tilde{x}_n}^{\top} \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \]
(adapted from Fieller (2016), §7.2)
Let \(\tilde{x}\) and \(\tilde{\beta}\) be column vectors of length \(p\) (see Definition 11 and Definition 16).
Definition 36 (Vector derivative) If \(f(\tilde{\beta})\) is a function that takes a vector \(\tilde{\beta}\) as input, such as \(f(\tilde{\beta}) = x'\tilde{\beta}\), then:
\[ \frac{\partial}{\partial \tilde{\beta}} f(\tilde{\beta}) = \begin{bmatrix} \frac{\partial}{\partial \beta_1}f(\tilde{\beta}) \\ \frac{\partial}{\partial \beta_2}f(\tilde{\beta}) \\ \vdots \\ \frac{\partial}{\partial \beta_p}f(\tilde{\beta}) \end{bmatrix} \]
Definition 37 (Row-vector derivative) If \(f(\tilde{\beta})\) is a function that takes a vector \(\tilde{\beta}\) as input, such as \(f(\tilde{\beta}) = x'\tilde{\beta}\), then:
\[ \frac{\partial}{\partial \tilde{\beta}^{\top}} f(\tilde{\beta}) = \begin{bmatrix} \frac{\partial}{\partial \beta_1}f(\tilde{\beta}) & \frac{\partial}{\partial \beta_2}f(\tilde{\beta}) & \cdots & \frac{\partial}{\partial \beta_p}f(\tilde{\beta}) \end{bmatrix} \]
Theorem 51 (Row and column derivatives are transposes) \[\frac{\partial}{\partial \tilde{\beta}^{\top}} f(\tilde{\beta}) = \mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}} f(\tilde{\beta})\right)\mathclose{}^{\top}\]
\[\frac{\partial}{\partial \tilde{\beta}} f(\tilde{\beta}) = \mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}^{\top}} f(\tilde{\beta})\right)\mathclose{}^{\top}\]
Theorem 52 (Derivative of a dot product) \[ \frac{\partial}{\partial \tilde{\beta}} \tilde{x}\cdot \tilde{\beta}= \frac{\partial}{\partial \tilde{\beta}} \tilde{\beta}\cdot \tilde{x}= \tilde{x} \]
Proof. \[ \begin{aligned} \frac{\partial}{\partial \beta} (x^{\top}\beta) &= \begin{bmatrix} \frac{\partial}{\partial \beta_1}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \\ \frac{\partial}{\partial \beta_2}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \\ \vdots \\ \frac{\partial}{\partial \beta_p}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \end{bmatrix} \\ &= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{p} \end{bmatrix} \\ &= \tilde{x} \end{aligned} \]
Theorem 53 (Derivative of a quadratic form) For a quadratic form (Definition 34), if \(S\) is a \(p\times p\) matrix that is constant with respect to \(\beta\), then:
\[ \frac{\partial}{\partial \beta} \beta'S\beta = 2S\beta \]
Corollary 6 (Derivative of a simple quadratic form) \[ \frac{\partial}{\partial \tilde{\beta}} \tilde{\beta}'\tilde{\beta}= 2\tilde{\beta} \]
Theorem 54 (Vector chain rule) \[\frac{\partial z}{\partial \tilde{x}} = \frac{\partial y}{\partial \tilde{x}} \frac{\partial z}{\partial y}\]
or in Euler/Lagrange notation:
\[(f(g(\tilde{x})))' = \tilde{g}'(\tilde{x}) f'(g(\tilde{x}))\]
Corollary 7 (Vector chain rule for quadratic forms) \[\frac{\partial}{\partial \tilde{\beta}}{\mathopen{}\left(\tilde{\varepsilon}(\tilde{\beta})\cdot \tilde{\varepsilon}(\tilde{\beta})\right)\mathclose{}} = \mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}}\tilde{\varepsilon}(\tilde{\beta})\right)\mathclose{} \mathopen{}\left(2 \tilde{\varepsilon}(\tilde{\beta})\right)\mathclose{}\]