Mathematics Prerequisites

Math is not just a way of calculating numerical answers; it is a way of thinking, using clear definitions for concepts and rigorous logic to organize our thoughts and back up our assertions.

Cheng (2025)

These lecture notes use:

algebra
precalculus
univariate calculus
linear algebra
vector calculus

Some key results are listed here.

1 Algebra

1.1 Elementary Algebra

Equalities

Theorem 1 (Equalities are transitive) If \(a=b\) and \(b=c\), then \(a=c\)

Theorem 2 (Substituting equivalent expressions) If \(a = b\), then for any function \(f(x)\), \(f(a) = f(b)\)

Inequalities

Theorem 3 (Adding to both sides of an inequality) If \(a<b\), then \(a+c < b+c\)

Theorem 4 (negating both sides of an inequality) If \(a < b\), then: \(-a > -b\)

Theorem 5 (Multiplying both sides of an inequality by a nonnegative number) If \(a < b\) and \(c \geq 0\), then \(ca < cb\).

Theorem 6 (Negation is multiplication by \(-1\)) \[-a = (-1)*a\]

Infimum and supremum

Definition 1 (Infimum (greatest lower bound)) The infimum of a nonempty set \(A \subseteq \mathbb{R}\), written \(\inf A\), is the greatest real number \(m\) satisfying \(m \le a\) for all \(a \in A\):

\[\inf A \stackrel{\text{def}}{=}\max_{t \in \mathbb{R}}\mathopen{}\left\{t : \forall a \in A, a \ge t\right\}\mathclose{}\]

If the infimum belongs to \(A\), it equals the minimum: \(\inf A = \min A\).

Example 1 (Numerical examples of infimum)

\(\inf\{1, 2, 3\} = 1\), since \(1\) is the smallest element.
\(\inf(0.5, 1] = 0.5 = \min[0.5, 1]\): for intervals open below, the infimum equals the minimum of the corresponding closed-below interval, even though \(0.5 \notin (0.5, 1]\). More generally, \(\inf(c, b] = \min[c, b] = c\) for any \(c < b\).
\(\inf\{t \ge 0 : t > 0.5\} = 0.5\), even though \(0.5\) itself is not in the set.

Definition 2 (Supremum (least upper bound)) The supremum of a nonempty set \(A \subseteq \mathbb{R}\), written \(\sup A\), is the smallest real number \(M\) satisfying \(M \ge a\) for all \(a \in A\):

\[\sup A \stackrel{\text{def}}{=}\min_{t \in \mathbb{R}}\mathopen{}\left\{t : \forall a \in A, a \le t\right\}\mathclose{}\]

If the supremum belongs to \(A\), it equals the maximum: \(\sup A = \max A\).

Example 2 (Numerical examples of supremum)

\(\sup\{1, 2, 3\} = 3\), since \(3\) is the largest element.
\(\sup\{t \ge 0 : t < 0.5\} = 0.5\), even though \(0.5\) itself is not in the set.

Sums

Theorem 7 (adding zero changes nothing) \[a+0=a\]

Theorem 8 (Sums are symmetric) \[a+b = b+a\]

Theorem 9 (Sums are associative)

\[(a + b) + c = a + (b + c)\]

Products

Theorem 10 (Multiplying by 1 changes nothing) \[a \times 1 = a\]

Theorem 11 (Products are symmetric) \[a \times b = b \times a\]

Theorem 12 (Products are associative) \[(a \times b) \times c = a \times (b \times c)\]

Division

Theorem 13 (Division can be written as a product) \[\frac {a}{b} = a \times \frac{1}{b}\]

Sums and products together

Theorem 14 (Multiplication is distributive) \[a(b+c) = ab + ac\]

Quotients

Definition 3 (Quotients, fractions, rates)

\[\frac{a}{b}\]

Definition 4 (Ratios) A ratio is a quotient in which the numerator and denominator are measured using the same unit scales.

Definition 5 (Proportion) In statistics, a “proportion” typically means a ratio where the numerator represents a subset of the denominator.

Definition 6 (Proportional) Two functions \(f(x)\) and \(g(x)\) are proportional if their ratio \(\frac{f(x)}{g(x)}\) does not depend on \(x\). (c.f. https://en.wikipedia.org/wiki/Proportionality_(mathematics))

Additional reference for elementary algebra: https://en.wikipedia.org/wiki/Population_proportion#Mathematical_definition

Exponentials and Logarithms

Theorem 15 (logarithm of a product is the sum of the logs of the factors) \[ \log{a\cdot b} = \log{a} + \log{b} \]

Corollary 1 (logarithm of a quotient)

\[\log{\frac{a}{b}} = \log{a} - \log{b}\]

Theorem 16 (logarithm of an exponential function) \[ \operatorname{log}\mathopen{}\left\{a^b\right\}\mathclose{} = b \cdot\operatorname{log}\mathopen{}\left\{a\right\}\mathclose{} \]

Theorem 17 (exponential of a sum)

\[\operatorname{exp}\mathopen{}\left\{a+b\right\}\mathclose{} = \operatorname{exp}\mathopen{}\left\{a\right\}\mathclose{} \cdot\operatorname{exp}\mathopen{}\left\{b\right\}\mathclose{}\]

Corollary 2 (exponential of a difference)

\[\operatorname{exp}\mathopen{}\left\{a-b\right\}\mathclose{} = \frac{\operatorname{exp}\mathopen{}\left\{a\right\}\mathclose{}}{\operatorname{exp}\mathopen{}\left\{b\right\}\mathclose{}}\]

Theorem 18 (exponential of a product) \[a^{bc} = \mathopen{}\left(a^b\right)\mathclose{}^c = \mathopen{}\left(a^c\right)\mathclose{}^b\]

Corollary 3 (natural exponential of a product) \[\operatorname{exp}\mathopen{}\left\{ab\right\}\mathclose{} = (\operatorname{exp}\mathopen{}\left\{a\right\}\mathclose{})^b = (\operatorname{exp}\mathopen{}\left\{b\right\}\mathclose{})^a\]

Exercise 1 For \(a \ge 0,~b,c \in \mathbb{R}\), When does \((a^b)^c = a^{(b^c)}\)?

Solution 1. Short answer: rarely (that’s all you need to know for this course).

Long answer:

If \((a^b)^c = a^{(b^c)}\), then since \((a^b)^c = a^{bc}\), we have: \[a^{bc} = a^{(b^c)}\] \[\operatorname{log}\mathopen{}\left\{a^{bc}\right\}\mathclose{} = \operatorname{log}\mathopen{}\left\{a^{(b^c)}\right\}\mathclose{}\] \[bc \cdot \operatorname{log}\mathopen{}\left\{a\right\}\mathclose{} = b^c\cdot \operatorname{log}\mathopen{}\left\{a\right\}\mathclose{} \tag{1}\]

Equation 1 holds in each of the following cases:

\(bc = b^c\) (see Exercise 2).
\(a=1\) (i.e., \(\operatorname{log}\mathopen{}\left\{a\right\}\mathclose{} = 0\)).
\(a=0\) (i.e., \(\operatorname{log}\mathopen{}\left\{a\right\}\mathclose{}= -\infty\)) and \(\operatorname{sign}\mathopen{}\left\{bc\right\}\mathclose{}=\operatorname{sign}\mathopen{}\left\{b^c\right\}\mathclose{}\).

In particular, when \(a=0\) and \(c=0\), \(bc = 0\) and \(b^c = 1\) (for any \(b \in \mathbb{R}\)), so \(\operatorname{sign}\mathopen{}\left\{bc\right\}\mathclose{}\neq \operatorname{sign}\mathopen{}\left\{b^c\right\}\mathclose{}\), and \((a^b)^c \neq a^{(b^c)}\):

\[ \begin{aligned} (a^b)^c &= (0^b)^0 \\ &= 1 \end{aligned} \]

\[ \begin{aligned} a^{(b^c)} &= 0^{(b^0)} \\ &= 0^1 \\ &= 0 \end{aligned} \]

Exercise 2 For \(b,c \in \mathbb{R}\), when does \(b^c = bc\)?

Solution 2. \(bc = b^c\) in each of the following cases:

\(c = 1\).
\(b=0\) and \(c > 0\).
\(b = \operatorname{exp}\mathopen{}\left\{\frac{\log{c}}{c-1}\right\}\mathclose{}\) (for \(c \ge 0\)).

See the red contours in Figure 2 for a visualization.

[R code]

`b*c_f` <- function(b, c) b*c
`b^c_f` <- function(b, c) b^c
values_b <- seq(0, 5, by = .01)
values_c <- seq(-.5, 3, by = .01)

`b*c` <- outer(values_b, values_c, `b*c_f`)
`b^c` <- outer(values_b, values_c, `b^c_f`)
`b^c`[is.infinite(`b^c`)] = NA

opacity <- .3
z_min <- min(`b*c`, `b^c`, na.rm = TRUE)
z_max <- 5
plotly::plot_ly(
  x = ~values_b,
  y = ~values_c
) |>
  plotly::add_surface(
    z = ~ t(`b*c`),
    contours = list(
      z = list(
        show = TRUE,
        start = -1,
        end = 1,
        size = .1
      )
    ),
    name = "b*c",
    showscale = FALSE,
    opacity = opacity,
    colorscale = list(c(0, 1), c("green", "green"))
  ) |>
  plotly::add_surface(
    opacity = opacity,
    colorscale = list(c(0, 1), c("red", "red")),
    z = ~ t(`b^c`),
    contours = list(
      z = list(
        show = TRUE,
        start = z_min,
        end = z_max,
        size = .2
      )
    ),
    showscale = FALSE,
    name = "b^c"
  ) |>
  plotly::layout(
    scene = list(
      xaxis = list(
        # type = "log",
        title = "b"
      ),
      yaxis = list(
        # type = "log",
        title = "c"
      ),
      zaxis = list(
        # type = "log",
        range = c(z_min, z_max),
        title = "outcome"
      ),
      camera = list(eye = list(x = -1.25, y = -1.25, z = 0.5)),
      aspectratio = list(x = .9, y = .8, z = 0.7)
    )
  )

Figure 1: Graph of \(b*c\) and \(b^c\)

[R code]

`b^c - b*c_f` <- function(b, c) `b^c_f`(b,c) - `b*c_f`(b,c)

mat1 <- outer(values_b, values_c, `b^c - b*c_f`)
mat1[is.infinite(mat1)] = NA

opacity <- .3
plotly::plot_ly(
  x = ~values_b,
  y = ~values_c
) |>
  plotly::add_surface(
    z = ~ t(mat1),
    contours = list(
      z = list(
        show = TRUE,
        start = 0,
        end = 1,
        size = 1,
        color = "red"
      )
    ),
    name = "b^c - b*c",
    showscale = TRUE,
    opacity = opacity
  ) |>
  plotly::layout(
    scene = list(
      xaxis = list(
        # type = "log",
        title = "b"
      ),
      yaxis = list(
        # type = "log",
        title = "c"
      ),
      zaxis = list(
        title = "outcome"
      ),
      camera = list(eye = list(x = -1.25, y = -1.25, z = 0.5)),
      aspectratio = list(x = .9, y = .8, z = 0.7)
    )
  )

Figure 2: Graph of \(b^c - b*c\). Red contour lines show where \(b^c = b*c\).

Theorem 19 (\(\operatorname{exp}\mathopen{}\left\{\right\}\mathclose{}\) and \(\operatorname{log}\mathopen{}\left\{\right\}\mathclose{}\) are mutual inverses) \[\operatorname{exp}\mathopen{}\left\{\operatorname{log}\mathopen{}\left\{a\right\}\mathclose{}\right\}\mathclose{} = \operatorname{log}\mathopen{}\left\{\operatorname{exp}\mathopen{}\left\{a\right\}\mathclose{}\right\}\mathclose{} = a\]

2 Derivatives

Theorem 20 (Constant rule) \[\frac{\partial}{\partial x}c = 0\]

Theorem 21 (Power rule) If \(a\) is constant with respect to \(x\), then: \[\frac{\partial}{\partial x}ay = a \frac{\partial x}{\partial y}\]

Theorem 22 (Power rule) \[\frac{\partial}{\partial x}x^q = qx^{q-1}\]

Theorem 23 (Derivative of natural logarithm) \[\operatorname{log}'\mathopen{}\left\{x\right\}\mathclose{} = \frac{1}{x} = x^{-1}\]

Theorem 24 (derivative of exponential) \[\operatorname{exp}'\mathopen{}\left\{x\right\}\mathclose{} = \operatorname{exp}\mathopen{}\left\{x\right\}\mathclose{}\]

Theorem 25 (Product rule) \[(ab)' = ab' + ba'\]

Theorem 26 (Quotient rule) \[(a/b)' = a'/b - (a/b^2)b'\]

Theorem 27 (Chain rule) \[\begin{aligned} \frac{\partial a}{\partial c} &= \frac{\partial a}{\partial b} \frac{\partial b}{\partial c} \\ &= \frac{\partial b}{\partial c} \frac{\partial a}{\partial b} \end{aligned} \]

or in Euler/Lagrange notation:

\[(f(g(x)))' = g'(x) f'(g(x))\]

Corollary 4 (Chain rule for logarithms) \[ \frac{\partial}{\partial x}\log{f(x)} = \frac{f'(x)}{f(x)} \]

Proof. Apply Theorem 27 and Theorem 23.

3 Integration

Integration is the inverse operation of differentiation: it recovers a function from its derivative and accumulates quantities such as areas, totals, and probabilities. We begin with antiderivatives, then state basic integration rules, and conclude with the Fundamental Theorem of Calculus and a worked example from probability.

3.1 Antiderivatives

Definition 7 (Antiderivative) A function \(F\) is an antiderivative of \(f\) on an interval \(I\) if:

\[\frac{\partial}{\partial x} F(x) = f(x), \quad \forall x \in I\]

The family of all antiderivatives of \(f\) is written as the indefinite integral:

\[\int f(x)\,dx = F(x) + C\]

where \(C\) is an arbitrary constant of integration.

(Larson and Edwards 2018, sec. 4.1, pp. 248–249)

Example 3 (Antiderivative of a power function) For \(f(x) = x^2\), an antiderivative is \(F(x) = \frac{x^3}{3}\), since \(\frac{\partial}{\partial x}\frac{x^3}{3} = x^2 = f(x)\).

Adding any constant \(C\) gives another antiderivative; for example, with \(C = 7\), \(F(x) = \frac{x^3}{3} + 7\) also satisfies \(F'(x) = x^2\), since adding a constant does not change the derivative. See Figure 3.

Figure 3: The function \(f(x) = x^2\) and five antiderivatives \(F(x) = x^3/3 + C\) for \(C \in \{-2, -1, 0, 1, 2\}\). Each antiderivative has the same derivative \(f\); they differ only by a vertical shift.

Theorem 28 (Basic integration rules) Each antiderivative below is defined only up to an arbitrary constant \(C\) (see Definition 7); the table omits \(+ C\) from every row for brevity.

Function \(f(x)\)	Antiderivative \(F(x)\)	Condition
\(c\)	\(cx\)	—
\(x^n\)	\(\dfrac{x^{n+1}}{n+1}\)	\(n \ne -1\)
\(\dfrac{1}{x}\)	\(\ln\mathopen{}\left\|x\right\|\mathclose{}\)	\(x \ne 0\)
\(\text{e}^{x}\)	\(\text{e}^{x}\)	(self-antiderivative)
\(\text{e}^{cx}\)	\(\dfrac{1}{c}\text{e}^{cx}\)	\(c \ne 0\)
\(c \cdot f(x)\)	\(c \cdot F(x)\)	—
\(f(x) + g(x)\)	\(F(x) + G(x)\)	—

The first two rows and the bottom two rows (linearity) are from (Larson and Edwards 2018, sec. 4.1, p. 250 “Basic Integration Rules”); \(1/x\) is from (Larson and Edwards 2018, sec. 5.2, Theorem 5.5, p. 324); \(\text{e}^{x}\) and \(\text{e}^{cx}\) are from (Larson and Edwards 2018, sec. 5.4, Theorem 5.12, p. 346).

Example 4 (Antiderivative of \(3x^2 - 1\)) By the power rule (\(n = 2\)) and linearity from Theorem 28:

\[ \int \mathopen{}\left(3x^2 - 1\right)\mathclose{}\,dx = 3 \cdot\frac{x^3}{3} - x + C = x^3 - x + C. \]

Verify by differentiating: \(\frac{\partial}{\partial x}\mathopen{}\left(x^3 - x + C\right)\mathclose{} = 3x^2 - 1 = f(x)\), as required.

3.2 Regularity Conditions

Definition 8 (Differentiable function) A function \(f\) is differentiable at \(x = c\) if the limit

\[f'(c) = \lim_{h \to 0} \frac{f(c + h) - f(c)}{h}\]

exists and is finite. \(f\) is differentiable on an interval if it is differentiable at every interior point; at a closed endpoint, the appropriate one-sided derivative is used.

(Larson and Edwards 2018, sec. 2.1, p. 100)

Definition 9 (Continuous function) A function \(f\) is continuous at \(x = c\) if all three conditions hold:

\(f(c)\) is defined,
\(\lim_{x \to c} f(x)\) exists, and
\(\lim_{x \to c} f(x) = f(c)\).

\(f\) is continuous on a closed interval \([a, b]\) if it is continuous at every point of \([a, b]\).

(Larson and Edwards 2018, sec. 1.4, p. 73)

Definition 10 (Riemann integrable) A bounded function \(f\) is Riemann integrable on \([a, b]\) if the Riemann integral

\[\int_a^b f(x)\,dx = \lim_{n \to \infty} \sum_{i=1}^n f(x_i^*)\,\Delta x, \quad \Delta x = \frac{b - a}{n},\]

exists and is finite (for equal-width partitions of width \(\Delta x = (b - a)/n\)), where \(x_i^*\) is any point in the \(i\)-th subinterval.

(Larson and Edwards 2018, sec. 4.3, p. 272)

General Riemann integrability

More generally, using partitions \(\mathcal{P}\) of arbitrary mesh — subintervals of varying widths \(\Delta x_i\) — a bounded function \(f\) is Riemann integrable on \([a, b]\) if

\[\int_a^b f(x)\,dx = \lim_{\|\mathcal{P}\| \to 0} \sum_{i=1}^n f(x_i^*)\,\Delta x_i\]

exists and is finite, where \(\|\mathcal{P}\| = \max_i \Delta x_i\) is the mesh of the partition.

Theorem 29 (Equivalence of Riemann sum formulations) For continuous \(f\) on a closed interval \([a, b]\), the equal-width Riemann sum (Definition 10) and the arbitrary-mesh Riemann sum (in the callout above) give the same value (Rudin 1976, chap. 6). The equal-width form in Definition 10 is used throughout Epi 204.

Before stating the Fundamental Theorem of Calculus, we record two prerequisite results. The FTC requires the integrand \(f\) to be continuous, and the two theorems below establish where continuity comes from (differentiability \(\Rightarrow\) continuity) and what it buys us (continuity \(\Rightarrow\) integrability).

Theorem 30 (Differentiability implies continuity) If \(f\) is differentiable at \(x = c\), then \(f\) is continuous at \(x = c\).

(Larson and Edwards 2018, Theorem 2.1, p. 106)

Example 5 (Differentiable, hence continuous: \(x^3 - x\)) \(f(x) = x^3 - x\) is differentiable everywhere (with derivative \(f'(x) = 3x^2 - 1\)), so by Theorem 30 it is continuous everywhere.

Example 6 (Continuous but not differentiable: \(\mathopen{}\left|x\right|\mathclose{}\)) The absolute-value function \(f(x) = \mathopen{}\left|x\right|\mathclose{}\) is continuous at \(x = 0\) (\(\lim_{x \to 0}\mathopen{}\left|x\right|\mathclose{} = 0 = \mathopen{}\left|0\right|\mathclose{}\)), but it is not differentiable at \(x = 0\): the left-derivative is \(-1\) and the right-derivative is \(+1\).

This shows that the converse of Theorem 30 fails: continuity does not imply differentiability. See Figure 4.

[R code]

ggplot() +
  geom_function(fun = abs, xlim = c(-2, 2), linewidth = 1) +
  geom_point(aes(x = 0, y = 0), size = 3) +
  labs(x = "x", y = expression(f(x) == abs(x))) +
  theme_minimal()

Figure 4: \(f(x) = \mathopen{}\left|x\right|\mathclose{}\) has a sharp corner at \(x = 0\) (not differentiable there) but is continuous everywhere: no gaps or jumps.

Theorem 31 (Continuity implies integrability) If \(f\) is continuous on the closed interval \([a, b]\), then \(f\) is integrable on \([a, b]\) (i.e., the Riemann integral \(\int_a^b f(x)\,dx\) exists and is finite).

(Larson and Edwards 2018, Theorem 4.4, p. 272)

Example 7 (Continuous, hence integrable: polynomials) Every polynomial is continuous on \(\mathbb{R}\), so by Theorem 31 every polynomial is integrable on every closed interval \([a, b]\).

Example 8 (Integrable but not continuous: a step function) Let \(f(x) = 0\) for \(x < \tfrac{1}{2}\) and \(f(x) = 1\) for \(x \ge \tfrac{1}{2}\). Then \(f\) is discontinuous at \(x = \tfrac{1}{2}\), but it is integrable on \([0, 1]\):

\[ \int_0^1 f(x)\,dx = \int_0^{1/2} 0\,dx + \int_{1/2}^1 1\,dx = 0 + \tfrac{1}{2} = \tfrac{1}{2}. \]

This shows that the converse of Theorem 31 fails: integrability does not imply continuity. See Figure 5.

[R code]

step_df <- data.frame(
  x = c(0, 0.5, 0.5, 1),
  y = c(0, 0, 1, 1),
  segment = c("left", "left", "right", "right")
)

ggplot() +
  geom_rect(
    aes(xmin = 0.5, xmax = 1, ymin = 0, ymax = 1),
    fill = "steelblue", alpha = 0.3
  ) +
  geom_line(
    data = step_df,
    aes(x = x, y = y, group = segment),
    linewidth = 1
  ) +
  geom_point(aes(x = 0.5, y = 0), shape = 1, size = 3) +
  geom_point(aes(x = 0.5, y = 1), shape = 16, size = 3) +
  scale_x_continuous(breaks = c(0, 0.5, 1), labels = c("0", "1/2", "1")) +
  scale_y_continuous(limits = c(-0.1, 1.2)) +
  labs(x = "x", y = "f(x)") +
  theme_minimal()

Figure 5: Step function: \(f(x) = 0\) on \([0, \tfrac{1}{2})\) (open circle at the jump) and \(f(x) = 1\) on \([\tfrac{1}{2}, 1]\) (filled circle). The shaded rectangle has area \(\tfrac{1}{2}\), matching the integral computed above.

Together, Theorem 30 and Theorem 31 establish the chain:

\[\text{differentiable} \;\Rightarrow\; \text{continuous} \;\Rightarrow\; \text{integrable}\]

Example 6 and Example 8 show that neither implication reverses in general.

3.3 Fundamental Theorem of Calculus

Theorem 32 (Fundamental Theorem of Calculus) Let \(f\) be a continuous function on a closed interval \([a, b]\).

Part 1 (Derivative of an integral). Define \(F(x) = \int_a^x f(t)\,dt\) for \(x \in [a, b]\). Then \(F\) is differentiable and:

\[\frac{\partial}{\partial x}\int_a^x f(t)\,dt = f(x) \tag{2}\]

Note

Continuity on all of \([a, b]\) is a sufficient condition. More generally, Part 1 holds at any individual point \(x\) where \(f\) is integrable on \([a, b]\) (see Definition 10) and continuous at \(x\) (see Definition 9), even if \(f\) has jump discontinuities elsewhere (Rudin 1976, Theorem 6.20, p. 133).

(Larson and Edwards 2018, Theorem 4.11, p. 288)

Part 2 (Evaluation theorem). The \(F\) here may be any antiderivative of \(f\) — not just the accumulation function from Part 1. If \(F\) is an antiderivative of \(f\) on \([a, b]\) (i.e., \(\frac{\partial}{\partial x} F(x) = f(x)\) for all \(x \in [a, b]\)), then:

\[\int_a^b f(x)\,dx = F(b) - F(a) \tag{3}\]

Equivalently, with \(b\) replaced by a variable upper limit \(x\), integrating the derivative of \(F\) recovers the net change in \(F\):

\[\int_a^x F'(t)\,dt = F(x) - F(a) \tag{4}\]

or equivalently in Leibniz notation:

\[\int_a^x \frac{d F}{d t}\,dt = F(x) - F(a)\]

(Banner 2007, chap. 18; Larson and Edwards 2018, Theorem 4.9, p. 282)

The two parts of the FTC together express that differentiation and integration are inverse operations:

Part 1: differentiating the integral of \(f\) recovers \(f\) (Equation 2).
Part 2: the integral of \(f\) over \([a, b]\) equals the difference of any antiderivative’s values at the endpoints (Equation 3), which rearranges to “integrating the derivative of \(F\) recovers the net change in \(F\)” (Equation 4).

The standard form of the FTC assumes \(f\) is continuous on \([a, b]\); continuity is sufficient but not strictly necessary (see the preceding callout note for the more general statement). Since differentiability implies continuity (Theorem 30), the FTC applies in particular whenever \(f\) is differentiable — a common situation in Epi 204.

Example 9 (FTC Part 1 visualized: accumulation function for \(f(t) = 2t\)) Take \(f(t) = 2t\) on \([0, 2]\). The accumulation function from \(0\) is

\[F(x) \;\stackrel{\text{def}}{=}\; \int_0^x 2t\,dt \;=\; \mathopen{}\left[t^2\right]\mathclose{}_{t=0}^{t=x} \;=\; x^2 - 0^2 \;=\; x^2,\]

so \(F(x) = x^2\), and indeed \(F'(x) = 2x = f(x)\), as Theorem 32 Part 1 predicts. Figure 6 shows the integrand on the left (shaded area equals \(F(x)\) at each \(x\)) and the accumulation function \(F(x) = x^2\) on the right (its slope at \(x\) equals \(f(x) = 2x\)).

Figure 6: Left: \(f(t) = 2t\); the shaded area \(\int_0^{1.5} 2t\,dt = F(1.5) = 2.25\); vertical lines mark \(x \in \{1, 1.5, 2\}\). Right: \(F(x) = x^2\); for each marked \(x\), the tangent slope equals \(f(x) = 2x\).

Example 10 (CDF and PDF of the exponential distribution) In what follows, \(f\) denotes the PDF and \(F\) the CDF — the same letters as the antiderivative pair in Definition 7, because the FTC will show \(F\) is exactly an antiderivative of \(f\).

For the exponential distribution with rate parameter \(\lambda > 0\), the probability density function (PDF) is (Kleinbaum and Klein 2012, sec. II, p. 295, “Survival and Hazard Functions for Selected Distributions”):

\[f(t) = \lambda \text{e}^{-\lambda t}, \quad t \ge 0\]

FTC Part 2 gives the cumulative distribution function (CDF) from the PDF. Apply the \(\text{e}^{cx}\) rule from Theorem 28 with \(c = -\lambda\) to antidifferentiate the integrand:

\[ \begin{aligned} F(t) &= \int_0^t \lambda \text{e}^{-\lambda u}\,du \\ &= \mathopen{}\left[\lambda \cdot\frac{1}{-\lambda}\text{e}^{-\lambda u}\right]\mathclose{}_{u=0}^{u=t} \\ &= \mathopen{}\left[(-1)\text{e}^{-\lambda u}\right]\mathclose{}_{u=0}^{u=t} \\ &= \mathopen{}\left[-\text{e}^{-\lambda u}\right]\mathclose{}_{u=0}^{u=t} \\ &= -\text{e}^{-\lambda t} - \mathopen{}\left(-\text{e}^{0}\right)\mathclose{} \\ &= -\text{e}^{-\lambda t} - (-1) \\ &= 1 - \text{e}^{-\lambda t} \end{aligned} \]

FTC Part 1 recovers the PDF from the CDF:

\[ \begin{aligned} \frac{\partial}{\partial t} F(t) &= \frac{\partial}{\partial t}\mathopen{}\left(1 - \text{e}^{-\lambda t}\right)\mathclose{} \\ &= 0 - (-\lambda)\text{e}^{-\lambda t} \\ &= \lambda\text{e}^{-\lambda t} \\ &= f(t) \end{aligned} \]

For a concrete instance: with \(\lambda = 1\) (standard exponential), the probability that \(T \le 2\) is:

\[ F(2) = 1 - \text{e}^{-1 \cdot 2} = 1 - \text{e}^{-2} \approx 1 - 0.135 = 0.865 \]

See Figure 7.

Figure 7: Exponential distribution with \(\lambda = 1\). Left: the PDF \(f(t) = \lambda \text{e}^{-\lambda t}\); the shaded area under the curve from \(0\) to \(2\) equals \(F(2) \approx 0.865\). Right: the CDF \(F(t) = 1 - \text{e}^{-\lambda t}\); the dashed lines mark the value \(F(2)\) computed via FTC Part 2.

4 Double Integrals

The Fubini–Tonelli theorem states conditions under which the order of integration in a double integral can be exchanged. We state two versions: the Riemann version (Theorem 33) is what Epi 204 directly uses for double integrals of continuous functions on simple regions; the σ-finite measure-theoretic version (Theorem 34) is included to make the joint-distribution form corollary in the probability chapter follow from a stated theorem rather than from an aside.

Theorem 33 (Fubini’s theorem (Riemann version)) Let \(f\) be continuous on a plane region \(R \subseteq \mathbb{R}^2\).

Vertically simple region. If \(R\) is defined by \(a \le x \le b\) and \(g_1(x) \le y \le g_2(x)\), where \(g_1\) and \(g_2\) are continuous on \([a, b]\), then

\[ \begin{aligned} \iint_R f(x, y)\,dA &= \int_a^b \int_{g_1(x)}^{g_2(x)} f(x, y)\,dy\,dx. \end{aligned} \]
Horizontally simple region. If \(R\) is defined by \(c \le y \le d\) and \(h_1(y) \le x \le h_2(y)\), where \(h_1\) and \(h_2\) are continuous on \([c, d]\), then

\[ \begin{aligned} \iint_R f(x, y)\,dA &= \int_c^d \int_{h_1(y)}^{h_2(y)} f(x, y)\,dx\,dy. \end{aligned} \]

When \(R\) can be described both ways, the two iterated integrals are equal — so the order of integration can be exchanged.

(Larson and Edwards 2018, Theorem 14.2, p. 982)

Example 11 (Changing the order of integration for a non-rectangular region) Adapted from (Larson and Edwards 2018, sec. 14.2, Example 4, pp. 984–985).

Let \(X\) and \(Y\) be independent \(\operatorname{Uniform}(0, 1)\) random variables, with joint density \(f(x, y) = 1\) on the unit square \([0, 1]^2\). Define the function \(g(x, y) = \text{e}^{-x^2}\,\mathbb{1}\mathopen{}\left(y \le x\right)\mathclose{}\), and compute its expectation \(\operatorname{E}\mathopen{}\left[g(X, Y)\right]\mathclose{}\).

Because the joint density equals \(1\) on \([0, 1]^2\), this expectation is the double integral of \(g\) over the unit square:

\[\operatorname{E}\mathopen{}\left[g(X, Y)\right]\mathclose{} = \iint_{[0, 1]^2} g(x, y)\,dA.\]

The indicator factor \(\mathbb{1}\mathopen{}\left(y \le x\right)\mathclose{}\) equals \(1\) on the triangular region where \(y \le x\) and \(0\) elsewhere, so only that region, namely \(D = \{(x, y) : x \in [0, 1],\; y \in [0, x]\}\) (Figure 8), contributes, and there \(g(x, y) = \text{e}^{-x^2}\):

\[\operatorname{E}\mathopen{}\left[g(X, Y)\right]\mathclose{} = \iint_D \text{e}^{-x^2}\,dA.\]

[R code]

region <- data.frame(x = c(0, 1, 1), y = c(0, 0, 1))
ggplot2::ggplot(region, ggplot2::aes(x = x, y = y)) +
  ggplot2::geom_polygon(
    fill = "steelblue", alpha = 0.4, color = "black", linewidth = 0.7
  ) +
  ggplot2::annotate(
    "text", x = 0.65, y = 0.28, label = "D", size = 8, fontface = "italic"
  ) +
  ggplot2::annotate(
    "text", x = 0.36, y = 0.52, label = "y == x",
    parse = TRUE, angle = 45
  ) +
  ggplot2::annotate(
    "text", x = 1.08, y = 0.5, label = "x == 1",
    parse = TRUE, angle = 90, hjust = 0.5
  ) +
  ggplot2::annotate(
    "text", x = 0.5, y = -0.1, label = "y == 0", parse = TRUE
  ) +
  ggplot2::scale_x_continuous(breaks = c(0, 0.5, 1), limits = c(-0.1, 1.3)) +
  ggplot2::scale_y_continuous(breaks = c(0, 0.5, 1), limits = c(-0.2, 1.2)) +
  ggplot2::coord_equal() +
  ggplot2::labs(x = "x", y = "y") +
  ggplot2::theme_minimal()

Figure 8: Triangular integration region \(D = \{(x, y) : x \in [0, 1],\; y \in [0, x]\}\), bounded below by \(y = 0\), above-left by \(y = x\), and on the right by \(x = 1\).

Order \(dx\,dy\) is intractable. Re-describing \(D\) as \(D = \{(x, y) : y \in [0, 1],\; x \in [y, 1]\}\), the inner integral is

\[\int_y^1 \text{e}^{-x^2}\,dx,\]

which has no elementary antiderivative.

Order \(dy\,dx\) works. Applying Theorem 33 Part 1 (\(\text{e}^{-x^2}\) is continuous and \(D\) is the vertically simple region \(x \in [0, 1]\), \(y \in [0, x]\)):

\[ \begin{aligned} \iint_D \text{e}^{-x^2}\,dA &= \int_0^1\!\int_0^x \text{e}^{-x^2}\,dy\,dx \\&= \int_0^1 \text{e}^{-x^2}\mathopen{}\left(\int_0^x dy\right)\mathclose{}\,dx \\&= \int_0^1 x\,\text{e}^{-x^2}\,dx \\&= \mathopen{}\left[-\frac{1}{2}\,\text{e}^{-x^2}\right]\mathclose{}_0^1 \\&= -\frac{1}{2}\mathopen{}\left(\text{e}^{-1} - 1\right)\mathclose{} \\&= \frac{e - 1}{2e} \\&\approx 0.316 \end{aligned} \]

The solid whose volume equals this integral is shown in Figure 9.

[R code]

n_grid <- 51
x_seq <- seq(0, 1, length.out = n_grid)
y_seq <- seq(0, 1, length.out = n_grid)

z_mat <- outer(x_seq, y_seq, function(x, y) {
  z <- exp(-x^2)
  z[y > x] <- NA
  z
})

plotly::plot_ly(x = ~x_seq, y = ~y_seq, z = ~t(z_mat)) |>
  plotly::add_surface(showscale = FALSE) |>
  plotly::layout(scene = list(
    xaxis = list(title = "x"),
    yaxis = list(title = "y"),
    zaxis = list(title = "z = exp(-x^2)"),
    camera = list(eye = list(x = 1.6, y = -1.6, z = 0.8))
  ))

Figure 9: Surface \(z = e^{-x^2}\) over the region \(D = \{(x, y) : x \in [0, 1],\; y \in [0, x]\}\). The surface depends only on \(x\) (constant in \(y\)), so for each \(x\) the inner integral over \(y \in [0, x]\) contributes \(x \cdot e^{-x^2}\).

Example 12 (When conditions fail: a counterexample) The conditions in Theorem 33 are not merely technical — when they fail, iterated integrals can exist yet disagree.

Let \[f(x, y) = \frac{x^2 - y^2}{(x^2 + y^2)^2}\] on the unit square \(R = [0, 1] \times [0, 1]\). Strictly, \(f\) is defined on \(R \setminus \{(0, 0)\}\): the denominator vanishes at the origin, so \(f\) is undefined there (we return to this point below).

Integrating \(y\) first, then \(x\):

Using \(\displaystyle\frac{\partial}{\partial y}\frac{y}{x^2 + y^2} = \frac{x^2 - y^2}{(x^2 + y^2)^2}\):

\[ \begin{aligned} \int_0^1\!\int_0^1 f(x, y)\,dy\,dx &= \int_0^1 \mathopen{}\left[\frac{y}{x^2 + y^2}\right]\mathclose{}_{y=0}^{y=1}\,dx \\&= \int_0^1 \frac{1}{x^2 + 1}\,dx \\&= \mathopen{}\left[\arctan(x)\right]\mathclose{}_0^1 \\&= \frac{\pi}{4} \end{aligned} \]

Integrating \(x\) first, then \(y\):

Using \(\displaystyle\frac{\partial}{\partial x}\mathopen{}\left(-\frac{x}{x^2 + y^2}\right)\mathclose{} = \frac{x^2 - y^2}{(x^2 + y^2)^2}\):

\[ \begin{aligned} \int_0^1\!\int_0^1 f(x, y)\,dx\,dy &= \int_0^1 \mathopen{}\left[-\frac{x}{x^2 + y^2}\right]\mathclose{}_{x=0}^{x=1}\,dy \\&= \int_0^1 \mathopen{}\left(-\frac{1}{1 + y^2}\right)\mathclose{}\,dy \\&= -\mathopen{}\left[\arctan(y)\right]\mathclose{}_0^1 \\&= -\frac{\pi}{4} \end{aligned} \]

Conclusion: \(\dfrac{\pi}{4} \neq -\dfrac{\pi}{4}\), so the two iterated integrals are unequal. Theorem 33 does not apply here.

Why Theorem 33’s hypothesis fails: Theorem 33 requires \(f\) to be continuous on \(R\). The denominator \((x^2 + y^2)^2\) vanishes at the origin \((0, 0) \in R\), so \(f\) is not even defined there — let alone continuous — and the theorem does not apply.

(Wikipedia contributors 2024)

The surface, and the singularity at the origin responsible for the failure, are shown in Figure 10.

[R code]

n_grid <- 81
eps <- 0.04
x_seq <- seq(eps, 1, length.out = n_grid)
y_seq <- seq(eps, 1, length.out = n_grid)

z_mat <- outer(x_seq, y_seq, function(x, y) (x^2 - y^2) / (x^2 + y^2)^2)
z_clip <- 50
z_mat[z_mat > z_clip] <- z_clip
z_mat[z_mat < -z_clip] <- -z_clip

plotly::plot_ly(x = ~x_seq, y = ~y_seq, z = ~t(z_mat)) |>
  plotly::add_surface(
    showscale = FALSE,
    colorscale = list(
      list(0, "#3b4cc0"),
      list(0.5, "#dddddd"),
      list(1, "#b40426")
    )
  ) |>
  plotly::layout(scene = list(
    xaxis = list(title = "x"),
    yaxis = list(title = "y"),
    zaxis = list(title = "f(x, y)", range = c(-z_clip, z_clip)),
    camera = list(eye = list(x = 1.6, y = 1.6, z = 0.6))
  ))

Figure 10: Surface \(f(x, y) = (x^2 - y^2)/(x^2 + y^2)^2\) on \([0, 1]^2\), sampled away from the origin and clipped to \([-50, 50]\) for display. The function diverges to \(+\infty\) along the \(x\)-axis (red ridge, \(f > 0\) when \(|x| > |y|\)) and to \(-\infty\) along the \(y\)-axis (blue ridge, \(f < 0\) when \(|y| > |x|\)). The singularity at \((0, 0)\) is why \(f\) is not continuous on \(R\) and Theorem 33 does not apply.

Corollary 5 (Continuous functions on a rectangle (corollary of Theorem 33)) If \(f : [a, b] \times [c, d] \to \mathbb{R}\) is continuous on the closed bounded rectangle \([a, b] \times [c, d]\), then:

\[ \begin{aligned} \int_a^b \mathopen{}\left(\int_c^d f(x, y)\,dy\right)\mathclose{}\,dx &= \int_c^d \mathopen{}\left(\int_a^b f(x, y)\,dx\right)\mathclose{}\,dy\\ &= \iint_{[a,b]\times[c,d]} f(x, y)\,dx\,dy. \end{aligned} \]

(Larson and Edwards 2018, Theorem 14.2, p. 982)

Proof. A closed bounded rectangle \([a, b] \times [c, d]\) is both vertically simple (with \(g_1 \equiv c\), \(g_2 \equiv d\)) and horizontally simple (with \(h_1 \equiv a\), \(h_2 \equiv b\)). Applying both parts of Theorem 33 to \(f\) on this rectangle gives the two iterated forms shown.

Example 13 (Evaluating a double integral on a rectangle) Structure adapted from (Larson and Edwards 2018, sec. 14.2, Example 2, pp. 982–983); the integrand \(x^2 + y^2\) is original, chosen so the integral equals \(\operatorname{E}\mathopen{}\left[g(X, Y)\right]\mathclose{}\) for \(g(x, y) = x^2 + y^2\).

Let \(X\) and \(Y\) be independent \(\operatorname{Uniform}(0, 1)\) random variables, with joint density \(f(x, y) = 1\) on the unit square \(R = \{(x, y) : x \in [0, 1],\; y \in [0, 1]\}\) (Figure 11). Define the function \(g(x, y) = x^2 + y^2\), and compute its expectation \(\operatorname{E}\mathopen{}\left[g(X, Y)\right]\mathclose{}\).

Because the joint density equals \(1\) on \(R\), this expectation is the double integral of \(g\) over \(R\):

\[\operatorname{E}\mathopen{}\left[g(X, Y)\right]\mathclose{} = \iint_R \mathopen{}\left(x^2 + y^2\right)\mathclose{}\,dA.\]

[R code]

ggplot2::ggplot() +
  ggplot2::annotate(
    "rect", xmin = 0, xmax = 1, ymin = 0, ymax = 1,
    fill = "steelblue", alpha = 0.4, color = "black", linewidth = 0.7
  ) +
  ggplot2::annotate(
    "text", x = 0.5, y = 0.5, label = "R", size = 8, fontface = "italic"
  ) +
  ggplot2::scale_x_continuous(breaks = c(0, 1), limits = c(-0.15, 1.25)) +
  ggplot2::scale_y_continuous(breaks = c(0, 1), limits = c(-0.15, 1.25)) +
  ggplot2::coord_equal() +
  ggplot2::labs(x = "x", y = "y") +
  ggplot2::theme_minimal()

Figure 11: Integration region \(R = [0, 1]^2\), the unit square.

The integrand is continuous on \(R\), so Corollary 5 applies and either order of integration yields the same value.

Integrating \(y\) first, then \(x\):

\[ \begin{aligned} \operatorname{E}\mathopen{}\left[g(X, Y)\right]\mathclose{} &= \int_0^1\!\int_0^1 \mathopen{}\left(x^2 + y^2\right)\mathclose{}\,dy\,dx \\&= \int_0^1 \mathopen{}\left[x^2 y + \frac{y^3}{3}\right]\mathclose{}_0^1\,dx \\&= \int_0^1 \mathopen{}\left(x^2 + \frac{1}{3}\right)\mathclose{}\,dx \\&= \mathopen{}\left[\frac{x^3}{3} + \frac{x}{3}\right]\mathclose{}_0^1 \\&= \frac{2}{3} \end{aligned} \]

Integrating \(x\) first, then \(y\) (verifying the order can be swapped):

\[ \begin{aligned} \int_0^1\!\int_0^1 \mathopen{}\left(x^2 + y^2\right)\mathclose{}\,dx\,dy &= \int_0^1 \mathopen{}\left[\frac{x^3}{3} + y^2 x\right]\mathclose{}_0^1\,dy \\&= \int_0^1 \mathopen{}\left(\frac{1}{3} + y^2\right)\mathclose{}\,dy \\&= \mathopen{}\left[\frac{y}{3} + \frac{y^3}{3}\right]\mathclose{}_0^1 \\&= \frac{2}{3} \end{aligned} \]

Both orders give \(\frac{2}{3}\), as Corollary 5 guarantees.

As a cross-check, linearity of expectation gives the same value: since \(\operatorname{E}\mathopen{}\left[X^2\right]\mathclose{} = \int_0^1 x^2\,dx = \frac{1}{3}\) for \(X \sim \operatorname{Uniform}(0, 1)\) (and likewise for \(Y\)),

\[\operatorname{E}\mathopen{}\left[g(X, Y)\right]\mathclose{} = \operatorname{E}\mathopen{}\left[X^2 + Y^2\right]\mathclose{} = \operatorname{E}\mathopen{}\left[X^2\right]\mathclose{} + \operatorname{E}\mathopen{}\left[Y^2\right]\mathclose{} = \frac{1}{3} + \frac{1}{3} = \frac{2}{3}.\]

The solid whose volume equals this integral is shown in Figure 12.

[R code]

n_grid <- 41
x_seq <- seq(0, 1, length.out = n_grid)
y_seq <- seq(0, 1, length.out = n_grid)
z_mat <- outer(x_seq, y_seq, function(x, y) x^2 + y^2)

plotly::plot_ly(x = ~x_seq, y = ~y_seq, z = ~t(z_mat)) |>
  plotly::add_surface(showscale = FALSE) |>
  plotly::layout(scene = list(
    xaxis = list(title = "x"),
    yaxis = list(title = "y"),
    zaxis = list(title = "z", range = c(0, 2))
  ))

Figure 12: Surface \(z = x^2 + y^2\) over the unit square \([0, 1]^2\). The double integral \(\tfrac{2}{3}\) is the volume between this surface and the \(xy\)-plane, and equals \(\operatorname{E}\mathopen{}\left[X^2 + Y^2\right]\mathclose{}\).

Theorem 34 (Fubini–Tonelli theorem (measure-theoretic form)) Let \((\Omega_1, \mathcal F_1, \mu_1)\) and \((\Omega_2, \mathcal F_2, \mu_2)\) be σ-finite measure spaces, and let \(f : \Omega_1 \times \Omega_2 \to \mathbb{R}\) be measurable with respect to the product σ-algebra \(\mathcal F_1 \otimes \mathcal F_2\). If either

\(f \ge 0\) almost everywhere with respect to \(\mu_1 \otimes \mu_2\) (Tonelli’s theorem), or
\(\int_{\Omega_1 \times \Omega_2} \mathopen{}\left|f\right|\mathclose{}\,d(\mu_1 \otimes \mu_2) < \infty\) (Fubini’s theorem),

then both iterated integrals exist, agree with the double integral, and equal each other:

\[ \begin{aligned} \int_{\Omega_1 \times \Omega_2} f\,d(\mu_1 \otimes \mu_2) &= \int_{\Omega_1} \mathopen{}\left(\int_{\Omega_2} f(\omega_1, \omega_2)\,d\mu_2(\omega_2)\right)\mathclose{}\,d\mu_1(\omega_1)\\ &= \int_{\Omega_2} \mathopen{}\left(\int_{\Omega_1} f(\omega_1, \omega_2)\,d\mu_1(\omega_1)\right)\mathclose{}\,d\mu_2(\omega_2). \end{aligned} \]

(Billingsley 1995, Theorem 18.3; Gut 2013, Theorem 9.1, p. 65; Fubini 1907; Wikipedia contributors 2024)

Example 14 (Positive application of Theorem 34) Let \(X\) and \(Y\) be independent \(\operatorname{Exponential}(1)\) random variables, with joint density \(f(x, y) = e^{-(x+y)}\) for \(x, y \ge 0\). Their distributions are probability measures on \([0,\infty)\), and probability measures are finite (hence \(\sigma\)-finite), so the product measure on \([0,\infty)^2\) satisfies the \(\sigma\)-finiteness hypothesis of Theorem 34. Since \(f(x,y) = e^{-(x+y)} \ge 0\), hypothesis (a) (Tonelli theorem, nonnegativity) is also satisfied.

By Theorem 34, both iterated integrals exist and agree:

\[ \begin{aligned} P(X \le 1,\, Y \le 1) &= \int_0^1\!\int_0^1 e^{-(x+y)}\,dy\,dx \\&= \int_0^1 e^{-x}\mathopen{}\left(\int_0^1 e^{-y}\,dy\right)\mathclose{}\,dx \\&= \int_0^1 e^{-x}(1 - e^{-1})\,dx \\&= (1 - e^{-1})\mathopen{}\left[-e^{-x}\right]\mathclose{}_0^1 \\&= (1 - e^{-1})^2. \end{aligned} \]

This also equals \(\left(\int_0^1 e^{-x}\,dx\right)^2 = (1 - e^{-1})^2\), since \(f(x,y) = e^{-x} \cdot e^{-y}\) factors as a product of independent densities. Both iterated integrals agree, as Theorem 34 guarantees when hypothesis (a) holds.

Example 15 (When neither Fubini–Tonelli condition is satisfied) The same function \(f(x, y) = (x^2 - y^2)/(x^2 + y^2)^2\) from Example 12 illustrates a case where neither condition of Theorem 34 is satisfied.

Why Theorem 34’s hypothesis fails: \(\iint_R |f|\,dA = \infty\), which violates hypothesis (b). Switching to polar coordinates \((r, \theta)\) near the origin, the integrand satisfies \(|f(x, y)| = \mathopen{}\left|x^2 - y^2\right|\mathclose{}/(x^2 + y^2)^2 = \mathopen{}\left|\cos 2\theta\right|\mathclose{}/r^2\), so

\[ \begin{aligned} \iint_R |f|\,dA &\ge \int_0^{\pi/2}\!\int_0^{\epsilon} \frac{\mathopen{}\left|\cos 2\theta\right|\mathclose{}}{r^2}\, r\,dr\,d\theta\\ &= \mathopen{}\left(\int_0^{\pi/2}\mathopen{}\left|\cos 2\theta\right|\mathclose{}\,d\theta\right)\mathclose{} \int_0^{\epsilon} \frac{dr}{r}\\ &= +\infty, \end{aligned} \]

since \(\int_0^{\epsilon} dr/r\) diverges. Therefore \(\iint_R |f|\,dA = \infty\), and hypothesis (b) of Theorem 34 is not satisfied. (Hypothesis (a) also fails: \(f\) takes both positive and negative values, so it is not nonnegative a.e.) The unequal iterated integrals from Example 12 are thus consistent with Theorem 34: the theorem simply does not apply.

(Wikipedia contributors 2024)

5 Linear Algebra

5.1 Vectors

Definition 11 (Column vector) A column vector of length \(p\) is an ordered list of \(p\) numbers, written vertically:

\[ \tilde{x}= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{p} \end{bmatrix} \]

Definition 12 (Transpose) The transpose of a column vector \(\tilde{x}\) is the row vector with the same sequence of entries, written horizontally:

\[ {\tilde{x}}^{\top} \equiv \tilde{x}' \equiv [x_1,\; x_2,\; \ldots,\; x_p] \]

Special vectors

Definition 13 (Zero vector) The zero vector \(\tilde{0}\) of length \(p\) has all entries equal to zero:

\[ \tilde{0}= \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \]

Definition 14 (Ones vector) The ones vector \(\tilde{1}\) of length \(p\) has all entries equal to one:

\[ \tilde{1} = \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \]

Definition 15 (Indicator vector / standard basis vector) The \(j\)-th indicator vector (or standard basis vector) \(\tilde{e}_j\) of length \(p\) has a \(1\) in position \(j\) and \(0\)s elsewhere:

\[ (\tilde{e}_j)_i = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases} \qquad \tilde{e}_j = \begin{bmatrix} 0 \\ \vdots \\ 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \leftarrow \text{position } j \]

Theorem 35 (Indicator vectors select entries) For any vector \(\tilde{x}\) of length \(p\) and any \(j \in \{1, \ldots, p\}\):

\[{\tilde{e}_j}^{\top}\tilde{x}= x_j\]

Proof. Writing the product componentwise:

\[ \begin{aligned} {\tilde{e}_j}^{\top}\tilde{x} &= \sum_{i=1}^{p} (\tilde{e}_j)_i\, x_i \\&= \sum_{i=1}^{p} \begin{cases} 1 \cdot x_i & \text{if } i = j \\ 0 \cdot x_i & \text{if } i \neq j \end{cases} \\&= x_j \end{aligned} \]

Definition 16 (Dot product/linear combination/inner product) For any two real-valued vectors \(\tilde{x}= (x_1, \ldots, x_n)\) and \(\tilde{y}= (y_1, \ldots, y_n)\), the dot-product, linear combination, or inner product of \(\tilde{x}\) and \(\tilde{y}\) is:

\[\tilde{x}\cdot \tilde{y}= \tilde{x}^{\top} \tilde{y}\stackrel{\text{def}}{=}\sum_{i=1}^nx_i y_i\]

Theorem 36 (Dot product is symmetric) The dot product is symmetric:

\[\tilde{x}\cdot \tilde{y}= \tilde{y}\cdot \tilde{x}\]

Proof. Apply:

Definition 16
symmetry of scalar multiplication
Definition 16 again

Example 16 (Dot product as matrix multiplication) The dot product of two column vectors \(\tilde{x}\) and \(\tilde{\beta}\) can be written as a matrix product of the row vector \({\tilde{x}}^{\top}\) with the column vector \(\tilde{\beta}\):

\[ \begin{aligned} \tilde{x}\cdot \tilde{\beta} &= {\tilde{x}}^{\top}\, \tilde{\beta} \\ &= [x_1,\; x_2,\; \ldots,\; x_p] \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p} \end{bmatrix} \\ &= x_1\beta_1 + x_2\beta_2 + \cdots + x_p \beta_p \end{aligned} \]

Orthogonality

Definition 17 (Orthogonal vectors) Two vectors \(\tilde{x}\) and \(\tilde{y}\) of the same length are orthogonal (written \(\tilde{x}\perp \tilde{y}\)) if their dot product is zero:

\[\tilde{x}\perp \tilde{y}\iff {\tilde{x}}^{\top}\tilde{y}= 0\]

Definition 18 (Orthonormal vectors) A set of vectors \(\{\tilde{x}_1, \tilde{x}_2, \ldots, \tilde{x}_k\}\) is orthonormal if the vectors are mutually orthogonal and each has unit length:

\[{\tilde{x}_i}^{\top}\tilde{x}_j = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases}\]

5.2 Matrices

Definition 19 (Matrix) A matrix of dimensions \(m \times n\) is a rectangular array of \(m \cdot n\) numbers, arranged in \(m\) rows and \(n\) columns:

\[ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} \]

Matrix transpose

Definition 20 (Matrix transpose) The transpose of an \(m \times n\) matrix \(\mathbf{A}\) is the \(n \times m\) matrix \({\mathbf{A}}^{\top}\) obtained by swapping the rows and columns of \(\mathbf{A}\):

\[({\mathbf{A}}^{\top})_{ij} = a_{ji}\]

Theorem 37 (Transpose of a sum) \[{(\mathbf{A} + \mathbf{B})}^{\top} = {\mathbf{A}}^{\top} + {\mathbf{B}}^{\top}\]

In particular, for column vectors \(\tilde{x}\) and \(\tilde{y}\):

\[{(\tilde{x}+ \tilde{y})}^{\top} = {\tilde{x}}^{\top} + {\tilde{y}}^{\top}\]

Theorem 38 (Transpose of a product) For compatible matrices \(\mathbf{A}\) and \(\mathbf{B}\):

\[{(\mathbf{A}\mathbf{B})}^{\top} = {\mathbf{B}}^{\top}\,{\mathbf{A}}^{\top}\]

Matrix addition

Definition 21 (Zero matrix) The \(m \times n\) zero matrix \(\mathbf{0}_{m \times n}\) (or \(\mathbf{0}\) when dimensions are clear from context) has all entries equal to zero:

\[ \mathbf{0}_{m \times n} = \begin{bmatrix} 0 & 0 & \cdots & 0 \\ 0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{bmatrix} \]

Definition 22 (Matrix addition) Two matrices \(\mathbf{A}\) and \(\mathbf{B}\) of the same dimensions \(m \times n\) can be added element-wise:

\[(\mathbf{A} + \mathbf{B})_{ij} = a_{ij} + b_{ij}\]

Theorem 39 (Matrix addition is commutative) \[\mathbf{A} + \mathbf{B} = \mathbf{B} + \mathbf{A}\]

Theorem 40 (Matrix addition is associative) \[(\mathbf{A} + \mathbf{B}) + \mathbf{C} = \mathbf{A} + (\mathbf{B} + \mathbf{C})\]

Theorem 41 (Zero matrix is the additive identity) \[\mathbf{A} + \mathbf{0} = \mathbf{A}\]

Theorem 42 (Additive inverse) For any matrix \(\mathbf{A}\), the matrix \(-\mathbf{A}\) (defined by \((-\mathbf{A})_{ij} = -a_{ij}\)) satisfies:

\[\mathbf{A} + (-\mathbf{A}) = \mathbf{0}\]

Scalar multiplication

Definition 23 (Scalar multiplication) A matrix \(\mathbf{A}\) can be multiplied by a scalar \(c\):

\[(c\mathbf{A})_{ij} = c \cdot a_{ij}\]

Matrix multiplication

Definition 24 (Matrix multiplication) The product of an \(m \times k\) matrix \(\mathbf{A}\) and a \(k \times n\) matrix \(\mathbf{B}\) is the \(m \times n\) matrix \(\mathbf{C} = \mathbf{A}\mathbf{B}\) with entries:

\[c_{ij} = \sum_{s=1}^{k} a_{is}\, b_{sj}\]

Theorem 43 (Matrix multiplication is associative) \[(\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C})\]

Theorem 44 (Matrix multiplication is distributive over addition) \[\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}\]

\[(\mathbf{A} + \mathbf{B})\mathbf{C} = \mathbf{A}\mathbf{C} + \mathbf{B}\mathbf{C}\]

Matrix-vector multiplication

Definition 25 (Matrix-vector multiplication) The product of an \(m \times p\) matrix \(\mathbf{A}\) and a \(p \times 1\) column vector \(\tilde{x}\) is the \(m \times 1\) column vector \(\mathbf{A}\tilde{x}\) with entries:

\[(\mathbf{A}\tilde{x})_i = \sum_{j=1}^{p} a_{ij}\, x_j\]

5.3 Special Matrices

Definition 26 (Square matrix) A matrix is square if it has the same number of rows as columns. The number of rows (= columns) is the order of the matrix.

Definition 27 (Matrix power) For a square matrix \(\mathbf{A}\) of order \(p\) and a positive integer \(k\), the \(k\)-th power of \(\mathbf{A}\) is:

\[\mathbf{A}^k = \underbrace{\mathbf{A}\,\mathbf{A}\cdots\mathbf{A}}_{k \text{ copies}}\]

In particular, \(\mathbf{A}^2 = \mathbf{A}\mathbf{A}\).

Definition 28 (Identity matrix) The \(p \times p\) identity matrix \(\mathbf{I}_p\) (or \(\mathbf{I}\) when the size is clear from context) has ones on the main diagonal and zeros elsewhere:

\[ (\mathbf{I}_p)_{ij} = \begin{cases} 1 & \text{if } i = j \\ 0 & \text{if } i \neq j \end{cases} \qquad \mathbf{I}_p = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} \]

Theorem 45 (Identity matrix is a multiplicative identity) For any \(m \times p\) matrix \(\mathbf{A}\):

\[\mathbf{A}\,\mathbf{I}_p = \mathbf{A}\]

\[\mathbf{I}_m\,\mathbf{A} = \mathbf{A}\]

Definition 29 (Symmetric matrix) A square matrix \(\mathbf{A}\) is symmetric if \({\mathbf{A}}^{\top} = \mathbf{A}\), i.e., \(a_{ij} = a_{ji}\) for all \(i\) and \(j\).

Definition 30 (Diagonal matrix) A square matrix \(\mathbf{D}\) is a diagonal matrix if all off-diagonal entries are zero: \(d_{ij} = 0\) whenever \(i \neq j\):

\[ \mathbf{D} = \begin{bmatrix} d_1 & 0 & \cdots & 0 \\ 0 & d_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & d_p \end{bmatrix} \]

Definition 31 (Matrix inverse) For a square \(p \times p\) matrix \(\mathbf{A}\), the inverse \(\mathbf{A}^{-1}\) (if it exists) is the unique matrix satisfying:

\[\mathbf{A}\,\mathbf{A}^{-1} = \mathbf{A}^{-1}\,\mathbf{A} = \mathbf{I}_p\]

Theorem 46 (Inverse of a product) For invertible matrices \(\mathbf{A}\) and \(\mathbf{B}\):

\[(\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1}\mathbf{A}^{-1}\]

Definition 32 (Idempotent matrix) A square matrix \(\mathbf{A}\) is idempotent if

\[\mathbf{A}^2 = \mathbf{A}\]

Definition 33 (Projection matrix) A square matrix \(\mathbf{P}\) is a projection matrix (also called an orthogonal projector) if it is both symmetric and idempotent:

\[{\mathbf{P}}^{\top} = \mathbf{P} \qquad \text{and} \qquad \mathbf{P}^2 = \mathbf{P}\]

Theorem 47 (Complement of a projection matrix) If \(\mathbf{P}\) is a projection matrix, then \(\mathbf{I} - \mathbf{P}\) is also a projection matrix.

Proof. We verify symmetry and idempotency.

Symmetry: \[{(\mathbf{I} - \mathbf{P})}^{\top} = {\mathbf{I}}^{\top} - {\mathbf{P}}^{\top} = \mathbf{I} - \mathbf{P}\]

Idempotency: \[\begin{aligned} (\mathbf{I} - \mathbf{P})^2 &= (\mathbf{I} - \mathbf{P})(\mathbf{I} - \mathbf{P}) \\ &= \mathbf{I} - \mathbf{P} - \mathbf{P} + \mathbf{P}^2 \\ &= \mathbf{I} - \mathbf{P} - \mathbf{P} + \mathbf{P} \\ &= \mathbf{I} - \mathbf{P} \end{aligned}\]

Theorem 48 (Hat matrix is a projection matrix) In a linear regression model with full-rank design matrix \(\mathbf{X}\), the hat matrix

\[\mathbf{H} = \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top}\]

is a projection matrix.

Proof. We verify symmetry and idempotency.

Symmetry: \[\begin{aligned} {\mathbf{H}}^{\top} &= {\left(\mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top}\right)}^{\top} \\ &= {({\mathbf{X}}^{\top})}^{\top} \cdot {\left(({\mathbf{X}}^{\top}\mathbf{X})^{-1}\right)}^{\top} \cdot {\mathbf{X}}^{\top} \\ &= \mathbf{X}\cdot ({\mathbf{X}}^{\top}\mathbf{X})^{-1} \cdot {\mathbf{X}}^{\top} \\ &= \mathbf{H} \end{aligned}\]

where the third line uses \({({\mathbf{X}}^{\top})}^{\top} = \mathbf{X}\) and the fact that \({\mathbf{X}}^{\top}\mathbf{X}\) is symmetric, so its inverse is also symmetric (\({\left(({\mathbf{X}}^{\top}\mathbf{X})^{-1}\right)}^{\top} = ({\mathbf{X}}^{\top}\mathbf{X})^{-1}\)).

Idempotency: \[\begin{aligned} \mathbf{H}^2 &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \cdot \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}({\mathbf{X}}^{\top}\mathbf{X})({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{H} \end{aligned}\]

Theorem 49 (Projection matrices produce orthogonal decompositions) If \(\mathbf{P}\) is a projection matrix and \(\tilde{v}\) is any vector of compatible dimension, then the two components of the decomposition

\[\tilde{v} = \underbrace{\mathbf{P}\tilde{v}}_{\text{projected}} + \underbrace{(\mathbf{I} - \mathbf{P})\tilde{v}}_{\text{residual}}\]

are orthogonal:

\[\mathbf{P}\tilde{v} \;\perp\; (\mathbf{I} - \mathbf{P})\tilde{v}\]

Proof. \[\begin{aligned} {(\mathbf{P}\tilde{v})}^{\top}\,(\mathbf{I} - \mathbf{P})\tilde{v} &= {\tilde{v}}^{\top}\,{\mathbf{P}}^{\top}\,(\mathbf{I} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,\mathbf{P}\,(\mathbf{I} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,(\mathbf{P} - \mathbf{P}^2)\tilde{v} \\ &= {\tilde{v}}^{\top}\,(\mathbf{P} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,\mathbf{0}\,\tilde{v} \\ &= 0 \end{aligned}\]

where the second line uses symmetry (\({\mathbf{P}}^{\top} = \mathbf{P}\)) and the fourth line uses idempotency (\(\mathbf{P}^2 = \mathbf{P}\)).

5.4 Quadratic Forms

Definition 34 (Quadratic form) A quadratic form is a mathematical expression of the structure

\[{\tilde{x}}^{\top}\, \mathbf{S}\, \tilde{x}\]

where \(\tilde{x}\) is a \(p \times 1\) vector and \(\mathbf{S}\) is a \(p \times p\) matrix.

Theorem 50 (Symmetric part of a quadratic form) If \(\mathbf{S}\) is a square matrix, then

\[ {\tilde{x}}^{\top}\mathbf{S}\tilde{x} = {\tilde{x}}^{\top}\left(\frac{1}{2}(\mathbf{S}+{\mathbf{S}}^{\top})\right)\tilde{x}. \]

So the value of a quadratic form depends only on the symmetric part of \(\mathbf{S}\).

5.5 Design Matrix

Definition 35 (Design matrix) In a regression model with \(n\) observations and \(p\) predictors, the design matrix (or model matrix) \(\mathbf{X}\) is the \(n \times p\) matrix whose \(i\)-th row is the covariate vector \({\tilde{x}_i}^{\top}\) for observation \(i\):

\[ \mathbf{X}= \begin{bmatrix} {\tilde{x}_1}^{\top} \\ {\tilde{x}_2}^{\top} \\ \vdots \\ {\tilde{x}_n}^{\top} \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \]

6 Vector Calculus

(adapted from Fieller (2016), §7.2)

Let \(\tilde{x}\) and \(\tilde{\beta}\) be column vectors of length \(p\) (see Definition 11 and Definition 16).

Definition 36 (Vector derivative) If \(f(\tilde{\beta})\) is a function that takes a vector \(\tilde{\beta}\) as input, such as \(f(\tilde{\beta}) = x'\tilde{\beta}\), then:

\[ \frac{\partial}{\partial \tilde{\beta}} f(\tilde{\beta}) = \begin{bmatrix} \frac{\partial}{\partial \beta_1}f(\tilde{\beta}) \\ \frac{\partial}{\partial \beta_2}f(\tilde{\beta}) \\ \vdots \\ \frac{\partial}{\partial \beta_p}f(\tilde{\beta}) \end{bmatrix} \]

Definition 37 (Row-vector derivative) If \(f(\tilde{\beta})\) is a function that takes a vector \(\tilde{\beta}\) as input, such as \(f(\tilde{\beta}) = x'\tilde{\beta}\), then:

\[ \frac{\partial}{\partial \tilde{\beta}^{\top}} f(\tilde{\beta}) = \begin{bmatrix} \frac{\partial}{\partial \beta_1}f(\tilde{\beta}) & \frac{\partial}{\partial \beta_2}f(\tilde{\beta}) & \cdots & \frac{\partial}{\partial \beta_p}f(\tilde{\beta}) \end{bmatrix} \]

Theorem 51 (Row and column derivatives are transposes) \[\frac{\partial}{\partial \tilde{\beta}^{\top}} f(\tilde{\beta}) = \mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}} f(\tilde{\beta})\right)\mathclose{}^{\top}\]

\[\frac{\partial}{\partial \tilde{\beta}} f(\tilde{\beta}) = \mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}^{\top}} f(\tilde{\beta})\right)\mathclose{}^{\top}\]

Definition 38 (Constant) \(\tilde{x}\) is constant with respect to \(\tilde{\beta}\) if

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}} {\tilde{x}}^{\top}}_{p \times p} = \underbrace{\mathbf{0}}_{p \times p} \]

Example 17 (A constant vector) Let \(\tilde{\beta}= {(\beta_1, \beta_2)}^{\top}\) and \(\tilde{x}= {(3, 5)}^{\top}\), so \(x_1 = 3\) and \(x_2 = 5\) do not depend on \(\tilde{\beta}\). Expanding \(\frac{\partial}{\partial \tilde{\beta}} {\tilde{x}}^{\top}\) into its matrix of scalar partial derivatives (Definition 36, applied to each component of the row \({\tilde{x}}^{\top}\)) and evaluating each entry:

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}} {\tilde{x}}^{\top}}_{2 \times 2} = \frac{\partial}{\partial \tilde{\beta}} \begin{bmatrix}x_1 & x_2\end{bmatrix} = \begin{bmatrix} \frac{\partial}{\partial \beta_1} x_1 & \frac{\partial}{\partial \beta_1} x_2 \\ \frac{\partial}{\partial \beta_2} x_1 & \frac{\partial}{\partial \beta_2} x_2 \end{bmatrix} = \begin{bmatrix} \frac{\partial}{\partial \beta_1} 3 & \frac{\partial}{\partial \beta_1} 5 \\ \frac{\partial}{\partial \beta_2} 3 & \frac{\partial}{\partial \beta_2} 5 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} = \underbrace{\mathbf{0}}_{2 \times 2} \]

Every entry is the derivative of a constant, so \(\frac{\partial}{\partial \tilde{\beta}} {\tilde{x}}^{\top} = \mathbf{0}\) and \(\tilde{x}\) is constant with respect to \(\tilde{\beta}\) (Definition 38).

Theorem 52 (Derivative of a dot product) If \(\tilde{x}\) is constant with respect to \(\tilde{\beta}\), then:

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}} (\tilde{x}\cdot \tilde{\beta})}_{p \times 1} = \underbrace{\frac{\partial}{\partial \tilde{\beta}} (\tilde{\beta}\cdot \tilde{x})}_{p \times 1} = \underbrace{\tilde{x}}_{p \times 1} \]

Proof. \[ \begin{aligned} \frac{\partial}{\partial \tilde{\beta}} (\tilde{x}\cdot \tilde{\beta}) &= \begin{bmatrix} \frac{\partial}{\partial \beta_1}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \\ \frac{\partial}{\partial \beta_2}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \\ \vdots \\ \frac{\partial}{\partial \beta_p}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \end{bmatrix} \\ &= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{p} \end{bmatrix} \\ &= \tilde{x} \end{aligned} \]

Example 18 (Derivative of a dot product) Let \(\tilde{x}= {(3, 5)}^{\top}\) (constant with respect to \(\tilde{\beta}\); see Example 17) and \(\tilde{\beta}= {(\beta_1, \beta_2)}^{\top}\). Then \(\tilde{x}\cdot \tilde{\beta}= 3\beta_1 + 5\beta_2\), and by Theorem 52:

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}}(\tilde{x}\cdot \tilde{\beta})}_{2 \times 1} = \underbrace{\tilde{x}}_{2 \times 1} = \begin{pmatrix} 3 \\ 5 \end{pmatrix} \]

Verifying entry-wise:

\[ \frac{\partial}{\partial \tilde{\beta}}(3\beta_1 + 5\beta_2) = \begin{pmatrix} \frac{\partial}{\partial \beta_1}(3\beta_1 + 5\beta_2) \\ \frac{\partial}{\partial \beta_2}(3\beta_1 + 5\beta_2) \end{pmatrix} = \begin{pmatrix} 3 \\ 5 \end{pmatrix} \]

Both methods agree. ✓

Theorem 53 (Product rule for dot-products) If \(\mathbf{a} = \mathbf{a}(\tilde{x})\) and \(\mathbf{b} = \mathbf{b}(\tilde{x})\) are differentiable \(p \times 1\) vector functions of \(\tilde{x}\), then:

\[ \begin{aligned} \frac{\partial}{\partial \underbrace{\tilde{x}}_{p \times 1}} \underbrace{a}_{p \times 1} \cdot \underbrace{b}_{p \times 1} &= \mathopen{}\left( \frac{\partial}{\partial \underbrace{\tilde{x}}_{p \times 1}} \underbrace{{a}^{\top}}_{1 \times p} \right)\mathclose{} \underbrace{b}_{p \times 1} + \mathopen{}\left( \frac{\partial}{\partial \underbrace{\tilde{x}}_{p \times 1}} \underbrace{{b}^{\top}}_{1 \times p} \right)\mathclose{} \underbrace{a}_{p \times 1} \end{aligned} \]

Proof. Entry-wise, for \(i = 1, \ldots, p\):

\[ \begin{aligned} \left[\frac{\partial}{\partial \tilde{x}} (\mathbf{a} \cdot \mathbf{b})\right]_i &= \frac{\partial}{\partial x_i} \sum_{k=1}^p a_k b_k \\ &= \sum_{k=1}^p \mathopen{}\left(b_k \frac{\partial}{\partial x_i} a_k + a_k \frac{\partial}{\partial x_i} b_k\right)\mathclose{} \\ &= \left[\mathopen{}\left(\frac{\partial}{\partial \tilde{x}} {\mathbf{a}}^{\top}\right)\mathclose{}\mathbf{b}\right]_i + \left[\mathopen{}\left(\frac{\partial}{\partial \tilde{x}} {\mathbf{b}}^{\top}\right)\mathclose{}\mathbf{a}\right]_i \end{aligned} \]

Example 19 (Example of the dot-product rule) Let \(\tilde{x}= {(\beta_1, \beta_2)}^{\top}\), \(\mathbf{a}(\tilde{x}) = {(\beta_1, \beta_1\beta_2)}^{\top}\), and \(\mathbf{b}(\tilde{x}) = {(\beta_2, \beta_1)}^{\top}\). Then:

\[ \mathbf{a} \cdot \mathbf{b} = \beta_1 \cdot \beta_2 + \beta_1\beta_2 \cdot \beta_1 = \beta_1\beta_2 + \beta_1^2\beta_2 \]

By direct calculation:

\[ \underbrace{\frac{\partial}{\partial \tilde{x}}(\mathbf{a} \cdot \mathbf{b})}_{2 \times 1} = \frac{\partial}{\partial \tilde{x}}(\beta_1\beta_2 + \beta_1^2\beta_2) = \begin{pmatrix} \beta_2 + 2\beta_1\beta_2 \\ \beta_1 + \beta_1^2 \end{pmatrix} \]

By the product rule (Theorem 53), using \(\underbrace{\frac{\partial}{\partial \tilde{x}}{\mathbf{a}}^{\top}}_{2 \times 2} = \begin{pmatrix}1 & \beta_2 \\ 0 & \beta_1\end{pmatrix}\) and \(\underbrace{\frac{\partial}{\partial \tilde{x}}{\mathbf{b}}^{\top}}_{2 \times 2} = \begin{pmatrix}0 & 1 \\ 1 & 0\end{pmatrix}\):

\[ \begin{aligned} \underbrace{\frac{\partial}{\partial \tilde{x}}(\mathbf{a} \cdot \mathbf{b})}_{2 \times 1} &= \underbrace{\begin{pmatrix}1 & \beta_2 \\ 0 & \beta_1\end{pmatrix}}_{2 \times 2} \underbrace{\begin{pmatrix}\beta_2 \\ \beta_1\end{pmatrix}}_{2 \times 1} + \underbrace{\begin{pmatrix}0 & 1 \\ 1 & 0\end{pmatrix}}_{2 \times 2} \underbrace{\begin{pmatrix}\beta_1 \\ \beta_1\beta_2\end{pmatrix}}_{2 \times 1} \\ &= \begin{pmatrix}\beta_2 + \beta_1\beta_2 \\ \beta_1^2\end{pmatrix} + \begin{pmatrix}\beta_1\beta_2 \\ \beta_1\end{pmatrix} \\ &= \begin{pmatrix}\beta_2 + 2\beta_1\beta_2 \\ \beta_1^2 + \beta_1\end{pmatrix} \end{aligned} \]

Both methods agree. ✓

Theorem 54 (Derivative of a linear map) If \(A\) is an \(m \times p\) matrix that is constant with respect to \(\tilde{\beta}\), then:

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}} (A\tilde{\beta})}_{p \times m} = \underbrace{{A}^{\top}}_{p \times m} \]

Proof. For entry \((i,j)\), where row \(i\) indexes the denominator \(\tilde{\beta}\) (see Definition 36) and column \(j\) indexes the numerator \(A\tilde{\beta}\):

\[ \begin{aligned} \left[\frac{\partial}{\partial \tilde{\beta}} (A\tilde{\beta})\right]_{ij} &= \frac{\partial}{\partial \beta_i} (A\tilde{\beta})_j \\ &= \frac{\partial}{\partial \beta_i} \sum_{k=1}^{p} a_{jk} \beta_k \\ &= a_{ji} \\ &= \left[{A}^{\top}\right]_{ij} \end{aligned} \]

Example 20 (Derivative of a linear map) Let \(A = \begin{pmatrix} 2 & 3 \end{pmatrix}\) (\(1 \times 2\)) and \(\tilde{\beta}= {(\beta_1, \beta_2)}^{\top}\). Then \(A\tilde{\beta}= 2\beta_1 + 3\beta_2\), and by Theorem 54:

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}}(A\tilde{\beta})}_{2 \times 1} = \underbrace{{A}^{\top}}_{2 \times 1} = \begin{pmatrix} 2 \\ 3 \end{pmatrix} \]

Theorem 55 (Vector-derivative of a matrix-vector product) If \(A\) is an \(m \times q\) matrix that is constant with respect to \(\tilde{\beta}\), and \(\tilde{v} = \tilde{v}(\tilde{\beta})\) is a \(q \times 1\) vector that depends on the \(p \times 1\) vector \(\tilde{\beta}\), then:

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}} (A\tilde{v})}_{p \times m} = \underbrace{\mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}} \tilde{v}\right)\mathclose{}}_{p \times q} \underbrace{{A}^{\top}}_{q \times m} \]

Proof. For entry \((i,j)\), where row \(i\) indexes the denominator \(\tilde{\beta}\) and column \(j\) indexes the numerator \(A\tilde{v}\):

\[ \begin{aligned} \left[\frac{\partial}{\partial \tilde{\beta}} (A\tilde{v})\right]_{ij} &= \frac{\partial}{\partial \beta_i} (A\tilde{v})_j \\ &= \frac{\partial}{\partial \beta_i} \sum_{k=1}^{q} a_{jk} v_k \\ &= \sum_{k=1}^{q} a_{jk} \frac{\partial}{\partial \beta_i} v_k \\ &= \sum_{k=1}^{q} \left[\frac{\partial}{\partial \tilde{\beta}} \tilde{v}\right]_{ik} \left[{A}^{\top}\right]_{kj} \\ &= \left[\mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}} \tilde{v}\right)\mathclose{} {A}^{\top}\right]_{ij} \end{aligned} \]

Example 21 (Vector-derivative of a matrix-vector product) Let \(A = \begin{pmatrix} 2 & 3 \end{pmatrix}\) (\(1 \times 2\), constant) and \(\tilde{v}(\tilde{\beta}) = {(\beta_1^2, \beta_2^2)}^{\top}\). Then \(A\tilde{v} = 2\beta_1^2 + 3\beta_2^2\). By Theorem 55:

\[ \begin{aligned} \underbrace{\frac{\partial}{\partial \tilde{\beta}}(A\tilde{v})}_{2 \times 1} &= \begin{pmatrix} 2\beta_1 & 0 \\ 0 & 2\beta_2 \end{pmatrix} \begin{pmatrix} 2 \\ 3 \end{pmatrix} \\ &= \begin{pmatrix} 4\beta_1 \\ 6\beta_2 \end{pmatrix} \end{aligned} \]

Corollary 6 (Derivative of a dot product, transpose-product form) If \(\tilde{x}\) is constant with respect to \(\tilde{\beta}\), then:

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}} (\underbrace{{\tilde{x}}^{\top}}_{1 \times p} \underbrace{\tilde{\beta}}_{p \times 1})}_{p \times 1} = \underbrace{\tilde{x}}_{p \times 1} \]

Proof. Using Theorem 52:

Since \({\tilde{x}}^{\top}\tilde{\beta}= \tilde{x}\cdot \tilde{\beta}\) (Definition 16), and \(\tilde{x}\) is constant with respect to \(\tilde{\beta}\):

\[ \frac{\partial}{\partial \tilde{\beta}}({\tilde{x}}^{\top}\tilde{\beta}) = \frac{\partial}{\partial \tilde{\beta}}(\tilde{x}\cdot \tilde{\beta}) = \tilde{x} \]

by Theorem 52.

Proof. Using Theorem 55:

Since \(\tilde{x}\) is constant with respect to \(\tilde{\beta}\), \(A = {\tilde{x}}^{\top}\) is a constant \(1 \times p\) matrix. Applying Theorem 55 with \(\tilde{v} = \tilde{\beta}\) (so \(\frac{\partial}{\partial \tilde{\beta}}\tilde{\beta}= \mathbf{I}\)):

\[ \begin{aligned} \frac{\partial}{\partial \tilde{\beta}}({\tilde{x}}^{\top}\tilde{\beta}) &= \mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}}\tilde{\beta}\right)\mathclose{} {({\tilde{x}}^{\top})}^{\top} \\ &= \mathbf{I} \cdot \tilde{x}\\ &= \tilde{x} \end{aligned} \]

Example 22 (Derivative of a transpose product) Let \(\tilde{x}= {(3, 5)}^{\top}\) and \(\tilde{\beta}= {(\beta_1, \beta_2)}^{\top}\). Then \({\tilde{x}}^{\top}\tilde{\beta}= 3\beta_1 + 5\beta_2\), and by Corollary 6:

\[ \underbrace{\frac{\partial}{\partial \tilde{\beta}}\left(\underbrace{{\tilde{x}}^{\top}}_{1 \times 2}\underbrace{\tilde{\beta}}_{2 \times 1}\right)}_{2 \times 1} = \underbrace{\tilde{x}}_{2 \times 1} = \begin{pmatrix} 3 \\ 5 \end{pmatrix} \]

Theorem 56 (Derivative of a quadratic form) For a quadratic form (Definition 34), if \(S\) is a \(p\times p\) matrix that is constant with respect to \(\beta\), then:

\[ \frac{\partial}{\partial \beta} \beta'S\beta = 2S\beta \]

Corollary 7 (Derivative of a simple quadratic form) \[ \frac{\partial}{\partial \tilde{\beta}} \tilde{\beta}'\tilde{\beta}= 2\tilde{\beta} \]

Theorem 57 (Vector chain rule) \[\frac{\partial z}{\partial \tilde{x}} = \frac{\partial y}{\partial \tilde{x}} \frac{\partial z}{\partial y}\]

or in Euler/Lagrange notation:

\[(f(g(\tilde{x})))' = \tilde{g}'(\tilde{x}) f'(g(\tilde{x}))\]

Corollary 8 (Vector chain rule for quadratic forms) \[\frac{\partial}{\partial \tilde{\beta}}{\mathopen{}\left(\tilde{\varepsilon}(\tilde{\beta})\cdot \tilde{\varepsilon}(\tilde{\beta})\right)\mathclose{}} = \mathopen{}\left(\frac{\partial}{\partial \tilde{\beta}}\tilde{\varepsilon}(\tilde{\beta})\right)\mathclose{} \mathopen{}\left(2 \tilde{\varepsilon}(\tilde{\beta})\right)\mathclose{}\]

7 Additional resources

7.1 Calculus

Kaplan (2022)
Khuri (2003)
Banner (2007)
Larson and Edwards (2018)
Miller (2016)
- http://www.youtube.com/watch?v=xYzQL0TUtBA
- http://www.youtube.com/watch?v=Ps2SBo_WjoE

7.2 Linear Algebra and Vector Calculus

Fieller (2016)
Banerjee and Roy (2014)
Searle and Khuri (2017)

7.3 Numerical Analysis

Hua Zhou’s lecture notes for “UCLA Biostat 216 - Mathematical Methods for Biostatistics” (2023 Fall)

7.4 Real Analysis

Grinberg (2017)

References

Banerjee, Sudipto, and Anindya Roy. 2014. Linear Algebra and Matrix Analysis for Statistics. Vol. 181. Crc Press Boca Raton. https://www.routledge.com/Linear-Algebra-and-Matrix-Analysis-for-Statistics/Banerjee-Roy/p/book/9781420095388.

Banner, Adrian D. 2007. The Calculus Lifesaver : All the Tools You Need to Excel at Calculus. A Princeton Lifesaver Study Guide. Princeton University Press. https://press.princeton.edu/books/paperback/9780691130880/the-calculus-lifesaver.

Billingsley, Patrick. 1995. Probability and Measure. 3rd ed. Wiley Series in Probability and Mathematical Statistics. Wiley.

Cheng, Eugenia. 2025. “Opinion | How Math Turned Me from a D.E.I. Skeptic to a Supporter.” The New York Times. https://www.nytimes.com/2025/09/05/opinion/math-dei.html.

Dobson, Annette J, and Adrian G Barnett. 2018. An Introduction to Generalized Linear Models. 4th ed. CRC press. https://doi.org/10.1201/9781315182780.

Fieller, Nick. 2016. Basics of Matrix Algebra for Statistics with R. Chapman; Hall/CRC. https://doi.org/10.1201/9781315370200.

Fubini, Guido. 1907. “Sugli Integrali Multipli.” Rendiconti Della Reale Accademia Dei Lincei. Classe Di Scienze Fisiche, Matematiche e Naturali 16: 608–14.

Grinberg, Raffi. 2017. The Real Analysis Lifesaver: All the Tools You Need to Understand Proofs. 1st ed. Princeton Lifesaver Study Guides. Princeton University Press. https://press.princeton.edu/books/paperback/9780691172934/the-real-analysis-lifesaver.

Gut, Allan. 2013. Probability: A Graduate Course. 2nd ed. Springer Texts in Statistics. Springer. https://doi.org/10.1007/978-1-4614-4708-5.

Kaplan, Daniel. 2022. MOSAIC Calculus. Www.mosaic-web.org. www.mosaic-web.org.

Khuri, André I. 2003. Advanced Calculus with Applications in Statistics. John Wiley & Sons. https://www.wiley.com/en-us/Advanced+Calculus+with+Applications+in+Statistics%2C+2nd+Edition-p-9780471391043.

Kleinbaum, David G, and Mitchel Klein. 2012. Survival Analysis: A Self-Learning Text. 3rd ed. Springer. https://link.springer.com/book/10.1007/978-1-4419-6646-9.

Larson, Ron, and Bruce H. Edwards. 2018. Calculus. 11th ed. Cengage Learning. https://www.cengage.com/c/calculus-11e-larson/.

Miller, Steven J. 2016. The Probability Lifesaver: Calculus Review Problems. https://web.williams.edu/Mathematics/sjmiller/public_html/probabilitylifesaver/index.htm#:~:text=http%3A//web.williams.edu/Mathematics/sjmiller/public_html/probabilitylifesaver/supplementalchap_calcreview.pdf.

Rudin, Walter. 1976. Principles of Mathematical Analysis. 3rd ed. International Series in Pure and Applied Mathematics. McGraw-Hill.

Searle, Shayle R, and Andre I Khuri. 2017. Matrix Algebra Useful for Statistics. John Wiley & Sons.

Wikipedia contributors. 2024. Fubini’s Theorem — Wikipedia, the Free Encyclopedia. https://en.wikipedia.org/wiki/Fubini%27s_theorem.