Mathematics Prerequisites
1 Mathematics
Math is not just a way of calculating numerical answers; it is a way of thinking, using clear definitions for concepts and rigorous logic to organize our thoughts and back up our assertions.
Cheng (2025)
These lecture notes use:
- algebra
- precalculus
- univariate calculus
- linear algebra
- vector calculus
Some key results are listed here.
1.1 Elementary Algebra
Mastery of Elementary Algebra (a.k.a. “College Algebra”) is a prerequisite for calculus, which is a prerequisite for Epi 202 and Epi 203, which are prerequisites for this course (Epi 204). Nevertheless, each year, some Epi 204 students are still uncomfortable with algebraic manipulations of mathematical formulas. Therefore, I include this section as a quick reference.
1.1.1 Equalities
1.1.2 Inequalities
1.1.3 Sums
1.1.4 Products
1.1.5 Division
1.1.6 Sums and products together
1.1.7 Quotients
Additional reference for elementary algebra: https://en.wikipedia.org/wiki/Population_proportion#Mathematical_definition
1.2 Exponentials and Logarithms
2 Derivatives
Proof. Apply Theorem 27 and Theorem 23.
3 Linear Algebra
3.1 Vectors
Column vectors are the default convention in these notes and in most statistics textbooks. They are also called \(p \times 1\) matrices.
The transpose operation converts a column vector to a row vector, or more generally, swaps the rows and columns of a matrix (Definition 14).
3.1.1 Special vectors
The zero vector is the additive identity for vector addition: \(\tilde{x}+ \tilde{0}= \tilde{x}\) for any vector \(\tilde{x}\) of the same length.
The dot product \({\tilde{1}}^{\top}\tilde{x}= \tilde{1} \cdot \tilde{x}= \sum_{i=1}^p x_i\) is the sum of all entries of \(\tilde{x}\).
They are also called unit vectors or standard basis vectors.
Proof. Writing the product componentwise:
\[ \begin{aligned} {\tilde{e}_j}^{\top}\tilde{x} &= \sum_{i=1}^{p} (\tilde{e}_j)_i\, x_i \\&= \sum_{i=1}^{p} \begin{cases} 1 \cdot x_i & \text{if } i = j \\ 0 \cdot x_i & \text{if } i \neq j \end{cases} \\&= x_j \end{aligned} \]
See also the definitions in
“Linear combination” can also refer to weighted sums of vectors, or in other words matrix-vector multiplication.
The dot-product has a different generalization for two matrices; see wikipedia for more.
Proof. Apply:
- Definition 10
- symmetry of scalar multiplication
- Definition 10 again
3.1.2 Orthogonality
Orthogonality generalizes the geometric notion of perpendicularity to arbitrary dimensions.
The indicator vectors \(\tilde{e}_1, \tilde{e}_2, \ldots, \tilde{e}_p\) (Definition 9) form an orthonormal set.
3.2 Matrices
The entry in row \(i\) and column \(j\) is denoted \(a_{ij}\) or \((\mathbf{A})_{ij}\). A column vector of length \(p\) is a special case: a \(p \times 1\) matrix. A row vector of length \(p\) is a \(1 \times p\) matrix.
3.2.1 Matrix transpose
The order of the factors reverses when transposing a product.
3.2.2 Matrix addition
3.2.3 Scalar multiplication
3.2.4 Matrix multiplication
Matrix multiplication is only defined when the number of columns in \(\mathbf{A}\) equals the number of rows in \(\mathbf{B}\).
Matrix multiplication is not commutative in general: \(\mathbf{A}\mathbf{B} \neq \mathbf{B}\mathbf{A}\).
3.2.5 Matrix-vector multiplication
Matrix-vector multiplication is a generalization of the dot product. Each entry of the result is a dot product of a row of \(\mathbf{A}\) with the vector \(\tilde{x}\).
3.3 Special Matrices
See also Definition 15 for the zero matrix.
Covariance matrices and information matrices are symmetric.
Diagonal matrices are denoted \(\mathbf{D} = \text{diag}(d_1, d_2, \ldots, d_p)\), where \(d_1, \ldots, d_p\) are the diagonal entries.
A matrix that has an inverse is called invertible or non-singular.
Proof. We verify symmetry and idempotency.
Symmetry: \[{(\mathbf{I} - \mathbf{P})}^{\top} = {\mathbf{I}}^{\top} - {\mathbf{P}}^{\top} = \mathbf{I} - \mathbf{P}\]
Idempotency: \[\begin{aligned} (\mathbf{I} - \mathbf{P})^2 &= (\mathbf{I} - \mathbf{P})(\mathbf{I} - \mathbf{P}) \\ &= \mathbf{I} - \mathbf{P} - \mathbf{P} + \mathbf{P}^2 \\ &= \mathbf{I} - \mathbf{P} - \mathbf{P} + \mathbf{P} \\ &= \mathbf{I} - \mathbf{P} \end{aligned}\]
Proof. We verify symmetry and idempotency.
Symmetry: \[\begin{aligned} {\mathbf{H}}^{\top} &= {\left(\mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top}\right)}^{\top} \\ &= {({\mathbf{X}}^{\top})}^{\top} \cdot {\left(({\mathbf{X}}^{\top}\mathbf{X})^{-1}\right)}^{\top} \cdot {\mathbf{X}}^{\top} \\ &= \mathbf{X}\cdot ({\mathbf{X}}^{\top}\mathbf{X})^{-1} \cdot {\mathbf{X}}^{\top} \\ &= \mathbf{H} \end{aligned}\]
where the third line uses \({({\mathbf{X}}^{\top})}^{\top} = \mathbf{X}\) and the fact that \({\mathbf{X}}^{\top}\mathbf{X}\) is symmetric, so its inverse is also symmetric (\({\left(({\mathbf{X}}^{\top}\mathbf{X})^{-1}\right)}^{\top} = ({\mathbf{X}}^{\top}\mathbf{X})^{-1}\)).
Idempotency: \[\begin{aligned} \mathbf{H}^2 &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \cdot \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}({\mathbf{X}}^{\top}\mathbf{X})({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top} \\ &= \mathbf{H} \end{aligned}\]
The hat matrix appears in the formula for fitted values in linear regression: \(\hat{\tilde{y}} = \mathbf{X}\hat{\tilde{\beta}} = \mathbf{X}({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top}\tilde{y}= \mathbf{H}\tilde{y}\). It “puts a hat” on \(\tilde{y}\) — hence the name.
Proof. \[\begin{aligned} {(\mathbf{P}\tilde{v})}^{\top}\,(\mathbf{I} - \mathbf{P})\tilde{v} &= {\tilde{v}}^{\top}\,{\mathbf{P}}^{\top}\,(\mathbf{I} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,\mathbf{P}\,(\mathbf{I} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,(\mathbf{P} - \mathbf{P}^2)\tilde{v} \\ &= {\tilde{v}}^{\top}\,(\mathbf{P} - \mathbf{P})\tilde{v} \\ &= {\tilde{v}}^{\top}\,\mathbf{0}\,\tilde{v} \\ &= 0 \end{aligned}\]
where the second line uses symmetry (\({\mathbf{P}}^{\top} = \mathbf{P}\)) and the fourth line uses idempotency (\(\mathbf{P}^2 = \mathbf{P}\)).
3.4 Quadratic Forms
Quadratic forms are the matrix generalizations of the scalar expression \(c x^2\). They occur frequently in statistics:
- The residual sum of squares in linear regression
- is a quadratic form.
- The variance of a linear combination of estimates
- is a quadratic form: \(\text{Var}{\left({\tilde{x}}^{\top}\hat{\tilde{\beta}}\right)} = {\tilde{x}}^{\top}\,\text{Var}{\left(\hat{\tilde{\beta}}\right)}\,\tilde{x}\).
3.5 Design Matrix
The product \(\mathbf{X}\tilde{\beta}\) collects all the linear predictors \({\tilde{x}_i}^{\top}\tilde{\beta}\) into a single \(n \times 1\) vector:
\[ \mathbf{X}\tilde{\beta}= \begin{bmatrix} {\tilde{x}_1}^{\top}\tilde{\beta}\\ \vdots \\ {\tilde{x}_n}^{\top}\tilde{\beta} \end{bmatrix} \]
The matrix \({\mathbf{X}}^{\top}\mathbf{X}\) is a \(p \times p\) symmetric matrix that appears in the OLS estimator \(\hat{\tilde{\beta}} = ({\mathbf{X}}^{\top}\mathbf{X})^{-1}{\mathbf{X}}^{\top}\tilde{y}\).
4 Vector Calculus
(adapted from Fieller (2016), §7.2)
This section covers derivatives of functions of vectors and matrices. Linear algebra prerequisites — including vectors, matrices, transpose, dot product, and quadratic forms — are covered in Section 3.
Let \(\tilde{x}\) and \(\tilde{\beta}\) be column vectors of length \(p\) (see Definition 5 and Definition 10).
Proof. \[ \begin{aligned} \frac{\partial}{\partial \beta} (x^{\top}\beta) &= \begin{bmatrix} \frac{\partial}{\partial \beta_1}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \\ \frac{\partial}{\partial \beta_2}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \\ \vdots \\ \frac{\partial}{\partial \beta_p}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \end{bmatrix} \\ &= \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{p} \end{bmatrix} \\ &= \tilde{x} \end{aligned} \]
This is like taking the derivative of \(cx^2\) with respect to \(x\) in non-vector calculus.
This is like taking the derivative of \(x^2\).
See https://quickfem.com/finite-element-analysis/, specifically https://quickfem.com/wp-content/uploads/IFEM.AppF_.pdf
See also https://en.wikipedia.org/wiki/Gradient#Relationship_with_Fr%C3%A9chet_derivative
This chain rule is like the univariate chain rule (Theorem 27), but the order matters now. The version presented here is for the gradient (column vector); the total derivative (row vector) would be the transpose of the gradient.
5 Additional resources
5.1 Calculus
5.2 Linear Algebra and Vector Calculus
5.3 Numerical Analysis
5.4 Real Analysis
- Grinberg (2017)