2  Linear (Gaussian) Models

Published

Last modified: 2024-05-07: 18:36:23 (PM)



Configuring R

Functions from these packages will be used throughout this document:

Show R code
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggeasy) # help with graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

Show R code
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 4)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9

Note

This content is adapted from:

  • Dobson and Barnett (2018), Chapters 2-6
  • Dunn, Smyth, et al. (2018), Chapters 2-3
  • Vittinghoff et al. (2012), Chapter 4

There are numerous textbooks specifically for linear regression, including:

  • Kutner et al. (2005): used for UCLA Biostatistics MS level linear models class
  • Chatterjee and Hadi (2015): used for Stanford MS-level linear models class
  • Seber and Lee (2012): used for UCLA Biostatistics PhD level linear models class
  • Kleinbaum et al. (2014): same first author as Kleinbaum and Klein (2010) and Kleinbaum and Klein (2012)

2.1 Overview

2.1.1 Why this course includes linear regression

  • This course is about generalized linear models (for non-Gaussian outcomes)
  • UC Davis STA 108 (“Applied Statistical Methods: Regression Analysis”) is a prerequisite for this course, so everyone here should have some understanding of linear regression already.
  • We will review linear regression to:
  • make sure everyone is caught up
  • to provide an epidemiological perspective on model intepretation.

2.1.2 Chapter overview

  • Section 2.2: how to interpret linear regression models

  • Section 2.3: how to estimate linear regression models

  • Section 2.4: how to quantify uncertainty about our estimates

  • Section 2.8: how to tell if your model is insufficiently complex

2.2 Understanding Gaussian Linear Regression Models

2.2.1 Motivating example: birthweights and gestational age

Suppose we want to learn about the distributions of birthweights (outcome \(Y\)) for (human) babies born at different gestational ages (covariate \(A\)) and with different chromosomal sexes (covariate \(S\)) (Dobson and Barnett (2018) Example 2.2.2).

Show R code
library(dobson)
data("birthweight", package = "dobson")
birthweight |> knitr::kable()
Table 2.1: birthweight data (Dobson and Barnett (2018) Example 2.2.2)
boys gestational age boys weight girls gestational age girls weight
40 2968 40 3317
38 2795 36 2729
40 3163 40 2935
35 2925 38 2754
36 2625 42 3210
37 2847 39 2817
41 3292 40 3126
40 3473 37 2539
37 2628 36 2412
38 3176 38 2991
40 3421 39 2875
38 2975 40 3231
Show R code
bw = 
  birthweight |> 
  pivot_longer(
    cols = everything(),
    names_to = c("sex", ".value"),
    names_sep = "s "
  ) |> 
  rename(age = `gestational age`) |> 
  mutate(
    sex = sex |> 
      case_match(
        "boy" ~ "male",
        "girl" ~ "female") |> 
      factor(levels = c("female", "male")))

bw
Table 2.2: birthweight data reshaped
Show R code
plot1 = bw |> 
  ggplot(aes(
    x = age, 
    y = weight,
    linetype = sex,
    shape = sex,
    col = sex))  +
  theme_bw() +
  xlab("Gestational age (weeks)") +
  ylab("Birthweight (grams)") +
  theme(legend.position = "bottom") +
  # expand_limits(y = 0, x = 0) +
  geom_point(alpha = .7)
print(plot1 + facet_wrap(~ sex))
Figure 2.1: birthweight data (Dobson and Barnett (2018) Example 2.2.2)

Data notation

Let’s define some notation to represent this data.

  • \(Y\): birthweight (measured in grams)

  • \(S\): chromosomal sex: “male” (XY) or “female” (XX)

  • \(M\): indicator variable for \(S\) = “male”1

  • \(M = 0\) if female (XX)

  • \(M = 1\) if male (XY)

  • \(F\): indicator variable for \(S\) = “female”2

  • \(F = 1\) if female (XX)

  • \(F = 0\) if male (XY)

  • \(A\): estimated gestational age at birth (measured in weeks).

Note

Female is the reference level for the categorical variable \(S\) (chromosomal sex) and corresponding indicator variable \(M\) . The choice of a reference level is arbitrary and does not limit what we can do with the resulting model; it only makes it more computationally convenient to make inferences about comparisons involving that reference group.

2.2.2 Parallel lines regression

We don’t have enough data to model the distribution of birth weight separately for each combination of gestational age and sex, so let’s instead consider a (relatively) simple model for how that distribution varies with gestational age and sex:

\[p(Y=y|A=a,S=s) \ \sim_{\text{iid}}\ N(\mu(a,s), \sigma^2)\]

\[ \begin{aligned} \mu(a,s) &\stackrel{\text{def}}{=}\mathbb{E}\left[Y|A=a, S=s\right] \\ &= \beta_0 + \beta_A a+ \beta_M m \end{aligned} \tag{2.1}\]

Table 2.3 shows the parameter estimates from R. Figure 2.2 shows the estimated model, superimposed on the data.

Show R code
bw_lm1 = lm(
  formula = weight ~ sex + age, 
  data = bw)

bw_lm1 |> 
  parameters() |>
  print_md(
    include_reference = TRUE,
    # show_sigma = TRUE,
    select = "{estimate}")
Table 2.3: Estimate of Model 2.1 for birthweight data
Parameter Estimate
(Intercept) -1773.32
sex (female) 0.00
sex (male) 163.04
age 120.89
Show R code
bw = 
  bw |> 
  mutate(`E[Y|X=x]` = fitted(bw_lm1)) |> 
  arrange(sex, age)

plot2 = 
  plot1 %+% bw +
  geom_line(aes(y = `E[Y|X=x]`))

print(plot2)
Figure 2.2: Parallel-slopes model of birthweight

Model assumptions and predictions

To learn what this model is assuming, let’s plug in a few values.

Exercise 2.1 According to this model, what’s the mean birthweight for a female born at 36 weeks?

Show R code
coef(bw_lm1)
#> (Intercept)     sexmale         age 
#>     -1773.3       163.0       120.9

Solution.  

Show R code
pred_female = coef(bw_lm1)["(Intercept)"] + coef(bw_lm1)["age"]*36
coef(bw_lm1)
#> (Intercept)     sexmale         age 
#>     -1773.3       163.0       120.9
# print(pred_female)
### built-in prediction: 
# predict(bw_lm1, newdata = tibble(sex = "female", age = 36))

\[ \begin{aligned} E[Y|A = 0, A = 36] &= \beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36 \\ &= 2578.8739 \end{aligned} \]


Exercise 2.2 What’s the mean birthweight for a male born at 36 weeks?

Show R code
coef(bw_lm1)
#> (Intercept)     sexmale         age 
#>     -1773.3       163.0       120.9

Solution.  

Show R code
pred_male = 
  coef(bw_lm1)["(Intercept)"] + 
  coef(bw_lm1)["sexmale"] + 
  coef(bw_lm1)["age"]*36
coef(bw_lm1)
#> (Intercept)     sexmale         age 
#>     -1773.3       163.0       120.9

\[ \begin{aligned} E[Y|A = 1, A = 36] &= \beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 \\ &= 2741.9132 \end{aligned} \]


Exercise 2.3 What’s the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?

Show R code
coef(bw_lm1)
#> (Intercept)     sexmale         age 
#>     -1773.3       163.0       120.9

Solution. \[ \begin{aligned} & E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\ &= 2741.9132 - 2578.8739\\ &= 163.0393 \end{aligned} \]

Shortcut:

\[ \begin{aligned} & E[Y|A = 1, A = 36] - E[Y|A = 0, A = 36]\\ &= (\beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36) - (\beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36) \\ &= \beta_M \\ &= 163.0393 \end{aligned} \]

Note that age doesn’t show up in this difference: in other words, according to this model, the difference between females and males with the same gestational age is the same for every age.

That’s an assumption of the model; it’s built-in to the parametric structure, even before we plug in the estimated values of those parameters.

That’s why the lines are parallel.

2.2.3 Interactions

What if we don’t like that parallel lines assumption?

Then we need to allow an “interaction” between age \(A\) and sex \(S\):

\[ E[Y|A=a, S=s] = \beta_0 + \beta_A a+ \beta_M m + \beta_{AM} (a \cdot m) \tag{2.2}\]

Now, the slope of mean birthweight \(E[Y|A,S]\) with respect to gestational age \(A\) depends on the value of sex \(S\).

Show R code
bw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)
bw_lm2 |> 
  parameters() |>
  print_md(
    include_reference = TRUE,
    # show_sigma = TRUE,
    select = "{estimate}")
Table 2.4: Birthweight model with interaction term
Parameter Estimate
(Intercept) -2141.67
sex (female) 0.00
sex (male) 872.99
age 130.40
sex (male) × age -18.42
Show R code
bw = 
  bw |> 
  mutate(
    predlm2 = predict(bw_lm2)
  ) |> 
  arrange(sex, age)

plot1_interact = 
  plot1 %+% bw +
  geom_line(aes(y = predlm2))

print(plot1_interact)
Figure 2.3: Birthweight model with interaction term

Now we can see that the lines aren’t parallel.


Here’s another way we could rewrite this model (by collecting terms involving \(S\)):

\[ E[Y|A, M] = \beta_0 + \beta_M M+ (\beta_A + \beta_{AM} M) A \]

Note

If you want to understand a coefficient in a model with interactions, collect terms for the corresponding variable, and you will see what other variables are interacting with the variable you are interested in.

In this case, the coefficient \(S\) is interacting with \(A\). So the slope of \(Y\) with respect to \(A\) depends on the value of \(M\).

According to this model, there is no such thing as “the slope of birthweight with respect to age”. There are two slopes, one for each sex.3 We can only talk about “the slope of birthweight with respect to age among males” and “the slope of birthweight with respect to age among females”.

Then: that coefficient is the difference in means per unit change in its corresponding coefficient, when the other collected variables are set to 0.


To learn what this model is assuming, let’s plug in a few values.

Exercise 2.4 According to this model, what’s the mean birthweight for a female born at 36 weeks?


Solution.  

Show R code
pred_female = coef(bw_lm2)["(Intercept)"] + coef(bw_lm2)["age"]*36

\[ E[Y|A = 0, X_2 = 36] = \beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36 + \beta_{AM} \cdot (0 * 36) = 2552.7333 \]


Exercise 2.5 What’s the mean birthweight for a male born at 36 weeks?


Solution.  

Show R code
pred_male = 
  coef(bw_lm2)["(Intercept)"] + 
  coef(bw_lm2)["sexmale"] + 
  coef(bw_lm2)["age"]*36 + 
  coef(bw_lm2)["sexmale:age"] * 36

\[ \begin{aligned} E[Y|A = 0, X_2 = 36] &= \beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 + \beta_{AM} \cdot 1 \cdot 36\\ &= 2762.7069 \end{aligned} \]


Exercise 2.6 What’s the difference in mean birthweights between males born at 36 weeks and females born at 36 weeks?


Solution. \[ \begin{aligned} & E[Y|M = 1, A = 36] - E[Y|M = 0, A = 36]\\ &= (\beta_0 + \beta_M \cdot 1+ \beta_A \cdot 36 + \beta_{AM} \cdot 1 \cdot 36)\\ &\ \ \ \ \ -(\beta_0 + \beta_M \cdot 0+ \beta_A \cdot 36 + \beta_{AM} \cdot 0 \cdot 36) \\ &= \beta_{S} + \beta_{AM}\cdot 36\\ &= 209.9736 \end{aligned} \]

Note that age now does show up in the difference: in other words, according to this model, the difference in mean birthweights between females and males with the same gestational age can vary by gestational age.

That’s how the lines in the graph ended up non-parallel.

2.2.4 Stratified regression

We could re-write the interaction model as a stratified model, with a slope and intercept for each sex:

\[ \mathbb{E}\left[Y|A=a, S=s\right] = \beta_M m + \beta_{AM} (a \cdot m) + \beta_F f + \beta_{AF} (a \cdot f) \tag{2.3}\]

Compare this stratified model with our interaction model, Equation 2.2:

\[ \mathbb{E}\left[Y|A=a, S=s\right] = \beta_0 + \beta_A a + \beta_M m + \beta_{AM} (a \cdot m) \]

In the stratified model, the intercept term \(\beta_0\) has been relabeled as \(\beta_F\).

Show R code
bw_lm2 = lm(weight ~ sex + age + sex:age, data = bw)
bw_lm2 |> 
  parameters() |>
  print_md(
    include_reference = TRUE,
    # show_sigma = TRUE,
    select = "{estimate}")
Table 2.5: Birthweight model with interaction term
Parameter Estimate
(Intercept) -2141.67
sex (female) 0.00
sex (male) 872.99
age 130.40
sex (male) × age -18.42
Show R code
bw_lm_strat = 
  bw |> 
  lm(
    formula = weight ~ sex + sex:age - 1, 
    data = _)

bw_lm_strat |> 
  parameters() |>
  print_md(
    # show_sigma = TRUE,
    select = "{estimate}")
Table 2.6: Birthweight model - stratified betas
Parameter Estimate
sex (female) -2141.67
sex (male) -1268.67
sex (female) × age 130.40
sex (male) × age 111.98

2.2.5 Curved-line regression

If we transform some of our covariates (\(X\)s) and plot the resulting model on the original covariate scale, we end up with curved regression lines:

Show R code
bw_lm3 = lm(weight ~ sex:log(age) - 1, data = bw)
library(palmerpenguins)

ggpenguins <- 
  palmerpenguins::penguins |> 
  dplyr::filter(species == "Adelie") |> 
  ggplot(
    aes(x = bill_length_mm , y = body_mass_g)) +
  geom_point() + 
  xlab("Bill length (mm)") + 
  ylab("Body mass (g)")

ggpenguins2 = ggpenguins +
  stat_smooth(
    method = "lm",
    formula = y ~ log(x),
    geom = "smooth") +
  xlab("Bill length (mm)") + 
  ylab("Body mass (g)")


ggpenguins2 |> print()
Figure 2.4: palmerpenguins model with bill_length entering on log scale

2.3 Estimating Linear Models via Maximum Likelihood

2.3.1 Likelihood, log-likelihood, and score functions for linear regression

In EPI 203 and [intro-MLEs.qmd#sec-intro-MLEs], we learned how to fit outcome-only models of the form \(p(X=x|\theta)\) to iid data \(\mathbf x = (x_1,…,x_n)\) using maximum likelihood estimation.

Now, we apply the same procedure to linear regression models:

\[ \mathcal L(\tilde{y}|\mathbf{x},\beta, \sigma^2) = \prod_{i=1}^n (2\pi\sigma^2)^{-1/2} \text{exp}\left\{-\frac{1}{2\sigma^2}(y_i - \tilde{x_i}'\beta)^2\right\} \tag{2.4}\]

\[ \ell(\tilde{y}|\mathbf{x},\beta, \sigma^2) = -\frac{n}{2}\text{log}\left\{\sigma^2\right\} - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \tilde{x_i}' \beta)^2 \tag{2.5}\]

\[ \ell'_{\beta}(\tilde{y}|\mathbf{x},\beta, \sigma^2) = - \frac{1}{2\sigma^2}\frac{\partial}{\partial \beta} \left(\sum_{i=1}^n (y_i - \tilde{x_i}' \beta)^2\right) \tag{2.6}\]

2.3.2 Some tools from vector calculus

A few tools from linear algebra will make this analysis go easier (see Fieller (2016), Section 7.2 for details).

\[ f_{\beta}(\mathbf x) = (f_{\beta}(x_1), f_{\beta}(x_2), ..., f_{\beta}(x_n))^\top \]

Let \(\mathbf x\) and \(\beta\) be vectors of length \(p\), or in other words, matrices of length \(p\times 1\):

\[ x = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{p} \end{bmatrix} \\ \beta = \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p} \end{bmatrix} \]

Then

\[ x' \equiv x^\top \equiv [x_1, x_2, ..., x_p] \]

and

\[ x'\beta = [x_1, x_2, ..., x_p] \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p} \end{bmatrix} = x_1\beta_1+x_2\beta_2 +...+x_p \beta_p \]


If \(f(\beta)\) is a function that takes \(\beta\) as input and outputs a scalar, such as \(f(\beta) = x'\beta\), then:

\[ \frac{\partial}{\partial \beta} f(\beta)= \begin{bmatrix} \frac{\partial}{\partial \beta_1}f(\beta) \\ \frac{\partial}{\partial \beta_2}f(\beta) \\ \vdots \\ \frac{\partial}{\partial \beta_p}f(\beta) \end{bmatrix} \]


In particular, if \(f(\beta) = x'\beta\), then:

\[ \frac{\partial}{\partial \beta} x'\beta= \begin{bmatrix} \frac{\partial}{\partial \beta_1}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \\ \frac{\partial}{\partial \beta_2}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \\ \vdots \\ \frac{\partial}{\partial \beta_p}(x_1\beta_1+x_2\beta_2 +...+x_p \beta_p ) \end{bmatrix} = \begin{bmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{p} \end{bmatrix} = \mathbf x \]


In general:

\[ \frac{\partial}{\partial \beta} x'\beta = x \]

This looks a lot like non-vector calculus, except that you have to transpose the coefficient.


Similarly,

\[ \frac{\partial}{\partial \beta} \beta'\beta = 2\beta \]

This is like taking the derivative of \(x^2\).


And finally, if \(S\) is a \(p\times p\) matrix, then:

\[ \frac{\partial}{\partial \beta} \beta'S\beta = 2S\beta \]

Again, this is like taking the derivative of \(cx^2\) with respect to \(x\) in non-vector calculus.


Thus:

\[ \sum_{i=1}^n (y_i - f_\beta(x_i))^2 = (\mathbf y - X\beta)'(\mathbf y - X\beta) \]

\[ (\mathbf y - X\beta)' = (\mathbf y' - (X\beta)') = (\mathbf y' - \beta'X') \]


So

\[ \begin{aligned} (\mathbf y - X\beta)'(\mathbf y - X\beta) &= (\mathbf y' - \beta'X')(\mathbf y - X\beta)\\ &= y'y - \beta'X'y - y'X\beta +\beta'X'X\beta\\ &= y'y - 2y'X\beta +\beta'X'X\beta \end{aligned} \]

2.3.3 Analyzing the linear regression score function

\[ \begin{aligned} \frac{\partial}{\partial \beta}\left( \sum_{i=1}^n (y_i - x_i' \beta)^2\right) &= \frac{\partial}{\partial \beta}(\mathbf y - X\beta)'(\mathbf y - X\beta)\\ &= \frac{\partial}{\partial \beta} (y'y - 2y'X\beta +\beta'X'X\beta)\\ &= (- 2X'y +2X'X\beta) \end{aligned} \]


So if \(\ell(\beta,\sigma^2) =0\), then

\[ \begin{aligned} 0 &= (- 2X'y +2X'X\beta)\\ 2X'y &= 2X'X\beta\\ X'y &= X'X\beta\\ (X'X)^{-1}X'y &= \beta \end{aligned} \]


The second derivative matrix \(\ell_{\beta, \beta'} ''(\beta, \sigma^2;\mathbf X,\mathbf y)\) is negative definite at \(\beta = (X'X)^{-1}X'y\), so \(\hat \beta_{ML} = (X'X)^{-1}X'y\) is the MLE for \(\beta\).


Similarly (not shown):

\[ \hat\sigma^2_{ML} = \frac{1}{n} (Y-X\hat\beta)'(Y-X\hat\beta) \]

And

\[ \begin{aligned} \mathcal I_{\beta} &= E[-\ell_{\beta, \beta'} ''(Y|X,\beta, \sigma^2)]\\ &= \frac{1}{\sigma^2}X'X \end{aligned} \]


So:

\[ Var(\hat \beta) \approx (\mathcal I_{\beta})^{-1} = \sigma^2 (X'X)^{-1} \]

and

\[ \hat\beta \dot \sim N(\beta, \mathcal I_{\beta}^{-1}) \]

These are all results you have hopefully seen before.


In the Gaussian linear regression case, we also have exact results:

\[ \frac{\hat\beta_j}{\hat{\text{se}}\left(\hat\beta_j\right)} \ \sim \ t_{n-p} \]


In our model 2 above, \(\hat{\mathcal{I}}(\beta)\) is:

Show R code
bw_lm2 |> vcov()
#>             (Intercept)  sexmale      age sexmale:age
#> (Intercept)     1353968 -1353968 -34871.0     34871.0
#> sexmale        -1353968  2596387  34871.0    -67211.0
#> age              -34871    34871    899.9      -899.9
#> sexmale:age       34871   -67211   -899.9      1743.5

If we take the square roots of the diagonals, we get the standard errors listed in the model output:

Show R code

bw_lm2 |> vcov() |> diag() |> sqrt()
#> (Intercept)     sexmale         age sexmale:age 
#>     1163.60     1611.33       30.00       41.76
Show R code
bw_lm2 |> parameters() |> print_md()
Table 2.7: Estimated model for birthweight data with interaction term
Parameter Coefficient SE 95% CI t(20) p
(Intercept) -2141.67 1163.60 (-4568.90, 285.56) -1.84 0.081
sex (male) 872.99 1611.33 (-2488.18, 4234.17) 0.54 0.594
age 130.40 30.00 (67.82, 192.98) 4.35 < .001
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664

So we can do confidence intervals, hypothesis tests, and p-values exactly as in the one-variable case we looked at previously.

2.3.4 Residual Standard Deviation

\(\hat \sigma\) represents an estimate of the Residual Standard Deviation parameter, \(\sigma\). We can extract \(\hat \sigma\) from the fitted model, using the sigma() function:

Show R code
sigma(bw_lm2)
#> [1] 180.6

\(\sigma\) is NOT “Residual standard error”

In the summary.lm() output, this estimate is labeled as "Residual standard error":

Show R code
summary(bw_lm2)
#> 
#> Call:
#> lm(formula = weight ~ sex + age + sex:age, data = bw)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -246.7 -138.1  -39.1  176.6  274.3 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  -2141.7     1163.6   -1.84  0.08057 .  
#> sexmale        873.0     1611.3    0.54  0.59395    
#> age            130.4       30.0    4.35  0.00031 ***
#> sexmale:age    -18.4       41.8   -0.44  0.66389    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 181 on 20 degrees of freedom
#> Multiple R-squared:  0.643,  Adjusted R-squared:  0.59 
#> F-statistic:   12 on 3 and 20 DF,  p-value: 0.000101

However, this is a misnomer:

Show R code
library(printr) # captures ? documentation
?stats::sigma
sigma R Documentation

Extract Residual Standard Deviation 'Sigma'

Description

Extract the estimated standard deviation of the errors, the “residual standard deviation” (misnamed also “residual standard error”, e.g., in summary.lm()'s output, from a fitted model).

Many classical statistical models have a scale parameter, typically the standard deviation of a zero-mean normal (or Gaussian) random variable which is denoted as \sigma. sigma(.) extracts the estimated parameter from a fitted model, i.e., \hat\sigma.

Note

The misnomer “Residual standard error” has been part of too many R (and S) outputs to be easily changed there.

2.4 Inference about Gaussian Linear Regression Models

2.4.1 Motivating example: birthweight data

Research question: is there really an interaction between sex and age?

\(H_0: \beta_{AM} = 0\)

\(H_A: \beta_{AM} \neq 0\)

\(P(|\hat\beta_{AM}| > |-18.4172| \mid H_0)\) = ?

2.4.2 Wald tests and CIs

R can give you Wald tests for single coefficients and corresponding CIs:

Show R code

bw_lm2 |> 
  parameters() |>
  print_md(
    include_reference = TRUE)
Parameter Coefficient SE 95% CI t(20) p
(Intercept) -2141.67 1163.60 (-4568.90, 285.56) -1.84 0.081
sex (female) 0.00
sex (male) 872.99 1611.33 (-2488.18, 4234.17) 0.54 0.594
age 130.40 30.00 (67.82, 192.98) 4.35 < .001
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664

To understand what’s happening, let’s replicate these results by hand for the interaction term.

2.4.3 P-values

Show R code
bw_lm2 |> 
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE)
Parameter Coefficient SE 95% CI t(20) p
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664
Show R code
beta_hat = coef(summary(bw_lm2))["sexmale:age", "Estimate"]
se_hat = coef(summary(bw_lm2))["sexmale:age", "Std. Error"]
dfresid = bw_lm2$df.residual
t_stat = abs(beta_hat)/se_hat
pval_t = 
  pt(-t_stat, df = dfresid, lower.tail = TRUE) +
  pt(t_stat, df = dfresid, lower.tail = FALSE)

\[ \begin{aligned} &P\left( | \hat \beta_{AM} | > | -18.4172| \middle| H_0 \right) \\ &= \Pr \left( \left| \frac{\hat\beta_{AM}}{\hat{SE}(\hat\beta_{AM})} \right| > \left| \frac{-18.4172}{41.7558} \right| \middle| H_0 \right)\\ &= \Pr \left( \left| T_{20} \right| > 0.4411 | H_0 \right)\\ &= 0.6639 \end{aligned} \]

This matches the result in the table above.

2.4.4 Confidence intervals

Show R code
bw_lm2 |> 
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE)
Parameter Coefficient SE 95% CI t(20) p
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664
Show R code
q_t = qt(
  p = 0.975, 
  df = dfresid, 
  lower.tail = TRUE)

q_t = qt(
  p = 0.025, 
  df = dfresid, 
  lower.tail = TRUE)


confint_radius_t = 
  se_hat * q_t

confint_t = beta_hat + c(-1,1) * confint_radius_t

print(confint_t)
#> [1]   68.68 -105.52

This also matches.

2.4.5 Gaussian approximations

Here are the asymptotic (Gaussian approximation) equivalents:

2.4.6 P-values

Show R code
bw_lm2 |> 
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE)
Parameter Coefficient SE 95% CI t(20) p
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664
Show R code
pval_z = pnorm(abs(t_stat), lower = FALSE) * 2

print(pval_z)
#> [1] 0.6592

2.4.7 Confidence intervals

Show R code
bw_lm2 |> 
  parameters(keep = "sexmale:age") |>
  print_md(
    include_reference = TRUE)
Parameter Coefficient SE 95% CI t(20) p
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664
Show R code
confint_radius_z = se_hat * qnorm(0.975, lower = TRUE)
confint_z = 
  beta_hat + c(-1,1) * confint_radius_z
print(confint_z)
#> [1] -100.26   63.42

2.4.8 Likelihood ratio statistics

Show R code

logLik(bw_lm2)
#> 'log Lik.' -156.6 (df=5)
logLik(bw_lm1)
#> 'log Lik.' -156.7 (df=4)

lLR = (logLik(bw_lm2) - logLik(bw_lm1)) |> as.numeric()
delta_df = (bw_lm1$df.residual - df.residual(bw_lm2))


x_max = 1

Show R code
d_lLR = function(x, df = delta_df) dchisq(x, df = df)

chisq_plot = 
  ggplot() + 
  geom_function(fun = d_lLR) +
  stat_function( fun = d_lLR, xlim = c(lLR, x_max), geom = "area", fill = "gray") +
  geom_segment(aes(x = lLR, xend = lLR, y = 0, yend = d_lLR(lLR)), col = "red") + 
  xlim(0.0001,x_max) + 
  ylim(0,4) + 
  ylab("p(X=x)") + 
  xlab("log(likelihood ratio) statistic [x]") +
  theme_classic()
chisq_plot |> print()
Figure 2.5: Chi-square distribution

Now we can get the p-value:

Show R code
pchisq(
  q = 2*lLR, 
  df = delta_df, 
  lower = FALSE) |> 
  print()
#> [1] 0.6298

In practice you don’t have to do this by hand; there are functions to do it for you:

Show R code

# built in
library(lmtest)
lrtest(bw_lm2, bw_lm1)
#Df LogLik Df Chisq Pr(>Chisq)
5 -156.6 NA NA NA
4 -156.7 -1 0.2323 0.6298

2.5 Goodness of fit

2.5.1 AIC and BIC

When we use likelihood ratio tests, we are comparing how well different models fit the data.

Likelihood ratio tests require “nested” models: one must be a special case of the other.

If we have non-nested models, we can instead use the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC):

  • AIC = \(-2 * \ell(\hat\theta) + 2 * p\)

  • BIC = \(-2 * \ell(\hat\theta) + p * \text{log}(n)\)

where \(\ell\) is the log-likelihood of the data evaluated using the parameter estimates \(\hat\theta\), \(p\) is the number of estimated parameters in the model (including \(\hat\sigma^2\)), and \(n\) is the number of observations.

You can calculate these criteria using the logLik() function, or use the built-in R functions:

AIC in R

Show R code

-2 * logLik(bw_lm2) |> as.numeric() + 
  2*(length(coef(bw_lm2))+1) # sigma counts as a parameter here
#> [1] 323.2

AIC(bw_lm2)
#> [1] 323.2

BIC in R

Show R code

-2 * logLik(bw_lm2) |> as.numeric() + 
  (length(coef(bw_lm2))+1) * log(nobs(bw_lm2))
#> [1] 329

BIC(bw_lm2)
#> [1] 329

Large values of AIC and BIC are worse than small values. There are no hypothesis tests or p-values associated with these criteria.

2.5.2 (Residual) Deviance

Let \(q\) be the number of distinct covariate combinations in a data set.

Show R code
bw.X.unique = 
  bw |> 
  count(sex, age)

n_unique.bw  = nrow(bw.X.unique)

For example, in the birthweight data, there are \(q = 12\) unique patterns (Table 2.8).

Show R code
bw.X.unique
Table 2.8: Unique covariate combinations in the birthweight data, with replicate counts
sex age n
female 36 2
female 37 1
female 38 2
female 39 2
female 40 4
female 42 1
male 35 1
male 36 1
male 37 2
male 38 3
male 40 4
male 41 1

Definition 2.1 (Replicates) If a given covariate pattern has more than one observation in a dataset, those observations are called replicates.


Example 2.1 (Replicates in the birthweight data) In the birthweight dataset, there are 2 replicates of the combination “female, age 36” (Table 2.8).


Exercise 2.7 (Replicates in the birthweight data) Which covariate pattern(s) in the birthweight data has the most replicates?


Solution 2.1 (Replicates in the birthweight data). Two covariate patterns are tied for most replicates: males at age 40 weeks and females at age 40 weeks. 40 weeks is the usual length for human pregnancy (Polin, Fox, and Abman (2011)), so this result makes sense.

Show R code
bw.X.unique |> dplyr::filter(n == max(n))
sex age n
female 40 4
male 40 4

Saturated models

The most complicated model we could fit would have one parameter (a mean) for each covariate pattern, plus a variance parameter:

Show R code
lm_max = 
  bw |> 
  mutate(age = factor(age)) |> 
  lm(
    formula = weight ~ sex:age - 1, 
    data = _)

lm_max |> 
  parameters() |> 
  print_md()
Table 2.9: Saturated model for the birthweight data
Parameter Coefficient SE 95% CI t(12) p
sex (male) × age35 2925.00 187.92 (2515.55, 3334.45) 15.56 < .001
sex (female) × age36 2570.50 132.88 (2280.98, 2860.02) 19.34 < .001
sex (male) × age36 2625.00 187.92 (2215.55, 3034.45) 13.97 < .001
sex (female) × age37 2539.00 187.92 (2129.55, 2948.45) 13.51 < .001
sex (male) × age37 2737.50 132.88 (2447.98, 3027.02) 20.60 < .001
sex (female) × age38 2872.50 132.88 (2582.98, 3162.02) 21.62 < .001
sex (male) × age38 2982.00 108.50 (2745.60, 3218.40) 27.48 < .001
sex (female) × age39 2846.00 132.88 (2556.48, 3135.52) 21.42 < .001
sex (female) × age40 3152.25 93.96 (2947.52, 3356.98) 33.55 < .001
sex (male) × age40 3256.25 93.96 (3051.52, 3460.98) 34.66 < .001
sex (male) × age41 3292.00 187.92 (2882.55, 3701.45) 17.52 < .001
sex (female) × age42 3210.00 187.92 (2800.55, 3619.45) 17.08 < .001

We call this model the full, maximal, or saturated model for this dataset.

We can calculate the log-likelihood of this model as usual:

Show R code
logLik(lm_max)
#> 'log Lik.' -151.4 (df=13)

We can compare this model to our other models using chi-square tests, as usual:

Show R code
lrtest(lm_max, bw_lm2)
#Df LogLik Df Chisq Pr(>Chisq)
13 -151.4 NA NA NA
5 -156.6 -8 10.36 0.241

The likelihood ratio statistic for this test is \[\lambda = 2 * (\ell_{\text{full}} - \ell) = 10.3554\] where:

  • \(\ell_{\text{max}}\) is the log-likelihood of the full model: -151.4016
  • \(\ell\) is the log-likelihood of our comparison model (two slopes, two intercepts): -156.5793

This statistic is called the deviance or residual deviance for our two-slopes and two-intercepts model; it tells us how much the likelihood of that model deviates from the likelihood of the maximal model.

The corresponding p-value tells us whether there we have enough evidence to detect that our two-slopes, two-intercepts model is a worse fit for the data than the maximal model; in other words, it tells us if there’s evidence that we missed any important patterns. (Remember, a nonsignificant p-value could mean that we didn’t miss anything and a more complicated model is unnecessary, or it could mean we just don’t have enough data to tell the difference between these models.)

2.5.3 Null Deviance

Similarly, the least complicated model we could fit would have only one mean parameter, an intercept:

\[\text E[Y|X=x] = \beta_0\] We can fit this model in R like so:

Show R code
lm0 = lm(weight ~ 1, data = bw)

lm0 |> parameters() |> print_md()
Parameter Coefficient SE 95% CI t(23) p
(Intercept) 2967.67 57.58 (2848.56, 3086.77) 51.54 < .001

This model also has a likelihood:

Show R code
logLik(lm0)
#> 'log Lik.' -169 (df=2)

And we can compare it to more complicated models using a likelihood ratio test:

Show R code

lrtest(bw_lm2, lm0)
#Df LogLik Df Chisq Pr(>Chisq)
5 -156.6 NA NA NA
2 -169.0 -3 24.75 0

The likelihood ratio statistic for the test comparing the null model to the maximal model is \[\lambda = 2 * (\ell_{\text{full}} - \ell_{0}) = 35.1067\] where:

  • \(\ell_{\text{0}}\) is the log-likelihood of the null model: -168.955
  • \(\ell_{\text{full}}\) is the log-likelihood of the maximal model: -151.4016

In R, this test is:

Show R code
lrtest(lm_max, lm0)
#Df LogLik Df Chisq Pr(>Chisq)
13 -151.4 NA NA NA
2 -169.0 -11 35.11 2e-04

This log-likelihood ratio statistic is called the null deviance. It tells us whether we have enough data to detect a difference between the null and full models.

2.6 Rescaling

2.6.1 Rescale age

Show R code

bw = 
  bw |>
  mutate(
    `age - mean` = age - mean(age),
    `age - 36wks` = age - 36
  )

lm1c = lm(weight ~ sex + `age - 36wks`, data = bw)

lm2c = lm(weight ~ sex + `age - 36wks` + sex:`age - 36wks`, data = bw)

parameters(lm2c, ci_method = "wald") |> print_md()
Parameter Coefficient SE 95% CI t(20) p
(Intercept) 2552.73 97.59 (2349.16, 2756.30) 26.16 < .001
sex (male) 209.97 129.75 (-60.68, 480.63) 1.62 0.121
age - 36wks 130.40 30.00 (67.82, 192.98) 4.35 < .001
sex (male) × age - 36wks -18.42 41.76 (-105.52, 68.68) -0.44 0.664

Compare with what we got without rescaling:

Show R code
parameters(bw_lm2, ci_method = "wald") |> print_md()
Parameter Coefficient SE 95% CI t(20) p
(Intercept) -2141.67 1163.60 (-4568.90, 285.56) -1.84 0.081
sex (male) 872.99 1611.33 (-2488.18, 4234.17) 0.54 0.594
age 130.40 30.00 (67.82, 192.98) 4.35 < .001
sex (male) × age -18.42 41.76 (-105.52, 68.68) -0.44 0.664

2.7 Prediction

2.7.1 Prediction for linear models

Definition 2.2 (Predicted value) In a regression model \(\text{p}(y|x)\), the predicted value of \(y\) given \(x\) is the estimated mean of \(Y\) given \(X\):

\[\hat y \stackrel{\text{def}}{=}\hat{\text{E}}\left[Y|X=x\right]\]


For linear models, the predicted value can be straightforwardly calculated by multiplying each predictor value \(x_j\) by its corresponding coefficient \(\beta_j\) and adding up the results:

\[ \begin{aligned} \hat Y &= \hat E[Y|X=x] \\ &= x'\hat\beta \\ &= \hat\beta_0\cdot 1 + \hat\beta_1 x_1 + ... + \hat\beta_p x_p \end{aligned} \]


2.7.2 Example: prediction for the birthweight data

Show R code

X = c(1,1,40)
sum(X * coef(bw_lm1))
#> [1] 3225

R has built-in functions for prediction:

Show R code
x = tibble(age = 40, sex = "male")
bw_lm1 |> predict(newdata = x)
#>    1 
#> 3225

If you don’t provide newdata, R will use the covariate values from the original dataset:

Show R code
predict(bw_lm1)
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 
#>   17   18   19   20   21   22   23   24 
#> 2863 2579 2984 2821 3225 2942 2984 3062

These special predictions are called the fitted values of the dataset:

Definition 2.3 For a given dataset \((\tilde{Y}, \tilde{X})\) and corresponding fitted model \(\text{p}_{\hat \beta}(y|x)\), the fitted value of \(y_i\) is the predicted value of \(y\) when \(X=x_i\) using the estimate parameters \(\hat \beta\).

R has an extra function to get these values:

Show R code
fitted(bw_lm1)
#>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
#> 3225 3062 2984 2579 3225 3062 2621 2821 2742 3304 2863 2942 3346 3062 3225 2700 
#>   17   18   19   20   21   22   23   24 
#> 2863 2579 2984 2821 3225 2942 2984 3062

2.7.3 Quantifying uncertainty in predictions

Show R code
bw_lm1 |> 
  predict(
    newdata = x,
    se.fit = TRUE)
#> $fit
#>    1 
#> 3225 
#> 
#> $se.fit
#> [1] 61.46
#> 
#> $df
#> [1] 21
#> 
#> $residual.scale
#> [1] 177.1

This is a list(); you can extract the elements with $ or magrittr::use_series():

Show R code
bw_lm1 |> 
  predict(
    newdata = x,
    se.fit = TRUE) |> 
  use_series(se.fit)
#> [1] 61.46

You can get confidence intervals for \(\mathbb{E}\left[Y|X=x\right]\):

Show R code
bw_lm1 |> predict(
  newdata = x,
  interval = "confidence")
fit lwr upr
3225 3098 3353

You can also get prediction intervals for the value of an individual outcome \(Y\):

Show R code
bw_lm1 |> 
  predict(newdata = x, interval = "predict")
fit lwr upr
3225 2836 3615

The warning from the last command is: “predictions on current data refer to future responses” (since you already know what happened to the current data, and thus don’t need to predict it).

See ?predict.lm for more.

2.8 Diagnostics

Tip

This section is adapted from Dobson and Barnett (2018, secs. 6.2–6.3) and Dunn, Smyth, et al. (2018) Chapter 3.

2.8.1 Assumptions in linear regression models

\[Y|\tilde{X}\ \sim_{⫫}\ N(\tilde{X}'\beta,\sigma^2)\]

  1. Normality: The distribution conditional on a given \(X\) value is normal

  2. Correct Functional Form: The conditional means have the structure

\[E[Y|\tilde{X} = \tilde{x}] = \tilde{x}'\beta\] 3. Homoskedasticity: The variance \(\sigma^2\) is constant (with respect to \(\tilde{x}\))

  1. Independence: The observations are statistically independent

2.8.2 Direct visualization

The most direct way to examine the fit of a model is to compare it to the raw observed data.

Show R code
bw = 
  bw |> 
  mutate(
    predlm2 = predict(bw_lm2)
  ) |> 
  arrange(sex, age)

plot1_interact = 
  plot1 %+% bw +
  geom_line(aes(y = predlm2))

print(plot1_interact)
Figure 2.6: Birthweight model with interaction term

It’s not easy to assess these assumptions from this model. If there are multiple continuous covariates, it becomes even harder to visualize the raw data.

2.8.3 Residuals

Maybe we can transform the data and model in some way to make it easier to inspect.

Definition 2.4 (Residual noise) The residual noise in a probabilistic model \(p(Y)\) is the difference between an observed value \(y\) and its distributional mean:

\[\epsilon(y) \stackrel{\text{def}}{=}y - \mathbb{E}\left[Y\right] \tag{2.7}\]

We use the same notation for residual noise that we used for errors. \(\mathbb{E}\left[Y\right]\) can be viewed as an estimate of \(Y\), before \(y\) is observed. Conversely, each observation \(y\) can be viewed as an estimate of \(\mathbb{E}\left[Y\right]\) (albeit an imprecise one, individually, since \(n=1\)).

We can rearrange Equation 2.7 to view \(y\) as the sum of its mean plus the residual noise:

\[y = \mathbb{E}\left[Y\right] + \epsilon{y}\]


Theorem 2.1 (Residuals in Gaussian models) If \(Y\) has a Gaussian distribution, then \(\epsilon{Y}\) also has a Gaussian distribution, and vice versa.

Proof. Left to the reader.


Definition 2.5 (Residual errors of a fitted model value) The residual of a fitted value \(\hat y\) (shorthand: “residual”) is its error: \[ \begin{aligned} e(\hat y) &\stackrel{\text{def}}{=}\epsilon\left(\hat y\right) \\&= y - \hat y \end{aligned} \]

\(e(\hat y)\) can be seen as the maximum likelihood estimate of the residual noise:

\[ \begin{aligned} e(\hat y) &= y - \hat y \\ &= \hat\epsilon_{ML} \end{aligned} \]


General characteristics of residuals

Theorem 2.2 For unbiased estimators \(\hat\theta\):

\[\mathbb{E}\left[e(y)\right] = 0 \tag{2.8}\] \[\text{Var}\left(e(y)\right) \approx \sigma^2 \tag{2.9}\]

Proof.  

Equation 2.8:

\[ \begin{aligned} \mathbb{E}\left[e(y)\right] &= \mathbb{E}\left[y - \hat y\right] \\ &= \mathbb{E}\left[y\right] - \mathbb{E}\left[\hat y\right] \\ &= \mathbb{E}\left[y\right] - \mathbb{E}\left[y\right] \\ &= 0 \end{aligned} \]

Equation 2.9:

\[ \begin{aligned} \text{Var}\left(e(y)\right) &= \text{Var}\left(y - \hat y\right) \\ &= \text{Var}\left(y\right) + \text{Var}\left(\hat y\right) - 2 \text{Cov}\left(y, \hat y\right) \\ &{\dot{\approx}} \text{Var}\left(y\right) + 0 - 2 \cdot 0 \\ &= \text{Var}\left(y\right) \\ &= \sigma^2 \end{aligned} \]


Characteristics of residuals in Gaussian models

With enough data and a correct model, the residuals will be approximately Guassian distributed, with variance \(\sigma^2\), which we can estimate using \(\hat\sigma^2\): that is:

\[ e_i \ \sim_{\text{iid}}\ N(0, \hat\sigma^2) \]


Example 2.2 (residuals in birthweight data) R provides a function for residuals:

Show R code
resid(bw_lm2)
#>       1       2       3       4       5       6       7       8       9      10 
#>  176.27 -140.73 -144.13  -59.53  177.47 -126.93  -68.93  242.67 -139.33   51.67 
#>      11      12      13      14      15      16      17      18      19      20 
#>  156.67 -125.13  274.28 -137.71  -27.69 -246.69 -191.67  189.33  -11.67 -242.64 
#>      21      22      23      24 
#>  -47.64  262.36  210.36  -30.62

Exercise 2.8 Check R’s output by computing the residuals directly.

Solution.  

Show R code
bw$weight - fitted(bw_lm2)
#>       1       2       3       4       5       6       7       8       9      10 
#>  176.27 -140.73 -144.13  -59.53  177.47 -126.93  -68.93  242.67 -139.33   51.67 
#>      11      12      13      14      15      16      17      18      19      20 
#>  156.67 -125.13  274.28 -137.71  -27.69 -246.69 -191.67  189.33  -11.67 -242.64 
#>      21      22      23      24 
#>  -47.64  262.36  210.36  -30.62

This matches R’s output!


Graph the residuals

Show R code
bw = bw |> 
  mutate(resids_intxn = 
           weight - fitted(bw_lm2))

plot_bw_resid =
  bw |> 
  ggplot(aes(
    x = age, 
    y = resids_intxn,
    linetype = sex,
    shape = sex,
    col = sex))  +
  theme_bw() +
  xlab("Gestational age (weeks)") +
  ylab("residuals (grams)") +
  theme(legend.position = "bottom") +
  # expand_limits(y = 0, x = 0) +
  geom_point(alpha = .7)
print(plot_bw_resid + facet_wrap(~ sex))
Figure 2.7: Residuals of interaction model for birthweight data

Definition 2.6 (Standardized residuals) \[r_i = \frac{e_i}{\widehat{SD}(e_i)}\]

Hence, with enough data and a correct model, the standardized residuals will be approximately standard Gaussian; that is,

\[ r_i \ \sim_{\text{iid}}\ N(0,1) \]

2.8.4 Marginal distributions of residuals

To look for problems with our model, we can check whether the residuals \(e_i\) and standardized residuals \(r_i\) look like they have the distributions that they are supposed to have, according to the model.


Standardized residuals in R

Show R code

rstandard(bw_lm2)
#>        1        2        3        4        5        6        7        8 
#>  1.15982 -0.92601 -0.87479 -0.34723  1.03507 -0.73473 -0.39901  1.43752 
#>        9       10       11       12       13       14       15       16 
#> -0.82539  0.30606  0.92807 -0.87616  1.91428 -0.86559 -0.16430 -1.46376 
#>       17       18       19       20       21       22       23       24 
#> -1.11016  1.09658 -0.06761 -1.46159 -0.28696  1.58040  1.26717 -0.19805
resid(bw_lm2)/sigma(bw_lm2)
#>        1        2        3        4        5        6        7        8 
#>  0.97593 -0.77920 -0.79802 -0.32962  0.98258 -0.70279 -0.38166  1.34357 
#>        9       10       11       12       13       14       15       16 
#> -0.77144  0.28606  0.86741 -0.69282  1.51858 -0.76244 -0.15331 -1.36584 
#>       17       18       19       20       21       22       23       24 
#> -1.06123  1.04825 -0.06463 -1.34341 -0.26376  1.45262  1.16471 -0.16954

These are not quite the same, because R is doing something more complicated and precise to get the standard errors. Let’s not worry about those details for now; the difference is pretty small in this case:


Show R code

rstandard_compare_plot = 
  tibble(
    x = resid(bw_lm2)/sigma(bw_lm2), 
    y = rstandard(bw_lm2)) |> 
  ggplot(aes(x = x, y = y)) +
  geom_point() + 
  theme_bw() +
  coord_equal() + 
  xlab("resid(bw_lm2)/sigma(bw_lm2)") +
  ylab("rstandard(bw_lm2)") +
  geom_abline(
    aes(
      intercept = 0,
      slope = 1, 
      col = "x=y")) +
  labs(colour="") +
  scale_colour_manual(values="red")

print(rstandard_compare_plot)


Let’s add these residuals to the tibble of our dataset:

Show R code

bw = 
  bw |> 
  mutate(
    fitted_lm2 = fitted(bw_lm2),
    
    resid_lm2 = resid(bw_lm2),
    # resid_lm2 = weight - fitted_lm2,
    
    std_resid_lm2 = rstandard(bw_lm2),
    # std_resid_lm2 = resid_lm2 / sigma(bw_lm2)
  )

bw |> 
  select(
    sex,
    age,
    weight,
    fitted_lm2,
    resid_lm2,
    std_resid_lm2
  )
sex age weight fitted_lm2 resid_lm2 std_resid_lm2
female 36 2729 2553 176.27 1.1598
female 36 2412 2553 -140.73 -0.9260
female 37 2539 2683 -144.13 -0.8748
female 38 2754 2814 -59.53 -0.3472
female 38 2991 2814 177.47 1.0351
female 39 2817 2944 -126.93 -0.7347
female 39 2875 2944 -68.93 -0.3990
female 40 3317 3074 242.67 1.4375
female 40 2935 3074 -139.33 -0.8254
female 40 3126 3074 51.67 0.3061
female 40 3231 3074 156.67 0.9281
female 42 3210 3335 -125.13 -0.8762
male 35 2925 2651 274.28 1.9143
male 36 2625 2763 -137.71 -0.8656
male 37 2847 2875 -27.69 -0.1643
male 37 2628 2875 -246.69 -1.4638
male 38 2795 2987 -191.67 -1.1102
male 38 3176 2987 189.33 1.0966
male 38 2975 2987 -11.67 -0.0676
male 40 2968 3211 -242.64 -1.4616
male 40 3163 3211 -47.64 -0.2870
male 40 3473 3211 262.36 1.5804
male 40 3421 3211 210.36 1.2672
male 41 3292 3323 -30.62 -0.1981

Now let’s build histograms:

Show R code
resid_marginal_hist = 
  bw |> 
  ggplot(aes(x = resid_lm2)) +
  geom_histogram()

print(resid_marginal_hist)
Figure 2.8: Marginal distribution of (nonstandardized) residuals

Hard to tell with this small amount of data, but I’m a bit concerned that the histogram doesn’t show a bell-curve shape.


Show R code
std_resid_marginal_hist = 
  bw |> 
  ggplot(aes(x = std_resid_lm2)) +
  geom_histogram()

print(std_resid_marginal_hist)
Figure 2.9: Marginal distribution of standardized residuals

This looks similar, although the scale of the x-axis got narrower, because we divided by \(\hat\sigma\) (roughly speaking).

Still hard to tell if the distribution is Gaussian.


2.8.5 QQ plot of standardized residuals

Another way to assess normality is the QQ plot of the standardized residuals versus normal quantiles:

Show R code

library(ggfortify) 
# needed to make ggplot2::autoplot() work for `lm` objects

qqplot_lm2_auto = 
  bw_lm2 |> 
  autoplot(
    which = 2, # options are 1:6; can do multiple at once
    ncol = 1) +
  theme_classic()

print(qqplot_lm2_auto)

If the Gaussian model were correct, these points should follow the dotted line.

Fig 2.4 panel (c) in Dobson and Barnett (2018) is a little different; they didn’t specify how they produced it, but other statistical analysis systems do things differently from R.

See also Dunn, Smyth, et al. (2018) §3.5.4.


QQ plot - how it’s built

Let’s construct it by hand:

Show R code

bw = bw |> 
  mutate(
    p = (rank(std_resid_lm2) - 1/2)/n(), # "Blom's method"
    expected_quantiles_lm2 = qnorm(p)
  )

qqplot_lm2 = 
  bw |> 
  ggplot(
    aes(
      x = expected_quantiles_lm2, 
      y = std_resid_lm2, 
      col = sex, 
      shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  theme(legend.position='none') + # removing the plot legend
  ggtitle("Normal Q-Q") +
  xlab("Theoretical Quantiles") + 
  ylab("Standardized residuals")

# find the expected line:

ps <- c(.25, .75)                  # reference probabilities
a <- quantile(rstandard(bw_lm2), ps)  # empirical quantiles
b <- qnorm(ps)                     # theoretical quantiles

qq_slope = diff(a)/diff(b)
qq_intcpt = a[1] - b[1] * qq_slope

qqplot_lm2 = 
  qqplot_lm2 +
  geom_abline(slope = qq_slope, intercept = qq_intcpt)

print(qqplot_lm2)


2.8.6 Conditional distributions of residuals

If our Gaussian linear regression model is correct, the residuals \(e_i\) and standardized residuals \(r_i\) should have:

  • an approximately Gaussian distribution, with:
  • a mean of 0
  • a constant variance

This should be true for every value of \(x\).


If we didn’t correctly guess the functional form of linear component of the mean, \[\text{E}[Y|X=x] = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p\]

Then the the residuals might have nonzero mean.

Regardless of whether we guessed the mean function correctly, ther the variance of the residuals might differ between values of \(x\).


Residuals versus fitted values

To look for these issues, we can plot the residuals \(e_i\) against the fitted values \(\hat y_i\) (Figure 2.10).

Show R code
autoplot(bw_lm2, which = 1, ncol = 1) |> print()
Figure 2.10: birthweight model (Equation 2.2): residuals versus fitted values

If the model is correct, the blue line should stay flat and close to 0, and the cloud of dots should have the same vertical spread regardless of the fitted value.

If not, we probably need to change the functional form of linear component of the mean, \[\text{E}[Y|X=x] = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p\]


Example: PLOS Medicine title length data

(Adapted from Dobson and Barnett (2018), §6.7.1)

Show R code
data(PLOS, package = "dobson")
library(ggplot2)
fig1 = 
  PLOS |> 
  ggplot(
    aes(x = authors,
        y = nchar)
  ) +
  geom_point() +
  theme(legend.position = "bottom") +
  labs(col = "") +
  guides(col=guide_legend(ncol=3))
fig1
Figure 2.11: Number of authors versus title length in PLOS Medicine articles

Linear fit
Show R code
lm_PLOS_linear = lm(
  formula = nchar ~ authors, 
  data = PLOS)
Show R code
fig2 = fig1 +
  geom_smooth(
    method = "lm", 
              fullrange = TRUE,
              aes(col = "lm(y ~ x)"))
fig2

library(ggfortify)
autoplot(lm_PLOS_linear, which = 1, ncol = 1)
Figure 2.12: Number of authors versus title length in PLOS Medicine, with linear model fit
(a) Data and fit
(b) Residuals vs fitted

Quadratic fit
Show R code
lm_PLOS_quad = lm(
  formula = nchar ~ authors + I(authors^2), 
  data = PLOS)
Show R code
fig3 = 
  fig2 + 
geom_smooth(
    method = "lm",
    fullrange = TRUE,
    formula = y ~ x + I(x ^ 2),
    aes(col = "lm(y ~ x + I(x^2))")
  )
fig3

autoplot(lm_PLOS_quad, which = 1, ncol = 1)
Figure 2.13: Number of authors versus title length in PLOS Medicine, with quadratic model fit
(a) Data and fit
(b) Residuals vs fitted

Linear versus quadratic fits
Show R code
library(ggfortify)
autoplot(lm_PLOS_linear, which = 1, ncol = 1)

autoplot(lm_PLOS_quad, which = 1, ncol = 1)
Figure 2.14: Residuals versus fitted plot for linear and quadratic fits to PLOS data
(a) Linear
(b) Quadratic

Cubic fit
Show R code
lm_PLOS_cub = lm(
  formula = nchar ~ authors + I(authors^2) + I(authors^3), 
  data = PLOS)
Show R code
fig4 = 
  fig3 + 
geom_smooth(
    method = "lm",
    fullrange = TRUE,
    formula = y ~ x + I(x ^ 2) + I(x ^ 3),
    aes(col = "lm(y ~ x + I(x^2) + I(x ^ 3))")
  )
fig4

autoplot(lm_PLOS_cub, which = 1, ncol = 1)
Figure 2.15: Number of authors versus title length in PLOS Medicine, with cubic model fit
(a) Data and fit
(b) Residuals vs fitted

Logarithmic fit
Show R code
lm_PLOS_log = lm(nchar ~ log(authors), data = PLOS)
Show R code
fig5 = fig4 + 
  geom_smooth(
    method = "lm",
    fullrange = TRUE,
    formula = y ~ log(x),
    aes(col = "lm(y ~ log(x))")
  )
fig5

autoplot(lm_PLOS_log, which = 1, ncol = 1)
Figure 2.16: logarithmic fit
(a) Data and fit
(b) Residuals vs fitted

Model selection
Show R code
anova(lm_PLOS_linear, lm_PLOS_quad)
Table 2.10: linear vs quadratic
Res.Df RSS Df Sum of Sq F Pr(>F)
876 947502 NA NA NA NA
875 880950 1 66552 66.1 0
Show R code
anova(lm_PLOS_quad, lm_PLOS_cub)
Table 2.11: quadratic vs cubic
Res.Df RSS Df Sum of Sq F Pr(>F)
875 880950 NA NA NA NA
874 865933 1 15018 15.16 1e-04

AIC/BIC
Show R code
AIC(lm_PLOS_quad)
#> [1] 8568
AIC(lm_PLOS_cub)
#> [1] 8555
Show R code
AIC(lm_PLOS_cub)
#> [1] 8555
AIC(lm_PLOS_log)
#> [1] 8544
Show R code
BIC(lm_PLOS_cub)
#> [1] 8578
BIC(lm_PLOS_log)
#> [1] 8558

Extrapolation is dangerous
Show R code
fig_all = fig5 +
  xlim(0, 60)
fig_all
Figure 2.17: Number of authors versus title length in PLOS Medicine

Scale-location plot

We can also plot the square roots of the absolute values of the standardized residuals against the fitted values (Figure 2.18).

Show R code
autoplot(bw_lm2, which = 3, ncol = 1) |> print()
Figure 2.18: Scale-location plot of birthweight data

Here, the blue line doesn’t need to be near 0, but it should be flat. If not, the residual variance \(\sigma^2\) might not be constant, and we might need to transform our outcome \(Y\) (or use a model that allows non-constant variance).


Residuals versus leverage

We can also plot our standardized residuals against “leverage”, which roughly speaking is a measure of how unusual each \(x_i\) value is. Very unusual \(x_i\) values can have extreme effects on the model fit, so we might want to remove those observations as outliers, particularly if they have large residuals.

Show R code
autoplot(bw_lm2, which = 5, ncol = 1) |> print()
birthweight model with interactions (Equation 2.2): residuals versus leverage

The blue line should be relatively flat and close to 0 here.


2.8.7 Diagnostics constructed by hand

Show R code

bw = 
  bw |> 
  mutate(
    predlm2 = predict(bw_lm2),
    residlm2 = weight - predlm2,
    std_resid = residlm2 / sigma(bw_lm2),
    # std_resid_builtin = rstandard(bw_lm2), # uses leverage
    sqrt_abs_std_resid = std_resid |> abs() |> sqrt()
    
  )
Residuals vs fitted
Show R code

resid_vs_fit = bw |> 
  ggplot(
    aes(x = predlm2, y = residlm2, col = sex, shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)
Show R code
print(resid_vs_fit)

Standardized residuals vs fitted
Show R code
bw |> 
  ggplot(
    aes(x = predlm2, y = std_resid, col = sex, shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

Standardized residuals vs gestational age
Show R code

bw |> 
  ggplot(
    aes(x = age, y = std_resid, col = sex, shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

sqrt(abs(rstandard())) vs fitted

Compare with autoplot(bw_lm2, 3)

Show R code


bw |> 
  ggplot(
    aes(x = predlm2, y = sqrt_abs_std_resid, col = sex, shape = sex)
  ) + 
  geom_point() +
  theme_classic() +
  geom_hline(yintercept = 0)

2.9 Model selection

(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).

If we have a lot of covariates in our dataset, we might want to choose a small subset to use in our model.

There are a few possible metrics to consider for choosing a “best” model.

2.9.1 Mean squared error

We might want to minimize the mean squared error, \(\text E[(y-\hat y)^2]\), for new observations that weren’t in our data set when we fit the model.

Unfortunately, \[\frac{1}{n}\sum_{i=1}^n (y_i-\hat y_i)^2\] gives a biased estimate of \(\text E[(y-\hat y)^2]\) for new data. If we want an unbiased estimate, we will have to be clever.


Cross-validation

Show R code
data("carbohydrate", package = "dobson")
library(cvTools)
full_model <- lm(carbohydrate ~ ., data = carbohydrate)
cv_full = 
  full_model |> cvFit(
    data = carbohydrate, K = 5, R = 10,
    y = carbohydrate$carbohydrate)

reduced_model = update(full_model, 
                       formula = ~ . - age)

cv_reduced = 
  reduced_model |> cvFit(
    data = carbohydrate, K = 5, R = 10,
    y = carbohydrate$carbohydrate)

Show R code
results_reduced = 
  tibble(
      model = "wgt+protein",
      errs = cv_reduced$reps[])
results_full = 
  tibble(model = "wgt+age+protein",
           errs = cv_full$reps[])

cv_results = 
  bind_rows(results_reduced, results_full)

cv_results |> 
  ggplot(aes(y = model, x = errs)) +
  geom_boxplot()


comparing metrics
Show R code

compare_results = tribble(
  ~ model, ~ cvRMSE, ~ r.squared, ~adj.r.squared, ~ trainRMSE, ~loglik,
  "full", cv_full$cv, summary(full_model)$r.squared,  summary(full_model)$adj.r.squared, sigma(full_model), logLik(full_model) |> as.numeric(),
  "reduced", cv_reduced$cv, summary(reduced_model)$r.squared, summary(reduced_model)$adj.r.squared, sigma(reduced_model), logLik(reduced_model) |> as.numeric())

compare_results
model cvRMSE r.squared adj.r.squared trainRMSE loglik
full 6.807 0.4805 0.3831 5.956 -61.84
reduced 6.398 0.4454 0.3802 5.971 -62.49

Show R code
anova(full_model, reduced_model)
Res.Df RSS Df Sum of Sq F Pr(>F)
16 567.7 NA NA NA NA
17 606.0 -1 -38.36 1.081 0.3139

stepwise regression

Show R code
library(olsrr)
olsrr:::ols_step_both_aic(full_model)
#> 
#> 
#>                              Stepwise Summary                              
#> -------------------------------------------------------------------------
#> Step    Variable         AIC        SBC       SBIC       R2       Adj. R2 
#> -------------------------------------------------------------------------
#>  0      Base Model     140.773    142.764    83.068    0.00000    0.00000 
#>  1      protein (+)    137.950    140.937    80.438    0.21427    0.17061 
#>  2      weight (+)     132.981    136.964    77.191    0.44544    0.38020 
#> -------------------------------------------------------------------------
#> 
#> Final Model Output 
#> ------------------
#> 
#>                          Model Summary                          
#> ---------------------------------------------------------------
#> R                       0.667       RMSE                 5.505 
#> R-Squared               0.445       MSE                 35.648 
#> Adj. R-Squared          0.380       Coef. Var           15.879 
#> Pred R-Squared          0.236       AIC                132.981 
#> MAE                     4.593       SBC                136.964 
#> ---------------------------------------------------------------
#>  RMSE: Root Mean Square Error 
#>  MSE: Mean Square Error 
#>  MAE: Mean Absolute Error 
#>  AIC: Akaike Information Criteria 
#>  SBC: Schwarz Bayesian Criteria 
#> 
#>                                ANOVA                                
#> -------------------------------------------------------------------
#>                 Sum of                                             
#>                Squares        DF    Mean Square      F        Sig. 
#> -------------------------------------------------------------------
#> Regression     486.778         2        243.389    6.827    0.0067 
#> Residual       606.022        17         35.648                    
#> Total         1092.800        19                                   
#> -------------------------------------------------------------------
#> 
#>                                   Parameter Estimates                                    
#> ----------------------------------------------------------------------------------------
#>       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
#> ----------------------------------------------------------------------------------------
#> (Intercept)    33.130        12.572                  2.635    0.017     6.607    59.654 
#>     protein     1.824         0.623        0.534     2.927    0.009     0.509     3.139 
#>      weight    -0.222         0.083       -0.486    -2.662    0.016    -0.397    -0.046 
#> ----------------------------------------------------------------------------------------

Lasso

\[\arg min_{\theta} \ell(\theta) + \lambda \sum_{j=1}^p|\beta_j|\]

Show R code
library(glmnet)
y = carbohydrate$carbohydrate
x = carbohydrate |> 
  select(age, weight, protein) |> 
  as.matrix()
fit = glmnet(x,y)

Show R code
autoplot(fit, xvar = 'lambda')
Figure 2.19: Lasso selection

Show R code
cvfit = cv.glmnet(x,y)
plot(cvfit)


Show R code
coef(cvfit, s = "lambda.1se")
#> 4 x 1 sparse Matrix of class "dgCMatrix"
#>                  s1
#> (Intercept) 34.1090
#> age          .     
#> weight      -0.1041
#> protein      0.9441

2.10 Categorical covariates with more than two levels

2.10.1 Example: birthweight

In the birthweight example, the variable sex had only two observed values:

Show R code
unique(bw$sex)
#> [1] female male  
#> Levels: female male

If there are more than two observed values, we can’t just use a single variable with 0s and 1s.

2.10.2

For example, Table 2.12 shows the (in)famous iris data (Anderson (1935)), and Table 2.13 provides summary statistics. The data include three species: “setosa”, “versicolor”, and “virginica”.

Show R code
head(iris)
Table 2.12: The iris data
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
Show R code
library(table1)
table1(
  x = ~ . | Species,
  data = iris,
  overall = FALSE
)
Table 2.13: Summary statistics for the iris data
setosa
(N=50)
versicolor
(N=50)
virginica
(N=50)
Sepal.Length
Mean (SD) 5.01 (0.352) 5.94 (0.516) 6.59 (0.636)
Median [Min, Max] 5.00 [4.30, 5.80] 5.90 [4.90, 7.00] 6.50 [4.90, 7.90]
Sepal.Width
Mean (SD) 3.43 (0.379) 2.77 (0.314) 2.97 (0.322)
Median [Min, Max] 3.40 [2.30, 4.40] 2.80 [2.00, 3.40] 3.00 [2.20, 3.80]
Petal.Length
Mean (SD) 1.46 (0.174) 4.26 (0.470) 5.55 (0.552)
Median [Min, Max] 1.50 [1.00, 1.90] 4.35 [3.00, 5.10] 5.55 [4.50, 6.90]
Petal.Width
Mean (SD) 0.246 (0.105) 1.33 (0.198) 2.03 (0.275)
Median [Min, Max] 0.200 [0.100, 0.600] 1.30 [1.00, 1.80] 2.00 [1.40, 2.50]

If we want to model Sepal.Length by species, we could create a variable \(X\) that represents “setosa” as \(X=1\), “virginica” as \(X=2\), and “versicolor” as \(X=3\).

Show R code
data(iris) # this step is not always necessary, but ensures you're starting  
# from the original version of a dataset stored in a loaded package

iris = 
  iris |> 
  tibble() |>
  mutate(
    X = case_when(
      Species == "setosa" ~ 1,
      Species == "virginica" ~ 2,
      Species == "versicolor" ~ 3
    )
  )

iris |> 
  distinct(Species, X)
Table 2.14: iris data with numeric coding of species
Species X
setosa 1
versicolor 3
virginica 2

Then we could fit a model like:

Show R code
iris_lm1 = lm(Sepal.Length ~ X, data = iris)
iris_lm1 |> parameters() |> print_md()
Table 2.15: Model of iris data with numeric coding of Species
Parameter Coefficient SE 95% CI t(148) p
(Intercept) 4.91 0.16 (4.60, 5.23) 30.83 < .001
X 0.47 0.07 (0.32, 0.61) 6.30 < .001

2.10.3 Let’s see how that model looks:

Show R code
iris_plot1 = iris |> 
  ggplot(
    aes(
      x = X, 
      y = Sepal.Length)
  ) +
  geom_point(alpha = .1) +
  geom_abline(
    intercept = coef(iris_lm1)[1], 
    slope = coef(iris_lm1)[2]) +
  theme_bw(base_size = 18)
print(iris_plot1)
Figure 2.20: Model of iris data with numeric coding of Species

We have forced the model to use a straight line for the three estimated means. Maybe not a good idea?

2.10.4 Let’s see what R does with categorical variables by default:

Show R code
iris_lm2 = lm(Sepal.Length ~ Species, data = iris)
iris_lm2 |> parameters() |> print_md()
Table 2.16: Model of iris data with Species as a categorical variable
Parameter Coefficient SE 95% CI t(147) p
(Intercept) 5.01 0.07 (4.86, 5.15) 68.76 < .001
Species (versicolor) 0.93 0.10 (0.73, 1.13) 9.03 < .001
Species (virginica) 1.58 0.10 (1.38, 1.79) 15.37 < .001

2.10.5 Re-parametrize with no intercept

If you don’t want the default and offset option, you can use “-1” like we’ve seen previously:

Show R code
iris.lm2b = lm(Sepal.Length ~ Species - 1, data = iris)
iris.lm2b |> parameters() |> print_md()
Parameter Coefficient SE 95% CI t(147) p
Species (setosa) 5.01 0.07 (4.86, 5.15) 68.76 < .001
Species (versicolor) 5.94 0.07 (5.79, 6.08) 81.54 < .001
Species (virginica) 6.59 0.07 (6.44, 6.73) 90.49 < .001

2.10.6 Let’s see what these new models look like:

Show R code
iris_plot2 = 
  iris |> 
  mutate(
    predlm2 = predict(iris_lm2)) |> 
  arrange(X) |> 
  ggplot(aes(x = X, y = Sepal.Length)) +
  geom_point(alpha = .1) +
  geom_line(aes(y = predlm2), col = "red") +
  geom_abline(
    intercept = coef(iris_lm1)[1], 
    slope = coef(iris_lm1)[2]) + 
  theme_bw(base_size = 18)

print(iris_plot2)
Figure 2.21

2.10.7 Let’s see how R did that:

Show R code
formula(iris_lm2)
#> Sepal.Length ~ Species
model.matrix(iris_lm2) |> as_tibble() |> unique()
(Intercept) Speciesversicolor Speciesvirginica
1 0 0
1 1 0
1 0 1

This is called a “corner point parametrization”.

Show R code
formula(iris.lm2b)
#> Sepal.Length ~ Species - 1
model.matrix(iris.lm2b) |> as_tibble() |> unique()
Speciessetosa Speciesversicolor Speciesvirginica
1 0 0
0 1 0
0 0 1

This can be called a “group point parametrization”.

There are more options; see Dobson and Barnett (2018) §6.4.1 and the codingMatrices package vignette (Venables (2023)).

2.11 Ordinal covariates

(c.f. Dobson and Barnett (2018) §2.4.4)


We can create ordinal variables in R using the ordered() function4.

Example 2.3  

Show R code
url = paste0(
  "https://regression.ucsf.edu/sites/g/files/tkssra6706/",
  "f/wysiwyg/home/data/hersdata.dta")
library(haven)
hers = read_dta(url)
Show R code
hers |> head()
Table 2.17: HERS dataset
HT age raceth nonwhite smoking drinkany exercise physact globrat poorfair medcond htnmeds statins diabetes dmpills insulin weight BMI waist WHR glucose weight1 BMI1 waist1 WHR1 glucose1 tchol LDL HDL TG tchol1 LDL1 HDL1 TG1 SBP DBP age10
0 70 2 1 0 0 0 5 3 0 0 1 1 0 0 0 73.8 23.69 96.0 0.932 84 73.6 23.63 93.0 0.912 94 189 122.4 52 73 201 137.6 48 77 138 78 7.0
0 62 2 1 0 0 0 1 3 0 1 1 0 0 0 0 70.9 28.62 93.0 0.964 111 73.4 28.89 95.0 0.964 78 307 241.6 44 107 216 150.6 48 87 118 70 6.2
1 69 1 0 0 0 0 3 3 0 0 1 0 1 0 0 102.0 42.51 110.2 0.782 114 96.1 40.73 103.0 0.774 98 254 166.2 57 154 254 156.0 66 160 134 78 6.9
0 64 1 0 1 1 0 1 3 0 1 1 0 0 0 0 64.4 24.39 87.0 0.877 94 58.6 22.52 77.0 0.802 93 204 116.2 56 159 207 122.6 57 137 152 72 6.4
0 65 1 0 0 0 0 2 3 0 0 0 0 0 0 0 57.9 21.90 77.0 0.794 101 58.9 22.28 76.5 0.757 92 214 150.6 42 107 235 172.2 35 139 175 95 6.5
1 68 2 1 0 1 0 3 3 0 0 0 0 0 0 0 60.9 29.05 96.0 1.000 116 57.7 27.52 86.0 0.910 115 212 137.8 52 111 202 126.6 53 112 174 98 6.8

Show R code

# C(contr = codingMatrices::contr.diff)

  1. \(M\) is implicitly a deterministic function of \(S\)↩︎

  2. \(F\) is implicitly a deterministic function of \(S\)↩︎

  3. using the definite article “the” would mean there is only one slope.↩︎

  4. or equivalently, factor(ordered = TRUE)↩︎