Functions from these packages will be used throughout this document:
[R code]
library(conflicted) # check for conflicting function definitions# library(printr) # inserts help-file output into markdown outputlibrary(rmarkdown) # Convert R Markdown documents into a variety of formats.library(pander) # format tables for markdownlibrary(ggplot2) # graphicslibrary(ggfortify) # help with graphicslibrary(dplyr) # manipulate datalibrary(tibble) # `tibble`s extend `data.frame`slibrary(magrittr) # `%>%` and other additional piping toolslibrary(haven) # import Stata fileslibrary(knitr) # format R output for markdownlibrary(tidyr) # Tools to help to create tidy datalibrary(plotly) # interactive graphicslibrary(dobson) # datasets from Dobson and Barnett 2018library(parameters) # format model output tables for markdownlibrary(haven) # import Stata fileslibrary(latex2exp) # use LaTeX in R code (for figures and tables)library(fs) # filesystem path manipulationslibrary(survival) # survival analysislibrary(survminer) # survival analysis graphicslibrary(KMsurv) # datasets from Klein and Moeschbergerlibrary(parameters) # format model output tables forlibrary(webshot2) # convert interactive content to static for pdflibrary(forcats) # functions for categorical variables ("factors")library(stringr) # functions for dealing with stringslibrary(lubridate) # functions for dealing with dates and times
Here are some R settings I use in this document:
[R code]
rm(list =ls()) # delete any data that's already loaded into Rconflicts_prefer(dplyr::filter)ggplot2::theme_set( ggplot2::theme_bw() +# ggplot2::labs(col = "") + ggplot2::theme(legend.position ="bottom",text = ggplot2::element_text(size =12, family ="serif")))knitr::opts_chunk$set(message =FALSE)options('digits'=6)panderOptions("big.mark", ",")pander::panderOptions("table.emphasize.rownames", FALSE)pander::panderOptions("table.split.table", Inf)conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by defaultlegend_text_size =9run_graphs =TRUE
This course is about generalized linear models (for non-Gaussian outcomes)
UC Davis STA 108 (“Applied Statistical Methods: Regression Analysis”) is a prerequisite for this course, so everyone here should have some understanding of linear regression already.
We will review linear regression to:
make sure everyone is caught up
provide an epidemiological perspective on model interpretation.
1.2 Chapter overview
Section 2: how to interpret linear regression models
Section 3: how to estimate linear regression models
Section 4: how to quantify uncertainty about our estimates
Section 6.3: how to tell if your model is insufficiently complex
2 Understanding Gaussian Linear Regression Models
2.1 Motivating example: birthweights and gestational age
Suppose we want to learn about the distributions of birthweights (outcome\(Y\)) for (human) babies born at different gestational ages (covariate\(A\)) and with different chromosomal sexes (covariate\(S\)) (Dobson and Barnett (2018) Example 2.2.2).
pred_female <-coef(bw_lm1)["(Intercept)"] +coef(bw_lm1)["age"] *36## or using built-in prediction:pred_female_alt <-predict(bw_lm1, newdata =tibble(sex ="female", age =36))
Exercise 10 What is the interpretation of \(\beta_{M}\) in Model 3?
Solution.
Mean birthweight among males with gestational age 0 weeks: \[
\begin{aligned}
\mu(1,0) &= \text{E}{\left[Y|M = 1,A = 0\right]}\\
&= \beta_0 + {\color{red}\beta_M} \cdot 1 + \beta_A \cdot 0 + \beta_{AM}\cdot 1 \cdot 0\\
&= \beta_0 + {\color{red}\beta_M}
\end{aligned}
\] Mean birthweight among females with gestational age 0 weeks: \[
\begin{aligned}
\mu(0,0) &= \text{E}{\left[Y|M = 0,A = 0\right]}\\
&= \beta_0 + {\color{red}\beta_M} \cdot 0 + \beta_A \cdot 0 + \beta_{AM}\cdot 0 \cdot 0\\
&= \beta_0
\end{aligned}
\]
\[
\begin{aligned}
\beta_{M} &= \mu(1,0) - \mu(0,0) \\
&= \text{E}{\left[Y|M = 1,A = 0\right]} - \text{E}{\left[Y|M = 0,A = 0\right]}
\end{aligned}
\]\(\beta_M\) is the difference in mean birthweight between males with gestational age 0 weeks and females with gestational age 0 weeks.
Exercise 11 What is the interpretation of \(\beta_{AM}\) in Model 3?
Let the coefficients of this model be \(\gamma\)s instead of \(\beta\)s.
What are the interpretations of the \(\gamma\)s? How do they relate to the \(\beta\)s in Model 3? Which have the same interpretation? Which are different, and how do they differ? What is the pattern?
Solution. Interpretation of \(\gamma_0\):
From Model 6, \(\gamma_0\) is the mean birthweight among females (\(M = 0\)) with \(A^* = 0\) (i.e., \(A = 32\) weeks):
Since shifting \(A\) by a constant does not change the slope, \(\gamma_{A^*} = \beta_A\): these two coefficients have the same value and interpretation.
Interpretation of \(\gamma_{A^*M}\):
From Model 6, \(\gamma_{A^*M}\) is the difference in slope with respect to \(A^*\) between males and females:
Since shifting \(A\) by a constant does not change slopes, \(\gamma_{A^*M} = \beta_{AM}\): these two coefficients have the same value and interpretation.
The pattern:
Slope coefficients (\(\gamma_{A^*}\) and \(\gamma_{A^*M}\)) are unchanged by rescaling: they have the same values and interpretations as the corresponding \(\beta\)s.
Coefficients change only for variables that have interactions with the rescaled variable \(A\). This includes the intercept (which can be viewed as the main effect of a variable that interacts with \(A\) via \(\beta_A\)), and the main effect of \(M\) (which interacts with \(A\) via \(\beta_{AM}\)). Shifting \(A\) by 32 weeks changes the reference point from \(A = 0\) to \(A = 32\), so these coefficients now represent quantities evaluated at \(A = 32\) weeks rather than at \(A = 0\) weeks.
Exercise 13 Using R, fit the rescaled interaction model with \(A^* = A - 36\) weeks in place of \(A\) in Model 3. Compare the coefficient estimates with those from the original model. Which coefficients change, and which remain the same?
Solution.
[R code]
bw <- bw |>mutate(`age - mean`= age -mean(age),`age - 36wks`= age -36 )lm1_c <-lm(weight ~ sex +`age - 36wks`, data = bw)lm2_c <-lm(weight ~ sex +`age - 36wks`+ sex:`age - 36wks`, data = bw)parameters(lm2_c, ci_method ="wald") |>print_md()
If we want to model Sepal.Length by species, we could create a variable \(X\) that represents “setosa” as \(X=1\), “virginica” as \(X=2\), and “versicolor” as \(X=3\).
[R code]
data(iris) # this step is not always necessary, but ensures you're starting# from the original version of a dataset stored in a loaded packageiris <- iris |>tibble() |>mutate(X =case_when( Species =="setosa"~1, Species =="virginica"~2, Species =="versicolor"~3 ) )iris |>distinct(Species, X)
Table 13: iris data with numeric coding of species
Then we could fit a model like:
[R code]
iris_lm1 <-lm(Sepal.Length ~ X, data = iris)iris_lm1 |>parameters() |>print_md()
Table 14: Model of iris data with numeric coding of Species
\(\ell_{\beta, \beta'} ''(\beta, \sigma^2;\mathbf X,\tilde{y})\) is negative definite at \(\beta = (\mathbf{X}'\mathbf{X})^{-1}X'y\), so \(\hat \beta_{ML} = (\mathbf{X}'\mathbf{X})^{-1}X'y\) is the MLE for \(\beta\).
Table 21: Estimated model for birthweight data with interaction term
Parameter
Coefficient
SE
95% CI
t(20)
p
(Intercept)
-2141.67
1163.60
(-4568.90, 285.56)
-1.84
0.081
sex (male)
872.99
1611.33
(-2488.18, 4234.17)
0.54
0.594
age
130.40
30.00
(67.82, 192.98)
4.35
< .001
sex (male) × age
-18.42
41.76
(-105.52, 68.68)
-0.44
0.664
Wald tests and confidence intervals
For Gaussian linear regression, the ordinary least squares (OLS) estimates \(\hat\beta_k\) are exactly Gaussian when the error terms \(\epsilon_i\) are Gaussian, for any sample size. See also the table of Gaussian vs. MLE-based tests.
Wald test statistic
To test \(H_0: \beta_k = \beta_{k,0}\) (typically \(\beta_{k,0} = 0\)):
Exercise 14 Given a maximum likelihood estimate \(\hat{\tilde{\beta}}\) and a corresponding estimated covariance matrix \(\hat \Sigma\stackrel{\text{def}}{=}\widehat{\text{Cov}}(\hat{\tilde{\beta}})\), calculate a 95% confidence interval for the predicted mean \(\mu(\tilde{x}) = \text{E}{\left[Y|\tilde{X}=\tilde{x}\right]}\).
Combining Equation 13 and the Gauss-Markov theorem:
Theorem 1 (Estimated variance and standard error of predicted mean)\[
{\color{orange}\widehat{\text{Var}}{\left(\hat\mu(\tilde{x})\right)}} = {\color{orange}{\tilde{x}}^{\top}\hat{\Sigma}\tilde{x}}
\tag{14}\]
In R, predict() with se.fit = TRUE computes the estimated mean \(\hat\mu(\tilde{x}) = \tilde{x}'\hat{\tilde{\beta}}\) and its estimated standard error for each covariate pattern:
Table 24: Predicted means and 95% CI using interval = 'confidence'
fit
lwr
upr
age
sex
2762.71
2584.34
2941.07
36
male
2986.67
2876.05
3097.29
38
male
3210.64
3062.23
3359.05
40
male
4.4 Inference for differences in means
Exercise 16 Given a maximum likelihood estimate \(\hat{\tilde{\beta}}\) and a corresponding estimated covariance matrix \(\hat \Sigma\stackrel{\text{def}}{=}\widehat{\text{Cov}}(\hat{\tilde{\beta}})\), calculate a 95% confidence interval for the difference in means comparing covariate patterns \(\tilde{x}\) and \({\tilde{x}^*}\), \(\mu(\tilde{x}) - \mu({\tilde{x}^*})\).
Combining Equation 18 and the Gauss-Markov theorem:
Theorem 2 (Estimated variance and standard error of difference in means)\[
{\color{orange}\widehat{\text{Var}}{\left(\widehat{\mu(\tilde{x}) - \mu({\tilde{x}^*})}\right)}} = {\color{orange}{\Delta\tilde{x}}^{\top}\hat{\Sigma}(\Delta\tilde{x})}
\tag{19}\]
In R, we can compute the CI for the difference in means by computing the predicted means for both covariate patterns and applying the variance formula directly:
In practice you don’t have to do this by hand; there are functions to do it for you:
[R code]
# built inlibrary(lmtest)lrtest(bw_lm2, bw_lm1)
5 Prediction
5.1 Prediction for linear models
Definition 1 (Predicted value) In a regression model \(\text{p}(y|\tilde{x})\), the predicted value of \(y\) given \(\tilde{x}\) is the estimated mean of \(Y\) given \(\tilde{X}=\tilde{x}\):
\[\hat y \stackrel{\text{def}}{=}\hat{\text{E}}{\left[Y|\tilde{X}=\tilde{x}\right]}\]
For linear models, the predicted value can be straightforwardly calculated by multiplying each predictor value \(x_j\) by its corresponding coefficient \(\beta_j\) and adding up the results:
These special predictions are called the fitted values of the dataset:
Definition 2 For a given dataset \((\tilde{Y}, \mathbf{X})\) and corresponding fitted model \(\text{p}_{\hat \beta}(\tilde{y}|\mathbf{x})\), the fitted value of \(y_i\) is the predicted value of \(y\) when \(\tilde{X}=\tilde{x}_i\) using the estimate parameters \(\hat \beta\).
bw_lm2 |>predict(interval ="predict") |>head()#> Warning in predict.lm(bw_lm2, interval = "predict"): predictions on current data refer to _future_ responses#> fit lwr upr#> 1 2552.73 2124.50 2980.97#> 2 2552.73 2124.50 2980.97#> 3 2683.13 2275.99 3090.27#> 4 2813.53 2418.60 3208.47#> 5 2813.53 2418.60 3208.47#> 6 2943.93 2551.48 3336.38
The warning from the last command is: “predictions on current data refer to future responses” (since you already know what happened to the current data, and thus don’t need to predict it).
where \(\ell(\hat\theta)\) is the log-likelihood evaluated at the maximum-likelihood estimates \(\hat\theta\), \(p\) is the number of estimated parameters (including \(\hat\sigma^2\) for Gaussian models), and \(n\) is the number of observations.
Definition 4 (Bayesian Information Criterion (BIC))\[\text{BIC} = -2 \ell(\hat\theta) + p \log(n)\]
where \(\ell(\hat\theta)\), \(p\), and \(n\) are defined as in Definition 3.
Conceptual basis
Each criterion has two components:
Fit term (\(-2\ell(\hat\theta)\)): measures lack of fit — lower is better. A model with more parameters will always achieve a higher (or equal) log-likelihood on the observed data.
Penalty term (\(2 p\) for AIC; \(p \log(n)\) for BIC): penalizes model complexity to guard against overfitting.
Together, they balance goodness of fit against parsimony.
AIC vs. BIC
Criterion
Penalty per parameter
Tends to select
AIC
\(2\)
larger models
BIC
\(\log(n)\)
smaller models
The BIC penalty exceeds the AIC penalty whenever \(\log(n) > 2\), i.e., when \(n > e^2 \approx 7.4\). In practice, BIC almost always penalizes additional parameters more heavily than AIC and therefore tends to select simpler (more parsimonious) models (Vittinghoff et al. 2012; Kleinbaum et al. 2014).
AIC in R
[R code]
-2*logLik(bw_lm2) |>as.numeric() +2* (length(coef(bw_lm2)) +1) # sigma counts as a parameter here#> [1] 323.159AIC(bw_lm2)#> [1] 323.159
Lower values are better. There are no hypothesis tests or p-values associated with these criteria.
To compare models, calculate the criterion for each model and choose the model with the smallest value.
Differences of less than 2 units are generally considered negligible; differences greater than 10 are considered strong evidence in favor of the lower-criterion model.
BIC tends to favor simpler models than AIC, especially when the sample size is large.
(Residual) Deviance
Let \(q\) be the number of distinct covariate combinations in a data set.
For example, in the birthweight data, there are \(q = 12\) unique patterns (Table 26).
[R code]
bw_X_unique
Table 26: Unique covariate combinations in the birthweight data, with replicate counts
Definition 5 (Replicates) If a given covariate pattern has more than one observation in a dataset, those observations are called replicates.
Example 3 (Replicates in the birthweight data) In the birthweight dataset, there are 2 replicates of the combination “female, age 36” (Table 26).
Exercise 19 (Replicates in the birthweight data) Which covariate pattern(s) in the birthweight data has the most replicates?
Solution 6 (Replicates in the birthweight data). Two covariate patterns are tied for most replicates: males at age 40 weeks and females at age 40 weeks. 40 weeks is the usual length for human pregnancy (Polin et al. (2011)), so this result makes sense.
[R code]
bw_X_unique |> dplyr::filter(n ==max(n))
Saturated models
The most complicated model we could fit would have one parameter (a mean) for each covariate pattern, plus a variance parameter:
We can calculate the log-likelihood of this model as usual:
[R code]
logLik(lm_max)#> 'log Lik.' -151.402 (df=13)
We can compare this model to our other models using chi-square tests, as usual:
[R code]
lrtest(lm_max, bw_lm2)
The likelihood ratio statistic for this test is \[\lambda = 2 * (\ell_{\text{full}} - \ell) = 10.355374\] where:
\(\ell_{\text{full}}\) is the log-likelihood of the full model: -151.401601
\(\ell\) is the log-likelihood of our comparison model (two slopes, two intercepts): -156.579288
This statistic is called the deviance or residual deviance for our two-slopes and two-intercepts model; it tells us how much the likelihood of that model deviates from the likelihood of the maximal model.
The corresponding p-value tells us whether there we have enough evidence to detect that our two-slopes, two-intercepts model is a worse fit for the data than the maximal model; in other words, it tells us if there’s evidence that we missed any important patterns. (Remember, a nonsignificant p-value could mean that we didn’t miss anything and a more complicated model is unnecessary, or it could mean we just don’t have enough data to tell the difference between these models.)
Null Deviance
Similarly, the least complicated model we could fit would have only one mean parameter, an intercept:
\[\text E[Y|X=x] = \beta_0\] We can fit this model in R like so:
[R code]
lm0 <-lm(weight ~1, data = bw)lm0 |>parameters() |>print_md()
Parameter
Coefficient
SE
95% CI
t(23)
p
(Intercept)
2967.67
57.58
(2848.56, 3086.77)
51.54
< .001
[R code]
plot_PIs_and_CIs(bw, lm0)
Figure 15: Null model for birthweight data, with 95% confidence and prediction intervals.
This model also has a likelihood:
[R code]
logLik(lm0)#> 'log Lik.' -168.955 (df=2)
And we can compare it to more complicated models using a likelihood ratio test:
[R code]
lrtest(bw_lm2, lm0)
The likelihood ratio statistic for the test comparing the null model to the maximal model is \[\lambda = 2 * (\ell_{\text{full}} - \ell_{0}) = 35.106732\] where:
\(\ell_{\text{0}}\) is the log-likelihood of the null model: -168.954967
\(\ell_{\text{full}}\) is the log-likelihood of the maximal model: -151.401601
In R, this test is:
[R code]
lrtest(lm_max, lm0)
This log-likelihood ratio statistic is called the null deviance. It tells us whether we have enough data to detect a difference between the null and full models.
6.2 Gaussian Deviance vs. GLM Deviance
The residual deviance is a general concept that applies to all GLMs, including Gaussian linear regression. For any GLM, the residual deviance is:
where \(\ell_{\text{sat}}\) is the log-likelihood of the saturated model and \(\ell(\hat\beta)\) is the log-likelihood of the fitted model. However, this formula simplifies differently depending on the distribution family, and the resulting test statistics have different null distributions.
The saturated model has one free mean parameter per distinct covariate pattern. When there are no replicates (each covariate pattern appears exactly once), it sets \(\hat\mu_i = y_i\) for every observation, so its residual sum of squares is zero:
Because \(\sigma^2\) is unknown and must be estimated, comparing deviances between two nested Gaussian models uses an F-test, which is exact under Gaussian assumptions. Substituting \(D = \text{RSS}/\hat\sigma^2\), the \(\hat\sigma^2\) factors cancel:
For Poisson and Binomial models, the dispersion parameter \(\phi = 1\) is fixed rather than estimated. Consequently, the difference in deviances between two nested models follows an approximate \(\chi^2\) distribution:
\[D_1 - D_2 \;\dot \sim\; \chi^2_{p_2 - p_1}\]
This asymptotic result replaces the exact F-test used for Gaussian models.
Saturated vs. fully parametrized models with replicates
When some covariate patterns appear more than once (i.e., when there are replicates), it is important to distinguish between two special models:
The saturated model has one free mean parameter per distinct covariate pattern (\(q\) parameters, where \(q \leq n\) is the number of unique patterns).
The fully parametrized model has one free mean parameter per observation (\(n\) parameters).
When there are no replicates (\(q = n\)), these two coincide. When there are replicates (\(q < n\)), the saturated model constrains all observations sharing a covariate pattern to have the same mean, but places no constraint on means across different patterns. See Kleinbaum and Klein (2010) for further discussion of this distinction.
Deviance is always measured relative to the saturated model, not the fully parametrized model.
Gaussian deviance with replicates
When covariate pattern \(k\) has \(n_k\) replicates with sample mean \(\bar{y}_k\), the saturated model fits \(\hat\mu_k = \bar{y}_k\) for each pattern \(k\), so its residual sum of squares equals the within-group (pure error) SS:
This is nonzero whenever any covariate pattern has replicates with different response values. The Gaussian deviance relative to the saturated model is therefore:
R’s deviance() applied to an lm object returns the total fitted-model RSS, not the deviance relative to the saturated model. To compute the deviance relative to the saturated model, subtract the saturated model’s RSS: deviance(lm_fit) - deviance(lm_saturated).
For the birthweight data, we can verify this directly using bw_lm2 (the interaction model) and lm_max (the saturated model):
deviance(bw_lm2) # total RSS of fitted model#> [1] 652425deviance(lm_max) # within-group (pure error) SS#> [1] 423783deviance(bw_lm2) -deviance(lm_max) # lack-of-fit SS (deviance vs. saturated)#> [1] 228642
GLM deviance with replicates
For a Binomial GLM fit to data grouped by covariate pattern (with \(y_k\) events and \(n_k\) observations for pattern \(k\)), the saturated model sets \(\hat\pi_k^{\text{sat}} = y_k / n_k\) for each pattern \(k\). R’s deviance() correctly computes \(2(\ell_{\text{sat}} - \ell(\hat\beta))\) using this grouping:
When binomial data is ungrouped (individual Bernoulli observations, \(n_i = 1\)), R uses the fully parametrized model as its reference — assigning \(\hat\pi_i = y_i \in \{0, 1\}\) to each individual observation. Since each predicted probability is exactly 0 or 1, every Bernoulli likelihood contribution equals 1, and hence \(\ell_{\text{fp}} = 0\). Thus R’s deviance() returns \(-2\ell(\hat\beta)\).
We can verify this directly: observations are ungrouped Bernoulli with repeated covariate patterns (\(x \in \{0, 1\}\) appears three times each).
This reference model is the fully parametrized model (one parameter per observation), not the saturated model (one parameter per distinct covariate pattern). They coincide only when there are no repeated covariate patterns (\(q = n\)). When patterns do repeat (\(q < n\)), the saturated model sets \(\hat\pi_k = y_k/n_k\) per pattern, giving \(\ell_{\text{sat}} < 0\).
deviance() for ungrouped data cannot be used as a goodness-of-fit test against the \(\chi^2\) distribution when \(q < n\). The correct GOF statistic is \(2(\ell_{\text{sat}} - \ell(\hat\beta))\), but R’s deviance() for ungrouped data returns \(-2\ell(\hat\beta)\) (using \(\ell_{\text{fp}} = 0\)). These two quantities differ by \(-2\ell_{\text{sat}} > 0\) whenever \(q < n\). Even if each covariate pattern has many replicates (large \(n_k\)), R’s ungrouped deviance is the wrong statistic to compare against \(\chi^2(q - p)\). To obtain a valid GOF test when patterns repeat, fit the model using grouped data (one row per pattern with \(y_k\) and \(n_k\)), so that R uses the saturated model as its reference.
Definition 6 (Residual noise/deviation from the population mean) The residual noise in a probabilistic model \(p(Y)\), also known as the residual deviation of an observation from its population mean or residual for short, is the difference between an observed value \(y\) and its population mean:
Theorem 3 (Residuals in Gaussian models) If \(Y\) has a Gaussian distribution, then \(\varepsilon(Y)\) also has a Gaussian distribution, and vice versa.
Proof. Left to the reader.
Definition 7 (Residuals of a fitted model value) The residual of a fitted value \(\hat y\) (shorthand: “residual”) is its error relative to the observed data: \[
\begin{aligned}
e(\hat y) &\stackrel{\text{def}}{=}\varepsilon{\left(\hat y\right)}
\\&= y - \hat y
\end{aligned}
\]
With enough data and a correct model, the residuals will be approximately Gaussian distributed, with variance \(\sigma^2\), which we can estimate using \(\hat\sigma^2\); that is:
Hence, with enough data and a correct model, the standardized residuals will be approximately standard Gaussian; that is,
\[
r_i \ \sim_{\text{iid}}\ N(0,1)
\]
Marginal distributions of residuals
To look for problems with our model, we can check whether the residuals \(e_i\) and standardized residuals \(r_i\) look like they have the distributions that they are supposed to have, according to the model.
Figure 25: Marginal distribution of standardized residuals
QQ plot of standardized residuals
[R code]
library(ggfortify)# needed to make ggplot2::autoplot() work for `lm` objectsqqplot_lm2_auto <- bw_lm2 |>autoplot(which =2, # options are 1:6; can do multiple at oncencol =1 ) +theme_classic()print(qqplot_lm2_auto)
resid_vs_fit <- bw |>ggplot(aes(x = predlm2, y = residlm2, col = sex, shape = sex) ) +geom_point() +theme_classic() +geom_hline(yintercept =0)print(resid_vs_fit)
Standardized residuals vs fitted
[R code]
bw |>ggplot(aes(x = predlm2, y = std_resid, col = sex, shape = sex) ) +geom_point() +theme_classic() +geom_hline(yintercept =0)
Standardized residuals vs gestational age
[R code]
bw |>ggplot(aes(x = age, y = std_resid, col = sex, shape = sex) ) +geom_point() +theme_classic() +geom_hline(yintercept =0)
sqrt(abs(rstandard())) vs fitted
Compare with autoplot(bw_lm2, 3)
[R code]
bw |>ggplot(aes(x = predlm2, y = sqrt_abs_std_resid, col = sex, shape = sex) ) +geom_point() +theme_classic() +geom_hline(yintercept =0)
6.4 Model selection
(adapted from Dobson and Barnett (2018) §6.3.3; for more information on prediction, see James et al. (2013) and Harrell (2015)).
Mean squared error
We might want to minimize the mean squared error, \(\text E[(y-\hat y)^2]\), for new observations that weren’t in our data set when we fit the model.
Unfortunately, \[\frac{1}{n}\sum_{i=1}^n (y_i-\hat y_i)^2\] gives a biased estimate of \(\text E[(y-\hat y)^2]\) for new data. If we want an unbiased estimate, we will have to be clever.
Cross-validation
[R code]
data("carbohydrate", package ="dobson")library(cvTools)full_model <-lm(carbohydrate ~ ., data = carbohydrate)cv_full <- full_model |>cvFit(data = carbohydrate, K =5, R =10,y = carbohydrate$carbohydrate )reduced_model <- full_model |>update(formula =~ . - age)cv_reduced <- reduced_model |>cvFit(data = carbohydrate, K =5, R =10,y = carbohydrate$carbohydrate )
It tends to select too many variables (overfitting)
P-values and confidence intervals are biased after selection
It ignores model uncertainty
Results can be unstable across different samples
Consider using cross-validation, penalized methods (like Lasso), or subject-matter knowledge instead. See Harrell (2015) and Heinze et al. (2018) for more discussion.
[R code]
library(olsrr)olsrr:::ols_step_both_aic(full_model)#> #> #> Stepwise Summary #> -------------------------------------------------------------------------#> Step Variable AIC SBC SBIC R2 Adj. R2 #> -------------------------------------------------------------------------#> 0 Base Model 140.773 142.764 83.068 0.00000 0.00000 #> 1 protein (+) 137.950 140.937 80.438 0.21427 0.17061 #> 2 weight (+) 132.981 136.964 77.191 0.44544 0.38020 #> -------------------------------------------------------------------------#> #> Final Model Output #> ------------------#> #> Model Summary #> ---------------------------------------------------------------#> R 0.667 RMSE 5.505 #> R-Squared 0.445 MSE 30.301 #> Adj. R-Squared 0.380 Coef. Var 15.879 #> Pred R-Squared 0.236 AIC 132.981 #> MAE 4.593 SBC 136.964 #> ---------------------------------------------------------------#> RMSE: Root Mean Square Error #> MSE: Mean Square Error #> MAE: Mean Absolute Error #> AIC: Akaike Information Criteria #> SBC: Schwarz Bayesian Criteria #> #> ANOVA #> -------------------------------------------------------------------#> Sum of #> Squares DF Mean Square F Sig. #> -------------------------------------------------------------------#> Regression 486.778 2 243.389 6.827 0.0067 #> Residual 606.022 17 35.648 #> Total 1092.800 19 #> -------------------------------------------------------------------#> #> Parameter Estimates #> ----------------------------------------------------------------------------------------#> model Beta Std. Error Std. Beta t Sig lower upper #> ----------------------------------------------------------------------------------------#> (Intercept) 33.130 12.572 2.635 0.017 6.607 59.654 #> protein 1.824 0.623 0.534 2.927 0.009 0.509 3.139 #> weight -0.222 0.083 -0.486 -2.662 0.016 -0.397 -0.046 #> ----------------------------------------------------------------------------------------
coef(cvfit, s ="lambda.1se")#> 4 x 1 sparse Matrix of class "dgCMatrix"#> lambda.1se#> (Intercept) 34.3091615#> age . #> weight -0.0800142#> protein 0.7640510
Akaike, Hirotugu. 1974. “A New Look at the Statistical Model Identification.”IEEE Transactions on Automatic Control 19 (6): 716–23. https://doi.org/10.1109/TAC.1974.1100705.
Anderson, Edgar. 1935. “The Irises of the Gaspe Peninsula.”Bulletin of American Iris Society 59: 2–5.
Harrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. 2nd ed. Springer. https://doi.org/10.1007/978-3-319-19425-7.
Heinze, Georg, Christine Wallisch, and Daniela Dunkler. 2018. “Variable Selection – A Review and Recommendations for the Practicing Statistician.”Biometrical Journal 60 (3): 431–49. https://doi.org/10.1002/bimj.201700067.
Hogg, Robert V., Elliot A. Tanis, and Dale L. Zimmerman. 2015. Probability and Statistical Inference. Ninth edition. Pearson.
Hulley, Stephen, Deborah Grady, Trudy Bush, et al. 1998. “Randomized Trial of Estrogen Plus Progestin for Secondary Prevention of Coronary Heart Disease in Postmenopausal Women.”JAMA : The Journal of the American Medical Association (Chicago, IL) 280 (7): 605–13.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer. https://www.statlearning.com/.
Vittinghoff, Eric, David V Glidden, Stephen C Shiboski, and Charles E McCulloch. 2012. Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models. 2nd ed. Springer. https://doi.org/10.1007/978-1-4614-1353-0.
Weisberg, Sanford. 2005. Applied Linear Regression. Vol. 528. John Wiley & Sons.