rm(list =ls())# delete any data that's already loaded into Rconflicts_prefer(dplyr::filter)ggplot2::theme_set(ggplot2::theme_bw()+# ggplot2::labs(col = "") +ggplot2::theme( legend.position ="bottom", text =ggplot2::element_text(size =12, family ="serif")))knitr::opts_chunk$set(message =FALSE)options('digits'=6)panderOptions("big.mark", ",")pander::panderOptions("table.emphasize.rownames", FALSE)pander::panderOptions("table.split.table", Inf)conflicts_prefer(dplyr::filter)# use the `filter()` function from dplyr() by defaultlegend_text_size=9run_graphs=TRUE
1.1 Welcome
Welcome to Epidemiology 204: Quantitative Epidemiology III (Statistical Models).
Epi 204 is a course on regression modeling.
1.2 What you should already know
Warning
Epi 202, Epi 203, and Sta 108 are prerequisites for this course. If you haven’t passed one of these courses, talk to me ASAP.
1.2.1 Epi 202: probability models
Probability distributions
binomial
Poisson
Gaussian
exponential
Characteristics of probability distributions
Mean, median, mode, quantiles
Variance, standard deviation, overdispersion
Characteristics of samples
independence, dependence, covariance, correlation
ranks, order statistics
identical vs nonidentical distribution (homogeneity vs heterogeneity)
Laws of Large Numbers
Central Limit Theorem for the mean of an iid sample
1.2.2 Epi 203: inference for one or several homogenous populations
the maximum likelihood inference framework:
likelihood functions
log-likelihood functions
score functions
estimating equations
information matrices
point estimates
standard errors
confidence intervals
hypothesis tests
p-values
Hypothesis tests for one, two, and >2 groups:
t-tests/ANOVA for Gaussian models
chi-square tests for binomial and Poisson models
nonparametric tests:
Wilcoxon signed-rank test for matched pairs
Mann–Whitney/Kruskal-Wallis rank sum test for \(\geq 2\) independent samples
Fisher’s exact test for contingency tables
Cochran–Mantel–Haenszel-Cox log-rank test
For all of the quantities above, and especially for confidence intervals and p-values, you should know how both:
ggpenguins<-palmerpenguins::penguins|>ggplot(aes( x =bill_length_mm, y =body_mass_g, color =species))+geom_point()+stat_smooth( method ="lm", formula =y~x, geom ="smooth")+xlab("Bill length (mm)")+ylab("Body mass (g)")ggpenguins|>print()
Figure 1.8: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)
1.4.9 Logistic regression
Show R code
glm1<-beetles|>glm(formula =cbind(died, survived)~dose, family ="binomial")f<-function(x){glm1|>predict(newdata =data.frame(dose =x), type ="response")}plot4<-plot3+geom_function(fun =f, aes(col ="Logistic regression"))print(plot4)
Figure 1.9: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)
1.5 Structure of regression models
Exercise 1.2 What is a regression model?
Definition 1.1 (Regression model) Regression models are conditional probability distribution models:
\[\text{P}(Y|\tilde{X})\]
Exercise 1.3 What are some of the names used for the variables in a regression model \(\text{P}(Y|\tilde{X})\)?
Definition 1.2 (Outcome) The outcome variable in a regression model is the variable whose distribution is being described; in other words, the variable on the left-hand side of the “|” (“pipe”) symbol.
The outcome variable is also called the response variable, regressand, predicted variable, explained variable, experimental variable, output variable, dependent variable, endogenous variables, target, or label.
and is typically denoted \(Y\).
Definition 1.3 (Predictors) The predictor variables in a regression model are the conditioning variables defining subpopulations among which the outcome distribution might vary.
Predictors are also called regressors, covariates, independent variables, explanatory variables, risk factors, exposure variables, input variables, exogenous variables, candidate variables (Dunn and Smyth (2018)), carriers (Dunn and Smyth (2018)), manipulated variables,
or features and are typically denoted \(\tilde{X}\). 1
Table 1.1: Common pairings of terms for variables \(\tilde{X}\) and \(Y\) in regression models \(P(Y|\tilde{X})\)2
\(\tilde{X}\)
\(Y\)
usual context
input
output
independent
dependent
predictor
predicted or response
explanatory
explained
exogenous
endogenous
econometrics
manipulated
measured
randomized controlled experiments
exposure
outcome
epidemiology
feature
label or target
machine learning
Exercise 1.4 What is the general structure of a generalized linear model?
Solution 1.2. Generalized linear models have three components:
The outcome distribution family: \(\text{p}(Y|\mu(\tilde{x}))\)
The link function: \(g(\mu(\tilde{x})) = \eta(\tilde{x})\)
The linear component: \(\eta(\tilde{x}) = \tilde{x}\cdot \beta\)
The outcome distribution family (a.k.a. the random component of the model)
Gaussian (normal)
Binomial
Poisson
Exponential
Gamma
Negative binomial
The linear component (a.k.a. the linear predictor or linear functional form) describing how the covariates combine to define subpopulations:
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.". https://r4ds.hadley.nz/.
The “~” (“tilde”) symbol in the notation \(\tilde{X}\) indicates that \(\tilde{X}\) is a vector. See the appendices for a table of notation used in these notes.↩︎
# Introduction---{{< include shared-config.qmd >}}## WelcomeWelcome to Epidemiology 204: Quantitative Epidemiology III (Statistical Models).Epi 204 is a course on **regression modeling**. ## What you should already know {.scrollable}---{{< include prereq-knowledge.qmd >}}## What we will cover in this course---* Linear (Gaussian) regression models (review and more details)* Regression models for non-Gaussian outcomes + binary + count + time to event* Statistical analysis using RWe will start where Epi 203 left off: with linear regression models.## Motivations for regression models---:::{#exr-why-regression}Why do we need regression models?:::---::: {#sol-why-regression .incremental}* when there's not enough data to analyze every subgroup of interest individually* especially when subgroups are defined using continuous predictors:::### Example: Adelie penguins```{r}#| label: fig-palmer-1#| fig-cap: Palmer penguins#| echo: falselibrary(ggplot2)library(plotly)library(dplyr)ggpenguins <- palmerpenguins::penguins |> dplyr::filter(species =="Adelie") |>ggplot( ) +aes(x = bill_length_mm, y = body_mass_g) +geom_point() +xlab("Bill length (mm)") +ylab("Body mass (g)")print(ggpenguins)```### Linear regression```{r}#| label: fig-palmer-2#| fig-cap: Palmer penguins with linear regression fitggpenguins2 <- ggpenguins +stat_smooth(method ="lm",formula = y ~ x,geom ="smooth" )ggpenguins2 |>print()```### Curved regression lines```{r}#| label: fig-palmer-3#| fig-cap: Palmer penguins - curved regression linesggpenguins2 <- ggpenguins +stat_smooth(method ="lm",formula = y ~log(x),geom ="smooth" ) +xlab("Bill length (mm)") +ylab("Body mass (g)")ggpenguins2```### Multiple regression```{r}#| label: fig-palmer-4#| fig-cap: "Palmer penguins - multiple groups"ggpenguins <- palmerpenguins::penguins |>ggplot(aes(x = bill_length_mm,y = body_mass_g,color = species ) ) +geom_point() +stat_smooth(method ="lm",formula = y ~ x,geom ="smooth" ) +xlab("Bill length (mm)") +ylab("Body mass (g)")ggpenguins |>print()```### Modeling non-Gaussian outcomes::: {#fig-beetles_1a}```{r}library(glmx)data(BeetleMortality)beetles <- BeetleMortality |>mutate(pct = died / n,survived = n - died )plot1 <- beetles |>ggplot(aes(x = dose, y = pct)) +geom_point(aes(size = n)) +xlab("Dose (log mg/L)") +ylab("Mortality rate (%)") +scale_y_continuous(labels = scales::percent) +# xlab(bquote(log[10]), bquote(CS[2])) +scale_size(range =c(1, 2))print(plot1)```Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935):::### Why don't we use linear regression?:::{#fig-beetles-plot1}```{r}beetles_long <- beetles |>reframe(.by =everything(),outcome =c(rep(1, times = died),rep(0, times = survived) ) )lm1 <- beetles_long |>lm(formula = outcome ~ dose,data = _ )range1 <-range(beetles$dose) +c(-.2, .2)f_linear <-function(x) predict(lm1, newdata =data.frame(dose = x))plot2 <- plot1 +geom_function(fun = f_linear, aes(col ="Straight line")) +labs(colour ="Model", size ="")print(plot2)```Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935):::### Zoom out:::{#fig-beetles2}```{r}#| echo: falseprint(plot2 +expand_limits(x =c(1.6, 2)))```Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935)::::### log transformation of dose?:::{#fig-beetles3}```{r}lm2 <- beetles_long |>lm(formula = outcome ~log(dose), data = _)f_linearlog <-function(x) predict(lm2, newdata =data.frame(dose = x))plot3 <- plot2 +expand_limits(x =c(1.6, 2)) +geom_function(fun = f_linearlog, aes(col ="Log-transform dose"))print(plot3 +expand_limits(x =c(1.6, 2)))```Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935):::### Logistic regression:::{#fig-beetles4b}```{r}glm1 <- beetles |>glm(formula =cbind(died, survived) ~ dose, family ="binomial")f <-function(x) { glm1 |>predict(newdata =data.frame(dose = x), type ="response")}plot4 <- plot3 +geom_function(fun = f, aes(col ="Logistic regression"))print(plot4)```Mortality rates of adult flour beetles after five hours' exposure to gaseous carbon disulphide (Bliss 1935):::## Structure of regression models---{{< include _sec_regression_defs.qmd >}}---:::{#exr-glm-structure}What is the general structure of a generalized linear model?:::---:::{#sol-glm-structure}{{< include _sec_glm_structure.qmd >}}:::---