1 Introduction

Published

Last modified: 2024-06-03: 13:28:16 (PM)

Configuring R

Functions from these packages will be used throughout this document:

Show R code

library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggeasy) # help with graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

Show R code

rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 4)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9

1.1 Introduction to Epi 204

Welcome to Epidemiology 204: Quantitative Epidemiology III (Statistical Models).

In this course, we will start where Epi 203 left off: with linear regression models.

Note

Epi 203/STA 130B/STA 131B is a prerequisite for this course. If you haven’t passed one of these courses, please talk to me ASAP.

1.1.1 What you should already know

Epi 202: probability models for different data types

Probability distributions
- binomial
- Poisson
- Gaussian
- exponential
Characteristics of probability distributions
- Mean, median, mode, quantiles
- Variance, standard deviation, overdispersion
Characteristics of samples
- independence, dependence, covariance, correlation
- ranks, order statistics
- identical vs nonidentical distribution (homogeneity vs heterogeneity)
- Laws of Large Numbers
- Central Limit Theorem for the mean of an iid sample

Epi 203: inference for one or several homogenous populations

the maximum likelihood inference framework:
- likelihood functions
- log-likelihood functions
- score functions
- estimating equations
- information matrices
- point estimates
- standard errors
- confidence intervals
- hypothesis tests
- p-values
Hypothesis tests for one, two, and >2 groups:
- t-tests/ANOVA for Gaussian models
- chi-square tests for binomial and Poisson models
- nonparametric tests:
  - Wilcoxon signed-rank test for matched pairs
  - Mann–Whitney/Kruskal-Wallis rank sum test for ≥2 independent samples
  - Fisher’s exact test for contingency tables
  - Cochran–Mantel–Haenszel-Cox log-rank test
Some linear regression

For all of the quantities above, and especially for confidence intervals and p-values, you should know how both: - how to compute them - how to interpret them

Stat 108: linear regression models

building models for Gaussian outcomes
- multiple predictors
- interactions
regression diagnostics
fundamentals of R programming; e.g.:
- Wickham, Çetinkaya-Rundel, and Grolemund (2023)
- Dalgaard (2008)
RMarkdown or Quarto for formatting homework
- LaTeX for writing math in RMarkdown/Quarto

1.1.2 What we will cover in this course

Linear (Gaussian) regression models (review and more details)
Regression models for non-Gaussian outcomes
- binary
- count
- time to event
Statistical analysis using R

1.2 Regression models

Why do we need them?

continuous predictors
not enough data to analyze some subgroups individually

1.2.1 Example: Adelie penguins

1.2.2 Linear regression

Show R code

ggpenguins2 = 
  ggpenguins +
  stat_smooth(method = "lm",
              formula = y ~ x,
              geom = "smooth")

ggpenguins2 |> print()

Figure 1.2: Palmer penguins with linear regression fit

1.2.3 Curved regression lines

Show R code

ggpenguins2 = ggpenguins +
  stat_smooth(
    method = "lm",
    formula = y ~ log(x),
    geom = "smooth") +
  xlab("Bill length (mm)") + 
  ylab("Body mass (g)")
ggpenguins2

Figure 1.3: Palmer penguins - curved regression lines

1.2.4 Multiple regression

Show R code

ggpenguins =
  palmerpenguins::penguins |> 
  ggplot(
    aes(x = bill_length_mm , 
        y = body_mass_g,
        color = species
    )
  ) +
  geom_point() +
  stat_smooth(
    method = "lm",
    formula = y ~ x,
    geom = "smooth") +
  xlab("Bill length (mm)") + 
  ylab("Body mass (g)")
ggpenguins |> print()

Figure 1.4: Palmer penguins - multiple groups

1.2.5 Modeling non-Gaussian outcomes

Show R code

library(glmx)
data(BeetleMortality)
beetles = BeetleMortality |>
  mutate(
    pct = died/n,
    survived = n - died
  )

plot1 = 
  beetles |> 
  ggplot(aes(x = dose, y = pct)) +
  geom_point(aes(size = n)) +
  xlab("Dose (log mg/L)") +
  ylab("Mortality rate (%)") +
  scale_y_continuous(labels = scales::percent) +
  # xlab(bquote(log[10]), bquote(CS[2])) +
  scale_size(range = c(1,2))

print(plot1)

Figure 1.5: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.6 Why don’t we use linear regression?

Show R code

beetles_long = 
  beetles  |> 
  reframe(.by = everything(),
          outcome = c(
            rep(1, times = died), 
            rep(0, times = survived))
  )

lm1 = 
  beetles_long |> 
  lm(
    formula = outcome ~ dose, 
    data = _)


range1 = range(beetles$dose) + c(-.2, .2)

f.linear = function(x) predict(lm1, newdata = data.frame(dose = x))

plot2 = 
  plot1 + 
  geom_function(fun = f.linear, aes(col = "Straight line")) +
  labs(colour="Model", size = "")
print(plot2)

Figure 1.6: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.7 Zoom out

Figure 1.7: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.8 log transformation of dose?

Show R code


lm2 = 
  beetles_long |> 
  lm(formula = outcome ~ log(dose), data = _)

f.linearlog = function(x) predict(lm2, newdata = data.frame(dose = x))

plot3 = plot2 + 
  expand_limits(x = c(1.6, 2)) +
  geom_function(fun = f.linearlog, aes(col = "Log-transform dose"))

print(plot3  + expand_limits(x = c(1.6, 2)))

Figure 1.8: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.9 Logistic regression

Show R code

glm1 = beetles |> 
  glm(formula = cbind(died, survived) ~ dose, family = "binomial")

f = function(x) predict(glm1, newdata = data.frame(dose = x), type = "response")

plot4 = plot3 + geom_function(fun = f, aes(col = "Logistic regression"))
print(plot4)

Figure 1.9: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.10 Three parts to regression models

What distribution does the outcome have for a specific sub-population defined by covariates? (outcome model)
How does the combination of covariates relate to the mean? (link function)
How do the covariates combine? (linear predictor/linear component) \[\eta \stackrel{\text{def}}{=}\tilde{x}^{\top} \tilde{\beta}= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]