Published

Last modified: 2024-12-13: 0:43:41 (AM)



Configuring R

Functions from these packages will be used throughout this document:

Show R code
library(conflicted) # check for conflicting function definitions
# library(printr) # inserts help-file output into markdown output
library(rmarkdown) # Convert R Markdown documents into a variety of formats.
library(pander) # format tables for markdown
library(ggplot2) # graphics
library(ggeasy) # help with graphics
library(ggfortify) # help with graphics
library(dplyr) # manipulate data
library(tibble) # `tibble`s extend `data.frame`s
library(magrittr) # `%>%` and other additional piping tools
library(haven) # import Stata files
library(knitr) # format R output for markdown
library(tidyr) # Tools to help to create tidy data
library(plotly) # interactive graphics
library(dobson) # datasets from Dobson and Barnett 2018
library(parameters) # format model output tables for markdown
library(haven) # import Stata files
library(latex2exp) # use LaTeX in R code (for figures and tables)
library(fs) # filesystem path manipulations
library(survival) # survival analysis
library(survminer) # survival analysis graphics
library(KMsurv) # datasets from Klein and Moeschberger
library(parameters) # format model output tables for
library(webshot2) # convert interactive content to static for pdf
library(forcats) # functions for categorical variables ("factors")
library(stringr) # functions for dealing with strings
library(lubridate) # functions for dealing with dates and times

Here are some R settings I use in this document:

Show R code
rm(list = ls()) # delete any data that's already loaded into R

conflicts_prefer(dplyr::filter)
ggplot2::theme_set(
  ggplot2::theme_bw() + 
        # ggplot2::labs(col = "") +
    ggplot2::theme(
      legend.position = "bottom",
      text = ggplot2::element_text(size = 12, family = "serif")))

knitr::opts_chunk$set(message = FALSE)
options('digits' = 4)

panderOptions("big.mark", ",")
pander::panderOptions("table.emphasize.rownames", FALSE)
pander::panderOptions("table.split.table", Inf)
conflicts_prefer(dplyr::filter) # use the `filter()` function from dplyr() by default
legend_text_size = 9

1.1 Introduction to Epi 204

Welcome to Epidemiology 204: Quantitative Epidemiology III (Statistical Models).

In this course, we will start where Epi 203 left off: with linear regression models.

Note

Epi 203/STA 130B/STA 131B is a prerequisite for this course. If you haven’t passed one of these courses, please talk to me ASAP.

1.1.1 What you should already know

Epi 202: probability models for different data types

  • Probability distributions
    • binomial
    • Poisson
    • Gaussian
    • exponential
  • Characteristics of probability distributions
    • Mean, median, mode, quantiles
    • Variance, standard deviation, overdispersion
  • Characteristics of samples
    • independence, dependence, covariance, correlation
    • ranks, order statistics
    • identical vs nonidentical distribution (homogeneity vs heterogeneity)
    • Laws of Large Numbers
    • Central Limit Theorem for the mean of an iid sample

Epi 203: inference for one or several homogenous populations

  • the maximum likelihood inference framework:
    • likelihood functions
    • log-likelihood functions
    • score functions
    • estimating equations
    • information matrices
    • point estimates
    • standard errors
    • confidence intervals
    • hypothesis tests
    • p-values
  • Hypothesis tests for one, two, and >2 groups:
    • t-tests/ANOVA for Gaussian models
    • chi-square tests for binomial and Poisson models
    • nonparametric tests:
      • Wilcoxon signed-rank test for matched pairs
      • Mann–Whitney/Kruskal-Wallis rank sum test for \(\geq 2\) independent samples
      • Fisher’s exact test for contingency tables
      • Cochran–Mantel–Haenszel-Cox log-rank test
  • Some linear regression

For all of the quantities above, and especially for confidence intervals and p-values, you should know how both: - how to compute them - how to interpret them

Stat 108: linear regression models

  • building models for Gaussian outcomes
    • multiple predictors
    • interactions
  • regression diagnostics
  • fundamentals of R programming; e.g.:
    • Wickham, Çetinkaya-Rundel, and Grolemund (2023)
    • Dalgaard (2008)
  • RMarkdown or Quarto for formatting homework
    • LaTeX for writing math in RMarkdown/Quarto

1.1.2 What we will cover in this course

  • Linear (Gaussian) regression models (review and more details)

  • Regression models for non-Gaussian outcomes

    • binary
    • count
    • time to event
  • Statistical analysis using R

1.2 Regression models

Why do we need them?

  • continuous predictors

  • not enough data to analyze some subgroups individually

1.2.1 Example: Adelie penguins

Figure 1.1: Palmer penguins

1.2.2 Linear regression

Show R code
ggpenguins2 = 
  ggpenguins +
  stat_smooth(method = "lm",
              formula = y ~ x,
              geom = "smooth")

ggpenguins2 |> print()
Figure 1.2: Palmer penguins with linear regression fit

1.2.3 Curved regression lines

Show R code
ggpenguins2 = ggpenguins +
  stat_smooth(
    method = "lm",
    formula = y ~ log(x),
    geom = "smooth") +
  xlab("Bill length (mm)") + 
  ylab("Body mass (g)")
ggpenguins2
Figure 1.3: Palmer penguins - curved regression lines

1.2.4 Multiple regression

Show R code
ggpenguins =
  palmerpenguins::penguins |> 
  ggplot(
    aes(x = bill_length_mm , 
        y = body_mass_g,
        color = species
    )
  ) +
  geom_point() +
  stat_smooth(
    method = "lm",
    formula = y ~ x,
    geom = "smooth") +
  xlab("Bill length (mm)") + 
  ylab("Body mass (g)")
ggpenguins |> print()
Figure 1.4: Palmer penguins - multiple groups

1.2.5 Modeling non-Gaussian outcomes

Show R code
library(glmx)
data(BeetleMortality)
beetles = BeetleMortality |>
  mutate(
    pct = died/n,
    survived = n - died
  )

plot1 = 
  beetles |> 
  ggplot(aes(x = dose, y = pct)) +
  geom_point(aes(size = n)) +
  xlab("Dose (log mg/L)") +
  ylab("Mortality rate (%)") +
  scale_y_continuous(labels = scales::percent) +
  # xlab(bquote(log[10]), bquote(CS[2])) +
  scale_size(range = c(1,2))

print(plot1)
Figure 1.5: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.6 Why don’t we use linear regression?

Show R code
beetles_long = 
  beetles  |> 
  reframe(.by = everything(),
          outcome = c(
            rep(1, times = died), 
            rep(0, times = survived))
  )

lm1 = 
  beetles_long |> 
  lm(
    formula = outcome ~ dose, 
    data = _)


range1 = range(beetles$dose) + c(-.2, .2)

f.linear = function(x) predict(lm1, newdata = data.frame(dose = x))

plot2 = 
  plot1 + 
  geom_function(fun = f.linear, aes(col = "Straight line")) +
  labs(colour="Model", size = "")
print(plot2)
Figure 1.6: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.7 Zoom out

Figure 1.7: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.8 log transformation of dose?

Show R code

lm2 = 
  beetles_long |> 
  lm(formula = outcome ~ log(dose), data = _)

f.linearlog = function(x) predict(lm2, newdata = data.frame(dose = x))

plot3 = plot2 + 
  expand_limits(x = c(1.6, 2)) +
  geom_function(fun = f.linearlog, aes(col = "Log-transform dose"))

print(plot3  + expand_limits(x = c(1.6, 2)))
Figure 1.8: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.9 Logistic regression

Show R code
glm1 = beetles |> 
  glm(formula = cbind(died, survived) ~ dose, family = "binomial")

f = function(x) predict(glm1, newdata = data.frame(dose = x), type = "response")

plot4 = plot3 + geom_function(fun = f, aes(col = "Logistic regression"))
print(plot4)
Figure 1.9: Mortality rates of adult flour beetles after five hours’ exposure to gaseous carbon disulphide (Bliss 1935)

1.2.10 Three parts to regression models

  • What distribution does the outcome have for a specific sub-population defined by covariates? (outcome model)

  • How does the combination of covariates relate to the mean? (link function)

  • How do the covariates combine? (linear predictor/linear component) \[\eta \stackrel{\text{def}}{=}\tilde{x}^{\top} \tilde{\beta}= \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\]