---
title: "Midterm 2 Review Session"
format:
html: default
revealjs:
output-file: midterm-2-review-session-slides.html
pdf:
output-file: midterm-2-review-session-handout.pdf
docx:
output-file: midterm-2-review-session-handout.docx
---
{{< include shared-config.qmd >}}
# Introduction
This chapter walks through the **most common mistakes** that students made on
Midterm 2, which covered survival analysis and Cox proportional hazards
models. Each section states a mistake, explains *why* it is wrong, and works
through the correct approach.
The exam had two parts:
- **Part 1** (Kaplan-Meier and Nelson-Aalen estimators) used a small data set
of survival times: `10, 14, 14+, 15+, 18, 21, 25` months (the `+` marks
censored observations).
- **Part 2** (Cox proportional hazards regression) used a model fit to
$n = 3{,}142$ men from the Western Collaborative Group Study (WCGS), relating
time to incident coronary heart disease (CHD) to baseline cigarette smoking
and other covariates.
::: {.callout-tip}
The mistakes below are organized so that the most consequential and most
frequent errors come first within each part. If you only have time to review a
few things, start at the top of each part.
:::
# Part 1: Kaplan-Meier and Nelson-Aalen
```{r}
#| label: setup-surv
#| message: false
#| warning: false
library(survival)
library(dplyr)
surv_data <- tibble(
time = c(10, 14, 14, 15, 18, 21, 25),
death = c(1, 1, 0, 0, 1, 1, 1),
surv = Surv(time, death)
)
```
Here is the full table that everyone was trying to reproduce; we will refer
back to it throughout this part.
::: {#tbl-km-na}
```{r}
#| label: km-na-calc
KM_est <- survfit(surv ~ 1, type = "kaplan-meier", data = surv_data)
surv_table <-
KM_est |>
summary(censored = TRUE, data.frame = TRUE) |>
as_tibble() |>
mutate(
hazard = n.event / n.risk,
nonhazard = 1 - hazard,
cusum_hazard = cumsum(hazard),
surv_KM = cumprod(nonhazard),
surv_NA = exp(-cusum_hazard)
) |>
select(time, n.risk, n.event, hazard, surv_KM, cusum_hazard, surv_NA)
surv_table |> pander::pander()
```
Kaplan-Meier and Nelson-Aalen calculations for the Part 1 data
:::
## The risk set shrinks by the number of subjects who *leave* {#sec-risk-set}
*Exam reference: **Exercise 1.1** (6 points) — compute the KM and Nelson-Aalen estimates of $S(t)$.*
::: {.callout-warning}
## Common mistake
Reducing the number at risk by **one** at every row, regardless of how many
subjects actually left the study. With this mistake, the number at risk after
the two censored subjects (at $t = 14^+$ and $t = 15^+$) was computed as $5$
instead of $4$.
:::
::: {.solution}
The number at risk $n_j$ is the number of subjects still under observation
*just before* time $t_j$ — that is, everyone who has neither had the event nor
been censored yet.
Track the cohort of 7:
| time | what happens | number at risk *before* this time |
|:----:|:-------------|:---------------------------------:|
| 10 | 1 death | 7 |
| 14 | 1 death **and** 1 censored | 6 |
| 15 | 1 censored | 4 |
| 18 | 1 death | 3 |
| 21 | 1 death | 2 |
| 25 | 1 death | 1 |
At $t = 14$, one subject dies *and* a different subject is censored (the
`14+`), so **two** subjects leave the risk set; the count drops from 6 to 4,
not from 6 to 5. Every subject who has an event **or** is censored is removed
from all later risk sets.
:::
## Nelson-Aalen estimates the *cumulative hazard* first, then exponentiates {#sec-na-exp}
*Exam reference: **Exercise 1.1** (6 points) — compute the KM and Nelson-Aalen estimates of $S(t)$.*
::: {.callout-warning}
## Common mistake
Two versions of this error were common:
1. Computing a running sum of arbitrary quantities (e.g. summing survival
probabilities, or summing $1 - \hat\lambda_j$) instead of summing the
hazards $\hat\lambda_j = d_j / n_j$.
2. Computing the cumulative hazard correctly but then forgetting the final
step, $\hat S_{NA}(t) = \exp{-\hat\Lambda(t)}$, and reporting the
cumulative hazard itself (or the KM product) as the Nelson-Aalen survival
estimate.
:::
::: {.solution}
The Nelson-Aalen estimator builds the **cumulative hazard** by adding up the
instantaneous hazard contributions at each event time:
$$\hat\Lambda(t) = \sum_{j: t_j \le t} \frac{d_j}{n_j}$$
and then converts it to a survival estimate using
$\surv(t) = \exp{-\cumhaz(t)}$:
$$\hat S_{NA}(t) = \exp{-\hat\Lambda(t)}$$
For example, at $t = 10$:
$$\hat\Lambda(10) = \frac{1}{7} = 0.143,
\qquad
\hat S_{NA}(10) = \exp{-0.143} = 0.867$$
Contrast this with Kaplan-Meier, which multiplies conditional survival
probabilities:
$$\hat S_{KM}(t) = \prod_{j: t_j \le t}\left(1 - \frac{d_j}{n_j}\right),
\qquad
\hat S_{KM}(10) = 1 - \tfrac{1}{7} = 0.857$$
The two estimators are close but **not** identical, and
$\hat S_{NA}(t) \ge \hat S_{KM}(t)$ always. Reporting the same numbers for both,
or reporting $\hat\Lambda$ where $\hat S_{NA}$ was asked for, loses credit.
:::
## Survival estimates are *step functions*: use the most recent event time {#sec-step-function}
*Exam reference: **Exercise 1.3** (1 point) — the KM and NA survival estimates for $t = 17$ months.*
::: {.callout-warning}
## Common mistake
For $\hat S(17)$, looking at the row for $t = 18$ (the next event *after* 17),
or interpolating between rows. Some answers gave only a verbal description
("it's flat there") without the numeric value.
:::
::: {.solution}
$\hat S(t)$ is a **right-continuous step function**: it only changes at observed
event times and stays flat in between. To evaluate it at $t = 17$, find the
**most recent event time at or before 17**, which is $t = 14$ (the subsequent
$t = 15^+$ is a censoring, not an event, so the curve does not step there). So:
$$\hat S_{KM}(17) = \hat S_{KM}(14) = 0.714,
\qquad
\hat S_{NA}(17) = \hat S_{NA}(14) = 0.734$$
Give the actual numbers, not just "it's flat."
:::
## Median survival time is read off the curve, not modeled {#sec-median}
*Exam reference: **Exercise 1.4** (1 point) — the KM and NA estimates of median survival time.*
::: {.callout-warning}
## Common mistake
Estimating the median by fitting an exponential model
($\hat\lambda = \text{events} / \text{total follow-up}$, then
$\hat{\text{E}}[T] = 1/\hat\lambda$). That computes an exponential-model
**mean**, not the nonparametric **median**, and answers a different question.
A second error was reading the median off the wrong curve or the wrong row.
:::
::: {.solution}
The median survival time is the smallest $t$ at which the estimated survival
curve drops to (or below) $0.5$:
$$\hat t_{\text{median}} = \min\{t : \hat S(t) \le 0.5\}$$
Reading from @tbl-km-na:
- $\hat S_{KM}(t)$ first reaches $\le 0.5$ at $t = 18$ (where it drops to
$0.476$), so the KM median is **18 months**.
- $\hat S_{NA}(t)$ first reaches $\le 0.5$ at $t = 21$ (where it drops to
$0.319$), so the NA median is **21 months**.
We can confirm this in R:
```{r}
#| label: median-check
quantile(KM_est, p = 0.5)$quantile
```
No distributional assumption (exponential or otherwise) is used: this is a
nonparametric read-off of the estimated curve.
:::
## How censored times are used: denominator yes, numerator no {#sec-censoring-role}
*Exam reference: **Exercise 1.5** (2 points) — describe how the censored time(s) are utilized in the Kaplan-Meier estimate.*
::: {.callout-warning}
## Common mistake
Describing only the *symptom* ("the curve doesn't drop at a censoring time")
without the *mechanism*, or implying that censored subjects are simply dropped
from the analysis entirely.
:::
::: {.solution}
A censored subject contributes to the **risk-set denominators** ($n_j$) for
every event time up to and including their censoring time, because we know they
survived event-free until then. They **never** contribute to an event
**numerator** ($d_j$), because no event was observed for them.
In this example, subjects \#3 and \#4 (censored at $14^+$ and $15^+$) are
counted in the number at risk at $t = 10$ and $t = 14$, which is why those
denominators are 7 and 6. After their censoring times they drop out of all
later risk sets. This is exactly what lets Kaplan-Meier use partial information
from censored subjects instead of discarding them.
:::
# Part 2: Cox proportional hazards models {#sec-part2}
For reference, here is the fitted model from the exam. The primary exposure is
baseline smoking category (nonsmoker = reference, light smoker $L=1$, heavy
smoker $H=1$), adjusting for age in decades ($A$), behavior pattern ($P$),
overweight ($B$), and high cholesterol ($C$).
| Characteristic | $\log(\widehat{HR})$ | $\widehat{HR}$ | 95% CI for HR | $p$ |
|:---------------------|:--------------------:|:--------------:|:-------------:|:---:|
| nonsmoker (ref.) | – | – | – | – |
| light smoker ($L$) | 0.36 | 1.43 | (1.05, 1.95) | 0.023 |
| heavy smoker ($H$) | 0.78 | 2.18 | (1.63, 2.91) | < 0.001 |
| age per decade ($A$) | 0.66 | 1.94 | (1.56, 2.40) | < 0.001 |
| behavior type A ($P$)| 0.73 | 2.06 | (1.58, 2.70) | < 0.001 |
| overweight ($B$) | 0.30 | 1.34 | (1.05, 1.72) | 0.019 |
| high cholesterol ($C$)| 0.75 | 2.12 | (1.65, 2.71) | < 0.001 |
: Cox PH regression for the WCGS CHD data {#tbl-wcgs}
The exam also provided the estimated covariance matrix of the coefficient
estimates (on the $\log(HR)$ scale). The entries we need below are:
$$\V(\hat\beta_L) = 0.024840,
\quad
\V(\hat\beta_H) = 0.021905,
\quad
\Covt(\hat\beta_L, \hat\beta_H) = 0.010504$$
## Writing the model: name every assumption and show where it is used {#sec-write-model}
*Exam reference: **Exercise 2.1** (10 points) — write the mathematical form of the proportional hazards model corresponding to Table 1.*
::: {.callout-warning}
## Common mistake
Several patterns lost the most points here:
- Treating the question as "interpret the coefficients" and writing only the
linear predictor, omitting the **likelihood**, the **distribution
functions**, and the **assumptions**.
- Listing assumption *names* without **showing where they are used**, or
writing the math without naming the assumption.
- Adding a spurious **"baseline hazard is exponential"** assumption. The Cox
model leaves the baseline hazard $\lambda_0(t)$ **unspecified**; assuming a
parametric form is wrong.
- Omitting one or more of the standalone distribution functions (survival,
density, hazard, log-hazard, cumulative hazard).
:::
::: {.solution}
A full-credit answer connects the likelihood to the linear predictor through a
chain of named components and assumptions. Using
$\tilde X = (L, H, A, P, B, C)$:
**Joint likelihood of the data set:**
$$\Lik \eqdef \p(\tilde Y = \tilde y, \tilde D = \tilde d \mid \mathbf X = \mathbf x)$$
**Marginal likelihood contribution of observation $i$:**
$$\Lik_i \eqdef \p(Y_i = y_i, D_i = d_i \mid \tilde X_i = \tilde x_i)$$
***Independent-observations assumption*** (used to factor the likelihood):
$$\Lik = \prod_{i=1}^n \Lik_i$$
***Non-informative censoring assumption*** (used so each contribution reduces to
survival and hazard terms): $T_i \indpt C_i \mid \tilde X_i$, giving
$$\Lik_i \propto \left[\f(y_i \mid \tilde x_i)\right]^{d_i}
\left[\surv(y_i \mid \tilde x_i)\right]^{1 - d_i}
= \surv(y_i \mid \tilde x_i)\cdot\left[\haz(y_i \mid \tilde x_i)\right]^{d_i}$$
**Distribution functions** (define each one):
$$\surv(t \mid \tilde x) \eqdef \P(T > t \mid \tilde X = \tilde x) = \exp{-\cumhaz(t \mid \tilde x)}$$
$$\f(t \mid \tilde x) \eqdef \haz(t \mid \tilde x)\,\surv(t \mid \tilde x)$$
$$\haz(t \mid \tilde x) \eqdef \p(T = t \mid T \ge t, \tilde X = \tilde x) = \frac{\f(t \mid \tilde x)}{\surv(t \mid \tilde x)}$$
$$\cumhaz(t \mid \tilde x) \eqdef \int_0^t \haz(u \mid \tilde x)\,du = -\log\surv(t \mid \tilde x)$$
$$\loghaz(t \mid \tilde x) \eqdef \log\haz(t \mid \tilde x)$$
***Proportional-hazards assumption*** (used to split the hazard into a baseline
that depends only on time and a factor that depends only on covariates):
$$\haz(t \mid \tilde x) = \lambda_0(t)\cdot\theta(\tilde x)$$
where $\lambda_0(t)$ is the **unspecified** baseline hazard.
***Logarithmic-link assumption*** (used to make the covariate factor a function
of a linear predictor):
$$\loghaz(t \mid \tilde x) = \loghaz_0(t) + \Delta\loghaz(\tilde x),
\qquad
\theta(\tilde x) = \exp{\Delta\loghaz(\tilde x)}$$
***Linear functional-form assumption*** (used to write the covariate term as a
linear combination):
$$\Delta\loghaz(\tilde x) = \tilde x \cdot \tilde\beta
= \beta_L\, l + \beta_H\, h + \beta_A\, a + \beta_P\, p + \beta_B\, b + \beta_C\, c$$
Notice that the baseline hazard $\lambda_0(t)$ is carried along symbolically the
whole time — we never assume a shape for it.
:::
## Interpreting a hazard ratio: magnitude, reference, adjustment, *and* significance {#sec-interpret}
*Exam reference: **Exercise 2.2** (4 points) — summarize how baseline smoking category is associated with hazard of incident CHD.*
::: {.callout-warning}
## Common mistake
- Reporting the direction of the effect but omitting the **magnitude** ("X%
higher/lower hazard") or the **reference group**.
- Forgetting to say the estimate is **adjusted for / holding constant** the
other covariates.
- Omitting **statistical significance** (whether the 95% CI excludes 1, or
$p < 0.05$).
- Calling the hazard ratio a **risk** or an **odds ratio**. It is a ratio of
*hazards*, not of risks or odds.
:::
::: {.solution}
A complete interpretation has four parts: magnitude, reference group,
adjustment, and significance. For the smoking effect in @tbl-wcgs:
> Adjusting for age, behavior pattern, overweight, and cholesterol, **light
> smokers** had an estimated hazard of incident CHD about **43% higher** than
> **nonsmokers** ($\widehat{HR} = 1.43$); this is statistically significant at
> the 0.05 level, since the 95% CI $(1.05, 1.95)$ excludes 1 ($p = 0.023$).
>
> **Heavy smokers** had an estimated hazard about **118% higher** (roughly
> double; $\widehat{HR} = 2.18$) than nonsmokers, also adjusting for the other
> covariates; this is highly significant ($p < 0.001$, CI $(1.63, 2.91)$).
Note "43% higher" comes from $1.43 - 1 = 0.43$, and "118% higher" from
$2.18 - 1 = 1.18$. Interpret each smoking level **separately** against the
reference, rather than lumping them together.
:::
## A hazard ratio for a non-unit change: exponentiate {#sec-hr-multiunit}
*Exam reference: **Exercise 2.3** (2 points) — the hazard ratio associated with a 7.5-year increase in age.*
::: {.callout-warning}
## Common mistake
For the hazard ratio associated with a **7.5-year** increase in age (with the
table reporting the HR per **decade**), the errors were:
- Multiplying: $1.94 \times 0.75 = 1.46$. **Wrong** — the HR for a multi-unit
change is *not* linear in the HR.
- Arithmetic slips that produced answers like 2.08.
:::
::: {.solution}
On the **log** scale, the effect is linear: a change of $c$ units multiplies the
log-hazard by $c \cdot \beta$. So the hazard ratio for a $c$-unit change is
$$HR(c) = \exp{c\,\beta} = \left(\exp{\beta}\right)^c = HR^{\,c}$$
A 7.5-year increase is $c = 0.75$ decades, and the per-decade
$\widehat{HR} = 1.94$ (i.e. $\hat\beta_A = \log 1.94 = 0.66$):
$$\widehat{HR}(0.75) = 1.94^{\,0.75} = \exp{0.75 \times 0.66} = \exp{0.495} = 1.64$$
So a 7.5-year-older man has about **64% higher** estimated hazard of CHD, all
else equal. The operation is **exponentiation** ($HR^{0.75}$), not
multiplication.
:::
## Comparing two coefficients: variance of a *difference* {#sec-wald}
*Exam reference: **Exercise 2.4** (2 points) — test whether the hazard of incident CHD differs between heavy and light smokers; compute the z-statistic and two-sided p-value.*
::: {.callout-warning}
## Common mistake
This was the **single most consequential Part 2 error**. To test
$H_0: \beta_H = \beta_L$, the standard error of $\hat\beta_H - \hat\beta_L$ was computed incorrectly in several ways:
- Using a **single covariance entry** $\Covt(\hat\beta_L, \hat\beta_H)$ as the
variance.
- **Adding** $2\,\Covt$ instead of **subtracting** it.
- Plugging in the **hazard ratios** ($2.18$, $1.43$) instead of the
**coefficients** ($0.78$, $0.36$) in the numerator.
- Guessing a value for $z$ because the variance-of-a-difference formula was not
on the formula sheet.
:::
::: {.solution}
Work on the $\log(HR)$ (coefficient) scale, where the estimates are
approximately normal. The point estimate of the difference is
$$\widehat{\Delta} = \hat\beta_H - \hat\beta_L = 0.78 - 0.36 = 0.42$$
The variance of a **difference** of two estimates uses **all three** relevant
entries of the covariance matrix:
$$\V(\hat\beta_H - \hat\beta_L)
= \V(\hat\beta_H) + \V(\hat\beta_L) - 2\,\Covt(\hat\beta_H, \hat\beta_L)$$
The $-2\,\Covt$ term is essential — its sign is **minus** for a difference.
Plugging in:
```{r}
#| label: wald-test
var_H <- 0.021905
var_L <- 0.024840
cov_HL <- 0.010504
diff <- 0.78 - 0.36
var_diff <- var_H + var_L - 2 * cov_HL
se_diff <- sqrt(var_diff)
z <- diff / se_diff
p_value <- 2 * pnorm(-abs(z))
c(diff = diff, se = se_diff, z = z, p_value = p_value) |> round(4)
```
So $z = 0.42 / 0.160 = 2.62$, two-sided $p = 0.0088$. We **reject** $H_0$: the
hazard of CHD differs significantly between heavy and light smokers, holding
the other covariates constant.
::: {.callout-note}
Compare with the **naive** standard error that ignores the covariance,
$\sqrt{\V(\hat\beta_H) + \V(\hat\beta_L)} = 0.216$. Because $\hat\beta_H$ and
$\hat\beta_L$ are *positively* correlated, subtracting $2\,\Covt$ makes the
correct SE (0.160) **smaller**, which sharpens the test. Using a single
covariance cell as the variance gave a wildly wrong SE and an absurd $z$.
:::
:::
## A confidence interval for a hazard ratio: build it on the log scale, then exponentiate {#sec-ci-hr}
*Exam reference: **Exercise 2.5** (2 points) — a 95% confidence interval for the hazard ratio comparing heavy to light smokers.*
::: {.callout-warning}
## Common mistake
- Forming the interval directly on the HR scale,
$\widehat{HR} \pm 1.96 \cdot \text{SE}$ (e.g. $1.52 \pm 0.31$), instead of on
the log scale.
- Building the log-scale interval correctly but **never exponentiating**, then
reporting a log-scale interval as if it were the HR (sometimes even declaring
"significant" while the reported interval contained 1).
- Reusing the wrong SE from the Wald-test mistake above.
:::
::: {.solution}
A hazard ratio is positive and its sampling distribution is skewed, so the
symmetric normal approximation is applied to $\log(HR)$, and the endpoints are
**then** exponentiated. For the heavy-vs-light comparison
($\widehat\Delta = 0.42$, $\text{SE} = 0.160$):
$$\text{log-scale CI:}\quad 0.42 \pm 1.96\times 0.160 = (0.106,\ 0.734)$$
$$\text{HR CI:}\quad \left(e^{0.106},\ e^{0.734}\right) = (1.11,\ 2.08)$$
```{r}
#| label: ci-hr
est <- 0.42
se <- sqrt(0.021905 + 0.024840 - 2 * 0.010504)
exp(est + c(-1, 1) * 1.96 * se) |> round(2)
```
The point estimate is $\widehat{HR} = e^{0.42} = 1.52$, and the 95% CI
$(1.11, 2.08)$ **excludes 1**, consistent with the significant Wald test in
@sec-wald. Always exponentiate the endpoints, and check that your CI and your
hypothesis test agree.
:::
# Summary of the most common errors
| # | Exam Q | Topic | The fix |
|:-:|:------:|:------|:--------|
| 1 | 1.1 | Risk set after censoring (@sec-risk-set) | Remove **everyone** who has an event *or* is censored from later risk sets. |
| 2 | 1.1 | Nelson-Aalen (@sec-na-exp) | Sum hazards $d_j/n_j$, then $\hat S_{NA} = \exp{-\hat\Lambda}$. |
| 3 | 1.3 | Reading $\hat S(t)$ (@sec-step-function) | Step function: use the most recent event time $\le t$; give the number. |
| 4 | 1.4 | Median survival (@sec-median) | First $t$ with $\hat S(t) \le 0.5$ — not a mean, not an exponential fit. |
| 5 | 1.5 | Role of censoring (@sec-censoring-role) | Denominator until censoring; never the event numerator. |
| 6 | 2.1 | Writing the model (@sec-write-model) | Name **and** apply every assumption; baseline hazard stays unspecified. |
| 7 | 2.2 | Interpreting an HR (@sec-interpret) | Magnitude + reference + adjustment + significance; HR $\ne$ risk/odds. |
| 8 | 2.3 | HR for a non-unit change (@sec-hr-multiunit) | $HR^{\,c} = \exp{c\beta}$ — exponentiate, don't multiply. |
| 9 | 2.4 | Comparing two coefficients (@sec-wald) | $\V(\hat\beta_H - \hat\beta_L) = \V_H + \V_L - 2\,\Covt$. |
| 10 | 2.5 | CI for an HR (@sec-ci-hr) | Build on the log scale, then exponentiate the endpoints. |