Georgia birthweight data

This article examines whether maternal age and interpregnancy interval are associated with infant birthweight in clustered Georgia birth records, illustrating the challenges of correlated outcomes in RMB2e Chapter 7.

1 Introduction

Birthweight is a key neonatal health indicator strongly predicted by gestational age, maternal weight, and parity-related factors. When multiple births are recorded per mother, infant outcomes within the same mother are correlated— violating the independence assumption of ordinary linear regression. The Georgia births dataset (gababies) provides a clustered sample of 1,000 birth records from mothers with multiple deliveries, enabling study of between- and within-mother predictors of birthweight (RMB2e Ch. 7).

Code
data(rmb_datasets, package = "rmb")
rmb_datasets$study_design[rmb_datasets$object == "gababies"]
#> [1] "Clustered repeated-birth sample from Georgia mothers used for birthweight analyses."

Are maternal age and interpregnancy interval associated with infant birthweight, after accounting for clustering of births within mothers?

1.1 Causal assumptions

Code
set.seed(42)
dag <- ggdag::dagify(
  bw ~ momage + interval + initwt,
  interval ~ momage,
  labels = c(
    bw = "Birthweight",
    momage = "Maternal age",
    interval = "Interpregnancy interval",
    initwt = "Initial maternal weight"
  ),
  exposure = "interval",
  outcome = "bw"
)
ggdag::ggdag(dag, use_labels = "label", text = FALSE) +
  ggdag::theme_dag_blank() +
  ggplot2::labs(title = "GABABabies: Causal DAG")

2 Methods

2.1 Study sample

Code
data(gababies, package = "rmb")
dat <- gababies
dim(dat)
#> [1] 1000   11
summary(haven::zap_labels(dat[c("bweight", "cinitage", "timesnc", "momage", "delwght")]))
#>     bweight        cinitage         timesnc           momage     
#>  Min.   : 340   Min.   :-5.545   Min.   :-8.000   Min.   :12.00  
#>  1st Qu.:2835   1st Qu.:-2.545   1st Qu.: 1.000   1st Qu.:18.00  
#>  Median :3175   Median :-0.545   Median : 4.000   Median :21.00  
#>  Mean   :3135   Mean   : 0.000   Mean   : 4.088   Mean   :21.63  
#>  3rd Qu.:3487   3rd Qu.: 0.455   3rd Qu.: 6.000   3rd Qu.:24.00  
#>  Max.   :5018   Max.   :14.455   Max.   :81.000   Max.   :99.00  
#>     delwght       
#>  Min.   :-1551.0  
#>  1st Qu.: -190.0  
#>  Median :  164.0  
#>  Mean   :  191.6  
#>  3rd Qu.:  482.0  
#>  Max.   : 2700.0

2.2 Statistical analysis

A linear regression of birthweight on centered maternal age at first birth (cinitage), interpregnancy interval (timesnc), and current maternal age is fitted, ignoring clustering for a baseline OLS estimate; the discussion notes the need for mixed-effects models to account for within-mother correlation (RMB2e Ch. 7).

Code
formula_main <- bweight ~ cinitage + timesnc + momage
formula_main
#> bweight ~ cinitage + timesnc + momage

3 Results

3.1 Descriptive statistics

Code
with(dat, tapply(bweight, birthord, summary))
#> $`1`
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     815    2818    3051    3017    3349    4508 
#> 
#> $`2`
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1021    2835    3154    3111    3430    4678 
#> 
#> $`3`
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     340    2828    3202    3147    3521    4960 
#> 
#> $`4`
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     910    2892    3246    3194    3525    4780 
#> 
#> $`5`
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1210    2854    3218    3208    3548    5018
with(dat, c(n_mothers = length(unique(momid)), n_births = nrow(dat)))
#> n_mothers  n_births 
#>       200      1000
summary(haven::zap_labels(dat[c("bweight", "cinitage", "timesnc")]))
#>     bweight        cinitage         timesnc      
#>  Min.   : 340   Min.   :-5.545   Min.   :-8.000  
#>  1st Qu.:2835   1st Qu.:-2.545   1st Qu.: 1.000  
#>  Median :3175   Median :-0.545   Median : 4.000  
#>  Mean   :3135   Mean   : 0.000   Mean   : 4.088  
#>  3rd Qu.:3487   3rd Qu.: 0.455   3rd Qu.: 6.000  
#>  Max.   :5018   Max.   :14.455   Max.   :81.000
Code
ggplot2::ggplot(dat, ggplot2::aes(x = factor(birthord), y = bweight)) +
  ggplot2::geom_boxplot(fill = "grey85") +
  ggplot2::labs(
    title = "Georgia births: Birthweight by birth order",
    x = "Birth order",
    y = "Birthweight (grams)"
  ) +
  ggplot2::theme_minimal()

3.2 Model estimates

Code
fit <- stats::lm(formula_main, data = dat)
summary(fit)
#> 
#> Call:
#> stats::lm(formula = formula_main, data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -2990.00  -301.41    31.64   332.68  1721.94 
#> 
#> Coefficients: (1 not defined because of singularities)
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 3079.486     23.716 129.846  < 2e-16 ***
#> cinitage      26.082      5.608   4.650 3.76e-06 ***
#> timesnc       13.693      3.766   3.636 0.000291 ***
#> momage            NA         NA      NA       NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 570.5 on 997 degrees of freedom
#> Multiple R-squared:  0.03482,    Adjusted R-squared:  0.03288 
#> F-statistic: 17.98 on 2 and 997 DF,  p-value: 2.129e-08

3.3 Model diagnostics

Code
fit_data <- data.frame(
  fitted = stats::fitted(fit),
  residuals = stats::residuals(fit),
  std_residuals = stats::rstandard(fit)
)

ggplot2::ggplot(fit_data, ggplot2::aes(x = fitted, y = residuals)) +
  ggplot2::geom_point() +
  ggplot2::geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  ggplot2::geom_smooth(se = FALSE, color = "blue") +
  ggplot2::labs(
    title = "Residuals vs Fitted",
    x = "Fitted values",
    y = "Residuals"
  ) +
  ggplot2::theme_minimal()

ggplot2::ggplot(fit_data, ggplot2::aes(sample = std_residuals)) +
  ggplot2::stat_qq() +
  ggplot2::stat_qq_line(color = "red") +
  ggplot2::labs(
    title = "Normal Q-Q",
    x = "Theoretical Quantiles",
    y = "Standardized residuals"
  ) +
  ggplot2::theme_minimal()

3.4 Inference

Code
ci <- stats::confint(fit)
coefs <- summary(fit)$coefficients
ci_sub <- ci[rownames(coefs), , drop = FALSE]
data.frame(
  term = rownames(coefs),
  estimate = coefs[, "Estimate"],
  conf_low = ci_sub[, 1],
  conf_high = ci_sub[, 2],
  p_value = coefs[, "Pr(>|t|)"]
)
#>                    term   estimate    conf_low  conf_high      p_value
#> (Intercept) (Intercept) 3079.48641 3032.946433 3126.02639 0.000000e+00
#> cinitage       cinitage   26.08179   15.076064   37.08752 3.757601e-06
#> timesnc         timesnc   13.69315    6.303301   21.08299 2.908827e-04

4 Discussion

The OLS estimates show the direction of associations between maternal characteristics and birthweight, but standard errors underestimate uncertainty because births within the same mother are correlated (RMB2e Ch. 7). Proper inference requires a mixed-effects model with a random intercept for mother, which would partition the total variance into within-mother and between-mother components. The centered age at first birth (cinitage) isolates the between-mother age effect, while timesnc captures the interpregnancy interval’s influence on fetal growth.

5 Source

  • UCSF Regression Methods companion data: https://regression.ucsf.edu/sites/g/files/tkssra16191/files/wysiwyg/home/data/gababies.dta
  • Book: Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE (2012). Regression Methods in Biostatistics (2nd edition).