Master’s Thesis

Investigating Determinants of Birth Weight Using Bayesian Tree-Based Nonparametric Modeling

Advisor: Dr. P. Richard Hahn, Department of Mathematical & Statistical Sciences, Arizona State University

Defense Date: May 22nd, 2025

Abstract

Low birth weight (LBW) remains a critical public-health indicator, linked strongly with higher neonatal mortality, developmental delays, and lifelong chronic diseases. Using the 2021 U.S. Natality dataset (> 3 million births), this thesis develops a Bayesian, tree-based, nonparametric framework that models the full birth weight distribution and quantifies LBW risk.


The raw dataset is condensed into 128 mutually exclusive classes defined by seven dichotomous maternal-infant predictors and 11 birth weight categories, comprised of 10% LBW quantile categories plus one aggregated normal weight category for added LBW granularity. Classification and Regression Trees (CART) are grown using the marginal Dirichlet-Multinomial likelihood as the splitting criterion. This criterion is equipped to handle sparse observations, with the Dirichlet hyperparameters informed by previous quantiles from the 2020 dataset to avoid "double dipping".


Employing a two-tier parametric bootstrap resampling technique, a 10,000 tree ensemble is grown yielding highly stable prediction estimates. Maternal race, smoking status, and marital status consistently drive the initial LBW risk stratification, identifying Black, smoking, unmarried mothers among the highest-risk subgroups. When the analysis is restricted to LBW births only, infant sex and maternal age supersede smoking and marital status as key discriminators, revealing finer biological gradients of risk. Ensemble predictions are well calibrated, and 95% bootstrap percentile intervals achieve nominal coverage.


The resulting framework combines the interpretability of decision trees with Bayesian uncertainty quantification, delivering actionable, clinically relevant insights for targeting maternal-health interventions among the most vulnerable subpopulations.


DM-CART Decision Tree

Figure 3: DM-CART decision tree structure for the 2021 dataset (Full Model).

Research Framework

My Master's thesis developed a novel Bayesian nonparametric framework called DM-CART (Dirichlet-Multinomial Classification and Regression Trees) to analyze birth weight distributions and predict low birth weight (LBW) risk. This innovative approach combines the interpretability of decision trees with the uncertainty quantification of Bayesian methods to provide actionable clinical insights for maternal health interventions.

Significance and Contribution

This research makes significant contributions to both methodological development and clinical applications:

  • Methodological: Extends traditional CART methodology with Dirichlet-Multinomial distributions for modeling count data with natural uncertainty quantification
  • Statistical: Demonstrates the effectiveness of two-tier parametric bootstrap for generating stable ensemble predictions with valid percentile intervals
  • Clinical: Provides actionable insights into maternal risk factors for low birth weight, aiding in early identification of at-risk pregnancies
  • Public Health: Offers targeted intervention strategies by identifying the most vulnerable demographic subgroups

Methodology

Data Sources and Preparation

The study utilized U.S. Natality datasets from the National Center for Health Statistics (NCHS):

  • 2021 dataset (>3 million births) for model development
  • 2020 dataset for informing Dirichlet priors and establishing quantile cutpoints

Key variables were transformed into seven binary predictors, creating 128 distinct maternal-infant profiles:

Variable Source Description Encoding
BOY sex Infant sex 1 = Male, 0 = Female
MARRIED dmar Maternal marital status 1 = Married, 0 = Not married
BLACK mrace15 Maternal race 1 = Black only, 0 = Other
OVER33 mager Maternal age 1 = Over 33 years, 0 = 33 years or younger
HIGH SCHOOL meduc Maternal education 1 = High school graduate, 0 = Other
FULL PRENATAL precare5 Prenatal care received 1 = Full prenatal care, 0 = Less than full care
SMOKER cig_0 Smoking during pregnancy 1 = Any smoking, 0 = No smoking

Birth weights were categorized into quantile-based categories to enable detailed analysis of the LBW region:

  • Type 1 Analysis: 10 quantile-based categories for low birth weight (≤2500g) plus 1 category for normal birth weight (>2500g)
  • Type 2 Analysis: 10 quantile-based categories focusing exclusively on low birth weight distributions

Model Development: DM-CART

The DM-CART algorithm extends traditional CART methodology by incorporating the Dirichlet-Multinomial distribution to model count data across birth weight categories:

log.dm.likelihood <- function(counts, alpha = 1) {
    # 'counts': an integer vector (n_1, ..., n_K)
    # 'alpha':  Dirichlet hyperparameter vector

    N <- sum(counts)
    K <- length(counts)

    # sum of alpha over all categories
    alpha_0 <- sum(alpha)

    # Term 1: log Gamma(alpha_0) - log Gamma(N + alpha_0)
    term1 <- lgamma(alpha_0) - lgamma(N + alpha_0)

    # Term 2: sum over k of [log Gamma(n_k + alpha) - log Gamma(alpha)]
    term2 <- sum(lgamma(counts + alpha) - lgamma(alpha))

    # Total log-likelihood
    ll <- term1 + term2

    return(ll)
}

Key components of the implementation include:

  • Custom rpart Method: Extended R's rpart package with specialized functions for initialization, evaluation, splitting, and prediction using the Dirichlet-Multinomial model
  • Splitting Criterion: Used marginal log-likelihood differences between splitting versus not splitting nodes
  • Informed Priors: Utilized 2020 data to establish quantile-based cutpoints and inform Dirichlet hyperparameters
  • Two-Tier Bootstrap: Implemented parametric bootstrap with 10,000 iterations for robust uncertainty quantification

Key Technical Components

Tree Growing Process

The DM-CART algorithm follows these key steps:

  1. Initialize the tree with the complete dataset of 128 maternal-infant profiles
  2. Calculate the node-specific Dirichlet-Multinomial log-likelihood
  3. Evaluate potential splits based on likelihood ratio improvement
  4. Grow the tree to a specified depth (max depth = 8 in the primary analysis)
  5. Apply Dirichlet smoothing to the predicted probabilities
  6. Generate bootstrap ensembles for uncertainty quantification
dm.method <- list(
    init=myinit,      # Initialize the DM tree
    eval=myeval,      # Evaluate nodes with DM likelihood
    split=mysplit,    # DM likelihood ratio splitting
    pred=mypred,      # Prediction with Dirichlet smoothing
    method="dm"
)

# Control parameters
dm.control <- rpart.control(
    minsplit=2,    # Minimum observations for a split
    cp=0,          # Complexity parameter
    maxdepth=8,    # Maximum tree depth
    xval=0         # No cross-validation
)

Key Findings

Risk Determinants

The research revealed several important insights about determinants of low birth weight risk:

  • Primary Risk Factors: Three variables consistently appeared at the top of the decision trees and had the highest variable importance scores:
    • Maternal race (mrace15): Black mothers showed significantly higher LBW risk
    • Smoking status (cig_0): Maternal smoking during pregnancy strongly increased LBW risk
    • Marital status (dmar): Unmarried mothers had elevated LBW risk
  • High-Risk Profile: The highest-risk maternal-infant profile identified was:
    • Black (BLACK=1), Unmarried (MARRIED=0), Smoker (SMOKER=1), Age ≤33 years (OVER33=0), Less than high school education (HIGH SCHOOL=0), Inadequate prenatal care (FULL PRENATAL=0), Female infant (BOY=0)
  • Low-Risk Profiles: When analyzing only LBW births (Type 2 analysis), different patterns emerged:
    • Infant sex (BOY) became a more important discriminator
    • Maternal age (OVER33) gained significance
    • Smoking and marital status decreased in relative importance
  • Risk Quantification: Bootstrap analysis provided little variability in estimates with 95% percentile intervals, allowing precise quantification of risk differences between subgroups

Statistical Validation

The DM-CART approach demonstrated excellent statistical properties:

  • Bootstrap Stability: Across 10,000 bootstrap samples, the key variables and tree structures showed high consistency
  • percentile Interval Coverage: 95% percentile intervals achieved nominal coverage in validation tests
  • Prediction Calibration: Predicted probabilities aligned well with observed frequencies
  • Variable Importance Stability: Consistent variable rankings across different tree depths and bootstrap iterations

Impact and Applications

This framework provides multiple benefits for clinical practice and public health:

  • Clinical Decision Support: The DM-CART model can be used to screen mothers early in pregnancy, identifying those at highest risk for delivering low birth weight infants
  • Intervention Targeting: Public health resources can be more effectively directed toward the identified high-risk subpopulations
  • Risk Communication: The tree structure allows for intuitive explanation of risk factors to patients and healthcare providers
  • Methodology Transfer: The DM-CART framework can be adapted to other clinical prediction contexts where count data and uncertainty quantification are important
  • Health Disparities Research: The model highlights demographic factors contributing to birth weight disparities, informing policy interventions

Technical Implementation

The full implementation of DM-CART consists of several interrelated R scripts:

  • quantile.R: Establishes quantile-based cutpoints and creates informed priors from the 2020 dataset
  • data_rebin.R: Preprocesses the 2021 dataset, creates the 128 maternal-infant profiles, and bins birth weights according to established quantiles
  • dm-cart.R: Implements the core DM-CART algorithm, including:
    • Custom extension of rpart with Dirichlet-Multinomial likelihood functions
    • Two-tier parametric bootstrap for uncertainty quantification
    • Risk prediction and visualization
    • Variable importance analysis

For complete technical details and code implementation, visit the GitHub repository.

Committee

  • Advisor: Dr. P. Richard Hahn, Department of Mathematical & Statistical Sciences, Arizona State University
  • Committee Members:
    • Dr. Shuang Zhou, Department of Mathematical & Statistical Sciences, Arizona State University
    • Dr. Shiwei Lan, Department of Mathematical & Statistical Sciences, Arizona State University

Conclusion

The DM-CART framework represents a significant advancement in both methodological development and clinical application. By combining the interpretability of decision trees with the statistical rigor of Bayesian methods, this research delivers a powerful tool for maternal-infant risk stratification with potential for real-world clinical impact.

This work demonstrates my ability to bridge sophisticated statistical methodology with practical healthcare applications, developing innovative solutions to complex public health challenges.