Master’s Thesis
Investigating Determinants of Birth Weight Using Bayesian Tree-Based Nonparametric Modeling
Advisor: Dr. P. Richard Hahn, Department of Mathematical & Statistical Sciences, Arizona State University
Defense Date: May 22nd, 2025
Abstract
Low birth weight (LBW) remains a critical public-health indicator, linked strongly with higher neonatal mortality, developmental delays, and lifelong chronic diseases. Using the 2021 U.S. Natality dataset (> 3 million births), this thesis develops a Bayesian, tree-based, nonparametric framework that models the full birth weight distribution and quantifies LBW risk.
The raw dataset is condensed into 128 mutually exclusive classes defined by seven dichotomous maternal-infant predictors and 11 birth weight categories, comprised of 10% LBW quantile categories plus one aggregated normal weight category for added LBW granularity. Classification and Regression Trees (CART) are grown using the marginal Dirichlet-Multinomial likelihood as the splitting criterion. This criterion is equipped to handle sparse observations, with the Dirichlet hyperparameters informed by previous quantiles from the 2020 dataset to avoid "double dipping".
Employing a two-tier parametric bootstrap resampling technique, a 10,000 tree ensemble is grown yielding highly stable prediction estimates. Maternal race, smoking status, and marital status consistently drive the initial LBW risk stratification, identifying Black, smoking, unmarried mothers among the highest-risk subgroups. When the analysis is restricted to LBW births only, infant sex and maternal age supersede smoking and marital status as key discriminators, revealing finer biological gradients of risk. Ensemble predictions are well calibrated, and 95% bootstrap percentile intervals achieve nominal coverage.
The resulting framework combines the interpretability of decision trees with Bayesian uncertainty quantification, delivering actionable, clinically relevant insights for targeting maternal-health interventions among the most vulnerable subpopulations.
Figure 3: DM-CART decision tree structure for the 2021 dataset (Full Model).
Research Framework
My Master's thesis developed a novel Bayesian nonparametric framework called DM-CART (Dirichlet-Multinomial Classification and Regression Trees) to analyze birth weight distributions and predict low birth weight (LBW) risk. This innovative approach combines the interpretability of decision trees with the uncertainty quantification of Bayesian methods to provide actionable clinical insights for maternal health interventions.
Significance and Contribution
This research makes significant contributions to both methodological development and clinical applications:
- Methodological: Extends traditional CART methodology with Dirichlet-Multinomial distributions for modeling count data with natural uncertainty quantification
- Statistical: Demonstrates the effectiveness of two-tier parametric bootstrap for generating stable ensemble predictions with valid percentile intervals
- Clinical: Provides actionable insights into maternal risk factors for low birth weight, aiding in early identification of at-risk pregnancies
- Public Health: Offers targeted intervention strategies by identifying the most vulnerable demographic subgroups
Methodology
Data Sources and Preparation
The study utilized U.S. Natality datasets from the National Center for Health Statistics (NCHS):
- 2021 dataset (>3 million births) for model development
- 2020 dataset for informing Dirichlet priors and establishing quantile cutpoints
Key variables were transformed into seven binary predictors, creating 128 distinct maternal-infant profiles:
Variable | Source | Description | Encoding |
---|---|---|---|
BOY | sex |
Infant sex | 1 = Male, 0 = Female |
MARRIED | dmar |
Maternal marital status | 1 = Married, 0 = Not married |
BLACK | mrace15 |
Maternal race | 1 = Black only, 0 = Other |
OVER33 | mager |
Maternal age | 1 = Over 33 years, 0 = 33 years or younger |
HIGH SCHOOL | meduc |
Maternal education | 1 = High school graduate, 0 = Other |
FULL PRENATAL | precare5 |
Prenatal care received | 1 = Full prenatal care, 0 = Less than full care |
SMOKER | cig_0 |
Smoking during pregnancy | 1 = Any smoking, 0 = No smoking |
Birth weights were categorized into quantile-based categories to enable detailed analysis of the LBW region:
- Type 1 Analysis: 10 quantile-based categories for low birth weight (≤2500g) plus 1 category for normal birth weight (>2500g)
- Type 2 Analysis: 10 quantile-based categories focusing exclusively on low birth weight distributions
Model Development: DM-CART
The DM-CART algorithm extends traditional CART methodology by incorporating the Dirichlet-Multinomial distribution to model count data across birth weight categories:
log.dm.likelihood <- function(counts, alpha = 1) {
# 'counts': an integer vector (n_1, ..., n_K)
# 'alpha': Dirichlet hyperparameter vector
N <- sum(counts)
K <- length(counts)
# sum of alpha over all categories
alpha_0 <- sum(alpha)
# Term 1: log Gamma(alpha_0) - log Gamma(N + alpha_0)
term1 <- lgamma(alpha_0) - lgamma(N + alpha_0)
# Term 2: sum over k of [log Gamma(n_k + alpha) - log Gamma(alpha)]
term2 <- sum(lgamma(counts + alpha) - lgamma(alpha))
# Total log-likelihood
ll <- term1 + term2
return(ll)
}
Key components of the implementation include:
- Custom
rpart
Method: Extended R'srpart
package with specialized functions for initialization, evaluation, splitting, and prediction using the Dirichlet-Multinomial model - Splitting Criterion: Used marginal log-likelihood differences between splitting versus not splitting nodes
- Informed Priors: Utilized 2020 data to establish quantile-based cutpoints and inform Dirichlet hyperparameters
- Two-Tier Bootstrap: Implemented parametric bootstrap with 10,000 iterations for robust uncertainty quantification
Key Technical Components
Tree Growing Process
The DM-CART algorithm follows these key steps:
- Initialize the tree with the complete dataset of 128 maternal-infant profiles
- Calculate the node-specific Dirichlet-Multinomial log-likelihood
- Evaluate potential splits based on likelihood ratio improvement
- Grow the tree to a specified depth (max depth = 8 in the primary analysis)
- Apply Dirichlet smoothing to the predicted probabilities
- Generate bootstrap ensembles for uncertainty quantification
dm.method <- list(
init=myinit, # Initialize the DM tree
eval=myeval, # Evaluate nodes with DM likelihood
split=mysplit, # DM likelihood ratio splitting
pred=mypred, # Prediction with Dirichlet smoothing
method="dm"
)
# Control parameters
dm.control <- rpart.control(
minsplit=2, # Minimum observations for a split
cp=0, # Complexity parameter
maxdepth=8, # Maximum tree depth
xval=0 # No cross-validation
)
Key Findings
Risk Determinants
The research revealed several important insights about determinants of low birth weight risk:
- Primary Risk Factors: Three variables consistently appeared at the top of the decision trees and had the highest variable importance scores:
- Maternal race (
mrace15
): Black mothers showed significantly higher LBW risk - Smoking status (
cig_0
): Maternal smoking during pregnancy strongly increased LBW risk - Marital status (
dmar
): Unmarried mothers had elevated LBW risk
- Maternal race (
- High-Risk Profile: The highest-risk maternal-infant profile identified was:
- Black (BLACK=1), Unmarried (MARRIED=0), Smoker (SMOKER=1), Age ≤33 years (OVER33=0), Less than high school education (HIGH SCHOOL=0), Inadequate prenatal care (FULL PRENATAL=0), Female infant (BOY=0)
- Low-Risk Profiles: When analyzing only LBW births (Type 2 analysis), different patterns emerged:
- Infant sex (BOY) became a more important discriminator
- Maternal age (OVER33) gained significance
- Smoking and marital status decreased in relative importance
- Risk Quantification: Bootstrap analysis provided little variability in estimates with 95% percentile intervals, allowing precise quantification of risk differences between subgroups
Statistical Validation
The DM-CART approach demonstrated excellent statistical properties:
- Bootstrap Stability: Across 10,000 bootstrap samples, the key variables and tree structures showed high consistency
- percentile Interval Coverage: 95% percentile intervals achieved nominal coverage in validation tests
- Prediction Calibration: Predicted probabilities aligned well with observed frequencies
- Variable Importance Stability: Consistent variable rankings across different tree depths and bootstrap iterations
Impact and Applications
This framework provides multiple benefits for clinical practice and public health:
- Clinical Decision Support: The DM-CART model can be used to screen mothers early in pregnancy, identifying those at highest risk for delivering low birth weight infants
- Intervention Targeting: Public health resources can be more effectively directed toward the identified high-risk subpopulations
- Risk Communication: The tree structure allows for intuitive explanation of risk factors to patients and healthcare providers
- Methodology Transfer: The DM-CART framework can be adapted to other clinical prediction contexts where count data and uncertainty quantification are important
- Health Disparities Research: The model highlights demographic factors contributing to birth weight disparities, informing policy interventions
Technical Implementation
The full implementation of DM-CART consists of several interrelated R scripts:
- quantile.R: Establishes quantile-based cutpoints and creates informed priors from the 2020 dataset
- data_rebin.R: Preprocesses the 2021 dataset, creates the 128 maternal-infant profiles, and bins birth weights according to established quantiles
- dm-cart.R: Implements the core DM-CART algorithm, including:
- Custom extension of
rpart
with Dirichlet-Multinomial likelihood functions - Two-tier parametric bootstrap for uncertainty quantification
- Risk prediction and visualization
- Variable importance analysis
- Custom extension of
For complete technical details and code implementation, visit the GitHub repository.
Committee
- Advisor: Dr. P. Richard Hahn, Department of Mathematical & Statistical Sciences, Arizona State University
- Committee Members:
- Dr. Shuang Zhou, Department of Mathematical & Statistical Sciences, Arizona State University
- Dr. Shiwei Lan, Department of Mathematical & Statistical Sciences, Arizona State University
Conclusion
The DM-CART framework represents a significant advancement in both methodological development and clinical application. By combining the interpretability of decision trees with the statistical rigor of Bayesian methods, this research delivers a powerful tool for maternal-infant risk stratification with potential for real-world clinical impact.
This work demonstrates my ability to bridge sophisticated statistical methodology with practical healthcare applications, developing innovative solutions to complex public health challenges.