Hack4Health · Byte 2 Beat 2026

Early Detection of Cardiovascular Disease with Interpretable ML

An ensemble machine learning system trained on NHANES 2017-2018 — the CDC's nationally representative population health survey — delivering calibrated CVD risk scores with SHAP-powered explanations for every prediction.

5,569

Adults Analyzed

0.800

Best AUROC

Clinical Features

5+1

Models in Ensemble

Dataset

NHANES 2017-2018 — CDC National Health and Nutrition Examination Survey

Why NHANES?

10x

Larger than the standard UCI Heart Disease dataset (9,254 participants vs ~303)

Data Modalities

Interview + Physical Exam + Laboratory + Questionnaire — all merged by participant ID

CVD Cases

631

Self-reported coronary heart disease, angina, heart attack, or stroke in adults 20+

Features Engineered

Including Pulse Pressure, Chol/HDL Ratio, and computed Framingham Risk Score

Class Imbalance Fix

SMOTE

Synthetic Minority Oversampling balances training set to 50/50 without data leakage

Representativeness

Nationally representative via CDC complex sampling — not limited to hospital patients

Model Performance

6 classifiers trained on SMOTE-balanced data, evaluated on held-out 20% test set (n=1,114)

AUROC — Area Under ROC Curve

Random Forest

0.800

Logistic Regression

0.800

Ensemble

0.794

Gradient Boosting

0.786

XGBoost

0.782

LightGBM

0.781

Framingham Score

~0.73

Model	AUROC	AUPRC	F1 Score	Brier Score	Notes
Random Forest	0.8002	0.3127	0.3418	0.1166	Best AUROC
Logistic Regression	0.7996	0.3029	0.3811	0.1780	Best F1
Ensemble (soft vote)	0.7937	0.3106	0.3234	0.1018	Best Brier
Gradient Boosting	0.7863	0.3040	0.2954	0.1036
XGBoost	0.7819	0.2855	0.2897	0.1021
LightGBM	0.7806	0.3103	0.3070	0.1002	Lowest Brier