Hack4Health · Byte 2 Beat 2026

Early Detection of Cardiovascular Disease with Interpretable ML

An ensemble machine learning system trained on NHANES 2017-2018 — the CDC's nationally representative population health survey — delivering calibrated CVD risk scores with SHAP-powered explanations for every prediction.

5,569
Adults Analyzed
0.800
Best AUROC
19
Clinical Features
5+1
Models in Ensemble

Dataset

NHANES 2017-2018 — CDC National Health and Nutrition Examination Survey

Why NHANES?

10x

Larger than the standard UCI Heart Disease dataset (9,254 participants vs ~303)

Data Modalities

4

Interview + Physical Exam + Laboratory + Questionnaire — all merged by participant ID

CVD Cases

631

Self-reported coronary heart disease, angina, heart attack, or stroke in adults 20+

Features Engineered

19

Including Pulse Pressure, Chol/HDL Ratio, and computed Framingham Risk Score

Class Imbalance Fix

SMOTE

Synthetic Minority Oversampling balances training set to 50/50 without data leakage

Representativeness

US

Nationally representative via CDC complex sampling — not limited to hospital patients


Model Performance

6 classifiers trained on SMOTE-balanced data, evaluated on held-out 20% test set (n=1,114)

AUROC — Area Under ROC Curve

Random Forest
0.800
Logistic Regression
0.800
Ensemble
0.794
Gradient Boosting
0.786
XGBoost
0.782
LightGBM
0.781
Framingham Score
~0.73
Model AUROC AUPRC F1 Score Brier Score Notes
Random Forest 0.8002 0.3127 0.3418 0.1166 Best AUROC
Logistic Regression 0.7996 0.3029 0.3811 0.1780 Best F1
Ensemble (soft vote) 0.7937 0.3106 0.3234 0.1018 Best Brier
Gradient Boosting 0.7863 0.3040 0.2954 0.1036
XGBoost 0.7819 0.2855 0.2897 0.1021
LightGBM 0.7806 0.3103 0.3070 0.1002 Lowest Brier
ROC, PR, Confusion Matrix
ROC curves, Precision-Recall curves, and Ensemble confusion matrix
Calibration curves
Calibration curves — Ensemble tracks the diagonal closely (well-calibrated)

Exploratory Analysis

Understanding the population before modeling

EDA Overview
CVD prevalence by age group and sex; age distribution by CVD status
Risk Factor Comparison
Risk factor prevalence — CVD patients show significantly higher rates across all four factors
Correlation Heatmap
Pairwise feature correlations — Age and Framingham Score show highest correlation with CVD outcome

SHAP Interpretability

Every prediction explained — globally and per-patient — using TreeExplainer on XGBoost

SHAP Summary
SHAP summary dot plot — each dot is one patient; red = high feature value, blue = low
SHAP Importance
Mean |SHAP| global importance — Age and Framingham Score dominate; WaistCirc ranks 3rd
SHAP Waterfall
Per-patient waterfall: a 63% probability patient — Age (+0.33) and Framingham Score (+0.29) drive the prediction up

Risk Stratification

Patients are stratified into 4 clinical tiers based on ensemble probability

Low Risk
2.6%
Prob <10%  |  n=586 patients
Actual CVD rate: 2.6% — confirms model safety
Moderate Risk
13.0%
Prob 10-20%  |  n=161 patients
Actual CVD rate: 13.0% — annual monitoring
High Risk
20.5%
Prob 20-40%  |  n=195 patients
Actual CVD rate: 20.5% — prompt referral
Very High Risk
29.1%
Prob >40%  |  n=172 patients
Actual CVD rate: 29.1% — immediate intervention
Risk Stratification
Left: Framingham Score vs ML probability by risk tier. Right: patient distribution across tiers.

Patient Risk Calculator

Estimate CVD risk using the same features as our trained model

Estimated CVD Risk
Key Risk Drivers