Student:
Duong Nguyen (dtn2@hood.edu)
Instructor: Dr. Sarangan Ravichandran
The goal of this project is to develop interpretable machine learning models for predicting diabetes status using health indicator data from the CDC Behavioral Risk Factor Surveillance System (BRFSS). The project also aims to identify the strongest risk factors associated with diabetes and evaluate how different preprocessing and imbalance-handling techniques affect predictive performance.
| Field | Details |
|---|---|
| Name | Diabetes Health Indicators Dataset |
| Source | Kaggle / CDC BRFSS 2015 |
| URL | https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset |
| Size | 253,680 observations with 21 feature variables |
| Target Variable | Diabetes_binary (0 = No Diabetes, 1 = Diabetes/Prediabetes) |
The dataset contains demographic, lifestyle, and clinical health indicators related to diabetes risk.
| Phase | Technique |
|---|---|
| EDA | Summary statistics and class distribution analysis |
| EDA | Histograms, boxplots, countplots, correlation heatmap |
| Data Cleaning | Duplicate removal |
| Data Cleaning | BMI outlier removal using IQR |
| Modeling | Logistic Regression |
| Modeling | Random Forest |
| Imbalance Handling | Class weighting |
| Imbalance Handling | SMOTE oversampling |
| Imbalance Handling | Random undersampling |
| Feature Selection | Reduced-feature logistic regression |
| Evaluation | Accuracy, Precision, Recall, F1-score |
| Evaluation | ROC-AUC, PR-AUC, Confusion Matrix |
BIFX546_diabetes-indicators/
โโโ data/
โ โโโ diabetes_012_health_indicators_BRFSS2015.csv
โ โโโ diabetes_binary_5050split_health_indicators_BRFSS2015.csv
โ โโโ diabetes_binary_health_indicators_BRFSS2015.csv
โโโ notebooks/
โ โโโ EDA_diabetes_indicators.ipynb
โโโ results/
โ โโโ age_distribution_diabetes_status.png
โ โโโ bmi_distribution.png
โ โโโ bmi_vs_diabetes_boxplot.png
โ โโโ correlation_heatmap.png
โ โโโ correlation_with_diabetes_barplot.png
โ โโโ diabetes_status_distribution.png
โโโ src/
โ โโโ .gitkeep
โโโ LICENSE
โโโ Project Proposal.docx
โโโ README.md
Open Google Colab
Click File โ Open notebook โ GitHub
Paste the repository URL:
https://github.com/dtn2/BIFX546_diabetes-indicators
EDA_diabetes_indicators.ipynb
Runtime โ Run all
Required Python packages are installed within the notebook as needed.
# Clone repository
git clone https://github.com/dtn2/BIFX546_diabetes-indicators.git
# Move into repository folder
cd BIFX546_diabetes-indicators
# Launch Jupyter Notebook
jupyter notebook
Open the notebook:
notebooks/EDA_diabetes_indicators.ipynb
Install any missing Python packages directly in the notebook if needed.
| Model | Accuracy | Diabetes Recall | ROC-AUC |
|---|---|---|---|
| Logistic Regression | 73% | 0.78 | 0.827 |
| Random Forest | 86% | 0.16 | 0.799 |
| Logistic Regression + SMOTE | 72% | 0.74 | 0.802 |
| Logistic Regression + Downsampling | 72% | 0.75 | 0.804 |
| Reduced Feature Logistic Regression | 72% | 0.76 | 0.813 |
Logistic Regression provided the most effective and interpretable model for diabetes prediction in this imbalanced healthcare dataset. Although Random Forest achieved higher overall accuracy, Logistic Regression achieved substantially higher recall for diabetes cases, making it more clinically useful for screening purposes. The results also demonstrated that a simplified model using only a few key health indicators can still effectively predict diabetes risk.
Core Python packages:
pandas
numpy
matplotlib
seaborn
scikit-learn
imbalanced-learn
jupyter
CDC Behavioral Risk Factor Surveillance System (BRFSS)
https://www.cdc.gov/brfss/
Kaggle Diabetes Health Indicators Dataset
https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
BIFX-546 ยท Hood College ยท Spring 2026