BIFX546_diabetes-indicators

Diabetes Prediction Using Health Indicators

BIFX-546: Machine Learning for Bioinformatics โ€” Spring 2026

Student:
Duong Nguyen (dtn2@hood.edu)

Instructor: Dr. Sarangan Ravichandran



๐ŸŽฏ Project Goal

The goal of this project is to develop interpretable machine learning models for predicting diabetes status using health indicator data from the CDC Behavioral Risk Factor Surveillance System (BRFSS). The project also aims to identify the strongest risk factors associated with diabetes and evaluate how different preprocessing and imbalance-handling techniques affect predictive performance.


๐Ÿ“Š Dataset

Field Details
Name Diabetes Health Indicators Dataset
Source Kaggle / CDC BRFSS 2015
URL https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
Size 253,680 observations with 21 feature variables
Target Variable Diabetes_binary (0 = No Diabetes, 1 = Diabetes/Prediabetes)

The dataset contains demographic, lifestyle, and clinical health indicators related to diabetes risk.


๐Ÿง  Techniques Used

Phase Technique
EDA Summary statistics and class distribution analysis
EDA Histograms, boxplots, countplots, correlation heatmap
Data Cleaning Duplicate removal
Data Cleaning BMI outlier removal using IQR
Modeling Logistic Regression
Modeling Random Forest
Imbalance Handling Class weighting
Imbalance Handling SMOTE oversampling
Imbalance Handling Random undersampling
Feature Selection Reduced-feature logistic regression
Evaluation Accuracy, Precision, Recall, F1-score
Evaluation ROC-AUC, PR-AUC, Confusion Matrix

๐Ÿ“ Repository Structure

BIFX546_diabetes-indicators/
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ diabetes_012_health_indicators_BRFSS2015.csv
โ”‚   โ”œโ”€โ”€ diabetes_binary_5050split_health_indicators_BRFSS2015.csv
โ”‚   โ””โ”€โ”€ diabetes_binary_health_indicators_BRFSS2015.csv
โ”œโ”€โ”€ notebooks/
โ”‚   โ””โ”€โ”€ EDA_diabetes_indicators.ipynb
โ”œโ”€โ”€ results/
โ”‚   โ”œโ”€โ”€ age_distribution_diabetes_status.png
โ”‚   โ”œโ”€โ”€ bmi_distribution.png
โ”‚   โ”œโ”€โ”€ bmi_vs_diabetes_boxplot.png
โ”‚   โ”œโ”€โ”€ correlation_heatmap.png
โ”‚   โ”œโ”€โ”€ correlation_with_diabetes_barplot.png
โ”‚   โ””โ”€โ”€ diabetes_status_distribution.png
โ”œโ”€โ”€ src/
โ”‚   โ””โ”€โ”€ .gitkeep
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ Project Proposal.docx
โ””โ”€โ”€ README.md

โš™๏ธ How to Run

Option 1 โ€” Google Colab

  1. Open Google Colab

  2. Click File โ†’ Open notebook โ†’ GitHub

  3. Paste the repository URL:

https://github.com/dtn2/BIFX546_diabetes-indicators
  1. Open the notebook:
EDA_diabetes_indicators.ipynb
  1. Run all notebook cells:
Runtime โ†’ Run all

Required Python packages are installed within the notebook as needed.

Option 2 โ€” Local Jupyter Notebook

# Clone repository
git clone https://github.com/dtn2/BIFX546_diabetes-indicators.git

# Move into repository folder
cd BIFX546_diabetes-indicators

# Launch Jupyter Notebook
jupyter notebook

Open the notebook:

notebooks/EDA_diabetes_indicators.ipynb

Install any missing Python packages directly in the notebook if needed.


๐Ÿ“ˆ Key Results

Model Accuracy Diabetes Recall ROC-AUC
Logistic Regression 73% 0.78 0.827
Random Forest 86% 0.16 0.799
Logistic Regression + SMOTE 72% 0.74 0.802
Logistic Regression + Downsampling 72% 0.75 0.804
Reduced Feature Logistic Regression 72% 0.76 0.813

Main Findings


๐Ÿ“ Conclusion

Logistic Regression provided the most effective and interpretable model for diabetes prediction in this imbalanced healthcare dataset. Although Random Forest achieved higher overall accuracy, Logistic Regression achieved substantially higher recall for diabetes cases, making it more clinically useful for screening purposes. The results also demonstrated that a simplified model using only a few key health indicators can still effectively predict diabetes risk.


๐Ÿ“ฆ Dependencies

Core Python packages:

pandas
numpy
matplotlib
seaborn
scikit-learn
imbalanced-learn
jupyter

๐Ÿ“œ References

  1. CDC Behavioral Risk Factor Surveillance System (BRFSS)
    https://www.cdc.gov/brfss/

  2. Kaggle Diabetes Health Indicators Dataset
    https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset


BIFX-546 ยท Hood College ยท Spring 2026