AN IMPROVED DIABETES MELLITUS PREDICTION MODEL THROUGH ENSEMBLE LEARNING AND GINI INDEX-BASED FEATURE SELECTION

Authors

  • Rukkayya Yahaya Ibrahim Computer Science Department, Ahmadu Bello University, Zaria,
  • Sahabi A. Yusuf Computer Science Department, Ahmadu Bello University, Zaria,
  • Mohammed Abdullahi Computer Science Department, Ahmadu Bello University, Zaria,
  • Jeremiah Isuwa Computer Science Department, Federal University Kashere, Gombe,

Abstract

Diabetes Mellitus (DM) is a condition where the body cannot regulate blood sugar due to improper insulin production or use, posing a significant global health burden. Traditional detection methods rely on clinical assessments and basic lab tests, but recent technological advancements suggest that Machine Learning (ML) algorithms can predict DM more effectively and efficiently. However, current ML models face challenges like feature redundancy, irrelevancy, and dataset imbalance, which can reduce accuracy and interpretability, ultimately affecting patient outcomes. This paper aims to address these challenges by developing an enhanced ML-based DM prediction model. The proposed model leverages an ensemble soft voting classifier, integrating the Random Forest, Logistic Regression, and Naïve Bayes algorithms. Feature importance determination is facilitated by the Gini Index Random Forest (GI-RF) algorithm. Additionally, three data imbalance handling techniques random oversampling (ROS), random undersampling (RUS), and the synthetic minority oversampling technique (SMOTE) are employed to mitigate biased model development. Initially, the GI-RF algorithm identifies the top 5 most informative features from the PIMA Indians Diabetes Dataset, originally comprising 8 features. Subsequently, the dataset is subjected to each of the three imbalance handling techniques. The performance of each model variation, incorporating different imbalance handling techniques is then extensively compared. The results demonstrate that ROS notably outperforms RUS and SMOTE across multiple metrics, including accuracy, F1 score, recall, and AUC. A comparative analysis with existing studies reveals the proposed method's notable improvements across all metrics, with increases of 5% in accuracy, 8% in precision, 13% in F1 score, 18% in recall, and 4% in AUC. This demonstrates the proposed model's overall robustness and effectiveness in predictive modeling, contributing to more accurate diagnosis and treatment of DM.

Downloads

Published

2025-01-06

Issue

Section

ARTICLES