Feature reduction using swarm optimization and random forest classifiers for early diabetes risk prediction.
Diabetes is a chronic metabolic disorder caused by excessive blood sugar levels, which leads to severe damage to other organs. Type 2 diabetes is, more often than others, a long-term metabolic disorder in which the body resists insulin or does not produce enough of it. Early diabetes detection with fewer features reduces patient burden, and machine learning makes the process more time-efficient. This study proposes three machine learning approaches that achieve both high performance and effective feature reduction using the Early Stage Diabetes Risk Prediction dataset. Multiple research works have been published. However, they have struggled to achieve efficient feature reduction while maintaining high accuracy and have not provided detailed explanations of the models' nature or misclassifications. This research resolves the issues by showing outstanding performance, including a remarkable feature reduction. Three different swarm-based metaheuristic algorithms: Fox Optimizer, Honey Badger Algorithm, and Tuna Swarm Optimization, have been used, wrapped with a Random Forest Classifier. SHAP, as an explainable AI, is used to present the model's nature and feature importance, including individual predictions. FOX_RF, HBA_RF, and TSO_RF have gained an Accuracy of 99.36%, 99.36%, and 100% without cross-validation. However, TSO_RF has achieved the highest mean 10-fold cross-validation accuracy of 98.14%, F-score of 98.47%, and 98.54% of Precision using only 14 features out of 16. And, FOX_RF has achieved the highest mean Precision of 98,43%. HBA_RF has shown the highest number of feature reduction by selecting 10 features out of 16, maintaining a moderate performance. SHAP has confirmed that Polyuria, Polydipsia, and Gender are the most impacted features for diabetes prediction. SHAP-based individual prediction analysis has revealed that even small changes in these features can influence the model's decisions. This research analyzes the Early Stage Diabetes Risk Prediction dataset, which includes 520 individuals with 16 predictors and one target class, where TSO_RF has outperformed other models by achieving 100% and 98.14% of Accuracy, respectively, without cross-validation and using cross-validation.