Diabetes is one of the most common diseases worldwide, and its prevalence rate continues to rise. This increase is due to factors related to nutrition and lifestyle on the one hand, and to genetic factors on the other hand, thus creating a real public health problem. Therefore, it is crucial to identify diabetes early in order to allow rapid treatment, capable of slowing down the progression of the disease.
The objective of this work is to propose an automatic diabetes prediction system based on the following machine learning techniques: SVM, KNN, Decision Tree and Logistic Regression. Using risk factors specific to the Algerian environment, we constructed a new dataset that includes 823 patients, with 418 being diabetic and 405 being non-diabetic. In order to choose the relevant features and identify the most informative risk factors, we combined several feature extraction methods such as ANalysis Of Variance (ANOVA), Recursive Feature Elimination (RFE) and we used also the features proposed by the Pima Indian Diabetes Dataset (PIDD).
The results of this study provided valuable information on the comparative performance of different machine learning models in the prediction of diabetes, as well as on the importance of the selected characteristics.
Key words: ANOVA, Diabetes, Feature extraction, Machine learning, Patients, Prediction, RFE.
|