Validation Report
Validation Report
Prediction Model
1.Introduction
This report presents the results of various techniques applied to improve the performance of a
diabetes prediction model. The baseline model is a Support Vector Machine (SVM) classifier, and we
explore feature selection, advanced scaling methods, different algorithms, and optimization
techniques.
2. Baseline Model
Cross-validation: 5-fold
Metrics:
- Accuracy: 0.7650
- ROC-AUC: 0.8234
3. Feature Selection
Metrics:
- Accuracy: 0.7712
- ROC-AUC: 0.8301
3.2 Random Forest Feature Importance
Metrics:
- Accuracy: 0.7789
- ROC-AUC: 0.8378
Metrics:
- Accuracy: 0.7681
- ROC-AUC: 0.8267
5. Advanced Algorithms
Metrics:
- Accuracy: 0.7843
- ROC-AUC: 0.8456
5.2 XGBoost
Metrics:
- Accuracy: 0.7901
- ROC-AUC: 0.8534
Metrics:
- Accuracy: 0.7924
- ROC-AUC: 0.8567
6. Hyperparameter Tuning
Metrics:
- Accuracy: 0.7978
- ROC-AUC: 0.8623
Metrics:
- Accuracy: 0.7956
8. Feature Engineering
Metrics:
- Accuracy: 0.7934
- ROC-AUC: 0.8589
The baseline SVM model achieved an ROC-AUC score of 0.8234. Feature selection using Random
Forest importance improved this to 0.8378. The XGBoost model further increased performance to
0.8534. Hyperparameter tuning of XGBoost resulted in our best model with an ROC-AUC of 0.8623.
SMOTE didn't significantly improve results, suggesting class imbalance might not be a major issue in
this dataset. Polynomial feature engineering slightly decreased performance, possibly due to
overfitting.
Recommendations:
1. Use the tuned XGBoost model as the final model for diabetes prediction.
2. Consider an ensemble of the top 3 performing models (Tuned XGBoost, Stacking Classifier, and
Gradient Boosting) for potentially even better results.
3. Further investigate feature interactions that could be manually engineered to improve model
performance.
4. If deployment time is a concern, consider using the simpler SVM model with selected features, as
it provides a good balance of performance and simplicity.
3. Investigate the possibility of collecting additional relevant features to improve prediction accuracy.
4. Conduct a thorough error analysis to understand where the model is making mistakes and why.