ML for Air Quality
ML for Air Quality
com/scientificreports
Due to the excessive growth of PM 2.5 in aerosol, the cases of lung cancer are increasing rapidly and
are most severe among other types as the highest mortality rate. In most of the cases, lung cancer
is detected with least symptoms at its later stage. Hence, clinical records may play a vital role to
diagnose this disease at the correct stage for suitable medication to cure it. To detect lung cancer
an accurate prediction method is needed which is significantly reliable. In the digital clinical record
era with advancement in computing algorithms including machine learning techniques opens an
opportunity to ease the process. Various machine learning algorithms may be applied over realistic
clinical data but the predictive power is yet to be comprehended for accurate results. This paper
envisages to compare twelve potential machine learning algorithms over clinical data with eleven
symptoms of lung cancer along with two major habits of patients to predict a positive case accurately.
The result has been found based on classification and heat map correlation. K-Nearest Neighbor Model
and Bernoulli Naive Bayes Model are found most significant methods for early lung cancer prediction.
Keywords Lung cancer, Machine learning, Classification, Prediction, Confusion matrix, Heat map
correlation
The respiratory disease has enormously increased over the last decades which may be directly associated with
the exposer of humans to the polluted atmosphere. Sustainable development goals (SDGs) ensure an aspiration
of health and well-being for a ll1, target 3.9 is associated with reducing death and illness from air, water, and soil
pollution2. Lung cancer is one of the most lethal diseases caused with increasing mortality rates globally by air
pollution. Usually, this type of cancer begins in the lungs and may spread to other section of the body and its
causes includes smoking, air pollution, and exposure to peculiar chemicals3. The prognosis for lung cancer var-
ies depending on the type, stage, and overall health of the individual. The initial phases of lung cancer may not
usually manifest symptoms. If early symptoms manifest, they may encompass symptoms such as short breath-
ing, in addition to unforeseen symptoms like back pain. Tumors can lead to back pain by exerting pressure on
the lungs or by spreading to the patient’s spinal cord and r ibs4. Additional initial symptoms of lung cancer may
encompass: a persistent or getting worse cough, expectorating phlegm or blood, exacerbation of chest pain during
deep breathing, laughter, or coughing, hoarseness, wheezing, weakness, and fatigue, reduced appetite and weight
loss, recurring respiratory infections like pneumonia or b ronchitis5. The initial manifestations of lung cancer
may be subtle, however, an early diagnosis is crucial for effective treatment alternatives and potential results.
However, it is a great challenge to detect and diagnose it in the early stage by doctors and researchers. The
advancement in the storage of health records on digital platforms and data visualizations improved pattern
analysis6. The early prediction of disease based on symptoms and textual information may enhance the diagnosis
system. Aside from medical methods, soft computing techniques like applying machine learning algorithms to
the main features of large, complicated lung cancer datasets may be significant for a specialist to find the disease
early. On the contrary, the precision of detection depends on the availability of data and the process of selecting
important measures, which further results in adequate treatment decisions.
Diverse mathematical models have already been utilized for the detection and prevention of diseases to
facilitate early treatment. However, if lung cancer is diagnosed three years after its onset, it becomes unprevent-
able, and the likelihood of survival is extremely poor7,8. Nevertheless, it is possible to treat the disease when the
1
Department of Computer Science and Engineering, Graphic Era (Deemed to be University), Dehradun,
India. 2Department of Computer Science and Engineering, Indus University, Ahmedabad 382115,
India. 3Department of Electronics and Computer Engineering, National Institute of Advanced Manufacturing
Technology (NIAMT), Ranchi, India. *email: [email protected]
Vol.:(0123456789)
www.nature.com/scientificreports/
earliest signs are present before metastasis. Thus, if cancer is found within a specific time-frame of curability,
along with various risk factors for further diagnosis, a suitable therapy can be provided to the patient, enabling
the implementation of appropriate preventive measures. Several computer methods have been used to find or
predict lung cancer, which helps doctors figure out the best way to treat patients and their chances of survival
after being diagnosed. Researchers in the field of medical sciences have employed machine learning and soft
computing approaches to accurately diagnose several forms of cancer in their early stages using categorization
methods. Furthermore, researchers have identified various cutting-edge methods for early-stage prognosis of
cancer therapy outcomes9. However, it is crucial to determine an appropriate learning algorithm for the purpose
of detecting lung cancer and its correlation with the patient’s habits. This research aims to conduct a comparative
analysis of several machine learning algorithms on the characteristics related to lung cancer, specifically focusing
on the symptoms exhibited by patients and their habits.
Vol:.(1234567890)
www.nature.com/scientificreports/
Table 2. List of patient’s habits and symptoms in lung cancer study dataset.
Vol.:(0123456789)
www.nature.com/scientificreports/
Figure 2. Positive case distribution age-wise over gender in the given dataset.
Figure 3. Positive and negative case distribution gender-wise over patient’s habits.
Vol:.(1234567890)
www.nature.com/scientificreports/
Figure 4. Distribution of positive and negative cases gender-wise over patient’s symptoms.
Figure 5. Correlation heat map for attributes considering alcohol consuming as habit of patient.
Vol.:(0123456789)
www.nature.com/scientificreports/
Now, we can apply different machine learning algorithms to understand the significance of the algorithm
in this problem domain. Based on the literature survey, we have identified a few learning algorithms for lung
cancer prediction viz. (1) Logistic regression, (2) Gaussian Naïve Bayes, (3) Bernoulli Naïve Bayes, (4) Support
vector machine, (5) Random forest, (6) K-Nearest neighbor, (7) Extreme Gradient boosting, (8) Extra tree,
(9) Ada boost, (10) Ensemble_1 with XGB and ADA, (11) Ensemble_2 with Voting Classifier, (12) Multilayer
Perceptron (MLP).
Figure 6. A comparative study of learning algorithm through confusion matrix over lung cancer dataset.
Vol:.(1234567890)
www.nature.com/scientificreports/
Figure 7. A comparative study of learning algorithm through ROC/AUC over lung cancer dataset.
Table 3. Classification report for LR classifiers. The accuracy of logistic regression is 87.5%.
Table 4. Classification report for Gaussian Naive Bayes classifiers. The accuracy of Gaussian Naive Bayes is
91.07%.
Vol.:(0123456789)
www.nature.com/scientificreports/
Table 5. Classification report for Bernaulli Navie classifier. The accuracy of Bernoulli Naive Bayes is 91.07%.
Table 6. Classification report for SVM classifier. The accuracy of Support Vector Machine is 85.71%.
Table 7. Classification report for Random Forest Classifiers. The accuracy of Random Forest Classifier is
85.71%.
Table 8. Classification report for K Nearest Neighbors Classifier. The accuracy of K Nearest Neighbors
Classifier is 92.86%.
Table 9. Classification report for Extreme Gradient Boosting Classifier. The accuracy of extreme gradient
boosting classifier is 89.29%.
Vol:.(1234567890)
www.nature.com/scientificreports/
Table 10. Classification report for Extra Tree Classifier. The accuracy of extra tree classifier is 89.29%.
Table 11. Classification report for Ada Boost Classifier. The accuracy of ada boost classifier is 89.29%.
Table 12. Classification report for Ensemble_1 with XGB and ADA Classifier. The accuracy of Ensemble_1
with XGB and ADA Classifier is 89.29%.
Table 13. Classification report for Ensemble_2 with Voting Classifier. The accuracy of Ensemble_2 with
Voting Classifier is 87.5%.
Table 14. Classification report for MLP Classifier. The accuracy of MLP Classifier is 89.29%.
Vol.:(0123456789)
www.nature.com/scientificreports/
Table 15. A comparison of the accuracy of different learning algorithms applied over lung cancer.
Conclusion
Prediction of lung cancer can be useful if the system for cancer prediction works after symptom detection and
also correlates to the patient’s habits and state about the cancer at a low risk. Furthermore, the expert may advise
the suitable treatment option based on the individual’s cancer risk status. However, it is important to be precise
while predicting lung cancer in a patient. The raw data having 310 instances has been processed to find positive
cases gender-wise and then compared individual positive cases for each attribute gender-wise. A correlation
study over alcohol consumption habits has identified that yellow finger and allergy are the main symptoms
while conducting a preliminary analysis of the data. This study focused on the comprehensive analysis of twelve
potential different machine learning algorithms in which the K-nearest neighbor and Bernoulli Naïve Bayes
model (equally well as Gaussian Naïve Bayes) are found suitable with accuracy 92.86% and 91.07% respectively.
Data availability
The datasets generated and/or analysed during the current study are available in the Data source: https://www.
kaggle.com/datasets/sanjoli02/lung-cancer.
References
1. Organization, W. H. et al. A vision for primary health care in the 21st century: towards universal health coverage and the sustainable
development goals (World Health Organization, Tech. Rep., 2018).
2. Yue, H., He, C., Huang, Q., Yin, D. & Bryan, B. A. Stronger policy required to substantially reduce deaths from pm2. 5 pollution
in China. Nat. Commun. 11(1), 1462 (2020).
3. Organization, W.H. National cancer control programmes: Policies and managerial guidelines. World Health Organization, (2002).
4. Hamann, H. A., Ver Hoeve, E. S., Carter-Harris, L., Studts, J. L. & Ostroff, J. S. Multilevel opportunities to address lung cancer
stigma across the cancer control continuum. J. Thoracic Oncol. 13(8), 1062–1075 (2018).
5. Valentine, T. R., Presley, C. J., Carbone, D. P., Shields, P. G. & Andersen, B. L. Illness perception profiles and psychological and
physical symptoms in newly diagnosed advanced non-small cell lung cancer. Health Psychol. 41(6), 379 (2022).
6. Maurya, S.P., Ohri, A., & Gaur, S. Relevance of spatio-temporal data visualization techniques in healthcare system. in Geospatial
Data Science in Healthcare for Society 5.0. Springer, 59–78 (2022).
7. Mithoowani, H. & Febbraro, M. Non-small-cell lung cancer in 2022: A review for general practitioners in oncology. Curr. Oncol.
29(3), 1828–1839 (2022).
8. Miller, K. D. et al. Cancer treatment and survivorship statistics, 2022. CA Cancer J. Clin. 72(5), 409–436 (2022).
Vol:.(1234567890)
www.nature.com/scientificreports/
9. Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V. & Fotiadis, D. I. Machine learning applications in cancer prognosis
and prediction. Comput. Struct. Biotechnol. J. 13, 8–17 (2015).
10. Yang, Y., Xu, L., Sun, L., Zhang, P. & Farid, S. S. Machine learning application in personalised lung cancer recurrence and surviv-
ability prediction. Comput. Struct. Biotechnol. J. 20, 1811–1820 (2022).
11. Pokkuluri, K.S., Usha Devi, N., & Mangalampalli, S. Dlcp: A robust deep learning with non-linear ca mechanism for lung cancer
prediction. in Innovations in Computer Science and Engineering: Proceedings of the Ninth ICICSE, 2021. Springer, 299–305 (2022).
12. Alsinglawi, B. et al. An explainable machine learning framework for lung cancer hospital length of stay prediction. Sci. Rep. 12(1),
607 (2022).
13. Venkatesh, S.P., & Raamesh, L. Predicting lung cancer survivability: A machine learning ensemble method on seer data, (2022).
14. Chauhan, A. et al. Detection of lung cancer using machine learning techniques based on routine blood indices. in 2020 IEEE
international conference for innovation in technology (INOCON). IEEE, 1–6. (2020)
15. Faisal, M. I., Bashir, S., Khan, Z. S., & Khan, F. H. An evaluation of machine learning classifiers and ensembles for early stage
prediction of lung cancer. in 3rd international conference on emerging trends in engineering, sciences and technology (ICEEST). IEEE
2018, 1–4 (2018).
16. R. Patra. Prediction of lung cancer using machine learning classifier. in Computing Science, Communication and Security: First
International Conference, COMS2. Gujarat, India, March 26–27, 2020, Revised Selected Papers 1. Springer 2020, 132–142 (2020).
17. Earnest, A., Tesema, G. A. & Stirling, R. G. Machine learning techniques to predict timeliness of care among lung cancer patients.
Healthcare. 11(20), 2756 (2023).
18. Chandran, U. et al. Machine learning and real-world data to predict lung cancer risk in routine care. Cancer Epidemiol. Biomark.
Prevent. 32(3), 337–343 (2023).
19. Qureshi, R. et al. Machine learning based personalized drug response prediction for lung cancer patients. Sci. Rep. 12(1), 18935
(2022).
20. Shmatko, A., Ghaffari Laleh, N., Gerstung, M. & Kather, J. N. Artificial intelligence in histopathology: Enhancing cancer research
and clinical oncology. Nat. Cancer 3(9), 1026–1038 (2022).
21. Nahm, F. S. Receiver operating characteristic curve: Overview and practical use for clinicians. Korean J. Anesthesiol. 75(1), 25–36
(2022).
22. Muschelli, J. III. Roc and auc with a binary predictor: A potentially misleading metric. J. Classification 37(3), 696–708 (2020).
23. Dritsas, E. & Trigka, M. Lung cancer risk prediction with machine learning models. Big Data Cognit. Comput. 6(4), 139 (2022).
Author contributions
Satya Prakash Maurya-Conceptualization, Methodology, Writing-original draft, Validation Pushpendra Singh
Sisodiya-Writing-review & editing, Visualization, Supervision, Investigation, Resources, Project Management.
Rahul Mishra-Writing-review & editing, Visualization, Supervision, Investigation, Project Management. Devesh
Pratap Singh-Methodology, Writing-review & editing, Validation, Resources.
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to R.M.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International
License, which permits use, sharing, adaptation, distribution and reproduction in any medium or
format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the
Creative Commons licence, and indicate if changes were made. The images or other third party material in this
article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the
material. If material is not included in the article’s Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Vol.:(0123456789)