0% found this document useful (0 votes)
5 views

18A Big Data Framework To Analyze Risk Factors of Diabetes Outbreak in Indian Population Using A Map Reduce Algorithm

Uploaded by

Lorrany Amorim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

18A Big Data Framework To Analyze Risk Factors of Diabetes Outbreak in Indian Population Using A Map Reduce Algorithm

Uploaded by

Lorrany Amorim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Proceedings of the Second International Conference on Intelligent Computing and Control Systems (ICICCS 2018)

IEEE Xplore Compliant Part Number: CFP18K74-ART; ISBN:978-1-5386-2842-3

A Big Data framework to analyze risk factors


of diabetes outbreak in Indian population
using a MapReduce algorithm
J.Ramsingh V.Bhuvaneswari
Department of Computer Applications, Bharathiar Department of Computer Applications, Bharathiar
University University
[email protected], [email protected]

Abstract tends to fail to process those mixed information.


Increase in burden of chronic disease is Leading to the failure of traditional technologies,
hurting the economic and the prosperity of the Hadoop framework provides a space in processing
country with the global risk, financial loss with of large data sets and extraction of information
increased expenditure, loss of productivity and likely from those mixed data. Sentiment analysis of the
to affect India's economic development adversely over
text data on internet plays a vital role in extraction
the next couple of decades. Instantaneous measures
are to be taken to create awareness to thwart of people views towards a product, schemes by the
epidemic among Indian Population. A Big Data government etc [4]. The manual interpretation of
unified data analysis and evaluation framework is the large amount of data cannot be done, were the
proposed to analyze the awareness of risk factors of need of automatic text categorization is very much
Diabetes among young, middle-aged Indian apparent. Sentiment analysis makes use of many
population. As a first phase data acquisition is done branches like Natural Language Processing,
from heterogeneous data sources with different Machine Learning, Text Mining using the methods,
formats (Xml, Log files, Text document, Whats app, techniques available [3].
Emails) using Scoop. The data acquired is converted
Lack of physical activity and improper
from different structure to a structured format using
ETL and Text mining engine, Diabetic corpus is diet among the young individuals is the biggest risk
formed using with the reference of the food chart and factor that cause many chronic disease all around
the domain consultant for further processing its the world. Physical inactivity is a condition of
stored in HDFS. The data analysis is done as a alarm as it leads to most important health problems
MapReduce task using machine learning algorithms (cardiovascular diseases, cancer, obesity,
and the results are visualized. The results show hypertension, diabetes) [3, 6,14, 15]. Globally, a
devastating effects on the middle aged Indian chronic disease, predominantly non-communicable
population. High intake of refined carbohydrate foods diseases (NCDs) (cardiovascular diseases, cancer
and significant reduction of physical activity resulted
and Diabetes mellitus) scales the highest death toll
in many younger generations being more prone to
endemic diabetes. Rapid nutrition transition due to than any other. The death due to NCD has risen
westernized diet and lifestyle increase the rate of from 68% in 2012 to 82% in 2014[8]. Such hike in
diabetes. More than half of the young adolescents are NCDs death has caused an increasing burden on
more prone to diabetes. Extensive studies and clinical socioeconomic conditions in developing and many
evidences show that type-2 diabetes is almost low income countries. Diabetic Mellitus (DM) a
preventable through lifestyle changes and food habits. Non Communicable Disease (NCD), has become a
To hold back the growing outbreak of diabetes, the major health hazard in India, around 61.3 million
primary prevention must be through advertise of a citizens are affected by the chronic disease [13].
healthy diet, food nutrition value and good physical
It’s predicted to increase around 103 million by
activity as a global public policy priority.
2030 promoting India as the “Diabetes Capital” of
Keywords: Big Data, Social Media, Diabetics, Corpus, the world. India is undergoing a great change in
Text Mining, Map Reduce epidemiology and economic status during the last
decades. Globalization has resulted in increased
I. INTRODUCTION movement of population, change in food habits,
The widespread expansion and usage of dietary patterns, technology and race mixes. As per
online social media has left a ransom amount of 2007 statistics report India ranks first in the list
data. The data on the internet (social media) is with 49 million people affected with diabetes [2].
mounting at an express velocity leading to many The awareness of diabetic among Indian’s are very
challenges like processing of the data information, poor and many are not even diagnosed with
mining of large data sets. The data generated diabetics. Several studies on Diabetes in India has
through social media is in the form of structured shown a greater scale of increase in people affected
and unstructured were the traditional technologies with Diabetes from 8.2% -18.6% in urban areas,

978-1-5386-2842-3/18/$31.00 ©2018 IEEE 1755


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE OURO PRETO. Downloaded on November 27,2024 at 18:20:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Intelligent Computing and Control Systems (ICICCS 2018)
IEEE Xplore Compliant Part Number: CFP18K74-ART; ISBN:978-1-5386-2842-3

and from 2.4%-9.2% in rural areas, with a time factors of Diabetes Mellitus using Map Reduce
frame of just 16 years (1992-2008)[1,7]. The high algorithm. The Big Data framework is designed
prevalence of diabetes strikes both rural and urban using Hadoop in multi-node with HDFS to store
India without any change [2, 3]. It has been assured unstructured and structured data. Machine learning
that the development in epidemiology is the algorithms are used to analyse the risk factors using
awareness on ‘behaviour’ as a risk factor. Many python in Map Reduce format.
diseases are linked with the human behaviour The DAEE(Data Analysis and Evaluation
beyond the biomedical trails and past history, the Engine) framework consists of four important
consequences [10]. phases Data Acquisition, Data Transformation,
The machine learning algorithms helps in Data Analytics Engine and Visualization
categorization of the text data (news, tweets, Dashboard. The framework fetches data from
reviews, blogs) into positive, negative, neutral different sources, stores the data in HDFS in
based on the sentiment expressed[4,18]. Machine uniform structure. As the next step the data is
learning algorithms are used in finding the analyzed and the results are visualized.
association between the diabetes incidence and the A. Data Acquisition
food habits of the individuals (behavior). This
study focuses on the prevalence of risk factors Identification of possible data associated with
risk factors of diabetes is done using Genetic
among the general public in India. It also sought to
Factors, Demographic characteristics (Age,
study the association, with various socio- Gender), Behavioral Factors (Food Diet,
demographic characteristics related to awareness of Occupation, Physical Activity), Clinical Factors.
physical activities among the people through social The key data related to above categories are
media data. The dataset were collected from social collected from heterogeneous sources like Tweets,
media like tweets and WhatsApp. A diagnostic WhatsApp, blog, Interview with the public
analytics is carried out using big data architecture institutions, civil societies and the local population
to understand the food preference of the people, the were consulted on the food habits was done. The
job preferred by the majority of the population, the data from different source are extracted and
Physical activity of the people. The paper is transported to HDFS a centralized storage area
organized as follows the section II gives an using Flume and Streaming API[17]. The work is
overview of the framework for analyzing the food, confined to focus on India so the extracted data
job, physical preference of the people using Map from different locations are filtered using the
reduce algorithm in python. Section III discusses location parameter. A total of 9000000 instances
the experimental results followed by conclusion in were collected which constitute 10% data from
section IV. “WhatsApp”, 50% data from Twitter , 30% from
survey (local population, civil societies) and 10%
II. METHODS data from interview. The data set constitutes
This section presents with a data analytics unstructured and semi structured format which
framework to analyze the population with the risk contribute the Big Data in variety character.

Fig 1. DAEE Framework

978-1-5386-2842-3/18/$31.00 ©2018 IEEE 1756


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE OURO PRETO. Downloaded on November 27,2024 at 18:20:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Intelligent Computing and Control Systems (ICICCS 2018)
IEEE Xplore Compliant Part Number: CFP18K74-ART; ISBN:978-1-5386-2842-3

occurs in document, the word frequency is calculated


B. Data Transformation and Dia - Corpus
using the following equation.
Creation
w(wi , d ) = log(f(wi , d))
The data acquired from different sources are The word frequency is calculated and the results
stored in generic data exchange formats such as .CSV are visualized. The table 1 presents with the mapping
(Comma Separated Version), and JSON(Java Script and the reducing phase using python in multi-node
Object Notation). The pre-processing is carried in the Hadoop architecture.
data transformation phase, to remove the
inconsistencies such as missing values, mapping
attributes with conventional names. The structured TABLE I. PSEUDO CODE [CLASSIFIER]
data is mapped through many joining attributes and Algorithm : mapper
the unstructured data is transformed using text pre- Get input (standard input)STDIN
processing techniques. Import required library for natural language process
Load imported library
The Python language is used to transfer and pre- Iterate the input to preprocess:
process structured and unstructured data collected Perform pre preprocessing
from different sources. The common ETL process is split each line on whitespace
used to consolidate and preprocess the structured data Stem the input word using stemmer
Iterate the stemmed word to remove punctuations:
(.csv), the unstructured data (JSON, Text etc) are Remove punctuation, numbers, stop words, caseconversion
transferred and consolidated to structure format using write the results to STDOUT (standard output);
text mining process [20]. The hash tag, symbols, The output here will be the input for the Reduce step, i.e. the
comma, full stop, colon and emotions in the input for reducer.py
unstructured data are removed using stop word Algorithm: reducer.py
Reads input comes from STDIN i.e. output from Mapper
removal. Then the data are tokenized and stemmed (standard input)
using N-gram approach. The terms relevant to our Initialize the required variables to count the current words
study are filtered and stored along with the user Iterate the input to cluster the word:
information and a complete catalogue of keywords Parse the input we got from mapper.py
related to the Risk factors of Diabetes Mellitus was Convert the count to integer
developed. Calculate the count of the words frequency
print result to STDOUT
A corpus (diet corpus) related to Diet was store the result in the HDFS
developed along with the Glycemic Index(GI). The
Diet corpus was classified and indexed based on the III. RESULTS
good and bad facts of food Carbohydrate, the diabetic The experimental result of the proposed
corpus is created in consultation with the domain method is carried out on the heterogeneous data set
experts, nutrition’s and diabetologist. To recognize generated from the social media like tweets,
sentiments of the public, a emoticons corpus is used WhatsApp, and survey. Table II presents with the
along with the Dia-corpus which contains Happy (: number of instances collected from social media
“:-)”, “:)”, “=)”, “:” etc.) and sad (: “:-(”, “:(”, “=(”, using scoop. The data related to the risk factors of
“;(” etc ) emoticons. diabetes is extracted using streaming API provided.
C. Data Analytics Engine
TABLE II.DATA EXTRACTION
a) Opinion Mining using Machine Learning
Sentiment or opinion mining is the field to Data extracted 9000000
analyze the people attitude, emotions, opinions and Data based on location 1500000
polarity of the people based on the data available. Data after removing duplication 1300000
The opinion of the data is pulled out at three levels The data extracted obviously contains a lot of
Document level, Sentence level, Aspect level [5]. noise and irrelevant data. The data extracted is saved
The proposed opinion mining of social media data is in JSON format is pre-processed using Python
done based on Sentence level and Document level Natural Language Process (NLP), in the mapping
using machine learning algorithm. phase the collected data are pre-processed, tokenized
The maximum entropy model is used to and stemmed and the data are consolidated as key
classify the opinion based on sentence level and value pair. In the Mapping phase the data are pre-
document level. Text classification with maximum processed, tokenized and stemmed. Finally the pre-
entropy follows word counts as a feature for analysis. processed data are consolidated using maximum
The data are categorized based on the terms related to entropy model the word frequency are calculated and
the risk factors of diabetes mellitus. Maximum are stored based on the class label. Figure 2 shows
entropy of the word w(wi , d ) in document d is the sample data after pre-processing.
defined as the number of times that the word w

978-1-5386-2842-3/18/$31.00 ©2018 IEEE 1757


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE OURO PRETO. Downloaded on November 27,2024 at 18:20:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Intelligent Computing and Control Systems (ICICCS 2018)
IEEE Xplore Compliant Part Number: CFP18K74-ART; ISBN:978-1-5386-2842-3

Fig 4: Represents the frequency of the food items preferred by


the general public.

Fig 4 clearly depicts the food preference of the


people, many prefer food that are highly refined food
Fig 2: Pre-processed un-structured data rich in carbohydrates.
The table III presents with the sample shot Fig 5 Represents the GI of the food items preferred
of diet corpus created using the Glycemic Index by the people for 130 grams. By comparing figure 4
based on carbohydrate. The food preferred by the and figure 5 it is found that the people preference
people is compared with its Glycemic Index and the food which is high in GI which results in high of
corpus is created. blood sugar level.
TABLE III. DIA-CORPUS
FFGL1 Prunes
FFGL3 Cherries
FFGL4 Grapefruit
FFGL5 Apricots
FFGL6 Strawberries
FFGL7 Figs
FFGL8 Apples
FFGL9 Pears
FFGL10 Plums
FFGL11 Peaches
FFGL12 Oranges
FFGL13 Grapes
The stemmed data are weighted using the
dia and emoticons corpus using bigram approach, to Fig 5 : GI of food items preferred by people
find the opinion of each data. The weighted data are
stored as a key value pair for future analysis. In the
reducing phase the sorted dataare classified based on
the word frequency calculated using entropy model
and figure 3 shows the word frequency.

Fig 6: Physical activity based on occupation

The above results inferred that 70% of


Fig 3: Maximum entropy of the food items young generation (20-40) consumes food more

978-1-5386-2842-3/18/$31.00 ©2018 IEEE 1758


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE OURO PRETO. Downloaded on November 27,2024 at 18:20:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Intelligent Computing and Control Systems (ICICCS 2018)
IEEE Xplore Compliant Part Number: CFP18K74-ART; ISBN:978-1-5386-2842-3

highly refined carbohydrates food ( packed foods, Proposed Prevention Strategies for Reduction of
noodles, ice cream and rice items), Sugar-Sweetened Diabetes
Drinks (Soda, Cool drinks, processed fruit juice etc), To attain a reduction of diabetes incidence
food items with trans fat (cookies, bread, butter, etc) in India strenuous efforts is needed from all including
and meats and age group above 40 consumes more policy makers, the government, manufactures,
of wheat, soya, millets which are modereate in GI consumers and other agencies in food production and
and good carbohydraet food. safety. Firm laws should be implemented in focusing
Discussion on reduction of raising prevalence of diabetes in
This provides an insight on risk factors of India. Some important strategies listed in reduction of
diabetes among the people of India. The risk factors unhealthy food intake not only reduce diabetes but
of diabetes prediction is done using big data also other chronic diseases.
analytics. Incidence of diabetes among the people in
India is very high when compared with the people of x Strict guidelines on calorie values should be
US [4]. Many factors contribute to India making it formulated.
the diabetic capital of the world. The first step in x Increase awareness among the consumers and the
achieving the goal of subsiding diabetes is to identify risks of diabetes
all diabetes patients and those with the risk of x Increase in tax for the products such as sugar
developing diabetes. Supporting hit of the two sweetened beverages, Refined carbohydrates
important prevention methods are clearly explored food, western foods
[9], This work identifies the people with risk factors x Sale of unhealthy and junk foods must be banned
of diabetes, the analysis reveals that lack of physical within the school campus and healthy diet must
activity and change in the life style contribute over be made available to children.
30% of diabetes in India. Sedentary life style change x Warning labels on unhealthy foods must be made
to active life with physical activity reduces the blood mandatory (obesity, diabetes, heart disease, tooth
glucose level and contributes to production of lipids decay).
in patients with an evidence of diabetes [3]. From the x Ban on advertisements for commercial foods on
deep analysis it’s clearly reveals the unhealthy eating television (during prime time and children’s
behavior of the growing younger generations of India programs).
[19]. Around 67.9% of the adolescent people x Loans and endowment could be made free for the
consume unhealthy foods with high GI. Unhealthy traditional industries that manufacture healthy
eating habits is one of the predominant reason for food and snacks.
diabetes in India [11, 16]. This is why the change in
x Decrease taxes on and prices of fruits, vegetables,
the unhealthy nutrition among the adolescents needs
nuts, and other healthy foods.
to be changed, the dietary behavior acquired at this
stage will continue later on. Recent evidence suggests
that numerous small bouts of exercise with healthy IV. CONCLUSION
diet is more effective in diabetes management for Diabetes mellitus is attaining a potentially
long term in men and women aged 50 to 65 years [2]. epidemic extent in India. Diabetes is now widespread
Many longitudinal studies on behavior state the across all the sector of the general public within
above mentioned context through adolescence and India. The height of morbidity and mortality owing to
into adulthood [12]. Only people less than 28% diabetes and its severe complications are huge, and
perform physical activity daily. Physical inactiveness face a considerable healthcare burdens on family as
is another global concern for many non- well as on the society. Diabetes is associated with a
communicable diseases leading to mortality. range of complications and is to affect at a relatively
Behavioral risk factor (physical inactiveness and at a younger age within the Indian sub continent.
unhealthy eating habits) is one of the main causes for India is experiencing a very steady relocation of
the pre diabetic stage [10]. The analysis found a very people from rural to urban area, due to the economic
devastating results people more than 80 % are pre and change in life style leads to diabetes. As the
diabetic and to be likely to be affected with diabetes prevalence of the Diabetes continue to rise on scale it
at a very younger age. With amplified life is important to identify the high risk populations and
anticipation, the difficulties due to diabetes lead to to implement policies to delay or prevent Diabetes
severe disability in old age [1]. onset. Despite increase in diabetes there is a scarcity
of research investigating the exact prominence of
diabetes in India due to the varied diversity across the
country. This research bridges the gap of regional
level research among all the age group. The

978-1-5386-2842-3/18/$31.00 ©2018 IEEE 1759


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE OURO PRETO. Downloaded on November 27,2024 at 18:20:10 UTC from IEEE Xplore. Restrictions apply.
Proceedings of the Second International Conference on Intelligent Computing and Control Systems (ICICCS 2018)
IEEE Xplore Compliant Part Number: CFP18K74-ART; ISBN:978-1-5386-2842-3

adolescents were targeted with social media, Younger randomized controlled trial. ClinRehabil.
2010;0269215510380825.
generations through common gatherings and elderly
[7]. Gupta A, Gupta R, Sarna M, Rastogi S, Gupta VP, Kothari
through surveys to alleviate the potentially debacle K.Prevalence of diabetes, impaired fasting glucose and
increase in diabetes predicted in upcoming years. An insulin resistance syndrome in an urban Indian population.
experimental analysis is made using social media Diabetes Res ClinPract. 2003;61:69–76.
[8]. International Diabetes Federation. IDF diabetes atlas. 4th
data collected from tweets, WhatsApp and survey
ed.Belgium: International Diabetes Federation; 2009
data to analyze the involvement of Physical activities [9]. International Diabetes Federation. IDF diabetes atlas. 4th
by the people and the risk factors associated with ed.Belgium: International Diabetes Federation; 2013
diabetes. Devastating results are observed after (accessed on 15 jan 2014)
[10]. Jessor R. Risk behavior in adolescence: a psychological
analysing the data collected from different sources.
framework for understanding and action. J Adolesc Health.
By association the two results it is found that around 1991;12:597–605.
half the population prefer highly refined [11]. Jessor R. Risk behavior in adolescence: a psychological
carbohydrates and food with high GI. By analysing frameworkfor understanding and action. J Adolesc Health.
1991;12:597–605.
people by profession it's also found that people
[12]. Lien N, Lytle LA, Klepp KI. Stability in consumption of
choose a white colored jobs rather than jobs that fruit,vegetables, and sugary foods in a cohort from age 14 to
involve physical activity, in turn they do have excess age 21.Prev Med. 2001;33:217–26.
blood sugar level due physical inactivity and [13]. Mohan V, Sandeep S, Deepa R, Shah B, Varghese C.
Epidemiology of type 2 diabetes: Indian scenario. Indian J
consumption of food with high GI. Early diagnosis
Med Res. 2007;125:217–30.
must be done to decline the effect of diabetes [14]. Nathan DM. Initial management of glycemia in type 2
affecting the health and the economy of the country. diabetes mellitus. N Engl J Med. 2002;347:1342–9.
[15]. Nikam S, Nikam P, Joshi A, Viveki RG, Halappanavar B,
Hungund B. Effect of regular physical exercise (among
REFERENCES
circus athletes) on lipid profile, lipid peroxidation and
[1]. Amos AF, McCarty DJ, Zimmet P. The rising global burden
enzymatic antioxidants. Int J Biochem Res Rev.
of diabetes and its complications: estimate and projections
2013;3(4):414–20.
to the year 2010. Diabetic Med. 1997;14(Suppl-5):S1–85.
[16]. Park K. Park’s textbook of preventive and social medicine.
[2]. Abby CK,William lH, Deborah RY, Roberta KO,Marcia lS.
18th ed. Jabalpur: Banarsidas Bhanot Publishers; 2005.
Longterm effects of varying intensities and formats of
[17]. Ramsingh J, Bhuvaneshwari Data Analytic on Diabetic
physical activity on participation rates, fitness, and
awareness with Hadoop Streaming using Map Reduce in
lipoproteins in men and women aged 50 to 65 years.
Python, IEEE, 10.1109/ICACA.2016.7887979.
Circulation. 1995;91:2596–604.
[18]. Ran Kim, MSN, RN, et al., Development and Evalution of
[3]. Ardoyl DN, Artero EG, Ruiz JR, Labayen I, Sjostrom M,
an Obisity ontology for social big data analysis, Healthc
Castillo M, et al. Effects on adolescents’ lipid profile of a
Inform Res. 2017 July;23(3)
fitness-enhancing intervention in the school setting: the
159-168 https://doi.org/10.4258/hir.2017.23.3.141p ISSN
EDUFIT study. Nutr Hosp.2013;28:119–26.
2093-3681 • eISSN 2093-369X.
[4]. B. Pang and L. Lee,” Opinion Mining and Sentiment
[19]. Tuomilehto J, Lindstrom J, Eriksson J, et al. Prevention of
Analysis” Foundations and Trends in Information Retrieval
type 2 diabetes mellitus by changes in lifestyle among
Vol. 2, Nos. 1–2 (2008) 1–135
subjects with impaired glucose tolerance. N Engl J Med.
[5]. Bing Liu, Sentiment Analysis and Opinion Mining, Morgan
2001;344:1343–50.
and Claypool Publishers, May 2012.p.18-19,27-28,44-
[20]. Vinaitheerthan Renganathan, Text Mining in Biomedical
45,47,90-101.
Domain with Emphasis on Document Clustering, Healthc
[6]. Evangelia M, Kouidi N, Koutlianos N, Deligiannis A.
Inform Res. 2017 July;23(3) 141 -146
Effects of long-term exercise training on cardiac baroreflex
https://doi.org/10.4258/hir.2017.23.3. 141p ISSN 2093-3681
sensitivity in patients with coronary artery disease: a
• eISSN 2093-369X

978-1-5386-2842-3/18/$31.00 ©2018 IEEE 1760


Authorized licensed use limited to: UNIVERSIDADE FEDERAL DE OURO PRETO. Downloaded on November 27,2024 at 18:20:10 UTC from IEEE Xplore. Restrictions apply.

You might also like