Dokumen - Pub - Applied Machine Learning For Smart Data Analysis 9780429440953 0429440952 9781138339798
Dokumen - Pub - Applied Machine Learning For Smart Data Analysis 9780429440953 0429440952 9781138339798
Edited by
Nilanjan Dey
Sanjeev Wagh
Parikshit N. Mahalle
Mohd. Shafi Pathan
MATLAB® and Simulink® are trademarks of The MathWorks, Inc. and are used with permission. The
MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or
discussion of MATLAB® and Simulink® software or related products does not constitute endorsement or
sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB®
and Simulink® software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2019 by Taylor & Francis Group, LLC
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii
Editors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xv
v
vi Contents
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Preface
Applied Machine Learning for Smart Data Analysis discusses varied emerging
and developing domains of computer technology. This book is divided
into four sections covering machine learning, data mining, Internet of
things, and information security.
Machine learning is a technique in which a system first understands
information from the available context and then makes decisions. Trans-
literation from Indian languages to English is a recurring need, not only
for converting out-of-vocabulary words during machine translation but
also for providing multilingual support for printing electricity bills,
telephone bills, municipal corporation tax bills, and so on, in English
as well as Hindi language. The origin for source language named
entities are taken as either Indo-Aryan-Hindi (IAH) or Indo-Aryan-
Urdu (IAU). Two separate language models are built for IAH and IAU.
GIZA++ is used for word alignment.
Anxiety is a common problem among our generation. Anxiety is
usually treated by visiting a therapist thus altering the hormone adrena-
line in the body with medication. There are also some studies based on
talking to a chatbot. Cognitive Behavioral Therapy (CBT) mainly focuses
on user’s ability to accept behavior, clarify problems, and understanding
the reasoning behind setting goals. Anxiety can be reduced by detecting
the emotion and clarifying the problems. Natural conversation between
humans and machines is aimed at providing general bot systems for
members of particular organizations. It uses natural language processing
along with pattern matching techniques to provide appropriate response
to the end user for a requested query. Experimental analysis suggests that
topic-specific dialogue coupled with conversational knowledge yields
maximum dialogue sessions when compared with general conversational
dialogue.
We also discuss the need for a plagiarism detector system i.e. Plagiasil.
Some of our research highlights a technical scenario by predicting the
knowledge base or source as local dataset, Internet resources, online or
offline books, research published by various publications and industries.
The architecture herewith highlights on scheming an algorithm, which is
compliant to dynamic environment of datasets. Thus, information extrac-
tion, predicting key aspects from it and compression or transforming
information for storing and faster comparison are addressed to explore
research methodology.
Data summarization is an important data analysis technique. Sum-
marization is broadly classified into two types based on the metho-
dology: semantic and syntactic. Clustering algorithms like K-means
vii
viii Preface
xi
xii Editors
Machine Learning
1
Hindi and Urdu to English Named Entity
Statistical Machine Transliteration Using
Source Language Word Origin Context
CONTENTS
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Hindi Phonology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 IAH and IAU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Experimental Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7.1 Normalization or Pre-processing Phase . . . . . . . . . . . . . . . 10
1.7.2 Training Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7.2.1 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.7.2.2 Transliteration Model . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.7.2.3 Testing Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.8 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.9 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.10 Conclusion and Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.1 Introduction
Out of vocabulary (OOV) words such as named entities (proper nouns, city
names, village names, location names, organization names, henceforth
referred as NE) are not a part of bilingual dictionaries. Hence, it is necessary
to transliterate them in target language by retaining their phonetic features.
Transliteration is a process of mapping alphabets or words of source language
script to target language script by retaining their phonetic properties [1]. The
choice language for our current proposal includes Hindi–English language
3
4 Applied Machine Learning for Smart Data Analysis
pair in which source language is Hindi, which is the official national language
of the Republic of India, and English, which is world’s business language.
Hindi uses the Devanagari script, which is highly phonetic in nature, while
English uses the Latin script, which is less phonetic when compared with
native Indian languages. In rest of the chapters, Hindi language and its
associated Devanagari script will be referred to as Hindi, whereas the English
language and Latin script will be referred to as English. According to official
Act of India 1963, Hindi is the official language of India. After Chinese,
English, and Spanish, Hindi is world’s fourth most widely spoken language.
Hindi is an Indo-Aryan language derived from Indo-European languages and
spoken by half a billion people of India, Africa, Asia, America, Oceania, and
Europe. According to a recent census, about 41.03 percent of the population in
India speaks Hindi. Urdu is an Indic language spoken by approximately 52
million people in the southern, central, northern and western Indian states of
Jammu and Kashmir, Telangana, Delhi, and Maharashtra. The Urdu literature
has in its collections a wide range of poems and ghazals, which are also used
in movie dialogs. A special case in the Urdu poetry is “shayari,” which is a
soulful way of expressing one’s thoughts and ideas.
1.3 Motivation
There is a need to work on transliteration generation problem if source
language contains NE of multiple origins but share a common script. The
Hindi and Urdu to English Named Entity Statistical Machine 5
2001 census of India reveals that the Indian population largely consists of
Hindus, following which the Muslims take the second place. In India, most
of the documents require the name and address of the holder either in
Hindi or English or both. Hindu and Muslim names when written and
read in Hindi maintain their originality, but their transliterations differ in
English. If the proper context in term of origin is not known, then there
exist more chances of incorrect transliteration as mentioned in the original
scenario. In India NE written in Hindi using Devanagari script may have
different origins. For this work we have considered two origins: one is
Indo-Aryan-Hindi (henceforth IAH) and the other is Indo-Aryan-Urdu
(henceforth IAU). NE written in IAH follows the phonetic properties of
Devanagari script for transliteration generation, whereas IAU follows the
phonetic properties of Urdu. Urdu language follows perso-arabic script,
and it is the national language of Pakistan. Urdu is also one of the official
languages in India. NE of both IAH and IAU origin use the Devanagari
Script which leads to the major concern in ambiguity. Consider as an
example the name written in Devanagari “तौफीक” would be transliterated
in English as “Taufik” by considering IAH context because it is written
without using nuqtas for /फ़/ and /क़/. The correct transliteration is
“Taufiq” according to IAU as /फ़/ and /क़/ are adopted to provide true
phonemes of Urdu consonants in Hindi.
In the past, various utility bills including electricity, telephones and
others in the state of Maharashtra were printed only in English, while
other bills such as the Municipal Corporation Tax were printed in Hindi.
In the recent years, these bills have names printed in Hindi as well as in
English. However, the accuracy of names transliterated from English to
Hindi and vice versa is not up to the mark in many cases. There are several
reasons to this inappropriate transliteration problem. Few of them are
multiple origins and one–many and many–one mapping due to unequal
number of alphabets in the source and target language. Machine transli-
teration does not provide better accuracy as Devanagari script has two
meanings for the English alphabet /a/ as /अ/ and /आ/. The stringent
ambiguity is that the alphabet /अ/ is a short vowel and /आ/ is a long
vowel in Devanagari. This ambiguity affects accuracy of transliteration
generation especially from Roman to Devanagari. The correct name of the
author of this chapter is /Manikrao//माणिकराव/, but it is printed as
/मणिकराव/ on the electricity bill as the earlier records were in English
and now have been transliterated in Hindi using machine learning. This
chapter proposes Hindi to English transliteration by considering the origin
of named entity in source language. In order to detect the origin, two
separate data sets and two language models are used for IAH and IAU.
Both the models are trained separately and the word alignment is obtained
using GIZA++. Testing is done by calculating the probability of given
named entity over both language models and higher probability between
the two outcomes is taken as origin. If IAH has higher probability over the
6 Applied Machine Learning for Smart Data Analysis
alignments using GIZA++ for the Hindi–Urdu pair. In [18], the authors have
proposed a mechanism to achieve the transliteration from Perso-Arabic to
Indic Script using hybrid-wordlist-generator (HWG). In [19], the authors have
presented the Beijing Jiaotong University -Natural Language Processing
system of transliteration using multiple features. In [20], the authors have
proposed a system using two baseline models for alignment between Thai
Orthography and Phonology. This chapter proposes an SMT-based Devana-
gari-to-Roman transliteration by separately handling two disjoint data sets for
two origins, two language models and separate training for each language.
The same mechanism was proposed by Mitesh M. Khapra for Indic and
western origin in 2009.
TABLE 1.1
Devanagari Script with its Phonetic Mapping for IAH and IAU
अ-a-No Matra/, /आ-A- /क-ka/, /ख-kha/, /ग- /क्ष-ksha/, /ज्ञ- /ढ़ -rha/, /ख़-
ा/, /ए-E- े/, /इ-i- ि/, /ऎ- ga/, /घ-gha/, /ङ -nga/, dnya/, /श्र-shra/, khha/, /ग़-gaa/,
ai-ै/, /ई-ee-ी/, /ओ-oo-ो/, /च-cha/, /छ-chha/, /ज- /द्य-dya/, /त्र-tra/, /ज़-za/, /फ़-fa/,
/उ-u-ु/, /औ –au-ौ/, /ऊ- ja/, /झ-jha/, /ञ-nya/, /श्री-shree/, /ॐ /ड़-Dxa/, /क़-qa/
U- ू/,/अं-am-ं/, /ऋ-Ru-ृ/, /ट-Ta/, /ठ-Tha/, /ड- -om/
/ॠ-RU-ॄ/ /अ:-aH- ः/ Da/, /ढ-dha/, /ण -Na/,
/त-ta/, /थ-tha/, /द-da/,
/ध-Dha/, /न-na/, /प-
pa/, /फ-pha/, /ब-ba/,
/भ-bha/, /म-ma/, /य-
ya/, /र-ra/, /ल-la/, /व-
va/, /श-sha/, /ष-Sha/,
/स-sa/, /ह-ha/
8 Applied Machine Learning for Smart Data Analysis
provide true phonemes in Persian, Arabic and Urdu. These seven consonant
have a “dot” in its subscript referred as nuqta, which indicates the difference
in phonetics from that of the regular Devanagari consonants.
Muslim names are mostly written by using IAU origin consonants in
Hindi, however, the practice of not using nuqta generates wrong transli-
teration. For example, the named entity “तौफिक” is transliterated as “Tau-
phik” by considering IAH context, but the correct transliteration must be
“Taufiq” according to IAU. If /तौफिक/ is correctly written as /तौफ़िक़/
using IAU origin, the consonants will then be transliterated to /Taufiq/.
Nowadays nobody makes the use of nuqta while writing names of IAU
origin in Hindi, which is a major concern in generating correct translitera-
tion for Muslim names written in Hindi. This chapter mainly focuses on
this issue and a solution is provided by creating two separate data sets,
different language models and different trainings.
1.6 Methodology
The overall system architecture for Hindi to English named entity transli-
teration using correct origin in source language is shown in Figure 1.1. To
avoid confusion, two separate data sets are created for IAH and IAU for
Hindu and Muslim named entities. For IAH, syllabification of Hindu
names is done by using phonetic mappings for all consonants and vowels
except the seven additional IAU origin consonants /क़, ख़, ग़, ज़, ड़, ढ़, फ़/.
Few examples are:
कुमार गंधर्व [कु| मा| र] [गं| ध| र्व] [ku |ma | r] [gan | dha | rva]
भिमसेन जोशी [भि| म| से| न] [जो| शी] [bhi | m | se | n] [jo | shi]
गिरिजा देवी [गि | रि |जा] [दे | वी] [gi | ri | ja] [de | vi]
FIGURE 1.1
Overall System Architecture.
#Bi-gram Probabilities:
#Tri-gram Probabilities:
fully disjoint from trained data sets. During the decoding process, the
named entity is given as an input to both language models in order to
calculate probabilities over the syllables. This gives two probabilities over
IAH and IAU language models and the higher probability between the
two decides on which trained model is to be used for decoding.
FIGURE 1.2
Broader Experiential Framework.
Hindi and Urdu to English Named Entity Statistical Machine 11
Testing Phase
(moses.ini)
IRSTLM Language
Model
IAU IAH
Monolingual Data Context Context
Training Data
Raszzaq Rajjak
FIGURE 1.3
Overall System Architecture.
Endforeach
Output: Syllabification of Source Named Entity
FIGURE 1.4
Script for Building Language Model.
FIGURE 1.5
ARPA File.
14 Applied Machine Learning for Smart Data Analysis
model: <s>, </s>, and <unk>. The <s> denotes the beginning of a syllable,
and the </s> denotes the end of a syllable. This differs from our original
study where we had no concept of beginnings and ends of syllable—we just
ignored syllable boundaries. The <unk> special word means “unknown”.
Consider examples {माणिकराव}, in the ARPA file probabilities generated for
this माणिकराव on the basis of n-grams are as given here.
We know the total probability is always between 0 and 1 and not more
than 1. Here, in the above example considering 1-gram log probabilities of
{मा} is (–1.278754), <s> is (–1.102662) and </s> is (–0.465840). Now after
taking their actual values of probabilities we should get the value 1. As
1-gram gives unconditional probabilities we need to add beginning and
end of the syllable along with the probability value of {मा} in this case,
FIGURE 1.6
Script for generating ‘Phrase Table’ and ’moses.ini file’.
Hindi and Urdu to English Named Entity Statistical Machine 15
FIGURE 1.7
Phrase Table.
FIGURE 1.8
Moses.ini File.
GIZA++ includes:
The latest version of Moses software embeds calls to GIZA++ and mkcls
software and hence there is no need to call them separately.
16 Applied Machine Learning for Smart Data Analysis
FIGURE 1.9
Sample input.
FIGURE 1.10
Sample output.
FIGURE 1.11
Sample mismatch.
Hindi and Urdu to English Named Entity Statistical Machine 17
FIGURE 1.12
Console sample output.
The default setting weights used are: LM: 0.5, TM: 0.2, 0.2, 0.2, 0.2,
Distortion Limit: 0.3.
Figures 1.9, 1.10, 1.11, and 1.12 shows sample input, sample output,
sample mismatch and sample output testing at terminal.
1.8 Data
The training data sets for IAH and IAU are created from the authentic
resources of the government websites. Few of them are census website,
voter lists websites, road atlas books and telephone directories of Govern-
ment of India in English and Hindi. IAH monolingual data set consists of
20000 NEs and IAU monolingual data set consists of 2200 unique named
entities with only one candidate name in English. Test data for IAH were
2200 NEs and for IAU 400 NEs. Initial testing was done by using separate
test data sets for IAH and IAU. Final testing was done by combining IAH
and IAU test sets and it was 2600 NE.
TABLE 1.2
Results IAH to English
Training Set 20 K 20 K 20 K
Test Set 2.2 K 2.2 K 2.2 K
Exact Match Found 1521 1704 1865
Accuracy in % 69.13% 77.45% 84.77%
TABLE 1.3
Results IAU to English
TABLE 1.4
Results IAH, IAU to English
to English. Table 1.3 shows results of IAU to English. Table 1.4 shows
results of IAH, IAU to English accuracy, also known as Word Error Rate,
which measures the correctness of the transliteration candidate produced
by a transliteration system. Accuracy = 1 indicate exact match and Accu-
racy = 0 indicates no match for IAH and IAU. Following formula is used
for calculating accuracy.
N n
1X 1; if correct match found
Accuracy ¼
N i¼1 0; if incorrect match found
References
[1] Padariya Nilesh, Chinnakotla Manoj, Nagesh Ajay, Damani Om P. (2008),
“Evaluation of Hindi to English, Marathi to English and English to Hindi”, IIT
Mumbai CLIR at FIRE.
[2] Arbabi M, Fischthal S M, Cheng V C and Bart E (1994) “Algorithms for Arabic
name transliteration”, IBM Journal of Research and Development, pp. 183-194.
[3] Knight Kevin and Graehl Jonathan (1997) “Machine transliteration”, In Pro-
ceedings of the 35th annual meetings of the Association for Computational
Linguistics, pp. 128-135.
[4] K. Knight and J. Graehl, “Machine Transliteration.’ In Computational Linguis-
tics, pp 24(4):599–612, Dec. 1998.
[5] Gao W, Wong K F and Lam W (2004), ‘Improving transliteration with precise
alignment of phoneme chunks and using contextual features”, In Information
Retrieval Technology, Asia Information Retrieval Symposium. Lecture Notes
in Computer Science, vol. 3411, Springer, Berlin, pp. 106–117
[6] Saha Sujan Kumar, Ghosh P S, Sarkar Sudeshna and Mitra Pabitra (2008),
“Named entity recognition in Hindi using maximum entropy and translitera-
tion.” Polibits (38) 2008, open access research journal on Computer Science
and Computer Engineering. It is published by the Centro de Innovación y
Desarrollo Tecnológico en Cómputo of the Instituto Politécnico Nacional, a
public university belonging to the Ministry of Education of Mexico. Mexio,
pp. 33-41
[7] Dhore M L, Dixit S K and Dhore R M (2012) “Hindi and Marathi to English NE
Transliteration Tool using Phonology and Stress Analysis”, 24th International
Conference on Computational Linguistics Proceedings of COLING Demon-
stration Papers, at IIT Bombay, pp 111-118
20 Applied Machine Learning for Smart Data Analysis
[8] Dhore M L (2017) “Marathi - English Named Entities Forward Machine Transli-
teration using Linguistic and Metrical Approach”, Third International Confer-
ence on Computing, Communication, Control and Automation ‘ICCUBEA 2017ʹ
at PCCOE, Pune, 978-1-5386-4008-1/17©2017 IEEE
[9] Lee and J.-S. Chang. (2003),.“Acquisition of English-Chinese Transliter-
ated Word Pairs from Parallel-Aligned Texts Using a Statistical Machine
Transliteration Model”, In Proc. of HLT-NAACL Workshop Data Driven
MT and Beyond, pp. 96-103.
[10] Nasreen Abdul Jaleel and Leah S. Larkey (2003), “Statistical transliteration for
English-Arabic cross language information retrieval”. In Proceedings of the
12th international conference on information and knowledge management.
pp. 139–146.
[11] Li H, Zhang M and Su J (2004), “A joint source-channel model for machine
transliteration”, In Proceedings of ACL, pp.160-167.
[12] Ganesh S, Harsha S, Pingali P and Verma V (2008), “Statistical transliteration
for cross language information retrieval using HMM alignment and CRF”, In
Proceedings of the Workshop on CLIA, Addressing the Needs of Multilingual
Societies.
[13] M Khapra, P Bhattacharyya (2009), “Improving transliteration accuracy using
word-origin detection and lexicon lookup” Proceedings of the 2009 NEWS
[14] Najmeh M N (2011), “An Unsupervised Alignment Model for Sequence
Labeling: Application to Name Transliteration”, Proceedings of the 2011
Named Entities Workshop, IJCNLP 2011, pp 73–81, Chiang Mai, Thailand,
November 12, 2011.
[15] Chunyue Zhang et al (2012), “Syllable-based Machine Transliteration with
Extra Phrase Features”, ACL
[16] S P Kosbatwar and S K Pathan (2012), “Pattern Association for character
recognition by Back-Propagation algorithm using Neural Network
approach”, International Journal of Computer Science & Engineering Survey
(IJCSES), February 2012, Vol.3, No.1, 127-134
[17] M. G. Abbas Malik (2013), “Urdu Hindi Machine Transliteration using SMT”,
The 4th Workshop on South and Southeast Asian NLP (WSSANLP), Interna-
tional Joint Conference on Natural Language Processing, pp 43–57, Nagoya,
Japan, 14-18 October 2013.
[18] Gurpreet Singh Lehal (2014), “Sangam: A Perso-Arabic to Indic Script
Machine Transliteration Model”. Proc. of the 11th Intl. Conference on Natural
Language Processing, pp 232–239, Goa, India. December 2014. NLP Associa-
tion of India (NLPAI)
[19] Dandan Wang (2015), “A Hybrid Transliteration Model for Chinese/English
Named Entities”, BJTU-NLP Report for the 5th Named Entities Workshop,
Proceedings of the Fifth Named Entity Workshop, joint with 53rd ACL and
the 7th IJCNLP, pp 67–71, Beijing, China, July 26-31, 2015, Association for
Computational Linguistics
[20] Binh Minh Nguyen, Hoang Gia Ngo and Nancy F. Chen (2016), “Regulating
Orthography-Phonology Relationship for English to Thai Transliteration”,
NEWS 2016
2
Anti-Depression Psychotherapist Chatbot
for Exam and Study-Related Stress
CONTENTS
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Assumptions and Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1.1 Probabilistic Topic Models . . . . . . . . . . . . . . . . . . . . . 23
2.2.1.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1.3 Social Emotion Detection. . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Hybrid Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Keyword Spotting Technique . . . . . . . . . . . . . . . . . . . . . . 26
2.3.1.1 Lexical Affinity Method. . . . . . . . . . . . . . . . . . . . . . . .26
2.3.1.2 Learning-Based Methods. . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1.3 Lexical Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1.3.1 Approaches Based On Keyword
Using Affect Lexicons . . . . . . . . . . . . . . . . 27
2.3.1.3.2 Linguistic Rules-based Approaches. . . . . 27
2.3.1.3.3 Approaches to Machine Learning . . . . . . 28
2.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
2.4.1 Text Cleaner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 Text Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.3 Response Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 System Feature1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.2 System Feature2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6 External Interface Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.1 Hardware Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.2 Communication Interface. . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Non-Functional Requirement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7.1 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7.2 Backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
21
22 Applied Machine Learning for Smart Data Analysis
2.1 Introduction
Anxiety and depression are two major public health issues prevailing in India.
Around 5.6 million people in India suffer from either depression or anxiety,
and 2.4 million of them are aged between 16 and 24 years. Most of these cases
remain untreated because of the taboo associated with mental health care
among the Indian population and also due to the higher cost incurred in
treating these cases. This chapter aims at providing a solution to reducing
anxiety and mild depression among the teens owing to study and exam
pressure. Peer pressure and increased parental expectations in today’s compe-
titive world trigger anxiety [2] or mild depression among many teens, which
can be treated with therapeutic sessions. With the increased pace in evolving
human lives and changing environmental conditions, more and more people
are being prone to depression. Anxiety is one of the precursors to depression.
Anxiety is defined as a feeling of worry, nervousness, or unease about some-
thing with an uncertain outcome. With the increase in social media exposure
and peer pressure, there are more number of cases of teenagers committing
suicide because of insecurity and fear of separation. Anxiety and depression
are topics that are still considered taboo in the Indian society. People do not
talk about these issues openly. In addition, access to healthcare and the cost
associated with it is also big issue. Mental health is not taken seriously. Teen-
agers growing up in emotionally weak households have low self-esteem [10]
and are prone to develop symptoms of anxiety and depression from a very
young age. Not being treated at the right time leads to severe depression. As
we advance further in the 21st century, with changing environmental factors,
Anti-Depression Psychotherapist Chatbot 23
Sends
text
User server Cleans text
replies
Analyzes text
Generate
Finds the
appropriate
emotion
response
Fe
ed
ba Checks for the
ck
decrease
in depression
FIGURE 2.1
System architecture.
26 Applied Machine Learning for Smart Data Analysis
What
Why Who
When
FIGURE 2.2
Context storage using a tree.
Anti-Depression Psychotherapist Chatbot 31
feeling?” “Why are you feeling?” “Who is making you feel this way?”
“Since when do you feel this way?” These four “what, why, who, when”
questions set up the basis of the situations. The text analyzer also analyzes
emotions in the given array from the text cleaner.
Emotion
class
2 Response
class
Context
FIGURE 2.3
Neural network representation for output classification.
32 Applied Machine Learning for Smart Data Analysis
the given class. This response is sent back to the server and is given
back to the user.
The emotion classifier continuously analyses input and hence gives the
emotion feedback to the text generator to improve the type of response
given back to the user. This increases the accuracy of matching the user
context and thinking.
2.7.1 Availability
Chatting applications are ubiquitous nowadays. People are online most of the
time on chatting applications like WhatsApp, Hike, and others. A chatbot is
an intelligent application with which a user can chat and get a reply as if from
a human. The user may get time during any time of the day to chat so it is
made available 24*7.
2.7.2 Backup
In a real-time system no issues in the system should affect the user. To give
a continuous service, the system should be regularly backed up. So, if any-
thing happens to the database, the system should switch to the backup
database and continue to provide service. Also, when the original database is
available the backup should store the current status in it (updated frequently).
2.7.4 Performance
The response time is a critical measure for any application because people use
the application to make their world simple, better, and to save time. Hence the
response time should be as short as possible. The response time depends on
many factors such as number of users that can be connected to the server at
any given instance. This can be done by running a background process on
many servers in order to address the number of users at a time, since the
number of people using the system keeps changing. The connectivity issue
can be solved by minimizing the data transfer, i.e. by preprocessing the string
at the user end and passing only the required characters to the server side.
needed. Each user who is chatting with the chatbot about their problems is
assigned a separate thread.
User who needs to chat with the bot needs an android phone with
following minimum requirements:
1. 512mb RAM
2. 512mb Disk space
3. Android (KitKat 4.4.4 or more)
4. An internet connection
1. Front layer
2. Business
3. Logic
4. Database
Front layer is designed in android, which has the login page and the
actual chatting application. The login page validates the user credentials
from the database and if found correct the actual chatting page is displayed
where the user can type the messages and send it to the server. Before
starting the actual chat, a quiz is taken which determines the polarities of all
emotions; the emotion with maximum polarity is considered as the current
emotion of the user.
Business logic is the actual code which determines the actual emotion
and generates a response to be given back to the user. When the quiz is
taken, the answers are sent down to the server and the polarities of the all
the emotions are identified. Once the emotion is determined the user starts
talking to the chatbot. All the messages the user types are sent down to the
server. The action and intent are determined for every problem that the
user states. Response is generated such that the intent of every action turns
from negative to positive.
Database is used to store the user’s credentials, emotion, actions and
intents. User’s credentials are stored during the signup process and checked
during login process. Emotions of the user are stored in the database. Actions
and the intents are stored during the chat process.
First, a server is made in Django python that accepts the requests and
messages; next, the android app is designed which contains login, signup, and
38 Applied Machine Learning for Smart Data Analysis
the chat page. A database is created to save the credentials of the users, after
which a quiz is taken to determine the emotion. The algorithm to determine the
emotion based on the answers is made on the server side. The emotion is saved
in the database against the username. The response generator is then created.
From the problems keyed in by the user, action and intents are extracted and
saved in the database. Response is generated using the algorithm on the server
side, which tries to change the intent of every action. Once all the intents are
changed, the emotion polarity is altered and the work is done.
2.9 Conclusion
The chapter discussed the uses of machine learning and natural language
processing to successfully apply CBT and generate human-like responses
by a chatbot.
References
[1] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, “Recogni-
tion of Emotional States in Natural human-computer interaction,” in IEEE
Signal Processing Magazine, vol. 18(1), Jan. 2009.
[2] Parrott, W.G, “Emotions in Social Psychology,” in Psychology Press, Phila-
delphia 2001
[3] C. Maaoui, A. Pruski, and F. Abdat, “Emotion recognition for human-machine
communication”, Proc. IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 08), IEEE Computer Society, Sep. 2008, pp. 1210-
1215, doi: 10.1109/IROS.2008.4650870
[4] C.-H. Wu, Z.-J.Chuang, and Y.-C.Lin, “Emotion Recognition from Text Using
Semantic Labels and Separable Mixture Models,” ACM Transactions on Asian
Language Information Processing (TALIP), vol. 5, issue 2, Jun. 2006, pp.165-
183, doi:10.1145/1165255.1165259.
Anti-Depression Psychotherapist Chatbot 39
[5] Feng Hu and Yu-Feng Zhang, “Text Mining Based on Domain Ontology”, in
2010 International Conference on E-Business and E-Government.
[6] Z. Teng, F. Ren, and S. Kuroiwa, “Recognition of Emotion with SVMs,” in
Lecture Notes of Artificial Intelligence 4114, D.-S. Huang, K. Li, and
G. W. Irwin, Eds. Springer, Berlin Heidelberg, 2006, pp. 701-710, doi:
10.1007/11816171_87.
[7] C. Strapparava and A.Valitutti, “Wordnet-affect: an effective extension of
wordnet,” in Proceedings of the 4th International Conference on Language
Resources and Evaluation, 2004.
[8] T. Wilson, J. Wiebe, and P. Hoffmann, “Recognizing contextual polarity in
phrase-level sentiment analysis,” in Proceedings of the Conference on Human
Language Technology and Empirical Methods in Natural Language Proces-
sing, 2005, pp.347–354.
[9] A. Esuli and F. Sebastiani, “Sentiwordnet: A publicly available lexical
resource for opinion mining,” in Proceedings of the 5th Conference on
Language Resources and Evaluation, 2006, pp. 417–422.
[10] A. Neviarouskaya, H. Prendinger, and M. Ishizuka, “Sentinel: Generating
a reliable lexicon for sentimentanalysis,” in Affective Computing and Intelli-
gent Interactionand Workshops, 2009.
[11] Kyo-Joong Oh, Dongguan Lee, Byung-soo Ko, Ho-Jin Choi: “A Chatbot for
Psychiatric Counseling in Mental Healthcare Service Based on Emotional
Dialogue Analysis and Sentence Generation”
3
Plagiasil
A Plagiarism Detector Based on MAS Scalable
Framework for Research Effort Evaluation by
Unsupervised Machine Learning – Hybrid
Plagiarism Model
CONTENTS
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Discussion on Problem Statement. . . . . . . . . . . . . . . . . . . 54
3.4.3 Server Algorithm for Training Dataset Creation . . . . . . . . 54
3.4.4 Client Algorithm for Amount of Plagiarism . . . . . . . . . . . 55
3.4.5 Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.6 System Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5 Level 1 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Level 2 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7 Level 3 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8 Noise Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.9 Client-Side Query Processing Sequence . . . . . . . . . . . . . . . . . . . . . . . . 60
3.10 Cosine Score Estimation for Partial Mapping or Partly
Modified Sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.10.1 Software System Performance Evaluation and Results . . . 61
3.10.2 Plagiarism System Performance Enhancement . . . . . . . . . 62
3.10.3 Performance Parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 63
41
42 Applied Machine Learning for Smart Data Analysis
3.1 Introduction
In today’s progressive technological era, new concepts, new methods, new
algorithms and terms are being written every day, resulting in the growth
of the information tree. It is critical to evaluate efforts of authors, which
can be achieved by means of plagiarism detection. Plagiarism is nothing
but copying of content and ideas of someone without attributing them [1].
In order to publish quality literature, it is important for publishers to check
for plagiarism and evaluate the efforts of the author. Academic institutions
face greater issues of plagiarism owing to various factors like social
engineering and sharing of submissions. Plagiarism detectors thus play
a major role in the task of effort evaluation [1].
Information sources are identified by appropriate reference specifica-
tions, and if such references are absent for a particular statement then it
implies that it is the author’s original effort. Some sources of information
are the internet, books, journals, and so on. An honest document is one
that provides attribution to original authors [2].
Some of the evident factors of plagiarism are as follows:
• How to retrieve articles and identify and access key information and
transformation?
• Which information classification strategy should be applied?
• How to choose machine learning-based algorithms that suit a spe-
cific task?
• How to identify algorithms used in text similarity, partial text
mapping, and similarity score estimation?
Est. Add to
similarity corpus
code
FIGURE 3.1
Client–Server module for plagiarism evaluation
FIGURE 3.2
Term-to-sentence and sentence-to-file mapping
Plagiasil 51
Thus, two types of clusters are mapped by the mappers. Figure 3.2 depicts
term-to-sentence and sentence-to-file mapping.
Sentence mappers: Every file is a collection of sentences, whereas
each sentence is a collection of terms, stop words, and stemming.
Supervised learning is applied to detect sentences based on rules of
grammar that define syntax to extract words from files in question.
Pointing is mapping the key/value pairs where each key is assigned
a location vector as filename, file path, and line number where the
sentence exists in dataset files.
Term mappers: Term mappers add accuracy to the system where the
TermKey is assigned a value sentence key, that is, term exists in some
sentences and those sentences are located in some file.
Text *.txt file *.pdf file *.doc file Directory containing Text files
Layer1 :-( Input to System for Training set Creation or Plagiarism Evaluation.)
Layer2 :-( Sentence Extraction, Cleansing and Transformation) Transform term to Unique
Code
Layer 4
Layer 3:- Update Training Dataset with File info, Sentence and Term links on Admin Instruction
Local Dataset
Files Layer 5:- Plagiarism Detector Module Offline and online
Exact Match Sentence Mapper from Partial Match Sentence Mapper from Local
Local Dataset
local Dataset and Online resources Dataset and Online Resources
Sentence to
file Mapper
Similarity Measures and Sequence Detector
FIGURE 3.3
Layered architecture
52 Applied Machine Learning for Smart Data Analysis
1. Technology constraints
2. Time constraints
3. Economic or financial constraints
4. Operational constraints
5. Environmental constraints
6. Alternative solutions
Plagiasil 53
the features that are required for the system to fulfil user demands, some of
which are User Interface, reports, meta-data information, and database
required. It adds to the usability and reliability of software product.
Non-functional requirements: These define the limitations, validations, and
rule-based implementations. These non-functional requirements improve the
system quality. Their existence in the system improves system efficiency, and
they also act as supporting agents to guide user to reach the expected
objective. Login name, password, response time, request time, deciding auto
backup, auto recovery, authorization services, data field validation, reduction
of SQL-Injection attack, avoidance and prevention of data in transmission,
storage and maintaining service logs are the few non-functional requirements
that are constraints.
Product perception: We believe that unsupervised multi-layer
Plagiarism System is an efficient system that adopts a hybrid approach
to improve system accuracy and performance. This product is essential
in the upcoming times in the fields of literature publication and
academic author efforts estimation. It has an enduring scope in busi-
ness and research as well, as these fields too need measures to improve
plagiarism detection.
Design and implementation constraints: This product is developed on
J2EE platform in order to attain portability and web-based access. Even
though the product design is flexible, it is limited to desktop-based usage.
An Internet connection enhances the system accuracy. The current system
has not considered mobile-based services. It has been developed for
academic purposes and in future would need refinement on commercial
aspects. The system accuracy can be improved by adding rules. At present
the file access is for text-based assessment, which needs to be elaborated
for inclusion of other formats such as images and audio charts.
fi is a set of files in the corpus, and file#i means ith file in the corpus. In
other word file#i acts as a handler to access the resources.
Where i= 1, 2, 3, ……..n
The training set to the system is scalable and new contents are vali-
dated as genuine contents, or say plagiarism free contents. Datasets are
dynamic in nature and tend to grow with age. This dynamic updating
needs a scalable data management process. Content organization algo-
rithms will update the entire structure as new resources are added to the
corpus. Another issue handled is the reading of multi-formatted docu-
ments. The input resources accepted by the proposed system are text
files, pdf files, word documents etc. Files such as latex, html, and other
textual contents scales the digging of resources and adds capability to the
training corpus.
58 Applied Machine Learning for Smart Data Analysis
where Sj is a set of sentences in the file, and Sent#j means j-th sentence in
a file. “n” terms to number of sentences in any file#i.
where j = 1, 2, 3, …, n
Classification table
Reference sentence Non-reference sentence Citations/references
where k = 1, 2, 3, …, n
of the sentence but the actual presence or absence of such content does not
affect the meaning of the sentence are considered insignificant. Removal of
such terms reduces the dataset size, which leads to a smaller number of
comparisons and faster results which are indeed the goals to achieve.
Supervised machine learning approach is best suited to implement
a mapping of input to output pair. The input Sent#k vector is further
decomposed into two classes, namely valuable words and non-valuable
words or irrelevant contents. For example, the words such as “the, for, a,
an etc” are rated as non-valuable words as their meaning does not highly
impact the results. Classification algorithms play an effective role in
designing such eliminating terms. Since these terms are of shorter lengths,
they can be loaded in RAM while processing. The algorithm is static in
nature and must be balanced based on the training set. Frequency of access
of such terms is very high, and search can be prioritized based on the hit
count of the content.
The real skill is to develop a machine learning strategy for unsupervised
contents by auto-classification of fine-grained contents. This classification
signifies extraction of complex associations, meta-info, semantically map-
ping to existing terms based on system knowledge, and finally handling
uncertainty in data. Organizing such content in the form of WordNet
Classification and Regression Tree (CART), lattice organization is normally
expected but the limitation of the RAM size hinders the compensation of
the corpus. Since the training contents are scalable, such corpus may lead
to content replication, nodes repeating in graph, or visiting nodes repeat-
edly during traversal of nodes. Also, the system demands semantic analy-
sis to show a more precise result.
Transformation of training set: The fine-grained contents are further
packed to multi-dimensional Data cubes. Each file is transformed to
such cubes. Figure 3.4 shows a sample architecture for cube organiza-
tion of decomposed data. This formulates a file node and serves as
a training set.
Grain packing to node where each weighted word is provided with
unique identifiers is reflected in the words template above. Word vector is
packed with a unique identifier which is a subset of each sentence. The
sentence vectors are packed further in to nodes of a file. Finally, the file
vector comprises the content of the corpus. File handlers are unique
identifiers to access the resources easily.
FIGURE 3.4
Semantic words mapping to unique identifier
Plagiasil 61
(iv) Result cos(q) = 1 terms exact match of competent vector, any value
nearer to 1 implies partial map.
(v) An offset for partial match is set to 0.6, else the decision can be
made that the two vectors differ. This offset can be set based on
percentage of allowed similarity in contents.
The end user module for plagiarism evaluation: Its core activity is to
generate content originality report for the user requested file.
(i) Identification of true positive (TP) results: Sentences that reflect the
exact match or reveal to have similar meaning, which relates to the
concept is termed as TP. In plagiarism evaluation, the sentences that
map to TP are evaluated as plagiarized.
(ii) Identification of true negative (TN) result: Set of sentences that
reflects as unique and do not map to any data set contents are
mapped as true negative. Identification of error in set of false
positive (FP) and false negative (FN) sentences is challenging and
raises confusion in verifying the result to check whether the sentence
is replicated from some source or it is author’s effort. If the percen-
tage of FP and FN set sentences are more, then the accuracy of the
system is questionable and the algorithm needs to be redeveloped or
similarity measures revaluated.
False positive
(Retrieved that are non-relevant) False Negative
(Relevant item Not retrieved)
Retrieved item
FIGURE 3.5
Performance measures
64 Applied Machine Learning for Smart Data Analysis
clusters might be misclassified if it contains a set of FPs and FNs and such
conditions give rise to confusion as shown in Figure 3.5.
The accuracy of classification is given by
Accuracy = (True Positive + True Negative)/(P + N)
where P and N are positive and negative instances of some condi-
tion [15].
Sample calculations that reflects results: We predict some data for the
demonstration of evaluation of precision and recall as follows: Input
a document in question for plagiarism score evaluation containing 100
plagiarized sentences among 10,000 sentences. The issue is to predict
which ones are positive. The Plagiasil system retrieves a score of 200 to
represent the probability of plagiarism, w.r.t. 100 positive cases. The results
are evaluated several times to predict the accuracy of the retrieved
sentences.
For instance, let expected results be 100 sentences.
TP = 60 sentences (found as plagiarized sentence, i.e. accurate detection)
FN = 40
TN = 9760
FP = 140
Therefore, Recall will be 60 out of 100 = 60%
Retrieved Results: 200 sentences.
Therefore, precision counts will be 60 out of 200 = 30%
Total statements to be compared = 10000.
Accuracy of the system will be 9760 + 60 = 9820 out of 10000 = 98.20%
100
90
90
80
Similarity score in percentage
70
60
50
45
45
40
30
20 25 25
20
15
10
10 10 10
0 0 0 0 5
0
0-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 90-100 101-110 111-120 121-130 131-140 141-150
Group of 10 sentences
FIGURE 3.6
Plagiarism score of a document with 150 sentences with 20% similarity detection.
3.13 Conclusion
It is essential to evaluate efforts of authors in academic and research
publications to analyze the quality of work and unique valuable informa-
tion resource. This is achieved by means of a plagiarism detector system
that applies a reduction algorithm to the hybrid machine learner model
that evaluates text similarity on the web as well as in local dataset. About
the future works in the plagiarism detection system, text-based replication
detection is the initial factor that needs to be discussed. Other features that
can be copied from literature are images, charts, templates, and so on. The
upcoming research topic would be analysis of documents translated in
other languages.
References
[1] David J. C. Mackay, “Information Theory, Inference and Learning Algo-
rithms”, The Press Syndicate of the University of Cambridge, Cambridge
University Press, 2003, ISBN 0 521 64298 1.
[2] Kaveh Bakhtiyari, Hadi Salehi, Mohamed Amin Embi, et al. “Ethical and
Unethical Methods of Plagiarism Prevention in Academic Writing”,
66 Applied Machine Learning for Smart Data Analysis
Machine Learning in
Data Mining
4
Digital Image Processing Using Wavelets
Basic Principles and Application
Luminița Moraru
Faculty of Sciences and Environment, Modelling & Simulation Laboratory, Dunarea de Jos
University of Galati, Romania
Simona Moldovanu
Faculty of Sciences and Environment, Modelling & Simulation Laboratory, Dunarea de Jos
University of Galati, Romania
Department of Computer Science and Engineering, Electrical and Electronics Engineering,
Faculty of Control Systems, Computers, Dunarea de Jos University of Galati, Romania
Salam Khan
Department of Physics, Chemistry and Mathematics, Alabama A&M University, Normal
AL-35762, USA
Anjan Biswas
Department of Physics, Chemistry and Mathematics, Alabama A&M University, Normal
AL-35762, USA
Department of Mathematics, King Abdulaziz University, Jeddah 21589, Saudi Arabia
Department of Mathematics and Statistics, Tshwane University of Technology, Pretoria
0008, South Africa
CONTENTS
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.1 Image Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Image Processing Using Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Image-Quality Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5.1 Decomposition Levels for 2D Haar Wavelet Transform . . . . 83
4.5.2 Noise Removal in Haar Wavelet Domain . . . . . . . . . . . . . . 84
4.5.3 Image Enhancement by Haar Wavelet . . . . . . . . . . . . . . . . . 86
4.5.4 Edge Detection by Haar Wavelet and Image Fusion . . . . . . 87
4.5.5 Image-Quality Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . 88
71
72 Applied Machine Learning for Smart Data Analysis
4.1 Introduction
Recent pathology practices require huge labor and time-intensive pro-
cesses. Currently, advantages of digital image processing and analysis are
recognizable in all areas, including pathology.
Pathologists and associated scientific teams use, among other investigative
techniques, microscopic evaluation. Automatic imaging techniques provide
huge advantages, including both image processing and image analysis. Digital
image processing enables enhanced and virtually noise-free images. Micro-
scopy digital image-processing techniques are used for both enhancement and
manipulation of raw images to make the informational content available.
Microscopy digital image analysis techniques comprehend provided data by
mean of data recognition techniques, image-quality assessment, or classifica-
tion. There are many image-processing algorithms and analysis techniques, and
this chapter intends to underline the principles behind image manipulation.
The technique utilized by optical microscopy is transmitted light for image
recording. Generally, microscopy volumes are strongly anisotropic, and
acquired digital images suffer from intensity inhomogeneity, uneven illumi-
nation and backgrounds, noise, poor contrast, aberration artifacts, out-of-
focus area, color shifts, and color balance errors. These artifacts lead to a bias
in the results and are a cause of inaccuracy and errors. Image processing
represents a variety of methods to manipulate and generate intensity changes
useful for observation and imaging. Methods that manipulate the common
microscope illumination modes are brightfield, phase contrast and/or differ-
ential interference contrast, and oblique. Other methods are specific to the
MATLAB® programming language, namely, the Wavelets and Image Proces-
sing Toolbox for MATLAB R2014a software.
The methods under consideration belong to unsupervised machine-
learning methods. We proposed some algorithms for wavelet transforms to
obtain texture information and a machine-learning approach to handle the
de-noising operations that become more adaptable when the wavelet trans-
forms are integrated in decision (Suraj et al., 2014; Moraru et al., 2017;
Rajinikanth et al., 2018). To enrich our study of edge detection, we included
two machine-learning techniques: image enhancement by wavelet transforms
and image fusion. This chapter presents the fundamental tools of digital
image processing based on wavelets applications. The principles of wavelet
representation, image enhancement such as filtering for noise reduction or
removal, edge detection, edge enhancement, and image fusion are discussed
Digital Image Processing Using Wavelets 73
and some applications are presented. At this point, edge detection and fusion
technique related to horizontal and vertical orientations are highlighted.
A three-step framework is exploited: decomposition, recomposition, and
fusion. The image-quality assessment is discussed in terms of objective
metrics such as peak signal-to-noise ratio (PSNR) and mean square error
(MSE) and of a hybrid approach such as structural similarity (SSIM) index as
a method to establish the similarity between two images. For edge detection,
the latter method takes advantage of sensitivity of human visual system.
In this study, the problems and procedures are presented in two-
dimensional (2D) real images, which are available for free access in the
Olympus Microscopy Resource Center galleries (http://www.olympusmicro.
com/galleries/index.html). The same dataset of test image is used across this
work for comparison. Some examples of image processing through MATLAB
programming language are presented.
4.2 Background
a phase plate with neutral density material and/or optical wave retarders
(Nikon, Olympus). There are more types of phase contrast objectives: (1) dark
low objectives generate a dark image outline on a light gray background,
(2) dark low objectives produce better images in brightfield, (3) apodized dark
low phase diminishes the “halo” effects into images obtained in phase contrast
microscopy, (4) dark medium gets a dark image outline on a medium gray
background, and (5) negative phase contrast or bright medium generates
a bright image map on a medium gray background. Differential interference
contrast objectives are useful for unstained samples; those regions of the
sample where the optical distances increase will appear either brighter or
darker, whereas regions where the optical paths decrease will show the
inverse contrast (Rosenthal, 2009). Polarized light microscopy produces the
best images for samples, showing optically anisotropic character.
Typical images captured in optical microscopy are red-green-blue (RGB)
color images. A color image is defined as a digital array of pixel containing
color information. In MATLAB, an RGB image is stored as “an m-by-n-by-3
data array that defines red, green, and blue color components for each
individual pixel.” An RGB array can belong to a double, uint8, or uint16
class. In an RGB array of class double, each constituent spans between 0 and 1.
A black pixel is described as (0, 0, 0), and a white pixel as (1, 1, 1). In a RGB
space, each pixel is depicted using a third dimension of the data array. In
MATLAB space, a pixel (10, 5) in a RGB color image is memorized as RGB
(10, 5, 1), RGB (10, 5, 2), and RGB (10, 5, 3), respectively. The MATLAB
function is image(RGB). The following functions are used to convert
between various image types: im2double (converts an image to double
precision), rgb2gray (converts RGB image or color map to gray scale),
ind2rgb (converts indexed image to RGB image), or rgb2ind (converts
RGB image to indexed image).
On the other hand, in a gray-scale digital image, each pixel is depicted
by a scalar value, which is its intensity. The intensity values L depend on
the numerical type encoding the image (an 8-bit image has 28 = 256 gray-
level intensity values).
The nonuniform background correction or nonuniform illumination
correction and primary noise filtering were done by microscope camera
settings. After the digital raw images have been acquired and suffered
certain preprocessing corrections and rehabilitations provided by micro-
scope software system, the digital content becomes accessible even by
visual analysis or by image processing and analysis algorithms.
1
di1;k ¼ pffiffiffi ci; 2k ci;2k1 ; for k 2 1; 2; . . . ; 2J1 ð1Þ
2
1
ci1;k ¼ pffiffiffi ci;2k þ ci;2k1 ; for k 2 1; 2; . . . ; 2J1 ð2Þ
2
As the most used technique, DWT uses the Haar functions in image
coding, edge detection, and analysis and binary logic design (Davis et al.,
1999; Munoz et al., 2002; Zeng et al., 2003; Dettori & Semler, 2007; Liu
et al., 2009; Yang et al., 2010; AlZubi et al., 2011). For an input constituted
as a list of 2n numbers, the Haar transform separates the signal into two
subsignals of half its lengths and pairs up input values, storing the
difference and passing the sum. Then, this procedure is repeated in
a recursive way, pairing up the sums to provide the next scale. The final
outcomes are 2n − 1 differences and one final sum. A simple example of
multiscale analysis consisting of Haar transform for n = 3 is provided in
Figure 4.1.
Thus, the Haar transform at 1-level is
H1
pffiffiffi pffiffiffi pffiffiffi pffiffiffi pffiffiffi pffiffiffi pffiffiffi pffiffiffi
pffiffiffi
Multiplication by 2 ensures the energy conservation.
The Haar transforms at multilevel are
76 Applied Machine Learning for Smart Data Analysis
FIGURE 4.1
A numerical example for 1-level, multilevel, and inverse transform of Haar transform for n = 3.
The algorithm presented in Figure 4.1 is equivalent with the Haar wave-
let. The di;j are the wavelet coefficients and c0;0 determine the output of the
transform and allow an alternative way of storing the original sequence.
The DWT provides information in a more discriminating way as it segre-
gates data into distinct frequency components; each frequency component is
analyzed using a resolution level adjusted to its scale. The Haar father
wavelet is defined as (Vetterli & Kovacevic, 1995; Jahromi et al., 2003)
1; t 2 ½0; 1
ðtÞ ¼ ð3Þ
0; otherwise
where ψ denotes the wavelet functions. The Haar father and mother func-
tions are plotted in Figure 4.2.
The decomposition of the original function f ðxÞ can be depicted by
a linear combination of mother wavelet:
X
f ðxÞ ¼ dj; k ψj; k ð7Þ
j; k
It should be noted that the functions j; k ðxÞ ¼ 2j=2 2j x k and
ψj; k ðxÞ ¼ 2j=2 ψ 2j x k form a wavelet family if they are orthonormal
base of L2 ðRÞ. The constant 2j=2 is chosen so that the scalar pro-
duct hψj;k ; ψj;k i ¼ 1; ψj;k L2 ðRÞ.
0.5 0.8
0.6
0
–1 –0.5 0 0.5 1 1.5 2 0.4
–0.5 0.2
0
–1 –0.5 0 0.5 1 1.5 2
–1
a) b)
FIGURE 4.2
The Haar wavelet functions: (a) Haar mother wavelet and (b) Haar father wavelet.
78 Applied Machine Learning for Smart Data Analysis
• is discontinuous,
• is localized in time; as consequence, it is not perfectly localized in
frequency, and
• has low computing requirements.
2j ¼ j j ð9Þ
coefficients match to the signal, and small coefficients contain mostly noise.
Weaver et al. (1991) proposed the first approach for noise removal by using an
orthogonal decomposition of an image and a soft thresholding procedure in
the selection of the coefficients dj; k :
8
< dj; k ti ; dj; k ti
dj; k ¼ 0; dj; k ti
:
dj; k þ ti ; dj; k ti
ti denotes the threshold and its value depends of the noise at the ith scale,
namely, sub-band adaptive. Generally, the coefficients corresponding to
noise are characterized by high frequency, and the low-pass filters repre-
sent the best solution for noising removal. In thresholding methods, the
wavelet coefficients whose magnitudes are smaller than the threshold are
set to zero. In the next stage, the de-noised image is reconstructed by
using the inverse wavelet transform. The threshold selection is a sensitive
process because a higher ti allows thresholding operation to remove
a significant amount of signal energy and to cause over-smoothing.
A small ti means a significant amount of noise that is not suppressed
(Chang et al., 2000; Portilla et al., 2003; Moldovanu & Moraru, 2010;
Bibicu et al., 2012; Moldovanu et al., 2016). On the other hand, the edge
information is presented in the low frequency, so, a balance should be
found between these two goals of image processing.
Image enhancement aims to accentuate image features or details that are
relevant for a specific task. It is useful where the contrast between different
structures in an image is small and a minor change between these structures
could be very informative. Wavelet-based enhancement methods use reversible
wavelet decomposition and the enhancement is performed by selective
modification of wavelet coefficients for highlighting subtle image details
(Sakellaropoulos et al., 2003; Zeng et al., 2004; Moraru et al., 2011; Kumar et al.,
2013; Patnaik & Zhong, 2014). It is followed by a reconstruction operation.
According to Misiti et al. (2007), the applications of wavelets play an
important role in edge detection and image fusion, mainly due to their local
character. Edges in images are mathematically described as local singularities
that are, in fact, discontinuities in gray-level values (Tang et al., 2000; Misiti
et al., 2007; Guan, 2008; Papari & Petkov, 2011; Xu et al., 2012). In fact, for
a digital image a½m; n, its edges correspond to singularities of a½m; n that, in
turn, are correlated to the local maxima of the wavelet transform modulus.
Discrete Haar wavelet has applications in the location of the specific lines or
edges in image by searching all coefficients in the spectral space. This will
allow to find all important edge directions in the image. Wavelet filters of
large scales are capable of producing noise removal but amplify the uncer-
tainty of the location of edges. Wavelet filters of small scales detect the exact
location of edges, but incertitude exists between noise and real edges. Mallat
80 Applied Machine Learning for Smart Data Analysis
(1989) demonstrated that large wavelet coefficients characterize the edges but
the wavelet representation provide different edge map according to the
orientations. These directional limitations of the wavelets can be overcome
by edge fusion operation (Pajares & de la Cruz, 2004; Akhtar et al., 2008;
Rubio-Guivernau et al., 2012; Xu et al., 2012; Suraj et al., 2014). The result of
image fusion is a more desirable image with higher information density that
is more suitable for further image-processing operations.
To take advantages from both horizontal and vertical edge map images,
the Dempster–Shafer fusion is used (Shafer, 1990; Mena & Malpica, 2003;
Seo et al., 2011; Li & Wee, 2014, Moraru et al., 2018). This is a way to fuse
information from various level of decomposition by taking into account
the inaccuracy and uncertainty at the same time. The final goal is to reduce
the uncertainty by combining the evidences from different sources. To
apply the Dempster–Shafer theory of evidence, a mass function should be
provided. The belief function represents the total belief for a hypothesis
and can be integrated into a variational wavelet decomposition framework.
Let hypotheses set O be composed of j-level of decomposition exclusive
subsets Oi , then O ¼ fO1 ; O2 ; . . . :; On g. Each element of this universal set
has a power set of }ðOÞ and A is an element of this set. The associated
elementary mass function m(A) specifies all confidences assigned to this
proposition. The mass function m : }ðOÞ ! ½0; 1 satisfies the conditions:
(
Pð
Þ ¼ 0
m
m ð AÞ ¼ 1
AO
where is an empty set and m(A) is the belief that certain element of O
belongs to the set A. The mass function allows defining the interval of
upper and lower bounds, which are called belief and plausibility. The
belief functions are particularized on the same frame of discernment and
are based on independent arguments. The plausibility functions indicate
how the available evidence fails to refute A (Scheuermann & Rosenhahn,
2010; Li & Wee, 2014). In the next step, the Dempster–Shafer theory
provides a rule of combination that aggregates two masses m1 and m2
resulted from different sources into one body of evidence. This rule of
combination is a conjunctive operation (AND), and the Dempster’s rule or
orthogonal product of the theory of evidence is as follows:
P
m1 ðBÞm2 ðCÞ
m1 ðAÞ m2 ðAÞ ¼ PB \ C¼A ; ðA; B; C 2 OÞ
B \ C≠ 1 ðBÞm2 ðCÞ
m
The fusion rule decides those detail coefficients that will be suppressed
and those that will be fused with the coefficients of another image. The most
informative are horizontal and vertical image decomposition structures (de
Zeeuw et al., 2012). The fusion technique involves three steps: (1) decom-
pose the images in the same wavelet basis to obtain horizontal, vertical, and
diagonal details; (2) recompose horizontal and vertical images; and (3) fuse
horizontal and vertical images at j-level. This technique can be used to
reconstruct a new image from processed versions of the original image or
to construct a new image by combining two different images. Wavelet
functions apply derivatives at different levels. In this case, horizontal and
vertical details, at various levels of decompositions, are considered as
evidences and their outputs are taken as events for Dempster–Shafer fusion.
Below, certain image enhancement and edge detection algorithms are
shortly present along with some code listing and examples. The Wavelets
and Image Processing Toolbox for MATLAB contains many basic functions
used in image-processing applications such as filtering for noise removal,
edge detection, and edge enhancement. Haar is a MATLAB library which
computes the Haar transform of data. For this chapter purposes, the data
are in the form of a 2D array and the transform is applied to the columns
and then the rows of a matrix.
The images of cells used in this chapter were downloaded by digital video
gallery provided by Olympus website. They were captured using an Olym-
pus fluorescence microscope, for example, fluorescent speckle microscopy,
for obtaining high-resolution images of embryonic rat fibroblast cells (A7r5
line) that express a fusion of monomeric Kusabira Orange fluorescent protein
to human beta-actin. The test image has a size of 400 × 320 pixels.
where f ði; jÞ denotes the original image and gði; jÞ is the processed image.
The images have N M size. PSNR compares the restoration results for the
same image, and higher PSNR denotes a better restoration method and
indicates a higher quality image (Mohan et al., 2013; Moraru et al., 2015).
Through this maximization operation in signal-to-noise ratio, the probabil-
ities of missed detection and false alarm in edge detection are minimized.
In the wavelets application domain, the main focus was for noise reduction
or noise removal in digital images. However, the wavelet transforms show
some drawbacks such as oscillations, shift variance, or aliasing (Satheesh &
Prasad, 2011; Raj & Venkateswarlu, 2012; Thanki et al., 2018).
The MSE is an error metric that computes the cumulative squared error
between the processed and raw/original image (Mohan et al., 2013;
Moraru et al., 2015):
1 NX X
1 M1
MSE ¼ ðgði; jÞ f ði; jÞÞ2
N M i¼0 j¼0
Lower portion of MSE indicates lesser errors, but it strongly depends on the
image intensity scaling. In other words, the squared difference diminishes the
small differences between the two pixels but penalizes the large ones. Wavelet
transform as a method for noise reduction or noise removal surpasses this
drawback because it has better directional and decomposition capabilities.
To improve the results provided by accepted methods such as PSNR and
MSE, which are not accurate to the visual human perception, a hybrid,
perceptual method called SSIM index is used. This method measures the
similarity between two images (Wang et al., 2004; Moraru et al., 2015).
SSIM is a strong evaluation instrument in the case of image fusion because
it takes into account luminance and contrast terms (that describe the
nonstructural distortions) and the structural distortion that characterizes
the loss of linear correlation. For two spatial locations x and y in analyzed
images (i.e., two data vectors having nonnegative values expressing the
pixel values to be compared), the SSIM is defined as
α !β
2 xy þ C1 2sx sy þ C2 sx;y þ C3 γ
SSIM ðx; yÞ ¼
x2 þ y2 þ C1 s2x þ s2y þ C2 sx sy þ C3
where x and y denote the mean values, sx and sy are the standard
deviation, and sxy is the covariance. α; β; γ40 are weights on the lumi-
nance ðlÞ, contrast (c), and structure term (s) of the fidelity measure. C1 ;
C2 ; and C3 are small positive constants to avoid the division with weak
denominator.
4.5 Applications
FIGURE 4.3
2-Level wavelet decomposing structure of embryonic rat fibroblast cells. The graph on the left
shows the organization of the approximation and details.
84 Applied Machine Learning for Smart Data Analysis
FIGURE 4.4
Approximation and details by directions for 1-level wavelet decomposing structure of test
image. The directions are A (horizontal), B (vertical), and C (diagonal).
FIGURE 4.5
Approximation and details by directions for 2-level wavelet decomposing structure of test
image. The directions are D (horizontal), E (vertical), and F (diagonal).
FIGURE 4.6
The approximation component providing information about the global properties of analyz-
ing image.
H = ‘haar’;
l_2 = 2;
86 Applied Machine Learning for Smart Data Analysis
FIGURE 4.7
Noise removal by the Haar wavelet: (a) noisy raw image, (b) de-noised image, and (c)
residuals.
sorh = ‘s’;
t_S = [5 5 5 5 5 5];
roundFLAG = true;
[c_s,sizes] = wavedec2(X,l_2,H);
[X,cfsDEN,dimCFS]=wdencmp(‘lvd’,c_s,l_2,H,l_2, t_S,sorh);
if roundFLAG
X= round(X);
end
if isequal(class(X),’uint8ʹ)
X= uint8(X);
end
FIGURE 4.8
Image enhancement by Haar wavelet: (a) enhanced image and (b) decomposition at level 1 of
de-noised image.
load rat;
PX = size(X);
[c1,a1,V1,D1] = dwt2(X,’haar’);
A = idwt2(c1,a1,V1,D1,’haar’,PX);
More accurate results are achieved when the analysis is made at the first
level. At superior analysis levels, blurring effects appear.
FIGURE 4.9
Edges detection using Haar wavelet: (a) horizontal edge responses and (b) vertical edge responses.
Code listing 3: Haar wavelet filters to get vertical and horizontal edges
J = imread(‘rat.jpg’);
I=rgb2gray(J);
intImg = integralImage(I);
h_H = integralKernel([1 1 4 3; 1 4 4 3], [-1, 1]);
v_H = h_H.’;
h_Reponse = integralFilter(intImg, h_H);
v_Res = integralFilter(intImg, v_H);
imshow(h_Reponse);
imshow (v_Res);
FIGURE 4.10
Fused edge map image provided by Haar wavelet. Both horizontal and vertical edge map
details were recomposed following the rules: (a) horizontal details for 1- and 2-levels (HL1
and HL2) and their edge fusion (HL1 + HL2); (b) vertical details for 1- and 2-levels (VL1 and
VL2) and their edge fusion (VL1 + VL2); and (c) horizontal and vertical details at level 1
(HL1 + VL1), at level 2 (HL2 + VL2) and for the combination of both edge fusion for
horizontal and vertical direction [(HL1 + HL2) + (VL1 + VL2)].
used to evaluate the efficacy of the noise removal. The same enhanced test
image was used in this analysis to reveal the effect of a particular choice of
wavelet bases in terms of PSNR and MSE values. The enhancement perfor-
mance is analyzed between the original and enhanced images (with noise
removal and enhancement operations) with the same size. Wavelet decom-
positions and reconstructions of 1- and 2-level were performed on a stack of
20 images. An example of de-noising operation is displayed in Figure 4.7.
The average value of PSNR computed between the original and enhanced
image with Haar wavelet of 1-level is 36.97 dB. Similarly, when the analysis
is performed between the original image and Haar wavelet of 2-level, the
average value of PSNR is 36.93 dB. The average values of MSE were
computed in the same experimental conditions. For the first case, MSE =
13.25 dB, and for the second case, MSE = 13.24 dB. Comparing the results,
no differences are observed between 1- and 2-levels of decomposition for
both objective quality metrics; so, the de-noising capability is the same.
However, the perceptual quality of the enhanced images is significantly
better for enhanced images as noise has been notably reduced, that is,
higher PSNR and lower MSE. A partial first conclusion is that the Haar
90 Applied Machine Learning for Smart Data Analysis
TABLE 4.1
The average SSIM values of fused images based on three combination rules
(HL1 + VL1) vs. (HL1 + VL1) vs. [(HL1 + HL2) + (HL2 + VL2) vs. [(HL1 + HL2) +
(HL2 + VL2) (VL1 + VL2)] (VL1 + VL2)]
SSIM 0.765 0.878 0.844
wavelet transform improves the objective quality of the raw images. How-
ever, these very good values of objective metrics do not guarantee avoiding
the loss of useful information and preserving the contents of the image. For
accuracy, the structural aspects are addressed and the quality of visual
information measured by the corresponding SSIM index values is quanti-
fied. SSIM captures helpful information on the fidelity with a reference
image and estimates the local improvement of image quality over the
image space. A value closer to ±1 for the SSIM index value is evidence for
a better image quality obtained as a result of image processing in each
individual sub-image. Three combinations of the level and directions of
decompositions for edge map images are considered. The test conditions
are arranged in pairs such that the first is the reference 1- and 2-level of
decomposition and the second is a combination between the fused levels of
decomposition. An example is provided in Figure 4.10 and the SSIM results
are presented in Table 4.1.
From this table, it can be observed that SSIM values have the maximum
average value when the image is reconstructed by using the fusion
between horizontal and vertical details at level 1. Moreover, the recon-
structed images by using the fusion between horizontal and vertical details
at level 2 have a good quality. These results conclude that the level of
decompositions determines the quality of enhanced images.
4.7 Conclusions
The Haar wavelet decomposition seems to be one of the most efficient tools to
sequentially process and represent an image. The pairs of pixels are averaged
to get a new image having different resolutions, but it is required to estimate
the loss of information that naturally occurs in this process. The Haar wavelet
transform delivers the approximation and detail coefficients at 1- and 2-level
of decomposition and it worked like a low- and a high-pass filter simulta-
neously. The proposed unsupervised machine-learning methods performed
better at level 1 of wavelet decomposition and when horizontal and vertical
details are fused.
The purpose of this chapter was to analysis the Haar matrix-based method
as tool for image processing (such as de-noising, enhancing, edge detection,
and edge preserving) and image analysis (such as quality assessment). One of
the major advantages of the technique described in this chapter is the multi-
resolution approach, which provides a context for a signal decomposition in
a succession of approximations of images with lower resolution, complemen-
ted by correlative details that sharpen some aspects into the image. To
conclude, the main advantages of this approach are as follows: (1) decrease
the manual work in the process of analyzing a huge amount of data encom-
passed in microscopy images; (2) wavelets avoid preprocessing operations
such as filtering with dedicated tools, which constitutes a substantial advan-
tage; (3) ensure a good signal-to-noise ratio as well as a good detection rate of
the useful information; (4) allow the edge detection at different scale and for
different orientations; and (5) improve edge detection by wavelet representa-
tion by using the image fusion.
References
Aldroubi, A., & Unser M. (Eds.). (1996). Wavelets in Medicine and Biology. Boca
Raton, FL, USA: CRC Press.
Ali, S. T., Antoine, J-P., & Gazeau, J-P. (2010). Coherent states and wavelets,
a mathematical overview. In Graduate Textbooks in Contemporary Physics,
New York: Springer.
Ali, S.T., Antoine, J-P., & Gazeau, J-P. (2014). Coherent States, Wavelets and Their
Generalizations, 2nd edition. New York, USA: Springer.
Akhtar, P., Ali, T. J., & Bhatti, M. I. (2008). Edge detection and linking using
wavelet representation and image fusion. Ubiquitous Computing and Communi-
cation Journal, 3(3), pp. 6–11.
92 Applied Machine Learning for Smart Data Analysis
AlZubi, S., Islam, N., & Abbod, M. (2011). Multiresolution analysis using wavelet,
ridgelet, and curvelet transforms for medical image segmentation. International
Journal of Biomedical Imaging, 2011, Article ID 136034, p. 18.
Ashour, A. S., Beagum, S., Dey, N., Ashour, A. S., Pistolla, D. S., Nguyen, G. H.,
Le, D. N, & Shi, F.. (2018). Light microscopy image de-noising using optimized
LPA-ICI filter. Neural Computing and Applications 29(12), pp. 1517–1533.
Bhosale, B., Moraru, L., Ahmed, B. S., Riser, D., & Biswas, A. (2014). Multi-
resolution analysis of wavelet like soliton solution of KDV equation, Proceedings
of the Romanian Academy, Series A, 15(1), pp. 18–26.
Bibicu, D., Modovanu, S., & Moraru, L. (2012). De-noising of ultrasound images
from cardiac cycle using complex wavelet transform with dual tree. Journal of
Engineering Studies and Research, 18(1), pp. 24–30.
Chang, S. G., Yu, B., & Vetterli, M. (2000). Spatially adaptive wavelet thresholding
with context modeling for image denoising. IEEE Transactions on Image Proces-
sing, 9(9), pp. 1522–1531.
Davidson, M. W., & Abramowitz, M. (2002). Optical microscopy. Encyclopedia of
Imaging Science and Technology, 2, pp. 1106–1140.
Davis, G., Strela, V., & Turcajova, R. (1999). Multi-wavelet construction via the
lifting scheme. In He, T. X. (Eds.), Wavelet Analysis and Multiresolution Methods.
Lecture Notes in Pure and Applied Mathematics, New York, USA: Marcel
Dekker Inc.
Dettori, L., & Semler, L. (2007). A comparison of wavelet, ridgelet, and
curvelet-based texture classification algorithms in computed tomography.
Computers in Biology and Medicine, 37(4), pp. 486–498.
Dey, N., Rajinikanth, V., Ashour, A. S., & Tavares, J. M. R. S. (2018). Social group
optimization supported segmentation and evaluation of skin melanoma
images. Symmetry, 10(2), p. 51.
de Zeeuw, P. M., Pauwels, E. J. E. M., & Han, J. (2012, February). Multimodality
and multiresolution image fusion. Paper presented at the meeting of the 7th
International Joint Conference, VISIGRAPP 2012, Rome, Italy.
Donoho, D. L. (1995). Denoising by soft-thresholding. IEEE Transactions on Informa-
tion Theory, 41(3), pp. 613–627.
Donoho, D. L., & Johnstone, I. M. (1994). Ideal spatial adaptation via wavelet
shrinkage. Biometrika, 81(3), pp. 425–455.
Eckley, I. A., Nason, G. P., & Treloar, R. L. (2010). Locally stationary wavelet fields
with application to the modelling and analysis of image texture. Journal of the
Royal Statistical Society: Series C (Applied Statistics), 59(4), pp. 595–616.
Evennett, P. J., & Hammond, C. (2005). Microscopy overview. In Encyclopedia of
Analytical Science, (pp. 32–41). Elsevier Ltd.
Gröchenig K., & Madych W. R. (1994). Multiresolution analysis, Haar bases and
self–similar tilings of Rn. IEEE Transactions on Information Theory, 38(2), pp.
556–568.
Guan, Y. P. (2008). Automatic extraction of lips based on multi-scale wavelet edge
detection. IET Computer Vision, 2(1), pp.23–33.
Jahromi, O. S., Francis, B. A., & Kwong, R. H. (2003). Algebraic theory of optimal
filterbanks. IEEE Transactions on Signal Processing, 51(2), pp. 442–457.
Kumar, H., Amutha, N. S., Ramesh Babu, D. R. (2013). Enhancement of mammo-
graphic images using morphology and wavelet transform. International Journal
of Computer Technology and Applications, 3, pp. 192–198.
Digital Image Processing Using Wavelets 93
Lanni, F., & Keller, E. (2000). Microscopy and microscope optical systems. In
Yuste, R., Lanni, F., & Konnerth, A. (Eds.), Imaging Neurons: A Laboratory
Manual, (pp. 1.1–1.72). New York, USA: Cold Spring Harbor Laboratory Press.
Li, X., & Wee, W. G. (2014). Retinal vessel detection and measurement for
computer-aided medical diagnosis. Journal of Digital Imaging, 27, pp. 120–132.
Liu, X., Zhao, J., & Wang, S. (2009). Nonlinear algorithm of image enhancement
based on wavelet transform. In Proceedings of the International Conference on
Information Engineering and Computer Science, (pp. 1–4). Wuhan, China: IEEE.
Lorenz, K. S., Salama, P., Dunn, K. W., & Delp, E. J. (2012). Digital correction of
motion artifacts in microscopy image sequences collected from living animals
using rigid and non-rigid registration. Journal of Microscopy, 245(2), pp. 148–160.
Mallat, S. (1989). A theory for multiresolution signal decomposition: the wavelet
representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11
(7), pp. 674–693.
MATLAB, Image Processing Toolbox User’s Guide, Retrieved October 30, 2015, from
http://www.mathworks.com/help/matlab/creating_plots/image-types.html
Mena, J. B., & Malpica, J. A. (2003). Color image segmentation using the
Dempster-Shafer theory of evidence for the fusion of texture. In The Interna-
tional Archives of the Photogrammetry, Remote Sensing and Spatial Information
Sciences (ISPRS Archives), vol. XXXVI, Part 3/W8, Munich, Germany.
Misiti, M., Misiti, Y., Oppenheim, G., & Poggi, J.-M. (Eds). (2007). Wavelets and their
Applications. London, UK: Wiley-ISTE.
Mohan, J., Krishnaveni, V., & Yanhui, G. (2013). MRI denoising using nonlocal
neutrosophic set approach of Wiener filtering. Biomedical Signal Processing and
Control, 8, pp. 779–791.
Moldovanu, S., & Moraru, L. (2010). De-noising kidney ultrasound analysis using
Haar wavelets. Journal of Science and Arts, 2(13), pp. 365–370.
Moldovanu, S., Moraru L., & Biswas A. (2016). Edge-based structural similarity
analysis in brain MR images. Journal of Medical Imaging and Health Informatics,
6(2), pp. 539–546.
Moraru, L., Moldovanu, S., Culea-Florescu, A. L., Bibicu, D., Ashour, A. S., &
Dey N. (2017). Texture analysis of parasitological liver fibrosis images. Micro-
scopy Research and Technique, 80(8), pp.862–869.
Moraru, L., Moldovanu, S., & Nicolae, M. C. (2011). De-noising ultrasound images
of colon tumors using Daubechies wavelet transform. In Conference Proceedings
of American Institute of Physics, vol. 1387, (pp. 294–299), Timisoara, Romania:
AIP Publishing.
Moraru, L., Moldovanu, S., & Obreja, C. D. (2015). A survey over image quality
analysis techniques for brain MR images. International Journal of Radiology, 2(1),
pp. 29–37
Moraru, L., Obreja, C. D., Dey N., & Ashour, A. S. (2018). Dempster-Shafer fusion
for effective retinal vessels’ diameter measurement. In Dey, N., Ashour, A. S.,
Shi, F., & Balas, V. E. (Eds.), Soft Computing in Medical Image Analysis. Academic
Press Elsevier B&T.
Munoz, A., Ertle, R., & Unser, M. (2002). Continuous wavelet transform with
arbitrary scales and O(N) complexity. Signal Processing, 82(5), pp. 749–757.
Nikon. Specialized Microscope Objectives (n.d.). Retrieved October 30, 2015, from
https://www.microscopyu.com/articles/optics/objectivespecial.html
94 Applied Machine Learning for Smart Data Analysis
Olympus. Microscopy resource center (n.d.). Retrieved October 30, 2015, from
http://www.olympusmicro.com/
Pajares, G., & de la Cruz, J. M. (2004). A wavelet-based image fusion tutorial.
Pattern Recogition, 37(9), pp. 1855–1872
Papari, G., & Petkov, N. (2011). Edge and line oriented contour detection: State of
the art. Image and Vision Computing, 29(2–3), pp.79–103.
Patnaik, S., & Zhong, B., (Eds.). (2014). Soft Computing Techniques in Engineering
Application. Studies in Computational Intelligence (vol. 543). Switzerland: Springer
International.
Piella, G., & Heijmans, H. (2003). A new quality metric for image fusion. In
Proceedings of International Conference on Image Processing ICIP 2003, vol. 3, (pp.
173–176), Barcelona, Spain: IEEE
Portilla, J., Strela, V., Wainwright, M. J., & Simoncelli, E. P. (2003). Image denoising
using scale mixtures of Gaussians in the wavelet domain. IEEE Transactions on
Image Processing, 12(11), pp. 1338–1351.
Porwik, P., Lisowska, A. (2004a). The Haar-Wavelet transform in digital image
processing: its status and achievements. Machine Graphics and Vision, 13, pp.
79–98.
Porwik, P., & Lisowska, A. (2004b). The new graphics description of the Haar
wavelet transform. Lecture Notes in Computer Science vol. 3039, (pp. 1–8),
Springer-Verlag.
Raj, V. N. P., & Venkateswarlu, T. (2012). Denoising of medical images using dual
tree complex wavelet transform. Procedia Technology, 4, pp. 238–244.
Rajinikanth, V., Satapathy, S. C., Dey, N., & Vijayarajan, R.. (2018). DWT-PCA
image fusion technique to improve segmentation accuracy in brain tumor
analysis. In Microelectronics, Electromagnetics and Telecommunications, pp. 453–
462. Singapore: Springer.
Rohit, T., Borra, S., Dey, N., & Ashour, A. S.. (2018). Medical imaging and its
objective quality assessment: An introduction. In Classification in BioApps, pp. 3–
32. Cham: Springer.
Rosenthal, C. K. (2009). Light Microscopy: Contrast by interference. Nature Mile-
stones | Milestone 8.
Rubio-Guivernau, J. L., Gurchenkov, V., Luengo-Oroz, M. A., Duloquin, L.,
Bourgine, P., Santos, A., Peyrieras, N., & Ledesma-Carbayo, M. J. (2012).
Wavelet-based image fusion in multi-view three-dimensional microscopy.
Bioinformatics, 28(2), pp. 238–245.
Sakellaropoulos, P., Costaridou, L., & Panayiotakis, G. (2003). A wavelet-based
spatially adaptive method for mammographic contrast enhancement. Physics in
Medicine and Biology, 48(6), pp. 787–803.
Satheesh, S., & Prasad, K. (2011). Medical image denoising using adaptive thresh-
old based on contourlet transform. Advanced Computing: An International Jour-
nal, 2(2), pp. 52–58.
Scheuermann, B., & Rosenhahn, B. (2010). Feature quarrels: The Dempster-Shafer
evidence theory for image segmentation using a variational framework. In
Kimmel, R., Klette, R., & Sugimoto, A. (Eds.), Computer Vision – ACCV 2010,
(pp. 426–439). Lecture Notes in Computer Science, 6493.
Seo, S. T., Sivakumar, K., & Kwon, S. H. (2011). Dempster-Shafer’s evidence
theory-based edge detection. International Journal of Fuzzy Logic and Intelligent
Systems, 11(1), pp. 19–24.
Digital Image Processing Using Wavelets 95
CONTENTS
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3 Terminologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 System Architecture and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.7 System Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.8 Algorithm Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.9 Software and Hardware Specifications . . . . . . . . . . . . . . . . . . . . . . . . 109
5.10 Other Specifications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.11 Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.12 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.1 Introduction
If huge capital, which is being used for educational data mining, is used
for other purposes, a great development can be done. Considering this
point, education data mining is applied to predict the performance of
students in the proposed system. While doing this, data-mining techniques
can be used to identify different perspectives and patterns, which will be
helpful to interpret performance of students. It gives a new approach to
97
98 Applied Machine Learning for Smart Data Analysis
look and evaluate the students according the performance on a whole new
level. Data mining done in education field refers to education data mining.
It is used to discover new methods and gain knowledge from database and
is useful in decision-making. Here, past and present student records are
considered. Different classification algorithms are used to predict the
student’s placement. Various data-mining tools are used in the process.
Knowledge of points, which are helpful in students’ placements, are
beneficial for both, students as well as management. Management can
take more steps, which are beneficial for placement of students. Prediction
is a tricky thing. The more the variables are used, the more is the accuracy.
Among the classifiers used widely, decision trees are very famous. They
have a set of rules that are applied on the dataset and the decision is made.
Here, list of tasks to be performed can be given as follows: data prepara-
tions, data selection and transformation, implementation of mining model,
and prediction model.
The chapter is structured as follows: In Section 2, motivation behind the
proposed system is discussed. In Section 3, some widely used terminolo-
gies regarding the system are briefly explained. Section 4 reviews the
related work of the proposed system. In Section 5, gap analysis in the
form of how the proposed system is better than the existing systems’
architecture is discussed. In Section 6, system architecture of the proposed
system along with the flow of the entire process in the proposed system
with the help of sequence diagram is discussed. Section 7 describes the
overall scope of the proposed system and the important attributes used in
it. In Section 8, algorithm used for the proposed system is explained.
Section 9 enlightens us about the minimum software and hardware
requirements for the proposed system. In Section 10, other specifications
are discussed in terms of advantages, disadvantages, and applications. In
Section 11, technologies used in the proposed system are briefly intro-
duced. Section 12 describes the mathematical aspects of the proposed
system in terms of mathematical model, set theory analysis, and Venn
diagram. In Section 13, improvements that can be done in the proposed
system are discussed and the chapter is concluded.
5.2 Motivation
Every year, around 1.5 million engineers graduate from colleges. But only
18.43% of them manage to get jobs. The reason behind their unemploy-
ment needs not necessarily be their lack of acquaintance with the concepts
in their field. Most of the time, students do not know where they are
actually lacking and they get frustrated. The motivation behind the pro-
posed system is to help students understand where they should focus
more and what other things they need to follow to get their dream job.
Probability Predictor Using Data-Mining Techniques 99
5.3 Terminologies
i. TPO (training and placements officer): Staff under whom all the
placement work is done.
ii. GH (GUI handler): An entity to handle the GUI (graphical user
interface) considered here for explanation purpose.
iii. PB (profile builder): From the previous data, a particular pattern or
profile will be generated by PB with respect to work of algorithm.
iv. DT (dataset training): Algorithm will be trained on the previous data
to make prediction.
v. ML (machine learning): Ability of a computer to learn by itself
without being explicitly programed.
vi. WEKA (Waikato environment for knowledge analysis): Collabora-
tion of different machine-learning tools.
introducing bias was quite natural. This can be overcome using Haralick
features in backpropagation of NNs where an optimization function is
used to minimize maximum errors. This results in high accuracy with
great effective details [14]. The vectors are used as a driving function to
increase the accuracy of the results. It is most commonly done by an
optimization function which performs the major functionality of minimiz-
ing errors. The main disadvantage of using traditional learning algorithms
is that it gives the results prematurely. Therefore, accuracy is an issue here.
As ANN uses backpropagation, local optimizations are performed at the
learning stage itself, which is beneficial for the accuracy [15]. NNs are
becoming the new and advanced way of machine learning. The most
adaptive and famous modeling tool is thought of as the ANN. It can
work on less amount of data and still provide the accurate results. It is
considered as a biological neuron because the working of both is kind of
similar. The main advantage of using ANN is that it is adaptive in its
learning stage itself and can handle similar functional relationships [16].
TABLE 5.1
Related Work versus Proposed System
Reference Gap analysis
[1] By using C4.5 decision tree algorithm, error rates will be reduced due to
inherent nature of the algorithm
[2] C4.5 decision tree algorithms are advance version of decision tree algorithms,
that is, ID3 algorithm
[3] C4.5 decision tree algorithm gives more accuracy
[4] C4.5 decision tree algorithm gives more precision rate
[5] C4.5 decision tree algorithm consists of both more accuracy and precision rate.
As a result, it works more efficiently
K-means is a very basic classifier and its error rates are high [1]. This is
tackled using C4.5 algorithm. The number of attributes in consideration
also matters. They cannot be too high or too low. If they are less, there is
a scope for improvement in accuracy of results [2]. Naïve Bayes algorithm
has its advantages, but it has less accuracy [3]. It also has low precision
rate [4] than C4.5 algorithm. NNs have more accuracy, but the precision
rate is low [5]. C4.5 algorithm tends to tackle all these issues.
Probability Predictor Using Data-Mining Techniques 103
FIGURE 5.1
Architecture of proposed system (brief system architecture).
104 Applied Machine Learning for Smart Data Analysis
details. Server will update the database. Student will complete the registration
process from the server. Student can update his profile. Server will fetch the
student profile and update into student database. Student will be able to view
the company details such as company name, company location, criteria, and
package. Student can check placement criteria. Server will fetch previous
placement record and student’s current record. After fetching the records,
server will apply the algorithm for classification and prediction. The decision
on whether the student can be placed in these companies or not will be
determined accordingly. If the prediction is true, then the server will send the
notification to students. If the prediction is false, then the server will inform the
students that the criterion of company does not match your profile. The
sequence diagram is useful to get the interaction between entities considering
their order of execution. The flow of the system is given in Figure 5.2.
In Figure 5.3, the work of the three main entities in the proposed system
is discussed individually. Basically, the admin is responsible for adding all
the entities to the system, and by doing so, access is granted to individual
entities. Students can view their statistics company-wise. TPO has access to
all placement details. TPO can also view these details graphically for better
understanding of the entire scenario. If at all any update is required in the
system and also in the information regarding staff or students, it has to be
done via admin access which gives admin the most important role in this
system. Moreover, students will be notified via SMS system if any new
company is available for placement purpose.
i. Data preparation:
To decide which dataset is to be used.
ii. Data transformation and selection:
Here, fields required for data mining are selected.
iii. Implementation of mining model:
Here, C4.5 classification algorithm is applied on the final dataset.
Probability Predictor Using Data-Mining Techniques 105
1: doLogin()
2: doLogin()
3: validateUser()
4: success()
5: successfullyLogin()
6: success()
7: addStudentAndStaff()
8: added()
9: updateDatabase()
10: success()
11: success()
12: success()
13: updatePacementDetails()
14: sendPacementDetails()
15: updateDatabase()
16: success()
17: successfullyUpdate()
18: success()
22: classify()
23: Predict()
24: buildResult()
25: showResult()
26: viewResult()
FIGURE 5.2
Sequence diagram for proposed system (flow of activities in the proposed system).
Here, the proposed system mainly differs from others in the efficiency of
the algorithm and the scope of attributes considered. In addition, some of the
corner cases are also considered such as the ability of a student to handle
pressure which can be judged by his/her nonacademic performance as well.
The order of attributes in which the decision tree is made is not important, but
106 Applied Machine Learning for Smart Data Analysis
login
Invalid
Validate
valid
fetch placement record Add Placement Record
Invalid
valid record?
valid
Update Database add company details
Invalid
valid details?
Update Database
login
Invalid
Classify
predict
Yes
No
FIGURE 5.3
Activity diagram for proposed system (dynamic control structure for the proposed system).
Probability Predictor Using Data-Mining Techniques 107
1. Tree = { }
2. If D is “pure” OR other stopping criteria met then
3. Terminate
4. end if
5. for all attribute aD do
6. Compute information-theoretic criteria if we split on a
7. end for
8. abest = Best attribute according to above computed criteria
9. Tree = Create a decision node that tests abest in the root
10. Dv = Induced subdatasets from D based on abest
Probability Predictor Using Data-Mining Techniques 109
v
X Dj
InfoðDÞ ¼ Info Dj
j¼1
jDj
v
X Dj log Dj =jDj
SplitInfoðDÞ ¼
j¼1
j Dj log 2
Step 8: The maximum information gain having node is used for expansion.
Step 9: The procedure is repeated to construct the decision tree with all
considered attributes.
Serial
number Description UML description
1 Problem description
Let S be module which contains following S holds the list of modules in
submodules, the proposed system
S1 = GUI handler (GH)
S2 = Dataset training (DT)
S3 = Profile builder (PB)
S4 = Graphical output (GO)
S5 = Machine learning (ML)
S6 = Company details (CD)
2 Activity
2.1 Activity I If the given input username/
Login Process password are correct, allow
Let S1 be a set of parameters which can be used for user to navigate to home page
login Else, show proper error
S1 = {u/n, pwd} message
where
u/n: Username
pwd: Password
Condition/Parameter Operation/Function
If login == valid login f1: Proceed ()
Else. Throw error
2.2 Activity II Students will upload their pro-
Upload data process file data which will be helpful
Let S2 be a set of dataset upload: to predict the placement
S2 = {student_id, marks, hobbies, achievements, chances of that student in the
apti-marks} particular company
where
student_id: Student ID
marks: Marks of different subjects
hobbies: Hobbies
achievements: Achievements in different area
apti-marks: Aptitude marks
Condition/Parameters Operation/
Function
Dataset f1: doTraning ();
If (dataset is valid) f2: validate-
Do Training Dataset ();
Else f3: error ();
Throw error
2.3 Activity III The system will show place-
Show output ment prediction in graphical
Let S3 be the set which is useful for graphical user format and system will send
interface SMS to student and TPO as well
(Continued )
Probability Predictor Using Data-Mining Techniques 113
(Cont.)
Serial
number Description UML description
P0 N0
P1 N1
P2 N2
P3 N3
P4 N4
N5
5.12 Conclusion
The main objective of the proposed system is to develop a system which will
give precise and to-the-point prediction. Using the algorithm of C4.5, the
students are classified based on their performance attributes. Information
gain of each and every attribute is calculated and a decision tree is devel-
oped. Here, information gain is an important parameter. The attribute having
the maximum information gain will be the root node and will be expanded
and explored further. Then, for sub-nodes, same procedure is followed. This
process continues till a leaf node is obtained. So, the decision tree made may
differ from one individual to another, but it will make sure that all the
attributes are considered and the end result format will be the same for each
end user regardless of their results. Thus, this domain-specific predictor will
give accurate results with more number of records and attributes as well.
Future work consists of applying the proposed system for actual use. It
can be used in different applications just by making few changes in
parameters. The reports from all applications can be analyzed and can be
further used for enhancements. The classification algorithms can be
114 Applied Machine Learning for Smart Data Analysis
References
[1] Devasia, Tismy, T. P. Vinushree, and Vinayak Hegde. “Prediction of students’
performance using Educational Data Mining”, Dept. of Computer Science,
Amrita Vishwa Vidyapeetham University, Mysuru campus, IEEE 2016.
[2] Pruthi,Karan, and Dr. Prateek Bhatia. “Application of Data Mining in predict-
ing placement of students”, Dept. of Computer Science, Thapar University,
IEEE 2015.
[3] M. V., Ashok, and Apoorva A. “Data Mining Approach for Predicting Student
and Institution’s Placement Percentage”, Dept. of Computer Science, Banga-
lore, IEEE 2016.
[4] Sheetal B, Mangasuli, and Prof. Savita Bakare. “Prediction of Campus Place-
ment Using Data Mining Algorithm Fuzzy logic and K nearest neighbour”,
Department of Computer Science and Engineering, KLE DR M S Sheshgiri
College of Engineering and Tech, Belgaum, India, IJARCCE 2016.
[5] Ramesh,V., P. Parkavi, and P. Yasodha. “Performance Analysis of Data
Mining Techniques for Placement Chance Prediction”, IJSER 2011.
[6] Dwivedi, Tripti, and Diwakar Singh. “Analysing Educational Data through
EDM Process: A Survey”, Department of Computer Engineering, BUIT,
Bhopal, IJCA 2016.
[7] Zemmal, Nawel, Nabiha Azizi, Nilanjan Dey, and Mokhtar Sellami. “Adap-
tive semi supervised support vector machine semi supervised learning with
features cooperation for breast cancer classification.” Journal of Medical Ima-
ging and Health Informatics 6, no. 1 (2016): 53–62.
[8] Ahmed, Sk Saddam, Nilanjan Dey, Amira S. Ashour, Dimitra Sifaki-Pistolla,
Dana Bălas-Timar, Valentina E. Balas, and João Manuel RS Tavares. “Effect of
fuzzy partitioning in Crohn’s disease classification: a neuro-fuzzy-based
approach.” Medical & Biological Engineering & Computing 55, no. 1 (2017): 101–115.
[9] Cheriguene, Soraya, Nabiha Azizi, Nilanjan Dey, Amira S. Ashour, Corina
A. Mnerie, Teodora Olariu, and Fuqian Shi. “Classifier Ensemble Selection
Based on mRMR Algorithm and Diversity Measures: An Application of
Medical Data Classification.” In International Workshop Soft Computing Applica-
tions, pp. 375–384. Springer, Cham, 2016.
[10] Dey, Nilanjan, Amira S. Ashour, Sayan Chakraborty, Sourav Samanta,
Dimitra Sifaki-Pistolla, Ahmed S. Ashour, Dac-Nhuong Le, and Gia
Nhu Nguyen. “Healthy and unhealthy rat hippocampus cells classification:
a neural based automated system for Alzheimer disease classification.” Jour-
nal of Advanced Microscopy Research 11, no. 1 (2016): 1–10.
[11] Bhattacherjee, Aindrila, Sourav Roy, Sneha Paul, Payel Roy, Noreen Kausar,
and Nilanjan Dey. “Classification approach for breast cancer detection using
back propagation neural network: a study.” In Biomedical image analysis and
mining techniques for improved health outcomes, p. 210, 2015.
[12] Maji, Prasenjit, Souvik Chatterjee, Sayan Chakraborty, Noreen Kausar,
Sourav Samanta, and Nilanjan Dey. “Effect of Euler number as a feature in
Probability Predictor Using Data-Mining Techniques 115
Shilpa G. Kolte
Research Scholar, University of Mumbai
Jagdish W. Bakal
Professor, S S. Jondhale College of Engineering, Dombavali, India
CONTENTS
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 Technology for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.1 Hadoop-distributed File System. . . . . . . . . . . . . . . . . . . . . 121
6.3.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.3 Data Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3.4 Data Association and Administration . . . . . . . . . . . . . . . . 123
6.4 Proposed Summarization Framework . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.1 Preprocessing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.2 New Method of Data Clustering . . . . . . . . . . . . . . . . . . . . 124
6.4.3 Modified Fuzzy C-means Clustering . . . . . . . . . . . . . . . . . 125
6.4.4 Data Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.4.5 Process of identifying similar terms . . . . . . . . . . . . . . . . . . 127
6.4.6 Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.5 Experiment evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.5.1 Performance Evaluation of Clustering . . . . . . . . . . . . . . . . 129
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.1 Introduction
Due to huge information produced by numerous sources, interpersonal
organizations, and cell phones, we get huge volumes of data known as Big
117
118 Applied Machine Learning for Smart Data Analysis
Data. The gigantic development in the size of data has been seen recently
as a key factor of the Big Data. Big Data can be characterized as high
volume, velocity, and variety. Computational infrastructure for analyzing
Big Data is large. Big Data is the term for a gathering of informational
indexes so expensive and complex that it becomes hard to process utilizing
conventional information preparing applications. There are three Vs:
volume (a lot of information), variety (incorporates distinctive sorts of
information), and velocity (always collecting new information) [1], which
portray Big Data. Data turn out to be big when their volume, variety, or
assortment surpasses its capacities of frameworks to store, analyze, and
process them. As of late, general understanding is more prominent after
adding two more Vs. Hence, Big Data can be explained by five Vs, namely,
volume, velocity, variety, veracity, and value [2]. Big Data is not just about
heaps of information, they are really another idea of giving a chance to
locate knowledge into the current information. There are numerous usages
of Big Data, such as business, innovation, media transmission, drug,
human services, and administrations, bioinformatics (hereditary qualities),
science, online business, the Internet (data seek, interpersonal organiza-
tions), and so forth. Big Data can be gathered from computers, as well as
from billions of mobile phones, web-based social networking sites, distinc-
tive sensors introduced in autos, transportation, and numerous different
sources. Large data are simply being created at speeds greater than that of
which they can be prepared and analyzed. The new difficulty in data
mining is that substantial volumes and distinctive assortments must be
considered. The basic techniques and devices for data preparing and
examination cannot oversee such measures of data, regardless of whether
capable Personal Computer (PC) groups are used [3, 4]. To study Big Data,
numerous data mining and machine-learning techniques as well as
advancements have been found and improvised. Thus, Big Data yields
new data and capacity components, which in turn require new strategies
for investigation. While managing Big Data, data clustering becomes
a major issue. Clustering methods have been connected to numerous
critical issues [5], for instance, to find social insurance drifts in under-
standing documents, to find out duplicate entries in address records, to get
new classes of stars in cosmic information, to partition information into
clusters that are significant and valuable, and to group a large number of
documents or website pages. To provide solutions to these applications,
many clustering algorithms have been created. Recent clustering algo-
rithms have a couple of impediments. Most calculations require checking
the enlightening record for a couple of times; therefore, they are not
suitable for Big Data grouping. There are a lot of usages, in which, to an
extraordinary degree, colossal or immense educational accumulations
ought to be examined, which in any way are excessively costly, making it
impossible to ever be set up by regular grouping strategies. Document
summarization acts as an instrument to gain speed, by accepting the
Big Data Summarization Using Modified Fuzzy Clustering Algorithm 119
FIGURE 6.1
Hadoop framework
HDFS. A MapReduce work more often comprises three stages: map, copy,
and reduce. The data are split into pieces of 64-MB size (as a matter of
course). In the map phase, a client-characterized work acts on each chunk of
data, delivering middle-of-the-road key-esteem sets, which are put away on
neighborhoods plate. One map process is invoked to process one chunk of
data. In the copy stage, the halfway key-esteem sets are exchanged to the
area where a lighter procedure would work on the middle-of-the-road data.
In reduce stage, a client-characterized decrease work acts on the middle-of-
the-road key esteem that combines and produces the yield. One reduced
process is invoked to process the scope of the keys.
HBase [26] is an open-source, conveyed, and nonsocial database frame-
work actualized in Java. It keeps running over the layer of HDFS. It can
serve the info and yield for the MapReduce in a very much organized
structure. It is a Not Only Structure Query Language (NoSQL) database
that keeps running over Hadoop as a disseminated and adaptable Big Data
store. HBase enables you to question singular records and additionally
infer total investigative reports on a huge measure of data [27].
FIGURE 6.2
Framework of Big Data summarization
Big Data Summarization Using Modified Fuzzy Clustering Algorithm 125
C X
X C
m 2
JðU; VÞ ¼ μij xi vj ð1Þ
j ¼ 1i ¼ 1
V ¼ fv1 ; v2 ; . . . ; vc g are the cluster centers. The cluster centers are initially
calculated as follows. To determine the centroid of the cluster, all the
patterns are applied to every other pattern and the patterns having
Euclidian distance less than or equal to α (user-defined value) are counted
for all the patterns. Later, the pattern with the maximum count is selected
as the centroid of the cluster. If
Ri
p
j ¼ 1 Rj α
ð2Þ
then Di ¼ Di þ 1; where i ¼ 1; 2; . . . ; p:
If Dmax is the maximum value in the row vector D and Dind is the index of
maximum value, then
U ¼ ðμij ÞNC is the fuzzy partition matrix, in which each member μij
indicates the degree membership between the data vector xi and cluster j:
The values of the matrix U should satisfy the following conditions:
X
C
μij ¼ 1; 8i ¼ 1; . . . ; N ð4Þ
j¼1
f ðx; v; rÞ ¼ 1 f ðÞ;
8
< 1 if r 1 ð5Þ
where f ðÞ ¼ 0 if r ¼ 0
:
rγ if 05r51 ðγ is sensitive parameterÞ
r ¼ xi vj ; r 1; and if r41, then rγ is set to 1.
To satisfy conditions 2 and 3, divide the total sum of attributes to the each
attribute for every pattern.
The exponent m 2 ½1; ∞ is the weighting exponent which determines the
fuzziness of the clusters. Minimization of the cost function J½U; V is
a nonlinear optimization problem, which can be minimized with the
following iterative algorithm:
PN
m
i¼1 μij xij
vj ¼ P
m ; 8j ¼ 1; . . . ; C ð6Þ
i ¼ 1 μij
N
//Mapper:
//Reducer:
//Mapper:
/Reducer:
Nrs
P¼
Ns
Nrs
R¼
Nr
TABLE 6.1
Comparative analysis of clustering algorithms
K-means 100 63 52
K-medodis 100 67 64
Fuzzy C-means 100 74 73
Modified fuzzy C-means 100 89 87
130 Applied Machine Learning for Smart Data Analysis
TABLE 6.2
Comparison of proposed and existing system
6.6 Conclusion
This chapter presents a novel approach for fuzzy clustering, semantic
approach, and data compression for Big Data summarization. K-means algo-
rithm is mostly used in traditional summarizations; however, it performs
poor due to several user-specified input parameters. Our proposed modified
FCM can solve this problem up to some extent. The outcome from different
recreations utilizing Iris informational index demonstrates that the proposed
modified FCM performs superior to K-means, K-medoids, and FCM group-
ing, which enhances the quality information outline. The proposed semantic
approach and data compression techniques enhance the superiority of data
summarization. Precision, recall, and computational time of the proposed
framework are superior to LSA and BDS.
References
[1] Schmidt, Data is exploding: the 3 V’s of. Business Computing World, 2012.
[2] Y. Zhai, Y.-S. Ong, and IW. Tsang, The Emerging “Big Dimensionality”. In
Proceedings of the 22nd International Conference on World Wide Web Compa-
nion, Computational Intelligence Magazine, IEEE, vol. 9, no. 3, pp. 14–26, 2014.
[3] V. Medvedev, G. Dzemyda, O. Kurasova, and V. Marcinkeviˇcius, “Efficient
data projection for visual analysis of large data sets using neural networks”,
Informatica, vol.22, no. 4, pp. 507–520, 2011.
[4] G. Dzemyda, O. Kurasova, and V. Medvedev, “Dimension reduction and data
visualization using neural networks”, in Maglogiannis, I., Karpouzis, K.,
Wallace, M., Soldatos, J., eds.: Emerging Artificial Intelligence Applications
Big Data Summarization Using Modified Fuzzy Clustering Algorithm 131
[19] Y J Ko, H K An, and J Y Seo, “Pseudo-relevance feedback and statistical query
expansion for web snippet generation,” Information Processing Letters, vol.
109, pp. 18-22, 2008.
[20] Q Li, and Y P Chen, “Personalized text snippet extraction using statistical
language models,” Pattern Recognition, vol. 43, pp. 378-86, 2010.
[21] J H Lee, S Park, C M Ahn, and D H Kim, “Automatic Generic Document
Summarization Based on Non-negative Matrix Factorization,” Information
Processing and Management, vol. 45, pp. 20-34, Jan. 2009.
[22] N K Nagwani, “Summarizing Large Text Collection using Topic Modeling
and clustering based on MapReduce Framework”, Journal of Big Data,
Springer, 2014.
[23] Shilpa Kolte and J W Bakal, “Big Data Summarization Using Novel Cluster-
ing Algorithm And Semantic Feature Approach”, International Journal of
Rough Sets and Data Analysis, IGI-Global publication, USA, Vol. 4, Issue 3,
2017.
[24] Bhatt, Chintan, Nilanjan Dey, and Amira S. Ashour, eds. “Internet of things
and big data technologies for next generation healthcare.” (2017): 978-3.
[25] Tamane, Sharvari, Sharvari Tamane, Vijender Kumar Solanki, and
Nilanjan Dey. “Privacy and security policies in big data.” (2017).
[26] K V N Rajesh. Big Data Analytics: Applications and Benefits, 2009.
[27] Dey, Nilanjan, Amira S. Ashour, and Chintan Bhatt. “Internet of things driven
connected healthcare.” In Internet of things and big data technologies for next
generation healthcare, pp. 3-12. Springer, 2017.
[28] Kamal, Md Sarwar, Sazia Parvin, Amira S. Ashour, Fuqian Shi, and
Nilanjan Dey. “De-Bruijn graph with MapReduce framework towards meta-
genomic data classification.” International Journal of Information Technology
9, no. 1 (2017): 59-75.
[29] Dey, Nilanjan, Aboul Ella Hassanien, Chintan Bhatt, Amira Ashour, and
Suresh Chandra Satapathy, eds. Internet of Things and Big Data Analytics
Toward Next-Generation Intelligence. Springer, 2018.
[30] Kamal, Sarwar, Shamim Hasnat Ripon, Nilanjan Dey, Amira S. Ashour, and
V. Santhi. “A Map Reduce approach to diminish imbalance parameters for big
deoxyribonucleic acid dataset.” Computer methods and programs in biome-
dicine 131 (2016): 191-206.
[31] Kamal, S., N. Dey, A. S. Ashour, S. Ripon, V. E. Balas, and M. S. Kaysar. “Fb
Mapping: An automated system for monitoring Facebook data.” Neural Net-
work World 27, no. 1 (2017): 27.
[32] J. Baker, C. Bond, J. Corbett, J. Furman, Lloyd, and V. Yushprakh. Megastore:
providing scalable, highly available storage for interactive services. In Pro-
ceedings of Conference on Innovative Data Systems Research, 2011.
[33] https://www.tutorialspoint.com/zookeeper/zookeeper_tutorial.pdf
[34] J. Boulon, A. Konwinski, R. Qi, A. Rabkin, E. Yang, and M. Yang. Chukwa, A
large-scale monitoring system. In First Workshop on Cloud Computing and
its Applications (CCA ’08), Chicago, 2008.
[35] J. Bezdek. “Pattern Recognition with Fuzzy Objective Function Algorithms”.
Plenum Press, USA, 1981.
[36] J.C. Dunn. “A Fuzzy Relative of the ISODATA Process and its Use in Detect-
ing Compact, Well Separated Clusters”. Journal of Cybernetics, 3(3): 32-57,
1974.
Big Data Summarization Using Modified Fuzzy Clustering Algorithm 133
[37] Rani, Jyotsna, Ram Kumar, F. Talukdar, and Nilanjan Dey. “The brain tumor
segmentation using fuzzy c-means technique: a study.” Recent advances in
applied thermal imaging for industrial applications, 40-61, 2017.
[38] Wang, Dan, Zairan Li, Nilanjan Dey, Amira S. Ashour, R. Simon Sherratt, and
Fuqian Shi. “Case-based reasoning for product style construction and fuzzy
analytic hierarchy process evaluation modeling using consumers linguistic
variables.” IEEE Access 5 (2017): 4900-4912.
[39] Ahmed, Sk Saddam, Nilanjan Dey, Amira S. Ashour, Dimitra Sifaki-Pistolla,
Dana Bălas-Timar, Valentina E. Balas, and João Manuel RS Tavares. “Effect of
fuzzy partitioning in Crohn’s disease classification: a neuro-fuzzy-based
approach.” Medical & biological engineering & computing 55, no. 1, 2017.
[40] Ngan, Tran Thi, Tran Manh Tuan, Nguyen Hai Minh, and Nilanjan Dey.
“Decision making based on fuzzy aggregation operators for medical diagno-
sis from dental X-ray images.” Journal of medical systems 40, no. 12, 2016.
[41] Wang, Dan, Ting He, Zairan Li, Luying Cao, Nilanjan Dey, Amira S. Ashour,
Valentina E. Balas et al. “Image feature-based affective retrieval employing
improved parameter and structure identification of adaptive neuro-fuzzy
inference system.” Neural Computing and Applications 29, no. 4, 2018.
[42] Tuan, Tran Manh, Hamido Fujita, Nilanjan Dey, Amira S. Ashour, Vo
Truong Nhu Ngoc, and Dinh-Toi Chu. “Dental diagnosis from X-Ray
images: An expert system based on fuzzy computing.” Biomedical Signal
Processing and Control pp. 64-73, 2018.
[43] Wang, Cunlei, Zairan Li, Nilanjan Dey, Amira Ashour, Simon Fong, R.
Simon Sherratt, Lijun Wu, and Fuqian Shi. “Histogram of oriented gradient
based plantar pressure image feature extraction and classification employing
fuzzy support vector machine.” Journal of Medical Imaging and Health
Informatics, 2017.
[44] Karaa, Wahiba, Ben Abdessalem, and Nilanjan Dey. Mining Multimedia
Documents, CRC Press, 2017.
[45] WordNet Java API.
[46] AI-Daoud, M. B., & Roberts, S. A. (1996). New methods for the initialization of
clusters. Pattern Recognition Letters, 17, 451–455.
[47] Khan, S. S., & Ahmad, A., (2004). Cluster center initialization algorithm for
K-means clustering. Pattern Recognition Letters, 25, 1293–1302.
[48] Saba, Luca, Nilanjan Dey, Amira S. Ashour, Sourav Samanta, Siddhartha, and
Jasjit S. Suri (2016) “Automated stratification of liver disease in ultrasound: an
online accurate feature classification paradigm.” Computer methods and
programs in biomedicine pp. 118-134.
[49] Y. Gong, X. Liu (2001) Generic Text Summarization Using Relevance Measure
and Latent Semantic Analysis. Proceedings of the 24th annual international
ACM SIGIR conference on Research and development in information retrie-
val, New Orleans, Louisiana, United States.
7
Topic-Specific Natural Language Chatbot
as General Advisor for College
CONTENTS
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.3 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.6.1 Android App . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.6.1.1 Technologies Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.6.1.2 Chat Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6.1.3 Sign-Up Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6.1.4 Sign-In Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6.2 Dialogflow Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6.2.1 Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.6.2.2 Intents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
7.6.2.3 Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.6.2.4 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.6.3 Web Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.6.3.1 Technologies Used . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.6.4 Other Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.6.4.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.6.4.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.6.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.7 Screenshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
135
136 Applied Machine Learning for Smart Data Analysis
7.1 Introduction
A chatbot (also known as an interactive agent or artificial conversational
entity) is a computer program which conducts a conversation via auditory
or textual methods. Such programs are designed to replicate the role of
a human conversational partner. Naturally, chatbot’s application can be
extended in daily life, such as help-desk tools, automatic telephone-
answering systems, and tools to aid in education, business, and e-com-
merce [1, 2]. The emerging trend in online chatting is gaining popularity,
where a customer can chat with customer care representatives at any point
of time from anywhere [3]. We are looking at designing a system to make
natural conversations between human and machine that is specifically
aimed at providing a chatbot system for members of a particular organiza-
tion. Proposed system uses natural language processing to provide an
appropriate response to the end user for a requested query. The bot is
based on natural language; so, a user does not have to follow any specific
format. However, some queries might start with special symbols to be
used for special intents, for example, if the user wants Google search
results for a topic, then the query will be appended by a designated special
symbol like “#.”
The system will use natural language processing (using Dialogflow
formerly known as Api.ai) to understand the user-provided query. Being
a topic-specific system, user can query about any topic (here college)-
related events using June (chatbot). This eliminates the hassle of visiting
college for trivial queries. After analyzing the user’s query, system will
strive to generate the response that will satisfy the query. The system
administrator can update the existing model to handle more diverse
queries. Figure 7.1 describes various fundamental use cases that the user
or administrator may use. Viewing information is an important use case,
which has been focused in our system.
By building an automated intelligent system to resolve the domain-
related queries of user, many human resources previously used for resol-
ving queries now can be utilized for other purposes, which will increase
productivity. Moreover, the system’s availability, response time, and abil-
ity to handle a number of clients simultaneously make it more suitable to
handle this task. After analyzing multiple methods for building chatbots, it
was observed that most of the architectures depend on natural language
understanding than pattern matching. It also helps in understanding many
diverse queries. The system consists of a Controller which is a combination
of the knowledge base and the query processing engine, and a View which
is a simple interface for system–user interaction which enables user to
learn the system quickly and use it efficiently.
Topic-Specific Natural Language Chatbot as General Advisor for College 137
System
SignIn and
SignUp
Student Chat
Add information
Teacher
Update
information
a
Delete
information
Admin
FIGURE 7.1
Use case diagram (major use cases explained).
provided in voice form as well as in text form. In Ref. [9], the authors have
proposed an idea of building a chatbot based on ontological approach. It
maps the domain-related knowledge and information in a relational data-
base. This relational database is then utilized to provide relevant informa-
tion which can be used to generate chatbot responses. This approach is
suitable for developing a chatbot for domain-specific requirements, as
a domain-specific chat. The system architecture proposed in this work is
influenced by the ontological approach.
FIGURE 7.2
System architecture (brief system architecture with major components).
Topic-Specific Natural Language Chatbot as General Advisor for College 141
FIGURE 7.3
DFD_level_1 diagram (context of the proposed system).
D Knowledge
see result base
ask query
response
request
request
Process admin’s
Admin
request
response
FIGURE 7.4
DFD_level_2 diagram (breakdown of main activities in proposed system).
142 Applied Machine Learning for Smart Data Analysis
NL Query
User understanding generator
ask query json obj
query
D knowledge
response system response
generator base
result
req/response
Update
View info delete
info
info
req/response
result
ask query Admin
FIGURE 7.5
DFD_level_3 diagram (detailed analysis of activities in proposed system).
7.5 Algorithm
if (para1 == dummy )
{
switch(para7)
{
case case 1 :
if (para2 =+ metaEntityValue && para3 == sample
EntityValue)
Topic-Specific Natural Language Chatbot as General Advisor for College 143
{
Construct query with received data & avail-
able data. Fetch result from database.
Construct result. sendResponse()
} else
{
Execute else.
}
.
.
.
}
}
.
.
if(resultConflict)
{
Wait for user selection
Accepted input : 1, first, second one, later one,
last one. Parse response for selected option and
display that only. (happens locally)
} else
{
Display response.append(Data insights if
available);
Wait for new request.
}
}
}
7.6 Implementation
This section describes the implementation for developing the chatbot.
A Dialogflow agent is created and trained to understand user queries.
A web server is implemented to generate response and an android app is
developed for user interface.
7.6.2.1 Entities
These are the things or objects that our chatbot agents need to understand
user queries. For example, person name is an entity. Using this, the chatbot
can understand that words like Sam, Ben, and others are person names.
Similarly “color” is also an entity by which the chatbot can understand
that words like blue, red, and others are colors.
7.6.2.2 Intents
Intents are used to understand the user’s intention, that is, what informa-
tion user wants to know. For example, if the user wants to know about
a person, it can create person details intent.
Dialogflow agent is trained using entities and intents, for example, if user
asks, “get me the email of Ben.” To understand what the user wants, it can
create a personInfo entity which can identify the word “email” as personInfo.
Another entity can be a personName, which identifies the word “Ben” as
personName. An intent getPersonDetails can be created which is returned by
the agent if it finds personDetails and personName in one query.
The agent is trained with many ways in which the user can ask queries. For
example, another way a user can ask the same question is, “email of Ben.”
With more examples, the agent can identify intents with more confidence.
Topic-Specific Natural Language Chatbot as General Advisor for College 145
FIGURE 7.6
Activity diagram (dynamic flow control in the proposed system).
7.6.2.3 Actions
An action corresponds to the step of application that the agent takes when
a specific intent has been triggered by a user’s input. Actions can have
parameters for extracting information from user requests and will appear
in the following format in a JSON response:
{“action”:”action_name”}
{“parameter_name”:”parameter_value”}
7.6.2.4 Parameters
Parameters are elements generally used to connect words in a user’s
response to entities. In JSON responses to a query, parameters are returned
in the following format:
{“parameter_name”:”parameter_value”}
{
“id”: “cf49a7e9-ed58-4344-874c-07ab974bbb2c”,
“timestamp”: “2017-12-15T14:52:48.558Z”,
“lang”: “en”,
“result”: {
“source”: “agent”,
“resolvedQuery”: “What is pranav’s Number ? “,
“action”: “lookForPhone”,
“actionIncomplete”: false,
“parameters”: {
“contactAttribute”: [
“phone”
],
“teacher”: “Pranav”
},
“contexts”: [],
“metadata”: {
“intentId”: “735166a4-70da-4b2b-8d37-dc7f460cf771”,
“webhookUsed”: “false”,
“webhookForSlotFillingUsed”: “false”,
“intentName”: “teacherContactDetails”
},
“fulfillment”: {
“speech”: “Pranav ‘s phone number is”,
“messages”: [
{
“type”: 0,
“speech”: “Pranav ‘s phone number is”
},
{
“type”: 4,
“payload”: {
“queryType”: “search”,
“requiredAttribute”: “phone”
}
}
]
},
“score”: 0.9599999785423279
},
“status”: {
“code”: 200,
“errorType”: “success”,
“webhookTimedOut”: false
},
“sessionId”: “12bb615c-8d1d-b176-976d-c49cc1b94183”
}
Topic-Specific Natural Language Chatbot as General Advisor for College 147
FIGURE 7.7
Sequence diagram (flow of control in proposed system).
148 Applied Machine Learning for Smart Data Analysis
7.6.4.1 Advantages
This system will serve as a point of contact between the administrator and
a general user for most of the topic-related general queries.
It will reduce the frequency of trivial queries being asked to the
administrator while maximizing number of customers served with
appropriate data and hence improving productivity of the admin as
well as the user.
7.6.4.2 Disadvantages
Frequent updating of data is required, for the system to be useful.
If data have not been updated, user may get wrong results.
7.6.4.3 Applications
Applications related to organizations where huge number of personnel are
required for handling user queries only.
Customer support
Taking orders
Product suggestions
7.7 Screenshots
The following screenshots provide some actual information of our imple-
mentation. Figure 7.8 depicts an interface, where user will actually interact
in natural language with the system via a simple form to be filled during
the sign-up activity and Figure 7.9 shows authentication done with email
and password in sign-in activity.
7.8 Conclusion
We have proposed to develop a system which gives a precise and to-the-
point answers to the queries of a user rather than making user search
for answers amongst a huge pool of data. The system has an access to
all topic-specific data and answers some personal info queries. As for
natural language processing, we use Dialogflow, formerly known as
Topic-Specific Natural Language Chatbot as General Advisor for College 149
FIGURE 7.8
Screenshot 1.
FIGURE 7.9
Screenshot 2.
by looking into other data sources such as Google search results link,
Wikipedia, and so on
References
[1] Chai and J. Lin, “The role of natural language conversational interface in online
sales: a case study,” International Journal of Speech Technology, vol. 4, pp.
285–295, Nov. 2001.
[2] J. Chai, V. Horvath, N. Nicolov, N. Stys, K. Zadrozny and W. Melville,
“Natural Language Assistant: a dialogue system for online product
recommendation,” AI Magazine; ProQuest Science Journals Summer vol. 23,
no. 2, pp. 63–75, 2002.
[3] G.M. D’silva, S. Thakare, S. More and J. Kuriakose, “Real World Smart Chatbot
for Customer Care using a Software as a Service (SaaS) architecture,” Interna-
tional conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud)
(I-SMAC 2017)
[4] S. Ghose and J.J. Barua, “Toward the implementation of a topic specific
dialogue based natural language chatbot as an undergraduate advisor”,
Topic-Specific Natural Language Chatbot as General Advisor for College 151
CONTENTS
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.2 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.3 Materials and Proposed Methodologies . . . . . . . . . . . . . . . . . . . . . . . 158
8.3.1 Linear Regression (LR) . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.3.2 Regression Tree (RT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.3.3 Random Forest (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
8.3.4 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8. 4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.1 Introduction
Currently, the IT industry has been growing rapidly, and in this journey,
machine learning (ML) has been playing a vital role. ML has become so
popular nowadays that almost all IT industries are using it to retrieve
hidden information from the massive amounts of data. Thus, it is necessary
to turn these insights into applicable models, applied onto various areas,
such as agriculture, transportation, banking, medical, and so on. Chatterjee
et al. (2018); Dey, Pal, and Das (2012); El-Sayed et al. (2018); Fong et al.
(n.d.); Hore et al. (2017); Kamal et al. (2018); Saba et al. (2016); Singh et al.
(2017); Tiwari (n.d.); Tiwari, Dao, and Nguyen (2017); Tiwari, Kumar, and
Kalitin (2017); Virmani et al. (2016) are some studies on this aspect.
155
156 Applied Machine Learning for Smart Data Analysis
y0i ¼ f ðxi Þ
X
p
f ð xi Þ ¼ wih xi þ b
h¼1
where f(.) is selected from a set of linear models F by minimizing the errors
and satisfying the constraint, and b is the intercept (the estimation of
y when x=0).
Implementation of Machine Learning in the Education Sector 159
X
n 2
f ðxÞ ¼ arg min y0i yi
f 2F i¼1
subject to
X
p
jwih j t
h¼1
Algorithm 1: Pseudo-code of RT
• Step 1: Fit a regression function with order two to root node minimizingab-
solute deviation, recorded as ERRORroot;
• Step 2: Initially, we consider the root node as the current node, and
ERRORcurrent = ERRORroot;
• Step 3: For every current root and for each input variable, we solve
theoptimization of the regression problem. The deviation is noted as
ERRORsplit;
160 Applied Machine Learning for Smart Data Analysis
samples and variables that are randomly chosen, instead of the whole
dataset as the RT does. These trees are fully grown (low bias, high
variance), and they are relatively uncorrelated by the random generation
fashion. RF may alleviate the overfitting problem to some extent by
reducing the variance (average the trees).
8.4 Results
In this section, we conduct several experiments, show the results, and give
the corresponding explanation. First, we plot the grades of mathematics as
the x-axis and the grade of Spanish as the y-axis.
Implementation of Machine Learning in the Education Sector 161
TABLE 8.1
Table of features
Feature Type Range
FIGURE 8.1
Scatterplot between mathematics grades and Spanish grades
FIGURE 8.2
Scatterplot between average grades and daily consumption of alcohol
FIGURE 8.3
Error scatterplot
164 Applied Machine Learning for Smart Data Analysis
FIGURE 8.4
Scatterplot between the actual obtained grades and predicted grades
In the above charts, the horizontal axis depicts the predicted scores or
grades, whereas the vertical axis depicts real grades or scores. On the off
chance that the model is precise in anticipating actual scores or grades at
that point anticipated grades or scores must be equivalent to real grades,
and in this manner, the scatterplot should arrange along the 45 degree
(blue) line. As the normalized mean squared error and error plots demon-
strate, neither of the models appears to have a better than average score
with regard to forecasting student average scores or grades. Unsatisfied
with how RT and LR models execute, that is why RF used to go into thick.
The normalized squared mean error of the implemented RF is 0.25, and it
is approximately lower than RT and LR. For validation purposes, the error
plot of RF is plotted and compared with the error plots of RT and LR, as
shown in Figure 8.5.
Despite the fact that the RF appears to methodically underpredict the
score of poor-scoring students and overpredict the scores of high-scoring
students, by and large RF is by all accounts a vastly improved indicator of
average scores or grades than either the RT or LR. Moreover, 10 × 5-fold
cross-validation executed on a machine confirmed that the RF performed
Implementation of Machine Learning in the Education Sector 165
FIGURE 8.5
Scatterplot between the actual grades and predicted grades
FIGURE 8.6
Scatterplot
166 Applied Machine Learning for Smart Data Analysis
with 600 trees is the best indicator of an average student’s scores or grades
compared to RT and LR.
If we remove the alcohol consumption variable on a weekday or weekend
basis, there will be about 13–15% increment in the mean square error of
predictions, as can be seen in Figure 8.6. Thus, these features are also
important predictors in determining students’ average grades.
References
[1] Balas, V. E., & Shi, F. (2017). Indian sign language recognition using optimized
neural networks. In Information technology and intelligent transportation systems
(pp. 553–563). Springer.
[2] Jariwala, D. R., Desai, H., & Zaveri, S. H. (2016). Mining educational data to
analyze the performance of infrastructure facilities of gujarat state. Interna-
tional Journal of Engineering And Computer Science, 5(7).
[3] Kamal, S., Dey, N., Nimmy, S. F., Ripon, S. H., Ali, N. Y., Ashour, A. S., … Shi, F.
(2018). Evolutionary framework for coding area selection from cancer data.
Neural Computing and Applications, 29(4), 1015–1037.
[4] Kanyongo, G. Y., Certo, J., & Launcelot, B. I. (2006). Using regression analysis
to establish the relationship between home environment and reading achieve-
ment: A case of Zimbabwe. International Education Journal, 7(4), 632–641.
Implementation of Machine Learning in the Education Sector 167
Sachin P. Godse
Ph. D Scholar, Department of Computer Engineering, Sinhgad Institutes, Smt. Kashibai
Navale College of Engineering, Savitribai Phule Pune University, Pune, India
CONTENTS
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.2 Computational Intelligence in VANET. . . . . . . . . . . . . . . . . . . . . . . . . 170
9.3 Existing Schemes in VANET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.4 Proposed Message-Forwarding Scheme in VANET . . . . . . . . . . . . . . 174
9.4.1 Relevance/priority-based message-forwarding scheme . . . 175
9.5 Proposed Intelligent Navigation Scheme . . . . . . . . . . . . . . . . . . . . . . . 177
9.6 Navigation with Respective Conditions A and B . . . . . . . . . . . . . . . . 179
9.7 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.1 Introduction
Vehicular adhoc network (VANET) is the adhoc network that provides intelli-
gent transportation services. Transportation plays a crucial role in smart city
development. There are several aspects of VANET that have not been studied,
and hence it is an area that provides much scope for research. In VANET,
nodes communicate with each other to share either safety or non-safety
information. Safety information may contain various kinds of messages like
169
170 Applied Machine Learning for Smart Data Analysis
traffic jam condition, accidents met by vehicles, bad weather, road conditions,
collision of vehicles because of sudden break applied by the next vehicle, and
so on. Non-safety information may include messages like driver details, docu-
ments, songs, movies, and so on. In the event of a fire accident in the city, an
automatic call gets placed to the fire station due to alarm systems or because of
someone manually calling for a fire brigade. As the fire brigade has to reach the
accident location as soon as possible, it is considered as an emergency event.
This emergency event message is forwarded to all vehicles in the range of the
fire brigade, so that they can give it way, creating an easy and fast passage for
the fire brigade. If the road to the place of the fire accident is packed due to
traffic, then the intermediate vehicles can suggest an alternate path to reach that
place. The driver can also make a request for an alternate path to the server,
which, on being provided, can be used by the driver to select the best path and
proceed to the accident location. It is very important for a fire vehicle to reach
the location of fire on time to avoid fatalities and property losses. Figure 9.1
shows a VANET scenario and the types of communication associated with the
scenario [1][2].
FIGURE 9.1
Vehicular network scenarios.
Priority-Based Message-Forwarding Scheme 171
accident, the Chord algorithm is used to control the speed of the vehicle. It
measures the speed of the vehicle, and intimates it when it crosses the
speed limit to avoid accidents.
In [8], the author recommends a new approach for path guidance across
the Internet for the purpose of collecting useful information. This approach
also provides proper guidance to new users for reaching the target. It can
obtain real-time data of traffic conditions and congestion caused by unfore-
seen incidents besides providing proper guidance to the user. The scheme
also ensures that the data provided is valid to the context without affecting
the credentials of users. A third party takes care of the user’s privacy.
Processing delay is a parameter against which the scheme is assessed.
In [9],the authors have suggested ways of improving the security and
efficiency of routing. Out of the numerous factors that have been considered,
beacon frame size and counts of reception are the primary ones. Increase in
the size of the beacon improves all the parameters. The author also has
proposed a mathematical model along with proof of concept (POC).
In [10], the authors have provided an option of using Bluetooth Low
Energy (BLE) in VANET applications. They addressed the issue by
employing Wi-Fi in vehicle-to-vehicle (V2V) communication considering
that 90% of the smart phones are Bluetooth enabled. The authors also
elaborated on some applications of VANETs with BLE. They discuss how
these applications can be transformed into new ones. The results indicate
that BLE communication has minimum latency and an appreciable dis-
tance between two moving objects. However, establishment of trust
between two vehicles is crucial in VANETS.
In [11], the author provides a trust-based and categorized messaging
scheme. A hybrid approach was used that leverages role-based and experi-
ence-based trust. Along with the hybrid approach, data-mining concepts
are also used for better decision making. In addition, the authors also
provide an extensive literature review along with gap analysis.
In [12], the author elaborates on the parameters to be considered in V2V
communication. The parameters are studied considering security aspects. The
paper discusses the need of a strong algorithm for accepting or rejecting
messages. It also explains a model detecting confidence on security infra-
structure (CoS) of VANET. The paper also provides a number of packets from
both compromised and non-compromised nodes for analysis. Malicious
nodes can be identified by using certificate revocation mechanisms.
From the literature survey discussed above, we can conclude that some
schemes lack end-to-end connectivity, and retransmission of packets is
required in many situations where there exists a low-density network. In
high-density networks, rapid dissemination causes a large number of colli-
sions. Some schemes however do not allow unregistered vehicles to take
advantage of the proposed mechanism or else give them less priority. It also
requires the employment of the vehicle traffic flow theory to transmit a time-
consuming packet. Many schemes consider low-density, high-density, and
174 Applied Machine Learning for Smart Data Analysis
FIGURE 9.2
Classification of messages in VANET.
Priority-Based Message-Forwarding Scheme 175
FIGURE 9.3
Steps for assigning priority to message.
follows: 0, 1,2,3,4, and so on, with 0 being the highest priority. Step 3
applies the priority handling algorithm as explained in the following
section.
TABLE 9.1
Message Priority
Sr. No. Relevance/Priority Message Type
Parameter
1. Event Sense
2. Destination Vehicle Location x, y
3. Distance
4. Vehicle Speed
Algorithm [13]
1. Generate event (message) by source node.
a) Sense event.
b) Assign priority to event.
c) If (destination node is in the range of source)
i. Source node broadcast message.
d) Else (message forwarded to destination through RSU)
i. RSU verifies message.
ii. RSU stores packet for time t.
iii. RSU applies the scheduling algorithm.
2. Forward message by receiver node
i. Check priority of received messages.
ii. Forward messages in sequence of their priority values, 0, 1, 2, 3….8.
iii. Sort messages in ascending order of distances of nodes (i.e. less
distance first).
iv. Forward messages one by one.
3. Schedule vehicle and RSU
a) If (message waiting time < threshold time “t”)where message
waiting time = current time – sense time.
Forward message
b) Else
Remove packet from node database.
4. Forwarding—Vehicle
a) Forward message by the receiver node algorithm.
5. Validate message
a) Source node validation by the receiver node.
b) Message validation.
c) If (valid node and valid message)
Forward message.
d) Else
Drop message.
Priority-Based Message-Forwarding Scheme 177
FIGURE 9.4
System architecture for navigation.
178 Applied Machine Learning for Smart Data Analysis
Algorithms
The algorithm used has three different phases—the initializing phase, the
authentication phase, and the navigation phase. Table 9.2 describes the
different notations used in the algorithm.
A) Initializing Phase: In this phase, the vehicle and the RSU register with
the TA. After registration, the TA shares some identity parameters with the
vehicle and the RSU, which is included in the certificate.
TABLE 9.2
Notation Used in Algorithms
Notation Used Meaning
1) TA → (VID,VKpri)
2) TA assigns a license plate number to each vehicle.
TA → VLPN
3) TA generates and assigns certificates to each vehicle
CVi = <VID, VLPN, VKpub>
D) Navigation Schemes
1) {V1,V2,…….,Vn} → ReqN (RID,VkPosition,VkDest);
2) EnMsg = (Encrypt(RSUverification(Vk(msg)),VKpri));
3) RSU → send (EnMsg);
4) TA decrypts the message DeMsg(msg,VKpub);
5) Verify data using VKpub;
6) Check (VkDest,VkPosition,VID);
7) Apply rerouting for navigation.
1. If (A)
2. Stop (end the navigation process)
3. Else
4. If (B)
5. Check (alternate lane)
6. Else
7. Check (alternate path)
8. Generate event/forward emergency message received
9. Broadcast event
10. If (request for path)
11. TA forwards message to RSU
12. RSU checks path
180 Applied Machine Learning for Smart Data Analysis
FIGURE 9.5
Times Required to Reach Destination With and Without Priority.
Priority-Based Message-Forwarding Scheme 181
TABLE 9.3
Time required for vehicles to reach the end point with and without
message relevance/priority.
Time to reach destination Time to reach destination
Vehicle without relevance/priority with relevance/priority
1 313 260
2 213 155
3 435 258
4 454 280
5 125 89
6 543 387
7 56 56
8 345 295
9 165 102
10 813 546
9.8 Conclusion
The proposed scheme of priority-based message forwarding reduces the
time required for message forwarding by prioritizing the emergency
messages for broadcasting. It also considers the nearest receiver first for
sending messages. As priority assigning takes place at both levels—the
vehicle and the RSU—it is more efficient than other schemes. Efficient
and effective transportation can be achieved using an event-based path
finding navigation system. As the path is changed at run-time, emergency
situations are handled more effectively. Because of the priority-based
message-forwarding scheme, vehicles save the time spent in processing
the lower-priority messages. Vehicles reach their destinations in lesser
time compared to the time required to reach without the application of
the proposed scheme. Here, intelligent navigation was achieved using
vehicle position data, emergency event data, and traffic scenario gathered
by RSUs. As the time threshold value is used to decide the freshness of
a message, older messages are discarded and the time spent in processing
irrelevant messages is saved. Selection of the shortest path among the
available alternate paths can be the future scope of this scheme for which
we propose to use a distance factor or traffic situation analyzer along the
path in consideration.
182 Applied Machine Learning for Smart Data Analysis
References
[1] Sabih ur Rehman, et al. “Vehicular Ad-Hoc Networks (VANETs)-An Over-
view and Challenges”. Journal of Wireless Networking and Communications
3, 3 (2013): 29–38.
[2] Sachin Godse, Parikshit Mahalle, “Time-Efficient and Attack-Resistant
Authentication Schemes in VANET” Proceedings of 2nd International Con-
ference, ICICC 2017 Springer, 579-589.
[3] Wahabou Abdou, Benoˆıt Darties, Nader Mbarek,“Priority Levels Based
Multi-hop Broadcasting Method for Vehicular Ad hoc networks”. Annals of
Telecommunications 70, 7–8 (August 2015): 359–368.
[4] Chakkaphong Suthaputchakun et. al. “Priority Based Inter-Vehicle Commu-
nication in Vehicular Ad-Hoc Networks using IEEE 802.11e”, IEEE 2007, 2595-
2599.
[5] Hao Wu and Richard Fujimoto et. al., “MDDV: A Mobility-Centric Data
Dissemination Algorithm for Vehicular Networks”, VANET’04, October 1,
2004, 47-56.
[6] Chinmoy Ghorai et. al., “A Novel Priority Based Exigent Data Diffusion
Approach for Urban VANets”, ICDCN 17 January 4–7, 2017
[7] A. Betsy Felicia et. al. “Accident Avoidance and Privacy Preserving Naviga-
tion System in Vehicular Network”, International Journal of Engineering
Science and Computing, March 2016.
[8] Chim, Tat Wing, et al. “VSPN: VANET-based secure and privacy-preserving
navigation” IEEE Transactions on Computers 63, 2 (2014): 510-524.
[9] Carpenter, Scott E. “Balancing Safety and Routing Efficiency with VANET
Beaconing Messages.” National Highway Traffic Safety Administration,
Fatality Analysis Reporting System (FARS) Encyclopedia, October 15, 2013.
[10] Frank, Raphaël, et al. “Bluetooth low energy: An alternative technology for
VANET applications”. Wireless on-Demand Network Systems and Services
(WONS), 2014 11th Annual Conference On. IEEE, 2014
[11] Monir, Merrihan, Ayman Abdel-Hamid, and Mohammed Abd El Aziz.
“A Categorized Trust-Based Message Reporting Scheme for VANETs”.
Advances in Security of Information and Communication Networks. Springer
Berlin; Heidelberg, 2013. 65-83.
[12] Rao, Ashwin, et al. “Secure V2V communication with certificate revocations”.
2007 Mobile Networking for Vehicular Environments. IEEE, 2007.
[13] Sachin P. Godse, Parikshit N. MAhalle et al. “Rising Issues in VANET
Communication and Security: A State of Art Survey”. (IJACSA) International
Journal of Advanced Computer Science and Applications, 8, 9 (2017), 245–252.
Section IV
Machine Learning in
Security
10
A Comparative Analysis and Discussion of
Email Spam Classification Methods Using
Machine Learning Techniques
Aakash Atul Alurkar, Sourabh Bharat Ranade, Shreeya Vijay Joshi, and
Siddhesh Sanjay Ranade
Department of Computer Engineering, Smt. Kashibai Navale College of Engineering,
Pune, India
Gitanjali R. Shinde
Centre for Communication, Media and Information Technologies, Aalborg University,
Copenhagen, Denmark
CONTENTS
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.2 Approaches of Email Spam Classification . . . . . . . . . . . . . . . . . . . . 188
10.2.1 Manual Sorting Method . . . . . . . . . . . . . . . . . . . . . . . . 188
10.2.2 Simple Keyword Classification. . . . . . . . . . . . . . . . . . . 188
10.2.3 Email Aggregation Using Data Science Approaches . . 189
10.3 Importance of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
10.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
10.5 Literature Survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
10.6 Gap Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.7 Proposed System Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
10.7.1 Retrieving Emails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.7.2 Sending Data for Preprocessing . . . . . . . . . . . . . . . . . . 195
10.7.3 Preprocessing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.7.4 Train/Test Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.7.5 API Methods Used for Processing Email Dataset . . . . . 196
10.7.6 Creation of Model and Training. . . . . . . . . . . . . . . . . . 196
10.7.7 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
10.8 Retrieval of Email Data for Spam Classification . . . . . . . . . . . . . . . 196
185
186 Applied Machine Learning for Smart Data Analysis
10.1 Introduction
Email communication is by far the most easy, ubiquitous, and convenient
medium ever used by humankind for communicating with people. It is a
useful, fast, and efficient way of sending various types of data through web
protocols anywhere in the world. Data sent in email includes text messages,
files such as .jpeg and .png images, audio and video files, pdf and word
documents, and many other such attachments. This method of transmitting
information, which has been in use since the 90s, makes the geography of
the sender and receiver irrelevant. It overcomes the shortcomings of other
traditional communication methods such as mail by sending messages
instantaneously. It is often a better alternative to fax as paper is not
involved. The swift learning curve also helps. Owing to its versatility and
ease of use, in the last thirty years email has pervaded the academic and
corporate world. Everyday an estimated 205 billion emails are sent [15],
which goes to show the extent to which emailing has come to dominate our
lives. It can be argued that email, through instant communication, has
played a part in the rapid development of the global economy, especially
in the early 2000s, by assisting the communication among people in different
countries. This means that—like any technology that permeates our lives
this deeply—it is liable to be misused. This is done with both malicious and
non-malicious intentions, but it is done in innumerable ways. As most
people are aware, it is not possible for 200 billion emails to be manually
sent. Obviously, most of these have been written by scripts and/or auto-
mated bots and are used for advertising. Such types of spam emails are used
for ransomware, advertising, fake purchase receipts, phishing, and increas-
ing traffic. It is clear that the possibilities of misuse of these technologies are
vast, and every year individuals and companies lose millions of dollars
because of such attacks. Apart from the damage due to phishing and
malware, spam emails cause a loss of around 20 million dollars every year,
partly due to employees wasting company time in deleting and reading
them and partly due to information being stolen from emails [1].
A Comparative Analysis and Discussion of Email 187
processing the dataset through the same network, the results get increas-
ingly closer to the ideal classification required.
We thus consider different algorithmic approaches for this problem.
Python libraries have various predefined machine learning algorithms that
can be used to classify large datasets, simply by importing the respective
libraries such as numpy, TensorFlow, and pandas [3]. The next few sections of
this chapter discuss the importance of machine learning in spam classification
along with the various approaches used and then delineate the working of the
algorithms proposed, comparing their pros and cons over various datasets. A
basic workflow for a sample proposed model has also been provided. Readers
can thus decide on the suitable algorithm for themselves, considering all the
parameters of their unique requirement or situation.
10.4 Motivation
To reiterate the need for a sufficient classifier, it is useful to imagine a
world where spam classifiers do not exist. Around two hundred billion
messages continue to get sent daily, and sorting out the important ones
necessitates users manually clicking each one and deleting them by them-
selves. Just imagining such a situation can be stressful. This is because a
spam classifier has quietly become one of those features that we have
collectively begun relying on without even noticing its presence. Another
exercise would be to open any Inbox and go to the spam section. The
hundreds of spam emails, advertisements, offers, and so on in that section
makes one realize how much clutter has been hidden from the user’s eyes.
Also consider, for example, a situation where a layperson receives an
email, asking for update of the user’s bank account details via a provided
A Comparative Analysis and Discussion of Email 191
link, which does not appear suspicious to the user because it appears to
have been sent by the user’s bank and looks authentic. The user then visits
the link and updates all details, unaware that she/he has been the most
recent victim of a phishing attack. Upon realizing this mistake, the user
contacts the required authorities (such as the cyber security cell of the
state) who may or may not succeed in remedying the situation and
identifying the perpetrators. Now imagine if, in the same situation, the
user’s email client employed a more accurate spam filter. The email from
the “bank” would be identified as inauthentic, based on features such as
the authentication of the email id, the number of HTML tags used, spam
keywords, and so on. Basically, it would go straight to spam and get
deleted a few days later. The user would never find out she/he received
this email, and hence there would be no damage.
The above use case illustrates just one of the ways in which spam
filtering improves our lives, by simply hiding otherwise dangerous emails
from our eyes. Obviously not everyone would fall for scams like these, but
millions of dollars of damage are incurred each year due to other less-
obvious schemes. A filter helps in these situations by providing a basic
level of security. One that uses machine learning will also be able to
identify such messages with greater accuracy.
early works on the social and organizational aspects of email. [6] Mackay
noted that people used email in highly diverse ways, and Whittaker and
Sidner further studied this aspect. They found that along with basic
communication, email was “overloaded” in the sense of being used for
many tasks—communication, reminders, contact management, task man-
agement, and information storage. Mackay also noted that people could be
divided as belonging to either of the two categories when handling their
email: prioritizes or archives. [7]
Sahami et al. reported good results for filters that used naive Bayes (NB)
classifiers [8]. Subsequently, many experiments showed similar outcomes,
confirming Sahami’s conclusion. Similar results were found by Zhang et al.
related to spam classification, which incorporated machine learning
algorithms. A significant change in the approach was made when the
importance of both header and body for classifying mails was understood.
[9]. However, more difficult problems starting cropping in spam classifica-
tion with increased attacks on the filtering algorithms themselves. One out
of four identified attacks by Wittel turned out to be a tokenization attack,
which works based on the manipulation of the text characters, spaces, and
HTML tags. These attacks work against the statistical nature of the classifiers
[10]. This was eventually overcome by Boykin by utilizing interconnected
networks [11]. Gray and Haahr proposed a joint spam filtering method [12].
Goodman et al. summarized other advances except machine learning in
spam [13]. Although not directly used for spam classification, Raje et al.
developed an algorithm for extraction of key phrases from documents using
statistical and linguistic analyses. Their method focused on three areas, out
of which two are relevant for spam classification on the body of the email.
In the first method, the authors maintained a list of important words in the
English language itself. This file was maintained as a list of words and their
respective multipliers ranging from 1 to 10. The multiplier is dependent on
how important the word is, and hence can be considered as the priority of
the word in the language. For example, words such as “firstly,” “secondly,”
and “therefore” are used to state new points or to conclude, as is the case for
“therefore.” It is obvious that phrases with such words must be given the
highest consideration and so will be given a multiplier from 8 to 10, whereas
other important words such as “state,” “consider,” or “analysis” are impor-
tant but not as much as those mentioned before. The authors also proposed
an idea of a dynamic list [14].
Email spam classification is the most researched issue among all the
problems regarding emails. In a survey by Awed and ELseoufi, most
studies on email classification are conducted to classify emails into either
spam or ham. Among the 98 articles, 49 are related to “spam email
classification.” Binary classifiers that classify emails into spam or ham
were developed in the studies. The second highest number of articles is
on the “multi-folder categorization of emails” (20 published articles), in
which researchers developed a multiclass classifier that categorizes emails
A Comparative Analysis and Discussion of Email 193
FIGURE 10.1
Proposed system architecture.
to give an ideal output despite being optimized thoroughly. The email data
used for training and testing must have the same attributes such as time-
stamp, subject, title, body, whether the email is forwarded or not, and so
on. This architecture focuses on collecting email data for training and
testing in a common format such that the scores of various machine
learning techniques used on a common dataset can produce accurate
results through the various processing steps that generally a model of
machine learning goes through. These steps are generally dependent on
each other and aim to optimize the model.
10.7.3 Preprocessing
Preprocessing helps make the data suitable for the model depending on
the type of data the preprocessing works upon. Preprocessing is more time
consuming for text data as the sentence semantics, punctuations, and
annotations should be considered.
10.7.7 Verification
After training a particular model, it should be verified on the testing data,
following which its score should be compared with the results of other
machine learning algorithms and the parameters should be adjusted
accordingly for the most optimum score. The testing and training data
should be adjusted for more favorable results.
having 1 billion active users and Hotmail having 400 million users, these
providers contribute a huge chunk to email management. With the
services of these email providers used for enterprise solutions, these
companies have released their own email APIs to integrate into devices
for email management. The APIs allow the authenticated user to read
and view emails, along with modifying the labels and accessing these
emails from another email client. For accounts with custom domains
apart from the services provided by Microsoft Outlook, the JavaMail
API by Oracle provides a platform for third-party email management.
This API can be called from any Android or iOS device, and the
programmer can implement any custom functionality in the email client
without compromising on the security of the emails. The JavaMail API
provides a protocol- and platform-independent framework to send and
receive emails. The different protocols supported and used in JavaMail
API are SMTP, POP, and IMAP.
JavaMail API can send or receive emails regardless of the implementa-
tion of a single specific protocol because of the various protocols being
implemented presently. Along with this, the API can fetch emails from all
the email service providers be it Google, Yahoo, Hotmail, and so on. This
ensures versatility to the approach for the retrieval of emails. The JavaMail
API uses service provider interfaces (SPIs), which provide intermediary
services to java application with which it can deal with different proto-
cols [18].
75 79.61
72.47
65.39
SCORE
50
25
0
KNN Multinomial Random Forest SVC Kernel SVC Kernel Poly ANN
Nalve Bayes Sigmoid
TYPES OF ALGORITHMS
Score - Represent predictions of future values
FIGURE 10.2
Comparative analysis of machine learning algorithms.
A Comparative Analysis and Discussion of Email 199
decision tree model, where the programmer assigns the features. [36] The
random forest has an advantage in scaling as it avoids the overfitting
problem, which occurs commonly in machine learning where the training
phase learns the noise in the test dataset such that it affects the perfor-
mance negatively. Random forest is highly useful for feature classification
and extraction.
TABLE 10.1
ALGORITHM TYPE STRUCTURE ADVANTAGE DISADVANTAGE
FIGURE 10.3
An example of serving machine learning models using TensorFlow.
power, bandwidth, battery, and storage. With the rise of mobile users,
researchers started looking for a mobile-first machine learning solution. In
late 2017, machine learning models could be served on devices that use
low computing power, can work offline without the data embedded, and
can be small (up to 2MB) without the extreme power requirements of a
GPU. As Figure 10.3 specifies, TensorFlow Lite, which is an open source
solution for machine learning, uses many techniques for achieving low
latency such as optimizing the kernels for mobile apps, pre-fused activa-
tions, and quantized kernels that allow smaller and faster (fixed-point
math) models. [37] It specifically provides and interfaces for hardware
acceleration on hardware devices such as Android, iOS, and Raspberry Pi.
Machine learning for mobile devices primarily uses a file structure called
Flat Buffers, which represents hierarchical data while supporting multiple
and complex data structures such as scalars, arrays, extensible array,
balanced tree, heap, and so on with a complete backward compatibility in
a way that it can still be computed directly, without the data being parsed
or unpacked. This helps machine learning operations work at an optimum
level in a low-powered device without consuming extreme resources.
Figure 10.3 shows the trained TensorFlow model, which is frozen by
converting the dynamic variables into static and embedded into an iOS or
Linux device, which is called Java API. Java API is a wrapper around the
C++ API, which invokes the device’s kernel using a set of kernels that
performs hardware acceleration. A portable mobile-first solution can be
implemented for serving machine learning for offline interaction, low
latency, and interest in stronger user data privacy paradigms where user
data do not need to leave the mobile device. This mobile solution, coupled
with serving models from servers with high computing needs, provides an
open and flexible machine learning production and deployment environ-
ment to promote research, collaboration, and distribution of machine
learning.
10.11 Conclusion
As discussed above, although email spam might look like a harmless
distraction, it is actually a dangerous facade. Most spam emails might be
benign, but even the 1% that have been specially created with malicious
intents could prove severe. Whaling, spear phishing, website forgery, and
clonephishing can be included in this category. Classification of emails is
thus important for the security of not just the user, but also the email service
provider, as spam emails still consist of a large chunk of emails sent daily,
which add up to a large amount of server storage space, resulting in limited
storage and bandwidth for the administrator. The algorithms mentioned
above provide us with different results depending on the dataset used, the
A Comparative Analysis and Discussion of Email 203
size of the dataset, and the unique advantages and disadvantages of each. In
any problem, it is important to compare the workings of every proposed
solution. The biggest reason for such a comparison is that a user can use any
of the above-mentioned algorithms, from artificial neural networks to
random forests, completely based on their requirements at the given time.
These comparisons show certain biases that might have been missed when
using a single approach, and they provide developers with a certain flex-
ibility to use any of them at their own will.
The easy availability of machine learning libraries has allowed people
from all over the world to implement these algorithms in various ways.
This has led to a huge increase in popularity for not just python as a
programming language, but aspects of machine learning and algorithms in
general. TensorFlow already has hundreds of thousands of users because it
is easily accessible and easy to learn. One potential application using the
above-mentioned machine learning algorithms is email classification.
Despite having classified emails as spam and ham, many users still face
the problem of email overload because of multiple accounts, frequency of
emails, and even job designation. One further scope that can be extended
from spam classification is prioritizing the ham emails between the tags of
“Priority” and “Other.” It is possible that important emails such as meet-
ing reminders, deadlines, delivery status, and so on might get overlooked
as the user’s inbox is constantly swamped with emails.
The future scope of this spam classification involves prioritization of the
already-classified ham emails under parameters such as domain, frequency
of emails previously sent by that user, Cc/Bcc, and specific keywords in
the body of the email, which would act as inputs to a neural network
model, thereby classifying each message as either high priority (if it
requires an urgent reply) or normal to low priority. This would provide a
better overall email management experience to the user. The model cur-
rently implements a neural network (whichever is to be used) with a
predefined and labeled dataset. When it is implemented on a user’s
inbox, it classifies each incoming email as spam or non-spam automatically
and accordingly displays it only in the required section or folder. Not only
are users shielded from unnecessary emails, saving precious time, they are
also protected from potentially harmful ones, and thus the benefits are
twofold. As attacks and threats by emails become more versatile in their
methods, this elementary measure gives us a good starting point to
provide both a secure and productive environment to the user.
References
[1] Ramzan, Zulfikar (2010). “Phishing attacks and countermeasures”. In Stamp,
Mark & Stavroulakis, Peter eds. Handbook of Information and Communication
Security. Springer. ISBN 9783642041174
204 Applied Machine Learning for Smart Data Analysis
[2] Samuel, Arthur (1959). “Some Studies in Machine Learning Using the Game
of Checkers”. IBM Journal of Research and Development.
[3] Top 15 Python Libraries for Data Science in 2017 [https://medium.com/
activewizards-machine-learning-company/top-15-python-libraries-for-data-
science-in-in-2017-ab61b4f9b4a7]
[4] Kuldeep Yadav, Ponnurangam Kumaraguru, Atul Goyal, Ashish Gupta and
Vinayak Naik. SMS Assassin: Crowdsourcing Driven Mobile-based System
for SMS Spam Filtering Copyright 2011 ACM. 978-1-4503-0649-2 $10.00
[5] Lee Sproull and Sara Kiesler. (1986, Nov). Reducing social context cues:
Electronic mail in organizational communication. Management Science (32, 11)
[6] Mackay, W. Diversity in the use of electronic mail: A preliminary inquiry.
ACM Transactions on Office Information Systems 6, 4 (1988), 380–397
[7] Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998, July). A Bayesian
approach to filtering junk e-mail. In Learning for Text Categorization: Papers
from the 1998 workshop (Vol. 62, pp. 98-105)
[8] Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering
techniques. ACM Transactions on Asian Language Information Processing
(TALIP), 3(4),243-269.
[9] Boykin, P. O., & Roychowdhury, V. P. (2005). Leveraging social networks to
fight spam. Computer, 38(4),61-68
[10] Wittel, G. L., & Wu, S. F. (2004, July). On Attacking Statistical Spam Filters. In
CEAS.
[11] Tretyakov, K. (2004, May). Machine learning techniques in spam filtering. In
Data Mining Problem-oriented Seminar, MTAT (Vol. 3, No. 177, pp. 60-79)
[12] Alan Gray Mads Haahr. Personalised, Collaborative Spam Filtering. Enter-
prise Ireland under grant no. CFTD/03/219
[13] Goodman, J., Cormack, G. V., & Heckerman, D. (2007). Spam and the ongoing
battle for the inbox. Communications of the ACM, 50(2),24–33
[14] Satyajeet Raje, Sanket Tulangekar, Rajshekhar Waghe, Rohit Pathak, Parikshit
Mahalle. Extraction of Key Phrases from Document using Statistical and
Linguistic analysis. Proceedings of 2009 4th International Conference on
Computer Science & Education. 978-1-4244-3521-0/09/$25.00 ©2009 IEEE
[15] Ghulam Mujtaba, Liyana Shuib, Ram Gopal Raj, Nahdia Majeed, Mohammed
Ali Al-Garadi. Email Classification Research Trends: Review and Open Issues.
2169-3536 (c) 2016 IEEE
[16] Anju Radhakrishnan et al. Email Classification Using Machine Learning
Algorithms. International Journal of Engineering and Technology (IJET) Vol
9 No 2 Apr-May 2017. DOI: 10.21817/ijet/2017/v9i1/170902310.
[17] Jim Keogh. J2EE: The Complete Reference, 2002
[18] Khorsi. “An overview of content-based spam filtering techniques”, Informatica,
2007
[19] W.A. Awad, S.M. ELseuofi. Machine Learning Methods For Spam E-Mail
Classification. International Journal of Computer Science & Information Tech-
nology (IJCSIT), Vol 3, No 1, Feb 2011
[20] How Random Forest Algorithm Works in Machine Learning https://
medium.com/@Synced/how-random-forest-algorithm-works-in-machine-
learning-3c0fe15b6674
[21] Wang, Cunlei, Zairan Li, Nilanjan Dey, Amira Ashour, Simon Fong, R. Simon
Sherratt, Lijun Wu, and Fuqian Shi. “Histogram of oriented gradient based
A Comparative Analysis and Discussion of Email 205
Sachin M. Kolekar
Department Computer Engineering, Zeal College of Engineering and Research, Pune,
Maharashtra, India
CONTENTS
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
11.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
11.3 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.4 Implementation Details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11.5 Steps of Smartphone Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.6 Relevant Mathematical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.6.1 Set Theory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
11.7 Apparent Study for Malware Uncovering in Android (MAMA) . . 215
11.7.1 Materials and Methods. . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.7.2 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.7.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.7.4 Authorizations of the Manifest File . . . . . . . . . . . . . . . . 217
11.7.5 Other Aspects of the Manifest File: Uses-Featuretag. . . . 217
11.8 Results and Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
11.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
11.1 Introduction
Number of Smartphone users has increased tremendously in last decade,
and it continues to increase every day. This in turn attracts malware
developers to target smartphones and perform their malicious activities.
Nearly all of these Intrusion Detection Systems are demeanor-predicated,
for example, they don’t plan on a record of maleficent code design, as in
207
208 Applied Machine Learning for Smart Data Analysis
FIGURE 11.1
Malware tracking and detection system flowchart.
the user to turn on or off the microphone without user’s consent. Recorded
calls can be uploaded to the server:
1. Set PINs and password: A malware can easily access the applica-
tions, which are installed in our system without notifying the user.
So, setting the pin and password for applications allows user to
safely authenticate the system. Devices should be secured with
password.
2. Do not modify your Smartphone security settings: As android
allows rooting privileges, which allows modifying a system, users
should not perform the rooting or jailbreaking or the devices as it
leads to reduced security of devices.
3. Backup and secure your data: Many malwares can delete the data,
so it is always suggested to keep a backup of the data. If by any
means the data is lost, we can restore the data anytime. Data can be
stored on cloud, which can be fetched anytime.
214 Applied Machine Learning for Smart Data Analysis
4. Only install apps from trusted sources: There are many application
stores available other than official stores. For example, on android
there are many Chinese app stores that are flooded with malicious
application. Therefore, user should download the application only
from trusted sources such as Play Store for android and App Store
for iOS.
5. Set up security apps that enable isolated location and wiping: It is
advisable that there should be anti-viruses or antimalware to secure
our devices. Installing such application increases the security of
devices.
6. Accept updates and patches to your Smartphone software: Smart-
phone companies are also working to reduce the malwares. Therefore,
user should keep their devices up to date. Whenever new patches are
launched, the user should download it immediately.
7. Be smart on open Wi-Fi networks: Open wifi network can monitor
the user’s activity via smartphones which are connected to the wifi.
Therefore, user should not connect to any open wifi to access inter-
net. Wifi admin may monitor user authentication’s process, so one
should be always smart while accessing the public wifi.
8. Wipe down data on your mature phone before you donate, resell,
or recycle it: When user is changing a device or selling it to other
users, user should wipe out the original data before selling it.
9. Report a stolen Smartphone: When a device is lost, the user should
always report it to the: Federal Communications Commission and also
ask the network provider to shut down the sim card. This will notify all
the chief wireless accommodation providers that the phone has been
lost or stolen and will sanction for remote “bricking” of the phone so
that it cannot be activated on any wireless network without your
sanction at the time of installation of applications for malware detection.
TABLE 11.1
Mathematical Model
1 C = {C1, C2, C3, …, Cn} C gives .apk file As the size of current inputs are
variable we used SQlite DB
2 P = {P1, P2, P3, …, Pn} P gives installed files As the size of previous inputs
are variable we used SQlite DB
3 M = {M1, M2, …, Mn} M gives permission set As the alert message is used ADT
4 O={O1, O2, O3,…,On} ‘O’ gives malware
detection
(i) the authority needed for the claim, under the uses-permission tag and
(ii) the attribute under the uses-features group in the Android Manifest
File. In order to acquire these features, we first extracted the sanctions
utilized by each of the applications. To this extent, we employed the
AAPI (Android Asset Packaging Implement), which exists within the
set of implements provided by the Android SDK.
<uses-permission android:name=“string”/>
Accordingly, there are numerous strings that are utilized for declaring the
sanction utilization of the diverse Android applications, such as “android.
permission.CAMERA” or “android.permission.SEND_SMS”. Also, analysis
is done for the number of sanctions and their frequency in a precedent step
for kenning their administration within our dataset.
<uses-feature
android:name=“string”
android:required=[“true” | “false”]
FIGURE 11.2
Setting permissions.
FIGURE 11.3
Applying permissions.
FIGURE 11.4
Malware detection – by permission.
FIGURE 11.5
Package insertion.
Malware Prevention and Detection System for Smartphone 219
FIGURE 11.6
Insert the package name event
FIGURE 11.7
Malware found according to the package name
FIGURE 11.8
Malware detection – by signature.
both signed package name and unsigned package name. Figure 11.8
shows the check using unsigned package, which includes distrusted files
or non-registered files.
Table 11.2 shows the results of the investigations, where a reduced
time rule is observed by applying “by permission & by package” setup
techniques. The average time for both the techniques observed is equiva-
lent for most of the applications. Additionally, since our approach relies
only on permissions and package setup, modifications on permissions do
not harm our system.
220 Applied Machine Learning for Smart Data Analysis
TABLE 11.2
Results and Discussion
Time by Rule
CLASSIFICTION RULE SET/COMBINATION (In second)
11.9 Conclusion
Here, we have described a new method for malware detection in Smart-
phone’s, which has been validated by installing different types of appli-
cations. It is based on permission analysis and distinguishes malware
programs from benevolent programs. Furthermore, it is light-weighted
and faster, which makes it additionally pertinent for smartphone plat-
form requirements. We have examined different methods for improving
malware tracking on smartphones. We have also presented a method of
examining malware tracking “by signature” classification. Our future
scope will be to broaden the analysis on benevolent programs in order
to generate additional efficient malware detection methods based on
autonomously edification and to detect malware which are unrelieved
until now.
References
[1] A.D. Schmidt et al., “Detecting Symbian OS Malware through Static Function
Call Analysis,” Proc. 4th Int’l Conf.Malicious and Unwanted Software (Malware
09), IEEE, 2009, pp. 15-22.
[2] Schmidt, A.D., Peters, F., Lamour, F., Scheel, C., Camtepe, S.A., Albayrak, S.:
Monitoring smartphones for anomaly detection. Mob. Netw. Appl. 14 (1)
(2009), 92106
[3] D. Barrera et al., “A Methodology for Empirical Analysis of Permission-Based
Security Models and Its Application to Android,” Proc. 17th ACM Conf.
Computer and Communications Security (CCS 10), ACM, 2010, pp. 73-84.
Malware Prevention and Detection System for Smartphone 221
[22] W. Enck et al., “A Study of Android Application Security,” Proc. 20th Usenix
Security Symp., Usenix, 2011; http://static.usenix.org/events/sec11/tech/
full_papers/Enck.pdf.
[23] Yong Wang, Kevin Streff, and Sonell Raman, “Dakota State University Smart-
phone Security Challenges” 2012.
[24] La Polla, M., Martinelli, F., Sgandurra, D.: A survey on security for mobile
devices. Communications Surveys Tutorials, IEEE PP (99) (2012) 1-26.
[25] William Enck, Machigar Ongtang, and Patrick McDaniel “On Lightweight
Mobile Phone Application Certification” The Pennsylvania State University
2011.
[26] Abhijith Shastry, Murat Kantarcioglu, Yan Zhou, and Bhavani Thuraising-
ham, “Randomizing Smartphone Malware Profiles against Statistical Mining
Techniques” 2010
Index
A D
AIML see artificial intelligence mark-up data generalization 117, 126
language (AIML) data mining 43, 45–7, 52, 54, 58, 69, 97–8,
Android app 37, 139, 143 100, 104, 118, 157, 194, 213
Android malware applications 209, 216 data summarization 117, 119–21, 123–5,
anti-viruses 214 127, 129–33
anti-malware 214 dataset training 99, 111–12
AOL 188 decision tree algorithm 99, 102–3, 108, 199
application package 209 deep mining 45
artificial conversational entity 136 DeepMind 189
artificial intelligence mark-up language Dempster-Shafer fusion 80–1
(AIML) 138 denial of service malware attack 211
Devanagari script 4–5, 7–8, 11–12
Dialogflow 136, 143–4, 147–8
B
discrete wavelet transform 74
Big Data compact text document 119
biological microscopy 73
E
BLE see Bluetooth Low Energy (BLE)
Bluetooth Low Energy (BLE) 173 Edge fusion operation 80
boolean similarity 45 educational data mining 97, 100, 157
education system improvement 100
emotion topic model (ETM)
C
ETM see emotion topic model
C4.5 102–104, 108–109, 113 (ETM) 24
CART see WordNet Classification and Euler number 101
Regression Tree (CART)
CBT see Cognitive Behavioral Therapy
F
(CBT)
certification authority 177 finite-state transducers 6
Certo 158 FKM see fuzzy k-means (FKM)
chatbot 21, 23, 25, 27, 136–9, 141, 143–5, fuzzy clustering algorithm 117, 119, 121,
147, 149–51 123, 125, 127, 129
Cloud Firestore 143–4, 147 fuzzy k-means (FKM) 156
cluster 43, 45–7, 49–54, 58, 63–4, 118, 121,
124–7
G
Cognitive Behavioral Therapy
(CBT) 23 gray scale 74
CompuServe 188 GUI handler 99, 112
conjuncts 4, 7, 28
consonants 4–5, 7–8
H
content retrieval 62, 196–7
content summarizer 119 Haar father wavelet 76–7
context-aware conversations 137 HTML 55–7, 73, 191–2
223
224 Index