Data Mining Techniques For Medical Data A Review PDF
Data Mining Techniques For Medical Data A Review PDF
net/publication/318130038
CITATIONS READS
3 3,972
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dr. Subhash Chandra Pandey on 11 October 2017.
I. INTRODUCTION Knowledge
Medical data means databases that stores healthcare
information, like patient’s records. With the development of Fig. 1: Role of data mining in knowledge discovery process
Information Technology, lots of such medical data are stored
in electronic forms. These databases contain large volume of nowadays we have lots of data available in our databases for
data. Medical data is available from different sources for this purpose. However, the knowledge that is extracted from it
example; X-ray, computed tomography scans (CT), magnetic is nearly negligible. Thus, effective organization, analysis and
resonance images (MRI), ultrasound, etc. Thus, the increase in interpretation of data are of the paramount importance so that
the volume of data and the databases required to store the tangible extraction of knowledge could become possible. In
digitized data has increased exponentially [1]. Further, raw fact, different computational techniques are required to
medical data is usually huge and dissimilar in nature and it manage these large databases of medical data to discover
may be collected from different sources like, images, useful patterns and hidden knowledge from them [4]. Often in
interviews with the patient, laboratory data, and the data mining process we analyze enormous and large
physician’s observations and evaluations [2]. Medical data are observational datasets and subsequently extract the useful
of the various types. It can be in the form of images, datasets, hidden patterns for the purpose of data classification. Today,
signals, wavelengths etc. In present scenario, due to researches data mining has also started its tryst with healthcare and
and development in the field of information gathering tools, medical data. It is because of the fact that there is dire need of
we can witness huge amount of information or data available efficient techniques for detecting unknown and valuable
in electronic format. It is obvious that to store such a large hidden information from medical data [6] so that complex
amount of data or information the sizes of databases also interrelation among the patients, their medical conditions, and
increase substantially [3]. treatments can be analyzed in a lucid manner [7]. The use of
Medical data are available in hundreds of public and data mining in healthcare and medical field is pervasive and it
private databases, which has only been possible by novel has many applications like, detection of fraud in health
database technologies and the Internet [4]. It has been insurance, providing better medical solutions to patients at a
estimated that healthcare industry may generate terabytes of lower cost, detection and causes of diseases, and identification
data every year [5]. Actually, the job of extracting useful of efficient medical treatments methods. Indeed, data mining
information for quality healthcare is tricky and important and is a core process of a broader prospect known as the
Birla Institute of Technology, Mesra, Ranchi (Allahabad Campus).
knowledge discovery. The inter-relation between the data Further, the classifier is trained with the help of training data
mining and knowledge discovery is shown in the Figure 1. set and subsequently the correctness of the classifier is tested
on test dataset. The classification task of data mining is
generally used in healthcare industries [6]. The classification
II. DATA MINING TASKS AND ITS USE IN HEALTHCARE task is often used to predict the treatment cost of different
There are different data mining models varying from one disease [11].
application domain to another. However, it can be broadly (iv)Clustering: There is subtle difference between
categorized in two groups. Namely: Predictive Model and classification and clustering. Classification is a supervised
Descriptive Model. Some important data mining tasks learning whereas clustering is an unsupervised learning
pertaining to medical and healthcare domain are enumerated method. Classification has the information of the class leveled
below. but in clustering the information regarding the class leveled is
not known. In clustering similar data are placed in the same
cluster and dissimilar data are placed in some other cluster
Summarization [12]. Clustering needs very less or no information for
Association partitioning the data. The drawback of clustering is that first
Classification we have to identify the clusters and then assign a new instance
Clustering to the clusters [13].
Trend analysis (v) Trend analysis: We can observe a lot of time dependent
Regression. data in literature. In different walks of life such that: sales of a
(i) Summarization: In summarization, the set of data is company, credit card transactions of a customer, and stock
abstracted that results into a smaller set of data which gives us prices are all time series data. Such data can be viewed as
a general overall review of the data. Thus, summarization is objects with a ‘time’ attribute. It is interesting to find patterns
the abstraction or generalization of the data. Summarization and regularities in the data along the dimension of time. Trend
can be done till many levels of abstraction and it can be analysis discovers these interesting patterns [9].
(vi) Regression: Regression is learning a function which can
viewed from different perspectives. For example, rather than map a data item to a real – valued prediction variable [14].
looking at the details of the call, it can be summarized into Indeed, regression establishes a relationship between unknown
duration of the call, number of call, and cost incurred during and independent estimated variable and known dependent
the call. In the same way, calls can also be summarized on the variable. Regression is a widely used technique for prediction
basis of national calls or international calls. These
A. Data Mining for Healthcare
combinations of different levels of abstraction tell us about the
various patterns and regularities present in the data [8]. In healthcare industries dependence on data is increasing
(ii) Association: Association is looking for togetherness or day by day [15]. In medical science, diagnosis of any disease
connection of objects in large databases. Such kind of and treatment of patients is the most important task. In recent
connection is known as association rule. An association days, doctor’s hand written notes have been converted to
reveals relationships existing among objects. Its main purpose electronic records with an aim of reducing cost incurred
is to find interesting correlations existing among the objects, during treatment and improves efficiency of the treatment
i.e., existence of a set of objects in some other object [9]. [16].
Association rules are usually used in marketing, commodity Data mining applications in healthcare can be further
management, advertising, etc. From these association rules divided into following categories:
associations and patterns are extracted that exist among a. Diagnosis and prediction of diseases – When it comes to
various attributes. Indeed, association based data mining aims healthcare industries, diagnosis and prognosis of diseases is
to find associations between attributes and then generate rules very important [17], it is one of the most important purpose of
from those data sets [10]. For example, an association rule that using data mining for healthcare. Use of data mining for
“call waiting” is associated with “call display”, says if a healthcare has helped doctor’s to improve the health services
customer is subscribed to the “call waiting” service, that provided by them [15]. One cannot waste time and money by
customer is very likely to subscribe to “call display” service as choosing some incorrect treatment for a patient, which can
well. also harm patient’s health [18].
(iii) Classification: Classification divides data sets into target b. Ranking of various hospitals – Data mining techniques are
classes. Classification techniques predict the target classes for used to study all the details of various hospitals in order to
each of the data instance present. For example, using rank them [19]. Organizations rank various hospitals on the
classification techniques a patient can be classified into “high basis of their capability to handle patients with serious illness,
risk” or “low risk” on the basis of their disease patterns. In this i.e., hospitals with a higher rank are more suitable for handling
approach the classes are known and thus it is a kind of high–risk patients, as it is their highest priority whereas this is
supervised learning. There are two methods of classification not the case in lower ranked hospitals because they do not
task. These are: binary and multilevel. In classification task even consider the risk factor.
the dataset is divided into training and testing data sets. c. Better treatment techniques – With the help of data mining
techniques, both the doctor and patient can choose the best
treatment option by comparing among all the treatment
techniques. They can select the best treatment techniques both Data Mining Tasks
in terms of effectiveness and cost. Through data mining they
can also find out the side effects of various treatments and thus
decreases risk to patients [6].
d. Effective treatments– By comparing factors like causes, Association
symptoms, side effects, and cost of treatments data mining is Classification
used to analyze the effectiveness of treatments. For example, Clustering
Trend Analysis
one can compare the results of treatments of different patients Regression
which were suffering from the same disease but were treated Summarization
with different drugs. In this way, we can find which treatment
is effective in terms of the patient’s health and cost [20].
e. Better quality services provided to patients– With the
advancement in technology, we already have voluminous data Data Mining Applications in Health Care
stored in digitized form. Data mining when applied on this
huge medical data can help us in extracting many of the
interesting unknown patterns. With the help of these patterns Hospital Resource Management
we can improve the quality of services and care provided to Fraud Deduction
patients. Data mining also helps in knowing patients needs and Identify High Risk Patients
more of their requirements so that they can be better treated Infection Control
[6]. Milley has also stated that data mining can help in Better Services
Effective Treatments
analyzing specific patient’s needs in order to enhance services Better Treatment Techniques
provided by healthcare organizations [21]. Diagnosis of Disease
f. Infection control in hospitals– Hospital infections affects Medical Device Industry
millions of patients every year and the number of infections
which are drug resistant is really high [22]. Inspection for
infection is done through data mining to identify some Fig. 2: Data mining tasks and applications in healthcare
irregular patterns in the data of infection control [15]. For
aspect of mobile healthcare applications which provides a safe
infection control, these patterns are further studied by a
method for studying important signs of patients [29].
knowledgeable person. Such a surveillance system that uses
data mining techniques for discovering unknown patterns in Ultimately, the success of data mining in healthcare totally
infection control data was implemented at the University of depends on the availability of clean and organized healthcare
data. Thus, the healthcare industries must look into this factor
Alabama [23].
as well, i.e., how to capture and store data so that it could be
g. Identifying high risk patients–American Health ways helps
properly mined subsequently [30]. The applications of data
hospitals with diabetes disease management services to
mining techniques in healthcare along with various data
improve the quality and reduce the cost of diabetic patients.
To differentiate between high–risk and low–risk patients, mining tasks are diagrammatically shown in the Figure 2.
American Health ways used predictive modeling technique.
Using predictive modeling technique, high–risk patients who
needed more concern regarding their health were identified by III. KNOWLEDGE MANAGEMENT AND DATA MINING IN
the healthcare providers [24]. HEALTHCARE
h. Reduction in insurance fraud and abuse–Healthcare insurer
constructs a model to identify unusual patterns of claims by Medical healthcare has been recently gaining increasing
patients, physicians, hospitals, etc [25]. In 1998, Texas attention and popularity. Due to advances in technologies like
Medicaid Fraud and Abuse Detection System saved million molecular, biomedical techniques, medical imaging, and
dollars by detecting fraud and abuse through data mining medical records of patients, large amount of medical data is
techniques [26]. generated every day. From clinical practices to individual
i. Proper hospital resources management – Management of research, these medical data is being stored in hundreds of
hospital resources is an important task in healthcare industries. private as well as public databases after the digitization of
Data mining constructs a model for managing hospital medical information like patient records, lab reports etc.
resources. Group Health Cooperative uses data mining and Today, the rate of data accumulation is much faster than the
provides services to hospitals at a lower cost [27]. Blue Cross rate of data extraction. Thus, this data needs to be well
manages diseases efficiently by reducing the cost and organized and stored in order to be useful. New information
improving the outputs with the help of data mining [28]. technology techniques are required to handle these large data
j. Medical device industry – Without medical devices, repositories of medical data and to extract useful patterns from
healthcare industry could not exist. Mobile communications it. Basically, knowledge management and data mining have
and inexpensive wireless bio-sensors are the most important been adopted in various medical domains in recent years.
In the 20th century, management along with
psychology and cognitive sciences led to the evolution of Feature Set
knowledge management [31]. The term ‘knowledge
management’ came into existence in 80s and the academic Searching
discipline was developed in 1995 [32]. Indeed, knowledge
management is the managerial approach to collect, manage, Subset
use, analyze, share, and discover the knowledge in order to
maximize the performance [33]. There is no definition for Evaluating the Subset
what constitutes knowledge, but it is something abstract and
No
inferential and is needed to support hypothesis generation and Selection Criteria
decision making. Recently researchers have done studies
which showed that knowledge management has good effects Yes
on organizational and operational performance [34, 35]. A
knowledge management model proposed in [36] gave Subset Feature
substantial information regarding the healthcare industries and
it said that the knowledge management processes lead to better Fig. 3: Process of feature selection.
organization learning and decision making which in turn leads
to better organization performance. Knowledge management patterns that are not known to the system and the users [42,
methodologies and techniques have been used to support 33]. In biomedical data mining, patient data should not be
storing, retrieving, sharing and management of data to make it ‘individually identifiable’, i.e., no record should give
explicit to biomedical knowledge. It is used in both scientific sufficient data about the patient so that no one can identify the
and business domains recently. There are many goals and patient [2].
challenges for knowledge management in companies. This is
due to the following reasons; knowledge management could
increase their performance, evaluate risks, help in developing IV. DATA MINING TECHNIQUES FOR HEALTHCARE
partnerships, organize the management as well as enhance
their economic value [37]. There are some criticisms also for
knowledge management given by T.D. Wilson, [38]. Data mining uses various techniques for mining medical
However, knowledge management could succumb these data. In fact, data mining techniques are used for feature
criticisms mainly because of the fact that companies and selection. Feature Selection can be described as the process of
organization really need knowledge management. selecting a minimum subset of features which are actually
Methods and techniques in knowledge management essential for classification. The feature set may be redundant
can be categorized into three sections: people and technology, and it may decrease the efficiency. Feature selection is a
requirements elicitation, and measurement of value. Today problem in the field of medical diagnosis [43]. The feature
frameworks take humans as well as technical perspectives into subset generation is also known as data reduction that is a step
account. When we talk about human perspectives: it is about in data preprocessing [44]. Further, feature selection
motivation and adoption. The employees are motivated either minimizes the number of essential features required for
by giving financial or non–financial incentives in order to use maximizing the accuracy of the model. It helps in reducing the
knowledge management, not only for the sake of technology space required by the feature set.
but also because it would affect the company. In [39], it is
suggested that apart from giving incentives there should be a It also removes the redundant noise that might be present
win-win system, both for the employee as well as the company in the feature set and thus it increases the efficiency of the data
and not a win-lose reward system. Other issue related to mining algorithm [45]. The objective of feature selection is to
knowledge motivation was knowledge adoption; since people produce cost effective and efficient model [46].
were not ready to use knowledge management. In [40], a Fig. 3 shows complete process of the feature
model is proposed which discussed about issues of knowledge selection. It mainly consists of four stages: subset formation,
adoption. Indeed, data mining is a core step of a broader evaluation of the subset, a selection criterion which is used as
prospect known as knowledge discovery and it is used in stopping criteria, and the final subset feature [44]. In the first
different domain e.g.; to discover different biological, drug step the feature set is searched after eliminating some
and patient care knowledge. It is also used for statistical inconsistencies like null values etc and redundancies that are
analysis of the patterns. Perhaps, data mining is frequently present. Then the process of subset generation starts after
used technique in medicine [27]. The basic objective of data searching the feature set. Subsequently, attribute evaluator
mining is to analyze a set of raw data or data and to identify evaluates the subset generated [47]. The phase of subset
and extract novel and useful patterns [41]. Various data generation and evaluation continues until the
mining techniques such as neural networks, decision trees, selection/stopping criteria are fulfilled. Only after that the final
fuzzy sets, support vector machines, bayesian networks and subset feature set is selected.
genetic algorithms are used to discover knowledge and
A. Neural Networks C. Fuzzy Sets
Neural networks were developed in the early days of the 20th Fuzzy sets and fuzzy logic are the best methodology used
century [48]. Neural networks are used in medicines as one of in data mining that is generally used for representing and
the most popular data modeling algorithm. Before the processing uncertainty. It is one of the best methods to deal
invention of decision trees and Support Vector Machine, with imperfect and noisy data [51]. This fuzzy set theory was
neural networks were the best classification algorithm introduced by Zadeh [59], which helps us in handling vague
[49].The main objective of using neural networks is for pattern data. Fuzzy sets and fuzzy logic are needed to implement the
recognition and performing the tasks of classification [50]. proposed expert system. With the help of fuzzy logic we can
The neural network system is modeled like a human brain. calculate the probability of any particular case to fall in any
The human brain consists of millions of interconnected cluster and after that based on the value, decisions can be
made [60].
neurons. In a similar way, the neural network is an
interconnection of artificial neurons and each connection has
D. Support Vector Machine (SVM)
associated weight. By adjusting the weights, due to its
adaptive nature it helps in minimizing the error [3].These The concept of SVM was proposed first time in. [61-62]. It
neurons work together in parallel to produce the output provides the most accurate results in comparison to all the
function. In the learning phase the network will learn by other algorithms. It is a classification technique and it works
adjusting the weights to predict the correct class label of the on the basis of statistical learning theory [62-63]. For various
input. Neural Networks have added advantage because they kernels, SVM has been used as a universal approximator [64].
can predict nonlinear relationship unlike simple modeling The subset of the learning data is called support vector and
methods [51]. Neural networks play an important role in with the help of this the support vector machines is defined.
analysis of medical data. Applications of neural networks in Absence of local minima is one of the main features of SVM.
this field consists tissue classification, disease prediction and The SVM model is a representation of the training data and
drug development. Prediction of heart diseases can be done with the help of support vectors one can extract the condensed
with the help of a neural network [52]. There are a few data set [65]. SVM finds an optimal separating hyper-plane
architectures of neural networks which are enumerated below: which maximizes the margin between the examples of two
i. Multi Layer Neural Network (MLNN): This type of neural different classes. SVM was developed for problems related to
networks use hidden layers with the help of which it solves the binary classification but then it can easily be extended to
classification problem for non linear sets [53]. These hidden problems related to multiclass problems. This is one of the
layers are usually interpreted as hyper-planes. This kind of most important reasons for SVM to gain popularity [66-67]. In
neural networks is used for classifying different categories of a binary classification task, such as predicting ICU mortality,
data. the hyper-plane is the division between two outputs. To be
ii. Polynomial Neural Network (PNN): Polynomial neural useful for tasks it can create single as well as multiple hyper-
networks have neurons like units as multilayer perceptrons planes. There are two methods for implementing SVM’s. The
which produce multivariate polynomial mappings. first method involves mathematical programming and the
second method employs kernel functions. The main task of
B. Decision Tree using hyper-planes is that it will maximize the separation
A decision tree is one which has terminal and non-terminal between data points [3]. In noisy data, error is minimized by
nodes. Each non-terminal node represents a test or condition maximizing the margin between the examples of two different
on a data item. Decision trees classify the instances by sorting classes and the hyper-plane is defined as the center line of the
them down from the non-terminal to the terminal nodes [54]. separating space. There are two types of SVMs. The first one
The output that which branch will be selected completely is Linear SVMs which separates the data points with the help
depends on the outcome of the test. For example, we have a of a linear decision boundary. It performs well on the datasets
decision tree for medical readmission. With the help of this that can easily be separated into two parts. But sometimes
tree we can decide whether a patient needs readmission or not complex datasets are difficult to classify with the help of a
[3]. Decision trees basically create a visual representation of
linear kernel for which the second kind of SVMs is used i.e.,
various pros and cons and potential values of each option [55].
Non–linear SVMs which separates the datasets with the help
Decision trees are commonly used for calculating conditional
probabilities in operations research analysis [56]. Best of non linear decision boundary. It is the most powerful
alternatives can be chosen with the help of decision trees and algorithm as it can obtain maximal generalization when
based on maximum information gain the traversal from root to predicting the classification of data [45]. The SVM shows
leaf node indicates unique class separation [57]. In some other accuracy in binary classification problems like valve
applications of data mining, like in marketing, the accuracy of classification/heart beat etc [68-70].
a prediction could be all that they need. It may not be E. Bayesian Networks
important to know about the working of the model. For
example, when a marketing professional wants to launch a Bayesian network is a specific type of network which
marketing campaign, he would require the overall descriptions represents knowledge about uncertain domain. It belongs to
of customer segments. For these types of applications, the the domain of probabilistic graphical models (GMs). In
decision tree algorithm is very suitable [58]. Bayesian network nodes represent the variables and various
edges represent probabilistic dependencies among those
variables [71-73]. Bayesian network specifies two types of existence. Machine learning includes many methods, but we
information for each variable [74]. can broadly classify them as symbolic and sub-symbolic based
on the nature of manipulation while learning [78]. When we
F. Rough Set
talk about symbolic learning method, knowledge required and
The concept of rough sets theory is similar to the concept the level of inference performed are different, like in decision
of fuzzy sets theory. The only difference is that in this theory trees [79]. On the other hand genetic algorithms [80] and
the uncertainty is described as a boundary region of a set. artificial neural networks [81] are examples of sub-symbolic
Every subset that is defined through upper and lower methods of classification.
approximations is called a rough set. This definition also When we talk about machine learning methods in
needs mathematical concepts since it is defined by topological healthcare domain, these techniques and tools can help in
operations known as approximations. They are usually diagnosis and prognosis of diseases, prediction of disease
combined with other methods such as classification, clustering progression, or extraction of medical knowledge. Symbolic
[51]. classification like inductive learning is used to add learning
G. Genetic Algorithm and knowledge management to expert systems [82]. Machine
learning tools help us in handling few characteristic features of
The genetic algorithm is a search and optimization medical domain like missing values, random noise or only few
techniques which is based on genetics and selection. Genetic patient records available [83]. Sub-symbolic learning methods
algorithms are basically used in neural sets which act as a
like neural networks help in improving the decision making
guide for the learning process of data mining algorithms rather
because they are able to handle these datasets [84]. A major
than for finding patterns. They are also used in the form of
application in medical diagnosis is to interpret the medical
association rules or some other formalism in data mining to
image which provides significant assistance [85]. Indeed, as
formulate hypothesis about variables and dependencies among the healthcare domains is becoming more and more reliant on
them. The basic idea of genetic algorithm is that we can obtain computer systems, machine learning methods can substantially
a much better solution by combining the good parts of other
help the physician’s in many cases and enable diagnosis in
solutions which is said in schemata theory, in a way like
real time.
nature does by combining the DNAs of living creatures [75].
Apart from making medical decisions, machine
In a genetic algorithm there is a population that is composed learning improves the efficiency and quality of medical
of many individuals which evolve under specific selection decision making systems [86]. Issues like how well a medical
rules to a state where fitness is maximized [76]. Initially a
expert can understand and use the results obtained from a
population of rules is created at random, each rule
system depend considerably on machine learning methods
representing a solution to the problem. Then pairs of rules are
used. Many researchers worked on medical expert systems for
selected as parents which are usually the strongest rules and
ECG diagnosis by implementing machine learning techniques
these pairs of rules are then combined to produce offspring to improve the knowledge of the medical expert system.
[77]. A genetic algorithm basically consists of three operators,
namely, selection, crossover and mutation. In selection, on the
basis of fitness a suitable string is selected for breeding a new
generation, then crossover combines these suitable good VI. UNIQUENESS OF DATA MINING IN HEALTHCARE
strings to produce better offspring, mutation then alters a
string locally so that the genetic diversity is maintained from In this section, we will render the unique features of
one generation of a population to another. In every generation medical data mining to make the expert system dealing with
the population is evaluated for the termination of the healthcare more constraint free specifically while mining the
algorithm, if the termination criteria are not satisfied it again is large heterogeneous medical data because medical data itself
operated by the three operators and then again it is evaluated. is very rewarding and difficult to mine in comparison to other
datasets. The medical datasets are huge and contain large
V. MACHINE LEARNING METHODS IN HEALTHCARE amount of medical information. At the same time, medical
data also possess distinct legal, ethical, and social constraints
[2]. Precisely, there are four main points that should be
There is plethora of research in machine learning domain discussed regarding the uniqueness of medical data.
and it is mostly application driven. Machine learning
researches are widely used in healthcare domain. Machine i. Medical data is heterogeneous in nature: As we already
learning methods are able to identify areas in which an know raw medical data is voluminous and heterogeneous. It
increase in research would lead to advances. In conditions may be collected from various sources like images,
where algorithmic solutions are not present and there is lack of physician’s observations, interviews with patients, laboratory
formal codes or there is poor definition of knowledge about data. All these help in diagnosis and prognosis of diseases and
the application domain, machine learning methods come into
TABLE 1. ADVANTAGES AND DISADVANTAGES OF DIFFERENT TECHNIQUES USED IN HEALTHCARE
1. It can handle all types of variables, variables with missing values 1. For numeric dataset, it generates complex decision
as well and it is easy to interpret. trees.
2. Decision Trees 2. For constructing decision trees one does not need to know about 2. It is an unstable classifier, i.e., performance of a
the domain. Even it can handle numerical and categorical data. classifier depends on the dataset.
3. It can process high dimension data easily and it minimizes 3. It is restricted to one output attribute and generates
ambiguity of complex decisions and assigns exact values to the categorical data.
outputs. 4. Performance of decision trees is not affected by co-
linearity and linear-separability problems.