0% found this document useful (0 votes)
25 views77 pages

Chapter Five Data Mining for Healthcare Analytics

Uploaded by

tesfahuntenahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views77 pages

Chapter Five Data Mining for Healthcare Analytics

Uploaded by

tesfahuntenahun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

Chapter Five

Data Mining for Healthcare Analytics


Session objectives

By the end of this chapter, you will able to:


 Define the basic concept of data mining
 Explain of data mining process models
 Describe data mining standards and output protocols
 Identify common techniques used in mining healthcare data
 Describe frequent pattern mining techniques
 Describe the future of healthcare data mining
What Is Data Mining
• Data mining is the process of discovering patterns,
relationships, and insights from large datasets.
• Extracts data from databases to solve business problems,
turning raw data into useful information.
• Is the process of data selection and exploration and building
models using vast data stores to uncover previously unknown
pattern
• It uses advanced algorithms and statistical models to discover
patterns and trends that go beyond basic analysis methods.
What Is Data Mining…cont’d
• Data mining is a relatively recently developed
methodology, coming into prominence in 1994.
• Aims to identify valid, novel, potentially useful, and
understandable correlations and patterns in data.
• Used to find new, accurate, and useful patterns in data for
individuals or organizations.
• Extract patterns that cannot be discovered by traditional
data exploration because the relationships are too
complex or because there is too much data
What Is Data Mining…cont’d
• The aim of data mining is to extract information from a
dataset and convert it into a understandable structure for
future use.
• In healthcare, data mining is becoming increasingly
important like improvement of patient care, fast and
accurate diagnose
• It employs concepts from various subfields and involves
techniques at the intersection of artificial intelligence,
machine learning, statistics, and database systems to
What Is Data Mining…cont’d
• The main reason is that the huge amounts of data
generated by healthcare transactions and
• Too complex and voluminous to be processed and analyzed
by traditional methods.
• The main aim of data mining is to find hidden patterns and
relationships in data for making informed decisions or
predictions.
What Is Data Mining…cont’d

 There is vast potential for data mining applications in


healthcare such as:
 The evaluation of treatment effectiveness
 Customer relationship management
 Proper diagnosis of patients
 Early detection of diseases
 Survivability of patients
 Prevention and management of diseases
What Is Data Mining…cont’d

• Health data requires analytical methodology in identifying


vital information that is used for decision making; this in
turn:
 Decreases costs by increasing efficiencies

 Improving patient quality of life

 Saving the lives of more patients


Data Mining Standards and Output Protocols

• Accurate and reliable data is crucial for successful data


mining
• In order to ensure ethical and responsible practices, there
are established standards and protocols for data mining.
• These standards and protocols help protect data privacy,
ensure accuracy, promote transparency, and guide the
appropriate use of data mining results.
• Data mining practitioners should follow ethical guidelines
and principles when conducting their analysis
Data Mining Standards and Output Protocols
• Process:
– The overall process by which data mining models are
produced, used and deployed
– E.g. a description of the business interpretation of the
output of a classification
• Models:
– a standard representation for data mining and statistical
models
– E.g. the parameters defining a classification tree
Data Mining Standards and Output Protocols
• Attributes:
– A standards representation for cleaning, transforming,
and aggregating attributes to provide the inputs for data
mining models
– E.g. The parameters defining how zip codes are mapped to
three digit codes prior to their use a categorical variable in
a classification tree
Data Mining Standards and Output Protocols
• Settings:
– A standard representation for specifying the settings
required to build models and to use the output of models in
other systems
– E.g. Specifying the examples, a description of the API so
that a classification tree can be built on data in a SQL
database
Data Mining Standards and Output Protocols
• Remote distributed data:
– Standards for viewing, analyzing, mining remote and
distributed data
– e.g. standards for the format of the data and name of the
training set used build a classification tree
• Interface and APIs Interfaces:
– Standard data mining APIs for Java and SQL
– e.g. a description of the API so that a classification tree
can be built on data in a SQL database
Data Mining Standards and Output Protocols
Data Mining Standards and Output Protocols
Data Mining Process
 Data mining process models used to guide the
implementation of data mining on big or huge amount of
data.
• The three most popular data mining process models are:
– CRISP-DM (Cross-Industry Standard Process for Data
Mining)
– Knowledge Discovery Database (KDD) process model
– Sample, Explore, Modify, Model, Assess (SEMMA)
Data Mining Process
• CRISP-DM, which stands for Cross-Industry Standard
Process for Data Mining,
• Is a widely accepted and well-documented process model
for data mining.
• It provides a structured approach to guide data mining
projects and
• Consists of six main phases: Business Understanding, Data
Understanding, Data Preparation, Modeling, Evaluation,
and Deployment.
Data Mining Process
• SEMMA (Sample, Explore, Modify, Model, Assess):

• SEMMA is a data mining process model developed by SAS


Institute.
• It consists of five main phases: Sample, Explore, Modify,
Model, and Assess.
• SEMMA places a strong emphasis on data exploration and
manipulation, allowing analysts to gain insights into the data.
• It provides a structured approach to data mining and is often
used in conjunction with SAS software.
Data Mining Process
Knowledge Discovery Database (KDD) process model
• The KDD process model refers to a comprehensive
framework for knowledge discovery from data,
specifically in the context of data mining and machine
learning.
• The KDD process model consists of various stages and
tasks, including data selection, preprocessing,
transformation, data mining, evaluation, and
interpretation of the results.
Data Mining Process
KDD model consists of nine steps
1. Understanding of the application domain: Is the first stage in
which goals are defined from the customer’s view of point and
• Learning the relevant prior knowledge and the goals of the end
user of the discovered knowledge
2. Creating target dataset:
• The aim of this stage is to create a representative dataset that
contains the necessary information for analysis in the problem
domain.
• This dataset will be used in later stages to uncover patterns,
21
relationships, and insights.
Data Mining Process
• Identifying available data, acquiring relevant data, and
integrating all the data into a single set.
• This process considers the important attributes that need to
be included for effective knowledge discovery.
3. Data cleaning and preprocessing:
• Strategies are developed for handling such type of noisy
and inconsistence data
• It incorporates data clearing, handling the missing
quantities and removal of noise or outliers. 22
Data Mining Process
4. Data transformation:
• The process of generating suitable data for data mining is
carried out and developed.
• This stage involves identifying valuable attributes
through the use of techniques such as dimension
reduction and transformation, and discovering a
consistent representation of the data.
• Various methods for reducing and transforming data are
applied to the desired data.
Data Mining Process
5. Choose data mining task:
• Selecting the appropriate data mining task for the problem
domain, such as classification, clustering, or association rule mining.

6. Choosing the suitable data mining algorithm:


• One or more appropriate data mining algorithms are selected for
searching different patterns from data.
• There are a number of algorithms presented today for data mining
but
• Appropriate algorithms are selected based on matching the overall
criteria for data mining
Data Mining Process
7. Employing data mining algorithms:
• Applying the selected algorithm to the data to discover
patterns, trends, and relationships.
8. Interpreting pattern:
• Focuses on interpretation and evaluation of mining
patterns
• Involve visualization of the data based on the extracted
models and determining their significance for the problem
domain. 25
Data Mining Process
9. Using discovered knowledge:
– In the final step process the discovered knowledge is
used for different purposes
– Incorporating the discovered knowledge into the
performance system, and documenting and reporting it
to the interested parties.
– This step may also include checking and resolving
potential conflicts with previously believed knowledge
26
Data Mining Process

• Selection: Obtain data from various sources.


• Preprocessing: Cleanse data and fills incomplete once.
• Transformation: Convert data from different sources into
common format. Transform to new format.
• Data Mining: apply data mining techniques to obtain
desired results.
• Interpretation/Evaluation: Present results to user in
meaningful manner using various visualization and GUI
strategies. 27
Common Techniques Used in Mining …cont’d
Types of Data Mining Tasks

• Predictive: the collected data is used to train a model for


making future predictions.
– Use some variables to predict unknown or future values
of other variables.
– Includes regression and classification

• Descriptive: general properties of the database are


determined.
– Visualization, clustering, dimension, and association
rule mining.
Common Techniques Used in Mining …cont’d
Type of Learning
1. Supervised learning is a data mining technique that
involves creating models using labeled data, where each
instance has a known output label.
 The aim is to develop a model that can accurately predict
the labels of new, unseen instances.
 Includes Regression and Classification
 Decision trees, Random forests, Support vector machines
(SVM), Naive bayes, and neural networks are all
examples of supervised learning algorithms used in data
mining.
29
Common Techniques Used in Mining…cont’d
2. Unsupervised learning :
• is concerned with the identification of patterns, relationships, or
structures within the data without the need for specific target labels.
• The primary goal is to uncover the inherent structure of the data and
uncover intriguing patterns or clusters.
• Popular techniques used in unsupervised learning for data mining
include clustering algorithms such as
 K-means and

 Hierarchical clustering,

 Association rule mining, and

 Anomaly detection.
Common Techniques Used in Mining…cont’d
3. Semi-supervised Learning: Semi-supervised learning in
data mining combines both labeled and unlabeled data to
improve the learning process.

4. Reinforcement Learning: While reinforcement learning is


more commonly associated with tasks involving agent-
environment interactions.
Common tasks of data mining
Classification
– A Supervised learning technique used in data mining and
machine learning to assign predefined class labels to instances
based on their feature values.
– The data is typically represented as a set of instances, where
each instance is described by a set of attributes or features.
– The class labels are known for the training instances

– Useful for disease diagnosis, patient risk stratification,


treatment outcome prediction, and detecting insurance
claim fraud.
Common Techniques Used in Mining…cont’d

• The process of classification is based on four


fundamental components:
– The class

– Predictor

– Training dataset

– Testing dataset
Common Techniques Used in Mining…cont’d

• Class:
– The dependent variable of the model
– Is a categorical variable representing the ‘label’ put on
the object after its classification
• Example
– Presence of myocardial infarction
– Customer loyalty
– Condition of a patient
Common Techniques Used in Mining…cont’d
• Predictor:

– The independent variable of the model

– Represented by the attributes of the data to be classified


based on which classification is made
• Examples

– Smoking

– Alcohol consumption

– Blood pleasure

– Marital status and etc.


Common Techniques Used in Mining…cont’d
• Training dataset:

– Is the set of data containing values for the class and


predictors
– Is used for ‘training’ the model to recognize the appropriate
class, based on available predictors
• Examples

– Groups of patients tested on heart attacks

– Groups of customers of supermarket

– Database containing images for telescope monitoring and


Common Techniques Used in Mining…cont’d
• Testing dataset:
– is a distinct set of examples utilized to assess the
efficacy of the trained classification model.
– During the training phase, the model does not encounter
the testing dataset, which comprises instances with
known class labels.
– To evaluate the accuracy and effectiveness of the
classification model, its predictions on the testing
dataset are compared to the actual class labels.
Common Techniques Used in Mining…cont’d

Classification
Common Techniques Used in Mining…cont’d
• Decision tree is a flowchart-like tree structure where:
• Each internal node (non-leaf node) denotes a test on an
attribute
• Each branch represents an outcome of the test
• Each leaf node (terminal node) holds a class label
• The topmost node in a tree is the root node
• is a hierarchical model whereby the local region is
identified in a sequence of recursive splits in a smaller
number of steps.
Common Techniques Used in Mining…cont’d

Figure 5.1: Pictorial representation of a decision tree classifier


Common Techniques Used in Mining…cont’d
• How are decision tree used for classification?
 Data preparation is the initial stage in classification
where the dataset is prepared.
– This involves choosing important features or attributes
and dealing with missing values or outliers.
– Categorical attributes may need to be converted into
numerical representations, depending on the decision
tree algorithm being utilized.
Common Techniques Used in Mining…cont’d

How are decision tree used for classification?


 Training Dataset: A labeled training dataset is needed
for building the decision tree model.
 Each instance in the dataset should have a class label
associated with it.
 The dataset consists of instances described by a set of
features and their corresponding class labels.
Common Techniques Used in Mining…cont’d
How are decision tree used for classification?
 Construct a decision tree:
• The best attribute or feature is selected at each node to
split the data.
• This is done using criteria such as
• information gain, gain ratio, or Gini index to
maximize the separation of classes.
• The selected attribute creates child nodes or branches in
the tree, and
• this process continues recursively until a stopping
criterion is met, such as reaching a minimum number of
instances or maximum depth.
Common Techniques Used in Mining…cont’d
 How are decision tree used for classification?

 Classification:

• it can be used to classify new, unseen instances.


• Starting at the root node, each instance traverses the tree by
following the appropriate branches based on the attribute
values.
• At each internal node, the instance is directed to the child
node that matches its attribute value.
• This process continues until a leaf node is reached, which
represents a class label.
• The class label of the leaf node is assigned to the instance as
its predicted class.
Common Techniques Used in Mining…cont’d
How are decision tree used for classification?
 Model Evaluation:
 The performance of the decision tree model is evaluated using
a separate testing dataset.
• Decision tree classifiers does not require any Domain
knowledge and Parameter setting
• Decision tress induction algorithms have been used for
classification in many application areas such as medicine,
manufacturing and production, financial analysis,
astronomy, and molecular biology
Building a Decision Tree
• Two step method:

1. Tree Construction: determine the best split to find out


all the branches and the leaf nodes of the tree.
2.Tree Pruning (Optimization): identify and remove
branches that are not useful for classification
• Pre-Pruning
• Post Pruning
Building a Decision Tree
• Attribute selection measures(splitting rules ):
– is a heuristic for selecting the splitting criterion that best
separates a given data partition, D, of class-labeled
training tuples into individual classes.
– determine how the tuples at a given nodes are to be split.
– The popular attribute selection measures are:
• Information gain
• Gain ratio
• Gini index
How to Determine the Best Split
• The goodness of the split is quantified by node impurity:
– Information gain (entropy): attributes are assumed to be
categorical

– Gain ration: extension to information gain


• Overcome the maximal information gained by a single
attribute

– Gini index: attributes are assumed to be continuous


• Assume there exist several possible split values for each
attribute
Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the


same class

• Stop expanding a node when all the records have similar


attribute values
Tree Pruning
• The node is not split further if the number of training
instances reaching a node is smaller than a certain
threshold.
– Pre pruning: stopping tree construction early on before
it is full built
– Post pruning: to get simpler tree (to find and prune
unnecessary sub trees)
• Comparing prepruning and postpruning:
– The prepruning is faster but postpruning leads to more
accurate trees.
Issues in Decision Trees
• Overfitting happens when the model learns the detail
and noise in the training data negatively impact the
performance.
• A decision tree is said to overfit the training data if:
– It results in poor accuracy to classify test samples
– It has too many branches, that reflect anomalies
• Avoiding overfitting :
– Prune the tree: leaf nodes (sub-trees) are removed from
the tree as long as the pruned tree performs better on the
test data than the larger tree.
Issues in decision trees
• Underfitting refers to a model that can neither model the
training data nor generalize to new data.

• Underfitting: when model is too simple, both training and


test errors are large

• Easy to detect given a good performance metric


Decision Trees Algorithms
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– ID3, C4.5
– CART
– CHAID(CHi-squared Automatic Interaction Detector)
– SLIQ, SPRINT, MARS
Decision Tree Classifier Advantages
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees:- can be converted to
if-then rules that is easily understandable
• Accuracy is comparable to other classification techniques
for many simple data sets
Decision Tree Classifier Disadvantages

• Prone to overfitting.
• Require some kind of measurement as to how well
they are doing.
• Need to be careful with parameter tuning.
• Can create biased learned trees if some classes
dominate.
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?

• Methods for Performance Evaluation


– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among
competing models?
Metrics for Performance Evaluation
• Focus on the predictive capability of a model
– Rather than how fast it takes to classify or build models,
scalability, etc.
• Confusion Matrix is a table that provides a comprehensive
summary of the performance of a classification model
Predictive Positive Predictive Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN)


Metrics for Performance Evaluation…cont’d
From the above confusion matrix, the term was described as:
True positives (TP): the actual true positives that were
correctly predicted as true positives
True negatives (TN): the actual true negatives that were
correctly predicted as true negatives
False negatives (FN): the actual positives that were
incorrectly predicted as false negatives
False positives (FP): The actual negatives that were
incorrectly predicted as positives
58
Metrics for Performance Evaluation…cont’d
The confusion matrix can be used to calculate various performance
metrics in healthcare, including:
• Accuracy: the overall correctness of the model's predictions.
• It is calculated as (TP + TN) / (TP + TN + FP + FN).
• Accuracy provides an overall assessment of the model's
performance but may be influenced by class imbalance.

• Sensitivity (Recall or True Positive Rate):


• Measures the proportion of actual positive instances correctly
identified by the model.
• It is calculated as TP / (TP + FN).
• It is important for identifying cases with the condition and
minimizing false negatives.
Metrics for Performance Evaluation…cont’d

Limitation of Accuracy
• Consider a 2-class problem:
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
• If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %
– Accuracy is misleading because model does not detect
any class 1 example
Metrics for Performance Evaluation…cont’d
• Specificity (True Negative Rate): Measures the proportion
of actual negative instances correctly identified by the
model.
• It is calculated as TN / (TN + FP).
• It is important for ruling out the condition and minimizing
false positives.

• Precision: Measures the proportion of positive predictions


that are actually correct.
• It is calculated as TP / (TP + FP).
• Precision focuses on minimizing false positives and is
valuable for tasks where false alarms or unnecessary
interventions should be minimized.
Metrics for Performance Evaluation…cont’d
• Recall: measures the proportion of true positive predictions (correctly
identified positive instances) out of the total number of actual positive
instances.
 It focuses on minimizing false negatives, which means capturing as many
positive instances as possible, even if it results in some false positives.
 The formula for recall is:

Recall = True Positives / (True Positives + False Negatives)


• F1 Score:

• The harmonic mean of precision and recall,

• Calculated as F1 =2*(Recall*Precision)/(Recall +Precision )

• The F1 score provides a balanced measure of the model's performance by


considering both precision and recall.
Methods for Performance Evaluation
• How to obtain a reliable estimate of performance?

• Performance of a model may depend on other factors


besides the learning algorithm:
– Class distribution
– Size of training and test sets
Methods of Estimation
• Holdout: Reserve 2/3 for training and 1/3 for testing
• Random sub-sampling: Repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
• Stratified sampling: oversampling vs under-sampling
• Bootstrap: Sampling with replacement
Association Rule Mining
• Association rule mining is a data mining technique used
to discover interesting relationships or associations
between items in large datasets.
• The association rule mining technique, which examines the
concurrent correlations between several variables in a
grouping, was first created by Agarwal and Srikanth
• Applied in the healthcare domain to uncover meaningful
associations and patterns within large healthcare datasets.

65
Association Rule Mining
• To produce association rules, primarily focused on features
that are implied by the target features (Antecedent =>
Consequent),
• Which is a way to classify all the variables that contribute to
dependent variable.
• These rules are often referred to as classification association
rules.
• Support, Confidence, and Lift can be represented as follows
for a certain rule: Here, the sets of features represented by X
and Y are mutually exclusive.
• Rule X=>Y
• Support
• Confidence
Association Rule Mining
• Support, confidence, and lift are key measures used in association
rule mining to evaluate the significance and strength of association
rules.
1. Support: It indicates how frequently an itemset occurs in the
dataset.
Support = (frequency of itemset) / (total number of transactions)
2. Confidence: It represents the strength of the implication from the
antecedent to the consequent.
Confidence = (frequency of antecedent and consequent) / (frequency of antecedent)

3. Lift: measures the strength of association between the antecedent


and the consequent in an association rule, taking into account the
expected probability of the consequent occurring independently of the
antecedent.
Lift = (support of antecedent and consequent) / (support of antecedent) * (support of
consequent)
• Example: Let's consider a hypothetical scenario in which we are
analyzing a dataset of medical records to discover associations
between different medical conditions.
• We want to determine the support, confidence, and lift measures for
the rule "Diabetes => Hypertension" based on the dataset.
Assume the following information:
• Total number of transactions (medical records) in the dataset: N =
1000
• Frequency of the presence of both Diabetes (X) and Hypertension
(Y): frequency(X, Y) = 200
• Frequency of the presence of Diabetes (X): frequency(X) = 400
• Frequency of the presence of Hypertension (Y): frequency(Y) = 600

• Using these values, calculate the support, confidence and lift


• Support: measures the proportion of transactions in the dataset that
contain both Diabetes and Hypertension.
Support = frequency(X, Y) / N
= 200 / 1000
= 0.2 "Diabetes => Hypertension" is 0.2 or 20%.
• Confidence: measures the conditional probability of Hypertension
given Diabetes, or the proportion of transactions that contain both
Diabetes and Hypertension out of the transactions that contain Diabetes
Confidence = frequency(X, Y) / frequency(X)
= 200 / 400
= 0.5 "Diabetes => Hypertension" is 0.5 or 50%.
• Lift: measures the strength of association between Diabetes and
Hypertension, taking into account the expected probability of
Hypertension being present given the independence of Diabetes and
Hypertension.
Lift = (frequency(X, Y)) / (frequency(X) * frequency(Y))
= 200 / (400 * 600)
= 0.0083
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a dataset
• First proposed by Agrawal et al.
• Applications :Basket data analysis, cross-marketing, catalog design,
sale campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.

• Motivation: Finding inherent regularities in data


– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?

70
Challenges of Implementing Healthcare Data mining

• Healthcare data is frequently spread out across


different systems and organizations, making it difficult
to access and combine.
• This data can be found in electronic health records
(EHR), medical imaging systems, laboratory systems,
claims data, and other sources.
• Ensuring that the data is compatible, standardized, and
securely shared between these different sources can be
a complex task.
• Missing, corrupted, inconsistent, or non-standardize
data such as pieces of information recorded in different
formats in different data sources
Challenges of Implementing…cont’d
• Lack of a standard clinical vocabulary is a serious
hindrance to data mining
• Massive amounts of patient data being shared during
the data mining process increases patient concerns that
their personal information could fall into the wrong hands
• There may be ethical, legal and social issues, such as data
ownership and privacy issues, related to healthcare data
Challenges of Implementing…cont’d
• Protecting patient privacy and ensuring data security
throughout the data mining process is a significant challenge.
• The successful application of data mining requires knowledge
of the domain area as well as in data mining methodology and
tools
• Without a sufficient knowledge of data mining, the user may
not be aware of or be able to avoid the pitfalls of data mining
• Data mining requires intensive planning and technological
preparation work
Challenges of Implementing…cont’d
• Healthcare organizations developing data mining
applications must make a substantial investment of
resources, particularly time, effort and money
• Data mining projects can fail for a variety of reasons,
such as lack of management support, unrealistic user
expectations, poor project management, inadequate data
mining expertise, etc.
• Physicians and executives have to be convinced of the
usefulness of data mining
The Future of Healthcare Data Mining

• Data mining applications in healthcare can have


tremendous potential and usefulness
• The success of healthcare data mining hinges on
the availability of clean healthcare data
• It is crucial that the health-care industry consider
how data be better captured, stored, prepared
and mined
The Future of Healthcare…cont’d
• Standardizations of clinical vocabulary and the
sharing of data across organizations to enhance
benefits of healthcare data mining applications
• Healthcare data are not limited to just quantitative data,
such as physicians’ notes or clinical records,
• The future of healthcare data mining holds immense
potential for transforming healthcare delivery,
improving patient outcomes, and advancing medical
knowledge.

You might also like