0% found this document useful (0 votes)

25 views77 pages

Chapter Five Data Mining for Healthcare Analytics

Uploaded by

tesfahuntenahun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views77 pages

Chapter Five Data Mining for Healthcare Analytics

Uploaded by

tesfahuntenahun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 77

Chapter Five

Data Mining for Healthcare Analytics

Session objectives

By the end of this chapter, you will able to:

 Define the basic concept of data mining
 Explain of data mining process models
 Describe data mining standards and output protocols
 Identify common techniques used in mining healthcare data
 Describe frequent pattern mining techniques
 Describe the future of healthcare data mining
What Is Data Mining
• Data mining is the process of discovering patterns,
relationships, and insights from large datasets.
• Extracts data from databases to solve business problems,
turning raw data into useful information.
• Is the process of data selection and exploration and building
models using vast data stores to uncover previously unknown
pattern
• It uses advanced algorithms and statistical models to discover
patterns and trends that go beyond basic analysis methods.
What Is Data Mining…cont’d
• Data mining is a relatively recently developed
methodology, coming into prominence in 1994.
• Aims to identify valid, novel, potentially useful, and
understandable correlations and patterns in data.
• Used to find new, accurate, and useful patterns in data for
individuals or organizations.
• Extract patterns that cannot be discovered by traditional
data exploration because the relationships are too
complex or because there is too much data
What Is Data Mining…cont’d
• The aim of data mining is to extract information from a
dataset and convert it into a understandable structure for
future use.
• In healthcare, data mining is becoming increasingly
important like improvement of patient care, fast and
accurate diagnose
• It employs concepts from various subfields and involves
techniques at the intersection of artificial intelligence,
machine learning, statistics, and database systems to
What Is Data Mining…cont’d
• The main reason is that the huge amounts of data
generated by healthcare transactions and
• Too complex and voluminous to be processed and analyzed
by traditional methods.
• The main aim of data mining is to find hidden patterns and
relationships in data for making informed decisions or
predictions.
What Is Data Mining…cont’d

 There is vast potential for data mining applications in

healthcare such as:
 The evaluation of treatment effectiveness
 Customer relationship management
 Proper diagnosis of patients
 Early detection of diseases
 Survivability of patients
 Prevention and management of diseases
What Is Data Mining…cont’d

• Health data requires analytical methodology in identifying

vital information that is used for decision making; this in
turn:
 Decreases costs by increasing efficiencies

 Improving patient quality of life

 Saving the lives of more patients

Data Mining Standards and Output Protocols

• Accurate and reliable data is crucial for successful data

mining
• In order to ensure ethical and responsible practices, there
are established standards and protocols for data mining.
• These standards and protocols help protect data privacy,
ensure accuracy, promote transparency, and guide the
appropriate use of data mining results.
• Data mining practitioners should follow ethical guidelines
and principles when conducting their analysis
Data Mining Standards and Output Protocols
• Process:
– The overall process by which data mining models are
produced, used and deployed
– E.g. a description of the business interpretation of the
output of a classification
• Models:
– a standard representation for data mining and statistical
models
– E.g. the parameters defining a classification tree
Data Mining Standards and Output Protocols
• Attributes:
– A standards representation for cleaning, transforming,
and aggregating attributes to provide the inputs for data
mining models
– E.g. The parameters defining how zip codes are mapped to
three digit codes prior to their use a categorical variable in
a classification tree
Data Mining Standards and Output Protocols
• Settings:
– A standard representation for specifying the settings
required to build models and to use the output of models in
other systems
– E.g. Specifying the examples, a description of the API so
that a classification tree can be built on data in a SQL
database
Data Mining Standards and Output Protocols
• Remote distributed data:
– Standards for viewing, analyzing, mining remote and
distributed data
– e.g. standards for the format of the data and name of the
training set used build a classification tree
• Interface and APIs Interfaces:
– Standard data mining APIs for Java and SQL
– e.g. a description of the API so that a classification tree
can be built on data in a SQL database
Data Mining Standards and Output Protocols
Data Mining Standards and Output Protocols
Data Mining Process
 Data mining process models used to guide the
implementation of data mining on big or huge amount of
data.
• The three most popular data mining process models are:
– CRISP-DM (Cross-Industry Standard Process for Data
Mining)
– Knowledge Discovery Database (KDD) process model
– Sample, Explore, Modify, Model, Assess (SEMMA)
Data Mining Process
• CRISP-DM, which stands for Cross-Industry Standard
Process for Data Mining,
• Is a widely accepted and well-documented process model
for data mining.
• It provides a structured approach to guide data mining
projects and
• Consists of six main phases: Business Understanding, Data
Understanding, Data Preparation, Modeling, Evaluation,
and Deployment.
Data Mining Process
• SEMMA (Sample, Explore, Modify, Model, Assess):

• SEMMA is a data mining process model developed by SAS

Institute.
• It consists of five main phases: Sample, Explore, Modify,
Model, and Assess.
• SEMMA places a strong emphasis on data exploration and
manipulation, allowing analysts to gain insights into the data.
• It provides a structured approach to data mining and is often
used in conjunction with SAS software.
Data Mining Process
Knowledge Discovery Database (KDD) process model
• The KDD process model refers to a comprehensive
framework for knowledge discovery from data,
specifically in the context of data mining and machine
learning.
• The KDD process model consists of various stages and
tasks, including data selection, preprocessing,
transformation, data mining, evaluation, and
interpretation of the results.
Data Mining Process
KDD model consists of nine steps
1. Understanding of the application domain: Is the first stage in
which goals are defined from the customer’s view of point and
• Learning the relevant prior knowledge and the goals of the end
user of the discovered knowledge
2. Creating target dataset:
• The aim of this stage is to create a representative dataset that
contains the necessary information for analysis in the problem
domain.
• This dataset will be used in later stages to uncover patterns,
21
relationships, and insights.
Data Mining Process
• Identifying available data, acquiring relevant data, and
integrating all the data into a single set.
• This process considers the important attributes that need to
be included for effective knowledge discovery.
3. Data cleaning and preprocessing:
• Strategies are developed for handling such type of noisy
and inconsistence data
• It incorporates data clearing, handling the missing
quantities and removal of noise or outliers. 22
Data Mining Process
4. Data transformation:
• The process of generating suitable data for data mining is
carried out and developed.
• This stage involves identifying valuable attributes
through the use of techniques such as dimension
reduction and transformation, and discovering a
consistent representation of the data.
• Various methods for reducing and transforming data are
applied to the desired data.
Data Mining Process
5. Choose data mining task:
• Selecting the appropriate data mining task for the problem
domain, such as classification, clustering, or association rule mining.

6. Choosing the suitable data mining algorithm:

• One or more appropriate data mining algorithms are selected for
searching different patterns from data.
• There are a number of algorithms presented today for data mining
but
• Appropriate algorithms are selected based on matching the overall
criteria for data mining
Data Mining Process
7. Employing data mining algorithms:
• Applying the selected algorithm to the data to discover
patterns, trends, and relationships.
8. Interpreting pattern:
• Focuses on interpretation and evaluation of mining
patterns
• Involve visualization of the data based on the extracted
models and determining their significance for the problem
domain. 25
Data Mining Process
9. Using discovered knowledge:
– In the final step process the discovered knowledge is
used for different purposes
– Incorporating the discovered knowledge into the
performance system, and documenting and reporting it
to the interested parties.
– This step may also include checking and resolving
potential conflicts with previously believed knowledge
26
Data Mining Process

• Selection: Obtain data from various sources.

• Preprocessing: Cleanse data and fills incomplete once.
• Transformation: Convert data from different sources into
common format. Transform to new format.
• Data Mining: apply data mining techniques to obtain
desired results.
• Interpretation/Evaluation: Present results to user in
meaningful manner using various visualization and GUI
strategies. 27
Common Techniques Used in Mining …cont’d
Types of Data Mining Tasks

• Predictive: the collected data is used to train a model for

making future predictions.
– Use some variables to predict unknown or future values
of other variables.
– Includes regression and classification

• Descriptive: general properties of the database are

determined.
– Visualization, clustering, dimension, and association
rule mining.
Common Techniques Used in Mining …cont’d
Type of Learning
1. Supervised learning is a data mining technique that
involves creating models using labeled data, where each
instance has a known output label.
 The aim is to develop a model that can accurately predict
the labels of new, unseen instances.
 Includes Regression and Classification
 Decision trees, Random forests, Support vector machines
(SVM), Naive bayes, and neural networks are all
examples of supervised learning algorithms used in data
mining.
29
Common Techniques Used in Mining…cont’d
2. Unsupervised learning :
• is concerned with the identification of patterns, relationships, or
structures within the data without the need for specific target labels.
• The primary goal is to uncover the inherent structure of the data and
uncover intriguing patterns or clusters.
• Popular techniques used in unsupervised learning for data mining
include clustering algorithms such as
 K-means and

 Hierarchical clustering,

 Association rule mining, and

 Anomaly detection.
Common Techniques Used in Mining…cont’d
3. Semi-supervised Learning: Semi-supervised learning in
data mining combines both labeled and unlabeled data to
improve the learning process.

4. Reinforcement Learning: While reinforcement learning is

more commonly associated with tasks involving agent-
environment interactions.
Common tasks of data mining
Classification
– A Supervised learning technique used in data mining and
machine learning to assign predefined class labels to instances
based on their feature values.
– The data is typically represented as a set of instances, where
each instance is described by a set of attributes or features.
– The class labels are known for the training instances

– Useful for disease diagnosis, patient risk stratification,

treatment outcome prediction, and detecting insurance
claim fraud.
Common Techniques Used in Mining…cont’d

• The process of classification is based on four

fundamental components:
– The class

– Predictor

– Training dataset

– Testing dataset
Common Techniques Used in Mining…cont’d

• Class:
– The dependent variable of the model
– Is a categorical variable representing the ‘label’ put on
the object after its classification
• Example
– Presence of myocardial infarction
– Customer loyalty
– Condition of a patient
Common Techniques Used in Mining…cont’d
• Predictor:

– The independent variable of the model

– Represented by the attributes of the data to be classified

based on which classification is made
• Examples

– Smoking

– Alcohol consumption

– Blood pleasure

– Marital status and etc.

Common Techniques Used in Mining…cont’d
• Training dataset:

– Is the set of data containing values for the class and

predictors
– Is used for ‘training’ the model to recognize the appropriate
class, based on available predictors
• Examples

– Groups of patients tested on heart attacks

– Groups of customers of supermarket

– Database containing images for telescope monitoring and

Common Techniques Used in Mining…cont’d
• Testing dataset:
– is a distinct set of examples utilized to assess the
efficacy of the trained classification model.
– During the training phase, the model does not encounter
the testing dataset, which comprises instances with
known class labels.
– To evaluate the accuracy and effectiveness of the
classification model, its predictions on the testing
dataset are compared to the actual class labels.
Common Techniques Used in Mining…cont’d

Classification
Common Techniques Used in Mining…cont’d
• Decision tree is a flowchart-like tree structure where:
• Each internal node (non-leaf node) denotes a test on an
attribute
• Each branch represents an outcome of the test
• Each leaf node (terminal node) holds a class label
• The topmost node in a tree is the root node
• is a hierarchical model whereby the local region is
identified in a sequence of recursive splits in a smaller
number of steps.
Common Techniques Used in Mining…cont’d

Figure 5.1: Pictorial representation of a decision tree classifier

Common Techniques Used in Mining…cont’d
• How are decision tree used for classification?
 Data preparation is the initial stage in classification
where the dataset is prepared.
– This involves choosing important features or attributes
and dealing with missing values or outliers.
– Categorical attributes may need to be converted into
numerical representations, depending on the decision
tree algorithm being utilized.
Common Techniques Used in Mining…cont’d

How are decision tree used for classification?

 Training Dataset: A labeled training dataset is needed
for building the decision tree model.
 Each instance in the dataset should have a class label
associated with it.
 The dataset consists of instances described by a set of
features and their corresponding class labels.
Common Techniques Used in Mining…cont’d
How are decision tree used for classification?
 Construct a decision tree:
• The best attribute or feature is selected at each node to
split the data.
• This is done using criteria such as
• information gain, gain ratio, or Gini index to
maximize the separation of classes.
• The selected attribute creates child nodes or branches in
the tree, and
• this process continues recursively until a stopping
criterion is met, such as reaching a minimum number of
instances or maximum depth.
Common Techniques Used in Mining…cont’d
 How are decision tree used for classification?

 Classification:

• it can be used to classify new, unseen instances.

• Starting at the root node, each instance traverses the tree by
following the appropriate branches based on the attribute
values.
• At each internal node, the instance is directed to the child
node that matches its attribute value.
• This process continues until a leaf node is reached, which
represents a class label.
• The class label of the leaf node is assigned to the instance as
its predicted class.
Common Techniques Used in Mining…cont’d
How are decision tree used for classification?
 Model Evaluation:
 The performance of the decision tree model is evaluated using
a separate testing dataset.
• Decision tree classifiers does not require any Domain
knowledge and Parameter setting
• Decision tress induction algorithms have been used for
classification in many application areas such as medicine,
manufacturing and production, financial analysis,
astronomy, and molecular biology
Building a Decision Tree
• Two step method:

1. Tree Construction: determine the best split to find out

all the branches and the leaf nodes of the tree.
2.Tree Pruning (Optimization): identify and remove
branches that are not useful for classification
• Pre-Pruning
• Post Pruning
Building a Decision Tree
• Attribute selection measures(splitting rules ):
– is a heuristic for selecting the splitting criterion that best
separates a given data partition, D, of class-labeled
training tuples into individual classes.
– determine how the tuples at a given nodes are to be split.
– The popular attribute selection measures are:
• Information gain
• Gain ratio
• Gini index
How to Determine the Best Split
• The goodness of the split is quantified by node impurity:
– Information gain (entropy): attributes are assumed to be
categorical

– Gain ration: extension to information gain

• Overcome the maximal information gained by a single
attribute

– Gini index: attributes are assumed to be continuous

• Assume there exist several possible split values for each
attribute
Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the

same class

• Stop expanding a node when all the records have similar

attribute values
Tree Pruning
• The node is not split further if the number of training
instances reaching a node is smaller than a certain
threshold.
– Pre pruning: stopping tree construction early on before
it is full built
– Post pruning: to get simpler tree (to find and prune
unnecessary sub trees)
• Comparing prepruning and postpruning:
– The prepruning is faster but postpruning leads to more
accurate trees.
Issues in Decision Trees
• Overfitting happens when the model learns the detail
and noise in the training data negatively impact the
performance.
• A decision tree is said to overfit the training data if:
– It results in poor accuracy to classify test samples
– It has too many branches, that reflect anomalies
• Avoiding overfitting :
– Prune the tree: leaf nodes (sub-trees) are removed from
the tree as long as the pruned tree performs better on the
test data than the larger tree.
Issues in decision trees
• Underfitting refers to a model that can neither model the
training data nor generalize to new data.

• Underfitting: when model is too simple, both training and

test errors are large

• Easy to detect given a good performance metric

Decision Trees Algorithms
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– ID3, C4.5
– CART
– CHAID(CHi-squared Automatic Interaction Detector)
– SLIQ, SPRINT, MARS
Decision Tree Classifier Advantages
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees:- can be converted to
if-then rules that is easily understandable
• Accuracy is comparable to other classification techniques
for many simple data sets
Decision Tree Classifier Disadvantages

• Prone to overfitting.
• Require some kind of measurement as to how well
they are doing.
• Need to be careful with parameter tuning.
• Can create biased learned trees if some classes
dominate.
Model Evaluation
• Metrics for Performance Evaluation
– How to evaluate the performance of a model?

• Methods for Performance Evaluation

– How to obtain reliable estimates?
• Methods for Model Comparison
– How to compare the relative performance among
competing models?
Metrics for Performance Evaluation
• Focus on the predictive capability of a model
– Rather than how fast it takes to classify or build models,
scalability, etc.
• Confusion Matrix is a table that provides a comprehensive
summary of the performance of a classification model
Predictive Positive Predictive Negative

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN)

Metrics for Performance Evaluation…cont’d
From the above confusion matrix, the term was described as:
True positives (TP): the actual true positives that were
correctly predicted as true positives
True negatives (TN): the actual true negatives that were
correctly predicted as true negatives
False negatives (FN): the actual positives that were
incorrectly predicted as false negatives
False positives (FP): The actual negatives that were
incorrectly predicted as positives
58
Metrics for Performance Evaluation…cont’d
The confusion matrix can be used to calculate various performance
metrics in healthcare, including:
• Accuracy: the overall correctness of the model's predictions.
• It is calculated as (TP + TN) / (TP + TN + FP + FN).
• Accuracy provides an overall assessment of the model's
performance but may be influenced by class imbalance.

• Sensitivity (Recall or True Positive Rate):

• Measures the proportion of actual positive instances correctly
identified by the model.
• It is calculated as TP / (TP + FN).
• It is important for identifying cases with the condition and
minimizing false negatives.
Metrics for Performance Evaluation…cont’d

Limitation of Accuracy
• Consider a 2-class problem:
– Number of Class 0 examples = 9990
– Number of Class 1 examples = 10
• If model predicts everything to be class 0, accuracy is
9990/10000 = 99.9 %
– Accuracy is misleading because model does not detect
any class 1 example
Metrics for Performance Evaluation…cont’d
• Specificity (True Negative Rate): Measures the proportion
of actual negative instances correctly identified by the
model.
• It is calculated as TN / (TN + FP).
• It is important for ruling out the condition and minimizing
false positives.

• Precision: Measures the proportion of positive predictions

that are actually correct.
• It is calculated as TP / (TP + FP).
• Precision focuses on minimizing false positives and is
valuable for tasks where false alarms or unnecessary
interventions should be minimized.
Metrics for Performance Evaluation…cont’d
• Recall: measures the proportion of true positive predictions (correctly
identified positive instances) out of the total number of actual positive
instances.
 It focuses on minimizing false negatives, which means capturing as many
positive instances as possible, even if it results in some false positives.
 The formula for recall is:

Recall = True Positives / (True Positives + False Negatives)

• F1 Score:

• The harmonic mean of precision and recall,

• Calculated as F1 =2(RecallPrecision)/(Recall +Precision )

• The F1 score provides a balanced measure of the model's performance by

considering both precision and recall.
Methods for Performance Evaluation
• How to obtain a reliable estimate of performance?

• Performance of a model may depend on other factors

besides the learning algorithm:
– Class distribution
– Size of training and test sets
Methods of Estimation
• Holdout: Reserve 2/3 for training and 1/3 for testing
• Random sub-sampling: Repeated holdout
• Cross validation
– Partition data into k disjoint subsets
– k-fold: train on k-1 partitions, test on the remaining one
– Leave-one-out: k=n
• Stratified sampling: oversampling vs under-sampling
• Bootstrap: Sampling with replacement
Association Rule Mining
• Association rule mining is a data mining technique used
to discover interesting relationships or associations
between items in large datasets.
• The association rule mining technique, which examines the
concurrent correlations between several variables in a
grouping, was first created by Agarwal and Srikanth
• Applied in the healthcare domain to uncover meaningful
associations and patterns within large healthcare datasets.

65
Association Rule Mining
• To produce association rules, primarily focused on features
that are implied by the target features (Antecedent =>
Consequent),
• Which is a way to classify all the variables that contribute to
dependent variable.
• These rules are often referred to as classification association
rules.
• Support, Confidence, and Lift can be represented as follows
for a certain rule: Here, the sets of features represented by X
and Y are mutually exclusive.
• Rule X=>Y
• Support
• Confidence
Association Rule Mining
• Support, confidence, and lift are key measures used in association
rule mining to evaluate the significance and strength of association
rules.
1. Support: It indicates how frequently an itemset occurs in the
dataset.
Support = (frequency of itemset) / (total number of transactions)
2. Confidence: It represents the strength of the implication from the
antecedent to the consequent.
Confidence = (frequency of antecedent and consequent) / (frequency of antecedent)

3. Lift: measures the strength of association between the antecedent

and the consequent in an association rule, taking into account the
expected probability of the consequent occurring independently of the
antecedent.
Lift = (support of antecedent and consequent) / (support of antecedent) * (support of
consequent)
• Example: Let's consider a hypothetical scenario in which we are
analyzing a dataset of medical records to discover associations
between different medical conditions.
• We want to determine the support, confidence, and lift measures for
the rule "Diabetes => Hypertension" based on the dataset.
Assume the following information:
• Total number of transactions (medical records) in the dataset: N =
1000
• Frequency of the presence of both Diabetes (X) and Hypertension
(Y): frequency(X, Y) = 200
• Frequency of the presence of Diabetes (X): frequency(X) = 400
• Frequency of the presence of Hypertension (Y): frequency(Y) = 600

• Using these values, calculate the support, confidence and lift

• Support: measures the proportion of transactions in the dataset that
contain both Diabetes and Hypertension.
Support = frequency(X, Y) / N
= 200 / 1000
= 0.2 "Diabetes => Hypertension" is 0.2 or 20%.
• Confidence: measures the conditional probability of Hypertension
given Diabetes, or the proportion of transactions that contain both
Diabetes and Hypertension out of the transactions that contain Diabetes
Confidence = frequency(X, Y) / frequency(X)
= 200 / 400
= 0.5 "Diabetes => Hypertension" is 0.5 or 50%.
• Lift: measures the strength of association between Diabetes and
Hypertension, taking into account the expected probability of
Hypertension being present given the independence of Diabetes and
Hypertension.
Lift = (frequency(X, Y)) / (frequency(X) * frequency(Y))
= 200 / (400 * 600)
= 0.0083
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a dataset
• First proposed by Agrawal et al.
• Applications :Basket data analysis, cross-marketing, catalog design,
sale campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.

• Motivation: Finding inherent regularities in data

– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?

70
Challenges of Implementing Healthcare Data mining

• Healthcare data is frequently spread out across

different systems and organizations, making it difficult
to access and combine.
• This data can be found in electronic health records
(EHR), medical imaging systems, laboratory systems,
claims data, and other sources.
• Ensuring that the data is compatible, standardized, and
securely shared between these different sources can be
a complex task.
• Missing, corrupted, inconsistent, or non-standardize
data such as pieces of information recorded in different
formats in different data sources
Challenges of Implementing…cont’d
• Lack of a standard clinical vocabulary is a serious
hindrance to data mining
• Massive amounts of patient data being shared during
the data mining process increases patient concerns that
their personal information could fall into the wrong hands
• There may be ethical, legal and social issues, such as data
ownership and privacy issues, related to healthcare data
Challenges of Implementing…cont’d
• Protecting patient privacy and ensuring data security
throughout the data mining process is a significant challenge.
• The successful application of data mining requires knowledge
of the domain area as well as in data mining methodology and
tools
• Without a sufficient knowledge of data mining, the user may
not be aware of or be able to avoid the pitfalls of data mining
• Data mining requires intensive planning and technological
preparation work
Challenges of Implementing…cont’d
• Healthcare organizations developing data mining
applications must make a substantial investment of
resources, particularly time, effort and money
• Data mining projects can fail for a variety of reasons,
such as lack of management support, unrealistic user
expectations, poor project management, inadequate data
mining expertise, etc.
• Physicians and executives have to be convinced of the
usefulness of data mining
The Future of Healthcare Data Mining

• Data mining applications in healthcare can have

tremendous potential and usefulness
• The success of healthcare data mining hinges on
the availability of clean healthcare data
• It is crucial that the health-care industry consider
how data be better captured, stored, prepared
and mined
The Future of Healthcare…cont’d
• Standardizations of clinical vocabulary and the
sharing of data across organizations to enhance
benefits of healthcare data mining applications
• Healthcare data are not limited to just quantitative data,
such as physicians’ notes or clinical records,
• The future of healthcare data mining holds immense
potential for transforming healthcare delivery,
improving patient outcomes, and advancing medical
knowledge.

Handout 2 Data Mining
No ratings yet
Handout 2 Data Mining
16 pages
DataMining and Warehousing - chapter1
No ratings yet
DataMining and Warehousing - chapter1
23 pages
DWDM UNIT-2
No ratings yet
DWDM UNIT-2
13 pages
(Onpage For Track Changes) (Question 2)
No ratings yet
(Onpage For Track Changes) (Question 2)
18 pages
Lecture 1 & 2- Introduction to Data Mining2
No ratings yet
Lecture 1 & 2- Introduction to Data Mining2
19 pages
Mod03-Lifecycle Dataprocessing
No ratings yet
Mod03-Lifecycle Dataprocessing
72 pages
IS352_ Lecture 01
No ratings yet
IS352_ Lecture 01
62 pages
Data Mining e Resources
No ratings yet
Data Mining e Resources
98 pages
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
No ratings yet
WINSEM2024-25_MCSE615L_TH_VL2024250502897_2024-12-19_Reference-Material-I
58 pages
Similarities in Fuzzy Data Mining From a Cognitive View to Real World Applications 1st editon by Bernadette Bouchon Meunier, Maria Rifqi, Marie Jeanne Lesot ISBN 3540688587 9783540688587 pdf download
100% (1)
Similarities in Fuzzy Data Mining From a Cognitive View to Real World Applications 1st editon by Bernadette Bouchon Meunier, Maria Rifqi, Marie Jeanne Lesot ISBN 3540688587 9783540688587 pdf download
55 pages
Chapter 1
No ratings yet
Chapter 1
38 pages
LECTURE 13
No ratings yet
LECTURE 13
51 pages
DSS chapter 5
No ratings yet
DSS chapter 5
9 pages
DataMining
No ratings yet
DataMining
6 pages
Data Mining - Bi 3
No ratings yet
Data Mining - Bi 3
40 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
PREDICTIVE & PRESCRIPTIVE ANALYTICS
No ratings yet
PREDICTIVE & PRESCRIPTIVE ANALYTICS
19 pages
Data Mining and IBM SPSS Modeler
No ratings yet
Data Mining and IBM SPSS Modeler
20 pages
Lecture 1.1.1 1.1.2
No ratings yet
Lecture 1.1.1 1.1.2
32 pages
Data Mining Merged Pdf CS1 CS8
No ratings yet
Data Mining Merged Pdf CS1 CS8
272 pages
DWDM-LS1-Fall-24-25
No ratings yet
DWDM-LS1-Fall-24-25
42 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
38 pages
Datawarehouse&Data mining_ALL
No ratings yet
Datawarehouse&Data mining_ALL
46 pages
Unit III Dwdm
No ratings yet
Unit III Dwdm
113 pages
Data Mining.intro
No ratings yet
Data Mining.intro
17 pages
PPT4 W3 S4 R0 Predictive Analytics I Data Mining Process
No ratings yet
PPT4 W3 S4 R0 Predictive Analytics I Data Mining Process
50 pages
CSE445 T5a Decision Trees
No ratings yet
CSE445 T5a Decision Trees
54 pages
Data Mining Concepts
100% (3)
Data Mining Concepts
122 pages
Data Science Module 1 Notes
No ratings yet
Data Science Module 1 Notes
16 pages
2_Unit 1 - Lecture 3
No ratings yet
2_Unit 1 - Lecture 3
16 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
intro data mining
No ratings yet
intro data mining
51 pages
1 DMiningKuliah 1 Introduction
No ratings yet
1 DMiningKuliah 1 Introduction
51 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
Unit-1
No ratings yet
Unit-1
148 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
27 pages
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
No ratings yet
Unit-4object Segmentation Regression Vs Segmentation Supervised and Unsupervised Learning Tree Building Regression Classification Overfitting Pruning and Complexity Multiple Decision Trees
25 pages
IBA - MODULe 4.3
No ratings yet
IBA - MODULe 4.3
10 pages
Data Mining Techniques PDF
No ratings yet
Data Mining Techniques PDF
41 pages
dm 1
No ratings yet
dm 1
47 pages
Unit 1
No ratings yet
Unit 1
27 pages
Data Mining
No ratings yet
Data Mining
41 pages
DM Introduction
No ratings yet
DM Introduction
32 pages
PredictiveAnalysis U1 U2
No ratings yet
PredictiveAnalysis U1 U2
7 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
41 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
Lecture 1428550844
No ratings yet
Lecture 1428550844
87 pages
5 Data Mining Proccess and Techniques - Week 7
No ratings yet
5 Data Mining Proccess and Techniques - Week 7
61 pages
Tanaya Das Tide Vulnerability Sundarbans
No ratings yet
Tanaya Das Tide Vulnerability Sundarbans
47 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Unit 3
No ratings yet
Unit 3
23 pages
Dwdm Unit-II Notes
No ratings yet
Dwdm Unit-II Notes
29 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
Unit-1 PPT
No ratings yet
Unit-1 PPT
21 pages
Bms Syllabus Sem 3. Cvs Warriors
No ratings yet
Bms Syllabus Sem 3. Cvs Warriors
34 pages
LightGBM - Release 2.2.4 PDF
No ratings yet
LightGBM - Release 2.2.4 PDF
183 pages
Projectreport Diabetes Prediction
No ratings yet
Projectreport Diabetes Prediction
25 pages
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
No ratings yet
Business Understanding This Step Involves Understanding The Problem That Needs To Be Solved and Defining The Objectives of The Data Mining Project
5 pages
DMBI Sample Questions
No ratings yet
DMBI Sample Questions
7 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Unit - I
No ratings yet
Unit - I
22 pages
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
No ratings yet
3-OLAP Operations-13!08!2021 (13-Aug-2021) Material I 13-Aug-2021 Data Mining - Introductory Slides
37 pages
syllabus-1
No ratings yet
syllabus-1
13 pages
My Chapter Two
No ratings yet
My Chapter Two
57 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
DS4 - CLS-Decision Tree
No ratings yet
DS4 - CLS-Decision Tree
32 pages
B E - Computer-Engg
No ratings yet
B E - Computer-Engg
27 pages
XGBoost & Adaboost
No ratings yet
XGBoost & Adaboost
22 pages
Predicting The Academic Performance of The Engineering Students Using Decision Trees
No ratings yet
Predicting The Academic Performance of The Engineering Students Using Decision Trees
10 pages
01 Intro
No ratings yet
01 Intro
23 pages
Data Structures and Algorithms - L6
No ratings yet
Data Structures and Algorithms - L6
17 pages
Unit 1 Data Mining
No ratings yet
Unit 1 Data Mining
15 pages
Decision Making Under Risk Continued: Decision Trees
No ratings yet
Decision Making Under Risk Continued: Decision Trees
17 pages
Python MCQs
100% (3)
Python MCQs
39 pages
Credit Cards Fraud Detection System Synopsis
100% (1)
Credit Cards Fraud Detection System Synopsis
18 pages
Crisp - DM: Data Mining Process
No ratings yet
Crisp - DM: Data Mining Process
8 pages
A Review of Vibration Machine Diagnostics by Using
No ratings yet
A Review of Vibration Machine Diagnostics by Using
14 pages
Seminar On AI
No ratings yet
Seminar On AI
41 pages
BDA Quiz 2 Help
No ratings yet
BDA Quiz 2 Help
4 pages
Machine Learning and Real-World Applications
100% (1)
Machine Learning and Real-World Applications
19 pages
HW1 Final
No ratings yet
HW1 Final
4 pages
Learning Disjunctive Sets of Rules
No ratings yet
Learning Disjunctive Sets of Rules
8 pages
Modified Non-Linear Index (MNLI)
No ratings yet
Modified Non-Linear Index (MNLI)
11 pages
Flight Delay Prediction Team3
No ratings yet
Flight Delay Prediction Team3
8 pages
NLP and ML Project
100% (1)
NLP and ML Project
37 pages
Data Mining Using RFM Analysis, Derya Birant, Dokuz Eylul University, Turkey
100% (1)
Data Mining Using RFM Analysis, Derya Birant, Dokuz Eylul University, Turkey
18 pages
Dong Ying PDF
No ratings yet
Dong Ying PDF
52 pages
Dataminging Syllabus
100% (1)
Dataminging Syllabus
3 pages

Chapter Five Data Mining for Healthcare Analytics

Uploaded by

Chapter Five Data Mining for Healthcare Analytics

Uploaded by

Chapter Five

Data Mining for Healthcare Analytics

By the end of this chapter, you will able to:

 There is vast potential for data mining applications in

• Health data requires analytical methodology in identifying

 Improving patient quality of life

 Saving the lives of more patients

• Accurate and reliable data is crucial for successful data

• SEMMA is a data mining process model developed by SAS

6. Choosing the suitable data mining algorithm:

• Selection: Obtain data from various sources.

• Predictive: the collected data is used to train a model for

• Descriptive: general properties of the database are

 Association rule mining, and

4. Reinforcement Learning: While reinforcement learning is

– Useful for disease diagnosis, patient risk stratification,

• The process of classification is based on four

– The independent variable of the model

– Represented by the attributes of the data to be classified

– Marital status and etc.

– Is the set of data containing values for the class and

– Groups of patients tested on heart attacks

– Groups of customers of supermarket

– Database containing images for telescope monitoring and

Figure 5.1: Pictorial representation of a decision tree classifier

How are decision tree used for classification?

• it can be used to classify new, unseen instances.

1. Tree Construction: determine the best split to find out

– Gain ration: extension to information gain

– Gini index: attributes are assumed to be continuous

• Stop expanding a node when all the records belong to the

• Stop expanding a node when all the records have similar

• Underfitting: when model is too simple, both training and

• Easy to detect given a good performance metric

• Methods for Performance Evaluation

Actual Positive True Positive (TP) False Negative (FN)

Actual Negative False Positive (FP) True Negative (TN)

• Sensitivity (Recall or True Positive Rate):

• Precision: Measures the proportion of positive predictions

Recall = True Positives / (True Positives + False Negatives)

• The harmonic mean of precision and recall,

• Calculated as F1 =2*(Recall*Precision)/(Recall +Precision )

• The F1 score provides a balanced measure of the model's performance by

• Performance of a model may depend on other factors

3. Lift: measures the strength of association between the antecedent

• Using these values, calculate the support, confidence and lift

• Motivation: Finding inherent regularities in data

• Healthcare data is frequently spread out across

• Data mining applications in healthcare can have

You might also like

• Calculated as F1 =2(RecallPrecision)/(Recall +Precision )