CS-DM MODULE -1
CS-DM MODULE -1
(Autonomous)
Maisammaguda, Dhulapally,
Secunderabad. 500100
Code: B0536
1
2021-22
MALLA REDDY ENGINEERING COLLEGE B.Tech.
Onwards
(MR-21) (Autonomous) VI Semester
Code: B0536 L T P
DATA MINING
Credits: 3 3 - -
Prerequisites: NIL
Course Objectives:
This course provides the students to understand stages in building a Data Warehouse,
identify the need and importance of preprocessing techniques, implement similarity and
dissimilarity techniques, analyze and evaluate performance of algorithms for
Association Rules, analyze Classification and Clustering algorithms.
MODULE I: Introduction and Mining Issues & Data [09 Periods]
Introduction – Data, Why Data Mining? What Is Data Mining? What Kinds of Data Can Be mined? What
Kinds of Patterns Can Be Mined? Which Technologies Are Used? Which Kinds of Applications Are
Targeted?
Mining Issues and Data - Major Issues in Data Mining, Types of Data, Data Quality
MODULE II: Data, Data Pre processing [9 Periods]
A: Data Pre-processing: Data Warehousing, Data Cleaning, Data Integration, Data Reduction, Data
Transformation, Aggregation, Sampling,
B: Techniques: Dimensionality Reduction, Feature Subset Selection, Feature Creation, Data Discretization
and Binarization, Variable transformation.
MODULE III: Data Similarity and Dissimilarity Classification [10 Periods]
A:Measuring Data Similarity and Dissimilarity - Similarity and Dissimilarity between simple attributes,
Dissimilarities and similarities between data objects, Examples of Proximity measures, Issues in Proximity
Calculation, Selection of right proximity measure.
B:Classification - Basic Concepts, General Approach to solving a classification problem, Decision Tree
Induction: Working of Decision Tree, building a decision tree.
MODULE IV: Classifier and Association Analysis [10 Periods]
Classifiers - Alterative Techniques, Bayes‘Theorem, Naïve Bayesian Classification, Bayesian Belief
Networks Association Analysis - Basic Concepts and Algorithms: Problem Definition, Frequent Item Set
2
generation, Rule generation, compact representation of frequent item sets, FP Growth Algorithm.
MODULE V: Cluster Analysis and DBSCAN [10 Periods]
Cluster Analysis - Basic Concepts and Algorithms: Overview: What Is Cluster Analysis? Different Types of
Clustering, Different Types of Clusters; K-means: The Basic K-means Algorithm, K-means Additional
Issues, Bisecting K-means, Strengths and Weaknesses; Agglomerative Hierarchical Clustering: Basic
Agglomerative Hierarchical Clustering Algorithm
DBSCAN - Traditional Density Center-Based Approach, DBSCAN Algorithm, Strengths and Weaknesses.
TEXTBOOKS
1. Pang-Ning Tan and Michael Steinbach, “Introduction to Data Mining”, Vipin Kumar,
Pearson.
2. Jiawei Han, Michel Kamber,”Data Mining concepts and Techniques”, 3/e, Elsevier.
REFERENCES
1. Hongbo Du, “Data Mining Techniques and Applications: An Introduction”, Cengage
Learning.
2. Vikram Pudi and P. Radha Krishna, “Data Mining”, Oxford.
3. Mohammed J. Zaki, Wagner Meira, Jr ,”Data Mining and Analysis - Fundamental
Concepts and Algorithms”, Oxford
4. Alex Berson, Stephen Smith,”Data Warehousing Data Mining and OLAP”, TMH.
E –RESOURCES
1. http://www-users.cs.umn.edu/~kumar/dmbook/index.php
http://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Kaufmann-Series-
in-Data- Management-Systems-Jiawei-Han-Micheline-Kamber-Jian-Pei-Data-Mining.-
Concepts-and-Techniques-3rd- Edition-Morgan-Kaufmann-2011.pdf
2. http://www.ijctee.org/files/Issuethree/IJCTEE_1111_20.pdf
3. http://www.ccsc.org/southcentral/E- Journal/2010/Papers/Yihao%20final%20paper
%20CCSC%20for%20submission.pdf
4. https://gunjesh.wordpress.com/
3
UNIT-1
Introduction to Mining and Issues in Data Mining
4
widely applied to credibility ratings play important role in card transactions to detect and
analyze fraud
1990Data collection, DB creation IMS and network DBMS
1970Relational data model, Relational RDBMS implementation
1990Data Mining, Data warehouse, multimedia data bases and web databases
2000Stream data management, Data mining and its applications, Web technology ,
Knowledge discovery from Data (KDD) is essential for data mining. While others view
data mining as an essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process −
5
Data Cleaning − In this step, the noise and inconsistent data is removed.
1. Flat Files
Fat files are defined as data files in text form or binary form with a structure that can be
easily extracted by data mining algorithms.
Data stored in flat files have no relationship or path among themselves, like if a relational
database is stored on flat file, and then there will be no relations between the tables.
Flat files are represented by data dictionary. Eg: CSV file.
Application: Used in Data Warehousing to store data, Used in carrying data to and from
server, etc.
2. Relational Databases
A Relational database is defined as the collection of data organized in tables with rows and
columns.
Physical schema in Relational databases is a schema which defines the structure of tables.
Logical schema in Relational databases is a schema which defines the relationship among
tables.
Standard API of relational database is SQL.
Application: Data Mining, ROLAP model, etc
3. Data Warehouse
A data warehouse is defined as the collection of data integrated from multiple sources
7
that will query and decision making.
There are three types of data warehouse: Enterprise data Warehouse, Data Mart and
Virtual Warehouse.
Two approaches can be used to update data in Data Warehouse: Query- driven Approach
and Update-driven Approach.
Application: Business decision making, Data mining, etc.
4. Transactional Databases
Transactional databases are a collection of data organized by time stamps, date, etc to
represent transaction in databases.
This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
Highly flexible system where users can modify information without changing any sensitive
information.
Follows ACID property of DBMS.
Application: Banking, Distributed systems, Object databases, etc.
8
5. Multimedia Databases
Multimedia databases consists audio, video, images and text media.
They can be stored on Object-Oriented Databases.
They are used to store complex information in pre-specified formats.
Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
6. Spatial Database
Store geographical information.
Stores data in the form of coordinates, topology, lines, polygons, etc.
Application: Maps, Global positioning, etc.
7. Time-series Databases
Time series databases contain stock exchange data and user logged activities.
Handles array of numbers indexed by time, date, etc.
It requires real-time analysis.
Application: eXtreme DB, Graphite, Influx DB, etc.
8. WWW
WWW refers to World wide web is a collection of documents and resources like audio,
video, text, etc .Which are identified by Uniform Resource Locators (URLs) through web
9
browsers, linked by HTML pages, and accessible via the Internet network.
It is the most heterogeneous repository as it collects data from multiple resources.
It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc
a) Descriptive Function
The descriptive function deals with the general properties of data in the database.
Here is the list of descriptive functions −
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters
1. Class/Concept Description
Class/Concept refers to the data to be associated with the classes or concepts. For example,
in a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept are
called class/concept descriptions. These descriptions can be derived by the following two ways −
Data Characterization − This refers to summarizing data of class under study. This class
under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
2. Mining of Frequent Patterns
Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −
Frequent Item Set − It refers to a set of items that frequently appear together, for
10
example, milk and bread.
Frequent Subsequence − A sequence of patterns that occur frequently such as purchasing a
camera is followed by memory card.
Frequent Sub Structure − Substructure refers to different structural forms, such as graphs,
trees, or lattices, which may be combined with item-sets or subsequences.
3. Mining of Association
Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.
4. Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.
5. Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group
of objects that are very similar to each other but are highly different from the objects in other
clusters.
The objects are clustered or grouped based on principles of maximizing the intra cluster
similarity and minimize the inter cluster similarity.
11
Classification (IF-THEN) Rules
Prediction
Decision Trees
Mathematical Formulae
Neural Networks
Outlier Analysis
Evolution Analysis
The list of functions involved in these processes is as follows −
Classification − It predicts the class of objects whose class label is unknown. Its objective is
to find a derived model that describes and distinguishes data classes or concepts. The Derived
Model is based on the analysis set of training data i.e. the data object whose class label is well
known.
Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be used for
identification of distribution trends based on available data.
Decision Trees − A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label.
12
Mathematical Formulae – Data can be mined by using some mathematical formulas.
Neural Networks − Neural networks represent a brain metaphor for information processing.
These models are biologically inspired rather than an exact replica of how the brain actually
functions. Neural networks have been shown to be very promising systems in many forecasting
applications and business classification applications due to their ability to “learn” from the data.
Outlier Analysis − Outliers may be defined as the data objects that do not comply with the
general behavior or model of the data available.
Evolution Analysis − Evolution analysis refers to the description and model regularities or
trends for objects whose behavior changes over time.
13
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
1. Statistics:
It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
Statistical analysis involves the collection of methods, applicable to large amount of data to
conclude and report the trend.
2. Machine learning
Arthur Samuel defined machine learning as a field of study that gives computers the ability
to learn without being programmed.
When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning.
In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
It is related to computational statistics.
3. Information retrieval
Information deals with uncertain representations of the semantics of objects (text, images).
The technique searches for the information in the document, which may be in text,
multimedia, or residing on the Web. It has two main characteristics:
1. Searched data is unstructured
2. Queries are formed by keywords that don’t have complex structures. The most widely
used information retrieval approach is the probabilistic model. Information retrieval
combined with data mining techniques is used for finding out any relevant topic in the
document or web.
Uses: A large amount of data is available and streamed in the web, both text and multimedia
due to the fast growth of digitalization including the government sector, health care, and
many others. The search and analysis have raised many challenges and hence Information
Retrieval becomes increasingly important
For example: Finding relevant information from a large document.
4. Database systems and data warehouse
Databases are used for the purpose of recording the data as well as data warehousing.
Online Transactional Processing (OLTP) uses databases for day to day transaction purpose.
15
Data warehouses are used to store historical data which helps to take strategically decision
for business.
It is used for online analytical processing (OALP), which helps to analyze the data.
5. Pattern Recognition:
Pattern recognition is the automated recognition of patterns and regularities in data. Pattern
recognition is closely related to artificial intelligence and machine learning, together with
applications such as data mining and knowledge discovery in databases (KDD), and is often used
interchangeably with these terms.
6. Visualization:
It is the process of extracting and visualizing the data in a very clear and understandable way
without any form of reading or writing by displaying the results in the form of pie charts, bar graphs,
statistical representation and through graphical forms as well.
7. Algorithms:
To perform data mining techniques we have to design best algorithms.
8. High Performance Computing:
High Performance Computing most generally refers to the practice of aggregating computing power
a way that delivers much higher performance than one could get out of a typical desktop computer
or workstation in order to solve large problems in science, engineering, or business.
16
Other Scientific Applications
Intrusion Detection
Financial Data Analysis
The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Like,
Loan payment prediction and customer credit policy analysis.
Detection of money laundering and other financial crimes.
Retail Industry
Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and services. It
is natural that the quantity of data collected will continue to expand rapidly because of the
increasing ease, availability and popularity of the web.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e- mail, web data
transmission, etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become
very important to help and understand the business.
Biological Data Analysis
In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics.
Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the major
17
issue. With increased usage of internet and availability of the tools and tricks for intruding and
attacking network prompted intrusion detection to become a critical component of network
administration.
Pattern evaluation − the patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
b).User Interaction:
Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query
18
language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns are discovered
it needs to be expressed in high level languages, and visual representations. These representations
should be easily understandable.
c).Efficiency and scalability
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the data again from scratch.
d) Diverse Data Types Issues
Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.
19
Types of Data :
Data Object:
An Object is real time entity.
Attribute:
It can be seen as a data field that represents characteristics or features of a data object. For a
customer object attributes can be customer Id, address etc. The attribute types can represented as
follows—
1. Nominal Attributes – related to names: The values of a Nominal attribute are name of things,
some kind of symbols. Values of Nominal attributes represent some category or state and that’s
why nominal attribute also referred as categorical attributes.
Example:
Attribute Values
Colors Black, Green, Brown, red
2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected
or unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).
3.Ordinal Attributes: The Ordinal Attributes contains values that have a meaningful sequence
or ranking (order) between them, but the magnitude between values is not actually known, the
order of values that shows what is important but don’t indicate how important it is.
Attribute Values
Grade O, S, A, B, C, D, F
6. Continuous: Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.
Example:
Attribute Values
Height 5.4,5.7,6.2,etc.,
Weight 50,65,70,73,etc.,
There are many types of data sets, and as the field of data mining develops and matures,
a greater variety of data sets become available for analysis. we have grouped the types of
data sets into three groups:
1.Record data,
21
2.Graph based data,
3.Ordered data.
Record Data: Record data is usually stored either in flat files or in relational databases.
Relational databases are certainly more than a collection of records, but data pruning often
does not use any of the additional information available in a relational database. Rather, the
database serves as a convenient place to find records.
Transaction data is a collection of sets of items, but it can be viewed as a set of records
whose fields are asymmetric attributes.
A data matrix is a variation of record data, but because it consists of numeric attributes,
standard matrix operation can be applied to transform and manipulate the data.
22
The Sparse Data Matrix : A sparse data matrix is a special case of a data matrix in
which the attributes are of the same type and are asymmetric; i.e., only non-zero values
are important. Transaction data is an example of a sparse data matrix that has only 0-1
entries.
Data with Relationships among Objects: The relationships among objects frequently
convey important information. In such cases, the data is often represented as a graph. In
particular, the data objects are mapped to nodes of the graph, while the relationships
among objects are captured by the links between objects and link properties, such as
direction and weight
Data with Objects That Are Graphs: If objects have structure, that is, the objects
contain sub objects that have relationships, then such objects are frequently represented
as graphs. For example, the structure of chemical compounds can be represented by a
graph, where the nodes are atoms and the links between nodes are chemical bonds.
Ordered Data : For some types of data, the attributes have relationships that involve
23
Sequence Data Sequence data consists of a data set that is a. sequence of individual entities,
such as a sequence of words or letters. It is quite similar to sequential data, except that there are no
time stamps; instead, there are positions in an ordered sequence. For example, the genetic
information of plants and animals can be represented in the form of sequences of nucleotides that
are known as genes.
Time Series Data Time series data is a special type of sequential data in which each record
is a time series, i.e., a series of measurements taken over time. For example, a financial data set
might contain objects that are time series of the daily prices of various stocks.
Spatial Data Some objects have spatial attributes, such as positions or are, as well as other
types of attributes. An example of spatial data is weather data (precipitation, temperature, pressure)
that is collected for a variety of geographical locations.
Data Quality:
The following sections discuss specific aspects of data quality. The focus is on
measurement and data collection issues, although some application-related issues are also
discussed.
24
Measurement and Data Collection Issues : we focus on aspects of data quality that are related
to data measurement and collection.
Measurement and Data Collection Errors: The term measurement error refers to any
problem resulting from the measurement process. A common problem is that the value
recorded differs from the true value to some extent. For continuous attributes, the numerical
difference of the measured and true value is called the error.
The term data collection error refers to errors such as omitting data objects or attribute
values, or inappropriately including a data object. For example, a study of animals of a
certain species might include animals of a related species that are similar in appearance to
the species of interest. Both measurement errors and data collection errors can be either
systematic or random.
Noise and Artifacts Noise is the random component of a measurement error. It may
involve the distortion of a value or the addition of spurious objects..Data artifact is a data
flaw caused by techniques or conditions
In statistics and experimental science, the quality of the measurement process and the
resulting data are measured by precision and bias.
Precision: The closeness of repeated measurements (of the same quantity) to one another.
Precision is often measured by the standard deviation of a set of values,
Bias: A systematic variation of measurements from the quantity being measured. Bias is
measured by taking the difference between the mean of the set of values
Accuracy: The closeness of measurements to the true value of the quantity being measured.
Accuracy depends on precision and bias. One important aspect of accuracy is the use of
significant digits. The goal is to use only as many digits to represent the result of a
measurement or calculation as are justified by the precision of the data
Outliers: Outliers are either (1) data objects that, in some sense, have characteristics that
are different from most of the other data objects in the data set, or (2) values of an attribute
25
that are unusual with respect to the typical values for that attribute
Missing Values: It is not unusual for an object to be missing one or more attribute values.
in some cases, the information was not collected; e.g..some people decline to give their age
or weight. There are several strategies (and variations on these strategies) for dealing with
missing data, each of which may be appropriate in certain circumstances.
Estimate Missing Values Sometimes missing data can be reliably estimated. For example,
consider a time series that changes in a reasonably smooth fashion, but has a few, widely
scattered missing values. In such cases, the missing values can be estimated (interpolated)
by using the remaining values.
Ignore the Missing Value during Analysis: Many data mining approaches can be
modified to ignore missing values. For example, suppose that objects are being clustered
and the similarity between pairs of data objects needs to be calculated.
Inconsistent Values Data can contain inconsistent values. Consider an ad dress field ,
where both a zip code and city are listed, but the specified zip code area is not contained in
that city. It may be that the individual entering this inform action transposed two digits, or
perhaps a dig it was misread when the information was scanned from a handwritten form.
Duplicate Data: A data set may include data objects that are duplicates, or almost
duplicates, of one another. Many people receive duplicate mailings because they appear in
a database multiple times under slightly different names. To detect and eliminate such
duplicates, two main issues must be addressed. First, if there are two objects that actually
represent a single object, then the values of corresponding attributes may differ, and these
inconsistent values must be resolved. Second , care needs to be taken to avoid accidentally
combining data objects that are similar, but not duplicates, such as two distinct people with
identical names.
Timeliness Some data starts to age as soon as it has been collected. In particular, if the data
provides a snapshot of some ongoing phenomenon or process, such as the purchasing
behaviour of customers or Web browsing patterns, then this snapshot represents reality for
26
only a limited time
Relevance The available data must contain the information necessary for the application.
Consider the task of building a model that predicts the accident rate for drivers. If
information about the age and gender of the driver is omitted, then it is likely that the model
will have limited accuracy unless this information is indirectly available through other
attribute
Knowledge about the data: Ideally, data sets are accompanied by documentation that describes
different aspects of the data; the quality of this documentation can either aid or hinder the
subsequent analysis. For example, if the documentation identifies several attributes as being
strongly related, these attributes are likely to provide highly redundant information
27