0% found this document useful (0 votes)
15 views

CS-DM MODULE -1

module 1

Uploaded by

Varaha Giri
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

CS-DM MODULE -1

module 1

Uploaded by

Varaha Giri
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 27

Malla Reddy Engineering College

(Autonomous)
Maisammaguda, Dhulapally,
Secunderabad. 500100

Department of Computer Science and Engineering


(CS)

III B. Tech II Semester


Subject: Data Mining
Prepared by
Mr. G.Varahagiri

Code: B0536

Academic Year 2023-24


Regulations: MR 21

1
2021-22
MALLA REDDY ENGINEERING COLLEGE B.Tech.
Onwards
(MR-21) (Autonomous) VI Semester

Code: B0536 L T P
DATA MINING
Credits: 3 3 - -

Prerequisites: NIL
Course Objectives:
This course provides the students to understand stages in building a Data Warehouse,
identify the need and importance of preprocessing techniques, implement similarity and
dissimilarity techniques, analyze and evaluate performance of algorithms for
Association Rules, analyze Classification and Clustering algorithms.
MODULE I: Introduction and Mining Issues & Data [09 Periods]
Introduction – Data, Why Data Mining? What Is Data Mining? What Kinds of Data Can Be mined? What
Kinds of Patterns Can Be Mined? Which Technologies Are Used? Which Kinds of Applications Are
Targeted?
Mining Issues and Data - Major Issues in Data Mining, Types of Data, Data Quality
MODULE II: Data, Data Pre processing [9 Periods]
A: Data Pre-processing: Data Warehousing, Data Cleaning, Data Integration, Data Reduction, Data
Transformation, Aggregation, Sampling,
B: Techniques: Dimensionality Reduction, Feature Subset Selection, Feature Creation, Data Discretization
and Binarization, Variable transformation.
MODULE III: Data Similarity and Dissimilarity Classification [10 Periods]
A:Measuring Data Similarity and Dissimilarity - Similarity and Dissimilarity between simple attributes,
Dissimilarities and similarities between data objects, Examples of Proximity measures, Issues in Proximity
Calculation, Selection of right proximity measure.
B:Classification - Basic Concepts, General Approach to solving a classification problem, Decision Tree
Induction: Working of Decision Tree, building a decision tree.
MODULE IV: Classifier and Association Analysis [10 Periods]
Classifiers - Alterative Techniques, Bayes‘Theorem, Naïve Bayesian Classification, Bayesian Belief
Networks Association Analysis - Basic Concepts and Algorithms: Problem Definition, Frequent Item Set

2
generation, Rule generation, compact representation of frequent item sets, FP Growth Algorithm.
MODULE V: Cluster Analysis and DBSCAN [10 Periods]
Cluster Analysis - Basic Concepts and Algorithms: Overview: What Is Cluster Analysis? Different Types of
Clustering, Different Types of Clusters; K-means: The Basic K-means Algorithm, K-means Additional
Issues, Bisecting K-means, Strengths and Weaknesses; Agglomerative Hierarchical Clustering: Basic
Agglomerative Hierarchical Clustering Algorithm
DBSCAN - Traditional Density Center-Based Approach, DBSCAN Algorithm, Strengths and Weaknesses.

TEXTBOOKS
1. Pang-Ning Tan and Michael Steinbach, “Introduction to Data Mining”, Vipin Kumar,
Pearson.
2. Jiawei Han, Michel Kamber,”Data Mining concepts and Techniques”, 3/e, Elsevier.

REFERENCES
1. Hongbo Du, “Data Mining Techniques and Applications: An Introduction”, Cengage
 Learning.
2. Vikram Pudi and P. Radha Krishna, “Data Mining”, Oxford.
3. Mohammed J. Zaki, Wagner Meira, Jr ,”Data Mining and Analysis - Fundamental
Concepts and Algorithms”, Oxford
4. Alex Berson, Stephen Smith,”Data Warehousing Data Mining and OLAP”, TMH.

E –RESOURCES
1. http://www-users.cs.umn.edu/~kumar/dmbook/index.php
http://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Kaufmann-Series-
in-Data- Management-Systems-Jiawei-Han-Micheline-Kamber-Jian-Pei-Data-Mining.-
Concepts-and-Techniques-3rd- Edition-Morgan-Kaufmann-2011.pdf
2. http://www.ijctee.org/files/Issuethree/IJCTEE_1111_20.pdf
3. http://www.ccsc.org/southcentral/E- Journal/2010/Papers/Yihao%20final%20paper
%20CCSC%20for%20submission.pdf
4. https://gunjesh.wordpress.com/

3
UNIT-1
Introduction to Mining and Issues in Data Mining

Why we need Data Mining?


 Volume of information is increasing everyday that we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
extracting essence of information available and that can automatically generate report, views or
summary of data for better decision-making.
 Tele Communication N/Ws: It carry tens of peta bytes data
 Medical and health Industry: Generate amount of data from medical records, patient
monitoring and medical images
 Billions of Web Searches: It Supported by search engine process and processes daily tens of
peta bytes of data
 Communities and social media: Increases important data sources producing digital pictures
and video blogs
 Web communities and social networks: people join the community because they care
about this common interest that glues the community members together. Everyone has their own
social network (whether online or offline). Everyone has friends, families, and people they are
acquainted with. An online social networking site simply makes our social networks visible to others
who are not in our immediate network.
 Data mining as evolution of information technology
 It is a result of the natural evolution of information technology and is used to convert raw
data into useful data
 It is a multi dimensional data base approach which uses machine learning, data visualization,
statistics, data base technology, pattern recognition, neural networks, information retrieval and soft
computing.
 D.M is so much concerned about customer experience and user interface in different sectors
like communication marketing organizations, financial and banking sectors.
 Banking widely used to find cause of network attack, fraud detection.
 Banking and finance sectors use data mining to better understanding market risks. It is

4
widely applied to credibility ratings play important role in card transactions to detect and

 analyze fraud
 1990Data collection, DB creation IMS and network DBMS
 1970Relational data model, Relational RDBMS implementation

 1980RDBMS, Advanced models, Application oriented DBMS(Spatial, Scientific


engineering

 1990Data Mining, Data warehouse, multimedia data bases and web databases

 2000Stream data management, Data mining and its applications, Web technology ,

 Global information systems.

Why Data Mining is used in Business?


 Data mining is used in business to make better managerial decisions by:
 Automatic summarization of data
 Extracting essence of information stored.
 Discovering patterns in raw data.

What is Data Mining?


 Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications
 Market Analysis
 Fraud Detection
 Customer Retention
 Production Control
 Science Exploration

 Knowledge discovery from Data (KDD) is essential for data mining. While others view
data mining as an essential step in the process of knowledge discovery. Here is the list of steps
involved in the knowledge discovery process −

5
 Data Cleaning − In this step, the noise and inconsistent data is removed.

 Data Integration − In this step, multiple data sources are combined.


 Data Selection − In this step, data relevant to the analysis task are retrieved from the
database.
 Data Transformation − In this step, data is transformed or consolidated into forms
appropriate for mining by performing summary or aggregation operations.
 Data Mining − In this step, intelligent methods are applied in order to extract data
patterns.
 Pattern Evaluation − In this step, data patterns are evaluated.
 Knowledge Presentation − In this step, knowledge is represented.

Web mining usually involves:


 Data Cleaning
 Data integration from multiple sources
 Warehousing the data
 Data cube construction
 Data selection for data mining
 Data mining
 Presentation of mining results
 Pattern and knowledge to be used
Data Mining in Business intelligence:
6
 Data sources(paper, Files, web document, scientific experiments, data base systems)
 Data Preprocessing
 Data exploration
 Data mining
 Data presentation
 Decision making
What kinds of data can be mined?
 Flat Files
 Relational Databases
 Data Warehouse
 Transactional Databases
 Multimedia Databases
 Spatial Databases
 Time Series Databases
 World Wide Web(WWW)

1. Flat Files
 Fat files are defined as data files in text form or binary form with a structure that can be
easily extracted by data mining algorithms.
 Data stored in flat files have no relationship or path among themselves, like if a relational
database is stored on flat file, and then there will be no relations between the tables.
 Flat files are represented by data dictionary. Eg: CSV file.
 Application: Used in Data Warehousing to store data, Used in carrying data to and from
server, etc.
2. Relational Databases
 A Relational database is defined as the collection of data organized in tables with rows and
columns.
 Physical schema in Relational databases is a schema which defines the structure of tables.
 Logical schema in Relational databases is a schema which defines the relationship among
tables.
 Standard API of relational database is SQL.
 Application: Data Mining, ROLAP model, etc
3. Data Warehouse
 A data warehouse is defined as the collection of data integrated from multiple sources

7
that will query and decision making.

 There are three types of data warehouse: Enterprise data Warehouse, Data Mart and
Virtual Warehouse.
 Two approaches can be used to update data in Data Warehouse: Query- driven Approach
and Update-driven Approach.
 Application: Business decision making, Data mining, etc.

4. Transactional Databases
 Transactional databases are a collection of data organized by time stamps, date, etc to
represent transaction in databases.
 This type of database has the capability to roll back or undo its operation when a
transaction is not completed or committed.
 Highly flexible system where users can modify information without changing any sensitive
information.
 Follows ACID property of DBMS.
 Application: Banking, Distributed systems, Object databases, etc.

8
5. Multimedia Databases
 Multimedia databases consists audio, video, images and text media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in pre-specified formats.
 Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
6. Spatial Database
 Store geographical information.
 Stores data in the form of coordinates, topology, lines, polygons, etc.
 Application: Maps, Global positioning, etc.

7. Time-series Databases
 Time series databases contain stock exchange data and user logged activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: eXtreme DB, Graphite, Influx DB, etc.
8. WWW
 WWW refers to World wide web is a collection of documents and resources like audio,
video, text, etc .Which are identified by Uniform Resource Locators (URLs) through web

9
 browsers, linked by HTML pages, and accessible via the Internet network.
 It is the most heterogeneous repository as it collects data from multiple resources.
 It is dynamic in nature as Volume of data is continuously increasing and changing.
 Application: Online shopping, Job search, Research, studying, etc

What kinds of Patterns can be mined?


 On the basis of the kind of data to be mined, there are two categories of functions involved
in Data Mining −
a) Descriptive
b) Classification and Prediction

a) Descriptive Function
 The descriptive function deals with the general properties of data in the database.
 Here is the list of descriptive functions −
1. Class/Concept Description
2. Mining of Frequent Patterns
3. Mining of Associations
4. Mining of Correlations
5. Mining of Clusters
1. Class/Concept Description
 Class/Concept refers to the data to be associated with the classes or concepts. For example,
in a company, the classes of items for sales include computer and printers, and concepts of
customers include big spenders and budget spenders. Such descriptions of a class or a concept are
called class/concept descriptions. These descriptions can be derived by the following two ways −
 Data Characterization − This refers to summarizing data of class under study. This class
under study is called as Target Class.
 Data Discrimination − It refers to the mapping or classification of a class with some
predefined group or class.
2. Mining of Frequent Patterns
 Frequent patterns are those patterns that occur frequently in transactional data. Here is the
list of kind of frequent patterns −

 Frequent Item Set − It refers to a set of items that frequently appear together, for

10
example, milk and bread.
 Frequent Subsequence − A sequence of patterns that occur frequently such as purchasing a
camera is followed by memory card.
 Frequent Sub Structure − Substructure refers to different structural forms, such as graphs,
trees, or lattices, which may be combined with item-sets or subsequences.
3. Mining of Association
 Associations are used in retail sales to identify patterns that are frequently purchased
together. This process refers to the process of uncovering the relationship among data and
determining association rules.
 For example, a retailer generates an association rule that shows that 70% of time milk is sold
with bread and only 30% of times biscuits are sold with bread.
4. Mining of Correlations
 It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.
5. Mining of Clusters
 Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group
of objects that are very similar to each other but are highly different from the objects in other
clusters.
 The objects are clustered or grouped based on principles of maximizing the intra cluster
similarity and minimize the inter cluster similarity.

b) Classification and Prediction


 Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training

 data. The derived model can be presented in the following forms −

11
 Classification (IF-THEN) Rules
 Prediction
 Decision Trees
 Mathematical Formulae
 Neural Networks
 Outlier Analysis
 Evolution Analysis
 The list of functions involved in these processes is as follows −
 Classification − It predicts the class of objects whose class label is unknown. Its objective is
to find a derived model that describes and distinguishes data classes or concepts. The Derived
Model is based on the analysis set of training data i.e. the data object whose class label is well
known.
 Prediction − It is used to predict missing or unavailable numerical data values rather than
class labels. Regression Analysis is generally used for prediction. Prediction can also be used for
identification of distribution trends based on available data.

 Decision Trees − A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test,
and each leaf node holds a class label.

12
 Mathematical Formulae – Data can be mined by using some mathematical formulas.
 Neural Networks − Neural networks represent a brain metaphor for information processing.
These models are biologically inspired rather than an exact replica of how the brain actually
functions. Neural networks have been shown to be very promising systems in many forecasting
applications and business classification applications due to their ability to “learn” from the data.

 Outlier Analysis − Outliers may be defined as the data objects that do not comply with the
general behavior or model of the data available.

 Evolution Analysis − Evolution analysis refers to the description and model regularities or
trends for objects whose behavior changes over time.

Data Mining Task Primitives


 We can specify a data mining task in the form of a data mining query.
 This query is input to the system.
 A data mining query is defined in terms of data mining task primitives.
 Note −These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
 Set of task relevant data to be mined.

13
 Kind of knowledge to be mined.
 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.

Which Technologies are used in data mining?

1. Statistics:
 It uses the mathematical analysis to express representations, model and summarize
empirical data or real world observations.
 Statistical analysis involves the collection of methods, applicable to large amount of data to
conclude and report the trend.
2. Machine learning
 Arthur Samuel defined machine learning as a field of study that gives computers the ability
to learn without being programmed.
 When the new data is entered in the computer, algorithms help the data to grow or
change due to machine learning.
 In machine learning, an algorithm is constructed to predict the data from the available
database (Predictive analysis).
 It is related to computational statistics.

The four types of machine learning are:


 Supervised learning
14
 It is based on the classification.
 It is also called as inductive learning. In this method, the desired outputs are included in the
training dataset.
 Unsupervised learning
 Unsupervised learning is based on clustering. Clusters are formed on the basis of similarity
measures and desired outputs are not included in the training dataset.
 Semi-supervised learning
 Semi-supervised learning includes some desired outputs to the training dataset to generate
the appropriate functions. This method generally avoids the large number of labeled examples (i.e.
desired outputs).
 Active learning
 Active learning is a powerful approach in analyzing the data efficiently.
 The algorithm is designed in such a way that, the desired output should be decided by the
algorithm itself (the user plays important role in this type).

3. Information retrieval
 Information deals with uncertain representations of the semantics of objects (text, images).
 The technique searches for the information in the document, which may be in text,
multimedia, or residing on the Web. It has two main characteristics:
1. Searched data is unstructured
2. Queries are formed by keywords that don’t have complex structures. The most widely
used information retrieval approach is the probabilistic model. Information retrieval
combined with data mining techniques is used for finding out any relevant topic in the
document or web.

Uses: A large amount of data is available and streamed in the web, both text and multimedia
due to the fast growth of digitalization including the government sector, health care, and
many others. The search and analysis have raised many challenges and hence Information
Retrieval becomes increasingly important
For example: Finding relevant information from a large document.
4. Database systems and data warehouse
 Databases are used for the purpose of recording the data as well as data warehousing.
 Online Transactional Processing (OLTP) uses databases for day to day transaction purpose.

15
 Data warehouses are used to store historical data which helps to take strategically decision
for business.
 It is used for online analytical processing (OALP), which helps to analyze the data.
5. Pattern Recognition:
 Pattern recognition is the automated recognition of patterns and regularities in data. Pattern
recognition is closely related to artificial intelligence and machine learning, together with
applications such as data mining and knowledge discovery in databases (KDD), and is often used
interchangeably with these terms.
6. Visualization:
 It is the process of extracting and visualizing the data in a very clear and understandable way
without any form of reading or writing by displaying the results in the form of pie charts, bar graphs,
statistical representation and through graphical forms as well.
7. Algorithms:
 To perform data mining techniques we have to design best algorithms.
8. High Performance Computing:
High Performance Computing most generally refers to the practice of aggregating computing power
a way that delivers much higher performance than one could get out of a typical desktop computer
or workstation in order to solve large problems in science, engineering, or business.

Are all patterns interesting?


 Typically the answer is No – only small fraction of the patterns potentially generated would
actually be of interest to a given user.
 What makes patterns interesting?
 The answer is if it is (1) easily understood by humans, (2) valid on new or test data with some

 degree of certainty, (3) potentially useful, (4) novel.


 A Pattern is also interesting if it is validates a hypothesis that the users ought to confirm.

Data Mining Applications


Here is the list of areas where data mining is widely used −
 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis

16
 Other Scientific Applications
 Intrusion Detection
Financial Data Analysis
 The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Like,
 Loan payment prediction and customer credit policy analysis.
 Detection of money laundering and other financial crimes.
Retail Industry
 Data Mining has its great application in Retail Industry because it collects large amount of
data from on sales, customer purchasing history, goods transportation, consumption and services. It
is natural that the quantity of data collected will continue to expand rapidly because of the
increasing ease, availability and popularity of the web.
Telecommunication Industry
 Today the telecommunication industry is one of the most emerging industries providing
various services such as fax, pager, cellular phone, internet messenger, images, e- mail, web data
transmission, etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become
very important to help and understand the business.
Biological Data Analysis
 In recent times, we have seen a tremendous growth in the field of biology such as genomics,
proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics.

Other Scientific Applications


 The applications discussed above tend to handle relatively small and homogeneous data sets
for which the statistical techniques are appropriate. Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc.
 A large amount of data sets is being generated because of the fast numerical simulations in
various fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, etc

Intrusion Detection
 Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become the major

17
issue. With increased usage of internet and availability of the tools and tricks for intruding and
attacking network prompted intrusion detection to become a critical component of network
administration.

Major Issues in data mining:


 Data mining is a dynamic and fast-expanding field with great strengths. The major issues
can divided into five groups:
 Mining Methodology
 User Interaction
 Efficiency and scalability
 Diverse Data Types Issues
 Data mining society
a).Mining Methodology:
 It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be interested in
different kinds of knowledge. Therefore it is necessary for data mining to cover a broad range of
knowledge discovery task.
 Mining knowledge in multidimensional space – when searching for knowledge in large
datasets, we can explore the data in multidimensional space.
 Handling noisy or incomplete data − the data cleaning methods are required to handle the
noise and incomplete objects while mining the data regularities. If the data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.

 Pattern evaluation − the patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
b).User Interaction:
 Interactive mining of knowledge at multiple levels of abstraction − The data mining
process needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on the returned results.
 Incorporation of background knowledge − To guide discovery process and to express the
discovered patterns, the background knowledge can be used. Background knowledge may be used to
express the discovered patterns not only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query language that
allows the user to describe ad hoc mining tasks, should be integrated with a data warehouse query
18
language and optimized for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns are discovered
it needs to be expressed in high level languages, and visual representations. These representations
should be easily understandable.
c).Efficiency and scalability
 There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient and
scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such as huge size
of databases, wide distribution of data, and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithms divide the data
into partitions which is further processed in a parallel fashion. Then the results from the partitions is
merged. The incremental algorithms, update databases without mining the data again from scratch.
d) Diverse Data Types Issues
 Handling of relational and complex types of data − The database may contain complex
data objects, multimedia data objects, spatial data, temporal data etc. It is not possible for one
system to mine all these kind of data.

 Mining information from heterogeneous databases and global information systems


The data is available at different data sources on LAN or WAN. These data source may be
structured, semi structured or unstructured. Therefore mining the knowledge from them adds
challenges to data mining.
e) Data Mining and Society
 Social impacts of data mining – With data mining penetrating our everyday lives, it is
important to study the impact of data mining on society.
 Privacy-preserving data mining – data mining will help scientific discovery, business
management, economy recovery, and security protection.
 Invisible data mining – we cannot expect everyone in society to learn and master data
mining techniques. More and more systems should have data mining functions built within so that
people can perform data mining or use data mining results simply by mouse clicking, without any
knowledge of data mining algorithms.

19
Types of Data :
Data Object:
 An Object is real time entity.
Attribute:
 It can be seen as a data field that represents characteristics or features of a data object. For a
customer object attributes can be customer Id, address etc. The attribute types can represented as
follows—
1. Nominal Attributes – related to names: The values of a Nominal attribute are name of things,
some kind of symbols. Values of Nominal attributes represent some category or state and that’s
why nominal attribute also referred as categorical attributes.
Example:
Attribute Values
Colors Black, Green, Brown, red

2. Binary Attributes: Binary data has only 2 values/states. For Example yes or no, affected
or unaffected, true or false.
i) Symmetric: Both values are equally important (Gender).
ii) Asymmetric: Both values are not equally important (Result).

3.Ordinal Attributes: The Ordinal Attributes contains values that have a meaningful sequence
or ranking (order) between them, but the magnitude between values is not actually known, the
order of values that shows what is important but don’t indicate how important it is.

Attribute Values
Grade O, S, A, B, C, D, F

4. Numeric: A numeric attribute is quantitative because, it is a measurable quantity,


represented in integer or real values. Numerical attributes are of 2 types.
 i).An interval-scaled attribute has values, whose differences are interpretable, but the
20
 numerical attributes do not have the correct reference point or we can call zero point. Data
can be added and subtracted at interval scale but cannot be multiplied or divided. Consider an
example of temperature in degrees Centigrade. If a day’s temperature of one day is twice than the
other day we cannot say that one day is twice as hot as another day.
 ii). A ratio-scaled attribute is a numeric attribute with a fix zero-point. If a measurement is
ratio-scaled, we can say of a value as being a multiple (or ratio) of another value. The values are
ordered, and we can also compute the difference between values, and the mean, median, mode,
Quintile-range and five number summaries can be given.
5. Discrete: Discrete data have finite values it can be numerical and can also be in categorical form.
These attributes has finite or countably infinite set of values.
Example
Attribute Values
Teacher,
Profession
Businessman, Peon
ZIP Code 521157, 521301

6. Continuous: Continuous data have infinite no of states. Continuous data is of float type.
There can be many values between 2 and 3.

Example:
Attribute Values
Height 5.4,5.7,6.2,etc.,
Weight 50,65,70,73,etc.,

Types of Data Sets

There are many types of data sets, and as the field of data mining develops and matures,

a greater variety of data sets become available for analysis. we have grouped the types of
data sets into three groups:

1.Record data,

21
2.Graph based data,

3.Ordered data.

 Record Data: Record data is usually stored either in flat files or in relational databases.

Relational databases are certainly more than a collection of records, but data pruning often

does not use any of the additional information available in a relational database. Rather, the
database serves as a convenient place to find records.

 Transaction or Market Basket Data:

Transaction data is a collection of sets of items, but it can be viewed as a set of records
whose fields are asymmetric attributes.

 The Data Matrix:

A data matrix is a variation of record data, but because it consists of numeric attributes,
standard matrix operation can be applied to transform and manipulate the data.
22
 The Sparse Data Matrix : A sparse data matrix is a special case of a data matrix in
which the attributes are of the same type and are asymmetric; i.e., only non-zero values
are important. Transaction data is an example of a sparse data matrix that has only 0-1

entries.

 Graph-Based Data: A graph can sometimes be a convenient and powerful


representation for data. We consider two specific cases:

(1) The graph captures relationships among data objects and

(2) The data objects themselves are represented as graphs.

 Data with Relationships among Objects: The relationships among objects frequently
convey important information. In such cases, the data is often represented as a graph. In
particular, the data objects are mapped to nodes of the graph, while the relationships
among objects are captured by the links between objects and link properties, such as
direction and weight

 Data with Objects That Are Graphs: If objects have structure, that is, the objects
contain sub objects that have relationships, then such objects are frequently represented
as graphs. For example, the structure of chemical compounds can be represented by a
graph, where the nodes are atoms and the links between nodes are chemical bonds.

 Ordered Data : For some types of data, the attributes have relationships that involve

order in time or space. Different types of ordered data are

23
 Sequence Data Sequence data consists of a data set that is a. sequence of individual entities,
such as a sequence of words or letters. It is quite similar to sequential data, except that there are no
time stamps; instead, there are positions in an ordered sequence. For example, the genetic
information of plants and animals can be represented in the form of sequences of nucleotides that
are known as genes.
 Time Series Data Time series data is a special type of sequential data in which each record
is a time series, i.e., a series of measurements taken over time. For example, a financial data set
might contain objects that are time series of the daily prices of various stocks.
 Spatial Data Some objects have spatial attributes, such as positions or are, as well as other
types of attributes. An example of spatial data is weather data (precipitation, temperature, pressure)
that is collected for a variety of geographical locations.

Data Quality:

The following sections discuss specific aspects of data quality. The focus is on
measurement and data collection issues, although some application-related issues are also
discussed.

24
 Measurement and Data Collection Issues : we focus on aspects of data quality that are related
to data measurement and collection.

Measurement and Data Collection Errors: The term measurement error refers to any
problem resulting from the measurement process. A common problem is that the value
recorded differs from the true value to some extent. For continuous attributes, the numerical
difference of the measured and true value is called the error.

The term data collection error refers to errors such as omitting data objects or attribute
values, or inappropriately including a data object. For example, a study of animals of a
certain species might include animals of a related species that are similar in appearance to
the species of interest. Both measurement errors and data collection errors can be either
systematic or random.

Noise and Artifacts Noise is the random component of a measurement error. It may
involve the distortion of a value or the addition of spurious objects..Data artifact is a data
flaw caused by techniques or conditions

Precision, Bias, and Accuracy:

In statistics and experimental science, the quality of the measurement process and the
resulting data are measured by precision and bias.

Precision: The closeness of repeated measurements (of the same quantity) to one another.
Precision is often measured by the standard deviation of a set of values,

Bias: A systematic variation of measurements from the quantity being measured. Bias is
measured by taking the difference between the mean of the set of values

Accuracy: The closeness of measurements to the true value of the quantity being measured.
Accuracy depends on precision and bias. One important aspect of accuracy is the use of

significant digits. The goal is to use only as many digits to represent the result of a
measurement or calculation as are justified by the precision of the data

Outliers: Outliers are either (1) data objects that, in some sense, have characteristics that
are different from most of the other data objects in the data set, or (2) values of an attribute

25
that are unusual with respect to the typical values for that attribute

Missing Values: It is not unusual for an object to be missing one or more attribute values.
in some cases, the information was not collected; e.g..some people decline to give their age

or weight. There are several strategies (and variations on these strategies) for dealing with
missing data, each of which may be appropriate in certain circumstances.

Estimate Missing Values Sometimes missing data can be reliably estimated. For example,
consider a time series that changes in a reasonably smooth fashion, but has a few, widely
scattered missing values. In such cases, the missing values can be estimated (interpolated)
by using the remaining values.

Ignore the Missing Value during Analysis: Many data mining approaches can be
modified to ignore missing values. For example, suppose that objects are being clustered
and the similarity between pairs of data objects needs to be calculated.

Inconsistent Values Data can contain inconsistent values. Consider an ad dress field ,
where both a zip code and city are listed, but the specified zip code area is not contained in
that city. It may be that the individual entering this inform action transposed two digits, or
perhaps a dig it was misread when the information was scanned from a handwritten form.

Duplicate Data: A data set may include data objects that are duplicates, or almost
duplicates, of one another. Many people receive duplicate mailings because they appear in

a database multiple times under slightly different names. To detect and eliminate such
duplicates, two main issues must be addressed. First, if there are two objects that actually
represent a single object, then the values of corresponding attributes may differ, and these
inconsistent values must be resolved. Second , care needs to be taken to avoid accidentally
combining data objects that are similar, but not duplicates, such as two distinct people with

identical names.

 Issues Related to Applications: The general issues are

 Timeliness Some data starts to age as soon as it has been collected. In particular, if the data
provides a snapshot of some ongoing phenomenon or process, such as the purchasing
behaviour of customers or Web browsing patterns, then this snapshot represents reality for

26
only a limited time

 Relevance The available data must contain the information necessary for the application.
Consider the task of building a model that predicts the accident rate for drivers. If
information about the age and gender of the driver is omitted, then it is likely that the model
will have limited accuracy unless this information is indirectly available through other
attribute
 Knowledge about the data: Ideally, data sets are accompanied by documentation that describes
different aspects of the data; the quality of this documentation can either aid or hinder the
subsequent analysis. For example, if the documentation identifies several attributes as being
strongly related, these attributes are likely to provide highly redundant information

27

You might also like