0% found this document useful (0 votes)

157 views64 pages

Unit 2 - Introduction of Data Mining

The document provides an introduction to data mining. It defines data mining as extracting knowledge from large amounts of data and discusses how it relates to the broader process of knowledge discovery in databases (KDD). The stages of the KDD process are described, including data cleaning, integration, selection, transformation, mining, evaluation, and presentation. The architecture of a typical data mining system and common data mining functionalities like classification, association rule mining, and clustering are also outlined.

Uploaded by

Mag Creation

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views64 pages

Unit 2 - Introduction of Data Mining

Uploaded by

Mag Creation

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Unit 2

Introduction to Data Mining

Prepared By
Arjun Singh Saud, Asst. Prof. CDCSIT

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining and KDD
• Simply stated, data mining refers to extracting or mining
knowledge from large amounts of data stored in databases, data
warehouses, or other information repositories.
• Many people treat data mining as a synonym for another
popularly used term, Knowledge Discovery from Data or KDD.
• Alternatively, others view data mining as simply an essential
step in the process of knowledge discovery from Data. KDD
consists of an iterative sequence of the following steps:

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining and KDD

Figure: Stages of KDD

Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Mining and KDD
• Data Cleaning: Data cleaning is a process of removing unnecessary
and inconsistent data from the databases. The main purpose of
cleaning is to improve the quality of the data by filling the missing
values, configuring the data to make sure that it in consistent format.
• Data Integration: In this stage multiple data sources may be
combined (i.e. integrated) to form a large database.
• Data Selection: Data which is required for data mining process can
be extracted from multiple and heterogeneous data sources such as
databases, files etc. Data selection is a process where the appropriate
data required for analysis is fetched from the databases.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining and KDD
• Data Transformation: In the transformation stage data
extracted from multiple data sources are converted into an
appropriate format for data mining process. Data reduction or
summarization is used to decrease the number of possible
values of data without affecting the integrity of data.
• Data Mining: It is the most essential step of KDD process where
intelligent methods are applied in order to extract hidden
patterns from data stored in databases.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining and KDD
• Pattern Evaluation: This step Identifies the truly interesting
patterns representing knowledge on the basis of some
interestingness measures. Support and confidence are two
widely used interestingness measures. These patterns are
helpful for decision support systems.
• Knowledge Presentation: Knowledge representation and
visualization techniques are used to present the mined
knowledge to the user so that it will be easily understandable to
them.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Architecture of Data Mining System

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Architecture of Data Mining System
• Data Sources or Repositories: This component represents
multiple data sources such as database, data warehouse, or any
other information repository. Data cleaning and data
integration techniques may be performed on the data.
• Database Server or Data Warehouse Server: The database or
data warehouse server is responsible for fetching the relevant
data, based on the user's data mining request.
• Knowledge Base: It is the area of knowledge that is used to
guide the search, or to perform analysis of the resulting
patterns.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Architecture of Data Mining System
• Data Mining Engine: This is core component of the data
mining system and consists of a set of functional modules for
tasks such as association analysis, classification, Clustering,
Evolution analysis, etc.
• Pattern Evaluation Module: This component typically employs
interestingness measures and interacts with the data mining
modules so as to focus the search towards interesting patterns.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Architecture of Data Mining System
• Graphical User Interface: This module communicates between
users and the data mining system, allowing the user to interact
with the system by specifying a data mining query or task. This
component allows the user to browse database and data
warehouse schemas or data structures, evaluate mined patterns,
and visualize the patterns in different forms.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
• Data mining functionalities or the kinds of patterns that can be
discovered are described below.
• Concept/Class Description
• Association and Correlation
• Classification and Regression
• Clustering Analysis
• Outlier Analysis
• Evolution Analysis

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Concept/Class Description:
• It can be useful to describe individual classes and concepts in
summarized, concise, and yet precise terms. Such descriptions
of a class or a concept are called class/concept descriptions.
These descriptions can be derived via data characterization or
data discrimination.
• Data characterization is a summarization of the general
characteristics or features of a target class of data. The data
corresponding to the user-specified class are typically collected
by a query.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Concept/Class Description
• For example, to study the characteristics of software products
with sales that increased by 10% in the previous year, the data
related to such products can be collected by executing a query
on the sales database.
• There are several methods for effective data summarization and
characterization. Simple data summaries can be generated
based on statistical measures. The data cube-based OLAP roll-
up operation can also be used to perform user-controlled data
summarization along a specified dimension.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Concept/Class Description
• Data discrimination is a comparison of the general features of the
target class data objects against the general features of objects from
one or multiple contrasting classes. The target and contrasting
classes can be specified by a user, and the corresponding data objects
can be retrieved through database queries.
• For example, a user may want to compare the general features of
software products with sales that increased by 10% last year against
those with sales that decreased by at least 30% during the same
period. The methods used for data discrimination are similar to
those used for data characterization.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Mining Functionalities
Association and Correlation
• Frequent patterns are patterns that occur frequently in data.
Frequent patterns may include frequent itemsets, and
subsequences.
• A frequent itemsets typically refers to a set of items that
frequently appear together in a transactional data set, such as
milk and bread.
• A pattern that customers tend to purchase first a PC, followed
by a digital camera, and then a memory card, is a frequent
subsequence. Mining frequent patterns leads to the discovery of
interesting associations and correlations within data.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Mining Functionalities
Association and Correlation
• For example, a marketing manager of an Electronics store would like
to determine which items are frequently purchased together within
the same transactions. For this mining rule can be
buys(X; “computer”))=>buys(X; “software”) [support = 20%;
confidence = 50%]
Where X is a variable representing a customer
• 20% support means that 20% of all the transactions under analysis
show that computer and software are purchased together.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Association and Correlation
• A confidence of 50% means that if a customer buys a computer,
there is a 50% chance that he/she will buy software as well.
• Typically, association rules are discarded as uninteresting if
they do not satisfy both a minimum support threshold and a
minimum confidence threshold. Additional analysis can be
performed to uncover interesting statistical correlations
between associated attribute-value pairs.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Classification and Regression
• Classification is the process of finding a model (or function) that
describes and distinguishes data classes or concepts.
• Mainly it is used to predict the class of objects whose class label is
unknown.
• The derived model is based on the analysis of a set of training data.
Data object whose class label is known is considered as training data.
• The derived model may be represented in various forms, such as
classification rules, decision trees, mathematical formulae, neural networks
etc.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Classification and Regression
• Whereas classification predicts categorical labels, regression or
prediction models are continuous-valued functions. That is, it is used
to predict missing or unavailable numerical data values rather than
class labels.
• Regression analysis is a statistical methodology that is most often
used for numeric prediction, although other methods exist as well.
• Classification and prediction may need to be preceded by relevance
analysis, which attempts to identify attributes that do not contribute
to the classification or prediction process. These attributes can then
be excluded.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Classification and Regression
• If we want to classify a large set of items in the store, based on three
kinds of responses to a sales campaign: good response, mild response
and no response. We want to derive a model for each of these three
classes based on the descriptive features of the items, such as price,
brand, place made, type, and category. This type of problem can be
solved using classification.
• Suppose instead, that rather than predicting categorical response
labels for each store item, we would like to predict the amount of
revenue that each item will generate during an upcoming sale, based
on the previous sales data. This is an example of regression analysis.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Mining Functionalities
Cluster Analysis
• Unlike classification and prediction, which analyze output-
labeled data objects, clustering analyzes data objects without
consulting a known output-label.
• Clustering can be used to generate such labels. The objects are
clustered or grouped based on the principle of maximizing the
intra-class similarity and minimizing the interclass similarity.
• That is, clusters of objects are formed so that objects within a
cluster have high similarity in comparison to one another, but
are very dissimilar to objects in other clusters.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Mining Functionalities
Cluster Analysis
• For example, cluster analysis can be performed on customer
data to identify homogeneous subpopulations of customers.
These clusters may represent individual target groups for
marketing.

Figure: Three Data Clusters

Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Mining Functionalities
Outlier Analysis
• A database may contain data objects that do not comply with
the general behavior or model of the data. These data objects are
outliers.
• Most data mining methods discard outliers as noise or
exceptions. However, in some applications such as fraud
detection, the rare events can be more interesting than the more
regularly occurring ones.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Outlier Analysis
• Outliers may be detected using statistical tests that assume a
distribution or probability model for the data, or using distance
measures where objects that are a substantial distance from any
other cluster are considered outliers.
• For example, Outlier analysis may uncover fraudulent usage of
credit cards by detecting purchases of extremely large amounts
for a given account number in comparison to regular charges
incurred by the same account.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining Functionalities
Evolution Analysis
• Data evolution analysis describes and models regularities or trends
for objects whose behavior changes over time.
• Distinct features of such an analysis include time-series data
analysis, sequence or pattern matching.
• For example, you have the major stock market (time-series) data of
the last several years available from the Nepal Stock Exchange. A
data mining study of stock exchange may identify stock evolution
regularities for overall stocks and for the stocks of particular
companies. Such regularities may help predict future trends in stock
market prices, contributing to our decision making regarding stock
investments.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Objects and Attribute Types
• Data sets are made up of data objects. A data object represents
an entity.
• In a sales database, the objects may be customers, store items,
and sales; in a university database, the objects may be students,
professors, and courses.
• Data objects are typically described by attributes. Data objects
can also be referred to as samples, examples, instances, data points,
or objects.
• If the data objects are stored in a database, they are data tuples.
That is, the rows of a database correspond to the data objects,
and the columns correspond to the attributes.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Objects and Attribute Types
What is an Attribute?
• An attribute is a data field, representing a characteristic or
feature of a data object. The nouns attribute, dimension, feature,
and variable are often used interchangeably in the literature.
• The term dimension is commonly used in data warehousing.
Machine learning literature tends to use the term feature, while
statisticians prefer the term variable. Data mining and database
professionals commonly use the term attribute.
• Attributes describing a customer object can include, for
example, customer ID, name, and address.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Objects and Attribute Types
Types of Attributes
On the basis of set of possible values attributes can be divided
into following types
• Nominal Attributes
• Ordinal Attributes
• Interval-scaled Attributes
• Ratio-scaled Attributes

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Objects and Attribute Types
Nominal Attributes
• The values of a nominal attribute are symbols or names of
things. Each value represents some kind of category, code, or
state, and so nominal attributes are also referred to as
categorical. The values do not have any meaningful order.
• Examples of nominal attributes:
Hair_color: possible values are: {black, brown, red, grey, white}
Marital_status: possible values are:{Married, Single, Divorced,
Widowed}
Customer_ID: possible values are: Combination of numbers

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Objects and Attribute Types
Nominal Attributes
• It is possible to represent such symbols with numbers. With
hair_color, for instance, we can assign a code of 0 for black, 1 for
brown, and so on.
• However, in such cases, the numbers are not intended to be used
quantitatively. That is, mathematical operations on values of
nominal attributes are not meaningful. It makes no sense to subtract
one customer ID number from another.
• A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Objects and Attribute Types
Ordinal Attributes
• An ordinal attribute is an attribute with possible values that
have a meaningful order or ranking among them, but the
magnitude between successive values is not known.
• Examples of ordinal attributes:
Grades: possible values are: {A+, A, A-, B+, B, B- and so on}
Height: possible values are:{Tall, Medium, Short}
• The values have a meaningful sequence (which corresponds to
increasing height ); however, we cannot tell from the values how
much bigger, say, a medium is than a short.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Objects and Attribute Types
Ordinal Attributes
• Ordinal attributes may also be obtained from the discretization
of numeric quantities by splitting the value range into a finite
number of ordered categories.
• Note that nominal, and ordinal attributes are qualitative. That is,
they describe a feature of an object without giving an actual size
or quantity.
• We can compute median and mode of ordinal attributes.
However, we cannot compute mean.
• But, we can only compute mode of nominal attributes.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Objects and Attribute Types
Interval-Scaled Attributes
• Interval-scaled attributes are numeric attributes. A numeric attribute
is quantitative; that is, it is a measurable quantity, represented in
integer or real values.
• The values of interval-scaled attributes have order and can be
positive, 0, or negative. Thus, in addition to providing a ranking of
values, such attributes allow us to compare and quantify the
difference between values.
• Because interval-scaled attributes are numeric, we can compute their
mean value, in addition to the median and mode measures of central
tendency.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Objects and Attribute Types
Interval-Scaled Attributes
• A temperature attribute is interval-scaled. Suppose that we have
the outdoor temperature value for a number of different days,
where each day is an object. By ordering the values, we obtain a
ranking of the objects with respect to temperature.
• In addition, we can quantify the difference between values. For
example, a temperature of 200C is five degrees higher than a
temperature of 150C.
• Calendar dates are another example. For instance, the years
2002 and 2010 are eight years apart.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Objects and Attribute Types
Interval-Scaled Attributes
• Temperatures in Celsius and Fahrenheit do not have a true
zero-point, that is, neither 00C nor 00F indicates “no
temperature.”
• Although we can compute the difference between temperature
values, we cannot talk of one temperature value as being a
multiple of another.
• Without a true zero, we cannot say, for instance, that 100C is
twice as warm as 50C. That is, we cannot speak of the values in
terms of ratios. Similarly, there is no true zero-point for
calendar dates.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Data Objects and Attribute Types
Ratio-Scaled Attributes
• A ratio-scaled attribute is a numeric attribute with an inherent
zero-point.
• That is, if a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another value.
• In addition, the values are ordered, and we can also compute
the difference between values, as well as the mean, median, and
mode.
• Temperature in Kelvin, length, counts, elapsed time, etc. are
examples of ratio scaled attributes

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Discrete vs. Continuous Attributes
• A discrete attribute has a finite or countably infinite set of
values, which may or may not be represented as integers.
• The attributes hair_color, marital_status, gender, etc. are
examples of discrete attributes.
• If the set of possible values for an attribute is infinite it is said to
be continuous attribute.
• Attributes Customer_ID, temperature, etc. are examples of
continuous attributes.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
• For data preprocessing to be successful, it is essential to have an
overall picture of your data.
• Basic statistical descriptions can be used to identify properties
of the data and highlight which data values should be treated as
noise or outliers.
• Basic statistical descriptions include Measure of Central Tendency
and Measure of Dispersion.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Central Tendency
• The most common and effective numeric measure of the
“center” of a set of data is the (arithmetic) mean.
• Let x1,x2, …… ,xN be a set of N values or observations, such as for
some numeric attribute X, like salary. The mean of this set of
values is

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Central Tendency
• Sometimes, each value xi in a set may be associated with a
weight wi for i =1, ….. ,N. The weights reflect the significance,
importance, or occurrence frequency attached to their respective
values. In this case, we can compute

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Central Tendency
• Although the mean is the useful quantity for describing a data
set, it is not always the best way of measuring the center of the
data. A major problem with the mean is its sensitivity to
extreme (e.g., outlier) values. Even a small number of extreme
values can corrupt the mean.
• For example, the mean salary at a company may be
substantially pushed up by that of a few highly paid managers.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Central Tendency
• For skewed (asymmetric) data, a better measure of the center of
data is the median, which is the middle value in a set of ordered
data values. It is the value that separates the higher half of a
data set from the lower half.
• In probability and statistics, the median generally applies to
numeric data; however, we may extend the concept to ordinal
data.
• Suppose that a given data set of N values for an attribute X is
sorted in increasing order. If N is odd, then the median is the
middle value of the ordered set.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Statistical Description of Data
Measure of Central Tendency
• If N is even, then the median is not unique; it is the two
middlemost values and any value in between. If X is a numeric
attribute in this case, by convention, the median is taken as the
average of the two middlemost values.
• The mode is another measure of central tendency. The mode for
a set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative
attributes.
• It is possible for the greatest frequency to correspond to several
different values, which results in more than one mode.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Statistical Description of Data
Measure of Central Tendency
• Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal. In general, a data set with
two or more modes is multimodal. At the other extreme, if each
data value occurs only once, then there is no mode.
• The midrange can also be used to assess the central tendency of
a numeric data set. It is the average of the largest and smallest
values in the set.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Central Tendency
Example
• Suppose we have the following values for salary in thousands:
50, 52, 52, 56, 60, 63, 70, 70, 110, 30, 36, 47. Calculate mean,
median, mode, and midrange for the above data.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Dispersion
• Let x1,x2, ……. ,xN be a set of observations for some numeric
attribute, X. The range of the set is the difference between the
largest and smallest values.
• Suppose that the data for attribute X are sorted in increasing
numeric order. Quartiles are points taken at regular intervals of
a data distribution, dividing it into essentially equal size
consecutive sets.
• The 2-quartile is the data point dividing the lower and upper
halves of the data distribution. It corresponds to the median.
Data Mining-CSIT 7th Prepared BY: Arjun Saud
Statistical Description of Data
Measure of Dispersion
• The 4-quantiles are the three data points that split the data
distribution into four equal parts; each part represents one-
fourth of the data distribution. They are more commonly
referred to as quartiles.
• The 100-quantiles are more commonly referred to as
percentiles; they divide the data distribution into 100 equal-
sized consecutive sets.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Dispersion
• The quartiles give an indication of a distribution’s center,
spread, and shape. The first quartile, denoted by Q1, is the 25th
percentile. It cuts off the lowest 25% of the data.
• The third quartile, denoted by Q3, is the 75th percentile—it cuts
off the lowest 75% (or highest 25%) of the data.
• The second quartile is the 50th percentile. As the median, it
gives the center of the data distribution.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Dispersion
• The distance between the first and third quartiles is a simple
measure of spread that gives the range covered by the middle
half of the data. This distance is called the interquartile range
(IQR) and is defined as
IQR=Q3-Q1
• Variance and standard deviation are measures of data
dispersion. They indicate how spread out a data distribution is.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Dispersion
• A low standard deviation means that the data observations tend
to be very close to the mean, while a high standard deviation
indicates that the data are spread out over a large range of
values.
• The variance of N observations, x1,x2, …… ,xN, for a numeric
attribute X is

• The standard deviation, σ , of the observations is the square

root of the variance, σ2.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Statistical Description of Data
Measure of Central Tendency
Example
• Suppose we have the following values for salary in thousands:
50, 52, 52, 56, 60, 63, 70, 70, 110, 30, 36, 47. Calculate range, 4-
Quantiles, IQR, and variance for the above data.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Applications of Data Mining
• Data mining can be applied in almost every field. Some of the
major applications of data mining are briefly discussed below.
• Business Intelligence: Data mining help businesses perform
effective market analysis, compare customer feedback on
similar products, discover the strengths and weaknesses of their
competitors, retain highly valuable customers, and make smart
business decisions.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Applications of Data Mining
• Market Basket Analysis: Market basket analysis is a modelling
technique based upon a theory that if you buy a certain group
of items you are more likely to buy another group of items. This
technique may allow the retailer to understand the purchase
behaviour of a buyer. This information may help the retailer to
know the buyer’s needs and change the store’s layout
accordingly.
• Fraud Detection: Use historical data to build models of
fraudulent behavior and use data mining to help identify
similar instances. For example, detect suspicious money
transactions.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Applications of Data Mining
• Intrusion Detection: Data mining can help improve intrusion
detection by adding a level of focus to anomaly detection. It
helps an analyst to distinguish an activity from common
everyday network activity.
• Customer Segmentation: Data mining aids in aligning the
customers into a distinct segment and can tailor the needs
according to the customers. The business could offer them with
special offers and enhance satisfaction.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Applications of Data Mining
• Bio Informatics: Applications of data mining to bioinformatics
include gene finding, protein function inference, disease
diagnosis, disease treatment optimization, etc.
• Web Search Engines: Web search engines are essentially very
large data mining applications. Various data mining techniques
are used in all aspects of search engines, ranging from crawling,
indexing, and searching.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Applications of Data Mining
• Social Web and Networks: There are a growing number of
highly-popular user-centric applications such as blogs, wikis
and Web communities that generate a lot of structured and
semi-structured information. In these applications data mining
can be used to explain and predict the evolution of social
networks, personalized search for social interaction, user
behavior prediction etc.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Issue in Data Mining
• The major issues in data mining research, partitioning them into
five groups
• Mining methodology
• User interaction
• Efficiency and scalability
• Diversity of data types, and
• Data mining and society

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Issue in Data Mining
Mining Methodology
• Mining various and new kinds of knowledge: As there are
diverse applications, new mining tasks continue to emerge.
These tasks can use the same database in different ways and
require the development of new data mining techniques.
• Mining knowledge in multidimensional space: While
searching for knowledge in large datasets, we need to explore
multidimensional space. To find interesting patterns, various
combinations of dimensions need to be applied.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Issue in Data Mining
Mining Methodology
• Data mining—an interdisciplinary effort: The power of data
mining can be substantially enhanced by integrating methods
from multiple disciplines. For example, to mine data with
natural language text, it makes sense to fuse data mining
methods with methods of information retrieval and natural
language processing.
• Boosting the power of discovery in a networked environment:
Knowledge derived in one set of objects can be used to boost
the discovery of knowledge in a “related” or semantically
linked set of objects.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Issue in Data Mining
Mining Methodology
• Handling uncertainty, noise, or incompleteness of data: Errors
and noise may confuse the data mining process, leading to the
derivation of erroneous patterns. Therefore, techniques like data
cleaning, data preprocessing, outlier detection and removal
need to be integrated with the data mining process.
• Pattern evaluation and pattern- or constraint-guided mining:
Not all the patterns generated by data mining processes are
interesting. Therefore, techniques are needed to assess the
interestingness of discovered patterns.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Issue in Data Mining
User Interaction
• Interactive Mining: Interactive mining allows users to focus the
search for patterns from different angles. The data mining
process should be interactive because it is difficult to know
what can be discovered within a database.
• Incorporation of background knowledge: Background
knowledge, constraints, rules, and other information regarding
the domain under study should be incorporated into the
knowledge discovery process. Such knowledge can be used for
pattern evaluation as well as to guide the search toward
interesting patterns.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Issue in Data Mining
User Interaction
• Ad hoc data mining and data mining query languages: High-
level data mining query languages will give users the freedom
to define ad hoc data mining tasks. Optimization of the
processing of such flexible mining requests is another
promising area of study.
• Presentation and visualization of data mining results: The
knowledge discovered by mining the data should be usable for
humans. The system should adopt an expressive representation
of knowledge, user-friendly visualization techniques, etc.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Issue in Data Mining
Efficiency and Scalability
• Efficiency and scalability of data mining algorithms: To
effectively extract information from a huge amount of data in
databases, data mining algorithms must be efficient and
scalable.
• Parallel, distributed, and incremental mining algorithms: The
huge size of many databases, the wide distribution of data, and
complexity of some data mining methods are factors motivating
the development of parallel and distributed data mining
algorithms. Such algorithms divide the data into partitions,
which are processed in parallel.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Issue in Data Mining
Diversity of Data Types
• Handling of relational and complex types of data: There are many
kinds of data stored in databases and data warehouses. It is not
possible for one system to mine all these kind of data. So different
data mining system should be construed for different kinds data.
• Mining information from heterogeneous databases and global
information systems: Since data is fetched from different data
sources on Local Area Network (LAN) and Wide Area Network
(WAN).The discovery of knowledge from different sources of
structured is a great challenge to data mining.

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Lecture 11 PPT (OOPs-Introduction)
No ratings yet
Lecture 11 PPT (OOPs-Introduction)
24 pages
Unit 4 - Cloud Programming Models
100% (2)
Unit 4 - Cloud Programming Models
21 pages
Presentation of Computer On The Topic of Mobile Computing
100% (1)
Presentation of Computer On The Topic of Mobile Computing
15 pages
Assembly Language
100% (1)
Assembly Language
15 pages
Course Syllabus
No ratings yet
Course Syllabus
18 pages
Soft Key
No ratings yet
Soft Key
11 pages
Table of Content Ionize
100% (1)
Table of Content Ionize
339 pages
Advanced Data Analytics Courses
No ratings yet
Advanced Data Analytics Courses
6 pages
Vaibu Mohan - Resume (September 2024)
No ratings yet
Vaibu Mohan - Resume (September 2024)
2 pages
AKSU-CSC103_Past_Question23-1-2
No ratings yet
AKSU-CSC103_Past_Question23-1-2
2 pages
Huawei OptiXstar V564 Quick Start 01-1
No ratings yet
Huawei OptiXstar V564 Quick Start 01-1
14 pages
ReleaseNotes_2025.1.0
No ratings yet
ReleaseNotes_2025.1.0
62 pages
New Onboarding Information
No ratings yet
New Onboarding Information
3 pages
67163118e98feCCWeek-03Lecture05
No ratings yet
67163118e98feCCWeek-03Lecture05
62 pages
Computer Image Corporation Brochure
No ratings yet
Computer Image Corporation Brochure
8 pages
SOA DESIGN PATTERNS AND ANTI PATTERNS
No ratings yet
SOA DESIGN PATTERNS AND ANTI PATTERNS
10 pages
Security Project Proposal
100% (2)
Security Project Proposal
6 pages
Hospital Management System
0% (1)
Hospital Management System
23 pages
Capsule Neural Network
100% (1)
Capsule Neural Network
42 pages
DCS World Activation Guide en
No ratings yet
DCS World Activation Guide en
14 pages
Regular Expressions
100% (5)
Regular Expressions
94 pages
Unit 4 - Data Cube Technology
No ratings yet
Unit 4 - Data Cube Technology
27 pages
Performance Tuning - Index Management & DTA
No ratings yet
Performance Tuning - Index Management & DTA
2 pages
MMA Global Mobile Advertising Guidelines - 030308PR
No ratings yet
MMA Global Mobile Advertising Guidelines - 030308PR
43 pages
VVFGC Cns Unit2 (Final)
No ratings yet
VVFGC Cns Unit2 (Final)
13 pages
Apache Spark™ - Unified Analytics Engine For Big Data
No ratings yet
Apache Spark™ - Unified Analytics Engine For Big Data
1 page
Module 7 Sample Test Questions
No ratings yet
Module 7 Sample Test Questions
15 pages
WEB PROGRAMMING I Course Notes
No ratings yet
WEB PROGRAMMING I Course Notes
129 pages
ACN Notes Ch5
No ratings yet
ACN Notes Ch5
28 pages
Beacon Frames: Linksys - Ses - 24086 Access Point? From The 30 Munroe St. Access Point? (Hint: This
No ratings yet
Beacon Frames: Linksys - Ses - 24086 Access Point? From The 30 Munroe St. Access Point? (Hint: This
4 pages
CSC317 Simulation and Modeling
No ratings yet
CSC317 Simulation and Modeling
5 pages
CSC314 Design and Analysis of Algorithms
100% (1)
CSC314 Design and Analysis of Algorithms
5 pages
WIDT UNIT-III
100% (1)
WIDT UNIT-III
50 pages
Embedded EWM Basic System Setup in S - 4HANA - SAP Quick Guide
No ratings yet
Embedded EWM Basic System Setup in S - 4HANA - SAP Quick Guide
22 pages
Unit 3
No ratings yet
Unit 3
18 pages
System Network Administrator Assignment
100% (2)
System Network Administrator Assignment
97 pages
RDMS Notes
100% (1)
RDMS Notes
11 pages
Document Shopping Mall
60% (5)
Document Shopping Mall
14 pages
Version Control and Collaboration Tools
No ratings yet
Version Control and Collaboration Tools
4 pages
CSS Frameworks - The Ultimate Guide
No ratings yet
CSS Frameworks - The Ultimate Guide
511 pages
7.-Revised-Tle-As-Css-10-Q3-Testing Installed Devices
No ratings yet
7.-Revised-Tle-As-Css-10-Q3-Testing Installed Devices
4 pages
Unit 3 (JavaScript and HTML Documents)
No ratings yet
Unit 3 (JavaScript and HTML Documents)
43 pages
HTML & CSS Crash Course
No ratings yet
HTML & CSS Crash Course
14 pages
Oracle Application Express (Version 3.1)
No ratings yet
Oracle Application Express (Version 3.1)
25 pages
10 Jquery Snippets
100% (1)
10 Jquery Snippets
4 pages
Mysirg A - Merged - Removed
No ratings yet
Mysirg A - Merged - Removed
29 pages
Web Technology Questions
67% (3)
Web Technology Questions
18 pages
SGATFMT4
No ratings yet
SGATFMT4
17 pages
N Tier Architecture
100% (1)
N Tier Architecture
31 pages
Finite Automata (DFA and NFA, Epsilon NFA) : FSA Unit 1 Chapter 2
100% (1)
Finite Automata (DFA and NFA, Epsilon NFA) : FSA Unit 1 Chapter 2
24 pages
Unit - I Introduction and Web Development Strategies
No ratings yet
Unit - I Introduction and Web Development Strategies
12 pages
Standard Controls 2. Rich Controls 3. Validation Controls 4. Databind Controls
No ratings yet
Standard Controls 2. Rich Controls 3. Validation Controls 4. Databind Controls
26 pages
JavaScript Syllabus
No ratings yet
JavaScript Syllabus
3 pages
Apollo 16 MkII Hardware Manual
No ratings yet
Apollo 16 MkII Hardware Manual
33 pages
B Tree Assignment
No ratings yet
B Tree Assignment
4 pages
CSC315 System Analysis and Desing
No ratings yet
CSC315 System Analysis and Desing
4 pages
CSC315 System Analysis and Desing
No ratings yet
CSC315 System Analysis and Desing
4 pages
Android6 PDF
100% (1)
Android6 PDF
100 pages
CSC316 Cryptography
No ratings yet
CSC316 Cryptography
6 pages
Bootstrap
No ratings yet
Bootstrap
17 pages
Windows 8 Product Keys
No ratings yet
Windows 8 Product Keys
4 pages
Basic Computer Organization and Design
100% (1)
Basic Computer Organization and Design
20 pages
File Systems
No ratings yet
File Systems
40 pages
5 e Commerce Intro 29slide
No ratings yet
5 e Commerce Intro 29slide
29 pages
JDBC Crud Operations: Group 3
No ratings yet
JDBC Crud Operations: Group 3
27 pages
Microprocessor Notes
No ratings yet
Microprocessor Notes
44 pages
Unit 7. Database Tuning
No ratings yet
Unit 7. Database Tuning
16 pages
CSC321 Image Processing
No ratings yet
CSC321 Image Processing
5 pages
CSS3
100% (1)
CSS3
25 pages
DBMS Solutions
No ratings yet
DBMS Solutions
9 pages
WD - Final Question Bank Students
100% (1)
WD - Final Question Bank Students
1 page
Opps Task
No ratings yet
Opps Task
2 pages
Major C Programs
No ratings yet
Major C Programs
44 pages
Unit - 1: ASP - NET Basic
No ratings yet
Unit - 1: ASP - NET Basic
62 pages
PHP Syllabus
No ratings yet
PHP Syllabus
3 pages
CG PDF
100% (1)
CG PDF
63 pages
Untitled
No ratings yet
Untitled
12 pages
HTML and CSS
No ratings yet
HTML and CSS
77 pages
MG-Develop A Backend Application Using Node Js
No ratings yet
MG-Develop A Backend Application Using Node Js
16 pages
CSC318 Web Technology
No ratings yet
CSC318 Web Technology
7 pages
6 Cse - Cs8651 Ip Unit 2
100% (1)
6 Cse - Cs8651 Ip Unit 2
91 pages
Relational Database Design
No ratings yet
Relational Database Design
9 pages
SQL Most Asked Questions
No ratings yet
SQL Most Asked Questions
7 pages
SQL Most Asked Questions
No ratings yet
SQL Most Asked Questions
7 pages
Question Bank
No ratings yet
Question Bank
16 pages
The Central Processing Unit:: What Goes On Inside The Computer
No ratings yet
The Central Processing Unit:: What Goes On Inside The Computer
42 pages
Dart Language Specification
No ratings yet
Dart Language Specification
122 pages
Integrity Constraints
No ratings yet
Integrity Constraints
11 pages
Regular Expression
No ratings yet
Regular Expression
17 pages
Advance Java
No ratings yet
Advance Java
100 pages
Java EE694c
100% (1)
Java EE694c
8 pages
Introduction To OS
No ratings yet
Introduction To OS
12 pages
Datastructure Unit 1 SKM
No ratings yet
Datastructure Unit 1 SKM
110 pages
Software Engineering Module 2
No ratings yet
Software Engineering Module 2
35 pages
Object-Oriented Programming
No ratings yet
Object-Oriented Programming
30 pages
Lab 2 - Function and Array
No ratings yet
Lab 2 - Function and Array
4 pages
Introduction of DBMS
No ratings yet
Introduction of DBMS
83 pages
Technology
No ratings yet
Technology
408 pages
MKTG 201-Principles of Marketing-Sarah Suneel Sarfraz
No ratings yet
MKTG 201-Principles of Marketing-Sarah Suneel Sarfraz
6 pages
Bootstrap and React
No ratings yet
Bootstrap and React
20 pages
Pgdca 2 Sem PDF
No ratings yet
Pgdca 2 Sem PDF
11 pages
Notes Cryptography
No ratings yet
Notes Cryptography
68 pages
An Overview of Mobile Computing Motivations and Challenges
No ratings yet
An Overview of Mobile Computing Motivations and Challenges
38 pages
Relational Model
No ratings yet
Relational Model
20 pages
Web Deveopment: Vivekananda Global University, Jaipur
No ratings yet
Web Deveopment: Vivekananda Global University, Jaipur
41 pages
Web Development
No ratings yet
Web Development
23 pages
Dbms PPT For Chapter 7
No ratings yet
Dbms PPT For Chapter 7
45 pages
Android Project: Santu@netcamp - in YEAR:-2019-20
No ratings yet
Android Project: Santu@netcamp - in YEAR:-2019-20
45 pages
Web Design With HTML-CSS-JavaScript - 0
No ratings yet
Web Design With HTML-CSS-JavaScript - 0
5 pages
SAD 03 Object Oriented Concepts
No ratings yet
SAD 03 Object Oriented Concepts
41 pages
Oops
No ratings yet
Oops
28 pages
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
From Everand
Google Cloud Data Engineer 100+ Practice Exam Questions With Well Explained Answers
vivian njoroge
No ratings yet
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
From Everand
ColdFusion Interview Questions, Answers, and Explanations: ColdFusion Certification Review
equitypress
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
AppDynamics Third Edition
From Everand
AppDynamics Third Edition
Gerardus Blokdyk
No ratings yet
Java servlet Second Edition
From Everand
Java servlet Second Edition
Gerardus Blokdyk
No ratings yet

Unit 2 - Introduction of Data Mining

Uploaded by

Unit 2 - Introduction of Data Mining

Uploaded by

Unit 2

Introduction to Data Mining

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Figure: Stages of KDD

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Figure: Three Data Clusters

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

• The standard deviation, σ , of the observations is the square

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

Data Mining-CSIT 7th Prepared BY: Arjun Saud

You might also like