0% found this document useful (0 votes)
5 views

Unit-2

Data mining involves extracting knowledge from large datasets and is often considered a step in the knowledge discovery process. The document outlines the architecture of data mining systems, the challenges faced, and the importance of data preprocessing, including cleaning, integration, and transformation. It also discusses the benefits of data mining for businesses, the types of data that can be mined, and the differences between operational databases and data warehouses.

Uploaded by

itachimyfriend
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit-2

Data mining involves extracting knowledge from large datasets and is often considered a step in the knowledge discovery process. The document outlines the architecture of data mining systems, the challenges faced, and the importance of data preprocessing, including cleaning, integration, and transformation. It also discusses the benefits of data mining for businesses, the types of data that can be mined, and the differences between operational databases and data warehouses.

Uploaded by

itachimyfriend
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 144

Unit-2

Abhishek S. Rao
Introduction
to Data
Mining
Data mining refers to extracting or
“mining” knowledge from large
amounts of data.
Many people treat data mining as a synonym
for another popularly used term, Knowledge
Discovery from Data, or KDD. Alternatively,
others view data mining as simply an essential
step in the process of knowledge discovery.
Knowledge discovery as a process is depicted
in the Figure below and consists of an iterative
sequence of the following steps:
Architecture
of a typical
data mining
system
Challenges of Data Mining
Data
Mining
Tasks
The 6 CRISP-DM phases
7 Crucial Business Benefits of
Data Mining
1. Improved Decision-making

2. Enhanced Customer Understanding

3. Increased Sales and Revenue

4. Risk Mitigation and Fraud Detection

5. Competitive Advantage

6. Cost Reduction and Operational Efficiency


What Kinds of data can be mined?
What kind of patterns can be mined in data mining?
Data Mining - Issues

Data mining is not an easy task, as


the algorithms used can get very
complex and data is not always
available at one place. It needs to
be integrated from various
heterogeneous data sources. These
factors also create some issues.
Data Preprocessing
Data preprocessing is an important step in the data
mining process. It refers to the cleaning, transforming,
and integrating of data to make it ready for analysis. The
goal of data preprocessing is to improve the quality of
the data and to make it more suitable for the specific
data mining task.
Data Quality
Poor data quality negatively affects many data processing efforts.
Data mining example: a classification model for detecting people who are loan risks is built
using poor data. The results may be that
• some credit-worthy candidates are denied loans
• more loans are given to individuals by default
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:


• Noise and outliers
• Missing values
• Duplicate data
• Wrong data
Missing Data Handling

Many causes: malfunctioning equipment, changes in experimental design, collation of


different data sources, and so on. People may wish to not supply information. Information is
not applicable (children don't have annual income)

• Discard records with missing values

• Ordinal-continuous data, could be replaced with attribute means

• Substitute with a value from a similar instance

• Ignore missing values, i.e., just proceed and let the tools deal with them

• Treat missing values as equals (all share the same missing value code)

• Treat missing values as unequal values


Major Tasks in Data Pre-processing

• Data Cleaning: This involves identifying and correcting errors or inconsistencies in the
data, such as missing values, outliers, and duplicates. Various techniques can be used for
data cleaning, such as imputation, removal, and transformation.
• Data Integration: This involves combining data from multiple sources to create a
unified dataset. Data integration can be challenging as it requires handling data with
different formats, structures, and semantics. Techniques such as record linkage and data
fusion can be used for data integration.
• Data Transformation: This involves converting the data into a suitable format for
analysis. Common techniques used in data transformation include normalization,
standardization, and discretization. Normalization is used to scale the data to a common
range, while standardization is used to transform the data to have zero mean and unit
variance. Discretization is used to convert continuous data into discrete categories.
• Data Reduction: This involves reducing the size of the dataset while preserving the
important information. Data reduction can be achieved through techniques such as feature
selection and feature extraction. Feature selection involves selecting a subset of relevant
features from the dataset, while feature extraction involves transforming the data into a
lower-dimensional space while preserving the important information.
• Data Discretization: This involves dividing continuous data into discrete categories or
intervals. Discretization is often used in data mining and machine learning algorithms that
require categorical data. Discretization can be achieved through techniques such as equal
width binning, equal frequency binning, and clustering.
• Data Normalization: This involves scaling the data to a common range, such as between
0 and 1 or -1 and 1. Normalization is often used to handle data with different units and
scales. Common normalization techniques include min-max normalization, z-score
normalization, and decimal scaling.
Data Cleaning
A 2-D plot of customer data
with respect to customer
locations in a city, showing
three data clusters. Each cluster
centroid is marked with a “+”,
representing the average point
in space for that cluster.
Outliers may be detected as
values that fall outside of the
sets of clusters.
Data Integration in Data Mining
Data Transformation
Cube Aggregation
Attribute subset selection
Wavelet Transforms in Data Mining
• The discrete wavelet transform (DWT) is a signal processing technique that transforms
linear signals. The data vector X is transformed into a numerically different vector, Xo, of
wavelet coefficients when the DWT is applied. The two vectors X and Xo must be of the
same length. When applying this technique to data reduction, we consider n-dimensional
data tuple, that is, X = (x1,x2,…,xn), where n is the number of attributes present in the
relation of the data set.
• The wavelet transforms the data can be truncated and this is helpful in data reduction. If we
store a small fraction of the strongest wavelet coefficients, then the compressed
approximation of the original data can be obtained. For example, the wavelet coefficients
larger than some determined threshold can be retained. The coefficients of the wavelet
other than the user-determined data are set to 0. The resultant representation of data is very
sparse. The computation of the operations is very fast if they are performed in wavelet
space. This technique can also be used to remove the noise in the data. This reduces the
task of smoothing the main features of the data and wavelet transforms also make the data
cleaning very effective. The approximation of the original data can be done if the set of
coefficients is given by applying the inverse of the DWT.
Numerosity Reduction in Data Mining
Discretization in data mining
Techniques of data discretization
Cluster Analysis
DATA WAREHOUSING AND ON-LINE
ANALYTICAL PROCESSING
In today’s rapidly changing corporate environment, organizations
are turning to cloud-based technologies for convenient data
collection, reporting, and analysis. This is where Data
Warehousing comes in as a core component of business
intelligence that enables businesses to enhance their performance.
It is important to understand what is data warehouse and why it is
evolving in the global marketplace.
Key Characteristics of Data
Warehouse
• Subject-Oriented

• Integrated

• Non-Volatile

• Time-Variant
Differences between
Operational Database Systems
and Data Warehouses
The major task of online operational database systems is to perform
online transaction and query processing. These systems are called
online transaction processing (OLTP) systems. They cover most of the
day-to-day operations of an organization such as purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting. Data
warehouse systems, on the other hand, serve users or knowledge
workers in the role of data analysis and decision making. Such systems
can organize and present data in various formats to accommodate the
diverse needs of different users. These systems are known as online
analytical processing (OLAP) systems.
Comparison of OLTP and OLAP Systems
Data
Warehousi
ng: A
Multitiered
Architectur
e
Data Warehouse Models: Enterprise
Warehouse, Data Mart, and Virtual Warehouse
Extraction, Transformation,
and Loading
Data Warehouse Modeling: Data
Cube and OLAP
Data warehouses and OLAP tools are based on a multidimensional data
model. This model views data in the form of a data cube.
What is Data Cube?
A data cube is created from a subset of attributes in the
database. Specific attributes are chosen to be measure
attributes, i.e., the attributes whose values are of interest.
Another attributes are selected as dimensions or functional
attributes. The measure attributes are aggregated according
to the dimensions.
A data cube enables data to be modeled and viewed in
multiple dimensions. A multidimensional data model is
organized around a central theme, like sales and transactions.
A fact table represents this theme. Facts are numerical
measures. Thus, the fact table contains measure (such as
Rs_sold) and keys to each of the related dimensional tables.
Dimensions are a fact that defines a data cube. Facts are
generally quantities, which are used for analyzing the
relationship between dimensions.
Data Warehousing - Schemas
Schema is a logical description of the entire database. It
includes the name and description of records of all record
types including all associated data-items and aggregates.
Much like a database, a data warehouse also requires to
maintain a schema. A database uses relational model,
while a data warehouse uses Star, Snowflake, and Fact
Constellation schema. In this chapter, we will discuss the
schemas used in a data warehouse.
Measures Their Categorization and Computation
in Data Mining
Typical OLAP Operations
Data Warehouse Design
Data Warehouse Usage for Information Processing
From Online Analytical Processing to
Multidimensional Data Mining
Data Warehouse
Implementation
Supervised Learning
Regression
Regression Vs Classification
Multiple Linear Regression
Multiple linear regression refers to a statistical technique that is used to predict the
outcome of a variable based on the value of two or more variables. It is sometimes
known simply as multiple regression, and it is an extension of linear regression. The
variable that we want to predict is known as the dependent variable, while the
variables we use to predict the value of the dependent variable are known as
independent or explanatory variables.

You might also like