SlideShare a Scribd company logo
-
1
Intro to Data Warehousing
Data Warehousing vs Data Mining & Data
Preprocessing in Data Mining
Ch Anwar ul Hassan (Lecturer)
Department of Computer Science and Software
Engineering
Capital University of Sciences & Technology, Islamabad
Pakistan
anwarchaudary@gmail.com
Slide 2
• Data Science is an area
• A data warehouse is built to support management
functions whereas data mining is used to extract
useful information and patterns from data. Data
warehousing is the process of compiling information
into a data warehouse.
Difference between Data Warehousing and Data
Mining
Slide 3
• It is a technology that aggregates structured data from one or
more sources so that it can be compared and analyzed rather
than transaction processing. A data warehouse is designed to
support management decision-making process by providing a
platform for data cleaning, data integration and data
consolidation. A data warehouse contains subject-oriented,
integrated, time-variant and non-volatile data.
Data Warehousing
Slide 4
• It is the process of finding patterns and correlations within
large data sets to identify relationships between data. Data
mining tools allow a business organization to predict customer
behavior. Data mining tools are used to build risk models and
detect fraud. Data mining is used in market analysis and
management, fraud detection, corporate analysis and risk
management.
Data Mining
Slide 5
• Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
Data Preprocessing in Data Mining
Slide 6
1.Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways. Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
Steps Involved in Data Preprocessing:
Slide 7
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all data
in a segment by its mean or boundary values can be used to complete the task.
Data Preprocessing in Data Mining
Slide 8
Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
Data Preprocessing in Data Mining
Slide 9
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable
for mining process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.
Data Preprocessing in Data Mining
Slide 10
Discretization:
• This is done to replace the raw values of numeric attribute by interval levels
or conceptual levels (interval variable is a measurement variable).
Concept Hierarchy Generation:
• Here attributes are converted from level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
Data Preprocessing in Data Mining
Slide 11
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data.
While working with huge volume of data, analysis became harder in such cases.
In order to get rid of this, we uses data reduction technique. It aims to increase
the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be
discarded.
Data Preprocessing in Data Mining
Slide 12
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
This technique is used to aggregate data in a simpler form. For example,
imagine that information you gathered for your analysis for the years 2012 to
2014, that data includes the revenue of your company every three months. They
involve you in the annual sales, rather than the quarterly average, So we can
summarize the data in such a way that the resulting data summarizes the total
sales per year instead of per quarter. It summarizes the data.
Numerosity Reduction:
In this reduction technique the actual data is replaced with mathematical models
or smaller representation of the data instead of actual data, it is important to
only store the model parameter. Or non-parametric method such as clustering,
histogram, sampling.
Data Preprocessing in Data Mining
Slide 13
Dimensionality Reduction:
Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates
outdated or redundant features.
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we decide best of
the original attributes on the set based on their relevance to other attributes. We
know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes
are redundant.
Data Preprocessing in Data Mining
Slide 14
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes
are redundant.
Data Preprocessing in Data Mining
Slide 15
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.
Suppose there are the following attributes in the data set in which few attributes
are redundant.
Combination of forwarding and Backward Selection –
It allows us to remove the worst and select best attributes, saving time and
making the process faster.
Data Preprocessing in Data Mining
Slide 16
Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two
types based on their compression techniques.
Lossless Compression –
Encoding techniques allows a simple and minimal data size reduction. Lossless data
compression uses algorithms to restore the precise original data from the compressed
data.
Lossy Compression –
Methods such as Discrete Wavelet transform technique, principal component
analysis) are examples of this compression. For e.g., JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original the image. In
lossy-data compression, the decompressed data may differ to the original data but are
useful enough to retrieve information from them.
Data Preprocessing in Data Mining
Slide 17
Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous
nature into data with intervals. We replace many constant values of the attributes by
labels of small intervals. This means that mining results are shown in a concise, and
easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split points)
to divide the whole set of attributes and repeat of this method up to the end, then the
process is known as top-down discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split-points, some are discarded through
a combination of the neighborhood values in the interval, that process is called
bottom-up discretization.
Data Preprocessing in Data Mining
Slide 18
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such
as 43 for age) to high-level concepts (categorical variables such as middle age or
Senior).
For numeric data following techniques can be followed:
Binning – is the process of changing numerical variables into categorical
counterparts. The number of categorical counterparts depends on the number of bins
specified by the user.
Data Preprocessing in Data Mining
Slide 19
Histogram analysis – Like the process of binning, the histogram is used to partition
the value for the attribute X, into disjoint ranges called brackets. There are several
partitioning rules:
Equal Frequency partitioning: Partitioning the values based on their number of
occurrences in the data set.
Equal Width partitioning : partitioning the values in a fixed gap based on the
number of bins i.e. a set of values ranging from 0-20.
Clustering: Grouping the similar data together.
Data Preprocessing in Data Mining

More Related Content

What's hot (18)

PPTX
Data Preprocessing || Data Mining
Iffat Firozy
 
PPT
Data preprocessing
kayathri02
 
PDF
Distributed Decision Tree Induction
gregoryg
 
PDF
International Refereed Journal of Engineering and Science (IRJES)
irjes
 
PDF
Descriptive Analytics: Data Reduction
Nguyen Ngoc Binh Phuong
 
DOC
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
PDF
Ijariie1117 volume 1-issue 1-page-25-27
IJARIIE JOURNAL
 
PPTX
Data Reduction
Rajan Shah
 
PPTX
Data Reduction Stratergies
AnjaliSoorej
 
PPT
Data Mining
Jay Nagar
 
PDF
Multidimentional data model
jagdish_93
 
PPTX
Data preprocessing
dineshbabuspr
 
PPT
Data1
suganmca14
 
PDF
Data Warehouse Designing: Dimensional Modelling and E-R Modelling
International Journal of Engineering Inventions www.ijeijournal.com
 
PPT
Data preprocessing
ankur bhalla
 
PDF
Decision tree clustering a columnstores tuple reconstruction
csandit
 
PPT
Data preprocessing
Harry Potter
 
PDF
Statistics and Data Mining
R A Akerkar
 
Data Preprocessing || Data Mining
Iffat Firozy
 
Data preprocessing
kayathri02
 
Distributed Decision Tree Induction
gregoryg
 
International Refereed Journal of Engineering and Science (IRJES)
irjes
 
Descriptive Analytics: Data Reduction
Nguyen Ngoc Binh Phuong
 
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
Ijariie1117 volume 1-issue 1-page-25-27
IJARIIE JOURNAL
 
Data Reduction
Rajan Shah
 
Data Reduction Stratergies
AnjaliSoorej
 
Data Mining
Jay Nagar
 
Multidimentional data model
jagdish_93
 
Data preprocessing
dineshbabuspr
 
Data1
suganmca14
 
Data Warehouse Designing: Dimensional Modelling and E-R Modelling
International Journal of Engineering Inventions www.ijeijournal.com
 
Data preprocessing
ankur bhalla
 
Decision tree clustering a columnstores tuple reconstruction
csandit
 
Data preprocessing
Harry Potter
 
Statistics and Data Mining
R A Akerkar
 

Similar to Intro to Data warehousing lecture 17 (20)

PPT
My3prep
asad199
 
PPT
DM Lecture 3
asad199
 
PPT
data mining concepts and techniques and systems
GokulKannan194051
 
PPT
Data processing
Sania Shoaib
 
PDF
Data mining and data warehouse lab manual updated
Yugal Kumar
 
PPT
Data preprocess
srigiridharan92
 
PPT
Konsep dan teknik dataminging bagian 3.ppt
qorry1990
 
PPT
Data preparation
James Wong
 
PPT
Data preparation
Tony Nguyen
 
PPT
Data preparation
Young Alista
 
PPT
Data preparation
Harry Potter
 
PPT
Data preperation
Luis Goldster
 
PPT
Data preperation
Hoang Nguyen
 
PPT
Data preperation
Fraboni Ec
 
PDF
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
PPTX
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
PPT
summarized best pre-processing techniques
shalinipriya1692
 
PPT
Pre processing
meenas06
 
PPT
Datapreprocess
sharmila parveen
 
PPT
Data preprocessing ng
saranya12345
 
My3prep
asad199
 
DM Lecture 3
asad199
 
data mining concepts and techniques and systems
GokulKannan194051
 
Data processing
Sania Shoaib
 
Data mining and data warehouse lab manual updated
Yugal Kumar
 
Data preprocess
srigiridharan92
 
Konsep dan teknik dataminging bagian 3.ppt
qorry1990
 
Data preparation
James Wong
 
Data preparation
Tony Nguyen
 
Data preparation
Young Alista
 
Data preparation
Harry Potter
 
Data preperation
Luis Goldster
 
Data preperation
Hoang Nguyen
 
Data preperation
Fraboni Ec
 
prvg4sczsginx3ynyqlc-signature-b84f0cf1da1e7d0fde4ecfab2a28f243cfa561f9aa2c9b...
ImXaib
 
UNIT 2: Part 2: Data Warehousing and Data Mining
Nandakumar P
 
summarized best pre-processing techniques
shalinipriya1692
 
Pre processing
meenas06
 
Datapreprocess
sharmila parveen
 
Data preprocessing ng
saranya12345
 
Ad

More from AnwarrChaudary (20)

PPT
Intro to Data warehousing lecture 20
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 19
AnwarrChaudary
 
PPTX
Intro to Data warehousing lecture 18
AnwarrChaudary
 
PPTX
Intro to Data warehousing lecture 16
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 15
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 14
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 13
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 12
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 11
AnwarrChaudary
 
PPTX
Intro to Data warehousing lecture 10
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 09
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 08
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 07
AnwarrChaudary
 
PPT
Intro to Data warehousing Lecture 06
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 05
AnwarrChaudary
 
PPT
Intro to Data warehousing Lecture 04
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 03
AnwarrChaudary
 
PPT
Intro to Data warehousing lecture 02
AnwarrChaudary
 
PPT
Introduction to Data Warehouse
AnwarrChaudary
 
PPT
Introduction to Software Engineering
AnwarrChaudary
 
Intro to Data warehousing lecture 20
AnwarrChaudary
 
Intro to Data warehousing lecture 19
AnwarrChaudary
 
Intro to Data warehousing lecture 18
AnwarrChaudary
 
Intro to Data warehousing lecture 16
AnwarrChaudary
 
Intro to Data warehousing lecture 15
AnwarrChaudary
 
Intro to Data warehousing lecture 14
AnwarrChaudary
 
Intro to Data warehousing lecture 13
AnwarrChaudary
 
Intro to Data warehousing lecture 12
AnwarrChaudary
 
Intro to Data warehousing lecture 11
AnwarrChaudary
 
Intro to Data warehousing lecture 10
AnwarrChaudary
 
Intro to Data warehousing lecture 09
AnwarrChaudary
 
Intro to Data warehousing lecture 08
AnwarrChaudary
 
Intro to Data warehousing lecture 07
AnwarrChaudary
 
Intro to Data warehousing Lecture 06
AnwarrChaudary
 
Intro to Data warehousing lecture 05
AnwarrChaudary
 
Intro to Data warehousing Lecture 04
AnwarrChaudary
 
Intro to Data warehousing lecture 03
AnwarrChaudary
 
Intro to Data warehousing lecture 02
AnwarrChaudary
 
Introduction to Data Warehouse
AnwarrChaudary
 
Introduction to Software Engineering
AnwarrChaudary
 
Ad

Recently uploaded (20)

PPTX
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
PPT
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
PDF
Dimensions of Societal Planning in Commonism
StefanMz
 
PDF
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
PPTX
How to Manage Access Rights & User Types in Odoo 18
Celine George
 
PDF
CHILD RIGHTS AND PROTECTION QUESTION BANK
Dr Raja Mohammed T
 
PPTX
How to Manage Promotions in Odoo 18 Sales
Celine George
 
PPSX
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
PPTX
Quarter1-English3-W4-Identifying Elements of the Story
FLORRACHELSANTOS
 
PDF
IMP NAAC-Reforms-Stakeholder-Consultation-Presentation-on-Draft-Metrics-Unive...
BHARTIWADEKAR
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PPTX
How to Configure Lost Reasons in Odoo 18 CRM
Celine George
 
PPTX
Views on Education of Indian Thinkers Mahatma Gandhi.pptx
ShrutiMahanta1
 
PDF
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
PDF
People & Earth's Ecosystem -Lesson 2: People & Population
marvinnbustamante1
 
PPTX
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
PDF
IMP NAAC REFORMS 2024 - 10 Attributes.pdf
BHARTIWADEKAR
 
PDF
'' IMPORTANCE OF EXCLUSIVE BREAST FEEDING ''
SHAHEEN SHAIKH
 
PPTX
HYDROCEPHALUS: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
PPTX
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
Talk on Critical Theory, Part One, Philosophy of Social Sciences
Soraj Hongladarom
 
Dimensions of Societal Planning in Commonism
StefanMz
 
ARAL-Orientation_Morning-Session_Day-11.pdf
JoelVilloso1
 
How to Manage Access Rights & User Types in Odoo 18
Celine George
 
CHILD RIGHTS AND PROTECTION QUESTION BANK
Dr Raja Mohammed T
 
How to Manage Promotions in Odoo 18 Sales
Celine George
 
HEALTH ASSESSMENT (Community Health Nursing) - GNM 1st Year
Priyanshu Anand
 
Quarter1-English3-W4-Identifying Elements of the Story
FLORRACHELSANTOS
 
IMP NAAC-Reforms-Stakeholder-Consultation-Presentation-on-Draft-Metrics-Unive...
BHARTIWADEKAR
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
How to Configure Lost Reasons in Odoo 18 CRM
Celine George
 
Views on Education of Indian Thinkers Mahatma Gandhi.pptx
ShrutiMahanta1
 
LAW OF CONTRACT (5 YEAR LLB & UNITARY LLB )- MODULE - 1.& 2 - LEARN THROUGH P...
APARNA T SHAIL KUMAR
 
People & Earth's Ecosystem -Lesson 2: People & Population
marvinnbustamante1
 
How to Manage Large Scrollbar in Odoo 18 POS
Celine George
 
IMP NAAC REFORMS 2024 - 10 Attributes.pdf
BHARTIWADEKAR
 
'' IMPORTANCE OF EXCLUSIVE BREAST FEEDING ''
SHAHEEN SHAIKH
 
HYDROCEPHALUS: NURSING MANAGEMENT .pptx
PRADEEP ABOTHU
 
STAFF DEVELOPMENT AND WELFARE: MANAGEMENT
PRADEEP ABOTHU
 

Intro to Data warehousing lecture 17

  • 1. - 1 Intro to Data Warehousing Data Warehousing vs Data Mining & Data Preprocessing in Data Mining Ch Anwar ul Hassan (Lecturer) Department of Computer Science and Software Engineering Capital University of Sciences & Technology, Islamabad Pakistan [email protected]
  • 2. Slide 2 • Data Science is an area • A data warehouse is built to support management functions whereas data mining is used to extract useful information and patterns from data. Data warehousing is the process of compiling information into a data warehouse. Difference between Data Warehousing and Data Mining
  • 3. Slide 3 • It is a technology that aggregates structured data from one or more sources so that it can be compared and analyzed rather than transaction processing. A data warehouse is designed to support management decision-making process by providing a platform for data cleaning, data integration and data consolidation. A data warehouse contains subject-oriented, integrated, time-variant and non-volatile data. Data Warehousing
  • 4. Slide 4 • It is the process of finding patterns and correlations within large data sets to identify relationships between data. Data mining tools allow a business organization to predict customer behavior. Data mining tools are used to build risk models and detect fraud. Data mining is used in market analysis and management, fraud detection, corporate analysis and risk management. Data Mining
  • 5. Slide 5 • Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format. Data Preprocessing in Data Mining
  • 6. Slide 6 1.Data Cleaning: The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc. (a). Missing Data: This situation arises when some data is missing in the data. It can be handled in various ways. Some of them are: Ignore the tuples: This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple. Fill the Missing values: There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value. Steps Involved in Data Preprocessing:
  • 7. Slide 7 (b). Noisy Data: Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways : Binning Method: This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task. Each segmented is handled separately. One can replace all data in a segment by its mean or boundary values can be used to complete the task. Data Preprocessing in Data Mining
  • 8. Slide 8 Regression: Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables). Clustering: This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters. Data Preprocessing in Data Mining
  • 9. Slide 9 2. Data Transformation: This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways: Normalization: It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0) Attribute Selection: In this strategy, new attributes are constructed from the given set of attributes to help the mining process. Data Preprocessing in Data Mining
  • 10. Slide 10 Discretization: • This is done to replace the raw values of numeric attribute by interval levels or conceptual levels (interval variable is a measurement variable). Concept Hierarchy Generation: • Here attributes are converted from level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”. Data Preprocessing in Data Mining
  • 11. Slide 11 3. Data Reduction: Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs. The various steps to data reduction are: Attribute Subset Selection: The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one can use level of significance and p- value of the attribute. The attribute having p-value greater than significance level can be discarded. Data Preprocessing in Data Mining
  • 12. Slide 12 Data Cube Aggregation: Aggregation operation is applied to data for the construction of the data cube. This technique is used to aggregate data in a simpler form. For example, imagine that information you gathered for your analysis for the years 2012 to 2014, that data includes the revenue of your company every three months. They involve you in the annual sales, rather than the quarterly average, So we can summarize the data in such a way that the resulting data summarizes the total sales per year instead of per quarter. It summarizes the data. Numerosity Reduction: In this reduction technique the actual data is replaced with mathematical models or smaller representation of the data instead of actual data, it is important to only store the model parameter. Or non-parametric method such as clustering, histogram, sampling. Data Preprocessing in Data Mining
  • 13. Slide 13 Dimensionality Reduction: Whenever we come across any data which is weakly important, then we use the attribute required for our analysis. It reduces data size as it eliminates outdated or redundant features. Step-wise Forward Selection – The selection begins with an empty set of attributes later on we decide best of the original attributes on the set based on their relevance to other attributes. We know it as a p-value in statistics. Suppose there are the following attributes in the data set in which few attributes are redundant. Data Preprocessing in Data Mining
  • 14. Slide 14 Step-wise Backward Selection – This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst remaining attribute in the set. Suppose there are the following attributes in the data set in which few attributes are redundant. Data Preprocessing in Data Mining
  • 15. Slide 15 Step-wise Backward Selection – This selection starts with a set of complete attributes in the original data and at each point, it eliminates the worst remaining attribute in the set. Suppose there are the following attributes in the data set in which few attributes are redundant. Combination of forwarding and Backward Selection – It allows us to remove the worst and select best attributes, saving time and making the process faster. Data Preprocessing in Data Mining
  • 16. Slide 16 Data Compression: The data compression technique reduces the size of the files using different encoding mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two types based on their compression techniques. Lossless Compression – Encoding techniques allows a simple and minimal data size reduction. Lossless data compression uses algorithms to restore the precise original data from the compressed data. Lossy Compression – Methods such as Discrete Wavelet transform technique, principal component analysis) are examples of this compression. For e.g., JPEG image format is a lossy compression, but we can find the meaning equivalent to the original the image. In lossy-data compression, the decompressed data may differ to the original data but are useful enough to retrieve information from them. Data Preprocessing in Data Mining
  • 17. Slide 17 Discretization & Concept Hierarchy Operation: Techniques of data discretization are used to divide the attributes of the continuous nature into data with intervals. We replace many constant values of the attributes by labels of small intervals. This means that mining results are shown in a concise, and easily understandable way. Top-down discretization – If you first consider one or a couple of points (so-called breakpoints or split points) to divide the whole set of attributes and repeat of this method up to the end, then the process is known as top-down discretization also known as splitting. Bottom-up discretization – If you first consider all the constant values as split-points, some are discarded through a combination of the neighborhood values in the interval, that process is called bottom-up discretization. Data Preprocessing in Data Mining
  • 18. Slide 18 Concept Hierarchies: It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to high-level concepts (categorical variables such as middle age or Senior). For numeric data following techniques can be followed: Binning – is the process of changing numerical variables into categorical counterparts. The number of categorical counterparts depends on the number of bins specified by the user. Data Preprocessing in Data Mining
  • 19. Slide 19 Histogram analysis – Like the process of binning, the histogram is used to partition the value for the attribute X, into disjoint ranges called brackets. There are several partitioning rules: Equal Frequency partitioning: Partitioning the values based on their number of occurrences in the data set. Equal Width partitioning : partitioning the values in a fixed gap based on the number of bins i.e. a set of values ranging from 0-20. Clustering: Grouping the similar data together. Data Preprocessing in Data Mining