0% found this document useful (0 votes)

73 views7 pages

A) Data Cleaning

This document discusses the key processes involved in data mining, including data preparation processes like cleaning, integration, selection and transformation, as well as the core data mining process and subsequent steps. It outlines the typical steps as: 1) Data preparation - which involves cleaning data, integrating data from multiple sources, selecting relevant data for analysis, and transforming data into suitable formats. 2) Data mining - applying algorithms and techniques to extract patterns from prepared data. 3) Pattern evaluation and knowledge representation - identifying truly interesting patterns, representing results in an understandable way. The document also discusses some challenges in data mining like dealing with noisy/incomplete real-world data, mining distributed data sources, handling complex data types,

Uploaded by

Aziz Ur Rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views7 pages

A) Data Cleaning

Uploaded by

Aziz Ur Rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Introduction

The whole process of data mining cannot be completed in a single step. In other words, you cannot
get the required information from the large volumes of data as simple as that. It is a very complex
process than we think involving a number of processes. The processes including data cleaning, data
integration, data selection, data transformation, data mining, pattern evaluation and knowledge
representation are to be completed in the given order.

Types of Data Mining Processes

Different data mining processes can be classified into two types: data preparation or data
preprocessing and data mining. In fact, the first four processes, that are data cleaning, data
integration, data selection and data transformation, are considered as data preparation processes.
The last three processes including data mining, pattern evaluation and knowledge representation are
integrated into one process called data mining.

a) Data Cleaning

Data cleaning is the process where the data gets cleaned. Data in the real world is normally
incomplete, noisy and inconsistent. The data available in data sources might be lacking attribute
values, data of interest etc. For example, you want the demographic data of customers and what if
the available data does not include attributes for the gender or age of the customers? Then the data
is of course incomplete. Sometimes the data might contain errors or outliers. An example is an age
attribute with value 200. It is obvious that the age value is wrong in this case.
The data could also be inconsistent. For example, the name of an employee might be stored
differently in different data tables or documents. Here, the data is inconsistent. If the data is not clean,
the data mining results would be neither reliable nor accurate.

Data cleaning involves a number of techniques including filling in the missing values manually,
combined computer and human inspection, etc. The output of data cleaning process is adequately
cleaned data.

b) Data Integration

Data integration is the process where data from different data sources are integrated into one. Data
lies in different formats in different locations. Data could be stored in databases, text files,
spreadsheets, documents, data cubes, Internet and so on. Data integration is a really complex and
tricky task because data from different sources does not match normally. Suppose a table A contains
an entity named customer_id where as another table B contains an entity named number. It is really
difficult to ensure that whether both these entities refer to the same value or not. Metadata can be
used effectively to reduce errors in the data integration process. Another issue faced is data
redundancy. The same data might be available in different tables in the same database or even in
different data sources. Data integration tries to reduce redundancy to the maximum possible level
without affecting the reliability of data.

c) Data Selection

Data mining process requires large volumes of historical data for analysis. So, usually the data
repository with integrated data contains much more data than actually required. From the available
data, data of interest needs to be selected and stored. Data selection is the process where the data
relevant to the analysis is retrieved from the database.

d) Data Transformation

Data transformation is the process of transforming and consolidating the data into different forms that
are suitable for mining. Data transformation normally involves normalization, aggregation,
generalization etc. For example, a data set available as "-5, 37, 100, 89, 78" can be transformed as "-
0.05, 0.37, 1.00, 0.89, 0.78". Here data becomes more suitable for data mining. After data integration,
the available data is ready for data mining.
e) Data Mining

Data mining is the core process where a number of complex and intelligent methods are applied to
extract patterns from data. Data mining process includes a number of tasks such as association,
classification, prediction, clustering, time series analysis and so on.

f) Pattern Evaluation

The pattern evaluation identifies the truly interesting patterns representing knowledge based on
different types of interestingness measures. A pattern is considered to be interesting if it is potentially
useful, easily understandable by humans, validates some hypothesis that someone wants to confirm
or valid on new data with some degree of certainty.

g) Knowledge Representation

The information mined from the data needs to be presented to the user in an appealing way. Different
knowledge representation and visualization techniques are applied to provide the output of data
mining to the users.

Data Mining Architecture

The major components of any data mining system are data source, data warehouse server, data
mining engine, pattern evaluation module, graphical user interface and knowledge base.
a) Data Sources
Database, data warehouse, World Wide Web (WWW), text files and other documents are the actual
sources of data. You need large volumes of historical data for data mining to be successful.
Organizations usually store data in databases or data warehouses. Data warehouses may contain
one or more databases, text files, spreadsheets or other kinds of information repositories.
Sometimes, data may reside even in plain text files or spreadsheets. World Wide Web or the Internet
is another big source of data.

Different Processes

The data needs to be cleaned, integrated and selected before passing it to the database or data
warehouse server. As the data is from different sources and in different formats, it cannot be used
directly for the data mining process because the data might not be complete and reliable. So, first
data needs to be cleaned and integrated. Again, more data than required will be collected from
different data sources and only the data of interest needs to be selected and passed to the server.
These processes are not as simple as we think. A number of techniques may be performed on the
data as part of cleaning, integration and selection.

b) Database or Data Warehouse Server

The database or data warehouse server contains the actual data that is ready to be processed.
Hence, the server is responsible for retrieving the relevant data based on the data mining request of
the user.

c) Data Mining Engine

The data mining engine is the core component of any data mining system. It consists of a number of
modules for performing data mining tasks including association, classification, characterization,
clustering, prediction, time-series analysis etc.

d) Pattern Evaluation Modules

The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern
by using a threshold value. It interacts with the data mining engine to focus the search towards
interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data mining system.
This module helps the user use the system easily and efficiently without knowing the real complexity
behind the process. When the user specifies a query or a task, this module interacts with the data
mining system and displays the result in an easily understandable manner.

f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the
search or evaluating the interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in the process of data mining.
The data mining engine might get inputs from the knowledge base to make the result more accurate
and reliable. The pattern evaluation module interacts with the knowledge base on a regular basis to
get inputs and also to update it.

Challenges in Data Mining

The data mining process becomes successful when the challenges or issues are identified correctly
and sorted out properly.

Noisy and Incomplete Data

Data mining is the process of extracting information from large volumes of data. The real-world data
is heterogeneous, incomplete and noisy. Data in large quantities normally will be inaccurate or
unreliable. These problems could be due to errors of the instruments that measure the data or
because of human errors. Suppose a retail chain collects the email id of customers who spend more
than $200 and the billing staff enters the details into their system. The person might make spelling
mistakes while entering the email id which results in incorrect data. Even some customers might not
be ready to disclose their email id which results in incomplete data. The data even could get altered
due to system or human errors. All these result in noisy and incomplete data which makes the data
mining really challenging.

Distributed Data
Real world data is usually stored on different platforms in distributed computing environments. It could
be in databases, individual systems, or even on the Internet. It is practically very difficult to bring all
the data to a centralized data repository mainly due to organizational and technical reasons. For
example, different regional offices might be having their own servers to store their data whereas it will
not be feasible to store all the data (millions of terabytes) from all the offices in a central server. So,
data mining demands the development of tools and algorithms that enable mining of distributed data.

Complex Data
Real world data is really heterogeneous and it could be multimedia data including images, audio and
video, complex data, temporal data, spatial data, time series, natural language text and so on. It is
really difficult to handle these different kinds of data and extract required information. Most of the
times, new tools and methodologies would have to be developed to extract relevant information.

Performance
The performance of the data mining system mainly depends on the efficiency of algorithms and
techniques used. If the algorithms and techniques designed are not up to the mark, then it will affect
the performance of the data mining process adversely.

Incorporation of Background Knowledge

If background knowledge can be incorporated, more reliable and accurate data mining solutions can
be found. Descriptive tasks can come up with more useful findings and predictive tasks can make
more accurate predictions. But collecting and incorporating background knowledge is a complex
process.

Data Visualization
Data visualization is a very importance process in data mining because it is the main process that
displays the output in a presentable manner to the user. The information extracted should convey the
exact meaning of what it actually intends to convey. But many times, it is really difficult to represent
the information in an accurate and easy-to-understand way to the end user. The input data and output
information being really complex, very effective and successful data visualization techniques need to
be applied to make it successful.

Data Privacy and Security

Data mining normally leads to serious issues in terms of data security, privacy and governance. For
example, when a retailer analyzes the purchase details, it reveals information about buying habits
and preferences of customers without their permission.
Different Data Mining Tasks

There are a number of data mining tasks such as classification, prediction, time-series analysis,
association, clustering, summarization etc. All these tasks are either predictive data mining tasks or
descriptive data mining tasks. A data mining system can execute one or more of the above specified
tasks as part of data mining.

data mining
No ratings yet
data mining
44 pages
Unit III Dwdm
No ratings yet
Unit III Dwdm
113 pages
unit-III
No ratings yet
unit-III
101 pages
Fulltext01 PDF
No ratings yet
Fulltext01 PDF
49 pages
Presentation Salon Comp
No ratings yet
Presentation Salon Comp
72 pages
DM Notes-1
No ratings yet
DM Notes-1
71 pages
DATA MINING UNIT-1
No ratings yet
DATA MINING UNIT-1
59 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
Great Compiled Notes Data Mining V1
No ratings yet
Great Compiled Notes Data Mining V1
92 pages
Unit 1 Datamining For Business Intelligence
No ratings yet
Unit 1 Datamining For Business Intelligence
101 pages
The Computer Aspect of HCI: Projects
No ratings yet
The Computer Aspect of HCI: Projects
15 pages
Data Mining - Reference - 1
No ratings yet
Data Mining - Reference - 1
91 pages
Module 4
No ratings yet
Module 4
54 pages
Case Study + Assignment 4,5,6 (DBMS)
No ratings yet
Case Study + Assignment 4,5,6 (DBMS)
35 pages
Wimbledon Page
No ratings yet
Wimbledon Page
116 pages
DMDW Lecture Notes
No ratings yet
DMDW Lecture Notes
24 pages
AWS Certified Cloud Practitioner 2021 Cheat Sheet
No ratings yet
AWS Certified Cloud Practitioner 2021 Cheat Sheet
76 pages
DWM Notes Class by Proff
No ratings yet
DWM Notes Class by Proff
88 pages
Magnetic Disk
No ratings yet
Magnetic Disk
11 pages
DATA MINING MODULE 2
No ratings yet
DATA MINING MODULE 2
23 pages
Railway Reservation
No ratings yet
Railway Reservation
9 pages
NS-2502 Manual English
No ratings yet
NS-2502 Manual English
62 pages
Unit1: Lecture Notes Topic: Data Mining Architecture
No ratings yet
Unit1: Lecture Notes Topic: Data Mining Architecture
4 pages
Operating System Commands
No ratings yet
Operating System Commands
4 pages
Unit-1 Introduction To Data Mining
No ratings yet
Unit-1 Introduction To Data Mining
33 pages
Chapter 3
No ratings yet
Chapter 3
26 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data User 0 Com - Microsoft.office - Officehubrow Files TempOffice OfficeMobilePdf DWDM Unit 3-1
No ratings yet
Data User 0 Com - Microsoft.office - Officehubrow Files TempOffice OfficeMobilePdf DWDM Unit 3-1
97 pages
R Lab
No ratings yet
R Lab
7 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Data Mining Notes
100% (1)
Data Mining Notes
75 pages
unit-1 notes onl
No ratings yet
unit-1 notes onl
25 pages
Unit 1 DMW
No ratings yet
Unit 1 DMW
41 pages
Data Mining Notes
No ratings yet
Data Mining Notes
21 pages
DWH Unit 3
No ratings yet
DWH Unit 3
7 pages
Notes for DMDWH -Module1
No ratings yet
Notes for DMDWH -Module1
21 pages
Data Mining - KTUweb PDF
No ratings yet
Data Mining - KTUweb PDF
82 pages
Data Mining
No ratings yet
Data Mining
15 pages
CTE (Common Table Expressions)
No ratings yet
CTE (Common Table Expressions)
5 pages
Unit 4 SQL
No ratings yet
Unit 4 SQL
45 pages
UML Exercise- University Course Management System (CMS) 2
No ratings yet
UML Exercise- University Course Management System (CMS) 2
4 pages
Data Mining New
No ratings yet
Data Mining New
21 pages
big data analytics notes
No ratings yet
big data analytics notes
15 pages
A Study On Applications of Wavelets To Data Mining
No ratings yet
A Study On Applications of Wavelets To Data Mining
11 pages
Unit-I Data Mining
No ratings yet
Unit-I Data Mining
28 pages
Data Mining and Warehousing-1
No ratings yet
Data Mining and Warehousing-1
43 pages
UNIT 3 DWM NOTES
No ratings yet
UNIT 3 DWM NOTES
17 pages
data_mining_ppt
No ratings yet
data_mining_ppt
17 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Data Mining - Digital Notes (Unit I To V)
No ratings yet
Data Mining - Digital Notes (Unit I To V)
85 pages
fecb062f-17a3-497e-a209-546e0889904c
No ratings yet
fecb062f-17a3-497e-a209-546e0889904c
6 pages
1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
No ratings yet
1.1 Introduction To Data Mining: 1.1.1 Moving Toward The Information Age
14 pages
Data Mining Questions 1st Unit
No ratings yet
Data Mining Questions 1st Unit
6 pages
UNIT-2 BI
No ratings yet
UNIT-2 BI
26 pages
Data Mining Ch1
No ratings yet
Data Mining Ch1
38 pages
Advance Database With Lab: Professor & Head (Department of Software Engineering)
No ratings yet
Advance Database With Lab: Professor & Head (Department of Software Engineering)
5 pages
Doctor Sheraz PDF
No ratings yet
Doctor Sheraz PDF
5 pages
Data Mining
No ratings yet
Data Mining
26 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
Index of dx commands | DNAnexus Documentation
No ratings yet
Index of dx commands | DNAnexus Documentation
6 pages
DWM 4
No ratings yet
DWM 4
23 pages
Unit II Data Mining
No ratings yet
Unit II Data Mining
8 pages
Unix Is A Command Line User Interface and Windows Is Graphic User Interface Operating System
No ratings yet
Unix Is A Command Line User Interface and Windows Is Graphic User Interface Operating System
3 pages
Data Mining Architecture - Data Mining Tutorial by Wideskills
No ratings yet
Data Mining Architecture - Data Mining Tutorial by Wideskills
3 pages
DATA MINING Unit 1
No ratings yet
DATA MINING Unit 1
22 pages
SQL PR PDF
No ratings yet
SQL PR PDF
5 pages
Top 25 SQL Interview Questions ?
No ratings yet
Top 25 SQL Interview Questions ?
21 pages
Unit 3
No ratings yet
Unit 3
34 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
Oracle Programess Missing
No ratings yet
Oracle Programess Missing
2 pages
DM Chapter 1
No ratings yet
DM Chapter 1
10 pages
How Business Intelligence Can Help Youto Better Understand Your Customers
No ratings yet
How Business Intelligence Can Help Youto Better Understand Your Customers
8 pages
Data
No ratings yet
Data
9 pages
Replicating or Copying Netbackup Image Catalogs
No ratings yet
Replicating or Copying Netbackup Image Catalogs
3 pages
Data Mining and Data Analysis UNIT-1 Notes For Print
No ratings yet
Data Mining and Data Analysis UNIT-1 Notes For Print
22 pages
What Are Mental Models, and Why Are They Important in Interface Design?
No ratings yet
What Are Mental Models, and Why Are They Important in Interface Design?
1 page
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Data Mining Notes
No ratings yet
Data Mining Notes
82 pages
What Is Shell Scripting
No ratings yet
What Is Shell Scripting
4 pages
Data Mining Notes
No ratings yet
Data Mining Notes
9 pages
Data Mining - Prashant
No ratings yet
Data Mining - Prashant
10 pages
Adjusting The Purchasing Document Conditions Automatically
100% (1)
Adjusting The Purchasing Document Conditions Automatically
4 pages
C-User Questionnaire V2.1
No ratings yet
C-User Questionnaire V2.1
1 page
Topic 5 - Fundamental of Data Visulization-Edit
No ratings yet
Topic 5 - Fundamental of Data Visulization-Edit
17 pages
Unit 1
No ratings yet
Unit 1
11 pages
Python Practical File 12
No ratings yet
Python Practical File 12
22 pages
DATA Mining
No ratings yet
DATA Mining
21 pages
06 Laboratory Exercise 1
No ratings yet
06 Laboratory Exercise 1
7 pages
DSpace System Documentation 1.6.0 rc1
No ratings yet
DSpace System Documentation 1.6.0 rc1
281 pages
Ab Initio Best Practices (Light)
No ratings yet
Ab Initio Best Practices (Light)
3 pages
DWH PPT Topics
No ratings yet
DWH PPT Topics
12 pages
Overview of Data Warehousing: AIM: - To Learn Architectural Framework For Data Warehousing Theory
No ratings yet
Overview of Data Warehousing: AIM: - To Learn Architectural Framework For Data Warehousing Theory
10 pages
Lun (iSCSI) setup with Linux example: 1. Test 환경
No ratings yet
Lun (iSCSI) setup with Linux example: 1. Test 환경
3 pages
DATA MINING-Knowledge Discovery in Databases
No ratings yet
DATA MINING-Knowledge Discovery in Databases
6 pages
Topic: Cryptology: History of Cryptography
No ratings yet
Topic: Cryptology: History of Cryptography
24 pages
Cert4Prep 1Z0-448 Certification Sample Questions
No ratings yet
Cert4Prep 1Z0-448 Certification Sample Questions
7 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

A) Data Cleaning

Uploaded by

A) Data Cleaning

Uploaded by

Introduction

Types of Data Mining Processes

Data Mining Architecture

b) Database or Data Warehouse Server

c) Data Mining Engine

d) Pattern Evaluation Modules

Challenges in Data Mining

Noisy and Incomplete Data

Incorporation of Background Knowledge

Data Privacy and Security

You might also like