A) Data Cleaning
A) Data Cleaning
The whole process of data mining cannot be completed in a single step. In other words, you cannot
get the required information from the large volumes of data as simple as that. It is a very complex
process than we think involving a number of processes. The processes including data cleaning, data
integration, data selection, data transformation, data mining, pattern evaluation and knowledge
representation are to be completed in the given order.
a) Data Cleaning
Data cleaning is the process where the data gets cleaned. Data in the real world is normally
incomplete, noisy and inconsistent. The data available in data sources might be lacking attribute
values, data of interest etc. For example, you want the demographic data of customers and what if
the available data does not include attributes for the gender or age of the customers? Then the data
is of course incomplete. Sometimes the data might contain errors or outliers. An example is an age
attribute with value 200. It is obvious that the age value is wrong in this case.
The data could also be inconsistent. For example, the name of an employee might be stored
differently in different data tables or documents. Here, the data is inconsistent. If the data is not clean,
the data mining results would be neither reliable nor accurate.
Data cleaning involves a number of techniques including filling in the missing values manually,
combined computer and human inspection, etc. The output of data cleaning process is adequately
cleaned data.
b) Data Integration
Data integration is the process where data from different data sources are integrated into one. Data
lies in different formats in different locations. Data could be stored in databases, text files,
spreadsheets, documents, data cubes, Internet and so on. Data integration is a really complex and
tricky task because data from different sources does not match normally. Suppose a table A contains
an entity named customer_id where as another table B contains an entity named number. It is really
difficult to ensure that whether both these entities refer to the same value or not. Metadata can be
used effectively to reduce errors in the data integration process. Another issue faced is data
redundancy. The same data might be available in different tables in the same database or even in
different data sources. Data integration tries to reduce redundancy to the maximum possible level
without affecting the reliability of data.
c) Data Selection
Data mining process requires large volumes of historical data for analysis. So, usually the data
repository with integrated data contains much more data than actually required. From the available
data, data of interest needs to be selected and stored. Data selection is the process where the data
relevant to the analysis is retrieved from the database.
d) Data Transformation
Data transformation is the process of transforming and consolidating the data into different forms that
are suitable for mining. Data transformation normally involves normalization, aggregation,
generalization etc. For example, a data set available as "-5, 37, 100, 89, 78" can be transformed as "-
0.05, 0.37, 1.00, 0.89, 0.78". Here data becomes more suitable for data mining. After data integration,
the available data is ready for data mining.
e) Data Mining
Data mining is the core process where a number of complex and intelligent methods are applied to
extract patterns from data. Data mining process includes a number of tasks such as association,
classification, prediction, clustering, time series analysis and so on.
f) Pattern Evaluation
The pattern evaluation identifies the truly interesting patterns representing knowledge based on
different types of interestingness measures. A pattern is considered to be interesting if it is potentially
useful, easily understandable by humans, validates some hypothesis that someone wants to confirm
or valid on new data with some degree of certainty.
g) Knowledge Representation
The information mined from the data needs to be presented to the user in an appealing way. Different
knowledge representation and visualization techniques are applied to provide the output of data
mining to the users.
The major components of any data mining system are data source, data warehouse server, data
mining engine, pattern evaluation module, graphical user interface and knowledge base.
a) Data Sources
Database, data warehouse, World Wide Web (WWW), text files and other documents are the actual
sources of data. You need large volumes of historical data for data mining to be successful.
Organizations usually store data in databases or data warehouses. Data warehouses may contain
one or more databases, text files, spreadsheets or other kinds of information repositories.
Sometimes, data may reside even in plain text files or spreadsheets. World Wide Web or the Internet
is another big source of data.
Different Processes
The data needs to be cleaned, integrated and selected before passing it to the database or data
warehouse server. As the data is from different sources and in different formats, it cannot be used
directly for the data mining process because the data might not be complete and reliable. So, first
data needs to be cleaned and integrated. Again, more data than required will be collected from
different data sources and only the data of interest needs to be selected and passed to the server.
These processes are not as simple as we think. A number of techniques may be performed on the
data as part of cleaning, integration and selection.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the
search or evaluating the interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in the process of data mining.
The data mining engine might get inputs from the knowledge base to make the result more accurate
and reliable. The pattern evaluation module interacts with the knowledge base on a regular basis to
get inputs and also to update it.
Distributed Data
Real world data is usually stored on different platforms in distributed computing environments. It could
be in databases, individual systems, or even on the Internet. It is practically very difficult to bring all
the data to a centralized data repository mainly due to organizational and technical reasons. For
example, different regional offices might be having their own servers to store their data whereas it will
not be feasible to store all the data (millions of terabytes) from all the offices in a central server. So,
data mining demands the development of tools and algorithms that enable mining of distributed data.
Complex Data
Real world data is really heterogeneous and it could be multimedia data including images, audio and
video, complex data, temporal data, spatial data, time series, natural language text and so on. It is
really difficult to handle these different kinds of data and extract required information. Most of the
times, new tools and methodologies would have to be developed to extract relevant information.
Performance
The performance of the data mining system mainly depends on the efficiency of algorithms and
techniques used. If the algorithms and techniques designed are not up to the mark, then it will affect
the performance of the data mining process adversely.
Data Visualization
Data visualization is a very importance process in data mining because it is the main process that
displays the output in a presentable manner to the user. The information extracted should convey the
exact meaning of what it actually intends to convey. But many times, it is really difficult to represent
the information in an accurate and easy-to-understand way to the end user. The input data and output
information being really complex, very effective and successful data visualization techniques need to
be applied to make it successful.
There are a number of data mining tasks such as classification, prediction, time-series analysis,
association, clustering, summarization etc. All these tasks are either predictive data mining tasks or
descriptive data mining tasks. A data mining system can execute one or more of the above specified
tasks as part of data mining.