Modul 1 CertDA
Modul 1 CertDA
Data mining is the process of identifying relationships, trends and patterns in large sets of data,
effectively turning raw data into useful information. Data mining approaches involve various
methods such as statistics, machine learning, and database systems.
The information obtained through the data mining process can then be further processed and used
to support decision-making.
CRISP-DM is a cross-industry process for data mining and is a process model designed to facilitate a
structured approach to data mining. It was first conceived in 1996, and in 1997 it became an official
European Union project under the ESPRIT funding initiative. The project was spear-headed by five
companies: Integral Solutions Ltd (ISL), Teradata, Daimler AG, NCR Corporation and OHRA, an
insurance company, and led to the first version of the methodology being published as a data mining
guide in 1999.
Recent research indicates that CRISP-DM is the most widely used data-mining process, because of its
various advantages which solved the existing problems in the data mining industries. The apparent
success and wide use of the CRISP-DM is that it is industry, tool, and application neutral.
The process model is composed of six distinct but connected phases which represent the ideal
sequence of activities involved in the data mining process. In practice some of these activities may
be performed in a different order. Some of the paths between activities are two-way, indicating that
it will frequently be necessary to return to earlier steps depending on the outcome of a particular
activity.
Business understanding
Business understanding is the essential and mandatory first phase in any date mining or data
analytics project. It involves identifying and describing the fundamental aims of the project from a
business perspective. This may involve solving a key business problem or exploring a particular
business opportunity.
Establishing whether the business has been performing or under-performing and in which areas
Monitoring and controlling performance against targets or budgets
Identifying areas where efficiency and effectiveness in business processes can be improved
Understanding customer behaviour to identify trends, patterns and relationships
Predicting sales volumes at given prices
Detecting and preventing fraud more easily
Using scarce resources most profitably
Optimising sales or profits.
Having identified the aims of the project to address the business problem or opportunity, the next
step is to establish a set of project objectives and requirements. These are then used to inform the
development of a project plan. The plan will detail the steps to be performed over the course of the
rest of the project and should cover the following:
The second prase of the CRISP-DM process involves obtaining and exploring the data identified as
part of the previous phase and has three separate steps, each resulting in the production of a report.
Data Acquisition
This step involves retrieving the data from their respective sources and the production of a data
acquisition report that lists the sources of date, along with their provenance, the tools or techniques
used to acquire them. It should also document any issues which arose during the acquisition along
with the relevant solutions. This report will facilitate the replication of the date acquisition process if
the project is repeated in the future.
Data Description
The next step requires loading the data and performing a rudimentary examination of the data to aid
in the production of a data quality report. This report should describe the data that has been
acquired.
It should detail the number of attributes and the type of data they contain. For quantitative data,
this should include descriptive statistics such as minimum and maximum values as well as their mean
and median and other statistical measures. For qualitative data, the summary data should include
the number of distinct values, known as the cardinality of data, and how many instances of each
value exists. The first step is to describe the raw data. For instance, if analyzing a purchases ledger,
you would at this stage produce counts of the number of transactions for each department and cost
center, the minimum, mean and maximum for amounts, etc. Relationships between variables are
examined in the data exploration phase (eg. by calculating correlation). For both types of data, the
report should also detail the number of missing or invalid values in each of the attributes.
If there are multiple sources of data, the report should state on which common attributes these
sources will be joined. Finally, the report should include a statement as to whether the data acquired
is complete and satisfies the requirements outlined during the business understanding phase.
Data Exploration
This step builds on the data description and involves using statistical and visualisation techniques to
develop a deeper understanding of the data and their suitability for the analysis.
These exploratory data analysis techniques can help provide an indication on the likely outcome of
the analysis and may uncover patterns in the data that may be worth subjecting to further
examination.
The results of the exploratory data analysis should be presented as part of a data exploration report
that should also detail any initial findings.
Data preparation
As with the data exploration phase, the data preparation phase is composed of multiple steps and is
about ensuring that the correct data is used, in the correct form in order for the data analytics model
to work effectively.
Data Selection
Feature selection is the process of eliminating features or variables which exhibit little predictive
value or those that are highly correlated with others and retaining those that are the most relevant
to the process of building analytical models such as: • Multiple linear regression, where the
correlation between multiple independent variables and the dependent variable is used to model
the relationship between them • Decision trees, simulating human approaches to solving problems
by dividing the set of predictors into smaller and smaller subsets and associating an outcome with
each one. • Neural networks, a native simulation of multiple interconnected brain cells that can be
configured to learn and recognize patterns.
Sampling may be needed if the amount of data exceeds the capabilities of the tools or systems used
to build the model. This normally involves retaining a random selection of rows as a predetermined
percentage of the total number of rows. Often surprisingly small samples can give reasonably
reliable information about the wider population of data, such as obtained from voter exit polls in
local and national elections.
Any decisions taken during this step should be documented, along with a description of the reasons
for eliminating non-significant variables or selecting samples of data from a wider population of such
data.
Data Cleaning
Data cleaning is the process of ensuring the data can be used effectively in the analytical model.
The next step is to process missing and erroneous data identified during the data understanding or
collection phase. Erroneous data, values outside of reasonably expected ranges, are generally set as
missing. Missing values in each feature are then replaced either using simple rules of thumb, such as
setting them to be equal to the mean or median of data in the feature or by building models that
represent the patterns of missing data and using those models to "predict" the missing values.
Other data cleaning tasks include transforming dates into a common format and removing non-
alphanumeric characters from text.
The activities undertaken, and decisions made during this step should be documented in a data
cleaning report.
Data Integration
Data mining algorithms expect a single source of data to be organised into rows and columns. If
multiple sources of data are to be used in the analysis, it is necessary to combine them. This involves
using common features in each dataset to join the datasets together. For example, a dataset of
customer details may be combined with records of their purchases. The resulting joined dataset will
have one row for each purchase containing attributes of the purchase combined with attributes
related to the customer.
Feature Engineering
This optional step involves the creation or inclusion of new variables or derived attributes into the
existing variables or features originally included to improve the model's capability. This step is
frequently performed when the data analyst feels that the derived attribute or new feature or
variable is likely to make a positive contribution to the modelling process and where it involves a
complex relationship that the model is unlikely to infer by itself.
An example of a derived feature might be adding such attributes such as the amount a customer
spends on different products in a given time period, how soon they pay and how often they return
goods to more reliably assess the profitability of that customer, rather than just measure the gross
profit generated by the customer based on sales values.
Modelling
This key part of the data mining process involves creating generalised, concise representations of the
data. These are frequently mathematical in nature and are used later to generate predictions from
new, previously unseen data.
The first step in creating models is to choose the modelling techniques which are the most
appropriate, given both the nature of the analysis and of the data used. Many modelling methods
make assumptions about the nature of data. For examples, some methods can perform well in the
presence of missing data whereas others will fail to produce a valid model.
Before proceeding to build a data analytics model, you will need to determine how you are going to
assess the quality of predictive ability of the model. This is done using data specially held aside for
this purpose, in other words, how well the model will perform on data it hasn't yet seen. This
involves using a subset of data kept aside for this purpose and using it to evaluate how far off the
model's predictions of the dependent variable are from the actual values in the data.
Deployment
During this final phase, the outcome of the evaluation will be used to establish a timetable and
strategy for the deployment of the data mining models, detailing the required steps and how they
should be implemented.
Data mining projects are rarely "set it and forget it in nature. At this time, you will need to develop a
comprehensive plan for the monitoring of the deployed models as well as their future maintenance.
This should take the form of a detailed document. Once the project has been completed there
should be a final written report, re-stating and re-affirming the project objectives, identifying the
deliverables, providing a summary of the results and identifying any problems encountered and how
they were dealt with.
Depending on the requirements, the deployment phase can be as simple as generating a report and
presenting it to the sponsors or as complex as implementing a repeatable data mining process across
the enterprise. In many cases, it is the customer, not the data analyst, who carries out the
deployment steps. However, even if the analyst does carry out the deployment, it is important for
the customer to clearly understand which actions need to be carried out in order to actually make
use of the created models. This is where data visualisation is most important as the data analyst
hands over the findings from the modelling to the sponsor or the end user and these should be
presented and communicated in a form which is easily understood.