0% found this document useful (0 votes)
43 views37 pages

Data Mining1

Uploaded by

Maruf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views37 pages

Data Mining1

Uploaded by

Maruf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

DATA MINING

Introduction
Data mining is one of the most useful techniques that help entrepreneurs, researchers, and
individuals to extract valuable information from huge sets of data. Data mining is also
called Knowledge Discovery in Database (KDD). The knowledge discovery process
includes Data cleaning, Data integration, Data selection, Data transformation, Data mining,
Pattern evaluation, and Knowledge presentation.

This note on Data mining tutorial includes all topics of Data mining such as applications,
Data mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining
techniques, Clustering in data mining, Challenges in Data mining, etc.

What is Data Mining?


The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is collected
and assembled in particular areas such as data warehouses, efficient analysis, data mining
algorithm, helping decision making and other data requirement to eventually cost-cutting
and generating revenue.

Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of future
events. Data Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge databases
to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such
as text mining, web mining, audio and video mining, pictorial data mining, and social media
mining. It is done through software that is simple or highly specific. By outsourcing data
mining, all the work can be done faster with low operation costs. Specialized firms can also
use new technologies to collect data that is impossible to locate manually. There are tonnes
of information available on various platforms, but very little knowledge is accessible. The
biggest challenge is to analyze the data to extract important information that can be used to
solve a problem or for company development.

There are many powerful instruments and techniques available to mine data and find
better insight from it.

Types of Data Mining


Data mining can be performed on the following types of data:

Relational Database
A relational database is a collection of multiple data sets formally organized by tables,
records, and columns from which data can be accessed in various ways without having to
recognize the database tables. Tables convey and share information, which facilitates data
searchability, reporting, and organization.

Data warehouses
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.

Data Repositories
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure.
For example, a group of databases, where an organization has kept various kinds of
information.

Object-Relational Database
A combination of an object-oriented database model and relational database model is
called an object-relational model. It supports Classes, Objects, Inheritance, etc.

One of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.

Transactional Database
A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even though
this was a unique capability a very long while back, today, most of the relational database
systems support transactional database activities.

Advantages of Data Mining


1. The Data Mining technique enables organizations to obtain knowledge-based
data.
2. Data mining enables organizations to make lucrative modifications in
operation and production.
3. Compared with other statistical data applications, data mining is a cost-
efficient.
4. Data Mining helps the decision-making process of an organization.
5. It Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
6. It can be induced in the new system as well as the existing platforms.
7. It is a quick process that makes it easy for new users to analyze enormous
amounts of data in a short time.

Disadvantages of Data Mining


1. There is a probability that the organizations may sell useful data of
customers to other organizations for money. As per the report, American
Express has sold credit card purchases of their customers to other
organizations.
2. Many data mining analytics software is difficult to operate and needs
advance training to work on.
3. Different data mining instruments operate in distinct ways due to the
different algorithms used in their design. Therefore, the selection of the right
data mining tools is a very challenging task.
4. The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.

Applications of Data Mining


Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits. Data
mining enables a retailer to use point-of-sale records of customer purchases to develop
products and promotions that help the organization to attract the customer.

These are the following areas where data mining is widely used:

Data Mining in Healthcare


Data mining in healthcare has excellent potential to improve the health system. It uses data
and analytics for better insights and to identify best practices that will enhance health care
services and reduce costs. Analysts use data mining approaches such as Machine learning,
multi-dimensional database, Data visualization, soft computing, and statistics. Data Mining
can be used to forecast patients in each category. The procedures ensure that the patients
get intensive care at the right place and at the right time. Data mining also enables
healthcare insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis


Market basket analysis is a modeling method based on a hypothesis. If you buy a specific
group of products, then you are more likely to buy another group of products. This
technique may enable the retailer to understand the purchase behavior of a buyer. This
data may assist the retailer in understanding the requirements of the buyer and altering
the store's layout accordingly. Using a different analytical comparison of results between
various stores, between customers in different demographic groups can be done.

Data mining in Education


Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments. EDM
objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use
data mining to make precise decisions and also to predict the results of the student. With
the results, the institution can concentrate on what to teach and how to teach.

Data Mining in Manufacturing Engineering


Knowledge is the best asset possessed by a manufacturing company. Data mining tools can
be beneficial to find patterns in a complex manufacturing process. Data mining can be used
in system-level designing to obtain the relationships between product architecture,
product portfolio, and data needs of the customers. It can also be used to forecast the
product development period, cost, and expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management)


Customer Relationship Management (CRM) is all about obtaining and holding Customers,
also enhancing customer loyalty and implementing customer-oriented strategies. To get a
decent relationship with the customer, a business organization needs to collect data and
analyze the data. With data mining technologies, the collected data can be used for
analytics.

Data Mining in Fraud detection


Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection
are a little bit time consuming and sophisticated. Data mining provides meaningful patterns
and turning data into information. An ideal fraud detection system should protect the data
of all the users. Supervised methods consist of a collection of sample records, and these
records are classified as fraudulent or non-fraudulent. A model is constructed using this
data, and the technique is made to identify whether the document is fraudulent or not.

Data Mining in Lie Detection


Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate offenses,
monitor suspected terrorist communications, etc. This technique includes text mining also,
and it seeks meaningful patterns in data, which is usually unstructured text. The
information collected from the previous investigations is compared, and a model for lie
detection is constructed.

Data Mining Financial Banking


The Digitalization of the banking system is supposed to generate an enormous amount of
data with every new transaction. The data mining technique can help bankers by solving
business-related problems in banking and finance by identifying trends, casualties, and
correlations in business information and market costs that are not instantly evident to
managers or executives because the data volume is too large or are produced too rapidly
on the screen by experts. The manager may find these data for better targeting, acquiring,
retaining, segmenting, and maintain a profitable customer.

Challenges of Implementation in Data mining


Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc.
The process of data mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.
Incomplete and Noisy Data
The process of extracting useful data from large volumes of data is data mining. The data in
the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities will usually
be inaccurate or unreliable. These problems may occur due to data measuring instrument
or because of human errors. Suppose a retail chain collects phone numbers of customers
who spend more than $ 500, and the accounting employees put the information into their
system. The person may make a digit mistake when entering the phone number, which
results in incorrect data. Even some customers may not be willing to disclose their phone
numbers, which results in incomplete data. The data could get changed due to human or
system error. All these consequences (noisy and incomplete data) makes data mining
challenging.

Data Distribution
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository
mainly due to organizational and technical concerns. For example, various regional offices
may have their servers to store their data. It is not feasible to store, all the data from all the
offices on a central server. Therefore, data mining requires the development of tools and
algorithms that allow the mining of distributed data.

Complex Data
Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these various
types of data and extracting useful information is a tough task. Most of the time, new
technologies, new tools, and methodologies would have to be refined to obtain specific
information.

Performance
The data mining system's performance relies primarily on the efficiency of algorithms and
techniques used. If the designed algorithm and techniques are not up to the mark, then the
efficiency of the data mining process will be affected adversely.

Data Privacy and Security


Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without their
permission.

Data Visualization
In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data should
convey the exact meaning of what it intends to express. But many times, representing the
information to the end-user in a precise and easy way is difficult. The input data and the
output information being complicated, very efficient, and successful data visualization
processes need to be implemented to make it successful.

There are many more challenges in data mining in addition to the problems above-mentioned.
More problems are disclosed as the actual data mining process begins, and the success of data
mining relies on getting rid of all these difficulties.
Data Mining Techniques
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as
neural networks or decision trees. Thus, data mining incorporates analysis and prediction.

Depending on various methods and technologies from the intersection of machine learning,
database management, and statistics, professionals in data mining have devoted their
careers to better understanding how to process and make conclusions from the huge
amount of data, but what are the methods they use to make it happen?

In recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns,
and regression.

1. Classification

This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
Data mining techniques can be classified by different criteria, as follows:

i. Classification of Data mining frameworks as per the type of data sources


mined:
This classification is as per the type of data handled. For example, multimedia,
spatial data, text data, time-series data, World Wide Web, and so on..
ii. Classification of data mining frameworks as per the database involved:
This classification based on the data model involved. For example. Object-oriented
database, transactional database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge
discovered:
This classification depends on the types of knowledge discovered or data mining
functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks offering a
few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining techniques
used:
This classification is as per the data analysis approach utilized, such as neural
networks, machine learning, genetic algorithms, visualization, statistics, data
warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved in
the data mining procedure, such as query-driven systems, autonomous systems, or
interactive exploratory systems.

2. Clustering
Clustering is a division of information into groups of connected objects. Describing the data
by a few clusters mainly loses certain confine details, but accomplishes improvement. It
models data by its clusters. Data modeling puts clustering from a historical point of view
rooted in statistics, mathematics, and numerical analysis. From a machine learning point of
view, clusters relate to hidden patterns, the search for clusters is unsupervised learning,
and the subsequent framework represents a data concept. From a practical point of view,
clustering plays an extraordinary job in data mining applications. For example, scientific
data exploration, text mining, information retrieval, spatial database applications, CRM,
Web analysis, computational biology, medical diagnostics, and much more.

In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between the
data. Clustering is very similar to the classification, but it involves grouping chunks of data
together based on their similarities.

3. Regression
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of planning and
modeling. For example, we might use it to project certain costs, depending on other factors
such as availability, consumer demand, and competition. Primarily it gives the exact
relationship between two or more variables in the given data set.

4. Association Rules
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.

Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.

The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.

These are three major measurements technique:

o Lift:
This measurement technique measures the accuracy of the confidence over how
often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support:
This measurement technique measures how often multiple items are purchased
and compared it to the overall dataset.
(Item A + Item B) / (Entire dataset)
o Confidence:
This measurement technique measures how often item B is purchased when item
A is purchased as well.
(Item A + Item B)/ (Item A)
5. Outer detection
This type of data mining technique relates to the observation of data items in the data set,
which do not match an expected pattern or expected behavior. This technique may be used
in various domains like intrusion, detection, fraud detection, etc. It is also known as Outlier
Analysis or Outilier mining. The outlier is a data point that diverges too much from the rest
of the dataset. The majority of the real-world datasets have an outlier. Outlier detection
plays a significant role in the data mining field. Outlier detection is valuable in numerous
fields like network interruption identification, credit or debit card fraud detection,
detecting outlying in wireless sensor network data, etc.

6. Sequential Patterns

The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in a
set of sequences, where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.

In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.

7. Prediction
Prediction used a combination of other data mining techniques such as trends, clustering,
classification, etc. It analyzes past events or instances in the right sequence to predict a
future event.

Data Mining Implementation Process


Many different sectors are taking advantage of data mining to boost their business
efficiency, including manufacturing, chemical, marketing, aerospace, etc. Therefore, the
need for a conventional data mining process improved effectively. Data mining techniques
must be reliable, repeatable by company individuals with little or no knowledge of the data
mining context. As a result, a cross-industry standard process for data mining (CRISP-DM)
was first introduced in 1990, after going through many workshops, and contribution for
more than 300 organizations.

Data mining is described as a process of finding hidden precious data by evaluating the
huge quantity of information stored in data warehouses, using multiple data mining
techniques such as Artificial Intelligence (AI), Machine learning and statistics.

Let's examine the implementation process for data mining in details:

The Cross-Industry Standard Process for Data Mining (CRISP-DM)


Cross-industry Standard Process of Data Mining (CRISP-DM) comprises of six phases
designed as a cyclical method as the given figure:

1. Business understanding
It focuses on understanding the project goals and requirements form a business point of
view, then converting this information into a data mining problem afterward a preliminary
plan designed to accomplish the target.

Tasks:
o Determine business objectives
o Access situation
o Determine data mining goals
o Produce a project plan

Determine Business Objectives


o It Understands the project targets and prerequisites from a business point of view.

o Thoroughly understand what the customer wants to achieve.

o Reveal significant factors, at the starting, it can impact the result of the project.

Access situation
o It requires a more detailed analysis of facts about all the resources, constraints,
assumptions, and others that ought to be considered.
Determine data mining goals
o A business goal states the target of the business terminology. For example, increase
catalog sales to the existing customer.
o A data mining goal describes the project objectives. For example, It assumes how many
objects a customer will buy, given their demographics details (Age, Salary, and City) and
the price of the item over the past three years.

Produce a project plan:

o It states the targeted plan to accomplish the business and data mining plan.
o The project plan should define the expected set of steps to be performed during the rest
of the project, including the latest technique and better selection of tools.

2. Data Understanding
Data understanding starts with an original data collection and proceeds with operations
to get familiar with the data, to data quality issues, to find better insight in data, or to
detect interesting subsets for concealed information hypothesis.

Tasks:
o Collects initial data
o Describe data
o Explore data
o Verify data quality

Collect initial data


o It acquires the information mentioned in the project resources.
o It includes data loading if needed for data understanding.
o It may lead to original data preparation steps.
o If various information sources are acquired then integration is an extra issue, either here or
at the subsequent stage of data preparation.

Describe data
o It examines the "gross" or "surface" characteristics of the information obtained.
o It reports on the outcomes.

Explore data
o Addressing data mining issues that can be resolved by querying,
visualizing, and reporting, including:
o Distribution of important characteristics, results of simple aggregation.
o Establish the relationship between the small number of attributes.
o Characteristics of important sub-populations, simple statical analysis.
o It may refine the data mining objectives.
o It may contribute or refine the information description, and quality reports.
o It may feed into the transformation and other necessary information preparation.

Verify data quality


o It examines the data quality and addressing questions.

3. Data Preparation
o It usually takes more than 90 percent of the time.
o It covers all operations to build the final data set from the original raw information.
o Data preparation is probable to be done several times and not in any prescribed
order.

Tasks
o Select data
o Clean data
o Construct data
o Integrate data
o Format data

Select data

o It decides which information to be used for evaluation.


o In the data selection criteria include significance to data mining objectives, quality and
technical limitations such as data volume boundaries or data types.
o It covers the selection of characteristics and the choice of the document in the table.

Clean data
o It may involve the selection of clean subsets of data, inserting appropriate defaults
or more ambitious methods, such as estimating missing information by modeling.

Construct data
o It comprises of Constructive information preparation, such as generating derived
characteristics,
o complete new documents, or transformed values of current characteristics.

Integrate data
o Integrate data refers to the methods whereby data is combined from various tables,
or documents to create new documents or values.

Format data
o Formatting data refer mainly to linguistic changes produced to information that
does not alter their significance but may require a modeling tool.

4. Modeling
In modeling, various modeling methods are selected and applied, and their parameters are
measured to optimum values. Some methods gave particular requirements on the form of
data. Therefore, stepping back to the data preparation phase is necessary.

Tasks
o Select modeling technique
o Generate test design
o Build model
o Access model
Select modeling technique
o It selects the real modeling method that is to be used. For example, decision tree,
neural network.
o If various methods are applied,then it performs this task individually for each
method.

Generate test Design


o Generate a procedure or mechanism for testing the validity and quality of the model
before constructing a model. For example, in classification, error rates are commonly
used as quality measures for data mining models. Therefore, typically separate the
data set into train and test
o set, build the model on the train set and assess its quality on the separate test set.

Build model
o To create one or more models, we need to run the modeling tool on the prepared
data set.

Assess model
o It interprets the models according to its domain expertise, the data mining success
criteria, and the required design.
o It assesses the success of the application of modeling and discovers methods more
technically.
o It Contacts business analytics and domain specialists later to discuss the outcomes
of data mining in the business context.

5. Evaluation

o At the last of this phase, a decision on the use of the data mining results should be
reached.

o It evaluates the model efficiently, and review the steps executed to build the model
and to ensure that the business objectives are properly achieved.

o The main objective of the evaluation is to determine some significant business issue
that has not been regarded adequately.
o At the last of this phase, a decision on the use of the data mining outcomes should be
reached.

Tasks

o Evaluate results

o Review process

o Determine next steps

Evaluate results

o It assesses the degree to which the model meets the organization's business
objectives.

o It tests the model on test apps in the actual implementation when time and budget
limitations permit and also assesses other data mining results produced.

o It unveils additional difficulties, suggestions, or information for future instructions.

Review process

o The review process does a more detailed evaluation of the data mining engagement
to determine when there is a significant factor or task that has been somehow
ignored.

o It reviews quality assurance problems.

Determine next steps

o It decides how to proceed at this stage.

o It decides whether to complete the project and move on to deployment when


necessary or whether to initiate further iterations or set up new data-mining
initiatives.it includes resources analysis and budget that influence the decisions.

6. Deployment

Determine:
o Deployment refers to how the outcomes need to be utilized.
Deploy data mining results by:

o It includes scoring a database, utilizing results as company guidelines, interactive internet


scoring.
o The information acquired will need to be organized and presented in a way that can be used
by the client. However, the deployment phase can be as easy as producing. However,
depending on the demands, the deployment phase may be as simple as generating a report
or as complicated as applying a repeatable data mining method across the organizations.

Tasks
o Plan deployment
o Plan monitoring and maintenance
o Produce final report
o Review project

Plan deployment:
o To deploy the data mining outcomes into the business, takes the assessment results
and concludes a strategy for deployment.
o It refers to documentation of the process for later deployment.

Plan monitoring and maintenance


o It is important when the data mining results become part of the day-to-day business
and its environment.
o It helps to avoid unnecessarily long periods of misuse of data mining results.
o It needs a detailed analysis of the monitoring process.

Produce final report


o A final report can be drawn up by the project leader and his team.
o It may only be a summary of the project and its experience.
o It may be a final and comprehensive presentation of data mining.

Review project
o Review projects evaluate what went right and what went wrong, what was done
wrong, and what needs to be improved.
Data Mining Architecture

Introduction
Data mining is a significant method where previously unknown and potentially useful
information is extracted from the vast amount of data. The data mining process involves
several components, and these components constitute a data mining system architecture.

The significant components of data mining systems are a data source, data mining engine,
data warehouse server, the pattern evaluation module, graphical user interface, and
knowledge base.

Data Source
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text
files, and other documents. You need a huge amount of historical data for data mining to be
successful. Organizations typically store data in databases or data warehouses. Data
warehouses may comprise one or more databases, text files spreadsheets, or other
repositories of data. Sometimes, even plain text files or spreadsheets may contain
information. Another primary source of data is the World Wide Web or the internet.
Different Processes
Before passing the data to the database or data warehouse server, the data must be
cleaned, integrated, and selected. As the information comes from various sources and in
different formats, it can't be used directly for the data mining procedure because the data
may not be complete and accurate. So, the first data requires to be cleaned and unified.
More information than needed will be collected from various data sources, and only the
data of interest will have to be selected and passed to the server. These procedures are not
as easy as we think. Several methods may be performed on the data as part of selection,
integration, and cleaning.

Database or Data Warehouse Server


The database or data warehouse server consists of the original data that is ready to be
processed. Hence, the server is cause for retrieving the relevant data that is based on data
mining as per user request.

Data Mining Engine


The data mining engine is a major component of any data mining system. It contains
several modules for operating data mining tasks, including association, characterization,
classification, clustering, prediction, time-series analysis, etc.

In other words, we can say data mining is the root of our data mining architecture. It
comprises instruments and software used to obtain insights and knowledge from data
collected from various data sources and stored within the data warehouse.

Pattern Evaluation Module


The Pattern evaluation module is primarily responsible for the measure of investigation of
the pattern by using a threshold value. It collaborates with the data mining engine to focus
the search on exciting patterns.
This segment commonly employs stake measures that cooperate with the data mining
modules to focus the search towards fascinating patterns. It might utilize a stake threshold
to filter out discovered patterns. On the other hand, the pattern evaluation module might
be coordinated with the mining module, depending on the implementation of the data
mining techniques used. For efficient data mining, it is abnormally suggested to push the
evaluation of pattern stake as much as possible into the mining procedure to confine the
search to only fascinating patterns.
Graphical User Interface
The graphical user interface (GUI) module communicates between the data mining system
and the user. This module helps the user to easily and efficiently use the system without
knowing the complexity of the process. This module cooperates with the data mining
system when the user specifies a query or a task and displays the results.

Knowledge Base
The knowledge base is helpful in the entire process of data mining. It might be helpful to
guide the search or evaluate the stake of the result patterns. The knowledge base may even
contain user views and data from user experiences that might be helpful in the data mining
process. The data mining engine may receive inputs from the knowledge base to make the
result more accurate and reliable. The pattern assessment module regularly interacts with
the knowledge base to get inputs, and also update it.

KDD- Knowledge Discovery in Databases


The term KDD stands for Knowledge Discovery in Databases. It refers to the broad
procedure of discovering knowledge in data and emphasizes the high-level applications of
specific Data Mining techniques. It is a field of interest to researchers in various fields,
including artificial intelligence, machine learning, pattern recognition, databases, statistics,
knowledge acquisition for expert systems, and data visualization.

The main objective of the KDD process is to extract information from data in the context of
large databases. It does this by using Data Mining algorithms to identify what is deemed
knowledge.

The Knowledge Discovery in Databases is considered as a programmed, exploratory


analysis and modeling of vast data repositories. KDD is the organized procedure of
recognizing valid, useful, and understandable patterns from huge and complex data sets.
Data Mining is the root of the KDD procedure, including the inferring of algorithms that
investigate the data, develop the model, and find previously unknown patterns. The model
is used for extracting the knowledge from the data, analyze the data, and predict the data.

The availability and abundance of data today make knowledge discovery and Data Mining a
matter of impressive significance and need. In the recent development of the field, it isn't
surprising that a wide variety of techniques is presently accessible to specialists and
experts.

The KDD Process


The knowledge discovery process (illustrates in the given figure) is iterative and
interactive, comprises of nine steps. The process is iterative at each stage, implying that
moving back to the previous actions might be required. The process has many imaginative
aspects in the sense that one can’t presents one formula or make a complete scientific
categorization for the correct decisions for each step and application type. Thus, it is
needed to understand the process and the different requirements and possibilities in each
stage.

The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge. At that point, the loop is closed, and the
Active Data Mining starts. Subsequently, changes would need to be made in the application
domain. For example, offering various features to cell phone users in order to reduce churn.
This closes the loop, and the impacts are then measured on the new data repositories, and
the KDD process again. Following is a concise description of the nine-step KDD process,
Beginning with a
managerial step:
1. Building up an understanding of the application domain
This is the initial preliminary step. It develops the scene for understanding what should be
done with the various decisions like transformation, algorithms, representation, etc. The
individuals who are in charge of a KDD venture need to understand and characterize the
objectives of the end-user and the environment in which the knowledge discovery process
will occur (involves relevant prior knowledge).

2. Choosing and creating a data set on which discovery will be performed


Once defined the objectives, the data that will be utilized for the knowledge discovery
process should be determined. This incorporates discovering what data is accessible,
obtaining important data, and afterward integrating all the data for knowledge discovery
onto one set involves the qualities that will be considered for the process. This process is
important because of Data Mining learns and discovers from the accessible data. This is the
evidence base for building the models. If some significant attributes are missing, at that
point, then the entire study may be unsuccessful from this respect, the more attributes are
considered. On the other hand, to organize, collect, and operate advanced data repositories
is expensive, and there is an arrangement with the opportunity for best understanding the
phenomena. This arrangement refers to an aspect where the interactive and iterative
aspect of the KDD is taking place. This begins with the best available data sets and later
expands and observes the impact in terms of knowledge discovery and modeling.

3. Preprocessing and cleansing


In this step, data reliability is improved. It incorporates data clearing, for example,
Handling the missing quantities and removal of noise or outliers. It might include complex
statistical techniques or use a Data Mining algorithm in this context. For example, when one
suspects that a specific attribute of lacking reliability or has many missing data, at this
point, this attribute could turn into the objective of the Data Mining supervised algorithm. A
prediction model for these attributes will be created, and after that, missing data can be
predicted. The expansion to which one pays attention to this level relies upon numerous
factors. Regardless, studying the aspects is significant and regularly revealing by itself, to
enterprise data frameworks.

4. Data Transformation
In this stage, the creation of appropriate data for Data Mining is prepared and developed.
Techniques here incorporate dimension reduction (for example, feature selection and
extraction and record sampling), also attribute transformation (for example, discretization
of numerical attributes and functional transformation). This step can be essential for the
success of the entire KDD project, and it is typically very project-specific. For example, in
medical assessments, the quotient of attributes may often be the most significant factor and
not each one by itself. In business, we may need to think about impacts beyond our control
as well as efforts and transient issues. For example, studying the impact of advertising
accumulation. However, if we do not utilize the right transformation at the starting, then
we may acquire an amazing effect that insights to us about the transformation required in
the next iteration. Thus, the KDD process follows upon itself and prompts an understanding
of the transformation required.

5. Prediction and description


We are now prepared to decide on which kind of Data Mining to use, for example,
classification, regression, clustering, etc. This mainly relies on the KDD objectives, and also
on the previous steps. There are two significant objectives in Data Mining, the first one is a
prediction, and the second one is the description. Prediction is usually referred to as
supervised Data Mining, while descriptive Data Mining incorporates the unsupervised and
visualization aspects of Data Mining. Most Data Mining techniques depend on inductive
learning, where a model is built explicitly or implicitly by generalizing from an adequate
number of preparing models. The fundamental assumption of the inductive approach is
that the prepared model applies to future cases. The technique also takes into account the
level of meta-learning for the specific set of accessible data.

6. Selecting the Data Mining algorithm


Having the technique, we now decide on the strategies. This stage incorporates choosing a
particular technique to be used for searching patterns that include multiple inducers. For
example, considering precision versus understandability, the previous is better with neural
networks, while the latter is better with decision trees. For each system of meta-learning,
there are several possibilities of how it can be succeeded. Meta-learning focuses on
clarifying what causes a Data Mining algorithm to be fruitful or not in a specific issue. Thus,
this methodology attempts to understand the situation under which a Data Mining
algorithm is most suitable. Each algorithm has parameters and strategies of leaning, such
as ten folds cross-validation or another division for training and testing.

7. Utilizing the Data Mining Algorithm

At last, the implementation of the Data Mining algorithm is reached. In this stage, we may
need to utilize the algorithm several times until a satisfying outcome is obtained. For
example, by turning the algorithms control parameters, such as the minimum number of
instances in a single leaf of a decision tree.

8. Evaluation
In this step, we assess and interpret the mined patterns, rules, and reliability to the
objective characterized in the first step. Here we consider the preprocessing steps as for
their impact on the Data Mining algorithm results. For example, including a feature in step
4, and repeat from there. This step focuses on the comprehensibility and utility of the
induced model. In this step, the identified knowledge is also recorded for further use. The
last step is the use, and overall feedback and discovery results acquire by Data Mining.

9. Using the discovered knowledge

Now, we are prepared to include the knowledge into another system for further activity.
The knowledge becomes effective in the sense that we may make changes to the system
and measure the impacts. The accomplishment of this step decides the effectiveness of the
whole KDD process. There are numerous challenges in this step, such as losing the
"laboratory conditions" under which we have worked. For example, the knowledge was
discovered from a certain static depiction, it is usually a set of data, but now the data
becomes dynamic. Data structures may change certain quantities that become unavailable,
and the data domain might be modified, such as an attribute that may have a value that was
not expected previously.
Data Mining vs Machine Learning

Data Mining relates to extracting information from a large quantity of data. Data mining is a
technique of discovering different kinds of patterns that are inherited in the data set and
which are precise, new, and useful data. Data Mining is working as a subset of business
analytics and similar to experimental studies. Data Mining's origins are databases,
statistics.

Machine learning includes an algorithm that automatically improves through data-based


experience. Machine learning is a way to find a new algorithm from experience. Machine
learning includes the study of an algorithm that can automatically extract the data. Machine
learning utilizes data mining techniques and another learning algorithm to construct
models of what is happening behind certain information so that it can predict future
results.

Data Mining and Machine learning are areas that have been influenced by each other,
although they have many common things, yet they have different ends.

Data Mining is performed on certain data sets by humans to find interesting patterns
between the items in the data set. Data Mining uses techniques created by machine
learning for predicting the results while machine learning is the capability of the computer
to learn from a minded data set.

Machine learning algorithms take the information that represents the relationship between
items in data sets and creates models in order to predict future results. These models are
nothing more than actions that will be taken by the machine to achieve a result.

What is Data Mining?

Data Mining is the method of extraction of data or previously unknown data patterns from
huge sets of data. Hence as the word suggests, we 'Mine for specific data' from the large
data set. Data mining is also called Knowledge Discovery Process, is a field of science that is
used to determine the properties of the datasets. Gregory Piatetsky-Shapiro founded the
term "Knowledge Discovery in Databases" (KDD) in 1989. The term "data mining" came
in the database community in 1990. Huge sets of data collected from data warehouses or
complex datasets such as time series, spatial, etc. are extracted in order to extract
interesting correlations and patterns between the data items. For Machine Learning
algorithms, the output of the data mining algorithm is often used as input.
What is Machine learning?

Machine learning is related to the development and designing of a machine that can learn
itself from a specified set of data to obtain a desirable result without it being explicitly
coded. Hence Machine learning implies 'a machine which learns on its own. Arthur
Samuel invented the term Machine learning an American pioneer in the area of computer
gaming and artificial intelligence in 1959. He said that "it gives computers the ability to
learn without being explicitly programmed."

Machine learning is a technique that creates complex algorithms for large data processing
and provides outcomes to its users. It utilizes complex programs that can learn through
experience and make predictions.

The algorithms are enhanced by themselves by frequent input of training data. The aim of
machine learning is to understand information and build models from data that can be
understood and used by humans.

Machine learning algorithms are divided into two types:

1. Unsupervised Learning
2. Supervised Learning

1. Unsupervised Machine Learning:


Unsupervised learning does not depend on trained data sets to predict the results, but it
utilizes direct techniques such as clustering and association in order to predict the results.
Trained data sets are defined as the input for which the output is known.

2. Supervised Machine Learning:


As the name implies, supervised learning refers to the presence of a supervisor as a
teacher. Supervised learning is a learning process in which we teach or train the machine
using data which is well leveled implies that some data is already marked with the correct
responses. After that, the machine is provided with the new sets of data so that the
supervised learning algorithm analyzes the training data and gives an accurate result from
labeled data.
Major Difference between Data mining and Machine learning
1. Two-component is used to introduce data mining techniques first one is the database,
and the second one is machine learning. The database provides data management
techniques, while machine learning provides methods for data analysis. But to introduce
machine learning methods, it used algorithms.

2. Data Mining utilizes more data to obtain helpful information, and that specific data will
help to predict some future results. For example, In a marketing company that utilizes last
year's data to predict the sale, but machine learning does not depend much on data. It uses
algorithms. Many transportation companies such as OLA, UBER machine learning
techniques to calculate ETA (Estimated Time of Arrival) for rides is based on this
technique.

3. Data mining is not capable of self-learning. It follows the guidelines that are predefined.
It will provide the answer to a specific problem, but machine learning algorithms are self-
defined and can alter their rules according to the situation, and find out the solution for a
specific problem and resolves it in its way.

4. The main and most important difference between data mining and machine learning is
that without the involvement of humans, data mining can't work, but in the case of machine
learning human effort only involves at the time when the algorithm is defined after that it
will conclude everything on its own. Once it implemented, we can use it forever, but this is
not possible in the case of data mining.

5. As machine learning is an automated process, the result produces by machine learning


will be more precise as compared to data mining.

6. Data mining utilizes the database, data warehouse server, data mining engine, and
pattern assessment techniques to obtain useful information, whereas machine learning
utilizes neural networks, predictive models, and automated algorithms to make the
decisions.
Data Mining Vs Machine Learning

Factors Data Mining Machine Learning

Origin Traditional databases with It has an existing algorithm and


unstructured data. data.

Meaning Extracting information from a huge Introduce new Information from


amount of data. data as well as previous
experience.

History In 1930, it was known as knowledge The first program, i.e., Samuel's
discovery in databases(KDD). checker playing program, was
established in 1950.

Responsibility Data Mining is used to obtain the Machine learning teaches the
rules from the existing data. computer, how to learn and
comprehend the rules.

Abstraction Data mining abstract from the data Machine learning reads machine.
warehouse.

Applications In compare to machine learning, It needs a large amount of data to


data mining can produce outcomes obtain accurate results. It has
on the lesser volume of data. It is various applications, used in web
also used in cluster analysis. search, spam filter, credit scoring,
computer design, etc.

Nature It involves human interference more It is automated, once designed and


towards the manual. implemented, there is no need for
human effort.
Techniques Data mining is more of research It is a self-learned and train system
involve using a technique like a machine to do the task precisely.
learning.

Scope Applied in the limited fields. It can be used in a vast area.

You might also like