0% found this document useful (0 votes)
39 views

Data Mining Complete

A Data Warehouse is a relational database designed to support data analysis and decision making. It integrates data from multiple sources and stores current and historical data. A Data Warehouse contains data organized by subject and focused on specific decision making needs. It uses a multidimensional model with dimensions and facts to enable analysis of data from different perspectives. A Data Warehouse provides a centralized source of integrated data for reporting and analysis to support strategic decision making across an organization.

Uploaded by

Jatin Tanwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Data Mining Complete

A Data Warehouse is a relational database designed to support data analysis and decision making. It integrates data from multiple sources and stores current and historical data. A Data Warehouse contains data organized by subject and focused on specific decision making needs. It uses a multidimensional model with dimensions and facts to enable analysis of data from different perspectives. A Data Warehouse provides a centralized source of integrated data for reporting and analysis to support strategic decision making across an organization.

Uploaded by

Jatin Tanwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

UNIT-I

Data Warehouse
Data Warehouse is a relational database management system (RDBMS) construct to meet
the requirement of transaction processing systems. It can be loosely described as any
centralized data repository which can be queried for business benefits. It is a database that
stores information oriented to satisfy decision-making requests. It is a group of decision
support technologies, targets to enabling the knowledge worker (executive, manager, and
analyst) to make superior and higher decisions. So, Data Warehousing support
architectures and tool for business executives to systematically organize, understand and
use their information to make strategic decisions.
Data Warehouse environment contains an extraction, transportation, and loading (ETL)
solution, an online analytical processing (OLAP) engine, customer analysis tools, and other
applications that handle the process of gathering information and delivering it to business
users.

What is a Data Warehouse?


A Data Warehouse (DW) is a relational database that is designed for query and analysis
rather than transaction processing. It includes historical data derived from transaction data
from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on


providing support for decision-makers for data modeling and analysis.

A Data Warehouse is a group of data specific to the entire organization, not only to a
particular group of users.

It is not used for daily operations and transaction processing but used for making
decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

o It is a database designed for investigative tasks, using data from various


applications.
o It supports a relatively small number of clients with relatively long interactions. o It
includes current and historical data to provide a historical perspective of
information.
o Its usage is read-intensive.
o It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information


in support of management's decisions."

Characteristics of Data Warehouse

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around a
particular subject, such as customer, product, or sales, instead of the global organization's
ongoing operations. This is done by excluding data that are not useful concerning the
subject and including all data needed by the users to understand the subject.

Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes types,
etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.

Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data. Therefore,
the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. NonVolatile defines that once
entered into the warehouse, and data should not change.

Goals of Data Warehousing


o To help reporting as well as analysis o Maintain the

organization's historical information o Be the foundation

for decision making.

Need for Data Warehouse


Data Warehouse is needed for the following reasons:
1. 1) Business User: Business users require a data warehouse to view summarized data
from the past. Since these people are non-technical, the data may be presented to
them in an elementary form.

2. 2) Store historical data: Data Warehouse is required to store the time variable data
from the past. This input is made to be used for various purposes.

3. 3) Make strategic decisions: Some strategies may be depending upon the data in
the data warehouse. So, data warehouse contributes to making strategic decisions.

4. 4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and
consistency in data.

5. 5) High response time: Data warehouse has to be ready for somewhat unexpected
loads and types of queries, which demands a significant degree of flexibility and
quick response time.

Benefits of Data Warehouse


1. Understand business trends and make better forecasting decisions.

2. Data Warehouses are designed to perform well enormous amounts of data.

3. The structure of data warehouses is more accessible for end-users to navigate,


understand, and query.

4. Queries that would be complex in many normalized databases could be easier to


build and maintain in data warehouses.

5. Data warehousing is an efficient method to manage demand for lots of information


from lots of users.

6.Data warehousing provide the capabilities to analyze a large amount of historical data.

What is OLAP (Online Analytical Processing)?


OLAP stands for On-Line Analytical Processing. OLAP is a classification of software
technology which authorizes analysts, managers, and executives to gain insight into
information through fast, consistent, interactive access in a wide variety of possible views
of data that has been transformed from raw information to reflect the real dimensionality
of the enterprise as understood by the clients.

OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is
rapidly enhancing the essential foundation for Intelligent Solutions containing Business
Performance Management, Planning, Budgeting, Forecasting, Financial Documenting,
Analysis, Simulation-Models, Knowledge Discovery, and Data Warehouses Reporting.

OLAP enables end-clients to perform ad hoc analysis of record in multiple dimensions,


providing the insight and understanding they require for better decision making.

Who uses OLAP and Why?


OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:

o Budgeting o Activity-based costing o


Financial performance analysis o And
financial modeling

Sales and Marketing

o Sales analysis and forecasting o Market research analysis o Promotion analysis o


Customer analysis o Market and customer
segmentation

Production o

Production planning

o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data
model more intuitive to them than a tabular model. This model is called a Dimensional
Model.
The second purpose is to enable fast query response that is usually difficult to achieve
using tabular models.

How OLAP Works?


Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that
are typically very hard to execute over tabular databases, namely aggregation, joining, and
grouping. These queries are calculated during a process that is usually called 'building' or
'processing' of the OLAP cube. This process happens overnight, and by the time end users
get to work - data will have been updated.

Difference between Database System and Data


Warehouse
Database System: Database System is used in traditional way of storing and
retrieving data. The major task of database system is to perform query processing.
These systems are generally referred as online transaction processing system. These
systems are used day to day operations of any organization.
Data Warehouse: Data Warehouse is the place where huge amount of data is
stored. It is meant for users or knowledge workers in the role of data analysis and
decision making. These systems are supposed to organize and present data in
different format and different forms in order to serve the need of the specific user
for specific purpose. These systems are referred as online analytical processing.

Database System Data Warehouse

It supports operational processes. It supports analysis and performance reporting.

Capture and maintain the data. Explore the data.

Current data. Multiple years of history.

Data is balanced within the scope of this Data must be integrated and balanced from
one system. multiple system.
Data is updated when transaction
occurs. Data is updated on scheduled processes.
Data verification occurs when entry is
done.
Data verification occurs after the fact.
100 MB to GB. 100 GB to TB.

ER based. Star/Snowflake.

Application oriented. Subject oriented.

Primitive and highly detailed. Summarized and consolidated.

Flat relational. Multidimensional.

What is Multi-Dimensional Data Model?


A multidimensional model views data in the form of a data-cube. A data cube enables
data to be modeled and viewed in multiple dimensions. It is defined by dimensions and
facts.

The dimensions are the perspectives or entities concerning which an organization keeps
records. Each dimension has a table related to it, called a dimensional table, which
describes the dimension further. For example, a dimensional table for an item may contain
the attributes item_name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales.
This theme is represented by a fact table. Facts are numerical measures. The fact table
contains the names of the facts or measures of the related dimensional tables.

OLAP Operations in the Multidimensional Data


Model
In the multidimensional model, the records are organized into various dimensions, and
each dimension includes multiple levels of abstraction described by concept hierarchies.
This organization support users with the flexibility to view data from various perspectives.
A number of OLAP data cube operation exist to demonstrate these different views,
allowing interactive queries and search of the record at hand. Hence, OLAP supports a
user-friendly environment for interactive data analysis.

Consider the OLAP operations which are to be performed on multidimensional data. The
figure shows data cubes for sales of a shop. The cube contains the dimensions, location, and
time and item, where the location is aggregated with regard to city values, time is
aggregated with respect to quarters, and an item is aggregated with respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs
aggregation on a data cube, by climbing down concept hierarchies, i.e., dimension
reduction. Roll-up is like zooming-out on the data cubes. Figure shows the result of rollup
operations performed on the dimension location. The hierarchy for the location is defined
as the Order Street, city, province, or state, country. The roll-up operation aggregates the
data by ascending the location hierarchy from the level of the city to the level of the
country.

When a roll-up is performed by dimensions reduction, one or more dimensions are


removed from the cube. For example, consider a sales data cube having two dimensions,
location and time. Roll-up may be performed by removing, the time dimensions,
appearing in an aggregation of the total sales by location, relatively than by location and
by time.
Temperature 64 65 68 69 70 71 72 75 80 81 83 85

Week1 1 0 1 0 1 0 0 0 0 0 1 0
Week2 0 0 0 1 0 0 1 2 0 1 0 0

The rollup operation groups the information by levels of temperature.

The following diagram illustrates how roll-up works.

Drill-Down
The drill-down operation (also called roll-down) is the reverse operation of roll-up.
Drilldown is like zooming-in on the data cube. It navigates from less detailed record to
more detailed data. Drill-down can be performed by either stepping down a concept
hierarchy for a dimension or adding additional dimensions.

Figure shows a drill-down operation performed on the dimension time by stepping down
a concept hierarchy which is defined as day, month, quarter, and year. Drill-down appears
by descending the time hierarchy from the level of the quarter to a more detailed level of
the month.
Because a drill-down adds more details to the given data, it can also be performed by
adding a new dimension to a cube. For example, a drill-down on the central cubes of the
figure can occur by introducing an additional dimension, such as a customer group.
Example

Drill-down adds more details to the given data


Temperature cool mild hot
Day 1 0 0 0

Day 2 0 0 0

Day 3 0 0 1

Day 4 0 1 0

Day 5 1 0 0

Day 6 0 0 0

Day 7 1 0 0

Day 8 0 0 0

Day 9 1 0 0

Day 10 0 1 0

Day 11 0 1 0

Day 12 0 1 0

Day 13 0 0 1

Day 14 0 0 0
The following diagram illustrates how Drill-down works.
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members
of the dimension. For example, a slice operation is executed when the customer wants a
selection on one dimension of a three-dimensional cube resulting in a two-dimensional
site. So, the Slice operations perform a selection on one dimension of the given cube, thus
resulting in a subcube.

For example, if we make the selection, temperature=cool we will obtain the following cube:

Temperature cool

Day 1 0

Day 2 0

Day 3 0

Day 4 0

Day 5 1

Day 6 1

Day 7 1

Day 8 1

Day 9 1

Day 11 0

Day 12 0

Day 13 0

Day 14 0

The following diagram illustrates how Slice works.


Here Slice is functioning for the dimensions "time" using the criterion time = "Q1". It will
form a new sub-cubes by selecting one or more dimensions.
Dice
The dice operation describes a subcube by operating a selection on two or more
dimension.

For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature
= cool OR temperature = hot) to the original cubes we get the following subcube (still
two-dimensional)
Temperature cool hot

Day 3 0 1

Day 4 0 0
Consider the following diagram, which shows the dice operations.

The dice operation on the cubes based on the following selection criteria involves three
dimensions.

o (location = "Toronto" or

"Vancouver") o (time = "Q1" or "Q2") o


(item =" Mobile" or "Modem")

Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which
rotates the data axes in view to provide an alternative presentation of the data. It may
contain swapping the rows and columns or moving one of the row-dimensions into the
column dimensions.

Consider the following diagram, which shows the pivot operation.

Other OLAP Operations


executes queries containing more than one fact table. The drill-through operations make
use of relational SQL facilitates to drill through the bottom level of a data cubes down to
its back-end relational tables.

Other OLAP operations may contain ranking the top-N or bottom-N elements in lists, as
well as calculate moving average, growth rates, and interests, internal rates of returns,
depreciation, currency conversions, and statistical tasks.

OLAP offers analytical modeling capabilities, containing a calculation engine for


determining ratios, variance, etc. and for computing measures across various dimensions.
It can generate summarization, aggregation, and hierarchies at each granularity level and
at every dimensions intersection. OLAP also provide functional models for forecasting,
trend analysis, and statistical analysis. In this context, the OLAP engine is a powerful data
analysis tool.

Data Processing in Data Mining


Data processing is collecting raw data and translating it into usable information. The raw
data is collected, filtered, sorted, processed, analyzed, stored, and then presented in a
readable format. It is usually performed in a step-by-step process by a team of data
scientists and data engineers in an organization.

The data processing is carried out automatically or manually. Nowadays, most data is
processed automatically with the help of the computer, which is faster and gives accurate
results. Thus, data can be converted into different forms. It can be graphic as well as audio
ones. It depends on the software used as well as data processing methods.

After that, the data collected is processed and then translated into a desirable form as per
requirements, useful for performing tasks. The data is acquired from Excel files,
databases, text file data, and unorganized data such as audio clips, images, GPRS, and
video clips.

Data processing is crucial for organizations to create better business strategies and
increase their competitive edge. By converting the data into a readable format like graphs,
charts, and documents, employees throughout the organization can understand and use
the data.

The most commonly used tools for data processing are Storm, Hadoop, HPCC, Statwing,
Qubole, and CouchDB. The processing of data is a key step of the data mining process.
Raw data processing is a more complicated task. Moreover, the results can be misleading.
Therefore, it is better to process data before analysis. The processing of data largely
depends on the following things, such as:

o The volume of data that needs to be processed. o The complexity of data

processing operations. o Capacity and inbuilt technology of respective

computer systems. o Technical skills and Time constraints.

Stages of Data Processing


The data processing consists of the following six stages.
1. Data Collection

The collection of raw data is the first step of the data processing cycle. The raw data
collected has a huge impact on the output produced. Hence, raw data should be gathered
from defined and accurate sources so that the subsequent findings are valid and usable.
Raw data can include monetary figures, website cookies, profit/loss statements of a
company, user behavior, etc.

2. Data Preparation

Data preparation or data cleaning is the process of sorting and filtering the raw data to
remove unnecessary and inaccurate data. Raw data is checked for errors, duplication,
miscalculations, or missing data and transformed into a suitable form for further analysis
and processing. This ensures that only the highest quality data is fed into the processing
unit.

3. Data Input

In this step, the raw data is converted into machine-readable form and fed into the
processing unit. This can be in the form of data entry through a keyboard, scanner, or any
other input source.

4. Data Processing

In this step, the raw data is subjected to various data processing methods using machine
learning and artificial intelligence algorithms to generate the desired output. This step may
vary slightly from process to process depending on the source of data being processed
(data lakes, online databases, connected devices, etc.) and the intended use of the output.

5. Data Interpretation or Output

The data is finally transmitted and displayed to the user in a readable form like graphs,
tables, vector files, audio, video, documents, etc. This output can be stored and further
processed in the next data processing cycle.

6. Data Storage

The last step of the data processing cycle is storage, where data and metadata are stored
for further use. This allows quick access and retrieval of information whenever needed.
Effective proper data storage is necessary for compliance with GDPR (data protection
legislation).
Why Should We Use Data Processing?
In the modern era, most of the work relies on data, therefore collecting large amounts of
data for different purposes like academic, scientific research, institutional use, personal
and private use, commercial purposes, and lots more. The processing of this data collected
is essential so that the data goes through all the above steps and gets sorted, stored,
filtered, presented in the required format, and analyzed.

The amount of time consumed and the intricacy of processing will depend on the required
results. In situations where large amounts of data are acquired, the necessity of processing
to obtain authentic results with the help of data processing in data mining and data
processing in data research is inevitable.

Methods of Data Processing


There are three main data processing methods, such as:

1. Manual Data Processing

Data is processed manually in this data processing method. The entire procedure of data
collecting, filtering, sorting, calculation and alternative logical operations is all carried out
with human intervention without using any electronic device or automation software. It is
a low-cost methodology and does not need very many tools. However, it produces high
errors and requires high labor costs and lots of time.

2. Mechanical Data Processing

Data is processed mechanically through the use of devices and machines. These can
include simple devices such as calculators, typewriters, printing press, etc. Simple data
processing operations can be achieved with this method. It has much fewer errors than
manual data processing, but the increase in data has made this method more complex
and difficult.

3. Electronic Data Processing

Data is processed with modern technologies using data processing software and
programs. The software gives a set of instructions to process the data and yield output.
This method is the most expensive but provides the fastest processing speeds with the
highest reliability and accuracy of output.
Types of Data Processing
There are different types of data processing based on the source of data and the steps
taken by the processing unit to generate an output. There is no one size fits all method
that can be used for processing raw data.

1. Batch Processing: In this type of data processing, data is collected and processed in
batches. It is used for large amounts of data. For example, the payroll system.

2. Single User Programming Processing: It is usually done by a single person for his
personal use. This technique is suitable even for small offices.

3. Multiple Programming Processing: This technique allows simultaneously storing and


executing more than one program in the Central Processing Unit (CPU). Data is broken
down into frames and processed using two or more CPUs within a single computer system.
It is also known as parallel processing. Further, the multiple programming techniques
increase the respective computer's overall working efficiency. A good example of multiple
programming processing is weather forecasting.

4. Real-time Processing: This technique facilitates the user to have direct contact with the
computer system. This technique eases data processing. This technique is also known as
the direct mode or the interactive mode technique and is developed exclusively to perform
one task. It is a sort of online processing, which always remains under execution. For
example, withdrawing money from ATM.

5. Online Processing: This technique facilitates the entry and execution of data directly; so, it
does not store or accumulate first and then process. The technique is developed to reduce
the data entry errors, as it validates data at various points and ensures that only corrected
data is entered. This technique is widely used for online applications. For example, barcode
scanning.

6. Time-sharing Processing: This is another form of online data processing that facilitates
several users to share the resources of an online computer system. This technique is
adopted when results are needed swiftly. Moreover, as the name suggests, this system is
time-based. Following are some of the major advantages of time-sharing processing, such
as:
o Several users can be served simultaneously. o All the users have an almost equal
amount of processing time. o There is a possibility of interaction with the running
programs.
7. Distributed Processing: This is a specialized data processing technique in which various
computers (located remotely) remain interconnected with a single host computer making
a network of computers. All these computer systems remain interconnected with a
highspeed communication network. However, the central computer system maintains the
master database and monitors accordingly. This facilitates communication between
computers.

Examples of Data Processing


Data processing occurs in our daily lives whether we may be aware of it or not. Here are
some real-life examples of data processing, such as:

o Stock trading software that converts millions of stock data into a simple graph.

o An e-commerce company uses the search history of customers to recommend similar


products.
o A digital marketing company uses demographic data of people to strategize
locationspecific campaigns.
o A self-driving car uses real-time data from sensors to detect if there are pedestrians and
other cars on the road.
Importance of Data Processing in Data Mining
In today's world, data has a significant bearing on researchers, institutions, commercial
organizations, and each individual user. Data is often imperfect, noisy, and incompatible,
and then it requires additional processing. After gathering, the question arises of how to
store, sort, filter, analyze and present data. Here data mining comes into play.

The complexity of this process is subject to the scope of data collection and the complexity
of the required results. Whether this process is time-consuming depends on steps, which
need to be made with the collected data and the type of output file desired to be received.
This issue becomes actual when the need for processing a big amount of data arises.
Therefore, data mining is widely used nowadays.

When data is gathered, there is a need to store it. The data can be stored in physical form
using paper-based documents, laptops and desktop computers, or other data storage
devices. With the rise and rapid development of such things as data mining and big data,
the process of data collection becomes more complicated and timeconsuming. It is
necessary to carry out many operations to conduct thorough data analysis.

At present, data is stored in a digital form for the most part. It allows processing data faster
and converting it into different formats. The user has the possibility to choose the most
suitable output.

Data Cleaning in Data Mining


Data cleaning is a crucial process in Data Mining. It carries an important part in the building
of a model. Data Cleaning can be regarded as the process needed, but everyone often
neglects it. Data quality is the main issue in quality information management. Data quality
problems occur anywhere in information systems. These problems are solved by data
cleaning.

Data cleaning is fixing or removing incorrect, corrupted, incorrectly formatted, duplicate,


or incomplete data within a dataset. If data is incorrect, outcomes and algorithms are
unreliable, even though they may look correct. When combining multiple data sources,
there are many opportunities for data to be duplicated or mislabeled.

Generally, data cleaning reduces errors and improves data quality. Correcting errors in
data and eliminating bad records can be a time-consuming and tedious process, but it
cannot be ignored. Data mining is a key technique for data cleaning. Data mining is a
technique for discovering interesting information in data. Data quality mining is a recent
approach applying data mining techniques to identify and recover data quality problems
in large databases. Data mining automatically extracts hidden and intrinsic information
from the collections of data. Data mining has various techniques that are suitable for data
cleaning.

Understanding and correcting the quality of your data is imperative in getting to an


accurate final analysis. The data needs to be prepared to discover crucial patterns. Data
mining is considered exploratory. Data cleaning in data mining allows the user to discover
inaccurate or incomplete data before the business analysis and insights.

In most cases, data cleaning in data mining can be a laborious process and typically
requires IT resources to help in the initial step of evaluating your data because data
cleaning before data mining is so time-consuming. But without proper data quality, your
final analysis will suffer inaccuracy, or you could potentially arrive at the wrong conclusion.
Steps of Data Cleaning
While the techniques used for data cleaning may vary according to the types of data your
company stores, you can follow these basic steps to cleaning your data, such as:

1. Remove duplicate or irrelevant observations

Remove unwanted observations from your dataset, including duplicate observations or


irrelevant observations. Duplicate observations will happen most often during data
collection. When you combine data sets from multiple places, scrape data, or receive data
from clients or multiple departments, there are opportunities to create duplicate data.
Deduplication is one of the largest areas to be considered in this process. Irrelevant
observations are when you notice observations that do not fit into the specific problem
you are trying to analyze.

For example, if you want to analyze data regarding millennial customers, but your dataset
includes older generations, you might remove those irrelevant observations. This can
make analysis more efficient, minimize distraction from your primary target, and create a
more manageable and performable dataset.

2. Fix structural errors

Structural errors are when you measure or transfer data and notice strange naming
conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled
categories or classes. For example, you may find "N/A" and "Not Applicable" in any sheet,
but they should be analyzed in the same category.

3. Filter unwanted outliers

Often, there will be one-off observations where, at a glance, they do not appear to fit
within the data you are analyzing. If you have a legitimate reason to remove an outlier,
like improper data entry, doing so will help the performance of the data you are working
with.

However, sometimes, the appearance of an outlier will prove a theory you are working on.
And just because an outlier exists doesn't mean it is incorrect. This step is needed to
determine the validity of that number. If an outlier proves to be irrelevant for analysis or
is a mistake, consider removing it.
4. Handle missing data

You can't ignore missing data because many algorithms will not accept missing values.
There are a couple of ways to deal with missing data. Neither is optimal, but both can be
considered, such as:

o You can drop observations with missing values, but this will drop or lose information, so be
careful before removing it. o You can input missing values based on other observations; again,
there is an opportunity to lose the integrity of the data because you may be operating from
assumptions and not actual observations.
o You might alter how the data is used to navigate null values effectively.

5. Validate and QA

At the end of the data cleaning process, you should be able to answer these questions as
a part of basic validation, such as:

o Does the data make sense? o Does the data follow the appropriate

rules for its field? o Does it prove or disprove your working theory

or bring any insight to light?

o Can you find trends in the data to help you for your next theory? o
If not, is that because of a data quality issue?
Because of incorrect or noisy data, false conclusions can inform poor business strategy
and decision-making. False conclusions can lead to an embarrassing moment in a
reporting meeting when you realize your data doesn't stand up to study. Before you get
there, it is important to create a culture of quality data in your organization. To do this,
you should document the tools you might use to create this strategy.

Methods of Data Cleaning


There are many data cleaning methods through which the data should be run. The
methods are described below:
1. Ignore the tuples: This method is not very feasible, as it only comes to use when the tuple
has several attributes is has missing values.

2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it can
be a time-consuming method. In the approach, one has to fill in the missing value. This is
usually done manually, but it can also be done by attribute mean or using the most
probable value.

3. Binning method: This approach is very simple to understand. The smoothing of sorted
data is done using the values around it. The data is then divided into several segments of
equal size. After that, the different methods are executed to complete the task.

4. Regression: The data is made smooth with the help of using the regression function. The
regression can be linear or multiple. Linear regression has only one independent variable,
and multiple regressions have more than one independent variable.

5. Clustering: This method mainly operates on the group. Clustering groups the data in a
cluster. Then, the outliers are detected with the help of clustering. Next, the similar values
are then arranged into a "group" or a "cluster".

Process of Data Cleaning


The following steps show the process of data cleaning in data mining.

1. Monitoring the errors: Keep a note of suitability where the most mistakes arise. It will
make it easier to determine and stabilize false or corrupt information. Information is
especially necessary while integrating another possible alternative with established
management software.

2. Standardize the mining process: Standardize the point of insertion to assist and reduce
the chances of duplicity.

3. Validate data accuracy: Analyze and invest in data tools to clean the record in real-time.
Tools used Artificial Intelligence to better examine for correctness.

4. Scrub for duplicate data: Determine duplicates to save time when analyzing data.
Frequently attempted the same data can be avoided by analyzing and investing in separate
data erasing tools that can analyze rough data in quantity and automate the operation.
5. Research on data: Before this activity, our data must be standardized, validated, and
scrubbed for duplicates. There are many third-party sources, and these Approved &
authorized parties sources can capture information directly from our databases. They help
us to clean and compile the data to ensure completeness, accuracy, and reliability for
business decision-making.

6. Communicate with the team: Keeping the group in the loop will assist in developing and
strengthening the client and sending more targeted data to prospective customers.

Usage of Data Cleaning in Data Mining


Here are the following usages of data cleaning in data mining, such as:

o Data Integration: Since it is difficult to ensure quality in low-quality data, data integration
has an important role in solving this problem. Data Integration is the process of combining
data from different data sets into a single one. This process uses data cleansing tools to
ensure that the embedded data set is standardized and formatted before moving to the
final destination.
o Data Migration: Data migration is the process of moving one file from one system to
another, one format to another, or one application to another. While the data is on the
move, it is important to maintain its quality, security, and consistency, to ensure that the
resultant data has the correct format and structure without any delicacies at the destination.
o Data Transformation: Before the data is uploaded to a destination, it needs to be
transformed. This is only possible through data cleaning, which considers the system criteria
of formatting, structuring, etc. Data transformation processes usually include using rules
and filters before further analysis. Data transformation is an integral part of most data
integration and data management processes. Data cleansing tools help to clean the data
using the built-in transformations of the systems.
o Data Debugging in ETL Processes: Data cleansing is crucial to preparing data during
extract, transform, and load (ETL) for reporting and analysis. Data cleansing ensures that
only high-quality data is used for decision-making and analysis.
For example, a retail company receives data from various sources, such as CRM or ERP
systems, containing misinformation or duplicate data. A good data debugging tool would
detect inconsistencies in the data and rectify them. The purged data will be converted to
a standard format and uploaded to a target database.
Characteristics of Data Cleaning
Data cleaning is mandatory to guarantee the business data's accuracy, integrity, and
security. Based on the qualities or characteristics of data, these may vary in quality. Here
are the main points of data cleaning in data mining:

o Accuracy: All the data that make up a database within the business must be highly accurate.
One way to corroborate their accuracy is by comparing them with different sources. If the
source is not found or has errors, the stored information will have the same problems. o
Coherence: The data must be consistent with each other, so you can be sure that the
information of an individual or body is the same in different forms of storage used. o Validity:
The stored data must have certain regulations or established restrictions. Likewise, the
information has to be verified to corroborate its authenticity. o Uniformity: The data that
make up a database must have the same units or values. It is an essential aspect when carrying
out the Data Cleansing process since it does not increase the complexity of the procedure. o
Data Verification: The process must be verified at all times, both the appropriateness and the
effectiveness of the procedure. Said verification is carried out through various insistence of the
study, design, and validation stages. The drawbacks are often evident after the data is applied
in a certain amount of changes. o Clean Data Backflow: After eliminating quality problems,
the already clean data must be replaced by those not located in the original source, so that
legacy applications obtain the benefits of these, obviating the need for applications of actions
of data cleaning afterward.

Tools for Data Cleaning in Data Mining


Data Cleansing Tools can be very helpful if you are not confident of cleaning the data
yourself or have no time to clean up all your data sets. You might need to invest in those
tools, but it is worth the expenditure. There are many data cleaning tools in the market.
Here are some top-ranked data cleaning tools, such as:

1. OpenRefine

2. Trifacta Wrangler

3. Drake
4. Data Ladder

5. Data Cleaner

6. Cloudingo

7. Reifier

8. IBM Infosphere Quality Stage

9. TIBCO Clarity

10. Winpure

Benefits of Data Cleaning


Having clean data will ultimately increase overall productivity and allow for the highest
quality information in your decision-making. Here are some major benefits of data
cleaning in data mining, such as:

o Removal of errors when multiple sources of data are at play. o Fewer errors make for

happier clients and less-frustrated employees.


o Ability to map the different functions and what your data is intended to do.
o Monitoring errors and better reporting to see where errors are coming from, making it
easier to fix incorrect or corrupt data for future applications.
o Using tools for data cleaning will make for more efficient business practices and quicker
decision-making.

Data Transformation in Data Mining


Raw data is difficult to trace or understand. That's why it needs to be preprocessed before
retrieving any information from it. Data transformation is a technique used to convert the
raw data into a suitable format that efficiently eases data mining and retrieves strategic
information. Data transformation includes data cleaning techniques and a data reduction
technique to convert the data into the appropriate form.

Data transformation is an essential data preprocessing technique that must be performed


on the data before data mining to provide patterns that are easier to understand.

Data transformation changes the format, structure, or values of the data and converts
them into clean, usable data. Data may be transformed at two stages of the data pipeline
for data analytics projects. Organizations that use on-premises data warehouses generally
use an ETL (extract, transform, and load) process, in which data transformation is the
middle step. Today, most organizations use cloud-based data warehouses to scale
compute and storage resources with latency measured in seconds or minutes. The
scalability of the cloud platform lets organizations skip preload transformations and load
raw data into the data warehouse, then transform it at query time.

Data integration, migration, data warehousing, data wrangling may all involve data
transformation. Data transformation increases the efficiency of business and analytic
processes, and it enables businesses to make better data-driven decisions. During the data
transformation process, an analyst will determine the structure of the data. This could
mean that data transformation may be:

o Constructive: The data transformation process adds, copies, or replicates data. o

Destructive: The system deletes fields or records. o Aesthetic: The transformation standardizes

the data to meet requirements or parameters. o Structural: The database is reorganized by

renaming, moving, or combining columns.

Data Transformation Techniques


There are several data transformation techniques that can help structure and clean up the
data before analysis or storage in a data warehouse. Let's study all techniques used for
data transformation, some of which we have already studied in data reduction and data
cleaning.

1. Data Smoothing

Data smoothing is a process that is used to remove noise from the dataset using some
algorithms. It allows for highlighting important features present in the dataset. It helps in
predicting the patterns. When collecting data, it can be manipulated to eliminate or reduce
any variance or any other noise form.

The concept behind data smoothing is that it will be able to identify simple changes to
help predict different trends and patterns. This serves as a help to analysts or traders who
need to look at a lot of data which can often be difficult to digest for finding patterns that
they wouldn't see otherwise.

We have seen how the noise is removed from the data using the techniques such as
binning, regression, clustering.

o Binning: This method splits the sorted data into the number of bins and smoothens the
data values in each bin considering the neighborhood values around it.
o Regression: This method identifies the relation among two dependent attributes so that if
we have one attribute, it can be used to predict the other attribute.
o Clustering: This method groups similar data values and form a cluster. The values that lie
outside a cluster are known as outliers.

2. Attribute Construction

In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied
to assist the mining process from the given attributes. This simplifies the original data and
makes the mining more efficient.

For example, suppose we have a data set referring to measurements of different plots, i.e.,
we may have the height and width of each plot. So here, we can construct a new attribute
'area' from attributes 'height' and 'weight'. This also helps understand the relations among
the attributes in a data set.

3. Data Aggregation

Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data
sources into a data analysis description. This is a crucial step since the accuracy of data
analysis insights is highly dependent on the quantity and quality of the data used.

Gathering accurate data of high quality and a large enough quantity is necessary to
produce relevant results. The collection of data is useful for everything from decisions
concerning financing or business strategy of the product, pricing, operations, and
marketing strategies.

For example, we have a data set of sales reports of an enterprise that has quarterly sales
of each year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization

Normalizing the data refers to scaling the data values to a much smaller range such as [1,
1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values
for attribute A that are V1, V2, V3, ….Vn.

o Min-max normalization: This method implements a linear transformation on the original


data. Let us consider that we have minA and maxA as the minimum and maximum value
observed for attribute A and Viis the value for attribute A that has to be normalized. The
min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:

o Z-score normalization: This method normalizes the value for attribute A using the
meanand standard deviation. The following formula is used for Z-score normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively. For
example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.

o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A. The formula for the decimal scaling is given below:
Here j is the smallest integer such that max(|v'i|)<1 For example, the observed values
for attribute A range from -986 to 917, and the maximum absolute value for attribute A is
986. Here, to normalize each value of attribute A using decimal scaling, we have to divide
each value of attribute A by 1000, i.e., j=3. So, the value -986 would be normalized to 0.986,
and 917 would be normalized to 0.917. The normalization parameters such as mean,
standard deviation, the maximum absolute value must be preserved to normalize the
future data uniformly.

5. Data Discretization

This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study
and analyze. If a data mining task handles a continuous attribute, then its discrete values
can be replaced by constant quality attributes. This improves the efficiency of the task.
This method is also called a data reduction mechanism as it transforms a large dataset
into a set of categorical data. Discretization also uses decision tree-based algorithms to
produce short, compact, and accurate results when using discrete values.

Data discretization can be classified into two types: supervised discretization, where the
class information is used, and unsupervised discretization, which is based on which
direction the process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging
strategy'.

For example, the values for the age attribute can be replaced by the interval labels such
as (0-10, 11-20…) or (kid, youth, adult, senior).

6. Data Generalization

It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches:

• Data cube process (OLAP) approach.

• Attribute-oriented induction (AOI) approach.

For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a
higher conceptual level into a categorical value (young, old).
Data Transformation Process
The entire process for transforming data is known as ETL (Extract, Load, and Transform).
Through the ETL process, analysts can convert data to its desired format. Here are the
steps involved in the data transformation process:

1. Data Discovery: During the first stage, analysts work to understand and identify data in its
source format. To do this, they will use data profiling tools. This step helps analysts decide
what they need to do to get data into its desired format.

2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data mapping is
essential to many data processes, and one misstep can lead to incorrect analysis and ripple
through your entire organization.

3. Data Extraction: During this phase, analysts extract the data from its original source. These
may include structured sources such as databases or streaming sources such as customer
log files from web applications.

4. Code Generation and Execution: Once the data has been extracted, analysts need to
create a code to complete the transformation. Often, analysts generate codes with the help
of data transformation platforms or tools.

5. Review: After transforming the data, analysts need to check it to ensure everything has
been formatted correctly.

6. Sending: The final step involves sending the data to its target destination. The target might
be a data warehouse or a database that handles both structured and unstructured data.

Advantages of Data Transformation


Transforming data can help businesses in a variety of ways. Here are some of the essential
advantages of data transformation, such as:
o Better Organization: Transformed data is easier for both humans and computers to use.

o Improved Data Quality: There are many risks and costs associated with bad data. Data
transformation can help your organization eliminate quality issues such as missing values
and other inconsistencies.
o Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to it
being stored and standardizedin a source location.
o Better Data Management: Businesses are constantly generating data from more and more
sources. If there are inconsistencies in the metadata, it can be challenging to organize
and understand it. Data transformation refines your metadata, so it's easier to organize
and understand. o More Use Out of Data: While businesses may be collecting data
constantly, a lot of that data sits around unanalyzed. Transformation makes it easier to
get the most out of your data by standardizing it and making it more usable.

Disadvantages of Data Transformation


While data transformation comes with a lot of benefits, still there are some challenges to
transforming data effectively, such as:

o Data transformation can be expensive. The cost is dependent on the specific infrastructure,
software, and tools used to process data. Expenses may include licensing, computing
resources, and hiring necessary personnel. o Data transformation processes can be
resource-intensive. Performing transformations in an on-premises data warehouse after
loading or transforming data before feeding it into applications can create a computational
burden that slows down other operations. If you use a cloud-based data warehouse, you
can do the transformations after loading because the platform can scale up to meet
demand. o Lack of expertise and carelessness can introduce problems during
transformation. Data analysts without appropriate subject matter expertise are less likely to
notice incorrect data because they are less familiar with the range of accurate and
permissible values.
o Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information to its prior format for a different application.

Ways of Data Transformation


There are several different ways to transform data, such as:
o Scripting: Data transformation through scripting involves Python or SQL to write the code
to extract and transform data. Python and SQL are scripting languages that allow you to
automate certain tasks in a program. They also allow you to extract information from data
sets. Scripting languages require less code than traditional programming languages.
Therefore, it is less intensive.
o On-Premises ETL Tools: ETL tools take the required work to script the data transformation
by automating the process. On-premises ETL tools are hosted on company servers. While
these tools can help save you time, using them often requires extensive expertise and
significant infrastructure costs. o Cloud-Based ETL Tools: As the name suggests,
cloudbased ETL tools are hosted in the cloud. These tools are often the easiest for non-
technical users to utilize. They allow you to collect data from any cloud source and load it
into your data warehouse. With cloudbased ETL tools, you can decide how often you want
to pull data from your source, and you can monitor your usage.

Data Reduction in Data Mining


Data mining is applied to the selected data in a large amount database. When data analysis
and mining is done on a huge amount of data, then it takes a very long time to process,
making it impractical and infeasible.

Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a much
smaller volume. Data reduction techniques are used to obtain a reduced representation
of the dataset that is much smaller in volume by maintaining the integrity of the original
data. By reducing the data, the efficiency of the data mining process is improved, which
produces the same analytical results.

Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.

Data reduction aims to define it more compactly. When the data size is smaller, it is simpler
to apply sophisticated and computationally high-priced algorithms. The reduction of the
data may be in terms of the number of rows (records) or terms of the number of columns
(dimensions).

Techniques of Data Reduction


Here are the following techniques or methods of data reduction in data mining, such as:

1. Dimensionality Reduction

Whenever we encounter weakly important data, we use the attribute required for our
analysis. Dimensionality reduction eliminates the attributes from the data set under
consideration, thereby reducing the volume of original data. It reduces data size as it
eliminates outdated or redundant features. Here are three methods of dimensionality
reduction.

i. Wavelet Transform: In the wavelet transform, suppose a data vector A is transformed


into a numerically different data vector A' such that both A and A' vectors are of the same
length. Then how it is useful in reducing data because the data obtained from the wavelet
transform can be truncated. The compressed data is obtained by retaining the smallest
fragment of the strongest wavelet coefficients. Wavelet transform can be applied to data
cubes, sparse data, or skewed data.

ii. Principal Component Analysis: Suppose we have a data set to be analyzed that has
tuples with n attributes. The principal component analysis identifies k independent tuples
with n attributes that can represent the data set.
In this way, the original data can be cast on a much smaller space, and dimensionality
reduction can be achieved. Principal component analysis can be applied to sparse and
skewed data.

iii. Attribute Subset Selection: The large data set has many attributes, some of which are
irrelevant to data mining or some are redundant. The core attribute subset selection
reduces the data volume and dimensionality. The attribute subset selection reduces the
volume of data by eliminating redundant and
irrelevant attributes.
The attribute subset selection ensures that we get a good subset of original attributes even
after eliminating the unwanted attributes. The resulting probability of data distribution is
as close as possible to the original data distribution using all the attributes.
2. Numerosity Reduction

The numerosity reduction reduces the original data volume and represents it in a much
smaller form. This technique includes two types parametric and non-parametric
numerosity reduction.

i. Parametric: Parametric numerosity reduction incorporates storing only data parameters


instead of the original data. One method of parametric numerosity reduction is the
regression and log-linear method.
o Regression and Log-Linear: Linear regression models a relationship between the
two attributes by modeling a linear equation to the data set. Suppose we need to
model a linear function between two attributes. y = wx +b
Here, y is the response attribute, and x is the predictor attribute. If we discuss in
terms of data mining, attribute x and attribute y are the numeric database
attributes, whereas w and b are regression coefficients.
Multiple linear regressions let the response variable y model linear function
between two or more predictor variables.
Log-linear model discovers the relation between two or more discrete attributes in
the database. Suppose we have a set of tuples presented in n-dimensional space.
Then the log-linear model is used to study the probability of each tuple in a
multidimensional space.
Regression and log-linear methods can be used for sparse data and skewed data.
ii. Non-Parametric: A non-parametric numerosity reduction technique does not assume any
model. The non-Parametric technique results in a more uniform reduction, irrespective of
data size, but it may not achieve a high volume of data reduction like the parametric. There
are at least four types of Non-Parametric data reduction techniques, Histogram, Clustering,
Sampling, Data Cube Aggregation, and Data Compression. o Histogram: A histogram is
a graph that represents frequency distribution which describes how often a value appears
in the data. Histogram uses the binning method to represent an attribute's data
distribution. It uses a disjoint subset which we call bin or buckets.
A histogram can represent a dense, sparse, uniform, or skewed data. Instead of
only one attribute, the histogram can be implemented for multiple attributes. It
can effectively represent up to five attributes.
o Clustering: Clustering techniques groups similar objects from the data so that the
objects in a cluster are similar to each other, but they are dissimilar to objects in
another cluster.
How much similar are the objects inside a cluster can be calculated using a distance
function. More is the similarity between the objects in a cluster closer they appear
in the cluster.
The quality of the cluster depends on the diameter of the cluster, i.e., the max
distance between any two objects in the cluster.
The cluster representation replaces the original data. This technique is more
effective if the present data can be classified into a distinct clustered.
o Sampling: One of the methods used for data reduction is sampling, as it can reduce
the large data set into a much smaller data sample. Below we will discuss the
different methods in which we can sample a large data set D containing N tuples:

a. Simple random sample without replacement (SRSWOR) of size s: In this


s, some tuples are drawn from N tuples such that in the data set D (s<N).
The probability of drawing any tuple from the data set D is 1/N. This means
all tuples have an equal probability of getting sampled.

b. Simple random sample with replacement (SRSWR) of size s: It is similar


to the SRSWOR, but the tuple is drawn from data set D, is recorded, and
then replaced into the data set D so that it can be drawn again.

c. Cluster sample: The tuples in data set D are clustered into M mutually
disjoint subsets. The data reduction can be applied by implementing
SRSWOR on these clusters. A simple random sample of size s could be
generated from these clusters where s<M.

d. Stratified sample: The large data set D is partitioned into mutually disjoint
sets called 'strata'. A simple random sample is taken from each stratum to
get stratified data. This method is effective for skewed data.
3. Data Cube Aggregation

This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a
multidimensional aggregation that uses aggregation at various levels of a data cube to
represent the original data set, thus achieving data reduction.
For example, suppose you have the data of All Electronics sales per quarter for the year
2018 to the year 2022. If you want to get the annual sale per year, you just have to
aggregate the sales per quarter for each year. In this way, aggregation provides you with
the required data, which is much smaller in size, and thereby we achieve data reduction
even without losing any data.

The data cube aggregation is a multidimensional aggregation that eases multidimensional


analysis. The data cube present precomputed and summarized data which eases the data
mining into fast access.

4. Data Compression

Data compression employs modification, encoding, or converting the structure of data in


a way that consumes less space. Data compression involves building a compact
representation of information by removing redundancy and representing data in binary
form. Data that can be restored successfully from its compressed form is called Lossless
compression. In contrast, the opposite where it is not possible to restore the original form
from the compressed form is Lossy compression. Dimensionality and numerosity
reduction method are also used for data compression.

This technique reduces the size of the files using different encoding mechanisms, such as
Huffman Encoding and run-length Encoding. We can divide it into two types based on
their compression techniques.

i. Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses algorithms to restore the
precise original data from the compressed data.

ii. Lossy Compression: In lossy-data compression, the decompressed data may differ from
the original data but are useful enough to retrieve information from them. For example,
the JPEG image format is a lossy compression, but we can find the meaning equivalent to
the original image. Methods such as the Discrete Wavelet transform technique PCA
(principal component analysis) are examples of this compression.
5. Discretization Operation

The data discretization technique is used to divide the attributes of the continuous nature
into data with intervals. We replace many constant values of the attributes with labels of
small intervals. This means that mining results are shown in a concise and easily
understandable way.

i. Top-down discretization: If you first consider one or a couple of points (so-called


breakpoints or split points) to divide the whole set of attributes and repeat this method
up to the end, then the process is known as top-down discretization, also known as
splitting.

ii. Bottom-up discretization: If you first consider all the constant values as split-points, some
are discarded through a combination of the neighborhood values in the interval.
That process is called bottom-up discretization.

Benefits of Data Reduction


The main benefit of data reduction is simple: the more data you can fit into a terabyte of
disk space, the less capacity you will need to purchase. Here are some benefits of data
reduction, such as:

o Data reduction can save energy. o Data reduction can reduce

your physical storage costs. o And data reduction can decrease

your data center track.

Data reduction greatly increases the efficiency of a storage system and directly impacts
your total spending on capacity.

Discretization in data mining


Data discretization refers to a method of converting a huge number of data values into
smaller ones so that the evaluation and management of data become easy. In other words,
data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. There are two forms of data discretization
first is supervised discretization, and the second is unsupervised discretization. Supervised
discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottomup merging strategy.
Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Table before Discretization


Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78


After
Discretization Child Young Mature Old
Another example is analytics, where we gather the static data of website visitors. For
example, all visitors who visit the site with the IP address of India are shown under country
level.

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Some Famous techniques of data discretization


Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a


continuous data set. Histogram assists the data inspection for data distribution. For
example, Outliers, skewness representation, normal distribution representation, etc.

Binning

Binning refers to a data smoothing technique that helps to group a huge number of
continuous values into smaller values. For data discretization and the development of idea
hierarchy, this technique can also be used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by


dividing the values of x numbers into clusters to isolate a computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique
is used. It is done through a supervised procedure. In a numeric attribute discretization,
first, you need to select the attribute that has the least entropy, and then you need to run
it with the help of a recursive process. The recursive process divides it into various
discretized disjoint intervals, from top to bottom, using the same splitting criterion.
Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighboring interval,
and then the large intervals are combined to develop a larger overlap to form the final 20
overlapping intervals. It is a supervised procedure.

Data discretization and concept hierarchy generation


The term hierarchy represents an organizational structure or mapping in which items are
ranked according to their levels of importance. In other words, we can say that a hierarchy
concept refers to a sequence of mappings with a set of more general concepts to complex
concepts. It means mapping is done from low-level concepts to high-level concepts. For
example, in computer science, there are different types of hierarchical systems. A
document is placed in a folder in windows at a specific place in the tree structure is the
best example of a computer hierarchical tree model. There are two types of hierarchy:
topdown mapping and the second one is bottom-up mapping.

Let's understand this concept hierarchy for the dimension location with the help of an
example.

A particular city can map with the belonging country. For example, New Delhi can be
mapped to India, and India can be mapped to Asia.

Top-down mapping

Top-down mapping generally starts with the top with some general information and ends
with the bottom to the specialized information.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some specialized information
and ends with the top to the generalized information.
Data discretization and binarization in data mining
Data discretization is a method of converting attributes values of continuous data into a
finite set of intervals with minimum data loss. In contrast, data binarization is used to
transform the continuous and discrete attributes into binary attributes.

Why is Discretization important?


As we know, an infinite of degrees of freedom mathematical problem poses with the
continuous data. For many purposes, data scientists need the implementation of
discretization. It is also used to improve signal noise ratio.

UNIT-II
What is Data Mining?
The process of extracting information to identify patterns, trends, and useful data that
would allow the business to take the data-driven decision from huge sets of data is called
Data Mining.

In other words, we can say that Data Mining is the process of investigating hidden patterns
of information to various perspectives for categorization into useful data, which is
collected and assembled in particular areas such as data warehouses, efficient analysis,
data mining algorithm, helping decision making and other data requirement to eventually
cost-cutting and generating revenue.

Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).

Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.

Data Mining is similar to Data Science carried out by a person, in a specific situation, on a
particular data set, with an objective. This process includes various types of services such
as text mining, web mining, audio and video mining, pictorial data mining, and social
media mining. It is done through software that is simple or highly specific. By outsourcing
data mining, all the work can be done faster with low operation costs. Specialized firms
can also use new technologies to collect data that is impossible to locate manually. There
are tons of information available on various platforms, but very little knowledge is
accessible. The biggest challenge is to analyze the data to extract important information
that can be used to solve a problem or for company development. There are many
powerful instruments and techniques available to mine data and find better insight from
it.

Types of Data Mining


Data mining can be performed on the following types of data:

Relational Database: A relational database is a collection of multiple data sets formally


organized by tables, records, and columns from which data can be accessed in various
ways without having to recognize the database tables. Tables convey and share
information, which facilitates data searchability, reporting, and organization.

Data warehouses: A Data Warehouse is the technology that collects the data from various
sources within the organization to provide meaningful business insights. The huge amount
of data comes from multiple places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision- making for a business organization.
The data warehouse is designed for the analysis of data rather than transaction processing.

Data Repositories: The Data Repository generally refers to a destination for data storage.
However, many IT professionals utilize the term more clearly to refer to a specific kind of
setup within an IT structure. For example, a group of databases, where an organization has
kept various kinds of information.

Object-Relational Database: A combination of an object-oriented database model and


relational database model is called an object-relational model. It supports Classes, Objects,
Inheritance, etc. One of the primary objectives of the Object-relational data model is to
close the gap between the Relational database and the object-oriented model practices
frequently utilized in many programming languages, for example, C++, Java, C#, and so
on.

Transactional Database: A transactional database refers to a database management


system (DBMS) that has the potential to undo a database transaction if it is not performed
appropriately. Even though this was a unique capability a very long while back, today, most
of the relational database systems support transactional database activities.

Advantages of Data Mining


o The Data Mining technique enables organizations to obtain knowledge-based data.
o Data mining enables organizations to make lucrative modifications in operation and
production.
o Compared with other statistical data applications, data mining is a cost-efficient. o
Data Mining helps the decision-making process of an organization.
o It Facilitates the automated discovery of hidden patterns as well as the prediction
of trends and behaviors.
o It can be induced in the new system as well as the existing platforms.
o It is a quick process that makes it easy for new users to analyze enormous amounts
of data in a short time.

Disadvantages of Data Mining


o There is a probability that the organizations may sell useful data of customers to
other organizations for money. As per the report, American Express has sold credit
card purchases of their customers to other organizations.
o Many data mining analytics software is difficult to operate and needs advance
training to work on.
o Different data mining instruments operate in distinct ways due to the different
algorithms used in their design. Therefore, the selection of the right data mining
tools is a very challenging task.
o The data mining techniques are not precise, so that it may lead to severe
consequences in certain conditions.

Data Mining Applications


Data Mining is primarily used by organizations with intense consumer demands- Retail,
Communication, Financial, marketing company, determine price, consumer preferences,
product positioning, and impact on sales, customer satisfaction, and corporate profits.
Data mining enables a retailer to use point-of-sale records of customer purchases to
develop products and promotions that help the organization to attract the customer.

These are the following areas where data mining is widely used:

Data Mining in Healthcare: Data mining in healthcare has excellent potential to improve
the health system. It uses data and analytics for better insights and to identify best
practices that will enhance health care services and reduce costs. Analysts use data mining
approaches such as Machine learning, Multi-dimensional database, Data visualization, Soft
computing, and statistics. Data Mining can be used to forecast patients in each category.
The procedures ensure that the patients get intensive care at the right place and at the
right time. Data mining also enables healthcare insurers to recognize fraud and abuse.

Data Mining in Market Basket Analysis: Market basket analysis is a modeling method
based on a hypothesis. If you buy a specific group of products, then you are more likely
to buy another group of products. This technique may enable the retailer to understand
the purchase behavior of a buyer. This data may assist the retailer in understanding the
requirements of the buyer and altering the store's layout accordingly. Using a different
analytical comparison of results between various stores, between customers in different
demographic groups can be done.

Data mining in Education: Education data mining is a newly emerging field, concerned
with developing techniques that explore knowledge from the data generated from
educational Environments. EDM objectives are recognized as affirming student's future
learning behavior, studying the impact of educational support, and promoting learning
science. An organization can use data mining to make precise decisions and also to predict
the results of the student. With the results, the institution can concentrate on what to teach
and how to teach.

Data Mining in Manufacturing Engineering: Knowledge is the best asset possessed by a


manufacturing company. Data mining tools can be beneficial to find patterns in a complex
manufacturing process. Data mining can be used in system-level designing to obtain the
relationships between product architecture, product portfolio, and data needs of the
customers. It can also be used to forecast the product development period, cost, and
expectations among the other tasks.

Data Mining in CRM (Customer Relationship Management): Customer Relationship


Management (CRM) is all about obtaining and holding Customers, also enhancing
customer loyalty and implementing customer-oriented strategies. To get a decent
relationship with the customer, a business organization needs to collect data and analyze
the data. With data mining technologies, the collected data can be used for analytics.

Data Mining in Fraud detection: Billions of dollars are lost to the action of frauds.
Traditional methods of fraud detection are a little bit time consuming and sophisticated.
Data mining provides meaningful patterns and turning data into information. An ideal
fraud detection system should protect the data of all the users. Supervised methods
consist of a collection of sample records, and these records are classified as fraudulent or
non-fraudulent. A model is constructed using this data, and the technique is made to
identify whether the document is fraudulent or not.

Data Mining in Lie Detection: Apprehending a criminal is not a big deal, but bringing out
the truth from him is a very challenging task. Law enforcement may use data mining
techniques to investigate offenses, monitor suspected terrorist communications, etc. This
technique includes text mining also, and it seeks meaningful patterns in data, which is
usually unstructured text. The information collected from the previous investigations is
compared, and a model for lie detection is constructed.

Data Mining Financial Banking: The Digitalization of the banking system is supposed to
generate an enormous amount of data with every new transaction. The data mining
technique can help bankers by solving business-related problems in banking and finance
by identifying trends, casualties, and correlations in business information and market costs
that are not instantly evident to managers or executives because the data volume is too
large or are produced too rapidly on the screen by experts. The manager may find these
data for better targeting, acquiring, retaining, segmenting, and maintain a profitable
customer.

Challenges of Implementation in Data mining


Although data mining is very powerful, it faces many challenges during its execution.
Various challenges could be related to performance, data, methods, and techniques, etc.
The process of data mining becomes effective when the challenges or problems are
correctly recognized and adequately resolved.
Incomplete and noisy data: The process of extracting useful data from large volumes of
data is data mining. The data in the real-world is heterogeneous, incomplete, and noisy.
Data in huge quantities will usually be inaccurate or unreliable. These problems may occur
due to data measuring instrument or because of human errors. Suppose a retail chain
collects phone numbers of customers who spend more than $ 500, and the accounting
employees put the information into their system. The person may make a digit mistake
when entering the phone number, which results in incorrect data. Even some customers
may not be willing to disclose their phone numbers, which results in incomplete data. The
data could get changed due to human or system error. All these consequences (noisy and
incomplete data)makes data mining challenging.

Data Distribution: Real-worlds data is usually stored on various platforms in a distributed


computing environment. It might be in a database, individual systems, or even on the
internet. Practically, It is a quite tough task to make all the data to a centralized data
repository mainly due to organizational and technical concerns. For example, various
regional offices may have their servers to store their data. It is not feasible to store, all the
data from all the offices on a central server. Therefore, data mining requires the
development of tools and algorithms that allow the mining of distributed data.

Complex Data: Real-world data is heterogeneous, and it could be multimedia data,


including audio and video, images, complex data, spatial data, time series, and so on.
Managing these various types of data and extracting useful information is a tough task.
Most of the time, new technologies, new tools, and methodologies would have to be
refined to obtain specific information.

Performance: The data mining system's performance relies primarily on the efficiency of
algorithms and techniques used. If the designed algorithm and techniques are not up to
the mark, then the efficiency of the data mining process will be affected adversely.

Data Privacy and Security: Data mining usually leads to serious issues in terms of data
security, governance, and privacy. For example, if a retailer analyzes the details of the
purchased items, then it reveals data about buying habits and preferences of the
customers without their permission.
Data Visualization: In data mining, data visualization is a very important process because
it is the primary method that shows the output to the user in a presentable way. The
extracted data should convey the exact meaning of what it intends to express. But many
times, representing the information to the end-user in a precise and easy way is difficult.
The input data and the output information being complicated, very efficient, and
successful data visualization processes need to be implemented to make it successful.

KDD vs Data Mining


KDD (Knowledge Discovery in Databases) is a field of computer science, which includes
the tools and theories to help humans in extracting useful and previously unknown
information (i.e., knowledge) from large collections of digitized data. KDD consists of
several steps, and Data Mining is one of them. Data Mining is the application of a specific
algorithm to extract patterns from data. Nonetheless, KDD and Data Mining are used
interchangeably.

What is KDD?
KDD is a computer science field specializing in extracting previously unknown and
interesting information from raw data. KDD is the whole process of trying to make sense
of data by developing appropriate methods or techniques. This process deals with lowlevel
mapping data into other forms that are more compact, abstract, and useful. This is
achieved by creating short reports, modeling the process of generating data, and
developing predictive models that can predict future cases.

Due to the exponential growth of data, especially in areas such as business, KDD has
become a very important process to convert this large wealth of data into business
intelligence, as manual extraction of patterns has become seemingly impossible in the
past few decades.

Difference between KDD and Data Mining


Although the two terms KDD and Data Mining are heavily used interchangeably, they refer
to two related yet slightly different concepts.

KDD is the overall process of extracting knowledge from data, while Data Mining is a step
inside the KDD process, which deals with identifying patterns in data.

And Data Mining is only the application of a specific algorithm based on the overall goal
of the KDD process.
KDD is an iterative process where evaluation measures can be enhanced, mining can be
refined, and new data can be integrated and transformed to get different and more
appropriate results.

Difference Between Data Mining and Database


Database Data mining
Data mining is analyzing data from different
The database is the organized collection of data. information to discover useful knowledge.
Most of the times, these raw data are stored in
very large databases. Data mining deals with extracting useful and
previously unknown information from raw
A Database may contain different levels of data.
abstraction in its architecture.
The data mining process relies on the data
Typically, the three levels: external, conceptual compiled in the data warehousing phase in
and internal make up the database architecture. order to detect meaningful patterns.
Data Mining Techniques
Data mining includes the utilization of refined data analysis tools to find previously
unknown, valid patterns and relationships in huge data sets. These tools can incorporate
statistical models, machine learning techniques, and mathematical algorithms, such as
neural networks or decision trees. Thus, data mining incorporates analysis and
prediction.
Depending on various methods and technologies from the intersection of machine
learning, database management, and statistics, professionals in data mining have
devoted their careers to better understanding how to process and make conclusions
from the huge amount of data, but what are the methods they use to make it happen? In
recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns,
and regression.

1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
Play Video
Data mining techniques can be classified by different criteria, as follows:
i. Classification of Data mining frameworks as per the type of data sources
mined: This classification is as per the type of data handled. For example, multimedia,
spatial data, text data, time-series data, World Wide Web, and so on.. ii. Classification
of data mining frameworks as per the database involved: This classification based
on the data model involved. For example. Object-oriented database, transactional
database, relational database, and so on..
iii. Classification of data mining frameworks as per the kind of knowledge
discovered: This classification depends on the types of knowledge discovered or
data mining functionalities. For example, discrimination, classification, clustering,
characterization, etc. some frameworks tend to be extensive frameworks offering
a few data mining functionalities together..
iv. Classification of data mining frameworks according to data mining
techniques used: This classification is as per the data analysis approach utilized,
such as neural networks, machine learning, genetic algorithms, visualization,
statistics, data warehouse-oriented or database-oriented, etc.
The classification can also take into account, the level of user interaction involved
in the data mining procedure, such as query-driven systems, autonomous
systems, or interactive exploratory systems.
2. Clustering:
Clustering is a division of information into groups of connected objects. Describing the
data by a few clusters mainly loses certain confine details, but accomplishes
improvement. It models data by its clusters. Data modeling puts clustering from a
historical point of view rooted in statistics, mathematics, and numerical analysis. From a
machine learning point of view, clusters relate to hidden patterns, the search for clusters
is unsupervised learning, and the subsequent framework represents a data concept.
From a practical point of view, clustering plays an extraordinary job in data mining
applications. For example, scientific data exploration, text mining, information retrieval,
spatial database applications, CRM, Web analysis, computational biology, medical
diagnostics, and much more.
In other words, we can say that Clustering analysis is a data mining technique to identify
similar data. This technique helps to recognize the differences and similarities between
the data. Clustering is very similar to the classification, but it involves grouping chunks of
data together based on their similarities.
3. Regression:
Regression analysis is the data mining process is used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used to
define the probability of the specific variable. Regression, primarily a form of planning
and modeling. For example, we might use it to project certain costs, depending on other
factors such as availability, consumer demand, and competition. Primarily it gives the
exact relationship between two or more variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It finds a
hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of databases.
Association rule mining has several applications and is commonly used to help sales
correlations in data or medical data sets.
The way the algorithm works is that you have various data, For example, a list of grocery
items that you have been buying for the last six months. It calculates a percentage of
items being purchased together.
These are three major measurements technique: o Lift: This measurement technique
measures the accuracy of the confidence over how often item B is purchased.
(Confidence) / (item B)/ (Entire dataset)
o Support: This measurement technique measures how often multiple items are
purchased and compared it to the overall dataset. (Item A + Item B) / (Entire
dataset)
o Confidence: This measurement technique measures how often item B is purchased
when item A is purchased as well. (Item A + Item B)/ (Item A)
5. Outer detection:
This type of data mining technique relates to the observation of data items in the data
set, which do not match an expected pattern or expected behavior. This technique may
be used in various domains like intrusion, detection, fraud detection, etc. It is also known
as Outlier Analysis or Outilier mining. The outlier is a data point that diverges too much
from the rest of the dataset. The majority of the real-world datasets have an outlier.
Outlier detection plays a significant role in the data mining field. Outlier detection is
valuable in numerous fields like network interruption identification, credit or debit card
fraud detection, detecting outlying in wireless sensor network data, etc.
6. Sequential Patterns:
The sequential pattern is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequences in
a set of sequences, where the stake of a sequence can be measured in terms of different
criteria like length, occurrence frequency, etc.
In other words, this technique of data mining helps to discover or recognize similar
patterns in transaction data over some time.
7. Prediction:
Prediction used a combination of other data mining techniques such as trends,
clustering, classification, etc. It analyzes past events or instances in the right sequence to
predict a future event.

Classification and Predication in Data Mining


There are two forms of data analysis that can be used to extract models describing
important classes or predict future data trends. These two forms are as follows:

1. Classification

2. Prediction

We use classification and prediction to extract a model, representing the data classes to
predict future data trends. Classification predicts the categorical labels of data with the
prediction models. This analysis provides us with the best understanding of the data at a
large scale.

Classification models predict categorical class labels, and prediction models predict
continuous-valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky or a prediction model to predict
the expenditures in dollars of potential customers on computer equipment given their
income and occupation.

What is Classification?
Classification is to identify the category or the class label of a new observation. First, a set
of data is used as training data. The set of input data and the corresponding outputs are
given to the algorithm. So, the training data set includes the input data and their
associated class labels. Using the training dataset, the algorithm derives a model or the
classifier. The derived model can be a decision tree, mathematical formula, or a neural
network. In classification, when unlabeled data is given to the model, it should find the
class to which it belongs. The new data provided to the model is the test data set.
Classification is the process of classifying a record. One simple example of classification is
to check whether it is raining or not. The answer can either be yes or no. So, there is a
particular number of choices. Sometimes there can be more than two classes to classify.
That is called multiclass classification.

The bank needs to analyze whether giving a loan to a particular customer is risky or not.
For example, based on observable data for multiple loan borrowers, a classification model
may be established that forecasts credit risk. The data could track job records,
homeownership or leasing, years of residency, number, type of deposits, historical credit
ranking, etc. The goal would be credit ranking, the predictors would be the other
characteristics, and the data would represent a case for each consumer. In this example, a
model is constructed to find the categorical label. The labels are risky or safe.

How does Classification Works?


The functioning of classification with the assistance of the bank loan application has been
mentioned above. There are two stages in the data classification system: classifier or model
creation and classification classifier.

1. Developing the Classifier or model creation: This level is the learning stage or the
learning process. The classification algorithms construct the classifier in this stage.
A classifier is constructed from a training set composed of the records of databases
and their corresponding class names. Each category that makes up the training set
is referred to as a category or class. We may also refer to these records as samples,
objects, or data points.

2. Applying classifier for classification: The classifier is used for classification at this
level. The test data are used here to estimate the accuracy of the classification
algorithm. If the consistency is deemed sufficient, the classification rules can be
expanded to cover new data records. It includes:
o Sentiment Analysis: Sentiment analysis is highly helpful in social media
monitoring. We can use it to extract social media insights. We can build
sentiment analysis models to read and analyze misspelled words with
advanced machine learning algorithms. The accurate trained models provide
consistently accurate outcomes and result in a fraction of the time.
o Document Classification: We can use document classification to organize
the documents into sections according to the content. Document
classification refers to text classification; we can classify the words in the
entire document. And with the help of machine learning classification
algorithms, we can execute it automatically.
o Image Classification: Image classification is used for the trained categories
of an image. These could be the caption of the image, a statistical value, a
theme. You can tag images to train your model for relevant categories by
applying supervised learning algorithms.
o Machine Learning Classification: It uses the statistically demonstrable
algorithm rules to execute analytical tasks that would take humans hundreds
of more hours to perform.
3. Data Classification Process: The data classification process can be categorized into
five steps:
o Create the goals of data classification, strategy, workflows, and
architecture of data classification.
o Classify confidential details that we store. o Using marks by data
labelling. o To improve protection and obedience, use effects. o Data is
complex, and a continuous method is a classification.

What is Data Classification Lifecycle?


The data classification life cycle produces an excellent structure for controlling the flow of
data to an enterprise. Businesses need to account for data security and compliance at each
level. With the help of data classification, we can perform it at every stage, from origin to
deletion. The data life-cycle has the following stages, such as:
1. Origin: It produces sensitive data in various formats, with emails, Excel, Word,
Google documents, social media, and websites.

2. Role-based practice: Role-based security restrictions apply to all delicate data by


tagging based on in-house protection policies and agreement rules.

3. Storage: Here, we have the obtained data, including access controls and encryption.

4. Sharing: Data is continually distributed among agents, consumers, and coworkers


from various devices and platforms.

5. Archive: Here, data is eventually archived within an industry's storage systems.

6. Publication: Through the publication of data, it can reach customers. They can then
view and download in the form of dashboards.

What is Prediction?
Another process of data analysis is prediction. It is used to find a numerical output. Same
as in classification, the training dataset contains the inputs and corresponding numerical
output values. The algorithm derives the model or a predictor according to the training
dataset. The model should find a numerical output when the new data is given. Unlike in
classification, this method does not have a class label. The model predicts a
continuousvalued function or ordered value.

Regression is generally used for prediction. Predicting the value of a house depending on
the facts such as the number of rooms, the total area, etc., is an example for prediction.

For example, suppose the marketing manager needs to predict how much a particular
customer will spend at his company during a sale. We are bothered to forecast a numerical
value in this case. Therefore, an example of numeric prediction is the data processing
activity. In this case, a model or a predictor will be developed that forecasts a continuous
or ordered value function.

Classification and Prediction Issues


The major issue is preparing the data for Classification and Prediction. Preparing the data
involves the following activities, such as:
1. Data Cleaning: Data cleaning involves removing the noise and treatment of missing
values. The noise is removed by applying smoothing techniques, and the problem
of missing values is solved by replacing a missing value with the most commonly
occurring value for that attribute.

2. Relevance Analysis: The database may also have irrelevant attributes. Correlation
analysis is used to know whether any two given attributes are related.
3. Data Transformation and reduction: The data can be transformed by any of the
following methods. o Normalization: The data is transformed using
normalization.
Normalization involves scaling all values for a given attribute to make them
fall within a small specified range. Normalization is used when the neural
networks or the methods involving measurements are used in the learning
step.
o Generalization: The data can also be transformed by generalizing it to the higher
concept. For this purpose, we can use the concept hierarchies.
NOTE: Data can also be reduced by some other methods such as wavelet
transformation, binning, histogram analysis, and clustering.

Comparison of Classification and Prediction Methods


Here are the criteria for comparing the methods of Classification and Prediction, such as:

o Accuracy: The accuracy of the classifier can be referred to as the ability of the
classifier to predict the class label correctly, and the accuracy of the predictor can
be referred to as how well a given predictor can estimate the unknown value.
o Speed: The speed of the method depends on the computational cost of generating
and using the classifier or predictor. o Robustness: Robustness is the ability to
make correct predictions or classifications. In the context of data mining, robustness
is the ability of the classifier or predictor to make correct predictions from incoming
unknown data.
o Scalability: Scalability refers to an increase or decrease in the performance of the
classifier or predictor based on the given data.
o Interpretability: Interpretability is how readily we can understand the reasoning
behind predictions or classification made by the predictor or classifier.

Difference between Classification and Prediction


The decision tree, applied to existing data, is a classification model. We can get a class
prediction by applying it to new data for which the class is unknown. The assumption is
that the new data comes from a distribution similar to the data we used to construct our
decision tree. In many instances, this is a correct assumption, so we can use the decision
tree to build a predictive model. Classification of prediction is the process of finding a
model that describes the classes or concepts of information. The purpose is to predict the
class of objects whose class label is unknown using this model. Below are some major
differences between classification and prediction.

Classification Prediction

Classification is the process of identifying which category a Predication is the process of identifying the missing or
new observation belongs to based on a training data set unavailable numerical data for a new observation.
containing observations whose category membership is
known.

In classification, the accuracy depends on finding the class label In prediction, the accuracy depends on how well a given
correctly. predictor can guess the value of a predicated attribute for
new data.

In classification, the model can be known as the classifier. In prediction, the model can be known as the predictor.

A model or the classifier is constructed to find the categorical A model or a predictor will be constructed that predicts a
labels. continuous-valued function or ordered value.

For example, the grouping of patients based on their medical For example, We can think of prediction as predicting the
records can be considered a classification. correct treatment for a particular disease for a person.

Difference between Parametric and Non- Parametric


Methods
Parametric Methods: The basic idea behind the parametric method is that there is a set
of fixed parameters that uses to determine a probability model that is used in Machine
Learning as well. Parametric methods are those methods for which we priory knows that
the population is normal, or if not then we can easily approximate it using a normal
distribution which is possible by invoking the Central Limit Theorem. Parameters for
using the normal distribution is as follows:
• Mean
• Standard Deviation
Eventually, the classification of a method to be parametric is completely depends on the
presumptions that are made about a population. There are many parametric methods
available some of them are:
• Confidence interval used for – population mean along with known standard
deviation.
• The confidence interval is used for – population means along with the unknown
standard deviation.
• The confidence interval for population variance.
• The confidence interval for the difference of two means, with unknown standard
deviation.
Nonparametric Methods: The basic idea behind the parametric method is no need to
make any assumption of parameters for the given population or the population we are
studying. In fact, the methods don’t depend on the population. Here there is no fixed set
of parameters are available, and also there is no distribution (normal distribution, etc.) of
any kind is available for use. This is also the reason that nonparametric methods are also
referred to as distribution-free methods. Nowadays Non-parametric methods are
gaining popularity and an impact of influence some reasons behind this fame is:
• The main reason is that there is no need to be mannered while using parametric
methods.
• The second important reason is that we do not need to make more and more
assumptions about the population given (or taken) on which we are working on.
• Most of the nonparametric methods available are very easy to apply and to
understand also i.e. the complexity is very low.
There are many nonparametric methods are available today but some of them are as
follows:
• Spearman correlation test
• Sign test for population means
• U-test for two independent means

Difference between Parametric and Non-Parametric Methods are as follows:


Parametric Methods Non-Parametric Methods

Parametric Methods uses a fixed number of parameters to Non-Parametric Methods use the flexible number of parameters
build the model. to build the model.

Parametric analysis is to test group means. A non-parametric analysis is to test medians.

It is applicable only for variables. It is applicable for both – Variable and Attribute.

It always considers strong assumptions about data. It generally fewer assumptions about data.

Parametric Methods require lesser data than Non- Non-Parametric Methods requires much more data than
Parametric Methods. Parametric Methods.

Parametric methods assumed to be a normal distribution. There is no assumed distribution in non-parametric methods.

Parametric data handles – Intervals data or ratio data. But non-parametric methods handle original data.

Here when we use parametric methods then the result or When we use non-parametric methods then the result or
outputs generated can be easily affected by outliers. outputs generated cannot be seriously affected by outliers.

Parametric Methods can perform well in many situations Similarly, Non-Parametric Methods can perform well in many
but its performance is at peak (top) when the spread of situations but its performance is at peak
each group is different. (top) when the spread of each group is the same.

Parametric methods have more statistical power than Non- Non-parametric methods have less statistical power than
Parametric methods. Parametric methods.

As far as the computation is considered these methods are As far as the computation is considered these methods are
computationally faster than the Non-Parametric methods. computationally slower than the Parametric methods.

Examples: Logistic Regression, Naïve Bayes Model, etc. Examples: KNN, Decision Tree Model, etc.
Data Mining Algorithms
Data Mining Algorithms are a particular category of algorithms useful for analyzing data
and developing data models to identify meaningful patterns. These are part of machine
learning algorithms. These algorithms are implemented through various programming
like R language, Python, and data mining tools to derive the optimized data models.
Some of the popular data mining algorithms are C4.5 for decision trees, K-means for
cluster data analysis, Naive Bayes Algorithm, Support Vector Mechanism Algorithms,
The Apriori algorithm for time series data mining. These algorithms are part of data
analytics implementation for business. These algorithms are based upon statistical and
mathematical formulas which applied to the data set.

1. C4.5 Algorithm
Some constructs are used by classifiers which are tools in data mining. These systems
take inputs from a collection of cases where each case belongs to one of the small
numbers of classes and are described by its values for a fixed set of attributes. The
output classifier can accurately predict the level to which it belongs. It uses decision trees
where the first initial tree is acquired by using a divide and conquer algorithm.
Suppose S is a class and the tree is leaf labelled with the most frequent type in S.
Choosing a test based on a single attribute with two or more outcomes than making this
test as root one branch for each work of the test can be used. The partitions correspond
to subsets S1, S2, etc., which are outcomes for each case. C4.5 allows for multiple
products. C4.5 has introduced an alternative formula in thorny decision trees, which
consists of a list of rules, where these rules are grouped for each class. To classify the
case, the first class whose conditions are satisfied is named as the first one. If the patient
meets no power, then it is assigned a default class. The C4.5 rulesets are formed from the
initial decision tree. C4.5 enhances the scalability by multi-threading.
2. The k-means Algorithm
This algorithm is a simple method of partitioning a given data set into the user-specified
number of clusters. This algorithm works on d-dimensional vectors, D={xi | i= 1, … N}
where i is the data point. To get these initial data seeds, the data has to be sampled at
random. This sets the solution of clustering a small subset of data, the global mean of
data k times. This algorithm can be paired with another algorithm to describe nonconvex
clusters. It creates k groups from the given set of objects. It explores the entire data set
with its cluster analysis. It is simple and faster than other algorithms when it is used with
different algorithms. This algorithm is mostly classified as semi-supervised. Along with
specifying the number of clusters, it also keeps learning without any information. It
observes the group and learns.
3. Naive Bayes Algorithm
This algorithm is based on Bayes theorem. This algorithm is mainly used when the
dimensionality of inputs is high. This classifier can easily calculate the next possible
output. New raw data can be added during the runtime, and it provides a better
probabilistic classifier. Each class has a known set of vectors that aim to create a rule that
allows the objects to be assigned to classes in the future. The vectors of variables
describe the future things. This is one of the most comfortable algorithms as it is easy to
construct and does not have any complicated parameter estimation schemas. It can be
easily applied to massive data sets as well. It does not need any elaborate iterative
parameter estimation schemes, and hence unskilled users can understand why the
classifications are made.
4. Support Vector Machines Algorithm
If a user wants robust and accurate methods, then Support Vector machines algorithm
must be tried. SVMs are mainly used for learning classification, regression or ranking
function. It is formed based on structural risk minimization and statistical learning theory.
The decision boundaries must be identified, which is known as a hyperplane. It helps in
the optimal separation of classes. The main job of SVM is to identify the maximizing the
margin between two types. The margin is defined as the amount of space between two
types. A hyperplane function is like an equation for the line, y= MX + b. SVM can be
extended to perform numerical calculations as well. SVM makes use of kernel so that it
operates well in higher dimensions. This is a supervised algorithm, and the data set is
used first to let SVM know about all the classes. Once this is done then, SVM can be
capable of classifying this new data.
5. The Apriori Algorithm
The Apriori algorithm is widely used to find the frequent itemsets from a transaction data
set and derive association rules. To find frequent itemsets is not difficult because of its
combinatorial explosion. Once we get the frequent itemsets, it is clear to generate
association rules for larger or equal specified minimum confidence. Apriori is an
algorithm which helps in finding routine data sets by making use of candidate
generation. It assumes that the item set or the items present are sorted in lexicographic
order. After the introduction of Apriori data mining research has been specifically
boosted. It is simple and easy to implement. The basic approach of this algorithm is as
below:
• Join: The whole database is used for the hoe frequent 1 item sets.
• Prune: This item set must satisfy the support and confidence to move to the next
round for the 2 item sets.
• Repeat: Until the pre-defined size is not reached till, then this is repeated for each
itemset level.
Conclusion
With the five algorithms being used prominently, others help in mining data and learn. It
integrates different techniques including machine learning, statistics, pattern
recognition, artificial intelligence and database systems. All these help in analyzing large
sets of data and perform other data analysis tasks. Hence they are the most useful and
reliable analytics algorithms.

Data Mining Bayesian Classifiers


In numerous applications, the connection between the attribute set and the class variable
is non- deterministic. In other words, we can say the class label of a test record cant be
assumed with certainty even though its attribute set is the same as some of the training
examples. These circumstances may emerge due to the noisy data or the presence of
certain confusing factors that influence classification, but it is not included in the analysis.
For example, consider the task of predicting the occurrence of whether an individual is at
risk for liver illness based on individuals eating habits and working efficiency. Although
most people who eat healthly and exercise consistently having less probability of
occurrence of liver disease, they may still do so due to other factors. For example, due to
consumption of the high-calorie street foods and alcohol abuse. Determining whether an
individual's eating routine is healthy or the workout efficiency is sufficient is also subject
to analysis, which in turn may introduce vulnerabilities into the leaning issue.
Bayesian classification uses Bayes theorem to predict the occurrence of any event.
Bayesian classifiers are the statistical classifiers with the Bayesian probability
understandings. The theory expresses how a level of belief, expressed as a probability.

Bayes theorem came into existence after Thomas Bayes, who first utilized conditional
probability to provide an algorithm that uses evidence to calculate limits on an unknown
parameter.

Bayes's theorem is expressed mathematically by the following equation that is given


below.

Where X and Y are the events and P (Y) ≠ 0

P(X/Y) is a conditional probability that describes the occurrence of event X is given that
Y is true.
P(Y/X) is a conditional probability that describes the occurrence of event Y is given that
X is true.

P(X) and P(Y) are the probabilities of observing X and Y independently of each other. This
is known as the marginal probability.

Bayesian interpretation:

In the Bayesian interpretation, probability determines a "degree of belief." Bayes


theorem connects the degree of belief in a hypothesis before and after accounting for
evidence. For example, Lets us consider an example of the coin. If we toss a coin, then we
get either heads or tails, and the percent of occurrence of either heads and tails is 50%. If
the coin is flipped numbers of times, and the outcomes are observed, the degree of
belief may rise, fall, or remain the same depending on the outcomes.

For proposition X and evidence Y,

o P(X),
the prior, is the primary degree of belief in X o P(X/Y), the
posterior is the degree of belief having accounted for Y. o The

quotient represents the supports Y provides for X.


Bayes theorem can be derived from the conditional probability:

Where P (X Y) is the joint probability of both X and Y being true, because

Bayesian network:

A Bayesian Network falls under the classification of Probabilistic Graphical Modelling


(PGM) procedure that is utilized to compute uncertainties by utilizing the probability
concept. Generally known as Belief Networks, Bayesian Networks are used to show
uncertainties using Directed Acyclic Graphs (DAG)

A Directed Acyclic Graph is used to show a Bayesian Network, and like some other
statistical graph, a DAG consists of a set of nodes and links, where the links signify the
connection between the nodes.

The nodes here represent random variables, and the edges define the relationship
between these variables.

A DAG models the uncertainty of an event taking place based on the Conditional
Probability Distribution (CDP) of each random variable. A Conditional Probability Table
(CPT) is used to represent the CPD of each variable in a network. Classification
of Errors
Errors are classified in two types – Systemic (Determinate) and Random (Indeterminate) errors
Systemic (Determinate) errors:
Errors which can be avoided or whose magnitude can be determined is called as systemic errors.
It can be determinable and presumably can be either avoided or corrected. Systemic errors further
classified as

• Operational and personal error


• Instrumental error
• Errors of method
• Additive or proportional error

Operational and personal error:


Errors for which the individual analyst is responsible and are not connected with the method or
procedure is called as personal errors e.g. unable to judge color change

When errors occur during operation is called as operational error e.g. transfers of solution,
effervescence, incomplete drying, underweighting of precipitates, overweighing of precipitates,
and insufficient cooling of precipitates. These errors are physical in nature and occur when sound
analytical techniques is not followed

Instrumental and Reagent errors:


Errors occur due to faulty instrument or reagent containing impurities e.g. un-calibrated weights,
un-calibrated burette, pipette and measuring flasks.

Errors of Method:
When errors occur due to method, it is difficult to correct. In gravimetric analysis, error occurs due
to Insolubility of precipitates, co-precipitates, post-precipitates, decomposition, and volatilization.

In titrimetric analysis errors occur due to failure of reaction, side reaction, reaction of substance
other than the constituent being determined, difference between observed end point and the
stoichiometric equivalence point of a reaction.

Additive or proportional errors:


Additive error does not depend on constituent present in the determination e.g. loss in weight of a
crucible in which a precipitate is ignited.

Proportional error depends on the amount of the constituent e.g. impurities in standard compound.

Random Errors:
It occurs accidentally or randomly so called as indeterminate or accidental or random error.
Analyst has no control in this error. It follows a random distribution and a mathematical law of
probability can be applied.

UNIT-III
Association Rules in Data Mining
Introduction to Association Rules in Data Mining

Association rule learning is additionally a basic rule-based machine learning technique


used for locating fascinating relations between variables in massive databases.
It’s purported to spot sturdy rules discovered in information victimization measures of
quality. It has a form of applications and it is wide accustomed to facilitate discover sales
correlations in transactional information or in medical information sets. In this topic, we are
going to learn about Association Rules in Data Mining.

Association rules unit typically needed to satisfy user-specified minimum support and
userspecified minimum confidence at constant time.

The generation of the Association Rule is sometimes divided into a combination of separate st
eps. They are:

• To look for all the frequent items a minimum support threshold is applied which
sets the database information.
• Where minimum confidence is applicable to those frequent item sets so on turn
out rules. While the other step is easy, the primary step needs much attention.

Working of Association Rules in Data Mining

Association rule mining involves the employment of machine learning models to


analyze information for patterns terribly information. It identifies the if or then associations,
that unit known as the association rules.

An association rule incorporates a combination of parts:

• An antecedent (if)

• An consequent(then)

An antecedent is an associate item found at intervals the data. A consequent is an associate


item found within the combo with the antecedent.

Association rules unit created by absolutely analyzing information and looking for frequent if
or then patterns. Then, looking at the future a combination of
parameters, the obligatory relationships unit discovered.

• Support

• Confidence • Lift

Support indicates however frequently the if/then relationship appearance within the data.

Confidence tells concerning the number of times these relationships unit found to be true.

Lift is additionally wont to compare the boldness with the expected confidence.

Algorithms of Association Rules in Data Mining

There unit such a large amount of algorithms planned for generating association rules. Style
of the algorithms unit mentioned below:

• Apriori formula

• Eclat formula

• FP-growth formula
1. Apriori algorithm

Apriori is the associate formula for frequent itemset mining and association rule learning
over relative databases. It yields by characteristic the frequent individual things within the
data and protraction them to larger and bigger item sets as long as those item
sets seem sufficiently typically within the data.

The frequent itemsets ensured by apriori is additionally wont to confirm association rules
that highlight trends within the data. It uses a breadth-first search strategy to count the
support of item sets and uses a candidate generation perform that exploits the downward
closure property of support.

2. Eclat algorithm

Eclat represents for equivalence category transformation. Its depth-first search formula
supported set intersection. It’s applicable for each consecutive in addition to parallel
execution with spot-magnifying properties. This is the associate formula for frequent
pattern mining supported depth-first search cross of the item set lattice.

• Its rather a DFS cross of the prefix tree than lattice

• The branch and certain technique is employed for stopping

The basic got wind of typically to use dealings Id sets intersections to cypher the
support price of a candidate and avoiding the generation of
the subsets that don’t exist within the prefix tree.

3. FP-growth algorithm

It is also known as a frequent pattern. It’s the associate improvement of apriori formula.
FP growth formula is employed for locating frequent item sets
terribly dealings information whereas not the candidate generation.

This was mainly designed to compress the database which provides the frequent sets and
then it divides the compressed data into sets of the conditional databases.

This conditional database is associated with a frequent set and then apply to data mining
on each database.

The data source is compressed using a data structure called FP-tree.


This algorithm works in two steps. They are discussed as:

• Construction of FP-tree

• Extract frequent itemsets

Types of Association Rules

There unit style of the categories of association rule mining. They’re mentioned as:

• Multi-relational association rules

• Generalized association rules

• Quantitative association rules

• Interval information association rules

Uses of Association Rules

• Market base analysis: information is collected victimization the barcode scanners in


most markets

• Medical diagnosis: it’s progressing to be helpful for serving to physicians


for method patients

• Census information: this information may be used to prepare economical public


services also as businesses.

Correlation Analysis in Data Mining


Correlation analysis is a statistical method used to measure the strength of the linear
relationship between two variables and compute their association. Correlation analysis
calculates the level of change in one variable due to the change in the other. A high
correlation points to a strong relationship between the two variables, while a low
correlation means that the variables are weakly related.

Researchers use correlation analysis to analyze quantitative data collected through


research methods like surveys and live polls for market research. They try to identify
relationships, patterns, significant connections, and trends between two variables or
datasets. There is a positive correlation between two variables when an increase in one
variable leads to an increase in the other. On the other hand, a negative correlation means
that when one variable increases, the other decreases and vice-versa.

Correlation is a bivariate analysis that measures the strength of association between two
variables and the direction of the relationship. In terms of the strength of the relationship,
the correlation coefficient's value varies between +1 and -1. A value of ± 1 indicates a
perfect degree of association between the two variables.

As the correlation coefficient value goes towards 0, the relationship between the two
variables will be weaker. The coefficient sign indicates the direction of the relationship; a
+ sign indicates a positive relationship, and a - sign indicates a negative relationship.

Why Correlation Analysis is Important


Correlation analysis can reveal meaningful relationships between different metrics or
groups of metrics. Information about those connections can provide new insights and
reveal interdependencies, even if the metrics come from different parts of the business.

Suppose there is a strong correlation between two variables or metrics, and one of them
is being observed acting in a particular way. In that case, you can conclude that the other
one is also being affected similarly. This helps group related metrics together to reduce
the need for individual data processing.

Types of Correlation Analysis in Data Mining


Usually, in statistics, we measure four types of correlations: Pearson correlation, Kendall
rank correlation, Spearman correlation, and the Point-Biserial correlation.

1. Pearson r correlation

2. Kendall rank correlation

3. Spearman rank correlation

Interpreting Results
Typically, the best way to gain a generalized but more immediate interpretation of the
results of a set of data is to visualize it on a scatter graph such as these:

1. Positive Correlation: Any score from +0.5 to +1 indicates a very strong positive
correlation, which means that they both increase simultaneously. This case follows
the data points upwards to indicate the positive correlation. The line of best fit, or
the trend line, places to best represent the graph's data.

2. Negative Correlation: Any score from -0.5 to -1 indicates a strong negative


correlation, which means that as one variable increases, the other decreases
proportionally. The line of best fit can be seen here to indicate the negative
correlation. In these cases, it will slope downwards from the point of origin.

3. No Correlation: Very simply, a score of 0 indicates no correlation, or relationship,


between the two variables. This fact will stand true for all, no matter which formula
is used. The more data inputted into the formula, the more accurate the result will
be. The larger the sample size, the more accurate the result.

Outliers or anomalies must be accounted for in both correlation coefficients. Using a


scatter graph is the easiest way of identifying any anomalies that may have occurred.
Running the correlation analysis twice (with and without anomalies) is a great way to
assess the strength of the influence of the anomalies on the analysis. Spearman's Rank
coefficient may be used if anomalies are present instead of Pearson's Coefficient, as this
formula is extremely robust against anomalies due to the ranking system used. Benefits
of Correlation Analysis
Here are the following benefits of correlation analysis, such as:

1. Reduce Time to Detection: In anomaly detection, working with many metrics and
surfacing correlated anomalous metrics helps draw relationships that reduce time to
detection (TTD) and support shortened time to remediation (TTR). As data-driven
decision-making has become the norm, early and robust detection of anomalies is critical
in every industry domain, as delayed detection adversely impacts customer experience
and revenue.
2. Reduce Alert Fatigue: Another important benefit of correlation analysis in anomaly
detection is reducing alert fatigue by filtering irrelevant anomalies (based on the
correlation) and grouping correlated anomalies into a single alert. Alert storms and false
positives are significant challenges organizations face - getting hundreds, even thousands
of separate alerts from multiple systems when many of them stem from the same incident.

3. Reduce Costs: Correlation analysis helps significantly reduce the costs associated
with the time spent investigating meaningless or duplicative alerts. In addition, the time
saved can be spent on more strategic initiatives that add value to the organization.

Clustering in Data Mining


Clustering is an unsupervised Machine Learning-based Algorithm that comprises a group
of data points into clusters so that the objects belong to the same group.

Clustering helps to splits data into several subsets. Each of these subsets contains data
similar to each other, and these subsets are called clusters. Now that the data from our
customer base is divided into clusters, we can make an informed decision about who we
think is best suited for this product.

Clustering, falling under the category of unsupervised machine learning, is one of the
problems that machine learning algorithms solve.

Clustering only utilizes input data, to determine patterns, anomalies, or similarities in its
input data.

A good clustering algorithm aims to obtain clusters whose:

o The intra-cluster similarities are high, It implies that the data present inside the
cluster is similar to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that is not
similar to other data.

What is a Cluster?
o A cluster is a subset of similar objects
o A subset of objects such that the distance between any of the two objects in the
cluster is less than the distance between any object in the cluster and any object
that is not located inside it.
o A connected region of a multidimensional space with a comparatively high density
of objects.

What is clustering in Data Mining?


o Clustering is the method of converting a group of abstract objects into classes of
similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of significant
subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and used
either as a stand-alone instrument to get a better insight into data distribution or
as a pre-processing step for other algorithms Important points:
o Data objects of a cluster can be considered as one group.
o We first partition the information set into groups while doing cluster analysis. It is
based on data similarities and then assigns the levels to the groups.
o The over-classification main advantage is that it is adaptable to modifications, and
it helps single out important characteristics that differentiate between distinct
groups.

Applications of cluster analysis in data mining:


o In many applications, clustering analysis is widely used, such as data analysis, market
research, pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on the
purchasing patterns. They can characterize their customer groups. o It helps in
allocating documents on the internet for data discovery.
o Clustering is also used in tracking applications such as detection of credit card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal taxonomies,
categorization of genes with the same functionalities and gain insight into structure
inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth
observation database and the identification of house groups in a city according to
house type, value, and geographical location.
Why is clustering used in data mining?
Clustering analysis has been an evolving problem in data mining due to its variety of
applications. The advent of various data clustering tools in the last few years and their
comprehensive use in a broad range of applications, including image processing,
computational biology, mobile communication, medicine, and economics, must
contribute to the popularity of these algorithms. The main issue with the data clustering
algorithms is that it cant be standardized. The advanced algorithm may give the best
results with one type of data set, but it may fail or perform poorly with other kinds of data
set. Although many efforts have been made to standardize the algorithms that can
perform well in all situations, no significant achievement has been achieved so far. Many
clustering tools have been proposed so far. However, each algorithm has its advantages
or disadvantages and cant work on all real situations.

1. Scalability

2. Interpretability

3. Discovery of clusters with attribute shape

4. Ability to deal with different types of attributes

5. Ability to deal with noisy data

6. High dimensionality

Major Issues In Clustering


• The cluster membership may change over time due to dynamic shifts in data.
• Handling outliers is difficult in cluster analysis.
• Clustering struggles with high-dimensional datasets.
• Multiple correct answers for the same problem.
• Evaluating a solution’s correctness is problematic.
• Clustering computation can become complex and expensive.
• Assumption of equal feature variance in distance measures.
• Struggle with missing data (columns and points)
Partitioning Method in Data Mining
Partitioning Method: This clustering method classifies the information into multiple
groups based on the characteristics and similarity of the data. Its the data analysts to
specify the number of clusters that has to be generated for the clustering methods. In the
partitioning method when database(D) that contains multiple(N) objects then the
partitioning method constructs user-specified(K) partitions of the data in which each
partition represents a cluster and a particular region. There are many algorithms that come
under partitioning method some of the popular ones are K-Mean, PAM(KMedoids), CLARA
algorithm (Clustering Large Applications) etc. In this article, we will be seeing the working
of K Mean algorithm in detail.

K-Mean (A centroid based Technique): The K means algorithm takes the input
parameter K from the user and partitions the dataset containing N objects into K clusters
so that resulting similarity among the data objects inside the group (intracluster) is high
but the similarity of data objects with the data objects from outside the cluster is low
(intercluster). The similarity of the cluster is determined with respect to the mean value of
the cluster. It is a type of square error algorithm. At the start randomly k objects from the
dataset are chosen in which each of the objects represents a cluster mean(centre). For the
rest of the data objects, they are assigned to the nearest cluster based on their distance
from the cluster mean. The new mean of each of the cluster is then calculated with the
added data objects. Algorithm: K mean:

Input:

K: The number of clusters in which the dataset has to be divided

D: A dataset containing N number of objects

Output:

A dataset of K clusters Method:

1. Randomly assign K objects from the dataset(D) as cluster centres(C)

2. (Re) Assign each object to which object is most similar based upon mean values.

3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.
4. Repeat Step 2 until no change occurs.

K-Medoids clustering
K-Medoids and K-Means are two types of clustering mechanisms in Partition Clustering.
First, Clustering is the process of breaking down an abstract group of data points/ objects
into classes of similar objects such that all the objects in one cluster have similar traits. , a
group of n objects is broken down into k number of clusters based on their similarities.

Two statisticians, Leonard Kaufman, and Peter J. Rousseeuw came up with this method.
This tutorial explains what K-Medoids do, their applications, and the difference between
K-Means and K-Medoids.

K-medoids is an unsupervised method with unlabelled data to be clustered. It is an


improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity. Compared to other partitioning algorithms, the algorithm is simple, fast, and
easy to implement.

The partitioning will be carried on such that:

1.Each cluster must have at least one object

2.An object must belong to only one cluster.

Medoid: A Medoid is a point in the cluster from which the sum of distances to other data
points is minimal. (or)

A Medoid is a point in the cluster from which dissimilarities with all the other points in the
clusters are minimal.

Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm


takes a Medoid as a reference point.

There are three types of algorithms for K-Medoids Clustering:

1. PAM (Partitioning Around Clustering)


2. CLARA (Clustering Large Applications)

3. CLARANS (Randomized Clustering Large Applications) Algorithm:

Given the value of k and unlabelled data:

1. Choose k number of random points from the data and assign these k points to k
number of clusters. These are the initial medoids.

2. For all the remaining data points, calculate the distance from each medoid and
assign it to the cluster with the nearest medoid.

3. Calculate the total cost (Sum of all the distances from all the data points to the
medoids)

4. Select a random point as the new medoid and swap it with the previous medoid.
Repeat 2 and 3 steps.

5. If the total cost of the new medoid is less than that of the previous medoid, make
the new medoid permanent and repeat step 4.

6. If the total cost of the new medoid is greater than the cost of the previous medoid,
undo the swap and repeat step 4.

7. The Repetitions have to continue until no change is encountered with new medoids
to classify data points.

Advantages of using K-Medoids:


1. Deals with noise and outlier data effectively

2. Easily implementable and simple to understand

3. Faster compared to other partitioning algorithms

Disadvantages:
1. Not suitable for Clustering arbitrarily shaped groups of data points.

2. As the initial medoids are chosen randomly, the results might vary based on the
choice in different runs.

K-Means and K-Medoids:


K-Means K-Medoids
Both methods are types of Partition Clustering.
Unsupervised iterative algorithms

Have to deal with unlabelled data

Both algorithms group n objects into k clusters based on similar traits where k is pre-defined.

Inputs: Unlabelled data and the value of k


Metric of similarity: Euclidian Distance Metric of similarity: Manhattan Distance

Clustering is done based on distance from centroids. Clustering is done based on distance
from medoids.
A centroid can be a data point or some other point in the A medoid is always a data point in the cluster.
cluster

Can't cope with outlier data Can manage outlier data too
Sometimes, outlier sensitivity can turn out to be useful Tendency to ignore meaningful clusters in outlier
data

Hierarchical Clustering in Data Mining


A Hierarchical clustering method works via grouping data into a tree of clusters.
Hierarchical clustering begins by treating every data point as a separate cluster. Then, it
repeatedly executes the subsequent steps:
1. Identify the 2 clusters which can be closest together, and

2. Merge the 2 maximum comparable clusters. We need to continue these steps until
all the clusters are merged together.

In Hierarchical Clustering, the aim is to produce a hierarchical series of nested clusters. A


diagram called Dendrogram (A Dendrogram is a tree-like diagram that statistics the
sequences of merges or splits) graphically represents this hierarchy and is an inverted tree
that describes the order in which factors are merged (bottom-up view) or clusters are
broken up (top-down view).

Hierarchical clustering is a method of cluster analysis in data mining that creates a


hierarchical representation of the clusters in a dataset. The method starts by treating each
data point as a separate cluster and then iteratively combines the closest clusters until a
stopping criterion is reached. The result of hierarchical clustering is a tree-like structure,
called a dendrogram, which illustrates the hierarchical relationships among the clusters.

Hierarchical clustering has a number of advantages over other clustering methods,


including:

1. The ability to handle non-convex clusters and clusters of different sizes and
densities.

2. The ability to handle missing data and noisy data.

3. The ability to reveal the hierarchical structure of the data, which can be useful for
understanding the relationships among the clusters. However, it also has some
drawbacks, such as:

4. The need for a criterion to stop the clustering process and determine the final
number of clusters.

5. The computational cost and memory requirements of the method can be high,
especially for large datasets.

6. The results can be sensitive to the initial conditions, linkage criterion, and distance
metric used.

In summary, Hierarchical clustering is a method of data mining that groups similar


data points into clusters by creating a hierarchical structure of the clusters.

7. This method can handle different types of data and reveal the relationships among
the clusters. However, it can have high computational cost and results can be
sensitive to some conditions.

1. Agglomerative: Initially consider every data point as an individual Cluster and at


every step, merge the nearest pairs of the cluster. (It is a bottom-up method). At first,
every dataset is considered an individual entity or cluster. At every iteration, the
clusters merge with different clusters until one cluster is formed.

The algorithm for Agglomerative Hierarchical Clustering is:


• Calculate the similarity of one cluster with all the other clusters (calculate proximity
matrix)

• Consider every data point as an individual cluster

• Merge the clusters which are highly similar or close to each other.

• Recalculate the proximity matrix for each cluster

• Repeat Steps 3 and 4 until only a single cluster remains.

Let’s see the graphical representation of this algorithm using a dendrogram.

Note: This is just a demonstration of how the actual algorithm works no calculation has
been performed below all the proximity among the clusters is assumed.

Let’s say we have six data points A, B, C, D, E, and F.

Figure – Agglomerative Hierarchical clustering

• Step-1: Consider each alphabet as a single cluster and calculate the distance of
one cluster from all the other clusters.

• Step-2: In the second step comparable clusters are merged together to form a
single cluster. Let’s say cluster (B) and cluster (C) are very similar to each other
therefore we merge them in the second step similarly to cluster (D) and (E) and at
last, we get the clusters [(A), (BC), (DE), (F)]

• Step-3: We recalculate the proximity according to the algorithm and merge the
two nearest clusters([(DE), (F)]) together to form new clusters as [(A), (BC), (DEF)]

• Step-4: Repeating the same process; The clusters DEF and BC are comparable and
merged together to form a new cluster. We’re now left with clusters [(A), (BCDEF)].

• Step-5: At last the two remaining clusters are merged together to form a single
cluster [(ABCDEF)].
2. Divisive:

We can say that Divisive Hierarchical clustering is precisely the opposite of Agglomerative
Hierarchical clustering. In Divisive Hierarchical clustering, we take into account all of the
data points as a single cluster and in every iteration, we separate the data points from the
clusters which aren’t comparable. In the end, we are left with N clusters.

Figure – Divisive Hierarchical clustering

Non Hierarchical Clustering


Non Hierarchical Clustering involves formation of new clusters by merging or splitting the
clusters.It does not follow a tree like structure like hierarchical clustering.This technique
groups the data in order to maximize or minimize some evaluation criteria.K means
clustering is an effective way of non hierarchical clustering.In this method the partitions
are made such that non-overlapping groups having no hierarchical relationships between
themselves.

Difference between Hierarchical Clustering and Non


Hierarchical Clustering:
S.NO. Hierarchical Clustering: Non Hierarchical Clustering:

1. Hierarchical Clustering involves Non Hierarchical Clustering involves formation


creating clusters in a predefined of new clusters by merging or splitting the
order from top to bottom . clusters instead of following a hierarchical
order.
2. It is considered less reliable than It is comparatively more reliable than
Non Hierarchical Clustering. Hierarchical Clustering.

3. It is considered slower than Non It is comparatively more faster than


Hierarchical Clustering. Hierarchical Clustering.

4. It is very problematic to apply this It can work better than Hierarchical clustering
technique when we have data with even when error is there.
high level of error.

5. It is comparatively easier to read The clusters are difficult to read and understand
and understand. as compared to Hierarchical clustering.

6. It is relatively unstable than Non It is a relatively stable technique.


Hierarchical clustering.

UNIT-IV
Decision Tree Induction
Decision Tree is a supervised learning method used in data mining for classification and
regression methods. It is a tree that helps us in decision-making purposes. The decision
tree creates classification or regression models as a tree structure. It separates a data set
into smaller subsets, and at the same time, the decision tree is steadily developed. The
final tree is a tree with the decision nodes and leaf nodes. A decision node has at least two
branches. The leaf nodes show a classification or decision. We can't accomplish more split
on leaf nodes-The uppermost decision node in a tree that relates to the best predictor
called the root node. Decision trees can deal with both categorical and numerical data.

Key factors:
Entropy:

Entropy refers to a common way to measure impurity. In the decision tree, it measures the
randomness or impurity in data sets.
Information Gain:

Information Gain refers to the decline in entropy after the dataset is split. It is also called
Entropy Reduction. Building a decision tree is all about discovering attributes that return
the highest data gain.

In short, a decision tree is just like a flow chart diagram with the terminal nodes showing
decisions. Starting with the dataset, we can measure the entropy to find a way to segment
the set until the data belongs to the same class.

Why are decision trees useful?


It enables us to analyze the possible consequences of a decision thoroughly.

It provides us a framework to measure the values of outcomes and the probability of


accomplishing them.

It helps us to make the best decisions based on existing data and best speculations.

In other words, we can say that a decision tree is a hierarchical tree structure that can be
used to split an extensive collection of records into smaller sets of the class by
implementing a sequence of simple decision rules. A decision tree model comprises a set
of rules for portioning a huge heterogeneous population into smaller, more
homogeneous, or mutually exclusive classes. The attributes of the classes can be any
variables from nominal, ordinal, binary, and quantitative values, in contrast, the classes
must be a qualitative type, such as categorical or ordinal or binary. In brief, the given data
of attributes together with its class, a decision tree creates a set of rules that can be used
to identify the class. One rule is implemented after another, resulting in a hierarchy of
segments within a segment. The hierarchy is known as the tree, and each segment is called
a node. With each progressive division, the members from the subsequent sets become
more and more similar to each other. Hence, the algorithm used to build a decision tree
is referred to as recursive partitioning. The algorithm is known as CART (Classification and
Regression Trees)

Consider the given example of a factory where

Expanding factor costs $3 million, the probability of a good economy is 0.6 (60%), which
leads to $8 million profit, and the probability of a bad economy is 0.4 (40%), which leads
to $6 million profit.

Not expanding factor with 0$ cost, the probability of a good economy is 0.6(60%), which
leads to $4 million profit, and the probability of a bad economy is 0.4, which leads to $2
million profit.

The management teams need to take a data-driven decision to expand or not based on
the given data.

Net Expand = ( 0.6 *8 + 0.4*6 ) - 3 = $4.2M Net Not Expand = (0.6*4 + 0.4*2) - 0 = $3M
$4.2M > $3M,therefore the factory should be expanded.

Decision tree Algorithm:


The decision tree algorithm may appear long, but it is quite simply the basis algorithm
techniques is as follows:

The algorithm is based on three parameters: D, attribute_list, and


Attribute _selection_method.

Generally, we refer to D as a data partition.

Initially, D is the entire set of training tuples and their related class levels (input training
data).

The parameter attribute_list is a set of attributes defining the tuples.


Attribute_selection_method specifies a heuristic process for choosing the attribute that
"best" discriminates the given tuples according to class.

Attribute_selection_method process applies an attribute selection measure.

Advantages of using decision trees:


A decision tree does not need scaling of information.

Missing values in data also do not influence the process of building a choice tree to any
considerable extent.

A decision tree model is automatic and simple to explain to the technical team as well as
stakeholders.

Compared to other algorithms, decision trees need less exertion for data preparation
during pre-processing.

A decision tree does not require a standardization of data.

Pruning decision trees


Pruning means to change the model by deleting the child nodes of a branch node. The
pruned node is regarded as a leaf node. Leaf nodes cannot be pruned.

A decision tree consists of a root node, several branch nodes, and several leaf nodes.

• The root node represents the top of the tree. It does not have a parent node, however,
it has different child nodes.
• Branch nodes are in the middle of the tree. A branch node has a parent node and
several child nodes.

• Leaf nodes represent the bottom of the tree. A leaf node has a parent node. It does
not have child nodes.

The color of the pruned nodes is a shade brighter than the color of unpruned nodes, and the
decision next to the pruned nodes is represented in italics. In contrast to collapsing nodes to
hide them from the view, pruning actually changes the model.

You can manually prune the nodes of the tree by selecting the check box in
the Pruned column. When the node is pruned, the lower levels of the node are collapsed. If
you expand a collapsed node by clicking on the node icon, the collapsed nodes are displayed.
You can specify the prune level. The prune level determines that all nodes with a level
smaller than the specified prune level are unpruned, and all nodes with a level equal or
greater than the specified prune level are pruned. For example, if you specify a prune level
of 3, all nodes with level 1 and 2 are unpruned, and all nodes with level 3 or greater are
pruned.

The computed prune level is the original prune state of the tree classification model. This
means that some of the branch nodes might be pruned by the Tree Classification mining
function, or none of the branch nodes might be pruned at all. Resetting to the computed
prune level removes the manual pruning that you might ever have done to the tree
classification model.

Data Mining - Rule Based Classification


IF-THEN Rules
Rule-based classifier makes use of a set of IF-THEN rules for classification. We can express a rule
in the following from −
IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes


THEN buy_computer = yes

Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these
tests are logically ANDed.
• The consequent part consists of class prediction. Note − We can also write rule R1
as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is satisfied.

Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-THEN rules from a decision
tree.
Points to remember −
To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm


Sequential Covering Algorithm can be used to extract IF-THEN rules form the training data. We do
not require to generate a decision tree first. In this algorithm, each rule for a given class covers
many of the tuples of that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As per the general strategy
the rules are learned one at a time. For each time rules are learned, a tuple covered by the rule is
removed and the process continues for the rest of the tuples. This is because the path to each leaf
in a decision tree corresponds to a rule.
Note − The Decision tree induction can be considered as learning a set of rules simultaneously.
The Following is the sequential learning Algorithm where rules are learned for one class at a time.
When learning a rule from a class Ci, we want the rule to cover all the tuples from class C only and
no tuple form any other class.

Algorithm: Sequential Covering

Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.

Output: A Set of IF-THEN rules.


Method:
Rule_set={ }; // initial set of rules learned is empty

Rule = Learn_One_Rule(D, Att_valls, c);


remove tuples covered by Rule form D; until
termination condition;
Rule_set=Rule_set+Rule; // add a new rule to rule-set end for return Rule_Set;
for each class c do
repeat

end for return Rule_Set;


Rule Pruning
The rule is pruned is due to the following reason −
• The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule
pruning is required.
• The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of
R has greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,
FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the FOIL_Prune
value is higher for the pruned version of R, then we prune R.

Creating a Decision Tree


Some of the decision tree algorithms include Hunt’s Algorithm, ID3, CD4.5, and CART.

Example of Creating a Decision Tree


(Example is taken from Data Mining Concepts: Han and Kimber)

#1) Learning Step: The training data is fed into the system to be analyzed by a classification
algorithm. In this example, the class label is the attribute i.e. “loan decision”. The model built from
this training data is represented in the form of decision rules.
#2) Classification: Test dataset are fed to the model to check the accuracy of the classification
rule. If the model gives acceptable results then it is applied to a new dataset with unknown class
variables.
Decision Tree Induction Algorithm

UNIT-V
Data Mining Software
Introduction to Data Mining Software
Data mining is a process of analyzing data, identifying patterns, and converting
unstructured data into structured data ( data organized in rows and columns) to use it for
business-related decision making. It is a process to extract extensive unstructured data from
various databases. Data mining is an interdisciplinary science that has mathematics and
computer science algorithms used by a machine. Data Mining Software helps the user to
analyze data from different databases and detect patterns. Data mining tools’ primary aim
is to find, extract, and refine data and then distribute the information.
Features of Data Mining Software
Below are the different features of Data Mining Software:
• Easy to use: Data mining software has easy to use Graphical User Interface (GUI)
to help the user analyze data efficiently.
• Pre-processing: Data pre-processing is a necessary step. It includes data cleaning,
data transformation, data normalization, and data integration.
• Scalable processing: Data mining software permits scalable processing, i.e., the
software is scalable on the size of the data and users.
• High Performance: Data mining software increases the performance capabilities
and creates an environment that generates results quickly.
• Anomaly Detection: They help to identify unusual data that might have errors or
need further investigation.
• Association Rule Learning: Data mining software use Association rule learning
that identifies the relationship between variables.
• Clustering: It is a process of grouping the data that are similar in some way or
other.
• Classification: It is the process of generalizing the known structure and then
applying it to new data.
• Regression: It is the task of estimating the relationships between datasets or data.

• Data Summarization: Data mining tools are capable of compressing or


summarizing the data into an informative representation. This software provides
interactive data preparation tools.
Different Data Mining Software
Below are some of the top data mining software:
1. Orange Data Mining
It is an open-source data analysis and visualization tool. In this, data mining is done through
Python scripting and visual programming. In addition, it contains features for data analytics
and components for machine learning and text mining.
2. R Software Environment
R is a free software environment for graphics and statistical computing. It can run on
various UNIX platforms, MacOS and Windows. It is a suite of software facilities for
calculation, graphical display, and data manipulation.
3. Weka Data Mining
It is a collection of algorithms of machine learning to perform data mining tasks. The
algorithms can be called using Java code, or they can be directly applied to the dataset.
It is written in Java and contains features like machine learning, preprocessing, data
mining, clustering, regression, classification, visualization, and attribute selection.
4. SpagoBI Business Intelligence
It is an open-source business intelligence suite. It offers advanced data visualization
features, an extensive range of analytical functions, and a functional semantic layer. The
various modules of the SpagoBI suite are SpagoBI Studio, SpagoBI SDK, SpagoBI Server,
and SpagoBI Meta.
5. Anaconda
It is an open data science platform. It is a high-performance distribution of R and
Python. It includes R, Scala, and Python for data mining, stats, deep learning, simulation
and optimization, Natural language processing, and image analysis.
6. Shogun
It is an open-source, free toolbox. It has various data structures and algorithms for
machine learning problems. Its primary focus is on kernel machines like support vector
machines. It allows the user to combine algorithm classes, multiple data representations,
and general-purpose tools easily. It allows the full implementation of Hidden Markov
Models.
7. DataMelt
It is software for statistics, numeric computation, scientific visualization, and analysis of
big data. It is a computational platform. It can use different programming languages on
various operating systems.
8. Natural Language Toolkit
It is a platform for implementing python programs to work with human language data. It
has easy to use interface. It provides resources such as WordNet and has a suite of text
processing libraries and a discussion forum. It is useful for students, engineers,
researchers, linguists, and industry users.
9. Apache Mahout
Its main aim is to create an environment for building scalable machine learning
applications quickly. It contains various algorithms for Apache Spark, Scala, and Apache
Flink. It is implemented on Apache Hadoop and uses MapReduce Paradigm.
10. GNU Octave
It represents a high-level language built for numerical computations. It works on a
command-line interface and allows users to solve linear and nonlinear problems
numerically using a language compatible with Matlab. It offers features like visualization
tools. It runs on Windows, macOS, GNU/Linux, and BSD.
11. RapidMiner Starter Edition:
It provides an integrated environment for machine learning, data preparation, text mining, and
deep learning. It is used for commercial and business applications, research, training,
education, and rapid prototyping. It supports data preparation, model visualization, and
optimization.
12. GraphLab Create
It is a machine learning platform to create a predictive application that includes data
cleaning, training the model and developing features. These applications provide
predictions for use cases of fraud detection, sentiment analysis, and churn prediction.
13. Lavastorm Analytics Engine
It is a visual data discovery solution that permits to integrate of diverse data rapidly and
detect outliers, anomalies continuously. It offers the self-service capability for business
users. It provides features like transform, acquire, and combine data without pre-
planning and scripting.
14. Scikit-learn
It is an open-source machine learning library for Python programming. It provides
different classification, clustering and regression algorithms, including random forests,
Kmeans, and support vector machines. IT is built to work with Python libraries like
NumPy and SciPy.

Text Data Mining


Text data mining can be described as the process of extracting essential data from standard
language text. All the data that we generate via text messages, documents, emails, files are
written in common language text. Text mining is primarily used to draw useful insights or
patterns from such data.

The text mining market has experienced exponential growth and adoption over the last few
years and also expected to gain significant growth and adoption in the coming future. One
of the primary reasons behind the adoption of text mining is higher competition in the
business market, many organizations seeking value-added solutions to compete with other
organizations. With increasing completion in business and changing customer perspectives,
organizations are making huge investments to find a solution that is capable of analyzing
customer and competitor data to improve competitiveness. The primary source of data is
ecommerce websites, social media platforms, published articles, survey, and many more.
The larger part of the generated data is unstructured, which makes it challenging and
expensive for the organizations to analyze with the help of the people. This challenge
integrates with the exponential growth in data generation has led to the growth of analytical
tools. It is not only able to handle large volumes of text data but also helps in decision-
making purposes. Text mining software empowers a user to draw useful information from a
huge set of data available sources.

Text mining can be used to extract structured information from unstructured text data
such as:
Named Entity Recognition (NER): Identifying and classifying named entities such as
people, organizations, and locations in text data.
Sentiment Analysis: Identifying and extracting the sentiment (e.g. positive, negative,
neutral) of text data.
Text Summarization: Creating a condensed version of a text document that captures the
main points.
Topic Modeling: Identifying the main topics present in a document or collection of
documents.
Text Classification: Assigning predefined categories to text data
Issues in Text Mining
Numerous issues happen during the text mining process:
1. The efficiency and effectiveness of decision-making.

2. The uncertain problem can come at an intermediate stage of text mining. In the
preprocessing stage, different rules and guidelines are characterized to normalize the
text that makes the text mining process efficient. Prior to applying pattern analysis on
the document, there is a need to change over unstructured data into a moderate
structure.
3. Sometimes original message or meaning can be changed due to alteration.
4. Another issue in text mining is many algorithms and techniques support multi-
language text. It may create ambiguity in text meaning. This problem can lead to false-
positive results.
5. The utilization of synonym, polysemy, and antonyms in the document text makes
issues for the text mining tools that take both in a similar setting. It is difficult to
categorize such kinds of text/ words.

ADVANTAGES OR DISADVANTAGES:
Advantages of Text Mining:
1. Large Amounts of Data: Text mining allows organizations to extract insights from
large amounts of unstructured text data. This can include customer feedback,
social media posts, and news articles.
2. Variety of Applications: Text mining has a wide range of applications, including
sentiment analysis, named entity recognition, and topic modeling. This makes it a
versatile tool for organizations to gain insights from unstructured text data.
3. Improved Decision Making: Text mining can be used to extract insights from
unstructured text data, which can be used to make data-driven decisions.
4. Cost-effective: Text mining can be a cost-effective way to extract insights from
unstructured text data, as it eliminates the need for manual data entry.
Disadvantages of Text Mining:
1. Complexity: Text mining can be a complex process that requires advanced skills in
natural language processing and machine learning.
2. Quality of Data: The quality of text data can vary, which can affect the accuracy of
the insights extracted from text mining.
3. High computational cost: Text mining requires high computational resources, and
it may be difficult for smaller organizations to afford the technology.
4. Limited to text data: Text mining is limited to extracting insights from unstructured
text data and cannot be used with other data types.

Parser
Parser is a compiler that is used to break the data into smaller elements coming from lexical
analysis phase.

A parser takes input in the form of sequence of tokens and produces output in the form of parse
tree.

Parsing is of two types: top down parsing and bottom up parsing.


Top down paring
o The top down parsing is known as recursive parsing or predictive parsing.
o Bottom up parsing is used to construct a parse tree for an input string.
o In the top down parsing, the parsing starts from the start symbol and transform it into the
input symbol.

Parse Tree representation of input string "acdb" is as follows:

Bottom up parsing
o Bottom up parsing is also known as shift-reduce parsing.
o Bottom up parsing is used to construct a parse tree for an input string.
o In the bottom up parsing, the parsing starts with the input symbol and construct the parse
tree up to the start symbol by tracing out the rightmost derivations of string in reverse.

Soft Parse (%)

A soft parse is recorded when the Oracle Server checks the shared pool for a SQL statement
and finds a version of the statement that it can reuse.

This metric represents the percentage of parse requests where the cursor was already in the
cursor cache compared to the number of total parses. This ratio provides an indication as to
how often the application is parsing statements that already reside in the cache as compared
to hard parses of statements that are not in the cache.

This test checks the percentage of soft parse requests to total parse requests. If the value is
less than or equal to the threshold values specified by the threshold arguments, and the
number of occurrences exceeds the value specified in the "Number of Occurrences" parameter,
then a warning or critical alert is generated.
Hard Parses (per second)

This metric represents the number of hard parses per second during this sample period. A hard
parse occurs when a SQL statement has to be loaded into the shared pool. In this case, the
Oracle Server has to allocate memory in the shared pool and parse the statement.

Each time a particular SQL cursor is parsed, this count will increase by one. There are certain
operations that will cause a SQL cursor to be parsed. Parsing a SQL statement breaks it down
into atomic steps, which the optimizer will evaluate when generating an execution plan for the
cursor.

This test checks the number of parses of statements that were not already in the cache. If the
value is greater than or equal to the threshold values specified by the threshold arguments,
and the number of occurrences exceeds the value specified in the "Number of Occurrences"
parameter, then a warning or critical alert is generated.

What is Web Mining?


Web mining can widely be seen as the application of adapted data mining techniques to the
web, whereas data mining is defined as the application of the algorithm to discover patterns on
mostly structured data embedded into a knowledge discovery process. Web mining has a
distinctive property to provide a set of various data types. The web has multiple aspects that yield
different approaches for the mining process, such as web pages consist of text, web pages are
linked via hyperlinks, and user activity can be monitored via web server logs. These three features
lead to the differentiation between the three areas are web content mining, web structure mining,
web usage mining.

There are three types of data mining:

1. Web Content Mining:

Web content mining can be used to extract useful data, information, knowledge from the web
page content. In web content mining, each web page is considered as an individual document.
The individual can take advantage of the semi-structured nature of web pages, as HTML provides
information that concerns not only the layout but also logical structure. The primary task of
content mining is data extraction, where structured data is extracted from unstructured websites.
The objective is to facilitate data aggregation over various web sites by using the extracted
structured data. Web content mining can be utilized to distinguish topics on the web. For
Example, if any user searches for a specific task on the search engine, then the user will get a list
of suggestions.

2. Web Structured Mining:

The web structure mining can be used to find the link structure of hyperlink. It is used to identify
that data either link the web pages or direct link network. In Web Structure Mining, an individual
considers the web as a directed graph, with the web pages being the vertices that are associated
with hyperlinks. The most important application in this regard is the Google search engine, which
estimates the ranking of its outcomes primarily with the PageRank algorithm. It characterizes a
page to be exceptionally relevant when frequently connected by other highly related pages.
Structure and content mining methodologies are usually combined. For example, web structured
mining can be beneficial to organizations to regulate the network between two commercial sites.

3. Web Usage Mining:

Web usage mining is used to extract useful data, information, knowledge from the weblog
records, and assists in recognizing the user access patterns for web pages. In Mining, the usage
of web resources, the individual is thinking about records of requests of visitors of a website, that
are often collected as web server logs. While the content and structure of the collection of web
pages follow the intentions of the authors of the pages, the individual requests demonstrate how
the consumers see these pages. Web usage mining may disclose relationships that were not
proposed by the creator of the pages.

Some of the methods to identify and analyze the web usage patterns are given below:

I. Session and visitor analysis:

The analysis of preprocessed data can be accomplished in session analysis, which incorporates
the guest records, days, time, sessions, etc. This data can be utilized to analyze the visitor's
behavior.

The document is created after this analysis, which contains the details of repeatedly visited web
pages, common entry, and exit.

II. OLAP (Online Analytical Processing):

OLAP accomplishes a multidimensional analysis of advanced data.

OLAP can be accomplished on various parts of log related data in a specific period.
OLAP tools can be used to infer important business intelligence metrics

Challenges in Web Mining:


The web pretends incredible challenges for resources, and knowledge discovery based on the
following observations:

o The complexity of web pages: The site pages don't have a unifying structure. They are
extremely complicated as compared to traditional text documents. There are enormous
amounts of documents in the digital library of the web. These libraries are not organized
according to a specific order.

o The web is a dynamic data source: The data on the internet is quickly updated. For
example, news, climate, shopping, financial news, sports, and so on.

o Diversity of client networks: The client network on the web is quickly expanding. These
clients have different interests, backgrounds, and usage purposes. There are over a
hundred million workstations that are associated with the internet and still increasing
tremendously.

o Relevancy of data: It is considered that a specific person is generally concerned about a


small portion of the web, while the rest of the segment of the web contains the data that
is not familiar to the user and may lead to unwanted results.

o The web is too broad: The size of the web is tremendous and rapidly increasing. It
appears that the web is too huge for data warehousing and data mining.

Application of Web Mining:


Web mining has an extensive application because of various uses of the web. The list of some
applications of web mining is given below.

o Marketing and conversion tool


o Data analysis on website and application accomplishment.
o Audience behavior analysis
o Advertising and campaign accomplishment analysis.
o Testing and analysis of a site.
WEB MINING PROCESS FOR KNOWLEDGE DISCOVERY

Web mining is the application of data mining techniques to extract knowledge from web data,
including web documents, hyperlinks between documents, usage logs of web sites.

web mining comprises four different steps:

• Resource identification, in which the resources needed for information


extraction are identified.

• Pre-processing, in which relevant information is selected from found information


sources. This step is directly related to information extraction techniques

• Generalization, in which automatic pattern discovery is made on several web


documents. This step uses data mining techniques as well as clustering and
classification trees.

• Analysis, in which pattern discovery is validated and interpreted.

The primary objective of a Web Mining process is to discover interesting patterns and rules from
data collected within the Web space. In order to adopt generic data mining techniques and
algorithms to Web data, these data must be transformed into a suitable form. The idea is to
connect specific research domains such as Information Retrieval, Information Extraction, Text
Mining and so on, and to put them together in an innovative process of workflow defining several
phases and steps moreover they can share common activities, facilitating reuse and
standardization.

Generally, Web mining is the application of data mining algorithms and techniques to large Web
data repositories Web usage mining refers to the automatic discovery and analysis of
generalized patterns which describe user navigation paths (e.g. click streams), collected or
generated as a result of user interactions with Web site. constraint- based data mining algorithms
applied in Web Usage Mining and developed software tools .One of the most common algorithm
applied in Web Usage Mining is the Apriori algorithm. Web user navigation patterns were
represented by association rules in. Sequence mining can be also used to mine Web user
navigation patterns. The association rules holds information of forward the sequence of
requested pages (e.g. if user visits page A, and then page C, it will visit page D). Based on this,
users activity can be determined and predictions to the next page can be calculated. The
sequence mining algorithms inherited much from association mining algorithms to discovered
pattern.

You might also like