0% found this document useful (0 votes)
366 views

Data Mining in Business Intelligence

Business intelligence combines various tools and techniques to help organizations make data-driven decisions. Data mining is an important precursor to business intelligence, as it decodes raw and unstructured data into a cleaner format. There are several key techniques in data mining, including clustering, classification, association, and prediction. Text mining is another important technique that analyzes unstructured text to identify meaningful patterns and insights.

Uploaded by

mmkpes7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
366 views

Data Mining in Business Intelligence

Business intelligence combines various tools and techniques to help organizations make data-driven decisions. Data mining is an important precursor to business intelligence, as it decodes raw and unstructured data into a cleaner format. There are several key techniques in data mining, including clustering, classification, association, and prediction. Text mining is another important technique that analyzes unstructured text to identify meaningful patterns and insights.

Uploaded by

mmkpes7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

lOMoARcPSD|27298668

DATA Mining IN Business Intelligence

MBA, Human Resource (Anna University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by JAYAPRAKASH A ([email protected])
lOMoARcPSD|27298668

DATA MINING IN BUSINESS INTELLIGENCE

UNIT -1

Business intelligence

Business intelligence combines business analytics, data mining, data visualization,


data tools and infrastructure, and best practices to help organizations make
more data-driven decisions.

DATA MINING

Data mining can be seen as the precursor to business intelligence. Upon collection,
data is often raw and unstructured, making it challenging to draw conclusions. Data
mining decodes these complex datasets, and delivers a cleaner version for the
business intelligence team to derive insights.

Stages of data mining

The Process Is More Important Than the Tool

STATISTICA Data Miner divides the modeling screen into four general phases of
data mining: (1) data acquisition; (2) data cleaning, preparation, and
transformation; (3) data analysis, modeling, classification, and forecasting; and
(4) reports.

Importance of data mining

Data mining tools include powerful statistical, mathematical, and analytics


capabilities whose primary purpose is to sift through large sets of data to identify
trends, patterns, and relationships to support informed decision-making and
planning.

Techniques of data mining

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

There are numerous crucial data mining techniques to consider when entering the data
field, but some of the most prevalent methods include clustering, data cleaning,
association, data warehousing, machine learning, data visualization,
classification, neural networks, and prediction.

The eight data mining methods?

Eight Essential Data Mining Techniques for Your Business

 Correlation analysis.
 Classification.
 Outlier detection.
 Clustering.
 Sequential patterning.
 Data visualization.
 Neural networking.
 Computational advertising

TEXT MINING

Text mining, also known as text data mining, is the process of transforming
unstructured text into a structured format to identify meaningful patterns and
new insights.

Text mining is an automatic process that uses natural language processing to


extract valuable insights from unstructured text. By transforming data into
information that machines can understand, text mining automates the process of
classifying texts by sentiment, topic, and intent.
Text mining/analysis activities or tasks

 Document classification. Information retrieval (e.g., search engines) ...


 Corpora comparison. e.g., political speeches.
 Entity recognition/extraction. e.g., geoparsing.
 Data Visualization.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Why do we use text mining?

Widely used in knowledge-driven organizations, text mining is the process of examining


large collections of documents to discover new information or help answer specific
research questions. Text mining identifies facts, relationships and assertions that would
otherwise remain buried in the mass of textual big data.

Equipped with Natural Language Processing (NLP), text mining tools are used to analyze all
types of text, from survey responses and emails to tweets and product reviews, helping
businesses gain insights and make data-based decisions.

Data scientists analyze text using advanced data science techniques. The data from the text
reveals customer sentiments toward subjects or unearths other insights. There are two ways
to use text analytics (also called text mining) or natural language processing (NLP)
technology.

What Is Text Mining?

We have already defined what text mining is. For academic purpose, let’s try again. It is a
multi-disciplinary field based on information retrieval, data mining, machine learning,
statistics, and computational linguistics. Unlike data stored in databases, the text is
unstructured, ambiguous, and challenging to process. Text mining applies several text mining
techniques like summarization, classification, and clustering to extract knowledge
from natural language text, which is stored in a semi-structured and unstructured format.

Text mining techniques are continuously used in areas like search engines, customer
relationship management systems, filter emails, product suggestion analysis, fraud detection,
and social media analytics for opinion mining, feature extraction, sentiment, predictive, and
trend analysis.

In general, text mining uses four different methods:

1. Term-Based Method

It is a method when a document is analyzed based on a term that it contains. The term may
have some value or meaning in a context. Each term is associated with a value, known as
weight. This method, however, has two problems: 1. Polysemy (a term having many possible
meanings), and 2. Synonymy (multiple words having the same meanings.)

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

2. Phrase-Based Method

As the name indicates, this method analyses a document based on phrases which carry more
information than a single term, because they are a collection of semantic terms. They are
more descriptive and less ambiguous than a term. But this method isn’t devoid of any
problems. The performance of this method could vary due to three reasons:

1. Inferior statistical properties to terms

2. Low frequency of occurrences.

3. Redundant phrases and noisy phrases.

3. Concept-Based Method

In the concept-based method, the terms are predicted or guessed at a sentence or a document
level. Rather than a single term analysis, this model tries to analyses a term on a document or
sentence level by finding a significant matching term aptly. This model contains three
components:

1. Examining the semantic construction of sentences.

2. Building a conceptual ontological graph to describe the semantic structures.

3. Extracting top concepts based on the first two components to build feature vectors
using the standard vector space model.

4. Pattern Taxonomy Method

In the pattern-based model, a document is analyzed based on a pattern i.e., a relation between
terms to form taxonomy, which is a tree-like structure. The pattern-based approach can
improve the accuracy of the system for evaluating term weights because discovered patterns
are more specific than whole documents.

Patterns can be discovered by using data mining techniques like closed pattern mining,
sequential pattern mining, frequent itemset mining, and association rule mining. The pattern-
based technique uses two processes pattern deploying (PDM) and pattern evolving. This
technique refines the discovered patterns in text documents.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

All text mining process follows these steps:

 Collecting information: The textual data from various sources that are in a semi-

structured or unstructured format is collected to perform text mining.

 Conversion into structured data: Pre-processing involves cleaning the data that is

collected.

 Pattern identification: Various techniques used in text mining, which are discussed

later, are then applied to extract meaningful information.

 Pattern Analysis: The data obtained is analyzed to extract knowledge and meaning

out of it.

 Advanced analysis: Finally, the required knowledge is obtained and can then be used

for further analysis


There are several text mining tasks performed while analyzing the text. They are:

 Clustering

 Factor analysis

 Text classification

 Text purification

 Text summarization

 Distributed storage and retrieval

 Find similar documents

 Find an association between terms

 Find commonly occurring terms.

Popular Text Mining Techniques

1. Information Extraction (IE)

Information extraction (IE) is a technique to automatically extract a piece of definite,


structured information from unstructured or semi-structured data in the form of text using
Natural Language Processing. It is used for the extraction of entities from the text, like names
of persons, organization, location, and the relationship between entities, attributes, events,
and relationships.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

The extracted information is well-organized (structured) and stored in a database for further
use. IE extracts specific attributes and entities from the document and establishes their
relationship. The process used to check and evaluate the relevance of results is called
‘Precision and Recall.’

2. Information Retrieval (IR)

Information retrieval (IR) refers to finding and collecting relevant information from a variety
of resources, usually documented in an unstructured format. It is a set of methods or
approaches for methodically developing information needs of the users in the form of queries
that are used to fetch a document from a collection of databases. IR helps to extract relevant
and associated patterns according to a given set of words or phrases.

3. Text Categorization

This technique involves designating pre-decided categories to free-text documents that


contain insights about the world. The purpose of text classification/text categorization is to
increase the detection of information that can lead to a better decision. For example, news
stories are typically organized by subject categories (topics) or geographical codes. Technical
domains and subdomains often classify academic papers.

At the same time, patient reports in healthcare organizations are often indexed from multiple
aspects, using taxonomies of disease categories, types of surgical procedures, insurance
reimbursement codes, and so on. Another widespread application of text categorization is
spam filtering, where email messages are classified into the two categories of spam and non-
spam, respectively.

4. Document Clustering

This technique is used to find groups of documents with similar content. It makes use of
descriptors and descriptor extraction that are essentially sets of words that describe the
contents within the cluster. It is an unsupervised process responsible for classifying objects
into groups called clusters, which consist of several documents. Dividing similar text into the
same cluster forms the basis of this method.

Any labels associated with objects are obtained solely from the data. The advantage of this
technique is that it ensures that no document is missed from search results since documents
can emerge in numerous subtopics. For example, if clustering is performed on a collection of

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

news articles, it can make sure that similar documents are kept closer to each other or lie in
the same cluster.

5. Text Visualization

Text Visualization is a technique that represents large textual information into a visual map
layout, which provides enhanced browsing capabilities along with simple searching. In text
mining, visualization methods can improve and simplify the discovery of relevant
information.

Text flags are used to show the document category to represent individual documents or
groups of documents, and colors are used to show density. Visual text mining puts large
textual sources in an appropriate visual hierarchy, which helps the user to interact with the
document by scaling and zooming.

Is text mining same as NLP?

NLP and text mining differ in the goal for which they are used. NLP is used to
understand human language by analyzing text, speech, or grammatical syntax. Text mining is
used to extract information from unstructured and structured content. It focuses on structure
rather than the meaning of content.

WEB MINING

Web Mining is the process of Data Mining techniques to automatically discover


and extract information from Web documents and services. The main purpose of
web mining is discovering useful information from the World-Wide Web and its
usage patterns.

What is Web Mining?

Web mining is the use of data mining techniques to extract knowledge from web data.

● Web data includes :

○ web documents

○ hyperlinks between documents

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

○ usage logs of web sites

● The WWW is huge, widely distributed, global information service centre and, therefore,
constitutes a rich source for data mining.

Data Mining vs Web Mining

● Data Mining : It is a concept of identifying a significant pattern from the data that gives a
better outcome.

● Web Mining : It is the process of performing data mining in the web. Extracting the web
documents and discovering the patterns from it.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

WEB CONTENT MINING

● Mining, extraction and integration of useful data, information and


knowledge from Web pagecontent.

● Web content mining is related but different from data mining and text mining.

● Web data are mainly semi-structured and/or unstructured, while


data mining deals primarilywith structured data.
WEB STRUCTURE MINING

● Web structure mining is the process of discovering structure information from


the web.

● The structure of typical web graph consists of Web pages as


nodes, and hyperlinks as edgesconnecting between two related
pages

.
WEB USAGE MINING

● Web usage mining: automatic discovery of patterns in


clickstreams and associated datacollected or generated as a
result of user interactions with one or more Web sites.

● Goal: analyze the behavioral patterns and profiles of users interacting with a
Web site.

● The discovered patterns are usually represented as collections of pages,


objects, or resources thatare frequently accessed by groups of users with
common interests.

● Data in Web Usage Mining:

a. Web server logs

b. Site contents

c. Data about the visitors, gathered from external channels

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

SPATIAL DATA MINING WITH EXAMPLES

Spatial data mining is societally important having applications in public health, public
safety, climate science, etc. For example, in epidemiology, spatial data mining helps
to find areas with a high concentration of disease incidents to manage disease
outbreaks.

Spatial Data Mining is inexorably linked to developments in Geographical Information


Systems. Such systems store spatially referenced data. They allow the user to extract
information on contiguous regions and investigate spatial patterns.

What are spatial data types?

Spatial data are of two types according to the storing technique, namely, raster data
and vector data.

Spatial data can help us make better predictions about human behaviour and
understand what variables may influence an individual's choices. By performing
spatial analysis on our communities, we can ensure that neighbourhoods are
accessible and usable by everyone.

Geographic Information Systems (GIS) have various industrial applications, and


technological advancements have significantly enhanced GIS data, specifically how it
can be used and what can be achieved as a result.

Clustering is the most widely used technique for the spatial data mining. Index
Terms with respect to, database management system, density based spatial clustering
of applications with noise, varied density based spatial clustering of applications with
noise, partitioning around medoids.

What are the characteristics of spatial data?

Spatial data refers to the shape, size and location of the feature. Non- spatial data refers to
other attributes associated with the feature such as name, length, area, volume, population,
soil type, etc ..

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Why spatial data is important?

Spatial data can help us make better predictions about human behaviour and
understand what variables may influence an individual's choices. By performing spatial
analysis on our communities, we can ensure that neighbourhoods are accessible and usable
by everyone.

Geographic Information Systems are powerful decision-making tools for any business or
industry since it allows the analyzation of environmental, demographic, and topographic
data. Data intelligence compiled from GIS applications help companies and various
industries, and consumers, make informed decisions.

 2.1 1. Mapping
 2.2 2. Telecom and Network Services
 2.3 3. Accident Analysis and Hot Spot Analysis
 2.4 4. Urban planning
 2.5 5. Transportation Planning
 2.6 6. Environmental Impact Analysis
 2.7 7. Agricultural Applications
 2.8 8. Disaster Management and Mitigation
 2.9 9. Navigation
 2.10 10. Flood damage estimation
 2.11 11. Natural Resources Management
 2.12 12. Banking
 2.13 13. Taxation
 2.14 14. Surveying
 2.15 15. Geology
 2.16 16. Assets Management and Maintenance
 2.17 17. Planning and Community Development
 2.18 18. Dairy Industry
 2.19 19. Irrigation Water Management
 2.20 20. Pest Control and Management

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

PROCESS MINING

Process mining is a method similar to data mining, used to analyze and monitor business
processes. The software helps organizations to capture data from enterprise transactions and
provides important insights on how business processes are performing.

Data mining analyzes static information. In other words: data that is available at the
time of analysis. Process mining on the other hand looks at how the data was actually
created. Process mining techniques also allow users to generate processes dynamically based
on the most recent data.

There are three main classes of process mining techniques: process discovery, conformance
checking, and process enhancement.

Process mining enables business leaders to gain a holistic view of their processes, spot
inefficiencies and identify improvement opportunities, including automation.

Some examples of such activities are receiving an order, submitting a piece of


documentation, approving a loan, entering information into a health record, etc. Process
mining software transforms the digital records into event logs

Process mining is a method similar to data mining, used to analyze and monitor business
processes. The software helps organizations to capture data from enterprise transactions and
provides important insights on how business processes are performing.

Comparing data mining and process mining

Data mining and process mining share a number of commonalities, but they are different.

Both data mining and process mining fall under the umbrella of business intelligence. Both
use algorithms to understand big data and may also use machine learning. Both can help
businesses improve performance.

However, the two areas are distinct. Process mining is more concerned with how information
is generated and how that fits into a process as a whole, whereas data mining relies on data
that's available. Data mining is more concerned with the what -- that is, the patterns
themselves -- while process mining seeks to answer the why. As part of that, process mining
is concerned with exceptions and the story those exceptions help to tell about the holistic
answer, while data mining discards exceptions, as outliers can prevent finding the dominant
patterns.
Downloaded by JAYAPRAKASH A ([email protected])
lOMoARcPSD|27298668

DATA WAREHOUSING

Data warehousing is a method of organizing and compiling data into one database,
whereas data mining deals with fetching important data from databases. Data mining
attempts to depict meaningful patterns through a dependency on the data that is compiled in
the data warehouse

What is data warehousing explain?

A data warehouse is a type of data management system that is designed to enable and
support business intelligence (BI) activities, especially analytics. Data warehouses are
solely intended to perform queries and analysis and often contain large amounts of historical
data.

What is data warehouse with example?

Data Warehousing integrates data and information collected from various sources into
one comprehensive database. For example, a data warehouse might combine customer
information from an organization's point-of-sale systems, its mailing lists, website, and
comment cards.

Types of Data Warehouse

Three main types of Data Warehouses (DWH) are:

1. Enterprise Data Warehouse (EDW):

Enterprise Data Warehouse (EDW) is a centralized warehouse. It provides decision support


service across the enterprise. It offers a unified approach for organizing and representing
data. It also provide the ability to classify data according to the subject and give access
according to those divisions.

2. Operational Data Store:

Operational Data Store, which is also called ODS, are nothing but data store required when
neither Data warehouse nor OLTP systems support organizations reporting needs. In ODS,
Data warehouse is refreshed in real time. Hence, it is widely preferred for routine activities
like storing records of the Employees.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

3. Data Mart:

A data mart is a subset of the data warehouse. It specially designed for a particular line of
business, such as sales, finance, sales or finance. In an independent data mart, data can
collect directly from sources.

Who needs Data warehouse?

DWH (Data warehouse) is needed for all types of users like:

 Decision makers who rely on mass amount of data


 Users who use customized, complex processes to obtain information from multiple
data sources.
 It is also used by the people who want simple technology to access the data
 It also essential for those people who want a systematic approach for making
decisions.
 If the user wants fast performance on a huge amount of data which is a necessity for
reports, grids or charts, then Data warehouse proves useful.
 Data warehouse is a first step If you want to discover ‘hidden patterns’ of data-flows
and groupings.

What Is a Data Warehouse Used For?

Here, are most common sectors where Data warehouse is used:

Airline:

In the Airline system, it is used for operation purpose like crew assignment, analyses of route
profitability, frequent flyer program promotions, etc.

Banking:

It is widely used in the banking sector to manage the resources available on desk effectively.
Few banks also used for the market research, performance analysis of the product and
operations.

Healthcare:

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Healthcare sector also used Data warehouse to strategize and predict outcomes, generate
patient’s treatment reports, share data with tie-in insurance companies, medical aid services,
etc.

Public sector:

In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyze tax records, health policy records, for every individual.

Investment and Insurance sector:

In this sector, the warehouses are primarily used to analyze data patterns, customer trends,
and to track market movements.

Retain chain:

In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing
policy.

Telecommunication:

A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.

Hospitality Industry:

This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

DATA MART

A data mart is a simple form of data warehouse focused on a single subject or line of
business. With a data mart, teams can access data and gain insights faster, because they don't
have to spend time searching within a more complex data warehouse or manually
aggregating data from different sources.

Three basic types of data marts are dependent, independent, and hybrid. The
categorization is based primarily on the data source that feeds the data mart. Dependent data
marts draw data from a central data warehouse that has already been created.

What is ETL data mart?

Extract, transform, and load (ETL) is a process for integrating and transferring
information from various data sources into a single physical database. Data marts use
ETL to retrieve information from external sources when it does not come from a data
warehouse

Steps in Implementing a Datamart

Implementing a Data Mart is a rewarding but complex procedure. Here are the detailed steps
to implement a Data Mart:

Designing

Designing is the first phase of Data Mart implementation. It covers all the tasks between
initiating the request for a data mart to gathering information about the requirements. Finally,
we create the logical and physical Data Mart design.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

The design step involves the following tasks:

 Gathering the business & technical requirements and Identifying data sources.
 Selecting the appropriate subset of data.
 Designing the logical and physical structure of the data mart.

Data could be partitioned based on following criteria:

 Date
 Business or Functional Unit
 Geography
 Any combination of above

Data could be partitioned at the application or DBMS level. Though it is recommended to


partition at the Application level as it allows different data models each year with the change
in business environment.

Constructing

This is the second phase of implementation. It involves creating the physical database and the
logical structures.

This step involves the following tasks:

 Implementing the physical database designed in the earlier phase. For instance,
database schema objects like table, indexes, views, etc. are created.

What Products and Technologies Do You Need?

You need a relational database management system to construct a data mart. RDBMS have
several features that are required for the success of a Data Mart.

 Storage management: An RDBMS stores and manages the data to create, add, and
delete data.
 Fast data access: With a SQL query you can easily access data based on certain
conditions/filters.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

 Data protection: The RDBMS system also offers a way to recover from system
failures such as power failures. It also allows restoring data from these backups incase
of the disk fails.
 Multiuser support: The data management system offers concurrent access, the
ability for multiple users to access and modify data without interfering or overwriting
changes made by another user.
 Security: The RDMS system also provides a way to regulate access by users to
objects and certain types of operations.

Populating:

In the third phase, data in populated in the data mart.

The populating step involves the following tasks:

 Source data to target data Mapping


 Extraction of source data
 Cleaning and transformation operations on the data
 Loading data into the data mart
 Creating and storing metadata

What Products and Technologies Do You Need?

You accomplish these population tasks using an ETL (Extract Transform Load) Tool. This
tool allows you to look at the data sources, perform source-to-target mapping, extract the
data, transform, cleanse it, and load it back into the data mart.

In the process, the tool also creates some metadata relating to things like where the data came
from, how recent it is, what type of changes were made to the data, and what level of
summarization was done.

Accessing

Accessing is a fourth step which involves putting the data to use: querying the data, creating
reports, charts, and publishing them. End-user submit queries to the database and display the
results of the queries.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

The accessing step needs to perform the following tasks:

 Set up a meta layer that translates database structures and objects names into business
terms. This helps non-technical users to access the Data mart easily.
 Set up and maintain database structures.
 Set up API and interfaces if required

What Products and Technologies Do You Need?

You can access the data mart using the command line or GUI. GUI is preferred as it can
easily generate graphs and is user-friendly compared to the command line.

Managing

This is the last step of Data Mart Implementation process. This step covers management
tasks such as-

 Ongoing user access management.


 System optimizations and fine-tuning to achieve the enhanced performance.
 Adding and managing fresh data into the data mart.
 Planning recovery scenarios and ensure system availability in the case when the
system fails.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

UNIT – 2

Data Mining refers to extracting or mining knowledge from large amounts of data. The
term is actually a misnomer. Thus, data mining should have been more appropriately
named as knowledge mining which emphasis on mining from large amounts of data. It is
computational process of discovering patterns in large data sets involving methods at
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of data mining process is to extract information from a data set and
transform it into an understandable structure for further use. It is also defined as extraction
of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or
knowledge from a huge amount of data. Data mining is a rapidly growing field that is
concerned with developing techniques to assist managers and decision-makers to make
intelligent use of a huge amount of repositories.

Alternative names for Data Mining :


1. Knowledge discovery (mining) in databases (KDD)
2. Knowledge extraction
3. Data/pattern analysis
4. Data archaeology
5. Data dredging
6. Information harvesting
7. Business intelligence

KDD, SEMMA AND CRISP-DM DESCRIPTION

The term knowledge discovery in databases or KDD, for short, was coined in 1989 to
refer to the broad process of finding knowledge in data, and to emphasize the “high-level”
application of particular data mining methods (Fayyad et al, 1996). Fayyad considers DM
as one of the phases of the KDD process and considers that the data mining phase
concerns, mainly, to the means by which the patterns are extracted and enumerated from
data. In this paper there is a concern with the overall KDD process, which will be
described in section 2.1.
SEMMA was developed by the SAS Institute. CRISP-DM was developed by the means

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

of the efforts of a consortium initially composed with Daimler Chryrler, SPSS and NCR.
They will be described in sections 2.2 and 2.3, respectively. Despite that SEMMA and
CRISP-DM are usually referred as methodologies, in this paper they are referred as
processes, in the sense that they consist of a particular course of action intended to achieve
a result.

The KDD process

The KDD process, as presented in (Fayyad et al, 1996) is the process of using DM
methods to extract what is deemed knowledge according to the specification of measures
and thresholds, using a database along with any required preprocessing, sub sampling, and
transformation of the database. There are considered five stages, presented in figure 1:
1. Selection – This stage consists on creating a target data set, or focusing on a
subset of variables or data samples, on which discovery is to be performed.
2. Pre processing – This stage consists on the target data cleaning and pre
processing in order to obtain consistent data.
3. Transformation – This stage consists on the transformation of the data using
dimensionality reduction or transformation methods.
4. Data Mining – This stage consists on the searching for patterns of interest
in a particular representational form, depending on the data mining objective
(usually, prediction)
5. Interpretation/Evaluation – This stage consists on the interpretation and
evaluation of the mined patterns.

Figure 1. The five stages of KDD

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

The KDD process is interactive and iterative, involving numerous steps with many
decisions being made by the user. (Brachman, Anand, 1996).
Additionally, the KDD process must be preceded by the development of an
understanding of the application domain, the relevant prior knowledge and the goals of the
end-user. It also must be continued by the knowledge consolidation by incorporating this
knowledge into the system (Fayyad et al, 1996).

The SEMMA process

The SEMMA process was developed by the SAS Institute. The acronym SEMMA
stands for Sample, Explore, Modify, Model, Assess, and refers to the process of
conducting a data mining project. The SAS Institute considers a cycle with 5 stages for the
process:
1. Sample – This stage consists on sampling the data by extracting a portion of a
large data set big enough to contain the significant information, yet small
enough to manipulate quickly. This stage is pointed out as being optional.
2. Explore – This stage consists on the exploration of the data by searching for
unanticipated trends and anomalies in order to gain understanding and ideas.
3. Modify – This stage consists on the modification of the data by creating,
selecting, and transforming the variables to focus the model selection process.
4. Model – This stage consists on modeling the data by allowing the software to
search automatically for a combination of data that reliably predicts a desired
outcome.
5. Assess – This stage consists on assessing the data by evaluating the usefulness
and reliability of the findings from the data mining process and estimate how
well it performs.
Although the SEMMA process is independent from de DM chosen tool, it is linked to
the SAS Enterprise Miner software and pretends to guide the user on the implementations
of DM applications.
SEMMA offers an easy to understand process, allowing an organized and adequate
development and maintenance of DM projects.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

The CRISP-DM process

The CRISP-DM process was developed by the means of the effort of a consortium
initially composed with DaimlerChryrler, SPSS and NCR. CRISP-DM stands for CRoss-
Industry Standard Process for Data Mining. It consists on a cycle that comprises six stages
(figure 2):
1. Business understanding – This initial phase focuses on understanding the project
objectives and requirements from a business perspective, then converting this
knowledge into a data mining problem definition and a preliminary plan
designed to achieve the objectives.
2. Data understanding – The data understanding phase starts with an initial data
collection and proceeds with activities in order to get familiar with the data, to
identify data quality problems, to discover first insights into the data or to detect
interesting subsets to form hypotheses for hidden information.
3. Data preparation – The data preparation phase covers all activities to construct
the final dataset from the initial raw data.
4. Modeling – In this phase, various modeling techniques are selected and applied
and their parameters are calibrated to optimal values.
5. Evaluation – At this stage the model (or models) obtained are more thoroughly
evaluated and the steps executed to construct the model are reviewed to be
certain it properly achieves the business objectives.
6. Deployment – Creation of the model is generally not the end of the project. Even
if the purpose of the model is to increase knowledge of the data, the knowledge
gained will need to be organized and presented in a way that the customer can
use it.

The sequence of the six stages is not rigid, as is schematize in figure 2. CRISP-DM is
extremely complete and documented. All his stages are duly organized, structured and
Downloaded by JAYAPRAKASH A ([email protected])
lOMoARcPSD|27298668

defined, allowing that a project could be easily understood or revised (Santos & Azevedo,
2005). Although the CRISP-DM process is independent from de DM chosen tool, it is
linked to the SPSS Clementine software

A COMPARATIVE STUDY

By doing a comparison of the KDD and SEMMA stages we would, on a first approach,
affirm that they are equivalent:
 Sample can be identified with Selection,
 Explore can be identified with Pre processing
 Modify can be identified with Transformation
 Model can be identified with Data Mining
 Assess can be identified with Interpretation/Evaluation.
Examining it thoroughly, we may affirm that the five stages of the SEMMA process can
be seen as a practical implementation of the five stages of the KDD process, since it is
directly linked to the SAS Enterprise Miner software.
Comparing the KDD stages with the CRISP-DM stages is not as straightforward as in
the SEMMA situation. Nevertheless, we can first of all observe that the CRISP-DM
methodology incorporates the steps that, as referred above, must precede and follow the
KDD process that is to say:
 The Business Understanding phase can be identified with the development of

an understanding of the application domain, the relevant prior knowledge and


the goals of the end-user
 The Deployment phase can be indentified with the consolidation by
incorporating this knowledge into the system.
Concerning the remaining stages, we can say that:
 The Data Understanding phase can be identified as the combination of

Selection and Pre processing


 The Data Preparation phase can be identified with Transformation

 The Modeling phase can be identified with Data Mining

 The Evaluation phase can be identified with


Interpretation/Evaluation. In table 1, we present a summary of the
presented correspondence.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

CONCLUSIONS AND FUTURE WORK

Considering the presented analysis we conclude that SEMMA and CRISP-DM can be
viewed as an implementation of the KDD process described by (Fayyad et al, 1996). At
first sight, we can get to the conclusion that CRISP-DM is more complete than SEMMA.
However, analyzing it deeper, we can integrate the development of an understanding of
the application domain, the relevant prior knowledge and the goals of the end-user, on the
Sample stage of SEMMA, because the data can not be sampled unless there exists a truly
understanding of all the presented aspects. With respect to the consolidation by
incorporating this knowledge into the system, we can assume that it is present, because it is
truly the reason for doing it. This leads to the fact that standards have been achieved,
concerning the overall process: SEMMA and CRISP-DM do guide people to know how
DM can be applied in practice in real systems.
In the future we pretend to analyze other aspects related to DM standards, namely SQL-
based languages for DM, as well as XML-based languages for DM. As a complement, we
pretend to investigate the existence of other standards for DM.

DOMAIN SPECIFIC ANALYTICS

Domain analytics refers to the collective set of analytics that are applied across all
industry verticals and business processes. Data and analytics leaders should optimize their
domain-specific capabilities for achieving success in their industry.

Types of Data Mining

 Predictive Data Mining Analysis.


 Descriptive Data Mining Analysis.

domain specific analytics

Domain analytics refers to the collective set of analytics that are applied across all
industry verticals and business processes. Data and analytics leaders should optimize their
domain-specific capabilities for achieving success in their industry.

domain-specific data

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

The domain specific data schemas contain final specializations of entities as shown
highlighted in blue. Entities defined in this layer are self-contained and cannot be referenced
by any other layer. The domain specific layer organizes definitions according to industry
discipline.

PREDICTIVE PERFORMANCE

Predictive Models Performance Evaluation is Important

Choice of metrics influences how the performance of a performance evaluation model is


measured and compared. But metrics can also be deceiving. If we are not using metrics that
correctly measure how accurate the model is predicting our problem, we might be fooled to
think that we built a robust model. Let’s take a look at an example to understand why that
can be a problem and how predictive analytics can cope with it..

Take, for example, prediction of a rare disease that occurs in 1% of the population. If we use
a metric that only tells us how good the model is at making the correct prediction, we might
end up with a 98% or 99% accuracy because the model will be right 99% of the times by
predicting that the person does not have the disease. That is, however, not the point of the
model.

Instead, we might want to use a metric that evaluates only the true positives and the false
negatives, and determines how good the model is at prediction of the case of the disease.

Proper predictive performance models evaluation is also important because we want our
model to have the same predictive evaluation across many different data sets. In other words,
the results need to be comparable, measurable and reproducible, which are important factors
for many industries with heavy regulations, such as insurance and the healthcare sector.

Let’s now dive into prediction performance, the most commonly used metrics, their use
cases, and their limitations.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

How to Evaluate Model Performance and What Metrics to Choose

All problems a performance evaluation model can solve fall into one of two categories: a
classification problem or a regression problem. Depending on what category your business
challenge falls into, you will need to use different metrics to evaluate your model.

That is why it is important to first determine what overall business goal or business problem
needs to be solved. That will be the starting point for your data science team to choose the
metrics, and ultimately determine what a good model is.

Classification Problems
A classification problem is about predicting what category something falls into. An example
of a classification problem is analyzing medical data to determine if a patient is in a high risk
group for a certain disease or not.

Metrics that can be used for evaluation a classification model:

 Percent correction classification (PCC): measures overall accuracy. Every error has the same
weight.
 Confusion matrix: also measures accuracy but distinguished between errors, i.e false
positives, false negatives and correct predictions.

Both of these metrics are good to use when every data entry needs to be scored. For example,
if every customer who visits a website needs to be shown customized content based on their
browsing behavior, every visitor will need to be categorized.

If, however, you only need to act upon results connected to a subset of your data – for
example, if you aim to identify high churn clients to interact with, or, as in the earlier
example, predict a rare disease – you might want to use the following metrics:

 Area Under the ROC Curve (AUC – ROC): is one of the most widely used metrics for
evaluation. Popular because it ranks the positive predictions higher than the negative. Also,
ROC curve is independent of the change in proportion of responders.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

 Lift and Gain charts: both charts measure the effectiveness of a model by calculating the ratio
between the results obtained with and without the performance evaluation model. In other
words, these metrics examine if using predictive models has any positive effects or not.

Regression Problems
A regression problem is about predicting a quantity. A simple example of a regression
problem is prediction of the selling price of a real estate property based on its attributes
(location, square meters available, condition, etc.).

To evaluate how good your regression model is, you can use the following metrics:

 R-squared: indicate how many variables compared to the total variables the model predicted.
R-squared does not take into consideration any biases that might be present in the data.
Therefore, a good model might have a low R-squared value, or a model that does not fit the
data might have a high R-squared value.
 Average error: the numerical difference between the predicted value and the actual value.
 Mean Square Error (MSE): good to use if you have a lot of outliers in the data.
 Median error: the average of all difference between the predicted and the actual values.
 Average absolute error: similar to the average error, only you use the absolute value of the
difference to balance out the outliers in the data.
 Median absolute error: represents the average of the absolute differences between prediction
and actual observation. All individual differences have equal weight, and big outliers can
therefore affect the final evaluation of the model.

MSE and RMSE

Mean Squared Error (MSE) and Root Mean Square Error (RMSE) are error measures based
on the following error (e_i) concept (where x_i represents the i-th actual value of a time series
and m_i is the value that was forecasted, for the same position in the series, by the model):

Since positive and negative errors tend to cancel each other out, we take the squares of these
differences and take the average of all these squares. As a result, we get the Mean Square
Error (calculated on N differences between actual and predicted values):

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

One of the MSE’s limitations is that the unit of measurement for the error is the square of the
unit of measurement for the data (the MSE calculates the error in square meters if the data is
measured in meters). To convert the error unit of measurement to the data unit of
measurement, we take the root of the MSE and then get the Root Mean Square Error:

MAE

MSE and RMSE are constructed in such a way that they give greater weight to large errors
than to small errors (it is because we rise to the power). Giving equal weight to large and
small errors, we can use the average of the absolute values of the errors and get the Mean
Absolute Error:

The MAE does not give larger errors a higher weight, but, when used as a loss function in a
machine learning model, it can cause convergence issues during the learning phase.

Precision, Recall and F1 Score

MAE, MSE and RMSE are widely used measures. However, relying entirely on these
measures may not be appropriate in some cases. For example, if the forecast is made to make
trading decisions, eg. whether or not to buy a stock on a particular day, we are more interested
in predicting well not so much the exact value of the next day, but whether the difference in
Downloaded by JAYAPRAKASH A ([email protected])
lOMoARcPSD|27298668

value of the next day will be positive or negative. That is, we are interested in knowing how to
predict the direction of change. In this case, we can use a confusion matrix.

In a binary confusion matrix we can have the following four cases:

1. True positive (T+): actual positive, predicted positive;

2. True negative (T-): actual negative, predicted negative;

3. False positive (F+): actual negative, predicted positive;

4. False negative (F-): actual positive, predicted negative.

We can count the number of cases belonging to each of the categories and represent them in a
table like the one below.

 Mean absolute percentage error (MAPE)

Expresses accuracy as a percentage of the error. Because this number is a percentage, it can
be easier to understand than the other statistics.

 Mean absolute deviation (MAD)

Expresses accuracy in the same units as the data, which helps conceptualize the amount of
error. Outliers have less of an effect on MAD than on MSD.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

CONFUSION MATRIX

A confusion matrix is a table that is often used to describe the performance of a


classification model (or "classifier") on a set of test data for which the true values are
known. The confusion matrix itself is relatively simple to understand, but the related
terminology can be confusing.

ROC CURVE

An ROC curve (receiver operating characteristic curve) is a graph showing the


performance of a classification model at all classification thresholds. This curve plots two
parameters:

 True Positive Rate

 False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

TPR=TPTP+FN

False Positive Rate (FPR) is defined as follows:

FPR=FPFP+TN

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False Positives
and True Positives. The following figure shows a typical ROC curve.

To compute the points in an ROC curve, we could evaluate a logistic regression model many
times with different classification thresholds, but this would be inefficient. Fortunately,
there's an efficient, sorting-based algorithm that can provide this information for us, called
AUC.

AUC: Area Under the ROC Curve

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-
dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to
(1,1).

AUC provides an aggregate measure of performance across all possible classification


thresholds. One way of interpreting AUC is as the probability that the model ranks a random
positive example more highly than a random negative example. For example, given the
following examples, which are arranged from left to right in ascending order of logistic
regression predictions:

AUC represents the probability that a random positive (green) example is positioned to the
right of a random negative (red) example.

AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC
of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

AUC is desirable for the following two reasons:

 AUC is scale-invariant. It measures how well predictions are ranked, rather than their
absolute values.

 AUC is classification-threshold-invariant. It measures the quality of the model's


predictions irrespective of what classification threshold is chosen.

However, both these reasons come with caveats, which may limit the usefulness of AUC in
certain use cases:

 Scale invariance is not always desirable. For example, sometimes we really do need well
calibrated probability outputs, and AUC won’t tell us about that.
 Classification-threshold invariance is not always desirable. In cases where there are wide
disparities in the cost of false negatives vs. false positives, it may be critical to minimize one
type of classification error. For example, when doing email spam detection, you likely want
to prioritize minimizing false positives (even if that results in a significant increase of false
negatives). AUC isn't a useful metric for this type of optimization.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Why do we perform k-fold cross-validation?

Cross-validation is usually used in machine learning for improving model prediction when
we don't have enough data to apply other more efficient methods like the 3-way split (train,
validation and test) or using a holdout dataset.

Cross validation is a model evaluation method that is better than residuals. The problem with
residual evaluations is that they do not give an indication of how well the learner will do
when it is asked to make new predictions for data it has not already seen. One way to
overcome this problem is to not use the entire data set when training a learner. Some of the
data is removed before training begins. Then when training is done, the data that was
removed can be used to test the performance of the learned model on ``new'' data. This is the
basic idea for a whole class of model evaluation methods called cross validation.

The holdout method is the simplest kind of cross validation. The data set is separated into
two sets, called the training set and the testing set. The function approximator fits a function
using the training set only. Then the function approximator is asked to predict the output
values for the data in the testing set (it has never seen these output values before). The errors
it makes are accumulated as before to give the mean absolute test set error, which is used to
evaluate the model. The advantage of this method is that it is usually preferable to the
residual method and takes no longer to compute. However, its evaluation can have a high
variance. The evaluation may depend heavily on which data points end up in the training set
and which end up in the test set, and thus the evaluation may be significantly different
depending on how the division is made.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

K-fold cross validation is one way to improve over the holdout method. The data set is
divided into k subsets, and the holdout method is repeated k times. Each time, one of
the k subsets is used as the test set and the other k-1 subsets are put together to form a
training set. Then the average error across all k trials is computed. The advantage of this
method is that it matters less how the data gets divided. Every data point gets to be in a test
set exactly once, and gets to be in a training set k-1 times. The variance of the resulting
estimate is reduced as k is increased. The disadvantage of this method is that the training
algorithm has to be rerun from scratch k times, which means it takes k times as much
computation to make an evaluation. A variant of this method is to randomly divide the data
into a test and training set k different times. The advantage of doing this is that you can
independently choose how large each test set is and how many trials you average over.

Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with
K equal to N, the number of data points in the set. That means that N separate times, the
function approximator is trained on all the data except for one point and a prediction is made
for that point. As before the average error is computed and used to evaluate the model. The
evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at first pass
it seems very expensive to compute. Fortunately, locally weighted learners can make LOO
predictions just as easily as they make regular predictions. That means computing the LOO-
XVE takes no more time than computing the residual error and it is a much better way to
evaluate models. We will see shortly that Vizier relies heavily on LOO-XVE to choose its
metacodes.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

LOOCV(Leave One Out Cross-Validation) is a type of cross-validation approach in


which each observation is considered as the validation set and the rest (N-1) observations
are considered as the training set. In LOOCV, fitting of the model is done and predicting
using one observation validation set. Furthermore, repeating this for N times for each
observation as the validation set. Model is fitted and the model is used to predict a value
for observation. This is a special case of K-fold cross-validation in which the number of
folds is the same as the number of observations(K = N). This method helps to
reduce Bias and Randomness. The method aims at reducing the Mean-Squared error rate
and prevent over fitting. It is very much easy to perform LOOCV in R programming.

Mathematical Expression
LOOCV involves one fold per observation i.e each observation by itself plays the role of
the validation set. The (N-1) observations play the role of the training set. With least-
squares linear, a single model performance cost is the same as a single model. In LOOCV,
refitting of the model can be avoided while implementing the LOOCV
method. MSE(Mean squared error) is calculated by fitting on the complete dataset.

Advantages of LOOCV are as follows:


 It has no randomness of using some observations for training vs. validation set. In the
validation-set method, each observation is considered for both training and validation
so it has less variability due to no randomness no matter how many times it runs.
 It has less bias than validation-set method as training-set is of n-1 size. On the entire
data set. As a result, there is a reduced over-estimation of test-error as much
compared to the validation-set method.
Disadvantage of LOOCV is as follows:
 Training the model N times leads to expensive computation time if the dataset is
large.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Random sub sampling, which is also known as Monte Carlo cross validation ,
as multiple holdout or as repeated evaluation set , is based on randomly splitting the data
into subsets, whereby the size of the subsets is defined by the user . The random
partitioning of the data can be repeated arbitrarily often.

• Bootstrapping

Bootstrap is a powerful statistical tool used to quantify the uncertainty of a given model.
However, the real power of bootstrap is that it could get applied to a wide range of models
where the variability is hard to obtain or not output automatically.

• Challenges:
Algorithms in Machine Learning tend to produce unsatisfactory classifiers when
handled with unbalanced datasets.
For example, Movie Review datasets

• Total Observations : 100 Positive Dataset : 90 Negative Dataset : 10 Event rate : 2%

• The main problem here is how to get a balanced dataset.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

UNIT - 3

Data visualization is the graphical representation of information and data in a pictorial or


graphical format(Example: charts, graphs, and maps). Data visualization tools provide an
accessible way to see and understand trends, patterns in data, and outliers. Data
visualization tools and technologies are essential to analyzing massive amounts of
information and making data-driven decisions. The concept of using pictures is to
understand data that has been used for centuries. General types of data visualization are
Charts, Tables, Graphs, Maps, Dashboards.

Categories of Data Visualization

Data visualization is very critical to market research where both numerical and categorical
data can be visualized, which helps in an increase in the impact of insights and also helps
in reducing the risk of analysis paralysis. So, data visualization is categorized into the
following categories:

Advantages of Data Visualization

1. Better Agreement: In business, for numerous periods, it happens that we need to look
at the exhibitions of two components or two situations. A conventional methodology is to
experience the massive information of both the circumstances and afterward examine it.
This will clearly take a great deal of time.
2. A Superior Method: It can tackle the difficulty of placing the information of both
perspectives into the pictorial structure. This will unquestionably give a superior
comprehension of the circumstances. For instance, Google patterns assist us with
understanding information identified with top ventures or inquiries in pictorial or
graphical structures.
3. Simple Sharing of Data: With the representation of the information, organizations
present another arrangement of correspondence. Rather than sharing the cumbersome
information, sharing the visual data will draw in and pass on across the data which is more
absorbable.
4. Deals Investigation: With the assistance of information representation, a salesman can,
without much of a stretch, comprehend the business chart of items. With information

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

perception instruments like warmth maps, he will have the option to comprehend the
causes that are pushing the business numbers up just as the reasons that are debasing the
business numbers. Information representation helps in understanding the patterns and
furthermore, different variables like sorts of clients keen on purchasing, rehash clients, the
impact of topography, and so forth.
5. Discovering Relations Between Occasions: A business is influenced by a lot of
elements. Finding a relationship between these elements or occasions encourages chiefs to
comprehend the issues identified with their business. For instance, the online business
market is anything but another thing today. Each time during certain happy seasons, like
Christmas or Thanksgiving, the diagrams of online organizations go up. Along these lines,
state if an online organization is doing a normal $1 million business in a specific quarter
and the business ascends straightaway, at that point they can rapidly discover the
occasions compared to it.
6. Investigating Openings and Patterns: With the huge loads of information present,
business chiefs can discover the profundity of information in regard to the patterns and
openings around them. Utilizing information representation, the specialists can discover
examples of the conduct of their clients, subsequently preparing for them to investigate
patterns and open doors for business.
Now the most important question arises. Why is Data Visualization So Important?
Why is Data Visualization Important?
Let’s take an example. Suppose you compile a data visualization of the company’s profits
from 2010 to 2020 and create a line chart. It would be very easy to see the line going
constantly up with a drop in just 2018. So you can observe in a second that the company
has had continuous profits in all the years except a loss in 2018. It would not be that easy
to get this information so fast from a data table. This is just one demonstration of the
usefulness of data visualization. Let’s see some more reasons why data visualization is so
important.
1. Data Visualization Discovers the Trends in Data
The most important thing that data visualization does is discover the trends in data. After
all, it is much easier to observe data trends when all the data is laid out in front of you in a
visual form as compared to data in a table. For example, the screenshot below on Tableau
demonstrates the sum of sales made by each customer in descending order. However, the
color red denotes loss while grey denotes profits. So it is very easy to observe from this
visualization that even though some customers may have huge sales, they are still at a
loss. This would be very difficult to observe from a table.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

2. Data Visualization Provides a Perspective on the Data


Data Visualization provides a perspective on data by showing its meaning in the larger
scheme of things. It demonstrates how particular data references stand with respect to the
overall data picture. In the data visualization below, the data between sales and profit
provides a data perspective with respect to these two measures. It also demonstr ates that
there are very few sales above 12K and higher sales do not necessarily mean a higher
profit.

3. Data Visualization Puts the Data into the Correct Context


It is very difficult to understand the context of the data with data visualization. Since
context provides the whole circumstances of the data, it is very difficult to grasp by just
reading numbers in a table. In the below data visualization on Tableau, a TreeMap is used
to demonstrate the number of sales in each region of the United States. It is very easy to
understand from this data visualization that California has the largest number of sales out
of the total number since the rectangle for California is the largest. But this information is
not easy to understand outside of context without data visualization.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

4. Data Visualization Saves Time


It is definitely faster to gather some insights from the data using data visualization rather
than just studying a chart. In the screenshot below on Tableau, it is very easy to identify
the states that have suffered a net loss rather than a profit. This is because all the cells with
a loss are colored red using a heat map, so it is obvious states have suffered a loss.
Compare this to a normal table where you would need to check each cell to see if it has a
negative value to determine a loss. Obviously, data visualization saves a lot of time in this
situation!

5. Data Visualization Tells a Data Story


Data visualization is also a medium to tell a data story to the viewers. The visualization
can be used to present the data facts in an easy-to-understand form while telling a story
and leading the viewers to an inevitable conclusion. This data story, like any other type of
story, should have a good beginning, a basic plot, and an ending that it is leading towards.
For example, if a data analyst has to craft a data visualization for company executives
detailing the profits on various products, then the data story can start with the profits and
losses of various products and move on to recommendations on how to tackle the losses.

Top Data Visualization Tools

The following are the 10 best Data Visualization Tools


1. Tableau
2. Looker
3. Zoho Analytics
4. Sisense
5. IBM Cognos Analytics
6. Qlik Sense
7. Domo
8. Microsoft Power BI
9. Klipfolio
10. SAP Analytics Cloud

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

A time series is a sequence where a metric is recorded over regular time intervals.

Depending on the frequency, a time series can be of yearly (ex: annual budget),
quarterly (ex: expenses), monthly (ex: air traffic), weekly (ex: sales qty), daily (ex:
weather), hourly (ex: stocks pr ice), minutes (ex: inbound calls in a call canter) and
even seconds wise (ex: web traffic).

Not just in manufacturing, the techniques and concepts behind time series forecasting
are applicable in any business.

Now forecasting a time series can be broadly d ivided into two types.

If you use only the previous values of the time series to predict its future values, it
is called Univariate Time Series Forecasting.

And if you use predictors other than the series to forecast it is called Multi Variate
Time Series Forecasting.

Uses of Time Series

 The most important use of studying time series is that it helps us to predict the future
behaviour of the variable based on past experience

 It is helpful for business planning as it helps in comparing the actual current performance
with the expected one

 From time series, we get to study the past behaviour of the phenomenon or the variable
under consideration

 We can compare the changes in the values of different variables at different times or
places, etc.

Components for Time Series Analysis

The various reasons or the forces which affect the values of an observation in a time series are
the components of a time series. The four categories of the components of time series are

 Trend

 Seasonal Variations

 Cyclic Variations

 Random or Irregular movements

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Trend

The trend shows the general tendency of the data to increase or decrease during a long period of
time. A trend is a smooth, general, long-term, average tendency. It is not always necessary that
the increase or decrease is in the same direction throughout the given period of time.

It is observable that the tendencies may increase, decrease or are stable in different sections of
time. But the overall trend must be upward, downward or stable. The population, agricultural
production, items manufactured, number of births and deaths, number of industry or any factory,
number of schools or colleges are some of its example showing some kind of tendencies of
movement.

Linear and Non-Linear Trend

If we plot the time series values on a graph in accordance with time t. The pattern of the data
clustering shows the type of trend. If the set of data cluster more or less round a straight line,
then the trend is linear otherwise it is non-linear (Curvilinear).

Periodic Fluctuations

There are some components in a time series which tend to repeat themselves over a certain
period of time. They act in a regular spasmodic manner.

Seasonal Variations

These are the rhythmic forces which operate in a regular and periodic manner over a span of less
than a year. They have the same or almost the same pattern during a period of 12 months. This
variation will be present in a time series if the data are recorded hourly, daily, weekly, quarterly,
or monthly.

These variations come into play either because of the natural forces or man-made conventions.
The various seasons or climatic conditions play an important role in seasonal variations. Such as
production of crops depends on seasons, the sale of umbrella and raincoats in the rainy season,
and the sale of electric fans and A.C. shoots up in summer seasons.

The effect of man-made conventions such as some festivals, customs, habits, fashions, and some
occasions like marriage is easily noticeable. They recur themselves year after year. An upswing
in a season should not be taken as an indicator of better business conditions.

Cyclic Variations

The variations in a time series which operate themselves over a span of more than one year are
the cyclic variations. This oscillatory movement has a period of oscillation of more than a year.
One complete period is a cycle. This cyclic movement is sometimes called the ‘Business Cycle’.

It is a four-phase cycle comprising of the phases of prosperity, recession, depression, and


recovery. The cyclic variation may be regular are not periodic. The upswings and the

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

downswings in business depend upon the joint nature of the economic forces and the interaction
between them.

Random or Irregular Movements

There is another factor which causes the variation in the variable under study. They are not
regular variations and are purely random or irregular. These fluctuations are unforeseen,
uncontrollable, unpredictable, and are erratic. These forces are earthquakes, wars, flood,
famines, and any other disasters.

ARIMA, short for ‘Auto Regressive Integrated Moving Average’, is a forecasting


algorithm based on the idea that the information in the past values of the time
series can alone be used to predict the future values.

ARIMA, short for ‘Auto Regressive Integrated Moving Average’ is actually a class of
models that ‘explains’ a given time series based on its own past values, that is, its own
lags and the lagged forecast errors, so that equation can be used to forecast future
values.

Any ‘non-seasonal’ time series that exhibits patterns and is not a random white noise
can be modeled with ARIMA models.

An ARIMA model is characterized by 3 terms: p, d, q where,

p is the order of the AR term

q is the order of the MA term

d is the number of differencing required to make the time series stationary

If a time series, has seasonal patterns, then you need to add seasonal terms and it
becomes SARIMA, short for ‘Seasonal ARIMA’. More on that once we finish ARIMA.

Holt's two-parameter model, also known as linear exponential smoothing, is a popular


smoothing model for forecasting data with trend. Holt's model has three separate
equations that work together to generate a final forecast.

As far as I understand, Holt-Winters is a special case of ARIMA.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Holt-Winters is a model of time series behavior. Forecasting always requires a model, and
Holt-Winters is a way to model three aspects of the time series:

a typical value (average),

a slope (trend) over time,

and a cyclical repeating pattern (seasonality).

Time series anomaly detection is a complicated problem with plenty of practical methods.
It’s easy to get lost in all of the topics it encompasses. Learning them is certainly an issue,
but implementing them is often more complicated. A key element of anomaly detection is
forecasting—taking what you know about a time series, either based on a model or its
history, and making decisions about values that arrive later.

Multivariate autoregressive models. Given a univariate time series, its consecutive


measurements contain information about the process that generated it. An attempt at
describing this underlying order can be achieved by modeling the current value of the
variable as a weighted linear sum of its previous values.

Multivariate Time Series (MTS)

A Multivariate time series has more than one time-dependent variable. Each variable depends
not only on its past values but also has some dependency on other variables. This dependency
is used for forecasting future values. Sounds complicated? Let me explain.

Consider the above example. Now suppose our dataset includes perspiration percent, dew
point, wind speed, cloud cover percentage, etc. along with the temperature value for the past
two years. In this case, there are multiple variables to be considered to optimally predict
temperature. A series like this would fall under the category of multivariate time series.
Below is an illustration of this:

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Now that we understand what a multivariate time series looks like, let us understand how can
we use it to build a forecast.

2. Dealing with a Multivariate Time Series – VAR

In this section, I will introduce you to one of the most commonly used methods for
multivariate time series forecasting – Vector Auto Regression (VAR).

In a VAR model, each variable is a linear function of the past values of itself and the past
values of all the other variables. To explain this in a better manner, I’m going to use a simple
visual example:

We have two variables, y1 and y2. We need to forecast the value of these two variables at
time t, from the given data for past n values. For simplicity, I have considered the lag value
to be 1.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Why Do We Need VAR?

Recall the temperate forecasting example we saw earlier. An argument can be made for it to
be treated as a multiple univariate series. We can solve it using simple univariate forecasting
methods like AR. Since the aim is to predict the temperature, we can simply remove the other
variables (except temperature) and fit a model on the remaining univariate series.

Another simple idea is to forecast values for each series individually using the techniques we
already know. This would make the work extremely straightforward! Then why should you
learn another forecasting technique? Isn’t this topic complicated enough already?

From the above equations (1) and (2), it is clear that each variable is using the past values of
every variable to make the predictions. Unlike AR, VAR is able to understand and use the
relationship between several variables. This is useful for describing the dynamic behavior
of the data and also provides better forecasting results. Additionally, implementing VAR is
as simple as using any other univariate technique (which you will see in the last section).

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

UNIT – 4

Decision Tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (terminal node) holds a class label.

A decision tree for the concept PlayTennis.

Construction of Decision Tree: A tree can be “learned” by splitting the source


set into subsets based on an attribute value test. This process is repeated on
each derived subset in a recursive manner called recursive partitioning. The
recursion is completed when the subset at a node all has the same value of the
target variable, or when splitting no longer adds value to the predictions. The
construction of a decision tree classifier does not require any domain knowledge
or parameter setting, and therefore is appropriate for exploratory knowledge
discovery. Decision trees can handle high-dimensional data. In general decision
tree classifier has good accuracy. Decision tree induction is a typical inductive
approach to learn knowledge on classification.

Decision Tree Representation: Decision trees classify instances by sorting them


down the tree from the root to some leaf node, which provides the classification of
the instance. An instance is classified by starting at the root node of the tree,
testing the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute as shown in the above figure. This
process is then repeated for the subtree rooted at the new node.
The decision tree in above figure classifies a particular morning according to

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

whether it is suitable for playing tennis and returns the classification associated
with the particular leaf.(in this case Yes or No).

For example, the instance


(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong )

would be sorted down the leftmost branch of this decision tree and would
therefore be classified as a negative instance.
In other words, we can say that the decision tree represents a disjunction of
conjunctions of constraints on the attribute values of instances.
(Outlook = Sunny ^ Humidity = Normal) v (Outlook = Overcast) v (Outlook = Rain ^
Wind = Weak)

Gini Index:

Gini Index is a score that evaluates how accurate a split is among the classified
groups. Gini index evaluates a score in the range between 0 and 1, where 0 is
when all observations belong to one class, and 1 is a random distribution of the
elements within classes. In this case, we want to have a Gini index score as low
as possible. Gini Index is the evaluation metrics we shall use to evaluate our
Decision Tree Model.

Strengths and Weaknesses of the Decision Tree approach

The strengths of decision tree methods are:


 Decision trees are able to generate understandable rules.
 Decision trees perform classification without requiring much computation.
 Decision trees are able to handle both continuous and categorical variables.
 Decision trees provide a clear indication of which fields are most important for
prediction or classification.

The weaknesses of decision tree methods:


 Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
 Decision trees are prone to errors in classification problems with many classes and a
relatively small number of training examples.
 Decision tree can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting
field must be sorted before its best split can be found. In some algorithms,
combinations of fields are used and a search must be made for optimal combining
weights. Pruning algorithms can also be expensive since many candidate sub-trees
must be formed and compared.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does
not make any underlying assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into
groups identified by an attribute.

As an example, consider the following table of data points containing two features:

Now, given another set of data points (also called testing data), allocate these points a
group by analyzing the training set. Note that the unclassified points are marked as
‘White’.

Intuition
If we plot these points on a graph, we may be able to locate some clusters or groups. Now,
given an unclassified point, we can assign it to a group by observing what group its
nearest neighbours belong to. This means a point close to a cluster of points classified as
‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the
second point (5.5, 4.5) should be classified as ‘Red’.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Algorithm
Let m be the number of training data samples. Let p be an unknown point.

1. Store the training samples in an array of data points arr[]. This means each element of
this array represents a tuple (x, y).
for i=0 to m:
Calculate Euclidean distance d(arr[i], p).
1. Make set S of K smallest distances obtained. Each of these distances corresponds to
an already classified data point.
2. Return the majority label among S.

K can be kept as an odd number so that we can calculate a clear majority in the case where
only two groups are possible (e.g. Red/Blue). With increasing K, we get smoother, more
defined boundaries across different classifications. Also, the accuracy of the above
classifier increases as we increase the number of data points in the training set.
Example Program
Assume 0 and 1 as the two classifiers (groups)

What is logistic regression?


In this logistic regression equation, logit(pi) is the dependent or response variable and x is the
independent variable. The beta parameter, or coefficient, in this model is commonly
estimated via maximum likelihood estimation (MLE). This method tests different values of
beta through multiple iterations to optimize for the best fit of log odds. All of these iterations
produce the log likelihood function, and logistic regression seeks to maximize this function
to find the best parameter estimate. Once the optimal coefficient (or coefficients if there is
more than one independent variable) is found, the conditional probabilities for each
observation can be calculated, logged, and summed together to yield a predicted probability.
For binary classification, a probability less than .5 will predict 0 while a probability greater
than 0 will predict 1. After the model has been computed, it’s best practice to evaluate the
how well the model predicts the dependent variable, which is called goodness of fit. The
Hosmer–Lemeshow test is a popular method to assess model fit.

Interpreting logistic regression

Log odds can be difficult to make sense of within a logistic regression data analysis. As a
result, exponentiating the beta estimates is common to transform the results into an odds ratio
(OR), easing the interpretation of results. The OR represents the odds that an outcome will
occur given a particular event, compared to the odds of the outcome occurring in the absence
of that event. If the OR is greater than 1, then the event is associated with a higher odds of
generating a specific outcome. Conversely, if the OR is less than 1, then the event is
associated with a lower odds of that outcome occurring. Based on the equation from above,

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

the interpretation of an odds ratio can be denoted as the following: the odds of a success
changes by exp(cB_1) times for every c-unit increase in x. To use an example, let’s say that
we were to estimate the odds of survival on the Titanic given that the person was male, and
the odds ratio for males was .0810. We’d interpret the odds ratio as the odds of survival of
males decreased by a factor of .0810 when compared to females, holding all other variables
constant.

Linear regression vs logistic regression

Both linear and logistic regression are among the most popular models within data science,
and open-source tools, like Python and R, make the computation for them quick and easy.

Linear regression models are used to identify the relationship between a continuous
dependent variable and one or more independent variables. When there is only one
independent variable and one dependent variable, it is known as simple linear regression, but
as the number of independent variables increases, it is referred to as multiple linear
regression. For each type of linear regression, it seeks to plot a line of best fit through a set of
data points, which is typically calculated using the least squares method.

Similar to linear regression, logistic regression is also used to estimate the relationship
between a dependent variable and one or more independent variables, but it is used to make a
prediction about a categorical variable versus a continuous one. A categorical variable can be
true or false, yes or no, 1 or 0, et cetera. The unit of measure also differs from linear
regression as it produces a probability, but the logit function transforms the S-curve into
straight line.

While both models are used in regression analysis to make predictions about future
outcomes, linear regression is typically easier to understand. Linear regression also does not
require as large of a sample size as logistic regression needs an adequate sample to represent
values across all the response categories. Without a larger, representative sample, the model
may not have sufficient statistical power to detect a significant effect.

Types of logistic regression

There are three types of logistic regression models, which are defined based on categorical
response.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

 Binary logistic regression: In this approach, the response or dependent variable is


dichotomous in nature—i.e. it has only two possible outcomes (e.g. 0 or 1). Some
popular examples of its use include predicting if an e-mail is spam or not spam or if a
tumor is malignant or not malignant. Within logistic regression, this is the most
commonly used approach, and more generally, it is one of the most common
classifiers for binary classification.
 Multinomial logistic regression: In this type of logistic regression model, the
dependent variable has three or more possible outcomes; however, these values have
no specified order. For example, movie studios want to predict what genre of film a
moviegoer is likely to see to market films more effectively. A multinomial logistic
regression model can help the studio to determine the strength of influence a person's
age, gender, and dating status may have on the type of film that they prefer. The
studio can then orient an advertising campaign of a specific movie toward a group of
people likely to go see it.
 Ordinal logistic regression: This type of logistic regression model is leveraged when
the response variable has three or more possible outcome, but in this case, these
values do have a defined order. Examples of ordinal responses include grading scales
from A to F or rating scales from 1 to 5.

Logistic regression and machine learning

Within machine learning, logistic regression belongs to the family of supervised machine
learning models. It is also considered a discriminative model, which means that it attempts to
distinguish between classes (or categories). Unlike a generative algorithm, such as naïve
bayes, it cannot, as the name implies, generate information, such as an image, of the class
that it is trying to predict (e.g. a picture of a cat).

Previously, we mentioned how logistic regression maximizes the log likelihood function to
determine the beta coefficients of the model. This changes slightly under the context of
machine learning. Within machine learning, the negative log likelihood used as the loss
function, using the process of gradient descent to find the global maximum. This is just
another way to arrive at the same estimations discussed above.

Logistic regression can also be prone to overfitting, particularly when there is a high number
of predictor variables within the model. Regularization is typically used to penalize
parameters large coefficients when the model suffers from high dimensionality.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Scikit-learn (link resides outside IBM) provides valuable documentation to learn more about
the logistic regression machine learning model.

Use cases of logistic regression

Logistic regression is commonly used for prediction and classification problems. Some of
these use cases include:

 Fraud detection: Logistic regression models can help teams identify data anomalies,
which are predictive of fraud. Certain behaviors or characteristics may have a higher
association with fraudulent activities, which is particularly helpful to banking and
other financial institutions in protecting their clients. SaaS-based companies have also
started to adopt these practices to eliminate fake user accounts from their datasets
when conducting data analysis around business performance.
 Disease prediction: In medicine, this analytics approach can be used to predict the
likelihood of disease or illness for a given population. Healthcare organizations can
set up preventative care for individuals that show higher propensity for specific
illnesses.
 Churn prediction: Specific behaviors may be indicative of churn in different
functions of an organization. For example, human resources and management teams
may want to know if there are high performers within the company who are at risk of
leaving the organization; this type of insight can prompt conversations to understand
problem areas within the company, such as culture or compensation. Alternatively,
the sales organization may want to learn which of their clients are at risk of taking
their business elsewhere. This can prompt teams to set up a retention strategy to avoid
lost revenue.

Advantages and disadvantages of logistic regression

The main advantage of logistic regression is that it is much easier to set up and train than
other machine learning and AI applications.

Another advantage is that it is one of the most efficient algorithms when the different
outcomes or distinctions represented by the data are linearly separable. This means that you
can draw a straight line separating the results of a logistic regression calculation.

One of the biggest attractions of logistic regression for statisticians is that it can help reveal
the interrelationships between different variables and their impact on outcomes. This could

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

quickly determine when two variables are positively or negatively correlated, such as the
finding cited above that more studying tends to be correlated with higher test outcomes. But
it is important to note that other techniques like causal AI are required to make the leap from
correlation to causation.

Discriminant function analysis (DFA) is a statistical procedure that classifies unknown


individuals and the probability of their classification into a certain group.

Why do we do discriminant analysis?


It enables the researcher to examine whether significant differences exist among the groups,
in terms of the predictor variables. It also evaluates the accuracy of the classification.
Discriminant analysis is described by the number of categories that is possessed by the
dependent variable.

The two types of Discriminant Analysis:

Linear Discriminant Analysis and

Quadratic Discriminant Analysis.

Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant


Function Analysis is a dimensionality reduction technique that is commonly used for
supervised classification problems. It is used for modelling differences in groups i.e.
separating two or more classes. It is used to project the features in higher dimension space
into a lower dimension space.

Two criteria are used by LDA to create a new axis:

1. Maximize the distance between means of the two classes.


2. Minimize the variation within each class.

Quadratic discriminant analysis is quite similar to Linear discriminant analysis except we


relaxed the assumption that the mean and covariance of all the classes were equal.
Therefore, we required to calculate it separately.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Introduction to Clustering

It is basically a type of unsupervised learning method. An unsupervised learning method


is a method in which we draw references from datasets consisting of input data without
labeled responses. Generally, it is used as a process to find meaningful structure,
explanatory underlying processes, generative features, and groupings inherent in a set of
examples.
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the same
group and dissimilar to the data points in other groups. It is basically a collectio n of
objects on the basis of similarity and dissimilarity between them.
For ex– The data points in the graph below clustered together can be classified into one
single group. We can distinguish the clusters, and we can identify that there are 3 clusters
in the below picture.

It is not necessary for clusters to be spherical. Such as :

DBSCAN: Density-based Spatial Clustering of Applications with Noise


These data points are clustered by using the basic concept that the data point lies within
the given constraint from the cluster center. Various distance methods and techniques are
used for the calculation of the outliers.

Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the
unlabelled data present. There are no criteria for good clustering. It depends on the user,
what is the criteria they may use which satisfy their need. For instance, we could be
interested in finding representatives for homogeneous groups (data reduction), in finding
“natural clusters” and describe their unknown properties (“natural” data types), in finding
useful and suitable groupings (“useful” data classes) or in finding unusual data objects

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

(outlier detection). This algorithm must make some assumptions that constitute the
similarity of points and each assumption make different and equally valid clusters.

Clustering Methods :

 Density-Based Methods: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space.
These methods have good accuracy and the ability to merge two clusters.
Example DBSCAN (Density-Based Spatial Clustering of Applications with
Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc.
 Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. New clusters are formed using the previously
formed one. It is divided into two category
 Agglomerative (bottom-up approach)
 Divisive (top-down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative
Reducing Clustering and using Hierarchies), etc.

 Partitioning Methods: These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion
similarity function such as when the distance is a major parameter example K-means,
CLARANS (Clustering Large Applications based upon Randomized Search), etc.
 Grid-based Methods: In this method, the data space is formulated into a finite
number of cells that form a grid-like structure. All the clustering operations done on
these grids are fast and independent of the number of data objects example STING
(Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.

Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that
solves clustering problem.K-means algorithm partitions n observations into k clusters
where each observation belongs to the cluster with the nearest mean serving as a prototype
of the cluster.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Applications of Clustering in different fields


 Marketing: It can be used to characterize & discover customer segments for
marketing purposes.
 Biology: It can be used for classification among different species of plants and
animals.
 Libraries: It is used in clustering different books on the basis of topics and
information.
 Insurance: It is used to acknowledge the customers, their policies and identifying the
frauds.
City Planning: It is used to make groups of houses and to study their values based on
their geographical locations and other factors present.

There are two types of clustering hard clustering and soft clustering. We
can define the type of cluster based on the data belonging to the dataset
can used to identify whether the data is belonging to distinct clusters with
the help of likelihood of the data points or the probability of data points
belonging to the nearest cluster.

Hard clustering
In hard clustering each data point either belongs to a cluster completely or
does not belong to the cluster at all.

Soft clustering
In soft clustering instead of putting each data point into a separate cluster a
probability or likelihood of the data point is to be considered to find
whether the data point is belonging to the specific cluster. An observation
can belong to more than one cluster to a certain degree that is likelihood of
belonging to that cluster can be more.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Market basket analysis


Market basket analysis is a data mining technique used by retailers to increase sales by
better understanding customer purchasing patterns. It involves analyzing large data sets,
such as purchase history, to reveal product groupings, as well as products that are likely to be
purchased together.

Market basket analysis is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns. It involves analyzing large data sets, such as
purchase history, to reveal product groupings and products that are likely to be purchased
together.

The adoption of market basket analysis was aided by the advent of electronic point-of-sale (POS)
systems. Compared to handwritten records kept by store owners, the digital records generated by
POS systems made it easier for applications to process and analyze large volumes of purchase
data.

Implementation of market basket analysis requires a background in statistics and data science
and some algorithmic computer programming skills. For those without the needed technical
skills, commercial, off-the-shelf tools exist.

One example is the Shopping Basket Analysis tool in Microsoft Excel, which analyzes
transaction data contained in a spreadsheet and performs market basket analysis. A transaction
ID must relate to the items to be analyzed. The Shopping Basket Analysis tool then creates two
worksheets:

o The Shopping Basket Item Groups worksheet, which lists items that are frequently
purchased together,

o And the Shopping Basket Rules worksheet shows how items are related (For example,
purchasers of Product A are likely to buy Product B).

How does Market Basket Analysis Work?

Market Basket Analysis is modelled on Association rule mining, i.e., the IF {}, THEN {}
construct. For example, IF a customer buys bread, THEN he is likely to buy butter as well.

Association rules are usually represented as: {Bread} -> {Butter}

Some terminologies to familiarize yourself with Market Basket Analysis are:

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

o Antecedent: Items or 'item sets' found within the data are antecedents. In simpler words,
it's the IF component, written on the left-hand side. In the above example, bread is the
antecedent.

o Consequent: A consequent is an item or set of items found in combination with the


antecedent. It's the THEN component, written on the right-hand side. In the above
example, butter is the consequent.

Types of Market Basket Analysis

Market Basket Analysis techniques can be categorized based on how the available data is
utilized. Here are the following types of market basket analysis in data mining, such as:

1. Descriptive market basket analysis: This type only derives insights from past data and
is the most frequently used approach. The analysis here does not make any predictions
but rates the association between products using statistical techniques. For those familiar
with the basics of Data Analysis, this type of modelling is known as unsupervised
learning.

2. Predictive market basket analysis: This type uses supervised learning models like
classification and regression. It essentially aims to mimic the market to analyze what
causes what to happen. Essentially, it considers items purchased in a sequence to
determine cross-selling. For example, buying an extended warranty is more likely to

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

follow the purchase of an iPhone. While it isn't as widely used as a descriptive MBA, it is
still a very valuable tool for marketers.

3. Differential market basket analysis: This type of analysis is beneficial for competitor
analysis. It compares purchase history between stores, between seasons, between two
time periods, between different days of the week, etc., to find interesting patterns in
consumer behaviour. For example, it can help determine why some users prefer to
purchase the same product at the same price on Amazon vs Flipkart. The answer can be
that the Amazon reseller has more warehouses and can deliver faster, or maybe
something more profound like user experience.

Algorithms associated with Market Basket Analysis

In market basket analysis, association rules are used to predict the likelihood of products being
purchased together. Association rules count the frequency of items that occur together, seeking
to find associations that occur far more often than expected.

Algorithms that use association rules include AIS, SETM and Apriori. The Apriori algorithm is
commonly cited by data scientists in research articles about market basket analysis. It identifies
frequent items in the database and then evaluates their frequency as the datasets are expanded to
larger sizes.

R's rules package is an open-source toolkit for association mining using the R programming
language. This package supports the Apriori algorithm and other mining algorithms, including
arulesNBMiner, opusminer, RKEEL and RSarules.

With the help of the Apriori Algorithm, we can further classify and simplify the item sets that the
consumer frequently buys. There are three components in APRIORI ALGORITHM:

o SUPPORT

o CONFIDENCE

o LIFT

For example, suppose 5000 transactions have been made through a popular e-Commerce
website. Now they want to calculate the support, confidence, and lift for the two products. For
example, let's say pen and notebook, out of 5000 transactions, 500 transactions for pen, 700
transactions for notebook, and 1000 transactions for both.

Examples of Market Basket Analysis

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

Here are the following examples that explore Market Basket Analysis by market segment, such
as:

o Retail: The most well-known MBA case study is Amazon.com. Whenever you view a
product on Amazon, the product page automatically recommends, "Items bought together
frequently." It is perhaps the simplest and most clean example of an MBA's cross-selling
techniques.
Apart from e-commerce formats, BA is also widely applicable to the in-store retail
segment. Grocery stores pay meticulous attention to product placement based and
shelving optimization. For example, you are almost always likely to find shampoo and
conditioner placed very close to each other at the grocery store. Walmart's infamous beer
and diapers association anecdote is also an example of Market Basket Analysis.

o Telecom: With the ever-increasing competition in the telecom sector, companies are
paying close attention to customers' services. For example, Telecom has now started to
bundle TV and Internet packages apart from other discounted online services to reduce
churn.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

o IBFS: Tracing credit card history is a hugely advantageous MBA opportunity for IBFS
organizations. For example, Citibank frequently employs sales personnel at large malls to
lure potential customers with attractive discounts on the go. They also associate with apps
like Swiggy and Zomato to show customers many offers they can avail of via purchasing
through credit cards. IBFS organizations also use basket analysis to determine fraudulent
claims.

o Medicine: Basket analysis is used to determine comorbid conditions and symptom


analysis in the medical field. It can also help identify which genes or traits are hereditary
and which are associated with local environmental effects.

Benefits of Market Basket Analysis

The market basket analysis data mining technique has the following benefits, such as:

o Increasing market share: Once a company hits peak growth, it becomes challenging to
determine new ways of increasing market share. Market Basket Analysis can be used to
put together demographic and gentrification data to determine the location of new stores
or geo-targeted ads.

Downloaded by JAYAPRAKASH A ([email protected])


lOMoARcPSD|27298668

o Behaviour analysis: Understanding customer behaviour patterns is a primal stone in the


foundations of marketing. MBA can be used anywhere from a simple catalogue design to
UI/UX.

o Optimization of in-store operations: MBA is not only helpful in determining what goes
on the shelves but also behind the store. Geographical patterns play a key role in
determining the popularity or strength of certain products, and therefore, MBA has been
increasingly used to optimize inventory for each store or warehouse.

o Campaigns and promotions: Not only is MBA used to determine which products go
together but also about which products form keystones in their product line.

o Recommendations: OTT platforms like Netflix and Amazon Prime benefit from MBA
by understanding what kind of movies people tend to watch frequently.

Downloaded by JAYAPRAKASH A ([email protected])

You might also like