Data Mining in Business Intelligence
Data Mining in Business Intelligence
UNIT -1
Business intelligence
DATA MINING
Data mining can be seen as the precursor to business intelligence. Upon collection,
data is often raw and unstructured, making it challenging to draw conclusions. Data
mining decodes these complex datasets, and delivers a cleaner version for the
business intelligence team to derive insights.
STATISTICA Data Miner divides the modeling screen into four general phases of
data mining: (1) data acquisition; (2) data cleaning, preparation, and
transformation; (3) data analysis, modeling, classification, and forecasting; and
(4) reports.
There are numerous crucial data mining techniques to consider when entering the data
field, but some of the most prevalent methods include clustering, data cleaning,
association, data warehousing, machine learning, data visualization,
classification, neural networks, and prediction.
Correlation analysis.
Classification.
Outlier detection.
Clustering.
Sequential patterning.
Data visualization.
Neural networking.
Computational advertising
TEXT MINING
Text mining, also known as text data mining, is the process of transforming
unstructured text into a structured format to identify meaningful patterns and
new insights.
Equipped with Natural Language Processing (NLP), text mining tools are used to analyze all
types of text, from survey responses and emails to tweets and product reviews, helping
businesses gain insights and make data-based decisions.
Data scientists analyze text using advanced data science techniques. The data from the text
reveals customer sentiments toward subjects or unearths other insights. There are two ways
to use text analytics (also called text mining) or natural language processing (NLP)
technology.
We have already defined what text mining is. For academic purpose, let’s try again. It is a
multi-disciplinary field based on information retrieval, data mining, machine learning,
statistics, and computational linguistics. Unlike data stored in databases, the text is
unstructured, ambiguous, and challenging to process. Text mining applies several text mining
techniques like summarization, classification, and clustering to extract knowledge
from natural language text, which is stored in a semi-structured and unstructured format.
Text mining techniques are continuously used in areas like search engines, customer
relationship management systems, filter emails, product suggestion analysis, fraud detection,
and social media analytics for opinion mining, feature extraction, sentiment, predictive, and
trend analysis.
1. Term-Based Method
It is a method when a document is analyzed based on a term that it contains. The term may
have some value or meaning in a context. Each term is associated with a value, known as
weight. This method, however, has two problems: 1. Polysemy (a term having many possible
meanings), and 2. Synonymy (multiple words having the same meanings.)
2. Phrase-Based Method
As the name indicates, this method analyses a document based on phrases which carry more
information than a single term, because they are a collection of semantic terms. They are
more descriptive and less ambiguous than a term. But this method isn’t devoid of any
problems. The performance of this method could vary due to three reasons:
3. Concept-Based Method
In the concept-based method, the terms are predicted or guessed at a sentence or a document
level. Rather than a single term analysis, this model tries to analyses a term on a document or
sentence level by finding a significant matching term aptly. This model contains three
components:
3. Extracting top concepts based on the first two components to build feature vectors
using the standard vector space model.
In the pattern-based model, a document is analyzed based on a pattern i.e., a relation between
terms to form taxonomy, which is a tree-like structure. The pattern-based approach can
improve the accuracy of the system for evaluating term weights because discovered patterns
are more specific than whole documents.
Patterns can be discovered by using data mining techniques like closed pattern mining,
sequential pattern mining, frequent itemset mining, and association rule mining. The pattern-
based technique uses two processes pattern deploying (PDM) and pattern evolving. This
technique refines the discovered patterns in text documents.
Collecting information: The textual data from various sources that are in a semi-
Conversion into structured data: Pre-processing involves cleaning the data that is
collected.
Pattern identification: Various techniques used in text mining, which are discussed
Pattern Analysis: The data obtained is analyzed to extract knowledge and meaning
out of it.
Advanced analysis: Finally, the required knowledge is obtained and can then be used
Clustering
Factor analysis
Text classification
Text purification
Text summarization
The extracted information is well-organized (structured) and stored in a database for further
use. IE extracts specific attributes and entities from the document and establishes their
relationship. The process used to check and evaluate the relevance of results is called
‘Precision and Recall.’
Information retrieval (IR) refers to finding and collecting relevant information from a variety
of resources, usually documented in an unstructured format. It is a set of methods or
approaches for methodically developing information needs of the users in the form of queries
that are used to fetch a document from a collection of databases. IR helps to extract relevant
and associated patterns according to a given set of words or phrases.
3. Text Categorization
At the same time, patient reports in healthcare organizations are often indexed from multiple
aspects, using taxonomies of disease categories, types of surgical procedures, insurance
reimbursement codes, and so on. Another widespread application of text categorization is
spam filtering, where email messages are classified into the two categories of spam and non-
spam, respectively.
4. Document Clustering
This technique is used to find groups of documents with similar content. It makes use of
descriptors and descriptor extraction that are essentially sets of words that describe the
contents within the cluster. It is an unsupervised process responsible for classifying objects
into groups called clusters, which consist of several documents. Dividing similar text into the
same cluster forms the basis of this method.
Any labels associated with objects are obtained solely from the data. The advantage of this
technique is that it ensures that no document is missed from search results since documents
can emerge in numerous subtopics. For example, if clustering is performed on a collection of
news articles, it can make sure that similar documents are kept closer to each other or lie in
the same cluster.
5. Text Visualization
Text Visualization is a technique that represents large textual information into a visual map
layout, which provides enhanced browsing capabilities along with simple searching. In text
mining, visualization methods can improve and simplify the discovery of relevant
information.
Text flags are used to show the document category to represent individual documents or
groups of documents, and colors are used to show density. Visual text mining puts large
textual sources in an appropriate visual hierarchy, which helps the user to interact with the
document by scaling and zooming.
NLP and text mining differ in the goal for which they are used. NLP is used to
understand human language by analyzing text, speech, or grammatical syntax. Text mining is
used to extract information from unstructured and structured content. It focuses on structure
rather than the meaning of content.
WEB MINING
Web mining is the use of data mining techniques to extract knowledge from web data.
○ web documents
● The WWW is huge, widely distributed, global information service centre and, therefore,
constitutes a rich source for data mining.
● Data Mining : It is a concept of identifying a significant pattern from the data that gives a
better outcome.
● Web Mining : It is the process of performing data mining in the web. Extracting the web
documents and discovering the patterns from it.
● Web content mining is related but different from data mining and text mining.
.
WEB USAGE MINING
● Goal: analyze the behavioral patterns and profiles of users interacting with a
Web site.
b. Site contents
Spatial data mining is societally important having applications in public health, public
safety, climate science, etc. For example, in epidemiology, spatial data mining helps
to find areas with a high concentration of disease incidents to manage disease
outbreaks.
Spatial data are of two types according to the storing technique, namely, raster data
and vector data.
Spatial data can help us make better predictions about human behaviour and
understand what variables may influence an individual's choices. By performing
spatial analysis on our communities, we can ensure that neighbourhoods are
accessible and usable by everyone.
Clustering is the most widely used technique for the spatial data mining. Index
Terms with respect to, database management system, density based spatial clustering
of applications with noise, varied density based spatial clustering of applications with
noise, partitioning around medoids.
Spatial data refers to the shape, size and location of the feature. Non- spatial data refers to
other attributes associated with the feature such as name, length, area, volume, population,
soil type, etc ..
Spatial data can help us make better predictions about human behaviour and
understand what variables may influence an individual's choices. By performing spatial
analysis on our communities, we can ensure that neighbourhoods are accessible and usable
by everyone.
Geographic Information Systems are powerful decision-making tools for any business or
industry since it allows the analyzation of environmental, demographic, and topographic
data. Data intelligence compiled from GIS applications help companies and various
industries, and consumers, make informed decisions.
2.1 1. Mapping
2.2 2. Telecom and Network Services
2.3 3. Accident Analysis and Hot Spot Analysis
2.4 4. Urban planning
2.5 5. Transportation Planning
2.6 6. Environmental Impact Analysis
2.7 7. Agricultural Applications
2.8 8. Disaster Management and Mitigation
2.9 9. Navigation
2.10 10. Flood damage estimation
2.11 11. Natural Resources Management
2.12 12. Banking
2.13 13. Taxation
2.14 14. Surveying
2.15 15. Geology
2.16 16. Assets Management and Maintenance
2.17 17. Planning and Community Development
2.18 18. Dairy Industry
2.19 19. Irrigation Water Management
2.20 20. Pest Control and Management
PROCESS MINING
Process mining is a method similar to data mining, used to analyze and monitor business
processes. The software helps organizations to capture data from enterprise transactions and
provides important insights on how business processes are performing.
Data mining analyzes static information. In other words: data that is available at the
time of analysis. Process mining on the other hand looks at how the data was actually
created. Process mining techniques also allow users to generate processes dynamically based
on the most recent data.
There are three main classes of process mining techniques: process discovery, conformance
checking, and process enhancement.
Process mining enables business leaders to gain a holistic view of their processes, spot
inefficiencies and identify improvement opportunities, including automation.
Process mining is a method similar to data mining, used to analyze and monitor business
processes. The software helps organizations to capture data from enterprise transactions and
provides important insights on how business processes are performing.
Data mining and process mining share a number of commonalities, but they are different.
Both data mining and process mining fall under the umbrella of business intelligence. Both
use algorithms to understand big data and may also use machine learning. Both can help
businesses improve performance.
However, the two areas are distinct. Process mining is more concerned with how information
is generated and how that fits into a process as a whole, whereas data mining relies on data
that's available. Data mining is more concerned with the what -- that is, the patterns
themselves -- while process mining seeks to answer the why. As part of that, process mining
is concerned with exceptions and the story those exceptions help to tell about the holistic
answer, while data mining discards exceptions, as outliers can prevent finding the dominant
patterns.
Downloaded by JAYAPRAKASH A ([email protected])
lOMoARcPSD|27298668
DATA WAREHOUSING
Data warehousing is a method of organizing and compiling data into one database,
whereas data mining deals with fetching important data from databases. Data mining
attempts to depict meaningful patterns through a dependency on the data that is compiled in
the data warehouse
A data warehouse is a type of data management system that is designed to enable and
support business intelligence (BI) activities, especially analytics. Data warehouses are
solely intended to perform queries and analysis and often contain large amounts of historical
data.
Data Warehousing integrates data and information collected from various sources into
one comprehensive database. For example, a data warehouse might combine customer
information from an organization's point-of-sale systems, its mailing lists, website, and
comment cards.
Operational Data Store, which is also called ODS, are nothing but data store required when
neither Data warehouse nor OLTP systems support organizations reporting needs. In ODS,
Data warehouse is refreshed in real time. Hence, it is widely preferred for routine activities
like storing records of the Employees.
3. Data Mart:
A data mart is a subset of the data warehouse. It specially designed for a particular line of
business, such as sales, finance, sales or finance. In an independent data mart, data can
collect directly from sources.
Airline:
In the Airline system, it is used for operation purpose like crew assignment, analyses of route
profitability, frequent flyer program promotions, etc.
Banking:
It is widely used in the banking sector to manage the resources available on desk effectively.
Few banks also used for the market research, performance analysis of the product and
operations.
Healthcare:
Healthcare sector also used Data warehouse to strategize and predict outcomes, generate
patient’s treatment reports, share data with tie-in insurance companies, medical aid services,
etc.
Public sector:
In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyze tax records, health policy records, for every individual.
In this sector, the warehouses are primarily used to analyze data patterns, customer trends,
and to track market movements.
Retain chain:
In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing
policy.
Telecommunication:
A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.
Hospitality Industry:
This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.
DATA MART
A data mart is a simple form of data warehouse focused on a single subject or line of
business. With a data mart, teams can access data and gain insights faster, because they don't
have to spend time searching within a more complex data warehouse or manually
aggregating data from different sources.
Three basic types of data marts are dependent, independent, and hybrid. The
categorization is based primarily on the data source that feeds the data mart. Dependent data
marts draw data from a central data warehouse that has already been created.
Extract, transform, and load (ETL) is a process for integrating and transferring
information from various data sources into a single physical database. Data marts use
ETL to retrieve information from external sources when it does not come from a data
warehouse
Implementing a Data Mart is a rewarding but complex procedure. Here are the detailed steps
to implement a Data Mart:
Designing
Designing is the first phase of Data Mart implementation. It covers all the tasks between
initiating the request for a data mart to gathering information about the requirements. Finally,
we create the logical and physical Data Mart design.
Gathering the business & technical requirements and Identifying data sources.
Selecting the appropriate subset of data.
Designing the logical and physical structure of the data mart.
Date
Business or Functional Unit
Geography
Any combination of above
Constructing
This is the second phase of implementation. It involves creating the physical database and the
logical structures.
Implementing the physical database designed in the earlier phase. For instance,
database schema objects like table, indexes, views, etc. are created.
You need a relational database management system to construct a data mart. RDBMS have
several features that are required for the success of a Data Mart.
Storage management: An RDBMS stores and manages the data to create, add, and
delete data.
Fast data access: With a SQL query you can easily access data based on certain
conditions/filters.
Data protection: The RDBMS system also offers a way to recover from system
failures such as power failures. It also allows restoring data from these backups incase
of the disk fails.
Multiuser support: The data management system offers concurrent access, the
ability for multiple users to access and modify data without interfering or overwriting
changes made by another user.
Security: The RDMS system also provides a way to regulate access by users to
objects and certain types of operations.
Populating:
You accomplish these population tasks using an ETL (Extract Transform Load) Tool. This
tool allows you to look at the data sources, perform source-to-target mapping, extract the
data, transform, cleanse it, and load it back into the data mart.
In the process, the tool also creates some metadata relating to things like where the data came
from, how recent it is, what type of changes were made to the data, and what level of
summarization was done.
Accessing
Accessing is a fourth step which involves putting the data to use: querying the data, creating
reports, charts, and publishing them. End-user submit queries to the database and display the
results of the queries.
Set up a meta layer that translates database structures and objects names into business
terms. This helps non-technical users to access the Data mart easily.
Set up and maintain database structures.
Set up API and interfaces if required
You can access the data mart using the command line or GUI. GUI is preferred as it can
easily generate graphs and is user-friendly compared to the command line.
Managing
This is the last step of Data Mart Implementation process. This step covers management
tasks such as-
UNIT – 2
Data Mining refers to extracting or mining knowledge from large amounts of data. The
term is actually a misnomer. Thus, data mining should have been more appropriately
named as knowledge mining which emphasis on mining from large amounts of data. It is
computational process of discovering patterns in large data sets involving methods at
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of data mining process is to extract information from a data set and
transform it into an understandable structure for further use. It is also defined as extraction
of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or
knowledge from a huge amount of data. Data mining is a rapidly growing field that is
concerned with developing techniques to assist managers and decision-makers to make
intelligent use of a huge amount of repositories.
The term knowledge discovery in databases or KDD, for short, was coined in 1989 to
refer to the broad process of finding knowledge in data, and to emphasize the “high-level”
application of particular data mining methods (Fayyad et al, 1996). Fayyad considers DM
as one of the phases of the KDD process and considers that the data mining phase
concerns, mainly, to the means by which the patterns are extracted and enumerated from
data. In this paper there is a concern with the overall KDD process, which will be
described in section 2.1.
SEMMA was developed by the SAS Institute. CRISP-DM was developed by the means
of the efforts of a consortium initially composed with Daimler Chryrler, SPSS and NCR.
They will be described in sections 2.2 and 2.3, respectively. Despite that SEMMA and
CRISP-DM are usually referred as methodologies, in this paper they are referred as
processes, in the sense that they consist of a particular course of action intended to achieve
a result.
The KDD process, as presented in (Fayyad et al, 1996) is the process of using DM
methods to extract what is deemed knowledge according to the specification of measures
and thresholds, using a database along with any required preprocessing, sub sampling, and
transformation of the database. There are considered five stages, presented in figure 1:
1. Selection – This stage consists on creating a target data set, or focusing on a
subset of variables or data samples, on which discovery is to be performed.
2. Pre processing – This stage consists on the target data cleaning and pre
processing in order to obtain consistent data.
3. Transformation – This stage consists on the transformation of the data using
dimensionality reduction or transformation methods.
4. Data Mining – This stage consists on the searching for patterns of interest
in a particular representational form, depending on the data mining objective
(usually, prediction)
5. Interpretation/Evaluation – This stage consists on the interpretation and
evaluation of the mined patterns.
The KDD process is interactive and iterative, involving numerous steps with many
decisions being made by the user. (Brachman, Anand, 1996).
Additionally, the KDD process must be preceded by the development of an
understanding of the application domain, the relevant prior knowledge and the goals of the
end-user. It also must be continued by the knowledge consolidation by incorporating this
knowledge into the system (Fayyad et al, 1996).
The SEMMA process was developed by the SAS Institute. The acronym SEMMA
stands for Sample, Explore, Modify, Model, Assess, and refers to the process of
conducting a data mining project. The SAS Institute considers a cycle with 5 stages for the
process:
1. Sample – This stage consists on sampling the data by extracting a portion of a
large data set big enough to contain the significant information, yet small
enough to manipulate quickly. This stage is pointed out as being optional.
2. Explore – This stage consists on the exploration of the data by searching for
unanticipated trends and anomalies in order to gain understanding and ideas.
3. Modify – This stage consists on the modification of the data by creating,
selecting, and transforming the variables to focus the model selection process.
4. Model – This stage consists on modeling the data by allowing the software to
search automatically for a combination of data that reliably predicts a desired
outcome.
5. Assess – This stage consists on assessing the data by evaluating the usefulness
and reliability of the findings from the data mining process and estimate how
well it performs.
Although the SEMMA process is independent from de DM chosen tool, it is linked to
the SAS Enterprise Miner software and pretends to guide the user on the implementations
of DM applications.
SEMMA offers an easy to understand process, allowing an organized and adequate
development and maintenance of DM projects.
The CRISP-DM process was developed by the means of the effort of a consortium
initially composed with DaimlerChryrler, SPSS and NCR. CRISP-DM stands for CRoss-
Industry Standard Process for Data Mining. It consists on a cycle that comprises six stages
(figure 2):
1. Business understanding – This initial phase focuses on understanding the project
objectives and requirements from a business perspective, then converting this
knowledge into a data mining problem definition and a preliminary plan
designed to achieve the objectives.
2. Data understanding – The data understanding phase starts with an initial data
collection and proceeds with activities in order to get familiar with the data, to
identify data quality problems, to discover first insights into the data or to detect
interesting subsets to form hypotheses for hidden information.
3. Data preparation – The data preparation phase covers all activities to construct
the final dataset from the initial raw data.
4. Modeling – In this phase, various modeling techniques are selected and applied
and their parameters are calibrated to optimal values.
5. Evaluation – At this stage the model (or models) obtained are more thoroughly
evaluated and the steps executed to construct the model are reviewed to be
certain it properly achieves the business objectives.
6. Deployment – Creation of the model is generally not the end of the project. Even
if the purpose of the model is to increase knowledge of the data, the knowledge
gained will need to be organized and presented in a way that the customer can
use it.
The sequence of the six stages is not rigid, as is schematize in figure 2. CRISP-DM is
extremely complete and documented. All his stages are duly organized, structured and
Downloaded by JAYAPRAKASH A ([email protected])
lOMoARcPSD|27298668
defined, allowing that a project could be easily understood or revised (Santos & Azevedo,
2005). Although the CRISP-DM process is independent from de DM chosen tool, it is
linked to the SPSS Clementine software
A COMPARATIVE STUDY
By doing a comparison of the KDD and SEMMA stages we would, on a first approach,
affirm that they are equivalent:
Sample can be identified with Selection,
Explore can be identified with Pre processing
Modify can be identified with Transformation
Model can be identified with Data Mining
Assess can be identified with Interpretation/Evaluation.
Examining it thoroughly, we may affirm that the five stages of the SEMMA process can
be seen as a practical implementation of the five stages of the KDD process, since it is
directly linked to the SAS Enterprise Miner software.
Comparing the KDD stages with the CRISP-DM stages is not as straightforward as in
the SEMMA situation. Nevertheless, we can first of all observe that the CRISP-DM
methodology incorporates the steps that, as referred above, must precede and follow the
KDD process that is to say:
The Business Understanding phase can be identified with the development of
Considering the presented analysis we conclude that SEMMA and CRISP-DM can be
viewed as an implementation of the KDD process described by (Fayyad et al, 1996). At
first sight, we can get to the conclusion that CRISP-DM is more complete than SEMMA.
However, analyzing it deeper, we can integrate the development of an understanding of
the application domain, the relevant prior knowledge and the goals of the end-user, on the
Sample stage of SEMMA, because the data can not be sampled unless there exists a truly
understanding of all the presented aspects. With respect to the consolidation by
incorporating this knowledge into the system, we can assume that it is present, because it is
truly the reason for doing it. This leads to the fact that standards have been achieved,
concerning the overall process: SEMMA and CRISP-DM do guide people to know how
DM can be applied in practice in real systems.
In the future we pretend to analyze other aspects related to DM standards, namely SQL-
based languages for DM, as well as XML-based languages for DM. As a complement, we
pretend to investigate the existence of other standards for DM.
Domain analytics refers to the collective set of analytics that are applied across all
industry verticals and business processes. Data and analytics leaders should optimize their
domain-specific capabilities for achieving success in their industry.
Domain analytics refers to the collective set of analytics that are applied across all
industry verticals and business processes. Data and analytics leaders should optimize their
domain-specific capabilities for achieving success in their industry.
domain-specific data
The domain specific data schemas contain final specializations of entities as shown
highlighted in blue. Entities defined in this layer are self-contained and cannot be referenced
by any other layer. The domain specific layer organizes definitions according to industry
discipline.
PREDICTIVE PERFORMANCE
Take, for example, prediction of a rare disease that occurs in 1% of the population. If we use
a metric that only tells us how good the model is at making the correct prediction, we might
end up with a 98% or 99% accuracy because the model will be right 99% of the times by
predicting that the person does not have the disease. That is, however, not the point of the
model.
Instead, we might want to use a metric that evaluates only the true positives and the false
negatives, and determines how good the model is at prediction of the case of the disease.
Proper predictive performance models evaluation is also important because we want our
model to have the same predictive evaluation across many different data sets. In other words,
the results need to be comparable, measurable and reproducible, which are important factors
for many industries with heavy regulations, such as insurance and the healthcare sector.
Let’s now dive into prediction performance, the most commonly used metrics, their use
cases, and their limitations.
All problems a performance evaluation model can solve fall into one of two categories: a
classification problem or a regression problem. Depending on what category your business
challenge falls into, you will need to use different metrics to evaluate your model.
That is why it is important to first determine what overall business goal or business problem
needs to be solved. That will be the starting point for your data science team to choose the
metrics, and ultimately determine what a good model is.
Classification Problems
A classification problem is about predicting what category something falls into. An example
of a classification problem is analyzing medical data to determine if a patient is in a high risk
group for a certain disease or not.
Percent correction classification (PCC): measures overall accuracy. Every error has the same
weight.
Confusion matrix: also measures accuracy but distinguished between errors, i.e false
positives, false negatives and correct predictions.
Both of these metrics are good to use when every data entry needs to be scored. For example,
if every customer who visits a website needs to be shown customized content based on their
browsing behavior, every visitor will need to be categorized.
If, however, you only need to act upon results connected to a subset of your data – for
example, if you aim to identify high churn clients to interact with, or, as in the earlier
example, predict a rare disease – you might want to use the following metrics:
Area Under the ROC Curve (AUC – ROC): is one of the most widely used metrics for
evaluation. Popular because it ranks the positive predictions higher than the negative. Also,
ROC curve is independent of the change in proportion of responders.
Lift and Gain charts: both charts measure the effectiveness of a model by calculating the ratio
between the results obtained with and without the performance evaluation model. In other
words, these metrics examine if using predictive models has any positive effects or not.
Regression Problems
A regression problem is about predicting a quantity. A simple example of a regression
problem is prediction of the selling price of a real estate property based on its attributes
(location, square meters available, condition, etc.).
To evaluate how good your regression model is, you can use the following metrics:
R-squared: indicate how many variables compared to the total variables the model predicted.
R-squared does not take into consideration any biases that might be present in the data.
Therefore, a good model might have a low R-squared value, or a model that does not fit the
data might have a high R-squared value.
Average error: the numerical difference between the predicted value and the actual value.
Mean Square Error (MSE): good to use if you have a lot of outliers in the data.
Median error: the average of all difference between the predicted and the actual values.
Average absolute error: similar to the average error, only you use the absolute value of the
difference to balance out the outliers in the data.
Median absolute error: represents the average of the absolute differences between prediction
and actual observation. All individual differences have equal weight, and big outliers can
therefore affect the final evaluation of the model.
Mean Squared Error (MSE) and Root Mean Square Error (RMSE) are error measures based
on the following error (e_i) concept (where x_i represents the i-th actual value of a time series
and m_i is the value that was forecasted, for the same position in the series, by the model):
Since positive and negative errors tend to cancel each other out, we take the squares of these
differences and take the average of all these squares. As a result, we get the Mean Square
Error (calculated on N differences between actual and predicted values):
One of the MSE’s limitations is that the unit of measurement for the error is the square of the
unit of measurement for the data (the MSE calculates the error in square meters if the data is
measured in meters). To convert the error unit of measurement to the data unit of
measurement, we take the root of the MSE and then get the Root Mean Square Error:
MAE
MSE and RMSE are constructed in such a way that they give greater weight to large errors
than to small errors (it is because we rise to the power). Giving equal weight to large and
small errors, we can use the average of the absolute values of the errors and get the Mean
Absolute Error:
The MAE does not give larger errors a higher weight, but, when used as a loss function in a
machine learning model, it can cause convergence issues during the learning phase.
MAE, MSE and RMSE are widely used measures. However, relying entirely on these
measures may not be appropriate in some cases. For example, if the forecast is made to make
trading decisions, eg. whether or not to buy a stock on a particular day, we are more interested
in predicting well not so much the exact value of the next day, but whether the difference in
Downloaded by JAYAPRAKASH A ([email protected])
lOMoARcPSD|27298668
value of the next day will be positive or negative. That is, we are interested in knowing how to
predict the direction of change. In this case, we can use a confusion matrix.
We can count the number of cases belonging to each of the categories and represent them in a
table like the one below.
Expresses accuracy as a percentage of the error. Because this number is a percentage, it can
be easier to understand than the other statistics.
Expresses accuracy in the same units as the data, which helps conceptualize the amount of
error. Outliers have less of an effect on MAD than on MSD.
CONFUSION MATRIX
ROC CURVE
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TPTP+FN
FPR=FPFP+TN
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the
classification threshold classifies more items as positive, thus increasing both False Positives
and True Positives. The following figure shows a typical ROC curve.
To compute the points in an ROC curve, we could evaluate a logistic regression model many
times with different classification thresholds, but this would be inefficient. Fortunately,
there's an efficient, sorting-based algorithm that can provide this information for us, called
AUC.
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-
dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to
(1,1).
AUC represents the probability that a random positive (green) example is positioned to the
right of a random negative (red) example.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC
of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
AUC is scale-invariant. It measures how well predictions are ranked, rather than their
absolute values.
However, both these reasons come with caveats, which may limit the usefulness of AUC in
certain use cases:
Scale invariance is not always desirable. For example, sometimes we really do need well
calibrated probability outputs, and AUC won’t tell us about that.
Classification-threshold invariance is not always desirable. In cases where there are wide
disparities in the cost of false negatives vs. false positives, it may be critical to minimize one
type of classification error. For example, when doing email spam detection, you likely want
to prioritize minimizing false positives (even if that results in a significant increase of false
negatives). AUC isn't a useful metric for this type of optimization.
Cross-validation is usually used in machine learning for improving model prediction when
we don't have enough data to apply other more efficient methods like the 3-way split (train,
validation and test) or using a holdout dataset.
Cross validation is a model evaluation method that is better than residuals. The problem with
residual evaluations is that they do not give an indication of how well the learner will do
when it is asked to make new predictions for data it has not already seen. One way to
overcome this problem is to not use the entire data set when training a learner. Some of the
data is removed before training begins. Then when training is done, the data that was
removed can be used to test the performance of the learned model on ``new'' data. This is the
basic idea for a whole class of model evaluation methods called cross validation.
The holdout method is the simplest kind of cross validation. The data set is separated into
two sets, called the training set and the testing set. The function approximator fits a function
using the training set only. Then the function approximator is asked to predict the output
values for the data in the testing set (it has never seen these output values before). The errors
it makes are accumulated as before to give the mean absolute test set error, which is used to
evaluate the model. The advantage of this method is that it is usually preferable to the
residual method and takes no longer to compute. However, its evaluation can have a high
variance. The evaluation may depend heavily on which data points end up in the training set
and which end up in the test set, and thus the evaluation may be significantly different
depending on how the division is made.
K-fold cross validation is one way to improve over the holdout method. The data set is
divided into k subsets, and the holdout method is repeated k times. Each time, one of
the k subsets is used as the test set and the other k-1 subsets are put together to form a
training set. Then the average error across all k trials is computed. The advantage of this
method is that it matters less how the data gets divided. Every data point gets to be in a test
set exactly once, and gets to be in a training set k-1 times. The variance of the resulting
estimate is reduced as k is increased. The disadvantage of this method is that the training
algorithm has to be rerun from scratch k times, which means it takes k times as much
computation to make an evaluation. A variant of this method is to randomly divide the data
into a test and training set k different times. The advantage of doing this is that you can
independently choose how large each test set is and how many trials you average over.
Leave-one-out cross validation is K-fold cross validation taken to its logical extreme, with
K equal to N, the number of data points in the set. That means that N separate times, the
function approximator is trained on all the data except for one point and a prediction is made
for that point. As before the average error is computed and used to evaluate the model. The
evaluation given by leave-one-out cross validation error (LOO-XVE) is good, but at first pass
it seems very expensive to compute. Fortunately, locally weighted learners can make LOO
predictions just as easily as they make regular predictions. That means computing the LOO-
XVE takes no more time than computing the residual error and it is a much better way to
evaluate models. We will see shortly that Vizier relies heavily on LOO-XVE to choose its
metacodes.
Mathematical Expression
LOOCV involves one fold per observation i.e each observation by itself plays the role of
the validation set. The (N-1) observations play the role of the training set. With least-
squares linear, a single model performance cost is the same as a single model. In LOOCV,
refitting of the model can be avoided while implementing the LOOCV
method. MSE(Mean squared error) is calculated by fitting on the complete dataset.
Random sub sampling, which is also known as Monte Carlo cross validation ,
as multiple holdout or as repeated evaluation set , is based on randomly splitting the data
into subsets, whereby the size of the subsets is defined by the user . The random
partitioning of the data can be repeated arbitrarily often.
• Bootstrapping
Bootstrap is a powerful statistical tool used to quantify the uncertainty of a given model.
However, the real power of bootstrap is that it could get applied to a wide range of models
where the variability is hard to obtain or not output automatically.
• Challenges:
Algorithms in Machine Learning tend to produce unsatisfactory classifiers when
handled with unbalanced datasets.
For example, Movie Review datasets
UNIT - 3
Data visualization is very critical to market research where both numerical and categorical
data can be visualized, which helps in an increase in the impact of insights and also helps
in reducing the risk of analysis paralysis. So, data visualization is categorized into the
following categories:
1. Better Agreement: In business, for numerous periods, it happens that we need to look
at the exhibitions of two components or two situations. A conventional methodology is to
experience the massive information of both the circumstances and afterward examine it.
This will clearly take a great deal of time.
2. A Superior Method: It can tackle the difficulty of placing the information of both
perspectives into the pictorial structure. This will unquestionably give a superior
comprehension of the circumstances. For instance, Google patterns assist us with
understanding information identified with top ventures or inquiries in pictorial or
graphical structures.
3. Simple Sharing of Data: With the representation of the information, organizations
present another arrangement of correspondence. Rather than sharing the cumbersome
information, sharing the visual data will draw in and pass on across the data which is more
absorbable.
4. Deals Investigation: With the assistance of information representation, a salesman can,
without much of a stretch, comprehend the business chart of items. With information
perception instruments like warmth maps, he will have the option to comprehend the
causes that are pushing the business numbers up just as the reasons that are debasing the
business numbers. Information representation helps in understanding the patterns and
furthermore, different variables like sorts of clients keen on purchasing, rehash clients, the
impact of topography, and so forth.
5. Discovering Relations Between Occasions: A business is influenced by a lot of
elements. Finding a relationship between these elements or occasions encourages chiefs to
comprehend the issues identified with their business. For instance, the online business
market is anything but another thing today. Each time during certain happy seasons, like
Christmas or Thanksgiving, the diagrams of online organizations go up. Along these lines,
state if an online organization is doing a normal $1 million business in a specific quarter
and the business ascends straightaway, at that point they can rapidly discover the
occasions compared to it.
6. Investigating Openings and Patterns: With the huge loads of information present,
business chiefs can discover the profundity of information in regard to the patterns and
openings around them. Utilizing information representation, the specialists can discover
examples of the conduct of their clients, subsequently preparing for them to investigate
patterns and open doors for business.
Now the most important question arises. Why is Data Visualization So Important?
Why is Data Visualization Important?
Let’s take an example. Suppose you compile a data visualization of the company’s profits
from 2010 to 2020 and create a line chart. It would be very easy to see the line going
constantly up with a drop in just 2018. So you can observe in a second that the company
has had continuous profits in all the years except a loss in 2018. It would not be that easy
to get this information so fast from a data table. This is just one demonstration of the
usefulness of data visualization. Let’s see some more reasons why data visualization is so
important.
1. Data Visualization Discovers the Trends in Data
The most important thing that data visualization does is discover the trends in data. After
all, it is much easier to observe data trends when all the data is laid out in front of you in a
visual form as compared to data in a table. For example, the screenshot below on Tableau
demonstrates the sum of sales made by each customer in descending order. However, the
color red denotes loss while grey denotes profits. So it is very easy to observe from this
visualization that even though some customers may have huge sales, they are still at a
loss. This would be very difficult to observe from a table.
A time series is a sequence where a metric is recorded over regular time intervals.
Depending on the frequency, a time series can be of yearly (ex: annual budget),
quarterly (ex: expenses), monthly (ex: air traffic), weekly (ex: sales qty), daily (ex:
weather), hourly (ex: stocks pr ice), minutes (ex: inbound calls in a call canter) and
even seconds wise (ex: web traffic).
Not just in manufacturing, the techniques and concepts behind time series forecasting
are applicable in any business.
Now forecasting a time series can be broadly d ivided into two types.
If you use only the previous values of the time series to predict its future values, it
is called Univariate Time Series Forecasting.
And if you use predictors other than the series to forecast it is called Multi Variate
Time Series Forecasting.
The most important use of studying time series is that it helps us to predict the future
behaviour of the variable based on past experience
It is helpful for business planning as it helps in comparing the actual current performance
with the expected one
From time series, we get to study the past behaviour of the phenomenon or the variable
under consideration
We can compare the changes in the values of different variables at different times or
places, etc.
The various reasons or the forces which affect the values of an observation in a time series are
the components of a time series. The four categories of the components of time series are
Trend
Seasonal Variations
Cyclic Variations
Trend
The trend shows the general tendency of the data to increase or decrease during a long period of
time. A trend is a smooth, general, long-term, average tendency. It is not always necessary that
the increase or decrease is in the same direction throughout the given period of time.
It is observable that the tendencies may increase, decrease or are stable in different sections of
time. But the overall trend must be upward, downward or stable. The population, agricultural
production, items manufactured, number of births and deaths, number of industry or any factory,
number of schools or colleges are some of its example showing some kind of tendencies of
movement.
If we plot the time series values on a graph in accordance with time t. The pattern of the data
clustering shows the type of trend. If the set of data cluster more or less round a straight line,
then the trend is linear otherwise it is non-linear (Curvilinear).
Periodic Fluctuations
There are some components in a time series which tend to repeat themselves over a certain
period of time. They act in a regular spasmodic manner.
Seasonal Variations
These are the rhythmic forces which operate in a regular and periodic manner over a span of less
than a year. They have the same or almost the same pattern during a period of 12 months. This
variation will be present in a time series if the data are recorded hourly, daily, weekly, quarterly,
or monthly.
These variations come into play either because of the natural forces or man-made conventions.
The various seasons or climatic conditions play an important role in seasonal variations. Such as
production of crops depends on seasons, the sale of umbrella and raincoats in the rainy season,
and the sale of electric fans and A.C. shoots up in summer seasons.
The effect of man-made conventions such as some festivals, customs, habits, fashions, and some
occasions like marriage is easily noticeable. They recur themselves year after year. An upswing
in a season should not be taken as an indicator of better business conditions.
Cyclic Variations
The variations in a time series which operate themselves over a span of more than one year are
the cyclic variations. This oscillatory movement has a period of oscillation of more than a year.
One complete period is a cycle. This cyclic movement is sometimes called the ‘Business Cycle’.
downswings in business depend upon the joint nature of the economic forces and the interaction
between them.
There is another factor which causes the variation in the variable under study. They are not
regular variations and are purely random or irregular. These fluctuations are unforeseen,
uncontrollable, unpredictable, and are erratic. These forces are earthquakes, wars, flood,
famines, and any other disasters.
ARIMA, short for ‘Auto Regressive Integrated Moving Average’ is actually a class of
models that ‘explains’ a given time series based on its own past values, that is, its own
lags and the lagged forecast errors, so that equation can be used to forecast future
values.
Any ‘non-seasonal’ time series that exhibits patterns and is not a random white noise
can be modeled with ARIMA models.
If a time series, has seasonal patterns, then you need to add seasonal terms and it
becomes SARIMA, short for ‘Seasonal ARIMA’. More on that once we finish ARIMA.
Holt-Winters is a model of time series behavior. Forecasting always requires a model, and
Holt-Winters is a way to model three aspects of the time series:
Time series anomaly detection is a complicated problem with plenty of practical methods.
It’s easy to get lost in all of the topics it encompasses. Learning them is certainly an issue,
but implementing them is often more complicated. A key element of anomaly detection is
forecasting—taking what you know about a time series, either based on a model or its
history, and making decisions about values that arrive later.
A Multivariate time series has more than one time-dependent variable. Each variable depends
not only on its past values but also has some dependency on other variables. This dependency
is used for forecasting future values. Sounds complicated? Let me explain.
Consider the above example. Now suppose our dataset includes perspiration percent, dew
point, wind speed, cloud cover percentage, etc. along with the temperature value for the past
two years. In this case, there are multiple variables to be considered to optimally predict
temperature. A series like this would fall under the category of multivariate time series.
Below is an illustration of this:
Now that we understand what a multivariate time series looks like, let us understand how can
we use it to build a forecast.
In this section, I will introduce you to one of the most commonly used methods for
multivariate time series forecasting – Vector Auto Regression (VAR).
In a VAR model, each variable is a linear function of the past values of itself and the past
values of all the other variables. To explain this in a better manner, I’m going to use a simple
visual example:
We have two variables, y1 and y2. We need to forecast the value of these two variables at
time t, from the given data for past n values. For simplicity, I have considered the lag value
to be 1.
Recall the temperate forecasting example we saw earlier. An argument can be made for it to
be treated as a multiple univariate series. We can solve it using simple univariate forecasting
methods like AR. Since the aim is to predict the temperature, we can simply remove the other
variables (except temperature) and fit a model on the remaining univariate series.
Another simple idea is to forecast values for each series individually using the techniques we
already know. This would make the work extremely straightforward! Then why should you
learn another forecasting technique? Isn’t this topic complicated enough already?
From the above equations (1) and (2), it is clear that each variable is using the past values of
every variable to make the predictions. Unlike AR, VAR is able to understand and use the
relationship between several variables. This is useful for describing the dynamic behavior
of the data and also provides better forecasting results. Additionally, implementing VAR is
as simple as using any other univariate technique (which you will see in the last section).
UNIT – 4
Decision Tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart-like tree structure, where each internal
node denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (terminal node) holds a class label.
whether it is suitable for playing tennis and returns the classification associated
with the particular leaf.(in this case Yes or No).
would be sorted down the leftmost branch of this decision tree and would
therefore be classified as a negative instance.
In other words, we can say that the decision tree represents a disjunction of
conjunctions of constraints on the attribute values of instances.
(Outlook = Sunny ^ Humidity = Normal) v (Outlook = Overcast) v (Outlook = Rain ^
Wind = Weak)
Gini Index:
Gini Index is a score that evaluates how accurate a split is among the classified
groups. Gini index evaluates a score in the range between 0 and 1, where 0 is
when all observations belong to one class, and 1 is a random distribution of the
elements within classes. In this case, we want to have a Gini index score as low
as possible. Gini Index is the evaluation metrics we shall use to evaluate our
Decision Tree Model.
K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense
application in pattern recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning, it does
not make any underlying assumptions about the distribution of data (as opposed to other
algorithms such as GMM, which assume a Gaussian distribution of the given data).
We are given some prior data (also called training data), which classifies coordinates into
groups identified by an attribute.
As an example, consider the following table of data points containing two features:
Now, given another set of data points (also called testing data), allocate these points a
group by analyzing the training set. Note that the unclassified points are marked as
‘White’.
Intuition
If we plot these points on a graph, we may be able to locate some clusters or groups. Now,
given an unclassified point, we can assign it to a group by observing what group its
nearest neighbours belong to. This means a point close to a cluster of points classified as
‘Red’ has a higher probability of getting classified as ‘Red’.
Intuitively, we can see that the first point (2.5, 7) should be classified as ‘Green’ and the
second point (5.5, 4.5) should be classified as ‘Red’.
Algorithm
Let m be the number of training data samples. Let p be an unknown point.
1. Store the training samples in an array of data points arr[]. This means each element of
this array represents a tuple (x, y).
for i=0 to m:
Calculate Euclidean distance d(arr[i], p).
1. Make set S of K smallest distances obtained. Each of these distances corresponds to
an already classified data point.
2. Return the majority label among S.
K can be kept as an odd number so that we can calculate a clear majority in the case where
only two groups are possible (e.g. Red/Blue). With increasing K, we get smoother, more
defined boundaries across different classifications. Also, the accuracy of the above
classifier increases as we increase the number of data points in the training set.
Example Program
Assume 0 and 1 as the two classifiers (groups)
Log odds can be difficult to make sense of within a logistic regression data analysis. As a
result, exponentiating the beta estimates is common to transform the results into an odds ratio
(OR), easing the interpretation of results. The OR represents the odds that an outcome will
occur given a particular event, compared to the odds of the outcome occurring in the absence
of that event. If the OR is greater than 1, then the event is associated with a higher odds of
generating a specific outcome. Conversely, if the OR is less than 1, then the event is
associated with a lower odds of that outcome occurring. Based on the equation from above,
the interpretation of an odds ratio can be denoted as the following: the odds of a success
changes by exp(cB_1) times for every c-unit increase in x. To use an example, let’s say that
we were to estimate the odds of survival on the Titanic given that the person was male, and
the odds ratio for males was .0810. We’d interpret the odds ratio as the odds of survival of
males decreased by a factor of .0810 when compared to females, holding all other variables
constant.
Both linear and logistic regression are among the most popular models within data science,
and open-source tools, like Python and R, make the computation for them quick and easy.
Linear regression models are used to identify the relationship between a continuous
dependent variable and one or more independent variables. When there is only one
independent variable and one dependent variable, it is known as simple linear regression, but
as the number of independent variables increases, it is referred to as multiple linear
regression. For each type of linear regression, it seeks to plot a line of best fit through a set of
data points, which is typically calculated using the least squares method.
Similar to linear regression, logistic regression is also used to estimate the relationship
between a dependent variable and one or more independent variables, but it is used to make a
prediction about a categorical variable versus a continuous one. A categorical variable can be
true or false, yes or no, 1 or 0, et cetera. The unit of measure also differs from linear
regression as it produces a probability, but the logit function transforms the S-curve into
straight line.
While both models are used in regression analysis to make predictions about future
outcomes, linear regression is typically easier to understand. Linear regression also does not
require as large of a sample size as logistic regression needs an adequate sample to represent
values across all the response categories. Without a larger, representative sample, the model
may not have sufficient statistical power to detect a significant effect.
There are three types of logistic regression models, which are defined based on categorical
response.
Within machine learning, logistic regression belongs to the family of supervised machine
learning models. It is also considered a discriminative model, which means that it attempts to
distinguish between classes (or categories). Unlike a generative algorithm, such as naïve
bayes, it cannot, as the name implies, generate information, such as an image, of the class
that it is trying to predict (e.g. a picture of a cat).
Previously, we mentioned how logistic regression maximizes the log likelihood function to
determine the beta coefficients of the model. This changes slightly under the context of
machine learning. Within machine learning, the negative log likelihood used as the loss
function, using the process of gradient descent to find the global maximum. This is just
another way to arrive at the same estimations discussed above.
Logistic regression can also be prone to overfitting, particularly when there is a high number
of predictor variables within the model. Regularization is typically used to penalize
parameters large coefficients when the model suffers from high dimensionality.
Scikit-learn (link resides outside IBM) provides valuable documentation to learn more about
the logistic regression machine learning model.
Logistic regression is commonly used for prediction and classification problems. Some of
these use cases include:
Fraud detection: Logistic regression models can help teams identify data anomalies,
which are predictive of fraud. Certain behaviors or characteristics may have a higher
association with fraudulent activities, which is particularly helpful to banking and
other financial institutions in protecting their clients. SaaS-based companies have also
started to adopt these practices to eliminate fake user accounts from their datasets
when conducting data analysis around business performance.
Disease prediction: In medicine, this analytics approach can be used to predict the
likelihood of disease or illness for a given population. Healthcare organizations can
set up preventative care for individuals that show higher propensity for specific
illnesses.
Churn prediction: Specific behaviors may be indicative of churn in different
functions of an organization. For example, human resources and management teams
may want to know if there are high performers within the company who are at risk of
leaving the organization; this type of insight can prompt conversations to understand
problem areas within the company, such as culture or compensation. Alternatively,
the sales organization may want to learn which of their clients are at risk of taking
their business elsewhere. This can prompt teams to set up a retention strategy to avoid
lost revenue.
The main advantage of logistic regression is that it is much easier to set up and train than
other machine learning and AI applications.
Another advantage is that it is one of the most efficient algorithms when the different
outcomes or distinctions represented by the data are linearly separable. This means that you
can draw a straight line separating the results of a logistic regression calculation.
One of the biggest attractions of logistic regression for statisticians is that it can help reveal
the interrelationships between different variables and their impact on outcomes. This could
quickly determine when two variables are positively or negatively correlated, such as the
finding cited above that more studying tends to be correlated with higher test outcomes. But
it is important to note that other techniques like causal AI are required to make the leap from
correlation to causation.
Introduction to Clustering
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the
unlabelled data present. There are no criteria for good clustering. It depends on the user,
what is the criteria they may use which satisfy their need. For instance, we could be
interested in finding representatives for homogeneous groups (data reduction), in finding
“natural clusters” and describe their unknown properties (“natural” data types), in finding
useful and suitable groupings (“useful” data classes) or in finding unusual data objects
(outlier detection). This algorithm must make some assumptions that constitute the
similarity of points and each assumption make different and equally valid clusters.
Clustering Methods :
Density-Based Methods: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space.
These methods have good accuracy and the ability to merge two clusters.
Example DBSCAN (Density-Based Spatial Clustering of Applications with
Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc.
Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. New clusters are formed using the previously
formed one. It is divided into two category
Agglomerative (bottom-up approach)
Divisive (top-down approach)
examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative
Reducing Clustering and using Hierarchies), etc.
Partitioning Methods: These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion
similarity function such as when the distance is a major parameter example K-means,
CLARANS (Clustering Large Applications based upon Randomized Search), etc.
Grid-based Methods: In this method, the data space is formulated into a finite
number of cells that form a grid-like structure. All the clustering operations done on
these grids are fast and independent of the number of data objects example STING
(Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest), etc.
Clustering Algorithms :
K-means clustering algorithm – It is the simplest unsupervised learning algorithm that
solves clustering problem.K-means algorithm partitions n observations into k clusters
where each observation belongs to the cluster with the nearest mean serving as a prototype
of the cluster.
There are two types of clustering hard clustering and soft clustering. We
can define the type of cluster based on the data belonging to the dataset
can used to identify whether the data is belonging to distinct clusters with
the help of likelihood of the data points or the probability of data points
belonging to the nearest cluster.
Hard clustering
In hard clustering each data point either belongs to a cluster completely or
does not belong to the cluster at all.
Soft clustering
In soft clustering instead of putting each data point into a separate cluster a
probability or likelihood of the data point is to be considered to find
whether the data point is belonging to the specific cluster. An observation
can belong to more than one cluster to a certain degree that is likelihood of
belonging to that cluster can be more.
Market basket analysis is a data mining technique used by retailers to increase sales by better
understanding customer purchasing patterns. It involves analyzing large data sets, such as
purchase history, to reveal product groupings and products that are likely to be purchased
together.
The adoption of market basket analysis was aided by the advent of electronic point-of-sale (POS)
systems. Compared to handwritten records kept by store owners, the digital records generated by
POS systems made it easier for applications to process and analyze large volumes of purchase
data.
Implementation of market basket analysis requires a background in statistics and data science
and some algorithmic computer programming skills. For those without the needed technical
skills, commercial, off-the-shelf tools exist.
One example is the Shopping Basket Analysis tool in Microsoft Excel, which analyzes
transaction data contained in a spreadsheet and performs market basket analysis. A transaction
ID must relate to the items to be analyzed. The Shopping Basket Analysis tool then creates two
worksheets:
o The Shopping Basket Item Groups worksheet, which lists items that are frequently
purchased together,
o And the Shopping Basket Rules worksheet shows how items are related (For example,
purchasers of Product A are likely to buy Product B).
Market Basket Analysis is modelled on Association rule mining, i.e., the IF {}, THEN {}
construct. For example, IF a customer buys bread, THEN he is likely to buy butter as well.
o Antecedent: Items or 'item sets' found within the data are antecedents. In simpler words,
it's the IF component, written on the left-hand side. In the above example, bread is the
antecedent.
Market Basket Analysis techniques can be categorized based on how the available data is
utilized. Here are the following types of market basket analysis in data mining, such as:
1. Descriptive market basket analysis: This type only derives insights from past data and
is the most frequently used approach. The analysis here does not make any predictions
but rates the association between products using statistical techniques. For those familiar
with the basics of Data Analysis, this type of modelling is known as unsupervised
learning.
2. Predictive market basket analysis: This type uses supervised learning models like
classification and regression. It essentially aims to mimic the market to analyze what
causes what to happen. Essentially, it considers items purchased in a sequence to
determine cross-selling. For example, buying an extended warranty is more likely to
follow the purchase of an iPhone. While it isn't as widely used as a descriptive MBA, it is
still a very valuable tool for marketers.
3. Differential market basket analysis: This type of analysis is beneficial for competitor
analysis. It compares purchase history between stores, between seasons, between two
time periods, between different days of the week, etc., to find interesting patterns in
consumer behaviour. For example, it can help determine why some users prefer to
purchase the same product at the same price on Amazon vs Flipkart. The answer can be
that the Amazon reseller has more warehouses and can deliver faster, or maybe
something more profound like user experience.
In market basket analysis, association rules are used to predict the likelihood of products being
purchased together. Association rules count the frequency of items that occur together, seeking
to find associations that occur far more often than expected.
Algorithms that use association rules include AIS, SETM and Apriori. The Apriori algorithm is
commonly cited by data scientists in research articles about market basket analysis. It identifies
frequent items in the database and then evaluates their frequency as the datasets are expanded to
larger sizes.
R's rules package is an open-source toolkit for association mining using the R programming
language. This package supports the Apriori algorithm and other mining algorithms, including
arulesNBMiner, opusminer, RKEEL and RSarules.
With the help of the Apriori Algorithm, we can further classify and simplify the item sets that the
consumer frequently buys. There are three components in APRIORI ALGORITHM:
o SUPPORT
o CONFIDENCE
o LIFT
For example, suppose 5000 transactions have been made through a popular e-Commerce
website. Now they want to calculate the support, confidence, and lift for the two products. For
example, let's say pen and notebook, out of 5000 transactions, 500 transactions for pen, 700
transactions for notebook, and 1000 transactions for both.
Here are the following examples that explore Market Basket Analysis by market segment, such
as:
o Retail: The most well-known MBA case study is Amazon.com. Whenever you view a
product on Amazon, the product page automatically recommends, "Items bought together
frequently." It is perhaps the simplest and most clean example of an MBA's cross-selling
techniques.
Apart from e-commerce formats, BA is also widely applicable to the in-store retail
segment. Grocery stores pay meticulous attention to product placement based and
shelving optimization. For example, you are almost always likely to find shampoo and
conditioner placed very close to each other at the grocery store. Walmart's infamous beer
and diapers association anecdote is also an example of Market Basket Analysis.
o Telecom: With the ever-increasing competition in the telecom sector, companies are
paying close attention to customers' services. For example, Telecom has now started to
bundle TV and Internet packages apart from other discounted online services to reduce
churn.
o IBFS: Tracing credit card history is a hugely advantageous MBA opportunity for IBFS
organizations. For example, Citibank frequently employs sales personnel at large malls to
lure potential customers with attractive discounts on the go. They also associate with apps
like Swiggy and Zomato to show customers many offers they can avail of via purchasing
through credit cards. IBFS organizations also use basket analysis to determine fraudulent
claims.
The market basket analysis data mining technique has the following benefits, such as:
o Increasing market share: Once a company hits peak growth, it becomes challenging to
determine new ways of increasing market share. Market Basket Analysis can be used to
put together demographic and gentrification data to determine the location of new stores
or geo-targeted ads.
o Optimization of in-store operations: MBA is not only helpful in determining what goes
on the shelves but also behind the store. Geographical patterns play a key role in
determining the popularity or strength of certain products, and therefore, MBA has been
increasingly used to optimize inventory for each store or warehouse.
o Campaigns and promotions: Not only is MBA used to determine which products go
together but also about which products form keystones in their product line.
o Recommendations: OTT platforms like Netflix and Amazon Prime benefit from MBA
by understanding what kind of movies people tend to watch frequently.