0% found this document useful (0 votes)
134 views249 pages

DSC100 Data Science Fundamentals by SAP

Data Science Fundamentals SAP AG

Uploaded by

galup.inc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views249 pages

DSC100 Data Science Fundamentals by SAP

Data Science Fundamentals SAP AG

Uploaded by

galup.inc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 249

DSC100

Fundamentals of Data Science

.
.
PARTICIPANT HANDBOOK
INSTRUCTOR-LED TRAINING
.
Course Version: 10
Course Duration: 3 Day(s)
e-book Duration: 5 Hours 40 Minutes
Material Number: 50156437
SAP Copyrights, Trademarks and
Disclaimers

© 2021 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form or for any purpose without the
express permission of SAP SE or an SAP affiliate company.
SAP and other SAP products and services mentioned herein as well as their respective logos are
trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other
countries. Please see http://global12.sap.com/corporate-en/legal/copyright/index.epx for additional
trademark information and notices.
Some software products marketed by SAP SE and its distributors contain proprietary software
components of other software vendors.
National product specifications may vary.
These materials may have been machine translated and may contain grammatical errors or
inaccuracies.
These materials are provided by SAP SE or an SAP affiliate company for informational purposes only,
without representation or warranty of any kind, and SAP SE or its affiliated companies shall not be liable
for errors or omissions with respect to the materials. The only warranties for SAP SE or SAP affiliate
company products and services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should be construed as constituting an
additional warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business
outlined in this document or any related presentation, or to develop or release any functionality
mentioned therein. This document, or any related presentation, and SAP SE’s or its affiliated companies’
strategy and possible future developments, products, and/or platform directions and functionality are
all subject to change and may be changed by SAP SE or its affiliated companies at any time for any
reason without notice. The information in this document is not a commitment, promise, or legal
obligation to deliver any material, code, or functionality. All forward-looking statements are subject to
various risks and uncertainties that could cause actual results to differ materially from expectations.
Readers are cautioned not to place undue reliance on these forward-looking statements, which speak
only as of their dates, and they should not be relied upon in making purchasing decisions.

© Copyright. All rights reserved. iii


Typographic Conventions

American English is the standard used in this handbook.


The following typographic conventions are also used.

This information is displayed in the instructor’s presentation

Demonstration

Procedure

Warning or Caution

Hint

Related or Additional Information

Facilitated Discussion

User interface control Example text

Window title Example text

© Copyright. All rights reserved. iv


Contents

vi Course Overview

1 Unit 1: Data Science Overview

2 Lesson: Understanding Data Science

12 Unit 2: Business Understanding Phase

13 Lesson: Understanding the Business Phase

28 Unit 3: Data Understanding Phase

29 Lesson: Understanding the Data Phase

30 Unit 4: Data Preparation Phase

31 Lesson: Understanding Data Preparation

67 Unit 5: Modeling Phase

68 Lesson: Understand the parts of the modeling phase

199 Unit 6: Evaluation Phase

200 Lesson: Understanding the Evaluation Phase

206 Unit 7: Deployment and Maintenance Phase

207 Lesson: Deployment and Maintenance Phase


218 Lesson: End-to-end Scenario

219 Unit 8: Conclusion

220 Lesson: SAP Data Science Applications

© Copyright. All rights reserved. v


Course Overview

TARGET AUDIENCE
This course is intended for the following audiences:
● Data Manager
● Application Consultant
● Development Consultant
● Technology Consultant
● Data Consultant
● Data Scientist
● Developer

© Copyright. All rights reserved. vi


UNIT 1 Data Science Overview

Lesson 1
Understanding Data Science 2

UNIT OBJECTIVES

● Understand the basic principles of data science

© Copyright. All rights reserved. 1


Unit 1
Lesson 1
Understanding Data Science

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Understand the basic principles of data science

Introduction to Data Science


Data Science Overview
Data science is an interdisciplinary field. The following figure introduces you to its basic
concepts.

Figure 1: Data Science

Data science is a multidisciplinary field. It encompasses tools, methods, and systems,


including statistics and data analytics that are applied to large volumes of data with the
purpose of deriving insights for decision-making support. Data science may include the
collection and usage of data for the following:
● A superior understanding of business operations.
● The provision of an accurate, up-to-date evaluation of business performance.
● Using predictive analytics to transform an organization from being reactive to proactive in
the context of its business decision-making.
● Improving customer service by using data to build a more coherent knowledge base that
offers a greater understanding of customer needs.

Data science is often represented by images like the following figure.

© Copyright. All rights reserved. 2


Lesson: Understanding Data Science

Figure 2: What is data science?

For more information on data science, see the following:


● https://www.simplilearn.com/data-science-vs-data-analytics-vs-machine-learning-article
● http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Data Science as a Multidisciplinary Field


The following figure examines the scientific approach of data science and machine learning
(ML).

Figure 3: Machine Learning or Data Science?

For more information on the multidisciplinary aspect of data science, see the following:
● https://www.quora.com/What-is-the-difference-between-a-data-scientist-and-a-
machine-learning-engineer

The Limitations of Data Science


The relatively recent growth of data science was stimulated by the availability of big data and
cheap computing power. Small data sets, poor quality data, inconsistent data, and incorrect

© Copyright. All rights reserved. 3


Unit 1: Data Science Overview

data are problematic for data scientists and can waste time, producing an analysis that is
meaningless or misleading.
This course introduces you to the basic techniques of data science. However, you must
always remember that data science relies on reliable data.

Intelligent Enterprise
Artificial intelligent (AI) and ML are core enablers of the intelligent technologies that support
SAP Intelligent Enterprise Framework.

Figure 4: SAP Delivers the Intelligent Enterprise

Augmented Analytics
Gartner define augmented analytics in the following way: "Augmented analytics is the use of
enabling technologies such as machine learning and AI to assist with data preparation, insight
generation and insight explanation to augment how people explore and analyze data in
analytics and BI platforms. It also augments the expert and citizen data scientist by
automating many aspects of data science, machine learning, and AI model development,
management and deployment."
SAP Analytics Cloud provides you with augmented analytics capabilities in the module on
smart predicting. You will be introduced to these tools when you complete the exercises in
this course.
For more information on augmented analytics, see the following:
● https://www.gartner.com/en/information-technology/glossary/augmented-analytics

Summary
Data science is an interdisciplinary field within the broad areas of mathematics, statistics,
operations research, information science, and computer science. Data science focuses on the
processes and systems that enable the extraction of knowledge or insights from data. ML and
AI are both parts of data science. AI and ML are core enablers of the intelligent technologies
that support the SAP Intelligent Enterprise Framework.

© Copyright. All rights reserved. 4


Lesson: Understanding Data Science

Data Science Methodologies


Methodologies Overview
There are a wide range of data science project methodologies, which have been developed
over many years. Some of the more popular are the following:

● IBM Statistical Package for the Social Sciences (SSPS) five (Assess, Access, Analyze, Act,
Automate) As
● KDD (Knowledge Discovery in Databases) process
● Strategic Audience Segmentation (SASs) Sample, Explore, Modify, Model, Assess
(SEMMA)
● Cross Industry Standard Process for Data Mining (CRISP-DM)

In addition to the data science software and algorithms they use, many organizations that
have data science teams develop their own data science methodology so that their data
science processes closely align with their business and decision-making processes.

The Necessity of Project Methodologies for Data Science


Data science processes must be reliable and repeatable by people who possess relatively little
experience in the area of data science. Therefore, a project methodology is vital because it
provides the following:
● A framework for recording experience
● A number of methods that allow projects to be replicated
● A means of assistance for project planning and project management
● A "comfort factor" for new adopters of these methodologies
● A reduced dependency on "stars"

Figure 5: Why should there be a project methodology?

Ultimately, the methodology must support the effective integration of data science into your
organization.

CRISP-DM
The most popular project methodology in data science is CRISP-DM. CRISP-DM was an
initiative launched in 1996 and led by five companies: SPSS, Teradata, Daimler AG, NCR

© Copyright. All rights reserved. 5


Unit 1: Data Science Overview

Corporation, and OHRA, an insurance company. Over 300 organizations contributed to the
process model.
CRISP-DM attempts to create a data-centric project methodology that accomplishes the
following:
● Non-proprietary
● Application and industry neutral
● Focused on business issues
● Tool neutral
● Technical analysis

Polls conducted by KD Nuggets in 2002, 2004, 2007, and 2014 show that CRISP-DM was the
leading methodology used by industry data miners. These results yield the following findings:
● The CRISP-DM methodology is a hierarchical process model.
● At the top level, the process is divided into six different generic phases, ranging from
business understanding to the deployment of project results.
● The next level elaborates each of these phases, comprising several generic tasks; at this
level, the description is generic enough to cover all data science scenarios.
● The third level specializes these tasks for specific situations - for example, the generic task
can be cleaning data, and the specialized task can be cleaning of numeric or categorical
values.
● The fourth level is the process, that is, the recording of actions, decisions, and results of an
actual execution of a DM project.

The six generic phases are represented in the following figure:

Figure 6: Six Generic Phases of CRISP-DM

Each of the six generic phases is important and are best summarized in the following way:

© Copyright. All rights reserved. 6


Lesson: Understanding Data Science

Business understanding
During this phase, you confirm the project objectives and requirements from the
perspective of your business. Define the data science approach that answers specific
business objectives.
Data understanding
During this phase, you commence initial data collection and familiarization. You also
identify data quality problems.
Data preparation
During this phase, you select data tables, records, and attributes. You also undertake any
data transformation and cleaning that is required.
Modeling
During this phase, you select modeling techniques. You also calibrate the model
parameters and begin model building.
Evaluation
During this phase, you confirm that the business objectives have been achieved.
Deployment
During this phase, you deploy models and “productionize,” if required. You also develop
and implement a repeatable process that enables your organization to monitor and
maintain each model’s performance.

The sequence of the phases is not strict, and movement back and forth between different
phases is always required. The arrows in the process figure, Six Generic Phases of CRISP-DM,
indicate the most important and frequent dependencies between phases.
The outer circle in the figure, Six Generic Phases of CRISP-DM, symbolizes the cyclic nature
of any data science project. The process continues after a solution has been deployed.
The lessons learned during the process can trigger new, often more focused business
questions, and subsequent data science processes that benefit from the experiences of the
previous ones. This is illustrated by the figure, Six Generic Phases of CRISP-DM.

CRISP-DM: Phase 1, Business Understanding


The objective of this phase focuses on understanding the project objectives and requirements
from a business perspective. You convert this knowledge into a data science problem
definition and a preliminary plan designed to achieve these objectives.

Figure 7: CRISP-DM: Phase 1, Business Understanding

© Copyright. All rights reserved. 7


Unit 1: Data Science Overview

CRISP-DM: Phase 2, Data Understanding


The objective of this phase begins with initial data collection and proceeds with activities in
order to get familiar with the data, to identify data quality problems, to discover first insights
into the data, or to detect interesting subsets to form hypotheses for hidden information.

Figure 8: CRISP-DM: Phase 2, Data Understanding

CRISP-DM: Phase 3, Data Preparation


The objective of this phase is to cover all activities to construct the final data set from the
initial raw data. Data preparation tasks are likely to be performed multiple times and not in
any prescribed order. Tasks include table, record, and attribute selection in addition to
transformation and cleaning of data for the chosen algorithms.

Figure 9: CRISP-DM - Phase 3: Data Preparation

CRISP-DM: Phase 4, Modeling


The objective of this phase is to select various modeling techniques are ensure they, and their
parameters, are applied and calibrated to optimal values. Some techniques have specific
requirements for the form of data. Therefore, stepping back to the data preparation phase is
often necessary.

© Copyright. All rights reserved. 8


Lesson: Understanding Data Science

Figure 10: CRISP-DM - Phase 4: Modeling

Phase 5: Evaluation
The objective of this phase is to thoroughly evaluate the model and review the model
construction to be certain it properly achieves the business objectives. A key objective is to
determine if there is some important business issue that has not been sufficiently considered.
At the end of this phase, a decision on the use of these data science results should be reached

Figure 11: Phase 5: Evaluation

Phase 6: Deployment
The objective of this phase is the acquirement of knowledge that you gain and need to be
organized and presented in a way that allows the organization to use it. However, depending
on the requirements, the deployment phase can be as simple as generating a report or as
complex as implementing a repeatable data mining process across the enterprise.

© Copyright. All rights reserved. 9


Unit 1: Data Science Overview

Figure 12: Phase 6: Deployment

CRISP-DM: Monitoring Phase


Most model’s predictive performance degrade over time because the data that used to apply
the model to the data changes. The data distributions can change as customer characteristics
change, competitors launch campaigns, and the general business environment changes.
The models must be updated when this happens. A monitoring phase can be added to the
CRISP-DM methodology that specifically focuses on this very important aspect of any data
science project.

Figure 13: CRISP-DM: Monitoring Phase

Summary
This lesson has introduced you to the most popular project methodology for data science,
CRISP-DM. There are six key phases, and each phase includes a number of tasks and outputs.
It is very important for you to follow a project methodology when you are working on a data
science project, so that you understand the order of the phases and each of the tasks you
must consider. Different data science projects have different requirements, which means you
could use CRISP-DM as a template to ensure you have considered all of the different aspects
specific to your project, and modify it, as required.

© Copyright. All rights reserved. 10


Lesson: Understanding Data Science

LESSON SUMMARY
You should now be able to:
● Understand the basic principles of data science

© Copyright. All rights reserved. 11


UNIT 2 Business Understanding
Phase

Lesson 1
Understanding the Business Phase 13

UNIT OBJECTIVES

● Explain the business understanding phase

© Copyright. All rights reserved. 12


Unit 2
Lesson 1
Understanding the Business Phase

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Explain the business understanding phase

CRISP-DM Phase
CRISP-DM Phase
This phase focuses on understanding the project objectives and designing a plan to achieve
the objectives.

Figure 14: CRISP-DM: Phase 1, Business Understanding

Tasks
The first objective is to thoroughly understand a business perspective, that is, what the client
wants to accomplish. The organization usually has many competing objectives and
constraints, which must be properly balanced. The analyst’s goal is to uncover important
factors, at the beginning, so that they can influence the outcome of the project.

Outputs
Outputs are broadly divided into the following categories:
● Background
● Business objectives
● Deployment option

© Copyright. All rights reserved. 13


Unit 2: Business Understanding Phase

● Business success criteria

Output Categories
This is an overview of the various categories of output.

Outputs: Background
Record the information that is known about the organization's business situation at the
beginning of the project.
Outputs: Business Objectives
Describe the customer's primary objective from a business perspective.
In addition to the primary business objective, there are generally other related business
questions that the organization would like to address.
For example, the primary business goal for a financial services business might be to keep
current customers by predicting when they are prone to move to a competitor.
Examples of related business questions are as follows: "How does the primary channel a
bank customer uses (for example, ATM, branch visit, internet) affect whether they stay
or go?" or "Will lower ATM fees significantly reduce the number of high-value customers
who leave?"
Agree with the customer as to how they want to deploy the analysis when it is completed
- for example, do they want to have the analysis available in a stand alone app, or
embedded in an existing business application.

Outputs: Business Success Criteria


Describe the criteria for a successful or useful outcome to the project from the business
point of view.
This might be quite specific and able to be measured objectively, such as a reduction of
customer churn to a certain level, or general and subjective such as "give useful insights
into the relationships."
In the latter case it should be indicated who makes the subjective judgment.

Task: Assess situation


In the previous task, the objective refers to quickly arriving at the crux of the situation. You
continue by fleshing out the details in the following way:
This task involves more detailed fact-finding about all of the resources, constraints,
assumptions, and other factors that must be considered to determine the data science goal
and project plan.
List the resources available to the project, including the following:

● Personnel - business experts, data experts, technical support, data science personnel
● Data - fixed extracts, access to live warehoused, or operational data
● Computing resources - hardware platforms
● Software - data science tools, other relevant software

Outputs: Inventory of Resources


List the resources available to the project, including the following:

© Copyright. All rights reserved. 14


Lesson: Understanding the Business Phase

● Personnel - business experts, data experts, technical support, data science personnel
● Data - fixed extracts, access to live warehoused, or operational data
● Computing resources - hardware platforms
● Software - data science tools, other relevant software

List all requirements of the project including the schedule for completion, comprehensibility,
and the quality of results and security in addition to legal issues. As part of this output, make
sure that you are allowed to use the data.
List the assumptions made by the project, which are the following:

● These can be assumptions about the data, which can be checked during the analysis
process. However, they can also include assumptions that you can not check about the
business on which the project rests.
● It is particularly important to list these if they form conditions on the validity of the results.

List the constraints on the project. These may be constraints on the availability of resources,
but may also include technological constraints such as the size of data that it is practical to
use for modeling.

Outputs: Risks and contingencies


List the risks or events that might occur to delay the project or cause it to fail. List the
corresponding contingency plans. Identify what action is taken if the risks happen.

Outputs: Terminology
Compile a glossary of terminology relevant to the project. This can include two
components:
A glossary of relevant business terminology, which forms part of the business
understanding available to the project. Constructing this glossary is a useful "knowledge
elicitation" and education exercise.
A glossary of data science terminology, illustrated with examples relevant to the business
problem in question.
Outputs - Costs and benefits
Construct a cost-benefit analysis for the project, which compares the costs of the project
with the potential benefit to the business if it is successful.
The comparison should be as specific as possible, for example using monetary measures
in a commercial situation.

Tasks: Business Goals and Data Science Goals


We distinguish business objectives from data science objectives. The task determines the
data science goal.
● A business goal states objectives in business terminology.
● A data science goalstates objectives in technical terms.

Task: Determine data science goals


Determine the business goal. An example of a business goal might be the following:
"Increase catalog sales to existing customers."

© Copyright. All rights reserved. 15


Unit 2: Business Understanding Phase

A data science goal might be the following: "Predict how many widgets a customer will
buy, given their purchases over the past three years, demographic information (age,
salary, city, and so on) and the price of the item."

Output: Data science goals


Describe the intended outputs of the project that enables the achievement of the
business objectives.
Output: Data science success criteria
Define the criteria for a successful outcome to the project in technical terms. For
example a certain level of predictive accuracy or a propensity to purchase profile with a
given degree of "lift."
As with any business success criteria, it is often necessary to describe these in subjective
terms, in which case the person or persons making the subjective judgment should be
identified.

Achieve Data Science Goals


Task: We must describe the intended plan for achieving each data science goal and thereby
achieving every business goal. The plan must specify the anticipated set of steps you perform
during the rest of the project including an initial selection of tools and techniques.
Output: The project plan must detail the project stages, duration, resources, and so on. It
must also include an initial assessment of tools and techniques.

Output: Project plan


List the stages to be executed in the project, together with duration, resources, inputs,
outputs and dependencies.
Where possible make explicit the large-scale iterations in the data science process, for
example repetitions of the modeling and evaluation phases.
As part of the project plan, it is also important to analyze dependencies between time
schedule and risks.
The project plan is a dynamic document in the sense that at the end of each phase a
review of progress and achievements is necessary and an update of the project plan is
recommended. Specific review points for these reviews are part of the project plan.

Output: Initial assessment of tools and techniques


At the end of the first phase, the project also performs an initial assessment of tools and
techniques. For example, you select a data science algorithm that supports the available
data and required output.
It is important to assess tools and techniques early in the process since this selection
possibly influences the entire project.

Summary
This lesson introduces you to the details of the tasks required in the Business Understanding
Phase of the CRISP-DM project methodology. These phases are outlined in the following way:

● Determine business objectives (and agree deployment options)


● Assess situation

© Copyright. All rights reserved. 16


Lesson: Understanding the Business Phase

● Determine data science goals


● Produce project plan

Defining Project Success Criteria


Business Success Criteria
It is helpful to describe the criteria for a successful or useful outcome to the project from the
business point of view. This can be quite specific and able to be measured objectively, such as
reduction of customer churn to a certain level, or general and subjective, such as giving useful
insights into these relationships.

Data Science Criteria


It is also helpful to define the criteria for a successful outcome to the project in technical
terms, for example, a certain level of predictive accuracy or a propensity to purchase profile
with a given degree of “lift.”

Industry Surveys
Industry surveys indicate the standard methods of assessing data science project success.
They are useful for the following reasons:
● In both of these surveys, meeting business goals and model accuracy or performance are
the two most important factors.
● On the left hand side, 57% of responders responded to the question, "How do you measure
success?", for a predictive analytics project as "meeting business goals", and 56% as
"model accuracy". Lift is also an important factor. you will discuss how to calculate lift in
more detail later in this course.
● On the right side, in their Third Annual Data Miner Survey, conducted by Karl Rexer
Analytics, a renowned CRM consulting firm based in Winchester Massachusetts USA,
asked the BI community: "How do you evaluate project success in Data Mining?" Out of 14
different criteria, a massive 58% ranked "Model Performance (Lift, R2, and so on)" as the
primary factor.

Figure 15: Industry surveys

© Copyright. All rights reserved. 17


Unit 2: Business Understanding Phase

Descriptive or Predictive Models


The data science success criteria will differ depending on whether the models are “predictive”
or “descriptive” type models and the type of algorithm chosen.
Descriptive models can be described in the following ways:
● Descriptive analysis describes or summarizes raw data and makes it more interpretable. It
describes the past – i.e. any point of time that an event occurred, whether it was one
minute ago or one year ago.
● Descriptive analytics are useful because they allow you to learn from past behaviors and
understand how these might influence future outcomes.
● Common examples of descriptive analytics are reports that provide historical insights
regarding a company’s production, financials, operations, sales, finance, inventory and
customers
● Descriptive analytical models include cluster models, association rules, and network
analysis. You will learn more about these later in this course.

Predictive models can be described in the following ways:


● Predictive analysis predicts what might happen in the future – providing estimates about
the likelihood of a future outcome.
● One common application is the use of predictive analytics to produce a credit score. These
scores are used by financial services to determine the probability of customers making
future credit payments on time.
● Typical business uses include understanding how sales might close at the end of the year,
predicting what items customers will purchase together, or forecasting inventory levels
based upon a myriad of variables.
● Predictive analytical models include classification models, regression models, and neural
network models. You will learn more about these later in this course.

Algorithm Types
The data science success criteria will differ depending on the type of algorithm chosen.

Figure 16: Algorithm Type Selection

© Copyright. All rights reserved. 18


Lesson: Understanding the Business Phase

Consider the figure in the context of the following points:


● The data science success criteria will also differ depending on the type of algorithm
chosen. Different algorithms can use different accuracy metrics. You will learn more about
these algorithms and metrics later in this course.
● The business question helps to determine the most likely algorithm type to use.
● You can choose algorithms to analyze trends in data and use this information for
forecasting.
● You can identify the main influencers and relationships that could be driving customers to
switch to another supplier. This is called churn analysis. You can also analyze why certain
customers are more likely to respond to a campaign offer and buy specific products.
● Some algorithms group observations or customers together, which means that all of the
customers in a group have similar characteristics. These are called cluster algorithms.
● You can use association type algorithms to understand which products to recommend to
customers in a cross or up-sell marketing campaign, or to analyze the relationship
between certain variables in a data set.
● And you can identify unusual values in a data set by using anomaly detection algorithms.

Figure 17: The Wide Range of Algorithms you can Choose

There are a wide range of algorithms to choose from, depending on the type of question
asked by the business, the output that is required and the data that are available:
● For Association rules, or basket analysis, you can use algorithms that analyze the
combinations of products purchased together in a basket or over time. One of the
common algorithms used is called Apriori and you will learn more about this later in the
course.
● For clustering, you can use algorithms that group similar observations together. You are
introduced to a number of these algorithms later in the course and a commonly used one,
which is called K-Means.
● For classification analysis, where you are classifying observations into groups, you can use
decision trees or neural networks.
● You can use outlier analysis to identify which observations have unusually high or low
values and to identify anomalies.

© Copyright. All rights reserved. 19


Unit 2: Business Understanding Phase

● Regression algorithms enable you to forecast the values of continuous type variables, such
as customer spend in the next 12 months.
● Time-series analysis enables you to forecasts future KPI values, and control stock and
inventory levels.

Each of these algorithms are explored in more detail during this course

The Business Question


The following figure outlines the basic business questions one needs to answer at the
beginning of a data sciene project.

Figure 18: Business Question that Needs to be Answered

Each of these broad categories of algorithm can answer different types of business question.
Consider the following points in light of this fact and the preceding figure:
● For classification, you can answer the who and when type questions, such as which
customers will buy a product and when they will most likely make the purchase? You can
also answer questions such as, which machine will fail and when will it need preventative
maintenance?; or is that transaction fraudulent?
● For regression, you can answer the what type questions. What will be the spend of each
customer in the next 12 months? How many customer will churn next year?
● For clustering and segmentation you are grouping together similar observations. This
enables you to communicate to customers with similar needs and requirements who are
grouped together in a cluster, or develop specific products or services for customers in a
segment.
● Forecasting allows you to estimate a KPI on a regular time interval. So for example, you
can forecast revenue per month for the next 12 months, accounting for trends,
seasonalities and other external factors.
● Link analysis is used mainly in telecommunications to create communities of customers
who are calling one another, or in retail analysis to analyze the links between customers
and the products they have purchased to support product recommendations.
● And association rules and recommendations are used for basket analysis and also to
produce product recommendations for customers.

© Copyright. All rights reserved. 20


Lesson: Understanding the Business Phase

Model Accuracy and Robustness


The “accuracy” and “robustness” of a model are two major factors that determine the quality
of the prediction, which reflects how successful the model is overall.
To determine accuracy, you must take the following into account:
● Accuracy is often the starting point for analyzing the quality of a predictive model, as well
as an obvious criterion for prediction.
● Accuracy measures the ratio of correct predictions to the total number of cases evaluated.
● There are a wide variety of metrics and methods to measure accuracy, such as lift charts
and decile tables, which measure the performance of the model against random guessing,
or what the results would be if you didn’t use any model. These will be discussed in more
detail later in this course.

To determine robustness, you must take the following into account:


● The robustness of a predictive model refers to how well a model works on alternative data.
This can be hold-out data or new data to which the model is applied.
● The predictive performance of a model must not deteriorate substantially when it is
applied to data that were not used in model training.
● Robustness enables you to assess how confident you are in the prediction.

Training and Testing: Data Cutting Strategies


Training and testing data cutting strategies are important for the following reasons:
● Data cutting strategies are central to developing predictive models and assessing if they
are successful is a train-and-test regime.
● Data is partitioned into training and test subsets. There are a variety of cutting strategies,
for example, random/sequential/periodic.
● You build a model on the training subset (called the estimation subset) and evaluate its
performance on the test subset (a hold-out sample called the validation subset).
● Simple two and three-way data partitioning is shown in the following figure.

Figure 19: Training and testing: Data Cutting Strategies

© Copyright. All rights reserved. 21


Unit 2: Business Understanding Phase

When a predictive model has been built on the estimation sub-sample, its performance is
tested on the validation and test sub-samples. you expect to find the following:
● You expect that the model will have similar performance on the estimation, validation and
test sub-sets.
● The more similar the performance is of the model on the sub-sets, the more robust the
model is overall.

However, an even more rigorous test is to check how well the models performs on totally new
data that was not used in the model training.
For example, if the model is to be used in a marketing campaign to identify which customers
are most likely to respond to a discount offer, often the model's performance is also tested to
analyze how well it would have performed on historical campaign data.
Frequently, a model is also tested on a new campaign to see how well it performs in a real
environment. Appropriate control groups are defined, so the response to the modeled group
can be compared to the response using other methods.
There are extensions to and variations on the train-and-test theme. For example, a random
splitting of a sample into training and test sub-sets could be fortuitous, especially when
working with small data sets, so you could conduct statistical experiments by executing a
number of random splits and averaging performance indices from the resulting test sets.

Summary
This lesson highlights the importance of the following:

● At an early stage in a project, it is important to clearly define business and data science
project success criteria.
● The data science success criteria differs depending on whether the models are predictive-
or descriptive-type models and the type of algorithm chosen.
● The business question that you are analyzing in your data science project helps to
determine the most likely algorithms to use.
● There is a wide range of algorithms to choose from, depending on the type of question
asked by the business, the output that is required and the data that are available.
● The accuracy and robustness of the model are two major factors to determine the quality
of the prediction, which reflects the success of the model.
● A train-and-test regime is central to developing predictive models and assessing if they are
successful.

Circular Economy
Circular Economy
In a traditional linear economy, we take resources, make products, and dispose of them when
we finish (take >make > use> dispose).
A circular economy (CE) is an alternative to a linear economy. It aims to close the loop, so that
waste is reused, re-purposed, or recycled in a way that retains as much value as possible.
Basically, resources are kept in use for as long as possible, in order to extract the maximum
value from them, and later they are recovered and regenerated at the end of service life.

© Copyright. All rights reserved. 22


Lesson: Understanding the Business Phase

CE is a major topic for industry as modern consumers insist that organizations consider how
to achieve sustainability, and develop strategies for narrowing, slowing and closing material
and energy flows as a means for addressing structural waste.
In addition to creating new opportunities for growth, a more circular economy allows for the
following:

● A reduction in waste and pollution


● A drive for greater resource productivity
● A means to position organizations to better address emerging resource security/scarcity
issues in the future
● A means to help reduce the environmental impacts of production and consumption

For more information on the circular economy, see the following:


● https://www.ellenmacarthurfoundation.org/circular-economy/concept
● https://en.wikipedia.org/wiki/Circular_economy
● https://towardsdatascience.com/amsterdams-environmental-saviour-the-circular-
economy-c83200222e61

Overview
A circular economy seeks to rebuild capital, whether this is financial, manufactured, human,
social or natural. This ensures enhanced flows of goods and services. The following figure
illustrates the continuous flow of technical and biological materials through the “value circle”
of a circular economy.
The following figure shows you what a CE attempts to achieve. It also outlines what is referred
to as the value circle in such a system.

Figure 20: Overview of the Circular Economy

For more information, see the following:


● https://www.ellenmacarthurfoundation.org/circular-economy/concept/infographic

© Copyright. All rights reserved. 23


Unit 2: Business Understanding Phase

How Data Science Supports CE


As the idea of a circular economy becomes increasingly mainstream, how can businesses
stay relevant? One way is to use data science, artificial intelligence, and new technological
interventions to create changes across an organization’s supply chain.
Organizations must reconfigure and blend existing value creation mechanisms with new
innovative digital strategies, becoming data-driven, where decision-makers base their actions
on data and insights generated from analytics rather than instinct. Consequently, data
science has a key role in supporting organizations achieve their CE goals.
For more information, see the following:
● https://www.ellenmacarthurfoundation.org/explore/artificial-intelligence-and-the-
circular-economy

Circular Components and Products Aided by Data Science


The CE puts a strong focus on the design of components, products, and materials. Data
science and AI can enhance and accelerate the development of new products, components,
and materials that are fit for a CE through data driven design processes that allow for rapid
prototyping and testing.
Employing data science and AI can account for better designs faster, due to the speed with
which an algorithm can analyze large amounts of data and identify initial designs, or potential
design adjustments. A designer can therefore review, improve, and approve adjustments
based on the data. Data science gives designers more informed insight and the ability to
reduce complexity and identify which designs fit the CE.

Circular Business Models that use Data Science


By combining real-time and historical data from products and users, data science can help
increase product circulation and asset utilization through pricing and demand prediction,
predictive maintenance, and smart inventory management.
Data science can be used to do the following:
● Create dynamic pricing models where food prices are reduced as it approaches its expiry
data and thus reduce food waste.
● Connect people with the things they want –for example, using matching algorithms to
connect people to second-hand products.

Data science can also be used to provide predictive maintenance programs:


● Extend the life cycle of machines
● Increase utilization by reducing unplanned downtime and increased equipment
effectiveness
● \Improve insight and transparency into an asset's condition and usage history

Circular Infrastructure Optimized


Data science and AI can help build and improve the “reverse logistics” infrastructure required
to ‘close the loop’ on products and materials by improving the processes to sort and
disassemble products, re-manufacture components, and recycle materials.
For example, data science, AI and robotics can be used to do the following:

© Copyright. All rights reserved. 24


Lesson: Understanding the Business Phase

● Automate the assessment of the condition of used products and recommend if they can
be reused, resold, repaired or recycled to maximize value preservation.
● Automate the dis-assembly of used products by using visual recognition to assess and
adjust the dis-assembly equipment settings based on the condition of the product and its
position on the dis-assembly line.
● Sort mixed material streams using visual recognition techniques and robotics.

CE in Data Science Initiatives


An enhanced process model adds an additional phase called data validation, and integrates
“analytic profiles” as a core element of the CRISP-DM process.
The data validation phase helps to ensure a complete understanding of whether the prepared
data is a valid representation of the original problem.
An analytic profile is a collection of knowledge, mainly used in the business and data
understanding phases, that lists the best practices for a particular analytics use case or
problem.
These suggested enhancements enable an organization to consolidate their analytics
knowledge base. These improvements also allow the organization to learn from, and reuse,
their own experience more easily. In addition, the organization can use the experience of
others to easily accelerate the analytics development process.
Using these enhancements, organizations can structure their resources to align their data
science and analytics capability with their overall CE business strategies.
For more information, see the following:
● Exploring the Relationship Between Data Science and Circular Economy: an Enhanced
CRISP-DM Process Model.” Eivind Kristoersen, et al. 2019. DOI:
10.1007/978-3-030-29374-1_15.

Figure 21: How to Incorporate CE into Data Science Initiatives

Data Validation
Consider the following aspects of data validation:

© Copyright. All rights reserved. 25


Unit 2: Business Understanding Phase

● In CRISP-DM, there is no validation between the data preparation phase and the modeling
phase against the specific business domain. A complete understanding of whether the
data which is prepared is a valid representation of the original problem is not guaranteed.
● As such, this can result in sub-optimal solutions that miss the mark on the intended
capturing of business value.
● Therefore, data validation must be done by the re-involvement of domain experts to
validate that a proper understanding of the data and business problem has been reached,
and include data preparation methods tailored for the given analytic profile.
● The data validation phase can result in a re-iteration of the data understanding and/or the
data preparation phase(s), as indicated by a single arrow back in the figure, How to
Incorporate CE into Data Science Initiatives.

Analytic Profile
An analytic profile is an abstract collection of knowledge, mainly used in the business and
data understanding phases, which lists the best practices for a particular analytics use case,
or problem. The profile must include the following:

● Use case description defining the business goal


● Domain specific insights important for the use case
● Data sources relevant for the use case
● Key Performance Indicators (KPIs) or metrics for assessing the analytics implementation
performance
● Analytics models and tools with proven conformity for the given problem
● Short descriptions of previous implementations with lessons learned

An analytic profile is an abstract collection of knowledge, mainly used in the business and
data understanding phases, that lists the best practices for a particular analytics use case, or
problem. The profile must include the following:
● The profile should include:
● Use case description defining the business goal
● Domain specific insights important for the use case
● Data sources relevant for the use case
● KPIs or metrics for assessing the analytics implementation performance
● Analytics models and tools with proven conformity for the given problem
● Short descriptions of previous implementations with lessons learned

Summary
This lesson covered the following:

● A number of the concepts for a CE and an explanation of how data science can support the
delivery of CE strategies
● The data science and AI systems that can be used to deliver many of the essential CE
concepts, from predictive analytics - such as setting the optimal service and repair

© Copyright. All rights reserved. 26


Lesson: Understanding the Business Phase

schedule for durable equipment, to dynamic pricing and matching for the effective
functioning of digital marketplaces for secondhand goods and by-product material
streams.
● A look to future trends, which show that data science and AI could be integral to the
redesign of whole systems, which create a circular society that works in accordance with
these principles over the long-term.

LESSON SUMMARY
You should now be able to:
● Explain the business understanding phase

© Copyright. All rights reserved. 27


UNIT 3 Data Understanding
Phase

Lesson 1
Understanding the Data Phase 29

UNIT OBJECTIVES

● Explain the data understanding phase

© Copyright. All rights reserved. 28


Unit 3
Lesson 1
Understanding the Data Phase

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Explain the data understanding phase

LESSON SUMMARY
You should now be able to:
● Explain the data understanding phase

© Copyright. All rights reserved. 29


UNIT 4 Data Preparation Phase

Lesson 1
Understanding Data Preparation 31

UNIT OBJECTIVES

● Prepare data

© Copyright. All rights reserved. 30


Unit 4
Lesson 1
Understanding Data Preparation

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Prepare data

CRISP-DM Data Preparation


Data Preparation Overview
The following figure includes a chart that gives you an insight into time allocation for data
preparation - in short, what data scientists spend most of their time doing on projects.

Figure 22: Time Allocation for Data Preparation

The figure and chart therein clarify the following points:

● Data preparation is referred to as data wrangling, data munging, and data janitor work.
● Everything from list verification to removing commas and debugging databases. Messy
data is by far the more time-consuming aspect of the typical data scientist's work flow.
● An article in The New York Times reported that data scientists spend from 50% to 80% of
their time mired in the more mundane task of collecting and preparing unruly digital data
before it can be explored for useful nuggets.
● The chart shows that 3 out of every 5 data scientists spend the most time during their
working day cleaning and organizing data - while only 9% spend most of their time mining
the data and building models.

CRISP-DM: Phase 3, Data Preparation


The following figure provides you with an overview of phase 3 of CRISP-DM, Data Preparation.

© Copyright. All rights reserved. 31


Unit 4: Data Preparation Phase

Figure 23: CRISP-DM - Phase 3: Data Preparation

Data Set Outputs


There are two separate outputs in this phase, not related to a specific task.
The first is the data set:
● The data set, or data sets, are produced by the Data Preparation phase, which is used for
modeling or the major analytical part of the project.

The second is the data set description:


● Here, you describe the data set, or data sets, that are used for the modeling or the major
analytical work of the project.

Phase 3.1: Data Selection

Figure 24: Data Selection

© Copyright. All rights reserved. 32


Lesson: Understanding Data Preparation

Phase 3.2: Clean Data

Figure 25: Clean Data

Phase 3.3: Construct Data

Figure 26: Construct Data

Phase 3.4: Integrate Data

Figure 27: Integrate Data

© Copyright. All rights reserved. 33


Unit 4: Data Preparation Phase

Phase 3.5: Format Data

Figure 28: Format Data

Summary
This lesson covers the third phase of the CRISP-DM process, that is, data preparation. The
five tasks in this phase are as follows:

● Select data
● Clean data
● Construct data
● Integrate data
● Format data

The important outputs from this phase are the analytical data set you use later for data
analysis and data description.

Predictive Modeling
Overview
To prepare data correctly in the Data Preparation Phase of CRISP-DM, you must have a
knowledge of predictive modelingand the correct formatting of data that is required.
Predictive modeling is covered in more depth in the followings sections of the course, but this
is a basic introduction. Predictive modeling covers the following:

● Predictive modeling encompasses a variety of statistical techniques from modeling,


machine learning, and data science that analyze current and historical facts to make
predictions about future, or otherwise unknown, events.
● The output of a predictive model is a score (or probability) of the targeted event occurring
in the specified time frame in the future.
● Although, most often the unknown event of interest is in the future, predictive analytics
can be applied to any type of unknown whether it be in the past, present or future. For

© Copyright. All rights reserved. 34


Lesson: Understanding Data Preparation

example, identifying suspects after a crime has been committed, or credit card fraud as it
occurs.

Figure 29: Predictive Modeling

Descriptive analytics, which uses data aggregation and data visualization to provide insight
into the past and answer, asks the following question: What has happened?
Descriptive statisticsare useful for company reports giving total stock in inventory, average
dollars spent per customer and year over year change in sales.
Predictive analytics, which use statistical models and forecasts techniques to understand the
future and answer: What could happen? Predictive analytics are used for the following tasks:
● Predictive analytics combines the historical data found in Enterprise Resource Planning
(ERP), Customer Relationship Management (CRM), Human Resources (HR), and Point-of-
Sale (POS) systems to identify patterns in the data and apply statistical models and
algorithms to capture relationships between various data sets. Companies use predictive
analytics any time they want to look into the future.
● Prescriptive Analytics, which use optimization and simulation algorithms to advice on
possible outcomes and answer: What should we do?
● Prescriptive analytics predicts not only what can happen in the future, but also why it
happens by providing recommendations regarding actions that take advantage of these
predictions. Prescriptive analytics utilizes a combination of techniques and tools, such as
business rules, algorithms, optimization, machine learning, and mathematical modeling
processes.

Predictive Analytics: Use Cases


Many businesses use predictive analytics to mine and collect data.

© Copyright. All rights reserved. 35


Unit 4: Data Preparation Phase

Figure 30: Use Cases

Build and Apply Phases


There are two phases to a predictive modeling process:

Figure 31: Build and Apply Phases

The first is the model build, or training phase, is detailed in the following manner:
● Predictive models are built or "trained" on historic data with a known outcome.
● The input variables are called "explanatory" or "independent" variables.
● For model building, the "target" or "dependent" variable is known. It can be coded, so if the
model is to predict the probability of response for a marketing campaign, the responders
can be coded as "1"s and the non-responders as "0"s, or, for example, as "yes" and "no."
● The model is trained to differentiate between the characteristics of the customers who are
1s and 0s.

The second is the model apply phase, or the applying phase. It is detailed in the following
manner:
● Once the model has been built, it is applied onto new, more recent data, which has an
unknown outcome (because the outcome is in the future).
● The model calculates the score or probability of the target category occurring; in our
example, the probability of a customer responding to the marketing campaign.

© Copyright. All rights reserved. 36


Lesson: Understanding Data Preparation

Building the Model: Training Phase


The following figure represents an example of building the model, the training phase.

Figure 32: Training Phase

The training model in this figure is best understood in the following way:
● This example represents the model training of a "churn" model. You are trying to predict if
a customer is going to switch to another supplier.
● You train the model (in the training phase) using historical data where you know if
customers churned or not - you have a known target.
● The target variable flags churners (yes) and non-churners (no). This type of model, with a
binary target, is called a "classification" model.
● This is a simple representation where you only have two explanatory characteristics - age
and city. In a real predictive model you might have hundreds or even thousands of these
characteristics.
● The predictive model identifies the difference in the characteristics of a churner and a non-
churner. This can be represented as a mathematical equation, or in a scorecard format.
The algorithm calculates the Weight values in the predictive model equation that give the
most accurate estimate of the target.

Using the Model: Applying Phase


You can use the model, which is part of what is called the applying phase.

© Copyright. All rights reserved. 37


Unit 4: Data Preparation Phase

Figure 33: Applying Phase

The applying phase is best understood in the following context:


● When you apply the model onto new data, you do not know the target because it is in the
future.
● You are trying to calculate the probability of customers churning in the future - this is the
score, or the probability output of the model.
● The higher the score for each customer in the new data, the more likely they are to churn
(target = yes), and the lower the score, the more likely they are to remain (target = no).

Data Movement Through Time: Training Phase


Consider the movement of data over time in your predictive model.

Figure 34: Training Phase

Consider the following aspects of a predictive model:


● When you train a predictive model, the model "architecture" is important.
● Remember, in a predictive model you are using historical data to predict what is going to
happen in the future.
● The historical data date frame must reflect data prior to the target date frame.

© Copyright. All rights reserved. 38


Lesson: Understanding Data Preparation

● You can use a reference date to split the historical data date period and the target date
period.
● In this example, the target data time frame (April) occurs after the historical data time
frame (January to March). The model is trained to identify patterns in the data in the past
to predict the target in the subsequent, or later, months.

Data Movement Through Time: Applying Phase


Consider the movement of data through time in the applying phase.

Figure 35: Moving Data Through Time - Applying Phase

Consider the following aspects of the applying phase as you examine the example in the
figure:
● In this example, you are applying the model on the latest 3 months of data: April to June.
The model calculates the probability of churn in the future: for July.
● Every time you use the model, the apply data has to be updated to the most recent time
frame.
● The same data set is required at different points of time - for example, the 1st of every
month, so that models can be applied to generate updated scores on a recurring basis.
Depending on the business requirements, models need to be applied each month, week,
day, minute, or second.
● When the data set time frame changes, the reference date changes. Therefore, any
derived variables, such as a customer's age, needs to be updated relative to the new
reference date. For example, you would calculate the age as the number of days difference
between the reference date and the customer's date of birth.
● Alternatively, you can calculate each person's tenure as a customer of the business. This is
the days difference between the reference date and the date the customer made their first
purchase or joined a loyalty scheme.
● In addition, any transactional data in the previous months needs to be updated relative to
the moving reference date.
● For example, you might want to calculate the number of transactions in the month prior to
our reference date, and if the reference date moves forward then the number of

© Copyright. All rights reserved. 39


Unit 4: Data Preparation Phase

transactions needs to be recalculated for the new month that has become the prior month
to the reference date.

Latency Period
We must now consider the latency period.

Figure 36: Latency Period: Models

Durations for each of these periods depends on the use case and the business - this can be
days, weeks, months, on so on. Examine the following figure to explore this in further detail:

Figure 37: Why is Latency Needed?

Model Fitting
The following figure represents model fitting:

© Copyright. All rights reserved. 40


Lesson: Understanding Data Preparation

Figure 38: Model Fitting

Consider the following aspects of model fitting:


● The concepts of model "over-fitting", "under-fitting," and the concept of model
"robustness" are very important when you build predictive models.
● Look at the 3 models on the slide. They are all trained on the same known data,
represented by the red data points, but with very different accuracy when compared to
both the known training data and the new unseen data, represented by the green data
points.
● When you build a predictive model, care must be taken not to "over-fit" the model. This is
shown by the example on the top left. An over-fitted model is very accurate when tested on
the model training data, but has high inaccuracy when applied onto new data. The means
that the model does not "generalize" and it is not "robust." Our goal is to build "robust"
models that generalize.
● In over-fitting, the model describes random error or noise instead of the underlying
relationship. Over-fitting occurs when a model is excessively complex. A model that is
over-fitted displays a poor predictive performance, as it over-reacts to minor fluctuations
in the training data.
● The possibility of over-fitting exists because the criterion used for training the model is not
the same as the criterion used to judge the efficacy of a model. In particular, a model is
typically trained by maximizing its performance on the training data. However, its efficacy
is determined not by its performance on the training data but by its ability to perform well
on unseen data. Over-fitting occurs when a model begins to "memorize" training data
rather than "learning" to generalize from trends.
● In order to avoid over-fitting, it is necessary to use additional techniques (for example,
cross-validation, regularization, early stopping, or pruning of decision trees), that can
indicate when further training does not result in better generalization. These techniques
can either penalize overly complex models, or test the model's ability to generalize by
evaluating its performance on a set of data not used for training, which is assumed to
approximate the typical unseen data that a model encounters when it is used.
● In the examples on the slide, you see the over-fitted model has no or low error on the
training data, but high error on the test data when the model is applied. Conversely, in the
example on the top right, you can see an under-fitted model. In this instance, there is a
large error on the training data as well as the test data. However, because the model is just

© Copyright. All rights reserved. 41


Unit 4: Data Preparation Phase

as inefficient on both data sets, with an equivalent large error, the model is robust. When
you are building a model, you are aiming at the best compromise between under and over-
fitting, and this is shown in the figure outlined model fitting. In this instance, you have low
training and test errors, and these errors are equivalent, so the model is robust.

Hold-Out Sample
The model build phase is split into different samples.

Figure 39: Hold-Out Sample

● The hold-out is a sample of observations withheld from the model training. We can use it to
test the model's ability to make accurate predictions based on its ability to predict the
outcomes of the data in the hold-out sample, and to confirm the model's robustness by
comparing the distribution of the predictions for the training versus hold-out samples.
● Data is partitioned and split into a training sub-set to train the models, and a validation
sub-set, which is the hold-out, to test the model's performance.
● For classification and regression models, the data could be split with 75% of the data being
randomly partitioned into the training sub-set and 25% randomly partitioned into the
validation sub-set for example.
● Note that for time-series models, the historical data is split sequentially not randomly. For
75% could go into the training set and 25% into the validation sub-sets. This sequential
split is necessary because of the inherent continuous time-based nature of a time series
model.
● You learn more about data partitioning later in this training.

Summary
This lesson covered the following elements regarding the data preparation phase of CRISP-
DM:

© Copyright. All rights reserved. 42


Lesson: Understanding Data Preparation

Figure 40: Summary

Data Manipulation
Overview
This is an introduction to data manipulation.

Figure 41: Introduction to Data Manipulation

Data manipulation is part of CRISP-DM Data Preparation, Phase 3. It involves the following
three phases:
● Phase 3.3: Constructing Data:This task includes constructive data preparation operations
such as the production of derived attributes, entire new records, or transformed values for
existing attributes.
● Phase 3.4: Integrating Data:These are methods whereby information is combined from
multiple tables, or records to create new records or values.
● Phase 3.5: Formatting Data:Formatting transformations refer to the primarily syntactic
modifications made to the data that do not change its meaning, but might be required by
the modeling tool.

Defining an Entity
We must first identify an entity.

© Copyright. All rights reserved. 43


Unit 4: Data Preparation Phase

Figure 42: What is an entity?

The following information helps you to define an entity:


● To define an entity, you must take into account the fact that the entity makes business
sense, that it can be characterized in terms of attributes, and that it can be associated with
predictive metrics in relation to the tasks you want to perform.
● Defining an entity is not a minor challenge. Entities may be used in many projects and can
not be changed without an impact analysis on all the deployed processes using this entity.
● For example, you have to determine, together with everyone involved in this project, if the
entity for a project is the 'account', the 'customer', or the association between the account
and the customer, describing the role of the person with the account. This can sometimes
be difficult to agree.

The following are examples that illustrate these points on defining an entity. The examples are
as follows:
● A predictive model that is designed to predict if a customer of a utility company is going to
respond to an up-sell offer. The customer has a CustomerID as the entity.
● A churn model, designed to predict if a postpaid telecom customer will extend their
subscription when their 12 month contract expires, or if they will switch to a competitor.
They can be assigned the CustomerID or AccountID as entities, depending on the
appropriate level of analysis.

Analytical Record Creation


The next step is to create an analytical record.

© Copyright. All rights reserved. 44


Lesson: Understanding Data Preparation

Figure 43: Analytical Record

This figure provides you with a 360-degree view of each entity, collecting all of the static and
dynamic data together, which can be used to define the entity.
● The "static" data does not change very much over time, such as gender, address, work,
class, and so on.
● The "dynamic" data does change frequently, such as a customer's age, every time a new
data time-frame is chosen for analysis.
● These data sets are the explanatory variables in the analysis.
● These steps require data manipulation, that is, merging tables, aggregating data, creating
new data transformations, and derived variables, and so on.

The Analytical Record is an overall view of the entities;if this entity is a customer, it is
sometimes called a "Customer Analytic Record", or a "360-degree view of the customer," or
even "Customer DNA." This view characterizes entities by a large number of attributes (the
more, the better), which can be extracted from the database or even computed from events
that occurred for each of them.
The list of all attributes corresponds to what is called an "Analytical Record" of the entity
"disposition." This analytical record can be decomposed into a number of domains, such as
the following:

● Demographic
● Geo-demographic
● Complaints history
● Contacts history
● Products history
● Loan history
● Purchase history (coming from the transaction events)
● Segments
● Model scores

© Copyright. All rights reserved. 45


Unit 4: Data Preparation Phase

Feature Engineering
Features are attributes shared by all entities.

Figure 44: Feature engineering

Common Feature Engineering Techniques: Part 1


The following figure explores the imputation of missing values and handle outliers.

Figure 45: Imputation of Missing Values and Handle Outliers

Let us examine outliers. Consider the following:


● Many machine learning models, like linear and logistic regression, are easily impacted by
the outliers in the training data.
● Models like AdaBoost increase the weights of misclassified points on every iteration and
therefore might put high weights on these outliers as they generally tend to be
misclassified.

Common Feature Engineering Techniques: Part 2


There are other common feature engineering techniques, such as binning and logarithm
transformation.

© Copyright. All rights reserved. 46


Lesson: Understanding Data Preparation

Figure 46: Binning and Logarithm Transformation

Let us examine binning numerical data, which are as follows:


● Equal Frequency Binning, which are bins that have equal frequency.
● Equal Width Binning, which are bins that have equal width with a range for each bin defined
as the following: [min + w], [min + 2w] …. [min + nw], where w = (max - min) / (no of bins).

Common Feature Engineering Techniques: Part 3


There are other common feature engineering techniques, such as one-hot encoding
categorical variables and grouping transformations.

Figure 47: One-hot Encoding Categorical Variables and Grouping Transformations

Common Feature Engineering Techniques: Part 4


There are other common feature engineering techniques, such as feature splitting and scaling
numerical data challenges.

© Copyright. All rights reserved. 47


Unit 4: Data Preparation Phase

Figure 48: Feature Splitting and Scaling Numerical Data Challenges

Handling Missing Values


Missing values must be handled in your data.

Figure 49: How to Handle Missing Values

Handling missing values is important for the following reasons:


● There are many approaches to estimating missing values. For numeric data, the mean or
median value may be used or an estimate made though interpolation between values.
● Missing values can lead to a useful investigation that explains why the values are missing.
● Like outliers, they must not be ignored.

Handling Outliers
An outlier is an observation that lies an abnormal distance from other values in a random
sample of the data.

© Copyright. All rights reserved. 48


Lesson: Understanding Data Preparation

Figure 50: Handling outliers

Consider the following aspects regarding outliers in your data:

● Determining whether a value is an outlier or a data error can be difficult.


● If outliers exist in a data set they can significantly affect the analysis, therefore the IDA
must include a search for outliers.
● However, outliers must not necessarily be omitted from the analysis as they can be
genuine observations in the data, and they can indicate that a robust model solution is not
possible.
● Outliers can be searched visually, and by using various algorithms, which we look at in
more detail in this course.
● Popular plots for outlier detection are Scatter Plots and Box Plots although, as data
volumes increase, the data visualization used to identify outliers becomes more difficult.

Merging Tables
Another important data manipulation process is the merging of input data tables.

Figure 51: Examples of Merged Tables

Follow the example, which is outlined as follows:


● The merging of input data tables is important.

© Copyright. All rights reserved. 49


Unit 4: Data Preparation Phase

● When you are preparing the data, it is contained in multiple tables that need to be
assembled in the format required by the machine-learning algorithm you are using.
● This previous figure shows you an example of merging tables that utilize what is called a
"left outer join."
● The A_NUMBER_FACT table gives the unique line number associated to each account.
● This is merged with the CUSTOMER_ID_LOOKUP table so that the CUSTOMER_ID is
associated with the unique line number. The merge key is A_NUMBER in table 1 to
A_NUMBER in table 2.
● The CUSTOMER table can be merged: CUSTOMER_ID in table 2 to CUSTOMER_ID in table
3.

Aggregating Data
Data aggregation is an important facet of data processing.

Figure 52: Aggregating data

The following figure provides you with an example of data aggregation.

Figure 53: Aggregating Data

Consider the following points in relation to the example in the figure:

© Copyright. All rights reserved. 50


Lesson: Understanding Data Preparation

● This is an example of an aggregation: Telco Call Detail Record (CDR) Table.


● Each call between each customer, called the A_Number, and the person they are calling,
called the B_Number, has the duration and date time of the call.
● The aggregation then computes the count of calls made by each A_Number in each month
- January, February, and so on.
● A similar aggregation computes the total duration of all voice calls made per month for
each A_Number.

Aggregating Categorical Data


You can aggregate data using a pivot table.

Figure 54: Aggregating categorical data

Aggregation Functions
Consider the range of aggregation functions available to you.

Figure 55: Aggregation functions

Summary
This lesson covers important basic elements of data manipulation and Feature Engineering
techniques, which are outlined in the following way:

© Copyright. All rights reserved. 51


Unit 4: Data Preparation Phase

● In Feature Engineering, new features are created to extract more information from existing
features.
● You also learn about "merging" and "aggregating" data.
● Data segregation is the process where raw data is gathered and expressed in a summary
form for statistical analysis. For example, raw data can be aggregated over a given tie
period to provide statistical data such as average, minimum, maximum, sum, and count.

Key Data Types


Overview

Figure 56: Introduction

Quantitative or Qualitative
The following figure demonstrates the differences between the following:

● Quantitative or numerical data


● Qualitative or categorical data

Figure 57: Quantitative or Qualitative Data

Quantitative or numerical data is defined in the following way:


● Data are numbered and can be quantified
● Data can be classified as either discrete or continuous
● Discrete data is based on counts. Only a finite number of values is possible, and the values
cannot be subdivided meaningfully. For example, the number of parts damaged in

© Copyright. All rights reserved. 52


Lesson: Understanding Data Preparation

shipment or the number of students in a class. It is typically things counted in whole


numbers.
● Continuous data is information that can be measured on a continuum or scale. Continuous
data can have almost any numeric value and can be meaningfully subdivided into finer and
finer increments, depending upon the precision of the measurement system.
● Data can be counted or measured, and summarized using mathematical operations such
as addition or subtraction.

Qualitative or categorical data is defined in the following way:


● Data are not numbers or if they are numbers, they cannot be quantified
● Data items can be placed into distinct categories based on some attribute or characteristic
● Data can only be summarized by frequency count (or mode). No other mathematical
operators can be applied

Scales of Measurement
Scales of measurement are important to definition and categorization of variables or
numbers.

Figure 58: Different Scales of Measurement

There are four issues to consider when measuring variables:


● Can the items be placed in separate categories? Are these items nominal or ordinal?
● Can we rank or order the items from lowest to highest? These items are ordinal.
● Can we say how much one item is more in value than the other item? This refers to an
interval.
● Can we say how many times one item is more in value than the other? This refers to a ratio.

Nominal Scale
The following figure outlines the issue of nominal scale:

© Copyright. All rights reserved. 53


Unit 4: Data Preparation Phase

Figure 59: Nominal Data

Ordinal Scale
The following figure outlines the issue of ordinal scale:

Figure 60: Ordinal Data

● With ordinal variables, the order matters, but not the difference between the values. For
example, patients are asked to express the amount of pain they are feeling on a scale of 1
to 10. A score of 7 means more pain than a score of 5, and that is more than a score of 3.
However, the difference between the 7 and 5 might not be the same as the difference
between the 5 and 3.
● The values simply express an order.

© Copyright. All rights reserved. 54


Lesson: Understanding Data Preparation

Interval Scale

Figure 61: Interval Data

● Numerical data, as its name suggests, involves features that are only composed of
numbers, such as integers or floating-point values.
● You can establish the numerical interval difference between two items, but you can not
calculate how many times one item is more or less in value than the other item.
● The arbitrary starting point can be confusing at first. For example, year and temperature
do not have a natural zero value. The year 0 is arbitrary and it is not sensible to say that the
year 2000 is twice as old as the year 1000. Similarly, zero degrees Centigrade does not
represent the complete absence of temperature (the absence of any molecular kinetic
energy). In reality, the label "zero" is applied to its temperature for quite accidental
reasons connected to the freezing point of water, and so it does not make sense to say that
20 degrees Centigrade is twice as hot as 10 degrees Centigrade. However, zero on the
Kelvin scale is absolute zero.
● Since an interval scale has no true zero point, it does not make sense to compute ratios.

Ratio Scale
The following figure outlines the key point about data in relation to the ratio scale:

Figure 62: Data on the Ratio Scale

© Copyright. All rights reserved. 55


Unit 4: Data Preparation Phase

Summary
This lesson covers important data types and describes the following:

Figure 63: Data Types and Measurements

Encoding Data
Overview of Data Encoding
Data encoding is a set of processes that prepares the data you are using and transforms it
into a "mineable" source of information. It is an essential part of the data preparation process.
Consider the following three issues in relation to variables, encoding startegies, and
explanatory variables:

● Essentially, there are three types of variable: nominal, ordinal, and continuous (numeric).
● Different encoding strategies are deployed depending on the variable type.
● Data encoding each explanatory variable can take a large amount of time, but it must not
be ignored.

Algorithms and Multiple Data Types


Algorithms work with data types in the following ways:

Figure 64: Algorithms and Data Types

© Copyright. All rights reserved. 56


Lesson: Understanding Data Preparation

Categorical Variables: Ordinal Encoding


Categories are assigned values in the following way:

Figure 65: Ordinal Encoding

Note the following about ordinal values:


● One must note that ordinal encoding might not be suitable for ordinal variables. This is
because encoding creates a numerical representation where the difference in the
magnitude between the encoded categories can have a significant impact.
● An example of this point is as follows: the ordinal categorical variable, speed, can be
encoded as {'Low': 1, 'Medium': 2, 'High': 3}. The relation: 'low'< 'medium'<'high' appears to
makes sense. However, with labels such as 100, 2000, and 30,000 in place of 1, 2, and 3
also have the same relationship as 'low', 'medium,' and 'high'.
● The difference between these labels can potentially affect the model. Therefore, a one-hot-
encoding is also preferable for this example.

Categorical Encoding: One-hot Encoding


Consider the following in relation to one-hot encoding:

© Copyright. All rights reserved. 57


Unit 4: Data Preparation Phase

Figure 66: One-Hot Encoding

Categorical Variables: Dummy Variable Encoding


Consider the following in relation to dummy variable encoding:

Figure 67: Dummy Variable Encoding

This encoding type is necessary for some algorithms to function correctly, for example, linear
regression and some other regression types.

Continuous Variables: Data Binning


Consider the following in relation to data binning:

© Copyright. All rights reserved. 58


Lesson: Understanding Data Preparation

Figure 68: Data Binning

For more information about data binning, see the following:


● https://en.wikipedia.org/wiki/Data_binning#:~:text=Data%20binning%20(also
%20called%20Discrete,interval%2C%20often%20the%20central%20value/
● https://www.geeksforgeeks.org/binning-in-data-mining/

Continuous Variables Binning by Variable


In the following figure, you see raw data with no binning. In the middle of the figure, you see
age bins that are created as follows: 18-25, 26-35, and 36-45. This clearly simplifies the data
compared to the raw data.
On the right of the figure, you see another example of binning, but this time the bins have a
different width or grouping, which are as follows: 18-28, 29-39, and 40-45. Again, this is a
simplification compared to the raw data, but there is a higher degree of discrimination, which
is represented by the difference in the height of the bars, that is, between the bins.

Figure 69: Continuous Variable Binning using the Variable: AGE

Therefore, we can state the following about numeric values in these models:

© Copyright. All rights reserved. 59


Unit 4: Data Preparation Phase

Figure 70: Continuous Variables within Data Binning

Advantages of Using Continuous Variables


There are many advantages to binning using continuous variables. Binning continuous
variables helps you to improve the following:

● Model performance
● Captures non-linear behavior of continuous variables
● Minimizes the impact of outliers
● Removes "noise" from large numbers of distinct values

● Model explain-ability
● Grouped values are easier to display and understand
● Build model speed
● Predictive algorithms build faster as the numbers of distinct values decrease

Summary
This lesson covers the following:
● The ML algorithms that can work with categorical data. Remember, some of the other ML
algorithms can not work with this data.
● The ML algorithm that can not operate efficiently on categorical data directly require all
input variables and out variables to be numeric.
● The 3 common encoding strategies for categorical variables, which are Ordinal encoding,
One-hot encoding, and Dummy encoding.
● Continuous numerical-type variables that can be binned, or discretized.

Data Selection
Overview
Consider the following in relation to feature selection:

© Copyright. All rights reserved. 60


Lesson: Understanding Data Preparation

Traditional Approaches to Variable Selection


Traditional approaches to variable selection include the following:

Figure 71: Traditional Approaches to Variable Selection

Variable Selection Process: Backward Elimination


Backward elimination is an important variable selection procedure.

© Copyright. All rights reserved. 61


Unit 4: Data Preparation Phase

Figure 72: Backward Elimination

Example of Backward Elimination


Backward elimination is a valuable method of variable selection.

Figure 73: Backward Elimination: Example

Forward Selection
Forward selection is the oppose of backward selection.

Figure 74: Forward Selection: Characteristics

© Copyright. All rights reserved. 62


Lesson: Understanding Data Preparation

Example of Forward Selection


The following figure provides you with an example of forward selection.

Figure 75: Forward Selection: Example

Stepwise Regression
Stepwise regression combines backward and forward selection in the following manner:

Figure 76: Stepwise Regression

The widespread but incorrect usage of this process, and the availability of more modern
approaches (which is discussed next), or using expert judgment to identify relevant variables,
has led to calls to totally avoid stepwise model selection.

Other Approaches to Variable Selection


There are other approaches to variable selection. These methods are the following - filter,
wrapper, and embedded:

© Copyright. All rights reserved. 63


Unit 4: Data Preparation Phase

Figure 77: Other Approaches: Filter, Wrapper, and Embedded Method

Filter
Filter feature selection is an approach to variable selection.

Figure 78: Filter Feature Selection

Some examples of filter methods include the Chi squared test, information gain and
correlation coefficient scores.

Wrapper
Wrapper feature selection is an approach to variable selection.

© Copyright. All rights reserved. 64


Lesson: Understanding Data Preparation

Figure 79: Wrapper Selection Methods

The two main disadvantages of the wrapper methods are the following:
● The increasing over-fitting risk when the number of observations is insufficient.
● The significant computation time when the number of variables is large.
● An example of the wrapper method is the recursive feature elimination algorithm.

Embedded
Embedded feature section is an approach to variable selection.

Figure 80: Embedded Selection Methods

Examples of regularization algorithms are the LASSO (least absolute shrinkage and selection
operator) and Ridge regression.

Summary
This lesson covers the following in the area of data selection:
● you cover Variable or "feature" selection.

© Copyright. All rights reserved. 65


Unit 4: Data Preparation Phase

● Feature selection, which is the process of selecting a subset of relevant explanatory


variables or predictors for use in data science model construction.
● This is also known as variable selection, attribute selection, or variable subset selection.
● Modern approaches to feature selection use subset selection that evaluates a subset of
features as a group for suitability to include in a predictive model.
● The most popular form of feature selection is stepwise regression. This is an algorithm that
adds the best feature (or deletes the worst feature) in a series of iterative steps.
● Selection algorithms can be grouped into Filters, Wrappers, and Embedded methods.

LESSON SUMMARY
You should now be able to:
● Prepare data

© Copyright. All rights reserved. 66


UNIT 5 Modeling Phase

Lesson 1
Understand the parts of the modeling phase 68

UNIT OBJECTIVES

● Understand modeling phase

© Copyright. All rights reserved. 67


Unit 5
Lesson 1
Understand the parts of the modeling phase

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Understand modeling phase

CRISP-DM Modeling Overview


CRISP-DM: Phase 4, Modeling
We begin with CRISP-DM: Phase 4, Modeling.

Figure 81: CRISP-DM - Phase 4: Modeling

Phase 4.1: Select Modeling Technique


Consider the tasks and outputs for select modeling techniques.

© Copyright. All rights reserved. 68


Lesson: Understand the parts of the modeling phase

Figure 82: Select Modeling Technique

Phase 4.2: Generate Test Design


Consider the tasks and outputs for the generation of test design.

Figure 83: Generate Test Design

Phase 4.3: Build Model


Consider the following task and outputs for the build model phase.

Figure 84: Build Model

© Copyright. All rights reserved. 69


Unit 5: Modeling Phase

Phase 4.4: Assess Model


Consider the following task and outputs for the assess model phase.

Figure 85: Assess Model

It is important to recall the following for this phase:


● This task can be confused with the subsequent Evaluation Phase. During phase 4.4, assess
model, the data scientist judges the success of the application's data modeling from a
technical perspective.
● During the evaluation phase, the data scientist confirms the results from a business
perspective and evaluates how well the model can work in a production, real-world
environment.

Summary
This lesson covers the following tasks:
● Select modeling technique
● Generate test design
● Build model
● Assess model

Anomaly Detection
Anomaly Detection Overview
An anomaly is something unexpected from what is typical or normal, or a deviation in the data

© Copyright. All rights reserved. 70


Lesson: Understand the parts of the modeling phase

Figure 86: Beginnings of Anomaly Detection

In statistics, an outlier is an observation that is numerically distant from the rest of the data.
We can see anomalies and outliers in the following context:

● Outliers can occur because of errors and might need to be removed from the data set or
corrected.
● They can occur naturally and, therefore, must be treated carefully. The outlier can be the
most interesting thing in the data set.
● Some statistics or algorithms can be heavily biased by outliers - for example, the simple
mean, correlation, and linear regression. In contrast the trimmed mean and median are not
so significantly affected.
● Outliers can be detected visually - for example, using Scatter Plots and Box Plots.

Influence of Outliers
The following two graphs shows the same data, but the graph on the right has an outlier
added to it.

Figure 87: Influence of Outliers

Assess the two graphs in light of the following:


● Later in the course, you learn more about regression, but the linear regression algorithm
fits the best model by computing the square of the errors between the data points and the
trend line.

© Copyright. All rights reserved. 71


Unit 5: Modeling Phase

● To minimize the square of the errors, the regression model tries to keep the line closer to
the data point at the right extreme of the plot, and this gives this outlier data point a
disproportionate influence on the slope of the line.
● If this outlier value is removed, the model is totally different, as you can see in the graph on
the left, and which you can judge by comparing the equation of the line in the two graphs.

Anomaly Types
There are different types of anomalies, which include point and contextual anomalies.

Figure 88: Point Anomalies

Consider the following in relation to point anomalies:


● One type of anomaly are called point anomalies.
● A point anomaly is a single instance of data that is too far off from the rest of the data set.
● An example of this might be detecting credit card fraud based on the "amount spent"
compared to other transactions made by that customer.
● This simple two-dimensional graph shows some different types of point anomaly you
might encounter.
● The two anomalies x1 and x2 are easily identifiable.
● x4 could also be an anomaly with respect to its direct neighborhood.
● x3 seems to be one of the normal instances in the data.

Contextual anomalies
Another type of anomaly is called a contextual anomaly.

© Copyright. All rights reserved. 72


Lesson: Understand the parts of the modeling phase

Figure 89: Features of Contextual Anomalies

Contextual anomalies show the following features:


● In this type, the abnormality is context specific.
● This type of anomaly is common with time-series data.
● In the histogram plot of the average monthly temperatures in Germany from 2001 to 2010.
● You can see one obvious anomaly, which is one very hot month in the bin 22-23°C.

Identify a Contextual Anomaly


The following two graphs contain examples of contextual anomalies.

Figure 90: Types of Contextual Anomalies

Note the following in each graph:


● In the first graph on the left, the average temperature is plotted against the date.
● You can not only see the previously mentioned extreme outliers, but also two very mild
winters in 2006 and 2007, which you could not observe previously.
● In the second graph on the right, the month is also added as a numerical attribute.
● You can now detect a few more outliers besides the extreme temperature values. The
month of October in 2003 turns out to be an anomaly with only 5.9°C on average, which is
only half of the average temperature in that month.

Common Approaches to Anomaly Detection


You can use model-based techniques and proximity-based techniques to detect anomalies.

© Copyright. All rights reserved. 73


Unit 5: Modeling Phase

Figure 91: Model- and Proximity-Based Techniques

Statistical Methods: Understanding Quartiles


To understand one of the most popular tests for outliers, you need to understand quartiles.

● Quartiles are referred to as measures of position because they give the relative ranked
position of a specific value in a distribution.
● To create quartiles, you simply order the data by value and split it into 4 equal parts.
● The second quartile, Q2, is the median of the data.

An example of quartiles are provided in the following figure:

Figure 92: Quartiles

Quartiles: Example
The following figure shows you an example of quartiles, which show the following patterns:

© Copyright. All rights reserved. 74


Lesson: Understand the parts of the modeling phase

Figure 93: Statistical methods - Understanding quartiles

● This data shows you the number of volunteer hours performed by 15 students in a single
year.
● These values are ranked, with the lowest value on the left and the highest value on the
right.

Detecting Outliers
We must detect outliers using a box plot, as illustrated in the following figure.

Figure 94: Detecting outliers using a box plot

In the figure, note the way the box plot describes the data distribution.

Interquartile range (IQR)


IQR is the difference between the highest and lowest values for the middle 50% of the ranked
data. This refers also to the spread of the middle 50% of the data.

Figure 95: Interquartile range (IQR)

In the example from the figure, IQR = Q3 - Q1. Take note of the following:
● In the example, half of the values lie between 24 and 46. Therefore, IQR is (46 -24) = 22.

© Copyright. All rights reserved. 75


Unit 5: Modeling Phase

● This also refers to the difference, or distance, between the bottom 25% and upper 25% of
the ranked data.

Identify an Outlier and Extreme Value


The following figure contains data that shows you how to identify outliers and extreme values.

Figure 96: How to Identify Outlier and Extreme Value

The following is important with regard to this sample data:


● Upper and lower fences cordon off outliers from the bulk of the data.
● A point beyond an inner fence on either side is considered a mild outlier.
● A point beyond an outer fence is considered an extreme outlier.

Outlier and Extreme Value


The following figure contains sample data from which you can find an outlier, usual and
extreme value.

Figure 97: Outlier and Extreme Value

Take note of the following in this example:


● The values that are contained or lie within the inner fences are the "USUAL" values. For
usual values, the lowest value is 12 and the highest value is 48.
● The values that lie between the inner and outer fences are called OUTLIERS or SUSPECT
OUTLIERS and are denoted by a symbol at 86

© Copyright. All rights reserved. 76


Lesson: Understand the parts of the modeling phase

● The values that lie outside the outer fences are called EXTREME or HIGHLY SUSPECT
OUTLIER values and are denoted by a symbol at 128

We must move to consider suspect outliers and highly suspect outliers, where the following is
the case:

● Between QL … -1.5(IQR) Below QL … -3(IQR)


● and QL … - 3(IQR) or
● or Above QU … +3(IQR)
● Between QU … +1.5(IQR)
● and QU … +3(IQR)

Identify k-Nearest Neighbor (k-NN)


The following figure contains and example of data from which you identify k-NN.

Figure 98: Sample k-NN

From this figure, we can identify the following:


● This algorithm is based on the concept of a local density, where locality is given by the
nearest neighbors whose distance is used to estimate the density.
● By comparing the local density of an object to the local densities of its neighbors, you can
identify regions of similar density, and points that have a substantially lower density than
their neighbors.
● These are considered to be outliers.
● The local density is estimated by the typical distance at which a point can be "reached"
from its neighbors.

Working Example of k-NN


The following figure contains a working example of data from which you can identify k-NN.

© Copyright. All rights reserved. 77


Unit 5: Modeling Phase

Figure 99: Working Example of k-NN

In this working example, you define and calculate the following:

● You define the number of neighbors to be considered in the analysis. Here it is K = 3.


● You also define the number of outliers to be detected. Here it is N = 2.
● Calculate all the inter object distances. In this example, we use the straight line (Euclidean)
distance, which is simply the difference between the numerical values in Column.
● Select the three (K = 3) nearest and calculate the average of the distances.
● Highlight the two (N=2) largest.

Anomaly Detection
The following figure contains an example of anomaly detection using SAP Hana Predictive
Analysis Library (PAL).

Figure 100: Anomaly Detection Algorithm in SAP Hana PAL

© Copyright. All rights reserved. 78


Lesson: Understand the parts of the modeling phase

You learn more about k-means later in this course. However, take note of the following in
relation to this example:
● This algorithm can be used for detecting outliers.
● It detects patterns in a given data set that do not conform to the norm.

Summary
This lesson covers the following:

Figure 101: Summary

Association Analysis
Association Analysis Overview
Association analysis is a key part of analyzing the day-to-day data that businesses collect.

Figure 102: Association Analysis

Association analysis enables you to identify items that have an affinity for each other. It is
frequently used to analyze transactional data, which is called market baskets, to identify
items that often appear together in transactions.
We use association analysis in our everyday lives.
The following are application examples of association analysis:

● - Shopping carts and supermarket shoppers


- Analysis of any product purchases - not just in shops, but also online
- Analysis of telecom service purchases, that is, voice and data packages
- The basket can be a household that is not only related to an individual

© Copyright. All rights reserved. 79


Unit 5: Modeling Phase

- If customer identification takes place, through a link such as a loyalty scheme, the
purchases over time and the sequence of product purchases can be analyzed
- Identification of fraudulent medical insurance claims - consider cases where common
rules are broken
- Differential analysis, which compares the results between different stores, between
customers in different demographic groups, between different days of the week,
different seasons of the year, and so on

Rules for Association Analysis


There are rules for association analysis, which are outlined as follows:

Figure 103: Rules

Example: Association Analysis


This figure provides you with a sample of association analysis, which shows the following:.

Figure 104: Worked Example of Association Analysis

Support
Data can support your understanding the following:

© Copyright. All rights reserved. 80


Lesson: Understand the parts of the modeling phase

Figure 105: Definition of Support

Support: Rules and Calculations


Support has rules and these help you to calculate this factor using your data.

Figure 106: Support

Consider the following in relation to this example of Support:


● Going back to our transactional example, you can see how the Support is calculated.
● Please notice that Support is "bi-directional". For example, the rules If Products 3 and 2
then 4; If 2 and 4 then 3; If 4 and 3 then 2. These all have the same Rule Support.
● The support of the Rule X=>Y is "Symmetric," that is, Support (X->Y) = Support (Y->X)

Confidence
Confidence is an important facet of association analysis.

© Copyright. All rights reserved. 81


Unit 5: Modeling Phase

Figure 107: Confidence: Measures, Transactions, Drawbacks

Example: Confidence
The following figure contains a working example of confidence.

Figure 108: Confidence

Lift
Now that you have examined Support and Confidence, you must explore Lift.

Figure 109: Lift: Definition and Calculation

© Copyright. All rights reserved. 82


Lesson: Understand the parts of the modeling phase

Lift is the number of transactions containing the antecedent and consequent, divided by the
number of transactions with the antecedent only, all of which are divided by the fraction of
transactions containing the consequent.

Example: Lift
The following is a working example of Lift:

Figure 110: Example: Lift

Take note of the following in relation to the example:


● Any rule with a lift of less than 1 does not indicate a real cross-selling opportunity, no
matter how high its support and confidence, because it actually offers less ability to
predict a purchase than random chance. If some rule has a lift of 1, it implies that the
probability of the occurrence of the antecedent and that of the consequent are
independent of each other. If two events are independent of each other, no rule can be
drawn involving those two events. If the lift is >1, that informs us that the degree to which
those two occurrences are dependent on one another, and makes these rules potentially
useful for predicting the consequent in future data sets
● It is important to note that lift is not calculated by dividing the confidence of the rule by the
support of the rule. It is calculated by dividing the confidence of the rule by the support of
the result.
● Lift = confidence (A and B) / support (B)

Lift of the Rule


The Lift of the Rule X->Y is Symmetric, that is, Lift (X->Y) = Lift (Y->X). For more information
to understand why, see the appendix to this lesson.

© Copyright. All rights reserved. 83


Unit 5: Modeling Phase

Figure 111: Example of the Rule

How can a retailer use the Lift of the Rule? Possible recommendations for a rule X=>Y Rule
(where X and Y are 2 separate products and have high support, high confidence, and high
positive lift >1) are as follows:

● Put X and Y closer in the store


● Package X with Y
● Package X and Y with an item that sells poorly
● Give a discount on only one of X and Y
● Increase the price of X and lower the price of Y, or vice versa
● Advertise only one of X and Y, that is, do not advertise X and Y together
● For example, if X was a toy and Y a form of sweet, an offer of sweets in the form of toy X
could also be a good option

Summary
This lesson covers the strengths and weaknesses of association analysis, which are outlined
in the following manner:

© Copyright. All rights reserved. 84


Lesson: Understand the parts of the modeling phase

Figure 112: Summary

Figure 113: Appendix: Why Lift is Symmetric

Cluster Analysis
Cluster Analysis Overview
Cluster analysis revolves around the division of data points into groups. The following figure
gives you an overview of cluster analysis:

Figure 114: Cluster Analysis

Cluster analysis display the following features:

© Copyright. All rights reserved. 85


Unit 5: Modeling Phase

● Cluster analysis allows you to group a set of objects in such a way that objects in the same
group (called a cluster) are more similar (homogeneous in some sense or another) to each
other, but are very dissimilar to objects not belonging to that group (heterogeneous).
● Grouping similar customers and products is a fundamental marketing activity. It is used,
prominently, in market segmentation because companies can not connect with all their
customers and, consequently, they have to divide markets into groups of consumers,
customers, or clients (called segments) with similar needs and wants.
● Organizations can target each of these segments with specific offers that are relevant and
have the tone and content that is most likely to appeal to the customers within the
segment.
● Cluster analysis has many uses: it can be used to identify consumer segments, or
competitive sets of products, or for geo-demographic or behavioral groupings of
customers, and so on.
● Clustering techniques fall into a group of "undirected" (or "unsupervised") data science
tools, where there is no pre-existing, labeled "target" variable. The goal of undirected
analysis is to discover structure in the data as a whole. There is no target variable being
predicted.

There are a wide variety of applications of cluster analysis, which are as follows:
● Analysis to find groups of similar customers
● Segmenting the market and determining target markets
● Product positioning
● Selecting test markets

Similarity
Similarity is outlined in the following terms:

Figure 115: Characteristics of Similarity

Features
Features are characterized in the following figure:

© Copyright. All rights reserved. 86


Lesson: Understand the parts of the modeling phase

Figure 116: Characteristics of Features

Clustering
Clustering is characterized in the following figure:

Figure 117: Characteristics of Clustering

Identification of New Data Items


New data items are identified by the means outlined in the following figure:

Figure 118: Identify New Data Items

Method of Cluster Analysis


The method of cluster analysis is outlined in the following figure:

© Copyright. All rights reserved. 87


Unit 5: Modeling Phase

Figure 119: Cluster Analysis Method: ABC Analysis

Consider the following in relation to this example and use of cluster analysis:
● You can use cluster analysis to find the top 10% of customers based on their spending, or
the top 20% of products based on their contribution to overall profit.
● The data is first sorted in descending numeric order and then partitioned into the first A%,
the second B% and the final C%.
● The A cluster can be considered the most important, or gold segment, while the B cluster
can be considered the next most important, or the silver segment. The C cluster is the
least important or, bronze segment.
● Here is an example, where A=25%, B=30% and C=45%.

Cluster Analysis Method: The K-Means Cluster Analysis Algorithm


K-Means is the best-known algorithm for cluster analysis.

Figure 120: K-Means Cluster Analysis Algorithm

The first phase of understanding this sample assignment is as follows:


● The user has to specify the number of clusters required (the k-value).

© Copyright. All rights reserved. 88


Lesson: Understand the parts of the modeling phase

● The diagram in the slide explains the process.


● Step 1 is to initialize the values of the means for each cluster. There are a variety of ways to
establish the initial values of these centre points, called the centroid. For example,
randomly choosing k observations from the data set and using these as the initial means.
● Then the algorithm proceeds alternating between steps 2 and 3:

The second phase of understanding this sample assignment is as follows: you must assign
each observation to the cluster with the closest centroid point (this can be measured using
the Euclidean distance).
The third phase of understanding this sample assignment is an update step and unfolds as
follows: you must calculate the new means to be the centroids of the observations in the new
clusters.
You must repeat this process. By doing this, you discover the following:
● K-Means clustering works by constantly trying to find a centroid with closely held
observation data points.
● The algorithm has "converged" when the assignments of the data points no longer change.
This indicates that every observation has been assigned to the cluster with its closest
centroid.
● The standard algorithm aims at minimizing what is called the within-cluster sum of
squares (WCSS).
● The WCSS is the sum of the squares of the distances of each data point in all clusters to
their respective centroids.
● This is exactly equivalent to assigning observations by using the smallest Euclidean
distance.
● Therefore, all of the observations in a cluster must be as homogeneous, or as similar, as
possible, and the clusters are thus heterogeneous or dissimilar.

Choosing the distance measureis very important. The distance between two clusters can
been computed based on the length of the straight line drawn from one cluster to another.
This is commonly referred to as the Euclidean distance. However, many other distance
metrics have also been developed.

Choosing K
You can choose K using different models.

© Copyright. All rights reserved. 89


Unit 5: Modeling Phase

Figure 121: Choosing K

The choice of the number of clusters must also be based on business operational constraints
- the number of clusters is always limited by the organization's capacity to use them. For
example, if you are clustering customers into groups to create differential marketing
campaigns, having 20+ clusters does not make business sense because it would be very
difficult to develop 20+ different marketing initiatives.
It is important that the final cluster model is interpretable by the business. Clustering is only
useful if it can be understood by the business and explained in simple words.

Example of K-Means Cluster Analysis


The following is an example of K-Means cluster analysis.

Figure 122: Example: K-Means Cluster Analysis Algorithm

In the figure, you can extract the following insights:


● You can see the iterations stepping through the K-Means process, recomputing the
position of the centroids.

© Copyright. All rights reserved. 90


Lesson: Understand the parts of the modeling phase

● In the first picture, Recompute Centroids, you see that k initial "means" (in this case k=3)
are randomly generated within the data domain, and that k clusters are created by
associating every observation with the nearest mean.
● In the second picture, Reassign Membership, the position of the centroids is recomputed.
The centroid in each of the k clusters becomes the new mean.
● The third picture, Final Solution, shows that the membership of each cluster is reassigned,
which means that each observation is associated with its nearest centroid.
● This process continues, with successively smaller steps as the convergence to the final
solution is achieved, where there is no more movement in the position of the centroids.

Cluster Analysis Method: Hierarchical Clustering


Another approach to cluster analysis is hierarchical clustering.

Figure 123: Hierarchical Clustering

Consider the figure in the context of the following insights:


● Hierarchical clustering starts by treating each observation as a separate cluster. It
repeatedly executes the following two steps: (1) identify the two clusters that are closest
together, and (2) merge the two most similar clusters.
● This iterative process continues until all the clusters are merged together.
● In order to decide which clusters must be combined, a measure of dissimilarity between
the sets of observations is required. In most methods of hierarchical clustering, this is
achieved by using an appropriate metric, which is a measure of distance between pairs of
observations (for example, Euclidean straight line distance), and linkage criterion that
determines the distance between the sets of observations as a function of the pairwise
distances between the observations.

Hierarchical Clustering Dendrogram


The main output of hierarchical clustering is a dendrogram, aspects of which are outlined in
the following figure.

© Copyright. All rights reserved. 91


Unit 5: Modeling Phase

Figure 124: Hierarchical clustering dendrogram

This figure illustrates the following:


● The main output of hierarchical clustering is a dendrogram that shows the hierarchical
relationship between the clusters.
● The main use of a dendrogram is to work out the best way to allocate objects to clusters.
● The dendrogram on the right shows the hierarchical clustering of six observations in
the scatter plot to the left.
● The merging steps in the clustering are described in the previous slide - 1. E and F, 2. A and
B, 3. EF and D, 4. EFD and C.
● The key to interpreting a dendrogram is to focus on the height at which any two objects are
joined together. In this example, the heights reflect the distance between the clusters and
the dendrogram shows us that the big difference between clusters is between the cluster
of A and B versus that of C, D, E, and F.

Segmentation
Homogeneous and heterogeneous segmentation must be considered.

Figure 125: What Makes a Good Form of Segmentation?

Summary: Strengths and Weaknesses of Cluster Analysis


There are strengths and weaknesses of cluster analysis, which is as follows:

© Copyright. All rights reserved. 92


Lesson: Understand the parts of the modeling phase

Figure 126: Strengths and Weaknesses

We must put these strengths and weaknesses in the following context. Most clustering
approaches are undirected, which means there is no target. The goal of undirected analysis is
to discover structure in the data as a whole.
There are other important issues in cluster analysis, which can be described in the following
ways.
You must calculate the distance measure:

● Most clustering techniques use the Euclidean distance formula, which is the square root of
the sum of the squares of distances along each attribute axes, for the distance measure.
● Before the clustering can take place, categorical variables must be encoded and scaled.
Depending on these transformations, the categorical variables can dominate clustering
results or they can be completely ignored.

You must choose the right number of clusters.


● If the number of clusters k in the K-Means method is not chosen to match the natural
structure of the data, the results are not as good. The proper way to alleviate this is to
experiment with different values for k.
● In principle, the best k value exhibits the smallest intra-cluster distances and largest inter-
cluster distances.

You must interpret the clusters when you discover them. There are different ways to utilize
clustering results:
-
● You can use cluster membership as a label for a separate classification problem.
- You can use other data science techniques - for example, decision trees - to find
descriptions of clusters.
- You can visualize clusters by utilizing 2D and 3D scatter graphs, or some other
visualization technique.
- You can examine the differences in attribute values among different clusters, one attribute
at a time.

You must identify application issues:

© Copyright. All rights reserved. 93


Unit 5: Modeling Phase

● You use clustering techniques when you expect natural groupings in the data. Clusters
must represent groups of items (products, events, customers) that have a lot in common.
● Creating clusters prior to the application of some other technique (classification models,
decision trees, neural networks) can reduce the complexity of the problem by dividing the
data space.
● These space partitions can be modeled separately and these two-step procedures can
occasionally exhibit improved results when compared to the analysis or modeling without
using clustering. This is referred to as segmented modeling.

Classification Regression
Classification Analysis Overview
You can use regression analysis for modeling and analyzing numerical data.

Figure 127: Regression Analysis

You can utilize numerous types of regression models. This choice frequently depends on the
kind of data that you possess for the target variable. Take note of the following in relation to
the example provided:
● This is a simple linear regression. The target is a continuous variable.
● The target variable in the regression equation is modeled as a function of the explanatory
variables, a constant term, and an error term.
● The error term is treated as a random variable. It represents unexplained variation in the
target variable.
● You can see that the equation of the straight line is Y = a+bx.
● In this simple equation, b is a regression coefficient. Regression coefficients are estimates
of unknown parameters and describe the relationship between an explanatory variable and
the target. In linear regression, coefficients are the values that multiply the explanatory
values. Suppose you have the following regression equation: y = 5+2x. In this equation, +2
is the coefficient, x is the explanatory variable, and +5 is the constant.

© Copyright. All rights reserved. 94


Lesson: Understand the parts of the modeling phase

● The sign of each coefficient indicates the direction of the relationship between the
explanatory variable and the target variable.
● A positive sign indicates that as the explanatory variable increases, the target variable also
increases.
● A negative sign indicates that as the explanatory variable increases, the target variable
decreases.
● The coefficient value represents the mean change in the target given a one-unit change in
the explanatory variable. For example, if a coefficient is +2, the mean response value
increases by 2 for every one-unit change in the explanatory variable.

Example: Simple Linear Regression Analysis


The following figure provides you with a simple bi-variate linear regression, Y= a+ b*x.

Figure 128: Simple Linear Regression Analysis

Consider the following points in context of the example provided:


● The example shows a simple bi-variate linear regression, Y= a+ b*x. This means that there
are two variables in the data set.
● The properties of observations are termed explanatory variables (or independent
variables, regressors, or features); these are the x variables in the equation.
● The values being predicted are known as targets, dependent variables or outcomes. This is
the Y variable in the equation. In this example, the Y variable is a continuous variable and
therefore we use a linear regression.
● In the input data, every value of the explanatory variable x is associated with a value of the
target variable Y.
● Consequently, in this simple two variable data set, there is only one explanatory variable,
and one target.
● In more complex examples, "multiple linear regression" models the relationship between
two or more explanatory variables and a target variable by fitting a linear equation to the
observed data.
● Multiple linear regression has more than one explanatory variable. The equation is Y = a +
b1 * x1 + b2 * x2 + … In the equation, x1, x2, and so on, are the explanatory variables and b1,
b2, and so on, are the regression coefficients.

© Copyright. All rights reserved. 95


Unit 5: Modeling Phase

● In the terminology of predictive analysis, regression is considered an instance of


supervised learning, that is, learning where a training set of correct outcome values is
available - which is the target variable.

Uses for Linear Regression


The following figure illustrates the multiple uses of linear regression.

Figure 129: How to Utilize Linear Regression

Most applications of linear regression fall into one of two broad categories. These categories
are outlined in the following way:
● Prediction or forecasting
● Linear regression can be used to fit a predictive model to an observed data set of values of
the target and explanatory variables.
● When you develop the model, and when the additional values of the explanatory variables
are collected without an accompanying target value, the model can be used to make a
prediction of the target values.
● Explaining variation in the target variable that can be attributed to variation in the
explanatory variables.
● Linear regression analysis can be applied to quantify the strength of the relationship
between the target and the explanatory variables.
● The model can also be used to determine if some explanatory variables have no linear
relationship with the target.

Example of Linear Regression


The following provides you with an example of linear regression.

© Copyright. All rights reserved. 96


Lesson: Understand the parts of the modeling phase

Figure 130: Example: Linear Regression

Consider the following in the context of this example:


● During the two-week period assessed, a shop that sells ice creams measures the
temperature each day and the sales value of the ice creams that is sold.
● This is plotted on a scatter plot, as shown in the figure.
● You can see that there are higher sales when the temperature increases. Therefore, there
is a correlation between sales and temperature.

Least Squares
The following figure provides you with an example of least squares; it uses the information
from the previous sample.

Figure 131: Least squares

Linear regression models are often fitted using the least squares approach - although they
can also be fitted in other ways. LEt us consider the following in the context of this example:
● The best fit, from a least-squares perspective, minimizes the sum of squared residuals,
where a residual is the difference between an observed value and the fitted value provided
by a model - that is, the error.

© Copyright. All rights reserved. 97


Unit 5: Modeling Phase

● In the ice cream example, we can calculate the residual and this is shown in the table.
● Ordinary least squares (OLS) is a type of linear least squares method for estimating the
unknown parameters in a linear regression model.
● OLS chooses the parameters of a linear function of a set of explanatory variables by the
principle of least squares: minimizing the sum of the squares of the differences between
the observed target or dependent variable (the values of the variable being predicted) and
those predicted by the linear function.

Forecast Future Values


Linear regression can be used to forecast future values.

Figure 132: Future Values

Least Squares and Outliers


You can use least squares to assess residuals.

Figure 133: Least Squares and Outliers

© Copyright. All rights reserved. 98


Lesson: Understand the parts of the modeling phase

Influence of Outliers
The linear regression algorithm fits the best model by computing the square of the errors
between the data points and the trend line.

Figure 134: Linear Regression: With and Without Outlier

Consider the following points in the context of this example:


● To minimize the square of the errors, the regression model tries to keep the line closer to
the data point at the right extreme of the plot, and this gives this data point a
disproportionate influence on the slope of the line.
● If this outlier value is removed, the model equation is totally different, which you can see by
comparing the equation of the line in both diagrams.
● In addition, outliers have a high influence on the model performance metrics as they can
not be predicted.

Bi-Variate Regression Variations


There are a wide variety of other regression variations. Three are contained in the following
figure.

Figure 135: Examples of Bi-Variate Regression Variations

Polynomial Regression
Polynomial regression is another form of regression analysis.

© Copyright. All rights reserved. 99


Unit 5: Modeling Phase

Figure 136: Polynomial Regression

Overview of Classification
We begin with an introduction to classification analysis.

Figure 137: Classification

You have looked at the regression example where the target is continuous, but what happens
if the target is categorical? The following points become apparent:
● A categorical variable has values that you can put into a countable number of distinct
groups based on a characteristic.
● In predictive analysis, you frequently come across applications where you want to predict
a binary variable (0 or 1) or a categorical variable (yes or no). Such a target variable is also
referred to as a dichotomous variable - something which is divided into two parts or
classifications. This is called classification analysis. The problem can be extended to
predicting more than two integer values or categories.

© Copyright. All rights reserved. 100


Lesson: Understand the parts of the modeling phase

● There are many use cases for this type of classification analysis using regression
techniques, covering scenarios where the focus is on the relationship between a target
variable and one or more explanatory variables.
● Classification analysis can use regression techniques to identify the category to which a
new observation belongs, based on a training set of data containing observations whose
category membership is known. In these examples in the slide the target has 2 categories:
churners/non-churners, responders/non-responders, apples/pears. There are also other
use cases where the classification target has more than 2 categories.
● Retention analysis or churn analysis, is a major application area for predictive analysis
where the objective is to distinguish between customers who have switched to a new
supplier (for example, for Telco or utility services) and those who have not switched (and
have therefore been "retained" as a customer). The objective is to try and build a model to
describe the attributes of those customers who have been retained, in contrast to those
who have left or churned, and therefore develop strategies to maximize the retention of
customers. The target or dependent variable is usually a flag, for example, a binary or
Boolean variable, that is, Yes / No or 1 / 0. The explanatory variables describe the
attributes of each customer.
● The class of models in these cases is referred to as a classification model, as you want to
classify observations, and in the more general sense data records, into classes.
● The algorithms commonly used for classification analysis are decision trees, regression,
and neural networks. There are other approaches as well, and some are presented in this
training course.
● In the terminology of predictive analysis, classification is considered an instance of
supervised learning, that is, learning where a training set of correctly-identified
observations is available - this is the target variable. This means that when you train a
churn model, you need some historic data to establish if a customer has churned or not.
When you have built and trained the model, you can apply it onto other data where you
have not got the target value, and predict the target - to answer questions such as: "Will
these customers churn or not?"

The use cases for classification analysis are the largest group within predictive analysis.
Examples of these are as follows:
● Churn analysis to predict the probability that a customer may leave/stay
● Success or failure of a medical treatment, dependent on dosage, patient's age, sex, weight,
and severity of condition
● High or low cholesterol level, dependent on sex, age, whether a person smokes or not, and
so on
● Vote for or against a political party, dependent on age, gender, education level, region,
ethnicity, and so on
● Yes or No, or Agree or Disagree to responses to questionnaire items in a survey.

Overview of Logistic Regression


The following figure introduces you to logistic regression.

© Copyright. All rights reserved. 101


Unit 5: Modeling Phase

Figure 138: Logistic Regression

You can extract the following information from this graph:


● The plot shows a model of the relationship between a continuous predictor (x) and the
probability of an event or outcome (y).
● Probabilities range strictly from 0 to 1. Therefore, the outcome (y) can have a value of 1,
with probability p, or a value of 0, with a probability of (1-p).
● The linear model clearly does not fit this relationship between x and the probability.
● In order to model this relationship directly, you must use a nonlinear function. The plot
displays one such function. The S-shape function is known as sigmoid.

Example of Logistic Regression


The following example is a subset of Motor Trend Car Road Tests (MTCARS) data set.

Figure 139: Logistic Regression: Example of Predictive Data

Example of Logistic Regression: Explanatory and Target Variables


The following figure provides you with an example of explanatory and target variables.

© Copyright. All rights reserved. 102


Lesson: Understand the parts of the modeling phase

Figure 140: Example: Logistic Regression

Weaknesses of Logistic Regression


There are certain weaknesses to logistic regression.

Figure 141: Logistic Regression: A Poor Fit for Certain Models

Linear regression is not always suitable for certain reasons, which are as follows:

© Copyright. All rights reserved. 103


Unit 5: Modeling Phase

Figure 142: Logistic Regression: Unsuitable for Certain Reasons

Logistic Regression Sigmoid Curve


The following figure describes a well-known logistic curve.

Figure 143: Logistic Regression Sigmoid Curve

The Logistic Function


The following figure details the logistic function.

© Copyright. All rights reserved. 104


Lesson: Understand the parts of the modeling phase

Figure 144: Logistic Regression Sigmoid Curve: The Logistic Function

Logistic Regression Equation


The logistic regression equation can be described in the following way:

Figure 145: The Equation

Over-fitting
The following diagrams illustrate different aspects of over-fitting.

Figure 146: Over-fitting

From these diagrams, you can extract the following points:


● Model over-fitting occurs when a model provides an excellent fit to the data that it is
trained on, but when the model is applied onto new data its performance or accuracy is
very poor.
● Over-fitting generally occurs when a model is excessively complex, that is, when it has too
many explanatory variables relative to the number of observations.

© Copyright. All rights reserved. 105


Unit 5: Modeling Phase

● The standard approach to avoiding over-fitting is to split the data into a train data set to
train or build the model, and a test data set to test the model on unseen or hold-out data.
● Other techniques include cross-validation where multiple models are run on samples of
the data and the models compared.
● These concepts are explored in more detail as we progress through this course.

Leaker Variables
The following figure outlines leaker variables.

Figure 147: Leaker Variables: Input Variables and Prediction

Data Leakage
You can utilize certain methods to prevent data leakage.

Figure 148: Leaker Variables: Explanatory Variables

Consider the following points in the context of the example in the figure:
● To avoid using leaker variables, you must know your data, inspect the data with care, and
use your common sense.
● Leaker variables generally possess a high statistical correlation to the target. Therefore,
the model's predictive power is suspiciously high. Therefore, search the Influencer
Contributions report for a variable that has an unusually high influence, meaning it is highly
correlated to your target.

© Copyright. All rights reserved. 106


Lesson: Understand the parts of the modeling phase

● Remember that if you build a model and it is extremely accurate, you might have a leakage
problem.

Summary
This lesson covers the following:

Figure 149: Summary

Demo Regression in SAC Smart Predict

Figure 150: Overview of Regression Modeling in SAC Smart Predict

Data
The following example provides you with sample data.

Figure 151: Data

Access Data
The following figure provides you with sample access data.

© Copyright. All rights reserved. 107


Unit 5: Modeling Phase

Figure 152: Access Data

Predictive Scenario Selection


The following example shows you how to select a predictive scenario.

Figure 153: Select Predictive Scenario

New Predictive Scenario


The following example shows you how to create a predictive scenario.

Figure 154: Create New Predictive Scenario

Data Selection Process


The following figure shows you how to select data.

© Copyright. All rights reserved. 108


Lesson: Understand the parts of the modeling phase

Figure 155: Select Data

Column Details
The following figure shows you how to edit the details in a column.

Figure 156: Edit Column Details

Target and Train Selection


The following figure shows you how to choose a target and training model.

© Copyright. All rights reserved. 109


Unit 5: Modeling Phase

Figure 157: Choose Target and Train

Overview Report
The following figure introduces you to the overview report. This report measures the
following:

Figure 158: Overview Report

Overview Report: Target Statistics


Use target statistics to yield descriptive statistics from the target variable.

© Copyright. All rights reserved. 110


Lesson: Understand the parts of the modeling phase

Figure 159: Target Statistics

Overview Report: Influencer Contributions


Use the overview report to explore and assess influencer contributions.

Figure 160: Influencer Contributions

Influencer Contributions Report


Use the influencer contributions report to check influencer variables.

Figure 161: Influencer Contributions Report

© Copyright. All rights reserved. 111


Unit 5: Modeling Phase

Grouped Category Influence Report


Use the Grouped Category Influence Report to show groupings of categories of an infleuncer
variable.

Figure 162: Grouped Category Influence Report

Grouped Category Statistics Report


Use the Grouped Category Statistics Report to show the details of how grouped categories
influence the target variable.

Figure 163: Grouped Category Statistics Report

Summary
This lesson gives you a demonstration of regression modeling in SAC Smart Predict.

Demo Classification in SAP Analytics Cloud


Demonstration Overview
This demonstration shows you how to build a classification model in SAC, examine the output
of the model, and apply it.

© Copyright. All rights reserved. 112


Lesson: Understand the parts of the modeling phase

Check the Data


The following figure shows you how to check the data in SAC.

Figure 164: Data Check in SAC

Predictive Scenario Selection


The following figure shows you how to select a predictive scenario in SAC.

Figure 165: Select a predictive scenario

New Predictive Scenario Creation


The following figure shows you how to create a new predictive scenario.

Figure 166: New Predictive Scenario

Classification Model Settings


The following figure shows you how to configure the classification model settings.

© Copyright. All rights reserved. 113


Unit 5: Modeling Phase

Figure 167: Classify Classification Model Settings

Edit Column Details in Settings


The following figure shows you how to edit the column details.

Figure 168: Settings: Edit Column Details

Correct Column Details in the System


The following figure shows you how to correct the column details in the system.

© Copyright. All rights reserved. 114


Lesson: Understand the parts of the modeling phase

Figure 169: Correct Column Details

Target Variable Selection


The following figure shows you how to select the target variable.

Figure 170: Select the Target Variable

Exclude Variables in the System


The following figure shows you how to exclude variables in the system.

© Copyright. All rights reserved. 115


Unit 5: Modeling Phase

Figure 171: Exclude Variables

Overview Report: Model Accuracy


You can check the accuracy of your model using the overview report.

Figure 172: Overview Report in SAC

Overview Report: Influencer Contributions and Detected Target Graph


Use the overview report to check influencer contributions and the detected target graph. The
following figure gives you an example of such investigations.

© Copyright. All rights reserved. 116


Lesson: Understand the parts of the modeling phase

Figure 173: Influencer Contributions and Detected Target Graph

Influencer Contributions Report


The following figures show you how to examine influencer contributions and the grouped
category influence.

Figure 174: Influencer Contributions

Grouped Category Statistics


The following figure shows you how to scrutinize an influencer contributions report for
grouped category statistics.

Figure 175: Influencer contributions report

© Copyright. All rights reserved. 117


Unit 5: Modeling Phase

Confusion Matrix
The confusion matrix is a useful guide to the performance of an algorithm.

Figure 176: Example: Confusion Matrix

Confusion Matrix Metrics


The following figure describes the means by which you access the confusion matrix models.

Figure 177: Confusion Matrix Metrics

Profit Simulation Report


You can use the profit simulation report to visualize profit to achieve greater insights.

Figure 178: Example: Profit Simulation Report

© Copyright. All rights reserved. 118


Lesson: Understand the parts of the modeling phase

Performance Curves Report


The following figures provide you with examples of performances curves, which you can
generate using SAC.

Figure 179: Example: Performance Curves Report

Sensitivity and Lorenz Curves


The following figures provide you with examples of sensitivity and Lorenz curves in SAC.

Figure 180: Performance Curves Report: Sensitivity and Lorenz Curves

Density Curves
Consider the function of density curves, which you can generate using SAC.

Figure 181: Performance Curves Report

© Copyright. All rights reserved. 119


Unit 5: Modeling Phase

Apply the Model


The following figure shows you how to apply the predictive model.

Figure 182: Apply the Predictive Model

Model Output Data


The following figure shows you how to generate model output data.

Figure 183: Model Output Data

Summary
These demonstrations show you how to build a classification model in SAC, examine the
output of the model, and apply it.

Classification Decision
Overview of Classification Analysis with Decision Trees
Decision trees are popular machine learning tools.

© Copyright. All rights reserved. 120


Lesson: Understand the parts of the modeling phase

Figure 184: Classification Analysis with Decision Trees

Consider the following points about decision trees in the context of the figure:
● To find solutions a decision tree makes sequential, hierarchical decision about the
outcome variable based on the predictor data.
● The model provides a series of "if this occurs, this occurs" conditions that produce a
specific result from the input data.

Example: Play Golf?


The following figure provides you with an example of a decision tree applied to a question. It
illustrates how rules are generated.

Figure 185: Play Golf?

Decision trees are characterized by the following:


● A decision tree is a graphical representation of the relationships between a target variable
(output) and a set of explanatory variables (inputs).

© Copyright. All rights reserved. 121


Unit 5: Modeling Phase

● Usually the model is represented in the form of a tree-shaped structure that represents a
set of decisions, which is very easy to understand.
● The tree can be binary or multi-branching, depending on the algorithm utilized to segment
the data.
● Each node represents a test of a decision and the rules that are generated can easily be
expressed, which means that the records falling into a particular category can be retrieved.
● In this example related to golf, you classify the weather conditions that would indicate
whether or not to play golf. There are two categorical predictors: Outlook and Windy, and
two numeric predictors, Temperature and Humidity.

Example: Mail Shot Campaign


The following figure provides you with an example of the mail shot campaign.

Figure 186: Mail Shot Campaign

In this example, you are given information about the income of the customers, if they are new
or existing customers, if they are young or old, and their marriage status and gender.

Titanic
The following example uses the sinking of the Titanic to demonstrate the uses of a decision
tree.

© Copyright. All rights reserved. 122


Lesson: Understand the parts of the modeling phase

Figure 187: Titanic

Consider the figure in the context of the following points:


● The RMS Titanic, a luxury steamship, sank in the early hours of April 15, 1912, off the coast
of Newfoundland in the North Atlantic after sideswiping an iceberg during its maiden
voyage. Of over 2,200 passengers and crew on board, more than 1,500 lost their lives in
the disaster.
● This decision tree represents recursive splits in the data to determine the "decisions"
required to reach the "leaf" nodes. The goal is to create a model that predicts the value of a
target variable based on several input variables.
● In this example, the target variable is as follows: Did the passenger survive or not, and the
input variables are gender, age, and the number of children. These decisions generate
rules.

Decision Tree Terminology


The following figure outlines the basic components of decision tree terminology:

Figure 188: Decision Tree: Terms and Definitions

© Copyright. All rights reserved. 123


Unit 5: Modeling Phase

Play Golf: Chi-squared Automatic Interaction Detector (CHAID) Example


The following figure outlines a CHAID example in relation to the previous example on the
question: play golf?

Figure 189: CHAID example

CHAID analysis demonstrates the following:


● CHAID analysis builds a predictive model to help determine how variables best merge to
explain the outcome in the given dependent variable.
● In this example, the target is a binary categorical variable - do you play golf or not?
● There are two categorical predictors (Outlook and Windy), and two numeric predictors
(Temperature and Humidity).

Define Bins in CHAID Analysis


The following figure shows you how to define bins in CHAID analysis.

Figure 190: Define Bins

Consider the following definitions of variables and values:

© Copyright. All rights reserved. 124


Lesson: Understand the parts of the modeling phase

● When a variable has many values then in the CHAID algorithm it can lead to many
branches in the tree and consequently more complex rules.
● For the values, Temperature and Humidity, suppose you define "bins" or "groups" in the
following way:

1 <=70, 2 >70 and <=80, 3 >80 for Temperature.


1 <=75, 2 >75 for Humidity.

CHAID Example for Play Golf


The following figure outlines a CHAID example for the previous question, that is, play golf?

Figure 191: CHAID Example

The example clarifies the following points:


● CHAID creates all possible cross tabulations for each categorical predictor until the best
outcome is achieved and no further splitting can be performed.
● To build the decision tree, you recursively split the data to maximize the diversity at each
split.
● Which independent variable makes the best "splitter"? Which one does the best job of
separating the records into groups?
● One measure of diversity is the Chi-Squared Test - hence CHAID, which stands for Chi-
Squared Automatic Interaction Detection.

We turn to the following, Chi-Squared = (Observed - Expected)2 / Expected:


● The tables are called contingency tables. To find the expected frequencies, you assume
independence of the rows and columns. To get the expected frequency corresponding to
the 2 at top left (sunny/play), you look at row total (5) and column total (9), multiply them,
and divide them by the overall total (14).
● (5*9)/14 = 3.21, which is the expected result.
● Each table is represented with m rows and n columns. The number of degrees of freedom
is calculated for an m-by-n table as (m-1)(n-1), so in this case (3-1)(2-1) = 2*1 = 2.
● Therefore, the chi-squared test is the sum of the squares of the standardized difference
between the expected and observed frequencies.

© Copyright. All rights reserved. 125


Unit 5: Modeling Phase

● The test measures the probability that an apparent association is due to chance or not.
● In this example, the expected value for Sunny/Play is only equal to the observed value of 2,
if the two events are really independent from each other.
● If there is a large chi-squared test statistic, you can say it is most likely wrong to assume
that the playing decision is independent from outlook.
● Take note that the final value of the Chi-Test of 0.1698 is found in look-up tables. Take a
look at the bottom reference given in the following figure. In this reference page, you can
enter the observed values in the table. It calculates all the values for you, including p
(0.1698).

CHAID Example Applied to Play Golf


The following is an example of the CHAID example being applied to the question: play golf?

Figure 192: Play Golf: CHAID example

The best initial splitter is - Outlook.

Further Exploration of Play Golf: CHAID Example


We must delve deeper into the play golf example.
For the next split you examine the Outlook Sunny node and repeat the process. The best split
is Humidity.

© Copyright. All rights reserved. 126


Lesson: Understand the parts of the modeling phase

Figure 193: Play Golf: CHAID example

Play Golf CHAID Example: Split on Outlook Sunny Node


The following figure illustrates the split on the Outlook Sunny node.

Figure 194: Play Golf: CHAID example

This analysis continues for each successive split. Take note of the following points:
● The next split you examine id Outlook Rain and the best split is Windy.
● The tree is complete as each leaf node contains a single outcome.

Other Classification Tree Algorithms: C4.5 Algorithm


There are other classification tree algorithms, such as the C4.5 algorithm.

© Copyright. All rights reserved. 127


Unit 5: Modeling Phase

Figure 195: C4.5 Algorithm

C4.5 Algorithm: Play Golf?


We can use the C4.5 Algorithm for the question: play golf?

Figure 196: C4.5 Algorithm: Play Golf?

Other Classification Tree Algorithms: Classification and Regression (CNR) Tree


There are other classification tree algorithms, such as the CNR tree.

© Copyright. All rights reserved. 128


Lesson: Understand the parts of the modeling phase

Figure 197: Other Classification Tree Algorithms - CNR Tree

CART is very similar to the C4.5 algorithm, but has the following major differences:
● Rather than building trees that could have multiple branches, CART builds binary trees,
which only have two branches from each node.
● CART use the Gini Impurity as the criterion to split a node, not what is called Information
Gain.
● CART supports numerical target variables, creating a regression tree that predicts
continuous values.

Random Forests
Random forests are another method for classification and regression.

Figure 198: Random Forests

Confusion Matrix: Actual and Predicted Values


The following figure provides you with an example of a confusion matrix.

© Copyright. All rights reserved. 129


Unit 5: Modeling Phase

Figure 199: Example: Confusion Matrix

In this simple golfing example, actual and predicted are 100% consistent, but of course in
practice it's never like that. Consider these addition elements:
● There are always misclassifications - that is, errors.
● A confusion matrix is used to analyze the performance of the algorithm.
● The confusion matrix is a table that shows the performance of a classification algorithm by
comparing the predicted value of the target variable with its actual value.
● Each column of the matrix represents the observations in a predicted class.
● Each row of the matrix represents the observations in an actual class.
● The name stems from the fact that it makes it easy to see if the system is confusing two
classes, that is, commonly mislabeling one as another.
● The confusion matrix is examined in more detail later in the course.

Summary
This lesson covers the strengths and weaknesses of classification decision.
The strengths are as follows:

● The tree-type output is very visual and easy to understand.


● They are able to produce "understandable" rules.
● They can perform classification without requiring much computation.
● They can handle both continuous and categorical input variables.
● They provide a clear indication of a variable's importance.

The weaknesses are as follows:

● They are clearly sensitive to the "first split".


● Some decision tree algorithms require binary target variables.
● They can be computationally expensive.

© Copyright. All rights reserved. 130


Lesson: Understand the parts of the modeling phase

● They generally only examine a single field at a time.


● They are prone to over-fitting.

Understanding Classification Analysis with KNN, NN, and SVM


Introduction: k-Nearest Neighbor Algorithm (k-NN)
You can use the k-NN algorithm to predict or classify objects. The following figure is an
example of this in practice.

Figure 200: The k-Nearest Neighbor Algorithm

The k-NN algorithm has the following characteristics:


● The k-NN algorithm assumes that similar objects exist in close proximity.
● The Euclidean distance is often used as the distance metric.
● k-NN is amongst the simplest of all machine learning algorithms: an object is classified by
a majority vote of its neighbors, with the object being assigned to the class most common
amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, the object
is simply assigned to the class of its nearest neighbor. The neighbors are taken from a set
of objects for which the actual classification is known.
● k-NN can be used to solve both classification and regression problems.
● k is a user-defined constant.
● To select the k that's right for your data, you run the algorithm several times with different
values of k and choose the k that reduces the number of errors.

Example: k-NN
The following figure provides you with an example of the k-NN algorithm.

© Copyright. All rights reserved. 131


Unit 5: Modeling Phase

Figure 201: Example of k-NN

This example demonstrates the following:


● These diagrams show how a new case would be classified using two different values of k.
● When k = 3, the new case is placed in category 1 because a majority of the nearest 3
neighbors (in the circle) belong to category 1.
● However, when k = 7, the new case is placed in category 0 because a majority of the 7
nearest neighbors (in the circle) belong to category 0.
● Generally, as you decrease the value of k to 1, the predictions become less stable.
● Conversely, as you increase the value of k, the predictions become more stable due to
majority voting or averaging, and are therefore more likely to make more accurate
predictions (up to a certain point).
● Eventually, as you increase k further, errors increase and it is at this point you know you
have increased the value of k too far.

Strengths and Weaknesses of the k-NN Algorithm


The k-NN algorithm shows the following strengths and weaknesses.
The major strength of the k-NN algorithm is as follows:
● It is beautifully simple and logical.

The weaknesses of the k-NN algorithm are as follows:


● It can be driven by the choice of k, which might be a poor choice.
● Generally, larger values of k reduce the effect of noise on the classification, but make
boundaries between classes less distinct.
● The accuracy of the algorithm can be severely degraded by the presence of noisy or
irrelevant features, or if the feature scales are not consistent with their importance.
● In binary (two class) classification problems, it is helpful to choose k as an odd number, as
this avoids tied votes.
● It is important to review the sensitivity of the solution to different values of k.

Introduction to Neural Networks (NN)


The following figure provides you with a basic overview of NN.

© Copyright. All rights reserved. 132


Lesson: Understand the parts of the modeling phase

Figure 202: NN Overview

An NN has the following characteristics:


● A NN or Artificial Neural Network (ANN), which is sometimes called a multilayer
perceptron, is basically a simplified model of the way the human brain processes
information. It works by simulating a large number of interconnected simple processing
units that resemble abstract versions of neurons.
● An artificial NN is an interconnected group of nodes, inspired by a simplification of neurons
in a brain.
● In the figure, each circular node represents an artificial neuron and an arrow represents a
connection from the output of one artificial neuron to the input of another.
● The processing units are arranged in layers.

There are generally three parts in a neural network:


● An input layer, with units representing the input fields
● One or more hidden layers
● An output layer, with a unit or units representing the output field or fields

These units are characterized by the following:


● The units are connected with varying connection strengths or weights.
● Input data are presented to the first layer, and values are propagated from each neuron to
every neuron in the next layer. In this simple example, there are only 2 input variables x1
and x2.
● Eventually, a result is delivered from the output layer.

Nodes
The following figure shows you what actions take place at a node.

© Copyright. All rights reserved. 133


Unit 5: Modeling Phase

Figure 203: What Happens at the Node?

The following processes take place at each node:


● At each node, the input-weight products are summed and then the sum is passed through
the node's non-linear activation function, to determine whether and to what extent that
signal can progress further through the network.
● If the signal passes through, the neuron has been activated.
● The non-linear element transforms the sum of the weighted inputs into an output, which is
frequently between 0 and 1, or -1 and +1.

Neural Network Process: Part 1


The following figure introduces you to part 1 of the neural network process.

Figure 204: Neural Network Process: Part 1

The figures illustrates the following:


● To teach the neural network you need a training data set.
● In this example, the training data set consists of input signals (x1 and x2 ) assigned with a
corresponding target (desired output) y.
● The network training is an iterative process. In each iteration, weight coefficients for each
node are modified using new data from the training data set.
● The network learns by examining individual records, generating a prediction for each
record, and making adjustments to the weights whenever it makes an incorrect prediction.

© Copyright. All rights reserved. 134


Lesson: Understand the parts of the modeling phase

● This process is repeated many times, and the network continues to improve its predictions
until one or more of the stopping criteria are met.
● Initially, all weights are random, and the answers that come out of the net are probably
nonsensical.
● The network learns through training. Examples for which the output is known are
repeatedly presented to the network, and the answers it gives are compared to the known
outcomes. Information from this comparison is passed back through the network,
gradually changing the weights.
● As training progresses, the network becomes increasingly accurate in replicating the
known outcomes. When it is trained, the network can be applied to future cases where the
outcome is unknown.

Neural Network Process: Part 2


The following figure introduces you to part 2 of the neural network process.

Figure 205: Neural Network Process: Part 2

Neural Network Process: Part 3


The following figure introduces you to part 3 of the neural network process.

Figure 206: Neural Network Process: Part 3

The figure illustrates the following points:

© Copyright. All rights reserved. 135


Unit 5: Modeling Phase

● In the next step of the algorithm, the output signal of the network y is compared with the
desired output value (the target), which is found in the training data set.
● The difference is the error signal of the output layer neuron.
● The back propagation algorithm propagates the error signal, computed in each single
teaching step, back to all the neurons.
● Basically, back propagation is a method to adjust the connection weights to compensate
for each error found during learning. The error amount is effectively divided among the
connections.

Neural Network Process: Part 4


The following figure introduces you to part 4 of the neural network process.

Figure 207: Neural Network Process: Part 4

Neural Network Process: Part 5


The following figure introduces you to part 5 of the neural network process.

Figure 208: Neural Network Process: Part 5

© Copyright. All rights reserved. 136


Lesson: Understand the parts of the modeling phase

The figure illustrates the following:


● The error signal for each neuron is computed, and the weight coefficients of each neuron
input node might be modified.
● In the formulas df(e)/de represents the derivative of the neuron activation function.

Neural Network Process: Part 6


The following figure introduces you to part 6 of the neural network process.

Figure 209: Neural Network Process: Part 6

Neural Network: Summary


Neural networks have strengths and weaknesses.
The strengths are as follows:
● They can handle a wide range of problems
● They can produce good results even in complex non-linear domains
● They can handle both categorical and continuous variables

The weaknesses are as follows:


● Black box - hard to explain results
● NNs need large amounts of data
● NNs are computationally expensive
● Potential to over-fit
● No hard and fast rule to determine the best network structure

A more comprehensive list of the weakness with an NN is as follows:


● The best-known disadvantage of neural networks is their "black box" nature. Put simply,
you do not know how or why your NN came up with a certain output. For example, when
you put an image of a dog into a neural network and it predicts it to be a car, it is very hard
to understand what causes it to arrive at this prediction. By comparison, algorithms, like

© Copyright. All rights reserved. 137


Unit 5: Modeling Phase

decision trees, are very interpretable. This is important because in some domains,
interpretability is critical. This is why a lot of financial institutions do not use neural
networks to predict the creditworthiness of a person; they need to explain to their
customers why they did not get a loan, otherwise the person can feel unfairly treated.
● Neural networks usually require much more data than traditional machine learning
algorithms, as in at least thousands, if not millions of labeled samples. In many cases,
these amounts of data are not available and many machine learning problems can be
solved well with less data if you use other algorithms.
● The amount of computational power needed for a neural network depends heavily on the
size of your data, but also on the depth and complexity of your network. Usually, neural
networks are more computationally expensive than traditional algorithms.
● An NN shows the potential to over-fit.
● There is no specific rule for determining the structure of artificial neural networks. The
appropriate network structure is achieved through experience and trial and
error. Architectures have to be fine-tuned to achieve the best performance. There are
many design decisions that have to be made, from the number of layers to the number of
nodes in each layer to the activation functions.

Introduction to Support Vector Machines (SVMs)


The following figure introduces you to SVMs.

Figure 210: Uses of SVMs

This example shows the following about SVMs:


● In the classification example in the figure, the training examples are known to belong to
one of two categories, black or white dots.
● The SVM training algorithm builds a model that assigns new examples into one category or
the other. Data points falling on either side of a hyperplane can be attributed to different
classes.
● An SVM model is a representation of the data as points in space, mapped so that the
examples of the separate categories are divided by a clear gap (the hyperplane) that is as
wide as possible. This gives you the maximum distance between data points of both

© Copyright. All rights reserved. 138


Lesson: Understand the parts of the modeling phase

classes. Maximizing the distance means that you can classify new data points with more
confidence.
● You can see possible hyperplanes in the diagram H1, H2, and H3 separating the black and
white dots.
● This line is the decision boundary: anything that falls to one side of it, we classify as black,
and anything that falls to the other side as white.
● But, what exactly is the best hyperplane? For an SVM, it is the one that maximizes the
margins from both types of dot. In other words: the hyperplane whose distance to the
nearest element of each dot is the largest.

Support vectors are the data points that lie closest to the decision surface (the hyperplane)
and show the following features:
● These are the data points most difficult to classify.
● They have a direct bearing on the optimum location of the decision surface.
● If they are removed, support vectors are the elements of the training set that can change
the position of the dividing hyperplane.
● Support vectors are the critical elements of the training set.
● Compared to newer algorithms like neural networks, SVMs have two main advantages:
higher speed and better performance with a limited number of samples (in the thousands).

SVMs versus other Classification Techniques


Lets compare SVMs with other classification techniques.

Figure 211: SVMs versus Logistic Regression and Decision Trees

Comparing SVMs to logistic regression and decision trees illustrates the following:
● If you are a farmer and you need to set up a fence to protect your cows from packs of
wolves, where do you build your fence?
● One way you could do it would be to build a classifier based on the position of the cows and
wolves in your pasture.
● In this example, you see that SVMs do a great job at separating your cows from the packs
of wolves because it uses a non-linear classifier.

© Copyright. All rights reserved. 139


Unit 5: Modeling Phase

● You can see that both the logistic regression and decision tree models only make use of
straight lines.

SVM: Summary
The strengths and weaknesses of SVMs are as follows:

Figure 212: Summary

Time-Series Analysis
Time-Series Analysis Overview
The following figure introduces you to time-series analysis.

Figure 213: Time Series Analysis Introduction

Time-series analysis is characterized by the following:


● A time series is a sequence of data points, called a signal, measured typically at successive
time instants spaced at uniform time intervals.
● Examples of time series are sales data, daily temperatures, and monthly KPIs.

© Copyright. All rights reserved. 140


Lesson: Understand the parts of the modeling phase

● Time-series analysis comprises methods for analyzing time series data in order to extract
meaningful patterns in the data.
● Time-series forecasting is the use of a model to predict future values based on previously
observed signal values.

Time-series analysis use cases are as follows:


● Sales forecasting
● Run rate analysis
● Stock market analysis
● Yield projections
● Process and quality control
● Inventory control

Forecast Horizon
The following figure provides you with an overview of a forecast horizon.

Figure 214: Forecast Horizon

Stationary, Trend, and Seasonality


The basic patterns found in a signal are outlined in the following figure.

© Copyright. All rights reserved. 141


Unit 5: Modeling Phase

Figure 215: Stationary, Trend, and Seasonality

Naïve Forecasting
The following figure outlines naïve forecasting.

Figure 216: Naïve Forecasting

The very simplest forecasting method are called:


● Naive 1 forecasting – where tomorrow’s value equals today’s
● Naive 2 forecasting – where tomorrow’s value equals the average of today’s + yesterday’s

These simple approaches can be used as a basis for comparing other algorithms to see if they
are significantly better.
Another approach is Moving Averages. Let's take the popular 7-day moving average that is
used to monitor hospital admissions as an example. A 7-day moving average is calculated by
taking the number of COVID hospital admissions for the last 7 days and adding them together.
The result from the addition calculation is then divided by the number of periods, in this case,
7.
A moving average requires that you specify a window size that defines the number of raw
observations used to calculate the moving average value. In our example that is 7 days.
The “moving” part in the moving average refers to the fact that the window slides along the
time series to calculate the average values in the new series. In our example, the most recent
daily admission number is added and the oldest, 7 periods previously, is dropped.

© Copyright. All rights reserved. 142


Lesson: Understand the parts of the modeling phase

Moving averages can smooth time series data, revealing underlying trends. Smoothing is the
process of removing random variations that appear as coarseness in a plot of raw time series
data.

Note:
For more info on naïve forecasting, see: https://otexts.com/fpp2/simple-
methods.html https://en.wikipedia.org/wiki/Forecasting

Single Exponential Smoothing


The following figure provides you with an explanation of single exponential smoothing.

Figure 217: Single Exponential Smoothing

Consider the figure in the context of the following points:


● Large values of alpha mean that the model pays attention mainly to the most recent past
observations, whereas smaller values mean more of the history is taken into account when
making a prediction.
● Usual values for α are from 0.1 to 0.7.
● The modeler tests a number of alpha values using trial and error, or using algorithms
designed to identify the best alpha value using a computer program, and choose the value
that minimizes the mean squared error of the forecasts.
● The preceding equation can be shown to be equivalent to the following: Ft+1 = α Xt + (1- α)
Ft
● Therefore, the computation becomes very easy, but, using this formula, you have to start
the process with the first forecast Ft and that is where different starting methods can lead
to different fits and forecasts.

Single Exponential Smoothing: Worked Example 1


The following figure provides you with a working example of single exponential smoothing.

© Copyright. All rights reserved. 143


Unit 5: Modeling Phase

Figure 218: Single Exponential Smoothing: Worked Example 1

Single Exponential Smoothing: Worked Example 2


The following figure provides you with another working example of single exponential
smoothing.

Figure 219: Single Exponential Smoothing: Worked Example 2

Single Exponential Smoothing: Effect of Different Alphas


The following figure provides you with an example of single exponential smoothing, the effect
of different alphas.

© Copyright. All rights reserved. 144


Lesson: Understand the parts of the modeling phase

Figure 220: Single Exponential Smoothing: Effect of Different Alphas

Single Exponential Smoothing: Business Example


The following figure provides you with a business example of single exponential smoothing.

Figure 221: Single Exponential Smoothing: Business Example

The figure shows you the following:


● It gives you an example of an exponential smoothing model used to analyse energy
consumption in Hungary.
● It shows you that the analysis shows the actual consumption in blue, the exponential
smoothing model, and the alpha level that is used.

Double Exponential Smoothing: Holt's Two-Parameter Model


The following figure describes double exponential smoothing.

© Copyright. All rights reserved. 145


Unit 5: Modeling Phase

Figure 222: Double Exponential Smoothing

The figure illustrates the following points:


● In addition to the alpha parameter for controlling smoothing factor for the level, an
additional smoothing factor is added to control the decay of the influence of the change in
trend. This is called beta (b).
● The method supports trends that change in different ways: an additive and a multiplicative,
depending on whether the trend is linear or exponential respectively.
● Double Exponential Smoothing with an additive trend is classically referred to as Holt's
linear trend model, named for the developer of the method Charles Holt.
● Holt's Two-Parameter Model is a combination of the stationary element and trend
element.
● The values for alpha and beta can be obtained using non-linear optimization techniques.

An Example of Double Exponential Smoothing


The following figure provides you with an example of double exponential smoothing.

Figure 223: Example: Double Exponential Smoothing

Double Exponential Smoothing: Different Alpha and Beta Values


The following figure provides you with an example of double exponential smoothing using
different alpha and beta values.

© Copyright. All rights reserved. 146


Lesson: Understand the parts of the modeling phase

Figure 224: Double Exponential Smoothing: Different Alpha and Beta Values

Triple Exponential Smoothing


The following figure provides you with an example of triple exponential smoothing.

Figure 225: Triple Exponential Smoothing

The figure demonstrates the following points:


● Triple Exponential Smoothing is an extension of Exponential Smoothing that explicitly adds
support for seasonality to the univariate time series.
● This method is sometimes called Holt-Winters' Exponential Smoothing, named for two
contributors to the method: Charles Holt and Peter Winters.
● In addition to the alpha and beta smoothing factors, a new parameter is added
called gamma (γ) that controls the influence on the seasonal component.
● As with the trend, the seasonality may be modeled as either an additive or multiplicative
process for a linear or exponential change in the seasonality.
- The Holt-Winters' Three-Parameter Model is a combination of the stationary, trend and
seasonality elements.
- The values for alpha, beta and gamma can be obtained via non-linear optimization
techniques.

Example of Triple Exponential Smoothing


The following figure provides you with an example of triple exponential smoothing.

© Copyright. All rights reserved. 147


Unit 5: Modeling Phase

Figure 226: Triple Exponential Smoothing: Example

Working Example: Triple Exponential Smoothing


The following figure provides you with a working example of triple exponential smoothing.

Figure 227: Triple Exponential Smoothing: Working Example

Autoregressive Integrated Moving Average (ARIMA)


The following figure provides you with an overview of ARIMA.

© Copyright. All rights reserved. 148


Lesson: Understand the parts of the modeling phase

Figure 228: Overview of ARIMA

ARIMA is another time series forecasting approach. It stands for Autoregressive Integrated
Moving Average models and is used for the following:
● Its main application is in the area of short term forecasting requiring at least 40 historical
data points. It works best when data exhibits a stable or consistent pattern over time with a
minimum amount of outliers. ARIMA is usually superior to exponential smoothing
techniques when the data is reasonably long and the correlation between past
observations is stable. If the data is short or highly volatile, some smoothing method can
perform better. If you do not have at least 38 data points, you must consider some other
method than ARIMA.
● The AR part (for Autoregressive) of ARIMA indicates that the evolving signal is regressed
on its own lagged values (that is, the previous values). The MA part (for Moving Average)
indicates that the regression error is actually a linear combination of error terms. The I (for
"integrated") indicates that the data values have been replaced with the difference
between their values and the previous values (and this differencing process may have been
performed more than once). The purpose of each of these features is to make the model fit
the data as well as possible.
● The first step in applying the ARIMA methodology is to check for stationarity. Stationarity
implies that the series remains at a fairly constant level over time. If a trend exists, as in
most economic or business applications, your data is NOT stationary. The data should also
show a constant variance in its fluctuations over time. This is easily seen with a series that
is heavily seasonal and growing at a faster rate. In this case, the ups and downs in the
seasonality will become more dramatic over time. Without these stationarity conditions
being met, many of the calculations associated with the ARIMA process cannot be
computed. Therefore, you need to transform the data so that it is stationary.
● If a graphical plot of the data indicates non-stationarity, you should "difference" the series.
Differencing is an excellent way of transforming a non-stationary series to a stationary one.
This is done by subtracting the observation in the current period from the previous one. If
this transformation is done only once to a series, you say that the data has been "first
differenced". This process essentially eliminates the trend if your series is growing at a
fairly constant rate. If it is growing at an increasing rate, you can apply the same procedure
and difference the data again. Your data would then be "second differenced".

© Copyright. All rights reserved. 149


Unit 5: Modeling Phase

● "Autocorrelations" are numerical values that indicate how a data series is related to itself
over time. More precisely, it measures how strongly data values at a specified number of
periods apart are correlated to each other over time. The number of periods apart is
usually called the lag. For example, an autocorrelation at lag 1 measures how values 1
period apart are correlated to one another throughout the series. An autocorrelation at lag
2 measures how the data two periods apart are correlated throughout the series.
Autocorrelations can range from +1 to -1. A value close to +1 indicates a high positive
correlation while a value close to -1 implies a high negative correlation. These measures
are most often evaluated through graphical plots called "correlograms". A correlogram
plots the auto- correlation values for a given series at different lags. This is referred to as
the autocorrelation function and is very important in the ARIMA method.
● After a time series has been stationarized by differencing, the next step in fitting an ARIMA
model is to determine whether AR or MA terms are needed to correct any autocorrelation
that remains in the differenced series.
● Of course, with software packages, you could just try some different combinations of
terms and see what works best. But there is a more systematic way to do this: by looking
at the plots of the autocorrelation function (ACF) and partial autocorrelation (PACF) of the
differenced series, you can tentatively identify the numbers of AR and/or MA terms that
are needed.
● The main problem is trying to decide which ARIMA specification to use - that is, how many
AR and/or MA parameters to include. This is what much of an approach called Box-
Jenkings [1976] was devoted to in the past. That method depends on the graphical and
numerical evaluation of the sample autocorrelation and partial autocorrelation functions.

Accuracy Measures
Mean Absolute Percentage Error (MAPE) is a popular measure used to forecast time-series
errors.

Figure 229: Accuracy Measures

While MAPE is one of the most popular measures for forecasting error, there are many
studies on shortcomings and misleading results from MAPE. These are outlined in the
following way:
● It can not be used if there are zero values, which sometimes happens for example in
demand data, because there would be a division by zero.

© Copyright. All rights reserved. 150


Lesson: Understand the parts of the modeling phase

● For forecasts that are too low the percentage error cannot exceed 100%, but for forecasts
which are too high there is no upper limit to the percentage error.
● Moreover, MAPE puts a heavier penalty on negative errors, A t < F t, than on positive
errors.

To overcome these issues with MAPE, there are some other measures proposed in literature:
● Mean Absolute Scaled Error (MASE)
● Symmetric Mean Absolute Percentage Error (sMAPE)
● Also, as an alternative, each actual value (At) of the series in the original formula can be
replaced by the average of all actual values (Āt) of that series. This alternative is still being
used for measuring the performance of models that forecast spot electricity prices.

Summary
This lesson covers the following:

Figure 230: Summary

Forecasting in SAC
Time-Series Components in SAC
The following figure introduces you to the idea of the signal in a time-series forecast.

Figure 231: Time-Series Components in SAC

Trend
Trends are defined in the following ways.

© Copyright. All rights reserved. 151


Unit 5: Modeling Phase

Figure 232: Trend

Seasonality and Cycles


Seasonal components arise from systematic, calendar-related influences, which are outlined
in the following figure.

Figure 233: Seasonality and Cycles

Detecting the Cycles


The following figure shows you the means by which SAC evaluates two types of cycles.

© Copyright. All rights reserved. 152


Lesson: Understand the parts of the modeling phase

Figure 234: Detecting the Cycles

The figure illustrates the following points:


● Cycles are also evaluated for potential influencer variables.
● All of these are automatically detected and tested in SAC Smart Predict.
● Note that it is necessary to have enough historical data to detect cycles and particularly
cycles over a long period or long seasonality.

Detecting the Fluctuation


The following figure describes the fluctuation, a pattern of trends and cycles.

Figure 235: Fluctuation Pattern

AR is outlined in the following way:


● AR(1) is the first order process, meaning that the current value is estimated based on the
immediately preceding value (a lag of 1 period). For example, the value this month is
effected by the value last month.
● An AR(2) process has the current value based on the previous two values (a lag of 2
periods), and so on.

© Copyright. All rights reserved. 153


Unit 5: Modeling Phase

Fluctuations in SAC
The following figure provides with an example of how a fluctuation is detected.

Figure 236: Fluctuations in SAC

Influencer Variables
Potential Influencer Variables are included in your model and can be used for the following, as
shown in the figure.

Figure 237: Influencer Variables

During the analysis of the trend and cycle components, there are constraints for potential
influencer variables:
● The future values must be known, at least for the expected horizon.
● Influencer variables with ordinal, continuous, and nominal types are used in the detection
of trends.
● But only influencer variables with ordinal and continuous types are used in the detection of
cycles.

Selecting the Final Model: MAPE


The following figure shows you what you can do with SAC Smart Predict.

© Copyright. All rights reserved. 154


Lesson: Understand the parts of the modeling phase

Figure 238: Final Model Selection in SAC Smart Predict

Horizon-Wide MAPE
Building your model means you need to specify a horizon.

Figure 239: Horizon-Wide MAPE

Demonstration Data
The following figure provides you with demonstration data.

Figure 240: Demonstration Data

Time Series: Select Predictive Scenario


The following figure shows you how to create a predictive scenario for time series.

© Copyright. All rights reserved. 155


Unit 5: Modeling Phase

Figure 241: Select Predictive Scenario

The Creation of a New Predictive Scenario


The following figure shows you how to create a new predictive scenario.

Figure 242: Create the New Predictive Scenario

Edit the Column Details


The following figure shows you how to edit the column details.

Figure 243: Edit Column Details

© Copyright. All rights reserved. 156


Lesson: Understand the parts of the modeling phase

Define a Predictive Goal


The following figure shows you how to define a predictive goal.

Figure 244: Define Goals

Predictive Model Training Parameters


The following figure shows you how to enter the predictive model training parameters.

Figure 245: Enter Predictive Model Training Parameters

Examine Model Reports


The following figure shows you what the model reports display to you when the model is
trained.

© Copyright. All rights reserved. 157


Unit 5: Modeling Phase

Figure 246: Model Reports

Examine Forecasts
The following figure shows you what you can learn from forecasts.

Figure 247: Forecasts

Outliers
The following figure provides you with an example of Signal Outliers table.

Figure 248: Example of Outliers

Examine the Signal Decomposition Report


The following figure outlines the elements of the signal decomposition report.

© Copyright. All rights reserved. 158


Lesson: Understand the parts of the modeling phase

Figure 249: Signal Decomposition Report

Example: Signal Decomposition Report


The following figure provides you with an example of a signal decomposition report.

Figure 250: Signal Decomposition Report: Quadratic Trend

Segmented Models
The following figure gives you an example of a segmented model.

Figure 251: Example of Segmented Model

© Copyright. All rights reserved. 159


Unit 5: Modeling Phase

Entity
The following figure provides you with an example of an entity.

Figure 252: Entity

Overview Report: Horizon-Wide MAPE


The following figure shows you the horizon-wide MAPE for each segment.

Figure 253: Overview Report

What Forecast Reports Shows


The following figure shows you what the forecast report illustrates.

© Copyright. All rights reserved. 160


Lesson: Understand the parts of the modeling phase

Figure 254: Forecast Report

Signal Analysis Report


The following figure shows you the signal decomposition information for each segment.

Figure 255: Signal Analysis Report

Summary
This lesson covers the following:

© Copyright. All rights reserved. 161


Unit 5: Modeling Phase

Figure 256: Summary

Ensemble Methods
Bootstrapping Overview
This is an introduction to a type of resampling called bootstrapping.

Figure 257: Introduction to Bootstrapping

Before describing some of the basic ensemble methods, it is useful to understand the concept
called boostrapping.
This is a simple example that concludes only one repeat of the procedure. It can be repeated
many more times to give a sample of calculated statistics.

Ensemble Methods
Ensemble methods are based around the hypothesis that an aggregated decision from
multiple models can be superior to a decision from a single model.

© Copyright. All rights reserved. 162


Lesson: Understand the parts of the modeling phase

Figure 258: Introduction to Ensemble Methods

Ensemble methods use the following:


● Ensemble methods use multiple learning algorithms to obtain a predictive performance
than exceeds the performance of any of the other constituent learning algorithms.
● The ensemble result of the models is also adjudged as more robust.
● In general, numeric target variables are averaged across multiple executions of an
algorithm, while categorical target variables have a voting system, usually run an odd
number of times to avoid draws.

Voting and Averaging-Based Ensemble Methods


Voting and averaging are simple ensemble methods.

Figure 259: Voting and Averaging-Based Ensemble Methods

Voting and averaging methods are described in the following manner:


● Voting is used for classification and averaging is used for regression.
● In both averaging and voting, the first step is to create multiple classification or regression
models using a number of training data sets. Each base model can be created using
different splits of the same training data set and the same algorithm, or using the same
data set with different algorithms, or any other method.

Majority voting is described in the following way:

© Copyright. All rights reserved. 163


Unit 5: Modeling Phase

● Every model makes a prediction (votes) for each test instance and the final output
prediction is the one that receives more than half of the votes. If none of the predictions
get more than half of the votes, you can say that the ensemble method could not make a
stable prediction for this instance.
● Although this is a widely used technique, you can try the most voted prediction (even if
that is less than half of the votes) as the final prediction. In some articles, you sometimes
see this method being called plurality voting.

Weighted voting is described in the following way:


● Unlike majority voting, where each model has the same rights, you can increase the
importance of one or more models.
● In weighted voting, you count the prediction of the better models multiple times. Finding a
reasonable set of weights is up to you.

Weighted averaging is described in the following way:


● Weighted averaging is a slightly modified version of simple averaging, where the prediction
of each model is multiplied by the weight and their average is calculated.

Bagging
The following figure provides you with an overview of bagging.

Figure 260: Bagging

Bootstrap aggregating can be described in the following way:


● Bootstrap aggregating, often abbreviated as bagging, involves having each model in the
ensemble vote with equal weight.
● In order to promote model variance, bagging trains each model in the ensemble using a
randomly drawn bootstrap subset of the training set.
● An example of this is as follows: the random forest algorithm combines random decision
trees with bagging to achieve very high classification accuracy.

© Copyright. All rights reserved. 164


Lesson: Understand the parts of the modeling phase

● The models use voting for classification or averaging for regression. This is where
"aggregating" comes from in the name "bootstrap aggregating."
● Each model has the same weight as all the others.
● In many cases, bagging methods constitute a very simple way to improve a model, without
making it necessary to adapt the underlying base algorithm.
● Bagging methods come in many forms, but mostly differ from each other by the way they
draw random subsets of the training set.

Random Forests
Random forests are a set of powerful, fully automated, machine learning techniques.

Figure 261: A Representation of Random Forests

Random forests are one of the most powerful, fully automated, machine learning techniques.
With almost no data preparation or modeling expertise, analysts can obtain surprisingly
effective models. Random forests is an essential component in the modern data scientist's
toolkit.
Random forests are outlined in the following way:
● They are a popular and fast ensemble learning method for classification or regression
scenarios.
● They run a series of classification or regression models over random (bootstrap samples)
from the data.
● They combine and fit those results by voting (classification) or averaging (regression).
● This approach results in robust and high-prediction quality models.

Basically, a random forest consists of multiple random decision trees. Two types of
randomness are built into the trees:

1. First, each tree is built on a random sample from the original data.

2. Second, at each tree node, a subset of features are randomly selected to generate the
best split.

There a advantages to random forests, which are as follows:


● Applicable to both regression and classification problems

© Copyright. All rights reserved. 165


Unit 5: Modeling Phase

● Computationally simple and quick to fit, even for large problems


● No formal distributional assumptions (non-parametric)
● Automatic variable selection
● Handles missing values

Boosting
Boosting algorithms create a sequence of models. The following figure illustrates the core
principles of boosting.

Figure 262: Boosting

Boosting is another ensemble technique. This involves incrementally building an ensemble by


training each new model instance to emphasize the training instances that previous models
incorrectly classified. Boosting is also characterized by the following:
● Boosting refers to a group of algorithms that utilize weighted averages to make weaker
models into stronger ones.
● Unlike bagging that has each model run independently and subsequently aggregates the
outputs at the end without preference to any model, boosting is all about model teamwork.
Each model that runs dictates the features on which the next model focuses.
● In a number of cases, boosting has been shown to yield better accuracy than bagging, but
it also tends to be more likely to over-fit the training data.
● Boosting is a two-step approach, where you first use subsets of the original data to
produce a series of averagely performing models and "boost" their performance by
combining them together using, for example, majority voting.
● Unlike bagging, the classical boosting the subset creation is not random, and depends on
the performance of the previous models: every new subset contains the elements that
were (likely to be) misclassified by previous models.
● The training data set in the diagram is passed to Model 1. The yellow background indicates
that the model predicted hyphen, and blue background indicates that it is predicted plus.
Model 1 incorrectly predicts two hyphens and one plus. These are highlighted with a circle.
The weights of these incorrectly predicted data points are increased and sent to the next
model - Model 2. Model 2 correctly predicts the two hyphen, which Model 1 was not able to

© Copyright. All rights reserved. 166


Lesson: Understand the parts of the modeling phase

perform. However, Model 2 also makes a number of other errors. This process continues
and there is a combined final classifier that predicts all the data points correctly.

Stacking and Machine Learning Models


The following figure outlines "stacking" and multiple machine learning models.

Figure 263: "Stacking" Multiple Machine Learning Models

Stacking, also known as stacked generalization, is an ensemble method where the models are
combined using another machine learning algorithm. Stacking raises certain questions and is
characterized by the following points:
● If you develop multiple machine learning models, how do you choose which model to use?
● Stacking uses another machine learning model that learns when to use each model in the
ensemble.
● In stacking, unlike bagging, the models are typically different (for example, they are not all
decision trees) and fit on the same data set (for example, instead of samples of the training
data set).
● In stacking, unlike boosting, you use a single model to learn how to best combine the
predictions from the contributing models (for example, instead of a sequence of models
that correct the predictions of prior models).
● The picture shows the typical architecture of a stacking model with two or more base
models, often referred to as level-0 models, and a meta-model that combines the
predictions of the base models, referred to as a level-1 model.
● Level 0 models are frequently diverse and make very different assumptions about how to
solve the predictive modeling task, such as linear models, decision trees, support vector
machines, neural networks, and so on.
● The meta-model is frequently simple, such as linear regression for regression tasks
(predicting a numeric value) and logistic regression for classification tasks (predicting a
class label).

Summary
This lesson covers the following:

© Copyright. All rights reserved. 167


Unit 5: Modeling Phase

Figure 264: Summary

Simulation Optimization
Introduction: Monte Carlo Simulation
The Monte Carlo simulation is a mathematical technique that allows you to account for risk in
quantitative analysis and decision making. It uses repeated random sampling to obtain the
distribution of an unknown probabilistic entity.
A Monte Carlo simulation provides the decision-maker with a range of possible outcomes and
the probabilities that occur for any choice of action. It shows the extreme possibilities, that is,
the outcomes of going for broke and for the most conservative, as well as the middle of the
road options.
The history of this method is as follows:
● The modern version was invented in the late 1940s by Stanislaw Ulam while he was
working on nuclear weapons projects at the Los Alamos National Laboratory. It was
developed by John von Neumann, who identified a way to create pseudo-random
numbers. It was named after Monte Carlo, the Monaco resort town renowned for its
casinos.
● The technique is used in finance, project management, energy, manufacturing,
engineering, research and development, insurance, oil and gas, transportation, and the
environment.
● Applications and examples of this simulation are as follows:

- Operations Research studies - queuing and service levels, manufacturing, distribution, and
so on
- Computational physics, physical chemistry, and complex quantum chromodynamics
- Engineering for quantitative probabilistic analysis in process design
- Computational biology
- Computer graphics
- Applied statistics
- Finance and business

© Copyright. All rights reserved. 168


Lesson: Understand the parts of the modeling phase

Simple Example of the Monte Carlo Simulation


The following figure provides you with a simple example of the Monte Carlo simulation.

Figure 265: Monte Carlo Simulation: Simple Example

The Monte Carlo simulation works in the following manner:


● The Monte Carlo simulation builds models of possible results by substituting a range of
values - a probability distribution - for any factor that has inherent uncertainty.
● It calculates results over and over, each time using a different set of random values from
the probability functions.
● Depending on the number of uncertainties and the ranges specified for them, a Monte
Carlo simulation could involve thousands of recalculations before it is complete.
● Monte Carlo simulation produces distributions of possible outcome values.
● By using probability distributions, variables can have different probabilities of different
outcomes occurring. Probability distributions are a much more realistic way of describing
uncertainty in variables.

Monte Carlo Simulation: Finance Example


The following figure contains a finance example pertaining to the Monte Carlo simulation.

Figure 266: Monte Carlo Simulation: Finance Example

© Copyright. All rights reserved. 169


Unit 5: Modeling Phase

This example, which relates to business and finance, illustrates the following:
● Projected estimates are given for the product revenue, product costs, overheads, and
capital investment for each year of the analysis, from which the cash flow can be
calculated.
● The cash flows are summed for each year and discounted for future values. In other words,
the net present value of the cash flow is derived as a single value measuring the benefit of
the investment.
● The projected estimates are single-point estimates of each data point and the analysis
provides a single point value of project Net Present Value (NPV). you can use sensitivity
analysis to explore the investment model.

This is referred to as deterministic modeling, which is in contrast to probabilistic modeling,


where we examine the probability of outcomes. Deterministic modeling enables you to
explore the impact of the model assumptions using the following: sensitivity analysis, step
sensitivity, impact analysis, and goal seeking or backwards iteration.
Sensitivity analysis explores the following:
● What if revenue is 10% less
● What if costs are 10% more

Step sensitivity explores the following:


● What if capital investment is 10% to 100% higher, in steps of 10%

Impact analysis explores the following:


● What is the impact of an X% change to all the input variables, on a target variable - for
example, NPV.

Goal seeking, or backwards iteration, explores the following:


● What do sales have to amount to yield an NPV of £10,000?

However, these tests do not answer the question: what are the chances, that is, the
probability, that the NPV will be zero or less, or over £10,000. This is where we can use
probabilistic modeling - the Monte Carlo simulation.

The Monte Carlo Simulation in Action


The following figure shows you how the Monte Carlo simulation works.

© Copyright. All rights reserved. 170


Lesson: Understand the parts of the modeling phase

Figure 267: How the Monte Carlo Simulation Works

Take note of the following in relation to this simple example of the Monte Carlo simulation:
● This is a simple example in which there are 2 distributions: one for the product revenue,
which is a normal distribution in this example, and another for the product costs, a
rectangular distribution. This data has been collected over a period of time and the
distributions show the range of values that have been observed over this time period.
● It is important to notice that a sample is taken from each distribution.
● Using this sample, you can calculate the product margin, overhead, capital investment,
and NPV.
● This process is repeated multiple times, taking further samples from the revenue and
costs distributions, and calculating a range of values, the probability distribution, for the
product margin, overhead, capital investment and NPV.

Monte Carlo Simulation: Sampling and Calculations


You can run the simulation multiple times to calculate different values. The following example
shows you how to accomplish this task.

Figure 268: Derive the Range of Possible Values of the NPV

© Copyright. All rights reserved. 171


Unit 5: Modeling Phase

Run the sampling and calculations multiple times, quite possibly hundreds or thousands of
times (this is the number of trials or simulations). This allows you to derive the following:
● Multiple values of the NPV
● Probability distribution of the NPV, which can be analyzed to estimate Pr(NPV >0), or
Pr(NPV >10,000)

Summary: Uses of the Monte Carlo Simulation


In finance and business, you can use the Monte Carlo simulation to explore the probability of
outcomes.
In addition, it offers a probabilistic approach, by contrast with the deterministic approach to
model building.
For more information, see the following:
● https://en.wikipedia.org/wiki/Monte_Carlo_method
● https://www.palisade.com/risk/monte_carlo_simulation.asp

Optimization: Linear Programming (LP)


This is an introduction to linear programming. LP is a technique you can use for optimization,
which is explained in the following figure.

Figure 269: Optimization: Introduction

You can use LP to achieve the best outcome, such as maximum profit or lowest cost, in a
mathematical model whose requirements are represented by linear relationships. Consider
the example of LP in the context of the following points:
● A linear programming algorithm finds a point in the space defined by the constraints where
the objective function has the smallest, or largest, value.
● In this example, there is an objective function to maximize profits (Z), represented by
Z=300X + 500Y, where X and Y are two different types of product. It therefore finds the
largest value for the objective: profit. However, there are also a number of constraints on
the possible values of X and Y.

© Copyright. All rights reserved. 172


Lesson: Understand the parts of the modeling phase

● You can use the LP method to consider the objective function in conjunction with the
constraints because it enables you to identify the optimal solution. In this example, that is
shown by the red circle.

These are a core class of algorithms in Operations Research and are used for the following
purposes:
● Product formulation - minimize the cost subject to ingredient requirements
● Inventory management - minimize inventory costs subject to space and demand
● Allocation of resources - minimize transport costs subject to requirements
● Capital investment planning - maximize return on investment subject to investment
constraints
● Menu planning - minimize cost subject to meal requirements
● It has proved useful in modeling diverse types of problems in planning, routing, scheduling,
assignment, and design

Optimization: Sample Problem


The following figure provides you with an example of the optimization problem.

Figure 270: Example of the Optimization Problem

Optimization: LP Problem Formulation


The following figures provide you with an example of the LP problem formulation in the area of
optimization.

© Copyright. All rights reserved. 173


Unit 5: Modeling Phase

Figure 271: Example of Optimization: LP Problem Formulation 1

Figure 272: Optimization: LP Problem Formulation 2

Optimization: Other Examples


There are a wide range of applications for LP. A number of the most popular are in chemical
blending and optimizing shipping schedules.
The following are example of these applications:
● Concerning the chemical blending problem, the company produces two types of unleaded
petrol, regular and premium. The regular petrol sells at £12 per barrel. The unleaded patrol
sells at £14 per barrel. However, there are the following constraints: maximum demand,
minimum deliveries, minimum octane rating, and so on. The company must calculate the
following: what quantities of the oils must be blended in order to maximize profit? LP can
help them with this assessment.
● In the area of shipping, LP is used to determine shipping schedules based on demand,
shipping costs, and shipping constraints (size or quantity of pallets or containers) to
determine a minimum cost shipping schedule for satisfying demand, minimizing costs, and
maximizing profit.

© Copyright. All rights reserved. 174


Lesson: Understand the parts of the modeling phase

Simulation Optimization in Action


The following figure uses optimization in SAP Transportation Resource Planning as an
example of simulation optimization.

Figure 273: Example: Simulation Optimization

SAP Transportation Resource Planning (TRP) is designed to supply the right equipment at the
right time and right location, with the minimum cost to fulfill customer demand - a simple
summary of a very complex problem.

Optimization: Summary
This lesson covers the following:
● LP as a mathematical modeling technique, in which a linear function is maximized or
minimized when it is subjected to various constraints.
● The uses of this technique. which you can use to support decision making in business
planning and in industrial engineering.

Common Model Performance Metrics


Introduction to Model Performance Metrics
Evaluating a model is a core part of building an effective model. Use model performance
metrics to accomplish this task.

© Copyright. All rights reserved. 175


Unit 5: Modeling Phase

Figure 274: Introduction to Model Performance Metrics

The particular business question that the model is designed to analyze helps you to determine
if a model can be classed as successful or not. Often the business question is associated with
a specific business success criterion that is converted to a data science success criterion
during Phase 1 of the CRISP-DM process: business understanding.
Potential uses of such a model are as follows:
● if the purpose of the model is to provide highly accurate predictions or decisions that are
used by the business, measures of accuracy are utilized.
● If the interpretation of the business is what is of most interest, accuracy measures are not
as important. Instead, subjective measures of what can provide maximum insight in the
future might be more desirable.
● Many projects use a combination of both accuracy and interpretation. Therefore, the most
accurate model is not selected if a less accurate, but more transparent, model with nearly
the same accuracy, is available.
● In their Third Annual Data Miner Survey, Rexer Analytics, an analytic and CRM consulting
firm based in Winchester Massachusetts, USA, asked analytic professionals: How do you
evaluate project success in Data Mining? The answer to this question is represented in the
figure.
● Out of 14 different criteria, 58% ranked model performance (that is, Lift, R2, and so on) as
the primary factor.
● For these responders, the most important component is the generation of very precise
forecasts.

It is very important to remember the following in relation to model performance:


● The Rexer Analytics survey was to professionals in the analytics field. No one in a business
environment has ever received a raise, bonus, or promotion based on lift, R2, or any other
analytic metric.
● One of the greatest threats to successful model development is building very good models
that answer the wrong question.
● Experienced data science practitioners readily recognize that a decision strategy that
maximizes response rate might not be the same decision strategy that maximizes net
profit.

© Copyright. All rights reserved. 176


Lesson: Understand the parts of the modeling phase

● There is simply no rational reason to expect that a model, which has been developed to
maximize R2, or Lift, can also maximize the business performance metrics that are of
interest to an organization.

Evaluate a Classification Model


The previous discussion invariably leads to the question: how do you evaluate a classification
model? The following figure provides you with an insight in the criterion one uses to evaluate
the success or failure of a classification model.

Figure 275: Success Criteria for Classification Models

The figure shows the difference between accuracy and precision. Accuracy denotes how
closely a measured value is to the actual (true) value. Precision denotes how closely the
measured values are to each other - for example, if you weigh a given substance 10 times, and
get 5.1 kg each time, you know that this measurement is very precise.
Classification models often have a binary nominal target, meaning there are two outcome
classes - churn or no-churn, fraud or no-fraud, and so on - and these are coded as 1 and 0.
The following performance metrics are frequently used to assess classification model
success:
● Confusion matrices summarize the different kinds of errors, called Type I and Type II
errors
● Lift, Area Under the Curve (AUC) metrics
● SAP have developed our own - called Predictive Power and Prediction Confidence (you
examine this in more detail later in this unit)

Bias
Bias and precision refer to factors that impact the accuracy of your model.

© Copyright. All rights reserved. 177


Unit 5: Modeling Phase

Figure 276: Bias and Precision

Analysts frequently refer to bias, which can be understood in the following terms:
● A forecast bias occurs when there are consistent differences between actual outcomes
and forecasts of those quantities, that is, forecasts can have a general tendency to be too
high or too low. A normal property of a good forecast is that it is not biased.
● Bias is a measure of how far the expected value of the estimate is from the true value of
the parameter being estimated.
● Precision is a measure of how similar the multiple estimates are to each other, not how
close they are to the true value.
● Basically, bias refers to the tendency of measures to systematically shift in one direction
from the true value and as such are often called systematic errors as opposed to random
errors.
● Forecast bias is different to forecast error (accuracy) in that a forecast can have any level
of error but be completely unbiased. For example, if a forecast is 10& higher than the
actual values half the time and 10% lower than the actual values the other half of the time,
it has no bias. However, if it is, on average, 10% higher than the actual value, the forecast
has both a 10% error and a 10% bias.

Lift and Gain Charts


You might have seen Lift or Gain charts before, but how is the following one constructed?

© Copyright. All rights reserved. 178


Lesson: Understand the parts of the modeling phase

Figure 277: Lift and Gains Charts

Common Model Performance


The following figure is a simple illustration in which there are 18 customers. Consider the
figure and examine the data following it.

In this figure, we see the following:


● There are 18 customers.
● 6 customers are targets (churners/responders/fraudsters/etc)=> 6/18 = 33%.
● 12 customers are non-targets (non-churners/non-responders/and so on).
● The red line represents the random model.
● If you randomly select 33% of the whole of the base of 18 customers (on the x-axis you are
selecting from left to right, from the origin on the left to 33% on the x-axis)), you detect
33% of the total number of targets (moving from the 33% on the x-axis, up to the red
random line and then across to the y-axis).

© Copyright. All rights reserved. 179


Unit 5: Modeling Phase

● If you select 50% on the x-axis, you find 50% of the targets on the y-axis.
● This is simply a random selection from the data and it represents what can happen if you
do not use a predictive model, but simply try and choose the targets randomly.

Gains Chart
The following figure is an example of a gains chart.

Figure 278: Gains Chart: Perfect Model

This figure illustrates the following:


● This is the perfect model, where you can perfectly identify all of the target customers with
100% accuracy.
● This means that all of the target red customers are on the far left of the x-axis (you are
selecting from left to right, therefore you select the targets first and no non-targets), while
all of the non-targets in green follow on after the targets on the x-axis.
● On the perfect model line, shown in green in this example, shows the following: if you select
33% of the whole of the base of 18 customers (on the x-axis), you detect 100% of the total
number of targets (shown on the y-axis).

Gains Chart: Model


When you have built a classification model, you expect that the customers with the highest
scores are those most likely to be the target population.

© Copyright. All rights reserved. 180


Lesson: Understand the parts of the modeling phase

Figure 279: Model

Therefore, if you rank order the customers based on the model score, and select those with
the highest score (going from left to right on the x-axis, those with the highest scores are on
the left) these have the highest probability of being the targets.
In the model (shown by the yellow line), if you select 33% of the whole of the base of 18
customers (on the x-axis going from left to right), based on descending score, you detect
66% of the total number of targets (shown on the y-axis). This is much better than a random
model.
This is a graphical indication of the predictive power of the model, compared to random and
the perfect model.
You can also see the misclassifications. These are the red targets that have lower scores and
are situated further on the right side of the x-axis, and the green non-targets that have high
scores, and are shown more to the left of the x-axis.

Lift Chart
The following figure is an example of a lift chart. The lift is a comparison of the difference
between the random selection and classification model.

© Copyright. All rights reserved. 181


Unit 5: Modeling Phase

Figure 280: Lift Chart

It is usually shown with the random line in a horizontal orientation, (see the previous figure),
where Lift = 1.
However, you can see that in this example, at 33% of the population on the x-axis, you get
33% of the targets randomly and 66% of the targets using the model. Therefore, the lift is
simply the difference.
In this example, if you select 33% of the population (on the x-axis), you identify 2x the number
of targets using the model compared to a random selection.

Predictive Power and Prediction Confidence


The following graph shows you how SAP calculates the predictive power and prediction
confidence metrics that are used in the SAP Analytics Cloud Smart Predict solution.

Figure 281: SAP Metrics: Predictive Power and Prediction Confidence

Consider the following in relation to the graph:


● When you build a classification model, the build data are split into two sub-samples
randomly. One sub-sample is used to train the model; this is called the Estimation sub-

© Copyright. All rights reserved. 182


Lesson: Understand the parts of the modeling phase

sample. The other sub-sample is used to test the model to ensure it is accurate and robust;
this is called the Validation sub-sample.
● In this example, you are training the model on the Estimation and testing the model's
performance on both the Estimation and Validation sub-samples.
● To maintain high confidence in the robustness of the model, you expect that the
performance on both of these sub-samples is very similar, and that the performance
curves overlap one another.
● The predictive power metric ranges from 0 to 100%. There is no minimum cut-off, as the
predictive power is dependent on the predictiveness of the data you are using, but
obviously higher values are better than low values.

A model with the predictive power of the "79%" is capable of explaining 79% of the
information contained in the target variable using the explanatory variables contained in the
data set analyzed.
The following is also shown in relation to the predictive power of the model:
● "100%" is an hypothetical perfect model (on the green line), that explains 100% of the
target variable. In practice, such a predictive power would generally indicate that one or
more of the explanatory variables is 100% correlated with the target variable and there is a
leaking variable that must be excluded from the model.
● "0%" is a purely random model (on the red line).
● The prediction confidence is the robustness indicator that indicates the capacity of the
model to achieve the same performance when it is applied to a new data set exhibiting the
same characteristics as the training data set. It is estimated by comparing the difference in
the performance of the model on the Estimation and Validation sub-samples, calculated by
the area between the two performance curves (shown as B in the graph). The smaller the
value of B, the more robust the model is as the performance on the Estimation and
Validation sub-samples is very similar. For the metric to range from 0 to 100%, the area of
B is calculated and the ratio of B is compared to the total area A, B and C.

Prediction confidence also ranges from 0 to 100%. However, it has a minimum cut-off
threshold of 95% that must be achieved. For example, a model with a prediction confidence:
● Equal to or greater than "95%" is very robust. It has a high capacity for generalization.
● Less than "95%" must be considered with caution. Applying the model to a new data set
incurs the risk of generating unreliable results.

Confusion Matrix
The following figure provides you with an example of a confusion matrix. A confusion matrix is
a table that is often used to describe the performance of a classification model on a set of test
data for which the actual true values are known.

© Copyright. All rights reserved. 183


Unit 5: Modeling Phase

Figure 282: Confusion Matrix

Consider the following case. You are using a classification model to predict if a bank customer
will default on a loan repayment. If they miss a payment, the model predicted class = 1. If they
do not miss a payment, the model predicted class = 0. The confusion matrix shows the
following measures:

True Positive:
Interpretation: You predicted positive and it is true.
You predicted that the customer will miss a payment and they did miss a payment.

True Negative:
Interpretation: You predicted negative and it is true.
You predicted that the customer will not miss a payment and they did not miss a payment.

False Positive: (Type 1 Error)


Interpretation: You predicted positive and it is false.
You predicted that the customer will miss a payment and they did not miss a payment.

False Negative: (Type 2 Error)


Interpretation: You predicted negative and it is false.
You predicted that the customer will not miss a payment and they did not miss a payment.
Depending on the application of the classification model, errors of these two types are of
more or less importance:

● Marketing response:You would like the number of FPs (non responding customers) to be
as small as possible to optimize mailing costs. You might have to accept a high proportion
of FNs (responders classified as non responders) as there are a large number of
customers for to be contacted.
● Medical screening:You would like the proportion of FNs (patients with the disease who are
not screened) to be as small as possible (ideally zero). Therefore, you are willing to have a
higher proportion of FPs (patients screened unnecessarily).
● You can "tune" the classification model to minimize the type of error that is most relevant
for your specific scenario.

© Copyright. All rights reserved. 184


Lesson: Understand the parts of the modeling phase

Confusion Matrix: Costs and Benefits

Figure 283: Costs and Benefits

You can use a confusion matrix to assess the costs and benefits of particular actions:
● You can use return on investment, or profit information, with a confusion matrix if there is
a fixed or variable cost associated with the treatment of a customer or transaction, and a
fixed or variable return or benefit if the customer responds favorably.
● For example, if you are building a customer acquisition model, the cost
is typically a fixed cost associated with mailing or contacting the individual. The return is
the estimated value of acquiring a new customer.
● However, if the model predicts a No, but an acquisition would actually have occurred -
Actual = Yes (a false negative), there is an opportunity lost (a negative benefit value).
● For fraud detection, there is a cost associated with investigating the invoice or claim, and a
gain associated with the successful recovery of the fraudulent amount.
● For some business scenarios, it is not immediately apparent how to associate costs and
benefits to the confusion matrix, but there are a number of obvious associations that can
be made in marketing and sales when the classification model is predicting if a customer
will buy a product or not, and what the benefit is to the company if the customer does buy
the product.

Confusion Matrix: Commonly Used Metrics


The following figure outlines the commonly used metrics in a confusion matrix.

© Copyright. All rights reserved. 185


Unit 5: Modeling Phase

Figure 284: Commonly Used Metrics

Receiver Operating Characteristic (ROC) Curve


The following figure is an example of an ROC Curve.

Figure 285: An ROC Curve that measures Performance

When you plot the True Positive rate versus the False Positive rate, you have the ROC Curve,
which shows the following:
● Sensitivity, which appears on the Y axis, is the proportion of CORRECTLY identified targets
(true positives) found, out of all true positives.
● [1 - Specificity], which appears on the X axis, is the proportion of INCORRECT assignments
to the target class (false positives), out of all false positives.
● The term, Specificity, as opposed to [1 - Specificity], is the proportion of CORRECT
assignments to the non-target class - the true negatives.
● The closer the curve follows the left-hand border and, subsequently, the top border of the
ROC space, the more accurate the test.
● The closer the curve comes to the 45-degree diagonal random line, the less accurate the
test.

The Area Under the Curve (AUC) is a measure of test accuracy:

© Copyright. All rights reserved. 186


Lesson: Understand the parts of the modeling phase

● The AUC statistic is a measure of model performance or predictive power calculated as the
area under the ROC curve.
● A rough guide for classifying ROC is the traditional point system: 0.9-1 = excellent (A);
0.8-0.9 = good (B); 0.7-0.8 = fair (C); 0.6 - 0.7 = poor (D); < 0.6 = fail (F)

ROC Curve: Sensitivity and Specificity


The following figure provides you with an example of an ROC Curve that demonstrates the
trade-off between sensitivity and specificity.

Figure 286: Sensitivity and Specificity

In the chart, at 40% of false positive targets - observations incorrectly assigned to the
negative target:
● A random selection (with no predictive model) would classify 40% of the positive targets
correctly as True Positive.
● A perfect predictive model would classify 100% of the positive targets as True Positive.
● The predictive model created by Smart Predict in SAP Analytics Cloud (the validation
curve) would classify 96% of the positive targets as True Positive.

Success Criteria for Regression Models


The following figure provides you with an overview of how to establish success criteria using
regression analysis.

© Copyright. All rights reserved. 187


Unit 5: Modeling Phase

Figure 287: Success Criteria for Regression Models

Regression models help you with the following:


● Regression models have a continuous target so there are a range of different metrics to
those you have already seen for classification models.
● The error of a model is first computed (the actual observed value minus the predicted or
fitted estimate). This is called the residual.
● The values are summed over all the records in the data.
● Average errors indicate whether the models are biased toward positive or negative errors.
● Average absolute errors indicate the magnitude of the errors (whether they are positive or
negative).
● Often the entire range of predicted values is examined by considering scatter plots of
actual versus predicted values or actual versus residuals (errors).

Performance Metrics for Regression Models


The following performance metrics are frequently used to assess the success of a regression
model (with a continuous target):
● Mean absolute error refers to the mean of the absolute values of the differences between
predictions and actual results - this is called City block distance or Manhattan distance.
● Mean square refers to the square root of the mean of the quadratic errors - Euclidian
distance, or root mean squared error (RMSE).
● Maximum error refers to the maximum absolute difference between predicted and actual
values, which is called the Chebyshev distance.
● Error mean refers to the mean of the difference between predictions and actual values.
● Error standard deviation refers to the dispersion of errors around the actual result.
● R² (coefficient of determination) refers to the ratio between the variability of the prediction
and the variability of the data.

All of these indicators must be as low as possible, except R², which must be high (its
maximum is 1).

© Copyright. All rights reserved. 188


Lesson: Understand the parts of the modeling phase

Summary
This lesson covers the following:
● What metric you must use. In general, this decision must be the one that most closely
matches the business objectives defined at the beginning of the project during the
Business Understanding Phase.
● An example of the previous point is as follows: if the objective indicates that the model is
going to be used to select one-quarter of the population for treatment, the model gain or
lift at the 25% depth is appropriate. However, if all customers are due to be treated,
computing AUC might be appropriate.
● If the objective is to select the subject to a maximum false error rate, a ROC is appropriate
- otherwise, use the confusion matrix.
● It is vital to remember that the metric you use for model selection is of critical importance
because the model selected based on one metric might not be such a good model for a
different metric.
● During the Business Understanding Phase, it is vital that you understand the intent of the
model and match the metric that best fits that intent.

Test Predictive Models


Overview of Gain Charts, Lift Charts, and Decile Tables
You can use gain and lift charts, in addition to decile tables, to compare the performance of
different types of classification algorithm in order to select the best one.

Figure 288: Gains Charts, Lift Charts, and Decile Tables

You can also use these charts and tables to test the performance of models on different time
stamped populations, that is, test how the model performs in different time frames.

Example of Direct Mailing


The following figure provides you with an example of direct mailing.

© Copyright. All rights reserved. 189


Unit 5: Modeling Phase

Figure 289: Direct Mailing 1

Figure 290: Direct Mailing 2

The example in the figure reveals the following:


● The historical campaign data (with the known response information) is randomly split into
an estimation and a validation holdout sub-sample.
● The predictive model is trained on the estimation sub-sample data.
● The model is used to score the validation holdout sample. Those customers with the
higher scores are the ones who are most likely to respond.
● The validation holdout sample is subsequently rank ordered by score and split into 10
sections. These sections are called deciles. The top 10% of scores is decile 1, the next 10
percent is decile 2, and so on.
● As a result of the fact that these are all historical leads, you can also use the decile table to
report on the actual number and proportion of sales for each decile (the number of positive
responses).
● If the model is working well, the leads in the top deciles possess a much higher response
rate than the leads in the lower deciles.

Decile Table: Creation


There are many ways to create a decline table. The following is one suggestion.

1. Score the validation sample or file using the classification model under consideration.
Every individual receives a model score, that is, the Probability_Estimation.

2. Rank the scored file, in descending order, by Probability_Estimation.

© Copyright. All rights reserved. 190


Lesson: Understand the parts of the modeling phase

3. Divide the ranked and scored file into ten equal groups, called deciles, from top(1), 2, 3, 4,
5, 6, 7, 8, 9, and bottom(10). The "top" decile consists of the best 10% of individuals most-
likely to respond; decile 2 consists of the next 10% of individuals most-likely to respond,
and so on for the remaining deciles. The decile separates and orders the individuals on an
ordinal scale that ranges from most to least likely to respond.

The decile table yields the following results:


The customers contacted per decile is the number of individuals in each decile; 10% of the
total size of the file.
The Positive Responses per Decile is the actual (not predicted) number of responses in each
decile. The model identifies 600 actual responders in the top decile. In decile 2, the model
identifies 400 actual responders, and so on, for the remaining deciles.
The Cumulative Responses per Decile is simply the cumulative sum of the positive responses,
added up as you go from decile to decile, deeper into the file.
The Decile Response Rate is the actual response rate for each decile group. It is Positive
Responses per Decile divided by Customers Contacted per Decile for each decile group. For
the top decile, the response rate is 60.0% (= 600/1000). For the second decile, the response
rate is 40.0% (=400/1000). It is similar for the remaining deciles.
The Cumulative Response Rate is calculated for a given depth-of-file (that is, it is the response
rate for the individuals in the cumulative deciles). For example, the cumulative response rate
for the top decile (10% depth-of-file) is 60.0% (=600/1000). For the top two deciles (20%
depth-of-file), the cumulative response rate is 50.0% = ( [600+400]/[1000+1000] ). It is
similar for the remaining deciles.
The Cumulative Gain compares the cumulative percentage of customers who are responders
with the cumulative percentage of customers contacted in the marketing campaign across
the deciles. This describes the "gain" in targeting a given percentage of the total number of
customers using the highest modeled probabilities of responding, rather than targeting them
at random. It is calculated by dividing the Positive Responses per Decile, for a given number of
segments or depth of file, by the total number of responders across the whole dataset Iin the
example 2000).
The Cumulative Lift, for a given depth-of-file, is the Cumulative Response Rate divided by the
overall response rate of the file (in this example 20.0%), multiplied by 100. For the top decile,
this is 100*(60.0%/20.0%) = 300. It measures how much better you can do with the model
than without the model. For example, a Cumulative Lift of 300 for the top decile means that
when you are mailing to the top 10% of the file using the model, you can expect 3 times the
total number of responders found by randomly mailing 10% of the file. The Cumulative Lift of
250 for top two deciles means that when you are mailing the 20% of the file using the model,
you can expect 2.5 times the total number of responders found by mailing 20% of the file
without a model.

Example of Direct Mailing: Cumulative Gains Chart


The following figure provides you with an example of a Cumulative Gains Chart.

© Copyright. All rights reserved. 191


Unit 5: Modeling Phase

Figure 291: Cumulative Gains Chart

The Cumulative Gains Chart shows the following:


● You can use the predictive model to calculate the scores or probabilities of a response and
rank all of the customers based on this score. The customer with the higher scores are
those most likely to respond.
● The y-axis shows the percentage of positive responses. This is a percentage of the total
possible positive responses (overall response = 2000).
● The x-axis shows the percentage of total customers contacted, which is a fraction of the
10,000 total customers.
● The random line shows that if X% of customers are contacted, X% of the total positive
responses are received. This is a totally random response.
● The gains curve uses the predictions of the response model and allows you to calculate the
percentage of positive responses for the percent of customers contacted and map these
points to create the cumulative gains curve.

Figure 292: Direct Mailing Example - Cumulative Gains Chart

The gains chart and decile table show the following:


● The gains chart and decile table show that if the top 40% of customers with the highest
model scores are contacted, 79% of all the possible responders (1580 of the 2000 total)
are identified. Without the model, contacting the customers randomly, you only find 79%

© Copyright. All rights reserved. 192


Lesson: Understand the parts of the modeling phase

of all the responders if you contacted 79% of them all. Therefore, you can see the massive
benefit of using the model.
● In addition, if you compare the gains chart of the model scored on the estimation sub-
sample to the gains chart of the model scored on the validation sub-sample, you expect to
see the same proportion of responders in each decile. If the proportions are comparable,
the model is robust. If there are discrepancies, the model might be over trained.

Direct Mailing Example: Lift Chart


The following figure provides you with an example of how a lift chart can be used in relation to
contacting customers.

Figure 293: Lift Chart: Customer Contact

The figure illustrates the following:


● The points on the lift curve are calculated by determining the ratio between the result
predicted by the model and the result using no model. Using no model is represented by a
random line where Lift = 1.
● An example of this is as follows: if you contact 10% of customers, using no model, you get
10% of responders. However, using the given model, you get 30% of responders
(600/2000 = 30%). The y-value of the lift curve where the percentage of customers
contacted = 10% is, therefore, 3 times the random value of 1.
● Consequently, you can see that using the model is substantially better than using a
random selection.

Summary
In this unit, you have been introduced to Decile Analysis and how you can use it to analyze the
power of the classification models you build.

Improve Model Performance


Introduction to Improving Model Performance
This lesson introduces you to some of the common techniques that you can use to improve
the performance of the predictive models that you build:

● Add more data


● Improve the data quality – for example, remove missing values and outliers

© Copyright. All rights reserved. 193


Unit 5: Modeling Phase

● Select the best features to best explain your target


● Engineer more features
● Use multiple algorithms
● Tune the model’s parameters
● Use ensemble methods
● Use cross-validation techniques

Add More Data


By using more data, you can often produce better and more accurate models. However the
data must be clean and domain specific. You must also take note of the following:
● On occasion, there is not an option to add more data - for example, you can not get a
choice to increase the size of the training data in a data science competition.
● However, if you are working on a company project, you should ask for more data, if
possible.
● Predictive power can be increased by adding more relevant domain specific variables,
which include composite variables to capture variable interactions. A composite variable is
a variable created by combining two or more individual variables. You can use these values
to measure multidimensional concepts that are not easily observed.

Improve Data Quality: Missing Values and Outliers


The unwanted presence of missing and outlier values in the training data generally reduces
the accuracy of a model or leads to a biased model with inaccurate predictions. You must also
consider the following factors as you endeavor to improve the quality of your data:
● In a multivariate model, missing values and outliers cause the behavior and relationship
with other variables to be analyzed incorrectly.
● Therefore, it is important to treat missing and outlier values before you build a model.

You must also attend to missing values by doing the following:


● In case of continuous variables, you can impute the missing values with mean, median, or
mode.
● For categorical variables, you can treat missing variable values as a separate class. You
can also build a model to predict or “impute” the missing values. KNN imputation offers a
great option to deal with missing values.

You must also attend to outlier values by doing the following:


● You can delete the observations, perform data transformations, “bin” the data in order to
capture the outlier values in a bin, impute the value (same as missing values), or you can
treat outlier values separately.
● You can use data binning, a data pre-processing technique that reduces the effects of
minor observation errors. The original data values that fall into a given small interval, a bin,
are replaced by a value representative of that interval, frequently the central value. An
example of this is to bin values for "age" into categories such as 20-39, 40-59, and 60-79.

© Copyright. All rights reserved. 194


Lesson: Understand the parts of the modeling phase

Feature Engineering
During the process known as feature engineering, new features are created to extract more
information from existing features. These new features can have an improved ability to
explain the variance in the training data and improve the accuracy of the model.
Feature engineering is highly influenced by business understanding. The feature engineering
process can be divided into the following two steps: feature transformation and feature
creation.
Feature transformation is outlined as follows:

● An example of feature transformation is when data normalization is used to change the


scale of a variable from its original scale to a scale between zero and one. You can use this
type of data preparation to transform the scales of continuous explanatory variables so
that they all have the same scale and a fair comparison can be made between them - for
example, if we are comparing kilometers, meters, and centimeters.
● A number of machine learning algorithms work better with normally distributed data.
Therefore, we must remove skews in variable(s) that use data transformations such as log,
square root, inverse, and so on.

Feature creation is explained as follows:

● Feature creation is the creation of new variables, such as ratios and percentages, to
enhance the prediction. A deep understanding of the business environment, and
knowledge of the types of features that are usually predictive, helps to drive the creation of
these new variables.

For more information about feature engineering, see the following:


● https://en.wikipedia.org/wiki/Feature_engineering
● https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-
features-and-how-to-get-good-at-it/
● https://towardsdatascience.com/feature-engineering-for-machine-
learning-3a5e293a5114

Feature Selection
Feature selection is the process of finding out the subset of explanatory variables that best
explain the relationship of independent variables with the target variable. Filter, wrapper, and
embedded techniques have been described earlier in this course.
There are a number of other different ways to accomplish this task:

● You can use domain knowledge to select feature(s) that might possess a higher impact on
the target variable based on your domain business experience.
● You can use visualization to take advantage of simple data visualizations that help your
business to understand the relationship between variables.
● You can use statistical parameters as there are a range of statistical metrics you can utilize
- for example, p-values and variable contributions to the model.
● You can use Principal Component Analysis (PCA). This is a type of dimensionality
reduction technique. PCA is a statistical procedure that allows you to summarize the

© Copyright. All rights reserved. 195


Unit 5: Modeling Phase

information content in large data tables by means of a smaller set of “summary indices”
that can be more easily visualized and analyzed.

Using Multiple Algorithms


On occasion, you can use an approach called “segmented modeling” where classification or
regression models are developed for each segment of a cluster model.
Instead of training one model, first a cluster model is developed to create segments of the
data with common characteristics (for example, using K-Means). Afterward, separate
classification or regression models are developed for each segment.
The goal is for the combined accuracy of the separate segmented models to be greater than
the accuracy of the single overall model.
The disadvantage of this approach is the increase in complexity. Multiple models must be
trained. When the models are scored (that is, during the Model Deployment Phase), a
segmentation needs to be run first, and the regression/classification models need to be
deployed for each appropriate segment. Therefore, there is an increase in the complexity of
the model deployment process and you need to maintain multiple models. This increase in
complexity needs to be weighed against the possible improvements in model performance.

Algorithm Tuning
Machine learning algorithms are driven by parameters, for example, the number of leaves of a
classification tree, the number of hidden layers in a neural network, or the number of clusters
in a K-Means clustering.
These parameters influence the outcome of the learning process. This is often called
“hyperparameter tuning” and it finds the optimum value for each parameter to improve the
accuracy of the model.
To tune these parameters, you must first have a good understanding of their meaning and
their individual impact on the model.

Ensemble methods
The following figure contains a graphic representation of ensemble methods.

Figure 294: Ensemble methods

Ensemble methods are characterized by the following:


● In general, numeric target variables are averaged across multiple executions of an
algorithm, while categorical target variables have a voting system, usually run an odd
number of times to avoid draws.

© Copyright. All rights reserved. 196


Lesson: Understand the parts of the modeling phase

● Bootstrap aggregating, often abbreviated as bagging, involves having each model in the
ensemble vote with equal weight. In order to promote model variance, bagging trains each
model in the ensemble using a randomly drawn subset of the training set - for example, the
random forest algorithm combines random decision trees with bagging to achieve very
high classification accuracy.
● Boosting involves incrementally building an ensemble by training each new model instance
to emphasize the training instances that previous models incorrectly classified. In a
number of cases, boosting is shown to yield better accuracy than bagging, but it also tends
to be more likely to over-fit the training data.
● You were introduced to these techniques earlier in this training.

Cross-Validation
Cross-validation is a popular model validation technique that is illustrated in the following
figure.

Figure 295: Example of Cross-Validation

This method is useful and involves the following:


● Improving model accuracy is not the only goal of improving model performance. On
occasion, the model might be over trained and therefore the accuracy might be too high.
● One round of cross-validation involves partitioning a sample of data into complementary
subsets, performing the analysis on one subset (called the training or estimation set), and
validating the analysis on the other subset (called the validation set or testing set). This is
referred to as the hold-out method. One draw back is that the evaluation can depend
heavily on which data points end up in the training set and which end up in the test set, and
thus the evaluation can be significantly different depending on how the division is made.
● Another approach is the K-fold cross validation, shown in the figure, which is one way to
improve on the hold-out sample method. The data set is divided into k subsets (fold 1, fold
2, and so on), and the holdout method is repeated k times. Each time, one of the k subsets
is used as the test set and the other k-1 subsets are put together to form a training set.
Afterward, the average error across all k trials is computed. The advantage of this method
is that it matters less how the data gets divided. Every data point gets to be in a test set
exactly once, and gets to be in a training set k-1 times. The variance of the resulting
estimate is reduced as k is increased. The disadvantage of this method is that the training
algorithm has to be rerun from scratch k times, which means it takes k times as much
computation to make an evaluation.
● There are also other variants.

© Copyright. All rights reserved. 197


Unit 5: Modeling Phase

To learn more about cross-validation see the following


● https://en.wikipedia.org/wiki/Cross-validation_(statistics)#:~:text=Cross%2Dvalidation
%2C%20sometimes%20called%20rotation,to%20an%20independent%20data%20set
● https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-
d6424b45261f

Summary
This lesson introduces you to some of the techniques that are commonly used to improve the
performance of predictive models. these techniques are outlined in the following manner:

● Add more data


● Improve the data quality – for example, remove missing values and outliers
● Engineer more features
● Select the best features to best explain your target
● Use multiple algorithms
● Tune the model’s parameters
● Use ensemble methods
● Use cross-validation techniques

Of course, one of the most common ways of improving the accuracy of a forecast is to use a
different algorithm. For example, if you are using a decision tree, try a neural network or
logistic regression to see if these other algorithms improve the performance of the model.

LESSON SUMMARY
You should now be able to:
● Understand modeling phase

© Copyright. All rights reserved. 198


UNIT 6 Evaluation Phase

Lesson 1
Understanding the Evaluation Phase 200

UNIT OBJECTIVES

● Explain Evaluation Phase

© Copyright. All rights reserved. 199


Unit 6
Lesson 1
Understanding the Evaluation Phase

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Explain Evaluation Phase

CRISP-DM Evaluation Phase


CRISP-DM: Phase 5, Evaluation
The following figure provides you with an introduction to the CRISP-DM: Phase 5, the
Evaluation phase.

Figure 296: CRISP-DM - Phase 5: Evaluation

Phase 5.1: Evaluate Results


The tasks for this phase are as follows:
● Assess the degree to which the model meets the business objectives.
● Test the model(s) on test applications if time and budget constraints permit.

The outputs for this phase are as follows:


● Assessment of data science project results with respect to business success criteria
● Approved model

Previous evaluation steps dealt with factors such as the accuracy and generality of the model.
However, this phase assesses other areas:

© Copyright. All rights reserved. 200


Lesson: Understanding the Evaluation Phase

● This phase assesses the degree to which the model meets the business objectives. It also
seeks to determine whether there is a business reason that explains why this model is
deficient.
● You can perform an evaluation in another way. For example, you can test the model(s) on
test applications in the real application, if time and budget constraints permit.
● Furthermore, an evaluation also assesses the other data science results you have
generated.
● Data science project results cover models that are necessarily related to the original
business objectives and all other findings. These are not necessarily related to the original
business objectives. but might also unveil additional challenges, information or hints for
future directions.

The outputs for this phase are as follows:


● The assessment of data science results with respect to business success criteria.
● Summarize assessment results in terms of business success criteria including a final
statement whether the project already meets the initial business objectives.

The output for approved models are as follows:


● When the model assessment is complete, with respect to business success criteria, the
generated models that meet the selected criteria become approved models.

Phase 5.2: Review Process


The tasks for this phase are as follows:

● Conduct a more thorough review of the data science engagement to determine if there is
any important factor or task that has somehow been overlooked.
● Identify any quality assurance issues.

The outputs for this phase are as follows:

● Conduct a review of the process.


● Summarize the process review and highlight activities that you have missed and/or must
be repeated.

At this point, the following results become clearer:


● - At this point, the model (hopefully) appears to be satisfactory and to satisfy the needs
of the business.
- It is appropriate to do a more thorough review of the data science engagement in order
to determine if there is any important factor or task that has somehow been
overlooked.
- This review also covers quality assurance issues, for example, did you correctly build
the model? Did you only use attributes that you are allowed to use and that are
available for future analysis?

Phase 5.3: Determine Next Steps


The task for this phase is as follows:

© Copyright. All rights reserved. 201


Unit 6: Evaluation Phase

● Assess how to proceed with the project.

The outputs for the phase are as follows:


● Create a list of possible actions.
● List the potential further actions along with the reasons for and against each option.
● List the decision.
● Describe the decision on how to proceed.

Take the following actions:


● - According to the assessment results and the process review, decide on how the project
must proceed at this stage.
- Decide on whether to finish this project and move on to deployment, if appropriate, or
whether to initiate further iterations or set up new data science projects.
- This task includes analyzes of the remaining resources and budget that influences the
decisions.

Summary
This lesson introduces you to the Evaluation phase of the CRISP-DM process. It covers the
following 3 tasks:

● Evaluate the results


● Review the process
● Determine the next steps

Evaluate Model Performance


Introduction: Evaluate the Performance of a Model
To evaluate the performance of a model, you need to do the following:

● Assess the degree to which the model meets the business objectives.
● Determine if there is any reason why this model is deficient.
● Test the model(s) on test applications in the real application, if time and budget
constraints permit.

Frequently, there is confusion between Phase 4, the Model Assessment task, and the
Evaluation phase of CRISP-DM. The following points clarify this scenario:

● In the Model Assessment task (that is, Task 4 of the Modeling phase of CRISP-DM), you
check that you are satisfied with the performance of the model from a data science
perspective by analyzing the confusion matrix, all of the model performance metrics and
charts, and confirm that the model's explanatory variables and their categories make
"business" sense. You confirm that the model meets the data science success criteria.
● When you have checked every component in the Model Assessment task, you move to the
Evaluation phase. During the Evaluation phase of CRISP-DM, you need to evaluate the
model to assess the degree to which it meets the business objectives, determine if there is
a business reason why this model is deficient and test the model(s) on test applications in

© Copyright. All rights reserved. 202


Lesson: Understanding the Evaluation Phase

the real application on new, updated data - if time and budget constraints permit. You
confirm if the model meets the business success criteria.

Challenges
When your model is finally deployed and operationalized, you might find that its behavior is
different to your expectation. Therefore, you need to try and evaluate the model in order to
understand how its performance can deviate from what is expected.
The deviation could be the consequence of interactions with other models and systems, the
variability of the real world, or even because of adversaries that are changing the behavior
your data reflected when you initially trained the model. Examples of this are as follows:
● If you are building a fraud model, the fraudsters are capable of adapting their methods to
evade detection by your model.
● If you are creating marketing campaigns, your competitors are always looking for ways to
change the sales environment to their own advantage.
● There has been a Global pandemic that changed the way people interact and buy products.
● In certain scenarios, you might need to deploy and use the model, and capture the results
over a period of a few months before you can make a final judgment that the model is
working as expected. An example of this is if you are running a sales response campaign
based on a predictive model that indicates the probability that a customer might purchase
a product or not. In this scenario, you need to wait a few months to ensure you capture all
of the customers responding to the campaign.

Decile Analysis
One way to compare actual to predicted performance is to use decile analysis, which you
looked at earlier in this course. When the model is being evaluated on new data, you compare
the model build decile distribution to the decile distribution observed. You also take following
actions:

● When you train the model, the decile analysis of the scored output is represented in 10
bins, each with defined minimum and maximum scores or estimates (the bin boundaries),
with 10% of the samples in each bin.
● When you evaluate the model on new data, you use the same predefined minimum and
maximum scores or estimates you found in the training phase, and you analyze the
percentage of samples in each bin. If the model is working as you expect, you observe
approximately 10% in each evaluation bin.
● If there is a significant shift in the data distribution, with more that or less than 10% in the
predefined bins, this indicates that there is a shift in the underlying population and the
model is no longer working as you would expect.

Example of the Classification Model


The following figure provides you with an example of how to build the classification model.

© Copyright. All rights reserved. 203


Unit 6: Evaluation Phase

Figure 297: Classification model example

In this classification model example, you want to know if customers will buy your new product
"P". To discover this, you must understand the following:
● In the following example, you train your predictive model using an input data set containing
past observed observations for 1,000 customers.
● When you build the model, the decile analysis, as it progresses, splits the data into 10 bins
and you can calculate the average probability per bin to purchase product "P".
● When you apply the model and evaluate it on up-to-date real-world data, let us say that
your input data set contains observations on 700 customers. You might find there is a
change in the percentage of customers in each bin.
● You can see the number of customers in each bin is no longer 10%.

By monitoring the population structure, you can accomplish the following:


● Dividing the data set into bins means that each bin must contain approximately 10% of the
observations.
● However, if this changes, it indicates that your population is changing.
● For example, there could be an effect that advertising on social media sites might influence
and attract more young customers, rather than other age groups. It does not mean that
the predictive model is not efficient anymore, but it can be an alert to check this
performance with more recent past data (than the ones used to train the model).

Example of the Regression Model


The following figure provides you with an example of a regression model. This allows you to
predict a number of values for the next quarter.

© Copyright. All rights reserved. 204


Lesson: Understanding the Evaluation Phase

Figure 298: Sample Regression Model

The example shows you the following:


● In the apply data set, there are 14% of customers in the top bin, which is clearly more than
the 10% of customers you would expect when looking at the build data set.
● Again, this indicates that there is a change in the underlying population that needs to be
carefully assessed and monitored because the model is not performing as you initially
expected.

Summary
During the Evaluation phase of CRISP-DM, you do the following:

● Evaluate the model to assess the degree to which it meets the business objectives.
● Determine if there is some business reason why this model is deficient.
● Test the model(s) on test applications in the real application on new, updated data - if time
and budget constraints permit.
● You confirm if the model meets the business success criteria. One way to do this is to use
decile analysis and compare the deciles when you trained the model to those when you use
the model on more up-to-date data.

LESSON SUMMARY
You should now be able to:
● Explain Evaluation Phase

© Copyright. All rights reserved. 205


UNIT 7 Deployment and
Maintenance Phase

Lesson 1
Deployment and Maintenance Phase 207

Lesson 2
End-to-end Scenario 218

UNIT OBJECTIVES

● Deploy and maintain models


● Complete an challenge

© Copyright. All rights reserved. 206


Unit 7
Lesson 1
Deployment and Maintenance Phase

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Deploy and maintain models

CRISP-DM Deployment Phase


CRISP-DM: Phase 6, Deployment
The following figure introduces you to Phase 6 of CRISP-DM, the Deployment phase. This
phase is concerned with the organization and presentation of knowledge to the customer so
that they can use this information.

Figure 299: Phase 6: Deployment

Phase 6.1: Plan Deployment


The tasks for this phase are as follows:

● In order to deploy the data mining results(s) into the business, this task takes the
evaluation results and develops a strategy for the deployment of this information.
● If a general procedure is identified to create the relevant model(s), this procedure is
documented in this lesson for later deployment.

The outputs for this phase are as follows:

● The development of the deployment plan.

© Copyright. All rights reserved. 207


Unit 7: Deployment and Maintenance Phase

● Summarize a deployment strategy, including the necessary steps and how to perform
them.

Phase 6.2: Plan Monitoring and Maintenance


The tasks for this phase are as follows:

● Monitoring and maintenance are critical issues if the data mining results become part of
the day-to-day business and its environment.
● A careful preparation of a maintenance strategy helps to avoid unnecessarily long periods
of incorrect usage of data mining results.
● This plan takes the specific type of deployment into account.

The outputs for this phase are as follows:

● Understanding and developing the monitoring and maintenance plan.


● Summarize monitoring and maintenance strategy, including the necessary steps and how
to perform them.

Phase 6.3: Produce Final Report


The tasks for this phase are as follows:

● At the end of the project, the project leader and their team write up a final report.
● Depending on the deployment plan, this report might only be a summary of the project and
its experiences - if they have not already been documented as an ongoing activity, or it can
be a final and comprehensive presentation of the data mining result(s).

The outputs for this phase are as follows:

● Generating the final report. The final report is the final written report of the data mining
engagement.
● Creating the final presentation. There is frequently a meeting at the conclusion of the
project where the results are verbally presented to the customer.

Phase 6.4: Review Project


The task for this project is as follows:

● To assess what has gone right and what has gone wrong over the course of the project,
and what has been done well and what needs to be improved.

The outputs for this project are as follows:

● Produce the experience documentation.


● Summarize important experiences made during the project.
● For example, one might look at pitfalls, such as misleading approaches or hints for
selecting the best suited data science techniques in similar situations. This could be part of
the documentation.

© Copyright. All rights reserved. 208


Lesson: Deployment and Maintenance Phase

● In ideal projects, experience documentation also covers any reports that are written by
individual project members over the course of the project phases and their tasks.

CRISP-DM: Update
The following figure examines the extent to which a model's predictive performance degrades
over time.

Figure 300: CRISP-DM - Update

Summary
This lesson covers the sixth and final phase of the CRISP-DM process, that is, the deployment
phase. There are 4 tasks in this phase:
● Plan deployment
● Plan monitoring and maintenance
● Production of the final report
● Review of the project

Introduction to Deployment Options


Deployment Options Overview
When you have a model, what do you do next?
The analysis and models that you have developed need to be deployed into a production
environment. You must do this in order for the models to be used in the everyday decision-
making process of the business.
You might want to build a standalone program that can make impromptu predictions.
You might want to incorporate your model into your existing software.
There can be many challenges in model deployment as many organizations do not have an
integrated infrastructure. Challenges can include complications with new data collection. In
addition, there might be a requirement to deploy new software to fully leverage the analysis,
such as a campaign management system to deploy the results of propensity models.

© Copyright. All rights reserved. 209


Unit 7: Deployment and Maintenance Phase

There are a number of options that you need to consider. These must be fully discussed and
agreed during the Business Understanding phase. Examples of these options are as follows:

● Model scoring:The model is scored on the new apply data. It also refers to the score or
probability that is provided to the business and used to underpin actions, decisions, and
the development of business strategy.
-
● The model is often represented as a mathematical equation that can be represented as
code - for example, an SQL code (specific to a database), R, SAS code, C, Java HTML, and
so on. The code is created and latterly deployed in a database, or onto the apply data that
could be held in (for example) a text file. The output from this operation are the scores,
probabilities, deciles, and so on that are required. When the model is applied and scored,
the output is provided to the business and used to trigger actions and decisions.
● A model integration with SAP BusinessObjects BI Reporting: In this scenario, the model is
integrated with the BI reports, and the scores or probabilities are visualized - for example,
score distributions per geography.
● Model integrated with the application:This means the model is integrated with other
applications and used in day-to-day business decision making - for example, in a call
center.
● Cloud: In the cloud, models are trained on predictive analytics software - in the cloud, or on
the premises. These models can be deployed onto a database in the cloud.
● Real-time scoring:On occasion, you might need to deploy models for real-time scoring.

Batch or Real-time Scoring


The classic example of batch scoring is a traditional marketing-campaign model. You can
send out a mass postal mailing or email, only to those individuals that have a high score.
These “offline” predictions are usually performed in retention campaigns for (inactive)
customers with high propensity to churn, in promotion campaigns, and so on.
In batch scoring, you collect the data points in your data lake or data warehouse, and the
predictions are produced for all the data points at once, in a scheduled batch prediction job.
"Real-time" or “on-line” models are used to serve real-time predictions, based on online
requests from the operational systems or apps. In online recommendations, you need the
current context of the customer who is using your application, along with historical
information, to make the prediction. This could include information such as date/time, page
views, items viewed, items in the basket, items removed from the basket, customer device
type, and device location.
The classic, "real-time" scoring situation is an incident of credit card fraud. A transaction is
run, and you need near-instant feedback. For example, you might get a text message from
your credit card provider asking you if a particular transaction is legitimate or not.
Examples of batch scoring include the following:
● Demand forecasting where you estimate the demand for products by store on a daily basis
for stock and delivery optimization.
● Customer segmentation analysis where you identify customer segments, emerging
segments, and customers who are migrating across segments each month or quarter.

In the real-time scenario, the model usually receives a single data point from a caller, and it
provides a prediction for this data point in real time. Other examples include the following:

© Copyright. All rights reserved. 210


Lesson: Deployment and Maintenance Phase

● Predictive maintenance, that is, predicting whether a particular machine part is likely to fail
in the next few minutes, given the sensor's real-time data.
● Estimating how long a food delivery takes based on the average delivery time in an area
over the course of the past 30 minutes. THis estimate considers the longevity and
ingredients, and real-time traffic information.

The SAP HANA database system can support real-time decisioning. One option is to deploy
models on "data streams". for more information about real-time and batch model scoring, see
the following:
● https://cloud.google.com/solutions/machine-learning/minimizing-predictive-serving-
latency-in-machine-learning

Event Stream Processing (ESP)


Real-time event stream processing is divided into three steps - capture, act, and stream:

1. Capture the data arriving continuously from devices and applications.

2. Act on new information as soon as it arrives. This information includes alerts, notifications,
and data that precipitates immediate responses to changing conditions.

3. Stream information to live operational dashboards.

These processes are highly scalable - they are processing hundreds of thousands, or even
millions, of events per second.
There are a range of predictive models that can be deployed on SAP HANA smart data
access/integration/quality, from the automated Application Function Library (AFL) system
and from SAP HANA Predictive Analysis Library (PAL) - for example, incremental
classification and clustering algorithms.

● - SAP HANA smart data access/integration/quality provides both supervised and


unsupervised machine learning algorithms. Both types of algorithms are specifically
optimized to deal with streaming data. These algorithms can respond in real-time
without the onerous long-term storage of massive amounts of previously seen data.
- The scoring equation of a trained model can be exported in Continuous Computation
Language (CCL) for use in SAP HANA smart data access/integration/quality.

Smart Data Analytics Capabilities


The following table shows you what you can accomplish with smart data analytics in SAP
HANA.

© Copyright. All rights reserved. 211


Unit 7: Deployment and Maintenance Phase

Table 1: Steam Capture and Immediate Response


Stream Capture Immediate Response
Capture data arriving as individual events, at Monitor incoming event streams by using
potentially high speeds, by using SAP HANA SAP HANA to do the following:
to do the following:
● Watch for trends or patterns
● Monitor hundreds of thousands or mil- ● Monitor correlations
lions of events per second
● Detect missing events
● Utilize micro-batching and parallel proc-
essing to optimize load speeds ● Continuously update and monitor aggre-
gate statistics

Capture events that are published from Generate alerts, notifications


streaming sources
Filter, transform, or enrich the data on the Initiate immediate response
way in by using SAP HANA to do the follow-
ing:
● Capture only the data you want, in the
form you need it

Prioritize data by using SAP HANA to do the


following:
● Capture high value data in SAP HANA and
direct other data into Hadoop

ESP is a sophisticated application that allows you to do the following:


● This type of ESP applies continuous queries to streams of events. The process can collect
events in windows, which allows you to examine new events and compare them to past
events. However, this can be confusing at first because it starts to sound like a database.
However, it is important to understand that ESP is not a database.
● By taking advantage of ESP, you can define continuous queries. These are not queries in
the traditional sense in that they are run against a database. These queries are defined in
advance and they are event-driven, updating continuously as new information arrives. In
addition, while ESP can hold data temporarily in windows, it does not provide you with the
means to store data permanently.
● While there is some overlap in the types of analytics you can apply to the data, probably
the most basic distinction is as follows. ESP is event-driven; it publishes output in response
to the arrival of new event information. By contrast, a database operates on-demand; it
delivers results in response to queries that are run against it by a user, by an application, or
on a scheduled basis.
● There is also the aspect of streaming output. When you want to see what data has changed
in a database, you need to query it. To monitor a database for changes, you need to
regularly poll it. ESP on the other hand, produces streaming output: it tells you when
something changes.

© Copyright. All rights reserved. 212


Lesson: Deployment and Maintenance Phase

Complex Event Processing


The following figure provides you with an overview of how ESP works.

Figure 301: Complex Event Processing: Extracting Insight from Events

How does ESP work? The following points clarify the means by which you can use ESP to
achieve your goals:
● ESP lets you define continuous queries that are applied to one or more streams of
incoming data to produce streams of output events.
● This is what sets ESP apart from simple event processing. Using ESP gives you the ability
to examine incoming events in the context of other events or other data to understand
what is happening.
● In many cases, a single event might not contain much information or be very interesting by
itself, but when combined with other events, you might be able to observe a trend or
pattern that is very meaningful.
● Take the following example: you want to monitor the temperature of equipment to ensure
it does not overheat. You have access to real-time sensor data that tells you the
temperature of the equipment. However, that data alone might not be very useful. An
instance of isolated data is as follows: if the equipment temperature is 90 degrees, but the
equipment is located outside, and the outside air temperature is 85 degrees. This might be
a normal operating temperature for the equipment. However, if the equipment is 90
degrees when the air temperature is only 30 degrees, that could indicate that there is an
imminent problem.
● With regards to this example, there might be a trend that indicates a potential problem
because an individual data point does not tell you much and you want to analyze changes
over the course of the last hour. Moreover, you might want to compute a moving average
and compare it to historical norms for similar equipment.

Summary
This lesson introduces you to a number of of the deployment options for your data science
model and output.
In addition, you learn about the following:

● The difference between "batch scoring" or "offline predictions," as opposed to "real-time


scoring" or "online predictions"
● Real-time ESP

© Copyright. All rights reserved. 213


Unit 7: Deployment and Maintenance Phase

Monitor Maintaining Models


Introduction to Monitoring and Maintenance
The predictive models that are deployed into the business environment must be regularly
monitored for acceptable performance. You monitor the following areas to maintain the
effectiveness of predictive models:

● Your analysis and models use data based on past experiences and events to identify future
risks and opportunities. However, conditions and environments are constantly changing
(for example, new products are launched, competitors reduce the prices, and so on) and
this needs to be reflected in the models. If this does not happen, model performances
decay over time.
● Models that exhibit degraded performances produce scores that are incorrect and must
be replaced. If this does not happen, they put decision-makers and their decisions at
severe risk of inaccuracy and unreliability, which can ultimately effect bottom-line
profitability and cause damage to the organization.
● Organizations must develop built-in processes to systematically detect performance
reduction in all deployed models to identify those that are obsolete and to replace them
with new ones.
● A systematic model management life cycle methodology is of paramount importance.

For more information about deploying predictive models, see the following:
● https://machinelearningmastery.com/deploy-machine-learning-model-to-production/
● https://www.kdnuggets.com/2019/06/approaches-deploying-machine-learning-
production.html

Monitoring and Maintenance Approach


Approaches to monitoring and maintenance are as follows:

● Random spot checks are not the best approach to measure the effectiveness of a model
and performance levels.
● Instead, each model must be thoroughly evaluated and measured for accuracy, predictive
power, stability over time, and other appropriate metrics that are defined by the objectives
of each model.
● Ideally, the performance of each model must be evaluated every time it is applied or used.
However, for those models you use infrequently, evaluating their performances provides
you with an opportunity. When you evaluate infrequently used model, you have the
opportunity to collect fresh data and rebuild models that have been missed.
● Factors to consider are the speed of change in your particular business environment, the
age of the model, the data availability for monitoring and for rebuilding the models, and the
potential risks of using under-performing models.
● In addition, there are business constraints and limited resources.

Update Frequency
Factors that influence how frequently you must update your models are as follows:

© Copyright. All rights reserved. 214


Lesson: Deployment and Maintenance Phase

● Business environment change


● Frequency of model usage
● The age of the model
● Data readiness

You must turn to examine business environment change:

● Certain business environments are subject to constant change while other environments
undergo very minimal change.
● Clearly, the more frequent the changes, the more often the models are affected and run
the risk of being outdated.
● For example, prepaid telecommunications predictive models need to be updated
frequently because of the rapid business environment changes, new product and service
offers, the release of new types of handset and competition.

The following are examples of changes to the business environment:

● When "internal" or "external" events occur, models must be monitored more closely.
● What is categorized as an internal event could include changes to the product,
underwriting, distribution channels, and processing procedures, or the launch of new
products and services.
● What is categorized as an external event could include changes in legislation, regulation,
customer behavior, and competitor activity (launching campaigns, releasing new
products, and relocating to new geographical areas), or a global pandemic.

You must also factor in the frequency with which you use the model for the following reasons:

● If you use the model on a daily basis, and evaluation tests are carried out each time it is
used, then it is easy to identify when its performance starts to degrade.
● However, if a model is used quarterly for example, there might have been changes in the
business environment that have been missed. In addition, there might be opportunities to
collect fresh data and rebuild the models over the past 3 months. To avoid these problems,
implement a systematic checking process.

You must also factor in the age of the model for the following reasons:

● If a model is new, it is monitored to ensure its performance matches what is expected


compared to the training environment to build confidence in the results.
● However, as a model matures, it starts to degrade.
● Again this is a reflection on the business environment. For example, in the financial
services sector, many credit risk type models might be used for a number of years
because the activity it is forecasting is relatively stable over time. However, in other fast-
paced environments, such as telecommunications, models have to be refreshed monthly
or on a quarterly basis.

You must also factor in data readiness for the following reasons:

© Copyright. All rights reserved. 215


Unit 7: Deployment and Maintenance Phase

● The availability of data can affect the monitoring of models as well as the redevelopment of
models. In both cases, appropriate data needs to be available.
● There are many factors that affect data readiness. Data occasionally needs to be collected
to reflect rare events, or if data is sourced from a third-party agency, there might be
delays, and of course to develop predictive models you need to wait for the response data
to be collected. An example of the response data you need is the responses to a marketing
campaign. You need this data in order for the model to be properly trained.

Concept and Data Drift


The term, concept drift, means that the statistical properties of the target variable, which the
model is trying to predict, change over time in unforeseen ways. This causes problems for
your model because the predictions become less accurate as time passes.
The change to the data can take any form. There can be some temporal consistency to the
change, such as the data collected within a specific time period show the same relationship
and this relationship itself changes smoothly over time.
However, there are occasions when you might detect other types of change:

● A gradual change over time


● A recurring or cyclical change
● A sudden or abrupt change

There are domains where predictions are ordered by time. An example of these are time-
series forecasting, such as retail forecasting, and predictions on streaming data (for example,
in predictive maintenance models), where the problem of concept drift is more likely.
Therefore, in such a scenario, you explicitly for these issues and address them.
While concept drift is about the target variable, data drift describes the change of the
properties of the explanatory variables. In this case, it is not the definition of a customer that
changes, but the values of the features that define them/
For more information on data drift, see the following:
● https://machinelearningmastery.com/gentle-introduction-concept-drift-machine-
learning/
● https://en.wikipedia.org/wiki/Concept_drift
● https://www.explorium.ai/blog/understanding-and-handling-data-and-concept-drift/

Monitoring and Maintenance Process


The models that are not running at their optimum level need to be refreshed, or recalibrated,
with data from the current environment or completely rebuilt.
Refreshing refers to the following:
● In the case of a model "refresh," you assume the explanatory variables in the original
model are still valid.
● In this scenario, you gather and analyze the new data. Using the new data, you re-fit the
models to adjust the weights that are assigned to each existing explanatory variable and
test the outcomes to ensure that they are in line with the current environment.

Rebuilding refers to the following:

© Copyright. All rights reserved. 216


Lesson: Deployment and Maintenance Phase

● In the case of a model rebuild, you totally rebuild the model. This means that the model can
use different explanatory variables if they are available. In certain cases, you start from the
beginning of the CRISP-DM process and reconfirm the current business understanding an
the original assumptions of the model.
● When any of the assumptions are updated or redefined, the model build process starts
from the beginning, including data collection, model building and evaluation.

CRISP-DM Update
The following figure points to the key things you must take note of during the CRISP-DM
update.

Figure 302: Monitoring Phase: CRISP-DM Update

Summary
This lesson covers the extent to which the predictive models that you deploy must be
monitored for acceptable performance because model performance diminishes over time.
Each model must be evaluated and measured for accuracy, predictive power, stability over
time, and other appropriate metrics that are defined by the objectives of each model.
Factors that influence the time period in which you must update your models include the
following:

● Business environment change


● Frequency of model usage
● The age of the model
● Data readiness

This lesson also covers the ideas behind concept drift and data drift.

LESSON SUMMARY
You should now be able to:
● Deploy and maintain models

© Copyright. All rights reserved. 217


Unit 7
Lesson 2
End-to-end Scenario

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Complete an challenge

LESSON SUMMARY
You should now be able to:
● Complete an challenge

© Copyright. All rights reserved. 218


UNIT 8 Conclusion

Lesson 1
SAP Data Science Applications 220

UNIT OBJECTIVES

● Explain data science applications

© Copyright. All rights reserved. 219


Unit 8
Lesson 1
SAP Data Science Applications

LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Explain data science applications

Data Science Applications


Introduction
SAP offers you a wide array of data science tools.

Figure 303: Introduction

AI/ML Solutions at SAP

Figure 304: Overview of AI/ML Solutions at SAP

© Copyright. All rights reserved. 220


Lesson: SAP Data Science Applications

For more information on SAP and ML, see the following:


● https://developers.sap.com/topics/machine-learning.html#details/
cjma26n0zd51s0932uoh0h452

SAP Analytics Cloud (SAC)


The following figure introduces you to SAC.

Figure 305: SAP Analytics Cloud

You can use SAC to do the following:


● To build interactive visualisations and stories.
● To create predictive models, and integrate them into business intelligence and planning
workflows - no data science experience required for this task.
● To discover the key influencers behind business-critical KPIs and run powerful simulations
to see how the KPIs can change based on changing contributory factors.
● Uncover contributing factors to your data points using natural language queries.
● You can easily use SAC to quickly develop a clear understanding of even the most complex
aspects of your data.

Data Science Applications


The following figure provides you with an overview of data science applications in SAC.

© Copyright. All rights reserved. 221


Unit 8: Conclusion

Figure 306: Data Science Applications

Smart Assist
The following figure provides you with an overview of Smart Assist in SAC.

Figure 307: Smart Assist in SAC

SAC Search to Insight


THe following figure provides you with an overview of the Search to Insight function in SAC.

Figure 308: SAC Search to Insight

© Copyright. All rights reserved. 222


Lesson: SAP Data Science Applications

Example: Search to Insight


The following figure shows you an operational example of Search to Insight.

Figure 309: Use Search to Insight

Smart Insights in SAC


The following figure shows you what operations you can perform using Smart Insights in SAC.

Figure 310: What Smart Insights Allow You to Perform?

Further Information on Smart Insights


The following figure provides you with more information on Smart Insights in SAC.

Figure 311: SAC Smart Insights

© Copyright. All rights reserved. 223


Unit 8: Conclusion

Time-Series Forecasting
The following figures shows you what you can do with time-series forecasting in SAC.

Figure 312: SAC Time-Series Forecasting

SAC Time-Series Forecasting: Expense Transactions Over Time


The following figure provides you with an example of calculating expense transactions over
time using Time-Series Forecasting in SAC.

Figure 313: Expense Transactions Over Time

R Visualizations in SAC
The following figure provides you with an overview of R Visualizations in SAC.

© Copyright. All rights reserved. 224


Lesson: SAP Data Science Applications

Figure 314: R Visualizations

Smart Features and R Visualizations in SAC


The following figure introduces to interactive R Visualizations in SAC.

Figure 315: R Visualizations in SAC

SAC Smart Grouping


You can use Smart Grouping in SAC.

Figure 316: Smart Grouping

© Copyright. All rights reserved. 225


Unit 8: Conclusion

Example of Smart Grouping


The following figure contains an example of Smart Grouping in SAC.

Figure 317: SAC Smart Grouping

SAC Smart Discovery


The following figure introduces you to Smart Discovery in SAC.

Figure 318: SAC Smart Discovery

Example of Smart Discovery


The following figure provides you with a working example of Smart Discovery in SAC.

© Copyright. All rights reserved. 226


Lesson: SAP Data Science Applications

Figure 319: Example: SAC Smart Discovery

Smart Predict
You can perform many operations with Smart Predict in SAC.

Figure 320: SAC Smart Predict

Operations in SAC Smart Predict


The following figure shows you the functions you can perform using Smart Predict in SAC.

© Copyright. All rights reserved. 227


Unit 8: Conclusion

Figure 321: SAC Smart Predict

SAC Predictive Planning


Predictive planning allows you to accomplish many things in SAC, as shown in the following
figure.

Figure 322: SAC Predictive Planning

SAP HANA: Advanced Analytical Processing-In-Database and EML Capabilities


The following figure gives you an overview of external machine learning (EML) in SAP HANA.

© Copyright. All rights reserved. 228


Lesson: SAP Data Science Applications

Figure 323: SAP HANA: Advanced Analytical Processing-In-Database and EML Capabilities

SAP HANA Predictive Analysis Library (PAL)


The following figure provides you with an overview of the various functions in SAP HANA PAL.

Figure 324: SAP HANA PAL

SAP HANA PAL


The following figure provides you with a breakdown of the functions in SAP HANA PAL.

© Copyright. All rights reserved. 229


Unit 8: Conclusion

Figure 325: SAP HANA PAL

SAP HANA Automated Predictive Library (APL)


SAP HANA APl allows you to automate a number of data mining capabilities.

Figure 326: SAP HANA APL

Functions in SAP HANA APL

Figure 327: Functions

SAP HANA: EML Integration Leverage Open-Source ML


External machine learning in SAP HANA covers integrations using many programming
languages.

© Copyright. All rights reserved. 230


Lesson: SAP Data Science Applications

Figure 328: External ML Integration Leverage Open-Source ML with SAP HANA

SAP HANA Multimodal Example Scenario: PAL, Spatial, and TensorFlow, 1


The following figure provides with the example of a multimodal scenario in SAP HANA.

Figure 329: SAP HANA Multimodal Example Scenario: PAL, Spatial, and TensorFlow

© Copyright. All rights reserved. 231


Unit 8: Conclusion

SAP HANA Multimodal Example Scenario: PAL, Spatial, and TensorFlow, 2

Figure 330: SAP HANA Multimodal Example Scenario: PAL, Spatial, and TensorFlow, 2

SAP Data Intelligence Overview


The following figure provides you with an example of SAP Data Intelligence.

Figure 331: SAP Data Intelligence

© Copyright. All rights reserved. 232


Lesson: SAP Data Science Applications

SAP Data Intelligence: EML Capabilities

Figure 332: SAP Data Intelligence: EML Capabilities

You can use SAP Data intelligence to do the following:


● Explore, preview and profile data assets
● Prepare data directly from the Data Intelligence Data Explorer, without any technical
scripting skills
● Discover, refine, govern, and orchestrate any type, variety, and volume of data
● Develop data analysis and predictive models using notebooks, code and data
● Integrate Jupyter Notebooks into a Pipeline Modeler application, so it is possible to directly
harness Jupyter Notebooks and Python in the operationalization of predictive model
pipelines

Intelligent Processing in SAP Data Intelligence


The following figure introduces you to intelligent processing in SAP Data Intelligence.

Figure 333: Intelligent Processing in SAP Data Intelligence

© Copyright. All rights reserved. 233


Unit 8: Conclusion

Data Science Applications


The following figure provides you with an overview of the machine learning scenario manager
in SAP Data Intelligence.

Figure 334: Machine Learning Scenario Manager

Data Science Applications: Juypter Lab Environment


The following figure provides you with an overview of the Juypter Lab Environment in SAP
Data Intelligence.

Figure 335: Juypter Lab Environment

© Copyright. All rights reserved. 234


Lesson: SAP Data Science Applications

ML Operationalization in SAP Data Intelligence

Figure 336: ML Operationalization in SAP Data Intelligence

Embedded AI in Standard SAP Applications


The following figure shows you the extent to which SAP embeds AI in standard SAP
applications and exposes these capabilities as enterprise-specific AI solutions and services to
extend business processes.

Figure 337: SAP AI Foundation (AIF)

AIF is as follows:
● SAP AI Foundation (AIF) is used internally in SAP applications but currently not available
for customers.
● AIF enables SAP to embed intelligent data science ML/AI applications into SAP solutions,
and supports the creation of the Intelligent Enterprise.

© Copyright. All rights reserved. 235


Unit 8: Conclusion

SAP AIF: Vision

Figure 338: Data Science Applications

Intelligent Scenario Lifecycle Management (ISLM) embeds HANA PAL and HANA APL into the
SAP S/4HANA business applications without the need for coding.

AI/ML embedded in SAP Applications

Figure 339: AI/ML embedded in SAP Applications

SAP Delivers the Intelligent Enterprise


Al and ML are core enables of intelligent technologies.

© Copyright. All rights reserved. 236


Lesson: SAP Data Science Applications

Figure 340: SAP Delivers the Intelligent Enterprise

Summary

Figure 341: Summary

Data Science References


Introduction

Figure 342: Introduction

© Copyright. All rights reserved. 237


Unit 8: Conclusion

Microsoft (MS) Excel

Figure 343: MS Excel

Microsoft Excel is used for the following:


● Microsoft Excel uses a grid of cells arranged in numbered rows and letter-named columns
to organize data manipulations like arithmetic operations.
● It provides a range of functions for statistical, engineering and financial analysis.
● It can display data as line graphs, histograms and charts, and with a very limited three-
dimensional graphical display.

Figure 344: R

R is a programming language that is widely used for the following:


● The R language is widely used among statisticians and data miners for developing
statistical software.

© Copyright. All rights reserved. 238


Lesson: SAP Data Science Applications

● The basic installation includes all of the commonly used statistical techniques such as
univariate analysis, categorical data analysis, hypothesis testing, generalised linear
modeling, multivariate analysis and time series analysis.
● There are also powerful facilities to produce statistical graphics
● The base software is supplemented by over 5000 add-on packages developed by R users.
These packages cover a broad range of statistical techniques.
● R is command driven, so it takes longer to master than point-and-click software. However,
it has greater flexibility.

Python

Figure 345: Python

Python is a concise, easy to read code that is mainly used for the following:
● Python was mainly developed for emphasis on code readability, and its syntax allows
programmers to express concepts in fewer lines of code.
● Python is concise and easy to read, and it can be used for everything from web
development to software development and scientific applications.
● Python's standard library is very extensive and contains built-in modules that provide
access to system functionality that would otherwise be inaccessible to Python
programmers, as well as modules written in Python that provide standardized solutions for
many problems that occur in everyday programming.
● There are 1000s of external libraries that can be added.
● Python 2.0 was released in 2000 and Python 3.0 was released 2008. This was a major
revision of the language that is not completely backward-compatible, and much Python 2
code does not run unmodified on Python 3.

© Copyright. All rights reserved. 239


Unit 8: Conclusion

Orange

Figure 346: Orange

References for Statistics

Figure 347: References for Statistics

References for Data Science

Figure 348: References for Data Science

© Copyright. All rights reserved. 240


Lesson: SAP Data Science Applications

References for Data Science

Figure 349: References for Data Science

References for Data Science and R

Figure 350: References for Data Science and R

References for R

Figure 351: References for R

© Copyright. All rights reserved. 241


Unit 8: Conclusion

References for Python

Figure 352: References for Python

References for Web Resources

Figure 353: References for Web resources

Data Sets

Figure 354: Data Sets

Summary
This unit introduces you to some of the popular tools for statistical analysis. IT also introduces
you to the following tools:

© Copyright. All rights reserved. 242


Lesson: SAP Data Science Applications

● Summary:
● MS Excel, which is commonly used for relatively small data sets. It is very accessible,
provides lots of easy-to-use functionality and has good graphical features.
● Orange is a free, component-based visual programming software package for data
visualization, machine learning, data mining, and data analysis. It is ideal for beginners,
data analysts, an data scientists who like visual programming.
● R and Python require you to learn a programming language, but provide a huge amount of
functionality and extremely powerful graphical capability. If you are going to take your
interest in data science further, you should consider learning one or both of these
programming languages.
● Congratulations! you have now completed DSC100 Fundamentals of Data Science.

LESSON SUMMARY
You should now be able to:
● Explain data science applications

© Copyright. All rights reserved. 243

You might also like