DSC100 Data Science Fundamentals by SAP
DSC100 Data Science Fundamentals by SAP
.
.
PARTICIPANT HANDBOOK
INSTRUCTOR-LED TRAINING
.
Course Version: 10
Course Duration: 3 Day(s)
e-book Duration: 5 Hours 40 Minutes
Material Number: 50156437
SAP Copyrights, Trademarks and
Disclaimers
No part of this publication may be reproduced or transmitted in any form or for any purpose without the
express permission of SAP SE or an SAP affiliate company.
SAP and other SAP products and services mentioned herein as well as their respective logos are
trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and other
countries. Please see http://global12.sap.com/corporate-en/legal/copyright/index.epx for additional
trademark information and notices.
Some software products marketed by SAP SE and its distributors contain proprietary software
components of other software vendors.
National product specifications may vary.
These materials may have been machine translated and may contain grammatical errors or
inaccuracies.
These materials are provided by SAP SE or an SAP affiliate company for informational purposes only,
without representation or warranty of any kind, and SAP SE or its affiliated companies shall not be liable
for errors or omissions with respect to the materials. The only warranties for SAP SE or SAP affiliate
company products and services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should be construed as constituting an
additional warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue any course of business
outlined in this document or any related presentation, or to develop or release any functionality
mentioned therein. This document, or any related presentation, and SAP SE’s or its affiliated companies’
strategy and possible future developments, products, and/or platform directions and functionality are
all subject to change and may be changed by SAP SE or its affiliated companies at any time for any
reason without notice. The information in this document is not a commitment, promise, or legal
obligation to deliver any material, code, or functionality. All forward-looking statements are subject to
various risks and uncertainties that could cause actual results to differ materially from expectations.
Readers are cautioned not to place undue reliance on these forward-looking statements, which speak
only as of their dates, and they should not be relied upon in making purchasing decisions.
Demonstration
Procedure
Warning or Caution
Hint
Facilitated Discussion
vi Course Overview
TARGET AUDIENCE
This course is intended for the following audiences:
● Data Manager
● Application Consultant
● Development Consultant
● Technology Consultant
● Data Consultant
● Data Scientist
● Developer
Lesson 1
Understanding Data Science 2
UNIT OBJECTIVES
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Understand the basic principles of data science
For more information on the multidisciplinary aspect of data science, see the following:
● https://www.quora.com/What-is-the-difference-between-a-data-scientist-and-a-
machine-learning-engineer
data are problematic for data scientists and can waste time, producing an analysis that is
meaningless or misleading.
This course introduces you to the basic techniques of data science. However, you must
always remember that data science relies on reliable data.
Intelligent Enterprise
Artificial intelligent (AI) and ML are core enablers of the intelligent technologies that support
SAP Intelligent Enterprise Framework.
Augmented Analytics
Gartner define augmented analytics in the following way: "Augmented analytics is the use of
enabling technologies such as machine learning and AI to assist with data preparation, insight
generation and insight explanation to augment how people explore and analyze data in
analytics and BI platforms. It also augments the expert and citizen data scientist by
automating many aspects of data science, machine learning, and AI model development,
management and deployment."
SAP Analytics Cloud provides you with augmented analytics capabilities in the module on
smart predicting. You will be introduced to these tools when you complete the exercises in
this course.
For more information on augmented analytics, see the following:
● https://www.gartner.com/en/information-technology/glossary/augmented-analytics
Summary
Data science is an interdisciplinary field within the broad areas of mathematics, statistics,
operations research, information science, and computer science. Data science focuses on the
processes and systems that enable the extraction of knowledge or insights from data. ML and
AI are both parts of data science. AI and ML are core enablers of the intelligent technologies
that support the SAP Intelligent Enterprise Framework.
● IBM Statistical Package for the Social Sciences (SSPS) five (Assess, Access, Analyze, Act,
Automate) As
● KDD (Knowledge Discovery in Databases) process
● Strategic Audience Segmentation (SASs) Sample, Explore, Modify, Model, Assess
(SEMMA)
● Cross Industry Standard Process for Data Mining (CRISP-DM)
In addition to the data science software and algorithms they use, many organizations that
have data science teams develop their own data science methodology so that their data
science processes closely align with their business and decision-making processes.
Ultimately, the methodology must support the effective integration of data science into your
organization.
CRISP-DM
The most popular project methodology in data science is CRISP-DM. CRISP-DM was an
initiative launched in 1996 and led by five companies: SPSS, Teradata, Daimler AG, NCR
Corporation, and OHRA, an insurance company. Over 300 organizations contributed to the
process model.
CRISP-DM attempts to create a data-centric project methodology that accomplishes the
following:
● Non-proprietary
● Application and industry neutral
● Focused on business issues
● Tool neutral
● Technical analysis
Polls conducted by KD Nuggets in 2002, 2004, 2007, and 2014 show that CRISP-DM was the
leading methodology used by industry data miners. These results yield the following findings:
● The CRISP-DM methodology is a hierarchical process model.
● At the top level, the process is divided into six different generic phases, ranging from
business understanding to the deployment of project results.
● The next level elaborates each of these phases, comprising several generic tasks; at this
level, the description is generic enough to cover all data science scenarios.
● The third level specializes these tasks for specific situations - for example, the generic task
can be cleaning data, and the specialized task can be cleaning of numeric or categorical
values.
● The fourth level is the process, that is, the recording of actions, decisions, and results of an
actual execution of a DM project.
Each of the six generic phases is important and are best summarized in the following way:
Business understanding
During this phase, you confirm the project objectives and requirements from the
perspective of your business. Define the data science approach that answers specific
business objectives.
Data understanding
During this phase, you commence initial data collection and familiarization. You also
identify data quality problems.
Data preparation
During this phase, you select data tables, records, and attributes. You also undertake any
data transformation and cleaning that is required.
Modeling
During this phase, you select modeling techniques. You also calibrate the model
parameters and begin model building.
Evaluation
During this phase, you confirm that the business objectives have been achieved.
Deployment
During this phase, you deploy models and “productionize,” if required. You also develop
and implement a repeatable process that enables your organization to monitor and
maintain each model’s performance.
The sequence of the phases is not strict, and movement back and forth between different
phases is always required. The arrows in the process figure, Six Generic Phases of CRISP-DM,
indicate the most important and frequent dependencies between phases.
The outer circle in the figure, Six Generic Phases of CRISP-DM, symbolizes the cyclic nature
of any data science project. The process continues after a solution has been deployed.
The lessons learned during the process can trigger new, often more focused business
questions, and subsequent data science processes that benefit from the experiences of the
previous ones. This is illustrated by the figure, Six Generic Phases of CRISP-DM.
Phase 5: Evaluation
The objective of this phase is to thoroughly evaluate the model and review the model
construction to be certain it properly achieves the business objectives. A key objective is to
determine if there is some important business issue that has not been sufficiently considered.
At the end of this phase, a decision on the use of these data science results should be reached
Phase 6: Deployment
The objective of this phase is the acquirement of knowledge that you gain and need to be
organized and presented in a way that allows the organization to use it. However, depending
on the requirements, the deployment phase can be as simple as generating a report or as
complex as implementing a repeatable data mining process across the enterprise.
Summary
This lesson has introduced you to the most popular project methodology for data science,
CRISP-DM. There are six key phases, and each phase includes a number of tasks and outputs.
It is very important for you to follow a project methodology when you are working on a data
science project, so that you understand the order of the phases and each of the tasks you
must consider. Different data science projects have different requirements, which means you
could use CRISP-DM as a template to ensure you have considered all of the different aspects
specific to your project, and modify it, as required.
LESSON SUMMARY
You should now be able to:
● Understand the basic principles of data science
Lesson 1
Understanding the Business Phase 13
UNIT OBJECTIVES
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Explain the business understanding phase
CRISP-DM Phase
CRISP-DM Phase
This phase focuses on understanding the project objectives and designing a plan to achieve
the objectives.
Tasks
The first objective is to thoroughly understand a business perspective, that is, what the client
wants to accomplish. The organization usually has many competing objectives and
constraints, which must be properly balanced. The analyst’s goal is to uncover important
factors, at the beginning, so that they can influence the outcome of the project.
Outputs
Outputs are broadly divided into the following categories:
● Background
● Business objectives
● Deployment option
Output Categories
This is an overview of the various categories of output.
Outputs: Background
Record the information that is known about the organization's business situation at the
beginning of the project.
Outputs: Business Objectives
Describe the customer's primary objective from a business perspective.
In addition to the primary business objective, there are generally other related business
questions that the organization would like to address.
For example, the primary business goal for a financial services business might be to keep
current customers by predicting when they are prone to move to a competitor.
Examples of related business questions are as follows: "How does the primary channel a
bank customer uses (for example, ATM, branch visit, internet) affect whether they stay
or go?" or "Will lower ATM fees significantly reduce the number of high-value customers
who leave?"
Agree with the customer as to how they want to deploy the analysis when it is completed
- for example, do they want to have the analysis available in a stand alone app, or
embedded in an existing business application.
● Personnel - business experts, data experts, technical support, data science personnel
● Data - fixed extracts, access to live warehoused, or operational data
● Computing resources - hardware platforms
● Software - data science tools, other relevant software
● Personnel - business experts, data experts, technical support, data science personnel
● Data - fixed extracts, access to live warehoused, or operational data
● Computing resources - hardware platforms
● Software - data science tools, other relevant software
List all requirements of the project including the schedule for completion, comprehensibility,
and the quality of results and security in addition to legal issues. As part of this output, make
sure that you are allowed to use the data.
List the assumptions made by the project, which are the following:
● These can be assumptions about the data, which can be checked during the analysis
process. However, they can also include assumptions that you can not check about the
business on which the project rests.
● It is particularly important to list these if they form conditions on the validity of the results.
List the constraints on the project. These may be constraints on the availability of resources,
but may also include technological constraints such as the size of data that it is practical to
use for modeling.
Outputs: Terminology
Compile a glossary of terminology relevant to the project. This can include two
components:
A glossary of relevant business terminology, which forms part of the business
understanding available to the project. Constructing this glossary is a useful "knowledge
elicitation" and education exercise.
A glossary of data science terminology, illustrated with examples relevant to the business
problem in question.
Outputs - Costs and benefits
Construct a cost-benefit analysis for the project, which compares the costs of the project
with the potential benefit to the business if it is successful.
The comparison should be as specific as possible, for example using monetary measures
in a commercial situation.
A data science goal might be the following: "Predict how many widgets a customer will
buy, given their purchases over the past three years, demographic information (age,
salary, city, and so on) and the price of the item."
Summary
This lesson introduces you to the details of the tasks required in the Business Understanding
Phase of the CRISP-DM project methodology. These phases are outlined in the following way:
Industry Surveys
Industry surveys indicate the standard methods of assessing data science project success.
They are useful for the following reasons:
● In both of these surveys, meeting business goals and model accuracy or performance are
the two most important factors.
● On the left hand side, 57% of responders responded to the question, "How do you measure
success?", for a predictive analytics project as "meeting business goals", and 56% as
"model accuracy". Lift is also an important factor. you will discuss how to calculate lift in
more detail later in this course.
● On the right side, in their Third Annual Data Miner Survey, conducted by Karl Rexer
Analytics, a renowned CRM consulting firm based in Winchester Massachusetts USA,
asked the BI community: "How do you evaluate project success in Data Mining?" Out of 14
different criteria, a massive 58% ranked "Model Performance (Lift, R2, and so on)" as the
primary factor.
Algorithm Types
The data science success criteria will differ depending on the type of algorithm chosen.
There are a wide range of algorithms to choose from, depending on the type of question
asked by the business, the output that is required and the data that are available:
● For Association rules, or basket analysis, you can use algorithms that analyze the
combinations of products purchased together in a basket or over time. One of the
common algorithms used is called Apriori and you will learn more about this later in the
course.
● For clustering, you can use algorithms that group similar observations together. You are
introduced to a number of these algorithms later in the course and a commonly used one,
which is called K-Means.
● For classification analysis, where you are classifying observations into groups, you can use
decision trees or neural networks.
● You can use outlier analysis to identify which observations have unusually high or low
values and to identify anomalies.
● Regression algorithms enable you to forecast the values of continuous type variables, such
as customer spend in the next 12 months.
● Time-series analysis enables you to forecasts future KPI values, and control stock and
inventory levels.
Each of these algorithms are explored in more detail during this course
Each of these broad categories of algorithm can answer different types of business question.
Consider the following points in light of this fact and the preceding figure:
● For classification, you can answer the who and when type questions, such as which
customers will buy a product and when they will most likely make the purchase? You can
also answer questions such as, which machine will fail and when will it need preventative
maintenance?; or is that transaction fraudulent?
● For regression, you can answer the what type questions. What will be the spend of each
customer in the next 12 months? How many customer will churn next year?
● For clustering and segmentation you are grouping together similar observations. This
enables you to communicate to customers with similar needs and requirements who are
grouped together in a cluster, or develop specific products or services for customers in a
segment.
● Forecasting allows you to estimate a KPI on a regular time interval. So for example, you
can forecast revenue per month for the next 12 months, accounting for trends,
seasonalities and other external factors.
● Link analysis is used mainly in telecommunications to create communities of customers
who are calling one another, or in retail analysis to analyze the links between customers
and the products they have purchased to support product recommendations.
● And association rules and recommendations are used for basket analysis and also to
produce product recommendations for customers.
When a predictive model has been built on the estimation sub-sample, its performance is
tested on the validation and test sub-samples. you expect to find the following:
● You expect that the model will have similar performance on the estimation, validation and
test sub-sets.
● The more similar the performance is of the model on the sub-sets, the more robust the
model is overall.
However, an even more rigorous test is to check how well the models performs on totally new
data that was not used in the model training.
For example, if the model is to be used in a marketing campaign to identify which customers
are most likely to respond to a discount offer, often the model's performance is also tested to
analyze how well it would have performed on historical campaign data.
Frequently, a model is also tested on a new campaign to see how well it performs in a real
environment. Appropriate control groups are defined, so the response to the modeled group
can be compared to the response using other methods.
There are extensions to and variations on the train-and-test theme. For example, a random
splitting of a sample into training and test sub-sets could be fortuitous, especially when
working with small data sets, so you could conduct statistical experiments by executing a
number of random splits and averaging performance indices from the resulting test sets.
Summary
This lesson highlights the importance of the following:
● At an early stage in a project, it is important to clearly define business and data science
project success criteria.
● The data science success criteria differs depending on whether the models are predictive-
or descriptive-type models and the type of algorithm chosen.
● The business question that you are analyzing in your data science project helps to
determine the most likely algorithms to use.
● There is a wide range of algorithms to choose from, depending on the type of question
asked by the business, the output that is required and the data that are available.
● The accuracy and robustness of the model are two major factors to determine the quality
of the prediction, which reflects the success of the model.
● A train-and-test regime is central to developing predictive models and assessing if they are
successful.
Circular Economy
Circular Economy
In a traditional linear economy, we take resources, make products, and dispose of them when
we finish (take >make > use> dispose).
A circular economy (CE) is an alternative to a linear economy. It aims to close the loop, so that
waste is reused, re-purposed, or recycled in a way that retains as much value as possible.
Basically, resources are kept in use for as long as possible, in order to extract the maximum
value from them, and later they are recovered and regenerated at the end of service life.
CE is a major topic for industry as modern consumers insist that organizations consider how
to achieve sustainability, and develop strategies for narrowing, slowing and closing material
and energy flows as a means for addressing structural waste.
In addition to creating new opportunities for growth, a more circular economy allows for the
following:
Overview
A circular economy seeks to rebuild capital, whether this is financial, manufactured, human,
social or natural. This ensures enhanced flows of goods and services. The following figure
illustrates the continuous flow of technical and biological materials through the “value circle”
of a circular economy.
The following figure shows you what a CE attempts to achieve. It also outlines what is referred
to as the value circle in such a system.
● Automate the assessment of the condition of used products and recommend if they can
be reused, resold, repaired or recycled to maximize value preservation.
● Automate the dis-assembly of used products by using visual recognition to assess and
adjust the dis-assembly equipment settings based on the condition of the product and its
position on the dis-assembly line.
● Sort mixed material streams using visual recognition techniques and robotics.
Data Validation
Consider the following aspects of data validation:
● In CRISP-DM, there is no validation between the data preparation phase and the modeling
phase against the specific business domain. A complete understanding of whether the
data which is prepared is a valid representation of the original problem is not guaranteed.
● As such, this can result in sub-optimal solutions that miss the mark on the intended
capturing of business value.
● Therefore, data validation must be done by the re-involvement of domain experts to
validate that a proper understanding of the data and business problem has been reached,
and include data preparation methods tailored for the given analytic profile.
● The data validation phase can result in a re-iteration of the data understanding and/or the
data preparation phase(s), as indicated by a single arrow back in the figure, How to
Incorporate CE into Data Science Initiatives.
Analytic Profile
An analytic profile is an abstract collection of knowledge, mainly used in the business and
data understanding phases, which lists the best practices for a particular analytics use case,
or problem. The profile must include the following:
An analytic profile is an abstract collection of knowledge, mainly used in the business and
data understanding phases, that lists the best practices for a particular analytics use case, or
problem. The profile must include the following:
● The profile should include:
● Use case description defining the business goal
● Domain specific insights important for the use case
● Data sources relevant for the use case
● KPIs or metrics for assessing the analytics implementation performance
● Analytics models and tools with proven conformity for the given problem
● Short descriptions of previous implementations with lessons learned
Summary
This lesson covered the following:
● A number of the concepts for a CE and an explanation of how data science can support the
delivery of CE strategies
● The data science and AI systems that can be used to deliver many of the essential CE
concepts, from predictive analytics - such as setting the optimal service and repair
schedule for durable equipment, to dynamic pricing and matching for the effective
functioning of digital marketplaces for secondhand goods and by-product material
streams.
● A look to future trends, which show that data science and AI could be integral to the
redesign of whole systems, which create a circular society that works in accordance with
these principles over the long-term.
LESSON SUMMARY
You should now be able to:
● Explain the business understanding phase
Lesson 1
Understanding the Data Phase 29
UNIT OBJECTIVES
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Explain the data understanding phase
LESSON SUMMARY
You should now be able to:
● Explain the data understanding phase
Lesson 1
Understanding Data Preparation 31
UNIT OBJECTIVES
● Prepare data
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Prepare data
● Data preparation is referred to as data wrangling, data munging, and data janitor work.
● Everything from list verification to removing commas and debugging databases. Messy
data is by far the more time-consuming aspect of the typical data scientist's work flow.
● An article in The New York Times reported that data scientists spend from 50% to 80% of
their time mired in the more mundane task of collecting and preparing unruly digital data
before it can be explored for useful nuggets.
● The chart shows that 3 out of every 5 data scientists spend the most time during their
working day cleaning and organizing data - while only 9% spend most of their time mining
the data and building models.
Summary
This lesson covers the third phase of the CRISP-DM process, that is, data preparation. The
five tasks in this phase are as follows:
● Select data
● Clean data
● Construct data
● Integrate data
● Format data
The important outputs from this phase are the analytical data set you use later for data
analysis and data description.
Predictive Modeling
Overview
To prepare data correctly in the Data Preparation Phase of CRISP-DM, you must have a
knowledge of predictive modelingand the correct formatting of data that is required.
Predictive modeling is covered in more depth in the followings sections of the course, but this
is a basic introduction. Predictive modeling covers the following:
example, identifying suspects after a crime has been committed, or credit card fraud as it
occurs.
Descriptive analytics, which uses data aggregation and data visualization to provide insight
into the past and answer, asks the following question: What has happened?
Descriptive statisticsare useful for company reports giving total stock in inventory, average
dollars spent per customer and year over year change in sales.
Predictive analytics, which use statistical models and forecasts techniques to understand the
future and answer: What could happen? Predictive analytics are used for the following tasks:
● Predictive analytics combines the historical data found in Enterprise Resource Planning
(ERP), Customer Relationship Management (CRM), Human Resources (HR), and Point-of-
Sale (POS) systems to identify patterns in the data and apply statistical models and
algorithms to capture relationships between various data sets. Companies use predictive
analytics any time they want to look into the future.
● Prescriptive Analytics, which use optimization and simulation algorithms to advice on
possible outcomes and answer: What should we do?
● Prescriptive analytics predicts not only what can happen in the future, but also why it
happens by providing recommendations regarding actions that take advantage of these
predictions. Prescriptive analytics utilizes a combination of techniques and tools, such as
business rules, algorithms, optimization, machine learning, and mathematical modeling
processes.
The first is the model build, or training phase, is detailed in the following manner:
● Predictive models are built or "trained" on historic data with a known outcome.
● The input variables are called "explanatory" or "independent" variables.
● For model building, the "target" or "dependent" variable is known. It can be coded, so if the
model is to predict the probability of response for a marketing campaign, the responders
can be coded as "1"s and the non-responders as "0"s, or, for example, as "yes" and "no."
● The model is trained to differentiate between the characteristics of the customers who are
1s and 0s.
The second is the model apply phase, or the applying phase. It is detailed in the following
manner:
● Once the model has been built, it is applied onto new, more recent data, which has an
unknown outcome (because the outcome is in the future).
● The model calculates the score or probability of the target category occurring; in our
example, the probability of a customer responding to the marketing campaign.
The training model in this figure is best understood in the following way:
● This example represents the model training of a "churn" model. You are trying to predict if
a customer is going to switch to another supplier.
● You train the model (in the training phase) using historical data where you know if
customers churned or not - you have a known target.
● The target variable flags churners (yes) and non-churners (no). This type of model, with a
binary target, is called a "classification" model.
● This is a simple representation where you only have two explanatory characteristics - age
and city. In a real predictive model you might have hundreds or even thousands of these
characteristics.
● The predictive model identifies the difference in the characteristics of a churner and a non-
churner. This can be represented as a mathematical equation, or in a scorecard format.
The algorithm calculates the Weight values in the predictive model equation that give the
most accurate estimate of the target.
● You can use a reference date to split the historical data date period and the target date
period.
● In this example, the target data time frame (April) occurs after the historical data time
frame (January to March). The model is trained to identify patterns in the data in the past
to predict the target in the subsequent, or later, months.
Consider the following aspects of the applying phase as you examine the example in the
figure:
● In this example, you are applying the model on the latest 3 months of data: April to June.
The model calculates the probability of churn in the future: for July.
● Every time you use the model, the apply data has to be updated to the most recent time
frame.
● The same data set is required at different points of time - for example, the 1st of every
month, so that models can be applied to generate updated scores on a recurring basis.
Depending on the business requirements, models need to be applied each month, week,
day, minute, or second.
● When the data set time frame changes, the reference date changes. Therefore, any
derived variables, such as a customer's age, needs to be updated relative to the new
reference date. For example, you would calculate the age as the number of days difference
between the reference date and the customer's date of birth.
● Alternatively, you can calculate each person's tenure as a customer of the business. This is
the days difference between the reference date and the date the customer made their first
purchase or joined a loyalty scheme.
● In addition, any transactional data in the previous months needs to be updated relative to
the moving reference date.
● For example, you might want to calculate the number of transactions in the month prior to
our reference date, and if the reference date moves forward then the number of
transactions needs to be recalculated for the new month that has become the prior month
to the reference date.
Latency Period
We must now consider the latency period.
Durations for each of these periods depends on the use case and the business - this can be
days, weeks, months, on so on. Examine the following figure to explore this in further detail:
Model Fitting
The following figure represents model fitting:
as inefficient on both data sets, with an equivalent large error, the model is robust. When
you are building a model, you are aiming at the best compromise between under and over-
fitting, and this is shown in the figure outlined model fitting. In this instance, you have low
training and test errors, and these errors are equivalent, so the model is robust.
Hold-Out Sample
The model build phase is split into different samples.
● The hold-out is a sample of observations withheld from the model training. We can use it to
test the model's ability to make accurate predictions based on its ability to predict the
outcomes of the data in the hold-out sample, and to confirm the model's robustness by
comparing the distribution of the predictions for the training versus hold-out samples.
● Data is partitioned and split into a training sub-set to train the models, and a validation
sub-set, which is the hold-out, to test the model's performance.
● For classification and regression models, the data could be split with 75% of the data being
randomly partitioned into the training sub-set and 25% randomly partitioned into the
validation sub-set for example.
● Note that for time-series models, the historical data is split sequentially not randomly. For
75% could go into the training set and 25% into the validation sub-sets. This sequential
split is necessary because of the inherent continuous time-based nature of a time series
model.
● You learn more about data partitioning later in this training.
Summary
This lesson covered the following elements regarding the data preparation phase of CRISP-
DM:
Data Manipulation
Overview
This is an introduction to data manipulation.
Data manipulation is part of CRISP-DM Data Preparation, Phase 3. It involves the following
three phases:
● Phase 3.3: Constructing Data:This task includes constructive data preparation operations
such as the production of derived attributes, entire new records, or transformed values for
existing attributes.
● Phase 3.4: Integrating Data:These are methods whereby information is combined from
multiple tables, or records to create new records or values.
● Phase 3.5: Formatting Data:Formatting transformations refer to the primarily syntactic
modifications made to the data that do not change its meaning, but might be required by
the modeling tool.
Defining an Entity
We must first identify an entity.
The following are examples that illustrate these points on defining an entity. The examples are
as follows:
● A predictive model that is designed to predict if a customer of a utility company is going to
respond to an up-sell offer. The customer has a CustomerID as the entity.
● A churn model, designed to predict if a postpaid telecom customer will extend their
subscription when their 12 month contract expires, or if they will switch to a competitor.
They can be assigned the CustomerID or AccountID as entities, depending on the
appropriate level of analysis.
This figure provides you with a 360-degree view of each entity, collecting all of the static and
dynamic data together, which can be used to define the entity.
● The "static" data does not change very much over time, such as gender, address, work,
class, and so on.
● The "dynamic" data does change frequently, such as a customer's age, every time a new
data time-frame is chosen for analysis.
● These data sets are the explanatory variables in the analysis.
● These steps require data manipulation, that is, merging tables, aggregating data, creating
new data transformations, and derived variables, and so on.
The Analytical Record is an overall view of the entities;if this entity is a customer, it is
sometimes called a "Customer Analytic Record", or a "360-degree view of the customer," or
even "Customer DNA." This view characterizes entities by a large number of attributes (the
more, the better), which can be extracted from the database or even computed from events
that occurred for each of them.
The list of all attributes corresponds to what is called an "Analytical Record" of the entity
"disposition." This analytical record can be decomposed into a number of domains, such as
the following:
● Demographic
● Geo-demographic
● Complaints history
● Contacts history
● Products history
● Loan history
● Purchase history (coming from the transaction events)
● Segments
● Model scores
Feature Engineering
Features are attributes shared by all entities.
Handling Outliers
An outlier is an observation that lies an abnormal distance from other values in a random
sample of the data.
Merging Tables
Another important data manipulation process is the merging of input data tables.
● When you are preparing the data, it is contained in multiple tables that need to be
assembled in the format required by the machine-learning algorithm you are using.
● This previous figure shows you an example of merging tables that utilize what is called a
"left outer join."
● The A_NUMBER_FACT table gives the unique line number associated to each account.
● This is merged with the CUSTOMER_ID_LOOKUP table so that the CUSTOMER_ID is
associated with the unique line number. The merge key is A_NUMBER in table 1 to
A_NUMBER in table 2.
● The CUSTOMER table can be merged: CUSTOMER_ID in table 2 to CUSTOMER_ID in table
3.
Aggregating Data
Data aggregation is an important facet of data processing.
Aggregation Functions
Consider the range of aggregation functions available to you.
Summary
This lesson covers important basic elements of data manipulation and Feature Engineering
techniques, which are outlined in the following way:
● In Feature Engineering, new features are created to extract more information from existing
features.
● You also learn about "merging" and "aggregating" data.
● Data segregation is the process where raw data is gathered and expressed in a summary
form for statistical analysis. For example, raw data can be aggregated over a given tie
period to provide statistical data such as average, minimum, maximum, sum, and count.
Quantitative or Qualitative
The following figure demonstrates the differences between the following:
Scales of Measurement
Scales of measurement are important to definition and categorization of variables or
numbers.
Nominal Scale
The following figure outlines the issue of nominal scale:
Ordinal Scale
The following figure outlines the issue of ordinal scale:
● With ordinal variables, the order matters, but not the difference between the values. For
example, patients are asked to express the amount of pain they are feeling on a scale of 1
to 10. A score of 7 means more pain than a score of 5, and that is more than a score of 3.
However, the difference between the 7 and 5 might not be the same as the difference
between the 5 and 3.
● The values simply express an order.
Interval Scale
● Numerical data, as its name suggests, involves features that are only composed of
numbers, such as integers or floating-point values.
● You can establish the numerical interval difference between two items, but you can not
calculate how many times one item is more or less in value than the other item.
● The arbitrary starting point can be confusing at first. For example, year and temperature
do not have a natural zero value. The year 0 is arbitrary and it is not sensible to say that the
year 2000 is twice as old as the year 1000. Similarly, zero degrees Centigrade does not
represent the complete absence of temperature (the absence of any molecular kinetic
energy). In reality, the label "zero" is applied to its temperature for quite accidental
reasons connected to the freezing point of water, and so it does not make sense to say that
20 degrees Centigrade is twice as hot as 10 degrees Centigrade. However, zero on the
Kelvin scale is absolute zero.
● Since an interval scale has no true zero point, it does not make sense to compute ratios.
Ratio Scale
The following figure outlines the key point about data in relation to the ratio scale:
Summary
This lesson covers important data types and describes the following:
Encoding Data
Overview of Data Encoding
Data encoding is a set of processes that prepares the data you are using and transforms it
into a "mineable" source of information. It is an essential part of the data preparation process.
Consider the following three issues in relation to variables, encoding startegies, and
explanatory variables:
● Essentially, there are three types of variable: nominal, ordinal, and continuous (numeric).
● Different encoding strategies are deployed depending on the variable type.
● Data encoding each explanatory variable can take a large amount of time, but it must not
be ignored.
This encoding type is necessary for some algorithms to function correctly, for example, linear
regression and some other regression types.
Therefore, we can state the following about numeric values in these models:
● Model performance
● Captures non-linear behavior of continuous variables
● Minimizes the impact of outliers
● Removes "noise" from large numbers of distinct values
● Model explain-ability
● Grouped values are easier to display and understand
● Build model speed
● Predictive algorithms build faster as the numbers of distinct values decrease
Summary
This lesson covers the following:
● The ML algorithms that can work with categorical data. Remember, some of the other ML
algorithms can not work with this data.
● The ML algorithm that can not operate efficiently on categorical data directly require all
input variables and out variables to be numeric.
● The 3 common encoding strategies for categorical variables, which are Ordinal encoding,
One-hot encoding, and Dummy encoding.
● Continuous numerical-type variables that can be binned, or discretized.
Data Selection
Overview
Consider the following in relation to feature selection:
Forward Selection
Forward selection is the oppose of backward selection.
Stepwise Regression
Stepwise regression combines backward and forward selection in the following manner:
The widespread but incorrect usage of this process, and the availability of more modern
approaches (which is discussed next), or using expert judgment to identify relevant variables,
has led to calls to totally avoid stepwise model selection.
Filter
Filter feature selection is an approach to variable selection.
Some examples of filter methods include the Chi squared test, information gain and
correlation coefficient scores.
Wrapper
Wrapper feature selection is an approach to variable selection.
The two main disadvantages of the wrapper methods are the following:
● The increasing over-fitting risk when the number of observations is insufficient.
● The significant computation time when the number of variables is large.
● An example of the wrapper method is the recursive feature elimination algorithm.
Embedded
Embedded feature section is an approach to variable selection.
Examples of regularization algorithms are the LASSO (least absolute shrinkage and selection
operator) and Ridge regression.
Summary
This lesson covers the following in the area of data selection:
● you cover Variable or "feature" selection.
LESSON SUMMARY
You should now be able to:
● Prepare data
Lesson 1
Understand the parts of the modeling phase 68
UNIT OBJECTIVES
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Understand modeling phase
Summary
This lesson covers the following tasks:
● Select modeling technique
● Generate test design
● Build model
● Assess model
Anomaly Detection
Anomaly Detection Overview
An anomaly is something unexpected from what is typical or normal, or a deviation in the data
In statistics, an outlier is an observation that is numerically distant from the rest of the data.
We can see anomalies and outliers in the following context:
● Outliers can occur because of errors and might need to be removed from the data set or
corrected.
● They can occur naturally and, therefore, must be treated carefully. The outlier can be the
most interesting thing in the data set.
● Some statistics or algorithms can be heavily biased by outliers - for example, the simple
mean, correlation, and linear regression. In contrast the trimmed mean and median are not
so significantly affected.
● Outliers can be detected visually - for example, using Scatter Plots and Box Plots.
Influence of Outliers
The following two graphs shows the same data, but the graph on the right has an outlier
added to it.
● To minimize the square of the errors, the regression model tries to keep the line closer to
the data point at the right extreme of the plot, and this gives this outlier data point a
disproportionate influence on the slope of the line.
● If this outlier value is removed, the model is totally different, as you can see in the graph on
the left, and which you can judge by comparing the equation of the line in the two graphs.
Anomaly Types
There are different types of anomalies, which include point and contextual anomalies.
Contextual anomalies
Another type of anomaly is called a contextual anomaly.
● Quartiles are referred to as measures of position because they give the relative ranked
position of a specific value in a distribution.
● To create quartiles, you simply order the data by value and split it into 4 equal parts.
● The second quartile, Q2, is the median of the data.
Quartiles: Example
The following figure shows you an example of quartiles, which show the following patterns:
● This data shows you the number of volunteer hours performed by 15 students in a single
year.
● These values are ranked, with the lowest value on the left and the highest value on the
right.
Detecting Outliers
We must detect outliers using a box plot, as illustrated in the following figure.
In the figure, note the way the box plot describes the data distribution.
In the example from the figure, IQR = Q3 - Q1. Take note of the following:
● In the example, half of the values lie between 24 and 46. Therefore, IQR is (46 -24) = 22.
● This also refers to the difference, or distance, between the bottom 25% and upper 25% of
the ranked data.
● The values that lie outside the outer fences are called EXTREME or HIGHLY SUSPECT
OUTLIER values and are denoted by a symbol at 128
We must move to consider suspect outliers and highly suspect outliers, where the following is
the case:
Anomaly Detection
The following figure contains an example of anomaly detection using SAP Hana Predictive
Analysis Library (PAL).
You learn more about k-means later in this course. However, take note of the following in
relation to this example:
● This algorithm can be used for detecting outliers.
● It detects patterns in a given data set that do not conform to the norm.
Summary
This lesson covers the following:
Association Analysis
Association Analysis Overview
Association analysis is a key part of analyzing the day-to-day data that businesses collect.
Association analysis enables you to identify items that have an affinity for each other. It is
frequently used to analyze transactional data, which is called market baskets, to identify
items that often appear together in transactions.
We use association analysis in our everyday lives.
The following are application examples of association analysis:
- If customer identification takes place, through a link such as a loyalty scheme, the
purchases over time and the sequence of product purchases can be analyzed
- Identification of fraudulent medical insurance claims - consider cases where common
rules are broken
- Differential analysis, which compares the results between different stores, between
customers in different demographic groups, between different days of the week,
different seasons of the year, and so on
Support
Data can support your understanding the following:
Confidence
Confidence is an important facet of association analysis.
Example: Confidence
The following figure contains a working example of confidence.
Lift
Now that you have examined Support and Confidence, you must explore Lift.
Lift is the number of transactions containing the antecedent and consequent, divided by the
number of transactions with the antecedent only, all of which are divided by the fraction of
transactions containing the consequent.
Example: Lift
The following is a working example of Lift:
How can a retailer use the Lift of the Rule? Possible recommendations for a rule X=>Y Rule
(where X and Y are 2 separate products and have high support, high confidence, and high
positive lift >1) are as follows:
Summary
This lesson covers the strengths and weaknesses of association analysis, which are outlined
in the following manner:
Cluster Analysis
Cluster Analysis Overview
Cluster analysis revolves around the division of data points into groups. The following figure
gives you an overview of cluster analysis:
● Cluster analysis allows you to group a set of objects in such a way that objects in the same
group (called a cluster) are more similar (homogeneous in some sense or another) to each
other, but are very dissimilar to objects not belonging to that group (heterogeneous).
● Grouping similar customers and products is a fundamental marketing activity. It is used,
prominently, in market segmentation because companies can not connect with all their
customers and, consequently, they have to divide markets into groups of consumers,
customers, or clients (called segments) with similar needs and wants.
● Organizations can target each of these segments with specific offers that are relevant and
have the tone and content that is most likely to appeal to the customers within the
segment.
● Cluster analysis has many uses: it can be used to identify consumer segments, or
competitive sets of products, or for geo-demographic or behavioral groupings of
customers, and so on.
● Clustering techniques fall into a group of "undirected" (or "unsupervised") data science
tools, where there is no pre-existing, labeled "target" variable. The goal of undirected
analysis is to discover structure in the data as a whole. There is no target variable being
predicted.
There are a wide variety of applications of cluster analysis, which are as follows:
● Analysis to find groups of similar customers
● Segmenting the market and determining target markets
● Product positioning
● Selecting test markets
Similarity
Similarity is outlined in the following terms:
Features
Features are characterized in the following figure:
Clustering
Clustering is characterized in the following figure:
Consider the following in relation to this example and use of cluster analysis:
● You can use cluster analysis to find the top 10% of customers based on their spending, or
the top 20% of products based on their contribution to overall profit.
● The data is first sorted in descending numeric order and then partitioned into the first A%,
the second B% and the final C%.
● The A cluster can be considered the most important, or gold segment, while the B cluster
can be considered the next most important, or the silver segment. The C cluster is the
least important or, bronze segment.
● Here is an example, where A=25%, B=30% and C=45%.
The second phase of understanding this sample assignment is as follows: you must assign
each observation to the cluster with the closest centroid point (this can be measured using
the Euclidean distance).
The third phase of understanding this sample assignment is an update step and unfolds as
follows: you must calculate the new means to be the centroids of the observations in the new
clusters.
You must repeat this process. By doing this, you discover the following:
● K-Means clustering works by constantly trying to find a centroid with closely held
observation data points.
● The algorithm has "converged" when the assignments of the data points no longer change.
This indicates that every observation has been assigned to the cluster with its closest
centroid.
● The standard algorithm aims at minimizing what is called the within-cluster sum of
squares (WCSS).
● The WCSS is the sum of the squares of the distances of each data point in all clusters to
their respective centroids.
● This is exactly equivalent to assigning observations by using the smallest Euclidean
distance.
● Therefore, all of the observations in a cluster must be as homogeneous, or as similar, as
possible, and the clusters are thus heterogeneous or dissimilar.
Choosing the distance measureis very important. The distance between two clusters can
been computed based on the length of the straight line drawn from one cluster to another.
This is commonly referred to as the Euclidean distance. However, many other distance
metrics have also been developed.
Choosing K
You can choose K using different models.
The choice of the number of clusters must also be based on business operational constraints
- the number of clusters is always limited by the organization's capacity to use them. For
example, if you are clustering customers into groups to create differential marketing
campaigns, having 20+ clusters does not make business sense because it would be very
difficult to develop 20+ different marketing initiatives.
It is important that the final cluster model is interpretable by the business. Clustering is only
useful if it can be understood by the business and explained in simple words.
● In the first picture, Recompute Centroids, you see that k initial "means" (in this case k=3)
are randomly generated within the data domain, and that k clusters are created by
associating every observation with the nearest mean.
● In the second picture, Reassign Membership, the position of the centroids is recomputed.
The centroid in each of the k clusters becomes the new mean.
● The third picture, Final Solution, shows that the membership of each cluster is reassigned,
which means that each observation is associated with its nearest centroid.
● This process continues, with successively smaller steps as the convergence to the final
solution is achieved, where there is no more movement in the position of the centroids.
Segmentation
Homogeneous and heterogeneous segmentation must be considered.
We must put these strengths and weaknesses in the following context. Most clustering
approaches are undirected, which means there is no target. The goal of undirected analysis is
to discover structure in the data as a whole.
There are other important issues in cluster analysis, which can be described in the following
ways.
You must calculate the distance measure:
● Most clustering techniques use the Euclidean distance formula, which is the square root of
the sum of the squares of distances along each attribute axes, for the distance measure.
● Before the clustering can take place, categorical variables must be encoded and scaled.
Depending on these transformations, the categorical variables can dominate clustering
results or they can be completely ignored.
You must interpret the clusters when you discover them. There are different ways to utilize
clustering results:
-
● You can use cluster membership as a label for a separate classification problem.
- You can use other data science techniques - for example, decision trees - to find
descriptions of clusters.
- You can visualize clusters by utilizing 2D and 3D scatter graphs, or some other
visualization technique.
- You can examine the differences in attribute values among different clusters, one attribute
at a time.
● You use clustering techniques when you expect natural groupings in the data. Clusters
must represent groups of items (products, events, customers) that have a lot in common.
● Creating clusters prior to the application of some other technique (classification models,
decision trees, neural networks) can reduce the complexity of the problem by dividing the
data space.
● These space partitions can be modeled separately and these two-step procedures can
occasionally exhibit improved results when compared to the analysis or modeling without
using clustering. This is referred to as segmented modeling.
Classification Regression
Classification Analysis Overview
You can use regression analysis for modeling and analyzing numerical data.
You can utilize numerous types of regression models. This choice frequently depends on the
kind of data that you possess for the target variable. Take note of the following in relation to
the example provided:
● This is a simple linear regression. The target is a continuous variable.
● The target variable in the regression equation is modeled as a function of the explanatory
variables, a constant term, and an error term.
● The error term is treated as a random variable. It represents unexplained variation in the
target variable.
● You can see that the equation of the straight line is Y = a+bx.
● In this simple equation, b is a regression coefficient. Regression coefficients are estimates
of unknown parameters and describe the relationship between an explanatory variable and
the target. In linear regression, coefficients are the values that multiply the explanatory
values. Suppose you have the following regression equation: y = 5+2x. In this equation, +2
is the coefficient, x is the explanatory variable, and +5 is the constant.
● The sign of each coefficient indicates the direction of the relationship between the
explanatory variable and the target variable.
● A positive sign indicates that as the explanatory variable increases, the target variable also
increases.
● A negative sign indicates that as the explanatory variable increases, the target variable
decreases.
● The coefficient value represents the mean change in the target given a one-unit change in
the explanatory variable. For example, if a coefficient is +2, the mean response value
increases by 2 for every one-unit change in the explanatory variable.
Most applications of linear regression fall into one of two broad categories. These categories
are outlined in the following way:
● Prediction or forecasting
● Linear regression can be used to fit a predictive model to an observed data set of values of
the target and explanatory variables.
● When you develop the model, and when the additional values of the explanatory variables
are collected without an accompanying target value, the model can be used to make a
prediction of the target values.
● Explaining variation in the target variable that can be attributed to variation in the
explanatory variables.
● Linear regression analysis can be applied to quantify the strength of the relationship
between the target and the explanatory variables.
● The model can also be used to determine if some explanatory variables have no linear
relationship with the target.
Least Squares
The following figure provides you with an example of least squares; it uses the information
from the previous sample.
Linear regression models are often fitted using the least squares approach - although they
can also be fitted in other ways. LEt us consider the following in the context of this example:
● The best fit, from a least-squares perspective, minimizes the sum of squared residuals,
where a residual is the difference between an observed value and the fitted value provided
by a model - that is, the error.
● In the ice cream example, we can calculate the residual and this is shown in the table.
● Ordinary least squares (OLS) is a type of linear least squares method for estimating the
unknown parameters in a linear regression model.
● OLS chooses the parameters of a linear function of a set of explanatory variables by the
principle of least squares: minimizing the sum of the squares of the differences between
the observed target or dependent variable (the values of the variable being predicted) and
those predicted by the linear function.
Influence of Outliers
The linear regression algorithm fits the best model by computing the square of the errors
between the data points and the trend line.
Polynomial Regression
Polynomial regression is another form of regression analysis.
Overview of Classification
We begin with an introduction to classification analysis.
You have looked at the regression example where the target is continuous, but what happens
if the target is categorical? The following points become apparent:
● A categorical variable has values that you can put into a countable number of distinct
groups based on a characteristic.
● In predictive analysis, you frequently come across applications where you want to predict
a binary variable (0 or 1) or a categorical variable (yes or no). Such a target variable is also
referred to as a dichotomous variable - something which is divided into two parts or
classifications. This is called classification analysis. The problem can be extended to
predicting more than two integer values or categories.
● There are many use cases for this type of classification analysis using regression
techniques, covering scenarios where the focus is on the relationship between a target
variable and one or more explanatory variables.
● Classification analysis can use regression techniques to identify the category to which a
new observation belongs, based on a training set of data containing observations whose
category membership is known. In these examples in the slide the target has 2 categories:
churners/non-churners, responders/non-responders, apples/pears. There are also other
use cases where the classification target has more than 2 categories.
● Retention analysis or churn analysis, is a major application area for predictive analysis
where the objective is to distinguish between customers who have switched to a new
supplier (for example, for Telco or utility services) and those who have not switched (and
have therefore been "retained" as a customer). The objective is to try and build a model to
describe the attributes of those customers who have been retained, in contrast to those
who have left or churned, and therefore develop strategies to maximize the retention of
customers. The target or dependent variable is usually a flag, for example, a binary or
Boolean variable, that is, Yes / No or 1 / 0. The explanatory variables describe the
attributes of each customer.
● The class of models in these cases is referred to as a classification model, as you want to
classify observations, and in the more general sense data records, into classes.
● The algorithms commonly used for classification analysis are decision trees, regression,
and neural networks. There are other approaches as well, and some are presented in this
training course.
● In the terminology of predictive analysis, classification is considered an instance of
supervised learning, that is, learning where a training set of correctly-identified
observations is available - this is the target variable. This means that when you train a
churn model, you need some historic data to establish if a customer has churned or not.
When you have built and trained the model, you can apply it onto other data where you
have not got the target value, and predict the target - to answer questions such as: "Will
these customers churn or not?"
The use cases for classification analysis are the largest group within predictive analysis.
Examples of these are as follows:
● Churn analysis to predict the probability that a customer may leave/stay
● Success or failure of a medical treatment, dependent on dosage, patient's age, sex, weight,
and severity of condition
● High or low cholesterol level, dependent on sex, age, whether a person smokes or not, and
so on
● Vote for or against a political party, dependent on age, gender, education level, region,
ethnicity, and so on
● Yes or No, or Agree or Disagree to responses to questionnaire items in a survey.
Linear regression is not always suitable for certain reasons, which are as follows:
Over-fitting
The following diagrams illustrate different aspects of over-fitting.
● The standard approach to avoiding over-fitting is to split the data into a train data set to
train or build the model, and a test data set to test the model on unseen or hold-out data.
● Other techniques include cross-validation where multiple models are run on samples of
the data and the models compared.
● These concepts are explored in more detail as we progress through this course.
Leaker Variables
The following figure outlines leaker variables.
Data Leakage
You can utilize certain methods to prevent data leakage.
Consider the following points in the context of the example in the figure:
● To avoid using leaker variables, you must know your data, inspect the data with care, and
use your common sense.
● Leaker variables generally possess a high statistical correlation to the target. Therefore,
the model's predictive power is suspiciously high. Therefore, search the Influencer
Contributions report for a variable that has an unusually high influence, meaning it is highly
correlated to your target.
● Remember that if you build a model and it is extremely accurate, you might have a leakage
problem.
Summary
This lesson covers the following:
Data
The following example provides you with sample data.
Access Data
The following figure provides you with sample access data.
Column Details
The following figure shows you how to edit the details in a column.
Overview Report
The following figure introduces you to the overview report. This report measures the
following:
Summary
This lesson gives you a demonstration of regression modeling in SAC Smart Predict.
Confusion Matrix
The confusion matrix is a useful guide to the performance of an algorithm.
Density Curves
Consider the function of density curves, which you can generate using SAC.
Summary
These demonstrations show you how to build a classification model in SAC, examine the
output of the model, and apply it.
Classification Decision
Overview of Classification Analysis with Decision Trees
Decision trees are popular machine learning tools.
Consider the following points about decision trees in the context of the figure:
● To find solutions a decision tree makes sequential, hierarchical decision about the
outcome variable based on the predictor data.
● The model provides a series of "if this occurs, this occurs" conditions that produce a
specific result from the input data.
● Usually the model is represented in the form of a tree-shaped structure that represents a
set of decisions, which is very easy to understand.
● The tree can be binary or multi-branching, depending on the algorithm utilized to segment
the data.
● Each node represents a test of a decision and the rules that are generated can easily be
expressed, which means that the records falling into a particular category can be retrieved.
● In this example related to golf, you classify the weather conditions that would indicate
whether or not to play golf. There are two categorical predictors: Outlook and Windy, and
two numeric predictors, Temperature and Humidity.
In this example, you are given information about the income of the customers, if they are new
or existing customers, if they are young or old, and their marriage status and gender.
Titanic
The following example uses the sinking of the Titanic to demonstrate the uses of a decision
tree.
● When a variable has many values then in the CHAID algorithm it can lead to many
branches in the tree and consequently more complex rules.
● For the values, Temperature and Humidity, suppose you define "bins" or "groups" in the
following way:
● The test measures the probability that an apparent association is due to chance or not.
● In this example, the expected value for Sunny/Play is only equal to the observed value of 2,
if the two events are really independent from each other.
● If there is a large chi-squared test statistic, you can say it is most likely wrong to assume
that the playing decision is independent from outlook.
● Take note that the final value of the Chi-Test of 0.1698 is found in look-up tables. Take a
look at the bottom reference given in the following figure. In this reference page, you can
enter the observed values in the table. It calculates all the values for you, including p
(0.1698).
This analysis continues for each successive split. Take note of the following points:
● The next split you examine id Outlook Rain and the best split is Windy.
● The tree is complete as each leaf node contains a single outcome.
CART is very similar to the C4.5 algorithm, but has the following major differences:
● Rather than building trees that could have multiple branches, CART builds binary trees,
which only have two branches from each node.
● CART use the Gini Impurity as the criterion to split a node, not what is called Information
Gain.
● CART supports numerical target variables, creating a regression tree that predicts
continuous values.
Random Forests
Random forests are another method for classification and regression.
In this simple golfing example, actual and predicted are 100% consistent, but of course in
practice it's never like that. Consider these addition elements:
● There are always misclassifications - that is, errors.
● A confusion matrix is used to analyze the performance of the algorithm.
● The confusion matrix is a table that shows the performance of a classification algorithm by
comparing the predicted value of the target variable with its actual value.
● Each column of the matrix represents the observations in a predicted class.
● Each row of the matrix represents the observations in an actual class.
● The name stems from the fact that it makes it easy to see if the system is confusing two
classes, that is, commonly mislabeling one as another.
● The confusion matrix is examined in more detail later in the course.
Summary
This lesson covers the strengths and weaknesses of classification decision.
The strengths are as follows:
Example: k-NN
The following figure provides you with an example of the k-NN algorithm.
Nodes
The following figure shows you what actions take place at a node.
● This process is repeated many times, and the network continues to improve its predictions
until one or more of the stopping criteria are met.
● Initially, all weights are random, and the answers that come out of the net are probably
nonsensical.
● The network learns through training. Examples for which the output is known are
repeatedly presented to the network, and the answers it gives are compared to the known
outcomes. Information from this comparison is passed back through the network,
gradually changing the weights.
● As training progresses, the network becomes increasingly accurate in replicating the
known outcomes. When it is trained, the network can be applied to future cases where the
outcome is unknown.
● In the next step of the algorithm, the output signal of the network y is compared with the
desired output value (the target), which is found in the training data set.
● The difference is the error signal of the output layer neuron.
● The back propagation algorithm propagates the error signal, computed in each single
teaching step, back to all the neurons.
● Basically, back propagation is a method to adjust the connection weights to compensate
for each error found during learning. The error amount is effectively divided among the
connections.
decision trees, are very interpretable. This is important because in some domains,
interpretability is critical. This is why a lot of financial institutions do not use neural
networks to predict the creditworthiness of a person; they need to explain to their
customers why they did not get a loan, otherwise the person can feel unfairly treated.
● Neural networks usually require much more data than traditional machine learning
algorithms, as in at least thousands, if not millions of labeled samples. In many cases,
these amounts of data are not available and many machine learning problems can be
solved well with less data if you use other algorithms.
● The amount of computational power needed for a neural network depends heavily on the
size of your data, but also on the depth and complexity of your network. Usually, neural
networks are more computationally expensive than traditional algorithms.
● An NN shows the potential to over-fit.
● There is no specific rule for determining the structure of artificial neural networks. The
appropriate network structure is achieved through experience and trial and
error. Architectures have to be fine-tuned to achieve the best performance. There are
many design decisions that have to be made, from the number of layers to the number of
nodes in each layer to the activation functions.
classes. Maximizing the distance means that you can classify new data points with more
confidence.
● You can see possible hyperplanes in the diagram H1, H2, and H3 separating the black and
white dots.
● This line is the decision boundary: anything that falls to one side of it, we classify as black,
and anything that falls to the other side as white.
● But, what exactly is the best hyperplane? For an SVM, it is the one that maximizes the
margins from both types of dot. In other words: the hyperplane whose distance to the
nearest element of each dot is the largest.
Support vectors are the data points that lie closest to the decision surface (the hyperplane)
and show the following features:
● These are the data points most difficult to classify.
● They have a direct bearing on the optimum location of the decision surface.
● If they are removed, support vectors are the elements of the training set that can change
the position of the dividing hyperplane.
● Support vectors are the critical elements of the training set.
● Compared to newer algorithms like neural networks, SVMs have two main advantages:
higher speed and better performance with a limited number of samples (in the thousands).
Comparing SVMs to logistic regression and decision trees illustrates the following:
● If you are a farmer and you need to set up a fence to protect your cows from packs of
wolves, where do you build your fence?
● One way you could do it would be to build a classifier based on the position of the cows and
wolves in your pasture.
● In this example, you see that SVMs do a great job at separating your cows from the packs
of wolves because it uses a non-linear classifier.
● You can see that both the logistic regression and decision tree models only make use of
straight lines.
SVM: Summary
The strengths and weaknesses of SVMs are as follows:
Time-Series Analysis
Time-Series Analysis Overview
The following figure introduces you to time-series analysis.
● Time-series analysis comprises methods for analyzing time series data in order to extract
meaningful patterns in the data.
● Time-series forecasting is the use of a model to predict future values based on previously
observed signal values.
Forecast Horizon
The following figure provides you with an overview of a forecast horizon.
Naïve Forecasting
The following figure outlines naïve forecasting.
These simple approaches can be used as a basis for comparing other algorithms to see if they
are significantly better.
Another approach is Moving Averages. Let's take the popular 7-day moving average that is
used to monitor hospital admissions as an example. A 7-day moving average is calculated by
taking the number of COVID hospital admissions for the last 7 days and adding them together.
The result from the addition calculation is then divided by the number of periods, in this case,
7.
A moving average requires that you specify a window size that defines the number of raw
observations used to calculate the moving average value. In our example that is 7 days.
The “moving” part in the moving average refers to the fact that the window slides along the
time series to calculate the average values in the new series. In our example, the most recent
daily admission number is added and the oldest, 7 periods previously, is dropped.
Moving averages can smooth time series data, revealing underlying trends. Smoothing is the
process of removing random variations that appear as coarseness in a plot of raw time series
data.
Note:
For more info on naïve forecasting, see: https://otexts.com/fpp2/simple-
methods.html https://en.wikipedia.org/wiki/Forecasting
Figure 224: Double Exponential Smoothing: Different Alpha and Beta Values
ARIMA is another time series forecasting approach. It stands for Autoregressive Integrated
Moving Average models and is used for the following:
● Its main application is in the area of short term forecasting requiring at least 40 historical
data points. It works best when data exhibits a stable or consistent pattern over time with a
minimum amount of outliers. ARIMA is usually superior to exponential smoothing
techniques when the data is reasonably long and the correlation between past
observations is stable. If the data is short or highly volatile, some smoothing method can
perform better. If you do not have at least 38 data points, you must consider some other
method than ARIMA.
● The AR part (for Autoregressive) of ARIMA indicates that the evolving signal is regressed
on its own lagged values (that is, the previous values). The MA part (for Moving Average)
indicates that the regression error is actually a linear combination of error terms. The I (for
"integrated") indicates that the data values have been replaced with the difference
between their values and the previous values (and this differencing process may have been
performed more than once). The purpose of each of these features is to make the model fit
the data as well as possible.
● The first step in applying the ARIMA methodology is to check for stationarity. Stationarity
implies that the series remains at a fairly constant level over time. If a trend exists, as in
most economic or business applications, your data is NOT stationary. The data should also
show a constant variance in its fluctuations over time. This is easily seen with a series that
is heavily seasonal and growing at a faster rate. In this case, the ups and downs in the
seasonality will become more dramatic over time. Without these stationarity conditions
being met, many of the calculations associated with the ARIMA process cannot be
computed. Therefore, you need to transform the data so that it is stationary.
● If a graphical plot of the data indicates non-stationarity, you should "difference" the series.
Differencing is an excellent way of transforming a non-stationary series to a stationary one.
This is done by subtracting the observation in the current period from the previous one. If
this transformation is done only once to a series, you say that the data has been "first
differenced". This process essentially eliminates the trend if your series is growing at a
fairly constant rate. If it is growing at an increasing rate, you can apply the same procedure
and difference the data again. Your data would then be "second differenced".
● "Autocorrelations" are numerical values that indicate how a data series is related to itself
over time. More precisely, it measures how strongly data values at a specified number of
periods apart are correlated to each other over time. The number of periods apart is
usually called the lag. For example, an autocorrelation at lag 1 measures how values 1
period apart are correlated to one another throughout the series. An autocorrelation at lag
2 measures how the data two periods apart are correlated throughout the series.
Autocorrelations can range from +1 to -1. A value close to +1 indicates a high positive
correlation while a value close to -1 implies a high negative correlation. These measures
are most often evaluated through graphical plots called "correlograms". A correlogram
plots the auto- correlation values for a given series at different lags. This is referred to as
the autocorrelation function and is very important in the ARIMA method.
● After a time series has been stationarized by differencing, the next step in fitting an ARIMA
model is to determine whether AR or MA terms are needed to correct any autocorrelation
that remains in the differenced series.
● Of course, with software packages, you could just try some different combinations of
terms and see what works best. But there is a more systematic way to do this: by looking
at the plots of the autocorrelation function (ACF) and partial autocorrelation (PACF) of the
differenced series, you can tentatively identify the numbers of AR and/or MA terms that
are needed.
● The main problem is trying to decide which ARIMA specification to use - that is, how many
AR and/or MA parameters to include. This is what much of an approach called Box-
Jenkings [1976] was devoted to in the past. That method depends on the graphical and
numerical evaluation of the sample autocorrelation and partial autocorrelation functions.
Accuracy Measures
Mean Absolute Percentage Error (MAPE) is a popular measure used to forecast time-series
errors.
While MAPE is one of the most popular measures for forecasting error, there are many
studies on shortcomings and misleading results from MAPE. These are outlined in the
following way:
● It can not be used if there are zero values, which sometimes happens for example in
demand data, because there would be a division by zero.
● For forecasts that are too low the percentage error cannot exceed 100%, but for forecasts
which are too high there is no upper limit to the percentage error.
● Moreover, MAPE puts a heavier penalty on negative errors, A t < F t, than on positive
errors.
To overcome these issues with MAPE, there are some other measures proposed in literature:
● Mean Absolute Scaled Error (MASE)
● Symmetric Mean Absolute Percentage Error (sMAPE)
● Also, as an alternative, each actual value (At) of the series in the original formula can be
replaced by the average of all actual values (Āt) of that series. This alternative is still being
used for measuring the performance of models that forecast spot electricity prices.
Summary
This lesson covers the following:
Forecasting in SAC
Time-Series Components in SAC
The following figure introduces you to the idea of the signal in a time-series forecast.
Trend
Trends are defined in the following ways.
Fluctuations in SAC
The following figure provides with an example of how a fluctuation is detected.
Influencer Variables
Potential Influencer Variables are included in your model and can be used for the following, as
shown in the figure.
During the analysis of the trend and cycle components, there are constraints for potential
influencer variables:
● The future values must be known, at least for the expected horizon.
● Influencer variables with ordinal, continuous, and nominal types are used in the detection
of trends.
● But only influencer variables with ordinal and continuous types are used in the detection of
cycles.
Horizon-Wide MAPE
Building your model means you need to specify a horizon.
Demonstration Data
The following figure provides you with demonstration data.
Examine Forecasts
The following figure shows you what you can learn from forecasts.
Outliers
The following figure provides you with an example of Signal Outliers table.
Segmented Models
The following figure gives you an example of a segmented model.
Entity
The following figure provides you with an example of an entity.
Summary
This lesson covers the following:
Ensemble Methods
Bootstrapping Overview
This is an introduction to a type of resampling called bootstrapping.
Before describing some of the basic ensemble methods, it is useful to understand the concept
called boostrapping.
This is a simple example that concludes only one repeat of the procedure. It can be repeated
many more times to give a sample of calculated statistics.
Ensemble Methods
Ensemble methods are based around the hypothesis that an aggregated decision from
multiple models can be superior to a decision from a single model.
● Every model makes a prediction (votes) for each test instance and the final output
prediction is the one that receives more than half of the votes. If none of the predictions
get more than half of the votes, you can say that the ensemble method could not make a
stable prediction for this instance.
● Although this is a widely used technique, you can try the most voted prediction (even if
that is less than half of the votes) as the final prediction. In some articles, you sometimes
see this method being called plurality voting.
Bagging
The following figure provides you with an overview of bagging.
● The models use voting for classification or averaging for regression. This is where
"aggregating" comes from in the name "bootstrap aggregating."
● Each model has the same weight as all the others.
● In many cases, bagging methods constitute a very simple way to improve a model, without
making it necessary to adapt the underlying base algorithm.
● Bagging methods come in many forms, but mostly differ from each other by the way they
draw random subsets of the training set.
Random Forests
Random forests are a set of powerful, fully automated, machine learning techniques.
Random forests are one of the most powerful, fully automated, machine learning techniques.
With almost no data preparation or modeling expertise, analysts can obtain surprisingly
effective models. Random forests is an essential component in the modern data scientist's
toolkit.
Random forests are outlined in the following way:
● They are a popular and fast ensemble learning method for classification or regression
scenarios.
● They run a series of classification or regression models over random (bootstrap samples)
from the data.
● They combine and fit those results by voting (classification) or averaging (regression).
● This approach results in robust and high-prediction quality models.
Basically, a random forest consists of multiple random decision trees. Two types of
randomness are built into the trees:
1. First, each tree is built on a random sample from the original data.
2. Second, at each tree node, a subset of features are randomly selected to generate the
best split.
Boosting
Boosting algorithms create a sequence of models. The following figure illustrates the core
principles of boosting.
perform. However, Model 2 also makes a number of other errors. This process continues
and there is a combined final classifier that predicts all the data points correctly.
Stacking, also known as stacked generalization, is an ensemble method where the models are
combined using another machine learning algorithm. Stacking raises certain questions and is
characterized by the following points:
● If you develop multiple machine learning models, how do you choose which model to use?
● Stacking uses another machine learning model that learns when to use each model in the
ensemble.
● In stacking, unlike bagging, the models are typically different (for example, they are not all
decision trees) and fit on the same data set (for example, instead of samples of the training
data set).
● In stacking, unlike boosting, you use a single model to learn how to best combine the
predictions from the contributing models (for example, instead of a sequence of models
that correct the predictions of prior models).
● The picture shows the typical architecture of a stacking model with two or more base
models, often referred to as level-0 models, and a meta-model that combines the
predictions of the base models, referred to as a level-1 model.
● Level 0 models are frequently diverse and make very different assumptions about how to
solve the predictive modeling task, such as linear models, decision trees, support vector
machines, neural networks, and so on.
● The meta-model is frequently simple, such as linear regression for regression tasks
(predicting a numeric value) and logistic regression for classification tasks (predicting a
class label).
Summary
This lesson covers the following:
Simulation Optimization
Introduction: Monte Carlo Simulation
The Monte Carlo simulation is a mathematical technique that allows you to account for risk in
quantitative analysis and decision making. It uses repeated random sampling to obtain the
distribution of an unknown probabilistic entity.
A Monte Carlo simulation provides the decision-maker with a range of possible outcomes and
the probabilities that occur for any choice of action. It shows the extreme possibilities, that is,
the outcomes of going for broke and for the most conservative, as well as the middle of the
road options.
The history of this method is as follows:
● The modern version was invented in the late 1940s by Stanislaw Ulam while he was
working on nuclear weapons projects at the Los Alamos National Laboratory. It was
developed by John von Neumann, who identified a way to create pseudo-random
numbers. It was named after Monte Carlo, the Monaco resort town renowned for its
casinos.
● The technique is used in finance, project management, energy, manufacturing,
engineering, research and development, insurance, oil and gas, transportation, and the
environment.
● Applications and examples of this simulation are as follows:
- Operations Research studies - queuing and service levels, manufacturing, distribution, and
so on
- Computational physics, physical chemistry, and complex quantum chromodynamics
- Engineering for quantitative probabilistic analysis in process design
- Computational biology
- Computer graphics
- Applied statistics
- Finance and business
This example, which relates to business and finance, illustrates the following:
● Projected estimates are given for the product revenue, product costs, overheads, and
capital investment for each year of the analysis, from which the cash flow can be
calculated.
● The cash flows are summed for each year and discounted for future values. In other words,
the net present value of the cash flow is derived as a single value measuring the benefit of
the investment.
● The projected estimates are single-point estimates of each data point and the analysis
provides a single point value of project Net Present Value (NPV). you can use sensitivity
analysis to explore the investment model.
However, these tests do not answer the question: what are the chances, that is, the
probability, that the NPV will be zero or less, or over £10,000. This is where we can use
probabilistic modeling - the Monte Carlo simulation.
Take note of the following in relation to this simple example of the Monte Carlo simulation:
● This is a simple example in which there are 2 distributions: one for the product revenue,
which is a normal distribution in this example, and another for the product costs, a
rectangular distribution. This data has been collected over a period of time and the
distributions show the range of values that have been observed over this time period.
● It is important to notice that a sample is taken from each distribution.
● Using this sample, you can calculate the product margin, overhead, capital investment,
and NPV.
● This process is repeated multiple times, taking further samples from the revenue and
costs distributions, and calculating a range of values, the probability distribution, for the
product margin, overhead, capital investment and NPV.
Run the sampling and calculations multiple times, quite possibly hundreds or thousands of
times (this is the number of trials or simulations). This allows you to derive the following:
● Multiple values of the NPV
● Probability distribution of the NPV, which can be analyzed to estimate Pr(NPV >0), or
Pr(NPV >10,000)
You can use LP to achieve the best outcome, such as maximum profit or lowest cost, in a
mathematical model whose requirements are represented by linear relationships. Consider
the example of LP in the context of the following points:
● A linear programming algorithm finds a point in the space defined by the constraints where
the objective function has the smallest, or largest, value.
● In this example, there is an objective function to maximize profits (Z), represented by
Z=300X + 500Y, where X and Y are two different types of product. It therefore finds the
largest value for the objective: profit. However, there are also a number of constraints on
the possible values of X and Y.
● You can use the LP method to consider the objective function in conjunction with the
constraints because it enables you to identify the optimal solution. In this example, that is
shown by the red circle.
These are a core class of algorithms in Operations Research and are used for the following
purposes:
● Product formulation - minimize the cost subject to ingredient requirements
● Inventory management - minimize inventory costs subject to space and demand
● Allocation of resources - minimize transport costs subject to requirements
● Capital investment planning - maximize return on investment subject to investment
constraints
● Menu planning - minimize cost subject to meal requirements
● It has proved useful in modeling diverse types of problems in planning, routing, scheduling,
assignment, and design
SAP Transportation Resource Planning (TRP) is designed to supply the right equipment at the
right time and right location, with the minimum cost to fulfill customer demand - a simple
summary of a very complex problem.
Optimization: Summary
This lesson covers the following:
● LP as a mathematical modeling technique, in which a linear function is maximized or
minimized when it is subjected to various constraints.
● The uses of this technique. which you can use to support decision making in business
planning and in industrial engineering.
The particular business question that the model is designed to analyze helps you to determine
if a model can be classed as successful or not. Often the business question is associated with
a specific business success criterion that is converted to a data science success criterion
during Phase 1 of the CRISP-DM process: business understanding.
Potential uses of such a model are as follows:
● if the purpose of the model is to provide highly accurate predictions or decisions that are
used by the business, measures of accuracy are utilized.
● If the interpretation of the business is what is of most interest, accuracy measures are not
as important. Instead, subjective measures of what can provide maximum insight in the
future might be more desirable.
● Many projects use a combination of both accuracy and interpretation. Therefore, the most
accurate model is not selected if a less accurate, but more transparent, model with nearly
the same accuracy, is available.
● In their Third Annual Data Miner Survey, Rexer Analytics, an analytic and CRM consulting
firm based in Winchester Massachusetts, USA, asked analytic professionals: How do you
evaluate project success in Data Mining? The answer to this question is represented in the
figure.
● Out of 14 different criteria, 58% ranked model performance (that is, Lift, R2, and so on) as
the primary factor.
● For these responders, the most important component is the generation of very precise
forecasts.
● There is simply no rational reason to expect that a model, which has been developed to
maximize R2, or Lift, can also maximize the business performance metrics that are of
interest to an organization.
The figure shows the difference between accuracy and precision. Accuracy denotes how
closely a measured value is to the actual (true) value. Precision denotes how closely the
measured values are to each other - for example, if you weigh a given substance 10 times, and
get 5.1 kg each time, you know that this measurement is very precise.
Classification models often have a binary nominal target, meaning there are two outcome
classes - churn or no-churn, fraud or no-fraud, and so on - and these are coded as 1 and 0.
The following performance metrics are frequently used to assess classification model
success:
● Confusion matrices summarize the different kinds of errors, called Type I and Type II
errors
● Lift, Area Under the Curve (AUC) metrics
● SAP have developed our own - called Predictive Power and Prediction Confidence (you
examine this in more detail later in this unit)
Bias
Bias and precision refer to factors that impact the accuracy of your model.
Analysts frequently refer to bias, which can be understood in the following terms:
● A forecast bias occurs when there are consistent differences between actual outcomes
and forecasts of those quantities, that is, forecasts can have a general tendency to be too
high or too low. A normal property of a good forecast is that it is not biased.
● Bias is a measure of how far the expected value of the estimate is from the true value of
the parameter being estimated.
● Precision is a measure of how similar the multiple estimates are to each other, not how
close they are to the true value.
● Basically, bias refers to the tendency of measures to systematically shift in one direction
from the true value and as such are often called systematic errors as opposed to random
errors.
● Forecast bias is different to forecast error (accuracy) in that a forecast can have any level
of error but be completely unbiased. For example, if a forecast is 10& higher than the
actual values half the time and 10% lower than the actual values the other half of the time,
it has no bias. However, if it is, on average, 10% higher than the actual value, the forecast
has both a 10% error and a 10% bias.
● If you select 50% on the x-axis, you find 50% of the targets on the y-axis.
● This is simply a random selection from the data and it represents what can happen if you
do not use a predictive model, but simply try and choose the targets randomly.
Gains Chart
The following figure is an example of a gains chart.
Therefore, if you rank order the customers based on the model score, and select those with
the highest score (going from left to right on the x-axis, those with the highest scores are on
the left) these have the highest probability of being the targets.
In the model (shown by the yellow line), if you select 33% of the whole of the base of 18
customers (on the x-axis going from left to right), based on descending score, you detect
66% of the total number of targets (shown on the y-axis). This is much better than a random
model.
This is a graphical indication of the predictive power of the model, compared to random and
the perfect model.
You can also see the misclassifications. These are the red targets that have lower scores and
are situated further on the right side of the x-axis, and the green non-targets that have high
scores, and are shown more to the left of the x-axis.
Lift Chart
The following figure is an example of a lift chart. The lift is a comparison of the difference
between the random selection and classification model.
It is usually shown with the random line in a horizontal orientation, (see the previous figure),
where Lift = 1.
However, you can see that in this example, at 33% of the population on the x-axis, you get
33% of the targets randomly and 66% of the targets using the model. Therefore, the lift is
simply the difference.
In this example, if you select 33% of the population (on the x-axis), you identify 2x the number
of targets using the model compared to a random selection.
sample. The other sub-sample is used to test the model to ensure it is accurate and robust;
this is called the Validation sub-sample.
● In this example, you are training the model on the Estimation and testing the model's
performance on both the Estimation and Validation sub-samples.
● To maintain high confidence in the robustness of the model, you expect that the
performance on both of these sub-samples is very similar, and that the performance
curves overlap one another.
● The predictive power metric ranges from 0 to 100%. There is no minimum cut-off, as the
predictive power is dependent on the predictiveness of the data you are using, but
obviously higher values are better than low values.
A model with the predictive power of the "79%" is capable of explaining 79% of the
information contained in the target variable using the explanatory variables contained in the
data set analyzed.
The following is also shown in relation to the predictive power of the model:
● "100%" is an hypothetical perfect model (on the green line), that explains 100% of the
target variable. In practice, such a predictive power would generally indicate that one or
more of the explanatory variables is 100% correlated with the target variable and there is a
leaking variable that must be excluded from the model.
● "0%" is a purely random model (on the red line).
● The prediction confidence is the robustness indicator that indicates the capacity of the
model to achieve the same performance when it is applied to a new data set exhibiting the
same characteristics as the training data set. It is estimated by comparing the difference in
the performance of the model on the Estimation and Validation sub-samples, calculated by
the area between the two performance curves (shown as B in the graph). The smaller the
value of B, the more robust the model is as the performance on the Estimation and
Validation sub-samples is very similar. For the metric to range from 0 to 100%, the area of
B is calculated and the ratio of B is compared to the total area A, B and C.
Prediction confidence also ranges from 0 to 100%. However, it has a minimum cut-off
threshold of 95% that must be achieved. For example, a model with a prediction confidence:
● Equal to or greater than "95%" is very robust. It has a high capacity for generalization.
● Less than "95%" must be considered with caution. Applying the model to a new data set
incurs the risk of generating unreliable results.
Confusion Matrix
The following figure provides you with an example of a confusion matrix. A confusion matrix is
a table that is often used to describe the performance of a classification model on a set of test
data for which the actual true values are known.
Consider the following case. You are using a classification model to predict if a bank customer
will default on a loan repayment. If they miss a payment, the model predicted class = 1. If they
do not miss a payment, the model predicted class = 0. The confusion matrix shows the
following measures:
True Positive:
Interpretation: You predicted positive and it is true.
You predicted that the customer will miss a payment and they did miss a payment.
True Negative:
Interpretation: You predicted negative and it is true.
You predicted that the customer will not miss a payment and they did not miss a payment.
● Marketing response:You would like the number of FPs (non responding customers) to be
as small as possible to optimize mailing costs. You might have to accept a high proportion
of FNs (responders classified as non responders) as there are a large number of
customers for to be contacted.
● Medical screening:You would like the proportion of FNs (patients with the disease who are
not screened) to be as small as possible (ideally zero). Therefore, you are willing to have a
higher proportion of FPs (patients screened unnecessarily).
● You can "tune" the classification model to minimize the type of error that is most relevant
for your specific scenario.
You can use a confusion matrix to assess the costs and benefits of particular actions:
● You can use return on investment, or profit information, with a confusion matrix if there is
a fixed or variable cost associated with the treatment of a customer or transaction, and a
fixed or variable return or benefit if the customer responds favorably.
● For example, if you are building a customer acquisition model, the cost
is typically a fixed cost associated with mailing or contacting the individual. The return is
the estimated value of acquiring a new customer.
● However, if the model predicts a No, but an acquisition would actually have occurred -
Actual = Yes (a false negative), there is an opportunity lost (a negative benefit value).
● For fraud detection, there is a cost associated with investigating the invoice or claim, and a
gain associated with the successful recovery of the fraudulent amount.
● For some business scenarios, it is not immediately apparent how to associate costs and
benefits to the confusion matrix, but there are a number of obvious associations that can
be made in marketing and sales when the classification model is predicting if a customer
will buy a product or not, and what the benefit is to the company if the customer does buy
the product.
When you plot the True Positive rate versus the False Positive rate, you have the ROC Curve,
which shows the following:
● Sensitivity, which appears on the Y axis, is the proportion of CORRECTLY identified targets
(true positives) found, out of all true positives.
● [1 - Specificity], which appears on the X axis, is the proportion of INCORRECT assignments
to the target class (false positives), out of all false positives.
● The term, Specificity, as opposed to [1 - Specificity], is the proportion of CORRECT
assignments to the non-target class - the true negatives.
● The closer the curve follows the left-hand border and, subsequently, the top border of the
ROC space, the more accurate the test.
● The closer the curve comes to the 45-degree diagonal random line, the less accurate the
test.
● The AUC statistic is a measure of model performance or predictive power calculated as the
area under the ROC curve.
● A rough guide for classifying ROC is the traditional point system: 0.9-1 = excellent (A);
0.8-0.9 = good (B); 0.7-0.8 = fair (C); 0.6 - 0.7 = poor (D); < 0.6 = fail (F)
In the chart, at 40% of false positive targets - observations incorrectly assigned to the
negative target:
● A random selection (with no predictive model) would classify 40% of the positive targets
correctly as True Positive.
● A perfect predictive model would classify 100% of the positive targets as True Positive.
● The predictive model created by Smart Predict in SAP Analytics Cloud (the validation
curve) would classify 96% of the positive targets as True Positive.
All of these indicators must be as low as possible, except R², which must be high (its
maximum is 1).
Summary
This lesson covers the following:
● What metric you must use. In general, this decision must be the one that most closely
matches the business objectives defined at the beginning of the project during the
Business Understanding Phase.
● An example of the previous point is as follows: if the objective indicates that the model is
going to be used to select one-quarter of the population for treatment, the model gain or
lift at the 25% depth is appropriate. However, if all customers are due to be treated,
computing AUC might be appropriate.
● If the objective is to select the subject to a maximum false error rate, a ROC is appropriate
- otherwise, use the confusion matrix.
● It is vital to remember that the metric you use for model selection is of critical importance
because the model selected based on one metric might not be such a good model for a
different metric.
● During the Business Understanding Phase, it is vital that you understand the intent of the
model and match the metric that best fits that intent.
You can also use these charts and tables to test the performance of models on different time
stamped populations, that is, test how the model performs in different time frames.
1. Score the validation sample or file using the classification model under consideration.
Every individual receives a model score, that is, the Probability_Estimation.
3. Divide the ranked and scored file into ten equal groups, called deciles, from top(1), 2, 3, 4,
5, 6, 7, 8, 9, and bottom(10). The "top" decile consists of the best 10% of individuals most-
likely to respond; decile 2 consists of the next 10% of individuals most-likely to respond,
and so on for the remaining deciles. The decile separates and orders the individuals on an
ordinal scale that ranges from most to least likely to respond.
of all the responders if you contacted 79% of them all. Therefore, you can see the massive
benefit of using the model.
● In addition, if you compare the gains chart of the model scored on the estimation sub-
sample to the gains chart of the model scored on the validation sub-sample, you expect to
see the same proportion of responders in each decile. If the proportions are comparable,
the model is robust. If there are discrepancies, the model might be over trained.
Summary
In this unit, you have been introduced to Decile Analysis and how you can use it to analyze the
power of the classification models you build.
Feature Engineering
During the process known as feature engineering, new features are created to extract more
information from existing features. These new features can have an improved ability to
explain the variance in the training data and improve the accuracy of the model.
Feature engineering is highly influenced by business understanding. The feature engineering
process can be divided into the following two steps: feature transformation and feature
creation.
Feature transformation is outlined as follows:
● Feature creation is the creation of new variables, such as ratios and percentages, to
enhance the prediction. A deep understanding of the business environment, and
knowledge of the types of features that are usually predictive, helps to drive the creation of
these new variables.
Feature Selection
Feature selection is the process of finding out the subset of explanatory variables that best
explain the relationship of independent variables with the target variable. Filter, wrapper, and
embedded techniques have been described earlier in this course.
There are a number of other different ways to accomplish this task:
● You can use domain knowledge to select feature(s) that might possess a higher impact on
the target variable based on your domain business experience.
● You can use visualization to take advantage of simple data visualizations that help your
business to understand the relationship between variables.
● You can use statistical parameters as there are a range of statistical metrics you can utilize
- for example, p-values and variable contributions to the model.
● You can use Principal Component Analysis (PCA). This is a type of dimensionality
reduction technique. PCA is a statistical procedure that allows you to summarize the
information content in large data tables by means of a smaller set of “summary indices”
that can be more easily visualized and analyzed.
Algorithm Tuning
Machine learning algorithms are driven by parameters, for example, the number of leaves of a
classification tree, the number of hidden layers in a neural network, or the number of clusters
in a K-Means clustering.
These parameters influence the outcome of the learning process. This is often called
“hyperparameter tuning” and it finds the optimum value for each parameter to improve the
accuracy of the model.
To tune these parameters, you must first have a good understanding of their meaning and
their individual impact on the model.
Ensemble methods
The following figure contains a graphic representation of ensemble methods.
● Bootstrap aggregating, often abbreviated as bagging, involves having each model in the
ensemble vote with equal weight. In order to promote model variance, bagging trains each
model in the ensemble using a randomly drawn subset of the training set - for example, the
random forest algorithm combines random decision trees with bagging to achieve very
high classification accuracy.
● Boosting involves incrementally building an ensemble by training each new model instance
to emphasize the training instances that previous models incorrectly classified. In a
number of cases, boosting is shown to yield better accuracy than bagging, but it also tends
to be more likely to over-fit the training data.
● You were introduced to these techniques earlier in this training.
Cross-Validation
Cross-validation is a popular model validation technique that is illustrated in the following
figure.
Summary
This lesson introduces you to some of the techniques that are commonly used to improve the
performance of predictive models. these techniques are outlined in the following manner:
Of course, one of the most common ways of improving the accuracy of a forecast is to use a
different algorithm. For example, if you are using a decision tree, try a neural network or
logistic regression to see if these other algorithms improve the performance of the model.
LESSON SUMMARY
You should now be able to:
● Understand modeling phase
Lesson 1
Understanding the Evaluation Phase 200
UNIT OBJECTIVES
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Explain Evaluation Phase
Previous evaluation steps dealt with factors such as the accuracy and generality of the model.
However, this phase assesses other areas:
● This phase assesses the degree to which the model meets the business objectives. It also
seeks to determine whether there is a business reason that explains why this model is
deficient.
● You can perform an evaluation in another way. For example, you can test the model(s) on
test applications in the real application, if time and budget constraints permit.
● Furthermore, an evaluation also assesses the other data science results you have
generated.
● Data science project results cover models that are necessarily related to the original
business objectives and all other findings. These are not necessarily related to the original
business objectives. but might also unveil additional challenges, information or hints for
future directions.
● Conduct a more thorough review of the data science engagement to determine if there is
any important factor or task that has somehow been overlooked.
● Identify any quality assurance issues.
Summary
This lesson introduces you to the Evaluation phase of the CRISP-DM process. It covers the
following 3 tasks:
● Assess the degree to which the model meets the business objectives.
● Determine if there is any reason why this model is deficient.
● Test the model(s) on test applications in the real application, if time and budget
constraints permit.
Frequently, there is confusion between Phase 4, the Model Assessment task, and the
Evaluation phase of CRISP-DM. The following points clarify this scenario:
● In the Model Assessment task (that is, Task 4 of the Modeling phase of CRISP-DM), you
check that you are satisfied with the performance of the model from a data science
perspective by analyzing the confusion matrix, all of the model performance metrics and
charts, and confirm that the model's explanatory variables and their categories make
"business" sense. You confirm that the model meets the data science success criteria.
● When you have checked every component in the Model Assessment task, you move to the
Evaluation phase. During the Evaluation phase of CRISP-DM, you need to evaluate the
model to assess the degree to which it meets the business objectives, determine if there is
a business reason why this model is deficient and test the model(s) on test applications in
the real application on new, updated data - if time and budget constraints permit. You
confirm if the model meets the business success criteria.
Challenges
When your model is finally deployed and operationalized, you might find that its behavior is
different to your expectation. Therefore, you need to try and evaluate the model in order to
understand how its performance can deviate from what is expected.
The deviation could be the consequence of interactions with other models and systems, the
variability of the real world, or even because of adversaries that are changing the behavior
your data reflected when you initially trained the model. Examples of this are as follows:
● If you are building a fraud model, the fraudsters are capable of adapting their methods to
evade detection by your model.
● If you are creating marketing campaigns, your competitors are always looking for ways to
change the sales environment to their own advantage.
● There has been a Global pandemic that changed the way people interact and buy products.
● In certain scenarios, you might need to deploy and use the model, and capture the results
over a period of a few months before you can make a final judgment that the model is
working as expected. An example of this is if you are running a sales response campaign
based on a predictive model that indicates the probability that a customer might purchase
a product or not. In this scenario, you need to wait a few months to ensure you capture all
of the customers responding to the campaign.
Decile Analysis
One way to compare actual to predicted performance is to use decile analysis, which you
looked at earlier in this course. When the model is being evaluated on new data, you compare
the model build decile distribution to the decile distribution observed. You also take following
actions:
● When you train the model, the decile analysis of the scored output is represented in 10
bins, each with defined minimum and maximum scores or estimates (the bin boundaries),
with 10% of the samples in each bin.
● When you evaluate the model on new data, you use the same predefined minimum and
maximum scores or estimates you found in the training phase, and you analyze the
percentage of samples in each bin. If the model is working as you expect, you observe
approximately 10% in each evaluation bin.
● If there is a significant shift in the data distribution, with more that or less than 10% in the
predefined bins, this indicates that there is a shift in the underlying population and the
model is no longer working as you would expect.
In this classification model example, you want to know if customers will buy your new product
"P". To discover this, you must understand the following:
● In the following example, you train your predictive model using an input data set containing
past observed observations for 1,000 customers.
● When you build the model, the decile analysis, as it progresses, splits the data into 10 bins
and you can calculate the average probability per bin to purchase product "P".
● When you apply the model and evaluate it on up-to-date real-world data, let us say that
your input data set contains observations on 700 customers. You might find there is a
change in the percentage of customers in each bin.
● You can see the number of customers in each bin is no longer 10%.
Summary
During the Evaluation phase of CRISP-DM, you do the following:
● Evaluate the model to assess the degree to which it meets the business objectives.
● Determine if there is some business reason why this model is deficient.
● Test the model(s) on test applications in the real application on new, updated data - if time
and budget constraints permit.
● You confirm if the model meets the business success criteria. One way to do this is to use
decile analysis and compare the deciles when you trained the model to those when you use
the model on more up-to-date data.
LESSON SUMMARY
You should now be able to:
● Explain Evaluation Phase
Lesson 1
Deployment and Maintenance Phase 207
Lesson 2
End-to-end Scenario 218
UNIT OBJECTIVES
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Deploy and maintain models
● In order to deploy the data mining results(s) into the business, this task takes the
evaluation results and develops a strategy for the deployment of this information.
● If a general procedure is identified to create the relevant model(s), this procedure is
documented in this lesson for later deployment.
● Summarize a deployment strategy, including the necessary steps and how to perform
them.
● Monitoring and maintenance are critical issues if the data mining results become part of
the day-to-day business and its environment.
● A careful preparation of a maintenance strategy helps to avoid unnecessarily long periods
of incorrect usage of data mining results.
● This plan takes the specific type of deployment into account.
● At the end of the project, the project leader and their team write up a final report.
● Depending on the deployment plan, this report might only be a summary of the project and
its experiences - if they have not already been documented as an ongoing activity, or it can
be a final and comprehensive presentation of the data mining result(s).
● Generating the final report. The final report is the final written report of the data mining
engagement.
● Creating the final presentation. There is frequently a meeting at the conclusion of the
project where the results are verbally presented to the customer.
● To assess what has gone right and what has gone wrong over the course of the project,
and what has been done well and what needs to be improved.
● In ideal projects, experience documentation also covers any reports that are written by
individual project members over the course of the project phases and their tasks.
CRISP-DM: Update
The following figure examines the extent to which a model's predictive performance degrades
over time.
Summary
This lesson covers the sixth and final phase of the CRISP-DM process, that is, the deployment
phase. There are 4 tasks in this phase:
● Plan deployment
● Plan monitoring and maintenance
● Production of the final report
● Review of the project
There are a number of options that you need to consider. These must be fully discussed and
agreed during the Business Understanding phase. Examples of these options are as follows:
● Model scoring:The model is scored on the new apply data. It also refers to the score or
probability that is provided to the business and used to underpin actions, decisions, and
the development of business strategy.
-
● The model is often represented as a mathematical equation that can be represented as
code - for example, an SQL code (specific to a database), R, SAS code, C, Java HTML, and
so on. The code is created and latterly deployed in a database, or onto the apply data that
could be held in (for example) a text file. The output from this operation are the scores,
probabilities, deciles, and so on that are required. When the model is applied and scored,
the output is provided to the business and used to trigger actions and decisions.
● A model integration with SAP BusinessObjects BI Reporting: In this scenario, the model is
integrated with the BI reports, and the scores or probabilities are visualized - for example,
score distributions per geography.
● Model integrated with the application:This means the model is integrated with other
applications and used in day-to-day business decision making - for example, in a call
center.
● Cloud: In the cloud, models are trained on predictive analytics software - in the cloud, or on
the premises. These models can be deployed onto a database in the cloud.
● Real-time scoring:On occasion, you might need to deploy models for real-time scoring.
In the real-time scenario, the model usually receives a single data point from a caller, and it
provides a prediction for this data point in real time. Other examples include the following:
● Predictive maintenance, that is, predicting whether a particular machine part is likely to fail
in the next few minutes, given the sensor's real-time data.
● Estimating how long a food delivery takes based on the average delivery time in an area
over the course of the past 30 minutes. THis estimate considers the longevity and
ingredients, and real-time traffic information.
The SAP HANA database system can support real-time decisioning. One option is to deploy
models on "data streams". for more information about real-time and batch model scoring, see
the following:
● https://cloud.google.com/solutions/machine-learning/minimizing-predictive-serving-
latency-in-machine-learning
2. Act on new information as soon as it arrives. This information includes alerts, notifications,
and data that precipitates immediate responses to changing conditions.
These processes are highly scalable - they are processing hundreds of thousands, or even
millions, of events per second.
There are a range of predictive models that can be deployed on SAP HANA smart data
access/integration/quality, from the automated Application Function Library (AFL) system
and from SAP HANA Predictive Analysis Library (PAL) - for example, incremental
classification and clustering algorithms.
How does ESP work? The following points clarify the means by which you can use ESP to
achieve your goals:
● ESP lets you define continuous queries that are applied to one or more streams of
incoming data to produce streams of output events.
● This is what sets ESP apart from simple event processing. Using ESP gives you the ability
to examine incoming events in the context of other events or other data to understand
what is happening.
● In many cases, a single event might not contain much information or be very interesting by
itself, but when combined with other events, you might be able to observe a trend or
pattern that is very meaningful.
● Take the following example: you want to monitor the temperature of equipment to ensure
it does not overheat. You have access to real-time sensor data that tells you the
temperature of the equipment. However, that data alone might not be very useful. An
instance of isolated data is as follows: if the equipment temperature is 90 degrees, but the
equipment is located outside, and the outside air temperature is 85 degrees. This might be
a normal operating temperature for the equipment. However, if the equipment is 90
degrees when the air temperature is only 30 degrees, that could indicate that there is an
imminent problem.
● With regards to this example, there might be a trend that indicates a potential problem
because an individual data point does not tell you much and you want to analyze changes
over the course of the last hour. Moreover, you might want to compute a moving average
and compare it to historical norms for similar equipment.
Summary
This lesson introduces you to a number of of the deployment options for your data science
model and output.
In addition, you learn about the following:
● Your analysis and models use data based on past experiences and events to identify future
risks and opportunities. However, conditions and environments are constantly changing
(for example, new products are launched, competitors reduce the prices, and so on) and
this needs to be reflected in the models. If this does not happen, model performances
decay over time.
● Models that exhibit degraded performances produce scores that are incorrect and must
be replaced. If this does not happen, they put decision-makers and their decisions at
severe risk of inaccuracy and unreliability, which can ultimately effect bottom-line
profitability and cause damage to the organization.
● Organizations must develop built-in processes to systematically detect performance
reduction in all deployed models to identify those that are obsolete and to replace them
with new ones.
● A systematic model management life cycle methodology is of paramount importance.
For more information about deploying predictive models, see the following:
● https://machinelearningmastery.com/deploy-machine-learning-model-to-production/
● https://www.kdnuggets.com/2019/06/approaches-deploying-machine-learning-
production.html
● Random spot checks are not the best approach to measure the effectiveness of a model
and performance levels.
● Instead, each model must be thoroughly evaluated and measured for accuracy, predictive
power, stability over time, and other appropriate metrics that are defined by the objectives
of each model.
● Ideally, the performance of each model must be evaluated every time it is applied or used.
However, for those models you use infrequently, evaluating their performances provides
you with an opportunity. When you evaluate infrequently used model, you have the
opportunity to collect fresh data and rebuild models that have been missed.
● Factors to consider are the speed of change in your particular business environment, the
age of the model, the data availability for monitoring and for rebuilding the models, and the
potential risks of using under-performing models.
● In addition, there are business constraints and limited resources.
Update Frequency
Factors that influence how frequently you must update your models are as follows:
● Certain business environments are subject to constant change while other environments
undergo very minimal change.
● Clearly, the more frequent the changes, the more often the models are affected and run
the risk of being outdated.
● For example, prepaid telecommunications predictive models need to be updated
frequently because of the rapid business environment changes, new product and service
offers, the release of new types of handset and competition.
● When "internal" or "external" events occur, models must be monitored more closely.
● What is categorized as an internal event could include changes to the product,
underwriting, distribution channels, and processing procedures, or the launch of new
products and services.
● What is categorized as an external event could include changes in legislation, regulation,
customer behavior, and competitor activity (launching campaigns, releasing new
products, and relocating to new geographical areas), or a global pandemic.
You must also factor in the frequency with which you use the model for the following reasons:
● If you use the model on a daily basis, and evaluation tests are carried out each time it is
used, then it is easy to identify when its performance starts to degrade.
● However, if a model is used quarterly for example, there might have been changes in the
business environment that have been missed. In addition, there might be opportunities to
collect fresh data and rebuild the models over the past 3 months. To avoid these problems,
implement a systematic checking process.
You must also factor in the age of the model for the following reasons:
You must also factor in data readiness for the following reasons:
● The availability of data can affect the monitoring of models as well as the redevelopment of
models. In both cases, appropriate data needs to be available.
● There are many factors that affect data readiness. Data occasionally needs to be collected
to reflect rare events, or if data is sourced from a third-party agency, there might be
delays, and of course to develop predictive models you need to wait for the response data
to be collected. An example of the response data you need is the responses to a marketing
campaign. You need this data in order for the model to be properly trained.
There are domains where predictions are ordered by time. An example of these are time-
series forecasting, such as retail forecasting, and predictions on streaming data (for example,
in predictive maintenance models), where the problem of concept drift is more likely.
Therefore, in such a scenario, you explicitly for these issues and address them.
While concept drift is about the target variable, data drift describes the change of the
properties of the explanatory variables. In this case, it is not the definition of a customer that
changes, but the values of the features that define them/
For more information on data drift, see the following:
● https://machinelearningmastery.com/gentle-introduction-concept-drift-machine-
learning/
● https://en.wikipedia.org/wiki/Concept_drift
● https://www.explorium.ai/blog/understanding-and-handling-data-and-concept-drift/
● In the case of a model rebuild, you totally rebuild the model. This means that the model can
use different explanatory variables if they are available. In certain cases, you start from the
beginning of the CRISP-DM process and reconfirm the current business understanding an
the original assumptions of the model.
● When any of the assumptions are updated or redefined, the model build process starts
from the beginning, including data collection, model building and evaluation.
CRISP-DM Update
The following figure points to the key things you must take note of during the CRISP-DM
update.
Summary
This lesson covers the extent to which the predictive models that you deploy must be
monitored for acceptable performance because model performance diminishes over time.
Each model must be evaluated and measured for accuracy, predictive power, stability over
time, and other appropriate metrics that are defined by the objectives of each model.
Factors that influence the time period in which you must update your models include the
following:
This lesson also covers the ideas behind concept drift and data drift.
LESSON SUMMARY
You should now be able to:
● Deploy and maintain models
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Complete an challenge
LESSON SUMMARY
You should now be able to:
● Complete an challenge
Lesson 1
SAP Data Science Applications 220
UNIT OBJECTIVES
LESSON OBJECTIVES
After completing this lesson, you will be able to:
● Explain data science applications
Smart Assist
The following figure provides you with an overview of Smart Assist in SAC.
Time-Series Forecasting
The following figures shows you what you can do with time-series forecasting in SAC.
R Visualizations in SAC
The following figure provides you with an overview of R Visualizations in SAC.
Smart Predict
You can perform many operations with Smart Predict in SAC.
Figure 323: SAP HANA: Advanced Analytical Processing-In-Database and EML Capabilities
Figure 329: SAP HANA Multimodal Example Scenario: PAL, Spatial, and TensorFlow
Figure 330: SAP HANA Multimodal Example Scenario: PAL, Spatial, and TensorFlow, 2
AIF is as follows:
● SAP AI Foundation (AIF) is used internally in SAP applications but currently not available
for customers.
● AIF enables SAP to embed intelligent data science ML/AI applications into SAP solutions,
and supports the creation of the Intelligent Enterprise.
Intelligent Scenario Lifecycle Management (ISLM) embeds HANA PAL and HANA APL into the
SAP S/4HANA business applications without the need for coding.
Summary
Figure 344: R
● The basic installation includes all of the commonly used statistical techniques such as
univariate analysis, categorical data analysis, hypothesis testing, generalised linear
modeling, multivariate analysis and time series analysis.
● There are also powerful facilities to produce statistical graphics
● The base software is supplemented by over 5000 add-on packages developed by R users.
These packages cover a broad range of statistical techniques.
● R is command driven, so it takes longer to master than point-and-click software. However,
it has greater flexibility.
Python
Python is a concise, easy to read code that is mainly used for the following:
● Python was mainly developed for emphasis on code readability, and its syntax allows
programmers to express concepts in fewer lines of code.
● Python is concise and easy to read, and it can be used for everything from web
development to software development and scientific applications.
● Python's standard library is very extensive and contains built-in modules that provide
access to system functionality that would otherwise be inaccessible to Python
programmers, as well as modules written in Python that provide standardized solutions for
many problems that occur in everyday programming.
● There are 1000s of external libraries that can be added.
● Python 2.0 was released in 2000 and Python 3.0 was released 2008. This was a major
revision of the language that is not completely backward-compatible, and much Python 2
code does not run unmodified on Python 3.
Orange
References for R
Data Sets
Summary
This unit introduces you to some of the popular tools for statistical analysis. IT also introduces
you to the following tools:
● Summary:
● MS Excel, which is commonly used for relatively small data sets. It is very accessible,
provides lots of easy-to-use functionality and has good graphical features.
● Orange is a free, component-based visual programming software package for data
visualization, machine learning, data mining, and data analysis. It is ideal for beginners,
data analysts, an data scientists who like visual programming.
● R and Python require you to learn a programming language, but provide a huge amount of
functionality and extremely powerful graphical capability. If you are going to take your
interest in data science further, you should consider learning one or both of these
programming languages.
● Congratulations! you have now completed DSC100 Fundamentals of Data Science.
LESSON SUMMARY
You should now be able to:
● Explain data science applications