DATA ANAYTICS Notes UNIT4
DATA ANAYTICS Notes UNIT4
(Professional Elective - I)
Subject Code: CS513PE
UNIT 4
NOTES MATERIAL
OBJECT SEGMENTATION
TIME SERIES METHODS
Faculty:
KAILASH SINHA
DEPARTMENT OF CSE
SHREE DATTHA INSTITUTE OF ENGINEERING
AND SCIENCE
DATA ANALYTICS UNIT-4
UNIT - IV
Object Segmentation & Time Series Methods
Syllabus:
Object Segmentation: Regression Vs Segmentation – Supervised and Unsupervised
Learning, Tree Building – Regression, Classification, Overfitting, Pruning and Complexity,
Multiple Decision Trees etc.
Time Series Methods: Arima, Measures of Forecast Accuracy, STL approach, Extract
features from generated model as Height, Average Energy etc and analyze for
prediction
Topics:
Object Segmentation:
Supervised and Unsupervised Learning
Segmentaion & Regression Vs Segmentation
Regression, Classification, Overfitting,
Decision Tree Building
Pruning and Complexity
Multiple Decision Trees etc.
Unit-4 Objectives:
1. To explore the Segmentaion & Regression Vs Segmentation
2. To learn the Regression, Classification, Overfitting
3. To explore Decision Tree Building, Multiple Decision Trees etc.
4. To Learn the Arima, Measures of Forecast Accuracy
5. To understand the STL approach
Unit-4 Outcomes:
After completion of this course students will be able to
1. To Describe the Segmentaion & Regression Vs Segmentation
2. To demonstrate Regression, Classification, Overfitting
3. To analyze the Decision Tree Building, Multiple Decision Trees etc.
4. To explore the Arima, Measures of Forecast Accuracy
5. To describe the STL approach
KS SDES – C S E 2|P a g e
DATA ANALYTICS UNIT-4
KS SDES – C S E 3|P a g e
DATA ANALYTICS UNIT-4
The main differences between Supervised and Unsupervised learning are given below:
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting correct take any feedback.
output or not.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning can be used for those Unsupervised learning can be used for
cases where we know the input as well as those cases where we have only input data
corresponding outputs. and no corresponding output data.
Supervised learning is not close to true Unsupervised learning is more close to the
Artificial intelligence as in this, we first train true Artificial Intelligence as it learns
the model for each data, and then only it can similarly as a child learns daily routine
predict the correct output. things by his experiences.
KS SDES – C S E 4|P a g e
DATA ANALYTICS UNIT-4
Segmentation
Segmentation refers to the act of segmenting data according to your company’s
needs in order to refine your analyses based on a defined context. It is a
technique of splitting customers into separate groups depending on their
attributes or behavior.
The purpose of segmentation is to better understand your customers(visitors),
and to obtain actionable data in order to improve your website or mobile app. In
concrete terms, a segment enables you to filter your analyses based on certain
elements (single or combined).
Segmentation can be done on elements related to a visit,
as well as on elements related to multiple visits during a
studied period.
Steps:
Define purpose – Already mentioned in the statement above
Identify critical parameters – Some of the variables which come up in mind are
skill, motivation, vintage, department, education etc. Let us say that basis past
experience, we know that skill and motivation are most important parameters.
Also, for sake of simplicity we just select 2 variables. Taking additional variables
will increase the complexity, but can be done if it adds value.
Granularity – Let us say we are able to classify both skill and motivation into
High and Low using various techniques.
There are two broad set of methodologies for segmentation:
Objective (supervised) segmentation
Non-Objective (unsupervised) segmentation
KS SDES – C S E 5|P a g e
DATA ANALYTICS UNIT-4
Objective Segmentation
Segmentation to identify the type of customers who would respond to a
particular offer.
Segmentation to identify high spenders among customers who will use the e-
commerce channel for festive shopping.
Segmentation to identify customers who will default on their credit obligation
for a loan or credit card.
Non-Objective Segmentation
https://www.yieldify.com/blog/types-of-market-segmentation/
Segmentation of the customer base to understand the specific profiles which exist
within the customer base so that multiple marketing actions can be personalized
for each segment
Segmentation of geographies on the basis of affluence and lifestyle of people
living in each geography so that sales and distribution strategies can be
formulated accordingly.
Hence, it is critical that the segments created on the basis of an objective
segmentation methodology must be different with respect to the stated objective
(e.g. response to an offer).
However, in case of a non-objective methodology, the segments are different with
respect to the “generic profile” of observations belonging to each segment, but not
with regards to any specific outcome of interest.
The most common techniques for building non-objective segmentation are cluster
analysis, K nearest neighbor techniques etc.
Regression Vs Segmentation
Regression analysis focuses on finding a relationship between a dependent
variable and one or more independent variables.
Predicts the value of a dependent variable based on the value of at least one
independent variable.
Explains the impact of changes in an independent variable on the dependent
variable.
We use linear or logistic regression technique for developing accurate models
for predicting an outcome of interest.
Often, we create separate models for separate segments.
Segmentation methods such as CHAID or CRT is used to judge their effectiveness.
KS SDES – C S E 6|P a g e
DATA ANALYTICS UNIT-4
Creating separate model for separate segments may be time consuming and not
worth the effort. But, creating separate model for separate segments may provide
higher predictive power.
Decision Tree is a supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems.
Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents the
outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
Basic Decision Tree Learning Algorithm:
Now that we know what a Decision Tree is, we’ll see how it works internally.
There are many algorithms out there which construct Decision Trees, but one of
the best is called as ID3 Algorithm. ID3 Stands for Iterative Dichotomiser 3.
KS SDES – C S E 7|P a g e
DATA ANALYTICS UNIT-4
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
Decision Tree Representation:
Each non-leaf node is connected to a test that splits its set of possible
answers into subsets corresponding to different test results.
Each branch carries a particular test result's subset to another node.
Each node is connected to a set of possible answers.
Below diagram explains the general structure of a decision tree:
and each branch descending from that node corresponds to one of the possible
values for this attribute.
An instance is classified by starting at the root node of the decision tree, testing
the attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute. This process is then repeated at the
node on this branch and so on until a leaf node is reached.
After a decision tree learns classification rules, it can also be re-represented as a set of
if-then rules in order to improve readability.
How does the Decision Tree algorithm Work?
The decision of making strategic splits heavily affects a tree’s accuracy. The decision
criteria are different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node into two or more sub-
nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In
other words, we can say that the purity of the node increases with respect to the target
variable. The decision tree splits the nodes on all available variables and then selects the
split which results in most homogeneous sub-nodes.
Tree Building: Decision tree learning is the construction of a decision tree from class-
labeled training tuples. A decision tree is a flow-chart-like structure, where each
internal (non-leaf) node denotes a test on an attribute, each branch represents the
outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node
in a tree is the root node. There are many specific decision-tree algorithms. Notable ones
include the following.
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when
computing classification trees)
MARS → (multivariate adaptive regression splines): Extends decision trees to handle
numerical data better
Conditional Inference Trees → Statistics-based approach that uses non-parametric tests
as splitting criteria, corrected for multiple testing to avoid over fitting.
The ID3 algorithm builds decision trees using a top-down greedy search approach
through the space of possible branches with no backtracking. A greedy algorithm, as the
name suggests, always makes the choice that seems to be the best at that moment.
In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute with
the record (real dataset) attribute and, based on the comparison, follows the branch and
jumps to the next node.
KS SDES – C S E 10 | P a g e
DATA ANALYTICS UNIT-4
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the
tree. The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
Step-3: Divide the S into subsets that contains possible values for the best
attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset
created in
Step -6: Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Entropy:
Entropy is a measure of the randomness in the information being
processed. The higher the entropy, the harder it is to draw any
conclusions from that information. Flipping a coin is an example of
an action that provides information that is random.
From the graph, it is quite evident that the entropy H(X) is zero when the probability is
either 0 or 1. The Entropy is maximum when the probability is 0.5 because it projects
perfect randomness in the data and there is no chance if perfectly determining the
outcome.
Information Gain
Information gain or IG is a statistical property that
measures how well a given attribute separates the training
examples according to their target classification.
Constructing a decision tree is all about finding an attribute
that returns the highest information gain and the smallest
entropy.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A branch with
entropy more than zero needs further splitting.
In order to derive the Hypothesis space, we compute the Entropy and Information Gain
of Class and attributes. For them we use the following statistics formulae:
KS SDES – C S E 11 | P a g e
DATA ANALYTICS UNIT-4
InformationGain( Attribute)
pi pi ni
I (pi , ni ) log ni log 2
p n 2
p n p n p n
i i i i i i i i
Entropy( Attribute)
pi ni
I p n
i
PN i
Data set:
KS SDES – C S E 12 | P a g e
DATA ANALYTICS UNIT-4
KS SDES – C S E 13 | P a g e
DATA ANALYTICS UNIT-4
Requires little data preparation. Other techniques often require data normalization,
dummy variables need to be created and blank values to be removed.
Able to handle both numerical and categorical data. Other techniques are usually
specialized in analysing datasets that have only one type of variable. (For example,
relation rules can be used only with nominal variables while neural networks can be
used only with numerical variables.)
Uses a white box model. If a given situation is observable in a model the
explanation for the condition is easily explained by Boolean logic. (An example of a
black box model is an artificial neural network since the explanation for the results is
difficult to understand.)
Possible to validate a model using statistical tests. That makes it possible to
account for the reliability of the model.
Robust: Performs well with large datasets. Large amounts of data can be analyzed
using standard computing resources in reasonable time.
Tools used to make Decision Tree:
Many data mining software packages provide implementations of one or more decision
tree algorithms. Several examples include:
SAS Enterprise Miner
Matlab
R (an open source software environment for statistical computing which includes
several CART implementations such as rpart, party and random Forest packages)
Weka (a free and open-source data mining suite, contains many decision tree
algorithms)
Orange (a free data mining software suite, which includes the tree module orngTree)
KNIME
Microsoft SQL Server
Scikit-learn (a free and open-source machine learning library for the Python
programming language).
Salford Systems CART (which licensed the proprietary code of the original CART
authors)
IBM SPSS Modeler
Rapid Miner
KS SDES – C S E 14 | P a g e
DATA ANALYTICS UNIT-4
Classification Trees:
A classification tree is an algorithm where
the target variable is fixed or categorical. The
algorithm is then used to identify the “class”
within which a target variable would most
likely fall.
An example of a classification-type
problem would be determining who will or
will not subscribe to a digital platform; or
who will or will not graduate from high
school.
These are examples of simple binary classifications where the categorical
dependent variable can assume only one of two, mutually exclusive values.
KS SDES – C S E 15 | P a g e
DATA ANALYTICS UNIT-4
Regression Trees
A regression tree refers to an algorithm
where the target variable is and the
algorithm is used to predict its value
which is a continuous variable.
As an example of a regression type
problem, you may want to predict the
selling prices of a residential house, which
is a continuous dependent variable.
This will depend on both continuous factors
like square footage as well as categorical
factors.
value for the dependent variable is a homogenous node that requires no further
splitting because it is "pure." For categorical (nominal, ordinal) dependent variables
the common measure of impurity is Gini, which is based on squared probabilities of
membership for each category. Splits are found that maximize the homogeneity of
child nodes with respect to the value of the dependent variable.
One of the questions that arises in a decision tree algorithm is the optimal size of the
final tree. A tree that is too large risks overfitting the training data and poorly
generalizing to new samples. A small tree might not capture important structural
information about the sample space. However, it is hard to tell when a tree algorithm
should stop because it is impossible Before and After pruning to tell if the addition of a
single extra node will dramatically decrease error. This problem is known as the horizon
effect. A common strategy is to grow the tree until each node contains a small number of
instances then use pruning to remove nodes that do not provide additional information.
Pruning should reduce the size of a learning tree without reducing predictive accuracy
KS SDES – C S E 17 | P a g e
DATA ANALYTICS UNIT-4
as measured by a cross-
KS SDES – C S E 18 | P a g e
DATA ANALYTICS UNIT-4
validation set. There are many techniques for tree pruning that differ in the measurement
that is used to optimize performance.
Pruning Techniques:
Pruning processes can be divided into two types: PrePruning & Post Pruning
Pre-pruning procedures prevent a complete induction of the training set by
replacing a stop () criterion in the induction algorithm (e.g. max. Tree depth or
information gain (Attr)> minGain). They considered to be more efficient because
they do not induce an entire set, but rather trees remain small from the start.
Post-Pruning (or just pruning) is the most common way of simplifying trees. Here,
nodes and subtrees are replaced with leaves to reduce complexity.
The procedures are differentiated on the basis of their approach in the tree: Top-down
approach & Bottom-Up approach
KS SDES – C S E 19 | P a g e
DATA ANALYTICS UNIT-4
CHAID:
CHAID stands for CHI-squared Automatic Interaction Detector. Morgan and Sonquist
(1963) proposed a simple method for fitting trees to predict a quantitative
variable.
Each predictor is tested for splitting as follows: sort all the n cases on the
predictor and examine all n-1 ways to split the cluster in two. For each possible
split, compute the within-cluster sum of squares about the mean of the cluster on
the dependent variable.
Choose the best of the n-1 splits to represent the predictor’s contribution. Now do
this for every other predictor. For the actual split, choose the predictor and its cut
point which yields the smallest overall within-cluster sum of squares. Categorical
predictors require a different approach. Since categories are unordered, all possible
splits between categories must be considered. For deciding on one split of k
categories into two groups, this means that 2k-1 possible splits must be considered.
Once a split is found, its suitability is measured on the same within-cluster sum of
squares as for a quantitative predictor.
It has to do instead with conditional discrepancies. In the analysis of variance,
interaction means that a trend within one level of a variable is not parallel to a trend
within another level of the same variable. In the ANOVA model, interaction is
represented by cross-products between predictors.
In the tree model, it is represented by branches from the same nodes which have
different splitting predictors further down the tree. Regression trees parallel
regression/ANOVA modeling, in which the dependent variable is quantitative.
Classification trees parallel discriminant analysis and algebraic classification
methods. Kass (1980) proposed a modification to AID called CHAID for categorized
dependent and independent variables. His algorithm incorporated a sequential
merge and split procedure based on a chi-square test statistic.
Kass’s algorithm is like sequential cross-tabulation. For each predictor:
1) cross tabulate the m categories of the predictor with the k categories of the
dependent variable.
2) find the pair of categories of the predictor whose 2xk sub-table is least
significantly different on a chi-square test and merge these two
categories.
3) if the chi-square test statistic is not “significant” according to a preset
critical value, repeat this merging process for the selected predictor until
no non- significant chi-square is found for a sub-table, and pick the
KS SDES – C S E 20 | P a g e
DATA ANALYTICS UNIT-4
predictor variable
KS SDES – C S E 21 | P a g e
DATA ANALYTICS UNIT-4
whose chi-square is largest and split the sample into subsets, where l is
the number of categories resulting from the merging process on that
predictor.
4) Continue splitting, as with AID, until no “significant” chi-squares result. The
CHAID algorithm saves some computer time, but it is not guaranteed to find
the splits which predict best at a given step.
Only by searching all possible category subsets can we do that. CHAID is also
limited to categorical predictors, so it cannot be used for quantitative or mixed
categorical quantitative models.
KS SDES – C S E 22 | P a g e
DATA ANALYTICS UNIT-4
From the three graphs shown above, one can clearly understand that the leftmost
figure line does not cover all the data points, so we can say that the model is
under- fitted. In this case, the model has failed to generalize the pattern to the
new dataset, leading to poor performance on testing. The under-fitted model can
be easily seen as it gives very high errors on both training and testing data. This
is because the dataset is not clean and contains noise, the model has High Bias,
and the size of the training data is not enough.
When it comes to the overfitting, as shown in the rightmost graph, it shows the
model is covering all the data points correctly, and you might think this is a
perfect fit. But actually, no, it is not a good fit! Because the model learns too many
details from the dataset, it also considers noise. Thus, it negatively affects the
new data set; not every detail that the model has learned during training needs
also apply to the new data points, which gives a poor performance on testing
or validation
dataset. This is because the model has trained itself in a very complex manner
and has high variance.
The best fit model is shown by the middle graph, where both training and testing
(validation) loss are minimum, or we can say training and testing accuracy
should be near each other and high in value.
KS SDES – C S E 23 | P a g e
DATA ANALYTICS UNIT-4
KS SDES – C S E 24 | P a g e
DATA ANALYTICS UNIT-4
KS SDES – C S E 25 | P a g e
DATA ANALYTICS UNIT-4
o p: Stands for the number of lag observations included in the model, also
known as the lag order.
o d: The number of times the raw observations are differentiated, also
called the degree of differencing.
o q: Is the size of the moving average window and also called the order of
moving average.
Univariate stationary processes (ARMA)
A covariance stationary process is an ARMA (p, q) process of autoregressive order p and
moving
average order q if it can be written as
The acronym ARIMA stands for Auto-Regressive Integrated Moving Average. Lags of
the stationarized series in the forecasting equation are called "autoregressive" terms,
lags of the forecast errors are called "moving average" terms, and a time series
which needs to be differenced to be made stationary is said to be an "integrated"
version of a stationary series. Random-walk and random-trend models, autoregressive
models, and exponential smoothing models are all special cases of ARIMA models.
KS SDES – C S E 26 | P a g e
DATA ANALYTICS UNIT-4
The forecasting equation is constructed as follows. First, let y denote the dth
difference of Y, which means:
If d=0: yt = Yt
If d=1: yt = Yt - Yt-1
Note that the second difference of Y (the d=2 case) is not the difference from 2 periods
ago. Rather, it is the first-difference-of-the-first difference, which is the discrete
analog of a second derivative, i.e., the local acceleration of the series rather than its
local trend.
We measure Forecast Accuracy by 2 methods : 1. Mean Forecast Error (MFE) For n time
periods where we have actual demand and forecast values:
n
(e ) i
MFE i1
n
Ideal value = 0; MFE > 0, model tends to under-forecast MFE < 0, model tends to over-
forecast 2. Mean Absolute Deviation (MAD) For n time periods where we have actual
demand and forecast values:
n
(e ) i
MAD i1
n
While MFE is a measure of forecast model bias, MAD indicates the absolute size of
the errors
Uses of Forecast error:
Forecast model bias
Absolute size of the forecast errors
Compare alternative forecasting models
KS SDES – C S E 27 | P a g e
DATA ANALYTICS UNIT-4
ETL Approach:
Extract, Transform and Load (ETL) refers to a process in database usage and especially
in data warehousing that:
Extracts data from homogeneous or heterogeneous data sources
Transforms the data for storing it in proper format or structure for querying and
analysis purpose
Loads it into the final target (database, more specifically, operational data store,
data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes time, so
while the data is being pulled another transformation process executes, processing the
already received data and prepares the data for loading and as soon as there is some
data ready to be loaded into the target, the data loading kicks off without waiting for the
completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems), typically
developed and supported by different vendors or hosted on separate computer
hardware. The disparate systems containing the original data are frequently managed
and operated by different employees. For example, a cost accounting system may
combine data from payroll, sales, and purchasing.
Commercially available ETL tools include:
Anatella
Alteryx
CampaignRunner
ESF Database Migration Toolkit
InformaticaPowerCenter
Talend
IBM InfoSphereDataStage
Ab Initio
Oracle Data Integrator (ODI)
Oracle Warehouse Builder (OWB)
Microsoft SQL Server Integration Services (SSIS)
Tomahawk Business Integrator by Novasoft Technologies.
Pentaho Data Integration (or Kettle) opensource data integration framework
Stambia
KS SDES – C S E 28 | P a g e
DATA ANALYTICS UNIT-4
KS SDES – C S E 29 | P a g e
DATA ANALYTICS UNIT-4
Clean: The cleaning step is one of the most important as it ensures the
quality of the data in the data warehouse. Cleaning should perform basic data
unification rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard Male/Female/Unknown)
Convert null values into standardized Not Available/Not Provided valueConvert
phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
Validate address fields against each other (State/Country, City/State, City/ZIP
code, City/Street).
Transform:
The transform step applies a set of rules to transform the data from the
source to the target.
This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be joined.
The transformation step also requires joining data from several sources,
generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.
Load:
During the load step, it is necessary to ensure that the load is performed correctly
and with as little resources as possible. The target of the Load process is often a
database.
In order to make the load process efficient, it is helpful to disable any constraints
and indexes before the load and enable them back only after the load
completes. The referential integrity needs to be maintained by ETL tool to ensure
consistency.
KS SDES – C S E 30 | P a g e
DATA ANALYTICS UNIT-4
Staging:
It should be possible to restart, at least, some of the phases independently from the
others. For example, if the transformation step fails, it should not be necessary to restart
the Extract step. We can ensure this by implementing proper staging. Staging means that
the data is simply dumped to the location (called the Staging Area) so that it can then
be read by the next processing phase. The staging area is also used during ETL process
to store intermediate results of processing. This is ok for the ETL process which uses for
this purpose. However, the staging area should be accessed by the load ETL process
only. It should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the end-user. May contain incomplete or in-the-
middle-of-the- processing data.
KS SDES – C S E 31 | P a g e
Decision Tree Example
for
DATA ANALYTICS
(UNIT 4)
DATA ANALYTICS Decision Tree Learning
In order to derive the Hypothesis space, we compute the Entropy and Information Gain of Class and
attributes. For them we use the following statistics formulae:
InformationGain( Attribute)
pi pi ni ni
I (pi , ni ) log 2 log 2
pi ni i
p ni pi ni i
p n i
Entropy of an Attribute is:
Entropy( Attribute)
p n
I p n
i i
PN
i i
Data set: