0% found this document useful (0 votes)
3 views

DataAnalytics(Unit 2)

Exploratory Data Analysis (EDA) is a data analysis approach that utilizes visual techniques to uncover trends, patterns, and validate assumptions through statistical summaries and graphical representations. EDA is essential for identifying errors, understanding data relationships, and ensuring valid results for business outcomes, and it employs various techniques such as clustering, regression, and visualization methods. The document outlines different types of EDA, including univariate and multivariate analyses, and discusses various regression techniques used for modeling relationships between variables.

Uploaded by

akhtarhannaan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

DataAnalytics(Unit 2)

Exploratory Data Analysis (EDA) is a data analysis approach that utilizes visual techniques to uncover trends, patterns, and validate assumptions through statistical summaries and graphical representations. EDA is essential for identifying errors, understanding data relationships, and ensuring valid results for business outcomes, and it employs various techniques such as clustering, regression, and visualization methods. The document outlines different types of EDA, including univariate and multivariate analyses, and discusses various regression techniques used for modeling relationships between variables.

Uploaded by

akhtarhannaan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 131

Exploratory Data Analysis (EDA)

• Exploratory Data Analysis (EDA) is an approach to analyze the data using


visual techniques. It is used to discover trends, patterns, or to check
assumptions with the help of statistical summary and graphical
representations. EDA focuses more narrowly on checking assumptions
required for model fitting and hypothesis testing. It also checks while
handling missing values and making transformations of variables as
needed.
• EDA build a robust understanding of the data, issues associated with
either the info or process. it’s a scientific approach to get the story of the
data.
Why EDA?
• The main purpose of EDA is to help look at data before making any assumptions.
• It can help identify obvious errors, as well as better understand patterns within the
data, detect outliers or anomalous events, find interesting relations among the
variables.
• Data scientists can use exploratory analysis to ensure the results they produce are
valid and applicable to any desired business outcomes and goals.
• EDA also helps stakeholders by confirming they are asking the right questions.
EDA can help answer questions about standard deviations, categorical variables,
and confidence intervals.
• Once EDA is complete and insights are drawn, its features can then be used for
more sophisticated data analysis or modeling, including machine learning.
EDA Techniques
Specific statistical functions and techniques you can perform with EDA tools include:
• Clustering and dimension reduction techniques, which help create graphical
displays of high-dimensional data containing many variables.
• Univariate visualization of each field in the raw dataset, with summary statistics.
• Bivariate visualizations and summary statistics that allow you to assess the
relationship between each variable in the dataset and the target variable you’re
looking at.
• Multivariate visualizations, for mapping and understanding interactions between
different fields in the data.
• K-means Clustering is a clustering method in unsupervised learning where data
points are assigned into K groups, i.e. the number of clusters, based on the
distance from each group’s centroid. The data points closest to a particular
centroid will be clustered under the same category. K-means Clustering is
commonly used in market segmentation, pattern recognition, and image
compression.
• Predictive models, such as linear regression, use statistics and data to predict
outcomes.
Commonly used tools for performing EDA- R, Python etc.
EDA Types
EDA is of four types:
• Univariate non-graphical. This is simplest form of data analysis, where the data
being analyzed consists of just one variable. Since it’s a single variable, it doesn’t
deal with causes or relationships. The main purpose of univariate analysis is to
describe the data and find patterns that exist within it.
• Univariate graphical. Non-graphical methods don’t provide a full picture of the
data. Graphical methods are therefore required. Common types of univariate
graphics include:
– Stem-and-leaf plots, which show all data values and the shape of the distribution.
– Histograms, a bar plot in which each bar represents the frequency (count) or proportion
(count/total count) of cases for a range of values.
– Box plots, which graphically depict the five-number summary of minimum, first quartile,
median, third quartile, and maximum.
EDA Types
• Multivariate nongraphical: Multivariate data arises from more than one variable.
Multivariate non-graphical EDA techniques generally show the relationship
between two or more variables of the data through cross-tabulation or statistics.
• Multivariate graphical: Multivariate data uses graphics to display relationships
between two or more sets of data. The most used graphic is a grouped bar plot or
bar chart with each group representing one level of one of the variables and each
bar within a group representing the levels of the other variable. Other common
types of multivariate graphics include:
– Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how
much one variable is affected by another.
– Multivariate chart, which is a graphical representation of the relationships between factors
and a response.
– Run chart, which is a line graph of data plotted over time.
– Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-
dimensional plot.
– Heat map, which is a graphical representation of data where values are depicted by color.
• Univariate analysis: This type of data consists of only one variable. The analysis of univariate data is thus
the simplest form of analysis since the information deals with only one quantity that changes. It does not
deal with causes or relationships and the main purpose of the analysis is to describe the data and find
patterns that exist within it.
• Bi-Variate analysis: This type of data involves two different variables. The analysis of this type of data
deals with causes and relationships and the analysis is done to find out the relationship among the two
variables.
• Multi-Variate analysis: When the data involves three or more variables, it is categorized under
multivariate.
Univariate Analysis
• Patterns in this analysis can be described in following ways:
Bivariate Analysis
Bivariate Analysis of two Numerical
Variables (Numerical-Numerical)
• Scatter Plot: A scatter plot represents individual pieces of data using dots. These plots make
it easier to see if two variables are related to each other. The resulting pattern indicates the
type (linear or non-linear) and strength of the relationship between two variables.
• Linear Correlation: Linear Correlation represents the strength of a linear relationship
between two numerical variables. If there is no correlation between the two variables, there
is no tendency to change along with the values of the second quantity. Here, r measures the
strength of a linear relationship and is always between -1 and 1 where -1 denotes perfect
negative linear correlation and +1 denotes perfect positive linear correlation and zero
denotes no linear correlation.

Scatter Plot
Bivariate Analysis of two categorical
Variables (Categorical-Categorical)
• Chi-square Test: The chi-square test is used for determining the association between
categorical variables. It is calculated based on the difference between expected frequencies
and the observed frequencies in one or more categories of the frequency table.
• A probability of zero indicates a complete dependency between two categorical variables
and a probability of one indicates that two categorical variables are completely independent.
• Here, subscript c indicates the degrees of freedom, O indicates observed value, and E
indicates expected value.
Bivariate Analysis of one numerical and one
categorical variable (Numerical-Categorical)
• Z-test and t-test: Z and T-tests are important to calculate if the difference between a sample
and population is substantial.
• If the probability of Z is small, the difference between the two averages is more significant.
• If the sample size is large enough, then we use a Z-test, and for a small sample size, we use a
T-test.
Bivariate Analysis of one numerical and one
categorical variable (Numerical-Categorical)
• ANALYSIS OF VARIANCE (ANOVA): The ANOVA test is used to determine whether there is a
significant difference among the averages of more than two groups that are statistically
different from each other. This analysis is appropriate for comparing the averages of a
numerical variable for more than two categories of a categorical variable.
Multivariate Analysis
Multivariate Analysis
• Cluster Analysis: Cluster Analysis classifies different objects into clusters in a way that the similarity between two objects
from the same group is maximum and minimal otherwise. It is used when rows and columns of the data table represent the
same units and the measure represents distance or similarity.

• Principal Component Analysis (PCA): Principal Components Analysis (or PCA) is used for reducing the dimensionality of a
data table with a large number of interrelated measures. Here, the original variables are converted into a new set of
variables, which are known as the “Principal Components” of Principal Component Analysis. PCA is used for the dataset that
shows multicollinearity. Although least squares estimates are biased, the distance between variances and their actual value
can be really large. So, PCA adds some bias and reduces standard error for the regression model.
Regression Analysis
• In statistical modeling, regression analysis is a set of statistical processes for estimating the
relationships between a dependent variable (often called the 'outcome' or 'response'
variable, or a 'label' in machine learning parlance) and one or more independent
variables (often called 'predictors', 'covariates', 'explanatory variables' or 'features').
• Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning.
• Regression analysis can be used to infer causal relationships between the independent and
dependent variables. Importantly, regressions by themselves only reveal relationships
between a dependent variable and a collection of independent variables in a fixed dataset.

• Why?
• In statistical analysis, regression is used to identify the associations between variables
occurring in some data. It can show both the magnitude of such an association and also
determine its statistical significance (i.e., whether or not the association is likely due to
chance). Regression is a powerful tool for statistical inference and has also been used to try
to predict future outcomes based on past observations.
Regression Analysis
Regression models involve the following components:
• The unknown parameters, often denoted as a scalar or vector β.
• The independent variables, which are observed in data and are often denoted as a
vector {Xi} (where i denotes a row of data).
• The dependent variable, which are observed in data and often denoted using the scalar {Yi}.
• The error terms, which are not directly observed in data and are often denoted using the
scalar {ei}.
• Regression model can be represented as:
Linear Regression
• The most extensively used modeling technique is linear regression. The relationship between
a dependent variable and a single independent variable is described using a basic linear
regression methodology. A Simple Linear Regression model reveals a linear or slanted straight
line relation, thus the name.
• The simple linear model is expressed using the following equation:
– Where:
– Y – variable that is dependent
– X – Independent (explanatory) variable
– a – Intercept
– b – Slope
– ϵ – Residual (error)
• The dependent variable needs to be continuous/real, which is the most crucial component of
Simple Linear Regression. On the other hand, the independent variable can be evaluated
using either continuous or categorical values.

a
Assumptions
• Homogeneity of variance (homoscedasticity): the size of the error in our
prediction doesn’t change significantly across the values of the independent
variable.
• Independence of observations: the observations in the dataset were collected
using statistically valid methods, and there are no hidden relationships among
variables.
– In multiple linear regression, it is possible that some of the independent variables are actually
correlated with one another, so it is important to check these before developing the regression
model. If two independent variables are too highly correlated then only one of them should be used
in the regression model.
• Normality: The data follows a normal distribution.
• Linearity: the line of best fit through the data points is a straight line, rather than a
curve or some sort of grouping factor.
Multiple Regression
• Multiple linear regression (MLR), often known as multiple regression, is a statistical process that
uses multiple explanatory factors to predict the outcome of a response variable.
• MLR is a method of representing the linear relationship between explanatory (independent) and
response (dependent) variables.
• The mathematical representation of multiple linear regression is:
– Where, y = the dependent variable’s predicted value
– B0 = the y-intercept
– B1X1= B1 is the coefficient for regression of the first independent variable X1 (The effect of increasing the independent
variable's value on the projected y value is referred to as X1.)
– … = Repeat for as many independent variables as you're testing.
– BnXn = the last independent variable's regression coefficient
– ϵ = model error (i.e. how much flexibility is there in our y estimate)

• Multiple linear regression uses the same criteria as single linear regression. Due to the huge
number of independent variables in multiple linear regression, there is an extra need for the model
• The absence of a link between two independent variables with a low correlation is referred to as
non-collinearity. It would be hard to determine the true correlations between the dependent and
independent variables if the independent variables were strongly correlated.
• MLR holds same assumptions as linear regression.
Multiple Regression
Multiple Regression is used to find:
• How strong the relationship is between two or more independent variables and one dependent variable (e.g. how
rainfall, temperature, and amount of fertilizer added affect crop growth).
• The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a
crop at certain levels of rainfall, temperature, and fertilizer addition).
Non-Linear Regression
• A sort of regression analysis in which data is fitted to a model and then displayed numerically
is known as nonlinear regression.
• Simple linear regression connects two variables (X and Y) in a straight line (y = mx + b),
whereas nonlinear regression connects two variables (X and Y) in a nonlinear (curved)
relationship.
• The goal of the model is to minimise the sum of squares as much as possible. The sum of
squares is a statistic that tracks how much Y observations differ from the nonlinear (curved)
function that was used to anticipate Y.
• In the same way that linear regression modelling aims to graphically trace a specific response
from a set of factors, nonlinear regression modelling aims to do the same.
• Because the function is generated by a series of approximations (iterations) that may be
dependent on trial-and-error, nonlinear models are more complex to develop than linear
models.
Logistic Regression
• Logistic Regression: When the dependent variable is discrete, the logistic regression
technique is applicable.
• In other words, this technique is used to compute the probability of mutually exclusive
occurrences such as pass/fail, true/false, 0/1, and so forth.
• Thus, the target variable can take on only one of two values, and a sigmoid curve represents
its connection to the independent variable, and probability has a value between 0 and 1.
Polynomial Regression
• Polynomial Regression: The technique of polynomial regression analysis is used to
represent a non-linear relationship between dependent and independent
variables. It is a variant of the multiple linear regression model, except that the
best fit line is curved rather than straight.
Ridge Regression
• Ridge Regression: When data exhibits multicollinearity, that is, the ridge regression
technique is applied when the independent variables are highly correlated. While least
squares estimates are unbiased in multicollinearity, their variances are significant
enough to cause the observed value to diverge from the actual value. Ridge regression
reduces standard errors by biasing the regression estimates.
• The lambda (λ) variable in the ridge regression equation resolves the multicollinearity
problem. Y = XB + e
– Where Y is the dependent variable, X represents the independent variables, B is the regression coefficients
to be estimated, and e represents the errors are residuals.
• Once we add the lambda function to this equation, the variance that is not evaluated by
the general model is considered.
Lasso Regression
• Lasso Regression: As with ridge regression, the lasso (Least Absolute
Shrinkage and Selection Operator) technique penalizes the absolute
magnitude of the regression coefficient. Additionally, the lasso regression
technique employs variable selection, which leads to the shrinkage of
coefficient values to absolute zero.
Quantile Regression
• Quantile Regression: The quantile regression approach is a
subset of the linear regression technique. It is employed when
the linear regression requirements are not met or when the
data contains outliers. In statistics and econometrics, quantile
regression is used.
Applications
• Forecasting:
– The most common use of regression analysis in business is for forecasting future opportunities and
threats. Demand analysis, for example, forecasts the amount of things a customer is likely to
buy. When it comes to business, though, demand is not the only dependent variable. Regressive
analysis can anticipate significantly more than just direct income.
– For example, we may predict the highest bid for an advertising by forecasting the number of
consumers who would pass in front of a specific billboard.
– Insurance firms depend extensively on regression analysis to forecast policyholder creditworthiness
and the amount of claims that might be filed in a particular time period.
• CAPM:
– The Capital Asset Pricing Model (CAPM), which establishes the link between an asset's projected
return and the related market risk premium, relies on the linear regression model.
– It is also frequently used in financial analysis by financial analysts to anticipate corporate returns and
operational performance.
• Comparing with competition:
– It may be used to compare a company's financial performance to that of a certain counterpart.
– It may also be used to determine the relationship between two firms' stock prices (this can be
extended to find correlation between 2 competing companies, 2 companies operating in an
unrelated industry etc).
– It can assist the firm in determining which aspects are influencing their sales in contrast to the
comparative firm. These techniques can assist small enterprises in achieving rapid success in a short
amount of time.
Applications
• Identifying problems:
– Regression is useful not just for providing factual evidence for management choices, but also
for detecting judgment mistakes.
– A retail store manager, for example, may assume that extending shopping hours will
significantly boost sales.
– However, RA might suggest that the increase in income isn't enough to cover the increase in
operational cost as a result of longer working hours (such as additional employee labour
charges).
– As a result, this research may give quantitative backing for choices and help managers avoid
making mistakes based on their intuitions.
• Reliable source
– Many businesses and their top executives are now adopting regression analysis (and
other types of statistical analysis) to make better business decisions and reduce guesswork
and gut instinct.
– Regression enables firms to take a scientific approach to management. Both small and large
enterprises are frequently bombarded with an excessive amount of data.
– Managers may use regression analysis to filter through data and choose the relevant factors
to make the best decisions possible.
Advantages
• Regression models are easy to understand as they are built upon basic statistical principles,
such as correlation and least-square error.
• The output of regression models is an algebraic equation that is easy to understand and use
to predict.
• The strength (or the goodness of fit) of the regression model is measured in terms of the
correlation coefficients, and other related statistical parameters that are well understood.
• The predictive power of regression models matches with other predictive models and
sometimes performs better than the competitive models.
• Regression models can include all the variables that one wants to include in the model.
• Regression modeling tools are pervasive. Almost all the data mining packages include
statistical packages include regression tools. MS Excel spreadsheets can also provide simple
regression modeling capabilities.
Disadvantages
• Regression models cannot work properly if the input data has errors (that is poor quality
data). If the data preprocessing is not performed well to remove missing values or redundant
data or outliers or imbalanced data distribution, the validity of the regression model suffers.
• Regression models are susceptible to collinear problems (that is there exists a strong linear
correlation between the independent variables). If the independent variables are strongly
correlated, then they will eat into each other’s predictive power and the regression
coefficients will lose their ruggedness.
• As the number of variables increases the reliability of the regression models decreases. The
regression models work better if you have a small number of variables.
• Regression models do not automatically take care of nonlinearity. The user needs to imagine
the kind of additional terms that might be needed to be added to the regression model to
improve its fit.
• Regression models work with datasets containing numeric values and not with categorical
variables. There are ways to deal with categorical variables though by creating multiple new
variables with a yes/no value.
Time Series Analysis
• Time series analysis is a specific way of analyzing a sequence of data points
collected over an interval of time.
• In time series analysis, analysts record data points at consistent intervals over
a set period of time rather than just recording the data points intermittently or
randomly.
• This type of analysis is not merely the act of collecting data over time.
• In other words, time is a crucial variable because it shows how the data
adjusts over the course of the data points as well as the final results.
• It provides an additional source of information and a set order of
dependencies between the data.
• Time series analysis typically requires a large number of data points to ensure
consistency and reliability.
• An extensive data set ensures you have a representative sample size and that
analysis can cut through noisy data. It also ensures that any trends or patterns
discovered are not outliers and can account for seasonal variance.
Time Series Analysis
Why use Time Series Analysis? The reasons for doing time series analysis are as follows:
• Features: Time series analysis can be used to track features like trend, seasonality, and
variability.
• Forecasting: Time series analysis can aid in the prediction of stock prices. It is used if
you would like to know if the price will rise or fall and how much it will rise or fall.
• Inferences: You can predict the value and draw inferences from data using Time series
analysis.

Applications:
• Financial Analysis − It includes sales forecasting, inventory analysis, stock market
analysis, price estimation.
• Weather Analysis − It includes temperature estimation, climate change, seasonal shift
recognition, weather forecasting.
• Network Data Analysis − It includes network usage prediction, anomaly or intrusion
detection, predictive maintenance.
• Healthcare Analysis − It includes census prediction, insurance benefits prediction,
patient monitoring.
Components of Time Series Analysis
• Trend: In which there is no fixed interval and any divergence within the given dataset is
a continuous timeline. The trend would be Negative or Positive or Null Trend. Trend is a
smooth, general, long-term, average tendency. It is not always necessary that the
increase or decrease is in the same direction throughout the given period of time.
• Seasonality: In which regular or fixed interval shifts within the dataset in a continuous
timeline. Would be bell curve or saw tooth. They have the same or almost the same
pattern during a period of 12 months. This variation will be present in a time series if
the data are recorded hourly, daily, weekly, quarterly, or monthly.
• Cyclical: In which there is no fixed interval, uncertainty in movement and its pattern.
This oscillatory movement has a period of oscillation of more than a year. One complete
period is a cycle. This cyclic movement is sometimes called the ‘Business Cycle’. It is a
four-phase cycle comprising of the phases of prosperity, recession, depression, and
recovery. The cyclic variation may be regular are not periodic.
• Irregularity: Unexpected situations/events/scenarios and spikes in a short time span.
These fluctuations are unforeseen, uncontrollable, unpredictable, and are erratic. These
forces are earthquakes, wars, flood, famines, and any other disasters.
Mathematical Model for Time Series
Analysis
• Mathematically, a time series is given as: yt = f (t)
• Here, yt is the value of the variable under study at time t. If the population is the variable under study at
the various time period t1, t2, t3, … , tn. Then the time series is
t: t1, t2, t3, … , tn
yt: yt1, yt2, yt3, …, ytn
or, t: t1, t2, t3, … , tn
yt: y1, y2, y3, … , yn
• Additive Model for Time Series Analysis: If yt is the time series value at time t. Tt, St, Ct, and Rt are the
trend value, seasonal, cyclic and random fluctuations at time t respectively. According to the Additive
Model, a time series can be expressed as : yt = Tt + St + Ct + Rt.
This model assumes that all four components of the time series act independently of each other.
• Multiplicative Model for Time Series Analysis: The multiplicative model assumes that the various
components in a time series operate proportionately to each other. According to this model:
yt = Tt × St × Ct × Rt
• Mixed models: Different assumptions lead to different combinations of additive and multiplicative models
as: yt = Tt + St + Ct Rt.
The time series analysis can also be done using the model
yt = Tt + St × Ct × Rt or yt = Tt × Ct + St × Rt etc.
Naïve Bayesian Classifier
Pros:
• It is simple and easy to implement
• It doesn’t require as much training data
• It handles both continuous and discrete data
• It is highly scalable with the number of predictors and data points
• It is fast and can be used to make real-time predictions
• It is not sensitive to irrelevant features
• The assumption that all features are independent makes naive bayes algorithm very
fast compared to complicated algorithms. In some cases, speed is preferred over higher
accuracy.
• It works well with high-dimensional data such as text classification, email spam
detection.
Cons:
• The assumption that all features are independent is not usually the case in real life so it
makes naive bayes algorithm less accurate than complicated algorithms. Speed comes
at a cost.
Naïve Bayesian Classifier
• Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which
helps in building the fast machine learning models that can make quick predictions. It is a
probabilistic classifier, which means it predicts on the basis of the probability of an object.
• Assumption: The fundamental Naive Bayes assumption is that each feature makes an
Independent & Equal contribution to the outcome.
• Naive Bayes is called naive because it assumes that each input variable is independent. This
is a strong assumption and unrealistic for real data; however, the technique is very effective
on a large range of complex problems.
• Applications:
– As this algorithm is fast and efficient, you can use it to make real-time predictions.
– This algorithm is popular for multi-class predictions. You can find the probability of multiple target classes easily by
using this algorithm.
– Email services (like Gmail) use this algorithm to figure out whether an email is a spam or not. This algorithm is
excellent for spam filtering.
– Its assumption of feature independence, and its effectiveness in solving multi-class problems, makes it perfect for
performing Sentiment Analysis. Sentiment Analysis refers to the identification of positive or negative sentiments of a
target group (customers, audience, etc.)
– Collaborative Filtering and the Naive Bayes algorithm work together to build recommendation systems. These systems
use data mining and machine learning to predict if the user would like a particular resource or not.
Naïve Bayesian Classifier
• The dataset is divided into two parts, namely, feature matrix and the response vector.
• Feature matrix contains all the vectors(rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are ‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’.
• Response vector contains the value of class variable(prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’.
Bayes’ Theorem
• Bayes’ Theorem: Bayes’ Theorem finds the probability of an event occurring given the
probability of another event that has already occurred. Bayes’ theorem is stated
mathematically as the following equation:

• where A and B are events and P(B) ≠ 0.


• Basically, we are trying to find probability of event A, given the event B is true. Event B is also
termed as evidence.
• P(A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen).
The evidence is an attribute value of an unknown instance(here, it is event B).
• P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen or
conditional probability.
• In previous mentioned dataset, we can apply Bayes’ theorem in following way:

• where, y is class variable and X is a dependent feature vector (of size n) where:
Naïve Bayesian Classification
• Now, considering naive assumption to the Bayes’ theorem, which is, independence among
the features. So now, we split evidence into the independent parts.
• Now, if any two events A and B are independent, then,
• Hence, we reach to the result:

which can be expressed as:

• Now, as the denominator remains constant for a given input, we can remove that term:

• Now, we need to create a classifier model. For this, we find the probability of given set of
inputs for all possible values of the class variable y and pick up the output with maximum
probability. This can be expressed mathematically as:
Types of Naïve Bayes Models
Types of Naive Bayes models :
• Gaussian: It is used in classification and it assumes that features follow a normal
distribution.
Types of Naïve Bayes Models
• Multinomial: It is used for discrete counts. For example, let’s say, we have a text
classification problem. Here we can consider Bernoulli trials which is one step further and
instead of “word occurring in the document”, we have “count how often word occurs in the
document”, you can think of it as “number of times outcome number xi is observed over the
n trials”.

• Bernoulli: The binomial model is useful if your feature vectors are binary (i.e. zeros and
ones). One application would be text classification with ‘bag of words’ model where the 1s &
0s are “word occurs in the document” and “word does not occur in the document”
respectively.
Bayesian Belief Network
• A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph. It is also called a Bayes
network, belief network, decision network, or Bayesian model.
• Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
• Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning, time
series prediction, and decision making under uncertainty.
• Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
– Directed Acyclic Graph
– Table of conditional probabilities.
Bayesian Belief Network
• The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
• A Bayesian network graph is made up of nodes and Arcs (directed links), where:
• Each node corresponds to the random variables, and a variable can
be continuous or discrete.
• Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows connect
the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there
is no directed link that means that nodes are independent with each other
– In the above diagram, A, B, C, and D are random variables represented by the nodes of the
network graph.
– If we are considering node B, which is connected with node A by a directed arrow, then node
A is called the parent of Node B.
– Node C is independent of node A.
• The Bayesian network graph does not contain any cyclic graph. Hence, it is known
as a directed acyclic graph or DAG.
Bayesian Belief Network
• The Bayesian network has mainly two components:
– Causal Component
– Actual numbers
• Each node in the Bayesian network has condition probability
distribution P(Xi |Parent(Xi) ), which determines the effect of the parent on that node.
• Bayesian network is based on Joint probability distribution and conditional probability.
So let's first understand the joint probability distribution:
• If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination
of x1, x2, x3.. xn, are known as Joint probability distribution.
• P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
• In general for each variable Xi, we can write the equation as:
Example
• Harry installed a new burglar alarm at his home to detect burglary. The alarm reliably
responds at detecting a burglary but also responds for minor earthquakes. Harry has two
neighbors David and Sophia, who have taken a responsibility to inform Harry at work when
they hear the alarm. David always calls Harry when he hears the alarm, but sometimes he got
confused with the phone ringing and calls at that time too. On the other hand, Sophia likes to
listen to high music, so sometimes she misses to hear the alarm. Here we would like to
compute the probability of Burglary Alarm.
Example
• The Bayesian network for the above problem is given below.
• The network structure is showing that burglary and earthquake is the parent node of the alarm and
directly affecting the probability of alarm's going off, but David and Sophia's calls depend on alarm
probability.
• The network is representing that our assumptions do not directly perceive the burglary and also do
not notice the minor earthquake, and they also not confer before calling.
• The conditional distributions for each node are given as conditional probabilities table or CPT.
• Each row in the CPT must be sum to 1 because all the entries in the table represent an exhaustive
set of cases for the variable.
• In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if there are two
parents, then CPT will contain 4 probability values
• List of all events occurring in this network:
– Burglary (B)
– Earthquake(E)
– Alarm(A)
– David Calls(D)
– Sophia calls(S)
Example
• Problem: Calculate the probability that alarm has sounded, but there is neither a burglary,
nor an earthquake occurred, and David and Sophia both called the Harry.
From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
K- Nearest Neighbour
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
K- Nearest Neighbour
• K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
• K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
• K-NN is a non-parametric algorithm, which means it does not make any assumption on
underlying data.
• It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
How?
K- Nearest Neighbour
Steps of KNN
– Step-1: Select the number K of the neighbors
– Step-2: Calculate the Euclidean distance of K number of neighbors
– Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
– Step-4: Among these k neighbors, count the number of the data points in each category.
– Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
– Step-6: Our model is ready.
• There is no particular way to determine the best value for "K", so we need to try some values to find the
best out of them.
– A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the
model.
– Large values for K are good, but it may find some difficulties.
• Pros:
– It is simple to implement.
– It is robust to the noisy training data
– It can be more effective if the training data is large.
• Cons:
– Always needs to determine the value of K which may be complex some time.
– The computation cost is high because of calculating the distance between the data points for all the
training samples.
Rule Induction
• It is the extraction of useful if-then rules from data based on statistical significance.
• Process of learning, from cases or instances, if-then rule relationships that consist of an
antecedent (if-part, defining the preconditions or coverage of the rule) and a consequent
(then-part, stating a classification, prediction, or other expression of a property that holds for
cases defined in the antecedent)
IF condition THEN conclusion
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute tests and these tests are
logically ANDed.
• The consequent part consists of class prediction.
• If the condition holds true for a given tuple, then the antecedent is satisfied.

Consider a rule R1,


R1: IF age = youth AND student = yes THEN buys_computer = yes

We can also write rule R1 as follows −


R1: (age = youth) ^ (student = yes))(buyscomputer = yes)
Rule Induction
• To extract a rule from a decision tree −
– One rule is created for each path from the root to the leaf node.
– To form a rule antecedent, each splitting criterion is logically ANDed.
– The leaf node holds the class prediction, forming the rule consequent.

• Some major rule induction paradigms are:


– Association rule learning algorithms (e.g., Agrawal)
– Decision rule algorithms (e.g., Quinlan 1987)
– Hypothesis testing algorithms (e.g., RULEX)
– Horn clause induction
– Version spaces
– Rough set rules
– Inductive Logic Programming
Decision Tree
• A decision tree is a flowchart-like structure in which each internal node represents
a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each
branch represents the outcome of the test, and each leaf node represents a class
label (decision taken after computing all attributes).
• The paths from root to leaf represent classification rules.
• Decision tree is a type of supervised learning algorithm (having a predefined target
variable) that is mostly used in classification problems.
• It works for both categorical and continuous input and output variables.
• In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant splitter /
differentiator in input variables.
Decision Tree
Terminologies used with Decision trees:
– Root Node: It represents entire population or sample and this further gets divided into two or more
homogeneous sets.
– Splitting: It is a process of dividing a node into two or more sub-nodes.
– Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
– Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
– Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say
opposite process of splitting.
– Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
– Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes
where as sub-nodes are the child of parent node.
Splitting Criterion
• Entropy is a measure of the degree of disorder/impurity or uncertainty which is
defined as mentioned below.
• Information gain as a measure of how much information a feature provides about
a class. Information gain helps to determine the order of attributes in the nodes of
a decision tree.

• Gini impurity (L. Breiman et al. 1984) is a measure of non-homogeneity. gini index,
or gini coefficient, or gini impurity computes the degree of probability of a specific
variable that is wrongly being classified when chosen randomly and a variation of
gini coefficient.
Splitting Criterion
• Gain Ratio or Uncertainty Coefficient is used to normalize the information
gain of an attribute against how much entropy that attribute has.

• Split Info

• Chi-square is another method of splitting nodes in a decision tree for


datasets having categorical target values. It can make two or more than
two splits. It works on the statistical significance of differences between
the parent node and child nodes.
Pros & Cons
Advantages of decision trees:
• Easy to Understand: Decision tree output is very easy to understand even for people from non-analytical
background. It does not require any statistical knowledge to read and interpret them. Its graphical representation
is very intuitive and users can easily relate their hypothesis.
• Useful in Data exploration: Decision tree is one of the fastest way to identify most significant variables and
relation between two or more variables. With the help of decision trees,
• Less data cleaning required: It requires less data cleaning compared to some other modeling techniques. It is not
influenced by outliers and missing values to a fair degree.
• Data type is not a constraint: It can handle both numerical and categorical variables.
• Non Parametric Method: Decision tree is considered to be a non-parametric method. This means that decision
trees have no assumptions about the space distribution and the classifier structure.

Disadvantages of decision trees:


• They are unstable, meaning that a small change in the data can lead to a large change in the structure of the
optimal decision tree.
• They are often relatively inaccurate. Many other predictors perform better with similar data. This can be remedied
by replacing a single decision tree with a random forest of decision trees, but a random forest is not as easy to
interpret as a single decision tree.
• For data including categorical variables with different numbers of levels, information gain in decision trees is
biased in favor of those attributes with more levels.
• Calculations can get very complex, particularly if many values are uncertain and/or if many outcomes are linked.
Neural Network
• A neural network (NN), in the case of artificial neurons called artificial neural network
(ANN) or simulated neural network (SNN), is an interconnected group of natural or
artificial neurons that uses a mathematical or computational model for information
processing based on a connective approach to computation.
• An ANN is an adaptive system that changes its structure based on external or internal
information that flows through the network.
• In more practical terms neural networks are non-linear statistical data modeling or
decision making tools.
• They can be used to model complex relationships between inputs and outputs or to find
patterns in data. Using algorithms, they can recognize hidden patterns and correlations
in raw data, cluster and classify it, and – over time – continuously learn and improve.
• Applications:
– Chatbots
– Natural language processing, translation and language generation
– Stock market prediction
– Delivery driver route planning and optimization
– Drug discovery and development and so on
How Neural Network work?
• Initially, the dataset should be fed into the input layer which will then flow to the hidden
layer.
• The connections which exist between the two layers randomly assign weights to the input.
• A bias is added to each input. Bias is a constant which is used in the model to fit best for the
given data.
• The weighted sum of all the inputs will be sent to a function that is used to decide the active
status of a neuron by calculating the weighted sum and adding the bias. This function is
called the activation function.
• The nodes that are required to fire for feature extraction are decided based on the output
value of the activation function.
• The final output of the network is then compared to the required labeled data of our dataset
to calculate the final cost error. The cost error is actually telling us how ‘bad’ our network is.
Hence we want the error to be as smallest as we can.
• The weights are adjusted through backpropagation, which reduces the error. This
backpropagation process can be considered as the central mechanism that neural networks
learn. It basically fine-tunes the weights of the deep neural network in order to reduce the
cost value.
How Neural Network work?
Backpropagation
• Backpropagation is a widely used algorithm for training feedforward neural
networks.
• It computes the gradient of the loss function with respect to the network weights
and is very efficient, rather than naively directly computing the gradient with
respect to each individual weight.
• This efficiency makes it possible to use gradient methods to train multi-layer
networks and update weights to minimize loss; variants such as gradient descent
or stochastic gradient descent are often used.
Working:
• Neural networks use supervised learning to generate output vectors from input
vectors that the network operates on. It Compares generated output to the
desired output and generates an error report if the result does not match the
generated output vector. Then it adjusts the weights according to the bug report to
get your desired output.
Backpropagation Algorithm
• Step 1: Inputs X, arrive through the preconnected path.
• Step 2: The input is modeled using true weights W. Weights are usually chosen
randomly.
• Step 3: Calculate the output of each neuron from the input layer to the hidden
layer to the output layer.
• Step 4: Calculate the error in the outputs
Backpropagation Error= Actual Output – Desired Output
• Step 5: From the output layer, go back to the hidden layer to adjust the weights to
reduce the error.
• Step 6: Repeat the process until the desired output is achieved.
Backpropagation
Backpropagation is “backpropagation of errors” and is very useful for training
neural networks. It’s fast, easy to implement, and simple. Backpropagation
does not require any parameters to be set, except the number of inputs.
Backpropagation is a flexible method because no prior knowledge of the
network is required.

Types of Backpropagation
• Static backpropagation: Static backpropagation is a network designed to
map static inputs for static outputs. These types of networks are capable
of solving static classification problems such as OCR (Optical Character
Recognition).
• Recurrent backpropagation: Recursive backpropagation is another
network used for fixed-point learning. Activation in recurrent
backpropagation is feed-forward until a fixed value is reached. Static
backpropagation provides an instant mapping, while recurrent
backpropagation does not provide an instant mapping.
Types of Neural Networks
Neural networks can also be described by the number of hidden nodes the model has or in terms
of how many inputs and outputs each node has. Variations on the classic neural network design
allow various forms of forward and backward propagation of information among tiers.
• Feed-forward neural networks: one of the simplest variants of neural networks. They pass
information in one direction, through various input nodes, until it makes it to the output
node. The network may or may not have hidden node layers, making their functioning more
interpretable. It is prepared to process large amounts of noise. This type of ANN
computational model is used in technologies such as facial recognition and computer vision.
Types of Neural Networks
• Recurrent neural networks: more complex. They save the output of processing
nodes and feed the result back into the model. This is how the model is said to
learn to predict the outcome of a layer. Each node in the RNN model acts as a
memory cell, continuing the computation and implementation of operations. This
neural network starts with the same front propagation as a feed-forward network,
but then goes on to remember all processed information in order to reuse it in the
future. If the network's prediction is incorrect, then the system self-learns and
continues working towards the correct prediction during backpropagation. This
type of ANN is frequently used in text-to-speech conversions.
Types of Neural Networks
• Convolutional neural networks: one of the most popular models used today. This
neural network computational model uses a variation of multilayer perceptron and
contains one or more convolutional layers that can be either entirely connected or
pooled. These convolutional layers create feature maps that record a region of
image which is ultimately broken into rectangles and sent out for nonlinear. The
CNN model is particularly popular in the realm of image recognition; it has been
used in many of the most advanced applications of AI, including facial recognition,
text digitization and natural language processing. Other uses include paraphrase
detection, signal processing and image classification.
Types of Neural Networks
• Deconvolutional neural networks: utilize a reversed CNN model process. They aim
to find lost features or signals that may have originally been considered
unimportant to the CNN system's task. This network model can be used in image
synthesis and analysis.
Types of Neural Networks
• Modular neural networks: contain multiple neural networks working separately
from one another. The networks do not communicate or interfere with each
other's activities during the computation process. Consequently, complex or big
computational processes can be performed more efficiently.
Advantages
Advantages of neural networks include:
• Parallel processing abilities: mean the network can perform more than one job at a time.
• Information is stored on an entire network, not just a database.
• The ability to learn and model nonlinear, complex relationships helps model the real-life
relationships between input and output.
• Fault tolerance means the corruption of one or more cells of the NN will not stop the generation of
output.
• Gradual corruption means the network will slowly degrade over time, instead of a problem
destroying the network instantly.
• The ability to produce output with incomplete knowledge with the loss of performance being based
on how important the missing information is.
• No restrictions are placed on the input variables, such as how they should be distributed.
• Machine learning means the NN can learn from events and make decisions based on the
observations.
• The ability to learn hidden relationships in the data without commanding any fixed relationship
means an NN can better model highly volatile data and non-constant variance.
• The ability to generalize and infer unseen relationships on unseen data means NNs can predict the
output of unseen data.
Disadvantages
The disadvantages of NNs include:
• The lack of rules for determining the proper network structure means the
appropriate artificial neural network architecture can only be found through trial
and error and experience.
• The requirement of processors with parallel processing abilities makes neural
networks hardware-dependent.
• The network works with numerical information, therefore all problems must be
translated into numerical values before they can be presented to the NN.
• The lack of explanation behind probing solutions is one of the biggest
disadvantages in NNs. The inability to explain the why or how behind the solution
generates a lack of trust in the network.
ANN Learning

• Parameter Learning: It involves changing and updating the connecting weights in


NN.
• Structure Learning: It focuses on changing the structure or architecture of neural
network.
Competitive Learning
• Competitive learning is a form of unsupervised learning in artificial neural
networks, in which nodes compete for the right to respond to a subset of the input
data.
• In competitive learning, the neural network consists of single layer of output
neurons.
• All output neurons are connected to the input neurons.
• In this learning all the output neurons compete against each other for the rights of
being fired or get activated.
Competitive Learning
There are three basic elements to a competitive learning rule:
• A set of neurons that are all the same except for some randomly distributed
synaptic weights, and which therefore respond differently to a given set of input
patterns
• A limit imposed on the "strength" of each neuron
• A mechanism that permits the neurons to compete for the right to respond to a
given subset of inputs, such that only one output neuron (or only one neuron per
group), is active (i.e. "on") at a time. The neuron that wins the competition is
called a "winner-take-all" neuron.
Support Vector Machines
• Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems.
However, primarily, it is used for Classification problems in Machine Learning.
• The goal of the SVM algorithm is to create the best line or decision boundary that
can segregate n-dimensional space into classes so that we can easily put the new
data point in the correct category in the future. This best decision boundary is
called a hyperplane.
• SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is termed
as Support Vector Machine.
Types of SVM
• Linear SVM : Linear SVM is used for data that are linearly separable i.e. for a
dataset that can be categorized into two categories by utilizing a single straight
line. Such data points are termed as linearly separable data, and the classifier is
used described as a Linear SVM classifier.

• Non-linear SVM: Non-Linear SVM is used for data that are non-linearly separable
data i.e. a straight line cannot be used to classify the dataset. For this, we use
something known as a kernel trick that sets data points in a higher dimension
where they can be separated using planes or other mathematical functions. Such
data points are termed as non-linear data, and the classifier used is termed as a
Non-linear SVM classifier.
Linear SVM
Linear SVM
• Select the hyper-plane which separates the two classes better.
• We do this by maximizing the distance between the closest data point and the
hyper-plane.
• The greater the distance, the better is the hyperplane and better classification
results ensue.
• It can be seen in the figure that the hyperplane selected has the maximum
distance from the nearest point from each of those classes.
• The two dotted lines that go parallel to the hyperplane crossing the nearest points
of each of the classes are referred to as the support vectors of the hyperplane.
• Now, the distance of separation between the supporting vectors and the
hyperplane is called a margin.
• And the purpose of the SVM algorithm is to maximize this margin. The optimal
hyperplane is the hyperplane with maximum margin.
Hyperplane
• The hyperplane is defined by finding the optimal values w or weights and b or
intercept which. And these optimal values are found by minimizing the cost
function.
• Once the algorithm collects these optimal values, the SVM model or the line
function f(x) efficiently classifies the two classes.
• The optimal hyperplane has equation w.x+b = 0. The left support vector has
equation w.x+b= -1 and the right support vector has w.x+b=1.
• Thus the distance d between two parallel lines Ay = Bx + c1 and Ay = Bx + c2 is
given by d = |C1–C2|/√A^2 + B^2. With this formula in place, we have the distance
between the two support vectors as 2/||w||.
• The cost function for SVM
Non Linear SVM
• The SVM kernel function takes in low dimensional input space and converts it to a
higher-dimensional space.
• It converts the not separable problem to a separable problem. It performs complex
data transformations based on the labels or outputs that define them.
• To separate non linearly separable data points, we have to add an extra dimension.
For linear data, two dimensions have been used, that is, x and y. For these data
points, we add a third dimension, say z. For the example below let z=x² +y².
• This z function or the added dimensionality transforms the sample space and the
above image will become as the following:
Non Linear SVM
• The above data points can be separated using a straight line function that
is either parallel to the x axis or is inclined at an angle.
• Different types of kernel functions are present — linear, nonlinear,
polynomial, radial basis function (RBF), and sigmoid.
• Consider fig. below: In 3D it is looking like a plane parallel to the x-axis. If
we convert it in 2d space with z=1, then it will become as:
Kernel
• Kernel Function is a method used to take data as input and transform it into the
required form of processing data.
• “Kernel” is used due to a set of mathematical functions used in Support Vector
Machine providing the window to manipulate the data.
• It generally transforms the training set of data so that a non-linear decision surface
is able to transform to a linear equation in a higher number of dimension spaces.
• Basically, It returns the inner product between two points in a standard feature
dimension.
• Standard Kernel Equation
Kernel
SVM Applications
• Face detection – SVM classify parts of the image as a face and non-face and create a square
boundary around the face.
• Text and hypertext categorization – SVMs allow Text and hypertext categorization for both
inductive and transductive models. They use training data to classify documents into different
categories. It categorizes on the basis of the score generated and then compares with the
threshold value.
• Classification of images – Use of SVMs provides better search accuracy for image
classification. It provides better accuracy in comparison to the traditional query-based
searching techniques.
• Bioinformatics – It includes protein classification and cancer classification. We use SVM for
identifying the classification of genes, patients on the basis of genes and other biological
problems.
• Protein fold and remote homology detection – Apply SVM algorithms for protein remote
homology detection.
• Handwriting recognition – We use SVMs to recognize handwritten characters used widely.
• Generalized predictive control(GPC) – Use SVM based GPC to control chaotic dynamics with
useful parameters.
SVM Pros & Cons
• Pros:
– It works really well with a clear margin of separation
– It is effective in high dimensional spaces.
– It is effective in cases where the number of dimensions is greater than
the number of samples.
– It uses a subset of training points in the decision function (called
support vectors), so it is also memory efficient.
• Cons:
– It doesn’t perform well when we have large data set because the
required training time is higher
– It also doesn’t perform very well, when the data set has more noise
i.e. target classes are overlapping
– SVM doesn’t directly provide probability estimates, these are
calculated using an expensive five-fold cross-validation. It is included
in the related SVC method of Python scikit-learn library.
Principal Component Analysis
• Principal Components Analysis (PCA) is a well-known unsupervised dimensionality reduction technique.
• It constructs relevant features/variables through linear (linear PCA) or non-linear (kernel
PCA) combinations of the original variables (features).
• The construction of relevant features is achieved by linearly transforming correlated variables into a
smaller number of uncorrelated variables. This is done by projecting (dot product) the original data into
the reduced PCA space using the eigenvectors of the covariance/correlation matrix aka the principal
components (PCs).
• The resulting projected data are essentially linear combinations of the original data capturing most of the
variance in the data.
• PCA is an orthogonal transformation of the data into a series of uncorrelated data living in the reduced
PCA space such that the first component explains the most variance in the data with each subsequent
component explaining less.
Principal Component Analysis
• PCA technique is particularly useful in processing data where multi-colinearity exists between
the features/variables.
• PCA can be used when the dimensions of the input features are high (e.g. a lot of variables).
• PCA can be also used for denoising and data compression.
• Principal Component Analysis (PCA) extracts the most important information. This in turn
leads to compression since the less important information are discarded. With fewer data
points to consider, it becomes simpler to describe and analyze the dataset.
• PCA can be seen a trade-off between faster computation and less memory consumption
versus information loss. It's considered as one of the most useful tools for data analysis.
Principal Component Analysis
Advantages & Disadvantages
Advantages
• Removes Correlated Features: After implementing the PCA on your dataset, all the Principal Components
are independent of one another. There is no correlation among them.
• Improves Algorithm Performance: With so many features, the performance of your algorithm will
drastically degrade. PCA is a very common way to speed up your Machine Learning algorithm by getting
rid of correlated variables which don’t contribute in any decision making.
• Reduces Overfitting: Overfitting mainly occurs when there are too many variables in the dataset. So, PCA
helps in overcoming the overfitting issue by reducing the number of features.
• Improves Visualization: It is very hard to visualize and understand the data in high dimensions. PCA
transforms a high dimensional data to low dimensional data (2 dimension) so that it can be visualized
easily.
Disadvantages
• Independent variables become less interpretable: After implementing PCA on the dataset, your original
features will turn into Principal Components. Principal Components are the linear combination of your
original features. Principal Components are not as readable and interpretable as original features.
• Data standardization is must before PCA: All the categorical features are required to be converted into
numerical features before PCA can be applied.
• Information Loss: Although Principal Components try to cover maximum variance among the features in a
dataset, if we don’t select the number of Principal Components with care, it may miss some information as
compared to the original list of features.
Fuzzy Logic
• Fuzzy logic is an approach to computing based on "degrees of truth" rather than
the usual "true or false" (1 or 0) Boolean logic on which the modern computer is
based.
• The idea of fuzzy logic was first advanced by Lotfi Zadeh of the University of
California at Berkeley in the 1960s.
• The term fuzzy refers to things that are not clear or are vague. In the real world
many times we encounter a situation when we can’t determine whether the state
is true or false, their fuzzy logic provides very valuable flexibility for reasoning.
Fuzzy Logic Architecture
Fuzzy Logic Architecture
• RULE BASE: It contains the set of rules and the IF-THEN conditions provided by the
experts to govern the decision-making system, on the basis of linguistic information.
Recent developments in fuzzy theory offer several effective methods for the design and
tuning of fuzzy controllers. Most of these developments reduce the number of fuzzy
rules.
• FUZZIFICATION: It is used to convert inputs i.e. crisp numbers into fuzzy sets. Crisp
inputs are basically the exact inputs measured by sensors and passed into the control
system for processing, such as temperature, pressure, rpm’s, etc.
• INFERENCE ENGINE: It determines the matching degree of the current fuzzy input with
respect to each rule and decides which rules are to be fired according to the input field.
Next, the fired rules are combined to form the control actions.
• DEFUZZIFICATION: It is used to convert the fuzzy sets obtained by the inference engine
into a crisp value. There are several defuzzification methods available and the best-
suited one is used with a specific expert system to reduce the error.
Fuzzy Logic
Membership function
• A graph that defines how each point in the input space is
mapped to membership value between 0 and 1. Input space is
often referred to as the universe of discourse or universal set
(u), which contains all the possible elements of concern in
each particular application.
• There are largely three types of fuzzifiers:
– Singleton fuzzifier
– Gaussian fuzzifier
– Trapezoidal or triangular fuzzifier
Fuzzy Set
• Fuzzy sets theory is an extension of classical set theory. Elements have varying degree of membership.
• A logic based on two truth values -True and False is sometimes insufficient when describing human
reasoning.
• Fuzzy Logic uses the whole interval between 0 (false) and 1 (true) to describe human reasoning.
• A Fuzzy Set is any set that allows its members to have different degree of membership, called membership
function, having interval [0,1]Fuzzy sets can be considered as an extension and gross oversimplification of
classical sets.
• Classical set contains elements that satisfy precise properties of membership while fuzzy set contains
elements that satisfy imprecise properties of membership.
Fuzzy Operation
• Fuzzy operation involves use of fuzzy sets and membership functions. Each fuzzy set is a
representation of a linguistic variable that defines the possible state of output.
Membership function is the function of a generic value in a fuzzy set, such that both the
generic value and the fuzzy set belong to a universal set.
• The degrees of membership of that generic value in the fuzzy set determines the
output, based on the principle of IF-THEN. The memberships are assigned based on the
assumption of outputs with the help of inputs and rate of change of inputs. A
membership function is basically a graphical representation of the fuzzy set.
• Consider a value ‘x’ such that x€X for all interval [0,1] and a fuzzy set A, which is a
subset of X. Membership function of ‘x’ in the subset A is given as : fA(x). Note that ‘x’
denotes the membership value.
Example
• A simple fuzzy control system to control operation of a washing machine such that the fuzzy
system controls the washing process, water intake, wash time and spin speed.
• The input parameters here are the volume of clothes, degree of dirt and type of dirt. While
the volume of clothes would determine the water intake, the degree of dirt in turn would be
determined by the transparency of water and the type of dirt is determined by the time at
which the water color remains unchanged.
• Step 1: The first step would involve defining linguistic variables and terms. For the inputs, the
linguistic variables are as given below
– Type of Dirt: {Greasy, Medium, Not Greasy }
– Quality of Dirt: {Large, Medium, Small }
– For output, the linguistic variables are as given below
– Wash Time: {Short, Very Short, Long, Medium, Very Long}
• Step 2: The second step involves construction of membership functions.
Example
• Step 3: The third step involves developing a set of rules for the knowledge base. Given below are the set of
rules using IF-THEN logic
– IF quality of dirt is Small AND Type of dirt is Greasy, THEN Wash Time is Long.
– IF quality of dirt is Medium AND Type of dirt is Greasy, THEN Wash Time is Long.
– IF quality of dirt is Large and Type of dirt is Greasy, THEN Wash Time is Very Long.
– IF quality of dirt is Small AND Type of dirt is Medium, THEN Wash Time is Medium.
– IF quality of dirt is Medium AND Type of dirt is Medium, THEN Wash Time is Medium.
– IF quality of dirt is Large and Type of dirt is Medium, THEN Wash Time is Medium.
– IF quality of dirt is Small AND Type of dirt is Non-Greasy, THEN Wash Time is Very Short.
– IF quality of dirt is Medium AND Type of dirt is Non-Greasy, THEN Wash Time is Medium.
– IF quality of dirt is Large and Type of dirt is Greasy, THEN Wash Time is Very Short.
• Step 4: The fuzzifier which initially had converted the sensor inputs to these linguistic variables, now
applies the above rules to perform the fuzzy set operations (like MIN and MAX) to determine the output
fuzzy functions. Based upon the output fuzzy sets, the membership function is developed.
• Step 5: The final step is the defuzzification step where the Defuzzifier uses the output membership
functions to determine the output washing time.
Fuzzy Decision Trees
• Fuzzy decision trees represent classification knowledge more naturally to
the way of human thinking and are more robust in tolerating imprecise,
conflict, and missing information.
• Similar to crisp decision trees.
• Allow more flexibility.
Fuzzy Decision Trees
Applications
• It is used in the aerospace field for altitude control of spacecraft and satellites.
• It has been used in the automotive system for speed control, traffic control.
• It is used for decision-making support systems and personal evaluation in the large
company business.
• It has application in the chemical industry for controlling the pH, drying, chemical
distillation process.
• Fuzzy logic is used in Natural language processing and various intensive
applications in Artificial Intelligence.
• Fuzzy logic is extensively used in modern control systems such as expert systems.
• Fuzzy Logic is used with Neural Networks as it mimics how a person would make
decisions, only much faster. It is done by Aggregation of data and changing it into
more meaningful data by forming partial truths as Fuzzy sets.
Advantages & Disadvantages
Stochastic Search
• Stochastic optimization or search refers to the use of randomness in the objective function
or in the optimization algorithm.
• Randomness in the objective function means that the evaluation of candidate solutions
involves some uncertainty or noise and algorithms must be chosen that can make progress in
the search in the presence of this noise.
• Randomness in the algorithm is used as a strategy, e.g. stochastic or probabilistic decisions. It
is used as an alternative to deterministic decisions in an effort to improve the likelihood of
locating the global optima or a better local optima.
• Challenging optimization algorithms, such as high-dimensional nonlinear objective problems,
may contain multiple local optima in which deterministic optimization algorithms may get
stuck.
• Stochastic optimization algorithms provide an alternative approach that permits less optimal
local decisions to be made within the search procedure that may increase the probability of
the procedure locating the global optima of the objective function.
Stochastic Search
• There are many stochastic optimization algorithms.
• Some examples of stochastic optimization algorithms include:
– Iterated Local Search
– Stochastic Hill Climbing
– Stochastic Gradient Descent
– Tabu Search
– Greedy Randomized Adaptive Search Procedure
• Some examples of stochastic optimization algorithms that are inspired by
biological or physical processes include:
– Simulated Annealing
– Evolution Strategies
– Genetic Algorithm
– Differential Evolution
– Particle Swarm Optimization
Genetic Algorithm
• Genetic algorithms (GA) are adaptive search algorithms- adaptive in terms of the
number of parameters you provide or the types of parameters you provide.
• The algorithms classify the best optimal solution among the several solutions, and
its design is based on the natural genetic solution.
• Genetic algorithm emulates the principles of natural evolution, i.e. survival of the
fittest.
• Natural evolution propagates the genetic material in the fittest individuals from
one generation to the next.
• The genetic algorithm applies the same technique in data mining – it iteratively
performs the selection, crossover, mutation, and encoding process to evolve the
successive generation of models.
Genetic Algorithm Phases
Initial population
• Being the first phase of the algorithm, it includes a set of individuals where each individual is a solution to the
concerned problem. We characterize each individual by the set of parameters that we refer to as genes.
Calculate Fitness
• A fitness function is implemented to compute the fitness of each individual in the population. The function
provides a fitness score to each individual in the population. The fitness score is the probability of the individual
selection in the reproduction process.
Selection
• The selection process selects the individuals with the highest fitness score and is allowed to pass on their genes to
the next generation.
Crossover
• It is a core phase of the genetic algorithm. Now the algorithm chooses a crossover point within the parents’ genes
chosen for mating.
• The algorithm keeps generating the offspring until the groups of parents exchange their genes until they reach the
crossover point. Now, these newly created offspring are added to the population.
Mutation
• The mutation phase inserts random genes into the generated offspring to maintain the population’s diversity. It is
done by flipping random genes in new offspring.
Termination
• The iteration of the algorithm stops when it produces offspring that is not different from the previous generation.
It is said to have produced a set of solutions for the problem at this stage.
Advantages & Disadvantages
Advantages
• Easy to understand as it is based on the concept of natural evolution.
• Classifies an optimal solution from a set of solutions.
• GA uses the pay off information instead of the derivative to yield an
optimal solution.
• GA backs multi-objective optimization.
• GA is an adaptive search algorithm.
• GA also operates in a noisy environment.
Disadvantages
• An improper implementation may lead to a solution that is not optimal.
• Implementing fitness function iteratively may lead to computational
challenges.
• GA is time-consuming as it deals with a lot of computation.
Applications
Applications of GA
• GA is used in implementing many applications let s discuss a few of
them.
• Economics: In the field of economics GA is used to implement
certain models that conduct competitive analysis, decision making,
and effective scheduling.
• Aircraft Design: GA is used to provide the parameters that must be
modified and upgraded in order to get a better design.
• DNA Analysis: GA is used to establish DNA structure using
spectrometric information.
• Transport: GA is used to develop a transport plan that is time and
cost-efficient.
• Data mining: GA classify a large set of data to determine the
optimal solution to the concerned problem.

You might also like