EDA Unit-3
EDA Unit-3
EDA Techniques
After you have collected a set of data, how do you do an exploratory data
analysis? What techniques do you employ? What do the various techniques
focus on? What conclusions can you expect to reach?
The sample plots and output in this section were generated with the Data plot
software program. Other general purpose statistical data analysis programs
can generate most of the plots, intervals, and tests discussed here, or macros
can be written to achieve the same result.
Some common questions that exploratory data analysis is used to answer are:
4. What is a percentile?
9. What is the best function for relating a response variable to a set of factor
variables?
This sample autocorrelation plot shows that the time series is not
random, but rather has a high degree of autocorrelation between
adjacent and near-adjacent observations.
Note--Some sources may use the following formula for the autocovariance
function:
1.
If the autocorrelation plot is being used to test for randomness (i.e.,
there is no time dependence in the data), the following formula is
recommended:
where N is the sample size, z is the cumulative distribution function of the standard
normal distribution and is the significance level. In this case, the confidence bands
have fixed width that depends on the sample size. This is the formula that was used
to generate the confidence bands in the above plot.
Autocorrelation plots are also used in the model identification stage for fitting ARIMA models. In this
case, a moving average model is assumed for the data and the following confidence bands should be
generated:
where k is the lag, N is the sample size, z is the cumulative
distribution function of the standard normal distribution and is the
significance level. In this case, the confidence bands increase as the
lag increases.
8. Is the model
sufficient?
Is the formula
valid?
formula being the formula for determining the standard deviation of the sample mean
Y = constant + error
If the data are not random, this model is incorrect and invalid,
and the estimates for the parameters (such as the constant)
become nonsensical and invalid.
In short, if the analyst does not check for randomness, then the
validity of many of the statistical conclusions becomes suspect.
The autocorrelation plot is an excellent way of checking for such
randomness.
2. Weak autocorrelation
4. Sinusoidal model
A few lags slightly outside the 95% and 99% confidence limits do
not necessarily indicate non-randomness. For a 95% confidence
interval, we might expect about one out of twenty lags to be
statistically significant due to random fluctuations.
The plot starts with a moderately high autocorrelation at lag 1 (approximately 0.75) that gradually decreases. The
decreasing autocorrelation is generally linear, but with significant noise. Such a pattern is the autocorrelation plot
signature of "moderate autocorrelation", which in turn provides moderate predictability if modeled properly
The plot starts with a high autocorrelation at lag 1 (only slightly less than
1) that slowly declines. It continues decreasing until it becomes negative
and starts showing an increasing negative autocorrelation. The decreasing
autocorrelation is generally linear with little noise. Such a pattern is the
autocorrelation plot signature of "strong autocorrelation", which in turn
provides high predictability if modeled properly.
The next step would be to estimate the parameters for the autoregressive
model:
Such estimation can be performed by using least squares linear
regression .
The randomness assumption for least squares fitting applies to the residuals of the model. That is, even though the
original data exhibit non-randomness, the residuals after fitting Yi against Yi-1 should result in random residuals.
With the development of Big Data, many phases of preliminary data preparation are
implemented. A whole terminology has developed, and it becomes difficult for
non-specialists to detect the nuances between different terms. Among the preliminary
phases, Data Preparation and Data Exploration occupy a large place. This is the way in
which raw data is integrated and processed in BI software.
BI represents all the means by which data is collected and modelled to assist in
decision-making. Ultimately, BI provides an overview of an activity.
Data Preparation
Data Preparation is the very first phase of a business intelligence project. It is the phase of
transforming raw data into useful information that will later be used for
decision-making. Data sources are merged and filtered. They are finally aggregated, and
the raw data are subject to the calculation of additional values.
Data Preparation is mainly the phase that precedes the analysis. A graphical user
interface that makes the preparation usable is preferably required. Data Preparation is
mainly used for an analysis of business data. This involves the collection, cleaning, and
consolidation of data. All this takes place in a file that can then be used for the analysis.
This phase is of course essential for filtering unstructured and disordered data. Data
Preparation also makes it possible to connect data from different sources, all in real time.
Another important advantage of Data Preparation is that it allows you to manage the data
collected from a file and to obtain a quick report of this data.
The various data preparation procedures include data collection, which is the initial
process for any organization or business. It is at this stage that data is collected from a
variety of sources. These sources can really be of any type.
The next step is data discovery. It is then important to understand the data collected in
order to classify it into different sets. As the data is often very large, filtering the data can
be very time consuming.
It is then equally important to clean and validate the data (data cleansing) in order to
remove and discard anything that is not useful for later steps when decision-making is
required. Unnecessary or aberrant data should be removed at this stage. Appropriate
models should be used to refine the data set. A lock should be used to protect sensitive
data.
Once the data has been cleansed, it must go through the test team who will perform all
necessary checks. The next step is to define the format of the value entries in order to
make the set accessible and understandable to decision-makers. Once all these procedures
have been carried out, the data remains to be stored. The analysis tools can then be
implemented.
Preparation Data has many advantages. Among other things, it allows a quick response to
correct possible errors. The quality of the data is improved, allowing for a more efficient
and faster analysis.
Data Exploration
Data Exploration is the stage following the preparation phase. The prepared data is then
analysed to enable the questions arising from the data preparation to be answered. The
data provided is explored interactively. They are reorganized in such a way that they are
presented in an understandable way and used by decision-makers. It is therefore a question
of exploring data that has not yet been transformed.
Exploration is necessary for decision-makers, who thereby obtain information on data that
was previously difficult to perceive. Data mining is in fact the first step in data analysis. It
is from this phase that it becomes possible to plan appropriate decisions for the
organization or company. This involves identifying and summarizing the main
characteristics of a set of data.
A team of experienced analysts is needed to handle visual analysis tools and statistical
management software. Sometimes it is necessary to use both manual and automated
tools.
Data can be explored manually or automatically. Automated methods are, of course, popular
because of their accuracy and speed. Data visualization tools are particularly effective. Manual
data mining allows you to filter and explore data in files such as Excel. Scripting is also used to
analyze raw data.
Among the techniques used for Data Exploration is univariate analysis, which is the simplest
technique, since only one variable is present in the data. The data is analyzed one by one. The
analysis here depends on the type of variables, which can be categorical or continuous as the
case may be.
Bivariate analysis involves the analysis of two variables. The empirical relationship between
each of them is calculated. An analysis that includes more than one variable can be called
multivariate analysis. There is also principal component analysis, based on the conversion of
correlated variables into a smaller number of uncorrelated variables.
After the exploration comes the discovery of the data. This is an inspection of trends and events
to create visualizations to present to the sales managers to be met. Several tools exist to facilitate
data exploration and visualisation. Tableau and Power BI are frequently used.
The quality of the input during the exploration process will determine the quality of the output. It
is therefore important to apply a very versatile input value so that the output remains constant.
In order for Data Exploration to lead to the construction of a valid predictive model, it is
necessary to proceed in stages. First, it is important to identify the variables. The input and
output variables must first be identified. Next, the type of data and the category of variables must
be identified.
The next step can be either univariate or bivariate analysis. Then the specialists proceed with the
processing of missing values and the treatment of outliers. After the variable transformation, the
creation of variables is the last step.
Characterization
Characterization is a big data methodology that is used for generating descriptive parameters that
effectively describe the characteristics and behavior of a particular data item. This is then used in
unsupervised learning algorithms in order to find patterns, clusters and trends without incorporating class
labels that may have biases. It has its uses in cluster analysis and even deep learning.
Some benefits of characterization:
● Can generate useful metrics for tracking and measuring events and anomalies in data sets
● Creates small footprint representations of essential information
● Quickly accomplishes data-to-information conversion, which brings the industry closer to the full
data-to-information-to-knowledge transformation
● Is useful for indexing and tagging specific objects, events and other features in a data collection
Outlier – an outlier is defined by an unusual observation with respect to either x-value or y-value. An
x-outlier will make the scope of the regression too broad, which is usually considered less accurate.
Leverage – a data point whose x-value (independent) is unusual, y-value follows the predicted
regression line though a leverage point may look okay as it sits on the predicted regression line.
2 Types of Duplicate Features
Two things distinguish top data scientists from others in most cases: Feature
Creation and Feature Selection. i.e., creating features that capture deeper/hidden
insights about the business or customer and then making the right choices about
which features to choose for your model.
1. Duplicate Values: When two features have the same set of values
2. Duplicate Index: When the value of two features are different, but they
2. Duplicate Index (value of two features are different but they occur
at the same index)
As shown in the example image — all the ‘Camry’ cars are from 2018 and all
the ‘Corolla’ cars from 2019. There is nothing insightful for your machine
learning model from these features in training. I can also do integer encoding
for the Car Model to replace Camry with 2018 and Corolla with 2019. Then it is
the same as case 1 above of Duplicate Values. You are better off dropping one
of these two features.
multicollinearity.
— If you are using tree-based modes, it won’t matter unless you are looking
at feature importance.
— In the case of distance-based models, it will make that feature count more
in the distance.
Noisy Data:
Noisy data are data that is corrupted, or distorted, or has a low signal-to-noise ratio. Improper
procedures (or improperly-documented procedures) to subtract out the noise in data can lead to a
Missing Data:
Missing data is defined as the values or data that is not stored (or not present) for some
Reasons for the missing data from the dataset affect the approach of handling missing data.
Introduction
The problem of missing value is quite common in many real-life datasets. Missing value can
bias the results of the machine learning models and/or reduce the accuracy of the model. It
describes what is missing data, how it is represented, and the different reasons for the
missing data. Along with the different categories of missing data, it also details out different
In MCAR, the probability of data being missing is the same for all the observations.
In this case, there is no relationship between the missing data and any other values observed
or unobserved (the data which is not recorded) within the given dataset.
That is, missing values are completely independent of other data. There is no pattern.
In the case of MCAR, the data could be missing due to human error, some system/equipment
failure, loss of sample, or some unsatisfactory technicalities while recording the values.
For Example, suppose in a library there are some overdue books. Some values of overdue
books in the computer system are missing. The reason might be a human error like the
librarian forgot to type in the values. So, the missing values of overdue books are not related
It should not be assumed as it’s a rare case. The advantage of such data is that the statistical
variables on which you have complete information as there is some relationship between the
In this case, the data is not missing for all the observations. It is missing only within
sub-samples of the data and there is some pattern in the missing values.
For example, if you check the survey data, you may find that all the people have answered
their ‘Gender’ but ‘Age’ values are mostly missing for people who have answered their
‘Gender’ as ‘female’. (The reason being most of the females don’t want to reveal their age.)
So, the probability of data being missing depends only on the observed data.
In this case, the variables ‘Gender’ and ‘Age’ are related and the reason for missing values of
the ‘Age’ variable can be explained by the ‘Gender’ variable but you can not predict the
Suppose a poll is taken for overdue books of a library. Gender and the number of overdue
books are asked in the poll. Assume that most of the females answer the poll and men are
less likely to answer. So why the data is missing can be explained by another factor that is
gender.
In this case, the statistical analysis might result in bias.
Getting an unbiased estimate of the parameters can be done only by modeling the missing
data.
If there is some structure/pattern in missing data and other observed data can not explain it,
If the missing data does not fall under the MCAR or MAR then it can be categorized as
MNAR.
It can happen due to the reluctance of people in providing the required information. A
For example, suppose the name and the number of overdue books are asked in the poll for a
library. So most of the people having no overdue books are likely to answer the poll. People
having more overdue books are less likely to answer the poll.
So in this case, the missing value of the number of overdue books depends on the people who
Another example, people having less income may refuse to share that information in a
survey.
In the case of MNAR as well the statistical analysis might result in bias.
● Many machine learning algorithms fail if the dataset contains missing values.
However, algorithms like K-nearest and Naive Bayes support data with missing
values.
● You may end up building a biased machine learning model which will lead to
In MCAR, the probability of data being missing is the same for all the observations.
In this case, there is no relationship between the missing data and any other values observed
or unobserved (the data which is not recorded) within the given dataset.
That is, missing values are completely independent of other data. There is no pattern.
In the case of MCAR, the data could be missing due to human error, some system/equipment
failure, loss of sample, or some unsatisfactory technicalities while recording the values.
For Example, suppose in a library there are some overdue books. Some values of overdue
books in the computer system are missing. The reason might be a human error like the
librarian forgot to type in the values. So, the missing values of overdue books are not related
variables on which you have complete information as there is some relationship between the
In this case, the data is not missing for all the observations. It is missing only within
sub-samples of the data and there is some pattern in the missing values.
For example, if you check the survey data, you may find that all the people have answered
their ‘Gender’ but ‘Age’ values are mostly missing for people who have answered their
‘Gender’ as ‘female’. (The reason being most of the females don’t want to reveal their age.)
So, the probability of data being missing depends only on the observed data.
In this case, the variables ‘Gender’ and ‘Age’ are related and the reason for missing values of
the ‘Age’ variable can be explained by the ‘Gender’ variable but you can not predict the
Suppose a poll is taken for overdue books of a library. Gender and the number of overdue
books are asked in the poll. Assume that most of the females answer the poll and men are
less likely to answer. So why the data is missing can be explained by another factor that is
gender.
data.
If there is some structure/pattern in missing data and other observed data can not explain it,
If the missing data does not fall under the MCAR or MAR then it can be categorized as
MNAR.
It can happen due to the reluctance of people in providing the required information. A
For example, suppose the name and the number of overdue books are asked in the poll for a
library. So most of the people having no overdue books are likely to answer the poll. People
having more overdue books are less likely to answer the poll.
So in this case, the missing value of the number of overdue books depends on the people who
Another example, people having less income may refuse to share that information in a
survey.
In the case of MNAR as well the statistical analysis might result in bias.
However, algorithms like K-nearest and Naive Bayes support data with missing
values.
● You may end up building a biased machine learning model which will lead to
the missing values as it is crucial to find out the strategy for handling the missing
values.
If the missing value is of the type Missing Not At Random (MNAR), then it should not be
deleted.
The disadvantage of this method is one might end up deleting some useful data from the
dataset.
There are 2 ways one can delete the missing values:
If a row has many missing values then you can choose to drop the entire row.
If every row has some (column) value missing then you might end up deleting the whole
data.
If a certain column has many missing values then you can choose to drop the entire column.
If a row has many missing values then you can choose to drop the entire row.
If every row has some (column) value missing then you might end up deleting the whole
data.
If you can make an educated guess about the missing value then you can replace it with some
arbitrary value
are outliers then the mean will not be appropriate. In such cases, outliers need to be treated
first.
Mode is the most frequently occurring value. It is used in the case of categorical features.
Median is the middlemost value. It’s better to use the median value for imputation in the case
of outliers.
In some cases, imputing the values with the previous value instead of mean, mode or median
is more appropriate. This is called forward fill. It is mostly used in time series data
In backward fill, the missing value is imputed using the next value.