0% found this document useful (0 votes)
25 views31 pages

EDA Unit-3

The document discusses exploratory data analysis (EDA) techniques, emphasizing the importance of graphical methods alongside quantitative techniques. It outlines common EDA questions, describes autocorrelation plots for assessing randomness in data, and differentiates between data preparation and exploration phases in business intelligence. Additionally, it covers characterization, outlier identification, and the significance of feature creation and selection in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views31 pages

EDA Unit-3

The document discusses exploratory data analysis (EDA) techniques, emphasizing the importance of graphical methods alongside quantitative techniques. It outlines common EDA questions, describes autocorrelation plots for assessing randomness in data, and differentiates between data preparation and exploration phases in business intelligence. Additionally, it covers characterization, outlier identification, and the significance of feature creation and selection in data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

UNIT-3 EDA

EDA Techniques

After you have collected a set of data, how do you do an exploratory data
analysis? What techniques do you employ? What do the various techniques
focus on? What conclusions can you expect to reach?

This section provides answers to these kinds of questions via a gallery of


EDA techniques and a detailed description of each technique. The techniques
are divided into graphical and quantitative techniques. For exploratory data
analysis, the emphasis is primarily on the graphical techniques.

EDA emphasizes graphical techniques while classical techniques emphasize


quantitative techniques. In practice, an analyst typically uses a mixture of
graphical and quantitative techniques. In this section, we have divided the
descriptions into graphical and quantitative techniques. This is for
organizational clarity and is not meant to discourage the use of both graphical
and quantitative techniques when analyzing data.

The sample plots and output in this section were generated with the Data plot
software program. Other general purpose statistical data analysis programs
can generate most of the plots, intervals, and tests discussed here, or macros
can be written to achieve the same result.

Some common questions that exploratory data analysis is used to answer are:

1. What is a typical value?

2. What is the uncertainty for a typical value?


3. What is a good distributional fit for a set of numbers?

4. What is a percentile?

5. Does an engineering modification have an effect?

6. Does a factor have an effect?

7. What are the most important factors?

8. Are measurements coming from different laboratories equivalent?

9. What is the best function for relating a response variable to a set of factor
variables?

10. What are the best settings for factors?

11. Can we separate signal from noise in time dependent data?

12. Can we extract any structure from multivariate data?

13. Does the data have outliers?

Graphical Techniques: Alphabetic


6-Plot: 1.3.3.33
Autocorrelation Plot
Autocorrelation plots are a commonly-used tool for checking randomness in a data set. This randomness is
ascertained by computing autocorrelations for data values at varying time lags. If random, such autocorrelations
should be near zero for any and all time-lag separations. If non-random, then one or more of the autocorrelations
will be significantly non- zero.

This sample autocorrelation plot shows that the time series is not
random, but rather has a high degree of autocorrelation between
adjacent and near-adjacent observations.

Autocorrelation plots are formed by

Vertical axis: Autocorrelation coefficient

where Ch is the autocovariance function


and C0 is the variance function

Note--Rh is between -1 and +1.

Note--Some sources may use the following formula for the autocovariance
function:

Horizontal axis: Time lag h (h = 1, 2, 3…)


The above line also contains several horizontal reference lines. The middle
line is at zero. The other four lines are 95 % and 99 % confidence bands.
Note that there are two distinct formulas for generating the confidence
bands.

1.
If the autocorrelation plot is being used to test for randomness (i.e.,
there is no time dependence in the data), the following formula is
recommended:

where N is the sample size, z is the cumulative distribution function of the standard
normal distribution and is the significance level. In this case, the confidence bands
have fixed width that depends on the sample size. This is the formula that was used
to generate the confidence bands in the above plot.

Autocorrelation plots are also used in the model identification stage for fitting ARIMA models. In this
case, a moving average model is assumed for the data and the following confidence bands should be
generated:
where k is the lag, N is the sample size, z is the cumulative
distribution function of the standard normal distribution and is the
significance level. In this case, the confidence bands increase as the
lag increases.

The autocorrelation plot can provide answers to the


following questions:

1. Are the data random?

2. Is an observation related to an adjacent


observation?
3. Is an observation related to an observation twice- removed?
(etc.)

4. Is the observed time series white noise?

5. Is the observed time series sinusoidal?

6. Is the observed time series autoregressive?

7. What is an appropriate model for the observed time series?

8. Is the model

Y = constant + error valid and

sufficient?

Is the formula

valid?

Randomness (along with fixed model, fixed variation, and fixed


distribution) is one of the four assumptions that typically underlie
all measurement processes. The randomness assumption is
critically important for the following three reasons:
1. Most standard statistical tests depend on randomness. The
validity of the test conclusions is directly linked to the validity of
the randomness assumption.
2.Many commonly-used statistical formulae depend on the randomness assumption, the most common

formula being the formula for determining the standard deviation of the sample mean

Where s is the standard deviation of the data.


Although heavily used, the results from using this
formula are of no value unless the randomness
assumption holds.
3. For univariate data, the default model is

Y = constant + error

If the data are not random, this model is incorrect and invalid,
and the estimates for the parameters (such as the constant)
become nonsensical and invalid.
In short, if the analyst does not check for randomness, then the
validity of many of the statistical conclusions becomes suspect.
The autocorrelation plot is an excellent way of checking for such
randomness.

Examples of the autocorrelation plot for several common


situations are given in the following pages.

1. Random (= White Noise)

2. Weak autocorrelation

3. Strong autocorrelation and autoregressive model

4. Sinusoidal model

Autocorrelation Plot: Random Data


The following is a sample autocorrelation plot.
Conclusions We can make the following conclusions from this plot.

1. There are no significant autocorrelations.

2.The data are random

Note that with the exception of lag 0, which is always 1 by


definition, almost all of the autocorrelations fall within the 95%
confidence limits. In addition, there is no apparent pattern (such as
the first twenty-five being positive and the second twenty-five being
negative). This is the absence of a pattern we expect to see if the
data are in fact random.

A few lags slightly outside the 95% and 99% confidence limits do
not necessarily indicate non-randomness. For a 95% confidence
interval, we might expect about one out of twenty lags to be
statistically significant due to random fluctuations.

There is no associative ability to infer from a current value Yi as to


what the next value Yi+1 will be. Such non- association is the
essence of randomness. In short, adjacent observations do not
"correlate", so we call this the "no autocorrelation" case.

Autocorrelation Plot: Moderate


Autocorrelation
The following is a sample autocorrelation plot

We can make the following conclusions from this plot.

1. The data come from an underlying autoregressive model with


moderate positive autocorrelation.

The plot starts with a moderately high autocorrelation at lag 1 (approximately 0.75) that gradually decreases. The
decreasing autocorrelation is generally linear, but with significant noise. Such a pattern is the autocorrelation plot
signature of "moderate autocorrelation", which in turn provides moderate predictability if modeled properly

The next step would be to estimate the parameters for the


autoregressive model:

Such estimation can be performed by using least squares linear


regression
The randomness assumption for least squares fitting applies to the residuals of the model. That is, even though the
original data exhibit non-randomness, the residuals after fitting Yi against Yi-1 should result in random residuals.

The residual standard deviation for this autoregressive model will be


much smaller than the residual standard deviation for the default
model
Autocorrelation Plot: Strong
Autocorrelation and Autoregressive
Model
The following is a sample autocorrelation plot

We can make the following conclusions from the above plot.

1. The data come from an underlying autoregressive model with


strong positive autocorrelation.

The plot starts with a high autocorrelation at lag 1 (only slightly less than
1) that slowly declines. It continues decreasing until it becomes negative
and starts showing an increasing negative autocorrelation. The decreasing
autocorrelation is generally linear with little noise. Such a pattern is the
autocorrelation plot signature of "strong autocorrelation", which in turn
provides high predictability if modeled properly.

The next step would be to estimate the parameters for the autoregressive
model:
Such estimation can be performed by using least squares linear
regression .

The randomness assumption for least squares fitting applies to the residuals of the model. That is, even though the
original data exhibit non-randomness, the residuals after fitting Yi against Yi-1 should result in random residuals.

The residual standard deviation for this autoregressive model will be


much smaller than the residual standard deviation for the default
model

Autocorrelation Plot: Sinusoidal


Model
The following is a sample autocorrelation plot

We can make the following conclusions from the above plot.

1. The data come from an underlying sinusoidal model.

The plot exhibits an alternating sequence of positive and


negative spikes. These spikes are not decaying to zero. Such a
pattern is the autocorrelation plot signature of a sinusoidal model.
What is the difference between Data Preparation and Data
Exploration?

The emergence of big data

With the development of Big Data, many phases of preliminary data preparation are
implemented. A whole terminology has developed, and it becomes difficult for
non-specialists to detect the nuances between different terms. Among the preliminary
phases, Data Preparation and Data Exploration occupy a large place. This is the way in
which raw data is integrated and processed in BI software.

BI is Business Intelligence, which, as the name suggests, is used by business managers


and those who are generally referred to as decision-makers.

BI represents all the means by which data is collected and modelled to assist in
decision-making. Ultimately, BI provides an overview of an activity.

Data Preparation

Data Preparation is the very first phase of a business intelligence project. It is the phase of
transforming raw data into useful information that will later be used for
decision-making. Data sources are merged and filtered. They are finally aggregated, and
the raw data are subject to the calculation of additional values.

Data Preparation is mainly the phase that precedes the analysis. A graphical user
interface that makes the preparation usable is preferably required. Data Preparation is
mainly used for an analysis of business data. This involves the collection, cleaning, and
consolidation of data. All this takes place in a file that can then be used for the analysis.

This phase is of course essential for filtering unstructured and disordered data. Data
Preparation also makes it possible to connect data from different sources, all in real time.
Another important advantage of Data Preparation is that it allows you to manage the data
collected from a file and to obtain a quick report of this data.

The various data preparation procedures include data collection, which is the initial
process for any organization or business. It is at this stage that data is collected from a
variety of sources. These sources can really be of any type.

The next step is data discovery. It is then important to understand the data collected in
order to classify it into different sets. As the data is often very large, filtering the data can
be very time consuming.

It is then equally important to clean and validate the data (data cleansing) in order to
remove and discard anything that is not useful for later steps when decision-making is
required. Unnecessary or aberrant data should be removed at this stage. Appropriate
models should be used to refine the data set. A lock should be used to protect sensitive
data.

Once the data has been cleansed, it must go through the test team who will perform all
necessary checks. The next step is to define the format of the value entries in order to
make the set accessible and understandable to decision-makers. Once all these procedures
have been carried out, the data remains to be stored. The analysis tools can then be
implemented.

Preparation Data has many advantages. Among other things, it allows a quick response to
correct possible errors. The quality of the data is improved, allowing for a more efficient
and faster analysis.

Data Exploration

Data Exploration is the stage following the preparation phase. The prepared data is then
analysed to enable the questions arising from the data preparation to be answered. The
data provided is explored interactively. They are reorganized in such a way that they are
presented in an understandable way and used by decision-makers. It is therefore a question
of exploring data that has not yet been transformed.
Exploration is necessary for decision-makers, who thereby obtain information on data that
was previously difficult to perceive. Data mining is in fact the first step in data analysis. It
is from this phase that it becomes possible to plan appropriate decisions for the
organization or company. This involves identifying and summarizing the main
characteristics of a set of data.

A team of experienced analysts is needed to handle visual analysis tools and statistical
management software. Sometimes it is necessary to use both manual and automated
tools.

Data can be explored manually or automatically. Automated methods are, of course, popular
because of their accuracy and speed. Data visualization tools are particularly effective. Manual
data mining allows you to filter and explore data in files such as Excel. Scripting is also used to
analyze raw data.

Among the techniques used for Data Exploration is univariate analysis, which is the simplest
technique, since only one variable is present in the data. The data is analyzed one by one. The
analysis here depends on the type of variables, which can be categorical or continuous as the
case may be.

Bivariate analysis involves the analysis of two variables. The empirical relationship between
each of them is calculated. An analysis that includes more than one variable can be called
multivariate analysis. There is also principal component analysis, based on the conversion of
correlated variables into a smaller number of uncorrelated variables.

After the exploration comes the discovery of the data. This is an inspection of trends and events
to create visualizations to present to the sales managers to be met. Several tools exist to facilitate
data exploration and visualisation. Tableau and Power BI are frequently used.

The quality of the input during the exploration process will determine the quality of the output. It
is therefore important to apply a very versatile input value so that the output remains constant.

In order for Data Exploration to lead to the construction of a valid predictive model, it is
necessary to proceed in stages. First, it is important to identify the variables. The input and
output variables must first be identified. Next, the type of data and the category of variables must
be identified.

The next step can be either univariate or bivariate analysis. Then the specialists proceed with the
processing of missing values and the treatment of outliers. After the variable transformation, the
creation of variables is the last step.

Characterization
Characterization is a big data methodology that is used for generating descriptive parameters that
effectively describe the characteristics and behavior of a particular data item. This is then used in
unsupervised learning algorithms in order to find patterns, clusters and trends without incorporating class
labels that may have biases. It has its uses in cluster analysis and even deep learning.
Some benefits of characterization:

● Can generate useful metrics for tracking and measuring events and anomalies in data sets
● Creates small footprint representations of essential information
● Quickly accomplishes data-to-information conversion, which brings the industry closer to the full
data-to-information-to-knowledge transformation
● Is useful for indexing and tagging specific objects, events and other features in a data collection

Outlier – an outlier is defined by an unusual observation with respect to either x-value or y-value. An
x-outlier will make the scope of the regression too broad, which is usually considered less accurate.

Leverage – a data point whose x-value (independent) is unusual, y-value follows the predicted
regression line though a leverage point may look okay as it sits on the predicted regression line.
2 Types of Duplicate Features
Two things distinguish top data scientists from others in most cases: Feature
Creation and Feature Selection. i.e., creating features that capture deeper/hidden
insights about the business or customer and then making the right choices about
which features to choose for your model.

1. Duplicate Values: When two features have the same set of values

2. Duplicate Index: When the value of two features are different, but they

occur at the same index

1. Duplicate Values (Same value for each record)


As shown in the example image, the year of the sale for a car is the same as
the manufacture year; then, these two features essentially say the same thing.
Your machine learning model won’t learn anything insightful by keeping both
these features in training. You are better off dropping one of the features.

2. Duplicate Index (value of two features are different but they occur
at the same index)
As shown in the example image — all the ‘Camry’ cars are from 2018 and all
the ‘Corolla’ cars from 2019. There is nothing insightful for your machine
learning model from these features in training. I can also do integer encoding
for the Car Model to replace Camry with 2018 and Corolla with 2019. Then it is
the same as case 1 above of Duplicate Values. You are better off dropping one
of these two features.

Keeping duplicate features in your dataset introduces the problem of

multicollinearity.

— In the case of linear models, weights distribution between the two

features will be problematic.

— If you are using tree-based modes, it won’t matter unless you are looking

at feature importance.
— In the case of distance-based models, it will make that feature count more

in the distance.

Noisy Data:

Noisy data are data that is corrupted, or distorted, or has a low signal-to-noise ratio. Improper

procedures (or improperly-documented procedures) to subtract out the noise in data can lead to a

false sense of accuracy or false conclusions.

Data = true signal + noise

Missing Data:

Missing data is defined as the values or data that is not stored (or not present) for some

variable/s in the given dataset.

Why Is Data Missing From The Dataset


There can be multiple reasons why certain values are missing from the data.

Reasons for the missing data from the dataset affect the approach of handling missing data.

So it’s necessary to understand why the data could be missing.

Some of the reasons are listed below:

● Past data might get corrupted due to improper maintenance.


● Observations are not recorded for certain fields due to some reasons. There might be a

failure in recording the values due to human error.

● The user has not provided the values intentionally.

Introduction
The problem of missing value is quite common in many real-life datasets. Missing value can

bias the results of the machine learning models and/or reduce the accuracy of the model. It

describes what is missing data, how it is represented, and the different reasons for the

missing data. Along with the different categories of missing data, it also details out different

ways of handling missing values with examples.

Types Of Missing Value


Formally the missing values are categorized as follows:
Missing Completely At Random (MCAR)

In MCAR, the probability of data being missing is the same for all the observations.

In this case, there is no relationship between the missing data and any other values observed

or unobserved (the data which is not recorded) within the given dataset.

That is, missing values are completely independent of other data. There is no pattern.

In the case of MCAR, the data could be missing due to human error, some system/equipment

failure, loss of sample, or some unsatisfactory technicalities while recording the values.

For Example, suppose in a library there are some overdue books. Some values of overdue

books in the computer system are missing. The reason might be a human error like the
librarian forgot to type in the values. So, the missing values of overdue books are not related

to any other variable/data in the system.

It should not be assumed as it’s a rare case. The advantage of such data is that the statistical

analysis remains unbiased.

Missing At Random (MAR)


Missing at random (MAR) means that the reason for missing values can be explained by

variables on which you have complete information as there is some relationship between the

missing data and other values/data.

In this case, the data is not missing for all the observations. It is missing only within

sub-samples of the data and there is some pattern in the missing values.

For example, if you check the survey data, you may find that all the people have answered

their ‘Gender’ but ‘Age’ values are mostly missing for people who have answered their

‘Gender’ as ‘female’. (The reason being most of the females don’t want to reveal their age.)

So, the probability of data being missing depends only on the observed data.

In this case, the variables ‘Gender’ and ‘Age’ are related and the reason for missing values of

the ‘Age’ variable can be explained by the ‘Gender’ variable but you can not predict the

missing value itself.

Suppose a poll is taken for overdue books of a library. Gender and the number of overdue

books are asked in the poll. Assume that most of the females answer the poll and men are

less likely to answer. So why the data is missing can be explained by another factor that is

gender.
In this case, the statistical analysis might result in bias.

Getting an unbiased estimate of the parameters can be done only by modeling the missing

data.

Missing Not At Random (MNAR)


Missing values depend on the unobserved data.

If there is some structure/pattern in missing data and other observed data can not explain it,

then it is Missing Not At Random (MNAR).

If the missing data does not fall under the MCAR or MAR then it can be categorized as

MNAR.

It can happen due to the reluctance of people in providing the required information. A

specific group of people may not answer some questions in a survey.

For example, suppose the name and the number of overdue books are asked in the poll for a

library. So most of the people having no overdue books are likely to answer the poll. People

having more overdue books are less likely to answer the poll.

So in this case, the missing value of the number of overdue books depends on the people who

have more books overdue.

Another example, people having less income may refuse to share that information in a

survey.

In the case of MNAR as well the statistical analysis might result in bias.

Why Do We Need To Care About Handling Missing Value?


It is important to handle the missing values appropriately.

● Many machine learning algorithms fail if the dataset contains missing values.

However, algorithms like K-nearest and Naive Bayes support data with missing

values.

● You may end up building a biased machine learning model which will lead to

incorrect results if the missing values are not handled properly.

● Missing data can lead to a lack of precision in the statistical analysis.

Missing Completely At Random (MCAR)

In MCAR, the probability of data being missing is the same for all the observations.

In this case, there is no relationship between the missing data and any other values observed

or unobserved (the data which is not recorded) within the given dataset.

That is, missing values are completely independent of other data. There is no pattern.

In the case of MCAR, the data could be missing due to human error, some system/equipment

failure, loss of sample, or some unsatisfactory technicalities while recording the values.

For Example, suppose in a library there are some overdue books. Some values of overdue

books in the computer system are missing. The reason might be a human error like the

librarian forgot to type in the values. So, the missing values of overdue books are not related

to any other variable/data in the system.


It should not be assumed as it’s a rare case. The advantage of such data is that the statistical

analysis remains unbiased.

Missing At Random (MAR)


Missing at random (MAR) means that the reason for missing values can be explained by

variables on which you have complete information as there is some relationship between the

missing data and other values/data.

In this case, the data is not missing for all the observations. It is missing only within

sub-samples of the data and there is some pattern in the missing values.

For example, if you check the survey data, you may find that all the people have answered

their ‘Gender’ but ‘Age’ values are mostly missing for people who have answered their

‘Gender’ as ‘female’. (The reason being most of the females don’t want to reveal their age.)

So, the probability of data being missing depends only on the observed data.

In this case, the variables ‘Gender’ and ‘Age’ are related and the reason for missing values of

the ‘Age’ variable can be explained by the ‘Gender’ variable but you can not predict the

missing value itself.

Suppose a poll is taken for overdue books of a library. Gender and the number of overdue

books are asked in the poll. Assume that most of the females answer the poll and men are

less likely to answer. So why the data is missing can be explained by another factor that is

gender.

In this case, the statistical analysis might result in bias.


Getting an unbiased estimate of the parameters can be done only by modeling the missing

data.

Missing Not At Random (MNAR)


Missing values depend on the unobserved data.

If there is some structure/pattern in missing data and other observed data can not explain it,

then it is Missing Not At Random (MNAR).

If the missing data does not fall under the MCAR or MAR then it can be categorized as

MNAR.

It can happen due to the reluctance of people in providing the required information. A

specific group of people may not answer some questions in a survey.

For example, suppose the name and the number of overdue books are asked in the poll for a

library. So most of the people having no overdue books are likely to answer the poll. People

having more overdue books are less likely to answer the poll.

So in this case, the missing value of the number of overdue books depends on the people who

have more books overdue.

Another example, people having less income may refuse to share that information in a

survey.

In the case of MNAR as well the statistical analysis might result in bias.

Why Do We Need To Care About Handling Missing Value?


It is important to handle the missing values appropriately.
● Many machine learning algorithms fail if the dataset contains missing values.

However, algorithms like K-nearest and Naive Bayes support data with missing

values.

● You may end up building a biased machine learning model which will lead to

incorrect results if the missing values are not handled properly.

● Missing data can lead to a lack of precision in the statistical analysis.

How To Handle The Missing Data


Analyze each column with missing values carefully to understand the reasons behind

the missing values as it is crucial to find out the strategy for handling the missing

values.

There are 2 primary ways of handling missing values:

1. Deleting the Missing values

2. Imputing the Missing Values

Deleting the Missing value


Generally, this approach is not recommended. It is one of the quick and dirty techniques one

can use to deal with missing values.

If the missing value is of the type Missing Not At Random (MNAR), then it should not be

deleted.

If the missing value is of type Missing At Random (MAR) or Missing Completely At

Random (MCAR) then it can be deleted.

The disadvantage of this method is one might end up deleting some useful data from the

dataset.
There are 2 ways one can delete the missing values:

Deleting the entire row

If a row has many missing values then you can choose to drop the entire row.

If every row has some (column) value missing then you might end up deleting the whole

data.

Deleting the entire column

If a certain column has many missing values then you can choose to drop the entire column.

Deleting the entire row

If a row has many missing values then you can choose to drop the entire row.

If every row has some (column) value missing then you might end up deleting the whole

data.

Imputing the Missing Value


There are different ways of replacing the missing values.

Replacing With Arbitrary Value

If you can make an educated guess about the missing value then you can replace it with some

arbitrary value

Replacing With Mean


This is the most common method of imputing missing values of numeric columns. If there

are outliers then the mean will not be appropriate. In such cases, outliers need to be treated

first.

Replacing With Mode

Mode is the most frequently occurring value. It is used in the case of categorical features.

Replacing With Median

Median is the middlemost value. It’s better to use the median value for imputation in the case

of outliers.

Replacing with previous value – Forward fill

In some cases, imputing the values with the previous value instead of mean, mode or median

is more appropriate. This is called forward fill. It is mostly used in time series data

Replacing with next value – Backward fill

In backward fill, the missing value is imputed using the next value.

You might also like