Explanatory Data Analysis
Explanatory Data Analysis
import pandas as pd
boston = load_boston()
x = boston.data
y = boston.target
boston_df = pd.DataFrame(boston.data)
boston_df.columns = columns
boston_df.head()
sns.boxplot(x=boston_df['DIS'])

Above plot shows three points between 10 to 12, these are
outliers as there are not included in the box of other
observation i.e no where near the quartiles.
Scatter plot-
A scatter plot , is a type of plot or mathematical diagram
using Cartesian coordinates to display values for typically
two variables for a set of data. The data are displayed as
a collection of points, each having the value of one
variable determining the position on the horizontal axis
and the value of the other variable determining the
position on the vertical axis.
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(boston_df['INDUS'], boston_df['TAX'])
plt.show()
Looking at the plot above, we can most of data points are
lying bottom left side but there are points which are far
from the population like top right corner.
lets consider threshold as 3 but its mostly business requirement .here lets
be 3
threshold = 3
print(z[55][1])
3.375038763517309
IQR score -
Z-Score
We can remove or filter the outliers and can get the clean
data. This can be done with just one line code as we have
already calculated the Z-score.
boston_df_out = boston_df_o1[~((boston_df_o1 < (Q1 - 1.5
* IQR)) |(boston_df_o1 > (Q3 + 1.5 *
IQR))).any(axis=1)]boston_df_out.shape
Code:
Now, let’s see the ratio of data points above the upper limit
& extreme upper limit. ie, the outliers.
total = np.float(data.shape[0])
print('Total borrowers:
{}'.format(data.annual_inc.shape[0]/total))
print('Borrowers that earn > 178k:
{}'.format(data[data.annual_inc>178000].shape[0]/total))
print('Borrowers that earn > 256k:
{}'.format(data[data.annual_inc>256000].shape[0]/total))
2.Trimming:
In this method, we discard the outliers completely. That is,
eliminate the data points that are considered as outliers.
In situations where you won’t be removing a large number
of values from the dataset, trimming is a good and fast
approach.
index = data[(data['annual_inc'] >= 256000)].index
data.drop(index, inplace=True)
Discretization methods
Equal width binning
Equal frequency binning
Detecting Missing values:
Depending on data sources, missing data are identified
differently. Pandas always identify missing values as NaN.
https://towardsdatascience.com/data-cleaning-with-python-and-pandas-detecting-missing-values-
3e9c6ebcf78b-- Data cleaning with python and pandas –Detecting missing values
Example:
# Detecting numbers
cnt=0
for row in df['OWN_OCCUPIED']:
try:
int(row)
df.loc[cnt, 'OWN_OCCUPIED']=np.nan
except ValueError:
pass
cnt+=1
Data transformation
Data transformation predominantly deals with
normalizing also known as scaling data , handling
skewness and aggregation of attributes.
Normalization
Normalization or scaling refers to bringing all the columns
into same range. We will discuss two most common
normalization techniques.
1. Min-Max
2. Z score
3. Min-Max normalization:
4. It is simple way of scaling values in a column. But, it
tries to move the values towards the mean of the
column. Here is the formula
5.
6.
Z score normalization:
Skewness of data:
According to Wikipedia,” In probability
theory and statistics, skewness is a measure of the
asymmetry of the probability distribution of a real-
valued random variable about its mean.”
Logarithm transformation:
Square transformation:
Feature Scaling :
When should you perform feature scaling and
mean normalization on the given data? What are
the advantages of these techniques?
Few advantages of normalizing the data are as follows:
Hovewer, there are few algorithms such as Logistic Regression and Decision
Trees that are not affected by scaling of input data.
Let me answer this from general ML perspective and not only neural
networks. When you collect data and extract features, many times the data
is collected on different scales. For example, the age of employees in a
company may be between 21-70 years, the size of the house they live is
500-5000 Sq feet and their salaries may range from $30000-$80000. In this
situation if you use a simple Euclidean metric, the age feature will not play
any role because it is several order smaller than other features. However, it
may contain some important information that may be useful for the task.
Here, you may want to normalize the features independently to the same
scale, say [0,1], so they contribute equally while computing the distance.
However, normalization may also result in loss of information. Therefore,
you need to be sure about this aspect as well. Most of the time, it helps
when the objective function you are optimizing computes some sort of
distance or squared distance.
The difference is that, in scaling, you're changing the range of your data while in
normalization you're changing the shape of the distribution of your data. Let's talk
a little more in-depth about each of these option
Scaling
This means that you're transforming your data so that it fits within a specific scale,
like 0-100 or 0-1. You want to scale data when you're using methods based on
measures of how far apart data points, like support vector machines, or SVMor k-
nearest neighbors, or KNN. With these algorithms, a change of "1" in any numeric
feature is given the same importance.
For example, you might be looking at the prices of some products in both Yen
and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your
prices methods like SVM or KNN will consider a difference in price of 1 Yen as
important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions
of the world. With currency, you can convert between currencies. But what about
if you're looking at something like height and weight? It's not entirely clear how
many pounds should equal one inch (or how many kilograms should equal one
meter).
By scaling your variables, you can help compare different variables on equal
footing.
Scaling Example:
1. # generate 1000 data points randomly drawn from an exponential
distribution
2. original_data = np.random.exponential(size = 1000)
3.
4. # mix-max scale the data between 0 and 1
5. scaled_data = minmax_scaling(original_data, columns = [0])
6.
7. # plot both together to compare
8. fig, ax=plt.subplots(1,2)
9. sns.distplot(original_data, ax=ax[0])
10. ax[0].set_title("Original Data")
11. sns.distplot(scaled_data, ax=ax[1])
12. ax[1].set_title("Scaled data")
Normalization
Scaling just changes the range of your data. Normalization is a more radical
transformation. The point of normalization is to change your observations so that
they can be described as a normal distribution.
In general, you'll only want to normalize your data if you're going to be using a
machine learning or statistics technique that assumes your data is normally
distributed. Some examples of these include t-tests, ANOVAs, linear regression,
linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method
with "Gaussian" in the name probably assumes normality.)
Normalization example :
1. # normalize the exponential data with boxcox
2. normalized_data = stats.boxcox(original_data)
3.
4. # plot both together to compare
5. fig, ax=plt.subplots(1,2)
6. sns.distplot(original_data, ax=ax[0])
7. ax[0].set_title("Original Data")
8. sns.distplot(normalized_data[0], ax=ax[1])
9. ax[1].set_title("Normalized data")
Feature Scaling or Standardization: It is a step of Data Pre Processing which is
applied to independent variables or features of data. It basically helps to
normalise the data within a particular range. Sometimes, it also helps in speeding
up the calculations in an algorithm.
Package Used:
1. sklearn.preprocessing
Import:
1. from sklearn.preprocessing import StandardScaler
Formula used in Backend
Standardisation replaces the values by their Z scores.
Mostly the Fit method is used for Feature scaling
The other methods (penalized regressions, per example) work better with
centering and scaling the data.
Regression:
In this part, you will understand and learn how to implement the following
Machine Learning Regression models:
1. Linear Regression
2. Linear regression: Linear regression involves using data to calculate a line that
best fits that data, and then using that line to predict scores on one variable from
another. Prediction is simply the process of estimating scores of the outcome (or
dependent) variable based on the scores of the predictor (or independent)
variable. To generate the regression line, we look for a line of best fit. A line
which can explain the relationship between independent and dependent
variable(s), better is said to be best fit line. The difference between the observed
value and actual value gives the error. Linear Regression gives an equation of
the following form: Y = m0 + m1x1 + m2x2 + m3x3 +…….mnxn where Y is the
dependent variable and X’s are the independent variables. The right-hand side of
this equation is also known as Hypothesis Function - H(x)
The purpose of line of best fit is that the predicted values should be as close as
possible to the actual or observed values. This means the main objective in
determining the line of best fit is to “minimize” the difference predicted values and
observed values. These differences are called “errors” or “residuals”. 3 ways to
calculate the “error” Sum of all errors: (∑(Y – h(X))) (This may result in the
cancellation of positive and negative errors. This will not be a correct metric to
use) Sum of absolute value of all errors: (∑|Y-h(X)|) Sum of square of all
errors ( ∑ (Y-h(X))2) The line of best fit for 1 feature can be represented as : Y=
bx +c Where Y is the score or outcome variable we are trying to predict B =
regression coefficient or slope C = Y intercept or the regression constant This is
Linear regression with 1 variable.
4. Sum of Squared Errors Squaring the difference between actual value and
predicted value “penalizes” more for each error. Hence minimizing the sum of
squared errors improves the quality of regression line. This method of fitting the
data line so that there is minimal difference between the observations and the
line is called the method of least squares. Baseline model refers to the line
which predicts each value as the average of the data points. SSE or Sum of
Squared Errors is the total of all squares of the errors. It is a measure of the
quality of regression line. SSE is sensitive to the number of input data points.
SST is Total Sum of Squares: It is the SSE for baseline model.
5. Regression Metrics
Mean Absolute Error : One way to measure error is by using absolute error to find
the predicted distance from the true value. The mean absolute error takes the
total absolute error of each example and averages the error based on the number
of data points. By adding up all the absolute values of errors of a model we can
avoid canceling out errors from being too high or below the true values and get
an overall error metric to evaluate the model on.
Mean Squared Error : Mean squared is the most common metric to measure
model performance. In contrast with absolute error, the residual error (the
difference between predicted and the true value) is squared. Some benefits of
squaring the residual error is that error terms are positive, it emphasizes larger
errors over smaller errors, and is differentiable. Being differentiable allows us to
use calculus to find minimum or maximum values, often resulting in being more
computationally efficient.
R-Squared: Its called coefficient of determination. The values for R2 range from 0
to 1, and it determines how much of the total variation in Y is explained by the
variation in X. A model with an R2 of 0 is no better than a model that always
predicts the mean of the target variable, whereas a model with an R2 of 1
perfectly predicts the target variable. Any value between 0 and 1 indicates what
percentage of the target variable, using this model, can be explained by the
features. A model can be given a negative R2 as well, which indicates that the
model is arbitrarily worse than one that always predicts the mean of the target
variable.
Cost Function
Lack of Multicollinearity
Multicollinearity-
Overfitting:
Multicollinearity:
Sol:
Correlation analysis:
Degree of association –correlation coefficient “r”
“r “ varies from -1 to +1
pearsoncorr = SuicideRate.corr(method='pearson')
Polynomial regression: