0% found this document useful (0 votes)
41 views

Wayspire AI Course

This document provides an overview of key techniques for exploratory data analysis (EDA) and data preprocessing, including: 1. Visualizing data distributions and relationships using Matplotlib and Seaborn libraries in Python. 2. Detecting and handling missing values and encoding categorical data. Common encoding techniques include one-hot encoding, label encoding, and ordinal encoding. 3. Standardizing data using min-max scaling and z-score normalization to center and normalize features.

Uploaded by

rhittum1802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Wayspire AI Course

This document provides an overview of key techniques for exploratory data analysis (EDA) and data preprocessing, including: 1. Visualizing data distributions and relationships using Matplotlib and Seaborn libraries in Python. 2. Detecting and handling missing values and encoding categorical data. Common encoding techniques include one-hot encoding, label encoding, and ordinal encoding. 3. Standardizing data using min-max scaling and z-score normalization to center and normalize features.

Uploaded by

rhittum1802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Wayspire AI Course

Plots:
-Two types of libraries
1. Matplotlib
2. Seaborn

For five point summary:


import matplotlib.pyplot as plt
%matplotlib inline

For outliers:
import seaborn as sns
sns.boxplot (x=”Age”, data=df)
#or
sns.boxplot(x=”Gender”, y=”Age”, data=df)

Count plot (using seaborn):


sns.countplot(x=”Product”, hue=”MaritalStatus”, data=df)
Pairplot:
sns.pairplot(df)
To check the shape of the distribution (using seaborn):
sns.distplot(df[“Age”])

(or to get the shape)


sns.distplot(df[“Age”], kde=True)

For correlation:
corr = df.corr()
print(corr)
Heatmap:
sns.heatmap(corr, annot=True)

EDA (Exploratory Data Analysis)


1. Missing values
i. Standard missing values
Values which pandas can detect.
ii. Non-Standard missing values
Values which pandas cannot detect.
2. Encoding:
i. Dealing with categorical data.
ii. Encodes the categorical data and helps in regional and
iii. Encoding Techniques:
a. N-1 Dummy encoding
b. One-Hot encoding
c. Label encoding
d. Ordinal encoding
e. Frequency encoding
f. Target encoding

I) N-1 Dummy encoding:


Ex: If we have 3 different categories in a column we use “3-1
encoding” or 2 columns will be encoded.
Columns Dairy product Fruits product
1. Veg 0 0
2. Fruits 0 1
3. Dairy 1 0
If there are 10 columns then the no. of encoded columns
would be 9.
II) One-Hot encoding:
In this for ‘n’ categories the no. of encoded columns would
be ‘n’.
Categories Veg-Category Fruit-Category Dairy-Category
Veg 1 0 0
Fruits 0 1 0
Dairy 0 0 1

III) Label encoding:


In this we consider the labels in categorical variables by
Alphabetical order for encoding.
It encodes the columns from 0 – (n-1).
Ex: Performance of a car
Mileage Performance Price encoded Perf.
Bad -> 0 - Bad - 0
Average -> 1 - Good - 2
Good -> 2 - Average - 1

IV) Ordinal encoding:


Encoded values ranges between 0 – (n-1)
Ex: Bad = 0, Average = 1, Good = 2
Standardization and Z transform scaling

 Min-Max scaling and Z Standardization (Scaling)


i. Z-score normalization (Standardisation)
Formula: z = (x – µ)/σ
ii. New mean in Z transform is always 0.
iii. To convert from Z scale to Standard scale
X norm = (X – X min)/(X max – X min)

 Transformation
i.

You might also like