0% found this document useful (0 votes)
17 views

DM Lab Cycle 2 1

The document demonstrates how to perform data preprocessing tasks like handling categorical data using encoding techniques, scaling features, and splitting datasets into training and testing sets using Python libraries. It covers label encoding, one-hot encoding, normalization, standardization, and train-test split procedures.

Uploaded by

ispclx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

DM Lab Cycle 2 1

The document demonstrates how to perform data preprocessing tasks like handling categorical data using encoding techniques, scaling features, and splitting datasets into training and testing sets using Python libraries. It covers label encoding, one-hot encoding, normalization, standardization, and train-test split procedures.

Uploaded by

ispclx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DATA MINING USING PYTHON LAB (R20-IV Sem) Page |1

Cycle-2
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Dealing with categorical data.
b) Scaling the features.
c) Splitting dataset into Training and Testing Sets

Solution:
a) Dealing with categorical data.
● Categorical Data
○ Categorical data is a type of data that is used to group information with similar
characteristics.
○ Numerical data is a type of data that expresses information in the form of
numbers.
○ Example of categorical data: gender
● Encoding Categorical Data
○ Most machine learning algorithms cannot handle categorical variables unless we
convert them to numerical values
○ Many algorithm performances even vary based upon how the categorical
variables are encoded
● Categorical variables can be divided into two categories:
○ Nominal: no particular order
○ Ordinal: there is some order between values
Nominal data: This type of categorical data consists of the name variable without any
numerical values. For example, in any organization, the name of the different
departments like research and development department, human resource department,
accounts and billing department etc.

Above we can see some examples of nominal data.


DATA MINING USING PYTHON LAB (R20-IV Sem) Page |2

Ordinal data: This type of categorical data consists of a set of orders or scales. For
example, a list of patients consists of the level of sugar present in the body of a person
which can be divided into high, low and medium classes.

● Different encoding techniques for dealing with categorical data


○ Label (or) Ordinal Encoding
○ One-hot Encoding

(i) Label encoding


In label encoding in Python, we replace the categorical value with a numeric value between 0
and the number of classes minus 1. If the categorical variable value contains 5 distinct classes,
we use (0, 1, 2, 3, and 4).
Ex: Let us take the dataset salary.csv and load it using read_csv()function

Output:

Now we will encode the values of categorical attribute ‘Country’ using Label
Encoding Technique
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |3

Input:

Sample Output:

(ii) One hot encoding


One-Hot Encoding is another popular technique for treating categorical variables. It simply
creates additional features based on the number of unique values in the categorical feature. Every
unique value in the category will be added as a feature.
In this encoding technique, each category is represented as a one-hot vector.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |4

Input:

Output:

b) Scaling the features


Feature Scaling is a technique of bringing down the values of all the independent features of our
dataset on the same scale. Feature selection helps to do calculations in algorithms very quickly.
It is the important stage of data preprocessing.
If we didn't do feature scaling then the machine learning model gives higher weightage to higher
values and lower weightage to lower values. Also, takes a lot of time for training the machine
learning model.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |5

Many machine learning algorithms that are using Euclidean distance as a metric to calculate the
similarities will fail to give a reasonable recognition to the smaller feature, in this case, the
number of bedrooms, which in the real case can turn out to be an actually important metric.
There are several ways to do feature scaling.

Types of Feature Scaling


1. Normalization
Normalization is a scaling technique in which the values are rescaled between the range 0 to 1.

To normalize our data, we need to import MinMaxScalar from the Sci-Kit learn library and
apply it to our dataset. After applying the MinMaxScalar, the minimum value will be zero and
the maximum value will be one.

2. Standardization
Standardization is another scaling technique in which the mean will be equal to zero and the
standard deviation equal to one.

To standardize our data, we need to import StandardScalar from the Sci-Kit learn library and
apply it to our dataset.
We'll be working with the Ames Housing Dataset which contains 79 features regarding houses
sold in Ames
Let's import the data and take a look at some of the features we'll be using:
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |6

Output:

From the output, there's a clear strong positive correlation between


(a) the "Gr Liv Area" feature and the "SalePrice" feature - with only a couple of outliers.
(b) the "Overall Qual" feature and the "SalePrice" feature.
The "Gr Liv Area" spans up to ~5000 (measured in square feet), while the "Overall Qual"
feature spans up to 10 (discrete categories of quality). If we were to plot these two on the same
axes, we wouldn't be able to tell much about the "Overall Qual" feature:

Output:
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |7

1. Standardization
The StandardScaler class is used to transform the data by standardizing it. Let's import it
and scale the data via its fit_transform() method:

Output:

2. MinMaxScaler
To normalize features, we use the MinMaxScaler class. It works in much the same way as
StandardScaler, but uses a fundementally different approach to scaling the data: They are
normalized in the range of [0, 1].
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |8

Output:

c) Splitting dataset into Training and Testing Sets


What Is the Train Test Split Procedure?
Train test split is a model validation procedure that allows you to simulate how a model would
perform on new/unseen data. Here is how the procedure works:

1. Arrange the Data


Make sure your data is arranged into a format acceptable for train test split. In scikit-learn, this
consists of separating your full data set into “Features” and “Target.”
2. Split the Data
Split the data set into two pieces — a training set and a testing set. This consists of random
sampling without replacement about 75 percent of the rows (you can vary this) and putting them
into your training set. The remaining 25 percent is put into your test set. Note that the colors in
“Features” and “Target” indicate where their data will go (“X_train,” “X_test,” “y_train,”
“y_test”) for a particular train test split.
3. Train the Model
Train the model on the training set. This is “X_train” and “y_train” in the image.
4. Test the Model
Test the model on the testing set (“X_test” and “y_test” in the image) and evaluate the
performance.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |9

Example:
Download kc_house_data.csv

Output:

Output:
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 10

Output:

Output:

Output:

You might also like