0% found this document useful (0 votes)

17 views

DM Lab Cycle 2 1

The document demonstrates how to perform data preprocessing tasks like handling categorical data using encoding techniques, scaling features, and splitting datasets into training and testing sets using Python libraries. It covers label encoding, one-hot encoding, normalization, standardization, and train-test split procedures.

Uploaded by

ispclx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

DM Lab Cycle 2 1

Uploaded by

ispclx

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

DATA MINING USING PYTHON LAB (R20-IV Sem) Page |1

Cycle-2
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Dealing with categorical data.
b) Scaling the features.
c) Splitting dataset into Training and Testing Sets

Solution:
a) Dealing with categorical data.
● Categorical Data
○ Categorical data is a type of data that is used to group information with similar
characteristics.
○ Numerical data is a type of data that expresses information in the form of
numbers.
○ Example of categorical data: gender
● Encoding Categorical Data
○ Most machine learning algorithms cannot handle categorical variables unless we
convert them to numerical values
○ Many algorithm performances even vary based upon how the categorical
variables are encoded
● Categorical variables can be divided into two categories:
○ Nominal: no particular order
○ Ordinal: there is some order between values
Nominal data: This type of categorical data consists of the name variable without any
numerical values. For example, in any organization, the name of the different
departments like research and development department, human resource department,
accounts and billing department etc.

Above we can see some examples of nominal data.

DATA MINING USING PYTHON LAB (R20-IV Sem) Page |2

Ordinal data: This type of categorical data consists of a set of orders or scales. For
example, a list of patients consists of the level of sugar present in the body of a person
which can be divided into high, low and medium classes.

● Different encoding techniques for dealing with categorical data

○ Label (or) Ordinal Encoding
○ One-hot Encoding

(i) Label encoding

In label encoding in Python, we replace the categorical value with a numeric value between 0
and the number of classes minus 1. If the categorical variable value contains 5 distinct classes,
we use (0, 1, 2, 3, and 4).
Ex: Let us take the dataset salary.csv and load it using read_csv()function

Output:

Now we will encode the values of categorical attribute ‘Country’ using Label
Encoding Technique
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |3

Input:

Sample Output:

(ii) One hot encoding

One-Hot Encoding is another popular technique for treating categorical variables. It simply
creates additional features based on the number of unique values in the categorical feature. Every
unique value in the category will be added as a feature.
In this encoding technique, each category is represented as a one-hot vector.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |4

Input:

Output:

b) Scaling the features

Feature Scaling is a technique of bringing down the values of all the independent features of our
dataset on the same scale. Feature selection helps to do calculations in algorithms very quickly.
It is the important stage of data preprocessing.
If we didn't do feature scaling then the machine learning model gives higher weightage to higher
values and lower weightage to lower values. Also, takes a lot of time for training the machine
learning model.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |5

Many machine learning algorithms that are using Euclidean distance as a metric to calculate the
similarities will fail to give a reasonable recognition to the smaller feature, in this case, the
number of bedrooms, which in the real case can turn out to be an actually important metric.
There are several ways to do feature scaling.

Types of Feature Scaling

1. Normalization
Normalization is a scaling technique in which the values are rescaled between the range 0 to 1.

To normalize our data, we need to import MinMaxScalar from the Sci-Kit learn library and
apply it to our dataset. After applying the MinMaxScalar, the minimum value will be zero and
the maximum value will be one.

2. Standardization
Standardization is another scaling technique in which the mean will be equal to zero and the
standard deviation equal to one.

To standardize our data, we need to import StandardScalar from the Sci-Kit learn library and
apply it to our dataset.
We'll be working with the Ames Housing Dataset which contains 79 features regarding houses
sold in Ames
Let's import the data and take a look at some of the features we'll be using:
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |6

Output:

From the output, there's a clear strong positive correlation between

(a) the "Gr Liv Area" feature and the "SalePrice" feature - with only a couple of outliers.
(b) the "Overall Qual" feature and the "SalePrice" feature.
The "Gr Liv Area" spans up to ~5000 (measured in square feet), while the "Overall Qual"
feature spans up to 10 (discrete categories of quality). If we were to plot these two on the same
axes, we wouldn't be able to tell much about the "Overall Qual" feature:

Output:
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |7

1. Standardization
The StandardScaler class is used to transform the data by standardizing it. Let's import it
and scale the data via its fit_transform() method:

Output:

2. MinMaxScaler
To normalize features, we use the MinMaxScaler class. It works in much the same way as
StandardScaler, but uses a fundementally different approach to scaling the data: They are
normalized in the range of [0, 1].
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |8

Output:

c) Splitting dataset into Training and Testing Sets

What Is the Train Test Split Procedure?
Train test split is a model validation procedure that allows you to simulate how a model would
perform on new/unseen data. Here is how the procedure works:

1. Arrange the Data

Make sure your data is arranged into a format acceptable for train test split. In scikit-learn, this
consists of separating your full data set into “Features” and “Target.”
2. Split the Data
Split the data set into two pieces — a training set and a testing set. This consists of random
sampling without replacement about 75 percent of the rows (you can vary this) and putting them
into your training set. The remaining 25 percent is put into your test set. Note that the colors in
“Features” and “Target” indicate where their data will go (“X_train,” “X_test,” “y_train,”
“y_test”) for a particular train test split.
3. Train the Model
Train the model on the training set. This is “X_train” and “y_train” in the image.
4. Test the Model
Test the model on the testing set (“X_test” and “y_test” in the image) and evaluate the
performance.
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |9

Example:
Download kc_house_data.csv

Output:

Output:
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 10

Output:

Clever Keeping Maths Simple
71% (7)
Clever Keeping Maths Simple
104 pages
(Feature Engineering) (Extended-Cheatsheet)
No ratings yet
(Feature Engineering) (Extended-Cheatsheet)
9 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Lecture02. ML Pipeline (Chapter 2)
No ratings yet
Lecture02. ML Pipeline (Chapter 2)
50 pages
Week 10
No ratings yet
Week 10
50 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Preprocessing
No ratings yet
Preprocessing
9 pages
Exp2 - Data Visualization and Cleaning and Feature Selection
No ratings yet
Exp2 - Data Visualization and Cleaning and Feature Selection
13 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Hint_sheet
No ratings yet
Hint_sheet
13 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
data-mining-lab-manual-CSE-VII-Sem
No ratings yet
data-mining-lab-manual-CSE-VII-Sem
63 pages
Unit-II
No ratings yet
Unit-II
119 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Lab 08 - Data Preprocessing
No ratings yet
Lab 08 - Data Preprocessing
9 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
DS Day 5
No ratings yet
DS Day 5
11 pages
DA Question Bank
No ratings yet
DA Question Bank
4 pages
CH1
No ratings yet
CH1
64 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
1737527078055
No ratings yet
1737527078055
111 pages
List of Experiment - Data Analysis Lab
No ratings yet
List of Experiment - Data Analysis Lab
2 pages
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
No ratings yet
Top 9 Feature Engineering Techniques With Python: Dataset & Prerequisites
27 pages
3_AML _Lecture 3_Feature Engg
No ratings yet
3_AML _Lecture 3_Feature Engg
39 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
Machine Learning Lab Record Report
No ratings yet
Machine Learning Lab Record Report
38 pages
05 Pandas (1)
No ratings yet
05 Pandas (1)
12 pages
Develop A Program To Implement Data Preprocessing Using
No ratings yet
Develop A Program To Implement Data Preprocessing Using
19 pages
Datascience
No ratings yet
Datascience
8 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Data Science
No ratings yet
Data Science
18 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Machine Learning With Python Data Preprocessing, Analysis and Visualization
No ratings yet
Machine Learning With Python Data Preprocessing, Analysis and Visualization
8 pages
Ds 5
No ratings yet
Ds 5
9 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
1
No ratings yet
1
9 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
PPA Data Preparation
No ratings yet
PPA Data Preparation
31 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
4.2 Two Jordan Canonical Form Examples
No ratings yet
4.2 Two Jordan Canonical Form Examples
4 pages
Design and Simulation of A 4.0 GHZ Low-Noise RF Amplifier With Avago MGA-665P8 MMIC
No ratings yet
Design and Simulation of A 4.0 GHZ Low-Noise RF Amplifier With Avago MGA-665P8 MMIC
3 pages
Peltier Temp Controller - Poster
No ratings yet
Peltier Temp Controller - Poster
1 page
OO Testing
No ratings yet
OO Testing
18 pages
M700V, M70V - Programming Manual (Lathe System) IB (NA) - 1500924-G (03.14)
No ratings yet
M700V, M70V - Programming Manual (Lathe System) IB (NA) - 1500924-G (03.14)
794 pages
LETV class 10 worksheet
No ratings yet
LETV class 10 worksheet
2 pages
RSK Test
No ratings yet
RSK Test
5 pages
Worksheet I - Kinematics of Particles
No ratings yet
Worksheet I - Kinematics of Particles
7 pages
ENGLISH Subject-Verb-Agreement
No ratings yet
ENGLISH Subject-Verb-Agreement
27 pages
Solution techniques for elementary partial differential equations 2nd Edition Constanda C. instant download
No ratings yet
Solution techniques for elementary partial differential equations 2nd Edition Constanda C. instant download
55 pages
Electric Field Summary Notes
No ratings yet
Electric Field Summary Notes
11 pages
Math 2
No ratings yet
Math 2
16 pages
Number System
No ratings yet
Number System
4 pages
MAT1514 Assignment 1, 2024
No ratings yet
MAT1514 Assignment 1, 2024
2 pages
Class Activity (GoPlay Case Solution)
No ratings yet
Class Activity (GoPlay Case Solution)
5 pages
Safari 2
No ratings yet
Safari 2
6 pages
Homework 5.1 The Cell Cycle
100% (1)
Homework 5.1 The Cell Cycle
6 pages
PSI 4 Course Material v2.1
No ratings yet
PSI 4 Course Material v2.1
34 pages
COMS W3261 Computer Science Theory
No ratings yet
COMS W3261 Computer Science Theory
3 pages
Mindtree Final Material
No ratings yet
Mindtree Final Material
46 pages
icjemapu01
No ratings yet
icjemapu01
8 pages
ANN Lab Manual-2
No ratings yet
ANN Lab Manual-2
35 pages
Computational Physics Test
No ratings yet
Computational Physics Test
2 pages
Python Notes
No ratings yet
Python Notes
118 pages
Liuping Wang
No ratings yet
Liuping Wang
1 page
Executive Summary: Design Process of A Small-Scale Wind Turbine To Supply Electricity To West Africa
No ratings yet
Executive Summary: Design Process of A Small-Scale Wind Turbine To Supply Electricity To West Africa
11 pages
All chapter download Solution Manual for College Algebra 2nd by Miller
100% (6)
All chapter download Solution Manual for College Algebra 2nd by Miller
60 pages
Well Planner Job Description
No ratings yet
Well Planner Job Description
3 pages
Eurocodes Presentation 1
No ratings yet
Eurocodes Presentation 1
51 pages