DM Lab Cycle 2 1
DM Lab Cycle 2 1
Cycle-2
Aim: Demonstrate the following data preprocessing tasks using python
libraries.
a) Dealing with categorical data.
b) Scaling the features.
c) Splitting dataset into Training and Testing Sets
Solution:
a) Dealing with categorical data.
● Categorical Data
○ Categorical data is a type of data that is used to group information with similar
characteristics.
○ Numerical data is a type of data that expresses information in the form of
numbers.
○ Example of categorical data: gender
● Encoding Categorical Data
○ Most machine learning algorithms cannot handle categorical variables unless we
convert them to numerical values
○ Many algorithm performances even vary based upon how the categorical
variables are encoded
● Categorical variables can be divided into two categories:
○ Nominal: no particular order
○ Ordinal: there is some order between values
Nominal data: This type of categorical data consists of the name variable without any
numerical values. For example, in any organization, the name of the different
departments like research and development department, human resource department,
accounts and billing department etc.
Ordinal data: This type of categorical data consists of a set of orders or scales. For
example, a list of patients consists of the level of sugar present in the body of a person
which can be divided into high, low and medium classes.
Output:
Now we will encode the values of categorical attribute ‘Country’ using Label
Encoding Technique
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |3
Input:
Sample Output:
Input:
Output:
Many machine learning algorithms that are using Euclidean distance as a metric to calculate the
similarities will fail to give a reasonable recognition to the smaller feature, in this case, the
number of bedrooms, which in the real case can turn out to be an actually important metric.
There are several ways to do feature scaling.
To normalize our data, we need to import MinMaxScalar from the Sci-Kit learn library and
apply it to our dataset. After applying the MinMaxScalar, the minimum value will be zero and
the maximum value will be one.
2. Standardization
Standardization is another scaling technique in which the mean will be equal to zero and the
standard deviation equal to one.
To standardize our data, we need to import StandardScalar from the Sci-Kit learn library and
apply it to our dataset.
We'll be working with the Ames Housing Dataset which contains 79 features regarding houses
sold in Ames
Let's import the data and take a look at some of the features we'll be using:
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |6
Output:
Output:
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |7
1. Standardization
The StandardScaler class is used to transform the data by standardizing it. Let's import it
and scale the data via its fit_transform() method:
Output:
2. MinMaxScaler
To normalize features, we use the MinMaxScaler class. It works in much the same way as
StandardScaler, but uses a fundementally different approach to scaling the data: They are
normalized in the range of [0, 1].
DATA MINING USING PYTHON LAB (R20-IV Sem) Page |8
Output:
Example:
Download kc_house_data.csv
Output:
Output:
DATA MINING USING PYTHON LAB (R20-IV Sem) P a g e | 10
Output:
Output:
Output: