Exp 01-B Feature Selection and Extraction
Exp 01-B Feature Selection and Extraction
URK21CS1124
Aim: The main function of data preprocessing is to extract the data sources related to the moni-
toring target based on the mining requirements, check the legality of the data, and to generate the
next waiting core data for analysis
Description:
1. Data Collection:
Gather raw data from various sources, such as databases, files, APIs, or external datasets.
2. Data Cleaning:
Handle missing values: Decide whether to remove instances with missing values or impute
them with techniques like mean, median, or more sophisticated methods. Remove duplicates:
Eliminate identical records to avoid redundancy in the dataset. Handle outliers: Identify and
deal with outliers that might adversely affect model training.
3. Data Exploration and Visualization:
Explore the dataset through statistical analysis and visualizations to gain insights into the
distribution, relationships, and patterns within the data.
4. Feature Selection:
Identify and select relevant features that contribute the most to the prediction task. Remove
irrelevant or redundant features to simplify the model and reduce computational requirements.
5. Feature Scaling:
Standardize or normalize numerical features to bring them to a common scale. This helps
prevent features with larger magnitudes from dominating the learning process.
Data preprocessing is an iterative process, and the choice of techniques depends on the specific
characteristics of the dataset and the requirements of the machine learning task at hand.
Effective data preprocessing contributes significantly to building robust and accurate machine
learning models.
1
URK21CS1124
URK21CS1124
Branch 0.784314
City 0.000000
Customer 0.000000
Gender 0.000000
Product line 3.529412
Unit_price 0.000000
Quantity 0.000000
Tax 0.000000
Total 0.000000
Payment 0.000000
cogs 0.000000
Rating 0.000000
Age 0.000000
Quarterly_Tax 0.784314
Price 0.000000
dtype: float64
[5]: #3. Replace missing value with mean for the numerical column, if the % of␣
↪missing value is less than 10%. (Use temporary data frame & inplace=True)
print("URK21CS1124")
missing_values=df.isna().sum()
total=len(df)
percentage=(missing_values/total)*10
temp=df.copy()
for i in df.columns:
if df[i].dtype!="object" and percentage[i]<10:
c_mean=df[i].mean()
temp[i].fillna(c_mean,inplace=True)
df.update(temp)
df.head(10)
URK21CS1124
2
4 A Yangon Normal Male Sports and travel 86.31
5 C Naypyitaw Normal Male Electronic accessories 85.39
6 A Yangon Member Female NaN 68.84
7 C Naypyitaw Normal Female NaN 73.56
8 A Yangon Member Female NaN 36.26
9 B Mandalay Member Female NaN 54.84
Quarterly_Tax Price
0 210.0 7469
1 210.0 1528
2 124.0 4633
3 210.0 5822
4 210.0 8631
5 210.0 8539
6 210.0 6884
7 210.0 7356
8 100.0 3626
9 185.0 5484
[6]: #4. Perform the interpolation using nearest method to estimate the missing␣
↪values for the numerical column, if the % of missing value is less than 10%.␣
print("URK21CS1124")
missing_values2=df.isna().sum()
total2=len(df)
percentage2=(missing_values2/total2)*10
temp2=df.copy()
for i in df.columns:
if df[i].dtype!="object" and percentage2[i]<10:
temp2[i].interpolate(method="nearest",inplace=True)
temp2.head(10)
URK21CS1124
3
[6]: Branch City Customer Gender Product line Unit_price \
0 A Yangon Member Female Health and beauty 74.69
1 C Naypyitaw Normal Female Electronic accessories 15.28
2 A Yangon Normal Male Home and lifestyle 46.33
3 A Yangon Member Male Health and beauty 58.22
4 A Yangon Normal Male Sports and travel 86.31
5 C Naypyitaw Normal Male Electronic accessories 85.39
6 A Yangon Member Female NaN 68.84
7 C Naypyitaw Normal Female NaN 73.56
8 A Yangon Member Female NaN 36.26
9 B Mandalay Member Female NaN 54.84
Quarterly_Tax Price
0 210.0 7469
1 210.0 1528
2 124.0 4633
3 210.0 5822
4 210.0 8631
5 210.0 8539
6 210.0 6884
7 210.0 7356
8 100.0 3626
9 185.0 5484
[7]: #5. Perform the mode imputation for a categorical data, if the % of missing␣
↪value is less than 10%. (Use temporary data frame & inplace=True)
print("URK21CS1124")
missing_values3=df.isna().sum()
total3=len(df)
percentage3=(missing_values3/total3)*10
temp3=df.copy()
for i in df.columns:
if df[i].dtype =="object" and percentage3[i]<10:
c_column=df[i].mode().iloc[0]
temp3[i].fillna(c_column,inplace=True)
4
df.update(temp3)
df.head(10)
URK21CS1124
Quarterly_Tax Price
0 210.0 7469
1 210.0 1528
2 124.0 4633
3 210.0 5822
4 210.0 8631
5 210.0 8539
6 210.0 6884
7 210.0 7356
8 100.0 3626
9 185.0 5484
[8]: #6. Drop the columns with more than 10% missing values and display the size␣
↪(Use temporary data frame & inplace=True)
print("URK21CS1124")
missing_values4=df.isna().sum()
total4=len(df)
percentage4=(missing_values4/total4)*10
temp4=df.copy()
5
for i in df.columns:
if percentage4[i]<10:
temp4.drop(i,axis=1,inplace=True)
print(temp4.head())
print(temp4.shape)
URK21CS1124
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
(51, 0)
[9]: #7. Drop the rows with outlier Z-score value > 3 for “Guarantee_Period” and␣
↪display the size. (Use temporary data frame & inplace=True)
print("URK21CS1124")
import numpy as np
temp5=df.copy()
zscore=np.abs((temp5["Quantity"]-temp5["Quantity"].mean())/temp5["Quantity"].
↪std())
temp5.drop(temp5[zscore>3].index,inplace=True)
print(temp5["Quantity"].head(10))
print(temp5.shape)
URK21CS1124
0 7
1 5
2 7
3 8
5 7
6 6
7 10
8 2
9 3
10 4
Name: Quantity, dtype: int64
(47, 15)
[10]: #8. Find the % of duplicate rows with all columns having same value.
print("URK21CS1124")
duplicated=df.duplicated().sum()
total_num=len(df)
perc=(duplicated/total_num)*100
perc
URK21CS1124
[10]: 3.9215686274509802
6
[11]: #9. Find the % of duplicate rows based on some specific columns␣
↪[Price&,Age,Mfg_Month,Fuel_Type] having same value. Drop the duplicates and␣
print("URK21CS1124")
temp6=df.copy()
specific_column=['Customer', 'Product line', 'Age', 'Gender']
dup=temp6.duplicated(subset=specific_column)
num_dup=dup.sum()
total_num=len(temp6)
percen=(num_dup/total_num)*100
temp6.drop_duplicates(subset=specific_column,inplace=True)
print(temp6.head())
print(temp6.shape)
URK21CS1124
Branch City Customer Gender Product line Unit_price \
0 A Yangon Member Female Health and beauty 74.69
1 C Naypyitaw Normal Female Electronic accessories 15.28
2 A Yangon Normal Male Home and lifestyle 46.33
3 A Yangon Member Male Health and beauty 58.22
4 A Yangon Normal Male Sports and travel 86.31
Quarterly_Tax Price
0 210.0 7469
1 210.0 1528
2 124.0 4633
3 210.0 5822
4 210.0 8631
(40, 15)
[12]: #10. Perform the min-max normalization for a numerical feature Quarterly_Tax␣
↪using Python code and analyze the values in scatter plot.
print("URK21CS1124")
age_min=df["Age"].min()
age_max=df["Age"].max()
df["normalized"]=(df["Age"]-age_min)/(age_max-age_min)
import matplotlib.pyplot as plt
plt.scatter(df.index,df["normalized"])
plt.xlabel("index")
plt.ylabel("normalized")
7
URK21CS1124
print("URK21CS1124")
mean=df["Age"].mean()
std=df["Age"].std()
df["normalized_age"]=(df["Age"]-mean)/std
plt.scatter(df.index,df["normalized_age"])
plt.xlabel("index")
plt.ylabel("normalized")
URK21CS1124
8
[14]: #12. Perform the label encoding for a categorical feature ‘Fuel_Type’ using␣
↪Python code
print("URK21CS1124")
from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
label_encoder.fit(df["Payment"])
df["payment_encoded"]=label_encoder.transform(df["Payment"])
df[["Payment","payment_encoded"]]
URK21CS1124
9
9 Credit card 1
10 Ewallet 2
11 Cash 0
12 Ewallet 2
13 Ewallet 2
14 Cash 0
15 Cash 0
16 Credit card 1
17 Credit card 1
18 Credit card 1
19 Ewallet 2
20 Ewallet 2
21 Ewallet 2
22 Credit card 1
23 Ewallet 2
24 Ewallet 2
25 Credit card 1
26 Cash 0
27 Credit card 1
28 Cash 0
29 Cash 0
30 Credit card 1
31 Cash 0
32 Cash 0
33 Credit card 1
34 Ewallet 2
35 Ewallet 2
36 Ewallet 2
37 Ewallet 2
38 Ewallet 2
39 Cash 0
40 Ewallet 2
41 Cash 0
42 Cash 0
43 Cash 0
44 Cash 0
45 Cash 0
46 Credit card 1
47 Ewallet 2
48 Credit card 1
49 Credit card 1
50 Ewallet 2
[15]: #13. Perform the one-hot encoding for a categorical feature ‘Fuel_Type’ using␣
↪Python code
print("URK21CS1124")
dummie=pd.get_dummies(df["Payment"])
10
dummie
URK21CS1124
11
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 0 1 0
47 0 0 1
48 0 1 0
49 0 1 0
50 0 0 1
Result: The given dataset is being analysed using data pre-processing and output is verifiedsuc-
cessfully
12