0% found this document useful (0 votes)
26 views

Exp 01-B Feature Selection and Extraction

The document discusses the process of data preprocessing which involves collecting raw data from various sources, cleaning the data by handling missing values, outliers, and duplicates, exploring and visualizing the data to gain insights, selecting relevant features, and scaling features. The key steps are data collection, cleaning, exploration and visualization, feature selection, and feature scaling. Effective data preprocessing contributes significantly to building accurate machine learning models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Exp 01-B Feature Selection and Extraction

The document discusses the process of data preprocessing which involves collecting raw data from various sources, cleaning the data by handling missing values, outliers, and duplicates, exploring and visualizing the data to gain insights, selecting relevant features, and scaling features. The key steps are data collection, cleaning, exploration and visualization, feature selection, and feature scaling. Effective data preprocessing contributes significantly to building accurate machine learning models.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Exp 01-B Feature Selection and Extraction

January 12, 2024

URK21CS1124
Aim: The main function of data preprocessing is to extract the data sources related to the moni-
toring target based on the mining requirements, check the legality of the data, and to generate the
next waiting core data for analysis
Description:
1. Data Collection:
Gather raw data from various sources, such as databases, files, APIs, or external datasets.
2. Data Cleaning:
Handle missing values: Decide whether to remove instances with missing values or impute
them with techniques like mean, median, or more sophisticated methods. Remove duplicates:
Eliminate identical records to avoid redundancy in the dataset. Handle outliers: Identify and
deal with outliers that might adversely affect model training.
3. Data Exploration and Visualization:
Explore the dataset through statistical analysis and visualizations to gain insights into the
distribution, relationships, and patterns within the data.
4. Feature Selection:
Identify and select relevant features that contribute the most to the prediction task. Remove
irrelevant or redundant features to simplify the model and reduce computational requirements.
5. Feature Scaling:
Standardize or normalize numerical features to bring them to a common scale. This helps
prevent features with larger magnitudes from dominating the learning process.
Data preprocessing is an iterative process, and the choice of techniques depends on the specific
characteristics of the dataset and the requirements of the machine learning task at hand.
Effective data preprocessing contributes significantly to building robust and accurate machine
learning models.

[2]: import pandas as pd

[3]: #1. Read the data


print("URK21CS1124")
df=pd.read_csv("Stores1b.csv")

1
URK21CS1124

[4]: #2. Calculate the % of missing values in the columns


print("URK21CS1124")
missing_values=df.isna().sum()
total=len(df)
percentage=(missing_values/total)*10
print(percentage)

URK21CS1124
Branch 0.784314
City 0.000000
Customer 0.000000
Gender 0.000000
Product line 3.529412
Unit_price 0.000000
Quantity 0.000000
Tax 0.000000
Total 0.000000
Payment 0.000000
cogs 0.000000
Rating 0.000000
Age 0.000000
Quarterly_Tax 0.784314
Price 0.000000
dtype: float64

[5]: #3. Replace missing value with mean for the numerical column, if the % of␣
↪missing value is less than 10%. (Use temporary data frame & inplace=True)

print("URK21CS1124")
missing_values=df.isna().sum()
total=len(df)
percentage=(missing_values/total)*10
temp=df.copy()
for i in df.columns:
if df[i].dtype!="object" and percentage[i]<10:
c_mean=df[i].mean()
temp[i].fillna(c_mean,inplace=True)
df.update(temp)
df.head(10)

URK21CS1124

[5]: Branch City Customer Gender Product line Unit_price \


0 A Yangon Member Female Health and beauty 74.69
1 C Naypyitaw Normal Female Electronic accessories 15.28
2 A Yangon Normal Male Home and lifestyle 46.33
3 A Yangon Member Male Health and beauty 58.22

2
4 A Yangon Normal Male Sports and travel 86.31
5 C Naypyitaw Normal Male Electronic accessories 85.39
6 A Yangon Member Female NaN 68.84
7 C Naypyitaw Normal Female NaN 73.56
8 A Yangon Member Female NaN 36.26
9 B Mandalay Member Female NaN 54.84

Quantity Tax Total Payment cogs Rating Age \


0 7 26.1415 548.9715 Ewallet 522.83 9.1 23
1 5 3.8200 80.2200 Cash 76.40 9.6 23
2 7 16.2155 340.5255 Credit card 324.31 7.4 24
3 8 23.2880 489.0480 Ewallet 465.76 8.4 26
4 60 30.2085 634.3785 Ewallet 604.17 5.3 30
5 7 29.8865 627.6165 Ewallet 597.73 4.1 32
6 6 20.6520 433.6920 Ewallet 413.04 5.8 27
7 10 36.7800 772.3800 Ewallet 735.60 8.0 30
8 2 3.6260 76.1460 Credit card 72.52 7.2 27
9 3 8.2260 172.7460 Credit card 164.52 5.9 23

Quarterly_Tax Price
0 210.0 7469
1 210.0 1528
2 124.0 4633
3 210.0 5822
4 210.0 8631
5 210.0 8539
6 210.0 6884
7 210.0 7356
8 100.0 3626
9 185.0 5484

[6]: #4. Perform the interpolation using nearest method to estimate the missing␣
↪values for the numerical column, if the % of missing value is less than 10%.␣

↪(Use temporary dataframe)

print("URK21CS1124")
missing_values2=df.isna().sum()
total2=len(df)
percentage2=(missing_values2/total2)*10
temp2=df.copy()
for i in df.columns:
if df[i].dtype!="object" and percentage2[i]<10:
temp2[i].interpolate(method="nearest",inplace=True)
temp2.head(10)

URK21CS1124

3
[6]: Branch City Customer Gender Product line Unit_price \
0 A Yangon Member Female Health and beauty 74.69
1 C Naypyitaw Normal Female Electronic accessories 15.28
2 A Yangon Normal Male Home and lifestyle 46.33
3 A Yangon Member Male Health and beauty 58.22
4 A Yangon Normal Male Sports and travel 86.31
5 C Naypyitaw Normal Male Electronic accessories 85.39
6 A Yangon Member Female NaN 68.84
7 C Naypyitaw Normal Female NaN 73.56
8 A Yangon Member Female NaN 36.26
9 B Mandalay Member Female NaN 54.84

Quantity Tax Total Payment cogs Rating Age \


0 7 26.1415 548.9715 Ewallet 522.83 9.1 23
1 5 3.8200 80.2200 Cash 76.40 9.6 23
2 7 16.2155 340.5255 Credit card 324.31 7.4 24
3 8 23.2880 489.0480 Ewallet 465.76 8.4 26
4 60 30.2085 634.3785 Ewallet 604.17 5.3 30
5 7 29.8865 627.6165 Ewallet 597.73 4.1 32
6 6 20.6520 433.6920 Ewallet 413.04 5.8 27
7 10 36.7800 772.3800 Ewallet 735.60 8.0 30
8 2 3.6260 76.1460 Credit card 72.52 7.2 27
9 3 8.2260 172.7460 Credit card 164.52 5.9 23

Quarterly_Tax Price
0 210.0 7469
1 210.0 1528
2 124.0 4633
3 210.0 5822
4 210.0 8631
5 210.0 8539
6 210.0 6884
7 210.0 7356
8 100.0 3626
9 185.0 5484

[7]: #5. Perform the mode imputation for a categorical data, if the % of missing␣
↪value is less than 10%. (Use temporary data frame & inplace=True)

print("URK21CS1124")
missing_values3=df.isna().sum()
total3=len(df)
percentage3=(missing_values3/total3)*10
temp3=df.copy()
for i in df.columns:
if df[i].dtype =="object" and percentage3[i]<10:
c_column=df[i].mode().iloc[0]
temp3[i].fillna(c_column,inplace=True)

4
df.update(temp3)
df.head(10)

URK21CS1124

[7]: Branch City Customer Gender Product line Unit_price \


0 A Yangon Member Female Health and beauty 74.69
1 C Naypyitaw Normal Female Electronic accessories 15.28
2 A Yangon Normal Male Home and lifestyle 46.33
3 A Yangon Member Male Health and beauty 58.22
4 A Yangon Normal Male Sports and travel 86.31
5 C Naypyitaw Normal Male Electronic accessories 85.39
6 A Yangon Member Female Electronic accessories 68.84
7 C Naypyitaw Normal Female Electronic accessories 73.56
8 A Yangon Member Female Electronic accessories 36.26
9 B Mandalay Member Female Electronic accessories 54.84

Quantity Tax Total Payment cogs Rating Age \


0 7 26.1415 548.9715 Ewallet 522.83 9.1 23
1 5 3.8200 80.2200 Cash 76.40 9.6 23
2 7 16.2155 340.5255 Credit card 324.31 7.4 24
3 8 23.2880 489.0480 Ewallet 465.76 8.4 26
4 60 30.2085 634.3785 Ewallet 604.17 5.3 30
5 7 29.8865 627.6165 Ewallet 597.73 4.1 32
6 6 20.6520 433.6920 Ewallet 413.04 5.8 27
7 10 36.7800 772.3800 Ewallet 735.60 8.0 30
8 2 3.6260 76.1460 Credit card 72.52 7.2 27
9 3 8.2260 172.7460 Credit card 164.52 5.9 23

Quarterly_Tax Price
0 210.0 7469
1 210.0 1528
2 124.0 4633
3 210.0 5822
4 210.0 8631
5 210.0 8539
6 210.0 6884
7 210.0 7356
8 100.0 3626
9 185.0 5484

[8]: #6. Drop the columns with more than 10% missing values and display the size␣
↪(Use temporary data frame & inplace=True)

print("URK21CS1124")
missing_values4=df.isna().sum()
total4=len(df)
percentage4=(missing_values4/total4)*10
temp4=df.copy()

5
for i in df.columns:
if percentage4[i]<10:
temp4.drop(i,axis=1,inplace=True)
print(temp4.head())
print(temp4.shape)

URK21CS1124
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
(51, 0)

[9]: #7. Drop the rows with outlier Z-score value > 3 for “Guarantee_Period” and␣
↪display the size. (Use temporary data frame & inplace=True)

print("URK21CS1124")
import numpy as np
temp5=df.copy()
zscore=np.abs((temp5["Quantity"]-temp5["Quantity"].mean())/temp5["Quantity"].
↪std())

temp5.drop(temp5[zscore>3].index,inplace=True)
print(temp5["Quantity"].head(10))
print(temp5.shape)

URK21CS1124
0 7
1 5
2 7
3 8
5 7
6 6
7 10
8 2
9 3
10 4
Name: Quantity, dtype: int64
(47, 15)

[10]: #8. Find the % of duplicate rows with all columns having same value.
print("URK21CS1124")
duplicated=df.duplicated().sum()
total_num=len(df)
perc=(duplicated/total_num)*100
perc

URK21CS1124

[10]: 3.9215686274509802

6
[11]: #9. Find the % of duplicate rows based on some specific columns␣
↪[Price&,Age,Mfg_Month,Fuel_Type] having same value. Drop the duplicates and␣

↪display the size. (Use temporary data framei nplace=True)

print("URK21CS1124")
temp6=df.copy()
specific_column=['Customer', 'Product line', 'Age', 'Gender']
dup=temp6.duplicated(subset=specific_column)
num_dup=dup.sum()
total_num=len(temp6)
percen=(num_dup/total_num)*100
temp6.drop_duplicates(subset=specific_column,inplace=True)
print(temp6.head())
print(temp6.shape)

URK21CS1124
Branch City Customer Gender Product line Unit_price \
0 A Yangon Member Female Health and beauty 74.69
1 C Naypyitaw Normal Female Electronic accessories 15.28
2 A Yangon Normal Male Home and lifestyle 46.33
3 A Yangon Member Male Health and beauty 58.22
4 A Yangon Normal Male Sports and travel 86.31

Quantity Tax Total Payment cogs Rating Age \


0 7 26.1415 548.9715 Ewallet 522.83 9.1 23
1 5 3.8200 80.2200 Cash 76.40 9.6 23
2 7 16.2155 340.5255 Credit card 324.31 7.4 24
3 8 23.2880 489.0480 Ewallet 465.76 8.4 26
4 60 30.2085 634.3785 Ewallet 604.17 5.3 30

Quarterly_Tax Price
0 210.0 7469
1 210.0 1528
2 124.0 4633
3 210.0 5822
4 210.0 8631
(40, 15)

[12]: #10. Perform the min-max normalization for a numerical feature Quarterly_Tax␣
↪using Python code and analyze the values in scatter plot.

print("URK21CS1124")
age_min=df["Age"].min()
age_max=df["Age"].max()
df["normalized"]=(df["Age"]-age_min)/(age_max-age_min)
import matplotlib.pyplot as plt
plt.scatter(df.index,df["normalized"])
plt.xlabel("index")
plt.ylabel("normalized")

7
URK21CS1124

[12]: Text(0, 0.5, 'normalized')

[13]: #11. Perform the Z-score normalization for a numerical feature,Quarterly_Tax␣


↪using Python code and analyze the values in scatter plot.

print("URK21CS1124")
mean=df["Age"].mean()
std=df["Age"].std()
df["normalized_age"]=(df["Age"]-mean)/std
plt.scatter(df.index,df["normalized_age"])
plt.xlabel("index")
plt.ylabel("normalized")

URK21CS1124

[13]: Text(0, 0.5, 'normalized')

8
[14]: #12. Perform the label encoding for a categorical feature ‘Fuel_Type’ using␣
↪Python code

print("URK21CS1124")
from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
label_encoder.fit(df["Payment"])
df["payment_encoded"]=label_encoder.transform(df["Payment"])
df[["Payment","payment_encoded"]]

URK21CS1124

[14]: Payment payment_encoded


0 Ewallet 2
1 Cash 0
2 Credit card 1
3 Ewallet 2
4 Ewallet 2
5 Ewallet 2
6 Ewallet 2
7 Ewallet 2
8 Credit card 1

9
9 Credit card 1
10 Ewallet 2
11 Cash 0
12 Ewallet 2
13 Ewallet 2
14 Cash 0
15 Cash 0
16 Credit card 1
17 Credit card 1
18 Credit card 1
19 Ewallet 2
20 Ewallet 2
21 Ewallet 2
22 Credit card 1
23 Ewallet 2
24 Ewallet 2
25 Credit card 1
26 Cash 0
27 Credit card 1
28 Cash 0
29 Cash 0
30 Credit card 1
31 Cash 0
32 Cash 0
33 Credit card 1
34 Ewallet 2
35 Ewallet 2
36 Ewallet 2
37 Ewallet 2
38 Ewallet 2
39 Cash 0
40 Ewallet 2
41 Cash 0
42 Cash 0
43 Cash 0
44 Cash 0
45 Cash 0
46 Credit card 1
47 Ewallet 2
48 Credit card 1
49 Credit card 1
50 Ewallet 2

[15]: #13. Perform the one-hot encoding for a categorical feature ‘Fuel_Type’ using␣
↪Python code

print("URK21CS1124")
dummie=pd.get_dummies(df["Payment"])

10
dummie

URK21CS1124

[15]: Cash Credit card Ewallet


0 0 0 1
1 1 0 0
2 0 1 0
3 0 0 1
4 0 0 1
5 0 0 1
6 0 0 1
7 0 0 1
8 0 1 0
9 0 1 0
10 0 0 1
11 1 0 0
12 0 0 1
13 0 0 1
14 1 0 0
15 1 0 0
16 0 1 0
17 0 1 0
18 0 1 0
19 0 0 1
20 0 0 1
21 0 0 1
22 0 1 0
23 0 0 1
24 0 0 1
25 0 1 0
26 1 0 0
27 0 1 0
28 1 0 0
29 1 0 0
30 0 1 0
31 1 0 0
32 1 0 0
33 0 1 0
34 0 0 1
35 0 0 1
36 0 0 1
37 0 0 1
38 0 0 1
39 1 0 0
40 0 0 1
41 1 0 0

11
42 1 0 0
43 1 0 0
44 1 0 0
45 1 0 0
46 0 1 0
47 0 0 1
48 0 1 0
49 0 1 0
50 0 0 1

Result: The given dataset is being analysed using data pre-processing and output is verifiedsuc-
cessfully

12

You might also like