0% found this document useful (0 votes)
13 views

Pre-Processing techniques.ipynb - Colab

The document outlines various data pre-processing techniques essential for preparing raw data for analysis or machine learning, including attribute selection, handling missing values, discretization, and outlier elimination. It provides code examples using Python libraries such as pandas, numpy, and sklearn to demonstrate these techniques on sample datasets. The document emphasizes improving dataset quality through these methods to enhance analysis outcomes.

Uploaded by

mgiri63021
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Pre-Processing techniques.ipynb - Colab

The document outlines various data pre-processing techniques essential for preparing raw data for analysis or machine learning, including attribute selection, handling missing values, discretization, and outlier elimination. It provides code examples using Python libraries such as pandas, numpy, and sklearn to demonstrate these techniques on sample datasets. The document emphasizes improving dataset quality through these methods to enhance analysis outcomes.

Uploaded by

mgiri63021
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

2/13/25, 10:11 AM Pre-Processing techniques.

ipynb - Colab

keyboard_arrow_down Data Pre-Processing


Transforming raw data into a clean and structured format suitable for analysis or machine learning models.
It includes techniques such as handling missing values, removing duplicates, normalizing data, encoding categorical variables, and
eliminating outliers to improve the dataset’s quality.

Attribute Selection: Selecting the most relevant features.

Handling Missing Values: Filling or removing missing data.

Discretization: Converting continuous data into categorical bins.

Elimination of Outliers: Removing extreme values.

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import KBinsDiscretizer
from scipy import stats

# Sample dataset
data = {
'A': [10, 20, 30, np.nan, 50, 60, 70, 800], # Outlier at 800, missing value at index 3
'B': [5, 15, np.nan, 25, 35, 45, 55, 65], # Missing value at index 2
'C': [1, 2, 3, 4, 5, 6, 7, 8], # Continuous data
'Target': [0, 1, 0, 1, 0, 1, 0, 1] # Target variable (classification)
}

df = pd.DataFrame(data)
print("Original Dataset:\n", df)

# 1. Attribute Selection
X = df.drop(columns=['Target']) # Features
y = df['Target']
selector = SelectKBest(score_func=f_classif, k=2) # Select top 2 best features
X_new = selector.fit_transform(X.fillna(X.mean()), y)
selected_features = X.columns[selector.get_support()]
print("\nSelected Features:", selected_features)

# 2. Handling Missing Values


imputer = SimpleImputer(strategy='mean') # Replace NaN with column mean
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print("\nDataset after handling missing values:\n", df_imputed)

# 3. Discretization
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
df_imputed['C_binned'] = discretizer.fit_transform(df_imputed[['C']])
print("\nDataset after discretization:\n", df_imputed[['C', 'C_binned']])

# 4. Elimination of Outliers (Z-score method)


z_scores = np.abs(stats.zscore(df_imputed.drop(columns=['Target'])))
df_no_outliers = df_imputed[(z_scores < 3).all(axis=1)]
print("\nDataset after outlier removal:\n", df_no_outliers)

Original Dataset:
A B C Target
0 10.0 5.0 1 0
1 20.0 15.0 2 1
2 30.0 NaN 3 0
3 NaN 25.0 4 1
4 50.0 35.0 5 0
5 60.0 45.0 6 1
6 70.0 55.0 7 0
7 800.0 65.0 8 1

Selected Features: Index(['A', 'C'], dtype='object')

Dataset after handling missing values:


A B C Target
0 10.000000 5.0 1.0 0.0
1 20.000000 15.0 2.0 1.0
2 30.000000 35.0 3.0 0.0
3 148.571429 25.0 4.0 1.0

https://colab.research.google.com/drive/1TNr6rVAg-_e7072NwFMWAJIkZ3ZuNDCE#scrollTo=4tijQ9eTbVw-&printMode=true 1/3
2/13/25, 10:11 AM Pre-Processing techniques.ipynb - Colab
4 50.000000 35.0 5.0 0.0
5 60.000000 45.0 6.0 1.0
6 70.000000 55.0 7.0 0.0
7 800.000000 65.0 8.0 1.0

Dataset after discretization:


C C_binned
0 1.0 0.0
1 2.0 0.0
2 3.0 0.0
3 4.0 1.0
4 5.0 1.0
5 6.0 2.0
6 7.0 2.0
7 8.0 2.0

Dataset after outlier removal:


A B C Target C_binned
0 10.000000 5.0 1.0 0.0 0.0
1 20.000000 15.0 2.0 1.0 0.0
2 30.000000 35.0 3.0 0.0 0.0
3 148.571429 25.0 4.0 1.0 1.0
4 50.000000 35.0 5.0 0.0 1.0
5 60.000000 45.0 6.0 1.0 2.0
6 70.000000 55.0 7.0 0.0 2.0
7 800.000000 65.0 8.0 1.0 2.0

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import KBinsDiscretizer
from scipy import stats

# Load Iris dataset


iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

# Introduce missing values for demonstration


df.iloc[2, 1] = np.nan

print("Original Dataset:\n", df.head())

# 1. Attribute Selection
X = df.drop(columns=['target']) # Features
y = df['target']
selector = SelectKBest(score_func=f_classif, k=2) # Select top 2 best features
X_new = selector.fit_transform(X.fillna(X.mean()), y)
selected_features = X.columns[selector.get_support()]
print("\nSelected Features:", selected_features)

# 2. Handling Missing Values


imputer = SimpleImputer(strategy='mean') # Replace NaN with column mean
df_imputed = pd.DataFrame(imputer.fit_transform(df.iloc[:, :-1]), columns=df.columns[:-1])
df_imputed['target'] = df['target']
print("\nDataset after handling missing values:\n", df_imputed.head())

# 3. Discretization
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
df_imputed['sepal length (cm)_binned'] = discretizer.fit_transform(df_imputed[['sepal length (cm)']])
print("\nDataset after discretization:\n", df_imputed[['sepal length (cm)', 'sepal length (cm)_binned']].head())

# 4. Elimination of Outliers (Z-score method)


z_scores = np.abs(stats.zscore(df_imputed.drop(columns=['target'])))
df_no_outliers = df_imputed[(z_scores < 3).all(axis=1)]
print("\nDataset after outlier removal:\n", df_no_outliers.head())

Original Dataset:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 NaN 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

target
0 0

https://colab.research.google.com/drive/1TNr6rVAg-_e7072NwFMWAJIkZ3ZuNDCE#scrollTo=4tijQ9eTbVw-&printMode=true 2/3
2/13/25, 10:11 AM Pre-Processing techniques.ipynb - Colab
1 0
2 0
3 0
4 0

Selected Features: Index(['petal length (cm)', 'petal width (cm)'], dtype='object')

Dataset after handling missing values:


sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.500000 1.4 0.2
1 4.9 3.000000 1.4 0.2
2 4.7 3.056376 1.3 0.2
3 4.6 3.100000 1.5 0.2
4 5.0 3.600000 1.4 0.2

target
0 0
1 0
2 0
3 0
4 0

Dataset after discretization:


sepal length (cm) sepal length (cm)_binned
0 5.1 0.0
1 4.9 0.0
2 4.7 0.0
3 4.6 0.0
4 5.0 0.0

Dataset after outlier removal:


sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \
0 5.1 3.500000 1.4 0.2
1 4.9 3.000000 1.4 0.2
2 4.7 3.056376 1.3 0.2
3 4.6 3.100000 1.5 0.2
4 5.0 3.600000 1.4 0.2

target sepal length (cm)_binned


0 0 0.0
1 0 0.0
2 0 0.0
3 0 0.0
4 0 0.0

https://colab.research.google.com/drive/1TNr6rVAg-_e7072NwFMWAJIkZ3ZuNDCE#scrollTo=4tijQ9eTbVw-&printMode=true 3/3

You might also like