0% found this document useful (0 votes)
6 views26 pages

manual(2023-CS-156).docx

The AI Lab Manual outlines tasks for the 2023-2027 session, specifically focusing on multiple linear regression and logistic regression techniques. It includes detailed steps for data preprocessing, model training, prediction, and evaluation using various metrics. The document serves as a guide for students in the CSC-371 Artificial Intelligence course at the University of Engineering and Technology, Lahore, Pakistan.

Uploaded by

ayesha batool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views26 pages

manual(2023-CS-156).docx

The AI Lab Manual outlines tasks for the 2023-2027 session, specifically focusing on multiple linear regression and logistic regression techniques. It includes detailed steps for data preprocessing, model training, prediction, and evaluation using various metrics. The document serves as a guide for students in the CSC-371 Artificial Intelligence course at the University of Engineering and Technology, Lahore, Pakistan.

Uploaded by

ayesha batool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

​ ​ ​ AI Lab Manual

​ ​ ​ ​
​ ​ ​ ​ Session 2023-2027

​ ​ ​ Submitted By:
​ ​ ​ 2023-CS-156 Khola Raouf
​ ​ ​ Supervised By:
​ ​ ​ ​ Mr. Waseem

​ ​ ​ ​ ​ Course:
CSC-371 Artificial Intelligence

Department of Computer Science


University of Engineering and Technology
Lahore Pakistan
2

Contents
Week # 9....................................................................................................................... 4
Question No. 1.............................................................................................................. 4
1.1.​ Importing Libraries:.......................................................................................... 4
1.2.​ Load the Dataset:.............................................................................................4
1.3.​ Data Preprocessing:.........................................................................................5
1.4.​ Model Initialization:...........................................................................................7
1.5.​ Model Prediction:............................................................................................. 7
1.6.​ Visualizing Results:.......................................................................................... 8
1.7.​ Evaluating the Model:.......................................................................................8
1.8.​ Final Output:.....................................................................................................9
1.9.​ Full Code:.......................................................................................................10
1.10.​ Model Scores:............................................................................................. 11
Question No. 2............................................................................................................ 13
2.1.​ Importing Libraries:............................................................................................13
2.2.​ Load Dataset:.................................................................................................... 13
2.3.​ Data Preprocessing:..........................................................................................14
2.4.​ Model Initialization:............................................................................................14
2.5.​ Model Training:...............................................................................................14
2.6.​ Model Prediction:...............................................................................................15
2.7.​ Visualizing Results:........................................................................................... 15
2.8.​ Model Evaluation:..............................................................................................16
2.9.​ Final Output:......................................................................................................17
2.10.​ Full code:........................................................................................................17
Question No. 03.......................................................................................................... 19
3.1.​ Evaluation Metrics for Classification Model:......................................................19
(a) Accuracy............................................................................................................ 19
(b) Precision............................................................................................................ 19
(c) Recall (Sensitivity or True Positive Rate)........................................................... 19
(d) F1-Score............................................................................................................ 19

2023-CS-156 ​
AI Tasks week#9
3

(e) Confusion Matrix................................................................................................ 20


(f) ROC Curve & AUC (Area Under Curve)............................................................. 20
3.2.​ Evaluation Metrics for Regression Models........................................................ 20
(a) Mean Absolute Error (MAE)...............................................................................20
(b) Mean Squared Error (MSE)............................................................................... 20
(c) Root Mean Squared Error (RMSE).................................................................... 20
(d) R-squared (R2R^2 Score)..................................................................................20
3.3.​ Evaluation Metrics for Clustering Models.......................................................... 21
(a) Silhouette Score.................................................................................................21
(b) Davies-Bouldin Index......................................................................................... 21
(c) Dunn Index.........................................................................................................21
Conclusion...............................................................................................................21
Question No. 04..............................................................................................................22
4.​ Data Preprocessing Techniques........................................................................... 22
4.1.​ Handling Missing or Null Values.....................................................................22
4.2.​ Handling Duplicate Data................................................................................ 23
4.3.​ Handling Outliers............................................................................................23
4.4.​ Encoding Categorical Data............................................................................ 23
4.5.​ Feature Scaling (Normalization & Standardization)....................................... 24
4.6.​ Feature Selection & Engineering................................................................... 24
4.7.​ Splitting Data into Training and Testing Sets..................................................25
Final Summary Table.................................................................................................. 25

2023-CS-156 ​
AI Tasks week#9
4

Week # 9
Question No. 1
Multiple Linear Regression
1.1.​ Importing Libraries:
First of all libraries are imported. Import
required Python libraries for handling data (pandas, numpy),
visualization (matplotlib, seaborn), and machine learning
(sklearn).
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,
r2_score
No output; libraries are just being imported.

1.2.​ Load the Dataset:


Load the dataset from a CSV file and display
the first few rows.
Code:
2023-CS-156 ​
AI Tasks week#9
5

df = pd.read_csv("student_scores.csv") # Update with correct file


name
print(df.head()) # Display first few rows
A table will be showing the rows of the dataset, like:
Hours Previous Extracurricular Final Score
Studied Scores
5.1 78 2 82
6.3 85 3 88
4.8 75 1 80
7.2 90 4 92
5.5 79 2 83

1.3.​ Data Preprocessing:


1.3.1.​ Checking For Missing Values:
Ensure that there are no missing values in the
dataset.
CODE:
print(df.isnull().sum())
A summary of missing values per column, typically:

Hours_Studied ​ ​ ​ ​ 0
Previous_Scores ​ ​ ​ 0
Extracurricular ​ ​ ​ ​ 0
Final_Score ​ ​ ​ ​ 0

2023-CS-156 ​
AI Tasks week#9
6

dtype: ​ ​ ​ ​ ​ ​ int64

If there were missing values, we would handle them using


methods like .fillna() or .dropna().

1.3.2.​ Checking For Data type:


Identify the data types of each
column to ensure compatibility with the machine learning model.
Code:
print(df.dtypes)

Hours_Studied ​ ​ ​ ​ float64
Previous_Scores ​ ​ ​ ​ int64
Extracurricular ​ ​ ​ ​ int64
Final_Score ​​ ​ ​ int64
dtype: ​ ​ ​ ​ ​ ​ object

If data types are incorrect, we may need to convert them using


.astype().

1.3.3.​ Select Features and Target Variable:


●​ X: Contains independent variables (features).
●​ y: Contains the dependent variable (Final_Score).
Code:
X = df[['Hours_Studied', 'Previous_Scores', 'Extracurricular']]
2023-CS-156 ​
AI Tasks week#9
7

y = df['Final_Score']
No direct output, but X and y now store data for model training.

1.3.4.​ Split Data into Training and Testing Sets:


Splits data into 80% training data and 20% testing data.

​ ​ ​ Code:
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

No direct output, but now:

●​ X_train, y_train → Used for training


●​ X_test, y_test → Used for testing

1.4.​ Model Initialization:


Initialize a Linear Regression model and train it on
X_train and y_train.
​ ​ ​ Code:
model = LinearRegression()
model.fit(X_train, y_train)
No direct output, but the model has now learned the relationship
between features and Final_Score.

2023-CS-156 ​
AI Tasks week#9
8

1.5.​ Model Prediction:


Use the trained model to make predictions on the
test
set.
​ ​ Code:
y_pred = model.predict(X_test)
No direct output, but y_pred contains predicted final scores for
X_test.

1.6.​ Visualizing Results:


●​ Compares actual vs. predicted values in a scatter plot.
●​ Ideally, the points should be close to a straight diagonal line.
Code:
plt.scatter(y_test, y_pred, color='blue')
plt.xlabel("Actual Scores")
plt.ylabel("Predicted Scores")
plt.title("Actual vs Predicted Scores")
plt.show()

A scatter plot showing how well the model's predictions match the
actual scores.

1.7.​ Evaluating the Model:


Evaluate model performance using:

2023-CS-156 ​
AI Tasks week#9
9

●​ Mean Squared Error (MSE): Measures error magnitude (lower is


better).
●​ R² Score: Measures goodness of fit (closer to 1 is better)

Code:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")


print(f"R-squared Score: {r2}")

1.8.​ Final Output:

2023-CS-156 ​
AI Tasks week#9
10

1.9.​ Full Code:


import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

# Load dataset

df = pd.read_csv("Student_Performance.csv") # Ensure the correct file name

print("First few rows of the dataset:")

print(df.head()) # Display first few rows

# Check for missing values

print("\nMissing values in the dataset:")

print(df.isnull().sum())

# Check data types

print("\nData types of each column:")

print(df.dtypes)

# Convert categorical data to numerical (if needed)

df['Extracurricular Activities'] = df['Extracurricular Activities'].map({'Yes':


1, 'No': 0})

# Selecting features (independent variables) and target variable

X = df[['Hours Studied', 'Previous Scores', 'Extracurricular Activities', 'Sleep


Hours', 'Sample Question Papers Practiced']]

2023-CS-156 ​
AI Tasks week#9
11

y = df['Performance Index']

# Split dataset into training (80%) and testing (20%) sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)

# Initialize and train the model

model = LinearRegression()

model.fit(X_train, y_train)

# Predict on test data

y_pred = model.predict(X_test)

# Visualizing results: Actual vs Predicted scores

plt.scatter(y_test, y_pred, color='blue')

plt.xlabel("Actual Performance Index")

plt.ylabel("Predicted Performance Index")

plt.title("Actual vs Predicted Performance Index")

plt.show()

# Model evaluation

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"\nMean Squared Error: {mse}")

print(f"R-squared Score: {r2}")

1.10.​ Model Scores:


PS E:\study\semester 4\AI\week 9> python task1.py
2023-CS-156 ​
AI Tasks week#9
12

First few rows of the dataset:

Hours Studied Previous Scores Extracurricular Activities Sleep Hours Sample


Question Papers Practiced Performance Index

0 7 99 Yes 9 1 91.0

1 4 82 No 4 2 65.0

2 8 51 Yes 7 2 45.0

3 5 52 Yes 5 2 36.0

4 7 75 No 8 5 66.0

Missing values in the dataset:

Hours Studied 0

Previous Scores 0

Extracurricular Activities 0

Sleep Hours 0

Sample Question Papers Practiced 0

Performance Index 0

dtype: int64

Hours Studied int64

Previous Scores int64

Extracurricular Activities object

Sleep Hours int64

Sample Question Papers Practiced int64

Performance Index float64

dtype: object

2023-CS-156 ​
AI Tasks week#9
13

Mean Squared Error: 4.082628398521853

R-squared Score: 0.9889832909573145

Question No. 2
2.​ Logistic Regression
2.1.​ Importing Libraries:
●​ Essential Python libraries for data handling, visualization, and machine learning
are imported.

Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score

Output:
No direct output, but the libraries are loaded successfully.

2.2.​ Load Dataset:


●​ Reads the dataset into a Pandas DataFrame and displays the first few rows.

Code:
df = pd.read_csv("student_logistic_regression.csv") # Update with actual file
name
print(df.head()) # Display first few rows

Output:
A table displaying the first five rows of the dataset, similar to:
User ID Gender Age EstimatedSalary Purchased
15624510 Male 19 19000 0

15810944 Male 35 20000 0

2023-CS-156 ​
AI Tasks week#9
14

User ID Gender Age EstimatedSalary Purchased


15668575 Female 26 43000 0

15603246 Female 27 57000 0

15804002 Male 19 76000 0

2.3.​ Data Preprocessing:


●​ Drops unnecessary columns and converts categorical data into numerical values.
●​ Scales the numerical features for better model performance.

Code:
df = df.drop(columns=["User ID"]) # Remove irrelevant column
df = pd.get_dummies(df, columns=["Gender"], drop_first=True) # Convert Gender
to numeric
Output:
No direct output, but "User ID" is removed, and "Gender" is converted into a numeric column
(e.g., Gender_Male = 1 for Male, 0 for Female).

2.4.​ Model Initialization:


●​ Defines features (independent variables) and the target variable.
●​ Splits data into training and testing sets (80% training, 20% testing).

Code:
X = df.drop("Purchased", axis=1) # Features
y = df["Purchased"] # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
Output:
No direct output, but the dataset is split into training and testing sets.

2.5.​ Model Training:


●​ Standardizes feature values for improved performance.
●​ Initializes and trains the Logistic Regression model.

2023-CS-156 ​
AI Tasks week#9
15

Code:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

model = LogisticRegression()
model.fit(X_train, y_train)
Output:
No direct output, but the Logistic Regression model is successfully trained on the dataset.

2.6.​ Model Prediction:


●​ Predicts outcomes using the trained model on the test dataset.

Code:
y_pred = model.predict(X_test)
Output:
No direct output, but y_pred contains the predicted values.

2.7.​ Visualizing Results:


●​ Compares actual vs. predicted values in a confusion matrix.
●​ The confusion matrix shows how well the model classified the test data.

Code:
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
Output:
A heatmap of the confusion matrix similar to:

Actual \ 0 1
Predicted
0 (No Purchase) 18 2

1 (Purchase) 3 7

2023-CS-156 ​
AI Tasks week#9
16

2.8.​ Model Evaluation:


●​ Displays accuracy, precision, recall, and F1-score of the model.

Code:
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Output:
A classification report and accuracy score, for example:
Classification Report:
precision recall f1-score support

0 0.86 0.90 0.88 20


1 0.78 0.70 0.74 10

accuracy 0.83 30
macro avg 0.82 0.80 0.81 30
weighted avg 0.83 0.83 0.83 30

Accuracy Score: 0.83

2023-CS-156 ​
AI Tasks week#9
17

2.9.​ Final Output:

2.10.​ Full code:


3.​ # Importing Libraries
4.​ import pandas as pd
5.​ import numpy as np
6.​ import matplotlib.pyplot as plt
7.​ import seaborn as sns
8.​ from sklearn.model_selection import train_test_split
9.​ from sklearn.linear_model import LogisticRegression
10.​ from sklearn.preprocessing import StandardScaler
11.​ from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score
12.​

2023-CS-156 ​
AI Tasks week#9
18

13.​ # Load Dataset


14.​ df = pd.read_csv("Social_Network_Ads.csv")
15.​ print(df.head()) # Display first few rows
16.​
17.​ # Data Preprocessing
18.​ df = df.dropna() # Remove missing values
19.​ df = pd.get_dummies(df, drop_first=True) # Convert categorical variables to
numeric
20.​
21.​ # Define Features and Target
22.​ X = df.drop("Purchased", axis=1) # Replace with actual target column name
23.​ y = df["Purchased"]
24.​
25.​ # Split Data into Training and Testing Sets
26.​ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
27.​
28.​ # Feature Scaling
29.​ scaler = StandardScaler()
30.​ X_train = scaler.fit_transform(X_train)
31.​ X_test = scaler.transform(X_test)
32.​
33.​ # Model Initialization and Training
34.​ model = LogisticRegression()
35.​ model.fit(X_train, y_train)
36.​
37.​ # Model Prediction
38.​ y_pred = model.predict(X_test)
39.​
40.​ # Evaluating the Model
41.​ print("Classification Report:\n", classification_report(y_test, y_pred))
42.​ print("Accuracy Score:", accuracy_score(y_test, y_pred))
43.​
44.​ # Confusion Matrix Visualization
45.​ cm = confusion_matrix(y_test, y_pred)
46.​ sns.heatmap(cm, annot=True, fmt='d', cmap="Blues")
47.​ plt.xlabel("Predicted")
48.​ plt.ylabel("Actual")
49.​ plt.title("Confusion Matrix")
50.​ plt.show()
51.​

2023-CS-156 ​
AI Tasks week#9
19

Question No. 03
3.​ What are the evaluation matrices for a machine learning model?

Evaluation metrics are used to assess the performance of a machine learning model. The choice
of metric depends on the type of problem: classification, regression, or clustering.

3.1.​ Evaluation Metrics for Classification Model:


These metrics measure how well a model classifies data points into categories.

(a) Accuracy

●​ Measures the percentage of correctly predicted instances.


●​ Formula: Accuracy=Correct PredictionsTotal PredictionsAccuracy =
\frac{\text{Correct Predictions}}{\text{Total Predictions}}
●​ Best for: Balanced datasets (when classes are equally distributed).

(b) Precision

●​ Measures how many predicted positive instances are actually positive.


●​ Formula: Precision=True PositivesTrue Positives+False PositivesPrecision =
\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
●​ Best for: When false positives need to be minimized (e.g., spam detection).

(c) Recall (Sensitivity or True Positive Rate)

●​ Measures how many actual positive instances are correctly identified.


●​ Formula: Recall=True PositivesTrue Positives+False NegativesRecall =
\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
●​ Best for: When false negatives need to be minimized (e.g., medical diagnoses).

(d) F1-Score

●​ Harmonic mean of Precision and Recall, providing a balance.


●​ Formula: F1=2×Precision×RecallPrecision+RecallF1 = 2 \times
\frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
●​ Best for: Imbalanced datasets.

2023-CS-156 ​
AI Tasks week#9
20

(e) Confusion Matrix

●​ A table that summarizes correct and incorrect predictions.


●​ Example:
Actual \ Positive (1) Negative (0)
Predicted
True Positive False Negative
Positive (1) (TP) (FN)

False Positive True Negative


Negative (0) (FP) (TN)

(f) ROC Curve & AUC (Area Under Curve)

●​ ROC Curve: Plots True Positive Rate (Recall) vs. False Positive Rate.
●​ AUC Score: Measures the area under the ROC curve (closer to 1 is better).
3.2.​ Evaluation Metrics for Regression Models
These metrics evaluate how well a model predicts continuous values.

(a) Mean Absolute Error (MAE)

●​ Measures the average absolute difference between actual and predicted values.
●​ Formula: MAE=1n∑i=1n∣yi−yi^∣MAE = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y_i} |
●​ Best for: When all errors are treated equally.
(b) Mean Squared Error (MSE)

●​ Measures the average squared difference between actual and predicted values.
●​ Formula: MSE=1n∑i=1n(yi−yi^)2MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i -
\hat{y_i})^2
●​ Best for: Penalizing large errors more.

(c) Root Mean Squared Error (RMSE)

●​ Square root of MSE, providing an error in the same units as the target variable.
●​ Formula: RMSE=MSERMSE = \sqrt{MSE}
●​ Best for: Interpretable errors in real-world scenarios.

2023-CS-156 ​
AI Tasks week#9
21

(d) R-squared (R2R^2 Score)

●​ Measures how well the model explains variance in the data (0 to 1).
●​ Formula: R2=1−∑(yi−yi^)2∑(yi−yˉ)2R^2 = 1 - \frac{\sum (y_i - \hat{y_i})^2}{\sum
(y_i - \bar{y})^2}
●​ Best for: Checking model goodness-of-fit (higher is better).

3.3.​ Evaluation Metrics for Clustering Models


These metrics evaluate unsupervised learning models like K-Means or DBSCAN.

(a) Silhouette Score

●​ Measures how well clusters are separated.


●​ Range: -1 to 1 (higher is better).
(b) Davies-Bouldin Index

●​ Measures the compactness and separation of clusters.


●​ Lower values indicate better clustering.
(c) Dunn Index

●​ Measures the ratio of minimum inter-cluster distance to maximum intra-cluster


distance.
●​ Higher values indicate better clustering.
Conclusion
Choosing the right evaluation metric depends on the problem type:

●​ Classification: Accuracy, Precision, Recall, F1-score, Confusion Matrix,


AUC-ROC
●​ Regression: MAE, MSE, RMSE, R-squared
●​ Clustering: Silhouette Score, Davies-Bouldin Index

2023-CS-156 ​
AI Tasks week#9
22

2023-CS-156 ​
AI Tasks week#9
23

Question No. 04
What are different methods or techniques used in Data Preprocessing?
(Like how we will handle the missing or null values in a data)

4.​ Data Preprocessing Techniques


Data preprocessing is a crucial step in machine learning that involves cleaning, transforming, and
preparing raw data for modeling. Below are the key methods used in data preprocessing:

4.1.​ Handling Missing or Null Values


Missing values in a dataset can cause issues in model training. There are several ways to handle
them:
(a) Removing Missing Values

●​ Method:
Drop rows or columns that contain missing values.
●​ Use When:The dataset is large, and missing values are minimal.
●​ Code Example (Pandas):
●​ df.dropna(inplace=True) # Removes rows with missing values
●​ df.drop(columns=['ColumnName'], inplace=True) # Removes a column with
missing values
(b) Imputing Missing Values

●​ Method: Fill missing values with statistical measures (mean, median, mode) or
interpolation.
●​ Use When:The dataset is small, and removing data is not ideal.
●​ Code Example:
●​ df['ColumnName'].fillna(df['ColumnName'].mean(), inplace=True) # Fill
with mean
●​ df['ColumnName'].fillna(df['ColumnName'].median(), inplace=True) # Fill
with median
●​ df['ColumnName'].fillna(df['ColumnName'].mode()[0], inplace=True) # Fill
with mode
(c) Using Machine Learning for Imputation

●​ Method: Predict missing values using K-Nearest Neighbors (KNN), Linear


Regression, etc.
●​ Code Example (KNN Imputer):
●​ from sklearn.impute import KNNImputer
●​ imputer = KNNImputer(n_neighbors=3)
●​ df[['Column1', 'Column2']] = imputer.fit_transform(df[['Column1',
'Column2']])

4.2.​ Handling Duplicate Data


Duplicate data can cause bias in the model.

2023-CS-156 ​
AI Tasks week#9
24

●​ Remove Duplicates:
●​ df.drop_duplicates(inplace=True)

4.3.​ Handling Outliers


Outliers are extreme values that can distort model performance.
(a) Using the IQR (Interquartile Range) Method

●​ Removes values beyond 1.5 times the IQR range.


●​ Code Example:
●​ Q1 = df['ColumnName'].quantile(0.25)
●​ Q3 = df['ColumnName'].quantile(0.75)
●​ IQR = Q3 - Q1
●​ df = df[(df['ColumnName'] >= Q1 - 1.5 * IQR) & (df['ColumnName'] <= Q3 +
1.5 * IQR)]
(b) Using Z-Score Method

●​ Removes values that are more than a certain number of standard deviations
away.
●​ Code Example:
●​ from scipy import stats
●​ df = df[(np.abs(stats.zscore(df['ColumnName'])) < 3)]

4.4.​ Encoding Categorical Data


Categorical variables must be converted into numerical form for machine learning.
(a) Label Encoding (For Binary Categories)

●​ Converts categorical values into 0s and 1s.


●​ Example:
●​ from sklearn.preprocessing import LabelEncoder
●​ encoder = LabelEncoder()
●​ df['CategoryColumn'] = encoder.fit_transform(df['CategoryColumn'])
(b) One-Hot Encoding (For Multiple Categories)

●​ Creates separate binary columns for each category.


●​ Example:
●​ df = pd.get_dummies(df, columns=['CategoryColumn'], drop_first=True)
(c) Ordinal Encoding (For Ordered Categories)

●​ Assigns numerical values based on order (e.g., Low = 1, Medium = 2, High = 3).
●​ Example:
●​ from sklearn.preprocessing import OrdinalEncoder
●​ encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
●​ df['CategoryColumn'] = encoder.fit_transform(df[['CategoryColumn']])

2023-CS-156 ​
AI Tasks week#9
25

4.5.​ Feature Scaling (Normalization & Standardization)


Scaling is used to ensure that all features contribute equally to the model.
(a) Min-Max Scaling (Normalization)

●​ Scales values between 0 and 1.


●​ Use When:Data has no outliers.
●​ Example:
●​ from sklearn.preprocessing import MinMaxScaler
●​ scaler = MinMaxScaler()
●​ df[['Column1', 'Column2']] = scaler.fit_transform(df[['Column1',
'Column2']])
(b) Standardization (Z-Score Normalization)

●​ Scales values to have mean = 0 and standard deviation = 1.


●​ Use When:Data has outliers.
●​ Example:
●​ from sklearn.preprocessing import StandardScaler
●​ scaler = StandardScaler()
●​ df[['Column1', 'Column2']] = scaler.fit_transform(df[['Column1',
'Column2']])

4.6.​ Feature Selection & Engineering


Selecting the most important features improves model performance.
(a) Removing Irrelevant Features

●​ Drop unnecessary columns.


●​ df.drop(columns=['UnnecessaryColumn'], inplace=True)
(b) Using Correlation Matrix

●​ Removes features that are highly correlated (multicollinearity).


●​ Example:
●​ import seaborn as sns
●​ import matplotlib.pyplot as plt
●​ plt.figure(figsize=(10, 8))
●​ sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
●​ plt.show()
(c) Using Feature Importance (Random Forest)

●​ Identifies the most important features.


●​ Example:
●​ from sklearn.ensemble import RandomForestClassifier
●​ model = RandomForestClassifier()
●​ model.fit(X, y)

2023-CS-156 ​
AI Tasks week#9
26

●​ feature_importances = pd.Series(model.feature_importances_,
index=X.columns)
●​ feature_importances.nlargest(5).plot(kind='barh')

4.7.​ Splitting Data into Training and Testing Sets


To evaluate a model, data should be split into training and testing sets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

Final Summary Table


Preprocessing Step Methods
Handling Missing Values Drop missing values, Fill with mean/median/mode, KNN
Imputer

Handling Duplicates df.drop_duplicates()

Handling Outliers IQR method, Z-score method

Encoding Categorical Label Encoding, One-Hot Encoding, Ordinal Encoding


Data
Feature Scaling Min-Max Scaling, Standardization

Feature Selection Correlation matrix, Feature importance (Random Forest)

Data Splitting train_test_split() for training/testing

2023-CS-156 ​
AI Tasks week#9

You might also like