0% found this document useful (0 votes)
8 views38 pages

2. ML Lab Record

The document outlines the structure and requirements for the Machine Learning Laboratory course at Sri Eshwar College of Engineering. It includes a bonafide certificate for students, a table of contents for experiments, and detailed descriptions of various machine learning concepts and coding exercises. The document emphasizes the importance of feature engineering, data preprocessing, and model selection in enhancing the performance of machine learning models.

Uploaded by

24f1000995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views38 pages

2. ML Lab Record

The document outlines the structure and requirements for the Machine Learning Laboratory course at Sri Eshwar College of Engineering. It includes a bonafide certificate for students, a table of contents for experiments, and detailed descriptions of various machine learning concepts and coding exercises. The document emphasizes the importance of feature engineering, data preprocessing, and model selection in enhancing the performance of machine learning models.

Uploaded by

24f1000995
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

DEPARTMENT OF ARTIFICIAL INTELLIGENCE

AND DATA SCIENCE

R19AM311 - MACHINE LEARNING LABORATORY

SRI ESHWAR COLLEGE OF ENGINEERING


KINATHUKADAVU, COIMBATORE – 641 202
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

BONAFIDE CERTIFICATE

Certified that this is the bonafide record of work done by

Name: Mr. /Ms. ...…………………………………………………………………………………….

Register No: ……………………………………………………………………………... of 3rd Year

B. Tech. – Artificial Intelligence and Data Science in the R19AM311 - MACHINE LEARNING

LABORATORY during the 5th Semester of the academic year 2024 – 2025 (Odd Semester).

Signature of Faculty In-Charge Head of the Department

Submitted for the end semester practical examinations held on ……………………….

Internal Examiner External Examiner


Table of Contents

Page Number

Marks (50)

Faculty Member
Signature of the
S. Date Name of the Experiment
No.

10

Average Marks :
Signature of the Faculty
Experiment 1
Problem Statement Terminology Theory Code Input Output Conclusion
Set up Python environment with libraries like NumPy, Pandas, and Matplotlib. Introduce Jupyter Notebooks for
interactive coding. Perform basic operations and data manipulations using NumPy and Pandas. Visualize data
distributions and relationships with Matplotlib

Problem Statement Terminology Theory Code Input Output Conclusion


NumPy: A Python library used for working with arrays, offering powerful mathematical functions.
Pandas: A Python library used for data manipulation and analysis. It provides data structures like Series (1D)
and DataFrames (2D).
Matplotlib: A plotting library used for data visualization.
Jupyter Notebook: An interactive environment for writing and executing code, where you can combine code,
visualizations, and narrative.

Problem Statement Terminology Theory Code Input Output Conclusion


NumPy is primarily used for array-based computing, offering efficient multi-dimensional arrays (ndarray).
Pandas allows easier manipulation of tabular data through its Series and DataFrame structures.
Matplotlib is essential for plotting various types of graphs for data visualization.
Jupyter Notebooks provide an easy-to-use interface that makes interactive coding, plotting, and analysis
accessible.

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

#1
array = np.array([1, 2, 3, 4, 5])
squared_array = np.square(array)
mean_value = np.mean(array)
print("Original Array:", array)
print("Squared Array:", squared_array)
print("Mean Value:", mean_value)

#2
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'Score': [85, 88, 92, 79]}

df = pd.DataFrame(data)
print(df)

mean_age = df['Age'].mean()
print("Mean Age:", mean_age)
filtered_data = df[df['Score'] > 85]
print(filtered_data)

#3
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='Sine Wave')
plt.title('Sine Function')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.legend()
plt.show()

Problem Statement Terminology Theory Code Input Output Conclusion


#1
array = np.array([1, 2, 3, 4, 5])

#2
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'Score': [85, 88, 92, 79]}
df = pd.DataFrame(data)

#3
x = np.linspace(0, 10, 100)
y = np.sin(x)

Problem Statement Terminology Theory Code Input Output Conclusion


#1
Original Array: [1 2 3 4 5]
Squared Array: [ 1 4 9 16 25]
Mean Value: 3.0

#2
Name Age Score
0 Alice 24 85
1 Bob 27 88
2 Charlie 22 92
3 David 32 79
Mean Age: 26.25
Name Age Score
1 Bob 27 88
2 Charlie 22 92
Problem Statement Terminology Theory Code Input Output Conclusions
Using the above code the program has successfully executed and the output is verified.
1

Experiment 2
Problem Statement Terminology Theory Code Input Output Conclusion
Handle missing data using imputation techniques. Remove outliers and understand their impact on
models. Standardize or normalize numerical features. Encode categorical variables using techniques
like one-hot encoding.

Problem Statement Terminology Theory Code Input Output Conclusion


Imputation: Filling missing data with substituted values (e.g., mean, median).
Outliers: Data points that are significantly different from others and can distort analysis.
Standardization/Normalization: Transforming features to a common scale. Standardization centers around
mean, while normalization scales to a range like [0, 1].
One-Hot Encoding: Converting categorical variables into binary vectors for use in models.
Imputation: Filling missing data with substituted values (e.g., mean, median).

Problem Statement Terminology Theory Code Input Output Conclusion


Handling Missing Data
 Missing data can lead to bias and reduce the efficiency of a model.
 Imputation techniques include filling missing values with the mean, median, mode, or more advanced
methods like K-Nearest Neighbors.
Removing Outliers
 Outliers can skew your dataset and affect model performance. Identifying and removing them helps
improve model robustness.
 Common methods include using z-scores or the IQR method (Interquartile Range).
Standardization and Normalization
 Scaling features ensures that no feature dominates due to its scale.
 Standardization: Scales the data to have a mean of 0 and standard deviation of 1.
 Normalization: Scales the data to a [0, 1] range, often useful for algorithms like KNN or neural networks.
Encoding Categorical Variables
 Machine learning algorithms require numerical inputs. Encoding techniques like One-Hot Encoding
convert categorical variables into binary (0/1) columns.
1

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import missingno as mso
import seaborn as sns
import warnings
import os
import scipy

df = pd.read_csv("loan.csv")
df.head()

df.describe()

numerical_df = df.select_dtypes(include=['float64', 'int64'])


plt.figure(figsize=(8, 6))
corr_matrix = numerical_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='inferno', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

pd.crosstab(df.Gender,df.Married).plot(kind="bar", stacked=True, figsize=(5,5), color=['#f64f59','#12c2e9'])


plt.title('Gender vs Married')
plt.xlabel('Gender')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.show()

pd.crosstab(df.Self_Employed,df.Credit_History).plot(kind="bar", stacked=True, figsize=(5,5),


color=['#544a7d','#ffd452'])
plt.title('Self Employed vs Credit History')
plt.xlabel('Self Employed')
plt.ylabel('Frequency')
plt.legend(["Bad Credit", "Good Credit"])
plt.xticks(rotation=0)
plt.show()

pd.crosstab(df.Property_Area,df.Loan_Status).plot(kind="bar", stacked=True, figsize=(5,5),


color=['#333333','#dd1818'])
plt.title('Property Area vs Loan Status')
plt.xlabel('Property Area')
plt.ylabel('Frequency')
plt.xticks(rotation=0)

df.plot(x='ApplicantIncome', y='CoapplicantIncome', style='o')


plt.title('Applicant Income - Co Applicant Income')
plt.xlabel('ApplicantIncome')
plt.ylabel('CoapplicantIncome')
print('Pearson correlation:', df['ApplicantIncome'].corr(df['CoapplicantIncome']))
print('T Test and P value: \n', stats.ttest_ind(df['ApplicantIncome'], df['CoapplicantIncome']))
1

df.isnull().sum()
df['LoanAmount'].fillna(df['LoanAmount'].mean(),inplace=True)
df = df.drop(['Loan_ID'], axis = 1)

df['Gender'].fillna(df['Gender'].mode()[0],inplace=True)
df['Married'].fillna(df['Married'].mode()[0],inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0],inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0],inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0],inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0],inplace=True)

df.isnull().sum()
numerical_df = df.select_dtypes(include=['float64', 'int64'])
Q1 = numerical_df.quantile(0.25)
Q3 = numerical_df.quantile(0.75)
IQR = Q3 - Q1
df_cleaned = numerical_df[~((numerical_df < (Q1 - 1.5 * IQR)) | (numerical_df > (Q3 + 1.5 *
IQR))).any(axis=1)]
print(df_cleaned)

df.ApplicantIncome = np.sqrt(df.ApplicantIncome)
df.CoapplicantIncome = np.sqrt(df.CoapplicantIncome)
df.LoanAmount = np.sqrt(df.LoanAmount)

sns.set(style="darkgrid")
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
sns.histplot(data=df, x="ApplicantIncome", kde=True, ax=axs[0, 0], color='green')
sns.histplot(data=df, x="CoapplicantIncome", kde=True, ax=axs[0, 1], color='skyblue')
sns.histplot(data=df, x="LoanAmount", kde=True, ax=axs[1, 0], color='orange');

Problem Statement Terminology Theory Code Input Output Conclusion


"loantrain.csv": https://drive.google.com/file/d/1L6BAmRcOXvpTd8fTitueEKY3oJCKl2eN/view?usp=sharing
“loantest.csv”: https://drive.google.com/file/d/1JGzJHmDeei1U-Xk1d2kyqQu1NtRSmGJd/view?usp=sharing

Problem Statement Terminology Theory Code Input Output Conclusion


1

Problem Statement Terminology Theory Code Input Output Conclusion


Thus the study has been done successfully.
1

Experiment 3
Problem Statement Terminology Theory Code Input Output Conclusion
In many machine learning applications, the performance of a model is heavily influenced by the quality of the
input features. Proper feature engineering can significantly enhance the predictive power of a model.
Additionally, choosing the right model for the task is crucial for achieving optimal results. In this activity,
students will experiment with various feature engineering techniques and compare the performance of different
machine learning models using cross-validation. The goal is to optimize both the feature set and the model
selection to improve the accuracy of predictions.

Problem Statement Terminology Theory Code Input Output Conclusion


Feature Engineering: The process of transforming raw data into features that better represent the underlying
problem, thereby improving model accuracy.
 Feature Scaling: Techniques like Min-Max Scaling or Standardization used to normalize the range of
features.
 Feature Encoding: Transforming categorical variables into numerical form (e.g., One-Hot Encoding,
Label Encoding).
 Feature Selection: The process of selecting a subset of relevant features for model building.
Model Selection: The process of selecting the machine learning model that best fits the problem based on
performance metrics.
 Decision Tree: A non-parametric supervised learning algorithm used for classification and regression
tasks.
 Support Vector Machine (SVM): A powerful supervised learning algorithm used for both
classification and regression tasks by finding the hyperplane that best separates the data points.
Cross-Validation: A statistical method used to estimate the performance of machine learning models by
partitioning data into training and test subsets and iterating through them (e.g., K-Fold Cross-Validation).
 Overfitting: When a model performs well on training data but poorly on new, unseen data.
 Underfitting: When a model is too simple to capture the underlying data patterns, resulting in poor
performance both on training and new data.

Problem Statement Terminology Theory Code Input Output Conclusion


Feature Engineering:
 Importance of Feature Engineering: Well-engineered features help machine learning models
understand patterns in the data better, leading to improved accuracy and generalization.
Common methods include transforming categorical features into numerical values, scaling
numerical data, handling missing values, and creating new features from existing data.
 Common Techniques:
o Normalization/Standardization: Ensuring that features with different scales (e.g., income
in dollars, age in years) don’t disproportionately affect the model.
o Encoding Categorical Data: Converting categories (e.g., ‘Male’, ‘Female’) into
numerical values for machine learning algorithms.
o Polynomial Features: Creating new features that represent non-linear relationships
between variables.
o Dimensionality Reduction: Techniques like PCA (Principal Component Analysis)
reduce the number of features while retaining important information.
Model Selection:
 Understanding Model Trade-offs: Different models (e.g., Decision Trees, Support Vector
Machines, etc.) have unique strengths and weaknesses. A decision tree might be easy to interpret
but prone to overfitting, while an SVM could be highly accurate but harder to interpret.
1

 Criteria for Model Selection:


o Bias-Variance Tradeoff: High-bias models tend to underfit the data, while high-variance
models are prone to overfitting. The goal is to balance these to achieve optimal
performance.
o Cross-Validation: Cross-validation helps in assessing how well a model generalizes to
unseen data. By splitting the dataset into multiple folds, each fold is used once as a test
set, while the rest are used for training. This reduces bias and variance in performance
estimation.
o Hyperparameter Tuning: Parameters that govern the model (e.g., depth of a decision tree,
kernel type in an SVM) need to be optimized using methods like grid search or random
search to improve performance.
Performance Metrics:
 Accuracy: The fraction of correctly predicted instances.
 Precision, Recall, F1-Score: Metrics that are especially useful for imbalanced datasets.
 ROC-AUC Score: A metric for evaluating the performance of classification models.

Problem Statement Terminology Theory Code Input Output Conclusion


df = pd.get_dummies(df)
df = df.drop(['Gender_Female', 'Married_No', 'Education_Not Graduate',
'Self_Employed_No', 'Loan_Status_N'], axis = 1)
new = {'Gender_Male': 'Gender', 'Married_Yes': 'Married',
'Education_Graduate': 'Education', 'Self_Employed_Yes': 'Self_Employed',
'Loan_Status_Y': 'Loan_Status'}
df.rename(columns=new, inplace=True)
X = df.drop(["Loan_Status"], axis=1)
y = df["Loan_Status"]

X, y = SMOTE().fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

#Logistic Regression :
LRclassifier = LogisticRegression(solver='saga', max_iter=500, random_state=1)
LRclassifier.fit(X_train, y_train)
y_pred = LRclassifier.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

from sklearn.metrics import accuracy_score


LRAcc = accuracy_score(y_pred,y_test)
print('LR accuracy: {:.2f}%'.format(LRAcc*100))
LRcv_scores = cross_val_score(LRclassifier, X_train, y_train, cv=5)
print("Logistic Regression CV Scores:", LRcv_scores)
#SVC :
SVCclassifier = SVC(kernel='rbf', max_iter=500)
SVCclassifier.fit(X_train, y_train)
y_pred = SVCclassifier.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
1

from sklearn.metrics import accuracy_score


SVCAcc = accuracy_score(y_pred,y_test)
print('SVC accuracy: {:.2f}%'.format(SVCAcc*100))
SVCcv_scores = cross_val_score(SVCclassifier, X_train, y_train, cv=5)
print("SVC CV Scores:", SVCcv_scores)

# Decision Tree :
scoreListDT = []
for i in range(2,21):
DTclassifier = DecisionTreeClassifier(max_leaf_nodes=i)
DTclassifier.fit(X_train, y_train)
scoreListDT.append(DTclassifier.score(X_test, y_test))
DTcv_scores = cross_val_score(DTclassifier, X_train, y_train, cv=5)

plt.plot(range(2,21), scoreListDT)
plt.xticks(np.arange(2,21,1))
plt.xlabel("Leaf")
plt.ylabel("Score")
plt.show()
DTAcc = max(scoreListDT)
print("Decision Tree Accuracy: {:.2f}%".format(DTAcc*100))
print("Decision Tree CV Scores:", DTcv_scores)

# Random Forest :
scoreListRF = []
for i in range(2,25):
RFclassifier = RandomForestClassifier(n_estimators = 1000, random_state = 1, max_leaf_nodes=i)
RFclassifier.fit(X_train, y_train)
scoreListRF.append(RFclassifier.score(X_test, y_test))
RFcv_scores = cross_val_score(RFclassifier, X_train, y_train, cv=5)

plt.plot(range(2,25), scoreListRF)
plt.xticks(np.arange(2,25,1))
plt.xlabel("RF Value")
plt.ylabel("Score")
plt.show()
RFAcc = max(scoreListRF)
print("Random Forest Accuracy: {:.2f}%".format(RFAcc*100))
print("Random Forest CV Scores:", RFcv_scores)

results = {
'Model Name': ['Logistic Regression', 'SVC', 'Decision Tree', 'Random Forest'],
'Mean Accuracy (%)': [LRAcc, SVCAcc, DTAcc, RFAcc]
}

df1 = pd.DataFrame(results)
print(df1)

Problem Statement Terminology Theory Code Input Output Conclusion


Loan.csv: https://drive.google.com/file/d/1JGzJHmDeei1U-Xk1d2kyqQu1NtRSmGJd/view?usp=sharing

Problem Statement Terminology Theory Code Input Output Conclusion


1

Pearson correlation: -0.11660458122889966

Problem Statement Terminology Theory Code Input Output Conclusions


Thus the study has been done successfully.
1

Experiment 4
Problem Statement Terminology Theory Code Input Output Conclusion
To use Linear Regression to predict house price.

Problem Statement Terminology Theory Code Input Output Conclusion


Linear Regression: A method to model the linear relationship between two variables by fitting a straight line.
Independent Variable (X): The input variable (or feature) used to predict the output.
Dependent Variable (Y): The output or target variable that we are trying to predict.
Slope (m): The rate at which Y changes with respect to X.
Intercept (b): The value of Y when X=0, where the line crosses the Y-axis.
Prediction: The output generated by the regression model for given input values.
Residual/Error: The difference between the actual and predicted values of Y.

Problem Statement Terminology Theory Code Input Output Conclusion


Linear regression is a supervised learning algorithm used for predicting a continuous target variable based on
one or more input features. It models the relationship between the independent variable X and the dependent
variable Y by fitting a linear equation to the observed data. The goal is to find the line of best fit that minimizes
the error between predicted and actual values. The general form of the linear regression equation is:
Y=β0+β1X1+β2X2+⋯+βnXn+ϵ

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(url)

data.head()

X = data.drop("medv", axis=1)
y = data["medv"]

numeric_features = X.select_dtypes(include=['int64', 'float64']).columns


categorical_features = X.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer(
transformers=[
1

('num', SimpleImputer(strategy='mean'), numeric_features),


('cat', OneHotEncoder(), categorical_features)
])

pipeline = Pipeline(steps=[('preprocessor', preprocessor),


('scaler', StandardScaler())])

X_preprocessed = pipeline.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)


r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-Squared: {r2}")

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.title('Actual vs Predicted Home Prices')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()

Problem Statement Terminology Theory Code Input Output Conclusion


https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv

Problem Statement Terminology Theory Code Input Output Conclusion

Mean Squared Error: 24.291119474973513


R-Squared: 0.668759493535632
1

Problem Statement Terminology Theory Code Input Output Conclusions


This experiment demonstrated how to implement a simple linear regression model from scratch and how to
use scikit-learn for the same task. The results from both approaches were compared, and visualizations were
created to show the regression line along with the data points. The manual implementation provided insight
into the workings of the algorithm, while scikit-learn simplified the process of fitting the model.
1

Experiment 5
Problem Statement Terminology Theory Code Input Output Conclusion
The task is to implement the Classification problems and decision tree algorithms for handling image
classification, exploring its efficiency and accuracy. The goal is to apply the algorithm to a dataset and evaluate
the performance based on accuracy, precision, recall, and F1 score. .

Problem Statement Terminology Theory Code Input Output Conclusion


Classification problem and decision tree Algorithm: Explain the core of the Classification problems and
decision tree algorithms and its components.
Accuracy: The ratio of correct predictions to the total predictions.
Precision: The ratio of true positive predictions to the sum of true positives and false positives.
Recall: The ratio of true positive predictions to the sum of true positives and false negatives.
F1 Score: The harmonic mean of precision and recall.
Classification problem and decision tree Algorithm: Explain the core of the Classification problems and
decision tree algorithms and its components.

Problem Statement Terminology Theory Code Input Output Conclusion


Classificarion problem and decision tree Algorithm: Explain the core of the Classification problems and
decision tree algorithms and its components.
 Accuracy: The ratio of correct predictions to the total predictions.
 Precision: The ratio of true positive predictions to the sum of true positives and false positives.
 Recall: The ratio of true positive predictions to the sum of true positives and false negatives.
 F1 Score: The harmonic mean of precision and recall.

Problem Statement Terminology Theory Code Input Output Conclusion


import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns

from sklearn.datasets import make_classification


X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0,
n_clusters_per_class=1, random_state=42)

print(X.shape)
print(y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

param_grid = {
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy']
}
clf = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
1

print("Best Parameters: ", grid_search.best_params_)

best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test)

print("Accuracy after tuning:", accuracy_score(y_test, y_pred))


print("Classification Report:\n", classification_report(y_test, y_pred))

conf_matrix = confusion_matrix(y_test, y_pred)


sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", cbar=False)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

def plot_decision_boundaries(X, y, model):


x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdYlBu)
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', marker='o', s=50, cmap=plt.cm.RdYlBu)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title("Decision Tree Classifier - Decision Boundaries")
plt.show()

plot_decision_boundaries(X_train, y_train, best_clf)

Problem Statement Terminology Theory Code Input Output Conclusion


Inbuilt dataset

Problem Statement Terminology Theory Code Input Output Conclusion


(100, 2)
(100,)
Best Parameters: {'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2}
Accuracy after tuning: 1.0
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 15
1 1.00 1.00 1.00 15
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
1
1

Problem Statement Terminology Theory Code Input Output Conclusions


The Classification problems and decision tree algorithms performs efficiently for the given task, with
acceptable levels of accuracy, precision, recall, and F1 score. Further optimization can be achieved by
tweaking the hyperparameters of the model.
1

Experiment 6
Problem Statement Terminology Theory Code Input Output Conclusion
Develop a Convolutional Neural Network (CNN) model to classify images into predefined categories. The
project aims to achieve high accuracy in image classification through the implementation and optimization of a
CNN model.

Problem Statement Terminology Theory Code Input Output Conclusion


CNN: Convolutional Neural Networks, Conv2D: Convolutional Layer, MaxPooling2D: Pooling Layer,
Flatten, Dense: Fully Connected Layer,
Rectified Linear Unit (ReLU): max(0, x)
Sigmoid: σ(x)=1/1+e-x

Problem Statement Terminology Theory Code Input Output Conclusion


CNN: Convolutional Neural Networks are a class of deep neural networks specifically designed for
analyzing visual data. They have revolutionized the field of computer vision by achieving state-of-
the-art results in tasks such as image classification, object detection, and image segmentation.
Convolution Operation: The core building block of a CNN is the convolutional layer, which applies
a set of learnable filters (also called kernels) to the input image. Each filter convolves across the
image to produce a feature map.
ReLU Activation: After each convolution operation, an activation function (typically ReLU,
Rectified Linear Unit) is applied to introduce non-linearity.
Pooling: Pooling layers reduce the spatial dimensions (width and height) of the feature maps, which
helps in reducing the number of parameters, computational complexity, and overfitting. The most
common pooling operations are Max Pooling (which takes the maximum value) and Average Pooling
(which takes the average value) within a window.
Dropout: Dropout is a regularization technique used to prevent overfitting. It randomly drops (sets to
zero) a fraction of the input units during training.
cv2: OpenCV (Open Source Computer Vision Library) is an open-source computer vision and
machine learning software library. It contains more than 2500 optimized algorithms that are useful
for a variety of tasks including image processing, computer vision, and machine learning.
Sigmoid: The Sigmoid function is a type of activation function used in neural networks. The output
of the Sigmoid function is in the range (0, 1), making it useful for binary classification tasks where
the output needs to be interpreted as a probability.

Problem Statement Terminology Theory Code Input Output Conclusion


import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import warnings
warnings.filterwarnings('ignore')

train_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
r'train',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)
validation_datagen = ImageDataGenerator(rescale=1./255)
1

validation_generator = validation_datagen.flow_from_directory(
r'test',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)

# Class 0: cat
# Class 1: dog
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Conv2D(128, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(512, activation='relu'),
Dense(1, activation='sigmoid') # Use 'softmax' for multiple classes
])

model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])

history = model.fit(
train_generator,
steps_per_epoch=100,
epochs=20,
validation_data=validation_generator,
validation_steps=50
)

model.save('my_model.h5')

class_indices = train_generator.class_indices
class_labels = {v: k for k, v in class_indices.items()}

print("Class Labels:")
for index, label in class_labels.items():
print(f"Class {index}: {label}")

from tensorflow.keras.models import load_model


from tensorflow.keras.preprocessing import image
import numpy as np
model = load_model('my_model.h5')
class_labels = {0: 'cat', 1: 'dog'}

def prepare_image(img_path):
img = image.load_img(img_path, target_size=(150, 150))
img_array = image.img_to_array(img)
img_array = np.expand_dims(img_array, axis=0)
img_array /= 255.0
return img_array
1

def predict_image(img_path):
img_array = prepare_image(img_path)
predictions = model.predict(img_array)
return predictions
img_path = "cat_image.jpg"
predictions = predict_image(img_path)

if model.output_shape[1] == 1:
predicted_class_index = 1 if predictions[0] > 0.5 else 0
else:
predicted_class_index = np.argmax(predictions[0])

predicted_class_label = class_labels.get(predicted_class_index, 'Unknown Class')


print(f"Prediction: {predicted_class_label}")

Problem Statement Terminology Theory Code Input Output Conclusion

Problem Statement Terminology Theory Code Input Output Conclusion

Prediction : cats

Problem Statement Terminology Theory Code Input Output Conclusions


Thus the Convolutional Neural Network model has been developed to classify images into predefined
categories.
1

Experiment 7
Problem Statement Terminology Theory Code Input Output Conclusion
In today’s digital age, organizations receive massive amounts of customer feedback in the form of
reviews, social media comments, and survey responses. Analyzing the sentiment behind this text data
can help businesses understand customer satisfaction and improve their services. In this task, students
will preprocess a dataset of customer reviews, apply sentiment analysis techniques, and build a model
to classify the reviews as positive, negative, or neutral. The goal is to accurately classify the sentiment
of the reviews and evaluate the model’s performance.

Problem Statement Terminology Theory Code Input Output Conclusion


Text Preprocessing: Steps taken to clean and transform text data into a format suitable for analysis.
o Tokenization: Splitting text into smaller units like words or phrases.
o Stop Words: Common words (e.g., "the", "is", "in") that are often removed because they do not
carry meaningful information.
o Stemming: Reducing words to their base or root form (e.g., "running" becomes "run").
o Lemmatization: Similar to stemming but more accurate, it reduces words to their meaningful base
form (e.g., "better" becomes "good").
Feature Extraction: Converting textual data into numerical features for machine learning models.
o Bag-of-Words (BoW): A representation of text that counts the frequency of words in a document,
ignoring grammar and word order.
o TF-IDF (Term Frequency-Inverse Document Frequency): A technique that assigns a weight to
words based on their frequency in a document compared to their frequency in the entire corpus.
o Word Embeddings: Dense vector representations of words that capture semantic relationships
(e.g., Word2Vec, GloVe).
Model Training: The process of using algorithms to build a predictive model from the data.
o Naive Bayes: A probabilistic classifier based on applying Bayes’ theorem with strong (naive)
independence assumptions between features.
o Support Vector Machine (SVM): A classification technique that finds the optimal hyperplane
separating the data into different sentiment categories.
Evaluation Metrics: Methods used to assess the performance of the sentiment analysis model.
o Accuracy: The ratio of correctly predicted instances to the total instances.
o Precision, Recall, and F1-Score: Metrics that provide insight into the model’s performance,
especially for imbalanced datasets.
o Confusion Matrix: A table that describes the performance of a classification model by showing
true positives, false positives, true negatives, and false negatives.

Problem Statement Terminology Theory Code Input Output Conclusion


Sentiment Analysis: The process of identifying and categorizing opinions expressed in text, especially in order
to determine whether the writer’s attitude is positive, negative, or neutral. Sentiment analysis aims to classify
the underlying sentiment of a piece of text, such as a customer review or a social media post. This is important
for businesses to gauge public opinion and improve services.

Text Preprocessing: Preprocessing is a crucial step in sentiment analysis to reduce noise and standardize the
text for further analysis. Common steps include:
 Lowercasing: Converting all words to lowercase to avoid treating words like "Happy" and "happy"
as different tokens.
 Removing Stop Words: Eliminating frequently occurring but unimportant words.
 Stemming and Lemmatization: Converting words to their base form to reduce dimensionality and
improve the model's generalization.
 Tokenization: Breaking text into individual words or tokens is essential for applying models like
Bag-of-Words or Word Embeddings.
1

Feature Extraction:
 Bag-of-Words: This is one of the simplest ways to represent text data, where each document is
represented as a vector of word counts or occurrences. Though easy to implement, BoW ignores word
order and context.
 TF-IDF: This method improves on BoW by not just counting word occurrences but weighing them by
how important they are in the document relative to the entire corpus. This helps in diminishing the
impact of common but uninformative words.
 Word Embeddings: Unlike BoW, Word Embeddings capture semantic relationships between words,
meaning that words with similar meanings will have similar vector representations. This is particularly
useful in capturing nuances in sentiment.

Model Selection:
 Naive Bayes: Naive Bayes is often used for text classification because it’s simple and efficient,
particularly in high-dimensional spaces such as text data. Despite its simplicity, it often works well for
sentiment analysis tasks.
 Support Vector Machine (SVM): SVM is another popular model for text classification tasks due to its
ability to handle high-dimensional spaces and its robustness in separating different classes.

Model Evaluation:
 Cross-Validation: This technique is commonly used to validate the model's performance on unseen
data. By splitting the dataset into k folds, each fold is used as the validation set while the remaining
data is used for training.
 Metrics: Metrics like accuracy, precision, recall, and F1-score give insight into how well the model
performs, particularly when there’s class imbalance. Confusion matrices can also provide a clearer
understanding of the model's behavior in classifying sentiment correctly.

Problem Statement Terminology Theory Code Input Output Conclusion


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
import re
!kaggle datasets download -d chaudharyanshul/airline-reviews
!unzip "airline-reviews"

df = pd.read_csv('BA_AirlineReviews.csv')
print(df.head())

stop_words = set(stopwords.words('english'))
def preprocess_text(text):
tokens = word_tokenize(text.lower())
tokens = [word for word in tokens if word not in stop_words and word.isalnum()]
return ' '.join(tokens)

df['ProcessedReview'] = df['ReviewBody'].apply(preprocess_text)
print(df[['ReviewBody', 'ProcessedReview']].head())

def map_sentiment(rating):
1

if rating >= 4:
return 'positive'
elif rating == 3:
return 'neutral'
else:
return 'negative'

df['sentiment'] = df['OverallRating'].apply(map_sentiment)

def preprocess_text(text):
text = re.sub(r'\W', ' ', text)
text = re.sub(r'\s+', ' ', text)
text = text.lower()
tokens = text.split()
tokens = [word for word in tokens if word not in stop_words]
return ' '.join(tokens)

df['cleaned_reviews'] = df['ReviewBody'].apply(preprocess_text)

X = df['cleaned_reviews']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

model = LogisticRegression()
model.fit(X_train_bow, y_train)

y_pred = model.predict(X_test_bow)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))

from gensim.models import Word2Vec


import numpy as np
from sklearn.preprocessing import StandardScaler

sentences = [review.split() for review in X_train]


model_w2v = Word2Vec(sentences, vector_size=100, window=5, min_count=1, sg=1)
def get_vector(review):
words = review.split()
vectors = [model_w2v.wv[word] for word in words if word in model_w2v.wv]
return np.mean(vectors, axis=0) if vectors else np.zeros(100)

X_train_w2v = np.array([get_vector(review) for review in X_train])


X_test_w2v = np.array([get_vector(review) for review in X_test])

scaler = StandardScaler()
X_train_w2v = scaler.fit_transform(X_train_w2v)
X_test_w2v = scaler.transform(X_test_w2v)

model = LogisticRegression(max_iter=200)
model.fit(X_train_w2v, y_train)
1

y_pred = model.predict(X_test_w2v)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))

Problem Statement Terminology Theory Code Input Output Conclusion


!kaggle datasets download -d chaudharyanshul/airline-reviews

Problem Statement Terminology Theory Code Input Output Conclusion


Unnamed: 0 OverallRating ReviewHeader \
0 0 1.0 "Service level far worse then Ryanair"
1 1 3.0 "do not upgrade members based on status"
2 2 8.0 "Flight was smooth and quick"
3 3 1.0 "Absolutely hopeless airline"
4 4 1.0 "Customer Service is non existent"
Name Datetime VerifiedReview \
0 L Keele 19th November 2023 True
1 Austin Jones 19th November 2023 True
2 M A Collie 16th November 2023 False
3 Nigel Dean 16th November 2023 True
4 Gaylynne Simpson 14th November 2023 False

ReviewBody TypeOfTraveller \
0 4 Hours before takeoff we received a Mail stat... Couple Leisure
1 I recently had a delay on British Airways from... Business
2 Boarded on time, but it took ages to get to th... Couple Leisure
3 5 days before the flight, we were advised by B... Couple Leisure
4 We traveled to Lisbon for our dream vacation, ... Couple Leisure

SeatType Route DateFlown SeatComfort \


0 Economy Class London to Stuttgart November 2023 1.0
1 Economy Class Brussels to London November 2023 2.0
2 Business Class London Heathrow to Dublin November 2023 3.0
3 Economy Class London to Dublin December 2022 3.0
4 Economy Class London to Lisbon November 2023 1.0

CabinStaffService GroundService ValueForMoney Recommended Aircraft \


0 1.0 1.0 1.0 no NaN
1 3.0 1.0 2.0 no A320
2 3.0 4.0 3.0 yes A320
3 3.0 1.0 1.0 no NaN
4 1.0 1.0 1.0 no NaN

Food&Beverages InflightEntertainment Wifi&Connectivity


0 NaN NaN NaN
1 1.0 2.0 2.0
2 4.0 NaN NaN
3 NaN NaN NaN
4 1.0 1.0 1.0
ReviewBody \
0 4 Hours before takeoff we received a Mail stat...
1 I recently had a delay on British Airways from...
2 Boarded on time, but it took ages to get to th...
3 5 days before the flight, we were advised by B...
1

4 We traveled to Lisbon for our dream vacation, ...

ProcessedReview
0 4 hours takeoff received mail stating cryptic ...
1 recently delay british airways bru lhr due sta...
2 boarded time took ages get runway due congesti...
3 5 days flight advised ba cancelled asked us re...
4 traveled lisbon dream vacation cruise portugal...
Accuracy: 0.7278617710583153
Classification Report:
precision recall f1-score support

negative 0.67 0.75 0.71 311


neutral 0.26 0.14 0.18 120
positive 0.82 0.86 0.84 495

accuracy 0.73 926


macro avg 0.59 0.58 0.58 926
weighted avg 0.70 0.73 0.71 926

Accuracy: 0.7451403887688985
Classification Report:
precision recall f1-score support

negative 0.66 0.83 0.74 311


neutral 0.40 0.03 0.06 120
positive 0.81 0.86 0.84 495

accuracy 0.75 926


macro avg 0.63 0.58 0.55 926
weighted avg 0.71 0.75 0.70 926

Problem Statement Terminology Theory Code Input Output Conclusions


Hence, the given python code has accurately classify the sentiment of the reviews and evaluated the
model’s performance.
1

Experiment 8
Problem Statement Terminology Theory Code Input Output Conclusion
In many real-world applications, machine learning models must be deployed to serve predictions to
users in real-time. Deploying models requires converting them into a format suitable for use in
production, setting up an environment (either cloud-based or using containers), and providing a user-
friendly interface for interaction. In this project, students will explore how to deploy a trained machine
learning model to the cloud or via containers, create a basic web interface, and allow users to interact
with the model through the interface.

Problem Statement Terminology Theory Code Input Output Conclusion


Model Deployment: The process of integrating a trained machine learning model into a production
environment where it can serve predictions to users.
Cloud Services: Platforms that provide infrastructure and tools to deploy applications. Common examples
include:
 AWS (Amazon Web Services): Offers services like SageMaker for ML model deployment.
 Google Cloud: Provides AI Platform for deploying models.
 Microsoft Azure: Azure ML offers tools for deploying models.
API (Application Programming Interface): A set of rules that allows different software entities to
communicate. In model deployment, an API allows external applications to interact with the model and send
data to it for predictions.
Web Interface: A front-end interface (usually created with HTML, CSS, and JavaScript) that allows users to
interact with a machine learning model in a simple and intuitive way.

Problem Statement Terminology Theory Code Input Output Conclusion


Pickle/Joblib: Common Python libraries used to serialize and save machine learning models.
AWS SageMaker, Google Cloud AI Platform, and Azure ML offer streamlined workflows for deploying
models as APIs, where users can send data to the cloud-hosted model and receive predictions.
Flask: A lightweight Python web framework that can be used to build APIs and simple web applications.
HTML/CSS/JavaScript: Used to create the front-end interface that users interact with.

Problem Statement Terminology Theory Code Input Output Conclusion


# Python Notebook:
import numpy as np
import pandas as pd

df = pd.read_csv('students_placement.csv')
df.shape
X = df.drop(columns=['placed'])
y = df['placed']

from sklearn.model_selection import train_test_split


X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

from sklearn.linear_model import LogisticRegression


from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

scaler = StandardScaler()

X_train_trf = scaler.fit_transform(X_train)
X_test_trf = scaler.transform(X_test)
1

accuracy_score(y_test, LogisticRegression().fit(X_train_trf,y_train).predict(X_test_trf))

from sklearn.ensemble import RandomForestClassifier


accuracy_score(y_test,RandomForestClassifier().fit(X_train,y_train).predict(X_test))

from sklearn.svm import SVC


accuracy_score(y_test,SVC(kernel='rbf').fit(X_train,y_train).predict(X_test))

svc = SVC(kernel='rbf')
svc.fit(X_train,y_train)

rf = RandomForestClassifier()
rf.fit(X_train,y_train)

import pickle
pickle.dump(svc,open('model.pkl','wb'))

# Python:
from flask import Flask, render_template, request
import pickle
import numpy as np

model = pickle.load(open('model.pkl', 'rb'))


app = Flask(__name__)

@app.route('/')
def index():
return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict_placement():
cgpa = float(request.form.get('cgpa'))
iq = int(request.form.get('iq'))
profile_score = int(request.form.get('profile_score'))

# prediction
result = model.predict(np.array([cgpa, iq, profile_score]).reshape(1, 3))
if result[0] == 1:
result = 'placed'
else:
result = 'not placed'
return render_template('index.html', result=result)

if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)

# Flask
<!DOCTYPE html>
<html lang="en">

<head>
<meta charset="UTF-8">
<title>Student Placement Predictor</title>
1

</head>

<body>
<h1>Student Placement Predictor</h1>
{% if result %}
<p>{{ result }}</p>
{% endif %}

<form method="post" action="/predict">


<label>CGPA</label><br>
<input type="text" name="cgpa"><br><br>
<label>IQ</label><br>
<input type="text" name="iq"><br><br>
<label>Profile Score</label><br>
<input type="text" name="profile_score"><br><br>
<input type="submit" value="Predict"><br><br>
</form>

</body>

</html>

Problem Statement Terminology Theory Code Input Output Conclusion


Get ready with pickle file
1. Login to AWS -> Create Instance -> Name the instance -> Choose Ubuntu -> key pair to be .ppk ->
create -> copy public dns
2. Open WinSCP -> username as ubuntu -> password to be authentication -> security -> select the .ppk
file and open -> Drag n Drop from windowd to ubuntu -> open putty ->
3. Putty terminal open -> Type:
i. sudo apt install python3-pip
ii. sudo apt-get update && sudo apt-get install python3-pip
iii. pip install modules
iv. python app.py or python3 app.py
v. screen -R deploy python3 app.py
1

Problem Statement Terminology Theory Code Input Output Conclusion

Problem Statement Terminology Theory Code Input Output Conclusions


Model deployment transforms machine learning models from development to real-world use. By converting,
deploying, and providing user-friendly access, this process ensures models are scalable, accessible, and
impactful in practical applications.
1

Experiment 9
Problem Statement Terminology Theory Code Input Output Conclusion
With the rise of online media, misinformation and fake news have become a serious concern, leading to
misinformed public opinions and societal harm. In this project, students will develop a machine learning model
to classify news articles as either real or fake. The task involves preprocessing text data, building a model, and
evaluating its performance. The ultimate goal is to create a reliable system that can help users distinguish
between trustworthy information and misleading or fake news articles.

Problem Statement Terminology Theory Code Input Output Conclusion


Text Preprocessing: Preparing text data for machine learning models by cleaning and transforming it.
 Tokenization: Breaking down text into individual words or tokens.
 Stop Words: Words that are common and typically do not carry significant meaning (e.g., “the”, “is”,
“in”).
 Stemming and Lemmatization: Reducing words to their base or root form (e.g., “running” to “run” or
“better” to “good”).
Feature Extraction: Converting text into a numerical format suitable for machine learning algorithms.
 Bag-of-Words (BoW): A method where a document is represented by a vector of word counts.
 TF-IDF (Term Frequency-Inverse Document Frequency): A feature extraction method that assigns
weights to words based on their importance in a document relative to the entire corpus.
 Word Embeddings: Dense vector representations of words that capture their meanings and relationships
in a continuous vector space (e.g., Word2Vec, GloVe).
Model Training: Using machine learning algorithms to learn patterns in the data and classify it.
 Logistic Regression: A simple, interpretable model used for binary classification tasks.
 Random Forest: An ensemble learning method based on decision trees, often used for text classification.
 Neural Networks: A deep learning model that is powerful for complex patterns in data, often used in
NLP tasks.
Model Evaluation: Assessing the performance of a model using various metrics.
 Accuracy: The percentage of correctly classified instances out of the total instances.
 Precision, Recall, F1-Score: Metrics that help evaluate how well the model is performing, especially for
imbalanced datasets.
 Confusion Matrix: A table showing the performance of a classification model by indicating true
positives, false positives, true negatives, and false negatives

Problem Statement Terminology Theory Code Input Output Conclusion


Fake News: False or misleading information presented as news with the intent to deceive readers.
Natural Language Processing (NLP): A field of artificial intelligence that focuses on the interaction between
computers and humans using natural language.

Problem Statement Terminology Theory Code Input Output Conclusion


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

data_fake=pd.read_csv('Fake.csv')
data_true=pd.read_csv('True.csv')
1

data_fake.head()

data_fake["class"]=0
data_true['class']=1

data_fake.shape, data_true.shape

data_fake_manual_testing = data_fake.tail(10)
for i in range(23480,23470,-1):
data_fake.drop([i],axis = 0, inplace = True)

data_true_manual_testing = data_true.tail(10)
for i in range(21416,21406,-1):
data_true.drop([i],axis = 0, inplace = True)

data_fake_manual_testing['class']=0
data_true_manual_testing['class']=1

data_merge=pd.concat([data_fake, data_true], axis = 0)


data_merge.head(10)

data_merge.columns

data=data_merge.drop(['title','subject','date'], axis = 1)
data.isnull().sum()

data = data.sample(frac = 1)

data.reset_index(inplace = True)
data.drop(['index'], axis = 1, inplace = True)

def wordopt(text):
text = text.lower()
text = re.sub('\[.*?\]','',text)
text = re.sub("\\W"," ",text)
text = re.sub('https?://\S+|www\.\S+','',text)
text = re.sub('<.*?>+',b'',text)
text = re.sub('[%s]' % re.escape(string.punctuation),'',text)
text = re.sub('\w*\d\w*','',text)
return text

data['text'] = data['text'].apply(wordopt)

x = data['text']
y = data['class']

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.25)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)
1

from sklearn.linear_model import LogisticRegression


LR = LogisticRegression()
LR.fit(xv_train, y_train)
pred_lr = LR.predict(xv_test)
LR.score(xv_test, y_test)
print (classification_report(y_test, pred_lr))

from sklearn.tree import DecisionTreeClassifier


DT = DecisionTreeClassifier()
DT.fit(xv_train, y_train)
pred_dt = DT.predict(xv_test)
DT.score(xv_test, y_test)
print (classification_report(y_test, pred_lr))

from sklearn.ensemble import GradientBoostingClassifier


GB = GradientBoostingClassifier(random_state = 0)
GB.fit(xv_train, y_train)
pred_gb = GB.predict(xv_test)
GB.score(xv_test, y_test)
print(classification_report(y_test, pred_gb))

from sklearn.ensemble import RandomForestClassifier


RF = RandomForestClassifier(random_state = 0)
RF.fit(xv_train, y_train)
pred_rf = RF.predict(xv_test)
RF.score(xv_test, y_test)
print (classification_report(y_test, pred_rf))

def output_lable(n):
if n==0:
return "Fake News"
elif n==1:
return "Not A Fake News"

def manual_testing(news):
testing_news = {"text":[news]}
new_def_test = pd.DataFrame(testing_news)
new_def_test['text'] = new_def_test["text"].apply(wordopt)
new_x_test = new_def_test["text"]
new_xv_test = vectorization.transform(new_x_test)
pred_LR = LR.predict(new_xv_test)
pred_DT = DT.predict(new_xv_test)
pred_GB = GB.predict(new_xv_test)
pred_RF = RF.predict(new_xv_test)

return print("\n\nLR Predicition: {} \nDT Prediction: {} \nGBC Prediction: {} \nRFC


Prediction:{}".format(output_lable(pred_LR[0]), output_lable(pred_DT[0]), _lable(pred_GB[0]),
output_lable(pred_RF[0])))

news = str(input())
manual_testing(news)

news=str(input())
1

manual_testing(news)

Problem Statement Terminology Theory Code Input Output Conclusion


True.csv: https://drive.google.com/file/d/1m_XJOELOPHDLJNFMSJ9mGzqc-U6qEhQ3/view?usp=sharing
Fake.csv: https://drive.google.com/file/d/1Zrss7nMQhpxMeJWVdAKCtaseO1GTddTd/view?usp=sharing

Problem Statement Terminology Theory Code Input Output Conclusion


((23471, 5), (21407, 5))
precision recall f1-score support

0 0.99 0.98 0.99 5884


1 0.98 0.99 0.99 5336

accuracy 0.99 11220


macro avg 0.99 0.99 0.99 11220
weighted avg 0.99 0.99 0.99 11220

precision recall f1-score support

0 0.99 0.98 0.99 5884


1 0.98 0.99 0.99 5336

accuracy 0.99 11220


macro avg 0.99 0.99 0.99 11220
weighted avg 0.99 0.99 0.99 11220
precision recall f1-score support

0 0.99 0.98 0.99 5884


1 0.98 0.99 0.99 5336

accuracy 0.99 11220


macro avg 0.99 0.99 0.99 11220
weighted avg 0.99 0.99 0.99 11220

precision recall f1-score support

0 1.00 0.99 1.00 5884


1 0.99 1.00 1.00 5336

accuracy 1.00 11220


macro avg 1.00 1.00 1.00 11220
weighted avg 1.00 1.00 1.00 1122

precision recall f1-score support

0 0.99 0.99 0.99 5884


1 0.99 0.99 0.99 5336

accuracy 0.99 11220


macro avg 0.99 0.99 0.99 11220
weighted avg 0.99 0.99 0.99 11220

LR Predicition: Fake News


DT Prediction: Fake News
GBC Prediction: Fake News
1

RFC Prediction:Fake News

LR Predicition: Not A Fake News


DT Prediction: Fake News
GBC Prediction: Fake News
RFC Prediction:Not A Fake News

Problem Statement Terminology Theory Code Input Output Conclusions


Fake news detection using machine learning helps combat misinformation by analyzing and classifying news
articles as reliable or misleading. Through feature engineering and model selection, we can build accurate
models that assist users in identifying fake news, promoting informed decision-making and reducing the spread
of false information.

You might also like