2. ML Lab Record
2. ML Lab Record
BONAFIDE CERTIFICATE
B. Tech. – Artificial Intelligence and Data Science in the R19AM311 - MACHINE LEARNING
LABORATORY during the 5th Semester of the academic year 2024 – 2025 (Odd Semester).
Page Number
Marks (50)
Faculty Member
Signature of the
S. Date Name of the Experiment
No.
10
Average Marks :
Signature of the Faculty
Experiment 1
Problem Statement Terminology Theory Code Input Output Conclusion
Set up Python environment with libraries like NumPy, Pandas, and Matplotlib. Introduce Jupyter Notebooks for
interactive coding. Perform basic operations and data manipulations using NumPy and Pandas. Visualize data
distributions and relationships with Matplotlib
#1
array = np.array([1, 2, 3, 4, 5])
squared_array = np.square(array)
mean_value = np.mean(array)
print("Original Array:", array)
print("Squared Array:", squared_array)
print("Mean Value:", mean_value)
#2
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'Score': [85, 88, 92, 79]}
df = pd.DataFrame(data)
print(df)
mean_age = df['Age'].mean()
print("Mean Age:", mean_age)
filtered_data = df[df['Score'] > 85]
print(filtered_data)
#3
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(8, 6))
plt.plot(x, y, label='Sine Wave')
plt.title('Sine Function')
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.legend()
plt.show()
#2
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'Score': [85, 88, 92, 79]}
df = pd.DataFrame(data)
#3
x = np.linspace(0, 10, 100)
y = np.sin(x)
#2
Name Age Score
0 Alice 24 85
1 Bob 27 88
2 Charlie 22 92
3 David 32 79
Mean Age: 26.25
Name Age Score
1 Bob 27 88
2 Charlie 22 92
Problem Statement Terminology Theory Code Input Output Conclusions
Using the above code the program has successfully executed and the output is verified.
1
Experiment 2
Problem Statement Terminology Theory Code Input Output Conclusion
Handle missing data using imputation techniques. Remove outliers and understand their impact on
models. Standardize or normalize numerical features. Encode categorical variables using techniques
like one-hot encoding.
df = pd.read_csv("loan.csv")
df.head()
df.describe()
df.isnull().sum()
df['LoanAmount'].fillna(df['LoanAmount'].mean(),inplace=True)
df = df.drop(['Loan_ID'], axis = 1)
df['Gender'].fillna(df['Gender'].mode()[0],inplace=True)
df['Married'].fillna(df['Married'].mode()[0],inplace=True)
df['Dependents'].fillna(df['Dependents'].mode()[0],inplace=True)
df['Self_Employed'].fillna(df['Self_Employed'].mode()[0],inplace=True)
df['Credit_History'].fillna(df['Credit_History'].mode()[0],inplace=True)
df['Loan_Amount_Term'].fillna(df['Loan_Amount_Term'].mode()[0],inplace=True)
df.isnull().sum()
numerical_df = df.select_dtypes(include=['float64', 'int64'])
Q1 = numerical_df.quantile(0.25)
Q3 = numerical_df.quantile(0.75)
IQR = Q3 - Q1
df_cleaned = numerical_df[~((numerical_df < (Q1 - 1.5 * IQR)) | (numerical_df > (Q3 + 1.5 *
IQR))).any(axis=1)]
print(df_cleaned)
df.ApplicantIncome = np.sqrt(df.ApplicantIncome)
df.CoapplicantIncome = np.sqrt(df.CoapplicantIncome)
df.LoanAmount = np.sqrt(df.LoanAmount)
sns.set(style="darkgrid")
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
sns.histplot(data=df, x="ApplicantIncome", kde=True, ax=axs[0, 0], color='green')
sns.histplot(data=df, x="CoapplicantIncome", kde=True, ax=axs[0, 1], color='skyblue')
sns.histplot(data=df, x="LoanAmount", kde=True, ax=axs[1, 0], color='orange');
Experiment 3
Problem Statement Terminology Theory Code Input Output Conclusion
In many machine learning applications, the performance of a model is heavily influenced by the quality of the
input features. Proper feature engineering can significantly enhance the predictive power of a model.
Additionally, choosing the right model for the task is crucial for achieving optimal results. In this activity,
students will experiment with various feature engineering techniques and compare the performance of different
machine learning models using cross-validation. The goal is to optimize both the feature set and the model
selection to improve the accuracy of predictions.
X, y = SMOTE().fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#Logistic Regression :
LRclassifier = LogisticRegression(solver='saga', max_iter=500, random_state=1)
LRclassifier.fit(X_train, y_train)
y_pred = LRclassifier.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Decision Tree :
scoreListDT = []
for i in range(2,21):
DTclassifier = DecisionTreeClassifier(max_leaf_nodes=i)
DTclassifier.fit(X_train, y_train)
scoreListDT.append(DTclassifier.score(X_test, y_test))
DTcv_scores = cross_val_score(DTclassifier, X_train, y_train, cv=5)
plt.plot(range(2,21), scoreListDT)
plt.xticks(np.arange(2,21,1))
plt.xlabel("Leaf")
plt.ylabel("Score")
plt.show()
DTAcc = max(scoreListDT)
print("Decision Tree Accuracy: {:.2f}%".format(DTAcc*100))
print("Decision Tree CV Scores:", DTcv_scores)
# Random Forest :
scoreListRF = []
for i in range(2,25):
RFclassifier = RandomForestClassifier(n_estimators = 1000, random_state = 1, max_leaf_nodes=i)
RFclassifier.fit(X_train, y_train)
scoreListRF.append(RFclassifier.score(X_test, y_test))
RFcv_scores = cross_val_score(RFclassifier, X_train, y_train, cv=5)
plt.plot(range(2,25), scoreListRF)
plt.xticks(np.arange(2,25,1))
plt.xlabel("RF Value")
plt.ylabel("Score")
plt.show()
RFAcc = max(scoreListRF)
print("Random Forest Accuracy: {:.2f}%".format(RFAcc*100))
print("Random Forest CV Scores:", RFcv_scores)
results = {
'Model Name': ['Logistic Regression', 'SVC', 'Decision Tree', 'Random Forest'],
'Mean Accuracy (%)': [LRAcc, SVCAcc, DTAcc, RFAcc]
}
df1 = pd.DataFrame(results)
print(df1)
Experiment 4
Problem Statement Terminology Theory Code Input Output Conclusion
To use Linear Regression to predict house price.
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
data = pd.read_csv(url)
data.head()
X = data.drop("medv", axis=1)
y = data["medv"]
preprocessor = ColumnTransformer(
transformers=[
1
X_preprocessed = pipeline.fit_transform(X)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.title('Actual vs Predicted Home Prices')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.show()
Experiment 5
Problem Statement Terminology Theory Code Input Output Conclusion
The task is to implement the Classification problems and decision tree algorithms for handling image
classification, exploring its efficiency and accuracy. The goal is to apply the algorithm to a dataset and evaluate
the performance based on accuracy, precision, recall, and F1 score. .
print(X.shape)
print(y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
param_grid = {
'max_depth': [3, 5, 10, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'criterion': ['gini', 'entropy']
}
clf = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)
1
best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test)
Experiment 6
Problem Statement Terminology Theory Code Input Output Conclusion
Develop a Convolutional Neural Network (CNN) model to classify images into predefined categories. The
project aims to achieve high accuracy in image classification through the implementation and optimization of a
CNN model.
train_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
r'train',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)
validation_datagen = ImageDataGenerator(rescale=1./255)
1
validation_generator = validation_datagen.flow_from_directory(
r'test',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)
# Class 0: cat
# Class 1: dog
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Conv2D(128, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(512, activation='relu'),
Dense(1, activation='sigmoid') # Use 'softmax' for multiple classes
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
history = model.fit(
train_generator,
steps_per_epoch=100,
epochs=20,
validation_data=validation_generator,
validation_steps=50
)
model.save('my_model.h5')
class_indices = train_generator.class_indices
class_labels = {v: k for k, v in class_indices.items()}
print("Class Labels:")
for index, label in class_labels.items():
print(f"Class {index}: {label}")
def prepare_image(img_path):
img = image.load_img(img_path, target_size=(150, 150))
img_array = image.img_to_array(img)
img_array = np.expand_dims(img_array, axis=0)
img_array /= 255.0
return img_array
1
def predict_image(img_path):
img_array = prepare_image(img_path)
predictions = model.predict(img_array)
return predictions
img_path = "cat_image.jpg"
predictions = predict_image(img_path)
if model.output_shape[1] == 1:
predicted_class_index = 1 if predictions[0] > 0.5 else 0
else:
predicted_class_index = np.argmax(predictions[0])
Prediction : cats
Experiment 7
Problem Statement Terminology Theory Code Input Output Conclusion
In today’s digital age, organizations receive massive amounts of customer feedback in the form of
reviews, social media comments, and survey responses. Analyzing the sentiment behind this text data
can help businesses understand customer satisfaction and improve their services. In this task, students
will preprocess a dataset of customer reviews, apply sentiment analysis techniques, and build a model
to classify the reviews as positive, negative, or neutral. The goal is to accurately classify the sentiment
of the reviews and evaluate the model’s performance.
Text Preprocessing: Preprocessing is a crucial step in sentiment analysis to reduce noise and standardize the
text for further analysis. Common steps include:
Lowercasing: Converting all words to lowercase to avoid treating words like "Happy" and "happy"
as different tokens.
Removing Stop Words: Eliminating frequently occurring but unimportant words.
Stemming and Lemmatization: Converting words to their base form to reduce dimensionality and
improve the model's generalization.
Tokenization: Breaking text into individual words or tokens is essential for applying models like
Bag-of-Words or Word Embeddings.
1
Feature Extraction:
Bag-of-Words: This is one of the simplest ways to represent text data, where each document is
represented as a vector of word counts or occurrences. Though easy to implement, BoW ignores word
order and context.
TF-IDF: This method improves on BoW by not just counting word occurrences but weighing them by
how important they are in the document relative to the entire corpus. This helps in diminishing the
impact of common but uninformative words.
Word Embeddings: Unlike BoW, Word Embeddings capture semantic relationships between words,
meaning that words with similar meanings will have similar vector representations. This is particularly
useful in capturing nuances in sentiment.
Model Selection:
Naive Bayes: Naive Bayes is often used for text classification because it’s simple and efficient,
particularly in high-dimensional spaces such as text data. Despite its simplicity, it often works well for
sentiment analysis tasks.
Support Vector Machine (SVM): SVM is another popular model for text classification tasks due to its
ability to handle high-dimensional spaces and its robustness in separating different classes.
Model Evaluation:
Cross-Validation: This technique is commonly used to validate the model's performance on unseen
data. By splitting the dataset into k folds, each fold is used as the validation set while the remaining
data is used for training.
Metrics: Metrics like accuracy, precision, recall, and F1-score give insight into how well the model
performs, particularly when there’s class imbalance. Confusion matrices can also provide a clearer
understanding of the model's behavior in classifying sentiment correctly.
df = pd.read_csv('BA_AirlineReviews.csv')
print(df.head())
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
tokens = word_tokenize(text.lower())
tokens = [word for word in tokens if word not in stop_words and word.isalnum()]
return ' '.join(tokens)
df['ProcessedReview'] = df['ReviewBody'].apply(preprocess_text)
print(df[['ReviewBody', 'ProcessedReview']].head())
def map_sentiment(rating):
1
if rating >= 4:
return 'positive'
elif rating == 3:
return 'neutral'
else:
return 'negative'
df['sentiment'] = df['OverallRating'].apply(map_sentiment)
def preprocess_text(text):
text = re.sub(r'\W', ' ', text)
text = re.sub(r'\s+', ' ', text)
text = text.lower()
tokens = text.split()
tokens = [word for word in tokens if word not in stop_words]
return ' '.join(tokens)
df['cleaned_reviews'] = df['ReviewBody'].apply(preprocess_text)
X = df['cleaned_reviews']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)
model = LogisticRegression()
model.fit(X_train_bow, y_train)
y_pred = model.predict(X_test_bow)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))
scaler = StandardScaler()
X_train_w2v = scaler.fit_transform(X_train_w2v)
X_test_w2v = scaler.transform(X_test_w2v)
model = LogisticRegression(max_iter=200)
model.fit(X_train_w2v, y_train)
1
y_pred = model.predict(X_test_w2v)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))
ReviewBody TypeOfTraveller \
0 4 Hours before takeoff we received a Mail stat... Couple Leisure
1 I recently had a delay on British Airways from... Business
2 Boarded on time, but it took ages to get to th... Couple Leisure
3 5 days before the flight, we were advised by B... Couple Leisure
4 We traveled to Lisbon for our dream vacation, ... Couple Leisure
ProcessedReview
0 4 hours takeoff received mail stating cryptic ...
1 recently delay british airways bru lhr due sta...
2 boarded time took ages get runway due congesti...
3 5 days flight advised ba cancelled asked us re...
4 traveled lisbon dream vacation cruise portugal...
Accuracy: 0.7278617710583153
Classification Report:
precision recall f1-score support
Accuracy: 0.7451403887688985
Classification Report:
precision recall f1-score support
Experiment 8
Problem Statement Terminology Theory Code Input Output Conclusion
In many real-world applications, machine learning models must be deployed to serve predictions to
users in real-time. Deploying models requires converting them into a format suitable for use in
production, setting up an environment (either cloud-based or using containers), and providing a user-
friendly interface for interaction. In this project, students will explore how to deploy a trained machine
learning model to the cloud or via containers, create a basic web interface, and allow users to interact
with the model through the interface.
df = pd.read_csv('students_placement.csv')
df.shape
X = df.drop(columns=['placed'])
y = df['placed']
scaler = StandardScaler()
X_train_trf = scaler.fit_transform(X_train)
X_test_trf = scaler.transform(X_test)
1
accuracy_score(y_test, LogisticRegression().fit(X_train_trf,y_train).predict(X_test_trf))
svc = SVC(kernel='rbf')
svc.fit(X_train,y_train)
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
import pickle
pickle.dump(svc,open('model.pkl','wb'))
# Python:
from flask import Flask, render_template, request
import pickle
import numpy as np
@app.route('/')
def index():
return render_template('index.html')
@app.route('/predict', methods=['POST'])
def predict_placement():
cgpa = float(request.form.get('cgpa'))
iq = int(request.form.get('iq'))
profile_score = int(request.form.get('profile_score'))
# prediction
result = model.predict(np.array([cgpa, iq, profile_score]).reshape(1, 3))
if result[0] == 1:
result = 'placed'
else:
result = 'not placed'
return render_template('index.html', result=result)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
# Flask
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Student Placement Predictor</title>
1
</head>
<body>
<h1>Student Placement Predictor</h1>
{% if result %}
<p>{{ result }}</p>
{% endif %}
</body>
</html>
Experiment 9
Problem Statement Terminology Theory Code Input Output Conclusion
With the rise of online media, misinformation and fake news have become a serious concern, leading to
misinformed public opinions and societal harm. In this project, students will develop a machine learning model
to classify news articles as either real or fake. The task involves preprocessing text data, building a model, and
evaluating its performance. The ultimate goal is to create a reliable system that can help users distinguish
between trustworthy information and misleading or fake news articles.
data_fake=pd.read_csv('Fake.csv')
data_true=pd.read_csv('True.csv')
1
data_fake.head()
data_fake["class"]=0
data_true['class']=1
data_fake.shape, data_true.shape
data_fake_manual_testing = data_fake.tail(10)
for i in range(23480,23470,-1):
data_fake.drop([i],axis = 0, inplace = True)
data_true_manual_testing = data_true.tail(10)
for i in range(21416,21406,-1):
data_true.drop([i],axis = 0, inplace = True)
data_fake_manual_testing['class']=0
data_true_manual_testing['class']=1
data_merge.columns
data=data_merge.drop(['title','subject','date'], axis = 1)
data.isnull().sum()
data = data.sample(frac = 1)
data.reset_index(inplace = True)
data.drop(['index'], axis = 1, inplace = True)
def wordopt(text):
text = text.lower()
text = re.sub('\[.*?\]','',text)
text = re.sub("\\W"," ",text)
text = re.sub('https?://\S+|www\.\S+','',text)
text = re.sub('<.*?>+',b'',text)
text = re.sub('[%s]' % re.escape(string.punctuation),'',text)
text = re.sub('\w*\d\w*','',text)
return text
data['text'] = data['text'].apply(wordopt)
x = data['text']
y = data['class']
vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)
1
def output_lable(n):
if n==0:
return "Fake News"
elif n==1:
return "Not A Fake News"
def manual_testing(news):
testing_news = {"text":[news]}
new_def_test = pd.DataFrame(testing_news)
new_def_test['text'] = new_def_test["text"].apply(wordopt)
new_x_test = new_def_test["text"]
new_xv_test = vectorization.transform(new_x_test)
pred_LR = LR.predict(new_xv_test)
pred_DT = DT.predict(new_xv_test)
pred_GB = GB.predict(new_xv_test)
pred_RF = RF.predict(new_xv_test)
news = str(input())
manual_testing(news)
news=str(input())
1
manual_testing(news)