Introduction To Python and Computer Programming 1704298503
Introduction To Python and Computer Programming 1704298503
AYUSHI SINGH
INDEX
Ass. PROLEM PAGES REMARKS
No. STATEMENT
1 Implement k-nearest
neighbours classification
using python
2 Extract the data from
database using python
3 The probability that it is
Friday and that a student
is absent is 3 %. Since
there are 5 school days in a
week, the probability that
it is Friday is 20 %. What
is the probability that a
student is absent given that
today is Friday? Apply
Baye’s rule in python to
get the result
4 Predict Canada's per
capita income in year 2020.
5 Employee retention
dataset
6 Predict a classification for
a case where VAR1=0.906
and VAR2=0.606
7 Predict if a person would
buy Insurance or not using
Logistic Regression
8 Implement linear
regression using python.
9 Implement Naïve Bayes
theorem to classify the
English text.
10 Use wine dataset from
sklearn.datasets to classify
wines into 3 categories.
11 Heart disease dataset.
12 Write a python program to
import and export data
using Pandas library
functions.
13 Using Python implement
Dimensionality reduction
using Principle Component
Analysis (PCA) method.
14 Using Python implement
Simple and Multiple
Linear Regression Models
15 Using Python develop
Logistic Regression Model
for a given dataset.
16 Using Python develop
Decision Tree
Classification model for a
given dataset and use it to
classify a new sample.
Assignment 1
Ass. 1: Implement k-nearest neighbours classification using python
# Create a KNN classifier with k=3 (you can change this value as ne
eded)
knn_classifier = KNeighborsClassifier(n_neighbors=3)
Accuracy: 1.0
Assignment 2
Ass. 2: Extract the data from database using python
In [1]: import sqlite3
# Create a table
cursor.execute('''
CREATE TABLE IF NOT EXISTS users (
id INTEGER PRIMARY KEY,
name TEXT,
age INTEGER
)
''')
import sqlite3
Assignment 3
Ass. 3: The probability that it is Friday and that a student is absent is 3 %. Since there are 5 school days in a
week, the probability that it is Friday is 20 %. What is the probability that a student is absent given that
today is Friday? Apply Baye’s rule in python to get the result
In [ ]: #Question 3. Probability of friday and student is absent..
p_A_given_B = 0.03
p_B =0.2
#p_not_A= 1-p_A
result= (p_A_given_B/p_B)
print("Answer is : ", result)
Answer is : 0.15
Assignment 4
Ass 4: Predict Canada's per capita income in year 2020. There is a data folder here on Kaggle, download
that and you will find canada_per_capita_income.csv file. Using this build a regression model and predict the
per capita income for Canadian citizens in year 2020. Link for csv file:
https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-se
(https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-se)
model = LinearRegression()
model.fit(X_train,Y_train)
y_pred=model.predict(X_test)
prediction_2020=model.predict([[2020]])
print(f'income:{prediction_2020[0]}')
year income
0 1970 3399.299037
1 1971 3768.297935
2 1972 4251.175484
3 1973 4804.463248
4 1974 5576.514583
income:[40993.56532482]
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWa
rning: X does not have valid feature names, but LinearRegression wa
s fitted with feature names
warnings.warn(
Assignment 5
Ass 5: Download employee retention dataset from here: https://www.kaggle.com/giripujar/hr analytics
(https://www.kaggle.com/giripujar/hr analytics).
1. Now do some exploratory data analysis to figure out which variables have direct and clear impact on
employee retention (i.e., whether they leave the company or continue to work)
2. Plot bar charts showing impact of employee salaries on retention
3. Plot bar charts showing correlation between department and employee retention
4. Now build logistic regression model using variables that were narrowed down in step 1
5. Measure the accuracy of the model
In [ ]: # Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of the model: {accuracy}')
promotion_last_5years
count 14999.000000
mean 0.021268
std 0.144281
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None
<ipython-input-8-2e65f0a87cc2>:19: FutureWarning: The default value
of numeric_only in DataFrame.corr is deprecated. In a future versio
n, it will default to False. Select only valid columns or specify t
he value of numeric_only to silence this warning.
correlation_matrix = data.corr()
Accuracy of the model: 0.5043333333333333
Confusion Matrix:
[[ 0 123 130]
[ 0 988 486]
[ 0 748 525]]
/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logis
tic.py:458: ConvergenceWarning: lbfgs failed to converge (status=
1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Assignment 6
Given the following data, which specify classifications for nine combinations of VAR1 and VAR2 predict a
classification for a case where VAR1=0.906 and VAR2=0.606, using the result of kmeans clustering with 3
means (i.e., 3 centroids) VAR1 VAR2 CLASS 1.713 1.586 0 0.180 1.786 1 0.353 1.240 1 0.940 1.566 0 1.486
0.759 1 1.266 1.106 0 1.540 0.419 1 0.459 1.799 1 0.773 0.186 1
In [ ]: #Ass 6
from sklearn.cluster import KMeans
import numpy as np
data = np.array([
[1.713, 1.586, 0],
[0.180, 1.786, 1],
[0.353, 1.240, 1],
[0.940, 1.566, 0],
[1.486, 0.759, 1],
[1.266, 1.106, 0],
[1.540, 0.419, 1],
[0.459, 1.799, 1],
[0.773, 0.186, 1]
])
X = data[:, :2]
new_case = np.array([[0.906, 0.606]])
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
predicted_cluster = kmeans.predict(new_case)
predicted_class = int(data[predicted_cluster, 2])
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:
870: FutureWarning: The default value of `n_init` will change from
10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppre
ss the warning
warnings.warn(
Assignment 7
Ass. 7: Predict if a person would buy Insurance or not using Logistic Regression the insurance-data.csv file
existing on the link given below. https://www.kaggle.com/datasets/adepvenugopal/insurance-data
(https://www.kaggle.com/datasets/adepvenugopal/insurance-data)
In [ ]: #Question 7.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
data = pd.read_csv("/content/insurance_data.csv",sep=",")
data
print(data.head())
x=data[['age']]
y=data[['bought_insurance']]
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size=0.3,
random_state=42)
model = LogisticRegression()
model.fit(X_train,Y_train)
y_pred=model.predict(X_test)
predict_yes_no=model.predict([[30]])
print(f'Yes_Not:{predict_yes_no[0]}')
age bought_insurance
0 22 0
1 25 0
2 47 1
3 52 0
4 46 1
Yes_Not:0
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.p
y:1143: DataConversionWarning: A column-vector y was passed when a
1d array was expected. Please change the shape of y to (n_samples,
), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWa
rning: X does not have valid feature names, but LogisticRegression
was fitted with feature names
warnings.warn(
Assignment 8
Ass. 8: Implement linear regression using python.
In [ ]: # Importing the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Generate some sample data
np.random.seed(0)
X = np.random.rand(100, 1) * 10
y = 2 * X + 1 + np.random.randn(100, 1) * 2 # Adding some random no
ise
Assignment 9
Ass. 9: Implement Naïve Bayes theorem to classify the English text.
In [ ]: from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
Accuracy: 1.00
Classification Report:
precision recall f1-score support
accuracy 1.00 1
macro avg 1.00 1.00 1.00 1
weighted avg 1.00 1.00 1.00 1
Assignment. 10
Ass. 10: Use wine dataset from sklearn.datasets to classify wines into 3 categories. Load the dataset and
split it into test and train. After that train the model using Gaussian and Multinominal classifier and post
which model performs better. Use the trained model to perform some predictions on test data
In [ ]: from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
accuracy 1.00 36
macro avg 1.00 1.00 1.00 36
weighted avg 1.00 1.00 1.00 36
accuracy 0.89 36
macro avg 0.88 0.85 0.86 36
weighted avg 0.89 0.89 0.88 36
Assignment. 11
Ass. 11: Download heart disease dataset heart.csv and do following, (credits of dataset:
https://www.kaggle.com/fedesoriano/heart-failure-prediction (https://www.kaggle.com/fedesoriano/heart-
failure-prediction))
model_xgb.fit(X_train, y_train)
print(metrics)
importances_rf = model_rf.feature_importances_
features = X_train.columns
importances_rf_dict = dict(zip(features, importances_rf))
sorted_importances_rf = sorted(importances_rf_dict.items(), key=lam
bda x: x[1], reverse=True)
HeartDisease 0 1
FastingBS
0 0.519886 0.480114
1 0.205607 0.794393
{'accuracy': 0.907608695652174, 'precision': 0.9107142857142857, 'r
ecall': 0.9357798165137615, 'f1': 0.9230769230769231}
Feature importances from Random Forest:
ST_Slope: 0.21900350843265542
Cholesterol: 0.11614969583831752
ChestPainType: 0.11473519381230739
MaxHR: 0.11081418028186905
Oldpeak: 0.10948878239475818
ExerciseAngina: 0.0940729487344156
Age: 0.08022767477386052
RestingBP: 0.07858468508468472
RestingECG: 0.028125742820119818
Sex: 0.027884311091323537
FastingBS: 0.020913276735688314
Assignment. 12
Ass. 12:Write a python program to import and export data using Pandas library functions.
In [ ]: import pandas as pd
data = {'Name': ['shiva', 'parvati', 'ganesh'],
'Age': [18, 14, 12],
'City': ['rohtak', 'karnal', 'sonipat'],
'Father name ':['shiv kumar','hawa singh ', 'ram ji'],
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\nImported DataFrame from CSV:")
print(df)
Original DataFrame:
Name Age City Father name
0 shiva 18 rohtak shiv kumar
1 parvati 14 karnal hawa singh
2 ganesh 12 sonipat ram ji
Assignment. 14
Ass. 14: Using Python implement Simple and Multiple Linear Regression Models
In [ ]: # Question 14(Part A). Write a Python program to implement Simple L
inear Regression.
import numpy as np
import matplotlib.pyplot as plt
# Sample data
X = np.array([1, 2, 3, 4, 5, 6])
Y = np.array([2, 3, 5, 4, 6, 6])
# Calculate the slope (m) and the y-intercept (b) using the least s
quares method
numerator = np.sum((X - mean_X) * (Y - mean_Y))
denominator = np.sum((X - mean_X) ** 2)
m = numerator / denominator
b = mean_Y - m * mean_X
np.random.seed(0)
X1=np.random.rand(100,1)*10
X2=np.random.rand(100,1)*10
y1=2 * X1 + 1 + np.random.randn(100,1)*2
y2=2 * X2 + 1 + np.random.randn(100,1)*2
X1_train,X1_test,y1_train,y1_test = train_test_split(X1,y1,test_siz
e=0.2,random_state=42)
X2_train,X2_test,y2_train,y2_test = train_test_split(X2,y2,test_siz
e=0.2,random_state=42)
model1 = LinearRegression()
model2 = LinearRegression()
model1.fit(X1_train,y1_train)
model2.fit(X2_train,y2_train)
y_pred1=model1.predict(X1_test)
y_pred2=model2.predict(X2_test)
coefficient1 = model1.coef_
coefficient2 = model2.coef_
intercept1 = model1.intercept_
intercept2 = model2.intercept_
mse1 = mean_squared_error(y1_test,y_pred1)
mse2 = mean_squared_error(y2_test,y_pred2)
r1 = r2_score(y1_test,y_pred1)
r2 = r2_score(y1_test,y_pred2)
print("Coefficients: ",coefficient1)
print("Coefficients: ",coefficient2)
print("Intercept: ",intercept1)
print("Intercept: ",intercept2)
print("Mean_Squared_Error: ",mse1)
print("Mean_Squared_Error: ",mse2)
print("R-squared Score: ",r1)
print("R-squared Score: ",r2)
plt.scatter(X1_test,y1_test,color="blue")
plt.scatter(X2_test,y2_test,color="red")
plt.plot(X1_test,y_pred1,color="blue",linewidth=2)
plt.plot(X2_test,y_pred2,color="red",linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Multiple Linear Regression")
plt.show()
Coefficients: [[1.88199746]]
Coefficients: [[2.00317266]]
Intercept: [1.26395959]
Intercept: [0.68918563]
Mean_Squared_Error: 2.9573753913252188
Mean_Squared_Error: 5.15919190916186
R-squared Score: 0.8663947329351374
R-squared Score: -1.5452856283756056
Assignment. 15
Ass. 15: Using Python develop Logistic Regression Model for a given dataset.
In [ ]: # Question 15. Logistic Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
np.random.seed(0)
X = np.random.rand(100, 2) * 10
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
if X_train.shape[1] == 2:
X_min, X_max = X[:,0].min() - 1,X[:, 0].max() + 1
y_min, y_max = X[:,1].min() - 1,X[:, 1].max() + 1
XX, yy = np.meshgrid(np.arange(X_min, X_max, 0.1), np.arange(y_mi
n, y_max, 0.2))
Z = model.predict(np.c_[XX.ravel(), yy.ravel()])
Z = Z.reshape(XX.shape)
plt.contourf(XX, yy, Z, alpha=0.4)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='o', lab
el='Actual')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Logistic Regression Decision Boundary")
plt.legend()
plt.show()
Accuracy: 0.9
Cofusion Matrix:
[[ 8 2]
[ 0 10]]
Assignment. 16
Ass. 16: Using Python develop Decision Tree Classification model for a given dataset and use it to classify a
new sample.
In [ ]: # Question 16. Write a python program to implement Decision tree us
ing sklearn and its parameter tuning
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score