0% found this document useful (0 votes)

64 views

Introduction To Python and Computer Programming 1704298503

This document outlines steps to analyze an employee retention dataset using Python: 1. Explore the dataset using descriptive statistics and correlation analysis to identify variables that impact retention. 2. Plot bar charts showing the impact of employee salaries on retention. 3. Plot bar charts showing the correlation between department and employee retention. 4. Build a logistic regression model to predict retention using variables identified in step 1. 5. Measure the accuracy of the logistic regression model.

Uploaded by

el.tico.138623

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Introduction To Python and Computer Programming 1704298503

Uploaded by

el.tico.138623

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

BY

AYUSHI SINGH
INDEX
Ass. PROLEM PAGES REMARKS
No. STATEMENT
1 Implement k-nearest
neighbours classification
using python
2 Extract the data from
database using python
3 The probability that it is
Friday and that a student
is absent is 3 %. Since
there are 5 school days in a
week, the probability that
it is Friday is 20 %. What
is the probability that a
student is absent given that
today is Friday? Apply
Baye’s rule in python to
get the result
4 Predict Canada's per
capita income in year 2020.
5 Employee retention
dataset
6 Predict a classification for
a case where VAR1=0.906
and VAR2=0.606
7 Predict if a person would
buy Insurance or not using
Logistic Regression
8 Implement linear
regression using python.
9 Implement Naïve Bayes
theorem to classify the
English text.
10 Use wine dataset from
sklearn.datasets to classify
wines into 3 categories.
11 Heart disease dataset.
12 Write a python program to
import and export data
using Pandas library
functions.
13 Using Python implement
Dimensionality reduction
using Principle Component
Analysis (PCA) method.
14 Using Python implement
Simple and Multiple
Linear Regression Models
15 Using Python develop
Logistic Regression Model
for a given dataset.
16 Using Python develop
Decision Tree
Classification model for a
given dataset and use it to
classify a new sample.
Assignment 1
Ass. 1: Implement k-nearest neighbours classification using python

In [ ]: # Question 1. Write a python program to implement KNN using sklear

n.
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'svg')

# Import necessary libraries

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
=0.2, random_state=42)

# Create a KNN classifier with k=3 (you can change this value as ne
eded)
knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Fit the model to the training data

knn_classifier.fit(X_train, y_train)

# Make predictions on the test set

y_pred = knn_classifier.predict(X_test)

# Calculate and print the accuracy

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

<ipython-input-1-76c9024a1781>:2: DeprecationWarning: `set_matplotl

ib_formats` is deprecated since IPython 7.23, directly use `matplot
lib_inline.backend_inline.set_matplotlib_formats()`
set_matplotlib_formats('pdf', 'svg')

Accuracy: 1.0

Assignment 2
Ass. 2: Extract the data from database using python
In [1]: import sqlite3

# Connect to a database (this will create a new file named 'exampl

e.db' if it doesn't exist)
connection = sqlite3.connect('example.db')

# Create a cursor object to execute SQL queries

cursor = connection.cursor()

# Create a table
cursor.execute('''
CREATE TABLE IF NOT EXISTS users (
id INTEGER PRIMARY KEY,
name TEXT,
age INTEGER
)
''')

# Insert some data

cursor.execute("INSERT INTO users (name, age) VALUES (?, ?)", ('Joh
n Doe', 25))
cursor.execute("INSERT INTO users (name, age) VALUES (?, ?)", ('Jan
e Smith', 30))

# Commit the changes and close the connection

connection.commit()
connection.close()

import sqlite3

# Connect to the database

connection = sqlite3.connect('example.db')
cursor = connection.cursor()

# Execute a SELECT query

cursor.execute("SELECT * FROM users")

# Fetch all the rows

rows = cursor.fetchall()

# Display the data

for row in rows:
print(row)

# Close the connection

connection.close()

(1, 'John Doe', 25)

(2, 'Jane Smith', 30)

Assignment 3
Ass. 3: The probability that it is Friday and that a student is absent is 3 %. Since there are 5 school days in a
week, the probability that it is Friday is 20 %. What is the probability that a student is absent given that
today is Friday? Apply Baye’s rule in python to get the result
In [ ]: #Question 3. Probability of friday and student is absent..

p_A_given_B = 0.03
p_B =0.2

#p_not_A= 1-p_A

result= (p_A_given_B/p_B)
print("Answer is : ", result)

Answer is : 0.15

Assignment 4
Ass 4: Predict Canada's per capita income in year 2020. There is a data folder here on Kaggle, download
that and you will find canada_per_capita_income.csv file. Using this build a regression model and predict the
per capita income for Canadian citizens in year 2020. Link for csv file:
https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-se
(https://www.kaggle.com/datasets/gurdit559/canada-per-capita-income-single-variable-data-se)

In [ ]: #Question 4. Canada's per capita income

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, confusion_matrix

data = pd.read_csv("/content/Canada_per_capita_income (1).csv",sep

=",")
data
print(data.head())
x=data[['year']]
y=data[['income']]
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size=0.3,
random_state=42)

model = LinearRegression()
model.fit(X_train,Y_train)
y_pred=model.predict(X_test)
prediction_2020=model.predict([[2020]])
print(f'income:{prediction_2020[0]}')

year income
0 1970 3399.299037
1 1971 3768.297935
2 1972 4251.175484
3 1973 4804.463248
4 1974 5576.514583
income:[40993.56532482]

/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWa
rning: X does not have valid feature names, but LinearRegression wa
s fitted with feature names
warnings.warn(
Assignment 5

Ass 5: Download employee retention dataset from here: https://www.kaggle.com/giripujar/hr analytics

(https://www.kaggle.com/giripujar/hr analytics).

1. Now do some exploratory data analysis to figure out which variables have direct and clear impact on
employee retention (i.e., whether they leave the company or continue to work)
2. Plot bar charts showing impact of employee salaries on retention
3. Plot bar charts showing correlation between department and employee retention
4. Now build logistic regression model using variables that were narrowed down in step 1
5. Measure the accuracy of the model
In [ ]: # Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

# Load the dataset

data = pd.read_csv('/content/HR_comma_sep.csv')

# 1. Exploratory Data Analysis (EDA)

# Explore the dataset to identify variables impacting retention
# Use methods like describe(), info(), and correlation analysis
# For example:
print(data.describe())
print(data.info())
correlation_matrix = data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

# 2. Impact of salaries on retention

sns.countplot(x='salary', hue='left', data=data)
plt.title('Impact of Salaries on Retention')
plt.show()

# 3. Correlation between department and retention

sns.countplot(x='Department', hue='left', data=data)
plt.title('Correlation between Department and Retention')
plt.xticks(rotation=45)
plt.show()

from sklearn.preprocessing import LabelEncoder

le=LabelEncoder()
data['salary']=le.fit_transform(data['salary'])
data['Department']=le.fit_transform(data['Department'])
# 4. Build Logistic Regression Model
X=data.iloc[:,:-1]
y=data.iloc[:,-1]

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
=0.2, random_state=42)

# Initialize the logistic regression model

model = LogisticRegression()

# Fit the model to the training data

model.fit(X_train, y_train)

# 5. Measure the accuracy of the model

# Predict the values using the test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy of the model: {accuracy}')

# Optionally, you can also print the confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix:\n{conf_matrix}')
satisfaction_level last_evaluation number_project \
count 14999.000000 14999.000000 14999.000000
mean 0.612834 0.716102 3.803054
std 0.248631 0.171169 1.232592
min 0.090000 0.360000 2.000000
25% 0.440000 0.560000 3.000000
50% 0.640000 0.720000 4.000000
75% 0.820000 0.870000 5.000000
max 1.000000 1.000000 7.000000

average_montly_hours time_spend_company Work_accident

left \
count 14999.000000 14999.000000 14999.000000 149
99.000000
mean 201.050337 3.498233 0.144610
0.238083
std 49.943099 1.460136 0.351719
0.425924
min 96.000000 2.000000 0.000000
0.000000
25% 156.000000 3.000000 0.000000
0.000000
50% 200.000000 3.000000 0.000000
0.000000
75% 245.000000 4.000000 0.000000
0.000000
max 310.000000 10.000000 1.000000
1.000000

promotion_last_5years
count 14999.000000
mean 0.021268
std 0.144281
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 satisfaction_level 14999 non-null float64
1 last_evaluation 14999 non-null float64
2 number_project 14999 non-null int64
3 average_montly_hours 14999 non-null int64
4 time_spend_company 14999 non-null int64
5 Work_accident 14999 non-null int64
6 left 14999 non-null int64
7 promotion_last_5years 14999 non-null int64
8 Department 14999 non-null object
9 salary 14999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 1.1+ MB
None
<ipython-input-8-2e65f0a87cc2>:19: FutureWarning: The default value
of numeric_only in DataFrame.corr is deprecated. In a future versio
n, it will default to False. Select only valid columns or specify t
he value of numeric_only to silence this warning.
correlation_matrix = data.corr()
Accuracy of the model: 0.5043333333333333
Confusion Matrix:
[[ 0 123 130]
[ 0 988 486]
[ 0 748 525]]

/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logis
tic.py:458: ConvergenceWarning: lbfgs failed to converge (status=
1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as s

hown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver optio
ns:
https://scikit-learn.org/stable/modules/linear_model.html#logis
tic-regression
n_iter_i = _check_optimize_result(

Assignment 6
Given the following data, which specify classifications for nine combinations of VAR1 and VAR2 predict a
classification for a case where VAR1=0.906 and VAR2=0.606, using the result of kmeans clustering with 3
means (i.e., 3 centroids) VAR1 VAR2 CLASS 1.713 1.586 0 0.180 1.786 1 0.353 1.240 1 0.940 1.566 0 1.486
0.759 1 1.266 1.106 0 1.540 0.419 1 0.459 1.799 1 0.773 0.186 1
In [ ]: #Ass 6
from sklearn.cluster import KMeans
import numpy as np
data = np.array([
[1.713, 1.586, 0],
[0.180, 1.786, 1],
[0.353, 1.240, 1],
[0.940, 1.566, 0],
[1.486, 0.759, 1],
[1.266, 1.106, 0],
[1.540, 0.419, 1],
[0.459, 1.799, 1],
[0.773, 0.186, 1]
])

X = data[:, :2]
new_case = np.array([[0.906, 0.606]])
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)
predicted_cluster = kmeans.predict(new_case)
predicted_class = int(data[predicted_cluster, 2])

print(f"The predicted class for VAR1=0.906 and VAR2=0.606 is: {pred

icted_class}")

The predicted class for VAR1=0.906 and VAR2=0.606 is: 1

/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:
870: FutureWarning: The default value of `n_init` will change from
10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppre
ss the warning
warnings.warn(

Assignment 7
Ass. 7: Predict if a person would buy Insurance or not using Logistic Regression the insurance-data.csv file
existing on the link given below. https://www.kaggle.com/datasets/adepvenugopal/insurance-data
(https://www.kaggle.com/datasets/adepvenugopal/insurance-data)
In [ ]: #Question 7.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

data = pd.read_csv("/content/insurance_data.csv",sep=",")
data
print(data.head())
x=data[['age']]
y=data[['bought_insurance']]
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size=0.3,
random_state=42)

model = LogisticRegression()
model.fit(X_train,Y_train)
y_pred=model.predict(X_test)
predict_yes_no=model.predict([[30]])
print(f'Yes_Not:{predict_yes_no[0]}')

age bought_insurance
0 22 0
1 25 0
2 47 1
3 52 0
4 46 1
Yes_Not:0

/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.p
y:1143: DataConversionWarning: A column-vector y was passed when a
1d array was expected. Please change the shape of y to (n_samples,
), for example using ravel().
y = column_or_1d(y, warn=True)
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWa
rning: X does not have valid feature names, but LogisticRegression
was fitted with feature names
warnings.warn(

Assignment 8
Ass. 8: Implement linear regression using python.
In [ ]: # Importing the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Generate some sample data
np.random.seed(0)
X = np.random.rand(100, 1) * 10
y = 2 * X + 1 + np.random.randn(100, 1) * 2 # Adding some random no
ise

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
=0.2, random_state=42)

# Create a linear regression model

model = LinearRegression()

# Fit the model to the training data

model.fit(X_train, y_train)

# Make predictions on the test data

y_pred = model.predict(X_test)

# Calculate the coefficients and intercept

coefficients = model.coef_
intercept = model.intercept_

# Calculate the mean squared error and R-squared score

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print the results
print("Coefficients:", coefficients)
print("Intercept:", intercept)
print("Mean Squared Error:", mse)
print("R-squared Score:", r2)

# Plot the regression line

plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Simple Linear Regression")
plt.show()
Coefficients: [[1.99610364]]
Intercept: [1.41268038]
Mean Squared Error: 3.671012987885715
R-squared Score: 0.8453207776609701

Assignment 9
Ass. 9: Implement Naïve Bayes theorem to classify the English text.
In [ ]: from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample data for demonstration

# You should replace this with your own dataset
texts = ["This is a positive sentence.", "Negative sentiment her
e.", "Another positive example.", "Negative vibes."]

# Labels for the corresponding texts

labels = ["positive", "negative", "positive", "negative"]

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(texts, labels,
test_size=0.2, random_state=42)

# Create a CountVectorizer to convert text data into a bag-of-words

representation
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Create and train a Multinomial Naive Bayes classifier

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vec, y_train)

# Make predictions on the test set

y_pred = nb_classifier.predict(X_test_vec)

# Evaluate the classifier

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display classification report

print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 1.00
Classification Report:
precision recall f1-score support

negative 1.00 1.00 1.00 1

accuracy 1.00 1
macro avg 1.00 1.00 1.00 1
weighted avg 1.00 1.00 1.00 1

Assignment. 10
Ass. 10: Use wine dataset from sklearn.datasets to classify wines into 3 categories. Load the dataset and
split it into test and train. After that train the model using Gaussian and Multinominal classifier and post
which model performs better. Use the trained model to perform some predictions on test data
In [ ]: from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Wine dataset

wine = load_wine()
X = wine.data
y = wine.target

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
=0.2, random_state=42)

# Train Gaussian Naive Bayes classifier

gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test set

y_pred_gaussian = gnb.predict(X_test)

# Evaluate Gaussian Naive Bayes classifier

accuracy_gaussian = accuracy_score(y_test, y_pred_gaussian)
print("Gaussian Naive Bayes Classifier:")
print(f"Accuracy: {accuracy_gaussian:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_gaussian))

# Train Multinomial Naive Bayes classifier

mnb = MultinomialNB()
mnb.fit(X_train, y_train)

# Make predictions on the test set

y_pred_multinomial = mnb.predict(X_test)

# Evaluate Multinomial Naive Bayes classifier

accuracy_multinomial = accuracy_score(y_test, y_pred_multinomial)
print("\nMultinomial Naive Bayes Classifier:")
print(f"Accuracy: {accuracy_multinomial:.2f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_multinomial))
Gaussian Naive Bayes Classifier:
Accuracy: 1.00
Classification Report:
precision recall f1-score support

0 1.00 1.00 1.00 14

1 1.00 1.00 1.00 14
2 1.00 1.00 1.00 8

accuracy 1.00 36
macro avg 1.00 1.00 1.00 36
weighted avg 1.00 1.00 1.00 36

Multinomial Naive Bayes Classifier:

Accuracy: 0.89
Classification Report:
precision recall f1-score support

0 0.88 1.00 0.93 14

1 0.93 0.93 0.93 14
2 0.83 0.62 0.71 8

accuracy 0.89 36
macro avg 0.88 0.85 0.86 36
weighted avg 0.89 0.89 0.88 36

Assignment. 11
Ass. 11: Download heart disease dataset heart.csv and do following, (credits of dataset:
https://www.kaggle.com/fedesoriano/heart-failure-prediction (https://www.kaggle.com/fedesoriano/heart-
failure-prediction))

1. Load heart disease dataset in pandas dataframe

2. Convert text columns to numbers using label encoding and one hot encoding
3. Apply scaling
4. Build a classification model using various methods (SVM, logistic regression, random forest) and check
which model gives you the best accuracy
5. Now use PCA to reduce dimensions, retrain your model and see what impact it has on your model in
terms of accuracy. Keep in mind that many times doing PCA reduces the accuracy but computation is
much lighter and that's the trade-off you need to consider while building models in real life
In [ ]: import pandas as pd
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('/content/heart.csv')
data.head()
rows = data.shape[0]
cols = data.shape[1]
print(f'Rows : {rows}\nColumns : {cols}')
data.info()
data.describe()
data['HeartDisease'].value_counts()
continuous_columns = ['Age', 'RestingBP', 'Cholesterol', 'FastingB
S', 'MaxHR', 'Oldpeak']

# Plotting box plots for each continuous column

for column in continuous_columns:
plt.figure(figsize=(10, 5))
sns.boxplot(x=data[column])
plt.title(f'Box plot of {column}')
plt.show()
gender_heart = pd.crosstab(data['Sex'], data['HeartDisease'],no
rmalize='index')
print(gender_heart)
gender_heart.plot(kind='bar')
plt.title('HeartDisease by Sex')
plt.show()
chest_heart = pd.crosstab(data['ChestPainType'], data['HeartDiseas
e'],normalize='index')
print(chest_heart)
chest_heart.plot(kind='bar')
plt.title('HeartDisease by ChestPainType')
plt.show()
rest_heart = pd.crosstab(data['RestingECG'], data['HeartDisease'],n
ormalize='index')
print(rest_heart)
rest_heart.plot(kind='bar')
plt.title('HeartDisease by rest type')
plt.show()
angina_heart = pd.crosstab(data['ExerciseAngina'], data['HeartDisea
se'],normalize='index')
print(angina_heart)
angina_heart.plot(kind='bar')
plt.title('HeartDisease by Angina')
plt.show()
st_heart = pd.crosstab(data['ST_Slope'], data['HeartDisease'],norma
lize='index')
print(st_heart)
st_heart.plot(kind='bar')
plt.title('HeartDisease by ST')
plt.show()
fasting_heart = pd.crosstab(data['FastingBS'], data['HeartDiseas
e'],normalize='index')
print(fasting_heart)
st_heart.plot(kind='bar')
plt.title('HeartDisease by fasting')
plt.show()
for column in continuous_columns:
plt.figure(figsize=(10, 5)) # Set the figure size as desired
plt.plot(data['HeartDisease'],data[column],linestyle='none', mark
er='o')
plt.title(f'plot of {column}')
plt.show()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex'])
data['ChestPainType'] = le.fit_transform(data['ChestPainType'])
data['RestingECG'] = le.fit_transform(data['RestingECG'])
data['ExerciseAngina'] = le.fit_transform(data['ExerciseAngina'])
data['ST_Slope'] = le.fit_transform(data['ST_Slope'])
data = data / data.max()
from sklearn.model_selection import train_test_split
X = data.drop('HeartDisease', axis=1)
y = data['HeartDisease']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=data['FastingB
S']
)
from sklearn.ensemble import RandomForestClassifier, StackingClassi
fier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall
_score, f1_score

# Adjusting the parameters of the Random Forest

model_rf= RandomForestClassifier(n_estimators=50, # number of esti
mators
random_state=42)
model_rf.fit(X_train, y_train)

# Adjusting parameters for XGBoost

model_xgb = XGBClassifier(n_estimators=100,
max_depth=3, # limiting depth of
trees
learning_rate=0.1, # potentially
adding regularization via learning rate
subsample=0.8, # using a subsampl
e of data to prevent overfitting
colsample_bytree=0.7, # using a s
ubsample of features for each tree
eval_metric='logloss',
random_state=42)

model_xgb.fit(X_train, y_train)

# Recreate the stacked model with base models

model_stack = StackingClassifier(estimators=[('rf', model_rf),
('xgb', model_x
gb)],
final_estimator=LogisticReg
ression(),
stack_method='auto',
n_jobs=-1)
model_stack.fit(X_train, y_train)

# Evaluate the pruned model

y_pred = model_stack.predict(X_test)
metrics = {
'accuracy': accuracy_score(y_test, y_pred),
'precision': precision_score(y_test, y_pred),
'recall': recall_score(y_test, y_pred),
'f1': f1_score(y_test, y_pred),
}

print(metrics)
importances_rf = model_rf.feature_importances_
features = X_train.columns
importances_rf_dict = dict(zip(features, importances_rf))
sorted_importances_rf = sorted(importances_rf_dict.items(), key=lam
bda x: x[1], reverse=True)

print("Feature importances from Random Forest:")

for feature, importance in sorted_importances_rf:
print(f"{feature}: {importance}")

# Get feature importances from XGBoost

importances_xgb = model_xgb.feature_importances_
importances_xgb_dict = dict(zip(features, importances_xgb))
sorted_importances_xgb = sorted(importances_xgb_dict.items(), key=l
ambda x: x[1], reverse=True)

print("\nFeature importances from XGBoost:")

for feature, importance in sorted_importances_xgb:
print(f"{feature}: {importance}")
Rows : 918
Columns : 12
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 918 non-null int64
1 Sex 918 non-null object
2 ChestPainType 918 non-null object
3 RestingBP 918 non-null int64
4 Cholesterol 918 non-null int64
5 FastingBS 918 non-null int64
6 RestingECG 918 non-null object
7 MaxHR 918 non-null int64
8 ExerciseAngina 918 non-null object
9 Oldpeak 918 non-null float64
10 ST_Slope 918 non-null object
11 HeartDisease 918 non-null int64
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
HeartDisease 0 1
Sex
F 0.740933 0.259067
M 0.368276 0.631724
HeartDisease 0 1
ChestPainType
ASY 0.209677 0.790323
ATA 0.861272 0.138728
NAP 0.645320 0.354680
TA 0.565217 0.434783
HeartDisease 0 1
RestingECG
LVH 0.436170 0.563830
Normal 0.483696 0.516304
ST 0.342697 0.657303
HeartDisease 0 1
ExerciseAngina
N 0.648995 0.351005
Y 0.148248 0.851752
HeartDisease 0 1
ST_Slope
Down 0.222222 0.777778
Flat 0.171739 0.828261
Up 0.802532 0.197468

HeartDisease 0 1
FastingBS
0 0.519886 0.480114
1 0.205607 0.794393
{'accuracy': 0.907608695652174, 'precision': 0.9107142857142857, 'r
ecall': 0.9357798165137615, 'f1': 0.9230769230769231}
Feature importances from Random Forest:
ST_Slope: 0.21900350843265542
Cholesterol: 0.11614969583831752
ChestPainType: 0.11473519381230739
MaxHR: 0.11081418028186905
Oldpeak: 0.10948878239475818
ExerciseAngina: 0.0940729487344156
Age: 0.08022767477386052
RestingBP: 0.07858468508468472
RestingECG: 0.028125742820119818
Sex: 0.027884311091323537
FastingBS: 0.020913276735688314

Feature importances from XGBoost:

ST_Slope: 0.31917333602905273
ChestPainType: 0.16173650324344635
ExerciseAngina: 0.14603181183338165
FastingBS: 0.06605455279350281
Sex: 0.05703895911574364
Oldpeak: 0.05463773384690285
Cholesterol: 0.04760991781949997
MaxHR: 0.044129274785518646
RestingECG: 0.04399832338094711
Age: 0.030759083107113838
RestingBP: 0.02883044071495533

Assignment. 12
Ass. 12:Write a python program to import and export data using Pandas library functions.

In [ ]: import pandas as pd
data = {'Name': ['shiva', 'parvati', 'ganesh'],
'Age': [18, 14, 12],
'City': ['rohtak', 'karnal', 'sonipat'],
'Father name ':['shiv kumar','hawa singh ', 'ram ji'],
}

df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print("\nImported DataFrame from CSV:")
print(df)

Original DataFrame:
Name Age City Father name
0 shiva 18 rohtak shiv kumar
1 parvati 14 karnal hawa singh
2 ganesh 12 sonipat ram ji

Imported DataFrame from CSV:

Name Age City Father name
0 shiva 18 rohtak shiv kumar
1 parvati 14 karnal hawa singh
2 ganesh 12 sonipat ram ji
Assignment. 13
Ass. 13: Using Python implement Dimensionality reduction using Principle Component Analysis (PCA)
method.
In [ ]: import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

# Load sample data (for demonstration)

data = load_iris()
X = data.data # Features
y = data.target # Target

# Initialize PCA and specify the number of components (dimensions)

pca = PCA(n_components=2)

# Fit the PCA model to the data

X_pca = pca.fit_transform(X)

# Percentage of variance explained by each of the selected componen

ts
explained_variance_ratio = pca.explained_variance_ratio_

print("Explained variance ratio:", explained_variance_ratio)

# Plotting the transformed data

plt.figure(figsize=(8, 6))

for i, target_name in enumerate(data.target_names):

plt.scatter(
X_pca[y == i, 0],
X_pca[y == i, 1],
label=target_name
)

plt.xlabel('Principal Component 1')

plt.ylabel('Principal Component 2')
plt.title('PCA of IRIS dataset')
plt.legend()
plt.show()
Explained variance ratio: [0.92461872 0.05306648]

Assignment. 14
Ass. 14: Using Python implement Simple and Multiple Linear Regression Models
In [ ]: # Question 14(Part A). Write a Python program to implement Simple L
inear Regression.

import numpy as np
import matplotlib.pyplot as plt

# Sample data
X = np.array([1, 2, 3, 4, 5, 6])
Y = np.array([2, 3, 5, 4, 6, 6])

# Calculate the mean of X and Y

mean_X = np.mean(X)
mean_Y = np.mean(Y)

# Calculate the total number of data points

n = len(X)

# Calculate the slope (m) and the y-intercept (b) using the least s
quares method
numerator = np.sum((X - mean_X) * (Y - mean_Y))
denominator = np.sum((X - mean_X) ** 2)
m = numerator / denominator
b = mean_Y - m * mean_X

# Print the slope and y-intercept

print("Slope (m):", m)
print("Y-Intercept (b):", b)

# Predict the values of Y based on the linear regression model

Y_pred = m * X + b

# Plot the original data and the regression line

plt.scatter(X, Y, label='Original Data', color='blue')
plt.plot(X, Y_pred, label='Regression Line', color='red')
plt.legend()
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Simple Linear Regression')
plt.show()
Slope (m): 0.8
Y-Intercept (b): 1.5333333333333328
In [ ]: # Multiple Linear Regression.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score

np.random.seed(0)
X1=np.random.rand(100,1)*10
X2=np.random.rand(100,1)*10
y1=2 * X1 + 1 + np.random.randn(100,1)*2
y2=2 * X2 + 1 + np.random.randn(100,1)*2

X1_train,X1_test,y1_train,y1_test = train_test_split(X1,y1,test_siz
e=0.2,random_state=42)
X2_train,X2_test,y2_train,y2_test = train_test_split(X2,y2,test_siz
e=0.2,random_state=42)

model1 = LinearRegression()
model2 = LinearRegression()

model1.fit(X1_train,y1_train)
model2.fit(X2_train,y2_train)

y_pred1=model1.predict(X1_test)
y_pred2=model2.predict(X2_test)

coefficient1 = model1.coef_
coefficient2 = model2.coef_
intercept1 = model1.intercept_
intercept2 = model2.intercept_

mse1 = mean_squared_error(y1_test,y_pred1)
mse2 = mean_squared_error(y2_test,y_pred2)
r1 = r2_score(y1_test,y_pred1)
r2 = r2_score(y1_test,y_pred2)

print("Coefficients: ",coefficient1)
print("Coefficients: ",coefficient2)
print("Intercept: ",intercept1)
print("Intercept: ",intercept2)
print("Mean_Squared_Error: ",mse1)
print("Mean_Squared_Error: ",mse2)
print("R-squared Score: ",r1)
print("R-squared Score: ",r2)

plt.scatter(X1_test,y1_test,color="blue")
plt.scatter(X2_test,y2_test,color="red")
plt.plot(X1_test,y_pred1,color="blue",linewidth=2)
plt.plot(X2_test,y_pred2,color="red",linewidth=2)
plt.xlabel("X")
plt.ylabel("y")
plt.title("Multiple Linear Regression")
plt.show()
Coefficients: [[1.88199746]]
Coefficients: [[2.00317266]]
Intercept: [1.26395959]
Intercept: [0.68918563]
Mean_Squared_Error: 2.9573753913252188
Mean_Squared_Error: 5.15919190916186
R-squared Score: 0.8663947329351374
R-squared Score: -1.5452856283756056

Assignment. 15
Ass. 15: Using Python develop Logistic Regression Model for a given dataset.
In [ ]: # Question 15. Logistic Regression
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

np.random.seed(0)
X = np.random.rand(100, 2) * 10

y = (X[:, 0] + X[:, 1] > 10).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size

=0.2, random_state=42)

model = LogisticRegression()

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

conf_matrix = confusion_matrix(y_test, y_pred)

print("Accuracy: ", accuracy)

print("Cofusion Matrix: ")
print(conf_matrix)

if X_train.shape[1] == 2:
X_min, X_max = X[:,0].min() - 1,X[:, 0].max() + 1
y_min, y_max = X[:,1].min() - 1,X[:, 1].max() + 1
XX, yy = np.meshgrid(np.arange(X_min, X_max, 0.1), np.arange(y_mi
n, y_max, 0.2))

Z = model.predict(np.c_[XX.ravel(), yy.ravel()])
Z = Z.reshape(XX.shape)
plt.contourf(XX, yy, Z, alpha=0.4)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, marker='o', lab
el='Actual')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Logistic Regression Decision Boundary")
plt.legend()
plt.show()
Accuracy: 0.9
Cofusion Matrix:
[[ 8 2]
[ 0 10]]

Assignment. 16
Ass. 16: Using Python develop Decision Tree Classification model for a given dataset and use it to classify a
new sample.
In [ ]: # Question 16. Write a python program to implement Decision tree us
ing sklearn and its parameter tuning
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size
=0.2, random_state=42)

# Create a Decision Tree Classifier

dt_classifier = DecisionTreeClassifier()

# Define the hyperparameters to tune

param_grid = {
'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [None, 5, 10, 15],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}

# Create a GridSearchCV object

grid_search = GridSearchCV(dt_classifier, param_grid, cv=5)

# Fit the model to the training data

grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV

print("Best Parameters:", grid_search.best_params_)

# Make predictions on the test set

y_pred = grid_search.predict(X_test)

# Calculate and print the accuracy

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Best Parameters: {'criterion': 'entropy', 'max_depth': 5, 'min_samp

les_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'}
Accuracy: 0.9666666666666667

Effective Pandas: Patterns for Data Manipulation 1st Edition Matt Harrison All Chapters Instant Download
100% (3)
Effective Pandas: Patterns for Data Manipulation 1st Edition Matt Harrison All Chapters Instant Download
55 pages
AAU5356
No ratings yet
AAU5356
17 pages
Instant Download Regression Analysis An Intuitive Guide For Using and Interpreting Linear Models 1st Edition Jim Frost PDF All Chapter
0% (1)
Instant Download Regression Analysis An Intuitive Guide For Using and Interpreting Linear Models 1st Edition Jim Frost PDF All Chapter
62 pages
Statistics R Charts and Graphs Assignment
No ratings yet
Statistics R Charts and Graphs Assignment
13 pages
Strategies Used by Brandix To Stay Competitive in The Global Market
67% (6)
Strategies Used by Brandix To Stay Competitive in The Global Market
9 pages
Bioinformatics F&amp M 20100722 Bujak
100% (1)
Bioinformatics F&amp M 20100722 Bujak
27 pages
Data Visualization With Ma Thematic A
No ratings yet
Data Visualization With Ma Thematic A
46 pages
Cheatsheet Machine Learning Tips and Tricks PDF
No ratings yet
Cheatsheet Machine Learning Tips and Tricks PDF
2 pages
Estimation and Hypothesis
100% (1)
Estimation and Hypothesis
32 pages
CDS NW Synthesis and Characterization.12
100% (1)
CDS NW Synthesis and Characterization.12
22 pages
A Step by Step Backpropagation Example - Matt Mazur
No ratings yet
A Step by Step Backpropagation Example - Matt Mazur
7 pages
App.A - Detection and Estimation in Additive Gaussian Noise PDF
No ratings yet
App.A - Detection and Estimation in Additive Gaussian Noise PDF
55 pages
Introduction To Data Visualization in Python
No ratings yet
Introduction To Data Visualization in Python
16 pages
An Introduction of Ensemble Learning
100% (1)
An Introduction of Ensemble Learning
40 pages
MTCARS Regression Analysis
No ratings yet
MTCARS Regression Analysis
5 pages
Forecast
No ratings yet
Forecast
82 pages
17CS3116 EP FinalLab Solutions
No ratings yet
17CS3116 EP FinalLab Solutions
179 pages
Matlab Manual
No ratings yet
Matlab Manual
70 pages
Numpy
No ratings yet
Numpy
15 pages
MongoDB CheatSheet
No ratings yet
MongoDB CheatSheet
9 pages
Midsem Regular MFDS 22-12-2019 Answer Key PDF
No ratings yet
Midsem Regular MFDS 22-12-2019 Answer Key PDF
5 pages
Student Booklet For Sep 2015 v6
100% (1)
Student Booklet For Sep 2015 v6
50 pages
Rsa - TCR PDF
No ratings yet
Rsa - TCR PDF
89 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Random Variable Generation
No ratings yet
Random Variable Generation
5 pages
Chapter 7 - Regression Analysis
100% (1)
Chapter 7 - Regression Analysis
111 pages
(Solutions Manual) Probability and Statistics For Engineers and Scientists Manual Hayler
100% (1)
(Solutions Manual) Probability and Statistics For Engineers and Scientists Manual Hayler
51 pages
Scikit Learn Docs
100% (1)
Scikit Learn Docs
2,201 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Heart Disease PredictionUsing
No ratings yet
Heart Disease PredictionUsing
6 pages
Text
No ratings yet
Text
131 pages
Time Series Analysis
No ratings yet
Time Series Analysis
3 pages
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
100% (1)
Linear Regression: Major: All Engineering Majors Authors: Autar Kaw, Luke Snyder
25 pages
Max and Min PDF
No ratings yet
Max and Min PDF
19 pages
Example of 2D Convolution
No ratings yet
Example of 2D Convolution
5 pages
CNN Cheat Sheet
No ratings yet
CNN Cheat Sheet
5 pages
Linear Regression Analysis. Statistics 2 Notes
No ratings yet
Linear Regression Analysis. Statistics 2 Notes
20 pages
Gradient Descent - Linear Regression
100% (1)
Gradient Descent - Linear Regression
47 pages
R Basics PDF
No ratings yet
R Basics PDF
10 pages
Math For Data Science
100% (1)
Math For Data Science
554 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
Assignment No - 6-1
100% (1)
Assignment No - 6-1
3 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Mathematical Tools For Data Science
No ratings yet
Mathematical Tools For Data Science
9 pages
Study Guide For STA3701
No ratings yet
Study Guide For STA3701
325 pages
Assignment # 1: MA-234 Linear Algebra
100% (1)
Assignment # 1: MA-234 Linear Algebra
3 pages
Homework 2
100% (1)
Homework 2
14 pages
Dummy Regression
No ratings yet
Dummy Regression
23 pages
Coordinate Descent and Golden Selection Search
No ratings yet
Coordinate Descent and Golden Selection Search
2 pages
Data Science
No ratings yet
Data Science
74 pages
Wavelet Toolbox™ User's Guide PDF
No ratings yet
Wavelet Toolbox™ User's Guide PDF
617 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Data Visualization in Python
No ratings yet
Data Visualization in Python
11 pages
Variable Selection
No ratings yet
Variable Selection
15 pages
1694600777-Unit2.2 Logistic Regression CU 2.0
100% (1)
1694600777-Unit2.2 Logistic Regression CU 2.0
37 pages
Estimation and Testing of Hypothesis PDF
100% (1)
Estimation and Testing of Hypothesis PDF
75 pages
ML Lab Programs PDF
No ratings yet
ML Lab Programs PDF
15 pages
Lab Program
100% (1)
Lab Program
15 pages
ML Acoerecordiot
No ratings yet
ML Acoerecordiot
18 pages
MACHINE LEARNING LAB MANUAL
No ratings yet
MACHINE LEARNING LAB MANUAL
36 pages
ML Lab Mannual
No ratings yet
ML Lab Mannual
29 pages
MFSC-2000X Continuous Laser (50μm) - KRRASS
No ratings yet
MFSC-2000X Continuous Laser (50μm) - KRRASS
3 pages
Hernandez Vs Andal
No ratings yet
Hernandez Vs Andal
4 pages
Objectives of A Research Proposal On THE IMPACT OF MOBILE BANKING ON THE PROFITABILITY OF KENYAN BANKS
No ratings yet
Objectives of A Research Proposal On THE IMPACT OF MOBILE BANKING ON THE PROFITABILITY OF KENYAN BANKS
2 pages
2953_Set 2-Paper 2 QP
No ratings yet
2953_Set 2-Paper 2 QP
8 pages
MIL-HDBK-1461A Ammo Manufacturer and Symbol
No ratings yet
MIL-HDBK-1461A Ammo Manufacturer and Symbol
463 pages
RE API Installation Guide
No ratings yet
RE API Installation Guide
4 pages
Ethical Hacking From Ieee Dtu Students
No ratings yet
Ethical Hacking From Ieee Dtu Students
35 pages
Dubai Islamic Bank Limited 2020
No ratings yet
Dubai Islamic Bank Limited 2020
43 pages
Concept-Activity-4-PMBOK-Process-Area-Map
No ratings yet
Concept-Activity-4-PMBOK-Process-Area-Map
1 page
05AADCK2960P1Z7 - GSTR-2A - 19feb24021450 - FY 2023-2024
No ratings yet
05AADCK2960P1Z7 - GSTR-2A - 19feb24021450 - FY 2023-2024
111 pages
Backend Web Developer Resume Example
No ratings yet
Backend Web Developer Resume Example
1 page
Review of Related Literature
No ratings yet
Review of Related Literature
5 pages
Conceptual Framework For The Study of Food Waste Generation and
No ratings yet
Conceptual Framework For The Study of Food Waste Generation and
11 pages
SIP5 7UM85 V07.90 Manual C027-5 en
No ratings yet
SIP5 7UM85 V07.90 Manual C027-5 en
2,052 pages
Egress and Fire Protection
No ratings yet
Egress and Fire Protection
30 pages
Special Power of Attorney
No ratings yet
Special Power of Attorney
2 pages
Hse Check List For Arc Welding Machine
No ratings yet
Hse Check List For Arc Welding Machine
3 pages
EDUB 2714_Individual Assignment
No ratings yet
EDUB 2714_Individual Assignment
6 pages
2020 WITW, Pourmand, Jafari, Ebrahimi
No ratings yet
2020 WITW, Pourmand, Jafari, Ebrahimi
15 pages
Introduction To Binary Trees
No ratings yet
Introduction To Binary Trees
11 pages
Fundamentals of Estimating and Costing Part 2
No ratings yet
Fundamentals of Estimating and Costing Part 2
32 pages
Final The Effectiveness of Internet Advertising On Consumer Behaviour
50% (4)
Final The Effectiveness of Internet Advertising On Consumer Behaviour
38 pages
Stewartcalcet8 03 06
No ratings yet
Stewartcalcet8 03 06
15 pages
Dowry and Bridal Gifts Act
No ratings yet
Dowry and Bridal Gifts Act
3 pages
Excell Precision Weighing Scale Manual de Usuario Indicador
No ratings yet
Excell Precision Weighing Scale Manual de Usuario Indicador
31 pages
Marwin Condensed Catalog
No ratings yet
Marwin Condensed Catalog
8 pages
Cotton Procurement Guide
No ratings yet
Cotton Procurement Guide
24 pages
Evanston Voter Initiative Lawsuit
No ratings yet
Evanston Voter Initiative Lawsuit
34 pages