0% found this document useful (0 votes)
3 views

Dovdush_KN-305_lab3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Dovdush_KN-305_lab3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Практична робота №3

з дисципліни "Інформаційні технології смартсистем"

на тему "Кардіологічна клініка"


Виконав:

студент групи КН-305

Довбуш Павло
In [1]: !pip install numpy
!pip install matplotlib
!pip install pandas
!pip install seaborn
!pip install tabulate

Defaulting to user installation because normal site-packages is not writeable


Requirement already satisfied: numpy in c:\users\олеся\appdata\roaming\python\python310\site-packages (1.23.5)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: matplotlib in c:\users\олеся\appdata\roaming\python\python310\site-packages (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (1.0.7)
Requirement already satisfied: cycler>=0.10 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (4.39.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (1.4.4)
Requirement already satisfied: numpy>=1.20 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (1.23.5)
Requirement already satisfied: packaging>=20.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (23.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in c:\users\олеся\appdata\roaming\python\python310\site-packages (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from pandas) (2022.7.1)
Requirement already satisfied: numpy>=1.21.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from pandas) (1.23.5)
Requirement already satisfied: six>=1.5 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: seaborn in c:\users\олеся\appdata\roaming\python\python310\site-packages (0.12.2)
Requirement already satisfied: numpy!=1.24.0,>=1.17 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from seaborn) (1.23.5)
Requirement already satisfied: pandas>=0.25 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from seaborn) (1.5.3)
Requirement already satisfied: matplotlib!=3.6.1,>=3.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from seaborn) (3.7.1)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.0.7)
Requirement already satisfied: cycler>=0.10 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (4.39.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (23.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (3.0.9)
Requirement already satisfied: python-dateutil>=2.7 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from matplotlib!=3.6.1,>=3.1->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from pandas>=0.25->seaborn) (2022.7.1)
Requirement already satisfied: six>=1.5 in c:\users\олеся\appdata\roaming\python\python310\site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.1->seaborn) (1.16.0)
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: tabulate in c:\users\олеся\appdata\roaming\python\python310\site-packages (0.9.0)

In [2]: import os.path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as stats

In [3]: import warnings


warnings.simplefilter('ignore')

In [4]: pd.set_option('display.max_columns', 500)


pd.set_option('display.max_rows', 500)

Read the dataset


In [5]: print(os.path.exists("dataset_3.csv"))

True

In [6]: ds = pd.read_csv("dataset_3.csv")
ds.head()

Out[6]: Unnamed: 0 Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease

0 0 40.0 M ATA 140.0 289.0 0.0 Normal 172.0 N 0.0 Up 0.0

1 1 49.0 F NAP NaN 180.0 NaN Normal 156.0 N 1.0 Flat 1.0

2 2 37.0 M ATA 130.0 283.0 0.0 ST NaN N 0.0 Up 0.0

3 3 48.0 F ASY 138.0 214.0 0.0 Normal 108.0 Y 1.5 Flat 1.0

4 4 54.0 M NAP 150.0 195.0 0.0 Normal 122.0 N 0.0 Up 0.0

In [7]: print('columns count - ',len(ds.columns), '\n')


print('columns: ',list(ds.columns))

columns count - 13

columns: ['Unnamed: 0', 'Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS', 'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope', 'HeartDisease']

Missing data imputation


In [8]: ds.shape

Out[8]: (918, 13)

In [9]: ds.dtypes

Out[9]: Unnamed: 0 int64


Age float64
Sex object
ChestPainType object
RestingBP float64
Cholesterol float64
FastingBS float64
RestingECG object
MaxHR float64
ExerciseAngina object
Oldpeak float64
ST_Slope object
HeartDisease float64
dtype: object

In [10]: for col in ds.columns:


if ds[col].isnull().values.any():
print("Missing data in ", col, ds[col].isnull().sum())

Missing data in Age 45


Missing data in Sex 18
Missing data in ChestPainType 18
Missing data in RestingBP 36
Missing data in Cholesterol 82
Missing data in FastingBS 45
Missing data in RestingECG 27
Missing data in MaxHR 91
Missing data in ExerciseAngina 9
Missing data in Oldpeak 73
Missing data in ST_Slope 91
Missing data in HeartDisease 64

In [11]: def impute_na(df, variable, value):

return df[variable].fillna(value)

In [12]: Age_median = ds['Age'].median()


RestingBP_median = ds['RestingBP'].median()
Cholesterol_median = ds['Cholesterol'].median()
FastingBS_median = ds['FastingBS'].median()
MaxHR_median = ds['MaxHR'].median()
Oldpeak_median = ds['Oldpeak'].median()
Sex_mode = ds['Sex'].mode()
ChestPainType_mode = ds['ChestPainType'].mode()
RestingECG_mode = ds['RestingECG'].mode()
ExerciseAngina_mode = ds['ExerciseAngina'].mode()
ST_Slope_mode = ds['ST_Slope'].mode()
HeartDisease_median = ds['HeartDisease'].median()

In [13]: #числові значення з заміною на середнє


ds['Age'] = impute_na(ds, 'Age',Age_median)
ds['RestingBP'] = impute_na(ds, 'RestingBP',RestingBP_median)
ds['Cholesterol'] = impute_na(ds, 'Cholesterol',Cholesterol_median)
ds['FastingBS'] = impute_na(ds, 'FastingBS',FastingBS_median)
ds['MaxHR'] = impute_na(ds, 'MaxHR',MaxHR_median)
ds['Oldpeak'] = impute_na(ds, 'Oldpeak',Oldpeak_median)
ds['HeartDisease'] = impute_na(ds, 'HeartDisease',HeartDisease_median)

#Заміна відсутніх значень на категорію, що найчастіше зустрічається

ds['Sex'] = impute_na(ds, 'Sex',Sex_mode)


ds['ChestPainType'] = impute_na(ds, 'ChestPainType',ChestPainType_mode)
ds['RestingECG'] = impute_na(ds, 'RestingECG',RestingECG_mode)
ds['ExerciseAngina'] = impute_na(ds, 'ExerciseAngina',ExerciseAngina_mode)
ds['ST_Slope'] = impute_na(ds, 'ST_Slope',ST_Slope_mode)

ds['Sex'].fillna(method ='ffill', inplace = True)


ds['ChestPainType'].fillna(method ='ffill', inplace = True)
ds['RestingECG'].fillna(method ='ffill', inplace = True)
ds['ExerciseAngina'].fillna(method ='ffill', inplace = True)
ds['ST_Slope'].fillna(method ='ffill', inplace = True)

In [14]: for col in ds.columns:


if ds[col].isnull().values.any():
print("Missing data in ", col, ds[col].isnull().sum())

Categorical encoding
In [15]: ds.nunique()

Out[15]: Unnamed: 0 918


Age 50
Sex 2
ChestPainType 4
RestingBP 66
Cholesterol 217
FastingBS 2
RestingECG 3
MaxHR 118
ExerciseAngina 2
Oldpeak 51
ST_Slope 3
HeartDisease 2
dtype: int64

In [16]: ds['Sex'].unique()

Out[16]: array(['M', 'F'], dtype=object)

In [17]: ds['ChestPainType'].unique()

Out[17]: array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object)

In [18]: ds['RestingECG'].unique()

Out[18]: array(['Normal', 'ST', 'LVH'], dtype=object)

In [19]: ds['ExerciseAngina'].unique()

Out[19]: array(['N', 'Y'], dtype=object)

In [20]: ds['ST_Slope'].unique()

Out[20]: array(['Up', 'Flat', 'Down'], dtype=object)

In [21]: from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()

In [22]: ds['Sex'] = le.fit_transform(ds['Sex'])


ds['ChestPainType'] = le.fit_transform(ds['ChestPainType'])
ds['RestingECG'] = le.fit_transform(ds['RestingECG'])
ds['ExerciseAngina'] = le.fit_transform(ds['ExerciseAngina'])
ds['ST_Slope'] = le.fit_transform(ds['ST_Slope'])

In [23]: ds.head(10)

Out[23]: Unnamed: 0 Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease

0 0 40.0 1 1 140.0 289.0 0.0 1 172.0 0 0.0 2 0.0

1 1 49.0 0 2 130.0 180.0 0.0 1 156.0 0 1.0 1 1.0

2 2 37.0 1 1 130.0 283.0 0.0 2 138.0 0 0.0 2 0.0

3 3 48.0 0 0 138.0 214.0 0.0 1 108.0 1 1.5 1 1.0

4 4 54.0 1 2 150.0 195.0 0.0 1 122.0 0 0.0 2 0.0

5 5 39.0 1 2 120.0 339.0 0.0 1 138.0 0 0.0 2 0.0

6 6 45.0 0 1 130.0 237.0 0.0 1 170.0 0 0.0 2 0.0

7 7 54.0 1 1 110.0 208.0 0.0 1 142.0 0 0.0 2 0.0

8 8 37.0 1 0 140.0 207.0 0.0 1 130.0 1 1.5 1 1.0

9 9 48.0 0 1 120.0 284.0 0.0 1 120.0 1 0.0 2 1.0

In [24]: def diagnostic_plots(df, variable):


# function takes a dataframe (df) and
# the variable of interest as arguments

# define figure size


plt.figure(figsize=(16, 4))

# histogram
plt.subplot(1, 3, 1)
sns.histplot(df[variable], bins=30)
plt.title('Histogram')

# Q-Q plot
plt.subplot(1, 3, 2)
stats.probplot(df[variable], dist="norm", plot=plt)
plt.ylabel('Variable quantiles')

# boxplot
plt.subplot(1, 3, 3)
sns.boxplot(y=df[variable])
plt.title('Boxplot')

plt.show()

In [25]: diagnostic_plots(ds, 'Age')

In [26]: diagnostic_plots(ds, 'RestingBP')

In [27]: diagnostic_plots(ds, 'Cholesterol')

In [28]: diagnostic_plots(ds, 'MaxHR')

In [29]: diagnostic_plots(ds, 'FastingBS')

In [30]: diagnostic_plots(ds, 'Oldpeak')

Data Scaling
In [31]: from sklearn.preprocessing import MinMaxScaler,StandardScaler
mms = MinMaxScaler() # Normalization
ss = StandardScaler() # Standardization

ds['Oldpeak'] = mms.fit_transform(ds[['Oldpeak']])
ds['Age'] = ss.fit_transform(ds[['Age']])
ds['RestingBP'] = ss.fit_transform(ds[['RestingBP']])
ds['Cholesterol'] = ss.fit_transform(ds[['Cholesterol']])
ds['MaxHR'] = ss.fit_transform(ds[['MaxHR']])
ds.head()

Out[31]: Unnamed: 0 Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease

0 0 -1.473387 1 1 0.427330 0.846142 0.0 1 1.443735 0 0.295455 2 0.0

1 1 -0.496724 0 2 -0.127534 -0.202998 0.0 1 0.780688 0 0.409091 1 1.0

2 2 -1.798941 1 1 -0.127534 0.788391 0.0 2 0.034759 0 0.295455 2 0.0

3 3 -0.605242 0 0 0.316357 0.124257 0.0 1 -1.208455 1 0.465909 1 1.0

4 4 0.045866 1 2 0.982193 -0.058621 0.0 1 -0.628288 0 0.295455 2 0.0

Модель машинного навчання не розуміє одиниці значень ознак. Він розглядає вхідні дані як просте число, але не розуміє справжнього значення цього значення. Таким чином, виникає необхідність масштабувати дані.

У нас є 2 варіанти масштабування даних: 1) Нормалізація 2) Стандартизація. Оскільки більшість алгоритмів передбачає, що дані мають нормальний (гаусівський) розподіл, нормалізація виконується для функцій, дані яких не відображають нормального розподілу, а
стандартизація виконується для функцій, які нормально розподіляються, де їхні значення величезні або дуже малі порівняно з іншими особливості.

Нормалізація: функцію Oldpeak нормалізовано, оскільки вона відображала правий спотворений розподіл даних. Стандартизація: Age, RestingBP, Cholesterol і MaxHR зменшено, оскільки ці функції розподілені нормально.

Modeling
In [32]: from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier

In [33]: X = ds.drop(['HeartDisease'] ,axis = 1)


y = ds['HeartDisease']
X

Out[33]: Unnamed: 0 Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope

0 0 -1.473387 1 1 0.427330 0.846142 0.0 1 1.443735 0 0.295455 2

1 1 -0.496724 0 2 -0.127534 -0.202998 0.0 1 0.780688 0 0.409091 1

2 2 -1.798941 1 1 -0.127534 0.788391 0.0 2 0.034759 0 0.295455 2

3 3 -0.605242 0 0 0.316357 0.124257 0.0 1 -1.208455 1 0.465909 1

4 4 0.045866 1 2 0.982193 -0.058621 0.0 1 -0.628288 0 0.295455 2

... ... ... ... ... ... ... ... ... ... ... ... ...

913 913 -0.930797 1 3 -1.237261 0.210883 0.0 1 -0.213883 0 0.431818 1

914 914 1.565119 1 0 0.649275 -0.077871 1.0 1 0.159081 0 0.681818 1

915 915 0.371420 1 0 -0.127534 -0.674630 0.0 1 -0.918371 1 0.431818 1

916 916 0.371420 0 1 -0.127534 0.336010 0.0 0 1.526616 0 0.352273 1

917 917 -1.690423 1 2 0.316357 -0.251124 0.0 1 1.485176 0 0.295455 2

918 rows × 12 columns

In [34]: X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2, random_state=21)

In [35]: models = [
{
"name": "Logistic Regression",
"estimator": LogisticRegression(),
"hyperparameters": {
"penalty": ["l2"],
"C": [0.01, 0.1, 1, 10],
"max_iter": [500]
}
},
{
"name": "Gradient Boosting",
"estimator": GradientBoostingClassifier(),
"hyperparameters": {
"n_estimators": [100],
"learning_rate": [0.1],
"max_depth": [3]
}
},
{
"name": "Random Forest",
"estimator": RandomForestClassifier(),
"hyperparameters": {
"n_estimators": [100, 200, 300],
"max_depth": [3, 5, 10],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4]
}
},
{
"name": "Decision Tree",
"estimator": DecisionTreeClassifier(),
"hyperparameters": {
"criterion": ["gini", "entropy"],
"max_depth": [3, 5, 10],
"min_samples_split": [2, 5, 10],
"min_samples_leaf": [1, 2, 4]
}
},
{
"name": "K-Nearest Neighbors",
"estimator": KNeighborsClassifier(),
"hyperparameters": {
"n_neighbors": [3, 5, 7],
"weights": ["uniform", "distance"],
"algorithm": ["auto", "ball_tree", "kd_tree", "brute"]
}
},
{
"name": "Naive Bayes",
"estimator": GaussianNB(),
"hyperparameters": {
"var_smoothing": [1e-9, 1e-10, 1e-11, 1e-12]
}
},
{
"name": "AdaBoost",
"estimator": AdaBoostClassifier(),
"hyperparameters": {
"n_estimators": [50, 100, 200],
"learning_rate": [0.01, 0.1, 1],
"algorithm": ["SAMME", "SAMME.R"]
}
},

Choose the best parameters for each model


In [36]: accuracies = []
train_accuracies = []
best_models = {}

for model in models:


with warnings.catch_warnings():
warnings.simplefilter("ignore")
print(f"Training {model['name']}...")
grid_search = GridSearchCV(
estimator=model['estimator'],
param_grid=model['hyperparameters'],
scoring='accuracy',
cv=10
)
grid_search.fit(X_train, y_train)

# evaluate the model's performance


best_model = grid_search.best_estimator_
# Calculate training accuracy
y_train_pred = best_model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_accuracies.append((model['name'], train_accuracy))

# Calculate testing accuracy


y_test_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
accuracies.append((model['name'], test_accuracy))

best_models[model['name']] = best_model

print(f"Best parameters for {model['name']}:{grid_search.best_params_}")


print("\033[1m--------------------------------------------------------\033[0m")
print(f"Training accuracy for {model['name']}:{train_accuracy}")
print("\033[1m--------------------------------------------------------\033[0m")
print(f"Testing accuracy for {model['name']}:{test_accuracy}")
print("\033[1m--------------------------------------------------------\033[0m")

Training Logistic Regression...


Best parameters for Logistic Regression:{'C': 10, 'max_iter': 500, 'penalty': 'l2'}
--------------------------------------------------------
Training accuracy for Logistic Regression:0.8174386920980926
--------------------------------------------------------
Testing accuracy for Logistic Regression:0.8315217391304348
--------------------------------------------------------
Training Gradient Boosting...
Best parameters for Gradient Boosting:{'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
--------------------------------------------------------
Training accuracy for Gradient Boosting:0.9223433242506812
--------------------------------------------------------
Testing accuracy for Gradient Boosting:0.8478260869565217
--------------------------------------------------------
Training Random Forest...
Best parameters for Random Forest:{'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}
--------------------------------------------------------
Training accuracy for Random Forest:0.8746594005449592
--------------------------------------------------------
Testing accuracy for Random Forest:0.8532608695652174
--------------------------------------------------------
Training Decision Tree...
Best parameters for Decision Tree:{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}
--------------------------------------------------------
Training accuracy for Decision Tree:0.8787465940054496
--------------------------------------------------------
Testing accuracy for Decision Tree:0.7989130434782609
--------------------------------------------------------
Training K-Nearest Neighbors...
Best parameters for K-Nearest Neighbors:{'algorithm': 'auto', 'n_neighbors': 7, 'weights': 'distance'}
--------------------------------------------------------
Training accuracy for K-Nearest Neighbors:1.0
--------------------------------------------------------
Testing accuracy for K-Nearest Neighbors:0.6739130434782609
--------------------------------------------------------
Training Naive Bayes...
Best parameters for Naive Bayes:{'var_smoothing': 1e-09}
--------------------------------------------------------
Training accuracy for Naive Bayes:0.8174386920980926
--------------------------------------------------------
Testing accuracy for Naive Bayes:0.842391304347826
--------------------------------------------------------
Training AdaBoost...
Best parameters for AdaBoost:{'algorithm': 'SAMME.R', 'learning_rate': 0.1, 'n_estimators': 50}
--------------------------------------------------------
Training accuracy for AdaBoost:0.8365122615803815
--------------------------------------------------------
Testing accuracy for AdaBoost:0.8532608695652174
--------------------------------------------------------

Create the models with the best hyperparameters


In [37]: log_reg_model = LogisticRegression(
C=10,
max_iter=500,
penalty='l2'
)

# create the Random Forest model with the best hyperparameters


rf_model = RandomForestClassifier(
max_depth=5,
min_samples_leaf=4,
min_samples_split=10,
n_estimators=300
)
# create the Gradient Boosting model with the best hyperparameters
gb_model = GradientBoostingClassifier(
learning_rate=0.1,
max_depth=3,
n_estimators=100
)
# create the Decision Tree model with the best hyperparameters
dt_model = DecisionTreeClassifier(
criterion='gini',
max_depth=3,
min_samples_leaf=1,
min_samples_split=2
)

# create the K-Nearest Neighbors model with the best hyperparameters


knn_model = KNeighborsClassifier(
algorithm='auto',
n_neighbors=7,
weights='distance'
)

# create the Naive Bayes model with the best hyperparameters


nb_model = GaussianNB(
var_smoothing=1e-09
)
# create the AdaBoost model with the best hyperparameters
ab_model = AdaBoostClassifier(
algorithm='SAMME.R',
learning_rate=0.1,
n_estimators=50
)

Train models
In [38]: # Train Logistic Regression model
log_reg_model.fit(X_train, y_train)

# Train Random Forest model


rf_model.fit(X_train, y_train)

# Train Gradient Boosting model


gb_model.fit(X_train, y_train)

# Train Decision Tree model


dt_model.fit(X_train, y_train)

# Train K-Nearest Neighbors model


knn_model.fit(X_train, y_train)

# Train Naive Bayes model


nb_model.fit(X_train, y_train)

# Train AdaBoost model


ab_model.fit(X_train, y_train)

Out[38]: ▾ AdaBoostClassifier

AdaBoostClassifier(learning_rate=0.1)

Features importance
In [39]: # fit the model
ab_model.fit(X_train, y_train)

# get feature importances


importances = gb_model.feature_importances_

# get feature names


feature_names = X.columns

# sort feature importances in descending order


indices = np.argsort(importances)[::-1]

# plot feature importances


plt.figure(figsize=(10,5))
plt.title("Feature Importances")
plt.bar(range(len(indices)), importances[indices])
plt.xticks(range(len(indices)), feature_names[indices], rotation='vertical')
plt.show()
In [ ]:

You might also like