0% found this document useful (0 votes)

21 views

CSC 240 HW 4

Uploaded by

minulo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

CSC 240 HW 4

Uploaded by

minulo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Comparison of Naïve Bayes and Decision Tree Classifier on Wine

Dataset
Naïve Bayes and Decision Tree Classifier Analysis
Jonathan, J.W, Wang
University of Rochester, [email protected]

Using both the Naïve Bayes and Decision Tree Classifiers, this assignment attempts to compare the
classification results on the Wine Dataset. The first step ensure that all data points were present in the data
files. Then, both classifiers were implemented using library functions. From there, certain conclusions about
the dataset could be made and are presented in this document.

Keywords and Phrases: Data Mining, Naïve Bayes, Decision Tree

Reference:
Forina, M. et al, PARVUS, 1991. Wine Data Set, https://archive.ics.uci.edu/ml/datasets/wine
Prashant, P.B, Banerjee, 2020. Naive Bayes Classifier in Python
Vipul, V.G, Gandhi, 2020. A Guide to Decision Trees for Beginners

1 INTRODUCTION
The goal of this assignment is to compare the ROC-AUC score and k-array cross validation
scores of both the Naïve Bayes and Decision Tree Classifiers in the Wine Dataset. The data
taken from UCI is first checked for missing values. This step is necessary because all data points
need to have an appropriate value in order to run the algorithms properly. The .data file
downloaded is first converted to a CSV file for ease of use and then checked for null values. For
this data set specifically, no null values were present. As a result, no further steps were needed
to clean the data. This step is done for both classification techniques. After both techniques are
used on the data, the ROC-AUC score and k-array cross validation scores are calculated with a
few variations (varying test sizes and folds for k-cross validation) during the testing phase of
this assignment to gain a better perspective on the results. The results from these tests are
then gathered to compare the scores and reach conclusions.

2 TESTING FOR EMPTY DATA POINTS

The data pulled from UCI is in the form of a .data file and the information about the file is
contained in the accompanying .names file. From the .names file, we can see that the leftmost
column in the .data file represents the class. The class of each row is 1, 2, or 3 and each
different number is a different cultivator in the same region of Italy. The other columns show the
quantities of 13 constituents found in each respective wine.
For ease of use, the .data file is converted into a CSV file in Microsoft Excel. In order to
check whether there are null values in the CSV file, a simple python code can be run in Jupyter
Notebook.

import pandas as pd
import csv

data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names

# check if there are any empty values

df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

The code above takes in the wine.csv file and names each column with the respective
chemical constituents or class. Afterwards, the number of null values is counted for each
column. For this specific dataset, there is not a single empty value. So, no extra steps need to
be done to clean the data. The class column is dropped and set as a separate value. This way,
the actual class of the data does not affect the classification results.

3 IMPLEMNTING THE CLASSIFICATION

Both Naïve Bayes and Decision Trees are methods to classify a given dataset. For this
assignment, the accuracy of the classification techniques are compared using ROC-AUC and k-
array cross validation scores. The ROC-AUC score tells us how efficient a model is. The closer
the ROC-AUC score is to 0.5, the closer the model is to random guessing, and the closer the
ROC-AUC score is to 1, the better the model. Meanwhile, k-array cross validation is a machine
learning method used to determine whether a model is overfitting (a situation where a model
can almost perfectly predict on training data but fails on testing data). The closer this score is to
1, the better the accuracy of the model.

2
For the Naïve Bayes method, the data is first split into training and testing sets. The features
are also scaled first (0.3 is used as the test size in this example).

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

cols = X_train.columns

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train, columns=[cols])

X_test = pd.DataFrame(X_test, columns=[cols])

Next, a Gaussian Naïve Bayes classifier is trained on the training set and the results are
predicted.
# instantiate the model
gnb = GaussianNB()

# fit the model

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

To ensure the model is behaving properly, the accuracy score and the training set accuracy
can be checked. In addition, by comparing whether the training set score and testing set score
are similar, overfitting/underfitting can be tested.
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

y_pred_train = gnb.predict(X_train)

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set

print('Training set score: {:.4f}'.format(gnb.score(X_train, y_train)))

3
print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

K-fold cross validation is then applied, and the average is computed (in this example, 10-folds
are used)
# Applying 10-Fold Cross Validation
scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

Lastly, the ROC-AUC score is computed. Usually, ROC-AUC does not work for multiclass
comparison where the base class is compared to more than one other class. This posed a small
problem during the assignment. As seen in the textbook, a simple solution to this is to modify
the code so that the algorithm can be extended to multiclass classification. In this case, the
ROC-AUC score is modified to a OVR (One versus Rest) scheme. There is a different number of
instances associated with each class label (as can be seen from the wine.names file given by
UCI where there are 59 instances of class 1, 71 instances of class 2, and 48 instances of class
3). As a result, macro averaging is done in order to treat all classes equally when calculating the
ROC-AUC score.
y_score_gnb = gnb.predict_proba(X_test)

macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

The Decision Tree method is then implemented next. There are a lot of similarities between
the actual process of determining the ROC-AUC score and k-array cross validation score. The
first step is also to split the data into training and testing sets (once again 0.3 is used as the
testing size for this example). However, unlike the Naïve Bayes method, there is no need to
scale the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

4
A decision tree classifier is trained on the training set.
tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)
tree_clf.fit(X_train, y_train)

From here, the cross validation process and the ROC-AUC score calculation is done just like
the Naïve Bayes method (once again, 10 folds are used in the k-fold cross validation for the sake
of this example).
# Applying 10-Fold Cross Validation
scores = cross_val_score(tree_clf, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

4 TESTING VARIOUS TEST SIZES AND FOLDS IN K-CROSS VALIDATION

Now that the implementations are finished, the test.py file runs the various tests with
different test sizes and different number of folds in k-cross validation. An example of one of the
tests in the test.py file is the following:
print('0.3 test size with 10-fold cross validation')
naiveBayes(0.3, 10)
This inputs 0.3 as the test size and 10 as the number of folds for cross validation. To test
different values, test.py just runs the same function with different respective values. The same
exact values are then used for the decision tree method. This way, both classifications can be
compared with various criteria.

5 RESULTS
For the first test where the test size is 0.3 and there are 10 folds:

5
naiveBayes(0.3, 10)
The result for the naïve bayes is:

0.3 test size with 10-fold cross validation

Model accuracy score: 0.9444
Training-set accuracy score: 0.9839
Training set score: 0.9839
Test set score: 0.9444
Cross-validation scores:[0.92307692 1. 1. 1. 1. 1.
1. 1. 0.83333333 1. ]
Average cross-validation score: 0.9756
Macro-averaged One-vs-Rest ROC AUC score:
1.00

decisionTree(0.3, 10)
The result for the decision tree is:

0.3 test size with 10-fold cross validation

Cross-validation scores:[0.84615385 0.92307692 0.84615385 1. 0.91666667 0.83333333
0.83333333 0.83333333 0.75 0.91666667]
Average cross-validation score: 0.8699
Macro-averaged One-vs-Rest ROC AUC score:
0.93

We can see that both the average cross-validation score and the macro-averaged OVR ROC-
AUC score is higher for the Naïve Bayes method.

From here, it was necessary to test whether this stayed the same for different test sizes and
different number of folds. First, only changing the number of folds to 5 folds produced these
results:
naiveBayes(0.3, 5)
The result for the naïve bayes is:

0.3 test size with 5-fold cross validation

Model accuracy score: 0.9444
Training-set accuracy score: 0.9839
Training set score: 0.9839
Test set score: 0.9444
Cross-validation scores:[0.96 1. 1. 1. 0.91666667]
Average cross-validation score: 0.9753
Macro-averaged One-vs-Rest ROC AUC score:
1.00

decisionTree(0.3, 5)
The result for the decision tree is:

0.3 test size with 5-fold cross validation

6
Cross-validation scores:[0.8 0.92 0.88 0.88 0.83333333]
Average cross-validation score: 0.8627
Macro-averaged One-vs-Rest ROC AUC score:
0.93

When only changing the test size down to 0.1:

naiveBayes(0.1, 10)
The result for the naïve bayes is:

0.1 test size with 10-fold cross validation

Model accuracy score: 1.0000
Training-set accuracy score: 0.9812
Training set score: 0.9812
Test set score: 1.0000
Cross-validation scores:[0.875 1. 0.9375 0.9375 1. 1. 1. 1. 0.9375 0.9375]
Average cross-validation score: 0.9625
Macro-averaged One-vs-Rest ROC AUC score:
1.00

decisionTree(0.1, 10)
The result for the decision tree is:

0.1 test size with 10-fold cross validation

Cross-validation scores:[0.8125 0.9375 0.8125 0.6875 0.9375 0.875 0.8125 0.9375 0.8125 0.875 ]
Average cross-validation score: 0.8500
Macro-averaged One-vs-Rest ROC AUC score:
0.92

When changing both the test size down to 0.1 and the number of folds to 5:
naiveBayes(0.1, 5)
The result for the naïve bayes is:

0.1 test size with 5-fold cross validation

Model accuracy score: 1.0000
Training-set accuracy score: 0.9812
Training set score: 0.9812
Test set score: 1.0000
Cross-validation scores:[0.9375 0.9375 1. 1. 0.9375]
Average cross-validation score: 0.9625
Macro-averaged One-vs-Rest ROC AUC score:
1.00

decisionTree(0.1, 5)
The result for the decision tree is:

0.1 test size with 5-fold cross validation

Cross-validation scores:[0.875 0.84375 0.9375 0.90625 0.875 ]
Average cross-validation score: 0.8875
Macro-averaged One-vs-Rest ROC AUC score:

7
0.92

For every single case, no matter the test size or the number of folds in k-fold cross validation,
the Naïve Bayes performed better. The average cross-validation score was higher, and the ROC-
AUC score was closer to 1. From these results, we can conclude that the Naïve Bayes method is
better for classification. Given the size of the data set though, considering that it is relatively
small, Naïve Bayes would logically perform better since Decision Trees would require more data
to reach the same accuracy levels. However, that being said, the Decision Tree method still has
very solid scores for both cross-validation and ROC-AUC. Decision Trees are by no means a bad
method of classification. Based on the textbook, Decision Trees are more flexible and easy to
use, but require tree pruning to really prevent overfitting. Both methods are excellent methods
of classification for various different data sets and for this assignment, Naïve Bayes
outperforms.

REFERENCES
[1] Forina, M. et al, PARVUS, 1991. Wine Data Set, https://archive.ics.uci.edu/ml/datasets/wine
[2] Prashant, P.B, Banerjee, 2020. Naive Bayes Classifier in Python
[3] Vipul, V.G, Gandhi, 2020. A Guide to Decision Trees for Beginners

A APPENDICES

A.1 Code for naiveBayes.py that implements the Naïve Bayes Classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
import csv

data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

8
# check if there are any empty values
# df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

cols = X_train.columns

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train, columns=[cols])

X_test = pd.DataFrame(X_test, columns=[cols])

# train a Gaussian Naive Bayes classifier on the training set

# instantiate the model
gnb = GaussianNB()

# fit the model

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

y_pred_train = gnb.predict(X_train)

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set

print('Training set score: {:.4f}'.format(gnb.score(X_train, y_train)))

9
print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

# Applying 10-Fold Cross Validation

scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_gnb = gnb.predict_proba(X_test)

macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

A.2 Code for decisionTree.py that implements the Decision Tree Classification
import pandas as pd
import numpy as np
import csv
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

10
col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names

# check if there are any empty values

# df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)

tree_clf.fit(X_train, y_train)

# Applying 10-Fold Cross Validation

scores = cross_val_score(tree_clf, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

11
A.3 Code for test.py that has the values used for testing various test sizes and folds
for k-fold cross validation
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import csv

def naiveBayes(test_size, k_fold):

data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

# check if there are any empty values

# df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = 0)

cols = X_train.columns

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

12
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])

# train a Gaussian Naive Bayes classifier on the training set

# instantiate the model
gnb = GaussianNB()

# fit the model

gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

y_pred_train = gnb.predict(X_train)

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set

print('Training set score: {:.4f}'.format(gnb.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

# Applying k-Fold Cross Validation

scores = cross_val_score(gnb, X_train, y_train, cv = k_fold, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_gnb = gnb.predict_proba(X_test)

macro_roc_auc_ovr = roc_auc_score(

13
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

def decisionTree(test_size, k_fold):

data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

# check if there are any empty values

# df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = 0)

tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)

tree_clf.fit(X_train, y_train)

# Applying k-Fold Cross Validation

scores = cross_val_score(tree_clf, X_train, y_train, cv = k_fold, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

14
print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

print('Naive Bayes Tests:')

print()
print('0.3 test size with 10-fold cross validation')
naiveBayes(0.3, 10)
print()
print('0.3 test size with 5-fold cross validation')
naiveBayes(0.3, 5)
print()
print('0.1 test size with 10-fold cross validation')
naiveBayes(0.1, 10)
print()
print('0.1 test size with 5-fold cross validation')
naiveBayes(0.1, 5)

print()
print()

print('Decision Tree Tests:')

print()
print('0.3 test size with 10-fold cross validation')
decisionTree(0.3, 10)
print()
print('0.3 test size with 5-fold cross validation')
decisionTree(0.3, 5)
print()
print('0.1 test size with 10-fold cross validation')
decisionTree(0.1, 10)
print()

15
print('0.1 test size with 5-fold cross validation')
decisionTree(0.1, 5)
print()

16
A.4 Image of the terminal results when running test.py

Manual Taller Outlook (Dierre) 125 Efi (Idioma Ingles)
89% (9)
Manual Taller Outlook (Dierre) 125 Efi (Idioma Ingles)
173 pages
Machine Learning (16CIC73) Project Report Template
33% (3)
Machine Learning (16CIC73) Project Report Template
12 pages
University of Mauritius: Assignment On Supervised & Unsupervised Machine Learning Algorithms
No ratings yet
University of Mauritius: Assignment On Supervised & Unsupervised Machine Learning Algorithms
71 pages
3 Classification
No ratings yet
3 Classification
16 pages
ML2
No ratings yet
ML2
7 pages
Program 5
No ratings yet
Program 5
3 pages
Mlp Slides Merged
No ratings yet
Mlp Slides Merged
480 pages
(REPORT) LAB - 2 - Decision - Tree
No ratings yet
(REPORT) LAB - 2 - Decision - Tree
17 pages
Mini Project Report
No ratings yet
Mini Project Report
12 pages
Module 4 - Classification (1)
No ratings yet
Module 4 - Classification (1)
10 pages
devesh
No ratings yet
devesh
11 pages
Wine Quality Prediction Using Machine Learning
No ratings yet
Wine Quality Prediction Using Machine Learning
10 pages
ML Lab1 pgm
No ratings yet
ML Lab1 pgm
4 pages
ML Python Exercises UOM BDS Classification
No ratings yet
ML Python Exercises UOM BDS Classification
18 pages
Tutorial 6
No ratings yet
Tutorial 6
8 pages
vertopal.com_project
No ratings yet
vertopal.com_project
16 pages
22MID0187_ML_LAB-5
No ratings yet
22MID0187_ML_LAB-5
13 pages
Assignment 3
No ratings yet
Assignment 3
8 pages
# Tommy Trojan # ITP 449 Fall 2021 # Final Project # Q1
No ratings yet
# Tommy Trojan # ITP 449 Fall 2021 # Final Project # Q1
6 pages
Week 5
No ratings yet
Week 5
72 pages
Business Analytics 1 Ca 2
No ratings yet
Business Analytics 1 Ca 2
26 pages
Lab2
No ratings yet
Lab2
17 pages
Big Data Projecct
No ratings yet
Big Data Projecct
12 pages
DSBDA_10
No ratings yet
DSBDA_10
5 pages
3_answers
No ratings yet
3_answers
19 pages
my_project_1_AI
No ratings yet
my_project_1_AI
3 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
m1
No ratings yet
m1
10 pages
Decision Tree Copy
No ratings yet
Decision Tree Copy
44 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Introduction To Decision Tree: Gini Index
No ratings yet
Introduction To Decision Tree: Gini Index
15 pages
Probability Estimation in Random Forests
No ratings yet
Probability Estimation in Random Forests
35 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
Machine Learning Model Evaluation
No ratings yet
Machine Learning Model Evaluation
11 pages
Unit3 7 Issues
No ratings yet
Unit3 7 Issues
24 pages
Bachelor thesis-G.H. Van de Water-S2297213
No ratings yet
Bachelor thesis-G.H. Van de Water-S2297213
48 pages
ML Lab Session2 DecisionTrees
No ratings yet
ML Lab Session2 DecisionTrees
32 pages
Lecture 3b - Evaluation
No ratings yet
Lecture 3b - Evaluation
37 pages
Types of Pruning Techniques
No ratings yet
Types of Pruning Techniques
10 pages
Data Mining Unit 2
No ratings yet
Data Mining Unit 2
40 pages
Classification With Decision Trees I: Instructor: Qiang Yang
No ratings yet
Classification With Decision Trees I: Instructor: Qiang Yang
29 pages
ML
No ratings yet
ML
11 pages
11th Classfication Analysis - Colab (1)
No ratings yet
11th Classfication Analysis - Colab (1)
6 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
04 - Model Selection
No ratings yet
04 - Model Selection
62 pages
L05 - Advance Analytical Theory and Methods - Classification
No ratings yet
L05 - Advance Analytical Theory and Methods - Classification
34 pages
23BCE7092_ML_Lab_Assignment[1]
No ratings yet
23BCE7092_ML_Lab_Assignment[1]
14 pages
23ucc542_ml9
No ratings yet
23ucc542_ml9
6 pages
Week 6 - 7 - Classification
No ratings yet
Week 6 - 7 - Classification
67 pages
Naive Bayes Classifier: Fundamentals and Applications
From Everand
Naive Bayes Classifier: Fundamentals and Applications
Fouad Sabry
No ratings yet
Decision Tree
No ratings yet
Decision Tree
33 pages
Data Mining Classification Algorithms: Credits: Padhraic Smyth
No ratings yet
Data Mining Classification Algorithms: Credits: Padhraic Smyth
54 pages
Unit 4
No ratings yet
Unit 4
19 pages
TTDS Lecture 4
No ratings yet
TTDS Lecture 4
31 pages
DWDM 4
No ratings yet
DWDM 4
58 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Unit 4
No ratings yet
Unit 4
20 pages
Lab
No ratings yet
Lab
5 pages
20210913115613D3708 - Session 05-08 Decision Tree Classification
No ratings yet
20210913115613D3708 - Session 05-08 Decision Tree Classification
37 pages
DM UNIT-3
No ratings yet
DM UNIT-3
23 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Chapter10_ANOVA - Student(1)
No ratings yet
Chapter10_ANOVA - Student(1)
38 pages
Lecture20
No ratings yet
Lecture20
81 pages
Lecture15
No ratings yet
Lecture15
81 pages
Lecture18
No ratings yet
Lecture18
70 pages
Lecture16
No ratings yet
Lecture16
123 pages
Lecture19
No ratings yet
Lecture19
71 pages
Lecture6
No ratings yet
Lecture6
127 pages
Lecture4
No ratings yet
Lecture4
154 pages
Lecture2
No ratings yet
Lecture2
96 pages
Week 6 Components of Machine Tool
No ratings yet
Week 6 Components of Machine Tool
31 pages
355073814 Column Chromatography Separation of Lycopene and β carotene
No ratings yet
355073814 Column Chromatography Separation of Lycopene and β carotene
7 pages
Analysis of Data Transmission Performance Over A GSM Cellular Network
No ratings yet
Analysis of Data Transmission Performance Over A GSM Cellular Network
10 pages
PBL FAQs
No ratings yet
PBL FAQs
22 pages
Trigreview
No ratings yet
Trigreview
21 pages
Quantifiers 1
No ratings yet
Quantifiers 1
3 pages
Six Months Training in Java
No ratings yet
Six Months Training in Java
8 pages
Le Monde Grothendieck 7 May
No ratings yet
Le Monde Grothendieck 7 May
15 pages
CVFD2D Slides 2up
No ratings yet
CVFD2D Slides 2up
17 pages
Physics 2 (With Answers)
No ratings yet
Physics 2 (With Answers)
5 pages
EEB443 Test2 With Solutions - 2022
No ratings yet
EEB443 Test2 With Solutions - 2022
12 pages
Flexible J&T 2.0 (Fixed)
No ratings yet
Flexible J&T 2.0 (Fixed)
10 pages
DS Pole Sheets Hot-Rolled
No ratings yet
DS Pole Sheets Hot-Rolled
3 pages
1394 Sampling Jitter PDF
No ratings yet
1394 Sampling Jitter PDF
30 pages
Crack Width
67% (3)
Crack Width
4 pages
Rn8302B User Manual: Three-Phase Anti-Tampering Multifuction Electircal Energy Meter IC
No ratings yet
Rn8302B User Manual: Three-Phase Anti-Tampering Multifuction Electircal Energy Meter IC
96 pages
Python File
No ratings yet
Python File
10 pages
Advanced Building Technology Lecture 5 Tall Building Systems Part 1
No ratings yet
Advanced Building Technology Lecture 5 Tall Building Systems Part 1
36 pages
Sony str-k760p
No ratings yet
Sony str-k760p
44 pages
Transistor: History Importance Simplified Operation
No ratings yet
Transistor: History Importance Simplified Operation
18 pages
VSAT Agilis Transreceivers 20w To 60wpdf
No ratings yet
VSAT Agilis Transreceivers 20w To 60wpdf
2 pages
Copy of Momentum & Impulse Introduction Practice Problems
No ratings yet
Copy of Momentum & Impulse Introduction Practice Problems
3 pages
OMB SGP-WB-1 - ING - 0 Ukrseni Dipol
No ratings yet
OMB SGP-WB-1 - ING - 0 Ukrseni Dipol
2 pages
Lab 1-Flow Measurement
No ratings yet
Lab 1-Flow Measurement
4 pages
J2EE Notes
100% (1)
J2EE Notes
119 pages
Pressure
No ratings yet
Pressure
2 pages
Job # 4. Base Line Measurement Using Jaderin's Method
100% (1)
Job # 4. Base Line Measurement Using Jaderin's Method
6 pages
CSC204 Data Structure Practice Questions
No ratings yet
CSC204 Data Structure Practice Questions
38 pages
Thermodynamic Analysis of Combined Cycle Power Plant
No ratings yet
Thermodynamic Analysis of Combined Cycle Power Plant
11 pages

CSC 240 HW 4

Uploaded by

CSC 240 HW 4

Uploaded by

Comparison of Naïve Bayes and Decision Tree Classifier on Wine

Keywords and Phrases: Data Mining, Naïve Bayes, Decision Tree

2 TESTING FOR EMPTY DATA POINTS

# check if there are any empty values

3 IMPLEMNTING THE CLASSIFICATION

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

X_train = pd.DataFrame(X_train, columns=[cols])

# fit the model

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set

# compute Average cross-validation score

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

4 TESTING VARIOUS TEST SIZES AND FOLDS IN K-CROSS VALIDATION

0.3 test size with 10-fold cross validation

0.3 test size with 10-fold cross validation

0.3 test size with 5-fold cross validation

0.3 test size with 5-fold cross validation

When only changing the test size down to 0.1:

0.1 test size with 10-fold cross validation

0.1 test size with 10-fold cross validation

0.1 test size with 5-fold cross validation

0.1 test size with 5-fold cross validation

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

X_train = pd.DataFrame(X_train, columns=[cols])

# train a Gaussian Naive Bayes classifier on the training set

# fit the model

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set

# Applying 10-Fold Cross Validation

# compute Average cross-validation score

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

# check if there are any empty values

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)

# Applying 10-Fold Cross Validation

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

def naiveBayes(test_size, k_fold):

# check if there are any empty values

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = 0)

# train a Gaussian Naive Bayes classifier on the training set

# fit the model

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set

print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

# Applying k-Fold Cross Validation

# compute Average cross-validation score

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

def decisionTree(test_size, k_fold):

# check if there are any empty values

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = 0)

tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)

# Applying k-Fold Cross Validation

# compute Average cross-validation score

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

print('Naive Bayes Tests:')

print('Decision Tree Tests:')

You might also like