0% found this document useful (0 votes)
21 views

CSC 240 HW 4

Uploaded by

minulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

CSC 240 HW 4

Uploaded by

minulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Comparison of Naïve Bayes and Decision Tree Classifier on Wine

Dataset
Naïve Bayes and Decision Tree Classifier Analysis
Jonathan, J.W, Wang
University of Rochester, [email protected]

Using both the Naïve Bayes and Decision Tree Classifiers, this assignment attempts to compare the
classification results on the Wine Dataset. The first step ensure that all data points were present in the data
files. Then, both classifiers were implemented using library functions. From there, certain conclusions about
the dataset could be made and are presented in this document.

Keywords and Phrases: Data Mining, Naïve Bayes, Decision Tree

Reference:
Forina, M. et al, PARVUS, 1991. Wine Data Set, https://archive.ics.uci.edu/ml/datasets/wine
Prashant, P.B, Banerjee, 2020. Naive Bayes Classifier in Python
Vipul, V.G, Gandhi, 2020. A Guide to Decision Trees for Beginners

1 INTRODUCTION
The goal of this assignment is to compare the ROC-AUC score and k-array cross validation
scores of both the Naïve Bayes and Decision Tree Classifiers in the Wine Dataset. The data
taken from UCI is first checked for missing values. This step is necessary because all data points
need to have an appropriate value in order to run the algorithms properly. The .data file
downloaded is first converted to a CSV file for ease of use and then checked for null values. For
this data set specifically, no null values were present. As a result, no further steps were needed
to clean the data. This step is done for both classification techniques. After both techniques are
used on the data, the ROC-AUC score and k-array cross validation scores are calculated with a
few variations (varying test sizes and folds for k-cross validation) during the testing phase of
this assignment to gain a better perspective on the results. The results from these tests are
then gathered to compare the scores and reach conclusions.

2 TESTING FOR EMPTY DATA POINTS


The data pulled from UCI is in the form of a .data file and the information about the file is
contained in the accompanying .names file. From the .names file, we can see that the leftmost
column in the .data file represents the class. The class of each row is 1, 2, or 3 and each
different number is a different cultivator in the same region of Italy. The other columns show the
quantities of 13 constituents found in each respective wine.
For ease of use, the .data file is converted into a CSV file in Microsoft Excel. In order to
check whether there are null values in the CSV file, a simple python code can be run in Jupyter
Notebook.

import pandas as pd
import csv

data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names

# check if there are any empty values


df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

The code above takes in the wine.csv file and names each column with the respective
chemical constituents or class. Afterwards, the number of null values is counted for each
column. For this specific dataset, there is not a single empty value. So, no extra steps need to
be done to clean the data. The class column is dropped and set as a separate value. This way,
the actual class of the data does not affect the classification results.

3 IMPLEMNTING THE CLASSIFICATION


Both Naïve Bayes and Decision Trees are methods to classify a given dataset. For this
assignment, the accuracy of the classification techniques are compared using ROC-AUC and k-
array cross validation scores. The ROC-AUC score tells us how efficient a model is. The closer
the ROC-AUC score is to 0.5, the closer the model is to random guessing, and the closer the
ROC-AUC score is to 1, the better the model. Meanwhile, k-array cross validation is a machine
learning method used to determine whether a model is overfitting (a situation where a model
can almost perfectly predict on training data but fails on testing data). The closer this score is to
1, the better the accuracy of the model.

2
For the Naïve Bayes method, the data is first split into training and testing sets. The features
are also scaled first (0.3 is used as the test size in this example).

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

cols = X_train.columns

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train, columns=[cols])


X_test = pd.DataFrame(X_test, columns=[cols])

Next, a Gaussian Naïve Bayes classifier is trained on the training set and the results are
predicted.
# instantiate the model
gnb = GaussianNB()

# fit the model


gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

To ensure the model is behaving properly, the accuracy score and the training set accuracy
can be checked. In addition, by comparing whether the training set score and testing set score
are similar, overfitting/underfitting can be tested.
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

y_pred_train = gnb.predict(X_train)

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set


print('Training set score: {:.4f}'.format(gnb.score(X_train, y_train)))

3
print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

K-fold cross validation is then applied, and the average is computed (in this example, 10-folds
are used)
# Applying 10-Fold Cross Validation
scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score


print('Average cross-validation score: {:.4f}'.format(scores.mean()))

Lastly, the ROC-AUC score is computed. Usually, ROC-AUC does not work for multiclass
comparison where the base class is compared to more than one other class. This posed a small
problem during the assignment. As seen in the textbook, a simple solution to this is to modify
the code so that the algorithm can be extended to multiclass classification. In this case, the
ROC-AUC score is modified to a OVR (One versus Rest) scheme. There is a different number of
instances associated with each class label (as can be seen from the wine.names file given by
UCI where there are 59 instances of class 1, 71 instances of class 2, and 48 instances of class
3). As a result, macro averaging is done in order to treat all classes equally when calculating the
ROC-AUC score.
y_score_gnb = gnb.predict_proba(X_test)

macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

The Decision Tree method is then implemented next. There are a lot of similarities between
the actual process of determining the ROC-AUC score and k-array cross validation score. The
first step is also to split the data into training and testing sets (once again 0.3 is used as the
testing size for this example). However, unlike the Naïve Bayes method, there is no need to
scale the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

4
A decision tree classifier is trained on the training set.
tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)
tree_clf.fit(X_train, y_train)

From here, the cross validation process and the ROC-AUC score calculation is done just like
the Naïve Bayes method (once again, 10 folds are used in the k-fold cross validation for the sake
of this example).
# Applying 10-Fold Cross Validation
scores = cross_val_score(tree_clf, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

4 TESTING VARIOUS TEST SIZES AND FOLDS IN K-CROSS VALIDATION


Now that the implementations are finished, the test.py file runs the various tests with
different test sizes and different number of folds in k-cross validation. An example of one of the
tests in the test.py file is the following:
print('0.3 test size with 10-fold cross validation')
naiveBayes(0.3, 10)
This inputs 0.3 as the test size and 10 as the number of folds for cross validation. To test
different values, test.py just runs the same function with different respective values. The same
exact values are then used for the decision tree method. This way, both classifications can be
compared with various criteria.

5 RESULTS
For the first test where the test size is 0.3 and there are 10 folds:

5
naiveBayes(0.3, 10)
The result for the naïve bayes is:

0.3 test size with 10-fold cross validation


Model accuracy score: 0.9444
Training-set accuracy score: 0.9839
Training set score: 0.9839
Test set score: 0.9444
Cross-validation scores:[0.92307692 1. 1. 1. 1. 1.
1. 1. 0.83333333 1. ]
Average cross-validation score: 0.9756
Macro-averaged One-vs-Rest ROC AUC score:
1.00

decisionTree(0.3, 10)
The result for the decision tree is:

0.3 test size with 10-fold cross validation


Cross-validation scores:[0.84615385 0.92307692 0.84615385 1. 0.91666667 0.83333333
0.83333333 0.83333333 0.75 0.91666667]
Average cross-validation score: 0.8699
Macro-averaged One-vs-Rest ROC AUC score:
0.93

We can see that both the average cross-validation score and the macro-averaged OVR ROC-
AUC score is higher for the Naïve Bayes method.

From here, it was necessary to test whether this stayed the same for different test sizes and
different number of folds. First, only changing the number of folds to 5 folds produced these
results:
naiveBayes(0.3, 5)
The result for the naïve bayes is:

0.3 test size with 5-fold cross validation


Model accuracy score: 0.9444
Training-set accuracy score: 0.9839
Training set score: 0.9839
Test set score: 0.9444
Cross-validation scores:[0.96 1. 1. 1. 0.91666667]
Average cross-validation score: 0.9753
Macro-averaged One-vs-Rest ROC AUC score:
1.00

decisionTree(0.3, 5)
The result for the decision tree is:

0.3 test size with 5-fold cross validation

6
Cross-validation scores:[0.8 0.92 0.88 0.88 0.83333333]
Average cross-validation score: 0.8627
Macro-averaged One-vs-Rest ROC AUC score:
0.93

When only changing the test size down to 0.1:


naiveBayes(0.1, 10)
The result for the naïve bayes is:

0.1 test size with 10-fold cross validation


Model accuracy score: 1.0000
Training-set accuracy score: 0.9812
Training set score: 0.9812
Test set score: 1.0000
Cross-validation scores:[0.875 1. 0.9375 0.9375 1. 1. 1. 1. 0.9375 0.9375]
Average cross-validation score: 0.9625
Macro-averaged One-vs-Rest ROC AUC score:
1.00

decisionTree(0.1, 10)
The result for the decision tree is:

0.1 test size with 10-fold cross validation


Cross-validation scores:[0.8125 0.9375 0.8125 0.6875 0.9375 0.875 0.8125 0.9375 0.8125 0.875 ]
Average cross-validation score: 0.8500
Macro-averaged One-vs-Rest ROC AUC score:
0.92

When changing both the test size down to 0.1 and the number of folds to 5:
naiveBayes(0.1, 5)
The result for the naïve bayes is:

0.1 test size with 5-fold cross validation


Model accuracy score: 1.0000
Training-set accuracy score: 0.9812
Training set score: 0.9812
Test set score: 1.0000
Cross-validation scores:[0.9375 0.9375 1. 1. 0.9375]
Average cross-validation score: 0.9625
Macro-averaged One-vs-Rest ROC AUC score:
1.00

decisionTree(0.1, 5)
The result for the decision tree is:

0.1 test size with 5-fold cross validation


Cross-validation scores:[0.875 0.84375 0.9375 0.90625 0.875 ]
Average cross-validation score: 0.8875
Macro-averaged One-vs-Rest ROC AUC score:

7
0.92

For every single case, no matter the test size or the number of folds in k-fold cross validation,
the Naïve Bayes performed better. The average cross-validation score was higher, and the ROC-
AUC score was closer to 1. From these results, we can conclude that the Naïve Bayes method is
better for classification. Given the size of the data set though, considering that it is relatively
small, Naïve Bayes would logically perform better since Decision Trees would require more data
to reach the same accuracy levels. However, that being said, the Decision Tree method still has
very solid scores for both cross-validation and ROC-AUC. Decision Trees are by no means a bad
method of classification. Based on the textbook, Decision Trees are more flexible and easy to
use, but require tree pruning to really prevent overfitting. Both methods are excellent methods
of classification for various different data sets and for this assignment, Naïve Bayes
outperforms.

REFERENCES
[1] Forina, M. et al, PARVUS, 1991. Wine Data Set, https://archive.ics.uci.edu/ml/datasets/wine
[2] Prashant, P.B, Banerjee, 2020. Naive Bayes Classifier in Python
[3] Vipul, V.G, Gandhi, 2020. A Guide to Decision Trees for Beginners

A APPENDICES

A.1 Code for naiveBayes.py that implements the Naïve Bayes Classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
import csv

data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names

8
# check if there are any empty values
# df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

cols = X_train.columns

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train, columns=[cols])


X_test = pd.DataFrame(X_test, columns=[cols])

# train a Gaussian Naive Bayes classifier on the training set


# instantiate the model
gnb = GaussianNB()

# fit the model


gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

y_pred_train = gnb.predict(X_train)

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set


print('Training set score: {:.4f}'.format(gnb.score(X_train, y_train)))

9
print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

# Applying 10-Fold Cross Validation


scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score


print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_gnb = gnb.predict_proba(X_test)

macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

A.2 Code for decisionTree.py that implements the Decision Tree Classification
import pandas as pd
import numpy as np
import csv
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

10
col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names

# check if there are any empty values


# df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)


tree_clf.fit(X_train, y_train)

# Applying 10-Fold Cross Validation


scores = cross_val_score(tree_clf, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

11
A.3 Code for test.py that has the values used for testing various test sizes and folds
for k-fold cross validation
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import csv

def naiveBayes(test_size, k_fold):


data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names

# check if there are any empty values


# df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = 0)

cols = X_train.columns

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

12
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])

# train a Gaussian Naive Bayes classifier on the training set


# instantiate the model
gnb = GaussianNB()

# fit the model


gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

y_pred_train = gnb.predict(X_train)

print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

# print the scores on training and test set


print('Training set score: {:.4f}'.format(gnb.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

# Applying k-Fold Cross Validation


scores = cross_val_score(gnb, X_train, y_train, cv = k_fold, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score


print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_gnb = gnb.predict_proba(X_test)

macro_roc_auc_ovr = roc_auc_score(

13
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

def decisionTree(test_size, k_fold):


data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)

col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names

# check if there are any empty values


# df.isnull().sum()

X = df.drop(['Class'], axis=1)

y = df['Class']

# split X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size, random_state = 0)

tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)


tree_clf.fit(X_train, y_train)

# Applying k-Fold Cross Validation


scores = cross_val_score(tree_clf, X_train, y_train, cv = k_fold, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

# compute Average cross-validation score

14
print('Average cross-validation score: {:.4f}'.format(scores.mean()))

y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)

print(f"Macro-averaged One-vs-Rest ROC AUC score:\n{macro_roc_auc_ovr:.2f}")

print('Naive Bayes Tests:')


print()
print('0.3 test size with 10-fold cross validation')
naiveBayes(0.3, 10)
print()
print('0.3 test size with 5-fold cross validation')
naiveBayes(0.3, 5)
print()
print('0.1 test size with 10-fold cross validation')
naiveBayes(0.1, 10)
print()
print('0.1 test size with 5-fold cross validation')
naiveBayes(0.1, 5)

print()
print()

print('Decision Tree Tests:')


print()
print('0.3 test size with 10-fold cross validation')
decisionTree(0.3, 10)
print()
print('0.3 test size with 5-fold cross validation')
decisionTree(0.3, 5)
print()
print('0.1 test size with 10-fold cross validation')
decisionTree(0.1, 10)
print()

15
print('0.1 test size with 5-fold cross validation')
decisionTree(0.1, 5)
print()

16
A.4 Image of the terminal results when running test.py

17

You might also like