CSC 240 HW 4
CSC 240 HW 4
Dataset
Naïve Bayes and Decision Tree Classifier Analysis
Jonathan, J.W, Wang
University of Rochester, [email protected]
Using both the Naïve Bayes and Decision Tree Classifiers, this assignment attempts to compare the
classification results on the Wine Dataset. The first step ensure that all data points were present in the data
files. Then, both classifiers were implemented using library functions. From there, certain conclusions about
the dataset could be made and are presented in this document.
Reference:
Forina, M. et al, PARVUS, 1991. Wine Data Set, https://archive.ics.uci.edu/ml/datasets/wine
Prashant, P.B, Banerjee, 2020. Naive Bayes Classifier in Python
Vipul, V.G, Gandhi, 2020. A Guide to Decision Trees for Beginners
1 INTRODUCTION
The goal of this assignment is to compare the ROC-AUC score and k-array cross validation
scores of both the Naïve Bayes and Decision Tree Classifiers in the Wine Dataset. The data
taken from UCI is first checked for missing values. This step is necessary because all data points
need to have an appropriate value in order to run the algorithms properly. The .data file
downloaded is first converted to a CSV file for ease of use and then checked for null values. For
this data set specifically, no null values were present. As a result, no further steps were needed
to clean the data. This step is done for both classification techniques. After both techniques are
used on the data, the ROC-AUC score and k-array cross validation scores are calculated with a
few variations (varying test sizes and folds for k-cross validation) during the testing phase of
this assignment to gain a better perspective on the results. The results from these tests are
then gathered to compare the scores and reach conclusions.
import pandas as pd
import csv
data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)
col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names
X = df.drop(['Class'], axis=1)
y = df['Class']
The code above takes in the wine.csv file and names each column with the respective
chemical constituents or class. Afterwards, the number of null values is counted for each
column. For this specific dataset, there is not a single empty value. So, no extra steps need to
be done to clean the data. The class column is dropped and set as a separate value. This way,
the actual class of the data does not affect the classification results.
2
For the Naïve Bayes method, the data is first split into training and testing sets. The features
are also scaled first (0.3 is used as the test size in this example).
cols = X_train.columns
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Next, a Gaussian Naïve Bayes classifier is trained on the training set and the results are
predicted.
# instantiate the model
gnb = GaussianNB()
y_pred = gnb.predict(X_test)
To ensure the model is behaving properly, the accuracy score and the training set accuracy
can be checked. In addition, by comparing whether the training set score and testing set score
are similar, overfitting/underfitting can be tested.
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))
y_pred_train = gnb.predict(X_train)
3
print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))
K-fold cross validation is then applied, and the average is computed (in this example, 10-folds
are used)
# Applying 10-Fold Cross Validation
scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores))
Lastly, the ROC-AUC score is computed. Usually, ROC-AUC does not work for multiclass
comparison where the base class is compared to more than one other class. This posed a small
problem during the assignment. As seen in the textbook, a simple solution to this is to modify
the code so that the algorithm can be extended to multiclass classification. In this case, the
ROC-AUC score is modified to a OVR (One versus Rest) scheme. There is a different number of
instances associated with each class label (as can be seen from the wine.names file given by
UCI where there are 59 instances of class 1, 71 instances of class 2, and 48 instances of class
3). As a result, macro averaging is done in order to treat all classes equally when calculating the
ROC-AUC score.
y_score_gnb = gnb.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)
The Decision Tree method is then implemented next. There are a lot of similarities between
the actual process of determining the ROC-AUC score and k-array cross validation score. The
first step is also to split the data into training and testing sets (once again 0.3 is used as the
testing size for this example). However, unlike the Naïve Bayes method, there is no need to
scale the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
4
A decision tree classifier is trained on the training set.
tree_clf = DecisionTreeClassifier(max_depth=2, random_state = 36)
tree_clf.fit(X_train, y_train)
From here, the cross validation process and the ROC-AUC score calculation is done just like
the Naïve Bayes method (once again, 10 folds are used in the k-fold cross validation for the sake
of this example).
# Applying 10-Fold Cross Validation
scores = cross_val_score(tree_clf, X_train, y_train, cv = 10, scoring='accuracy')
print('Cross-validation scores:{}'.format(scores))
y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)
5 RESULTS
For the first test where the test size is 0.3 and there are 10 folds:
5
naiveBayes(0.3, 10)
The result for the naïve bayes is:
decisionTree(0.3, 10)
The result for the decision tree is:
We can see that both the average cross-validation score and the macro-averaged OVR ROC-
AUC score is higher for the Naïve Bayes method.
From here, it was necessary to test whether this stayed the same for different test sizes and
different number of folds. First, only changing the number of folds to 5 folds produced these
results:
naiveBayes(0.3, 5)
The result for the naïve bayes is:
decisionTree(0.3, 5)
The result for the decision tree is:
6
Cross-validation scores:[0.8 0.92 0.88 0.88 0.83333333]
Average cross-validation score: 0.8627
Macro-averaged One-vs-Rest ROC AUC score:
0.93
decisionTree(0.1, 10)
The result for the decision tree is:
When changing both the test size down to 0.1 and the number of folds to 5:
naiveBayes(0.1, 5)
The result for the naïve bayes is:
decisionTree(0.1, 5)
The result for the decision tree is:
7
0.92
For every single case, no matter the test size or the number of folds in k-fold cross validation,
the Naïve Bayes performed better. The average cross-validation score was higher, and the ROC-
AUC score was closer to 1. From these results, we can conclude that the Naïve Bayes method is
better for classification. Given the size of the data set though, considering that it is relatively
small, Naïve Bayes would logically perform better since Decision Trees would require more data
to reach the same accuracy levels. However, that being said, the Decision Tree method still has
very solid scores for both cross-validation and ROC-AUC. Decision Trees are by no means a bad
method of classification. Based on the textbook, Decision Trees are more flexible and easy to
use, but require tree pruning to really prevent overfitting. Both methods are excellent methods
of classification for various different data sets and for this assignment, Naïve Bayes
outperforms.
REFERENCES
[1] Forina, M. et al, PARVUS, 1991. Wine Data Set, https://archive.ics.uci.edu/ml/datasets/wine
[2] Prashant, P.B, Banerjee, 2020. Naive Bayes Classifier in Python
[3] Vipul, V.G, Gandhi, 2020. A Guide to Decision Trees for Beginners
A APPENDICES
A.1 Code for naiveBayes.py that implements the Naïve Bayes Classification
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
import csv
data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)
col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names
8
# check if there are any empty values
# df.isnull().sum()
X = df.drop(['Class'], axis=1)
y = df['Class']
cols = X_train.columns
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_pred = gnb.predict(X_test)
y_pred_train = gnb.predict(X_train)
9
print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))
print('Cross-validation scores:{}'.format(scores))
y_score_gnb = gnb.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)
A.2 Code for decisionTree.py that implements the Decision Tree Classification
import pandas as pd
import numpy as np
import csv
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
data = 'wine.csv'
df = pd.read_csv("wine.csv", header = None)
10
col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names
X = df.drop(['Class'], axis=1)
y = df['Class']
print('Cross-validation scores:{}'.format(scores))
y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)
11
A.3 Code for test.py that has the values used for testing various test sizes and folds
for k-fold cross validation
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import csv
col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names
X = df.drop(['Class'], axis=1)
y = df['Class']
cols = X_train.columns
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
12
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
y_pred = gnb.predict(X_test)
y_pred_train = gnb.predict(X_train)
print('Cross-validation scores:{}'.format(scores))
y_score_gnb = gnb.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
13
y_test,
y_score_gnb,
multi_class="ovr",
average="macro",
)
col_names = ['Class', 'Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total
phenols', 'Flavanoids', 'Nonflavanoid phenol',
'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']
df.columns = col_names
X = df.drop(['Class'], axis=1)
y = df['Class']
print('Cross-validation scores:{}'.format(scores))
14
print('Average cross-validation score: {:.4f}'.format(scores.mean()))
y_score_tree = tree_clf.predict_proba(X_test)
macro_roc_auc_ovr = roc_auc_score(
y_test,
y_score_tree,
multi_class="ovr",
average="macro",
)
print()
print()
15
print('0.1 test size with 5-fold cross validation')
decisionTree(0.1, 5)
print()
16
A.4 Image of the terminal results when running test.py
17