Report Practical PR
Report Practical PR
1 Introduction
In this paper the MNIST dataset of handwritten digits is explored and analyzed. In doing so,
various feature extraction techniques to analyze the dataset and the features contained are
employed. New features are extracted from the dataset. By testing the self extracted features
information about what information is learned by a model is gained. Subsequently various
machine learning algorithms, Support Vector Machines, multinomial logit models and neural
networks, are trained on the original pixel values. Using the different models with different
parameters knowledge on the function on the models is gained. All this is implemented in
python using the machine learning library sklearn. When referred to function names in the
rest of the paper, these come form this library.
In the following sections the methods used for data evaluation and extraction, model
training and evaluating is detailed. In the following section the results from the evaluations
are analyzed and follows with a conclusion.
• Row i - The i-th row of the dataset array, beginning from the top of the array.
1
• Column j - The j-th column of the dataset array, beginning from the left side of the
array.
• Pixel(i, j) - The pixel value in row i and column j of the dataset array.
How many instances per digit are contained in the dataset and its distribution can be
seen in the table below.
This shows that the majority class is the digit 1. If a classifier would predict this majority
class for every instance, 11.15% of the total dataset would be classified correctly.
Some pixels have a zero value for all digits in the dataset. These pixels are essentially
useless since they do not convey any information that can be used to distinguish between
different labels. The pixels in question are presented in the figure 1 below.
Figure 1: The yellow pixels are zero for all 42000 digits in the dataset, and thus convey no
useful information for classification.
2
3 Experimental Setup
3.1 Feature Extraction Experiments
For these experiments two new features will be extracted from the raw pixel data. The first
one is a so called ink feature, which specifies how much ”ink” a digit costs. The second
feature is called half ink. It is similar to ink but instead of using the whole digit it uses only
half of the digit.
More detail about these features and the experiments are laid out in the following part
of the report.
The ink feature essentially is a measure of how ’large’ digits are respective to one another.
However, it completely discards the form of a digit. This half ink feature may distinguish
well between digits as it relies on the asymmetric property of the digits. This way the actual
shape of the digit hypothetically has a bigger impact on the amount of ”ink” used. Some
digits have more ”ink” on the bottom half of the matrix like the 6 whilst other numbers have
more on the top like 7 even though overall they use a relatively even amount of ”ink”. By
dividing the matrix in half horizontally and taking only the top half these imbalances can
be captured thus discriminating between the digits.
To perform the experiment a same approach was taken as with the ink feature. For
each digit as a group the mean and standard deviation was calculated. The data was scaled
3
and consequently reshaped to the correct shape to feed to the model. Then a multino-
mial logit model was fitted in the same way as for the ink feature. I.e. using sklearn’s
LogisticRegression with default parameters.
• tol - The tolerance criteria that the model takes into account for stopping. This tells
the model to stop searching for a minimum or maximum once the specified tolerance
is achieved. This hyperparameter is important to tune, since if it is too big, the
algorithm will stop before it converges. Generally, this is why this hyperparameter is
experimented upon with small values.
4
• max iter - The maximum number of iterations taken into account for the solvers to
converge. As the number of iterations increases, the precision with which logistical re-
gression tries to fit the data grows. This in turn causes overfitting. By experimenting
with different maximum iteration values an optimal value that does not cause overfit-
ting may be found. The possible values experimented upon range from 1 to 10000 in
powers of 10, so that a clear separation between the different iteration values may be
obtained.
• solver - A solver tries to find the parameter weights that minimize a cost function.
Since the LASSO penalty is used in this experiment, the only possible solvers for this
penalty are saga and liblinear.
The values that would produce the smallest classification error were searched using
GridSearchCV . Here for all given values of the parameters all possible combinations are
tested to see which pairing receives the smallest error. This method makes use of cross
validation for training and evaluating. The possible hyperparameter values fed into the grid
search model were:
The best combination of hyperparameters observed for the Logit classifier through the
grid search method were C = 0.1, max iter = 1000, solver =′ saga′ and ′ tol′ : 1e − 05.
Table 2: The parameters and values searched in the support vector machines classifier
tol 0.000001 0.00001 0.0001 0.001 0.01 0.1
C 0.001 0.01 0.1 1 10 100 1000
max iter 1 10 100 1000 10000 -1
kernel linear poly rbf sigmoid
5
The best set of hyperparameters observed for the SVM classifier through the grid search
method were C = 100, kernel =′ poly ′ , max iter = 1000, tol = 0.1.
• hidden layer sizes - The number of neurons in the hidden layer. Having more neurons
increases the complexity of the model. This parameter is tuned so the model is of a
right complexity to converge but not be too complex.
• activation - The activation function used, decides wen a neuron should be activated
or not. The different activation functions give different thresholds.
• solver - A solver tries to find the parameter weights that minimize a cost function.
• learning rate - How much the model is changed in response to estimated error. The
value for this parameter is very important to make sure the model does not overfit.
• tol - The tolerance criteria that the model takes into account for stopping. This tells
the model to stop searching for a minimum or maximum once the specified tolerance
is achieved.
• max iter - The maximum number of iterations taken into account for the solvers to
converge.
• alpha - Similar to the complexity hyperparameter C, it dictates how strongly ridge
regression is applied.
The values tested for the hyperparameters are in the table below:
Table 3: The parameters and values searched in the feed-forward neural network
tol 0.000001 0.00001 0.0001 0.001 0.01 0.1
alpha 0.1 0.01 0.001 0.0001 0.00001 0.000001
max iter 100 500 1000 10000 100000
hidden layer sizes 10 100 200 500 1000
activation identity logistic tanh relu
solver lbfgs sgd adam
learning rate constant invscaling adaptive
The best set of hyperparameters found in the grid search were random state = 0,
hidden layer sizes = 500, activation =′ relu′ , learning rate =′ constant′ , solver =′ adam′ ,
alpha = 0.1, max iter = 100 and tol = 0.0001. This combination of values for the hyperpa-
rameters give an estimated accuracy of 94%.
6
3.2.4 Accuracy Analysis
When the logit, svm, and feed-forward multilayer perceptron models are trained, they will
also be compared with each other to determine if one accuracy is significantly better or worse
from the others. This will be done using a statistical method.
Specifically, the statistical method used was McNemar’s test, since it translates well for
the accuracy and error rates of machine learning models. First, the contingency table for each
pair of models will be calculated. For the three classifiers C1, C2 and C3 each contingency
table indicates:
• The number of correct classifications for classifier C1 that are wrong for classifier C2
• The number of correct classifications for classifier C2 that are wrong for classifier C1
After each contingency table is calculated, the McNemar’s test for each pair will be done
and the p-value and statistic for each pair will be recorded. The hypothesis H0 under which
the test will be performed is that the models’ accuracy differs significantly between the
pairs.
From this statistical analysis we can obtain the following results for the dataset:
7
• The class with the least ink used is the class with the digit 1.
• The class with the most ink used is the class with the digit 0.
• Classes with digits 2, 3 and 6 are very close regarding the amount of ink used, as well
as classes 4 and 9.
From the above it can be inferred that the pair of classes which will be easiest to classify
is classes 1 and 0, since their ink difference is the greatest. Following the same logic, classes
2, 3, 6 and classes 4 and 9 will be difficult to classify since their ink values are so close. This
can also be seen in the results from the classifier itself. The confusion matrix for the Logit
classifier trained on the ink feature can be seen in the table below.
Table 5: The logit classifier confusion matrix for the ink feature
0 1 2 3 4 5 6 7 8 9
0 2420 83 322 805 0 0 0 384 0 118
1 10 3823 5 101 0 0 0 722 0 23
2 1496 280 326 1039 0 0 0 874 0 162
3 1247 408 334 1037 0 0 0 1141 0 184
4 441 829 195 886 0 0 0 1496 0 225
5 728 671 197 846 0 0 0 1190 0 163
6 1057 450 296 982 0 0 0 1145 0 207
7 325 1190 149 819 0 0 0 1700 0 218
8 1431 192 342 1047 0 0 0 879 0 172
9 484 763 196 870 0 0 0 1651 0 224
The table shows that the Logit classifier performs poorly on the dataset. The accuracy
calculated from this matrix is 23%. For the digits 4, 5, 6 and 8 no guesses were ever made.
This is likely caused by other ink values being very similar. For instance the 4 and 9, their ink
values are very close, however as seen in the data analysis the 9 has more training instances.
Now the classifier will pick the majority class the 9 every time hence the 4 is never predicted
by the classifier.
8
Table 6: Half ink mean and standard deviations per label
Label Half Ink Mean Half Ink Standard Deviation
0 16605.25 4148.89
1 5950.64 2524.94
2 13389.62 3718.41
3 11980.13 3606.81
4 10695.53 3354.00
5 12207.75 3590.74
6 13303.94 3876.17
7 9456.96 3155.64
8 14163.36 3917.75
9 10300.92 3203.87
half-ink features is also observed when the classifier was evaluated. The obtained confusion
matrix can be found in table 7.
From the table 7 above it can be seen that this feature does not provide any substantial
difference in results, with the accuracy improving slightly over the ink feature from a 22.3%
to 23.8% accuracy score. The 4, 5 and 6 still are never predicted by the model. Most notably,
while the classifier using the ink feature never predicted a digit to be 8, using the half ink
feature it does predict the digit 8.
9
instances of the digit 6. However, it now classified incorrectly multiple instances of the digit
0. The same classifier using only the ink feature or only the half ink feature would never
predict any digit to be either 4, 5, or 6, having the classifier use both features makes it
predict 4, 5, or 6 sometimes. The accuracy of the model shows a slight improvement, which
would be far greater if the classifier didn’t incorrectly classify the previously correct digit
instances. The model accuracy is 27.32 for both of the features.
Table 8: The confusion matrix for both the ink and the half-ink features
0 1 2 3 4 5 6 7 8 9
0 2207 59 426 160 161 154 757 27 176 5
1 3 3719 12 53 185 196 75 427 9 5
2 765 237 657 974 357 151 412 410 153 61
3 344 399 473 1581 279 97 181 875 65 57
4 224 719 201 665 281 329 677 874 62 40
5 580 504 339 231 398 527 860 216 113 27
6 923 364 335 291 328 378 1130 247 108 33
7 135 1116 172 690 363 299 348 1190 51 37
8 1121 141 490 482 297 252 871 238 132 39
9 164 710 206 913 351 194 297 1245 58 50
10
This increase in accuracy can be attributed to the use of better and more features. Now
196 features are there to describe each instance. These features are more distinguishable as
they convey more information about the digit.
From the confusion table 10 certain things stand out. Focusing on the mistakes the
classifier till made, the main misclassification was that it confused the digit 8 with 3. In
other words it predicted 8 while the true label was 3. Other interesting errors occurred when
the SVM predicted 9, while the digit was a 7, 4 while a digit was a 9 and 8 again while the
true label was 2.
11
Table 11: The confusion matrix for the feed forward neural network
0 1 2 3 4 5 6 7 8 9
0 3583 1 5 7 5 11 39 10 13 5
1 0 4045 18 15 6 7 5 6 15 4
2 29 53 3370 29 41 7 27 53 45 7
3 11 14 68 3501 5 81 9 35 75 45
4 6 12 20 5 3380 2 21 8 10 85
5 23 15 9 55 24 3077 48 11 63 30
6 27 7 17 1 27 21 3511 1 29 1
7 11 38 42 8 41 6 5 3632 9 95
8 14 59 23 53 23 37 19 11 3279 29
9 21 19 3 47 67 16 0 67 27 3448
This model did receive the highest accuracy of all models tested. However, it can still be
seen that for the digit pairs that were hard to distinguish using the ink or half-ink feature
like the 4 and 9, this model makes more mistakes there than it does for other digits.
Table 14: The contingency table for pair 3: Logit and SVM
After calculating the three tables, McNemar’s test returns the following statistic and
p-value for each pair:
12
Statistic P-Value
Pair 1 - NN/SVM 732 0.00001
Pair 2 - NN/Logit 600 0.00003
Pair 3 - Logit/SVM 660 0.00002
Table 15: The statistic and p-values for each pair of classifiers
From the above results it is clear that the p-value is lower than α = 0.05 and so they the
Null hypothesis is rejected. Therefore the accuracies do not differ significantly.
5 Conclusion
The ink and half-ink features that were extracted were not very descriptive of the data.
It was found that since the ”ink” values for some digits were very similar the logit model
had trouble distinguishing the digits and did not predict some digits at all. When a model
was fitted with both the ink and half-ink features a slight improvement in accuracy was
observed. The model now had predictions for all of the digits. Whilst it did overall improve,
the improvement was not too large as for some of the digits, like 0, the accuracy did go
down.
For the experiments using the pixel values as features three models were tested, a reg-
ularized multinomial logit model, SVM and a feed-forward neural network. For all models
the values chosen for the parameters were tuned using grid search. The final accuracy scores
were as follows:
The feed forward neural network had the highest accuracy predicting the digits in the
test set. However, when the differences in accuracy were tested for statistical significance
none was found.
13