0% found this document useful (0 votes)
50 views

A Classifier Ensemble of Binary Classifier Ensembles: Hamid Parvin Sajad Parvin

This paper proposes an innovative combinational algorithm to improve the performance in multiclass classification domains. Because the more accurate classifier the better performance of classification, the researchers in computer communities have been tended to improve the accuracies of classifiers. Although a better performance for classifier is defined the more accurate classifier, but turning to the best classifier is not always the best option to obtain the best quality in classification. It means to reach the best classification there is another alternative to use many inaccurate or weak classifiers each of them is specialized for a sub-space in the problem space and using their consensus vote as the final classifier. So this paper proposes a heuristic classifier ensemble to improve the performance of classification learning. It is specially deal with multiclass problems which their aim is to learn the boundaries of each class from many other classes. Based on the concept of multiclass problems classifiers are divided into two different categories: pairwise classifiers and multiclass classifiers. The aim of a pairwise classifier is to separate one class from another one. Because of pairwise classifiers just train for discrimination between two classes, decision boundaries of them are simpler and more effective than those of multiclass classifiers. The main idea behind the proposed method is to focus classifier in the erroneous spaces of problem and use of pairwise classification concept instead of multiclass classification concept. Indeed although usage of pairwise classification concept instead of multiclass classification concept is not new, we propose a new pairwise classifier ensemble with a very lower order. In this paper, first the most confused classes are determined and then some ensembles of classifiers are created. The classifiers of each of these ensembles jointly work using majority weighting votes. The results of these ensembles are combined to decide the final vote in a weighted manner. Finally the outputs of these ensembles are heuristically aggregated. The proposed framework is evaluated on a very large scale Persian digit handwritten dataset and the experimental results show the effectiveness of the algorithm.

Uploaded by

ijecct
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

A Classifier Ensemble of Binary Classifier Ensembles: Hamid Parvin Sajad Parvin

This paper proposes an innovative combinational algorithm to improve the performance in multiclass classification domains. Because the more accurate classifier the better performance of classification, the researchers in computer communities have been tended to improve the accuracies of classifiers. Although a better performance for classifier is defined the more accurate classifier, but turning to the best classifier is not always the best option to obtain the best quality in classification. It means to reach the best classification there is another alternative to use many inaccurate or weak classifiers each of them is specialized for a sub-space in the problem space and using their consensus vote as the final classifier. So this paper proposes a heuristic classifier ensemble to improve the performance of classification learning. It is specially deal with multiclass problems which their aim is to learn the boundaries of each class from many other classes. Based on the concept of multiclass problems classifiers are divided into two different categories: pairwise classifiers and multiclass classifiers. The aim of a pairwise classifier is to separate one class from another one. Because of pairwise classifiers just train for discrimination between two classes, decision boundaries of them are simpler and more effective than those of multiclass classifiers. The main idea behind the proposed method is to focus classifier in the erroneous spaces of problem and use of pairwise classification concept instead of multiclass classification concept. Indeed although usage of pairwise classification concept instead of multiclass classification concept is not new, we propose a new pairwise classifier ensemble with a very lower order. In this paper, first the most confused classes are determined and then some ensembles of classifiers are created. The classifiers of each of these ensembles jointly work using majority weighting votes. The results of these ensembles are combined to decide the final vote in a weighted manner. Finally the outputs of these ensembles are heuristically aggregated. The proposed framework is evaluated on a very large scale Persian digit handwritten dataset and the experimental results show the effectiveness of the algorithm.

Uploaded by

ijecct
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

International Journal of Electronics Communication and Computer Technology (IJECCT)

Volume 1 Issue 1 | September 2011



ISSN:2249-7838 IJECCT | www.ijecct.org 1

A Classifier Ensemble of Binary Classifier Ensembles
Hamid Parvin
Computer Engineering,
Islamic Azad University,
Nourabad Mamasani Branch, Nourabad, Iran
[email protected]
Sajad Parvin
Computer Engineering
Islamic Azad University,
Nourabad Mamasani Branch, Nourabad, Iran
[email protected]


AbstractThis paper proposes an innovative combinational
algorithm to improve the performance in multiclass classification
domains. Because the more accurate classifier the better
performance of classification, the researchers in computer
communities have been tended to improve the accuracies of
classifiers. Although a better performance for classifier is
defined the more accurate classifier, but turning to the
best classifier is not always the best option to obtain the
best quality in classification. It means to reach the best
classification there is another alternative to use many
inaccurate or weak classifiers each of them is specialized
for a sub-space in the problem space and using their
consensus vote as the final classifier. So this paper
proposes a heuristic classifier ensemble to improve the
performance of classification learning. It is specially deal
with multiclass problems which their aim is to learn the
boundaries of each class from many other classes. Based
on the concept of multiclass problems classifiers are
divided into two different categories: pairwise classifiers
and multiclass classifiers. The aim of a pairwise classifier
is to separate one class from another one. Because of
pairwise classifiers just train for discrimination between
two classes, decision boundaries of them are simpler and
more effective than those of multiclass classifiers.
The main idea behind the proposed method is to focus
classifier in the erroneous spaces of problem and use of
pairwise classification concept instead of multiclass
classification concept. Indeed although usage of pairwise
classification concept instead of multiclass classification
concept is not new, we propose a new pairwise classifier
ensemble with a very lower order. In this paper, first the
most confused classes are determined and then some
ensembles of classifiers are created. The classifiers of each
of these ensembles jointly work using majority weighting
votes. The results of these ensembles are combined to
decide the final vote in a weighted manner. Finally the
outputs of these ensembles are heuristically aggregated.
The proposed framework is evaluated on a very large scale
Persian digit handwritten dataset and the experimental
results show the effectiveness of the algorithm.
Keywords-Genetic Algorithm; Optical Character
Recognition; Pairwise Classifier; Multiclass Classification
I. INTRODUCTION
Usage of recognition systems has found many applications
in almost all fields. However, most of classification algorithms
have obtained good performance for specific problems; but
they have not enough robustness for other problems.
Combination of multiple classifiers can be considered as a
general solution method for any pattern recognition problems.
It has been shown that combination of classifiers can usually
operate better than single classifier provided that its
components are independent or they have diverse outputs. It
has shown that the necessary diversity of an ensemble can be
achieved by manipulation of data set features. Parvin et al.
have proposed some methods of creating this diversity [12]-
[13].
In practice, there may be problems that one single
classifier cant deliver a satisfactory performance [7]-[9]. In
such situations, employing an ensemble of classifying models
instead of a single classifier can reach the model to a better
learning [6]. Although obtaining the more accurate classifier is
often targeted, there is an alternative way to reach for it.
Indeed one can use many inaccurate or weak classifiers each
of which is specialized for a few data items in the problem
space and then he can employ their consensus vote as the
classification. This can lead to better performance due to
reinforcement of the classifier in error-prone problem spaces.
Based on the concept of multiclass problem, classifiers are
divided into two different categories: pairwise classifiers and
multiclass classifiers. While the aim of multiclass problems is
to learn the boundaries of each class from many other classes,
the aim of a pairwise classifier is to separate one class from
another one. Because pairwise classifiers are just trained to
learn the boundary between two classes, decision boundaries
produced by them are simpler and more effective than those
produced by multiclass classifiers.
Pairwise discrimination between classes has been
suggested in [16]-[18]. In this model there are c*(c-1)/2
possible pairwise classifications, one for each pair of classes.
The class label for an input x is inferred from the similarity
between the code words and the outputs of the classifiers. The
code word for class q will contain dont care symbols to
denote the classifiers that are not concerned with this class
label. This method is impractical for a large c as the number of
classifiers becomes prohibitive.
International Journal of Electronics Communication and Computer Technology (IJECCT)
Volume 1 Issue 1 | September 2011

ISSN:2249-7838 IJECCT | www.ijecct.org 2

In General, it is ever-true sentence that "combining the
diverse classifiers any of which performs better than a random
results in a better classification performance" [2], [6] and [10].
Diversity is always considered as a very important concept in
classifier ensemble methodology. It is considered as the most
effective factor in succeeding an ensemble. The diversity in an
ensemble refers to the amount of differences in the outputs of
its components (classifiers) in deciding for a given sample.
Assume an example dataset with two classes. Indeed the
diversity concept for an ensemble of two classifiers refers to
the probability that they may produce two dissimilar results
for an arbitrary input sample. The diversity concept for an
ensemble of three classifiers refers to the probability that one
of them produces dissimilar result from the two others for an
arbitrary input sample. It is worthy to mention that the
diversity can converge to 0.5 and 0.66 in the ensembles of two
and three classifiers respectively. Although reaching the more
diverse ensemble of classifiers is generally handful, it is
harmful in boundary limit. It is very important dilemma in
classifier ensemble field: the ensemble of accurate/diverse
classifiers can be the best. It means that although the more
diverse classifiers, the better ensemble, it is provided that the
classifiers are better than random.
An Artificial Neural Network (ANN) is a model which is
to be configured to be able to produce the desired set of
outputs, given an arbitrary set of inputs. An ANN generally
composed of two basic elements: (a) neurons and (b)
connections. Indeed each ANN is a set of neurons with some
connections between them. From another perspective an ANN
contains two distinct views: (a) topology and (b) learning. The
topology of an ANN is about the existence or nonexistence of
a connection. The learning in an ANN is to determine the
strengths of the topology connections. One of the most
representatives of ANNs is MultiLayer Perceptron. Various
methods of setting the strength of connections in an MLP
exist. One way is to set the weights explicitly, using a prior
knowledge. Another way is to 'train' the MLP, feeding it by
teaching patterns and then letting it change its weights
according to some learning rule. In this paper the MLP is used
as one of the base classifiers.
Decision Tree (DT) is considered as one of the most
versatile classifiers in the machine learning field. DT is
considered as one of unstable classifiers. It means that it can
converge to different solutions in successive trainings on same
dataset with same initializations. It uses a tree-like graph or
model of decisions. The kind of its knowledge representation
is appropriate for experts to understand what it does [11].
Its intrinsic instability can be employed as a source of the
diversity which is needed in classifier ensemble. The ensemble
of a number of DTs is a well-known algorithm called Random
Forest (RF) which is considered as one of the most powerful
ensemble algorithms. The algorithm of RF was first developed
by Breiman [1].
In a previous work, Parvin et al. have only dealt with the
reducing the size of classifier ensemble [9]. They have shown
that one can reduce the size of an ensemble of pairwise
classifiers. Indeed they propose a method for reducing the
ensemble size in the best meaningful manner. Here we inspire
from their method, we propose a framework based on that a set
of classifier ensembles are produced that its size order is not
important. Indeed we propose an ensemble of binary classifier
ensembles that has the order of c, where c is number of classes.
This paper proposes a framework to develop combinational
classifiers. In this new paradigm, a multiclass classifier in
addition to a few ensembles of pairwise classifiers creates a
classifier ensemble. At last, to produce final consensus vote,
different votes (or outputs) are gathered, after that a heuristic
classifier ensemble algorithm is employed to aggregate them.
We focus on Persian handwritten digit recognition (PHDR),
especially on Hoda dataset [4]. Although there are well works
on PHDR, it is not rational to compare them with each other,
because there was no standard dataset in the PHDR field until
2006 [4]. The contribution is only compared with those used
the same dataset used in this paper, i.e. Hoda dataset.
II. ARTIFICIAL NEURAL NETWORK
A first wave of interest in ANN (also known as
'connectionist models' or 'parallel distributed processing')
emerged after the introduction of simplified neurons by
McCulloch and Pitts in 1943. These neurons were presented as
models of biological neurons and as conceptual components
for circuits that could perform computational tasks. Each unit
of an ANN performs a relatively simple job: receive input
from neighbors or external sources and use this to compute an
output signal which is propagated to other units. Apart from
this processing, a second task is the adjustment of the weights.
The system is inherently parallel in the sense that many units
can carry out their computations at the same time. Within
neural systems it is useful to distinguish three types of units:
input units (indicated by an index i) which receive data from
outside the ANN, output units (indicated by an index o) which
send data out of the ANN, and hidden units (indicated by an
index h) whose input and output signals remain within the
ANN. During operation, units can be updated either
synchronously or asynchronously. With synchronous
updating, all units update their activation simultaneously; with
asynchronous updating, each unit has a (usually fixed)
probability of updating its activation at a time t, and usually
only one unit will be able to do this at a time. In some cases
the latter model has some advantages.
An ANN has to be configured such that the application of
a set of inputs produces the desired set of outputs. Various
methods to set the strengths of the connections exist. One way
is to set the weights explicitly, using a priori knowledge.
Another way is to 'train' the ANN by feeding it teaching
patterns and letting it change its weights according to some
learning rule. For example, the weights are updated according
to the gradient of the error function. For further study the
reader must refer to an ANN book such as Haykin's book on
theory of ANN [3].
III. DECISION TREE LEARNING
DT as a machine learning tool uses a tree-like graph or
model to operate deciding on a specific goal. DT learning is a
data mining technique which creates a model to predict the
value of the goal or class based on input variables. Interior
nodes are the representative of the input variables and the
leaves are the representative of the target value. By splitting
the source set into subsets based on their values, DT can be
learned. Learning process is done for each subset by recursive
International Journal of Electronics Communication and Computer Technology (IJECCT)
Volume 1 Issue 1 | September 2011

ISSN:2249-7838 IJECCT | www.ijecct.org 3

partitioning. This process continues until all remain features in
subset has the same value for our goal or until there is no
improvement in Entropy. Entropy is a measure of the
uncertainty associated with a random variable.

Figure 1. An exemplary raw data
Data comes in records of the form: (x,Y) = (x
1
, x
2
, x
3
,, x
n
,Y). The dependent variable, Y, is the target variable that we
are trying to understand, classify or generalize. The vector x is
composed of the input variables, x
1
, x
2
, x
3
etc., that are used
for that task. To clarify that what the DT learning is, consider
Figure 1. Figure 1 has 3 attributes Refund, Marital Status and
Taxable Income and our goal is cheat status. We should
recognize if someone cheats by the help of our 3 attributes. To
do learn process, attributes split into subsets. Figure 2 shows
the process tendency. First, we split our source by the Refund
and then MarSt and TaxInc.
For making rules from a decision tree, we must go upward
from leaves as our antecedent to root as our consequent. For
example consider Figure 2. Rules such as following are
apprehensible. We can use these rules such as what we have in
Association Rule Mining.

Figure 2. The process tendency for Figure 1
- Refund=Yescheat=No
- TaxInc<80, MarSt= (Single or Divorce),
Refund=Nocheat=No
- TaxInc>80, MarSt= (Single or Divorce),
Refund=Nocheat=Yes
- Refund=No, MarSt=Marriedcheat=No
IV. K-NEAREST NEIGHBOR ALGORITHM
k-nearest neighbor algorithm (k-NN) is a method for
classifying objects based on closest training examples in the
feature space. k-NN is a type of instance-based learning, or
lazy learning where the function is only approximated locally
and all computation is deferred until classification. The k-
nearest neighbor algorithm is amongst the simplest of all
machine learning algorithms: an object is classified by a
majority vote of its neighbors, with the object being assigned to
the class most common amongst its k nearest neighbors (k is a
positive integer, typically small). If k = 1, then the object is
simply assigned to the class of its nearest neighbor.
As it is obvious, the k-NN classifier is a stable classifier. A
stable classifier is the one converge to an identical classifier
apart from its training initialization. It means the 2 consecutive
trainings of the k-NN algorithm with identical k value, results
in two classifiers with the same performance. This is not valid
for the MLP and DT classifiers. We use 3-NN as a base
classifier in the paper. It is then inferred that using a k-NN
classifier in an ensemble is not a good option.
V. PROPOSED ALGORITHM
The main idea behind the proposed method is to use a
number of pairwise classifiers to reinforce the main classifier
in the error-prone regions of the problem space. Figure 3
depicts the training phase of the proposed method
schematically.
In the proposed algorithm, a multiclass classifier is first
trained. Its duty is to produce a confusion matrix over the
validation set. Note that this classifier is trained over the total
train set. At next step, the pair-classes which are mostly
confused with each other and are also mostly error-prone are
detected. After that, a number of pairwise classifiers are
employed to reinforce the drawbacks of the main classifier in
those error-prone regions. A simple heuristic is used to
aggregate their outputs.
At the first step, a multiclass classifier is trained on all
train data. Then, using the results of this classifier on the
validation data, confusion matrix is obtained. This matrix
contains important information about the functionalities of
classifiers in the dataset localities. The close and Error-Prone
Pair-Classes (EPPC) can be detected using this matrix. Indeed,
confusion matrix determines the between-class error
distributions. Assume that this matrix is denoted by a. Item a
ij

of this matrix determines how many instances of class c
j
have
been misclassified as class c
i
.
Figure 4 shows the confusion matrix obtained from the
base multiclass classifier. As you can see, digit 5 (or
equivalently class 6) is incorrectly recognized as digit 0 fifteen
times (or equivalently class 1), and also digit 0 is incorrectly
recognized as digit 5 fourteen times. It means 29
misclassifications have totally occurred in recognition of these
two digits (classes). The mostly erroneous pair-classes are
respectively (2, 3), (0, 5), (3, 4), (1, 4), (6, 9) and so on
according to this matrix. Assume that the i-th mostly EPPC is
denoted by EPPC
i
. So EPPC
1
will be (2, 3). Also assume that
the number of selected EPPC is denoted by k.

International Journal of Electronics Communication and Computer Technology (IJECCT)
Volume 1 Issue 1 | September 2011

ISSN:2249-7838 IJECCT | www.ijecct.org 4


Figure 3. The first training phase of the proposed method

After determining the mostly erroneous pair-classes, or
EPPCs, a set of m ensembles of binary classifiers is to be
trained to jointly, as an ensemble of binary classifiers,
reinforce the main multiclass classifier in the region of each
EPPC. So as it can be inferred, it is necessary to train k
ensembles of m binary classifiers. Assume that the ensemble
which is to reinforce the main multiclass classifier in the
region of EPPC
i
is denoted by PWC
i
. Each binary classifier
contained in PWC
i
, is trained over a bag of train data like RF.
The bags of train data contain only b percept of the randomly
selected of train data. It is worthy to be mentioned that
pairwise classifiers which are to participate in PWC
i
are
trained only on those instances which belongs to EPPC
i
.
Assume that the j-th classifier binary classifier of PWC
i
is
denoted by PWC
i,j
. Because there exists m classifiers in each
of PWC
i
and also there exists k EPPC, so there will be k*m
binary classifiers totally. For example in Figure 4 the EPPC
(2, 3) can be considered as an erroneous pair-class. So a
classifier is necessary to be trained for that EPPC using those
dataitems of train data that belongs to class 2 or class 3. As
mentioned before, this method is flexible, so we can add
arbitrary number of PWC
i
to the base primary classifiers. It is
expected that the performance of the proposed framework
outperforms the primary base classifier. It is worthy to note
that the accuracies of PWC
i,j
can easily be approximated using
the train set. Because PWC
i,j
is trained only on b percept of the
train set with labels belong to EPPC
i
, provided that b is very
small rate, then the accuracy of PWC
i,j
on the train set with
labels belong to EPPC
i
can be considered as its approximated
accuracy. Assume that the mentioned approximated accuracy
of PWC
i,j
is denoted by P
i,j
.

Figure 4. The process tendency for Figure 1
It is important to note that each of PWC
i
acts as a binary
classifier. As it mentioned each PWC
i
contains m binary
classifiers with an accuracy vector, P
i
. It means of these binary
ensemble can take a decision with weighed sum algorithm
illustrated in [5]. So we can combine their results according to
weighs computed by (1).
)
1
log(
,
,
,
j i
j i
j i
p
p
w

= . (1) o + | = _. (1) (1)


Training
Dataset
Multiclass
Classifier
Validation
Dataset Test
train
Confusion Matrix
Selection of EPP
Error-Prone Pairclasses
.
.
.
Data Bag 1
Data Bag m
PWC1,1 on 1
st
EPP
PWC1,m on 1
st
EPP
b% selection
.
.
.
.
.
.
.
.
.
Data Bag 1
Data Bag m
PWCk,1 on k
st
EPP
PWCk,m on k
st
EPP
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
PWCi,j: jth classifier of ith pairwise classifier
ensemble specialized for ith error-prone pairclass
Pi,j: accuracy of jth classifier in PWCi ensembles
PWC1
PWCk
P1,1
P1,m
Pk,1
Pk,m
.
.
.
International Journal of Electronics Communication and Computer Technology (IJECCT)
Volume 1 Issue 1 | September 2011

ISSN:2249-7838 IJECCT | www.ijecct.org 5

where w
i,j
is the accuracy of j-th classifier in the i-th binary
ensemble. It is proved that the weights obtained according to
the (1) are optimal weights in theory. Now the two outputs of
each PWC
i
are computed as (2).

i
m
j
j i j i i
EPPC h h x PWC w h x PWC e =

=
, ) | ( * ) | (
1
, ,
. (2) o + | = _. (1) (1)
where x is a test data.


Figure 5. Heuristic test phase of the proposed method test

)
1
log(
,
,
,
j i
j i
j i
p
p
w

= . (1) o + | = _. (1) (1)


where w
i,j
is the accuracy. The last step of the proposed
framework is to combine the results of the main multiclass
classifier and those of PWC
i
. It is worthy to note that there are
2*k outputs from the binary ensembles plus c outputs of the
main multiclass classifier. So the problem is to map a 2*k+c
intermediate space to a c space each of which corresponds to a
class. The results of all these classifiers are fed as inputs in the
aggregators. The Output i of aggregator is the final joint
output for class i. Here, the aggregation is done using a special
heuristic method. This process is done using a heuristic based
ensemble which is illustrated in the Figure 5. As the Figure 5
shows, after producing the intermediate space, the outputs of i-
th ensemble of binary classifier are multiplied in a q
i
number.
This q
i
number is equal to the sum of the main multiclass
classifier's confidences for the classes belong to EPPC
i
.
Assume that the results of the multiplication of q
i
by the
outputs of PWC
i
are denoted by MPWC
i
. It is important to
note that MPWC
i
is a vector of two confidences; the
confidences of the classifier framework to the classes
belonging to PWC
i
.
After calculating the MPWC
i
, the max value is selected
between all of them. If the framework's confidence for the
most confident class is satisfactory for a test data, then it is
selected for final decision of framework, else the main
multiclass classifier decides for the data. It means that the final
decision is taken by (3).

>
=
e
e
otherwise x h MCC
thr x h MPWC x n MaxDecisio
x Decision
c h
sc
EPPC h
sc
)) | ( ( max
)) | ( ( max ) (
) (
} ,..., 1 {
.(3) o + | = _. (1) (1)
where MCC(h|x) is the confidence of the main multiclass
classifier for the class h given a test data x. MPWC
sc
(h|x) is
the confidence of the sc-th ensemble of binary classifiers for
the class h given a test data x. MaxDecision is calculated
according to (4).
Abs(Maxval)
> thr
NO
PWC1,1 on 1
st
EPP
.
.
.
.
.
.
PWC1
PWCk
MPWC1
Test
instance
Multiclass
Classifier
PWC1,m on 1
st
EPP
PWCk,1 on 1
st
EPP
.
.
.
PWCk,m on 1
st
EPP
Pi,j: accuracy of jth classifier in PWCi ensembles
wi,j=log(pi,j/(1-pi,j))
thr is threshold for decision
w1,1
w1,m
wk,1
wk,m
MPWCk
.
.
.
Max
Mean
Mean

Multiclass
Classifier decides
YES
Max decides
International Journal of Electronics Communication and Computer Technology (IJECCT)
Volume 1 Issue 1 | September 2011

ISSN:2249-7838 IJECCT | www.ijecct.org 6

)) | ( ( max arg ) ( x h MPWC x n MaxDecisio
sc
EPPC h
sc
e
= . (4) o + | = _. (1) (1)
where sc is computed as (5).
))) | ( ( (max max arg ) ( x h MPWC x sc
i
EPPC h
i
i
e
= . (5) o + | = _. (1) (1)
Because of the reinforcement of the main classifier by
some ensembles in erroneous regions, it is expected that the
accuracy of this method outperforms a simple MLP or
unweighted ensemble. Figure 3 along with Figure 5 stands as
the structure of the ensemble framework.
VI. WHY PROPOSED METHOD WORKS
As we presume in the paper, it is aimed to add as many as
pairwise classifiers to compensate a predefined error rate,
PDER*EF(MCL,DValidation), where PDER is a predefined
error rate and EF(MCL,DValidation) is error frequency of
multiclass classifier, MCL, over the validation data,
DValidation. Assume we add |EPS| pairwise classifiers to the
main MLC. It is as in the equation below.

) , , ( *
)) , . | . ( ) , . | . ( (
1
DTrain n DValidatio MCL EF PDER
x x EPPC w y EPPC w p x y EPPC w x EPPC w p
eps
i
i i i i
=
= = + = =

=
.(6) o + | = _. (1) (1)
Now assume that a data instance x which belongs really to
class q is to be classified by the proposed algorithm; it has the
error rate which can be obtain by (12). First assume p
p
max
is
probability for the proposed classifier ensemble to take
decision by one of its binary classifiers that is able to
distinguish two classes: q and p. Also assume p
pr
max
is
probability for the proposed classifier ensemble to take
decision by one of its binary classifiers that is able to
distinguish two classes: r and p. They can be is obtained by (7)
and (8) respectively.

)) | ( ), | ( max(
* )) | ( ) | ( ( ) | ) , ( (
max
x r PWC x p PWC
x r MCC x p MCC q x r p EPPC p
pr
+ = e =
. (7) o + | = _. (1) (1)

)) | ( ), | ( max(
* )) | ( ) | ( ( ) | ) , ( (
max
x q PWC x p PWC
x q MCC x p MCC q x q p EPPC p
p
+ = e =
. (8) o + | = _. (1) (1)
where, w
i,j
is the accuracy.
We can assume (9) without losing generality.


= e e
<< ~ e e =
)) | ( ), | ( max(
)) | ( ), | ( max( |
q x q PWC q x p PWC
q x r PWC q x p PWC q r
. (9) o + | = _. (1) (1)
where is a fixed value and then we have:

+ +
~ e =
) ( )) | ( ) | ( (
) | ) , ( (
, ,
max
q r q p
pr
b b x r MCC x p MCC
q x r p EPPC p
. (10) o + | = _. (1) (1)

+ = +
= e =
) ( )) | ( ) | ( (
) | ) , ( (
, ,
max
q q q p
p
b b x q MCC x p MCC
q x q p EPPC p
. (11) o + | = _. (1) (1)
As it is inferred from the algorithm in the same condition,
its error can be formulated as follow.

) 1 )( 1 ( ) | (
) | ( * ) | ( ) | (
, max max
) , (
max
) , (
max
q q
pr p
r p EPPC
pr
q p EPPC
pair
p
b p p x EPPC p
x p p x EPPC p q w x error
+
+ = =


=
=
.(12) o + | = _. (1) (1)
where p
pair
is probability of taking correct decision by
binary classifier and b
j,q
is defined as follow.

=
=
c
i
p i
p j
p j
confusion
confusion
b
1
,
,
,
. (13) o + | = _. (1) (1)
So we can reformulate (12) as follow.

) 1 )( 1 (
) | ( * ) | (
) 1 )( 1 ( ) | (
) | ( * ) | ( ) | (
, max max
) , (
max
, max max
) , (
max
) , (
max
q q
pr p
q p EPPC
pair
p
q q
pr p
r p EPPC
pr
q p EPPC
pair
p
b p p
x p p x EPPC p
b p p x EPPC p
x p p x EPPC p q w x error

+
~ +
+ = =


=
=
=
.(14) o + | = _. (1) (1)
Note that in (14) if p
pr
max
and p
r
max
are zero for an
exemplary input the error of classification will be still equal to
the main multiclass classifier. If they are not zero for an
exemplary input the misclassification rate will still be reduced
because of reduction in second part of (14). Although the first
part increases the error in (14), but if we assume that the
binary classifiers are more accurate than the multiclass
classifier, then the increase is nullified by the decrease part.
VII. EXPERIMENTAL RESULTS
This section evaluates the results of applying the proposed
framework on a Persian handwritten digit dataset named Hoda
[4]. This dataset contains 102,364 instances of digits 0-9.
Dataset is divided into 3 parts: train, evaluation and test sets.
Train set contains 60,000 instances. Evaluation and test
datasets are contained 20,000 and 22,364 instances. The 106
features from each of them have been extracted which are
described in [4]. Some instances of this dataset are depicted in
Figure 6.
In this paper, MLP, 3-NN and DT are used as base primary
classifier. We use an MLPs with 2 hidden layers including
respectively 10 and 5 neurons in the hidden layer 1 and 2, as
the base Multiclass classifier. Confusion matrix is obtained
from its output. Also DTs measure of decision is taken as
Gini measure. The classifiers parameters are kept fixed
during all of their experiments. It is important to take a note
that all classifiers in the algorithm are kept unchanged. It
means that all classifiers are considered as MLP in the first
International Journal of Electronics Communication and Computer Technology (IJECCT)
Volume 1 Issue 1 | September 2011

ISSN:2249-7838 IJECCT | www.ijecct.org 7

experiments. After that the same experiments are taken by
substituting all MLPs whit DTs.
The parameter k is set to 11. So, the number of pairwise
ensembles of binary classifiers added equals to 11 in the
experiments. The parameter m is also set to 9. So, the number
of binary classifiers per each EPPC equals to 9 in the
experiments. It means that 99 binary classifiers are trained for
the pair-classes that have considerable error rates. Assume that
the error number of each pair-class is available. For choosing
the most erroneous pair-classes, it is sufficient to sort error
numbers of pair-classes. Then we can select an arbitrary
number of them. This arbitrary number can be determined by
try and error which it is set to 11 in the experiments.
As mentioned 9*11=99 pairwise classifiers are added to
main multiclass classifier. As the parameter b is selected 20,
so each of these classifiers is trained on only b precepts of
corresponding train data. It means each of them is trained over
20 percept of the train set with the corresponding classes. The
cardinality of this set is calculated by (15).
2400 10 / 2 . 0 * 2 * 60000 / * 2 * = = = c b train Car . (15) o + | = _. (1) (1)
It means that each binary classifier is trained on 2400
datapoints with 2 class labels. Table 1 shows the experimental
results comparatively. As it is inferred the framework is
outperforms the previous works and the simple classifiers in
the case of employing decision tree as the base classifier.


Figure 6. Some instances of Persian OCR data set, with different qualities
It is inferred from Table 1 that the proposed framework
affects significantly in improving the classification precision
specially when employing DT as base classifier. Taking a look
at Table 1 shows that using DT as base classifier in ensemble
almost always produces a better performing classification. It
may be due to inherent instability of DT. It means that because
a DT is unstable classifier, so it is better to use it as a base
classifier in an ensemble. A stable classifier is the one
converge to an identical classifier apart from its training
initialization. It means the 2 consecutive trainings of the
classifier with identical initializations, results in two classifiers
with the same performance. This is not valid for the MLP and
DT classifiers. Although MLP is not a stable classifier, it is
more stable than DT. So it is also expected that using DT
classifier as base classifier has the most impact in improving
the recognition ratio.
As another point to be mentioned, reader can infer that
using the framework can outperforms Unweighted Full
Ensemble, Unweighted Static Classifier Selection and
Unweighted Static Classifier Selection methods explained in
[14]. This can be in consequence of employing binary
classifiers instead of multiclass classifiers.
It is inferred from the Table 1 that the proposed framework
affects significantly in improving the classification precision
specially when employing DT and MLP as base classifier. It is
also obvious that using DT classifier as base classifier has the
most impact in improving the recognition ratio. It is may be
due to its inherent instability.
As it is expected using a stable classifier like k-NN in an
ensemble is not a good option and unstable classifiers like DT
and MLP are better options.
VIII. CONCLUSION
Although the more accurate classifier leads to a better
performance, there is another option to use many inaccurate
classifiers while each one is specialized for a few data in the
problem space and using their consensus vote as the classifier.
So this paper proposes a heuristic classifier ensemble to
improve the performance of learning in multiclass problems.
The main idea behind the proposed method is to focus
classifier in the erroneous spaces of the problem. The new
proposed method tries to improve the performance of
multiclass classification system. We also propose a framework
based on that a set of classifier ensembles are produced that its
size order is not important. It means that we propose a new
pairwise classifier ensemble with a very lower order than usage
of all possible pairwise classifiers. Indeed paper proposes an
ensemble of binary classifier ensembles that has the order of c,
where c is number of classes. So first an arbitrary number of
binary classifier ensembles are added to main classifier. Then
results of all these binary classifier ensembles are given to a set
of a heuristic based ensemble. The results of these binary
ensembles indeed are combined to decide the final vote in a
weighted manner. The proposed framework is evaluated on a
very large scale Persian digit handwritten dataset and the
experimental results show the effectiveness of the algorithm.
Usage of confusion matrix make proposed method a flexible
one. The number of all possible pairwise classifiers is c*(c-1)/2
that it is O(c^2). Using this method without giving up a
considerable accuracy, we decrease its order to O(1). This
feature of our proposed method makes it applicable for
problems with a large number of classes. The experiments
show the effectiveness of this method. Also we reached to very
good results in Persian handwritten digit recognition which is a
very large dataset.
TABLE I. THE ACCURACIES OF DIFFERENT SETTINGS OF THE PROPOSED
FRAMEWORK
Methods Base Classifier
DT MLP 3-NN
A simple
multiclass
classifier
95.57 95.7 96.66
Method Proposed
in [8]
- 98.89 -
International Journal of Electronics Communication and Computer Technology (IJECCT)
Volume 1 Issue 1 | September 2011

ISSN:2249-7838 IJECCT | www.ijecct.org 8

Method Proposed
in [7]
- 98.27 -
Method Proposed
in [15]
97.20 96.70 96.86
Unweighted Full
Ensemble in [14]
98.22 98.11 -
Unweighted Static
Classifier
Selection in [14]
98.13 98.15 -
Weighted Static
Classifier
Selection in [14]
98.34 98.21 -
Proposed Method 99.01 98.46 96.89

It is concluded that using a stable classifier like k-NN in an
ensemble is not a good option and unstable classifiers like DT
and MLP are better options.
REFERENCES

[1] L. Breiman, "Bagging Predictors," Journal of Machine Learning, Vol 24,
no. 2, pp. 123-140, 1996.
[2] S. Gunter, and H. Bunke, "Creation of classifier ensembles for
handwritten word recognition using feature selection algorithms,"
IWFHR 2002 on January 15, 2002.
[3] S. Haykin, Neural Networks, a comprehensive foundation, second
edition, Prentice Hall International, 1999.
[4] H. Khosravi, and E. Kabir, "Introducing a very large dataset of
handwritten Farsi digits and a study on the variety of handwriting
styles," Pattern Recognition Letters, vol 28 issue 10 pp.1133-1141,
2007.
[5] L.I. Kuncheva, Combining Pattern Classifiers, Methods and Algorithms,
New York: Wiley, 2005.
[6] B. Minaei-Bidgoli, and W.F. Punch, "Using Genetic Algorithms for
Data Mining Optimization in an Educational Web-based System,"
GECCO, 2003.
[7] H. Parvin, H. Alizadeh and B. Minaei-Bidgoli, "A New Approach to
Improve the Vote-Based Classifier Selection," International Conference
on Networked Computing and advanced Information Management,
2008.
[8] H. Parvin, H. Alizadeh, M. Fathi, B. Minaei-Bidgoli, "Improved Face
Detection Using Spatial Histogram Features," Int. Conf. on Image
Processing, Computer Vision, and Pattern Recognition, pp. 381-386,
2008.
[9] H. Parvin, H. Alizadeh, B. Minaei-Bidgoli, M. Analoui, "An Scalable
Method for Improving the Performance of Classifiers in Multiclass
Applications by Pairwise Classifiers and GA," International Conference
on Networked Computing and advanced Information Management, pp.
137-142, 2008.
[10] A. Saberi, M. Vahidi, B. Minaei-Bidgoli, "Learn to Detect Phishing
Scams Using Learning and Ensemble Methods," IEEE/WIC/ACM
International Conference on Intelligent Agent Technology, Workshops,
pp. 311-314, 2007.
[11] T. Yang, "Computational Verb Decision Trees," International Journal of
Computational Cognition, pp. 3446, 2006.
[12] H. Parvin, H. Alizadeh, B. Minaei-Bidgoli, "Using Clustering for
Generating Diversity in Classifier Ensemble," JDCTA Vol. 3, no. 1,
pp.51-57, 2009.
[13] H. Parvin, H. Alizadeh, B. Minaei-Bidgoli, "A New Method for
Constructing Classifier Ensembles," JDCTA Vol. 3, no. 2, pp.62-66,
2009.
[14] H. Parvin, H. Alizadeh, "Classifier Ensemble Based Class Weighting,"
American Journal of Scientific Research, pp.84-90, 2011.
[15] H. Parvin, H. Alizadeh, M. Moshki, B. Minaei-Bidgoli, N. Mozayani,
"Divide & Conquer Classification and Optimization by Genetic
Algorithm," International Conference on Convergence and hybrid
Information Technology, pp.858-863, 2008.
[16] F. Masulli, and G. Valentini, "Comparing decomposition methods for
classification," In Proc. International Conference on Knowledge-Based
Intelligent Engineering Systems and Applied Technologies, pp. 788-792,
2000.
[17] F. Cutzu, "Polychotomous classification with pairwise classifiers: A new
voting principle," In Proc. International Workshop on Multiple Classifier
Systems, Lecture Notes in Computer Science, pp. 115-124, 2003.
[18] A. Jozwik, and G. Vernazza "Recognition of leucocytes by a parallel k-
nn classifier,". In Proc. International Conference on Computer-Aided
Medical Diagnosis, pp. 138-153, 1987.

You might also like