Imp Document3
Imp Document3
Abstract
This paper presents an efficient method for recognizing printed Tamil characters exploring the interclass relationship between them. This is accomplished using Multiclass Hierarchical Support Vector Machines [Crammer et al., 2001; Weston et al.,
1998], a new variant of Multi Class Support Vector
Machine which constructs a hyperplane that separates each class of data from other classes. 126
unique characters in Tamil language have been
identified. A lot of inter-class dependencies were
found in them based on their shapes. This enabled
the characters to be organized into hierarchies
thereby enhancing the process of recognizing the
characters. The System was trained using features
extracted from the binary character sub-images of
sample documents using Hus [Hu., 1962; Jain et
al., 1996] moment invariant feature extraction
method. The system fetched us promising results in
comparison with other classifying algorithms like
KNN, Bayesian Classifier and decision trees. An
accuracy of 96.85% was obtained in the experiments using Multiclass Hierarchical SVM
Introduction
We will now see how the multiclass formulation and interpretation differs from classical binary SVMs. First, class
labels are vectors instead of +1s and 1s in the binary
SVM. Thus class labels in binary SVM belong to one dimensional subspace where as for Multiclass SVM class label belongs to multi-dimensional subspace. Second, W that
defines the separating hyper plane in Binary SVM is a vector. In Multiclass, W is a Matrix. We can imagine the job of
W in two-class SVMs is to map the data/feature vector into
one-dimensional subspace. In multiclass SVM, the natural
extension is then, mapping data/feature space into vector
label space whose defining bases are vectors. In other
words multiclass learning may be viewed as vector labeled
learning or vector value- learning.
Assume we have a sample
S of pairs
{(yi , xi ) : yi Hy, xi Hx, i = 1, . . . ,m} independently and
identically generated by an unknown multivariate distribution P. The Support Vector Machine with vector output is
93
AND 2007
1
2
stand for the elements of the kernel matrices for the feature
vectors and for the label vectors respectively. Hence, the
vector labels are kernelized as well. The synthesized kernel
is the element-wise product of the input and the output kernels, an operation that preserves positive semi-definiteness.
The main point to be noted in the above formulation (1) is
the constraint equations.
T
T
tr(W W ) + Ce
Subject to
{W|W: H ( x ) H y , W is a linear operator},
{b | b H y }, bias vector
y i , (W (xi ) 1 - i , i = 1, . . . , m,
i 0 , i = 1, . . . , m,
y i , (W (xi ) +b ) qi - pii , i = 1, . . . , m,
(1)
{1,
be
chosen
y i , (x i ) ,
yi
from
(x i )
the
} depending
set
of
y i , (W (xi )
m
min i j ( xi ), ( x j )
i , j =1
(2)
1
1
min tr(W T W ) + C T
2
2
Subject to
K yij
m
(yi , y j ) i ,
i =1
= 1 - i , i = 1, . . . , m,
| i = 1, m
to the margin constraints and based on the Karush-KuhnTucker theory we can express the linear operator W by using the tensor products of the output and the feature vectors,
that is
K ij
y i , (W (xi ) 1 - i , i = 1, . . . , m,
then the magnitude of the error measured by the slack variables i will be the same independently of the norm of the
task. The bias term b can be put as zero because it has been
shown in [Kecman et. al., 2005] that polynomial and RBF
kernel do not require the bias term. To understand the geometry of the problem better, first we let qi and pi be 1,
m
T
W = i yi ( xi )
i =1
values
on the particular
( y i , y j ) = K i j K i j where
{W|W: H
(3)
Subject to
y , (W (x ) ) = 1 - , i = 1, . . . , m,
i
i
i
{ i | i },
m
(y ) t i = 0, t = 1, . . . ,dim(H y ),
i=1 i
C i 0, i = 1, , m ,
L =
94
1
2
tr( W
W) +
(4)
m
T
T
C i ( y i W ( x )-1 + i )
i =1
2
m
T
= W i y i ( x ) = 0
i =1
m
T
W = i y i ( x )
i =1
(5)
= C i i = 0 ,
i =
=
(6)
.* K
) = e -
(10)
K = e -
y
Where K = K . * K
I
(K +
T -1
if i belongs to t, t = 1, . . . ,T
T
(yi )t =
1
otherwise
T (T -1)
T
Substituting W = i y i ( x) and =
in the coni =1
C
straint (1), we obtain
y
(9)
(K
1 if i belongs to category t, t = 1, . . . ,T
(yi ) t =
0 otherwise.
( ) i =
) = e
r if i p ( ),
k
sq if i ( p ) and k =
p ( )
:V
p (i )
, and the
(11)
Q = e
Where Q = ( K + I )
Where r, q, s are the parameters of embedding. The parameter q can express the diminishing weight of the nodes being
0
closer to the root. If q=0, assuming 0 = 1 , then the intermediate nodes and the root are discarded, thus we have a
multiclass classification problem. The value of r can be 0
but some experiments show it may help to improve the classification performance. This method was successfully applied to WIPO-alpha patent dataset and Reuters Corpus and
were known to give good results against them.
(7)
=Q
1
e
Multiclass classification
t = 1 ,..., T
K
i =1
(y t , y i )K (xi , x)
Experiments
Tamil is a South Indian Language mainly spoken in southern parts of India, Sri Lanka, Malaysia and Singapore.
Tamil character set contains 12 vowels, 18 consonants and
totally 247 alphabets.126 unique commonly occurring characters in shape have been identified. Hierarchy is built based
on the 126 characters as explained in the section 5.4. Thus
the classification was to be done into 126 classes. The following flow chart Figure 1 describes the steps involved in
preparing the features extracted for training the system and
for classification thereby.
(8)
95
AND 2007
1 = 20 + 02
2 =
(20 02 )
3 =
(30 312 ) + (3 21 03 )
4 =
(30 + 12 ) + ( 21 + 03 )
5 =
2
11
7 =
+ 4
2
+ 3
6 =
21
( 21 + 03 )
3
2
03
( 20 02 ) [(30 + 12 ) ( 21 + 03 )
+ 4 (
+ )(
+ )
11
30
12
21
03
2
(3 21 03 )(30 + 12 ) [(30 + 12 )
2
]
( 21 + 03 )
2
]+
2
2
(321 30 )(21 + 03 ) 3 (30 + 12 ) (21 + 03 )
5.1 Preprocessing
Documents were digitized and stored as gray scale Bitmap
images. Binarization was performed based on a threshold
value applied on the image. Since the scanned images were
noise free to a considerable extent, a noise reduction technique was not required to be performed on the image.
5.2 Segmentation
Segmentation was performed in two phases. (i) Line segmentation wherein each line in the document was segmented using horizontal profile. (ii) Character segmentation
wherein each character in the line was segmented using 8connected component analysis [Haralick et al., 1992]. The
two phase approach was adopted based on a comparative
study where this approach yielded a better result.
96
5.5 Training
Training data set was generated by labeling the features
extracted from the test character image, with the corresponding class. A training dataset for a particular class, on
average, contains 20 sample training data.
Results
Multiclass Hierarchical SVM turned out to be a very efficient method in process of classification. The accuracy of
the algorithm depended on two parameter settings (RBF
Kernel parameter and regularization parameter C). The
( x x ) 2
95
Accuracy
90
85
80
75
70
-5
-3
-5
-5
-20
-10
0.5
0.5
0.5
0.5
0.9
96.85
96.23
96.86
Multilayer
Perceptron
91.8
95.45
93.43
KNN
89.40
90.05
89.90
Nave Bayes
84.5
88.90
88.20
Decision
Trees
91.0
92.84
93.23
Enhancements
The system can be enhanced by including a module of
Language heuristics wherein a character not leading to
a particular class could be classified based on predictions made using the general language grammar and
rules.
Another enhancement possible is compiling a language dictionary wherein in a situation of ambiguity,
classification could be performed based on language
semantics
Apart from the 126 characters identified here, there
exist some ancient Tamil characters that are not used
commonly. Such characters can also be taken into
consideration for a broader aspect.
The system can also be extended to other oriental languages. We are planning to work on some Indian languages like Sanskrit, Hindi, and Malayalam etc.,
which exhibit the property of similarity between characters in shapes and can be organized into hierarchies.
100
-1
Multiclass
Hierarchical
SVM
1 2
. The system had to
form of RBF kernel used is e
be fine tuned on these values in order to obtain a better accuracy. Figure 4 shows the improving performance/accuracy
of the system with the changing values of the parameters.
-0.5
Sigma and C
97
AND 2007
Conclusion
[Nagy, 1992] G. Nagy, On the Frontiers of OCR, Proceedings of the IEEE, vol. 40, #8, pp. 1093-1100, July 1992.
References
[Shawe-Taylor et al., 2005] Sandor Szedmak and J. ShaweTaylor, Multiclass Learning at One-class Complexity,
Technical Report, ISIS Group, Electronics and Computer
Science , 2005.
[Szedmak et al., 2005] Sandor Szedmak, John ShaweTaylor, Learning Hierarchies at Two-class Complexity,
Kernel Methods and Structured domains, NIPS 2005
[Cristianini et al., 2000] Nello Cristianini, John ShaweTaylor, An Introduction to Support Vector Machines and
Other Kernel-based Learning Methods, Cambridge University Press; 1st edition, 2000
[Vapnik, 1995] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995.
98