0% found this document useful (0 votes)
3 views

tamil_cnn

The document presents a method for recognizing handwritten Tamil characters using convolutional neural networks (ConvNets), achieving a state-of-the-art accuracy of 94.4% on the IWFHR-10 dataset. It details the architecture of the ConvNet, including the use of stochastic pooling, probabilistic weighted pooling, and local contrast normalization to enhance feature learning. The study emphasizes the challenges of Tamil character recognition compared to Latin scripts due to the larger character set and similarities among handwritten characters.

Uploaded by

KAVI BHARATHI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

tamil_cnn

The document presents a method for recognizing handwritten Tamil characters using convolutional neural networks (ConvNets), achieving a state-of-the-art accuracy of 94.4% on the IWFHR-10 dataset. It details the architecture of the ConvNet, including the use of stochastic pooling, probabilistic weighted pooling, and local contrast normalization to enhance feature learning. The study emphasizes the challenges of Tamil character recognition compared to Latin scripts due to the larger character set and similarities among handwritten characters.

Uploaded by

KAVI BHARATHI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Handwritten Tamil Recognition using a Convolutional Neural Network

Prashanth Vijayaraghavan Misha Sra


MIT Media Lab MIT Media Lab
[email protected] [email protected]

Abstract

We classify characters in Tamil, a south Indian language,


using convolutional neural networks (ConvNets) into 35 dif-
ferent classes. ConvNets are biologically inspired neural
networks. Unlike other vision learning approaches where
features are hand designed, ConvNets can automatically
learn a unique set of features in a hierarchical manner. We
augment the ConvNetJS library for learning features by us-
ing stochastic pooling, probabilistic weighted pooling, and
local contrast normalization to establish a new state-of-the-
art of 94.4% accuracy on the IWFHR-10 dataset. Further-
more, we describe the different pooling and normalization
methods we implemented and show how well they work in
our experiments.

1. Introduction Figure 1: 32 ⇥ 32 cropped samples from the classification


Tamil is one of the oldest languages in the world with task of the IWFHR-10 Tamil character dataset. The dataset
several million speakers in the southern Indian state of has samples for 156 handwritten characters classes.
Tamil Nadu and Sri Lanka. Tamil is written in a non-Latin
script and has 156 characters including 12 vowels and 23
consonants (see Figure 1). Compared to Latin character has helped in achieving extremely high performance on the
recognition, isolated Tamil character recognition is a much popular MNIST handwritten digits dataset [10]. ConvNets
harder problem because of the larger category set and poten- were also able to achieve a 97.6% facial expression recog-
tial confusion due to similarity between handwritten char- nition rate on 5,600 still images of more than 10 individuals
acters. Previous approaches in classifying Tamil charac- [11].
ters have used a variety of hand-crafted features [15] and Handwritten character recognition can be online or of-
template-matching [16]. In contrast, ConvNets learn fea- fline. In this context, online recognition involves conver-
tures from pixels all the way to the classifier [14]. The su- sion of digital pen-tip movements into a list of coordinates,
periority of learned features over hand-designed ones was used as input for the classification system whereas offline
demonstrated by [12] and also shown by others obtaining recognition uses images of characters as input. Some of the
best performance in traffic sign classification using Con- earlier works apply shallow learning with hand-designed
vNets [14, 3]. The deep learning network structure implic- features on both online and offline datasets. Examples of
itly extracts relevant features, by restricting the weights of hand-designed features include pixel densities over regions
one layer to a local receptive field in the previous layer. By of image, character curvature, dimensions, and number of
reducing the spatial resolution of the feature map, a certain horizontal and vertical lines. Shanthi et al. [15] use pixel
degree of shift and distortion invariance is achieved [5]. The densities over different zones of the image as features for
number of free parameters decreases significantly due to us- an SVM classifier. Their system achieved a recognition
ing the same set of weights for all features in a feature map rate of 82.04% on a handwritten Tamil character database.
[9]. Their ability to exploit the spatially local correlation Sureshkumar et. al. [16] use a neural network based algo-

1
Figure 2: An input image (or a feature map) is passed through a non-linear filterbank, followed by tanh activation, local
contrast normalization and spatial pooling/sub-sampling. a) First convolutional layer with 16 filters of size 5 ⇥ 5. b) Max-
pooling layer of size 2 ⇥ 2 with stride 2. c) Local response normalization layer with alpha = 0.1 and beta = 0.75. d) Second
convolutional layer with 32 filters of size 5 ⇥ 5. e) Max-pooling of size 2 ⇥ 2 with stride 2. f) Third convolutional layer with
32 filters of size 5 ⇥ 5. g) Max-pooling of size 2 ⇥ 2 with stride 2. h) 35-way softmax classification layer. We train using
stochastic gradient descent with an adaptive learning rate.

rithm where features like number of horizontal and vertical approximately 500 samples for each of the 156 Tamil char-
arcs and width and height of each character are extracted acters written by native Tamil writers. The characters are
during pre-processing. These features are then passed to made available for download as TIFF files. We resize the
an SVM, a Self Organizing Map, an RCS, a Fuzzy Neu- original unequally sized rectangular images into 32 ⇥ 32
ral Network, and a Radial Basis Network. They achieve an square images and save them as JPG files. The resized JPG
accuracy of 97% on test data but their approach is not in- images are exported and saved as rows in a large CSV file
variant to deformations or different writing styles as their where the first column of each row is added as the image
algorithms are highly dependent on the form of the char- class. This is done in MATLAB. A simple Python script
acter. Unfortunately, they provide little to no detail on is used to shuffle this large CSV file and split it into two
their dataset. Ramakrishnan et al. [13] derive global fea- smaller CSV files, one for the training set and another for
tures from discrete Fourier transform (DFT), discrete co- the test set containing approximately 60K and 10K images
sine transform (DCT), wavelet transform to capture overall each. We read both CSV files into the ConvNetJS library by
information about the data and feed into an SVM with a implementing a CSV parser using Papaparse 3 .
radial basis function (RBF) kernel. They obtain 95% ac-
curacy on an online test set. Though there has been a lot 3. Architecture
of research in Tamil handwriting recognition, most of it has
been with online datasets [1, 8], or with online and offline The input to the convolutional neural network is a 32⇥32
hybrid classifiers, and limited research with offline datasets. image passed through a stack of different kinds of layers as
To the best of our knowledge, we have not seen previous follows: n⇥32⇥32 16C5⇥5 P 2⇥2 L3⇥3 32C5⇥
attempts with ConvNets for our particular dataset. We em- 5 P 2 ⇥ 2 32C5 ⇥ 5 P 2 ⇥ 2 35N . This represents
ploy the traditional ConvNet architecture augmented with a net with n input images of size 32 ⇥ 32, a convolutional
different pooling methods and local contrast normalization. layer with 16 maps and filters of size 5 ⇥ 5, a max-pooling
This work is implemented with the open source ConvNetJS layer over non-overlapping regions of size 2 ⇥ 2, a convolu-
library 1 . tional layer with 32 maps of size 5 ⇥ 5, a max-pooling layer
over non-overlapping regions of size 2 ⇥ 2 and a fully con-
2. The Dataset nected output layer with 35 neurons, one neuron per class
(see Figure 2). We use a non-linear hyperbolic tangent ac-
We train the offline IWFHR-10 Tamil character dataset tivation function, where the output f is a function of input
from the HP Labs India website 2 . The dataset contains x such that f (x) = tanh(x) for the convolutional layers,
1 http://cs.stanford.edu/people/karpathy/ a linear activation function for the max-pooling layers, and
convnetjs/ a softmax activation function for the output layer. We train
2 http://lipitk.sourceforge.net/datasets/

tamilchardata.htm 3 http://papaparse.com/

2
selected. More precisely, the probabilities p for every re-
gion Rj (total regions nr ) are calculated after normalizing
the activations within the region (see Figure 3).
xi
pi = P (3)
k2Rj xk
Sampling the multinomial distribution based on p to pick
a location t within the pooling region is simply:

sj = xt ; t ⇠ P (p1 · · · pnr ) (4)


Figure 3: Example illustrating stochastic pooling. a) Re-
sulting activations within a 3 ⇥ 3 pooling region. b) Proba-
bilities based on the activations. c) Sampled activation. 3.0.2 Probabilistic Weighting
Probabilities are computed similar to stochastic pooling.
But the activations in each region are weighted by the
using stochastic gradient descent with an adaptive learning probability pi and summed. It can be called probabilistic
rate[17]. weighted averaging as it is a variation of the standard aver-
The input layer has N ⇥ N neurons corresponding to the age pooling. Stochastic pooling causes performance degra-
size of the input image. If we use an m ⇥ m filter , the dation during test but probabilistic averaging can boost the
output of the convolutional layer will be of size (N m + performance when applied during test time.
1) ⇥ (N m + 1). The input at any neuron at a particular
X
point in time xlij is the sum of the weighted contributions sj = pi x i (5)
from the neurons in the previous layer such that: k2Rj

m The next pair of convolutional and subsampling layers


X1 m
X1
xlij =
(l 1)
pq y(i+p)(j+q) (1) work in the same manner. The convolutional layer takes
p q in the output from the pooling layer as input and extracts
features are that are increasingly invariant to local changes
A non-linear function is then applied element-wise to in the input images. The second convolutional layer has
each feature map and the resulting activations are passed 32 feature maps which increases the feature space but re-
to the pooling layer. duces the spatial resolution. The last layer is the classifica-
tion layer.
l
yij = (xlij ) (2)

The pooling layer (i.e. subsampling) outputs local aver-


ages of the feature maps in the previous convolutional layer
to produce pooled feature maps (of smaller size) as output.
Max-pooling layers take a K ⇥K region as input and output
a single value based on the maximum value in that region.
If the input layer is N ⇥ N , the output from the pooling
layers is of size N K ⇥ N K since all the K ⇥ K blocks are Figure 4: Output from the LRN layer for one character, ro-
reduced to a single value. Because pooling is an average tated for display.
over a local region, the output feature maps are less sensi-
tive to precise locations of features in the image than the
first layer of feature maps. Beyond the conventional deter-
ministic forms of pooling like average and max, Zeiler et al. 3.0.3 Local response normalization
[18] introduce stochastic pooling and probabilistic weight- ReLUs have the desirable property that they do not require
ing which we implement in the ConvNetJS library for our input normalization to prevent them from saturating. If at
experiments. least some training examples produce a positive input to a
ReLU, learning will happen in that neuron [7]. Even though
3.0.1 Stochastic Pooling we got better results using tanh() over ReLU, we still found
that the following local normalization schemes aided gener-
In stochastic pooling, a sample activation from the multino- alization (see Figure 4).
mial distribution of activations from each pooling region is This layer computes the function:

3
ux,y
f
f (ux,y
f )= ↵ (8)
1+ N regionxy
0 0
where mxf ,y here is the mean of all ux,y
f in the 2D neigh-
borhood defined by the summation bounds below.

min(S,x N/2+N ) min(S,y N/2+N )


X X 0 0 0 0
regionxy = (uxf ,y mxf ,y )2 )
x0 =max(0,x N/2) y 0 =max(0,y N/2)
(9)
This layer is similar to the response normalization layer
except we compute the variance of activities in each neigh-
Figure 5: Local contrast normalization borhood, rather than just the sum of squares (correlation).

ux,y
f
f (ux,y
f )= ↵ (6)
1+ N regionxy

where ux,y
f is the activity of a unit in map f at position
x, y prior to normalization, S is the image size, and N is
the size of the region to use for normalization. The output
dimensionality of this layer is always equal to the input di-
mensionality.

min(S,x N/2+N ) min(S,y N/2+N )


X X 0 0
regionxy = (uxf ,y )2 )
x0 =max(0,x N/2) y 0 =max(0,y N/2)
(7)
This layer is useful when using neurons with unbounded
activations (e.g. rectified linear neurons) as it permits the
detection of high-frequency features with a big neuron re-
sponse, while damping responses that are uniformly large in
a local neighborhood. It is a type of regularizer that encour-
ages “competition” for big activities among nearby groups Figure 6: The first two images are perfectly classified with
of neurons 4 . high confidence scores despite existence of similar charac-
ters shown as second best predictions for each image.
3.0.4 Local contrast normalization
Local contrast Normalization (LCN) can be either subtrac- 4. Experiments
tive or divisive. Subtractive LCN subtracts from every value
in the feature a gaussian weighted average of its neighbors. ConvNets have a large set of hyper-parameters and find-
Divisive LCN divides every value in a layer by the standard ing the perfect configuration for each dataset is a challenge.
deviation of its neighbors over space and over all feature We explored different configurations of the network and at-
maps. The mean and variance of an image around a local tempted to optimize the parameters based on the validation
neighborhood are made consistent (see Figure 5)and that is set accuracy. For our recognition task, we focus on 35 Tamil
useful for correcting non-uniform illumination or shading characters that include all vowels and consonants with a
artifacts 5 . dataset of 18,535 images.
We implement this layer to compute the LCN function: 4.1. Data Preparation
4 https://code.google.com/p/cuda-convnet/wiki/

LayerParams
The IWFHR-10 classification dataset contains randomly
5 http://bigwww.epfl.ch/sage/soft/ sized images. After resizing all images to 32 samples, we
localnormalization/ split the dataset is into three subsets: train set, validation

4
have been tuned to work well together on the training set but
not on the test set [6]. Dropout is a regularization technique
where on each presentation of each training case, feature
detectors are deleted with probability p and the remaining
weights are trained by backpropagation [2].

Figure 7: Pairs of similar looking characters in the Tamil


dataset. Figure 8: The classification function calculates a negative
log likelihood loss and back propagates L1-loss.

set and test set. Since we are given no information about


For improving generalization, we implement a data aug-
how the sampling of these images was done, we shuffle our
mentation function for our network, where we initially in-
datasets before use. We initially experimented with rect-
put a 35 ⇥ 28 image and crop a random 31 ⇥ 24 window
angular images of size 35 ⇥ 28 selected to maintain the
from that image before training on it. Similarly, to do pre-
aspect ratio of most of the sample images in the original
diction, 4 random crops are sampled and the probabilities
dataset. However, the results were not promising due to
across all crops are averaged to produce final predictions 6 .
limited training examples. A neural network model for pat-
For square input images of size 28 ⇥ 28, the ConvNetJS
tern recognition should make predictions that are invariant
library already has a data augmentation implementation as
to variations of the same class of patterns [4]. This is usu-
described above.
ally done by training the neural network using a dataset with
We implemented and experimented with the following
enough variations of the patterns. However, with limited
generalization techniques:
data, the neural network training can over-fit and hurt the
classification robustness. One way to deal with this prob- • Stochastic Pooling during training and testing
lem is data augmentation where the training set is artifi-
cially augmented by adding samples with transformations • Probabilistic Weighting during training and testing
that preserve the class labels. Data augmentation is widely
• Stochastic Pooling during training + Probabilistic
used in image recognition tasks where transformations have
Weighting during testing
led to significant improvements in recognition accuracy [4].
To increase the size of our dataset, we first created comple- • Dropout in a convolutional layer
mentary images and normalized them to values between 0
and 1 and then applied rotational deformations for improv- • Dropout in a fully connected layer
ing spatial invariance resulting in a dataset 70, 524 images.
• Data augmentation function
4.2. Character Recognition Applying these techniques did not provide a big boost
Our initial experiments gave us a 93% training accuracy in performance but we did see marginal increases in com-
while the test accuracy was 89.3%. We explored techniques parison with configurations without them. With the imple-
to prevent overfitting (see Tables 1 and 2) by applying dif- mentation of local contrast normalization, the test accuracy
ferent regularization methods. In a convolutional neural net- improved to 94.1%. However, the local response normal-
work there may be many different settings of weights that ization outperformed all other configurations and we were
can model the training set well, especially with limited la- able to achieve a high test accuracy of 94.4% with a training
beled training data. Each of these weight vectors will make accuracy of 99%. The effect of local contrast normalization
different predictions on test data and most likely not do as 6 http://cs.stanford.edu/people/karpathy/

well as it did on training data because the feature detectors convnetjs/

5
Pooling Activation Classifier Train Acc. Test Acc. Others
Max Tanh Softmax 91% 87.39% NA
Max ReLu Softmax 69% 48.2% NA
Max Tanh SVM 84% 80.4% NA
Stochastic Tanh Softmax 86% 87.72% NA
Stochastic Tanh Softmax 84% 57.92% FC (dropout:0.1)
Stochastic Tanh SVM 80% 65.99% NA
Stochastic+Prob Wt Tanh Softmax 84% 86.5% NA

Table 1: Experiments with 35 ⇥ 28 images. FC: fully connected layer. Dropout: drop activations with probability 0.1.

Pooling Activation Classifier Train Acc. Test Acc. Others


Max Tanh Softmax 99% 94.4% LRN
Max Tanh Softmax 95% 94.1% LCN
Max Tanh Softmax 97% 93.5% NA
Max Tanh Softmax 93% 89.3% FC (neurons:1024)
Max ReLu Softmax 97% 93.2% NA
Max Tanh SVM 97% 90.3% NA
Max Tanh Softmax 82% 73% FC (dropout:0.2)
Stochastic Tanh Softmax 94% 93.3% NA
Stochastic ReLu Softmax 95% 92.5% NA
Stochastic Tanh SVM 90% 89.6% NA
Stochastic+Prob Wt ReLu Softmax 94% 92.6% NA

Table 2: Experiments with 32 ⇥ 32 images. FC: fully connected layer. Dropout: drop activations with probability 0.2.

can be seen in Figure 2 with the approximate whitening of other.


the image. Experiments using Dropout in a convolutional
layer as well as in a fully connected layer led to a surprising
drop in accuracy and we conjectured this was likely due to 5. Acknowledgements
a small total number of activations possibly leading to loss
of important activations. We would like to thank Professors William Freeman
The convolutional neural network architecture was and Antonio Torralba for giving us the opportunity to learn
trained with different set of configurations for rectangular about ConvNets through course 6.869. We are also thank-
and square images. The hyperparameters were optimized ful to the course TAs Carl Vondrick and Andrew Owens for
based on a validation set of images and results are shown in their help and support. Lastly, we would like to thank HP
Tables 1 and 2. Labs India for making the dataset available for research.
Figure 8 shows the drop in the classification loss as the
number of training examples increases. Classification loss
came down from approximately 3.39 to 0.08 as the training 6. Discussion
progressed. Exploiting the ability of convolutional neural
networks to learn invariances to scale, rotation, and trans- Our results show that a convolutional neural network is
lation the characters were recognized with high accuracy. capable of achieving record breaking results on the Tamil
Figure 6 shows the images and the top three predictions of dataset using purely supervised learning. It is notable that
our classification. The first two images are perfectly clas- our network’s performance degrades if a single convolu-
sified with high confidence scores though they have very tional layer is removed and does not improve much with
similar looking characters as the second best predictions for the addition of another convolutional + pooling layer pair.
each image. In the third image, our ConvNet fails to cor- To simplify our experiments, we did not use any unsuper-
rectly classify the character but the confidence score is re- vised pre-training even though we expect that it will help.
ally close to the second best guess. We would like to note We explored auto encoders but since our system is running
that it is difficult even for a human being to distinguish be- in the browser, we had trouble saving large amounts of data
tween these two characters as they are very similar to each to file from the browser to be used as input for the ConvNet.

6
References [16] C. Sureshkumar and T. Ravichandran. Handwritten tamil
character recognition and conversion using neural network.
[1] K. Aparna, V. Subramanian, M. Kasirajan, G. V. Prakash, Int J Comput Sci Eng, 2(7):2261–67, 2010.
V. Chakravarthy, and S. Madhvanath. Online handwriting
[17] M. D. Zeiler. Adadelta: An adaptive learning rate method.
recognition for tamil. In Frontiers in Handwriting Recog-
arXiv preprint arXiv:1212.5701, 2012.
nition, 2004. IWFHR-9 2004. Ninth International Workshop
[18] M. D. Zeiler and R. Fergus. Stochastic pooling for regular-
on, pages 438–443. IEEE, 2004.
ization of deep convolutional neural networks. arXiv preprint
[2] P. Baldi and P. J. Sadowski. Understanding dropout. In arXiv:1301.3557, 2013.
Advances in Neural Information Processing Systems, pages
2814–2822, 2013.
[3] D. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A
committee of neural networks for traffic sign classification.
In Neural Networks (IJCNN), The 2011 International Joint
Conference on, pages 1918–1921. IEEE, 2011.
[4] X. Cui, V. Goel, and B. Kingsbury. Data augmentation for
deep neural network acoustic modeling. In Acoustics, Speech
and Signal Processing (ICASSP), 2014 IEEE International
Conference on, pages 5582–5586. IEEE, 2014.
[5] S. Haykin. Self-organizing maps. Neural networks-A com-
prehensive foundation, 2nd edition, Prentice-Hall, 1999.
[6] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
R. R. Salakhutdinov. Improving neural networks by pre-
venting co-adaptation of feature detectors. arXiv preprint
arXiv:1207.0580, 2012.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
Advances in neural information processing systems, pages
1097–1105, 2012.
[8] R. Kunwar and A. Ramakrishnan. Online handwriting recog-
nition of tamil script using fractal geometry. In Document
Analysis and Recognition (ICDAR), 2011 International Con-
ference on, pages 1389–1393. IEEE, 2011.
[9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
based learning applied to document recognition. Proceed-
ings of the IEEE, 86(11):2278–2324, 1998.
[10] Y. LeCun and C. Cortes. The mnist database of handwritten
digits.
[11] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda. Subject
independent facial expression recognition with robust face
detection using a convolutional neural network. Neural Net-
works, 16(5):555–559, 2003.
[12] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.
Ng. Reading digits in natural images with unsupervised fea-
ture learning. In NIPS workshop on deep learning and unsu-
pervised feature learning, volume 2011, page 4, 2011.
[13] A. G. Ramakrishnan and K. B. Urala. Global and local
features for recognition of online handwritten numerals and
tamil characters. In Proceedings of the 4th International
Workshop on Multilingual OCR, MOCR ’13, pages 16:1–
16:5, New York, NY, USA, 2013. ACM.
[14] P. Sermanet and Y. LeCun. Traffic sign recognition with
multi-scale convolutional networks. In Neural Networks
(IJCNN), The 2011 International Joint Conference on, pages
2809–2813. IEEE, 2011.
[15] N. Shanthi and K. Duraiswamy. A novel svm-based hand-
written tamil character recognition system. Pattern Analysis
and Applications, 13(2):173–180, 2010.

You might also like