A_Hybrid_Model_for_End_to_End_Online_Handwriting_Recognition
A_Hybrid_Model_for_End_to_End_Online_Handwriting_Recognition
Abstract—Automatic recognition of online handwritten instead of the end of a character. Thus, the task of recog-
words in a generic mode has significant application potentials. nition of cursive handwriting is far more challenging [1]
However, this recognition job is challenging for unconstrained than recognition of isolated handwritten characters. In this
handwriting data. The challenge is more serious for Indic
scripts like Devanagari or Bangla due to the inherent cur- work, we have studied recognition of unconstrained online
siveness of their characters, large sizes of respective alphabets, handwriting of Devenagari and Bangla, the two most popular
existence of several groups of shape similar characters etc. Indian scripts. This type of handwriting data is captured by
On the other hand, with the recent development of powerful touch screen devices, pen tablets etc. Such devices store
machine learning tools, major research initiatives in this area coordinates of points on the writing surface along the path of
of pattern recognition studies have been observed. Feature
extraction and classification are two major modules of such a movement of finger tip or stylus as a temporal sequence. The
recognizer. Deep architectures of convolutional neural network part of such a sequence between a pair of successive ‘pen
(CNN) models have been found to be efficient in extraction of down’ and ‘pen up’ situations is often termed as a stroke.
useful features from raw signal. On the other hand, a recurrent A piece of online handwritten data is composed of one or
neural network (RNN) along with connectionist temporal classi- more such strokes. An example of such online handwriting
fication (CTC) has been shown to be able to label unsegmented
sequence data. In the present article, we propose a hybrid data is shown in Fig. 1.
layered architecture consisting of three networks CNN, RNN
and CTC for recognition of online handwriting without use of
any specific lexicon. In this study, we have also observed that
feeding hand-crafted features to the CNN at the first level of
the proposed model provides better performance than feeding
the raw signal to the CNN. We have simulated the proposed
model on two large databases of Devanagari and Bangla online
unconstrained handwritten words. The recognition accuracies
provided by the proposed model are encouraging. Figure 1. A piece of online handwritten Hindi text written in Devanagari
script is shown. Circles show the positions of captured coordinates on the
writing surface. Different colors mark different strokes.
I. I NTRODUCTION
A. Devanagari Script
From the perspective of automatic recognition, hand-
writing data are often categorized into offline and online Devanagari is one of the most widely used scripts in
formats. Offline handwriting sample is stored in the form of southern part of Asia. This is a descendant of old Brahmi
a two-dimensional image while online handwriting sample is script and its early use was found around 1000 CE. De-
stored as a temporal sequence of two-dimensional coordinate vanagari script is used to write several languages like
points determining the trajectory of pen tip movement along Sanskrit, Hindi, Nepali, Marathi, Kashmiri etc. The type
with some additional information such as pen status (‘up’ of Devanagari script is alpha-syllabary (also known as
or ‘down’) etc. Automatic recognition or interpretation of Abugida) [2] where a consonant and vowel composition is
both these types of handwriting data has their respective often written as a single unit. Also, two or more of its
challenges. Since the beginning, study of handwriting data basic consonant characters can combine together to form a
has attracted attention of the researchers in the area of compound character. Due to these chracteristics of the script,
pattern recognition. However, automatic recognition of un- the size of its alphabet is large consisting of many compound
constrained cursive handwriting has always been met with characters. Fig. 2 shows modified forms of a basic consonant
serious challenges. In such type of handwriting, information of Devanagari when it is attached with different basic vowel
about the boundaries of individual characters are not readily characters. On the other hand, the first two rows of Fig. 3
available because while a writer writes in an unconstrained show formation of Devanagari compound characters due to
way, the lifting of pen depends upon his/her idiosyncracy combinations of two basic consonant characters.
A. CNN
CNN has now been established as an efficient deep neural
network architecture [14], [15]. It is a layered architecture
with two distinct parts. The tasks of the layers in the first
part of this architecture are convolution and sub-sampling
operations. They have neurons arranged along width, height
(a) (b) and depth dimensions. The neurons in such a layer have
feed-forward connections only with the neurons in a small
Figure 4. (a) Handwritten samples of Bangla words, (b) Bangla basic
characters and vowel modifiers used to form the words. region of the layer in its immediate neighbour. The second
or last part of a CNN architecture is a fully connected mul-
II. R ELATED W ORKS tilayer perceptron consisting of one or more hidden layers.
Its last layer is the final output layer of the CNN which
Several studies of online handwriting recognition are
provides a single vector of class scores corresponding to the
found in the existing literature. A few recent studies on the
input image. The well-known backpropagation algorithm or
same include [4], [5]. The majority of existing online hand-
one of its variations is used to obtain the connection weights
writing recognition studies of the Indic scripts considered
of a CNN. A typical CNN architecture is shown in Fig. 5.
isolated characters as the input [2], [6], [7], [8]. However, a
659
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 24,2024 at 06:26:31 UTC from IEEE Xplore. Restrictions apply.
input sequence, RNN makes use of the information about
the previous element of the sequence stored into some sort
of memory of its hidden units. In the literature, RNN has
been successfully used in speech processing [17], natural
language processing [18], online handwriting recognition [1]
etc.
660
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 24,2024 at 06:26:31 UTC from IEEE Xplore. Restrictions apply.
input sequence a = (a1 , a2 , . . . , aN ) of length N, a recurrent
neural network R with k inputs, m outputs and weight vector
W is a continuous map RW : (Rk )N → (Rm )N . Let b =
RW (a) be the sequence of outputs of the network R, and
btn be the activation of output unit n ∈ {1, 2, . . . , |L| + 1}
at time t. Thus, btn is the probability of observing label n
N
at timestep t. It defines a distribution over the set L of
sequences of length N over the alphabet L = L ∪ {blank}:
Figure 7. (a) A handwritten Devanagari (Hindi) word, (b) Different
segments of the word and the corresponding labels – segments (i), . . .,
(vi) have the labels 1, . . ., 6 respectively.
N
N
p(η|a) = btηt , ∀η ∈ L (1)
t=1
known limitations [22]. A few hybrid approaches involving D. The Proposed Model
HMM and some neural network such as RNN have also been
studied in the literature [23] for this purpose. Such hybrid A CNN architecture is used at level 1 of the proposed
approaches can only partially solve the problem – they can model. It computes the feature representation of the input.
neither recover from all the drawbacks of HMM nor entirely Level 2 of the model consists of an RNN. It takes care of
exploit the potentials of RNN towards sequence labelling labelling the input sequence using the feature obtained at
tasks [22]. Another alternative is the use of a connectionist level 1. Finally, a CTC output layer is employed at level 3
temporal classification (CTC) output layer [22] which has for transcription of the input handwritten data without any
been used earlier in speech and handwriting recognition explicit segmentation scheme. At level 1, the spatial infor-
tasks to get rid of the problem of pre-segmentation. mation in the input sequence data is exploited in producing a
A CTC network has a softmax output layer with |L|+1 set of efficient features through supervised learning while the
units, where |L| is the size of the alphabet L. At any timestep, RNN of the next level makes use of the temporal information
the probabilities of observing various labels are computed at in the online handwriting data. Fig. 8 shows the architecture
the first |L| units of the output layer and its remaining one of this model while Table I presents detailed configuration
unit stores the probability of ‘no label’ or ‘blank’. For an of this architecture.
661
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 24,2024 at 06:26:31 UTC from IEEE Xplore. Restrictions apply.
width or number of columns and f eatures is the number of
channels of input data. The dimension of the output data of
the CNN module is 1 × timesteps × F, where timesteps
is the reduced number of timesteps after last maxpooling
operation (layer 6) and F is the number of feature maps
generated by the last convolution operation (layer 5). This
output data is again rendered the form timesteps × F at
the reshape layer of Level 2. This entire data is now passed
through the BLSTM layers before feeding the output to the
CTC layer of the final level.
An implementation of this model can be found in Github.
IV. W ORKFLOW
Before feeding the input handwritten data to the proposed
model described in Section III-D, it is subjected to a prepro-
cessing stage followed by feature extraction. Preprocessing
operations include size normalization (normalized height =
100), re-sampling and translation as described in [13].
We have computed two different sets of features F eat1
and F eat2 for simulations of the proposed approach. F eat1
consists of only three quantities r, sin θ and cos θ, where
(r, θ) are polar coordinates of the points of preprocessed
samples and each such point defines a ‘timestep’. On the
other hand, the feature set F eat2 has 16 components details
of which are given below. In this later case, each segment
of 5 consecutive points Pi−2 , Pi−1 , Pi , Pi+1 , Pi+2 form a
‘timestep’ and these segments are chosen at a step size of 3
along the pen trajectory. Now, F eat2 representing a segment
consists of the following measures.
• sin α and cos α, where α is the smaller of the two
Figure 8. Proposed Hybrid Architecture adjacent angles between the lines OPi−1 and OPi+1 , O
being the origin of the coordinate system, • sin β and cos β,
Table I
M ODEL C ONFIGURATION where β is the smaller of the two adjacent angles between
the lines OPi−2 and OPi+2 , • vicinity aspect [1], • velocity
Layer no. Layer type Filters / Nodes Specifications before resampling [1], • y-coordinate after normalization
1
2
Convolution
Max-pooling
16
NA
Kernel size=5, Shift=1
Kernel size=5, Shift=2
[1], • average squared distance [1], • Fourier transforms
3 Convolution 32 Kernel size=5, Shift=1 of dxk , and dyk (k = 1, 2, 3, 4), where dxk and dyk are
4 Max-pooling NA Kernel size=5, Shift=2 respectively the signed differences in x and y values of
5 Convolution 32 Kernel size=5, Shift=1
6 Max-pooling NA Kernel size=5, Shift=2
successive points on the segment.
7 Reshape NA Squeeze Dimension Thus, F eat2 consists of 8 quantities computed in the
8 BLSTM 64 NA spatial domain and another 8 quantities computed in the
9 BLSTM 128 NA
frequency domain.
662
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 24,2024 at 06:26:31 UTC from IEEE Xplore. Restrictions apply.
Table II [9] U Bhattacharya, A Nigam, Y S Rawat, and S K Parui. An
DATABASE D ETAILS analytic scheme for online handwritten Bangla cursive word
recognition. Proc. of the 11th ICFHR, pages 320–325, 2008.
Database No. of samples No. of Lexicon
Training Test Characters Size [10] G A Fink, S Vajda, U Bhattacharya, S K Parui, and B B
Devanagari 41831 9584 79 1959 Chaudhuri. Online Bangla word recognition using sub-stroke
Bangla 61728 15277 57 681
level features and hidden Markov models. In Int. Conf. on
Frontiers in Handwriting Recog., pages 393–398. IEEE, 2010.
Table III
R ECOGNITION P ERFORMANCE ON T EST S ETS OF T WO DATABASES
[11] A. Bharath and S. Madhvanath. HMM-based lexicon-driven
Database Accuracy (%) and lexicon-free word recognition for online handwritten
F eat1 F eat2 Indic scripts. IEEE Trans. on Patt. Anal. and Mach. Intell.,
Devanagari 59.57 82.39
34(4):670–682, 2012.
Bangla 68.21 84.47
[12] O. Samanta, U. Bhattacharya, and S. K. Parui. Smoothing
of HMM parameters for efficient recognition of online hand-
writing. Pattern Recognition, 47(11):3614–3629, November
of total number of insertion, deletion and substitution errors
2014.
with respect to the total number of Unicodes in the word.
Results shown in Table III proves that the use of higher [13] B Chakraborty, P S Mukherjee, and U Bhattacharya. Bangla
level handcrafted features is more efficient compared to online handwriting recognition using recurrent neural network
the use of coordinate values of the signal as the features. architecture. In 10th Indian Conf. on Computer Vision,
Graphics and Image Processing, page 63. ACM, 2016.
In future, we plan to perform further extensive studies on
selection of features for the proposed recognition scheme. [14] A Krizhevsky, I Sutskever, and G E Hinton. Imagenet
classification with deep convolutional neural networks. In
ACKNOWLEDGMENT Advances in Neural Info. Proc. Syst., pages 1097–1105, 2012.
C-DAC, Pune, India has provided Hindi word database.
[15] Y. LeCun and Y. Bengio. Convolutional networks for images,
R EFERENCES speech, and time series. In Michael A. Arbib, editor, The
[1] A Graves, M Liwicki, S Fernández, R Bertolami, H Bunke, Handbook of Brain Theory and Neural Networks, pages 255–
and J Schmidhuber. A novel connectionist system for uncon- 258. MIT Press, Cambridge, MA, USA, 1995.
strained handwriting recognition. IEEE Trans. on Patt. Anal.
and Mach. Intell., 31(5):855–868, 2009. [16] Y LeCun, L Bottou, Y Bengio, and P Haffner. Gradient-based
learning applied to document recognition. Proceedings of the
[2] H Swethalakshmi, A Jayaraman, V S Chakravarthy, and C C IEEE, 86(11):2278–2324, 1998.
Sekhar. Online handwritten character recognition of Devana-
gari and Telugu characters using support vector machines. In [17] A Graves and N. Jaitly. Towards end-to-end speech recogni-
10th Int. Workshop on Frontiers in Handwriting Recog., 2006. tion with recurrent neural networks. In 31st Int. Conference
on Machine Learning, volume 32, pages 1764–1772, 2014.
[3] S Bhattacharya, D S Maitra, U Bhattacharya, and S K
Parui. An end-to-end system for Bangla online handwriting [18] Y Wenpeng, K Katharina, Mo Y, and Hinrich S. Comparative
recognition. In 15th Int. Conf. on Frontiers in Handwriting study of CNN and RNN for natural language processing.
Recognition, pages 373–378. IEEE, 2016. CoRR, abs/1702.01923, 2017.
[4] T Van Phan and M Nakagawa. Combination of global [19] S. Hochreiter, Y. Bengio, and P. Frasconi. Gradient flow in
and local contexts for text/non-text classification in hetero- recurrent nets: the difficulty of learning long-term dependen-
geneous online handwritten documents. Pattern Recognition, cies. In J. Kolen and S. Kremer, editors, Field Guide to
51(C):112–124, March 2016. Dynamical Recurrent Networks. IEEE Press, 2001.
[5] A Delaye and C Liu. Contextual text/non-text stroke classi- [20] S. Hochreiter and J. Schmidhuber. Long short-term memory.
fication in online handwritten notes with conditional random Neural Computation, 9(8):1735–1780, 1997.
fields. Pattern Recogn., 47(3):959–968, 2014.
[21] A. Graves. Supervised Sequence Labelling with Recurrent
[6] S D Connell, R M K Sinha, and A K Jain. Recognition of Neural Networks, volume 385 of Studies in Computational
unconstrained online Devanagari characters. In 15th Int. Conf. Intelligence. Springer, 2012.
on Pattern Recognition, volume 2, pages 368–371, 2000.
[22] A Graves, S Fernández, F Gomez, and J Schmidhuber.
[7] S K Parui, K Guin, U Bhattacharya, and B B Chaudhuri. Connectionist temporal classification: labelling unsegmented
Online handwritten Bangla character recognition using HMM. sequence data with recurrent neural networks. In 23rd Int.
In 19th Int. Conf. on Patt. Recog., pages 1–4. IEEE, 2008. Conf. on Machine learning, pages 369–376. ACM, 2006.
[8] U Bhattacharya, B K Gupta, and S K Parui. Direction [23] Y. Bengio, Y. LeCun, C. Nohl, and C. Burges. Lerec: A
code based features for recognition of online handwritten nn/hmm hybrid for on-line handwriting recognition. Neural
characters of Bangla. In 9th Int. Conf. on Document Analysis Computation, 7(6), 1995.
and Recognition, volume 1, pages 58–62. IEEE, 2007.
663
Authorized licensed use limited to: ANNA UNIVERSITY. Downloaded on October 24,2024 at 06:26:31 UTC from IEEE Xplore. Restrictions apply.