A Compact Deep Learning Model For Khmer Handwritten Text Recognition
A Compact Deep Learning Model For Khmer Handwritten Text Recognition
Corresponding Author:
Norliza Mohd Noor
Department of Engineering, Razak Faculty of Technology and Informatics
Universiti Teknologi Malaysia Kuala Lumpur Campus
Jalan Sultan Yahya Petra, 54100 Kuala Lumpur, Malaysia
Email: [email protected]
1. INTRODUCTION
Khmer is an official language of Cambodia, spoken by about 16 million people. It has an Alpha
syllabary (Abugida) writing structure: words are comprised of syllables, most of which consist of a radical
for a consonant and additional score for vowels. The modern Khmer alphabet consists of 33 consonants.
There is a great demand for a recognition system reflecting Khmer writing specifics due to the
constant accumulation of documents in such spheres as government, healthcare, finance, education. Until the
early 2000 s, most records in the government and private sectors in Cambodia have been held on handwritten
documents and hand-filled forms. One has to manually browse through the entire mass of paper to reach any
of these records. The bulk of such tasks is extremely complex to carry out on daily basis, even with help of a
systematic archiving system. Having an effective deep learning [1] application for digitizing handwritten text
is particularly important for promoting the development of public and private services. Such an application
also needs to be inexpensive and applicable in developing economies.
As opposed to other common alphabetical systems, there is a very small amount of research on
Khmer text recognition. Most of the efforts have been done only within the past decade [2]-[8]. Sok and
Taing [3] and Srun and Vyshyakov [4], [5], [7] studied recognition of the Khmer printed text. Ye et al. [8]
developed an online recognition method for printed text in the Khmer, Bangla, and Myanmar alphabets. The
amount of work in the field, as well as the nature of the collected data for relevant experiments, describes the
current state of the art for Khmer handwritten text recognition (HTR). Most of the data used in the past
experiments were printed (Machine-derived) text, which greatly impedes the development of an accurate
application.
As an extension of previous experiments [9], [10], current work implemented CNNs for Khmer
HTR. A novel, compact model 2+1CNN was proposed to be used alongside the models used in literature
(LeNet-5 [1], AlexNet [11], visual geometry group 16 (VGG16), VGG19 [12], ResNet [13]). 2+1CNN is
designed for binary classification while existing models were optimized accordingly due to the adapted one-
against-all tactic used throughout the work.
To increase overall performance, an independent network was trained and evaluated for each class.
One particular class was taken as "positive" and all others – as "negative" while training each network. That
is, given a set of classes 𝐶 = {𝑐1 , 𝑐2 , … . 𝑐𝑘 }, the samples of class 𝑐𝑗 were isolated and all samples of other
classes 𝑐1 , 𝑐2 , … . 𝑐𝑗−1 , 𝑐𝑗+1 , … 𝑐𝑘 were considered as “𝑛𝑜𝑡 𝑐𝑗 " (or 𝑐𝑗 ’). Training cable news network (CNN)
with this setting yielded a classifier model 𝐹𝑗 (∙). The output of the training process was the combination of all
trained classifiers:
Intuitively, the final model was designed to iterate the question “Are you of class 𝑐𝑗 ?" instead of
asking directly "What class are you?" This work aims to design a compact model for the Khmer HTR system.
Lack of appropriate datasets contributes to its difficulty. Only datasets collected in preliminary experiments
[9], [10] were used.
2. RELATED WORK
2.1. Recognition of Khmer handwriting
Meng and Morariu [2] described how to combine feedforward artificial neural network (ANN) with
a self-organizing map (SOM) to design a recognition system for printed Khmer characters. Sok and Taing [3]
described their experiment with SVM on printed Khmer characters. Font size-based accuracy and CPU load
were presented as efficiency assessment. Authors also listed some scarce work done towards Khmer optical
character recognition (OCR) to emphasize on lack of research for the Khmer language. Backpropagation was
used by Srun [4] to train a classifier to recognized Khmer characters. For the experiments, Srun sampled
printed text. Preprocessing consisted of resizing images to standard dimensions. Thumwarin et al. [6] in their
studies implemented finite impulse response (FIR) to extract features from handwritten Khmer characters and
sent their results to a Euclidean-based classifier. The work relies on temporal information, which is
impossible to collect from a scanned image of a manuscript. Another problem that the method requires extra
hardware for collecting temporal information. Another work by Srun and Vishnyakov [7] included the
implementation of classifiers in TESSERACT and further improvement of recognition quality of scanned
characters. The earliest mention of Khmer HTR in a computerized setting dates as early as 2008 in work by
Ye et al. [8], which proposed a recognition system of scripts like Myanmar, Khmer, and Bangla [8]. Research
data was collected by the means of drawing characters with a mouse, which is also a drawback of the work.
Unlike many previous attempts, data used in the current work reflects the nature of common handwriting
which makes resultant models more realistic. Khmer datasets acquired in previous attempts are compared in
Table 1.
A compact deep learning model for Khmer handwritten text recognition (Bayram Annanurov)
586 ISSN: 2252-8938
and pooling. CNNs also differ from each other in the method and objective of training, e.g., prediction, object
discovery, segmentation.
According to Cun [1], [16], CNN is a variation of multilayer perceptron which require minimal
preprocessing requirements. The connectivity pattern between neurons in a CNN is inspired by the biological
processes of the animal visual cortex, where each cortical neuron responds to signal from only a restricted
area of the visual field (receptive field). Matsugu et al. [17] described that receptive fields that connect to
different neurons, partially overlap. This leads to having the entire visual field covered and, therefore, to
smooth vision.
Figure 1 shows an example of a three-dimensional neuron arrangement in a convolutional neural
network. Every layer takes a three-channel image, where each pixel has a separate value for red, green, and
blue components. The image is split to form output in form of a 3D matrix of neurons. Data used in this study
was preprocessed into grayscale images.
The convolution operation is performed on the input data. This step models the response of an
individual biological neuron to visual input. The activation step applies a transformation to the output of each
neuron by using activation functions. Rectified linear unit (ReLU), is an example of a commonly used
activation function. It takes the output of a neuron and maps it to the highest positive value. If the output is
negative, the function maps it to zero.
The output of the activation step can be further transformed by applying a pooling step. Pooling
reduces the dimensionality of the feature map by condensing the output of small regions of neurons into a
single output. This helps to simplify the consequent layers and reduces the number of parameters that the
model needs to learn. CNN layers are configured by these three concepts. A CNN can have tens or hundreds
of hidden layers that each learns to detect different features of an image. In such feature maps, every hidden
layer increases the complexity of the learned image features. For example, the first hidden layer learns how
to detect edges, and the last layer learns how to detect more complex shapes.
In CNN inputs from a small local receptive field (LRF) are connected to one neuron hidden layer.
LRF is translated across an image to create a feature map from the input layer for being used in the hidden
layers. Convolutions are used to implement this process efficiently [19]. A convolution operation is applied
to the input of each layer. The convolution mimics the reaction of neurons to visual input. CNN architecture
also includes pooling layers, that are used to group the outputs of one layer into a single neuron in the next
layer [11], [20]. The cluster of neurons is designed in form of square batches of any size 𝑛 × 𝑛,
where 𝑛 = 2, 3, 4, … .
In some cases, pooling batches need to be moved beyond the boundaries of a sample image, which
may cause ambiguity in the training process as well as computational and programmatic complexity.
Extending the image by several rows and columns of pixels to match the size of pooling batches (padding)
helps to overcome such a problem. The values used for the extra pixels may be taken differently: average
overall spectrum of pixel values (average padding), zeros (zero paddings). Denoting filter size as 𝐹, input
size as 𝑊, resulting in image size as R, padding size as 𝑃, and stride size as 𝑆, it is obvious that the size of
the sample after each pooling layer will become is being as forms, which can also be deducted for two
dimensions:
𝑊+2𝑃−𝐹
, 𝑖𝑓 𝑆|(𝑊 + 2𝑃 − 𝐹)
𝑆
𝑅={ 𝑊+2𝑃−𝐹 (2)
+ 1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑆
Int J Artif Intell, Vol. 10, No. 3, September 2021: 584 - 591
Int J Artif Intell ISSN: 2252-8938 587
3. RESEARCH METHOD
Current experiments were based on the same data set and most preprocessing steps [9], [10]. Later,
the potential to highly increase the recognition rate of neural networks was explored [21]. Figure 2 shows the
development of the Khmer HTR framework. Data collection and preliminary experiments were completed in
our previous work [9], [10]. In preliminary experiments, the number of features was reduced by 90% using
three independent methods: correlation-based feature selection (CORR), two-dimensional Fourier transform
(FT2D, and Gabor filters (GF). The result of each method was classified with an artificial neural network
(ANN). The original data, without feature space transformation, was classified for comparison of
performance. Gabor Filters yielded the highest improvement in recognition. Such a fact suggested that filters
may play an important role in feature extraction. The current study is based on convolutional models, which
rely on a wider variety of filters. In the course of current work, Models LeNet-5, AlexNet, VGG16, VGG19,
ResNet50 have been modified for binary classification.
To prevent overfitting, 50% of the nodes in the fully connected layer are dropped out in random
order. Rectified linear unit (ReLU) is used as an activation function, due to the simplicity of differentiation
and its behavior close to other activation functions. Hyper-parameters used in 2+1CNN are being as:
− Input images are pre-processed, resized to 224×224.
− First convolutional layer with ReLU as activation function, 5×5 filters with stride size 1.
− First pooling layer with 2×2 filters, stride size 1.
− Second convolutional layer with ReLU activation, 5×5 filters, stride size 2.
− Second pooling layer with 2×2 filters, stride size 1.
− The dropout stage randomly erases 50% of the perceptron, to reduce overfitting.
− The fully connected layer is made of 463 perceptron’s with the ReLU activation function. The choice
for the number is based on average (number of features + number of samples) / 2.
Table 2 illustrates the structure of 2+1CNN. The values R, W, P, and F were obtained per (2). Filter
sizes are chosen to minimize the number of computations required during model training. Figure 3 gives the
visualization of a sample as it is traversed through each layer in 2+1CNN. Represented layers are input,
convolution, pooling, convolution, pooling. All other models used in this work (LeNet, AlexNet, VGG16,
VGG19, RESNET) were also modified so that the number of output classes was reduced to two. This
modification was done to implement binary classification due to the adopted one-against-all tactic. While
2+1CNN is built ground-up, transfer learning was used to retrain the State-of-the-Art methods on Khmer
samples. Due to limitations of available processing power and a high amount of data, training of all
classifiers has been limited to 500 iterations.
Int J Artif Intell, Vol. 10, No. 3, September 2021: 584 - 591
Int J Artif Intell ISSN: 2252-8938 589
Table 3 compares the hardware used in previous experiments to that of the current work. The overall
comparison of models is given in Table 4. Table 5 shows the comparison of current work against previous
attempts. It highlights the theoretical progress in the field of handwritten text recognition for Abugida writing
systems, including Khmer. In previous attempts, data was collected either by scanning printed text or
drawing with a computer mouse, which poses difficulty representing common handwriting. The results of the
current HTR task were achieved on a hardware system of lesser specifications.
5. CONCLUSION
This work aimed to develop a compact and effective model for offline recognition of Khmer
handwritten characters. In general, recognition rates came out to be 93-98%. The 2+1CNN model was built
ground-up and had performance over 94%, which is at the same level as other, more sophisticated models.
The results also helped towards closing the research gap in the field since, at the time of experiments, Khmer
HTR has not yet been approached with deep learning. The main contribution is the compact Khmer HTR
model (2+1CNN) with low computational requirements, which is based on open-source software and does
not require any proprietary packages. These aspects ease its implementation, therefore, allowing swift
digitization of document corpora in rural and developing areas. The developed models may be applied in a
high-end OCR application targeted to the general public, as well used in more sophisticated applications with
only the back-end part, aiming to digitize documents. Further works may include recognition based on the
information about the layout of documents, forms, tables.
A compact deep learning model for Khmer handwritten text recognition (Bayram Annanurov)
590 ISSN: 2252-8938
ACKNOWLEDGEMENTS
This work was partially funded by Universiti Teknologi Malaysia and the Ministry of Higher
Education Malaysia.
REFERENCES
[1] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-based learning applied to document recognition,"
in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998, doi: 10.1109/5.726791.
[2] H. Meng and D. Morariu, "Khmer character recognition using artificial neural network," in Signal and Information
Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, 2014, pp. 1-8, doi:
10.1109/APSIPA.2014.7041824.
[3] P. Sok and N. Taing, "Support Vector Machine (SVM) based classifier for Khmer Printed Character-set
Recognition," in Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014
Asia-Pacific, 2014, pp. 1-9, doi: 10.1109/APSIPA.2014.7041823.
[4] S. Srun, "Applying Backpropagation for Khmer Printing Character Recognition," Proceedings of Japan-Cambodia
Joint Symposium on Information Systems and Communication Technology 2011, Phnom Penh, 2011, pp. 135-136.
[5] S. Srun and U. Vishnyakov, "An Approach for Quality Enhancement of the Text Recognition," Intellectual CAD,
vol. 4, 2009.
[6] P. Thumwarin, S. Khem, K. Janchitraponvej, and T. Matsuura, "On-line writer dependent character recognition for
Khmer based on FIR system characterizing handwriting motion," 2008 SICE Annual Conference, 2008, pp. 73-78,
doi: 10.1109/SICE.2008.4654625.
[7] S. Srun, "Applying Tesseract for Khmer Optical Character Recognition," in ASEAN-UEC Symposium, 2015.
[8] Y. K. Thu, O. Phavy. and Y. Urano, "Positional gesture for advanced smart terminals: Simple gesture text input for
syllabic scripts like Myanmar, Khmer and Bangla," in 2008 First ITU-T Kaleidoscope Academic Conference -
Innovations in NGN: Future Network and Services, 2008, pp. 77-84, doi: 10.1109/KINGN.2008.4542252.
[9] B. Annanurov and N. M. Noor, "Handwritten Khmer text recognition," in 2016 IEEE International WIE
Conference on Electrical and Computer Engineering (WIECON-ECE), 2016, pp. 176-179, doi: 10.1109/WIECON-
ECE.2016.8009112.
[10] B. Annanurov and N. M. Noor, "Feature selection for Khmer handwritten text recognition," in 2017 IEEE
Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), 2017, pp. 626-
630, doi: 10.1109/EIConRus.2017.7910634.
[11] A. Krizhevsky, I. Sutskever, and G.E. Hinton, "ImageNet classification with deep convolutional neural networks,"
in Advances in Neural Information Processing Systems, 2012.
[12] K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," CoRR
arXiv: 1409.1556, 2014.
[13] Z. Y. He, “A New Feature Fusion Method for Handwritten Character Recognition Based on 3D
Accelerometer,” Applied Mechanics and Materials, vol. 44-47, pp. 1583–1587, 2010, doi:
10.4028/www.scientific.net/AMM.44-47.1583.
[14] V. Kruy and W. Kameyama, “Preliminary Experiment on Khmer OCR,” 8th International Conference of Frontiers
of Information Technology, 2010.
[15] S. Kheang, K. Katsurada, Y. Iribe, and T. Nitta, “Solving the Phoneme Conflict in Grapheme-to-Phoneme
Conversion Using a Two-Stage Neural Network-Based Approach,” IEICE Transactions on Information and
Systems, vol. E97.D, no. 4, pp. 901–910, 2014, doi: 10.1587/transinf.E97.D.901.
[16] Y. LeCun, "Deep learning & convolutional networks," in 2015 IEEE Hot Chips 27 Symposium (HCS), 2015, pp. 1-
95, doi: 10.1109/HOTCHIPS.2015.7477328.
[17] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, "Subject independent facial expression recognition with robust
face detection using a convolutional neural network," Neural Networks, vol. 16, no. 5-6, pp. 555-559, 2003, doi:
10.1016/S0893-6080(03)00115-1.
[18] A. Karpathy, A., "Connecting images and natural language," Thesis, Dept. of Computer Science, Stanford
University, 2016.
[19] K. Gregor and Y. LeCun, "Emergence of Complex-Like Cells in a Temporal Product Network with Local
Receptive Fields," CoRR arXiv:1006.0448, 2010.
[20] D.C. Cireşan, U. Meier, J. Masci, L.M. Gambardella, and J. Schmidhuber. "Flexible, high performance
convolutional neural networks for image classification," in IJCAI International Joint Conference on Artificial
Intelligence, 2011, doi: 10.5591/978-1-57735-516-8/IJCAI11-210.
[21] B. Annanurov and N.M. Noor, "Khmer handwritten text recognition with convolution neural networks," ARPN
Journal of Engineering and Applied Sciences, vol. 13, no. 22, pp. 8828-8833, 2018.
[22] R. Venkatesan and M. J. Er, “A novel progressive learning technique for multi-class classification,”
Neurocomputing, vol. 207, pp. 310–321, 2016, doi: 10.1016/j.neucom.2016.05.006.
[23] A. Krizhevsky, "Convolutional deep belief networks on cifar-10," in Unpublished manuscript, U.o. Toronto, Editor.
2010, Available: https://www.cs.toronto.edu/~kriz/conv-cifar10-aug2010.pdf.
[24] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
[25] J. Sueiras, V. Ruiz, A. Sanchez, and J. F. Velez, “Offline continuous handwriting recognition using sequence to
sequence neural networks,” Neurocomputing, vol. 289, pp. 119–128, 2018, doi: 10.1016/j.neucom.2018.02.008.
Int J Artif Intell, Vol. 10, No. 3, September 2021: 584 - 591
Int J Artif Intell ISSN: 2252-8938 591
[26] M. Al Rabbani Alif, S. Ahmed, and M. A. Hasan, "Isolated Bangla handwritten character recognition with
convolutional neural network," in 2017 20th International Conference of Computer and Information Technology
(ICCIT), 2017, pp. 1-6, doi: 10.1109/ICCITECHN.2017.8281823.
[27] R. Zhang, Q. Wang, and Y. Lu, "Combination of ResNet and Center Loss Based Metric Learning for Handwritten
Chinese Character Recognition," in 2017 14th IAPR International Conference on Document Analysis and
Recognition (ICDAR), 2017, pp. 25-29, doi: 10.1109/ICDAR.2017.324.
[28] K. R. Ayyalasomayajula, F. Malmberg, and A. Brun, “PDNet: Semantic segmentation integrated with a primal-dual
network for document binarization,” Pattern Recognition Letters, vol. 121, pp. 52–60, 2019, doi:
10.1016/j.patrec.2018.05.011.
[29] D. Soselia, M. Tsintsadze, L. Shugliashvili, I. Koberidze, S. Amashukeli, and S. Jijavadze, “On Georgian
Handwritten Character Recognition,” IFAC-PapersOnLine, vol. 51, no. 30, pp. 161–165, 2018, doi:
10.1016/j.ifacol.2018.11.279.
[30] R. Sabzi et al., "Recognizing Persian handwritten words using deep convolutional networks," in 2017 Artificial
Intelligence and Signal Processing Conference (AISP), 2017, pp. 85-90, doi: 10.1109/AISP.2017.8324114.
BIOGRAPHIES OF AUTHORS
Dr. Bayram Annanurov completed his Ph.D. at the Universiti Teknologi Malaysia in 2016.
His main research area is Deep Learning. He is currently teaching programming and
optimization at Paragon International University in Phnom Penh, Cambodia.
Dr. Norliza Mohd Noor is a professor at Razak Faculty of Technology and Informatics,
Universiti Teknologi Malaysia, Kuala Lumpur Campus. Her research areas are image analysis
and machine learning.
A compact deep learning model for Khmer handwritten text recognition (Bayram Annanurov)