patterrn1
patterrn1
NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
Abstract –
Handwritten characters and numerals are still challenging to read, despite decades of study on
offline Indic recapitulations. The characters' uncanny facial likeness and the Indic scripts' pervasive
structural similarity are to blame for this. Results for the identification of handwritten Indian writing
using machine learning-based techniques are comparable to those for other computer vision tasks.
1211
This is the scenario, despite the fact that the problem is still fairly recent. However, developing a
handcrafted Machine learning model that is efficient for various Indian languages requires
considerable trial and error and in-depth knowledge with the issue. A solution was found after the
search was streamlined using an evolving meta-heuristics approach. managed to improve our text
extraction and language recognition abilities naturally by fusing machine learning and EasyOcr in this
manner. Focused on Hindi, Malayalam, Kannada, as well as Tamil languages with Ensemble Learning
models to detect languages present in images using the EasyOcr library, proposeddifferent models,
including Ensemble learning voting Classifier,Multi-layer perceptron and Support vector machine at
accuracy 98.6% as well as 89.9% with 100% detection and text extraction rate of Hindi, Kannada,
Malayalam and Tamil Languages.
Keywords— Machine Learning, Voting Classifier, Ensemble learning, Ada Boost, Multi-Layer
Perceptron, Script Identification, Easy OCR.
DOI Number:10.48047/NQ.2022.20.21.NQ99127 Neuroquantology 2022; 20(21):1211-1221
what language the text was found in. Text process of identifying scripts[5].
detection is therefore an essential stage in the
suggest two e2e techniques for simultaneously Gujarati script handwriting recognition, training
training the models, either multi-channel and testing that model, and changing numerous
segmentation and multi-channel mask (MCS). hyper-parameters to achieve the best
When using the ICDAR MLT 2017 or MLe2e accuracy.If you're a computer enthusiast or
datasets, respectively, the results demonstrate researcher interested in creating algorithms for
that an MCS outperforms current approaches to Gujarati script recognition, check out this
recall values at 54.34% and 81.13%. An MCM article. The goal of the essay is to clarify and
performs similar to some other cutting-edge illustrate unique qualities associated to Gujarati
techniques[15]. script[17].
Feurer 2019 et al. uses a mixed-code dataset Gomez 2017 et al. attention is given to the issue
that comprises Roman Urdu, Hindi, Saraiki, of scene text picture script identification.
Bengali, & English to address the problem of Modern CNN classifier are unable to take into
mixed script recognition. Many RNN iterations consideration scene text occurrences'
with word vectorization are used in the training constantly shifting aspect ratios. Because of
of the language recognition model. The optimal this, working out a solution with them is
designs for LSTM, Bidirectional LSTM, Gated difficult. Instead of scaling input images to a
Recurrent Unit (GRU), or Bidirectional Gated predetermined aspect ratio as is typically done
Recurrent Unit tasks have also been developed when using holistic CNN classifiers, we propose
through experimental research (Bi-GRU). By in this research a patch-based classification
combining learnt word class features and GloVe framework to maintain the discriminative
embedding, the experiment was able to achieve portions of the input image that are
the highest accuracy of 90.17 for Bi-GRU. Also, characteristic of its class. In this paper, we
this study addresses problems that arise in propose an unique approach for estimating the
1213
multilingual settings, including such phonetic relative weights of the multiple stroke-part
typing, generative spelling, or the transliteration representations within a patch-based
of Roman words into English characters[16]. categorization framework. This approach makes
Aniket 2019 et al. Explain a suggested use of ensembles of conjoined networks. For
application, which uses image processing & two script identification datasets that are
machine learning to identify and recognise currently available, our testing employing this
Gujarati handwriting. It draws attention to the learning technique show cutting-edge results.
substantial machinery needed for this process. Also, we provide a brand-new open benchmark
The technique is difficult since Gujarati contains dataset for testing end-to-end reading
curved characters and just a variety of algorithms on multilingual scene texts. An end-
handwriting styles. The entire process of to-end system that combines the script
character detection and identification, including identification technique with such a previously
image acquisition, preprocessing, segmentation, published text detector in addition to a
classification, or recognition, and also post- commercially available OCR engine is
processing, is discussed in this article. Also, it demonstrated through experiments with this
highlights crucial elements like creating a neural dataset that emphasise the crucial role the
network appropriate for the difficult task of script identification plays in the system[18].
III. PROPOSED METHODOLOGY
eISSN1303-5150 www.neuroquantology.com
Neuroquantology | December 2022 | Volume 20 | Issue 21 | Page 1211-1221 | Doi:10.48047/NQ.2022.20.21.NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
Performance Ensemble
Label Encoding Bag of Words
Evaluation Learning
EasyOcr
Hand-written text Extraction from
Images and Convert into Strings Detect and Classify
Languages and Class
.
Fig. 2 Proposed Flowchart.
A machine learning library for the visual undertaking. Choose no more than four
recognition and categorization of handwritten languages—probably Kerala, Tamil, Kannada, &
Indian language text or script is currently being Hindi. Malayalam has 591 text samples, Tamil
discussed. This is achieved by utilizing both has 464, Kannada has 366, & Hindi does have 62
handwritten text as well as image features. In text samples of various lengths. This
this design for natural language processing, information may be turned into a pandas data
which uses text data from four Indian languages frame for study.
(Hindi, Kannada, Tamil, and Malayalam), after 3.1.2 Pre-processing 1214
the data has been gathered and cleaned, the Finding null and nan values was the first step in
following steps are to apply a label encoder for the preprocessing task. Next, duplicate values in
categorical features to convert them to the text data were removed, and lastly, a clean
numerical values, apply a count vectorizer, and text column was designed. This column used the
then implement Machine learning algorithms regular expression library through Python to
such as Combining four algorithms with a voting clean the text of symbols, numbers, and
classifier in ensemble learning for voting punctuation. keywords[25][26], links, and The
categorization [24] Ada Boost,SVM,MLP[19]. first steps of a preprocessing job were to use
Following the completion of each of these steps, the tokenizer ,count vectorization, whitespaces
the effectiveness of trained models is assessed and also used label encoder[20][21]. Data was
using text data. Use the EasyOcr library, which is gathered from a variety of sources, with only
machine learning-based, to recognize the scrip- Indian languages being chosen as the primary
form image. To extract text from images and sources as the preprocessing task's first step in
send it to trained machine-learning algorithms data gathering. You must transform by using the
for detection and classification, use EasyOcr. label encoder[27].
3.1.1 Data Collection 3.1.3 Data Splitting
It was determined to gather information from A 90:10 ratio has been created from the
various sources in order to create a new data statistics. Teaching takes up 90% of the time,
set because different languages were not and evaluation takes up 10% of the time.
available with a singular data set. In order to Overfitting can be avoided by employing a deep
develop deep learning models that could learning algorithm to partition the data (ML).
recognize and categories the intended Machine learning can overfit when it fits the
languages, this was done. We gathered training data so well that it is unable to
information for four Indian dialects from public consistently fit any new data. This circumstance
sources. Data from of the Kaggle, UCI, as well as falls into that group[30]. Earlier than adding this
Data World websites should be gathered for this early data to a ML model[31].
eISSN1303-5150 www.neuroquantology.com
Neuroquantology | December 2022 | Volume 20 | Issue 21 | Page 1211-1221 | Doi:10.48047/NQ.2022.20.21.NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
eISSN1303-5150 www.neuroquantology.com
Neuroquantology | December 2022 | Volume 20 | Issue 21 | Page 1211-1221 | Doi:10.48047/NQ.2022.20.21.NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
the following four factors: The following lines False Negative (FN):Incorrectly anticipating the
provide an overview of TP, FP, TN, as well as FN: existence of negative classes causes a result to
True Negative (TN): True negative outcomes are be considered false.
those in which the model can be shown to have 1) Accuracy
correctly predicted the absence of the goal The accuracy of a classification task depends on
class. forecasts and the proportion of properly
True Positives (TP): Results that the model can classified data samples to the total amount of
confirm exist the goal class are considered true data samples. total number of samples and
positive results. forecasts of data. To demonstrate this, we
False Positives (FP): While the model incorrectly divided the quantity of correctly recognized
determines that a positive class exists, an samples by the sum of the TP and TN products.
outcome is said to have been measured as FP. (the main diagonal of the CM).
𝑇𝑃 + 𝐹𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑟𝑐𝑦 = 𝑇𝑃+ 𝑇𝐹 + 𝐹𝑃 + 𝐹𝑁 (1)
2) Precision precision can be determined. It is sufficient to
By comparing the true positive (TP) with all divide by the product of both components (TP +
instances of positivity (TP + FP), system FP).
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2)
𝑇𝑃 + 𝐹𝑃
3) Recall or Sensitivity instances in the dataset (TP + FN). It can be used
A percentage of all positive occurrences that to determine "how many additional right
was used to determine how many positive versions the model missed when it displayed 1216
events there were. The denominator in this the correct ones," to put it another way.
example is therefore the sum of the positive
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (3)
𝑇𝑃+𝐹𝑁
𝑇𝑃
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃+𝐹𝑁 = 𝑟𝑒𝑐𝑎𝑙𝑙 (4)
4) F-score values. It follows that both the FN as well as FP
Below is a graph showing the likelihood that a points of view have been taken into account. To
favorable prediction will come true. It carries calculate a user's F1 score individually, use the
out the required mathematical procedures to continuity method:
determine the harmonic mean between two
2(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙)
𝐹 − 𝑠𝑐𝑜𝑟𝑒 = (5)
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
5) Specificity light. similar to finding out how many healthy
ratio of all negative incidences to all other individuals who have never been told they have
instances of similar incidents. That total number cancer but have no visible signs of the disease in
of negative instances inside the dataset (TN + their bodies. a technique of evaluation to
FP) is the denominator of this equation. Similar determine how the classes differ from one
to remembrance, the main distinction is that another.
only bad things that happen are brought to
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑁+𝐹𝑃 (6)
.
Table.1 Performance Evaluation Machine Learning Models
eISSN1303-5150 www.neuroquantology.com
Neuroquantology | December 2022 | Volume 20 | Issue 21 | Page 1211-1221 | Doi:10.48047/NQ.2022.20.21.NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
The performance evaluation of machine for accuracy, f score, precision, and recall, with
learning models is shown in Table.1, where a score of 0.986, when compared to Support
Multi-Layer Perceptron received the best values Vector Machine and Ada Boost.
1217
eISSN1303-5150 www.neuroquantology.com
Neuroquantology | December 2022 | Volume 20 | Issue 21 | Page 1211-1221 | Doi:10.48047/NQ.2022.20.21.NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
Figs. 4 to 6 show the confusion matrix for the are both less accurate than Multi Layer
machine learning models, which contains Perceptron, which has the highest accuracy at
accuracy, projected values, and actual model 0.987.
values. Adaboost and Support Vector Machine
The Multi-Layer Perceptron, Naive Bayes, terms of accuracy, precision, recall, test score,
Logistic Regression, & Ensemble Learner Voting or f1 score (0.9865), the MLP, Naive Bayes, and
Classifier, which employs machine learning Ensemble are the best.
methods, performs well, as shown in Table 2. In
The performance evaluation of the models is purple, specificity by orange, and sensitivity by
depicted in Fig. 7, where accuracy is navy blue. The greatest accuracy, precision,
represented by blue, precision by red, recall by recall, and F-Measurecome from multi-layer
green, the f score by blue, the Test score by Perceptron and naive bayes ensemble models.
eISSN1303-5150 www.neuroquantology.com
Neuroquantology | December 2022 | Volume 20 | Issue 21 | Page 1211-1221 | Doi:10.48047/NQ.2022.20.21.NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
The Ensemble learning Voting classifier's confusion matrix is displayed in Fig. 8 along with accuracy,
prediction values, and real values.
1219
The findings for Hindi, Malayalam, and Kannada that is effective for different Indian languages.
are shown in Fig. 9's predicted as well as The search was streamlined by using an
extracted scripts for the languages. evolving meta-heuristics strategy, and a
V. CONCLUSION resolution was discovered. primarily on the
Despite decades of research on offline Indic languages of Hindi, Malayalam, Kannada, and
recapitulations, it is still difficult to recognize Tamil, proposed different models, including
handwritten characters and numerals. This is Ensemble learning voting Classifier ,Multi-layer
due to the characters' eerie facial resemblance perceptron or Support vector machine at
and widespread structural similarity in the Indic accuracy 98.6% as well as 89.9% with 100%
scripts. Modern results in the identification of detection and text extraction rate, to detect
handwritten Indic script have been attained languages present in images using the EasyOcr
using machine learning-based techniques, library. This contrasts with previous studies that
comparable to other computer vision tasks. prioritized particular languages over Hindi,
Even though the issue is still reasonably fresh, Malayalam, Kannada, and Tamil languages were
this is the situation. But it takes a lot of trial and in considering.
error and in-depth familiarity with the problem References
to create a handcrafted Machine learning model [1] N. Saqib, K. F. Haque, V. P. Yanambaka,
eISSN1303-5150 www.neuroquantology.com
Neuroquantology | December 2022 | Volume 20 | Issue 21 | Page 1211-1221 | Doi:10.48047/NQ.2022.20.21.NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
eISSN1303-5150 www.neuroquantology.com
Neuroquantology | December 2022 | Volume 20 | Issue 21 | Page 1211-1221 | Doi:10.48047/NQ.2022.20.21.NQ99127
Sakuldeep Singh et al/Indic Hand Written Script Identification Using Ensemble learning Soft Voting Classifier and Easy OCR
[17] S. Aniket, R. Atharva, C. Prabha, D. cursive video text using a deep learning
Rupali, and P. Shubham, “Handwritten framework,” IET Image Process., vol. 14,
Gujarati script recognition with image no. 14, pp. 3444–3455, 2020, doi:
processing and deep learning,” 2019 Int. 10.1049/iet-ipr.2019.1070.
Conf. Nascent Technol. Eng. ICNTE 2019 - [26] S. Aqab and M. U. Tariq, “Handwriting
Proc., no. Icnte, pp. 1–4, 2019, doi: recognition using artificial intelligence
10.1109/ICNTE44896.2019.8946074. neural network and image processing,”
[18] L. Gomez, A. Nicolaou, and D. Karatzas, Int. J. Adv. Comput. Sci. Appl., vol. 11, no.
“Improving patch-based scene text script 7, pp. 137–146, 2020, doi:
identification with ensembles of 10.14569/IJACSA.2020.0110719.
conjoined networks,” Pattern Recognit., [27] P. Thangamariappan and D. J. . Miraclin
vol. 67, pp. 85–96, 2017, doi: Joyce Pamila, “Handwritten Recognition
10.1016/j.patcog.2017.01.032. By Using Machine Learning Approach,”
[19] A. Bhat, V. Yadav, V. Dargan, and Yash, Int. J. Eng. Appl. Sci. Technol., vol. 04, no.
“Sign Language to Text Conversion using 11, pp. 564–567, 2020, doi:
Deep Learning,” 2022 3rd Int. Conf. 10.33564/ijeast.2020.v04i11.099.
Emerg. Technol. INCET 2022, pp. 4036– [28] P. Sujatha and D. Lalitha Bhaskari,
4044, 2022, doi: “Telugu and hindi script recognition
10.1109/INCET54531.2022.9824885. using deep learning techniques,” Int. J.
[20] M. Das, M. Panda, and S. Dash, Innov. Technol. Explor. Eng., vol. 8, no.
“Enhancing the Power of CNN Using Data 11, pp. 1758–1764, 2019, doi:
Augmentation Techniques for Odia 10.35940/ijitee.K1755.0981119.
Handwritten Character Recognition,” [29] C. Science and K. Dutta, “Handwritten 1221
Adv. Multimed., vol. 2022, 2022, doi: Word Recognition for Indic & Latin
10.1155/2022/6180701. scripts using Deep CNN-RNN Hybrid
[21] A. AYVACI ERDOĞAN and A. E. TÜMER, Networks,” no. March, 2019.
“Deep Learning Method for Handwriting [30] S. Susan and J. Malhotra, “Recognising
Recognition,” MANAS J. Eng., vol. 9, no. devanagari script by deep structure
1, pp. 85–92, 2021, doi: learning of image quadrants,” DESIDOC J.
10.51354/mjen.852312. Libr. Inf. Technol., vol. 40, no. 5, pp. 268–
[22] A. Asokan and S. N Unnithan, “Offline 271, 2020, doi:
Recognition of Malayalam and Kannada 10.14429/djlit.40.5.16336.
Handwritten Documents Using Deep [31] S. Ali, Z. Shaukat, M. Azeem, Z.
Learning,” Int. J. Comput. Commun. Sakhawat, T. Mahmood, and K. ur
Informatics, vol. 3, no. 2, pp. 12–24, Rehman, “An efficient and improved
2021, doi: 10.34256/ijcci2122. scheme for handwritten digit recognition
[23] B. Jose and K. P. Pushpalatha, “Intelligent based on convolutional neural network,”
Handwritten Character Recognition For SN Appl. Sci., vol. 1, no. 9, pp. 1–9, 2019,
Malayalam Scripts Using Deep Learning doi: 10.1007/s42452-019-1161-5.
Approach,” IOP Conf. Ser. Mater. Sci.
Eng., vol. 1085, no. 1, p. 012022, 2021,
doi: 10.1088/1757-899x/1085/1/012022.
[24] D. C. Cireşan, U. Meier, L. M.
Gambardella, and J. Schmidhuber,
“Convolutional neural network
committees for handwritten character
classification,” Proc. Int. Conf. Doc. Anal.
Recognition, ICDAR, vol. 10, pp. 1135–
1139, 2011, doi:
10.1109/ICDAR.2011.229.
[25] A. Mirza and I. Siddiqi, “Recognition of
eISSN1303-5150 www.neuroquantology.com
Reproduced with permission of copyright owner. Further reproduction
prohibited without permission.