0% found this document useful (0 votes)
13 views

Hindi Script Refinement Improved OCR

Uploaded by

deeug75
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Hindi Script Refinement Improved OCR

Uploaded by

deeug75
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Hindi Script Refinement & Improved

OCR
BS-2342 Satyaprakash Sahdev Jaiswar
BS-2344 Saurabh Bhikan Nankar
BS-2308 Aman Singh
BS-2353 Subhadip Roy
BS-2317 Aryan Sahu
BS-2335 Pallav Sahay
May 12, 2024

Figure 1:
1 Abstract
Hindi, one of the most widely spoken languages globally, presents unique challenges for
Optical Character Recognition (OCR) due to its complex script. This project addresses
the issue by proposing a new set of 60 characters optimized for ease of writing and me-
chanical differentiation, enhancing OCR efficiency. Inspired by simple shapes, limited
complexity, and consistent distinctive features, these characters aim to streamline both
manual writing and automated recognition processes. Leveraging the K-nearest neighbors
(KNN) classification algorithm, a model is trained on a dataset comprising handwritten
characters contributed by the project team. Preprocessing and segmentation utilize es-
tablished algorithms within the Python OpenCV module. The resulting OCR applica-
tion demonstrates improved accuracy and usability for Hindi text recognition, marking
a significant advancement in bridging the gap between manual and digital Hindi writing
systems.

2 Introduction
The Hindi language, one of the most widely spoken languages globally, poses unique chal-
lenges for optical character recognition (OCR) due to its complex script. While Hindi
is rich in cultural heritage and widely used in literature, administration, education, and
everyday communication, its intricacies present obstacles for mechanical interpretation.

In this project, we address the challenges of OCR for Hindi by proposing a novel set
of characters designed to enhance ease of writing and mechanical distinguishability. Rec-
ognizing the significance of simplicity in character design, we meticulously crafted a set
of 60 symbols tailored to streamline both writing and OCR processes for Hindi.

Our approach aimed to bridge the gap between traditional script complexity and mod-
ern computational requirements. By introducing characters that are intuitive to write and
easily discernible by machines, we sought to revolutionize the OCR landscape for Hindi
language applications.

The culmination of our efforts resulted in the development of an OCR application


specifically tailored for the newly designed Hindi characters. This application promises to
significantly enhance the efficiency and accuracy of Hindi text recognition, opening doors
to a myriad of applications across various domains.

Through this project, we aim to contribute to the advancement of Hindi language


technology and facilitate greater accessibility and usability of Hindi content in the digital
age.

PART 1: Character Design


The Devanagari script, used for writing Hindi among other languages, presents challenges
in both writing and recognition due to its intricate characters and writing rules. Tradi-

1
tionally, writing Hindi involves sketching a guiding line followed by the characters, often
resulting in curvy and complex shapes that require refinement.

To address these challenges, our design considerations for creating a new set of Hindi
characters were as follows:
1. Simple Shapes: We aimed to utilize basic geometric shapes such as lines, circles,
squares, and triangles, as well as their combinations. This simplification minimizes
the complexity of characters, making them easier to distinguish in scans and write
by hand.
2. Distinctive Features: Each symbol was designed with distinctive features to
ensure easy differentiation from others. Variations in shape, size, orientation, or
strokes were incorporated to enhance readability and recognition.
3. Consistency: Consistency in design across the alphabet was crucial. We main-
tained uniform stroke width and proportions to ensure that each symbol is unique
and recognizable, aiding both writing and scanning processes.
4. Limited Complexity: Keeping the overall complexity of the alphabet low was
essential for ease of learning and memorization. This approach also contributes to
improved accuracy in both writing and scanning.
5. Legibility: We prioritized legibility by avoiding symbols that are too similar or
easily confused. Clear differentiation between characters is vital for accurate writing
and scanning.
6. Test and Refine: The alphabet underwent rigorous testing with various hand-
writing styles and scanning methods to validate its efficacy. Feedback and usability
testing guided continuous refinement to meet desired criteria.
7. Size Consistency: Maintaining a consistent size for all symbols was crucial to
prevent confusion during interpretation and ensure uniformity across the alphabet.

Figure 2: Prototype Characters for Hindi Language

After initial prototyping and refinement, our final set of characters was achieved.
While some unworthy characters were discarded during the iterative design process, the
finalized characters represent a significant improvement in both ease of writing and me-
chanical differentiation.
Next, we proceed to create a training dataset and develop an OCR application tailored
to these newly designed Hindi characters.

2
Figure 3: Final Characters

Part 2: Training OCR Model


After finalizing the characters, we created a diverse training dataset by writing each char-
acter multiple times by a lot of people to capture various handwriting styles and variations.
This dataset serves as the foundation for training our OCR model, ensuring robustness
and accuracy. By exposing the model to a wide range of handwriting variations, we aim
to enhance its capability to accurately recognize characters in real-world scenarios. In
the next phase, we utilize this dataset to train our OCR model, focusing on preprocess-
ing, feature selection, and algorithm optimization to achieve superior performance and
accuracy in recognizing handwritten Hindi text.
Note: We utilized the Python OpenCV module to execute various preprocessing and
segmentation tasks within the OCR model. This choice was made due to its robust
functionality and efficiency in handling image processing operations.

I: Preprocessing[1] Function
After finalizing the design of the characters, we proceeded with the development of the
OCR application. The first step in this process involved creating a preprocessing function.
This function is crucial as it refines the input image to contain only the necessary features
required for OCR. The preprocessing function consists of the following steps:

1. Resize: Resizing the image ensures uniformity in processing and reduces computa-
tional complexity.
2. Adjust Brightness and Contrast: Enhancing brightness and contrast improves
image quality, aiding in text legibility.
3. De-skew: Correcting the skew of the image aligns the text horizontally, facilitating
character recognition.

3
4. Thresholding: Converting the image to greyscale removes color information while
preserving essential text features. Then converting the image into a binary format
to black and white enhances contrast, making text stand out from the background
simplifies further processing by separating text from the background. This involves
the use of established image thresholding techniques like Otsu Thresholding [2].

5. Noise Removal: Removing noise improves OCR accuracy by eliminating irrelevant


details.

6. Erosion and Dilation [3]: Morphological operations shrink or expand text regions
to improve text connectivity and remove small artifacts.

These preprocessing steps collectively prepare the image for OCR, enhancing text
visibility and reducing background noise to improve character recognition accuracy.
Next, we proceed to implement the OCR algorithm and integrate it into the application
workflow.

II: Segmentation Function


Following preprocessing, the next step in the OCR pipeline is segmentation, which involves
identifying individual characters in the image. This is achieved using a segmentation
function that utilizes contour detection.
Contours are essentially outlines or boundaries of objects in an image. In the context
of OCR, contours represent the boundaries of individual characters. They are calcu-
lated using algorithms such as the Suzuki algorithm [4] or the Moore-Neighbor Tracing
algorithm.
Here’s how contours are calculated and how they help in finding characters in the
binarized image:

1. Contour Detection: Contours, represented as continuous lines or curves, define


the complete boundary of an object. Contour detection algorithms analyze the
binarized image to identify connected regions of similar intensity. These regions
correspond to the boundaries of characters in the image.

4
2. Bounding Boxes: Once contours are detected, bounding boxes are drawn around
them to encapsulate individual characters. These bounding boxes serve as regions
of interest for character extraction.

3. Character Extraction: The segmented characters are extracted from the original
image based on the bounding boxes. Each bounding box represents a separate
character that can be isolated for further processing.

By leveraging contour detection and bounding box techniques, the segmentation func-
tion effectively identifies and isolates individual characters in the binarized image. This
step is crucial for creating the training dataset required for training the OCR model.
Next, we proceed to extract characters from the segmented regions and prepare the
training data for the OCR model.

5
III: Training using KNN
In the training phase of the OCR pipeline, the K-nearest neighbors (KNN) algorithm is
employed for classification. KNN is a simple and intuitive machine learning algorithm
used for classification tasks. It operates on the principle of proximity, where an unknown
sample is classified based on the majority class of its K nearest neighbors in the feature
space.

SMOTE (Synthetic Minority Over-sampling Technique):


SMOTE [5] is a technique used to address class imbalance in the training dataset. It syn-
thesizes new minority class samples by interpolating between existing samples, thereby
balancing the class distribution. This is essential for improving the performance of the
classifier, especially in scenarios where one class is significantly underrepresented.

Histogram of Oriented Gradients (HOG):


The Histogram of Oriented Gradients (HOG) [6] is a feature descriptor technique used in
computer vision to extract essential information from images. It operates by calculating
the distribution of gradient orientations in localized regions of an image. Gradients rep-
resent intensity changes across neighboring pixels, computed using gradient filters along
the x and y directions. The image is divided into small, overlapping cells, each covering a
specific region. Within each cell, histograms of gradient orientations are computed, quan-
tizing orientations into discrete bins. These histograms accumulate gradient magnitudes
based on their orientations. Finally, the histograms from all cells are concatenated to
form a feature vector representing the image. [7] This vector encodes the distribution of
gradient orientations across the image, providing a compact representation of shape and
texture characteristics.

Image Space:
In the context of OCR, the image space refers to the feature space where images are
represented as vectors of features. Each feature represents a characteristic of the image
that aids in classification.

6
Using KNN to Classify Images:
K-Nearest Neighbors (KNN) [8] is a simple yet effective algorithm used for both classifi-
cation and regression tasks. In the context of image classification, KNN operates by first
extracting relevant features from images, such as the Histogram of Oriented Gradients
(HOG), which captures the distribution of gradient orientations in an image.

Once the HOG features are extracted, each image is represented as a vector in a high-
dimensional feature space. During the classification phase, KNN calculates the distance
between the test image and all training images in this feature space. It then assigns the
test image to the majority class of its K nearest neighbors, where K is a predefined pa-
rameter.

In simpler terms, KNN classifies an unknown image by comparing it to the labeled


images in its training set. The algorithm finds the K images in the training set that are
closest to the unknown image based on some distance metric (e.g., Euclidean distance),
and then assigns the unknown image to the class that is most common among its K near-
est neighbors.

This approach is intuitive and often effective, especially when dealing with relatively
small datasets or when the decision boundaries between classes are well-defined. However,
it can be computationally expensive, especially as the size of the training set grows, since
it requires calculating distances between the test image and all training images. Addi-
tionally, the performance of KNN can be sensitive to the choice of distance metric and
the value of K.

Joblib File:
Joblib is a library in Python used for serialization and deserialization of Python objects.
It is particularly useful for saving machine learning models to disk, allowing for easy re-
trieval and reuse. The trained KNN model is dumped to a joblib file for later use in
predicting characters from new images.

Next, we proceed to implement the trained KNN model in the OCR application for
character recognition.

PART 3: OCR Application


After training the OCR model and saving it to a joblib file, we developed a simple python
program for character recognition using the ’model.joblib’ file. The application takes an
input image as its input and generates two output images:

1. Output Image with Predicted Characters: This image replaces the characters in
the original input image with the predicted characters obtained from the OCR model.

2. Output Image with Bounding Boxes: This image overlays bounding boxes around
each identified character in the input image and provides the respective Hindi character

7
below each box in text format.

The OCR application utilizes the previously developed functions and codes for pre-
processing, segmentation, and character recognition.
Below are the results of evaluating the OCR model using the confusion matrix and clas-
sification report:

Confusion Matrix:
Classification Report:

3 Results
Now, we provide some sample outputs of the program for reference:
These outputs demonstrate the effectiveness of the OCR application in somewhat
accurately recognizing characters from input images.
Next, we plan to further enhance the OCR application by incorporating additional
features and improving its usability.

Preprocessing Note
Depending on the input image, it may be necessary to make adjustments to various
parameters and the sequence or selection of sub-functions in the preprocessing phase to
obtain an optimized preprocessed image for character recognition. Additionally, since we
have designed characters corresponding to Hindi alphabets, we encounter the challenge
of dealing with Hindi matras. To address this, we have employed the concept of ”vard
vichhed” (वणर्- व े द) in Hindi words. For instance:

1. Combining अ + +आ=अ ा.

2. Combining ग + उ + ल + आ + ब + अ = गुलाब.

3. Combining च + आँ + द + अ + न + ई = चाँदनी. bro chandni me badi ii ki matra hoti hai


na?? right lol kr kr mai ek baar padhta hu ok thanks

4 Conclusions
In this project, we embarked on a journey to enhance the Optical Character Recog-
nition (OCR) capabilities for the Hindi language. Recognizing the challenges posed
by the intricate Devanagari script, we began by redesigning a set of characters
optimized for ease of writing and mechanical differentiation. Through careful con-
sideration of design principles such as simplicity, distinctiveness, and consistency,
we developed a new set of characters that streamline both manual writing and au-
tomated recognition processes.

8
With our characters finalized, we proceeded to create a comprehensive training
dataset by capturing diverse handwriting styles and variations for each character.
This dataset served as the foundation for training our OCR model, aiming to achieve
robustness and accuracy in recognizing handwritten Hindi text. Leveraging prepro-
cessing techniques, feature selection, and algorithm optimization, we trained our
model to accurately classify and recognize characters across various contexts and
scenarios.

Throughout the project, our focus remained on bridging the gap between manual
and digital Hindi writing systems. By developing an OCR model tailored to our
newly designed characters, we aimed to enable seamless and accurate recognition of
handwritten Hindi text, thereby facilitating digital transformation and accessibility
in Hindi language processing.

In conclusion, our project represents a step forward in enhancing OCR capabilities


for the Hindi language, contributing to advancements in digital literacy and accessi-
bility for Hindi speakers worldwide. As we continue to refine and optimize our OCR
model, we envision a future where handwritten Hindi text can be effortlessly and
accurately digitized, opening up new opportunities for communication, education,
and information dissemination.

Further Improvements
While our current OCR model shows promising results, there are several avenues
for further improvement and enhancement:

(a) Improving Accuracy: While training models for character recognition, it’s
essential to consider the limitations of creating a dataset with limited variation.
When the training dataset contains only a small number of images for each
character, the model may struggle to generalize effectively to unseen data,
leading to lower accuracy and suboptimal predictions.
To address this limitation, it’s beneficial to increase the variation in the training
dataset. By including more than 1000 images for each character and ensuring
a diverse range of samples, we can provide the model with a richer set of
features to learn from. This increased variation enables the model, particularly
K-Nearest Neighbors (KNN), to better capture the underlying patterns in the
data, resulting in higher accuracy and more reliable predictions.
Therefore, to achieve higher accuracy and improve the quality of predictions in
character recognition tasks, it’s advisable to augment the training dataset with
a significant number of images for each character, ensuring sufficient variation
and diversity.
(b) Exploration of Different Classification Algorithms: While the K-nearest
neighbors (KNN) algorithm was chosen for its simplicity and effectiveness,
exploring other classification algorithms such as Support Vector Machines

9
(SVM), Random Forests, or Deep Learning models like Convolutional Neural
Networks (CNNs) could lead to improved accuracy and performance. CNNs,
in particular, have shown remarkable success in image classification tasks due
to their ability to learn hierarchical features directly from pixel data.
(c) Incorporation of Line and Word Detection: By implementing techniques
[9] such as vertical and horizontal histogram projections, we can enhance our
OCR system to detect lines and words within the text. This would allow for
more accurate segmentation and recognition of individual words, improving
overall accuracy and readability of the extracted text.
(d) Language Formatting and Spell Checking: Integration of language for-
matting and spell checking functionalities would enable the OCR system to
output words and sentences in the correct language format. Additionally, in-
corporating a dictionary to identify and correct misspelled words or words
incorrectly identified by the OCR would further enhance the accuracy and
usability of the system.
(e) Utilization of Designed Characters for Faster and Easier Writing:
The newly designed characters aim to make Hindi writing faster, easier, and
more distinguishable. By incorporating these characters into our OCR system,
we can streamline the process of Hindi text input and make it more accessible
to a wider audience. Additionally, we could explore creating a simpler cursive
style for writing characters, facilitating quicker handwriting.
(f) Extension to Different Languages and Fonts: Finally, the OCR system
can be extended to support different languages and fonts, allowing for broader
applicability and usability. By adapting the system to recognize characters
from various languages and fonts, we can cater to diverse user needs and facil-
itate text recognition in multilingual environments.

Incorporating these improvements would not only enhance the performance and
accuracy of our OCR system but also extend its usability and applicability to a
wider range of scenarios and languages. By continuously refining and evolving the
system, we can pave the way for advancements in text recognition technology and
contribute to the accessibility and digital transformation of linguistic diversity.

5 References

References
[1] GitHub : OCR Python Textbook (pre-processing).
[2] Wikipedia : Otsu’s Method for thresholding.
[3] Towards Data Science : Introduction to Image Processing with Python (dilation
and erosion).
[4] Medium : Suzuki’s Contour Tracing Algorithm.

10
[5] Medium : Synthetic Minority Over-sampling Technique (SMOTE).
[6] Medium : Introduction to Histogram of Oriented Gradients (HOG).
[7] Medium : OCR with Machine Learning.
[8] Towards Data Science : Image Classification with K-Nearest Neighbors.
[9] Towards Data Science : Segmentation in OCR.

11

You might also like