0% found this document useful (0 votes)
5 views

Data Entry Through OCR - A Case Study of Digitizing Examination Marks from Paper Marksheets

This document discusses the challenges of digitizing handwritten examination marks in Bangladesh, highlighting the inefficiencies and errors associated with manual transcription. It proposes the development of an OCR-based system specifically designed to recognize handwritten Bangla numerals, utilizing CNN models for improved accuracy. The methodology includes evaluating existing OCR tools, training on a comprehensive dataset, and implementing image processing techniques to enhance digit recognition performance.

Uploaded by

voccubd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Entry Through OCR - A Case Study of Digitizing Examination Marks from Paper Marksheets

This document discusses the challenges of digitizing handwritten examination marks in Bangladesh, highlighting the inefficiencies and errors associated with manual transcription. It proposes the development of an OCR-based system specifically designed to recognize handwritten Bangla numerals, utilizing CNN models for improved accuracy. The methodology includes evaluating existing OCR tools, training on a comprehensive dataset, and implementing image processing techniques to enhance digit recognition performance.

Uploaded by

voccubd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Data Entry Through OCR - A Case Study of Digitizing

Examination Marks from Paper Marksheets

Background

In government institutions across Bangladesh, particularly during recruitment


processes, handwritten documentation remains a standard practice for recording
marks. Each recruitment cycle involves viva voce examinations, where thousands,
sometimes even lakhs of applicants may apply and thousands appear before
interview boards. These boards typically consist of multiple evaluators, each of
whom assigns marks using handwritten Bangla numerals on paper-based mark
sheets. Subsequently, these handwritten scores are manually transcribed into
digital formats for record-keeping and further evaluation. This manual process not
only consumes substantial administrative time and effort but also introduces
significant risk of human error, particularly under high workloads and tight
deadlines.

While Optical Character Recognition (OCR) technologies have achieved maturity


in recognizing printed and Latin-script text (Smith, 2007), the recognition of
handwritten Bangla numerals remains a relatively under-explored and challenging
domain. Widely used OCR engines such as Tesseract perform well on printed text
but have not demonstrated high accuracy when applied to handwritten Bangla
numerals. While general OCR systems struggle with Bangla handwritten
numerals, specialized CNN models are able to achieve impressive accuracy in
recognizing single-digit Bangla handwritten numbers, even in noisy images. Given
that viva marks are typically one- or two-digit numbers, there is a scope for the
development of a CNN-based OCR system specifically optimized for recognizing
two-digit handwritten Bangla numerals from scanned interview mark sheets. Such
an automated digitization process will minimize transcription errors, and reduce
the administrative burden in recruitment workflows. This approach will offer the
potential to significantly enhance efficiency and accuracy in government
recruitment and other examination procedures.

Word count: 254


Objective

The principal objective of this project is to create an OCR-based system capable


of extracting data from images of handwritten mark sheets. The specific objectives
are:

● To evaluate the efficacy of open-source OCR tools in extracting marks from


Bangla handwritten marksheets.

● To read multi-digit Bangla numbers using image processing and single-digit


Bangla number detector models.

● To compare different methods for drawing bounding boxes around each digit
from multi-digit numbers.

Word count: 69

Expected Outcome

The expected outcomes of this project are:

● A functional OCR-based system specifically designed for recognizing


handwritten Bangla digits from scanned mark sheets.

● Accurate recognition of multi-digit Bangla numbers, especially two-digit


marks, using image segmentation and digit classification models.

● A comparative evaluation of multiple bounding box strategies to improve


digit segmentation accuracy.

● High performance and reliability when tested on real-world handwritten


mark sheet samples, reflecting practical usage scenarios.

Word count: 67
Methodology
● Getting image slices that contain the numbers we want to extract
● Validating existing OCR systems on these cropped images with numbers.
● Preliminary results have shown.

Our research addresses the challenge of recognizing handwritten Bangla digits


from examination papers through a structured approach. The methodology
comprises several sequential steps:

First, we analyzed existing OCR systems (Tesseract, EasyOCR) on cropped


number images from sample examination papers. Preliminary results did not show
adequate performance, primarily due to the high variability in individual handwriting
styles and the absence of robust, annotated datasets for multi-digit handwritten
Bangla numerals.

Then we selected the NumtaDB dataset for its comprehensive representation of


handwriting styles with 70,000+ annotated samples from diverse demographics,
making it superior to alternatives like CMATERdb for robust model training. It's
important to note that NumtaDB contains only single-digit samples, while our target
application requires recognizing multi-digit (predominantly two-digit) numbers from
examination papers.

To bridge this gap, we developed a two-phase approach: first training a robust


single-digit classifier, then implementing a segmentation pipeline to handle multi-
digit numbers. For image preprocessing, we extract individual digits from multi-
digit numbers using blurring, binarization, and contour detection, with bounding
boxes sorted left-to-right for proper sequencing.

We chose a Convolutional Neural Network (CNN) architecture because it excels


at automatically learning spatial features directly from pixel data without manual
feature engineering. Our CNN model consists of convolutional layers, max pooling,
and dense layers trained on the preprocessed images.

Preliminary results are positive, with the model showing significant improvement
over generic OCR solutions when tested on examination papers with variable
writing styles and potentially touching digits.

Word count: 239

You might also like