Data Extraction From Images Through OCR-IJRASET
Data Extraction From Images Through OCR-IJRASET
https://doi.org/10.22214/ijraset.2021.37377
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 9 Issue VIII Aug 2021- Available at www.ijraset.com
Abstract: The paperwork used in maintaining various types of documents in our daily lives is tiresome and inefficient, it
consumes a lot of time and it is difficult to maintain and remember the concerned documents. This project provides a solution to
these problems by introducing Optical Character Recognition Technology (OCR) which runs on Tesseract OCR Engine. The
project specifically aims at increasing data accessibility, usability and improving customer experience by decreasing the time
spent to process, save, and maintain user data. Another objective of this project is to nullify the human error, which is huge in
manual handling of data records, the software used in the solution uses certain techniques to minimize these errors. Optical
Character Recognition (OCR) is used for extracting texts and characters from an image. This helps us in maintaining our
records and data digitally and securely. In this project we are using the Tesseract OCR Engine which has high accuracy rates
for clean images. We have implemented a web version of OCR which runs on TesseractJS; other JavaScript frameworks are
also used. The outcome of the project is that it is able successfully to extract text and characters from the provided image using
Tesseract OCR Engine. It is observed that for the high resolution images the accuracy is above 90%. This web based application
is useful for small businesses as they don’t have to install any extra software, all it needs is a file to be uploaded on an online
interface making them able to access remotely. It will also help students to save notes and documents online which will make
their important documents easily accessible on the web. This whole process is time and memory efficient.
Keywords: OCR, Tesseract API, HTML, CSS, ReactJS, Node, Express, Heroku, Accuracy, Image processing, Express,
Scalability, Accessibility
I. INTRODUCTION
OCR stands for optical character recognition. This technology lets you extract text from an image, it can be handwritten or printed
text. In technical terms, OCR defines the process of electronically converting scanned images of handwritten, typed or printed text
into machine-encoded text. Accuracy rates are measured by different approaches. OCR is considered as challenging because this
technology is always improving. OCR process works in 3 steps:
A. The computer acquires an image of the document and it is submitted as input to the OCR engine. This step is often called image
pre-processing in OCR in which image disturbances are repressed and specific image features are amplified.
B. The OCR engine is instructed to recognize certain shapes; it matches portions of images to these shapes. It uses the concept of
feature extraction which extracts only selected features when input data is huge and ignores the useless chunk of information.
C. The OCR analysis takes the image in digital format and converts it into machine recognisable text format. It performs a
technique named as post-processing in OCR; this ensures high efficiency of OCR results. It is an error correction technique
which can identify not only words but also serial numbers and codes.
D. OCR solves real world problems as this software can be combined with a broad range of technologies. Here are a few examples
of possible use cases including OCR software - Identification Processes in OCR, Marketing Campaigns with OCR, Payment
Processes in OCR, Reverse Image Search, Google Translate, Captcha and many more.
E. The project aims to make a feasible Minimal Viable Product using this technology accessible to students like us and explore the
widespread application of OCR in different areas of our lives through a web based application.
1) HTML, CSS: HTML is what your browser understands. When we browse a webpage, we see html, which is similar to the bone.
Html is what provides a web page structure and form. CSS (Cascaded Style Sheet) modifies the look of html data. CSS is like
skin, texture. It gives color, width ,height, padding, margin, background to the html element. Main job of CSS is to give
“STYLE” to the html element.
2) JavaScript: JavaScript is a scripting language that is mostly used to create interactive web pages. There are a lot of fantastic
things you can accomplish with your website with it's help. There is no need to waste time compiling the code. JavaScript code
can automatically execute in the browser window itself without compilations. It is quicker than the Java programming code.
3) React Web: Engineers may utilize React to fabricate gigantic web applications that can adjust information without reloading the
page. Responds significant objective is to be speedy, adaptable, and simple to utilize. React.JS is just simpler to grasp right
away. The component-based approach, well-defined lifecycle, and use of just plain JavaScript make React very simple to learn,
build a professional web (and mobile applications), and support it.
4) Node. Js: Node.js is an open-source, cross-platform, JavaScript runtime environment. It is built on Chrome's V8 JavaScript
engine, it lets developers run server-side scripts outside of the browser. With NodeJS, you can develop the backend of apps
easily by integrating express server and mongoDB in it. It is occasion driven and has non-hindering I/O, making it ideal for
planning web programs that are lightweight, proficient, and speedy. With its amazing and useful nature, Node.js has been a
good playground for developers.
5) Tesseract OCR Engine Library: Tesseract is an open source text recognition (OCR) Engine, available under the Apache 2.0
license. It can be used for programs using an API to extract characters from images. It supports a wide variety of languages.
6) Javascript: Javascript is a scripting language which helps in dynamically control events on the web and helps in interacting
with web components. It is an event-driven, functional and imperative language, it also has object-oriented and pure functions
paradigms which makes up pretty much everything you need to work on the web.
V. PROPOSED SOLUTION
This project aims at solving these problems faced in the real world by implementation of an OCR System to optically convert a
digitally captured image into machine readable text form which will help these sectors optimize on specific workflows and enable
longevity of user records by enabling digital storage and compression techniques for physical entities.
This proposed solution will make use of the Tesseract Optical Character Recognition Engine being implemented through a
JavaScript Port of the Tesseract API. The setup is supported by a Node.js server configuration and the User Interface is
implemented as a React Web app. For the scope of this project, we’ll be limiting it to some specific use cases and mainstream
languages like ENG (US) and ENG (INDIA) to match the timeline.
REFERENCES
BOOKS
[1] “Optical Character Recognition by Open Source. OCR Tool Tesseract: A Case Study” by Chirag Patel, Atul Patel(PhD) and Dharmendra Patel .
[2] “An Overview of the Tesseract OCR Engine” by R. Smith.
[3] Using Neural Networks to Create an Adaptive Character Recognition System © 2002, Alexander J. Faaborg Cornell University, Ithaca NY. 6.
[4] Digital Image Processing by A.Gonzales
WEBSITE
[1] http://www.ieeexplore.com
[2] http://www.stackoverflow.com