Optical Character Recognition - Project Report
Optical Character Recognition - Project Report
________________________
Kushagra Chadha
________________________
Amit Kumar
___________________________________
Professor Muneesh Meena, Major Advisor
___________________________________
Abstract
Our project aimed to understand, develop and improve the open Optical Character
Recognizer (OCR) software, OCR, to better handle some of the more complex recognition issues
such as unique language alphabets and special characters such as mathematical symbols. We
developed OCR to work with any language by creating support for UTF-8 character encoding.
The various stages of an OCR system are: upload a scanned image from the computer,
segmentation process in which we extract the text zone from the image, recognition of the text
and the last which is post processing process in which the output of the previous stage goes
through the error detection and correction phase. This report explains about the user interface
provided with the OCR with the help of which a user can very easily add or modify the
1.1 Introduction
We are moving forward to a more digitized world. Computer and PDA screens are replacing the
traditional books and newspapers. Also the large amount of paper archives which requires
maintenance as paper decays over time lead to the idea of digitizing them instead of simply
scanning them. This requires recognition software that is capable in an ideal version of reading
as well as humans. Such OCR software is also needed for reading bank checks and postal
addresses. Automating these two tasks can save many hours of human work.
These two major trends lead OCR software to be developed and licensed to OCR
contractors. “There is one notable exception to this, which is pyOCR open source OCR software
pyOCR was created by us on April 10, 2016 with the goal of providing an open source
OCR system capable of performing multiple digitization functions. The application of this
software ranged from general desktop use and simple document conversion to historical
The idea of OCR technology has been around for a long time and even predates electronic
computers.
Figure 1: Statistical Machine Design by Paul W. Handel
This is an image of the original OCR design proposed by Paul W. Handel in 1931. He applied for
a patent for a device “in which successive comparisons are made between a character and a
character and an image. This means you would shine a light through a filter and, if the light
matches up with the correct character of the filter, enough light will come back through the filter
and trigger some acceptance mechanism for the corresponding character. This was the first
documented vision of this type of technology. The world has come a long way since this
prototype.
1.2.1 Template-Matching Method
In 1956, Kelner and Glauberman used magnetic shift registers to project two-dimensional
information. The reason for this is to reduce the complexity and make it easier to interpret the
information. A printed input character on paper is scanned by a photodetector through a slit. The
reflected light on the input paper allows the photodetector to segment the character by
calculating the proportion of the black portion within the slit. This proportion value is sent to a
register which converts the analog values to digital values. These samples would then be
matched to a template by taking the total sum of the differences between each sampled value and
the corresponding template value. While this machine was not commercialized, it gives us
important insight into the dimensionality of characters. In essence, characters are two-
dimensional, and if we want to reduce the dimension to one, we must change the shape of the
Figure 2: Illustration of 2-D reduction to 1-D by a slit. (a) An input numeral “4” and a slit
scanned from left to right. (b) Black area projected onto axis, the scanning direction of the slit.
1.2.2 Peephole Method
This is the simplest logical template matching method. Pixels from different zones of the
binarized character are matched to template characters. An example would be in the letter A,
where a pixel would be selected from the white hole in the center, the black section of the stem,
Each template character would have its own mapping of these zones that could be matched with
the character that needs to be recognized. The peephole method was first executed with a
This was produced by Solartron Electronics Groups Ltd. and was used on numbers printed from
a cash register. It could read 120 characters per second, which was quite fast for its time, and
It is very difficult to create a template for handwritten characters. The variations would be too
large to have an accurate or functional template. This is where the structure analysis method
came into play. This method analyzes the character as a structure that can be broken down into
parts. The features of these parts and the relationship between them are then observed to
determine the correct character. The issue with this method is how to choose these features and
If the peephole method is extended to the structured analysis method, peepholes can be
viewed on a larger scale. Instead of single pixels, we can now look at a slit or ‘stroke’ of pixels
This technique was first proposed in 1954 with William S. Rohland’s “Character Sensing
System” patent using a single vertical scan. The features of the slits are the number of black
regions present in each slit. This is called the cross counting technique.
OCR results are mainly attributed to the OCR recognizer software, but there are other factors that
can have a considerable inpact on the results. The simplest of these factors can be the scanning
The table below summarizes these factors and provides recommendations for
so that text
boundaries can
be identified
Optimize image for OCR so that
character edges are smoothed,
rounded, sharpened, contrast
Image
OCR software - Analyzing increased prior to OCR.
optimization
stroke edge of each Obtain best source possible
Quality of
character (marked, mouldy, faded source,
source
characters not in sharp focus or
skewed on page negatively affects
identification of characters).
Pattern image
OCR software - Matching
in OCR
character edges to pattern
software
images and making Select good OCR software.
database
decision on what the
Algorithms in
character is
OCR software
OCR software – Matching Algorithms
whole words to dictionary and built in
Select good OCR software.
and making decisions on dictionaries in
confidence OCR software
Depends on Purchase OCR software that has
how much time this ability.
Train OCR engine you have At present it is questionable if
available to training is viable for large scale
train OCR historic newspaper projects
Table 1: Potential methods of improving OCR accuracy.
This is a method that was developed with the goal of finding a linear representation of
nongaussian data so that the components are statistically independent. Data is nongaussian if it
does not follow a normal distribution. The cocktail party problem is a great example of the
need for a way to analyze mixed data. In this problem, there are two signal sources, two
people speaking at the same time, and two sources, microphones, to collect this data. We
would like to be able to take the mixed data of the two speakers collected from these two
microphones and somehow separate the data back to their original signals. Each microphone
will have a different representation of the mixed signal because they will be located in
different positions in the room. If we represent these mixed recorded signals as and we could
where are parameters that depend on the distances of the microphones from the
speakers . This gives us the nongaussian data we need to properly analyze these signals in an
In order to properly execute Independent Component Analysis the data must go through
some initial standardization along with one fundamental condition: nongaussianity. To show why
Gaussian variables make ICA impossible, we assume we have an orthogonal mixing matrix and
our sources are all gaussian. Then and are gaussian, uncorrelated, and of unit variance. The
()
The density of this distribution is completely symmetric and does not contain any
relevant information about directions of the columns of the mixing matrix. Because there is no
relevant information, we have no way to make estimates about this data . We thus need a
Kurtosis is the older method of measuring nongaussianity and can be defined for as:
{ } { }
normalized fourth moment { }. Kurtosis is usually either positive or negative for nongaussian
random variables. If kurtosis is zero, then the random variable is Gaussian. For this reason we
generally take the absolute value or the square of kurtosis as a measure of gaussianity.
The use of kurtosis has been commonly used in ICA because of its simple formulation and its
low computational cost. The computation cost is in fact reduced when using the fourth moment
of the data as estimation for its kurtosis. This is due to the following linear properties:
Although kurtosis proved to be very handy for multiple applications, it did have one major
weakness; its sensitivity to outliers. This means that when using a sample data in which the
distribution is either random or has some errors, kurtosis can fail at determining its gaussianity.
concept of information theory. Entropy describes the amount of information that can be taken out
of the observation of a given variable. A large entropy value means the data is random and
unpredictable.
In a similar manner the entropy of a continuous random variable y can be expressed as:
∫
Information theory established that out of all random variables of equal variance, the Gaussian
variable will have the highest entropy value which can also be attributed to the fact that Gaussian
The precedent result shows that we can obtain a measure of gaussianity through
where a Gaussian random variable that has the same covariance matrix as the variable y.
Negentropy is zero if and only if y has a Gaussian distribution, thus the higher its measure the
less Gaussian the variable is. Unlike kurtosis, negentropy is computationally expensive. A
solution to this problem is to find simpler approximations of its measure. The classical
with the assumption that y has zero mean and unit variance.
follows:
∑ [{ } { }]
where some positive constans, v the normalized Gaussian variable and some non quadratic
functions.
A common use of this approximation is to take only one quadratic function G, usually
()
We then have obtained approximations that provide computational simplicity comparable to the
To give a brief explanation on why gaussianity is strictly not allowed we can say that it
makes the data completely symmetric and thus the mixing matrix will not provide any
As mentioned above, data preprocessing is crucial in that it makes the ICA estimation simpler
and better conditioned. Many preprocessing techniques can then be applied such as “Centering”
so as to make x a zero-mean variable and “Whitening” which is the linear transformation of the
observed vector x so that its components become uncorrelated and its variances equal unity, this
vector is then said to be white.
Approach and the Filtering approach. Density Modeling is based on causal generative models
whereas the Filtering approach uses information maximization techniques. Energy based models
emerged as a unification of these methods because it used Density Modeling techniques along
models, this is a powerful tool as it eliminates the need for proper normalization of the
mapping from an observation vector to a feature vector and the feature vector determines a
global energy, ” . Note that the probability density function of x is expressed as:
Finite State Machines are used in many areas of computational linguistics because of their
convenience and efficiency. They do a great job at describing the important local phenomena
encountered in empirical language study. They tend to give a good compact representation of
For computational linguistics, we are mainly concerned with time and space efficiency.
We achieve time efficiency through the use of a deterministic machine. The output of a
deterministic machine is usually linearly dependent on the size of the input. This fact alone
allows us to consider it optimal for time efficiency. We are able to achieve space efficiency with
This is an extension of the idea of deterministic automata with deterministic input. This type of
input. This quality is very useful and supports very efficient programs.
The use of Finite state automata contributed a lot to the development of speech recognition and
of natural language processing. Such an automaton provides a state transition depending on the
input it receives until it reaches one of the final states; the output state.
These transducers keep all the functionality of a simple FSM (finite state machine) but
add a weight to each transition. In speech recognition for example this weight is the probability
for each state transition. In addition, in these transducers the input or output label of a transducer
transition can be null. Such a null means that no symbol needs to be consumed or output during
the transition. These null labels are needed to create variable length input and output strings.
They also provide a good way of delaying the output via an inner loop for example.
Initial approaches to language modeling used affix dictionaries to represent natural languages.
This method came in handy to represent languages like English by having a list of the most
common words along with possible affixes. However, when trying to represent more languages,
it was quickly clear that such an approach fails with agglutinative languages.
other nouns. Unlike the English language in which we generally add suffixes to obtain other
word forms like the suffix –ly for adverbs. Hungarian falls under the agglutinative languages for
which we needed to create a dictionary and a language model in FST (finite state transducer)
format. The representation of such a language can be done by “having the last node of the
portion of the FST, which encodes a given suffix, contain outgoing arcs to the first states of
portions of the FST which encode other suffixes” . The advantage of this technique is that when
applied to all the possible affixes, it will then have a solid representation of the agglutination
There are many different file formatting options available for character recognition software. We
primarily dealt with PNG files because it was the only usable format in pyOCR but we were
faced with some challenges during image conversion. Image quality has a huge impact on the
effectiveness of any OCR software and when trying to change between formats, one has to be
aware of lossy vs. lossless compression. These were the formats we ran into during this project:
1.6.1 TIFF
This is a Tagged Image File Format and can be used as a single or multi image file format
(multiple pages in the same file). The TIFF format is very desirable because the most common
compression schemes are all lossless. This means that these types of compression can reduce the
file size (and later returned to their original size) without losing any quality.
1.6.2 PDF
Personal Document Format is currently an open source standard created by Adobe. While the
ability for a PDF to contain text and images is very useful for some applications, this is an
unnecessarily, robust quality that only adds to the file size. A TIFF is much more desirable
1.6.3 PNG
Portable Network Graphic formatting is a lossless data format and the one that is used by
pyOCR. They are a single image, open, color image format and were created to replace the GIF
1.6.4 JPEG
The acronym ‘JPEG’ comes from the founding company of the file format, Joint Photographic
Experts Group. This is a lossy image format but can be scaled to tradeoff between storage size
and image quality. This is not ideal for OCR software, but can be used as long as the data is
never compressed.
Chapter 2: SIP and PyQT
Introduction¶
SIP is a tool for automatically generating Python bindings for C and C++ libraries. SIP was
originally developed in 1998 for PyQt - the Python bindings for the Qt GUI toolkit - but is
suitable for generating bindings for any C or C++ library.
This version of SIP generates bindings for Python v2.3 or later, including Python v3.
There are many other similar tools available. One of the original such tools is SWIG and, in fact,
SIP is so called because it started out as a small SWIG. Unlike SWIG, SIP is specifically
designed for bringing together Python and C/C++ and goes to great lengths to make the
integration as tight as possible.
The homepage for SIP is http://www.riverbankcomputing.com/software/sip. Here you will
always find the latest stable version and the latest version of this documentation.
SIP can also be downloaded from the Mercurial repository at
http://www.riverbankcomputing.com/hg/sip.
2.1 License
SIP is licensed under similar terms as Python itself. SIP is also licensed under the GPL (both v2
and v3). It is your choice as to which license you use. If you choose the GPL then any bindings
you create must be distributed under the terms of the GPL.
2.2 Features
SIP, and the bindings it produces, have the following features:
• bindings are fast to load and minimise memory consumption especially when only a
small sub-set of a large library is being used
• automatic conversion between standard Python and C/C++ data types
• overloading of functions and methods with different argument signatures
• support for Python’s keyword argument syntax
• support for both explicitly specified and automatically generated docstrings
• access to a C++ class’s protected methods
• the ability to define a Python class that is a sub-class of a C++ class, including abstract
C++ classes
• Python sub-classes can implement the __dtor__() method which will be called from the
C++ class’s virtual destructor
• support for ordinary C++ functions, class methods, static class methods, virtual class
methods and abstract class methods
• the ability to re-implement C++ virtual and abstract methods in Python
• support for global and class variables
• support for global and class operators
• support for C++ namespaces
• support for C++ templates
• support for C++ exceptions and wrapping them as Python exceptions
• the automatic generation of complementary rich comparison slots
• support for deprecation warnings
• the ability to define mappings between C++ classes and similar Python data types that are
automatically invoked
• the ability to automatically exploit any available run time type information to ensure that
the class of a Python instance object matches the class of the corresponding C++ instance
• the ability to change the type and meta-type of the Python object used to wrap a C/C++
data type
• full support of the Python global interpreter lock, including the ability to specify that a
C++ function of method may block, therefore allowing the lock to be released and other
Python threads to run
• support for consolidated modules where the generated wrapper code for a number of
related modules may be included in a single, possibly private, module
• support for the concept of ownership of a C++ instance (i.e. what part of the code is
responsible for calling the instance’s destructor) and how the ownership may change
during the execution of an application
• the ability to generate bindings for a C++ class library that itself is built on another C++
class library which also has had bindings generated so that the different bindings integrate
and share code properly
• a sophisticated versioning system that allows the full lifetime of a C++ class library,
including any platform specific or optional features, to be described in a single set of
specification files
• support for the automatic generation of PEP 484 type hint stub files
• the ability to include documentation in the specification files which can be extracted and
subsequently processed by external tools
• the ability to include copyright notices and licensing information in the specification files
that is automatically included in all generated source code
• a build system, written in Python, that you can extend to configure, compile and install
your own bindings without worrying about platform specific issues
• support for building your extensions using distutils
• SIP, and the bindings it produces, runs under UNIX, Linux, Windows, MacOS/X,
Android and iOS.
2.5 Qt Support
SIP has specific support for the creation of bindings based on Digia’s Qt toolkit.
The SIP code generator understands the signal/slot type safe callback mechanism that Qt uses to
connect objects together. This allows applications to define new Python signals, and allows any
Python callable object to be used as a slot.
SIP itself does not require Qt to be installed.
2.6 Installation
2.6.1 Downloading
You can get the latest release of the SIP source code
from http://www.riverbankcomputing.com/software/sip/download.
SIP is also included with all of the major Linux distributions. However, it may be a version or
two out of date.
2.6.2 Configuring
After unpacking the source package (either a .tar.gz or a .zip file depending on your platform)
you should then check for any README files that relate to your platform.
Next you need to configure SIP by executing the configure.py script. For example:
python configure.py
This assumes that the Python interpreter is on your path. Something like the following may be
appropriate on Windows:
c:\python35\python configure.py
If you have multiple versions of Python installed then make sure you use the interpreter for
which you wish SIP to generate bindings for.
The full set of command line options is:
--version
Display the SIP version number.
-h, --help
Display a help message.
--arch <ARCH>
Binaries for the MacOS/X architecture <ARCH> will be built. This option should be given
once for each architecture to be built. Specifying more than one architecture will cause a
universal binary to be created.
-b <DIR>, --bindir <DIR>
The SIP code generator will be installed in the directory <DIR>.
--configuration <FILE>
New in version 4.16.
<FILE> contains the configuration of the SIP build to be used instead of dynamically
introspecting the system and is typically used when cross-compiling. See Configuring with
Configuration Files.
-d <DIR>, --destdir <DIR>
The sip module will be installed in the directory <DIR>.
--deployment-target <VERSION>
New in version 4.12.1.
Each generated Makefile will set the MACOSX_DEPLOYMENT_TARGET environment
variable to <VERSION>. In order to work around bugs in some versions of Python, this
should be used instead of setting the environment variable in the shell.
-e <DIR>, --incdir <DIR>
The SIP header file will be installed in the directory <DIR>.
-k, --static
The sip module will be built as a static library. This is useful when building the sip module
as a Python builtin.
-n, --universal
The SIP code generator and module will be built as universal binaries under MacOS/X. If
the --arch option has not been specified then the universal binary will include
the i386 and ppc architectures.
--no-pyi
New in version 4.18.
This disables the installation of the sip.pyi type hints stub file.
--no-tools
New in version 4.16.
The SIP code generator and sipconfig module will not be installed.
-p <PLATFORM>, --platform <PLATFORM>
Explicitly specify the platform/compiler to be used by the build system, otherwise a
platform specific default will be used. The --show-platforms option will display all the
supported platform/compilers.
--pyi-dir <DIR>
New in version 4.18.
<DIR> is the name of the directory where the sip.pyi type hints stub file is installed. By
default this is the directory where the sip module is installed.
-s <SDK>, --sdk <SDK>
If the --universal option was given then this specifies the name of the SDK directory. If a
path is not given then it is assumed to be a sub-directory
of/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/S
DKs or /Developer/SDKs.
-u, --debug
The sip module will be built with debugging symbols.
-v <DIR>, --sipdir <DIR>
By default .sip files will be installed in the directory <DIR>.
--show-platforms
The list of all supported platform/compilers will be displayed.
--show-build-macros
The list of all available build macros will be displayed.
--sip-module <NAME>
The sip module will be created with the name <NAME> rather than the
default sip. <NAME> may be of the form package.sub-package.module. See Building a
Private Copy of the sip Module for how to use this to create a private copy of
the sip module.
--sysroot <DIR>
New in version 4.16.
<DIR> is the name of an optional directory that replaces sys.prefix in the names of other
directories (specifically those specifying where the various SIP components will be
installed and where the Python include directories can be found). It is typically used when
cross-compiling or when building a static version of SIP. See Configuring with
Configuration Files.
--target-py-version <VERSION>
New in version 4.16.
<VERSION> is the major and minor version (e.g. 3.4) of the version of Python being
targetted. By default the version of Python being used to run the configure.py script is
used. It is typically used when cross-compiling. See Configuring with Configuration Files.
--use-qmake
New in version 4.16.
Normally the configure.py script uses SIP’s own build system to create the Makefiles for
the code generator and module. This option causes project files (.pro files) used by
Qt’s qmake program to be generated instead. qmake should then be run to generate the
Makefiles. This is particularly useful when cross-compiling.
The configure.py script takes many other options that allows the build system to be finely tuned.
These are of the form name=value or name+=value. The --show-build-macros option will display
each supported name, although not all are applicable to all platforms.
The name=value form means that value will replace the existing value of name.
The name+=value form means that value will be appended to the existing value of name.
For example, the following will disable support for C++ exceptions (and so reduce the size of
module binaries) when used with GCC:
python configure.py CXXFLAGS+=-fno-exceptions
A pure Python module called sipconfig.py is generated by configure.py. This defines
each name and its corresponding value. Looking at it will give you a good idea of how the build
system uses the different options. It is covered in detail in The Build System.
2.6.3 Building
The next step is to build SIP by running your platform’s make command. For example:
make
The final step is to install SIP by running the following command:
make install
(Depending on your system you may require root or administrator privileges.)
This will install the various SIP components.
class Word {
const char *the_word;
public:
Word(const char *w);
%Module word
class Word {
%TypeHeaderCode
#include <word.h>
%End
public:
Word(const char *w);
sip -c . word.sip
However, that still leaves us with the task of compiling the generated code and linking it against
all the necessary libraries. It’s much easier to use the SIP build system to do the whole thing.
Using the SIP build system is simply a matter of writing a small Python script. In this simple
example we will assume that the word library we are wrapping and it’s header file are installed in
standard system locations and will be found by the compiler and linker without having to specify
any additional flags. In a more realistic example your Python script may take command line
options, or search a set of directories to deal with different configurations and installations.
This is the simplest script (conventionally called configure.py):
import os
import sipconfig
# The name of the SIP build file generated by SIP and used by the build
# system.
build_file = "word.sbf"
# Add the library we are wrapping. The name doesn't include any platform
# specific prefixes or extensions (e.g. the "lib" prefix on UNIX, or the
# ".dll" extension on Windows).
makefile.extra_libs = ["word"]
[1] All SIP directives start with a % as the first non-whitespace character of a line.
[2] SIP includes many code directives like this. They differ in where the supplied code is placed by
SIP in the generated code.
#include <qlabel.h>
#include <qwidget.h>
#include <qstring.h>
public:
Hello(QWidget *parent = 0);
private:
// Prevent instances from being copied.
Hello(const Hello &);
Hello &operator=(const Hello &);
};
#if !defined(Q_OS_WIN)
void setDefault(const QString &def);
#endif
The corresponding SIP specification file would then look something like this:
%Module hello
%Import QtGui/QtGuimod.sip
%If (Qt_4_2_0 -)
%TypeHeaderCode
#include <hello.h>
%End
public:
Hello(QWidget *parent /TransferThis/ = 0);
private:
Hello(const Hello &);
};
%If (!WS_WIN)
void setDefault(const QString &def);
%End
%End
Again we look at the differences, but we’ll skip those that we’ve looked at in previous examples.
• The %Import directive has been added to specify that we are extending the class
hierarchy defined in the file QtGui/QtGuimod.sip. This file is part of PyQt4. The build
system will take care of finding the file’s exact location.
• The %If directive has been added to specify that everything [4] up to the
matching %End directive only applies to Qt v4.2 and later. Qt_4_2_0 is a tag defined
in QtCoremod.sip [5]using the %Timeline directive. %Timeline is used to define a tag for
each version of a library’s API you are wrapping allowing you to maintain all the
different versions in a single SIP specification. The build system provides support
to configure.py scripts for working out the correct tags to use according to which version
of the library is actually installed.
• The TransferThis annotation has been added to the constructor’s argument. It specifies
that if the argument is not 0 (i.e. the Hello instance being constructed has a parent) then
ownership of the instance is transferred from Python to C++. It is needed because Qt
maintains objects (i.e. instances derived from the QObject class) in a hierachy. When an
object is destroyed all of its children are also automatically destroyed. It is important,
therefore, that the Python garbage collector doesn’t also try and destroy them. This is
covered in more detail inOwnership of Objects. SIP provides many other annotations that
can be applied to arguments, functions and classes. Multiple annotations are separated by
commas. Annotations may have values.
• The = operator has been removed. This operator is not supported by SIP.
• The %If directive has been added to specify that everything up to the
matching %End directive does not apply to Windows. WS_WIN is another tag defined by
PyQt4, this time using the%Platforms directive. Tags defined by the %Platforms directive
are mutually exclusive, i.e. only one may be valid at a time [6].
One question you might have at this point is why bother to define the private copy constructor
when it can never be called from Python? The answer is to prevent the automatic generation of a
public copy constructor.
We now look at the configure.py script. This is a little different to the script in the previous
examples for two related reasons.
Firstly, PyQt4 includes a pure Python module called pyqtconfig that extends the SIP build system
for modules, like our example, that build on top of PyQt4. It deals with the details of which
version of Qt is being used (i.e. it determines what the correct tags are) and where it is installed.
This is called a module’s configuration module.
Secondly, we generate a configuration module (called helloconfig) for our own hello module.
There is no need to do this, but if there is a chance that somebody else might want to extend your
C++ library then it would make life easier for them.
Now we have two scripts. First the configure.py script:
import os
import sipconfig
from PyQt4 import pyqtconfig
# The name of the SIP build file generated by SIP and used by the build
# system.
build_file = "hello.sbf"
# Run SIP to generate the code. Note that we tell SIP where to find the qt
# module's specification files using the -I flag.
os.system(" ".join([config.sip_bin, "-c", ".", "-b", build_file, "-I",
config.pyqt_sip_dir, pyqt_sip_flags, "hello.sip"]))
# We are going to install the SIP specification file for this module and
# its configuration module.
installs = []
installs.append(["helloconfig.py", config.default_mod_dir])
# Add the library we are wrapping. The name doesn't include any platform
# specific prefixes or extensions (e.g. the "lib" prefix on UNIX, or the
# ".dll" extension on Windows).
makefile.extra_libs = ["hello"]
# Publish the set of SIP flags needed by this module. As these are the
# same flags needed by the qt module we could leave it out, but this
# allows us to change the flags at a later date without breaking
# scripts that import the configuration module.
"hello_sip_flags": pyqt_sip_flags
}
# These are installation specific values created when Hello was configured.
# The following line will be replaced when this template is used to create
# the final configuration module.
# @SIP_CONFIGURATION@
class Configuration(pyqtconfig.Configuration):
"""The class that represents Hello configuration values.
"""
def __init__(self, sub_cfg=None):
"""Initialise an instance of the class.
cfg.append(_pkg_config)
pyqtconfig.Configuration.__init__(self, cfg)
class HelloModuleMakefile(pyqtconfig.QtGuiModuleMakefile):
"""The Makefile class for modules that %Import hello.
"""
def finalise(self):
"""Finalise the macros.
"""
# Make sure our C++ library is linked.
self.extra_libs.append("hello")
[5] Actually in versions.sip. PyQt4 uses the %Include directive to split the SIP specification for Qt
across a large number of separate .sip files.
[6] Tags can also be defined by the %Feature directive. These tags are not mutually exclusive, i.e.
any number may be valid at a time.
class Klass
{
public:
// The Python signature is a tuple, but the underlying C++ signature
// is a 2 element array.
Klass(SIP_PYTUPLE) [(int *)];
%MethodCode
int iarr[2];
3.2 Dependencies:
PIL is required to work with images in memory. PyTesser has been tested with Python 2.4 in
Windows XP.http://www.pythonware.com/products/pil/
3.3 Installation:
PyTesser has no installation functionality in this release. Extract pytesser.zip into directory with
other scripts. Necessary files are listed in File Dependencies below.
3.4 Usage:
from pytesser import * im = Image.open('phototest.tif') text =
image_to_string(im) print text This is a lot of 12 point text to test the ocr
code and see if it works on all types of file format. The quick brown dog
jumped over the lazy fox. The quick brown dog jumped over the lazy
fox. The quick brown dog jumped over the lazy fox. The quick brown
dog jumped over the lazy fox.
The format attribute identifies the source of an image. If the image was not read from a file, it is
set to None. The size attribute is a 2-tuple containing width and height (in pixels). The mode
attribute defines the number and names of the bands in the image, and also the pixel type and
depth. Common modes are “L” (luminance) for greyscale images, “RGB” for true colour images,
and “CMYK” for pre-press images.
If the file cannot be opened, an IOError exception is raised.
Once you have an instance of the Image class, you can use the methods defined by this class to
process and manipulate the image. For example, let’s display the image we just loaded:
>>> im.show()
(The standard version of show is not very efficient, since it saves the image to a temporary file
and calls the xv utility to display the image. If you don’t have xv installed, it won’t even work.
When it does work though, it is very handy for debugging and tests.)
The following sections provide an overview of the different functions provided in this library.
It is important to note that the library doesn’t decode or load the raster data unless it really has to.
When you open a file, the file header is read to determine the file format and extract things like
mode, size, and other properties required to decode the file, but the rest of the file is not
processed until later.
This means that opening an image file is a fast operation, which is independent of the file size
and compression type. Here’s a simple script to quickly identify a set of image files:
The region is defined by a 4-tuple, where coordinates are (left, upper, right, lower). The Python
Imaging Library uses a coordinate system with (0, 0) in the upper left corner. Also note that
coordinates refer to positions between the pixels, so the region in the above example is exactly
300x300 pixels.
The region could now be processed in a certain manner and pasted back.
Processing a subrectangle, and pasting it back
region = region.transpose(Image.ROTATE_180)
im.paste(region, box)
When pasting regions back, the size of the region must match the given region exactly. In
addition, the region cannot extend outside the image. However, the modes of the original image
and the region do not need to match. If they don’t, the region is automatically converted before
being pasted (see the section on Colour Transforms below for details).
Here’s an additional example:
Rolling an image
return image
For more advanced tricks, the paste method can also take a transparency mask as an optional
argument. In this mask, the value 255 indicates that the pasted image is opaque in that position
(that is, the pasted image should be used as is). The value 0 means that the pasted image is
completely transparent. Values in-between indicate different levels of transparency.
The Python Imaging Library also allows you to work with the individual bands of an multi-band
image, such as an RGB image. The split method creates a set of new images, each containing one
band from the original multi-band image. The merge function takes a mode and a tuple of
images, and combines them into a new image. The following sample swaps the three bands of an
RGB image:
Splitting and merging bands
r, g, b = im.split()
im = Image.merge("RGB", (b, g, r))
Note that for a single-band image, split returns the image itself. To work with individual colour
bands, you may want to convert the image to “RGB” first.
Geometrical Transforms
The Image class contains methods to resize and rotate an image. The former takes a tuple giving
the new size, the latter the angle in degrees counter-clockwise.
Simple geometry transforms
To rotate the image in 90 degree steps, you can either use the rotate method or the transpose
method. The latter can also be used to flip an image around its horizontal or vertical axis.
Transposing an image
out = im.transpose(Image.FLIP_LEFT_RIGHT)
out = im.transpose(Image.FLIP_TOP_BOTTOM)
out = im.transpose(Image.ROTATE_90)
out = im.transpose(Image.ROTATE_180)
out = im.transpose(Image.ROTATE_270)
Colour Transforms
The Python Imaging Library allows you to convert images between different pixel
representations using the convert function.
Converting between modes
im = Image.open("lena.ppm").convert("L")
The library supports transformations between each supported mode and the “L” and “RGB”
modes. To convert between other modes, you may have to use an intermediate image (typically
an “RGB” image).
Image Enhancement
The Python Imaging Library provides a number of methods and modules that can be used to
enhance images.
Filters
The ImageFilter module contains a number of pre-defined enhancement filters that can be used
with the filter method.
Applying filters
import ImageFilter
out = im.filter(ImageFilter.DETAIL)
Point Operations
The point method can be used to translate the pixel values of an image (e.g. image contrast
manipulation). In most cases, a function object expecting one argument can be passed to the this
method. Each pixel is processed according to that function:
Applying point transforms
Using the above technique, you can quickly apply any simple expression to an image. You can
also combine the point and paste methods to selectively modify an image:
Processing individual bands
R, G, B = 0, 1, 2
# paste the processed band back, but only where red was < 100
source[G].paste(out, None, mask)
Python only evaluates the portion of a logical expression as is necessary to determine the
outcome, and returns the last value examined as the result of the expression. So if the expression
above is false (0), Python does not look at the second operand, and thus returns 0. Otherwise, it
returns 255.
Enhancement
For more advanced image enhancement, you can use the classes in the ImageEnhance module.
Once created from an image, an enhancement object can be used to quickly try out different
settings.
You can adjust contrast, brightness, colour balance and sharpness in this way.
Enhancing images
import ImageEnhance
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")
Image Sequences
The Python Imaging Library contains some basic support for image sequences (also called
animation formats). Supported sequence formats include FLI/FLC, GIF, and a few experimental
formats. TIFF files can also contain more than one frame.
When you open a sequence file, PIL automatically loads the first frame in the sequence. You can
use the seek and tell methods to move between different frames:
Reading sequences
import Image
im = Image.open("animation.gif")
im.seek(1) # skip to the second frame
try:
while 1:
im.seek(im.tell()+1)
# do something to im
except EOFError:
pass # end of sequence
As seen in this example, you’ll get an EOFError exception when the sequence ends.
Note that most drivers in the current version of the library only allow you to seek to the next
frame (as in the above example). To rewind the file, you may have to reopen it.
The following iterator class lets you to use the for-statement to loop over the sequence:
A sequence iterator class
class ImageSequence:
def __init__(self, im):
self.im = im
def __getitem__(self, ix):
try:
if ix:
self.im.seek(ix)
return self.im
except EOFError:
raise IndexError # end of sequence
layout = QVBoxLayout()
# size of window
self.resize(250, 300)
layout.addStretch(1)
self.btn = QPushButton("Click here to select Picture")
self.btn.resize(self.btn.sizeHint())
#self.btn.move(550, 100)
self.btn.setFixedWidth(300)
self.btn.clicked.connect(self.getfile)
# center aligns the window
self.move(QApplication.desktop().screen().rect().center() - self.rect().center())
self.path = ''
layout.addWidget(self.le)
layout.addWidget(self.btn)
layout.addWidget(self.btn1)
self.contents = QTextEdit()
self.contents.setMaximumHeight(600)
layout.addWidget(self.contents)
self.setLayout(layout)
self.setWindowTitle("Optical Character Recognition")
//File Picker Implementation
def getfile(self):
fname = QFileDialog.getOpenFileName(self, 'Open file',
'/home/kumaramit1996/Pictures/',"Images (*.png *.jpg *.gif)")
pixmap = QPixmap(fname)
pixmap = pixmap.scaledToWidth(300)
self.le.setPixmap(pixmap)
self.move(QApplication.desktop().screen().rect().center() - self.rect().center())
self.path = str(fname)
//File Picker Impplementati
def getfiles(self):
dlg = QFileDialog()
dlg.setFileMode(QFileDialog.AnyFile)
dlg.setFilter('Text files (*.txt)')
filenames = QStringList()
if dlg.exec_():
filenames = dlg.selectedFiles()
f = open(filenames[0], 'r')
with f:
data = f.read()
self.contents.setText(data)
//OCR Conversions
def ocr(self, path):
if path is not '':
im = Image.open(path)
text = image_to_string(im)
text = image_file_to_string(path)
text = image_file_to_string(path, graceful_errors=True)
self.contents.setText(text)
//Main implementation
def main():
app = QApplication(sys.argv)
ex = filedialogdemo()
ex.show()
sys.exit(app.exec_())
//Calling Main
if __name__ == '__main__':
main()
Chapter 5: Live Example
After successfully creating our English character and language models, we assessed the accuracy
of the pyOCR software. We were able to successfully recognize Wnglish special characters and
increase the overall accuracy. We used a character based approach to assess the accuracy and
increase the rate of correct recognition by 8%. The original accuracy with the English character
model was 66% on a sample of 1700 characters and we increased this to 74.5% with our
character model. We manually calculated the accuracy because the ground truth data for this text
did not exist in digital form. From our tests, we have concluded that our character and language
model yield significantly better results than the default English models.
The goal of pyOCR is to provide an accessible, flexible, and simple tool to preform optical
character recognition. In its current state, it is not the most user friendly utility and still has many
kinks to work out. This is all understandable because it is in an alpha stage of development, and
will require some more attention before an official release. The actual theory behind character
recognition is in place in the software. pyOCR does an amazing job preprocessing and
segmenting images and allows for many fine adjustments to fulfill a variety of user needs. It is
now just a matter of reorganizing and optimizing the code to create a user friendly experience.
With time, we believe pyOCR will be one of leading names in optical character recognition
software.
As we expect to extend the current version of pyOCR for the Devanagiri Script we are getting
familiar with the types of challenges presented by accented characters and are trying to deal with
them successfully. We thus anticipate a future extension of pyOCR to most languages based on
For languages with different alphabets like Chinese and Arabic we think it possible for a
future work project to adapt pyOCR to vertical and right to left character recognition since at the
language model level, we defined Unicode to be the standard used encoding. This is consistent
with the need to represent most written languages in unique encoding for further extensions to
other languages. The training portion will then be the key for both the correcting representation
As mentioned in section 2.3.2, the pyOCR software is run through multiple commands that
represent each step of the recognition process starting from the preprocessing and segmentation
and ending with the use of character and language models. We believe it will be very handy and
useful to streamline these commands under a single command. This can save a lot of time during
future revisions of the software as it is necessary for extensive testing to run it multiple times.
Such a command can take in flags for the different operations within the digitization pipeline,
and when omitted they will have default values for ease of use.
References
[1] Hyvärinen, Aapo, and Erkki Oja. "Algorithms and Applications." Independent Component
[2] Mori, Shunji, Ching Y. Suen, and Kazuhiko Yamamoto. Historical Review of OCR Research
and Development. Tech. no. 0018-9219. Vol. 80. IEEE, 1992. Print. Proceedings of the IEEE.
[3]Holley, Rose. "How Good Can It Get? Analysing and Improving OCR Accuracy in Large
Scale Historic Newspaper Digitisation Programs." D-Lib Magazine. Web. 28 Mar. 2012.
<http://www.dlib.org/dlib/march09/holley/03holley.html>.
[4] Breuel, Thomas M. The pyOCR Open Source OCR System. Tech. DFKI and U.
[5] Handel, Paul W. Statistical Machine. General Electric Company, assignee. Patent 1915993.
[6] Smith, Ray. "Tesseract OCR Engine." Lecture. Google Code. Google Inc, 2007. Web. Mar.-
[7] Teh, Yee Whye, Simon Osindero, and Geoffrey E. Hinton. "Energy-Based Models for Sparse
[8] Mohri, Mehryar, Fernando Pereira, and Michael Riley. "Weighted Finite-State Transducers in
<http://www.cs.nyu.edu/~mohri/pub/asr2000.ps>.
<http://www.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/thesis/node12.html>.
Greenfield, Kara and Sarah Judd. “Open Source Natural Language Processing.” Worcester
project-042810-055257/>.