Script Recognition-A Review: Debashis Ghosh, Tulika Dube, and Adamane P. Shivaprasad
Script Recognition-A Review: Debashis Ghosh, Tulika Dube, and Adamane P. Shivaprasad
Abstract—A variety of different scripts are used in writing languages throughout the world. In a multiscript, multilingual environment, it
is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm
can be chosen. In view of this, several methods for automatic script identification have been developed so far. They mainly belong to
two broad categories—structure-based and visual-appearance-based techniques. This survey report gives an overview of the different
script identification methodologies under each of these categories. Methods for script identification in online data and video-texts are
also presented. It is noted that the research in this field is relatively thin and still more research is to be done, particularly in the case of
handwritten documents.
Index Terms—Document analysis, optical character recognition, script identification, multiscript document.
1 INTRODUCTION
Fig. 1. Stages of document processing in a multiscript environment. Fig. 2. Examples of multiscript document images: (a) a government
report in China containing a mix of Chinese and English words, (b) a
medical report in Arabic containing words in English that do not have an
digital libraries when dealing with a multiscript environ- exact Arabic equivalent, (c) a portion of an official application form in
ment. Text area detection refers to either segmenting out India containing different script lines typeset in Hindi and English.
text blocks from other nontextual regions, like halftones,
images, line drawings, etc., in a document image, or we state our concluding remarks in Section 7, including some
extracting text printed against textured backgrounds and/ insights on the recent trends and future scope of work in this
or embedded in images within a document. To do this, the field.
system takes advantage of script-specific distinctive char-
acteristics of text which make it stand out from other 2 WRITING SYSTEMS AND SCRIPTS OF THE WORLD
nontextual parts in the document. Text extraction is also
required in images and videos for content-based browsing. In the context of script recognition, it may be worth
One powerful index for image/video retrieval is the text studying the characteristics of various writing systems
and the structural properties of the characters used in
appearing in them. Efficient indexing and retrieval of
certain major scripts of the world. In Fig. 3, we draw a tree
digital image/video in an international scenario therefore
diagram showing different classes of writing systems. As
requires text extraction followed by script identification and
said in [10], [11] and depicted in the tree diagram, there are
then character recognition. Similarly, text found in docu-
six prominent writing systems. Major scripts that follow
ments can be used for their annotation, indexing, sorting,
each of these writing systems are also shown in the tree
and retrieval. Thus, script identification plays an important
diagram and are described below.
role in building a digital library containing documents
written in different scripts. 2.1 Logographic System
In short, automatic script identification is crucial to meet A logogram, also called an ideogram, refers to a symbol
the growing demand for electronic processing of volumes of that graphically represents a complete word. Accordingly,
documents written in different scripts. This is important for the number of characters in a script for an ideographic
business transactions across Europe and the Orient, and has writing system generally runs into thousands. This makes
great significance in a country like India, which has many recognition of logographic characters a difficult but
official state languages and scripts. Due to this, there has been interesting problem.
a growing interest in multiscript OCR technology during An example of logographic script is Han, which is mainly
recent years. A brief survey on methods for script recognition associated with Chinese. Japanese and Korean writings also
was reported earlier in [7], with emphasis on script include Han modified as Kanji and Hanja, respectively. Han
identification in Indian multiscript documents but little characters are generally composed of multiple short strokes,
insight into the script recognition methods for non-Indian giving them a complex and dense look, distinctly different
scripts. A review of script identification research for Indian from other Western and Asian scripts. Accordingly, char-
documents is also available in [8]. A report on the key acter optical density and certain other visual appearance-
technologies in multilingual OCR and their application in based features have been utilized by many researchers in
building a multilingual digital library can also be found in [9]. distinguishing Han from other scripts. Another interesting
In this paper, we present a comprehensive survey of property of Han is its directionality—words in a textline are
written either from left to right or from top to bottom.
different script recognition techniques developed mainly for
identification of certain major scripts of the world, viz., 2.2 Syllabic System
Chinese, Japanese, Korean, Arabic, Hebrew, Latin, Cyrillic, In a syllabic system, every written symbol represents a
and the Brahmic family of Indian scripts. To begin with, in phonetic sound or syllable, as used in Japanese. The
Section 2, we give a brief description of different script types, symbols representing the Japanese syllables are known as
highlighting their main discriminating features. Methods for Kanas, which are of two types—Hirakana and Katakana. As
script recognition in document images are described in indicated in Fig. 3, Japanese script uses a mix of logographic
Section 3, giving comparative analysis among them. Section 4 Kanji and syllabic Kanas. Hence, it is visually similar to
discusses several methods for script recognition in the realm Chinese, but less dense due to the presence of simpler
of pen computing. As said before, script identification in Kanas in between the logograms.
video text is also important. However, not much research has
been done on this topic. The only work that we have found on 2.3 Alphabetic System
this is outlined in Section 5. Section 6 raises issues related to An alphabet is a set of characters representing phonemes of a
performance evaluation of multiscript OCR systems. Finally, spoken language. Examples of scripts following this system
2144 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010
Fig. 3. Tree diagram showing broad classification of prominent writing systems and scripts of the present world.
are Greek, Latin, Cyrillic, and Armenian. The Latin script, also stroke along with one to three dots. The characters in a
called Roman script, is used by many languages throughout word are generally conjoined, giving an overall cursive
the world with varying degrees of modifications from one appearance to the written text. This provides an important
language to another. It is used for writing many European clue for the recognition of Arabic script. The same applies to
languages like English, Italian, French, German, Portuguese, some other scripts of Arabic origin, such as Farsi (Persian),
Spanish, etc., and has been adopted in many Amerindian and Urdu, Sindhi, Jawi, etc. On the other hand, character strokes
Austronesian languages, including the modern Malay, in Hebrew are more uniform in length and the letters in a
Vietnamese, and Indonesian languages. Fig. 4 shows a few word are generally discrete.
such variants of the Latin script. Compared to other scripts,
2.5 Abugidas
classical Latin characters are simple in structure, mainly
composed of a few lines and arcs. The other major script Abugida is another alphabetic-like writing system used by
under the alphabetic system is Cyrillic. This script is used by the Brahmic family of scripts that originated from the
some languages of Eastern Europe, Asia, and Slavic regions ancient Indian Brahmi script and includes nearly all of the
that include Bulgarian, Russian, Macedonian, Ukrainian, scripts of India and southeast Asia. In Fig. 5, we draw a tree
Mongolian, etc. The basic properties of this script are diagram to illustrate the evolution of major Brahmic scripts
somewhat similar to that of Latin except that it uses a in India and southeast Asia. The northern group of Brahmic
different alphabet set. Some characters in the Cyrillic scripts (e.g., Devnagari, Bengali, Manipuri, Gurumukhi,
alphabet are also borrowed from Latin and Greek, modified Gujrati, and Oriya) bears a strong resemblance to the
original Brahmi script. On the other hand, scripts in south
with cedillas, crosshatches, or diacritical marks. This induces
India (Tamil, Telugu, Kannada, and Malayalam) as well as
recognition ambiguity among Cyrillic, Latin, and Greek.
in southeast Asia (e.g., Thai, Lao, Burmese, Javanese, and
2.4 Abjads Balinese) are derived from Brahmi through many changes
The Abjad system of writing is similar to the alphabetic and so look quite different from the northern group. One
system, but has symbols for consonantal sounds only. important characteristic of Devnagari, Bengali, Gurumukhi,
Unlike most other scripts in the world, Abjads are written and Manipuri is that the characters in a word are generally
from right to left within a textline. This unique feature is written together without spaces so that the top bar is
particularly useful for identifying Abjad-based scripts in unbroken. This results in the formation of a headline, called
pen computing. shirorekha, at the top of each word. Accordingly, these
Two important scripts under this category are Arabic and scripts can be separated from other script types by detecting
Hebrew. A typical Arabic character is formed of a long main the presence of a large number of horizontal lines in the
textual portions of a document.
Fig. 5. The Brahmic family of scripts used in India and southeast Asia.
Fig. 9. Chaudhury and Sheth’s three methods of script identification. without performing any feature extraction. The network
consists of four layers with 49 nodes in the input layer,
available to bring out the characteristics of the script. They 15 and 20 nodes in the hidden layers, and two nodes in the
offer good performance when used for script identification at output layer that correspond to the two script classes. The
the page level, but may not retain their performance when nodes in the input layer are fed with pixel values in a block of
applied on a smaller block of text. In multiscript documents, size 7 7 pixels. A number of sample blocks are randomly
it is necessary to identify and separate different script extracted from the input text block, and the script of the text
regions like paragraph, textline, word, or even character in block is then determined by a simple majority vote among
the document page. This is particularly important in a the sampling blocks. Experiments on a number of mixed-
country like India that hosts a variety of scripts like type document images showed the effectiveness of the
Devnagari, Bengali, Tamil, Telugu, Kannada, Malayalam, proposed system, yielding 92.3 and 95 percent accuracy in
Gujrati, Gurumukhi, Oriya, Manipuri, Urdu, Sindhi, and determining the Chinese and English texts, respectively.
Latin. In view of this, several multiscript OCR systems A method for Arabic and Latin text block differentiation
involving more than one Indian script in a single unit have in both printed and handwritten scripts was proposed in
been developed [8]. Multiscript OCR systems that perform [26]. This method is based on morphological analysis at the
script recognition at the paragraph level are now described. text block level and geometrical analysis at textline and
Fig. 9 shows three different strategies developed by connected component levels. Experimental evaluation of
Chaudhury and Sheth [23] to recognize the script of a text the method was carried out on two different data sets
block in a printed document. In the first technique, the containing 400 and 335 text blocks, and the results obtained
script of the text block is described in terms of the Fourier were quite promising.
coefficients of the horizontal projection profile. Subsequent In an attempt to build automatic letter sorting machines
classification is based on euclidean distance in the eigen- for Bangladesh post offices, an algorithm for Bengali/
space. The other two schemes are based on features derived English script identification was developed recently [27].
from connected components in text blocks—one using the The method is designed for application to both machine-
means and standard deviations of the outputs for a six- printed and handwritten address blocks on envelope
channel Gabor filter and the other using distribution of the images. The two scripts under consideration are recognized
width-to-height ratio of the connected components present on the basis of the aggregate distance of the pixels in the
in the document. Classification in both of these cases is topmost and the bottommost profiles of the connected
accomplished using Mahalanobis distance. The average components—an English text image has these two distance
recognition rate obtained with these methods, when tested measures almost equal, whereas their difference in Bengali
on Latin, Devnagari, Telugu, and Malayalam scripts, was text image is quite large. It was observed in the experiments
approximately 85, 95, and 89 percent, respectively. that the accuracy of this script identification method is quite
In [24], a neural network-based architecture was devel- high for printed text (98 and 100 percent for English and
oped for identification of printed Latin, Devnagari, and Bengali, respectively) and, for handwritten text, the
Kannada scripts. It consists of a feature extractor followed proposed approach can achieve a satisfactory accuracy of
by a modular neural network, as shown in Fig. 10. In the about 95 percent.
feature extraction stage, a feature vector corresponding to
pixel distributions along specified directions is obtained via 3.1.3 Textlinewise Script Identification
morphological operations. The modular neural network The earliest work we have found on textlinewise script
structure consists of three independently trained feed- identification in Indian documents was reported by Pal and
forward neural networks, one for each of the three scripts Chaudhuri in [28]. The method uses projection profile,
under consideration. The input is assigned to the script statistical and topological features, and stroke features for
class of the network, which produces maximum output. It decision-tree-based classification of printed Latin, Urdu,
was seen that such a system can classify English and Devnagari, and Bengali script lines. Later, they proposed an
Kannada with 100 percent accuracy, while the rate is automatic system for the identification of Latin, Chinese,
slightly lower (97 percent) in recognizing Devnagari. Arabic, Devnagari, and Bengali textlines in printed docu-
Script recognition using feed-forward neural network ments [29]. As depicted in Fig. 11, the headline (“shiror-
was also performed in [25]. The network is trained to classify ekha”) information is used first to separate Devnagari and
an input printed text block into Han or Latin directly, Bengali script lines from Latin, Chinese, and Arabic script
2148 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010
TABLE 1
Script Recognition Methods
one may use it for the purpose. In view of this, a discriminating feature that Spitz used is the location of
comparative analysis between different methods and script upward concavities in characters. An upward concavity is
features is desirable. formed when a run of character pixels spans the gap
One important structural feature for script recognition between two white runs just above it. As a result, upward
used by Spitz and some others is the character optical concavities in a character are observed at points where two
density. This is the measure of character pixels inside a or more character strokes join. Accordingly, ideograms
character bounding box, which is distinctly very high in composed of multiple strokes show many more upward
scripts using complex ideographic characters. Structurally concavities per character compared to that in other scripts.
simple Arabic characters, on the other hand, are low in As observed by Spitz [77], there are usually at most two or
density. All other scripts across Europe and Asia show three upward concavities in a single Latin character while
more or less the same medium character density. Therefore, Han characters have many more upward concavities per
while this feature may be good in separating out Han, on character that are evenly distributed along the vertical axis.
one hand, and Arabic, on the other, it does not help much in However, we observe that most other scripts also show two
bringing out the difference between moderately complex or three upward concavities, the same as in the Latin script.
scripts like Latin, Cyrillic, Brahmic scripts, etc. The second So, upward concavity is good for separating Han from
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2155
others but not good for discrimination among non-Han ambiguity will increase if the system includes script classes
scripts, except perhaps for Cyrillic, which contains a few that use similar looking characters or even share many
more upward concavities compared to other non-Han common characters. Therefore, Hochberg’s method may not
scripts. Another problem with these two features is that be suitable in a multiscript country like India, where most
they highly depend on document quality. Broken character scripts have the same line of origin. Nevertheless, it offers
segments may result in detection of false upward concavity, invariance to font size and computational simplicity. This is
while noise contributes to optical density measure. Non- because textual symbols are size-normalized and the
Han documents tend to be misclassified as Han-based algorithm uses simple binary shape matching without any
Oriental ones if the document quality is poor, because many feature value calculation.
characters are either broken or noisy. In order to cope with Another important feature proposed by Wood et al. and
such situations, features like character height distribution, used by many researchers is the horizontal projection. This
character bounding box profiles, horizontal projections, and gives a measure of the spatial spread of the characters in a
several other statistical features were proposed in [16], [17], script that provides an important clue to script identifica-
[18]. These features do not depend on the document quality tion. Some scripts can be identified by detecting the peaks
and resolution but on the overall size of the connected in the projection profile, e.g., Arabic scripts having a strong
components. However, these features are not invariant to baseline show peak at the bottom of the profile while
character size and font and offer high performance only in Brahmic scripts with “shirorekha” show peak at the top,
separating distinctly different Oriental scripts from other and so on. However, this feature also is not good for
non-Han scripts. separating scripts of similar nature and structure. For
Several different structural features, like character example, Devnagari, Bengali, and Gurumukhi will show
geometry, occurrence of certain stroke structures and the same peak in the profile due to “shirorekha”; Arabic,
structural primitives, stroke orientations, measure of cavity Urdu, and Farsi have the same lower peak. Hence, this
regions, side profiles, etc., that directly relate to the feature has not been used alone but mostly in combination
character shape have also been used for script characteriza- with other structural features.
tion. However, while some features show marked differ- A better approach to script identification is via texture
ence between two scripts, measures of other features may feature extraction using multichannel Gabor filter that
be the same between that script pair. For example, while provides a model for human vision system. This means
Devnagari and Gujrati can be easily identified using that Gabor filter offers a powerful tool to extract out visual
“shirorekha” and water reservoir-based features, character attributes from a document. This has motivated many
aspect ratio and character moments do not show much researchers to employ Gabor filter for script determination.
difference. This is because many Gujrati letters are exactly Since texture feature gives the general appearance of a
same as their Devnagari counterpart with the headline script, it can be derived from any script class of any nature.
(“shirorekha”) removed. Again, there are features that are Accordingly, this feature may be considered a universal
optimal in one script pair but not in another pair. For one. The discriminating power of a multichannel Gabor
example, the presence of “shirorekha” may be a good filter can be varied by having more channels with different
feature for discriminating Latin and Devnagari, but not at radial frequencies and closely spaced orientation angles.
all useful in separating Devnagari and Bengali. Therefore, Thus, this system is flexible compared to all other methods
in order to separate out a script from all other scripts, one and can be effectively used in discriminating scripts that are
may need to check a large pool of structural features before quite close in appearance. The main criticism with this
any decision can be taken. This may result in the curse of approach is that it cannot be applied with confidence to
dimensionality. So, a better option may be to do the small text regions as in wordwise script recognition. Also,
classification using different sets of features at different Gabor filters are not capable of handling variations in script
levels of hierarchy, as proposed in some of the works above. size and font, interline spacings, etc.
Another option is to learn the script characteristics in a Table 1 also lists recognition rates, as reported in the
neural network, as in [25], without bothering about the literature. Since the experiments were conducted indepen-
features to be used for classification. However, a larger dently using different data sets, however, they do not reflect
network with a greater number of hidden units may be the comparative performance of these methods. To have a
necessary for reliable recognition as more and more script proper measure of their relative script separation power,
classes are included. these methods need to be applied on a common data set.
Compared to the above, Hochberg et al.’s method is more Script recognition performance of some of the above-
versatile. The method is based on discovering frequent mentioned features, when applied to a common data set,
characters/symbols in every script class and storing them in is given in Table 2. The data set contains printed documents
the system database for matching during classification. typeset in 10 different scripts, including six scripts used in
Therefore, in principle, the method can identify any number India. In the absence of any standard database, we created
of scripts of varied nature and font as long as they are our own database by collecting document samples from
included in the training set. It is possible to apply the method books and magazines. Some documents were also available
in a common framework to scripts containing discrete and from the World Wide Web, which we printed using a laser
connected characters, alphabetic and nonalphabetic scripts, printer. All of the documents were scanned in black-and-
and so on, as demonstrated in [19], [34]. However, it is not white mode at 300 dpi and then rescaled to have a standard
difficult to realize that the classification error due to textline height in all documents while maintaining the
2156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010
TABLE 2
Script Recognition Results (in Percentage)
character aspect ratio. Script recognition was performed at components like textline, word, and character in a docu-
the text block level. Homogeneous text blocks of size 256 ment or from a patch of text that may be a complete
256 pixels were extracted from document pages in such a paragraph, a text block cropped from the input document,
way that page margins and nontextual parts were excluded. or even the whole document page. Script identification
A total of 120 text blocks were generated per script, each methods that use segmentwise analysis of character
block containing 10 to 12 textlines. The print quality of the structure may hence be regarded as local approach. On
documents, and hence, the quality of the document images the other hand, visual appearance-based methods that are
was reasonably good containing very little noise. designed to identify script by analyzing the overall look of a
We observe that the optical density feature is capable of text block may be regarded as a global approach.
identifying Chinese and Korean and also Arabic and Urdu As discussed before, many different structural features
to some extent. For other script classes, the recognition rate and methods for script characterization have been proposed
was well below the acceptable level. This is because the over the years. In each of these methods, the features were
optical density feature is not good enough to discriminate chosen keeping in view only those script types that were
among scripts of similar complexity. The same argument considered therein. Therefore, while these features have
holds for other script features. The Gabor filter method been proven to be efficient for script identification within a
shows relatively better discriminating power in comparison. given set of scripts, they may not be good in separating a
We noticed that the classification error was mainly due to
wider variety of script classes. Again, structural features
the misclassification between script pairs like Arabic and
cannot effectively discriminate between scripts having
Urdu, Chinese and Korean, Devnagari and Bengali, and
similar character shapes, which otherwise may be distin-
Devnagari and Gujrati. These pairs of script classes have
characters of the same nature and complexity, and even guished by their visual appearances. Another disadvantage
share some common characters. This leads to ambiguity, with structure-based methods is that they require complex
and hence, the classification error. So, on the whole, we may preprocessing involving connected component extraction.
say that every proposed script identification method and Also, extraction of structural features is highly susceptible
script feature works well only when applied within a small to noise and poor-quality document images. The presence
set of script classes. Classification accuracy falls significantly of noise or significant image degradation adversely affects
when more scripts of similar nature and origin are included. the location and segmentation of these features, making
As observed in Table 1, almost all works on script them difficult or sometimes impossible to extract.
recognition are targeted toward machine-printed docu- In short, the choice of features in local approach to script
ments. They have not been tested for script recognition in classification depends on the script classes to be identified.
handwritten documents. In view of the large amount of Further, the success of classification in this approach depends
handwritten documents that need to be processed electro- on the performance of the preprocessing stage, which
nically nowadays, script identification in handwritten includes denoising and extraction of connected components.
documents turns out to be an important research issue. Ironically, document segmentation and extraction of con-
Unfortunately, the script features proposed for printed nected components sometimes require the script type to be
documents may not be always effective in case of hand- known a priori. For example, an algorithm that is good for
written documents. Variations in writing style, character segmenting ideograms in Han may not be equally effective in
size, and interline and interword spacings make the segmenting alphabetic characters in the Latin script. This
recognition process difficult and unreliable when these presents a paradox in that, for determining the script type, it is
techniques are applied to the handwritten documents. necessary to know the script type beforehand. In contrast, text
Variation in writing across a document can be taken care block extraction in visual appearance-based global ap-
of by using certain statistical features, as proposed in [20]. proaches is simpler and can be employed irrespective of the
Textual symbol-based method can also be used but with document’s script. Since here it is not necessary to extract
certain modifications—some shape descriptor features can individual script components, such methods are better suited
be derived from the text symbols and the prototypes can be to degraded and noisy documents. Also, global features are
generated through clustering. We demonstrated this ap- more general in nature and can be applied to a broader range
proach in an earlier paper [35]. Also, a script class may be of script classes. They have practical importance in script-
represented by multiple models to account for variation in based retrieval systems because they are relatively fast and
writing from one person to another. reduce the cost of document handling. Thus, visual appear-
Based on our discussion above, we see that script ance-based methods prove to be better than structure-based
features are extracted either from a list of connected script identification methods in many ways, as listed in
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2157
followed by an almost circular curve at the end. These fuzzy simple in nature and some are quite complex, a relative
rules aid in decision making during classification. comparison of performance across scripts is a difficult task.
For example, Latin is generally simpler in structure and is
based on an alphabetic system. A script identifier that is
5 SCRIPT RECOGNITION IN VIDEO TEXT good in recognizing Latin scripts may not be so in the case
Script identification is not only important for document of complex nonalphabetic scripts like Arabic, Han, and
analysis but also for text recognition in images and videos. Devnagari. Therefore, in order to evaluate various systems,
Text recognition in images and videos is important in the a standard set of data should be used so that the evaluation
context of image/video indexing and retrieval. The process is unbiased. However, it is generally difficult to find
includes several preprocessing steps like text detection, text document data sets in different languages/scripts that are
localization, text segmentation, and binarization before an similar in content and layout. To address this problem,
OCR algorithm may be applied. As with documents in an Kanungo et al. introduced the Bible as a data set for
multiscript environment, image/video text recognition in
evaluating multilingual and multiscript OCR performance
an international environment also requires script identifica-
[85]. Bible translations are closely parallel in structure,
tion in order to apply suitable algorithm for text extraction
relevant with respect to modern day language, widely
and recognition. In view of this, an approach for discrimi-
nating between Latin and Han script was developed in [83]. available, and inexpensive. These make the Bible attractive
The proposed approach proceeds as follows: First, the text for controlling document content while varying language
present in an image or video frame is localized and size and script. The document layout can also be controlled by
normalized. Then, a set of low-level features is extracted using synthetically generated page image data. Other holy
from the edges detected inside the text region. This includes books, whose translation has similar properties, like the
mean and standard deviation of edge pixels, edge pixel Quran and the Bhagavad Gita, have also been suggested by
density, energy of edge pixels, horizontal projection, and some researchers.
Cartesian moments of the edge pixels. Finally, based on the One major concern with most of the reported works in
extracted features, the decision about the type of the script script recognition is the lack of any comparative analysis of
is made using a KNN classifier. Experimental results have the results. Experimental results given for every proposed
demonstrated the efficiency of the proposed method by method have not been compared with other benchmark
identifying Latin and Han scripts accurately at the rate of works in the field. Moreover, the data sets used in
85.5 and 89 percent, respectively. experiments are all different. This is mainly due to the lack
of availability of a standard database for script recognition
6 ISSUES IN MULTISCRIPT OCR SYSTEM research. Consequently, it is hard to assess the results
reported in the literature. Hence, a standard evaluation
EVALUATION testbed containing documents written in only one script type
In connection with research in script recognition, it is useful as well as multiscript documents with a mix of different
and important to develop benchmarks and methodologies scripts within a document is necessary. One important
that may be employed to evaluate the performance of consideration in selecting the data set for a script class is that
multiscript OCR systems. Some aspects of this problem it should reflect the global probability of occurrence of the
have been reported in [84], and are discussed below. characters in texts written in that particular script. Another
The OCR evaluation approaches are broadly classified problem of concern is for languages that constantly undergo
into two categories: black box evaluation and white box spelling modifications and graphemic changes over the
evaluation. In black box evaluation, only the input and years. As a result, if an old document is chosen as the corpus,
output are visible to the evaluator. In a white box evaluation then it may not be suitable for evaluating a modern OCR
procedure, outputs of different modules comprising the system. On the other hand, a database of modern documents
system may be accessed and the total system is evaluated may not be useful if the goal of the OCR is to process historic
stage by stage. Nevertheless, the primary issues related to documents. This suggests that the data set should include all
both types of evaluation are recognition accuracy and different forms of the same language that evolved with time,
processing speed. The parameters that can be varied for the with full coverage of the script alphabet of different
purpose of evaluation are content, font size and style, print languages, and it should be large enough to reflect the
and paper quality, scanning resolution, and the amount of statistical occurrence probability of the characters.
noise and degradation in the document images.
Needless to say, the overall performance of a multiscript
OCR greatly depends on the performance of the script 7 CONCLUSION
recognition algorithm used in the system. As with any OCR This paper presents a comprehensive survey on the devel-
system, the efficiency of a script recognizer is mainly opments in script recognition technology, which is an
assessed on the basis of accuracy and speed. Another important issue in OCR research in our multilingual multi-
important performance criterion is the minimum size of the script world. Researchers have attempted to characterize
document necessary for the script recognizer to perform different scripts either by extracting their structural features
reliably. This is to measure how the recognizer performs or by deriving some visual attributes. Accordingly, many
with varying document size. different script features have been proposed over the years
In a multiscript system, another issue of consideration is for script identification at different levels within a docu-
the writing system adopted by a script, script complexity, ment—pagewise, paragraphwise, textlinewise, wordwise,
and the size of the character set. Since some scripts are and even characterwise. Textlinewise and wordwise script
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2159
identifications are particularly important for use in a multi- [5] H. Bunke and P.S.P. Wang, Handbook of Character Recognition and
Document Image Analysis. World Scientific Publishing, 1997.
script document. However, compared to the large arsenal of [6] N. Nagy, “Twenty Years of Document Image Analysis in PAMI,”
literature available in the field of document analysis and IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1,
optical character recognition, the volume of work on script pp. 38-62, Jan. 2000.
identification is relatively thin. The main reason is that most [7] U. Pal, “Automatic Script Identification: A Survey,” J. Vivek,
vol. 16, no. 3, pp. 26-35, 2006.
research in the area of OCR has been directed at solving issues [8] U. Pal and B.B. Chaudhuri, “Indian Script Character Recognition:
within the scope of the country where the research is A Survey,” Pattern Recognition, vol. 37, no. 9, pp. 1887-1899, Sept.
conducted. Since most countries in the world use only one 2004.
language/script, OCR research in these countries need not [9] L. Peng, C. Liu, X. Ding, and H. Wang, “Multilingual Document
Recognition Research and Its Application in China,” Proc. Int’l
bother determining the script in which a document is written. Conf. Document Image Analysis for Libraries, pp. 126-132, Apr. 2006.
For instance, the US postal department spent a lot in [10] A. Nakanishi, Writing Systems of the World: Alphabets, Syllabaries,
developing system for automatic reading of postal addresses, Pictograms. Charles E. Tuttle Co., 1980.
[11] F. Coulmas, The Blackwell Encyclopedia of Writing Systems. Black-
but under the assumption that all letters originating or well Publishers, 1996.
arriving in US will carry addresses written in English only. [12] C. Ronse and P.A. Devijver, Connected Components in Binary
Script recognition is important only in an international Images: The Detection Problem. John Wiley & Sons, 1984.
environment or in a country that uses more than one script. [13] A.L. Spitz, “Multilingual Document Recognition,” Proc. Int’l
Conf. Electronic Publishing, Document Manipulation, and Typogra-
Nonetheless, with recent economic globalization and phy, pp. 193-206, Sept. 1990.
increased business transactions across the globe, there had [14] A.L. Spitz and M. Ozaki, “Palace: A Multilingual Document
been increased awareness of automatic script recognition Recognition System,” Proc. IAPR Workshop Document Analysis
among the OCR community. That is why the majority of the Systems, pp. 16-37, Oct. 1994.
[15] A.L. Spitz, “Determination of the Script and Language Content of
reported works are dated only during the last decade. Document Images,” IEEE Trans. Pattern Analysis and Machine
However, it is noted that most of these script recognition Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997.
methods have been tested on machine-printed documents [16] D.-S. Lee, C.R. Nohl, and H.S. Baird, “Language Identification in
only, and their performance on handwritten documents is Complex, Unoriented, and Degraded Document Images,” Proc.
IAPR Workshop Document Analysis Systems, pp. 76-98, Oct. 1996.
not known. In view of this, it will be not wrong to say that [17] B. Waked, S. Bergler, C.Y. Suen, and S. Khoury, “Skew Detection,
script recognition in handwritten documents is still in its Page Segmentation and Script Classification of Printed Document
early stage of research. Since the present thrust in OCR Images,” Proc. IEEE Int’l Conf. Systems, Man, and Cybernetics, vol. 5,
pp. 4470-4475, Oct. 1998.
research is in handwritten document analysis, parallel [18] L. Lam, J. Ding, and C.Y. Suen, “Differentiating between Oriental
research on script identification in handwritten documents and European Scripts by Statistical Features,” Int’l J. Pattern
is in demand. Also, not many of these script recognition Recognition and Artificial Intelligence, vol. 12, no. 1, pp. 63-79, Feb.
techniques have addressed font variation within a script 1998.
[19] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, “Automatic
class. Hence, we can conclude that script recognition Script Identification from Document Images Using Cluster-Based
technology still has a way to go, especially for handwritten Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence,
document analysis. Therefore, there is an urgent need to vol. 19, no. 2, pp. 176-181, Feb. 1997.
work on script recognition of handwritten documents and [20] J. Hochberg, K. Bowers, M. Cannon, and P. Kelly, “Script and
Language Identification for Handwritten Document Images,” Int’l
in developing font-independent script recognizers. J. Document Analysis and Recognition, vol. 2, nos. 2/3, pp. 45-52,
As is evident from our analysis, development in script Dec. 1999.
recognition technology lacks a generalized approach to the [21] Y. Tho and Y.Y. Tang, “Discrimination of Oriental and Euramer-
ican Scripts Using Fractal Feature,” Proc. Int’l Conf. Document
problem that can handle all different types of scripts under Analysis and Recognition, pp. 1115-1119, Sept. 2001.
a common framework. While a particular script feature [22] B.V. Dhandra, P. Nagabhushan, M. Hangarge, R. Hegadi, and V.S.
proves to be efficient within a set of scripts, it may not be Malemath, “Script Identification Based on Morphological Recon-
useful in other scripts. To some extent, texture features can struction in Document Images,” Proc. IEEE Int’l Conf. Pattern
Recognition, vol. 2, pp. 950-953, Aug. 2006.
be used universally but cannot be applied reliably at word [23] S. Chaudhury and R. Sheth, “Trainable Script Identification
and character levels within a document. Strategies for Indian Languages,” Proc. Int’l Conf. Document
Finally, we need to create a standard data set for research Analysis and Recognition, pp. 657-660, Sept. 1999.
in this field. This is necessary to evaluate different script [24] S.B. Patil and N.V. Subbareddy, “Neural Network Based System
for Script Identification in Indian Documents,” Sadhana, vol. 27,
recognition methodologies under the same conditions. The no. 1, pp. 83-97, Feb. 2002.
creation of standard data resources will undoubtedly [25] Z. Chi, Q. Wang, and W.-C. Siu, “Hierarchical Content Classifica-
provide a much needed resource to researchers working tion and Script Determination for Automatic Document Image
Processing,” Pattern Recognition, vol. 36, no. 11, pp. 2483-2500,
in this field. Nov. 2003.
[26] S. Kanoun, A. Ennaji, Y. Lecourtier, and A.M. Alimi, “Script and
Nature Differentiation for Arabic and Latin Text Images,” Proc.
REFERENCES Int’l Workshop Frontiers in Handwriting Recognition, pp. 309-313,
[1] C.Y. Suen, M. Berthod, and S. Mori, “Automatic Recognition of Aug. 2002.
Handprinted Characters—The State of the Art,” Proc. IEEE, [27] L. Zhou, Y. Lu, and C.L. Tan, “Bangla/English Script Identifica-
vol. 68, no. 4, pp. 469-487, Apr. 1980. tion Based on Analysis of Connected Component Profiles,” Proc.
[2] J. Mantas, “An Overview of Character Recognition Methodolo- Int’l Workshop Document Analysis Systems, pp. 243-254, Feb. 2006.
gies,” Pattern Recognition, vol. 19, no. 6, pp. 425-430, 1986. [28] U. Pal and B.B. Chaudhuri, “Script Line Separation from Indian
[3] V.K. Govindan and A.P. Shivaprasad, “Character Recognition—A Multi-Script Documents,” Proc. Int’l Conf. Document Analysis and
Review,” Pattern Recognition, vol. 23, no. 7, pp. 671-683, 1990. Recognition, pp. 406-409, Sept. 1999.
[4] S. Mori, C.Y. Suen, and K. Yamamoto, “Historical Review of OCR [29] U. Pal and B.B. Chaudhuri, “Identification of Different Script
Research and Development,” Proc. IEEE, vol. 80, no. 7, pp. 1029- Lines from Multi-Script Documents,” Image and Vision Computing,
1058, July 1992. vol. 20, nos. 13/14, pp. 945-954, Dec. 2002.
2160 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010
[30] U. Pal, S. Sinha, and B.B. Chaudhuri, “Multi-Script Line [52] K. Roy, U. Pal, and B.B. Chaudhuri, “Address Block Location and
Identification from Indian Documents,” Proc. Int’l Conf. Document Pin Code Recognition for Indian Postal Automation,” Proc.
Analysis and Recognition, pp. 880-884, Aug. 2003. Workshop Computer Vision, Graphics, and Image Processing, pp. 5-9,
[31] A. Elgammal and M.A. Ismail, “Techniques for Language Feb. 2004.
Identification for Hybrid Arabic-English Document Images,” Proc. [53] K. Roy, S. Vajda, U. Pal, B.B. Chaudhuri, and A. Belaid, “A System
Int’l Conf. Document Analysis and Recognition, pp. 1100-1104, Sept. for Indian Postal Automation,” Proc. Int’l Conf. Document Analysis
2001. and Recognition, vol. 2, pp. 1060-1064, Aug./Sept. 2005.
[32] C.S. Cumbee, Method of Identifying Script of Line of Text, US Patent [54] K. Roy, D. Pal, and U. Pal, “Pin-Code Extraction and Recognition
7020338, Mar. 2006. for Indian Postal Automation,” Proc. Nat’l Conf. Recent Trends in
[33] S.-W. Lee and J.-S. Kim, “Multi-Lingual, Multi-Font, Multi-Size Information Systems, pp. 192-195, July 2006.
Large-Set Character Recognition Using Self-Organizing Neural [55] K. Roy and U. Pal, “Word-Wise Hand-Written Script Separation
Network,” Proc. Int’l Conf. Document Analysis and Recognition, for Indian Postal Automation,” Proc. Int’l Workshop Frontiers in
vol. 1, pp. 28-33, Aug. 1995. Handwriting Recognition, pp. 521-526, Oct. 2006.
[34] J. Hochberg, M. Cannon, P. Kelly, and J. White, “Page Segmenta- [56] K. Roy, U. Pal, and B.B. Chaudhuri, “Neural Network Based
tion Using Script Identification Vectors: A First Look,” Proc. Symp. Word-Wise Handwritten Script Identification System for Indian
Document Image Understanding Technology, pp. 258-264, Apr./May Postal Automation,” Proc. Int’l Conf. Intelligent Sensing and
1997. Information Processing, pp. 240-245, Jan. 2005.
[35] D. Ghosh and A.P. Shivaprasad, “Handwritten Script Identifica- [57] S.L. Wood, X. Yao, K. Krishnamurthi, and L. Dang, “Language
tion Using Possibilistic Approach for Cluster Analysis,” J. Indian Identification for Printed Text Independent of Segmentation,”
Inst. of Science, vol. 80, pp. 215-224, May/June 2000. Proc. Int’l Conf. Image Processing, vol. 3, pp. 428-431, Oct. 1995.
[36] V. Ablavsky and M.R. Stevens, “Automatic Feature Selection with [58] T.N. Tan, “Rotation Invariant Texture Features and Their Use in
Applications to Script Identification of Degraded Documents,” Automatic Script Identification,” IEEE Trans. Pattern Analysis and
Proc. Int’l Conf. Document Analysis and Recognition, pp. 750-754, Machine Intelligence, vol. 20, no. 7, pp. 751-756, July 1998.
Aug. 2003. [59] L. O’Gorman and R. Kasturi, Document Image Analysis. IEEE CS
[37] R. Krishnapuram and J.M. Keller, “A Possihilistic Approach to Press, 1995.
Clustering,” IEEE Trans. Fuzzy Systems, vol. 1, no. 2, pp. 98-110, [60] G.S. Peake and T.N. Tan, “Script and Language Identification
May 1993. from Document Images,” Proc. Asian Conf. Computer Vision, pp. 97-
[38] D. Ghosh and A.P. Shivaprasad, “An Analytic Approach for 104, Jan. 1998.
Generation of Artificial Handprinted Character Database from [61] R.M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features
Given Generative Models,” Pattern Recognition, vol. 32, no. 6, for Image Classification,” IEEE Trans. Systems, Man, and Cyber-
pp. 907-920, June 1999. netics, vol. 3, no. 6, pp. 610-621, Nov. 1973.
[39] D.W. Muir and T. Thomas, Automatic Language Identification by [62] W.M. Pan, C.Y. Suen, and T.D. Bui, “Script Identification Using
Stroke Geometry Analysis, US Patent 6064767, May 2000. Steerable Gabor Filters,” Proc. Int’l Conf. Document Analysis and
[40] Y.-H. Liu, C.-C. Lin, and F. Chang, “Language Identification of Recognition, vol. 2, pp. 883-887, Aug./Sept. 2005.
Character Images Using Machine Learning Techniques,” Proc. Int’l [63] V. Singhal, N. Navin, and D. Ghosh, “Script-Based Classification
Conf. Document Analysis and Recognition, vol. 2, pp. 630-634, Aug./ of Hand-Written Text Documents in a Multilingual Environment,”
Sept. 2005. Proc. Int’l Workshop Research Issues in Data Eng.—Multi-Lingual
[41] I. Moalla, A. Elbaati, A.M. Alimi, and A. Benhamadou, “Extraction Information Management, pp. 47-54, Mar. 2003.
of Arabic Text from Multilingual Documents,” Proc. IEEE Int’l
[64] J. Cheng, X. Ping, G. Zhou, and Y. Yang, “Script Identification of
Conf. Systems, Man, and Cybernetics, http://ieeexplore.ieee.org/
Document Image Analysis,” Proc. Int’l Conf. Innovative Computing,
iel5/8325/26298/01173266.pdf?arnumber=1173266, Oct. 2002.
Information, and Control, vol. 3, pp. 178-181, Aug./Sept. 2006.
[42] I. Moalla, A.M. Alimi, and A. Benhamadou, “Extraction of Arabic
[65] A.K. Jain and Y. Zhong, “Page Segmentation Using Texture
Words from Multilingual Documents,” Proc. Conf. Artificial
Analysis,” Pattern Recognition, vol. 29, no. 5, pp. 743-770, May
Intelligence and Soft Computing, http://www.actapress.com/
1996.
PDFViewer.aspx?paperId=18567, Sept. 2004.
[66] A. Busch, W.W. Boles, and S. Sridharan, “Texture for Script
[43] C.L. Tan, P.Y. Leong, and S. He, “Language Identification in
Identification,” IEEE Trans. Pattern Analysis and Machine Intelli-
Multi-Lingual Documents,” Proc. Int’l Symp. Intelligent Multimedia
gence, vol. 27, no. 11, pp. 1720-1732, Nov. 2005.
and Distance Education, pp. 59-64, Aug. 1999.
[44] S. Lu, C.L. Tan, and W. Huang, “Language Identification in [67] A. Busch, “Multi-Font Script Identification Using Texture-Based
Degraded and Distorted Document Images,” Proc. Int’l Workshop Features,” Proc. Int’l Conf. Image Analysis and Recognition, pp. 844-
Document Analysis Systems, pp. 232-242, Feb. 2006. 852, Sept. 2006.
[45] C.V. Jawahar, M.N.S.S.K. Pavan Kumar, and S.S. Ravi Kiran, “A [68] G.D. Joshi, S. Garg, and J. Sivaswamy, “Script Identification from
Bilingual OCR for Hindi-Telugu Documents and Its Applica- Indian Documents,” Proc. IAPR Int’l Workshop Document Analysis
tions,” Proc. Int’l Conf. Document Analysis and Recognition, pp. 408- Systems, pp. 255-267, Feb. 2006.
412, Aug. 2003. [69] W. Chan and G.G. Coghill, “Text Analysis Using Local Energy,”
[46] S. Sinha, U. Pal, and B.B. Chaudhuri, “Word-Wise Script Pattern Recognition, vol. 34, no. 12, pp. 2523-2532, Dec. 2001.
Identification from Indian Documents,” Proc. IAPR Int’l Workshop [70] H. Ma and D. Doermann, “Gabor Filter Based Multi-Class
Document Analysis Systems, pp. 310-321, Sept. 2004. Classifier for Scanned Document Images,” Proc. Int’l Conf.
[47] S. Chanda, S. Sinha, and U. Pal, “Word-Wise English Devnagari Document Analysis and Recognition, pp. 968-972, Aug. 2003.
and Oriya Script Identification,” Speech and Language Systems for [71] S. Jaeger, H. Ma, and D. Doermann, “Identifying Script on Word-
Human Communication, R.M.K. Sinha and V.N. Shukla, eds., Level with Informational Confidence,” Proc. Int’l Conf. Document
pp. 244-248, Tata McGraw-Hill, 2004. Analysis and Recognition, vol. 1, pp. 416-420, Aug./Sept. 2005.
[48] S. Chanda and U. Pal, “English, Devnagari and Urdu Text [72] D. Dhanya, A.G. Ramkrishnan, and P.B. Pati, “Script Identification
Identification,” Proc. Int’l Conf. Cognition and Recognition, pp. 538- in Printed Bilingual Documents,” Sadhana, vol. 27, no. 1, pp. 73-82,
545, Dec. 2005. Feb. 2002.
[49] S. Chanda, R.K. Roy, and U. Pal, “English and Tamil Text [73] D. Dhanya and A.G. Ramkrishnan, “Script Identification in
Identification,” Proc. Nat’l Conf. Recent Trends in Information Printed Bilingual Documents,” Proc. IAPR Int’l Workshop Document
Systems, pp. 184-187, July 2006. Analysis Systems, pp. 13-24, Aug. 2002.
[50] M.C. Padma and P. Nagabhushan, “Identification and Separation [74] D. Dhanya and A.G. Ramkrishnan, “Optimal Feature Extraction
of Text Words of Kannada, Hindi and English Languages through for Bilingual OCR,” Proc. IAPR Int’l Workshop Document Analysis
Discriminating Features,” Proc. Nat’l Conf. Document Analysis and Systems, pp. 25-36, Aug. 2002.
Recognition, pp. 252-260, July 2003. [75] P.B. Pati, S. Sabari Raju, N. Pati, and A.G. Ramakrishnan,
[51] R. Kumar, V. Chaitanya, and C.V. Jawahar, “A Novel Approach to “Gabor Filters for Document Analysis in Indian Bilingual
Script Separation,” Proc. Int’l Conf. Advances in Pattern Recognition, Documents,” Proc. Int’l Conf. Intelligent Sensing and Information
pp. 289-292, Dec. 2003. Processing, pp. 123-126, Jan. 2004.
[76] P.B. Pati and A.G. Ramakrishnan, “HVS Inspired System for
Script Identification in Indian Multi-Script Documents,” Proc. Int’l
Workshop Document Analysis Systems, pp. 380-389, Feb. 2006.
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2161
[77] A.L. Spitz, “Script and Language Determination from Document Tulika Dube received the BTech degree in
Images,” Proc. Ann. Symp. Document Analysis and Information electronics and communication engineering from
Retrieval, pp. 229-235, Apr. 1994. the Indian Institute of Technology Guwahati in
[78] J.J. Lee, B.K. Sin, and J.H. Kim, “On-Line Mixed Character 2006. Soon after her graduation, she joined the
Recognition Using an HMM Network,” Proc. KISS Ann. Conf., Indian Division of British Telecom at Bangalore,
vol. 20, no. 2, pp. 317-320, Oct. 1993. and later moved to Ibibo Web Pvt. Ltd.,
[79] J.J. Lee, J.H. Kim, and M. Nakajima, “A Hierarchical HMM Gurgaon, India, as a software engineer. Be-
Network-Based Approach for On-Line Recognition of Multi- tween 2007 and 2009, she worked as a senior
Lingual Cursive Handwritings,” IEICE Trans. Information and software engineer with Infovedics Software Pvt.
Systems, vol. E81-D, no. 8, pp. 881-888, Aug. 1998. Ltd., Noida, India. She received a search
[80] A.M. Namboodiri and A.K. Jain, “Online Script Recognition,” developer certification from FAST University, Norway, in 2007. She is
Proc. Int’l Conf. Pattern Recognition, vol. 3, pp. 736-739, Aug. 2002. currently working toward the management degree at the Indian Institute
[81] A.M. Namboodiri and A.K. Jain, “Online Handwritten Script of Management, Ahmedabad.
Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 26, no. 1, pp. 124-130, Jan. 2004. Adamane P. Shivaprasad received the BE,
[82] A. Malaviya and L. Peters, “Fuzzy Handwriting Description ME, and PhD degrees in electrical communica-
Language: FOHDEL,” Pattern Recognition, vol. 33, no. 1, pp. 119- tions engineering from the Indian Institute of
131, Jan. 2000. Science, Bangalore, in 1965, 1967, and 1972,
[83] J. Gllavata and B. Freisleben, “Script Recognition in Images with respectively. He is currently a guest professor
Complex Backgrounds,” Proc. IEEE Int’l Symp. Signal Processing in the Department of Electronics and Commu-
and Information Technology, pp. 589-594, Dec. 2005. nication Engineering, Sambhram Institute of
[84] B.B. Chaudhuri, “On Multi-Script OCR System Evaluation,” Proc. Technology, Bangalore, India. He was a mem-
Int’l Workshop Performance Evaluation Issues in Multi-Lingual OCR, ber of the academic staff of the Department of
http://www.kanungo.com/workshop/abstracts/chaudhuri. Electrical Communication Engineering, Indian
html, Sept. 1999. Institute of Science, Bangalore, from 1967 until he retired as a
[85] T. Kanungo, P. Resnik, S. Mao, D.-W. Kim, and Q. Zheng, “The professor in 2006. His research interests include design of micropower
Bible and Multilingual Optical Character Recognition,” Comm. VLSI circuits, intelligent instrumentation, communication systems, and
ACM, vol. 48, no. 6, pp. 124-130, June 2005. pattern recognition.