0% found this document useful (0 votes)
83 views20 pages

Script Recognition-A Review: Debashis Ghosh, Tulika Dube, and Adamane P. Shivaprasad

Script recognition is an important precursor step for optical character recognition systems operating in multilingual and multiscript environments. It involves automatically identifying the script used in a document, such as identifying a document as containing Latin, Chinese, or Arabic script. This allows the document to then be processed using the appropriate character recognition model for that script. Script recognition helps enable accurate recognition of text in documents containing multiple scripts as well as supporting applications like document sorting and video indexing.

Uploaded by

Manishkumarsaini
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views20 pages

Script Recognition-A Review: Debashis Ghosh, Tulika Dube, and Adamane P. Shivaprasad

Script recognition is an important precursor step for optical character recognition systems operating in multilingual and multiscript environments. It involves automatically identifying the script used in a document, such as identifying a document as containing Latin, Chinese, or Arabic script. This allows the document to then be processed using the appropriate character recognition model for that script. Script recognition helps enable accurate recognition of text in documents containing multiple scripts as well as supporting applications like document sorting and video indexing.

Uploaded by

Manishkumarsaini
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

2142 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO.

12, DECEMBER 2010

Script Recognition—A Review


Debashis Ghosh, Tulika Dube, and Adamane P. Shivaprasad

Abstract—A variety of different scripts are used in writing languages throughout the world. In a multiscript, multilingual environment, it
is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm
can be chosen. In view of this, several methods for automatic script identification have been developed so far. They mainly belong to
two broad categories—structure-based and visual-appearance-based techniques. This survey report gives an overview of the different
script identification methodologies under each of these categories. Methods for script identification in online data and video-texts are
also presented. It is noted that the research in this field is relatively thin and still more research is to be done, particularly in the case of
handwritten documents.

Index Terms—Document analysis, optical character recognition, script identification, multiscript document.

1 INTRODUCTION

O NE interesting and challenging field of research in


pattern recognition is Optical Character Recognition
(OCR). Optical character recognition is the process in which
in Devnagari in India but is also written in Sinhala script in
Sri Lanka. Therefore, in this multilingual and multiscript
world, OCR systems need to be capable of recognizing
a paper document is optically scanned and then converted characters irrespective of the script in which they are
into computer processable electronic format by recognizing written. In general, recognition of different script characters
and associating symbolic identity with every individual in a single OCR module is difficult. This is because the
character in the document. features necessary for character recognition depend on the
With the increasing demand for creating a paperless structural properties, style, and nature of writing, which
world, many OCR algorithms have been developed over the generally differs from one script to another. For example,
years [1], [2], [3], [4], [5], [6]. However, most OCR systems features used for recognition of English alphabets are, in
are script specific in the sense that they can read characters general, not good for recognizing Chinese logograms.
written in one particular script only. Script is defined as the Another option for handling documents in a multiscript
graphic form of the writing system used to write statements environment is to use a bank of OCRs corresponding to all
expressible in language. This means that a script class refers different scripts expected to be seen. The characters in an
to a particular style of writing and the set of characters used input document can then be recognized reliably by selecting
in it. Languages throughout the world are typeset in many the appropriate OCR system from the OCR bank. Never-
different scripts. A script may be used by only one language theless, this will require knowing a priori the script in
or may be shared by many languages, sometimes with which the input document is written. Unfortunately, this
slight variations from one language to other. For example, information may not be readily available. At the same time,
Devnagari is used for writing a number of Indian languages manual identification of the documents’ scripts may be
like Sanskrit, Hindi, Konkani, Marathi, etc., English, French, tedious and time-consuming. Therefore, automatic script
German, and some other European languages use different recognition techniques are necessary to identify the script in
variants of the Latin alphabet, and so on. Some languages the input document and then redirect it to the appropriate
even use different scripts at different point of time and character recognition module, as illustrated in Fig. 1.
space. One good example for this is Malay, which uses the A script recognizer is also useful in reading multiscript
Latin alphabet nowadays, replacing the previously used documents in which different paragraphs, text blocks,
Jawi. Another example is Sanskrit, which is mainly written textlines, or words in a page are written in different scripts.
Fig. 2 shows several examples of multiscript documents.
Analysis of such documents works in two stages—identi-
. D. Ghosh is with the Department of Electronics and Computer
Engineering, Indian Institute of Technology Roorkee, Roorkee, Uttarak- fication and separation of different script regions in the
hand 247 667, India. E-mail: [email protected]. document, followed by reading of each individual script
. T. Dube is with the Indian Institute of Management Ahmedabad, Dorm 2, region using corresponding OCR system.
Room 31, Vastrapur, Ahmedabad, Gujrat 380 015, India.
E-mail: [email protected]. Script identification also serves as an essential precursor
. A.P. Shivaprasad is with the Department of Electronics and Communica- for recognizing the language in which a document is written.
tion Engineering, Sambhram Institute of Technology, 915/33, 7th Cross, This is necessary for further processing of the document,
13th Main, Mathikere, Bangalore, Karnataka 560 054, India.
E-mail: [email protected]. such as routing, indexing, or translation. For scripts used by
Manuscript received 6 July 2007; revised 16 Apr. 2008; accepted 22 July 2009;
only one language, script identification itself accomplishes
published online 21 Jan. 2010. language identification. For scripts shared by many lan-
Recommended for acceptance by D. Lopresti. guages, script recognition acts as the first level of classifica-
For information on obtaining reprints of this article, please send e-mail to: tion, followed by language identification within the script.
[email protected], and reference IEEECS Log Number
TPAMI-2007-07-0408. Script recognition also helps in text area identification,
Digital Object Identifier no. 10.1109/TPAMI.2010.30. video indexing and retrieval, and document sorting in
0162-8828/10/$26.00 ß 2010 IEEE Published by the IEEE Computer Society
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2143

Fig. 1. Stages of document processing in a multiscript environment. Fig. 2. Examples of multiscript document images: (a) a government
report in China containing a mix of Chinese and English words, (b) a
medical report in Arabic containing words in English that do not have an
digital libraries when dealing with a multiscript environ- exact Arabic equivalent, (c) a portion of an official application form in
ment. Text area detection refers to either segmenting out India containing different script lines typeset in Hindi and English.
text blocks from other nontextual regions, like halftones,
images, line drawings, etc., in a document image, or we state our concluding remarks in Section 7, including some
extracting text printed against textured backgrounds and/ insights on the recent trends and future scope of work in this
or embedded in images within a document. To do this, the field.
system takes advantage of script-specific distinctive char-
acteristics of text which make it stand out from other 2 WRITING SYSTEMS AND SCRIPTS OF THE WORLD
nontextual parts in the document. Text extraction is also
required in images and videos for content-based browsing. In the context of script recognition, it may be worth
One powerful index for image/video retrieval is the text studying the characteristics of various writing systems
and the structural properties of the characters used in
appearing in them. Efficient indexing and retrieval of
certain major scripts of the world. In Fig. 3, we draw a tree
digital image/video in an international scenario therefore
diagram showing different classes of writing systems. As
requires text extraction followed by script identification and
said in [10], [11] and depicted in the tree diagram, there are
then character recognition. Similarly, text found in docu-
six prominent writing systems. Major scripts that follow
ments can be used for their annotation, indexing, sorting,
each of these writing systems are also shown in the tree
and retrieval. Thus, script identification plays an important
diagram and are described below.
role in building a digital library containing documents
written in different scripts. 2.1 Logographic System
In short, automatic script identification is crucial to meet A logogram, also called an ideogram, refers to a symbol
the growing demand for electronic processing of volumes of that graphically represents a complete word. Accordingly,
documents written in different scripts. This is important for the number of characters in a script for an ideographic
business transactions across Europe and the Orient, and has writing system generally runs into thousands. This makes
great significance in a country like India, which has many recognition of logographic characters a difficult but
official state languages and scripts. Due to this, there has been interesting problem.
a growing interest in multiscript OCR technology during An example of logographic script is Han, which is mainly
recent years. A brief survey on methods for script recognition associated with Chinese. Japanese and Korean writings also
was reported earlier in [7], with emphasis on script include Han modified as Kanji and Hanja, respectively. Han
identification in Indian multiscript documents but little characters are generally composed of multiple short strokes,
insight into the script recognition methods for non-Indian giving them a complex and dense look, distinctly different
scripts. A review of script identification research for Indian from other Western and Asian scripts. Accordingly, char-
documents is also available in [8]. A report on the key acter optical density and certain other visual appearance-
technologies in multilingual OCR and their application in based features have been utilized by many researchers in
building a multilingual digital library can also be found in [9]. distinguishing Han from other scripts. Another interesting
In this paper, we present a comprehensive survey of property of Han is its directionality—words in a textline are
written either from left to right or from top to bottom.
different script recognition techniques developed mainly for
identification of certain major scripts of the world, viz., 2.2 Syllabic System
Chinese, Japanese, Korean, Arabic, Hebrew, Latin, Cyrillic, In a syllabic system, every written symbol represents a
and the Brahmic family of Indian scripts. To begin with, in phonetic sound or syllable, as used in Japanese. The
Section 2, we give a brief description of different script types, symbols representing the Japanese syllables are known as
highlighting their main discriminating features. Methods for Kanas, which are of two types—Hirakana and Katakana. As
script recognition in document images are described in indicated in Fig. 3, Japanese script uses a mix of logographic
Section 3, giving comparative analysis among them. Section 4 Kanji and syllabic Kanas. Hence, it is visually similar to
discusses several methods for script recognition in the realm Chinese, but less dense due to the presence of simpler
of pen computing. As said before, script identification in Kanas in between the logograms.
video text is also important. However, not much research has
been done on this topic. The only work that we have found on 2.3 Alphabetic System
this is outlined in Section 5. Section 6 raises issues related to An alphabet is a set of characters representing phonemes of a
performance evaluation of multiscript OCR systems. Finally, spoken language. Examples of scripts following this system
2144 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

Fig. 3. Tree diagram showing broad classification of prominent writing systems and scripts of the present world.

are Greek, Latin, Cyrillic, and Armenian. The Latin script, also stroke along with one to three dots. The characters in a
called Roman script, is used by many languages throughout word are generally conjoined, giving an overall cursive
the world with varying degrees of modifications from one appearance to the written text. This provides an important
language to another. It is used for writing many European clue for the recognition of Arabic script. The same applies to
languages like English, Italian, French, German, Portuguese, some other scripts of Arabic origin, such as Farsi (Persian),
Spanish, etc., and has been adopted in many Amerindian and Urdu, Sindhi, Jawi, etc. On the other hand, character strokes
Austronesian languages, including the modern Malay, in Hebrew are more uniform in length and the letters in a
Vietnamese, and Indonesian languages. Fig. 4 shows a few word are generally discrete.
such variants of the Latin script. Compared to other scripts,
2.5 Abugidas
classical Latin characters are simple in structure, mainly
composed of a few lines and arcs. The other major script Abugida is another alphabetic-like writing system used by
under the alphabetic system is Cyrillic. This script is used by the Brahmic family of scripts that originated from the
some languages of Eastern Europe, Asia, and Slavic regions ancient Indian Brahmi script and includes nearly all of the
that include Bulgarian, Russian, Macedonian, Ukrainian, scripts of India and southeast Asia. In Fig. 5, we draw a tree
Mongolian, etc. The basic properties of this script are diagram to illustrate the evolution of major Brahmic scripts
somewhat similar to that of Latin except that it uses a in India and southeast Asia. The northern group of Brahmic
different alphabet set. Some characters in the Cyrillic scripts (e.g., Devnagari, Bengali, Manipuri, Gurumukhi,
alphabet are also borrowed from Latin and Greek, modified Gujrati, and Oriya) bears a strong resemblance to the
original Brahmi script. On the other hand, scripts in south
with cedillas, crosshatches, or diacritical marks. This induces
India (Tamil, Telugu, Kannada, and Malayalam) as well as
recognition ambiguity among Cyrillic, Latin, and Greek.
in southeast Asia (e.g., Thai, Lao, Burmese, Javanese, and
2.4 Abjads Balinese) are derived from Brahmi through many changes
The Abjad system of writing is similar to the alphabetic and so look quite different from the northern group. One
system, but has symbols for consonantal sounds only. important characteristic of Devnagari, Bengali, Gurumukhi,
Unlike most other scripts in the world, Abjads are written and Manipuri is that the characters in a word are generally
from right to left within a textline. This unique feature is written together without spaces so that the top bar is
particularly useful for identifying Abjad-based scripts in unbroken. This results in the formation of a headline, called
pen computing. shirorekha, at the top of each word. Accordingly, these
Two important scripts under this category are Arabic and scripts can be separated from other script types by detecting
Hebrew. A typical Arabic character is formed of a long main the presence of a large number of horizontal lines in the
textual portions of a document.

2.6 Featural System


The last significant form of writing system is the featural
system in which the symbols or characters represent the
features that make up the phonemes. One prominent script
of this sort is the Korean Hangul. As indicated in Fig. 3, the
Korean script is formed by mixing logographic Hanja with
featural Hangul. However, modern Korean contains more
of Hangul than Hanja. Consequently, the Korean script is
Fig. 4. Examples of some languages using the Latin alphabet with relatively less complex and less dense compared to the
different modifications. (a) English. (b) German. (c) Vietnamese. Chinese and Japanese, containing more circles and ellipses.
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2145

Fig. 5. The Brahmic family of scripts used in India and southeast Asia.

3 SCRIPT RECOGNITION METHODOLOGIES connected component. On the other hand, in cursive


handwritten documents, the characters in a word or part
Script identification relies on the fact that each script has
of a word can touch each other to form one single connected
unique spatial distribution and visual attributes that make it component. Likewise, in scripts like Devnagari, Bengali,
possible to distinguish it from other scripts. So, the basic Arabic, etc., a word or a part of a word forms a connected
task involved in script recognition is to devise a technique component. Script identification methods that are based on
to discover these features from a given document and then extraction and analysis of connected components fall under
classify the document’s script accordingly. Based on the the category of structure-based methods.
nature of approach and features used, these methods may
be divided into two broad categories—structure-based and 3.1.1 Pagewise Script Identification Methods
visual appearance-based methods. Script recognition tech- A script identification method that relies on the spatial
niques in each of these two categories may be further relationship of character structures was developed by Spitz
classified on the basis of the level at which they are applied for differentiating Han and Latin scripts in machine-printed
inside a document image, viz., pagewise, paragraphwise, documents. In his first work on this topic [13], he used
textlinewise, and wordwise. The application mode of a character optical density for classifying individual textlines
method depends on the minimum size of the text from in a document as being either English or Japanese. In
which the features proposed in the method can be extracted another paper, Spitz used vertical distribution of upward
reliably. Various algorithms under each of these categories concavities in characters for discriminating Han from Latin
are summarized below. with 100 percent success in continuous production use [14].
Later, he developed a two-stage classifier in [15] by
3.1 Structure-Based Script Recognition combining these two features. In the first stage, Latin is
In general, script classes differ from each other in their separated from Han-based scripts by comparing the
stroke structure and connections and the writing styles variances of their upward concavity distributions. Further
associated with the character sets they use. One approach to classification within the Han-based scripts is performed by
script recognition may be to extract connected components analyzing the distribution of optical density in the text
(continuous runs of pixels) in a document [12] and then image. The system also has provisions for language
analyze their shapes and structures so as to reveal the identification within documents using the Latin alphabet
intrinsic morphological characteristics of the script used in by observing the most frequently occurring character shape
the document. In machine-printed Latin, Greek, Han, etc., codes. A schematic diagram showing the flow of informa-
every individual character or part of a character is a tion in the process is given in Fig. 6.
2146 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

Fig. 7. Hochberg et al.’s method of script identification in printed


documents.
Fig. 6. Spitz’s method of script identification.
Cyrillic, Devanagari, Ethiopic, Greek, Hebrew, Japanese,
The above works by Spitz were extended by Lee et al. Korean, Latin, and Thai, and obtained 96 percent accuracy.
[16] and Waked et al. [17] by incorporating some additional Hochberg et al. [20] proposed a feature-based approach
features. In [16], the script of a printed document is for script identification in handwritten documents and
identified via textlinewise script recognition, followed by achieved 88 percent accuracy in distinguishing Arabic,
a majority vote of the already decided textline classification Chinese, Cyrillic, Devnagari, Japanese, and Latin. In their
results. The features used are character height distribution method, a handwritten document is characterized in terms
and the top and bottom profiles of character bounding of mean, standard deviation, and skew of five features,
boxes, in addition to upward concavity distribution and which are relative vertical centroid, relative horizontal
optical density features. Experimental results showed that centroid, number of holes, sphericity, and aspect ratio, of
these features can separate Han-based (Chinese and the connected components in a document page. A set of
Japanese) documents from Latin-based (English, French, Fisher Linear Discriminants (FLDs), one FLD for every pair
German, Italian, and Spanish) documents in 98.16 percent of script classes, is used for classification. The document is
of cases. In [17], Waked et al. used bounding box size finally assigned to the script class to which it is classified
distribution, character density distribution, and horizontal most often. A schematic diagram showing different stages
projections for classifying printed documents written in of the system is given in Fig. 8.
Han, Latin, Cyrillic, and Arabic. These statistical features A novel approach to script identification using fractal
are more robust compared to the structural features features was proposed in [21] and had been utilized for
proposed by Spitz and Lee et al. However, Waked et al. discriminating printed Chinese, Japanese, and Devnagari
achieved an accuracy rate of only 91 percent when tested on scripts. Fractal features are obtained by computing fractal
documents of varying kinds, diverse formats, and qualities. signatures for the patterns extracted from a document
This drop in recognition accuracy is mainly due to the image. The fractal signature is determined by the area of the
misclassification between Latin and Cyrillic scripts, which surface onto which a gray-level function corresponding to
are similar looking under this measure. Also, some test the document image is mapped.
documents of extremely poor quality account for this A method for script identification in printed document
degradation in performance. images based on morphological reconstruction was pro-
Script identification in machine-printed documents using posed in [22]. In this method, morphological erosion and
statistical features has also been explored by Lam et al. [18]. opening by reconstruction is carried out on the document
In a first level of classification, documents are classified as image in horizontal, vertical, right, and left diagonal
Latin, Chinese, Japanese, or Korean using horizontal directions using line structuring elements. The average pixel
projection profiles, height distributions of connected com- distributions in the resulting images give the measures of
ponents, and enclosing structure of connected components. horizontal, vertical, 45, and 135 degree slanted lines present
Non-Latin documents that cannot be recognized in this in the document page. Finally, script identification is carried
stage are classified in a second level of recognition using out using nearest neighbor classification. The method
structural features like character complexity, presence of showed robustness with respect to noise, font sizes, and
circles, ellipses, and vertical strokes. In the process, more styles, and an average classification accuracy of 97 percent
than 95 percent correct recognition was achieved. was achieved when applied for classification of four script
The fact that every script class is composed of some classes, viz., Latin, Devnagari, Urdu, and Kannada.
“textual symbols” of unique characteristic shapes had been
exploited by Hochberg et al. in identifying the script of a 3.1.2 Script Identification at Paragraph and Text Block
printed document [19]. First, textual symbols obtained from Level
documents of a known script are resized and clustered to The script identification methods discussed above require
generate template symbols for that script class, as depicted large blocks of input text so that sufficient information is
in Fig. 7. Textual symbols include character fragments,
discrete characters, adjoined characters, and even whole
words. During classification, textual symbols extracted from
the input document are compared to the template symbols
using Hamming distance and then scored against every
script class on the basis of their distances from the best
match template symbols in that script class. The script class
with the best average score is chosen as the script of the
document. Hochberg et al. tested their method on as many Fig. 8. Hochberg et al.’s method of script identification in handwritten
as 13 scripts, viz., Arabic, Armenian, Burmese, Chinese, documents.
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2147

Fig. 10. Neural network-based architecture for script identification


proposed by Patil and Reddy.

Fig. 9. Chaudhury and Sheth’s three methods of script identification. without performing any feature extraction. The network
consists of four layers with 49 nodes in the input layer,
available to bring out the characteristics of the script. They 15 and 20 nodes in the hidden layers, and two nodes in the
offer good performance when used for script identification at output layer that correspond to the two script classes. The
the page level, but may not retain their performance when nodes in the input layer are fed with pixel values in a block of
applied on a smaller block of text. In multiscript documents, size 7  7 pixels. A number of sample blocks are randomly
it is necessary to identify and separate different script extracted from the input text block, and the script of the text
regions like paragraph, textline, word, or even character in block is then determined by a simple majority vote among
the document page. This is particularly important in a the sampling blocks. Experiments on a number of mixed-
country like India that hosts a variety of scripts like type document images showed the effectiveness of the
Devnagari, Bengali, Tamil, Telugu, Kannada, Malayalam, proposed system, yielding 92.3 and 95 percent accuracy in
Gujrati, Gurumukhi, Oriya, Manipuri, Urdu, Sindhi, and determining the Chinese and English texts, respectively.
Latin. In view of this, several multiscript OCR systems A method for Arabic and Latin text block differentiation
involving more than one Indian script in a single unit have in both printed and handwritten scripts was proposed in
been developed [8]. Multiscript OCR systems that perform [26]. This method is based on morphological analysis at the
script recognition at the paragraph level are now described. text block level and geometrical analysis at textline and
Fig. 9 shows three different strategies developed by connected component levels. Experimental evaluation of
Chaudhury and Sheth [23] to recognize the script of a text the method was carried out on two different data sets
block in a printed document. In the first technique, the containing 400 and 335 text blocks, and the results obtained
script of the text block is described in terms of the Fourier were quite promising.
coefficients of the horizontal projection profile. Subsequent In an attempt to build automatic letter sorting machines
classification is based on euclidean distance in the eigen- for Bangladesh post offices, an algorithm for Bengali/
space. The other two schemes are based on features derived English script identification was developed recently [27].
from connected components in text blocks—one using the The method is designed for application to both machine-
means and standard deviations of the outputs for a six- printed and handwritten address blocks on envelope
channel Gabor filter and the other using distribution of the images. The two scripts under consideration are recognized
width-to-height ratio of the connected components present on the basis of the aggregate distance of the pixels in the
in the document. Classification in both of these cases is topmost and the bottommost profiles of the connected
accomplished using Mahalanobis distance. The average components—an English text image has these two distance
recognition rate obtained with these methods, when tested measures almost equal, whereas their difference in Bengali
on Latin, Devnagari, Telugu, and Malayalam scripts, was text image is quite large. It was observed in the experiments
approximately 85, 95, and 89 percent, respectively. that the accuracy of this script identification method is quite
In [24], a neural network-based architecture was devel- high for printed text (98 and 100 percent for English and
oped for identification of printed Latin, Devnagari, and Bengali, respectively) and, for handwritten text, the
Kannada scripts. It consists of a feature extractor followed proposed approach can achieve a satisfactory accuracy of
by a modular neural network, as shown in Fig. 10. In the about 95 percent.
feature extraction stage, a feature vector corresponding to
pixel distributions along specified directions is obtained via 3.1.3 Textlinewise Script Identification
morphological operations. The modular neural network The earliest work we have found on textlinewise script
structure consists of three independently trained feed- identification in Indian documents was reported by Pal and
forward neural networks, one for each of the three scripts Chaudhuri in [28]. The method uses projection profile,
under consideration. The input is assigned to the script statistical and topological features, and stroke features for
class of the network, which produces maximum output. It decision-tree-based classification of printed Latin, Urdu,
was seen that such a system can classify English and Devnagari, and Bengali script lines. Later, they proposed an
Kannada with 100 percent accuracy, while the rate is automatic system for the identification of Latin, Chinese,
slightly lower (97 percent) in recognizing Devnagari. Arabic, Devnagari, and Bengali textlines in printed docu-
Script recognition using feed-forward neural network ments [29]. As depicted in Fig. 11, the headline (“shiror-
was also performed in [25]. The network is trained to classify ekha”) information is used first to separate Devnagari and
an input printed text block into Han or Latin directly, Bengali script lines from Latin, Chinese, and Arabic script
2148 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

Fig. 12. Elgammal and Ismail’s technique for script identification in


Arabic-English documents.
Fig. 11. Pal and Chaudhuri’s method for script line separation from
multiscript documents in India.
are generally smaller than those of the Arabic text. Script
lines. Next, Bengali script lines are distinguished from classification using these features is done in a two-layer
feed-forward network. The basic steps of processing in this
Devnagari by observing the presence of certain script-
method are illustrated in Fig. 12. The algorithm was also
specific principal strokes. Similarly, Chinese textlines are
applied for script identification at the word level and a
identified by checking the existence of characters with four
recognition rate of 96.8 percent was achieved.
or more vertical runs. Finally, Latin (English) textlines are Script identification using character component n-grams
separated from Arabic using statistical as well as water was recently patented by Cumbee [32]. First, character
reservoir-based features. Statistical features include the segments extracted from training documents of a known
distribution of lowermost points in the characters— script are clustered using K-means clustering and then
the lowermost points of characters in a printed English replaced by their corresponding cluster identification
textline lie only along the baseline and the bottom line, number. Thus, every line of text is converted into a
while those in Arabic are more randomly distributed. Water sequence of numbers. This sequence of numbers is then
reservoir-based features give a measure of the cavity analyzed to determine all the n-grams present in it and a
regions in a character. Based on all of these structural weight corresponding to the frequency of occurrence is
characteristics, the identification rates obtained were, defined for each n-gram. During recognition, n-grams are
respectively, 97.32, 98.65, 97.53, 96.05, and 97.12 percent generated in a similar fashion by comparing character
for Latin, Chinese, Arabic, Devnagari, and Bengali scripts, segments in the input textline to the K-means cluster
with an overall accuracy of 97.33 percent. centroids of a known script. These are then compared to the
A more generalized scheme for script line identification n-grams present in the training documents of that script.
in printed multiscript documents that can classify as many The input is subsequently scored against that script class by
as 12 Indian scripts, viz., Devnagari, Bengali, Latin, Gujrati, adding the weights of the best-match n-grams. The script of
Kannada, Kashmiri, Malayalam, Oriya, Gurumukhi, Tamil, the input textline is determined to be the script against
Telugu, and Urdu, is available in [30]. Features chosen in which it scores the highest.
the proposed method are headlines, horizontal projection
3.1.4 Script Identification at Word/Character Level
profile, water reservoir-based features, left and right
Compared to the paragraph and textline-level identifica-
profiles, and feature based on jump discontinuity, which
tions, script recognition at the word level in a multiscript
refers to the maximum horizontal distance between two
document is generally more difficult. This is because the
consecutive border pixels in a character pattern. Experi-
information available from only a few characters in a word
mental results show an average script line identification may not be sufficient for the purpose. This has motivated
accuracy of 97.52 percent. many researchers to take up this challenging problem in
A method for discriminating Arabic text and English text script identification. Some have even attempted to do script
using connected component analysis was proposed by identification at the character level. However, script
Elgammal and Ismail in [31]. They tested their method on recognition at the character level is generally not required
several machine-printed documents containing a mix of in practice. This is because the script usually changes only
these two languages and achieved a recognition rate as high from one word to the next and not from one character to
as 99.7 percent. Features used for distinguishing Arabic another within a word.
from Latin are the number of peaks and the moments in the In one of the earliest works on script identification at the
horizontal projection profile, and the distribution of run-
character level, Lee and Kim tried to solve the problem using
lengths over the location-length space. The horizontal
self-organizing networks [33]. The network is able to
projection profile of an Arabic textline generally has a
determine the script of every individual character in a
single peak, while that of an English textline has two major
peaks. Thus, Arabic script can be distinguished from Latin machine-printed multiscript document and classify them
by detecting the number of peaks in the horizontal into four groups—Latin, Chinese, Korean, and mixed.
projection profile. The other features they used for dis- Characters in the mixed group which cannot be classified
criminating Arabic and Latin scripts are the third and in the network with full confidence are classified in the next
fourth central moments of the horizontal projection profiles. level of fine classification using learning vector quantization.
The third moment measures the skew, while the fourth In order to evaluate the performance of the proposed
moment measures the kurtosis that describes how flat the scheme, experiments with 3,367,200 characters were carried
profile is. It is seen that the horizontal projection profile for out and a recognition rate of over 98.27 percent was obtained.
English is more symmetric and flat compared to that of the An extension of Hochberg et al.’s work in [19] includes
Arabic. Therefore, the moments in the case of English text separation of different script regions in a machine-printed
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2149

[41], [42]. The method proposed in [41] is based on


recognition of Arabic characters or character segments in
a word. First, a database containing templates of Arabic
character segments is generated through training. A word is
supposed to be Arabic if the percentage of matching
character segments in the word exceeds a user-defined
Fig. 13. Ghosh and Shivaprasad’s method of script identification for value. Otherwise, the word is considered to be written in
handwritten characters/words using pattern spectrum and possibilistic English (Latin). Experimental results showed 100 percent
measure. recognition accuracy on 30 text blocks containing a total of
478 words. The method in [42] is also based on recognition
multiscript document [34]. In this work, every textual of Arabic characters in the document, but via feature
symbol (character, word, or part of a word) in a document is matching. Features used are morphological and statistical
matched to a set of template symbols, as in [19], and is features such as overlapping and inclusion of bounding
classified to the script class of the best matching template boxes, horizontal bar, low diacritics, height and width
symbol. It was observed that the method offers good variation of connected components, etc. Recognition accu-
separation in all cases except in the case of visually similar racy achieved with this method was 98 percent.
scripts, such as Latin/Cyrillic and Latin/Greek. The best Wordwise script identification using character shape
separation was observed in visually distinct script pairs like codes was proposed by Tan et al. [43] and Lu et al. [44]. In
Latin/Arabic, Latin/Japanese, and Latin/Korean. [43], shape codes generated using basic document features
Methods that employ clustering for generating script- like elongation of bounding boxes of character cells and the
specific prototype symbols, much like the procedure by position of upward concavities are used to identify Latin,
Hochberg et al., were proposed in [35], [36]. In both of these Han, and Tamil in printed document images. The method
methods, classification algorithms are not based on direct
in [44] captures word shapes on the basis of local extremum
shape matching, as in Hochberg’s method, but use match-
points and horizontal intersections. For each script under
ing of shape description features of connected components
and/or characters. The shape description features used in consideration, a word shape template is first constructed
[35] are the pattern spectrum coefficients of every indivi- based on a word shape coding scheme. Identification is then
dual character in a string of isolated handwritten characters. accomplished using Hamming distance between the word
During training, prototype symbols for each script class are shape code of a query image and the previously constructed
obtained via possibilistic clustering [37]. In the recognition templates. Experimental tests demonstrated 99 percent
phase, the algorithm calculates the degree to which every recognition accuracy in discriminating eight Latin-based
character in a string belongs to each of the script classes scripts/languages.
using the possibilistic measure defined in [37]. The As noted before, multiscript document processing is
character string is classified to that script class for which important in a multiscript country such as India. Conse-
the accumulated possibilistic measure is maximum. The quently, script recognition at the word level involving
basic structure of the proposed system is shown in Fig. 13. Indian scripts is an important topic of research for the OCR
The method was tested on several artificially generated [38] community. Indian scripts are, in general, of two types—one
strings of handwritten numeric characters in four different that has headlines (“shirorekha”) on top of the characters
scripts, viz., Arabic, Devnagari, Bengali, and Kannada, and (e.g., Devnagari, Bengali, and Gurumukhi) and the other
a recognition rate as high as 96 percent was achieved. that does not carry headlines (e.g., Gujrati, Tamil, Telugu,
Ablavasky and Stevens reported a similar work [36], but for Malayalam, and Kannada). Based on this, a bilingual OCR
machine-printed documents. The algorithm processes a
for printed documents was developed in [45] that identifies
stream of connected components and assigns a script label
Devnagari and Telugu scripts by observing the presence and
when enough evidence has been accumulated to make the
absence of shirorekha. The classification result is further
decision. The method uses geometric properties like
Cartesian moments and compactness for shape description. supported with context information; if the previous word is
The likelihood of every input textual symbol belonging to Devnagari (or Telugu), the next word is also in Devnagari
each of the script classes is calculated using K-nearest (Telugu) unless a strong clue suggests otherwise. The
neighbor (KNN) classification. This approach was shown to proposed method was tested extensively on several Hindi-
be quite efficient, yielding 97 percent success rate in Telugu documents with recognition accuracies that vary in
discriminating similar looking Latin and Cyrillic scripts. the range from 92.3 to 99.86 percent.
In another structural approach to script identification, The script line identification techniques in [29], [30] were
stroke geometry has been utilized for script characterization modified in [46], [47] for script word separation in printed
and identification [39]. Another new approach for identify- Indian multiscript documents by including some new
ing the script type of character images in printed documents features, in addition to the features considered earlier. The
was proposed in [40]. Individual character images in a features used are headline feature, distribution of vertical
document are classified either by applying prototype strokes, water reservoir-based features, shift below head-
classification or by using support vector machine. Both of line, left and right profiles, deviation feature, loop, tick
the methods were implemented successfully in classifying feature, and left inclination feature. Tick feature refers to the
characters into Latin, Chinese, and Japanese. distinct “tick”-like structure, called telakattu, present at the
Extraction of Arabic words from among a mix of printed top of many Telugu characters. This helps in separating
Arabic-English words has gained attention in recent times Telugu script from other scripts. Fig. 14 shows a few Telugu
2150 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

3.2 Appearance-Based Script Recognition


Script types generally differ from each other by the shape of
individual characters and the way they are grouped into
words, words into sentences, etc. This gives different scripts
Fig. 14. Examples of Telugu characters having the tick feature.
distinctively different visual appearances. Therefore, one
natural way of identifying the script in which a document is
characters having this feature. The overall accuracy in script written may be on the basis of its visual appearance as seen
word separation using this proposed set of features was at a glance by a casual observer, without really analyzing
about 97.92 percent when applied to five script pairs, viz., the character patterns in the document. Accordingly,
Devnagari/Bengali, Bengali/Latin, Malayalam/Latin, Guj- several features that describe the visual appearance of a
rati/Latin, and Telugu/Latin. Finally, based on this script script region have been proposed and used for script
word separation algorithm, systems for recognizing English, identification by many researchers, as described below.
Devnagari, and Urdu [48], and English and Tamil [49] have
3.2.1 Pagewise Script Identification Methods
been developed in recent years. In this context, a script word
One early attempt to characterize the script of a document
discrimination system proposed by Padma and Nagabhush-
without actually analyzing the structure of its constituent
an [50] also deserves mentioning. The system uses several
connected components was made by Wood et al. [57]. They
discriminating structural features for identification and
proposed using vertical and horizontal projection profiles of
separation of Latin, Hindi, and Kannada words in Indian document images for determining scripts in machine-
multiscript documents in a manner similar to the above. generated documents. They argued that the projection
The basic system for blockwise script identification in profiles of document images are sufficient to characterize
[24] was modified further so as to accomplish script different scripts. For example, Roman script shows domi-
recognition at the word level. The modified system nant peaks at the top and bottom of the horizontal
architecture consists of a preprocessor that separates out projection profile, while Cyrillic script has a dominant
individual words in a machine-printed document, followed midline and Arabic script has a strong baseline. On the
by a modified feature extractor and a probabilistic neural other hand, the Korean characters usually have a peak on
network classifier. The probabilistic network is a two- the left of the vertical projection profile. However, the
layered structure composed of a radial basis layer followed authors did not suggest how these projection profiles can be
by a competitive layer. Experiments yielding 98.89 percent analyzed automatically for script determination without
classification accuracy demonstrate the effectiveness of such any user intervention. Also, they did not present any
a script classification system. recognition results to substantiate their argument.
A neural network structure employing script recognition Since visual appearance is often related to texture, a
at the character level in printed documents was presented block of text corresponding to each script class forms a
in [51]. Script separation at the word level can also be distinct texture pattern. Thus, the problem of script
achieved by combining the outputs of the character-level identification essentially boils down to a texture analysis
classification using Viterbi algorithm. The algorithm was problem and one may employ any available texture
tested on five scripts commonly used in India, namely, classification algorithm to perform the task. In accordance
Latin, Devnagari, Bengali, Telugu, and Malayalam, and an with this, Tan developed Gabor function-based texture
average recognition accuracy of 97 percent was achieved. analysis for machine-printed script identification that
MLP neural networks have also been employed for script yielded an accuracy as high as 96.7 percent in discriminat-
identification in Indian postal automation systems devel- ing printed Chinese, Latin, Greek, Russian, Persian, and
oped by Roy et al. [52], [53], [54], [55], [56]. In India, people Malayalam script documents [58]. In the first step of this
generally tend to write addresses either in English only or method, a uniform text block on which texture analysis can
English mixed with the local language/script. This calls for be performed is produced from the input document image
script identification at the word and character levels. In via the method given in [59]. Texture features are then
their earliest work [52], they developed a method for extracted from the text block using a 16-channel Gabor filter
locating the address block and extracting postal code from with channels at a fixed radial frequency of 16 cycles/sec
the address. In [53], [54], a two-stage neural network-based and at 16 equally spaced orientations. The average response
general classifier is used for the recognition of postal code of every channel provides a characteristic measure for the
digits written in Arabic or Bengali numerals. Since there script that is robust to noise but rotation dependent. In
exist shape similarities between some Arabic and Bengali order to achieve invariance to rotation, Fourier coefficients
numerals, the final assignment of script class is done in a for this set of 16 channel outputs are calculated. During
second stage using majority voting. It was noted that the classification, a feature vector generated from the input text
accuracy of the classifier was 98.42 percent in printed and block is compared to the class-representative feature vectors
about 89 percent in handwritten postcodes. Methods for using weighted (variance normalized) euclidean distance
wordwise script recognition in postal addresses using measure, as depicted in Fig. 15. A representative feature
features like the water reservoir concept, headline (“shiror- vector for a script class is obtained by computing the mean
ekha”), etc., in a tree classifier were proposed in [55]. Based feature vector obtained from a large set of training
on this, a two-stage MLP network was constructed in [56] documents written in that script.
that accomplishes wordwise script recognition in Indian One drawback with the above method is that the text
postal addresses at more than 96 percent accuracy. blocks extracted from the input documents do not necessarily
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2151

system, an overall accuracy of 91.6 percent was achieved in


classifying handwritten documents written in four different
scripts, namely, Latin, Devnagari, Bengali, and Telugu.
Another visual attribute that has been used in many
image processing applications is histogram statistics, which
reflects spatial distribution of gray levels in an image. In a
Fig. 15. Tan’s script identification system using Gabor function-based recent work [64], Cheng et al. proposed using normalized
rotation-invariant features. histogram statistics for the purpose of script identification
in documents typeset in Latin, Chinese, Cyrillic, or
have uniform character spacing. In view of this, Peake and Japanese. In this work, every line of text in an input
Tan extended this work in [60], where they used some simple document is divided into three zones—ascender zone
preprocessings to obtain uniform text blocks from the input- between top line and x-line, x-zone between x-line and
printed document. These include textline location, outsized baseline, and descender zone between baseline and bottom
textline removal, spacing normalization, and padding. line. Then, a horizontal projection is obtained for each
Documents are also skew compensated so that it is not textline that gives zonewise distribution of character pixels
necessary to generate rotation-invariant features. For the in a textline. It is observed that Latin and Cyrillic characters
purpose of feature extraction, gray-level co-occurrence mainly distribute in the x-zone with two significant
matrices (GLCMs) and multichannel Gabor filter are used peaks located on the x-line and baseline. The baseline peak
in independent experiments. GLCMs represent pairwise is higher than the x-line peak in Latin, while they are almost
joint statistics of the pixels in an image and have long been equal in Cyrillic. The Chinese characters, on the other hand,
used as a means for characterizing texture [61]. In Gabor have relatively random distribution, without any peak in
filter-based feature extraction, a 16-channel filter with four the profile. The Japanese characters also have the same
frequencies at four orientations is used. These two ap-
random distribution, but the average height of the profile is
proaches for texture feature extraction were applied to
significantly lower. Thus, it is possible to separate out every
machine-printed documents written in seven different
script from other scripts by analyzing the distribution of
scripts (adding Korean to the six scripts used earlier in
character pixels in different zones inside a document.
[58]). Script identification was then performed using KNN
classification. It was seen that GLCM approach yields only 3.2.2 Script Identification at Paragraph and Text Block
77.14 percent accuracy at best while Gabor filter approach Level
yields accuracy rate as high as 95.71 percent.
The use of texture features in script identification was
One problem encountered in Gabor filter-related appli-
considered by Jain and Zhong for discriminating printed
cations is the high computational cost due to the frequent
Chinese and English documents [65]. This paper in fact
image filtering. In order to reduce the cost of computation,
script identification in machine-printed documents using proposed a texture-based language-free page segmentation
steerable Gabor filters was proposed in [62]. The method algorithm which automatically extracts text, halftone, and
offers twofold advantages. First, the steerability property of line-drawing regions from input gray-scale document
Gabor filter is exploited to reduce the high computational images. An extension of this page segmentation procedure
cost. Second, the Gabor filter bank is appropriately provides for further segmentation of the text regions into
designed so that the extracted rotation-invariant features different script regions. First, a set of optimal texture
can discriminate scripts containing characters that are discrimination masks is created through neural network
similar in shape and even share many characters. In this training. Next, texture features are obtained by convolving
paper, a 98.5 percent recognition rate was achieved in the trained masks with the input image. These features are
discriminating the Chinese, Japanese, Korean, and Latin then used for classification.
scripts, while the number of image filtering operations was The use of other texture features for script classification,
significantly reduced by 40 percent. other than GLCM and Gabor energy features, has been
Although the above Gabor function-based script recog- explored by Busch et al. [66]. The features that they used are
nition schemes have shown good performance, their wavelet energy features, wavelet log mean deviation
application is limited to machine-printed documents only. features, wavelet co-occurrence signatures, wavelet log co-
Variations in writing style, character size, and interline and occurrence features, and wavelet scale co-occurrence
interword spacings make the recognition process difficult signatures. They tested these features on a database
and unreliable when these techniques are applied directly containing eight different script types—Latin, Han, Japa-
on handwritten documents. Therefore, it is necessary to nese, Greek, Cyrillic, Hebrew, Devnagari, and Farsi. In their
preprocess the document images prior to the application of experiments, machine-printed document images of size
the Gabor filter so as to compensate for the different 64  64 pixels were first binarized, skew corrected, and text
variations present. This has been addressed in the texture- block normalized, in line with the work done by Peake and
based script identification scheme proposed in [63]. In the Tan in [60]. In order to reduce the dimensionality of the
preprocessing stage, the algorithm employs denoising, feature vectors while improving classification accuracy,
thinning, pruning, m-connectivity, and text size normal- Fisher linear discriminant analysis technique is applied.
ization in sequence. Texture features are then extracted Classification is performed using a Gaussian Mixture Model
using a multichannel Gabor filter. Finally, different scripts (GMM) classifier, which models each script class as a
are classified using fuzzy classification. In this proposed combination of Gaussian distributions. The GMM classifier
2152 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

is trained using a version of the expectation maximization


(EM) algorithm. In order to create a more stable and global
script model, a maximum a posteriori (MAP) adaptation-
based method was also proposed. It was seen that the
wavelet log co-occurrence outperforms all other texture
features for script classification (only 1 percent classification
error) while GLCM features yielded the worst overall
performance (9.1 percent classification error). This indicates Fig. 16. Classification hierarchy in Joshi et al.’s script identification
that pixel relationships at small distances are insufficient to scheme.
characterize the script of a document image appropriately.
However, a single model per script class is useful only frequency and at eight equispaced orientations. For an
when every script is written using only one font or using input text block of size 100  100 pixels, the aforementioned
only visually similar fonts. On the contrary, there typically features are calculated and then classified into different
exist a large number of fonts, often of widely varying script classes using a KNN classifier. The scheme was tested
appearance, within a given script. Because of such on 10 different scripts commonly used in India and an
variations, it is unlikely that a model trained on one set overall classification accuracy of 97.11 percent was
of fonts will correctly identify an image of a previously achieved. The scripts used included Devnagari, Bengali,
unseen font of the same script. For example, classification Tamil, Kannada, Malayalam, Gurumukhi, Oriya, Gujrati,
Urdu, and Latin. Fig. 16 illustrates how these 10 different
error increases from 1 and 9.1 percent in [66] to 15.9 and
Indian scripts are classified using these features in two
13.2 percent in cases of wavelet log co-occurrence and
levels of hierarchy.
GLCM features, respectively. In view of this, Busch
proposed characterizing multiple fonts within a single 3.2.3 Script Identification at Word/Character Level
script more adequately by using multiple models per
While all of the texture-based script identification methods
script class [67]. This is done by partitioning each script
described above work on a document page or a text block,
class into 10 subclasses, each subclass corresponding to
script identification at the word level was successfully
one font included within that script class. This is followed
implemented in [70], [71], [72], [73], [74], [75], [76]. In the
by linear discriminant analysis and classification using the
works by Ma et al. [70], [71], Gabor filter analysis is applied
modified MAP-GMM classifier as above. Such a classifica-
to each word in a bilingual document to extract features
tion system provides significant improvement when
characterizing the script in which that particular word is
compared to the results obtained using a single model—
written. Subsequently, a 2-class classifier system is used to
classification error reduces to 2.1 and 12.5 percent for the discriminate the two different scripts contained in the input
above two cases, respectively. document. Different classifier architectures based on SVM,
Script identification in Indian printed documents using KNN, weighted euclidean distance, and GMM are con-
oriented local energy features was performed in [68]. Local sidered. A classifier system consisting of a single classifier
energy is defined as the sum of squared responses of a pair may consist of any of the above four architectures, while a
of conjugate symmetric Gabor filters. In an earlier work, multiple classifier system is built by combining two or more
Chan and Coghill [69] derived a set of descriptors from of them. In a multiple classifier system, the classification
oriented local energy and demonstrated their utility in scores from each of the different component classifiers are
script classification. In line with human perception, the combined using sum-rule to arrive at the final decision. In
features chosen are energy distribution, the ratio of energies their papers, Ma et al. considered bilingual documents
for two nonadjacent channels, and the horizontal projection containing combinations of one Latin-based language
profile. The distribution of energy across differently (mainly English) and one non-Latin language (e.g., Arabic,
oriented channels of a Gabor filter differs from one script Chinese, Hindi, or Korean). It was observed that while the
to other. While this feature captures the global differences performance for English-Hindi documents was quite good
among scripts, a closer analysis of the energy distribution (97.51 percent recognition rate using KNN classifier), script
may be necessary to reveal finer differences between similar identification in English-Arabic documents had the lowest
looking scripts. This is provided by the ratios between performance (90.93 percent using SVM classifier). More-
energies at the output of nonadjacent channel pairs. Finally, over, it was established that multiple classifier system can
there are certain scripts which are distinguishable only by consistently outperform the single classifier systems (98.08
the stroke structures used in the upper part of the words. and 92.66 percent in case of English-Hindi and English-
For example, Devnagari and Gurumukhi differ in the shape Arabic documents, respectively, using a combination of
of the matra present above the headline (“shirorekha”). KNN and SVM classifiers).
Horizontal projection is used to discover this information. A visual-appearance-based approach has also been
One major advantage with these features is that it is not applied to identify and separate script words in Indian
necessary to perform analysis at multiple frequencies but at multiscript documents. In [72], [73], two different ap-
only one optimal frequency. This helps in reducing the proaches to script identification at the word level in printed
computational cost. Again, filter response can be enhanced bilingual (Latin and Tamil) documents are presented. The
by increasing filter bandwidth at this optimal frequency. first method structures words into three distinct spatial
Accordingly, the filters employed in [68] are log-Gabor zones and utilizes the information about the spatial spread
filters designed for one empirically determined optimal of the words in these zones. The second technique analyzes
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2153

Fig. 17. Dhanya et al.’s two approaches to script identification in Tamil-


English documents.

the directional energy distribution of words using Gabor


filters with suitable frequencies and orientations. The
algorithms are based on the observations as follows:

1. The spatial spread of Roman characters mostly


covers the middle and upper zones; only a few
lowercase characters spread to the lower zone.
2. The Roman alphabet contains more vertical and
slanted strokes. Fig. 18. Stages of character classification in a printed Tamil-Latin
3. In Tamil, the characters mostly spread to the upper document.
and lower zones.
4. There is a dominance of horizontal and vertical yielded recognition accuracies of 94 percent and above
strokes in Tamil. when tested on 20 document samples, each containing a
5. The aspect ratio of Tamil characters is generally minimum of 300 characters.
more than that of the Roman characters. In [75], a Gabor function-based multichannel directional
filtering approach is used for both text area separation and
These suggest that the features that may play a major role in script identification at the word level. It may be assumed
discriminating Roman and Tamil script words are the that the text regions in a document are predominantly high-
spatial spread of the words and the direction of orientation frequency regions. Hence, a filter-bank approach may be
of the structural elements of the characters in the words. useful in discriminating text regions from nontext regions.
The spatial feature is obtained by calculating zonal pixel The script classification system using a Gabor filter with
concentration, while the directional features are available as four radial frequencies and four orientations showed a high
responses of Gabor filters. The extracted features are degree of classification accuracy (minimum 96.02 percent
classified using SVM, Nearest Neighbor, or KNN classifiers. and maximum 99.56 percent) when applied to bilingual
A block schematic diagram of the system is presented in documents containing Hindi, Tamil, or Oriya along with
Fig. 17. It was observed that the directional features possess English words. In an extended version of this work [76], the
better discriminating capabilities than the spatial features, method was applied to documents containing three scripts
yielding as high as 96 percent accuracy in an SVM classifier. and five scripts. In this filter bank approach to script
This may be attributed to the fact that Gabor filters can take recognition, the Gabor filter bank uses three different radial
frequencies and six different angles of orientations. For
into account the general nature of scripts better.
decision making, two different classifiers are considered—
Dhanya and Ramkrishnan also attempted to recognize
linear discriminant classifier and the commonly used
and separate out different script characters in printed
nearest neighbor classifier. It was observed in several
Tamil-Roman documents using zonal occupancy informa- experiments that both of the classifiers perform well with
tion along with some structural features [74]. For this, they Gabor feature vectors, although in some cases, the nearest
proposed a hierarchical scheme for extracting features from neighbor classifier performs marginally better—the average
characters and classify them accordingly. Based on the accuracy obtained in case of triscript documents was 98.4
zonal occupancy of characters, the scheme divides the and 98.7 percent with linear discriminant and nearest
combined alphabet set into four groups—characters that neighbor classifiers, respectively. The highest recognition
occupy all three zones (Group 1), characters that occupy the accuracy obtained was 99.7 percent using the nearest
middle and lower zones (Group 2), characters that occupy neighbor classifier in a biclass problem, while the lowest
the middle and upper zones (Group 3), and characters that attained recognition rate was 97.3 percent.
occupy the middle zone only (Group 4). Groups 3 and 4 are
3.3 Comparative Analysis
further divided on the basis of presence or absence of loop
Table 1 summarizes some of the benchmark work in script
structure in the character. This is followed by feature
recognition. Various script features used by different
extraction, feature transformation, and finally, nearest
researchers are listed in this table. However, the results
neighbor classification. Features that may be extracted from they reported, although quite encouraging on most occa-
a character are geometric moments, DCT coefficients, or sions, were obtained using only a selected number of script
DWT coefficients. Feature space transformation is required classes in their experiments. This leaves a question that how
for dimension reduction while enhancing class discrimina- these script features will perform when applied to scripts
tion. Three methods are proposed for the purpose—PCA, other than those considered in their works. Therefore, it is
FLD, or maximization of divergence. The whole process is important to investigate the discriminative power of each
explained pictorially in Fig. 18. The proposed scheme script identification feature proposed in the literature before
2154 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

TABLE 1
Script Recognition Methods

one may use it for the purpose. In view of this, a discriminating feature that Spitz used is the location of
comparative analysis between different methods and script upward concavities in characters. An upward concavity is
features is desirable. formed when a run of character pixels spans the gap
One important structural feature for script recognition between two white runs just above it. As a result, upward
used by Spitz and some others is the character optical concavities in a character are observed at points where two
density. This is the measure of character pixels inside a or more character strokes join. Accordingly, ideograms
character bounding box, which is distinctly very high in composed of multiple strokes show many more upward
scripts using complex ideographic characters. Structurally concavities per character compared to that in other scripts.
simple Arabic characters, on the other hand, are low in As observed by Spitz [77], there are usually at most two or
density. All other scripts across Europe and Asia show three upward concavities in a single Latin character while
more or less the same medium character density. Therefore, Han characters have many more upward concavities per
while this feature may be good in separating out Han, on character that are evenly distributed along the vertical axis.
one hand, and Arabic, on the other, it does not help much in However, we observe that most other scripts also show two
bringing out the difference between moderately complex or three upward concavities, the same as in the Latin script.
scripts like Latin, Cyrillic, Brahmic scripts, etc. The second So, upward concavity is good for separating Han from
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2155

others but not good for discrimination among non-Han ambiguity will increase if the system includes script classes
scripts, except perhaps for Cyrillic, which contains a few that use similar looking characters or even share many
more upward concavities compared to other non-Han common characters. Therefore, Hochberg’s method may not
scripts. Another problem with these two features is that be suitable in a multiscript country like India, where most
they highly depend on document quality. Broken character scripts have the same line of origin. Nevertheless, it offers
segments may result in detection of false upward concavity, invariance to font size and computational simplicity. This is
while noise contributes to optical density measure. Non- because textual symbols are size-normalized and the
Han documents tend to be misclassified as Han-based algorithm uses simple binary shape matching without any
Oriental ones if the document quality is poor, because many feature value calculation.
characters are either broken or noisy. In order to cope with Another important feature proposed by Wood et al. and
such situations, features like character height distribution, used by many researchers is the horizontal projection. This
character bounding box profiles, horizontal projections, and gives a measure of the spatial spread of the characters in a
several other statistical features were proposed in [16], [17], script that provides an important clue to script identifica-
[18]. These features do not depend on the document quality tion. Some scripts can be identified by detecting the peaks
and resolution but on the overall size of the connected in the projection profile, e.g., Arabic scripts having a strong
components. However, these features are not invariant to baseline show peak at the bottom of the profile while
character size and font and offer high performance only in Brahmic scripts with “shirorekha” show peak at the top,
separating distinctly different Oriental scripts from other and so on. However, this feature also is not good for
non-Han scripts. separating scripts of similar nature and structure. For
Several different structural features, like character example, Devnagari, Bengali, and Gurumukhi will show
geometry, occurrence of certain stroke structures and the same peak in the profile due to “shirorekha”; Arabic,
structural primitives, stroke orientations, measure of cavity Urdu, and Farsi have the same lower peak. Hence, this
regions, side profiles, etc., that directly relate to the feature has not been used alone but mostly in combination
character shape have also been used for script characteriza- with other structural features.
tion. However, while some features show marked differ- A better approach to script identification is via texture
ence between two scripts, measures of other features may feature extraction using multichannel Gabor filter that
be the same between that script pair. For example, while provides a model for human vision system. This means
Devnagari and Gujrati can be easily identified using that Gabor filter offers a powerful tool to extract out visual
“shirorekha” and water reservoir-based features, character attributes from a document. This has motivated many
aspect ratio and character moments do not show much researchers to employ Gabor filter for script determination.
difference. This is because many Gujrati letters are exactly Since texture feature gives the general appearance of a
same as their Devnagari counterpart with the headline script, it can be derived from any script class of any nature.
(“shirorekha”) removed. Again, there are features that are Accordingly, this feature may be considered a universal
optimal in one script pair but not in another pair. For one. The discriminating power of a multichannel Gabor
example, the presence of “shirorekha” may be a good filter can be varied by having more channels with different
feature for discriminating Latin and Devnagari, but not at radial frequencies and closely spaced orientation angles.
all useful in separating Devnagari and Bengali. Therefore, Thus, this system is flexible compared to all other methods
in order to separate out a script from all other scripts, one and can be effectively used in discriminating scripts that are
may need to check a large pool of structural features before quite close in appearance. The main criticism with this
any decision can be taken. This may result in the curse of approach is that it cannot be applied with confidence to
dimensionality. So, a better option may be to do the small text regions as in wordwise script recognition. Also,
classification using different sets of features at different Gabor filters are not capable of handling variations in script
levels of hierarchy, as proposed in some of the works above. size and font, interline spacings, etc.
Another option is to learn the script characteristics in a Table 1 also lists recognition rates, as reported in the
neural network, as in [25], without bothering about the literature. Since the experiments were conducted indepen-
features to be used for classification. However, a larger dently using different data sets, however, they do not reflect
network with a greater number of hidden units may be the comparative performance of these methods. To have a
necessary for reliable recognition as more and more script proper measure of their relative script separation power,
classes are included. these methods need to be applied on a common data set.
Compared to the above, Hochberg et al.’s method is more Script recognition performance of some of the above-
versatile. The method is based on discovering frequent mentioned features, when applied to a common data set,
characters/symbols in every script class and storing them in is given in Table 2. The data set contains printed documents
the system database for matching during classification. typeset in 10 different scripts, including six scripts used in
Therefore, in principle, the method can identify any number India. In the absence of any standard database, we created
of scripts of varied nature and font as long as they are our own database by collecting document samples from
included in the training set. It is possible to apply the method books and magazines. Some documents were also available
in a common framework to scripts containing discrete and from the World Wide Web, which we printed using a laser
connected characters, alphabetic and nonalphabetic scripts, printer. All of the documents were scanned in black-and-
and so on, as demonstrated in [19], [34]. However, it is not white mode at 300 dpi and then rescaled to have a standard
difficult to realize that the classification error due to textline height in all documents while maintaining the
2156 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

TABLE 2
Script Recognition Results (in Percentage)

character aspect ratio. Script recognition was performed at components like textline, word, and character in a docu-
the text block level. Homogeneous text blocks of size 256  ment or from a patch of text that may be a complete
256 pixels were extracted from document pages in such a paragraph, a text block cropped from the input document,
way that page margins and nontextual parts were excluded. or even the whole document page. Script identification
A total of 120 text blocks were generated per script, each methods that use segmentwise analysis of character
block containing 10 to 12 textlines. The print quality of the structure may hence be regarded as local approach. On
documents, and hence, the quality of the document images the other hand, visual appearance-based methods that are
was reasonably good containing very little noise. designed to identify script by analyzing the overall look of a
We observe that the optical density feature is capable of text block may be regarded as a global approach.
identifying Chinese and Korean and also Arabic and Urdu As discussed before, many different structural features
to some extent. For other script classes, the recognition rate and methods for script characterization have been proposed
was well below the acceptable level. This is because the over the years. In each of these methods, the features were
optical density feature is not good enough to discriminate chosen keeping in view only those script types that were
among scripts of similar complexity. The same argument considered therein. Therefore, while these features have
holds for other script features. The Gabor filter method been proven to be efficient for script identification within a
shows relatively better discriminating power in comparison. given set of scripts, they may not be good in separating a
We noticed that the classification error was mainly due to
wider variety of script classes. Again, structural features
the misclassification between script pairs like Arabic and
cannot effectively discriminate between scripts having
Urdu, Chinese and Korean, Devnagari and Bengali, and
similar character shapes, which otherwise may be distin-
Devnagari and Gujrati. These pairs of script classes have
characters of the same nature and complexity, and even guished by their visual appearances. Another disadvantage
share some common characters. This leads to ambiguity, with structure-based methods is that they require complex
and hence, the classification error. So, on the whole, we may preprocessing involving connected component extraction.
say that every proposed script identification method and Also, extraction of structural features is highly susceptible
script feature works well only when applied within a small to noise and poor-quality document images. The presence
set of script classes. Classification accuracy falls significantly of noise or significant image degradation adversely affects
when more scripts of similar nature and origin are included. the location and segmentation of these features, making
As observed in Table 1, almost all works on script them difficult or sometimes impossible to extract.
recognition are targeted toward machine-printed docu- In short, the choice of features in local approach to script
ments. They have not been tested for script recognition in classification depends on the script classes to be identified.
handwritten documents. In view of the large amount of Further, the success of classification in this approach depends
handwritten documents that need to be processed electro- on the performance of the preprocessing stage, which
nically nowadays, script identification in handwritten includes denoising and extraction of connected components.
documents turns out to be an important research issue. Ironically, document segmentation and extraction of con-
Unfortunately, the script features proposed for printed nected components sometimes require the script type to be
documents may not be always effective in case of hand- known a priori. For example, an algorithm that is good for
written documents. Variations in writing style, character segmenting ideograms in Han may not be equally effective in
size, and interline and interword spacings make the segmenting alphabetic characters in the Latin script. This
recognition process difficult and unreliable when these presents a paradox in that, for determining the script type, it is
techniques are applied to the handwritten documents. necessary to know the script type beforehand. In contrast, text
Variation in writing across a document can be taken care block extraction in visual appearance-based global ap-
of by using certain statistical features, as proposed in [20]. proaches is simpler and can be employed irrespective of the
Textual symbol-based method can also be used but with document’s script. Since here it is not necessary to extract
certain modifications—some shape descriptor features can individual script components, such methods are better suited
be derived from the text symbols and the prototypes can be to degraded and noisy documents. Also, global features are
generated through clustering. We demonstrated this ap- more general in nature and can be applied to a broader range
proach in an earlier paper [35]. Also, a script class may be of script classes. They have practical importance in script-
represented by multiple models to account for variation in based retrieval systems because they are relatively fast and
writing from one person to another. reduce the cost of document handling. Thus, visual appear-
Based on our discussion above, we see that script ance-based methods prove to be better than structure-based
features are extracted either from a list of connected script identification methods in many ways, as listed in
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2157

TABLE 3 component HMMs are, in turn, modeled by a network of


Local versus Global Approaches for Script Identification interconnected character and ligature models. Thus, basic
characters of a language, language network, and intermixed
use of language are modeled with hierarchical relations.
Given such a construction, recognition corresponds to
finding the optimal path in the network using the Viterbi
algorithm. This approach can be used for recognizing freely
handwritten text in more than one language and can be
applied to any combination of phonetic writing systems.
Results of word recognition tests showed that Hangul
words can be recognized with about 92 percent accuracy
while English words can be recognized correctly only
84 percent of the time. It was also observed that by
combining multiple languages, recognition accuracy drops
Table 3. However, local approach is useful in applications negligibly but speed is slowed substantially. Therefore, a
involving textlinewise, wordwise, and even characterwise more powerful search method and machine are needed to
script identification, which otherwise are generally not use this technique in practice.
possible through global approach. Since local methods The basic principle behind online character recognition is
extract features from elemental structures present in a to capture the temporal sequence of strokes. A stroke is
document, in principle they can be applied at all levels defined as the locus of tip of the pen from pen-down to the
within the document. Nonetheless, some structure-based next pen-up position. For script recognition, therefore, it
methods demand a minimum size of the text to arrive at may be useful to check the writing style associated with
some conclusive decision. For example, Spitz’s two-stage each script class. For example, Arabic and Hebrew scripts
script classifier [15] requires at least two lines of text in the are written from right to left, Devnagari script is character-
first level of classification and at least six lines in the second ized by the presence of “shirorekha,” a Han character is
stage. Likewise, at least 50 textual symbols need to be composed of several short strokes, and so on. An online
verified for acceptable classification in [19]. The same system can capture such information and be used for script
applies to methods in which the script class decision is identification. In [80], Namboodiri and Jain proposed nine
based on statistics taken across the input document. We also measures that may be used to quantify the characteristic
note that methods developed for pagewise script identifica- writing style of every script. They are:
tion can also be used for script recognition in a paragraph or
a text block as long as the document size is big enough to 1. horizontal interstroke direction defining the direc-
provide necessary information. tion of writing within a textline,
2. average stroke length,
3. “shirorekha” strength,
4 ONLINE SCRIPT RECOGNITION 4. “shirorekha” confidence,
The script identification techniques described earlier are for 5. stroke density,
offline script recognition and are, in general, not applicable 6. aspect ratio,
to online data. With the advancement of pen computing 7. reverse direction defined as the distance by which
technology in the last few decades, many online document the pen moves in the direction opposite to the
analysis systems have been developed in which it is normal writing direction,
necessary to interpret the written text as it is input by 8. average horizontal stroke direction, and
analyzing the spatial and temporal nature of the movement 9. average vertical stroke direction.
of the pen. Therefore, as in the case of OCR systems for Their proposed classification system, based on the above
offline data, an online character recognizer in a multiscript spatial and temporal features of the strokes, attained
environment must be preceded by an online script recogni- classification accuracies in between 86.5 and 95 percent in
tion system. Unfortunately, in comparison to offline script different experimental tests. Later, they added two more
recognition, not much effort has been dedicated toward the features in [81], viz., vertical interstroke direction and
development of online script recognition techniques. As of variance of stroke length, and achieved around 0.6 percent
today, only a few methods are available for online script improvement in the classification accuracy.
recognition, as described below. A unified syntactic approach to online script recognition
One of the earliest works on online script recognition was presented in [82] and was applied for classifying Latin,
was reported in [78] by Lee et al. Later, they extended their Devnagari, and Kanji scripts by analyzing their character-
work in [79]. Their method is based on the construction of a istic properties that include the fuzzy linguistic descriptors
unified recognizer for the entire set of characters incorpo- to describe the character features. The fuzzy pattern
rated from more than one script, and an approach using description language Fuzzy Online Handwriting Descrip-
HMM network is proposed for recognizing sequences of tion Language (FOHDEL) is used to store fuzzy feature
words in multiple languages. Viewing handwritten script as values for every character of a script class in the form of
an alternating sequence of words and interword ligatures, a fuzzy rules. For example, the character “b” in the Roman
hierarchical HMM is constructed by interconnecting HMMs alphabet may be described as consisting of two fuzzy
for ligatures and words in multiple languages. These linguistic terms—very straight vertical line at the beginning
2158 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

followed by an almost circular curve at the end. These fuzzy simple in nature and some are quite complex, a relative
rules aid in decision making during classification. comparison of performance across scripts is a difficult task.
For example, Latin is generally simpler in structure and is
based on an alphabetic system. A script identifier that is
5 SCRIPT RECOGNITION IN VIDEO TEXT good in recognizing Latin scripts may not be so in the case
Script identification is not only important for document of complex nonalphabetic scripts like Arabic, Han, and
analysis but also for text recognition in images and videos. Devnagari. Therefore, in order to evaluate various systems,
Text recognition in images and videos is important in the a standard set of data should be used so that the evaluation
context of image/video indexing and retrieval. The process is unbiased. However, it is generally difficult to find
includes several preprocessing steps like text detection, text document data sets in different languages/scripts that are
localization, text segmentation, and binarization before an similar in content and layout. To address this problem,
OCR algorithm may be applied. As with documents in an Kanungo et al. introduced the Bible as a data set for
multiscript environment, image/video text recognition in
evaluating multilingual and multiscript OCR performance
an international environment also requires script identifica-
[85]. Bible translations are closely parallel in structure,
tion in order to apply suitable algorithm for text extraction
relevant with respect to modern day language, widely
and recognition. In view of this, an approach for discrimi-
nating between Latin and Han script was developed in [83]. available, and inexpensive. These make the Bible attractive
The proposed approach proceeds as follows: First, the text for controlling document content while varying language
present in an image or video frame is localized and size and script. The document layout can also be controlled by
normalized. Then, a set of low-level features is extracted using synthetically generated page image data. Other holy
from the edges detected inside the text region. This includes books, whose translation has similar properties, like the
mean and standard deviation of edge pixels, edge pixel Quran and the Bhagavad Gita, have also been suggested by
density, energy of edge pixels, horizontal projection, and some researchers.
Cartesian moments of the edge pixels. Finally, based on the One major concern with most of the reported works in
extracted features, the decision about the type of the script script recognition is the lack of any comparative analysis of
is made using a KNN classifier. Experimental results have the results. Experimental results given for every proposed
demonstrated the efficiency of the proposed method by method have not been compared with other benchmark
identifying Latin and Han scripts accurately at the rate of works in the field. Moreover, the data sets used in
85.5 and 89 percent, respectively. experiments are all different. This is mainly due to the lack
of availability of a standard database for script recognition
6 ISSUES IN MULTISCRIPT OCR SYSTEM research. Consequently, it is hard to assess the results
reported in the literature. Hence, a standard evaluation
EVALUATION testbed containing documents written in only one script type
In connection with research in script recognition, it is useful as well as multiscript documents with a mix of different
and important to develop benchmarks and methodologies scripts within a document is necessary. One important
that may be employed to evaluate the performance of consideration in selecting the data set for a script class is that
multiscript OCR systems. Some aspects of this problem it should reflect the global probability of occurrence of the
have been reported in [84], and are discussed below. characters in texts written in that particular script. Another
The OCR evaluation approaches are broadly classified problem of concern is for languages that constantly undergo
into two categories: black box evaluation and white box spelling modifications and graphemic changes over the
evaluation. In black box evaluation, only the input and years. As a result, if an old document is chosen as the corpus,
output are visible to the evaluator. In a white box evaluation then it may not be suitable for evaluating a modern OCR
procedure, outputs of different modules comprising the system. On the other hand, a database of modern documents
system may be accessed and the total system is evaluated may not be useful if the goal of the OCR is to process historic
stage by stage. Nevertheless, the primary issues related to documents. This suggests that the data set should include all
both types of evaluation are recognition accuracy and different forms of the same language that evolved with time,
processing speed. The parameters that can be varied for the with full coverage of the script alphabet of different
purpose of evaluation are content, font size and style, print languages, and it should be large enough to reflect the
and paper quality, scanning resolution, and the amount of statistical occurrence probability of the characters.
noise and degradation in the document images.
Needless to say, the overall performance of a multiscript
OCR greatly depends on the performance of the script 7 CONCLUSION
recognition algorithm used in the system. As with any OCR This paper presents a comprehensive survey on the devel-
system, the efficiency of a script recognizer is mainly opments in script recognition technology, which is an
assessed on the basis of accuracy and speed. Another important issue in OCR research in our multilingual multi-
important performance criterion is the minimum size of the script world. Researchers have attempted to characterize
document necessary for the script recognizer to perform different scripts either by extracting their structural features
reliably. This is to measure how the recognizer performs or by deriving some visual attributes. Accordingly, many
with varying document size. different script features have been proposed over the years
In a multiscript system, another issue of consideration is for script identification at different levels within a docu-
the writing system adopted by a script, script complexity, ment—pagewise, paragraphwise, textlinewise, wordwise,
and the size of the character set. Since some scripts are and even characterwise. Textlinewise and wordwise script
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2159

identifications are particularly important for use in a multi- [5] H. Bunke and P.S.P. Wang, Handbook of Character Recognition and
Document Image Analysis. World Scientific Publishing, 1997.
script document. However, compared to the large arsenal of [6] N. Nagy, “Twenty Years of Document Image Analysis in PAMI,”
literature available in the field of document analysis and IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1,
optical character recognition, the volume of work on script pp. 38-62, Jan. 2000.
identification is relatively thin. The main reason is that most [7] U. Pal, “Automatic Script Identification: A Survey,” J. Vivek,
vol. 16, no. 3, pp. 26-35, 2006.
research in the area of OCR has been directed at solving issues [8] U. Pal and B.B. Chaudhuri, “Indian Script Character Recognition:
within the scope of the country where the research is A Survey,” Pattern Recognition, vol. 37, no. 9, pp. 1887-1899, Sept.
conducted. Since most countries in the world use only one 2004.
language/script, OCR research in these countries need not [9] L. Peng, C. Liu, X. Ding, and H. Wang, “Multilingual Document
Recognition Research and Its Application in China,” Proc. Int’l
bother determining the script in which a document is written. Conf. Document Image Analysis for Libraries, pp. 126-132, Apr. 2006.
For instance, the US postal department spent a lot in [10] A. Nakanishi, Writing Systems of the World: Alphabets, Syllabaries,
developing system for automatic reading of postal addresses, Pictograms. Charles E. Tuttle Co., 1980.
[11] F. Coulmas, The Blackwell Encyclopedia of Writing Systems. Black-
but under the assumption that all letters originating or well Publishers, 1996.
arriving in US will carry addresses written in English only. [12] C. Ronse and P.A. Devijver, Connected Components in Binary
Script recognition is important only in an international Images: The Detection Problem. John Wiley & Sons, 1984.
environment or in a country that uses more than one script. [13] A.L. Spitz, “Multilingual Document Recognition,” Proc. Int’l
Conf. Electronic Publishing, Document Manipulation, and Typogra-
Nonetheless, with recent economic globalization and phy, pp. 193-206, Sept. 1990.
increased business transactions across the globe, there had [14] A.L. Spitz and M. Ozaki, “Palace: A Multilingual Document
been increased awareness of automatic script recognition Recognition System,” Proc. IAPR Workshop Document Analysis
among the OCR community. That is why the majority of the Systems, pp. 16-37, Oct. 1994.
[15] A.L. Spitz, “Determination of the Script and Language Content of
reported works are dated only during the last decade. Document Images,” IEEE Trans. Pattern Analysis and Machine
However, it is noted that most of these script recognition Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997.
methods have been tested on machine-printed documents [16] D.-S. Lee, C.R. Nohl, and H.S. Baird, “Language Identification in
only, and their performance on handwritten documents is Complex, Unoriented, and Degraded Document Images,” Proc.
IAPR Workshop Document Analysis Systems, pp. 76-98, Oct. 1996.
not known. In view of this, it will be not wrong to say that [17] B. Waked, S. Bergler, C.Y. Suen, and S. Khoury, “Skew Detection,
script recognition in handwritten documents is still in its Page Segmentation and Script Classification of Printed Document
early stage of research. Since the present thrust in OCR Images,” Proc. IEEE Int’l Conf. Systems, Man, and Cybernetics, vol. 5,
pp. 4470-4475, Oct. 1998.
research is in handwritten document analysis, parallel [18] L. Lam, J. Ding, and C.Y. Suen, “Differentiating between Oriental
research on script identification in handwritten documents and European Scripts by Statistical Features,” Int’l J. Pattern
is in demand. Also, not many of these script recognition Recognition and Artificial Intelligence, vol. 12, no. 1, pp. 63-79, Feb.
techniques have addressed font variation within a script 1998.
[19] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, “Automatic
class. Hence, we can conclude that script recognition Script Identification from Document Images Using Cluster-Based
technology still has a way to go, especially for handwritten Templates,” IEEE Trans. Pattern Analysis and Machine Intelligence,
document analysis. Therefore, there is an urgent need to vol. 19, no. 2, pp. 176-181, Feb. 1997.
work on script recognition of handwritten documents and [20] J. Hochberg, K. Bowers, M. Cannon, and P. Kelly, “Script and
Language Identification for Handwritten Document Images,” Int’l
in developing font-independent script recognizers. J. Document Analysis and Recognition, vol. 2, nos. 2/3, pp. 45-52,
As is evident from our analysis, development in script Dec. 1999.
recognition technology lacks a generalized approach to the [21] Y. Tho and Y.Y. Tang, “Discrimination of Oriental and Euramer-
ican Scripts Using Fractal Feature,” Proc. Int’l Conf. Document
problem that can handle all different types of scripts under Analysis and Recognition, pp. 1115-1119, Sept. 2001.
a common framework. While a particular script feature [22] B.V. Dhandra, P. Nagabhushan, M. Hangarge, R. Hegadi, and V.S.
proves to be efficient within a set of scripts, it may not be Malemath, “Script Identification Based on Morphological Recon-
useful in other scripts. To some extent, texture features can struction in Document Images,” Proc. IEEE Int’l Conf. Pattern
Recognition, vol. 2, pp. 950-953, Aug. 2006.
be used universally but cannot be applied reliably at word [23] S. Chaudhury and R. Sheth, “Trainable Script Identification
and character levels within a document. Strategies for Indian Languages,” Proc. Int’l Conf. Document
Finally, we need to create a standard data set for research Analysis and Recognition, pp. 657-660, Sept. 1999.
in this field. This is necessary to evaluate different script [24] S.B. Patil and N.V. Subbareddy, “Neural Network Based System
for Script Identification in Indian Documents,” Sadhana, vol. 27,
recognition methodologies under the same conditions. The no. 1, pp. 83-97, Feb. 2002.
creation of standard data resources will undoubtedly [25] Z. Chi, Q. Wang, and W.-C. Siu, “Hierarchical Content Classifica-
provide a much needed resource to researchers working tion and Script Determination for Automatic Document Image
Processing,” Pattern Recognition, vol. 36, no. 11, pp. 2483-2500,
in this field. Nov. 2003.
[26] S. Kanoun, A. Ennaji, Y. Lecourtier, and A.M. Alimi, “Script and
Nature Differentiation for Arabic and Latin Text Images,” Proc.
REFERENCES Int’l Workshop Frontiers in Handwriting Recognition, pp. 309-313,
[1] C.Y. Suen, M. Berthod, and S. Mori, “Automatic Recognition of Aug. 2002.
Handprinted Characters—The State of the Art,” Proc. IEEE, [27] L. Zhou, Y. Lu, and C.L. Tan, “Bangla/English Script Identifica-
vol. 68, no. 4, pp. 469-487, Apr. 1980. tion Based on Analysis of Connected Component Profiles,” Proc.
[2] J. Mantas, “An Overview of Character Recognition Methodolo- Int’l Workshop Document Analysis Systems, pp. 243-254, Feb. 2006.
gies,” Pattern Recognition, vol. 19, no. 6, pp. 425-430, 1986. [28] U. Pal and B.B. Chaudhuri, “Script Line Separation from Indian
[3] V.K. Govindan and A.P. Shivaprasad, “Character Recognition—A Multi-Script Documents,” Proc. Int’l Conf. Document Analysis and
Review,” Pattern Recognition, vol. 23, no. 7, pp. 671-683, 1990. Recognition, pp. 406-409, Sept. 1999.
[4] S. Mori, C.Y. Suen, and K. Yamamoto, “Historical Review of OCR [29] U. Pal and B.B. Chaudhuri, “Identification of Different Script
Research and Development,” Proc. IEEE, vol. 80, no. 7, pp. 1029- Lines from Multi-Script Documents,” Image and Vision Computing,
1058, July 1992. vol. 20, nos. 13/14, pp. 945-954, Dec. 2002.
2160 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 32, NO. 12, DECEMBER 2010

[30] U. Pal, S. Sinha, and B.B. Chaudhuri, “Multi-Script Line [52] K. Roy, U. Pal, and B.B. Chaudhuri, “Address Block Location and
Identification from Indian Documents,” Proc. Int’l Conf. Document Pin Code Recognition for Indian Postal Automation,” Proc.
Analysis and Recognition, pp. 880-884, Aug. 2003. Workshop Computer Vision, Graphics, and Image Processing, pp. 5-9,
[31] A. Elgammal and M.A. Ismail, “Techniques for Language Feb. 2004.
Identification for Hybrid Arabic-English Document Images,” Proc. [53] K. Roy, S. Vajda, U. Pal, B.B. Chaudhuri, and A. Belaid, “A System
Int’l Conf. Document Analysis and Recognition, pp. 1100-1104, Sept. for Indian Postal Automation,” Proc. Int’l Conf. Document Analysis
2001. and Recognition, vol. 2, pp. 1060-1064, Aug./Sept. 2005.
[32] C.S. Cumbee, Method of Identifying Script of Line of Text, US Patent [54] K. Roy, D. Pal, and U. Pal, “Pin-Code Extraction and Recognition
7020338, Mar. 2006. for Indian Postal Automation,” Proc. Nat’l Conf. Recent Trends in
[33] S.-W. Lee and J.-S. Kim, “Multi-Lingual, Multi-Font, Multi-Size Information Systems, pp. 192-195, July 2006.
Large-Set Character Recognition Using Self-Organizing Neural [55] K. Roy and U. Pal, “Word-Wise Hand-Written Script Separation
Network,” Proc. Int’l Conf. Document Analysis and Recognition, for Indian Postal Automation,” Proc. Int’l Workshop Frontiers in
vol. 1, pp. 28-33, Aug. 1995. Handwriting Recognition, pp. 521-526, Oct. 2006.
[34] J. Hochberg, M. Cannon, P. Kelly, and J. White, “Page Segmenta- [56] K. Roy, U. Pal, and B.B. Chaudhuri, “Neural Network Based
tion Using Script Identification Vectors: A First Look,” Proc. Symp. Word-Wise Handwritten Script Identification System for Indian
Document Image Understanding Technology, pp. 258-264, Apr./May Postal Automation,” Proc. Int’l Conf. Intelligent Sensing and
1997. Information Processing, pp. 240-245, Jan. 2005.
[35] D. Ghosh and A.P. Shivaprasad, “Handwritten Script Identifica- [57] S.L. Wood, X. Yao, K. Krishnamurthi, and L. Dang, “Language
tion Using Possibilistic Approach for Cluster Analysis,” J. Indian Identification for Printed Text Independent of Segmentation,”
Inst. of Science, vol. 80, pp. 215-224, May/June 2000. Proc. Int’l Conf. Image Processing, vol. 3, pp. 428-431, Oct. 1995.
[36] V. Ablavsky and M.R. Stevens, “Automatic Feature Selection with [58] T.N. Tan, “Rotation Invariant Texture Features and Their Use in
Applications to Script Identification of Degraded Documents,” Automatic Script Identification,” IEEE Trans. Pattern Analysis and
Proc. Int’l Conf. Document Analysis and Recognition, pp. 750-754, Machine Intelligence, vol. 20, no. 7, pp. 751-756, July 1998.
Aug. 2003. [59] L. O’Gorman and R. Kasturi, Document Image Analysis. IEEE CS
[37] R. Krishnapuram and J.M. Keller, “A Possihilistic Approach to Press, 1995.
Clustering,” IEEE Trans. Fuzzy Systems, vol. 1, no. 2, pp. 98-110, [60] G.S. Peake and T.N. Tan, “Script and Language Identification
May 1993. from Document Images,” Proc. Asian Conf. Computer Vision, pp. 97-
[38] D. Ghosh and A.P. Shivaprasad, “An Analytic Approach for 104, Jan. 1998.
Generation of Artificial Handprinted Character Database from [61] R.M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features
Given Generative Models,” Pattern Recognition, vol. 32, no. 6, for Image Classification,” IEEE Trans. Systems, Man, and Cyber-
pp. 907-920, June 1999. netics, vol. 3, no. 6, pp. 610-621, Nov. 1973.
[39] D.W. Muir and T. Thomas, Automatic Language Identification by [62] W.M. Pan, C.Y. Suen, and T.D. Bui, “Script Identification Using
Stroke Geometry Analysis, US Patent 6064767, May 2000. Steerable Gabor Filters,” Proc. Int’l Conf. Document Analysis and
[40] Y.-H. Liu, C.-C. Lin, and F. Chang, “Language Identification of Recognition, vol. 2, pp. 883-887, Aug./Sept. 2005.
Character Images Using Machine Learning Techniques,” Proc. Int’l [63] V. Singhal, N. Navin, and D. Ghosh, “Script-Based Classification
Conf. Document Analysis and Recognition, vol. 2, pp. 630-634, Aug./ of Hand-Written Text Documents in a Multilingual Environment,”
Sept. 2005. Proc. Int’l Workshop Research Issues in Data Eng.—Multi-Lingual
[41] I. Moalla, A. Elbaati, A.M. Alimi, and A. Benhamadou, “Extraction Information Management, pp. 47-54, Mar. 2003.
of Arabic Text from Multilingual Documents,” Proc. IEEE Int’l
[64] J. Cheng, X. Ping, G. Zhou, and Y. Yang, “Script Identification of
Conf. Systems, Man, and Cybernetics, http://ieeexplore.ieee.org/
Document Image Analysis,” Proc. Int’l Conf. Innovative Computing,
iel5/8325/26298/01173266.pdf?arnumber=1173266, Oct. 2002.
Information, and Control, vol. 3, pp. 178-181, Aug./Sept. 2006.
[42] I. Moalla, A.M. Alimi, and A. Benhamadou, “Extraction of Arabic
[65] A.K. Jain and Y. Zhong, “Page Segmentation Using Texture
Words from Multilingual Documents,” Proc. Conf. Artificial
Analysis,” Pattern Recognition, vol. 29, no. 5, pp. 743-770, May
Intelligence and Soft Computing, http://www.actapress.com/
1996.
PDFViewer.aspx?paperId=18567, Sept. 2004.
[66] A. Busch, W.W. Boles, and S. Sridharan, “Texture for Script
[43] C.L. Tan, P.Y. Leong, and S. He, “Language Identification in
Identification,” IEEE Trans. Pattern Analysis and Machine Intelli-
Multi-Lingual Documents,” Proc. Int’l Symp. Intelligent Multimedia
gence, vol. 27, no. 11, pp. 1720-1732, Nov. 2005.
and Distance Education, pp. 59-64, Aug. 1999.
[44] S. Lu, C.L. Tan, and W. Huang, “Language Identification in [67] A. Busch, “Multi-Font Script Identification Using Texture-Based
Degraded and Distorted Document Images,” Proc. Int’l Workshop Features,” Proc. Int’l Conf. Image Analysis and Recognition, pp. 844-
Document Analysis Systems, pp. 232-242, Feb. 2006. 852, Sept. 2006.
[45] C.V. Jawahar, M.N.S.S.K. Pavan Kumar, and S.S. Ravi Kiran, “A [68] G.D. Joshi, S. Garg, and J. Sivaswamy, “Script Identification from
Bilingual OCR for Hindi-Telugu Documents and Its Applica- Indian Documents,” Proc. IAPR Int’l Workshop Document Analysis
tions,” Proc. Int’l Conf. Document Analysis and Recognition, pp. 408- Systems, pp. 255-267, Feb. 2006.
412, Aug. 2003. [69] W. Chan and G.G. Coghill, “Text Analysis Using Local Energy,”
[46] S. Sinha, U. Pal, and B.B. Chaudhuri, “Word-Wise Script Pattern Recognition, vol. 34, no. 12, pp. 2523-2532, Dec. 2001.
Identification from Indian Documents,” Proc. IAPR Int’l Workshop [70] H. Ma and D. Doermann, “Gabor Filter Based Multi-Class
Document Analysis Systems, pp. 310-321, Sept. 2004. Classifier for Scanned Document Images,” Proc. Int’l Conf.
[47] S. Chanda, S. Sinha, and U. Pal, “Word-Wise English Devnagari Document Analysis and Recognition, pp. 968-972, Aug. 2003.
and Oriya Script Identification,” Speech and Language Systems for [71] S. Jaeger, H. Ma, and D. Doermann, “Identifying Script on Word-
Human Communication, R.M.K. Sinha and V.N. Shukla, eds., Level with Informational Confidence,” Proc. Int’l Conf. Document
pp. 244-248, Tata McGraw-Hill, 2004. Analysis and Recognition, vol. 1, pp. 416-420, Aug./Sept. 2005.
[48] S. Chanda and U. Pal, “English, Devnagari and Urdu Text [72] D. Dhanya, A.G. Ramkrishnan, and P.B. Pati, “Script Identification
Identification,” Proc. Int’l Conf. Cognition and Recognition, pp. 538- in Printed Bilingual Documents,” Sadhana, vol. 27, no. 1, pp. 73-82,
545, Dec. 2005. Feb. 2002.
[49] S. Chanda, R.K. Roy, and U. Pal, “English and Tamil Text [73] D. Dhanya and A.G. Ramkrishnan, “Script Identification in
Identification,” Proc. Nat’l Conf. Recent Trends in Information Printed Bilingual Documents,” Proc. IAPR Int’l Workshop Document
Systems, pp. 184-187, July 2006. Analysis Systems, pp. 13-24, Aug. 2002.
[50] M.C. Padma and P. Nagabhushan, “Identification and Separation [74] D. Dhanya and A.G. Ramkrishnan, “Optimal Feature Extraction
of Text Words of Kannada, Hindi and English Languages through for Bilingual OCR,” Proc. IAPR Int’l Workshop Document Analysis
Discriminating Features,” Proc. Nat’l Conf. Document Analysis and Systems, pp. 25-36, Aug. 2002.
Recognition, pp. 252-260, July 2003. [75] P.B. Pati, S. Sabari Raju, N. Pati, and A.G. Ramakrishnan,
[51] R. Kumar, V. Chaitanya, and C.V. Jawahar, “A Novel Approach to “Gabor Filters for Document Analysis in Indian Bilingual
Script Separation,” Proc. Int’l Conf. Advances in Pattern Recognition, Documents,” Proc. Int’l Conf. Intelligent Sensing and Information
pp. 289-292, Dec. 2003. Processing, pp. 123-126, Jan. 2004.
[76] P.B. Pati and A.G. Ramakrishnan, “HVS Inspired System for
Script Identification in Indian Multi-Script Documents,” Proc. Int’l
Workshop Document Analysis Systems, pp. 380-389, Feb. 2006.
GHOSH ET AL.: SCRIPT RECOGNITION—A REVIEW 2161

[77] A.L. Spitz, “Script and Language Determination from Document Tulika Dube received the BTech degree in
Images,” Proc. Ann. Symp. Document Analysis and Information electronics and communication engineering from
Retrieval, pp. 229-235, Apr. 1994. the Indian Institute of Technology Guwahati in
[78] J.J. Lee, B.K. Sin, and J.H. Kim, “On-Line Mixed Character 2006. Soon after her graduation, she joined the
Recognition Using an HMM Network,” Proc. KISS Ann. Conf., Indian Division of British Telecom at Bangalore,
vol. 20, no. 2, pp. 317-320, Oct. 1993. and later moved to Ibibo Web Pvt. Ltd.,
[79] J.J. Lee, J.H. Kim, and M. Nakajima, “A Hierarchical HMM Gurgaon, India, as a software engineer. Be-
Network-Based Approach for On-Line Recognition of Multi- tween 2007 and 2009, she worked as a senior
Lingual Cursive Handwritings,” IEICE Trans. Information and software engineer with Infovedics Software Pvt.
Systems, vol. E81-D, no. 8, pp. 881-888, Aug. 1998. Ltd., Noida, India. She received a search
[80] A.M. Namboodiri and A.K. Jain, “Online Script Recognition,” developer certification from FAST University, Norway, in 2007. She is
Proc. Int’l Conf. Pattern Recognition, vol. 3, pp. 736-739, Aug. 2002. currently working toward the management degree at the Indian Institute
[81] A.M. Namboodiri and A.K. Jain, “Online Handwritten Script of Management, Ahmedabad.
Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence,
vol. 26, no. 1, pp. 124-130, Jan. 2004. Adamane P. Shivaprasad received the BE,
[82] A. Malaviya and L. Peters, “Fuzzy Handwriting Description ME, and PhD degrees in electrical communica-
Language: FOHDEL,” Pattern Recognition, vol. 33, no. 1, pp. 119- tions engineering from the Indian Institute of
131, Jan. 2000. Science, Bangalore, in 1965, 1967, and 1972,
[83] J. Gllavata and B. Freisleben, “Script Recognition in Images with respectively. He is currently a guest professor
Complex Backgrounds,” Proc. IEEE Int’l Symp. Signal Processing in the Department of Electronics and Commu-
and Information Technology, pp. 589-594, Dec. 2005. nication Engineering, Sambhram Institute of
[84] B.B. Chaudhuri, “On Multi-Script OCR System Evaluation,” Proc. Technology, Bangalore, India. He was a mem-
Int’l Workshop Performance Evaluation Issues in Multi-Lingual OCR, ber of the academic staff of the Department of
http://www.kanungo.com/workshop/abstracts/chaudhuri. Electrical Communication Engineering, Indian
html, Sept. 1999. Institute of Science, Bangalore, from 1967 until he retired as a
[85] T. Kanungo, P. Resnik, S. Mao, D.-W. Kim, and Q. Zheng, “The professor in 2006. His research interests include design of micropower
Bible and Multilingual Optical Character Recognition,” Comm. VLSI circuits, intelligent instrumentation, communication systems, and
ACM, vol. 48, no. 6, pp. 124-130, June 2005. pattern recognition.

Debashis Ghosh received the BE degree in


electronics and communication engineering from . For more information on this or any other computing topic,
M.R. Engineering College, Jaipur, India, in 1993,
please visit our Digital Library at www.computer.org/publications/dlib.
and the MS and PhD degrees in electrical
communication engineering from the Indian
Institute of Science, Bangalore, in 1996 and
2000, respectively. He is currently an associate
professor in the Department of Electronics and
Computer Engineering, Indian Institute of Tech-
nology Roorkee. From April 1999 to November
1999, he was a DAAD research fellow at the University of Kaiserslau-
tern, Germany. In November 1999, he joined the Indian Institute of
Technology Guwahati, as an assistant professor of electronics and
communication engineering. He spent the 2003-2004 academic year as
a visiting faculty member in the Department of Electrical and Computer
Engineering at the National University of Singapore. Between 2006 and
2008, he was a senior lecturer with the Faculty of Engineering and
Technology, Multimedia University, Malaysia. His teaching and research
interests include image/video processing, computer vision, and pattern
recognition.

You might also like