0% found this document useful (0 votes)

10 views

Document Image Segmentation Using Discriminative

Uploaded by

mrsyedsaqibbukhari

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Document Image Segmentation Using Discriminative

Uploaded by

mrsyedsaqibbukhari

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Document Image Segmentation using Discriminative

Learning over Connected Components

Syed Saqib Bukhari Mayce Ibrahim Ali Al Faisal Shafait

Technical University of Azawi German Research Center for
Kaiserslautern, Germany Technical University of Artificial Intelligence (DFKI),
[email protected] Kaiserslautern, Germany Kaiserslautern, Germany
kl.de [email protected] [email protected]
Thomas M. Breuel
Technical University of
Kaiserslautern,
Kaiserslautern, Germany
[email protected]

ABSTRACT produces lot of garbage characters originated from non-text

Segmentation of a document image into text and non-text components, as shown in Figure 1.
regions is an important preprocessing step for a variety of
document image analysis tasks, like improving OCR, doc- Document image segmentation approaches in the literature
ument compression etc. Most of the state-of-the-art doc- can generally be classified into two groups: (i) block or
ument image segmentation approaches perform segmenta- zone based classification and (ii) pixels based classification.
tion using pixel-based or zone(block)-based classification. Block based segmentation approaches apply page segmen-
Pixel-based classification approaches are time consuming, tation [11] on the document image and then classify the
whereas block-based methods heavily depend on the accu- obtained blocks into a set of determined classes [5]. On the
racy of block segmentation step. In contrast to the state-of- other hand pixel based approaches attempt to classify indi-
the-art document image segmentation approaches, our seg- vidual pixels [8, 7] according to predefined classes.
mentation approach introduces connected component based
classification, thereby not requiring a block segmentation Several block classification algorithms have been proposed
beforehand. Here we train a self-tunable multi-layer percep- over the years. For a more detailed overview of related work
tron (MLP) classifier for distinguishing between text and in the field of document block classification please refer to
non-text connected components using shape and context in- Okun [9] and Wang [12]. Okun et al. [9] proposed an ap-
formation as a feature vector. Experimental results prove proach for document block classification based on connected
the effectiveness of our proposed algorithm. We have eval- components and run-length statistics. Wang et al. [12] pre-
uated our method on subset of UW-III, ICDAR 2009 page sented the block classification system, each block with a
segmentation competition test images and circuit diagrams 25 dimensional feature vector and use an optimized deci-
datasets and compared its results with the state-of-the-art sion tree classifier to classify each block into one of different
leptonica’s 1 page segmentation algorithm. target classes. The most recent and detailed block classifi-
cation approach is introduced by Keysers et. al [5] which
1. INTRODUCTION showed that a document block classification system can be
Document image segmentation is the problem of classifying constructed using run-length histogram feature vector alone.
the contents of a document image into a set of text and non- That work includes several classes of blocks (math, logo,
text classes. Non-text class consists of following categories: text, table, drawing, halftone, ruling and speckles). In gen-
halftone, drawing, maths, logos, tables, etc. Document im- eral, the approaches that classify blocks depend heavily on
age segmentation is one of the most important preprocess- the result of page segmentation into blocks. The blocks may
ing steps before feeding the specific contents to an optical be segmented in a wrong way leading to miss-classification.
character recognition (OCR) system otherwise OCR engine
Moll et al. [8, 7] classify individual pixels instead of regions,
1 to avoid the constriction of the limited classes of region
http://code.google.com/p/leptonica/
shapes. The approach is applied on handwritten, machine
printed and photographed document images. Pixel based
classification approaches are slow with respect to execution
Permission to make digital or hard copies of all or part of this work for time. The approach by Won [13] focuses on a combination
personal or classroom use is granted without fee provided that copies are of a block based algorithm and a pixel based algorithm to
not made or distributed for profit or commercial advantage and that copies segment a document image into text and image area.
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Together with block based and pixel based image segmen-
DAS ’10, June 9-11, 2010, Boston, MA, USA tation approaches, there is another state-of-the-art text and
Copyright 2010 ACM 978-1-60558-773-8/10/06 ...$10.00

183
(a) Input page segment

tv \_ _. l 4
la) ` )â , ` (C)
Â» ` fr /.1 Â∙r , ` WM. â
7 if Â¢â Â¥, ` _` { Ã©>Ã©_ j l; I XII
K;~Â∙ _ V . Â∙ ` _ ~ M" JF Q} { fÂ∙',; Ai ' V _ I \' g,
â _ it ~ _y, i * t
f jgj . __ it ` l â f
1< .Â∙ â Â∙. V Â∙ âÂ¢ â â â F {U
Â∙~ . 7 ~ Ã© . i/`I / _â * â K Â»
Figure 4 View of fruit shed naked la) and lb) showing the complete androecial ring remaining within the tepals on the spike. (c)
Fruit shed with all tepals attached.

(b) OCR result

Figure 1: The OCR result of an in-correctly segmented zone containing both images and text. The OCR
system generates many garbage symbols from the non-text parts of the input page segment.

halftone segmentation approach reported by Bloomberg et The rest of this paper is organized as follows. In Section 2
al. [2] based on multi-resolution morphological operations. we describe our document image segmentation method. Sec-
This approach comprises three steps: 1) at first step, seed tion 3 deals with the experimental results. Section 4 de-
image is generated by sub-sampling input image such that scribes conclusion.
the resulting seed image mainly contains halftone pixels. 2)
Then mask image is produced by using morphological op- 2. DOCUMENT IMAGE SEGMENTATION
erations such that together with all image pixels there is a
sufficient connectivity of halftone seed pixels with other pix- ALGORITHM
els covering halftone regions. 3) In last step binary filling Here we describe our document image segmentation algo-
operation is used to transform seed image with the help of rithm which segments document image into text and non-
mask image into final halftone mask image. The open-source text regions. Our main target is to classify each connected
version of this algorithm is presented in leptonica library de- component as either text or non-text component. In Sec-
veloped by Dan Bloomberg. This method produces promis- tion 2.1 we describe feature extraction process. In Sec-
ing results for halftone objects but is unable to recognize tion 2.2 we discuss about the training of extracted features
thin halftone and drawing like objects as non-text objects. using self-tunable multi-layer perceptron (MLP) classifier.

In this paper our aim is to perform text and non-text clas- 2.1 Feature Extraction
sification based on connected components, instead of pix- Instead of extracting complex features from a connected
els or blocks. For this purpose, we use simple and easy to component, the raw shape of a connected component itself
compute feature vectors. For training we use multi-layer is an important distinguishable feature for classifying struc-
perception (MLP) classifier which has already been used in tured text and random or irregular non-text components, as
different document image pre-processing tasks [6], like bina- shown in Figure 2. Together with the shape of connected
rization [4], deskewing [10]. Classifier tuning is considered component, the surrounding area of a connected compo-
as one of the hard problem with respect to the optimiza- nent can also play an important role for text and non-text
tion of parameters. In order to get rid of this problem we classification, similarly because of the structured text and
use self-tunable MLP classifier [3]. Our method is indepen- non-structured non-text surrounding areas. Figure 2 shows
dent of block segmentation and equally applicable to differ- neighborhood surrounding areas for text and non-text re-
ent categories of non-text objects if they were included in gions. We refer connected component with its neighborhood
the training data. One can analyze the ease of implementa- surrounding as context. Based on the above mentioned hy-
tion and accuracy of our approach in algorithm description pothesis, our feature vector of connected component is com-
and experimental evaluation sections respectively. posed of shape and context information. Detail description
of the feature vector is presented below.

184
connected components are shown in Figure 3(a) and
Figure 3(b) respectively. Together with raw rescaled
connected component, our shape based feature vec-
tor is also composed of four other size based features,
mentioned below. So all together the size of our shape-
based feature vector is 1604.

1. normalized length (length of a component divided

by the length of an input image).
2. normalized height (height of a component divided
by the height of an input image).
3. aspect ratio of a component (length divided by
height).
4. number of foreground pixels in a rescaled area
divided by total rescaled area.

• surrounding context of connected component:

Usually the text components are aligned horizontally
in the document images which results in structured
surrounding area for a text component as compared
to the non-structured surrounding area for non-text
components. Therefore, the surrounding context of
a connected component can play an important role
in classifying the text and the non-text components.
Each connected component with its surrounding con-
text area is rescaled to a 40 × 40 window size for gen-
Figure 2: Sample image from ICDAR 2009 page seg- erating context-based feature vector. Here the sur-
mentation competition. This image shows the struc- rounding context area is not fixed for all of the con-
tured shapes of text components and random shapes nected components for calculating feature vectors, but
of non-text components. it is a function of component’s length(l) and height(h).
Such that, for each connected component the area of
dimensions 5 × l by 2 × h is chosen empirically by
• shape of connected component: In document im- keeping a connected component at center for rescaling.
ages, most of the text components are smaller than The rescaled text and non-text context components are
non-text components. Therefore size information can shown in Figure 3(c) and Figure 3(d) respectively. The
play an important role in the text and non-text com- size of context-based feature vector is 1600.
ponents classification. But only size information is not
enough for classifying the big text and the small non-
text components. Therefore, together with size infor- In this way, the size of a complete feature vector is 3204
mation we need some other features as well. As already which consist of raw rescaled shape (dimension 1600), raw
mentioned, the shapes of non-text connected compo- rescaled context (dimension 1600) and four size based fea-
nents are irregular, random and vary a lot form one tures.
image to another and on other hand the shapes of text
components are uniformly structured in document im- 2.2 Classification
ages. The structured and random shapes of text and In general, classifier tuning is a hard problem with respect to
non-text components respectively can be learned by the optimization of their sensitive parameters, for example
the MLP classifier. For generating feature vector, each learning rate of MLP classifier, ‘C’ and gamma of SVM clas-
connected component is rescaled to a 40×40 pixel win- sifier, confidence of decision tree classifier, maximum depth
dow size. This rescaling performs only downscaling, and number of attributes of random forest classifier, ‘k’ of
such that a connected component is downscaled if ei- K nearest neighbor classifier etc. We use MLP classifier for
ther length or height of component is greater than 40 text and non-text classification. Performance of MLP clas-
pixels otherwise it is fit into the center of a 40 × 40 sifier is sensitive to the chosen parameters values. The op-
window. The advantage of doing this type of rescal- timal parameters values depend upon the dataset. The pa-
ing is to distinguish the shape of small components rameters optimization problem can be solved by using grid
from large components. This type of rescaling can search for classifier training. But grid search is a slow pro-
produce different feature vectors for a same compo- cess. Therefore in order to overcome this problem we use
nents, for example small and big font ‘a’. Our target AutoMLP [3], a self-tuning classifier that can automatically
is not to classify each characters but to classify the adjust learning parameters. In AutoMLP classifier we train
text and non-text components. Therefore, in our case a population for MLP classifiers in parallel. For these MLP
this rescaling works better than normal rescaling be- classifiers, learning parameters are selected from parameter
cause of incorporating implicit size information of text space which has been sampled according to some probability
and non-text components. Rescaled text and non-text distribution function. All of these MLPs are trained for few

185
datasets. The main reason for using different datasets is
to check the accuracy of our approach on different types of
images which have not been used in training as well as to
have a variety of text and non-text components. For ex-
ample, majority of the document images in UW-III dataset
have Manhattan-layout but ICDAR 2009 dataset also con-
tains documents with non-Manhattan layout. All non-text
components, except halftone, have been removed from UW-
III and ICDAR-2009 test datasets. In contrast to this, the
circuit diagrams dataset mainly composed of text and draw-
ing components having no other types of non-text compo-
nents. Total 95 documents have been selected from UW-III
dataset. ICDAR 2009 dataset contains 8 test images. Our
circuit diagrams dataset composed of 10 images.

For each dataset, pixel-level ground truth has been gener-

Figure 3: Text and non-text connected compo- ated using zone-level ground truth information. Each pixel
nents shape and context features, (a) and (b) show in ground-truth images contains either text or non-text la-
rescaled (no upscaling, either downscale or fit into bel. Different types of metrics have been used for the perfor-
the center to preserve size) connected component’s mance evaluation of document image segmentation method
shape features. (c) and (d) show rescaled connected which are defined below:
component’s context features.

1. non-text classified as non-text: percentage of in-

epochs and then half of these classifiers are selected for next tersection of non-text pixels in both segmented and
generation based on the better performance. The AutoMLP ground truth image with respect to the total number
performs internal validation on a portion of training data. of non-text pixels in ground truth image.

Feature vectors for training AutoMLP classifier have been 2. non-text classified as text: percentage of intersec-
extracted from the UW-III dataset. The UW-III dataset tion of text pixels in segmented image and non-text
contains zone-level ground truth for text, halftone, ruling, pixels in ground truth image with respect to the total
drawing and logo. From this zone-level ground-truth in- number of non-text pixels in ground truth image.
formation, the text and the non-text (halftone, drawing and
logo) regions are extracted form document images. Non-text 3. text classified as text: percentage of intersection of
regions were small in number, which have been increased up text pixels in both segmented and ground truth im-
to four times by rotating each non-text region in four dif- age with respect to the total number of text pixels in
ferent orientations. Around 0.7 million text samples and ground truth image.
0.1 million non-text samples are used for training AutoMLP
classifier. 4. text classified as non-text: percentage of intersec-
tion of non-text pixels in segmented image and text
For testing and evaluation purpose, the feature vector for pixels in ground truth image with respect to the total
each connected component of a test document image is ex- number of text pixels in ground truth image.
tracted in the same way as described in Section 2.1. Then
a class label is assigned to each connected component based 5. segmentation accuracy: average percentage of text
on classification probabilities of text and non-text. classified as text accuracy and non-text classified as
non-text accuracy.
In order to improve the segmentation results, a nearest neigh-
bor analysis by using class probabilities is performed for
refining the class label of each connected component. For Based on the matrices defined above, we have compared
this purpose, a region of 70 × 70 (empirically chosen) is se- our approach with leptonica’s page-segmentation algorithm.
lected from document image by keeping targeted connected Leptonica algorithm is exclusively designed for segmenting
component at center. The probabilities of connected com- text and halftone components. Performance comparison re-
ponents within the selected regions are already computed sults of our and leptonica methods on UW-III and ICDAR
during classification. Already assigned class labels of the 2009 test datasets which contains only text and halftone
connected components are updated using the average text components are shown in Table 1. The boxplot of text and
and non-text provabilities of connected components within halftone accuracy of our and leptonica methods on combined
selected region. Some of segmented results are shown in UW-III and ICDAR 2009 test datasets is shown in Figure 5.
Figure 4. Our algorithm has also been evaluated on circuit diagrams
dataset in order to show its potential as compared to other
text and halftone based segmentation approaches like lep-
3. EXPERIMENTAL RESULTS tonica, results are shown in Table 2.
We have evaluated our document image segmentation ap-
proach using UW-III, ICDAR-2009 page segmentation com-
petition test dataset [1] and our private circuit diagrams 4. DISCUSSION

186
(a) image with text and halftone only (b) leptonica method (c) our method.
(UW-III)

(d) image with text and halftone only (e) leptonica method (f) our method.
(ICDAR 2009)

(g) image with text and halftone only (h) leptonica method (i) our method.
(Circuit Diagram)

Figure 4: Document image segmentation results of our and leptonica methods in non-text mask format.

187
Table 1: Performance evaluation of our and leptonica page segmentation algorithms on UW-III dataset
(95 document images), ICDAR 2009 page segmentation competition test dataset (8 document images) and
combined UW-III and ICDAR 2009 datasets (103 document images).

UW-III ICDAR-2009 Combined

our approach leptonica our approach leptonica our approach leptonica

non-text classified as non-text 98.91% 95.36% 96.70% 84.91% 98.79% 94.77%

non-text classified as text 1.09% 4.64% 3.30% 15.09% 1.21% 5.23%

text classified as text 95.93% 99.79% 93.31% 99.87% 95.72% 99.79%

text classified as non-text 4.07% 0.21% 6.69% 0.13% 4.28% 0.21%

segmentation accuracy 97.42% 97.57% 95.01% 92.39% 97.25% 97.28%

Leptonica method miss classifies the small non-text compo-

Table 2: Performance evaluation results of our and nents as the text components, as shown in Figure 4(b) and
leptonica page segmentation algorithms on circuit Figure 4(e). On the other hand, our method gives equal
diagrams dataset (10 document images). Note: Lep- importance to both the text and non-text components dur-
tonica method is designed for text and halftone seg- ing the classification. Unlike leptonica method, our method
mentation. Here we used it for evaluating it on can also classify between the small non-text and text com-
circuit diagrams dataset to show that, unlike our ponents, as shown in Figure 4(c) and Figure 4(f). Leptonica
method, usually text and halftone based segmenta- method is designed for the text and halftone segmentation
tion methods can not be directly applied on other and is not specifically designed for the drawing objects seg-
types of non-text components segmentation. mentation. Therefore, it is unable to recognize drawing im-
our approach leptonica ages in circuit diagram dataset, as shown in Table 2 and Fig-
ure 4(h). Together with halftone components segmentation,
non-text classified as non-text 89.79% 0% our method also has a potential of segmenting drawing com-
ponents (for example circuit diagrams), as shown in Table 2
and Figure 4(i). The segmentation results of our method
non-text classified as text 10.21% 100%
can be improved by increasing training samples and/or by
using some post-processing operations.
text classified as text 89.29% 100%

text classified as non-text 10.72% 0% 5. ACKNOWLEDGMENTS

This work was partially funded by the BMBF (German Fed-
segmentation accuracy 89.54% 50% eral Ministry of Education and Research), project PaREn
(01 IW 07001).

We have described and experimentally evaluated a new method 6. REFERENCES

for document image segmentation into text and non-text re- [1] A. Antonacopoulos, D. Bridson, C. Papadopoulos, and
gions based on discriminative learning over connected com- S. Pletschacher. ICDAR 2009 page segmentation
ponents. We have used the self-tuning MLP classifier (Au- competition. In Proc. Int. Conf. Documnet Analysis
toMLP) [3] which automatically optimized learning param- and Recognition (ICDAR2009), pages 1370–1374,
eters. Our method is independent of preprocessing step of Barcelona, Spain, 2009.
zone segmentation which is usually the case in zone based [2] D. S. Bloomberg and F. R. Chen. Extraction of
classification approaches. We have evaluated our algorithm text-related features for condensing image documents.
on UW-III, ICDAR 2009 page segmentation competition In SPIE Conf. 2660, Document Recgnition III, pages
test dataset and circuit diagrams datasets and compared 72–88, San Jose, CA, 1996.
its results with state-of-the-art leptonica’s page segmenta- [3] T. M. Breuel and F. Shafait. Automlp: Simple,
tion method [2]. In general, both the text and non-text effective, fully automated learning rate and size
components are equally important in document image anal- adjustment. In The Learning Workshop, Snowbird,
ysis operations. For example, OCR exclusively requires text Utah, 2010.
components and the document image compression or symbol [4] Z. Chi and K. W. Wong. A two-stage binarization
recognition approaches exclusively require non-text compo- approach for document images. In Proc. Int. Symp.
nents. The performance evaluation results of our and lepton- Intelligent Multimedia, Video and Speech Processing
ica methods are presented in Table 1, Table 2 and Figure 5. (ISIMP’01), pages 275–278, 2001.
It is obvious from the results that leptonica method has bet- [5] D. Keysers, F. Shafait, and T. M. Breuel. Document
ter text classification accuracy than non-text classification.

188
(a) box plot of halftone classification accuracy (b) box plot of text classification accuracy

Figure 5: Box Plots of our and leptonica page segmentation algorithms on 103 images of UW-III and ICDAR-
2009 page segmentation competition test datasets. Average classification accuracies are shown on the top of
boxplots.

image zone classification- a simple high-performance In Technical Report LAM-TR-036, CAR-TR-927,

approach. In Proc. 2nd Int. Conf. Computer Vision CS-TR-4079, University of Maryland, College Park,
Theory and Applications, pages 44–51, Barcelona, Nov. 1999.
Spain, Mar. 2007. [10] N. Rondel and G. Breuel. Coorperation of multilayer
[6] S. Marinai, M. Gori, and G. Soda. Artificial neural perceptrons for the estimation of skew angle in text
networks for document analysis and recognition. In document images. Proc. Int. Conf. Documnet Analysis
IEEE Transactions on Pattern Analysis and Machine and Recognition (ICDAR’95), pages 1141–1144, 1995.
Intelligence, volume 27(1), Jan. 2005. [11] F. Shafait, D. Keysers, and T. M. Breuel. Performance
[7] M. A. Moll and H. S. Baird. Segmentation-based evaluation and benchmarking of six page segmentation
retrieval of document images from diverse collections. algorithms. IEEE Transactions on Pattern Analysis
In Document Recognition and Retrieval XV, Proc. of and Machine Intelligence, 30(6):941–954, Jun 2008.
the SPIE, volume 6815, pages 68150L–68150L, 2008. [12] Y. Wang, I. Phillips, and R. Haralick. Document zone
[8] M. A. Moll, H. S. Baird, and C. An. Truthing for content classification and its performance evaluation.
pixel-accurate segmentation. In Document Analysis In Pattern Recognition, volume 39, pages 57–73, 2006.
Systems, the Eighth IAPR Int. Workshop, pages [13] C. S. Won. Image extraction in digital documents. In
379–385, Sep. 2008. Journal of Electronic Imaging, volume 17, page
[9] O. Okun, D. Doermann, and M. Pietikainen. Page 033016, 2008.
segmentation and zone calssification: the state of art.

189