Document Image Segmentation Using Discriminative
Document Image Segmentation Using Discriminative
183
(a) Input page segment
tv \_ _. l 4
la) ` )â , ` (C)
» ` fr /.1 Â∙r , ` WM. â
7 if ¢â Â¥, ` _` { é>é_ j l; I XII
K;~Â∙ _ V . Â∙ ` _ ~ M" JF Q} { fÂ∙',; Ai ' V _ I \' g,
â _ it ~ _y, i * t
f jgj . __ it ` l â f
1< .Â∙ â Â∙. V Â∙ ⢠â â â F {U
Â∙~ . 7 ~ é . i/`I / _â * â K »
Figure 4 View of fruit shed naked la) and lb) showing the complete androecial ring remaining within the tepals on the spike. (c)
Fruit shed with all tepals attached.
Figure 1: The OCR result of an in-correctly segmented zone containing both images and text. The OCR
system generates many garbage symbols from the non-text parts of the input page segment.
halftone segmentation approach reported by Bloomberg et The rest of this paper is organized as follows. In Section 2
al. [2] based on multi-resolution morphological operations. we describe our document image segmentation method. Sec-
This approach comprises three steps: 1) at first step, seed tion 3 deals with the experimental results. Section 4 de-
image is generated by sub-sampling input image such that scribes conclusion.
the resulting seed image mainly contains halftone pixels. 2)
Then mask image is produced by using morphological op- 2. DOCUMENT IMAGE SEGMENTATION
erations such that together with all image pixels there is a
sufficient connectivity of halftone seed pixels with other pix- ALGORITHM
els covering halftone regions. 3) In last step binary filling Here we describe our document image segmentation algo-
operation is used to transform seed image with the help of rithm which segments document image into text and non-
mask image into final halftone mask image. The open-source text regions. Our main target is to classify each connected
version of this algorithm is presented in leptonica library de- component as either text or non-text component. In Sec-
veloped by Dan Bloomberg. This method produces promis- tion 2.1 we describe feature extraction process. In Sec-
ing results for halftone objects but is unable to recognize tion 2.2 we discuss about the training of extracted features
thin halftone and drawing like objects as non-text objects. using self-tunable multi-layer perceptron (MLP) classifier.
In this paper our aim is to perform text and non-text clas- 2.1 Feature Extraction
sification based on connected components, instead of pix- Instead of extracting complex features from a connected
els or blocks. For this purpose, we use simple and easy to component, the raw shape of a connected component itself
compute feature vectors. For training we use multi-layer is an important distinguishable feature for classifying struc-
perception (MLP) classifier which has already been used in tured text and random or irregular non-text components, as
different document image pre-processing tasks [6], like bina- shown in Figure 2. Together with the shape of connected
rization [4], deskewing [10]. Classifier tuning is considered component, the surrounding area of a connected compo-
as one of the hard problem with respect to the optimiza- nent can also play an important role for text and non-text
tion of parameters. In order to get rid of this problem we classification, similarly because of the structured text and
use self-tunable MLP classifier [3]. Our method is indepen- non-structured non-text surrounding areas. Figure 2 shows
dent of block segmentation and equally applicable to differ- neighborhood surrounding areas for text and non-text re-
ent categories of non-text objects if they were included in gions. We refer connected component with its neighborhood
the training data. One can analyze the ease of implementa- surrounding as context. Based on the above mentioned hy-
tion and accuracy of our approach in algorithm description pothesis, our feature vector of connected component is com-
and experimental evaluation sections respectively. posed of shape and context information. Detail description
of the feature vector is presented below.
184
connected components are shown in Figure 3(a) and
Figure 3(b) respectively. Together with raw rescaled
connected component, our shape based feature vec-
tor is also composed of four other size based features,
mentioned below. So all together the size of our shape-
based feature vector is 1604.
185
datasets. The main reason for using different datasets is
to check the accuracy of our approach on different types of
images which have not been used in training as well as to
have a variety of text and non-text components. For ex-
ample, majority of the document images in UW-III dataset
have Manhattan-layout but ICDAR 2009 dataset also con-
tains documents with non-Manhattan layout. All non-text
components, except halftone, have been removed from UW-
III and ICDAR-2009 test datasets. In contrast to this, the
circuit diagrams dataset mainly composed of text and draw-
ing components having no other types of non-text compo-
nents. Total 95 documents have been selected from UW-III
dataset. ICDAR 2009 dataset contains 8 test images. Our
circuit diagrams dataset composed of 10 images.
Feature vectors for training AutoMLP classifier have been 2. non-text classified as text: percentage of intersec-
extracted from the UW-III dataset. The UW-III dataset tion of text pixels in segmented image and non-text
contains zone-level ground truth for text, halftone, ruling, pixels in ground truth image with respect to the total
drawing and logo. From this zone-level ground-truth in- number of non-text pixels in ground truth image.
formation, the text and the non-text (halftone, drawing and
logo) regions are extracted form document images. Non-text 3. text classified as text: percentage of intersection of
regions were small in number, which have been increased up text pixels in both segmented and ground truth im-
to four times by rotating each non-text region in four dif- age with respect to the total number of text pixels in
ferent orientations. Around 0.7 million text samples and ground truth image.
0.1 million non-text samples are used for training AutoMLP
classifier. 4. text classified as non-text: percentage of intersec-
tion of non-text pixels in segmented image and text
For testing and evaluation purpose, the feature vector for pixels in ground truth image with respect to the total
each connected component of a test document image is ex- number of text pixels in ground truth image.
tracted in the same way as described in Section 2.1. Then
a class label is assigned to each connected component based 5. segmentation accuracy: average percentage of text
on classification probabilities of text and non-text. classified as text accuracy and non-text classified as
non-text accuracy.
In order to improve the segmentation results, a nearest neigh-
bor analysis by using class probabilities is performed for
refining the class label of each connected component. For Based on the matrices defined above, we have compared
this purpose, a region of 70 × 70 (empirically chosen) is se- our approach with leptonica’s page-segmentation algorithm.
lected from document image by keeping targeted connected Leptonica algorithm is exclusively designed for segmenting
component at center. The probabilities of connected com- text and halftone components. Performance comparison re-
ponents within the selected regions are already computed sults of our and leptonica methods on UW-III and ICDAR
during classification. Already assigned class labels of the 2009 test datasets which contains only text and halftone
connected components are updated using the average text components are shown in Table 1. The boxplot of text and
and non-text provabilities of connected components within halftone accuracy of our and leptonica methods on combined
selected region. Some of segmented results are shown in UW-III and ICDAR 2009 test datasets is shown in Figure 5.
Figure 4. Our algorithm has also been evaluated on circuit diagrams
dataset in order to show its potential as compared to other
text and halftone based segmentation approaches like lep-
3. EXPERIMENTAL RESULTS tonica, results are shown in Table 2.
We have evaluated our document image segmentation ap-
proach using UW-III, ICDAR-2009 page segmentation com-
petition test dataset [1] and our private circuit diagrams 4. DISCUSSION
186
(a) image with text and halftone only (b) leptonica method (c) our method.
(UW-III)
(d) image with text and halftone only (e) leptonica method (f) our method.
(ICDAR 2009)
(g) image with text and halftone only (h) leptonica method (i) our method.
(Circuit Diagram)
Figure 4: Document image segmentation results of our and leptonica methods in non-text mask format.
187
Table 1: Performance evaluation of our and leptonica page segmentation algorithms on UW-III dataset
(95 document images), ICDAR 2009 page segmentation competition test dataset (8 document images) and
combined UW-III and ICDAR 2009 datasets (103 document images).
188
(a) box plot of halftone classification accuracy (b) box plot of text classification accuracy
Figure 5: Box Plots of our and leptonica page segmentation algorithms on 103 images of UW-III and ICDAR-
2009 page segmentation competition test datasets. Average classification accuracies are shown on the top of
boxplots.
189