0% found this document useful (0 votes)
33 views

An Intelligent and Unified Text and Non-Text Object Extraction From PDF Using Support Vector Machine

This document discusses a method for extracting both text and non-text objects such as images, graphs and tables from PDF documents using support vector machine (SVM) classifiers. It involves segmenting the PDF into separate text and non-text object layers independently, then merging the results. The method uses a bottom-up approach to extract text lines and a top-down approach to split diagrams generated by Kruskal's algorithm. SVM techniques are used to classify text and non-text objects in each segmented section. The method was tested on various PDF documents and provided sample input and output.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

An Intelligent and Unified Text and Non-Text Object Extraction From PDF Using Support Vector Machine

This document discusses a method for extracting both text and non-text objects such as images, graphs and tables from PDF documents using support vector machine (SVM) classifiers. It involves segmenting the PDF into separate text and non-text object layers independently, then merging the results. The method uses a bottom-up approach to extract text lines and a top-down approach to split diagrams generated by Kruskal's algorithm. SVM techniques are used to classify text and non-text objects in each segmented section. The method was tested on various PDF documents and provided sample input and output.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

July-August 2020

ISSN: 0193-4120 Page No. 3538 - 3546

An intelligent and unified text and non-text object


extraction from PDF using Support Vector Machine
Dr.V.Anandkumar,
Professor, Department of Information Technology, Sri Krishna College of Engineering and Technology,
Coimbatore. [email protected]
Mr.A.Vijay,
Assistant Professor, Department of Business Administration and Information Systems, Arbaminch
University, Sawla campus, Ethiopia. [email protected]
Ms.G.Divya,
Assistant Professor, Department of Computer Science and Engineering, Saveetha School of Engineering,
Chennai. [email protected]
Mr.V.Arulkumar,
Assistant Professor, Department of Computer Science, Sri Krishna College of Engineering and Technology,
Coimbatore. [email protected]

Article Info Abstract:


Volume 83 Today's e-book plays an important role in all fields to learn new things through
Page Number: 3538 - 3546 personal computers, laptops or mobile phones. There are several formats for an
Publication Issue: eBook. The most used format is PDF because it preserves the original format of the
July-August 2020 document. Segmentation is used to reuse content, but in the existing system
documents are only segmented as textual content. It does not take into account non-
textual elements, such as graphics, tables and images. In this survey, the design
analysis is performed by extracting text objects and non-text objects from the PDF
document and segmenting the objects separately using the Support Vector Machine
(SVM) classifiers. Finally, we get the output as text objects and non-text objects
separately. This method uses a bottom-up approach to extract lines of text and a top-
down approach to split the diagram tree generated by Kruskal's algorithm into sub
diagrams that use the Euclidean distance between adjacent vertices. Text and non-
text objects are classified using SVM techniques. With each section using the SVM
technique for each segmented and non-textual text, different dimensional
characteristics are extracted for labeling purposes. Different eBook PDF documents
Article History are tested, and some sample input and output PDF documents are shown in the
Article Received: 25 April 2020 experimental results.
Revised: 29 May 2020
Accepted: 20 June 2020 Keywords: E-book, PDF object, Support vector machine, Graph based Image
Publication: 10 August 2020 Segmentation.

AZW1, AZW4, EPUB, and PDF. The most widely


I.INTRODUCTION
used e-book format is PDF because while
An e-book is an electronic model where we get a transferring PDF documents it maintains the original
traditional print book from either a personal formatting and security; no one can change the
computer or by using an e-book reader. The e-book content of the document. The PDF document may
is available in various formats like MOBI, AZW, contain text objects and image objects. Text objects
Published by: The Mattingley Publishing Co., Inc. 3538
July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546

contain only the text data. Image objects include II. LITERATURE REVIEW
graphs, tables, lists, and images. Document Neha Gupta et al introduced a text extraction
segmentation plays an important role in e-book concept which is based on Image Segmentation. The
which is used to reuse the content of the document. It text involved in these images includes critical and
is a method of sub dividing the document regions as useful information. Text extraction in images has
text regions and image regions and it leads to layout been used in a large variety of applications such as
analysis. vehicle license plate detection, document retrieving,
The document can be divided into text mobile robot navigation, and object identification. In
segmentation and image segmentation. In existing this system, we retrieve text information from
work, you can only segment the text content of a complex input images by using Discrete Wavelet
PDF document. However, it is more important to Transform (DWT). But a preprocessing step is
segment the image with text segmentation. Text required for color to extract text edges in the color
segmentation is a precursor to text retrieval, auto image. The edge map is formed using resultant
synthesis, information retrieval, language modeling, edges. Morphological operations are applied to
and natural language processing. In written texts, text improve the performance on the processed edge map
segmentation is the process of identifying boundaries and then thresholding is applied in the image.
between words, phrases or other important units of Chandranath Adak et al introduced a new
language such as sentences and arguments. The term, method for Unsupervised Text Extraction from G-
separated from such processing, is useful for helping Maps. Text extraction is a method of extracting
people read text and is mainly used to help passage of text from a non-text background. Due to
computers perform certain man-made processes as an unsupervised approach, no prior knowledge or
basic units. Line extraction is a preprocessing step training is required on the textual and non-textual
for handwriting recognition and document structure parts. The fuzzy C mean clustering technique or the
extraction, and image segmentation is a mid-level Prewitt method are used for image segmentation and
processing technique. The main reason for the edge detection. The limitation of this system is that it
segmentation process is to get more information is not fully automatic due to the threshold and the
about the area of interest from an image. selection of a better result depends on the human eye.
Most PDF documents contain both text objects Q. Yuan et al introduced a new text extraction
and images. This examination takes into account technique which is based on Edge Information. The
both text and image objects for segmentation in PDF designed scheme presents a well-designed approach
documents. The proposed research takes into account that uses area statistics to take out textual blocks
the segmentation of text and image components in from grey scale record pictures. The main objective
the PDF e-book format. This overcomes the of this scheme is to find out textual regions on heavy
limitation of segmentation in tables, images, noise- infected newspaper photos and split them from
graphics, etc. in the PDF document. In this system graphical regions. The algorithm traces the function
both text layer and image layer are taken into points in unique entities and then groups the ones
consideration for segmentation. Each layer segments with facet points of textual areas. From this method
its data independently. Finally the results of both text we can obtain accurate web page decomposition with
and image layers are merged together for final green computation and reduced reminiscence size by
segmentation. Text segmentation and image copying with line segments.
segmentation are used for reusable purpose Thai V. Hoang et al introduced a new text
extraction method which is based on Sparse
Representation. Input document image includes both
text and graphics which is processed to produce two
Published by: The Mattingley Publishing Co., Inc. 3539
July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546

output images, one returns with text and the other B. The Merging of Words into Text Lines
returns with graphics. Graphical file pictures In this module the extracted text is grouped into
containing textual content and graphic additives are line segment. Initially the words or quads are sorted
taken into consideration as two-dimensional in the order of top down or bottom up which is based
indicators. The proposed set of rules fully depends on center position of bonding boxes. Then the words
upon a sparse representation framework with the are merged horizontally even if the vertical distance
content as it should be chosen discriminative over between two words is lesser than threshold. For line
complete dictionaries. Every one offers sparse segment, font size and vertical centers of bounding
illustration above one type of signal and non-sparse box are taken as attributes and then they are
illustration above the other. Separation of text and computed by weighted averaging. From this logical
image additives is obtained via promoting sparse text line fragmentation can be achieved.
graphic of input images in these dictionaries. The
C. The Grouping of Text Lines into Text Blocks
proposed approach overcomes the problem of
handling among text and images. This module is to merge text segments into
S.Ranjini et al implemented a new method of homogeneous text blocks. The problem of stemming
extracting and recognizing text taken from an in Bloches can be overcome by decoupling line space
English digital cartoon image using the median filter. and font size and carefully detecting block
In this work, blob extraction functions are used and boundaries during region growing.
Japanese text is extracted vertically from Manga The existing module doesn’t consider the text
Comic Image. At the same time, Optical Character objects such as titles, tables, lists and maps. Graphic
Recognition (OCR) removes several text restrictions recognition and integration with text segmentation
at the same time and converts Japanese manga are not considered in segmentation. In many cases,
language to some additional languages in the graphic components such as lines and color
traditional way, for the satisfaction of learning manga background are used to separate text. The detection
on the internet. of graphic components and their integration with text
segmentation will greatly improve performance.
III. EXISTING METHODOLY Lists and tables are not considered in this text
In the existing work the text documentation is to segmentation. Text belonging to map regions often
group text into visually homogeneous blocks. From has various orientations and excess character space.
PDF document we separate the text components from These are the most challenging cases for text
the image components such as images, tables and segmentation.
graphs. Here, line segmentation is considered over a
IV. PROPOSED METHODOLY
horizontal reading order. This method involves three
modules which are text information retrieval, the In the proposed work the segmentation of both
merging of words into text lines and the grouping of text and image components in e-book PDF format is
text lines into text blocks. considered. This overcomes the restriction of
segmentation in tables, images and graphs in PDF
A. Text Information Retrieval document. In this system both text layer and image
A PDWordFinder extracts words from a PDF file, layer are taken into consideration for segmentation.
and enumerates the words on a single page or on all Each layer segments its data independently. Finally
pages in a document. The visual attributes such as the results of both text and image layers are merged
font family, font size, color and bounding box are together for final segmentation. Text segmentation
retrieved. The bounding boxes can be formed based and image segmentation are used for reusable
on the font size of each word in a paragraph or line. purpose. This work is composed of text
In this module text information is retrieved. segmentation and image segmentation. In text
Published by: The Mattingley Publishing Co., Inc. 3540
July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546

segmentation text content in the PDF documents of space and font size and also the type of block
e-book is segmented and in image segmentation boundary. This can be explained as algorithm in the
image objects are segmented. Hence considering text following section.
and image objects in PDF document, the accuracy,
Text Segmentation Algorithm
precision, recall and F-measure for segmented
Input: PDF document
documents will be increased.
Output: Text content in text pad
A. Graph based Text Segmentation Step 1 : Access words and quads in PDF document
In PDF document the words and quads are Step 2 : Check the document in horizontal reading
accessed through Word Finder and visual attributes direction
are retrieved. Then PDWordGetNthQuad is used to Step 3 : Calculate geometric center point and form
get the bounding boxes of quads. The bounding block
boxes of each word or quad may vary from one to Step4 : For boundary detection assign
another word and it also varies from one line to
another. Additionally the vertical center lines are
computed from the bounding boxes. Further the
Step5: Define boundary for each line and increase
words or quads are merged into text lines by
the by 1.
selecting up a quad that has no longer been assigned
a line identity to begin a new line segment. Then the
Step 6: Merge the lines using queue
line is extended by adding qualified quads on both
lines. =
left and right to the line. When no qualified quad can
Step 7: Using Kruskal define the edge weight. Sort
be added to the line, a new line is started until all
the edge weight in descending order and calculate
quads are assigned a new line identity. This merging
mean and variance value
criterion is similar to Bloechle’s. If horizontal
;
distance between two words is smaller than
threshold value those words are merged horizontally
Variance=
and we cannot consider the vertical distance between
the words. Here we use font size, vertical center and Step 8: Set the threshold value Ө=n*variance
width of the quad which are assigned as attributes to Step 9: Remove the edges w( )-Mean>Ө
form line segment. After getting the text line Step 10: Calculate angle distribution for
segment we build homogeneous text blocks which segmentation
avoid the pitfall by decoupling line space and font
Angle
size. Relative difference between two line spaces is
defined as which is distance between distribution= // =|
vertical center lines and find the block boundary is |, =| |
found by comparing relative line space difference
Step 11: Calculate line spacing and word spacing
with a threshold value. For example if line i is block
word spacing>interline spacing
boundary it must satisfy the condition
Merge according to width of block.
> threshold. Similarly
relative difference between font size is also B. Graph based Image Segmentation
calculated to find block boundary by using condition In image segmentation process a digital image is
as > threshold where font size partitioned into a number of segments. The image
is the average of font sizes within the line may contain tables, lists, and graphs. By this
i. The block boundary can be measured using line segmentation process those contents are partitioned

Published by: The Mattingley Publishing Co., Inc. 3541


July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546

separately and saved in required location. It is a Output: Image objects


hybrid method. In this system both text and image Step 1: Components are analyzed from visual
layers are taken into consideration for segmentation. perspective
Each layer segments its data independently. Finally Step 2: Geometric features get from component
the results of both text and image layers are merged analysis
together for final segmentation. Step 3: Define interline spacing and Set threshold
For every layout analysis the image objects are Interline spacing < threshold
not considered in the segmentation. This is the main Merge component objects
goal of this research work considering both textual
and image objects. It is considered that image
objects are spatially far away from text blocks. Then
C. SVM Classification
cluster properties of Delaunay tessellation
neighborhood system are used to reject non-textual In this module the objects are classified. SVM
objects. For layout segmentation only the clusters in classifier uses test data and train data to classify the
the text region are considered. Hence it plays an data. The output of layout analysis is bounding boxes
important role in reflow the able reconstruction of of text line composite objects for text layer and
PDF document structure. There are two systems graphic composite objects for image layer. Then
available to segment identification for PDF from the analysis result of text or image layer a
document pages. One is from the PDF path which is feature vector is extracted for each composite object.
directly used to extract geometric features of The source of feature extraction comes from different
bounding boxes and to group the elements into layers for classification which are indicated by
desired physical segments by image streams. character features of graphic components with zero.
Thus the bounding box ensures to include the It is the main difference between the text and image
elements for graphics but the smallest bounding box features. For both textual content and image content
that encloses white background which is invisible to segments, all the segmented sub images are saved,
users. Such issues will return inaccuracy for graphic and image features describing texture spectrum are
segmentation. Additionally, when the path and extracted. In this SVM is used for classification of
image elements for making a holistic graphic text and image objects. Two-class SVM classifier is
composite are vast in numbers, the computational used for classification where mathematical
speed will be reduced for the grouping process. One expression for both isolated and embedded in PDF
more option is to utilize the well- researched image documents are detected with an accuracy of 90%. In
based segmentation methods. In this research work this work, a larger sort of class labels is considered.
image objects are processed as a separate layer using The previous analysis within the document is taken
traditional image analysis method. From the visual into consideration for segment extraction where
perspective component analysis is obtained. Local document is segmented into physical class labels
text features describe the spatial closeness of graphic such as footer text, body text, page number text,
objects. Merging process is required to detect graphic text, and header. Multi-class SVM classifier
graphic composite holistically. Thresholds are set for is used in labeling task to discover the dissimilarity
connected component grouping based on inter text capacity of the presented features.
line spacing. As for graphics embedded or V. EXPERIMENTAL RESULTS
surrounded by text elements added, post processing
In our research work we used 50 e-book PDF
of integration is handled.
documents to evaluate the performance of graph
Graph based Image Segmentation Algorithm: based approach in terms of accuracy, precision,
Input: PDF document recall, time measure and F-measure.In bottom up
Published by: The Mattingley Publishing Co., Inc. 3542
July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546

growing region approach it segments the text content document processing; XY cut segmentation and
in PDF document while in the graph based approach bottom up growing approach. Then for image
the image objects are segmented by using a hybrid segmentation the proposed graph based approach is
method. compared with segmentation using connected
components, Eigen vector and bottom up nearest
A. Sample input and output for graph based
neighbor application.
approach
A. Accuracy
Social Accuracy is defined as the proportion of
true positives and true negatives among the total
number of results obtained. Accuracy is evaluated
as,

Accura
cy =

Fig 1. Input PDF document Fig.3 shows graph based segmentation for text
In the above Fig 1, PDF document is taken as which shows higher accuracy than the existing
input to evaluate text and image segmentation. approaches.

Fig 2. a) Text Segmentation b) Fig 3. Accuracy on Graph based approach and


Image Segmentation existing for text segmentation and image
segmentation
Thus the image objects in the input PDF B. Precision
document such as graph, images, pictures and tables
Precision value is evaluated according to the
are segmented by using graph based approach which
relevant information at true positive prediction, false
is shown in Fig 2(b). The text contents are
positive.
segmented by using Kruskal’s algorithm based
approach as shown in Fig 2(a).The above fig 2,
shows the first page text segmentation. Like this the
text is segmented for whole document.
For performance evaluation, the proposed graph
based text segmentation is compared with OCD

Published by: The Mattingley Publishing Co., Inc. 3543


July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546

Fig 6 shows graph based segmentation which


shows higher f-measure than the existing
approaches.

Fig.4 shows graph based segmentation which


shows higher precision than the existing approaches.

Fig 4. Precision on Graph based approach and


Fig 6 Comparison of F-measure between Graph
existing for text segmentation and image
based and Bottom up region growing approach
segmentation
The values of precision, recall, accuracy, time
measure and f-measure are tabulated in the following
C.Recall table 1
The Recall value is evaluated according to TABLE I: COMPARISON TABLE
the retrieval of information at true positive F-
Accuracy Precision Recall
prediction, false negative. Measure
Text segmentation
OCD
Document 77.5 0.8 0.8 0.8
Fig 5 shows graph based segmentation which Processing
shows higher recall than the existing approaches. XY cut
81.8 0.8 0.8 0.8
segmentation
Bottom up
region
88.5 0.9 0.9 0.9
growing
approach
Graph based
94.5 0.9 0.9 0.9
approach
Image Segmentation
Segmentation
using
73.9 0.8 0.7 0.8
connected
Fig 5. Recall as on Graph based approach and components
existing for text segmentation and image Eigen vector
based image 80.6 0.8 0.8 0.8
segmentation segmentation
D. F-measure Bottom up
nearest 91.8 0.9 0.9 0.9
F-measure is calculated from the precision and neighbor app
recall value. It is calculated as:
Published by: The Mattingley Publishing Co., Inc. 3544
July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546

Graph based
96.6 1.0 1.0 1.0
International Conference on (pp. 302-306).
approach IEEE.
From the experimental results it is proved that the [7] Lienhart, R., & Wernicke, A. (2002). Localizing
proposed graph based approach more effectively and segmenting text in images and videos. IEEE
segment the e-book PDF format than the existing Transactions on circuits and systems for video
segmentation approaches. technology, 12(4), 256-268.
[8] Ranjini, S., &Sundaresan, M. (2013). Extraction
VI. CONCLUSION
and Recognition of Text From Digital English
In this work, e-book PDF format is segmented Comic Image Using Median Filter. International
considering both text objects and image objects. This Journal on Computer Science and
process involves text layer and image layers Engineering, 5(4), 238.
processed separately and finally the objects are [9] Tehsin, S., Masood, A., &Kausar, S. (2014).
classified by SVM classifier. Then experimental Survey of Region-Based Text Extraction
results are conducted in various e-book PDF Techniques for Efficient Indexing of
documents to prove that the proposed graph-based Image/Video Retrieval.International Journal of
approach is better than the existing bottom up region Image, Graphics and Signal Processing, 6(12),
approach in terms of accuracy, precision, recall, f- 53.
measure, time measure. [10] Arulkumar V. "An Intelligent Technique for
Uniquely Recognising Face and Finger Image
REFERENCES Using Learning Vector Quantisation (LVQ)-
[1] Gupta, N., &Banga, V. K. (2012, April). Image based Template Key Generation." International
Segmentation for Text Extraction. In 2nd Journal of Biomedical Engineering and
International Conference on Electrical, Technology 26, no. 3/4 (February 2, 2018): 237-
Electronics and Civil Engineering 49. doi:10.1504/IJBET.2018.089951
(ICEECE’2012) (pp. 182-185). [11] Hoang, T. V., &Tabbone, S. (2010, June). Text
[2] Adak, C. (2013, August). Unsupervised text extraction from graphical document images
extraction from G-maps. InHuman Computer using sparse representation. In Proceedings of
Interactions (ICHCI), 2013 International the 9th IAPR International Workshop on
Conference on (pp. 1-4). IEEE. Document Analysis Systems (pp. 143-150).
[3] Gautam, A. (2013). Segmentation of Text From ACM.
Image Document.International Journal of [12] V Arulkumar, Charlyn Puspha Latha, Daniel Jr
Computer Science and Information Dasig, "Concept of Implementing Big Data In
Technologies,4(3), 538-540. Smart City: Applications, Services, Data
[4] Hassan, T. (2010). User-guided information Security In Accordance With Internet of Things
extraction from print-oriented documents. and AI" International Journal of Recent
[5] O'Gorman, L. (1993). The document spectrum Technology and Engineering 8, no. 3
for page layout analysis. IEEE Transactions on (September 2019): 237-49. 2277-3878
Pattern Analysis and Machine [13] Arulkumar, C. V., and P. Vivekanandan.
Intelligence, 15(11), 1162-1173. "Multi-feature based automatic face
[6] Yuan, Q., & Tan, C. L. (2001). Text extraction identification on kernel eigen spaces (KES)
from gray scale document images using edge under unstable lighting conditions." Advanced
information. In Document Analysis and Computing and Communication Systems, 2015
Recognition, 2001. Proceedings. Sixth International Conference on. IEEE, 2015

Published by: The Mattingley Publishing Co., Inc. 3545


July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546

[14] Kumari, S., & Vijay, R. (2012). Effect of symlet


filter order on denoising of still
images. Advanced Computing, 3(1), 137.
[15] Wu, L., Shivakumara, P., Lu, T., & Tan, C. L.
(2015). A New Technique for Multi-Oriented
Scene Text Line Detection and Tracking in
Video. IEEE Transactions on
Multimedia, 17(8), 1137-1152.
[16] Mehta, A., Parihar, A. S., & Mehta, N. (2015,
September). Supervised classification of
dermoscopic images using optimized fuzzy
clustering based Multi-Layer Feed-forward
Neural Network. In Computer, Communication
and Control (IC4), 2015 International
Conference on (pp. 1-6). IEEE.
[17] Green, R., & Oliver, C. (2013, November).
Layout analysis of book pages. In2013 28th
International Conference on Image and Vision
Computing New Zealand (IVCNZ 2013) (pp.
118-FIN123). IEEE.

Published by: The Mattingley Publishing Co., Inc. 3546

View publication stats

You might also like