An Intelligent and Unified Text and Non-Text Object Extraction From PDF Using Support Vector Machine
An Intelligent and Unified Text and Non-Text Object Extraction From PDF Using Support Vector Machine
contain only the text data. Image objects include II. LITERATURE REVIEW
graphs, tables, lists, and images. Document Neha Gupta et al introduced a text extraction
segmentation plays an important role in e-book concept which is based on Image Segmentation. The
which is used to reuse the content of the document. It text involved in these images includes critical and
is a method of sub dividing the document regions as useful information. Text extraction in images has
text regions and image regions and it leads to layout been used in a large variety of applications such as
analysis. vehicle license plate detection, document retrieving,
The document can be divided into text mobile robot navigation, and object identification. In
segmentation and image segmentation. In existing this system, we retrieve text information from
work, you can only segment the text content of a complex input images by using Discrete Wavelet
PDF document. However, it is more important to Transform (DWT). But a preprocessing step is
segment the image with text segmentation. Text required for color to extract text edges in the color
segmentation is a precursor to text retrieval, auto image. The edge map is formed using resultant
synthesis, information retrieval, language modeling, edges. Morphological operations are applied to
and natural language processing. In written texts, text improve the performance on the processed edge map
segmentation is the process of identifying boundaries and then thresholding is applied in the image.
between words, phrases or other important units of Chandranath Adak et al introduced a new
language such as sentences and arguments. The term, method for Unsupervised Text Extraction from G-
separated from such processing, is useful for helping Maps. Text extraction is a method of extracting
people read text and is mainly used to help passage of text from a non-text background. Due to
computers perform certain man-made processes as an unsupervised approach, no prior knowledge or
basic units. Line extraction is a preprocessing step training is required on the textual and non-textual
for handwriting recognition and document structure parts. The fuzzy C mean clustering technique or the
extraction, and image segmentation is a mid-level Prewitt method are used for image segmentation and
processing technique. The main reason for the edge detection. The limitation of this system is that it
segmentation process is to get more information is not fully automatic due to the threshold and the
about the area of interest from an image. selection of a better result depends on the human eye.
Most PDF documents contain both text objects Q. Yuan et al introduced a new text extraction
and images. This examination takes into account technique which is based on Edge Information. The
both text and image objects for segmentation in PDF designed scheme presents a well-designed approach
documents. The proposed research takes into account that uses area statistics to take out textual blocks
the segmentation of text and image components in from grey scale record pictures. The main objective
the PDF e-book format. This overcomes the of this scheme is to find out textual regions on heavy
limitation of segmentation in tables, images, noise- infected newspaper photos and split them from
graphics, etc. in the PDF document. In this system graphical regions. The algorithm traces the function
both text layer and image layer are taken into points in unique entities and then groups the ones
consideration for segmentation. Each layer segments with facet points of textual areas. From this method
its data independently. Finally the results of both text we can obtain accurate web page decomposition with
and image layers are merged together for final green computation and reduced reminiscence size by
segmentation. Text segmentation and image copying with line segments.
segmentation are used for reusable purpose Thai V. Hoang et al introduced a new text
extraction method which is based on Sparse
Representation. Input document image includes both
text and graphics which is processed to produce two
Published by: The Mattingley Publishing Co., Inc. 3539
July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546
output images, one returns with text and the other B. The Merging of Words into Text Lines
returns with graphics. Graphical file pictures In this module the extracted text is grouped into
containing textual content and graphic additives are line segment. Initially the words or quads are sorted
taken into consideration as two-dimensional in the order of top down or bottom up which is based
indicators. The proposed set of rules fully depends on center position of bonding boxes. Then the words
upon a sparse representation framework with the are merged horizontally even if the vertical distance
content as it should be chosen discriminative over between two words is lesser than threshold. For line
complete dictionaries. Every one offers sparse segment, font size and vertical centers of bounding
illustration above one type of signal and non-sparse box are taken as attributes and then they are
illustration above the other. Separation of text and computed by weighted averaging. From this logical
image additives is obtained via promoting sparse text line fragmentation can be achieved.
graphic of input images in these dictionaries. The
C. The Grouping of Text Lines into Text Blocks
proposed approach overcomes the problem of
handling among text and images. This module is to merge text segments into
S.Ranjini et al implemented a new method of homogeneous text blocks. The problem of stemming
extracting and recognizing text taken from an in Bloches can be overcome by decoupling line space
English digital cartoon image using the median filter. and font size and carefully detecting block
In this work, blob extraction functions are used and boundaries during region growing.
Japanese text is extracted vertically from Manga The existing module doesn’t consider the text
Comic Image. At the same time, Optical Character objects such as titles, tables, lists and maps. Graphic
Recognition (OCR) removes several text restrictions recognition and integration with text segmentation
at the same time and converts Japanese manga are not considered in segmentation. In many cases,
language to some additional languages in the graphic components such as lines and color
traditional way, for the satisfaction of learning manga background are used to separate text. The detection
on the internet. of graphic components and their integration with text
segmentation will greatly improve performance.
III. EXISTING METHODOLY Lists and tables are not considered in this text
In the existing work the text documentation is to segmentation. Text belonging to map regions often
group text into visually homogeneous blocks. From has various orientations and excess character space.
PDF document we separate the text components from These are the most challenging cases for text
the image components such as images, tables and segmentation.
graphs. Here, line segmentation is considered over a
IV. PROPOSED METHODOLY
horizontal reading order. This method involves three
modules which are text information retrieval, the In the proposed work the segmentation of both
merging of words into text lines and the grouping of text and image components in e-book PDF format is
text lines into text blocks. considered. This overcomes the restriction of
segmentation in tables, images and graphs in PDF
A. Text Information Retrieval document. In this system both text layer and image
A PDWordFinder extracts words from a PDF file, layer are taken into consideration for segmentation.
and enumerates the words on a single page or on all Each layer segments its data independently. Finally
pages in a document. The visual attributes such as the results of both text and image layers are merged
font family, font size, color and bounding box are together for final segmentation. Text segmentation
retrieved. The bounding boxes can be formed based and image segmentation are used for reusable
on the font size of each word in a paragraph or line. purpose. This work is composed of text
In this module text information is retrieved. segmentation and image segmentation. In text
Published by: The Mattingley Publishing Co., Inc. 3540
July-August 2020
ISSN: 0193-4120 Page No. 3538 - 3546
segmentation text content in the PDF documents of space and font size and also the type of block
e-book is segmented and in image segmentation boundary. This can be explained as algorithm in the
image objects are segmented. Hence considering text following section.
and image objects in PDF document, the accuracy,
Text Segmentation Algorithm
precision, recall and F-measure for segmented
Input: PDF document
documents will be increased.
Output: Text content in text pad
A. Graph based Text Segmentation Step 1 : Access words and quads in PDF document
In PDF document the words and quads are Step 2 : Check the document in horizontal reading
accessed through Word Finder and visual attributes direction
are retrieved. Then PDWordGetNthQuad is used to Step 3 : Calculate geometric center point and form
get the bounding boxes of quads. The bounding block
boxes of each word or quad may vary from one to Step4 : For boundary detection assign
another word and it also varies from one line to
another. Additionally the vertical center lines are
computed from the bounding boxes. Further the
Step5: Define boundary for each line and increase
words or quads are merged into text lines by
the by 1.
selecting up a quad that has no longer been assigned
a line identity to begin a new line segment. Then the
Step 6: Merge the lines using queue
line is extended by adding qualified quads on both
lines. =
left and right to the line. When no qualified quad can
Step 7: Using Kruskal define the edge weight. Sort
be added to the line, a new line is started until all
the edge weight in descending order and calculate
quads are assigned a new line identity. This merging
mean and variance value
criterion is similar to Bloechle’s. If horizontal
;
distance between two words is smaller than
threshold value those words are merged horizontally
Variance=
and we cannot consider the vertical distance between
the words. Here we use font size, vertical center and Step 8: Set the threshold value Ө=n*variance
width of the quad which are assigned as attributes to Step 9: Remove the edges w( )-Mean>Ө
form line segment. After getting the text line Step 10: Calculate angle distribution for
segment we build homogeneous text blocks which segmentation
avoid the pitfall by decoupling line space and font
Angle
size. Relative difference between two line spaces is
defined as which is distance between distribution= // =|
vertical center lines and find the block boundary is |, =| |
found by comparing relative line space difference
Step 11: Calculate line spacing and word spacing
with a threshold value. For example if line i is block
word spacing>interline spacing
boundary it must satisfy the condition
Merge according to width of block.
> threshold. Similarly
relative difference between font size is also B. Graph based Image Segmentation
calculated to find block boundary by using condition In image segmentation process a digital image is
as > threshold where font size partitioned into a number of segments. The image
is the average of font sizes within the line may contain tables, lists, and graphs. By this
i. The block boundary can be measured using line segmentation process those contents are partitioned
growing region approach it segments the text content document processing; XY cut segmentation and
in PDF document while in the graph based approach bottom up growing approach. Then for image
the image objects are segmented by using a hybrid segmentation the proposed graph based approach is
method. compared with segmentation using connected
components, Eigen vector and bottom up nearest
A. Sample input and output for graph based
neighbor application.
approach
A. Accuracy
Social Accuracy is defined as the proportion of
true positives and true negatives among the total
number of results obtained. Accuracy is evaluated
as,
Accura
cy =
Fig 1. Input PDF document Fig.3 shows graph based segmentation for text
In the above Fig 1, PDF document is taken as which shows higher accuracy than the existing
input to evaluate text and image segmentation. approaches.
Graph based
96.6 1.0 1.0 1.0
International Conference on (pp. 302-306).
approach IEEE.
From the experimental results it is proved that the [7] Lienhart, R., & Wernicke, A. (2002). Localizing
proposed graph based approach more effectively and segmenting text in images and videos. IEEE
segment the e-book PDF format than the existing Transactions on circuits and systems for video
segmentation approaches. technology, 12(4), 256-268.
[8] Ranjini, S., &Sundaresan, M. (2013). Extraction
VI. CONCLUSION
and Recognition of Text From Digital English
In this work, e-book PDF format is segmented Comic Image Using Median Filter. International
considering both text objects and image objects. This Journal on Computer Science and
process involves text layer and image layers Engineering, 5(4), 238.
processed separately and finally the objects are [9] Tehsin, S., Masood, A., &Kausar, S. (2014).
classified by SVM classifier. Then experimental Survey of Region-Based Text Extraction
results are conducted in various e-book PDF Techniques for Efficient Indexing of
documents to prove that the proposed graph-based Image/Video Retrieval.International Journal of
approach is better than the existing bottom up region Image, Graphics and Signal Processing, 6(12),
approach in terms of accuracy, precision, recall, f- 53.
measure, time measure. [10] Arulkumar V. "An Intelligent Technique for
Uniquely Recognising Face and Finger Image
REFERENCES Using Learning Vector Quantisation (LVQ)-
[1] Gupta, N., &Banga, V. K. (2012, April). Image based Template Key Generation." International
Segmentation for Text Extraction. In 2nd Journal of Biomedical Engineering and
International Conference on Electrical, Technology 26, no. 3/4 (February 2, 2018): 237-
Electronics and Civil Engineering 49. doi:10.1504/IJBET.2018.089951
(ICEECE’2012) (pp. 182-185). [11] Hoang, T. V., &Tabbone, S. (2010, June). Text
[2] Adak, C. (2013, August). Unsupervised text extraction from graphical document images
extraction from G-maps. InHuman Computer using sparse representation. In Proceedings of
Interactions (ICHCI), 2013 International the 9th IAPR International Workshop on
Conference on (pp. 1-4). IEEE. Document Analysis Systems (pp. 143-150).
[3] Gautam, A. (2013). Segmentation of Text From ACM.
Image Document.International Journal of [12] V Arulkumar, Charlyn Puspha Latha, Daniel Jr
Computer Science and Information Dasig, "Concept of Implementing Big Data In
Technologies,4(3), 538-540. Smart City: Applications, Services, Data
[4] Hassan, T. (2010). User-guided information Security In Accordance With Internet of Things
extraction from print-oriented documents. and AI" International Journal of Recent
[5] O'Gorman, L. (1993). The document spectrum Technology and Engineering 8, no. 3
for page layout analysis. IEEE Transactions on (September 2019): 237-49. 2277-3878
Pattern Analysis and Machine [13] Arulkumar, C. V., and P. Vivekanandan.
Intelligence, 15(11), 1162-1173. "Multi-feature based automatic face
[6] Yuan, Q., & Tan, C. L. (2001). Text extraction identification on kernel eigen spaces (KES)
from gray scale document images using edge under unstable lighting conditions." Advanced
information. In Document Analysis and Computing and Communication Systems, 2015
Recognition, 2001. Proceedings. Sixth International Conference on. IEEE, 2015