0% found this document useful (0 votes)
127 views

Extracting Body Text From Academic PDF Documents For Text Mining

The document discusses a system called PDFBoT that accurately extracts body text from academic PDF documents for text mining applications. It replicates PDF documents into HTML files, identifies non-body text using line sweeping and text features, and extracts complete sentences and paragraph boundaries from the body text into plain text files while preserving structure. The system is evaluated on a corpus from arXiv and achieves high F1 scores of 0.99 for sentence extraction, 0.96 for paragraph extraction, and 0.98 for removing non-body elements like tables and figures.

Uploaded by

Dembeoscar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views

Extracting Body Text From Academic PDF Documents For Text Mining

The document discusses a system called PDFBoT that accurately extracts body text from academic PDF documents for text mining applications. It replicates PDF documents into HTML files, identifies non-body text using line sweeping and text features, and extracts complete sentences and paragraph boundaries from the body text into plain text files while preserving structure. The system is evaluated on a corpus from arXiv and achieves high F1 scores of 0.99 for sentence extraction, 0.96 for paragraph extraction, and 0.98 for removing non-body elements like tables and figures.

Uploaded by

Dembeoscar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Extracting Body Text from Academic PDF Documents for Text Mining

Changfeng Yu, Cheng Zhang and Jie Wang


Department of Computer Science, University of Massachusetts, Lowell, MA, U.S.A.
{changfeng yu, cheng zhang}@student.uml.edu, [email protected]

Keywords: body-text extraction, HTML replication of PDF, line sweeping, backward traversal

Abstract: Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining appli-
cations for deeper semantic understandings. The objective is to extract complete sentences in the body text
into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from
arXiv:2010.12647v1 [cs.IR] 23 Oct 2020

PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT
to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text
features and syntactic tagging in backward traversal, and align the remaining text back to sentences and para-
graphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting
sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus
of PDF documents randomly selected from arXiv.org across multiple academic disciplines.

1 INTRODUCTION bitary layouts is challenging, due to the utmost flex-


ibility of PDF typesetting. Instead, we focus on BT
It is desirable for text mining applications to extract extraction from single-column and multiple column
complete sentences and correct boundaries of para- research papers, reports, and case studies. We do so
graphs from the body text of a PDF document into by working with the location, font size, and font style
a txt file without hard breaks inside each paragraph. of each character, and the locations and sizes of other
Layered reading (http://dooyeed.com) and extractive objects. While a PDF file provides such information,
summarization, for example, are such applications. we find it easier to work with HTML replications pro-
Layered reading allows the reader to read the most duced by an exiting tool named pdf2htmlEX (Wang,
important layer of sentences first based on sentence 2014), with almost the same look and feel of the orig-
rankings, then the layer of next important sentences inal PDF document, providing necessary formatting
interleaving with the previous layers of sentences in information via HTML tags, classes, and id’s in the
the original order of the document, and continue in underlying DOM tree.
this fashion until the entire document is read. We devise a system named PDFBoT (PDF to
By “body text” (BT in short) it means the main Body Text) that, using pdf2htmlEX as a black box,
text of an article, excluding “nonbondy text” (NBT in incorporates certain text formatting features produced
short) such as headings, footings, sidings (i.e., text on by it to identify NBT texts. We use a line-sweeping
side margins), tables, figures, charts, captions, titles, method to detect multi-column layouts and the area
authors, affiliations, and math expressions in the dis- for printing the BT text. We also develop multiple
play mode, among other things. tests to identify NBT text inside the BT-text area and
Most existing tools for extracting text from PDF use a backward traversal method to deploy these tests.
documents, including pdftotext (FooLabs, 2014) and In addition, we use POS (Part-of-Speech) tagging to
PDFBox (Apache, 2017), extract a mixture of both help identify NBT text that are harder to distinguish.
BT and NBT texts. Identifying BT text from such The rest of the paper is organized as follows:
mixtures of texts is challenging, if not impossible. Section 2 is related work on text extractions from
Other tools extract texts according to rhetorical cate- PDF. Section 3 describes HTML replications via
gories such as LA-PDFText (Burns, 2013) and logical pdf2htmlEX and Sections 4 presents the architecture
text blocks such as Icecite (Korzen, 2017), which only of PDFBoT and its features Section 5 is evaluation re-
provide a suboptimal solution to our applications. sults with F1 scores and running time, and Section 6
Extracting BT text from PDF documents of ar- is conclusions and final remarks.
2 RELATED WORK et al., 2019; Wang et al., 2018; Phong et al., 2020).
In summary, previous methods, while meeting
Existing tools, such as pdftotext (FooLabs, 2014) and with certain success, still fall short of the desired ac-
PDFBox (Apache, 2017) the two most widely-used curacy required by text-mining applications relying
tools for extracting text from PDF, and a number of on clean extractions of complete sentences and cor-
other tools such as pdftohtml (Kruk, 2013), pdftoxml rect boundaries of paragraphs in BT text.
(Dejean and Giguet, 2016), pdf2xml (Tiedemann,
2016), ParsCit (Kan, 2016), PDFMiner (Shinyama,
2016), pdfXtk (Hassan, 2013), pdf-extract (Ward, 3 HTML REPLICATION OF PDF
2015), pdfx (Constantin et al, 2011), PDFExtract
(Berg, 2011), and Grobid (Lopez, 2017), extract text HTML technologies have been used to replicate PDF
from PDF extract BT text and NBT text together with- layouts to facilitate online publishing. A PDF docu-
out a clear distinction. PDFBox can extract text in ment can be represented as a sequence of pages, with
two-column layouts; some other tools extract text line each page being a DOM tree of objects with sufficient
by line across columns. information for an HTML viewer to display the con-
Using heuristics is a common approach. For ex- tent (Wang and Liu, 2013). The text extracted from
ample, the Java PDF library was used to obtain a PDF by pdf2htmlEX (Wang, 2014) are translated into
bounding box for each word, compute the distance HTML text elements that are placed into the same po-
between neighboring words, connect them based on a sitions as they are displayed by PDF.
set of rules to form a larger text block, place them into Let F denote a PDF document and f the HTML
rhetorical categories, and connect these categories file produced by pdf2htmlEX on F. The DOM tree
following the order of the underlying document (Ra- for f , denoted by T f , is divided into four levels: doc-
makrishnan et al., 2012). However, this method fails ument, page, text line, and text block (TBK in short).
to align broken sentence and determine text on formu- (1) Document structure. T f starts with the fol-
las, tables, or figures. Using an intermediate HTML lowing tag as the root: hdiv id=“page-container”i,
representation generated by pdftohtml (Yildiz et al., and each of its children is the root of a subtree for
2005). Text blocks may also be created by grouping a page, listed in sequence, with an id indicating its
characters based on their relative positions (Shigarov page number and a class name indicating the width
et al., 2016), while extracting the tables in PDF. These and height of a page. For example, a child node with
two methods are focused only on extracting tables. hdiv id=“pf7” class=“pf w0 h0 data-page-no=“7”i is
Other methods include rule-based and machine- the root of the subtree for Page 7, where w0 and h0
learning models. For example, text may be placed are the width and height of the page (specifying the
into predefined logical text blocks based on a set of printable area) with the origin at the lower-left corner
rules on the distance, positions, fonts of characters, of the page.
words, and text lines (Bast and Korzen, 2017). How- (2) Page structure. Each page starts with a page
ever, these rules also connect text on tables or fig- node, followed by object nodes with contents to be
ures as BT text. A Conditional Random Field (CRF) printed. Each object occupies a rectangular area (a
model is trained (Luong et al., 2011; Romary and bounding box) specified on a coordinate system of
Lopez, 2015) to extract texts according to a prede- pixels. The text of the document is divided into TBKs
fined rhetorical category, such as title, abstract, and as leaf nodes. Each TBK is represented by a hdivi
other sections in the input document. However, this tag with corresponding attributes, and so the text in a
model fails to determine paragraph boundaries or TBK are either all BT text or all NBT text. Each ob-
align broken sentences, among other things. ject is identified by coordinates (x, y) at the lower-left
CiteSeerX (Giles, 2006), a search engine, extracts corner of the bounding box relative to the coordinates
metadata from indexed articles in scientific docu- of its parent node. In what follows, these coordinates
ments for searching purpose, but not focused on the are referred to as the starting point of the underlying
accuracy of extracting body text. PDFfigures (Clark object. In addition to the starting point, a non-textual
and Divvala, 2015) chunks the text table and figure object is specified by a width and a height, and a TBK
into blocks, then classifies these blocks into captions, is specified with a height without a width, where the
body text, and part-of-figure text. Recent studies have width is implied by the enclosed text, font size and
shifted attentions to extracting certain types of text, style, and word spacing. The parent of each object
including titles (Yang et al., 2019) (but not text on ta- may either be the origin, a node for a figure or a table,
bles or figures), and math expressions in the display or a node due to some (probably invisible) formatting
mode and the inline mode (Mali et al., 2020; Pfahler code. Thus, the height of a page’s DOM tree could be
greater than 3. Figure 1 is a schematic of page struc- by the starting point relative to the starting point of
ture. its parent node, and some other formatting features.
The Preprocessing component calculates the absolute
starting point of each object by a breadth-first search
of the DOM tree. The starting points of the objects at
the first level are already absolute. For each object at
the second level or below, let (x, y) be its relative start-
ing point and (x0p , y0p ) the absolute starting point of its
parent. Then its absolute starting point is determined
by (x0 , y0 ) = (x + x0p , y + y0p ). In what follows, when
we mention a starting point of a TBK, we will mean
its absolute starting point, unless otherwise stated.

Figure 1: Schematic of the page structure. The red square


is a figure with a TBK1 and subscript TBK2 inside the
square, where (x0 , y0 ), (x3 , y3 ), (x4 , y4 ) are absolute coordi-
nates, and (xi , yi ) (1 ≤ i ≤ 2) are relative to (x0 , y0 ). Thus,
the corresponding absolute locations, denoted by (xi0 , y0i ),
are xi0 = x0 + xi and y0i = y0 + yi for i = 1, 2.

(3) Line structure. Each horizontal text line is


made up of one or more TBKs, and no horizontal
TBK contains text across multiple lines. But TBKs in
the NBT text could span across multiple lines, which
are either vertical or diagonal, specified by webkit-
transform rotations, which rotates the text box around
the center of the text box. For example, the back-
ground text of “Unpublished working draft” and “Not
for distribution” on certain documents are two diago- Figure 2: PDFBoT architecture and data flow diagram
nal TBKs on top of BT text. (2) Font-size statistics. This module computes the
(4) Text-block structure. Each TBK is specified frequency of each font size (over the total number of
by exactly eleven classes of features, where each fea- characters) by traversing each TBK to obtain its font
ture class consists of one or more features, including size and the number of characters in the text it en-
starting point (x, y) relative to the starting point of its closes. The font size with the highest frequency, de-
parent, height, font size, font style, font color, and noted by BASE FS, is the font size for BT.
word spacing. Enclosed in TBK are text and addi- (3) Shallow removal. This module removes all
tional spacing between words. A TBK ends either at non-textual objects (images and lines) and all TBKs
the end of a line or at the beginning of a subscript, a with font size beyond the interval
superscript, and a citation.
I f = (BASE FS − ∆2 , BASE FS + ∆2 ),
where ∆2 is a threshold value (e.g., ∆2 = 3), or with a
4 PDFBoT rotated display, which can be checked by its webkit-
transform matrix. Headings, sidings, and footings
PDFBoT consists of five major components: Pre- tend to have smaller font sizes than BASE FS − ∆2
processing, Multi-Column Detection, Text Features, (except page numbers) and so they are removed by
Deep Removal, and BT Alignment & POS-based Re- this module.
moval. Figure 2 depicts the architecture and data flow Remark. The abstract may have a slightly smaller
diagram of PDFBoT. font size than BASE FS (such as 3 pt smaller as in this
paper). Setting an appropriate value of ∆2 can resolve
4.1 Preprocessing this problem. We may also deal with the abstract
separately, regardless its font size, using the keyword
(1) Address resolution. On each page in the DOM tree “Abstract” and the keyword “Introduction” to extract
T f , each object occupies a rectangular area, specified the abstract.
4.2 Multi-Column Detection to adjust and beautify the overall layout (e.g., ∆1 = 5).
Suppose that they are on the same line, then B1 is at
Most lines on a given column are aligned flush left, the left-side of B2 iff x1 < x2 . If they are not on the
except that the first line in a paragraph may be in- same line, then B1 is on a line above that of B2 iff
dented. Start a vertical line sweep on each page from y1 − y2 > ∆1 . This gives rise to a Page-Line-TBK tree
the left edge to the right-hand edge one pixel at a time. structure of depth 2, where the Page node has Lines as
Let n p (i) denote the number of x-coordinates in the children, and each Line node has one or more TBKs
starting points of TBKs that are equal to i on page p, as children.
where i starts from 0 and ends at W one pixel at a The module then computes the gap between every
time, and W is the width of the printable area of the two consecutive lines in each column and obtains the
page (typically just the width of the page). Note that a frequency for each gap. The most common gap is the
TBK does not have coordinates at the right-hand side. line spacing in the body text, denoted by BASE LS.
A line is aligned flush left to a column if the x- (2) Char-TBK density. This module computes, for
coordinate of the starting point of the leftmost TBK in each line L, the number of non-whitespace characters
the line is equal to the x-coordinate of the left bound- over the number of TBKs contained in L. Denote
ary of the said column. It is reasonable to assume by #CharL and #TBKL , respectively, the number of
that (1) the left boundary of a corresponding column non-whitespace characters and the number of TBKs
is at the same x-coordinate on all pages and (2) over contained in L. Define by DL the following density:
one-half of the lines in any column across all pages DL = #CharL /#TBKL . Let BASE CBD denote the av-
are aligned flush left on each page. We also assume erage Char-TBK density for the entire document.
the following: Let j be the left boundary of a col-
umn. If i is not the left boundary of a column, then
4.4 Deep Removal
∑ p n p (i) (summing up n p (i) over all pages) is sub-
stantially smaller than ∑ p n p ( j).
This module removes NBT text with font sizes within
Proposition 4.1. A document has k columns (k ≥ 1) the range of I f . It is reasonable to assume the fol-
iff the function ∑ p n p (i) has exactly k peaks with about lowing features on a PDF document adhering to con-
the same values, and the i-th x-coordinate that regis- ventional formatting styles: (1) Math expressions in
ters a peak is the left boundary of the i-th column. the display mode, text on tables, text of figures, text
Remarks. (1) Columns may begin at different x- on charts, authors, and affiliations are indented by at
coordinates for pages that are even or odd numbered. least a pixel from the left boundary of the underlying
Just treat pages of even (and odd) numbered as one column. (2) Every sentence ends with a punctuation.
document and then Proposition 4.1 applies to them re- If a sentence ends with a math expression in the dis-
spectively. (2) A two-column layout may have a one- play mode, then the last line of the math expression
column layout inserted, such as a one-column abstract must end with a punctuation. (3) The first line of text
in a two-column academic paper. This can be detected followed a standalone title is aligned flush left.
by checking the locations of TBKs. If most of them (1) Remove sidings. The BT area on each page
do not match with the x-coordinate for the second col- is a rectangular area within which the BT text are
umn, then the underlying portion of the text is a single printed. Depending on how the majority of the BT
column. Single-column text is processed in the same text are displayed, the underlying document is of ei-
way as the left-column text. (3) A more sophisticated ther single column or multiple columns. A column for
method is to use a shorter vertical line segment to printing the BT text is referred to as a major column.
cover a sufficient number of lines for sweeping each A column on a side margin (such as the line num-
time, and move this line segment as a vertical sliding bers on some documents) is referred to as a minor
window. column, where TBKs are in red boxes. It is reason-
able to assume that the width of a major column can-
4.3 Text Features not be smaller than a certain value Γ1 (e.g., Γ1 = 1.5
inch = 144 pixels). It is reasonable to assume that
(1) Line-spacing statistics. This module lines up side margins are symmetrical. Namely, in the print-
TBKs according to their starting points to form lines able area, the width of the left margin is the same as
in sequence. Let (x1 , y1 ) and (x2 , y2 ) be the start- that of the right-hand margin. Without loss of general-
ing points of two text blocks B1 and B2 , respectively. ity, assume that the width of a side margin is less than
Then B1 and B2 are on the same line iff |y1 − y2 | ≤ ∆1 Γ1 . Most documents have either one major-column
for a small fixed value of ∆1 . The purpose of allowing or two major-columns. For a magazine layout, three
a small variation is to make typesetting more flexible major-columns may also be used. For example, the
layout of this submission is of two columns. as text on figures with the same font size as the BT
Proposition 4.2. Let k be the number of columns (as text, for in this case the leftmost TBK would have a
detected by line sweeping as in Proposition 4.1). Let large indentation due to the space taken by the y-axis
wm denote the width of a side margin. Initially, set and a vertical title.
wm ← xb , the x-coordinate of the left boundary of the If line L contains a TBK that includes a whites-
first column. If k > 1, let xb0 be the x-coordinate of the pace greater than a certain threshold Γ3 (e.g. Γ3 =
left boundary of the second column. If xb0 − xb < Γ1 , 50), specified by a hspani tag, then remove L. It is
then set wm ← xb0 . The BT area is from wm to W − wm , evident that such a TBK is NBT.
where W is the width of the printable area of the page. (4) Remove lines by backward scans and NBT
tests. The following tests are used in certain com-
Note that if k > 1 and xb0 − xb < Γ1 , then the first bination to determine NBT text lines.
column is not a major column. Any TBKs with an x- (a) Line-Spacing Test. An NBT line typically
coordinate of its starting point less than wm is on the has larger line spacing (gap) from the immediate line
left margin and any TBKs with an x-coordinate of its above and from the immediate line below. A line L
starting point greater than W −wm is on the right-hand passes the line-spacing test If the gap from L to the
margin. For example, this method removes line num- immediate line above (if it exists) and the immediate
bers. On a different formatting we have encountered, line below (if it exists) is either too large or too small;
such as on the LATEXtemplate for submitting drafts to namely, it is beyond an interval
a journal by the IOS Press, a line number is a TBK
with a starting x-coordinate in the left margin, where Ig = (BASE LS − Γ4 , BASE LS + Γ4 )
the text enclosed is a pair of the same number with a
long whitespace inserted in between that crosses over for a certain threshold Γ4 (e.g., Γ4 = ∆2 = 3).
the entire BT text from left to right. This pair of num- (b) Char-TBK Density Test. A line L in math ex-
bers will also be removed because its starting point is pression in the display mode typically consists of a
in the left margin. larger number of short TBKs because of the presence
(2) Remove references. The simplest way to de- of subscripts and superscripts, where each word or
tect references is to search for a line that consists of symbol would be by itself a TBK. Thus, the char-TBK
only one word “References” that is either on the first density DL would be much smaller than BASE CBD,
line of a column or has a larger space than BASE LS. the average Char-TBK density. L passes this test if
Remove everything after (this may remove appen- DL < Γ5 · BASE CBD for a threshold value of Γ5
dices, which for our purpose is acceptable). A more (e.g., Γ5 = 10).
sophisticated method is to use the following line- (c) Punctuation Test. A line L passes the punctua-
sweeping method to detect the area of references. by tion test if the rightmost TBK in L does not end with
detecting nested columns within a major column and a punctuation.
Proposition 4.2): Start from one pixel after the left (d) Indentation Test. A line L passes the indenta-
boundary of a major column, sweep the column from tion test if the x-coordinate in the starting point of its
left to right with a vertical line on the entire paper. If a leftmost TBK is greater than that of the left boundary
local peak occurs with the same x-coordinate on con- of the underlying column.
secutive lines, each line from the left boundary of the NBT-Tests-based Removal Algorithm. On a given
column to this x-coordinate is either null or a num- document, scan text from the line preceding the list of
bering TBK. A numbering TBK contains a number references and move backward one page at a time to
inside. Then any line that has this property is a ref- the first line on the first page. On each page, scan from
erence. To improve detection accuracy, we may also the bottom line in the rightmost column and move
use a named-entity tagger (Peters et al., 2017) to de- up one line at a time. Once it reaches the top line,
termine if the text right after a numbering TBK are scan from the bottom line in the column on the left
tagged as person(s). and move up one line at a time. When the top line
(3) Remove special lines. Let xc be the x-coordi- on the leftmost column is reached, move backward to
nate of the left boundary of the column that line L the preceding page and repeat. Let P be a Boolean
belongs to, and xL be the x-coordinate in the start- variable. Initially, set P ← 0. Scan text lines in the
ing point of the leftmost TBK in L. If xL − xc > Γ2 aforementioned order of traversal. In general, if a line
for a fixed value of Γ2 larger than normal indentation is kept, then set P to 0. If a line is removed, then set
(e.g., Γ2 = 50; normal indentation for a paragraph is P to 1, unless otherwise stated.
48 pixels or less), then remove L. This module re- In particular, do the following during scanning:
moves most of the math expressions in the display (1) If L passes both of the indentation test and the
mode, certain author names and affiliations, as well char-TBK density test, then remove L and set P ← 1.
(2) If P = 0 and L passes the line-spacing test and the the leftmost TBK in L is larger than that of the left-
punctuation test, then remove L and set P ← 0. (3) most TBK in the line immediately above. The rest
Otherwise, keep L and set P ← 0. is text extraction from each TBK in the order of line
Rule 1 removes page numbers, authors and affilia- locations. Denote by f 0 the txt file from this process.
tions, text on tables, text on figures, and text on charts While Shallow Removal and Deep Removal can
that pass both of the indentation test and the char- remove most of the NBT-text lines, captions that end
TBK density test. This rule also removes math ex- with punctuation could still remain in BT text. To
pressions in the display mode. It does not remove the remove all captions, we use the line-spacing rule to
last line of a paragraph because such a line fails the group lines in a caption in f 0 into a paragraph. In
indentation test. It does not remove a single-sentence this paragraph, the first keyword would be one of the
paragraph as long as it is not too short and does not followings: “Table”, “Figure”, “Fig.”, followed by a
contain multiple TBKs, for it would defy the small string of digits and dot. If the third word in the first
char-TBK density test. It does not remove a text line line of such a paragraph is not a verb, then this para-
that contains an inline math expression as long as it is graph is deemed to be a caption. We use an existing
not the first line in an indented paragraph. tool (Toutanova et al., 2003) to obtain part-of-speech
Rule 2 removes standalone one-line and two-line (POS) tags for each such paragraph, and remove it ac-
titles that are not ended with a punctuation in each line cordingly. Let BT.txt be the output.
for the following reason: By assumption, the first text Let T1 (F) and T2 ( f 0 ) denote, respectively, the time
line below a standalone title is aligned flushed left and complexities of pdf2htmlEX on PDF file F and POS
so it will not be removed, which means that P = 0 (see tagging on paragraphs starting with “Table”, “Fig-
Rule 3). Likewise, this rule also removes captions ure”, or “Fig.” in f 0 .
without punctuation at the end of each line, if its suc-
cessor line is not removed, which implies that P = 0 Proposition 4.3. PDFBoT runs in T1 (F) + T2 ( f 0 ) +
(see Item 3 below). This rule does not remove the O(np) time on an input PDF document F, where n
last line in a math expression in the display mode for is the number of pixels in the printable area of a page
this line must end with a punctuation by assumption, and p is the number of pages contained in f generated
which means that P = 1. This ensures that the line by pdf2htmlEX.
preceding the displayed math expression that doesn’t
end with a punctuation is not removed by this rule.
4.6 Display Sentences in Colors
4.5 BT Alignment & Syntactic Removal
An optional component of PDFBoT, sentences may
After Deep Removal, PDFBoT aligns BT lines to re- be colored in the original layout of the HTML repli-
store sentences and paragraphs without hard breaks. cate by adding appropriate color tags in f . Let B =
Recall that lines are formed according to columns. hCi ini=1 represent the string of character objects of the
For each page, BT Alignment starts from the first line BT text, where Ci = (ci , bi ,ti ) with ci being the ti -th
in the leftmost column one line at a time and removes character in the text contained in the bi -th TBK. Let
hard breaks within a paragraph until the last line in the S = hl j imj=1 be the sentence to be highlighted, where
current column. Then it moves to the next column (if li is the i-th character in S. Use a string-matching al-
there is any) and repeat the same procedure until the gorithm to find ` such that hC` , . . . ,C`+|S|−1 i = S. Let
last line in the last column. In addition to removing start point = C` and end point = C`+|S|−1 .
hard breaks within a paragraph, it also needs to take
special care of hyphens at the end of a line and bound- To color S with a chosen color, change the corre-
aries of paragraphs. Removing hyphens at the end of sponding elements in f as follows: If start point and
lines is the easiest way. While this might break a hy- endingpoint are in the same TBK, add all the char-
phenated word into two words, doing so has a minor acters between start point and endingpoint to a new
impact on our task while having a much larger benefit tag with an appropriate color attribute. Otherwise,
of restoring a word. We may also use a dictionary to for the start point block, add all the characters after
determine if a hyphen at the end of a line belongs to a start point in the block to a new tag with a color at-
hyphenated word and keep it if it does. tribute; for the endingpoint block, add all the char-
acters before endingpoint in the block to a new tag
If a line L meets one of the following three condi-
with the same color attribute; and wrap all the TBKs
tions, then it is the first sentence of a paragraph: (1)
between the start point block and endingpoint block
The gap between L and the immediate line above is
with a new tag with the same color attribute.
greater than BASE LS + Γ4 . (2) The x-coordinate of
Table 1: Statistics on extractions of sentences and para-
5 EVALUATION graphs, where “Incpl” means incomplete.
Correct Erroneous Missing
We evaluate the accuracies of PDFBoT on the fol- Total
(tp) (fp) (fn)
lowing tasks with a given document: (1) Extracting Sentences
complete sentences in the BT text. (2) Getting correct 341 (Incpl)
boundaries of paragraphs. (3) Removing text on ta- 19,564 19,158 30
205 (Extra)
bles and figures. To do so, we first need to determine Paragraphs
an evaluation dataset. To the best of our knowledge, 4,596 4,580 370 19
no existing benchmarks are appropriate for evaluating
PDFBoT. Bast and Korzen (Bast and Korzen, 2017)
presented a dataset on PDF articles collected from On removing text on tables, figures, and charts,
arXiv.org, where they worked out a method to gener- possible outcomes are (1) removed and (2) remained,
ate texts from the underlying TEXor LATEXfiles as the where removed means that the text is correctly re-
ground-truth txt files for evaluating extraction. How- moved as it should be and remained means that the
ever, this dataset does not meet our need for the fol- text that should be removed remains. Removed is true
lowing reasons: (1) Most of the txt files do not con- positive and remained is false negative. Since every
tain Abstracts of the underlying PDF documents, and text on a table or a figure/chart should be removed,
Abstracts are an important part of the BT text. (2) there is no false positive. There are 9.469 TBKs on
Some txt files contain authors and affiliations, and tables, figures, and charts in the corpus with 8,986
some don’t, resulting in an inconsistency for evalu- TBKs correctly removed and 483 TBKs remained.
ation. (3) The txt files treat the text after a math ex- Table 2 is the statistics of precision, recall, and
pression in the display mode as a new paragraph when F1 score, which are computed individually and then
it should not be. rounded to the second decimal place, unless otherwise
We construct a dataset by selecting independently stated to avoid writing 1.00 due to rounding.
at random from arXiv.org 100 two-column PDF arti- Table 2: Sentence statistics of precision, recall, and F1
cles in the disciplines of biology, computer science, score.
finance, physics, and mathematics with the following
Avg Med Max Min Std
statistics on document sizes: (1) the average number
of pages in an article: 8.28; (2) the median number of Sentences
pages: 8; (3) the maximum number of pages: 17, and Precision 0.97 0.98 1 0.92 0.02
the minimum number of pages: 4; the standard devi- Recall 0.999 1 1 0.95 0.01
ation: 2.94. We manually compare the extracted text F1 score 0.99 0.99 1 0.96 0.01
with the text in original academic PDF documents un- Paragraphs
der three categories: sentences, paragraphs, and text Precision 0.93 0.93 1 0.70 0.05
on tables and figures. Recall 0.99 1 1 0.81 0.03
Possible outcomes for sentence and paragraph ex- F1 score 0.96 0.96 1 0.83 0.03
tractions are (1) correct, (2) erroneous, and (3) miss- Text on tables/figures/charts
ing, where “correct” means that the sentences (para- Precision 0.93 0.93 1 0.70 0.05
graphs) extracted are BT text as the way they should Recall 0.99 1 1 0.81 0.03
be; “erroneous” for sentences means that either the F1 score 0.96 0.96 1 0.83 0.03
sentence extracted is BT text but with an error, re-
ferred to as incomplete, or it should not be extracted We note that in certain styles, paragraphs are not
at all, referred to as extra, while “erroneous” for para- indented, but separated by an obvious line of whites-
graphs means that the paragraph extracted is BT text pace. In this case, a text line that is not a new para-
but should not be a paragraph; and “missing” means graph and after a math expression in display mode
that a sentence (paragraph) should be extracted but could be mistakenly considered as a new paragraph.
isn’t. Correct extraction is true positive (tp), erro- Table 3 is the running times incurred, respectively,
neous extraction is false positive (fp), and extraction by pdf2htmlEX and PDFBoT after pdf2htmlEX gen-
that is missing is false negative (fn). erates a txt file on a 2015 commonplace laptop Mac-
Table 1 is the statistics on extractions of sentences Book Pro with a 2.7 GHz Dual-Core Intel Core i5
and paragraphs, where Total means the total number CPU and 8 GB RAM, where MAX represents the
of true sentences and paragraphs, respectively, in the maximum running time in seconds processing a docu-
original articles. ment in this dataset, MIN the minimum running time,
Avg the average running time, Med the median run-
ning time, and Std the standard deviation. Clark, C. and Divvala, S. (2015). Looking beyond text:
Extracting figures, tables, and captions from computer
Table 3: Running time statistics (in seconds). science paper.
Avg Med Max Min Std Giles, C. L. (2006). The future of citeseer: Citeseerx.
pdf2htmlEX 3.00 1.90 13.8 0.80 2.47 In Fürnkranz, J., Scheffer, T., and Spiliopoulou, M.,
PDFBoT 10.3 6.80 106 2.60 12.5 editors, Knowledge Discovery in Databases: PKDD
2006, pages 2–2, Berlin, Heidelberg. Springer Berlin
Heidelberg.
We note that the running time depends on how
complex the content of the underlying document Luong, M.-T., Nguyen, T. D., and Kan, M.-Y. (2011). Log-
ical structure recovery in scholarly articles with rich
would be. It would take a substantially longer time document features. International Journal of Digital
to process if a document contains significantly more Library Systems (IJDLS), pages 1–23.
math expressions or tables. A total of six documents
Mali, P., Kukkadapu, P., Mahdavi, M., and Zanibbi, R.
each takes longer than 25 seconds for PDFBoT to run. (2020). ScanSSD: scanning single shot detector for
Checking these documents, we found that they con- mathematical formulas in PDF document images.
tain a large number of math expressions, tables, or
Peters, M. E., Ammar, W., Bhagavatula, C., and Power,
supplemental materials after the references. The one R. (2017). Semi-supervised sequence tagging with
extreme outlier that runs 106 seconds on PDFBoT but bidirectional language models. arXiv preprint
only 9.42 seconds on pdf2htmlEX is a 10-page PDF arXiv:1705.00108.
document. The reason is likely due to complex fea- Pfahler, L., Schill, J., and Mori, K. (2019). The search
tures used to describe the document by pdf2htmlEX. for equations-learning to identify similarities between
While generating the HTML file would not be too mathematical expressions.
costly, analyzing the CSS3 files to extract features for Phong, B. H., Hoang, T. M., and Le, T. (2020). A hy-
this particular document has taken more time, which brid method for mathematical expression detection in
needs to be investigated further. Overall, PDFBoT in- scientific document images. IEEE Access, 8:83663–
curs 10.3 seconds on average. 83684.
Ramakrishnan, C., Patnia, A., Hovy, E., and Burns, G.
(2012). Layout-aware text extraction from full-text
PDF of scientific articles. Source Code for Biology
6 Conclusions and Final Remarks and Medicine.
Romary, L. and Lopez, P. (2015). Grobid – information
PDFBoT uses certain formatting features, text- extraction from scientific publications. ERCIM News,
100. ffhal-01673305.
block statistics, syntactic features, the line-sweeping
method, and the backward traversal method to achieve Shigarov, A., Mikhailov, A., and Altaev, A. (2016). Con-
figurable table structure recognition in untagged pdf
accurate extraction. PDFBoT is available for public
documents.
access at http://dooyeed.com:10080/pdfbot.
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y.
While the majority of the academic PDF docu- (2003). Feature-rich part-of-speech tagging with a
ments satisfy the assumptions listed in the paper, it cyclic dependency network. In Proceedings of the
is not always the case and so some of the extraction 2003 conference of the North American chapter of
mechanisms could fail. To further improve accuracy the association for computational linguistics on hu-
of detecting NBT text, particularly on a document that man language technology-volume 1, pages 173–180.
violates some of the assumptions, we may explore Association for Computational Linguistics.
deeper features in CSS3 files in addition to those we Wang, L. and Liu, W. (2013). Online publishing via
have used. For example, it would be useful to inves- pdf2htmlEX. TUGboat, 34:313–324.
tigate how to compute the width of a TBK. Neural- Wang, L., Wang, Y., Cai, D., Zhang, D., and Liu, X. (2018).
network classifiers such as CNN models may also be Translating math word problem to expression tree.
explored to identify certain types of NBT text residing pages 1064—-1069.
in the BT text area. Yang, H., Aguirre, C. A., Torre, M. F. D. L., Christensen,
D., Bobadilla, L., Davich, E., Roth, J., Luo, L., Theis,
Y., Lam, A., Han, T. Y.-J., Buttler, D., and Hsu, W. H.
(2019). Pipelines for procedural information extrac-
REFERENCES tion from scientific literature: towards recipes using
machine learning and data science. pages 41–46.
Yildiz, B., Kaiser, K., and Miksch, S. (2005). pdf2table:
Bast, H. and Korzen, C. (2017). A benchmark and evalua-
A method to extract table information from pdf files.
tion for text extraction from PDF. In ACM/IEEE Joint
pages 1773–1785.
Conference on Digital Libraries (JCDL), pages 1–10.

You might also like