Extracting Body Text From Academic PDF Documents For Text Mining
Extracting Body Text From Academic PDF Documents For Text Mining
Keywords: body-text extraction, HTML replication of PDF, line sweeping, backward traversal
Abstract: Accurate extraction of body text from PDF-formatted academic documents is essential in text-mining appli-
cations for deeper semantic understandings. The objective is to extract complete sentences in the body text
into a txt file with the original sentence flow and paragraph boundaries. Existing tools for extracting text from
arXiv:2010.12647v1 [cs.IR] 23 Oct 2020
PDF documents would often mix body and nonbody texts. We devise and implement a system called PDFBoT
to detect multiple-column layouts using a line-sweeping technique, remove nonbody text using computed text
features and syntactic tagging in backward traversal, and align the remaining text back to sentences and para-
graphs. We show that PDFBoT is highly accurate with average F1 scores of, respectively, 0.99 on extracting
sentences, 0.96 on extracting paragraphs, and 0.98 on removing text on tables, figures, and charts over a corpus
of PDF documents randomly selected from arXiv.org across multiple academic disciplines.