0% found this document useful (0 votes)

19 views17 pages

2022 Doc2Graph - A Task Agnostic Document Understanding Framework Based On Graph Neural Networks

Doc2Graph is a task-agnostic framework for document understanding that utilizes Graph Neural Networks (GNNs) to analyze various document types without requiring extensive pre-training. The framework effectively organizes document objects into graph representations, allowing for key information extraction and layout analysis across different tasks such as form understanding and invoice analysis. The proposed model demonstrates its capabilities through evaluations on challenging datasets, showcasing its potential to enhance document intelligence by leveraging structural patterns inherent in documents.

Uploaded by

aegr82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views17 pages

2022 Doc2Graph - A Task Agnostic Document Understanding Framework Based On Graph Neural Networks

Uploaded by

aegr82

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Doc2Graph: a Task Agnostic Document

Understanding Framework based on Graph

Neural Networks

Andrea Gemelli1 , Sanket Biswas2 , Enrico Civitelli1 , Josep Lladós2 , and

Simone Marinai1
arXiv:2208.11168v1 [cs.CV] 23 Aug 2022

1
Dipartimento di Ingegneria dell’Informazione (DINFO)
Università degli studi di Firenze, Italy
{andrea.gemelli, enrico.civitelli, simone.marinai}@unifi.it
2
Computer Vision Center & Computer Science Department
Universitat Autònoma de Barcelona, Spain
{sbiswas, josep}@cvc.uab.es

Abstract. Geometric Deep Learning has recently attracted significant

interest in a wide range of machine learning fields, including document
analysis. The application of Graph Neural Networks (GNNs) has be-
come crucial in various document-related tasks since they can unravel
important structural patterns, fundamental in key information extrac-
tion processes. Previous works in the literature propose task-driven mod-
els and do not take into account the full power of graphs. We pro-
pose Doc2Graph, a task-agnostic document understanding framework
based on a GNN model, to solve different tasks given different types
of documents. We evaluated our approach on two challenging datasets
for key information extraction in form understanding, invoice layout
analysis and table detection. Our code is freely accessible on https:
//github.com/andreagemelli/doc2graph.

Keywords: Document Analysis and Recognition · Graph Neural Net-

works · Document Understanding · Key Information Extraction · Table
Detection

1 Introduction

Document Intelligence deals with the ability to read, understand and interpret
documents. Document understanding can be backed by graph representations,
that robustly represent objects and relations. Graph reasoning for document
parsing involves manipulating structured representations of semantically mean-
ingful document objects (titles, tables, figures) and relations, using compositional
rules. Customarily, graphs have been selected as an adequate framework for lever-
aging structural information from documents, due to their inherent representa-
tional power to codify the object components (or semantic entities) and their
pairwise relationships. In this context, recently graph neural networks (GNNs)
2 A. Gemelli et al.

have emerged as a powerful tool to tackle the problems of Key Information Ex-
traction (KIE) [6,35], Document Layout Analysis (DLA) which includes well-
studied sub-tasks like table detection [25,26], table structure recognition [20,34]
and table extraction [9], Visual Question Answering (VQA) [18,17], synthetic
document generation [4] and so on.
Simultaneously, the common state-of-the-art practice in the document under-
standing community is to utilize the power of huge pre-trained vision-language
models [1,32,33] that learn whether the visual, textual and layout cues of the
document are correlated. Despite achieving superior performance on most docu-
ment understanding tasks, large-scale document pre-training comes with a high
computational cost both in terms of memory and training time. We present a
solution that does not rely on huge vision-language model pre-training modules,
but rather recognizes the semantic text entities and their relationships from
documents exploiting graphs. The solution has experimented on two challenging
benchmarks for forms [15] and invoices [10] with a very small amount of labeled
training data.
Inspired by some prior works [8,25,26], we introduce Doc2Graph, a novel
task-agnostic framework to exploit graph-based representations for document
understanding. The proposed model is validated in three different challenges,
namely KIE in form understanding, invoice layout analysis and table detection.
A graph representation module is proposed to organize the document objects.
The graph nodes represent words or the semantic entities while edges the pairwise
relationships between them. Finding the optimal set of edges to create the graph
is anything but trivial: usually in literature heuristics are applied, e.g. using a
visibility graph [25]. In this work, we do not make any assumption a priori on the
connectivity: rather we attempt to build a fully connected graph representation
over documents and let the network learn by itself what is relevant.
In summary, the primary contributions of this work can be summarized as
follows:

– Doc2Graph, the first task-agnostic GNN-based document understanding frame-

work, evaluated on two challenging benchmarks (form and invoice under-
standing) for three significant tasks, without any requirement of huge pre-
training data;
– We propose a general graph representation module for documents, that do
not rely on heuristics to build pairwise relationships between words or enti-
ties;
– A novel GNN architectural pipeline with node and edge aggregation func-
tions suited for documents, that exploits the relative positioning of document
objects through polar coordinates.

The rest of the paper is organized as follows. In section 2 we review the

state-of-the-art in graph representation learning and vision-language models for
document understanding. Section 3 provides the details of the main methodolog-
ical contribution. The experimental evaluation is reported in section 4. Finally,
the conclusions are drawn in section 5.
Doc2Graph: a Document Understanding Framework based on GNNs 3

2 Related Work
Document understanding has been studied extensively in the last few years, ow-
ing to the advent of deep learning, but has been reformulated in a recent survey
by Borchmann et. al. [5]. The tasks range from KIE performed for understanding
forms [15], receipts [14] and invoices [10], to multimodal comprehension of both
visual and textual cues in a document for classification [32,33]. It also includes
the DLA task where recent works focus on building an end-to-end framework for
both detection and classification of page regions [3,2]. Table detection [25,26],
structure recognition [20,24] and extraction [9,30] in DLA gathered some special
attention in recent years due to the high variability of layouts that make the
both necessary to be solved and challenging to be tackled. In addition, question
answering [21,29] has emerged as an extension of the KIE task principle, where
a natural language question replaces a property name. Current state-of-the-art
approaches [1,13,22,32,33] on these document understanding tasks have utilized
the power of large pre-trained language models, relying on language more than
the visual and geometrical information in a document and also end up using
hundreds of millions of parameters in the process. Moreover, most of these mod-
els are trained with a huge transformer pipeline, which requires an immense
amount of data during pre-training. In this regard, Davis et al. [7] and Sarkar
et al. [28] proposed language-agnostic models. In [7] they focused on the entity
relationship detection problem in forms [15] using a simple CNN as a text line
detector and then detecting key-value relationship pairs using a heuristic based
on each relationship candidate score generated from the model. Sarkar et al. [28]
rather focused on extracting the form structure by reformulating the problem
as a semantic segmentation (pixel labeling) task. They used a U-Net based ar-
chitectural pipeline, predicting all levels of the document hierarchy in parallel,
making it quite efficient.
GNN for document understanding was first introduced for mainly key DLA
sub-tasks that include table detection [25] and table structure recognition [23].
The key idea behind its introduction was to utilize the powerful geometrical
characteristics of a document using GNN and then to preserve the privacy of
confidential textual content (especially for administrative documents) during
training, making the model language-independent and more structure-reliant as
proposed in [25] for detection of tables in invoices. Carbonell et. al. [6] used graph
convolutional networks (GCNs) to solve the entity(word) grouping, labeling and
entity linking tasks for form analysis. They used the information of the bounding
boxes and word embeddings as the principal node features and do not include
any visual features, while they used k-nearest neighbours (KNNs) to encode the
edge information. The FUDGE [8] framework was then developed for form un-
derstanding as an extension of [7] to greatly improve the state-of-the-art on both
the semantic entity labeling and entity linking tasks by proposing relationship
pairs using the same detection CNN as in [7]. Then a graph convolutional net-
work (GCN) was deployed with plugged visual features from the CNN so that
semantic labels for the text entities were predicted jointly with the key-value
relationship pairs, as they are quite related tasks.
4 A. Gemelli et al.

Inspired by this influential prior work [8], we aim to propose a task-agnostic

GNN-based framework called Doc2Graph that adapts a similar joint prediction
of both the tasks, semantic entity labeling and entity linking using a node clas-
sification and edge classification module respectively. Doc2Graph is established
to tackle multiple challenges ranging from KIE for form understanding to layout
analysis and table detection for invoice understanding, without needing any kind
of huge data pre-training and being lightweight and efficient.

3 Method
In this section, we present the proposed approach. First, we describe the pre-
processing step that converts document images into graphs. Then, we describe
the GNN model designed to tackle different kinds of tasks.

3.1 Documents graph structure

A graph is a structure made of nodes and edges. A graph can be seen as a
language model representing a document in terms of its segments (text units) and
relationships. A preprocessing step is required. Depending on the task, different
levels of granularity have to be considered for defining the constituent objects of
a document. They can be single words or entities, that is, groups of words that
share a certain property (e.g., the name of a company). In our work we try both
as the starting point of the pipeline: we apply an OCR to recognize words, while
a pre-trained object detection model for detecting entities. The chosen objects,
once found, constitute the nodes of the graph.
At this point, nodes need to be connected through edges. Finding the optimal
set of edges to create the graph is anything but trivial: usually in literature
heuristics are applied, e.g. using a visibility graph [25]. These approaches: (i) do
not generalise well on different layouts; (ii) strongly rely on the previous node
detection processes, which are often prone to errors; (iii) generate noise in the
connections, since bounding box of objects could cut out important relations
or allow unwanted ones; (iv) exclude in advance sets of solutions, e.g. answers
far from questions. To avoid those behaviours, we do not make any assumption
a priori on the connectivity: we build a fully connected graph and we let the
network learn by itself what relations are relevant.

3.2 Node and edge features

In order to learn, suitable features should be associated to nodes and edges of the
graph. In documents, this information can be extracted from sources of different
modalities, such as visual, language and layout ones. Different methods can be
applied to encode a node (either word or entity) to enrich its representation. In
our pipeline, with the aim to possibly keep it lightweight, we include:
– a language model to encode the text. We use the spaCy large English model
to get word vector representations of words and entities;
Doc2Graph: a Document Understanding Framework based on GNNs 5

Fig. 1: Our proposed Doc2Graph framework. For visualisation purposes,

the architecture shows the perspective of one node (the blue one in Doc2Graph).

– a visual encoder to represent style and formatting. We pretrain a U-Net[27]

on FUNSD for entities segmentation. Since U-Net uses feature maps at dif-
ferent encoder’s layers to segment the images, we decide to use all these
information as visual features. Moreover, it is important to highlight that,
for each features map, we used a RoI Alignment layer to extract the features
relative to each entities bounding box;
– the absolute normalized positions of objects inside a document; layout and
structure are meaningful features to include in industrial documents, e.g. for
key-value associations.
As for the edges, to the best of our knowledge, we propose two new sets of
features to help both the node and the edge classification tasks:
– a normalized euclidean distance between nodes, by means of the minimum
distance between bounding boxes. Since we are using a fully connected graph
this is crucial for the aggregation node function in use to keep locality prop-
erty during the message passing algorithm;
– relative positioning of nodes using polar coordinates. Each source node is
considered to be in the center of a Cartesian plane and all its neighbors are
encoded by means of distance and angle. We discretize the space into bins
(one-hot encoded), which number can be chosen, instead of using normalized
angles: a continuous representation of the angle is challenging because, for
instance, two points at the same distance with angles 360◦ and 0◦ would be
encoded differently.

3.3 Architecture
Each node feature vector passes through our proposed architecture (Fig. 1, vi-
sualization of GNN layer inspired by “A Gentle Introduction to GNNs”): the
6 A. Gemelli et al.

connectivity defines the neighborhood for the message passing, while the weight
learnable matrices are shared across all nodes and edges, respectively. We make
use of four different components:

– Input Projector: this module applies as many fully connected (FC) layers as
there are different modalities in use, to project each of their representations
into inner spaces of the same dimension; e.g., we found it to be not very infor-
mative combine low dimensional geometrical features with high dimensional
visual ones, as they are;
– GNN Layer: we make use of a slightly different version of GraphSAGE [11].
Using a fully connected graph, we redefine the aggregation strategy (eq. 2);
– Node Predictor: this is a FC layer, that maps the representation of each node
into the number of target classes;
– Edge Predictor: this is a two-FC layer, that assigns a label to each edge. To
do so, we propose a novel aggregation on edges (eq. 3).

GNN Layer Our version of GraphSAGE slightly differs in the neighborhood

aggregation. At layer l given a node i, hi its inner representation and N (i) its
set of neighbors, the aggregation is defined as:

h_{N(i)}^{l+1} = aggregate(\{h^{l}_{j}, \forall j \in N(i) \}) (1)

where aggregate can be any permutation invariant operation, e.g. sum or mean.
Usually, in other domains, the graph structure is naturally given by the data
itself but, as already stated, in documents this can be challenging (sec. 3.1).
Then, given a document, we redefine the above equation as:

\label {eq:graph-sage} h_{N(i)}^{l+1} = \frac {c}{|\Upsilon (i)|} \sum _{j \in \Upsilon (i)} h^{l}_{j} (2)

where Υ (i) = {j ∈ N (i) : |i − j| < threshold}, |i − j| is the Euclidean distance

of nodes i and j saved (normalized between 0 and 1) on their connecting edge,
and c is a constant scale factor.

Edge Predictor We consider each edge as a triplet (src, e, dst): e is the edge
connecting the source (src) and destination (dst) node. The edge representation
he to feed into the two-FC classifier is defined as:

\label {eq:edges} h_e = h_{src}\;\Vert \;h_{dst}\;\Vert \;cls_{src}\;\Vert \;cls_{dst}\;\Vert \;e_{polar} (3)

where hsrc and hdst are the node embeddings output of the last GNN layer,
clsscr and clsdst are the softmax of the output logits of the previous node pre-
dictor layer, epolar are the polar coordinates described in sec 3.2 and ∥ is the
concatenation operator. These choices have been made because: (i) relative posi-
tioning on edges is stronger compared to absolute positioning on nodes: the local
property introduced by means of polar coordinates can be extended to different
Doc2Graph: a Document Understanding Framework based on GNNs 7

data, e.g. documents of different sizes or orientations; (ii) if the considered task
comprise also the classification of nodes, their classes may help in the classifica-
tion of edges, e.g. in forms it should not possible to find an answer connected to
another answer.
Given the task, graphs can be either undirected or directed: both are rep-
resented with two or one directed edge between nodes, respectively. In the first
case, the order does not matter and so the above formula can be redefined as:

h_e = (h_{src} + h_{dst})\;\Vert \;cls_{src}\;\Vert \;cls_{dst}\;\Vert \;e_{polar} (4)

4 Experiments and Results

In this chapter we present experiments of our method on two different datasets,

FUNSD and RVL-CDIP invoices, to tackle three tasks: entity linking, layout
analysis and table detection. We also discuss results compared to other methods.

4.1 Proposed model

We performed ablation studies on our proposed model for entity linking on

FUNSD without contribution and classification of nodes (Fig. 1), since we found
it to be the most challenging task. In Tab. 1 we report different combinations
of features and hyperparameters. Geometrical and textual features make the
largest contribution, while visual features bring almost three points more to the
Key-Value F1 score by an important increase in terms of network parameters
(2.3 times more). Textual and geometrical features remain crucial for the task
at hand, and their combination increase by a large amount both of their scores
when used in isolation. This may be due to two facts: (i) our U-Net has not been
included during the GNN training time (as done in [8]), unable to adjust the
representation for spotting key-value relationship pairs; (ii) the segmentation
task used to train the backbone do not yield useful features for that goal (as
shown in Tab. 1). The hyperparameters shown in the table refer to the edge pre-
dictor (EP) inner layer input dimension and the input projector fully connected
(IP FC) layers (per each modality) output dimension, respectively. A larger EP
is much more informative for the classification of links into ‘none’ (cut edges,
meaning no relationship) or ‘key-value’, while more dimensions for the projected
modalities helped the model to better learn the importance of their contribu-
tions. These changes bring an improvement of 13 points on the key-value F1
scores, between the third and fourth line of the table where we keep the features
fixed. We do not report the score relative to others network settings since their
changes only brought a decrease overall metrics. We use a learning rate of 10−3
and a weight decay of 10−4 , with a dropout of 0.2 over the last FC layer. The
threshold over neighbor nodes and their contribution scale factor (sec. 3.3) are
fixed to 0.9 and 0.1, respectively. The bins to discretize the space for angles (sec.
3.3) are 8. We apply one GNN layer before the node and edge predictors.
8 A. Gemelli et al.

Features F1 per classes (↑)

Geometric Text Visual EP Inner dim IP FC dim None Key-Value AUC-PR (↑) # Params ×106 (↓)
✓ ✗ ✗ 20 100 0.9587 0.1507 0.6301 0.025
✗ ✓ ✗ 20 100 0.9893 0.1981 0.5605 0.054
✓ ✓ ✗ 20 100 0.9941 0.4305 0.7002 0.120
✓ ✓ ✗ 300 300 0.9961 0.5606 0.7733 1.18
✓ ✓ ✓ 300 300 0.9964 0.5895 0.7903 2.68

Table 1: Ablation studies of Doc2Graph model. EP Inner dim and IP

FC dim show edge predictor layer input dimension and the input projector fully
connected layers output dimension, respectively. AUC-PR refers to the key-value
edge class. The # Params refers to Doc2Graph trainable parameters solely.

4.2 FUNSD
Dataset The dataset[15] comprises 199 real, fully annotated, scanned forms.
The documents are selected as a subset of the larger RVL-CDIP[12] dataset, a
collection of 400,000 grayscale images of various documents. The authors de-
fine the Form Understanding (FoUn) challenge into three different tasks: word
grouping, semantic entity labeling and entity linking. A recent work [31] found
some inconsistency in the original labeling, which impeded its applicability to
the key-value extraction problem. In this work, we are using the revised version
of FUNSD.

Entity Detection Our focus is on the GNN performances but, for comparison
reasons, we used a YOLOv5 small [16] to detect entities (pretrained on COCO
[19]). In [15] the word grouping task is evaluated using the ARI metric: since we
are not using words, we evaluated the entity detection with F1 score using two
different IoU thresholds (Tab. 2). For the semantic entity labeling and entity

Fig. 2: Image taken from [31]: the document on the right is the revised version of
the document on the left, where some answers (green) are mislabeled as question
(blue), and some questions (blue) are mislabeled as headers (yellow)
Doc2Graph: a Document Understanding Framework based on GNNs 9

Metrics (↑) % Drop Rate (↓)

IoU Precision Recall F1 Entity Link
0.25 0.8728 0.8712 0.8720 12.72 16.63
0.50 0.8132 0.8109 0.8121 18.67 25.93

Table 2: Entity detection results. YOLOv5 [16]-small performances on the

entity detection task.

Fig. 3: Blue boxes are FUNSD entities ground truth, green boxes are the correct
detected one (with IoU > 0.25/0.50), while red boxes are the false positive ones.

linking tasks we use IoU > 0.50 as done in [8]: we did not perform any optimiza-
tion on the detector model, which introduces a high drop rate for both entities
and links. We create the graphs on top of YOLO detections, linking the ground
truth accordingly (Fig. 3): false positive entities (red boxes) are labeled as class
’other’, while false negative entities cause some key-value pairs to be lost (red
links). The new connections created as a consequence of wrong detections are
considered false positives and labeled as ‘none’.

Numerical results We trained our architecture (sec. 3.3) with a 10-fold cross
validation. Since we found high variance in the results, we report both mean

F1 (↑)
Method GNN Semantic Entity Labeling Entity Linking # Params ×106 (↓)
BROS [13] ✗ 0.8121 0.6696 138
LayoutLM [33,13] ✗ 0.7895 0.4281 343

FUNSD [15] ✓ 0.5700 0.0400 -

Carbonell et al. [6] ✓ 0.6400 0.3900 201
FUDGE w/o GCN [8] ✗ 0.6507 0.5241 12
FUDGE [8] ✓ 0.6652 0.5662 17

Doc2Graph + YOLO ✓ 0.6581 ± 0.006 0.3882 ± 0.028 13.5

Doc2Graph + GT ✓ 0.8225 ± 0.005 0.5336 ± 0.036 6.2

Table 3: Results on FUNSD. The results have been shown for both semantic
entity labeling and entity linking tasks with their corresponding metrics.
10 A. Gemelli et al.

and variance over the 10 best models chosen over their respective validation
sets. The objective function in use (L) is based on both node (Ln ) and edge (Le )
classification tasks: L = Ln + Le . In Tab. 3 we report the performances of our
model Doc2Graph compared to other language models [13,33] and graph-based
techniques [6,8]. The number of parameters # Params refer to the trainable
Doc2Graph pipeline (that includes the U-Net and YOLO backbones); for the
spaCy word-embedding details, refer to their documentation. Using YOLO our
network outperforms [6] for semantic entity labeling and meets their model on
entity linking, using just 13.5 parameters. We could not do better than FUDGE,
which still outperforms our scores. Their backbone is trained for both tasks along
with the GCN (GCN that adds just minor improvements). The gap, especially
on entity linking, is mainly due to the low contributions given by our visual
features (Tab. 1) and the detector in use (Tab. 3). We also report the results
of our model initialized with ground truth (GT) entities, to show how it would
perform in the best case scenario. Entity linking remains a harder task compared
to semantic entity labeling and only complex language models seem to be able
to solve it. Moreover, for the sake of completeness, we highlight that, with good
entity representations, our model outperforms all the considered architectures
for the Semantic Entity Labeling task. Finally, we want to further stress that
the main contribution of a graph-based method is to yield a simpler but more
lightweight solution.

Qualitative results The order matters for detecting key-value relationship,

since the direction of a link induce a property for the destination entity that
enrich its meaning. Differently from FUDGE [8] we do make use of directed edges,
which led to a better understanding of the document having interpretable results.
In Fig. 5 we show our qualitative results using Doc2Graph on groundtruth:
green and red dots mean source and destination nodes, respectively. As shown
in the different example cases, Fig. 5(a) and 5(b) resemble a simple structured
form layout with directed one-to-one key-value association pairs and Doc2Graph
manages to extract them. On the contrary, where the layout appears to be more
complex as in Fig. 5(d), Doc2Graph fails to generalize the concept of one-to-
many key-value relationship pairs. This may be due to the small number of
trainable samples we had in our training data and the fact that header-cells
usually present different positioning and semantic meaning. In the future we will
integrate a table structure recognition path into our pipeline, hoping to improve
the extraction of all kinds of key-value relationships in such more complex layout
scenarios.

4.3 RVL-CDIP Invoices

Dataset In the work of Riba et al. [25] another subset of RVL-CDIP has been
released. The authors selected 518 documents from the invoices classes, annotat-
ing 6 different regions (two examples of annotations are shown in Fig. 4). The
task that can be performed are layout analysis, in terms of node classification,
and table detection, in terms of bounding box (IoU > 50).
Doc2Graph: a Document Understanding Framework based on GNNs 11

Fig. 4: RVL-CDIP Invoices benchmark in [25]. There are 6 regions: supplier

(pink), invoice info (brown), receiver (green), table (orange), total (light blue),
other (gray).

Numerical results As done previously, we perform a k-fold cross validation

keeping, for each fold, the same amount of test (104), val (52) and training
documents (362). This time we applied an OCR to build the graph. There are
two tasks: layout analysis, in terms of accuracy, and table detection, using F1
score and IoU > 0.50 for table regions. Our model outperforms [25] in both tasks,
as shown in tables 4 and 5. In particular, for table detection, we extracted the
subgraph induced by the edge classified as ‘table’ (two nodes are linked if they
are in the same table) to extract the target region. Riba et al. [25] formulated the
problem as a binary classification: we report, for brevity, in Tab. 5 the threshold
on confidence score they use to cut out edges, that in our multi-class setting
(‘none’ or ‘table’) is implicitly set to 0.50 by the softmax.

Qualitative results In Fig. 6 we show the qualitative results. The two docu-
ments are duplicated to better visualize the two tasks. For layout analysis, the
greater boxes colors indicate the true label that the word inside should have (the
colors reflects classes as shown in Fig. 4). For the table detection we use a simple

Accuracy (↑)
Method Max Mean
Riba et al. [25] 62.30 -
Doc2Graph + OCR 69.80 67.80 ± 1.1

Table 4: Layout analysis results on RVL-CDIP Invoices. Layout analysis

accuracy scores depicted in terms of node classification task.
12 A. Gemelli et al.

Metrics (↑)
Method Threshold Precision Recall F1
Riba et al. [25] 0.1 0.2520 0.3960 0.3080
Riba et al. [25] 0.5 0.1520 0.3650 0.2150
Doc2Graph + OCR 0.5 0.3786 ± 0.07 0.3723 ± 0.07 0.3754 ± 0.07

Table 5: Table Detection in terms of F1 score. A table is considered cor-

rectly detected if its IoU is greater than 0.50. Threshold values refers to the
scores an edges has to have in order to do not be cut: in our case is set to 0.50
by the softmax in use.

heuristic: we take the enclosing rectangle (green) of the nodes connected by ‘ta-
ble’ edges, then we evaluate the IoU with target regions (orange). This heuristic
is effective but simple and so error-prone: if a false positive is found outside table
regions this could lead to a poor detection result, e.g. a bounding box including
also ’sender item’ entity or ’receiver item’ entity. In addition, as inferred from
Figs. 6(a) and 6(b), ’total’ regions could be taken out. In the future, we will
refine this behaviour by both boosting the node classification task and including
’total’ as a table region for the training of edges.

5 Conclusion
In this work, we have presented a task-agnostic document understanding frame-
work based on a Graph Neural Network. We propose a general representation
of documents as graphs, exploiting fully connectivity between document objects
and letting the network automatically learn meaningful pairwise relationships.
Node and edge aggregation functions are defined by taking into account the rela-
tive positioning of document objects. We evaluated our model on two challenging
benchmarks for three different tasks: entity linking on forms, layout analysis on
invoices and table detection. Our preliminary results show that our model can
achieve promising results, keeping the network dimensionality considerably low.
For future works, we will extend our framework to other documents and tasks,
to deeper investigate the generalization property of the GNN. We would like to
explore more extensively the contribution of different source features and how
to combine them in more meaningful and learnable ways.

Acknowledgment
This work has been partially supported by the the Spanish projects MIRANDA
RTI2018-095645-B-C21 and GRAIL PID2021-126808OB-I00, the CERCA Pro-
gram / Generalitat de Catalunya, the FCT-19-15244, and PhD Scholarship from
AGAUR (2021FIB-10010).
Doc2Graph: a Document Understanding Framework based on GNNs 13

(a) (b)

Fig. 5: Entity Linking on FUNSD. We make use of directed edges: green and
red dots mean source and destination nodes, respectively.
14 A. Gemelli et al.

(a) (b)

Fig. 6: Layout Analysis on RVLCDIP Invoices. Inference over two docu-

ments from RVL-CDIP Invoices, showing both Layout Analysis (a,c) and Table
Detection (b,d) tasks.
Doc2Graph: a Document Understanding Framework based on GNNs 15

References
1. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha, R.: Docformer: End-to-
end transformer for document understanding. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 993–1003 (2021) 2, 3
2. Biswas, S., Banerjee, A., Lladós, J., Pal, U.: Docsegtr: An instance-level end-to-
end document image segmentation transformer. arXiv preprint arXiv:2201.11438
(2022) 3
3. Biswas, S., Riba, P., Lladós, J., Pal, U.: Beyond document object detection:
instance-level segmentation of complex layouts. International Journal on Docu-
ment Analysis and Recognition (IJDAR) 24(3), 269–281 (2021) 3
4. Biswas, S., Riba, P., Lladós, J., Pal, U.: Graph-based deep generative modelling for
document layout generation. In: International Conference on Document Analysis
and Recognition. pp. 525–537. Springer (2021) 2
5. Borchmann, L., Pietruszka, M., Stanislawek, T., Jurkiewicz, D., Turski, M., Szyn-
dler, K., Graliński, F.: Due: End-to-end document understanding benchmark. In:
Thirty-fifth Conference on Neural Information Processing Systems Datasets and
Benchmarks Track (Round 2) (2021) 3
6. Carbonell, M., Riba, P., Villegas, M., Fornés, A., Lladós, J.: Named entity recog-
nition and relation extraction with graph neural networks in semi structured doc-
uments. In: 2020 25th International Conference on Pattern Recognition (ICPR).
pp. 9622–9627. IEEE (2021) 2, 3, 9, 10
7. Davis, B., Morse, B., Cohen, S., Price, B., Tensmeyer, C.: Deep visual template-
free form parsing. In: 2019 International Conference on Document Analysis and
Recognition (ICDAR). pp. 134–141. IEEE (2019) 3
8. Davis, B., Morse, B., Price, B., Tensmeyer, C., Wiginton, C.: Visual fudge: Form
understanding via dynamic graph editing. In: International Conference on Docu-
ment Analysis and Recognition. pp. 416–431. Springer (2021) 2, 3, 4, 7, 9, 10
9. Gemelli, A., Vivoli, E., Marinai, S.: Graph neural networks and representation
embedding for table extraction in PDF documents. In: accepted for publication at
ICPR22 (2022) 2, 3
10. Goldmann, L.: Layout Analysis Groundtruth for the RVL-CDIP Dataset (Sep
2019). https://doi.org/10.5281/zenodo.3257319 2, 3
11. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large
graphs. Advances in neural information processing systems 30 (2017) 6
12. Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets
for document image classification and retrieval. In: International Conference on
Document Analysis and Recognition (ICDAR) 8
13. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros: a pre-trained
language model for understanding texts in document (2020) 3, 9, 10
14. Huang, Z., Chen, K., He, J., Bai, X., Karatzas, D., Lu, S., Jawahar, C.: Icdar2019
competition on scanned receipt ocr and information extraction. In: 2019 Interna-
tional Conference on Document Analysis and Recognition (ICDAR). pp. 1516–
1520. IEEE (2019) 3
15. Jaume, G., Ekenel, H.K., Thiran, J.P.: Funsd: A dataset for form understanding in
noisy scanned documents. In: 2019 International Conference on Document Analysis
and Recognition Workshops (ICDARW). vol. 2, pp. 1–6. IEEE (2019) 2, 3, 8, 9
16. Jocher, G., Chaurasia, A., Stoken, A., Borovec, J., NanoCode012, Kwon, Y.,
TaoXie, Fang, J., imyhxy, Michael, K.: ultralytics/yolov5: v6. 1-tensorrt, tensorflow
edge tpu and openvino export and inference. Zenodo, Feb 22 (2022) 8, 9
16 A. Gemelli et al.

17. Li, X., Wu, B., Song, J., Gao, L., Zeng, P., Gan, C.: Text-instance graph: Ex-
ploring the relational semantics for text-based visual question answering. Pattern
Recognition 124, 108455 (2022) 2
18. Liang, Y., Wang, X., Duan, X., Zhu, W.: Multi-modal contextual graph neural
network for text visual question answering. In: 2020 25th International Conference
on Pattern Recognition (ICPR). pp. 3491–3498. IEEE (2021) 2
19. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference
on computer vision. pp. 740–755. Springer (2014) 8
20. Liu, H., Li, X., Liu, B., Jiang, D., Liu, Y., Ren, B.: Neural collaborative graph
machines for table structure recognition. In: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition. pp. 4533–4542 (2022) 2,
3
21. Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document
images. In: Proceedings of the IEEE/CVF winter conference on applications of
computer vision. pp. 2200–2209 (2021) 3
22. Powalski, R., Borchmann, L., Jurkiewicz, D., Dwojak, T., Pietruszka, M., Palka,
G.: Going full-tilt boogie on document understanding with text-image-layout trans-
former. In: International Conference on Document Analysis and Recognition. pp.
732–747. Springer (2021) 3
23. Qasim, S.R., Mahmood, H., Shafait, F.: Rethinking table recognition using graph
neural networks. In: 2019 International Conference on Document Analysis and
Recognition (ICDAR). pp. 142–147. IEEE (2019) 3
24. Raja, S., Mondal, A., Jawahar, C.: Table structure recognition using top-down
and bottom-up cues. In: European Conference on Computer Vision. pp. 70–86.
Springer (2020) 3
25. Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table de-
tection in invoice documents by graph neural networks. In: 2019 International
Conference on Document Analysis and Recognition (ICDAR). pp. 122–127. IEEE
(2019) 2, 3, 4, 10, 11, 12
26. Riba, P., Goldmann, L., Terrades, O.R., Rusticus, D., Fornés, A., Lladós, J.: Ta-
ble detection in business document images by message passing networks. Pattern
Recognition 127, 108641 (2022) 2, 3
27. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi-
cal image segmentation. CoRR abs/1505.04597 (2015), http://arxiv.org/abs/
1505.04597 5
28. Sarkar, M., Aggarwal, M., Jain, A., Gupta, H., Krishnamurthy, B.: Document
structure extraction using prior based high resolution hierarchical semantic seg-
mentation. In: European Conference on Computer Vision. pp. 649–666. Springer
(2020) 3
29. Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh,
D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326
(2019) 3
30. Smock, B., Pesala, R., Abraham, R.: Pubtables-1m: Towards comprehensive ta-
ble extraction from unstructured documents. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 4634–4642 (2022) 3
31. Vu, H.M., Nguyen, D.T.N.: Revising funsd dataset for key-value detection in doc-
ument images. arXiv preprint arXiv:2010.05322 (2020) 8
Doc2Graph: a Document Understanding Framework based on GNNs 17

32. Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Florencio, D., Zhang, C.,
Che, W., et al.: Layoutlmv2: Multi-modal pre-training for visually-rich document
understanding. arXiv preprint arXiv:2012.14740 (2020) 2, 3
33. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: Pre-training of
text and layout for document image understanding. In: Proceedings of the 26th
ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
pp. 1192–1200 (2020) 2, 3, 9, 10
34. Xue, W., Yu, B., Wang, W., Tao, D., Li, Q.: Tgrnet: A table graph reconstruc-
tion network for table structure recognition. In: Proceedings of the IEEE/CVF
International Conference on Computer Vision. pp. 1295–1304 (2021) 2
35. Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: Pick: processing key information ex-
traction from documents using improved graph learning-convolutional networks.
In: 2020 25th International Conference on Pattern Recognition (ICPR). pp. 4363–
4370. IEEE (2021) 2

Babbira Aquatics Fish Farm: Kibanda Village, Kabira Sub-County, Kyotera County, Rakai District, Uganda
100% (1)
Babbira Aquatics Fish Farm: Kibanda Village, Kabira Sub-County, Kyotera County, Rakai District, Uganda
31 pages
William G. Niederland - The Schreber Case - Psychoanalytic Profile of A Paranoid Personality-Routledge (1984) PDF
No ratings yet
William G. Niederland - The Schreber Case - Psychoanalytic Profile of A Paranoid Personality-Routledge (1984) PDF
197 pages
Nordli Buying Guide A4
No ratings yet
Nordli Buying Guide A4
8 pages
A Survey of Deep Learning Approaches For OCR and D
No ratings yet
A Survey of Deep Learning Approaches For OCR and D
14 pages
2024 Acl-Long 162
No ratings yet
2024 Acl-Long 162
11 pages
DAN: A Segmentation-Free Document Attention Network For Handwritten Document Recognition
No ratings yet
DAN: A Segmentation-Free Document Attention Network For Handwritten Document Recognition
17 pages
PICK: Processing Key Information Extraction From Documents Using Improved Graph Learning-Convolutional Networks
No ratings yet
PICK: Processing Key Information Extraction From Documents Using Improved Graph Learning-Convolutional Networks
8 pages
Document Parsing Unveiled Techniques, Challenges, and Prospects For Structured Information Extraction
No ratings yet
Document Parsing Unveiled Techniques, Challenges, and Prospects For Structured Information Extraction
56 pages
OCR-free Document Understanding Transformer
No ratings yet
OCR-free Document Understanding Transformer
21 pages
Lee 2019
No ratings yet
Lee 2019
6 pages
2025 Graph-Based Document Structure Analysis
No ratings yet
2025 Graph-Based Document Structure Analysis
24 pages
Kim 2021 Donut
No ratings yet
Kim 2021 Donut
12 pages
PDF-VQA: A New Dataset For Real-World VQA On PDF Documents
No ratings yet
PDF-VQA: A New Dataset For Real-World VQA On PDF Documents
22 pages
DocParser Hierarchical Document Structure Parsing From Renderings
No ratings yet
DocParser Hierarchical Document Structure Parsing From Renderings
19 pages
Gated Attention Networks For Learning On Large
No ratings yet
Gated Attention Networks For Learning On Large
11 pages
Mathur LayerDoc Layer-Wise Extraction of Spatial Hierarchical Structure in Visually-Rich Documents WACV 2023 Paper
No ratings yet
Mathur LayerDoc Layer-Wise Extraction of Spatial Hierarchical Structure in Visually-Rich Documents WACV 2023 Paper
11 pages
Unconstrained Text Recognition With Convolutional Neural Networks
No ratings yet
Unconstrained Text Recognition With Convolutional Neural Networks
13 pages
Graph Contrastive Learning With Augmentations
No ratings yet
Graph Contrastive Learning With Augmentations
12 pages
Text Recuperated Using Ontology With Stable Marria
No ratings yet
Text Recuperated Using Ontology With Stable Marria
29 pages
C11-Attention and Transformers
No ratings yet
C11-Attention and Transformers
59 pages
Reconstructing Training Data From Document Understanding Models
No ratings yet
Reconstructing Training Data From Document Understanding Models
18 pages
GNN Review
No ratings yet
GNN Review
26 pages
Graph Neural Network GNN in Image and Video Understanding Using Deep Learning For Computer Vision Applications
No ratings yet
Graph Neural Network GNN in Image and Video Understanding Using Deep Learning For Computer Vision Applications
7 pages
Content Augmented Graph Neural Networks
No ratings yet
Content Augmented Graph Neural Networks
15 pages
DLG Book
No ratings yet
DLG Book
332 pages
23 - AAAI - Substructure Aware Graph Neural Networks
No ratings yet
23 - AAAI - Substructure Aware Graph Neural Networks
9 pages
Gnns
No ratings yet
Gnns
75 pages
Going Full-TILT Boogie On Document Understanding With Text-Image-Layout Transformer
No ratings yet
Going Full-TILT Boogie On Document Understanding With Text-Image-Layout Transformer
17 pages
Approximation - and Quantization-Aware Training For Graph Neural Networks
No ratings yet
Approximation - and Quantization-Aware Training For Graph Neural Networks
14 pages
Revisiting Document-Level Relation Extraction With Context-Guided Link Prediction
No ratings yet
Revisiting Document-Level Relation Extraction With Context-Guided Link Prediction
9 pages
Masked Attention Is All You Need For Graphs: Duvenaud Et Al. 2015 Kearnes Et Al. 2016 Gilmer Et Al. 2017
No ratings yet
Masked Attention Is All You Need For Graphs: Duvenaud Et Al. 2015 Kearnes Et Al. 2016 Gilmer Et Al. 2017
15 pages
Bakkali Visual and Textual Deep Feature Fusion For Document Image Classification CVPRW 2020 Paper
No ratings yet
Bakkali Visual and Textual Deep Feature Fusion For Document Image Classification CVPRW 2020 Paper
10 pages
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
No ratings yet
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
26 pages
Article
No ratings yet
Article
12 pages
Graph Neural Networks For Image Understanding Based On Multiple Cues
No ratings yet
Graph Neural Networks For Image Understanding Based On Multiple Cues
10 pages
A Generalization of Transformer Networks To Graphs
No ratings yet
A Generalization of Transformer Networks To Graphs
8 pages
Thesis LLMsForDocVQA
No ratings yet
Thesis LLMsForDocVQA
29 pages
MANVA
No ratings yet
MANVA
51 pages
Docile Benchmark For Document Information Localization and Extraction
No ratings yet
Docile Benchmark For Document Information Localization and Extraction
30 pages
Document Image Layout Analysis Via Explicit Edge Embedding Network
No ratings yet
Document Image Layout Analysis Via Explicit Edge Embedding Network
11 pages
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Glue for Scientific Data Exploration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
ScreenAI A Vision-Language Model For UI and Infographics Understanding
No ratings yet
ScreenAI A Vision-Language Model For UI and Infographics Understanding
16 pages
Convolutional Neural Networks For Document Image Classification
No ratings yet
Convolutional Neural Networks For Document Image Classification
5 pages
Advancement in Graph Understandig
No ratings yet
Advancement in Graph Understandig
16 pages
Graph Neural Networks: A Review of Methods and Applications
No ratings yet
Graph Neural Networks: A Review of Methods and Applications
20 pages
Extracting Structured Data From Templatic Documents
No ratings yet
Extracting Structured Data From Templatic Documents
7 pages
D LLM: A - : OC Layout Aware Generative Language Model For Multimodal Document Understanding
No ratings yet
D LLM: A - : OC Layout Aware Generative Language Model For Multimodal Document Understanding
16 pages
Scene Graph Generation
No ratings yet
Scene Graph Generation
27 pages
2024 Naacl-Long 207
No ratings yet
2024 Naacl-Long 207
12 pages
GNNS
No ratings yet
GNNS
7 pages
Original GNN
No ratings yet
Original GNN
22 pages
Improving Global Awareness of Linkset Predictions Using Cross-Attentive Modulation Tokens
No ratings yet
Improving Global Awareness of Linkset Predictions Using Cross-Attentive Modulation Tokens
17 pages
Tang Unifying Vision Text and Layout For Universal Document Processing CVPR 2023 Paper
No ratings yet
Tang Unifying Vision Text and Layout For Universal Document Processing CVPR 2023 Paper
11 pages
GraphDecoder Recovering Diverse Network Graphs From Visualization Images Via Attention-Aware Learning
No ratings yet
GraphDecoder Recovering Diverse Network Graphs From Visualization Images Via Attention-Aware Learning
17 pages
Is Graph Structure Necessary For Multi-Hop Question Answering?
No ratings yet
Is Graph Structure Necessary For Multi-Hop Question Answering?
6 pages
Slide - Key Information Extraction From Vietnamese Invoices by Combining Layout and Context
No ratings yet
Slide - Key Information Extraction From Vietnamese Invoices by Combining Layout and Context
45 pages
Papers Papers PDF
No ratings yet
Papers Papers PDF
48 pages
Ernie Layout
No ratings yet
Ernie Layout
13 pages
Human Object Interaction in Images Using GNN
No ratings yet
Human Object Interaction in Images Using GNN
17 pages
2020 Acl-Main 457
No ratings yet
2020 Acl-Main 457
14 pages
1201 Graph Transformer
No ratings yet
1201 Graph Transformer
14 pages
Apin D 23 03967
No ratings yet
Apin D 23 03967
29 pages
Graph Data Modeling and Analytics with Neo4j: Definitive Reference for Developers and Engineers
From Everand
Graph Data Modeling and Analytics with Neo4j: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
2018 Oblique DT !!
No ratings yet
2018 Oblique DT !!
17 pages
20-Structural Deep Clustering Network
No ratings yet
20-Structural Deep Clustering Network
11 pages
Multimodal RAG Systems Hands-On Guide
No ratings yet
Multimodal RAG Systems Hands-On Guide
7 pages
Sciadv Aat9004
No ratings yet
Sciadv Aat9004
7 pages
Machine Reasoning Explainability
No ratings yet
Machine Reasoning Explainability
72 pages
Vector Quantization
No ratings yet
Vector Quantization
14 pages
Symfony-Doctrine & The Database
No ratings yet
Symfony-Doctrine & The Database
108 pages
Activity Sheet 10.1 Lesson Plan Template - 5E Model
No ratings yet
Activity Sheet 10.1 Lesson Plan Template - 5E Model
5 pages
SECTION 205-01 Driveshaft
No ratings yet
SECTION 205-01 Driveshaft
6 pages
IEEE Power System Paper-A 20-KW, 10-KHz, Single-Phase Multilevel Active
No ratings yet
IEEE Power System Paper-A 20-KW, 10-KHz, Single-Phase Multilevel Active
7 pages
Privana
100% (1)
Privana
36 pages
Scratch 1 Reflection Journal
No ratings yet
Scratch 1 Reflection Journal
3 pages
"Introduction""Critical Traditions, Modernisms, and The ""Posts"""
No ratings yet
"Introduction""Critical Traditions, Modernisms, and The ""Posts"""
10 pages
Soc Stud 1 Print
No ratings yet
Soc Stud 1 Print
4 pages
CAT 545R1-Hawkeye 2row
No ratings yet
CAT 545R1-Hawkeye 2row
3 pages
Abdelrahman Mansour Ele-Engineer
No ratings yet
Abdelrahman Mansour Ele-Engineer
2 pages
NES 729 Part 3 Requirements For Non-Destructive Examination Methods
No ratings yet
NES 729 Part 3 Requirements For Non-Destructive Examination Methods
48 pages
Architectural Ironmongery
No ratings yet
Architectural Ironmongery
224 pages
Jurnal - PPH & BPHTB Pada PPJB
No ratings yet
Jurnal - PPH & BPHTB Pada PPJB
20 pages
Gauss Divergence Theorem
No ratings yet
Gauss Divergence Theorem
36 pages
Deadlocks
No ratings yet
Deadlocks
18 pages
Photography - May 2021
No ratings yet
Photography - May 2021
17 pages
Lori Hubbartt Human Resources Recruiter Resume
No ratings yet
Lori Hubbartt Human Resources Recruiter Resume
2 pages
LR Mate Connector Converter Board
No ratings yet
LR Mate Connector Converter Board
7 pages
Formularios Visual Basic
No ratings yet
Formularios Visual Basic
12 pages
Sampling Gates
67% (3)
Sampling Gates
37 pages
and Resin Content Relation
No ratings yet
and Resin Content Relation
9 pages
Rioflex MX 7000 Average Density
100% (2)
Rioflex MX 7000 Average Density
2 pages
Earth Science SHS 1.1 Worksheet 1
No ratings yet
Earth Science SHS 1.1 Worksheet 1
1 page
A Thesis ON: Pharma Marketing Management
No ratings yet
A Thesis ON: Pharma Marketing Management
3 pages
Final Analysis
No ratings yet
Final Analysis
29 pages
Forced Convection Heat Transfer of Molten Salts - A Review
No ratings yet
Forced Convection Heat Transfer of Molten Salts - A Review
9 pages
The Greenest, Fire Suppression System: FM-200 - Total Flooding System
No ratings yet
The Greenest, Fire Suppression System: FM-200 - Total Flooding System
10 pages