ICDAR2021-Information Extraction from Invoices
ICDAR2021-Information Extraction from Invoices
1 Introduction
identifiers and types, amounts, dates and so on. This automatic process of in-
voices has been formalized by Poulain d’Andecy et al. [18] and requires some
specific features :
– handling the variability of layouts,
– minimizing the end-user effort,
– training and quickly adapt to new languages and new contexts.
Even if a formal definition has been proposed in the literature, current ap-
proaches rely on heuristics which describe spatial relationships between the data
to be extracted and a list of trigger words through the use of dictionaries. Design-
ing heuristics-based models is time-consuming since it requires human expertise.
Furthermore, such models are dependent on the language and the templates they
have been trained for, which requires annotating a large number of documents
and labeling every piece of data from which to extract information. Thus, even
once a system has reached a good level of performance, it is very tedious to in-
tegrate new languages and new templates. We can for instance cite recent data
analysis approaches (CloudScan [22], Sypht [9]) which consider information
extraction as a classification problem. Based on its context, each word is either
assigned to a specific class. Their results are very competitive but these systems
require huge volumes of data to get good results.
The problem of extracting some specific entities from textual documents has
been studied by the natural language processing field and is known as Named
Entity Recognition (NER). NER is a subtask of information extraction that
aims to find and mark up real word entities from unstructured text and then to
categorize them into a set of predefined classes. Most NER tagsets define three
classes to categorize named entities: persons, locations and organizations [16].
Taking advantage of the development of neural-based methods, the performance
of NER systems has kept increasing since 2011 [3].
In this paper, we propose to adapt and compare two deep learning approaches
to extract key fields from invoices which comprise regular invoices, receipts, pur-
chase orders, delivery forms, accounts’ statements and payslips. Both methods
are language-independent and process invoices regardless of their templates. In
the first approach, we formulate the hypothesis that key fields related to docu-
ment analysis can extend name entity categories and be extracted in the same
way. It is based on annotated texts where all the target fields are extracted and
labeled into a set of predefined classes. The system is based on the context in
which the word appears in the document and additional features that encode
the spatial position of the word in the documents.
The second approach converts each word of the invoice into a vector of fea-
tures which will be semantically categorized in a second stage. To reduce the
annotation effort, the corpus comes from a real-life industrial workflow and the
annotation is semi-supervised. The corpus has been tagged with an existing in-
formation extraction system and manually checked by experts. Compared to the
state of the art, our experiments show that by adequately selecting part of the
data, we can train competitive systems by using a significantly smaller amount
of training data compared to the ones used in state-of-the-art approaches.
Information Extraction from Invoices 3
This paper first introduces prior works on information extraction from in-
voices (Section 2). We then describe the data used to assess our work (Section 3).
Our two approaches based on named entity recognition and document analysis
are detailed in Section 4. The experiments are described in Section 5 to com-
pare our models to state-of-the-art methods, before discussions and future work
(Section 6).
2 Related work
Recent works proposed deep learning based approaches to solve the extrac-
tion problem. Palm et al. [17] presented CloudScan, a commercial solution
by Tradeshift. They train a recurrent neural network (RNN) model over 300k
invoices to recognize eight key fields. This system requires no configuration and
does not rely on models. Instead, it considers tokens and encodes a set of con-
textual features of different kinds: textual, spatial, related to the format, etc.
They decided to apply a left-to-right order. However, invoices are often written
in both vertical and horizontal directions. Other works have been inspired by
CloudScan. In their work, they compare their system to alternative extrac-
tion systems and they claim an absolute accuracy gain of 20% across compared
fields. Lohani et al. [12] built a system based on graph convolutional networks
to extract 27 entities of interest from invoices. Their system learns structural
and semantic features for each entity and then uses surrounding information
to extract them. The evaluation of the system showed good performances and
achieves an overall F1-score of 0.93. In the same context, Zhao et al. [26] pro-
posed the cutie (Convolutional Universal Text Information Extractor) model.
It is based on spatial information to extract key text fields. cutie converts the
documents to gridded texts using positional mappings and uses a convolutional
neural network (CNN). The proposed model concatenates the CNN network with
a word embedding layer in order to simultaneously look into spatial and contex-
tual information. This method allowed them to reach state-of-the-art results on
key information extraction.
In this work, similarly to [11,13], we combine sequence labeling methods.
However, we add neural network layers to our systems so as to encode engineered
textual and spatial features. One of these methods is based on named entity
recognition using the Transformer architecture [24] and bert [7] that, to our
knowledge, have not been reported in previous research on the specific tasks
of processing administrative documents. The second method is based on word
classification that, unlike previous works, do not require neither pre- nor post-
processing to achieve satisfactory results.
3 Dataset
Datasets from business documents are usually not publicly available due to pri-
vacy issues. Previous systems such as CloudScan [22] and Sypht [9] use their
own proprietary datasets. In the same line, this work is based on a private in-
dustrial dataset composed of French and English invoices coming from a real
document workflow provided by customers. The dataset covers 10 types of in-
voices (orders, quotations, invoice notes, statements...) with diverse templates.
The dataset has been annotated in a semi-automatic way. A first list of
fields was extracted from the system currently in use, and finally checked and
completed by an expert. The main advantage of this process is its ability to
get a large volume of documents. However, even if most of the returned values
have been checked, we should take into account a part of noise which is expert-
dependent. In other words, some fields may be missed by the system or by
Information Extraction from Invoices 5
Fig. 1: Part of an invoice from our dataset. The blue text is the label of the field
(blue boxes)
Table 1 provides statistics on the databases. Each one was split into 70% for
training and 30% for validation.
4 Methodology
As we mentioned in the introduction, we define and compare two different meth-
ods on information extraction that are generic and language-independent: a
NER-based method and a word classification-based one (henceforth, we respec-
tively denote them NER-based and class-based). To the best of our knowledge,
no research study has adapted NER systems for invoices so far. The NER-based
evaluates the ability of NLP mainstream approaches to extract fields from in-
voices by fine-tuning BERT to this specific task. BERT can capture the context
from a large neighborhood of words. The class-based method adapted the fea-
tures of CloudScan [17] in order to fit our constraints and proposed some extra
features to automatically extracts the features, with no preprocessing step nor
dictionary lookup. Thus, our methods can easily be adapted to any type of ad-
ministrative documents. These features significantly reduced the processing of
the class-based method on the one hand and allow BERT to deal with challenges
related to semi-structured documents on the other hand. Both systems assign a
class to a sequence of words. The classes are the key fields to be extracted. We
assign the class ”undefined” to each word which does not correspond to a key
field. Both can achieve good performance with a small volume of training data
compared to the state of the art. Each word of the sequence to be labeled is
enriched with a set of features that encode contextual and spatial information of
the words. We therefore extract such features prior to data labeling. The same
features are used for both methods.
The first contribution of this paper relies on the fine-tuning of BERT [7] to
extract relevant information from invoices. The reason for using the BERT model
is not only because it is easy to fine-tune, but it has also proved to be one of
the most performing technologies in multiple tasks [4,19]. Nonetheless, despite
the major impact of BERT, we aim in this paper to evaluate the ability of this
model to deal with structured texts as with administrative documents.
BERT consists of stacked encoders/decoders. Each encoder takes as input
the output of the previous encoder (except the first which takes as input the
embeddings of the input words). According to the task, the output of BERT is a
probability distribution that allows predicting the most probable output element.
In order to obtain the best possible performance, we adapted this architecture to
use both BERT word embeddings and our proposed features. At the input layer
of the first encoder, we concatenate the word embedding vector with a fixed-size
vector for features in order to combine word-level information with contextual
information. The size of the word embedding vector is 572 (numerical values) for
which we first concatenate another embedding vector that corresponds to the
average embedding of the contextual features. The obtained vector of size 1,144
is then concatenated with a vector containing the logical features (Boolean) and
the spatial features (numerical). The final vector size is 1,155.
As an embedding model, we rely on the large version of the pre-trained
CamemBERT [14] model. For tokenization, we use CamemBERT’s built-in tok-
enizer, which splits text based on white-spaces before applying a Byte Pair En-
coding (BPE), based on WordPiece tokenization [25]. BPE can split words into
character n-grams to represent recognized sub-words units. BPE allows man-
aging words with OCR errors, for instance, ’in4voicem’ becomes ’in’, ’##4’,
’##voi’, ’##ce’, ’##m’. This word fragment can usually handle out of vo-
cabulary words and those with OCR errors, and still generates one vector per
word.
At this point, the feature-level representation vector is concatenated with the
word embedding vector to feed the BERT model. The output vectors of BERT
are then used as inputs to the CRF top layer to jointly decode the best label
sequence. To alleviate OCR errors, we add a stack of two transformer blocks
(cf. Fig 2) as recommended in [1], which should contribute to a more enriched
representation of words and sub-words from long-range contexts.
8 Hamdi et al.
The system converts the input sequences of words into sequences of fixed-
size vectors (x1 ,x2 ,...,xn ), i.e. the word-embeddings part is concatenated to the
feature embedding, and returns another sequence of vectors (h1 ,h2 ,...,hn ) that
represents named entity labels at every step of the input. In our context, we
aim to assign a sequence of labels for a given sequence of words. Each word
gets a pre-defined label (e.g. DOCNBR for document number, DATE for dates,
AMT for amounts ...) or O for words that are not to be extracted. According to
this system, the example sentence3 ”Invoice No. 12345 from 2014/10/31 for an
amount of 525 euros.” should be labeled as follows : ”DOCTYPE O DOCNBR
O DATE O O O O AMT CURRENCY ”.
We ran this tool over texts extracted from invoices using an OCR engine.
The OCR-generated XML files contain lines and words grouped in blocks, with
extracted text aligned in the same way as in regular documents (from top to bot-
tom and from left to right). As OCR segmentation can lead to many errors with
the presence of tables or difficult structures, we only kept words from OCR and
3
All the examples/images used in this paper are fake for confidentiality reasons.
Information Extraction from Invoices 9
rebuilt the lines based on word centroid coordinates. The left context therefore
allows defining key fields (cf. Figure 5). However, invoices, as all administrative
documents, may have or contain particular structures which should rather be
aligned vertically. In tables, for example, the context defining target fields can
appear only in the headers. For this reason, we define sequences including the
whole text of the document and ensure the presence of the context and the field
to be extracted in the same sequence.
In parallel to NER experiments, and as our end goal is classification rather than
sequence labeling, we decided to compare our work to more classical classification
approaches from the document analysis community. Our aim is to predict the
class of each word within its own context based on our proposed feature vector
(textual, spatial, structural, logical). The output class is the category of the
item, which can either one of the key fields or the undefined tag. Our work
is similar to the CloudScan approach [17], which is currently the state of the
art approach for invoice field extraction, our classification is mainly based on
features. However, unlike them, our system is language-independent and does
not require neither resources nor human actions as pre- and post-processing.
Indeed, they build resources such as a lexicon of towns to define a pre-tagging.
The latter is represented by Boolean features that check whether the word in
processing corresponds to a town or a zip code. In the same way, they extract
dates and amounts. In this work, our system is resourceless and avoids any pre-
tagging. We define a pattern feature to learn and predict dates and amounts. In
this way, we do not need to update lists nor detect language. We also added new
Boolean features (cf. Section 4.1) to define titles and mathematical assertions
(i.e: isTitle, isTerm, isProduct).
In order to accelerate the process, we proposed a strategy to reduce the
volume of data injected to train our models. To this end, we kept N-grams
which are associated with one of the ground-truth fields and reduced the volume
10 Hamdi et al.
5 Results
In order to evaluate our methods, we used two traditional metrics from the
information retrieval field: precision and recall. While precision is the rate of
predicted key fields correctly extracted and classified by the system, recall is the
rate of fields present in the reference that are found and correctly classified by
the system. The industrial context involves particular attention to the precision
measure because false positives are more problematic to customers than missing
answers. We therefore aim to reach a very high precision with a reasonable recall.
We report our results on 8 fields: the docType and docNbr respectively
define the type (i.e. regular invoices, credit notes, orders, account statements,
delivery notes, quotes, etc.) and the number of the invoice. The docDate and
dueDate are respectively the date on which the invoice is issued and the due
date by which the invoice amount should be paid. We additionally extract the
net amount netAmt, the tax amount taxAmt and the total amount totAmt
as well as the currency. Table 2 shows results of the first experiment which
has been conducted using the NER-based model and the class-based system.
These first results show that the class-based system outperforms NER-based
on most fields. Except for amounts, NER-based has a good precision for all while
the class-based system rightly manages to find all the fields with high precision
4
https://uber.github.io/ludwig/
Information Extraction from Invoices 11
Recall Precision
Fields Support NER-based class-based NER-based class-based
docType 3856 0.79 0.84 0.97 0.97
docNbr 3841 0.66 0.87 0.74 0.86
docDate 3897 0.78 0.73 0.94 0.95
dueDate 2603 0.73 0.78 0.94 0.92
netAmt 3441 0.47 0.72 0.58 0.78
taxAmt 3259 0.47 0.70 0.65 0.86
totAmt 3889 0.45 0.85 0.59 0.89
currency 2177 0.79 0.97 0.83 0.83
Table 2: First results using the NER-based model and the class-based system
over the database-20k. ”Support” stands for the number of occurrences of each
field class in the test set. Best results are in bold.
and recall. Despite the high performance of NER-based in the NER task, the
system showed some limits over invoices which we explain by the ratio between
undefined words and named entities which is much bigger in the case of invoices.
Having many undefined tokens tends to disturb the system especially when the
fields are redundant in the documents (i.e. amounts) unlike fields that appear
once in the document, for which results are quite good. One particularity of the
amount fields is that they often appear in a summary table which is a non-linear
block that contains many irrelevant words.
In order to improve the results, we firstly visualized the weights of the features
in the attention layer at the last encoder of the NER-based neural network (cf.
Figure 4). These weights indicate relevant features for each target field.
Figure 4 indicates the weights of the best performing epoch in the NER-
based model. We can notice from the figure that many features have weights
close to zero (with ocean blue) for all the fields. Features such as the position
of the word in the document, block and line are unused by the final model and
considered as irrelevant. Furthermore, it is clear that the relative position of the
word in the document page (rightMargin, bottomMargin) are high-weighted in
the predictions of all the fields. For the amount fields, the logical features as well
as the relative margin position of the word on the left and on the top are also
relevant. It is shown using white or light blue colors. We therefore conducted
a second experiment, keeping only the most relevant features. We trained new
models without considering the right and the bottom relative positions of the
word and its positions in the document, line and block.
Table 3 shows practically better results for all the target fields. Except for the
recall of docDate using the class-based system which is considerably degraded,
all the other results are either improved or kept good performance. Even if the
results are improved using relevant features, the NER-based system nevertheless
showed some limits to predict amounts. This is not totally unexpected given
that NER systems are mainly adapted to extract information from unstructured
texts while amounts are usually indicated at the end of tables with different
12 Hamdi et al.
Fig. 4: Weights of features used by the NER based method. Features: position of
the word in the document, line and block (table, paragraph or frame) posDoc,
posLine, posBlock; relative position of the word compared to its neighbours
leftMargeRel, topMargeRel, rightMargeRel, bottomMargeRel; relative
position of the word in the document page rightMargin, bottomMargin;
Boolean features for titles and mathematical operations isTitle, isFactor, is-
Product, isSum, isTerm.
Recall Precision
Fields Support NER-based class-based NER-based class-based
docType 3856 0.81* 0.85* 0.98* 0.97
docNbr 3841 0.67* 0.86 0.74 0.86
docDate 3897 0.78 0.33 0.95* 0.92
dueDate 2603 0.74* 0.70 0.93 0.91
netAmt 3441 0.47 0.78* 0.58 0.82*
taxAmt 3259 0.49* 0.78* 0.66* 0.87*
totAmt 3889 0.49* 0.87* 0.61* 0.89
currency 2177 0.82* 0.96 0.83 0.83
Table 3: Results of the NER-based model and the class-based system over the
database-20k using relevant features. Best results are given in bold. * denotes
better results compared to Table 2 (i.e. without feature selection)
Information Extraction from Invoices 13
Recall Precision
Fields CloudScan [17] NER-based class-based CloudScan [17] NER-based class-based
docType – 0.79 0.90 – 0.99 0.97
docNbr 0.84 0.69 0.89 0.88 0.85 0.89
docDate 0.77 0.78 0.94 0.89 0.96 0.96
dueDate – 0.74 0.90 – 0.96 0.93
netAmt 0.93 0.47 0.81 0.95 0.62 0.86
taxAmt 0.94 0.49 0.79 0.94 0.70 0.90
totAmt 0.92 0.44 0.87 0.94 0.63 0.95
Currency 0.78 0.76 0.98 0.99 0.90 0.84
Table 4: Results of the NER-based model and the class-based system over the
database-100k using relevant features. Best results are given in bold.
The results in Table 4 are quite promising regarding the small volume of data
we used in these experiments. For some fields (e.g. document number), they
can even be compared to our baseline. Unsurprisingly, the results are clearly
improved for both the recall measure and the precision measure.
14 Hamdi et al.
All in all, we can appreciate that, 100k sample is much less-sized than the
corpus used to demonstrate the CloudScan system (more than 300k) and more-
over, with only 20k training samples, the performance is yet very honorable.
6 Conclusion
acknowledgements
This work is supported by the Region Nouvelle Aquitaine under the grant num-
ber 2019-1R50120 (CRASD project) and AAPR2020-2019-8496610 (CRASD2
project), the European Union’s Horizon 2020 research and innovation program
under grant 770299 (NewsEye) and by the LabCom IDEAS under the grant
number ANR-18-LCV3-0008.
References
1. Boroş, E., Hamdi, A., Pontes, E.L., Cabrera-Diego, L.A., Moreno, J.G., Sidere, N.,
Doucet, A.: Alleviating digitization errors in named entity recognition for historical
documents. In: Proceedings of the 24th Conference on Computational Natural
Language Learning. pp. 431–441 (2020)
2. Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional lstm-cnns.
arXiv preprint arXiv:1511.08308 (2015)
3. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
Natural language processing (almost) from scratch. Journal of Machine Learning
Research 12(Aug), 2493–2537 (2011)
4. Conneau, A., Lample, G.: Cross-lingual language model pretraining. In: Wal-
lach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Gar-
nett, R. (eds.) Advances in Neural Information Processing Systems 32, pp.
7059–7069. Curran Associates, Inc. (2019), http://papers.nips.cc/paper/
8928-cross-lingual-language-model-pretraining.pdf
5. Dengel, A.R., Klein, B.: smartfix: A requirements-driven system for document
analysis and understanding. In: International Workshop on Document Analysis
Systems. pp. 433–444. Springer (2002)
6. Dernoncourt, F., Lee, J.Y., Szolovits, P.: Neuroner: an easy-to-use pro-
gram for named-entity recognition based on neural networks. arXiv preprint
arXiv:1705.05487 (2017)
7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
8. Grishman, R., Sundheim, B.M.: Message understanding conference-6: A brief his-
tory. In: COLING 1996 Volume 1: The 16th International Conference on Compu-
tational Linguistics (1996)
9. Holt, X., Chisholm, A.: Extracting structured data from invoices. In: Proceedings
of the Australasian Language Technology Association Workshop 2018. pp. 53–59.
Dunedin, New Zealand (Dec 2018)
10. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
11. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural
architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
16 Hamdi et al.
12. Lohani, D., Belaı̈d, A., Belaı̈d, Y.: An invoice reading system using a graph convo-
lutional network. In: Asian Conference on Computer Vision. pp. 144–158. Springer
(2018)
13. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional lstm-cnns-crf.
arXiv preprint arXiv:1603.01354 (2016)
14. Martin, L., Muller, B., Suárez, P.J.O., Dupont, Y., Romary, L., de la Clergerie,
É.V., Seddah, D., Sagot, B.: Camembert: a tasty french language model. arXiv
preprint arXiv:1911.03894 (2019)
15. Molino, P., Dudin, Y., Miryala, S.S.: Ludwig: a type-based declarative deep learn-
ing toolbox (2019)
16. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification.
Lingvisticae Investigationes 30(1), 3–26 (2007)
17. Palm, R.B., Winther, O., Laws, F.: Cloudscan - A configuration-free invoice anal-
ysis system using recurrent neural networks. CoRR abs/1708.07403 (2017),
http://arxiv.org/abs/1708.07403
18. Poulain d’Andecy, V., Hartmann, E., Rusinol, M.: Field extraction by hybrid incre-
mental and a-priori structural templates. In: 13th IAPR International Workshop
on Document Analysis Systems, DAS 2018, Vienna, Austria, April 24-27, 2018.
pp. 251–256 (04 2018). https://doi.org/10.1109/DAS.2018.29
19. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language un-
derstanding by generative pre-training (2018)
20. Reimers, N., Eckle-Kohler, J., Schnober, C., Kim, J., Gurevych, I.: Germeval-2014:
Nested named entity recognition with neural networks (2014)
21. Rusiñol, M., Benkhelfallah, T., D’Andecy, V.P.: Field extraction from administra-
tive documents by incremental structural templates. In: ICDAR. pp. 1100–1104.
IEEE Computer Society (2013), http://dblp.uni-trier.de/db/conf/icdar/
icdar2013.html#RusinolBD13
22. Sage, C., Aussem, A., Elghazel, H., Eglin, V., Espinas, J.: Recurrent neu-
ral network approach for table field extraction in business documents. In:
2019 International Conference on Document Analysis and Recognition, IC-
DAR 2019, Sydney, Australia, September 20-25, 2019. pp. 1308–1313 (09 2019).
https://doi.org/10.1109/ICDAR.2019.00211
23. Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: Language-
independent named entity recognition. arXiv preprint cs/0306050 (2003)
24. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
25. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun,
M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine translation
system: Bridging the gap between human and machine translation. arXiv preprint
arXiv:1609.08144 (2016)
26. Zhao, X., Niu, E., Wu, Z., Wang, X.: Cutie: Learning to understand docu-
ments with convolutional universal text information extractor. arXiv preprint
arXiv:1903.12363 (2019)