Full Text 01
Full Text 01
MARKO LAZIC
MARKO LAZIC
Abstract
The ability to automatically read, recognize, and extract different information
from unstructured text is of key importance to many areas. Most research in
this area has been focused on scanned invoices. This thesis investigates the
feasibility of using natural language processing to extract information from
receipt text. Three different machine learning models, BiLSTM, GCN, and
BERT, were trained to extract a total of 7 different data points from a dataset
consisting of 790 receipts. In addition, a simple rule-based model is built to
serve as a baseline. These four models were then compered on how well they
perform on different data points. The best performing machine learning model
was BERT with an overall F1 score of 0.455. The second best machine learn-
ing model was BiLSTM with the F1 score of 0.278 and GCN had the F1 score
of 0.167. These F1 scores are highly affected by the low performance on the
product list which was observed with all three models. BERT showed promis-
ing results on vendor name, date, tax rate, price, and currency. However, a
simple rule-based method was able to outperform the BERT model on all data
points except vendor name and tax rate. Receipt images from the dataset were
often blurred, rotated, and crumbled which introduced a high OCR error. This
error then propagated through all of the steps and was most likely the main rea-
son why machine learning models, especially BERT were not able to perform.
It is concluded that there is potential in using natural language processing for
the problem of information extraction. However, further research is needed if
it is going to outperform the rule-based models.
iv
Sammanfattning
Förmågan att automatiskt läsa, känna igen och utvinna information från ostruk-
turerad text har en avgörande betydelse för många områden. Majoriteten av den
forskning som gjorts inom området har varit inriktad på inskannade fakturor.
Detta examensarbete undersöker huruvida språkteknologi kan användas för
att utvinna information från kvittotext. Tre olika maskininlärningsmodeller,
BiLSTM, GCN och BERT, tränades på att utvinna totalt 7 olika datapunkter
från ett dataset bestående av 790 kvitton. Dessutom byggdes en enkel regel-
baserad modell som en referens. Dessa fyra modeller har sedan jämförts på
hur väl de presterat på de olika datapunkterna. Modellen som gav bäst resultat
bland maskininlärningsmodellerna var BERT med F1-resultatet 0.455. Den
näst bästa modellen var BiLSTM med F1-resultatet 0.278 medan GCN ha-
de F1-resultat 0.167. Dessa resultat påverkas starkt av den låga prestandan på
produktlistan som observerades med alla tre modellerna. BERT visade lovan-
de resultat på leverantörens namn, datum, moms, pris och valuta. Dock hade
den regelbaserade modellen bättre resultat på alla datapunkter förutom leve-
rantörens namn och moms. Kvittobilder från datasetet är ofta suddiga, roterade
och innehåller skrynkliga kvitton, vilket resulterar i ett högt fel hos maskinläs-
ningverktyget. Detta fel propagerades sedan genom alla steg och var troligen
den främsta anledningen till att maskininlärningsmodellerna, särskilt BERT,
inte kunde prestera. Sammanfattningsvis kan slutsatsen dras att användandet
av språkteknologi för att utvinna information från kvittotext har potential. Yt-
terligare forskning behövs dock om det ska användas istället för regelbaserade
modeller.
Contents
1 Introduction 1
1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . 4
2.1.1 Recurrent Neural Network . . . . . . . . . . . . . . . 4
2.1.2 Long Short Term Memory . . . . . . . . . . . . . . . 5
2.1.3 Sequence To Sequence Models . . . . . . . . . . . . . 6
2.1.4 Attention Mechanisms . . . . . . . . . . . . . . . . . 7
2.1.5 Transformer . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.6 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Graph Convolution Network . . . . . . . . . . . . . . . . . . 10
2.3 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Template structure based models . . . . . . . . . . . . 13
2.4.2 NER based models . . . . . . . . . . . . . . . . . . . 14
2.4.3 Hybrid models . . . . . . . . . . . . . . . . . . . . . 15
3 Methods 18
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Rule-based . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 NER data creator . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Oracle model . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.7 BERT model . . . . . . . . . . . . . . . . . . . . . . . . . . 26
v
vi CONTENTS
4 Results 31
4.1 Oracle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Rule Based . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 GCN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Result comparison . . . . . . . . . . . . . . . . . . . . . . . 42
4.6.1 Vendor . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6.2 Date . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6.3 Address . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.4 Tax rate . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.5 Price . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.6 Currency . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6.7 Products . . . . . . . . . . . . . . . . . . . . . . . . 43
5 Discussion 45
5.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Machine learning models . . . . . . . . . . . . . . . . . . . . 46
5.3 Rule Based . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.4 Societal and Ethical Aspects . . . . . . . . . . . . . . . . . . 49
5.5 Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Appendices 57
A 58
Chapter 1
Introduction
1
2 CHAPTER 1. INTRODUCTION
1.2 Scope
Figure 1.1 shows the overview of this thesis from the raw receipt image to the
final data extraction. The yellow color indicates the parts that are included in
the thesis and the blue color indicated the parts that are the main area of focus.
Scanning the receipts, binarizing the images, and reading the text are parts
that are necessary for the final results, but are not the part of this research.
The parts that are included, but less important are filtering the result from the
optical character recognition (OCR) engine and rule-based model (steps 3 and
4d in figure 1.1). The rule-based model is implemented to serve as a baseline
comparison. The main focus of the thesis is the comparison of three different
machine learning methods for data extraction from receipt text.
Figure 1.1: Overview of the thesis. Yellow represents parts included in the
thesis and blue represents the main area of focus.
• How well do the BiLSTM, BERT, and GCN perform on data extraction
from receipt text in terms of precision, recall, and F1 score?
CHAPTER 1. INTRODUCTION 3
• Which of these methods is best suited for the extraction of different data
fields?
Background
4
CHAPTER 2. BACKGROUND 5
the input xt is calculated by the formula 2.1. Each hidden layer xt is calculated
based on the current input and the signal propagated from the beginning. RNN
uses backpropagation learning rule but when the error signal propagates back
in time it tends to blow up or vanish [6].
ht = A(ht−1 , xt ) (2.1)
forget the important features of the first part of the input as it processes the
whole sequence.
2.1.5 Transformer
The Transformer is the neural network architecture presented in Vaswani et al.
[13] that is solely based on the idea of attention mechanisms. Instead of using
the recurrent cells whose value depends on the previous calculation, the whole
sequence is fed to the model at the same time. This enables the utilization
of parallel computations which are not available for the regular RNNs. In
the encoder phase, each element in the input sequence gets a corresponding
attention vector which is calculated by the following formula:
QK T
Attention(Q, K, V ) = sof tmax( √ )V (2.2)
dk
8 CHAPTER 2. BACKGROUND
Vectors Q, K and V (Query, Key, and Value) are calculated using the cor-
responding weight matrices WQ , WK , and WV which are learnable parts of
the encoder. The dimension of queries and values are represented by dk . The
attention vector for a single element in the input sequence contains the infor-
mation about which are the most relevant parts of the input when looking into
this particular element. This particular attention is called Scaled Dot-Product
Attention.
However, the work in Vaswani et al. [13] shows that it is beneficial to per-
form multiple attention operations with different WQ , WK , and WV matrices.
This type of attention is called Multi-Head attention. The output of the atten-
tion layer is then fed to the regular feed-forward layer which gives the final
encoder output.
The decoder part consist of two different attention layers. Firstly, there is a
Masked Multi-Head self Attention layer. Since the whole sequence is available
to the decoder during the training phase the model has to mask the part of
the input so that it can not look into the future words. Otherwise, the model
could end up only learning to copy the decoder input. This is then followed
by another Multi-Head Attention layer which gets its Query vector from the
previous decoder layer while the Key and Value are fetched from the encoder
output. Similar to the encoder this is followed by a feed-forward layer. The
model of this encoder-decoder architecture is shown in figure 2.5
Positional Encoding
Since the model does not contain any recurrence or convolution the informa-
tion about the relative and absolute position of each element in the sequence
has to be incorporated in some other way. This is accomplished by adding
a positional encoding to the input embeddings of both encoder and decoder.
Positional encodings have the same dimension as the input embeddings and
are calculated by the following formula:
pos
P E(pos, 2i) = sin( 2i ) (2.3)
10000 dmodel
where pos is the position and i is the dimension. That is, each dimension of
the positional encoding corresponds to a sinusoid. The wavelengths form a
geometric progression from 2π to 10000 ∗ 2π [13]. This allows the model to
encode the sequence lengths longer than encountered in the training phase.
CHAPTER 2. BACKGROUND 9
2.1.6 BERT
Bidirectional Encoder Representations from Transformers (BERT) is a lan-
guage representation model proposed by Devlin et al. [14]. BERT is designed
to pre-train deep bidirectional representations from an unlabeled text by jointly
conditioning on both left and right context in all layers. It is based on the trans-
former architecture presented in Vaswani et al. [13]. Figure 2.6 shows the sim-
ple model of the BERT architecture consisting of a number of bi-directionally
connected transformers. BERT is pre-trained on BooksCorpus (800M words)
and English Wikipedia (2.5B words) on two different unsupervised tasks. In
the first task, 15% of all word tokens in each sequence are masked at random
and the network is then trained to predict them. For the second task, the model
is learned to understand the relationship between a two sentences. During the
training phase, the model is fed by two sentences A and B, where 50% of the
time the B was the actual sentence that followed the sentence A. The model is
then learned to decide if the sentence B is a logical successor of the sentence
A.
By utilizing the power of transfer learning BERT can be fine-tuned to ob-
tain state-of-the-art results on 11 different NLP tasks [14]. Figure 2.6 shows
how the model can be adapted to perform the NER task.
H l+1 = f (H l , A) (2.4)
then the function f (H l , A) can be described as:
1 1
f (H l , A) = σ(D̂− 2 ÂD̂− 2 H l W l ) (2.5)
with  = A + I, where I i the identity matrix, D̂ is the diagonal node degree
matrix of Â, σ is a non linear activation function and W l is a weight matrix
[15].
2.3 F1 score
To properly evaluate methods some kind of measure is needed to determine
how well they perform with respect to some given ground truth. A measure
that is commonly used in literature and will also be used here is the F1 score.
It is a measure of the test’s accuracy used in the statistical analyses of binary
classification. The F1 score is the harmonic mean of precision p and recall r.
Precision p is the number of correct positive results divided by the number of
all positive results returned by the classifier. Recall r is the number of correct
positive results divided by the number of all relevant samples [16]. Figure 2.7
shows a visual representation of precision and recall. Equation 2.6 shows the
mathematical definition of F1 score.
p∗r
F1 = 2 ∗ (2.6)
p+r
F1 score can be extended for evaluating results for multi-class classifica-
tion as well. This can be done by either taking an arithmetic mean of each class
F1 score (macro averaging) or by simply looking at classes as one and sum-
ming all their true positives, actual positives, and false negatives to calculate
precision, recall and F1 score (micro averaging).
12 CHAPTER 2. BACKGROUND
the receipts may contain some private information that can not be shared due
to GDPR regulations.
The literature on information extraction from receipts and invoices can
roughly be divided into three different kinds of approaches. The first one is a
template-based approach that analyses the geometrical dependencies between
different parts of receipts. This approach tries to utilize the fact that the dif-
ferent key features usually occupy a different specific part of the receipt. The
example of this is that the vendor name is usually placed on the top of the
receipts while the total price is at the bottom. The second one is based on
Named Entity Recognition (NER). This approach tries to understand the se-
mantics of the receipt text and in that way to classify each of its characters or
words separately. The third approach is a hybrid solution and a mixture of the
first two.
Figure 2.10: An example of graph for a sample invoice used by Lohani et al.
[24].
Gal et al. [25] presents a very interesting method that tries to solve the
problem of receipt field recognition by taking into account both the semantic
and geometrical aspects of the receipt text. Firstly, they train their own model
in an unsupervised manner to perform the Char2Vec embedding which is a
technique widely used to mapp characters to vectors. As the input to their
model, they for each word in the receipt create a word pair consisting of that
word and 4 of their nearest neighbors in 4 different directions. Then, for each
word in the receipt they encode the output of Char2Vec to a 32-bit long vector
and color the bounding box of that word with the result of that vector. Figure
2.11 shows the resulting image after the steps described above. The heatmap
is then used as an input to a U-net that tries to learn how to classify each
pixel of the colored part of the image. This method is then applied to a dataset
consisting of 5094 manually labeled receipts that is extended to a total number
of 22670 images by various synthetic image transformations. The final result
is only evaluated on the total price field and the best-resulting accuracy was
87.86% with the best dice score of 0.6367.
CHAPTER 2. BACKGROUND 17
Figure 2.11: Example of the heatmap got from the Char2vec embedding
(left) and its mapping on the real image (right) [25].
Chapter 3
Methods
In this chapter, the methods used in this thesis are presented. Firstly, datasets
and prepossessing techniques are presented. Next, the rule-based model and
the data labeling algorithm is described. Lastly, the three different models and
the experiments performed are detailed.
3.1 Dataset
The dataset consists of 790 receipt images taken by a smartphone camera.
Some examples of the receipts can be seen in figure 3.1. The majority of re-
ceipts are from Sweden. However, there are a small number of outliers that
are not. All images have a corresponding list of ground truth labels connected
with it which includes: vendor name, date, address, total price, tax rate, cur-
rency, and a list of products with its corresponding name, price, and amount.
However, not all images contains all labels. From the total number of receipts
90 of them are selected at random and put in the test set.
18
CHAPTER 3. METHODS 19
After removing the unwanted OCR result all text boxes are grouped by if
they appear on the same line and then all the lines are sorted from the top to
the bottom. So the final result after step 3 is the line by line text of the receipt.
Figure 3.2 shows the example of the text sorted by lines.
Figure 3.3: Receipt images before (left) and after (right) the adaptive
thresholding has been applied
3.3 Rule-based
The simple rule-based method for data extraction is created to have a baseline
to compare to. This is accomplished by having a set of rules for each specific
data field. Each of these rules is applied to the result from step 3 in figure 1.1.
All of the rule sets are implemented using the behavior trees where each data
class is represented by a single tree.
Vendor name
The text form the textbox at the top is taken unless it contains the Swedish
word for receipt ’kvitto’ or the text length is less than 2, then the next one is
taken instead.
22 CHAPTER 3. METHODS
Date
A set of most common date formats is kept and then the receipt text is searched
for these formats using regex.
Address
Similar to the date a set of most common address formats is kept and then the
receipt text is searched for these formats using regex.
Currency
The set of most common currency acronyms and abbreviations is kept and then
the receipt text is searched using regex.
Tax rate
Firstly a receipt text is searched for numbers with percent sign % at the end.
Then, for each result of the first, another search on the same line or the line
above is performed looking for keywords like ’moms’ or ’vat’. If these key-
words are found then the number is considered to represent a tax rate. If no
match is found the same process is then performed for all numbers, not only
ones with percent sign at the end.
Total price
The total price is extracted by 3 different rules performed in priority order.
Firstly the receipt text is searched for a number coming directly after a currency
sign. If no match found then the text is searched for a numbers coming after
keywords ’total’, ’belopp’, ’summa’, ’betala’, ’tot’, ’kontokort’, ’amount’ and
’net’. If no match is found then the text is searched for all the numbers in
decimal format and the largest of them is considered to be the total price.
Product list
Each product is represented by three different classes; product name, product
price, and product amount. The product’s list is extracted by matching the
consecutive lines of text with the format for price, amount, and name.
CHAPTER 3. METHODS 23
Vendor name
The vendor name is replaced by a random one from a set of vendor names
created from all real receipts.
Address
The address is replaced by a new one created from a random combination of
a street name, street number, postal code, and city, all of which are contained
in a small database of Swedish addresses.
CHAPTER 3. METHODS 25
Date
The date is replaced by a new one with a random date format taken from a
small database of date formats. All synthetic generated dates are between the
first of January 2010 and the first of January 2020.
Total price
The total price is replaced by a random one from a normal distribution with a
mean of 100 and a standard deviation of 30.
Product name
Each product name is replaced with a random one from a set of already exist-
ing product names.
Currency, tax rate, product price, amount and the text that does not belongs to
any known class is not replaced. Both datasets are created on the same way
with synthetic1000 having 1000 synthetic receipts in addition to the real ones
while synthetic10000 has 10000.
Table 3.1: Most common addresses and vendor names from the dataset
26 CHAPTER 3. METHODS
set and the model is trained for 30 epoch with batch size 16. The model with
the best loss on the validation set is saved as a final model. This is represented
by step 6a in figure 1.1. To form the final result each consecutive word that
belongs to the same class, with the exception for class ’O’, is concatenated. In
case that the several groups of the same class are created for a single receipt
then the final result is decided by a majority vote. If all of the groups are
different then the one that comes first is taken.
Feature Creation
Each node (word) is represented by 312 dimensional feature vector. The fea-
tures are divided into three categories boolean, numeric and textual.
Boolean features outputs a 8 dimensional vector and are calculated as follows:
(i) isDate: a parser that looks if a word can be a date.
(ii) isKnownCity: looks if word matches some known city in a small database.
strings differ. Tax rates come both with and without the % sign so only the
numerical parts of the tax rate strings are compared. However, for the vendor,
currency, and address the simple string comparison is used. For a product to
be counted as correctly extracted all three parts: name, price, and the amount
have to match.
Results
In this section, results from the rule based, oracle, and the three different ma-
chine learning models are presented. At the beginning of this section the re-
sults for each model in the form of F1 scores are presented. After that, an
overview of results for each separate data field is given. The chapter concludes
with a comparison between all the models.
4.1 Oracle
Table 4.1 shows the result of the oracle model. Even though the model classi-
fies each token correctly the final result is not 1. This depends on two things.
Firstly, the OCR engine is not perfect which results in the errors in text scan-
ning and secondly, the algorithm for creating the NER data is only a heuristic
and some data can be lost in transition. Date, price, tax rate, and currency
have an F1 score above, or near 0.9. The vendor has an F1 score of 0.791,
while the scores for address and products are significantly lower. The table
4.2 gives a bit of the perspective for these three data fields. Instead of compar-
ing the strings as they are, a small tolerance in terms of Levenshtein distance
is allowed. The F1 score for the vendor increased from 0.791 to 0.985 when
the tolerance in Levenshtein distance is increased to 3. Under the same condi-
tions, the F1 score of address increased from 0.574 to 0.869. For the products,
the tolerance is introduced only for the product name and it resulted into the
increase from 0.404 to 0.515. The micro average for the oracle model is 0.660
in the F1 score and it gives an estimate of how big impact the OCR and the
errors in NER data creation have on the final results. From the 90 receipts in
the test set, only 9 of them had all the data fields correctly extracted. If the
product list is omitted, then this number rises to 40. This model also gives a
31
32 CHAPTER 4. RESULTS
good baseline for comparison with the other models since it shows how good
the model that predicts all tokes correctly is.
Table 4.2: The development of the precision, recall and F1 score for the
oracle model when the Levenshtein distance tolerance is introduced.
0.627. Vendor and address both have an F1 score slightly under 0.5. The
worst achieved performance is for the products which have an F1 score of
0.127. The overall performance of the rule base method is 0.515 in micro
average and 0.710 in the macro average. From the 90 receipts in the test set,
only 4 of them have all the data fields extracted correctly, and that number rises
to 27 if the product list is omitted. Table 4.4 shows however how the scores
change if the tolerance for correct answers is increased for vendor, address and
products fields. The F1 score for the vendor is increased from 0.455 to 0.693
when a Levenshtein distance tolerance is 3 and under the same conditions, the
score for address is increased from 0.427 to 0.565. The F1 score for products
is almost doubled when tolerance for product name is increased with the in-
crease from 0.127 to 0.228. However, the product price and amount still have
to be a perfect match for a product to be considered as correct.
Table 4.4: The development of the precision, recall and F1 score for the rule
based model when the Levenshtein distance tolerance is introduced.
4.3 LSTM
Figure 4.1 shows the training and validation loss of the LSTM model on three
different datasets. Validation loss on real data reaches its minimum around
epoch 750 and then it gets bigger and bigger as the model overfits for the train-
ing data. A similar trend is seen with both synthetic1000 and synthetic10000
where their loss achieves its minimum around epoch 850 respective 1400.
Even if the validation loss is more stable with these two datasets, overfitting is
still an issue. However, all three datasets achieve very similar minimum loss,
little over 0.6 with synthetic1000 being the lowest. Table 4.5 shows the re-
sult of LSTM model on these three datasets. The result matches the loss plot
showing that all three models have quite similar micro average scores with
synthetic1000 being the best with the F1 score of 0.278. All three models
show promising results on the date, tax rate, price, and currency. On the other
hand, the result for vendor, address, and product all have an F1 score lower
than 0.1. None of the models have managed to correctly extract all data fields
from a single receipt.
Table 4.6 shows how the scores change for the model trained on synthetic1000
when the Levenshtein tolerance is introduced. The F1 score for the vendor in-
creased linearly from 0.078 to 0.170 and the same score for products increased
from 0.037 to 0.088. The change for the address was not that high increasing
CHAPTER 4. RESULTS 35
from 0.060 to 0.075 on the first step and then staying unchanged.
Figure 4.1: Train and validation loss of the LSTM model on three different
datasets.
36 CHAPTER 4. RESULTS
Table 4.6: The development of the precision, recall and F1 score for the
LSTM model when the Levenshtein distance tolerance is introduced
4.4 BERT
Figure 4.2 shows the train and validation loss for the three datasets. Real data
reaches its minimum loss of 0.4 on the validation set around epoch 6 and then
loss only gets bigger. Synthetic1000 reaches its minimum loss of 0.2 around
epoch 11 and stays stable around that value for the rest of the training. Vali-
dation loss for the synthetic10000 dataset reaches its minimum around epoch
5 and stays stable after that being only slightly higher than the training loss.
Table 4.7 shows that the synthetic10000 achieves good results on the test set
as well having the micro average F1 score of 0.961. This score represents the
result of the BERT model on the token classification task. Table 4.8 shows
the final result for the BERT model after the data has been extracted. It shows
that the synthetic10000 has the best final result as well, with the micro average
score of 0.445. Synthetic10000 gave the best F1 scores for all fields except for
an address where it shares the best F1 of 0.344 with the real data model. The
address is the data field, with the exception of products that has the lowest F1
score. The best scores are achieved on the tax rate, date, price, and currency
with F1 scores of 0.885 respective 0.842, 0.818, and 0.885. The data field with
the lowest score is the products with an F1 score of 0.027. Even though BERT
was able to achieve high results of classifying the product name, price and
amount tokens, the final F1 score after the data are extracted is very low. Syn-
thetic1000 has been able to extract all data fields correctly from only 1 receipt.
38 CHAPTER 4. RESULTS
If the product list is omitted from the data fields then that number rises to 23.
Table 4.9 shows how the scores are changed for the synthetic10000 model is
the Levenshtein distance tolerance is added when comparing the results. The
F1 score for the vendor data field has increased from 0.667 to 0.845 when the
distance tolerance is 3. In that same condition, the F1 score for the address has
increased from 0.356 to 0.661. The scores for products almost doubled, but
are still very low with the best F1 score of 0.44. As explained in section 3.9
for a product to be counted as correctly extracted, all three components price,
name and amount have to match. If only product names are compared then the
F1 score for products increases to 0.502.
Figure 4.2: Train and validation loss of the BERT model on three different
datasets.
CHAPTER 4. RESULTS 39
Table 4.9: The development of the precision, recall and F1 score for the
BERT model when the Levenshtein distance tolerance is introduced.
4.5 GCN
Figure 4.3 shows the plot of train and validation loss on real data. Minimum
loss on the validation set is reached around epoch 100 and it stays stable after
that. Table 4.10 shows the final result of the GCN model. The best F1 score
is achieved on currency 0.682 and the worst on price 0.022. The model has
high precision on the date, tax rate, price, and currency, but it has a lot of
true negatives which reflects on recall score. The micro average F1 score of
the model is 0.167 and the macro average is 0.305. Table 4.11 shows how
the scores of the vendor, address, and product change when the Levenshtein
distance tolerance is introduced. The F1 score for the vendor name is increased
from 0.222 to 0.245 and the F1 score for products is increased from 0.067 to
0.089 while the scores for address stayed unchanged.
CHAPTER 4. RESULTS 41
Figure 4.3: Train and validation loss of GCN model on real data.
Table 4.11: The development of the precision, recall and F1 score for the
GCN model when the Levenshtein distance tolerance is introduced.
4.6.1 Vendor
The best F1 score on the vendor name data field is attained by the BERT model.
The F1 score achieved represents 86% of the oracle model F1 score which is
the highest score this model could have achieved. It was superior over all other
models.
4.6.2 Date
Rule-based achieved the best F1 score for the date. However, both LSTM and
BERT have comparable results. The best machine learning model and the
CHAPTER 4. RESULTS 43
second-best overall, BERT had reached 92% of the oracle model F1 score.
The GCN model showed to be inferior to the other models on this task.
4.6.3 Address
The best F1 score of address is obtained by the rule-based model. The second-
best result not so far behind is achieved by the BERT model. The BERT model
managed, however, to reach only 58% of the Oracle score. Both LSTM and
GCN have significantly lower scores.
4.6.5 Price
The rule-based model was best at extracting the price as well. The BERT
model was second best with a score that represents the 91% of the oracle
model. The GCN model was inferior with a very low score.
4.6.6 Currency
The best F1 score on currency is achieved by the rule-based model. LSTM
and BERT have very similar scores which represent around 98% of the oracle
model. The least successful model was GCN.
4.6.7 Products
The product list is a field with the lowest score for all models except GCN. The
best F1 score is achieved by rule-based. The second best is GCN, however, that
score is very low. Both LSTM and BERT have scored less than 10% of the
oracle model with LSTM being slightly better.
44 CHAPTER 4. RESULTS
Table 4.12: Comparison of the F1 scores of final results for all models.
Chapter 5
Discussion
In this chapter, the analysis and the discussion about the results is presented.
5.1 Problems
One of the main problems that affect all models including the rule-based one
is the errors produced by the OCR engine. Since all of the final extraction
results are exactly compared to the ground truth, errors in the OCR can make
this comparison appear false even if the extraction is done correctly. Leven-
shtein distance tolerance is included in the result analysis in order to see how
much of the error depends on this. This problem is even greater for the ma-
chine learning models as the OCR error propagates through all of the steps 2-6
in the figure 1.1. Erroneously read words can lead to even bigger errors in the
data creation algorithm 1 in steps 4a, 4b, and 4c. That can potentially lead that
some tokens are falsely labeled during this process which leads to information
loss. This is confirmed by the results of the oracle model. The micro average
of the oracle model is only 0.660 even though this models correctly predict all
of the tokens. Adding the Levenshtein distance tolerance increases its F1 score
for vendor name up to 0.985. Doing the same for the address the F1 score is
increased up to 0.896. Neither of these numbers is 1 despite the Levenshtein
distance tolerance, which means that additional information is being lost dur-
ing the token creation and labeling.
45
46 CHAPTER 5. DISCUSSION
Address
The address is the data field that together with the product list gave the worst
results which can partly be attributed to the nature of the data field itself. Table
A.1 shows the differences in length and the number of words for the different
data fields. The average length of the address in the test dataset is around 31
and the average number of words is 4.86. This means that the LSTM has to
classify 31 characters correctly on average compared to 3 for currency or 9.65
for data. The margin of error is much greater both for the OCR engine to
make a mistake while reading the word and for the model to misclassify some
part of the address. This can also be seen in the oracle models score which
had scored only 0.574 in F1 despite knowing all the right classes. The same
CHAPTER 5. DISCUSSION 47
applies to BERT and GCN which both have to correctly classify 4.86 words
on average. If one of these words is missed, then the result is classified as
incorrect even if the rest of the words are correctly extracted.
Products
The product list gave the worst results across all models. The main reason
behind this is that a single product is actually consisting of 3 different fields,
product name, product price, and product amount. All of these three fields
have to match in order to count a single product correctly extracted. The aver-
age number of words in a product name is 2.51 and the average length is 15.52
characters. This together with product price and product amount makes a mar-
gin or error much bigger compared to other data fields. As explained in section
5.1 even a small error in OCR gets bigger as it propagates through steps 3, 4,
5, and 6 in 1.1. Oracle model achieves only 0.404 in the F1 score. Even when
3 steps in Levenshtein distance for the name are allowed the F1 score reaches
only 0.515. Another challenge for products is the final data extraction at step 6
from the result given by the models. At this step product name, product price,
and product amount that belong to the same product must be grouped together.
Is some of these are missed then the whole product list might be corrupted.
This can be seen in table A.7. The prices for products given by LSTM are
all shifted by one and thus all incorrect. The best performing model here is
GCN with the F1 score of 0.067. The best performing model overall BERT
performed worst here with the score 0.027.
BERT was the model that gave the best score among the machine learning
methods. It showed that having the pre-trained knowledge about the language
48 CHAPTER 5. DISCUSSION
is more important than knowing the underlying geometrical position like GCN
does. The experiment with different amounts of synthetic data showed that
the model is very good at generalizing when more data is provided. Figure 4.2
shows that BERT tends to overfit when the dataset is small. Increasing the size
of the data reduced the validation loss by a huge margin. The synthetic10000
dataset was able to achieve the validation loss nearly as low as training loss.
Table 4.7 shows the true potential of the BERT model. The micro average F1
score of BERT on the token classification task is 0.9611, much greater than
the final F1 micro average of 0.455. This shows that a lot of error is introduced
by the OCR and the labeling algorithm, rather than the BERT model itself.
GCN model performed much worse than in the work by Lohani et al. [24].
There are two main reasons for this. Firstly, they used a dataset of invoices
which tend to be much more structured and less diverse than the receipts. Sec-
ondly, they have used 28 different classes compared to only 10 used in this the-
sis. This means that the much greater percentage of the nodes is marked with
class ’0’ compared to work by Lohani et al. [24]. Table A.2 shows the proba-
bility of a randomly chosen node with a given label to be connected to nodes
with other labels. It shows that a graph is very sparse in the sense that most
of the nodes are labeled with label ’0’. The probability for a date node to be
connected to something else then ’0’ is less than 10% which means that GCN
is not able to utilize its advantage of knowing the neighbor nodes, but rather
has to focus on the node value. Similar low probabilities are noticed with the
tax rate, currency, and total price. The only data fields where the probability
of a connection with the label ’0’ is low are product name, price, and amount.
There exist more connections between these three fields so GCN is able to uti-
lize its propagation rule much better. The product list is the only field where
GCN has managed to make an advantage over the other two models.
LSTM model proved to be much better on the numeric data fields compar-
ing to the textual ones which is quite reasonable considering the complexity of
the model. The experiment with the different synthetic datasets showed that
the size of the data does not have as big impacts as on BERT. The model that
performed the best is the one trained on synthetic1000, but the difference is
not that significant meaning that the problem is not overfitting but rather that
the underlying function is too complex to be learned by this model.
CHAPTER 5. DISCUSSION 49
the ideal world, this would mean that the human resources that are now spent
on this could be used elsewhere. However, some companies might decide
that human resources are not needed anymore and it could lead to a loss of
jobs. This work can also be extended to create a tool for helping a person
with tracking its economy by, for example taking a photo of each receipt and
saving the extracted information in digital form. This would mean that the
information would become easily accessible by the individual and could have
a positive impact on its economy. However, if this would mean using a third-
party app then there is a huge potential risk for exploitation. By taking a photo
of receipts one is giving out all the information about its shopping habits. In
addition to that, based on the information that can be extracted from receipts
one could possibly infer other habits, as well as the home and work address and
possibly even the income level. All this information is highly valued today.
5.5 Sustainability
Work by Sarenmalm [29] investigates the environmental impact of the usage
of paper receipts in Sweden. It showed that the receipts are often coated with
chemicals hazardous to humans and the environment and that they are non-
recyclable due to their high contamination risk and that around 60000 trees
are used annually to create 1.5 billion paper receipts that are issued in Sweden.
The same work also assess the possible substitute for paper receipts in the form
of a digital one. As mentioned in the previous section, the work in this thesis
can be extended for usage on an individual level. If such a thing would become
popular then it can potentially slow down or prevent transformation toward
digital receipts which would leave a big environmental footprint.
Chapter 6
6.1 Conclusions
In this thesis, three different machine learning algorithms were compared in
the task of information extraction from a receipt text. The values that were
extracted are vendor name, date, address, total price, tax rate, currency, and
the product list. The experiment showed that the best performing machine
learning model is BERT achieving 0.455 in micro average F1 score, followed
by BiLSTM with the micro average of 0.278. The model with the least micro
average was GCN with a score of 0.167. The BERT model had the best F1
score among the machine learning models on all individual fields except the
product list where the GCN was better. It was able to outperform the ruled
based model on vendor name and tax rate extraction. However, the simple
rule-based model was better on all other date fields and had a better micro
average F1 score of 0.515 as well. Several reasons for this were identified.
One of the main reasons is the OCR error that propagates over the rest of
the steps and causes imperfections in data labeling heuristic. Another reason
is that most of the data fields in receipts have a well-defined format and the
simple rules were sufficient to extract them. Although the BERT model did
not manage to beat the rule-based model it showed great potential and it would
be interesting to look at in the future work.
51
52 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
[1] Abigail J Sellen and Richard HR Harper. The myth of the paperless office.
MIT press, 2003.
[2] Bertin Klein, Stevan Agne, and Andreas Dengel. Results of a study on
invoice-reading systems in germany. In International workshop on doc-
ument analysis systems, pages 451–462. Springer, 2004.
[3] Gobinda G Chowdhury. Natural language processing. Annual review of
information science and technology, 37(1):51–89, 2003.
[4] Archana Goyal, Vishal Gupta, and Manish Kumar. Recent named entity
recognition and classification techniques: a systematic review. Computer
Science Review, 29:21–43, 2018.
[5] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning
representations by back-propagating errors. nature, 323(6088):533–536,
1986.
[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation, 9(8):1735–1780, 1997.
[7] Understanding lstm networks – colah’s blog. https://colah.
github.io/posts/2015-08-Understanding-LSTMs/.
(Accessed on 05/04/2020).
[8] Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao,
and Bo Xu. Attention-based bidirectional long short-term memory net-
works for relation classification. In Proceedings of the 54th annual meet-
ing of the association for computational linguistics (volume 2: Short pa-
pers), pages 207–212, 2016.
[9] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence
learning with neural networks. In Advances in neural information pro-
cessing systems, pages 3104–3112, 2014.
53
54 BIBLIOGRAPHY
[10] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bah-
danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning
phrase representations using rnn encoder-decoder for statistical machine
translation. arXiv preprint arXiv:1406.1078, 2014.
[12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural ma-
chine translation by jointly learning to align and translate. arXiv preprint
arXiv:1409.0473, 2014.
[13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. In Advances in neural information processing systems,
pages 5998–6008, 2017.
[14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
Bert: Pre-training of deep bidirectional transformers for language under-
standing. arXiv preprint arXiv:1810.04805, 2018.
[23] Rasmus Berg Palm, Ole Winther, and Florian Laws. Cloudscan-a
configuration-free invoice analysis system using recurrent neural net-
works. In 2017 14th IAPR International Conference on Document Anal-
ysis and Recognition (ICDAR), volume 1, pages 406–413. IEEE, 2017.
[25] Rinon Gal, Nimrod Morag, and Roy Shilkrot. Visual-linguistic methods
for receipt field recognition. In C. V. Jawahar, Hongdong Li, Greg Mori,
and Konrad Schindler, editors, Computer Vision – ACCV 2018, pages
542–557, Cham, 2019. Springer International Publishing. ISBN 978-3-
030-20890-5.
[29] Isabel Sarenmalm. Would you like your receipt?: Sustainability perspec-
tives of consumer paper receipts, 2016.
57
Appendix A
In this appendix, the sample of data and data extraction results from different
models is presented.
58
APPENDIX A. 59
Figure A.3: Data extraction results for bottom left receipt in figure A.1.
APPENDIX A. 61
Figure A.4: Data extraction results for top right receipt in figure A.1.
62 APPENDIX A.
Figure A.5: Data extraction results for bottom right receipt in figure A.1.
APPENDIX A. 63
Figure A.6: Data extraction results for top left receipt in figure A.1.
64 APPENDIX A.
Figure A.7: Products list results for top left receipt in figure A.1.
APPENDIX A. 65
Figure A.9: An example of the receipt from the dataset with an ambiguous
total price.
APPENDIX A. 67
Table A.1: Average length and average word number for different ground
truth classes in the test dataset.
www.kth.se