This is a slide for a paper “OCR-free Document Understanding Transformer”
OCR-free Document Understanding Transformer
Geewook Kim1 *, Teakgyu Hong4, Moonbin Yim2, Jeongyeon Nam1, Jinyoung Park5, �Jinyeong Yim6, Wonseok Hwang7, Sangdoo Yun3, Dongyoon Han3, Seunghyun Park1�* [email protected]�1NAVER CLOVA 2NAVER Search 3NAVER AI Lab 4Upstage 5Tmax 6Google 7LBox
Slide
which will be presented at
OCR-free Document Understanding Transformer
Geewook Kim1 *, Teakgyu Hong4, Moonbin Yim2, Jeongyeon Nam1, Jinyoung Park5, �Jinyeong Yim6, Wonseok Hwang7, Sangdoo Yun3, Dongyoon Han3, Seunghyun Park1�* [email protected]�1NAVER CLOVA 2NAVER Search 3NAVER AI Lab 4Upstage 5Tmax 6Google 7LBox
Slide
Agenda
This is the table of contents. Let’s start from the Introduction.
Slide
Visual Document Understanding (VDU)
VDU aims to extract useful information from the document image. For example,
VDU Model
Useful Information
Slide
Example 1: Document Classification
A document classifier aims to extract a category information from the image.
VDU Model
{ "class": "receipt" }
Slide
Example 2: Document Parsing
For another example, a document parser aims to get a data in a format,
such as, JSON or XML, that contains full information.
VDU Model
{ "menu": [
{
"nm": "3002-Kyoto Choco Mochi",
"unitprice": "14.000",
"cnt": "x2",
"price": "28.000"
}, … }
Slide
Conventional VDU Model
Here, we show a representative pipeline of visual document parsing.
Input
Output
{ "items": [
{
"name": "3002-Kyoto Choco Mochi",
"count": 2,
"priceInfo": {
"unitPrice": 14000,
"price": 28000
}
}, {
"name": "1001 - Choco Bun",
"count": 1,
"priceInfo": {
"unitPrice": 22000
"price": 22000
}
}, ...
],
"total": [ {
"menuqty_cnt": 4,
"total_price": 50000
}
]
}
≈
Slide
Conventional VDU Model
Most conventional VDU methods share a similar pipeline.
…
…
Input
Output
{ "items": [
{
"name": "3002-Kyoto Choco Mochi",
"count": 2,
"priceInfo": {
"unitPrice": 14000,
"price": 28000
}
}, {
"name": "1001 - Choco Bun",
"count": 1,
"priceInfo": {
"unitPrice": 22000
"price": 22000
}
}, ...
],
"total": [ {
"menuqty_cnt": 4,
"total_price": 50000
}
]
}
≈
Slide
Conventional VDU Model
First, a text detector finds all text boxes.
Detection!
Slide
Conventional VDU Model
And then, a text recognizer reads all texts in the extracted boxes.
Detection! Recognition!
{ "words": [ {
"id": 1,
"bbox":[[360,2048],...,[355,2127]],
"text": "3002-Kyoto"
}, {
"id": 2,
"bbox":[[801,2074],...,[801,2139]],
"text": "Choco"
}, {
"id": 3,
"bbox":[[1035,2074],...,[1035,2147]],
"text": "Mochi"
}, {
"id": 4,
"bbox":[[761,2172],...,[761,2253]],
"text": "14.000"
}, …, {
"id": 22,
"bbox":[[1573,3030],...,[1571,3126]],
"text": "50.000"
}
]
}
Slide
Conventional VDU Model
This two parts are also called as Optical Character Recognition (OCR).
Detection! Recognition!
{ "words": [ {
"id": 1,
"bbox":[[360,2048],...,[355,2127]],
"text": "3002-Kyoto"
}, {
"id": 2,
"bbox":[[801,2074],...,[801,2139]],
"text": "Choco"
}, {
"id": 3,
"bbox":[[1035,2074],...,[1035,2147]],
"text": "Mochi"
}, {
"id": 4,
"bbox":[[761,2172],...,[761,2253]],
"text": "14.000"
}, …, {
"id": 22,
"bbox":[[1573,3030],...,[1571,3126]],
"text": "50.000"
}
]
}
OCR
Slide
Conventional VDU Model
Finally, the OCR results are fed to a following module �to get full information of the document.
{ "items": [
{
"name": "3002-Kyoto Choco Mochi",
"count": 2,
"priceInfo": {
"unitPrice": 14000,
"price": 28000
}
}, {
"name": "1001 - Choco Bun",
"count": 1,
"priceInfo": {
"unitPrice": 22000
"price": 22000
}
}, ...
],
"total": [ {
"menuqty_cnt": 4,
"total_price": 50000
}
]
}
≈
{ "words": [ {
"id": 1,
"bbox":[[360,2048],...,[355,2127]],
"text": "3002-Kyoto"
}, {
"id": 2,
"bbox":[[801,2074],...,[801,2139]],
"text": "Choco"
}, {
"id": 3,
"bbox":[[1035,2074],...,[1035,2147]],
"text": "Mochi"
}, {
"id": 4,
"bbox":[[761,2172],...,[761,2253]],
"text": "14.000"
}, …, {
"id": 22,
"bbox":[[1573,3030],...,[1571,3126]],
"text": "50.000"
}
]
}
Detection! Recognition! Parsing!
OCR
Slide
Conventional VDU Model:
Details of the Parsing Stage
For example, in most methods, BIO-tags are predicted by a backbone.
… 3002-Kyoto Choco Mochi 14, 000 …
B-name I-name I-name B-price I-price
Transformer Backbone�(BERT, LayoutLM, …)
(Off-the-shelf)�OCR Engine
Slide
Conventional VDU Model:
Details of the Parsing Stage
Then, the tag sequence is converted into a final data format (e.g., JSON).
… 3002-Kyoto Choco Mochi 14, 000 …
B-name I-name I-name B-price I-price
Transformer Backbone�(BERT, LayoutLM, …)
(Off-the-shelf)�OCR Engine
{ "menu": [
{
"nm": "3002-Kyoto Choco Mochi",
"unitprice": "14.000",
"cnt": "x2",
"price": "28.000"
}, … }
Slide
Conventional VDU Model: Overview
Overall, the conventional VDU methods can be summarized as shown in the figure.
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Conventional VDU Model: Overview
For each task, the backbone predicts a desired set of tokens/tags.
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
Output
BIO-Tags / Answer Token Span / etc
Slide
Conventional VDU Model: Overview
For example, in order to conduct VQA, �a span is predicted over the OCR-ed text tokens.
Input Image
(Off-the-shelf)�OCR Engine
“3002-Kyoto Choco Mochi”
… 3002-Kyoto Choco Mochi 14, …
START END
Transformer Backbone�(BERT, LayoutLM, …)
Slide
Conventional VDU Model: Overview
Although such OCR-based approaches have shown promising performance,
they have several problems induced by OCR.
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Conventional VDU Model: Overview
Although such OCR-based approaches have shown promising performance,
they have several problems induced by OCR.
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Conventional VDU Model: Overview
First, OCR increase computational costs.
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Conventional VDU Model: Overview
Second, OCR makes it hard to handle various languages/types of documents.
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Conventional VDU Model: Overview
Lastly, OCR errors are propagate to the subsequent process.
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Proposal: OCR-free Approach
To address the issues, we introduce a novel OCR-free VDU model, Donut 🍩.
Donut 🍩�(End-to-end Model)
Token Sequence
Output
Input Image
Slide
AS-IS v.s. TO-BE
Without OCR, Donut directly processes the input image �and gets an output that contains desired types of information.
Donut 🍩�(End-to-end Model)
Token Sequence
Output
Input Image
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Overview
This is the overview of Donut.
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Slide
Overview
The visual encoder maps the input image into a set of embeddings.
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Slide
Overview
The textual decoder processes the image embeddings and prompt tokens.
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Slide
Overview
Then, the decoder outputs token sequences,
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Slide
Overview
that can be converted into a desired data format, such as, a JSON format.
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Slide
Overview: Model Architecture
Swin Transformer and BART are used as an encoder and decoder, respectively.�More details can also be found in the manuscript.
<vqa><question>what is the price
of choco mochi?</question><answer>
Converted JSON
transformer encoder
Input Image and Prompt
transformer decoder
Donut 🍩
<classification>
<parsing>
<class>receipt</class>
</classification>
14,000</answer></vqa>
<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>
{ "items": [{"name": "3002-Kyoto Choco Mochi",
"count": 2,
"unitprice": 14000, …}], … }
Output Sequence
{ "class":"receipt" }
{ "question": "what is the price of choco mochi?",
"answer": "14,000" }
Slide
Pre-training Task
To train Donut, we propose a simple pre-training task.
Donut 🍩�(End-to-end Model)
In terms of what that can teach, in what from …
Slide
Pre-training Task
The objective is to read all texts from the top-left to bottom-right.�This task can be interpret as a pseudo OCR task.
Donut 🍩�(End-to-end Model)
In terms of what that can teach, in what from …
Slide
Training Strategy: Teacher-forcing Scheme
Following the original Transformer, the model training is done with Teacher-forcing scheme. More details can also be found in the manuscript.
Model
…
…
…
This can be interpret as
a token classification at each step.
in
terms
of
what
terms
of
what
Minimize Cross Entropy
Slide
Training Strategy: Teacher-forcing Scheme
Following the original Transformer, the model training is done with Teacher-forcing scheme. More details can also be found in the manuscript.
Model
…
…
…
This can be interpret as
a token classification at each step.
in
terms
of
what
terms
of
what
Minimize Cross Entropy
Slide
SynthDoG 🐶:Synthetic Document Generator
For the pre-training of Donut, we also present a data generator, SynthDoG 🐶.
Slide
SynthDoG 🐶:Synthetic Document Generator
SynthDoG alleviates the dependency on large-scale real document images �and enables the extension to a multilingual setting.
Slide
Pre-training Task
After the model learns “how to read”,
Donut 🍩�(End-to-end Model)
In terms of what that can teach, in what from …
Slide
Model Fine-tuning and Inference Overview
in the fine-tuning, we teach the model “how to understand”.
Slide
Model Fine-tuning and Inference Overview
The prediction target is set to a desired downstream token sequence, �including some special tokens.
Slide
Model Fine-tuning and Inference Overview
At inference, the predicted token from the last step is fed to the next.
Slide
Model Fine-tuning and Inference Overview
The predicted sequence is converted into a JSON format.
Slide
Experiments: Samples of Downstream Datasets
We conducts 3 VDU applications on 6 different datasets. Some samples are shown.
Slide
Experiments: Document Classification
To see whether the model can distinguish across different types of documents, �we test a classification task.
Slide
Experiments: Document Classification
This is the results on the RVL-CDIP dataset. �Donut achieves state-of-the-are scores with reasonable speed and efficiency.
Slide
Experiments: Document Classification
This is the results on the RVL-CDIP dataset. �Donut achieves state-of-the-are scores with reasonable speed and efficiency.
Slide
Experiments: Document Parsing
Next, to see the model fully understands the complex layouts and contexts, �we test document parsing tasks.
Slide
Experiments: Document Parsing
For all domains, Donut showed the best accuracies �with significantly faster inference speed.
Slide
Experiments: Document Parsing
For all domains, Donut showed the best accuracies �with significantly faster inference speed.
Slide
Experiments: Document VQA
To validate the further capacity of the model, we test a document VQA task.
Q: What is the Extension Number as per the voucher?���A: (910) 741–0673
Slide
Experiments: Document VQA
For VQA, Donut showed a high score on the handwritten documents �which are known to be challenging.
Slide
Experiments: Document VQA
For VQA, Donut showed a high score on the handwritten documents �which are known to be challenging.
Slide
Analysis: VQA on Handwritten Documents
As can be seen, the OCR errors make the performance upper-bound �for the conventional baselines.
Slide
Analysis: VQA on Handwritten Documents
On the other hand, Donut seems robust to the handwritten documents.
Slide
Analysis
Next, we show some main results of our analysis on Donut.
Slide
Analysis: Pre-training Strategies
We tested several pre-training tasks. �We found that the proposed task is the most simple yet effective approach.
Slide
Analysis: Pre-training Strategies
Other tasks that impose a general knowledge of images and texts on models �show little gains in the fine-tuning tasks.
Slide
Analysis: Pre-training Strategies
For the text reading task, synthetic images were enough for the CORD task.
Slide
Analysis: Pre-training Strategies
But, in the DocVQA task, it was important to see the real images.
Slide
Analysis: Image Backbones
Next, we study popular image classification backbones.
Slide
Analysis: Image Backbones
Overall, EfficientNetV2 and Swin Transformer outperform the others.
Slide
Analysis: Image Backbones
We choose Swin Transformer due to the high scalability and performance.
Slide
Analysis: Input Resolution
We observed that Donut grows rapidly as we set a larger input size.
Slide
Analysis: Input Resolution
This gets clearer in the DocVQA where the images are larger with many tiny texts.
Slide
Analysis: OCR Engines
Next, we study the effects of OCR engines on the traditional baselines.
We test four widely-used public engines.
Slide
Analysis: OCR Engines
We observed that the scores heavily rely on the OCR engine.
Slide
Analysis: OCR Engines
This shows why we need an OCR-free method.
Slide
Conclusions
So far, we have introduced our new OCR-free method, Donut.
More experiments and analysis can be found in the manuscript.
Donut 🍩�(End-to-end Model)
Token Sequence
Output
Input Image
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Conclusions
So far, we have introduced our new OCR-free method, Donut.
More experiments and analysis can be found in the manuscript.
Donut 🍩�(End-to-end Model)
Token Sequence
Output
Input Image
Input Image
(Off-the-shelf)�OCR Engine
Backbone
(BERT-like)
BIO-Tags / Answer Token Span / etc
Output
Slide
Conclusions
The proposed method, Donut, directly maps an input document image �into a desired structured output.
Donut 🍩�(End-to-end Model)
Token Sequence
Output
Input Image
Slide
Conclusions
Unlike conventional methods, Donut does not depend on OCR �and can easily be trained in an end-to-end fashion.
Donut 🍩�(End-to-end Model)
Token Sequence
Output
Input Image
Slide
Conclusions
We also propose a synthetic document image generator, SynthDoG.
Slide
Conclusions
Through extensive experiments and analyses, �we show the high performance and cost-effectiveness of Donut.
Slide
Conclusions
We believe our work can easily be extended to other domains/tasks regarding document understanding.
Slide
If you have any questions please feel free to contact me :)�Thank you!
Thank you!
Contact: [email protected]
Slide