1 of 74

This is a slide for a paper “OCR-free Document Understanding Transformer”

OCR-free Document Understanding Transformer

Geewook Kim^{1 *}, Teakgyu Hong⁴, Moonbin Yim², Jeongyeon Nam¹, Jinyoung Park⁵, �Jinyeong Yim⁶, Wonseok Hwang⁷, Sangdoo Yun³, Dongyoon Han³, Seunghyun Park¹^�^{* [email protected]}^�¹NAVER CLOVA ²NAVER Search ³NAVER AI Lab ⁴Upstage ⁵Tmax ⁶Google ⁷LBox

Slide

2 of 74

which will be presented at

OCR-free Document Understanding Transformer

Geewook Kim^{1 *}, Teakgyu Hong⁴, Moonbin Yim², Jeongyeon Nam¹, Jinyoung Park⁵, �Jinyeong Yim⁶, Wonseok Hwang⁷, Sangdoo Yun³, Dongyoon Han³, Seunghyun Park¹^�^{* [email protected]}^�¹NAVER CLOVA ²NAVER Search ³NAVER AI Lab ⁴Upstage ⁵Tmax ⁶Google ⁷LBox

Slide

3 of 74

Agenda

Introduction: Background and Motivation
Proposal: Document Understanding Transformer (Donut 🍩)
Experiments and Analyses
Conclusions

This is the table of contents. Let’s start from the Introduction.

Slide

4 of 74

Visual Document Understanding (VDU)

VDU aims to extract useful information from the document image. For example,

VDU Model

Useful Information

Slide

5 of 74

Example 1: Document Classification

A document classifier aims to extract a category information from the image.

VDU Model

{ "class": "receipt" }

Slide

6 of 74

Example 2: Document Parsing

For another example, a document parser aims to get a data in a format,

such as, JSON or XML, that contains full information.

VDU Model

{ "menu": [

{

"nm": "3002-Kyoto Choco Mochi",

"unitprice": "14.000",

"cnt": "x2",

"price": "28.000"

}, … }

Slide

7 of 74

Conventional VDU Model

Here, we show a representative pipeline of visual document parsing.

Input

Output

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

≈

Slide

8 of 74

Conventional VDU Model

Most conventional VDU methods share a similar pipeline.

…

Input

Output

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

≈

Slide

9 of 74

Conventional VDU Model

First, a text detector finds all text boxes.

Detection!

Slide

10 of 74

Conventional VDU Model

And then, a text recognizer reads all texts in the extracted boxes.

Detection! Recognition!

{ "words": [ {

"id": 1,

"bbox":[[360,2048],...,[355,2127]],

"text": "3002-Kyoto"

}, {

"id": 2,

"bbox":[[801,2074],...,[801,2139]],

"text": "Choco"

}, {

"id": 3,

"bbox":[[1035,2074],...,[1035,2147]],

"text": "Mochi"

}, {

"id": 4,

"bbox":[[761,2172],...,[761,2253]],

"text": "14.000"

}, …, {

"id": 22,

"bbox":[[1573,3030],...,[1571,3126]],

"text": "50.000"

}

]

}

Slide

11 of 74

Conventional VDU Model

This two parts are also called as Optical Character Recognition (OCR).

Detection! Recognition!

{ "words": [ {

"id": 1,

"bbox":[[360,2048],...,[355,2127]],

"text": "3002-Kyoto"

}, {

"id": 2,

"bbox":[[801,2074],...,[801,2139]],

"text": "Choco"

}, {

"id": 3,

"bbox":[[1035,2074],...,[1035,2147]],

"text": "Mochi"

}, {

"id": 4,

"bbox":[[761,2172],...,[761,2253]],

"text": "14.000"

}, …, {

"id": 22,

"bbox":[[1573,3030],...,[1571,3126]],

"text": "50.000"

}

]

}

OCR

Slide

12 of 74

Conventional VDU Model

Finally, the OCR results are fed to a following module �to get full information of the document.

{ "items": [

{

"name": "3002-Kyoto Choco Mochi",

"count": 2,

"priceInfo": {

"unitPrice": 14000,

"price": 28000

}

}, {

"name": "1001 - Choco Bun",

"count": 1,

"priceInfo": {

"unitPrice": 22000

"price": 22000

}

}, ...

],

"total": [ {

"menuqty_cnt": 4,

"total_price": 50000

}

]

}

≈

{ "words": [ {

"id": 1,

"bbox":[[360,2048],...,[355,2127]],

"text": "3002-Kyoto"

}, {

"id": 2,

"bbox":[[801,2074],...,[801,2139]],

"text": "Choco"

}, {

"id": 3,

"bbox":[[1035,2074],...,[1035,2147]],

"text": "Mochi"

}, {

"id": 4,

"bbox":[[761,2172],...,[761,2253]],

"text": "14.000"

}, …, {

"id": 22,

"bbox":[[1573,3030],...,[1571,3126]],

"text": "50.000"

}

]

}

Detection! Recognition! Parsing!

OCR

Slide

13 of 74

Conventional VDU Model:

Details of the Parsing Stage

For example, in most methods, BIO-tags are predicted by a backbone.

… 3002-Kyoto Choco Mochi 14, 000 …

B-name I-name I-name B-price I-price

Transformer Backbone�(BERT, LayoutLM, …)

(Off-the-shelf)�OCR Engine

Slide

14 of 74

Conventional VDU Model:

Details of the Parsing Stage

Then, the tag sequence is converted into a final data format (e.g., JSON).

… 3002-Kyoto Choco Mochi 14, 000 …

B-name I-name I-name B-price I-price

Transformer Backbone�(BERT, LayoutLM, …)

(Off-the-shelf)�OCR Engine

{ "menu": [

{

"nm": "3002-Kyoto Choco Mochi",

"unitprice": "14.000",

"cnt": "x2",

"price": "28.000"

}, … }

Slide

15 of 74

Conventional VDU Model: Overview

Overall, the conventional VDU methods can be summarized as shown in the figure.

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

Slide

16 of 74

Conventional VDU Model: Overview

For each task, the backbone predicts a desired set of tokens/tags.

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

Output

BIO-Tags / Answer Token Span / etc

Slide

17 of 74

Conventional VDU Model: Overview

For example, in order to conduct VQA, �a span is predicted over the OCR-ed text tokens.

Input Image

(Off-the-shelf)�OCR Engine

“3002-Kyoto Choco Mochi”

… 3002-Kyoto Choco Mochi 14, …

START END

Transformer Backbone�(BERT, LayoutLM, …)

Slide

18 of 74

Conventional VDU Model: Overview

Although such OCR-based approaches have shown promising performance,

they have several problems induced by OCR.

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

Slide

19 of 74

Conventional VDU Model: Overview

Although such OCR-based approaches have shown promising performance,

they have several problems induced by OCR.

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

Slide

20 of 74

Conventional VDU Model: Overview

First, OCR increase computational costs.

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

high computational costs

Slide

21 of 74

Conventional VDU Model: Overview

Second, OCR makes it hard to handle various languages/types of documents.

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

high computational costs
inflexibility of OCR on languages or document type

Slide

22 of 74

Conventional VDU Model: Overview

Lastly, OCR errors are propagate to the subsequent process.

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

high computational costs
inflexibility of OCR on languages or document type
OCR error propagation

Slide

23 of 74

Proposal: OCR-free Approach

To address the issues, we introduce a novel OCR-free VDU model, Donut 🍩.

Donut 🍩�(End-to-end Model)

Token Sequence

Output

Input Image

Slide

24 of 74

AS-IS v.s. TO-BE

Without OCR, Donut directly processes the input image �and gets an output that contains desired types of information.

Donut 🍩�(End-to-end Model)

Token Sequence

Output

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

Slide

25 of 74

Overview

This is the overview of Donut.

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Slide

26 of 74

Overview

The visual encoder maps the input image into a set of embeddings.

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Slide

27 of 74

Overview

The textual decoder processes the image embeddings and prompt tokens.

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Slide

28 of 74

Overview

Then, the decoder outputs token sequences,

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Slide

29 of 74

Overview

that can be converted into a desired data format, such as, a JSON format.

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Slide

30 of 74

Overview: Model Architecture

Swin Transformer and BART are used as an encoder and decoder, respectively.�More details can also be found in the manuscript.

<vqa><question>what is the price

of choco mochi?</question><answer>

Converted JSON

transformer encoder

Input Image and Prompt

transformer decoder

Donut 🍩

<class>receipt</class>

</classification>

14,000</answer></vqa>

<item><name>3002-Kyoto Choco Mochi</name>・・・ </parsing>

{ "items": [{"name": "3002-Kyoto Choco Mochi",

"count": 2,

"unitprice": 14000, …}], … }

Output Sequence

{ "class":"receipt" }

{ "question": "what is the price of choco mochi?",

"answer": "14,000" }

Slide

31 of 74

Pre-training Task

To train Donut, we propose a simple pre-training task.

Donut 🍩�(End-to-end Model)

In terms of what that can teach, in what from …

Slide

32 of 74

Pre-training Task

The objective is to read all texts from the top-left to bottom-right.�This task can be interpret as a pseudo OCR task.

Donut 🍩�(End-to-end Model)

In terms of what that can teach, in what from …

Slide

33 of 74

Training Strategy: Teacher-forcing Scheme

Following the original Transformer, the model training is done with Teacher-forcing scheme. More details can also be found in the manuscript.

Model

…

This can be interpret as

a token classification at each step.

in

terms

of

what

terms

of

what

Minimize Cross Entropy

Slide

34 of 74

Training Strategy: Teacher-forcing Scheme

Following the original Transformer, the model training is done with Teacher-forcing scheme. More details can also be found in the manuscript.

Model

…

This can be interpret as

a token classification at each step.

in

terms

of

what

terms

of

what

Minimize Cross Entropy

Slide

35 of 74

SynthDoG 🐶:Synthetic Document Generator

For the pre-training of Donut, we also present a data generator, SynthDoG 🐶.

Slide

36 of 74

SynthDoG 🐶:Synthetic Document Generator

SynthDoG alleviates the dependency on large-scale real document images �and enables the extension to a multilingual setting.

Slide

37 of 74

Pre-training Task

After the model learns “how to read”,

Donut 🍩�(End-to-end Model)

In terms of what that can teach, in what from …

Slide

38 of 74

Model Fine-tuning and Inference Overview

in the fine-tuning, we teach the model “how to understand”.

Slide

39 of 74

Model Fine-tuning and Inference Overview

The prediction target is set to a desired downstream token sequence, �including some special tokens.

Slide

40 of 74

Model Fine-tuning and Inference Overview

At inference, the predicted token from the last step is fed to the next.

Slide

41 of 74

Model Fine-tuning and Inference Overview

The predicted sequence is converted into a JSON format.

Slide

42 of 74

Experiments: Samples of Downstream Datasets

We conducts 3 VDU applications on 6 different datasets. Some samples are shown.

Slide

43 of 74

Experiments: Document Classification

To see whether the model can distinguish across different types of documents, �we test a classification task.

Slide

44 of 74

Experiments: Document Classification

This is the results on the RVL-CDIP dataset. �Donut achieves state-of-the-are scores with reasonable speed and efficiency.

Slide

45 of 74

Experiments: Document Classification

This is the results on the RVL-CDIP dataset. �Donut achieves state-of-the-are scores with reasonable speed and efficiency.

Slide

46 of 74

Experiments: Document Parsing

Next, to see the model fully understands the complex layouts and contexts, �we test document parsing tasks.

Slide

47 of 74

Experiments: Document Parsing

For all domains, Donut showed the best accuracies �with significantly faster inference speed.

Slide

48 of 74

Experiments: Document Parsing

For all domains, Donut showed the best accuracies �with significantly faster inference speed.

Slide

49 of 74

Experiments: Document VQA

To validate the further capacity of the model, we test a document VQA task.

Q: What is the Extension Number as per the voucher?��A: (910) 741–0673

Slide

50 of 74

Experiments: Document VQA

For VQA, Donut showed a high score on the handwritten documents �which are known to be challenging.

Slide

51 of 74

Experiments: Document VQA

For VQA, Donut showed a high score on the handwritten documents �which are known to be challenging.

Slide

52 of 74

Analysis: VQA on Handwritten Documents

As can be seen, the OCR errors make the performance upper-bound �for the conventional baselines.

Slide

53 of 74

Analysis: VQA on Handwritten Documents

On the other hand, Donut seems robust to the handwritten documents.

Slide

54 of 74

Analysis

Next, we show some main results of our analysis on Donut.

Slide

55 of 74

Analysis: Pre-training Strategies

We tested several pre-training tasks. �We found that the proposed task is the most simple yet effective approach.

Slide

56 of 74

Analysis: Pre-training Strategies

Other tasks that impose a general knowledge of images and texts on models �show little gains in the fine-tuning tasks.

Slide

57 of 74

Analysis: Pre-training Strategies

For the text reading task, synthetic images were enough for the CORD task.

Slide

58 of 74

Analysis: Pre-training Strategies

But, in the DocVQA task, it was important to see the real images.

Slide

59 of 74

Analysis: Image Backbones

Next, we study popular image classification backbones.

Slide

60 of 74

Analysis: Image Backbones

Overall, EfficientNetV2 and Swin Transformer outperform the others.

Slide

61 of 74

Analysis: Image Backbones

We choose Swin Transformer due to the high scalability and performance.

Slide

62 of 74

Analysis: Input Resolution

We observed that Donut grows rapidly as we set a larger input size.

Slide

63 of 74

Analysis: Input Resolution

This gets clearer in the DocVQA where the images are larger with many tiny texts.

Slide

64 of 74

Analysis: OCR Engines

Next, we study the effects of OCR engines on the traditional baselines.

We test four widely-used public engines.

Slide

65 of 74

Analysis: OCR Engines

We observed that the scores heavily rely on the OCR engine.

Slide

66 of 74

Analysis: OCR Engines

This shows why we need an OCR-free method.

Slide

67 of 74

Conclusions

So far, we have introduced our new OCR-free method, Donut.

More experiments and analysis can be found in the manuscript.

Donut 🍩�(End-to-end Model)

Token Sequence

Output

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

Slide

68 of 74

Conclusions

So far, we have introduced our new OCR-free method, Donut.

More experiments and analysis can be found in the manuscript.

Donut 🍩�(End-to-end Model)

Token Sequence

Output

Input Image

(Off-the-shelf)�OCR Engine

Backbone

(BERT-like)

BIO-Tags / Answer Token Span / etc

Output

Slide

69 of 74

Conclusions

The proposed method, Donut, directly maps an input document image into a desired structured output.

The proposed method, Donut, directly maps an input document image �into a desired structured output.

Donut 🍩�(End-to-end Model)

Token Sequence

Output

Input Image

Slide

70 of 74

Conclusions

The proposed method, Donut, directly maps an input document image into a desired structured output.
Unlike conventional methods, Donut does not depend on OCR and can easily be trained in an end-to-end fashion.

Unlike conventional methods, Donut does not depend on OCR �and can easily be trained in an end-to-end fashion.

Donut 🍩�(End-to-end Model)

Token Sequence

Output

Input Image

Slide

71 of 74

Conclusions

The proposed method, Donut, directly maps an input document image into a desired structured output.
Unlike conventional methods, Donut does not depend on OCR and can easily be trained in an end-to-end fashion.
We also propose a synthetic document image generator, SynthDoG.

We also propose a synthetic document image generator, SynthDoG.

Slide

72 of 74

Conclusions

The proposed method, Donut, directly maps an input document image into a desired structured output.
Unlike conventional methods, Donut does not depend on OCR and can easily be trained in an end-to-end fashion.
We also propose a synthetic document image generator, SynthDoG.
Through extensive experiments and analyses, we show the high performance and cost-effectiveness of Donut.

Through extensive experiments and analyses, �we show the high performance and cost-effectiveness of Donut.

Slide

73 of 74

Conclusions

The proposed method, Donut, directly maps an input document image into a desired structured output.
Unlike conventional methods, Donut does not depend on OCR and can easily be trained in an end-to-end fashion.
We also propose a synthetic document image generator, SynthDoG.
Through extensive experiments and analyses, we show the high performance and cost-effectiveness of Donut.
We believe our work can easily be extended to other domains/tasks regarding document understanding.

We believe our work can easily be extended to other domains/tasks regarding document understanding.

Slide

74 of 74

If you have any questions please feel free to contact me :)�Thank you!

Thank you!

^Contact:^{[email protected]}

Slide