0% found this document useful (0 votes)
3 views

Multimodal Visual-Semantic Representations Learning For Scene Text Recognition

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Multimodal Visual-Semantic Representations Learning For Scene Text Recognition

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/378321155

Multimodal Visual-Semantic Representations Learning for Scene Text


Recognition

Article in ACM Transactions on Multimedia Computing, Communications and Applications · February 2024
DOI: 10.1145/3646551

CITATIONS READS
0 44

All content following this page was uploaded by Jun Yu on 12 March 2024.

The user has requested enhancement of the downloaded file.


Multimodal Visual-Semantic Representations Learning for Scene Text
Recognition
XINJIAN GAO∗ , University of Science and Technology of China, Hefei, China
YE PANG, Ping An Technology Co., Ltd, Beijing, China
YUYU LIU, Ping An Technology Co., Ltd, Beijing, China
MAOKUN HAN, Ping An Technology Co., Ltd, Beijing, China
JUN YU2 , University of Science and Technology of China, Hefei, China
WEI WANG2 , Ping An Technology Co., Ltd, Beijing, China
YUANXU CHEN, Ping An Technology Co., Ltd, Beijing, China
Scene Text Recognition (STR), the critical step in OCR systems, has attracted much attention in computer vision. Recent
research on modeling textual semantics with Language Model (LM) has witnessed remarkable progress. However, LM only
optimizes the joint probability of the estimated characters generated from the Vision Model (VM) in a single language
modality, ignoring the visual-semantic relations in diferent modalities. Thus, LM-based methods can hardly generalize
well to some challenging conditions, in which the text has weak or multiple semantics, arbitrary shape, etc. To migrate
the above issue, in this paper, we propose Multimodal Visual-Semantic Representations Learning for Text Recognition
Network (MVSTRN) to reason and combine the multimodal visual-semantic information for accurate Scene Text Recognition.
Speciically, our MVSTRN builds a bridge between vision and language through its uniied architecture and has the ability
to reason visual semantics by guiding the network to reconstruct the original image from the latent text representation,
breaking the structural gap between vision and language. Finally, the tailored multimodal Fusion (MMF) module is motivated
to combine the multimodal visual and textual semantics from VM and LM to make the inal predictions. Extensive experiments
demonstrate our MVSTRN achieves state-of-the-art performance on several benchmarks.
Additional Key Words and Phrases: Scene Text Recognition, Vision Transformer, Self-supervised Learning

1 INTRODUCTION
Scene Text Recognition (STR), recognizing text from the natural scene images, has attracted much attention in
computer vision. There exist various text images in our real-life scenes, such as road signs, posters, billboards
and license plates. Text recognition mainly reads text from cropped images detected by some text detection
methods[15, 29, 30]. In recent years, although many eforts have been made to achieve accurate and fast text
recognition, it’s still challenging to recognize text in the wild due to the variations of text in colour, font style,
and complex background.
∗ This work is done when Xinjian Gao is an intern at Ping An Technology Co., Ltd
2 Corresponding author

Authors’ addresses: Xinjian Gao, University of Science and Technology of China, Hefei, China, [email protected]; Ye Pang, Ping An
Technology Co., Ltd, Beijing, China; Yuyu Liu, Ping An Technology Co., Ltd, Beijing, China; MaoKun Han, Ping An Technology Co., Ltd,
Beijing, China; Jun Yu, University of Science and Technology of China, Hefei, China, [email protected]; Wei Wang, Ping An Technology
Co., Ltd, Beijing, China, [email protected]; Yuanxu Chen, Ping An Technology Co., Ltd, Beijing, China.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page.
Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from
[email protected].
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM 1551-6857/2024/2-ART
https://doi.org/10.1145/3646551

ACM Trans. Multimedia Comput. Commun. Appl.


2 • Gao et al.

Regul
ar M ethods

C NN RNN STA R. U C KS

Input Image
Structuralgap
V isionLA N
Test Transformer STA R. U C KS
C NN
.l
ock s

Input Image Train STA R. U C _S


M ask K
A ttentionM ask

Our M ethod
Transformer . l
ock
s STA R. U C KS

Input Image

M askImage M odel
ing

Fig. 1. Illustrations of the three diferent networks. From top to botom, Previous methods with CNN and RNN-like modules,
VisionLAN and our proposed MVSTRN.

Some early works[21, 45, 49] regard text recognition task as single character classiication based on the visual
features of the divided characters. They deine the characters as meaningless symbols, leaving out the rich semantic
information in the text. In order to leverage the textual semantics, a bunch of recent methods[13, 26, 37, 56] begin
to focus attention on modeling textual semantics with LM and have achieved remarkable progress. For example,
Yu et al.[56] propose a novel network with a global semantic reasoning block that can model semantics from
the estimated visual features. Following [56], Fang et al. [13] propose an multimodal text recognizer with VM to
produce visual features and a pretrained LM to correct the visual predictions with its rich semantics. Though LM
contributes to recognition accuracy with its powerful linguistic modeling ability, these LM-based multimodal
methods are inclined to correct the predictions of VM with semantics, but ignore the balacne between vision
and semantics. Hence, when faced with some challenging text with weak or multiple semantics, arbitrary shape
and etc., these methods may produce unsatisfactory results, as shown in Figure 2. In our work, we present the
multimodal visual-semantic text recognizer to address the above issue.
For vision model, we replace the typical hybrid structures based on CNN and RNN-like modules with the fully
Transformer-based networks to directly reason and fuse the visual-semantic relations. Thanks to the powerful
self-attention mechanism of Transformer and the Vision Transformer, Transformer has great advantages in
vision, language and multimodal tasks. In some hybrid networks based on CNN and RNN-like modules, it’s
hard to generalize and combine the multimodal information because of the structural gap of CNNs and RNN-
like modules. Therefore, a uniied structure network can efectively reduce the information loss caused by the
transformation of diferent network structures. Our VM consists of a ViT backbone and a position enhanced
parallel attention designed to inject position information and decode characters, performing a better generalization
of visual-semantic information compared with the traditional hybrid structures.

ACM Trans. Multimedia Comput. Commun. Appl.


Multimodal Visual-Semantic Representations Learning for Scene Text Recognition • 3

J OB NO_ A R R IVA I
UOB NOW R IVA I

E VER 70X HIT


IV E R F OX HUT

DE I _ HA _ D FAR
DE I I HA R D F BTI

Fig. 2. Illustrations of the recognition results where the reproduced ABINet fails but our MVSTRN succeeds. The top text in
red indicates the false prediction, and the botom text indicates the correct predictions of our MVSTRN.

For vision model pretraining, we adopt the masked image modeling strategy to pretrain the vision model,
giving our network much scalability. Masked image modeling is irst used in the regular vision task such as
object detection, object segmentation, etc. Compared with the regular vision task, the training images in OCR
task carry not only visual information but also semantic information, which makes it diicult to reconstructs the
original text images from the masked images for masked image strategies. Speciically, we split the input image
into a sequence of patches and randomly mask these patches with a set ratio, as shown in Figure 1. This way,
the visual backbone learns to not only concentrate on the image text content but also the semantic relations of
characters. Similarly, VisionLAN[48] utilizes a Masked Language-aware Module (MLM) that masks visual clues
at the speciic position of the attention maps to give the vision model the ability to perceive language. However,
the attention-level mask of VisionLAN needs an additional branch to predict the speciic position according to
the input index, where the model may predict a false mask position and decrease the performance of networks.
Fortunately, the analyzed issues in VisionLAN[48] can be well solved by our MIM strategies.
For multimodal information fusion, we present a multimodal Fusion (MMF) module implemented on a cross-
attention block that combines the 2D visual-semantic features with the 1D textual-semantic features in the inal
stage of the network, aiming to achieve a balance between Vision and Language. The mechanism of MMF is
similar to the parallel attention that rearranges the 2D visual-semantic features into the corresponding semantic
positions. Diferent from the previous methods that utilize a simple gated mechanism[54, 56] to fuse the aligned
1D features, our MMF performs in a multimodal way and is proved to boost the recognition accuracies by
extensive experiments.
The major contributions of this paper can be summarized as threefold.
(1) We propose a novel multimodal visual-semantic text recognizer composed of Transformer entirely and
prove that our proposed network is more capable of modeling and fusing visual-semantic multimodal information,
breaking the structural gap between vision and language.
(2) We for the irst time apply masked image modeling strategy into OCR task and make our model reason
multimodal visual-semantic representations by the reconstruction task in pretraining. With the help of the ViT

ACM Trans. Multimedia Comput. Commun. Appl.


4 • Gao et al.

architecture that crops the input images as multiple patches, we implement masking on both visual and semantic
information compared with Bert that masks semantics and MAE that masks vision.
(3) Efective position enhanced parallel attention decoders and multimodal fusion module are proposed to fuse
and balance the multimodal information. Extensive experiments are conducted to evaluate the efectiveness and
robustness of our proposed MVSTRN, which achieves new state-of-the-art performances on both popular regular
and irregular text recognition benchmarks.

2 RELATED WORK
2.1 Vision-language Multimodal Text Recognizer
Extracting and balancing the vision-language information is a key step in many related multimodal tasks[16,
27, 33, 55]. It’s also a key step in OCR systems. In this section, we mainly introduce the development of some
existing Scene Text Recognition methods, which can be generally divided into visual clues driven text recognizer
and multimodal visual-language clues driven recognizer .
Visual clues-driven methods[7, 14, 19, 28, 31, 41, 42, 44, 51, 59, 63] generally regard text recognition as a visual
classiication task, which mainly aims to achieve better recognition performance by extracting richer visual
features. These methods typically utilize CNN to extract visual features and RNN to model implicit semantic
relations among characters, and inally use CTC to maximize the probability of all the paths that may reach the
ground truth. For example, CRNN[41] proposes a CTC-based recognition method, where the 2D visual features
are lattened into a sequence and then fed into RNN to achieve sequence modeling. Atienza et al. [2] build a
compute and parameter eicient single-stage vision transformer to achieve eicient text recognition. Other
than the aforementioned CTC-based methods, some segmentation-based methods[31, 43] predict characters by
segmented visual features. In general, these visual features-driven methods always perform poorly on some
irregular text, such as severely obscured, blurred text, resulting from the limited semantic relations modeling.
Multimodal vison-language text recognizer[1, 5, 13, 25, 26, 37, 50, 56ś58], as the mainstream among the
recognition algorithms, have made remarkable progress in recent years. Pioneering works[25, 26, 37, 57] typically
reason textual semantics by the auto-regressive encoder-decoder frameworks, which predicts a single character
at each time step by one-way semantic context delivery. However, one-way semantic modeling may accumulate
misrecognition errors during left-to-right decoding and is ineicient due to serial predicting. To this end, Yu et
al.[56] propose the SRN (Semantic Reasoning Network) that includes a parallel attention aligning all vision
features of all time steps for parallel decoding and a global semantic reasoning module to model semantic relations
in a novel manner of multi-way parallel transmission. Based on [56], Fang et al.[13] propose a multimodal text
recognizer containing VM and LM to achieve more accurate text recognition. By introducing rich semantics into
the post-process of the visual estimated features, these multimodal methods achieve huge success.

2.2 Masked Image Modeling


As BERT[11] and GPT[6, 38, 39] achieve great success in NLP tasks, more and more masked image encoding
methods[4, 8, 18, 62] that learn representations from images corrupted by masking have been widely applied into
computer vision tasks. In particular, most of the above methods are implemented on Vision Transformer[12]
(ViT). Vision Transformer[12] is developed to divide the 2D input image into 1D sequence patches as the input of
transformers, breaking the architectural gap between CV and NLP ields. Motivated by [12] and [11], Bao et al.[4]
propose a masked image modeling method called BEiT to pretrain vision Transformers in a self-supervised manner.
Similarly, He et al.[18] present an asymmetric encoder-decoder architecture to make pixel-level predictions with
the input of high-ratio masked images and witness remarkable success in computer vision tasks.
Although the above MIM methods have witnessed great success in classiication, object detection and seg-
mentation task, there are few works adopting a masked strategy to pretrain a VM for Text Recognition task.

ACM Trans. Multimedia Comput. Commun. Appl.


Multimodal Visual-Semantic Representations Learning for Scene Text Recognition • 5

Correspondingly, VisionLAN[48] irstly attempts to make its VM extract the visual-semantic representations by
the proposed language-aware module (MLM). MLM is leveraged to enhance the semantic modeling by masking
the visual features at the speciic position of attention maps, giving the ability of the VM to perceive linguistic
knowledge. Though the masking strategy of VisionLAN[48] is complicated and unintuitive, VisionLAN[48], for
the irst time, proves the feasibility that the visual masked pretraining strategy can work well in STR task.

S elf-
S uperv
ised P re-
Training Tex
t RecognitionM odelFine-
Tuning
Iterativ
e
P 9A C 9

M ul
ti-
M odalFusion

V isual-
Semantic Semantic
Transformer 5ecoder Features Features

P re-
trained
P osition9nhanced
V isionTransformer V isionTransformer P arallelA ttention L anguage M odel

P atch9mb edding +P osition9mb edding P atch9mb edding +P osition9mb edding

C haracters 9stimation
P 9A O 9

Fig. 3. Overview of our proposed MVSTRN. Our MVSTRN is trained through two stages including the vision model pretraining
stage (let) and the finetuning stage (right).

3 PROPOSED METHOD
3.1 Overview
The overview of our MVSTRN is shown in Figure 3, which consists of three major parts: the VM, LM and
multimodal fusion module. For VM, we drop the convolutional backbone networks used to extract visual features
and utilize Vision Transformer to extract visual-semantic features. Besides, the MIM strategy is adopted to pretrain
the VM, giving VM the ability to reason visual semantics. For the alignment module, the Position Enhanced
Parallel Attention is developed in VM to enhance the positional clues that make a great diference to the alignment
of visual features with time steps. For LM, we follow the implementation of ABINet[13] that is bidirectional and
iterative with four transformer layers. Finally, a multimodal Fusion Module takes the 1D semantic and the 2D
visual-semantic features as inputs, combining the visual-semantic features with textual-semantic features in the
last prediction stage.

3.2 Vision Model


Transformer is conquering more and more computer vision tasks with its excellent ability to capture global
semantic information and scalability to apply various pretraining strategies. To leverage the above merits of
Transformer, we adopt a structurally uniied network with 12-layer ViT layers as our VM to excavate deep
visual-semantic features.

ACM Trans. Multimedia Comput. Commun. Appl.


6 • Gao et al.

2
Given an input image, X ∈ R� ×� ×� , we irst reshape the input image into the lattened patches, Xp ∈ R� × (� ·� ) ,
where (�,� ) is the resolution of the input image, � is the feature dimension, (�, �) is the size of each split
patch, � is the length of the lattened patches with the value �� /� 2 . Then we map the lattened patches to �
dimensions with linear projection so as to keep the same dimension as Transformer. In the original ViT, a class
token is used for object category prediction, while it is dropped in our method (More details can be found in
Section 4.4.2.). Besides, the standard sine-cosine 1D position embedding, �� ∈ R� ×� which is added to each
patch embedding can be indicated as:
�� (���,2� ) = sin(���/100002�/������ )
(1)
�� (���,2�+1) = cos(���/100002�/������ )

The visual-semantic features, F ∈ R� ×� , output from each layer of ViT can be calculated as:
Fi = ���� (��� (���� (��� ([Xp ] + ��)))) (2)
where � are indices along the layers of Vision Transformer, ���, ��� and �� are Multi-head Self-Attention
layer, Multilayer Perceptron and Layer Normalization respectively.

3.3 Vision Model Pretraining


There exist various methods[4, 9, 18] to pretrain a ViT model. For example, following the mask and prediction
strategy, Heet al.[18] propose a novel method to pre-learning the latent representations in a self-supervised
way. MAE[18] enables the ViT encoder to encode latent representations that are helpful for some speciic tasks
by masking a bunch of input patches and guiding the model to predict the original normalized pixels of the
masked patches. Motivated by [18], we develop the asymmetric encoder-decoder network, as shown in Figure 3,
to improve the VM on learning the deep visual-semantic representations.
During vision model pretraining, given the input image patches �� through a patch embedding and a 1D
sine-cosine positional embedding ��, the encoder randomly generates the masked image patches Xm p with the set
mask ratio �, then receives the unmasked patches Xup as inputs. Considering that the goal of the self-supervised
training is to obtain a better Transformer encoder as the ViT backbone of VM, we utilize a lightweight decoder
that contains fewer Transformer layers than the encoder to perform the reconstruction task. During the pixels
predicting, the lightweight decoder concatenates the encoded features from the encoders with the masked
patches according to the positional index, where another 1D sine-cosine positional embedding is added to the
p
concatenated features. Finally, the decoder directly predicts the normalized pixels Xp of the masked patches with
the above positional embedded concatenated features as inputs.
 

�� = � [� (��� + ��); ��� ] + �� (3)

where � (·), � (·) indicates the Encoder and Decoder respectively, �� is the positional embedding, [; ] represents
concatenation according to the positional index.

3.4 Position Enhanced Parallel Atention


Given the extracted visual-semantic features, the alignment module plays a role of aligning the 2D visual-
semantic features with 1D positional embeddings to generate the visual-semantic aligned features that will
be estimated to predict characters. Correspondingly, [13, 56] propose the Parallel Attention Mechanism with
diferent implementations to align all visual features with each character position. Especially, ABINet[13] utilizes
a mini U-Net to enhance the visual features and dot-product attention to achieve visual and positional features
alignment.

ACM Trans. Multimedia Comput. Commun. Appl.


Multimodal Visual-Semantic Representations Learning for Scene Text Recognition • 7

Position Enhanced Self-Attention Cross-Attention Decoder

Wq
Wq

Wv X softmax X
Positional
Embeddings Wk
Wv X softmax X

Aligned Features

Wk

Feature Maps

Fig. 4. Illustrations of the PEPA. PEPA consists of the Position Enhanced self-atention and cross-atention decoder, transmit-
ting the positional embeddings and 2D feature maps into the aligned features.

Unlike previous methods[13, 56], we irst utilize the self-attention blocks to enhance the sequential positional
embeddings, Epos ∈ R� ×� , which injects more rich position clues into query vectors that will be calculated in the
latter parallel attention.
Epos = S(G([� 0, � 1, ..., �ℎ ])) (4)
where S(·) is self-attention blocks, G(·) is the embedding function, �ℎ (ℎ ∈ [0....,� − 1]) is the sequential position
index. Aiming to achieve more accurate alignment, we replace the attention mechanism proposed in [13] with a
Transformer block, in which the visual features, F ∈ R� ×� , attend as the key and value vectors, and the enhanced
positional embeddings Epos , attend as the query vectors. Let the aligned features generated from the Parallel
Attention be denoted as e F ∈ R� ×� . By applying a linear layer and softmax function, the estimated character
sequence, YV ∈ R� ×� , can be predicted. The above process can be formalized as follows:
�� Epos�� FT
e
F = �� � ���� ( √ )F��
� (5)
� � = �� � ���� (e
F� )
where (�� ,�� ,�� ) ∈ R� ×� , � ∈ R� ×� , are the linear weights, FT is the transpose of F. Note that the query
vectors are equipped with rich position clues and the parallel attention is implemented on the more efective
Transformer blocks so that it can arrange visual-semantic features into the corresponding positions more
accurately.

3.5 Multimodal Fusion Module


In previous methods[13, 56], they typically implement the fusion module with a gated mechanism to fuse the
aligned visual-semantic features from VM, e F ∈ R� ×� , and semantic features from LM, FL = T (� � ) ∈ R� ×� ,
where T indicates the LM. Theoretically, the vision-semantic features aligned by the parallel attention no longer
have spatial context since it exists in sequence features, which deinitely weakens the expression of the spatial
information in visual-semantic features.
To introduce the visual-semantic features into the inal prediction, the MMF rearranges these 2D spatial
context information in visual-semantic features into semantic position in a similar way to the parallel attention

ACM Trans. Multimedia Comput. Commun. Appl.


8 • Gao et al.

mechanism as shown in Figure 4. The cross-attention of Transformer serving as the connecting module between
the encoder and decoder can exchange information from diferent modalities. We thus utilize a Transformer block
with a cross-attention mechanism as the inal fusion module. Given the generated 1D semantic features Fs ∈ R� ×�
as query vectors, and the visual-semantic features F ∈ R� ×� as key and value vectors, the cross-attention module
combine these two features and generate the multimodal fusion features Ff ∈ R� ×� .

3.6 Training Objective


Vision Model Pretraining. We denote the collection of the normalized masked patches as � = {� 0, ..., �� },
and the normalized predicted patches as � = {� 0, ..., �� }. For each decoding, we apply the ������� function to
calculate the loss between them.
1 ︁

L= ||�� − �� || 2 (6)
� �=0
where � is the length of the masked patches, �� and �� are the predicted patch and masked patch at the � index,
respectively. By applying the aforementioned task objective, the network can be trained end-to-end without any
annotations.
Recognition Network Training. During the recognition network training, given the pretrained VM and LM,
we inetune MVSTRN with multi-task objectives.

1 ︁ (� )
L��� = L� + (L + L �(� ) ) (7)
� �=0 �

where L��� indicates the total cross entropy losses, � is the number of iterations, L� , L�(� ) and L �(� ) are the
standard cross entropy losses deined as (8) from the VM, LM and MMF.


L=− (ln � (�� |�� ; � )) (8)
� =1
where �� , �� are the ground-truth text sequence and predicted text sequence from VM, LM and MMF respectively,
� is the length of the ground-truth text sequence, � are the training parameters.

4 EXPERIMENTS
4.1 Datasets
Following the previous methods[13, 37, 56, 57], we utilize two synthetic datasets SynthText (ST)[17] and
MjSynth[20] (MJ) as the training set that contains 16M (9M in MJ, 7M in ST) text instances in total. For fair
comparison in the evaluation, we adopt the evaluation protocol proposed in [10]. The testing set includes six public
benchmarks that can be divided into regular and irregular categories. It is more challenging to recognize text
from irregular datasets than regular datasets. The regular datasets include IIIT5K-Words (IIIT5K)[34], Street View
Text (SVT)[46], ICDAR2013 (IC13)[24] with 3000, 647 and 857 text images, respectively. ICDAR2015 (IC15)[23],
Street View Text-Perspective (SVTP)[35] and CUTE80[40] (CUTE) are irregular datasets containing 1811, 645 and
288 irregular text images, respectively.

4.2 Implementation Details


All experiments are conducted on the server with 4 NVIDIA A100 GPUs. The VM of our standard MVSTRN in
this paper is implemented on the ViT encoder containing 12 Transformer layers, of which the MLP dimension
is 2048. The self-attention in PEPA contains three Transformer layers with the MLP dimension 256, and the

ACM Trans. Multimedia Comput. Commun. Appl.


Multimodal Visual-Semantic Representations Learning for Scene Text Recognition • 9

Regular Text Irregular Text


Method Year Training Data
IIIT5K SVT IC13 IC15 SVTP CUTE
CAN [50] 2019 MJ 80.5 83.4 90.5 - - -
CombBest [3] 2019 MJ + ST 87.9 87.5 93.6 77.6 79.2 74.0
ESIR [58] 2019 MJ + ST 93.3 90.2 91.3 76.9 79.6 83.3
SEED [37] 2020 MJ + ST 93.8 89.6 92.8 80.0 81.4 83.6
DAN [47] 2020 MJ + ST 94.3 89.2 93.9 74.5 80.0 84.4
AutoSTR [61] 2020 MJ + ST 94.7 90.9 94.2 81.8 81.7 -
Yang et al. [53] 2020 MJ + ST 94.7 88.9 93.2 79.5 80.9 85.4
SATRN [25] 2020 MJ + ST 92.8 91.3 94.1 79.0 86.5 87.8
RobustScanner [57] 2020 MJ + ST 95.3 88.1 94.8 77.1 79.5 90.3
SRN [56] 2020 MJ + ST 94.8 91.5 95.5 82.7 85.1 87.8
SPIN [60] 2021 MJ + ST 95.2 90.9 94.8 82.8 83.2 87.5
PIMNet [36] 2021 MJ + ST 95.2 91.2 95.7 83.7 86.0 88.5
Bhunia et al et al. [5] 2021 MJ + ST 95.2 92.2 95.2 83.5 84.3 84.1
VisionLAN [48] 2021 MJ + ST 95.8 91.7 95.7 83.7 86.0 88.5
PREN2D [52] 2021 MJ + ST 95.6 94.0 96.4 83.0 87.6 91.7
ABINet [13] 2021 MJ + ST 96.2 93.5 97.4 86.0 89.3 89.2
MVSTRN - MJ + ST 96.9 95.8 97.1 87.1 91.6 89.6
Table 1. Comparision with privious state-of-the-art methods in Scene Text Recognition accuracies(%) on six public benchmark
test datasets. ’MJ’, ’ST’ indicate the Synth90K and SynthText. Regular and irregular text in the table indicate two kinds of
text images according to the degree of blur, curvature, etc.

cross-attention in both the PEPA and MMF contain one Transformer layer with the MLP dimension 1024. For all
experiments, the number of the attention heads and the embedding dimension are set to 8 and 512 respectively.
The input images are directly resized to 64 × 256. Patch size is set to 8, thus the input lattened patches include
256 tokens (without the class token). The 1D sine-cosine positional embedding is adopted to each input patch, no
matter in the pretraining or ine-tuning stages.
Vision Model Pretraining. For masked image modeling pretraining, the model containing a 12-layer ViT
encoder and an 8-layer Transformer decoder is trained only on SynthText dataset 40 epochs to predict the
normalized pixels where we use the AdamW optimizer and the batch size is set to 1600. No data augmentation
methods are added. The initial learning rate is set to 1.2�−4 with a cosine lr scaling rule. The training object to
predict the original but not the normalized pixels is only used for visualization of reconstruction samples.
Recognition Model Finetuning. We irst inetune our VM from the self-supervised pretrained ViT encoder
7 epochs with the batch size 480, where the initial learning rate will decrease from 1e-4 to 1e-5 after 4 epochs
then decrease to 5e-6 at the last epoch. For LM, we just adopt the pretrained LM with the same iteration
number provided by ABINet[13]. Finally, we inetune our MVSTRN end-to-end on MJ and ST datasets with the
aforementioned pretrained VM and LM 7 epochs, where we adopt the same learning rate schedule as the VM
pretraining, and the batch size is set to 420. Adam optimizer is adopted in both the VM and MVSTRN training.
Besides, some random image augmentation methods[32, 56] are added, such as rotation, perspective distortion,
motion blur and Gaussian noise. The max length � of the output characters is set to 26, where 37 characters,
including 10 dights, 26 alphabets and 1 padding symbol can be predicted.

ACM Trans. Multimedia Comput. Commun. Appl.


10 • Gao et al.

4.3 Comparison with State-of-the-art Methods


Table 1 indicates the existing STR methods and their recognition performance on six public datasets. Note that we
only concentrate on the unconstrained text recognition methods without any lexicons, which is more challenging
but more practical. Besides, in order to make a fair comparison, all methods displayed are trained on MJ and ST
datasets without any extra datasets.
As illustrated in Table 1, our MVSTRN achieves state-of-the-art performance on four benchmark datasets:
IIIT5K, SVT, IC15, SVTP and the second-best performance on IC13 dataset. When compared with the previous
state-of-the-art method ABINet, our MVSTRN outperforms ABINet with impressive margins on ive datasets.
Speciically, accuracies gains of 0.7%, 2.3%, 1.1%, 2.3% and 0.4% are obtained by both the regular and irregular
datasets IIIT5K, SVT, IC15, SVTP and CUTE, respectively. According to the experimental results, it’s easy to
conclude that MVSTRN gives an incredible performance on the more challenging datasets SVT, IC15 and SVTP,
where the text is blurred and severely curved compared to the images in IIIT5K and IC13. MVSTRN lags behind
the best in the CUTE dataset with 2.1%. The possible reason are that: 1. The CUTE dataset contains only 288
images, where the accuracy luctuates greatly. We just choose the model with the best average performance
under six benchmarks. 2. CUTE dataset contains more severely curved text, which also brings some challenges to
recognition.

Table 2. Ablation study of the Vision Model. VM-S, VM-L and VM-H in table indicate three VM visions of our MVSTRN
with 8, 12 and 16 Transformer layers, respectively. HybridNet, HybridNet-L and HybridNet-H indicate three visions of hbrid
networks with 3, 10 and 16 T ransformer layers, respectively.

Vision Model Layers IIIT SVT IC13 IC15 SVTP CUTE


HybridNet 3 94.7 91.1 94.9 84.0 85.1 86.5
HybridNet-L 10 94.6 91.1 94.4 84.2 85.0 86.5
HybridNet-H 16 94.9 92.2 95.4 84.2 85.6 87.5
VM-S 8 94.1 92.4 93.6 81.9 83.0 83.3
VM-L 12 95.4 92.2 95.6 85.0 87.3 86.5
VM-H 16 95.4 93.1 96.2 85.3 87.5 87.5

4.4 Ablation Studies


4.4.1 Vision Model. Table 2 speciies the diferent performances on six benchmarks with diferent VMs. Specii-
cally, HybridNet indicates the CNN and Transformer-based networks reproduced from [13], which contains ive
residual blocks and three Transformer layers in total. To further demonstrate the efectiveness of the structurally
uniied network and strictly perform a fair comparison, we build the HybridNet-L and HybridNet-H, the larger
versions compared with the HybridNet containing 10 and 16 Transformer layers respectively. VM-S, VM-L and
VM-H are three VM versions of our MVSTRN with 8, 12 and 16 Transformer layers, respectively. For a fair
comparison, all the above networks are trained end to end from scratch.
As illustrated in Table 2, our VM-S lags behind on ICDAR13, ICDAR15, SVTP and CUTE datasets, when
compared with the HybridNet. Thus, it seems that the lighter fully transformer-based network can not obtain a
more competitive performance than the lighter hybrid network. However, when extended to a larger network
with more Transformer layers, our VM-L outperforms the HybridNet-L by a large margin of 0.8%, 1.1%, 1.2%, 0.8%,
2.3% on IIIT, SVT, ICDAR13, ICDAR15 and SVTP benchmarks. On the contrary, the HybridNet-L containing more
Transformer layers fails to give a satisfactory improvement compared with the HybridNet and even lags behind
it on IIIT, ICDAR13 and SVTP datasets. However, the VM of MVSTRN still performs a stable improvement in

ACM Trans. Multimedia Comput. Commun. Appl.


Multimodal Visual-Semantic Representations Learning for Scene Text Recognition • 11

accuracy when extended to a larger network with 16 Transformers. Besides, VM-H also gives a better performance
compared with HybridNet-H that has same transformer layers but more parameters. From the experimental
results, we can conclude that our MVSTRN shows a more stable performance as the network goes deeper and
outperforms the hybrid network.

4.4.2 The Class Token of VM. In this paper, we conduct experiments to explore the impact of class tokens on
recognition accuracy. Three kinds of VMs of which the class tokens with diferent training objectives are designed
as follows. 1) Class token is remained but predicts nothing. 2) Class token is trained to predict the start lag.
Inspired by ViTSTR[2], ViTSTR utilizes ViT backbone and CTC decoder to recognize text, where the class token
is trained to predict the start lag that represents any speciic characters and but only represents the beginning of
the predicted sequence. 3) Class token is trained to predict the length of predicted text. Inspired by [22], aiming
to introduce additional supervised information to boost recognition performance, we also use the MSE loss to
optimize the distance between the text length predicted by class token and the length of ground truth. Besides,
the VM without class tokens is set up as the control group. Experimental results are shown in table 3.
As a consequence, experimental results demonstrate that class token is not necessary in our networks, as
shown in table 3. From the results, it’s shown that the class tokens trained to predicted the text length or nothing
lead to a decrease in recognition accuracy. The class token with the training objective of start lag nearly achieves
similar performance with no class token methods. Considering the computation cost of adding the class token,
we inally drop the class token in our network.

Table 3. Recognition performance on benchmarks of diferent class tokens with various training objectives. ’Nothing’ in the
table means VM with class token predicting nothing. ’Start flag’ means the VM whose class token is used to predict the
starting character. ’Text Length’ indicates that the class token is used to predicted the length of the text sequence.

Training Objective
Class Token Regular Irregular
Nothing Start Flag Text Length
✓ ✓ ✗ ✗ 94.8 84.3
✓ ✗ ✓ ✗ 95.0 85.1
✓ ✗ ✗ ✓ 94.4 84.1
✗ ✗ ✗ ✗ 95.1 85.1

4.4.3 Position Enhanced Parallel Atention. PEPA in MVSTRN plays a part in aligning all the visual features
into corresponding positions, which is also the core process of estimating the visual text classiication. Previous
methods[13, 56] typically improve the alignment module by proposing an efective attention mechanism[56] or
strengthen the visual features attended as key vectors[13]. In MVSTRN, PEPA stacks several self-attention blocks
to enhance the position clues and boost recognition accuracy.
Table 4 compares the recognition performance of various methods on both the regular and irregular benchmarks,
where Dot-product and PVAM indicate two attention mechanisms. As can be seen from the Table 4, our PEPA
achieves the best performance on both the regular and irregular datasets. Compared with the regular datasets,
a better improvement on the average accuracy of irregular datasets can be obtained by PEPA. Besides, PEPA
represents a better performance than PVAM, outperforming PVAM with accuracy gains of 0.2% and 0.3% on
regular and irregular datasets. The above results verify the efectiveness of our PEPA.

ACM Trans. Multimedia Comput. Commun. Appl.


12 • Gao et al.

Table 4. Recognition performance on benchmarks of diferent methods with various parallel atention mechanisms.

PVAM Regular Irregular Average


Dot-Product 94.0 83.9 90.5
PVAM[56] 94.9 84.8 91.0
PEPA 95.1 85.1 91.3

4.4.4 Multimodal Fusion. Most of the previous methods[13, 56, 57] utilize the gated dynamic fusion to fuse the
estimated visual and semantic results at semantic levels. Note that the estimated visual features are calculated by
the parallel attention module, where the modality of the output features depends on the query vectors. Thus, the
estimated visual features should be classiied as the semantic modality in essence, lacking 2D spatial context
information. Our MMF is proposed to address these issues.

Table 5. Ablation study of the final fusion module.

Fusion Module Regular Irregular Average


Add 96.4 87.6 93.1
Concat 96.3 87.6 93.0
Dynamic fusion 96.5 88.2 93.3
MMF 96.8 88.4 93.6

MMF is a simple but efective module that reintroduces the 2D spatial context into the inal predictions. Table 5
compares the performance of MMF on regular and irregular datasets with other fusion modules in diferent
implementations. From the results in Tabel 5, we can see MMF achieves the best recognition performance on
both the regular and irregular datasets and outperforms the second-best Dynamic fusion by 0.3% and 0.2%. By
introducing deep 2D visual-semantic clues into predictions, MMF makes a great improvement on the irregular
text recognition that needs more spatial representations to describe the 2D locations.
4.4.5 Vision Model Pretraining. Mask Image Modeling (MIM) is proved to be efective for many downstream
tasks in computer vision due to its powerful ability to learn deep visual representations. In STR task, it’s more
challenging for the model to reconstruct the text image considering the complex visual and semantic relations. To
verify the efectiveness of the pretraining strategy, we build two kinds of networks with the same VM containing
12 Transformer layers, as shown in Table 6. PA indicates the VM of our MVSTRN using the parallel decoding
method of PEPA, and AR means the network using our VM as encoder and the auto-regressive transformer
decoders composed of 12 Transformer layers.
We irst train PA and AR networks from scratch on MJ and ST datasets, then we inetune these two networks
on MJ and ST from the model that is self-supervised pretrained 40 epochs on ST. As the experimental results are
shown in Table 6, accuracy gains are obtained by both the PA and AR networks when inetuned from a pretrained
model. More concretely, better performance on irregular text is given by the inetuned PA compared to the regular
text. On the other hand, the AR model improves accuracy on both regular and irregular text recognition.
To distinguish the diference in the training process of various networks, we draw the line chart shown in
Figure 5 in which the average accuracy of the model varies with the number of training epochs. The diagram
suggests that ine-tuning from a pretrained model is beneicial to boost the recognition accuracy at every epoch in

ACM Trans. Multimedia Comput. Commun. Appl.


Multimodal Visual-Semantic Representations Learning for Scene Text Recognition • 13

91.0

AverMge AccurMcy
89.0

87.0
P-Fi
net
une P-St
rat
ch

A-Fi
net
une A-St
rat
ch
85.0
1 2 3 4 5 6 7
TrMining E pocOs

Fig. 5. The trend of the average accuracy during the training process. ’Finetune’ and ’Scratch’ mean the two training
strategies of the networks, ’P’ and ’A’ indicate the parallel and auto-regressive decoders.

Table 6. Recognition performance of the diferent networks trained with (without) MIM strategies. AR indicates the auto-
regressive decoding methods, and PA means the parallel decoding methods.

Decoder Training IIIT SVT IC13 IC15 SVTP CUTE

Scratch 95.4 92.2 95.6 85.0 87.3 86.5


PA
Finetune 95.4 93.8 95.4 85.0 87.8 87.5

Scratch 94.4 92.6 96.0 82.9 87.4 88.9


AR
Finetune 95.7 92.4 96.6 83.4 87.8 89.6

the training process. Besides, what can be seen from the chart is that AR achieves the best average accuracy only
in 3 epochs compared with training from scratch, which implies it is more helpful for AR than PA to converge
the network owing to the similar encoder-decoder architecture with the network in pretraining. Furthermore,
it’s apparent that AR beneits more from the pretrained model compared with PA, according to the peak of the
line chart.
4.4.6 Analysis of the Diferent Mask Ratio. The random masking ratio makes a diference to the quality of
reconstructed images and the recognition accuracy of the inetuned model. In order to ind a satisfactory mask
ratio, we conduct numerous experiments to inetune the VM of MVSTRN with diferent mask ratios. Table 7
compares the inetune results under various mask ratios. Analysing the experimental results of Table 7, we can
conclude that a too low or too high mask ratio may not help to improve recognition accuracy. 70% random
mask ratio may be suitable for STR task to boost the best recognition accuracy, although it only outperforms the
network under 65% mask ratio with a small gap.

ACM Trans. Multimedia Comput. Commun. Appl.


14 • Gao et al.

Mask Ratio Regular Irregular Average


50% 94.9 85.3 91.0
65% 94.7 85.6 91.3
70% 95.2 85.2 91.4
75% 94.6 84.5 90.8
Table 7. Ablation study of the self-supervised pretraining netowrk with diferent random mask ratios.

OriginalImage M ask
ed Reconstruction OriginalImage M ask
ed Reconstruction

(
a) (
b)

Fig. 6. Illustrations of the two kinds of good reconstruction samples. (a) Reconstruction samples of the common text images
from benchmark test datasets (b) Reconstruction samples of the OST dataset.

4.4.7 Analysis of the Reconstruction Samples. Our proposed MIM strategy is motivated to learn the deep visual-
semantic representations from the text latent representations by reconstructing the random masked input images.
A direct measure to evaluate the efect of the pretraining is to observe the quality of the reconstructed images.
Figure 6 represents some successful reconstruction samples under 75% mask ratio. The left part of Figure 6 exhibits
the visualization of the process in reconstructing images selected from the CUTE and SVTP datasets. From the
igure, it is obvious that the model can reconstruct the original image from the masked image even with 75%
mask ratio. Considering that the model may sample from the pixels of the original images according to its pixel
distribution, the right part of the Figure 6 exhibits more challenging images selected from the Occlusion Scene
Text (OST) dataset collected by [48], in which the text images are randomly blurred at character-level visual clues.
From the samples of the Figure 6 (b), we can see that the model can reconstruct the complete visual-semantic text
from the missing semantic text images. We further compare the performance on OST dataset with VisionLAN as
shown in Table 8. Our VM obtains better performance on the more challenging heavy dataset. The above results
demonstrate the efectiveness of our pretrained VM.
Figure 7 shows the visualization of the bad reconstruction cases under a more challenging 85% mask ratio.
As we can see from the masked images, images masked under an 85% mask ratio only remain limited text
information. Therefore, it is almost impossible for the model to reconstruct these extreme cases completely. From
the yellow box of the reconstruction samples, we can conclude that the model tries to predict the completely
masked character according to the learned knowledge when the visual text information is minimal. In the three
given cases, characters ’il’, ’R’ and ’B’ are predicted to ’n’, ’P’ and ’S’, which means the model fails to reason the

ACM Trans. Multimedia Comput. Commun. Appl.


Multimodal Visual-Semantic Representations Learning for Scene Text Recognition • 15

OriginalImage M ask
ed Reconstruction
Fig. 7. Visualization of bad reconstruction samples. Characters in yellow box indicate the false reconstruction.

Methods OST-Heavy
VisionLAN[48] 50.3
VM-L 53.5
MVSTRN 63.5
Table 8. Recognition performance on the OST heavy dataset compared with VisionLAN.

complete semantics without enough visual information of corresponding characters. In other words, under the
high mask ratio 85%, it’s challenging for VM to reason textual semantics.

5 CONCLUSION
In this paper, we irst reveal the deiciency of the LM-based methods that LM fails to capture the 2D spatial
context of visual semantics. Aiming to address the above issue, this paper proposes a Structurally Uniied Network
(MVSTRN) to compensate for the visual semantics of LM-based networks. Numerous experiments are also
conducted to prove that our MVSTRN achieves a more stable performance compared to the HybridNet with
similar parameters when extended to a larger network. Furthermore, we introduce the masked image modeling
strategy into STR task to pretrain our VM, giving our VM the ability to learn the visual-semantic representations
by reconstructing the semantics of the masked images. Finally, we utilize the proposed Multimodal Fusion
module to fuse the two multimodal information. Experimental results conirm that our MVSTRN achieves new
state-of-the-art performance on most benchmarks.

6 ACKNOWLEDGMENTS
This work was supported by the Natural Science Foundation of China (62276242), National Aviation Science
Foundation (2022Z071078001), CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2021-016B, CAAIXSJLJJ-2022-
001A), Anhui Province Key Research and Development Program (202104a05020007), Dreams Foundation of
Jianghuai Advance Technology Center (2023-ZM01Z001), USTC-IAT Application Sci. & Tech. Achievement
Cultivation Program (JL06521001Y), Sci. & Tech. Innovation Special Zone (20-163-14-LZ-001-004-01).

REFERENCES
[1] Aviad Aberdam, Roy Ganz, Shai Mazor, and Ron Litman. 2022. Multimodal semi-supervised learning for text recognition. arXiv preprint
arXiv:2205.03873 (2022).

ACM Trans. Multimedia Comput. Commun. Appl.


16 • Gao et al.

[2] Rowel Atienza. 2021. Vision transformer for fast and eicient scene text recognition. In International Conference on Document Analysis
and Recognition. Springer, 319ś334.
[3] Jeonghun Baek, Geewook Kim, Junyeop Lee, Sungrae Park, Dongyoon Han, Sangdoo Yun, Seong Joon Oh, and Hwalsuk Lee. 2019.
What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF International
Conference on Computer Vision. 4715ś4723.
[4] Hangbo Bao, Li Dong, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021).
[5] Ayan Kumar Bhunia, Aneeshan Sain, Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, and Yi-Zhe Song. 2021. Joint visual
semantic reasoning: Multi-stage decoder for text recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
14940ś14949.
[6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam,
Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33
(2020), 1877ś1901.
[7] Linlin Chao, Jingdong Chen, and Wei Chu. 2020. Variational connectionist temporal classiication. In European Conference on Computer
Vision. Springer, 460ś476.
[8] Xinlei Chen, Saining Xie, and Kaiming He. 2021. An empirical study of training self-supervised vision transformers. In Proceedings of
the IEEE/CVF International Conference on Computer Vision. 9640ś9649.
[9] Xinlei Chen, Saining Xie, and Kaiming He. 2021. An empirical study of training self-supervised vision transformers. In Proceedings of
the IEEE/CVF International Conference on Computer Vision. 9640ś9649.
[10] Zhanzhan Cheng, Fan Bai, Yunlu Xu, Gang Zheng, Shiliang Pu, and Shuigeng Zhou. 2017. Focusing attention: Towards accurate text
recognition in natural images. In Proceedings of the IEEE international conference on computer vision. 5076ś5084.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for
language understanding. arXiv preprint arXiv:1810.04805 (2018).
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani,
Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at
scale. arXiv preprint arXiv:2010.11929 (2020).
[13] Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, and Yongdong Zhang. 2021. Read like humans: autonomous, bidirectional
and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 7098ś7107.
[14] Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Textdragon: An end-to-end framework for arbitrary shaped
text spotting. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9076ś9085.
[15] Zilong Fu, Hongtao Xie, Shancheng Fang, Yuxin Wang, MengTing Xing, and Yongdong Zhang. 2022. Learning Pixel Ainity Pyramid
for Arbitrary-Shaped Text Detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2022).
[16] Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, and Boqing Gong. 2017. Vqs: Linking segmentations to questions and answers for
supervised attention in vqa and question-focused semantic segmentation. In Proceedings of the IEEE international conference on computer
vision. 1811ś1820.
[17] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for text localisation in natural images. In CVPR. 2315ś2324.
[18] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2021. Masked autoencoders are scalable vision
learners. arXiv preprint arXiv:2111.06377 (2021).
[19] Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang. 2016. Reading scene text in deep convolutional sequences. In
Thirtieth AAAI conference on artiicial intelligence.
[20] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and et al. 2014. Synthetic data and artiicial neural networks for natural scene text
recognition. arXiv preprint arXiv:1406.2227 (2014).
[21] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. Reading text in the wild with convolutional neural
networks. International journal of computer vision 116, 1 (2016), 1ś20.
[22] Hui Jiang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Yi Niu, Wenqi Ren, Fei Wu, and Wenming Tan. 2021. Reciprocal feature learning via
explicit and implicit tasks in scene text recognition. In Document Analysis and RecognitionśICDAR 2021: 16th International Conference,
Lausanne, Switzerland, September 5ś10, 2021, Proceedings, Part I. Springer, 287ś303.
[23] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, and et al. 2015. ICDAR 2015 competition on robust reading. In ICDAR.
IEEE, 1156ś1160.
[24] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, and et al. 2013. ICDAR 2013 robust reading competition. In ICDAR. IEEE, 1484ś1493.
[25] Junyeop Lee, Sungrae Park, Jeonghun Baek, Seong Joon Oh, Seonghyeon Kim, and Hwalsuk Lee. 2020. On recognizing texts of arbitrary
shapes with 2D self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 546ś547.
[26] Hui Li, Peng Wang, Chunhua Shen, and Guyu Zhang. 2019. Show, attend and read: A simple and strong baseline for irregular text
recognition. In Proceedings of the AAAI Conference on Artiicial Intelligence, Vol. 33. 8610ś8617.

ACM Trans. Multimedia Comput. Commun. Appl.


Multimodal Visual-Semantic Representations Learning for Scene Text Recognition • 17

[27] Xiangpeng Li, Bo Wu, Jingkuan Song, Lianli Gao, Pengpeng Zeng, and Chuang Gan. 2022. Text-instance graph: exploring the relational
semantics for text-based visual question answering. Pattern Recognition 124 (2022), 108455.
[28] Minghui Liao, Jian Zhang, Zhaoyi Wan, Fengming Xie, Jiajun Liang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2019. Scene text recognition
from two-dimensional perspective. In Proceedings of the AAAI Conference on Artiicial Intelligence, Vol. 33. 8714ś8721.
[29] Zhandong Liu, Wengang Zhou, and Houqiang Li. 2019. AB-LSTM: Attention-based bidirectional LSTM model for scene text detection.
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 4 (2019), 1ś23.
[30] Zhandong Liu, Wengang Zhou, and Houqiang Li. 2021. MFECN: Multi-level Feature Enhanced Cumulative Network for Scene Text
Detection. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 3 (2021), 1ś22.
[31] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network
for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 67ś83.
[32] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018. Mask textspotter: An end-to-end trainable neural network
for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision (ECCV). 67ś83.
[33] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. 2019. The neuro-symbolic concept learner: Interpreting
scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584 (2019).
[34] Anand Mishra, Karteek Alahari, and CV Jawahar. 2012. Scene text recognition using higher order language priors. In BMVC. BMVA.
[35] Trung Quy Phan, Palaiahnakote Shivakumara, Shangxuan Tian, and et al. 2013. Recognizing text with perspective distortion in natural
scenes. In ICCV. 569ś576.
[36] Zhi Qiao, Yu Zhou, Jin Wei, Wang, and et al. 2021. PIMNet: a parallel, iterative and mimicking network for scene text recognition. In
MM. ACM, 2046ś2055.
[37] Zhi Qiao, Yu Zhou, Dongbao Yang, and et al. 2020. Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In
CVPR. 13528ś13537.
[38] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
(2018).
[39] Alec Radford, Jefrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised
multitask learners. OpenAI blog 1, 8 (2019), 9.
[40] Anhar Risnumawan, Palaiahankote Shivakumara, Chan, and et al. 2014. A robust arbitrary text detection system for natural scene
images. Expert Systems with Applications 41, 18 (2014), 8027ś8048.
[41] Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its
application to scene text recognition. TPAMI 39, 11 (2016), 2298ś2304.
[42] Baoguang Shi, Mingkun Yang, Xinggang Wang, Pengyuan Lyu, Cong Yao, and Xiang Bai. 2018. Aster: An attentional scene text
recognizer with lexible rectiication. IEEE transactions on pattern analysis and machine intelligence 41, 9 (2018), 2035ś2048.
[43] Zhaoyi Wan, Minghang He, Haoran Chen, Xiang Bai, and Cong Yao. 2020. Textscanner: Reading characters in order for robust scene
text recognition. In Proceedings of the AAAI Conference on Artiicial Intelligence, Vol. 34. 12120ś12127.
[44] Jianfeng Wang and Xiaolin Hu. 2017. Gated recurrent convolution neural network for ocr. Advances in Neural Information Processing
Systems 30 (2017).
[45] Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In 2011 International conference on computer
vision. IEEE, 1457ś1464.
[46] Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene text recognition. In ICCV. IEEE, 1457ś1464.
[47] Tianwei Wang, Yuanzhi Zhu, Lianwen Jin, Canjie Luo, Xiaoxue Chen, Yaqiang Wu, Qianying Wang, and Mingxiang Cai. 2020. Decoupled
attention network for text recognition. In Proceedings of the AAAI Conference on Artiicial Intelligence, Vol. 34. 12216ś12224.
[48] Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, and Yongdong Zhang. 2021. From two to one: A new scene
text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
14194ś14203.
[49] Yuxin Wang, Hongtao Xie, Zheng-Jun Zha, Mengting Xing, Zilong Fu, and Yongdong Zhang. 2020. Contournet: Taking a further
step toward accurate arbitrary-shaped scene text detection. In proceedings of the IEEE/CVF conference on computer vision and pattern
recognition. 11753ś11762.
[50] Hongtao Xie, Shancheng Fang, Zheng-Jun Zha, Yating Yang, Yan Li, and Yongdong Zhang. 2019. Convolutional attention networks for
scene text recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 15, 1s (2019), 1ś17.
[51] Zecheng Xie, Yaoxiong Huang, Yuanzhi Zhu, Lianwen Jin, Yuliang Liu, and Lele Xie. 2019. Aggregation cross-entropy for sequence
recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6538ś6547.
[52] Ruijie Yan, Liangrui Peng, Shanyu Xiao, and Gang Yao. 2021. Primitive representation learning for scene text recognition. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 284ś293.
[53] L. Yang, P. Wang, H. Li, Z. Li, and Y. Zhang. 2020. A Holistic Representation Guided Attention Network for Scene Text Recognition.
Neurocomputing 414, 8 (2020).

ACM Trans. Multimedia Comput. Commun. Appl.


18 • Gao et al.

[54] Xiao Yang, Dafang He, Zihan Zhou, Daniel Kifer, and C Lee Giles. 2017. Learning to read irregular text with attention mechanisms.. In
IJCAI, Vol. 1. 3.
[55] Kexin Yi, Jiajun Wu, Chuang Gan, Antonio Torralba, Pushmeet Kohli, and Josh Tenenbaum. 2018. Neural-symbolic vqa: Disentangling
reasoning from vision and language understanding. Advances in neural information processing systems 31 (2018).
[56] Deli Yu, Xuan Li, Chengquan Zhang, Tao Liu, Junyu Han, Jingtuo Liu, and Errui Ding. 2020. Towards accurate scene text recognition
with semantic reasoning networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12113ś12122.
[57] Xiaoyu Yue, Zhanghui Kuang, Chenhao Lin, Hongbin Sun, and Wayne Zhang. 2020. Robustscanner: Dynamically enhancing positional
clues for robust text recognition. In European Conference on Computer Vision. Springer, 135ś151.
[58] Fangneng Zhan and Shijian Lu. 2019. Esir: End-to-end scene text recognition via iterative image rectiication. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2059ś2068.
[59] Chuhan Zhang, Ankush Gupta, and Andrew Zisserman. 2020. Adaptive text recognition through visual matching. In European Conference
on Computer Vision. Springer, 51ś67.
[60] Chengwei Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Yi Niu, Fei Wu, and Futai Zou. 2020. Spin: Structure-preserving inner ofset
network for scene text recognition. arXiv preprint arXiv:2005.13117 (2020).
[61] Hui Zhang, Quanming Yao, Mingkun Yang, Yongchao Xu, and Xiang Bai. 2020. AutoSTR: eicient backbone search for scene text
recognition. In European Conference on Computer Vision. Springer, 751ś767.
[62] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. 2021. ibot: Image bert pre-training with online
tokenizer. arXiv preprint arXiv:2111.07832 (2021).
[63] Biao Zhu, Hongxin Zhang, Wei Chen, Feng Xia, and Ross Maciejewski. 2015. ShotVis: Smartphone-based visualization of OCR information
from images. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 12, 1s (2015), 1ś17.

ACM Trans. Multimedia Comput. Commun. Appl.

View publication stats

You might also like