0% found this document useful (0 votes)
19 views

A Hybrid Method For Mathematical Expression Detect

This document proposes a hybrid method for detecting mathematical expressions in scientific documents. It first performs layout analysis to improve text line and word segmentation. Then, it detects both isolated and inline expressions using handcrafted and deep learning features. The method is evaluated on public datasets and shows improvements over conventional methods.

Uploaded by

Thương Phạm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

A Hybrid Method For Mathematical Expression Detect

This document proposes a hybrid method for detecting mathematical expressions in scientific documents. It first performs layout analysis to improve text line and word segmentation. Then, it detects both isolated and inline expressions using handcrafted and deep learning features. The method is evaluated on public datasets and shows improvements over conventional methods.

Uploaded by

Thương Phạm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

A hybrid method for mathematical


expression detection in scientific
document images
BUI HAI PHONG1, 3 , THANG MANH HOANG2 , AND THI-LAN LE 2, 1
1
MICA International Research Institute, Hanoi University of Science and Technology, Hanoi, Vietnam (e-mail: [email protected])
2
School of Electronics and Telecommunications, Hanoi University of Science and Technology, Hanoi, Vietnam(e-mail: [email protected])
3
Faculty of Information Technology, Hanoi Architectural University, Hanoi, Vietnam

This work was supported by the Domestic Master/ PhD Scholarship Programme of Vingroup Innovation Foundation.

ABSTRACT Mathematical expressions have been widely used in scientific documents. In order to analyze
the documents, automatic detection of mathematical expressions is a crucial step. The paper presents a
unified system for the detection of mathematical expressions including both inline and isolated expressions
in scientific document images that usually consist of heterogeneous components (e.g., figures, tables, text
and expressions). In the system, a hybrid method of two stages is proposed for the effective detection of
mathematical expressions. First, the layout analysis of entire document images is introduced to improve the
accuracy of text line and word segmentation. Then, both isolated and inline expressions in document images
are detected. Both hand-crafted and deep learning features are extensively investigated and combined to
improve the detection accuracy. Furthermore, a generic performance metric is applied to evaluate the system
comprehensively. The proposed method has been evaluated on two public benchmark datasets (Marmot
and GTDB). The obtained accuracies of isolated and inline expressions in the Marmot dataset are 91.18%
and 81.35% while those in the GTDB dataset are 89.51% and 80.20%, respectively. The performance
comparison is carried out with the conventional methods to show the outstanding effectiveness of the
proposed system. Moreover, extensive experiments have been performed in order to point out the effect
of document image resolution and post processing techniques on mathematical expression detection.

INDEX TERMS Mathematical expression detection, Document analysis, Machine learning, Neural
Network, Fusion Technique.

I. INTRODUCTION of isolated and inline expressions marked in red and blue,


respectively. The detection of mathematical expressions has
Mathematical expressions have widely used in scientific recently received many researches [1].
documents and an huge number of scientific documents
have been produced over years. Therefore, the demand of In the literature, the accuracy of the detection of isolated
document digitization for researching and studying purposes expressions has been gradually improved. However, the de-
has continuously increased [1]. Detection of mathematical tection of inline expressions remains low accuracy [2]. There
expressions in documents is considered as an essential step are many challenges in the detection of inline expression,
in the document information retrieval system. The detection including the variety of mathematical symbols and the com-
typically consists of three main steps: page segmentation, plex layout of mathematical structures. In practical, inline
classification of mathematical expressions and normal texts expressions may consist of subscripts and superscripts asso-
and post-processing. In scientific documents, mathematical ciated with mathematical symbols or variables. As shown in
expressions are classified in two categories, i.e. isolated Figure 1, Pinline
R expressions consist of mathematical opera-
(displayed) and inline (embedded) expressions. Isolated ex- tors (e.g. , , β, +, -, *, /), functions (log, sin, cos) and
pressions display in separate lines, meanwhile inline expres- variables (i, j). The accuracy of detection of inline expres-
sions are mixed with other components in document pages, sions can also be affected by punctuation marks and noises.
e.g. texts and figures. Figure 1 illustrates some examples Most existing detection methods have utilized heuristic rules

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

FIGURE 1: Examples of the isolated and inline expressions in a sample document page that are marked in red and blue
bounding boxes, respectively.

or machine learning approaches with hand-crafted feature order to overcome the obstacles, the work first over-segments
extraction. Those methods can be efficient in specific cases, the text lines and words from input document images. Then,
however, they are not robust for various document layout. the over-segmented text lines and words are merged by using
In addition, private datasets [2] have been used for testing. the heuristic thresholds of white space in document back-
The precision and recall metrics have been employed for the ground. In the paper, the analysis is carried out to evaluate the
performance evaluation. These metric are popular, but can impact of the results of page segmentation to the accuracy of
not fully reflect the quality of the detection. In reality, math- mathematical expression detection.
ematical expressions can be detected completely or partially (2)A hybrid method that combines both hand-crafted and
correct. In some cases, the expressions cannot be detected or deep learning features is proposed to improve the accuracy
other components in documents are identified as expressions. of the detection of mathematical expressions. In this work,
The accuracy flaws have caused many difficulties in the Fast Fourier Transform (FFT) magnitude and phase are used
development of mathematical detection systems. as features for isolated expression and normal text line clas-
The paper presents an extension of the work reported in sification while the parameters of Gaussian distribution of
[3], and compared with the previous work, there are three peaks and valleys of both vertical and horizontal projec-
main improvements: tion profiles of word images are used for inline expression
(1) Page segmentation is a prerequisite step of detection and textual work classification. As Convolutional Neural
of mathematical expressions. The accurate segmentation of Network (CNN) allows capture the rich visual features of
the text lines and words allows to obtain high accuracy of images, in this paper, transfer learning techniques are applied
the detection of mathematical expressions. The challenges on two pre-trained CNNs models that are Alexnet [5] and
in the page segmentation for the detection of mathematical ResNet-18 [6] for mathematical expression and text line
expressions are not only the complicated layout of document classification.
but also the variation in sizes, styles of the expressions. In (3) A generic performance metric and public datasets are
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

used to evaluate the system clearly. The proposed system is basic units for inline expression detection. For literature
tested with two benchmark datasets which has clear ground- documents, there is not much variation in text lines. Thus,
truth information of mathematical expressions to obtain the the text line segmentation usually achieves high accuracy
in-depth evaluation of the effectiveness of the system. It is [15], [16]. In contrast, there is variation of height, distance in
worth to mention that most detection methods have been text lines in scientific documents that contain mathematical
evaluated on private datasets that are unavailable for the notations. This issue causes many errors for the text line
research [2]. segmentation. One of the typical error of the segmentation is
The rest of the paper is organized as follows. Section II that a large mathematical expression is split into many lines.
overviews significant related works. Section III presents Therefore, additional techniques (e.g. rule-based, learning-
the detail of the architecture of the proposed system. In based methods) are integrated to improve the accuracy of text
section IV, experimental results are shown and discussed. line segmentation [17], [18]. The basic idea of the techniques
Finally, section V gives the conclusion and the future work. is that all text lines are split, then consecutive text lines are
merged to form the entire expression if they are belonging to
II. RELATED WORK components of the mathematical expression. The text lines
This section reviews the works significantly related to the are merged if the vertical distance between them is smaller
detection of isolated and inline expression in the image-based than a predefined threshold. Similarly, consecutive words are
and PDF formats. merged in order to form the entire expression if they belong
to the expression. The words are merged if the horizontal
distance between them is smaller than a predefined threshold.
A. DOCUMENT LAYOUT ANALYSIS In recent years, deep learning approaches have been utilized
Page segmentation aims to decompose a document image for the page segmentation. The advantage of the approaches
into homogeneous regions by several steps. Firstly, the im- is that the page segmentation task is performed without the
age pre-processing (noise removal and skew correction) is prior knowledge of document structure. The work in [19] has
performed. Then, each component (e.g. text, figure, or ta- proposed a simple CNN with one layer to perform the page
ble) is separated based on their structure layout. Traditional segmentation. Input of the CNN is a gray scale document
document layout analysis techniques can be divided into image. The work [20] has employed a DNN based on Resnet-
four types: top-down, bottom-up, multi-scale resolution and 50 [6] to segment historical document pages.
hybrid method [7]. Top-down methods split the page im-
age into smaller components [8], [9]: a page is split into B. MATHEMATICAL EXPRESSION DETECTION
blocks, blocks are split into text lines and text lines are split 1) Mathematical expression detection in document images
into words. In general, top-down methods are useful in the Mathematical expression detection has been studied for
segmentation of rectangular layout. However, the methods more than twenty years [21]. In the traditional detection
are not much effective for the complex structure document. approaches, the page segmentation is normally performed to
Bottom-up methods analyze and merge local pixels in order obtain text lines and words. Then, the hand-crafted feature
to form larger components such as characters, words, text extraction is designed to discriminate the mathematical ex-
lines and paragraphs [10], [11]. Comparing with top-down pressions from texts. The difference of the approaches is in
methods, bottom-up methods show higher performance in the ways of feature extraction and the use of different clas-
page segmentation. However, the methods have high com- sifiers. In the early research on the mathematical expression
putational complexity. The multi-scale resolution methods detection [22], all text lines and words in a document page
analyze page structure based on the features of different are scanned in order to get primitive tokens. After that, each
resolution levels of the document image [12], [13]. Then, token is determined whether it belongs to an inline expression
the features are used for text and non-text classification. by checking predefined expression forms. The accuracy of
Finally, text regions are split in to text lines by using a set detection is not reported in this research. Research in [23]
of rules of number and intensity of pixels. The difficulty of concluded that it is difficult to detect all inline expressions
the methods is the estimation of distance parameters between without using character recognition results.
components in a document page. Hybrid methods combine The method reported in [24] employs results of two
the bottom-up and top-down techniques. The methods are commercial optical character recognition (OCR) systems to
effective for the segmentation of complex structure docu- extract inline formula. First, existing OCR systems are ap-
ment [7], [14]. Connected components and delimiters (white plied to obtain content of document images. Then, sentences
space, tap stop) in a document page are extracted, filtered containing inline expressions are determined by computing
and analyzed. After that, various heuristic strategies are ap- word n-grams. For each sentence, several features of a word
plied to reduce page segmentation errors. For the purpose are extracted to determine whether the word is a part of inline
of mathematical expression detection, text regions in the expressions. The features of words mentioned in the work
body of document are focused to analyse. A text region is are:
segmented into text lines that are basic units for displayed (1) The probability of a sentence containing inline expres-
expression detection. Segmented words from a text line are sions.
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

FIGURE 2: Overall description of the proposed system for mathematical expression detection. The detection of isolated, inline
and ground-truth expressions are marked in blue, black and red, respectively.

(2) The confidence of OCR systems while recognizing annotated information of characters in blocks are used for
words. training the CNN. The purpose of the method is to obtain
(3) The type style (italic, bold) of words. connected components of expressions. The post-processing
(4) The space between characters of words. is performed to obtain accurate expressions. For the CNN,
(5) The variation of position of characters in words. the training on different datasets can improve the detection
If some consecutive words in a sentence are determined accuracy. Moreover, the accuracy of the detection depends on
as inline expressions, these words can be grouped to obtain the size of image blocks in the training of CNN. The achieved
an inline expression. It is obvious that above features highly precision and recall of the detection of mathematical ex-
depend on results of existing OCR systems. pressions in the method are 95.2% and 91%, respectively.
The method in [25] aims at detecting both isolated and As stated in the method, mathematical symbols are detected
inline expressions from document images. The method firstly with high accuracy, however the layout analysis of symbols
applies the low-cost text line segmentation technique [26] has not been implemented to construct complete expressions.
for heterogeneous document images. Then, features of each The italic and bold type styles of words can cause errors in
text line are extracted. After the feature extraction, Support the detection of inline expressions.
Vector Machine (SVM) classifier is used to determine if the
text line is an isolated expression. Non-isolated expressions 2) Mathematical expression detection in native PDF
are segmented into words and features of words are extracted documents
to check if the word belongs to an inline expression. The In recent years, several researches [21], [31], [32] have fo-
extracted features of words in the method are described cused on the detection of mathematical expressions in PDF
follows: documents. For PDF documents, metadata information of
(1) The density of black pixels in the word image. textual words such as font, size, styles can be extracted pre-
(2) The proportion of the height of word to that of the cisely. Therefore, the detection of mathematical expressions
whole document. in PDF documents is more accurate than that of image-based
(3) The fluctuation of "centroid" of characters in words. documents. The method reported in [31] extracts inline ex-
The features of words are effective in the detection of pressions in PDF documents with the use of natural language
special symbols but not accurate in the detection of inline processing. After the word extraction process, word features
expressions. The precision and recall of the detection of and conditional random field (CRF) are used for inline ex-
inline expressions reported at 80% and 48%, respectively. pression detection. The achieved accuracy in detection is
In recent years, DNNs have proved the outstanding per- 88.95% on PDF files from the ACL Anthology dataset [31]
formance in the recognition and detection of mathematical but there are still many errors in the detection of variables
expressions tasks [27]–[29]. The work in [27] takes the reported in the research.
advantages of CNNs in the detection of isolated and inline The research [33] attempts to detect mathematical expres-
expressions in document images. A CNN architecture based sions in PDF documents by taking the advantages of CNNs.
on the U-net [30] is used for detecting mathematical ex- The framework for the detection consists of two steps. In the
pressions. The document images are divided into blocks. The first step, the candidate regions for mathematical expressions
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

(a) Example of a document page.

Sum of black pixels


0 200 400 600 800 1000 1200 1400
0

500
Rows of image

1000

1500

2000

2500

(b) The horizontal projection profile of the sample page.

(c) The text line segmentation of the sample page.

FIGURE 3: Example of the text line segmentation in a sample document image. The input sample page (a), the horizontal
projection profile of the page image (b) and the text line segmentation of the page (c). The x-axis represents the sum of black
pixels of each row in page image and y-axis represents the rows of the image.

VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

FIGURE 4: The word segmentation of the text line image (a) based on the estimation of the vertical projection profile (b)
and the results (c). The x-axis represents the columns of text line image and y-axis represents the sum of black pixels of each
column.

are generated. For the generation of candidate regions, meta- work, the image dewarping algorithm in [35] is adopted to
data information including position, fonts of characters are correct of the distortions of input documents. By considering
extracted from PDF files. In the second step, the features of information of both text and non-text regions, the dewarping
candidate regions are extracted in order to obtain the entire algorithm is designed for a wide range of document lay-
mathematical expressions. In this step, two deep networks outs. The algorithm can handle camera-captured and scanned
are combined to automatically extract features of candidate document images. After the pre-processing, the document is
regions. The first network is the CNN and the second one is analyzed to obtain text lines for isolated expression detection.
the Recurrent Neural Network (RNN). The CNN is employed Non-isolated expressions are segmented into words for inline
to extract visual features of images and the RNN is utilized expression detection. After the segmentation, the combina-
to extract sequential information of characters. After that, the tion of hand-crafted and deep learning features are applied
features are combined together to improve the accuracy of ex- for the isolated and inline expression detection modules. Fi-
pression detection. A large dataset (12,000 document pages nally, the post-processing is performed in order to obtain the
containing more than 22,000 mathematical expressions) is accurate position information of mathematical expressions in
manually prepared for training the deep networks. document images.
As above-mentioned, a number of works have been pro-
posed for isolated and inline expression detection. However, A. PAGE SEGMENTATION
the performance of inline expression detection is needed to The projection profile [36] of a document image is applied for
be improved. In our work, a hybrid method for the math- the page segmentation. The estimation of projection profile
ematical expression detection has been proposed in which of images is performed recursively to analyze the structure
the accuracy of the detection of inline expressions is focused of documents [36]. The horizontal and vertical projection
to improve. We combine the hand-crafted and deep learning profiles of an image is the horizontal and vertical distribution
features in order to obtain higher accurate detection of math- of black pixels. Thus, the technique is useful for the analysis
ematical expressions in various document layout. of scanned documents. To obtain text regions, text and non-
text regions can be classified based on the following layout
III. SYSTEM ARCHITECTURE features:
The proposed system is illustrated in Figure 2. The proposed (1) The width and height of regions: The height of tables
system takes a binary document image as input and outputs and figures are normally larger than that of text lines. Mean-
an image with position information of detected mathematical while, the width of non-text components are smaller than that
expressions. Like document analysis and expression detec- of text lines. Thresholds can be used for confirming the text
tion methods, input of the proposed method is a non-skew and non-text components, in this work, the threshold of the
document image. For skew and curved images, the deskew height of text line is set from 50 to 400 (pixels) and that of
[34] and dewarping [35] algorithms must be applied. In our the width is set from 200 to 4000 (pixels).
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

(2) The number of the connected components in regions: sian distribution of peaks and valleys of both vertical and
text lines normally consist of more connected components horizontal projection profile of word images are used as the
than non-text elements. In our work, the heuristic threshold features. Peaks and valleys are local maximum and minimum
is chosen with the number of 5 connected components for the of the projection profile of word images. The features of the
filter. In other words, a text line typically consists of more vertical and horizontal projection profiles of word images are
than 5 connected components. extracted as follows:
In fact, the heuristic filtering is used to remove non-text (1) The number of peaks in the vertical and horizontal
components (e.g. small noises, tables, figures). After the text projection profiles.
and non-text classification, text regions are segmented into (2) The mean (average) of values of peaks in the vertical
text lines. The text lines are segmented by using the threshold and horizontal projection profiles.
of vertical distance between them. The threshold is set as 20 (3) The standard deviation of values of peaks in the vertical
(pixels) in this case. Examples of the text line segmentation and horizontal projection profiles.
in a document page is shown in Figure 3. The input sample (4) The number of valleys in the vertical and horizontal
page, the horizontal projection profile of the page and the projection profiles.
results of text line segmentation are illustrated in Figure 3(a), (5) The mean (average) of values of valleys in the vertical
3(b) and 3(c), respectively. In Figure 3(b), the x-axis and the and horizontal projection profiles.
y-axis represent the sum of black pixels and the rows of the (6) The standard deviation of values of valleys in the
page image, respectively. vertical and horizontal projection profiles.
Segmented text lines are fed into the isolated expression By using the feature extraction, two-dimensional layout
detection module in order to identify isolated expressions. properties of inline expressions are extracted that can im-
Then, text lines that are not determined as isolated expres- prove the accuracy of the classification.
sions are segmented into words. The words are segmented by After the feature extraction, Random Forest (RF) classifier
using the threshold of horizontal distance between them. The is used for the classification. Compared with other machine
threshold is set as 10 (pixels) in this case. The segmented learning classifiers such as Support Vector Machine (SVM)
words are fed into the inline expression detection module and k-Nearest Neighbor (k-NN), the RF shows better results
in order to identify inline expressions. Examples of word in the classification [38]. The RF demonstrates the effective-
segmentation in a text line is shown in Figure 4. The word ness in the classification task because it aggregates a large
segmentation of the text line, the vertical projection profile number of classification results of decision trees [39]. For
of the text line and the results of word segmentation are training the RF in the classification of isolated expression,
illustrated in Figure 4(a), 4(b) and 4(c), respectively. In labels of two classes (isolated expression and text line) are
Figure 4(b), the x-axis represents the columns and the y-axis prepared manually. The extracted features based on FFT and
represents the sum of black pixels of the page image. labels are used to train the classifier. Similarly, for training
the RF in the classification of inline expression, labels of
B. MATHEMATICAL EXPRESSION DETECTION two classes (inline expression and word) are also prepared
1) Expression detection by using hand-crafted feature manually. Then, the extracted features based on projection
extraction profile and labels are used to train the classifier. The number
The flowchart of the isolated and inline expression classifi- of trained cycles is set as 100 and the adaptive logistic
cation is described in Figure 5. In the hand-crafted feature regression algorithm [40] is used for training the Random
extraction approach, the powerful feature extraction and clas- Forest. Finally, the trained model is used for the classification
sifier are applied to improve the accuracy of the classification of testing data.
of both isolated and inline expressions. Text line images
are transformed from the spatial to the frequency domain 2) Expression detection by using Convolutional Neural
by using the FFT [37] to classify isolated expressions and Network
normal text lines. The dominant features of mathematical To improve the accuracy of the detection of both isolated
symbols are emphasized by using the transformation. Both and inline expressions, the transfer learning technique of
FFT magnitude and phase values are used as features of AlexNet [5] and ResNet-18 [6] those are popular Neural
text line images. Those also allow to clearly discriminate the Networks are employed. Comparing with AlexNet, the archi-
isolated expression from text line images. Actually, the white tecture of ResNet-18 consists of deeper layers and ResNet-18
space between characters of isolated expressions is larger normally shows better results in the classification task [41].
than those of normal text line and the density of black pixels For the AlexNet, each input image is pre-processed and
in isolated expressions is less than that of normal text line. resized to [227x227x3]. The CNN consists of 25 layers with
In order to improve the accuracy of the classification 5 convolutional layers and 3 fully connected layers. The
of inline expressions and textual words, feature extraction architecture and parameters of layers of Alexnet are provided
based on projection profiles of word images is applied [38]. in Table 1. For ResNet-18, input image is pre-processed
Firstly, both vertical and horizontal projection profiles of and resized to [224x224x3]. The CNN consists of 72 layers
word images are calculated. Then, the parameters of Gaus- corresponding to 18 blocks. The architecture and parameters
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

FIGURE 5: The flowchart of the isolated and inline expression detection by using hand-crafted feature extraction

FIGURE 6: The flowchart of the isolated and inline expression classification by using the transfer learning of CNNs

of layers of the ResNet-18 are provided in Table 2. For For AlexNet and ResNet-18, 4096 and 512 visual features
the isolated and inline expression detection modules, input are automatically extracted from input images, respectively.
images of the CNNs are text lines and words, respectively.
Figure 6 illustrates the flowchart of the transfer learning of
8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

FIGURE 7: The flowchart of the combination of hand-crafted and deep learning features in the classification of isolated and
inline expression.

CNNs for isolated and inline expression detection module. 3) Expression detection by combining the hand-crafted and
The dominant features are automatically extracted by the deep learning features
network without any domain specific knowledge. Then, the
In order to leverage the advantages of both hand-crafted
classification is performed by softmax layer of the network.
features and CNN models, the decision results obtained by
The learning rate and the number of epochs parameters of the
these features will be combined using the score-based fusion
network are set as 0.001 and 20, respectively. The stochastic
technique [44], [45]. Concretely, in this work, the confidence
gradient descent (SGDM) [42] algorithm with momentum
scores obtained from hand designed features with RF and
[43] that is set as 0.9 is used for training the CNNs.
CNN features with Softmax are combined using product and
average operators. Let p1 and p2 be the predicted scores of
VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

TABLE 1: Alexnet architecture and layer parameters


Layer id Layer Name Layer type Layer parameters
1 imageInputLayer Input Image 227×227×3
2 conv1 Convolution 55×55×96
3 relu1 ReLU 55×55×96
4 norm1 Cross Chanel Normalization 55×55×96
5 pool1 Max Pooling 27×27×96
6 conv2 Grouped Convolution 27×27×256
7 relu2 ReLU 27×27×256
8 norm2 Cross Channel Normalization 27×27×256
9 pool2 Max Pooling 13×13×256
10 conv3 Convolution 13×13×384
11 relu3 ReLU 13×13×384
12 conv4 Grouped Convolution 13×13×384
13 relu4 ReLU 13×13×384
14 conv5 Grouped Convolution 13×13×256
15 relu5 ReLU 13×13×256
16 pool5 Max Pooling 6×6×256
17 fc6 Fully ConnectedLayer 1×1×496
18 relu6 ReLU 1×1×496
19 drop6 Dropout 1×1×496
20 fc7 Fully Connected Layer 1×1×496
21 relu7 ReLU 1×1×496
22 drop7 Dropout 1×1×496
23 fc8 Fully Connected Layer 1×1×2
24 prob Softmax 1×1×2
25 output Classification output Output result

TABLE 2: ResNet-18 architecture and layer parameters


Layer Name Output Size Layer parameters
conv1 112×112×64 7×7,64, stride 2
3×3maxpool,
  stride2
conv2_x 56×56×64 3 × 3, 64
×2
3 × 3, 64
 
3 × 3, 128
conv3_x 28×28×128 ×2
3 × 3, 128
 
3 × 3, 256
conv4_x 14×14×256 ×2
3 × 3, 256
 
3 × 3, 512
conv5_x 7×7×512 ×2
3 × 3, 512
average pool 1×1×512 7×7 average pool
fully connected 2 512×2 fully connections
softmax 2 Classification results

TABLE 3: Statistic of the Marmot and GTDB datasets


GTDB Marmot
Datasets
GTDB-1 (Training) GTDB-2 (Testing) Training Testing
Number of pages 569 236 330 70
Number of isolated expressions 4218 2488 1322 253
Number of inline expressions 22178 9397 6951 956
Number of text fonts 30 18
Average number of expressions per page 47.55 23.70

the hand-designed features with RF and fine-tuned CNN with C. POST-PROCESSING


Softmax, respectively. The final prediction p is determined as In the detection of mathematical expressions, it is not rare
follows: that large isolated expressions are split into several text lines
[21]. Some strategies are proposed in order to overcome the
p = F (p1 , p2 ) (1) issue [21], [46]. The strategies have relied on the results of the
character recognition to determine the conditions of merging
successive text lines to become an expression. In our work,
Where F is the combination operator. the heuristic threshold of white-space between successive
The flowchart of the combination is described in Figure 7. text lines are carefully considered for the post-processing.
The obtained score is used to classify the expression and text. Two successive text lines are merged if the vertical distance
between the text lines is smaller than a threshold (100 pixels
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

in this work). By using the threshold, the text lines can be [48]. The dataset consists of diverse font and mathemat-
merged efficiently to obtain entire expressions without using ical symbol styles. In the dataset, 30 fonts of texts are
any additional character recognition modules. used in the documents as follows: TimesNewRoman , Ar-
Let linei and linej be the successive text lines that are ial, CourierNew, AGaramondPro-Regular, HiddenHorzOCR,
classified as isolated expressions. The text lines are consid- Helvetica, CMR6, CMBX10, CMR8, CMCSC10, CMMI8,
ered to merge to form an entire isolated expression if the CMR10, CMMI10, CMSY7, CMEX10, CMMI7, CMR7,
flowing condition is satisfied: CMSY10, MSBM10, CMTI8, CMSY6, CMR5, CMTI10,
CMSY8, CMMI6, CMMI5, MSBM7, CMSY5, CMBX8,
CMTT8. The text size varies from 4px to 28px. The dataset
|yi− yj | ≤ H (2)
consists of scientific articles in PDF format. Due to the
Where yi and yj are y-coordinates of the top-left corner of copyright reason, the dataset does not directly provide the
the text lines and H is the predefined threshold. The threshold PDF articles. However, links to the articles are provided in
is set by the observation of the average height of text lines in the dataset. In order to obtain document images, the PDF ar-
the whole documents. ticles in the dataset are converted at 600 dpi. Compared with
the Marmot dataset, the GTDB dataset is larger and more
Similarly, two successive words are merged to form entire
challenging for expression detection due to the complexity of
inline expressions if the horizontal distance between the
document layout and the diversity of scientific articles. The
words is smaller than a threshold (20 pixels in the work).
GTDB-1 dataset is used for training and the GTDB-2 dataset
Examples of the post-processing are demonstrated in Fig-
is used for testing. The statistic of the datasets are described
ure 8. The expression is split into two text lines in Figure 8(a)
in Table 3. The datasets provide ground truth bounding box
and the text lines are merged to form the entire expression in
for both character and mathematical expression regions. The
Figure 8(b).
ground truth bounding box is stored in CSV files. In our
work, the information of expression regions is used for the
IV. EXPERIMENTAL RESULTS performance evaluation.
A. DATASET
In the section, two public datasets that have been used for B. EVALUATION METRIC
performance evaluation of mathematical expression detec- In order to obtain the in-depth analysis of the proposed
tion are described. system, the Intersection over Union (IoU) metric is used in
The first one is Marmot public dataset [2]. It consists of our work. The metric is also known as Jaccard index that is
400 non-skew scientific document pages with 1575 isolated widely used to evaluate the performance of object detection
and 7907 inline expressions. The resolution of each page system [47]. The metric is the ratio of the overlapped and
image is around 500 dpi. In the dataset, 18 fonts of texts union of the detected and ground-truth regions. In our work,
are used in the documents as follows: ArialMT, Courier, Hel- the detected and ground-truth expressions are represented by
vetica, NimbusRomNo9L, Lasy9, TimesNewRomanPSMT, the coordinates of the top left corner and the size of the
TimesNewRomanPS-ItalicMT, TimesNewRomanPS-BoldMT, bounding boxes of the expressions. The IoU is calculated as
CMMI10, MSBM10, CMR7, CMSY10, CMBX12, CMEX10, follows:
CMTI10, SymboIMT, Universal-GreekwithMathPi, GillSans-
BoldCondensed. The text size varies from 4px to 22px. The area(Bp ∩ Bgt )
IoU = (3)
training and testing datasets are described in Table 3. The area(Bp ∪ Bgt )
number of isolated and inline expressions in each page is where Bp ∩ Bgt and Bp ∪ Bgt denotes the intersection and
described in Figure 9(a) and Figure 9(b), respectively. In union of the predicted and ground-truth bounding boxes of
the figures, the y-axises represent the number of expressions expressions.
and the x-axises represent the pages in the dataset. Each IoU value is in the closed interval [0;1] and the larger value
document page contains an average of 4 and a maximum of shows better detection results. Based on the threshold of the
20 isolated expressions. Each page contains an average 20 metric, the detection results are divided into four categories
and a maximum of 90 inline expressions. For each page, the as follows:
position information of the top left and bottom right corner of 1) Correct: the IoU value of the detected and ground-truth
each expression is stored in the XML files that are described regions is in the closed interval [0.5; 1].
in Figure 9(c). The precise bounding boxes of expressions 2) Partial: the IoU value of the detected and ground-truth
are represented by Hexadecimal numbers that consist of 16 regions is in the interval (0; 0.5).
characters. The symbols of mathematical expressions are also 3) Missed: mathematical expression cannot be detected by
annotated in the XML files. The ground-truth is created by a the proposed system.
semi-automatic tool and available for public research on the 4) False: other components in document page are detected
mathematical expression detection purposes. as mathematical expressions.
The second one is GTDB public dataset [27]. It has re- By using the evaluation metric, the quality of the detection
cently been used for performance evaluation of researches is clearly reflected.
VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

(a) Before the post-processing

(b) After the post-processing

FIGURE 8: Example of the post-processing of a mathematical expression that is split into two text lines. The detected and
ground-truth expressions are marked in blue and red, respectively. (a) before and (b) after the post-processing.

TABLE 4: Performance comparison between the proposed and existing methods of isolated expression detection on the Marmot
dataset (highest scores are in bold)
Isolated expression can be detected Error in the detection
Method
Correct Partial Total Missed False Total
Method in [25] 26.87% 44.89% 71.76% 9.89% 18.35% 28.24%
Our method
using FFT and RF 31.02% 42.32% 73.34% 9.04% 17.62% 26.66%
using transfer learning of AlexNet 47.22% 41.44% 88.66% 2.78% 8.56% 11.34%
using transfer learning of ResNet-18 50.89% 39.27% 90.16% 3.55% 6.29% 9.84%
combining features with average operator 51.34% 39.45% 90.79% 3.55% 5.66% 9.21%
combining features with product operator 51.34% 39.84% 91.18% 3.14% 5.68% 8.82%

TABLE 5: Performance comparison between the proposed and existing methods of inline expression detection on Marmot
dataset (highest scores are in bold)
Inline expression can be detected Error in the detection
Method
Correct Partial Total Missed False Total
Method in [25] 1.74% 28.87% 30.61% 9.93% 59.46% 69.39%
Our method
using projection profile and RF 11.05% 41.40% 52.45% 8.36% 39.19% 47.55%
using transfer learning of AlexNet 21.54% 56.25% 77.79% 7.60% 14.61% 22.21%
using transfer learning of ResNet-18 22.68% 57.06% 79.74% 5.59% 14.67% 20.26%
combining features with average operation 22.79% 57.96% 79.85% 5.79% 14.36% 20.15%
combining features with product operation 22.90% 58.45% 81.35% 5.40% 13.25% 18.65%

TABLE 6: Performance comparison between the proposed and existing methods of isolated expression detection on the GTDB
dataset (highest scores are in bold)
Isolated expression can be detected Error in the detection
Method
Correct Partial Total Missed False Total
Method in [25] 26.22% 44.87% 71.09% 9.91% 19.00% 28.91%
Our method
using FFT and RF 30.86% 42.12% 72.98% 9.25% 17.77% 27.02%
using transfer learning of AlexNet 47.05% 41.16% 88.21% 3.78% 8.01% 11.79%
using transfer learning of ResNet-18 50.29% 38.67% 88.96% 3.85% 7.19% 11.04%
combining features with average operator 50.34% 39.15% 89.49% 3.57% 6.94% 10.51%
combining features with product operator 50.37% 39.14% 89.51% 3.16% 7.33% 10.49%

C. PERFORMANCE EVALUATION respectively. The proposed system outperforms conventional


1) Performance evaluation of the detection of isolated and method due to the effective strategies on document analysis
inline expressions on different public datasets and novel classification techniques. Particularly, the trans-
fer learning of CNNs obtains the highest accuracy in the
The performance comparison between the proposed and con-
detection because the CNNs extract more visual features
ventional methods of isolated and inline expression detec-
of images than those in other methods. The method [25]
tion in the Marmot dataset is shown in Tables 4 and 5,
12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

Ϯϱ

ϮϬ

EƵŵďĞƌŽĨĞdžƉƌĞƐƐŝŽŶƐ
ϭϱ

ϭϬ

Ϭ
ϭ
ϭϲ
ϯϭ
ϰϲ
ϲϭ
ϳϲ
ϵϭ
ϭϬϲ
ϭϮϭ
ϭϯϲ
ϭϱϭ
ϭϲϲ
ϭϴϭ
ϭϵϲ
Ϯϭϭ
ϮϮϲ
Ϯϰϭ
Ϯϱϲ
Ϯϳϭ
Ϯϴϲ
ϯϬϭ
ϯϭϲ
ϯϯϭ
ϯϰϲ
ϯϲϭ
ϯϳϲ
ϯϵϭ
WĂŐĞ/

(a) Occurrence of isolated expression in Marmot dataset

ϭϬϬ
ϵϬ
ϴϬ
EƵŵďĞƌŽĨĞdžƉƌĞƐƐŝŽŶƐ

ϳϬ
ϲϬ
ϱϬ
ϰϬ
ϯϬ
ϮϬ
ϭϬ
Ϭ
ϭ
ϭϳ
ϯϯ
ϰϵ
ϲϱ
ϴϭ
ϵϳ
ϭϭϯ
ϭϮϵ
ϭϰϱ
ϭϲϭ
ϭϳϳ
ϭϵϯ
ϮϬϵ
ϮϮϱ
Ϯϰϭ
Ϯϱϳ
Ϯϳϯ
Ϯϴϵ
ϯϬϱ
ϯϮϭ
ϯϯϳ
ϯϱϯ
ϯϲϵ
ϯϴϱ

WĂŐĞ/

(b) Occurrence of inline expression in Marmot dataset

(c) The structure of XML file in dataset


FIGURE 9: The occurrence of isolated (a) and inline (b) expressions in document pages in Marmot dataset. The x-axis
represents the number of expressions and the y-axis represents the Page ID in the dataset. The structure of XML file storing
ground-truth information of expression (c)

focuses to extract features of bounding boxes of characters ing FFT and projection profiles of images obtains higher
in word images. The method is not effective for the detection accuracy than the method [25] because it can extract two-
of inline expressions because there is not much variation dimensional layout features of mathematical expressions. It
in the visualization of inline expressions. The method us- is clearly shown in Table 5 that the accuracy of detection of
VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

TABLE 7: Performance comparison between the proposed and existing methods of inline expression detection on GTDB
dataset (highest scores are in bold)
Inline expression can be detected Error in the detection
Method
Correct Partial Total Missed False Total
Method in [25] 1.56% 28.67% 30.23% 9.97% 59.80% 69.77%
Our method
using projection profile and RF 10.48% 41.36% 51.84% 8.26% 39.90% 48.16%
using transfer learning of AlexNet 20.46% 55.24% 75.70% 7.86% 16.44% 24.30%
using transfer learning of ResNet-18 22.16% 56.34% 78.50% 6.29% 15.21% 21.50%
combining features with average operation 22.69% 56.65% 79.34% 5.68% 14.98% 20.66%
combining features with product operation 22.76% 57.44% 80.20% 5.46% 14.34% 19.80%

the inline expression is much improved by using the transfer Table 8. For the GTDB dataset, the method of Samsung
learning of CNNs. The performance of the method using the R&D based on graph theory [48] has shown the highest
transfer learning of the ResNet-18 is slightly higher than that performance. However, it is worth noting that the method
of AlexNet. The out-performance is obtained because the in [48] exploits character-level information it is provided in
deeper architecture of the ResNet-18 allows to extract visual GTDB dataset for the detection of mathematical expression.
feature better than that of AlexNet. The combination of RF This information is not available in others datasets such as
and ResNet-18 allows to obtain the highest performance in Marmot dataset. As our method relies only on the appearance
the isolated expression detection because the predicted scores of mathematical expressions in the document images, it is
of two models are aggregated for the final classification and general and can be applied for different datasets. In compar-
the misclassification is reduced. ison with the similar method, the proposed method shows
For inline expressions, the percentage of partial detection better performance than that of the Michiking sytem [48]
is much higher than that of correct category. Thus, the per- because the employment of CNNs extracts features more
centage of inline expression detection based on the various efficiently than that of traditional rule-based and machine
ranges of IoU values of the partial category is evaluated in learning techniques.
order to obtain the further analysis of the performance of
the detection. The percentage of inline expression detection 2) Evaluation of the impact of image resolution on
based on different ranges of IoU values is demonstrated in mathematical expression detection
the Figure 10(b). Actually, the percentage of inline expres- For traditional methods based on OCR technique, input
sion detection fluctuates slightly in the five ranges ((0; 0.1], document images are typically rendered at high resolution
(0.1; 0.2], (0.2; 0.3], (0.3; 0.4], (0.4; 0.5)) of partial detection (around 600 dpi) [27] to prevent recognition errors. Our
category. The percentage of the inline expression detection in method detects mathematical expressions without using any
the lowest range (0; 0.1] is not much higher than that of other OCR modules. Thus, the detection can be performed in
ranges. The results show that the proposed method can detect low-resolution document images. In order to evaluate the
inline expressions in difficult cases (e.g. the expressions impact of the resolution to our method, the performance
consist of small mathematical symbols). The percentage of evaluation has carried out on document images at various
isolated expression detection based on different ranges of resolutions. Document images are rendered at 500, 300 and
IoU values is demonstrated in Figure 10(a). The figure shows 150 dpi. The performance evaluation is shown in Figure 11.
that the percentage of the isolated expression detection in As shown in the figure, the percentage of missed and failed
the highest range (0.4; 0.5) is much higher than that of detection slightly increases for document images at 300 dpi.
other ranges. The result demonstrates the effectiveness of the Whereas, the percentage noticeably increases for document
proposed method for isolated expression detection. images at 150 pdi. The results have shown that our pro-
The performance comparison between the proposed and posed method can perform the document image rendered
conventional methods of isolated and inline expression de- at more than 300 dpi. For document images rendered at
tection in the GTDB dataset is shown in Tables 6 and 7, low resolution, the error rate has increased during the page
respectively. Compared to the Marmot dataset, the detection segmentation. Therefore, the overall error rate of expression
of mathematical expressions in the GTDB dataset is more detection has significantly increased for document images at
challenging. Actually, in the GTDB dataset, the distance low resolution. Figure 12(a) and 12(b) illustrate the detection
between consecutive text line and word is narrower than of inline expressions in a sample page at 500 and 150 dpi,
that of the Marmot dataset and there is much variation in respectively.
type styles (font and size of character) in the GTDB dataset.
Therefore, the performance of detection of expressions in the 3) Visualization of extracted features of images using the
GTDB is lower than that of the Marmot dataset. transfer learning of CNN model
The performance comparison between the proposed and In order to demonstrate the effectiveness of the feature ex-
state of the art methods on the GTDB dataset is shown in traction of fine-tuned CNNs, the distribution of extracted
14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

ϯϬ͘ϬϬй

Ϯϱ͘ϵϰй
Ϯϰ͘Ϭϳй
Ϯϱ͘ϬϬй
ϮϮ͘ϮϮй

ϮϬ͘ϬϬй

ϭϰ͘ϴϭй
ϭϱ͘ϬϬй
ϭϮ͘ϵϲй

ϭϬ͘ϬϬй

ϱ͘ϬϬй

Ϭ͘ϬϬй
;Ϭ͕Ϭ͘ϭ΁ ;Ϭ͘ϭ͕Ϭ͘Ϯ΁ ;Ϭ͘Ϯ͕Ϭ͘ϯ΁ ;Ϭ͘ϯ͕Ϭ͘ϰ΁ ;Ϭ͘ϰ͕Ϭ͘ϱͿ

(a) Percentage of detected isolated expressions in partial detection category

Ϯϱ͘ϬϬй
Ϯϯ͘Ϭϴй

ϭϴ͘ϵϳй ϭϴ͘ϵϴй ϮϬ͘ϭϳй ϭϴ͘ϴϬй


ϮϬ͘ϬϬй

ϭϱ͘ϬϬй

ϭϬ͘ϬϬй

ϱ͘ϬϬй

Ϭ͘ϬϬй
;Ϭ͕Ϭ͘ϭ΁ ;Ϭ͘ϭ͕Ϭ͘Ϯ΁ ;Ϭ͘Ϯ͕Ϭ͘ϯ΁ ;Ϭ͘ϯ͕Ϭ͘ϰ΁ ;Ϭ͘ϰ͕Ϭ͘ϱͿ

(b) Percentage of detected inline expressions in partial detection category

FIGURE 10: Percentage of detected isolated and inline expressions in partial detection category. The partial detection category
is divided into five equal sub-ranges based on IoU values.

TABLE 8: Performance comparison of the proposed and the state of the art methods on the GTDB dataset
Method Expression detection results with IoU ≥ 0.5 Expression detection results with IoU ≥ 0.75
Method of Samsung based on graph theory [48] 94.36% 94.17%
Michiking system [48] 36.87 19.10%
Proposed method 50.17% 43.19%

VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

ϲϬ͘ϬϬй

ϰϵ͘ϯϴй
ϱϬ͘ϬϬй ϰϳ͘ϮϮй ϰϳ͘ϱϬй
ϰϭ͘ϰϰй
ϰϬ͘ϬϬй ϯϲ͘ϴϴй
ϯϭ͘Ϯϱй
ϯϬ͘ϬϬй

ϮϬ͘ϬϬй
ϭϰ͘ϴϴй

ϲ͘ϯϳй ϴ͘ϱϲй ϵ͘ϱϮй


ϭϬ͘ϬϬй
Ϯ͘ϳϴй ϰ͘ϮϮй
Ϭ͘ϬϬй
ŽƌƌĞĐƚ WĂƌƚŝĂů DŝƐƐĞĚ &ĂůƐĞ
ŽĐƵŵĞŶƚŝŵĂŐĞĂƚϱϬϬĚƉŝ ŽĐƵŵĞŶƚŝŵĂŐĞĂƚϯϬϬĚƉŝ ŽĐƵŵĞŶƚŝŵĂŐĞĂƚϭϱϬĚƉŝ

(a) Performance evaluation of isolated expression detection in document images at various resolution

ϲϬ͘ϬϬй
ϱϲ͘Ϯϱй ϱϭ͘ϭϴй
ϱϬ͘ϬϬй ϰϲ͘Ϯϲй

ϰϬ͘ϬϬй
Ϯϵ͘ϰϲй
ϯϬ͘ϬϬй
Ϯϭ͘ϱϰй ϭϵ͘ϵϳй ϮϬ͘Ϭϳй
ϮϬ͘ϬϬй
ϭϰ͘ϲϬй ϭϰ͘ϲϭй

ϭϬ͘ϬϬй ϳ͘ϲϬй ϴ͘ϳϴй ϵ͘ϲϴй

Ϭ͘ϬϬй
ŽƌƌĞĐƚ WĂƌƚŝĂů DŝƐƐĞĚ &ĂůƐĞ
ŽĐƵŵĞŶƚŝŵĂŐĞĂƚϱϬϬĚƉŝ ŽĐƵŵĞŶƚŝŵĂŐĞĂƚϯϬϬĚƉŝ ŽĐƵŵĞŶƚŝŵĂŐĞĂƚϭϱϬĚƉŝ

(b) Performance evaluation of inline expression detection in document images at various resolution

FIGURE 11: Performance evaluation of the detection of isolated (a) and inline (b) expressions in document images at various
resolution. The testing images in the Marmot dataset are rendered at 500, 300 and 150 dpi. The performance of the detection is
denoted by blue, orange and gray for document images at 500, 300 and 150 dpi, respectively.

TABLE 9: The average time of the detection of expressions in a document page in the Marmot dataset by different methods
(Bold value indicates the smallest detection time)
Average detection time per page (second)
Methods
Isolated detection Inline detection
Method in [25] 9.56 24.8
Our method
using FFT, projection profile and RF 2.6 13.04
using AlexNet 14.34 49.6
using ResNet-18 22.17 56.5
combining features 22.30 56.6

features of testing images of isolated expressions and normal pool5 layer at the end of the network. The dimensional reduc-
text lines is visualized. The extracted features of isolated tion technique is used to visualize learned features of text line
expressions and normal text lines by using the ResNet-18 are and isolated expression images. In this case, 512 extracted
illustrated in blue and red in Figure 13, respectively. For the features of each testing image of isolated expressions and text
ResNet-18, the visual features are automatically extracted at lines are reduced to 2 features for the visualization purpose
16 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

(a) Example of the detection of inline expressions in a sample page at 500 dpi.

(b) Example of the detection of inline expressions in a sample page at 150 dpi.

FIGURE 12: Examples of the detection of inline expressions in a sample page at 500 (a) and 150 (b) dpi. The inline expressions
detected by proposed system and ground-truth expressions are marked in black and red, respectively.

by using the T-distributed Stochastic Neighbor Embedding The post-processing allows to obtain better accuracy of the
(t-SNE) technique [49]. The t-SNE has demonstrated out- detection of isolated and inline expressions. The percentage
performance on various datasets compared to other dimen- of partial detection of isolated expressions is increased by
sional reduction techniques (e.g. Classical scaling [50], Prin- 8.81% and the false detection is decreased by 9.84%. Partic-
cipal component analysis [51]). Eight hundred images of ularly, the post-processing of the inline expression detection
each class of isolated expression and normal text lines are allows to obtain the considerable improvement of accuracy.
used in the visualization. The images are normalized to The percentage of partial detection of inline expressions is
the size of [224x224x3] as the ResNet-18 requirement. The increased by 21.80% and the false detection is decreased by
technique aims to respect the similarities between points in 24.59%. The percentage of expressions in missed detection
the visual space with the reduction from high-dimension to category is not affected by the post-processing because the
low-dimension. It is clearly shown in Figure 13 that most of post-processing aims at merging detected components of
testing images are separated into two classes. Various dis- expressions. Actually, a large number of expressions consists
tance metrics are used for the t-SNE technique. In our work, of multiple words in scientific documents. Therefore, the
two popular distance metrics including the Mahalanobis and post-processing is necessary to improve the accuracy of the
Cosine are employed. The visualization of extracted feature detection of inline expressions.
of text lines and isolated expressions by using the t-SNE
dimensional reduction with the Mahalanobis and Cosine dis- 5) Time efficiency
tance metric is shown in Figure 13 (a) and 13(b), respectively.
In order to compare the performance of the methods, the
The t-SNE technique with the Cosine distance metric allows
execution time of testing phase of methods is evaluated. The
to visualize the separation between two classes better than
methods are implemented in Matlab R2019a environment on
that of the Mahalanobis distance metric.
a computer with 6GB RAM and Core i3-2.67 GHz processor.
The average execution time of the detection of inline expres-
4) Evaluation of the impact of the post-processing to the sions in a document page in the Marmot dataset by using
detection of mathematical expression the methods is shown in Table 9. As shown in the table,
The outcome of the post-processing in the detection of the methods using the transfer learning of CNNs perform
mathematical expressions is clearly shown in Figure 14. slower than those of hand-crafted feature extraction methods.
VOLUME 4, 2016 17

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

60
text line
isolated expression
40

20

-20

-40
-60 -40 -20 0 20 40 60

(a) The visualization of feature extraction using dimensional reduction with the Mahalanobis distance metric.

100
text line
80 isolated expression

60

40

20

-20

-40

-60

-80
-80 -60 -40 -20 0 20 40

(b) The visualization of feature extraction using dimensional reduction with the Cosine distance metric.

FIGURE 13: The feature distribution of isolated and text line images. The extracted features of isolated expressions and normal
text lines in Marmot dataset by using the ResNet-18 are illustrated in red and blue, respectively. The visualization of extracted
feature of text lines and isolated expression by using the t-SNE dimensional reduction with the Mahalanobis (a) and Cosine (b)
distance metrics.

Particularly, for inline expression detection, the classification of expressions and textual words using CNNs requires more

18 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

ϱϬ͘ϬϬй ϰϲ͘ϭϵй
ϰϳ͘ϮϮй ϰϭ͘ϰϰй
ϰϬ͘ϬϬй
ϯϮ͘ϲϯй
ϯϬ͘ϬϬй
ϭϴ͘ϰϬϬй
ϮϬ͘ϬϬй
ϴ͘ϱϲй
ϭϬ͘ϬϬй
Ϯ͘ϳϴй Ϯ͘ϳϴй
Ϭ͘ϬϬй
ŽƌƌĞĐƚ WĂƌƚŝĂů DŝƐƐĞĚ &ĂůƐĞ

WƌŽƉŽƐĞĚŵĞƚŚŽĚǁŝƚŚŽƵƚƉŽƐƚͲƉƌŽĐĞƐƐŝŶŐ
WƌŽƉŽƐĞĚŵĞƚŚŽĚǁŝƚŚƉŽƐƚͲƉƌŽĐĞƐƐŝŶŐ

(a) Performance comparison of isolated expression detection before and after post-processing

ϲϬ͘ϬϬй
ϱϲ͘Ϯϱй
ϱϬ͘ϬϬй
ϯϵ͘ϮϬй
ϰϬ͘ϬϬй ϯϰ͘ϰϱй
ϯϬ͘ϬϬй
Ϯϭ͘ϱϰй
ϭϴ͘ϳϱй
ϮϬ͘ϬϬй ϭϰ͘ϲϭй
ϭϬ͘ϬϬй ϳ͘ϲϬй ϳ͘ϲϬй

Ϭ͘ϬϬй
ŽƌƌĞĐƚ WĂƌƚŝĂů DŝƐƐĞĚ &ĂůƐĞ

WƌŽƉŽƐĞĚŵĞƚŚŽĚǁŝƚŚŽƵƚƉŽƐƚͲƉƌŽĐĞƐƐŝŶŐ
WƌŽƉŽƐĞĚŵĞƚŚŽĚǁŝƚŚƉŽƐƚͲƉƌŽĐĞƐƐŝŶŐ

(b) Performance comparison of inline expression detection before and after post-processing

FIGURE 14: Performance comparison of the proposed method before (in blue) and after (in orange) the post-processing in the
detection of isolated (a) and inline (b) expressions.

time than those of hand-crafted feature extraction methods. of peak and valley values of projection profiles of an image
The main reason of the time-consuming execution of using instead of performing a whole image. Meanwhile, the method
the CNNs is that the CNNs aim at extracting more features reported in [25] is slow in comparison with the proposed
than those of hand-crafted feature extraction methods. The method because it extracts all bounding boxes of characters
method using the transfer learning of Alex has shown the of inline expressions. The combination of hand-crafted and
slightly higher performance in time execution than that of deep learning features allows to obtain the highest accuracy
the ResNet-18. In fact, the ResNet-18 consists of more layers in the detection and it takes the running time similar to the
than the Alexnet, the feature extraction takes more time transfer learning of CNNs.
to extract features than that of the Alexnet. For machine
learning methods, the proposed method based on the FFT and
D. ERROR ANALYSIS AND DISCUSSION
projection profile as feature extraction and RF as classifier
has shown the highest results. The proposed method has For examples of isolated and inline expression detection, the
achieved the most effective results in the training and testing results on document page images in the Marmot and GTDB
time because it focuses on extracting the feature distribution datasets are shown in Figure 15 and Figure 16, respectively.
The detection result in the Marmot dataset is more accurate
VOLUME 4, 2016 19

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

(a) Examples of the isolated and inline expression detection in one-column page.

(b) Examples of the isolated and inline expression detection in two-column page.

FIGURE 15: Examples of the isolated and inline expression detection in one-column (a) and two-column (b) pages in the
Marmot dataset. The detection of isolated, inline and ground-truth expressions are marked in blue, black and red, respectively.

than that of the GTDB one. It is clearly shown in the figures in the detection. An example of the false detection of inline
that isolated expressions are detected with high accuracy. expressions is shown in Figure 17(a). In this case, normal
However, there are some errors encountered in the detection texts containing mathematical symbols are detected as inline
of inline expressions. The errors can be classified into two expressions.
classes:
2) Small mathematical symbols cannot be detected in
1) The ambiguity in the detection of some numbers and some cases because of the noises generated in the page
variables is encountered in the real context. The number sym- segmentation. Concretely, during the word segmentation and
bols and single characters can be used in both mathematical merging, small symbols are possibly omitted. An example
expressions and narrative texts. The factor can cause errors of the missed detection of inline expression is shown in
20 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

FIGURE 16: Examples of the expression detection in a sample page in the GTDB dataset. The detection and ground-truth
expressions are marked in blue and red, respectively.

(a) Example of the false detection of inline expressions.

(b) Example of the missed detection of inline expressions.

FIGURE 17: Examples of the false (a) and missed (b) detection of inline expression. The inline expressions detected by
proposed system and ground-truth expressions are marked in black and red, respectively.

Figure 17(b). In this case, the small variable r cannot be including the Alexnet and ResNet-18 has efficiently em-
detected. ployed in the detection of both isolated and inline expres-
sions. The performance of overall system is evaluated on
V. CONCLUSION AND FUTURE WORKS two public datasets those are the Marmot and GTDB. The
We have presented a unified system that detects both isolated generic performance metrics based on IoU are applied to
and inline mathematical expressions in document images. evaluate the system clearly. The obtained results have shown
The improvements in the page segmentation and the classi- that the performance of the detection of the proposed system
fication of mathematical expressions and texts are combined is significantly improved comparing with the conventional
to improve the performance of the overall detection system. methods.
The combination of hand-crafted and deep learning features In the future, the performance of the system can be fur-
is proposed to improve the performance of the detection. ther improved by applying various strategies. Different deep
In the hand-crafted feature extraction method, the feature neural networks can be combined to improve the accuracy
extraction based on FFT and RF classifier are applied for the of both the page segmentation and the expression detection.
isolated expression detection. The feature extraction based The context information can be integrated to improve the
on projection profile and RF classifier are applied for the accuracy of the inline expression detection.
inline expression detection. The transfer learning of CNNs
VOLUME 4, 2016 21

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.:A hybrid method for mathematical expression detection in scientific document images

REFERENCES [25] W.Chu and F.Liu, "Mathematical Formula Detection in Heterogeneous


[1] R.Zanibbi and D.Blostein, "Recognition and retrieval of mathematical ex- Document Images," in Proc. Conf. on Technologies and Applications of
pressions," International Journal on Document Analysis and Recognition, Artificial Intelligence, Taipei, Taiwan, December 6-8, 2013.
vol. 15, iss. 4, pp.331–357, December, 2012. [26] Katherine L. Bouman et al., "A Low Complexity Sign Detection and Text
[2] X.Lin et al., "Performance Evaluation of Mathematical Formula Identifica- Localization Method for Mobile Applications," in IEEE Transactions on
tion," International Workshop on Document Analysis Systems, Gold Cost, Multimedia, vol. 13, no. 5 2011, pp. 922–934.
QLD, Australia, May 07, 2012. [27] W.Ohyama, M.Suzuki and S.Uchida, "Detecting Mathematical Expres-
[3] Bui Hai Phong, Thang Manh Hoang, Thi-Lan Le, "A unified system sions in Scientific Document Images Using a U-Net Trained on a Diverse
for mathematical expression detection in scientific document images," in Dataset," IEEE Access, vol. 7, 2019, pp. 144030 - 144042.
Proc. Korea-Vietnam International Joint Workshop on Communications and [28] Bui Hai Phong, Thang Manh Hoang, Thi-Lan Le, "Mathematical variable
Information Sciences, Hanoi, Vietnam 2019, pp. 14-16. detection based on Convolutional Neural Network and Support Vector
[4] C.Tan et al., "A Survey on Deep Transfer Learning," in Proc. Int. Conf. on Machine," Int. Conf. on Multimedia Analysis and Pattern Recognition, Ho
Artificial Neural Networks, Rhodes, Greece, October 4–7, 2018. Chi Minh City, Vietnam, May 9-10, 2019, pp.1-5.
[5] A.Krizhevskyand, I.Sutskever and G.Hinton, "ImageNet Classification with [29] W. He, Y. Luo, F. Yin, H. Hu, J. Han, E. Ding, and C.-L. Liu, "Context-
Deep Convolutional Neural Networks," in Proc. Int. Conf. on Neural aware mathematical expression recognition: An end-to-end framework and
Information Processing Systems, Lake Tahoe, Nevada, December 03 - 06, a benchmark," Int. Conf. on Pattern Recognition (ICPR), Cancun, Mexico,
2012, pp. 1097-1105. December 4-8 2016, pp. 3246–3251.
[6] Kaiming He et al., "Deep Residual Learning for Image Recognition," in [30] O.Ronneberger, P.Fischer and T.Brox, "U-Net: Convolutional Networks
Proc. IEEE Conference on Computer Vision and Pattern Recognition, Las for Biomedical Image Segmentation," Int. Conf. Medical Image Computing
Vegas, NV, USA, June 27-30,2016, pp. 770-778. and Computer-Assisted Intervention – MICCAI, 2015, pp. 234-241.
[7] Tuan Anh Tran, In Seop Na and Soo Hyung Kim, "Page segmentation using [31] K.Iwatsuki, T.Sagara, T.Hara and A.Aizawa, "Detecting In-line mathe-
minimum homogeneity algorithm and adaptive mathematical morphology," matical expression in Scientific documents," in Proc. ACM Symposium on
International Journal on Document Analysis and Recognition, vol. 19, iss.3, Document Engineering, Valletta, Malta, September 04-07, 2017, pp.141-
September, 2016, pp. 191–209. 144.
[8] F.Wahl et al., "Block segmentation and text extraction in mixed text/image [32] Bui Hai Phong, Thang Manh Hoang, Thi-Lan Le, "Mathematical variable
documents," Computer graphics and image processing, vol. 20, no. 4, 1982, detection in PDF scientific documents," in Int. Conf. on Intelligent Informa-
pp. 375-390. tion and Database Systems, Yogyakarta, Indonesia, April 8–11, 2019.
[9] D.Wang and S.Srihari, "Classification of newspaper image blocks using [33] L.Gao et al., "A Deep Learning-based Formula Detection Method for PDF
texture analysis," Computer graphics and image processing, vol. 47, no. Documents," Int. Conf. on Document Analysis and Recognition, Kyoto,
3, 1989, pp. 327-352. Japan, November 9-15, 2017, pp. 553-558.
[10] L.Caponetti et al., "Document page segmentation using neuro-fuzzy ap- [34] A.Papandreou and B.Gatos., "A Novel Skew Detection Technique Based
proach," Applied Soft Computing, vol. 8, no. 1, 2008, pp. 118-126. on Vertical Projections", Int. Conf. on Document Analysis and Recognition,
[11] M.Agrawal and D.Doermann , "Voronoi++: A Dynamic Page Segmen- 2011, pp. 384-388.
tation approach based on Voronoi and Docstrum features," Int. Conf. on [35] K.Taeho et al., "Robust Document Image Dewarping Using Text-Line and
Document Analysis and Recognition, Barcelona, Spain, July 26-29, 2009. Line Segments", Int. Conf. on Document Analysis and Recognition, 2017,
[12] H.Cheng and C.A.Bouman, "Multiscale Bayesian segmentation using a pp. 865-870.
trainable context model," IEEE Transactions on Image Processing, vol. 10, [36] T.Chang et al., "Physical Structure Segmentation with Projection Profile
no. 4, 2001, pp.511-525. for Mathematic Formulae and Graphics in Academic Paper Images," Int.
[13] Z.Shi and V.Govindaraju, "Multi-scale Techniques for Document Page Conf. on Document Analysis and Recognition, Parana, Brazil, September
Segmentation," Int. Conf. on Document Analysis and Recognition, Seoul, 23-26, 2007.
South Korea, South Korea, 29 August - 1 September, 2005. [37] Bui Hai Phong, Thang Manh Hoang, Thi-Lan Le, "A new method for
[14] D.T.Ha et al., "An Adaptive Over-Split and Merge Algorithm for Page displayed mathematical expression detection based on FFT and SVM," in
Segmentation," Pattern Recognition Letters, vol. 80, September 2016, pp. Proc. The NAFOSTED Conference on Information and Computer Science,
137-143. Hanoi, Vietnam, November 24-25, 2017, pp. 90-96.
[15] T.M.Breuel, "The OCRopus open source OCR system," in Proc. Confer- [38] Bui Hai Phong, Thang Manh Hoang, Thi-Lan Le,"Mathematical Variable
ence: Document Recognition and Retrieval XV, San Jose, CA, USA January Detection in Scientific Document Images," International Journal of Com-
29-31, 2008. putational Vision and Robotics, to be published.
[16] R.Smith, "An Overview of the Tesseract OCR Engine," in Int. Conf. on [39] Phan Thanh Noi and Martin Kappas, "Comparison of Random Forest, k-
Document Analysis and Recognition, vol. 2, September, 2007. Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover
[17] M.N.Anoop and K.J.Anil, "Document Structure and Layout Analysis," Classification Using Sentinel-2 Imagery," Sensors, December 22, 2017.
Digital Document Processing, Springer, London, 2007, pp. 29-48. [40] J.Friedman et al., "Additive logistic regression: A statistical view of
[18] X.Lin et al., "A Text Line Detection Method for Mathematical Formula boosting.," Annals of Statistics, vol. 28, no. 2, 2000, pp. 337–407.
Recognition," Int. Conf. on Document Analysis and Recognition, Washing- [41] P.Napoletano, F.Piccoli, R.Schettini, "Anomaly Detection in Nanofibrous
ton, DC, USA, Aug 25-28, 2013. Materials by CNN-Based Self-Similarity," Sensors, vol.18(1), January 12,
[19] K.Chen et al., "Convolutional Neural Networks for Page Segmentation 2018.
of Historical Document Images," in Int. Conf. on Document Analysis and [42] K. Diederik and J.Ba, "Adam: A method for stochastic optimization,"
Recognition, Kyoto, Japan, November 9-15, 2017. arXiv:1412.6980, December 22, 2014.
[20] S.Oliveira, B.Seguin and F.Kaplan, "dhSegment: A Generic Deep- [43] K.Murphy, "Machine Learning: A Probabilistic Perspective," The MIT
Learning Approach for Document Segmentation," in Int. Conf. on Frontiers Press, Cambridge, Massachusetts, First edition, 2012.
in Handwriting Recognition, Niagara Falls, NY, USA, August 5-8, 2018, [44] S.Lee, M.Zare and H.Muller, "Late Fusion of Deep Learning and Hand-
pp.7-12. crafted Visual Features for Biomedical Image Modality Classification," IET
[21] X.Lin et al., "Mathematical formula identification and performance eval- Image Processing , vol.13, iss.2, 2019, pp.382-391.
uation in PDF documents," International Journal on Document Analysis [45] A.Herrera and H.Müller, "Fusion Techniques in Biomedical Information
and Recognition, vol. 17, iss. 3, pp. 239–255, September, 2014. Retrieval," Fusion in Computer Vision, Springer International Publishing,
[22] H.Lee and J.Wang, "Design of a mathematical expression understanding 2014, pp.209-228.
system," Pattern Recognition Letters, vol. 18, no. 3, March, 1997, pp. 289- [46] Liu, Z. and Smith, R., "A Simple Equation Region Detector for Printed
298. Document Images in Tesseract," Int. Conf. on Document Analysis and
[23] J.Jin, X.Han and Q.Wang, "Mathematical formulas detection," Int. Recognition, Washington, DC, USA, August 25-28, 2013, pp. 245-249.
Conf. on Document Analysis and Recognition, Edinburgh Scotland, 2003, [47] M.Everingham et al., "The Pascal Visual Object Classes (VOC) Chal-
pp.1138-1141. lenge," International Journal of Computer Vision, vol. 88, iss. 2, 2010, pp.
[24] U.Garain, "Identification of Mathematical Expressions in Document Im- 303–338.
ages," Int. Conf. on Document Document Analysis and Recognition, [48] M. Mahdavi et al., "ICDAR 2019 CROHME + TFD: Competition on
Barcelona, Spain, July 26-29, 2009. Recognition of Handwritten Mathematical Expressions and Typeset For-
mula Detection," in Proc. Int. Conf. on Document Analysis and Recognition,
Sydney, Australia, September 20-25, 2019.
22 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.2992067, IEEE Access

Bui Hai Phong et al.: A hybrid method for mathematical expression detection in scientific document images

[49] V.Maaten, Laurens, and G.Hinton, "Visualizing Data using t-SNE," J. THANG MANH HOANG received B. Eng. De-
Machine Learning Research 9, 2008, pp. 2579–2605. gree in Electronics and Telecommunications from
[50] W.S. Torgerson, "Multidimensional scaling: I. Theory and method," Psy- Hanoi University of Science and Technology,
chometrika, vol. 17, December, 1952, pp. 401–419. (Vietnam) and M.Sc. degree from Hanoi Uni-
[51] C.K.I. Williams, "On a connection between Kernel PCA and metric versity of Science and Technology. In 2007, he
multidimensional scaling," Machine Learning, vol. 46, iss. 1-3, 2002, pp. was awarded a Ph.D. degree in Electronics and
11–19. Telecommunications from Nagaoka University of
Technology, Japan. He is, currently lecturer at
School of Electronics and Telecommunications,
Hanoi University of Science and Technology,
Vietnam. His research interests include non-linearity and its applications in
electronics and communication such as cryptography, modulation, oscilla-
tion, complex network, chaos synchronization, recognition.

BUI HAI PHONG graduated in School of In-


formation and Communication Technology from
Hanoi University of Science and Technology, THI-LAN LE graduated in Information Technol-
Vietnam in 2010. He obtained MS. degree in In- ogy from Hanoi University of Science and Tech-
formation Technology from Hanoi University of nology (HUST), Vietnam. She obtained MS. de-
Science and Technology in 2012. He is currently gree in Signal Processing and Communication
Ph.D. student of Hanoi University of Science from HUST, Vietnam. In 2009, she received her
and Technology. His researches interests include Ph.D. degree at INRIA Sophia Antipolis, France in
computer vision, pattern recognition and machine video retrieval. She is currently lecturer/researcher
learning. at Computer Vision Department, HUST, Vietnam.
Her research interests include computer vision,
content-based indexing and, retrieval, video under-
standing and human-robot interaction.

VOLUME 4, 2016 23

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/.

You might also like