Recognition and Retrieval of Mathematical Expressions
Recognition and Retrieval of Mathematical Expressions
Abstract Document recognition and retrieval technologies complement one another, providing improved access to increasingly large document collections. While
recognition and retrieval of textual information is fairly
mature, with wide-spread availability of Optical Character Recognition (OCR) and text-based search engines,
recognition and retrieval of graphics such as images, figures, tables, diagrams, and mathematical expressions
are in comparatively early stages of research. This paper surveys the state of the art in recognition and retrieval of mathematical expressions, organized around
four key problems in math retrieval (query construction, normalization, indexing, and relevance feedback),
and four key problems in math recognition (detecting
expressions, detecting and classifying symbols, analyzing symbol layout, and constructing a representation
of meaning). Of special interest is the machine learning problem of jointly optimizing the component algorithms in a math recognition system, and developing
effective indexing, retrieval and relevance feedback algorithms for math retrieval. Another important open
problem is developing user interfaces that seamlessly
integrate recognition and retrieval. Activity in these
important research areas is increasing, in part because
math notation provides an excellent domain for studying problems common to many document and graphics
recognition and retrieval applications, and also because
mature applications will likely provide substantial benefits for education, research, and mathematical literacy.
R. Zanibbi
Department of Computer Science, Rochester Institute of Technology, 102 Lomb Memorial Drive, Rochester, NY, USA 14623-5608.
E-mail: [email protected]
D. Blostein
School of Computing, Queens University, Kingston, Ontario,
Canada, K7L 3N6. E-mail: [email protected]
1 Introduction
In practice, the problem of retrieving math notation
is closely tied to the problem of recognizing math notation. For example, a college student may want to search
textbooks and course notes to find math notation that
has similar structure or semantics to a given expression.
Or, a researcher may wish to find technical papers that
use or define a given function. In both of these examples, recognition of math notation is needed in order
to support the retrieval of math notation: the system
must be able to recognize math expressions that the
user provides as a query, and the system must be able
to recognize math expressions in the target documents
that are the subject of search. Retrieval of math notation has received increasing research attention in the
past decade (see Section 3), while math recognition has
been a subject of research for over forty years (see Section 4). To our knowledge, we provide the first survey of
mathematical information retrieval; in surveying math
recognition, we focus on research that has appeared in
the decade since the survey of Chan and Yeung [28].
The math domain provides an excellent vehicle for
studying pattern recognition and retrieval problems,
and for studying methods of integrating pattern recognition algorithms to improve performance. The four
central pattern recognition problems segmentation,
classification, parsing, and machine learning (i.e. optimizing recognition model parameters) all come into
play when recognizing mathematics. The math domain
Fig. 1 Math Entry Systems. FFES is pen-based, XPRESS supports mouse and keyboard entry, and InftyEditor/IntryReader supports
OCR, pen, mouse and keyboard entry.
!
"
#
$
%
#
Fig. 3 Key Recognition Problems: Expression Detection, Symbol Extraction or Symbol Recognition, Layout Analysis, and Mathematical Content Interpretation. Shown at left are the possible input formats, including vector-based document encodings such as PDF
files, pen/finger strokes, and document images. The form of input and output for each problem is shown. Many systems perform
recognition in the order shown, but not all. For example, some systems combine Layout Analysis and Mathematical Content Interpretation, producing an operator tree directly using the expected locations of operator/relation arguments [29, 31]. Post-processing stages
used to apply language model constraints (e.g. n-grams) and other refinements are not shown (see Section 4.5).
[14]). In raster image data and pen strokes, detecting symbol location and identity is challenging.
There are hundreds of alphanumeric and mathematical symbols used, many so similar in appearance
that some use of context is necessary for disambiguation (e.g. O, o, 0 [103]).
3. Layout Analysis (Section 4.3). Analysis of the spatial relationships between symbols is challenging.
Spatial structure is often represented using a tree,
which we term a symbol layout tree (Figure 4a).
Symbol layout trees represent information similar to
LATEX math expressions; they indicate which groups
of horizontally adjacent symbols share a baseline
(writing line), along with subscript, superscript, above,
below, and containment relationships. Symbols may
be merged into tokens, in order to simplify later processing (e.g. function names and numeric constants).
2
SU P ER
(
ADD
http://www.ai.mit.edu/projects/natural-log/
http://www.cs.rit.edu/rlaz/ffes/
http://www.xthink.com/
http://www.inftyreader.org
"
$%
#
#
!
!
face, and views results through the Result Interface. Indexing, Normalization and Matching are three system
processes used to process the document collection and
query, and find matches for the query in the collection.
Math recognition can be applied both to the query
(e.g. to recognize a stylus-drawn expression, as in Figures 1 and 2) and to the searchable documents (e.g.
to recognize math expressions in document images or
PDF files). Prior to indexing, document images can be
annotated with region types (e.g. text, table, figure,
image, math), character information, and recognized
structure and semantics for detected math expressions.
Existing math retrieval systems lack the ability to recognize stylus-drawn queries. Instead template editors
are provided to assist in generating query strings; an
example is the Math WebSearch prototype (Figure 5a).
The following four key problems arise in the retrieval
of math notation, as illustrated in Figure 6.
http://www.latexsearch.com/
http://search.mathweb.org/index.xhtml
http://dlmf.nist.gov/
http://functions.wolfram.com/
http://www.wolframalpha.com
100
X 2
i=1
(a)
i +i+yx
(b)
(c)
st
(d)
(e)
Fig. 7 Ambiguous Mathematical Expressions. (a) Which division is performed first? (b) Is a superscripted? (c) What is the scope of
the summation? (d) Is this symbol a 9 or a q? The perceived answer depends on context (from [103]) (e) What do s, t and represent?
Table 1 Spatial Relationships in Mathematical Notation. Relationships shown are defined for standard symbol layout tree
encodings (e.g. LATEX, Presentation MathML), and used in most
recognition systems (as far back as Andersons [5]). Note that for
many expressions shown, mathematical content cannot be determined unambiguously.
Relation
Adjacent
(at right)
Expression
xy
xy
Math. Interpretation
Multiply x by y
Multiply x by y
Superscript
Subscript
x3
x1
xxx
Element 1 of list x
x21
x1 x1
p(x|i ) dx
nC
Above
Below
n choose k
not x
x
y
x divided by y
n
X
Add 1, 2, . . . , n 1, n
i=1
Contains
x2 y 2
xy
Nested
Grid
x! =
x 0
0 y
2 2 diagonal matrix
1,
if x = 0
x((x 1)!), if x > 0
Inductive function def.
(a + b)2
(a) Expression Image
<msup>
<mfenced>
<mi>a</ mi>
<mo>+</mo>
<mi>b</ mi>
</ mfenced>
<mn>2</mn>
</msup>
(a+b)2
(c) LATEX
(Symbol Layout Tree)
<a p p l y>
<power />
<a p p l y>
<p l u s />
< c i>a</ c i>
< c i>b</ c i>
</ a p p l y>
<cn>2</ cn>
</ a p p l y>
However, some expressions are not intended for evaluation. For example, consider the integral shown in Table 1. The vector space is continuous, and thus this
integral cannot be computed directly. Doing so would
also not be of interest, as this expression is commonly
used in a constraint that the expression needs to evaluate to 1.0.
9
Table 2 Information Needs for Mathematical Information Retrieval, from Kolhase and Kolhase [75], and Zhao et al. [171]
1
2
3
4
5
6
7
8
9
Information Need
Specific/similar formulae
Form/appearance (given by a symbol layout tree)
Mathematical Content (given by an operator tree)
Name
Theorems, proofs, and counter-examples
Examples and visualizations (e.g. graphs/charts)
Problem and solution sets (e.g. for instruction)
Algorithms
Applications (e.g. for the Fourier transform)
Answer mathematical questions/conjectures
People (by math content in publications)
Determine novelty/sequence of mathematical discoveries
http://www.ams.org/mr-database
http://www.zentralblatt-math.org/zmath/en/
The MSC is quite detailed; the 2010 revision is 47 pages long.
10
Consider n choose k, which may be written as nk ,
k
n
k
n C , Ck , or Cn [78]. In terms of expression semantics,
the variability is even more severe: consider the number
of expressions that evaluate to 0. It is not clear when or
to what extent transformation and simplification should
be used to recover such equivalences.
Below is a short list of query and document normalizations that have been applied in MIR systems.
Thesaurus: adding synonyms for symbols to a query
(e.g. adding equivalent function names [102]).
Canonical orderings: fixing the order for spatial
relationships such as subscripts and superscripts in
symbol layout trees (e.g. expressed in LATEX [102]),
and defining a fixed ordering for children of associative and commutative operations in operator trees,
such as for sums [109, 129].
Enumerating variables: variables may be enumerated (ignoring symbol identities) to permit unification of query variables with variables in archived
expressions [109].
Replacing symbols with their types: allows matching symbol types around an operator, rather than
specific symbols [67]. It also allows for a sub-expression
to be matched to an individual symbol of a given
type.
Simplification: produce smaller representations with
less variation. For example, one may eliminate <apply>
tags (see Figure 8) from Content MathML [160], or
use Computer Algebra Systems to simplify expressions symbolically [41, 102].
11
.png files). LATEXML was used in creating the NIST Digital Library of Mathematical Functions (DLMF) (see
Figure 5c). In contrast, Springers LATEX search (Figure
5b) represents documents using the LATEX sources provided directly by the authors of academic papers and
books. These encodings allow expression data to be represented explicitly, in a suitable form for indexing and
retrieval prior to archiving a document collection.
Unfortunately, many documents do not represent
mathematical information explicitly. Examples include
document images such as .tiff or .png files, and vectorbased representations such as .pdf files [13, 14]. This
makes it necessary to recover mathematical information
using pattern recognition techniques, and then annotate documents with recognition results prior to indexing. Pattern recognition has been used to identify math
symbols and structure in raw document images [8, 101]
and .pdf files [14, 71]. Another use of pattern recognition is to segment documents into region types such as
theorem, proof, and section heading [171]; these region
types can then be used in queries.
A German and Japanese project led by Michler developed a prototype for annotating documents in digital
mathematics libraries in the early 2000s [100,101]. Document images were recognized using commercial OCR
software (ABBYY FineReader), mathematical expressions were segmented and converted into LATEX using
techniques developed by Okamoto et al. [8], and paper
references were linked to online reviews from Zentralblatt f
ur Mathematik and Mathematical Reviews. References were detected using regular-expression matching in OCR results. Archived documents were stored using the DjVu format, which represents document pages
in three layers: 1. image, 2. OCR and math recognition
results, including associated page coordinates, and 3.
links to reviews for cited papers, with the associated
page coordinates for the citations [101]. DjVu viewers
allowed OCR/math recognition results to be seen inplace while viewing a document image, and for reviews
of references to be consulted simply by selecting a reference (e.g. using a mouse click).
During indexing, documents are converted to the
representation used in the document index. In the early
stages of indexing, documents are filtered (e.g. to select
expressions and/or index terms) and normalized in the
same fashion as queries.
3.3.1 Vector-Space Models
In vector-space models, documents are represented by
vectors in Rn , where each dimension corresponds to an
index term [62, 95, 125]. Index terms normally exclude
stop words (very high frequency terms such as the
that carry little information) as well as highly infrequent terms, whose inclusion would have little effect on
retrieval performace, while increasing the dimensionality of the vector space. Salton and McGill discuss index
term selection, the use of synonyms for low frequency
terms, and the construction of term phrases for high
frequency terms (Ch. 3 of [125]). Documents are represented by the weighted number of occurrences of each
index term (the term frequencies). Commonly, term frequencies are weighted using some variation of inverse
document frequency, to emphasize terms that appear in
fewer documents in the collection, and thereby likely to
be more informative [62, 125]:
ui = f req(i, u) log
N
docf req(i)
This is simply the inner product of the document vectors divided by the product of their magnitudes. If term
vectors are first normalized (length 1.0), then the denominator need not be computed. sim(u, v) has a value
of 1 when the vectors coincide (0 ), and 0 when the vectors are orthogonal (90 ).
For large document collections, the document index must be pre-structured to reduce the number of
comparisons made for a query. A common approach
uses clustering, and then compares a query vector with
the centroid of each child cluster at a node (Ch. 6.4
of [125]). The cluster tree is traversed top-down until individual documents are reached, pruning paths in
which similarity is less than a threshold value. This
greatly reduces retrieval time, but carries the risk that
the document(s) most similar to the query will not be
located (see [40] pp. 185-186). Smeulders et al. identify
three methods for hierarchically decomposing a document index in image retrieval [132]: partitioning the
feature space, partitioning the data, or distance-based
indexing relative to examples. Spatial data structures
used by these three decomposition approaches, respectively, include k-d trees, R-trees, and M-trees [126].
A number of MIR systems implement vector-space
models using the popular Lucene13 [60] indexing and retrieval library, both for indexing entire documents that
13
http://lucene.apache.org
12
include expressions [91, 102], and for indexing individual expressions in LATEX documents [168]. In these approaches, mathematical symbols are treated as terms,
and the expressions are linearized (flattened) before
conventional text-based indexing is performed. For example, consider the LATEX expression for xt2 = 1,
which is x{t-2} = 1. Below we show the symbol layout tree for the LATEX expression, along with the linearization produced by Miller and Youssef [102]:
SU P ER
x
0
0 exp(f (z, a, z))
0 sqrt(f ( 1 , 2 , 3 ))
3 a
1 z, 2 y
1 1, 2 z, 3 n
1 1, 2 k
Fig. 9 A Substitution Tree (adapted from Kohlhase and Sucan [78]). The tree represents all indexed expressions using
paths of substitutions. Substitution variables are represented by
boxed numbers. Five expressions are represented at the leaves
of the tree: exp(f (z, a, z)), sqrt(f (z, y, a)), sqrt(f (1, k, a)),
sqrt(f (1, z, n)) and .
may search for exact matches, instances, generalizations, and variant substitutions. An example of instancebased matching using Figure 9 is that the query sqrt(X)
returns the three expressions at the leaves of the tree
that contain an outermost sqrt(). An example of matching with generalizations is to ignore specific symbol
identities. In matching with variant substitutions, we
match expressions that are equivalent up to variable
renaming.
Substitution tree retrieval was applied to MIR by
Kohlhase and Sucan [78]. To simplify matching subexpressions, Kohlhase and Sucan add all sub-expressions
in the document collection to the substitution tree along
with their parent expression. They claim that this leads
to a manageable increase in the index size, because
many sub-expressions are shared by the larger expressions, and each sub-expression appears only once in the
substitution tree. To facilitate rapid retrieval, all substitution tree nodes contain references to matched expressions in the document collection.
Earlier, a related method was used by Einwohner
and Fateman for searching through integral tables, given
an integrand expressed as an operator tree in Lisp (e.g.
(expt (log (cos x)) 1/2)) [41]. Expressions from the
integral tables were indexed using hash tables: after
normalization of the Lisp expressions, the head (first
atom) of each list in the lisp expression is used as the
key for storing the associated sub-expression (sub-tree)
in the table. Retrieval was performed by recursively
looking up each lead atom (key); if the first key returns a non-empty set of expressions, the current key is
expanded to include the next key, and the intersection
of the previous returned and current lists of matches is
taken. This differs from the substitution trees in that
operator trees are matched using a depth-first traversal
of the query operator tree rather than based on com-
13
(T1 T2 )
(T1 ) + (T2 )
post-processing were used successfully to recover symbol layout trees from expression images by Okamoto
et al. [111, 153]. Retrieval is performed using (standard) XY-tree structure, and dynamic time warping of
query and candidate image columns similar to the wordspotting technique of Rath and Manmatha [119, 120].
A related approach was developed for visual matching of LATEX-generated expression images [168]. Connected components in the query image are matched
with connected components in archived images using visual similarity of connected components, again based on
features similar to Rath and Manmathas. The matching process also measures similarity in layout between
pairs of connected components.
3.4 Query Reformulation and Relevance Feedback
After query submission the retrieved documents are
presented to the user through an interface. In order to
support reformulation of queries, one interface is normally used both for constructing queries and evaluating
results, as seen in Figure 5. If a users information need
is satisfied by a retrieval result or if the user becomes
frustrated, he or she will stop searching. Otherwise the
user may craft a new query or may refine the existing
query, for example by filtering retrieved documents by
source or publication year (Figure 5b). New queries may
also be created automatically, in response to relevance
feedback.
Users provide relevance feedback by indicating whether
returned documents are relevant or irrelevant to their
information need. These positive and negative examples
can be used to automatically produce a new query. Relevance feedback is provide through the result interface,
using a selection mechanism such as check boxes, or
clicking on relevant/irrelevant objects. For interesting
examples from image retrieval, see [123].
For vector-space models, a new query may be produced by averaging and re-weighting the vector elements that define the feature space: increase the weights
for features present in positive examples, and decrease
the weights for features in negative examples. A concise explanation of relevance feedback operations using re-weighting is given by Salton and McGill [125]
Chs. 4.2.B, 4.3.B and 6.5. Machine-learning methods
have also been investigated. Discriminative methods estimate classification boundaries for relevant and irrelevant documents, whereas generative methods estimate
probability distributions [35, 172].
Ideally, relevance feedback algorithms learn optimal transformations of the feature space using userprovided relevance indications [172]. Optimality is defined by the users information need, which may change
14
15
that reduce the need for manual identification of relevant documents or document regions, and perhaps creating a labeled test set similar to those developed for
TREC. For MIR in general, relevance pertains to both
text and expressions, making this a very time-intensive
task, one that is sensitive to the expertise of the intended users. Once a reasonable method for defining or
approximating relevance is determined, existing information retrieval metrics are likely sufficient.
4 Recognition of Mathematical Notation
Pattern recognition methods for mathematical notation
may be used in a variety of contexts. Firstly, in Mathematical Information Retrieval, math recognition can be
used to interpret user queries and to annotate document
collections. An important open problem is to develop
robust MIR methods that make effective use of recognition results even when recognition errors are present.
Secondly, math recognition is used to support the insertion of expressions into documents; for example, entry
of LATEX expressions using images, pen, keyboard and
mouse is illustrated in Figure 1. Thirdly, math recognition is used to recover layout and operator trees from
images, handwritten strokes, or vector-based encodings
(e.g. .pdf files). Finally, math recognition is used to integrate pen-based math entry into CAS systems (see
Figure 2); in the future, expression images might also
be used as input. This requires recognition of mathematical content, with the resulting operator tree used
to support evaluation and manipulation of the expression.
Research on the recognition of math notation began
in the 1960s [5, 6, 31, 98], and a number of surveys are
available [19, 28, 52, 146]. In this paper we do not attempt to summarize the entire history as provided in
these surveys, but rather provide an updated account
of the state of the art, with an emphasis on advances
since the well-known survey by Chan and Yeung [28]
written a decade ago.
Many factors make the recognition of mathematical notation difficult. There may be noisy input in the
case of images and strokes, and ambiguities arise even
for noise-free input (see Figure 7). Math notation contains many small symbols (dots and diacritical marks)
which can be difficult to distinguish from noise. Symbol segmentation can be difficult, particularly in handwritten mathematical notation. Symbol recognition is
challenging due to the large character set (Roman letters, Greek letters, operator symbols) with a variety
of typefaces (normal, bold, italic), and a range of font
sizes (subscripts, superscripts, limit expressions). Several common symbols have ambiguity in their role; for
16
Fig. 10 User Interface for Evaluating Image-Based Query-by-Expression using Handwritten Queries [161]. Each returned region
is ranked on a 1-5 scale, with 1 indicating no match, 3 indicating roughly half the query is matched, and 5 indicating the query is
contained completely within a returned region.
example, a dot can represent a decimal point, a multiplication operator, a diacritical mark, or noise. Also,
spatial relationships are difficult to identify; for example, it is difficult to distinguish between configurations
that represent horizontal adjacency and those that represent superscripts or subscripts. The lack of redundancy in mathematical notation means that relatively
little information is available for resolving ambiguities.
As shown in Figure 3, we identify four key problems
that every math recognition system must address.
1.
2.
3.
4.
Expression detection
Symbol extraction or symbol recognition
Layout analysis
Mathematical content interpretation
For vector graphics, work has begun on methods for extracting symbols and recognizing manually segmented
expressions, but not on methods for automatic detection. Currently vector file formats such as PDF do not
demarcate math regions. This is an important direction
for future work, particularly for Mathematical Information Retrieval applications.
For pen-based applications, expressions are often segmented using gestures [85, 144]. For example, the
gesture is used in the E-chalk system to indicate the
end of an expression, and request its evaluation (see
Figure 2(b)). Typically, a gesture gives a partial or approximate indication of the extent of an expression. Additional clustering or region growing methods can be
applied, based on the properties of recognized symbols.
Matrix elements can be detected using similar methods [89, 147].
In images, expressions are normally found using properties of connected components. Before discussing these
methods, we distinguish between displayed expressions
that are offset from text paragraphs and expressions
that are embedded in text lines (Figure 11). Displayed
expressions are easier to detect than embedded expressions, because text lines and displayed expressions tend
to differ significantly in attributes such as height, separation, character sizes and symbol layout [52, 66].
Kacem et al. detect displayed expressions in images
based on simple visual and layout features of adjacent
connected components [66]. Embedded expressions are
found by coarsely classifying connected components.
Regions are grown around components that are identified as operators. The region growing is based on the
17
Fig. 11 Expression Detection and Layout Analysis. At left, the document image contains a mix of expressions that are displayed
(vertically offset) and expressions that are embedded in textlines (from [66]). Top right: a detected baseline (red) and minimum
spanning tree used to associated non-baseline symbols with symbols on the baseline [144]. Bottom right: a virtual link network, in
which a minimum spanning tree is constructed that minimizes costs based on symbol identity and spatial relationships [42].
Accuracies for online recognition of handwritten mathematical symbols have also been reported at rates of
over 95%. In recent years there have been a number
of methods based on Hidden Markov Models (HMMs
[117]) that extend early work by Winkler [158] and Kosmala and Rigoll [80]. There is a general trend here,
where HMMs were first used to perform simultaneous
segmentation and recognition for a time series of pen
strokes, but now later stages in processing, particularly
layout and content information, are being incorporated
into training and recognition stages. An open challenge
is to adapt these methods to better handle late additions to symbols, e.g. when a dot is added to the
top of an i after a large expression has been entered.
Developments in HMM-based recognition methods are
discussed further in Section 4.6.
Another group of successful methods employ features that approximate handwritten strokes via linear
combinations of basis vectors or parametric curves. Various techniques for this have been used, including Principal Components Analysis [99] and polynomial basis
functions [32, 54, 55]. These features allow recognition
to be performed effectively within a small feature space
(e.g. using the first fifteen principal components [99]),
while allowing regeneration of the original data up to a
chosen level of fidelity, making the interpretation of the
features simple.
18
19
Grammar-based methods commonly represent symbol locations by geometric objects such as bounding
boxes or convex hulls. The placement of symbol centroids reflects the presence of ascenders (h) and descenders (y). Predicates and actions associated with grammar productions make use of the bounding boxes and
centroids to determine spatial relationships. It should
be noted that grammars are a very general formalism,
and variations of layout analysis techniques seen in the
previous section have been employed within the production rules of grammars designed to recover the operator tree of an expression. Examples included syntactic recognition using operator-driven decomposition [5],
and baseline extraction [14]. A key issue is the geometric
model used to partition the input and define primitives.
For example, using unrestricted subsets of image pixels
as primitives is far too computationally intensive. Instead, primitive regions are represented using geometric objects such as axis-aligned rectangles, along with
constraints on allowable orderings and adjacencies between regions. Liang et al. provide a helpful overview,
including examples from math recognition [90]. Different parsing algorithms explore the space of legal expressions in different orders, some more efficiently than
others.
Stochastic context-free grammars allow uncertainty
in symbol recognition, layout and/or content to be accommodated, by returning the maximum-likelihood derivation for the input image [34] or symbols [103]. These
methods are discussed further in Section 4.6. Some more
recent parsing methods that model uncertainty include
fuzzy-logic based parsing [44,53], and A*-penalty-based
search [122].
As discussed previously, usage of notation differs significantly in different dialects of mathematical notation,
and so the space of operator trees and corresponding
grammar productions need to be adapted for different
mathematical domains of discourse. The notion of devising one grammar to cover all of mathematical notation seems quite impractical, though defining grammars
with some utility for a specific domain (e.g. matrix algebra) is possible.
Methods that permit recognition to be defined at
the level of a grammar are very appealing, in that with
suitable implementations for pattern recognition methods being available, a language definition may be sufficient for recognizing a dialect of mathematical notation,
including layout and mathematical content. However, it
has been observed that the tight coupling between the
assumed recognition model and grammar formalism can
make it difficult to adapt syntactic pattern recognition
methods. One compromise is to use a modular organization similar to a compiler, where recognized sym-
20
The statistics produced by So and Watt make a distinction between identifier symbols and operator symbols. In both cases, but especially for operator symbols, plotting symbols by decreasing frequency shows
an exponential decrease in frequency with rank; this
is similar to the Zipf distribution [173] seen for word
frequencies. Similarly, expressions become significantly
4.5 Post-processing: Constraining Outputs
less frequent as they become larger and more structurally complex. Interestingly, the number of distinct
Pattern-recognition systems commonly use post-processing
expressions increases with expression size and complexto correct preliminary recognition results. Many postity.
processing operations apply contextual constraints to
In a later study, Watt focused on engineering mathresults for individual objects and relationships identiematics, analyzing the LATEX sources for three engineerfied largely in isolation of one another [149]. In docing mathematics textbooks [155]. In this study, all symument recognition, perhaps the most well-known exbols were analyzed together, producing another Zipf
ample of post-processing is the use of dictionaries and
distribution. N-grams (for n {2, 3, 4, 5}) were pron-grams to refine preliminary OCR results obtained for
duced by traversing the symbol layout tree in writing
individual characters [107, 115].
order. The leaves of the tree, which store the symbols,
Ten years ago, the last IJDAR survey on math recogprovide the starting point. The traversal collects laynition [28] identified post-processing as an important diout information to provide context: there is information
rection for future research. Indeed, significant advances
about the spatial relationship between the n-gram symfor post-processing of math recognition have been made
bols and symbols on neighboring baselines (e.g. fracin the last ten years. Several methods are similar to
tions, super/subscript, containment by square root).
dictionary and n-gram methods used for OCR. Others
incorporate syntactic constraints on two-dimensional
4.5.2 Heuristic Rules and Contextual Constraints
symbol layout or expression syntax; these methods work
with symbol layout trees and operator trees respecHeuristic rules and manually constructed language modtively.
els are receiving use in post-processing. Chan and Yeung [29] describe an error-correcting parsing technique
4.5.1 Statistical Analysis of Math Notation
for converting handwritten symbols into operator trees,
adding heuristic rules to re-segment characters recogStatistical information about math notation is useful
nized with low confidence, to insert epsilon (empty)
in post-processing. The frequency estimates described
symbols to recover from parse errors (e.g. after detectbelow have been used to re-rank and constrain preliming unbalanced parentheses), and to replace symbol ideninary symbol recognition results for handwritten math
tities to make them consistent with the expression gramentry [134]. In addition, they have been used to categomar (e.g. replacing 1 by / in y 1 x, and + by t
rize mathematical documents by Math Subject Classiin +an). Garain and Chaudhuri make use of a simple
fication categories [155]; so far, this appears to be the
LATEX grammar to constrain handwritten symbol recogonly paper published on this interesting problem. Also,
nition alternatives [50], while Kanahori et al. present
recognition systems can use information about symbol
work in analyzing the mathematical content (operafrequencies and expression frequencies as prior probator tree) for matrices in order to revise symbol layout
bility estimates.
analysis [68]. A more recent technique by Fujiyoshi et
So and Watt [138] conducted an empirical study of
al. [47,48], similar to that of Chan and Yeung, defines a
over 19,000 papers stored in the ArXiv e-Print Archive.
grammar for valid symbol layout trees and then parses
This archive at http://arxiv.org provides electronic verinitial recognition results in order to identify invalid
sions and LATEX source of papers from scientific, mathestructures. During parsing, syntax errors are visualized
matical and computing disciplines. So and Watts study
so that users may identify the specific symbols associdetermined the frequencies for expression usage in difated with parse errors (e.g. unbalanced fence symbols).
ferent mathematical domains, as identified by the Mathematical Subject Classification described in Section 3.1.
Contextual constraints can also be incorporated into
Documents were categorized using the top-level Maththe recognition process itself. For example, Kim et al.
ematical Subject Classification provided by the ArXiv.
[73] modify the penalty metric used in an A* search
Analyses were made at the symbol layout level after
for constructing symbol layout trees for handwritten
converting the available LATEX to Presentation MathML. expressions [122]. The penalty metric considers mea-
21
sures of consistency of symbol size, style, and repetition, along with symbol n-grams and repeated subscripting.
4.6 Integration of Recognition Modules
Integration of recognition modules has been an important new area of development in the last ten years.
Most approaches involve some form of dynamic programming. The earliest work in this area is Chous influential paper describing the use of stochastic contextfree string grammars for analysis of typeset images of
mathematical notation [34]. This approach combines
segmentation, recognition, and layout analysis, and is
highly tolerant of bit-flip noise. Subsequent work includes extensions by Hull [65], and extension to a more
general HMM-based model for document image decoding [79].
Stochastic context-free grammars associate a probability with each derivation rule; the derivation rules associated with each nonterminal have probabilities that
sum to one. The probability of a derivation is computed
as the product of the probabilities of all rule applications used to derive the input string. Rule probabilities
can be estimated by the author of the grammar, or they
can be derived from a training corpus using the InsideOutside algorithm [34]. To facilitate the use of parsing
through dynamic programming, stochastic context-free
grammars are often represented in Chomsky-Normal
Form: all rules are of the form A BC or A t.
A modified form of the Cocke-Younger-Kasami (CYK)
parsing algorithm uses dynamic programming to produce the maximum likelihood parse in O(n3 ) time, where
n is the number of input tokens.
In Chous paper [34], the expression grammar is
augmented to include symbols representing horizontal
and vertical concatenation of adjacent regions in the
input image. In a lexical stage that precedes parsing, a template-based character recognizer is applied
to the entire input region, identifying a set of candidate symbols based on the Hamming distance between
input regions and a set of templates. This produce a
set of candidate symbols with associated probabilities.
More recently Yamamoto et al. [159] used a stochastic context-free grammar for online handwritten expressions, which introduces rules to model the likelihood of
written strokes along with rules incorporating probabilities for the expected relative positions of symbols (the
authors term these hidden writing areas).
There are many unexplored possibilities for using
stochastic context free grammars for math recognition.
For example, a variety of segmentation and classification methods might be employed within a framework of
stochastic context free grammars. Also, various heuristics could be used to prune or modify rules that are inferred from training data. It is true that sequential implementations of stochastic context free grammars are
computationally intensive, but both probability-estimation
algorithms and parsers may be parallelized [34]. Many
opportunities for parallelization exist in modern CPUs
with multiple cores and Graphical Processing Units.
The related technique of Hidden Markov Models
(automata that recognize probabilistic regular languages)
has been used to integrate segmentation and classification of handwritten symbols [80, 158] (analogous to
speech recognition [117]). For stochastic regular languages, the CYK algorithm reduces to the Viterbi algorithm, which may be used to determine the maximum likelihood path (parse) through a Hidden Markov
Model [34]. Hidden Markov Models form the core of a
general model of document image decoding, in which
the document-generation process is explicitly modeled
as part of the recognition system [79].
More recently, dynamic programming methods have
been used to let later stages of processing constrain
earlier ones in an optimization framework. For example, Toyozumi et al. address segmentation of handwritten symbols drawn online [152]. They produce improvements on the order of 5-7% over a feature-based elastic matching method by using simple, local grammatical rules to consider neighboring strokes and possible
under-segmentation of vertical operators such as fractions, square roots and summations. Shi, Li and Soong
go further, using a dynamic programming framework
to optimize symbol segmentation and recognition [130].
Their system considers a sequence of strokes from online
handwritten input. The space of all possible partitions
of the stroke sequence into symbols (containing at most
L strokes per symbol) is searched to find an optimal
partition through dynamic programming. The criterion
function that is used to evaluate a given stroke partition
uses two components: (1) a bigram model for symbol
adjacencies along particular spatial relationships, and
(2) the probability of the sequence of spatial relationships observed between symbols. As a post-processing
step, a trigram symbol sequence model is evaluated for
re-ranking alternatives. On a test set of over 2,500 expressions, a symbol accuracy of 96.6% is reported. An
extension employing graph-based discriminative training is reported by Shi and Soong [131], with similar
results. A method integrating complete symbol layout
trees into the dynamic programming is described in
Awal et al. [11].
22
23
5 Conclusion
Recognition and retrieval of mathematical notation are
challenging, interrelated research areas of great practical importance. In math retrieval, the key problems
15
www.inftyproject.org/en/database.html
www.science.uva.nl/research/dlia/datasets/uwash3.html
17 www.scg.uwaterloo.ca/mathbrush/corpus
18 http://yann.lecun.com/exdb/mnist
19 http://graphics.cs.brown.edu/research/pcc/
symbolRecognitionDataset.zip
20 http://www.isical.ac.in/crohme2011/
16
24
References
1. M. Adeel, H.S. Cheung, and H.S. Khiyal. Math go! Prototype of a content based mathematical formula search engine. J. Theoretical and Applied Information Technology,
4(10):10021012, 2008.
2. A.V. Aho, B.W. Kernighan, and P.J. Weinberger. The AWK
Programming Language. Addison-Wesley, New York, 1988.
3. M. Altamimi and A.S. Youssef. An extensive math query
language. In ISCA Intl Conf. Software Engineering and
Data Engineering, pages 5763, Las Vegas, USA, 2007.
25
21. D. Blostein, E. Lank, and R. Zanibbi. Treatment of diagrams in document image analysis. In Proc. Intl Conf. on
Theory and Application of Diagrams, pages 330344, London, UK, 2000. Springer.
22. P. Borlund. User-centered evaluation of information retrieval systems. In Information Retrieval: Searching in the
21st Century, pages 2137. Wiley, 2009.
23. A. Bunt, M. Terry, and E. Lank. Friend or foe? Examining
CAS use in mathematics research. In Proc. Intl Conf. Human Factors in Computing Systems, pages 229238, New
York, 2009.
24. F. Cajori. A History of Mathematical Notations (2 vols.).
Open Court Publishing Company, Chicago, Illinois, 1929.
25. J. Carette and W.M. Farmer. A review of mathematical
knowledge management. In Proc. Mathematical Knowledge Management, volume 5625 of LNAI, pages 233246.
Springer, 2009.
26. D.O. Case. Looking for Information: A Survey of Research
on Information Seeking, Needs, and Behavior. Academic
Press, 2002.
27. R.G. Casey and E. Lecolinet. A survey of methods and
strategies in character segmentation. IEEE Trans. Pattern
Analysis and Machine Intelligence, 18(7):690706, 1996.
28. K.-F. Chan and D.-Y. Yeung. Mathematical expression
recognition: A survey. Intl J. Document Analysis and
Recognition, 3:315, 2000.
29. K.-F. Chan and D.-Y. Yeung. Error detection, error correction and performance evaluation in on-line mathematical
expression recognition. Pattern Recognition, 34(8):1671
1684, 2001.
30. K.-F. Chan and D.-Y. Yeung. Pencalc: A novel application
of on-line mathematical expression recognition technology.
In Proc. Intl Conf. Document Analysis and Recognition,
pages 774778, Seattle, USA, 2001.
31. S.-K. Chang. A method for the structural analysis of twodimensional mathematical expressions. Information Sciences, 2:253272, 1970.
32. B.W. Char and S.M. Watt. Representing and characterizing
handwritten mathematical symbols through succinct functional approximation. In Proc. Intl Conf. Document Analysis and Recognition, pages 11981202, Curitiba, Brazil,
2007.
33. T.W. Chaundy, P.R. Barrett, and Charles Batey. The
Printing of Mathematics. Oxford University Press, London,
1957.
34. P.A. Chou.
Recognition of equations using a twodimensional stochastic context-free grammar. In Proc. Visual Communications and Image Processing IV, volume
1199 of Proc. SPIE, pages 852863, 1989.
35. R. Datta, D. Joshi, J. Li, and J.Z. Wang. Image retrieval:
Ideas, influences, and trends of the new age. ACM Computing Surveys, 40(2):160, 2008.
36. J.H. Davenport and M. Kohlhase. Unifying math ontologies:
A tale of two standards. In Intelligent Computer Mathematics, volume 5625 of LNAI, pages 263278. Springer, 2009.
37. M. Dewar. Openmath: An overview. ACM SIGSAM Bulletin, 34:25, 2000.
38. D. Doermann. The indexing and retrieval of document images: A survey. J. Computer Vision and Image Understanding, 70:287298, 1998.
39. D.M. Drake and H.S. Baird. Distinguishing mathematics
notation from english text using computational geometry.
In Proc. Intl Conf. Document Analysis and Recognition,
pages 12701274, Seoul, Korea, 2005.
40. R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classification. Springer-Verlag, New York, 2nd edition, 2001.
26
60. E. Hatcher and O. Gospodneti
c. Lucene in Action. Manning, 2nd edition, 2010.
61. M.A. Hearst. Search User Interfaces. Cambridge University
Press, 1st edition, 2009.
62. D. Hiemstra. Information retrieval models. In Information
Retrieval: Searching in the 21st Century, pages 117. Wiley,
2009.
63. N.J. Higham. Handbook of Writing for the Mathematical
Sciences. Society for Industrial and Applied Mathematics,
Philadelphia, 1993.
64. J. Hu, R.S. Kashi, D. Lopresti, and G.T. Wilfong. Evaluating the performance of table processing algorithms. Intl J.
Document Analysis and Recognition, 4(3):140153, 2002.
65. J.F. Hull.
Recognition of mathematics using a twodimensional trainable context-free grammar. Masters thesis, MIT, Cambridge, MA, 1996.
66. A. Kacem, A. Belaid, and M. Ben Ahmed. Automatic extraction of printed mathematical formulas using fuzzy logic
and propagation of context. Intl J. Document Analysis and
Recognition, 4:97108, 2001.
67. S. Kamali and F. Tompa. Improving mathematics retrieval.
In Proc. Digital Mathematics Libraries, pages 3748, Grand
Bend, Canada, 2009.
68. T. Kanahori, A.P. Sexton, V. Sorge, and M. Suzuki. Capturing abstract matrices from paper. In J. M. Borwein
and W. M. Farmer, editors, Proc. Mathematical Knowledge Management, volume 4108 of LNAI, pages 124138.
Springer, 2006.
69. T. Kanahori and M. Suzuki. A recognition method of matrices by using variable block pattern elements generating
rectangular areas. In Graphics Recognition Algorithms
and Applications, volume 2390 of LNCS, pages 320329.
Springer, 2002.
70. T. Kanahori and M. Suzuki. Detection of matrices and
segmentation of matrix elements in scanned images of scientific documents. In Proc. Intl Conf. Document Analysis
and Recognition, pages 433437, Edinburgh, 2003.
71. T. Kanahori and M. Suzuki. Refinement of digitized documents through recognition of mathematical formulae. In
Proc. Intl Work. on Document Image Analysis for Libraries, pages 2728, Lyon, France, 2006.
72. T. Kanungo, R.M. Haralick, H.S. Baird, W. Stuetzle, and
D. Madigan. A statistical, nonparametric methodology for
document degradation model validation. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(11):12091223,
2000.
73. K. Kim, T.-H. Rhee, J.S. Lee, and J.H. Kim. Utilizing consistency context for handwritten mathematical expression
recognition. In Proc. Intl Conf. Document Analysis and
Recognition, pages 10511055, Barcelona, Spain, 2009.
74. Donald E. Knuth. TeX and METAFONT - New Directions
in Typesetting. Digital Press, Bedford, MA, 1979.
75. A. Kohlhase and M. Kohlhase. Re-examining the MKM
value proposition: From math web search to math web research. In Proc. Symp. Towards Mechanized Mathematical
Assistants, volume 4573 of LNCS, pages 313326, Springer,
2007.
76. M. Kohlhase. OMDoc: An Open Markup Format for Mathematical Documents, volume 4180 of LNAI. Springer, 2006.
77. M. Kohlhase, S. Anca, C. Jucovschi, A.G. Palomo,
and I. Sucan. MathWebSearch 0.4: A semantic search
engine for mathematics.
(unpublished manuscript,
http://kwarc/info/kohlhase/publications.html), 2008.
78. M. Kohlhase and I. Sucan. A search engine for mathematical formulae. In Proc. Artificial Intelligence and Symbolic Computation, volume 4120 of LNAI, pages 241253.
Springer, 2006.
27
97. K. Marriott, B. Meyer, and K.D. Wittenburg. A survey
of visual language specification and recognition. In Visual
Language Theory, pages 585. Springer, 1998.
98. W.A. Martin. Computer input/output of mathematical expressions. In Proc. Symp. on Symbolic and Algebraic Manipulation, pages 7889, Los Angeles, USA, 1971.
99. N. Matsakis. Recognition of handwritten mathematical expressions. Masters thesis, MIT, Cambridge, MA, 1999.
100. G. O. Michler. Report on the retrodigitization project
Archiv der Mathematik. Archiv der Mathematik, 77:116
128, 2001.
101. G.O. Michler. How to build a prototype for a distributed
digital mathematics archive library. Annals of Mathematics
and Artificial Intelligence, 38:137164, 2003.
102. B.R. Miller and A.S. Youssef. Technical aspects of the digital library of mathematical functions. Annals of Mathematics and Artificial Intelligence, 38:121136, 2003.
103. E.G. Miller and P.A. Viola. Ambiguity and constraint in
mathematical expression recognition. In Proc. 15th National Conf. on Artificial Intelligence, pages 784791, Madison, Wisconsin, 1998.
104. R. Miner and R. Munavalli. An approach to mathematical
search through query formulation and data normalization.
In Towards Mechanized Mathematical Assistants, volume
4573 of LNAI, pages 342355. Springer, 2007.
105. Y. Miyazaki and Y. Iguchi. Development of informationretrieval tool for MathML-based math expressions. In Proc.
Intl Conf. Computers in Education, pages 419426, Tapei,
Taiwan, 2008.
106. R. Munavalli and R. Miner. Mathfind: a math-aware search
engine. In Proc. Intl Conf. Information Retrieval, pages
735735, New York, 2006.
107. G. Nagy. Twenty years of document image analysis in
PAMI. IEEE Trans. Pattern Analysis and Machine Intelligence, 22(1):3862, 2000.
108. G. Nagy and S. Seth. Hierarchical representation of optically scanned documents. In Proc. Intl Conf. Pattern
Recognition, pages 347349, Montr
eal, Canada, 1984.
109. I. Normann and M. Kohlhase. Extended formula normalization for -retrieval and sharing of mathematical knowledge.
In Proc. Towards Mechanized Mathematical Assistants, volume 4573 of LNAI, pages 356370. Springer, 2007.
110. M. Okamoto and K.T. Imait. Performance evaluation of
a robust method for mathematical expression recognition.
In Proc. Intl Conf. Document Analysis and Recognition,
pages 121128, Seattle, USA, 2001.
111. M. Okamoto and B. Miao. Recognition of mathematical expressions by using the layout structures of symbols. In Proc.
Intl Conf. Document Analysis and Recognition, volume 1,
pages 242250, Saint-Malo, France, 1991.
112. M. Okamoto and A. Miyazawa. An experimental implementation of a document recognition system for papers containing mathematical expressions. In Structured Document
Image Analysis, pages 3653. Springer, 1992.
113. M. Panic. Math handwriting recognition in Windows 7 and
its benefits. In Intelligent Computer Mathematics, volume
5625 of LNCS, pages 2930. Springer, 2009.
114. I. Phillips. Methodologies for using UW databases for
OCR and image understanding systems. In Proc. Document Recognition V, volume 3305 of SPIE Proceedings,
pages 112127, San Jose, 1998.
115. R. Plamandon and S.N. Srihari. On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans.
Pattern Analysis and Machine Intelligence, 22(1):6384,
2000.
116. M. Pollanen, T. Wisniewski, and X. Yu. Xpress: A novice
interface for the real-time communication of mathematical
117.
118.
119.
120.
121.
122.
123.
124.
125.
126.
127.
128.
129.
130.
131.
132.
133.
134.
135.
28
136. S. Smithies. Equation entry and editing via handwriting and
gesture recognition. Behavior & Information Technology,
20(1):5367, 2001.
137. S. Smithies, K. Novins, and J. Arvo. A handwriting-based
equation editor. In Proc. Graphics Interface, pages 8491,
Kingston, Canada, 1999.
138. C.M. So and S.M. Watt. Determining empirical charateristivs of mathematical expression use. In Proc. Mathematical
Knowledge Management, volume 3863 of LNCS, pages 361
375. Springer, 2005.
139. C.M. So and S.M. Watt. On the conversion between content MathML and OpenMath. In Proc. Conf. Communicating Mathematics in the Digital Era, pages 169182, Aveiro,
Portugal, 2006.
140. M. Suzuki, T. Kanahori, N. Ohtake, and K. Yamaguchi. An
integrated OCR software for mathematical documents and
its output with accessibility. In Proc. Intl Conf. Computers
Helping People with Special Needs, volume 3119 of LNCS,
pages 648655. Springer, 2004.
141. M. Suzuki, F. Tamari, R. Fukuda, S. Uchida, and T. Kanahori. INFTY: An integrated OCR system for mathematical
documents. In Proc. Document Engineering, pages 95104,
Grenoble, France, 2003.
142. M. Suzuki, S. Uchida, and A. Nomura. A ground-truthed
mathematical character and symbol image database. In
Proc. Intl Conf. Document Analysis and Recognition, volume 2, pages 675679, Seoul, Korea, 2005.
143. Y. Takiguchi, M. Okada, and Y. Miyake. A fundamental
study of output translation from layout recognition and semantic understanding system for mathematical formulae.
In Proc. Intl Conf. Document Analysis and Recognition,
pages 745749, Seoul, Korea, 2005.
144. E. Tapia and R. Rojas. Recognition of on-line handwritten
mathematical formulas in the e-chalk system. In Proc. Intl
Conf. Document Analysis and Recognition, pages 980984,
Edinburgh, 2003.
145. E. Tapia and R. Rojas. Recognition of on-line handwritten
mathematical expressions using a minimum spanning tree
construction and symbol dominance. In Graphics Recognition: Recent Advances and Persepectives, volume 3088 of
LNCS, pages 329340. Springer, 2004.
146. E. Tapia and R. Rojas. A survey on recognition of on-line
handwritten mathematical notation. Technical Report B07-01, Free University of Berlin, 2007.
147. D. Tausky, G. Labahn, E. Lank, and M. Marzouk. Managing
ambiguity in mathematical matrices. In Proc. Eurographics
Work. Sketch-Based Interfaces and Modeling, pages 115
122, Riverside, CA, 2007.
148. The OpenMath Society. http://www.openmath.org/.
149. G.T. Toussaint. The use of context in pattern recognition.
Pattern Recognition, 10:189204, 1978.
150. K. Toyozumi, T. Suzuki, J. Mori, and Y. Suenaga. A system
for real-time recognition of handwritten mathematical formulas. In Proc. Intl Conf. Document Analysis and Recognition, pages 10591063, Seattle, USA, 2001.
151. K. Toyozumi, S. Takahiro, K. Mori, and Y. Suenaga. An
on-line handwritten mathemical equation recognition system that can process matrix expressions by referring to the
relative positions of matrix elements. Systems and Computers in Japan, 37(14):8796, 2006.
152. K. Toyozumi, N. Yamada, K. Mase, T. Kitasaka, K. Mori,
Y. Suenaga, and T. Takahashi. A study of symbol segmentation method for handwritten mathematical formula recognition using mathematical structure information. In Proc.
Intl Conf. Pattern Recognition, volume 2, pages 630633,
Cambridge, UK, 2004.
29
171.
172.
173.
174.