0% found this document useful (0 votes)
7 views

source

Uploaded by

Kawsar Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

source

Uploaded by

Kawsar Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Are NLP Models really able to Solve Simple Math Word Problems?

Arkil Patel
Satwik Bhattamishra Navin Goyal
Microsoft Research India
[email protected], {t-satbh,navingo}@microsoft.com

Abstract P ROBLEM :
Text: Jack had 8 pens and Mary had 5 pens. Jack gave 3
The problem of designing NLP solvers for pens to Mary. How many pens does Jack have now?
math word problems (MWP) has seen sus- Equation: 8 - 3 = 5
tained research activity and steady gains in
arXiv:2103.07191v2 [cs.CL] 15 Apr 2021

Q UESTION S ENSITIVITY VARIATION :


the test accuracy. Since existing solvers Text: Jack had 8 pens and Mary had 5 pens. Jack gave 3
achieve high performance on the benchmark pens to Mary. How many pens does Mary have now?
datasets for elementary level MWPs contain- Equation: 5 + 3 = 8
ing one-unknown arithmetic word problems, R EASONING A BILITY VARIATION :
such problems are often considered “solved” Text: Jack had 8 pens and Mary had 5 pens. Mary gave 3
with the bulk of research attention moving to pens to Jack. How many pens does Jack have now?
Equation: 8 + 3 = 11
more complex MWPs. In this paper, we re-
strict our attention to English MWPs taught in S TRUCTURAL I NVARIANCE VARIATION :
grades four and lower. We provide strong ev- Text: Jack gave 3 pens to Mary. If Jack had 8 pens and
Mary had 5 pens initially, how many pens does Jack have
idence that the existing MWP solvers rely on now?
shallow heuristics to achieve high performance Equation: 8 - 3 = 5
on the benchmark datasets. To this end, we
show that MWP solvers that do not have ac- Table 1: Example of a Math Word Problem along with
cess to the question asked in the MWP can the types of variations that we make to create SVAMP.
still solve a large fraction of MWPs. Sim-
ilarly, models that treat MWPs as bag-of-
words can also achieve surprisingly high accu- complexity and world and domain knowledge. A
racy. Further, we introduce a challenge dataset, combined complexity measure is the grade level
SVAMP, created by applying carefully chosen
of an MWP, which is the grade in which similar
variations over examples sampled from exist-
ing datasets. The best accuracy achieved by MWPs are taught. Over the past few decades many
state-of-the-art models is substantially lower approaches have been developed to solve MWPs
on SVAMP, thus showing that much remains with significant activity in the last decade (Zhang
to be done even for the simplest of the MWPs. et al., 2020).
MWPs come in many varieties. Among the sim-
1 Introduction
plest are the one-unknown arithmetic word prob-
A Math Word Problem (MWP) consists of a short lems where the output is a mathematical expression
natural language narrative describing a state of involving numbers and one or more arithmetic op-
the world and poses a question about some un- erators (+, −, ∗, /). Problems in Tables 1 and 6
known quantities (see Table 1 for some examples). are of this type. More complex MWPs may have
MWPs are taught in primary and higher schools. systems of equations as output or involve other
The MWP task is a type of semantic parsing task operators or may involve more advanced topics
where given an MWP the goal is to generate an and specialized knowledge. Recently, researchers
expression (more generally, equations), which can have started focusing on solving such MWPs, e.g.
then be evaluated to get the answer. The task is multiple-unknown linear word problems (Huang
challenging because a machine needs to extract et al., 2016a), geometry (Sachan and Xing, 2017)
relevant information from natural language text as and probability (Amini et al., 2019), believing
well as perform mathematical reasoning to solve that existing work can handle one-unknown arith-
it. The complexity of MWPs can be measured metic MWPs well (Qin et al., 2020). In this paper,
along multiple axes, e.g., reasoning and linguistic we question the capabilities of the state-of-the-art
(SOTA) methods to robustly solve even the sim- 2018), semantic parsing (Huang et al., 2017) and
plest of MWPs suggesting that the above belief is most recently deep learning (Wang et al., 2017;
not well-founded. Xie and Sun, 2019; Zhang et al., 2020); see (Zhang
In this paper, we provide concrete evidence to et al., 2020) for an extensive survey. Many pa-
show that existing methods use shallow heuristics pers have pointed out various deficiencies with
to solve a majority of word problems in the bench- previous datasets and proposed new ones to ad-
mark datasets. We find that existing models are dress them. Koncel-Kedziorski et al. (2016) cu-
able to achieve reasonably high accuracy on MWPs rated the MAWPS dataset from previous datasets
from which the question text has been removed which along with Math23k (Wang et al., 2017)
leaving only the narrative describing the state of has been used as benchmark in recent works. Re-
the world. This indicates that the models can rely cently, ASDiv (Miao et al., 2020) has been pro-
on superficial patterns present in the narrative of posed to provide more diverse problems with an-
the MWP and achieve high accuracy without even notations for equation, problem type and grade
looking at the question. In addition, we show that level. HMWP (Qin et al., 2020) is another newly
a model without word-order information (i.e., the proposed dataset of Chinese MWPs that includes
model treats the MWP as a bag-of-words) can also examples with muliple-unknown variables and re-
solve the majority of MWPs in benchmark datasets. quiring non-linear equations to solve them.
The presence of these issues in existing bench- Identifying artifacts in datasets has been done
marks makes them unreliable for measuring the for the Natural Language Inference (NLI) task by
performance of models. Hence, we create a McCoy et al. (2019), Poliak et al. (2018), and Guru-
challenge set called SVAMP (Simple Variations rangan et al. (2018). Rosenman et al. (2020) iden-
on Arithmetic Math word Problems; pronounced tified shallow heuristics in a Relation Extraction
swamp) of one-unknown arithmetic word problems dataset. Cai et al. (2017) showed that biases preva-
with grade level up to 4 by applying simple varia- lent in the ROC stories cloze task allowed models
tions over word problems in an existing dataset (see to yield state-of-the-art results when trained only
Table 1 for some examples). SVAMP further high- on the endings. To the best of our knowledge, this
lights the brittle nature of existing models when kind of analysis has not been done on any Math
trained on these benchmark datasets. On evaluat- Word Problem dataset.
ing SOTA models on SVAMP, we find that they Challenge Sets for NLP tasks have been pro-
are not even able to solve half the problems in the posed most notably for NLI and machine transla-
dataset. This failure of SOTA models on SVAMP tion (Belinkov and Glass, 2019; Nie et al., 2020;
points to the extent to which they rely on simple Ribeiro et al., 2020). Gardner et al. (2020) sug-
heuristics in training data to make their prediction. gested creating contrast sets by manually perturb-
Below, we summarize the two broad contribu- ing test instances in small yet meaningful ways that
tions of our paper. change the gold label. We believe that we are the
first to introduce a challenge set targeted specifi-
• We show that the majority of problems in cally for robust evaluation of Math Word Problems.
benchmark datasets can be solved by shallow
heuristics lacking word-order information or 3 Background
lacking question text.
3.1 Problem Formulation
• We create a challenge set called SVAMP 1 for
We denote a Math Word Problem P by a sequence
more robust evaluation of methods developed
of n tokens P = (w1 , . . . , wn ) where each token
to solve elementary level math word prob-
wi can be either a word from a natural language or
lems.
a numerical value. The word problem P can be bro-
2 Related Work ken down into body B = (w1 , . . . , wk ) and ques-
tion Q = (wk+1 , . . . , wn ). The goal is to map P
Math Word Problems. A wide variety of methods to a valid mathematical expression EP composed
and datasets have been proposed to solve MWPs; of numbers from P and mathematical operators
e.g. statistical machine learning (Roy and Roth, from the set {+, −, /, ∗} (e.g. 3 + 5 − 4). The
1
The dataset and code are available at: metric used to evaluate models on the MWP task is
https://github.com/arkilpatel/SVAMP Execution Accuracy, which is obtained from com-
Model MAWPS ASDiv-A Model MAWPS ASDiv-A
Seq2Seq (S) 79.7 55.5 Seq2Seq 77.4 58.7
Seq2Seq (R) 86.7 76.9 GTS 76.2 60.7
Graph2Tree 77.7 64.4
GTS (S) (Xie and Sun, 2019) 82.6 71.4
GTS (R) 88.5 81.2
Table 3: 5-fold cross-validation accuracies (↑) of base-
Graph2Tree (S) (Zhang et al., 2020) 83.7 77.4
line models on Question-removed datasets.
Graph2Tree (R) 88.7 82.2
Majority Template Baseline2 17.7 21.2
datasets. Note that our implementations achieve a
Table 2: 5-fold cross-validation accuracies (↑) of base- higher score than the previously reported highest
line models on datasets. (R) means that the model is score of 78% on ASDiv-A (Miao et al., 2020) and
provided with RoBERTa pretrained embeddings while 83.7% on MAWPS (Zhang et al., 2020). The im-
(S) means that the model is trained from scratch.
plementation details are provided in Section B in
the Appendix.
paring the predicted answer (calculated by evalu-
4 Deficiencies in existing datasets
ating EP ) with the annotated answer. In this work,
we focus only on one-unknown arithmetic word Here we describe the experiments that show that
problems. there are important deficiencies in MAWPS and
ASDiv-A.
3.2 Datasets and Methods
Many of the existing datasets are not suitable 4.1 Evaluation on Question-removed MWPs
for our analysis as either they are in Chinese, As mentioned in Section 3.1, each MWP consists
e.g. Math23k (Wang et al., 2017) and HMWP of a body B, which provides a short narrative on a
(Qin et al., 2020), or have harder problem types, state of the world and a question Q, which inquires
e.g. Dolphin18K (Huang et al., 2016b). We con- about an unknown quantity about the state of the
sider the widely used benchmark MAWPS (Koncel- world. For each fold in the provided 5-fold split
Kedziorski et al., 2016) composed of 2373 MWPs in MAWPS and ASDiv-A, we keep the train set
and the arithmetic subset of ASDiv (Miao et al., unchanged while we remove the questions Q from
2020) called ASDiv-A which has 1218 MWPs the problems in the test set. Hence, each problem
mostly up to grade level 4 (MAWPS does not in the test set consists of only the body B with-
have grade level information). Both MAWPS and out any question Q. We evaluate all three models
ASDiv-A are evaluated on 5-fold cross-validation with RoBERTa embeddings on these datasets. The
based on pre-assigned splits. results are provided in Table 3.
We consider three models in our experiments: The best performing model is able to achieve
(a) Seq2Seq consists of a Bidirectional LSTM En- a 5-fold cross-validation accuracy of 64.4% on
coder to encode the input sequence and an LSTM ASDiv-A and 77.7% on MAWPS. Loosely trans-
decoder with attention (Luong et al., 2015) to gen- lated, this means that nearly 64% of the problems
erate the equation. in ASDiv-A and 78% of the problems in MAWPS
(c) GTS (Xie and Sun, 2019) uses an LSTM En- can be correctly answered without even looking at
coder to encode the input sequence and a tree-based the question. This suggests the presence of patterns
Decoder to generate the equation. in the bodies of MWPs in these datasets that have
(d) Graph2Tree (Zhang et al., 2020) combines a a direct correlation with the output equation.
Graph-based Encoder with a Tree-based Decoder. Some recent works have also demonstrated simi-
The performance of these models on both lar evidence of bias in NLI datasets (Gururangan
datasets is shown in Table 2. We either provide et al., 2018; Poliak et al., 2018). They observed
RoBERTa (Liu et al., 2019) pre-trained embed- that NLI models were able to predict the correct la-
dings to the models or train them from scratch. bel for a large fraction of the standard NLI datasets
Graph2Tree (Zhang et al., 2020) with RoBERTa based on only the hypothesis of the input and with-
embeddings achieves the state-of-the-art for both out the premise. Our results on question-removed
2
examples of math word problems resembles their
Majority Template Baseline is the accuracy when the
model always predicts the most frequent Equation Template. observations on NLI datasets and similarly indi-
Equation Templates are explained in Section 5.2 cates the presence of artifacts that help statistical
MAWPS ASDiv-A Model MAWPS ASDiv-A
Model Easy Hard Easy Hard FFN + LSTM Decoder (S) 75.1 46.3
FFN + LSTM Decoder (R) 77.9 51.2
Seq2Seq 86.8 86.7 91.3 56.1
GTS 92.6 71.7 91.6 65.3
Graph2Tree 93.4 71.0 92.8 63.3 Table 5: 5-fold cross-validation accuracies (↑) of the
constrained model on the datasets. (R) denotes that the
Table 4: Results of baseline models on the Easy and model is provided with non-contextual RoBERTa pre-
Hard test sets. trained embeddings while (S) denotes that the model is
trained from scratch.

models predict the correct answer without com-


plete information. Note that even though the two tokens. We use either RoBERTa embeddings (non-
methods appear similar, there is an important dis- contextual; taken directly from Embedding Matrix)
tinction. In Gururangan et al. (2018), the model is or train the model from scratch. Clearly, this model
trained and tested on hypothesis only examples and does not have access to word-order information.
hence, the model is forced to find artifacts in the Table 5 shows the performance of this model
hypothesis during training. On the other hand, our on MAWPS and ASDiv-A. The constrained model
setting is more natural since the model is trained in with non-contextual RoBERTa embeddings is able
the standard way on examples with both the body to achieve a cross-validation accuracy of 51.2 on
and the question. Thus, the model is not explicitly ASDiv-A and an astounding 77.9 on MAWPS. It is
forced to learn based on the body during training surprising to see that a model having no word-order
and our results not only show the presence of arti- information can solve a majority of word problems
facts in the datasets but also suggest that the SOTA in these datasets. These results indicate that it is
models exploit them. possible to get a good score on these datasets by
Following Gururangan et al. (2018), we attempt simply associating the occurence of specific words
to understand the extent to which SOTA models in the problems to their corresponding equations.
rely on the presence of simple heuristics in the We illustrate this more clearly in the next section.
body to predict correctly. We partition the test set
4.3 Analyzing the attention weights
into two subsets for each model: problems that
the model predicted correctly without the question To get a better understanding of how the con-
are labeled Easy and the problems that the model strained model is able to perform so well, we ana-
could not answer correctly without the question are lyze the attention weights that it assigns to the hid-
labeled Hard. Table 4 shows the performance of den representations of the input tokens. As shown
the models on their respective Hard and Easy sets. by Wiegreffe and Pinter (2019), analyzing the atten-
Note that their performance on the full set is already tion weights of our constrained model is a reliable
provided in Table 2. It can be seen clearly that al- way to explain its prediction since each hidden rep-
though the models correctly answer many Hard resentation consists of information about only that
problems, the bulk of their success is due to the token as opposed to the case of an RNN where each
Easy problems. This shows that the ability of SOTA hidden representation may have information about
methods to robustly solve word problems is overes- the context i.e. its neighboring tokens.
timated and that they rely on simple heuristics in We train the contrained model (with RoBERTa
the body of the problems to make predictions. embeddings) on the full ASDiv-A dataset and ob-
serve the attention weights it assigns to the words
4.2 Performance of a constrained model of the input problems. We found that the model
We construct a simple model based on the Seq2Seq usually attends to a single word to make its pre-
architecture by removing the LSTM Encoder and diction, irrespective of the context. Table 6 shows
replacing it with a Feed-Forward Network that some representative examples. In the first example,
maps the input embeddings to their hidden rep- the model assigns an attention weight of 1 to the
resentations. The LSTM Decoder is provided with representation of the word ‘every’ and predicts the
the average of these hidden representations as its correct equation. However, when we make a subtle
initial hidden state. During decoding, an attention change to this problem such that the corresponding
mechanism (Luong et al., 2015) assigns weights equation changes, the model keeps on attending
to individual hidden representations of the input over the word ‘every’ and predicts the same equa-
Input Problem Predicted Equation Answer

John delivered 3 letters at every house. If he delivered for 8 houses, how many letters did John deliver? 3*8 24 3
John delivered 3 letters at every house. He delivered 24 letters in all. How many houses did John visit to deliver letters? 3 * 24 72 7

Sam made 8 dollars mowing lawns over the Summer. He charged 2 bucks for each lawn. How many lawns did he mow? 8/2 43
Sam mowed 4 lawns over the Summer. If he charged 2 bucks for each lawn, how much did he earn? 4/2 27

10 apples were in the box. 6 are red and the rest are green. how many green apples are in the box? 10 - 6 43
10 apples were in the box. Each apple is either red or green. 6 apples are red. how many green apples are in the box? 10 / 6 1.67 7

Table 6: Attention paid to specific words by the constrained model.

tion, which is now incorrect. Similar observations simple problems that any system designed to solve
can be made for the other two examples. Table 26 MWPs should be expected to solve. We create new
in the Appendix has more such examples. These problems by applying certain variations to exist-
examples represent only a few types of spurious ing problems, similar to the work of Ribeiro et al.
correlations that we could find but there could be (2020). However, unlike their work, our variations
other types of correlations that might have been do not check for linguistic capabilities. Rather,
missed. the choice of our variations is motivated by the
Note that, we do not claim that every model experiments in Section 4 as well as certain simple
trained on these datasets relies on the occurrence of capabilities that any MWP solver must possess.
specific words in the input problem for prediction
the way our constrained model does. We are only 5.1 Creating SVAMP
asserting that it is possible to achieve a good score We create SVAMP by applying certain types of
on these datasets even with such a brittle model, variations to a set of seed examples sampled from
which clearly makes these datasets unreliable to the ASDiv-A dataset. We select the seed examples
robustly measure model performance. from the recently proposed ASDiv-A dataset since
it appears to be of higher quality and harder than
5 SVAMP the MAWPS dataset: We perform a simple experi-
ment to test the coverage of each dataset by training
The efficacy of existing models on benchmark a model on one dataset and testing it on the other
datasets has led to a shift in the focus of researchers one. For instance, when we train a Graph2Tree
towards more difficult MWPs. We claim that this model on ASDiv-A, it achieves 82% accuracy on
efficacy on benchmarks is misleading and SOTA MAWPS. However, when trained on MAWPS and
MWP solvers are unable to solve even elemen- tested on ASDiv-A, the model achieved only 73%
tary level one-unknown MWPs. To this end, we accuracy. Also recall Table 2 where most mod-
create a challenge set named SVAMP containing els performed better on MAWPS. Moreover, AS-
simple one-unknown arithmetic word problems of Div has problems annotated according to types and
grade level up to 4. The examples in SVAMP test grade levels which are useful for us.
a model across different aspects of solving word To select a subset of seed examples that suffi-
problems. For instance, a model needs to be sen- ciently represent different types of problems in the
sitive to questions and possess certain reasoning ASDiv-A dataset, we first divide the examples into
abilities to correctly solve the examples in our chal- groups according to their annotated types. We dis-
lenge set. SVAMP is similar to existing datasets of
the same level in terms of scope and difficulty for
humans, but is less susceptible to being solved by Group Examples in Selected Seed
ASDiv-A Examples
models relying on superficial patterns.
Our work differs from adversarial data collection Addition 278 28
Subtraction 362 33
methods such as Adversarial NLI (Nie et al., 2020) Multiplication 188 19
in that these methods create examples depending Division 176 20
on the failure of a particular model while we create Total 1004 100
examples without referring to any specific model.
Inspired by the notion of Normative evaluation Table 7: Distribution of selected seed examples across
(Linzen, 2020), our goal is to create a dataset of types.
C ATEGORY VARIATION E XAMPLES
Original: Allan brought two balloons and Jake brought four balloons to the park. How many balloons
Same Object, Different did Allan and Jake have in the park?
Structure Variation: Allan brought two balloons and Jake brought four balloons to the park. How many more
balloons did Jake have than Allan in the park?
Original: In a school, there are 542 girls and 387 boys. 290 more boys joined the school. How many
Question Different Object, Same pupils are in the school?
Sensitivity Structure Variation: In a school, there are 542 girls and 387 boys. 290 more boys joined the school. How many
boys are in the school?
Original: He then went to see the oranges being harvested. He found out that they harvest 83 sacks per
day and that each sack contains 12 oranges. How many sacks of oranges will they have after 6 days of
Different Object,
harvest?
Different Structure
Variation: He then went to see the oranges being harvested. He found out that they harvest 83 sacks
per day and that each sack contains 12 oranges. How many oranges do they harvest per day?

Original: Every day, Ryan spends 4 hours on learning English and 3 hours on learning Chinese. How
many hours does he spend on learning English and Chinese in all?
Add relevant information
Variation: Every day, Ryan spends 4 hours on learning English and 3 hours on learning Chinese. If he
learns for 3 days, how many hours does he spend on learning English and Chinese in all?
Original: Jack had 142 pencils. Jack gave 31 pencils to Dorothy. How many pencils does Jack have
Reasoning now?
Change Information
Ability Variation: Dorothy had 142 pencils. Jack gave 31 pencils to Dorothy. How many pencils does Dorothy
have now?
Original: He also made some juice from fresh oranges. If he used 2 oranges per glass of juice and he
made 6 glasses of juice, how many oranges did he use?
Invert Operation
Variation: He also made some juice from fresh oranges. If he used 2 oranges per glass of juice and he
used up 12 oranges, how many glasses of juice did he make?

Original: John has 8 marbles and 3 stones. How many more marbles than stones does he have?
Change order of objects
Variation: John has 3 stones and 8 marbles. How many more marbles than stones does he have?
Original: Matthew had 27 crackers. If Matthew gave equal numbers of crackers to his 9 friends, how
Structural Change order of phrases
many crackers did each person eat?
Invariance Variation: Matthew gave equal numbers of crackers to his 9 friends. If Matthew had a total of 27
crackers initially, how many crackers did each person eat?
Original: Jack had 142 pencils. Jack gave 31 pencils to Dorothy. How many pencils does Jack have
Add irrelevant now?
information Variation: Jack had 142 pencils. Dorothy had 50 pencils. Jack gave 31 pencils to Dorothy. How many
pencils does Jack have now?

Table 8: Types of Variations with examples. ‘Original:’ denotes the base example from which the variation is
created, ‘Variation:’ denotes a manually created variation.

card types such as ‘TVQ-Change’, ‘TVQ-Initial’, Question Sensitivity, Reasoning Ability and
‘Ceil-Division’ and ‘Floor-Division’ that have less Structural Invariance. Examples of each type of
than 20 examples each. We also do not consider variation are provided in Table 8.
the ‘Difference’ type since it requires the use of an
additional modulus operator. For ease of creation, 1. Question Sensitivity. Variations in this category
we discard the few examples that are more than 40 check if the model’s answer depends on the ques-
words long. To control the complexity of result- tion. In these variations, we change the question
ing variations, we only consider those problems as in the seed example while keeping the body same.
seed examples that can be solved by an expression The possible variations are as follows:
with a single operator. Then, within each group, we (a) Same Object, Different Structure: The principal
cluster examples using K-Means over RoBERTa object (i.e. object whose quantity is unknown) in
sentence embeddings of each example. From each the question is kept the same while the structure of
cluster, the example closest to the cluster centroid the question is changed.
is selected as a seed example. We selected a total of (b) Different Object, Same Structure: The principal
100 seed examples in this manner. The distribution object in the question is changed while the structure
of seed examples according to different types of of question remains fixed.
problems can be seen in Table 7. (c) Different Object, Different Structure: Both, the
principal object in the question and the structure of
5.1.1 Variations the question, are changed.
The variations that we make to each seed example
can be broadly classified into three categories 2. Reasoning Ability. Variations here check
based on desirable properties of an ideal model: whether a model has the ability to correctly de-
Dataset # Problems # Equation # Avg Ops CLD We created a total of 1098 examples. However,
Templates
since ASDiv-A does not have examples with equa-
MAWPS 2373 39 1.78 0.26
ASDiv-A 1218 19 1.23 0.50 tions of more than two operators, we discarded 98
SVAMP 1000 26 1.24 0.22
examples from our set which had equations consist-
ing of more than two operators. This is to ensure
Table 9: Statistics of our dataset compared with
MAWPS and ASDiv-A. that our challenge set does not have any unfairly
difficult examples. The final set of 1000 examples
was provided to an external volunteer unfamiliar
termine a change in reasoning arising from subtle with the task to check the grammatical and logical
changes in the problem text. The different possible correctness of each example.
variations are as follows:
(a) Add relevant information: Extra relevant in- 5.2 Dataset Properties
formation is added to the example that affects the
Our challenge set SVAMP consists of one-
output equation.
unknown arithmetic word problems which can be
(b) Change information: The information provided
solved by expressions requiring no more than two
in the example is changed.
operators. Table 9 shows some statistics of our
(c) Invert operation: The previously unknown
dataset and of ASDiv-A and MAWPS. The Equa-
quantity is now provided as information and the
tion Template for each example is obtained by con-
question instead asks about a previously known
verting the corresponding equation into prefix form
quantity which is now unknown.
and masking out all numbers with a meta symbol.
Observe that the number of distinct Equation Tem-
3. Structural Invariance. Variations in this cat-
plates and the Average Number of Operators are
egory check whether a model remains invariant
similar for SVAMP and ASDiv-A and are consider-
to superficial structural changes that do not alter
ably smaller than for MAWPS. This indicates that
the answer or the reasoning required to solve the
SVAMP does not contain unfairly difficult MWPs
example. The different possible variations are as
in terms of the arithmetic expression expected to
follows:
be produced by a model.
(a) Add irrelevant information: Extra irrelevant
Previous works, including those introducing
information is added to the problem text that is not
MAWPS and ASDiv, have tried to capture the
required to solve the example.
notion of diversity in MWP datasets. Miao et al.
(b) Change order of objects: The order of objects
(2020) introduced a metric called Corpus Lexicon
appearing in the example is changed.
Diversity (CLD) to measure lexical diversity. Their
(c) Change order of phrases: The order of number-
contention was that higher lexical diversity is cor-
containing phrases appearing in the example is
related with the quality of a dataset. As can be seen
changed.
from Table 9, SVAMP has a much lesser CLD than
5.1.2 Protocol for creating variations ASDiv-A. SVAMP is also less diverse in terms of
Since creating variations requires a high level of fa- problem types compared to ASDiv-a. Despite this
miliarity with the task, the construction of SVAMP we will show in the next section that SVAMP is in
is done in-house by the authors and colleagues, fact more challenging than ASDiv-A for current
hereafter called the workers. The 100 seed exam- models. Thus, we believe that lexical diversity is
ples (as shown in Table 7) are distributed among not a reliable way to measure the quality of MWP
the workers. datasets. Rather it could depend on other factors
For each seed example, the worker needs to cre- such as the diversity in MWP structure which pre-
ate new variations by applying the variation types clude models exploiting shallow heuristics.
discussed in Section 5.1.1. Importantly, a combina-
5.3 Experiments on SVAMP
tion of different variations over the seed example
can also be done. For each new example created, We train the three considered models on a combi-
the worker needs to annotate it with the equation nation of MAWPS and ASDiv-A and test them on
as well as the type of variation(s) used to create SVAMP. The scores of all three models with and
it. More details about the creation protocol can be without RoBERTa embeddings for various subsets
found in Appendix C. of SVAMP can be seen in Table 10.
Seq2Seq GTS Graph2Tree Model SVAMP
S R S R S R FFN + LSTM Decoder (S) 17.5
FFN + LSTM Decoder (R) 18.3
Full Set 24.2 40.3 30.8 41.0 36.5 43.8
Majority Template Baseline 11.7
One-Op 25.4 42.6 31.7 44.6 42.9 51.9
Two-Op 20.3 33.1 27.9 29.7 16.1 17.8
Table 12: Accuracies (↑) of the constrained model on
ADD 28.5 41.9 35.8 36.3 24.9 36.8
SUB 22.3 35.1 26.7 36.9 41.3 41.3
SVAMP. (R) denotes that the model is provided with
MUL 17.9 38.7 29.2 38.7 27.4 35.8 non-contextual RoBERTa pretrained embeddings while
DIV 29.3 56.3 39.5 61.1 40.7 65.3 (S) denotes that the model is trained from scratch.

Table 10: Results of models on the SVAMP challenge


set. S indicates that the model is trained from scratch. is marginally better than the majority template base-
R indicates that the model was trained with RoBERTa line. This shows that the problems in SVAMP are
embeddings. The first row shows the results for the full less vulnerable to being solved by models using
dataset. The next two rows show the results for subsets
simple patterns and that a model needs contextual
of SVAMP composed of examples that have equations
with one operator and two operators respectively. The
information in order to solve them.
last four rows show the results for subsets of SVAMP We also explored using SVAMP for training by
composed of examples of type Addition, Subtraction, combining it with ASDiv-A and MAWPS. We per-
Multiplication and Division respectively. formed 5-fold cross-validation over SVAMP where
the model was trained on a combination of the
three datasets and tested on unseen examples from
The best performing Graph2Tree model is only
SVAMP. To create the folds, we first divide the
able to achieve an accuracy of 43.8% on SVAMP.
seed examples into five sets, with each type of ex-
This indicates that the problems in SVAMP are
ample distributed nearly equally among the sets. A
indeed more challenging for the models than the
fold is obtained by combining all the examples in
problems in ASDiv-A and MAWPS despite being
SVAMP that were created using the seed examples
of the same scope and type and less diverse. Ta-
in a set. In this way, we get five different folds from
ble 27 in the Appendix lists some simple examples
the five sets. We found that the best model achieved
from SVAMP on which the best performing model
about 65% accuracy. This indicates that even with
fails. These results lend further support to our claim
additional training data existing models are still not
that existing models cannot robustly solve elemen-
close to the performance that was estimated based
tary level word problems.
on prior benchmark datasets.
Next, we remove the questions from the exam-
To check the influence of different categories of
ples in SVAMP and evaluate them using the three
variations in SVAMP, for each category, we mea-
models with RoBERTa embeddings trained on com-
sure the difference between the accuracy of the
bined MAWPS and ASDiv-A. The scores can be
best model on the full dataset and its accuracy on
seen in Table 11. The accuracy drops by half when
a subset containing no example created from that
compared to ASDiv-A and more than half com-
category of variations. The results are shown in
pared to MAWPS suggesting that the problems
Table 13. Both the Question Sensitivity and Struc-
in SVAMP are more sensitive to the information
present in the question. We also evaluate the perfor-
mance of the constrained model on SVAMP when Removed Category # Removed Change in
trained on MAWPS and ASDiv-A. The best model Examples Accuracy (∆)
achieves only 18.3% accuracy (see Table 12) which Question Sensitivity 462 +13.7
Reasoning Ability 649 -3.3
Structural Invariance 467 +4.5
Model SVAMP w/o ques ASDiv-A w/o ques
Seq2Seq 29.2 58.7 Table 13: Change in accuracies when categories are re-
GTS 28.6 60.7 moved. The Change in Accuracy ∆ = Acc(F ull −
Graph2Tree 30.8 64.4 Cat) − Acc(F ull), where Acc(F ull) is the accuracy
on the full set and Acc(F ull − Cat) is the accuracy
Table 11: Accuracies (↑) of models on SVAMP without on the set of examples left after removing all examples
questions. The 5-fold CV accuracy scores for ASDiv-A which were created using Category Cat either by itself,
without questions are restated for easier comparison. or in use with other categories.
Removed Variation # Removed Change in Dataset 2 nums 3 nums 4 nums
Examples Accuracy (∆)
ASDiv-A 93.3 59.0 47.5
Same Obj, Diff Struct 325 +7.3 SVAMP 78.3 25.4 25.4
Diff Obj, Same Struct 69 +1.5
Diff Obj, Diff Struct 74 +1.3 Table 15: Accuracy break-up according to the number
Add Rel Info 264 +5.5 of numbers in the input problem. 2 nums refers to the
Change Info 149 +3.2 subset of problems which have only 2 numbers in the
Invert Operation 255 -10.2
problem text. Similarly, 3 nums and 4 nums are sub-
Change order of Obj 107 +2.3 sets that contain 3 and 4 different numbers in the prob-
Change order of Phrases 152 -3.3 lem text respectively.
Add Irrel Info 281 +6.9

Table 14: Change in accuracies when variations are re-


problems? This paper gives a negative answer. We
moved. The Change in Accuracy ∆ = Acc(F ull −
V ar) − Acc(F ull), where Acc(F ull) is the accuracy have empirically shown that the benchmark En-
on the full set and Acc(F ull − V ar) is the accuracy glish MWP datasets suffer from artifacts making
on the set of examples left after removing all examples them unreliable to gauge the performance of MWP
which were created using Variation V ar either by itself, solvers: we demonstrated that the majority of prob-
or in use with other variations. lems in the existing datasets can be solved by sim-
ple heuristics even without word-order information
or the question text.
tural Invariance categories of variations show an
increase in accuracy when their examples are re- The performance of the existing models in our
moved, thereby indicating that they make SVAMP proposed challenge dataset also highlights their
more challenging. The decrease in accuracy for limitations in solving simple elementary level word
the Reasoning Ability category can be attributed in problems. We hope that our challenge set SVAMP,
large part to the Invert Operation variation. This containing elementary level MWPs, will enable
is not surprising because most of the examples more robust evaluation of methods. We believe
created from Invert Operation are almost indistin- that methods proposed in the future that make gen-
guishable from examples in ASDiv-A, which the uine advances in solving the task rather than re-
model has seen during training. The scores for each lying on simple heuristics will perform well on
individual variation are provided in Table 14. SVAMP despite being trained on other datasets
such as ASDiv-A and MAWPS.
We also check the break-up of performance of
the best performing Graph2Tree model accord- In recent years, the focus of the community has
ing to the number of numbers present in the text shifted towards solving more difficult MWPs such
of the input problem. We trained the model on as non-linear equations and word problems with
both ASDiv-A and MAWPS and tested on SVAMP multiple unknown variables. We demonstrated that
and compare those results against the 5-fold cross- the capability of existing models to solve simple
validation setting of ASDiv-A. The scores are pro- one-unknown arithmetic word problems is overes-
vided in Table 15. While the model can solve many timated. We believe that developing more robust
problems consisting of only two numbers in the in- methods for solving elementary MWPs remains a
put text (even in our challenge set), it performs very significant open problem.
badly on problems having more than two numbers.
This shows that current methods are incapable of Acknowledgements
properly associating numbers to their context. Also,
We thank the anonymous reviewers for their con-
the gap between the performance on ASDiv-A and
structive comments. We would also like to thank
SVAMP is high, indicating that the examples in
our colleagues at Microsoft Research for provid-
SVAMP are more difficult for these models to solve
ing valuable feedback. We are grateful to Mono-
than the examples in ASDiv-A even when consid-
jit Choudhury for discussions about creating the
ering the structurally same type of word problems.
dataset. We thank Kabir Ahuja for carrying out
6 Final Remarks preliminary experiments that led to this work. We
also thank Vageesh Chandramouli and Nalin Patel
Going back to the original question, are existing for their help in dataset construction.
NLP models able to solve elementary math word
References Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin,
and Wei-Ying Ma. 2016b. How well do comput-
Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik ers solve math word problems? large-scale dataset
Koncel-Kedziorski, Yejin Choi, and Hannaneh Ha- construction and evaluation. In Proceedings of the
jishirzi. 2019. MathQA: Towards interpretable 54th Annual Meeting of the Association for Compu-
math word problem solving with operation-based tational Linguistics (Volume 1: Long Papers), pages
formalisms. In Proceedings of the 2019 Conference 887–896, Berlin, Germany. Association for Compu-
of the North American Chapter of the Association tational Linguistics.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Rik Koncel-Kedziorski, Subhro Roy, Aida Amini,
pages 2357–2367, Minneapolis, Minnesota. Associ- Nate Kushman, and Hannaneh Hajishirzi. 2016.
ation for Computational Linguistics. MAWPS: A math word problem repository. In Pro-
ceedings of the 2016 Conference of the North Amer-
Yonatan Belinkov and James Glass. 2019. Analysis ican Chapter of the Association for Computational
methods in neural language processing: A survey. Linguistics: Human Language Technologies, pages
Transactions of the Association for Computational 1152–1157, San Diego, California. Association for
Linguistics, 7:49–72. Computational Linguistics.

Zheng Cai, Lifu Tu, and Kevin Gimpel. 2017. Pay at- Tal Linzen. 2020. How can we accelerate progress to-
tention to the ending:strong neural baselines for the wards human-like linguistic generalization? In Pro-
ROC story cloze task. In Proceedings of the 55th An- ceedings of the 58th Annual Meeting of the Asso-
nual Meeting of the Association for Computational ciation for Computational Linguistics, pages 5210–
Linguistics (Volume 2: Short Papers), pages 616– 5217, Online. Association for Computational Lin-
622, Vancouver, Canada. Association for Computa- guistics.
tional Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Roberta: A robustly optimized bert pretraining ap-
Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, proach.
Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-
son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Thang Luong, Hieu Pham, and Christopher D. Man-
Singh, Noah A. Smith, Sanjay Subramanian, Reut ning. 2015. Effective approaches to attention-based
Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. neural machine translation. In Proceedings of the
2020. Evaluating models’ local decision boundaries 2015 Conference on Empirical Methods in Natu-
via contrast sets. ral Language Processing, pages 1412–1421, Lis-
bon, Portugal. Association for Computational Lin-
guistics.
Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel Bowman, and Noah A. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.
Smith. 2018. Annotation artifacts in natural lan- Right for the wrong reasons: Diagnosing syntactic
guage inference data. In Proceedings of the 2018 heuristics in natural language inference. In Proceed-
Conference of the North American Chapter of the ings of the 57th Annual Meeting of the Association
Association for Computational Linguistics: Human for Computational Linguistics, pages 3428–3448,
Language Technologies, Volume 2 (Short Papers), Florence, Italy. Association for Computational Lin-
pages 107–112, New Orleans, Louisiana. Associa- guistics.
tion for Computational Linguistics.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su.
Danqing Huang, Shuming Shi, Chin-Yew Lin, and Jian 2020. A diverse corpus for evaluating and develop-
Yin. 2017. Learning fine-grained expressions to ing English math word problem solvers. In Proceed-
solve math word problems. In Proceedings of the ings of the 58th Annual Meeting of the Association
2017 Conference on Empirical Methods in Natural for Computational Linguistics, pages 975–984, On-
Language Processing, pages 805–814, Copenhagen, line. Association for Computational Linguistics.
Denmark. Association for Computational Linguis-
tics. Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-
Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, versarial NLI: A new benchmark for natural lan-
and Wei-Ying Ma. 2016a. How well do comput- guage understanding. In Proceedings of the 58th An-
ers solve math word problems? large-scale dataset nual Meeting of the Association for Computational
construction and evaluation. In Proceedings of the Linguistics, pages 4885–4901, Online. Association
54th Annual Meeting of the Association for Compu- for Computational Linguistics.
tational Linguistics (Volume 1: Long Papers), pages
887–896, Berlin, Germany. Association for Compu- Adam Poliak, Jason Naradowsky, Aparajita Haldar,
tational Linguistics. Rachel Rudinger, and Benjamin Van Durme. 2018.
Hypothesis only baselines in natural language in- D. Zhang, L. Wang, L. Zhang, B. T. Dai, and H. T. Shen.
ference. In Proceedings of the Seventh Joint Con- 2020. The gap of semantic parsing: A survey on au-
ference on Lexical and Computational Semantics, tomatic math word problem solvers. IEEE Transac-
pages 180–191, New Orleans, Louisiana. Associa- tions on Pattern Analysis and Machine Intelligence,
tion for Computational Linguistics. 42(9):2287–2305.
Jinghui Qin, Lihui Lin, Xiaodan Liang, Rumin Zhang, Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan
and Liang Lin. 2020. Semantically-aligned univer- Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-to-
sal tree-structured solver for math word problems. tree learning for solving math word problems. In
Proceedings of the 58th Annual Meeting of the Asso-
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, ciation for Computational Linguistics, pages 3928–
and Sameer Singh. 2020. Beyond accuracy: Be- 3937, Online. Association for Computational Lin-
havioral testing of NLP models with CheckList. In guistics.
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 4902–
4912, Online. Association for Computational Lin-
guistics.
Shachar Rosenman, Alon Jacovi, and Yoav Goldberg.
2020. Exposing Shallow Heuristics of Relation Ex-
traction Models with Challenge Data. In Proceed-
ings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
3702–3710, Online. Association for Computational
Linguistics.
Subhro Roy and Dan Roth. 2018. Mapping to declara-
tive knowledge for word problem solving. Transac-
tions of the Association for Computational Linguis-
tics, 6:159–172.
Mrinmaya Sachan and Eric Xing. 2017. Learning
to solve geometry problems from natural language
demonstrations in textbooks. In Proceedings of the
6th Joint Conference on Lexical and Computational
Semantics (*SEM 2017), pages 251–261, Vancouver,
Canada. Association for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.
Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017.
Deep neural solver for math word problems. In Pro-
ceedings of the 2017 Conference on Empirical Meth-
ods in Natural Language Processing, pages 845–
854, Copenhagen, Denmark. Association for Com-
putational Linguistics.
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
not not explanation. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 11–20, Hong Kong, China. Associ-
ation for Computational Linguistics.
Zhipeng Xie and Shichao Sun. 2019. A goal-driven
tree-structured neural model for math word prob-
lems. In Proceedings of the Twenty-Eighth In-
ternational Joint Conference on Artificial Intelli-
gence, IJCAI-19, pages 5299–5305. International
Joint Conferences on Artificial Intelligence Organi-
zation.
A Experiments with Transformer Model MAWPS ASDiv-A SVAMP
Transformer 79.4 64.4 25.3
We additionally ran all our experiments with the
Transformer (Vaswani et al., 2017) model. The 5-
Table 17: 5-fold cross-validation accuracies (↑) of
fold cross-validation accuracies of the Transformer Transformer model on Question-removed datasets.
on MAWPS and ASDiv-A are provided in Table
16. The scores on Question-removed datasets are Transformer
provided in Table 17 and on SVAMP challenge set
S R
is provided in Table 18.
Full Set 18.4 38.9
B Implementation Details One-Op 18.6 40.5
Two-Op 17.8 33.9
We use 8 NVIDIA Tesla P100 GPUs each with 16
ADD 22.3 36.3
GB memory to run our experiments. The hyperpa- SUB 17.1 37.5
rameters used for each model are shown in Table MUL 17.9 28.3
DIV 18.6 53.3
19. The hyperparameters used in for the Trans-
former model are provided in Table 20. The best
Table 18: Results of Transformer model on the SVAMP
hyperparameters are highlighted in bold. Follow- challenge set. S indicates that the model is trained from
ing the setting of Zhang et al. (2020), the arithmetic scratch. R indicates that the model was trained with
word problems from MAWPS are divided into five RoBERTa embeddings. The first row shows the results
folds, each of equal test size. For ASDiv-A, we for the full dataset. The next two rows show the results
consider the 5-fold split [238, 238, 238, 238, 266] for subsets of SVAMP composed of examples that have
provided by the authors (Miao et al., 2020). equations with one operator and two operators respec-
tively. The last four rows show the results for subsets of
C Creation Protocol SVAMP composed of examples of type Addition, Sub-
traction, Multiplication and Division respectively.
We create variations in template form. Generating
more data by scaling up from these templates or
by performing automatic operations on these tem- not be used for collectives; rather they should be
plates is left for future work. The template form of used for the things that the collective
  represents.

an example is created by replacing certain words Some example uses of OBJs and OBJp tags
with their respective tags. Table 21 lists the various are provided in Table 22. Lastly, the M OD tag
tags used must be used
 to replace any modifier preceding the
 in thetemplates.  
The N UM tag isused to replace all the num- OBJs / OBJp tag.
bers and the N AM E tag is used to replace A preprocessing script is executed over the Seed
 all the
Names in the example. The OBJs Examples to automatically generate template sug-
 of Persons
gestions for the workers. The script uses Named

and OBJp tags are used for replacing  the ob-
Entity Recognition and Regular Expression match-
 
jects in the example. The OBJs and OBJp
tags with the same index represent the same ob- ing to automatically mask the names of persons and
ject in singular and plural form respectively. The the numbers found in the Seed Examples. The out-
puts from the script are called the Script Examples.
 
intention when using the OBJs or the OBJp
tag is that it can be used as a placeholder for other An illustration is provided in Table 23.
similar words, which when entered in that place, Each worker is provided with the Seed Exam-
make sense as per the context. These tags must ples along with their respective Script Examples
that have been alloted to them. The worker’s task is
to edit the Script Example by correcting any mis-
Model MAWPS ASDiv-A
take made by the preprocessing script and
 adding
Transformer (S) 77.9 52.1

any new tags such as the OBJs and the OBJp
Transformer (R) 87.1 77.7
tags in order to create the Base Example. If a
Table 16: 5-fold cross-validation accuracies (↑) of worker introduces a new tag, they need to mark it
Transformer model on datasets. (R) means that the against its example-specific value. If the tag is used
model is provided with RoBERTa pretrained embed- to mask objects, the worker needs to mark both the
dings while (S) means that the model is trained from singular and plural form of the object in a comma-
scratch. seperated manner. Additionally, for each unique
Seq2Seq GTS Graph2Tree Constrained
Hyperparameters Scratch RoBERTa Scratch RoBERTa Scratch RoBERTa Scratch RoBERTa
Embedding Size [128, 256] [768] [128, 256] [768] [128, 256] [768] [128, 256] [768]
Hidden Size [256, 384] [256, 384] [384, 512] [384, 512] [256, 384] [256, 384] [256, 384] [256, 384]
Number of Layers [1, 2] [1, 2] [1, 2] [1, 2] [1, 2] [1, 2] [1, 2] [1, 2]
Learning Rate [5e-4, 8e-4, [1e-4, 2e-4, [8e-4, 1e-3, [5e-4, 8e-4, [8e-4, 1e-3, [5e-4, 8e-4, [1e-3, 2e-3] [1e-3, 2e-3]
1e-3] 5e-4] 2e-3] 1e-3] 2e-3] 1e-3]
Embedding LR [5e-4, 8e-4, [5e-6, 8e-6, [8e-4, 1e-3, [5e-6, 8e-6, [8e-4, 1e-3, [5e-6, 8e-6, [1e-3, 2e-3] [1e-3, 2e-3]
1e-3] 1e-5] 2e-3] 1e-5] 2e-3] 1e-5]
Batch Size [8, 16] [4, 8] [8, 16] [4, 8] [8, 16] [4, 8] [8, 16] [4, 8]
Dropout [0.1] [0.1] [0.5] [0.5] [0.5] [0.5] [0.1] [0.1]
# Parameters 8.5M 130M 15M 140M 16M 143M 5M 130M
Epochs 60 50 60 50 60 50 60 50
Avg Time/Epoch 10 40 60 120 60 120 10 15

Table 19: Different hyperparameters and the values considered for each of them in the models. The best hyperpa-
rameters for each model for 5-fold cross-validation on ASDiv-A are highlighted in bold. Average Time/Epoch is
measured in seconds.

Transformer Table 8 to see if they can be applied to the Base


Hyperparameters Scratch RoBERTa Example. If applicable, the worker needs to create
I/P and O/P Embedding Size [128, 256] [768] the Variation Example while also making a note of
FFN Size [256, 384] [256, 384]
heads [2, 4] [2, 4] the type of variation. If a particular example is the
Number of Encoder Layers [1, 2] [1, 2]
Number of Decoder Layers [1, 2] [1, 2]
result of performing multiple types of variations,
Learning Rate [5e-5, 8e-5, 1e-4] [5e-5, 8e-5, 1e-4] all types of variations should be listed according to
Embedding LR [5e-5, 8e-5, 1e-4] [1e-5, 5e-6]
Batch Size [4, 8] [4, 8] their order of application from latest to earliest in
Dropout [0.1] [0.1]
a comma-seperated manner. For any variation, if
# Parameters 0.67M 132M
Epochs 100 100 a worker introduces a new tag, they need to mark
Avg Time/Epoch 10 30 it against its example-specific value as mentioned
before. The index of any new tag introduced needs
Table 20: Different hyperparameters and the values
to be one more than the highest index already in use
considered for each of them in the Transformer model.
The best hyperparameters for 5-fold cross-validation on for that tag in the Base Example or its previously
ASDiv-A are highlighted in bold. Average Time/Epoch created variations.
is measured in seconds. To make the annotation more efficient and
streamlined, we provide the following steps to be
Tag Description followed in order:
1. Apply the Question Sensitivity variations on
 
 NUMx  Number
NAMEx Names of Persons the Base Example.
 OBJsx  Singular Object
OBJpx Plural Object 2. Apply the Invert Operation variation on the
MODx Modifier Base Example and on all the variations ob-
tained so far.
Table 21: List of tags used in annotated templates. x
denotes the index of the tag. 3. Apply the Add relevant information variation
on the Base Example. Then considering these
   variations as Base Examples, apply the Ques-
index of OBJs / OBJp tag in the example, the
tion Sensitivity variations.
worker must enter atleast one alternate value that
can be used in that place. Similarly, the worker 4. Apply the Add irrelevant information varia-
must enter atleast two modifierwords that   can be tion on the Base Example and on all the vari-
used to precede the principal OBJs / OBJp ations obtained so far.
tags in the example. These alternate values are
5. Apply the Change information variation on
used to gather a lexicon which can be utilised to
the Base Example and on all the variations
scale-up the data at a later stage. An illustration of
obtained so far.
this process is provided in Table 24.
In order to create the variations, the worker 6. Apply the Change order of Objects and
needs to check the different types of variations in Change order of Events or Phrases variations
on the Base Example and on all the variations from our experiments, we first perform a broad hy-
obtained so far. perparameter search over only a single fold for the
datasets and then run the cross validation experi-
Table 25 provides some variations for the exam- ment over a select few hyperparameters.
ple in Table 24. Note that two seperate examples
were created through the ’Add irrelevant informa-
tion’ variation. The first by applying the variation
on the Original Example and the second by apply-
ing it on a previously created example (as directed
in Step-4).
To make sure that different workers following
our protocol make similar types of variations, we
hold a trial where each worker created variations
from the same 5 seed examples. We observed that
barring minor linguistic differences, most of the
created examples were the same, thereby indicating
the effectiveness of our protocol.

D Analyzing Attention Weights


In Table 26, we provide more examples to illustrate
the specific word to equation correlation that the
constrained model learns.

E Examples of Simple Problems


In Table 27, we provide a few simple examples
from SVAMP that the best performing Graph2Tree
model could not solve.

F Ethical Considerations
In this paper, we consider the task of automati-
cally solving Math Word Problems (MWPs). Our
work encourages the development of better sys-
tems that can robustly solve MWPs. Such sys-
tems can be deployed for use in the education do-
main. E.g., an application can be developed that
takes MWPs as input and provides detailed expla-
nations to solve them. Such applications can aide
elementary school students in learning and practic-
ing math.
We present a challenge set called SVAMP of one-
unknown English Math Word Problems. SVAMP
is created in-house by the authors themselves by
applying some simple variations to examples from
ASDiv-A (Miao et al., 2020), which is a publicly
available dataset. We provide a detailed creation
protocol in Section C. We are not aware of any
risks associated with our proposed dataset.
To provide an estimate of the energy require-
ments of our experiments, we provide the details
such as computing platform and running time in
Section B. Also, in order to reduce carbon costs
Excerpt of Example Beth
 has 4 packs
 of  red crayons
 and 2 packs
 of green
  crayons. Eachpack has 10 crayonsin it.  
Template Form N AM E1 has N U M 1 packs of M OD1 OBJp1 and N U M 2 packs of M OD2 OBJp1 .
Excerpt of Example In a game,Frank defeated
 6 enemies.
 Each enemy
  earned him 9 points.
   
Template Form In a game N AM E1 defeated N U M 1 OBJp1 . Each OBJs1 earned him N U M 2 points.

Table 22: Example uses of tags. Note that in the first example, the word ’packs’ was not replaced since it is a
collective. In the second example, the word ’points’ was not replaced because it is too instance-specific and no
other word can be used in that place.

Seed Example Body Beth has 4 packs of crayons. Each pack has 10 crayons in it. She also has 6 extra crayons.
Seed Example Question How many crayons does Beth have altogether?
Seed Example Equation 4*10+6
       
N AM E1 has N U M 1 packs of crayons . Each pack has N U M 2 crayons in it . She also has N U M 3 extra
Script Example Body
crayons .  
Script Example Question How manycrayons
 does  N AM
 E1 have altogether ?
Script Example Equation NUM1 ∗ NUM2 + NUM3

Table
 23: An
 example of suggested templates. Note that the preprocessing script could not succesfully tag crayons
as OBJp1 .

       
N AM E1 has N U M 1 packs of crayons . Each pack has N U M 2 crayons in it . She also has N U M 3 extra
Script Example Body
crayons .  
Script Example Question How many crayons does N AM E1 have altogether ?
          
N AM E1 has N U M 1 packs of OBJp1 . Each pack has N U M 2 OBJp1 in it . She also has N U M 3
Base Example Body  
extra OBJp1
 .   
Base Example Question How many OBJp1 does N AM E1 have altogether ?
 
OBJ1   crayon, crayons
Alternate for OBJ1 pencil, pencils
Alternate for M OD small, large

Table 24: An example of editing the Suggested Templates. The edits are indicated in green.

          
N AM E1 has N U M 1 packs of OBJp1 . Each pack has N U M 2 OBJp1 in it. She also has N U M 3 extra
Base Example Body  
OBJp1    
Base Example Question How many OBJp1
  N AM E1
does  have altogether ?
Base Example Equation NUM1 ∗ NUM2 + NUM3

Category Question Sensitivity


Variation Same Object, Different Structure
          
N AM E1 has N U M 1 packs of OBJp1 . Each pack has N U M 2 OBJp1 in it. She also has N U M 3 extra
Variation Body  
OBJp1 .   
Variation Question How many OBJp1
 does
 N AM E1 have in packs?
Variation Equation NUM1 ∗ NUM2

Category Structural Invariance


Variation Add irrelevant information
            
N AM E1 has N U M 1 packs of OBJp1 and N U M 4 packs of OBJp2 . Each pack has N U M 2 OBJp1
Variation Body    
 has N U M 3  extra OBJp1
in it. She also  .
Variation Question How
 many OBJp1
  N AM E1
does  have altogether ?
Variation Equation NUM1 ∗ NUM2 + NUM3
            
N AM E1 has N U M 1 packs of OBJp1 and N U M 4 packs of OBJp2 . Each pack has N U M 2 OBJp1
Variation Body    
 has N U M 3  extra OBJp1
in it. She also  .
Variation Question How
 many OBJp1
 does
 N AM E1 have in packs?
Variation Equation NUM1 ∗ NUM2

Table 25: Example Variations


Input Problem Predicted Equation Answer
Mike had 8 games. After he gave some to his friend he had 5 left . How many games 8-5 33
did he give to his friend?
After Mike gave some games to his friend he had 5 left . If he had 8 games initially, how 5-8 -3 7
many games did he give to his friend?
Jack bought 5 radios but only 2 of them worked. How many radios did not work? 5-2 33
Jack bought 5 radios but only 2 of them worked. How many more radios did not work 5-2 37
than those that did?
Ross had 6 marbles. He sold 2 marbles to Joey. How many marbles does Ross have 6-2 43
now?
Ross had 6 marbles. Joey sold 2 marbles to Ross. How many marbles does Ross have 6-2 47
now?
Bob collected 7 cans. He lost 3 of them. How many cans does Bob have now? 7-3 43
Bob had 7 cans. He collected 3 more. How many cans does Bob have now? 7-3 47
Joey had 9 pens. he used 4 of them. How many pens does he have now? 9-4 53
Joey used 4 pens. If he had 9 pens intially, how many pens does he have now? 4-9 -5 7
Jill read 30 pages in 10 days. How many pages did she read per day? 30 / 10 33
Jill can read 3 pages per day. How many pages can she read in 10 days? 3 / 10 0.33 7
Mary’s hair was 15 inches long. After she did a haircut, it was 10 inches long . how much 15 - 10 53
did she cut off ?
Mary cut off 5 inches of her hair. If her hair is now 10 inches long, how long was it 5 - 10 -5 7
earlier?

Table 26: Attention paid to specific words by the constrained model.

Input Problem Correct Equation Predicted Equation


Every day ryan spends 6 hours on learning english and 2 hours on learning 6-2 2-6
chinese. How many more hours does he spend on learning english than he
does on learning chinese?
In a school there are 34 girls and 841 boys. How many more boys than girls 841 - 34 34 - 841
does the school have?
David did 44 push-ups in gym class today. David did 9 more push-ups than 44 - 9 44 + 9
zachary. How many push-ups did zachary do?
Dan has $ 3 left with him after he bought a candy bar for $ 2. How much 3+2 3-2
money did he have initially?
Jake has 11 fewer peaches than steven. If jake has 17 peaches. How many 11 + 17 17 - 11
peaches does steven have?
Kelly gives away 91 nintendo games. How many did she have initially if she 91 + 92 92 - 91
still has 92 games left?
Emily is making bead necklaces for her friends. She was able to make 18 18 / 6 6 / 18
necklaces and she had 6 beads. How many beads did each necklace need?
Frank was reading through some books. Each book had 249 pages and it 249 / 3 ( 249 * 3 ) / 3
took frank 3 days to finish each book. How many pages did he read per day?
A mailman has to give 5 pieces of junk mail to each block. If he gives 25 25 / 5 5 / 25
mails to each house in a block, how many houses are there in a block?
Faye was placing her pencils and crayons into 19 rows with 4 pencils and 27 19 * 4 19 * 27
crayons in each row. How many pencils does she have?
White t - shirts can be purchased in packages of 53. If mom buys 57 packages 53 * 57 ( 53 * 57 ) + 34
of white t - shirts and 34 trousers, How many white t - shirts will she have?
An industrial machine can make 6 shirts a minute. It worked for 5 minutes 6 * 12 5 + 12
yesterday and for 12 minutes today. How many shirts did machine make
today?

Table 27: Some simple examples from SVAMP on which the best performing Graph2Tree model fails.

You might also like