source
source
Arkil Patel
Satwik Bhattamishra Navin Goyal
Microsoft Research India
[email protected], {t-satbh,navingo}@microsoft.com
Abstract P ROBLEM :
Text: Jack had 8 pens and Mary had 5 pens. Jack gave 3
The problem of designing NLP solvers for pens to Mary. How many pens does Jack have now?
math word problems (MWP) has seen sus- Equation: 8 - 3 = 5
tained research activity and steady gains in
arXiv:2103.07191v2 [cs.CL] 15 Apr 2021
John delivered 3 letters at every house. If he delivered for 8 houses, how many letters did John deliver? 3*8 24 3
John delivered 3 letters at every house. He delivered 24 letters in all. How many houses did John visit to deliver letters? 3 * 24 72 7
Sam made 8 dollars mowing lawns over the Summer. He charged 2 bucks for each lawn. How many lawns did he mow? 8/2 43
Sam mowed 4 lawns over the Summer. If he charged 2 bucks for each lawn, how much did he earn? 4/2 27
10 apples were in the box. 6 are red and the rest are green. how many green apples are in the box? 10 - 6 43
10 apples were in the box. Each apple is either red or green. 6 apples are red. how many green apples are in the box? 10 / 6 1.67 7
tion, which is now incorrect. Similar observations simple problems that any system designed to solve
can be made for the other two examples. Table 26 MWPs should be expected to solve. We create new
in the Appendix has more such examples. These problems by applying certain variations to exist-
examples represent only a few types of spurious ing problems, similar to the work of Ribeiro et al.
correlations that we could find but there could be (2020). However, unlike their work, our variations
other types of correlations that might have been do not check for linguistic capabilities. Rather,
missed. the choice of our variations is motivated by the
Note that, we do not claim that every model experiments in Section 4 as well as certain simple
trained on these datasets relies on the occurrence of capabilities that any MWP solver must possess.
specific words in the input problem for prediction
the way our constrained model does. We are only 5.1 Creating SVAMP
asserting that it is possible to achieve a good score We create SVAMP by applying certain types of
on these datasets even with such a brittle model, variations to a set of seed examples sampled from
which clearly makes these datasets unreliable to the ASDiv-A dataset. We select the seed examples
robustly measure model performance. from the recently proposed ASDiv-A dataset since
it appears to be of higher quality and harder than
5 SVAMP the MAWPS dataset: We perform a simple experi-
ment to test the coverage of each dataset by training
The efficacy of existing models on benchmark a model on one dataset and testing it on the other
datasets has led to a shift in the focus of researchers one. For instance, when we train a Graph2Tree
towards more difficult MWPs. We claim that this model on ASDiv-A, it achieves 82% accuracy on
efficacy on benchmarks is misleading and SOTA MAWPS. However, when trained on MAWPS and
MWP solvers are unable to solve even elemen- tested on ASDiv-A, the model achieved only 73%
tary level one-unknown MWPs. To this end, we accuracy. Also recall Table 2 where most mod-
create a challenge set named SVAMP containing els performed better on MAWPS. Moreover, AS-
simple one-unknown arithmetic word problems of Div has problems annotated according to types and
grade level up to 4. The examples in SVAMP test grade levels which are useful for us.
a model across different aspects of solving word To select a subset of seed examples that suffi-
problems. For instance, a model needs to be sen- ciently represent different types of problems in the
sitive to questions and possess certain reasoning ASDiv-A dataset, we first divide the examples into
abilities to correctly solve the examples in our chal- groups according to their annotated types. We dis-
lenge set. SVAMP is similar to existing datasets of
the same level in terms of scope and difficulty for
humans, but is less susceptible to being solved by Group Examples in Selected Seed
ASDiv-A Examples
models relying on superficial patterns.
Our work differs from adversarial data collection Addition 278 28
Subtraction 362 33
methods such as Adversarial NLI (Nie et al., 2020) Multiplication 188 19
in that these methods create examples depending Division 176 20
on the failure of a particular model while we create Total 1004 100
examples without referring to any specific model.
Inspired by the notion of Normative evaluation Table 7: Distribution of selected seed examples across
(Linzen, 2020), our goal is to create a dataset of types.
C ATEGORY VARIATION E XAMPLES
Original: Allan brought two balloons and Jake brought four balloons to the park. How many balloons
Same Object, Different did Allan and Jake have in the park?
Structure Variation: Allan brought two balloons and Jake brought four balloons to the park. How many more
balloons did Jake have than Allan in the park?
Original: In a school, there are 542 girls and 387 boys. 290 more boys joined the school. How many
Question Different Object, Same pupils are in the school?
Sensitivity Structure Variation: In a school, there are 542 girls and 387 boys. 290 more boys joined the school. How many
boys are in the school?
Original: He then went to see the oranges being harvested. He found out that they harvest 83 sacks per
day and that each sack contains 12 oranges. How many sacks of oranges will they have after 6 days of
Different Object,
harvest?
Different Structure
Variation: He then went to see the oranges being harvested. He found out that they harvest 83 sacks
per day and that each sack contains 12 oranges. How many oranges do they harvest per day?
Original: Every day, Ryan spends 4 hours on learning English and 3 hours on learning Chinese. How
many hours does he spend on learning English and Chinese in all?
Add relevant information
Variation: Every day, Ryan spends 4 hours on learning English and 3 hours on learning Chinese. If he
learns for 3 days, how many hours does he spend on learning English and Chinese in all?
Original: Jack had 142 pencils. Jack gave 31 pencils to Dorothy. How many pencils does Jack have
Reasoning now?
Change Information
Ability Variation: Dorothy had 142 pencils. Jack gave 31 pencils to Dorothy. How many pencils does Dorothy
have now?
Original: He also made some juice from fresh oranges. If he used 2 oranges per glass of juice and he
made 6 glasses of juice, how many oranges did he use?
Invert Operation
Variation: He also made some juice from fresh oranges. If he used 2 oranges per glass of juice and he
used up 12 oranges, how many glasses of juice did he make?
Original: John has 8 marbles and 3 stones. How many more marbles than stones does he have?
Change order of objects
Variation: John has 3 stones and 8 marbles. How many more marbles than stones does he have?
Original: Matthew had 27 crackers. If Matthew gave equal numbers of crackers to his 9 friends, how
Structural Change order of phrases
many crackers did each person eat?
Invariance Variation: Matthew gave equal numbers of crackers to his 9 friends. If Matthew had a total of 27
crackers initially, how many crackers did each person eat?
Original: Jack had 142 pencils. Jack gave 31 pencils to Dorothy. How many pencils does Jack have
Add irrelevant now?
information Variation: Jack had 142 pencils. Dorothy had 50 pencils. Jack gave 31 pencils to Dorothy. How many
pencils does Jack have now?
Table 8: Types of Variations with examples. ‘Original:’ denotes the base example from which the variation is
created, ‘Variation:’ denotes a manually created variation.
card types such as ‘TVQ-Change’, ‘TVQ-Initial’, Question Sensitivity, Reasoning Ability and
‘Ceil-Division’ and ‘Floor-Division’ that have less Structural Invariance. Examples of each type of
than 20 examples each. We also do not consider variation are provided in Table 8.
the ‘Difference’ type since it requires the use of an
additional modulus operator. For ease of creation, 1. Question Sensitivity. Variations in this category
we discard the few examples that are more than 40 check if the model’s answer depends on the ques-
words long. To control the complexity of result- tion. In these variations, we change the question
ing variations, we only consider those problems as in the seed example while keeping the body same.
seed examples that can be solved by an expression The possible variations are as follows:
with a single operator. Then, within each group, we (a) Same Object, Different Structure: The principal
cluster examples using K-Means over RoBERTa object (i.e. object whose quantity is unknown) in
sentence embeddings of each example. From each the question is kept the same while the structure of
cluster, the example closest to the cluster centroid the question is changed.
is selected as a seed example. We selected a total of (b) Different Object, Same Structure: The principal
100 seed examples in this manner. The distribution object in the question is changed while the structure
of seed examples according to different types of of question remains fixed.
problems can be seen in Table 7. (c) Different Object, Different Structure: Both, the
principal object in the question and the structure of
5.1.1 Variations the question, are changed.
The variations that we make to each seed example
can be broadly classified into three categories 2. Reasoning Ability. Variations here check
based on desirable properties of an ideal model: whether a model has the ability to correctly de-
Dataset # Problems # Equation # Avg Ops CLD We created a total of 1098 examples. However,
Templates
since ASDiv-A does not have examples with equa-
MAWPS 2373 39 1.78 0.26
ASDiv-A 1218 19 1.23 0.50 tions of more than two operators, we discarded 98
SVAMP 1000 26 1.24 0.22
examples from our set which had equations consist-
ing of more than two operators. This is to ensure
Table 9: Statistics of our dataset compared with
MAWPS and ASDiv-A. that our challenge set does not have any unfairly
difficult examples. The final set of 1000 examples
was provided to an external volunteer unfamiliar
termine a change in reasoning arising from subtle with the task to check the grammatical and logical
changes in the problem text. The different possible correctness of each example.
variations are as follows:
(a) Add relevant information: Extra relevant in- 5.2 Dataset Properties
formation is added to the example that affects the
Our challenge set SVAMP consists of one-
output equation.
unknown arithmetic word problems which can be
(b) Change information: The information provided
solved by expressions requiring no more than two
in the example is changed.
operators. Table 9 shows some statistics of our
(c) Invert operation: The previously unknown
dataset and of ASDiv-A and MAWPS. The Equa-
quantity is now provided as information and the
tion Template for each example is obtained by con-
question instead asks about a previously known
verting the corresponding equation into prefix form
quantity which is now unknown.
and masking out all numbers with a meta symbol.
Observe that the number of distinct Equation Tem-
3. Structural Invariance. Variations in this cat-
plates and the Average Number of Operators are
egory check whether a model remains invariant
similar for SVAMP and ASDiv-A and are consider-
to superficial structural changes that do not alter
ably smaller than for MAWPS. This indicates that
the answer or the reasoning required to solve the
SVAMP does not contain unfairly difficult MWPs
example. The different possible variations are as
in terms of the arithmetic expression expected to
follows:
be produced by a model.
(a) Add irrelevant information: Extra irrelevant
Previous works, including those introducing
information is added to the problem text that is not
MAWPS and ASDiv, have tried to capture the
required to solve the example.
notion of diversity in MWP datasets. Miao et al.
(b) Change order of objects: The order of objects
(2020) introduced a metric called Corpus Lexicon
appearing in the example is changed.
Diversity (CLD) to measure lexical diversity. Their
(c) Change order of phrases: The order of number-
contention was that higher lexical diversity is cor-
containing phrases appearing in the example is
related with the quality of a dataset. As can be seen
changed.
from Table 9, SVAMP has a much lesser CLD than
5.1.2 Protocol for creating variations ASDiv-A. SVAMP is also less diverse in terms of
Since creating variations requires a high level of fa- problem types compared to ASDiv-a. Despite this
miliarity with the task, the construction of SVAMP we will show in the next section that SVAMP is in
is done in-house by the authors and colleagues, fact more challenging than ASDiv-A for current
hereafter called the workers. The 100 seed exam- models. Thus, we believe that lexical diversity is
ples (as shown in Table 7) are distributed among not a reliable way to measure the quality of MWP
the workers. datasets. Rather it could depend on other factors
For each seed example, the worker needs to cre- such as the diversity in MWP structure which pre-
ate new variations by applying the variation types clude models exploiting shallow heuristics.
discussed in Section 5.1.1. Importantly, a combina-
5.3 Experiments on SVAMP
tion of different variations over the seed example
can also be done. For each new example created, We train the three considered models on a combi-
the worker needs to annotate it with the equation nation of MAWPS and ASDiv-A and test them on
as well as the type of variation(s) used to create SVAMP. The scores of all three models with and
it. More details about the creation protocol can be without RoBERTa embeddings for various subsets
found in Appendix C. of SVAMP can be seen in Table 10.
Seq2Seq GTS Graph2Tree Model SVAMP
S R S R S R FFN + LSTM Decoder (S) 17.5
FFN + LSTM Decoder (R) 18.3
Full Set 24.2 40.3 30.8 41.0 36.5 43.8
Majority Template Baseline 11.7
One-Op 25.4 42.6 31.7 44.6 42.9 51.9
Two-Op 20.3 33.1 27.9 29.7 16.1 17.8
Table 12: Accuracies (↑) of the constrained model on
ADD 28.5 41.9 35.8 36.3 24.9 36.8
SUB 22.3 35.1 26.7 36.9 41.3 41.3
SVAMP. (R) denotes that the model is provided with
MUL 17.9 38.7 29.2 38.7 27.4 35.8 non-contextual RoBERTa pretrained embeddings while
DIV 29.3 56.3 39.5 61.1 40.7 65.3 (S) denotes that the model is trained from scratch.
Zheng Cai, Lifu Tu, and Kevin Gimpel. 2017. Pay at- Tal Linzen. 2020. How can we accelerate progress to-
tention to the ending:strong neural baselines for the wards human-like linguistic generalization? In Pro-
ROC story cloze task. In Proceedings of the 55th An- ceedings of the 58th Annual Meeting of the Asso-
nual Meeting of the Association for Computational ciation for Computational Linguistics, pages 5210–
Linguistics (Volume 2: Short Papers), pages 616– 5217, Online. Association for Computational Lin-
622, Vancouver, Canada. Association for Computa- guistics.
tional Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Roberta: A robustly optimized bert pretraining ap-
Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, proach.
Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-
son F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Thang Luong, Hieu Pham, and Christopher D. Man-
Singh, Noah A. Smith, Sanjay Subramanian, Reut ning. 2015. Effective approaches to attention-based
Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. neural machine translation. In Proceedings of the
2020. Evaluating models’ local decision boundaries 2015 Conference on Empirical Methods in Natu-
via contrast sets. ral Language Processing, pages 1412–1421, Lis-
bon, Portugal. Association for Computational Lin-
guistics.
Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel Bowman, and Noah A. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.
Smith. 2018. Annotation artifacts in natural lan- Right for the wrong reasons: Diagnosing syntactic
guage inference data. In Proceedings of the 2018 heuristics in natural language inference. In Proceed-
Conference of the North American Chapter of the ings of the 57th Annual Meeting of the Association
Association for Computational Linguistics: Human for Computational Linguistics, pages 3428–3448,
Language Technologies, Volume 2 (Short Papers), Florence, Italy. Association for Computational Lin-
pages 107–112, New Orleans, Louisiana. Associa- guistics.
tion for Computational Linguistics.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su.
Danqing Huang, Shuming Shi, Chin-Yew Lin, and Jian 2020. A diverse corpus for evaluating and develop-
Yin. 2017. Learning fine-grained expressions to ing English math word problem solvers. In Proceed-
solve math word problems. In Proceedings of the ings of the 58th Annual Meeting of the Association
2017 Conference on Empirical Methods in Natural for Computational Linguistics, pages 975–984, On-
Language Processing, pages 805–814, Copenhagen, line. Association for Computational Linguistics.
Denmark. Association for Computational Linguis-
tics. Yixin Nie, Adina Williams, Emily Dinan, Mohit
Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-
Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian Yin, versarial NLI: A new benchmark for natural lan-
and Wei-Ying Ma. 2016a. How well do comput- guage understanding. In Proceedings of the 58th An-
ers solve math word problems? large-scale dataset nual Meeting of the Association for Computational
construction and evaluation. In Proceedings of the Linguistics, pages 4885–4901, Online. Association
54th Annual Meeting of the Association for Compu- for Computational Linguistics.
tational Linguistics (Volume 1: Long Papers), pages
887–896, Berlin, Germany. Association for Compu- Adam Poliak, Jason Naradowsky, Aparajita Haldar,
tational Linguistics. Rachel Rudinger, and Benjamin Van Durme. 2018.
Hypothesis only baselines in natural language in- D. Zhang, L. Wang, L. Zhang, B. T. Dai, and H. T. Shen.
ference. In Proceedings of the Seventh Joint Con- 2020. The gap of semantic parsing: A survey on au-
ference on Lexical and Computational Semantics, tomatic math word problem solvers. IEEE Transac-
pages 180–191, New Orleans, Louisiana. Associa- tions on Pattern Analysis and Machine Intelligence,
tion for Computational Linguistics. 42(9):2287–2305.
Jinghui Qin, Lihui Lin, Xiaodan Liang, Rumin Zhang, Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan
and Liang Lin. 2020. Semantically-aligned univer- Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-to-
sal tree-structured solver for math word problems. tree learning for solving math word problems. In
Proceedings of the 58th Annual Meeting of the Asso-
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, ciation for Computational Linguistics, pages 3928–
and Sameer Singh. 2020. Beyond accuracy: Be- 3937, Online. Association for Computational Lin-
havioral testing of NLP models with CheckList. In guistics.
Proceedings of the 58th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 4902–
4912, Online. Association for Computational Lin-
guistics.
Shachar Rosenman, Alon Jacovi, and Yoav Goldberg.
2020. Exposing Shallow Heuristics of Relation Ex-
traction Models with Challenge Data. In Proceed-
ings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
3702–3710, Online. Association for Computational
Linguistics.
Subhro Roy and Dan Roth. 2018. Mapping to declara-
tive knowledge for word problem solving. Transac-
tions of the Association for Computational Linguis-
tics, 6:159–172.
Mrinmaya Sachan and Eric Xing. 2017. Learning
to solve geometry problems from natural language
demonstrations in textbooks. In Proceedings of the
6th Joint Conference on Lexical and Computational
Semantics (*SEM 2017), pages 251–261, Vancouver,
Canada. Association for Computational Linguistics.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.
Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017.
Deep neural solver for math word problems. In Pro-
ceedings of the 2017 Conference on Empirical Meth-
ods in Natural Language Processing, pages 845–
854, Copenhagen, Denmark. Association for Com-
putational Linguistics.
Sarah Wiegreffe and Yuval Pinter. 2019. Attention is
not not explanation. In Proceedings of the 2019 Con-
ference on Empirical Methods in Natural Language
Processing and the 9th International Joint Confer-
ence on Natural Language Processing (EMNLP-
IJCNLP), pages 11–20, Hong Kong, China. Associ-
ation for Computational Linguistics.
Zhipeng Xie and Shichao Sun. 2019. A goal-driven
tree-structured neural model for math word prob-
lems. In Proceedings of the Twenty-Eighth In-
ternational Joint Conference on Artificial Intelli-
gence, IJCAI-19, pages 5299–5305. International
Joint Conferences on Artificial Intelligence Organi-
zation.
A Experiments with Transformer Model MAWPS ASDiv-A SVAMP
Transformer 79.4 64.4 25.3
We additionally ran all our experiments with the
Transformer (Vaswani et al., 2017) model. The 5-
Table 17: 5-fold cross-validation accuracies (↑) of
fold cross-validation accuracies of the Transformer Transformer model on Question-removed datasets.
on MAWPS and ASDiv-A are provided in Table
16. The scores on Question-removed datasets are Transformer
provided in Table 17 and on SVAMP challenge set
S R
is provided in Table 18.
Full Set 18.4 38.9
B Implementation Details One-Op 18.6 40.5
Two-Op 17.8 33.9
We use 8 NVIDIA Tesla P100 GPUs each with 16
ADD 22.3 36.3
GB memory to run our experiments. The hyperpa- SUB 17.1 37.5
rameters used for each model are shown in Table MUL 17.9 28.3
DIV 18.6 53.3
19. The hyperparameters used in for the Trans-
former model are provided in Table 20. The best
Table 18: Results of Transformer model on the SVAMP
hyperparameters are highlighted in bold. Follow- challenge set. S indicates that the model is trained from
ing the setting of Zhang et al. (2020), the arithmetic scratch. R indicates that the model was trained with
word problems from MAWPS are divided into five RoBERTa embeddings. The first row shows the results
folds, each of equal test size. For ASDiv-A, we for the full dataset. The next two rows show the results
consider the 5-fold split [238, 238, 238, 238, 266] for subsets of SVAMP composed of examples that have
provided by the authors (Miao et al., 2020). equations with one operator and two operators respec-
tively. The last four rows show the results for subsets of
C Creation Protocol SVAMP composed of examples of type Addition, Sub-
traction, Multiplication and Division respectively.
We create variations in template form. Generating
more data by scaling up from these templates or
by performing automatic operations on these tem- not be used for collectives; rather they should be
plates is left for future work. The template form of used for the things that the collective
represents.
an example is created by replacing certain words Some example uses of OBJs and OBJp tags
with their respective tags. Table 21 lists the various are provided in Table 22. Lastly, the M OD tag
tags used must be used
to replace any modifier preceding the
in thetemplates.
The N UM tag isused to replace all the num- OBJs / OBJp tag.
bers and the N AM E tag is used to replace A preprocessing script is executed over the Seed
all the
Names in the example. The OBJs Examples to automatically generate template sug-
of Persons
gestions for the workers. The script uses Named
and OBJp tags are used for replacing the ob-
Entity Recognition and Regular Expression match-
jects in the example. The OBJs and OBJp
tags with the same index represent the same ob- ing to automatically mask the names of persons and
ject in singular and plural form respectively. The the numbers found in the Seed Examples. The out-
puts from the script are called the Script Examples.
intention when using the OBJs or the OBJp
tag is that it can be used as a placeholder for other An illustration is provided in Table 23.
similar words, which when entered in that place, Each worker is provided with the Seed Exam-
make sense as per the context. These tags must ples along with their respective Script Examples
that have been alloted to them. The worker’s task is
to edit the Script Example by correcting any mis-
Model MAWPS ASDiv-A
take made by the preprocessing script and
adding
Transformer (S) 77.9 52.1
any new tags such as the OBJs and the OBJp
Transformer (R) 87.1 77.7
tags in order to create the Base Example. If a
Table 16: 5-fold cross-validation accuracies (↑) of worker introduces a new tag, they need to mark it
Transformer model on datasets. (R) means that the against its example-specific value. If the tag is used
model is provided with RoBERTa pretrained embed- to mask objects, the worker needs to mark both the
dings while (S) means that the model is trained from singular and plural form of the object in a comma-
scratch. seperated manner. Additionally, for each unique
Seq2Seq GTS Graph2Tree Constrained
Hyperparameters Scratch RoBERTa Scratch RoBERTa Scratch RoBERTa Scratch RoBERTa
Embedding Size [128, 256] [768] [128, 256] [768] [128, 256] [768] [128, 256] [768]
Hidden Size [256, 384] [256, 384] [384, 512] [384, 512] [256, 384] [256, 384] [256, 384] [256, 384]
Number of Layers [1, 2] [1, 2] [1, 2] [1, 2] [1, 2] [1, 2] [1, 2] [1, 2]
Learning Rate [5e-4, 8e-4, [1e-4, 2e-4, [8e-4, 1e-3, [5e-4, 8e-4, [8e-4, 1e-3, [5e-4, 8e-4, [1e-3, 2e-3] [1e-3, 2e-3]
1e-3] 5e-4] 2e-3] 1e-3] 2e-3] 1e-3]
Embedding LR [5e-4, 8e-4, [5e-6, 8e-6, [8e-4, 1e-3, [5e-6, 8e-6, [8e-4, 1e-3, [5e-6, 8e-6, [1e-3, 2e-3] [1e-3, 2e-3]
1e-3] 1e-5] 2e-3] 1e-5] 2e-3] 1e-5]
Batch Size [8, 16] [4, 8] [8, 16] [4, 8] [8, 16] [4, 8] [8, 16] [4, 8]
Dropout [0.1] [0.1] [0.5] [0.5] [0.5] [0.5] [0.1] [0.1]
# Parameters 8.5M 130M 15M 140M 16M 143M 5M 130M
Epochs 60 50 60 50 60 50 60 50
Avg Time/Epoch 10 40 60 120 60 120 10 15
Table 19: Different hyperparameters and the values considered for each of them in the models. The best hyperpa-
rameters for each model for 5-fold cross-validation on ASDiv-A are highlighted in bold. Average Time/Epoch is
measured in seconds.
F Ethical Considerations
In this paper, we consider the task of automati-
cally solving Math Word Problems (MWPs). Our
work encourages the development of better sys-
tems that can robustly solve MWPs. Such sys-
tems can be deployed for use in the education do-
main. E.g., an application can be developed that
takes MWPs as input and provides detailed expla-
nations to solve them. Such applications can aide
elementary school students in learning and practic-
ing math.
We present a challenge set called SVAMP of one-
unknown English Math Word Problems. SVAMP
is created in-house by the authors themselves by
applying some simple variations to examples from
ASDiv-A (Miao et al., 2020), which is a publicly
available dataset. We provide a detailed creation
protocol in Section C. We are not aware of any
risks associated with our proposed dataset.
To provide an estimate of the energy require-
ments of our experiments, we provide the details
such as computing platform and running time in
Section B. Also, in order to reduce carbon costs
Excerpt of Example Beth
has 4 packs
of red crayons
and 2 packs
of green
crayons. Eachpack has 10 crayonsin it.
Template Form N AM E1 has N U M 1 packs of M OD1 OBJp1 and N U M 2 packs of M OD2 OBJp1 .
Excerpt of Example In a game,Frank defeated
6 enemies.
Each enemy
earned him 9 points.
Template Form In a game N AM E1 defeated N U M 1 OBJp1 . Each OBJs1 earned him N U M 2 points.
Table 22: Example uses of tags. Note that in the first example, the word ’packs’ was not replaced since it is a
collective. In the second example, the word ’points’ was not replaced because it is too instance-specific and no
other word can be used in that place.
Seed Example Body Beth has 4 packs of crayons. Each pack has 10 crayons in it. She also has 6 extra crayons.
Seed Example Question How many crayons does Beth have altogether?
Seed Example Equation 4*10+6
N AM E1 has N U M 1 packs of crayons . Each pack has N U M 2 crayons in it . She also has N U M 3 extra
Script Example Body
crayons .
Script Example Question How manycrayons
does N AM
E1 have altogether ?
Script Example Equation NUM1 ∗ NUM2 + NUM3
Table
23: An
example of suggested templates. Note that the preprocessing script could not succesfully tag crayons
as OBJp1 .
N AM E1 has N U M 1 packs of crayons . Each pack has N U M 2 crayons in it . She also has N U M 3 extra
Script Example Body
crayons .
Script Example Question How many crayons does N AM E1 have altogether ?
N AM E1 has N U M 1 packs of OBJp1 . Each pack has N U M 2 OBJp1 in it . She also has N U M 3
Base Example Body
extra OBJp1
.
Base Example Question How many OBJp1 does N AM E1 have altogether ?
OBJ1 crayon, crayons
Alternate for OBJ1 pencil, pencils
Alternate for M OD small, large
Table 24: An example of editing the Suggested Templates. The edits are indicated in green.
N AM E1 has N U M 1 packs of OBJp1 . Each pack has N U M 2 OBJp1 in it. She also has N U M 3 extra
Base Example Body
OBJp1
Base Example Question How many OBJp1
N AM E1
does have altogether ?
Base Example Equation NUM1 ∗ NUM2 + NUM3
Table 27: Some simple examples from SVAMP on which the best performing Graph2Tree model fails.