Automatic Math Word Problem Generation With Topic-Expression Co-Attention Mechanism and Reinforcement Learning
Automatic Math Word Problem Generation With Topic-Expression Co-Attention Mechanism and Reinforcement Learning
The math word problems they generate may not be complete A. Problem Definition
and solvable. When a key entity in the problem is wrong, the Given a topic word list Xo = {xo1 , xo2 , . . . , xol } and a math
problem cannot be solved. As shown in the example in Fig. expression Xe = {xe1 , xe2 , . . . , xem }, the generation task is to
1, the “unsolvable problem” asks for the price of “books” generate a math word problem Y= {y1 , y2 , . . . , yn }, which is
not mentioned in the previous text, which makes this problem a sequence of n words related to the given topic words Xo that
unsolvable. can be solved by the math expression Xe . Here, l and m are
To address these issues, we propose the novel model MW- the respective lengths of Xo and Xe . Our goal is to estimate
PGen for automatically generating math word problems. First, the probability distribution:
we propose a topic-expression co-attention mechanism that n
extracts the correlated information between topic words and P(Y|Xo , Xe ) =
Y
P(yt |y<t , Xo , Xe ). (1)
expressions, which enables the model to generate problems t=1
related to both inputs. We also convert the math expression
into a pre-order traversal sequence of the expression tree In this way, given a new input data pair (Xo , Xe ), we can
and use adjacent node embeddings in the expression tree as generate a new math word problem Y based on P(Y|Xo , Xe ).
additional embeddings. In this way, the model can capture
structure and global semantic information of the math ex- B. Math Word Problem Generator
pression. Furthermore, we use a state-of-the-art math word Encoder:
problem solver to obtain math expression that corresponds to Our model accepts two inputs, a topic word list Xo and a
the generated problem, and determine whether this expression math expression Xe , each of which is a sequence of words.
is the same as the original expression. To fine-tune our model, We embed the words xoi , xej in these two sequences as word
we use the results of the problem solver as rewards and apply embeddings e(xoi ), e(xej ) through a word embedding layer.
reinforcement learning. Additional embeddings for math expression:
Our contributions can be summarized as follows: The math expression can be converted into a binary tree
• We propose a novel model for generating math word structure with operators as internal nodes and numbers as
problems. The model has a topic-expression co-attention leaf nodes. The goal of the model is to generate a pre-order
mechanism that can effectively extract correlated infor- traversal sequence of this expression tree. To better capture
mation between topic words and expressions. It also the structure and global semantic information of the math
uses adjacent node embeddings in the expression tree as expression, we use the parent and child nodes of each word
additional embeddings to capture the structure and global in the expression tree as additional embeddings, as shown in
semantic information of the math expression. Fig. 3. We use this embedding strategy for two reasons. First,
• We use reinforcement learning to further fine-tune the pre-ordered math expressions can better implicitly model tree
model. We use a math word problem solving model structures than middle-ordered math expressions [12], [13].
to solve the generated problems, and use the results as Second, it can better capture long-distance dependencies. For
rewards in reinforcement learning. example, in the math expression “(N0*N1)/(N2+N3)”, the
• We conducted experiments on a large-scale math word operator “/” directly depends on its child operators “*,+”
problem dataset and the results confirmed that the pro- instead of its nearby words “N1,), (, N2 ”. Therefore, we use
posed model MWPGen outperformed baselines on popu- these adjacent node embeddings of the word in the expression
lar automatic evaluation metrics. The results of the math tree as additional embeddings.
word problem solving experiments also prove that the Then, we use two-layer bidirectional long short-term mem-
problems generated by MWPGen are more complete and ory (BiLSTM) [14] to obtain the hidden states of each word
solvable than those generated by other baseline models. in the math expression. The hidden states hei are updated as
In addition, human evaluation verified that these problems follows:
are more related to the given topic words and math
hei = BiLSTM(e(xei , xei,p , xei,l , xei,r ), hei−1 ), (2)
expressions.
where e(xei , xei,p , xei,l , xei,r )
are the embeddings of the i-th
II. M ODELS word and its parent, left child, and right child in the expression
tree, respectively, as shown in Fig. 3. We obtain the forward
In this section, we first describe the task of math word −→ ←−
hidden states hei and backward hidden states hei by reading
problem generation. Then, we introduce our proposed model e
expression X in the forward and backward directions. We
MWPGen for generating math word problems. As shown in
define the hidden states of the math expression hei as the
Fig. 2, the whole framework comprises a math word problem
concatenation of the forward and backward hidden states, i.e.,
generator and a math word problem solver, the procedure for −
→ ← −
hei = [hei : hei ].
which is as follows: (1) The math word problem generator
Topic-Expression Co-attention Mechanism:
generates a problem for the given topic words and expression;
We propose a topic-expression co-attention mechanism to
(2) the generated problem is then sent to the solver to
generate a co-attention matrix M for the topic word embed-
obtain its corresponding expression, and (3) the generated and
dings e(xo ) and the expression hidden states he :
original expressions are compared to produce a reward for
reinforcement learning to fine-tune the MWPGen model. Mij = tanh(U e(xoi ) ⊗ hej ), (3)
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 3
… … …
Decoder 𝑠1 𝑠2 𝑠𝑡 𝑠𝑛−1 𝑠𝑛
ℎ1
𝑝 𝑝
ℎ2 ℎ𝑛
𝑝 Encoder
…
Fig. 2. Overview of our Math Word Problem Generation (MWPGen) network. (Left) Math word problem generation model. (Right) Math word problem
solving model. The model is finally fine-tuned by rewards that are based on the similarity between the expression predicted by the MWP solving model and
the original math expression.
Math expression tree / expression hidden states and expression-aware topic hidden
states:
* + o
αij = softmax(max pooling(M))∈ Rl×m ,
e
N0 N1 N2 N3
αij = softmax((max pooling(M))T )∈ Rm×l ,
l m (4)
Word / * N0 N1 + N2 N3
X X
ĥoi = o
αij e(xoj ), ĥei = e e
αij hj .
Parent — / * * / + + j=1 j=1
Here, st−1 and e(yt−1 ) are the decoder hidden state and the or left sibling node in the generated expression tree, we pad
embedding of the generated word in the last time step. ct is it with a PAD token.
the context vector obtained by the attention mechanism [5]. In addition, we introduce a copying mechanism [16] to
Taking the topic word hidden states roi and the expression enable the model to either generate a word et from the
hidden states rei from the encoder, and last decoder hidden vocabulary V or copy a word from the input problem Y.
state st−1 , we can obtain ct as: At time step t, based on the context vector cp t and the
αti = softmax(tanh(Wh [roi : rei ] + Ws st−1 )), decoder state spt , this mechanism calculates a copy gate value
l+m
gp ∈ [0, 1] to determine whether the word et is generated or
X (7) copied:
ct = αti [roi : rei ],
i=1
gp = σ(Ws sp + W c cp
t ),
where Wh and Ws are the weight matrices. αti denotes the X t p
attention distribution on the hidden states ro and re at time Pc (et ) = αti ,
et =yi (11)
step t. Pg (et ) = softmax([Wg sp p
t : ct ]),
Finally, using the context vector ct and hidden state st , the
P(et |e<t ,Y) = gp Pc (et )+(1−gp )Pg (et ),
probability distribution of generating yt is calculated as:
P(yt |y<t , Xo , Xe ) = softmax(Wg [st : ct ]), where Ws , Wc and Wg are weight matrices and σ is a
Xn (8) sigmoid function. We obtain the final probability distribution
Lloss = − log P(yt |y<t , Xo , Xe ).
t=1 P(et |e<t , Y) over both the generate distribution Pg (et ) and
During training, we optimized the probability of generating copy distribution Pc (et ).
yt with a cross-entropy loss function. During the test, we used During the test, at each time step, if et is an operator,
beam search to generate the problem. We set the beam size this means that the current node is an internal node, and the
to 5. At the first time step, we selected the top 5 words with decoder continues to generate its left child nodes. If et is a
the highest probability under the current distribution as the number, it is a leaf node, so the decoder will generate right
first word of the 5 candidate output sequences. Subsequently, child nodes for the previous internal nodes. Once the children
at each time step, based on the candidate output sequences of all the internal nodes have been generated, the generated
of the last time step, we selected the top 5 words with the expression sequence E= {e1 , e2 , . . . , en′ } can be transformed
highest probability under the current distribution. Finally, the into a complete tree and the decoding process is terminated.
sequence with the highest probability was selected from the 5
candidate sequences as the generated problem.
D. Reinforcement learning
C. Math Word Problem Solver
In each training iteration, we first use MWPGen to generate
After generating a math word problem Y= {y1 , y2 , . . . , yn }, the math word problem Y based on the topic words Xo and
Y is sent to a pre-trained math word problem solver (MWP expression Xe , and then use the MWP solver to generate the
solver) to obtain its math expression. The structure of this math expression E based on Y. We use reinforcement learning
MWP solver is shown below. [17] to fine-tune our model.
Encoder: To do so, we define r(y) = exp(E, Xe ) + ans(E, Xe )
The MWP solver takes the generated problem Y as input, as a reward function for the generated problem Y, which
encodes this sequence, and then passes it to a two-layer is obtained by checking the expression correctness and the
bidirectional LSTM: answer correctness between the generated expression E and
hp p the original expression Xe . If these two expressions are the
i = BiLSTM(e(yi ), hi−1 ). (9)
same, the exp(E, Xe ) is set to 1, otherwise it is 0. If these two
Decoder: expressions can be executed to produce the same answer, the
Following the method proposed by [13], we use a tree- ans(E, Xe ) is set to 1, otherwise it is 0. We also consider the
structured gated recurrent unit (GRU) [15] decoder with an correctness of the answer because the model may generate a
attention mechanism to generate math expressions in a pre- correct expression but differ from the original expression, e.g.,
order traversal from top to bottom: if the generated expression is “4+5” but the original expression
sp p p is “5+4”.
t = GRU(e(et−1 , et,p , et,l ), ct , st−1 )
p The loss function for this reinforcement learning is defined
αti = softmax(Ws sp p
t−1 + Wh hi ) (10)
Xn as:
cp
t =
p p
αti hi . X
i=1 LRL = −(r(y s )−r(y ∗ )) logP(et |e<t ,Y). (12)
At time step 1, we use the last hidden states hp n of the
generated problem to initialize the decoder state sp 1 . Here, r(y s ) is the reward for the generated problem Y and r(y ∗ ) is
et−1 ,et,p and et,l represent the output words of the last node, a baseline reward to reduce the variance. Like the self-critical
parent node, and left sibling node of the current node et , sequence training (SCST) strategy [18], y ∗ is estimated by
respectively. If the current node does not have a parent node using the greedy search results of the MWPGen model.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 5
TABLE II TABLE IV
E VALUATION RESULTS ON AUTOMATIC METRICS . T HE BEST SCORE OF H UMAN EVALUATION RESULTS . T HESE METRICS ARE RATED ON A 1-3
EACH METRIC IS HIGHLIGHTED IN BOLD . SCALE (3 FOR THE BEST ). T HE BEST SCORE OF EACH METRIC IS
HIGHLIGHTED IN BOLD .
Model BLEU1 BLEU2 BLEU3 BLEU4 Rouge L CIDEr
Seq2Seq [4] 0.5554 0.4303 0.3498 0.2914 0.5498 2.2446 Model Fluency Completeness Expression Topic Words
ConvS2S [22] 0.5395 0.4147 0.3347 0.2767 0.5384 2.1028 Accuracy Relevance
NQG++ [23] 0.5511 0.4260 0.3459 0.2881 0.5466 2.2190 Seq2Seq [4] 2.49 2.41 1.98 2.51
MAGNET [11] 0.5676 0.4383 0.3544 0.2937 0.5524 2.2596
ConvS2S [22] 2.37 2.49 1.98 2.33
MWPGen 0.5658 0.4448 0.3638 0.3037 0.5659 2.3166
NQG++ [23] 2.45 2.50 1.94 2.39
MAGNET [11] 2.19 2.27 2.01 2.69
TABLE III MWPGen 2.62 2.54 2.18 2.51
S OLVABILITY OF THE GENERATED MATH WORD PROBLEMS . E VALUATED
BY A STATE - OF - ART MATH WORD PROBLEM SOLVING MODEL GTS [13].
T HE BEST SCORE OF EACH METRIC IS HIGHLIGHTED IN BOLD .
mechanism of NQG++ will copy keywords directly from
Model Expression Answer inputs during the generation process. This mechanism
Ground Truth 63.43% 75.75% may reduce the fluency and solvability of the generated
Seq2Seq [4] 51.42% 60.25% problems.
ConvS2S [22] 43.54% 52.14% 2) MAGNET performed better than other baselines on
NQG++ [23] 47.06% 55.59% automatic metrics and solvability. MAGNET uses entity-
MAGNET [11] 55.57% 62.85% enforced loss to ensure that the entities in the generated
MWPGen 57.71% 66.43% math problem are highly relevant to the words in the
given input. Although this additional loss may reduce
the fluency and completeness of the generated problem,
truths. These automatic evaluation metrics reflect the fluency it is helpful for generating topic-related and expression-
of the generated problem. However, they have limited abilities related problems.
to reflect the solvability and completeness of the generated 3) Our proposed model MWPGen achieves competitive re-
problem. For example, for the ground truth “each apple costs sults compared with baselines on the automatic metrics.
me N1”, here are two sentences: Sentence A “I spent N1 In addition, the solvability of the problems generated
on each apple” and Sentence B “each desk costs me N1”. by MWPGen is better than these baselines. The ob-
Sentence A has the same meaning as the ground truth, but it servations in the next section also confirm the fluency
is not as close to the ground truth as Sentence B, which may and completeness of the problems generated by the
result in a lower evaluation score. proposed model. We attribute these improvements to the
To further evaluate the quality of the generated problems, combination of all three components in MWPGen.
we used a pre-trained math word problem solving model to
solve the generated problems and checked whether the results
obtained were the same as the original math expressions and E. Human Evaluation
answers. In this study, we used the state-of-the-art model In addition, we can see that automatic evaluation metrics
GTS released by [13] for math word problem solving. It is and problem solving model accuracies are not always related
a sequence-to-tree model that uses math word problems as to human judgments on the correctness of a math word
input and generates expression trees from top to bottom. problem, human evaluation can help us to better evaluate its
The evaluation metrics include expression accuracy and quality. We conducted human evaluation comparing generation
answer accuracy, indicating whether these mathematical prob- problems from the baselines mentioned above and our model.
lems can be solved to obtain the original expression and Specifically, we consider four metrics in human evaluation:
the original answer. For example, the generated expression is • Fluency measures whether a problem is grammatically
“4+5” and the original expression is “5+4”, which indicates correct and is fluent to read.
that solving this problem has obtained a wrong expression and • Completeness measures whether a problem has a clear
a correct answer. Therefore, the answer accuracy is always question clause, and provides enough information in the
higher than the expression accuracy because sometimes the description part to solve the question.
model generates a correct result with an expression different • Expression Accuracy measures whether a problem can
from the original one. be solved to obtain the given math expression.
Table II shows the automatic metric evaluation results of our • Topic Words Relevance measures whether a problem is
model compared with other baselines. Using problems gener- relevant to all given topic words.
ated by the different baselines as inputs, Table III shows the For human evaluation, we used the baselines mentioned above
accuracies of the expressions and answers to these problem. to compare with our model. We randomly selected 100 pairs
We have the following observations: of topic word lists and math expressions from the test set, and
1) Seq2Seq model performed slightly better than the asked 3 native speakers to evaluate the generated problems of
NQG++ on automatic metrics, and its solvability is each model. For each metrics, we ask the reviewer to rate the
better than NQG++. We believe this is because the copy problems on a 1-3 scale (3 for the best).
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 7
TABLE V
A BLATION STUDY OF THE ADJACENT NODE EMBEDDINGS IN THE EXPRESSION TREE . T HE BEST SCORE OF EACH METRIC IS HIGHLIGHTED IN BOLD .
Results of each human evaluation metric are presented in tree, the topic-expression co-attention mechanism and
Table IV. We can see that: reinforcement learning. It can be seen as a basic sequence
1) As for Topic Words Relevance, MAGNET gets the best to sequence model.
score. MAGNET also achieved competitive results in • MWPGen-base with AdjEmb: It adds adjacent node
Expression Accuracy. But for Fluency and Complete- embeddings in the expression tree to the “MWPGen-
ness, MAGNET does not perform as well as other base” for better comparison.
baselines. The reason for this may be that its entity- • MWPGen without AdjEmb: It is a ablation model of
enforced loss module force MAGNET to generate solv- MWPGen which remove the adjacent node embeddings
able problems containing words from the input, but of the current word in the expression tree.
makes the generated problems unnatural or incomplete. • MWPGen: It is the complete version of our proposed
2) For Fluency and Topic Words Relevance, ConvS2S does model with all three components.
not perform as well as other baselines. Human evaluation In addition, we explored two other popular approaches to
found that compared with other baselines, ConvS2S capture the structure and global semantic information of the
generated simpler and shorter problems. These problems math expression, as follows:
are usually not related to the topic words. Therefore, • Graph Convolutional Network (GCN) [27]: We use
even if ConvS2S is competitive on Completeness and a two-layer graph convolutional network in place of
Expression Accuracy, we still believe that ConvS2S does the adjacent node embeddings for a fairer comparison
not perform as well as other baselines. with our model. Math expressions are converted into
3) MWPGen gets the best or competitive performance expression trees, where operators are internal nodes and
in each metric. It achieves the best performance on numbers are leaf nodes. We use GCN to update the node
Fluency, Completeness and Expression Accuracy. We state by aggregating its neighbor nodes in the expression
attribute the superior performance of MWPGen to two tree.
properties: MWPGen uses adjacent node embeddings in • Tree-LSTM [28]: We use a Tree-LSTM in place of
the expression tree, and thus better captures the structure our bidirectional LSTM encoder. Tree-LSTM trans-
and global semantic information of the math expres- form LSTM from chain-like to tree-like structures. This
sion. MWPGen uses the quality and solvability of the bottom-up hierarchical tree-structured encoder composes
generated problems as rewards to fine-tune the model, the node state according to the input embedding and the
which can improve the score of expression accuracy. In node states of its child nodes in the expression tree.
this way, these two properties improve the expression From Table V we can see:
accuracy of MWPGen. In addition, the rewards of rein-
1) Models with Tree-LSTM perform worse than all other
forcement learning promote that the generated problems
variants. We believe this is because Tree-LSTM transfers
can be processed by the MWP solving model, which also
the node state in one direction. Instead of obtaining
greatly improves the fluency of MWPGen and slightly
global information from the entire math expression, each
improves the completeness.
node only obtains information from its own subtree.
2) Models with GCN show competitive performance com-
F. Ablation study pared to models without GCN, and show improvements
Effect of adjacent node embeddings in the expression in some metrics such as BLEU4 and CIDEr. We believe
tree: To verify the effectiveness of the adjacent node embed- that both BiLSTM and GCN can capture the structure
dings in the expression tree, we first conduct an ablation study and global semantic information of the math expression.
on the adjacent node embeddings. Table V shows evaluation In this article, we use BiLSTM because it is simpler.
results for several variants of our proposed model on Math23K 3) With adjacent node embeddings from math expression
dataset. The definitions of the models under comparison are: trees, MWPGen and MWPGen-base achieved competi-
• MWPGen-base: It is a ablation model of MWPGen tive scores on automatic metrics, while also achieving
without the adjacent node embeddings in the expression higher accuracy on the problem solver. This shows
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 8
TABLE VI
A BLATION STUDY OF THE TOPIC - EXPRESSION CO - ATTENTION MECHANISM . T HE BEST SCORE OF EACH METRIC IS HIGHLIGHTED IN BOLD .
TABLE VII
A BLATION STUDY OF REINFORCEMENT LEARNING . T HE BEST SCORE OF EACH METRIC IS HIGHLIGHTED IN BOLD .
TABLE IX TABLE X
E XAMPLES OF GENERATED PROBLEMS WITH DIFFERENT NUMBER OF T WO CASES OF PROBLEM GENERATED BY MWPG EN COMPARED WITH
TOPIC WORDS AS INPUT. OTHER BASELINES .
Topic words: quantity, store, balloons, red, yellow Case 1: Topic words: quantity, rabbit, white, black
Expression: ( N0 * N1 ) + N2 Expression: N1 * N0
Ground Truth: There are N0 red balloons in the store. There are N2 Ground Truth: N0 of the number of white rabbits is equal to the
more yellow balloons than N1 times the number of red balloons. How number of black rabbits. There are N1 white rabbits. How many
black rabbits are there?
many yellow balloons are there?
白兔 只 数 的 N0 等于 黑 兔 的 只 数 , 白兔 有 N1 只 , 黑 兔
商店 有 红 气球 N0 个 , 黄 气球 比 红 气球 的 N1 倍 多 N2 个 ,
有 多少 只 ?
黄 气球 有 多少 个 ?
Seq2Seq: A number divided by N0 is equal to N1, the number = ?
Num=0: A number is N2 more than N1 times N0, this number = ? 一 个 数 除以 N0 等于 N1 , 这个 数 = ?
比 N0 的 N1 倍 多 N2 的 数 = ? NQG++: There are N0 white rabbits and N1 black rabbits. How many
Num=1: How many with N2 =N0 * N1. black rabbits are there?
多少 和 N2 = N0 * N1 . 白兔 有 N0 只 , 黑 兔 有 N1 只 , 黑 兔 有 多少 只 ? Result: N1
Num=2: A store bought N0 boxes of apples. The number of pears they MAGNET: The number of white rabbits is N0 of the black rabbits, and
bought is N2 more boxes than N1 times the number of apples. How the number of white rabbits is N1 of gray rabbits. How many white
many boxes of pears did they buy? rabbits are there?
商店 运 来 苹果 N0 箱 , 运 来 的 梨 比 苹果 的 N1 倍 还 多 N2 白兔 的 是 黑 兔 的 N0 , 白兔 的 只 数 是 灰 兔 的 N1 . 白兔 有
多少 只 ? Result: Unsolvable
箱 , 运 来 梨 多少 箱 ?
MWPGen: The number of black rabbits is N0 of the white rabbits. If
Num=3: The store has N0 packs of colorful balloons, and there are N1
there are N1 white rabbits, how many black rabbits are there?
colorful balloons in each pack. With N2 white balloons, how many
黑 兔 的 只 数 是 白兔 的 N0 . 白兔 有 N1 只 , 黑 兔 有 多少 只 ?
balloons are there in the store?
Case 2: Topic words: engineering,road, construction, speed, truck
商店 有 花 气球 N0 包 , 每包 N1 个 . 还有 白 气球 N2 个 , 商店
Expression: ( N0 / N1 ) * N2
有 气球 多少 个 ?
Ground Truth: During road construction of UNK School, N0 tons of sand were
Num=4: There are N0 red balloons in the store. The number of yellow transported by truck, which accounted for N1 of the daily sand
balloons is N1 times of red balloons. There are N2 blue balloons. How consumption. Workers accelerated the speed of road construction, so the
many yellow balloons and blue balloons are there? daily sand consumption increased by N2. How many tons of sand should
商店 里 有 红 气球 N0 个 , 黄 气球 的 个数 是 红 气球 的 N1 倍 the truck transport every day?
. 商店 里 有 蓝 气球 N2 个 。 黄 气球 和 蓝 气球 一共 多少 个 ? 学校 UNK 的 工地 上 在 修路 , 卡车 已 运来了 N0 吨 沙子 ,正好
Num=5: There are N0 yellow balloons in the store. The number of red 占 了 一天 用 沙 量 的 N1 , 为了 加快 进 度 , 工人 们 加快 了
balloons is N2 more than N1 times of yellow balloons. How many red 修路 的 速度 , 工地 每天 的 用 沙 量 也 增加 了 N2 , 卡车 每天
balloons are there in the store? 应 多 运 多少 吨 沙子 ?
商店 有 黄 气球 N0 个 , 红 气球 的 个数 比 黄 气球 的 N1 倍 多 Seq2Seq: A cow eats N0 kg of grass every day, which accounts for N1
of the total feed. After N2 days, how much feed did the cow eat?
N2 个 . 商店 有 红 气球 多少 个 ?
一 头 奶牛 , 每天 吃 N0 的 草 , 占 所 吃 食物 总数 的 N1 , 吃
了 N2 天 , 这 头 头 一共 吃 多少 ?
NQG++: The road construction team is going to build a N0 meter long
more relevant to the topic words and thus closer to the ground
road. The plan is completed within N1 days. In fact, N2 meters are
truth. At the same time, more initial information makes the built every day. How many days will it actually be built?
generated problems more diverse which leads to a decrease in 修路 队 要 修 一条 长 N0 米 的 路 , 计划 N1 天 修 完 , 实际
their expression and answer accuracy. 每天 修 N2 米 , 实际 多少 天 修 完 ? Result: N0 / N2
An example for this effect can be seen in Table IX. We MAGNET: Road construction team A builds a road at a speed of N0
can see that the more topic words used as input, the more km/h, and the construction is completed in N1 hours. Team B will take
N2 more hours to build this road.How long is this road?
detailed the problem generated by the model. Problems with
甲 修路 队 修 一段路 , 速度 是 N0 千米 / 小时 , N1 小时 修 完 .
richer input contain more information and are closer to the 如果 是 乙 队 , 要 再 修 N2 小时 , 一共 修 了 多少 千米 ?
ground truth. Human inspection of the results found that Result: N0 * N1
models without topic words as input tend to generate simple MWPGen: The road construction team built a N0 meter long road in N1
and limited types of math word problems. days. At this speed, how many meters of road can be built in N2 days?
However, if the model needs to provide a large number of 一 个 修路 队 要 修 N0 千米 的 路 , 前 N1 天 修 了 全程 的 N2 ,
照 这样 的 速度 , 还 需要 多少 天 才能 修 完 ?
topic words as input, there will be an additional workload
for teachers and in the data construction process. Moreover,
when the number of topic words reaches a certain value, the
improvement of the model effect by increasing the number of unrelated to the topic. The problem generated by NQG++ asks
topic words is not as obvious. Therefore, we set the length of the number of black rabbits already given in this problem. The
a given topic word list in this task to 5. problem generated by MAGNET is incomplete and incorrect.
In Case 2, when the given topic is about engineering,
G. Case Study Seq2Seq generates a problem about cow eating grass. The
Table X lists two example of problems generated by MW- problems generated by NQG and MAGNET cannot be solved
PGen compared with other baselines. to obtain the original expression “(N0/N1)*N2”. In this ques-
In Case 1, Seq2Seq does not realize that the topic is about tion, ground truth is asking about the speed of road construc-
the number of rabbits, and therefore it generates a problem tion after acceleration. The problem generated by MWPGen
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 10
is asking about the length of a road built at a certain speed. These models are trained to generate math expressions, which
However, the problem generated by MWPGen is related to the can be executed to generate answers to questions [42], [43].For
topic and can be solved to generate the given expression. instance, DNS [3] first proposed a method based Seq2Seq
From these cases, we can see that the problems generated model, which directly maps the problem to the corresponding
by Seq2Seq, NQG++ and MAGNET are quite complete and expression. Liu et al. incorporated a tree structured model
smooth. However, some of these problems cannot be solved with an auxiliary stack that generates the math expression tree
or do not correspond to the given expressions. Some of from top to bottom [12]. Xie and Sun proposed a seq2tree
these problems are unrelated to the topic words. Instead, the model to generate each node in the expression tree based
proposed MWPGen can generate topic-related problems and on its parent node and left sibling tree [13]. Recently, many
these problems can be solved to obtain given expressions. works that treat math word problems as graphs have also
These further verify the effectiveness of the topic-expression shown better performance. Zhang et al. connected each number
co-attention mechanism and reinforcement learning. in the problem with nearby nouns to enrich the problem
representations [44]. Wu et al. connected words that belongs
IV. R ELATED W ORK to the same category in the external knowledge base to capture
A. Math Word Problem Generation common sense information [43]. Li et al. constructed an input
graph from both the math problem and its corresponding
Recently, many studies on natural language generation have
dependency tree to incorporate structural information [45].
attracted a lot of attention, such as machine translation [4], [5],
For question generation, some methods [46], [47] treat
[29], dialogue generation [7], [8], [30], reading comprehension
question answering (QA) and question generation as com-
problem generation [6], [31], [32],image caption generation
plementary tasks and jointly train these two tasks. Yuan et
[10], [33], [34].With the development of deep neural networks,
al. feed the generated problem to a QA system and use the
the problem generation task uses neural network structures and
performance of the QA system as a metric of the quality of the
has achieved remarkable results.
problem [48]. Li et al. jointly trained models on visual question
Zhou et al. propose a sequence-to-sequence model with at-
answering and visual question generation tasks to leverage the
tention mechanism and copy mechanism to generate questions
complementary relationship between questions and answers in
for the text from SQuAD dataset [23]. They enrich the model
images [49]. Deng et al. propose a novel joint learning model
with answer position and lexical features. Wu et al. propose
to solve the task of community question answering and answer
a “read-attend-comment” procedure for news comment gen-
summary generation [50].
eration and formalize the procedure with a reading network
Inspired by these methods, we use the pre-trained GTS
and a generation network [35]. Zhao et al. propose a novel
model [13] to measure our proposed model. We feed the
document-level approach for question generation by using
generated problem to the GTS and use the results to measure
multi-step recursive attention mechanism on the document and
the completeness and solvability of the generated problem.
answer representation to extend the relevant context [36].
Math word problem generation is the inverse task of the
math word problem solving. It can be widely used in artifi- C. Reinforcement Learning
cial intelligence testing, data set construction and education Recently, some studies have applied reinforcement learning
scenarios [1], [2]. Zhou and Huang propose a GRU-based (RL) to natural language generation tasks [51]–[53].A variety
seq2seq model with a maxout layer and entity-enforced loss of reinforcement learning methods have been proposed to
for generating math word problems [11]. This model encode further improve natural language generation learning by lever-
topic words and math expressions separately, and use both aging reward functions. Rennie et al. presented an optimization
the topic and expression information to generate problems. method called self-critical sequence training (SCST), which
Liyanage and Ranathunga use a BiLSTM model with attention normalizes the rewards obtained by sampled sentences and
mechanism to generate math word problems for three different inference sentences [18]. Chen et al. proposed a RL-based
languages, English, Sinhala and Tamil [37]. However, the graph-to-sequence model. This model uses BLEU and word
above methods have not effectively considered the character- movers distance (WMD) as reward functions [54]. Wan et al.
istics of math word problem generation tasks. proposed a code summarization model based on an abstract
In this paper, we propose a topic-expression co-attention syntax tree structure in a reinforcement learning framework,
mechanism to extract the correlated information between topic and used BLEU scores as reward [55]. For the math word
words and expressions. We also leverage the adjacent node problem generation task, we solve the generated problems
embeddings in the math expression tree to capture the structure and compare the results with the given math expressions as
and global semantic information of the math expression. rewards and fine-tune our model by reinforcement learning,
thereby improving the quality and solvability of the generated
B. Math Word Problem Solving problem.
Solving math word problems has long been a very popular
task [38], [39] and various methods have been proposed in V. C ONCLUSION
the past few years [40], [41].Recent approaches to solve This paper proposed a novel model MWPGen for the math
math word problems usually use the Seq2Seq model [4] word problem generation task. Adjacent node embeddings in
with attention mechanism [5] and copy mechanism [16], [33]. the expression tree were used as additional embeddings to
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 11
capture the structure and global semantic information of the [16] C. Gulcehre, S. Ahn, R. Nallapati, B. Zhou, and Y. Bengio, “Pointing
math expression. A topic-expression co-attention mechanism the unknown words,” Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers),
was proposed to effectively consider the correlated information 2016. [Online]. Available: http://dx.doi.org/10.18653/v1/P16-1014
between topic words and expressions. In addition, we used the [17] R. J. Williams, “Simple statistical gradient-following algorithms for
quality and solvability of the generated math word problems connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4,
pp. 229–256, 1992.
as rewards and fine-tuned our model by reinforcement learn- [18] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-
ing. Experimental results confirmed that the proposed model critical sequence training for image captioning,” in Proceedings of the
MWPGen can generate more complete and solvable problems IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 7008–7024.
than other baselines, and these problems are more related to [19] Z. Dong, Q. Dong, and C. Hao, “Hownet and the computation of
the given topic words and mathematical expressions. meaning,” 2006.
[20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
ACKNOWLEDGMENT [21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: a simple way to prevent neural networks from over-
The authors wish to thank the anonymous reviewers for their fitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp.
1929–1958, 2014.
helpful comments. This work was funded by China National [22] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin,
Key R&D Program (No. 2018YFB1005104). “Convolutional sequence to sequence learning,” pp. 1243–1252, 2017.
[23] Q. Zhou, N. Yang, F. Wei, C. Tan, H. Bao, and M. Zhou, “Neural
question generation from text: A preliminary study,” in National CCF
R EFERENCES Conference on Natural Language Processing and Chinese Computing.
Springer, 2017, pp. 662–671.
[1] O. Polozov, E. O’Rourke, A. M. Smith, L. Zettlemoyer, S. Gulwani, and [24] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method
Z. Popović, “Personalized mathematical word problem generation,” in for automatic evaluation of machine translation,” in Proceedings of
Twenty-Fourth International Joint Conference on Artificial Intelligence, the 40th annual meeting on association for computational linguistics.
2015. Association for Computational Linguistics, 2002, pp. 311–318.
[2] R. Koncel-Kedziorski, I. Konstas, L. Zettlemoyer, and H. Hajishirzi, [25] C.-Y. Lin and F. Och, “Looking for a few good metrics: Rouge and its
“A theme-rewriting approach for generating algebra word problems,” in evaluation,” in Ntcir Workshop, 2004.
Proceedings of the 2016 Conference on Empirical Methods in Natural [26] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-
Language Processing, 2016, pp. 1617–1628. based image description evaluation,” in Proceedings of the IEEE confer-
[3] Y. Wang, X. Liu, and S. Shi, “Deep neural solver for math word ence on computer vision and pattern recognition, 2015, pp. 4566–4575.
problems,” in Proceedings of the 2017 Conference on Empirical Methods [27] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
in Natural Language Processing, 2017, pp. 845–854. convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[4] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning [28] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic repre-
with neural networks,” in Advances in neural information processing sentations from tree-structured long short-term memory networks,” pp.
systems, 2014, pp. 3104–3112. 1556–1566, 2015.
[5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by [29] J. Zhang, H. Luan, M. Sun, F. Zhai, J. Xu, M. Zhang, and Y. Liu, “Im-
jointly learning to align and translate,” in 3rd International Conference proving the transformer translation model with document-level context,”
on Learning Representations, ICLR 2015, 2015. in Proceedings of the 2018 Conference on Empirical Methods in Natural
[6] X. Du, J. Shao, and C. Cardie, “Learning to ask: Neural question gen- Language Processing, 2018, pp. 533–542.
eration for reading comprehension,” in Proceedings of the 55th Annual [30] S. Wu, Y. Li, D. Zhang, Y. Zhou, and Z. Wu, “Diverse and informa-
Meeting of the Association for Computational Linguistics (Volume 1: tive dialogue generation with context-specific commonsense knowledge
Long Papers), 2017, pp. 1342–1352. awareness,” in Proceedings of the 58th Annual Meeting of the Associa-
[7] B. Pan, H. Li, Z. Yao, D. Cai, and H. Sun, “Reinforced dynamic tion for Computational Linguistics, 2020, pp. 5811–5820.
reasoning for conversational question generation,” in Proceedings of the [31] Y. Zhao, X. Ni, Y. Ding, and Q. Ke, “Paragraph-level neural question
57th Annual Meeting of the Association for Computational Linguistics, generation with maxout pointer and gated self-attention networks,” in
2019, pp. 2114–2124. Proceedings of the 2018 Conference on Empirical Methods in Natural
[8] J. Wang, J. Liu, W. Bi, X. Liu, K. He, R. Xu, and M. Yang, “Improving Language Processing, 2018, pp. 3901–3910.
knowledge-aware dialogue generation via knowledge base question [32] B. Liu, M. Zhao, D. Niu, K. Lai, Y. He, H. Wei, and Y. Xu, “Learning
answering,” arXiv preprint arXiv:1912.07491, 2019. to generate questions by learningwhat not to generate,” in The World
[9] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing, “Toward Wide Web Conference, 2019, pp. 1106–1118.
controlled generation of text,” in Proceedings of the 34th International [33] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A
Conference on Machine Learning-Volume 70, 2017, pp. 1587–1596. neural image caption generator,” in Proceedings of the IEEE conference
[10] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing, “Recurrent topic- on computer vision and pattern recognition, 2015, pp. 3156–3164.
transition gan for visual paragraph generation,” in Proceedings of the [34] Z.-J. Zha, D. Liu, H. Zhang, Y. Zhang, and F. Wu, “Context-aware visual
IEEE International Conference on Computer Vision, 2017, pp. 3362– policy network for fine-grained image captioning,” IEEE transactions on
3371. pattern analysis and machine intelligence, 2019.
[11] Q. Zhou and D. Huang, “Towards generating math word problems [35] Z. Yang, C. Xu et al., “Read, attend and comment: A deep architecture
from equations and topics,” in Proceedings of the 12th International for automatic news comment generation,” in Proceedings of the 2019
Conference on Natural Language Generation, 2019, pp. 494–503. Conference on Empirical Methods in Natural Language Processing and
[12] Q. Liu, W. Guan, S. Li, and D. Kawahara, “Tree-structured decoding for the 9th International Joint Conference on Natural Language Processing
solving math word problems,” in Proceedings of the 2019 Conference (EMNLP-IJCNLP), 2019, pp. 5080–5092.
on Empirical Methods in Natural Language Processing and the 9th In- [36] L. A. Tuan, D. J. Shah, and R. Barzilay, “Capturing greater context for
ternational Joint Conference on Natural Language Processing (EMNLP- question generation,” arXiv preprint arXiv:1910.10274, 2019.
IJCNLP), 2019, pp. 2370–2379. [37] V. Liyanage and S. Ranathunga, “Multi-lingual mathematical word prob-
[13] Z. Xie and S. Sun, “A goal-driven tree-structured neural model for lem generation using long short term memory networks with enhanced
math word problems,” in Proceedings of the 28th International Joint input features,” in Proceedings of The 12th Language Resources and
Conference on Artificial Intelligence. AAAI Press, 2019, pp. 5299– Evaluation Conference, 2020, pp. 4709–4716.
5305. [38] C. R. Fletcher, “Understanding and solving arithmetic word problems: A
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computer simulation,” Behavior Research Methods, Instruments, &
computation, vol. 9, no. 8, pp. 1735–1780, 1997. Computers, vol. 17, no. 5, pp. 565–571, 1985.
[15] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation [39] Y. Bakman, “Robust understanding of word problems with extraneous
of gated recurrent neural networks on sequence modeling,” 2014. information,” arXiv preprint math/0701393, 2007.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 12
[40] D. Huang, S. Shi, C.-Y. Lin, and J. Yin, “Learning fine-grained expres- Qi Zhang received the PhD degree in computer
sions to solve math word problems,” in EMNLP, 2017, pp. 805–814. science from Fudan University. He is an associate
[41] S. Roy and D. Roth, “Mapping to declarative knowledge for word professor of computer science at Fudan University,
problem solving,” Transactions of the Association for Computational Shanghai, China. His research interests include nat-
Linguistics, vol. 6, pp. 159–172, 2018. ural language processing and information retrieval.
[42] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Program induction
by rationale generation: Learning to solve and explain algebraic word
problems,” in ACL, vol. 1, 2017, pp. 158–167.
[43] Q. Wu, Q. Zhang, J. Fu, and X.-J. Huang, “A knowledge-aware
sequence-to-tree network for math word problem solving,” in Proceed-
ings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), 2020, pp. 7137–7146.
[44] J. Zhang, L. Wang, R. K.-W. Lee, Y. Bin, Y. Wang, J. Shao, and
E.-P. Lim, “Graph-to-tree learning for solving math word problems,”
in Proceedings of the 58th Annual Meeting of the Association for Xuanjing Huang received the PhD degree in com-
Computational Linguistics, 2020, pp. 3928–3937. puter science from Fudan University. She is a profes-
[45] S. Li, L. Wu, S. Feng, F. Xu, F. Xu, and S. Zhong, “Graph-to- sor of computer science at Fudan University, Shang-
tree neural networks for learning structured input-output translation hai, China. Her research interests include natural
with applications to semantic parsing and math word problem,” in language processing and information retrieval.
Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: Findings, 2020, pp. 2841–2852.
[46] T. Wang, X. Yuan, and A. Trischler, “A joint model for question
answering and question generation,” arXiv preprint arXiv:1706.01450,
2017.
[47] D. Tang, N. Duan, T. Qin, Z. Yan, and M. Zhou, “Question answering
and question generation as dual tasks,” arXiv preprint arXiv:1706.02027,
2017.
[48] X. Yuan, T. Wang, C. Gulcehre, A. Sordoni, P. Bachman, S. Zhang,
S. Subramanian, and A. Trischler, “Machine comprehension by text-to-
text neural question generation,” in Proceedings of the 2nd Workshop
on Representation Learning for NLP, 2017, pp. 15–25.
[49] Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, X. Wang, and M. Zhou,
“Visual question generation as dual task of visual question answering,”
in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 6116–6124.
[50] Y. Deng, W. Lam, Y. Xie, D. Chen, Y. Li, M. Yang, and Y. Shen,
“Joint learning of answer selection and answer summary generation in
community question answering.” in AAAI, 2020, pp. 7651–7658.
[51] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao,
“Deep reinforcement learning for dialogue generation,” in Proceedings
of the 2016 Conference on Empirical Methods in Natural Language
Processing, 2016, pp. 1192–1202.
[52] S. Narayan, S. B. Cohen, and M. Lapata, “Ranking sentences for ex-
tractive summarization with reinforcement learning,” in Proceedings of
the 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume
1 (Long Papers), 2018, pp. 1747–1759.
[53] R. Csáky, P. Purgai, and G. Recski, “Improving neural conversational
models with entropy-based data filtering,” in Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics, 2019,
pp. 5650–5669.
[54] Y. Chen, L. Wu, and M. J. Zaki, “Reinforcement learning based graph-
to-sequence model for natural question generation,” 2019.
[55] Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and P. S. Yu,
“Improving automatic source code summarization via deep reinforce-
ment learning,” in Proceedings of the 33rd ACM/IEEE International
Conference on Automated Software Engineering, 2018, pp. 397–407.