A Weakly Supervised Model For Solving Math Word PR

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350875955

A Weakly Supervised Model for Solving Math word Problems

Preprint · April 2021

CITATIONS READS

0 99

5 authors, including:

Vishwajeet Kumar Kavi Arya


Indian Institute of Technology Patna Indian Institute of Technology Bombay
26 PUBLICATIONS 266 CITATIONS 66 PUBLICATIONS 330 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Vishwajeet Kumar on 22 June 2022.

The user has requested enhancement of the downloaded file.


A Weakly Supervised Model for Solving Math word Problems
Oishik Chatterjee Aashish Waikar Vishwajeet Kumar
Department of CSE Department of CSE IBM Research
IIT Bombay IIT Bombay Bangalore India

Ganesh Ramakrishnan Kavi Arya


Department of CSE Department of CSE
IIT Bombay IIT Bombay

Abstract Problem: It costs Rs 5.0 to buy 10.0 pepper-


mint candies. If the candies all have the same
Solving math word problems (MWPs) is an price,how much does it cost to buy 1.0 candy ?
important and challenging problem in natural
arXiv:2104.06722v1 [cs.CL] 14 Apr 2021

Equation: X=(5.0/10.0) × 1.0


language processing. Existing approaches to
solve MWPs require full supervision in the (full supervision)
form of intermediate equations. However, la- Answer: 0.5 (Weak supervision)
beling every math word problem with its cor-
responding equations is a time-consuming and Table 1: Example of a Math Word Problem
expensive task. In order to address this chal-
lenge of equation annotation, we propose a
weakly supervised model for solving math picted in Table 1, the equations provide full super-
word problems by requiring only the final an- vision, since equations can be solved to obtain the
swer as supervision. We approach this prob- answer (which amounts to weak supervision only)
lem by first learning to generate the equation
ii) parsing problem description and representing
using the problem description and the final an-
swer, which we then use to train a supervised it suitably for effective decoding of the equations
MWP solver. We propose and compare vari- is challenging. Paucity of completely supervised
ous weakly supervised techniques to learn to training data can post a severe challenge in training
generate equations directly from the problem MWP solvers. Most existing approaches assume
description and answer. Through extensive ex- the availability of full supervision in the form of
periment, we demonstrate that even without both intermediate equations and answers for train-
using equations for supervision, our approach ing. However, annotating MWPs with equations
achieves an accuracy of 56.0 on the standard
is an expensive and time consuming task. There
Math23K dataset (Wang et al., 2017). We also
curate and release a new dataset for MWPs in exists only one sufficiently large dataset (Wang
English consisting of 10227 instances suitable et al., 2017) in Chinese consisting of MWPs with
for training weakly supervised models. annotated intermediate equations for supervised
training.
1 Introduction We propose a novel two-step weakly supervised
A Math Word Problem (MWP) (e.g. is a numerical technique to solve MWPs by making use only of
problem expressed in natural language (problem the weak supervision (in the form of answers). In
description)), that can be transformed into an equa- the first step, using only the answer as supervision,
tion (solution expression), which can be solved to we learn to generate equations for questions in the
obtain the final answer. In Table 1), we present an training set. In the second step, we use the gen-
example of a Math Word Problem (MWP). Auto- erated equations along with answers to train any
matically solving MWPs has recently gained lot state-of-the-art supervised model. We illustrate the
of research interest in natural language processing effectiveness of our weakly supervised approach
(NLP) and Artificial Intelligence (AI). The task of on our newly curated reasonably large dataset in
automatically solving MWPs is challenging ow- English.
ing to two primary reasons: i) the unavailability of Our main contributions are as follows:
large training datasets with problem descriptions, 1) An approach, WARM, (c.f., Section 4) for
equations as well as corresponding answers; as de- generating equations from math word problems,
given (weak) supervision only in the form of the et al., 2017, 2018a,b; Huang et al., 2018; Chiang
final answer. and Chen, 2019; Wang et al., 2019; Liu et al., 2019;
Xie and Sun, 2019) have been proposed. These
2) An extended semi-supervised training method employ an encoder-decoder architecture and train
to leverage a small amount of annotated equations in an end-to-end manner without the need of hand-
as strong/complete supervision. crafted rules or templates.
3) A new and relatively large dataset in English Wang et al. (2017) were the first to propose a
(with more than 10k instances), for training sequence-to-sequence (Seq2Seq) model, viz., Deep
weakly supervised models for solving Math word Neural Solver, for solving MWPs. They employ
problems (c.f., Section 3). an RNN-based encoder-decoder architecture to di-
rectly translate the problem text into equation tem-
plates and also release a high-quality large-scale
2 Related Work dataset, Math23K, consisting of 23,161 math word
problems in Chinese. Wang et al. (2018a) extend
Automatic math word problem solving has recently the idea of the Deep Neural Solver by introducing
drawn significant interests in the natural language equation normalization techniques. Huang et al.
processing (NLP) community. Existing MWP solv- (2018) propose the incorporation of copy and align-
ing methods can be broadly classified into four ment mechanisms in the standard Seq2Seq model.
categories: (a) rule-based methods, (b) statistics-
Chiang and Chen (2019) propose a Seq2Seq
based methods, (c) tree-based methods, and (d)
model with an encoder designed to understand the
neural-network-based methods.
semantics of the problem, and a decoder equipped
Rule-based systems (Fletcher, 1985; Bakman,
with a stack to facilitate tracking semantic mean-
2007; Yuhui et al., 2010) were amongst the ear-
ings of the operands. Wang et al. (2019) propose a
liest approaches to solve MWPs. They heavily rely
two-step pipeline that employs i) a Seq2Seq model
on hand-engineered rules that might cover a limited
to convert the input text into a tree-structure tem-
domain of problems, whereas math word problems
plate with quantities representing inferred numbers
have potentially diverse descriptions in real-world
as leaf nodes and unknown operators as inner nodes
settings. On the other hand, statistics-based meth-
ii) a recursive neural network to encode the quan-
ods (Hosseini et al., 2014; Kushman et al., 2014;
tities, and infer the unknown operator nodes in a
Sundaram and Khemani, 2015; Mitra and Baral,
bottom-up manner.
2016; Liang et al., 2016a,b) use pre-defined logic
templates and employ traditional machine learning Unlike the earlier neural-network based methods,
models to identify entities, quantities, and operators (Liu et al., 2019) and (Xie and Sun, 2019) propose
from the problem text and employ simple logical tree-structured decoding that generates the syntax
inference to yield the numeric answer. Upadhyay tree of the equation in a top-down manner. In addi-
et al. (2016) employ a semi-supervised approach tion to applying tree-structured decoding, (Zhang
by learning to predict templates and corresponding et al., 2020) propose a graph-based encoder to cap-
alignments using both explicit and implicit super- ture relationships and order information among the
vision. These approaches involve an additional quantities. Wang et al. (2018b) make the first at-
overhead of annotating the problem text with the tempt of applying deep-reinforcement learning to
relevant template, which can be prohibitive when solve MWPs. They equivalently expressed the out-
learning from large-scale datasets or when requir- put equation as an expression tree and proposed
ing an MWP solver to be trained on a language MathDQN, a customised version of the deep Q-
with little or no existing datset. Moreover, the pre- network (DQN) framework by tailoring the defini-
defined templates used by these methods are not tions of states, actions and rewards functions and
very robust to dataset diversity. Tree-based meth- use this for iterative construction of the tree.
ods (Roy and Roth, 2015; Koncel-Kedziorski et al., For a more comprehensive review on automatic
2015; Roy et al., 2016; Roy and Roth, 2017, 2018) MWP solvers, readers can refer to a recent survey
replaced the process of deriving an equation by paper (Zhang et al., 2018). Unlike all the previous
constructing an equivalent tree structure, step by works that require equations for supervision, we
step, in a bottom-up manner. More recently, neu- propose a novel weakly supervised model, WARM,
ral network-based MWP solving methods (Wang (c.f., Section 4) for solving MWPs using only the
final answer for supervision. a decoder network with fully connected units (de-
scribed in Section 4.3). We call this model WARM.
3 Dataset Using this model, we create a noisy equation-
annotated dataset. We use only those instances
Currently, there does not exist any sufficiently large
to create the dataset for which the equation gener-
English dataset for single equation math word prob-
ated by the model yields the correct answer. Note
lems. While Math23K (Wang et al., 2017) is in
that the equations are noisy as there is no guarantee
Chinese, Dolphin18k (Huang et al., 2016) contains
that the generated equation will be the shortest or
mostly multivariable word problems. Hence we
even correct. In the second step we use this noisy
curated a new dataset, viz., EMWP10K which is an
data to train a supervised model.
English MWP dataset containing 10227 instances
(each associated with a single equation) that can be Step 2: Supervised
Training
used for training MWP solver models in a weakly
supervised manner.
We crawled the IXL website1 to obtain math Question + Answer
dataset
Equation Generator
Model
Equation
annotated
Supervised Model

dataset
word problems for grades VI until X. These word
problems involve a wide variety of mathemati-
Step 1: Equation Generation
cal computations ranging from simple addition-
subtraction to much harder mensuration and proba- Figure 2: WARM: Our Weakly Supervised Approach
bility problems. The dataset consists of 10 different
types of problems, across 3 different tiers of diffi-
culty. We also annotate the dataset with the target
4.1 Equation Generation
unit. The exact distributions are presented in Fig-
ure 1. Given a problem text P and answer A we first
pass the encoded representation obtained from the
encoder to the decoder. At each time step, the
decoder generates an operator and its two operands
from the operator and operand vocab list. The
operation is then executed to obtain a new quantity.
This quantity is checked against the ground truth
and if it matches the ground truth, the decoding is
terminated and a reward of +1 is assigned. Else a
reward of -1 is given and the generated quantity is
added to the operand vocab list. In the following
subsection, we describe this process in more detail.
Figure 1: Distribution of different types of questions
4.2 Encoder

4 Our Approach: WARM The encoder takes as input a math word prob-
lem represented as a sequence of tokens P =
We propose a weakly supervised model, WARM2 , x1 x2 x3 ...xn . We replace each number in the ques-
for solving the math word problem using only the tion with a special token < num > to obtain this
answer as supervision. We employ a two-step cas- sequence. Each word token xi is first transformed
caded approach for solving the given task. For into the corresponding word embedding x i by look-
the first step, we propose a model that predicts the ing up an embedding matrix M w . Next, a binary
equation given a problem text and answer. This feature is appended to the embedding to indicate
model uses reinforcement learning to search the whether the token is a word or a number. This ap-
space of possible equations given the question and pended embedding vector is then passed through
the correct answer only. It consists of a two layer a 2 layer bidirectional GRU (Cho et al., 2014) and
bidirectional GRU (Cho et al., 2014) encoder and the output of both the directions of the final layer
1
https://in.ixl.com/ are summed to get the encoded representation of
2
WARM stands for WeAkly supeRvised Math solver. the text. This is then passed on to the decoder.
Figure 3: Inference Illustration

Figure 4: Architecture for generating equation tree in WARM

4.3 Decoder
We have implemented a fully connected decoder
network for generating equations in a bottom-up opt , olt , ort , hdt = DecoderF CN (opt−1 , hdt−1 )
manner. Our decoder takes as input the previous de-
coded operand and the last decoder hidden state and
outputs the operator, left operand, right operand Here, hdt is the decoder hidden state at tth time
and hidden state for the current time step. We ini- step. opt , olt and ort are probability distributions over
tialize the decoder hidden state with the last state operators, left operand and right operand respec-
of the encoder. tively.
4.3.1 Operator generation 4.3.4 Bottom-up Equation Construction
Inside our decoder, we learn an operator embedding For each training instance, we maintain a dictionary
matrix Emop (opt−1 ) where opt−1 is the operator of possible operands OpDict. Initially, this dictio-
sampled in the last time step. We use a gating nary contains the numeric values that occurred in
mechanism to generate the operator hidden state the instance, i.e., the tokens we have replaced with
hop
t . < num > during encoding. At tth decoding step
we sample an operator opt , left operand olt and
right operand ort . We get an intermediate result
gtop = σ(Wop
1
[Emop (opt−1 ); hdt−1 ] + b1op ) by using the operator corresponding to opt on the
operands olt and ort . This intermediate result is
hop op 2 d 2
t = gt ∗tanh(Wop [Emop (opt−1 ); ht−1 ]+bop ) added to OpDict which enables us to reuse the re-
sults of previous computations in future decoding
opt = sof tmax(Wop
3 op
ht + b3op ) steps. So OpDict acts as a dynamic dictionary of
operands and we use it to reach towards the final
Here σ() denotes the sigmoid function and ∗
answer in a bottom-up manner.
denotes elementwise multiplication. We sample
operator opt from the probability distribution opt . 4.4 Rewards, Loss
4.3.2 Left Operand Generation We use REINFORCE (Williams, 1992) algorithm
We use the embedding of the current operator for training the model using just the final answer as
Em(opt ) and the operator hidden state hop
t to get a
the ground truth. We model the reward as +1 if the
probability distribution over the operands. We use predicted answer matches the ground truth and −1
a similar gating mechanism as used for generating if the predicted answer doesn’t equal the ground
operator. truth.
Let Rt define the reward obtained after generat-
ing yt = (opt , ol , or ). The probability Pt of gen-
gtol = σ(Wol1 [Emop (opt ); hop 1
t ] + bol ) erating the tuple yt is specified in the following
equation
op
hol ol 2 2
t = gt ∗ tanh(Wol [Emop (opt ); ht ] + bol )
t
opi × oli × ori
Y
olt = sof tmax(Wol3 hol
t + b3ol ) pθ (yt ) =
i=1
We sample the left operand olt from the proba-
bility distribution olt .
P
The loss is specified by L = − Epθ (yi ) [Ri ]
i
4.3.3 Right Operand Generation P Pthe corresponding gradient is ∇θ L
and =
pθ (yi )Ri ∇θ log pθ (yi )
For generating the right operand, we use the addi- i yi
tional context information that is already available Since the space of yi makes it infeasible to com-
from the generated left operand. So in addition pute the exact gradient, we use the standardized
to the operator embedding Emop (opt ) and opera- technique of sampling yi from pθ (yi ) to get an esti-
tor hidden state hop
t we also use the left operand mate of the gradient.
hidden state to get the right operand hidden state
hor
t .
4.5 Beam Exploration in Training
Since the rewards space for our problem is very
sparse, we have observed that the gradients while
gtor = σ(Wor
1
[Emop (opt ); hop ol 1
t ; ht ] + bor ) training our model goes to zero. Our model con-
op ol
verges too quickly to some local optima and conse-
hor or 2 2
t = gt ∗tanh(Wor [Emop (opt ); ht ; ht ]+bor ) quently training accuracy saturates to some fixed
ort = sof tmax(Wor
3 or
ht + b3or ) value despite training for large number of epochs.
In order to counter this problem, we have employed
We sample right operand ort from the probabil- beam exploration in the training procedure. Instead
ity distribution olt . The hidden state hor
t is returned of sampling operator opt , left operand olt and right
as the current decoder state hdt . operand ort only once in each decoding step, we
sample w number of triplets (opt , olt , ort ) with- Math23K (Wang et al., 2017) contains 23,161
out replacement from the joint probability space in math word problems in Chinese with 2187 tem-
each decoding step, where w is the beam width. For plates, annotated with equations and answers, for
selecting beams out of all possible candidates, we elementary school students and is crawled from
have tried multiple heuristics by looking at proba- multiple online education websites. It is the largest
bility and reward values. We have observed empiri- publicly available dataset for the task of automatic
cally that selecting the beam which gives a positive math word problem solving.
reward at the earliest decoding step gives the best EW10K contain 10,227 math word problems in
performance. This enables our model to explore English for classes VI to X. Described in Section 3.
more and mitigates the above problem significantly. We have used a 80 : 20 train-test split.

4.6 WARM -S: Adding Semi-supervision 5.2 Dataset Preprocessing

We consider a model that benefits from a relatively We replace every number token in the problem text
small amount of strong supervision in the form with a special word token < num > before provid-
of equation annotated data: D1 , in addition to po- ing it as input to the encoder. We also define a set
tentially larger sized math problem datasets with of numerical constants Vconst to solve those prob-
no equation annotation D2 . For a data instance d: lems which might require special numeric values
d.p, d.e, and d.a represent its problem statement, that may not be present in the problem text. For
equation, and answer respectively. D1 contains example, consider the problem “The radius of a
instances of the form (d.p, d.e, d.a) while D2 con- circle is 2.5, what is its area?”, the solution is “π
tains instances of the form (d.p, d.a). x 2.5 x 2.5” where the constant quantity π cannot
We extend the WARM model to include a Cross- be found in the text. As our model does not use
Entropy loss component for instances belonging to equations as supervision, we cannot know precisely
D1. The net loss is the sum of the REINFORCE what extra numeric values might be required for a
and Cross-Entropy losses shown below:- problem, so we fix Vconst = {1, π}. Finally, the
P operand dictionary for every problem is initialised
Loss 1: RLWARM (d.p, d.a)
d∈D2 as OpDict = nP ∪ Vconst where nP is the set of
Cross_Entropy(d.e, WARM(d.p, d.a)) numeric values present in the problem text.
P
Loss 2:
d∈D1
5.3 Implementation Details
5 Experimental Setup We implement3 all our models in PyTorch4 . We set
the dimension of the word embedding layer to 128,
In this section, we conduct experiments on three
and the dimension of all the hidden states for other
datasets to examine the performance of the pro-
layers to 512. We use REINFORCE (Williams,
posed weakly supervised models compared to vari-
1992) algorithm and Adam (Kingma and Ba, 2014)
ous baselines and fully supervised models.
to optimize the parameters. The initial value of the
5.1 Datasets learning rate is set to 0.001, and the learning rate is
multiplied by 0.7 every 75 epochs. We also set the
We performed all our experiments on the pub- dropout probability to 0.5 and weight decay to 1e-5
licly available AllArith (Roy and Roth, 2017) and to avoid over-fitting. Finally, we set the beam width
Math23K (Wang et al., 2017) datasets and also on to 5 in beam exploration for more exploration. We
our EMWP10K dataset.For each dataset, we have train our model for 100 epochs on a GTX 1080Ti
used a 80 : 20 train-test split. machine with the batch size set to 256.
AllArith contains 831 math word problems, an-
notated with equations and answer, populated by 5.4 Models
collecting problems from smaller datasets AI2 We compare the math word problem solving accu-
(Hosseini et al., 2014), IL (Roy and Roth, 2015), racy of our weakly supervised models with beam
CC (Roy and Roth, 2015) and SingleEQ (Koncel- exploration on various baselines and fully super-
Kedziorski et al., 2015) and normalizing all men- vised models. We describe the models below:
tions of quantities to digits and filtering out near- 3
Source code and pre-trained models will be released upon
duplicate problems (with over 80% match of uni- publication
4
grams and bigrams). https://pytorch.org/
WARM (Weakly Supervised Math Solver) is and an answer generation module that predicts the
a weakly supervised model with a two-step ap- operators.
proach with equation generation as the first step Seq2Tree (Liu et al., 2019) is a Seq2Tree model
and then doing supervised training on the equations with a Bi-LSTM encoder and a top-down hierarchi-
that yield the correct answer as the next step. It cal tree-structured decoder consisting of an LSTM
uses beam exploration (discussed in Section 4.5). which makes use of the parent and sibling informa-
WARM w/o Beam Exploration is the same as tion fed as the input, and also an auxiliary stack to
WARM but does not use beam exploration while help in the decoding process.
decoding. GTS (Xie and Sun, 2019) is a tree-structured
WARM -S is a semi-supervised model which neural model that generates the expression tree in
assumes the availability of a small amount of a goal-driven manner.
data with annotated equations It employs semi- Graph2Tree (Zhang et al., 2020) consists of a
supervision for both equation generation and su- graph-based encoder which captures the relation-
pervised training. It also uses beam exploration ships and order information among the quantities
(discussed in Section 4.5). and a tree-based decoder that generates expression
WARM -S w/o Beam Exploration is the same tree in a goal-driven manner.
as WARM -S but does not use beam exploration As described earlier in Section 4, we use our
while decoding. weakly supervised models to generate labelled
Random Equation Sampling consists of a ran- data (i.e., equations) which we then use to train
dom search over k parallel paths of length d. For a supervised model. We have performed ex-
each path, an operator and its two operands are periments using GTS (Xie and Sun, 2019) and
uniformly sampled from the given vocabulary and Graph2Tree (Zhang et al., 2020) as the supervised
the result is added to the operand vocab (similar to models since they are the current state-of-the-art.
WARM). The equation is terminated once the cor-
6 Results and Analysis
rect answer is reached. We set k = 5 and d = 40
for a fair comparison with our model in terms of
the number of search operations. Weakly Supervised Models AllArith Math23K EW10K
WARM w/o Beam Exploration 42.1 14.5 45.2
Seq2Seq Baseline is a GRU (Cho et al., 2014) WARM 95.8 80.1 98.8
based seq2seq encoder decoder model. REIN- Baselines AllArith Math23K EW10K
FORCE (Williams, 1992) is used to train the model. Random Equation Sampling 53.4 17.6 46.3
Beam exploration is also employed to mitigates is- Seq2Seq Baseline 67.0 7.1 77.6
sues mentioned in Section 4.5.
Table 2: Equation generation accuracy of our weakly
Hybrid model w/ SNI (Wang et al., 2017) is a supervised models compared to baselines.
combination of the retrieval model and the RNN-
based Seq2Seq model with significant number iden-
tification (SNI). 6.1 Analyzing WARM
Ensemble model w/ EN (Wang et al., 2018a) In Table 2, we observe that our model WARM
is an ensemble model that selects the result ac- yields far higher accuracy than random baselines
cording to generation probability among Bi-LSTM, with the accuracy values close to 100% on AllArith
ConvS2S and Transformer Seq2Seq models with and EW10K. Thus we are able to more accurately
equation normalization (EN). generate equations for a given problem and answer
Semantically-Aligned (Chiang and Chen, which can then be used to train supervised models.
2019) is a Seq2Seq model with an encoder As has been discussed earlier in Section 4.5,
designed to understand the semantics of the our model WARM w/o Beam Exploration suffers
problem text and a decoder equipped with a stack from the problem of converging to local optima
to facilitate tracking the semantic meanings of the because of the sparsity of reward signal. Train-
operands. ing our weakly supervised models with beam ex-
T-RNN + Retrieval (Wang et al., 2019) is a ploration alleviates the issue to a large extent as
combination of the retrieval model and the T-RNN we explore the solution space much more exten-
model comprising a structure prediction module sively and thus sparsity in reward signal is reduced.
that predicts the template with unknown operators We see vast improvement in training accuracy by
Weakly Supervised Models AllArith Math23K EW10K Problem: It costs Rs 5.0 to buy 10.0 peppermint candies.
WARM w/o Beam Exploration(GTS) 36.1 12.8 46.9
If the candies all have the same price, how much does it
WARM (GTS) 65.7 54.3 86.2
WARM w/o Beam Exploration(G2T) 48.2 13.5 47.3
cost to buy 1.0 candy ?
WARM (G2T) 68.1 56.0 88.1 Answer: 0.5
Fully Supervised Models AllArith Math23K EW10K Generated Equation: X=(5.0/10.0) (Correct)
G2T (Zhang et al., 2020) – 75.5 NA
GTS (Xie and Sun, 2019)‡ 70.5 73.6 NA Problem: Ariel already has 4.0 flowers in her garden, and
Seq2Tree (Liu et al., 2019) – 69.0 NA she can also grow 3.0 flowers with every seed packet she
T-RNN + Retrieval (Wang et al., 2019) – 68.7 NA uses. With 2.0 seed packets, how many total flowers can
Semantically-Aligned (Chiang and Chen, 2019)† – 65.8 NA Ariel have in her garden ?
Ensemble model w/ EN (Wang et al., 2018a) – 68.4 NA
Hybrid model w/ SNI (Wang et al., 2017)† – 64.7 NA Answer: 10.0
Generated Equation: X=(4.0 + (2.0*3.0)) (Correct)
Table 3: Math word problem solving accuracy of our Problem: In hopes of encouraging healthier snacks at
weakly supervised models compared to various super- school, Fred brought in a tray of carrot sticks and apple
vised models on AllArith and Math23K datasets. † de- slices to share. He brought 18.0 carrot sticks and 18.0
notes that the result is 5-fold cross validation perfor- apple slices. What percentage of the snacks were carrot
sticks ?
mance. All other models are tested on the test set. ‡
denotes that the result is on the same train-test split as Answer: 50
ours. “–” denotes that either the code is unavailable or Generated Equation: X=(100.0*(18.0/(18.0+18.0)))
(Correct)
it is not compatible for that dataset. NA denotes that it
is not applicable.
Problem: Celine took a total of 6.0 quizzes over the
course of 3.0 weeks. After attending 7.0 weeks of school
this quarter, how many quizzes will Celine have taken in
virtue of the addition of beam exploration to our total ? Assume the relationship is directly proportional.
model. The model WARM yields training accu- Answer: 14.0
racy significantly higher than its non-beam-explore Generated Equation: X=(7.0+7.0) (Incorrect)
counterpart. WARM gives the best training accu- Table 4: Equations generated by WARM model
racy overall.
We also observe that our approach yields re- Weakly Supervised Models EW10K
sults comparable to the various supervised mod- WARM w/o Beam Exploration 45.2
els without having to use any supervision from WARM 98.8
gold equations. On AllArith our approach achieves Semi Supervised Models AllArith+EW10K
an accuracy of 65.7% and 68.1% using GTS and WARM -S w/o Beam Exploration 92.0
WARM -S 99.3
Graph2Tree as the supervised models respectively.
The state-of-the-art supervised model GTS gives Table 5: Equation generation accuracy of our semi-
70.5%. On Math23k, the difference between supervised models compared to weakly supervised
WARM and the supervised models is more pro- models and baselines.
nounced.
Weakly Supervised Models EW10K
In Table 4, we present some of the predictions.
WARM w/o Beam Exploration(GTS) 46.9
As can be seen, the model is capable of producing WARM (GTS) 86.2
long complex equations as well. Sometimes, it may WARM w/o Beam Exploration(G2T) 47.3
reach the correct answer but the equation might be WARM (G2T) 88.1
wrong (like in the last example the correct equation Semi Supervised Models AllArith+EW10K
WARM -S w/o Beam Exploration(GTS) 87.2
would have been X = 7.0 ∗ 6.0/3.0, but the model WARM -S (GTS) 92.1
predicted X = 7.0 + 7.0). WARM -S w/o Beam Exploration(G2T) 87.4
WARM -S (G2T) 93.6

6.2 Analysing Semi-supervision through Table 6: Math word problem solving accuracy of our
WARM -S semi supervised models compared to the weakly super-
vised models
For analyzing semi-supervision, we shuffled and
combined AllArith with EW10K. In Table 5 we
observe that with less than 10% of fully annotated 7 Conclusion
data, our equation exploration accuracy increases
from 45.2% to 92.0% when Beam Exploration is We have proposed a two step approach for solving
not used. With Beam Exploration, the accuracy math word problems, using only the final answer
goes from 98.8% to 99.3%. for supervision. Our weakly supervised approach,
WARM, achieves a reasonable accuracy of 58.5 on Diederik Kingma and Jimmy Ba. 2014. Adam: A
the standard Math23K dataset even without lever- method for stochastic optimization. International
Conference on Learning Representations.
aging equations for supervision. We also curate and
release a large scale MWP dataset, EW10K, in En- Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish
glish.We observed that the results are encouraging Sabharwal, Oren Etzioni, and Siena Dumas Ang.
for simpler MWPs. 2015. Parsing algebraic word problems into
equations. Transactions of the Association for
8 Acknowledgement Computational Linguistics, 3:585–597.

We would like to acknowledge Saiteja Talluri and Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and
Regina Barzilay. 2014. Learning to automatically
Raktim Chaki for their contributions in the initial solve algebra word problems. In Proceedings
stages of the work. of the 52nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long
Papers), pages 271–281. Association for Computa-
References tional Linguistics.
Yefim Bakman. 2007. Robust understanding of Chao-Chun Liang, Kuang-Yi Hsu, Chien-Tsung
word problems with extraneous information. arXiv Huang, Chung-Min Li, Shen-Yu Miao, and Keh-Yih
preprint math/0701393. Su. 2016a. A tag-based english math word problem
solver with understanding, reasoning and explana-
Ting-Rui Chiang and Yun-Nung Chen. 2019. tion. In Proceedings of the Demonstrations Session,
Semantically-aligned equation generation for NAACL HLT 2016, pages 67–71. The Association
solving and reasoning math word problems. for Computational Linguistics.
In Proceedings of the 2019 Conference of the
North American Chapter of the Association for Chao-Chun Liang, Kuang-Yi Hsu, Chien-Tsung
Computational Linguistics: Human Language Huang, Chung-Min Li, Shen-Yu Miao, and Keh-
Technologies. ACL. Yih Su. 2016b. A tag-based statistical english
math word problem solver with understanding,
Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
reasoning and explanation. In Proceedings of
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
the Twenty-Fifth International Joint Conference on
Schwenk, and Yoshua Bengio. 2014. Learning
Artificial Intelligence, IJCAI’16, page 4254–4255.
phrase representations using RNN encoder–decoder
AAAI Press.
for statistical machine translation. In Proceedings
of the 2014 Conference on Empirical Methods
Qianying Liu, Wenyv Guan, Sujian Li, and Daisuke
in Natural Language Processing (EMNLP), pages
Kawahara. 2019. Tree-structured decoding for solv-
1724–1734, Doha, Qatar. Association for Computa-
ing math word problems. In Proceedings of the
tional Linguistics.
2019 Conference on Empirical Methods in Natural
Charles R. Fletcher. 1985. Understanding and solving Language Processing and the 9th International
arithmetic word problems: A computer simulation. Joint Conference on Natural Language Processing
Behavior Research Methods, 17:565–571. (EMNLP-IJCNLP), pages 2370–2379.

Mohammad Javad Hosseini, Hannaneh Hajishirzi, Arindam Mitra and Chitta Baral. 2016. Learning to
Oren Etzioni, and Nate Kushman. 2014. Learning use formulas to solve simple arithmetic problems.
to solve arithmetic word problems with verb catego- In Proceedings of the 54th Annual Meeting of the
rization. In Proceedings of the 2014 Conference on Association for Computational Linguistics (Volume
Empirical Methods in Natural Language Processing 1: Long Papers), pages 2144–2153. Association for
(EMNLP), pages 523–533. ACL. Computational Linguistics.

Danqing Huang, Jing Liu, Chin-Yew Lin, and Jian Subhro Roy and Dan Roth. 2015. Solving general
Yin. 2018. Neural math word problem solver arithmetic word problems. In Proceedings of the
with reinforcement learning. In Proceedings of 2015 Conference on Empirical Methods in Natural
the 27th International Conference on Computational Language Processing, pages 1743–1752. ACL.
Linguistics, pages 213–223. ACL.
Subhro Roy and Dan Roth. 2017. Unit dependency
Danqing Huang, Shuming Shi, Chin-Yew Lin, Jian graph and its application to arithmetic word prob-
Yin, and Wei-Ying Ma. 2016. How well do com- lem solving. In Proceedings of the Thirty-First
puters solve math word problems? large-scale AAAI Conference on Artificial Intelligence, page
dataset construction and evaluation. In Proceedings 3082–3088. AAAI Press.
of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Subhro Roy and Dan Roth. 2018. Mapping to declar-
Papers), pages 887–896, Berlin, Germany. Associ- ative knowledge for word problem solving. Trans.
ation for Computational Linguistics. Assoc. Comput. Linguistics, 6:159–172.
Subhro Roy, Shyam Upadhyay, and Dan Roth. Dongxiang Zhang, Lei Wang, Nuo Xu, Bing Tian Dai,
2016. Equation parsing : Mapping sentences and Heng Tao Shen. 2018. The gap of semantic
to grounded equations. In Proceedings of the parsing: A survey on automatic math word problem
2016 Conference on Empirical Methods in Natural solvers. CoRR, abs/1808.07290.
Language Processing, EMNLP 2016, pages 1088–
1097. The Association for Computational Linguis- Jipeng Zhang, Lei Wang, Roy Ka-Wei Lee, Yi Bin, Yan
tics. Wang, Jie Shao, and Ee-Peng Lim. 2020. Graph-
to-tree learning for solving math word problems.
Sowmya S Sundaram and Deepak Khemani. 2015. Nat- In Proceedings of the 58th Annual Meeting of the
ural language processing for solving simple word Association for Computational Linguistics, pages
problems. In Proceedings of the 12th International 3928–3937, Online. Association for Computational
Conference on Natural Language Processing, pages Linguistics.
394–402. NLP Association of India.

Shyam Upadhyay, Ming-Wei Chang, Kai-Wei Chang,


and Wen-tau Yih. 2016. Learning from explicit and
implicit supervision jointly for algebra word prob-
lems. In Proceedings of the 2016 Conference on
Empirical Methods in Natural Language Processing,
pages 297–306.

Lei Wang, Yan Wang, Deng Cai, Dongxiang Zhang,


and Xiaojiang Liu. 2018a. Translating math word
problem to expression tree. In Proceedings of the
2018 Conference on Empirical Methods in Natural
Language Processing, pages 1064–1069. ACL.

Lei Wang, Dongxiang Zhang, Lianli Gao, Jingkuan


Song, Long Guo, and Heng Tao Shen. 2018b.
Mathdqn: Solving arithmetic word problems via
deep reinforcement learning. In Proceedings of
the Thirty-Second AAAI Conference on Artificial
Intelligence, AAAI, pages 5545–5552. AAAI Press.

Lei Wang, Dongxiang Zhang, Jipeng Zhang, Xing Xu,


Lianli Gao, Bing Tian Dai, and Heng Tao Shen.
2019. Template-based math word problem solvers
with recursive neural networks. In Proceedings
of the Thirty-Third AAAI Conference on Artificial
Intelligence, AAAI, pages 7144–7151. AAAI Press.

Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017.


Deep neural solver for math word problems. In
Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pages
845–854. ACL.

Ronald J. Williams. 1992. Simple statistical gradient-


following algorithms for connectionist reinforce-
ment learning. Mach. Learn., 8(3–4):229–256.

Zhipeng Xie and Shichao Sun. 2019. A goal-driven


tree-structured neural model for math word prob-
lems. In Proceedings of the 28th International Joint
Conference on Artificial Intelligence, pages 5299–
5305. AAAI Press.

M. Yuhui, Z. Ying, C. Guangzuo, R. Yun, and


H. Ronghuai. 2010. Frame-based calculus of solv-
ing arithmetic multi-step addition and subtraction
word problems. In 2010 Second International
Workshop on Education Technology and Computer
Science, volume 2, pages 476–479.

View publication stats

You might also like