Lisp Programming
Lisp Programming
Lisp Programming
A BSTRACT
Neural networks have a reputation for being better at solving statistical or approxi-
mate problems than at performing calculations or working with symbolic data. In
this paper, we show that they can be surprisingly good at more elaborated tasks in
mathematics, such as symbolic integration and solving differential equations. We
propose a syntax for representing these mathematical problems, and methods for
generating large datasets that can be used to train sequence-to-sequence models.
We achieve results that outperform commercial Computer Algebra Systems such
as Matlab or Mathematica.
1 I NTRODUCTION
1
Under review as a conference paper at ICLR 2020
Mathematical expressions can be represented as trees, with operators and functions as internal nodes,
operands as children, and numbers, constants and variables as leaves. The following trees represent
2 2
expressions 2 + 3 × (5 + 2), 3x2 + cos(2x) − 1, and ∂∂xψ2 − ν12 ∂∂tψ2 :
+ + -
2 ×
× − ∂ ×
3 +
3 pow cos 1 ∂ x
5 2 / ∂
x 2 × ψ x
1 pow ∂ t
2 x
ν 2 ψ t
Trees disambiguate the order of operations, take care of precedence and associativity and eliminate
the need for parentheses. Up to the addition of meaningless symbols like spaces, punctuation or
redundant parentheses, different expressions result in different trees. With few assumptions, discussed
in Section A of the appendix, there is a one-to-one mapping between expressions and trees.
We consider expressions from a syntactical perspective, as meaningless
√ sequences of mathematical
symbols. 2 + 3 and 3 + 2√are different expressions, as are 4x and 2x: they will be represented
by different trees. x / 0, −2 or log(0) are also legitimate expressions, even though they do not
necessarily make mathematical sense. In practice, expressions represent meaningful mathematical
objects. Since there is a one-to-one correspondence between trees and expressions, equality between
expressions will be reflected over their associated trees, as an equivalence: since 2 + 3 = 5 =
12 − 7 = 1 × 5, the four trees corresponding to these expressions are equivalent. For instance,
expression simplification, a standard problem in formal mathematics, amounts to finding a shorter
equivalent representation of a tree.
In this paper, we consider two problems: symbolic integration and differential equations. Both these
problems boil down to transforming an expression into another, e.g. mapping the tree of an equation
to the tree of a solution that satisfies it. We regard this as a particular instance of machine translation.
Machine translation systems typically operate on sequences (Sutskever et al., 2014; Bahdanau et al.,
2015). Alternative approaches have been proposed to generate trees, such as Tree-LSTM (Tai et al.,
2015) or Recurrent Neural Network Grammars (RNNG) (Dyer et al., 2016; Eriguchi et al., 2017).
However, tree-to-tree models are more involved and much slower than their seq2seq counterparts,
both at training and at inference. For the sake of simplicity, we use seq2seq models, which were
shown to be effective at generating trees, e.g. in the context of constituency parsing (Vinyals et al.,
2015), where the task is to predict a syntactic parse tree of input sentences.
Using seq2seq models to generate trees requires to map trees to sequences. To this effect, we use
the prefix notation (also known as normal Polish notation), writing each node before its children,
listed from left to right. For instance, the arithmetic expression 2 + 3 ∗ (5 + 2) is represented as the
sequence [+ 2 ∗ 3 + 5 2]. In contrast to the common infix notation 2 + 3 ∗ (5 + 2), prefix sequences
need no parentheses and are therefore shorter. Inside sequences, operators, functions or variables are
represented by specific tokens, and integers by sequences of digits preceded by a sign. As in the case
between expressions and trees, there exists a one-to-one mapping between trees and prefix sequences.
In this work, we want to generate a training set of mathematical expressions. However, sampling
uniformly expressions with n internal nodes is not a simple task. Naive algorithms (such as recursive
methods or techniques using fixed probabilities for nodes to be leaves, unary, or binary) tend to favour
2
Under review as a conference paper at ICLR 2020
deep trees over broad trees, or left-leaning over right leaning trees. Here are examples of different
trees that we want to generate with the same probability.
cos × + +
+ + pow × 3 ×
3 × x 2 × 3 5 ×
× x
x 3 + 5 x +
3 sqrt
8 sin 8 sin
+
x x
pow 8
x 2
In Section B of the appendix, we present an algorithm to generate random trees and expressions
representative of the whole problem space, and where the four expression trees above all generated
with the same probability. In the next section, we investigate the number of such possible expressions.
Expressions are built from a list of operators, variables (i.e. literals) and integers sampled over a
finite set. Based on the task of interest, more involved operators can be used, like differentiation or
integration, etc. More precisely, we define our problem space as:
If p1 = 0, expressions are represented by binary trees. The number of binary trees with n internal
nodes is given by the n-th Catalan numbers Cn (Sloane, 1996). A binary tree with n internal nodes
has exactly n + 1 leaves. Each node and leaf can take respectively p2 and L different values. As a
result, the number of expressions with n binary operators can be expressed by:
4n
1 2n
En = Cn pn2 Ln+1 ≈ √ (p2 )n Ln+1 with Cn =
n πn n+1 n
If p1 > 0, the number of trees with n internal nodes is the n-th large Schroeder number Sn (Sloane,
1996). It can be computed by recurrence using the following equation:
(n + 1)Sn = 3(2n − 1)Sn−1 − (n − 2)Sn−2 (1)
Finally, the number En of expressions with n internal nodes, p1 unary operator, p2 operators and L
leaves is recursively computed as
(n + 1)En = (p1 + 2Lp2 )(2n − 1)En−1 − p1 (n − 2)En−2 (2)
3 G ENERATING DATASETS
To train networks on mathematical subjects, we generate datasets of problems and their solutions.
Ideally, we would like datasets to be representative of the problem space. Unfortunately, some
3
Under review as a conference paper at ICLR 2020
functions do not have an integral which can be expressed with usual functions (e.g. f (x) = exp(x2 )
or f (x) = log(log(x))), and solutions to arbitrary differential equations cannot always be expressed
with usual functions. To address this issue, we propose other approaches to generate large training
sets for integration and first and second order differential equations.
3.1 I NTEGRATION
For integration, we generate random functions with up to n operators, using methods from Section 2,
and calculate their derivative with a computer algebra system. The derivative becomes the problem,
and the original sample the solution. In that setting, we simply train the model to predict the generated
function given its derivative.
We now present a method to generate first order differential equations with their solutions. We
start from a bivariate function F (x, y) such that the equation F (x, y) = c (where c is a constant)
can be analytically solved
in y. In other words, there exists a bivariate function f that satisfies
∀(x, c), F x, f (x, c) = c. By differentiation with respect to x, we have that ∀x, c:
With this approach, we can use the method described in Section C of the appendix to generate
arbitrary functions F (x, y) analytically solvable in y, and create a dataset of differential equations
with their solutions.
Alternatively, instead of generating a random function F , we propose to directly generate a function
f (x, c) solution of a differential equation that has to be determined.
If f (x, c) is solvable in c, we can
simply compute the function F that satisfies F x, f (x, c) = c. By applying the approach above,
we have that for any constant c, x 7→ f (x, c) is solution of the differential Equation 3. Finally, the
resulting differential equation is factorized, and we remove all positive factors from the equation.
A necessary condition for this approach to work is the solvability in c of generated functions f (x, c).
For instance, the function f (x, c) = c × log(x + c) cannot be analytically solved in c, i.e. the function
F that satisfies F x, f (x, c) = c cannot be written with usual functions. Since all the operators
and functions we use are invertible, a simple condition to ensure the solvability in c is to guarantee
that c only appears once in the leaves of the tree representation of f (x, c). A straightforward way to
generate a suitable f (x, c) is to sample a random function f (x) by the methods described in Section C
of the appendix, and to replace one of the leaves in its tree representation by c. For instance:
Our method for generating first order equations can be extended to the second order, by considering
functions of three variables f (x, c1 , c2 ) that can be solved in c2 . As before, we derive a function of
4
Under review as a conference paper at ICLR 2020
three variables F such that F x, f (x, c1 , c2 ), c1 = c2 . Differentiation with respect to x yields a first
order differential equation:
∂F (x, y, c1 ) 0 ∂F x, y, c1
+ fc1 ,c2 (x) =0
∂x ∂y
y=fc1 ,c2 (x)
where fc1 ,c2 = x 7→ f (x, c1 , c2 ). If this equation can be solved in c1 , we can infer another three-
variable function G satisfying ∀x, G x, fc1 ,c2 (x), fc01 ,c2 (x) = c1 . Differentiating with respect to x
a second time yields the following equation:
∂G(x, y, z) 0 ∂G(x, y, z) 00 ∂G(x, y, z)
+ fc1 ,c2 (x) + fc1 ,c2 (x) =0
∂x ∂y ∂z
y=fc1 ,c2 (x)
z=fc0 (x)
1 ,c2
Therefore, for any constants c1 and c2 , fc1 ,c2 is solution of the second order differential equation:
∂G(x, y, y 0 ) ∂G(x, y, y 0 ) ∂G(x, y, y 0 )
+ y0 + y 00 =0
∂x ∂y ∂z
Using this approach, we can create pairs of second order differential equations and solutions, provided
we can generate f (x, c1 , c2 ) is solvable in c2 , and that the corresponding first order differential
equation is solvable in c1 . To ensure the solvability in c2 , we can use the same approach as for
first order differential equation, e.g. we create fc1 ,c2 so that c2 has exactly one leaf in its tree
representation. For c1 , we employ a simple approach where we simply skip the current equation if
we cannot solve it in c1 . Although naive, we found that the differentiation equation can be solved in
c1 about 50% the time. Another strategy to ensure the solvability in c1 is also proposed in Section D
of the appendix. As an example:
Coefficients simplification In the case of first order differential equations, we modify generated
expressions by equivalent expressions up to a change of variable. For instance, x + x tan(3) + cx + 1
will be simplified to cx + 1, as a particular choice of the constant c makes these two expressions
identical. Similarly, log(x2 ) + c log(x) becomes c log(x).
We apply a similar technique for second order differential equations, although simplification is
sometimes a bit more involved because there are two constants c1 and c2 . For instance, c1 − c2 x/5 +
c2 + 1 is simplified to c1 x + c2 , while c2 ec1 ec1 xe−1 can be expressed with c2 ec1 x , etc.
5
Under review as a conference paper at ICLR 2020
We also perform transformations that are not √ strictly equivalent, as long as they hold under specific
assumptions. For instance, we simplify tan( c2 x) + cosh(c1 + 1) + 4 to c1 + tan(c2 x), although the
constant term can be negative in the second expression, but not the first one. Similarly e3 ec1 x ec1 log(c2 )
is transformed to c2 ec1 x .
4 E XPERIMENTS
4.1 DATASET
For all considered tasks, we generate datasets using the method presented in Section 3, with:
To produce meaningful mathematical expressions, when generating expressions, binary operators are
sampled with higher probabilities than unary functions. Additions and multiplications represent on
average half the nodes in our expressions.
4.2 M ODEL
For all our experiments, we train a seq2seq model to predict the solutions of given equations, i.e.
to predict a primitive given a function, or predict a solution given a differential equation. We use a
transformer model (Vaswani et al., 2017) with 8 attention heads, 6 layers, and a dimensionality of
512. In our experiences, using larger models did not improve the performance. We train our models
with the Adam optimizer (Kingma & Ba, 2014), with a learning rate of 10−4 . We remove expressions
with more than 512 tokens, and train our model with 256 equations per batch.
At inference, expressions are generated by a beam search (Koehn, 2004), with early stopping. We
normalize the log-likelihood scores of hypotheses in the beam by their sequence length. We report
results with beam widths of 1 (i.e. greedy decoding), 10 and 50.
During decoding, nothing prevents the model from generating an invalid prefix expression, e.g.
[+ 2 ∗ 3]. To address this issue, Dyer et al. (2016) use constraints during decoding, to ensure
that generated sequences can always be converted to valid expression trees. In our case, we found
that model generations are almost always valid and we do not use any constraint. When an invalid
expression is generated, we simply consider it as an incorrect solution and ignore it.
4.3 E VALUATION
At the end of each epoch, we evaluate the ability of the model to predict the solutions of given
equations. In machine translation, hypotheses given by the model are compare to references written
by human translators, typically with metrics like the BLEU score (Papineni et al., 2002) that measure
the overlap between hypotheses and references. Evaluating the quality of translations is a very
difficult problem, and many studies showed that a better BLEU score does not necessarily correlate
with a better performance according to human evaluation. However, unlike in machine translation,
we can easily verify the correctness of our model by simply comparing generated expressions to their
reference solutions.
For instance, for the given differential equation xy 0 − y + x with a reference solution x log(c / x)
(where c is a constant), our model may generate x log(c) − x log(x). We can check that these
6
Under review as a conference paper at ICLR 2020
two solutions are equal, although they are written differently, using a symbolic framework like
SymPy (Meurer et al., 2017).
However, our model may also generate xc − x log(x) which is also a valid solution, that is actually
equivalent to the previous one for a different choice of constant c. In that case, we replace y in the
differential equation by the model hypothesis. If xy 0 − y + x = 0, we conclude that the hypothesis is
a valid solution. In the case of integral computation, we can simply differentiate the model hypothesis,
and compare it with the function to integrate. For the three problems, we measure the accuracy of our
model on equations from the test set.
4.4 R ESULTS
In Table 1, we report the accuracy of our models on the three different tasks, on a holdout test set
composed of 5000 equations. We observe that the model is extremely efficient at function integration,
with an accuracy of almost 100%, even with a beam size of 1. Greedy decoding does not work as well
on differential equations. In particular, we observe an improvement in accuracy of almost 40% when
using a large beam size of 50 for second order differential equations. Unlike in machine translation,
where increasing the beam size does not necessarily increase the performance (Ott et al., 2018), in
our case, we always observe significant improvements with wider beams. Typically, using a beam
size of 50 provides an improvement of 8% accuracy over a beam size of 10. This makes sense, as
increasing the beam size will provide more hypotheses, although a wider beam may displace a valid
hypothesis to consider invalid ones with better log-probabilities.
Table 1: Accuracy of our model on the three different tasks. Results are reported on a holdout test set of
5000 equations. Using beam search decoding significantly improves the accuracy of the model.
We compare our model with two popular mathematical frameworks: Mathematica (Wolfram-Research,
2019) and Matlab (MathWorks, 2019).1 Prefix sequences in our test set are converted back to their
infix representations, and given as input to Mathematica. For an input equation, Mathematica either
returns a solution, rewrites the equation in a different way (meaning it was not able to solve it) or
times out after a preset delay. When Mathematica times out, we conclude that it is not able to compute
a solution (although it might have found a solution given more time).
In Table 2, we present accuracy for our model with different beam sizes, and for Mathematica with
timeout delays of 10 and 30 seconds. Because Mathematica times out on a non-negligible number of
equations, the evaluation would take too long to run on our 5000 test equations, and we only evaluate
on a smaller test subset of 500 equations, on which we also re-evaluate our model.
On all tasks, we observe that our model significantly outperforms Mathematica. On function
integration, our model obtains close to 100% accuracy, while Mathematica does not reach 80%. On
first order differential equations, Mathematica is on par with our model when it uses a beam size
of 1, i.e. with greedy decoding. However, using a beam search of size 50 our model accuracy goes
from 81.2% to 97.0%, largely surpassing Mathematica. Similar observations can be made for second
order differential equations, where beam search is even more critical since the number of equivalent
solutions is larger. For Mathematica, the performance naturally increases with the timeout duration.
For each task, using a timeout of 30 seconds improves the test accuracy by about 4% over a timeout
of 10 seconds. Our model, on the other hand, typically finds a solution in less than a second, even
with a large beam size. On average, Matlab has a slightly lower performance than Mathematica on
the problems we tested.
1
All experiments were run with Mathematica 12.0.0.0, and Matlab R2019a.
7
Under review as a conference paper at ICLR 2020
Table 2: Comparison of our model with Mathematica and Matlab on a test set of 500 equations. For
Mathematica we report results by setting a timeout of 10 or 30 seconds per equation. On a given equation, our
model typically finds the solution in less than a second.
Equation Solution
16x3 − 42x2 + 2x
y0 = y = sin−1 (4x4 − 14x3 + x2 )
(−16x8 + 112x7 − 204x6 + 28x5 − x4 + 1)1/2
c1 + 3x + 3 log (x)
4x4 yy 00 −8x4 y 02 −8x3 yy 0 −3x3 y 00 −8x2 y 2 −6x2 y 0 −3x2 y 00 −9xy 0 −3y = 0 y=
x (c2 + 4x)
Table 3: Examples of problems that our model is able to solve, on which Mathematica and Matlab were not
able to find a solution. For each equation, our model finds a valid solution with greedy decoding.
Table 3 shows examples of functions that our model was able to solve, on which Mathematica and
Matlab did not find a solution. The denominator of the function to integrate, −16x8 + 112x7 −
204x6 + 28x5 − x4 + 1, can actually be rewritten as 1 − (4x4 − 14x3 + x2 )2 . With the simplified
input:
16x3 − 42x2 + 2x
1 − (4x4 − 14x3 + x2 )2 )1/2
integration becomes easy and Mathematica is able to find the solution, but not with the expanded
denominator.
When comparing with Matlab and Mathematica, we work from a data set generated for our model and
use their standard differential equation solvers. Different sets of equations, and advanced techniques
for solving them (e.g. transforming them before introducing them in the solver) would probably
result in smaller performance gaps.
An interesting property of our model is that it is able to generate solutions that are exactly equivalent,
but written in different ways. For instance, we consider the following first order differential equation,
along with one of its solutions:
√ q
9 x log1(x)
162x log(x)y 0 + 2y 3 log(x)2 − 81y log(x) + 81y = 0 y= √
c + 2x
In Table 4, we report the top 10 hypotheses returned by our model for this equation. We observe that
all generations are actually valid solutions, although they are expressed very differently. They are
however not all equal: merging the square roots within the first and third equations would give the
same expression except that the third one would contain a factor 2 in front of the constant c, but up to
a change of variable, these two solutions are actually equivalent.
The ability of the model to recover equivalent expressions, without having been trained to do so, is
very intriguing. This suggest that some deeper understanding of mathematics has been achieved by
the model.
8
Under review as a conference paper at ICLR 2020
√ √ q
9 2 x log1(x) 9
√ -0.11532 pc p -0.14374
2 c+x x
+ 2 log (x)
s s
√ 1 1
9 x -0.11740 9 c log (x)
-0.20465
c log (x) + 2x log (x) + 2 log (x)
x
√ √ s
9 2 x √ 1
√ p -0.12373 9 x -0.23228
2 c + x log (x) c log (x) + 2x log (x) + log (x)
Table 4: Top 10 generations of our model for the first order differential equation 162x log(x)y 0 +2y 3 log(x)2 −
81y log(x) + 81y = 0, generated with a beam search. All hypotheses are valid solutions, and are equivalent up
to a change of the variable c. Scores are log-probabilities normalized by sequence lengths.
5 R ELATED WORK
Computers were used for symbolic mathematics since the late 1960s (Moses, 1974). Computer
algebra systems (CAS), such as Matlab, Mathematica, Maple, PARI and SAGE, are used for a variety
of mathematical tasks (Gathen & Gerhard, 2013). Modern methods for symbolic integration are
based on Risch algorithm (Risch, 1970). Implementations can be found in Bronstein (2005) and
Geddes et al. (1992). However, the complete description of the Risch takes more than 100 pages, and
is not fully implemented in any mathematical framework.
Deep learning networks have been used to simplify treelike expressions. Zaremba et al. (2014) use
recursive neural networks to simplify complex symbolic expressions. They use tree representations
for expressions, but provide the model with problem related information: possible rules for sim-
plification. The neural network is trained to select the best rule. Allamanis et al. (2017) propose
a framework called neural equivalence networks to learn semantic representations of algebraic
expressions. Typically, a model is trained to map different but equivalent expressions (like the 10
expressions proposed in Table 4) to the same representation. However, they only consider Boolean
and polynomial expressions.
Most attempts to use deep networks for mathematics have focused on arithmetic over integers
(sometimes over polynomials with integer coefficients). For instance, Kaiser & Sutskever (2015)
proposed the Neural-GPU architecture, and train networks to perform additions and multiplications
of numbers given in their binary representations. They show that a model trained on numbers
with up-to 20 bits can be applied to much larger numbers at test time, while preserving a perfect
accuracy. Freivalds & Liepins (2017) proposed an improved version of the Neural-GPU by using hard
non-linear activation functions, and a diagonal gating mechanism. Saxton et al. (2019) use LSTMs
(Hochreiter & Schmidhuber, 1997) and transformers on a wide range of problems, from arithmetic to
simplification of formal expressions. However, they only consider polynomial functions, and the task
of differentiation, which is significantly easier than integration. Trask et al. (2018) propose the Neural
arithmetic logic units, a new module designed to learn systematic numerical computation, and that
can be used within any neural network. Like Kaiser & Sutskever (2015), they show that at inference
their model can extrapolate on numbers orders of magnitude larger than the ones seen during training.
9
Under review as a conference paper at ICLR 2020
6 C ONCLUSION
In this paper, we show that standard seq2seq models can be applied to difficult tasks like function
integration, or differential equation. Because there was no existing dataset to train such models, we
propose an approach to generate arbitrarily large datasets of equations, with their associated solutions.
We show that a simple transformer model trained on these datasets can perform extremely well both
at computing function integrals, and solving differential equations, outperforming state-of-the-art
mathematical frameworks like Matlab or Mathematica that rely on a large number of algorithms and
heuristics, and a complex implementation (Risch, 1970). Results also show that the model is able to
write identical expressions in very different ways.
These results are surprising given the incapacity of neural models to perform simpler tasks like integer
addition or multiplication. However, proposed hypotheses are sometimes incorrect, and considering
multiple beam hypotheses is often necessary to obtain a valid solution. The validity of a solution itself
is not provided by the model, but by an external symbolic framework (Meurer et al., 2017). These
results suggest that in the future, standard mathematical frameworks may benefit from integrating
neural components in their solvers.
R EFERENCES
Miltiadis Allamanis, Pankajan Chanthirasegaran, Pushmeet Kohli, and Charles Sutton. Learning con-
tinuous semantic representations of symbolic expressions. In Proceedings of the 34th International
Conference on Machine Learning - Volume 70, ICML’17, pp. 80–88. JMLR.org, 2017.
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and
translate. In International Conference on Learning Representations (ICLR), 2015.
M. Bronstein. Symbolic Integration I: Transcendental Functions. Algorithms and combinatorics.
Springer, 2005. ISBN 978-3-540-21493-9.
Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. Recurrent neural network
grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, pp. 199–209, 2016.
Akiko Eriguchi, Yoshimasa Tsuruoka, and Kyunghyun Cho. Learning to parse and translate improves
neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), pp. 72–78, 2017.
Philippe Flajolet and Andrew M. Odlyzko. Singularity analysis of generating functions. SIAM J.
Discrete Math., 3(2):216–240, 1990.
Philippe Flajolet and Robert Sedgewick. Analytic Combinatorics. Cambridge University Press, New
York, NY, USA, 1 edition, 2009. ISBN 0521898064, 9780521898065.
Karlis Freivalds and Renars Liepins. Improving the neural gpu architecture for algorithm learning.
ArXiv, abs/1702.08727, 2017.
Joachim von zur Gathen and Jurgen Gerhard. Modern Computer Algebra. Cambridge University
Press, New York, NY, USA, 3rd edition, 2013. ISBN 1107039037, 9781107039032.
Keith O. Geddes, Stephen R. Czapor, and George Labahn. Algorithms for Computer Algebra. Kluwer
Academic Publishers, Norwell, MA, USA, 1992. ISBN 0-7923-9259-0.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
Lukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. CoRR, abs/1511.08228, 2015.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Donald E. Knuth. The Art of Computer Programming, Volume 1 (3rd Ed.): Fundamental Algorithms.
Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1997. ISBN 0-201-
89683-4.
10
Under review as a conference paper at ICLR 2020
Philipp Koehn. Pharaoh: a beam search decoder for phrase-based statistical machine translation
models. In Conference of the Association for Machine Translation in the Americas, pp. 115–124.
Springer, 2004.
Sarah Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep network guided proof
search. arXiv preprint arXiv:1701.06972, 2017.
MathWorks. Matlab optimization toolbox (r2019a), 2019. The MathWorks, Natick, MA, USA.
Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertı́k, Sergey B. Kirpichev, Matthew
Rocklin, AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean
Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik
Johansson, Fabian Pedregosa, Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh
Saboo, Isuru Fernando, Sumith Kulal, Robert Cimrman, and Anthony Scopatz. Sympy: symbolic
computing in python. PeerJ Computer Science, 3:e103, January 2017. ISSN 2376-5992. doi:
10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103.
Joel Moses. Macsyma - the fifth year. SIGSAM Bull., 8(3):105–110, August 1974. ISSN 0163-5824.
Myle Ott, Michael Auli, David Grangier, et al. Analyzing uncertainty in neural machine translation.
In International Conference on Machine Learning, pp. 3953–3962, 2018.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic
evaluation of machine translation. In Proceedings of the 40th annual meeting on association for
computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
Robert H. Risch. The solution of the problem of integration in finite terms. Bull. Amer. Math. Soc.,
76(3):605–608, 05 1970.
David E. Rumelhart, James L. McClelland, and CORPORATE PDP Research Group (eds.). Parallel
Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. MIT
Press, Cambridge, MA, USA, 1986. ISBN 0-262-68053-X.
David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli. Analysing mathematical
reasoning abilities of neural models. In International Conference on Learning Representations,
2019.
N. J. A. Sloane. The encyclopedia of integer sequences, 1996.
Richard P. Stanley. Enumerative Combinatorics: Volume 1. Cambridge University Press, New York,
NY, USA, 2nd edition, 2011. ISBN 1107602629, 9781107602625.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.
In Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from
tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting
of the Association for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers), pp. 1556–1566, 2015.
Andrew Trask, Felix Hill, Scott E Reed, Jack Rae, Chris Dyer, and Phil Blunsom. Neural arithmetic
logic units. In Advances in Neural Information Processing Systems, pp. 8035–8044, 2018.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information
Processing Systems, pp. 6000–6010, 2017.
Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar
as a foreign language. In Advances in neural information processing systems, pp. 2773–2781,
2015.
H.S. Wilf. generatingfunctionology: Third Edition. CRC Press, 2005. ISBN 978-1-4398-6439-5.
URL https://www.math.upenn.edu/ wilf/gfologyLinked2.pdf.
11
Under review as a conference paper at ICLR 2020
12
Under review as a conference paper at ICLR 2020
We represent mathematical expressions as trees with operators as internal nodes, and numbers,
constants or variables, as leaves. By enumerating nodes in prefix order, we transform trees into
sequences suitable for NLP models.
For this representation to be efficient, we want expressions, trees and sequences to be in a one to one
correspondence. It is plain that different expressions will result in different trees and sequences. For
the reverse to hold, we need to take care of a few special cases.
First, expressions like sums and products may correspond to several trees. For instance, expression
2 + 3 + 5 can be represented as any one of those trees:
+ + +
2 3 5 + 5 2 +
2 3 3 5
We will assume that all operators have at most two operands, and that, in case of doubt, they are
associative to the right. 2 + 3 + 5 would then correspond to the rightmost tree.
Second, the distinction between internal nodes (operators) and leaves (mathematical primitive objects)
√ number −2 be the basic object −2 or a unary minus operator, applied
is somewhat arbitrary. Should
to number 2? What about 5, 42X 5 , or function log10 ? To make things simple , we only consider
numbers, constants and variables as possible leaves, and avoid using a unary minus. In particular,
expressions like −x are represented as −1 ∗ x.
√
Here are the trees for −2, 5, 42X 5 and −x
-2 sqrt * *
5 42 pow -1 X
X 5
Integers are represented in positional notation, as a sign followed by a sequence of digits (from 0 to
9 in base 10). 2354 and −34 would then be represented as +2 3 5 4 and − 3 4. For zero a unique
representation is chosen (+0 or −0).
All derivations make use of generating functions (Flajolet & Sedgewick, 2009) and (Wilf, 2005).
We deal first with the simpler binary case (p1 = 0), then proceed to unary-binary trees and expressions.
In each case, we calculate a generating function, from which we derive a closed formula or a
recurrence to calculate the sequence, and an asymptotic expansion.
The main part of this derivation follows (Knuth, 1997) (pages 388-389).
Let bn be the number of binary trees with n internal nodes. We have b0 = 1 and b1 = 1.
Any binary tree with n internal nodes can be generated by concatenating a left and a right subtree
with k and n − 1 − k internal nodes respectively. Summing over all possible values of k,
13
Under review as a conference paper at ICLR 2020
Therefore
1 2n (2n)!
bn = =
n+1 n (n + 1)!n!
These are the Catalan numbers.
For expressions, we note that binary trees with n internal nodes have n + 1 leaves, that each node
has one of p2 operators, and each leaf one of L leaves. For a tree with n nodes, this means pn2 Ln+1
possible combinations of operators and leaves. And so, the number of binary expressions with n
operators is
(2n)!
En = pn Ln+1
(n + 1)!n! 2
B.1.2 A SYMPTOTICS
To derive an asymptotic approximation of bn , we apply Stirling formula
√ n n
n! ≈ 2πn
e
to binomials
4n
2n
≈√
n πn
14
Under review as a conference paper at ICLR 2020
4n
bn ≈ √
n πn
And since binary trees with n internal nodes have n + 1 leaves, we have the following formulas for
the number of expressions with n internal nodes:
1
En ≈ √ (4p2 )n Ln+1
n πn
B.2.2 C ALCULATION
Schroeder numbers do not have a simple closed formula, but a recurrence allowing for their calculation
can be derived from their generating function.
Rewriting S(z) as p
2zS(z) + z − 1 = − 1 − 6z + z 2
and differentiating, we have
3−z 3−z
2zS 0 (z) + 2S(z) + 1 = √ = (1 − z − 2zS(z))
1 − 6z + z 2 1 − 6z + z 2
3z − z 2
(3 − z)(1 − z)
2zS 0 (z) + 2S(z) 1 + 2
= −1
1 − 6z + z 1 − 6z + z 2
1 − 3z 2 + 2z
2zS 0 (z) + 2S(z) =
1 − 6z + z 2 1 − 6z + z 2
z(1 − 6z + z 2 )S 0 (z) + (1 − 3z)S(z) = 1 + z
Replacing S(z) and S 0 (z) with their n-th coefficient yields, for n > 1
nsn − 6(n − 1)sn−1 + (n − 2)sn−2 + sn − 3sn−1 = 0
(n + 1)sn = 3(2n − 1)sn−1 − (n − 2)sn−2
Together with s0 = 1 and s1 = 2, this allows for fast (O(n)) calculation of Schroeder numbers.
15
Under review as a conference paper at ICLR 2020
B.2.3 A SYMPTOTICS
To derive an asymptotic formula of sn , we develop the generating function around its smallest
singularity (Flajolet & Odlyzko, 1990), i.e. the radius of convergence of the power series.
Since √ √
1 − 6z − z 2 = (1 − (3 − 8)z)(1 − (3 + 8)z)
The smallest singular value is
1
r1 = √
(3 + 8)
and the asymptotic formula will have the exponential term
√ √
r1−n = (3 + 8)n = (1 + 2)2n
In a neighborhood of r1 , the generating function can be rewritten as
√ √ √
q
1/4
S(z) ≈ (1 + 2) 1 − 2 1 − (3 + 8)z + O(1 − (3 + 8)z)3/2
since
√ an
[zn ] 1 − az ≈ − √
4πn3
denoting by [zn ]F (z) the n-th coefficient in the formal series of F
√ √ √
(1 + 2)(3 + 8)n (1 + 2)2n+1
sn ≈ √ = √
23/4 πn3 23/4 πn3
Comparing with the number of binary trees, we have
sn ≈ 1.44(1.46)n bn
In the binary case, the number of expressions can be derived from the number of trees. This cannot de
done in the unary-binary case, because the number of leaves in a tree with n internal nodes depends
on the number of binary operators only (n2 + 1).
If bq denotes the q-th Catalan number, the number of trees with n2 binary operators among n is
n + n2
bn 2
2n2
Since such trees have n2 + 1 leaves, with L leaves, p2 binary and p1 unary operators to choose from,
the number of expressions is
n + n2
E(n, n2 ) = bn2 pn2 2 pn−n
1
2
Ln2 +1
2n2
16
Under review as a conference paper at ICLR 2020
Summing over all values of n2 (from 0 to n) yields the number of different expressions
n
X n + n2
En = bn2 pn2 2 pn−n
1
2
Ln2 +1 z n
n =0
2n 2
2
n+n2
since 2n2 = 0 when n > n2
∞ ∞
n2 X
X Lp2
n + n2
E(z) = L bn2 (p1 z)n
n2 =0 n=0
p1 2n 2
∞ n2 X∞
X Lp2 n + 2n2
=L bn2 (p1 z)n+n2
n2 =0
p 1 n=0
2n 2
∞ ∞
X X n + 2n2
=L bn2 (Lp2 z)n2 (p1 z)n
n =0 n=0
2n 2
2
Reducing, we have
p
1 − p1 z − 1 − 2(p1 + 2Lp2 k)z + p1 z 2
E(z) =
2p2 z
17
Under review as a conference paper at ICLR 2020
B.3.2 C ALCULATION
As before, there is no closed simple formula for En , but we can derive a recurrence formula by
differentiating the generating function, rewritten as
p
2p2 zE(z) + p1 z − 1 = − 1 − 2(p1 + 2p2 L)z + p1 z 2
p1 + 2p2 L − p1 z
2p2 zE 0 (z) + 2p2 E(z) + p1 = p
1 − 2(p1 + 2p2 L)z + p1 z 2
(p1 + 2p2 L − p1 z)(1 − p1 z − 2p2 zE(z))
2p2 zE 0 (z) + 2p2 E(z) + p1 =
1 − 2(p1 + 2p2 L)z + p1 z 2
0 z(p1 + 2p2 L − p1 z) (p1 + 2p2 L − p1 z)(1 − p1 z)
2p2 zE (z) + 2p2 E(z) 1 + 2
= − p1
1 − 2(p1 + 2p2 L)z + p1 z 1 − 2(p1 + 2p2 L)z + p1 z 2
1 − (p1 + 2p2 L)z 2p2 L(1 + p1 z) + p1 (p1 − 1)z
2p2 zE 0 (z) + 2p2 E(z) 2
=
1 − 2(p1 + 2p2 L)z + p1 z 1 − 2(p1 + 2p2 L)z + p1 z 2
2p2 zE 0 (z)(1 − 2(p1 + 2p2 L)z + p1 z 2 ) + 2p2 E(z)(1 − (p1 + 2p2 L)z) = (2p2 L(1 + p1 z) + p1 (p1 − 1)z)
replacing E(z) and E 0 (z) with their coefficients
2p2 (nEn − 2(p1 + 2p2 L)(n − 1)En−1 + p1 (n − 2)E(n − 2)) + 2p2 (En − (p1 + 2p2 L)En−1 ) = 0
(n + 1)En − (p1 + 2p2 L)(2n − 1)En−1 + p1 (n − 2)En−2 = 0
B.3.3 A SYMPTOTICS
As before, approximations of En for large n can be found by developing E(z) in the neighbourhood
of the root with the smallest module of
1 − 2(p1 + 2p2 L)z + p1 z 2
The roots are
p1
r1 = p
p1 + 2p2 L − p21 + 4p22 L2 + 4p2 p1 L − p1
p1
r2 = p
p1 + 2p2 L + p21 + 4p22 L2 + 4p2 p1 L − p1
both are positive and the smallest one is r2
To alleviate notation, let q
δ= p21 + 4p22 L2 + 4p2 p1 L − p1
p1
r2 =
p1 + 2p2 L + δ
developping E(z) near r2 ,
q q
1 − p1 r2 − 1 − r2 ( p1 +2p
p1
2 L−δ
) 1− z
z
r2
E(z) ≈ + O(1 − )3/2
2p2 r2 r2
√ √ q
p1 + 2p2 L + δ − p21 − p1 + 2p2 L + δ 2δ 1 − rz2 z
E(z) ≈ + O(1 − )3/2
2p2 p1 r2
and therefore √ −n− 1 √ 1
δr 2
δ (p1 + 2p2 L + δ)n+ 2
En ≈ p2 = √
2p2 2πp1 n3 2p2 2πn3 pn+1
1
18
Under review as a conference paper at ICLR 2020
To generate random binary trees with n internal nodes, we use the following one-pass procedure. At
all steps e denotes the number of empty nodes, and n the number of operator to be generated.
To make this work, we need to calculate, for all e and n, the distribution K(e, n) of probability of the
next internal node, among e empty nodes.
Let D(p, n) be the number of different binary subtrees that can be generated from p empty elements,
with n internal nodes to generate. Denoting by bn the Catalan numbers, we have
D(0, n) = 0
D(1, n) = bn
D(2, n) = bn+1
D(p, n) = D(p − 1, n + 1) − D(p − 2, n + 1)
To derive the recurrence, consider the case where n + 1 internal nodes remain, and we have p − 1
empty nodes (p > 1), the first empty node is either an internal node or a leaf. If it is an internal node,
it has two empty children, and D(p, n) different subtrees can then be generated. If it is a leaf, there
are D(p − 2, n + 1) remaining subtrees. As a result
D(p − 1, n + 1) = D(p, n) + D(p − 2, n + 1)
The recurrence formula allow us to calculate D(n, p), for all interesting values of n and p.
From e empty nodes, D(e, n) subtrees that can be generated. Consider the first empty node, it will be
a leaf for D(e − 1, n) trees, and an internal node for D(e, n) − D(e − 1, n).
Consider now thre case where the first node is a leaf, there are D(e − 1, n) such trees. D(e − 2, n)
have a leaf as their second node, and D(e − 1, n) − D(e − 2, n) have an internal node.
This provides us with a formula for the probability distribution K(e, n). For k between 1 and e
D(e − k + 1, n) − D(e − k, n)
P rob(K(e, n) = k) =
D(e, n)
For unary-binary trees, we use an extended version of the previous algorithm, which decides at step
two whether the next operator is unary or binary, and uses a different distributions L(e, n) in the
unary case.
19
Under review as a conference paper at ICLR 2020
As before, to calculate K and L, we need to derive D(p, n), the number of subtrees that can be
generated from p empty nodes, with n internal nodes to generate, for unary-binary trees. The
following three equations hold.
D(p, 1) = 2p
D(1, n) = D(1, n − 1) + D(2, n − 1)
D(p, n) = D(p − 1, n) + D(p, n − 1) + D(p + 1, n − 1)
The first one states that since we can build 2 one operator trees from one empty element (one binary,
one unary), we can build 2p from p empty elements. The second states that if n > 0 operators remain
to allocate, a single empty element must accommodate either a unary operator (with D(1, n − 1)
possible subtrees) or a binary operator. The last one is the general case, for p > 1. Then the first
empty node is either a leaf, a unary or a binary operator.
These three equations allow us to tabulate D(p, n) for any value of n and p. For K(p, n) and L(p, n),
we reason as follows. Starting with p empty nodes, and having an unary operator to allocate and n
remaining, we have D(p, n − 1) trees with an operator on the first empty node, D(p − 1, n − 1) with
an operator on the second, and D(p − k, n − 1) with an operator on the k-th node. Overall,
D(p − k, n − 1)
P rob(L(p, n) = k) = Pp−1
i=0 D(p − i, n − 1)
The generating method described at the end of section 3.3 generates an equation from a possible
solution y(x, c1 , c2 ) by first inverting y with respect to c2 , yielding F (x, y, c1 ) such that F (x, y, c1 ) =
c2 . Differentiating F produces a first order differential equation that we need to solve in c1 . The
resulting equation G(x, y, y 0 ) = c1 is differentiated to provide a second order equation solved by y.
To guarantee that y can be inverted in c2 , we generate it with only one c2 leaf. This does not guarantee
that the differential equation derived from F can be solved in c1 , and a first version of our generating
method relies on trials and error to generate suitable y(x, c1 , c2).
We present an improved method for generating y, which guarantees solvability in c1 and c2 , at the
price of additional constraints in the solutions generated.
To guarantee that y(x, c1 , c2 ) solvable in c2 , we sample a unique c2 in the tree representation of y.
Under which additional conditions will y 0 Fy0 + Fx0 = 0 (with F (x, y, c1 ) = c2 ) be solvable in c1 ?
It will be the case if both Fx0 and Fy0 are linear in c1 or in a unique function of c1 only. That is if there
is k(c1 ) such that
Fx0 = a(x, y)k(c1 ) + b(x, y)
Fy0 = c(x, y)k(c1 ) + d(x, y)
20
Under review as a conference paper at ICLR 2020
/ - - -
- K(c1 ) C * C U U −1 K(c1 )
C M (x) M (x) K(c1 ) + C M (x)
M (x) K(c1 )
This results in the following method for generating solvable second order differential equations from
their solutions.
1. Generate two random functions M (x) and K(c1 )
2. Select one of the four subtrees, and build it from M (x) and K(c1 )
3. Generate a random function A(x)
4. Select a node at random and replace it by the subtree, we haves y(x, c1 , c2 )
5. Solve in c2 to get F (x, y, c1 ) = c2
6. Differentiate and solve in c1 the first order equation
7. Differentiate to get a second order equation solved by y(x, c1 , c2 )
pow +
x 2
* *
2 cos C x
+
x 3
21
Under review as a conference paper at ICLR 2020
If the root of the left tree is unary, remove it, and add its inverse at the root of the right tree.
+ exp
y
pow +
x 2
* *
2 cos C x
+
x 3
If the root of the left tree is binary, only one of its subtree contains C, move the other to the right tree,
by inverting the top node.
+ -
exp pow
* *
y x 2
2 cos C x
+
x 3
Continue until only C remains in the left tree
* -
C x
- *
exp pow 2 cos
y x 2 +
x 3
The right tree is the function F (x, y) sought.
C /
- x
- *
exp pow 2 cos
y x 2 +
x 3
The number of steps in this algorithm is equal to the depth of C in the tree representing y(x, C)).
22