Academia.eduAcademia.edu

Semantic Parsing Natural Language into Relational Algebra

2021, ArXiv

The past decade has witnessed the emerging of big data and data science. Meanwhile, people have created tens of thousands of data. Relational databases still serve as a major storage and management solution to those data, which also means SQL remains as the most commonly used query language to access those data. However, on the other hand, people, who want to manipulate a relational database, still need to invest a vast amount of time to learn SQL. But SQL as a language itself, which, given the flexibility and complexity of a programming language, is not so easy to master. Fortunately, recent progress in Natural Language Processing (NLP) opens a new door to solve this trouble. NLP gives the machine the ability to understand human language and semantically parsing the language into the SQL query. Via this new technology, people who want to use relational database would not need to learn SQL anymore, instead, they can just tell the machine what they want and let the machine generate t...

Neural Semantic Parsing Natural Language into SQL Ruiyang Xu, Ayush Singh {xu.r, singh.ay}@husky.neu.edu CS6120 NLP Fall 2017, Northeastern University 1 Introduction arXiv:2106.13858v1 [cs.CL] 25 Jun 2021 Natural interface to database (NLIDB) has been researched a lot during the past decades. In the core of NLIDB (Popescu et al., 2003) is a semantic parser used to convert natural language into SQL. For instance, given a database schema information: Solutions from traditional NLP methodology focuses on grammar rule pattern learning (Wang et al., 2017) and paring via intermediate logic forms (Kate et al., 2005). Although those methods give an acceptable performance on certain specific database and parsing tasks, they are hard to generalize and scale. On the other hand, recent progress in neural deep learning seems to provide a promising direction towards building a general NLIDB system. Unlike the traditional approach, those neural methodologies treat the parsing problem as a sequence-to-sequence learning problem. In this project, we experimented on several sequenceto-sequence learning models and evaluate their performance on general database parsing task. 2 Related Work This project was motivated by a recent research of Seq2SQL from Salesforce (Zhong et al., 2017). In their paper, they have mentioned a reinforcement learning enhanced attention pointer network model. The accuracy of their model is 59%. However, as claimed in their paper, the reinforcement learning only helps to increase the accuracy by only 2%. A previous work (Vinyals et al., 2015) which only use pointer network but has a tree-like decoder instead of a sequence decoder can already achieve an accuracy of 36%. And we have evidence to think Seq2SQL has also used a tree decoder. The pointer network (Vinyals et al., 2015) has gained a lot of attention in recent years. It is a location-based attention mechanism which tries to put attention on the locations in the input sequence. It tries to compose an output from the components in the input via predicting pointers to the index in the input. This structure has even been proved to have the power to solve combinatorial optimization problems within a certain size. As mentioned by the author, this attention mechanism was a modification of a contention based attention mechanism (Bahdanau et al., 2014) which put attention on the contents of the input sequence. Unlike pointer network, it tries to extract input-output correlation information and use that information to assist decoder inference and it will still use output vocabulary to compose its output. Underlying those attention mechanisms is a vanilla sequence to sequence model (Sutskever et al., 2014) or encoder-decoder model (Cho et al., 2014) Considering the complexity of tree decoders and time limitation of this project, we implemented a sequence decoder instead. And since Salesforce decided to keep their implementation close sourced, we have to implement our own pointer decoder (unfortunately, Tensorflow hasn’t incorporated pointer network for the time being). And we notice that research on pointer network based sequence to sequence model on semantic parsing has rarely been done by other researchers. Hence we built the whole end to end system on our own with Tensorflow. We also build two other models to compare the performance among those sequence to sequence models, namely ”vanilla seq2seq”, ”seq2seq with contention-based attention mechanism” and ”seq2seq with pointer network”. To mitigate the negative effect caused by using padding symbols with attention mechanism, we introduced an emphasizing strategy (will mention later). SQLizer (Yaghmazadeh et al., 2017) was our primary inspiration by using semantic parsing from the English description into an intermediate logic form, a query sketch (which is essentially a relational algebra with holes). This sketching technique allows them to utilize technologies from program synthesis community (a type theory guided sketch completion). Yet the authors haven’t further discussed the shortcoming and regression performance of SQLizer on medium to large datasets with complex queries rather than tested it on a handcrafted small dataset. Therefore the scalability of their method is still questionable. 3 Dataset We are using the WikiSQL dataset (Zhong et al., 2017). 1. A large-scale dataset (80,654 records with 24,241 schemas) suitable for effectively training neural networks. 2. Crowd-sourced to collect the natural language questions created by human beings, that helps overcome overfitting to template-synthesized descriptions. 3. The task synthesizes the SQL query based only on the natural language and the table schema without relying on the table’s content, that helps mitigate scalability as well as privacy issue that alternative approaches (Yaghmazadeh et al., 2017) may suffer when being applied to realistic application scenarios where large-scale and sensitive user data is involved. 4. The data is split so that the training (70%), dev (10%), and test (20%) set do not share tables which helps evaluate an approach’s capability to generalize to an unseen schema. Note that the WikiSQL task considers synthesizing a SQL query with respect to only one table. Thus, in an output SQL query, only the SELECT clause and the WHERE clause need to be predicted, and the FROM clause can be omitted. 4 Methodology 4.1 Preprocessing Since the original dataset is serialized for storage. We have found a way to deserialize the dataset and use them as our training, dev and test data. For the vanilla seq2seq model, the preprocessing above is good enough and no need to do further process. However, we have to augment the input sequence in order to provide more contextual information to the attention mechanism. Original: What is the description of a ch-47d-chinook Column: Aircraft/VARCHAR Description/VARCHAR Max Gross Weight/INT Total disk area/INT Max disk Loading/INT Seq2SQL Augmented: <col>;Aircraft Description Max-Gross-Weight Total-disk-area Max-diskLoading;<sql>;SELECT FROM WHERE COUNT MIN MAX AVG SUM AND = ;<question>; What is the description of a ch-47d-chinook Augmented: SELECT FROM WHERE COUNT MIN MAX AVG SUM AND = ; Aircraft Description MaxGross-Weight Total-disk-area Max-disk-Loading ; What is the description of a ch-47d-chinook The Sentinel (;) tokens are only for descriptive purposes to demarcate boundaries in our data unlike Seq2SQL and are removed in the original implementation. The first part comprises of all column names followed by SQL keywords and finally the natural language question. Bringing SQL keywords in the front helps the Pointer network easily point to SQL positions which are bound to change in case of Seq2SQL augmented data due to variable column names for different schemas. 4.2 Vanilla Seq2Seq Model Figure 1: Distribution of question types in WikiSQL We first removed all punctuations from the input data but then realized it disturbs names of the column names which made evaluation tricky so we only removed ? from the end of sentences. Further, we replaced column names with multiple words with single words via adding hyphens and also did some alignment between column names appeared in questions and tables, so that the neural network won’t treat them as two different objects. The final vocabulary size for questions and SQL respectively are 57,625 and 44,554. Our vanilla seq2seq model has a sequential encoder and decoder. The encoder is a 2 layers dynamic LSTM neural network. The hidden state of the last layer will be used as the initial state for the decoder. The decoder is also 2 layers dynamic LSTM neural network with a dense projection layer as final prediction output. We applied the word to vector embeddings on both encoder and decoder inputs with embedding size 300. We used weighted cross-entropy as our loss function. We also applied Adam optimizer with gradient clipping as the final trainer (learning rate was set to 0.01). 4.3 Reversed Seq2Seq Model (Sutskever et al., 2014) found that if we only reverse the source sequence and let target be intact, it reduces the distance between translation words and hence increase memory ability on long sentences while reducing computation time. In our experiments, we did not find reversing technique to be any superior, and would like to point that it is case specific. 4.4 Bidirectional Seq2Seq Model Figure 2: Distribution of question, query, table in WikiSQL Traditional LSTM are unidirectional which means they take only the word appearing after them into context. Adding another layer in parallel that takes previous word into context and merging/concatenating/averaging both layers allows the model to take both neighbor words into context and is found to be outperforming unidirectional LSTMs. 4.5 Attention Seq2Seq Model We modified our vanilla seq2seq model to add in Bahdanau attention mechanism (Bahdanau et al., 2014) with Bidirectional LSTMs. Except for using augmented dataset, other hyperparameters kept intact. 4.6 Pointer Attention Seq2Seq Model Since Tensorflow didn’t provide attention wrapper for pointer network, we have to hack the Tensorflow kernel code to modify the existing Bahdanau Attention Mechanism into pointer network. This modification effected our previous architecture tremendously. First, after a long time of trial and error, we noticed that it is impossible to implement pointer network with dynamic LSTM, so we have to change our design to static LSTM with fixed cell size (which means we have to pad all our input data to certain fixed length). Another trouble caused by this modification attempt is that the encoder and decoder will have the same cell size! This problem is more serious than the previous one because, averagely, output sequences are much shorter than input sequences. We noticed that padding will affect the probability distribution dramatically once using attention mechanism. To mitigate this effect, instead of using single padding symbols, we used the replication of input sequence. Namely, we will replicate an input sequence until it achieves the specified cell size. We called this strategy emphasizing. 5 Experiments 5.1 Dataset and Metric We trained our models on a training set of 60000 data and tested them on a test set of 9145 data. However, we later noticed that it is really difficult for a sequential structural LSTM to memorize and learn a long sequence; the exact match accuracy is insignificant for all three models. So we have to change our metric to component-wise match, where we count in an output sentence how many positions have been correctly predicted in case of pointers and bag of words to calculate average accuracy over the whole sequence. Seq2Seq Models Vanilla Reversed Bidirectional Attention Pointer Attention Accuracy % Train Validation 74.43 50.66 79.37 49.96 81.28 56.29 100 48.27 17.61 14.12 Table 1: Accuracy and loss for pointer network Figure 3: Accuracy and Loss measurements for various models 5.2 Training Speed and Performance In terms of training speed, the Bhadanau model converged very fast, the training loss keeps dropping quickly until 0. However, this is also an indication of overfitting. Evaluating on the test set verified our insights; it performs badly on the test dataset and maps everything into a few learned output sequences. The vanilla and reversed seq2seq model runs the second fast but compared to the Bhadanau model it converged much slower. The dropping of training loss become slower and slower, and it will finally stopped dropping at some point. The pointer network runs the slowest and it took a long time to train. And because of design difficulty we have mentioned previously, it is highly sensitive to the data length and data consistency. It performs not very well on the whole training set. And after some analysis, we notice that pointer network tries to balance the output on all data point. The final result is to output the most frequent symbols appeared in all data points. This fact indicates that batch size matters. Then we tried to decrease batch size and notice increasing in training speed, but, on the other hand, we get a longer training time. In order to see the true power of this new architecture, we tried to train it on a much smaller toy set with a few data points. And we notice that even on this small dataset, it needed a long time to really figure out the position of each output word. But finally, it indeed figured out the correct pointer distribution. Lastly, we want to notify that deep learning is compute intensive, so we kept our hyperparameters in a decent range training everything on an NVIDIA GeForce GTX 1080Ti with tensorflow 1.3, CUDA 8.0, cuDNN 6.1, python 3.6 1. 2 layers of LSTMs with hidden units to 200 initialized with a uniform distribution between -0.1 and 1.0 with a seed of 2. 2. Fixed number of epochs to 300 to make sure network plateau, we avoided early stopping since gradient descent guarantees local minima not global and gets stuck. We had to stop at 150 epoch for Bahdanau Attention to avoid overtraining itself. 3. Embedding size 300 which is enough to learn dense representation of our vocabulary size, also embeddings are allowed to be re-trained in each epoch. 4. Fixed batch size of 128. 5. Adam Optimizer with a Learning Rate: 0.01, reduced to 0.001 for pointer networks as it was not able to converge. 6. We are using Greedy Decoder with GreedyEmbeddingHelper which uses the argmax of the output (treated as logits) and passes the result through an embedding layer to get the next input. 7. We are using a sequence loss which is basically a weighted cross-entropy loss for a sequence of logits. 8. Although LSTMs tend to not suffer from the vanishing gradient problem, they can have exploding gradients, to address that we clipped our gradients at (-5.0, 5.0) 9. Different sentences have different lengths. Most sentences are short (e.g., length 20-30) but some sentences are long (e.g., length >100), so a minibatch of 128 randomly chosen training sentences will have many short sentences and few long sentences, and as a result, much of the computation in the minibatch is wasted. To address this problem, we made sure that all sentences within a minibatch were roughly of the same length, which a 2x speedup. 5.3 Sample cases Predictions by the models on the dev split. Q, T, and P denote the natural language question, ground truth query and it’s predictions respectively. FROM table has been removed for succinctness but is predicted in entirety by our model. Vanilla seq2seq model After 10 epochs Q: How many schools did player number 3 play at? Learns aggregation keyword placement and importance of numbers T: SELECT WHERE P: SELECT WHERE COUNT school/club team no. = 3 COUNT played played = 8 Q: what’s the division with league usl first divofion Syntactical Learning: typo in the word divofion vs division T: SELECT WHERE P: SELECT WHERE division league = usl first division division first elected = 1885 Q: what’s the u.s. open cup status for regular season of 4th, atlantic division Learns sequential dependencies and year/season/2002 concepts T: SELECT WHERE P: SELECT WHERE u.s. open cup regular season = 4th, atlantic division last division year = 2002 AND regular season = 1994/95 After 300 epochs Q: what was the score of the away team while playing at the arden street oval? T: SELECT WHERE P: SELECT WHERE away team score venue = arden street oval away team score venue = arden street oval Q: what was the largest crowd where the home team was fitzroy? Aggregation Clause Prediction T: SELECT WHERE P: SELECT WHERE MAX crowd home team = fitzroy MAX crowd home team = fitzroy Q: which catalog was formated as a cd under the label alfa records? Multiple WHERE clause prediction T: SELECT WHERE P: SELECT WHERE catalog label = alfa records AND format = cd catalog label = alfa records AND format = cd Reversed sequence vanilla seq2seq model Q: name the location attendance for january 18 T: SELECT WHERE P: SELECT WHERE catalog label = alfa records AND format = cd catalog label = alfa records AND format = cd Q: what was the date of the game when the away team was south melbourne? T: SELECT WHERE P: SELECT WHERE date away team = south melbourne date away team = south melbourne Pointer network model As mentioned above that our pointer network is very hard to train on a large dataset, as it tries to balance all possible correct pointers, and finally it will assign pointers to frequent common words and get into some local optimal. Since we were hacking the Tensorflow kernel code, in order to verify our design is correct, we tried to run it on a toy set with only a few data points. And we notice that it still converged very slow, but finally it can successfully figure out the correct position of each component: Bidirectional seq2seq model Q: what scores happened on february 9? T: SELECT WHERE P: SELECT WHERE score date = february 9 score date = february 9 Q: when the away team is south melbourne, what’s the home team score? Multiple Query Resolution: The question has two question words(when, what) T: SELECT WHERE P: SELECT WHERE home away home away team team team team score = south melbourne score = south melbourne Bahdanau Attention seq2seq model T: P: T: P: T: P: T: P: SELECT WHERE SELECT WHERE position school/club team = butler-cc-(ks) position position = colin-actress SELECT WHERE SELECT WHERE AND school/club team no. = 21 count # position = table position = 13 kong SELECT WHERE SELECT WHERE count position years in toronto = 2006-07 min no. of table date = 13 SELECT WHERE SELECT WHERE school/club team player = amir johnson actors name winning team = newman/haas the love = 13 love = 13 kong The above outputs are totally irrelevant with the question, we notice that attention mechanism is sensitive to overtraining and in that it learn to pay attention to highly frequent words in corpus like position (a lot of CFL draft tables), SQL keywords (table). These neural networks indeed learned some weak and strong concepts automagically like ’usage of aggregation operator’ and ’date’ or ’numbers’. This is surprising as we didn’t do any semantic similarity processing or feature engineering, even the embeddings were not pre-trained. Figure 4: Position distribution of first 7 components. From left to right and top down, data sampled each 15000 steps. Though it takes a very long time to converge, but this result has at least verified the correctness of our design. Because of using emphasizing (sequence replication), the final accuracy is not very high. However it is capable enough to cover at least one whole segment: T:select max-gross-weight from where aircraft=robinson-r-22 select max-gross-weight from where aircraft=robinson-r-22 select max-gross-weight from where aircraft=robinson-r-22 select max-gross-weight from where aircraft=robinson-r-22 select max-gross-weight from P:select max-gross-weight from where aircraft=robinson-r-22 select select select select aircraft=robinson-r-22 select select select aircraft==select from select where aircraft=robinson-r-22 select max-gross-weight from where aircraft Hence we conclude that pointer network, though powerful on other tasks, might be incompatible with this sequence to sequence semantic parsing task in long sequences. Therefore we can leverage this to only predict column name from given columns and question like a question-answering system. 5.4 LSTM memorability study As we have mentioned above that it is amazing that even for our vanilla seq2seq model it could already learn certain weak concepts. And we hypothesize this ability might relate with the memorability of LSTM cell itself. To verify our hypothesis, we take out the encoder only and try to intercept the output information and use it to predict the possible aggregation function used in the target SQL query only from the input natural language description. We label the target data with name of aggregation functions: NULL, MAX, MIN, COUNT, SUM, AVG and rewire the encoder to a multilayer perceptron and train the whole training dataset on this ’partial’ neural network. Using embeddings naturally increased accuracy. The result, as being presented below, verified our hypothesis: but could not replicate the results in paper, but we did verify that our approach and design of pointer network was working as expected. Apart from this, we also did a hypothesis testing that LSTM could encode sequences efficiently, we were able to predict aggregation operation with 80% accuracy. Our biggest observation is that networks do not require a lot of epochs to plateau. To evaluate our result, we wanted to use exact match as the original paper but our exact matches were negligible to correctly calibrate system performance so we recorded exact match but used positional and bag of words as accuracy measure since two queries can be equivalent even when they are not strung matching with each other. In future, we plan to further divide query formation tasks into dedicated networks for predicting aggregation, column using pointer networks, where clause using attention pointer networks and evaluate using Cosette (Chu et al., 2017), a tool to check SQL query equivalence. References Figure 5: Accuracy and loss measurement for LSTM memorability study. It only takes few epochs to achieve a very good performance. This fact also indicates that the LSTM cell indeed can capture certain concepts directly from input natural language description, which should be considered as the fundamental of all sequence to sequence models 6 Conclusion We framed semantic parsing natural language questions into structured query language as a sequence to sequence neural machine translation task. We reviewed different state of art systems, techniques and evaluated their performance on WikiSQL task. We were able to parse english questions into query using just seq2seq with 56.29% bag of words accuracy and even some exact matches. We found bidirectional seq2seq better than vanilla, reversed, and attention networks. We also observed that parsing into a SQL query is not directly a translation but a mix of both translation and word ordering problem to which we tried to apply pointer networks [Bahdanau et al., 2014] Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. [Cho et al., 2014] Cho, K., Bahdanau, D., and Bengio, Y. (2014). Learning phrase representation using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. [Chu et al., 2017] Chu, S., Weitz, K., Cheung, A., and Suciu, D. (2017). Hottsql: Proving query rewrites with univalent sql semantics. pages 510–524. [Kate et al., 2005] Kate, R., Wong, Y. W., and Mooney, R. (2005). Learning to transform natural to formal languages. Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05). [Popescu et al., 2003] Popescu, A.-M., Etzioni, O., and Kautz, H. (2003). Towards a theory of natural language interfaces to databases. Proceedings of the 8th international conference on Intelligent user interfaces. [Sutskever et al., 2014] Sutskever, P., Vinyals, O., and V.le, Q. (2014). Sequence to sequence learning with neural networks. pages 3104–3112. [Vinyals et al., 2015] Vinyals, O., Fortunato, M., and Jaitly, N. (2015). Pointer networks. pages 2692–2700. [Wang et al., 2017] Wang, C., Cheung, A., and Bodik, R. (2017). Synthesizing highly expressive sql queries from input-output examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 452–466. ACM. [Yaghmazadeh et al., 2017] Yaghmazadeh, N., Wang, Y., Dillig, I., and Dillig, T. (2017). Sqlizer: Query synthesis from natural language. [Zhong et al., 2017] Zhong, V., Xiong, C., and Socher, R. (2017). Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.