Problem 1 Proposal
Problem 1 Proposal
Problem 1 Proposal
[Member 1] – [Role]
[Member 2] – [Role]
Problem Statement
From a high level, the job of a chatbot is to be able to determine the best response for any
given message that it receives. This “best” response should either (1) answer the sender’s
question, (2) give the sender relevant information, (3) ask follow-up questions, or (4) continue
the conversation in a realistic way. This is a pretty tall order. The chatbot needs to be able to
understand the intentions of the sender’s message, determine what type of response message
(a follow-up question, direct response, etc.) is required, and follow correct grammatical and
lexical rules while forming the response.
It’s safe to say that modern chatbots have trouble accomplishing all these tasks. For all the
progress we have made in the field, we too often get chatbot experiences like this.
Chatbots are too often not able to understand our intentions, have trouble getting us the
correct information, and are sometimes just exasperatingly difficult to deal with. The usage of
machine learning concepts is one of the most effective methods in tackling this tough task.
The usage of machine learning algorithm will improve the identification of anomaly in real time
at a faster rate. In this work, we have also used a machine learning algorithm to detect
anomalies in UCSD Anomaly Detection Dataset.
Background Search
A recurrent neural network (RNN) is a class of artificial neural network where connections between
nodes form a directed graph along a sequence. This allows it to exhibit temporal dynamic behavior
for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state
(memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented,
connected handwriting recognition or speech recognition.
The recurrent neural network is represented as shown in the above figure. Each node at a time step
takes an input from the previous node and this can be represented using a feedback loop. We can
unfurl this feedback loop and represent it as shown in the figure below. At each time step, we take an
input x_i and a_i-1(output of the previous node) and perform computation on it and produce an
output h_i. This output is taken and given to the next node. This process continues until all the time
steps are evaluated.
The equations describing how the outputs are calculated at each time step is represented below.
Backpropagation in recurrent neural networks occurs in the opposite direction of the arrows drawn in
above figure. Like all other backpropagation techniques, we evaluate a loss function and obtain
gradients to update our weight parameters. The interesting part of backpropagation in RNN is that
backpropagation occurs from right to left. Since the parameters are updated from final time steps to
initial time steps, this is termed as backpropagation through time.
Sequence To Sequence model (Seq2Seq) introduced and become the Go-To model for Dialogue
Systems and Machine Translation. It consists of two RNNs (Recurrent Neural Network) : An Encoder
and a Decoder. The encoder takes a sequence (sentence) as input and processes one symbol(word)
at each timestep. Its objective is to convert a sequence of symbols into a fixed size feature vector
that encodes only the important information in the sequence while losing the unnecessary
information. You can visualize data flow in the encoder along the time axis, as the flow of local
information from one end of the sequence to another.
Each hidden state influences the next hidden state and the final hidden state can be seen as the
summary of the sequence. This state is called the context or thought vector, as it represents the
intention of the sequence. From the context, the decoder generates another sequence, one
symbol(word) at a time. Here, at each time step, the decoder is influenced by the context and the
previously generated symbols.
There are a few challenges in using this model. The most disturbing one is that the model cannot
handle variable length sequences. It is disturbing because almost all the sequence-to-sequence
applications, involve variable length sequences. The next one is the vocabulary size. The decoder has
to run softmax over a large vocabulary of say 20,000 words, for each word in the output. That is
going to slow down the training process, even if your hardware is capable of handling it.
Representation of words is of great importance. How do you represent the words in the sequence?
Use of one-hot vectors means we need to deal with large sparse vectors due to large vocabulary and
there is no semantic meaning to words encoded into one-hot vectors. Lets look into how we can face
these challenges, one by one.
Geek-AI-Mania‘18
2018
Padding
Before training, we work on the dataset to convert the variable length sequences into fixed length
sequences, by padding. We use a few special symbols to fill in the sequence.
1. EOS : End of sentence
2. PAD : Filler
3. GO : Start decoding
4. UNK : Unknown; word not in vocabulary
Consider the following query-response pair.
Q : How are you?
A : I am fine.
Assuming that we would like our sentences (queries and responses) to be of fixed length, 10, this pair
will be converted to:
Q : [ PAD, PAD, PAD, PAD, PAD, PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
Bucketing
Introduction of padding did solve the problem of variable length sequences, but consider the case of
large sentences. If the largest sentence in our dataset is of length 100, we need to encode all our
sentences to be of length 100, in order to not lose any words. Now, what happens to “How are you?” ?
There will be 97 PAD symbols in the encoded version of the sentence. This will overshadow the actual
information in the sentence.
Bucketing kind of solves this problem, by putting sentences into buckets of different sizes. Consider this
list of buckets : [ (5,10), (10,15), (20,25), (40,50) ]. If the length of a query is 4 and the length of its
response is 4 (as in our previous example), we put this sentence in the bucket (5,10). The query will be
padded to length 5 and the response will be padded to length 10. While running the model (training or
predicting), we use a different model for each bucket, compatible with the lengths of query and
response. All these models, share the same parameters and hence function exactly the same way.
If we are using the bucket (5,10), our sentences will be encoded to :
Q : [ PAD, “?”, “you”, “are”, “How” ]
A : [ GO, “I”, “am”, “fine”, “.”, EOS, PAD, PAD, PAD, PAD ]
Geek-AI-Mania‘18
2018
Word Embedding
Word Embedding is a technique for learning dense representation of words in a low dimensional vector
space. Each word can be seen as a point in this space, represented by a fixed length vector. Semantic
relations between words are captured by this technique. The word vectors have some interesting
properties.
paris – france + poland = warsaw.
The vector difference between paris and france captures the concept of capital city.
Word Embedding is typically done in the first layer of the network : Embedding layer, that maps a word
(index to word in vocabulary) from vocabulary to a dense vector of given size. In the seq2seq model, the
weights of the embedding layer are jointly trained with the other parameters of the model.
Attention Mechanism
One of the limitations of seq2seq framework is that the entire information in the input sentence should
be encoded into a fixed length vector, context. As the length of the sequence gets larger, we start losing
considerable amount of information. This is why the basic seq2seq model doesn’t work well in decoding
large sequences. The attention mechanism, introduced in this paper, Neural Machine Translation by
Geek-AI-Mania‘18
2018
Jointly Learning to Align and Translate, allows the decoder to selectively look at the input sequence while
decoding. This takes the pressure off the encoder to encode every useful information from the input.
Seq2seq in TensorFlow
Geek-AI-Mania‘18
2018
inputs placeholder will be fed with English sentence data, and its shape is [None, None]. The
first None means the batch size, and the batch size is unknown since user can set it. The
second None means the lengths of sentences. The maximum length of sentence is different from batch to
batch, so it cannot be set with the exact number.
One option is to set the lengths of every sentences to the maximum length across all sentences in
every batch. No matter which method you choose, you need to add special character, <PAD> in
empty positions. However, with the latter option, there could be unnecessarily
more <PAD>characters.
Geek-AI-Mania‘18
2018
targets placeholder is similar to inputs placeholder except that it will be fed with French sentence data.
target_sequence_length placeholder represents the lengths of each sentences, so the shape is None, a
column tensor, which is the same number to the batch size. This particular value is required as an
argument of TrainerHelper to build decoder model for training. We will see in (4).
max_target_len gets the maximum value out of lengths of all the target sentences(sequences). As you
know, we have the lengths of all the sentences in target_sequence_length parameter. The way to get the
maximum value from it is to use tf.reduce_max.
TF strided_slice
extracts a strided slice of a tensor (generalized python array indexing).
can be thought as splitting into multiple tensors with the striding window size from begin to end
arguments: TF Tensor, Begin, End, Strides
TF fill
creates a tensor filled with a scalar value.
arguments: TF Tensor (must be int32/int64), value to fill
TF concat
concatenates tensors along one dimension.
arguments: a list of TF Tensor (tf.fill and after_slice in this case), axis=1
After preprocessing the target label data, we will embed it later when implementing decoding_layer
function.
Encoding (2)
Geek-AI-Mania‘18
2018
stacked_cells =
tf.contrib.rnn.MultiRNNCell([tf.contrib.rnn.DropoutWrapper(tf.contrib.rnn.LSTMCell(rnn_size),
keep_prob) for _ in range(num_layers)])
Encoding model
TF nn.dynamic_rnn
: put Embedding layer and RNN layer(s) all together
Decoding — Training process (4)
Decoding model can be thought of two separate processes, training and inference. It is not they have
different architecture, but they share the same architecture and its parameters. It is that they have
different strategy to feed the shared model. For this(training) and the next(inference) section, Fig 4 shows
clearly shows what they are.
Geek-AI-Mania‘18
2018