3
$\begingroup$

I'm currently reading a paper on sequence-to-sequence classification (speech-to-text) and it is stated that

"there are ways to introduce future context, such as adding a delay between the outputs and the targets"

Furthermore there are architectures mentioned such as:

Unidirectional RNN with one hidden layers containing 275 sigmoidal units, trained with target delays from 0 to 10 frames (RNN)

can somebody explain how this works?

my explanation for this:

the model sees a certain frame plus a few frames from the future (aka. delay) and then predicts the class for the original frame. Am I right?

$\endgroup$
1
  • 1
    $\begingroup$ Further details about each keras implementation in this SO thread: stackoverflow.com/questions/43034960/… Especially useful for the many to many case with delay to implement using Keras, which is not trivial. $\endgroup$ Commented Dec 4, 2018 at 15:14

1 Answer 1

4
$\begingroup$

As A. Karpathy illustrates, there are many types of RNNs. The model described in the paper belongs to the fourth one.

RNN Types Picture from A. Karpathy (http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

The target delay (aka time delay) is an arbitrarily chosen number that introduces a delay between the inputs and the targets, thereby giving the network a few timesteps of future context, and it can be crucial as it makes robust to short distortions, especially when it is used with LSTM cells.

In other words, as you well said, it is the number of frames that you will input to the RNN, until you start getting predictions from the output.

$\endgroup$
0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.