I'm currently reading a paper on sequence-to-sequence classification (speech-to-text) and it is stated that
"there are ways to introduce future context, such as adding a delay between the outputs and the targets"
Furthermore there are architectures mentioned such as:
Unidirectional RNN with one hidden layers containing 275 sigmoidal units, trained with target delays from 0 to 10 frames (RNN)
can somebody explain how this works?
my explanation for this:
the model sees a certain frame plus a few frames from the future (aka. delay) and then predicts the class for the original frame. Am I right?