33 questions
1
vote
1
answer
31
views
Masked self-attention not working as expected when each token is masking also itself
I was developing a self-attentive module using Pytorch's nn.MultiheadAttention (MHA). My goal was to implement a causal mask that enforces each token to attend only to the tokens before itself, ...
0
votes
0
answers
53
views
Multihead Attention for 4-D tensor in Pytorch
I tried to transform tensorflow to pytorch, but I have a trouble with multi head attention due to its dimensions.
Input tensors for mha is 4-D, but pytorch mha couldn't accept 4-D tensor as an input ...
2
votes
1
answer
132
views
tensorflow.keras.layers.MultiHeadAttention warning that query layer is destroying mask
I am building a transformer model using tensorflow==2.16.1 and one of the layers is a tensorflow.keras.layers.MultiHeadAttention layer.
I implement the attention layer in the TransformerBlock below:
# ...
0
votes
1
answer
96
views
Why is attn_mask in PyTorch' MultiheadAttention specified for each head separately?
PyTorch MultiheadAttention allows to specify the attention mask, either as 2D or as 3D. The former will be broadcasted over all N batches the latter allows one to specify specific masks for each ...
0
votes
0
answers
38
views
How to visualize attention for long sequences (e.g., amino acids of length 1000) in Transformer models?
I am working with Transformer models and I have a specific use case where I need to visualize the attention mechanism for long sequences. Specifically, I am dealing with amino acid sequences of length ...
0
votes
0
answers
62
views
How to mask a multi-head attention layer?
I'm trying to make a Transformer model that could recibe sequences of a variable lenght. But I can't
I tried this:
def transformer_model(input_shape, num_layers, d_model, num_heads, dropout_rate, ...
1
vote
0
answers
47
views
multihead self-attention for sentiment analysis not accurate results
i am trying to implement a model for sentiment analysis in text data using self-attention. In this example, i am using multi-head attention but cannot be sure if the results are accurate or not. It ...
1
vote
0
answers
40
views
cannot back propagate on multi head attention tensorflowjs
I am trying to create a multi-head attention using tensorflowjs. When trying to train the model, an error kept popping up that the gradient shape was inconsistent with the input shape.
reproducable ...
0
votes
0
answers
531
views
PyTorch Vision Transformer - How Visualise Attention Layers
I am trying to extract the attention map for a PyTorch implementation of the Vision Transformer (ViT). however, I am having trouble understanding how to do this. I understand that doing this from ...
0
votes
0
answers
45
views
Interpreting the rows and columns of the attention Heatmap
I have a simple question to which I didn't find an answer:
How to read the attention heatmap ? Rows attend to Columns or Columns attend to Rows
Since it isn't symmetric sometimes, like in this plot, ...
0
votes
0
answers
48
views
Attention Mechanism Scores are the same
Problem Statement:
I am currently working on Aspect-Based Sentiment Analysis, where the objective is to analyze changing sentiment trends within a sentence by employing temporal windows. Ultimately, I ...
1
vote
1
answer
250
views
RuntimeError with PyTorch's MultiheadAttention: How to resolve shape mismatch?
I'm encountering an issue regarding the input shape for PyTorch's MultiheadAttention. I have initialized MultiheadAttention as follows:
attention = MultiheadAttention(embed_dim=1536, num_heads=4)
The ...
0
votes
0
answers
37
views
What's the exact input size in MultiHead-Attention of BERT?
I just recently learned BERT.
Some tutorials show that after embedding a sentence, a matrix X of [seq_len, 768] will be formed, and X will be sent to MultiHead_Attention, that is, multiple Self-...
0
votes
0
answers
61
views
How to patch intermediate layers of a python keras model with monkey patching?
I have a tf.keras model which internally contains a "custom tf.keras.layers.MultiHeadAttention() layer ". That is, I have divided the multihead attention layers into two parts:
(1) a first ...
0
votes
1
answer
188
views
PyTorch MultiHeadAttention implementation
In Pytorch's MultiHeadAttention implementation, regarding in_proj_weight, is it true that the first embed_dim elements correspond to the query, the next embed_dim elements correspond to the key, and ...
1
vote
1
answer
1k
views
Training torch.TransformerDecoder with causal mask
I use torch.TransformerDecoder to generate a sequence, where each next token depends on itself and first 2 tokens [CLS] and first predicted one.
So, steps of execution on inference, that i need:
...
0
votes
1
answer
323
views
Adding an attention block in deep neural network issue for regression problem
I want to add an tf.keras.layers.MultiHeadAttention inside the two layers of neural network. However, I am getting IndexError:
The detailed code are as follow
x1 = Dense(58, activation='relu')(x1)
x1 =...
0
votes
0
answers
184
views
How to propperly add a MultiHeadAttention keras layer to LSTM?
I am trying to build a deep learning network (USING TENSORFLOW KERAS) that performs a graph convolution, and at each node performs an LSTM computation. I want to add a MultiHeadAttention layer to the ...
0
votes
0
answers
326
views
How can I convert a multi-head attention layer from Tensorflow to Pytorch where key_dim * num_heads != embed_dim?
I am trying to implement a Pytorch version of some code that was previously written in Tensorflow. In the code I am starting with, there exists a multi-head attention layer that is instantiated in the ...
1
vote
2
answers
805
views
Understanding the output dimensionality for torch.nn.MultiheadAttention.forward
I want to implement a cross attention between 2 modalities. In my implementation, I set Q from modality A, and K and V from modality B. Modality A is used for a guidance by using cross attention, and ...
0
votes
0
answers
192
views
PyTorch RuntimeError: Invalid Shape During Reshaping for Multi-Head Attention
I'm implementing a multi-head self-attention mechanism in PyTorch which is part of Text2Image model that I am trying to build and I'm encountering a runtime error when trying to reshape the output of ...
1
vote
0
answers
174
views
Access attention score when using TransformerEncoderLayer, TransformerEncoder
My input after the following x = self.preTransformerInput(x) is of shape (2,16,4) (batch size, sequence length, embedding dimension). How can we access the attention score for each head using ...
2
votes
0
answers
171
views
What is the reason for MultiHeadAttention having a different call convention than Attention and AdditiveAttention?
Attention and AdditiveAttention are called with their input tensors in a list. (same as Add, Average, Concatenate, Dot, Maximum, Multiply, Subtract)
But MultiHeadAttention is called by passing the ...
1
vote
0
answers
351
views
Temporal Fusion Transformer model training encountered Gradient Vanishing
I am training financial data with Temporal Fusion Transformer. Though this model has skipping connection and residual connection to enhance information. I believe it encountered gradient vanishing at ...
0
votes
0
answers
152
views
How to convert Tensorflow Multi-head attention to PyTorch?
I'm converting a Tensorflow transformer model to Pytorch equivalent.
In TF multi-head attention part of the code I have:
att = layers.MultiHeadAttention(num_heads=6, key_dim=4)
and the input shape is [...
0
votes
1
answer
402
views
Inputs and Outputs Mismatch of Multi-head Attention Module (Tensorflow VS PyTorch)
I am trying to convert my tensorflow model for layers.MultiHeadAttention module from tf.keras to nn.MultiheadAttention from torch.nn module. Below are the snippets.
Tensorflow Multi-head Attention
...
0
votes
0
answers
115
views
Exception encountered when calling layer 'tft_multi_head_attention' (type TFTMultiHeadAttention)
I am trying to build a forecasting model with tft module with Temporal Fusion Transformer,I am getting below error when I am trying to train the model, since I am new to tensorflow, I can't understand ...
1
vote
0
answers
448
views
WQ, WK, WV matrix used for generating query, key and value vector for Attention in Transformers are fixed or WQ, WK and WV are dependent on input word
To calculate self-attention,
For each word, we create a Query vector, a Key vector, and a Value vector.
These vectors are created by multiplying the embedding by three matrices that we trained during ...
0
votes
1
answer
187
views
Running speed of Pytorch MultiheadAttention compared to Torchvision MVit
I am currently experimenting with my model, which uses Torchvision implementation of MViT_v2_s as backbone. I added a few cross attention modules to the model which looks roughly like this:
class ...
3
votes
2
answers
4k
views
How to read a BERT attention weight matrix?
I have extracted from the last layer and the last attention head of my BERT model the attention score/weights matrix. However I am not too sure how to read them. The matrix is the following one. I ...
1
vote
0
answers
189
views
How to access the value projection at MultiHeadAttention layer in Pytorch
I'm making an own implementation for the Graphormer architecture. Since this architecture needs to add an edge-based bias to the output for the key-query multiplication at the self-attention mechanism ...
0
votes
0
answers
836
views
How to add a multihead attention layer to a CNN-LSTM model?
I'm trying to make a hybrid binary text classification model using a multi-head attention mechanism with CNN-LSTM. However, I'm facing an issue when trying to pass the values obtained from CNN-LSTM to ...
1
vote
1
answer
762
views
Multi head Attention calculation
I create a model with a multi head attention layer,
import torch
import torch.nn as nn
query = torch.randn(2, 4)
key = torch.randn(2, 4)
value = torch.randn(2, 4)
model = nn.MultiheadAttention(4, 1, ...