Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
31 views

Masked self-attention not working as expected when each token is masking also itself

I was developing a self-attentive module using Pytorch's nn.MultiheadAttention (MHA). My goal was to implement a causal mask that enforces each token to attend only to the tokens before itself, ...
jackjack4468's user avatar
0 votes
0 answers
53 views

Multihead Attention for 4-D tensor in Pytorch

I tried to transform tensorflow to pytorch, but I have a trouble with multi head attention due to its dimensions. Input tensors for mha is 4-D, but pytorch mha couldn't accept 4-D tensor as an input ...
Doru's user avatar
  • 1
2 votes
1 answer
132 views

tensorflow.keras.layers.MultiHeadAttention warning that query layer is destroying mask

I am building a transformer model using tensorflow==2.16.1 and one of the layers is a tensorflow.keras.layers.MultiHeadAttention layer. I implement the attention layer in the TransformerBlock below: # ...
Stod's user avatar
  • 83
0 votes
1 answer
96 views

Why is attn_mask in PyTorch' MultiheadAttention specified for each head separately?

PyTorch MultiheadAttention allows to specify the attention mask, either as 2D or as 3D. The former will be broadcasted over all N batches the latter allows one to specify specific masks for each ...
Bastiaan's user avatar
  • 4,652
0 votes
0 answers
38 views

How to visualize attention for long sequences (e.g., amino acids of length 1000) in Transformer models?

I am working with Transformer models and I have a specific use case where I need to visualize the attention mechanism for long sequences. Specifically, I am dealing with amino acid sequences of length ...
Farshid B's user avatar
0 votes
0 answers
62 views

How to mask a multi-head attention layer?

I'm trying to make a Transformer model that could recibe sequences of a variable lenght. But I can't I tried this: def transformer_model(input_shape, num_layers, d_model, num_heads, dropout_rate, ...
Lyx Sword's user avatar
1 vote
0 answers
47 views

multihead self-attention for sentiment analysis not accurate results

i am trying to implement a model for sentiment analysis in text data using self-attention. In this example, i am using multi-head attention but cannot be sure if the results are accurate or not. It ...
phd Mom's user avatar
  • 11
1 vote
0 answers
40 views

cannot back propagate on multi head attention tensorflowjs

I am trying to create a multi-head attention using tensorflowjs. When trying to train the model, an error kept popping up that the gradient shape was inconsistent with the input shape. reproducable ...
MrGeniusProgrammer's user avatar
0 votes
0 answers
531 views

PyTorch Vision Transformer - How Visualise Attention Layers

I am trying to extract the attention map for a PyTorch implementation of the Vision Transformer (ViT). however, I am having trouble understanding how to do this. I understand that doing this from ...
Peter's user avatar
  • 9
0 votes
0 answers
45 views

Interpreting the rows and columns of the attention Heatmap

I have a simple question to which I didn't find an answer: How to read the attention heatmap ? Rows attend to Columns or Columns attend to Rows Since it isn't symmetric sometimes, like in this plot, ...
Wassim Jaoui's user avatar
0 votes
0 answers
48 views

Attention Mechanism Scores are the same

Problem Statement: I am currently working on Aspect-Based Sentiment Analysis, where the objective is to analyze changing sentiment trends within a sentence by employing temporal windows. Ultimately, I ...
sk-19's user avatar
  • 13
1 vote
1 answer
250 views

RuntimeError with PyTorch's MultiheadAttention: How to resolve shape mismatch?

I'm encountering an issue regarding the input shape for PyTorch's MultiheadAttention. I have initialized MultiheadAttention as follows: attention = MultiheadAttention(embed_dim=1536, num_heads=4) The ...
ララララ's user avatar
0 votes
0 answers
37 views

What's the exact input size in MultiHead-Attention of BERT?

I just recently learned BERT. Some tutorials show that after embedding a sentence, a matrix X of [seq_len, 768] will be formed, and X will be sent to MultiHead_Attention, that is, multiple Self-...
TomWu's user avatar
  • 11
0 votes
0 answers
61 views

How to patch intermediate layers of a python keras model with monkey patching?

I have a tf.keras model which internally contains a "custom tf.keras.layers.MultiHeadAttention() layer ". That is, I have divided the multihead attention layers into two parts: (1) a first ...
DROS's user avatar
  • 1
0 votes
1 answer
188 views

PyTorch MultiHeadAttention implementation

In Pytorch's MultiHeadAttention implementation, regarding in_proj_weight, is it true that the first embed_dim elements correspond to the query, the next embed_dim elements correspond to the key, and ...
carpet119's user avatar
1 vote
1 answer
1k views

Training torch.TransformerDecoder with causal mask

I use torch.TransformerDecoder to generate a sequence, where each next token depends on itself and first 2 tokens [CLS] and first predicted one. So, steps of execution on inference, that i need: ...
First Name Second Name's user avatar
0 votes
1 answer
323 views

Adding an attention block in deep neural network issue for regression problem

I want to add an tf.keras.layers.MultiHeadAttention inside the two layers of neural network. However, I am getting IndexError: The detailed code are as follow x1 = Dense(58, activation='relu')(x1) x1 =...
Zeshan Akber's user avatar
0 votes
0 answers
184 views

How to propperly add a MultiHeadAttention keras layer to LSTM?

I am trying to build a deep learning network (USING TENSORFLOW KERAS) that performs a graph convolution, and at each node performs an LSTM computation. I want to add a MultiHeadAttention layer to the ...
Valeria Laynes's user avatar
0 votes
0 answers
326 views

How can I convert a multi-head attention layer from Tensorflow to Pytorch where key_dim * num_heads != embed_dim?

I am trying to implement a Pytorch version of some code that was previously written in Tensorflow. In the code I am starting with, there exists a multi-head attention layer that is instantiated in the ...
Emery Wade's user avatar
1 vote
2 answers
805 views

Understanding the output dimensionality for torch.nn.MultiheadAttention.forward

I want to implement a cross attention between 2 modalities. In my implementation, I set Q from modality A, and K and V from modality B. Modality A is used for a guidance by using cross attention, and ...
Tony Ha's user avatar
  • 11
0 votes
0 answers
192 views

PyTorch RuntimeError: Invalid Shape During Reshaping for Multi-Head Attention

I'm implementing a multi-head self-attention mechanism in PyTorch which is part of Text2Image model that I am trying to build and I'm encountering a runtime error when trying to reshape the output of ...
venkatesh's user avatar
  • 162
1 vote
0 answers
174 views

Access attention score when using TransformerEncoderLayer, TransformerEncoder

My input after the following x = self.preTransformerInput(x) is of shape (2,16,4) (batch size, sequence length, embedding dimension). How can we access the attention score for each head using ...
pte's user avatar
  • 61
2 votes
0 answers
171 views

What is the reason for MultiHeadAttention having a different call convention than Attention and AdditiveAttention?

Attention and AdditiveAttention are called with their input tensors in a list. (same as Add, Average, Concatenate, Dot, Maximum, Multiply, Subtract) But MultiHeadAttention is called by passing the ...
Tobias Hermann's user avatar
1 vote
0 answers
351 views

Temporal Fusion Transformer model training encountered Gradient Vanishing

I am training financial data with Temporal Fusion Transformer. Though this model has skipping connection and residual connection to enhance information. I believe it encountered gradient vanishing at ...
Jack Lee's user avatar
0 votes
0 answers
152 views

How to convert Tensorflow Multi-head attention to PyTorch?

I'm converting a Tensorflow transformer model to Pytorch equivalent. In TF multi-head attention part of the code I have: att = layers.MultiHeadAttention(num_heads=6, key_dim=4) and the input shape is [...
ORC's user avatar
  • 18
0 votes
1 answer
402 views

Inputs and Outputs Mismatch of Multi-head Attention Module (Tensorflow VS PyTorch)

I am trying to convert my tensorflow model for layers.MultiHeadAttention module from tf.keras to nn.MultiheadAttention from torch.nn module. Below are the snippets. Tensorflow Multi-head Attention ...
Kevin Putra Santoso's user avatar
0 votes
0 answers
115 views

Exception encountered when calling layer 'tft_multi_head_attention' (type TFTMultiHeadAttention)

I am trying to build a forecasting model with tft module with Temporal Fusion Transformer,I am getting below error when I am trying to train the model, since I am new to tensorflow, I can't understand ...
Navneet's user avatar
1 vote
0 answers
448 views

WQ, WK, WV matrix used for generating query, key and value vector for Attention in Transformers are fixed or WQ, WK and WV are dependent on input word

To calculate self-attention, For each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during ...
Vinay Sharma's user avatar
0 votes
1 answer
187 views

Running speed of Pytorch MultiheadAttention compared to Torchvision MVit

I am currently experimenting with my model, which uses Torchvision implementation of MViT_v2_s as backbone. I added a few cross attention modules to the model which looks roughly like this: class ...
whz's user avatar
  • 11
3 votes
2 answers
4k views

How to read a BERT attention weight matrix?

I have extracted from the last layer and the last attention head of my BERT model the attention score/weights matrix. However I am not too sure how to read them. The matrix is the following one. I ...
Chiara's user avatar
  • 470
1 vote
0 answers
189 views

How to access the value projection at MultiHeadAttention layer in Pytorch

I'm making an own implementation for the Graphormer architecture. Since this architecture needs to add an edge-based bias to the output for the key-query multiplication at the self-attention mechanism ...
Angelo's user avatar
  • 645
0 votes
0 answers
836 views

How to add a multihead attention layer to a CNN-LSTM model?

I'm trying to make a hybrid binary text classification model using a multi-head attention mechanism with CNN-LSTM. However, I'm facing an issue when trying to pass the values obtained from CNN-LSTM to ...
Harsha Vardhan's user avatar
1 vote
1 answer
762 views

Multi head Attention calculation

I create a model with a multi head attention layer, import torch import torch.nn as nn query = torch.randn(2, 4) key = torch.randn(2, 4) value = torch.randn(2, 4) model = nn.MultiheadAttention(4, 1, ...
apostofes's user avatar
  • 3,661