ViT Explained
ViT Explained
ViT Explained
Department of Intelligent
Mechatronics Engineering
Yoon Jeong-Hyun(윤정현)
1. Theory
Main architecture
Main architecture explained
2. [CLS] Token
- Like BERT’s [class] Token, add the trainable embedding patch (𝑧 = 𝑥 )
- 𝑧 : 0th token of the final Lth Layer
-> it plays a classification-head role in pre-training and fine-tuning
Main architecture explained
- 𝑥 : classification token
- 𝑥 𝐸 : each image sequence divided by the patch
- 𝐸 : Positional Encoding that represents the sequential information of the patch.
Transformer Encoder
- 𝑧 : the previous calculated vector
- LN : Linear Normalization
- MSA : Multi-head Self Attention layers(where one Attention layer learns multiple meanings of the image)
- It also contains a residual layer (+𝑧 part) that skips some of the connection so that the model performs well.
Calculate Attention
Output combining
all resulting vectors
Find the class of the image
To be noted and contributions
2. Applications
ViT Applications
- 11 -
ViT Applications
- 12 -
ViT Applications
- 13 -
ViT Applications
- 14 -