ViT Explained
ViT Explained
ViT Explained
Department of Intelligent
Mechatronics Engineering
22110343
Yoon Jeong-Hyun(윤정현)
1. Theory
Introduction
-5-
Main architecture
-6-
Main architecture explained
2. [CLS] Token
- Like BERT’s [class] Token, add the trainable embedding patch (𝑧 = 𝑥 )
- 𝑧 : 0th token of the final Lth Layer
-> it plays a classification-head role in pre-training and fine-tuning
-7-
Main architecture explained
Input
- 𝑥 : classification token
- 𝑥 𝐸 : each image sequence divided by the patch
- 𝐸 : Positional Encoding that represents the sequential information of the patch.
Transformer Encoder
- 𝑧 : the previous calculated vector
- LN : Linear Normalization
- MSA : Multi-head Self Attention layers(where one Attention layer learns multiple meanings of the image)
- It also contains a residual layer (+𝑧 part) that skips some of the connection so that the model performs well.
Calculate Attention
Output combining
all resulting vectors
Find the class of the image
-8-
To be noted and contributions
-9-
2. Applications
ViT Applications
- 11 -
ViT Applications
- 12 -
ViT Applications
- 13 -
ViT Applications
- 14 -