ViT Explained

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

ViT’s theory and applications

Department of Intelligent
Mechatronics Engineering
22110343
Yoon Jeong-Hyun(윤정현)
1. Theory
Introduction

Transformer is a novel model, spotlighted for NLP(Natural Language Processing).


The model sees a semantical sentence as the combination of Attentions over words,
not just a sequential array of words.
Transformer consists of 1) Encoder block, which transforms sentences into potential expressions,
And 2) Decoder block, that decodes the encoded sentences.
-3-
Introduction

It has risen to SOTA, beating so many conventional RNN-based networks.


And many thought that the same could happen in the image and computer vision.
Nevertheless, in 2020 ICLR a game changer called ViT(Vision Transformer)
has disrupted the old-fashioned models, outperforming the past SOTAs.
-4-
Introduction

ViT has upended the CNN(Convolutional NN)-based image classification world


by 1) splitting an image into fixed-size patches,
2) adding position embeddings,
and 3) feeding the resulting sequence of vectors to a standard Transformer encoder.

-5-
Main architecture

Basic understanding of the model


1. Split an image into fixed-size patches
2. Linearly embed each of them
3. Add position embeddings(which means, projects a sequential index to each patch as it is placed in order)
4. And input the vectors to a Transformer encoder.
5. The encoder adds classification tokens to the embedded patches, to classify the original image.

-6-
Main architecture explained

1. Input Embedding (Linear projection of flattened patches)


: Change token embedding into 1D sequence
- 2D Image𝑥 ∈ ℝ × × to 𝑥 ∈ ℝ ×( )
- (P,P) : resolution of an image patch
- N : the number of patches = 𝐻𝑊/𝑃
- D : same latent vector size for every Layer
-> Use flattened patch as trainable Linear Projection(𝔼), mapping it into D dimension
: returning Patch Embedding

2. [CLS] Token
- Like BERT’s [class] Token, add the trainable embedding patch (𝑧 = 𝑥 )
- 𝑧 : 0th token of the final Lth Layer
-> it plays a classification-head role in pre-training and fine-tuning

-7-
Main architecture explained

Input
- 𝑥 : classification token
- 𝑥 𝐸 : each image sequence divided by the patch
- 𝐸 : Positional Encoding that represents the sequential information of the patch.

Transformer Encoder
- 𝑧 : the previous calculated vector
- LN : Linear Normalization
- MSA : Multi-head Self Attention layers(where one Attention layer learns multiple meanings of the image)
- It also contains a residual layer (+𝑧 part) that skips some of the connection so that the model performs well.

Calculate Attention

MLP(Multi Layer Perceptron) Head


- 𝑧′ : the output of the Transformer encoder
- Likewise, it contains the residual layer as above. Input

Output combining
all resulting vectors
Find the class of the image

-8-
To be noted and contributions

- It performs very well only if it is pre-trained with a host of train images.


(The authors pre-trained them with 600m images inside the Google database)
When pretrained with lesser images, its performance diminishes.

- The model outperforms other CNN-based SOTA models.


Not only that, with fewer resources during fine-tuning, it overdrives.

- No maximum number of parameters like other Transformer models.


So as long as GPUs are allowed, the model can be extended exponentially.

-9-
2. Applications
ViT Applications

1. Image Classification (Image -> label)


2. Image Captioning (Image -> Sentence)
3. Contrastive Language-Image Pre-Training (Image <-> Text Snippet)

- 11 -
ViT Applications

1. Image Classification (Image -> label)

- 12 -
ViT Applications

2. Image Captioning (Image -> Sentence)

- 13 -
ViT Applications

3. Contrastive Language-Image Pre-Training

Association between an image and a piece of text.


- This can be achieved by training two separate transform encoders for text snippet and the image.
- The encoded image and text are compared for their respective similarity by constructing a cosine-similarity matrix

- 14 -

You might also like