ViT Explained

ViT’s theory and applications
Department of Intelligent
Mechatronics Engineering
22110343
Yoon Jeong-Hyun(윤정현)
1. Theory
Introduction
Transformer is a novel model, spotlighted for NLP(Natural Language Processing).

The model sees a semantical sentence as the combination of Attentions over words,
not just a sequential array of words.
Transformer consists of 1) Encoder block, which transforms sentences into potential expressions,
And 2) Decoder block, that decodes the encoded sentences.
-3-
Introduction
It has risen to SOTA, beating so many conventional RNN-based networks.

And many thought that the same could happen in the image and computer vision.
Nevertheless, in 2020 ICLR a game changer called ViT(Vision Transformer)
has disrupted the old-fashioned models, outperforming the past SOTAs.
-4-
Introduction
ViT has upended the CNN(Convolutional NN)-based image classification world

by 1) splitting an image into fixed-size patches,
2) adding position embeddings,
and 3) feeding the resulting sequence of vectors to a standard Transformer encoder.
-5-
Main architecture
Basic understanding of the model

1. Split an image into fixed-size patches
2. Linearly embed each of them
3. Add position embeddings(which means, projects a sequential index to each patch as it is placed in order)
4. And input the vectors to a Transformer encoder.
5. The encoder adds classification tokens to the embedded patches, to classify the original image.
-6-
Main architecture explained
1. Input Embedding (Linear projection of flattened patches)

: Change token embedding into 1D sequence
- 2D Image𝑥 ∈ ℝ × × to 𝑥 ∈ ℝ ×( )
- (P,P) : resolution of an image patch
- N : the number of patches = 𝐻𝑊/𝑃
- D : same latent vector size for every Layer
-> Use flattened patch as trainable Linear Projection(𝔼), mapping it into D dimension
: returning Patch Embedding
2. [CLS] Token
- Like BERT’s [class] Token, add the trainable embedding patch (𝑧 = 𝑥 )
- 𝑧 : 0th token of the final Lth Layer
-> it plays a classification-head role in pre-training and fine-tuning
-7-
Main architecture explained
Input
- 𝑥 : classification token
- 𝑥 𝐸 : each image sequence divided by the patch
- 𝐸 : Positional Encoding that represents the sequential information of the patch.
Transformer Encoder
- 𝑧 : the previous calculated vector
- LN : Linear Normalization
- MSA : Multi-head Self Attention layers(where one Attention layer learns multiple meanings of the image)
- It also contains a residual layer (+𝑧 part) that skips some of the connection so that the model performs well.
Calculate Attention
MLP(Multi Layer Perceptron) Head

- 𝑧′ : the output of the Transformer encoder
- Likewise, it contains the residual layer as above. Input
Output combining
all resulting vectors
Find the class of the image
-8-
To be noted and contributions
- It performs very well only if it is pre-trained with a host of train images.

(The authors pre-trained them with 600m images inside the Google database)
When pretrained with lesser images, its performance diminishes.
- The model outperforms other CNN-based SOTA models.

Not only that, with fewer resources during fine-tuning, it overdrives.
- No maximum number of parameters like other Transformer models.

So as long as GPUs are allowed, the model can be extended exponentially.
-9-
2. Applications
ViT Applications
1. Image Classification (Image -> label)

2. Image Captioning (Image -> Sentence)
3. Contrastive Language-Image Pre-Training (Image <-> Text Snippet)
- 11 -
ViT Applications
1. Image Classification (Image -> label)
- 12 -
ViT Applications
2. Image Captioning (Image -> Sentence)
- 13 -
ViT Applications
3. Contrastive Language-Image Pre-Training
Association between an image and a piece of text.

- This can be achieved by training two separate transform encoders for text snippet and the image.
- The encoded image and text are compared for their respective similarity by constructing a cosine-similarity matrix
- 14 -

ViT Explained

Uploaded by

Copyright:

Available Formats

ViT Explained

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ViT Explained

Uploaded by

Copyright:

Available Formats

ViT’s theory and applications

Transformer is a novel model, spotlighted for NLP(Natural Language Processing).

It has risen to SOTA, beating so many conventional RNN-based networks.

ViT has upended the CNN(Convolutional NN)-based image classification world

Basic understanding of the model

1. Input Embedding (Linear projection of flattened patches)

MLP(Multi Layer Perceptron) Head

- It performs very well only if it is pre-trained with a host of train images.

- The model outperforms other CNN-based SOTA models.

- No maximum number of parameters like other Transformer models.

1. Image Classification (Image -> label)

1. Image Classification (Image -> label)

2. Image Captioning (Image -> Sentence)

3. Contrastive Language-Image Pre-Training

Association between an image and a piece of text.

You might also like