Literature Review On Image Classification Architecture
Literature Review On Image Classification Architecture
Literature Review On Image Classification Architecture
Sowvik Sarker (id: 17-33228-1)a , Mainul Islam Mahi (id: 18-38468-2)a , MD.
Rezvi Khalid Hridoy (id: 18-38472-2)a , Abir Hassan (id: 18-39206-3)a
a
Department of Computer Sciences, American International University-Bangladesh
Abstract
Convolutional neural networks (CNNs) were used to solve visual tasks since
the late 1980s. Despite a few dispersed applications, CNN leave unused until
the mid-2000s, when advances in machine learning and the availability of
large amounts of labeled data, combined with better algorithms, catapulted
it to the leading edge of a neural network rebirth that has seen quick advances
since 2012. Analyzing the scale up networks in ways that makes most of
the additional amount compute by implementing appropriately factorized
convolutions and strong regularization. This literature review helps to find
us which image classification algorithm to use in different scenarios. We
have studied 5 different image classification algorithms, U-Net, VGGNet-19,
ResNet, DenseNet, Inception V3 and We also introduce some of their current
trends and remaining challenges.
Keywords: Deep learning, Computer Vision, Object detection, NN, CNN
1 1. Introduction 1
44 2. Literature review 44
45 2.1. U-Net 45
46 Ronneberger et al. [2015] proposed the model U-net and is called U- 46
47 net because of its U-shaped architecture. The left side of this model is 47
48 called the contracting path, and the right side is called the expansive path. 48
2
49 Moreover, four concatenations occurred between the expansive path and its 49
50 corresponding contracting path in the network. The contracting path started 50
51 with one channel input image, consisting of (572*572) pixels. The network 51
52 then goes with an unpadded convolution with a kernel size of (3*3) for two 52
53 repeated times. As this is an unpadded convolution, the pixels reduce to 53
54 (570*570) in the first step and then (568*568) in the second step. For these 54
55 two convolutions, the channel number has been set to 64. The next step 55
56 started with down-sampling with (2*2) max-pooling, which reduces pixel size 56
57 to (284*284), i.e., half, and again the unpadded convolution occurs two times 57
58 as the previous step, but this time the channel number has been increased 58
59 to 128 from 64. This continues three more times and finishes with pixel size 59
60 (28*28), consisting of 1024 channels, which is the end of the contracting path 60
61 and the start of the extensive path. From this step, instead of max pooling, 61
62 up-convolution takes place, which is the opposite of max pooling. The (2*2) 62
63 up-convolution increases the pixel size from (28*28) to (56*56), i.e., double. 63
64 This time concatenation occurs with the channel of the contracting path 64
65 with its corresponding expansive path. Therefore, the channel increased by 65
66 1536(1024+512). Through expansive path, unpadded convolution happens 66
67 two more times with (3*3) kernel size, reducing the pixel size to (52*52). 67
68 The channel number has been reduced to 512 in this step. Next, again up- 68
69 convolution, as well as concatenation, happens and repeats the previous step 69
70 three times. The network completed its expansive path and came up with 70
71 the 64-channel image with a pixel size of 388*388. Finally, (1*1) convolution 71
72 happens to reduce the channel from 64 to 2. So, the model lastly shows the 72
73 two-channel output image of (388*388) pixels. 73
74 2.2. VGGNet-19 74
75 The input to VGG based ConvNet is a (224*224) RGB image. Prepro- 75
76 cessing layer takes the RGB image with pixel values in the range of 0-225 76
77 and subtracts the main image value which is calculated over the entire Ima- 77
78 geNet training set. The input images after prepossessing are passed through 78
79 these weight layers. The training images are passed through these weight 79
80 layers. The training images are passed through a stack of convolution lay- 80
81 ers. VGG-19 has 19 weight layers consisting of 16 convolutional layers with 81
82 3 fully connected layers and the same 5 pooling layers. VGG-19, there 2 82
83 fully connected layers with 4096 channels which are followed by another fully 83
84 connected layer with 1000 channels to predict 1000 labels. The last fully 84
3
85 connected layer uses the SoftMax layer for classification purposes Simonyan 85
86 and Zisserman [2015]. 86
87 2.2.1. ResNet 87
88 The input layers of this network are made up of many residual blocks, and 88
89 the operating principle is to optimize a residual function He et al. [2016]. This 89
90 unique architecture allows for greater accuracy when layer depth is increased. 90
91 The authors proposed residual mapping to accommodate the adding layers in 91
92 their research. If we designate H(x) by the underlying mapping, then F(x): 92
93 = H(x) - x determines the residual mapping. The residual block function 93
94 is defined by: y = F (x, Wi) + Wsx when the input x and the output y = 94
95 H(x) have the same dimension. All convolutional layers in ResNet models 95
96 use the same convolutional window of size (3*3), and the number of filters 96
97 rises with network depth, from 64 to 512 (for ResNet-18 and ResNet-34), 97
98 and from 64 to 2048 (for ResNet-50, ResNet-101, and ResNet-152). Only 98
99 one max-pooling layer with pooling size (3*3) is used in all models, and a 99
100 stride of 2 is applied after the first layer. As a result, reducing the resolution 100
101 of the input throughout the training phase is severely constrained. The 101
102 average pooling layer is used to replace completely linked layers at the end 102
103 of all models. This alternative has a few advantages. Firstly, there are no 103
104 parameters to optimize in this layer, hence it aids in the reduction of model 104
105 complexity. Secondly, this layer is more native in terms of enforcing feature 105
106 map and category correspondences. The number of neurons in the output 106
107 layer corresponds to the number of categories in the ImageNet dataset, which 107
108 is 1000. In addition, in this layer, a SoftMax activation function is used to 108
109 calculate the likelihood that the input belongs to each class. 109
4
121 into two convolutional operations: (1 * 1) CONV (standard conv operation 121
122 for extracting features) and (3 * 3) CONV (reducing feature depth/channel 122
123 count). The growth rate (used K=32) is the number of channels output by a 123
124 dense layer (1*1 conv to 3*3 conv). This means that a dense layer (l) will get 124
125 32 features from its preceding dense layer (l-1). Because 32 channel charac- 125
126 teristics are concatenated and provided as input to the following layer after 126
127 each layer, this is referred to as the growth rate. With the same number of 127
128 parameters, the DenseNet model has a considerably smaller validation error 128
129 than the ResNet model. These tests were carried out on both models with 129
130 hyper-parameters that were more suitable for ResNet. After rigorous hyper- 130
131 parameter searches, the authors claim that DenseNet will perform better 131
132 Huang et al. [2017b]. 132
5
157 3. Discussion 157
158 Deep Convolutional Neural Networks have succeeded at object recogni- 158
159 tion, detection, and localization, as well as a variety of other computer vision 159
160 tasks. Despite all of the advancements demonstrated by the many proposed 160
161 designs, there were little insights and logical thinking about how they had 161
162 attained state-of-the-art records, leaving additional improvements to trial 162
163 and error tactics. For example, one of our benchmark findings was that 163
164 ResNet152 architecture, which has 152 layers of depth, outperformed VGG- 164
165 19 architecture Sachin [2016] and only has 19 layers of depth. DenseNet de- 165
166 signs have been found to be the best in terms of parameter space utilization, 166
167 with up to 4x less parameter space size when compared to AlexNet model 167
168 and 10x less when compared to VGG-19 Muhammed et al. [2017]. U-Net 168
169 has a wide range of applications in biomedical image segmentation, includ- 169
170 ing brain and liver image segmentation. A (224*224) RGB image is used 170
171 as the input to a VGG-based ConvNet. Preprocessing layer takes the RGB 171
172 image with pixel values in the range of 0-225 and subtracts the main image 172
173 value which is calculated over the entire ImageNet training set. The input 173
174 images after prepossessing are passed through these weight layers. There 174
175 is various usage of VGGNet-19 in the field of medical science. The input 175
176 layers of ResNet are made up of many residual blocks, and the operating 176
177 principle is to optimize a residual function. This unique architecture allows 177
178 for greater accuracy when layer depth is increased. ResNet has vast usage 178
179 in the field of agricultural science. DenseNet establishes paths between the 179
180 network’s layers. Each layer in a dense block gets feature maps from all 180
181 preceding levels and transfers their output to all following layers, given to 181
182 the network’s feed-forward structure Szegedy et al. [2016b]. Concatenation 182
183 is used to join feature maps from different levels (like in ResNet). From 183
184 the idea of ResNet, dense connections have inspired optimizations in many 184
185 other deep learning areas such as picture super-resolution, image segmenta- 185
186 tion, medical diagnosis, and so on. The inception v3 model was introduced 186
187 to explore the inception architecture. Inception-v3 is a convolutional neural 187
188 network architecture from the Inception family that uses Label Smoothing, 188
189 factorized (7*7) convolutions, and an additional classifier to transport label 189
190 information down the network. Inception v3 enables health experts to take 190
191 some sample tests to determine systemic diseases from patients, but the re- 191
192 search has studied the analyses of systemic diseases through digital image 192
193 processing methods based on color analysis of nails. 193
6
Architecture ImageNet Top-1 Error
U-Net 22.50%
VGGNet-19 27.30%
ResNet 21.66%
DenseNet 25.53%
Inception V3 21.90%
Table 1: Comparison of accuracy for different architecture on ImageNet
7
223 diseases such as liver cirrhosis Jaworek-Korjakowska et al. [2019]. 223
250 One of the most well-known tasks in computer vision is image classifi- 250
251 cation: given an image, classify it into one of several predefined categories. 251
252 Image classification is a classic problem because of its wide range of applica- 252
253 tions. In the future, image classification systems may become an important 253
254 factor of accessibility software, assisting people with vision impairments in 254
255 making sense of their environment. A literature review of image classification 255
256 architectures is presented in this paper. It classifies the growth and contribu- 256
257 tion to the deep learning renaissance during the last few years. It focuses on 257
8
258 the progress in particular by debating and examining the designs, supervisory 258
259 components, regularization processes, optimization strategies, and computa- 259
260 tion. This paper does not categorize in terms of popularity, performance in 260
261 terms GPU/CPU and experimental results with comparisons. These are the 261
262 limitations of this paper and could be solved in future update. 262
9
References
Karpathy, A., et al. Cs231n convolutional neural networks for visual recog-
nition. Neural networks 2016;1(1).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al.
Imagenet large scale visual recognition challenge. International journal of
computer vision 2015;115(3):211–252.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.. Rethinking
the inception architecture for computer vision. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016a, p.
2818–2826.
Ronneberger, O., Fischer, P., Brox, T.. U-net: Convolutional networks for
biomedical image segmentation. In: International Conference on Medical
image computing and computer-assisted intervention. Springer; 2015, p.
234–241.
Long, J., Shelhamer, E., Darrell, T.. Fully convolutional networks for se-
mantic segmentation. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. 2015, p. 3431–3440.
Simonyan, K., Zisserman, A.. Very deep convolutional networks for large-
scale image recognition. 2015. arXiv:1409.1556.
Simonyan, K., Zisserman, A.. Very deep convolutional networks for large-
scale image recognition. arXiv preprint arXiv:14091556 2014;.
Too, E.C., Yujian, L., Njuki, S., Yingchun, L.. A comparative study of fine-
tuning deep learning models for plant disease identification. Computers
and Electronics in Agriculture 2019;161:272–279.
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.. Densely
connected convolutional networks. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. 2017a, p. 4700–4708.
10
Steinkraus, D., Buck, I., Simard, P.. Using gpus for machine learning
algorithms. In: Eighth International Conference on Document Analysis
and Recognition (ICDAR’05). IEEE; 2005, p. 1115–1120.
He, K., Zhang, X., Ren, S., Sun, J.. Deep residual learning for image
recognition. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2016, p. 770–778.
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.. Densely
connected convolutional networks. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. 2017b, p. 4700–4708.
Sachin, P.. Convolutional neural networks for image classification and cap-
tioning. 2016.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.. Rethinking
the inception architecture for computer vision. In: Proceedings of the
IEEE conference on computer vision and pattern recognition. 2016b, p.
2818–2826.
Duan, J., Shi, T., Zhou, H., Xuan, J., Wang, S.. A novel resnet-based
model structure and its applications in machine health monitoring. Journal
of Vibration and Control 2021;27(9-10):1036–1050.
Andersson, J., Ahlström, H., Kullberg, J.. Separation of water and fat sig-
nal in whole-body gradient echo scans using convolutional neural networks.
Magnetic resonance in medicine 2019;82(3):1177–1186.
11
Yao, W., Zeng, Z., Lian, C., Tang, H.. Pixel-wise regression using u-net
and its application on pansharpening. Neurocomputing 2018;312:364–371.
Iglovikov, V., Shvets, A.. Ternausnet: U-net with vgg11 encoder pre-trained
on imagenet for image segmentation. arXiv preprint arXiv:180105746
2018;.
Kandel, M.E., He, Y.R., Lee, Y.J., Chen, T.H.Y., Sullivan, K.M., Aydin,
O., et al. Phase imaging with computational specificity (pics) for measuring
dry mass changes in sub-cellular compartments. Nature communications
2020;11(1):1–10.
Saba, L., Agarwal, M., Sanagala, S.S., Gupta, S.K., Sinha, G., Johri,
A., et al. Brain mri-based wilson disease tissue classification: an optimised
deep transfer learning approach. Electronics Letters 2020;56(25):1395–
1398.
Xiao, J., Wang, J., Cao, S., Li, B.. Application of a novel and improved
vgg-19 network in the detection of workers wearing masks. In: Journal of
Physics: Conference Series; vol. 1518. IOP Publishing; 2020, p. 012041.
Rafi, A.M., Kamal, U., Hoque, R., Abrar, A., Das, S., Laganière, R.,
et al. Application of densenet in camera model identification and post-
processing detection. In: CVPR workshops. 2019, p. 19–28.
Zeng, X., Feng, G., Zhang, X.. Detection of double jpeg compres-
sion using modified densenet model. Multimedia Tools and Applications
2019;78(7):8183–8196.
Aldoj, N., Biavati, F., Michallek, F., Stober, S., Dewey, M.. Automatic
prostate and prostate zones segmentation of magnetic resonance images
using densenet-like u-net. Scientific reports 2020;10(1):1–17.
12
Huang, S., Lee, F., Miao, R., Si, Q., Lu, C., Chen, Q.. A deep con-
volutional neural network architecture for interstitial lung disease pattern
classification. Medical & biological engineering & computing 2020;:1–13.
Su, R., Zhang, D., Liu, J., Cheng, C.. Msu-net: Multi-scale u-net for 2d
medical image segmentation. Frontiers in Genetics 2021;12:140.
Khanh, T.L.B., Dao, D.P., Ho, N.H., Yang, H.J., Baek, E.T., Lee, G.,
et al. Enhancing u-net with spatial-channel attention gate for abnormal tis-
sue segmentation in medical imaging. Applied Sciences 2020;10(17):5729.
Cheng, S., Zhou, G.. Facial expression recognition method based on im-
proved vgg convolutional neural network. International Journal of Pattern
Recognition and Artificial Intelligence 2020;34(07):2056003.
Zhang, C., Benz, P., Argaw, D.M., Lee, S., Kim, J., Rameau, F., et al.
Resnet or densenet? introducing dense shortcuts to resnet. In: Proceedings
of the IEEE/CVF Winter Conference on Applications of Computer Vision.
2021, p. 3550–3559.
Zhang, C., Rameau, F., Kim, J., Argaw, D.M., Bazin, J.C., Kweon,
I.S.. Deepptz: Deep self-calibration for ptz cameras. In: Proceedings of
the IEEE/CVF Winter Conference on Applications of Computer Vision.
2020, p. 1041–1049.
Liu, W., Zeng, K.. Sparsenet: A sparse densenet for image classification.
arXiv preprint arXiv:180405340 2018;.
13
Name & ID Contribution
Sowvik Sarker,
VGGNet-19,Discussion
17-33228-1
Mainul Islam Mahi,
Introduction,U-Net,ResNet
18-38468-2
MD. Rezvi Khalid Hridoy,
DenseNet, Abstract
18-38472-2
Abir Hassan,
Inception V3,Conclusion
18-39206-3
14