Feature Fusion Based On Convolutional Neural Netwo PDF
Feature Fusion Based On Convolutional Neural Netwo PDF
Feature Fusion Based On Convolutional Neural Netwo PDF
1051/ itmconf/20171205001
ITA 2017
Abstract—Recent breakthroughs in algorithms related to deep convolutional neural networks (DCNN) have stimulated the
development of various of signal processing approaches, where the specific application of Automatic Target Recognition
(ATR) using Synthetic Aperture Radar (SAR) data has spurred widely attention. Inspired by the more efficient distributed
training such as inception architecture and residual network, a new feature fusion structure which jointly exploits all the
merits of each version is proposed to reduce the data dimensions and the complexity of computation. The detailed
procedure presented in this paper consists of the fused features, which make the representation of SAR images more
distinguishable after the extraction of a set of features from DCNN, followed by a trainable classifier. In particular, the
obtained results on the 10-class benchmark data set demonstrate that the presented architecture can achieve remarkable
classification performance to the current state-of-the-art methods.
handcrafted features with human intervention, the training
process in CNN automatically find appropriate features in
1 Introduction search space, which are sent into trainable classifiers in the
As a kind of active microwave imaging radar with all-day, second stage, and thus to avoid the complexity of pre-
all-weather and long-range detection capabilities, synthetic processing and feature selection.
aperture radar (SAR) has occupied a leading role in areas A variety of works have been done to achieve better
such as early warning, surveillance and guidance, of which performance in the past decades, however, it still remains a
the most extensive application is in automatic target challenging task since modern advanced techniques require
recognition (ATR). tens of thousands of examples to train adequately. Robert
With the emergence of well-performed classifiers such Wang [9] has augmented dataset to test the the adaptability
as support vector machine (SVM) [1], k-nearest neighbor on subsets under different displacement and rotation settings.
(KNN) [2] and AdaBoost [3], machine learning technology Furthermore, the training based on very deep networks still
has attracted much attention in SAR ATR studies. However, faces problems for the reason that the stacking of non-linear
most of the work in machine learning approaches focus on transformations in typical feed-forward network generally
designing a set of feature extractors. In SAR ATR field, the result in poor propagation of activations as well as vanishing
existing technique lacks the ability of extracting effective gradients. Hence it remains necessary to modify the
features and fusing them. Recently, a newly-developing architecture of deep feed-forward networks.
learning method called convolutional neural networks (CNN) Owing to the background speckles all over the SAR
have been successfully applied to many large scale visual images and the irregular distributions of strong scattering
recognition tasks such as image classification [4]-[6], object centers, SAR ATR is of great complexity especially when
detection [7][8]. Motivated by the model of mammal’s the networks get deeper. To enable the feasibility of training
visual system, CNNs consist of many hidden layers with the deeper network with limited data, we adopt CNN
convolution operations which extract the features in the architectures which fuse the features extracted from
image and achieve state-of-the-arts results on the visual different branches and these structures tends to perform well
image data set such as ImageNet [4]. In this way, we in maximally exploiting the input data and improving
consider employing CNN in SAR ATR field by means of classification accuracy.
designing reasonable network structure and optimizing The remainder of this paper is organized as follows.
training algorithms. Section II discusses the basic components of the CNN
Generally, the architecture of CNNs can be interpreted network as well as the training method. Section III gives an
as a two-stage procedure in image classification tasks. introduction into the two feature fusion categories and how
Unlike the previous methods which heavily rely on they are constructed in the proposed networks. Experimental
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative Commons Attribution
License 4.0 (http://creativecommons.org/licenses/by/4.0/).
ITM Web of Conferences 12, 05001 (2017) DOI: 10.1051/ itmconf/20171205001
ITA 2017
results conducted from SAR imagery are analyzed in using averaging or choosing maximal value neuron in the
Section IV to identify the desirable performance of the group, pooling layers largely reduce the amount of
novel network architectures. Finally, we summarize this computation through lessening the parameters. These layers
paper. also make CNN better at translation, shift and distortion
invariance. We assume that each feature map in a
2 Structure And Training of Cnns convolution layer corresponds to a single map in the
Among various of deep learning methods, CNN is an associated pooling layer. For example, the max pooling
effective model which considers the dependencies between operation [13] is defined as
pixels in an image. The CNN consists of consecutive layers ˄l 1˅ (l )
of neurons with learnable weights and bias, where the Om ( x, y ) max Om ( x s p, y s q ) (3)
p , q 0,... K 1
concept of perceptive field is applied. In this scenario, we where K is the pooling size, and s is the stride which
will give a detailed description of the basic operation indicates the internals between adjacent pooling windows.
modules in CNN.
On( l ) ( x, y ) V (¦ ¦k (l )
nm
( p, q )Om( l 1) ( x p, y q ) bn( l ) ) (1) through softmax function to yield a normalized probability
m 1 p,q 0 distribution over classes. The softmax layer operation takes
the form as [15]:
(l )
where the convolution kernel k nm ( p, q ) denotes the trainable (i)
yj
filters, while other terms bn( l ) and V represent the bias of the (i ) e
p j
(4)
¦
(i)
T yj
nth output feature map and the nonlinear activation function. j 1
e
2.2Activation Function We use the softmax output layer in conjunction with the
To form a highly nonlinear mapping relationship between the categorical cross-entropy error function. With that as the
input image and the output label, a proper activation function background, the error function E is defined as:
is added at each convolution layer [10]. Since a hyperbolic
tangent function or a sigmoid function [11] may get trapped N T
1
into saturating nonlinearities during the gradient propagation E [¦¦ 1{t ( i ) j}ln p (ji ) ] (5)
process, a piecewise linear function which does not suffer N i 1 j 1
2
ITM Web of Conferences 12, 05001 (2017) DOI: 10.1051/ itmconf/20171205001
ITA 2017
Although CNN works as a good feature extractor, how to 3.2Fusion under Summation Pattern
make use of multi-dimensional features and fuse them
together still deserves exploration. To increase the diversity Deep networks stacked by several layers are usually tracked
of extracted feature and understand the interactions between in degradation problem. Enlightened by the deep residual
them, we formulate two patterns of feature fusion. Since the learning framework [16], we design the second pattern of
targets in SAR images may not be square shaped, use of feature fusion. Formally, the expected underlying mapping
convolution kernels sharing the same size are restricted to also known as unreferenced mapping is represented as F(x).
learn more adequate features, hence the inserted convolution H(x) denotes the residual mapping consisting of two
kernels are modified as that with rectangular shape. consecutive convolution layers. When each convolution only
computes corrective terms H ( x ) F ( x ) x rather than the
3.1Feature Fusion under Concatenation Pattern entire approximation F(x), the thought of shortcut connection
which means skipping several layers is realized. By adopting
Usually, the original convolution layer contains several the summation mechanism between layers, the information
square kernels, in this pattern we replace them with two from previous layer flow more smoothly to the next layer
convolution branches arranged by concatenation. This mode without attenuation and the layers can learn the difference
of feature fusion on convolution level is referred to as “conv- from the input feature map. Additionally, considering the
fusion” module. The concrete structure of its elements is ability of asymmetric kernels in extracting features with
demonstrated in Figure 1, where a represents the equivalent various scales and structures, we propose a new conception
kernel size in standard CNN structure while n denotes the called asymmetric shortcut connection block (ASCB) as
adjustable kernel size. Here, the convolution marked with S shown in Figure 2.
are same-padded while V signifies that are valid-padded.
x
+
relu
Maxpooling 2h2
3
ITM Web of Conferences 12, 05001 (2017) DOI: 10.1051/ itmconf/20171205001
ITA 2017
4.1Data Description As for weight initialization, the weights of all the layers
are sent into an initializer [18] with uniform distribution
The CNN structure in this paper is applied to address the
problem of SAR ATR for MSTAR dataset. The MSTAR ( 6 ( fan _ in fan _ out ) , 6 ( fan _ in fan _ out ) ) ,
benchmark data act as a standard data set to test and evaluate where fan _ in and fan _ out represents the number of
the performance of recognition algorithms. In our experiment, input units and output units respectively.
images for training are captured at 17 degree depression
angle and that for testing are captured at 15 degree. The
proposed algorithm is evaluated on the ten-target 4.3Experiments on Different Network
classification problem, and the corresponding number of Configurations
images for training and testing are listed in Table I.
Here, we present the general layout of our baseline
convolutional neural network (BCNN) and then describe the
4.2Training Details details of components used in BCNN. The schematic view of
BCNN is depicted in Figure 3. To reduce the feature
As has been presented in previous work, the implementation dimension, each convolution layer is followed by a pooling
details play a decisive role in recognition layer with the 2 u 2 pooling size and a stride of 2 pixels.
TABLE I. NUMBER OF TRAINING AND TESTING IMAGES FOR THE To avoid a fair amount of redundancy and spilling over
EXPERIMENTS of local information brought by relatively large receptive
fields in each convolution layer, we choose convolution
Target Types No. Testing No. Training
kernel size smaller than 7. In the structure, the fully
BMP2(9563) 195 233
connected layer is replaced by the global average pooling
BMP2(9566) 196 (GAP) layer since the correspondence between feature maps
BMP2(C21) 196 and categories is strengthened and no parameters need
T72(132) 196 232 optimizing [19].
T72(812) 195
T72(S7) 191
BTR70 196 233
BTR60 195 256
BRDM2 274 298
ZSU 274 299
T62 273 299
ZIL 274 299
2S1 274 299
D7 274 299
Total 3203 2747
performance. We aim to derive a set of best CNN
structures both concerning low computation cost and high Figure 3. Overall architecture of BCNN (conv. (kernel depth) @
accuracy constraints towards the application of limited (kernel size))
labelled SAR data. Except for some hyper parameters
determined in CNNs, other details such as weight For MSTAR dataset, the recognition accuracy on BCNN
initialization, learning rate also count. is obtained by the above hyper parameter setting. Next, the
To improve the training speed as well as prevent the feature-fusion patterns are introduced into the BCNN and all
network from becoming trapped in local minima, a set of these CNN-based methods share the same parameter setting
fine-tuning methods have been applied to various of except for different framework. The relevant composition of
recognition tasks. Based on the gradient descent method, transformed conv-fusion CNN is listed in Table II.
Adam [17] algorithm can dynamically adjust the updating TABLE II. COMPOSITION OF CONV-FUSION CNN
amount of each parameter according to the moment
estimation and thus achieve the efficiency and stability of Left
Right
CNN. In view of this, we chose the Adam technique as a Layer Output Filter Branch
Branch
substitute for normal mini-batch SGD using the given name/Type size size Filter
Filter size
size
default value set in K 0.001ˈE1 0.9ˈE 2 0.999 [17].
Input image 64 64 1 / / /
Additionally, several subsets of training data are chosen 1 3/
stochastically from the whole data set as mini-batch and all Conv-fuse1 60 60 16 / 5 5
3 1
the models are trained under a 50-element batch size. Maxpool1 / /
30 30 16 2 2
4
ITM Web of Conferences 12, 05001 (2017) DOI: 10.1051/ itmconf/20171205001
ITA 2017
Recognition Result
Target Types BMP2 T72 Accuracy (%)
BTR70 BTR60 BRDM2 ZSU T62 ZIL 2S1 D7
BMP2(9563) 195 0 0 0 0 0 0 0 0 0 100.00
BMP2(9566) 195 0 0 0 1 0 0 0 0 0 99.49
BMP2(C21) 189 4 0 0 3 0 0 0 0 0 96.43
T72(132) 1 195 0 0 0 0 0 0 0 0 99.49
T72(812) 5 183 0 1 0 0 6 0 0 0 93.85
T72(S7) 8 177 0 1 0 0 4 0 1 0 92.67
BTR70 0 0 196 0 0 0 0 0 0 0 100.00
BTR60 0 0 7 188 0 0 0 0 0 0 96.41
BRDM2 1 0 0 0 272 0 0 1 0 0 99.27
ZSU 0 0 0 0 0 274 0 0 0 0 100.00
T62 0 1 0 0 0 1 271 0 0 0 99.27
ZIL 0 0 0 0 0 0 1 273 0 0 99.64
2S1 1 0 3 0 1 0 0 0 269 0 98.18
D7 0 0 0 0 0 0 0 0 0 274 100.00
Global Accuracy (%) 98.38
Recognition Result
Target Types Accuracy (%)
BMP2 T72 BTR70 BTR60 BRDM2 ZSU T62 ZIL 2S1 D7
BMP2(9563) 195 0 0 0 0 0 0 0 0 0 100.00
BMP2(9566) 191 3 0 0 2 0 0 0 0 0 97.45
BMP2(C21) 189 2 2 1 1 0 0 0 1 0 96.43
T72(132) 0 196 0 0 0 0 0 0 0 0 100.00
T72(812) 2 190 0 0 0 0 3 0 0 0 97.44
T72(S7) 6 183 0 0 0 0 2 0 0 0 95.81
BTR70 0 0 194 1 1 0 0 0 0 0 98.98
BTR60 0 0 4 191 0 0 0 0 0 0 97.95
BRDM2 0 0 0 0 274 0 0 0 0 0 100.00
ZSU 0 0 0 0 0 274 0 0 0 0 100.00
T62 0 3 0 0 0 0 270 0 0 0 98.90
ZIL 0 0 0 0 0 1 0 273 0 0 99.64
2S1 0 2 0 0 1 0 3 0 268 0 97.81
D7 0 0 0 0 0 0 0 0 0 274 100.00
Global Accuracy (%) 98.72
5
ITM Web of Conferences 12, 05001 (2017) DOI: 10.1051/ itmconf/20171205001
ITA 2017
Figure 5. Visualization of 8 u 8 feature map in ASCB (n=7) TABLE VI. CLASSIFICATION RESULTS OF OTHER METHODS
We can see from the figure 5 that the loss of visual Method Accuracy (%)
information is suppressed after the last output map is fed SVM [20] 86.73
through the block. By exploiting the complementarity of Adaptive Boosting [21] 92.70
each map aggregated to the next stage, our proposed CNN KPCA [22] 92.67
structure can learn more specific and abstract features. MSRC [23] 93.66
The recognition results for different CNN structures are IGT [24] 95.00
shown in Table III, IV, V, where each row denotes the real CNN with data augmentation
target type and each column denotes the predictive label. 93.16
[25]
As we can conclude from the table, the overall accuracy BCNN 97.39
of the two proposed architectures can reach 98.38% and
Conv-fusion CNN 98.38
98.72% respectively. Whilst variations exist in testing set,
the lowest accuracy is 95.81% for T72 in the second pattern Module-residual CNN 98.72
6
ITM Web of Conferences 12, 05001 (2017) DOI: 10.1051/ itmconf/20171205001
ITA 2017
Decision fusion CNN 99.42 [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik 2014.
It is desirable that the feature fusion CNNs show the “Rich feature hierarchies for accurate object detection
ability of classifying ten-class targets regardless of the and semantic segmentation,” in Proc. IEEE Computer
existence of variants. Meanwhile, neither increasing the Vision Pattern Recognition. , pp. 580–587.
width of network nor the complexity of architectures, the [8] X. Chen, S. Xiang, C. Liu, and C. Pan 2014. “Vehicle
excellent combination of high-level feature representations detection in satellite images by hybrid deep
and learning of potential invariant feature can achieve convolutional neural networks,” IEEE Geoscience
particularly good performance even when the labelled Remote Sense Letter, vol. 11, no. 10, pp. 1797–1801.
training data is limited. [9] Du, K., Deng, Y., Wang, R., Zhao, T., & Li, N 2016.
Sar atr based on displacement- and rotation-insensitive
4 Conclusion cnn. Remote Sensing Letters, 7(9), 895-904.
[10] Chen, S., Wang, H., Xu, F., & Jin, Y. Q. 2016. Target
Understanding the multi-scale and hierarchical features
classification using the deep convolutional networks
learned by CNN and then dig thoroughly into the
for sar images. IEEE Transactions on Geoscience &
discriminative features help us finish target recognition task.
Remote Sensing, 54(8), 1-12.
In this paper, we first constructed novel CNN architectures
[11] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner 1998.
which contain two independent modules called “conv-
“Gradient-based learning applied to document
fusion” module and asymmetric shortcut connection block,
recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–
then fined tune the hyper parameters in the deep CNN in the
2324.
second stage and finally applied them to address the problem
[12] V. Nair and G. E. Hinton 2010. “Rectified linear units
of recognition using SAR images. Experimental results
improve restricted boltzmann machines,” in ICML.
indicate that the proposed network can gain superior
[13] Y. LeCun, K. Kavukcuoglu, and C. Farabet 2010.
performance compared with other state-of-the-art methods
“Convolutional networks and applications in vision,” in
by extracting forceful feature representations. Furthermore,
Proc. IEEE International Symposium Circuits System,
the CNN approach presented in this paper can be revised or
pp. 253–256.
extended to other practical applications and thus provide
[14] Y. L. Cun, B. Boser, J. S. Denker, R. E. Howard, W.
insightful points in the recognition tasks aimed at small
Habbard, L. D. Jackel, and D. Henderson 1990.
dataset.
“Handwritten digit recognition with a backpropagation
Acknowledgment network,” in Advances in Neural Information
Processing Systems, pp. 396–404.
The research work is supported by the National Natural [15] C. M. Bishop 2006 Pattern Recognition and Machine
Science Foundation of China under grant No.61471370. Learning. New York, NY, USA: Springer-Verlag.
[16] He, K., Zhang, X., Ren, S., & Sun, J. 2015. Deep
References residual learning for image recognition. 770-778. .
[1] Q. Zhao and J. C. Principe 2001. “Support vector [17] D. P. Kingma and J. Ba 2014. “Adam: A method for
machines for sar automatic target recognition,” IEEE stochastic optimization,” Computer Science.
Transactions on Aerospace & Electronic Systems, vol. [18] K. He, X. Zhang, S. Ren, and J. Sun 2015. “Delving
37, no. 2, pp. 643–654. deep into rectifiers: Surpassing human-level
[2] Guo, G., Wang, H., Bell, D., Bi, Y., & Greer, K. 2003. performance on ImageNet classification,” in Proc. Int.
Knn model-based approach in classification. Lecture Conf. Comput. Vis., pp. 1026–1034.
Notes in Computer Science, 2888, 986-996. [19] Lin, M., Chen, Q., & Yan, S. 2013. Network in
[3] Y. Sun, Z. Liu, S. Todorovic, and J. Li 2007. “Adaptive network. Computer Science.
boosting for sar automatic target recognition,” IEEE [20] Bengio Y, Nicolas Boulanger-Lewandowski, Razvan
Transactions on Aerospace & Electronic Systems, vol. Pascanu 2013. “Advances in optimizing recurrent
43, no. 1, pp. 112–125. networks,” IEEE International Conference on
[4] A. Krizhevsky, I. Sutskever, and G. Hinton 2012. Acoustics, Speech and Signal Processing, 8624-8628.
“ImageNet classification with deep convolutional [21] Sun, Y., Liu, Z., Todorovic, S., and Li, J. 2007.
neural networks,” in Proc. Adv. Neural Inf. Process. “Adaptive boosting for sar automatic target
Syst., pp. 1106–1114. recognition,” Aerospace and Electronic Systems, IEEE
[5] C. Szegedy et al. 2015. “Going deeper with Transactions on 43, 112–125.
convolutions,” in Proc. IEEE Computer Vision Pattern [22] Mishra, A. K., & Motaung, T. 2015. Application of
Recognition, Boston, MA, USA, Jun. 8–10, pp. 1–9. linear and nonlinear PCA to SAR ATR.
[6] K. Simonyan and A. Zisserman 2015. “Very deep Radioelektronika.IEEE.pp.349-354..
convolutional networks for large-scale image [23] D. Ganggang, W. Na, and K. Gangyao 2014. “Sparse
recognition,” presented at the Int. Conf. Learning representation of monogenic signal: With application to
Representations. [Online]. Available: target recognition in sar images,” vol. 21, no. 8, pp.
http://arxiv.org/abs/1409.1556 952–956.
7
ITM Web of Conferences 12, 05001 (2017) DOI: 10.1051/ itmconf/20171205001
ITA 2017