1 Framework Evaluating LULC CNN
1 Framework Evaluating LULC CNN
1 Framework Evaluating LULC CNN
Article
A Framework for Evaluating Land Use and Land
Cover Classification Using Convolutional
Neural Networks
Manuel Carranza-García * , Jorge García-Gutiérrez and José C. Riquelme
Division of Computer Science, University of Sevilla, ES-41012 Seville, Spain; [email protected] (J.G.-G.);
[email protected] (J.C.R.)
* Correspondence: [email protected]
Abstract: Analyzing land use and land cover (LULC) using remote sensing (RS) imagery is essential
for many environmental and social applications. The increase in availability of RS data has led to the
development of new techniques for digital pattern classification. Very recently, deep learning (DL)
models have emerged as a powerful solution to approach many machine learning (ML) problems.
In particular, convolutional neural networks (CNNs) are currently the state of the art for many image
classification tasks. While there exist several promising proposals on the application of CNNs to
LULC classification, the validation framework proposed for the comparison of different methods
could be improved with the use of a standard validation procedure for ML based on cross-validation
and its subsequent statistical analysis. In this paper, we propose a general CNN, with a fixed
architecture and parametrization, to achieve high accuracy on LULC classification over RS data
from different sources such as radar and hyperspectral. We also present a methodology to perform
a rigorous experimental comparison between our proposed DL method and other ML algorithms
such as support vector machines, random forests, and k-nearest-neighbors. The analysis carried out
demonstrates that the CNN outperforms the rest of techniques, achieving a high level of performance
for all the datasets studied, regardless of their different characteristics.
Keywords: convolutional neural network; cross-validation; deep learning; land use classification;
land cover classification; remote sensing; statistical analysis
1. Introduction
Land use and land cover (LULC) classification is a fundamental task in remote sensing (RS).
The automated extraction of the earth’s surface information currently plays an essential role in many
RS applications such as change detection [1], environmental monitoring [2], urban expansion control [3],
infrastructure planning [4], and biodiversity analysis [5]. The ability to identify the physical material
of the surface or the human exploitation of land by analyzing aerial imagery has led to a significant
cost and time improvement when carrying out these studies, with respect to traditional field surveys.
LULC classification is the problem of assigning a class label to every pixel of an image. It often
proves to be a challenging task due to the heterogeneous appearance and high intraclass variance
of elements [6], the large number of spectral channels and the limited amount of labeled samples
in the images [7]. However, recent technological advancements in optics and the space industry
have increased the availability and quality of RS data, which vary in spatial, spectral, radiometric,
and temporal resolutions [8]. The wide diversity of datasets available (e.g., aerial photographs, thermal
imagery, hyperspectral imagery, radar and lidar) has opened up many lines of research and has
inspired the development of new techniques for digital pattern classification.
The RS community has tried to address the problem from many different perspectives, with
the purpose of seeking an efficient method for mapping LULC patterns. These studies range
from conventional statistical approaches to more powerful machine learning (ML) algorithms that
have enhanced the quality of the solutions for this problem. Traditional RS data classification
methods include maximum-likelihood classifier (MLC) [9], distance measure [10], clustering [11]
or logistic regression [12]. Over the last decade, more advanced methods such as decision trees [13],
k-nearest-neighbors (kNN) [14], random forest (RF) [15], neural networks [16] and support vector
machines (SVM) [17] have been used for LULC mapping. In 2016, a study on the state of the art of
supervised methods for LULC classification was performed [18]. It was reported that SVM, kNN,
and RF generally provide better performance than other traditional classifiers, SVM being the most
efficient method. Very recently, deep learning (DL) techniques have become the state of the art
in many fields such as image recognition [19]. The capacity of DL models to automatically learn
informative representations of raw input data with multiple levels of abstraction has allowed for
achieving high performance in many computer vision tasks [20]. Therefore, due to its rising potential,
DL has lately been applied with success to LULC classification and recent studies demonstrate that
it outperforms SVM approaches [21,22]. Several DL architectures have been analyzed for RS tasks:
convolutional neural networks (CNNs) [23,24], stacked autoencoders (SAEs) [25], and deep belief
networks (DBNs) [26]. In particular, recent surveys place CNNs as the most used model, as it
generally provides a better performance [20]. The deep structure of a CNN allows for learning effective
features by concatenating convolution, pooling, and fully connected layers. In addition, their capacity
to encode spectral and spatial information of RS images, by using contextual information of the
neighbourhood of the pixels, has led to a substantial improvement in the efficiency of the classification.
Furthermore, the properties of local connectivity and parameter sharing of the CNNs vastly reduces
the number of parameters that would have to be learned for multidimensional images if using a
standard neural network, thus decreasing processing effort. Numerous methods based on DL have
been proposed recently for LULC classification over specific RS data, especially focusing on high
resolution and hyperspectral images [27,28]. Nevertheless, due to the abundance of data from distinct
sources and characteristics, there is the need for developing models that are applicable for images at
different scales which vary on their spatial resolution and number of spectral bands. Therefore, our aim
in the present study was to build a general CNN model for LULC classification that could be used for
diverse RS images such as radar or hyperspectral.
Furthermore, due to the considerable amount of studies with different proposals, it is essential to
establish a standard validation procedure in order to perform a fair comparison between techniques.
With the aim of drawing stronger conclusions on how novel methods improve the existing ones,
the validation framework proposed in LULC studies could be improved with the use of a standard
testing methodology for ML such as cross-validation and its subsequent statistical analysis. Fairly often,
in remote sensing, holdout validation is carried out for comparing classifiers over multiple datasets.
However, the use of repeated stratified cross-validation has been shown to be the most suitable
procedure for ML problems, as it provides a reliable estimate of the capacity of generalization of
different models [29]. Furthermore, the statistical validation of results is a necessity in order to establish
a certain conclusion on an experimental analysis over a ML problem. Statistics allow us to verify the
hypothesis of improved performance between proposals, indicating how significant our compared
results are with respect to the experimentation that we have carried out [30]. Accordingly, in this
paper, we propose a validation procedure to carry out an experimental comparison between the
proposed DL model for LULC classification and some other ML well-known techniques such as
k-nearest-neighbours, random forest, and support vector machines.
In summary, the main scientific contributions on this paper can be condensed as follows:
• A general 2D CNN, with a fixed architecture and parametrization, to achieve high accuracy in
LULC classification over remote sensing imagery from different sources, concretely radar and
hyperspectral images.
Remote Sens. 2019, 11, 274 3 of 23
The rest of the paper is organised as follows: Section 2 describes the materials used for
the study and presents our proposed DL architecture with the experimental setup. The results
obtained when applying the presented methods to the selected datasets are reported and discussed in
Section 3. Finally, Section 4 summarises the main conclusions and presents a number of possibilities of
further research.
(b)
(a)
Figure 1. Indian Pines dataset. (a) false color composite image (bands 29, 19, and 9); (b) ground truth.
Remote Sens. 2019, 11, 274 4 of 23
(b)
(a)
Figure 2. Pavia University dataset. (a) false color composite image (bands 45, 27, and 11); (b) ground truth.
(b)
(a)
Figure 3. Salinas dataset. (a) false color composite image (bands 29, 19, and 9); (b) ground truth.
Remote Sens. 2019, 11, 274 5 of 23
patch, an edge padding operation is performed. Once the patches have been collected, we propose a
further improvement. We carry out a simple data augmentation just based on rotation. This action will
allow the models to learn different spatial distributions inside the patches, thus reducing overfitting
problems. We studied the performance after applying three rotations (90 degrees) or seven rotations
(45 degrees), finding that the latter option provided better results.
Figure 6. Patch extraction process to generate the instances that would be fed to the ML models.
Il −1 Jl −1
!
∑∑ ∑
xy ij ( x +i )(y+ j)
aln =g wlnm a(l −1)m + bln , (1)
m i =0 j =0
where l indicates the current layer that is considered; n is the number of feature maps in this layer;
xy
aim is the value of a neuron at position ( x, y) of the nth feature map in the ith layer; g is the activation
function; m indexes over the set of features maps in the (l − 1)th layer connected to the current feature
ij
map; Il and Jl are the height and width of the convolutional kernel; wlnm stands for the weight of
position (i, j) connected to the mth feature map; and bln is the bias. Rectified Linear Units (ReLU) layers
Remote Sens. 2019, 11, 274 7 of 23
are used to compute the nonlinear activation function g( x ) = max (0, x ), which generally improve the
convergence when training the network compared to the sigmoid function [36].
Max-pooling layers are used to progressively reduce the spatial size of the feature maps.
This operation reduces the amount of parameters, hence computation in the network, and helps
to control overfitting by selecting superior invariant features. In these layers, input maps are
downsampled by a factor of Fx and Fy along both width and height, remaining the depth dimension
unaltered. For each non-overlapping rectangular region of size ( Fx , Fy ) of the input, the output is given
by its maximum activation value [37].
Unlike convolutional layers, fully connected layers have full connections to all activations in the
previous layer. It combines all the extracted features after the convolution into a one-dimensional
feature vector. The last fully connected layer of the network has one output neuron per class and
performs the classification task by applying a softmax activation function. The output of this function is
a probability distribution representing the likelihood of belonging to a certain class. This is formulated
in Equation (2), where p is the probability distribution, i represents a certain class, and z is the
C-dimensional vector of real-valued scores of the output layer of the network, C being the number
of classes:
e zi
pi ( z ) = C z . (2)
∑ j =1 e j
In all the layers described, the value of the weights and biases of the connections are initialized
randomly using Xavier initialization, which aims to set the values within a reasonable range before
starting the training procedure [38]. However, the optimal set of values has to be found in order
to adjust the model to the training data, hence the whole process can be seen as an optimization
problem. The network is fed with input data and the weights and biases values are updated by
minimizing a loss function H. The loss function quantifies how well the network outputs, given a set
of parameters, match the ground truth labels in the training set. The usual loss function applied at
softmax layers is the cross-entropy (Equation (3)), where y is the one-hot encoded vector of the labels,
p is the predicted probability distribution given by the softmax function (Equation (2)), and i indexes
over the set of classes:
H (y, p) = − ∑ yi log( pi ). (3)
i
For carrying out the optimization process, the mini-batch gradient descent with back-propagation
algorithm is used [34]. It iteratively computes the gradient of the cost function H and updates the
model’s parameters. The mini-batch strategy only considers a small set of training samples for
computing the gradient and it is useful for large dataset processing. The weights are updated using a
mini-batch of size n as described in Equation (4), where k is the iteration index, η is the learning rate,
and x (i) and y(i) represent, respectively, the training example i and its label:
dimension of the patch, focusing on the central pixel. With these downsampling operations, there is
only a need for two convolutional blocks since the input images are small (5 × 5). The final convolved
features are transferred to the fully connected layer that has 1024 neurons. Finally, a layer with softmax
activation computes the probability for each class and yields the predictions.
Figure 7. CNN architecture. The conv layers refer to convolutional layers, pool to max-pooling layers, fc
to the fully connected layer, and sm to the softmax layer. Dropout and batch normalization are omitted
to simplify the visualization.
Furthermore, dropout and batch normalization layers are introduced in the architecture, which
generally improve the capacity of generalization of the network by reducing overfitting (further
explained in Section 2.3.2). Batch normalization also helps to speed up the learning process of the
network. It reduces the covariate shift of values of hidden units by normalizing layer inputs for each
training mini-batch. This allows the network to train faster as it prevents layers from having to adapt to
substantial changes in the distribution of output values of previous layers [39]. In the model proposed,
dropout is applied at the fully connected layer, and batch normalization is used at the input and
after convolution.
A summary of the architecture is presented in Table 2, detailing the number of layers proposed,
the number of filters and the kernel sizes.
Table 2. CNN architecture. P refers to the input patch size (5 in our case), N to the number of channels
of the input image, and C to the number of classes to consider.
CNN Architecture
Layer Type Neurons & # Maps Kernel
0 Input P×P×N
1 Batch normalization P×P×N
2 Convolutional P × P × 32 3×3
3 ReLU P × P × 32
4 Batch normalization P × P × 32
5 Max-Pooling d P/2e × d P/2e × 32 2×2
6 Convolutional d P/2e × d P/2e × 64 3×3
7 ReLU d P/2e × d P/2e × 64
8 Batch normalization d P/2e × d P/2e × 64
9 Max-Pooling d P/4e × d P/4e × 64 2×2
10 Fully connected 1024 neurons
11 Dropout 1024 neurons
12 Softmax C neurons
according to some configurable probability or sets its output to zero, thus not contributing in the
back-propagation algorithm. It has proven to be very effective since it prevents complex co-adaptations
by training the network with different dropped neurons on every iteration [40].
Apart from improving convergence, batch normalization layers also act as regularizers [41].
They add random noise to the value of hidden units, by subtracting the mean and dividing
by the standard deviation of the batches. Moreover, applying this normalization reduces the
need for dropout, hence a lower dropout rate can be used and the loss of information is smaller.
However, the combination of these regularization techniques is complex. According to [42], applying
dropout before a batch normalization layer often leads to unstable numerical behavior in inference that
introduces more erroneous predictions. Following this theory and after trying several combinations,
we have found that the best results are obtained if we place a total of three batch normalization layers
(one at the first input layer and the other two after each convolution operation), and only one dropout
layer with a small drop ratio just before the final softmax prediction layer.
Furthermore, the introduction of nonlinearity through the ReLU activation function and the
reduction of parameters provided by the max-pooling operations also helps to control overfitting.
With the combination of all the described techniques and with the aid of performing data augmentation
by rotating the patches, the network can handle the overfitting problem successfully. Therefore, it is
able to achieve a high level of accuracy on classifying data that has not been considered for training.
Table 3. Grid search and selected values for the training parameters of the CNN.
dataset, followed by a statistical analysis combining all the results. Figure 8 illustrates the proposed
experimental process, which is explained in detail in the following subsections. Finally, the models
have also been tested at the Geoscience Remote Sensing Society (GRSS) DASE website competition [43].
The source code, with the implementation of the model and the experimental setup with the seeds for
the cross-validation procedure, can be found at [44].
Figure 8. Experimental setup. For each dataset, we obtain five independent subsamples and perform
repeated stratified cross-validation. The collected results of all datasets are analyzed with a statistical
test (Friedman and Post-hoc).
For all experiments, we apply the same steps for generating the training data and the same
parameter choice for each method. The kNN algorithm has been tested using one, three, and five
neighbours. Moreover, as dimensionality reduction is essential for kNN to avoid the Hughes
phenomenon [48], we apply feature selection with the extra-trees method [49], a further randomised
version of RF. For the SVM, we use the RBF kernel and the best combination of values for the parameters
found on a grid search: γ (kernel coefficient) ∈ {0.01, 0.1, 0.25, 0.5, 1} and C (penalty) ∈ {1, 10, 50, 100}.
The most suitable values found were: γ = 0.01, C = 50. Finally, for the RF method, we fix the model to
have 100 estimators and no depth limit.
Table 4. Indian Pines cross-validation results. Sample distribution in each experiment and individual
class, overall, and average accuracies of all methods. Best results are highlighted in bold.
Table 5. Pavia University cross-validation results. Sample distribution in each experiment and
individual class, overall, and average accuracies of all methods. Best results are highlighted in bold.
Table 6. Salinas cross-validation results. Sample distribution in each experiment and individual class,
overall, and average accuracies of all methods. Best results are highlighted in bold.
Salinas Cross-Validation
Sample Distribution Accuracy
# Class Samples 1NN 3NN 5NN SVM RF CNN
1 Brocoli_green_weeds_1 402 99.26 98.94 98.75 99.84 99.99 99.99
2 Brocoli_green_weeds_2 746 99.63 99.53 99.36 99.90 100 99.28
3 Fallow 396 99.72 99.74 99.77 99.86 99.36 99.59
4 Fallow_rough_plow 279 99.76 99.66 99.58 99.76 99.83 99.44
5 Fallow_smooth 536 98.99 98.81 98.48 99.76 99.42 98.48
6 Stubble 792 100 100 100 100 100 100
7 Celery 716 99.90 99.83 99.77 99.94 99.97 99.95
8 Grapes_untrained 2255 82.64 83.15 83.31 92.16 91.43 92.81
9 Soil_vinyard_develop 1241 99.77 99.51 99.23 99.93 99.91 99.91
10 Corn_green_weeds 656 97.08 95.48 94.41 98.70 97.62 98.47
11 Lettuce_romaine_4wk 214 99.23 98.37 98.03 99.03 98.95 99.44
12 Lettuce_romaine_5wk 386 100 99.98 100 100 100 100
13 Lettuce_romaine_6wk 184 99.98 100 100 99.98 99.96 99.52
14 Lettuce_romaine_7wk 214 98.43 97.44 96.96 99.24 97.70 99.20
15 Vinyard_untrained 1454 81.28 81.15 80.84 80.34 77.01 91.34
16 Vinyard_vertical_trellis 362 98.94 98.29 97.90 99.26 98.90 99.58
Total 10833 OA 93.46 93.33 93.16 95.54 94.81 97.03
AA 97.16 96.87 96.65 97.98 97.50 98.56
Table 7. San Francisco cross-validation results. Sample distribution in each experiment and individual
class, overall, and average accuracies of all methods. Best results are highlighted in bold.
Table 8. Flevoland cross-validation results. Sample distribution in each experiment and individual
class, overall, and average accuracies of all methods. Best results are highlighted in bold.
Flevoland Cross-Validation
Sample Distribution Accuracy
# Class Samples 1NN 3NN 5NN SVM RF CNN
1 Rapeseed 3525 51.23 54.09 54.83 99.63 100 100
2 Potato 6571 82.96 88.02 88.06 98.47 99.32 99.23
3 Barley 3295 68.89 67.09 65.00 95.84 99.52 99.44
4 Maize 8004 96.27 96.83 97.24 98.00 99.80 99.62
5 Lucerne 468 12.20 6.10 6.56 96.29 86.55 98.73
6 Peas 482 60.23 59.45 58.77 86.81 99.99 99.99
7 Fruit 832 4.36 0.92 1.08 88.96 99.45 99.47
8 Beans 252 18.30 7.63 7.98 82.15 93.29 95.02
9 Wheat 24 50.50 27.83 17.01 6.83 13.33 77.33
10 Beet 112 7.77 1.04 0.21 58.11 59.23 98.43
11 Grass 1160 73.50 72.45 71.42 89.78 96.18 99.20
12 Oats 72 36.00 24.78 20.78 36.89 12.72 74.39
Total 24797 OA 74.86 76.06 75.96 96.53 98.65 99.36
AA 46.85 42.19 40.74 78.15 79.95 95.06
Figure 9. Cross-validation overall accuracy of all methods over each dataset, ordered by increasing
size of training set.
In general, the results obtained demonstrate the capacity of CNNs to adapt to datasets with
different size and characteristics, considering that we have used the same model for images of different
nature (radar and hyperspectral). Furthermore, the fact that we have kept constant the architecture
and parameter selection of the network proves the generality of the model, as it is able to achieve a
high level of accuracy for all experiments.
both CPU and GPU computation time of the CNN model are presented in order to perform a fair
comparison. The kNN models were not considered here due to their lower performance.
Table 9. Average computation time(s) of the cross-validation experiments for all datasets over different
classifiers. The fastest methods are highlighted in bold.
As it can be observed, the CNN trained over GPU was the fastest model in four of the five
datasets considered. The highest difference can be seen in the largest dataset, Flevoland, in which the
rest of the models are considerably slower than the CNN. Although training the CNN over CPU is
time-consuming, GPU training achieves an average speedup of three times faster. SVM appears to be
the method taking more time for training, and RF almost doubles the time of the CNN on average.
Furthermore, a clear advantage of the CNN model is that it is very quick for testing, especially when it
is compared to SVM. Therefore, it can be reasoned that the designed DL model does not only achieve a
higher accuracy, as shown in the previous section, but it is also the fastest method for both training
and testing.
From Table 9, we can also extract the computational effort invested in performing the grid search
(described in Section 2.3.3) for selecting the optimal set of parameter values of the CNN. The grid
search comprised a total of 192 possible combinations, and five repetitions of cross-validation with
three datasets (Indian Pines, Pavia, and San Francisco) were executed. Therefore, although obtaining
an optimal configuration for the CNN has been time-consuming with around 30 h of computation,
the effort for future studies would be minimal as we have been able to design a general network that
can work efficiently with images from different sensors.
(a) (b)
Figure 10. Indian Pines GRSS competition. (a) training set; (b) classification map (OA = 95.53%).
(a) (b)
Figure 11. Pavia University GRSS competition. (a) training set; (b) classification map (OA = 84.79%).
Remote Sens. 2019, 11, 274 18 of 23
(a) (b)
Figure 12. Flevoland GRSS competition. (a) training set; (b) classification map (OA = 99.05%).
(a)
(b)
Figure 13. San Francisco GRSS competition. (a) training set; (b) classification map (OA = 99.37%).
Remote Sens. 2019, 11, 274 19 of 23
Table 12. Overall accuracies obtained for each method at the GRSS DASE Website competition.
Table 13. Influence of input patch size of the CNN on classification accuracy with different images.
Figure 14. Influence of patch size on classification accuracy over four datasets.
vector machines, random forests, and k-nearest-neighbours. The cross-validation experiments carried
out with all methods over five different datasets and the subsequent statistical analysis of the
results, demonstrated that the proposed convolutional neural network achieved an improvement in
performance with respect to the rest of models considered, in both overall and per-class accuracies.
These methods were also evaluated at an online competition of the GRSS, and the convolutional neural
network again provided the highest accuracy. Moreover, comparison was also made with regard to
the needed computation time, and the deep learning model was reported to be the fastest for both
training and testing. In conclusion, our study demonstrated that deep learning is a very powerful
solution for the problem of LULC classification, as it showed promising results for all the images
studied, given that they are from different sources and have distinct characteristics. It also aimed to
show the necessity of establishing a standard validation methodology for evaluating the quality of
novel proposals from which the remote sensing community would benefit.
The study also illustrated the difficulty of designing an optimal configuration for a convolutional
neural network, and provided some suggestions for the parameter settings as a result of a grid
search experimentation. In addition, other aspects regarding the performance of the network were
analysed such as the influence of the size of the patches extracted from the images, and the importance
of preventing overfitting problems by performing data augmentation and applying regularization
techniques such as dropout or batch normalization. Furthermore, one important finding made in this
research was that the convolutional neural network performed exceptionally well on minority classes
compared to other methods, which is essential due to the common presence of the imbalance problem
between classes in remote sensing data.
Future work on this path should include a deeper study on the configuration of the architecture
and parameters of the network as it a very time-consuming task. Therefore, being able to automatize
this process would definitely help to progress on the field. Moreover, further research might be driven
towards evaluating the efficiency of the proposed method when dealing with very large scale images,
when memory and processing time start being a severe issue. Furthermore, the application of more
complex post-processing techniques could be investigated in order to improve the quality of results.
Finally, a more challenging future objective could be to adapt the system to work with fused data from
multiple sources and sensors. For instance, more consistent and accurate classification maps could be
obtained by integrating lidar information with radar and hyperspectral images.
Author Contributions: All authors made substantial contributions to conception and design of the study.
M.C.-G. performed the experiments, analyzed the data, and wrote the paper. J.G.-G. and J.C.R. guided the
research and reviewed the manuscript.
Funding: This research was funded by the Spanish Ministry of Economy and Competitiveness under the projects
TIN2014-55894-C2-1-R and TIN2017-88209-C2-2-R
Acknowledgments: We are grateful to NVIDIA for their GPU Grant Program that has provided us with the latest
technology for performing our experiments.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Hussain, M.; Chen, D.; Cheng, A.; Wei, H.; Stanley, D. Change detection from remotely sensed images: From
pixel-based to object-based approaches. ISPRS J. Photogramm. Remote Sens. 2013, 80, 91–106. [CrossRef]
2. Dandois, J.; Ellis, E. High spatial resolution three-dimensional mapping of vegetation spectral dynamics
using computer vision. Remote Sens. Environ. 2013, 136, 259–276. [CrossRef]
3. He, C.; Liu, Z.; Tian, J.; Ma, Q. Urban expansion dynamics and natural habitat loss in China: A multiscale
landscape perspective. Glob. Chang. Biol. 2014, 20, 2886–2902. [CrossRef] [PubMed]
4. Xie, J.; Chen, H.; Liao, Z.; Gu, X.; Zhu, D.; Zhang, J. An integrated assessment of urban flooding mitigation
strategies for robust decision making. Environ. Model. Softw. 2017, 95, 143–155. [CrossRef]
5. Zolkos, S.; Goetz, S.; Dubayah, R. A meta-analysis of terrestrial aboveground biomass estimation using lidar
remote sensing. Remote Sens. Environ. 2013, 128, 289–298. [CrossRef]
6. Paisitkriangkrai, S.; Sherrah, J.; Janney, P.; van den Hengel, A. Semantic Labeling of Aerial and Satellite
Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2868–2881. [CrossRef]
7. Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J.
Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [CrossRef]
8. Jensen, J.R. Introductory Digital Image Processing: A Remote Sensing Perspective, 4th ed.; Prentice Hall Press:
Upper Saddle River, NJ, USA, 2015.
9. Rogan, J.; Franklin, J.; Roberts, D.A. A comparison of methods for monitoring multitemporal vegetation
change using Thematic Mapper imagery. Remote Sens. Environ. 2002, 80, 143–156. [CrossRef]
10. Du, Q.; Chang, C.I. A linear constrained distance-based discriminant analysis for hyperspectral image
classification. Pattern Recognit. 2001, 34, 361–373. [CrossRef]
11. Kal-Yi, H. A synergistic automatic clustering technique (SYNERACT) for multispectral image Analysis.
Photogramm. Eng. Remote Sens. 2002, 68, 33–40.
12. Etter, A.; McAlpine, C.; Wilson, K.; Phinn, S.; Possingham, H. Regional patterns of agricultural land use and
deforestation in Colombia. Agric. Ecosyst. Environ. 2006, 114, 369–386. [CrossRef]
13. Xu, M.; Watanachaturaporn, P.; Varshney, P.K.; Arora, M.K. Decision tree regression for soft classification of
remote sensing data. Remote Sens. Environ. 2005, 97, 322–336. [CrossRef]
14. Samaniego, L.; Bardossy, A.; Schulz, K. Supervised Classification of Remotely Sensed Imagery Using a
Modified k-NN Technique. IEEE Trans. Geosci. Remote Sens. 2008, 46, 2112–2125. [CrossRef]
15. Gislason, P.; Benediktsson, J.; Sveinsson, J. Random forests for land cover classification. Pattern Recognit. Lett.
2006, 27, 294–300. [CrossRef]
16. Mas, J.; Flores, J. The application of artificial neural networks to the analysis of remotely sensed data. Int. J.
Remote Sens. 2008, 29, 617–663. [CrossRef]
17. Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines.
IEEE Trans. Geosci. Remote Sens. 2004, 42, 1778–1790. [CrossRef]
18. Khatami, R.; Mountrakis, G.; Stehman, S.V. A meta-analysis of remote sensing research on supervised
pixel-based land-cover image classification processes: General guidelines for practitioners and future
research. Remote Sens. Environ. 2016, 177, 89–100. [CrossRef]
19. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385.
20. Li, Y.; Zhang, H.; Xue, X.; Jiang, Y.; Shen, Q. Deep learning for remote sensing image classification: A survey.
Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1264. [CrossRef]
21. Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep Learning Classification of Land Cover and Crop
Types Using Remote Sensing Data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [CrossRef]
22. Ding, J.; Chen, B.; Liu, H.; Huang, M. Convolutional Neural Network With Data Augmentation for SAR
Target Recognition. IEEE Geosci. Remote Sens. Lett. 2016, 13, 364–368. [CrossRef]
23. Romero, A.; Gatta, C.; Camps-Valls, G. Unsupervised deep feature extraction for remote sensing image
classification. IEEE Trans. Geosci. Remote Sens. 2016, 54, 1349–1362. [CrossRef]
24. Castelluccio, M.; Poggi, G.; Sansone, C.; Verdoliva, L. Land Use Classification in Remote Sensing Images by
Convolutional Neural Networks. arXiv 2015, arXiv:1508.00092.
25. Li, W.; Fu, H.; Yu, L.; Gong, P.; Feng, D.; Li, C.; Clinton, N. Stacked Autoencoder-based deep learning for
remote-sensing image classification: A case study of African land-cover mapping. Int. J. Remote Sens. 2016,
37, 5632–5646. [CrossRef]
Remote Sens. 2019, 11, 274 22 of 23
26. Chen, Y.; Zhao, X.; Jia, X. Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief
Network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [CrossRef]
27. Li, Y.; Zhang, H.; Shen, Q. Spectral–Spatial Classification of Hyperspectral Imagery with 3D Convolutional
Neural Network. Remote Sens. 2017, 9, 67. [CrossRef]
28. Hou, F.; Lei, W.; Li, H.; Xi, J. FMRSS Net: Fast Matrix Representation-Based Spectral-Spatial Feature Learning
Convolutional Neural Network for Hyperspectral Image Classification. Math. Probl. Eng. 2018, 2018, 1–11.
[CrossRef]
29. Kohavi, R. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection.
In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI’95), Montreal, QC,
Canada, 20–25 August 1995; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1995; Volume 2,
pp. 1137–1143.
30. García, S.; Herrera, F. An extension on “statistical comparisons of classifiers over multiple data sets” for all
pairwise comparisons. J. Mach. Learn. Res. 2008, 9, 2677–2694.
31. Mahdianpari, M.; Salehi, B.; Rezaee, M.; Mohammadimanesh, F.; Zhang, Y. Very Deep Convolutional Neural
Networks for Complex Land Cover Mapping Using Multispectral Remote Sensing Imagery. Remote Sens.
2018, 10, 1119. [CrossRef]
32. Sharma, A.; Liu, X.; Yang, X.; Shi, D. A patch-based convolutional neural network for remote sensing image
classification. Neural Netw. 2017, 95, 19–28. [CrossRef]
33. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural
Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1
(NIPS’12); Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates Inc.: Red Hook,
NY, USA, 2012; pp. 1097–1105.
34. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition.
arXiv 2014, arXiv:1409.1556.
35. Chen, Y.; Jiang, H.; Li, C.; Jia, X.; Ghamisi, P. Deep Feature Extraction and Classification of Hyperspectral
Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2016, 54, 6232–6251.
[CrossRef]
36. Nair, V.; Hinton, G. Rectified linear units improve Restricted Boltzmann machines. In Proceedings of the
27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21 June 2010; Fürnkranz, J.,
Joachims, T., Eds.; pp. 807–814.
37. Boureau, Y.L.; Ponce, J.; Lecun, Y. A Theoretical Analysis of Feature Pooling in Visual Recognition.
In Proceedings of the 27th International Conference On Machine Learning (ICML 2010), Haifa, Israel,
21 June 2010; Fürnkranz, J., Joachims, T., Eds.; pp. 111–118.
38. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; Teh, Y.W.,
Titterington, M., Eds.; PMLR: Sardinia, Italy, 2010; Volume 9, pp. 249–256.
39. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift. arXiv 2015, arXiv:1502.03167.
40. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Improving neural networks by
preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580.
41. Luo, P.; Wang, X.; Shao, W.; Peng, Z. Towards Understanding Regularization in Batch Normalization.
arXiv 2018, arXiv:1809.00846.
42. Li, X.; Chen, S.; Hu, X.; Yang, J. Understanding the Disharmony between Dropout and Batch Normalization
by Variance Shift. arXiv 2018, arXiv:1801.05134.
43. GRSS DASE Website Competition. Available online: http://dase.grss-ieee.org (accessed on 10 December 2018).
44. 2D Convolutional Neural Networks for Land Use and Land Cover Classification of Radar and Hyperspectral
Images. Available online: https://github.com/carranza96/cnn-landcover (accessed on 28 January 2019).
45. Kim, J.H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap.
Comput. Stat. Data Anal. 2009, 53, 3735–3745. [CrossRef]
46. Hurni, K.; Schneider, A.; Heinimann, A.; Nong, D.H.; Fox, J. Mapping the Expansion of Boom Crops in
Mainland Southeast Asia Using Dense Time Stacks of Landsat Data. Remote Sens. 2017, 9, 320. [CrossRef]
Remote Sens. 2019, 11, 274 23 of 23
47. Graves, S.J.; Asner, G.P.; Martin, R.E.; Anderson, C.B.; Colgan, M.S.; Kalantari, L.; Bohlman, S.A. Tree Species
Abundance Predictions in a Tropical Agricultural Landscape with a Supervised Classification Model and
Imbalanced Data. Remote Sens. 2016, 8, 161. [CrossRef]
48. Hughes, G. On the mean accuracy of statistical pattern recognizers. IEEE Trans. Inf. Theory 1968, 14, 55–63.
[CrossRef]
49. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [CrossRef]
50. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30.
51. Gao, Q.; Lim, S.; Jia, X. Hyperspectral Image Classification Using Convolutional Neural Networks and
Multiple Feature Learning. Remote Sens. 2018, 10, 299. [CrossRef]
c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).