Wavelet Convnets For Texture Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Wavelet Convolutional Neural Networks for Texture Classification

Shin Fujieda Kohei Takayama


The University of Tokyo, Digital Frontier Inc. Digital Frontier Inc.
[email protected] [email protected]

Toshiya Hachisuka
The University of Tokyo
arXiv:1707.07394v1 [cs.CV] 24 Jul 2017

[email protected]

Abstract cessing, texture is used to represent types of surfaces that


are independent of shape. Texture can be thought as a basic
Texture classification is an important and challenging element that captures the appearance of surfaces of objects.
problem in many image processing applications. While con- Accurate classification of textures is also fundamental
volutional neural networks (CNNs) achieved significant suc- in many important applications such as inspection and seg-
cesses for image classification, texture classification remains mentation for image processing and generation of texture
a difficult problem since textures usually do not contain database for rendering. At the same time, texture classifica-
enough information regarding the shape of object. In im- tion is a challenging problem since textures often vary a lot
age processing, texture classification has been traditionally within the same class, due to changes in viewpoints, scales,
studied well with spectral analyses which exploit repeated lighting configurations, etc. In addition, textures usually do
structures in many textures. Since CNNs process images as- not contain enough information regarding the shape of ob-
is in the spatial domain whereas spectral analyses process jects which are informative to distinguish different objects in
images in the frequency domain, these models have different image classification tasks. Due to such difficulties, even the
characteristics in terms of performance. We propose a novel latest approaches based on convolutional neural networks
CNN architecture, wavelet CNNs, which integrates a spec- achieved a limited success, when compared to other tasks
tral analysis into CNNs. Our insight is that the pooling layer such as image classification.
and the convolution layer can be viewed as a limited form of
a spectral analysis. Based on this insight, we generalize both We propose a unification of two major classification ap-
layers to perform a spectral analysis with wavelet transform. proaches, convolutional neural networks and spectral analy-
Wavelet CNNs allow us to utilize spectral information which ses, to approach the difficulty of texture classification. Con-
is lost in conventional CNNs but useful in texture classifica- volutional neural networks (CNNs) process an input texture
tion. The experiments demonstrate that our model achieves as-is and collect statistics in the spatial domain. Spectral
better accuracy in texture classification than existing models. analysis transforms an input texture into a spectral domain
We also show that our model has significantly fewer parame- and uses frequency statistics. CNNs are usually good at cap-
ters than CNNs, making our model easier to train with less turing spatial features, while a spectral analysis is good at
memory. capturing scale invariant features. We aim to consider both
the spatial and spectral information so that it captures both
types of features well under a single model. The key idea is
1. Introduction that the pooling layer and the convolution layer in CNNs can
be thought as a limited form of a spectral analysis. Based
Texture is a key component used for various applications on this idea, we generalize both layers to perform a spectral
in computer graphics. While its definition varies slightly, analysis using multiresolution analysis by wavelet transform.
texture is typically a surface image of an object and it does We thus named our model as wavelet convolutional neural
not represent the shape of object. For example, a photograph networks (wavelet CNNs). The overview of wavelet CNNs
of an entire human face is usually not considered to be a is shown in Figure 1. We demonstrate that wavelet CNNs
texture, but a close-up of a human skin is. In rendering, achieve better or competitive classification accuracy while
artists use textures to add surface details to objects without having a significantly smaller number of trainable parame-
having to increase geometric complexity. For image pro- ters than conventional CNNs. Our model is thus easier to

1
3×3 conv
64×112×112 3×3 conv
128×56×56 3×3 conv
3×3 conv Ave. pool
256×28×28
256×14×14
7×7

3×3 conv
128×7×7
Output
Concat
Concat
Fully connected
2048 dims
Input 12×112×112 Multiresolution
3×224×224 Analysis
12×28×28
12×56×56

Figure 1. Overview of our model with 3-level decomposition of the input image. Our model processes the input image through convolution
layers with 3 × 3 kernels and 1 × 1 padding. 3 × 3 convolutional kernels with the stride of 2 and 1 × 1 padding are used to reduce the size
of feature maps. Additionally, the input image is decomposed through multiresolution analysis and the decomposed images are concatenated.
The output of convolution layers is vectorized by an average pooling layer followed by three fully connected layers. The size of the output is
equal to the number of classes included in the input dataset.

train and consumes less memory than CNNs. To summarize, Convolutional Neural Networks CNNs essentially re-
our contributions are: placed conventional descriptors by achieving significant per-
formance [25] in various tasks without relying on much
• Combination of CNNs and spectral analysis within one
of domain-specific knowledge. For texture classification,
model.
however, directly applying CNNs is known to achieve only
• Generalization of pooling and convolution as a spectral moderate accuracy [14]. CNNs alone thus are not very suit-
analysis. able for texture classification, despite its successes in other
• Accurate and efficient texture classification using our problems.
model.
Several numerical experiments in the results section validate Cimpoi et al. [8] demonstrated a CNN in combination
that our model successfully classified failure cases of existing with Fisher Vectors (FV-CNN) can achieve much better accu-
models. racy in texture classification. Their model uses a pre-trained
CNN to extract texture features and this CNN part is not
trained with existing texture datasets. Inherited from conven-
2. Related Work tional CNNs, this model has a large number of parameters
that makes it difficult to train in general. Our model uses
Conventional Texture Descriptors There are a variety a fewer parameters and achieves competitive results to FV-
of approaches for extracting texture descriptors based on CNN by fusing a CNN and a spectral analysis into one
domain-specific knowledge. Structural approaches [36, 12] model.
represent texture features based on spatial arrangements of
selected pixels. Statistical approaches [34] consider a set of Andrearczyk et al. [2] proposed texture CNN (T-CNN)
statistics of pixel intensities such as mean and co-occurrence which is a CNN specialized for texture classification. T-
matrices [15, 35] as features. These approaches can yield CNN uses a novel energy layer in which each feature map
certain features of textures in a compact manner, but only is simply pooled by calculating the average of its activated
under the assumptions in each approach. output. This results in a single value for each feature map,
More general approaches such as Markov random similar to an energy response to a filter bank. This approach
fields [28], fractal dimensions [6], and Wold model [26], does not improve classification accuracy, but to its simple
described texture images as a probability model or a linear architecture reduces the number of parameters. While our
combination of a set of basis functions. The coefficients of model is inspired by the problem of texture classification, it
these models are texture features in these approaches. While is not specialized for texture classification. As discussed in
these approaches are quite general, it is still difficult choose Section 5, we confirmed that our model achieves competitive
the most suitable model for given textures. performance to AlexNet [25] for image classification.
Spectral Approaches Spectral approaches transform tex-
N0 N1 Nn-2 Nn-1
tures into the frequency domain using a set of spatial filters.
The statistics of the spectral information at different scales x0 x1 x2 x3 xn-4 xn-3 xn-2 xn-1
and orientations define texture features. This approach has
been well studied in image processing and achieved practical w w w w
results [37, 3, 22]. y0 y1 y2 y3 yn-4 yn-3 yn-2 yn-1
Feature extraction in the frequency domain has two ad- (a)
vantages. First, a spatial filter can be easily made selective by p=2
enhancing only certain frequencies while suppressing others.
x0 x1 x2 x3 xn-4 xn-3 xn-2 xn-1
This explicit selection of certain frequencies is difficult to
control in CNNs. Additionally, the periodical structure of a p p p p
texture can be represented as a certain spatial frequency in
the spectral domain. y0 y1 ym-2 ym-1
Figure 1 shows that the structure of our model is similar (b)
to that of CNNs using skip-connections [27, 30]. While deep
neural networks including CNNs with skip-connections are Figure 2. Concepts of convolution and average pooling layers. (a)
Convolution layers compute a weighted sum of neighbors in each
known to be universal approximators [18], it is not clear
local receptive field. (b) Pooling layers take an average and perform
whether CNNs can learn to perform spectral analyses in downsampling.
practice with available datasets. We thus propose to directly
integrate spectral approaches into CNNs, particularly based
on multiresolution analysis using wavelet transform [37]. y = (y0 , y1 , . . . , yn−1 ) ∈ Rn :
Our experiments support this observation that a CNN with X
more parameters cannot be trained to become equivalent yi = wj xj , (1)
j∈Ni
to our model with available datasets in practice. The key
difference is that certain layers of wavelet CNNs have no where Ni is a set of indices in the local receptive field at
trainable parameters. Instead, those layers perform multires- xi and wj is a weight. Following the notational convention
olution analysis using fixed parameters defined by wavelet in CNNs, we consider that wj includes the bias by having
transform. a constant input of 1. The equation thus P says that each
output yi is a weighted sum of neighbors j∈Ni wj xj plus
3. Wavelet Convolutional Neural Networks constant.
Each layer defines the weights wj as constants over i. By
Overview We propose to formulate convolution and pool-
sharing parameters, CNNs reduce the number of parameters
ing in CNNs as filtering and downsampling. This formu-
and achieve translation invariance in the image space. The
lation allows us to connect convolution and pooling with
definition of yi in Equation 1 is equivalent to convolution
multiresolution analysis. In the following explanations, we
of xi via a filtering kernel wj , thus this layer is called a
use a single-channel 1D data for the sake of brevity. Ap-
convolution layer. We can thus rewrite yi in Equation 1
plications to 2D images with multiple channels are trivially
using the convolution operator ∗ as
possible as was done by CNNs.

3.1. Convolutional Neural Networks y = x ∗ w, (2)

A convolutional neural network [25] is a variant of the where w = (w0 , w1 , . . . , wm−1 ) ∈ Rm . Convolution layers
neural network which uses a sparsely connected deep net- in CNNs typically use different sets of weights for the same
work. In the regular neural network model, every input is input and output the results as a concatenated vector. This
connected to every unit in the next layer. In addition to the common practice just applies Equation 1 repeatedly for each
use of an activation function and a fully connected layer, set of weights.
CNNs introduce convolution/pooling layers that connect
only to local neighborhoods (called a local receptive field)
Pooling Layers: Pooling layers are typically used imme-
around each input. Figure 2 illustrates the configuration we
diately after convolution layers to simplify the information.
explain in the following.
While max pooling is used in many applications of CNNs,
Gatys et al. [10] showed that average pooling is more suit-
Convolution Layers: Given an input vector with n com- able for extracting texture features. We thus focus on average
ponents x = (x0 , x1 , . . . , xn−1 ) ∈ Rn , a convolution pooling, which in fact allows us to see the connection with
layer outputs a vector of the same number of components multiresolution analysis. Given an input x ∈ Rn , average
pooling outputs a vector of a fewer components y ∈ Rm as
kh Wavelet
p−1 CNNs
1X
yj = xpj+k , (3) kl
p kh
k=0

where p defines the support of pooling and m = np . For kh: high-pass filter kl kh
kl: low-pass filter
example, p = 2 means that we reduce the number of outputs
Conventional
to a half of the inputs by taking pair-wise averages. Us- CNNs kl
ing the standard downsampling operator ↓, we can rewrite
Equation 3 as
Figure 3. Relationship between conventional CNNs and wavelet
CNNs in terms of multiresolution analysis. Conventional CNNs
y = (x ∗ p) ↓ p, (4) apply convolution and pooling repeatedly to the input, which is
essentially equivalent to use only the low frequency components of
where p = (1/p, . . . , 1/p) ∈ Rp represents the averaging multiresolution analysis. Our wavelet CNNs instead consider all
filter. Average pooling mathematically involves convolution the components including high frequency components.
via p followed by downsampling with the stride of p.
3.2. Generalized Convolution and Pooling high frequency parts. Our model instead uses both the low
frequency parts and the high frequency parts within CNNs,
Equation 2 and Equation 4 can be combined into a gener-
so that we do not lose any information of the input x by the
alized form of convolution and downsampling as
definition of multiresolution analysis. While this idea might
y = (x ∗ k) ↓ p. look simple after the fact, our model is powerful enough to
(5)
outperform the existing more complex models as we will
The generalized weight k is defined as show in the results.
Note that we cannot use an arbitrary pair of filters (kl
• k = w with p = 1 (convolution in Equation 2)
and kh ) to perform multiresolution analysis. To avoid any
• k = p with p > 1 (pooling in Equation 4) loss of frequency information of the input x, a pair should
• k = w∗p with p > 1 (convolution followed by pooling). satisfy the quadrature mirror filter relationship. For wavelet
Our insight is that Equation 5 is equivalent to a part of a transform, kh is known as the wavelet function and kl is
single level application of multiresolution analysis. Suppose known as the scaling function. We used Haar wavelets [13]
that we use a pair of low-pass kl and high-pass kh filters for our experiments, but our model is not restricted to Haar.
to decompose data into the low-frequency part xlow and the This constraint also suggests why it is difficult to train con-
high frequency part xhigh with p = 2: ventional CNNs to perform the same computation as wavelet
CNNs do: weights in CNNs are ignorant of this important
xlow = (x ∗ kl ) ↓ 2 constraint to satisfy and just try to learn it from datasets.
(6)
xhigh = (x ∗ kh ) ↓ 2. Rippel et al. [29] proposed a related approach of replacing
convolution and pooling by discrete Fourier transform and
Our key insight is that multiresolution analysis [9] further de- truncation of the coefficients. This approach, called spectral
composes the low frequency part xlow into its low frequency pooling, is equivalent to using only the low frequency part
part and high frequency part by repeatedly applying the same in our model, thus it is not essentially different from con-
form of Equation 6. By defining xlow,0 = x, multiresolution ventional CNNs. Our model is also different from merely
analysis can be written as applying multiresolution analysis on input data and using
CNNs afterward, since multiresolution analysis is built in-
xlow,l+1 = (xlow,l ∗ kl ) ↓ 2 side the network and the input first go through CNN layers
(7)
xhigh,l+1 = (xlow,l ∗ kh ) ↓ 2. both before and after multiresolution analysis.

The number of applications l is called a level in multires- 3.3. Implementation


olution analysis. Multiresolution analysis is thus equal to
repeated applications of generalized convolution and pool- Network Structure Figure 1 illustrates our network struc-
ing layers on low-frequency parts with a specific pair of ture. We designed our network structure after a VGG-19
convolution kernels. network [33] since it has been successfully used for extract-
Figure 3 illustrates how CNNs and our wavelet CNNs ing features of textures for different purposes [10, 1]. We use
differ under this formulation. Conventional CNNs can be 3 × 3 convolutional kernels exclusively and 1 × 1 padding
seen as using only the low frequency parts and discard all the to ensure the output is the same size as the input.
Instead of using the pooling layers to reduce the size of
60
the feature maps, we exploit convolution layers with the
50
increased stride. If 1 × 1 padding is added to the layer
with a stride of two, the output becomes half the size of 40

Accuracy [%]
the input layer. This approach can be used to replace max 30
pooling without loss in accuracy [20]. In addition, since 20
both the VGG-like architecture and image decomposition in
10
multiresolution analysis have the same characteristic that the
0
size of images is reduced to a half successively, we combine AlexNet T-CNN 1-level 2-level 3-level 4-level 5-level
each level of decomposed images with feature maps of the (a)
specific layer that are the same size as those images. 60
For texture classification, inserting an energy layer di-
50
rectly before fully connected layers improves the perfor-
40

Accuracy [%]
mance of a network while keeping the number of parameters
small [2]. We used this approach and a complete network of 30
wavelet CNNs we tested consists of nine convolution layers, 20
the same number of wavelet layers as decomposition levels
10
of multiresolution analysis and an energy layer followed by
three fully connected layers. We implemented this network 0
AlexNet T-CNN 1-level 2-level 3-level 4-level 5-level
using Caffe [21]. The codes for our model will be available
(b)
on our website.
Figure 4. Classification results of (a) kth-tips2-b and (b) DTD for
Learning Wavelet CNNs exploit an energy layer with the networks trained from scratch. We compared our models (blue)
same size as the input of the layer, so the size of input images with AlexNet and T-CNN.
is required to be the fixed size. We thus train our proposed
model exclusively with images of the size 224 × 224. These images that are correctly labeled which is a common metric
images are achieved by first scaling the training images in texture classification.
to 256 × 256 pixels and then conducting random crops to
224 × 224 pixels and flipping. This random variation helps Training from scratch We compared our model with
the model to prevent overfitting. For further robustness, AlexNet and T-CNN using texture datasets to train each
we use batch normalization [19] throughout our network model from scratch. For initialization of the parameters,
before activation layers during training. For the optimizer, we used a robust method that specifically accounts for
we exploit the Adam optimizer [23] instead of SGD. We ReLU [17]. While the structure of our models is designed
use the Rectified Linear Unit (ReLU) [11] as the activation after VGG networks, we found that VGG networks tend to
function in all the experiments. perform poorly due to over-fitting with a large number of
trainable parameters, if trained from scratch. We thus used
4. Results AlexNet as an example of conventional CNNs instead of
VGG networks for this experiment. Figure 4 and Table 1
Datasets For our experiments, we used two publicly avail-
show the results of training our model with different levels
able texture datasets: kth-tips2-b [16] and DTD [7]. The
of multiresolution analysis. For both datasets, our models
kth-tips2-b dataset contains 11 classes of 432 texture images.
perform better than AlexNet and T-CNN.
Each class consists of four samples and each sample has
Comparing between different levels within wavelet CNNs,
108 images. Each sample is used for training once while
we found that the network with 4-level decomposition per-
the remaining three samples are used for testing. The re-
formed the best. In our experiments, the model with 5-level
sults for kth-tips2-b are shown as the mean and the standard
decomposition achieved almost the same accuracy as 4-level,
deviation over the four splits. The DTD dataset contains
but with more trainable parameters. Similar to the number of
47 classes of 120 images ”in the wild” which means that
layers in CNNs, the decomposition level of wavelet CNNs
images are collected in uncontrolled conditions. This dataset
is another hyper-parameter which can be tuned for different
includes 10 available annotated splits with 40 training im-
problems.
ages, 40 validation images, and 40 testing images for each
class. The results for DTD are averaged over the 10 splits.
We processed the images in each dataset by global contrast Training with fine-tuning Figure 5 and Table 2 show the
normalization.We calculated the accuracy as percentage of classification rates using the networks pre-trained with the
AlexNet T-CNN 1-level 2-level 3-level 4-level 5-level
kth-tips2-b 48.3±1.4 49.6±0.6 57.5±3.0 57.0±2.3 57.8±2.5 60.5±2.1 59.6±2.5
DTD 22.7±1.3 27.8±1.2 29.0±1.4 30.3±0.9 31.6±1.0 32.2±0.8 32.2±0.7
Table 1. Classification results for networks trained from scratch indicated as accuracy (%).

80 FC+ Wavelet
70 Shearlet VGG-M T-CNN FV-CNN CNN
(VGG-M)
60 kth-tips2-b 62.3±0.8 70.7±1.7 72.4±2.1 73.9±4.9 74.2±1.2
50 DTD 21.6±0.9 55.2±1.2 55.8±0.8 69.8±1.1 59.8±0.9
Accuracy [%]

40 Table 2. Classification results for networks pre-trained with Ima-


30 geNet indicated as accuracy (%).
20
10 110
102.9
0 100
Shearlet VGG-M T-CNN FC+FV-CNN Wavelet 90

Number of parameters
(VGG-M) CNN
(a) 80
70
80

in millions
60.9
60
70 50
60 40
50 30
23.4
Accuracy [%]

20
40 11.5 13.6
10 9.14 9.22 9.68
30
0
20 VGG-M AlexNet T-CNN 1-level 2-level 3-level 4-level 5-level
10 Figure 6. The number of trainable parameters in millions. Our
0 model, even with 5-level of multiresolution analysis, has a fewer
Shearlet VGG-M T-CNN FC+FV-CNN Wavelet
(VGG-M) CNN parameters than any other competing models we tested.
(b)
Figure 5. Classification results of (a) kth-tips2-b and (b) DTD for such as weights and biases for classification to 1000 classes
networks pre-trained with ImageNet 2012 dataset. We compared (Figure 6). Conventional CNNs such as VGG-M (which is
our model (wavelet CNN with 4-level decomposition) with shearlet used also in FV-CNN) and AlexNet have a large number of
transform, VGG-M, T-CNN, and FC+FV-CNN. parameters while their depth is a little shallower than our
proposed model. Even compared to T-CNN, which aims at
ImageNet 2012 dataset [31]. Since the model with 4-level reducing the model complexity, the number of parameters in
decomposition achieved the best accuracy in the previous our model with 4-level decomposition is less than the half.
experiment, we used this network in this experiment as well. We also remind that our model achieved higher accuracy
We compared our model with a spectral approach using than T-CNN does.
shearlet transform [24], a VGG network [5], T-CNN [2], and This result confirms that our model achieves better results
FV-CNN [8] with a fully connected layer (FC). with a significantly reduced number of parameters than exist-
Our model achieved the best performance for the kth- ing models. The memory consumption of each Caffemodel
tips2-b dataset, while it is outperformed only by FV-CNN is: 392 MB (VGG-M), 232 MB (AlexNet), 89.1 MB (T-
for the DTD dataset. We analyzed the results and found that CNN), and 43.8 MB (Ours). Our network is thus suitable to
some classes of the DTD dataset contain non-texture images run on a system with a limited amount of memory. The small
that clearly show the shape of an object. Since FV-CNN has number of parameters also generally suppress over-fitting of
a significantly larger number of trainable parameters than our the model for small datasets.
models, we speculate that FV-CNN is just trained to account
of for non-texture images as special cases. We put FC-CNN
as it is the state-of-the-art, a comparison with FV-CNN only Visual comparisons of classified images Figure 7 shows
on accuracy is not necessarily very fair because of this sheer some extracted images for several classes in our experiments.
difference in model complexity. The images in top two rows of Figure 7 are from kth-tips2-b
dataset, while the images in the bottom three rows of Figure 7
are from DTD dataset. A red square indicates that the texture
Number of parameters To assess the complexity of each is inappropriately classified to the class. We can visually see
model, we compared the number of trainable parameters that a spectral approach is insensitive to the scale variation
Corduroy
Aluminium foil
Stratified
Lacelike
Banded

(a) Shearlet Transform (b) VGG-M (c) Our Model (d) References
Figure 7. Some results classified by (a) shearlet transform, (b) VGG-M, (c) our model and (d) references. The images on top two rows are
extracted from kth-tips2-b and the rests are extracted from DTD. The images in red squares are wrongly classified images.
and extract detailed features, whereas a spatial approach is of FV-CNN, but another possibility is that pre-training with
insensitive to distortion. For example, in Aluminium foil, the the ImageNet 2012 dataset is simply not appropriate for
image of wrinkled aluminium foil is properly classified with texture classification. An exact reasoning of failure cases,
a shearlet transform, but not with VGG-M. In Banded, VGG- however, is generally difficult for any neural network models,
M classifies distorted lines into the correct class. Since our and our model is not an exception. We however note that
model is the combination of both approaches, it can assign we could not find a publicly available texture dataset at the
texture images to the correct label in every variation above. same scale as the ImageNet 2012 dataset.
We should also point that, while our model improves ac-
5. Discussion curacy a lot when we used textures as only training datasets,
the accuracy is still around 60% for kth-tips2-b and 32% for
Application to image classification Since we do not as- DTD. This level of accuracy might not be enough yet for
sume anything regarding the input, our model is not nec- certain applications and there is still room for improvement
essarily restricted to texture classification. To confirm the when compared to image classification.
generality, we trained a wavelet CNN with four-level de-
composition and AlexNet with the ImageNet 2012 dataset
6. Conclusion
from scratch to perform image classification. Our model
obtained the accuracy of 59.8% whereas AlexNet resulted in We presented a novel CNN architecture which incorpo-
57.1%. We should remind that the number of parameters of rates a spectral analysis into CNNs. We showed how to
our model is about five times smaller than that of AlexNet reformulate convolution and pooling layers in CNNs into
(Figure 6). Our model is thus suitable also for image classi- a generalized form of filtering and downsampling. This
fication with smaller memory footprint. Other applications reformulation shows how conventional CNNs perform a lim-
such as image recognition and object detection with our ited version of multiresolution analysis, which then allows
model should be similarly possible. us to integrate multiresolution analysis into CNNs as a sin-
gle model called wavelet CNNs. We demonstrated that our
Lp pooling An interesting generalization of max and av- model achieves better accuracy for texture classification with
erage pooling is Lp pooling [4, 32]. The idea of Lp pooling smaller number of trainable parameters than existing mod-
is that max pooling can be thought as computing L∞ norm, els. In particular, our model outperformed all the existing
while average pooling can be considered as computing L1 models with significantly more trainable parameters when
norm. In this case, Equation 4 cannot be written as linear we trained each model from scratch. A wavelet CNN is a
convolution anymore due to non-linear transformation in general learning model and applications to other problems
norm calculation. Our overall formulation, however, is not are interesting future works.
necessarily limited to multiresolution analysis either and we
can simply replace downsampling part by corresponding References
norm computation to support Lp pooling. This modifica-
[1] M. Aittala, T. Aila, and J. Lehtinen. Reflectance modeling
tion however will not retain all the frequency information by neural texture synthesis. ACM Trans. Graph., 35(4):65:1–
of the input as it is no longer multiresolution analysis. We 65:13, July 2016. 4
focused on average pooling as it has a clear connection to [2] V. Andrearczyk and P. F. Whelan. Using filter banks in con-
multiresolution analysis. volutional neural networks for texture classification. Pattern
Recognition Letters, 84:63 – 69, 2016. 2, 5, 6
Limitations We designed wavelet CNNs to put each high [3] S. Arivazhagan, T. G. Subash Kumar, and L. Ganesan. Texture
frequency part between layers of the CNN. Since our net- classification using curvelet transform. International Jour-
nal of Wavelets, Multiresolution and Information Processing,
work has four layers to reduce the size of feature maps, the
05(03):451–464, 2007. 3
maximum decomposition level is restricted to five. This
[4] Y. Boureau, J. Ponce, and Y. Lecun. A theoretical analysis of
design is likely to be less ideal since we cannot tweak the
feature pooling in visual recognition. In Proceedings of the
decomposition level independently from the depth (thereby
27th International Conference on Machine Learning (ICML-
the number of trainable parameters) of the network. A dif- 10), pages 111–118. Omnipress, 2010. 8
ferent network design might make this separation of hyper- [5] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman.
parameters possible. Return of the devil in the details: Delving deep into convo-
While wavelet CNNs achieved the best accuracy for train- lutional nets. In British Machine Vision Conference, 2014.
ing from scratch, its performance with fine-tuning with the 6
ImageNet 2012 dataset is only comparable to FV-CNN, al- [6] B. B. Chaudhuri and N. Sarkar. Texture segmentation using
beit with a significantly smaller number of parameters. We fractal dimension. IEEE Transactions on Pattern Analysis
speculated that it is partially because a more complex model and Machine Intelligence, 17(1):72–77, Jan 1995. 2
[7] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and [23] D. P. Kingma and J. Ba. Adam: A method for stochastic
A. Vedaldi. Describing textures in the wild. In IEEE Confer- optimization. CoRR, abs/1412.6980, 2014. 5
ence on Computer Vision and Pattern Recognition (CVPR), [24] K. G. Krishnan, P. T. Vanathi, and R. Abinaya. Performance
2014. 5 analysis of texture classification techniques using shearlet
[8] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banks for transform. In International Conference on Wireless Com-
texture recognition and segmentation. In IEEE Conference munications, Signal Processing and Networking (WiSPNET),
on Computer Vision and Pattern Recognition (CVPR), June pages 1408–1412, March 2016. 6
2015. 2, 6 [25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[9] J. L. Crowley. A representation for visual information. Tech- classification with deep convolutional neural networks. In
nical report, 1981. 4 Advances in Neural Information Processing Systems 25, pages
[10] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis 1106–1114. 2012. 2, 3
using convolutional neural networks. In Advances in Neural [26] F. Liu and R. W. Picard. Periodicity, directionality, and ran-
Information Processing Systems 28, pages 262–270. Curran domness: Wold features for image modeling and retrieval.
Associates, Inc., 2015. 3, 4 IEEE Transactions on Pattern Analysis and Machine Intelli-
[11] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier gence, 18(7):722–733, Jul 1996. 2
neural networks. In Proceedings of the Fourteenth Inter- [27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
national Conference on Artificial Intelligence and Statistics networks for semantic segmentation. In IEEE Conference on
(AISTATS-11), volume 15, pages 315–323, 2011. 5 Computer Vision and Pattern Recognition (CVPR), June 2015.
[12] R. K. Goyal, W. L. Goh, D. P. Mital, and K. L. Chan. Scale 3
and rotation invariant texture analysis based on structural [28] B. S. Manjunath and W. Y. Ma. Texture features for browsing
property. In Industrial Electronics, Control, and Instrumen- and retrieval of image data. IEEE Transactions on Pattern
tation, 1995., Proceedings of the 1995 IEEE IECON 21st Analysis and Machine Intelligence, 18(8):837–842, Aug 1996.
International Conference on, volume 2, pages 1290–1294, 2
Nov 1995. 2
[29] O. Rippel, J. Snoek, and R. P. Adams. Spectral representations
[13] A. Haar. Zur theorie der orthogonalen funktionensysteme. for convolutional neural networks. In Proceedings of the 28th
Mathematische Annalen, 69(3):331–371, 1910. 4 International Conference on Neural Information Processing
[14] O. L. S. Hafemann, L. G. and P. R. Cavalin. Forest Systems, pages 2449–2457, 2015. 4
species recognition using deep convolutional neural networks.
[30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
In 22nd International Conference on Pattern Recognition
tional networks for biomedical image segmentation. In Med-
(ICPR), pages 1103–1107, 2014. 2
ical Image Computing and Computer-Assisted Intervention
[15] R. M. Haralick, K. Shanmugam, and I. Dinstein. Textural fea-
(MICCAI), volume 9351, pages 234–241, 2015. 3
tures for image classification. IEEE Transactions on Systems,
[31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Man, and Cybernetics, SMC-3(6):610–621, Nov 1973. 2
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg,
[16] E. Hayman, B. Caputo, M. Fritz, and J.-O. Eklundh. On the
and L. Fei-Fei. Imagenet large scale visual recognition chal-
Significance of Real-World Conditions for Material Classifi-
lenge. International Journal of Computer Vision, 115(3):211–
cation, volume 4, pages 253–266. 2004. 5
252, 2015. 6
[17] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
[32] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural
rectifiers: Surpassing human-level performance on imagenet
networks applied to house numbers digit classification. In
classification. In IEEE International Conference on Computer
Proceedings of the 21st International Conference on Pattern
Vision (ICCV), pages 1026–1034, Dec 2015. 5
Recognition (ICPR), pages 3288–3291, Nov 2012. 8
[18] K. Hornik. Approximation capabilities of multilayer feed-
forward networks. Neural networks, 4(2):251–257, 1991. [33] K. Simonyan and A. Zisserman. Very deep convolu-
3 tional networks for large-scale image recognition. CoRR,
abs/1409.1556, 2014. 4
[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. [34] G. N. Srinivasan and G. Shobha. Statistical texture analysis.
CoRR, abs/1502.03167, 2015. 5 Proceedings of World Academy of Science, Engneering and
[20] T. B. J. T. Springenberg, A. Dosovitskiy and M. Riedmiller. Technology, 36, 2008. 2
Striving for simplicity: The all convolutional net. In Inter- [35] J. Y. Tou, Y. H. Tay, and P. Y. Lau. One-dimensional grey-
national Conference on Learning Representations Workshop level co-occurrence matrices for texture classification. In In-
Track, 2015. 5 ternational Symposium on Information Technology, volume 3,
[21] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- pages 1–6, Aug 2008. 2
shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional [36] M. Tüceryan and A. K. Jain. Texture segmentation using
architecture for fast feature embedding. In Proceedings of the voronoi polygons. IEEE Trans. Pattern Anal. Mach. Intell.,
22nd ACM International Conference on Multimedia, pages 12(2):211–216, Feb 1990. 2
675–678. ACM, 2014. 5 [37] M. Unser. Texture classification and segmentation using
[22] M. Kanchana and P. Varalakshmi. Texture classification using wavelet frames. IEEE Transactions on Image Processing,
discrete shearlet transform. International Journal of Scientific 4(11):1549–1560, Nov 1995. 3
Research, 5, 2013. 3

You might also like