Sensors: Partial Discharge Recognition With A Multi-Resolution Convolutional Neural Network
Sensors: Partial Discharge Recognition With A Multi-Resolution Convolutional Neural Network
Sensors: Partial Discharge Recognition With A Multi-Resolution Convolutional Neural Network
ir
https://www.tarjomano.com https://www.tarjomano.com
sensors
Article
Partial Discharge Recognition with a
Multi-Resolution Convolutional Neural Network
Gaoyang Li 1 , Xiaohua Wang 1, *, Xi Li 2 , Aijun Yang 1 and Mingzhe Rong 1, *
1 State Key Laboratory of Electrical Insulation and Power Equipment, School of Electrical Engineering,
Xi’an Jiaotong University, Xi’an 710049, China; [email protected] (G.L.);
[email protected] (A.Y.)
2 The Global Energy Interconnection Development and Cooperation Organization (GEIDCO), Beijing 100031,
China; [email protected]
* Correspondence: [email protected] (X.W.); [email protected] (M.R.)
Received: 24 September 2018; Accepted: 16 October 2018; Published: 18 October 2018
Abstract: Partial discharge (PD) is not only an important symptom for monitoring the imperfections
in the insulation system of a gas-insulated switchgear (GIS), but also the factor that accelerates the
degradation. At present, monitoring ultra-high-frequency (UHF) signals induced by PDs is regarded
as one of the most effective approaches for assessing the insulation severity and classifying the PDs.
Therefore, in this paper, a deep learning-based PD classification algorithm is proposed and realized
with a multi-column convolutional neural network (CNN) that incorporates UHF spectra of multiple
resolutions. First, three subnetworks, as characterized by their specified designed temporal filters,
frequency filters, and texture filters, are organized and then intergraded by a fully-connected neural
network. Then, a long short-term memory (LSTM) network is utilized for fusing the embedded
multi-sensor information. Furthermore, to alleviate the risk of overfitting, a transfer learning approach
inspired by manifold learning is also present for model training. To demonstrate, 13 modes of defects
considering both the defect types and their relative positions were well designed for a simulated GIS
tank. A detailed analysis of the performance reveals the clear superiority of the proposed method,
compared to18 typical baselines. Several advanced visualization techniques are also implemented to
explore the possible qualitative interpretations of the learned features. Finally, a unified framework
based on matrix projection is discussed to provide a possible explanation for the effectiveness of
the architecture.
1. Introduction
Gas-insulated switchgears (GISs) are widely used as the major control and protection equipment
in medium to ultra-high voltage substations, due to their superior compactness, high reliability,
strong dielectric strength, and maintenance-free properties. However, unavoidable imperfections
and deterioration in the insulation system pose serious threats to the reliability and safety of the
GIS and whole power grids [1–3]. The occurrence of partial discharges (PDs) is not only one of the
main characteristics that can effectively reflect the inner dielectric flaws, but is also the cause of the
accelerated degradation [4,5]. Therefore, it is imperative to assess the potential correlation between the
PD patterns and the defect types, so that corresponding maintenance activities can be taken before a
complete breakdown.
The conventional monitoring methods of PD could be summarized into three main categories:
chemical, electrical, and acoustic [6]. Among these, ultra-high-frequency (UHF) signals are becoming
increasingly important due to their high sensitivity, better immunity against interference, and the ability
for real-time monitoring [7,8]. Considering the stochastic nature of PDs, machine learning has long
been the mainstream method for the diagnosis of UHF signals. The existing literature covers a wide
range of techniques, including Support Vector Machines (SVM) [9], Neural Networks (NN) [10–12],
cluster analysis [13], fuzzy logic [14] and so on. A more detailed taxonomy of NNs could give rise to
methodologies such as the self-organizing map (SOM) [10], Ensemble neural networks (ENN) [11],
and Probabilistic Neural Networks (PNN) [12]. Compared with classifiers, feature extraction plays
a more dominant role in the success of PD recognition due to the high dimensionality of the PD
data. A gamut of techniques have been applied to both the time-resolved partial discharge (TRPD)
and phase-resolved partial discharge (PRPD) patterns [6], of which the statistical parameters [15],
the time-power ratio map [16], fractal features [17], Fourier transform [18], wavelet analysis [12],
Hilbert-Huang transform (HHT) [19], and S-transform [20] are a few. In addition, some advanced
signal processing techniques such as compressed sensing [21] and principal component analysis
(PCA) [22] are also utilized. A summary of these implementations is shown in Table 1.
Although the cooperation of features and algorithms indicate a considerable degree of success,
it is pertinent to note that most of them suffer from a fundamental limitation: the carefully selective
feature extraction process usually calls for much human expertise and hence, it is subject to human
error. The significant randomness of the electromagnetic waves caused by propagation and reflection
makes the process even more intricate [23]. Besides, even with all these efforts, the features designed
for a particular task are not guaranteed to be the optimal choices for the other application scenarios
and datasets. Therefore, learning features directly from the dataset itself may be a better choice.
Among all of the analysis methods, time–frequency (TF) analysis is especially applicable, due
to its invertibility, and its ability to provide both time and frequency resolutions [24]. From the
perspective of manifold learning, the intrinsic advantage of the TF method is to obtain continuity
and compactness in a high-dimensional space at the cost of dimensionality. However, the high
dimensionality poses a substantial impediment for the traditional machine learning models. The matrix
compression algorithms for dimensionality reduction such as the two-dimensional PCA (2DPCA) [25]
and Non-negative matrix factorization (NMF) [26] inevitably share similar obstacles with the other
feature engineering-based methodologies, such as adaptation and generalization.
Fortunately, the recent advances in deep learning make it possible to extract high-level features
automatically from high-dimensional sensory data, and they demonstrate dramatic success in various
areas such as natural language processing, image classification, and auto-driving, of which the
Convolutional Neural Network (CNN) has been designed for vision-related tasks [27,28]. Therefore,
the combination of the CNN and TF maps becomes a potentially promising option for the PD signal
recognition tasks. Nevertheless, although it sounds transparent, directly applying CNN for TF maps is
not a good idea, due to the inherently non-local and serial characteristics of TF maps. In our preliminary
work [29], we proposed a CNN of frequency matching to obtain the semantic meaning. In this paper,
we further improve the framework into multiple columns to realize multi-resolution diagnosis, thus
promoting the diagnostic accuracy to a new level. The major contributions are summarized as follows.
Table 1. Summary of the relevant features and classifiers that have been used for partial discharge (PD) recognition.
The rest of this paper is organized as follows. Some related theories are introduced first in
Section 2. Section 3 provides the details of both the model architectures and the transfer learning
process. The platform presented in Section 4 provides the necessary dataset for verifying the proposed
method, and Section 5 describes the recognition results. Based on the experimental results, Section 6
discusses the effectiveness of the proposed model. Finally, Section 7 concludes the paper.
2. Related Work
In deep learning, the most straightforward way to improve the model performance has shifted
from feature engineering to network architecture designing. ResNet [30] proposed by Microsoft is
a good example, which introduces an extra column with residual mapping, and has state-of-the-art
performance. Although the early implementation of the multi-column neural network only aimed
at accelerating the training process with multiple Graphics Processing Units (GPUs) [28], it has been
realized that the several columns can become specified experts for different features. For example,
the Inception model [31] is a successful multi-column deep neural network (DNN) to approximate the
optimal sparse structure of the input with different convolutional kernels. The Xception model [32] goes
even further by decoupling both the cross-channel and spatial correlations to obtain a multi-branch
structure. In addition, Saining exposed a new dimension called cardinality to standardize the
multi-branch structure [33]. Ling attempted to train a multi-column neural network with spectral
clustering to embed the hierarchical manifolds into a compact vector [34].
Although sharing the similar structure of multiple columns, the multi-resolution theories focus
more on the structures of different scales. Sander attempted to learn music features by extracting the
mel-spectrograms with different windows [35]. Michael utilized CNN to predict future images from
different resolutions and concluded that a lower resolution was able to capture the entire context, while
a higher resolution recovered more details [36]. The similar idea was also shared by MMDenseNet [37],
which enables the neural network to model both the fine-grained structures and the long contexts
efficiently. In the Deep Belief Network, the multi-resolution structures also showed their superiority in
image-related tasks [38]. Furthermore, the idea of multiple resolutions has long been integrated into the
underlying design of neural networks. The pooling layers and the variants, such as subsequent spatial
pyramid pooling all attempt to obtain the larger space by reducing the resolution of the intermediary
layers [39].
3. Proposed Method
where x (u) is the original signal, and h is the analysis window. t and v represent the time and frequency,
respectively. In practical applications, the discrete STFT is preferred, which is given by:
Z +∞
Fx [n, m; h] = Fx (nt0 , mv0 ; h) = x (u)h∗ (u − nt0 )exp[− j2πmv0 u]du (2)
−∞
where m, n ∈ Z. t0 and v0 are the time and frequency steps. Specifically, the STFT with Gaussian
windows is able to theoretically maximize the concentration in the TF plane by approaching the lower
bound of the Heisenberg–Gabor inequality [24]:
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
T×B≥1 (3)
Here, T is the time duration, and B is the normalized bandwidth. The STFT of this particular case
is also called the Gabor transform. The most important inference of the Heisenberg-Gabor inequality
is the trade-off between time and frequency resolutions in accordance with the width of the analysis
window. An
Sensors 2018, 18,illustrative
x example is shown in Figure 1. 5 of 26
Figure
Figure 1. Gabor representations
1. Gabor representations of
of different
different time
time and
and frequency
frequency resolutions:
resolutions: (a)
(a) windows
windows of
of different
different
lengths; (b) spectrogram
lengths; (b) spectrogramofofhigher
highertime
timeresolution;
resolution;
(c)(c) spectrogram
spectrogram of higher
of higher frequency
frequency resolution;
resolution; (d)
(d) spectrum of medium resolution.
spectrum of medium resolution.
As
As shown
shown in in Figure
Figure 1,1, the
the long
long window
window gives
gives rise
rise to
to harmonic
harmonic information,
information, while
while the
the pulses
pulses
emerge
emerge from the short time window. In this study, the spectrograms of different resolutions areare
from the short time window. In this study, the spectrograms of different resolutions all
all employed for the diagnosis of the UHF signals. It should be noted that
employed for the diagnosis of the UHF signals. It should be noted that the concept of multiple the concept of multiple
resolutions
resolutions isis different
different from
from thethe wavelet
wavelet or or S-transform,
S-transform, whichwhich still
still generate
generate only
only aa single
single resolution
resolution
for each frequency band.
for each frequency band.
Furthermore,
Furthermore, aa semiquantitative
semiquantitativemethod methodis isproposed
proposed as as
anan assistance
assistance criteria
criteria for for choosing
choosing the
the proper
proper analysis
analysis windows,
windows, considering
considering the the morphology
morphology characteristics
characteristics of the
of the STFTSTFT spectrums.
spectrums. As
As indicated
indicated by by Figure
Figure 2, centering
2, centering on on
thethe
max max value,
value, thethe time
time andand frequency
frequency ranges
ranges beforereaching
before reachinga
acertain
certainthreshold
thresholdare arecalculated,
calculated,whichwhichcancanbeberegarded
regardedasasaaroughrough description
descriptionof of the
the peak
peak shape.
shape.
Intuitively, the greater the time range is, the less the time resolution is retained, while
Intuitively, the greater the time range is, the less the time resolution is retained, while the frequency the frequency
isolation
isolation isismore
moreobvious.
obvious.The The frequency
frequency range
range varies
varies in aninopposite
an opposite
way. way. The proper
The proper window window
can be
can be selected
selected based onbasedthe on the trade-off
trade-off between between
the timethe and
timefrequency
and frequency resolutions,
resolutions, depending
depending on
on their
their priorities.
priorities.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
Sensors2018,
Sensors 18,x3512
2018,18, 66of
of26
27
Figure 2. Schematic diagram of the time range and frequency range before reaching the threshold.
Figure 2. Schematic diagram of the time range and frequency range before reaching the threshold.
3.2. Convolutional Neural Network and the Design of Convolutional Kernels
3.2. Convolutional Neural
In recent years, CNNNetwork
and itsand the Design have
derivatives of Convolutional
been regardedKernels
as the state-of-the-art choices for
computer vision
In recent tasks.
years, CNN The keyitsideas
and in CNNhave
derivatives include
beenthe concepts
regarded asof local
the receptive fields,
state-of-the-art shared
choices for
weights, and spatial subsampling. Specifically, in each convolutional layer, the feature
computer vision tasks. The key ideas in CNN include the concepts of local receptive fields, shared maps are first
convolved
weights, andwith several
spatial kernels. After
subsampling. the biases,
Specifically, activation
in each function
convolutional andthe
layer, sometimes, the are
feature maps pooling
first
operation, the
convolved withoutputs
severalare then fed
kernels. intothe
After the biases,
next convolutional layer, which
activation function and can be summarized
sometimes, as:
the pooling
operation, the outputs are then fed into the
next convolutional
layer, which can be summarized as:
x lj = f ∑ xil −1 ∗ klij + blj (4)
= i∈ Mj ∗ + (4)
∈
where ∗ donates the convolutional operator. x lj−1 and xil are the input and output of the lth layer,
where ∗ donates the convolutional operator. and are the input and output of the lth layer,
respectively, and b is the bias. M j is a selection of the feature maps, and klij is the element of the
respectively, and is the bias. is a selection of the feature maps, and is the element of the
kernel matrix.
kernelThematrix.
activation function f is chosen as the Relu function for the convenience of the gradient
The activation
calculation. A simple function
parameterisanalysis
chosen canas the Relu as
be given function
follows. forAssuming
the convenience of the gradient
that the volume of input
calculation. A simple parameter analysis can be given as follows. Assuming that
is W1 × H1 × D1, after applying convolutions of k filters (without zero padding), the new volume the volume of input of
isW2 ×1× 1 × 1,
H2 × D2 is: after applying convolutions of filters (without zero padding), the new volume
of 2 × 2 × 2 is: (W1 − F1)
W2 = +1 (5)
1S1− 1
2 =( H1 − F2) + 1 (5)
H2 = 1 +1 (6)
S2
1− 2
2 = D2 = k +1 (7)
(6)
2
where F1 and F2 are the width and height of the filter cores, and S1 and S2 are the strides. One special
characteristic of CNN is the shift invariance. However, 2= although dramatically useful in image (7)
recognition,
where 1 andthe invariance features
2 are the width and pose an impediment
height for theand
of the filter cores, recognition
1 and of2Gabor representations
are the strides. One
that own a clear physical meaning. Any shift of a generally rectangular filter
special characteristic of CNN is the shift invariance. However, although dramatically useful may distort theinoriginal
image
TF distributions and confuse the CNN. To alleviate the restriction, three kinds of
recognition, the invariance features pose an impediment for the recognition of Gabor representationsconvolutional kernels
are designed
that own a clearin physical
accordance with the
meaning. Anyspectrograms of different
shift of a generally resolutions,
rectangular filternamely frequency
may distort filters,
the original
temporal filters, and texture filters.
TF distributions and confuse the CNN. To alleviate the restriction, three kinds of convolutional
kernelsAsare
shown by the
designed inred rectangles
accordance in Figure
with 3, the temporal
the spectrograms filters are
of different designed namely
resolutions, for mapsfrequency
of higher
time resolutions,
filters, and they
temporal filters, andshare the filters.
texture same widths with the maps, while the frequency filters are suitable
As shown by the red rectangles in Figure 3, the temporal filters are designed for maps of higher
time resolutions, and they share the same widths with the maps, while the frequency filters are
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
for higher
suitable forfrequency resolutions,
higher frequency and theyand
resolutions, arethey
as high ashigh
are as the spectrograms. The texture
as the spectrograms. filters are
The texture of
filters
medium size as supplements.
are of medium size as supplements.
Figure 3. Schematic diagram of the different convolutional filters and their corresponding outputs:
Figure 3. Schematic diagram of the different convolutional filters and their corresponding outputs:
(a) temporal filter; (b) frequency filter; (c) texture filter.
(a) temporal filter; (b) frequency filter; (c) texture filter.
The designing principles of the special convolutional filters may be explained qualitatively with
The designing principles of the special convolutional filters may be explained qualitatively with
orthogonality, since the convolutional operation is basically a template matching process. The first
orthogonality, since the convolutional operation is basically a template matching process. The first
step is to make the filter cover either a complete frequency axis or a whole time axis, thereby allowing
step is to make the filter cover either a complete frequency axis or a whole time axis, thereby allowing
for perfect frequency or temporal matching. Second, by assuming that the filter is also a kind of
for perfect frequency or temporal matching. Second, by assuming that the filter is also a kind of
special signal, it is assumed that orthogonality exists between the unfolding direction of the filters
special signal, it is assumed that orthogonality exists between the unfolding direction of the filters
and the spectrograms, to maximize the diversity in a filter as much as possible. Following this idea,
and the spectrograms, to maximize the diversity in a filter as much as possible. Following this idea,
the frequency filters are applied to the Gabor maps of higher frequency resolutions, while the temporal
the frequency filters are applied to the Gabor maps of higher frequency resolutions, while the
filters are applied to the higher time resolutions. Another advantage of the frequency and temporal
temporal filters are applied to the higher time resolutions. Another advantage of the frequency and
filters refers to Equations (5) and (6). The outputs of the frequency and temporal filters are only
temporal filters refers to Equations (5) and (6). The outputs of the frequency and temporal filters are
one-dimensional, which dramatically decreases the model’s size.
only one-dimensional, which dramatically decreases the model’s size.
3.3. Architecture of the Proposed Model
3.3. Architecture of the Proposed Model
As has been mentioned before, TF maps of different resolutions can reveal different rhythms in
As has been mentioned
the 2-dimensional space. It isbefore, TF mapsthat
hypothesized of different
it would resolutions can reveal
be of great benefit different provide
to explicitly rhythmsTF in
the 2-dimensional space. It is hypothesized that it would be of great benefit to explicitly
maps of different resolutions as inputs to a deep model. This combines the multiple views with deep provide TF
maps of different
learning’s resolutions
ability to as inputs
extract intrinsic to a deepfrom
properties model.
rawThis combines
inputs, so thatthe multiple
the views with
discriminative deep
features
learning’s ability to extract intrinsic properties from raw inputs, so
could be learned while maintaining explanatory factors with physical meaning. that the discriminative features
couldAccordingly,
be learned while maintaining
the proposed explanatorynetwork
multi-resolution factors with physical meaning.
is summarized in Figure 4. It mainly contains
Accordingly, the proposed multi-resolution network
three parts, namely, single-resolution embedding, multi-resolution is summarized
fusion, and in multi-sensor
Figure 4. It mainly
fusion.
contains three parts, namely, single-resolution embedding, multi-resolution fusion, and multi-sensor
Here, the first part is realized by three subnetworks; the second part is a fully-connected neural network
fusion. Here,
attached at thethe
endfirst part
of the is realized and
subnetworks, by three
Long subnetworks;
Short-Term Memorythe second
(LSTM) part
[40]isisaimplemented
fully-connected
for
neural network attached
multi-sensor fusion. at the end of the subnetworks, and Long Short-Term Memory (LSTM) [40]
is implemented for multi-sensor fusion.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
Sensors 2018,
Sensors 18,18,
2018, 3512
x 8 8ofof2726
5,inineach
eachtime
timestep
stept, the
, the hidden
hidden state ht ℎis updated
state is updated
by by a function
a function of the
of the lastlast hidden
hidden ht−ℎ1 andand
state
state the
the
new new input
input xt : :
ℎ =ht = tanh
ℎ (Wx+ t +ℎ Uht+
−1 + b ) (8)(8)
where the
where thehidden
hiddenstate stateℎht is
is aa d-dimensional
d-dimensionalvector, and b is
vector,and is the bias. W and
the bias. and U arearethe
theweights.
weights.
Tanh is the activation function. LSTM is a branch of RNN, which introduces
Tanh is the activation function. LSTM is a branch of RNN, which introduces a memory cell to preserve a memory cell to preserve
theinformation
the informationfor forlong
longsequences,
sequences, andand relieves
relieves thethe exploding
exploding or vanishing
or vanishing gradients
gradients problem
problem in
in the
the original
original RNNRNN [40].[40].
The The
basicbasic
unit unit
of LSTM of LSTM
can can be explained
be explained as aascollection
a collection
of aofblock
a block input, xat ,
input
a memory
memory cellcell ,cand
t , and a hidden
a hidden stateℎ h, tcontrolled
state , controlledbybyananinput gate i,t ,aaforget
inputgate gate f t,,and
forgetgate andananoutput
output
gate ot as
gate as shown
shownin inFigure
Figure5,5,with
withthe
thefollowing
followingrelationships:
relationships:
= + ℎ + (9)(9)
i t = σ W ( i ) x t + U ( i ) h t −1 + b i
= + ℎ + (10)
f t = σ W ( f ) x t + U ( f ) h t −1 + b f (10)
= + ℎ + (11)
o t = σ W ( o ) x t + U ( o ) h t −1 + b o (11)
= ℎ + ℎ + (12)
ut = tanh W (u) xt + U (u) ht−1 + bu (12)
= ⊙ + ⊙ (13)
ct = it ut + f t c t −1 (13)
Architecturesof
Figure5.5.Architectures
Figure ofLSTM
LSTMand
andaageneral
generalRNN.
RNN.
low dimensional
Sensors 2018, 18, x manifold to make the similar points adjacent. Although some general dimensionality 11 of 26
reduction frameworks exist [44,45], the same key point is to preserve the distances of the original
the original
vertex pairs.vertex
Givenpairs. Given
the data the
x1,..., xUdata , the lossparameterized
,…, function
, the loss function parameterized by α and embedding
by α and embedding function
function
f is: is:
U
= = ∑ L∥ k f ( x, i , α)−− f x, j , α ∥k−− Wij
Loss (15)
(15)
i,j
,
Figure
Figure 7. Siamese network
7. Siamese network for
for embedding
embedding the
the feature
feature maps
maps into
into aa constant
constant vector.
vector.
3.4.2. Transfer Learning for Cascaded Training
3.4.2. Transfer Learning for Cascaded Training
For the other two steps, the weights of the MLP that were attached at the end of the three
For the other two steps, the weights of the MLP that were attached at the end of the three
subnetworks were randomly initialized first, and then another layer with the same node number of
subnetworks were randomly initialized first, and then another layer with the same node number of
the classes was connected to make a complete single-sensor diagnosis model. After fine-tuning with
the classes was connected to make a complete single-sensor diagnosis model. After fine-tuning with
Adadelta, the weights gained from step (2) were frozen, and they only acted as feature extractors for
Adadelta, the weights gained from step (2) were frozen, and they only acted as feature extractors for
the information fusion of different sensors, while the weights of the LSTM were made to be trainable.
the information fusion of different sensors, while the weights of the LSTM were made to be trainable.
The convergence speeds of the two steps were likely to be much faster compared with the first step, due
The convergence speeds of the two steps were likely to be much faster compared with the first step,
to the existing class-dependent knowledge. In addition, some advanced training techniques such as
due to the existing class-dependent knowledge. In addition, some advanced training techniques such
early stopping and heuristic learning rate adaption were also utilized.
as early stopping and heuristic learning rate adaption were also utilized.
4. PD Laboratory Setup
For the purpose of the experimental study, a simulative L-shaped GIS tank with a coaxial
cylindrical conductor was implemented to simulate the realistic equipment, as shown in Figures 8
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
4. PD Laboratory Setup
Sensors
For2018,
the18,purpose
x of the experimental study, a simulative L-shaped GIS tank with a coaxial 12 of 26
cylindrical conductor was implemented to simulate the realistic equipment, as shown in
and 9a, 8with
Figures and the central
9a, with theconductor diameterdiameter
central conductor being 90beingmm,90 and
mm,theand
enclosure tank diameter
the enclosure tank diameterbeing
320 mm. The tank was filled with SF6 gas of 0.1 MPa to simulate the
being 320 mm. The tank was filled with SF6 gas of 0.1 MPa to simulate the propagation medium. propagation medium.
Four planar
Four planar equiangular
equiangular spiral spiral antennas
antennas (PESAs)
(PESAs) withwith impedance
impedance transformers
transformers were were installed
installed
inside the
inside the GIS
GISchamber
chamberininthe thehand
hand holes
holesparallel to the
parallel to axis
the of theofGIS
axis thebus-bar, as shown
GIS bus-bar, in Figures
as shown in
Figures 8 and 9b. The PESA utilized in this study is an ultra-wideband antenna that can beinternally
8 and 9b. The PESA utilized in this study is an ultra-wideband antenna that can be installed installed
in the GIS.
internally Its GIS.
in the outside and inside
Its outside radii are
and inside radii109
are mm
109 mmandand2 mm,
2 mm, respectively,
respectively,withwith the substrate
the substrate
thickness being 1 mm. The other key parameters that can affect the performance
thickness being 1 mm. The other key parameters that can affect the performance of the PESA sensors of the PESA sensors
include the relative dielectric constant, the spiral growth rate, and the rotation
include the relative dielectric constant, the spiral growth rate, and the rotation angle, which are chosen angle, which are
chosen
as as 2.65,
2.65, 0.364, 0.364,
and 3.5π,and 3.5 , respectively.
respectively. In practice,In practice,
the most the most commonly
commonly quoted parameter
quoted parameter to evaluate to
evaluate the performance of the antenna is the reflection coefficient S 11, which represents how much
the performance of the antenna is the reflection coefficient S11 , which represents how much power
power
is is reflected
reflected from the from the antenna.
antenna. The range
The range of theofbandwidth
the bandwidththat that satisfies
satisfies S11 S<11−<10
−10dB dBwaswas0.8 0.8GHz
GHz
to 3.5
to 3.5 GHz
GHz from
fromthe thelower
lowerlimit
limittotothethehigher
higherlimit
limitfor
forthe
thePESAs.
PESAs.Besides,
Besides,thetheSS11parameter
parameter ranged ranged
11
between −8 dB to −6 dB from 0.2 GHz and 0.7 GHz, and close to −10 dB in the 0.7 GHz to 0.8range,
between −8 dB to −6 dB from 0.2 GHz and 0.7 GHz, and close to −10 dB in the 0.7 GHz to 0.8 GHz GHz
which which
range, satisfied the measurement
satisfied the measurement requirements
requirements of theof UHF
the UHFsignals. TheThe
signals. detailed
detailed S11 Scurve
curvecancanbe
11
found in our previous study [48]. The geometrical dimensions of the
be found in our previous study [48]. The geometrical dimensions of the GIS tank and the relative GIS tank and the relative
distances between
distances between the the PESAs
PESAs are arealso
alsoillustrated
illustratedin inFigure
Figure8.8.
High voltage
High voltage waswas introduced
introduced by by an an insulating
insulating bushing
bushing connected
connected with with aa non-PD
non-PD testing
testing
transformer (Xinyuan Electric, Yangzhou, China, YDTW-30/150) with
transformer (Xinyuan Electric, Yangzhou, China, YDTW-30/150) with an amplitude of 0–150 kV, and an amplitude of 0–150 kV, and a
a capacity of 30 kVA. The four PESAs received the UHF waves successively,
capacity of 30 kVA. The four PESAs received the UHF waves successively, based on their locations. based on their locations.
Furthermore,aafour-channel
Furthermore, four-channeldigitaldigital oscilloscope
oscilloscope (Tektronix,
(Tektronix, Beaverton,
Beaverton, OR, USA,
OR, USA, DPO7354C,
DPO7354C, 3.5
3.5 GHz,
GHz, 40 GS/s) was employed to acquire and record
40 GS/s) was employed to acquire and record the TPRD UHF patterns. the TPRD UHF patterns.
Figure 8. Schematic diagram of the GIS tank, PESA sensors, and the experimental circuit.
Figure 8. Schematic diagram of the GIS tank, PESA sensors, and the experimental circuit.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
Figure
Figure 9. GIS experimental
9. GIS
GIS platform: (a)
experimental platform:
platform: (a) GIS
GIS tank
tank and
and the
the positions of
positions of the
of the UHF
the UHF sensors;
UHF sensors; (b)
sensors; (b) the
(b) the
the
Figure 9. experimental (a) GIS tank and the positions
installation
installation of
of the
the UHF
UHF sensors
sensors inside
inside the
the tank.
tank.
installation of the UHF sensors inside the tank.
Five types of PDs were fabricated to simulate the various insulation defects, including the floating
Five types of of PDs
PDs were
were fabricated
fabricated to to simulate
simulate the the various
various insulation
insulation defects,
defects, including
including the the
electrode, types
Five a metal protrusion on the conductor and the tank, surface contamination, and free metal
floating
floating electrode, a metal protrusion on the conductor and the tank, surface contamination, and free
particles.electrode, a metal
Besides, three protrusion
relative angles on the conductor
between the defect and the tank,
position andsurface
the PESA contamination,
antennas, which andwere
free
metal
metal
◦ ◦
particles.◦Besides,
particles. Besides, three
three relative
relative angles
angles between
between the the defect
defect position
position and and the
the PESA
PESA antennas,
antennas,
0 , 90 , and 180 , were integrated with the five defect types in the defect simulations.
which were
which were 0°,
0°, 90°,
90°, and
and 180°,
180°, were
were integrated
integrated withwith the
the five
five defect
defect types
types in in the
the defect
defect simulations.
simulations.
As shown in Figure 10, the floating electrode defect was simulated by fixing two adjacent copper
As shown in Figure 10, the floating electrode defect was simulated by fixing two adjacent copper
nuts As shown
to an in Figure
insulated bolt10,
thatthewas
floating electrode
attached on the defect was simulated
conductor. The nuts bywere
fixing3 two
mmadjacent
away from copper
the
nuts to
nuts to anan insulated
insulated bolt
bolt that
that was
was attached
attached on on the
the conductor.
conductor. The The nuts
nuts were 33 mm mm awayaway fromfrom the
the
conductor, and the distance between the copper nuts was 1 mm. Two were
kinds of metal protrusions
conductor,
conductor, and the
and the distance between the copper nuts was 1 mm. Two kinds of metal protrusions were
were replicated bydistance
a 30 mmbetweenneedle the copper
adhered onnuts
the was 1 mm. Two
high-voltage kinds of metal
conductor protrusions
and tank, were
respectively.
replicated by
replicated by aa 30
30 mm
mm needle
needle adhered
adhered on on the
the high-voltage
high-voltage conductor
conductor and and tank,
tank, respectively.
respectively.
2
Surface
Surface
Surface contamination was replicated by a piece of aluminum foil 2(2 × 20 mm ) adhered to the
contamination was
contamination was replicated
replicated by by aa piece
piece ofof aluminum
aluminum foil foil (2
(2 ×× 20
20 mm
mm2)) adhered
adhered to to the
the surface
surface ofof the
the
surface of the spacer. Finally, the free particles defect was produced by connecting an insulation
spacer.
spacer. Finally,
Finally, the free
the free particles
particles defect was
defectparticles produced
was produced by connecting
by connecting an insulation tube that contained
tube that contained some aluminum inside the conductor, an withinsulation
the length tubeof that contained
the insulation
some aluminum
some aluminum particles
particles inside
inside thethe conductor,
conductor, with with the
the length
length of of the
the insulation
insulation tubetube being
being 20 20 mm.
tube being 20 mm. The first four types of defects were arranged at three relative angles betweenmm. the
The first
The first four
four types
types ofof defects
defects were
were arranged
arranged at at three
◦
three
◦
relative
relative ◦
angles between
angles between the the defect
defect positions
positions and
and
defect positions and the sensor position for 0 , 90 , and 180 , as shown in Figure 8, generating 13 PD
the sensor
the sensor position
position forfor 0°,
0°, 90°, andand 180°,
180°, asas shown
shown in FigureFigure 8, 8, generating
generating 13 13 PD
PD conditions
conditions in in total
total
conditions in total (the free90°,
metal particles were onlyinsimulated at 90◦ ). Figure 11 shows some typical
(the free
(the free metal
metal particles
particles were only only simulated
simulated at at 90°).
90°). Figure
Figure 11 11 shows
shows somesome typical
typical acquisitions
acquisitions of of the
the
acquisitions of the UHF were signals within 100 ns, and their corresponding frequency spectra.
UHF signals within 100 ns, and their corresponding
UHF signals within 100 ns, and their corresponding frequency spectra. frequency spectra.
Figure 10. Typical PD defect patterns: (a) floating electrode; (b) metal protrusion on the conductor;
Figure 10.
Figure 10. Typical
Typical PD defect
defect patterns:
patterns: (a)
(a) floating electrode;
electrode; (b) metal
metal protrusion
protrusion on on the
the conductor;
conductor;
(c) metal protrusionPD
on the tank; (d) surfacefloating
contamination; (e)(b)
free metal particles.
(c) metal
(c) metal protrusion
protrusion on
on the
the tank;
tank; (d)
(d) surface
surface contamination;
contamination; (e)
(e) free
free metal
metal particles.
particles.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
Figure 11. Typical UHF waveforms: (a) UHF signals of five different defect types; (b) UHF signals
Figure 11. Typical UHF waveforms: (a) UHF signals of five different defect types; (b) UHF signals
acquired from the four channels; (c) UHF signals of the three relative angles.
acquired from the four channels; (c) UHF signals of the three relative angles.
5. Experiment Evaluations
5. Experiment Evaluations
A detailed analysis was carried out to ascertain the model’s discriminatory capability. Both the
relative A detailed
angles andanalysis was carried
defect types out to into
were taken ascertain the model’sincluding
consideration, discriminatory capability.
the floating Both the
electrode (0◦ ,
90◦ ,relative
and 180angles
◦ ), theand
metaldefect types were
protrusion taken
on the into consideration,
conductor including
(0◦ , 90◦ , and themetal
180◦ ), the floating electrodeon(0°,
protrusion the
90°,◦and ◦180°), the metal
◦ protrusion on the conductor◦ (0°,
◦ 90°, and ◦180°), the metal protrusion
tank (0 , 90 , and 180 ), the surface contamination (0 , 90 , and 180 ), and the free particles (90 only), on
◦ the
tank (0°, 90°, and 180°), the surface contamination (0°, 90°, and 180°), and the free particles (90° only),
thus creating 13 defect modes. There were 1386 samples in total, of which 20% were used for testing.
thus creating 13 defect modes. There were 1386 samples in total, of which 20% were used for testing.
Two cases were designed for recognition with the same dataset, namely, (1) the combined diagnosis of
Two cases were designed for recognition with the same dataset, namely, (1) the combined diagnosis
positions and defect types together, which had 13 classes in total, and (2) diagnosis of the defect types
of positions and defect types together, which had 13 classes in total, and (2) diagnosis of the defect
only, which has five classes only.
types only, which has five classes only.
5.1. Implementation Details
5.1. Implementation Details
First, the lengths of the analysis windows were chosen based on the frequency decline range and
First, the lengths of the analysis windows were chosen based on the frequency decline range and
the time decline range described in Figure 2, and 50% of the max value was chosen as the threshold.
the time decline range described in Figure 2, and 50% of the max value was chosen as the threshold.
Instead of using
Instead thethe
of using ranges directly,
ranges their
directly, theirproportions
proportionsin
inthe
thewhole
whole analysis timeaxis
analysis time axisand
andfrequency
frequency
axisaxis
were calculated for ease of comparison. The time range limit and frequency range
were calculated for ease of comparison. The time range limit and frequency range limit were limit were
chosen
chosen as 100 ns and 3 GHz, respectively. The sampling rate was 10 GHz, indicating that 0.1 nsnsisis
as 100 ns and 3 GHz, respectively. The sampling rate was 10 GHz, indicating that 0.1
added when the window length increases by 1. Figure 12 shows the variations of the percentages of the
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
added when the window length increases by 1. Figure 12 shows the variations of the percentages of
the average
average time time and frequency
and frequency ranges.ranges. It is obvious
It is obvious thatwindow
that as the as the length
window length increased,
increased, the
the frequency
frequency resolution
resolution increases asincreases as well,
well, while whileresolution
the time the time resolution
showed anshowed
oppositean trend.
opposite trend.to
In order Inretain
order
to retainresolution
enough enough resolution
information information
for both thefor both
axis, 30%the
wasaxis, 30% was
selected selected
as the as the
threshold. It isthreshold. It is
observed that
observed
the window that the window
lengths lengths
were near were
6 and 100 near 6 and
at this 100Therefore,
point. at this point. Therefore,window
the temporal the temporal window
and frequency
and frequency
window window
were chosen as 6were
and chosen as 6 and 100,
100, respectively. respectively.
Besides, Besides,
the crossing pointthe
wascrossing
obtained point
nearwas
30,
obtained
which wasnear 30, which
selected as thewas selected
window forasthethe window
texture for the texture channel.
channel.
Figure
Figure 12. Percentages of
12. Percentages of the
the time
time and
and frequency
frequency decline
decline ranges
ranges along
along the
the window
window lengths.
lengths.
Besides, in order to further reduce the input dimension, a bigger stride was used along the time
Besides, in order to further reduce the input dimension, a bigger stride was used along the time
axis for the frequency channel, while a bigger frequency stride was utilized for the temporal channel
axis for the frequency channel, while a bigger frequency stride was utilized for the temporal channel in
in STFT, which could be seen as a subsampling. A cut-off of 2 GHz was also utilized in the temporal
STFT, which could be seen as a subsampling. A cut-off of 2 GHz was also utilized in the temporal
channel and the texture channel, considering that the details in the high-frequency band were trivial in
channel and the texture channel, considering that the details in the high-frequency band were trivial in
their diagnosis. Finally, the windows and the input shapes of the three subnetworks are summarized
their diagnosis. Finally, the windows and the input shapes of the three subnetworks are summarized
in Table 2. Considering the limited amount of data, the neural network architecture was wide, but it
in Table 2. Considering the limited amount of data, the neural network architecture was wide, but it
had only a few convolutional maps per column. The detailed net structures are also shown in Table 2.
had only a few convolutional maps per column. The detailed net structures are also shown in Table 2.
The three columns are the time column, frequency column, and texture column, respectively.
The three columns are the time column, frequency column, and texture column, respectively.
Table 2. Detailed network structure.
Table 2. Detailed network structure.
Index Window Input Shape CNN Structures MLP LSTM
Input
Index Window CNN
(Conv2 Structures
× 50 −MaxPooling2 × MLP LSTM
Shape
1−Dropout)−(Conv2 ×
Column 1 6 30 × 50 (Conv2 × 50−MaxPooling2 ×
1−MaxPooling2 ×
1−Dropout)−(Conv2
1−Dropout) ×
−(Dense100−Dense50)
Column 1 6 30 × 50 LSTM with 16
1−MaxPooling2 × Flatten–
(Conv150 × 4−MaxPooling1 × inner nodes–
1−Dropout)−(Dense100−Dense50)
2−Dropout)–(Conv1 × Concatenation– Flatten–
Column 2 100 150 × 36 Dense100– LSTM with 16
(Conv150 × 4−MaxPooling1
3–MaxPooling1 × × Flatten– Dense50–
Dense50 inner nodes–
2−2−Dropout)–(Conv1
Dropout)–(Dense100−×Dense50)
Concatenation–
3– Output
Column 2 100 150 × 36 Flatten–
MaxPooling1 × 2−Dropout)–
(Conv50 × 4−MaxPooling2 × Dense100–
Dense50–
2−Dropout)−(Conv2 ×
(Dense100−Dense50) Dense50
Column 3 30 100 × 64 Output
2 − MaxPooling2 ×
(Conv50 × 4−MaxPooling2 ×
2−Dropout)−(Dense100−Dense50)
2−Dropout)−(Conv2 ×
Column 3 30 100 × 64
2−MaxPooling2 ×
2−Dropout)−(Dense100−Dense50)
The learning algorithm setup was as follows: the Adadelta optimizer with a learning rate of 1.0
for the pretraining and MLP embedding stages, and a Rmsprop optimizer for LSTM with a learning
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
rate of 0.001 and epsilon of 1 × 10−6 . The pretraining was carried out for 1000, 800, and 400 iterations,
with the patience being 100, 50, and 50. All networks were realized by Keras and NVIDIA 750TI GPUs
(NVIDIA, Santa Clara, CA, USA). The tests averaged over five sets of initialization parameters were
reported as the final results.
1. Multi-resolution diagnosis leads to a smaller misclassification rate than by using only the single
resolution information.
2. Sensor1 has the best accuracy among all the sensors with respect to the shortest distance to the
defect location. The information loss can be quite significant after the L-shaped corner.
3. A higher frequency resolution is more valuable for single resolution diagnosis.
4. The multi-sensor combination is not certainly better than a single sensor without appropriate
methods for the increasing the dimensionality and the confusing information carried by
different sensors.
Table 3. Diagnostic accuracies and the intermediate outputs for classifying both the positions and
defect types.
Table 4. Diagnostic accuracies and the intermediate outputs for classifying the defect types only.
Sensors 2018,
Sensors
Sensors 18,18,
2018,
2018, 3512
18, xx 17 1726
17 of
of of 27
26
more
more than
than the the other
other two
two kinds
kinds ofof kernels.
kernels. Besides,
Besides, both
both the
the iteration
iteration number
number and and the
the time
time
theconsumption
second and third
consumption of thetraining
of the second
second and stages
and were
third
third muchstages
training
training less, compared
stages were
were much
much with
less,their
less, previous
compared
compared withsteps.
with theirFor
their the final
previous
previous
LSTM information
steps.
steps. For
For the finalfusion
the final LSTMstage,
LSTM only less
information
information thanstage,
fusion
fusion 10 s was
stage, onlyneeded.
only less
less than
than 10
10 ss was
was needed.
needed.
The
Theresults
The results
results inin
inTables
Tables333and
Tables and4,4,
and 4,and
andFigures
and Figures 13
Figures 13 and
13 and 14show
and 14
14 showthe
show theexhaustive
the exhaustiverole
exhaustive role
role played
played
played byby
by transfer
transfer
transfer
learning.
learning.It
learning. It is significant
It is
is significant that
significant that during
that duringthe
during the three training
the three
three trainingstages, the
training stages, diagnostic
stages, the
the diagnostic accuracy
diagnostic accuracy increased,
accuracy increased, while
increased,
both the
while training
while both
both the iterations
the training and
training iterations time
iterations and decreased
and time dramatically,
time decreased even
decreased dramatically, with
dramatically, even a larger
even with network.
with aa larger This
larger network. dynamic
network. This
This
indicates
dynamic
dynamic that the knowledge
indicates
indicates that
that the learned from
the knowledge
knowledge earlier
learned
learned stages
from
from wasstages
earlier
earlier successfully
stages was transferredtransferred
was successfully
successfully to a higherto
transferred tolevel
aa
higher
with less level
higher proneness
level with
with less proneness
proneness to
to overfitting.
less to overfitting.
overfitting.
Figure
Figure
Figure 13.13.
13. Loss
Loss
Loss curves
curvesofof
curves the
ofthe three
thethree stages
threestages
stages of the transfer
of the transfer learning:
transfer learning:(a)
learning: (a)The
(a) Thedefect
The defect
defecttypes
types and
and
types positions;
positions;
and positions;
(b)(b)
thethe
(b) the defect
defect
defect types
types
types only.
only.
only.
Figure
Figure
Figure 14.14.
14. Time consumptions
Timeconsumptions
Time consumptionsof of the
ofthe three
thethree stages
stages of
three stages of the
the transfer
transferlearning:
transfer learning:(a)
learning: (a)The
(a) The
Thedefect
defect types
types
defect and
and
types and
positions;
positions;
positions; (b)(b)
(b) the
the
the defect types
defecttypes
defect only.
typesonly.
only.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
Convolutional Filters
5.2.3. Visualization of the Convolutional Filters
considered as black boxes. However, with the help of visualization
Neural networks are usually considered
methods, it is possible to obtain good qualitative interpretations of the black boxes of deep deep networks.
networks.
Besides, visualization is also an effective way to diagnose the misclassifications and evaluate
Besides, visualization is also an effective way to diagnose the misclassifications and evaluate the degree the
degree
of of overfitting.
overfitting. In this In this study,
study, the Activation
the Activation Maximization
Maximization [49] method
[49] method was utilized
was utilized to visualize
to visualize the
the learned
learned filters.
filters. The ofidea
The idea of Activation
Activation Maximization
Maximization is quite is quite generating
simple: simple: generating
an image an image to
to maximize
maximize
the the of
activation activation
a hiddenofunit
a hidden
as: unit as:
∗
= argmax . .∥ ∥ ℎ , (17)
x ∗ = argmaxx s.t. k xk=ρ hij (θ, x ) (17)
where ∗ is the visualization result, and donates the fixed neural network parameters. Thus,
∗
ℎ
where , x is the activation
visualization result,
values of and
unitθ donates
in layerthe withfixedfixed
neural network parameters.
parameters hij (θ, x ).
Thus, norm
and bounded
is the activation values of unit j in layer i with fixed parameters and bounded
For simplicity and the convenience of comparison, only the first layer of filters were visualized, as
θ norm ρ. For simplicity
and
shown the in
convenience
Figure 15. of comparison, only the first layer of filters were visualized, as shown in Figure 15.
It
It is clear that
is clear that the
the learned
learned convolutional
convolutional filters
filters were
were quite
quite smooth
smooth in in space,
space, thus
thus indicating
indicating that
that
the
the training
training waswas sufficient.
sufficient. The
The temporal
temporal filters
filters were
were sensitive
sensitive to
to pulses
pulses along
along the
the time
time axis,
axis, while
while the
the
frequency filters showed clear patterns of different frequency combinations.
frequency filters showed clear patterns of different frequency combinations. Besides, the diversity ofBesides, the diversity
of patterns
patterns waswas richer
richer inin thevisualization
the visualizationresults
resultsofofthethetexture
texturefilters,
filters,where
where both
both the
the temporal
temporal and and
frequency
frequency patterns
patterns revealed
revealed their
their importance.
importance. It It is
is also
also encouraging
encouraging thatthat both
both the
the temporal
temporal filters
filters and
and
frequency
frequency filters
filters gained
gained significant
significant resolution
resolution power
power over over multiple
multiple scales,
scales, as
as shown
shown byby the
the different
different
densities
densities of of the
the lines.
lines.
Figure
Figure 15. Visualizationofofthe
15. Visualization theconvolutional
convolutionalfilters:
filters:(a)(a) temporal
temporal filters;
filters; (b) (b) frequency
frequency filters;
filters; (c)
(c) texture
texture filters.
filters.
Besides, a direct visualization of the outputs of the filters was helpful in illustrating the automatic
Besides, a direct visualization of the outputs of the filters was helpful in illustrating the
feature extraction and the information compression of the filters. An illustrative example is shown in
automatic feature extraction and the information compression of the filters. An illustrative example
Figure 16. The outputs of the frequency filter and the temporal filter were only one-dimensional, while
is shown in Figure 16. The outputs of the frequency filter and the temporal filter were only one-
the texture filter transferred the two-dimensional input to a smaller feature map. That is also why the
dimensional, while the texture filter transferred the two-dimensional input to a smaller feature map.
texture filter calls for the longest training time.
That is also why the texture filter calls for the longest training time.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
Figure
Figure
Figure16.16.
The
16. The
The outputs
outputs
outputs ofof
of the
the first
thefirst layer
firstlayer
layerof of convolutional
convolutional filters.
ofconvolutional filters. (a)
filters. (a)The
(a) Theoutput
The outputof
output the
ofof the
the temporal filter;
temporal
temporal filter;
filter;
(b)
(b) (b) the
thethe output
output of
of of
output the
thethe frequency
frequency
frequency filter;
filter; (c) the
filter;(c)(c)the output
theoutput of
outputof the
ofthe texture
thetexture filter.
texture filter.
filter.
Figure
Figure 17. Visualization of
of the
the learned features for
for both the
the defects and
and positions: (a)
(a) Temporal
Figure 17. 17. Visualization
Visualization of the learned
learned features
features for both bothdefects
the defects positions:
and positions: (a) TemporalTemporal
channel;
channel;
channel; (b)
(b) frequency
frequency channel;
channel; (c)
(c) texture
texture channel.
channel.
(b) frequency channel; (c) texture channel.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
Sensors2018,
Sensors 2018,18,
18,3512
x 20ofof27
20 26
Figure
Figure18. Visualization of
18. Visualization of the
the learned
learnedfeatures
featuresfor
forthe
thedefect
defecttypes
types only:
only: (a)(a) Temporal
Temporal channel;
channel; (b)
(b) frequency
frequency channel;
channel; (c)(c) texture
texture channel.
channel.
Table 6. Diagnostic accuracies of the baseline methods for classifying both defect types and positions.
Table 7. Diagnostic accuracies of the baseline methods for classifying the defect types only.
The accuracy comparisons may provide some insights into the fundamental differences between
the traditional feature-based methods and the deep learning approaches, as well as some essential
flaws of the feature engineering methods.
1. The misclassification rate of the proposed multi-resolution CNN clearly indicates its superior
capability compared with the baseline methods. STFT’s performance is slightly better than the
Wavelet spectrum in certain resolutions, and much better than the HHT map.
2. The best performance in the baseline methods is gained by the combination of STFT, NMF,
and SVM, which is 92.09% for the 13 classes, and 93.88% for the five classes. Firstly, it is observed
that NMF slightly outperforms 2DPCA, which may be explained as follows. In the experiments,
the NMF gains the best performance with four dimensions, compared with the 12 dimensions
of the 2DPCA, indicating that the redundancy is huge in the STFT maps. Thus, searching for a
global projection in the redundancy data may be not a good idea. Second, the features extraction
usually does not increase the model performance, which reflects the eternal contradiction among
the learning ability of the model, the input dimensions, the distribution of the data, and the
discernibility of the features.
3. The performances of SVM and DNN are similar when utilizing the same input features. Therefore,
for traditional feature-based recognition methods, the recognition accuracy relies much on the
discriminative features.
4. Similar to the conclusion drawn from Tables 3 and 4, the performance of multiple sensors is not
certainly better than the single sensor, especially when the nearest single sensor can gain good
diagnostic accuracy.
6. Discussion
Although the deep networks have gained tremendous performance in many areas, low
interpretability is always an obstacle to applying them in some highly regulated environments, due to
the uncontrollability that comes with the unclear mechanism. Sometimes, the training of the neural
networks becomes a trial-and-error problem. In recent years, a great number of studies have focused
on the interpretability of deep learning, and great progress has been made in both the theories and
experiments. In the discussion, instead of pursuing a general solution, we only concentrate attention
on the special CNN structures that are implemented here, and try to explain its effectiveness from the
perspective of manifold learning.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
Sensors 2018,18,
Sensors2018, 18,3512
x 23
23of
of27
26
on the special CNN structures that are implemented here, and try to explain its effectiveness from
A comprehensive
the perspective comparison
of manifold learning.of the similarities between the filters learned by the CNN and
the matrix
A comprehensive comparison ofsuch
projection-based methods, as 2DPCA,between
the similarities is presented to provide
the filters learnedsome insights
by the CNNintoandthe
the
advantages of deep learning
matrix projection-based methods,
methods, suchcompared
as 2DPCA, with the feature
is presented toextraction
provide some solutions. It is
insights quite
into the
interesting
advantagestoofnotice
deepthe similarity
learning betweencompared
methods, the projection
withoperation
the feature in 2DPCA
extractionandsolutions.
the convolutions in
It is quite
CNN. In 2DPCA, an image A of m × n matrix is projected into a lower dimension
interesting to notice the similarity between the projection operation in 2DPCA and the convolutionsby:
in CNN. In 2DPCA, an image of × matrix is projected into a lower dimension by:
Yk = AXk (18)
= (18)
whereXk isisa avector
where vectorofofn dimensions,
dimensions,and andk is is
thethe index.AAvisualization
index. visualizationexampleexampleofofthe
the2DPCA’s
2DPCA’s
compressedresult
compressed resultof
ofthe
theSTFT
STFTspectrum
spectrumisisshown
shownin inFigure
Figure19.19.ItItisiseasy
easyto tonotice
noticeits
itssimilarity
similaritywith
with
the outputs in Figure 16, while the information concentration in Figure
the outputs in Figure 16, while the information concentration in Figure 16 is better. 16 is better.
Figure19.
Figure 19.Visualization
Visualizationof
ofthe
thecompressed
compressedresults
resultsofofthe
the2DPCA.
2DPCA.
Moreover,ititisisalso
Moreover, alsopossible
possibleto
to reconstruct
reconstruct the original image from the compressed
compressed information
informationby:
by:
A= d
(19)
 = ∑ Yk XkT (19)
k =1
where is the compression dimension. Therefore, in 2DPCA, it is assumed that any input can be
approximately
where reconstructed
d is the compression as a weighted
dimension. sum ofin
Therefore, smaller
2DPCA,collections. In summary,
it is assumed that anythe keycan
input points
be
in 2DPCA include:
approximately reconstructed as a weighted sum of smaller collections. In summary, the key points in
2DPCA
(1) Theinclude:
projection operation for compressing,
(2)
(1) The projection
The assumption that complex
operation images are composed of eigenimages.
for compressing,
(2) The
It isassumption that complex
quite interesting imagescharacteristics
that similar are composedare of eigenimages.
shared by the proposed temporal and
frequency filters.
It is quite interesting that similar characteristics are shared by the proposed temporal and
(1) The dot
frequency production between the frequency filters and the spectrum, and the similar operation
filters.
between the temporal filters and the spectrum can be seen as a projection operation,
(1)
(2) The dotCNN,
In the production between
it is also assumed the that
frequency
imagesfilters and thefrom
are formed spectrum, andofthe
elements thesimilar
loweroperation
level, like
between the temporal filters and the
pixels to more composite representations [28].spectrum can be seen as a projection operation,
(2) In the CNN, it is also assumed that images are formed from elements of the lower level, like
Therefore, a common framework that distinguishes different categories by matrix projection
pixels to more composite representations [28].
may be extracted to unify the deep structure and 2DPCA. The CNN and 2DPCA both try to classify
different categories
Therefore, by using
a common matrix projections,
framework and the different
that distinguishes main differences liesbyinmatrix
categories their supervision
projection
methods, where the CNN is supervised by labels, while 2DPCA is supervised
may be extracted to unify the deep structure and 2DPCA. The CNN and 2DPCA both try to classify by the inner product
in a high-dimensional space. Furthermore, a special case of NMF that aims
different categories by using matrix projections, and the main differences lies in their supervision at reducing the
reconstruction
methods, whereerrors
the CNN by projecting
is supervised could also bewhile
by labels, defined to fitisthis
2DPCA framework.
supervised by theThe main
inner specialty
product in
lies in that the loss function is defined at the single sample, and it uses the mean square
a high-dimensional space. Furthermore, a special case of NMF that aims at reducing the reconstruction error between
the original images and the reconstructed ones.
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
errors by projecting could also be defined to fit this framework. The main specialty lies in that the loss
function is defined at the single sample, and it uses the mean square error between the original images
and the reconstructed ones.
Moreover, based on this framework, the 2DPCA can also be reformulated and approximated as a
special case of the CNN with the following setups: (1) Use only a single layer of convolutions, and the
width of the convolutional kernels should be restricted at one. (2) The loss function is defined as the
sum of the pairwise Euclidean distance losses to force the model to maintain its relative relations in
higher space. Some new algorithms may be developed based on this idea.
Therefore, the superiority of the proposed may be explained as follows. Based on the same
framework, only the CNN is goal-oriented at the final classifying task, while the 2DPCA and NMF aim
at the other specified goals. It is usually hard to determine the effectiveness of these specified goals
until obtaining the final results. Is it better to maintain the global information as 2DPCA, or to keep
the local completeness as in NMF? In addition, the multiple projections also equip the CNN with a
better learning ability to handle more complex problems.
7. Conclusions
Inspired by the phenomenon that the PD spectrograms of different resolutions can capture
the information of different patterns, both multi-resolution and multi-sensor fusion algorithms are
proposed. They are realized by a multi-column deep CNN characterized by specified filters and a
sequential LSTM network. A multi-stage transfer learning framework is also presented to train the
model with limited data. Several conclusions may be drawn based on the detailed analysis.
1. The proposed multi-resolution model gains accuracies of 97.51% and 98.20% on the tasks of
diagnosing (1) the positions and defect types together, and (2) the defect types only; thus
indicating its clear superiority compared with the baselines.
2. The loss curves, time consumptions, and diagnostic accuracies show the effectiveness of transfer
learning, which successfully transfers knowledge from a lower level to a higher level with less
proneness to overfitting.
3. The comparisons with the baseline methods reveal several fundamental flaws of the feature
based methods, such as the difficulties in choosing the most suitable approach, and the curse of
dimensionality when encountering too many features from multiple sensors.
4. A matrix projection framework is proposed to enhance the interpretability of the proposed deep
network structure.
Author Contributions: G.L. and X.W. conceived and designed the experiments; X.L. provided the experimental
data; G.L. wrote the paper; A.Y. revised the contents and reviewed the manuscript; M.R. provided theoretical
guidance and supported the study.
Funding: This work was supported by the National Natural Science Foundation of China (No. 51521065).
Acknowledgments: The author gratefully acknowledges the support of the K. C. Wong Education Foundation.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Metwally, I.A. Technology progress in high-voltage gas-insulated substations. IEEE Potentials 2010, 29, 25–32.
[CrossRef]
2. Riechert, U.; Holaus, W. Ultra high-voltage gas-insulated switchgear—A technology milestone. Eur. Trans.
Electr. Power 2012, 22, 60–82. [CrossRef]
3. Schichler, U.; Koltunowicz, W.; Endo, F.; Feser, K.; Giboulet, A.; Girodet, A.; Hama, H.; Hampton, B.;
Kranz, H.-G.; Lopez-Roldan, J. Risk assessment on defects in GIS based on PD diagnostics. IEEE Trans.
Dielectr. Electr. Insul. 2013, 20, 2165–2172. [CrossRef]
4. Stone, G.C. Partial discharge diagnostics and electrical equipment insulation condition assessment.
IEEE Trans. Dielectr. Electr. Insul. 2005, 12, 891–904. [CrossRef]
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
5. Bartnikas, R. Partial discharges. Their mechanism, detection and measurement. IEEE Trans. Dielectr.
Electr. Insul. 2002, 9, 763–808. [CrossRef]
6. Wu, M.; Cao, H.; Cao, J.; Nguyen, H.L. An overview of state-of-the-art partial discharge analysis techniques
for condition monitoring. IEEE Electr. Insul. Mag. 2015, 31, 22–35. [CrossRef]
7. Kurrer, R.; Feser, K. The application of ultra-high-frequency partial discharge measurements to gas-insulated
substations. IEEE Trans. Power Deliv. 1998, 13, 777–782. [CrossRef]
8. High Voltage Test Techniques—Measurement of Partial Discharges by Electromagnetic and Acoustic Methods; IEC TS
62478:2016; BSI: London, UK, 2016.
9. Umamaheswari, R.; Sarathi, R. Identification of partial discharges in gas-insulated switchgear by
ultra-high-frequency technique and classification by adopting multi-class support vector machines.
Electr. Power Compon. Syst. 2011, 39, 1577–1595. [CrossRef]
10. Darabad, V.; Vakilian, M.; Blackburn, T.; Phung, B. An efficient PD data mining method for power transformer
defect models using SOM technique. Int. J. Electr. Power Energy Syst. 2015, 71, 373–382. [CrossRef]
11. Mas’Ud, A.A.; Stewart, B.; McMeekin, S. Application of an ensemble neural network for classifying partial
discharge patterns. Electr. Power Syst. Res. 2014, 110, 154–162. [CrossRef]
12. Evagorou, D.; Kyprianou, A.; Lewin, P.; Stavrou, A.; Efthymiou, V.; Metaxas, A.; Georghiou, G. Feature
extraction of partial discharge signals using the wavelet packet transform and classification with a
probabilistic neural network. IET Sci. Meas. Technol. 2010, 4, 177–192. [CrossRef]
13. Wang, K.; Li, J.; Zhang, S.; Liao, R.; Wu, F.; Yang, L.; Li, J.; Grzybowski, S.; Yan, J. A hybrid algorithm based
on s transform and affinity propagation clustering for separation of two simultaneously artificial partial
discharge sources. IEEE Trans. Dielectr. Electr. Insul. 2015, 22, 1042–1060. [CrossRef]
14. Zhu, M.-X.; Xue, J.-Y.; Zhang, J.-N.; Li, Y.; Deng, J.-B.; Mu, H.-B.; Zhang, G.-J.; Shao, X.-J.; Liu, X.-W.
Classification and separation of partial discharge ultra-high-frequency signals in a 252 kV gas insulated
substation by using cumulative energy technique. IET Sci. Meas. Technol. 2016, 10, 316–326. [CrossRef]
15. Li, L.; Tang, J.; Liu, Y. Partial discharge recognition in gas insulated switchgear based on multi-information
fusion. IEEE Trans. Dielectr. Electr. Insul. 2015, 22, 1080–1087. [CrossRef]
16. Albarracín, R.; Robles, G.; Martinez-Tarifa, J.M.; Ardila-Rey, J. Separation of sources in radiofrequency
measurements of partial discharges using time–power ratio maps. ISA Trans. 2015, 58, 389–397. [CrossRef]
[PubMed]
17. Li, J.; Jiang, T.; Harrison, R.F.; Grzybowski, S. Recognition of ultra high frequency partial discharge signals
using multi-scale features. IEEE Trans. Dielectr. Electr. Insul. 2012, 19, 1412–1420. [CrossRef]
18. Wang, X.; Li, X.; Rong, M.; Xie, D.; Ding, D.; Wang, Z. UHF Signal Processing and Pattern Recognition
of Partial Discharge in Gas-Insulated Switchgear Using Chromatic Methodology. Sensors 2017, 17, 177.
[CrossRef] [PubMed]
19. Gu, F.-C.; Chang, H.-C.; Chen, F.-H.; Kuo, C.-C.; Hsu, C.-H. Application of the Hilbert-Huang transform with
fractal feature enhancement on partial discharge recognition of power cable joints. IET Sci. Meas. Technol.
2012, 6, 440–448. [CrossRef]
20. Dai, D.; Wang, X.; Long, J.; Tian, M.; Zhu, G.; Zhang, J. Feature extraction of GIS partial discharge signal
based on S-transform and singular value decomposition. IET Sci. Meas. Technol. 2016, 11, 186–193. [CrossRef]
21. Majidi, M.; Fadali, M.S.; Etezadi-Amoli, M.; Oskuoee, M. Partial discharge pattern recognition via sparse
representation and ANN. IEEE Trans. Dielectr. Electr. Insul. 2015, 22, 1061–1070. [CrossRef]
22. Khan, Y. Partial discharge pattern analysis using PCA and back-propagation artificial neural network for the
estimation of size and position of metallic particle adhering to spacer in GIS. Electr. Eng. 2016, 98, 29–42.
[CrossRef]
23. Li, X.; Wang, X.; Yang, A.; Xie, D.; Ding, D.; Rong, M. Propogation characteristics of PD-induced UHF signal
in 126 kV GIS with three-phase construction based on time–frequency analysis. IET Sci. Meas. Technol. 2016,
10, 805–812. [CrossRef]
24. Auger, F.; Flandrin, P.; Gonçalvès, P.; Lemoine, O. Time-Frequency Toolbox; CNRS France: Paris, France;
Rice University: Houston, TX, USA, 1996.
25. Zhang, S.; Li, C.; Wang, K.; Li, J.; Liao, R.; Zhou, T.; Zhang, Y. Improving recognition accuracy of partial
discharge patterns by image-oriented feature extraction and selection technique. IEEE Trans. Dielectr.
Electr. Insul. 2016, 23, 1076–1087. [CrossRef]
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
26. Wang, K.; Liao, R.; Yang, L.; Li, J.; Grzybowski, S.; Hao, J. Optimal features selected by NSGA-II for partial
discharge pulses separation based on time-frequency representation and matrix decomposition. IEEE Trans.
Dielectr. Electr. Insul. 2013, 20, 825–838. [CrossRef]
27. LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [CrossRef] [PubMed]
28. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks.
In Proceedings of the Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012;
pp. 1097–1105.
29. Li, G.; Rong, M.; Wang, X.; Li, X.; Li, Y. Partial discharge patterns recognition with deep Convolutional Neural
Networks. In Proceedings of the Condition Monitoring and Diagnosis, Xi’an, China, 25–28 September 2016;
pp. 324–327.
30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016;
pp. 770–778.
31. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
32. Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. In Proceedings of the 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017;
pp. 1251–1258.
33. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks.
In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu,
HI, USA, 21–26 July 2017; pp. 5987–5995.
34. Shao, L.; Wu, D.; Li, X. Learning Deep and Wide: A Spectral Method for Learning Deep Networks. IEEE Trans.
Neural Netw. Learn. Syst. 2017, 25, 2303–2308. [CrossRef] [PubMed]
35. Dieleman, S.; Schrauwen, B. Multiscale approaches to music audio feature learning. In Proceedings of the
14th International Society for Music Information Retrieval Conference (ISMIR-2013), Pontifícia Universidade
Católica do Paraná, Curitiba, Brazil, 4–8 November 2013; pp. 116–121.
36. Mathieu, M.; Couprie, C.; LeCun, Y. Deep multi-scale video prediction beyond mean square error.
arXiv Preprint 2015, arXiv:1511.05440.
37. Takahashi, N.; Mitsufuji, Y. Multi-Scale multi-band densenets for audio source separation. In Proceedings of
the 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz,
NY, USA, 15–18 October 2017; pp. 21–25.
38. Tang, Y.; Mohamed, A.-R. Multiresolution deep belief networks. In Proceedings of the Artificial Intelligence
and Statistics, La Palma, Canary Islands, Spain, 21–23 April 2012; pp. 1203–1211.
39. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual
Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [CrossRef] [PubMed]
40. Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A Search Space Odyssey.
IEEE Trans. Neural Netwo. Learn. Syst. 2015, 28, 2222–2232. [CrossRef] [PubMed]
41. Li, X.; Wang, X.; Xie, D.; Wang, X.; Yang, A.; Rong, M. Time–frequency analysis of PD-induced UHF signal in
GIS and feature extraction using invariant moments. IET Sci. Meas. Technol. 2018, 12, 169–175. [CrossRef]
42. Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neural network based language
model. In Proceedings of the INTERSPEECH 2010—11th Annual of Conference of the International Speech
Communication Association, Makuhari, Chiba, Japan, 26–30 September 2010; pp. 1045–1048.
43. Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359.
[CrossRef]
44. Weston, J.; Ratle, F.; Mobahi, H.; Collobert, R. Deep learning via semi-supervised embedding. In Neural
Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 639–655.
45. Bengio, Y.; Paiement, J.F.; Vincent, P.; Delalleau, O.; Roux, N.L.; Ouimet, M. Out-of-sample extensions for
LLE, IsoMap, MDS, Eigenmaps, and Spectral Clustering. Adv. Neural Inf. Process. Syst. 2004, 16, 177–184.
46. Williams, C.K.I. On a Connection between Kernel PCA and Metric Multidimensional Scaling. Mach. Learn.
2002, 46, 11–19. [CrossRef]
Downloaded from https://iranpaper.ir
https://www.tarjomano.com https://www.tarjomano.com
47. Chopra, S.; Hadsell, R.; Lecun, Y. Learning a Similarity Metric Discriminatively, with Application to Face
Verification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 539–546.
48. Li, T.; Rong, M.; Zheng, C.; Wang, X. Development simulation and experiment study on UHF partial
discharge sensor in GIS. IEEE Trans. Dielectr. Electr. Insul. 2012, 19, 1421–1430. [CrossRef]
49. Erhan, D.; Bengio, Y.; Courville, A.; Vincent, P. Visualizing Higher-Layer Features of a Deep Network; Technical
Report 1341; University of Montreal: Montreal, QC, Canada, 2009; pp. 1–13.
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).