Ebnn & BGP &arpi
Ebnn & BGP &arpi
Ebnn & BGP &arpi
Abstract: The Evolutionary Algorithms have been used for neural networks in two main ways: (i) to optimize the
network architecture in terms of hidden layers and number of neurons in each layers, and (ii) to train the weights of
fixed architecture. While most previous work focuses on only one of these two options, this paper investigative
Evolutionary approach called Breeder Genetic Programming (BGP) in which the architecture and the weights are
optimized simultaneously. The genotype of each network is represented as a tree whose depth and width are
dynamically adapted to the particular application by specifically defined genetic operators. The weights are trained by
Gaussian approximation. The fitness function has been chosen as a function increasing when the mean square error at
the end of the training and the number of epochs needed for learning are increase. This paper considers with fined
optimal Bayesian neural network using Breeder genetic programming to classify or identify acoustic radar Patterns. The
results demonstrate that the method is capable of successfully identifying the different acoustic radar patterns.
Key words: Bayesian Neural Network, Breeder Genetic Programming, Gaussian Approximation, SODAR Pattern Identification.
-1-
the other hand, a small network will achieve a good
generalization if it converges; it needs, however,
generally a large amount of training time. Therefore,
the size of the network should be as small possible,
but sufficiently large to ensure an accurate fitting of
the training set [3].
Koza [4] provides an alternative approach to
representing neural networks, under the framework of
so-called genetic programming, which enables
modification not only of the weights but also of the
architecture for a neural network. However, this
method provides neither a general method for
representing an arbitrary feed forward network, nor a
mechanism for finding a network of minimal
complexity.
This paper describes a new genetic programming
method, called Breeder Genetic Programming (BGP) Fig 1. A genotype to phenotype mapping example
that employs the fitness function as a function
increasing when the number of epoch’s (EP) needed
for learning and mean square error of the network 3. Genetic Breeding of Neural Networks
(MSE) increase. The weights are trained not by back
propagation, but by Bayesian method. 3.1. Breeder Genetic Programming (BGP)
In this manner, 90 data points were obtained from For the evolution of optimal neural networks we
the charts for each category. These were used as the use the concepts based on the Breeder Genetic
input to the network. The six acoustic radar pattern Algorithm (BGA). While the usual genetic algorithms
categories correspond to the output classes. model a natural evolution, the BGA model a rational
selection performed by human breeders.
2. Representing Neural Networks as Trees BGA represents a class of random optimization
techniques gleaned from the science of population
Multilayer feedforward neural networks are
genetics, which have proved their ability to solve hard
networks of simple processing elements, called
optimization problems with continuous parameters.
neurons or units, organized in layers.
BGA which can be seen as a recombination between
The external inputs are presented to the input layer Evoluation Strategies (ES) and Genetic Algorithm
and feedforward via one or more layers of hidden (GA) , uses truncation selection which is very similar
units to the output layer. There is no connection to the (u ,λ) strategy in ESs ( this strategy called also
between units in the same layer. the comma strategy denoted by (µ, λ) the parents, after
generating offspring, die off and are not taken into
A commonly adopted architecture involves full account to form the next generation) and the search
connectivity between neighboring layers only. We process is mainly driven by recombination making
allow both partial connectivity and direct connections BGAs very similar to GAs. It has been proven that
between non-neighboring layers, since this is BGAs can solve problems more efficiently than GAs
important for finding a parsimonious architecture. due to the theoretical faster convergence to the
Specifically, this architecture allows for some of input optimum and they can, like GAs, be easily written in a
units to be connected directly to output units. parallel form [6].
For the genetic optimization, we represent a Our approach differs from the BGA in that we use
feedforward network as a set of m trees [5]. Each variable size of chromosomes, a characteristic of
corresponding to one output unit. Figure 1 shows a Genetic Programming (GP). Thus we call the method
genotype to phenotype mapping example. A genotype Breeder Genetic Programming (BGP). BGP also
is depicted that procedures the shown phenotype. differs from usual GP. While GP uses proportional
There are 3 nodes, input nodes, one hidden, one output selection combined with crossover as main operator,
node, and seven connection definitions, one of which BGP uses truncation selection combined with
is recurrent. The second disabled, so the connection crossover plus Gaussian approximation.
that it specifies (between nodes 2 and 4) is not
expressed in the phenotype. The BGP evolutionary learning algorithm is
summarized in eight steps as explain below the
algorithm maintains a population A consisting of M
individuals Ai of variable size. Each individual
represents a neural network. The networks of initial
population, A(0), are generated with a random number
of layers. The receptive field of each neural unit and
its width are also chosen randomly. The (g+1)
-2-
population, A(g+1), is created from A(g) in three 7. Replace the worst fit network in A(t+1) by
steps: selection, Gaussian approximation, and the best in A(t).
mating.
8. Set g←g+1 and return to step 2.
The selection step starts from a population of A(g)
individuals. Only the M% elements showing the best 3.2. Genetic Operators
fitness are chosen to give origin to the individuals of The weights of a network are trained by applying a
the next generation (i.e., are accepted into the mating Gaussian approximation to each of the individuals
pool B(g)). Usually truncation ratio lies in rang [10% accepted by truncation selection. Given a chromosome
to 50%]. The fitness function has been chosen as a Si of the network, Bayesian network finds a better
function increasing when the mean square error at the chromosome by repeatedly applying the Gaussian
end of the training and the number of epochs needed approximation and adapting the weight until there is
for learning are increase. no weight form found having better fitness in each
After selection, each individual in B(g) undergoes sweep through the individual.
a Bayesian neural network where the weight’s of the After that the crossover operator adapts the size and
network are adapted by Gaussian approximation. This shape of the network architecture. A crossover
results in the revised mate set B(g). The mating phase operation stars by choosing at random two parent
repeatedly selects two random parent individuals in individuals from the mating pool B(g). Actual
B(g) to mate and generate one offspring in the new crossover of two individuals, i and j, is done on their
population A(g+1) by applying crossover operators, genotypical representations Si and Sj. The nodes in
until the population size amount to M. Notice that not the tree are numbered according to depth – first search
only the size of individuals in one population may be order and crossover sites Ci and Cj are chosen at
different, random with the following conditions:
׀Ai(g)׀≠׀Aj(g)׀, i≠j and j ∈ {1,,M} (1) 1≤ Ci ≤size(Si)
But the size of same individual of subsequent And (4)
population may also different, 1≤ Cj ≤size(Sj)
׀Ai(g+1)׀≠׀Aj(g)׀, i ∈ {1,,M} (2) Here, the length of an individual, size (Si), is
A new population is generated repeatedly until an defined as the total number of units and weights.
acceptable solution is found or the variance of the Given the crossover points, the subtrees of two
fitness V(g) falls below a specified limit value Vmin parent individuals, Si and Sj are exchanged to form
(i.e., the procedure terminates if ) one offspring. The label of the nodes Ci and Cj must
1 M
⎛ −
⎞
2 belong to the same class as explain below:-
V (g) =
M
∑
i =1
⎜ Fi ( g ) − F ( g ) ⎟
⎝ ⎠
≤ V min
(3)
−
Where F ( g ) the average fitness of the individuals in
A(g), the algorithm also stops if a specified number of
generations, gmax, is carried out .
Eight steps explains the summary of the BGP
algorithm
1. Generate initial population A(0) of M
networks at random. Set current generation
g←0.
2. Evaluate fitness values Fi(g) of networks
using the training set of N examples.
3. If the termination condition is satisfied, then
stop the evolution otherwise, continue with
step 4.
4. Select upper M% networks of gth population
into mating pool B(g).
5. Each network in B(g) undergoes train by 3.3. Fitness Function
Bayesian neural network, resulting in revised
mating pool B(g). To evaluate the goodness of an individual, the
network is trained with a fixed number of patterns and
6. Great (g+1) population A(t+1) of size M by then evaluated according to determined parameters.
applying genetic operators to randomly The parameters which seen to better describe the
chosen parent networks in B(g). goodness of a network configuration are the mean
-3-
square error MSE at the end of the training and the setting hyper parameters such as α that control the
number of epochs EP needed for learning. Clearly it is "complexity" of the model with probability theory we
desirable to attain for both the parameters MSE and can effectively infer the appropriate value of the
EP value low as possible: in fact, a neural network regularization constant α from the data. This removes
must learn as fast as possible (same value of EP), and the need to waste resources on cross-validation, which
with a good approximation of the output desired. is traditionally used to set such hyper parameters. This
is a especially important advantage of the Bayesian
It is necessary to establish an upper limit EPmax to
approach when there are many such regularization
the number of epochs EP units by the network for the
constants.
training , thus 0≤ EP ≤ EPmax . Moreover, it is
desirable MSE be as close as possible to a small value Assume we have a set of n independent training
emin (0≤ emin ≤ MSE) which represent the minimum items (x1,y1)….(xn,yn), each of which gives the value
error required (below emin we can say the network has of an output vector, yi associative with an "input"
learned its task). vector, xi. And the prior probability distribution to
weights vector p(w), likelihood function that represent
EP error mature function p(y/x,w) while the post
F ( x) = + MSE (5)
EPmax probability to weights vector represent by Bayesian
rule as the following :
Please note that, according to the above formula,
F(x) can be lower than 1 if and only if the learning p(w) p((x1 , y1 ),...,(xn , yn ) / w)
p(w /(x1 , y1 ),...(xn , yn )) =
phase is such that minimum error required is reached p((x1 , y1 ),...,(xn , yn )) (7)
be artificial neural network within a number of epochs
lower than EPmax [7]. p ( w ) p ( y 1 ,..., y n / x 1 ,..., x n , w )
=
4. Bayesian Network p ( y 1 ,..., y n / x 1 ,..., x n )
p ( y 1 ,..., y n / x 1 ,..., x n )
these are discrete variables, partitioned into three sets:
set of inputs, output, and intermediate variables, In the Bayesian approach to statistical prediction,
respectively. The model consists of a qualitative part one does not use a single "best" set of network
(a directed graph) and quantitative parts (dependency weights, but rather integrates the prediction from all
models). The vertices of the graph represent the possible weight vectors over the posterior weight
random variables and the edges define the distribution, which combines information from the
independency relations (each variable is independent data with a prior bias toward more plausible weights.
of its non descendants given its parents [9]). There is a And assuming the addition input represents by xn+1
probabilistic dependency model for each variable that therefore the desired output yn+1 can be compute as
describes its dependency on its parents. following [10]:
Let U = {x1, …, xn}, n ≥ 1 be a set of variables. A
Bayesian network B over a set of variables U is a
yn+1 = ∫ RN f (xn+1 , w) p(w (x1 , y1 ),...,(xn , yn ))dw (8)
network structure BS, which is a directed acyclic graph
(DAG) over U and a set of probability tables BP = In the prior distribution and predictive distribution
that represent by the above relation are present the
{p(u׀pa(u))׀u ∈ U}, where pa(u) is the set of parents optimal solution of the neural network problems but
of u in BS. A Bayesian network represents probability this techniques includes multi-dimensional
distributions [9]. integrations or multivariate integration and this types
of integration represent the base of most difficulty
P(U ) = Π u∈U p(u pa(u ) ) (6) particular that found in Bayesian approach specially in
many application[11].
Therefore we can consider the training is the
4.1. The Bayesian Solution
problem of neural networks while the multivariate
integration is the problem of the Bayesian neural
The Bayesian viewpoint allows the following network, the analysis solutions are impossible
advantages to be gained by simple application of the therefore we need to search about the replacement
rules of probability theory: methods to solve this integrations. And from these
A. By viewing the product of learning as an ensemble methods:-
of probable classifiers, we can take our uncertainty A. Direct Numerical Integration.
into account when making predictions. This
improves the quality of predictions. B. Monte Carlo Simulation.
B. By viewing the regularizer with regularization C. Gaussian Approximation.
constant α as defining a prior probability , we can
In this paper, I will deal with Gaussian
solve the overfitting problem, i.e., the problem of
approximation to find the best weights of the network.
-4-
4.2. Gaussian Approximation distribution that its represent by Gaussian distribution
as follows:-
In this paper, I will deal only with neural networks
used for regressions, through the method should be N 2
applicable to classification network as well. Assume w
p ( w) = ( 2πω ) exp( − 2 2
) (13)
we have a set of n independent training items 2ω 2
(x1,y1)….(xn,yn), each of which gives the value of an
output vector, yi associative with an "input" vector, xi. 5. Bayesian neural network training
We also have the value of the input vector for an
additional test item, xn+1. The task is to predict the The Bayesian neural network learning algorithm is
corresponding output vector, yn+1[10]. a form of supervised learning used to train mainly
A neural network of some given architecture and feed forward neural networks to perform many tasks
with weight vectors w defines a mapping from an [12]. It's used sigmoid function to determine
input vector x to predicted output vector y given by activation of their neurons and it uses the MSE as a
measure to determine convergence of network outputs
y = f ( x, w) (9) toward the desired output. Figure 2 explains the
structure of Bayesian neural network.
Assuming a Gaussian model, the conditional Bayesian neural network consists of three kinds of
probability distribution for the output vector given the layers [13]:-
input vector based on this mapping will be as follows:
• Input Layer (i): - This layer is accountable for
2
received network inputs and is distributed on
−
N
y − f ( x, w) (10) hidden layer neurons.
p( y / x, w) = (2πσ ) 2 2
exp(− )
2σ 2
• Hidden Layer (j):- This layer does extraction
information about distribution features within the
Where y is desired output, f(x,w) is actual output, sum of training examples and it's sent to a
N is dimension of the output vector, σ is the noise in neighboring hidden layer or output layer with
output. dependency weights [14].
The conventional maximum likelihood approach to • Output Layer (k): - This layer is responsible for
neural network training is to used some gradient-based receiving stimulus pattern code from the hidden
algorithm to find the weight vector, that maximize the layers with dependency weights and finding the
degree of fit to the data, given by the log-likelihood actual output of the network. The output of each
L(w) , defined as:- layers can be computed as following:-
n
N yi − f (xi , w)
2
(11)
h = f (∑ w xi) (14)
L(w) =∑log p(yi / xi , w) = −∑
j ij
+C i=1
i=0 2σ 2 l
-5-
How you can determine the weights of the
Bayesian neural network?
w ij = ∆ w ij′ + w ij′′ (24)
At the first, we generate random weights of the Where w″ij is prior weight.
network between 0 and 1 to the output and hidden
layers. Then calculate weights through prior Adjust weights between hidden layer and output
probability distribution that represents a Gaussian layer as follows:-
approximation as the following:-
L 2 ∆ w jk = η l k′ h j + α ∆ w ′jk (25)
− w ij
p ( w ij ) = ( 2πω ) 2 2
exp( − ) (19)
2ω 2 Where ∆wjk represents the different between the
output layer and hidden layer weights, η learning rate
l ′j = h j ( 1 − h j ) ∑ ( l k′ * w jk ) (22)
k
Fig 2. The Structure of Bayesian Neural Network
Adjust weights between input layer and hidden
layer as follows:-
6. Implementation and Results
∆ w ij = η l ′j x i + α ∆ w ij′ (23)
In this paper, 90 data points were obtained from
Where ∆wij represents the different between the the charts for each category. These were used as the
input features. The six pattern categories correspond
input layer and hidden layer weights, η learning rate to the output classes (i.e., output classes are
is a value between [0.1, 0.3], ∆w'ij represents the convective boundary layer, temperature inversion
layer with flat top, temperature inversion layer with
different between current and prior weight. α small spikes, temperature inversion layer with tall
'
momentum rate, l j error in node j. The new weight spikes, rising inversion layer, rising inversion layer
become:- with capping a convective boundary layer[15]).
-6-
A three-layered Bayesian neural network, with 6.3. Case Study Number Three
breeder genetic programming, was used for
classifying the patterns, random sets of 20 sequential In this case study uses the following parameters of
observations, from the total 90 data points in each breeder genetic programming (BGP):-
pattern category, constituted a training pattern with 20 Population size= 100
attributes (input nodes) for that class. Six nodes were
involved at the output layer. Various truncation Truncation ratio=30%
selection ratio were used (10%, 20%, 30%, 40%, No. of generation= 50
50%) each one presents different network in shape and
size (number of node in hidden layer) and also in And the following parameters of the Bayesian
training time. The selection ratio Q% refers to random neural network:-
class wise selection of Q% training data (a sequence Max No. of epochs =7000
of 20 points in the chart) from the entire dataset. The
remaining (100-Q) % data constitute the test set in Learning rate= 0.11
each case. Momentum term=0.09
6.1. Case Study Number One Noise level value= 0.18
In this case study uses the following parameters of Result the network consist of three layers, the
breeder genetic programming (BGP):- hidden layer of it has 13 nodes, the value of fitness
Population size= 50 function=0.78501 and the network enables from
identification all other patterns after 3100 epochs with
Truncation ratio=10% MSE is 0.342153.
No. of generation= 50 6.4. Case Study Number Four
And the following parameters of the Bayesian In this case study uses the following parameters of
neural network:- breeder genetic programming (BGP):-
Max No. of epochs =20000 Population size= 100
Learning rate= 0.2 Truncation ratio=40%
Momentum term=0.09 No. of generation= 50
Noise level value= 0.15 And the following parameters of the Bayesian
Result the network consist of three layers, the neural network:-
hidden layer of it has 6 nodes, the value of fitness Max No. of epochs =10000
function=0.992068 and the network enables from
identification all other patterns after 18795 epochs Learning rate= 0.25
with MSE is 0.052318. Momentum term=0.04
And the following parameters of the Bayesian In this case study uses the following parameters of
neural network:- breeder genetic programming (BGP):-
Noise level value= 0.17 And the following parameters of the Bayesian
neural network:-
Result the network consist of three layers, the
hidden layer of it has 10 nodes, the value of fitness Max No. of epochs =1000
function=0.80561 and the network enables from Learning rate= 0.3
identification all other patterns after 2405 epochs with
MSE is 0.00395. Momentum term=0.089
Noise level value= 0.18
-7-
Result the network consist of three layers, the REFERENCES
hidden layer of it has 15 nodes, the value of fitness
function=0.59468 and the network enables from [1] R. G. Reeves and F. J. Janza, "Manual of Remote
identification all other patterns after 585 epochs with Sensing: Theory Instruments and Techniques". Falls
MSE is 0.00968. Church, VA: Amer. Soc. Photogramm., 1975.
-8-