Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Neural networks: theory and practice Todor Todorov Hampson-Russell Software Services Ltd. todor@hampson-russell.

com
Introduction In its most general form, an artificial neural network is a set of electronic components or computer program that is designed to model the way in which the brain performs. The brain is a highly complex, nonlinear, and parallel information-processing system. The structural constituents of the brain are nerve cells called neurons, which are linked by a large number of connections called synapses. This complex system has the great ability to build up its own rules and store information through what we usually refer to as experience. The neural network resembles the brain in two respects: knowledge is acquired by the network through a learning process; inter-neuron connection strengths known as synaptic weights are used to store the knowledge. The procedure used to perform the learning process is called a learning algorithm. Its function is to modify the synaptic weights of the network in an orderly fashion to attain a desired design objective. Although neural networks are relatively new to the petroleum industry, their origins can be traced back to the 1940s, when psychologists began developing models of human learning. With the advent of the computer, researchers began to program neural network models to simulate the complex behavior of the brain. However, in 1969, Marvin Minsky proved that one-layer perceptrons, a simple neural network being studied at that time, are incapable of solving many simple problems. Optimism soared again in 1986 when Rumelhart and McClelland published a two-volume book Parallel Distributed Processing. The book presented the back-propagation algorithm, which has become one of the most popular learning algorithms for the multi-layer feedforward neural network. Since then, the effort to develop and implement different architectures and learning algorithms has been enormous. In 1990, Donald Specht published the idea of the probabilistic neural network, which has its roots in probability theory. The trend was picked up by research-geophysicists, and a number of successful applications were reported in the geophysical literature (Huang et al., 1996, Todorov et al., 1998). Two neural network architectures are described in this paper: the multi-layer feedforward neural network and the probabilistic neural network. The basic theory and its implementation in EMERGE are discussed.

Multi-layer feedforward neural network Basic architecture In this section we study one of the most used type of neural networks, the multi-layer feedforward network, known also as the multi-layer perceptron. Figure 1 shows the basic architecture of the multi-layer feedforward neural network. It consists of a set of neurons, also called processing units, which are arranged into two or more layers. There is always an input layer and an output layer, each containing at least one neuron. Between them there are one or more hidden layers. The neurons are connected in the following fashion: inputs to neurons in each layer come from outputs of the previous layer, and outputs from these neurons are passed to neurons in the next layer. Each connection represents a weight. In the example shown in Figure 1, we have four inputs (for example four seismic attributes: A1, A2, A3, A4), one hidden layer containing three neurons and an output neuron (for example, measured porosity). The number of connections is 15, i.e. we have 15 weights.

Figure 1: Feedforward neural network architecture. The neurons are information-processing units that are fundamental to the operation of a neural network. Figure 2 shows the model of a neuron. We may identify three basic processes of the neuron model: each of the input signals xj is multiplied by the corresponding synaptic weight wj summation of the weighted input signals applying a nonlinear function, called the activation function, to the sum

Figure 2: A model of a neuron.

Mathematically the process is written as:

p neuron's output = f x j w j j=1 where: wj synaptic (connection) weights xj neuron inputs f(.) activation function The activation function defines the output of a neuron in terms of the activity level at its input. The sigmoid function is by far the most common form of activation function used in the construction of artificial neural networks. It is defined as a strictly increasing function that exhibits smoothness and asymptotic properties. An example of the sigmoid function is the logistic function, defined by:
f (x ) = 1 1 + e x

The logistic function assumes a continuous range of values from 0 to 1. It is sometimes desirable to have the activation function range from 1 to +1, in which case the activation function assumes an antisymmetric form with respect to the origin. An example is the hyperbolic tangent function, defined by: ex ex f ( x ) = tanh( x ) = x e + e x A neural net is completely defined by the number of layers, neurons in each layer, and the connection weights. The process of weight estimation is called training or learning. The process of training The major task for a neural network is to learn a model by presenting it with examples. Each example consists of an input output pair: an input signal and the corresponding desired response for the neural network. Thus, a set of examples represents the knowledge. For each example we compare the outputs obtained by the network with the outputs we would like to obtain. If y = [y1, y2, ..., yp] is a vector containing the outputs (p is the number of neurons in the output layer), and d = [d1, d2, ..., dp] is a vector containing the desired response, we can compute the error for the example k:

ek =

1 p (y j d j )2 p j=1

If we have n examples, the total error is:


e= 1 n ek n k =1

Obviously our goal is to reduce the error. It can be done by updating the weights to minimize the error. Thus, in its basic form a neural network training algorithm is an optimization algorithm which minimizes the error with respect to the network weights. In 1986, the back-propagation algorithm, the first practical algorithm for multi-layer perceptron training was presented (Rumelhart and McClelland, 1986). The basic steps of the training are: initialize the network weights to small uniformly distributed random numbers present the examples to the network and compute the outputs compute the error update the weights backward, i.e., starting from the output layer and passing the layers to the input layer, using the delta rule (Figure 3): w ji = e w ji

where is a constant called the learning rate.

Figure 3: Back-propagation of the error. In practice we perform the above flow until we are satisfied with the neural network performance. The back-propagation is what numerical analysts call a gradient descent or the steepest descent algorithm. Although the method is capable of reaching the local minimum it is considerably slow. A better optimization method is the conjugate gradient algorithm (Press et al., 1988, Masters, 1993).

Eluding local minima: simulated annealing

The method of conjugate gradient is extremely efficient in locating the nearest local minimum. However, the conjugate gradient can not escape from the valley of the local minimum and find the global minimum. In Figure 4, we have a plot of a simple error function. There are three minima, two of which are local. The starting point for the optimization is marked with the black square. The arrow shows the path of the conjugate gradient it always goes down the valley, so we cannot jump over the hill and find the global minimum.

Figure 4: The conjugate gradient finds the local minimum. Generally, the error function is quite complex and contains a number of local minima. So if we are looking for a good training algorithm, we have to find a way to escape from the local minima valleys and locate the global minimum. One possible solution is to use simulated annealing. The concept of simulated annealing is simple: we are sitting at some starting point and we want to move to a more optimal point. To do so, we choose randomly a number of points and calculate the error function value. The maximum distance for searching is defined by a parameter called temperature. The point with the minimum function value becomes a center or staring point for the next search. Figure 5 is an example of the process. The black arrows show three randomly chosen points. We can see that one of them is within the valley of the global minimum. Then we can repeat the process, so we move down the valley. However, the process of simulated annealing is slower then the conjugate gradient, so a smart way is to combine both methods. First, we perform simulated annealing to find the valley of the global minimum, and then we use the conjugate gradient to move faster to the bottom of the valley.

Figure 5: Simulated annealing.


Overtraining and validation

Theoretically, given enough neurons and iterations, the error based on the training set will approach zero. However, this is undesirable since the neural net will be fitting noise and some small details of the individual cases. That often leads to pure prediction on unseen data. This pitfall is called overfitting or overtraining. The problem of overfitting versus generalization is similar to one of fitting a function to known points and then using the function for prediction. If we use a high enough order polynomial, we may fit the known points exactly (Figure 6). However, if we use a smoother function (the dashed line), the prediction of unknown points is better. The number of neurons in the neural network is analogous to the polynomial degree, i.e. a large number of neurons can lead to overfitting. So, how to determine the number of neurons?

Figure 6: Overfitting versus generalization. To solve the problem we can divide our data into two data sets: training and validation. The first one is used to train the neural network. Once built, the neural net is applied to the validation data for evaluation.

Possible training scheme: divide the data into training and validation sets; start the training with small number of neurons; apply to validation data and compute the error; add new neurons until no improvement on the validation data is seen.
Probabilistic neural network

The basic idea behind the general regression probabilistic neural network is to use a set of one or more measured values, called independent variables, to predict the value of a single dependent variable. Let us denote the independent variable with a vector x = [x1, x2, ..., xp], where p is the number of independent variables. Note that the dependent variable, denoted y, is a scalar. The inputs to the neural network are the independent variables, x1, x2, ..., xp, and the output is the dependent variable, y. The goal is to estimate the unknown dependent variable, y, at a location where the independent variables are known. This estimation is based on the fundamental equation of the general regression probabilistic neural network:

y ' ( x) =

y
i =1 n i =1

exp( D(x, x i ))
i

exp(D(x, x ))

where n is the number of examples and D(x, xi) is defined by:


x j x ij D ( x, x i ) = j=1 j
p 2

D(x, xi) is actually the scaled distance between the point we are trying to estimate, x, and the training points, xi. The distance is scaled by the quantity j, called the smoothing parameter, which may be different for each independent variable. The actual training of the network consists of determining the optimal set of smoothing parameters, j. The criterion for optimization is minimization of the validation error. We define the validation result for the mth example as:

y' m ( x m ) =

y exp(D(x
im n i

, x i ))

exp(D(x
im

, x i ))

So the predicted value of the m-th sample is ym. Since we know the actual value, ym, we can calculate the prediction error: e m = ( y m y' m ) 2 The total error for the n examples is:
e = ( y i y' i ) 2
i =1 n

The validation error then is minimized with respect to the smoothing parameters using the conjugate gradient algorithm.
Neural networks in EMERGE

Multi-layer feedforward neural network (MLFN)

Figure 7 is a flowchart for the multi-layer feedforward neural network (MLFN) implemented in EMERGE. It combines the global searching strategy of simulated annealing (SA) with the powerful minimum-seeking conjugate gradient (CJ) algorithm. Simulated annealing is used in two separate, independent ways. First it is used in a high temperature initialization mode, centered at weights of zero, to find a good starting point for CJ minimization. When the CJ algorithm subsequently converges to a local minimum, SA is called into play again. This time, its goal is to escape from what may be a local minimum and it is centered about the best weights as found by the CJ algorithm. The SA is called until it cannot find a better point than the CJ. This is so-called loop A. Each pass through loop A is called an iteration. When we exit from loop A, CJ is called again to refine the best weights found from loop A. If we have not reached the number of iterations, specified in the MLFN menu, SA from zero is called again, i.e., loop B is executed. The flow stops when: loop A iterations + loop B iterations = number of iterations.

Figure 7: MLFN flowchart. The number of neurons in the input layer is equal to the number of attributes multiplied by the length of the convolutional operator. We have one neuron in the output layer the well log sample. The number of neurons in the hidden layer is controlled by the nodes in hidden layer parameter. MLFN can work in two modes: mapping or classification. In the mapping mode, the actual value of the target log is predicted. In classification mode, an interval value is predicted. Let us assume a gamma ray log with readings from 10 API to 130 API. We may divide the log into three classes: class 1: 30 API (interval from 10 to 50 API, i.e. sand); class 2: 70 API (interval from 50 to 90 API, i.e. sand and shale); class 3: 110 API (interval from 90 to 130 API, i.e. shale).
Probabilistic neural network

The first step during probabilistic neural network training is to find a good single smoothing parameter (sigma) for the conjugate gradient optimization. This is done by testing a number of smoothing parameters, number of sigmas, within a specified interval, sigma range. The best sigma is called the global sigma and it is used as a starting point for the CJ algorithm. The optimum set of sigmas is found and used for prediction.

References

Haykin, S., 1994, Neural Networks: A Comprehensive Foundation, Prentice Hall Huang, Z., Shimeld, J., Williamson, M., Katsube, J., 1996, Permeability prediction with Artificial Neural Network Modeling in the Venture Gas Field, Offshore Eastern Canada, Geophysics, vol. 61, p.422 Masters, T., 1993, Practical Neural Network Recipes in C++, Academic Press Masters, T., 1994, Signal and Image Processing with Neural Networks, John Wiley and Sons Masters, T., 1995, Advanced Algorithms for Neural Networks, John Wiley and Sons Press,W., Flannery,B., Teukolsky, S., Vetterling, W., Numerical Recipes in C, Cambridge University Press Rumelhart, D., MacClelland, J., Parallel Distributed Processing, MIT Press Todorov, T., Stewart, R., Hampson, D., Russell, B., 1998, Well Log Prediction Using Attributes from 3C-3D Seismic Data, Expanded Abstracts, 1998 SEG Annual Meeting

You might also like