Published and copyrighted, 1999, Information and Software Technology 41 (2), pp. 109-119.
HEURISTIC PRINCIPLES FOR
THE DESIGN OF ARTIFICIAL NEURAL NETWORKS
Steven Walczak
Narciso Cerpa
&
University of Colorado at Denver
College of Business & Administration
Campus Box 165, PO Box 173364
Denver, CO 80217-3364 USA
[email protected]
(303) 556-6777
(303) 556-5899 fax
School of Information Systems
University of New South Wales
Sydney 2052
Australia
[email protected]
(61) 2 9385-4847
(61) 2 9662-4061 fax
Heuristics Principles for the Design of Artificial Neural Networks - Page 1
Abstract
Artificial neural networks have been used to support applications across a
variety of business and scientific disciplines during the past years. Artificial neural
network applications are frequently viewed as black boxes which mystically
determine complex patterns in data. Contrary to this popular view, neural network
designers typically perform extensive knowledge engineering and incorporate a
significant amount of domain knowledge into artificial neural networks. This paper
details heuristics that utilize domain knowledge to produce an artificial neural
network with optimal output performance. The effect of using the heuristics on
neural network performance is illustrated by examining several applied artificial
neural network systems. Identification of an optimal performance artificial neural
network requires that a full factorial design with respect to the quantity of input
nodes, hidden nodes, hidden layers, and learning algorithm be performed. The
heuristic methods discussed in this paper produce optimal or near-optimal
performance artificial neural networks using only a fraction of the time needed for a
full factorial design.
Keywords: Artificial neural networks; Heuristics; Input vector; Hidden layer size; ANN
learning method; Design.
Heuristics Principles for the Design of Artificial Neural Networks - Page 2
1. Introduction
Artificial neural network (ANN) applications have exploded onto the scene in the past
several years and are continuing to be developed. Industrial applications exist in the financial,
manufacturing,
marketing,
telecommunications,
biomedical,
and
other
domains
[4,12,15,16,26,27,30,31,55,56]. While business managers are seeking to develop new applications
using ANNs, a basic misunderstanding of the source of intelligence in an ANN exists.
Furthermore, the development of new ANN applications is facilitated by the recent emergence of a
variety of neural network shells (e.g., Neuralworks Professional II Plus, @Brain, and Neuralyst)
which enable anyone to produce neural network systems by simply specifying the ANN architecture
and providing a set of training data to be used by the shell to train the ANN [32,41]. These shell
based neural networks may fail or produce sub-optimal results unless a deeper understanding of
how to use and incorporate domain knowledge in the ANN is obtained by the designers of ANNs in
business and industrial domains.
The traditional view of an ANN is of a program that emulates biological neural networks
and "learns" to recognize patterns or categorize input data by being trained on a set of sample data
from the domain. Learning through training and subsequently the ability to generalize broad
categories from specific examples [19] is the unique perceived source of intelligence in an ANN.
However, experienced ANN application designers typically perform extensive knowledge
engineering and incorporate a significant amount of domain knowledge into the design of ANNs
even before the learning through training process has begun.
Design of optimal neural networks is problematic in that there exist a large number of
alternative ANN physical architectures and learning methods, all of which may be applied to a
Heuristics Principles for the Design of Artificial Neural Networks - Page 3
given business problem. Artificial neural network designers must determine the set of design
criteria specified in Figure 1.
1.
The appropriate input (independent) variables.
2.
The best learning method. Learning methods may be classified into either supervised or unsupervised
learning methods. Within each of these larger classifications exists numerous alternatives, each of
which works optimally on different distributions or types of data.
3.
The number of hidden layers (depending on the selected learning method).
4.
The quantity of processing elements (nodes) per hidden layer. To avoid both under-fitting and overfitting the training data, which produces poor generalization or out-of-sample performance.
Figure 1. Set of Design Criteria
Each individual choice for these four design variables will affect the performance of the resulting
ANN on out-of-sample data. Improper selection of the values for these four design factors may
produce ANNs that perform worse than random selection of an output (dependent) value.
In this paper, a heuristic approach for incorporating knowledge into an ANN and using
domain knowledge to design optimal ANNs is examined. The heuristic ANN design approach is
made up of the following steps: knowledge-based selection of input values, selection of a learning
method, and design of the hidden layers (quantity and nodes per layer). The majority of the ANN
design steps described will focus on feedforward supervised learning (and more specifically
backpropagation) ANN systems. A discussion of other learning methods is presented in Section 4.
Following these heuristic methods will enable businesses to take advantage of the power of ANNs
Heuristics Principles for the Design of Artificial Neural Networks - Page 4
and will afford economic benefit by producing an ANN that outperforms similar ANNs with
improperly specified design parameters.
2. Background
Artificial neural networks have been applied to a wide variety of business problems
[1,5,10,12,26,30,41,55], especially in accounting [28,40] and finance [23,51]. Various research
results have shown that ANNs outperform traditional statistical techniques (e.g., regression or logit)
[1,10,40] as well as other standard machine learning techniques (e.g., the ID3 algorithm) [40] for a
large class of problem types.
The ANN learning algorithm called backpropagation is the most popular design choice for
implementing neural networks [8,9,14]. The popularity of backpropagation is due in part to its
wide availability and support in commercial neural network shells [32,41] and to the robustness of
the paradigm.
Hornik et al. [21] have shown that backpropagation ANNs are universal
approximators, able to learn arbitrary category mappings. Supporting this research finding, various
other researchers have shown the superiority of backpropagation ANNs to different ANN learning
paradigms including: radial basis function (RBF) [2], counterpropagation [11], and adaptive
resonance theory [5]. Cherkassky and Lari-Najafi [8] claim that an ANN's performance is more
dependent on data representation than on the selection of a learning rule. Learning rules other than
backpropagation perform well if the data from the domain has specific properties.
The
functionality and prerequisite properties of several popular ANN learning rules are presented in
Section 4. Readers interested in the mathematical specifications of the various ANN learning
methods are directed to the references [9,14,17,19,21,54].
Heuristics Principles for the Design of Artificial Neural Networks - Page 5
The generalization performance of supervised learning artificial neural networks (e.g.,
backpropagation) generally improves when the network size is minimized [19,33] with respect to
the weighted connections between processing nodes (elements of the input, hidden, and output
layers). Networks that are too large tend to over-fit or memorize the input data [2]. Conversely,
ANNs with too few weighted connections do not contain enough processing elements to correctly
model the input data set, under-fitting the data. Both of these situations result in poor out-ofsample generalization.
3. Input Variable Selection
The place to begin using and implementing domain knowledge in an ANN is the input
vector. The old adage "garbage in, garbage out" applies to ANNs and ANN designers must spend a
significant amount of time performing the task of knowledge acquisition. Selection of input
variables is an important and complex task for the ANN designer [44]. Pakath and Zaveri [38]
claim that ANNs as well as other artificial intelligence (AI) techniques are highly dependent on the
specification of input variables. However, input variables are routinely misspecified by ANN
developers.
Input variable misspecification occurs because ANN designers follow the expert system
methodology of incorporating as much domain knowledge as possible into an intelligent system
[29]. Hertz et al. [19] specifically state that ANN performance improves as additional domain
knowledge is provided through the input variables. This is certainly true to the extent that if a
sufficient amount of information representing critical decision criteria is not given to an ANN, then
the ANN (or any other modeling technique) cannot develop a correct model of the domain. The
Heuristics Principles for the Design of Artificial Neural Networks - Page 6
common belief is that since ANNs learn, they will be able to determine those input variables that
are important and develop a corresponding model through the modification of the weights
associated with the connections between the input layer and the hidden layers [9].
However, Smith [43] and others [38,44,46,53] claim that noise input variables produce poor
generalization performance in ANNs. The presence of too many input variables causes poor
generalization when the ANN not only models the true predictors, but includes the noise variables
in the model. Piramuthu et al. [40] state that the interaction between input variables produces
critical differences in output values, further obscuring the ideal problem model when unnecessary
variables are included in the set of input values.
Degraded performance of an ANN is not the only cost of including too many input
variables. Bansal et al. [1] have noted that data (training, test, and standard use) represents an
important and recurring cost for information systems in general and ANNs in particular.
3.1. Heuristic Determination of Input Variables
As indicated above and shown in the following sections, both under and over specification
of input variables produces sub-optimal performance. No formal methods currently exist for
selecting input (independent) variables for an ANN solution to a domain problem.
The first step in determining the optimal set of input variables is to perform standard
knowledge acquisition.
Typically, this involves consultation with multiple domain experts.
Various researchers [25,51,52] have indicated the requirement for extensive knowledge acquisition
utilizing domain experts to specify ANN input variables. The primary purpose of the knowledge
Heuristics Principles for the Design of Artificial Neural Networks - Page 7
acquisition phase is to guarantee that the input variable set is not under-specified, providing all
relevant domain criteria to the ANN.
Once a base set of input variables is defined through knowledge acquisition, the set can be
pruned to eliminate variables that contribute noise to the ANN and consequently reduce the ANN
generalization performance. Smith [43] claims that ANN input variables need to be predictive, but
should not be correlated. Correlated variables degrade ANN performance by interacting with each
other as well as other elements to produce a biased effect. A first pass filter to help identify “noise”
variables is to calculate the correlation of pairs of variables (Pearson correlation matrix).
Alternatively, a chi-square test may be used for categorical variables. If two variables have a high
correlation, then one of these two variables may be removed from the set of variables without
adversely affecting the ANN performance. The cutoff value for variable elimination is a heuristic
value and must be determined separately for every ANN application, but any correlation absolute
value of 0.20 or higher indicates a probable noise source to the ANN.
Additional statistical techniques may be applied, depending on the distribution properties of
the data set. Stepwise regression (multiple or logistic) and factor analysis provide viable tools for
evaluating the predictive value of input variables [23,34] and may serve as a secondary filter to the
Pearson correlation matrix.
Multiple regression and factor analysis perform best with normally
distributed linear data, while logistic regression assumes a curvilinear relationship [34].
3.2. Effect of Variable Set on ANN Performance
Although efficient design of the input variable set has not been a previous research topic,
several researchers have shown that smaller input variable sets can produce better generalization
Heuristics Principles for the Design of Artificial Neural Networks - Page 8
performance by an ANN. Lenard et al. [28], in developing a neural network model to predict an
auditor’s going concern for a corporation, reduced their input set size by 50 percent going from
eight to four input variables with a corresponding improvement in performance of 1.2 to 9 percent.
The variable reduction performed in Lenard et al.'s research is accomplished by selecting only a
single variable for representing each of liquidity, solvency, and profitability instead of the two to
three variables used for each measure in the original ANN design. Lenard et al.’s variable
reduction successfully eliminates highly correlated variables.
Jain and Nag [23] however, when developing a neural network model to predict the value of
initial public stock offerings, indicate that reducing their input variable set from eleven to four
variables did not result in an improvement in the ANN's performance, but actually decreased the
performance level. Further analysis of the Jain and Nag research shows that a four input variable
set tested on a single ANN hidden layer architecture did perform similar to two of the five different
architectures used to evaluate the eleven variable model. Additionally, unlike the Lenard research,
variables are eliminated in Jain's research based on a statistical measure of predictiveness instead of
eliminating correlated variables.
Several of the variables used in this research have high
correlations, such as a correlation of 0.75 between two variables used in both the eleven and four
variable models. Use of the recommended correlated variable elimination heuristic may have
produced an ANN with better or at least equal performance to the larger ANN model, with a
corresponding economic savings in the data costs for the network.
Tahai et al. [46] report on both the under and over specification effects of input variables on
an ANN designed to assist new ambulance medical personnel in determining the need for
application of cardio-pulmonary resuscitation. When only a partial set of the necessary input
Heuristics Principles for the Design of Artificial Neural Networks - Page 9
variable set determined from knowledge acquisition interviews was used, the ANN was incapable
of learning when to advise (predict) the application of artificial respiration. When all of the
specified variables were used, the ANN performed with 100 percent accuracy. Upon adding two
additional input variables to the ANN input vector (these new variables were all acquired by
ambulance technicians as part of their routine patient evaluation), then the performance of the ANN
dropped by 7 percent.
Another explicit example of the effect of both under and over specification of the input
variable set is derived by extending the research of Walczak [51]. The purpose of the ANNs in this
research is to forecast the change to the future cable-rate values for exchanging US dollars to either
British pounds or Japanese yen.
The variable set determined by Walczak (using traditional
knowledge acquisition techniques, such as interviewing domain experts) to produce the best
forecast values is the one day lags for ten cross rate values for the dollar, pound, yen, mark, and
Swiss franc and nine rates of returns on international bonds (three each for Japan, Britain, and
Germany), resulting in nineteen input variables. New ANNs are designed to attempt to reduce the
nineteen input variables by eliminating either the international bond variables or the cross rate
variables. Next, a reduced set is derived by eliminating all variables that are not directly related to
the British pound. Finally, other financial indicators (the S&P 500 and the CRSP indexes, and oil
and non-ferrous metals indexes) are added to the original set of input variables to provide related
noise variables.
A partial Pearson correlation matrix of input variables is given in Table 1 (correlations of
.20 or higher are displayed in bold). Although all of the cable rate cross rates have a correlation
Heuristics Principles for the Design of Artificial Neural Networks - Page 10
$/P
$/Y
$/M
Brt.
bond
1
.30
.80
.19
-.08
.14
.20
.15
.25
.15
$/Yen
.30
1
.37
-.06
-.02
.10
-.11
-.06
-.06
.08
$/Mark
.80
.37
1
.24
-.07
.20
.13
.12
.15
.19
B 1 month
Bond
.19
-.06
.24
1
.02
.48
.06
.25
.04
.23
J 1 month
Bond
-.08
-.02
-.07
.02
1
.02
-.06
-.05
-.05
-.08
G 1 month
Bond
.14
.10
.20
.48
.02
1
.12
.33
.07
.16
S&P
.20
-.11
.13
.06
-.06
.12
1
.56
.27
.09
CRSP
.15
-.06
.12
.25
-.05
.33
.56
1
.15
.10
Oil
.25
-.06
.15
.04
-.05
.07
.27
.15
1
.12
Non-Fe
.15
.08
.19
.23
-.08
.16
.09
.10
.12
1
$/Pound
Jap.
bond
Ger.
bond
S&P
CRSP
Oil
NonFerrous
Table 1: Pearson correlation matrix for variables in cable-rate forecast ANN.
greater than .20, this information redundancy is unavoidable since each of the represented variables
is composed of two pieces of information, the currencies being traded, and the dollar value are a
common element for each of these cross rates. As a further illustration of the required correlations,
the dollar/pound cross rate has a correlation value of -.05 compared to the franc/mark cross rate and
the dollar/yen has a correlation of 0.001 to the pound/mark cross rate.
Various hidden layer architectures are implemented for each of the input variable sets just
described. All hidden layer architectures consist of two layers following the design of [51]. The
various hidden node quantities are used to reduce the opportunity for introducing non-inputHeuristics Principles for the Design of Artificial Neural Networks - Page 11
variable dependent error, such as under-fitting or over-fitting the data [2]. Backpropagation is the
learning method for all of the ANNs. The training data for all ANNs consists of the corresponding
values from January 2, 1993 until October 31, 1993 (143 cases) and the test data consists of the
corresponding values from November 1, 1993 until December 31, 1993 (37 cases). Each ANN is
trained until the in-sample MSE is less than .05 or the MSE does not change over the presentation
of 2000 training examples. Forecasting results of the various networks are displayed in Table 2 as
(1-MAPE), where a value of 1.0 indicates perfect forecasting.
Following the results of the
November to December 1993 test set, an additional test set covering the period from January 1994
through April 1994 was also tested with the original ANN and produced nearly identical results.
The extension of the test time period serves to demonstrate the robustness of the ANN forecasting
model.
Input variables
[quantity]
Best Performance
Pound
Yen
[nodes per layer]
Cross Rates (CR) and
Bond Returns (BR) [19]
.5676†
.5405
[_,18,3,2]
.5541†
.4956
(8 ANNs)
CR only
[10]
.4762
.5135
[_,9,6,2] [_,6,3,2]
.4445
.4886
(3 ANNs)
BR only
[9]
.3810
.5135
[_,6,3,2] [_,9,6,2]
.3333
.4728
(3 ANNs)
CR and BR for Pound only
[7]
.5135
N/A
[_,6,3,1]
.5135
N/A
(1 ANN)
CR and BR plus
S&P 500 & CRSP [21]
.5405
.4865
[_,18,6,2]
.5169
.4662
(8 ANNs)
CR and BR plus
Oil and Non-Ferrous [21]
.5405
.4595
[_,18,3,2]
.5067
.4122
(8 ANNs)
Average
Performance of
All Networks Tested
†
statistically significant at .05 level
Table 2: Forecasting performance for the various input variable sets.
Heuristics Principles for the Design of Artificial Neural Networks - Page 12
An MAPE value of .50 represents chance, so the ideal networks will have a 1 - MAPE value
greater than .50. Table 2 indicates that the combination of cross rates and bond return values
(Model 1) produces the optimal forecasting results for all of the ANNs with different input vectors.
In fact, for the dollar/pound cable rate forecasts, none of the other ANN variable sets surpasses the
average performance of all Model 1 ANN architectures evaluated. The test set for forecasting the
pound only using Model 1 is extended to April 30, 1994, bringing the total test cases to 120, and
produces similar results. The similarity of the forecasting performance for the British pound on the
second test set (using the same training set), demonstrates the robustness of the ANN for producing
out-of-sample forecasts.
The same cannot be said of the dollar/yen forecasts, shown in Table 2. However, the best
and average ANN forecasting performance for the yen, of the Model 1 ANNs, consistently
outperforms the corresponding measurement for all other variable set models.
As noted above, high correlation values of variables that share a common element need to
be disregarded. Identification of the single element components of the ten cross rate variables that
are correlated requires a comparison of similar correlation values. In the full Pearson matrix, the
correlation values for the Swiss franc and German mark are highly similar. The franc-yen and
mark-yen cross rates have an average difference in correlation values of .013 when correlated
against identical variables and a similar result occurs for the pound-franc and pound-mark variables.
The only other two variables with this type of similarity in the full correlation matrix are the nonferrous metals index and the CRSP index however, these two variables do not appear together in
any of the ANN models. Removal of the three Swiss franc variables (cross rates against the dollar,
pound, and yen) yields ANNs with identical best case performances to the Model 1 ANNs.
Heuristics Principles for the Design of Artificial Neural Networks - Page 13
Elimination of the Swiss franc values does not improve the ANN forecasting performance, but
instead yields the secondary benefits of reduced data acquisition costs (by eliminating the need for
the four Swiss franc variables) and a reduction in the complexity of the ANN model (producing
shorter training times as discussed in Section 5).
In this section, research examples are provided from accounting, finance, and medicine
which indicate that many ANN systems can be improved through variable reduction. Smaller input
variable sets frequently improve the ANN generalization performance and reduce the net cost of
data acquisition for development and usage of the ANN. However, care must be taken when
removing variables from the ANN's input set to insure that a complete set of non-correlated
predictor variables is available for the ANN, otherwise the reduced variable sets may worsen
generalization performance.
4. ANN Learning Method Selection
After determining a heuristically optimal set of input variables using the methods from the
previous section, a learning method must be selected. Learning methods can be divided into two
distinct categories: unsupervised learning and supervised learning. Both types of ANN learning
require a collection of training examples that enable the ANN to model the data set and produce
accurate output values. Unsupervised learning systems; such as adaptive resonance theory (ART)
[6], self organizing map (SOM) [24], or Hopfield [20] networks; do not require that the output
value for a training sample be provided at the time of training, while supervised learning systems;
such as backpropagation (multi-layer perceptron), radial basis function (RBF) [35],
counterpropagation [18], or fuzzy ARTMAP [7] (a supervised learning extension of the ART
Heuristics Principles for the Design of Artificial Neural Networks - Page 14
learning method) networks; require that a known output value for all training samples be provided
to the ANN.
Unsupervised learning methods determine output values (classifications) directly from the
input variable data set. Because the "answers" must be directly learnable from or contained within
the input values, most unsupervised learning methods have less computational complexity and less
generalization accuracy than supervised methods [14]. Therefore, unsupervised learning techniques
are typically used for classification problems, where the desired classes are self-descriptive. ART
networks are a favorite technique for object recognition in pictorial or graphical data. The
advantage of using unsupervised learning methods is that these ANNs can be designed to learn
much more rapidly than supervised learning systems [14,26].
As discussed in the Background Section, backpropagation is the most common learning
method for ANNs and because of its generality (robustness) and ease of implementation [32] it is
the best choice for a majority of ANN systems. Backpropagation is the superior learning method
when a sufficient number of noise/error free training examples exist, regardless of the complexity
of the specific domain problem. Although backpropagation networks can handle noise in the
training data (and may actually generalize better if some noise is present in the training data), too
many erroneous training values may prevent the ANN from learning the desired model.
A heuristic lower bound on the number of training examples required to train a
backpropagation ANN, is four times the number of weighted connections contained in the network.
Therefore, if a training database contains only 100 training examples, the maximum size of the
ANN is 25 connections or approximately (depending on the ANN architecture) ten nodes. While
the general heuristic of four times the number of connections is applicable to most categorization
Heuristics Principles for the Design of Artificial Neural Networks - Page 15
problems, time series problems, including the prediction of financial time series (e.g., stocks or
commodities values), are more dependent on “business cycles”. Walczak’s [49] research claims
that only a maximum of one or two years of data is required to produce optimal forecasting results
for neural networks performing financial time series prediction. Wilson and Sharda [56] make a
similar claim, stating that the neural network training set should be representative of the populationat-large. Therefore, the training set size of 10 months of data used in this articles on-going example
to predict foreign exchange rates is very near to the optimal value specified by Walczak.
For ANN applications that provide only a few training examples or very noisy training data,
other supervised learning methods should be selected. RBF networks perform well in domains with
limited training sets [2] and counterpropagation networks perform well when a sufficient number of
training examples is available, but may contain very noisy data [11]. Walczak [48] performs an
analysis of six different neural network supervised learning methods on the resource allocation
problem of assigning workers to perform mission critical tasks. Walczak concludes that for the
specific domain problem being addressed, backpropagation produced the best results (although the
first appearance of the problem indicated that counterpropagation might outperform
backpropagation due to anticipated noise in the training data set). Hence, although properties of the
data population may strongly indicate the preference of a particular training method, because of the
strength of the backpropagation network [21,54], this type of learning method should always be
tried in addition to any other methods proscribed by domain data tendencies.
The dollar to pound and dollar to yen cable rate forecasting ANN described in Section 3 is
implemented using backpropagation. The domain for the cable rate forecasting ANN has a large
collection of relatively error free historical examples with known outcomes, suiting it for
Heuristics Principles for the Design of Artificial Neural Networks - Page 16
backpropagation ANN implementations. A comparison of different ANN learning models is
performed using the training and test data from the Model 1 (CR and BR variables) ANN to
illustrate the correctness of the learning method choice. An ART ANN and an RBF ANN are both
implemented and to the extent possible with hidden layer constraints imposed by the ART and RBF
learning methods, the ANN hidden layer architectures is duplicated.
The results of the
backpropagation ANN for the original Model 1 input set and the best results for the two new
learning method ANNs are shown in Table 3. Both the ART and RBF ANNs have worse
performance than the backpropagation ANN performance for this specific domain problem.
ANN Learning Method
Forecast Accuracy
Dollar/Pound
Forecast Accuracy
Dollar/Yen
Backpropagation
.5676
.5405
Adaptive Resonance Theory (ART)
.4054
.4595
Radial Basis Function (RBF)
.3784
.4595
Table 3: ANNs with different learning methods for predicting cable rates.
Many other ANN learning methods exist and each is subject to constraints on the type of
data that are best processed by that specific learning method. For example, general regression
neural networks [45] are capable of solving any problem that can also be solved by a statistical
regression model, but does not require that a specific model type (e.g., multiple linear or logistic) be
specified in advance. However, regression ANNs suffer from the same constraints as regression
models [14], such as the linear or curvilinear relationship of the data with heteroscedastic error [34].
Likewise, learning vector quantization (LVQ) networks [24] try to divide input values into disjoint
Heuristics Principles for the Design of Artificial Neural Networks - Page 17
categories similar to discriminant analysis and consequently have the same data distribution
requirements as discriminant analysis. The resource/employee allocation example described earlier
[48] indicated that LVQ neural networks produced the second best allocation results, which
indicated the previously unknown perception that the categories used for allocating employee
resources were unique (unlike the previous assumption that the two categories of job were
correlated).
The selection of a learning method is an open problem and ANN designers must use the
constraints of the training data set for determining the optimal learning method. If a reasonably
large quantity of relatively noise-free training examples are available, then backpropagation
provides an effective learning method which is relatively easy to implement.
5. Design of Hidden Layers
The architecture of an ANN consists of the number of layers of processing elements (or
nodes); including input, output, and any hidden layers; and the quantity of nodes contained in each
layer. Heuristic design of the input vector is discussed in Section 3 and the output vector is
normally predefined by the problem to be solved with the ANN. Design of hidden layers is
dependent on the selected learning algorithm. For example, unsupervised learning methods such as
ART normally require a first hidden layer quantity of nodes equal to the size of the input layer.
Supervised learning systems are generally more flexible in the design of hidden layers. The
remaining discussion focuses on backpropagation ANN systems or other similar supervised
learning ANNs. The primary questions for the ANN designer concerning hidden layers are:
•
how many hidden layers should exist in the ANN architecture and
Heuristics Principles for the Design of Artificial Neural Networks - Page 18
•
how many nodes should be present in the hidden layer(s)?
5.1. Number of Hidden Layers
While it is possible to design an ANN with no hidden layers, these types of ANNs can only
classify input data that is linearly separable [17], which severely limits their application. Artificial
neural networks that contain hidden layers have the ability to deal robustly with nonlinear and
complex problems and therefore can operate on more interesting problems [32]. The quantity of
hidden layers corresponds to the complexity of the domain problem to be solved. Single hidden
layer ANNs create a hyperplane. Two hidden layer networks combine hyperplanes to form convex
decision areas and three hidden layer ANNs combine convex decision areas to form convex
decision areas that contain concave regions [14]. The convexity or concavity of a decision region
corresponds roughly to the number of unique inferences or abstractions that are performed on the
input variables to produce the desired output result.
Increasing the number of hidden unit layers enables a trade-off between smoothness and
closeness-of-fit [2]. A greater quantity of hidden layers enables an ANN to improve its closenessof-fit, while a smaller quantity improves the smoothness or extrapolation capabilities of the ANN.
Several researchers have indicated that a single hidden layer architecture with an arbitrarily large
quantity of hidden nodes in the single layer, is capable of modeling any categorization mapping
[21,54], while others [19,51] have demonstrated that two hidden layer networks outperform their
single hidden layer counterparts for specific problems. A heuristic for determining the quantity of
hidden layers required by an ANN is: “As the dimensionality of the problem space increases (higher
order problems), the number of hidden layers should increase correspondingly”. The number of
Heuristics Principles for the Design of Artificial Neural Networks - Page 19
hidden layers is heuristically set by determining the number of intermediate steps, dependent on
previous categorizations, to translate the input variables into an output value. Therefore, domain
problems that have a standard non-linear equation solution are solvable by a single hidden layer
ANN [1].
5.2. Quantity of Nodes Per Hidden Layer
A trade-off exists when choosing the number of nodes to be contained in a hidden layer
between training time and the accuracy of training. A greater number of hidden unit nodes results
in a longer training period, while fewer hidden units provide faster training at the cost of having
fewer feature detectors [9]. Too many hidden nodes in an ANN enable the ANN to memorize
(over-fit) the training data set, which produces poor generalization performance [22]. Several
quantitative heuristics exist for selecting the quantity of hidden nodes for an ANN such as: using 75
percent of the quantity of input nodes [23,28], using 50 percent of the quantity of input and output
nodes [40], or using 2n + 1 hidden layer nodes where n is the number of nodes in the input layer
[13,39]. These algorithmic heuristics do not utilize domain knowledge for estimating the quantity
of hidden nodes and may be counterproductive.
Jain and Nag [23] use the 75 percent rule, for their pricing of initial public offerings ANN,
to show that an eleven input variable ANN with six or seven hidden nodes produces the best results
on the training examples, but an ANN with twelve hidden nodes produces the best generalization
result. As with the knowledge acquisition and elimination of correlated variables heuristic for
defining the optimal input node set, the number of decision factors (DF) heuristically determines
the optimal number of hidden units for an ANN. Knowledge acquisition or existing knowledge
Heuristics Principles for the Design of Artificial Neural Networks - Page 20
bases may be used to determine the DF for a particular domain and consequently the hidden layer
architecture and optimal quantity of hidden nodes [1,14,47]. Decision factors are the separable
elements which serve to form the unique categories of the input vector space. The DF can be
equated to the collection of heuristic production rules used in an expert system.
The NETTalk neural network research [42] provides an illustration of the DF design
principle. NETTalk has 203 input nodes representing seven textual characters and 33 output units
representing the phonetic notation of the spoken text words. Hidden units are varied from zero to
120. The researchers claim that improved output accuracy is obtained as the number of hidden
units is increased from zero to 120, but only a minimal improvement in the output accuracy is
observed between 60 and 120 hidden units. This indicates that the quantity of DF for the NETTalk
problem was close to 60 and adding hidden units beyond 60 served to increase the training time
while not providing any appreciable difference in the ANN's performance.
Other research provides additional examples to illustrate that understanding the DF for a
domain enables efficient design of hidden layers. Patuwo et al. [39] use various quantities of
hidden nodes (3,6,8) in an ANN that solves a two-group classification problem given two input
variables. The DF for this problem can be inferred to be two. While Patuwo et al. did not use a
hidden layer with two nodes, the best generalization performance is reported to come from the
ANN with three hidden nodes. A similar two-group classification problem, with an inferred DF of
two, is addressed by Hung et al. [22] and they report that an ANN with two hidden nodes produces
the best generalization.
The two hidden node architecture used by Hung et al. produced a
performance accuracy two to three times better than ANNs with three to six hidden nodes.
Heuristics Principles for the Design of Artificial Neural Networks - Page 21
Another example is provided by Fletcher and Goss [13] who vary the quantity of hidden
nodes from three to six, in a neural network to predict bankruptcy. Their results indicate that ANN
performance improves by more than two percent from three to four hidden nodes and then
dramatically the performance falls (9.8 percent decrease) as additional hidden nodes are added. The
additional nodes beyond four enabled Fletcher and Goss’s ANN to memorize the training data set,
by providing a one-to-one mapping between input values and the corresponding output values,
which decreased the generalization performance of the ANN.
In each of the examples discussed above, the ANNs performed poorly until a sufficient
number of hidden units were available to represent the correlations between the input vector and the
desired output values and increasing the number of hidden units beyond the sufficient number
served to increase training time without a corresponding increase in output accuracy. Knowledge
acquisition is necessary to determine the optimal input variable set to be used in an ANN system.
During the knowledge acquisition phase, additional knowledge engineering can be performed to
determine the DF and subsequently the minimum number of hidden units required by the ANN
architecture. The ANN designer must acquire the heuristic rules or clustering methods used by
domain experts, similar to the knowledge that must be acquired during the knowledge acquisition
process for expert systems [50]. The number of heuristic rules or clusters used by domain experts
are equivalent to the DF used in the domain.
For example, the Model 1 ANN for predicting currency exchange cable rates described in
Section 3 has three DF: the anticipated change to the exchange rate, anticipated changes in relative
interest rates, and anticipated action by the Federal Reserve or other government entity [51]. The
DF indicate the quantity of nodes for the final hidden layer (layer 2 in this case). Because the
Heuristics Principles for the Design of Artificial Neural Networks - Page 22
Model 1 cable rate forecasting ANN predicts two cable rate values, the optimal quantity of hidden
nodes is 3 or 6. Additionally, the decision making process in forecasting cable rate changes is a
multi-step process indicating the need for two (or three) hidden layers. The quantity of hidden
nodes required in the first layer is heuristically estimated to be from 16 to 19 (16 if the Swiss franc
and German mark are in fact correlated). The Model 1 ANN architectures briefly presented in
Section 3 are extended to include models with a first hidden layer of 9 to 21 nodes in the first layer
and 3 to (n - 3) in the second layer (except for the 21 node ANN that was stopped after 12 nodes in
the second layer), where n is the quantity of hidden nodes in the first layer. Results for these ANN
architectures for predicting the dollar/pound cable rate only, shown in Table 4, indicate that 15 to
18 first layer nodes and 6 second layer nodes, corresponding to the DF (3 DF * 2 output values)
produce the best forecast accuracies. The poor performance of the ANN architectures with 21
nodes in the first layer is caused by over-fitting (memorizing) the data.
Recent research [3,36] has demonstrated techniques for automatically producing an ANN
architecture with the exact number of hidden units required to model the DF for the problem space.
The general philosophy of these automatic methods is to initially create a neural network
architecture with a very small or very large number of hidden units, train the network for some predetermined number of epochs and evaluate the error of the output nodes [14,47]. If the error
exceeds a threshold value then a hidden unit is added or deleted respectively and the process is
repeated until the error term is less than the threshold value. Another method to automatically
determine the optimum architecture is to use genetic algorithms to generate multiple ANN
architectures and select the architectures with the best performance [37]. Determining the optimum
number of hidden units for a ANN application is a very complex problem [36] and a method for
Heuristics Principles for the Design of Artificial Neural Networks - Page 23
automatically determining the DF quantity of hidden units without performing the corresponding
knowledge acquisition remains a current research topic.
Hidden Units
First Layer
Hidden Units
Second Layer
Forecast Accuracy
Dollar/Pound
9
3
0.5135
9
6
0.5405
12
3
0.5405
12
6
0.5405
12
9
0.5405
15
3
0.5405
15
6
0.5676
15
9
0.5405
15
12
0.5405
18
3
0.5676
18
6
0.5676
18
9
0.5405
18
12
0.5405
18
15
0.5405
21
3
0.4867
21
6
0.5135
21
9
0.5135
21
12
0.4324
Table 4: Effect of hidden units on Model 1 forecast accuracy.
In this section, the heuristic design principle of acquiring decision factors to determine the
quantity of hidden nodes and the configuration of hidden layers has been presented. A quantity of
Heuristics Principles for the Design of Artificial Neural Networks - Page 24
hidden nodes equal to the quantity of the DF is required by an ANN to perform robustly in a
domain and produce accurate results. This concept is similar to the principle of a minimum size
input vector determined through knowledge acquisition presented in Section 3. The knowledge
acquisition process for ANN designers must acquire the heuristic decision rules or clustering
methods of domain experts. The DF for a domain are equivalent to the heuristic decision rules used
by domain experts. Further analysis of the DF to determine the dimensionality of the problem
space enables the knowledge engineer to configure the hidden nodes into the optimal number of
hidden layers for efficient modeling of the problem space.
6. Conclusions
General guidelines for the development of artificial neural networks (ANNs) are few, so this
paper presents several heuristics for developing ANNs that produce optimal generalization
performance, as outlined in Figure 2. Extensive knowledge acquisition is the key to the design of
ANNs. First the correct input vector for the ANN must be determined by capturing all relevant
decision criteria used by domain experts for solving the domain problem to be modeled by the
ANN and eliminating correlated variables. Next, the selection of a learning method is an open
problem and an appropriate learning method can be selected by examining the set of constraints
imposed by the collection of available training examples for training the ANN. Finally, the
architecture of the hidden layers is determined by further analyzing a domain expert's clustering of
the input variables or heuristic rules for producing an output value from the input variables. The
collection of clustering/decision heuristics used by the domain expert has been called the set of
Heuristics Principles for the Design of Artificial Neural Networks - Page 25
decision factors (DF). The quantity of DF is equivalent to the minimum number of hidden units
required by an ANN to correctly represent the problem space of the domain.
1.
Perform extensive knowledge acquisition.
This knowledge acquisition should be targeted at
identifying the necessary domain criteria (information) required for solving the problem and identify
the decision factors that are used by domain experts for solving the type of problem to be modeled by
the ANN.
2.
Once the relevant domain information/criteria are identified as potential input values to the ANN,
remove noise variables.
A.
Identify highly correlated variables via a Pearson correlation Matrix or Chi-square
test (and keep only one correlated variable).
B.
Identify and remove non-contributing variables (depending on data distribution and
type) via discriminant/factor analysis or step-wise regression.
3.
Analyze the demographic features of the data and decision problem to select an ANN learning
method. If supervised learning methods are applicable then implement backpropagation in addition to
any other method indicated by the data demographics (i.e., radial-basis function for small training sets
or counterpropagation for very noisy training data).
4.
Make sure that an adequate quantity of training data is available for the selected method. Recall that
time-series models differ from standard categorization models (1 to 2 years of data vs. 4 times the
number of weighted connections).
5.
Analyze the complexity (number of unique steps) of the traditional expert decision making solution to
determine the number of hidden layers. If in doubt, then use a single hidden layer, but realize that
additional nodes (processing elements) may be required to adequately model the domain problem.
6.
From the previous knowledge acquisition step (1.), set the quantity of hidden nodes in the last hidden
layer equal to the decision factors (expert heuristics) used by domain experts to solve the problem.
Figure 2. Heuristic Principles for Design of Artificial Neural Networks
Heuristics Principles for the Design of Artificial Neural Networks - Page 26
Use of the knowledge-based design heuristics enables an ANN designer to build a
minimum size ANN that is capable of robustly dealing with specific domain problems. The future
may hold automatic methods for determining the optimum configuration of the hidden layers for
ANNs. Minimum size ANN configurations guarantee optimal results with the minimum amount of
training time.
References
[1]
Bansal, A., Kauffman, R. J., & Weitz R. R. “Comparing the Modeling Performance of
Regression and Neural Networks As Data Quality Varies: A Business Value Approach”,
Journal of Management Information Systems, Vol. 10 No. 1, 1993, pp. 11-32.
[2]
Barnard, E., & Wessels, L. “Extrapolation and Interpolation in Neural Network Classifiers”,
IEEE Control Systems, Vol. 12 No. 5, 1992, pp. 50-53.
[3]
Bartlett, E. “Dynamic Node Architecture Learning: An Information Theoretic Approach”,
Neural Networks, Vol. 7 No. 1, 1994, pp. 129-140.
[4]
Bejou, D., Wray, B., & Ingram, T. “Determinants of Relationship Quality: An Artificial
Neural Network Analysis”, Journal of Business Research, Vol. 36, 1996, pp. 137-143.
[5]
Benjamin, C. O., Chi, S., Gaber, T., & Riordan, C. A. “Comparing BP and ART II Neural
Network Classifiers for Facility Location”, Computers and Industrial Engineering, Vol. 28
No. 1, 1995, pp. 43-50.
[6]
Carpenter, G. A., & Grossberg, S. “The ART of Adaptive Pattern Recognition by a SelfOrganizing Neural Network”, Computer, Vol. 21 No. 3, 1988, pp. 77-88.
Heuristics Principles for the Design of Artificial Neural Networks - Page 27
[7]
Carpenter, G. A., Grossberg, S., Markuzon, N., & Reynolds, J. H. “Fuzzy ARTMAP: A
Neural Network Architecture for Incremental Learning of Analog Multidimensional Maps”,
IEEE Transactions on Neural Networks, Vol. 3 No. 5, 1992, pp. 698-712.
[8]
Cherkassky, V., & Lari-Najafi, H. “Data Representation for Diagnostic Neural Networks”,
IEEE Expert, Vol. 7 No. 5, 1992, pp. 43-53.
[9]
Dayhoff, J. Neural Network Architectures: An Introduction, New York: Van Nostrand
Reinhold, 1990.
[10]
Falas, T., Charitou, A, & Charalambous, C.
“The Application of Artificial Neural
Networks in The Prediction of Earnings”, Proceedings IEEE International Conference on
Neural Networks, 1994, pp. 3629-3633.
[11]
Fausett, L., & Elwasif, W. “Predicting Performance from Test Scores Using
Backpropagation and Counterpropagation”, Proceedings of IEEE International Conference
on Neural Networks, 1994, pp. 3398-3402.
[12]
Fish, K. E., Barnes, J. H., & Aiken, M. W. “Artificial Neural Networks: A Methodology
for Industrial Market Segmentation”, Industrial Marketing Management, Vol. 24, No. 5,
1995, pp. 431-438.
[13]
Fletcher, D., & Goss, E. “Forecasting With Neural Networks: An Application Using
Bankruptcy Data”, Information and Management, Vol. 24 No. 3, 1993, pp. 159-167.
[14]
Fu, L. Neural Networks in Computer Intelligence, New York: McGraw-Hill, 1994.
[15]
Grudnitski, G., & Osburn, L. “Forecasting S&P and Gold Futures Prices: An Application
of Neural Networks”, The Journal of Futures Markets, Vol. 13, No. 6, 1993, pp. 631-643.
Heuristics Principles for the Design of Artificial Neural Networks - Page 28
[16]
Hammerstrom, D. “Neural Networks At Work”, IEEE Spectrum, Vol. 30 No. 6, 1993, pp.
26-32.
[17]
Haykin, S. Neural Networks: A Comprehensive Foundation, New York: Macmillan, 1994.
[18]
Hecht-Nielsen, R. “Applications of Counterpropagation Networks”, Neural Networks, Vol.
1, 1988, pp. 131-139.
[19]
Hertz, J., Krogh, A., & Palmer, R. Introduction To The Theory of Neural Computation,
Reading, MA: Addison-Wesley, 1991.
[20]
Hopfield, J. J., & Tank, D. W. “Computing With Neural Circuits: A Model”, Science, Vol.
233 No. 4764, 1986, pp. 625-633.
[21]
Hornik, K., Stinchcombe, M., & White, H. “Multilayer Feedforward Networks Are
Universal Approximators”, Neural Networks, Vol. 2 No. 5, 1989, pp. 359-366.
[22]
Hung, M. S., Hu, M. Y., Shanker, M. S., & Patuwo, B. E. “Estimating Posterior
Probabilities in Classification Problems With Neural Networks”, International Journal of
Computational Intelligence and Organizations, Vol. 1 No. 1, 1996, pp. 49-60.
[23]
Jain, B. A., & Nag, B. R. “Artificial Neural Network Models for Pricing Initial Public
Offerings”, Decision Sciences, Vol. 26 No. 3, 1995, pp. 283-302.
[24]
Kohonen, T. Self-Organization and Associative Memory, Berlin: Springer-Verlag, 1988.
[25]
Kou, C., Shih, J., Lin, C., & Lee, Z. “An Application of Neural Networks to Reconstruct
Crime Scene Based on Non-Mark Theory - Suspicious Factors Analysis”, In P. K. Simpson
(Ed.), Neural Networks: Theory, Technology, and Applications, New York: IEEE Press,
1996, pp. 537-543.
Heuristics Principles for the Design of Artificial Neural Networks - Page 29
[26]
Kulkarni, U. R., & Kiang, M. Y “Dynamic grouping of parts in flexible manufacturing
systems - A self-organizing neural networks approach”, European Journal of Operational
Research, Vol. 84, No. 1, 1995, pp. 192-212.
[27]
Lacher, R. C., Coats, P. K., Sharma, S. C., & Fant, L. F. “A neural network for classifying
the financial health of a firm”, European Journal of Operational Research, Vol. 85, No. 1,
1995, pp.53-65.
[28]
Lenard, M. J., Alam, P., & Madey, G. R. “The Application of Neural Networks and a
Qualitative Response Model to the Auditor's Going Concern Uncertainty Decision”,
Decision Sciences, Vol. 26 No. 2, 1995, pp. 209-227.
[29]
Lenat, D. B., & Feigenbaum, E. A. “On the Thresholds of Knowledge”, Technical report
AI-126-87, Austin, TX: Microelectronics and Computer Technology Corporation (MCC),
1987.
[30]
Li, E. Y. “Artificial neural networks and their business applications”, Information &
Management, Vol. 27, No. 5, 1994, pp. 303-313.
[31]
McLeod, R. W., Malhotra, D. K., & Malhotra, R. “Predicting Credit Risk: A Neural
Network Approach”, Journal of Retail Banking, Vol. 15, No. 3, 1993, pp. 37-40.
[32]
Medsker, L., & Liebowitz, J. Design and Development of Expert Systems and Neural
Networks, New York: Macmillan, 1994.
[33]
Mehra, P., & Wah, B. W. Artificial Neural Networks: Concepts and Theory, New York:
IEEE Press, 1992.
[34]
Mendenhall, W., & Sincich, T. A Second Course in Statistics: Regression Analysis, Upper
Saddle River, NJ: Prentice Hall, 1996.
Heuristics Principles for the Design of Artificial Neural Networks - Page 30
[35]
Moody, J., & Darken, C. J. “Fast Learning in Networks of Locally-Tuned Processing
Elements”, Neural Computation, Vol. 1 No. 2, 1989, pp. 281-294.
[36]
Nabhan, T., & Zomaya, A. “Toward Generating Neural Network Structures for Function
Approximation”, Neural Networks, Vol. 7 No. 1, 1994, pp. 89-99.
[37]
Opitz, D., & Shavlik, J. “Genetically Refining Topologies of Knowledge-Based Neural
Networks”, International Symposium on Integrating Knowledge and Neural Heuristics,
Pensacola, FL, 1994, pp. 57-66.
[38]
Pakath, R., & Zaveri, J. S. “Specifying Critical Inputs in a Genetic Algorithm-driven
Decision Support System: An Automated Facility”, Decision Sciences, Vol. 26 No. 6, 1995,
pp. 749-779.
[39]
Patuwo, E., Hu, M. Y., & Hung, M. S. “Two-Group Classification Using Neural
Networks”, Decision Sciences, Vol. 24 No. 4, 1993, pp. 825-845.
[40]
Piramuthu, S., Shaw, M., & Gentry, J. “A classification approach using multi-layered
neural networks”, Decision Support Systems, Vol. 11 No. 5, 1994, pp. 509-525.
[41]
Schocken, S. & Ariav, G. “Neural Networks for Decision Support: Problems and
Opportunities”, Decision Support Systems, Vol. 11 No. 5, 1994, pp. 393-414.
[42]
Sejnowski, T., & Rosenberg, C. “Parallel Networks That Learn to Pronounce English Text”,
Complex Systems, Vol. 1 No. 1, 1987, pp. 145-168
[43]
Smith, M. Neural Networks for Statistical Modeling, New York: Van Nostrand Reinhold,
1993.
Heuristics Principles for the Design of Artificial Neural Networks - Page 31
[44]
Soulié, F. “Integrating Neural Networks for Real World Applications”, In Zurada, Marks, &
Robinson (Eds.), Computational Intelligence: Imitating Life, New York: IEEE Press, 1994,
pp. 396-405.
[45]
Specht, D. F. “A General Regression Neural Network”, IEEE Transactions on Neural
Networks, Vol. 2 No. 6, 1991, pp. 568-576.
[46]
Tahai, A., Walczak, S., & Rigsby, J. T. “Improving Artificial Neural Network Performance
Through Input Variable Selection”, In P. Siegel, K. Omer, A. deKorvin, & A. Zebda (Eds.)
Applications of Fuzzy Sets and The Theory of Evidence to Accounting II, Stamford,
Connecticut: JAI Press, 1998, pp. 277-292.
[47]
Towell, G. G., Shavlik, J. W., & Noordewier, M. O. “Refinement of Approximate Domain
Theories by Knowledge-Based Neural Networks”, Proceedings Eighth National
Conference on Artificial Intelligence, Boston, 1990, pp. 861-866.
[48]
Walczak, S. “Neural Network Models for A Resource Allocation Problem”, Transactions
on Systems, Man and Cybernetics, Vol. 28 B No. 2, 1998a, pp. 276-284.
[49]
_____
“Information Effects on Neural Network Forecasting Model Accuracy”, 1998
Proceedings Decision Sciences Institute Annual Meeting, 1998, in press.
[50]
_____ “A Modified Decision Tree Approach for Evaluating the Potential for Application of
Neural Networks and Expert Systems”, Journal of Computer Information Systems, Vol. 36
No. 4, 1996, pp. 1-6.
[51]
_____ “Developing Neural Nets for Currency Trading”, Artificial Intelligence in Finance,
Vol. 2 No. 1, 1995, pp. 27-34.
Heuristics Principles for the Design of Artificial Neural Networks - Page 32
[52]
_____ “Categorizing University Student Applicants With Neural Networks”, Proceedings
of IEEE International Conference on Neural Networks, 1994, pp. 3680-3685.
[53]
Weigand, A. S., & Zimmermann, H. G. “The Observer-Observation Dilemma in NeuroForecasting: Reliable Models From Unreliable Data Through CLEARNING”, Proceedings
of Artificial Intelligence Applications on Wall Street, New York, 1995, pp. 308-317.
[54]
White, H. “Connectionist Nonparametric Regression: Multilayer Feedforward Networks
Can Learn Arbitrary Mappings”, Neural Networks, Vol. 3, 1990, pp. 535-549.
[55]
Widrow, B., Rumelhart, D., & Lehr, M. “Neural Networks: Applications in Industry,
Business, and Science”, Communications of the ACM, Vol. 37 No. 3, 1994, pp. 93-105.
[56]
Wilson, R. L., & Sharda, R. “Bankruptcy prediction using neural networks”, Decision
Support Systems, Vol. 11, No. 5, 1994, pp. 545-557.
Heuristics Principles for the Design of Artificial Neural Networks - Page 33