First-Order Logical Neural Networks
Thanupol Lerdlamnaochai and Boonserm Kijsirikul
Department of Computer Engineering, Chulalongkorn University,
Pathumwan, Bangkok, 10330, Thailand
[email protected],
[email protected]
Abstract
Inductive Logic Programming (ILP) is a well
known machine learning technique in learning
concepts from relational data. Nevertheless, ILP
systems are not robust enough to noisy or unseen data
in real world domains. Furthermore, in multi-class
problems, if the example is not matched with any
learned rules, it cannot be classified. This paper
presents a novel hybrid learning method to alleviate
this restriction by enabling Neural Networks to handle
first-order logic programs directly. The proposed
method, called First-Order Logical Neural Network
(FOLNN), is based on feedforward neural networks
and integrates inductive learning from examples and
background knowledge. We also propose a method for
determining the appropriate variable substitution in
FOLNN learning by using Multiple-Instance Learning
(MIL). In the experiments, the proposed method has
been evaluated on two first-order learning problems,
i.e., the Finite Element Mesh Design and Mutagenesis
and compared with the state-of-the-art, the PROGOL
system. The experimental results show that the
proposed method performs better than PROGOL.
1. Introduction
Inductive Logic Programming (ILP) [1, 2] is only
one of machine learning techniques which adapts the
first-order logical concepts for hypothesis learning.
The advantages of ILP are the ability of employing
background
knowledge
and
the
expressive
representation of first-order logic. However, first-order
rules learned by ILP have the restriction to handle
imperfect data in real-world domains such as noisy
unseen data. This problem noticeably occurs especially
in
multi-class
classification.
In
multi-class
classification, if an example is not covered any learned
rule, it could not be classified. The simple solution is
assigning the majority class recorded from training
examples to the unable labeled test data [3]. Also, there
is more efficient method to solve this problem by using
the concept of intelligent hybrid systems [4].
Artificial Neural Networks (ANNs) [5] claim to
avoid the restrictions of symbolic rule-based systems
described above. Neural networks contain the ability of
processing inconsistent and noisy data. Moreover, they
compute the most reasonable output for each input.
Neural networks, because of their potential for noise
tolerance and multi-class classification, offer an
attractiveness for combining with symbolic
components. Although the ability of neural networks
could alleviate the problem in symbolic rule-based
systems, learned hypothesis from neural networks is
not available in a form that is legible for humans.
Therefore neural networks significantly require an
interpretation by rule-based systems [4]. Several works
show that the integration between robust neural
networks and symbolic knowledge representation can
improve classification accuracy such as Towell and
Shavlik’s KBANN [6], Mahoney and Mooney’s
RAPTURE [7], the works proposed by Rajesh Parekh
and Vasant Honavar [8] and d’Avila Garcez et al. [9].
Nevertheless, these researches have been restricted to
propositional theory refinement. Some models have
been proposed for first-order theory. SHRUTI [10]
employed a model making a restricted form of
unification—actually this system only propagates
bindings—. The work proposed by Botta et al. [11]
created a network consisting of restricted form of
learning first-order logic. Kijsirikul et al. [12]
proposed a feature generation method and a partial
matching technique for first-order logic but their
method still uses an ILP system in its first-step
learning and cannot select the appropriate values to
substitute in variables.
In this paper, we are interested in direct learning of
first-order logic programs by neural networks, called
First-Order Logical Neural Network (FOLNN).
FOLNN is a neural-symbolic learning system based on
the feedforward neural network that integrates
Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04)
0-7695-2291-2/04 $ 20.00 IEEE
inductive learning from examples and background
knowledge. We also propose the method that makes
use of Multiple-Instance Learning (MIL) [13, 14] for
determining the variable substitution in our model. Our
proposed method has been evaluated on two standard
first-order learning datasets i.e., the Finite Element
Mesh Design [15] and Mutagenesis [16]. The results
show that the proposed method provides more accurate
result than the original ILP system.
The rest of this paper is organized as follows.
Section 2 outlines the processes of first-order learning
by FOLNN, splitting into three subsections. The
experimental results on first-order problems are shown
in Section 3. Finally the conclusions are given in
Section 4.
2. First-Order Logical Neural Network
(FOLNN)
Commonly the main reason for integrating robust
neural networks and symbolic components is to reduce
the weakness of the rule-based system. Combining
these two techniques together is normally known as
neural-symbolic learning system [9]. Our proposed
method, FOLNN is also this type of learning system.
FOLNN structure is based on the feedforward neural
network and can receive examples and background
knowledge in form of first-order logic programs as the
inputs. FOLNN weight adaptation is based on the
Backpropagation (BP) algorithm [17].
The following subsections explain the FOLNN
algorithm composed of creating an initial network,
feeding examples to the network and training the
network.
2.1. Creating an initial network
In this subsection, we present the first step of the
FOLNN algorithm, creating an initial network from
background knowledge. A three layers feedforward
network, composed of one input layer, one output layer
and one hidden layer [18], is employed for FOLNN
structure. We define the functionality of each layer as
follows.
x Input layer: Input layer is the first layer that
receives input data, computes received data, and
then transmits processed data to the hidden layer.
This layer represents the literals for describing
the target rule. The number of units in this layer
depends on the number of predicates in
background knowledge. One predicate is
represented by one unit in the input layer if that
predicate contains only arity one. Otherwise, the
number of units for a predicate equals to the
number of all possible combinations of variables
of that predicate.
x Hidden layer: This layer connects between the
input layer and the output layer. The hidden layer
helps the network to learn the complex
classification. The number of units in this layer
depends on the complication of the learning
concept. The number of units in the hidden layer
is determined from the experiments.
x Output layer: This layer is the last layer of the
network producing the output for the
classification. The target concept is represented in
this layer so that the number of units in the output
layer equals to the number of concepts to be
learned or the number of classes.
An initial network is created by using the above
definition. To illustrate the construction of the
network, consider the task of learning rules for
classifying the rich person (rich(x)). Background
knowledge, the positive and negative examples are
given as follows.
z
background knowledge:
z
z
positive example:
negative example:
diligent(Alan),parent(Bob,Alan),
genius(Bob), diligent(Bob),
dilligent(Chris), strong(Chris)
rich(Alan), rich(Bob)
rich(Chris)
Figure 1. Inputs for learning the concept rich(x).
As shown in Figure 1, background knowledge
contains three predicates with arity one which are
genius(x), diligent(x) and strong(x) and one predicate
having arity two which is parent(x,y). Each predicate
of arity one is represented by one unit in the input
layer, so three units are created. Predicate parent is
represented by two input units for literals parent(x,y)
and parent(y,x). Furthermore, the output layer, because
of only one target concept (rich(x)), has only one unit.
Therefore in this case, the constructed network will
have five input units and one output unit. The created
network from the inputs in Figure 1 is shown in Figure
2. In addition, all network weights are initialized to
small random numbers.
Figure 2. The created network with one hidden
Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04)
0-7695-2291-2/04 $ 20.00 IEEE
unit.
The completely constructed network then receives
examples for refining the network. The process for
feeding examples to the network is described in the
next subsection.
2.2. Feeding examples to the network
In general, neural networks receive inputs in real
value form. However, inputs of the ILP system
(background knowledge and examples) are in logical
form. So we change the logical inputs to the form that
can be learned by neural networks. The examples are
fed to the network one by one and independently
transformed to the network input for each unit. The
value for each unit is defined as follows.
X ij
if LiT j is true in background knowledge
otherwise
1
®
¯0
(1)
where Xij, Li, and LJj are input value for input unit i
when feeding example j, literal represented by input
unit i, and variable binding with constants in example
j, respectively.
The input value for an input unit will be 1 if there
exists substitution that makes the truth value of the
literal true in background knowledge. Otherwise the
input value for that unit is 0. In addition, the target
value for output unit is defined as follows.
Tkj
1
®
¯0
if LkT j is positive example
otherwise
(2)
where Tkj and Lk are target value for output unit k when
feeding example j, and literal represented by output
unit k, respectively.
For instance, with the same inputs in Figure 1, if the
fed example is rich(Alan) then the first three units i.e.,
genius(x), diligent(x) and strong(x), receive 0,1 and 0
as their inputs respectively, because literal
diligent(Alan) is true in background knowledge, while
literals genius(Alan) and strong(Alan) are false. The
target value for the output unit is 1 because rich(Alan)
is a positive example. However, the input value for
literals parent(x,y) and parent(y,x) cannot be easily
determined since there are many possible constants
that can be mapped to relational variables. For unit
parent(y,x), the variable x is certainly replaced by
Alan. However, it is quite ambiguous by which term
(Alan, Bob or Chris) variable y should be replaced. If
we select Bob for substitution, the truth value for this
input unit will be 1. The other substitutions will give 0
for this unit (see Table 1). The truth value for
parent(x,y) is 0 for any substitution. From the above
example, the input value for unit parent(y,x) is not
certain and cannot be easily determined for network
training. This problem may occur when background
knowledge contains relational data and the learner
cannot determine the appropriate value for the variable
substitution.
Table 1. Input value of unit parent(y,x) for each
constant replacement.
Unit parent (y,x)
Replace x by Alan, and y by Alan
Replace x by Alan, and y by Bob
Replace x by Alan, and y by Chris
Input value
0
1
0
To solve this problem, we use the power of
Multiple-Instance Learning (MIL) to provide input
data for our network. In MIL framework [13, 14], the
training set is composed of a set of bags, each of which
is a collection of different number of instances. A bag
is labeled as a negative bag if all the instances in it are
negative. On the other hand, if a bag contains at least
one positive instance then it is labeled as a positive
bag. With this concept, we define FOLNN training
data as a set of training examples {B1, B2 , … Bn},
where n is the number of examples including positive
and negative ones. A bag is labeled as a positive bag if
an example is positive, and negative otherwise (in
multi-class classification, all bags are labeled as
positive of their classes). The positive bag is given 1 as
its target value and the negative bag is assigned 0, as
defined in Equation (2). Each bag contains mi instances
{Bi1, Bi2, …, Bimi} where Bij is one possible binding
(substitution). This is a very important key because
now we can use all cases of variable substitutions as
one bag for learning; therefore the appropriate value
selection would not be a problem. Consider an
example of positive bag rich(Alan) as input data (see
Table 2).
Table 2. Transformation of example rich(Alan) into input data of FOLNN
Positive bag of example rich(Alan)
Replace x by Alan, and y by Alan
Replace x by Alan, and y by Bob
Replace x by Alan, and y by Chris
genius(x)
0
0
0
diligent(x)
1
1
1
Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04)
0-7695-2291-2/04 $ 20.00 IEEE
strong(x)
0
0
0
parent(x,y)
0
0
0
parent(y,x)
0
1
0
As shown in Table 2, the bag rich(Alan) has 3
instances each of which is one case of substitution.
Also, positive bag rich(Bob) and negative bag
rich(Chris) have 3 instances as same as positive bag
rich(Alan). For training, these 3 bags are fed to the
network one by one and the network weights are
adapted by the Backpropagation (BP) algorithm for
MIL [19] as described in the next subsection.
Equation (4) and the weights in the network are
changed according to the weight update rule of BP
[17]. Then the next bag for training is fed to the
network and the training process is repeated until the
number of training iterations increases to some
predefined threshold or the global error E in Equation
(3) is decreased to some predefined threshold. After
having been trained, the network can be used to
classify unseen data.
2.3. Training the network
To train the network, training bags are fed to the
network for adapting network weights. Weight
adaptation is based on the BP algorithm and the
activation function is Sigmoid function. Suppose the
network has p input units, o output units, and one
hidden layer. The global error function (E) of the
network is defined as follows.
n
E
Ei
¦
i 1
(3)
o
¦
Ei
Eijk
¦
if Bi
if Bi
In the previous section, the three steps of learning
FOLNN algorithm were described. In this section, we
evaluate FOLNN by performing experiments on the
finite element mesh design and the mutagenesis
datasets, the well-known ILP problems. We also
compare the results obtained by FOLNN with those
obtained by an ILP system.
3.1. Datasets
where Ei is the error on bag i. Ei is defined according
to the type of the bag i as:
Eijk
°1min
° d j d mi k 1
®
o
° max
E
°1d j d mi k 1 ijk
¯
3. Results
(4)
0
if ( Bi ) and (0.5 d oijk ), lik 1
°
(5)
° 0
if ( Bi ) and (oijk 0.5), for all k
®
°1
2
otherwise
°̄ 2 (lik - oijk )
where
x Eijk is error of output unit k on instance j in bag
example i
x Bi=+ is positive bag example
x Bi=- is negative bag example
x oijk is actual output of output unit k from bag example
i, instance j, and
x lik is target output of output unit k from bag example i
With the defined error function above, the error BP
algorithm is simply adapted for training FOLNN. In
each training epoch, the training bags are fed to the
network one by one. Then the error Eijk is computed
according to Equation (5). For a positive bag Bi, if Eijk
is 0 then all the rest of instances of this bag are
disregarded, and the weights are not changed for this
epoch. Otherwise the process continues and when all
the instance of Bi are fed, Ei is computed by using
3.1.1. Finite Element Mesh Design. The dataset for
the finite element mesh design [15] consists of 5
structures and has 13 classes (13 possible number of
partitions for an edge in a structure). Additionally,
there are 278 examples each of which has the form
mesh(Edge,Number_of_elements) where Edge is an
edge label (unique for each edge) and
Number_of_elements indicates the number of
partitions. The background knowledge contains
relations describing the types of an edge (e.g. circuit,
short), boundary conditions (e.g. free, fixed), loadings
(e.g. not_loaded, one_side_loaded) and the relations
describing the structure of the object (e.g. neighbour,
opposite). The goal of finite element mesh design is to
learn general rules describing how many elements
should be used to model each edge of a structure.
3.1.2. Mutagenesis. The dataset for the mutagenesis
[16] consists of 188 molecules, of which 125 are
mutagenic (active) and 63 are non-mutagenic
(inactive). A molecule is described by listing its atoms
as atom(AtomID,Element,Type,Charge) and the bonds
between atoms as bond(Atom1,Atom2,BondType). This
problem is a two-class learning problem for predicting
the mutagenicity of the molecules, whether a molecule
is active or inactive in terms of mutagenicity.
3.2. Experiments
For the finite element mesh design dataset, we
create the network containing 130 units in the input
Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04)
0-7695-2291-2/04 $ 20.00 IEEE
layer (determined by predicates in background
knowledge), 13 output units (as the number of classes)
and one hidden layer with 80 hidden units (determined
by the experiment). For the mutagenesis dataset, the
constructed network has 235 input units, 100 hidden
units and 2 output units. The weights of two networks
are randomly initialized and then adapted by using the
BP algorithm with sigmoid activation function. We
performed three-fold cross validation [20] on each
dataset. The dataset is partitioned into three roughly
equal-sized subsets with roughly same proportion of
each class as that of the original dataset. Each subset is
used as a test set once, and the remaining subsets are
used as the training set. The final result is the average
result over three-fold data. For each fold, of both
datasets, we trained FOLNN with learning rate 0.0001
and momentum 0.97.
Table 3. The percent accuracies of FOLNN and
PROGOL on first-order datasets; FEM – Finite
Element Mesh Design, MUTA – Mutagenesis.
Dataset
FEM
MUTA
FOLNN
59.18
88.27
PROGOL
57.80
84.58
The average results over three-fold data on FEM
and MUTA datasets are summarized in Table 3.
PROGOL [21], the state-of-the-art ILP system, has
been used to compare the performance with our
proposed method, FOLNN. The experimental results
show that the accuracies of our proposed method,
FOLNN are better than PROGOL in both datasets. The
better results are according to the weakness of learned
rules generated by PROGOL.
In addition to the results on the original dataset, to
see how well our learner handles noisy data, we also
evaluate FOLNN on noisy domain. The mutagenesis
dataset is selected for this task. Using the three-fold
data of the mutagenesis dataset in the last experiment,
10% and 15% class noise is randomly added into the
training set, and no noise is added into the test set. In
our case, adding x% of noise means that the class value
is replaced with the wrong value in x out of 100 data
by random selection. The accuracies of PROGOL and
FOLNN on noisy data are shown in Table 4.
Table 4. Performance comparison on the noisy
mutagenesis dataset.
Noise
level in
dataset
10%
15%
PROGOL
PROGOL
PROGOL
0% noise
10%
noise
15%
noise
setting
setting
69.72
61.54
71.29
65.31
setting
64.23
60.56
FOLNN
84.01
81.28
Since PROGOL has an ability to handle noise in
data as its option, “x% noise setting” in the table
specifies that noise option of PROGOL is set to x%.
As can be seen in the Table 4, our proposed algorithm
still provides average accuracies higher than
PROGOL. When 10% and 15% noise is added into the
dataset, the PROGOL performance significantly drops
due to its sensitivity to noise which is the main
disadvantage of first-order rules directly induced by
the ILP system. However, accuracy of our method
decreased much slower and is much higher than that of
PROGOL. FOLNN, because of the ability of noise
tolerance by combining with neural networks, is more
robust against noise than the original first-order rules.
FOLNN prevents overfitting noisy data by employing
neural networks to give higher weights to important
features and give less attention to unimportant ones.
4. Conclusions
Learning first-order logic programs by using neural
networks is still an open problem. This paper presents
a novel hybrid connectionist symbolic system based on
the feedforward neural network that incorporates
inductive learning from examples and background
knowledge, called FOLNN (First-Order Logical
Neural Network). FOLNN alleviates the problem of
first-order rules induced by the ILP system which are
not robust enough to noisy or unseen data. The
prominent advantage of FOLNN is that it can learn
from inputs provided in form of first-order logic
programs directly. Other learners cannot directly learn
this kind of programs because they cannot select the
appropriate values for variable substitution, but our
method can solve this problem by applying the MIL
concept to provide certain input data from first-order
logic input.
The experimental results show that FOLNN
presents the ability of noise tolerance and produces the
better performance than PROGOL. This is because of
the ability of neural networks that can select important
attributes and then gives higher weights to these
attributes and vice versa.
Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04)
0-7695-2291-2/04 $ 20.00 IEEE
Although our main objective is to learn the firstorder logic, FOLNN can be applied to other tasks such
as learning from propositional datasets containing
missing values in some attributes.
One interesting issue is knowledge extraction.
Knowledge extraction from a trained network is one
phase of the neural-symbolic learning system [9] and is
of significant interest in data mining and knowledge
discovery applications such as medical diagnosis.
However, this phase is not included in this work and
we have not yet explored rule extraction from trained
networks. Nevertheless, we surmise that many
researches [9, 22-24] can be adapted to extract rules
from our networks.
5. Acknowledgement
This work was supported by the Thailand Research
Fund.
6. References
[1] N. Lavrac and S. Dzeroski, Inductive Logic
Programming Techniques and Applications, Ellis
Horwood, New York, 1994.
[2] S.-H. Nienhuys-Cheng and R. d. Wolf, Foundation of
Inductive Logic Programming, Springer-Verlag, New
York, 1997.
[3] S. Dzeroski, S. Schulze-Kremer, K. R. Heidtke, K.
Siems, and D. Wettschereck, "Applying ILP to
Diterpene Structure Elucidation from 13C NMR
Spectra", Proceedings of the Sixth International
Workshop on Inductive Logic Programming, 1996.
[4] S. Wermter and R. Sun, "An Overview of Hybrid
Neural Systems," Hybrid Neural Systems, number 1778
in Lecture Notes in Artificial Intelligence, S. Wermter
and R. Sun, Eds., Springer, 2000, pp. 1-13.
[5] C. M. Bishop, Neural Networks for Pattern
Recognition, Oxford University Press, 1995.
[6] G. G. Towell and J. W. Shavlik, "Knowledge-based
artificial neural networks", Artificial Intelligence, vol.
70(1-2), 1994, pp. 119-165.
[7] J. J. Mahoney and R. J. Mooney, "Combining
connectionist and symbolic learning to refine certaintyfactor rule-bases", Connection Science, vol. 5, 1993, pp.
339-364.
[8] R. Parekh and V. Honavar, "Constructive Theory
Refinement in Knowledge Based Neural Networks",
Proceedings of the International Joint Conference on
Neural Networks, Anchorage, Alaska, 1998.
[9] A. S. d. A. Garcez, K. B. Broda, and D. M. Gabbay,
Neural-Symbolic Learning Systems, Springer-Verlag,
2002.
[10] L. Shastri and V. Ajjanagadde, "From simple
associations to systematic reasoning", Behavioral and
Brain Sciences, vol. 16, 1993, pp. 417-494.
[11] M. Botta, A. Giordana, and R. Piola, "FONN:
Combining First Order Logic with Connectionist
Learning", Proceedings of the 14th International
Conference on Machine Learning, Nashville, TN, 1997.
[12] B.
Kijsirikul,
S.
Sinthupinyo,
and
K.
Chongkasemwongse, "Approximate Match of Rules
Using Backpropagation Neural Networks", Machine
Learning Journal, vol. 44, pp. 273-299, 2001.
[13] Y. Chevaleyre and J.-D. Zucker, "A Framework for
Learning Rules from Multiple Instance Data", 12th
European Conference on Machine Learning, Freiburg,
Germany, 2001.
[14] X. Huang, S.-C. Chen, and M.-L. Shyu, "An Open
Multiple Instance Learning Framework and Its
Application in Drug Activity Prediction Problems",
Proceedings of the Third IEEE Symposium on
BioInformatics and BioEngineering (BIBE'03),
Bethesda, Maryland, 2003.
[15] B. Dolsak and S. Muggleton, "The Application of
Inductive Logic Programming to Finite Element Mesh
Design", Inductive Logic Programming, S. Muggleton,
Ed., Academic Press, 1992, pp. 453--472.
[16] A. Srinivasan, S. H. Muggleton, M. J. E. Sternberg, and
R. D. King, "Theories for mutagenicity: a study in firstorder and feature-based induction", Artificial
Intelligence, vol. 85, Elsevier Science Publishers Ltd.,
1996, pp. 277-299.
[17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams,
"Learning
internal
representations
by
error
propagation", Parallel Distributed Processing, vol. 1,
D. E. Rumelhart and J. L. McClelland, Eds., The MIT
Press, Cambridge, MA, 1986.
[18] S. Holldobler and Y. Kalinke, "Towards a massively
parallel computational model for logic programming",
Proceedings of the ECAI94 Workshop on Combining
Symbolic and Connectionist Processing, ECAI94, 1994,
pp. 68-77.
[19] Z.-H. Zhou and M.-L. Zhang, "Neural Network for
Multi-Instance Learning", Proceedings of the
International Conference on Intelligent Information
Technology, Beijing, China, 2002.
[20] T. M. Mitchell, Machine Learning, The McGraw-Hill
Companies Inc, New York, 1997.
[21] S. Roberts, An Introduction to Progol. Technical
Manual, University of York, 1997.
[22] R. Andrew, J. Diederich, and A. B. Tickle, "Survey and
Critique of Techniques for Extracting Rules from
Trained Artificial Neural Networks", Knowledge-Based
Systems, vol. 8, 1995, pp. 373-389.
[23] M. W. Craven, "Extracting Comprehensible Models
from Trained Neural Networks", Department of
Computer Science: University of Wisconsin-Madison,
1996.
[24] G. G. Towell and J. W. Shavlik, "The Extraction of
Refined Rules from Knowledge-Based Neural
Networks", Machine Learning Journal, vol. 13, pp. 71101, 1993.
Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04)
0-7695-2291-2/04 $ 20.00 IEEE