Academia.eduAcademia.edu

First-Order Logical Neural Networks

Fourth International Conference on Hybrid Intelligent Systems (HIS'04)

Inductive Logic Programming (ILP) is a well known machine learning technique in learning concepts from relational data. Nevertheless, ILP systems are not robust enough to noisy or unseen data in real world domains. Furthermore, in multi-class problems, if the example is not matched with any learned rules, it cannot be classified. This paper presents a novel hybrid learning method to alleviate this restriction by enabling Neural Networks to handle first-order logic programs directly. The proposed method, called First-Order Logical Neural Network (FOLNN), is based on feedforward neural networks and integrates inductive learning from examples and background knowledge. We also propose a method for determining the appropriate variable substitution in FOLNN learning by using Multiple-Instance Learning (MIL). In the experiments, the proposed method has been evaluated on two first-order learning problems, i.e., the Finite Element Mesh Design and Mutagenesis and compared with the state-of-the-art, the PROGOL system. The experimental results show that the proposed method performs better than PROGOL.

First-Order Logical Neural Networks Thanupol Lerdlamnaochai and Boonserm Kijsirikul Department of Computer Engineering, Chulalongkorn University, Pathumwan, Bangkok, 10330, Thailand [email protected], [email protected] Abstract Inductive Logic Programming (ILP) is a well known machine learning technique in learning concepts from relational data. Nevertheless, ILP systems are not robust enough to noisy or unseen data in real world domains. Furthermore, in multi-class problems, if the example is not matched with any learned rules, it cannot be classified. This paper presents a novel hybrid learning method to alleviate this restriction by enabling Neural Networks to handle first-order logic programs directly. The proposed method, called First-Order Logical Neural Network (FOLNN), is based on feedforward neural networks and integrates inductive learning from examples and background knowledge. We also propose a method for determining the appropriate variable substitution in FOLNN learning by using Multiple-Instance Learning (MIL). In the experiments, the proposed method has been evaluated on two first-order learning problems, i.e., the Finite Element Mesh Design and Mutagenesis and compared with the state-of-the-art, the PROGOL system. The experimental results show that the proposed method performs better than PROGOL. 1. Introduction Inductive Logic Programming (ILP) [1, 2] is only one of machine learning techniques which adapts the first-order logical concepts for hypothesis learning. The advantages of ILP are the ability of employing background knowledge and the expressive representation of first-order logic. However, first-order rules learned by ILP have the restriction to handle imperfect data in real-world domains such as noisy unseen data. This problem noticeably occurs especially in multi-class classification. In multi-class classification, if an example is not covered any learned rule, it could not be classified. The simple solution is assigning the majority class recorded from training examples to the unable labeled test data [3]. Also, there is more efficient method to solve this problem by using the concept of intelligent hybrid systems [4]. Artificial Neural Networks (ANNs) [5] claim to avoid the restrictions of symbolic rule-based systems described above. Neural networks contain the ability of processing inconsistent and noisy data. Moreover, they compute the most reasonable output for each input. Neural networks, because of their potential for noise tolerance and multi-class classification, offer an attractiveness for combining with symbolic components. Although the ability of neural networks could alleviate the problem in symbolic rule-based systems, learned hypothesis from neural networks is not available in a form that is legible for humans. Therefore neural networks significantly require an interpretation by rule-based systems [4]. Several works show that the integration between robust neural networks and symbolic knowledge representation can improve classification accuracy such as Towell and Shavlik’s KBANN [6], Mahoney and Mooney’s RAPTURE [7], the works proposed by Rajesh Parekh and Vasant Honavar [8] and d’Avila Garcez et al. [9]. Nevertheless, these researches have been restricted to propositional theory refinement. Some models have been proposed for first-order theory. SHRUTI [10] employed a model making a restricted form of unification—actually this system only propagates bindings—. The work proposed by Botta et al. [11] created a network consisting of restricted form of learning first-order logic. Kijsirikul et al. [12] proposed a feature generation method and a partial matching technique for first-order logic but their method still uses an ILP system in its first-step learning and cannot select the appropriate values to substitute in variables. In this paper, we are interested in direct learning of first-order logic programs by neural networks, called First-Order Logical Neural Network (FOLNN). FOLNN is a neural-symbolic learning system based on the feedforward neural network that integrates Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE inductive learning from examples and background knowledge. We also propose the method that makes use of Multiple-Instance Learning (MIL) [13, 14] for determining the variable substitution in our model. Our proposed method has been evaluated on two standard first-order learning datasets i.e., the Finite Element Mesh Design [15] and Mutagenesis [16]. The results show that the proposed method provides more accurate result than the original ILP system. The rest of this paper is organized as follows. Section 2 outlines the processes of first-order learning by FOLNN, splitting into three subsections. The experimental results on first-order problems are shown in Section 3. Finally the conclusions are given in Section 4. 2. First-Order Logical Neural Network (FOLNN) Commonly the main reason for integrating robust neural networks and symbolic components is to reduce the weakness of the rule-based system. Combining these two techniques together is normally known as neural-symbolic learning system [9]. Our proposed method, FOLNN is also this type of learning system. FOLNN structure is based on the feedforward neural network and can receive examples and background knowledge in form of first-order logic programs as the inputs. FOLNN weight adaptation is based on the Backpropagation (BP) algorithm [17]. The following subsections explain the FOLNN algorithm composed of creating an initial network, feeding examples to the network and training the network. 2.1. Creating an initial network In this subsection, we present the first step of the FOLNN algorithm, creating an initial network from background knowledge. A three layers feedforward network, composed of one input layer, one output layer and one hidden layer [18], is employed for FOLNN structure. We define the functionality of each layer as follows. x Input layer: Input layer is the first layer that receives input data, computes received data, and then transmits processed data to the hidden layer. This layer represents the literals for describing the target rule. The number of units in this layer depends on the number of predicates in background knowledge. One predicate is represented by one unit in the input layer if that predicate contains only arity one. Otherwise, the number of units for a predicate equals to the number of all possible combinations of variables of that predicate. x Hidden layer: This layer connects between the input layer and the output layer. The hidden layer helps the network to learn the complex classification. The number of units in this layer depends on the complication of the learning concept. The number of units in the hidden layer is determined from the experiments. x Output layer: This layer is the last layer of the network producing the output for the classification. The target concept is represented in this layer so that the number of units in the output layer equals to the number of concepts to be learned or the number of classes. An initial network is created by using the above definition. To illustrate the construction of the network, consider the task of learning rules for classifying the rich person (rich(x)). Background knowledge, the positive and negative examples are given as follows. z background knowledge: z z positive example: negative example: diligent(Alan),parent(Bob,Alan), genius(Bob), diligent(Bob), dilligent(Chris), strong(Chris) rich(Alan), rich(Bob) rich(Chris) Figure 1. Inputs for learning the concept rich(x). As shown in Figure 1, background knowledge contains three predicates with arity one which are genius(x), diligent(x) and strong(x) and one predicate having arity two which is parent(x,y). Each predicate of arity one is represented by one unit in the input layer, so three units are created. Predicate parent is represented by two input units for literals parent(x,y) and parent(y,x). Furthermore, the output layer, because of only one target concept (rich(x)), has only one unit. Therefore in this case, the constructed network will have five input units and one output unit. The created network from the inputs in Figure 1 is shown in Figure 2. In addition, all network weights are initialized to small random numbers. Figure 2. The created network with one hidden Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE unit. The completely constructed network then receives examples for refining the network. The process for feeding examples to the network is described in the next subsection. 2.2. Feeding examples to the network In general, neural networks receive inputs in real value form. However, inputs of the ILP system (background knowledge and examples) are in logical form. So we change the logical inputs to the form that can be learned by neural networks. The examples are fed to the network one by one and independently transformed to the network input for each unit. The value for each unit is defined as follows. X ij if LiT j is true in background knowledge otherwise ­1 ® ¯0 (1) where Xij, Li, and LJj are input value for input unit i when feeding example j, literal represented by input unit i, and variable binding with constants in example j, respectively. The input value for an input unit will be 1 if there exists substitution that makes the truth value of the literal true in background knowledge. Otherwise the input value for that unit is 0. In addition, the target value for output unit is defined as follows. Tkj ­1 ® ¯0 if LkT j is positive example otherwise (2) where Tkj and Lk are target value for output unit k when feeding example j, and literal represented by output unit k, respectively. For instance, with the same inputs in Figure 1, if the fed example is rich(Alan) then the first three units i.e., genius(x), diligent(x) and strong(x), receive 0,1 and 0 as their inputs respectively, because literal diligent(Alan) is true in background knowledge, while literals genius(Alan) and strong(Alan) are false. The target value for the output unit is 1 because rich(Alan) is a positive example. However, the input value for literals parent(x,y) and parent(y,x) cannot be easily determined since there are many possible constants that can be mapped to relational variables. For unit parent(y,x), the variable x is certainly replaced by Alan. However, it is quite ambiguous by which term (Alan, Bob or Chris) variable y should be replaced. If we select Bob for substitution, the truth value for this input unit will be 1. The other substitutions will give 0 for this unit (see Table 1). The truth value for parent(x,y) is 0 for any substitution. From the above example, the input value for unit parent(y,x) is not certain and cannot be easily determined for network training. This problem may occur when background knowledge contains relational data and the learner cannot determine the appropriate value for the variable substitution. Table 1. Input value of unit parent(y,x) for each constant replacement. Unit parent (y,x) Replace x by Alan, and y by Alan Replace x by Alan, and y by Bob Replace x by Alan, and y by Chris Input value 0 1 0 To solve this problem, we use the power of Multiple-Instance Learning (MIL) to provide input data for our network. In MIL framework [13, 14], the training set is composed of a set of bags, each of which is a collection of different number of instances. A bag is labeled as a negative bag if all the instances in it are negative. On the other hand, if a bag contains at least one positive instance then it is labeled as a positive bag. With this concept, we define FOLNN training data as a set of training examples {B1, B2 , … Bn}, where n is the number of examples including positive and negative ones. A bag is labeled as a positive bag if an example is positive, and negative otherwise (in multi-class classification, all bags are labeled as positive of their classes). The positive bag is given 1 as its target value and the negative bag is assigned 0, as defined in Equation (2). Each bag contains mi instances {Bi1, Bi2, …, Bimi} where Bij is one possible binding (substitution). This is a very important key because now we can use all cases of variable substitutions as one bag for learning; therefore the appropriate value selection would not be a problem. Consider an example of positive bag rich(Alan) as input data (see Table 2). Table 2. Transformation of example rich(Alan) into input data of FOLNN Positive bag of example rich(Alan) Replace x by Alan, and y by Alan Replace x by Alan, and y by Bob Replace x by Alan, and y by Chris genius(x) 0 0 0 diligent(x) 1 1 1 Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE strong(x) 0 0 0 parent(x,y) 0 0 0 parent(y,x) 0 1 0 As shown in Table 2, the bag rich(Alan) has 3 instances each of which is one case of substitution. Also, positive bag rich(Bob) and negative bag rich(Chris) have 3 instances as same as positive bag rich(Alan). For training, these 3 bags are fed to the network one by one and the network weights are adapted by the Backpropagation (BP) algorithm for MIL [19] as described in the next subsection. Equation (4) and the weights in the network are changed according to the weight update rule of BP [17]. Then the next bag for training is fed to the network and the training process is repeated until the number of training iterations increases to some predefined threshold or the global error E in Equation (3) is decreased to some predefined threshold. After having been trained, the network can be used to classify unseen data. 2.3. Training the network To train the network, training bags are fed to the network for adapting network weights. Weight adaptation is based on the BP algorithm and the activation function is Sigmoid function. Suppose the network has p input units, o output units, and one hidden layer. The global error function (E) of the network is defined as follows. n E Ei ¦ i 1 (3) o ¦ Ei Eijk ¦ if Bi if Bi In the previous section, the three steps of learning FOLNN algorithm were described. In this section, we evaluate FOLNN by performing experiments on the finite element mesh design and the mutagenesis datasets, the well-known ILP problems. We also compare the results obtained by FOLNN with those obtained by an ILP system. 3.1. Datasets where Ei is the error on bag i. Ei is defined according to the type of the bag i as: ­ Eijk °1min ° d j d mi k 1 ® o ° max E °1d j d mi k 1 ijk ¯ 3. Results   (4) ­ 0 if ( Bi ) and (0.5 d oijk ), lik 1 ° (5) ° 0 if ( Bi ) and (oijk  0.5), for all k ® °1 2 otherwise °̄ 2 (lik - oijk ) where x Eijk is error of output unit k on instance j in bag example i x Bi=+ is positive bag example x Bi=- is negative bag example x oijk is actual output of output unit k from bag example i, instance j, and x lik is target output of output unit k from bag example i With the defined error function above, the error BP algorithm is simply adapted for training FOLNN. In each training epoch, the training bags are fed to the network one by one. Then the error Eijk is computed according to Equation (5). For a positive bag Bi, if Eijk is 0 then all the rest of instances of this bag are disregarded, and the weights are not changed for this epoch. Otherwise the process continues and when all the instance of Bi are fed, Ei is computed by using 3.1.1. Finite Element Mesh Design. The dataset for the finite element mesh design [15] consists of 5 structures and has 13 classes (13 possible number of partitions for an edge in a structure). Additionally, there are 278 examples each of which has the form mesh(Edge,Number_of_elements) where Edge is an edge label (unique for each edge) and Number_of_elements indicates the number of partitions. The background knowledge contains relations describing the types of an edge (e.g. circuit, short), boundary conditions (e.g. free, fixed), loadings (e.g. not_loaded, one_side_loaded) and the relations describing the structure of the object (e.g. neighbour, opposite). The goal of finite element mesh design is to learn general rules describing how many elements should be used to model each edge of a structure. 3.1.2. Mutagenesis. The dataset for the mutagenesis [16] consists of 188 molecules, of which 125 are mutagenic (active) and 63 are non-mutagenic (inactive). A molecule is described by listing its atoms as atom(AtomID,Element,Type,Charge) and the bonds between atoms as bond(Atom1,Atom2,BondType). This problem is a two-class learning problem for predicting the mutagenicity of the molecules, whether a molecule is active or inactive in terms of mutagenicity. 3.2. Experiments For the finite element mesh design dataset, we create the network containing 130 units in the input Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE layer (determined by predicates in background knowledge), 13 output units (as the number of classes) and one hidden layer with 80 hidden units (determined by the experiment). For the mutagenesis dataset, the constructed network has 235 input units, 100 hidden units and 2 output units. The weights of two networks are randomly initialized and then adapted by using the BP algorithm with sigmoid activation function. We performed three-fold cross validation [20] on each dataset. The dataset is partitioned into three roughly equal-sized subsets with roughly same proportion of each class as that of the original dataset. Each subset is used as a test set once, and the remaining subsets are used as the training set. The final result is the average result over three-fold data. For each fold, of both datasets, we trained FOLNN with learning rate 0.0001 and momentum 0.97. Table 3. The percent accuracies of FOLNN and PROGOL on first-order datasets; FEM – Finite Element Mesh Design, MUTA – Mutagenesis. Dataset FEM MUTA FOLNN 59.18 88.27 PROGOL 57.80 84.58 The average results over three-fold data on FEM and MUTA datasets are summarized in Table 3. PROGOL [21], the state-of-the-art ILP system, has been used to compare the performance with our proposed method, FOLNN. The experimental results show that the accuracies of our proposed method, FOLNN are better than PROGOL in both datasets. The better results are according to the weakness of learned rules generated by PROGOL. In addition to the results on the original dataset, to see how well our learner handles noisy data, we also evaluate FOLNN on noisy domain. The mutagenesis dataset is selected for this task. Using the three-fold data of the mutagenesis dataset in the last experiment, 10% and 15% class noise is randomly added into the training set, and no noise is added into the test set. In our case, adding x% of noise means that the class value is replaced with the wrong value in x out of 100 data by random selection. The accuracies of PROGOL and FOLNN on noisy data are shown in Table 4. Table 4. Performance comparison on the noisy mutagenesis dataset. Noise level in dataset 10% 15% PROGOL PROGOL PROGOL 0% noise 10% noise 15% noise setting setting 69.72 61.54 71.29 65.31 setting 64.23 60.56 FOLNN 84.01 81.28 Since PROGOL has an ability to handle noise in data as its option, “x% noise setting” in the table specifies that noise option of PROGOL is set to x%. As can be seen in the Table 4, our proposed algorithm still provides average accuracies higher than PROGOL. When 10% and 15% noise is added into the dataset, the PROGOL performance significantly drops due to its sensitivity to noise which is the main disadvantage of first-order rules directly induced by the ILP system. However, accuracy of our method decreased much slower and is much higher than that of PROGOL. FOLNN, because of the ability of noise tolerance by combining with neural networks, is more robust against noise than the original first-order rules. FOLNN prevents overfitting noisy data by employing neural networks to give higher weights to important features and give less attention to unimportant ones. 4. Conclusions Learning first-order logic programs by using neural networks is still an open problem. This paper presents a novel hybrid connectionist symbolic system based on the feedforward neural network that incorporates inductive learning from examples and background knowledge, called FOLNN (First-Order Logical Neural Network). FOLNN alleviates the problem of first-order rules induced by the ILP system which are not robust enough to noisy or unseen data. The prominent advantage of FOLNN is that it can learn from inputs provided in form of first-order logic programs directly. Other learners cannot directly learn this kind of programs because they cannot select the appropriate values for variable substitution, but our method can solve this problem by applying the MIL concept to provide certain input data from first-order logic input. The experimental results show that FOLNN presents the ability of noise tolerance and produces the better performance than PROGOL. This is because of the ability of neural networks that can select important attributes and then gives higher weights to these attributes and vice versa. Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE Although our main objective is to learn the firstorder logic, FOLNN can be applied to other tasks such as learning from propositional datasets containing missing values in some attributes. One interesting issue is knowledge extraction. Knowledge extraction from a trained network is one phase of the neural-symbolic learning system [9] and is of significant interest in data mining and knowledge discovery applications such as medical diagnosis. However, this phase is not included in this work and we have not yet explored rule extraction from trained networks. Nevertheless, we surmise that many researches [9, 22-24] can be adapted to extract rules from our networks. 5. Acknowledgement This work was supported by the Thailand Research Fund. 6. References [1] N. Lavrac and S. Dzeroski, Inductive Logic Programming Techniques and Applications, Ellis Horwood, New York, 1994. [2] S.-H. Nienhuys-Cheng and R. d. Wolf, Foundation of Inductive Logic Programming, Springer-Verlag, New York, 1997. [3] S. Dzeroski, S. Schulze-Kremer, K. R. Heidtke, K. Siems, and D. Wettschereck, "Applying ILP to Diterpene Structure Elucidation from 13C NMR Spectra", Proceedings of the Sixth International Workshop on Inductive Logic Programming, 1996. [4] S. Wermter and R. Sun, "An Overview of Hybrid Neural Systems," Hybrid Neural Systems, number 1778 in Lecture Notes in Artificial Intelligence, S. Wermter and R. Sun, Eds., Springer, 2000, pp. 1-13. [5] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. [6] G. G. Towell and J. W. Shavlik, "Knowledge-based artificial neural networks", Artificial Intelligence, vol. 70(1-2), 1994, pp. 119-165. [7] J. J. Mahoney and R. J. Mooney, "Combining connectionist and symbolic learning to refine certaintyfactor rule-bases", Connection Science, vol. 5, 1993, pp. 339-364. [8] R. Parekh and V. Honavar, "Constructive Theory Refinement in Knowledge Based Neural Networks", Proceedings of the International Joint Conference on Neural Networks, Anchorage, Alaska, 1998. [9] A. S. d. A. Garcez, K. B. Broda, and D. M. Gabbay, Neural-Symbolic Learning Systems, Springer-Verlag, 2002. [10] L. Shastri and V. Ajjanagadde, "From simple associations to systematic reasoning", Behavioral and Brain Sciences, vol. 16, 1993, pp. 417-494. [11] M. Botta, A. Giordana, and R. Piola, "FONN: Combining First Order Logic with Connectionist Learning", Proceedings of the 14th International Conference on Machine Learning, Nashville, TN, 1997. [12] B. Kijsirikul, S. Sinthupinyo, and K. Chongkasemwongse, "Approximate Match of Rules Using Backpropagation Neural Networks", Machine Learning Journal, vol. 44, pp. 273-299, 2001. [13] Y. Chevaleyre and J.-D. Zucker, "A Framework for Learning Rules from Multiple Instance Data", 12th European Conference on Machine Learning, Freiburg, Germany, 2001. [14] X. Huang, S.-C. Chen, and M.-L. Shyu, "An Open Multiple Instance Learning Framework and Its Application in Drug Activity Prediction Problems", Proceedings of the Third IEEE Symposium on BioInformatics and BioEngineering (BIBE'03), Bethesda, Maryland, 2003. [15] B. Dolsak and S. Muggleton, "The Application of Inductive Logic Programming to Finite Element Mesh Design", Inductive Logic Programming, S. Muggleton, Ed., Academic Press, 1992, pp. 453--472. [16] A. Srinivasan, S. H. Muggleton, M. J. E. Sternberg, and R. D. King, "Theories for mutagenicity: a study in firstorder and feature-based induction", Artificial Intelligence, vol. 85, Elsevier Science Publishers Ltd., 1996, pp. 277-299. [17] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning internal representations by error propagation", Parallel Distributed Processing, vol. 1, D. E. Rumelhart and J. L. McClelland, Eds., The MIT Press, Cambridge, MA, 1986. [18] S. Holldobler and Y. Kalinke, "Towards a massively parallel computational model for logic programming", Proceedings of the ECAI94 Workshop on Combining Symbolic and Connectionist Processing, ECAI94, 1994, pp. 68-77. [19] Z.-H. Zhou and M.-L. Zhang, "Neural Network for Multi-Instance Learning", Proceedings of the International Conference on Intelligent Information Technology, Beijing, China, 2002. [20] T. M. Mitchell, Machine Learning, The McGraw-Hill Companies Inc, New York, 1997. [21] S. Roberts, An Introduction to Progol. Technical Manual, University of York, 1997. [22] R. Andrew, J. Diederich, and A. B. Tickle, "Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks", Knowledge-Based Systems, vol. 8, 1995, pp. 373-389. [23] M. W. Craven, "Extracting Comprehensible Models from Trained Neural Networks", Department of Computer Science: University of Wisconsin-Madison, 1996. [24] G. G. Towell and J. W. Shavlik, "The Extraction of Refined Rules from Knowledge-Based Neural Networks", Machine Learning Journal, vol. 13, pp. 71101, 1993. Proceedings of the Fourth International Conference on Hybrid Intelligent Systems (HIS’04) 0-7695-2291-2/04 $ 20.00 IEEE