15 Wk7 Decision Tree ILP

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Logic

FOL
Syntax
FOL
Rules
(Copi)
1. Courses are either tough or boring.
2. Not all courses are boring.
3. Therefore there are tough courses.
(Cx, Tx, Bx, )
Dealing with Time

Translate into first-order logic :

• He left town in the morning

• Understanding leads to friendship


Russell & Norvig 3d ed:

8.1 to 8.4
9.1, 9.2 (skip 9.2.3 to 9.4),
9.5 resolution (skip 9.5.4 - 9.5.5)
Problems
from
Nilsson
82
Learning Logical Rules
Decision Trees

Duda and Hart, Ch.1


Russell & Norvig Ch. 18
Boolean Decision Trees
Attribute-based representations
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:

• Classification of examples is positive (T) or negative (F)


Learning decision trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Continuous orthogonal domains

classification and regression trees CART


[Breiman 84] ID3: [Quinlan 86]
Expressiveness
• Decision trees can express any function of the input attributes.
• E.g., for Boolean functions, truth table row → path to leaf:

• Trivially, there is a consistent decision tree for any training set with one path to
leaf for each example (unless f nondeterministic in x) but it probably won't
generalize to new examples

• Prefer to find more compact decision trees


Which attribute to use first?

Gain(S; A) = expected reduction in entropy due to sorting on


attribute A
Choosing an attribute
• Idea: a good attribute splits the examples into subsets that are
(ideally) "all positive" or "all negative"

6 2 0 4 4 6 2
• Gain (Patrons)  B( ) [ B( )  B( )  B( )]  0.541bits
12 12 2 12 4 12 2
1
• Gain (Type)  1  [1.B( )]  0 Information Gain is higher
2 for Patrons
Decision trees
• One possible representation for hypotheses
• E.g., here is the “true” tree for deciding whether to wait:
Information gain
• A chosen attribute A divides the training set E into subsets
E1, … , Ev according to their values for A, where A has v
distinct values.
v
p i ni pi
remainder( A)   B( )
i 1 p  n pi  ni
• Information Gain (IG) or reduction in entropy from the
attribute test:
p
IG ( A)  B( )  remainder ( A)
pn
• Choose the attribute with the largest IG
Decision tree learning
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of
(sub)tree
Example contd.
• Decision tree learned from the 12 examples:

• Substantially simpler than “true” tree---a more complex


hypothesis isn’t justified by small amount of data
Too many ways to order the tree

How many distinct decision trees with n Boolean attributes?


= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22

• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616


trees
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22

• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616


trees

How many purely conjunctive hypotheses (e.g., Hungry  Rain)?


• Each attribute can be in (positive), in (negative), or out
 3n distinct conjunctive hypotheses
• More expressive hypothesis space
– increases chance that target function can be expressed
– increases number of hypotheses consistent with training set
 may get worse predictions
Inductive Logic Programming
Review

• literal : atomic formula or its negation


• clause : disjunction of literals
• CNF : conjunction of clauses
Horn Clauses
• Horn clause : clause with single positive literal
• E.g. ∼p ∨ ∼q ∨ ∼r ∨ s
or equivalently:
p∧q∧r⇒s
• useful restriction on full first-order logic :
propositional Horn clauses are satisfiable in
polynomial time (HORNSAT)
• General proposition SAT: NP-complete
Prolog
• Prolog = Programming with Logic

• Works with Horn clauses


∼p ∨ ∼q ∨ ∼r ∨ s written as
s :- p, q, r

• read as s if p and q and r


• Resolving two Horn clauses always results in a
Horn clause
Example
• Set of Horn clauses :
[r,∼p,∼q] [p] [q]
• Goal: to prove r
 Add goal negation ∼r
[r,∼p,∼ q] [a] [q] [∼r]
Resolution refutation

[r, ~p, ~q] [p] [q] [~r]

[r, ~q] [q] [~r]

[r, ] [~r]

[]
Logic Programming
• example of a logic program:
parent_of(charles,george).
parent_of(george,diana).
parent_of(bob,harry).
parent_of(harry,elizabeth).
grandparent_of(X,Y) :- parent_of(X,Z),
parent_of(Z,Y).
• From the program, we can ask queries about
grandparents.
• Query: grandparent_of(X,Y)?
• Answers:
• grandparent_of(charles,diana).
• grandparent_of(bob,elizabeth).
Forms of Reasoning
• Deduction: From causes to effect (Prediction)
– fact a, rule a => b
INFER b (*First-order logic*)

• Abduction: From effects to possible causes (Explanation)


– rule a => b, observe b
AN EXPLANATION a

• Induction: From correlated observations to rules (Learning)


– observe correlation between a1, b1, ... an, bn
LEARN RULE a -> b
Inductive Logic Programming :
Progol

Given:
Background knowledge (of the domain): facts
Examples (of the relation to be learned): facts

Try to learn
Theories (as a result of learning): rules

ILP : a form of machine learning where both the data


and the hypotheses are logical expressions
ILP – formal definitions
• Given
– a logic program B representing background
knowledge
– a set of positive examples E+
– a set of negative examples E-
• Find hypothesis H such that:
1. B U H e for every e  E+.
2. B U H f for every f  E-.
3. B U H is consistent.
Assume that B ⊭ e for some e  E+.
Example
• Background knowledge B:
– parent_of(charles,george).
– parent_of(george,diana).
– parent_of(bob,harry).
– parent_of(harry,elizabeth).

• Positive examples E+:


– grandparent_of(charles,diana).
– grandparent_of(bob,elizabeth).

• Generate hypothesis H:
– grandparent_of(X,Y) :- parent_of(X,Z),
parent_of(Z,Y).
Rule Learning (Intuition)
• How to come up with a rule for grandparent_of(X,Y)?
1. Take the example grandparent_of(bob,elizabeth).
2. Find the subset of background knowledge relevant to this
example: parent_of(bob,harry),
parent_of(harry,elizabeth).
3. Form a rule from these facts
grandparent_of(bob,elizabeth) :-
parent_of(bob,harry), parent_of(harry,elizabeth).
4. Generalize the rule
grandparent_of(X,Y) :- parent_of(X,Z), parent_of(Z,Y).
5. Check if this rule is valid w.r.t the positive and the negative
examples
Progol Algorithm Outline

1. From a subset of positive examples, construct the


most specific rule rs.
2. Based on rs, find a generalized form rg of rs so that
score(rg) has the highest value among all
candidates.
3. Remove all positive examples that are covered by
rg.
4. Go to step 1 if there are still positive examples that
are not yet covered.
Scoring hypotheses
• score(r) is a measure of how well a rule r
explains all the examples with preference
given to shorter rules.
– pr = number of +ve examples correctly deducible
from r
– nr = number of -ve examples correctly deducible
from r
– cr = number of body literals in rule r
– score(r) = pr – (nr + cr)
Decision Theory
Decision Theory
Inference step
Determine either or .

Decision step
For given x, determine optimal t.
Minimum Expected Loss
Example: classify medical images as ‘cancer’ or ‘normal’

Loss matrix L:

Decision
Truth
Minimum Expected Loss

Regions are chosen to


minimize
Reject Option
Why Separate Inference and
Decision?
• Minimizing risk (loss matrix may change over time)
• Reject option
• Unbalanced class priors
• Combining models
Decision Theory for Regression
Inference step
Determine .

Decision step
For given x, make optimal
prediction, y(x), for t.

Loss function:
The Squared Loss Function
Performance measurement
• How do we know that h ≈ f ?
1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
Learning curve = % correct on test set as a function of training set size

You might also like