15 Wk7 Decision Tree ILP

Logic
FOL
Syntax
FOL
Rules
(Copi)
1. Courses are either tough or boring.
2. Not all courses are boring.
3. Therefore there are tough courses.
(Cx, Tx, Bx, )
Dealing with Time
Translate into first-order logic :
• He left town in the morning
• Understanding leads to friendship

Russell & Norvig 3d ed:
8.1 to 8.4
9.1, 9.2 (skip 9.2.3 to 9.4),
9.5 resolution (skip 9.5.4 - 9.5.5)
Problems
from
Nilsson
82
Learning Logical Rules
Decision Trees
Duda and Hart, Ch.1

Russell & Norvig Ch. 18
Boolean Decision Trees
Attribute-based representations
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:
• Classification of examples is positive (T) or negative (F)

Learning decision trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Continuous orthogonal domains
classification and regression trees CART

[Breiman 84] ID3: [Quinlan 86]
Expressiveness
• Decision trees can express any function of the input attributes.
• E.g., for Boolean functions, truth table row → path to leaf:
• Trivially, there is a consistent decision tree for any training set with one path to
leaf for each example (unless f nondeterministic in x) but it probably won't
generalize to new examples
• Prefer to find more compact decision trees

Which attribute to use first?
Gain(S; A) = expected reduction in entropy due to sorting on

attribute A
Choosing an attribute
• Idea: a good attribute splits the examples into subsets that are
(ideally) "all positive" or "all negative"
6 2 0 4 4 6 2
• Gain (Patrons)  B( ) [ B( )  B( )  B( )]  0.541bits
12 12 2 12 4 12 2
1
• Gain (Type)  1  [1.B( )]  0 Information Gain is higher
2 for Patrons
Decision trees
• One possible representation for hypotheses
• E.g., here is the “true” tree for deciding whether to wait:
Information gain
• A chosen attribute A divides the training set E into subsets
E1, … , Ev according to their values for A, where A has v
distinct values.
v
p i ni pi
remainder( A)   B( )
i 1 p  n pi  ni
• Information Gain (IG) or reduction in entropy from the
attribute test:
p
IG ( A)  B( )  remainder ( A)
pn
• Choose the attribute with the largest IG
Decision tree learning
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of
(sub)tree
Example contd.
• Decision tree learned from the 12 examples:
• Substantially simpler than “true” tree---a more complex

hypothesis isn’t justified by small amount of data
Too many ways to order the tree
How many distinct decision trees with n Boolean attributes?

= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616

trees
Hypothesis spaces
How many distinct decision trees with n Boolean attributes?
= number of Boolean functions
n
= number of distinct truth tables with 2n rows = 22
• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616

trees
How many purely conjunctive hypotheses (e.g., Hungry  Rain)?

• Each attribute can be in (positive), in (negative), or out
 3n distinct conjunctive hypotheses
• More expressive hypothesis space
– increases chance that target function can be expressed
– increases number of hypotheses consistent with training set
 may get worse predictions
Inductive Logic Programming
Review
• literal : atomic formula or its negation

• clause : disjunction of literals
• CNF : conjunction of clauses
Horn Clauses
• Horn clause : clause with single positive literal
• E.g. ∼p ∨ ∼q ∨ ∼r ∨ s
or equivalently:
p∧q∧r⇒s
• useful restriction on full first-order logic :
propositional Horn clauses are satisfiable in
polynomial time (HORNSAT)
• General proposition SAT: NP-complete
Prolog
• Prolog = Programming with Logic
• Works with Horn clauses

∼p ∨ ∼q ∨ ∼r ∨ s written as
s :- p, q, r
• read as s if p and q and r

• Resolving two Horn clauses always results in a
Horn clause
Example
• Set of Horn clauses :
[r,∼p,∼q] [p] [q]
• Goal: to prove r
 Add goal negation ∼r
[r,∼p,∼ q] [a] [q] [∼r]
Resolution refutation
[r, ~p, ~q] [p] [q] [~r]
[r, ~q] [q] [~r]
[r, ] [~r]
[]
Logic Programming
• example of a logic program:
parent_of(charles,george).
parent_of(george,diana).
parent_of(bob,harry).
parent_of(harry,elizabeth).
grandparent_of(X,Y) :- parent_of(X,Z),
parent_of(Z,Y).
• From the program, we can ask queries about
grandparents.
• Query: grandparent_of(X,Y)?
• Answers:
• grandparent_of(charles,diana).
• grandparent_of(bob,elizabeth).
Forms of Reasoning
• Deduction: From causes to effect (Prediction)
– fact a, rule a => b
INFER b (*First-order logic*)
• Abduction: From effects to possible causes (Explanation)

– rule a => b, observe b
AN EXPLANATION a
• Induction: From correlated observations to rules (Learning)

– observe correlation between a1, b1, ... an, bn
LEARN RULE a -> b
Inductive Logic Programming :
Progol
Given:
Background knowledge (of the domain): facts
Examples (of the relation to be learned): facts
Try to learn
Theories (as a result of learning): rules
ILP : a form of machine learning where both the data

and the hypotheses are logical expressions
ILP – formal definitions
• Given
– a logic program B representing background
knowledge
– a set of positive examples E+
– a set of negative examples E-
• Find hypothesis H such that:
1. B U H e for every e  E+.
2. B U H f for every f  E-.
3. B U H is consistent.
Assume that B ⊭ e for some e  E+.
Example
• Background knowledge B:
– parent_of(charles,george).
– parent_of(george,diana).
– parent_of(bob,harry).
– parent_of(harry,elizabeth).
• Positive examples E+:

– grandparent_of(charles,diana).
– grandparent_of(bob,elizabeth).
• Generate hypothesis H:
– grandparent_of(X,Y) :- parent_of(X,Z),
parent_of(Z,Y).
Rule Learning (Intuition)
• How to come up with a rule for grandparent_of(X,Y)?
1. Take the example grandparent_of(bob,elizabeth).
2. Find the subset of background knowledge relevant to this
example: parent_of(bob,harry),
parent_of(harry,elizabeth).
3. Form a rule from these facts
grandparent_of(bob,elizabeth) :-
parent_of(bob,harry), parent_of(harry,elizabeth).
4. Generalize the rule
grandparent_of(X,Y) :- parent_of(X,Z), parent_of(Z,Y).
5. Check if this rule is valid w.r.t the positive and the negative
examples
Progol Algorithm Outline
1. From a subset of positive examples, construct the

most specific rule rs.
2. Based on rs, find a generalized form rg of rs so that
score(rg) has the highest value among all
candidates.
3. Remove all positive examples that are covered by
rg.
4. Go to step 1 if there are still positive examples that
are not yet covered.
Scoring hypotheses
• score(r) is a measure of how well a rule r
explains all the examples with preference
given to shorter rules.
– pr = number of +ve examples correctly deducible
from r
– nr = number of -ve examples correctly deducible
from r
– cr = number of body literals in rule r
– score(r) = pr – (nr + cr)
Decision Theory
Decision Theory
Inference step
Determine either or .
Decision step
For given x, determine optimal t.
Minimum Expected Loss
Example: classify medical images as ‘cancer’ or ‘normal’
Loss matrix L:
Decision
Truth
Minimum Expected Loss
Regions are chosen to

minimize
Reject Option
Why Separate Inference and
Decision?
• Minimizing risk (loss matrix may change over time)
• Reject option
• Unbalanced class priors
• Combining models
Decision Theory for Regression
Inference step
Determine .
Decision step
For given x, make optimal
prediction, y(x), for t.
Loss function:
The Squared Loss Function
Performance measurement
• How do we know that h ≈ f ?
1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
Learning curve = % correct on test set as a function of training set size

15 Wk7 Decision Tree ILP

Uploaded by

Copyright:

Available Formats

15 Wk7 Decision Tree ILP

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15 Wk7 Decision Tree ILP

Uploaded by

Copyright:

Available Formats

Logic

Translate into first-order logic :

• He left town in the morning

• Understanding leads to friendship

Duda and Hart, Ch.1

• Classification of examples is positive (T) or negative (F)

classification and regression trees CART

• Prefer to find more compact decision trees

Gain(S; A) = expected reduction in entropy due to sorting on

• Substantially simpler than “true” tree---a more complex

How many distinct decision trees with n Boolean attributes?

• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616

• E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616

How many purely conjunctive hypotheses (e.g., Hungry  Rain)?

• literal : atomic formula or its negation

• Works with Horn clauses

• read as s if p and q and r

[r, ~p, ~q] [p] [q] [~r]

[r, ~q] [q] [~r]

• Abduction: From effects to possible causes (Explanation)

• Induction: From correlated observations to rules (Learning)

ILP : a form of machine learning where both the data

• Positive examples E+:

1. From a subset of positive examples, construct the

Regions are chosen to

You might also like