Decision Tree

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28
At a glance
Powered by AI
The key takeaways are that decision trees are hierarchical models that implement a divide and conquer strategy for classification and regression problems. They are efficient nonparametric methods.

The main components of a decision tree are internal decision nodes, terminal leaves, and branches. At each node a test is applied to partition the data and reach a leaf node which makes a prediction.

The two main approaches to avoid overfitting in decision trees are prepruning, which halts tree construction early, and postpruning, which removes branches from a fully grown tree.

Decision Trees

Decision Trees
• A decision tree is a hierarchical data structure
implementing the divide-and-conquer strategy.
• It is an efficient nonparametric method, which
can be used for both classification and
regression.
• We discuss learning algorithms that build the
tree from a given labeled training sample, as
well as how the tree can be converted to a set
of simple rules that are easy to understand.
Decision Trees
• In parametric estimation, we define a model over
the whole input space and learn its parameters
from all of the training data. Then we use the
same model and the same parameter set for any
test input.
• In nonparametric estimation, we divide the input
space into local regions, defined by a distance
measure like the Euclidean norm, and for each
input, the corresponding local model computed
from the training data in that region is used.
Decision Trees
• A decision tree is a hierarchical model for supervised
learning whereby the local region is identified in a
sequence of recursive splits in a smaller number of
steps.
• A decision tree is composed of internal decision nodes
and terminal leaves.
• Given an input, at each node, a test is applied and one
of the branches is taken depending on the outcome.
• This process starts at the root and is repeated
recursively until a leaf node is hit, at which point the
value written in the leaf constitutes the output.
Example
Algorithm
• Basic algorithm
– Tree is constructed in a top-down recursive divide-and-conquer manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left
Univariate Trees
• In a univariate tree, in each internal node, the
test uses only one of the input dimensions.
• For example, if an attribute is color ∈ {red,
blue, green}, then a node on that attribute has
three branches, each one corresponding to one
of the three possible values of the attribute.
Impurity
• Perfect purity: each split has either all
claims or all no-claims.

• Perfect impurity: each split has same


proportion of claims as overall
population.
Age Competition Type Profit
Old Yes Software Down
Old No Software Down
Old No Hardware Down
Mid Yes Software Down
Mid Yes Hardware Down
Mid No Hardware Up
Mid No Software Up
new Yes Software Up
New No Hardware Up
new No Software Up
Formulas

• Entropy (class) = - P log2 P - N log2 N


P+N P+N P+N P+N
• Information gain of each attribute
I (Pi,Ni) also have same formula
• Entropy (attribute) = Σ Pi + Ni (I (Pi,Ni) )
P+N
• Gain = Entropy (class) – Entropy (attribute)
Age Pi Ni I (Pi, Ni)
Old 0 3 0
Mid 2 2 1
New 3 0 0

Entropy of age = 0.4


Gain = 1 – 0.4 = 0.6

Competition

Competition Pi Ni I (Pi, Ni)


Yes 1 3 0.811
No 4 2 0.918

Entropy of competition = 0.8754


Gain = 1 – 0.8754 = 0.1245
Type Pi Ni I (Pi, Ni)
Software 3 3 1
Hardware 2 2 1

Entropy Type =1
Gain = 1 – 1 =0

Age
Old New

mid
Down ? Up
Age Competition Type Profit
Mid Yes Software Down
Mid Yes Hardware Down
Mid No Hardware Up
Mid No Software Up

Competition Pi Ni I (Pi,Ni)
Yes 0 2 0
No 2 0 0

Entropy competition = 0
Gain = 1-0=1
Type Pi Ni I (Pi,Ni)
Software 1 1 1
Hardware 1 1 1

Entropy Type =1
Gain =0
Age
Old New
mid
Down Up
Competition

Yes No

Down Up
Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
Example
Example
Example
Example
Example
Rule Extraction from Trees
• Rules are easier to understand than large trees
• One rule is created for each path from the
root to a leaf
• Each attribute-value pair along a path forms a
conjunction: the leaf holds the class prediction
• Rules are mutually exclusive and exhaustive
Rule Extraction from Trees
age?

<=30 31..40 >40

student? credit rating?


yes

no yes excellent fair

no yes yes

• Example: Rule extraction from our buys_computer decision-tree


IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule Extraction from Trees
Multivariate Trees
• In the case of a univariate tree, only one input
dimension is used at a split but not in
multivariate trees.
Multivariate Trees
• To approximate the class boundary, the
corresponding univariate decision tree uses a
series of orthogonal splits, whereas the
multivariate test uses only one linear split.
Multivariate Trees
• The multivariate decision tree-constructing algorithm selects
not the best attribute but the best linear combination of the
attributes: f
 wi xi  w0
i 1

• wi are the weights associated with each feature xi and w0 is


the threshold to be determined from the data.
• So there are basically two main operations in multivariate
algorithms: Feature Selection determining which features to
use and finding the weights wi of those features and the
threshold w0.
Example
Example

You might also like