Lec4 Tree v2.4 1
Lec4 Tree v2.4 1
Lec4 Tree v2.4 1
Mingsheng Long
Tsinghua University
Spring 2024
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 2
2-class Linear Model Hypothesis Space
Classification: ℎ!(*) = , ⋅ . *
Feature Polynomial Basis
$ % ⋅ ' + discretization
Engineering 1 Function Φ 1
1
Dataset 1 1 1 1
{0,1}
11 Discrete Label
1
Space
Discrete values Input Space Feature Space
Different scales "
! = {0,1}
! = ℝ! ) = ℝ!
Feature Label
Machine Learning 4
Human Classifier
• We find a most useful feature, such as shape: curved or round.
- Split the dataset: If the fruit is curved, then it is a banana.
• After the first split, select next best feature until each node contains
only one class.
Curved Round
① Shape
Sweet Sour
② Taste
• At last, we build a decision tree.
Machine Learning 5
Decision Tree
Generalize to
Interpretability ✓ unseen data ✓
• During test: 5.1 Cross-Validation 187
o o o o
o o oo o o o o oo o o
o oo oo o o o oo oo o o
o
o oo oo ooo oo o
o oo oo ooo oo
oo oo oo o oooooo o oo oo oo oo o oooooo o oo
o o oo oo oo o o oo oo oo
ooo o oo ooo o oo
oo o o o o
o
o o oo o o o o
o
o o
o o o o ooo oo o o o o ooo oo
ooo oo o o ooo oo o o
o o o o ooooo oo oo o oooo o o o o o ooooo oo oo o oooo o
o oo o ooooooo o o o o oo o ooooooo o o o
o o o oo o o o o o o o o oo o o o o o
oo o oo o o oo o oo o o
o o o o o o o o
o o o oo o oooo o o o o oo o oooo o
o o
oo o ooo
oo oo oo o oo o ooo
oo oo oo o
o oo oo o o oo oo o
oo ooo ooo o oo ooo ooo o
Curved Round
o o o o
o o oo ooo
oooo o o oo ooo
oooo
o o o o
o oo o o oo o
Degree=3 Degree=4
o o o o
o o oo o o o o oo o o
o oo oo o o o oo oo o o
o
o oo oo ooo oo o
o oo oo ooo oo
oo oo oo o oooooo o oo oo oo oo o oooooo o oo
o o oo oo oo o o oo oo oo
o ooo o oo
o o o o ooo o oo
o o o
oo ooo o oo ooo o
o o o o ooo o o o o o ooo o
Sweet Sour
ooo oo o o
o o o o o ooo oo o o
o o o o o
o o o o ooo oo o o o o o o ooo oo o o
o ooo o ooo
o o o oo o ooooooo o o o
o o o oo o ooooooo o o o
oo o oo o o o o o oo o oo o o o o o
o o o o o o o o o o
o ooo oo o ooo oo
o o o oooo o o o o oooo o
o o
oo o ooo
oo oo oo o oo o ooo
oo oo oo o
o oo oo o o oo oo o
o o o oooooo o
o o o o oooooo o
o
o o oooo
o ooo o o o o oo ooo
ooo o
o oo o o oo o
o o
o o o o o o
Machine Learning 7
Decision Tree
Outlook
Machine Learning 8
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 9
Node Splitting
• Which split is better?
Single
Curved Round Leaf No Leaf
Class
Machine Learning 10
Impurity(
Impurity(
Impurity(
How to Measure a Node
• We want pure leaf nodes, as close to a single class as possible:
- Lead to a lower classification error.
Number of samples
pure group Less &! &"
impure in each class
- Thegroup
y impure classification error is min(
Less ' , ' ).
impure Minimum
Very impure group Minimum
impurity
Less impure Minimum
impurity
• We’ll choose the splitting feature that minimizes impurity measure.
impurity
Example
• What(is(the(entropy(of(a(group(in(which(all( Minimum /2
examples(belong(to(the(same(class?( impurity
7* – entropy(=(W(1(log21(=(0(
Err 4 = 1 − max (( 2WClass(Cases:(
()*)+ 4 not(a(good(training(set(for(learning( n
X
Entropy( H(x) = P (x = i) log2 P (x = i)
- Entropy (used in ID3 and C4.5): i=1
• What(is(the(entropy(of(a(group(with(50%( Maximum
• What(is(the(entropy(of(a(group(in(which(all( Minimum
in(either(class?( impurity
examples(belong(to(the(same(class?( impurity
+
Mihaela van der Schaar Classification and Regression Trees
7* – entropy(=(W0.5((log
7* 0.5(–(0.5((log ((
0.5(=1(
@ #classes
– entropy(=(W(1(log
2 1(=(0( 2 2
8 4 = −9 log not(a(good(training(set(for(learning(
Each is 7* ⊂ 4
4 good(training(set(for(learning(
4
*,(Based(on(slide(by(Pedro(Domingos( 30(
• What(is(the(entropy(of(a(group(with(50%( Maximum
- Gini index (used in CART): in(either(class?( Max: Min:
impurity
7*
– entropy(=(W0.5((log20.5(–(0.5((log20.5(=1(
7*
+ - = 0.5 =0
7* 4
good(training(set(for(learning(
4
Gini 4 = 1 − 9
30(
Based(on(slide(by(Pedro(Domingos(
4 (0 log 0 = 0)
*,(
Machine Learning 12
How to Measure a Split
1 1 12 12
Entire dataset (30 instances) .(0$ ) = − log − log = 0.391
13 13 13 13
4 = 4( ∪ 4-
4 4( 13 instances
Child entropy:
13 13 4 4
.(0# ) = − log − log = 0.787
17 17 17 17
Parent entropy:
14 14 16 16 4-
.(0) = − log − log = 0.996 17 instances
30 30 30 30
• We want to measure this split – how this split minimizes entropy.
- How about 8 4 − (8(4() + 8(4-))/2?
- But the instance number on each node is different → Weighting
Machine Learning 13
Information Gain (IG)
• A good split gives minimal weighted average of node impurities:
4( 4-
8 4( + 8(4-)
4 4
• Equivalent to maximizing the Information Gain (IG): Child nodes
4( 4-
Parent node 8 4( ∪ 4- − 8 4( − 8(4-)
4 4
• In information theory, entropy is the expected number of
bits needed to encode a randomly drawn value of C.
• Larger entropy, less information. That’s why we call it IG.
https://en.wikipedia.org/wiki/Entropy_(information_theory)
Claude Shannon
Machine Learning 14
Calculating Information Gain (IG)
Entire population (30 instances) Child entropy:
1 1 12 12
− log − log = 0.391
13 13 13 13
13 instances
Child entropy:
13 13 4 4
Parent entropy: − log − log = 0.787
17 17 17 17
14 14 16 16
− log − log = 0.996
30 30 30 30
17 instances
Information Gain (IG):
13 17
0.996 − ⋅ 0.391 − ⋅ 0.787 = 0.38
30 30
Machine Learning 15
Split Search
Class Information Gain = 0.40
Dataset distribution Top Bottom
Split 1
<
Left Right
Four classes.
How to compute IG?
Split 2
Machine Learning 17
ID3 Algorithm
Class label
Stop Criteria
0 /.
in each layer
depth
min(/, log .)
Machine Learning 18
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 19
ID3
IG Rate Missing
Value
C4.5
Attribute Continuous
with Costs Value
Pruning
Modification on Criteria Solve Adapt to Various Features
Overfitting
Machine Learning 20
Overfitting in Decision Tree
• Decision tree has the ability to fit all data:
- Except noisy samples with same features and different labels.
• Overfitting also appears in decision tree:
Desired
Overfitting
Machine Learning 21
Pre-Pruning
• Tree Complexity is measured by numbers of layers and branches.
- To prevent overfitting: Control number of layers – Pruning.
• A simplest way is Pre-Pruning:
• During Training: for every split, use the change of accuracy on
validation set to measure the necessity of this split. Before Split: 71%
After Split: 71%
- If the accuracy grows no more than D
or even decreases: Before: 71%
After: 80% ✓
• Stop the split.
• May cause underfitting
Go on
Training
- A good split may not change the classification error.
×
Machine Learning 22
Post-Pruning
• Pre-Pruning may lead to underfitting.
• Safe idea is Post-Pruning:
- Pruning the tree after training Before: 68%
After: 57% ✓
• From leaf node to root node:
- Use the change of accuracy on Before: 54%
After: 58%
validation set to measure the
necessity of the pruning on this ×
node.
- Remove node most improving
Before Pruning: 54%
validation set accuracy.
After Pruning: 68%
How to understand
C4.5 decision tree?
What does it learn?
Attribute Continuous
with Costs Value
Pruning
Modification on Criteria Solve Adapt to Various Features
Overfitting
Machine Learning 25
X2
X2
R5 R3
X2
R3
t2 R4 R1
X1 t1 t3 X1
- Curse of dimensionality.
ree-Based Methods
X1 ≤ t1 X1 ≤ t1
| |
R5
X2 ≤ t2 X1 ≤ t3
X2 ≤ t2 R2 t4
X1 ≤ t3
X2
R3
X 2 ≤ t4 t2 R4
R1 R2 R3
X 2 ≤ t4
R1 X2
X1
R1 R2 R3
X2
X1
R4 R5 t1 t3
2.5 ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
2 ● ● ● ● ● ●
● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ● ● ●
Petal.Width
1.5 ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
1 ● ● ● ● ●
## Construct Decision Trees
from sklearn.datasets import load_iris
●
from sklearn import tree
0.5 ● iris = load_iris()
● ● ● ● ●
clf = tree.DecisionTreeClassifier()
● ● ● ●
● ● ● ● ● ● ● ● clf = clf.fit(iris.data, iris.target)
● ● ●
0
1 2 3 4 5 6 7
Petal.Length
Machine Learning 28
Expressivity vs. Interpretability Degree=1 Degree=2
o o o o
o o oo o o o o oo o o
3 Introduction oo oo o o 9 oo oo o o
• Multivariate decision tree: good expressivity but weak interpretability.
o o
o
o oo oo ooo oo o
o oo oo ooo oo
oo oo oo o oooooo o oo oo oo oo o oooooo o oo
o o oo oo oo o o oo oo oo
o ooo o oo
o o o o ooo o oo
o o o
oo o oo o o oo o oo o o
o o o o ooo o o o o ooo
ooo oo o o ooo oo o o
o o o o o o
Stronger expressivity?
o o o o
o o o
o
o
ooo ooo o o o
o
o
ooo ooo
o o o o
o oo o o oo o
o o o o o o
THE LASSO ESTIMATOR 11
Degree=3 Degree=4
Lasso Ridge Regression
o o o o
funding o o oo o o funding o o oo o o
oo oo o o oo oo o o
10
10
o o
o
o oo oo ooo oo o
o oo oo ooo oo
oo oo oo o oooooo o oo oo oo oo o oooooo o oo
o oo oo o oo oo
not−hs
o
o o ooo o oo
o o not−hs oo o o
o o ooo o oo
o o oo
o
college4 o
o o o o o college4
o o
o o ooo o
o o ooo o o o ooo o
5
5
Coefficients
Coefficients
ooo oo o o o ooo oo o o
o o o o ooo oo o oo o o
oo
o
o o o o o ooooo oo oo o oooo o
oo o
o o o oo
ooo o o o oo o
o o o o
oooo o o o
o o
oo o oo o oo o o oo
o o
oo o oo o oo o o oo
oo oo
college o ooo oo o o college o ooo oo o o
o oooo o o oooo o
0
0
o o o o
o o
oo o ooo
oo oo oo
o
oo o ooo
oo oo oo o
o oo oo
o o o
o ooo
o oooo oo o
o
o o
o ooo
o oooo oo
o o o
o ooo
ooo o o oo ooo
oooo
−5
o o
−5
o oo hs o oo o
hs o o o
o o o o o o
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Weaker interpretability
displayed in Figure 2.13. The Bayes decision boundary is represented using a
Figure 2.1 Left: Coefficient path for the lasso, plotted
purple versus
dashedthe ¸1 norm
line. of the decision boundaries from linear, quadratic, cubic
Estimated
Fig. 3.1 Decision tree. (a) A tree is a set of nodes
coefficient and
vector, edges
relative organized
to the norm of thein aand
hierarchical
unrestricted least-squares
quartic fashion.
(degrees estimate ˜ regressions are displayed in black. The test error
—.
1–4) logistic
Right: Same for ridge regression, plotted against the relative ¸ 2 norm.
A tree isa graph with no loops. Internal nodes are denoted with circles andrates terminal
for the nodes withregression fits are respectively 0.201, 0.197, 0.160, and
four logistic
0.162, while the Bayes error rate is 0.133.
squares. (b) A decision tree is a tree where each internal node stores a split (or test) function to
be applied to the incoming data. Each leaf stores the final answer (predictor). Here we show an
illustrative decision tree used to figure out whether a photo represents an indoor or outdoor scene
Machine Learning 29
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 30
X
X
R3
t2 R4
Regression Tree
R1
X1 X1
X1 ≤ t1
| R5
X2 ≤ t2 X1 ≤ t3 R
3
R1 X 2 ≤ t4
R1 R2 R3
X2
t1 t3 X1
X2
R3
Regression Tree
t2 R4
R1
X1 X1
@
!f \ = ℎ(\) = 9 g* ^{\ ∈ Y* }
X1 ≤ t1 *,(
|
• How to choose g(, … , g@ (of linear regressor)?
• For L2 loss ℓ(!,
f !) = !f − !
X2 ≤ t2
is - , the
X ≤t
best
1 3
R1 R2 R3
Average. Why? X2
X1
FIGURE 8.3. Top Left: A partition of two-dimensional feature space that cou
• In regression we find not
node
resultof low
from L2 loss
recursive tosplitting.
binary its average target
Top Right: value.
The output of recursi
binary splitting on a two-dimensional example. Bottom Left: A tree correspondi
to the partition in the top right panel. Bottom Right: A perspective plot of t
prediction surface corresponding to that tree.
Machine Learning 32
Continuous Variables
• Add threshold to split:
\|kB ≤ l ∪ \|kB > l
!$% !#% !&% !'% !(% !)% !*%
kB
"$ C "#
• In classification, we try to find best feature and split to ensure best IG.
• In regression, we want to find feature n and splitting threshold lB :
- To minimize L2 loss on left node and right node of the split:
avg(!A |\A ∈ Y) − ! -
Machine Learning 33
Node Splitting
• For each splitting variable n and splitting point l
- Assume that the region Y is split by n and l to Y( and Y-.
gf( n, l = avg !A \A ∈ Y( n, l
gf- n, l = avg !A \A ∈ Y- n, l
• Find n, l that minimize the loss
- -
ℓ n, l = 9 !A − gf( n, l + 9 !A − gf- n, l
A:E# ∈G! B,H A:E# ∈G" B,H
Machine Learning 35
Decision Tree
• Advantages:
-Inexpensive to construct and fast at classifying new samples.
-Easy to interpret for small-sized trees (caution: unstable).
• Trees make no use of geometry
- A nonmetric method, feature scale irrelevant (vs. SVM or KNN).
- Can simultaneously cope with categorical and continuous variables.
- Can naturally cope with missing values and noisy data (why?)
• Prediction functions are not continuous
- Not so bad for classification, but may be undesirable for regression.
Machine Learning 36
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 38
Ensemble Learning
• Ensemble methods train multiple learners and combine them for use.
Machine Learning 39
Random Forest
284 8. Model Inference and Averaging
G features 0 1 0
284 8. Model Inference
0 and Averaging
0 1
1 0
0
0 1 1 0 0 1 0 1
H examples
1 1
0
Inference:
0 1 0 0
0 1
1 0 1 1
0 0 1 1
0
0 1 1 0 0 0 1 1 0 0 1 0
1
Taking the
1 1 0 0
1 0 1 0
0
....…
0
0 1 11 00 0 1
1 1 0 1 1
b=3 b=4 b=5
x.2 < 0.285
|
b=6
x.3 < 0.985
|
x.4 < −1.36
|
b =b3= 7
majority
b =b4= 8 b=
0
x.1 < 0.395
| 1 1
0
x.2 < 0.285
x.1 < 0.395
| |
vote
x.3 <x.3
0.985
< 0.985
| |
x.4 < −1.36
|
0
1 0
1 1 0
1 0
0 1 1 1
1 0
1 1 0 1 1 0
1 1 0
1
1 0
0 0 1 0 0
1
b=6 1 1 0b = 7 0 1 0 1 1 b=8 0 1 0 1 1 0
x.1 < 0.395 x.1 < 0.395 x.3 < 0.985
b=9 b =b 6= 10 bb
=7= 11 b=
x.1 < 0.395 x.1 < x.1 < 0.395
0.555 x.1 <
x.10.395
< 0.555 x.3
1
| | | | |
1 1 0
Machine Learning
0 1 0 0 40
1
1 1 0 0 0 1 0 1 1
Random Forest
284 8. Model Inference and Averaging
G features 0 1 0
284 8. Model Inference
0 and Averaging
0 1
1 0
0
0 1 1 0 0 1 0 1
H examples
1 1
0
Inference:
0 1 0 0
0 1
1 0 1 1
0 0 1 1
0
0 1 1 0 0 0 1 1 0 0 1 0
1
Taking the
1 1 0 0
1 0 1 0
0
....…
0
0 1 11 00 0 1
1 1 0 1 1
Idea:
b=3 b=4 b=5
x.2 < 0.285
|
b=6
x.3 < 0.985
|
x.4 < −1.36
|
b =b3= 7
majority
b =b4= 8 b=
samples
0
0
1 0
1 1 0
1 0
0 1 1 1
Different
1 0
1 1 0 1 1 0
1 1 0
1
1 0
0 0 1 0 0
features
1
b=6 1 1 0b = 7 0 1 0 1 1 b=8 0 1 0 1 1 0
x.1 < 0.395 x.1 < 0.395 x.3 < 0.985
b=9 b =b 6= 10 bb
=7= 11 b=
x.1 < 0.395 x.1 < x.1 < 0.395
0.555 x.1 <
x.10.395
< 0.555 x.3
1
| | | | |
1 1 0
Machine Learning
0 1 0 0 41
1
1 1 0 0 0 1 0 1 1
Outline
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 42
Bootstrap Sample
• Bootstrap (自助法): randomly draw datasets with replacement from
the training data, with the same sample size as the original training set
le1
Model I
am p
S
Sample 2
Model II
Sam
ple
3
Model III
Total Dataset
Machine Learning 43
Bootstrap Sample
• Definition: A bootstrap sample from 4I is a sample of size H drawn
with replacement from 4I .
• In a bootstrap sample, some elements of 4I
- will show up multiple times.
- some will not show up at all. Will this happen?
( I
• Each CA has a probability 1−I of being not selected.
• Recall from calculus that for large H,
1 I 1
1− ≈ ≈ 0.368.
H p
• So we expect ~63.2% of elements of 4 will show up at least once.
Machine Learning 44
Bootstrap Method
• Definition: A bootstrap method is when you simulate X independent
samples from u by taking X bootstrap samples from the dataset 4I .
• Given original data 4I compute X bootstrap samples 4I( , … , 4I>
• For each bootstrap sample, compute some function:
v 4I( , … , v 4I>
- work with these values as though 4I( , … , 4I> were i.i.d. from u.
- For example, find some independent classifiers ℎ'$! , … , ℎ'$% .
• Amazing fact:
- Things v 4I( , … , v 4I> often come out very close to what we
will obtain with independent samples from u.
Machine Learning 45
Bootstrap Aggregation (Bagging)
• Bagging is a general technique to reduce the variance of an estimator.
• Draw X bootstrap samples 4I( , . . . , 4I> from original data 4I .
• Decision Tree
-ID3 Algorithm
-C4.5 Algorithm
-Regression Tree
• Random Forest
-Bagging
-Breiman Algorithm
Machine Learning 48
Random Forest 284 8. Model Inference and Averaging
each
1 0
0
0 1 1 0 0 1 0 1
H examples
1
|
1
|
| |
tree with
|
randomly
0 1 0 0
0 1
1 0 1 1
0 0 1 1
0
0 1 1 0 0 0 1 1 0 0 1 0
1
0
sampled !
1 1 0 1 0
1 0 0
....…
0
0 1 11 00 0 1
1 1 0 1 1
Different
b=3 b=4 b=5
x.2 < 0.285
|
b=6
x.3 < 0.985
|
x.4 < −1.36
|
b =b3= 7 features
b =b4= 8 b=
for each
< 0.985
| |
x.4 < −1.36
|
Different
0
0
1 0
split
1 1 0
1 0
0 1 1 1
features
1 0
1 1 0 1 1 0
1 1 0
1
(! ≈ #)
1 0
0 0 1 0 0
1
b=6 1 1 0b = 7 0 1 0 1 1 b=8 0 1 0 1 1 0
x.1 < 0.395 x.1 < 0.395 x.3 < 0.985
| | |
b=9 b =b 6= 10 bb
=7= 11 b=
x.1 < 0.395 x.1 < x.1 < 0.395
0.555 x.1 <
x.10.395
< 0.555 x.3
1
Machine Learning
| | |
1 0
| | 49
1
Breiman Algorithm
t r ee
a ch
e
Bootstrap samples
....…
0≈ 2
Randomized features
By Leo Breiman
Machine Learning 50
Breiman Algorithm
326 8. Tree-Based Methods
8. Tree-Based Methods
A widely-used algorithm in practice
0.30
! m=p
=#
! =m=p/2
#/2
0.5
!= m= #p
0.4
n_estimators=100, ## The number of trees
criterion=“gini”,
Error
0.20
03
clf.fit(X, y)
0.15
0.2
Test: Bagging
Test: RandomForest
OOB: Bagging
0.10
OOB: RandomForest
Number of Trees
GURE 8.8. Bagging and random forest results for the Heart data. The test
or (black and orange) is shown as a function of B, the number of bootstrapped
√
Bootstrap samples Randomized
forests forfeatures
ning sets used. Random forests were applied with m = p. The dashed line
icates the test error resulting from a single classification tree. The green 8.10.
FIGURE and Results from random the fifteen-class gene ex
data set
e traces show the OOB error, which in this case is considerably lower.with p = 500 predictors. The
test error is displayed as a functio
number of trees. Each colored line corresponds to a different value of
ss among the B predictions.
number
Figure 8.8 shows the results from bagging trees on the of predictors available for splitting at each interior tree node.
Heart data. The
t error rate is shown as a function of B, the number forests (m < Machine
of trees constructed
Learning
p) lead to a slight improvement over bagging (m =51p).
Bias-Variance Decomposition
Underfit Overfit
Ēnew
error
Var !|\
i bl e
uc
Irred
Error
1
Model Order=1
Variance Var' ℎ' \ \ 1
Model Order=10
0.8 0.8
0.6 0.6
2
0.4 0.4
Bias
0.2 0.2
Individual g 1(x) Individual g 10(x)
Mean of All Fits 0 Mean of All Fits 0
f(x) f(x)
Squared Error Squared Error
-0.2 -0.2
Bias ℎ' \ \ -
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
-1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1
Model complexity
Finding a balanced fit (neither over- nor underfit) is called the the bias-variance tradeoff.
Machine Learning 52
Decision Trees Have High Variance
284 Q
8. Model Inference and Averaging
• Input space Å = R and output space É = {−1,1}.
• Different splitting
0 1
1 0
0
0 1 1 0 0 1 0 1
1 0
1
1
0
1 0
high variance.
0
1
• Bagging is needed.
1 1 0 1 1 0
Machine Learning 54
Thank You
Questions?
Mingsheng Long
[email protected]
http://ise.thss.tsinghua.edu.cn/~mlong
答疑:东主楼11区413室