0% found this document useful (0 votes)
32 views58 pages

Understanding Decision Trees in ML

Uploaded by

Vinod hathiyani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views58 pages

Understanding Decision Trees in ML

Uploaded by

Vinod hathiyani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Topic 7

Decision Trees
What are trees?

2
Decision Trees
• Classify between lemon and apples

G
Images from [Link]
3
Decision Trees

Root node
D

Branches
D
L W

Leaves

W ~

4
Images from [Link]
Rules for classifying data using attributes
• The tree consists of decision nodes and leaf
nodes.
• A decision node has two or more branches,
each representing values for the attribute
tested.
• A leaf node attribute produces a
homogeneous result (all in one class), which
does not require additional classification
testing

5
Each internal node: tests
Root node
one feature Xi

Each branch from a node:


Branches selects one value for Xi

Leaves Each leaf node:


prediction for Y

Features can be discrete, continuous or categorical

6
Images from [Link]
Read
F
• Features can be discrete, continuous or categorical
-
~ -
&

• Each internal node: test some set of features {Xi}


• Each branch from a node: selects a set of value for {Xi}
• Each leaf node: prediction for Y

7
Example: What to do this Weekend?
• If my parents are visiting
– We’ll go to the cinema *
• If not
Et


– Then, if it’s sunny I’ll play tennis -

– But if it’s windy and I’m rich, I’ll go shopping


– If it’s windy and I’m poor, I’ll go to the cinema
– If it’s rainy, I’ll stay in

weather

8
edgei ↓
rain's
X
I ↓
yes
No

e Treather
r
-

&
-

we
-

Home
T
Miner 8=
⑫is Richy Pr

-
Ishep
M
Written as a Decision Tree

Root of tree

Leaves

9
Using the Decision Tree
(No parents on a Sunny Day)

I
&

10
Using the Decision Tree
(No parents on a Sunny Day)

O O

11
From Decision Trees to Logic
• Read from the root to every tip
– If this and this and this … and this, then do this

• In our example:
– If no_parents and sunny_day, then play_tennis
– no_parents ∧ sunny_day è play_tennis

12
How to design a decision tree
• Decision tree can be seen as rules for performing a
categorisation
– E.g., “what kind of weekend will this be?”

• Remember that we’re learning from examples


– Not turning thought processes into decision trees

• The major question in decision tree learning is


– Which nodes to put in which positions
– Including the root node and the leaf nodes

13
Training and Visualizing a Decision Tree TRY
T

-
-

notebook
-
=
Jupter
-
Iris Decision Tree O
-
-

• Decision trees require very O


little data preparation.
• Measure of node impurity.
Pure (gini=0) ⑤
& &
100
• Gini Impurity: O O
3
-

S -
-
--
00

•- -
max_depth=2& 52
• White box model – easy to ⑭
interpret -
-
• Not a black box model – like -
I &
-
I
..

neural networks - -
samples

-It ( H
54

+ +
1-

= % -
petal
Decision Tree Boundaries weight
of
&
&

• For
max_depth=2 E

-
• Max_depth=3 &

2

add the vertical O


T Of of
dotted lines.
T

O =
CART Training Algorithm
-

• Classification and regression Algorithm

0 0
• Splits training set into two subsets using a single feature k and a
threshold tk in this case petal length less than 2.45.
-

• Searches for (k,tk) combination that produces the purest subsets,


weighted by their size. The cost function is: -

# -
50x0
150
+
10x0
-

-
-
Decision Tree Regularisation
-

• Recall in Data pre-processing lecture – decision tree gave a model with 0 error –
they have a high tendency to overfit. Hence regularisation!!!
-

&
-

• T
max_depth: Maximum depth of the tree in terms of layers.
• S
Max_features: Maximum number of features that are evaluated for splitting at each node
• Max_leaf_nodes: Maximum number of leaf nodes&
• min_samples_split: Minimum number of samples a node must have before it can be split
• min_samples_leaf: Minimum number of samples a leaf node must have to be created
• min_weight_fraction_leaf: Same as min_samples_leaf but expressed as a fraction of the
-
total number of weighted instances
-
Non-regularized vs. Regularized Tree
-

• Test accuracy:
-
-
No restrictions:
0.898

O
Restricted:
0.92
Regression with Decision Trees
-

• Instead of predicting a class, it predicts


a value.
• Example xnew = 0.2
Regression with Decision Trees
• Predicted values for each region is the average target value of the
instances in that region.

-0

-0 3
.
CART for Regression
• Instead of trying to minimize
-
impurity, tries to split the training data
-
in order to minimize the MSE.
-

- =
-
Importance of Regularization in Regression
fo -
use

L
Sensitivity to Axis Orientation
• Decision trees love orthogonal decision boundaries.
• Rotate the data by 450 and note the convoluted boundary.
High Variance in Decision Trees
• Decision Trees have high variance.
• Small changes in hyperparameters leads to very different models
• Since the training algorithm used by Sci-Kit learn is stochastic in
nature, retraining the same model on the same data produces a very
different model.
• By averaging predictions over many trees, it’s possible to reduce the
variance. Such an ensemble of trees is called a random forest.
• Next slide shows an example of the same dataset and two different
tree configurations.
High Variance
The ID3 Algorithm

• Invented by J. Ross Quinlan in 1979


• ID3 uses a measure called Information Gain
– Used to choose which node to put next
• Node with the highest information gain is chosen
– When there are no choices, a leaf node is put on
• Builds the tree from the top down, with no
backtracking
• Information Gain is used to select the most useful
attribute for classification
15
Entropy – General Idea
• From Tom Mitchell’s book:
– “In order to define information gain precisely, we begin by
defining a measure commonly used in information theory,
called entropy that characterizes the (im)purity of an
arbitrary collection of examples”

• A notion of impurity in data


• A formula to calculate the homogeneity of a sample
• A completely homogeneous sample has entropy of 0
• An equally divided sample has entropy of 1

16
Entropy - Formulae

• Given a set of examples, S

Er
• For example, in a binary categorization
– Where p+ is the proportion of positives
– And p- is the proportion of negatives
-

l For examples belonging to classes c1 to cn


– Where pn is the proportion of examples in cn
n

17
1 Entropy Example
20 =

0
1=
1092

t &

18
outlook

/ ↓ Y
Rain
sunny overcast
N
Ni
Y
I
& i In
E(R) = 0 97
47
.

0
E(s) = 0 .
E(0) =
47
+ 500
40
.

+
5
%
0 .
97 Th 69
14 = 0
.

T4
5
69 =
0 94-0
.

Outlook
⑳5
.

volume class .

al
It for
like
attributes
other
Find 14 for
1) Temp
2) Humidity
3) wind
Entropy Example

Entropy(S) =
- (9/14) Log2 (9/14) - (5/14) Log2 (5/14)
= 0.940

19
Information Gain (IG)
• Information gain is based on the decrease in entropy after a dataset is
split on an attribute.
• Which attribute creates the most homogeneous branches?

• First the entropy of the total dataset is calculated


• The dataset is then split on different attributes
• The entropy for each branch is calculated. Then it is added
proportionally, to get total entropy for the split
• The resulting entropy is subtracted from the entropy before the split

• The result is the Information Gain, or decrease in entropy


• The attribute that yields the largest IG is chosen for the decision node
-
reduction in the
the expected entropy of target
It is
data sample s due to sorting.
variable y for 20
Information Gain (cont’d)
• A branch set with entropy of 0 is a leaf node.
• Otherwise, the branch needs further splitting to classify its
dataset.

• The ID3 algorithm is run recursively on the non-leaf branches,


until all the data is classified.

21
Information Gain (cont’d)

• Calculate Gain(S,A)
– Estimate the reduction in entropy we obtain if we know
the value of attribute A for the examples in S

22
An Example Calculation of
Information Gain
14
• Suppose we have a set of examples P+ =

– S = {s1, s2, s3, s4}


3/4
– In a binary categorization =
p =

• With one positive example and three negative examples


• The positive example is s1
= S
·

• And Attribute A
– Which takes values v1, v2, v3 S2 =
• S1 takes value v2 for A, S2 takes value v2 for A
S3 = Vy
S3 takes value v3 for A, S4 takes value v1 for A ~

Sy = V
,
-
23
First Calculate Entropy(S)
• Recall that
Entropy(S) = -p+log2(p+) – p-log2(p-)

• From binary categorisation, we know that


p+ = ¼ and p- = ¾

• Hence, Entropy(S) = -(1/4)log2(1/4) – (3/4)log2(3/4)


= 0.811

24
Calculate Gain for each Value of A
• Remember that

• And that Sv = {set of example with value V for A}


– So, Sv1 = {s4}, Sv2 = {s1,s2}, Sv3={s3}
"

Su. 2[Si)
↓ =
0

• Now, (|Sv1|/|S|) * Entropy(Sv1) 4 5


= (1/4) * (-(0/1)*log2(0/1)-(1/1)log2(1/1))
= (1/4) * (0 - (1)log2(1)) = (1/4)(0-0) = 0

• Similarly, (|Sv2|/|S|) = 0.5 and (|Sv3|/|S|) = 0


25
(2)
-log
- 1lose()
-

E(S
= ) =
(2)
12)
+
loga
=
1 logn
112
I
112 +

~
&

EXSv) = /
Final Calculation

• So, we add up the three calculations and take them


from the overall entropy of S:

• Final answer for information gain:


– Gain(S,A) = 0.811 – (0.25*0 +1/2*1 + 0*0.25) = 0.311

26
A Worked Example
6C271S2
5N SPAR
12
34WCR 5
Weekend Weather Parents Money Decision (Category)
W1 Sunny Yes Rich > Cinema
&

34
S
T -

W2 Sunny -

No Rich Tennis
-

27

W3 Windy Yes Rich Cinema


>
J
-
&
-

W4
W5
Rainy
Rainy
Yes
No
S

&Poor
Rich
- Cinema
Stay in
-
13]
-

- Ish
W6 Rainy Yes &

O-
Poor Cinema -

W7
W8
Windy
Windy
No
No
0Poor
Rich
>
- Cinema -

& - Shopping
W9 Windy Yes -

Rich -
&
Cinema S

W10 Sunny No ↓
Rich - Tennis

27
0.
2 0 2 log
2
logo
-

0
.

3
-
,

blog ,
. .

0
.

0 4322 +.
0 464h
.
+ 0 .
6644 ⑫
=

= 1 571
.
Information Gain for All of S
• S = {W1,W2,…,W10}
• Firstly, we need to calculate:
– Entropy(S) = … = 1.571

• Next, we need to calculate information gain


– For all the attributes we currently have available
• (which is all of them at the moment)
– Gain(S, weather) = 0.7
– Gain(S, parents) = 0.61
– Gain(S, money) = 0.2816

28
Gain (S money
,

⑫ 2x
(10)
E()
-

z10E)
-

-
2 .
8074 .

+ 2x
+
2 x 18074 7

EX1.222 ↓
v
1 .
28906

o
=
+ 0 802
5164
.

0
5237
.

:
0 1 2890
.

1 57
.

I(log21) 0 37
.
= O
=
0
2816
.
The ID3 Algorithm
• Given a set of examples, S
– Described by a set of attributes Ai
– Categorised into categories cj
1. Choose the root node to be attribute A
– Such that A scores highest for information gain
• Relative to S, i.e., gain(S,A) is the highest over all
attributes
2. For each value v that A can take
– Draw a branch and label each with corresponding v

29
The ID3 Algorithm
• For each branch you’ve just drawn (for value v)
– If Sv only contains examples in category c
• Then put that category as a leaf node in the tree
– If Sv is empty
• Then find the default category (which contains the most
examples from S)
– Put this default category as a leaf node in the tree
– Otherwise
• Remove A from attributes which can be put into nodes
• Replace S with Sv
• Find new attribute A scoring best for Gain(S, A)
• Start again at part 2
• Make sure you replace S with Sv 30
Explanatory Diagram

31
Information Gain for All of S
• S = {W1,W2,…,W10}
• Firstly, we need to calculate:
– Entropy(S) = … = 1.571
• Next, we need to calculate information gain
– For all the attributes we currently have available
• (which is all of them at the moment)
– Gain(S, weather) = … = 0.7
– Gain(S, parents) = … = 0.61
– Gain(S, money) = … = 0.2816
• Hence, the weather is the first attribute to split on
-

– Because this gives us the biggest information gain


-

33
Top of the Tree
• So, this is the top of our tree:
• Now, we look at each branch in turn
– In particular, we look at the examples with the attribute
prescribed by the branch
• Ssunny = {W1,W2,W10}
– Categorisations are cinema, tennis and tennis for W1,W2
and W10
– What does the algorithm say?
• Set is neither empty, nor a single category
• So we have to replace S by Ssunny and start again

34
Getting to the leaf nodes
• If it’s sunny and the parents have turned up
– Then, looking at the table in previous slide
• There’s only one answer: go to cinema
• If it’s sunny and the parents haven’t turned up
– Then, again, there’s only one answer: play tennis
• Hence our decision tree looks like this:

36
What is the optimal Tree Depth?
• We need to be careful to pick an appropriate
tree depth.
• If the tree is too deep, we can overfit.
• If the tree is too shallow, we underfit
• Max depth is a hyper-parameter that should
be tuned by the data. Alternative strategy is to
create a very deep tree, and then to prune it.

37
Control the size of the tree
• If we stop early, not all
training samples would
be classified correctly.
• How do we classify a new
instance:
– We label the leaves of this
smaller tree with the
majority of training
samples’ labels
38
Summary of learning classification
trees
• Advantages:
– Easily interpretable by human (as long as the tree is not too big)
– Computationally efficient
– Handles both numerical and categorical data
– It is parametric thus compact: unlike Nearest Neighborhood
Classification, we do not have to carry our training instances
around Building block for various ensemble methods (more on
this later)
• Disadvantages
– Heuristic training techniques
– Finding partition of space that minimizes empirical
error is NP-hard.
– We resort to greedy approaches with limited
theoretical underpinning.
39
Feature Space
• Suppose that we have p explanatory variables
X1, . . . , Xp and n observations.

– a numeric variable: n − 1 possible splits


– an ordered factor: k − 1 possible splits
– an unordered factor: −→ 2(k−1) − 1 possible splits.

41
Measures of Impurity
• At each node i of a classification tree, we have a
probability distribution p_{ik} over k classes.

• Deviance:
• Entropy:
• Gini index:
• Residual sum of squares

42

You might also like