0% found this document useful (0 votes)
11 views27 pages

Decision Tree Learning in AI

The document discusses decision tree induction as a method of inductive learning, focusing on its application in classification and regression tasks. It explains how decision trees operate by testing attributes to reach a decision, and outlines the process of inducing decision trees from examples, including the selection of attributes based on information gain. Additionally, it touches on the challenges of handling noise in data and the importance of choosing effective attributes to minimize the depth of the decision tree.

Uploaded by

jayanthsocial82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

Decision Tree Learning in AI

The document discusses decision tree induction as a method of inductive learning, focusing on its application in classification and regression tasks. It explains how decision trees operate by testing attributes to reach a decision, and outlines the process of inducing decision trees from examples, including the selection of attributes based on information gain. Additionally, it touches on the challenges of handling noise in data and the importance of choosing effective attributes to minimize the depth of the decision tree.

Uploaded by

jayanthsocial82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Module-V

LEARNING
Chapter 18: Learning from Examples

Department of CSE, GIT ECS302: AI 1


18.3 Learning Decision Trees
Decision tree induction serves as a good introduction to the area of inductive learning, and is easy to implement.

Decision trees as performance elements:

 Decision tree takes as input an object or situation described by a set of attributes and returns a "decision--the
predicted output value for the input.
 The input attributes can be discrete or continuous.
For now, we assume discrete inputs.
 The output value can also be discrete or continuous.

 Learning a discrete-valued function is called classification learning.


 Learning a continuous function is called regression.

 We will concentrate on Boolean classification,


wherein each example is classified as true (positive) or false (negative).

Department of CSE, GIT ECS302: AI 2


Continued…
 A decision tree reaches its decision by performing a sequence of tests.

 Each internal node in the tree corresponds to a test of the value of one of the properties,
and the branches from the node are labeled with the possible values of the test.

 Each leaf node in the tree specifies the value to be returned if that leaf is reached.

Ex: The problem of whether to wait for a table at a restaurant.


The aim here is to learn a definition for the goal predicate Will Wait.

The list of attributes:


1. Alternate: whether there is a suitable alternative restaurant nearby.
2. Bar: whether the restaurant has a comfortable bar area to wait in.
3. Fri/Sat: true on Fridays and Saturdays.
4. Hungry: whether we are hungry.
5. Patrons: how many people are in the restaurant (values are None, Some, and Full).

Department of CSE, GIT ECS302: AI 3


Continued…
6. Price: the restaurant's price range ($, $$, $$$).
7. Raining: whether it is raining outside.
8. Reservation: whether we made a reservation.
9. Type: the kind of restaurant (French, Italian, Thai, or burger).
10. WaitEstimate: the wait estimated by the host (0-10 minutes, 10-30, 30-60, >60).

Department of CSE, GIT ECS302: AI 4


Continued…
 The tree does not use the Price and Type attributes (since they are irrelevant).
 Examples are processed by the tree starting at the root and following the appropriate branch until a leaf is
reached.
For instance, an example with Patrons = Full and WaitEstimate = 0-10
will be classified as positive (i.e., yes, we will wait for a table).

Expressiveness of decision trees:


Logically speaking, any particular decision tree hypotlhesis Ifor the Will Wait goal predicate can be seen as an
assertion of the form:

where each condition Pi (s) is a


conjunction of tests corresponding to a path from the root of the tree to a leaf with a positive outcome.

Department of CSE, GIT ECS302: AI 5


Continued…
• Although this looks like a first-order sentence, it is, in a sense, propositional,
because it contains just one variable and all the predicates are unary.

• The decision tree is really describing a relationship between Will Wait and some logical combination of attribute
values.

• Decision trees can express any function of the input attributes. For Boolean functions, truth table row gives path
to leaf.

• If the function is the parity function, which returns 1 if and only if an even number of inputs are 1, then an
exponentially large decision tree will be needed. It is also difficult to use a decision tree to represent a majority
function, which returns 1 if more than half of its inputs are 1.

• The truth table has 2n rows, because each input case is described by n attributes.
We can consider the "answer" column of the table as a 2n -bit number that defines the function.

Department of CSE, GIT ECS302: AI 6


Continued…
Inducing decision trees from examples :
Ex: A Boolean decision tree consists of a vector of' input attributes, X, and
a single Boolean output value 3.
A set of examples (X1,y1),…..,(X12, y12) is shown in Figure 18.3.

Department of CSE, GIT ECS302: AI 7


Continued…
• The positive examples are the ones in which the goal WillWait is true (X 1, X3,…..).
• The negative examples are the ones in which it is false (X2, X5,……).
• The complete set of examples is called the training set.

A trivial (unimportant) solution for the problem of finding a decision tree that agrees with the training set:

• Construct a decision tree that has one path to a leaf for each example,
where the path tests each attribute in turn and follows the value for the example and the leaf has the
classification of the example.

• When given the same example again, the decision tree will come up with the right classification.

• Unfortunately, it will not have much to say about any other cases!

Department of CSE, GIT ECS302: AI 8


Continued…
Figure 18.4 shows how the algorithm gets started.

Department of CSE, GIT ECS302: AI 9


Continued…
• We are given 12 training examples, which we classify into positive and negative sets.

• We then decide which attribute to use as the first test in the tree.

• Figure 18.4(a) shows that Type is a poor attribute, because it leaves us with four possible outcomes,
each of which has the same number of positive and negative examples.

• On the other hand, in Figure 18.4(b) we see that Patrons is a fairly important attribute, because if the value is
None or Some, then we are left with example sets for which we can answer definitively (No and Yes,
respectively).

• If the value is Full, we are left with a mixed set of examples.

• In general, after the first attribute test splits up the examples, each outcome is a new decision tree learning
problem in itself, with fewer examples and one fewer attribute.

Department of CSE, GIT ECS302: AI 10


Continued…
There are four cases to consider for these recursive problems:

1. If there are some positive and some negative examples, then choose the best attribute to split them.
(Figure 18.4(b) shows Hungry being used to split the remaining examples.)

2. If all the remaining examples are positive (or all negative), then we are done: we can answer Yes or No.
(Figure 18.4(b) shows examples of this in the None and Some cases.)

3. If there are no examples left, it means that no such example has been observed, and we return a default value
calculated from the majority classification at the node's parent.

4. If there are no attributes left, but both positive and negative examples, we have a problem.
 It means that these examples have exactly the same description, but different classifications.
 This happens when some of the data are incorrect; we say there is noise in the data.
 It also happens either when the attributes do not give enough information to describe the situation fully, or when
the domain is truly nondeterministic. One simple way out of the problem is to use a majority vote.

Department of CSE, GIT ECS302: AI 11


Continued…
Decision Tree Learning Algorithm:

Department of CSE, GIT ECS302: AI 12


Continued…
The Decision Tree induced:

Department of CSE, GIT ECS302: AI 13


Continued…
• The learning algorithm looks at the examples, not at the correct function, and in fact, its hypothesis (see Figure
18.6) not only agrees with all the examples, but is considerably simpler than the original tree.

• The learning algorithm has no reason to include tests for Raining and Reservation, because it can classify all the
examples without them.

4. Choosing attribute tests:


• The scheme used in decision tree learning for selecting attributes is designed to minimize the depth of the final
tree.
• The idea is to pick the attribute that goes as far as possible toward providing an exact classification of the
examples.

• A perfect attribute divides the examples into sets that are all positive or all negative.
-- The Patrons attribute is not perfect, but it is fairly good.
-- A really useless attribute, such as Type, leaves the example sets with roughly the same proportion of
positive and negative examples as the original set.

Department of CSE, GIT ECS302: AI 14


Continued…
• We need a formal measure of "fairly good" and "really useless" and we can implement the CHOOSE-ATTRIBUTE
function of Figure 18.5.

• The measure should have its maximum value when the attribute is perfect and its minimum value when the
attribute of no use at all.

• One suitable measure is the expected amount of information provided by the attribute.
Ex: Whether a coin will come up heads.
-- The amount of information contained in the answer depends on one's prior knowledge.
-- The less you know, the more information is provided.

• Information theory measures information content in bits.

• One bit of information is enough to answer a yes/no question about which one has no idea, such as the flip of a
fair coin.

Department of CSE, GIT ECS302: AI 15


Continued…
• In general, if the possible answers vi have probabilities P(vi),

then the information content I of the actual answer is given by

To check this equation, for the tossing of a fair coin, we get

• If the coin is loaded to give 99% heads, we get I (1/100,99/100) = 0.08 bits, and
as the probability of heads goes to 1, the information of the actual answer goes to 0.

Department of CSE, GIT ECS302: AI 16


Continued…
• A correct decision tree can answer the question “what is the correct classification?”

• An estimate of the probabilities of the possible answers before any of the attributes have been tested is given by
the proportions of positive and negative examples in the training set.

• Suppose the training set contains p positive examples and n negative examples.
• Then an estimate of the information contained in a correct answer is:

• The restaurant training set in Figure 18.3 has p = n = 6, so we need 1 bit of information.

• Now a test on a single attribute A will not usually tell us this much information, but it will give us some of it.

• We can measure exactly how much by looking at how much information we still need after the attribute test.

Department of CSE, GIT ECS302: AI 17


Continued…
• Any attribute A divides the training set E into subsets E1,…..,Ev according to their values for A,
where A can have v distinct values.
• Each subset Ei has pi positive examples and ni negative examples,
so if we go along that branch, we will need an additional

bits of information to answer the question.

• A randomly chosen example from the training set has the ith value for the attribute
with probability (pi + ni)/(p+n).
 So on average, after testing attribute A, we will need

bits of information to classify the example.


Department of CSE, GIT ECS302: AI 18
Continued…
• The information gain from the attribute test is
the difference between the original information requirement and the new requirement:

• The heuristic used in the CHOOSE-ATTRIBUTE function is just to choose the attribute with the largest gain.
• Returning to the attributes considered in Figure 18.4, we have

confirming our intuition that Patrons is a better attribute to split on.


 In fact, Patrons has the highest gain of any of the attributes and would be chosen by the decision-tree learning
algorithm as the root.
***

Department of CSE, GIT ECS302: AI 19


13.7 The Wumpus world revisited

 Uncertainty arises in the wumpus world because the agent's sensors give only partial, local information about the
world.

Department of CSE, GIT ECS302: AI 20


Continued...

 Figure 13.6 shows a situation in which each of the three reachable squares-[1,3], [2,2], and [3,1]-might contain a
pit.
 Pure logical inference can conclude nothing about which square is most likely to be safe,
so a logical agent might be forced to choose randomly.
 A probabilistic agent can do much better than the logical agent.

Department of CSE, GIT ECS302: AI 21


Continued...
Aim: To calculate the probability that each of the three squares contains a pit.
(For the purposes of this example, we will ignore the wumpus and the gold.)

The relevant properties of the wumpus world are that


(1) a pit causes breezes in all neighboring squares, and
(2) each square other than [1,1] contains a pit with probability 0.2.

The first step is to identify the set of random variables we need:

• Pij is true if and only if square [i, j] actually contains a pit.


• Bij if and only if square [i, j] is breezy

• These variables are included only for the observed squares--[1,1], [1,2], and [2,1].

Department of CSE, GIT ECS302: AI 22


Continued...
Now specify the full joint distribution:

Applying the product rule, we have: [ Product rule: P (a Ʌ b) = P (a | b) P (b) ]

• The first term (on the RHS) : Conditional probability of a breeze configuration, given a pit configuration
-- this is 1 if the breezes are adjacent to the pits and 0 otherwise.
• The second term : Prior probability of a pit configuration.
-- Each square contains a pit with probability 0.2 independently of the other squares.
Hence,

For a configuration with n pits, this is just 0.2n X 0.816-n .

Department of CSE, GIT ECS302: AI 23


Continued...
In the situation in Figure 13.6(a), the evidence consists of --
the observed breeze (or its absence) in each square that is visited, combined with the fact that
each such square contains no pit.

We'll abbreviate these facts as :


b = ¬b1,1 Ʌ b1,2 Ʌ b2,1 and known = ¬p1,1 Ʌ ¬p1,2 Ʌ ¬p2,1

We are interested in answering queries such as : P(P1,3 | known, b).


(how likely is it that [1, 3] contains a pit, given the observation so far?)

To answer this query, we can follow the standard approach suggested by the equation

--- (4)

namely summing over entries from the full joint distribution.

Department of CSE, GIT ECS302: AI 24


Continued...
• Let Unknown be a composite variable consisting of the Pi,j variables for squares other than the Known squares and
the query square [1,3].
• Then by equation (4) we have

Department of CSE, GIT ECS302: AI 25


Continued...

Department of CSE, GIT ECS302: AI 26


Continued...

Department of CSE, GIT ECS302: AI 27

You might also like