0% found this document useful (0 votes)
61 views91 pages

Types of Learning in AI Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views91 pages

Types of Learning in AI Explained

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Learning

Learning with complete data


Learning with hidden data

Unit-IV
Learning in AI

Deductive: Deduce rules/facts from already known


rules/facts. (We have already dealt with this)

A  B  C  A  C
Inductive: Learn new rules/facts from a data set D.

D  x(n), y(n)n1... N   A  C 
Types of learning
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Unsupervised Learning
• Learning without a teacher.
• No teacher to oversee the learning process.
• No knowledge.
• The agent learns patterns in the input even though no
explicit feedback is supplied.
• The most common unsupervised learning task is
clustering: detecting potentially useful clusters of
input examples.
• For example, a taxi agent might gradually develop a
concept of "good traffic days" and "bad traffic days"
without ever being given labeled examples of each by
a teacher.
Block diagram of unsupervised learning.
Supervised Learning
• With a teacher.
• Teacher has knowledge of environment
• Knowledge is represented by a set of input-output
examples.
• The agent observes some example input–output pairs
and learns a function that maps from input to output.
Block diagram of learning with a teacher
Reinforcement Learning
• The agent learns from a series of reinforcements—rewards
or punishments.
• For example, the lack of a tip at the end of the journey
gives the taxi agent an indication that it did something
wrong.
• It is up to the agent to decide which of the actions prior to
the reinforcement were most responsible for it.
Block diagram of reinforcement learning;
Semi-supervised Learning
• In practice, these distinction are not always so crisp.
• In semi-supervised learning we are given a few labeled
examples and must make what we can of a large collection
of un-labeled examples.
• Imagine that you are trying to build a system to guess a
person's age from a photo. You gather some labeled
examples by snapping pictures of people and asking their
age. That's supervised learning.
• But in reality some of the people lied about their age. It's
not just that there is random noise in the data; rather the
inaccuracies are systematic, and to uncover them is an
unsupervised learning problem involving images, self-
reported ages, and true (un-known) ages.
• Both noise and lack of labels create a continuum between
supervised and unsupervised learning.
Inductive Learning
• An algorithm for supervised learning is this:
– TRAINING SET Given a training set of N example input–
output pairs.
– Try to find out the unknown function.
• Learning is a search through the space of possible
hypotheses for one that will perform well, even on new
examples beyond the training set.
• To measure the accuracy of a hypothesis, a test set of
examples is given that are distinct from the training set.
• A hypothesis generalizes well if it correctly predicts the
value for novel examples.
Inductive learning - example A
  1   1   1
     
  1   1   1
  1 0 0
     
0 0 0
x    1 , f (x)  1 x    1 , f (x)  1 x    1 , f (x)  0
0   1   1
     
  1   1 0
0 0   1
 0     
  0 0

• f(x) is the target function


• An example is a pair [x, f(x)]
• Learning task: find a hypothesis h such that h(x)  f(x) given a
training set of examples D = {[xi, f(xi) ]}, i = 1,2,…,N
Inductive learning – example B
Inconsistent linear fit. Consistent sinusoidal
Consistent linear fit Consistent 7th order Consistent 6th order
polynomial fit fit
polynomial fit.

• Construct h so that it agrees with f.


• The hypothesis h is consistent if it agrees with f on all
observations.
• Ockham’s razor: Select the simplest consistent hypothesis.

• How to achieve good generalization?


Inductive learning – example C

x
Inductive learning – example C

x
Inductive learning – example C

x
Inductive learning – example C
Sometimes a consistent hypothesis is worse than an inconsistent

x
The idealized inductive learning
problem
Find appropriate hypothesis space H and find
h(x)  H with minimum “distance” to f(x) (“error”)

Error f(x)
H hopt(x)  H

Our hypothesis space

The learning problem is realizable if f(x) ∈ H i.e. Hypothesis


space contains the true function.
Hypothesis space
• Can’t identify a learning problem is realizable or
unrealizable because the true function is not known.
• One way
– Use prior knowledge to derive a hypothesis space in
which true function must lie.
– Another is to use largest possible hypothesis space.
Tradeoff
• There is a tradeoff between the expressiveness of a
hypothesis space and the complexity of finding a good
hypothesis within that space.
• For example, fitting a straight line to data is an easy
computation; fitting high-degree polynomials is somewhat
harder; and fitting Turing machines is in general
undecidable.
• A second reason to prefer simple hypothesis spaces is that
presumably we will want to use h after we have learned it,
and computing h(x) when h is a linear function is
guaranteed to be fast, while computing an arbitrary Turing
machine program is not even guaranteed to terminate. For
these reasons, most work on learning has focused on
simple representations.
Hypothesis spaces (examples)

H2 H1 f(x) = 0.5 + x + x2 + 6x3


H3

H1  H2  H3

H1={a+bx}; H2={a+bx+cx2}; H3={a+bx+cx2+dx3};


Linear; Quadratic; Cubic;
Learning problems
• The hypothesis takes as input a set of
attributes x and returns a ”decision” h(x)
= the predicted (estimated) output value
for the input x.

• Discrete valued function ⇒ classification


• Continuous valued function ⇒ regression
Decision Trees
• A DT takes as input an object or situation
described by a set of attributes and returns a
decision- the predicted output value for the input.
• Reaches its decision by performing a sequence of
tests.
• Each internal node in the tree corresponds to a
test of the value of one of the properties, and the
branches from the node are labeled with the
possible values of the test.
• Each leaf node in the tree specifies the value to
be returned if that leaf is reached.
• Very natural for humans.
Decision trees

• “Divide and conquer”:


Split data into smaller x1 > a ?
and smaller subsets.
yes no

• Splits usually on a
x2 > b ? x2 > g ?
single variable
yes no yes no
An example:Whether to wait for a
table at a restaurant
Goal Predicate: WillWait. To set this as learning problem,
what attributes are available to describe examples in the
domain.
10 attributes:
1. Alternate: Is there a suitable alternative restaurant
nearby? {yes,no}
2. Bar: Is there a bar to wait in? {yes,no}
3. Fri/Sat: Is it Friday or Saturday? {yes,no}
4. Hungry: Are you hungry? {yes,no}
5. Patrons: How many are seated in the restaurant? {none,
some, full}
6. Price: Price level {$,$$,$$$}
7. Raining: Is it raining? {yes,no}
8. Reservation: Did you make a reservation? {yes,no}
9. Type: Type of food {French,Italian,Thai,Burger}
10. WaitEstimate: {0-10 min, 10-30 min, 30-60 min, >60
min}
The wait@restaurant decision tree
Expressiveness
• DTs are fully expressible within the class of
propositional languages.
• Any Boolean tree can be written as a decision
tree.
• Each row in the truth table for the function
correspond to a path in the tree, but may yield
exponentially large decision tree.
• Decision trees are good for some kinds of
functions and bad for others.
• There is no representation which is efficient for
all kinds of functions.
Inductive learning of decision tree
• A Boolean decision tree consists of a vector of input
attributes, X and a single Boolean output value y.
• The positive examples: TRUE goal.
• The negative examples: FALSE goal.
• The complete set: Training set.
• Basic idea behind the learning algorithm is to test the
most important attribute first.
• Most important means the one that makes the most
difference to the classification of an example.
• So, will get the correct classification with a small
number of tests, with all small paths and a small tree
as a whole.
Decision tree learning example

T = True, F = False
 12ln 612 612ln 612  0.30
6 True,
Entropy   6 6 False
Inductive learning of decision tree

• Simplest: Construct a decision tree with one leaf


for every example = memory based learning.
Not very good generalization.
• Advanced: Split on each variable so that the
purity of each split increases (i.e. either only yes
or only no)
Inductive learning of decision tree

• Simplest: Construct a decision tree with one leaf


for every example = memory based learning.
Not very good generalization.
• Advanced: Split on each variable so that the
purity of each split increases (i.e. either only yes
or only no)
• Purity measured,e.g, with entropy

Entropy   P(vi ) ln P(vi )


i
Decision tree learning algorithm

• Create pure nodes (0 entropy) whenever


possible.
• If pure nodes are not possible, choose the
split that leads to the largest decrease in
entropy.
Decision tree learning example
Alternate?

Yes No

3 T, 3 F 3 T, 3 F

Entropy 
6
12
                 
 3 ln 3  3 ln 3 
6 6 6
6
6 12
 3 ln 3  3 ln 3  0.30
6 6 6 6

Entropy decrease = 0.30 – 0.30 = 0


Decision tree learning example
Bar?

Yes No

3 T, 3 F 3 T, 3 F

Entropy 
6
12
                 
 3 ln 3  3 ln 3 
6 6 6
6
6 12
 3 ln 3  3 ln 3  0.30
6 6 6 6

Entropy decrease = 0.30 – 0.30 = 0


Decision tree learning example
Sat/Fri?

Yes No

2 T, 3 F 4 T, 3 F

Entropy 
5
12
                 
 2 ln 2  3 ln 3 
5 5 5
7
5 12
 4 ln 4  3 ln 3  0.29
7 7 7 7

Entropy decrease = 0.30 – 0.29 = 0.01


Decision tree learning example
Hungry?

Yes No

5 T, 2 F 1 T, 4 F

Entropy 
7
12
                 
 5 ln 5  2 ln 2 
7 7 7
5
7 12
 1 ln 1  4 ln 4  0.24
5 5 5 5

Entropy decrease = 0.30 – 0.24 = 0.06


Decision tree learning example
Raining?

Yes No

2 T, 2 F 4 T, 4 F

Entropy 
4
12
                 
 2 ln 2  2 ln 2 
4 4 4
8
4 12
 4 ln 4  4 ln 4  0.30
8 8 8 8

Entropy decrease = 0.30 – 0.30 = 0


Decision tree learning example
Reservation?

Yes No

3 T, 2 F 3 T, 4 F

Entropy 
5
12
                 
 3 ln 3  2 ln 2 
5 5 5
7
5 12
 3 ln 3  4 ln 4  0.29
7 7 7 7

Entropy decrease = 0.30 – 0.29 = 0.01


Decision tree learning example
Patrons?

None Full

Some
2F 2 T, 4 F
4T

Entropy                   
2
12
 0
2
ln 0
2
 2
2
ln 2 
4
2 12
 4 ln 4  0 ln 0
4 4 4 4


6
12
        
 2 ln 2  4 ln 4  0.14
6 6 6 6

Entropy decrease = 0.30 – 0.14 = 0.16


Decision tree learning example
Price

$ $$$

$$
3 T, 3 F 1 T, 3 F
2T

Entropy                   
6
12
 3
6
ln 3
6
 3
6
ln 3 
2
6 12
 2 ln 2  0 ln 0
2 2 2 2


4
12
        
 1 ln 1  3 ln 3  0.23
4 4 4 4

Entropy decrease = 0.30 – 0.23 = 0.07


Decision tree learning example
Type
French Burger

1 T, 1 F Italian 2 T, 2 F
Thai

1 T, 1 F 2 T, 2 F

Entropy                   
2
12
 1
2
ln 1
2
 1
2
ln 1 
2 12
2
 1 ln 1  1 ln 1
2 2 2 2


4
12
                 
 2 ln 2  2 ln 2 
4 4 4 4 12
4
 2 ln 2  2 ln 2  0.30
4 4 4 4
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example
Est. waiting
time
0-10 > 60

4 T, 2 F 10-30 2F
30-60

1 T, 1 F 1 T, 1 F

Entropy                   
6
12
 4
6
ln 4
6
 2
6
ln 2 
6 12
2
 1 ln 1  1 ln 1
2 2 2 2


2
12
                 
 1 ln 1  1 ln 1 
2 2 2 2 12
2
 0 ln 0  2 ln 2  0.24
2 2 2 2
Entropy decrease = 0.30 – 0.24 = 0.06
Decision tree learning example
Largest entropy decrease (0.16)
Patrons? achieved by splitting on Patrons.

None Full

Some
2F 2 T,
X?4 F Continue like this, making new
splits, always purifying nodes.
4T
Decision tree learning example
Induced tree (from examples)
Decision tree learning example
True tree
Decision tree learning example
Induced tree (from examples)

Cannot make it more complex


than what the data supports.
Statistical Learning Models
Learning with complete data
Learning with hidden data
Statistical Learning
• In which, LEARNING is a form of uncertain reasoning from
observation.
• Agents can handle uncertainty by using the methods of
probability and decision theory , but first they must learn
their probabilistic theories of the world from experience.
• Here data are evidence i.e. instantiations of some or all of
the random variables describing the domain
• The hypothesis are probabilistic theories of how the domain
works.

2
Bayesian Learning
• View learning as Bayesian updating of a probability
distribution over the hypothesis space.
• H is the hypothesis variable, values h1,…,hn be possible
hypotheses.
• Let D=(d1,…dn) be the observed data vectors.
• Let X denote the prediction.
 In Bayesian Learning,
– Compute the probability of each hypothesis given the data. Predict
based on that basis.
– Predictions are made by using all hypotheses, weighted by their
probabilities, rather than by using just a single best hypothesis.
– Learning in Bayesian setting is reduced to probabilistic inference.

3
Bayesian Learning
• The probability that the prediction is X, when the data d is
observed is
P(X|d) = i P(X|d, hi) P(hi|d)
= i P(X|hi) P(hi|d)
 Prediction is weighted average over the predictions of
individual hypothesis.
 Hypotheses are intermediaries between the data and the
predictions.
 Requires computing P(hi|d) for all i.

4
Bayesian Learning Basics Terms
• P(hi) is called the (hypothesis) prior.
– We can embed knowledge by means of prior.
– It also controls the complexity of the model.
• P(hi|d) is called posterior (or a posteriori) probability.
• Using Bayes’ rule,
P(hi|d) = P(d|hi)P(hi)
• P(d|hi) is called the likelihood of the data.
– Under i.i.d. assumption,
P(d|hi)=j P(dj|hi).
• Let hMAP be the hypothesis for which the posterior
probability P(hi|d) is maximal. It is called the maximum a
posteriori (or MAP) hypothesis.

5
Candy Example
• Two flavors of candy, cherry and lime, wrapped in the same opaque
wrapper. (cannot see inside)
• Sold in very large bags, of which there are known to be five kinds:
– h1: 100% cherry
– h2: 75% cherry + 25% lime
– h3: 50% -50 %
– h4: 25 % cherry -75 % lime
– h5: 100% lime

• Priors known: P(h1),…,P(h5) are 0.1, 0.2, 0.4, 0.2, 0.1


• Suppose from a bag of candy, we took N pieces of candy and all of
them were lime (data dN).

• What kind of bag is it ?


• What flavor will the next candy be ?

6
Candy Example
Posterior probability of hypotheses

7
Candy Example
Prediction Probability

8
Optimality
• Characteristic: The true hypothesis dominates the Bayesian
learning.
• For any fixed prior that doesn't rule out the true hypothesis,
the probability of any false hypothesis will eventually vanish
because the probability of generating uncharacteristic data is
vanishingly small.
• Bayesian prediction is optimal, whether the data set be small
or large.
• Cost behind: for real learning problems, the hypothesis space
is usually very large or infinite.

9
Maximum a posteriori (MAP) Learning
• Since calculating the exact probability is often impractical,
we use approximation by MAP hypothesis. That is,
P(X|d)  P(X|hMAP).

 Make prediction with most probable hypothesis


 Summing over that hypotheses space is often intractable
 instead of large summation (integration), an optimization problem
can be solved.

• In Candy example, different in starting but as more data


arrives, MAP and Bayesian predictions become more closer.

10
MAP approximation MDL Principle

• Since P(hi|d)= P(d|hi)P(hi), instead of maximizing P(hi|d), we


may maximize P(d|hi)P(hi).
• Equivalently, we may minimize
–log P(d|hi)P(hi)=-log P(d|hi)-log P(hi).
– We can interpret this as choosing hi to minimize the number of bits
that is required to encode the hypothesis hi and the data d under that
hypothesis.
• The principle of minimizing code length (under some pre-
determined coding scheme) is called the minimum
description length (or MDL) principle.
– MDL is used in wide range of practical machine learning applications.

11
Maximum Likelihood Approximation
• Assume furthermore that P(hi)’s are all equal, i.e., assume the
uniform prior.
– reasonable when there is no reason to prefer one hypothesis over
another a priori.
– For Large data set, prior becomes irrelevant
• to obtain MAP hypothesis, it suffices to maximize P(d|hi), the
likelihood.
– the maximum likelihood hypothesis hML.
• MAP and uniform prior , ML
• ML is the standard statistical learning method
– Simply get the best fit to the data

12
Learning with Complete Data
• Data is complete when each data point contains values for
every variable in the probability being learned.
• Complete data greatly simplifies the problem of learning the
parameters of a complex model.

13
Learning with Data : Parameter
Learning
• Introduce parametric probability model with
parameter q.
– Then the hypotheses are hq, i.e., hypotheses are
parameterized.
– In the simplest case, q is a single scalar. In more
complex cases, q consists of many components.
• Using the data d, predict the parameter q.

14
ML Parameter Learning Examples : discrete case

• A bag of candy whose lime-cherry proportions are


completely unknown.
– In this case we have hypotheses parameterized
by the probability q of cherry.
– P(d|hq)=j P(dj|hq)=qcherry(1-q)lime
– Find hq Maximize P(d|hq).
– The same value is obtained by maximizing the
log likelihood.
– L(d|hq)) = logP(d|hq)
– = cherry logq + lime log(1-q)
– Maximum, q = cherry/(cherry+lime)

15
ML Parameter Learning Examples : discrete case

• Two wrappers, green and red, are selected according to


some unknown conditional distribution, depending on the
flavor.
– It has three parameters: q=P(F=cherry), q1=P(W=red|F=cherry),
q2=P(W=red|F=lime).
– Likelihood of seeing a cherry candy in green wrapper:

P(d|hq,q1,q2)= qcherry(1-q)lime q1red,cherry(1-q1)green,cherry q2red,lime(1-


q2)green,lime
– Maximize it.

16
Naïve Bayes Method
• Attributes (components of observed data) are assumed to be
independent in Naïve Bayes Method.
– Works well for about 2/3 of real-world problems, despite naivety of
such assumption.
– Goal: Predict the class C, given the observed data Xi.
• By the independent assumption,
P(C|x1,…xn)= P(C) i P(xi|C)
• We choose the most likely class.
 Merits of NB
– Scales well: No search is required.
– Robust against noisy data.
– Gives probabilistic predictions.

17
Learning Curve on the Restaurant Problem

18
Learning with Hidden variables
• Many real world problems have hidden (latent) variables,
which are not observable in the data that are available for
learning.
• Example: Medical records( the observed symptoms, the
treatment applied, results)( direct observation of the disease
is hidden).
• Why not construct a model without a disease if it is not
observed?
• The latent variables can drastically reduce the number of
parameters required to specify a Bayesian network, So the
data to learn the parameters.
• Hidden variables complicate the learning problem.

19
20
Missing Data
• Data collection and feature generation are not perfect. Objects in a pattern
recognition application may have missing features
– Features are extracted by multiple sensors. One or multiple sensors may be
malfunctioning when the object is presented
– A respondent fails to answer all the questions in a questionnaire
– Some participants quit during the middle of a study, where measurements
are taken at different time points
• An example of missing data: “dermatology” from UCI

Missing
values

21
Missing Data
• Different types of missing data
– Missing completely at random (MCAR)
– Missing at random (MAR): the fact that a feature is missing is random,
after conditioned on another feature
• Example: people who are depressed might be less inclined to
report their income, and thus reported income will be related to
depression. However, if, within depressed patients the probability
of reported income was unrelated to income level, then the data
would be considered MAR
• If not MAR nor MCAR, the missing data phenomenon needs to be
modeled explicitly
– Example: if a user fails to provide his/her rating on a movie, it is more
likely that he/she dislikes the movie because he/she has not watched
it.

22
Strategy of Coping With Missing
Data
• Discarding cases
– Ignore patterns with missing features
– Ignore features with missing values
– Easy to implement, but wasteful of data. May bias the parameter
estimate if not MCAR
• Maximum likelihood (the EM algorithm)
– Under some assumption of how the data are generated, estimate the
parameters that maximize the likelihood of the observed data (non-
missing features)

23
The Expectation-Maximization
Algorithm
• The EM algorithm is an iterative algorithm that finds the parameters which
maximize the log-likelihood when there are missing data
• Complete data: the combination of the missing data and the observed
data
• The EM algorithm is best used in situation where the log-likelihood of the
observed data is difficult to optimize, whereas the log-likelihood of the
complete data is easy to optimize

24
The EM Algorithm
• Let X and Z denote the vector of all observed data and hidden data,
respectively
• Let qt be the parameter estimate at the t-th iteration
• Define the Q-function (a function of q)

• Intuitively, the Q-function is the “expected value of the complete data log-
likelihood”
– Fill in all the possible values for the missing data Z. This gives rise to
the complete data, and we compute its log-likelihood
– Not all the possible values for the missing features are equally “good”.
The “goodness” of a particular way of filling in (Z=z) is determined by
how likely the r.v. Z takes the value of z
– This probability is determined by the observed data X and the current
25 parameter qt
The EM Algorithm
• An improved parameter estimate at iteration (t+1) is obtained by
maximizing the Q function

• The EM algorithm takes its name because it alternates between the E-step
(expectation) and the M-step (maximization)
– E-step: compute Q(q; qt)
– M-step: Find q that maximizes Q(q; qt)
• The difference between q and qt: one is variable, while the other is given
(fixed)

26
Contd.
• Since EM is iterative, it may end up in a local maxima (instead of the global
maxima) of the log-likelihood of the observed data
– A good initialization is needed to find a good local maxima
• The EM algorithm may not be the most efficient algorithm for maximizing
the log-likelihood; however, the EM algorithm is often fairly simple to
implement
• If the missing data are continuous, integration should be used instead of
the summation to form Q(q; qt)

27
Reinforcement Learning

How an agent can learn from success and


failure, from reward and punishment?
Reinforcement
• Without some feedback about what is good and what
is bad, the agent will have no grounds for deciding
which move to make.
• The agent needs to know that something good has
happened when it wins and wrong when it loses. This
type of feedback is called a reward or reinforcement.
• May be at end or in between.
• Reward as part of the input percept, must be able to
recognize.
• Task of reinforcement learning is to use observed
rewards to learn an optimal policy for the
environment.
Reinforcement
• An agent is placed in an environment and must
learn to behave successfully therein.
• Passive learning: where the agent’s policy is fixed
and the task is to learn the utilities of states.
• Active learning: where the agent must also learn
what to do.
– Exploration(principal issue): an agent must experience
as much as possible of its environment in order to
learn how to behave in it.
Passive Reinforcement Learning
• Using state-based representation in a fully
observable environment.
• Here
– Agent’s policy is fixed: in state s, it always
executes the action (s).
– Goal: to learn how good the policy is i.e. to learn
the utility function U(s).
Example
Contd.
• The agent executes a set of trials in the
environment using its policy .
• In each trial, the agent starts in state (1,1) and
experiences a sequence of state transitions
until it reaches one of the terminal states, (4,
2) or (4, 3).
• Its percepts supply both the current state and
the reward received in the state.
Contd.
• The trials might look like:
(1,1)-.04 (1,2)-.04 (1,3)-.04 (1,2)-.04 (1,3)-.04 (2,3)-.04  (3,3)-.04 (4,3)+1
(1,1)-.04 (1,2)-.04 (1,3)-.04 (2,3)-.04 (3,3)-.04 (3,2)-.04  (3,3)-.04 (4,3)+1
(1,1)-.04 (2,1)-.04 (3,1)-.04 (3,2)-.04 (4,2)-1

• Each state percept is subscripted with the reward


received.
• Objective: use the information about rewards to
learn the expected utility U(s) associated with
each non-terminal state s.
Contd.
• The utility is defined to be the expected sum
of rewards obtained if policy  is followed.
 

U ( s)  E   R( st ) |  , s0  s 
 t

 t 0 
• Here, discounted factor  = 1.
Contd.
• Direct utility estimation: uses the total observed
reward to go for a given state as direct
evidence for learning its utility.
• Adaptive dynamic learning: learns a model and a
reward function from observations and then
uses value or policy iteration to obtain the
utilities or an optimal policy.
• Temporal-difference: update utility estimates to
match those of successor states.
Active Reinforcement Learning
• An active agent must decide what actions to
take.
• The agent will need to learn a complete model
with outcome probabilities for all actions,
rather than just the model for the fixed policy.
• Fact: an agent has a choice of actions.
Contd.
• Adaptive dynamic learning
• Need to take into account the fact that agent has a
choice of actions.
• Bellman equation:

U ( s)  R( s)   max  T ( s, a, s' )U ( s' )


a
s'
• Having obtained a utility function that is optimal for
the learned model, the agent can extract an optimal
action by one-step look-ahead to maximize the
expected utility.
Exploration
• Learned model is not the same as the true
environment.
• What is optimal in learned model can
therefore be suboptimal in the true
environment.
• The agent does not know what the true
environment is, so it can not compute the
optimal action for the true environment.
• What is to be done?
Exploitation/Exploration
• Actions do more than provide reward according to the
current learned model
– Also contribute to learning the true model by affecting the
percepts that are received.
• By improving the model, the agent will receive greater
reward in the future.
• Exploitation: maximize its reward in its current utility
estimates. (pure gets stuck)
• Exploration: maximize its long-term well being. (Pure is
useless, knowledge will not come in practice)
• Tradeoff must be there.
Optimal Exploration Policy
• Extremely difficult.
• Come up with a reasonable scheme.
• Greedy in the Limit of Infinite Exploration (GLIE).
• A GLIE scheme must try each action in each state an
unbounded number of times to avoid having a finite
probability that an optimal action is missed because of
an unusually bad series of outcomes.
• Agent using such a scheme will eventually learn the
true environment model.
• A GLIE scheme must also eventually become greedy, so
that the agent’s actions become optimal w.r.t. the
learned model.
GLIE schemes
• Simplest is to choose a random action a fraction
1/t of the time and to follow the greedy policy
otherwise.
• Can be extremely slow.
• More sensible approach is to give some weight to
actions that the agent has not tried very often,
while tending to avoid actions that are believed
to be of low utility.
• Causes the agent to behave initially as if there
were wonderful rewards scattered all over the
place. (optimistic)
Contd.
• Let U+(s) denotes the optimistic estimate of the utility
of the state.
• N(a, s): number of times action a has been tried in
state s.
• Update equation to incorporate the optimistic
estimate.
 
U ( s)  R( s)   max f   T ( s, a, s' )U ( s' ), N (a, s) 
 
a
 s' 
• f(u, n) is the exploration function. It determines how
greed is traded off against curiosity .
• It should be increasing in u and decreasing in n.
Possible function
• One kind of definition of f(u,n):

 R  (n  N e ) 
f (u, n)   
u (otherwise)

– R+ is an optimistic estimate of the best possible reward


obtainable in any state.
– The agent will try each action-state pair(s,a) at least Ne
times.

You might also like