0% found this document useful (0 votes)

61 views91 pages

Types of Learning in AI Explained

Uploaded by

Destiny Destination

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views91 pages

Types of Learning in AI Explained

Uploaded by

Destiny Destination

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Learning

Learning with complete data

Learning with hidden data

Unit-IV
Learning in AI

Deductive: Deduce rules/facts from already known

rules/facts. (We have already dealt with this)

A  B  C  A  C
Inductive: Learn new rules/facts from a data set D.

D  x(n), y(n)n1... N   A  C 
Types of learning
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Unsupervised Learning
• Learning without a teacher.
• No teacher to oversee the learning process.
• No knowledge.
• The agent learns patterns in the input even though no
explicit feedback is supplied.
• The most common unsupervised learning task is
clustering: detecting potentially useful clusters of
input examples.
• For example, a taxi agent might gradually develop a
concept of "good traffic days" and "bad traffic days"
without ever being given labeled examples of each by
a teacher.
Block diagram of unsupervised learning.
Supervised Learning
• With a teacher.
• Teacher has knowledge of environment
• Knowledge is represented by a set of input-output
examples.
• The agent observes some example input–output pairs
and learns a function that maps from input to output.
Block diagram of learning with a teacher
Reinforcement Learning
• The agent learns from a series of reinforcements—rewards
or punishments.
• For example, the lack of a tip at the end of the journey
gives the taxi agent an indication that it did something
wrong.
• It is up to the agent to decide which of the actions prior to
the reinforcement were most responsible for it.
Block diagram of reinforcement learning;
Semi-supervised Learning
• In practice, these distinction are not always so crisp.
• In semi-supervised learning we are given a few labeled
examples and must make what we can of a large collection
of un-labeled examples.
• Imagine that you are trying to build a system to guess a
person's age from a photo. You gather some labeled
examples by snapping pictures of people and asking their
age. That's supervised learning.
• But in reality some of the people lied about their age. It's
not just that there is random noise in the data; rather the
inaccuracies are systematic, and to uncover them is an
unsupervised learning problem involving images, self-
reported ages, and true (un-known) ages.
• Both noise and lack of labels create a continuum between
supervised and unsupervised learning.
Inductive Learning
• An algorithm for supervised learning is this:
– TRAINING SET Given a training set of N example input–
output pairs.
– Try to find out the unknown function.
• Learning is a search through the space of possible
hypotheses for one that will perform well, even on new
examples beyond the training set.
• To measure the accuracy of a hypothesis, a test set of
examples is given that are distinct from the training set.
• A hypothesis generalizes well if it correctly predicts the
value for novel examples.
Inductive learning - example A
  1   1   1
     
  1   1   1
  1 0 0
     
0 0 0
x    1 , f (x)  1 x    1 , f (x)  1 x    1 , f (x)  0
0   1   1
     
  1   1 0
0 0   1
 0     
  0 0

• f(x) is the target function

• An example is a pair [x, f(x)]
• Learning task: find a hypothesis h such that h(x)  f(x) given a
training set of examples D = {[xi, f(xi) ]}, i = 1,2,…,N
Inductive learning – example B
Inconsistent linear fit. Consistent sinusoidal
Consistent linear fit Consistent 7th order Consistent 6th order
polynomial fit fit
polynomial fit.

• Construct h so that it agrees with f.

• The hypothesis h is consistent if it agrees with f on all
observations.
• Ockham’s razor: Select the simplest consistent hypothesis.

• How to achieve good generalization?

Inductive learning – example C

x
Inductive learning – example C

x
Inductive learning – example C
Sometimes a consistent hypothesis is worse than an inconsistent

x
The idealized inductive learning
problem
Find appropriate hypothesis space H and find
h(x)  H with minimum “distance” to f(x) (“error”)

Error f(x)
H hopt(x)  H

Our hypothesis space

The learning problem is realizable if f(x) ∈ H i.e. Hypothesis

space contains the true function.
Hypothesis space
• Can’t identify a learning problem is realizable or
unrealizable because the true function is not known.
• One way
– Use prior knowledge to derive a hypothesis space in
which true function must lie.
– Another is to use largest possible hypothesis space.
Tradeoff
• There is a tradeoff between the expressiveness of a
hypothesis space and the complexity of finding a good
hypothesis within that space.
• For example, fitting a straight line to data is an easy
computation; fitting high-degree polynomials is somewhat
harder; and fitting Turing machines is in general
undecidable.
• A second reason to prefer simple hypothesis spaces is that
presumably we will want to use h after we have learned it,
and computing h(x) when h is a linear function is
guaranteed to be fast, while computing an arbitrary Turing
machine program is not even guaranteed to terminate. For
these reasons, most work on learning has focused on
simple representations.
Hypothesis spaces (examples)

H2 H1 f(x) = 0.5 + x + x2 + 6x3

H1  H2  H3

H1={a+bx}; H2={a+bx+cx2}; H3={a+bx+cx2+dx3};

Linear; Quadratic; Cubic;
Learning problems
• The hypothesis takes as input a set of
attributes x and returns a ”decision” h(x)
= the predicted (estimated) output value
for the input x.

• Discrete valued function ⇒ classification

• Continuous valued function ⇒ regression
Decision Trees
• A DT takes as input an object or situation
described by a set of attributes and returns a
decision- the predicted output value for the input.
• Reaches its decision by performing a sequence of
tests.
• Each internal node in the tree corresponds to a
test of the value of one of the properties, and the
branches from the node are labeled with the
possible values of the test.
• Each leaf node in the tree specifies the value to
be returned if that leaf is reached.
• Very natural for humans.
Decision trees

• “Divide and conquer”:

Split data into smaller x1 > a ?
and smaller subsets.
yes no

• Splits usually on a
x2 > b ? x2 > g ?
single variable
yes no yes no
An example:Whether to wait for a
table at a restaurant
Goal Predicate: WillWait. To set this as learning problem,
what attributes are available to describe examples in the
domain.
10 attributes:
1. Alternate: Is there a suitable alternative restaurant
nearby? {yes,no}
2. Bar: Is there a bar to wait in? {yes,no}
3. Fri/Sat: Is it Friday or Saturday? {yes,no}
4. Hungry: Are you hungry? {yes,no}
5. Patrons: How many are seated in the restaurant? {none,
some, full}
6. Price: Price level {$,$$,$$$}
7. Raining: Is it raining? {yes,no}
8. Reservation: Did you make a reservation? {yes,no}
9. Type: Type of food {French,Italian,Thai,Burger}
10. WaitEstimate: {0-10 min, 10-30 min, 30-60 min, >60
min}
The wait@restaurant decision tree
Expressiveness
• DTs are fully expressible within the class of
propositional languages.
• Any Boolean tree can be written as a decision
tree.
• Each row in the truth table for the function
correspond to a path in the tree, but may yield
exponentially large decision tree.
• Decision trees are good for some kinds of
functions and bad for others.
• There is no representation which is efficient for
all kinds of functions.
Inductive learning of decision tree
• A Boolean decision tree consists of a vector of input
attributes, X and a single Boolean output value y.
• The positive examples: TRUE goal.
• The negative examples: FALSE goal.
• The complete set: Training set.
• Basic idea behind the learning algorithm is to test the
most important attribute first.
• Most important means the one that makes the most
difference to the classification of an example.
• So, will get the correct classification with a small
number of tests, with all small paths and a small tree
as a whole.
Decision tree learning example

T = True, F = False
 12ln 612 612ln 612  0.30
6 True,
Entropy   6 6 False
Inductive learning of decision tree

• Simplest: Construct a decision tree with one leaf

for every example = memory based learning.
Not very good generalization.
• Advanced: Split on each variable so that the
purity of each split increases (i.e. either only yes
or only no)
Inductive learning of decision tree

• Simplest: Construct a decision tree with one leaf

Entropy   P(vi ) ln P(vi )

i
Decision tree learning algorithm

• Create pure nodes (0 entropy) whenever

possible.
• If pure nodes are not possible, choose the
split that leads to the largest decrease in
entropy.
Decision tree learning example
Alternate?

Yes No

3 T, 3 F 3 T, 3 F

Entropy 
6
12
                 
 3 ln 3  3 ln 3 
6 6 6
6
6 12
 3 ln 3  3 ln 3  0.30
6 6 6 6

Entropy decrease = 0.30 – 0.30 = 0

Decision tree learning example
Bar?

Yes No

3 T, 3 F 3 T, 3 F

Entropy 
6
12
                 
 3 ln 3  3 ln 3 
6 6 6
6
6 12
 3 ln 3  3 ln 3  0.30
6 6 6 6

Entropy decrease = 0.30 – 0.30 = 0

Decision tree learning example
Sat/Fri?

Yes No

2 T, 3 F 4 T, 3 F

Entropy 
5
12
                 
 2 ln 2  3 ln 3 
5 5 5
7
5 12
 4 ln 4  3 ln 3  0.29
7 7 7 7

Entropy decrease = 0.30 – 0.29 = 0.01

Decision tree learning example
Hungry?

Yes No

5 T, 2 F 1 T, 4 F

Entropy 
7
12
                 
 5 ln 5  2 ln 2 
7 7 7
5
7 12
 1 ln 1  4 ln 4  0.24
5 5 5 5

Entropy decrease = 0.30 – 0.24 = 0.06

Decision tree learning example
Raining?

Yes No

2 T, 2 F 4 T, 4 F

Entropy 
4
12
                 
 2 ln 2  2 ln 2 
4 4 4
8
4 12
 4 ln 4  4 ln 4  0.30
8 8 8 8

Entropy decrease = 0.30 – 0.30 = 0

Decision tree learning example
Reservation?

Yes No

3 T, 2 F 3 T, 4 F

Entropy 
5
12
                 
 3 ln 3  2 ln 2 
5 5 5
7
5 12
 3 ln 3  4 ln 4  0.29
7 7 7 7

Entropy decrease = 0.30 – 0.29 = 0.01

Decision tree learning example
Patrons?

None Full

Some
2F 2 T, 4 F
4T

Entropy                   
2
12
 0
2
ln 0
2
 2
2
ln 2 
4
2 12
 4 ln 4  0 ln 0
4 4 4 4


6
12
        
 2 ln 2  4 ln 4  0.14
6 6 6 6

Entropy decrease = 0.30 – 0.14 = 0.16

Decision tree learning example
Price

$ $$$

$$
3 T, 3 F 1 T, 3 F
2T

Entropy                   
6
12
 3
6
ln 3
6
 3
6
ln 3 
2
6 12
 2 ln 2  0 ln 0
2 2 2 2


4
12
        
 1 ln 1  3 ln 3  0.23
4 4 4 4

Entropy decrease = 0.30 – 0.23 = 0.07

Decision tree learning example
Type
French Burger

1 T, 1 F Italian 2 T, 2 F
Thai

1 T, 1 F 2 T, 2 F

Entropy                   
2
12
 1
2
ln 1
2
 1
2
ln 1 
2 12
2
 1 ln 1  1 ln 1
2 2 2 2


4
12
                 
 2 ln 2  2 ln 2 
4 4 4 4 12
4
 2 ln 2  2 ln 2  0.30
4 4 4 4
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example
Est. waiting
time
0-10 > 60

4 T, 2 F 10-30 2F
30-60

1 T, 1 F 1 T, 1 F

Entropy                   
6
12
 4
6
ln 4
6
 2
6
ln 2 
6 12
2
 1 ln 1  1 ln 1
2 2 2 2


2
12
                 
 1 ln 1  1 ln 1 
2 2 2 2 12
2
 0 ln 0  2 ln 2  0.24
2 2 2 2
Entropy decrease = 0.30 – 0.24 = 0.06
Decision tree learning example
Largest entropy decrease (0.16)
Patrons? achieved by splitting on Patrons.

None Full

Some
2F 2 T,
X?4 F Continue like this, making new
splits, always purifying nodes.
4T
Decision tree learning example
Induced tree (from examples)
Decision tree learning example
True tree
Decision tree learning example
Induced tree (from examples)

Cannot make it more complex

than what the data supports.
Statistical Learning Models
Learning with complete data
Learning with hidden data
Statistical Learning
• In which, LEARNING is a form of uncertain reasoning from
observation.
• Agents can handle uncertainty by using the methods of
probability and decision theory , but first they must learn
their probabilistic theories of the world from experience.
• Here data are evidence i.e. instantiations of some or all of
the random variables describing the domain
• The hypothesis are probabilistic theories of how the domain
works.

2
Bayesian Learning
• View learning as Bayesian updating of a probability
distribution over the hypothesis space.
• H is the hypothesis variable, values h1,…,hn be possible
hypotheses.
• Let D=(d1,…dn) be the observed data vectors.
• Let X denote the prediction.
 In Bayesian Learning,
– Compute the probability of each hypothesis given the data. Predict
based on that basis.
– Predictions are made by using all hypotheses, weighted by their
probabilities, rather than by using just a single best hypothesis.
– Learning in Bayesian setting is reduced to probabilistic inference.

3
Bayesian Learning
• The probability that the prediction is X, when the data d is
observed is
P(X|d) = i P(X|d, hi) P(hi|d)
= i P(X|hi) P(hi|d)
 Prediction is weighted average over the predictions of
individual hypothesis.
 Hypotheses are intermediaries between the data and the
predictions.
 Requires computing P(hi|d) for all i.

4
Bayesian Learning Basics Terms
• P(hi) is called the (hypothesis) prior.
– We can embed knowledge by means of prior.
– It also controls the complexity of the model.
• P(hi|d) is called posterior (or a posteriori) probability.
• Using Bayes’ rule,
P(hi|d) = P(d|hi)P(hi)
• P(d|hi) is called the likelihood of the data.
– Under i.i.d. assumption,
P(d|hi)=j P(dj|hi).
• Let hMAP be the hypothesis for which the posterior
probability P(hi|d) is maximal. It is called the maximum a
posteriori (or MAP) hypothesis.

5
Candy Example
• Two flavors of candy, cherry and lime, wrapped in the same opaque
wrapper. (cannot see inside)
• Sold in very large bags, of which there are known to be five kinds:
– h1: 100% cherry
– h2: 75% cherry + 25% lime
– h3: 50% -50 %
– h4: 25 % cherry -75 % lime
– h5: 100% lime

• Priors known: P(h1),…,P(h5) are 0.1, 0.2, 0.4, 0.2, 0.1

• Suppose from a bag of candy, we took N pieces of candy and all of
them were lime (data dN).

• What kind of bag is it ?

• What flavor will the next candy be ?

6
Candy Example
Posterior probability of hypotheses

7
Candy Example
Prediction Probability

8
Optimality
• Characteristic: The true hypothesis dominates the Bayesian
learning.
• For any fixed prior that doesn't rule out the true hypothesis,
the probability of any false hypothesis will eventually vanish
because the probability of generating uncharacteristic data is
vanishingly small.
• Bayesian prediction is optimal, whether the data set be small
or large.
• Cost behind: for real learning problems, the hypothesis space
is usually very large or infinite.

9
Maximum a posteriori (MAP) Learning
• Since calculating the exact probability is often impractical,
we use approximation by MAP hypothesis. That is,
P(X|d)  P(X|hMAP).

 Make prediction with most probable hypothesis

 Summing over that hypotheses space is often intractable
 instead of large summation (integration), an optimization problem
can be solved.

• In Candy example, different in starting but as more data

arrives, MAP and Bayesian predictions become more closer.

10
MAP approximation MDL Principle

• Since P(hi|d)= P(d|hi)P(hi), instead of maximizing P(hi|d), we

may maximize P(d|hi)P(hi).
• Equivalently, we may minimize
–log P(d|hi)P(hi)=-log P(d|hi)-log P(hi).
– We can interpret this as choosing hi to minimize the number of bits
that is required to encode the hypothesis hi and the data d under that
hypothesis.
• The principle of minimizing code length (under some pre-
determined coding scheme) is called the minimum
description length (or MDL) principle.
– MDL is used in wide range of practical machine learning applications.

11
Maximum Likelihood Approximation
• Assume furthermore that P(hi)’s are all equal, i.e., assume the
uniform prior.
– reasonable when there is no reason to prefer one hypothesis over
another a priori.
– For Large data set, prior becomes irrelevant
• to obtain MAP hypothesis, it suffices to maximize P(d|hi), the
likelihood.
– the maximum likelihood hypothesis hML.
• MAP and uniform prior , ML
• ML is the standard statistical learning method
– Simply get the best fit to the data

12
Learning with Complete Data
• Data is complete when each data point contains values for
every variable in the probability being learned.
• Complete data greatly simplifies the problem of learning the
parameters of a complex model.

13
Learning with Data : Parameter
Learning
• Introduce parametric probability model with
parameter q.
– Then the hypotheses are hq, i.e., hypotheses are
parameterized.
– In the simplest case, q is a single scalar. In more
complex cases, q consists of many components.
• Using the data d, predict the parameter q.

14
ML Parameter Learning Examples : discrete case

• A bag of candy whose lime-cherry proportions are

completely unknown.
– In this case we have hypotheses parameterized
by the probability q of cherry.
– P(d|hq)=j P(dj|hq)=qcherry(1-q)lime
– Find hq Maximize P(d|hq).
– The same value is obtained by maximizing the
log likelihood.
– L(d|hq)) = logP(d|hq)
– = cherry logq + lime log(1-q)
– Maximum, q = cherry/(cherry+lime)

15
ML Parameter Learning Examples : discrete case

• Two wrappers, green and red, are selected according to

some unknown conditional distribution, depending on the
flavor.
– It has three parameters: q=P(F=cherry), q1=P(W=red|F=cherry),
q2=P(W=red|F=lime).
– Likelihood of seeing a cherry candy in green wrapper:

P(d|hq,q1,q2)= qcherry(1-q)lime q1red,cherry(1-q1)green,cherry q2red,lime(1-

q2)green,lime
– Maximize it.

16
Naïve Bayes Method
• Attributes (components of observed data) are assumed to be
independent in Naïve Bayes Method.
– Works well for about 2/3 of real-world problems, despite naivety of
such assumption.
– Goal: Predict the class C, given the observed data Xi.
• By the independent assumption,
P(C|x1,…xn)= P(C) i P(xi|C)
• We choose the most likely class.
 Merits of NB
– Scales well: No search is required.
– Robust against noisy data.
– Gives probabilistic predictions.

17
Learning Curve on the Restaurant Problem

18
Learning with Hidden variables
• Many real world problems have hidden (latent) variables,
which are not observable in the data that are available for
learning.
• Example: Medical records( the observed symptoms, the
treatment applied, results)( direct observation of the disease
is hidden).
• Why not construct a model without a disease if it is not
observed?
• The latent variables can drastically reduce the number of
parameters required to specify a Bayesian network, So the
data to learn the parameters.
• Hidden variables complicate the learning problem.

19
20
Missing Data
• Data collection and feature generation are not perfect. Objects in a pattern
recognition application may have missing features
– Features are extracted by multiple sensors. One or multiple sensors may be
malfunctioning when the object is presented
– A respondent fails to answer all the questions in a questionnaire
– Some participants quit during the middle of a study, where measurements
are taken at different time points
• An example of missing data: “dermatology” from UCI

Missing
values

21
Missing Data
• Different types of missing data
– Missing completely at random (MCAR)
– Missing at random (MAR): the fact that a feature is missing is random,
after conditioned on another feature
• Example: people who are depressed might be less inclined to
report their income, and thus reported income will be related to
depression. However, if, within depressed patients the probability
of reported income was unrelated to income level, then the data
would be considered MAR
• If not MAR nor MCAR, the missing data phenomenon needs to be
modeled explicitly
– Example: if a user fails to provide his/her rating on a movie, it is more
likely that he/she dislikes the movie because he/she has not watched
it.

22
Strategy of Coping With Missing
Data
• Discarding cases
– Ignore patterns with missing features
– Ignore features with missing values
– Easy to implement, but wasteful of data. May bias the parameter
estimate if not MCAR
• Maximum likelihood (the EM algorithm)
– Under some assumption of how the data are generated, estimate the
parameters that maximize the likelihood of the observed data (non-
missing features)

23
The Expectation-Maximization
Algorithm
• The EM algorithm is an iterative algorithm that finds the parameters which
maximize the log-likelihood when there are missing data
• Complete data: the combination of the missing data and the observed
data
• The EM algorithm is best used in situation where the log-likelihood of the
observed data is difficult to optimize, whereas the log-likelihood of the
complete data is easy to optimize

24
The EM Algorithm
• Let X and Z denote the vector of all observed data and hidden data,
respectively
• Let qt be the parameter estimate at the t-th iteration
• Define the Q-function (a function of q)

• Intuitively, the Q-function is the “expected value of the complete data log-
likelihood”
– Fill in all the possible values for the missing data Z. This gives rise to
the complete data, and we compute its log-likelihood
– Not all the possible values for the missing features are equally “good”.
The “goodness” of a particular way of filling in (Z=z) is determined by
how likely the r.v. Z takes the value of z
– This probability is determined by the observed data X and the current
25 parameter qt
The EM Algorithm
• An improved parameter estimate at iteration (t+1) is obtained by
maximizing the Q function

• The EM algorithm takes its name because it alternates between the E-step
(expectation) and the M-step (maximization)
– E-step: compute Q(q; qt)
– M-step: Find q that maximizes Q(q; qt)
• The difference between q and qt: one is variable, while the other is given
(fixed)

26
Contd.
• Since EM is iterative, it may end up in a local maxima (instead of the global
maxima) of the log-likelihood of the observed data
– A good initialization is needed to find a good local maxima
• The EM algorithm may not be the most efficient algorithm for maximizing
the log-likelihood; however, the EM algorithm is often fairly simple to
implement
• If the missing data are continuous, integration should be used instead of
the summation to form Q(q; qt)

27
Reinforcement Learning

How an agent can learn from success and

failure, from reward and punishment?
Reinforcement
• Without some feedback about what is good and what
is bad, the agent will have no grounds for deciding
which move to make.
• The agent needs to know that something good has
happened when it wins and wrong when it loses. This
type of feedback is called a reward or reinforcement.
• May be at end or in between.
• Reward as part of the input percept, must be able to
recognize.
• Task of reinforcement learning is to use observed
rewards to learn an optimal policy for the
environment.
Reinforcement
• An agent is placed in an environment and must
learn to behave successfully therein.
• Passive learning: where the agent’s policy is fixed
and the task is to learn the utilities of states.
• Active learning: where the agent must also learn
what to do.
– Exploration(principal issue): an agent must experience
as much as possible of its environment in order to
learn how to behave in it.
Passive Reinforcement Learning
• Using state-based representation in a fully
observable environment.
• Here
– Agent’s policy is fixed: in state s, it always
executes the action (s).
– Goal: to learn how good the policy is i.e. to learn
the utility function U(s).
Example
Contd.
• The agent executes a set of trials in the
environment using its policy .
• In each trial, the agent starts in state (1,1) and
experiences a sequence of state transitions
until it reaches one of the terminal states, (4,
2) or (4, 3).
• Its percepts supply both the current state and
the reward received in the state.
Contd.
• The trials might look like:
(1,1)-.04 (1,2)-.04 (1,3)-.04 (1,2)-.04 (1,3)-.04 (2,3)-.04  (3,3)-.04 (4,3)+1
(1,1)-.04 (1,2)-.04 (1,3)-.04 (2,3)-.04 (3,3)-.04 (3,2)-.04  (3,3)-.04 (4,3)+1
(1,1)-.04 (2,1)-.04 (3,1)-.04 (3,2)-.04 (4,2)-1

• Each state percept is subscripted with the reward

received.
• Objective: use the information about rewards to
learn the expected utility U(s) associated with
each non-terminal state s.
Contd.
• The utility is defined to be the expected sum
of rewards obtained if policy  is followed.
 

U ( s)  E   R( st ) |  , s0  s 
 t

 t 0 
• Here, discounted factor  = 1.
Contd.
• Direct utility estimation: uses the total observed
reward to go for a given state as direct
evidence for learning its utility.
• Adaptive dynamic learning: learns a model and a
reward function from observations and then
uses value or policy iteration to obtain the
utilities or an optimal policy.
• Temporal-difference: update utility estimates to
match those of successor states.
Active Reinforcement Learning
• An active agent must decide what actions to
take.
• The agent will need to learn a complete model
with outcome probabilities for all actions,
rather than just the model for the fixed policy.
• Fact: an agent has a choice of actions.
Contd.
• Adaptive dynamic learning
• Need to take into account the fact that agent has a
choice of actions.
• Bellman equation:

U ( s)  R( s)   max  T ( s, a, s' )U ( s' )

a
s'
• Having obtained a utility function that is optimal for
the learned model, the agent can extract an optimal
action by one-step look-ahead to maximize the
expected utility.
Exploration
• Learned model is not the same as the true
environment.
• What is optimal in learned model can
therefore be suboptimal in the true
environment.
• The agent does not know what the true
environment is, so it can not compute the
optimal action for the true environment.
• What is to be done?
Exploitation/Exploration
• Actions do more than provide reward according to the
current learned model
– Also contribute to learning the true model by affecting the
percepts that are received.
• By improving the model, the agent will receive greater
reward in the future.
• Exploitation: maximize its reward in its current utility
estimates. (pure gets stuck)
• Exploration: maximize its long-term well being. (Pure is
useless, knowledge will not come in practice)
• Tradeoff must be there.
Optimal Exploration Policy
• Extremely difficult.
• Come up with a reasonable scheme.
• Greedy in the Limit of Infinite Exploration (GLIE).
• A GLIE scheme must try each action in each state an
unbounded number of times to avoid having a finite
probability that an optimal action is missed because of
an unusually bad series of outcomes.
• Agent using such a scheme will eventually learn the
true environment model.
• A GLIE scheme must also eventually become greedy, so
that the agent’s actions become optimal w.r.t. the
learned model.
GLIE schemes
• Simplest is to choose a random action a fraction
1/t of the time and to follow the greedy policy
otherwise.
• Can be extremely slow.
• More sensible approach is to give some weight to
actions that the agent has not tried very often,
while tending to avoid actions that are believed
to be of low utility.
• Causes the agent to behave initially as if there
were wonderful rewards scattered all over the
place. (optimistic)
Contd.
• Let U+(s) denotes the optimistic estimate of the utility
of the state.
• N(a, s): number of times action a has been tried in
state s.
• Update equation to incorporate the optimistic
estimate.
 
U ( s)  R( s)   max f   T ( s, a, s' )U ( s' ), N (a, s) 
 
a
 s' 
• f(u, n) is the exploration function. It determines how
greed is traded off against curiosity .
• It should be increasing in u and decreasing in n.
Possible function
• One kind of definition of f(u,n):

 R  (n  N e ) 
f (u, n)   
u (otherwise)

– R+ is an optimistic estimate of the best possible reward

obtainable in any state.
– The agent will try each action-state pair(s,a) at least Ne
times.

Learning Algorithms and Decision Trees
No ratings yet
Learning Algorithms and Decision Trees
21 pages
Learning in AI: Supervised & Decision Trees
No ratings yet
Learning in AI: Supervised & Decision Trees
9 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
47 pages
Types and Methods of Learning in AI
No ratings yet
Types and Methods of Learning in AI
35 pages
Learning Mechanisms in AI Agents
No ratings yet
Learning Mechanisms in AI Agents
18 pages
Understanding AI Learning Methods
No ratings yet
Understanding AI Learning Methods
46 pages
AI Module Training: Learning Methods
No ratings yet
AI Module Training: Learning Methods
8 pages
Learning Methods in AI: Supervised & More
No ratings yet
Learning Methods in AI: Supervised & More
45 pages
Inductive Learning in AI: SVM & Decision Trees
No ratings yet
Inductive Learning in AI: SVM & Decision Trees
59 pages
Learning Processes in AI Systems
No ratings yet
Learning Processes in AI Systems
50 pages
AI Learning Techniques Explained
No ratings yet
AI Learning Techniques Explained
19 pages
Inductive Learning in AI Systems
No ratings yet
Inductive Learning in AI Systems
53 pages
Learning Agents and Machine Learning Methods
No ratings yet
Learning Agents and Machine Learning Methods
30 pages
Learning from Examples in AI Systems
No ratings yet
Learning from Examples in AI Systems
22 pages
Learning Methods for Autonomous Agents
No ratings yet
Learning Methods for Autonomous Agents
59 pages
Understanding Learning Agents and Models
No ratings yet
Understanding Learning Agents and Models
42 pages
Learning from Examples in AI
No ratings yet
Learning from Examples in AI
51 pages
Chapter19 4e
No ratings yet
Chapter19 4e
67 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
62 pages
Inductive Learning in Machine Learning
No ratings yet
Inductive Learning in Machine Learning
51 pages
Machine Learning Fundamentals and Techniques
No ratings yet
Machine Learning Fundamentals and Techniques
6 pages
App Recommendation Using Decision Trees
No ratings yet
App Recommendation Using Decision Trees
38 pages
Supervised Learning in AI Systems
No ratings yet
Supervised Learning in AI Systems
64 pages
Types of Artificial Intelligence Learning
No ratings yet
Types of Artificial Intelligence Learning
19 pages
Advanced Machine Learning Syllabus
No ratings yet
Advanced Machine Learning Syllabus
52 pages
Learning and Expert Systems Overview
No ratings yet
Learning and Expert Systems Overview
52 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
121 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
66 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
113 pages
Overview of Machine Learning Concepts
No ratings yet
Overview of Machine Learning Concepts
54 pages
Decision Tree Learning Techniques
No ratings yet
Decision Tree Learning Techniques
26 pages
Supervised Learning Basics for Engineers
No ratings yet
Supervised Learning Basics for Engineers
162 pages
Learning Agents: Types & Models
No ratings yet
Learning Agents: Types & Models
51 pages
Algorithm Analysis: Learning Methods
No ratings yet
Algorithm Analysis: Learning Methods
31 pages
Machine Learning Techniques Overview
No ratings yet
Machine Learning Techniques Overview
145 pages
Introduction to Machine Learning Concepts
100% (1)
Introduction to Machine Learning Concepts
42 pages
Understanding Learning in Machine Learning
No ratings yet
Understanding Learning in Machine Learning
65 pages
Understanding Machine Learning Components
No ratings yet
Understanding Machine Learning Components
13 pages
Supervised Learning in Machine Learning
No ratings yet
Supervised Learning in Machine Learning
124 pages
AI Learning Types: Supervised vs. Unsupervised
No ratings yet
AI Learning Types: Supervised vs. Unsupervised
8 pages
Understanding TTNT in AI Learning
No ratings yet
Understanding TTNT in AI Learning
58 pages
week2
No ratings yet
week2
28 pages
Learning Methods in AI Explained
No ratings yet
Learning Methods in AI Explained
9 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
80 pages
Understanding Machine Learning Concepts
No ratings yet
Understanding Machine Learning Concepts
17 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
125 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
20 pages
Understanding Machine Learning Types
No ratings yet
Understanding Machine Learning Types
8 pages
Learning Agents in AI: Key Concepts
No ratings yet
Learning Agents in AI: Key Concepts
98 pages
Machine Learning Basics with Python
No ratings yet
Machine Learning Basics with Python
54 pages
CatBoost-Based Personalized Medicine System
No ratings yet
CatBoost-Based Personalized Medicine System
17 pages
Data Cleaning & Feature Engineering Guide
No ratings yet
Data Cleaning & Feature Engineering Guide
16 pages
The University of Chicago Press
No ratings yet
The University of Chicago Press
38 pages
Impact of Educational Institutions on Student Performance
No ratings yet
Impact of Educational Institutions on Student Performance
42 pages
Data Analysis With Mplus
No ratings yet
Data Analysis With Mplus
321 pages
Machine Learning Course Overview
No ratings yet
Machine Learning Course Overview
8 pages
Data Pre-Processing Techniques
No ratings yet
Data Pre-Processing Techniques
59 pages
R Business Programming: Data Structures
No ratings yet
R Business Programming: Data Structures
26 pages
STROBE Checklist for Cross-Sectional Studies
No ratings yet
STROBE Checklist for Cross-Sectional Studies
7 pages
Gaussian Copula for Missing Data Imputation
No ratings yet
Gaussian Copula for Missing Data Imputation
33 pages
Autistic Social Traits Development by Gender
No ratings yet
Autistic Social Traits Development by Gender
9 pages
Estimating Missing Rainfall Data Analysis
No ratings yet
Estimating Missing Rainfall Data Analysis
5 pages
Data Mining Using Sas Enterprise Miner: Mahesh Bommireddy. Chaithanya Kadiyala
No ratings yet
Data Mining Using Sas Enterprise Miner: Mahesh Bommireddy. Chaithanya Kadiyala
40 pages
Data Manipulation Techniques for Analytics
No ratings yet
Data Manipulation Techniques for Analytics
56 pages
Interval Uncertainty in Data Analysis
No ratings yet
Interval Uncertainty in Data Analysis
160 pages
Customer Churn Prediction Model Analysis
No ratings yet
Customer Churn Prediction Model Analysis
21 pages
Cognizant Data Analyst Interview Guide
No ratings yet
Cognizant Data Analyst Interview Guide
18 pages
Best Practices for Confirmatory Factor Analysis
No ratings yet
Best Practices for Confirmatory Factor Analysis
44 pages
Interpersonal Issues in BPD: Key Predictors
No ratings yet
Interpersonal Issues in BPD: Key Predictors
31 pages
Qualitative and Quantitative Research Methods
No ratings yet
Qualitative and Quantitative Research Methods
12 pages
Understanding the Data Science Process
No ratings yet
Understanding the Data Science Process
30 pages
School Connectedness and Internet Use in Teens
No ratings yet
School Connectedness and Internet Use in Teens
12 pages
Amazon Sales Data Analysis Insights
No ratings yet
Amazon Sales Data Analysis Insights
13 pages
NHANES Dataset Overview and Analysis
No ratings yet
NHANES Dataset Overview and Analysis
10 pages
Cross-Border Money Exchange Analytics
No ratings yet
Cross-Border Money Exchange Analytics
80 pages
BDAE-3 Factor Structure Analysis
No ratings yet
BDAE-3 Factor Structure Analysis
5 pages
Univariate Analysis in Data Preparation
No ratings yet
Univariate Analysis in Data Preparation
33 pages
Data Mining Lab Manual in Python
No ratings yet
Data Mining Lab Manual in Python
69 pages
Feature Selection Techniques in Data Science
No ratings yet
Feature Selection Techniques in Data Science
7 pages
Proactive Customer Churn Management Insights
No ratings yet
Proactive Customer Churn Management Insights
2 pages

Types of Learning in AI Explained

Uploaded by

Types of Learning in AI Explained

Uploaded by

Learning

Learning with complete data

Deductive: Deduce rules/facts from already known

• f(x) is the target function

• Construct h so that it agrees with f.

• How to achieve good generalization?

Our hypothesis space

The learning problem is realizable if f(x) ∈ H i.e. Hypothesis

H2 H1 f(x) = 0.5 + x + x2 + 6x3

H1={a+bx}; H2={a+bx+cx2}; H3={a+bx+cx2+dx3};

• Discrete valued function ⇒ classification

• “Divide and conquer”:

• Simplest: Construct a decision tree with one leaf

• Simplest: Construct a decision tree with one leaf

Entropy   P(vi ) ln P(vi )

• Create pure nodes (0 entropy) whenever

Entropy decrease = 0.30 – 0.30 = 0

Entropy decrease = 0.30 – 0.30 = 0

Entropy decrease = 0.30 – 0.29 = 0.01

Entropy decrease = 0.30 – 0.24 = 0.06

Entropy decrease = 0.30 – 0.30 = 0

Entropy decrease = 0.30 – 0.29 = 0.01

Entropy decrease = 0.30 – 0.14 = 0.16

Entropy decrease = 0.30 – 0.23 = 0.07

Cannot make it more complex

• Priors known: P(h1),…,P(h5) are 0.1, 0.2, 0.4, 0.2, 0.1

• What kind of bag is it ?

 Make prediction with most probable hypothesis

• In Candy example, different in starting but as more data

• Since P(hi|d)= P(d|hi)P(hi), instead of maximizing P(hi|d), we

• A bag of candy whose lime-cherry proportions are

• Two wrappers, green and red, are selected according to

P(d|hq,q1,q2)= qcherry(1-q)lime q1red,cherry(1-q1)green,cherry q2red,lime(1-

How an agent can learn from success and

• Each state percept is subscripted with the reward

U ( s)  R( s)   max  T ( s, a, s' )U ( s' )

– R+ is an optimistic estimate of the best possible reward

You might also like