Types of Learning in AI Explained
Types of Learning in AI Explained
Unit-IV
Learning in AI
A B C A C
Inductive: Learn new rules/facts from a data set D.
D x(n), y(n)n1... N A C
Types of learning
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
Unsupervised Learning
• Learning without a teacher.
• No teacher to oversee the learning process.
• No knowledge.
• The agent learns patterns in the input even though no
explicit feedback is supplied.
• The most common unsupervised learning task is
clustering: detecting potentially useful clusters of
input examples.
• For example, a taxi agent might gradually develop a
concept of "good traffic days" and "bad traffic days"
without ever being given labeled examples of each by
a teacher.
Block diagram of unsupervised learning.
Supervised Learning
• With a teacher.
• Teacher has knowledge of environment
• Knowledge is represented by a set of input-output
examples.
• The agent observes some example input–output pairs
and learns a function that maps from input to output.
Block diagram of learning with a teacher
Reinforcement Learning
• The agent learns from a series of reinforcements—rewards
or punishments.
• For example, the lack of a tip at the end of the journey
gives the taxi agent an indication that it did something
wrong.
• It is up to the agent to decide which of the actions prior to
the reinforcement were most responsible for it.
Block diagram of reinforcement learning;
Semi-supervised Learning
• In practice, these distinction are not always so crisp.
• In semi-supervised learning we are given a few labeled
examples and must make what we can of a large collection
of un-labeled examples.
• Imagine that you are trying to build a system to guess a
person's age from a photo. You gather some labeled
examples by snapping pictures of people and asking their
age. That's supervised learning.
• But in reality some of the people lied about their age. It's
not just that there is random noise in the data; rather the
inaccuracies are systematic, and to uncover them is an
unsupervised learning problem involving images, self-
reported ages, and true (un-known) ages.
• Both noise and lack of labels create a continuum between
supervised and unsupervised learning.
Inductive Learning
• An algorithm for supervised learning is this:
– TRAINING SET Given a training set of N example input–
output pairs.
– Try to find out the unknown function.
• Learning is a search through the space of possible
hypotheses for one that will perform well, even on new
examples beyond the training set.
• To measure the accuracy of a hypothesis, a test set of
examples is given that are distinct from the training set.
• A hypothesis generalizes well if it correctly predicts the
value for novel examples.
Inductive learning - example A
1 1 1
1 1 1
1 0 0
0 0 0
x 1 , f (x) 1 x 1 , f (x) 1 x 1 , f (x) 0
0 1 1
1 1 0
0 0 1
0
0 0
x
Inductive learning – example C
x
Inductive learning – example C
x
Inductive learning – example C
Sometimes a consistent hypothesis is worse than an inconsistent
x
The idealized inductive learning
problem
Find appropriate hypothesis space H and find
h(x) H with minimum “distance” to f(x) (“error”)
Error f(x)
H hopt(x) H
H1 H2 H3
• Splits usually on a
x2 > b ? x2 > g ?
single variable
yes no yes no
An example:Whether to wait for a
table at a restaurant
Goal Predicate: WillWait. To set this as learning problem,
what attributes are available to describe examples in the
domain.
10 attributes:
1. Alternate: Is there a suitable alternative restaurant
nearby? {yes,no}
2. Bar: Is there a bar to wait in? {yes,no}
3. Fri/Sat: Is it Friday or Saturday? {yes,no}
4. Hungry: Are you hungry? {yes,no}
5. Patrons: How many are seated in the restaurant? {none,
some, full}
6. Price: Price level {$,$$,$$$}
7. Raining: Is it raining? {yes,no}
8. Reservation: Did you make a reservation? {yes,no}
9. Type: Type of food {French,Italian,Thai,Burger}
10. WaitEstimate: {0-10 min, 10-30 min, 30-60 min, >60
min}
The wait@restaurant decision tree
Expressiveness
• DTs are fully expressible within the class of
propositional languages.
• Any Boolean tree can be written as a decision
tree.
• Each row in the truth table for the function
correspond to a path in the tree, but may yield
exponentially large decision tree.
• Decision trees are good for some kinds of
functions and bad for others.
• There is no representation which is efficient for
all kinds of functions.
Inductive learning of decision tree
• A Boolean decision tree consists of a vector of input
attributes, X and a single Boolean output value y.
• The positive examples: TRUE goal.
• The negative examples: FALSE goal.
• The complete set: Training set.
• Basic idea behind the learning algorithm is to test the
most important attribute first.
• Most important means the one that makes the most
difference to the classification of an example.
• So, will get the correct classification with a small
number of tests, with all small paths and a small tree
as a whole.
Decision tree learning example
T = True, F = False
12ln 612 612ln 612 0.30
6 True,
Entropy 6 6 False
Inductive learning of decision tree
Yes No
3 T, 3 F 3 T, 3 F
Entropy
6
12
3 ln 3 3 ln 3
6 6 6
6
6 12
3 ln 3 3 ln 3 0.30
6 6 6 6
Yes No
3 T, 3 F 3 T, 3 F
Entropy
6
12
3 ln 3 3 ln 3
6 6 6
6
6 12
3 ln 3 3 ln 3 0.30
6 6 6 6
Yes No
2 T, 3 F 4 T, 3 F
Entropy
5
12
2 ln 2 3 ln 3
5 5 5
7
5 12
4 ln 4 3 ln 3 0.29
7 7 7 7
Yes No
5 T, 2 F 1 T, 4 F
Entropy
7
12
5 ln 5 2 ln 2
7 7 7
5
7 12
1 ln 1 4 ln 4 0.24
5 5 5 5
Yes No
2 T, 2 F 4 T, 4 F
Entropy
4
12
2 ln 2 2 ln 2
4 4 4
8
4 12
4 ln 4 4 ln 4 0.30
8 8 8 8
Yes No
3 T, 2 F 3 T, 4 F
Entropy
5
12
3 ln 3 2 ln 2
5 5 5
7
5 12
3 ln 3 4 ln 4 0.29
7 7 7 7
None Full
Some
2F 2 T, 4 F
4T
Entropy
2
12
0
2
ln 0
2
2
2
ln 2
4
2 12
4 ln 4 0 ln 0
4 4 4 4
6
12
2 ln 2 4 ln 4 0.14
6 6 6 6
$ $$$
$$
3 T, 3 F 1 T, 3 F
2T
Entropy
6
12
3
6
ln 3
6
3
6
ln 3
2
6 12
2 ln 2 0 ln 0
2 2 2 2
4
12
1 ln 1 3 ln 3 0.23
4 4 4 4
1 T, 1 F Italian 2 T, 2 F
Thai
1 T, 1 F 2 T, 2 F
Entropy
2
12
1
2
ln 1
2
1
2
ln 1
2 12
2
1 ln 1 1 ln 1
2 2 2 2
4
12
2 ln 2 2 ln 2
4 4 4 4 12
4
2 ln 2 2 ln 2 0.30
4 4 4 4
Entropy decrease = 0.30 – 0.30 = 0
Decision tree learning example
Est. waiting
time
0-10 > 60
4 T, 2 F 10-30 2F
30-60
1 T, 1 F 1 T, 1 F
Entropy
6
12
4
6
ln 4
6
2
6
ln 2
6 12
2
1 ln 1 1 ln 1
2 2 2 2
2
12
1 ln 1 1 ln 1
2 2 2 2 12
2
0 ln 0 2 ln 2 0.24
2 2 2 2
Entropy decrease = 0.30 – 0.24 = 0.06
Decision tree learning example
Largest entropy decrease (0.16)
Patrons? achieved by splitting on Patrons.
None Full
Some
2F 2 T,
X?4 F Continue like this, making new
splits, always purifying nodes.
4T
Decision tree learning example
Induced tree (from examples)
Decision tree learning example
True tree
Decision tree learning example
Induced tree (from examples)
2
Bayesian Learning
• View learning as Bayesian updating of a probability
distribution over the hypothesis space.
• H is the hypothesis variable, values h1,…,hn be possible
hypotheses.
• Let D=(d1,…dn) be the observed data vectors.
• Let X denote the prediction.
In Bayesian Learning,
– Compute the probability of each hypothesis given the data. Predict
based on that basis.
– Predictions are made by using all hypotheses, weighted by their
probabilities, rather than by using just a single best hypothesis.
– Learning in Bayesian setting is reduced to probabilistic inference.
3
Bayesian Learning
• The probability that the prediction is X, when the data d is
observed is
P(X|d) = i P(X|d, hi) P(hi|d)
= i P(X|hi) P(hi|d)
Prediction is weighted average over the predictions of
individual hypothesis.
Hypotheses are intermediaries between the data and the
predictions.
Requires computing P(hi|d) for all i.
4
Bayesian Learning Basics Terms
• P(hi) is called the (hypothesis) prior.
– We can embed knowledge by means of prior.
– It also controls the complexity of the model.
• P(hi|d) is called posterior (or a posteriori) probability.
• Using Bayes’ rule,
P(hi|d) = P(d|hi)P(hi)
• P(d|hi) is called the likelihood of the data.
– Under i.i.d. assumption,
P(d|hi)=j P(dj|hi).
• Let hMAP be the hypothesis for which the posterior
probability P(hi|d) is maximal. It is called the maximum a
posteriori (or MAP) hypothesis.
5
Candy Example
• Two flavors of candy, cherry and lime, wrapped in the same opaque
wrapper. (cannot see inside)
• Sold in very large bags, of which there are known to be five kinds:
– h1: 100% cherry
– h2: 75% cherry + 25% lime
– h3: 50% -50 %
– h4: 25 % cherry -75 % lime
– h5: 100% lime
6
Candy Example
Posterior probability of hypotheses
7
Candy Example
Prediction Probability
8
Optimality
• Characteristic: The true hypothesis dominates the Bayesian
learning.
• For any fixed prior that doesn't rule out the true hypothesis,
the probability of any false hypothesis will eventually vanish
because the probability of generating uncharacteristic data is
vanishingly small.
• Bayesian prediction is optimal, whether the data set be small
or large.
• Cost behind: for real learning problems, the hypothesis space
is usually very large or infinite.
9
Maximum a posteriori (MAP) Learning
• Since calculating the exact probability is often impractical,
we use approximation by MAP hypothesis. That is,
P(X|d) P(X|hMAP).
10
MAP approximation MDL Principle
11
Maximum Likelihood Approximation
• Assume furthermore that P(hi)’s are all equal, i.e., assume the
uniform prior.
– reasonable when there is no reason to prefer one hypothesis over
another a priori.
– For Large data set, prior becomes irrelevant
• to obtain MAP hypothesis, it suffices to maximize P(d|hi), the
likelihood.
– the maximum likelihood hypothesis hML.
• MAP and uniform prior , ML
• ML is the standard statistical learning method
– Simply get the best fit to the data
12
Learning with Complete Data
• Data is complete when each data point contains values for
every variable in the probability being learned.
• Complete data greatly simplifies the problem of learning the
parameters of a complex model.
13
Learning with Data : Parameter
Learning
• Introduce parametric probability model with
parameter q.
– Then the hypotheses are hq, i.e., hypotheses are
parameterized.
– In the simplest case, q is a single scalar. In more
complex cases, q consists of many components.
• Using the data d, predict the parameter q.
14
ML Parameter Learning Examples : discrete case
15
ML Parameter Learning Examples : discrete case
16
Naïve Bayes Method
• Attributes (components of observed data) are assumed to be
independent in Naïve Bayes Method.
– Works well for about 2/3 of real-world problems, despite naivety of
such assumption.
– Goal: Predict the class C, given the observed data Xi.
• By the independent assumption,
P(C|x1,…xn)= P(C) i P(xi|C)
• We choose the most likely class.
Merits of NB
– Scales well: No search is required.
– Robust against noisy data.
– Gives probabilistic predictions.
17
Learning Curve on the Restaurant Problem
18
Learning with Hidden variables
• Many real world problems have hidden (latent) variables,
which are not observable in the data that are available for
learning.
• Example: Medical records( the observed symptoms, the
treatment applied, results)( direct observation of the disease
is hidden).
• Why not construct a model without a disease if it is not
observed?
• The latent variables can drastically reduce the number of
parameters required to specify a Bayesian network, So the
data to learn the parameters.
• Hidden variables complicate the learning problem.
19
20
Missing Data
• Data collection and feature generation are not perfect. Objects in a pattern
recognition application may have missing features
– Features are extracted by multiple sensors. One or multiple sensors may be
malfunctioning when the object is presented
– A respondent fails to answer all the questions in a questionnaire
– Some participants quit during the middle of a study, where measurements
are taken at different time points
• An example of missing data: “dermatology” from UCI
Missing
values
21
Missing Data
• Different types of missing data
– Missing completely at random (MCAR)
– Missing at random (MAR): the fact that a feature is missing is random,
after conditioned on another feature
• Example: people who are depressed might be less inclined to
report their income, and thus reported income will be related to
depression. However, if, within depressed patients the probability
of reported income was unrelated to income level, then the data
would be considered MAR
• If not MAR nor MCAR, the missing data phenomenon needs to be
modeled explicitly
– Example: if a user fails to provide his/her rating on a movie, it is more
likely that he/she dislikes the movie because he/she has not watched
it.
22
Strategy of Coping With Missing
Data
• Discarding cases
– Ignore patterns with missing features
– Ignore features with missing values
– Easy to implement, but wasteful of data. May bias the parameter
estimate if not MCAR
• Maximum likelihood (the EM algorithm)
– Under some assumption of how the data are generated, estimate the
parameters that maximize the likelihood of the observed data (non-
missing features)
23
The Expectation-Maximization
Algorithm
• The EM algorithm is an iterative algorithm that finds the parameters which
maximize the log-likelihood when there are missing data
• Complete data: the combination of the missing data and the observed
data
• The EM algorithm is best used in situation where the log-likelihood of the
observed data is difficult to optimize, whereas the log-likelihood of the
complete data is easy to optimize
24
The EM Algorithm
• Let X and Z denote the vector of all observed data and hidden data,
respectively
• Let qt be the parameter estimate at the t-th iteration
• Define the Q-function (a function of q)
• Intuitively, the Q-function is the “expected value of the complete data log-
likelihood”
– Fill in all the possible values for the missing data Z. This gives rise to
the complete data, and we compute its log-likelihood
– Not all the possible values for the missing features are equally “good”.
The “goodness” of a particular way of filling in (Z=z) is determined by
how likely the r.v. Z takes the value of z
– This probability is determined by the observed data X and the current
25 parameter qt
The EM Algorithm
• An improved parameter estimate at iteration (t+1) is obtained by
maximizing the Q function
• The EM algorithm takes its name because it alternates between the E-step
(expectation) and the M-step (maximization)
– E-step: compute Q(q; qt)
– M-step: Find q that maximizes Q(q; qt)
• The difference between q and qt: one is variable, while the other is given
(fixed)
26
Contd.
• Since EM is iterative, it may end up in a local maxima (instead of the global
maxima) of the log-likelihood of the observed data
– A good initialization is needed to find a good local maxima
• The EM algorithm may not be the most efficient algorithm for maximizing
the log-likelihood; however, the EM algorithm is often fairly simple to
implement
• If the missing data are continuous, integration should be used instead of
the summation to form Q(q; qt)
27
Reinforcement Learning
t 0
• Here, discounted factor = 1.
Contd.
• Direct utility estimation: uses the total observed
reward to go for a given state as direct
evidence for learning its utility.
• Adaptive dynamic learning: learns a model and a
reward function from observations and then
uses value or policy iteration to obtain the
utilities or an optimal policy.
• Temporal-difference: update utility estimates to
match those of successor states.
Active Reinforcement Learning
• An active agent must decide what actions to
take.
• The agent will need to learn a complete model
with outcome probabilities for all actions,
rather than just the model for the fixed policy.
• Fact: an agent has a choice of actions.
Contd.
• Adaptive dynamic learning
• Need to take into account the fact that agent has a
choice of actions.
• Bellman equation:
R (n N e )
f (u, n)
u (otherwise)