Bayesian Learning in Machine Learning
Bayesian Learning in Machine Learning
learning
Unit-4
Bayesian Learning
Dr. Ashwini B P
Assistant Professor
Department of CSE, SIT
Contents
• Introduction
• Bayes theorem
• Bayes theorem and concept learning
• Maximum likelihood and Least Squared Error Hypothesis
• Maximum likelihood Hypotheses for predicting probabilities
• Minimum Description Length Principle
• Bayes Optimal Classifier
• Naive Bayes classifier
• Bayesian belief networks
• EM algorithm
• Introduction to Neural Networks
2
Introduction
• Bayesian reasoning provides a probabilistic approach to inference
• It is important to machine learning because it provides a quantitative approach to
weighing the evidence supporting alternative hypotheses.
• Bayesian learning methods are relevant to study of machine learning for two
different reasons.
• The second reason is that they provide a useful perspective for understanding
many learning algorithms that do not explicitly manipulate probabilities. 3
Features of Bayesian Learning
Methods
• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct.
• This provides a more flexible approach to learning than algorithms that
completely eliminate a hypothesis if it is found to be inconsistent with any single
example
• Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis.
(2) a probability distribution over observed data for each possible hypothesis. 4
Features of Bayesian Learning Methods
contd…
• Bayesian methods can accommodate hypotheses that make probabilistic
predictions
6
Bayes Theorem
Bayes theorem provides a way to calculate the probability of a hypothesis based on
its prior probability, the probabilities of observing various data given the hypothesis,
and the observed data itself.
Notations
• P(h) prior probability of h, reflects any background knowledge about the chance
that h is correct
• P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed 7
Bayes theorem is the cornerstone of Bayesian learning methods because it
provides a way to calculate the posterior probability P(h|D), from the prior
probability P(h), together with P(D) and P(D|h).
P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
P(h|D) decreases as P(D) increases, because the more probable it is that D will
be
observed independent of h, the less evidence D provides in support of h.
8
Maximum a Posteriori (MAP) Hypothesis
• In many learning scenarios, the learner considers some set of candidate
hypotheses H and is interested in finding the most probable hypothesis h∈ H
given the observed data D. Any such maximally probable hypothesis is
called a maximum a posteriori (MAP) hypothesis.
• Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided
In this case the below equation can be simplified and need only consider the term
P(D|h) to find the most probable hypothesis.
P(D|h) is often called the likelihood of the data D given h, and any hypothesis
that
10
maximizes P(D|h) is called a maximum likelihood (ML) hypothesis
Example-1
• Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has a particular form of cancer.
(2) that the patient does not.
• The available data is from a particular laboratory test with two possible outcomes: +
(positive) and - (negative). We have prior knowledge that over the entire population of
people only .008 have this disease.
• Furthermore, the lab test is only an imperfect indicator of the disease.
• The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the
disease is not present.
• In other cases, the test returns the opposite result.
Example contd…
12
• Suppose a new patient is observed for whom the lab test returns a positive
(+) result.
• Should we diagnose the patient as having cancer or not?
13
Example-2
Bayes Theorem and Concept
Learning
What is the relationship between Bayes theorem and the problem of concept learning?
Since Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, and can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable.
15
Brute-Force Bayes Concept Learning
We can design a straightforward concept learning algorithm to output the
maximum
a posteriori hypothesis, based on Bayes theorem, as follows:
16
Brute-Force Bayes Concept Learning
contd…
Lets choose P(h) and for P(D|h) to be consistent with the following assumptions:
• We have no a priori reason to believe that any hypothesis is more probable than
any other.
17
Brute-Force Bayes Concept Learning
contd…
What values should we specify for P(h)?
• Given no prior knowledge that one hypothesis is more likely than another, it
is reasonable to assign the same prior probability to every hypothesis h in H.
18
Brute-Force Bayes Concept Learning
contd…
What choice shall we make for P(D|h)?
• P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the
fixed set of instances (x1 . . . xm), given a world in which hypothesis h holds
• Since we assume noise-free training data, the probability of
observing
classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,
19
Brute-Force Bayes Concept Learning
contd…
Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for
the above BRUTE-FORCE MAP LEARNING algorithm.
In a first step, we have to determine the probabilities for P(h|
D)
20
Brute-Force Bayes Concept Learning
contd…
To summarize, Bayes theorem implies that the posterior probability P(h|D) under our
assumed P(h) and P(D|h) is
21
We can derive P(D) from the theorem of total probability and the fact that the
hypotheses are mutually exclusive
The Evolution of Probabilities Associated with
Hypotheses
• Figure (a) all hypotheses have the same probability.
• Figures (b) and (c), As training data accumulates, the posterior probability for
inconsistent hypotheses becomes zero while the total probability summing to 1 is
shared equally among the remaining consistent hypotheses.
23
MAP Hypotheses and Consistent Learners
• A learning algorithm is a consistent learner if it outputs a hypothesis that commits zero errors over the
training examples.
• Every consistent learner outputs a MAP hypothesis, if we assume a uniform prior probability
distribution over H (P(hi) = P(hj) for all i, j), and deterministic, noise free training data (P(D|h) =1 if D
and h are consistent, and 0 otherwise).
Example:
• FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the probability
distributions P(h) and P(D|h) defined above.
• To summarize, the Bayesian framework allows one way to characterize the behavior of learning
algorithms (e.g., FIND-S), even when the learning algorithm does not explicitly manipulate probabilities.
• By identifying probability distributions P(h) and P(D|h) under which the algorithm outputs optimal (i.e.,
MAP) hypotheses, we can characterize the implicit assumptions, under which this algorithm behaves
optimally. 24
Normal Probability Distribution (Gaussian
Distribution)
A Normal distribution is a smooth, bell-shaped distribution that can be
completely characterized by its mean μ and its standard deviation σ
25
Maximum Likelihood & Least-Squared Error
Hypotheses
• Bayesian analysis can sometimes be used to show that a particular learning
algorithm outputs MAP hypotheses even though it may not explicitly use Bayes
rule or calculate probabilities in any form.
• Consider the problem of learning a continuous-valued target function such as
neural network learning, linear regression, and polynomial curve fitting
• A straightforward Bayesian analysis will show that under certain assumptions
any learning algorithm that minimizes the squared error between the output
hypothesis predictions and the training data will output a maximum likelihood
(ML) hypothesis
26
Learning A Continuous-Valued Target
Function
• Learner L considers an instance space X and a hypothesis space H consisting of some class of real- valued
functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form <xi, di>
• A set of m training examples is provided, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable representing
the noise.
–It is assumed that the values of the ei are drawn independently and that they are distributed
• The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP
27
hypothesis assuming all hypotheses are equally probable a priori.
Learning A Linear Function
• The target function f corresponds to the solid line.
• The training examples (xi, di ) are assumed to
Assuming training examples are mutually independent given h, we can write P(D|h) as the product
of the various (di|h)
Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di
must also obey a Normal distribution around the true targetvalue f(xi). Because we are writing
the expression for P(D|h), we assume h is the correct description of f. Hence, µ = f(xi) = h(xi)
29
It is common to maximize the less complicated logarithm, which is justified because of
the monotonicity of function p.
The first term in this expression is a constant independent of h and can therefore be
discarded
30
Finally Discard constants that are independent of
h
• the hML is one that minimizes the sum of the squared errors
31
Minimum Description Length Principle
Philosophy: Occam's razor suggests that the simplest hypothesis is the best.
Occam's razor: Prefer the simplest hypothesis that fits the data.
• A Bayesian perspective on Occam’s razor
• Motivated by interpreting the definition of hMAP in the light of basic concepts from information
theory.(bits needed to encode a message in information theory is log base 2)
• This equation can be interpreted as a statement that short hypotheses are preferred, assuming
a particular representation scheme for encoding hypotheses and data
32
Example: Information Theory
• -log2P(h): the description length of h under the optimal encoding for the hypothesis space
H LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
• -log2P(D|h): the description length of the training data D given hypothesis h, LC(D|h)(D|
h) = −log2P(D|h) , where CD|h is the optimal code for describing data D assuming that
both the sender and receiver know the hypothesis h.
Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description length of the data given the hypothesis.
where CH and CD|h are the optimal encodings for H and for D given h 34
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths.
• MDL principle provides a way for trading off hypothesis complexity for the number
of errors committed by the hypothesis
by h2 and h3.
• Taking all hypotheses into account, the probability that x is positive is 0.4 (the probability
associated with hi), and the probability that it is negative is therefore 0.6.
• The most probable classification (negative) in this case is different from the classification
then the probability P(vj|D) that the correct classification for the new instance is v j, is just
The optimal classification of the new instance is the value v j for which P (vj | D) is
maximum.
Any system that classifies new instances according to the above Equation is called a Bayes optimal classifier
Bayes Optimal Classifier contd…
• No other classification method using the same hypothesis space and same prior knowledge can
outperform this method on average.
• This method maximizes the probability that the new instance is classified correctly
• In learning boolean concepts using version spaces as in the earlier section, the Bayes optimal
classification of a new instance is obtained by taking a weighted vote among all members of the
version space, with each candidate hypothesis weighted by its posterior probability.
most probable target value, VMAP given the attribute values (a1, a2 . . .an)
First, the probabilities of the different target values can easily be estimated based
on their frequencies over the 14 training examples
P(P1ayTennis = yes) = 9/14 = .64
P(P1ayTennis = no) = 5/14 = .36
Estimating prior probabilities and classification of new instance
Probabilities Probabilities
Outlook Play Tennis =yes(9) Play Tennis =no(5) Total Humidity Play Tennis =yes(9) Play Tennis =no(5) Total
0.0053
0.0206
0.64*2/9*3/9*3/9*3/9=0.0053
0.36*3/5*1/5*4/5*3/5=0.0206
Normalizing Probabilities
= 0.795
= 0.205
Naïve Bayes Classifier: Example-2
Probabilities
Weather Condition Accident =yes Accident =no Total
P(Yes)(P(snow|yes)* P(bad|yes)* P(high|yes)*
Rain 1/5 2/5 3/10
P(no|yes))=
Snow 2/5 1/5 3/10
Clear 2/5 2/5 4/10
P(No)(P(snow|No)* P(bad|No)* P(high|No)*
P(no|No))=
Probabilities
Road Condition Accident =yes Accident =no Total
Good Predict for
Bad
Average 1. <Snow, Bad, High ,No>=?
1 Y N Mild Y N
2 Y Y No N Y
3 Y N Strong Y Y
4 N Y Mild Y Y
5 N N No N N
6 N Y Strong Y Y
7 N Y Strong N N
8 Y Y Mild Y Y
Predict for N, Y, No, Y
Estimating Probabilities
• Up to this point we have estimated probabilities by the fraction of times the event
is observed to occur over the total number of opportunities.
• For example, in the example-1 we estimated P(Wind = strong| Play Tennis = no)
by fraction
where =5 is the frequency of Play Tennis =no and =3 i.e., number of these(among
5) for which wind=strong
• While this observed fraction provides a good estimate of the probability in many
cases, it provides poor estimates when , is very small.
• Second, when this probability estimate is zero, this probability term will dominate
the Bayes classifier if the future query contains Wind = strong. The reason is that
the quantity calculated in Equation requires multiplying all the other probability
terms by this zero value.
m-Estimate of probability
• To avoid this difficulty we can adopt a Bayesian approach to estimating the probability,
using the m-estimate defined as follows.
• General algorithm for learning to classify text, based on the naive Bayes classifier.
• Consider an instance space X consisting of all possible text documents (i.e., all possible
strings of words and punctuation of all possible lengths).
• The task is to learn from these training examples to predict the target value for subsequent
text documents.
• The target function classifying documents as interesting or uninteresting to a particular
person, using the target values like and dislike to indicate these two classes
Sample text to classify
Our approach to representing arbitrary text documents is disturbingly simple: Given a text document,
such as this paragraph, we define an attribute for each word position in the document and define
the value of that attribute to be the English word found in that position. Thus, the current paragraph
would be described by 111 attribute values, corresponding to the 111 word positions. The value of
the first attribute is the word "our," the value of the second attribute is the word "approach," and so
on. Notice that long text documents will require a larger number of attributes than short documents.
• where n is the total number of word positions in all training examples whose target value is v j,
• nk is the number of times word wk is found among these n word positions,
• |Vocabulary| is the total number of distinct words (and other tokens) found within the training
data.
Naive Bayes algorithms for learning and classifying text
K-means clustering Algorithm(not in syllabus)
Zi,1 Zi,2
1 0
1 0
0 1
0 1
0 1
0 1
Solved Example contd…
• Iteration3:
Record Close to C1 (1.25,1.5) Close to C2 (3.9,5.1) Assign to Zi,1 Zi,2
Number Cluster
• Where xi is the observed value of the ith instance and where Zi1 and Zi2
indicate which of the two Normal distributions was used to generate the value
xi.
• In particular, ZiJ has the value 1 if xi was created by the jth Normal distribution
and 0 otherwise.
• Here xi is the observed variable in the description of the instance, and Z i1 and
Zi2 are hidden variables.
• The EM algorithm first initializes the hypothesis to h = <µ1, µ2>,where
µ1 and µ2 are arbitrary initial values.
• It then iteratively re-estimates h by repeating the following
two steps until the procedure converges to a stationary value for h.
General Statement of EM Algorithm
• More generally, the EM algorithm can be applied in many settings where we wish to estimate
some set of parameters θ that describe an underlying probability distribution, given only the
observed portion of the full data produced by this distribution.
• In the above two-means example the parameters of interest were θ =<µ1, µ2>, and the full data
were the triples (xi,zi1,zi2) of which only the xi were observed.
• In general, let X = {x1,. . . ,xm} denote the observed data in a set of m independently drawn
instances, let Z = {zl, . . . , zm} denote the unobserved data in these same instances, and let
Y = X U Z denote the full data.
• The EM algorithm searches for the maximum likelihood hypothesis h’ by seeking the h' that
maximizes
• In its general form, the EM algorithm repeats the following two steps until convergence:
Bayesian belief networks
• Naïve Bayes assumption of conditional independence is too restrictive.
• A Bayesian belief network describes the joint probability distribution for a set of
variables.
Bayesian belief networks contd…
• A Bayesian network, or belief network, shows conditional probability and causality relationships
between variables.
• The probability of an event occurring given that another event has already occurred is called
a conditional probability.
• The vertices of the graph, which represent variables, are called nodes. The nodes are
represented as circles containing the variable name.
• The connections between the nodes are called arcs, or edges. The edges are drawn as arrows
between the nodes, and represent dependence between the variables.
• Therefore, for any pairs of nodes indicate that one node is the parent of the other so there are no
independence assumptions. Independence assumptions are implied in Bayesian networks by the
absence of a link.
Example DAG
• The main point of Bayesian Networks is to allow for probabilistic inference to be performed.
• This means that the probability of each value of a node in the Bayesian network can be computed
when the values of the other variables are known.
• Also, because independence among the variables is easy to recognize since conditional
relationships are clearly defined by a graph edge, not all joint probabilities in the Bayesian system
need to be calculated in order to make a decision.
Conditional Independence
• Definition: X is conditionally independent of Y given Z if the probability distribution governing X
is independent of the value of Y given the value of Z; that is, if
(xi , y j , z k ) P ( X xi | Y y j , Z z k ) P ( X xi | Z z k )
more compactly we write P(X|Y,Z) = P(X|Z)
• This definition of conditional independence can be extended to sets of variables as well. We say
that the set of variables X1 . ..Xl is conditionally independent of the set of variables Y1 ...Ym given
Each node is asserted to be conditionally independent of its non-descendants, given its immediate
predecessors
• A Bayesian network represents the joint probability distribution by specifying a set of conditional
independence assumptions (represented by a directed acyclic graph), together with sets of local
conditional probabilities.
Bayesian Belief Network Representation contd…
• Each variable in the joint space is represented by a node in the Bayesian network.
• First, the network arcs represent the assertion that the variable is conditionally independent of its
nondescendants in the network given its immediate predecessors in the network. We say X is a
descendant of, Y if there is a directed path from Y to X.
• Second, a conditional probability table is given for each variable, describing the probability
distribution for that variable given the values of its immediate predecessors.
• The joint probability for any desired assignment of values ( y 1 , . . . , yn) to the tuple of network
variables (Y1 . . . Yn) can be computed by the formula
Example with probability table for Campfire
• The network nodes and arcs represent the assertion that CampJire is conditionally independent of its nondescendants
Lightning and Thunder, given its immediate parents Storm and BusTourGroup.
• This means that once we know the value of the variables Storm and BusTourGroup, the variables Lightning and
Thunder provide no additional information about Campfire.
• The set of local conditional probability tables for all the variables, together with the set of
conditional independence assumptions described by the network, describe the full joint
probability distribution for the network.
•Bayesian belief networks allow a convenient way to represent causal knowledge such as the
Full joint probability distribution for the network:Example
Inference
• We might wish to use a Bayesian network to infer the value of some target
variable (e.g., ForestFire) given the observed values of the other variables.
• This inference step can be straightforward if values for all of the other
variables in the network are known exactly.
• In the more general case we may wish to infer the probability distribution for
some variable (e.g., ForestFire) given observed values for only a subset of
the other variables (e.g., Thunder and BusTourGroup may be the only
observed values available).