0% found this document useful (0 votes)
55 views81 pages

Bayesian Learning in Machine Learning

The document discusses Bayesian learning in the context of artificial intelligence and machine learning, highlighting its importance in providing a probabilistic approach to inference and hypothesis evaluation. Key concepts include Bayes theorem, maximum likelihood hypotheses, the Minimum Description Length Principle, and various classifiers such as the Naive Bayes classifier and Bayesian belief networks. It also addresses practical challenges in applying Bayesian methods, such as the need for initial probability knowledge and computational costs.

Uploaded by

likhithds2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views81 pages

Bayesian Learning in Machine Learning

The document discusses Bayesian learning in the context of artificial intelligence and machine learning, highlighting its importance in providing a probabilistic approach to inference and hypothesis evaluation. Key concepts include Bayes theorem, maximum likelihood hypotheses, the Minimum Description Length Principle, and various classifiers such as the Naive Bayes classifier and Bayesian belief networks. It also addresses practical challenges in applying Bayesian methods, such as the need for initial probability knowledge and computational costs.

Uploaded by

likhithds2005
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Artificial intelligence and machine

learning
Unit-4
Bayesian Learning
Dr. Ashwini B P
Assistant Professor
Department of CSE, SIT
Contents
• Introduction
• Bayes theorem
• Bayes theorem and concept learning
• Maximum likelihood and Least Squared Error Hypothesis
• Maximum likelihood Hypotheses for predicting probabilities
• Minimum Description Length Principle
• Bayes Optimal Classifier
• Naive Bayes classifier
• Bayesian belief networks
• EM algorithm
• Introduction to Neural Networks
2
Introduction
• Bayesian reasoning provides a probabilistic approach to inference
• It is important to machine learning because it provides a quantitative approach to
weighing the evidence supporting alternative hypotheses.
• Bayesian learning methods are relevant to study of machine learning for two
different reasons.

• First, Bayesian learning algorithms that calculate explicit probabilities for


hypotheses, such as the naïve Bayes classifier, are among the most practical
approaches to certain types of learning problems

• The second reason is that they provide a useful perspective for understanding
many learning algorithms that do not explicitly manipulate probabilities. 3
Features of Bayesian Learning
Methods
• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct.
• This provides a more flexible approach to learning than algorithms that
completely eliminate a hypothesis if it is found to be inconsistent with any single
example

• Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis.

• In Bayesian learning, prior knowledge is provided by asserting

(1) a prior probability for each candidate hypothesis, and

(2) a probability distribution over observed data for each possible hypothesis. 4
Features of Bayesian Learning Methods
contd…
• Bayesian methods can accommodate hypotheses that make probabilistic
predictions

• New instances can be classified by combining the predictions of multiple


hypotheses, weighted by their probabilities.

• Even in cases where Bayesian methods prove computationally intractable, they


can provide a standard of optimal decision making against which other practical
methods can be measured.
Practical difficulty in applying Bayesian
methods
• The first practical difficulty in applying Bayesian methods is that they typically
require initial knowledge of many probabilities. When these probabilities are not
known in advance they are often estimated based on background knowledge,
previously available data, and assumptions about the form of the underlying
distributions.
• A second practical difficulty is the significant computational cost required to
determine the Bayes optimal hypothesis in the general case. In certain
specialized situations, this computational cost can be significantly reduced.

6
Bayes Theorem
Bayes theorem provides a way to calculate the probability of a hypothesis based on
its prior probability, the probabilities of observing various data given the hypothesis,
and the observed data itself.
Notations

• P(h) prior probability of h, reflects any background knowledge about the chance
that h is correct

• P(D) prior probability of D, probability that D will be observed

• P(D|h) probability of observing D given a world in which h holds

• P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed 7
Bayes theorem is the cornerstone of Bayesian learning methods because it
provides a way to calculate the posterior probability P(h|D), from the prior
probability P(h), together with P(D) and P(D|h).

P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.

P(h|D) decreases as P(D) increases, because the more probable it is that D will
be
observed independent of h, the less evidence D provides in support of h.

8
Maximum a Posteriori (MAP) Hypothesis
• In many learning scenarios, the learner considers some set of candidate
hypotheses H and is interested in finding the most probable hypothesis h∈ H
given the observed data D. Any such maximally probable hypothesis is
called a maximum a posteriori (MAP) hypothesis.
• Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided

• P(D) can be dropped, because it is a constant independent of h 9


Maximum Likelihood (ML) Hypothesis
In some cases, it is assumed that every hypothesis in H is equally probable a priori
(P(hi) = P(hj) for all hi and hj in H).

In this case the below equation can be simplified and need only consider the term
P(D|h) to find the most probable hypothesis.

P(D|h) is often called the likelihood of the data D given h, and any hypothesis
that
10
maximizes P(D|h) is called a maximum likelihood (ML) hypothesis
Example-1
• Consider a medical diagnosis problem in which there are two alternative hypotheses:
(1) that the patient has a particular form of cancer.
(2) that the patient does not.
• The available data is from a particular laboratory test with two possible outcomes: +
(positive) and - (negative). We have prior knowledge that over the entire population of
people only .008 have this disease.
• Furthermore, the lab test is only an imperfect indicator of the disease.
• The test returns a correct positive result in only 98% of the cases in which the disease is
actually present and a correct negative result in only 97% of the cases in which the
disease is not present.
• In other cases, the test returns the opposite result.
Example contd…

The above situation can be summarized by the following


probabilities:
h1: that the patient has a particular form of cancer- P(cancer)
h2: that the patient does not. P(¬ cancer)

12
• Suppose a new patient is observed for whom the lab test returns a positive
(+) result.
• Should we diagnose the patient as having cancer or not?

13
Example-2
Bayes Theorem and Concept
Learning
What is the relationship between Bayes theorem and the problem of concept learning?

Since Bayes theorem provides a principled way to calculate the posterior probability of each
hypothesis given the training data, and can use it as the basis for a straightforward learning
algorithm that calculates the probability for each possible hypothesis, then outputs the most
probable.

15
Brute-Force Bayes Concept Learning
We can design a straightforward concept learning algorithm to output the
maximum
a posteriori hypothesis, based on Bayes theorem, as follows:

16
Brute-Force Bayes Concept Learning
contd…

In order specify a learning problem for the BRUTE-FORCE MAP LEARNING


algorithm we must specify what values are to be used for P(h) and for P(D|h) ?

Lets choose P(h) and for P(D|h) to be consistent with the following assumptions:

• The training data D is noise free (i.e., di = c(xi))

• The target concept c is contained in the hypothesis space H

• We have no a priori reason to believe that any hypothesis is more probable than
any other.

17
Brute-Force Bayes Concept Learning
contd…
What values should we specify for P(h)?

• Given no prior knowledge that one hypothesis is more likely than another, it
is reasonable to assign the same prior probability to every hypothesis h in H.

• Assume the target concept is contained in H and require that these


prior probabilities sum to 1.

18
Brute-Force Bayes Concept Learning
contd…
What choice shall we make for P(D|h)?
• P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the
fixed set of instances (x1 . . . xm), given a world in which hypothesis h holds
• Since we assume noise-free training data, the probability of
observing
classification di given h is just 1 if di = h(xi) and 0 if di ≠ h(xi). Therefore,

19
Brute-Force Bayes Concept Learning
contd…
Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for
the above BRUTE-FORCE MAP LEARNING algorithm.
In a first step, we have to determine the probabilities for P(h|
D)

20
Brute-Force Bayes Concept Learning
contd…

To summarize, Bayes theorem implies that the posterior probability P(h|D) under our
assumed P(h) and P(D|h) is

where |VSH,D| is the number of hypotheses from H consistent with D

21
We can derive P(D) from the theorem of total probability and the fact that the
hypotheses are mutually exclusive
The Evolution of Probabilities Associated with
Hypotheses
• Figure (a) all hypotheses have the same probability.

• Figures (b) and (c), As training data accumulates, the posterior probability for
inconsistent hypotheses becomes zero while the total probability summing to 1 is
shared equally among the remaining consistent hypotheses.

23
MAP Hypotheses and Consistent Learners
• A learning algorithm is a consistent learner if it outputs a hypothesis that commits zero errors over the
training examples.
• Every consistent learner outputs a MAP hypothesis, if we assume a uniform prior probability
distribution over H (P(hi) = P(hj) for all i, j), and deterministic, noise free training data (P(D|h) =1 if D
and h are consistent, and 0 otherwise).
Example:
• FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the probability
distributions P(h) and P(D|h) defined above.
• To summarize, the Bayesian framework allows one way to characterize the behavior of learning
algorithms (e.g., FIND-S), even when the learning algorithm does not explicitly manipulate probabilities.
• By identifying probability distributions P(h) and P(D|h) under which the algorithm outputs optimal (i.e.,
MAP) hypotheses, we can characterize the implicit assumptions, under which this algorithm behaves
optimally. 24
Normal Probability Distribution (Gaussian
Distribution)
A Normal distribution is a smooth, bell-shaped distribution that can be
completely characterized by its mean μ and its standard deviation σ

25
Maximum Likelihood & Least-Squared Error
Hypotheses
• Bayesian analysis can sometimes be used to show that a particular learning
algorithm outputs MAP hypotheses even though it may not explicitly use Bayes
rule or calculate probabilities in any form.
• Consider the problem of learning a continuous-valued target function such as
neural network learning, linear regression, and polynomial curve fitting
• A straightforward Bayesian analysis will show that under certain assumptions
any learning algorithm that minimizes the squared error between the output
hypothesis predictions and the training data will output a maximum likelihood
(ML) hypothesis

26
Learning A Continuous-Valued Target
Function
• Learner L considers an instance space X and a hypothesis space H consisting of some class of real- valued
functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form <xi, di>

• The problem faced by L is to learn an unknown target function f : X → R

• A set of m training examples is provided, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)

• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .

– Here f(xi) is the noise-free value of the target function and ei is a random variable representing

the noise.

–It is assumed that the values of the ei are drawn independently and that they are distributed

according to a Normal distribution with zero mean.

• The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP
27
hypothesis assuming all hypotheses are equally probable a priori.
Learning A Linear Function
• The target function f corresponds to the solid line.
• The training examples (xi, di ) are assumed to

have Normally distributed noise ei with zero mean

added to the true target value f (xi ).


• The dashed line corresponds to the hypothesis
hML with least-squared training error, hence
the maximum likelihood hypothesis.
• Notice that the maximum likelihood hypothesis is
not necessarily identical to the
correct hypothesis, f, because it is inferred from
only a limited sample of noisy training data 28
Using the previous definition of hML we have

Assuming training examples are mutually independent given h, we can write P(D|h) as the product
of the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di
must also obey a Normal distribution around the true targetvalue f(xi). Because we are writing
the expression for P(D|h), we assume h is the correct description of f. Hence, µ = f(xi) = h(xi)

29
It is common to maximize the less complicated logarithm, which is justified because of
the monotonicity of function p.

The first term in this expression is a constant independent of h and can therefore be
discarded

Maximizing this negative term is equivalent to minimizing the corresponding positive


term.

30
Finally Discard constants that are independent of
h

• the hML is one that minimizes the sum of the squared errors

Why is it reasonable to choose the Normal distribution to characterize noise?


• Good approximation of many types of noise in physical systems
• Central Limit Theorem shows that the sum of a sufficiently large number of
independent, identically distributed random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the attributes describing the
instances
themselves

31
Minimum Description Length Principle
Philosophy: Occam's razor suggests that the simplest hypothesis is the best.
Occam's razor: Prefer the simplest hypothesis that fits the data.
• A Bayesian perspective on Occam’s razor
• Motivated by interpreting the definition of hMAP in the light of basic concepts from information
theory.(bits needed to encode a message in information theory is log base 2)

which can be equivalently expressed in terms of maximizing the log2

or alternatively, minimizing the negative of this quantity

• This equation can be interpreted as a statement that short hypotheses are preferred, assuming
a particular representation scheme for encoding hypotheses and data
32
Example: Information Theory

• Consider the problem of designing a code to transmit messages drawn at random


• i is the message, the probability of encountering message i is pi
• According to information theory, the base interest is in the code that minimizes the
expected number of bits we must transmit in order to encode a message drawn at random
• To minimize the expected code length we should assign shorter codes to messages that are
more probable
• Shannon and Weaver (1949) showed that the optimal code (i.e., the code that minimizes

the expected message length) assigns – log2 pi bits to encode message i.


• The number of bits required to encode message i using code C (description length
of message i with respect to C), is denoted by LC(i).
33
Interpreting the equation

• -log2P(h): the description length of h under the optimal encoding for the hypothesis space

H LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.

• -log2P(D|h): the description length of the training data D given hypothesis h, LC(D|h)(D|

h) = −log2P(D|h) , where CD|h is the optimal code for describing data D assuming that
both the sender and receiver know the hypothesis h.

Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description length of the data given the hypothesis.

where CH and CD|h are the optimal encodings for H and for D given h 34
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths.

Minimum Description Length principle:

The above analysis shows that if we choose C1 to be the optimal encoding of


hypotheses CH, and if we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP

• MDL principle provides a way for trading off hypothesis complexity for the number
of errors committed by the hypothesis

• Short imperfect hypothesis may be selected over a long perfect hypothesis.


35
Bayes Optimal Classifier

•So far we have considered the question


“What is the most probable hypothesis given the training data?”

•But the question that is often of most significance is


“What is the most probable classification of the new instance
given the training data?”
Example
• To develop some intuitions consider a hypothesis space containing three hypotheses, hl,

h2, and h3.


• Suppose that the posterior probabilities of these hypotheses given the training data are

0.4, 0.3, and 0.3 respectively.


• Thus, hl is the MAP hypothesis.

• Suppose a new instance x is encountered, which is classified positive by h l , but negative

by h2 and h3.
• Taking all hypotheses into account, the probability that x is positive is 0.4 (the probability

associated with hi), and the probability that it is negative is therefore 0.6.
• The most probable classification (negative) in this case is different from the classification

generated by the MAP hypothesis


Bayes Optimal Classifier contd…
If the possible classification of the new example can take on any value v j from some set V,

then the probability P(vj|D) that the correct classification for the new instance is v j, is just

The optimal classification of the new instance is the value v j for which P (vj | D) is
maximum.

Any system that classifies new instances according to the above Equation is called a Bayes optimal classifier
Bayes Optimal Classifier contd…
• No other classification method using the same hypothesis space and same prior knowledge can
outperform this method on average.

• This method maximizes the probability that the new instance is classified correctly

• In learning boolean concepts using version spaces as in the earlier section, the Bayes optimal
classification of a new instance is obtained by taking a weighted vote among all members of the
version space, with each candidate hypothesis weighted by its posterior probability.

• The predictions it makes can correspond to a hypothesis not contained in H!


The labelling (classification) of instances defined in this way need not correspond to the instance
labelling of any single hypothesis h from H.
Naïve Bayes Classifier
• Naive Bayes learner is highly practical Bayesian learning method.

• Its performance has been shown to be comparable to that of neural

network and decision tree learning


• The naive Bayes classifier applies to learning tasks where each instance x

is described by a conjunction of attribute values and where the target

function f ( x ) can take on any value from some finite set V.


• The learner is asked to predict the target value, or classification, for this

new instance given the attribute values <a1, a2 . . .an>.

• The Bayesian approach to classifying the new instance is to assign the

most probable target value, VMAP given the attribute values (a1, a2 . . .an)

that describe the instance


1. The P (vj) is obtained simply by counting the frequency with which each target
value vj occurs in the training data.
2. The naive Bayes classifier is based on the simplifying assumption that the
attribute values are conditionally independent given the target value. In other
words, the assumption is that given the target value of the instance, the probability
of observing the conjunction <a1, a2 . . .an>. is just the product of the probabilities
for the individual attributes:
Naïve Bayes Classifier: Example-1
• To calculate VNB we now require 10 probabilities that can be estimated from the training data.

First, the probabilities of the different target values can easily be estimated based
on their frequencies over the 14 training examples
P(P1ayTennis = yes) = 9/14 = .64
P(P1ayTennis = no) = 5/14 = .36
Estimating prior probabilities and classification of new instance
Probabilities Probabilities
Outlook Play Tennis =yes(9) Play Tennis =no(5) Total Humidity Play Tennis =yes(9) Play Tennis =no(5) Total

Sunny 2/9 3/5 5/14 High 3/9 4/5 7/14


Overcast 4/9 0/5 4/14 Normal 6/9 1/5 7/14
Rain 3/9 2/5 5/14
Probabilities
Probabilities Wind Play Tennis =yes(9) Play Tennis =no(5) Total
Temperature Play Tennis =yes(9) Play Tennis =no(5) Total
Weak 6/9 2/9 8/14
Hot 2/9 2/5 4/14 Strong 3/9 3/5 6/14
Mild 4/9 2/5 6/14
Cool 3/9 1/5 4/14

0.0053
0.0206

0.64*2/9*3/9*3/9*3/9=0.0053
0.36*3/5*1/5*4/5*3/5=0.0206
Normalizing Probabilities

• The estimated probabilities can be normalized to sum one to estimate


the conditional probability of the target value yes and no, given the
observed attribute values.

• For the current example, these probabilities are

= 0.795

= 0.205
Naïve Bayes Classifier: Example-2
Probabilities
Weather Condition Accident =yes Accident =no Total
P(Yes)(P(snow|yes)* P(bad|yes)* P(high|yes)*
Rain 1/5 2/5 3/10
P(no|yes))=
Snow 2/5 1/5 3/10
Clear 2/5 2/5 4/10
P(No)(P(snow|No)* P(bad|No)* P(high|No)*
P(no|No))=
Probabilities
Road Condition Accident =yes Accident =no Total
Good Predict for
Bad
Average 1. <Snow, Bad, High ,No>=?

Probabilities 2. <Rain, Good, Light,No>=?


Traffic Condition Accident =yes Accident =no Total
High
3. <Snow, Average, High ,Yes>=?
Normal
4. <Clear, Average, Normal,No>=?
Light
Also estimate the conditional
Probabilities
Engine Problem Accident =yes Accident =no Total probabilities of each class
Yes
No
Naïve Bayes Classifier: Example-3
Sl. No. CHILLS RUNNY HEADACHE FEVER FLU
NOSE

1 Y N Mild Y N
2 Y Y No N Y
3 Y N Strong Y Y
4 N Y Mild Y Y
5 N N No N N
6 N Y Strong Y Y
7 N Y Strong N N
8 Y Y Mild Y Y
Predict for N, Y, No, Y
Estimating Probabilities
• Up to this point we have estimated probabilities by the fraction of times the event
is observed to occur over the total number of opportunities.

• For example, in the example-1 we estimated P(Wind = strong| Play Tennis = no)
by fraction
where =5 is the frequency of Play Tennis =no and =3 i.e., number of these(among
5) for which wind=strong

• While this observed fraction provides a good estimate of the probability in many
cases, it provides poor estimates when , is very small.

• Then the most probable value for , is 0 .


Estimating Probabilities contd…

This raises two difficulties.

• First, produces a biased underestimate of the probability.

• Second, when this probability estimate is zero, this probability term will dominate
the Bayes classifier if the future query contains Wind = strong. The reason is that
the quantity calculated in Equation requires multiplying all the other probability
terms by this zero value.
m-Estimate of probability
• To avoid this difficulty we can adopt a Bayesian approach to estimating the probability,
using the m-estimate defined as follows.

• Here, , and n are defined as before,


• p is our prior estimate of the probability we wish to determine,
• A typical method for choosing p in the absence of other information is to assume uniform
priors; that is, if an attribute has k possible values we set p =1/k.(in example p=0.5 wind
attribute takes two values)
• m is a constant called the equivalent sample size, which determines how heavily to
weight p relative to the observed data.
• Note that if m is zero, the m-estimate is equivalent to the simple fraction .
• If both n and m are nonzero, then the observed fraction and prior p will be combined
according to the weight m.
An Example: Learning To Classify Text
• Practical application of Naïve Bayes classifier

• Learn which news articles are of interest

• Learn to classify web pages by topic

• General algorithm for learning to classify text, based on the naive Bayes classifier.

• Consider an instance space X consisting of all possible text documents (i.e., all possible
strings of words and punctuation of all possible lengths).
• The task is to learn from these training examples to predict the target value for subsequent
text documents.
• The target function classifying documents as interesting or uninteresting to a particular
person, using the target values like and dislike to indicate these two classes
Sample text to classify

Our approach to representing arbitrary text documents is disturbingly simple: Given a text document,

such as this paragraph, we define an attribute for each word position in the document and define

the value of that attribute to be the English word found in that position. Thus, the current paragraph

would be described by 111 attribute values, corresponding to the 111 word positions. The value of

the first attribute is the word "our," the value of the second attribute is the word "approach," and so

on. Notice that long text documents will require a larger number of attributes than short documents.

As we shall see, this will not cause us any trouble.


• Let us assume a set of 700 training documents classified as dislike and another 300 classified
as like.
• We are now given a new document and asked to classify it.
• Given a text document, such as this paragraph described by 111 attribute values
Assumptions
• The independence assumption states in this setting that the word probabilities for one text position
are independent of the words that occur in other positions, given the document classification V j.
• We shall assume the probability of encountering a specific word w k is independent of the specific
word position being considered

• where n is the total number of word positions in all training examples whose target value is v j,
• nk is the number of times word wk is found among these n word positions,
• |Vocabulary| is the total number of distinct words (and other tokens) found within the training
data.
Naive Bayes algorithms for learning and classifying text
K-means clustering Algorithm(not in syllabus)

K-means clustering is a popular method for grouping data by assigning


observations to clusters based on proximity to the cluster’s center.
Solved Example
Solved Example contd…
Zi,1 Zi,2
1 0
1 0
1 0
0 1
0 1
0 1
Solved Example contd…

Zi,1 Zi,2
1 0
1 0
0 1
0 1
0 1
0 1
Solved Example contd…

• Iteration3:
Record Close to C1 (1.25,1.5) Close to C2 (3.9,5.1) Assign to Zi,1 Zi,2
Number Cluster

The new clusters are C1 { }, C2{ }


The EM Algorithm
• The EM algorithm a widely used approach to learning in the presence of unobserved variables.
• It is the basis for many unsupervised clustering algorithms
• Consider a problem in which the data D is a set of instances generated by a probability distribution
that is a mixture of k distinct Normal distributions.
• The learning task is to output a hypothesis that describes the means of each of the
k distributions.
• Find a maximum likelihood hypothesis for these means a hypothesis h that maximizes p ( D lh).

• Where xi is the observed value of the ith instance and where Zi1 and Zi2
indicate which of the two Normal distributions was used to generate the value
xi.
• In particular, ZiJ has the value 1 if xi was created by the jth Normal distribution
and 0 otherwise.
• Here xi is the observed variable in the description of the instance, and Z i1 and
Zi2 are hidden variables.
• The EM algorithm first initializes the hypothesis to h = <µ1, µ2>,where
µ1 and µ2 are arbitrary initial values.
• It then iteratively re-estimates h by repeating the following
two steps until the procedure converges to a stationary value for h.
General Statement of EM Algorithm
• More generally, the EM algorithm can be applied in many settings where we wish to estimate
some set of parameters θ that describe an underlying probability distribution, given only the
observed portion of the full data produced by this distribution.

• In the above two-means example the parameters of interest were θ =<µ1, µ2>, and the full data
were the triples (xi,zi1,zi2) of which only the xi were observed.

• In general, let X = {x1,. . . ,xm} denote the observed data in a set of m independently drawn

instances, let Z = {zl, . . . , zm} denote the unobserved data in these same instances, and let
Y = X U Z denote the full data.

• The EM algorithm searches for the maximum likelihood hypothesis h’ by seeking the h' that
maximizes

E[ln P(Y |h’)]


General Statement of EM Algorithm contd…
• Let us define a function Q(h’|h) that gives E[ln P(Y |h')] as a function of h', under the
assumption that θ = h and given the observed portion X of the full data Y .

• In its general form, the EM algorithm repeats the following two steps until convergence:
Bayesian belief networks
• Naïve Bayes assumption of conditional independence is too restrictive.

• But it is intractable without some such assumptions.

• Bayesian belief networks describe conditional independence among subsets of


variables

• It allows combining prior knowledge about (in)dependence among variables with


observed training data

• It is also called Bayes Nets

• A Bayesian belief network describes the joint probability distribution for a set of
variables.
Bayesian belief networks contd…
• A Bayesian network, or belief network, shows conditional probability and causality relationships
between variables.

• The probability of an event occurring given that another event has already occurred is called
a conditional probability.

• The probabilistic model is described qualitatively by a directed acyclic graph, or DAG.

• The vertices of the graph, which represent variables, are called nodes. The nodes are
represented as circles containing the variable name.
• The connections between the nodes are called arcs, or edges. The edges are drawn as arrows
between the nodes, and represent dependence between the variables.

• Therefore, for any pairs of nodes indicate that one node is the parent of the other so there are no
independence assumptions. Independence assumptions are implied in Bayesian networks by the
absence of a link.
Example DAG

• The node where the arc originates is called the


parent, while the node where the arc ends is
called the child.
• In this case, A is a parent of C, and C is a child
of A.
• Nodes that can be reached from other nodes are
called descendants.
• Nodes that lead a path to a specific node are
called ancestors.
• For example, C and E are descendants of A,
and A and C are ancestors of E.
• There are no loops in Bayesian networks, since
no child can be its own ancestor or descendent.
Bayesian belief networks contd…
• Bayesian networks will generally also include a set of probability tables, stating the probabilities
for the true/false values of the variables.

• The main point of Bayesian Networks is to allow for probabilistic inference to be performed.

• This means that the probability of each value of a node in the Bayesian network can be computed
when the values of the other variables are known.

• Also, because independence among the variables is easy to recognize since conditional
relationships are clearly defined by a graph edge, not all joint probabilities in the Bayesian system
need to be calculated in order to make a decision.
Conditional Independence
• Definition: X is conditionally independent of Y given Z if the probability distribution governing X
is independent of the value of Y given the value of Z; that is, if

(xi , y j , z k ) P ( X  xi | Y y j , Z  z k ) P ( X  xi | Z  z k )
more compactly we write P(X|Y,Z) = P(X|Z)
• This definition of conditional independence can be extended to sets of variables as well. We say
that the set of variables X1 . ..Xl is conditionally independent of the set of variables Y1 ...Ym given

the set of variables Z1....Zn, if


Bayesian Belief Network Representation
• Network represents a set of conditional independence assumptions

Each node is asserted to be conditionally independent of its non-descendants, given its immediate
predecessors

• A Bayesian network represents the joint probability distribution by specifying a set of conditional
independence assumptions (represented by a directed acyclic graph), together with sets of local
conditional probabilities.
Bayesian Belief Network Representation contd…

• Each variable in the joint space is represented by a node in the Bayesian network.

• For each variable two types of information are specified.

• First, the network arcs represent the assertion that the variable is conditionally independent of its
nondescendants in the network given its immediate predecessors in the network. We say X is a
descendant of, Y if there is a directed path from Y to X.
• Second, a conditional probability table is given for each variable, describing the probability
distribution for that variable given the values of its immediate predecessors.
• The joint probability for any desired assignment of values ( y 1 , . . . , yn) to the tuple of network
variables (Y1 . . . Yn) can be computed by the formula
Example with probability table for Campfire

• The network nodes and arcs represent the assertion that CampJire is conditionally independent of its nondescendants
Lightning and Thunder, given its immediate parents Storm and BusTourGroup.
• This means that once we know the value of the variables Storm and BusTourGroup, the variables Lightning and
Thunder provide no additional information about Campfire.
• The set of local conditional probability tables for all the variables, together with the set of
conditional independence assumptions described by the network, describe the full joint
probability distribution for the network.
•Bayesian belief networks allow a convenient way to represent causal knowledge such as the
Full joint probability distribution for the network:Example
Inference
• We might wish to use a Bayesian network to infer the value of some target
variable (e.g., ForestFire) given the observed values of the other variables.

• This inference step can be straightforward if values for all of the other
variables in the network are known exactly.

• In the more general case we may wish to infer the probability distribution for
some variable (e.g., ForestFire) given observed values for only a subset of
the other variables (e.g., Thunder and BusTourGroup may be the only
observed values available).

You might also like