0% found this document useful (0 votes)

7 views45 pages

Bayesian Learning in Machine Learning

This document discusses Bayesian learning methods, emphasizing their importance in machine learning for calculating probabilities of hypotheses and understanding various learning algorithms. It covers concepts such as Bayes theorem, maximum likelihood estimation, and the maximum a posteriori hypothesis, along with practical difficulties in applying Bayesian methods. The document also explores the relationship between Bayes theorem and concept learning, as well as the implications of Bayesian analysis in learning continuous-valued target functions.

Uploaded by

gymstartup4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views45 pages

Bayesian Learning in Machine Learning

Uploaded by

gymstartup4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

MODULE 3

Chapter 6
Bayesian Learning and Unsupervised Learning: Bayes Theorem

Maximum Likelihood Estimation

Bayes Optimal Classifier

Naive Bayes Classifier

Bayesian Belief Networks

Chapter-6 (6.1-6.7,6.9,6.11)
Prof. K. Deepa Shree
INTRODUCTION
Bayesian learning methods are relevant to study of machine learning for two
different reasons.
• First, Bayesian learning algorithms that calculate explicit probabilities for
hypotheses, such as the naive Bayes classifier, are among the most practical
approaches to certain types of learning problems
• The second reason is that they provide a useful perspective for understanding
many learning algorithms that do not explicitly manipulate probabilities.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 3
Features of Bayesian Learning Methods
• Each observed training example can incrementally decrease or increase the estimated
probability that a hypothesis is correct. This provides a more flexible approach to
learning than algorithms that completely eliminate a hypothesis if it is found to be
inconsistent with any single example
• Prior knowledge can be combined with observed data to determine the final
probability of a hypothesis. In Bayesian learning, prior knowledge is provided by
asserting (1) a prior probability for each candidate hypothesis, and (2) a probability
distribution over observed data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make probabilistic predictions
• New instances can be classified by combining the predictions of multiple hypotheses,
weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical methods
can be measured.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 4
Practical difficulty in applying Bayesian methods
• One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities. When these probabilities are not known
in advance they are often estimated based on background knowledge, previously
available data, and assumptions about the form of the underlying distributions.

• A second practical difficulty is the significant computational cost required to

determine the Bayes optimal hypothesis in the general case. In certain specialized
situations, this computational cost can be significantly reduced.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 5
BAYES T H EO R E M
Bayes theorem provides a way to calculate the probability of a hypothesis based on
its prior probability, the probabilities of observing various data given the hypothesis,
and the observed data itself.
Notations
• P(h) prior probability of h, reflects any background knowledge about the chance
that h is correct
• P(D) prior probability of D, probability that D will be observed
• P(D|h) probability of observing D given a world in which h holds
• P(h|D) posterior probability of h, reflects confidence that h holds after D has been
observed

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 6
Bayes theorem is the cornerstone of Bayesian learning methods because it provides a
way to calculate the posterior probability P(h|D), from the prior probability P(h),
together with P(D) and P(D(h).

P(h|D) increases with P(h) and with P(D|h) according to Bayes theorem.
P(h|D) decreases as P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in support of h.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 7
Maximum a Posteriori (MAP) Hypothesis

• In many learning scenarios, the learner considers some set of candidate hypotheses
H and is interested in finding the most probable hypothesis h ∈H given the
observed data D. Any such maximally probable hypothesis is called a maximum a
posteriori (MAP) hypothesis.
• Bayes theorem to calculate the posterior probability of each candidate hypothesis is hMAP
is a MAP hypothesis provided

• P(D) can be dropped, because it is a constant independent of h

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 8
Maximum Likelihood (ML) Hypothesis

In some cases, it is assumed that every hypothesis in H is equally probable a priori

(P(hi) = P(hj) for all hi and hj in H).
In this case the below equation can be simplified and need only consider the term
P(D|h) to find the most probable hypothesis.

P(D|h) is often called the likelihood of the data D given h, and any hypothesis that
maximizes P(D|h) is called a maximum likelihood (ML) hypothesis

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 9
Example
Consider a medical diagnosis problem in which there are two alternative hypotheses
• The patient has a particular form of cancer (denoted by cancer)
• The patient does not (denoted by ¬ cancer)

The available data is from a particular laboratory with two possible outcomes: +
(positive) and - (negative)

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 10
• Suppose a new patient is observed for whom the lab test returns a positive (+)
result.
• Should we diagnose the patient as having cancer or not?

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 11
BAYES T H EO R E M AND C O N C E P T L EA R N I N G
What is the relationship between Bayes theorem and the problem of concept
learning?

Since Bayes theorem provides a principled way to calculate the posterior probability
of each hypothesis given the training data, and can use it as the basis for a
straightforward learning algorithm that calculates the probability for each possible
hypothesis, then outputs the most probable.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 12
Brute-Force Bayes Concept Learning
We can design a straightforward concept learning algorithm to output the maximum
a posteriori hypothesis, based on Bayes theorem, as follows:

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 13
In order specify a learning problem for the BRUTE-FORCE MAP LEARNING
algorithm we must specify what values are to be used for P(h) and for P(D|h) ?
Lets choose P(h) and for P(D|h) to be consistent with the following assumptions:
• The training data D is noise free (i.e., di = c(xi))
• The target concept c is contained in the hypothesis space H
• We have no a priori reason to believe that any hypothesis is more probable than any other.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 14
What values should we specify for P(h)?
• Given no prior knowledge that one hypothesis is more likely than another, it is
reasonable to assign the same prior probability to every hypothesis h in H.
• Assume the target concept is contained in H and require that these prior
probabilities sum to 1.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 15
What choice shall we make for P(D|h)?

• P(D|h) is the probability of observing the target values D = (d1 . . .dm) for the
fixed set of instances (x1 . . . xm), given a world in which hypothesis h holds
• Since we assume noise-free training data, the probability of observing
classification di given h is just 1 if di = h(xi) and 0 if di # h(xi). Therefore,

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 16
Given these choices for P(h) and for P(D|h) we now have a fully-defined problem for
the above BRUTE-FORCE MAP LEARNING algorithm.
In a first step, we have to determine the probabilities for P(h|D)

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 17
To summarize, Bayes theorem implies that the posterior probability P(h|D) under our
assumed P(h) and P(D|h) is

where |VSH,D| is the number of hypotheses from H consistent with D

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 18
The Evolution of Probabilities
Associated with Hypotheses
• Figure (a) all hypotheses have the same probability.
• Figures (b) and (c), As training data accumulates, the posterior probability for inconsistent
hypotheses becomes zero while the total probability summing to 1 is shared equally
among the remaining consistent hypotheses.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 19
MAP Hypotheses and Consistent Learners
A learning algorithm is a consistent learner if it outputs a hypothesis that commits zero errors over
the training examples.
Every consistent learner outputs a MAP hypothesis, if we assume a uniform prior probability
distribution over H (P(hi) = P(hj) for all i, j), and deterministic, noise free training data (P(D|h) =1 if
D and h are consistent, and 0 otherwise).

Example:
• FIND-S outputs a consistent hypothesis, it will output a MAP hypothesis under the probability
distributions P(h) and P(D|h) defined above.
• Are there other probability distributions for P(h) and P(D|h) under which FIND-S outputs MAP
hypotheses? Yes.
• Because FIND-S outputs a maximally specific hypothesis from the version space, its output
hypothesis will be a MAP hypothesis relative to any prior probability distribution that favours more
specific hypotheses.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 20
• Bayesian framework is a way to characterize the behaviour of learning algorithms
• By identifying probability distributions P(h) and P(D|h) under which the output is
a optimal hypothesis, implicit assumptions of the algorithm can be characterized
(Inductive Bias)
• Inductive inference is modelled by an equivalent probabilistic reasoning system
based on Bayes theorem

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 21
MAXIMUM LIKELIHOOD AND LEAST-SQUARED
ERROR HYPOTHESES
Consider the problem of learning a continuous-valued target function such as neural
network learning, linear regression, and polynomial curve fitting

A straightforward Bayesian analysis will show that under certain assumptions any
learning algorithm that minimizes the squared error between the output hypothesis
predictions and the training data will output a maximum likelihood (ML) hypothesis

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 22
Learning A Continuous-Valued Target Function
• Learner L considers an instance space X and a hypothesis space H consisting of some class of real-
valued functions defined over X, i.e., (∀ h ∈ H)[ h : X → R] and training examples of the form
<xi,di>
• The problem faced by L is to learn an unknown target function f : X → R
• A set of m training examples is provided, where the target value of each example is corrupted by
random noise drawn according to a Normal probability distribution with zero mean (di = f(xi) + ei)
• Each training example is a pair of the form (xi ,di ) where di = f (xi ) + ei .
– Here f(xi) is the noise-free value of the target function and ei is a random variable representing
the noise.
– It is assumed that the values of the ei are drawn independently and that they are distributed
according to a Normal distribution with zero mean.
• The task of the learner is to output a maximum likelihood hypothesis, or, equivalently, a MAP
hypothesis assuming all hypotheses are equally probable a priori.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 23
Learning A Linear Function
• The target function f corresponds to the solid
line.
• The training examples (xi , di ) are assumed to
have Normally distributed noise ei with zero
mean added to the true target value f (xi ).
• The dashed line corresponds to the hypothesis
hML with least-squared training error, hence the
maximum likelihood hypothesis.
• Notice that the maximum likelihood hypothesis is
not necessarily identical to the correct
hypothesis, f, because it is inferred from only a
limited sample of noisy training data

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 24
Before showing why a hypothesis that minimizes the sum of squared errors in this setting is also a
maximum likelihood hypothesis, let us quickly review probability densities and Normal distributions

Probability Density for continuous variables

e: a random noise variable generated by a Normal probability distribution

<x1 . . . xm>: the sequence of instances (as before)
<d1 . . . dm>: the sequence of target values with di = f(xi) + ei

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 25
Normal Probability Distribution (Gaussian Distribution)

A Normal distribution is a smooth, bell-shaped distribution that can be completely

characterized by its mean μ and its standard deviation σ

Prof. K. Deepa Shree

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 27
Using the previous definition of hML we have
Assuming training examples are mutually independent given h, we can write P(D|h) as the product of
the various (di|h)

Given the noise ei obeys a Normal distribution with zero mean and unknown variance σ2 , each di
must also obey a Normal distribution around the true targetvalue f(xi). Because we are writing the
expression for P(D|h), we assume h is the correct description of f. Hence, µ = f(xi) = h(xi)

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 28
It is common to maximize the less complicated logarithm, which is justified because of the
monotonicity of function p.

The first term in this expression is a constant independent of h and can therefore be discarded

Maximizing this negative term is equivalent to minimizing the corresponding positive term.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 29
Finally Discard constants that are independent of h

• the hML is one that minimizes the sum of the squared errors

Why is it reasonable to choose the Normal distribution to characterize noise?

• good approximation of many types of noise in physical systems
• Central Limit Theorem shows that the sum of a sufficiently large number of independent,
identically distributed random variables itself obeys a Normal distribution
Only noise in the target value is considered, not in the attributes describing the instances
themselves

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 30
MAXIMUM LIKELIHOOD HYPOTHESES FOR
PREDICTING PROBABILITIES

Consider the setting in which we wish to learn a nondeterministic (probabilistic) function f : X

→ {0, 1}, which has two discrete output values.

We want a function approximator whose output is the probability that f(x) = 1 In other
words , learn the target function
f’ : X → [0, 1] such that f’ (x) = P(f(x) = 1)

How can we learn f' using a neural network?

Use of brute force way would be to first collect the observed frequencies of 1's and 0's for each
possible value of x and to then train the neural network to output the target frequency for each x.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 31
What criterion should we optimize in order to find a maximum likelihood hypothesis for
f' in this setting?
• First obtain an expression for P(D|h)
• Assume the training data D is of the form D = {(x1, d1) . . . (xm, dm)}, where di is the observed 0 or
1 value for f (xi).
• Both xi and di as random variables, and assuming that each training example is drawn
independently, we can write P(D|h) as

Applying the product rule

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 32
The probability P(di|h, xi)

Re-express it in a more mathematically manipulable form, as

Equation (4) to substitute for P(di |h, xi) in Equation (5) to obtain

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 33
We write an expression for the maximum likelihood hypothesis

The last term is a constant independent of h, so it can be dropped

It easier to work with the log of the likelihood, yielding

Equation (7) describes the quantity that must be maximized in order to obtain the maximum
likelihood hypothesis in our current problem setting

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 34
Gradient Search to Maximize Likelihood in a Neural Net
Derive a weight-training rule for neural network learning that seeks to maximize G(h, D) using
gradient ascent
• The gradient of G(h, D) is given by the vector of partial derivatives of G(h, D) with respect to the
various network weights that define the hypothesis h represented by the learned network
• In this case, the partial derivative of G(h, D) with respect to weight wjk from input k to unit j is

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 35
Suppose our neural network is constructed from a single layer of sigmoid units. Then,

where xijk is the kth input to unit j for the ith training example, and d(x) is the derivative of the sigmoid
squashing function.
Finally, substituting this expression into Equation (1), we obtain a simple expression for the
derivatives that constitute the gradient

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 36
Because we seek to maximize rather than minimize P(D|h), we perform gradient ascent rather than
gradient descent search. On each iteration of the search the weight vector is adjusted in the direction of the
gradient, using the weight update rule

where η is a small positive constant that determines the step size of the i gradient ascent search

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 37
It is interesting to compare this weight-update rule to the weight-update rule used by the
BACKPROPAGATION algorithm to minimize the sum of squared errors between predicted and
observed network outputs.
The BACKPROPAGATION update rule for output unit weights, re-expressed using our current
notation, is

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 38
MINIMUM DESCRIPTION LENGTH PRINCIPLE
• A Bayesian perspective on Occam’s razor
• Motivated by interpreting the definition of hMAP in the light of basic concepts from information
theory.

which can be equivalently expressed in terms of maximizing the log2

or alternatively, minimizing the negative of this quantity

• This equation can be interpreted as a statement that short hypotheses are preferred, assuming a
particular representation scheme for encoding hypotheses and data

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 39
Introduction to a basic result of information theory
• Consider the problem of designing a code to transmit messages drawn at random
• i is the message
• The probability of encountering message i is pi
• Interested in the most compact code; that is, interested in the code that minimizes the
expected number of bits we must transmit in order to encode a message drawn at random
• To minimize the expected code length we should assign shorter codes to messages that are
more probable
• Shannon and Weaver (1949) showed that the optimal code (i.e., the code that minimizes
the expected message length) assigns - log, pi bitst to encode message i.
• The number of bits required to encode message i using code C as the description length
of message i with respect to C, which we denote by Lc(i).

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 40
Interpreting the equation
• -log2P(h): the description length of h under the optimal encoding for the hypothesis space H
LCH (h) = −log2P(h), where CH is the optimal code for hypothesis space H.
• -log2P(D | h): the description length of the training data D given hypothesis h, under the
optimal encoding fro the hypothesis space H: LCH (D|h) = −log2P(D| h) , where C D|h is the
optimal code for describing data D assuming that both the sender and receiver know the
hypothesis h.

Rewrite Equation (1) to show that hMAP is the hypothesis h that minimizes the sum given by the
description length of the hypothesis plus the description length of the data given the hypothesis.

where CH and CD|h are the optimal encodings for H and for D given h

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 41
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that
minimizes the sum of these two description lengths of equ.

Minimum Description Length principle:

Where, codes C1 and C2 to represent the hypothesis and the data given the hypothesis

The above analysis shows that if we choose C1 to be the optimal encoding of hypotheses CH, and if
we choose C2 to be the optimal encoding CD|h, then hMDL = hMAP

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 42
Application to Decision Tree Learning
Apply the MDL principle to the problem of learning decision trees from some training data.
What should we choose for the representations C1 and C2 of hypotheses and data?
• For C1: C1 might be some obvious encoding, in which the description length grows with the
number of nodes and with the number of edges
• For C2: Suppose that the sequence of instances (x1 . . .xm) is already known to both the transmitter
and receiver, so that we need only transmit the classifications (f (x1) . . . f (xm)).
Now if the training classifications (f (x1) . . .f(xm)) are identical to the predictions of the
hypothesis, then there is no need to transmit any information about these examples. The
description length of the classifications given the hypothesis ZERO
If examples are misclassified by h, then for each misclassification we need to transmit a message
that identifies which example is misclassified as well as its correct classification
The hypothesis hMDL under the encoding C1 and C2 is just the one that minimizes the sum of these
description lengths.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 43
• MDL principle provides a way for trading off hypothesis complexity for the number of errors
committed by the hypothesis
• MDL provides a way to deal with the issue of overfitting the data.
• Short imperfect hypothesis may be selected over a long perfect hypothesis.

Prof. K. Deepa Shree DEEPAK D, ASST. PROF., DEPT. OF CSE, CANARA ENGG. COLLEGE 44
END OF CHAPTER 6

Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
44 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
30 pages
Understanding Bayesian Learning Methods
No ratings yet
Understanding Bayesian Learning Methods
130 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
81 pages
Bayesian Learning Methods Overview
No ratings yet
Bayesian Learning Methods Overview
44 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
70 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
59 pages
Bayesian Concept Learning Overview
No ratings yet
Bayesian Concept Learning Overview
60 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
25 pages
Understanding Bayesian Learning Methods
No ratings yet
Understanding Bayesian Learning Methods
24 pages
Bayesian Learning Methods Explained
No ratings yet
Bayesian Learning Methods Explained
26 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayesian Learning and Hypothesis Evaluation
No ratings yet
Bayesian Learning and Hypothesis Evaluation
36 pages
Unit 4 Ml Shashi
No ratings yet
Unit 4 Ml Shashi
40 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
65 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
25 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
19 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
25 pages
Introduction to Bayesian Learning Theory
No ratings yet
Introduction to Bayesian Learning Theory
178 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
12 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
14 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayesian Learning and Hypothesis Testing
No ratings yet
Bayesian Learning and Hypothesis Testing
36 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
31 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
73 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
60 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
54 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
25 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
27 pages
Understanding Bayes' Theorem in Learning
No ratings yet
Understanding Bayes' Theorem in Learning
31 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
25 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
39 pages
Bayesian Learning and Methods Explained
No ratings yet
Bayesian Learning and Methods Explained
18 pages
Understanding Bayesian Learning Methods
No ratings yet
Understanding Bayesian Learning Methods
50 pages
Bayesian Learning in AI and ML
No ratings yet
Bayesian Learning in AI and ML
30 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
18 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Understanding Bayesian Learning Concepts
No ratings yet
Understanding Bayesian Learning Concepts
15 pages
Understanding Bayes' Theorem in Learning
No ratings yet
Understanding Bayes' Theorem in Learning
22 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
34 pages
Machine Learning Unit III: Bayesian Methods
No ratings yet
Machine Learning Unit III: Bayesian Methods
35 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
39 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
19 pages
@vtudeveloper - in ML Mod 4
No ratings yet
@vtudeveloper - in ML Mod 4
11 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
24 pages
Bayes Theorem in Machine Learning Notes
50% (2)
Bayes Theorem in Machine Learning Notes
31 pages
Bayesian Statistics and Inference Techniques
No ratings yet
Bayesian Statistics and Inference Techniques
57 pages
Understanding Bayesian Learning Concepts
No ratings yet
Understanding Bayesian Learning Concepts
33 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
41 pages
Fundamentals of Automation Technology
No ratings yet
Fundamentals of Automation Technology
60 pages
Java Interview Preparation Notes
100% (1)
Java Interview Preparation Notes
5 pages
Multi-Echelon Inventory Systems Explained
No ratings yet
Multi-Echelon Inventory Systems Explained
30 pages
AI Outreach & SDR Tools in the Market
No ratings yet
AI Outreach & SDR Tools in the Market
3 pages
IoT Project Research Assignment Guide
No ratings yet
IoT Project Research Assignment Guide
7 pages
RPV311 Manual en PDF
100% (1)
RPV311 Manual en PDF
326 pages
C++ Basics: A Beginner's Guide
No ratings yet
C++ Basics: A Beginner's Guide
13 pages
Curve Fitting Techniques Overview
No ratings yet
Curve Fitting Techniques Overview
34 pages
Sefius UF-890AG: Advanced Ultrasound System
No ratings yet
Sefius UF-890AG: Advanced Ultrasound System
2 pages
Computer Proficiency Test Questions 2023
No ratings yet
Computer Proficiency Test Questions 2023
38 pages
Proposal For 3d Character
No ratings yet
Proposal For 3d Character
1 page
Data Analysis Project Guidelines
No ratings yet
Data Analysis Project Guidelines
3 pages
Pre-Writing and Revision Strategies
100% (4)
Pre-Writing and Revision Strategies
11 pages
Solutions Manual For Business Statistics 3e by Norean D. Sharpe 0133866912 Instant Download
100% (11)
Solutions Manual For Business Statistics 3e by Norean D. Sharpe 0133866912 Instant Download
73 pages
Solutions Manual for Sadiku's 5th Ed.
0% (2)
Solutions Manual for Sadiku's 5th Ed.
5 pages
Natural Proofs and P vs NP Barrier
No ratings yet
Natural Proofs and P vs NP Barrier
2 pages
Security vs Protection in OS
No ratings yet
Security vs Protection in OS
3 pages
Project Management Overview and History
No ratings yet
Project Management Overview and History
27 pages
Digital Signature Application for Shristy Mishra
No ratings yet
Digital Signature Application for Shristy Mishra
1 page
Overview of Oracle HFM Features
100% (1)
Overview of Oracle HFM Features
27 pages
Intelligent Electronic Lock Safe Guide
No ratings yet
Intelligent Electronic Lock Safe Guide
2 pages
AI Integration in Saudi Architectural Education
No ratings yet
AI Integration in Saudi Architectural Education
11 pages
Understanding Sources of Power in Leadership
No ratings yet
Understanding Sources of Power in Leadership
27 pages
DBMS Exam Paper April 2024
No ratings yet
DBMS Exam Paper April 2024
3 pages
M.Sc. Computer Science Syllabus 2024-2026
No ratings yet
M.Sc. Computer Science Syllabus 2024-2026
26 pages
Volvo XC60 Infodapter Installation Guide
No ratings yet
Volvo XC60 Infodapter Installation Guide
8 pages
Real Soccer Revolution Changelog 1.1.5
No ratings yet
Real Soccer Revolution Changelog 1.1.5
1 page
Best AI Tools List for 2025
100% (1)
Best AI Tools List for 2025
9 pages
Anita Mukhiya - Customer Service Resume
No ratings yet
Anita Mukhiya - Customer Service Resume
4 pages
FMS 500 StCAN SDK
No ratings yet
FMS 500 StCAN SDK
14 pages

Bayesian Learning in Machine Learning

Uploaded by

Bayesian Learning in Machine Learning

Uploaded by

MODULE 3

Maximum Likelihood Estimation

Bayes Optimal Classifier

Naive Bayes Classifier

Bayesian Belief Networks

• A second practical difficulty is the significant computational cost required to

• P(D) can be dropped, because it is a constant independent of h

In some cases, it is assumed that every hypothesis in H is equally probable a priori

where |VSH,D| is the number of hypotheses from H consistent with D

Probability Density for continuous variables

e: a random noise variable generated by a Normal probability distribution

A Normal distribution is a smooth, bell-shaped distribution that can be completely

Prof. K. Deepa Shree

Why is it reasonable to choose the Normal distribution to characterize noise?

Consider the setting in which we wish to learn a nondeterministic (probabilistic) function f : X

How can we learn f' using a neural network?

Applying the product rule

Re-express it in a more mathematically manipulable form, as

The last term is a constant independent of h, so it can be dropped

It easier to work with the log of the likelihood, yielding

which can be equivalently expressed in terms of maximizing the log2

or alternatively, minimizing the negative of this quantity

Minimum Description Length principle:

You might also like