BAYESIAN LEARNING
Bayesian Classifiers
Bayesian classifiers are statistical classifiers, and are based
on Bayes theorem
They can calculate the probability that a given sample
belongs to a particular class
1
BAYESIAN LEARNING
Bayesian learning algorithms are among the most
practical approaches to certain types of learning
problems.
There results are comparable to the performance of other
classifiers, such as decision tree and neural networks in
many cases
2
BAYESIAN LEARNING
Bayes Theorem
Let X be a data sample, e.g. red and round fruit
Let H be some hypothesis, such as that X belongs to a
specified class C (e.g. X is an apple)
For classification problems, we want to determine P(H|X),
the probability that the hypothesis H holds given the
observed data sample X
3
BAYESIAN LEARNING
Prior & Posterior Probability
The probability P(H) is called the prior probability of H, i.e
the probability that any given data sample is an apple,
regardless of how the data sample looks
The probability P(H|X) is called posterior probability. It is
based on more information, then the prior probability P(H)
which is independent of X
4
BAYESIAN LEARNING
Bayes Theorem
It provides a way of calculating the posterior probability
P(H|X) = P(X|H) P(H)
P(X)
P(X|H) is the posterior probability of X given H (it is the
probability that X is red and round given that X is an apple)
P(X) is the prior probability of X (probability that a data
sample is red and round)
5
BAYESIAN LEARNING
Bayes Theorem: Proof
The posterior probability of the fruit being an apple given
that its shape is round and its colour is red is P(H|
X) = |H X| / |X|
i.e. the number of apples which are red and round divided by
the total number of red and round fruits
Since P(H X) = |H X| / |total fruits of all size and shapes|
and P(X) = |X| / |total fruits of all size and shapes|
Hence P(H|X) = P(H X) / P(X)
6
BAYESIAN LEARNING
Bayes Theorem: Proof
Similarly P(X|H) = P(H X) / P(H)
Since we have P(H X) = P(H|X)P(X)
And also P(H X) = P(X|H)P(H)
Therefore P(H|X)P(X) = P(X|H)P(H)
And hence P(H|X) = P(X|H) P(H) / P(X)
7
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Studies comparing classification algorithms have found that
the simple Bayesian classifier is comparable in performance
with decision tree and neural network classifiers
It works as follows:
1. Each data sample is represented by an n-dimensional
feature vector, X = (x1, x2, …, xn), depicting n
measurements made on the sample from n attributes,
respectively A1, A2, … An
8
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
2. Suppose that there are m classes C1, C2, … Cm. Given an
unknown data sample, X (i.e. having no class label), the
classifier will predict that X belongs to the class having the
highest posterior probability given X
Thus if P(Ci|X) > P(Cj|X) for 1 j m , j i
then X is assigned to Ci
This is called Bayes decision rule
9
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
3. We have P(Ci|X) = P(X|Ci) P(Ci) / P(X)
As P(X) is constant for all classes, only P(X|C i) P(Ci) needs to
be calculated
The class prior probabilities may be estimated by
P(Ci) = si / s
where si is the number of training samples of class Ci
& s is the total number of training samples
If class prior probabilities are equal (or not known and thus
assumed to be equal) then we need to calculate only P(X|Ci)
10
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
4. Given data sets with many attributes, it would be
extremely computationally expensive to compute P(X|Ci)
For example, assuming the attributes of colour and shape to
be Boolean, we need to store 4 probabilities for the category
apple
P(¬red ¬round | apple)
P(¬red round | apple)
P(red ¬round | apple)
P(red round | apple)
If there are 6 attributes and they are Boolean, then we need
to store 26 probabilities
11
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
In order to reduce computation, the naïve assumption of
class conditional independence is made
This presumes that the values of the attributes are
conditionally independent of one another, given the class
label of the sample (we assume that there are no dependence
relationships among the attributes)
12
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Thus we assume that P(X|Ci) = nk=1 P(xk|Ci)
Example
P(colour shape | apple) = P(colour | apple) P(shape | apple)
For 6 Boolean attributes, we would have only 12 probabilities
to store instead of 26 = 64
Similarly for 6, three valued attributes, we would have 18
probabilities to store instead of 36
13
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
The probabilities P(x1|Ci), P(x2|Ci), …, P(xn|Ci) can be
estimated from the training samples, where
For an attribute Ak, which can take on the values x1k, x2k, …
e.g. colour = red, green, …
P(xk|Ci) = sik/si
where sik is the number of training samples of class Ci having
the value xk for Ak
and si is the number of training samples belonging to Ci
e.g. P(red|apple) = 7/10 if 7 out of 10 apples are red
14
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Example:
15
Play-tennis example: estimating
P(xi|C) outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high false N
sunny hot high true N P(overcast|p) = 4/9 P(overcast|n) = 0
overcast hot high false P
rain mild high false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain cool normal false P
rain cool normal true N temperature
overcast cool normal true P
sunny mild high false N P(hot|p) = 2/9 P(hot|n) = 2/5
sunny cool normal false P
rain mild normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
sunny mild normal true P
overcast
overcast
mild
hot
high true
normal false
P
P
P(cool|p) = 3/9 P(cool|n) = 1/5
rain mild high true N
humidity
P(high|p) = 3/9 P(high|n) = 4/5
P(p) = 9/14 P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(n) = 5/14
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
Naive Bayesian Classifier (II)
Given a training set, we can compute the
probabilities
Outlook P N Humidity P N
sunny 2/9 3/5 high 3/9 4/5
overcast 4/9 0 normal 6/9 1/5
rain 3/9 2/5
Tempreature Windy
hot 2/9 2/5 true 3/9 3/5
mild 4/9 2/5 false 6/9 2/5
cool 3/9 1/5
Play-tennis example: classifying
X
An unseen sample X = <rain, hot, high, false>
P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class n (don’t play)
Naïve Bayesian Classifier: Example2
Training dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
Class: 30…40 high no fair yes
C1:buys_computer=>40 medium no fair yes
‘yes’ >40 low yes fair yes
C2:buys_computer=>40 low yes excellent no
‘no’ 31…40 low yes excellent yes
<=30 medium no fair no
Data sample <=30 low yes fair yes
X =(age<=30, >40 medium yes fair yes
Income=medium, <=30 medium yes excellent yes
Student=yes 31…40 medium no excellent yes
Credit_rating= 31…40 high yes fair yes
Fair) >40 medium no excellent 19 no
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Example:
Let C1 = class buy computer and C2 = class not buy computer
The unknown sample:
X = {age = 30, income = medium, student = yes, credit-
rating = fair}
The prior probability of each class can be computed as
P(buy computer = yes) = 9/14 = 0.643
P(buy_computer = no) = 5/14 = 0.357
21
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Example:
To compute P(X|Ci) we compute the following conditional
probabilities
22
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Example:
Using the above probabilities we obtain
And hence the naïve Bayesian classifier predicts that the
student will buy computer, because
23
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
An Example: Learning to classify text
- Instances (training samples) are text documents
- Classification labels can be: like-dislike, etc.
- The task is to learn from these training examples to
predict the class of unseen documents
Design issue:
- How to represent a text document in terms of
attribute values
24
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
One approach:
- The attributes are the word positions
- Value of an attribute is the word found in that
position
Note that the number of attributes may be different for each
document
We calculate the prior probabilities of classes from the
training samples
Also the probabilities of word in a position is calculated
e.g. P(“The” in first position | like document)
25
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Second approach:
The frequency with which a word occurs is counted
irrespective of the word’s position
Note that here also the number of attributes may be different
for each document
The probabilities of words are
e.g. P(“The” | like document)
26
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Results
An algorithm based on the second approach was applied to
the problem of classifying articles of news groups
- 20 newsgroups were considered
- 1,000 articles of each news group were collected (total
20,000 articles)
- The naïve Bayes algorithm was applied using 2/3rd of
these articles as training samples
- Testing was done over the remaining 3rd
27
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Results
- Given 20 news groups, we would expect random guessing
to achieve a classification accuracy of 5%
- The accuracy achieved by this program was 89%
28
BAYESIAN LEARNING
Naïve (Simple) Bayesian Classification
Minor Variant
The algorithm used only a subset of the words used in the
documents
- 100 most frequent words were removed (these include
words such as “the”, and “of”)
- Any word occurring fewer than 3 times was also
removed
29
Avoiding the Zero-Probability Problem
• NaïveBayesian prediction requires each conditional prob.
be non-zero. Otherwise, the predicted prob. will be zero
• Ex.
Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
• Prob(income = low) = 1/1003
• Prob(income = medium) = 991/1003
• Prob(income = high) = 11/1003
•The “corrected” prob. estimates are close to their
“uncorrected” counterparts
30
Handling Real valued data
31
The independence hypothesis…
• … makes computation possible
• … yields optimal classifiers when satisfied
• … but is seldom satisfied in practice, as attributes
(variables) are often correlated.
• Attempts to overcome this limitation:
– Bayesian networks, that combine Bayesian reasoning with
causal relationships between attributes
BAYESIAN LEARNING
Bayesian Belief Networks
Since in real world dependencies can exist between variables,
hence Bayesian belief networks are used to specify joint
conditional probability distributions
They allow class conditional independencies to be defined
between subset of variables
They provide a graphical model of causal relationships, on
which learning can be performed
These networks are also called belief networks, Bayesian
networks, and probabilistic networks
33
BAYESIAN LEARNING
Bayesian Belief Networks
A belief network is defined by two components
The first is a directed acyclic graph, where each node
represents a random variable and each arc represents a
probabilistic dependence
Age FamilyH
Diabetes Mass
Insulin Glucose
34
BAYESIAN LEARNING
Bayesian Belief Networks
If an arc is drawn from a node Y to a node Z, then Y is a
parent of Z and Z is a descendent of Y
Each variable is conditionally independent of its non
descendents in the graph, given its parents
Age FamilyH
The variables may be discrete
or continuous
Diabetes Mass
Insulin Glucose
35
BAYESIAN LEARNING
Bayesian Belief Networks
Age FamilyH
The second component of a belief
network consists of a conditional
probability table for each variable Diabetes Mass
For a variable Z, it specifies the
Insulin Glucose
conditional probability distribution
P(Z|Parents(Z)) (FH, A) (FH, ~A) (~FH, A) (~FH, ~A)
M 0.8 0.5 0.7 0.1
The conditional probability for each ~M 0.2 0.5 0.3 0.9
value of Z is listed for each possible
combination of values of its parents
36
BAYESIAN LEARNING
Bayesian Belief Networks
The joint probability of any tuple (z1,
…, zn) corresponding to the attributes
Z1, …, Zn is computed by
Where the values P(zi|Parents(Zi)) correspond to the entries
in the CPT for zi
37
BAYESIAN LEARNING
Bayesian Belief Networks
A node within the network can be selected as an output
node representing a class label attribute
The structure of the network can be given by an expert
38
BAYESIAN LEARNING
Example: probability that a fish caught in summer, in the
north Atlantic, is a sea bass, and is dark and thin:
P(a3, b1, x2, c3, d2)
= P(a3)P(b1)P(x2|a3,b1)P(c3|x2)P(d2|x2)
= 0.25 x 0.6 x0.4 x0.5 x0.4 = 0.012 39
BAYESIAN LEARNING
Learning Bayesian Belief Networks
The problem of learning a Bayes network is the problem of
finding a network that best matches (according to some
scoring metric) a training set of data
By finding a network we mean finding both the
- Structure of the net &
- the conditional probability tables (CPT) associated with
each node)
40
BAYESIAN LEARNING
Learning Bayesian Belief Networks
Known Network Structure
If the structure of the network is known we only have to find
the CPT
Often human experts can come up with the appropriate
structure for a problem domain but not the CPT
If we have ample number of training samples, we can compute
sample statistics for each node and its parents
41
BAYESIAN LEARNING
Learning Bayesian Belief Networks
Unknown Network Structure
If the network structure is not known, we must then attempt to
find that structure, as well as its associated CPTs, that best
fits the training data
In order to do so, we need
- a metric by which to score candidate networks
- a procedure for searching among possible structure
42
BAYESIAN LEARNING
Learning Bayesian Belief Networks
43
BAYESIAN LEARNING
Learning Bayesian Belief Networks
44
BAYESIAN LEARNING
Pros & Cons of Bayesian approach
It provides a more flexible approach to learning than
algorithms that completely eliminate a hypothesis if it is
found to be inconsistent with any single example. In
Bayesian approach each observed training sample can
gradually increase or decrease the estimated probability
that a hypothesis is correct.
The probabilities used, can be incrementally updated if some
new training samples arrive
45
BAYESIAN LEARNING
Pros & Cons of Bayesian approach
One practical difficulty in applying Bayesian methods is that
they typically require initial knowledge of many
probabilities
A second practical difficulty is the significant computational
cost required to determine the Bayes optimal hypothesis
in the general case (linear in the number of candidate
hypotheses)
46
BAYESIAN LEARNING
Reference
Chapter 6 of T. Mitchell
47
INSTANCE - BASED LEARNING
k – NEAREST NEIGHBOUR
48
k – NEAREST NEIGHBOUR LEARNING
Introduction
Key Idea:
Just store all training examples <xi, f(xi)>
Thus the training algorithm is very simple
49
k – NEAREST NEIGHBOUR LEARNING
Introduction
Classification Algorithm:
* Given query instance xq
* Locate nearest training example xn
* Estimate
50
k – NEAREST NEIGHBOUR LEARNING
Introduction
Classification Algorithm (k-nearest neighbor):
* Given query instance xq
* Locate k nearest training example xn
* Estimate its class label by taking vote among k-
nearest neighbor class labels
51
k – NEAREST NEIGHBOUR LEARNING
Introduction
Note that 1-nearest neighbor classifies xq as positive, whereas
5-nearest neighbor classifies it as negative
52
k – NEAREST NEIGHBOUR LEARNING
Introduction
Classification Algorithm (k-nearest neighbor):
* If the class labels are real-valued
Take the mean of class labels (target function
‘f’ values) of k nearest neighbors
53
k – NEAREST NEIGHBOUR LEARNING
Hypothesis about the target function
What classifications would be assigned if we were to hold the
training examples constant and query the algorithm with
every possible instance
The shape of the decision surface is something like this and is
called Voronoi diagram
54
k – NEAREST NEIGHBOUR LEARNING
Distance weighted NN-algorithm
We weight the contribution of each of the k-neighbors
according to their distance to the query point xq
The closer neighbors are given a greater weight
where
55
k – NEAREST NEIGHBOUR LEARNING
Distance weighted NN-algorithm
If for a case d(xq, xi)2 = 0
we assign the class of xi to xq
If there are several xi equal to xq, we take a majority
vote
For real valued target functions,
56
k – NEAREST NEIGHBOUR LEARNING
Distance weighted NN-algorithm
If all the training examples are used to determine the
classification of xq, then the algorithm is called a global
method, otherwise it is called a local method
For real-valued functions, the global methods is also called
the Shepard’s method
57
k – NEAREST NEIGHBOUR LEARNING
Remarks
Advantages
• It is robust to noise
• Training is fast
Disadvantages
• All attributes are used for calculation of distances, whereas
only a few may be relevant (this problem of
irrelevant attributes is called curse of dimensionality)
• Classification is a slow process
58
Reading Assignment & References
Chapter 8 of T. Mitchell
59