Module -1 ML-II [BAI702]
1.1 WELL-POSED LEARNING PROBLEMS
What is Learning?
Learning is defined broadly as:A computer program is said to learn from experience
E with respect to some class of tasks T and performance measure P, if its performance
at tasks in T, as measured by P, improves with experience E.
This definition highlights three essential components of a learning problem:
1. Task (T) - What the system is trying to do.
2. Performance measure (P) - How success is measured.
3. Experience (E) - The data/interaction from which the system learns.
Examples of Well-Posed Learning Problems
(a) Checkers game
T (Task): Playing checkers.
P (Performance measure): Percentage of games won against opponents.
E (Experience): Playing practice games against itself.
This makes the learning problem well-defined, because we clearly state
what is being improved, how it is measured, and from what experience
it is learned.
(b) Handwriting recognition
T: Recognizing and classifying handwritten words within images.
P: Percentage of words correctly classified.
E: A database of handwritten words with their correct labels.
(c) Autonomous driving
T: Driving on four-lane public highways using vision sensors.
P: Average distance traveled before making an error (evaluated by
Dept. of AI, VCET Puttur 1
Module -1 ML-II [BAI702]
human overseer).
E: A sequence of images and steering commands recorded while
observing a human driver.
A well-posed learning problem must always specify:
1. Task (T)
2. Performance measure (P)
3. Experience (E)
Without explicitly defining these, it is not possible to clearly evaluate whether a j
system is truly learning.
1.2 DESIGNING A LEARNING SYSTEM
In order to illustrate some of the basic design issues and approaches to machine
learning, let us consider designing a program to learn to play checkers, with the goal
of entering it in the world checkers tournament. We adopt the obvious performance
measure: the percent of games it wins in this world tournament.
1.2.1 Choosing the Training Experience
When building a learning system (like a game-playing AI), the first major decision is
to select "From what kind of experience should the program learn?". This choice
significantly affects how successful the system will be.
Attributes of Training Experience:
a) Direct vs. Indirect Feedback
Direct Feedback:
Provides immediate information (e.g., correct move for a board state)
Example: In checkers, showing the system a board state and telling it the correct
move to make.
- Easier to learn because it knows exactly what it did right or wrong.
Indirect Feedback:
The system only sees the overall result, not immediate feedback for each action.
Example: The system plays a full game, and only knows if it won or lost. It must
figure out which moves contributed to the final result.
Dept. of AI, VCET Puttur 2
Module -1 ML-II [BAI702]
❌ Harder to figure out which moves were good or bad because one bad move at the
end could ruin an otherwise great game.
This introduces the credit assignment problem, which means:-
Figuring out which moves should get credit or blame for the final result.
➡ Learning from direct feedback is usually easier than from indirect feedback.
b) Who Controls the Sequence of Training Examples?
Teacher-Selected Examples:
A teacher picks specific, informative examples and correct answers.
Learner-Driven Queries:
The learner can ask the teacher for help on situations it finds confusing.
Autonomous Exploration (Self-Play/Learner trains alone):
The program plays games against itself and learns from those.
➕ It can generate unlimited training data.
➖ It might miss the strategies human players use.
c) Representation of the Training Experience
For learning to be effective:
The training examples should match the situations the system will face in real-
world use.
Example with Checkers:
Performance Measure (P): Percent of games won in a world tournament.
If the system only trains by playing itself, it may:
Never experience certain board states likely to be used by expert human
players.
Struggle in the actual tournament because its training experience didn’t
prepare it for those situations.
➡Problem: In reality, the training data may differ from real-world test situations.
Theoretically, we assume training and testing data come from the same distribution.
But in practice, this assumption often doesn't hold, making it harder to guarantee
success.
✅ Example: Checkers Learning Problem
The authors decide that the system will train by playing games against itself because:
-No external trainer is needed.
- It can generate unlimited training data.
Dept. of AI, VCET Puttur 3
Module -1 ML-II [BAI702]
➕ Completing the Learning System Design
To fully design the learning system, three more decisions are needed:
1. Type of Knowledge to be Learned
What exactly should the system learn?
(E.g., strategies, evaluation functions for board positions, etc.)
2. Representation of the Target Knowledge
How will this knowledge be represented?
(E. g., rules, decision trees, neural networks, etc.)
3. Learning Mechanism
What method or algorithm will the system use to acquire knowledge?
(E.g., reinforcement learning, supervised learning, etc.)
1.2.2 Choosing the Target Function
Once we decide how the program will learn (from games it plays with itself), the next
important question is: “What exactly should the program learn?”
What is a target function?
Think of it as a formula or rule the program is trying to learn, to help it make the best
move during the game.
Example: Playing checkers, It knows the legal moves from any board setup.
But it doesn’t know:
- Which move is best at a given moment.
So our job is to help it learn: "From all possible moves, which one should I choose?"
1. First Obvious Target Function: ChooseMove
Learn a function called ChooseMove.
ChooseMove : B → M, where:
B = set of all legal board states.
M = set of all legal moves.
This means:
The program takes a board state as input.
It outputs the best move.
BUT… Learning this directly is hard, especially when:
The system only gets indirect feedback (it only knows if it won or lost after the
whole game, not whether each move was good or bad).
Dept. of AI, VCET Puttur 4
Module -1 ML-II [BAI702]
It's difficult to figure out the best move at every possible board state without lots
of specific, detailed examples.
2. Better Alternative: Learning an Evaluation Function (V)
Instead of learning to pick a move directly, we can teach the system to: "Give a
score to a board position."
A function called V, which assigns a numerical score to each board state.
Notation: V : B → ℝ, meaning:
For every board state b in set B,
V(b) gives a real number (ℝ),
Higher value of V(b) = better board state b
For a given current board:
1. The program simulates all possible legal moves.
2. It uses V to evaluate each resulting board state.
3. It selects the move leading to the board state with the highest score.
This simplifies learning and decision-making.
3. Defining the Ideal Target Function V
To guide the learning process, we define:
V(b) = the ideal score for board state b.
The ideal (but theoretical) definition:
If the board is a win, V(b) = 100.
If it's a loss, V(b) = -100.
If it's a draw, V(b) = 0.
If it's not final, Then V(b) = V(b′) Where b′ is the best final board state that can be
reached starting from b, assuming both players play perfectly until the end.
4. This Ideal Definition is Impractical, Because,
For non-final positions, V(b) is the score of the best possible end result that can
happen from that board.
This definition is called recursive, because it defines V(b) in terms of another
board state b′.
Dept. of AI, VCET Puttur 5
Module -1 ML-II [BAI702]
Problem to be noted:-
This definition of V(b) is:
Too slow to compute.
The system would need to:
Look ahead to every possible move
Simulate what the opponent might do
Go all the way to the end of the game
This is not practical. So we say this version of V(b) is not operational — it can’t be
used efficiently in a real-time game.
5. The Goal: Learn an Operational Approximation
Our goal is to learn an operational version of V, meaning:
A version of V that gives a good estimate of the board’s value quickly, without
going all the way to the end of the game.
That’s what learning is all about here:
Learn an approximation of the ideal V(b) that is good enough to play well.
Notation:
We call the learned version of V as .
It won’t be perfect.
But it will be close enough to the ideal V.
This process of learning is called function approximation.
1.2.3: Choosing a Representation for the Target Function
Importance of Representation:
☆ How the system will learn (by playing games against itself)
☆ What it will learn(a function that evaluates how good a board position is)
Now we need to decide:
How do we represent the function inside the program?
This step is very important, because the way we choose to represent the function
affects how well the system can learn it.
Why is Representation Important?
There are many ways to represent the function:
Dept. of AI, VCET Puttur 6
Module -1 ML-II [BAI702]
☆ We could make a huge lookup table that lists the exact value for every board
position.
But there are too many possible board positions—this isn’t practical.
☆ We could use something smarter:
A list of predefined board features
A polynomial formula
A neural network
All of these are just different ways to express the knowledge.
What makes a good representation?
☆ If the representation is very simple, it might be easy to compute but not accurate.
☆ If the representation is very complex, it might capture the target function well, but
needs a lot of training data and takes more time.
So we must strike a balance between:
☆ Being simple enough to learn quickly
☆ But expressive enough to represent useful knowledge
Chosen Simple Representation:
☆ Linear function based on board features.
6 Board Features Used:
1. Number of black pieces (x1).
2. Number of red pieces (x2).
3. Number of black kings (x3).
4. Number of red kings (x4).
5. Number of black pieces threatened by red(x5).[i.e., which can be captured on
red's next turn)]
6. Number of red pieces threatened by black(x6).
Function Representation:
☆ Thus, our learning program will represent (b) as a linear function of the form:
Where:
Dept. of AI, VCET Puttur 7
Module -1 ML-II [BAI702]
☆ b is a board position.
☆ (b) gives the estimated score of board b.
☆ x₁ to x₆ are the feature values for that board.
☆ w₀ to w₆ are the weights (numbers the system learns).
The system learns the weights by training:
☆ A positive weight means the feature makes the board better.
☆ A negative weight means the feature makes the board worse.
☆ w₀ is a constant offset—it adjusts the baseline value.
So far, we’ve completely defined our learning problem by deciding:
☆ Task (T): Playing checkers
☆ Performance measure (P): % of games won in the world tournament
☆ Training experience (E): Games played against itself
☆ Target function (V̂ ): Board → score
☆ Target function representation (formula):
1.2.4 Choosing a Function Approximation Algorithm
We want to learn a target function — which means a function that can predict some
value based on the current state of a system. In this case, it's a game (likely a board
game).
Dept. of AI, VCET Puttur 8
Module -1 ML-II [BAI702]
Training Example Format
☆ To learn this function, we need training examples.
☆ Each training example has:
☆ A board state b — like the current position of pieces on the board.
☆ A training value Vtrain(b) — which tells us how good or bad that board state is (for
example, +100 if black is winning).
☆ So, each training example is a pair:
☆ This means: “Here’s a board, and here’s how good that board is.”
Example Given
☆ They give a sample training example:
☆ These x values describe features of the board (e.g., number of black pieces, red
pieces, kings, etc.)
☆ For example, x2 = 0 means red has no remaining pieces.
☆ So black has won the game.
☆ Therefore, the value of that board state is +100.
☆ This is just one training example. A learner would get many such examples to
learn from.
Once we have these training examples:
1. The system derives them from the indirect training experience (e.g., results of
previous games).
2. It then adjusts weights wi — these are parameters in the learning algorithm — so
the function it learns matches the training examples as closely as possible.
[Link]: ESTIMATING TRAINING VALUES
☆ We are trying to train a program to evaluate how good a particular board state is in
a game (e.g., chess or checkers).
☆ We only know the final result of the game (win or lose).
☆ But we need to train the model on many board states from the middle of the game,
not just the end.
Dept. of AI, VCET Puttur 9
Module -1 ML-II [BAI702]
The challenge:
☆ How do we assign a score (training value) to those intermediate board states?
It is tricky because:
☆ A game might be lost at the end due to one bad move, but the earlier moves might
have been very good.
☆ So it's not accurate to assign a low score to all earlier board states just because the
final outcome was a loss.
Solution:
We use an estimation method:
☆ Assign the value of a board state b as the predicted value of its next state
Successor(b), using the current learned function .
Rule:
That means:
☆ We use the current approximation of the function to estimate the value of a
state.
☆ We apply it to the next board state, and use that estimate as the training value for
the current state.
Why this works?:
☆ It’s a bootstrapping approach: we’re using the current version of the function to
improve itself.
☆ This works well because:
The learner's function tends to be more accurate near the end of the game
(where outcomes are clearer).
So those more confident estimates help pull earlier estimates in the right
direction.
[Link] ADJUSTING THE WEIGHTS
☆ To learn a good function that estimates how good a board state b is.
☆ We do this by adjusting the weights wi of the function to minimize error between:
The actual training value Vtrain(b), and
Dept. of AI, VCET Puttur 10
Module -1 ML-II [BAI702]
The Predicted value (b)
Error Function:
To measure how well our function is doing, we use the squared error formula:
☆ Error Function: add up the square of the difference between the actual and
predicted values for all training examples.
☆ The goal is to minimize this error E.
How do we minimize this error?
☆ We use an algorithm to adjust the weights wi that define the function
(b).
☆ One such method is the Least Mean Squares (LMS) algorithm — also known as
gradient descent.
LMS Weight Update Rule:
Where:
1.2.5 The Final Design
☆ This topic talks about how a machine learning system is designed to learn how to
play checkers.
☆ The same design structure can be applied to other learning systems too.
☆ It involves 4 main components, like the different departments of a company
working together.
The Four Modules:
☆ Let’s say the learning system is like a student learning how to play checkers.
Dept. of AI, VCET Puttur 11
Module -1 ML-II [BAI702]
☆ These four parts act like teachers, examiners, and study guides for that student.
☆ 4 Modules are:
1. Performance System
2. Critic
3. Generalizer
4. Experiment Generator
The Performance System is the module that must solve the given performance
task, in this case playing checkers, by using the learned target function(s). It takes
an instance of a new problem (new game) as input and produces a trace of its
solution (game history) as output. In our case, the strategy used by the
Performance System to select its next move at each step is determined by the
learned evaluation function. Therefore, we expect its performance to improve
as this evaluation function becomes increasingly accurate.
The Critic takes as input the history or trace of the game and produces as output
a set of training examples of the target function. As shown in the diagram, each
training example in this case corresponds to some game state in the trace, along
with an estimate Vtrain of the target function value for this example. In our
example, the Critic corresponds to the training rule given by Equation (1.1).
The Generalizer takes as input the training examples and produces an output
hypothesis that is its estimate of the target function. It generalizes from the
specific training examples, hypothesizing a general function that covers these
Dept. of AI, VCET Puttur 12
Module -1 ML-II [BAI702]
examples and other cases beyond the training examples. In our example, the
Generalizer corresponds to the LMS algorithm, and the output hypothesis is the
function described by the learned weights wo, ... , W6.
The Experiment Generator takes as input the current hypothesis (currently
learned function) and outputs a new problem (i.e., initial board state) for the
Performance System to explore. Its role is to pick new practice problems that will
maximize the learning rate of the overall system. In our example, the Experiment
Generator follows a very simple strategy: It always proposes the same initial
game board to begin a new game. More sophisticated strategies could involve
creating board positions designed to explore particular regions of the state space.
Before finalizing the design of the checkers learning system, we must make several
key design choices. These include selecting the type of training experience, target
function, representation, and learning algorithm. Figure 1.2 summarizes these
decisions and the available options at each step.
Alternative Learning Methods
The same task (learning to play checkers) can be done using other techniques, like:
1. Nearest Neighbor – Store old games and find similar past situations to decide what
to do.
2. Genetic Algorithms – Create many checkers-playing programs, let them play each
other, and evolve the best ones.
Dept. of AI, VCET Puttur 13
Module -1 ML-II [BAI702]
3. Explanation-Based Learning – Think like humans: reflect on why certain moves
worked or failed.
1.3 PERSPECTIVES AND ISSUES IN MACHINE LEARNING
Machine learning can be seen as a process of searching through many possible
solutions (called hypotheses) to find the best one that fits the training data.
Hypothesis Space:-
☆ A hypothesis is just a guess or a rule about how to make predictions from input
data.
☆ Example: In the checkers game, a hypothesis could be a function that decides how
good a move is, based on certain features.
☆ The learner uses a formula with weights (w₀ to w₆), and by changing these weights,
it can make different guesses.
All the different possible weight combinations form a hypothesis space — a huge set
of possible solutions.
The Learner’s Job
The machine learning algorithm (the learner) tries to:
☆ Search through this space of hypotheses.
☆ Find the one that fits the training data as closely as possible.
How Do We Search for the Best Hypothesis?
☆ The learner (or algorithm) tries different combinations to see which one matches
the training data best.
☆ One way to do this is with the LMS (Least Mean Squares) algorithm.
This algorithm tweaks the weights step by step.
If the prediction is wrong, it makes a small correction to the weights.
Over time, it gets better at predicting.
Different Algorithms Use Different Representations
Not all problems can be solved the same way. That’s why we have different types of
hypothesis representations:
☆ Linear functions – good for simple numeric relationships.
☆ Logical descriptions – useful for rule-based reasoning.
☆ Decision trees – great for choices based on conditions.
Dept. of AI, VCET Puttur 14
Module -1 ML-II [BAI702]
☆ Neural networks – good for complex problems like image recognition.
Each type uses a different method to search through its own kind of hypothesis space.
Why This Perspective is Useful
This idea — learning as a search problem — helps us:
☆ Understand how learning works.
☆ Compare algorithms based on how they search.
☆ Analyze:
How big the hypothesis space is,
How much training data we need,
And how confident we can be that the learned model will work well on new,
unseen data.
1.3.1 Issues in Machine Learning
Our checkers example raises a number of generic questions about machine learning.
The field of machine learning, and much of this book, is concerned with answering
questions such as the following:
☆ What algorithms exist for learning general target functions from specific training
examples? In what settings will particular algorithms converge to the desired function,
given sufficient training data? Which algorithms perform best for which types of
problems and representations?
☆ How much training data is sufficient? What general bounds can be found to relate
the confidence in learned hypotheses to the amount of training experience and the
character of the learner's hypothesis space?
☆ When and how can prior knowledge held by the learner guide the process of
generalizing from examples? Can prior knowledge be helpful even when it is only
approximately correct?
☆ What is the best strategy for choosing a useful next training experience, and how
does the choice of this strategy alter the complexity of the learning problem?
☆ What is the best way to reduce the learning task to one or more function
approximation problems? Put another way, what specific functions should the system
attempt to learn? Can this process itself be automated?
☆ How can the learner automatically alter its representation to improve its ability to
represent and learn the target function?
Dept. of AI, VCET Puttur 15
Module -1 ML-II [BAI702]
2.1 INTRODUCTION
Much of learning is about understanding general concepts from specific examples.
For instance, people learn concepts like "bird," "car," or "when to study more to pass
an exam." These concepts represent a specific subset of items or events taken from a
larger set. For example, the concept "bird" includes a subset of animals.
Each concept can also be seen as a boolean-valued function — a function that gives
either true or false. For example, this function might take any animal as input and
return true if it is a bird and false otherwise.
This chapter focuses on the problem of automatically finding or learning the general
definition of a concept using labeled examples — examples marked as members (+)
or non-members (–) of that concept. This task is called concept learning, which means
approximating a boolean-valued function based on input-output examples.
Definition of Concept learning: Inferring a boolean-valued function from training
examples of its input and output.
2.2 A CONCEPT LEARNING TASK
To better understand concept learning, let’s consider a specific example:
We want to learn the target concept “days on which my friend Aldo enjoys his
favorite water sport.”
Setup:
Table 2.1 shows examples of days, with each day described using a set of
attributes like:
Sky, AirTemp, Humidity, Wind, Water, Forecast
Each example also shows whether Aldo enjoys his sport that day (EnjoySport =
Yes or No).
The task is to predict EnjoySport (Yes/No) based on the values of these six attributes.
Dept. of AI, VCET Puttur 16
Module -1 ML-II [BAI702]
Hypothesis Representation:
We need a way to represent hypotheses that can generalize from the examples.
Each hypothesis is a set (vector) of 6 conditions, one for each attribute.
Each attribute condition in the hypothesis can be:
“?” → Any value is acceptable for this attribute.
A specific value → Only this value is acceptable.(e.g., Warm)
“Ø” (phi symbol) → No value is acceptable (used in the most specific
hypothesis).
So, a hypothesis is like a filter. If a day (example) matches all the conditions in
the hypothesis, it is classified as a positive example (i.e., EnjoySport = Yes).
Example Hypothesis:
Let’s say the hypothesis is: (?, Cold, High, ?, ?, ?)
This means:
Sky: any value
AirTemp: must be Cold
Humidity: must be High
Wind, Water, Forecast: any value
Any day that matches this pattern (Cold air and high humidity) will be predicted as a
day Aldo enjoys the sport.
Note:
Boundary Hypotheses:
★ The most general hypothesis (everything is allowed, so every day is positive) is:
(?, ?, ?, ?, ?, ?)
Dept. of AI, VCET Puttur 17
Module -1 ML-II [BAI702]
★ The most specific hypothesis (no day is positive, allows nothing) is:
(Ø, Ø, Ø, Ø, Ø, Ø)
2.2.1 Notation
This section defines standard terminology used when talking about concept learning.
Definitions:
1. Instances (x):
An instance is one possible day described using the attributes:
Sky → values: Sunny, Cloudy, Rainy Wind → values: Strong, Weak
AirTemp → values: Warm, Cold Water → values: Warm, Cool
Humidity → values: Normal, High Forecast → values: Same, Change
The set of all such possible days (combinations of attribute values) is called the
instance space, denoted by X.
2. Target Concept (c):
This is the function we are trying to learn.
It's a boolean-valued function defined over instances in X:- c : X → {0,1}
c(x) = 1: instance x belongs to the concept (i.e., Aldo enjoys his sport on that day)
c(x) = 0: instance x does not belongs to the concept (i.e., Aldo does not enjoy it)
In the EnjoySport example:
⊛ c(x)=1 if EnjoySport = Yes ⊛ c(x)=0 if EnjoySport = No
4. Hypotheses (H):
A hypothesis describes a possible definition of the concept.
Each hypothesis is a vector of constraints on the 6 attributes.
For each attribute, it can:
Be “?” → any value is acceptable
Be a specific value (e.g., Sunny, Warm, etc.)
Be “Ø” → no value is acceptable for that attribute
A hypothesis h classifies an instance x as positive (i.e., Aldo enjoys the sport) if:
h(x) = 1
4. Training Examples (D):
A training example is a pair (x,c(x))
→ it includes both the instance and its correct label.
From Table 2.1:
Positive examples: c(x) = 1
Dept. of AI, VCET Puttur 18
Module -1 ML-II [BAI702]
Negative examples: c(x) = 0
Goal:
Find a hypothesis ℎ∈H such that:
h(x) = c(x) for all x∈X
→ i.e., the hypothesis must match the target concept on all examples in X.
2.2.2 The Inductive Learning Hypothesis
The problem:
We want a hypothesis (h) that matches the target concept (c) for all possible cases (X).
Target concept c: The real truth or correct rule we want to learn (but we don’t
know it fully).
Hypothesis h: Our guess/model that tries to match c.
X: All possible inputs or instances.
But… we only get training examples (a limited set of known cases). We don’t know c
for all possible situations.
The challenge:
Since we only see a small portion of all possible data:
We can’t be 100% sure h will be correct everywhere.
We can only check how well h matches c on the training data we have.
The inductive learning hypothesis:
If a hypothesis h does a good job of matching the target function c on a large enough
and representative set of training examples, it will probably do a good job on unseen
(new) examples too.
Importantance:-
This is the core assumption behind all inductive learning:
We generalize from past experience to future situations.
We assume patterns in the training data will also appear in new data.
Without this assumption, machine learning wouldn’t work — because the past would
tell us nothing about the future.
Example
Target concept c = “An animal is a cat if it has whiskers, sharp ears, and meows.”
Dept. of AI, VCET Puttur 19
Module -1 ML-II [BAI702]
Training data = 100 animals we’ve already checked (with their correct labels).
We train a model (h) that correctly labels all 100 animals in the training set.
The inductive learning hypothesis assumes:
If h works well on these 100 examples (and they represent the real world well), it will
also work well on new animals we’ve never seen before.
2.3 CONCEPT LEARNING AS SEARCH
Concept learning means finding a hypothesis that matches training data.
This process is like searching through a space of possible hypotheses (H) to find
the one that best fits the examples.
Why is it a search?
The hypothesis representation (the format we allow our hypothesis to take)
defines the search space.
The learning algorithm searches inside this space for the best match to the
training examples.
Example: EnjoySport
They give a small example with:
Instance space X: All possible situations (combinations of attribute values).
Attributes:
Sky: 3 values (Sunny, Cloudy, Rainy) Wind: 2 values
AirTemp: 2 values Water: 2 values
Humidity: 2 values Forecast: 2 values
Total distinct instances: 3 x 2 x 2 x 2 x 2 x 2 = 96
So there are 96 different situations possible.
Hypothesis space H
Dept. of AI, VCET Puttur 20
Module -1 ML-II [BAI702]
A hypothesis specifies which instances are positive (e.g., “EnjoySport”) and
which are negative.
Syntactically distinct hypotheses: Different combinations of allowed symbols in
the representation.
The given calculation: 5 x 4 x 4 x 4 x 4 x 4 = 5120
Here, the numbers 5 and 4 come from:
Sky: 3 actual values + “?” (any value) + “Ø” (no value) → 5 possibilities
Other attributes: 2 actual values + “?” + “Ø” → 4 possibilities each
So: 5.45 = 5.1024 = 5120 syntactically different hypotheses.
Semantic vs. syntactic distinctness
Syntactic: Different written forms (symbol arrangements).
Semantic: Different meanings (classifications of instances).
Many syntactically different hypotheses mean the same thing.
Example: Any hypothesis with “Ø” for an attribute means no instances are
positive → empty set → same meaning.
After removing duplicates in meaning: 1+([Link].3.3)=973
Note:
Here:
1 = reject-all case (Ø in any attribute)
4 × 3 × 3 × 3 × 3 x 3 = 972 = all hypotheses with no Ø (just values or “?”)
Total = 973 semantically distinct hypotheses
2.3.1 General-to-Specific Ordering of Hypotheses
When we are searching for the best hypothesis in concept learning, we can organize
the hypotheses in a way that makes searching easier. One very useful method is to
arrange them from most general to most specific.
How this helps:
Dept. of AI, VCET Puttur 21
Module -1 ML-II [BAI702]
Instead of blindly checking every hypothesis in the hypothesis space, we can use
the general-to-specific ordering to navigate more systematically.
This way, the algorithm doesn’t have to enumerate every single possibility one by
one.
Example setup
We have two hypotheses:
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
Each hypothesis is like a rule about when the target concept is positive.
The “?” means “any value is okay” for that attribute.
Hypothesis h1 says:
Sky = Sunny Wind = Strong
AirTemp = anything (?) Water = anything (?)
Humidity = anything (?) Forecast = anything (?)
Hypothesis h2 is the same except it allows any value for Wind as well (it’s “?”
instead of “Strong”).
h2 puts fewer restrictions than h1.
That means more instances will be classified as positive by h2 compared h1.
In fact, every instance that h1 classifies as positive will also be classified positive
by h2 — but not the other way around.
Therefore: h2 is more general than h1.
Definition:
Meaning:
If hk says an instance is positive, hj must also say it’s positive.
This means hj is at least as general as hk.
If hj is more general but not equal to hk, we write hj>g hk(“Strictly more general”)
Dept. of AI, VCET Puttur 22
Module -1 ML-II [BAI702]
Diagram:
Instance space X:
The outer box represents all possible instances.
The smaller shapes represent the sets of instances each hypothesis accepts
(classifies as positive).
Larger shapes mean more general hypotheses (cover more instances).
Hypothesis space H:
Hypotheses are plotted so that more general ones are towards the top, more
specific towards the bottom.
An arrow from ha to hb is immediately more general than hb(no hypothesis in
belween).
Comparing multiple hypotheses:
They compare h1, h2, and h3 from the EnjoySport example:
h1 = (Sunny, ?, ?, Strong, ?, ?)
h2 = (Sunny, ?, ?, ?, ?, ?)
h3 = (?, ?, ?, Strong, ?, ?)
Observations:
h2 is more general than h1 because it replaces "Strong" with "?" in Wind.
But h2 and h3 cannot be compared as more-general-than or less-general-than:-
Dept. of AI, VCET Puttur 23
Module -1 ML-II [BAI702]
Some instances match h2 but not h3.
Some match h3 but not h2.
This is called incomparability.
Why this ordering matters:
The more-general-than relation organizes hypotheses into a partial order (not
every pair can be compared).
It’s useful because many learning algorithms (like Version Space search) rely on
moving from general to specific or vice versa.
It allows the search to be systematic — moving through hypotheses in an ordered
way instead of random guessing.
2.4 FIND-S: FINDING A MAXIMALLY SPECIFIC HYPOTHESIS
FIND-S uses the more-general-than partial order over hypotheses to search for a
hypothesis consistent with the observed training examples. Instead of
enumerating many hypotheses, FIND-S starts at the most specific hypothesis and
generalizes it only as needed to cover positive training examples. This idea and
the connection to the general-to-specific ordering are introduced.
Algorithm:
Next more general constraint(how updates happen):
In conjunctive attribute constraints:
Placeholder Ø → change to specific value from x
Specific value → change to “?” if mismatch with x
Dept. of AI, VCET Puttur 24
Module -1 ML-II [BAI702]
Change only attributes that conflict with the positive example.
Example:
Start:
h0=(Ø,Ø,Ø,0,Ø,Ø)
First positive example: (Sunny, Warm, Normal, Strong, Warm, Same)
-> Replace each Ø with example's value:
h1=(Sunny, Warm,Normal, Strong, Warm,Same)
Second positive example: (Sunny, Warm, High, Strong, Warm, Same)
3rd attr: Normal ≠ High -> change to "?".
New hypothesis:
h2=(Sunny, Warm, ?, Strong, Warm, Same)
Third example: Negative (Rainy, Cold, High, Strong, Warm, Change)
FIND-S ignores -> no change.
Reason: under assumptions (true concept in H, no errors), current h can't misclassify
negatives.
Fourth positive example: (Sunny, Warm, High, Strong, Cool, Change)
5th attr: Warm ≠ Cool -> "?".
6th attr: Same ≠ Change - "?".
Final hypothesis:
h3=(Sunny, Warm,?,Strong,?,?)
Why Negatives Are Ignored
h is always most specific hypothesis consistent with positives so far.
If c (target concept) € H and matches all positives:
c ≥ h in generality.
Since c never covers negatives, h won't either.
No revision needed for negatives.
Search Path in Hypothesis Space
Moves along one chain: most specific → progressively more general.
Generalize only as far as necessary to cover each positive.
At all times: hypothesis = most specific consistent with positives seen.
Dept. of AI, VCET Puttur 25
Module -1 ML-II [BAI702]
Illustrated in Figure 2.2 (instance + hypothesis spaces).
Note
For conjunctive attribute constraints:
FIND-S always outputs the most specific hypothesis in H consistent with
positives.
Also consistent with negatives if:
1. Target concept is in H.
2. Training examples are error-free.
Limitations
1. No convergence test - can't tell if it found the correct concept or if others fit too.
2. Bias toward most specific - may not be optimal choice.
3. Ignores negatives - can't detect inconsistency/noise.
4. Multiple maximally specific hypotheses - requires backtracking if not unique.
5. Some H may have no maximally specific consistent hypothesis (theoretical case).
(Refer the class notes for the Problems)
2.5 VERSION SPACES AND THE CANDIDATE-ELIMINATION
ALGORITHM
.
Limitations of FIND-S
What FIND-S does:
Dept. of AI, VCET Puttur 26
Module -1 ML-II [BAI702]
It searches for a single hypothesis in the hypothesis space H that fits (is consistent
with) all the training examples given.
Problem:
There is usually more than one hypothesis in H that can explain the same training
data equally well.
FIND-S only returns one of them, so we lose information about other equally
possible hypotheses.
Idea of Candidate-Elimination Algorithm
Improvement:
Instead of finding just one hypothesis, it finds all hypotheses in that are
consistent with the training data.
These hypotheses together form the Version Space (set of all consistent
hypotheses).
Note:
It does not explicitly list every single hypothesis (which could be huge or infinite
in number).
Instead, it uses a compact representation of the version space.
How it works
It uses a more-general-than partial ordering to organize hypotheses.
As each new training example comes in:
The algorithm incrementally refines the set of possible hypotheses.
This ensures we always keep only those hypotheses that remain consistent
with all examples so far.
Applications in research
Has been applied in:
Chemical mass spectroscopy (Mitchell, 1979) — learning patterns in
chemical analysis data.
Learning control rules for heuristic search (Mitchell et al., 1983) —
learning strategies to guide problem-solving searches.
Limitations in practice:
Dept. of AI, VCET Puttur 27
Module -1 ML-II [BAI702]
Like FIND-S, it performs poorly when training data contains noise (errors or
inconsistencies).
Because even one noisy (wrong) example can wrongly eliminate many correct
hypotheses.
Importance for learning theory
Even though it is not widely used in noisy-data scenarios, it is important
conceptually because:
It provides a framework for discussing important ideas in machine learning.
2.5.1 Representation
Purpose of the Candidate-Elimination Algorithm
The Candidate-Elimination algorithm aims to find all describable hypotheses that
are consistent with the training examples we have observed so far.
To define this algorithm precisely, we first need some basic definitions.
Representation of the Hypotheses Set
The Candidate-Elimination algorithm represents the set of all hypotheses that are
consistent with the observed training examples.
This subset of all possible hypotheses is called the Version Space.
Definition of Version Space:
Meaning:
Take all hypotheses in H
Keep only that correctly classify every example in D
That’s your version space-all plausible “versions” of the target concept.
2.5.2 The LIST-THEN-ELIMINATE Algorithm
Idea of the List-Then-Eliminate Algorithm
One simple way to represent a version space is to list all of its members.
Dept. of AI, VCET Puttur 28
Module -1 ML-II [BAI702]
This leads to a straightforward learning method called List-Then-Eliminate.
Algorithm:
Outcome of the Algorithm:
If enough training examples are provided, the algorithm will narrow the version
space down to a single correct hypothesis.
If insufficient data is available, it will not be able to identify a single hypothesis.
In that case, the algorithm outputs the entire set of hypotheses still consistent
with the observed data.
Conditions for Using the Algorithm:
It works in principle when the hypothesis space H is finite.
Advantages
It guarantees that all hypotheses consistent with the training data will be output.
This ensures no valid possibilities are missed.
Limitations
It requires exhaustively listing all hypotheses in H.
This is impractical for large or infinite hypothesis spaces — feasible only for very
small, trivial cases.
2.5.3 A More Compact Representation for Version Spaces
Motivation for a More Compact Representation
The Candidate-Elimination algorithm works on the same basic principle as LIST-
THEN-ELIMINATE:
Keep only the hypotheses that are consistent with all training examples.
Dept. of AI, VCET Puttur 29
Module -1 ML-II [BAI702]
Problem with LIST-THEN-ELIMINATE:
It stores every hypothesis in the version space explicitly.
This can be huge and impractical for large hypothesis spaces.
Solution:
Use a more compact representation by storing only:
Most general members of the version space.
Most specific members of the version space.
These form boundary sets (general boundary and specific boundary) that define
the version space.
The Boundaries in the Hypothesis Space
The hypothesis space can be partially ordered using the more-general-than
relation.
The version space lies between:
General boundary set (G): maximally general consistent hypotheses.
Specific boundary set (S): maximally specific consistent hypotheses.
Any hypothesis between S and G (in the ordering) is part of the version space.
Version space representation:
Dept. of AI, VCET Puttur 30
Module -1 ML-II [BAI702]
Meaning: every consistent hypothesis h sits "between" some most- specific consistent
rule s and some most-general consistent rule g. Conversely, anything that lies between
a boundary s and a boundary g is itself consistent.
2.5.4 CANDIDATE-ELIMINATION Learning Algorithm
Goal: compute the version space—all hypotheses in that are consistent with the
observed training examples.
Initial point: pretend everything in H could still be right, but summarize that huge set
using just two boundaries:
General boundary G starts as the single most general hypothesis in H:
G0 ← {(?,?,?,?,?,?)}
Specific boundary S starts as the single most specific hypothesis in H:
S0 ← {(∅ ,∅ ,∅ ,∅ ,∅ ,∅ )}
“?” means any value; this hypothesis says every instance is positive.
Here ∅ is a special value that matches no value, so this hypothesis classifies
nothing as positive.
Why these two delimit the whole space: every hypothesis in H is more general
than S0 and more specific than G0. So S0 and G0 fence in the entire hypothesis
space.
Update idea as data arrives: for each new example,
Positive example: push S outward (make it more general just enough to
include the example) and prune any G members that reject the positive.
Negative example: push G inward (make it more specific just enough to
exclude the example) and prune any S members that would accept the
negative.
Dept. of AI, VCET Puttur 31
Module -1 ML-II [BAI702]
After all examples, the hypotheses lying between the final S and G (in the "more-
general-than" order) are exactly the hypotheses consistent with the data-no more,
no less.
After all examples, the hypotheses lying between the final S and G (in the "more-
general-than" order) are exactly the hypotheses consistent with the data-no more,
no less.
Algorithm:
Note:
Positive example generalize S, prune G.
Negative example Specialize G, prune S.
The two boundaries “close in” on the true concept from opposite directions.
The description uses abstract operations:
Computing minimal generalizations/specializations,
Testing the generality order between hypotheses,
Recognizing and removing non-maximal / non-minimal members.
The implementation of those operations depends on how you represent
hypotheses (attribute vectors, rules, intervals, etc.).
The method applies to any concept learning problem where H is partially ordered
by a more-general-than relation (which is true for most common hypothesis
spaces).
2.5.5 An Illustrative Example
Dept. of AI, VCET Puttur 32
Module -1 ML-II [BAI702]
(Note:
1. Read the theory from text book
2. Refer the class notes for the problems)
2.6 REMARKS ON VERSION SPACES AND CANDIDATE-ELIMINATION
2.6.1 Will the CANDIDATE-ELIMINATION Algorithm converge to the
Correct Hypothesis?
✅ Case 1: When everything is perfect
Conditions for success:
1. The training examples have no errors (every positive/negative label is correct).
2. The true target concept (the actual rule you’re trying to learn) exists somewhere in
the hypothesis space H.
What happens then?
As the algorithm sees more examples:
The S boundary (the most specific hypothesis consistent with the data) and
the G boundary (the most general hypothesis consistent with the data) move
closer together.
Eventually, they converge to one single hypothesis — and that’s the true
target concept.
Dept. of AI, VCET Puttur 33
Module -1 ML-II [BAI702]
Example: If you’re learning "animals that can fly," and your hypothesis space can
describe it perfectly, then as you see more examples (birds, bats, etc.), S and G will
shrink until they exactly describe the concept "can fly."
❌ Case 2: When training data has errors
Suppose you accidentally label a positive example as negative.
Example: A bird (which can fly) is wrongly labeled as "cannot fly."
What happens?
The algorithm immediately removes every hypothesis that says "bird can fly."
This means the correct target concept gets kicked out of the version space.
Result: The learner can never recover the true concept from that point onward (at
least with the current data).
However, if the learner keeps getting more data, it may notice something strange:
The S and G sets never overlap properly.
Eventually, the version space becomes empty (meaning no hypothesis fits all
the examples).
This emptiness is a warning signal that something is wrong:
Either the training data had errors, or
The true concept is not expressible in the chosen hypothesis space.
❌ Case 3: When the hypothesis space is too weak
Sometimes, even if data is correct, the hypothesis space H cannot describe the real
concept.
Example:
Suppose the target concept is “red OR round objects”.
But your hypothesis space only allows AND (conjunctions), not OR.
Then no matter what data you feed, the algorithm won’t find the true concept.
Again, the version space collapses to empty, showing "no solution in this
hypothesis space."
2.6.2 What Training Example Should the Learner Request Next?
The Usual Setup
Normally, we assume the learner is just given training examples
But here, we imagine a different situation:
Dept. of AI, VCET Puttur 34
Module -1 ML-II [BAI702]
The learner can choose an instance (a new example, like a weather condition)
and ask the teacher/oracle (external source) for the correct classification.
This is called a query.
Example: Instead of waiting for nature to show "Sunny, Warm, Normal, Light, Warm,
Same," the learner actively asks: “If it is Sunny, Warm, Normal, Light, Warm, Same
— will we play or not?”
Why is Querying Important?
The learner’s goal is to shrink the version space (the set of all hypotheses still
consistent with data).
A good query is one that helps the learner eliminate as many wrong hypotheses as
possible.
Example Given:
From the EnjoySport concept, suppose the learner has 4 training examples so far.
Currently, there are 6 hypotheses in the version space (still possible explanations
for the target concept).
The learner now considers querying the instance:
(Sunny, Warm, Normal, Light, Warm, Same)
Why is this a Good Query?
This instance is classified positive by some hypotheses in the version space but
negative by others.
That means, no matter what the teacher says:
If teacher says Positive → the learner eliminates all hypotheses that said
“Negative.”
The S boundary gets generalized (more inclusive).
If teacher says Negative → the learner eliminates all hypotheses that said
“Positive.”
The G boundary gets specialized (more restrictive).
Either way, the version space shrinks — from 6 hypotheses to fewer ones.
This means the learner has gained useful information, regardless of the answer.
General Strategy (Optimal Query Strategy)
The best strategy is to query instances that are:
1. Classified positive by some hypotheses and negative by others.
2. This way, whatever the true answer is, the version space shrinks by half (or nearly
half).
Dept. of AI, VCET Puttur 35
Module -1 ML-II [BAI702]
This process is like the game “Twenty Questions”, where each yes/no question
should eliminate about half of the possibilities.
After enough queries, only the true target hypothesis will remain.
2.6.3 How Can Partially Learned Concepts Be Used?
The Situation:
Suppose the learner hasn’t yet fully converged on the correct target concept.
That means the version space still contains multiple hypotheses (not just one).
The learner is now asked to classify new, unseen instances.
Question: How can the learner make predictions when it hasn’t nailed down the
exact target concept yet?
Key Idea
Even if the learner doesn’t know the exact target concept, it can still make useful
predictions using the current version space.
Example with Table 2.6
We have four new instances (A, B, C, D) to classify with the EnjoySport concept.
How to Classify with a Version Space
Case 1: Unanimous Agreement
If all hypotheses in the version space classify an instance the same way → we can
confidently assign that label.
Example:
Instance A is classified positive by every hypothesis in the version space.
Therefore, learner can safely say "Yes" (positive).
This is as good as if the learner had already found the exact target concept.
Similarly, B is classified negative by every hypothesis in the version space.
So the learner can confidently say "No."
Case 2: Disagreement
Dept. of AI, VCET Puttur 36
Module -1 ML-II [BAI702]
If some hypotheses predict positive and others predict negative, the learner
cannot be certain.
Example:
Instance C: half the hypotheses say positive, half say negative.
Here the learner can’t confidently decide.
This is a situation of ambiguity.
Case 3: Borderline Cases
Sometimes, the version space shrinks so much that only a few hypotheses remain.
If the majority predict one way, the learner could use that as a probabilistic guess.
General Rule
Safe prediction: If all hypotheses agree, the learner can classify with high
confidence.
Ambiguous prediction: If hypotheses disagree, the learner either:
Refuses to classify, or
Provides a probabilistic prediction (like "60% chance positive") based on
how many hypotheses agree.
2.7 INDUCTIVE BIAS
Background on Candidate-Elimination
The Candidate-Elimination algorithm works by narrowing down the set of all
possible hypotheses (called the version space) until only those hypotheses that are
consistent with the training examples remain.
It will converge to the correct target concept if:
It is given accurate training examples (no errors in data).
The target concept is present in the hypothesis space (i.e., the set of
hypotheses you consider contains the real answer).
The Problem if Target Concept Is Missing
Suppose the target concept is not in the hypothesis space (the set of possible
hypotheses you chose to search through).
Example: If the target concept is “rectangles,” but your hypothesis space only
considers “triangles,” then the correct concept cannot be found.
Dept. of AI, VCET Puttur 37
Module -1 ML-II [BAI702]
Question raised: Can we avoid this problem by just including every possible
hypothesis in the hypothesis space?
That means making the hypothesis space so large that no matter what the
target concept is, it will be in there.
Consequences of Making Hypothesis Space Very Large
If we do that, new questions arise:
How does the size of the hypothesis space affect generalization?
Generalization means: after learning from training data, how well can the
algorithm correctly classify unseen (new) examples?
How does the size of the hypothesis space affect the number of training
examples needed?
Intuitively: A larger hypothesis space means there are many more
possible consistent hypotheses. So, you need more training examples to
eliminate the wrong ones and pin down the correct one.
Why These Are Fundamental Questions
These are not just issues for Candidate-Elimination, but for inductive inference in
general.
Inductive inference = learning general rules from specific examples.
So, the it says: Here we’ll study these issues in Candidate-Elimination, but the
lessons we learn will apply to any concept learning system that outputs
hypotheses consistent with training data.
Note:
∧, . , * : logical AND
∨ : logical OR
Dept. of AI, VCET Puttur 38
Module -1 ML-II [BAI702]
2.7.1 A Biased Hypothesis Space
Goal:
We want the hypothesis space to contain the true target concept.
Idea: expand the hypothesis space to include every possible hypothesis.
Problem (Restriction in EnjoySport example):
In the EnjoySport case, the hypothesis space was restricted to only conjunctions
of attribute values (rules using AND).
Because of this restriction, it cannot represent disjunctive (OR) concepts like:
“Sky = Sunny OR Sky = Cloudy.”
Training Examples (Table):
The true concept here is something like: “EnjoySport if Sky = Sunny OR Cloudy.”
But our restricted hypothesis space cannot represent that.
Why no hypothesis fits:
The most specific hypothesis that matches Examples 1 and 2 is:
? for Sky, since Example 1 had Sunny and Example 2 had Cloudy.
Other attributes stay fixed (Warm, Normal, Strong, Cool, Change).
But this hypothesis is already too general: it also matches Example 3 (Rainy),
which is a negative example.
Therefore, no hypothesis in the restricted hypothesis space is consistent with all
three examples.
Conclusion:
The learner has been biased to only consider conjunctive hypotheses.
This bias prevents it from learning disjunctive concepts.
To fix this, we need a more expressive hypothesis space.
2.7.2 An Unbiased Learner
The Idea
Dept. of AI, VCET Puttur 39
Module -1 ML-II [BAI702]
To guarantee that the target concept is always inside the hypothesis space H, the
most obvious solution is to make H capable of representing every possible subset
of the instance space X.
Mathematically, the set of all subsets of X is called the power set of X.
EnjoySport Example
In EnjoySport, each instance of a day is described using 6 attributes.
The total number of possible instances X = 96.
(3×2×2×2×2×2 = 96)
Now:
How many different concepts can be defined over this space?
Answer: The size of the power set = 296.
That's an enormous number → each concept is just a subset of all possible
days.
Compare this to the earlier restricted hypothesis space (only conjunctions of
attribute values):
That space could represent only 973 concepts.[(4*3*3*3*3*3)+1=973]
So the original was very biased (too restrictive).
Defining an unbiased hypothesis space
To remove bias, let's define a new hypothesis space H' = power set of X.
This means H' can represent any possible subset of instances.
Another way: H' allows arbitrary conjunctions, disjunctions, and negations of
attribute values.
Example: the disjunctive target concept
"Sky = Sunny OR Sky = Cloudy"
can be written as:
Using Candidate-Elimination with H'
With this new unbiased hypothesis space, we can safely run Candidate-
Elimination.
We don't have to worry that the target concept is missing from H'.
BUT → this causes a new, worse problem:
The learner cannot generalize beyond the examples it has seen.
Why Generalization Fails
Dept. of AI, VCET Puttur 40
Module -1 ML-II [BAI702]
Suppose we train on:
Three positive examples x1, x2, x3
Two negative examples x4, x5
Then:
The S boundary will just be the disjunction of the observed
positives:
→ This is the most specific hypothesis covering those positives.
The G boundary will just be the negation of the observed negatives:
→ This rules out only those negatives.
Therefore:
S = “just the positives”
G = “everything except the negatives”
Result: the only examples classified unambiguously are the training examples
themselves.
To truly converge to the correct target concept, we’d need to present all 96
instances as training examples.
Why Voting Doesn't Help
Could we avoid this problem by taking a vote among version space hypotheses?
Unfortunately, no.
Why?
For any unseen instance, exactly half the hypotheses in the version space classify
it positive, and the other half classify it negative.
This happens because:
If a hypothesis h classifies x as positive, there exists another hypothesis h'
(also in the power set) that is identical to h but flips its classification of x.
Since both agree on all training examples, both are in the version space.
So for unobserved examples, voting is useless → it will always split evenly.
Dept. of AI, VCET Puttur 41
Module -1 ML-II [BAI702]
2.7.3 The Futility of Bias-Free Learning
Key point being introduced
From Section 2.7.2, we saw that if we remove all bias (by letting the hypothesis
space = power set of X), the learner cannot generalize beyond training data.
A fundamental property of inductive inference:
A learner that makes no prior assumptions about the target concept has no
rational basis for classifying unseen instances.
Why Candidate-Elimination worked earlier
In the original EnjoySport setting, the Candidate-Elimination algorithm could
generalize.
Why? Because it was biased: it assumed the target concept could be represented
by a conjunction of attribute values.
This implicit assumption is an inductive bias.
If this assumption is correct (and training data is error-free), generalization will
also be correct.
If the assumption is wrong, the algorithm will misclassify some instances.
General idea of inductive bias
Since inductive learning always requires bias, Mitchell suggests we characterize
learners by their inductive bias.
The bias describes:
The policy the learner uses to generalize from training data to new examples.
Dept. of AI, VCET Puttur 42
Module -1 ML-II [BAI702]
Setup:
Suppose a learner L is given training data Dc = {(x, c(x))} for some target concept
c.
After training, it is asked to classify a new instance xi.
Let L(xi,Dc) = classification given by learner.
This classification is an inductive inference (not deductively guaranteed).
Formal definition of Inductive Bias
The question: What assumptions must be added so that the learner’s inference
becomes deductively justified?
Mitchell defines:
The inductive bias of a learner = the minimal set of additional assumptions B
such that for all new instances xi:
where ⊢ means "follows deductively."
So, inductive bias = the assumptions that justify inductive inferences as if they
were deductive.
∧ = logical AND
⊢ = "entails" or "provably implies" in deductive logic
B = the learner's bias (set of assumptions)
Dc = the training data (examples labeled by the target concept c)
xi = the description of a new instance
L(xi, Dc) = the classification (Yes/No) that the learning algorithm L produces for
xi after learnin from Dc
'Given the bias B, the training data Dc, and the new instance xi, we can deductively
prove the classification L(xi, Dc)."
Candidate-Elimination’s Inductive Bias
For Candidate-Elimination, the bias is simply:
The target concept c is contained in the given hypothesis space H.
With this assumption, all inductive inferences of Candidate-Elimination can be
justified deductively.
Dept. of AI, VCET Puttur 43
Module -1 ML-II [BAI702]
How this works logically:
Assume c ∈ H.
Then it follows that c ∈ VSH,Dc (the version space), since by definition the
version space = all hypotheses in H consistent with data.
Candidate-Elimination classifies x; only if all hypotheses in the version space
agree.
Since c is in the version space, and if all hypotheses agree, then c(xi) must equal
the learner's classification.
Thus: c(xi) = L(xi, Dc).
The Figure (2.8)
What the figure shows
Top: the inductive Candidate-Elimination system:
Inputs: training examples + new instance.
Output: classification (Yes/No or "don't know").
Bottom: an equivalent deductive system (a theorem prover):
Inputs: training examples + new instance + the assumption ("Hypothesis
space H contains the target concept").
Output: Same classification
.
✅ The figure makes clear:
The only difference is that in the deductive system, the inductive bias is made
explicit as an assertion ("H contains the target concept").
In Candidate-Elimination, the bias is implicit (hidden inside the code).
So, the inductive system and deductive system are equivalent, once you explicitly
state the bias.
Why modeling learners by bias is useful
Two advantages:
1. It gives a clear description of the learner’s generalization policy (non-procedural,
not hidden in the algorithm’s code).
2. It allows comparison of learners according to the strength of their inductive bias.
Comparing strengths of inductive bias (examples)
Dept. of AI, VCET Puttur 44
Module -1 ML-II [BAI702]
Mitchell compares three learners:
1. ROTE-LEARNER
Simply memorizes training data.
Classifies only if an exact match is found in memory.
Has no inductive bias.
2. Candidate-Elimination Algorithm
Classifies if all hypotheses in version space agree.
Bias: target concept is in hypothesis space H.
Stronger bias than rote learning.
3. FIND-S Algorithm
Finds the most specific hypothesis consistent with data, then uses it to classify all
future examples.
Bias: (1) target concept is in hypothesis space H, and (2) assume instances are
negative unless proven positive.
Even stronger bias.
Dept. of AI, VCET Puttur 45
Module -1 ML-II [BAI702]
Note:
Please refer the textbook for the detailed explanation.
Question Bank
Level 2 (Understanding)
1. Define machine learning by its three core components (T, P, E), and provide a
novel example illustrating their interaction in a well-posed learning problem. (Ans:
Refer 1.1)
2. Differentiate between direct and indirect feedback for training experience. Using
the checkers example, explain the "credit assignment" problem associated with
indirect feedback.(Ans: Refer 1.2.1)
3. Compare ChooseMove and V target functions for checkers. Explain why V is
easier to learn with indirect experience and how it selects moves.(Ans: Refer 1.2.2)
4. Explain the purpose of the LMS weight update rule in the checkers program.
Describe how it adjusts weights, referencing the error term and feature values.(Ans:
Refer [Link])
5. Describe the four modules (Performance System, Critic, Generalizer, Experiment
Generator) of the checkers learning system, outlining their functions and
interactions.(Ans: Refer 1.2.5)
6. Explain the more-general-than-or-equal-to relation (hj ≥ hk) for boolean hypotheses.
Use an EnjoySport example to show its independence from the target concept.(Ans:
Refer 2.3.1)
Dept. of AI, VCET Puttur 46
Module -1 ML-II [BAI702]
7. Summarize key differences between FIND-S and CANDIDATE-ELIMINATION
algorithms regarding consistent hypotheses and hypothesis space navigation.(Ans:
Refer 2.4)
8. Explain why FIND-S ignores negative training examples. What two critical
assumptions justify this, and why?(Ans: Refer 2.4)
9. Distinguish between an instance "satisfying" a hypothesis and a hypothesis being
"consistent" with a training example. (Ans: Refer 2.5.1)
10. Explain inductive bias. Give an EnjoySport example where a biased hypothesis
space (e.g., conjunctive) prevents learning the true concept.(Ans: Refer 2.7.1)
11. How does CANDIDATE-ELIMINATION compactly represent the version space
compared to LIST-THEN-ELIMINATE? Explain the roles of G and S boundary
sets.(Ans: Refer 2.5.2)
12. Explain the "futility of bias-free learning" principle. Why can't an unbiased learner
generalize beyond observed examples?(Ans: Refer 2.7.3)
Level 3 (Applying)
13. Analyze why CANDIDATE-ELIMINATION's S and G boundary sets might
converge to an empty version space. Identify and explain two distinct reasons related
to training data and hypothesis space.(Ans: Refer 2.6)
14. For a system detecting fraudulent online transactions, propose a well-posed
learning problem by specifying Task (T), Performance measure (P), and Training
experience (E). Justify your choices. (Ans: Refer 1.1)
15. For CANDIDATE-ELIMINATION with a partially learned version space, if the
learner can choose the next training example (query), what characteristic should that
instance have to maximize learning rate? Explain its optimality. (Ans: Refer 2.6.2)
16. Refer the problems.
Dept. of AI, VCET Puttur 47