MACHINE LEARNING 2 NOTES
MODULE 1
1)THE LEARNING SYSTEM
ANS) A learning system, or computer program designed to learn, is fundamentally
characterized by its ability to improve its performance at a certain task automatically with
experience. For a learning system to be well-defined, three features must be clearly specified:
the class of tasks, the measure of performance to be improved, and the source of experience.
Well-Posed Learning Problems
A computer program is defined to learn if its performance improves with experience,
provided that improvement is measured precisely against three necessary components:
1. Task (T): The specific class of tasks the system performs.
2. Performance Measure (P): A quantitative measure used to determine if performance
at task T has improved.
3. Experience (E): The source of information (data or practice) from which the system
learns.
Examples of Well-Posed Learning Problems:
Speech Recognition– Systems like SPHINX learn to recognize spoken words using machine
learning to identify sounds and words.
Autonomous Driving– The ALVINN system learned to drive on roads using data from past
driving experience.
Astronomical Classification– NASA used decision trees to classify celestial objects from sky
surveys.
Game Playing– TD-Gammon learned backgammon strategies by playing over a million
games, reaching near world-champion level
Designing a Learning System
Designing a machine learning system involves making a series of critical choices regarding
the type of knowledge to be learned, its representation, and the mechanism used to acquire it.
1. Choosing the Training Experience (E)
The nature of the training experience significantly impacts the learner's success. Key
attributes include:
Feedback: Training can offer direct feedback (e.g., the correct move at each step) or
indirect feedback (e.g., only the final outcome of a game).
Credit Assignment: When feedback is indirect (e.g., in game playing), the system
faces the credit assignment problem, determining which individual actions deserve
credit or blame for the final outcome.
Control over Examples: The learner might rely on a teacher, ask questions, or
operate autonomously, generating its own examples (e.g., playing against itself).
Distribution: Learning is generally most reliable when the training examples (E) are
drawn from a distribution similar to the future test cases
2. Choosing the Target Function
The target function defines what type of knowledge the model needs to learn to perform a
task. (T).
•In a checkers-playing system, one option is to learn a function that selects the best legal
move for any board state.
• An alternative is to learn a value function (V) that assigns a numerical score to each board state,
representing how favourable that position is. The system then selects the move that leads to the
highest-valued successor state.
The ideal value function, for instance, would assign +100 for a winning position, -100 for a
losing position, and 0 for a draw. However, this ideal function is not computable in
practice, especially for non-final board states, because it would require evaluating every
possible future move sequence.
Therefore, machine learning focuses on approximating this ideal target function through a
process called function approximation. Here, the system learns an estimated version of the
true function—denoted as from examples and experience. This approximation enables
the system to make near-optimal decisions without needing to compute the perfect function.
3. Choosing a Representation for the Target Function
A representation is how we express or model the target function.
4. Choosing a Function Approximation Algorithm
This involves selecting the algorithm that uses the training examples to find the optimal
representation (e.g., learning the weights w_i).
Estimating Training Values: Since indirect feedback is often used, the system must
first estimate training values for intermediate states. A successful
approach assigns based on the learned evaluation of the successor state:
, where r is the immediate reward.
Adjusting Weights: The goal is often to minimize the squared error E between the
training values and the values predicted by the current hypothesis .
\
Architecture of a Learning System OR THE FINAL
DESIGN
The final design often comprises four distinct program modules :
1. Performance System: Executes the primary task (T) using the currently learned
target function(s). For checkers, it uses the $\hat{V}$ function to select the next
move.
2. Critic: Generates a set of training examples (e.g., board state b and estimated value
$V_{train}(b)$) from the task history/trace.
3. Generalizer: Takes the training examples from the Critic and generates the output
hypothesis ($\hat{V}$). It generalizes from specific examples, often using an
algorithm like LMS to adjust weights.
4. Experiment Generator: Selects a new problem or state for the Performance System
to explore, aiming to maximize the overall learning rate (e.g., proposing an initial
board state for a new game).
Perspectives on Machine Learning
Machine learning is inherently a multidisciplinary field, drawing on results from fields such
as artificial intelligence, statistics, computational complexity theory, and neurobiology.
A fundamental perspective on machine learning is that it involves searching a very large
space of possible hypotheses to find the one that best fits the observed data and any prior
constraints.
Different learning methods are distinguished by:
The hypothesis space they search (e.g., linear functions, neural networks, decision
trees, symbolic rules).
The search strategy they employ (e.g., simple-to-complex ordering in decision trees,
or gradient descent in neural networks).
Q) WHAT IS CONCEPT LEARNING ,EXPLAIN CONCEPT LEARNING AS A
TASK
ANS )
Q) EXPLAIN CONCEPT LEARNING AS A SEARCH
ANS) In concept learning, this search is done to find a hypothesis that is consistent with
both positive and negative training examples, meaning it correctly classifies all of them.
Defining the Search Space
The search space is the set of all hypotheses that the learner can possibly consider.
This space is implicitly decided by the hypothesis representation chosen by the
system designer.
By choosing a particular representation, the designer limits what the system can
express, and therefore limits what the system can ever learn.
For example, in the EnjoySport learning problem:
The hypothesis is represented as a conjunction of six attribute constraints.
Because of this representation, the total search space contains 973 semantically
distinct hypotheses.
Although this space is finite and relatively small, in real-world machine learning problems,
hypothesis spaces are often extremely large or even infinite.
Organizing the Search
Because hypothesis spaces are very large, learning algorithms do not try every possible
hypothesis one by one. Instead, they use the natural general-to-specific structure of
hypotheses to search efficiently.
General-to-Specific Partial Order
Hypotheses can be organized using a general-to-specific partial ordering.
One hypothesis is said to be more general than another if it classifies at least all the
same instances as positive, and possibly more.
This ordering helps relate hypotheses based on which instances they cover.
Search Strategies Used by Algorithms
Different learning algorithms explore this ordered space in different ways:
FIND-S Algorithm
o Starts with the most specific hypothesis.
o Gradually moves toward more general hypotheses.
It follows one single path in the general-to-specific order.
o
Its goal is to cover all positive training examples.
o
CANDIDATE-ELIMINATION Algorithm
o Maintains a version space, which is the set of all hypotheses consistent with
the training data.
o It keeps track of:
The most specific boundary
The most general boundary
o These boundaries are refined as more training examples are seen.
Search in Different Hypothesis Representations
The way the search is performed depends on how hypotheses are represented:
Linear Function Representations
o Algorithms such as Least Mean Squares (LMS) are used.
o They perform a stochastic gradient descent search.
o The search happens in a continuous space of weight values.
o The goal is to minimize prediction error.
Logical or Symbolic Representations
o Some algorithms search through discrete spaces, such as:
Logical rules
Decision trees
Why Viewing Learning as Search Is Important
By treating learning as a search problem, researchers can:
Analyze how the size of the hypothesis space affects learning
Understand how much training data is required
Estimate how confident we can be that the learned hypothesis will:
o Perform correctly
o Generalize well to unseen examples
2)FIND-S ALGORITHM
ANS) The FIND-S algorithm is a fundamental concept learning algorithm designed to find the
most specific hypothesis within a predefined hypothesis space H that is consistent with a set of
observed positive training examples
The crucial feature is that FIND-S ignores every negative example.
It only generalizes the current hypothesis when it encounters a positive example that the current
hypothesis fails to "cover" (i.e., fails to classify correctly as positive).
If the hypothesis is already consistent with a negative example (i.e., correctly classifies it as
negative), no revision is made
Limitations Addressed by Candidate-Elimination
While FIND-S is efficient, it leaves several fundamental questions unanswered, which led to
the development of the CANDIDATE-ELIMINATION algorithm:
1. Convergence:
FIND-S can find one hypothesis that fits all positive examples, but it cannot tell if this
is the only correct or true hypothesis in the hypothesis space.
2. Preference Bias:
FIND-S always prefers the most specific consistent hypothesis. However, there’s no
clear reason why the most specific hypothesis should always be better than a more
general one. It assumes that specific hypotheses are more likely to be correct.
3. Inconsistency or Noise:
FIND-S is very sensitive to errors in training data because it only looks at positive
examples and ignores negatives. Even a single incorrect example can mislead it
completely.
4. Multiple Maxima:
In some cases, there might be several equally specific hypotheses that fit the data,
and FIND-S cannot choose between them.
3)THE CANDIDATE ELIMINATION ALGORITHM
ANS) The key idea of Candidate Elimination Algorithm is to output a
description of the set of all hypothesis consistent with the training examples.
Applications: To learn regularities in Chemical Mass Spectroscopy and Learning
control rules for heuristic search.
• Disadvantage: Both Find-S and Candidate elimination algorithm perform
poorly when given noisy training data
A hypothesis h is consistent with a set of training examples D if and only if
h(x) = c(x) for each example (x, c(x)) in D.
The CANDIDATE-ELIMINATlON algorithm represents the set of all hypotheses
consistent with the observed training examples. This subset of all hypotheses is
called the version space
IN SIMPLE TERMS
Q) The LIST-THEN-ELIMINATE Algorithm
ANS) The LIST-THEN-ELIMINATE algorithm is a straightforward approach used in
concept learning to identify the version space—the subset of all hypotheses ($H$) that are
consistent with the observed training examples ($D$). It serves as a conceptual precursor to
more efficient algorithms like Candidate-Elimination.
Step-by-Step Mechanism
The algorithm follows a simple three-step process to narrow down potential target concepts:
1. Initialisation: Create a list called VersionSpace that contains every possible
hypothesis in the hypothesis space $H$.
2. Elimination: For each training example $(x, c(x))$, examine every hypothesis $h$ in
the VersionSpace. If a hypothesis incorrectly classifies the example (meaning
$h(x) \neq c(x)$), it is removed from the list.
3. Output: After processing all training data, the algorithm outputs the remaining list of
hypotheses in the VersionSpace.
Characteristics and Outcomes
Convergence: As more training examples are observed, the version space shrinks.
Ideally, it eventually narrows down to a single hypothesis that represents the desired
target concept.
Insufficient Data: If there is not enough data to identify a single correct hypothesis,
the algorithm simply outputs the entire set of consistent candidate hypotheses.
Finite Requirement: This algorithm can only be applied in principle if the hypothesis
space $H$ is finite.
Advantages and Disadvantages
The sources highlight a significant trade-off when using this method:
Advantage: It is guaranteed to output all hypotheses that are consistent with the
training data.
Disadvantage: It is highly impractical. It requires the exhaustive enumeration of
all hypotheses, which is an unrealistic requirement for any but the most trivial and
simple hypothesis spaces.
To better understand this, you can think of the LIST-THEN-ELIMINATE algorithm like a
student taking a multiple-choice test where they don't initially know the answer. They start
by looking at every single option (initializing the version space). As they read the question's
clues (training examples), they cross out every option that contradicts a clue. Whatever is left
at the end, whether it's one choice or three, is their final set of plausible answers.
Q) THE INDUCTIVE BIAS
ANS)
The Necessity of Bias
It can be clearly stated that learning without bias is useless for generalization.
A learner that makes no prior assumptions about the target concept has no logical
reason to classify any new or unseen instance.
Without bias, the learner cannot decide whether a new example should be positive or
negative.
If a learner is given a completely unbiased hypothesis space, meaning:
A hypothesis space that can represent every possible subset of instances (this is
called a powerset),
then the learner cannot generalize at all.
In such a case:
The Specific boundary (S) would simply be the disjunction of observed positive
examples.
The General boundary (G) would be the negated disjunction of observed negative
examples.
Unseen instances would be classified as positive by exactly half the hypotheses in the
version space and negative by the other half, resulting in no unanimous or reliable
classification.
Categorizing Learners by Bias Strength
Different algorithms can be characterized and compared based on the strength of their
inductive bias. The sources list three examples from weakest to strongest bias:
1. Rote-Learner: This has no inductive bias. it simply stores training examples and
only classifies new instances if they exactly match a previously observed case,.
2. Candidate-Elimination Algorithm: This has a stronger bias—the assumption that
the target concept $c$ is contained within the hypothesis space $H$,,. This
assumption allows it to classify some instances that a Rote-Learner would ignore.
3. FIND-S Algorithm: This has an even stronger inductive bias. In addition to
assuming $c \in H$, it assumes that all instances are negative unless the contrary is
entailed by its existing knowledge. Because it is more strongly biased, it makes more
"inductive leaps" and classifies a higher proportion of unseen instances.
4) CHALLENGES AND APPLICATION OF ML
ANS)
Challenges and Issues in Machine Learning (10 Points)
1. Credit Assignment Problem: In learning scenarios where the program
receives only indirect feedback (e.g., only the final outcome of a game), it is a
major challenge to determine which specific action or move along the way was
responsible for the final win or loss.
2. Inconsistent Training Examples: For concept learning algorithms like FIND-S,
a key question left unanswered is how to proceed if the available training
examples are inconsistent.
3. Handling Noisy Data: Concept learning algorithms, specifically FIND-S and
the Candidate Elimination algorithm, are noted to perform poorly when given
noisy training data.
4. Data Distribution Mismatch: Learning is most effective when the training
data comes from a distribution similar to the distribution of future test cases.
If a learner trains only against itself, it might not perform well in real scenarios
where test conditions differ.
5. Function Approximation Necessity: The ideal target function (e.g., assigning
perfect scores to all board states, like +100 for a guaranteed win) is often not
computable in practice, especially for non final intermediate states, forcing
machine learning to rely on approximations.
6. Estimating Intermediate Training Values: It is a challenge to assign specific,
accurate scores to the numerous intermediate board states when the only
information available is the final game outcome (won or lost). The final
outcome does not necessarily indicate whether every preceding board state
was good or bad.
7. Dependence on Hypothesis Space: A fundamental challenge is that learning
algorithms, such as the Candidate-Elimination algorithm, will only converge
toward the true target concept provided the Initial hypothesis space contains
the target concept. A key question arises if the target concept is not contained
within the hypothesis space (H).
8. Influence of Hypothesis Space Size: Determining how the size of the
hypothesis space influences the ability of the algorithm to generalize to
unobserved instances is a fundamental question for inductive inference.
9. K-Means Sensitivity to Cluster Shape/Size: The K-Means clustering algorithm
does not perform well when clusters have very different sizes (diameters),
because the assignment step relies solely on minimizing the distance to the
centroid.
10. Difficulty in Choosing K: In clustering, choosing the correct number of
clusters (k) is generally difficult, and setting it incorrectly leads to poor
clustering results. Furthermore, metrics like inertia are unreliable alone
because inertia always decreases as k increases.
Applications of Machine Learning (10 Points)
Machine learning systems are applied across various domains to improve
performance on specific
tasks (T) through experience (E), leading to measurable performance (P).
1. Game Playing (Checkers): A foundational example involves a program
learning to play checkers, with performance (P) measured by how often it wins
games.
2. Game Playing (Backgammon): The TD-Gammon system successfully learned
backgammon strategies by playing over a million games, achieving a near
world-champion level of performance.
3. Autonomous Driving: The ALVINN system learned to drive on roads by using
data derived from past driving experience.
4. Speech Recognition: Systems like SPHINX apply machine learning techniques
to identify sounds and words, enabling them to recognize spoken language.
5. Astronomical Classification: NASA utilized decision trees, a machine learning
technique, to classify celestial objects from sky survey data.
6. Concept Learning Tasks (Boolean Prediction): ML is applied to infer Boolean
valued functions from training examples, such as the EnjoySport task, where
the system learns the conditions (e.g., Sky, AirTemp, Humidity) under which
Aldo enjoys his favorite sport.
7. Learning Heuristic Control Rules: The Candidate-Elimination algorithm can
be applied to learn control rules for heuristic search.
8. Chemical Mass Spectroscopy Analysis: Candidate-Elimination algorithms are
used to learn regularities in Chemical Mass Spectroscopy.
9. Approximating Complex Functions: ML is used to approximate complex, non-
computable target functions (e.g., a value function that scores intermediate
board states in a game).
10. Learning Recursive Knowledge: Learning systems are applied to infer
expressive knowledge in the form of first-order rules (Horn clauses) containing
variables, which can represent recursive concepts like the Ancestor relation.
MODULE 2
1) Define sequential covering and explain its working with clear
steps, algorithm, example, and a neat diagram.
Discuss the variations of the sequential covering algorithm.
Compare sequential covering with simultaneous covering
algorithms.
Finally, differentiate sequential covering and decision tree learning.
ANS) Definition and Working of Sequential Covering
Sequential covering algorithms are a group of learning methods used to learn if–then rules.
These algorithms work by learning one rule at a time, removing the training examples that
the rule covers, and then repeating the process on the remaining data.
Because the algorithm separates the data and then conquers each part step by step, this
approach is also called the “separate-and-conquer” strategy.
Working Steps
The algorithm reduces the complex problem of learning a large disjunctive rule set into a
series of simpler problems: learning individual conjunctive rules,. The steps are as follows:
1. Initialize: Begin with an empty set of learned rules and the full set of training examples.
2. Learn-One-Rule: Invoke a subroutine (like LEARN-ONE-RULE) to find a single rule with high
accuracy on the current data, even if it has low coverage,.
3. Update Rule Set: Add this newly learned rule to the disjunctive set of rules.
4. Remove Examples: Delete all positive examples that are correctly classified (covered) by the
new rule from the training set,,.
5. Iterate: Repeat the process using the remaining examples until a user-defined threshold is
reached or no more high-quality rules can be found,.
6. Finalize: Sort the learned rules according to their performance to ensure the most accurate
rules are applied first during classification,.
Algorithm Pseudocode
Example and Search Diagram
One effective approach to implementing LEARN-ONE-RULE is to organize the hy- pothesis
space search in the same general fashion as the ID3 algorithm, but to follow only the most
promising branch in the tree at each step.
the search begins by considering the most general rule precondition possible (the empty test
that matches every instance), then greedily adding the attribute test that most improves rule
performance measured over the training examples. Once this test has been added, the process
is repeated by greedily adding a second attribute test, and so on.
Consider learning the concept "PlayTennis." The algorithm starts with the most general
precondition (an empty test that matches everything). It then greedily adds attribute tests
(e.g., Humidity=Normal) that most improve the rule's performance,.
Variations of Sequential Covering
The research literature highlights several dimensions along which these algorithms vary:
Search Direction: Most algorithms use a general-to-specific search (starting with an empty
rule), but others (like GOLEM) search specific-to-general, using positive examples to initialize
the search,.
Search Guidance: Some are generate-and-test, meaning they generate candidate rules
based on syntax and then test them against data. Others are example-driven (like AQ),
where a specific "seed" example is used to constrain the search for new rules,.
Performance Metrics: Different functions evaluate rule quality, including Relative
Frequency, m-estimate of accuracy (to handle scarce data), and Entropy,,.
Expressiveness: While many learn propositional rules, variations like FOIL are designed to
learn first-order Horn clauses, which can include variables and recursive definitions,,.
Sequential vs. Simultaneous Covering
The primary difference between these approaches lies in how they navigate the search space:
Feature Sequential Covering (e.g., CN2) Simultaneous Covering (e.g., ID3)
Learns one rule at a time Learns the entire set of disjuncts (the tree)
Strategy
independently,. simultaneously,.
Search Choice Chooses among attribute-value Chooses among attributes that partition the
pairs. data.
Makes many independent choices Makes fewer independent choices because
Efficiency
($n \times k$ search steps). each node serves multiple rules.
Data Requires plentiful data to support More effective when data is scarce because
Requirements many independent decisions. decisions are "shared" across rules.
Sequential Covering vs Decision Tree Learning
Aspect Sequential Covering Decision Tree Learning
How rules are Rules are learned directly, one by Tree is learned first, then converted
learned one into rules
All rules depend on the same tree
Rule structure Each rule is independent
structure
Different rules can use different Rules must follow attributes chosen
Attributes used
attributes at top of tree
Flexibility Very flexible Less flexible
Can learn recursive relations (e.g., Cannot easily learn recursive
Recursive rules
Ancestor) relations
Search method Greedy (fast but not always best) Greedy within a fixed tree
Rule quality May not be the smallest rule set May contain redundant rules
2) Explain the Learn-One-Rule algorithm and show how it uses a general-to-
specific beam search for rule learning.
The LEARN-ONE-RULE algorithm is a core subroutine used within sequential covering
algorithms to incrementally build a set of if-then rules. Its primary goal is to find a single
rule that correctly classifies many positive examples while covering very few negative ones.
Core Functionality
• High Accuracy, Low Coverage: The algorithm prioritizes accuracy over coverage; it must
be correct in its predictions, but it does not need to account for every training example in the
dataset.
• Input and Output: It accepts a set of positive and negative training examples and outputs a
single conjunctive rule.
• Role in Sequential Covering: Once LEARN-ONE-RULE identifies a high-quality rule, the
examples it covers are removed, and the subroutine is called again on the remaining data to
find the next rule.
Beam Search Mechanism
To avoid poor greedy decisions, LEARN-ONE-RULE often uses beam search.
Steps in Beam Search:
1. Initialization
Maintain a list of the top kkk candidate hypotheses.
Initially, this list contains only the most general hypothesis.
2. Generation
For each hypothesis in the beam:
o Generate all possible specializations
o This is done by adding all possible attribute constraints
3. Selection
o Evaluate all generated candidates using a performance measure
o Keep only the best kkk candidates
o Discard the rest
4. Termination
o Repeat the above steps until:
No candidates remain, or
A stopping condition is reached
o Return the best rule found during the search
3) Explain the learning of first-order rules and Horn clauses.
Describe first-order Horn clauses and the terminology used in first-order
logic with suitable examples.
ANS)
FIRST ORDER HORN CLAUSES
Operational Logic in PROLOG Systems
Systems like PROLOG, which use Horn clauses, usually follow a strategy called negation-
as-failure.
This means:
An expression is assumed to be false by default
If the system cannot prove it to be true using the available rules and data
TERMINOLOGY
4) describe the rule performance measures used in rule learning.
ANS)
1) FOIL ALGORITHM
ANS) FOIL (First-Order Inductive Learner) is an algorithm designed to learn sets of
first-order rules and is an extension of earlier sequential-covering algorithms
The hypotheses learned by FOIL are sets of first-order rules. While each rule is similar to a
Horn clause, there are two key exceptions :
1. Restriction: The literals in the rules learned by FOIL are restricted because they are
not permitted to contain function symbols. This constraint is implemented to
reduce the complexity of searching the hypothesis space.
2. Expressiveness: FOIL rules are more expressive than general Horn clauses
because the literals appearing in the body of the rule may be negated.
Algorithm Structure
The FOIL algorithm operates using nested loops that correspond to variants of earlier
sequential rule-learning methods:
THE ALGORITHM
Applications
FOIL has been successfully applied to a variety of problem domains. Examples include
demonstrating its ability to learn:
A recursive definition of the QUICKSORT algorithm.
Rules to discriminate legal from illegal chess positions.
2) PROLOG-EBG
ANS)
PROLOG-EBG is a representative explanation-based learning (EBL) algorithm designed
to learn sets of if-then rules, specifically first-order Horn clauses,. Unlike purely inductive
methods that rely on identifying statistical patterns across many examples, PROLOG-EBG is
an analytical learning method; it uses prior knowledge, called a domain theory, and
deductive reasoning to generalize from as little as a single training example,,.
Core Mechanism: The Three-Step Process
PROLOG-EBG operates as a sequential covering algorithm, meaning it learns one rule at a
time, removes the positive examples covered by that rule, and repeats the process until all
positive examples are accounted for,. For each new positive example, it follows three distinct
steps:
1. Explain the Training Example: The learner uses its domain theory to construct an
explanation—a logical proof—showing exactly how the training example satisfies
the target concept,. In PROLOG-EBG, this is generated using a backward-chaining
search similar to the way the PROLOG programming language operates,.
2. Analyze the Explanation: The algorithm analyzes the proof to determine the
weakest preimage of the target concept,. This is the most general set of features from
the training example that are sufficient to satisfy the concept according to the proof,.
To do this, it employs a procedure called regression, which works backward from the
conclusion to the leaf nodes of the proof tree, substituting specific constants with
general variables,,.
3. Refine the Hypothesis: The algorithm constructs a new Horn clause where the head
is the target concept and the body consists of the generalized features (weakest
preimage) found in the analysis,. This rule is added to the learner's hypothesis, and all
positive examples it covers are removed from further consideration,.
Characteristics and Assumptions
Perfect Domain Theory: PROLOG-EBG assumes the provided domain theory is
correct (each assertion is true) and complete (it can prove every positive example in
the instance space),.
Knowledge Reformulation: It is often viewed as a form of "knowledge
compilation" or reformulation. Because the learner already "knows" the concept in
principle via the domain theory, PROLOG-EBG’s role is to transform that knowledge
into a more operational form—rules that can classify instances in a single inference
step rather than a complex search,.
Inductive Bias: The inductive bias of PROLOG-EBG is primarily the input domain
theory itself, combined with a preference for small sets of maximally general Horn
clauses,.
Feature Discovery: A unique capability of the algorithm is its ability to formulate
new features not present in the original instance description but required by the
domain theory to explain the concept (e.g., calculating weight from density and
volume),.
Negation-as-Failure: Following PROLOG conventions, the algorithm predicts only
positive examples; an instance is classified as negative if the learned rules fail to
prove it is positive,.
Illustrative Example: SafeToStack
The sources use the SafeToStack(x, y) problem to illustrate the algorithm,. If given a
positive example of a box on an endtable, PROLOG-EBG will explain why it is safe (e.g.,
because the box is lighter than the table),. It will then ignore irrelevant features like the
color or owner of the objects because they were not part of the logical proof, resulting in a
general rule that applies to any similar objects,.
3) Recursive rule learning
Ans) Recursive rule learning is a specialized form of Inductive Logic Programming (ILP)
where the learned rules contain the target predicate within both the head (postcondition) and
the body (preconditions) of the rule. This allows the learner to automatically infer programs
that can describe complex, hierarchical relationships or repetitive processes.
Representational Power
Recursive rules are significantly more expressive than propositional or standard first-order
rules. They can compactly describe functions that are extremely difficult to represent using
decision trees or other propositional methods. Because these rules can be interpreted as
programs in the PROLOG language, learning them is essentially a task of automatically
inferring PROLOG programs from training examples.
The Classic Example: The Ancestor Relation
The most common illustration of a recursive rule set involves defining the Ancestor concept:
Base Case: IF Parent(x, y) THEN Ancestor(x, y)
Recursive Step: IF Parent(x, z) ∧ Ancestor(z, y) THEN Ancestor(x, y).
In this example, the second rule uses the Ancestor predicate in its own definition to indicate
that $y$ is an ancestor of $x$ if there exists some intermediate person $z$ such that $z$ is the
parent of $x$ and $y$ is an ancestor of $z$.
How Recursive Rules are Learned (The FOIL Approach)
Algorithms like FOIL can learn recursive rules by following these steps:
Predicate Inclusion: The target predicate (the one occurring in the rule head) is
included in the input list of predicates that the algorithm can use to build
preconditions.
Candidate Generation: When FOIL performs its general-to-specific search to
specialize a rule, it considers candidate literals that refer to the target predicate itself.
Greedy Selection: If a literal referring to the target predicate yields the highest Foil-
Gain (a measure of how much the literal improves the rule's performance over the
training data), it is added to the rule's body.
Practical Applications and Challenges
Recursive rule learning has been successfully used to discover definitions for various
computational tasks, including:
Sorting algorithms such as QUICKSORT.
List operations like appending two lists, sorting elements, or removing specific
elements from a list.
Logical Relations like the Ancestor function.
A major challenge in this field is the risk of learning rule sets that lead to infinite recursion.
Sophisticated ILP methods must include "subtleties" and constraints to ensure that the learned
recursive programs will eventually terminate when executed.
MODULE 3
1) BAGGING AND BOOSTING
ANS) Both bagging and boosting are ensemble methods used in machine learning to
combine multiple predictors, but they differ fundamentally in how they are trained and how
they operate.
Boosting
Boosting is an ensemble method designed to combine weak learners into a strong learner.
Key Characteristics and Mechanism:
Predictors are trained sequentially, where each subsequent model corrects the
errors of the previous one.
Boosting focuses on training instances that were misclassified by previous models.
•Many boosting methods exist, with AdaBoost and Gradient Boosting being the most
popular.
(AdaBoost):
In methods like AdaBoost the process starts by training a base classifier (e.g., a
Decision Tree). After each training round, misclassified samples get higher weights.
A new classifier is then trained using these updated weights. This sequence repeats
until a strong ensemble model is built.
Focus: Each new predictor gives greater attention to these "harder cases," and the
sequence of predictors gradually improves the decision boundaries.
Boosting improves accuracy by reducing both bias and variance.
A lower learning rate in boosting methods results in smaller weight updates, which
makes the learning process smoother.
Bagging
Bagging is a sampling technique where training instances are selected with replacement.
This means that the same data point can appear multiple times in a single subset used for
training.
Key Characteristics and Mechanism:
Training instances are selected with replacement. Because of replacement, some data
points may appear repeatedly, while others may be left out entirely.
Each subset of data is used to train a separate model (predictor).
Bagging primarily helps reduce variance and improves the stability of machine
learning models.
After all predictors are trained, the ensemble makes predictions by combining the
outputs of all models.
o For classification, the final decision is usually the most frequent prediction
o For regression, the final prediction is often the average of all model outputs.
Out-of-Bag (OOB) Evaluation:
In bagging, because instances are sampled with replacement, some instances are never
selected for a particular predictor's training.
The approximately 37% of the data not used for training a specific predictor are
called Out-of-Bag (OOB) instances.
Since the predictor never saw its OOB instances, these can be used to evaluate the
predictor's performance without needing a separate validation set.
Averaging OOB evaluations across all predictors gives the OOB score, which
estimates the overall model performance. This feature can be enabled in Scikit-Learn
by setting oob_score=True.
2) Explain ensemble learning and decision by committee with neat
diagram
Ans)
Ensemble learning, also called decision by committee, is a machine learning approach
based on the idea that “two heads are better than one.”
Instead of depending on just one model, this method combines many different models.
Each model learns the data in a slightly different way:
Some models learn certain patterns well,
Other models perform better on different parts of the data.
By combining all these individual opinions, the ensemble produces results that are more
accurate, more stable, and more robust than any single model on its own.
Core Principles of Decision by Committee
The success of an ensemble depends on three main factors:
1. Which learners are chosen
2. How they are trained to see different patterns
3. How their outputs are combined
1. Creating Complexity from Simplicity
One major advantage of ensemble learning is that it can combine many simple
classifiers.
Each simple classifier may have a basic decision boundary (for example, an
elliptical boundary).
When these simple boundaries are combined, the ensemble forms a much more
complex decision boundary.
This allows the system to correctly separate difficult data points that a single simple
model cannot handle.
2. Handling Data Extremes
Ensemble methods work well in both extreme data situations:
When data is scarce:
o All models trained during cross-validation are kept.
o Their predictions are combined to improve reliability.
When data is very large:
o The data is divided into parts.
o Different committee members are trained on different subsets.
3. Improving Stability
Ensemble learning improves model stability by reducing bias and variance.
A single model may have:
o High bias, or
oHigh variance.
When many models are combined:
o Individual errors cancel out,
o The overall variance is reduced,
o The final prediction becomes more reliable.
Common Ensemble Strategies
The sources describe three main types of ensemble methods:
1. Bagging (Bootstrap Aggregating)
Multiple models are trained on bootstrap samples of the dataset.
Bootstrap samples are created by sampling with replacement.
Bagging mainly helps in reducing variance.
It is widely used in Random Forests.
2. Boosting (e.g., AdaBoost)
Boosting is a sequential process.
It combines many weak learners to form a strong learner.
Each new model focuses more on the examples that were misclassified by previous
models.
Misclassified samples are given higher importance (weights).
3. Random Forests
Random Forests are ensembles made entirely of decision trees.
They introduce randomness in two ways:
1. Each tree is trained on a different data subset (bagging).
2. At each node, only a random subset of features is considered.
This randomness increases diversity and improves performance.
Aggregation Methods (How Decisions Are Made)
After all committee members are trained, their outputs are combined using one of the
following methods:
1. Majority Voting
Used for classification problems.
The final class is the one predicted by most models.
In binary classification:
o The ensemble is wrong only if more than half of the models are wrong.
2. Averaging / Median
Used for regression problems.
The final output is:
o The mean, or
o The median of all model predictions.
3. Mixture of Experts
A more advanced method.
Uses a gating system that learns:
o Which classifier should be trusted
o Based on the current input.
3) Explain AdaBoost algorithm in detail with equations.
Ans) AdaBoost, which stands for Adaptive Boosting, is a popular ensemble learning
algorithm. Its main goal is to combine many weak learners into one strong learner.
A weak learner is a classifier that performs only slightly better than random
guessing.
By combining many such weak learners, AdaBoost produces a model with high
accuracy.
AdaBoost works in a sequential manner, meaning:
Models are trained one after another.
Each new model focuses on correcting the mistakes made by the previous models.
Core Mechanism: Adaptive Weighting
The word “adaptive” comes from the way AdaBoost changes the weights of training
examples based on how hard they are to classify.
1. Initialization
Initially, all training data points are treated equally.
Each point is given a weight of:
1N\frac{1}{N}N1
where NNN is the total number of training examples.
2. Weight Shifting
After training a classifier:
o The algorithm checks which data points were misclassified.
These misclassified (hard) points are given higher weights.
Correctly classified points have their weights reduced.
3. Focusing Attention
In the next iteration, the new classifier is trained using the updated weights.
Because misclassified points now have higher weights, the model is forced to focus
more on them.
Repeating this process gradually improves the decision boundary.
Key Insights
• Complex Boundaries: By combining many simple classifiers (e.g., horizontal or vertical
lines), AdaBoost can create a highly complex decision boundary capable of separating
difficult datasets.
• Loss Function: AdaBoost effectively minimizes an exponential loss function, which is
noted for being robust to outliers and mathematically well-behaved.
• Bias and Variance: The algorithm improves accuracy by reducing both bias and
variance.
• Stumping: A common extreme version of AdaBoost uses "stumps"—trees with only one
split—which are often successful when boosted despite being poor learners on their own
4) Explain stumping and its role in boosting.
Ans) Here is the same explanation rewritten in very simple, clear, and exam-ready
language, while not missing even a single idea, point, or example. Only the wording is
simplified.
Stumping is a very specific and extreme form of ensemble learning that is used with
decision trees. In stumping, a decision tree is reduced to its simplest possible form, called a
decision stump.
Definition of Stumping
A decision stump is created by taking only the root node of a decision tree and using it as
the entire classifier.
A full decision tree asks many questions, one after another, before making a
decision.
A stump asks only the first question and completely ignores all branches below it.
Because of this, a stump makes its decision using only one variable (feature).
So, a decision stump is a one-level decision tree.
Role of Stumping in Boosting
In boosting algorithms such as AdaBoost, stumping is used as the ideal weak learner.
A weak learner is a model that performs only slightly better than random guessing.
Stumping fits this role perfectly for the following reasons:
1. Creating a Strong Learner from Weak Learners
Boosting theory states that:
o If you combine many very weak classifiers,
o You can build an ensemble that performs extremely well.
Decision stumps are intentionally low-quality learners.
These simple stumps act as building blocks for creating a strong model.
2. Adaptive Weighting
On its own, a single stump often performs poorly, sometimes even worse than
chance.
In AdaBoost:
o Each datapoint is assigned a weight.
o Datapoints that are misclassified by earlier stumps get higher weights.
This allows AdaBoost to:
o Decide which stump to trust,
o And how much importance each stump should have in the final model.
3. Developing Complex Decision Boundaries
A single stump can create only a very simple decision boundary, such as:
o One horizontal line, or
o One vertical line.
Such simple boundaries cannot separate complex datasets.
However:
When many stumps are combined using boosting,
Their simple boundaries merge to form a highly complex decision boundary.
This complex boundary can successfully separate very difficult data patterns.
4. Efficiency
Decision stumps are very fast to train.
They are much faster than full decision trees because:
o They consider only one split.
This efficiency allows boosting algorithms to:
o Train many stumps quickly,
o And iterate through many rounds of learning.
5) Bagging vs Boosting
Ans)
Feature Bagging (Bootstrap Aggregating) Boosting (e.g., AdaBoost)
A sampling technique where multiple
An ensemble method that combines
models are trained on different
Basic Concept multiple weak learners into a
bootstrap samples (subsets created
single strong learner.
with replacement).
Parallel and independent; because Sequential; each model is trained
Training trees do not depend on each other, one after the other to correct the
Process they can be trained simultaneously errors made by the previous
with nearly linear speedup. predictor.
The data remains the same, but
Uses sampling with replacement;
weights are assigned to
Data Selection some instances appear multiple times
datapoints; misclassified instances
Strategy in a subset while others (approx. 37%)
receive higher weights to increase
are left out as "Out-of-Bag".
their importance in the next round.
Uses a weighted majority vote; the
Uses majority voting (statistical final output is determined by the
Combining
mode) for classification and the mean sum of hypotheses multiplied by
Predictions
(average) or median for regression. their respective importance ($\
alpha$).
Primarily a variance-reducing
Improves accuracy by reducing
Primary Goal algorithm; it improves the stability
both bias and variance.
and reliability of models.
Exhaustive and sequential;
Computational Efficient and parallelizable; can run individual steps can be more
Effort on as many processors as available. expensive to run as they depend on
the previous stage.
Common
Random Forests, BaggingClassifiers. AdaBoost, Gradient Boosting.
Examples
.
6) Explain random forest algorithm
Ans) A Random Forest is an ensemble learning method composed of a collection of
Decision Trees. The core philosophy is that while one tree is good, a "forest" of many trees
is better, provided they are sufficiently varied. It is considered a robust and highly accurate
model that improves stability by reducing variance without significantly increasing bias.
Key Features of Random Forests
• Double Randomness: Random Forests create diversity among trees through two
mechanisms:
1. Bagging (Bootstrap Aggregating): Each tree is trained on a different bootstrap
sample (a subset of the data sampled with replacement).
2. Feature Sampling: At every node in a tree, the algorithm does not look at all available
features. Instead, it selects a random subset of features and chooses the best split only from
that subset.
• Efficiency and Parallelism: Because each tree in the forest is independent of the others,
the training process is "embarrassingly parallel". This allows for nearly linear speedup
when run on multiple processors.
• Out-of-Bag (OOB) Evaluation: Since each tree is trained on roughly 63% of the data, the
remaining 37% (the OOB instances) can be used to estimate test error automatically,
eliminating the need for separate cross-validation.
• Stability: Unlike individual decision trees, random forests generally do not need to be
pruned.
7)Explain mixture of experts and its training procedure.
Ans) The Mixture of Experts is an advanced ensemble learning strategy that, instead of
using simple voting or averaging, learns how to combine classifiers by determining which
"expert" is best suited for a specific input. It essentially treats different parts of the input
space as specialized domains where certain classifiers perform better than others.
Structure and Mechanism
The architecture consists of two primary components:
Experts: A set of individual classifiers (or regressors) that each make an assessment
of the input.
Gating System: A network that uses the current inputs to decide which experts to
trust. It assigns a weight to each expert's output, effectively "routing" the input to the
most relevant classifier.
The Mixture of Experts Algorithm
The process follows three mathematical steps to reach a final decision:
Training Procedure
Training a mixture of experts involves optimizing both the experts' parameters and the gating
network's parameters simultaneously. The sources identify two main methods:
EM (Expectation-Maximization) Algorithm: This is the most common way to
train these networks. It is a general statistical approximation algorithm that iteratively
improves the fit of the experts to their assigned data.
Gradient Descent: It is also possible to use gradient descent to optimize the
parameters, provided the functions used are differentiable.
Conceptual Perspectives
Soft Trees: Mixture of experts can be viewed as a type of decision tree where the
splits are "soft" (probabilistic) rather than the "hard" binary splits used in traditional
decision trees.
RBF Comparison: They are also comparable to Radial Basis Function (RBF)
networks; while a standard RBF node provides a constant output, a mixture of experts
node provides a linear approximation to the data.
8) explain k-means clustering
Ans) k-Means clustering is a popular unsupervised learning algorithm used to partition a
dataset into $k$ distinct groups based on similarity. First proposed by Stuart Lloyd in 1957
and independently by Edward W. Forgy in 1965, it is sometimes referred to as the Lloyd–
Forgy algorithm. The primary goal is to identify clusters—often visualized as "blobs" of
data points—by minimizing the distance between each point and its respective cluster center,
known as a centroid.
The k-Means Algorithm Procedure
Challenges and Solutions
While efficient, k-means has several inherent limitations that require specific strategies to
overcome:
Local Minima: The final clusters depend heavily on the initial random placement of
centroids, which can lead to poor solutions. To solve this, the algorithm is often run
multiple times with different starting locations; the solution with the lowest overall
sum-of-squares error (inertia) is selected.
Choosing the Optimal $k$: Setting $k$ too low leads to underfitting, while setting it
too high causes over-splitting and overfitting. Two common solutions are:
o The Elbow Method: Plotting inertia (how tight the clusters are) against $k$.
The "elbow" point, where the decrease in inertia slows significantly, suggests
the optimal $k$.
o Silhouette Coefficient: A measure of how well a point fits its own cluster
compared to others, ranging from +1 (well-matched) to -1 (likely
misclassified).
Sensitivity to Noise: Because it uses the mean average, k-means is highly susceptible
to outliers, which can pull centroids away from the actual center of the data. A
"robust" alternative is k-medians, which replaces the mean with the median to
minimize the impact of extreme values.
Varying Cluster Sizes: The algorithm struggles when clusters have different
diameters because it only considers distance to the centroid. Soft clustering can
address this by assigning points a "score" for each cluster (using functions like the
Gaussian Radial Basis Function) rather than a single hard label.
Implementation as a Neural Network
k-Means can be replicated as a single-layer neural network using competitive learning. In
this setup, each neuron represents a cluster center in weight space. When an input is
presented, the neurons "compete" to fire; the neuron closest to the input is the winner
(winner-takes-all) and its weights are updated to move it slightly closer to that input. To
ensure fair comparison between neurons, all weight vectors must be normalized to a length
of one, placing them on a unit hypersphere.
9) Explain k-means neural network and competitive learning in
detail.
Ans) The k-means neural network is a single-layer architecture that implements the
traditional k-means clustering algorithm through a biological-inspired framework known as
competitive learning. Instead of calculating clusters in a single batch, this network uses
neurons to represent cluster centers in weight space.
Network Structure
The architecture of a k-means neural network is relatively simple, consisting of:
Input Nodes: These represent the features of the data point being analyzed.
Competitive Layer: A single layer of $k$ neurons, where each neuron represents a
potential cluster center (centroid).
Weights: The connections between the inputs and a specific neuron represent the
coordinates of that cluster center in the input space.
Competitive Learning and "Winner-Takes-All"
In competitive learning, neurons "compete" with one another to fire in response to a specific
input. This is governed by the winner-takes-all strategy:
1. Activation Calculation: For every input, the network calculates the activation ($h$)
of each neuron. This is generally measured as the distance between the input vector
and the neuron's weight vector.
2. The Competition: The neuron with the highest activation—meaning it is the closest
to the input in weight space—is declared the winner.
3. Grandmother Cells: This process can lead to the formation of "grandmother cells,"
where a specific neuron learns to recognize one particular feature or group and fires
only when that specific input is detected.
The Importance of Normalisation
To ensure a fair competition, the magnitudes of the weight vectors must be equal. If one
neuron has much larger weights than others, it might "win" the competition simply because
MODULE 4
1) Explain the codebook approach to data compression with an
example. And vector quantisation .
Ans) The codebook approach to data compression is closely related to competitive learning,
where inputs are replaced by representative prototypes to reduce redundancy. This method is
utilized both for storing data and for transmitting data such as speech and images.
Explanation of the Codebook Approach
The fundamental concept involves data communication between a sender and a receiver,
particularly when minimizing the amount of transmitted data is necessary due to costs
associated with transmission.
1. Agreement on Prototypes: The sender and receiver first agree on a codebook of prototype
vectors. This codebook effectively holds a reduced set of representative data points.
2. Encoding the Data: Instead of transmitting the entire original data, the sender encodes the
data by finding the corresponding prototype vector in the codebook.
3. Transmission: The sender then transmits only the index of that prototype vector within the
codebook, which is significantly shorter than sending the raw data.
4. Decoding: The receiver uses the received index to look up the full prototype vector in the
shared codebook, thereby reconstructing the data.
Vector Quantisation and Lossy Compression
Vector quantisation (VQ) is a method used primarily for data compression where an input
datapoint is replaced by the cluster center, or prototype vector, that it belongs to.
A crucial aspect of this approach is that the codebook will likely not contain every single
possible datapoint. When an input datapoint is encountered that is not in the codebook, the
sender must accept that the reconstructed data will not be exactly the same.
In this scenario, the sender approximates the input by sending the index of the prototype
vector that is closest to it. This process is known as vector quantisation and serves as the
mechanism for lossy compression.
Example
Suppose a sender wants to minimize transmission costs while sending a stream of data.
1. Codebook Creation: The sender and receiver pre-agree on a codebook containing a limited
set of representative prototype vectors.
2. Compression: When the sender has a piece of data to transmit, they look for the prototype
vector in the codebook that either exactly matches the data or is the closest match (vector
quantisation).
3. Transmission: If the original data point is 40 bits long, but the agreed-upon codebook has
1,024 prototypes (requiring 10 bits for an index, since $2^{10} = 1024$), the sender
transmits only the 10-bit index corresponding to the closest prototype, saving 30 bits.
4. Decompression: The receiver receives the 10-bit index, retrieves the 40-bit prototype vector
from their copy of the codebook, and thereby recovers the compressed data.
2) Define prototype vectors and Voronoi regions. and wat is role of
voronoi regions in vq and lvq,k-means
Distinguish between LVQ and k-means.
Ans)
Prototype Vectors and Voronoi Regions
Prototype vectors are representative vectors that form the "codebook" used in VQ and data
compression. They are the points at the center of regions in the input space and are used to
represent any datapoint that falls within their cell.
Voronoi regions (also called Voronoi sets) are the cells into which the prototype vectors
divide the input space. Collectively, these cells form the Voronoi tesselation of the space.
The role of Voronoi regions in VQ, Learning Vector Quantisation (LVQ), and k-means is to
define the boundaries of representation:
Vector Quantisation (VQ): The Voronoi regions define the area of influence for each
prototype vector. Any datapoint that lies within a specific Voronoi region is represented by
the prototype vector at the center of that cell.
LVQ and k-means: Learning Vector Quantisation (LVQ) is the application concerned with
learning efficient VQ by choosing the prototype vectors that are as close as possible to all
possible inputs. The k-means algorithm is a method that can be used to solve this problem.
Both LVQ and k-means aim to optimally position the prototype vectors, which consequently
optimizes the size and placement of the Voronoi regions that partition the input space.
Distinguishing LVQ and k-means
The sources identify Learning Vector Quantisation (LVQ) as the overall application of
learning an efficient set of prototype vectors for vector quantisation.
The k-means algorithm is a specific method used to achieve this goal of finding efficient
prototype vectors, particularly if the required size of the codebook is known.
In essence, LVQ is the general problem of choosing optimal prototype vectors, and k-means
is one algorithm that can be employed to solve this problem.
SOM
SOM vs k-means – compare.
Both the Self-Organising Feature Map (SOM) and the k-means algorithm are competitive
learning techniques used to solve the problem of Learning Vector Quantisation (LVQ)—
that is, learning an efficient set of prototype vectors for data compression,.
Feature k-means Algorithm Self-Organising Map (SOM)
Find cluster centers
Learn an efficient, ordered set of prototype
Goal (prototypes) if the number of
vectors,.
clusters ($k$) is known.
Does not inherently preserve Preserves topology (relative ordering) by
Topology/Ordering the relative ordering of the arranging neurons in a grid (e.g., 2D) so that
input data. neighboring neurons represent similar inputs,.
Updates cluster center based Updates the winning neuron (BMU) and its
solely on the data points surrounding neighbors using a neighborhood
Mechanism
assigned to that center function and lateral connections (cooperative
(competitive learning). competitive learning),.
Often considered more useful than k-means
Used for LVQ when codebook
Usefulness for LVQ, as it can achieve clustering and
size is known.
ordering simultaneously.
While k-means partitions the data into optimal clusters, the SOM goes further by imposing a
global ordering on the resulting clusters through local interactions between map neurons, a
process known as self-organisation,.
4) Monte Carlo (MCMC) method
Ans) The Markov Chain Monte Carlo (MCMC) method has profoundly impacted
statistical computing and statistical physics over the past 20 years, despite the core algorithm
having existed since 1953. It became influential once computers were fast enough to perform
the necessary computations for real-world problems in hours rather than weeks.
MCMC methods are designed to solve two major problems in machine learning and statistics:
computing the optimum solution to an objective function, or calculating the posterior
distribution of a statistical learning problem, especially when the state space is very large.
The fundamental idea behind MCMC is that by exploring the state space, samples can be
constructed in a way that ensures they are likely to originate from the most probable parts of
that space.
Monte Carlo Sampling Foundation
MCMC is built upon the Monte Carlo principle, which states that if independent and
identically distributed samples $x^{(i)}$ are taken from an unknown high-dimensional
distribution $p(x)$, the sample distribution $p_N(x)$ will converge to the true distribution
$p(x)$ as the number of samples ($N$) increases.
This convergence is valuable because samples are more likely to be drawn from regions of
high probability. By concentrating computational resources in these important areas, MCMC
avoids wasting time on regions where the probability is low. Samples generated via the
Monte Carlo principle can be used to approximate the expectation of a function $f(x)$: $
$E_N (f) = \frac{1}{N} \sum_{i=1}^N f(x^{(i)}) \rightarrow \lim_{N\rightarrow\infty} E_N
(f) = \sum_{\mathbf{x}} f(\mathbf{x})p(\mathbf{x})$$.
Markov Chains
To implement MCMC, a Markov chain is constructed. A Markov chain is a sequence of
possible states where the probability of being in state $s$ at time $t$ depends only on the
state at $t-1$ (the Markov property). The transition probabilities link possible states together,
defining how likely it is to move from the current state to others.
To sample effectively from a distribution $p(x)$, the Markov chain must be constructed so
that:
1. Irreducible: Every state is reachable from every other state, ensuring the distribution
$p(x^{(i)})$ converges to $p(x)$ regardless of the starting state.
2. Ergodic: Every state will eventually be revisited, and the chain is not periodic.
3. Invariant (Reversible): The distribution $p(x)$ is invariant to the Markov chain, meaning the
transition probabilities $T(y, x)$ do not change the distribution: $$p(\mathbf{x}) = \sum_{\
mathbf{y}} T(\mathbf{y},\mathbf{x})p(\mathbf{y})$$.
The condition that ensures this invariance is the detailed balance condition: $$p(x)T (x,x') =
p(x')T (x',x)$$.
If a Markov chain satisfies detailed balance, it is guaranteed to be ergodic, and sampling from
it allows one to sample from the target distribution $p(x)$.
Key MCMC Algorithms
The most popular algorithm used for MCMC sampling is the Metropolis–Hastings
algorithm.
1. Metropolis–Hastings Algorithm
The Metropolis–Hastings algorithm uses a proposal distribution, $q(x^|x^{(i-1)})$, that is
easy to sample from. Similar to rejection sampling, a sample $x^$ is generated, and a
decision is made whether to keep it or reject it. If the sample is rejected, a copy of the
previous accepted sample, $x^{[i]}$, is added to the sequence instead.
The decision to accept the new sample $x^$ is governed by the probability $u(x^|x^{(i)})$: $
$u(x^|x^{(i)}) = \min \left( 1, \frac{\tilde{p}(x^)q(x^{(i)}|x^*)}{ \tilde{p}(x^{(i)})q(x^*|
x^{(i)}) } \right)$$.
This mechanism works because accepted values tend to move the Markov chain toward more
likely states, and since the chain is reversible, it explores states proportional to the target
distribution $p(x)$.
If the proposal distribution $q(x^*|x)$ is symmetric, the $q$ terms cancel out, simplifying the
acceptance test and resulting in the original Metropolis algorithm.
2. Gibbs Sampling
Gibbs sampling is a variation of Metropolis–Hastings used when the full conditional
probability of each variable is known, often seen in models like Bayesian networks.
In Gibbs sampling, a new sample is always accepted ($P_a = 1$). The total algorithm
involves choosing each variable and sampling from its conditional distribution $p(x_j|x_{-j})
$. The process is repeated until the joint distribution stops changing, and variables may be
updated in order or in a random sequence.
3. Simulated Annealing
MCMC can also be modified to find the maximum of a distribution rather than approximating
the distribution itself. This is achieved by using simulated annealing, which changes the
Markov chain so that its invariant distribution is $p^{1/T_i}(x)$, where the temperature
$T_i$ decreases over time ($T_i \rightarrow 0$ as $i \rightarrow \infty$). This cooling
schedule progressively makes the algorithm less likely to accept solutions that are worse than
the current one. The modification requires extending the acceptance criterion of Metropolis–
Hastings to include the temperature.
MCMC methods act like a sophisticated tour guide in a vast landscape (the state space).
Instead of randomly wandering around (like basic sampling) or trying to map the entire area
uniformly (which is too expensive), the Markov Chain focuses the exploration on the
mountain peaks (the high-probability regions of the distribution). By ensuring the transitions
between locations are balanced (detailed balance), the algorithm guarantees that the amount
of time spent at any location is proportional to its height (its true probability).
5) SOM
Ans) The Self-Organising Feature Map (SOM), often abbreviated as SOM, is one of the
most widely used competitive learning algorithms in unsupervised learning. It was proposed
by Teuvo Kohonen in 1988.
Key Concepts
The SOM departs from traditional neural network structures in two main ways:
1. Feature Mapping: The positions of the neurons in the network matter. Neurons that are
close to each other represent similar input patterns.
2. Topology Preservation: The neurons are placed in a grid (usually 1D or 2D) and are
connected within the same layer. This arrangement ensures that the order of the inputs is
preserved—inputs that are close to each other are represented by neurons that are close to
each other on the map. This is also called relative ordering preservation.
3.
Implementation and Structure
Network Arrangement: The typical arrangement for the SOM is a 2D grid of
neurons, though a 1D line of neurons is also sometimes used.
Lateral Connections: The SOM requires interaction between neurons within the
same layer, known as lateral connections. These connections implement feature
mapping:
o The winning neuron positively affects nearby neurons, pulling them closer to
itself in weight space so they represent similar features.
o The winning neuron repels neurons that are further away using negative
connections, pushing them away in weight space to represent different
features.
o Neurons that are very far away are ignored.
Mexican Hat: This form of interaction is known as the "Mexican Hat" form of lateral
connections. However, in Kohonen’s standard SOM algorithm, the weight update rule
is modified instead to include information about neighboring neurons, making the
algorithm simpler.
The SOM Algorithm
The SOM algorithm is a competitive learning method where one neuron is chosen as the
winner (best match), and its weights, along with those of its neighbors, are updated.
Implementation Details
Batch vs. Online Learning: The SOM is generally not designed for on-line learning
(where data arrives continuously), as successful convergence is not guaranteed unless
batch learning is applied (where all data is available from the start).
Boundary Conditions: Since neurons on the edges of a rectangular map behave
differently due to having fewer neighbors, boundary conditions may need to be
addressed. To achieve uniform neighborhood interaction, the map boundaries can be
wrapped:
o In 1D, a line is turned into a circle.
o In 2D, a rectangle is turned into a torus. This removal of boundaries ensures
that there are no neurons on the edge of the feature map, treating all neurons
equally during learning.
Self-Organisation: The spontaneous emergence of a global ordering of neurons in
the network, despite the interactions between neurons being only local (i.e., only
neighbors interact), is known as self-organisation. This phenomenon demonstrates
that complex, ordered behavior can arise from simple, local, decentralized rules,
similar to birds flying in formation.
MODULE 5
1) forward algorithm in Hidden Markov Models
ANS)
The Forward Algorithm is a key method used in Hidden Markov Models (HMMs) to solve
the problem of determining how well a sequence of observations matches a specified HMM.
Specifically, it calculates the probability of the entire observation sequence, P(O), given the
model.
Context and Purpose in HMMs
Hidden Markov Models are a type of dynamic Bayesian network particularly popular for
processing temporal data, such as sequences of measurements made at regular time intervals.
The model consists
Key Variables and Steps:
2) Viterbi algorithm
Ans) The Viterbi Algorithm is a dynamic programming method used in Hidden Markov
Models (HMMs) to solve the decoding problem—that is, determining the most probable
sequence of hidden states that produced the observed sequence of outputs.
Implementation Consideration: Numerical Underflow
As with the Forward algorithm, Viterbi repeatedly multiplies probabilities less than 1,
whichcauses numbers to rapidly shrink towards 0.
3) baum Wetch algorithm
ANS)
The algorithm you are referring to is the Baum–Welch algorithm, also known as the
Forward–Backward algorithm.
This algorithm is essential in Hidden Markov Models (HMMs) for solving the learning
problem: determining the optimal set of transition and observation probabilities from a set of
observations. Since this process involves learning probabilities without target solutions, it is
categorized as an unsupervised learning problem.
Role as an EM Algorithm
The Baum–Welch algorithm works as an Expectation-Maximization (EM) algorithm. The
main idea is to estimate the values for the hidden states (the expectation step) and then use
those estimates to maximize the likelihood of the training data (the maximization step),
repeating this until the parameters stop changing.
The algorithm fits the three core parameters of the HMM:
C
ore Steps (Forward–Backward)
.