linear classfier
binary classification
knn
K nearest algorithm
Majority votes
Calculate distance for whole dataset –linearly sepearble
Cons:
Time is higher -distance calculation
No learning
decision tree
Now there is one red , misclassified if we
stop question on right side .
So decision tree will again keep on asking questions , drawing boundaries around that red , ,
and making it fully over fitting , meaning memory the the data point boundary.
But if u do cross validation , u will find which one is good without overfitting.
Why Decision Trees? (Motivation)
Decision Trees (DTs) are used because:
1. They handle non-linear data
Most real data is not linearly separable. Linear models → fail.
DTs split the space into smaller regions and can fit complex patterns.
2. They handle mixed data types
● Numerical
● Categorical
No scaling/normalization needed.
3. They are interpretable
You can explain exactly why a prediction [Link] interpretable (“Why did my model predict
this?”)
4. No assumptions
Unlike Logistic Regression or SVM, DTs make:
● No linearity assumption
● No distribution assumption
5. They handle missing values and outliers well
. Concept of a Decision Tree
A Decision Tree tries to mimic human decision-making:
A decision tree is built by:
1. Taking all training data at the root
2. Finding the best feature and best split point
3. Dividing data into child nodes
4. Repeating until:
○ node becomes pure (all same class),
○ or max depth is reached,
○ or no improvement in impurity.
The key question:
Which split is “best”?
??
Goal of a Decision Tree
Find the best question at each step so that data gets separated = segregated as cleanly as
possible.
Entropy = how confused / uncertain the class distribution is
If the node is pure (only one class) → Entropy = 0
✔ If the node is mixed 50–50 → Entropy = maximum
Entropy measures impurity.
PART 6 — Information Gain (very
important for IIT courses)
Decision Tree chooses the best feature using Information Gain.
❓ What does Information Gain measure?
How much uncertainty (entropy) is removed after splitting.
✔ High Information Gain = good split
(because entropy decreases a lot)
✔ Low Information Gain = bad split
(entropy hardly decreases)
How much confusion is reduced
PART 8 — If you have MANY features —
how does the tree choose a condition?
The algorithm tries ALL features and ALL possible threshold values and picks the one with
the highest Information Gain.
Example:
Features:
● Age
● Salary
● Height
● Weight
● BloodPressure
● Gender
● SugarLevel
● Cholesterol
● BMI
● SmokingStatus
For each feature:
● For continuous: test all possible splits like Age ≤ 21, 22, 23, ...
● For categorical: test sets like Gender = Male vs Female
Compute:
● Entropy of parent
● Entropy of each child
● Compute Information Gain
Pick the best one.
This is why decision trees handle multiple features easily.
1. Classic decision trees (CART, ID3, C4.5)
DO NOT do binning
They try all possible thresholds between sorted values.
✔ sklearn’s DecisionTreeClassifier
✔ ID3
✔ C4.5
✔ CART (Gini)
👉 No binning. No min–max. No fixed intervals.
Only dynamic thresholds based on data.
✅ 2. But some implementations DO use
binning
Especially for speed in big data systems.
✔ These use binning:
● XGBoost
● LightGBM
● CatBoost
● Spark ML decision trees
● Histogram-based Gradient Boosting (sklearn’s HistGradientBoostingClassifier)
Hyperpameter:
Hyparameters:
Depth : how many questions / path
Impurity
Stop when pure
Tab 19
Types of modeeling
Types of modeeling
If test data exact same – no use
Completely different - no use
Should have relation test and train
ANS split in the distribution of data into training and testing
How to do modeling in ml:
Types of modeling
Generative model
Discriminative model
👉 Generative models learn the full data distribution and can generate data;
discriminative models learn only the decision boundary for classification.
In machine learning context
We often use joint probability between:
● Feature(s) XXX
● Target/label YYY
So yes, in ML:
P(X,Y)P(X, Y)P(X,Y)
means:
👉 probability of seeing a data point with feature value X and label Y
This is exactly what generative models learn.
In machine learning context
We often use joint probability between:
● Feature(s) X
● Target/label Y
So yes, in ML:
P(X,Y)means:
👉 probability of seeing a data point with feature value X and label Y
This is exactly what generative models learn.
Meaning:
● How each class generates the features
● What the whole data distribution looks like
● Can compute anything: marginals, conditionals, likelihoods
● Can generate new samples
They learn purely the boundary that separates classes, not the distribution of the data.
2. Core difference in one line
● Generative: learns how each class generates the data → then classify
● Discriminative: learns the best boundary between classes
⭐ 3. Examples
🔵 Generative models
● Naive Bayes
● Gaussian Mixture Models
● Hidden Markov Models
● Probabilistic PCA
● Variational Autoencoders (VAEs)
● GANs (generative adversarial networks)
🔴 Discriminative models
● Logistic Regression
● Linear/Kernel SVM
● Neural Networks (when doing classification)
● Decision Trees / Random Forest
● Perceptron
● Gradient Boosting
5. Intuitive analogy
Imagine classifying emails as SPAM or NOT SPAM.
🔵 Generative
Learns:
● What SPAM emails look like
● What NOT-SPAM emails look like
Then compares which class “generated” the observed email.
🔴 Discriminative
Learns:
● A boundary that separates SPAM vs NOT SPAM
It doesn’t care what SPAM looks like internally.
⭐ 6. Advantages & disadvantages
🔵 Generative — Pros:
✔ Can generate new data
✔ Works well with smaller datasets
✔ Can handle missing data
✔ Usually faster training (e.g., Naive Bayes)
🔵 Generative — Cons:
✘ Needs strong assumptions (e.g., Naive Bayes independence)
✘ Can be wrong if model assumptions are wrong
🔴 Discriminative — Pros:
✔ Better performance in classification tasks
✔ Directly models the boundary
✔ No assumptions about data distribution
✔ SVM, Logistic Regression often beat NB
🔴 Discriminative — Cons:
✘ Needs more data
✘ Cannot generate new samples
✘ Cannot model hidden structure of data
⭐ 7. One-sentence summary for
exams/interviews
👉 Generative models learn the full data distribution and can generate data;
discriminative models learn only the decision boundary for classification.
Sometimes discrimative learn for logistic regression
Tab 20
1
Generative model
Generative model
Lets start with
Let all features have binary categories with d dimensional
Example spam or not
Some emails could be written different length–convert them into d dimensional that is fixed
length features
Way: convert the email to d dimensional vector . take d as number of total words in dictionary ..
where dictionary is unique words of the data
So words hello how are you , get 1 in dictionary dimension , others as 0
Now our feature itself is binary
Now associate every email with label using probability
{hello how are you ,spam} - P ?
How?
Generative story :]
Have a assumption of label(spam) , and given spam, how email is generated .
p(x,y)
This is better way why ?
Your question:
“Why don’t we take features P(X) as prior and then generate P(Y∣X)?”
In the real world:
✔ Labels cause features
Not the other way around.
Examples:
● Disease (Y) → causes symptoms (X)
● Digit (Y) → pixels (X)
● Spam label (Y) → words in email (X)
Reason 2: P(X) has nothing to do with
classification
P(X) s the probability of seeing certain features no matter what the class is. No use
Example:
● Probability that an email contains the word “free”
● Probability that height is 170 cm
That doesn’t help classify anything.
You are saying:
👉 “I will first assume and generate features without knowing the class.”
👉 “Then I will decide the class using those features.”
This is the opposite of how the world works.
Question:
How do u assume pior ?
how we estimate P(y) (label prior)
In the model training u will be giving features with labels pairs right?
Example:
Step 2:how we estimate P(x | y) (feature likelihoods)
End of training
Now new features come in test 👍
Do we assume the label?
No — we estimate its distribution from data:
This is what you referred to:
number of training samples with label c / total samples
That is the label prior.
ONE-LINE SUMMARY
Naive Bayes does this:
1. Estimate label probabilities from training data → P(y)
2. Estimate feature behavior within each class → P(x | y)
3. Use Bayes rule to compute P(y | x) for classification
Generative story :]
Explanation:
Goal
Understand:
❗ Full generative model
We need P(x|y) and x is a vector → must model every possible combination of feature values.
✔ Naive Bayes
Assumes independence → combination probability becomes a product of single-feature
probabilities → much fewer parameters.
But why?
❗ Why a general generative model needs 2ᵈ parameters
and
✔ Why Naive Bayes needs only d parameters
🌟 PART 1 — Let’s take a simple example
Meaning:
What is the probability of each of these 8 combinations if the email is spam?
WHY do we need combination
probabilities in a full generative
model?
Because a generative model tries to learn:
P(x∣y)
Where x is a full feature vector, not individual features.
Example:
If an email has 3 features:
x=(x1,x2,x3)
Then the object whose probability we want is the entire vector:
x = (1,0,1)
This is one combination of features.
X - vector of all features combined
d+1
naive bayes
Naive Bayes Assumption
No need to try combinations for each feature
Cause treat features are independent of each other
So
When can we consider features independent (given the class)?
This is why it works well, especially for text classification.
When can we consider features independent (given the class)?
EMPIRICAL (PRACTICAL) FACT
✔ In text, words are almost independent after knowing
the class.
Example (Spam):
● If you know the email is spam,
the probability of seeing “free” does not strongly depend on seeing “win”.
Why?
➡️
Because the cause of both features is the same:
the email is spam.
Once you know the class, the dependence between features weakens.
So conditional independence is often a reasonable approximation.
Concrete Spam Example
If you know an email is spam:
● “free” is likely
● “win” is likely
● “offer” is likely
But:
Does knowing “free” appeared change my belief about “win”?
Not much.
They’re both just spammy words.
So:
This is the heart of Naive Bayes.
DRawbacks of naive bayes:
Possible correction
Artificial counts added to real data to prevent zero probabilities or to express prior belief.
👍
Parameter:
Parameter:
1. Binary classification , d binary features. 2^d+1-1
2. Binary classification d binary features, independent 2d−1
1.
2.
questions
Class-Conditional Probability Estimation
numerator = number of training examples of class y with feature j=1; denominator = total
number of examples in class y
Question:
loss function for generative model
for any probability-based (generative) model, three concepts always appear together:
✅ 1. Likelihood
✅ 2. Maximum Likelihood
✅ 3. Parameter Estimate
Loss function:
Why not the other options?
✖ Likelihood
Not a loss, because it is maximized, not minimized.
✖ Log-likelihood
Also maximized, not minimized.
✖ Negative likelihood
Possible but not used, because likelihood is a product of many tiny probabilities → underflow.
✔ Negative log-likelihood
Used everywhere because:
● numerically stable
● converts product into sum
● easier derivatives
For naive bayes:
svm
Svm labels: -1/+1
Svm
Binary classifier
Uses hyperplane to separate clases which comes from linear equation
What is the SVM Margin? (Simple
Meaning)
Margin = Gap between the decision boundary and the closest data points.
SVM tries to separate the classes with maximum margin.
Loss function:
Why hinge loss?
⭐ It gives zero loss only if margin is satisfied
(y(wᵀx) ≥ 1)
⭐ Penalizes inside-margin points
(0 < y(wᵀx) < 1)
⭐ Penalizes misclassified points heavily
(y(wᵀx) < 0)
⭐ Is convex → optimizable with gradient methods
⭐ Fits SVM geometrical idea of margin maximization
So what does “penalizing inner margin”
mean?
It means:
The model receives a positive loss even though the point is correctly classified.
Why?
Because SVM wants confidence, not just correctness.
Correct classification is not enough; SVM wants the point to be at least 1 unit away from the
boundary.
Why punish points inside the margin?
Because SVM wants:
● a wide margin
● all points safely outside that margin
● a classifier that generalizes well
Inner-margin points make the boundary unstable.
But the amount is controlled by the learning rate η.
Gaussian Naive bayes:
Gaussian Naive bayes:
Intitution:’
It is the most common shape of data in the real world.
It looks like a bell curve:
/\
/ \
/ \
-----/------\------
It is highest in the middle, and goes down smoothly on both sides.
A Gaussian distribution describes quantities where:
● most values are near the average
● fewer values appear far away
● extremely small or extremely large values are rare
Example:
Heights of people in a class:
most people ≈ around average height
few people ≈ very short or very tall
Real examples of
This naturally forms a bell curve.
Gaussian-like data
Feature Why it looks Gaussian
Human height Most people near average
height
Weight Same reason
Exam scores Most get moderate marks
The Gaussian is defined by 2 numbers
1. Mean (μ)
→ center of the curve
2. Standard deviation / variance (σ)
→ how spread out the curve is
Narrow curve (small σ):
/\
/ \
------/----\-------
Wide curve (large σ):
/------\
/ \
---/----------\-----
That’s all.
Why ML uses Gaussian distribution?
Because many numeric features naturally look like this bell shape.
So modeling them as Gaussian makes classifiers (like Gaussian Naive Bayes) work well.
Guaisan Naive Baiyes:
Models based on mean and deviation
Exa:
Step-by-step Intuition
Suppose you have a feature: email length
And two classes:
● y=1 spam
● y=0: not spam
Step 1️⃣ GNB learns two bell curves i.e what is mean and s .d
From training data, it computes:
For spam:
● mean length = 50 words
● std deviation = 10
For non-spam:
● mean length = 200 words
● std deviation = 30
So you get:
● A spam bell curve centered at 50
● A non-spam bell curve centered at 200
Step 2️⃣ For a new email, say length = 60
GNB calculates:
● How likely is 60 under the spam Gaussian?
● How likely is 60 under the non-spam Gaussian?
Since 60 is very close to 50,
the spam bell curve gives a high probability.
60 is far from 200,
so the non-spam bell curve gives a very tiny probability.
Step 3️⃣ Multiply likelihoods across
features
If there are multiple features x1,x2,…,xdx_1, x_2, \ldots, x_dx1,x2,…,xd
🌈 Visual Intuition
Which bell curve is taller at 60?
→ That class has higher likelihood.
GNB repeats this for every feature and multiplies results.
💡 Why does this work?
Because the Gaussian assigns:
● high probability to values close to its mean
● low probability to values far from the mean
So GNB says:
“A point belongs to the class whose Gaussian curves give it the highest probability.”
🔥 Summary (super simple)
Gaussian Naive Bayes:
1. Learns a Gaussian for each feature and class
2. Calculates how likely each feature value is
3. Multiplies all feature likelihoods
4. Picks the class with biggest probability
Tab 23
loss f comparison
The key conceptual difference
🔵 SVM
● Only cares if a point is inside the margin.
● Update is constant (±x direction).
● Update does NOT depend on how close the score is to the boundary.
🔴 Logistic Regression
● Cares about probability quality.
● Update depends on how wrong the prediction is.
● Update scales automatically with the prediction error.
gradient descent
Short explanation (most intuitive):
● Gradient says which way loss increases.
● We update in the opposite direction to reduce loss.
● The sign simply tells us whether to move left or right.
Why does the sign of the gradient
determine the direction of weight update?
✔️ Core idea
Gradient descent wants to minimize the loss.
To minimize something, you must move in the opposite direction of its slope.
Gradient = slope.
● If gradient is positive → slope goes up when weight increases
→ to reduce loss, you must move weight down.
● If gradient is negative → slope goes down when weight increases
→ to reduce loss, you must move weight up.
Therefore:
● Positive gradient → subtract → weight decreases
● Negative gradient → subtract negative → weight increases
If i increase my weight , loss is increasing , so to decrease
loss i will take negative direction of rate of loss i.e
subtracting
When slope (gradient) is negative
The value of the function (loss) decreases when the weight increases.
Let’s break it down simply:
random forest
SGD regressor -all
classification scores
Precison
Accuracy
Positive class
KNeighborsClassifier
BaggingClassifier
VotingClassifier
ada boost classsiance
AdaBoost – Simple Step-by-Step
Explanation
Step 1: Start with your data
● You have a dataset with features (X) and labels (y).
● Example: Emails labeled as spam or not spam.
Step 2: Train a weak learner
● Pick a simple model, called a weak learner (usually a tiny decision tree). Tree with 1
depth
● This model is just slightly better than random guessing.
● Train it on all your data.
Step 3: Check which samples it got wrong
● Look at the predictions.
● Some samples are correctly classified, some are misclassified.
Step 4: Give more weight to the hard examples
● Increase the “importance” of the misclassified samples.
● The next weak learner will focus more on these hard-to-classify samples.
Step 5: Train the next weak learner
● Train a new weak learner on the weighted dataset.
● Repeat Step 3 and Step 4: check errors and increase weights for misclassified points.
Step 6: Combine all weak learners
● Each weak learner gets a weight based on how well it performed.
● The final prediction is a weighted vote of all weak learners.
● Strong prediction comes from combining many weak learners.
Step 7: Make predictions
● When a new sample comes in, each weak learner predicts.
● The predictions are added up using the weights, and the majority (or weighted sum)
decides the final label.
xgoost vs lightgbm
XGBoost = LEVEL-WISE
At every depth XGBoost tries to split ALL nodes at that level:
Level 0
Split root.
Level 1
Try to split:
● Left child
● Right child
Level 2
Try to split:
● Left-Left
● Left-Right
● Right-Left
● Right-Right
ALL nodes are considered, even if only one node actually splits.
🔵 Visualization (XGBoost)
Level 0:
O
Level 1:
O
/ \
O O ← both tried for split
Level 2:
/ \ / \
O O O O ← all tried for split
✔ Slower tree growth
✔ More balanced structure
✔ Tries splitting “left and right” at each depth
✔ Computation-heavy
✅ LightGBM = LEAF-WISE (with depth
limit)
LightGBM does NOT go to all nodes at the same level.
Instead, it chooses only 1 leaf — the one that gives maximum gain — and splits that leaf
further.
This creates a deep, uneven tree.
🔵 Visualization (LightGBM)
Start:
O
Split the leaf with highest gain:
O
|
O
Split the leaf with highest gain again:
O
|
O
|
O
Then maybe side branch:
O
|
O
/ \
O O
✔ Very fast
✔ Very deep
✔ Grows like a vine in the direction of best gain
✔ Only splits 1 branch at a time (not both)
Xgboost
Dataset (Binary Classification Example)
Feature: Seats Filled %
Label y: 0 = low audience, 1 = high audience
Seats % y
10 0
15 0
22 0
25 1
30 1
55 1
70 1
90 0
More mixed, not clean.
Now you can see real XGBoost behavior.
🌳 XGBoost CLASSIFIER – LEVEL WISE
TREE
🔵 LEVEL 0 → Root Split
XGBoost checks all possible splits:
<12, <18, <24, <28, <40, <60, <80 …
It finds the highest gain split, for example:
✔ Best split = seats < 28
ROOT
(seats < 28)
/ \
Left: 10,15,22 → mostly 0 Right: 25,30,55,70,90 → mixed
🔵 LEVEL 1 → Split BOTH branches
(level-wise)
We now have 2 nodes → XGBoost tries splitting BOTH.
🟢 Split LEFT node (10,15,22 → all 0)
It tries splits like:
● <12
● <18
But gain might still exist if probabilities differ (in logistic loss).
Assume best split is:
✔ Left split: seats < 18
Left Left: 10,15 → all 0
Left Right: 22 → 0
Left subtree:
(seats < 28)
/
(seats < 18)
/ \
10,15 22
🔵 Split RIGHT node (25,30,55,70,90 → mixed)
Possible splits:
● <27
● <45
● <60
● <80
Assume best split is:
✔ Right split: seats < 60
Right Left: 25,30,55 → mostly 1
Right Right: 70,90 → [1,0]
Right subtree:
(seats < 60)
/ \
25,30,55 (1,1,1) 70,90 (1,0)
Now whole tree becomes:
ROOT
(seats < 28)
/ \
L Node R Node
(seats < 18) (seats < 60)
/ \ / \
10,15 22 25,30,55 70,90
🔵 LEVEL 2 → Try to split ALL 4 nodes
Now tree has 4 nodes at Level 2:
1. 10,15
2. 22
3. 25,30,55
4. 70,90
XGBoost tries splitting all 4 of them.
✔ Node 1 (10,15 → both 0)
→ Pure, no split (gain too small)
✔ Node 2 (22 → one sample)
→ Cannot split
✔ Node 3 (25,30,55 → all 1)
→ Pure, no split
✔ Node 4 (70,90 → [1,0])
This is the only node with mixed classes.
XGBoost checks split:
● <75 probably best
So:
70 → 1
90 → 0
Split accepted.
Now tree becomes:
ROOT
(seats < 28)
/ \
(seats < 18) (seats < 60)
/ \ / \
10,15 22 25,30,55 (seats < 75)
/ \
70 90
🔥 FINAL TREE (Classifier)
LEVEL 0:
(seats < 28)
LEVEL 1:
/ \
(seats < 18) (seats < 60)
LEVEL 2:
/ \ / \
10,15 22 25,30,55 (seats < 75)
LEVEL 3:
/ \
70 90
⭐ WHY THIS IS LEVEL-WISE?
✔ Level 0: 1 node split
✔ Level 1: BOTH children split
✔ Level 2: ALL 4 nodes tested
✔ Level 3: Only 1 node actually split (others were pure)
Even if only ONE node splits at Level 3, XGBoost still attempted splitting all nodes → that is
level-wise behavior.