0% found this document useful (0 votes)

43 views14 pages

GATE DA Machine Learning Revision Notes

The document provides comprehensive revision notes on machine learning, covering supervised learning techniques such as regression and classification, as well as unsupervised learning and neural networks. Key topics include various regression methods, classification algorithms like logistic regression and KNN, and model evaluation techniques. Additionally, it discusses mathematical foundations and concepts essential for understanding machine learning models.

Uploaded by

Abhishek Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views14 pages

GATE DA Machine Learning Revision Notes

Uploaded by

Abhishek Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

GATE DA: Complete Machine Learning Revision Notes

Deep Dive & Mathematical Foundations

Contents
1 Supervised Learning: Regression 3
1.1 Simple Linear Regression (SLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Ridge Regression (L2 Regularization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Supervised Learning: Classification 4

2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 1. Core Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 2. Distance Metrics (The “Closeness” Logic) . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 3. The Hyperparameter K (Bias-Variance Trade-off) . . . . . . . . . . . . . . . . . . . . . 4
2.2.4 4. Computational Complexity (GATE Trap) . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.5 5. Critical Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 1. Mathematical Derivation (MAP Estimation) . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 2. Variants of Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 3. Crucial Implementation Details (GATE Traps) . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 1. Core Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.2 2. The Scatter Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.3 3. Fisher’s Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.4 4. LDA vs. PCA (GATE High-Yield) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.5 5. Assumptions & Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 1. Geometric Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.2 2. Hard Margin SVM (Linearly Separable) . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.3 3. Soft Margin SVM (Non-Linearly Separable) . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.4 4. The Dual Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.5 5. The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 1. Core Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.2 2. Splitting Criteria (Impurity Measures) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.3 3. The "High Cardinality" Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.4 4. Regression Trees (CART) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.5 5. Overfitting & Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1
2.6.6 6. Algorithm Comparison (GATE High-Yield) . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Model Evaluation 10
3.1 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Neural Networks (Deep Learning) 11

4.1 1. The Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 2. Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 3. Activation Functions & Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 4. Loss Functions (Cost Functions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 5. Backpropagation (Derivation Steps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 6. Common Problems & Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Unsupervised Learning 12
5.1 1. Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1 A. K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2 B. K-Medoids (PAM - Partitioning Around Medoids) . . . . . . . . . . . . . . . . . . . . . 13
5.1.3 C. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.4 D. Cluster Evaluation: Silhouette Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 2. Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.1 A. Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.2 B. Mathematical Derivation (Optimization) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.3 C. The PCA Algorithm Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.4 D. SVD Approach (Alternative Calculation) . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2
1 Supervised Learning: Regression

1.1 Simple Linear Regression (SLR)

Model: yi = β0 + β1 xi + ϵi , where ϵi ∼ N (0, σ 2 ).

Ordinary Least Squares (OLS) Derivation

Objective: Minimize Residual Sum of Squares (RSS).

n
X
J(β0 , β1 ) = (yi − (β0 + β1 xi ))2
i=1

Closed Form Solutions: Taking partial derivatives ∂J

∂β0 = 0 and ∂J
∂β1 = 0:

Cov(x, y)
P
(xi − x̄)(yi − ȳ) Sy
β̂1 = = = rxy
Var(x)
P 2
(xi − x̄) Sx

β̂0 = ȳ − β̂1 x̄

Coefficient of Determination (R2 ): Measures the proportion of variance explained by the model.

(yi − ŷi )2
P
RSS
R2 = 1 − =1− P
T SS (yi − ȳ)2

Range: 0 ≤ R2 ≤ 1 (for training data).

1.2 Multiple Linear Regression

Matrix Form: Y = Xβ + ϵ. Where Y ∈ Rn×1 , X ∈ Rn×(p+1) (Design Matrix), β ∈ R(p+1)×1 .

The Normal Equation

To minimize J(β) = (Y − Xβ)T (Y − Xβ):

∇β J(β) = −2X T Y + 2X T Xβ = 0

=⇒ X T Xβ = X T Y
=⇒ β̂ = (X T X)−1 X T Y

GATE Trap / Crucial Note

Invertibility: The Normal Equation requires (X T X) to be invertible. It is not invertible if: 1. Multi-
collinearity exists (features are linearly dependent). 2. p > n (More features than samples).

1.3 Ridge Regression (L2 Regularization)

Addresses multicollinearity and overfitting by adding a penalty term.

X p
X
J(β) = (yi − ŷi )2 + λ βj2
j=1

In Matrix form: Minimize ||Y − Xβ||2 + λ||β||2 .

Analytical Solution:
β̂Ridge = (X T X + λI)−1 X T Y
Note: Adding λI to the diagonal ensures the matrix is always invertible.

3
2 Supervised Learning: Classification

2.1 Logistic Regression

Used for binary classification. Models the log-odds.

P (y = 1|x)
ln = βT x
1 − P (y = 1|x)

1+e−z .
1
Sigmoid Function: hθ (x) = σ(z) =

Cost Function (Log Loss / Cross Entropy)

We cannot use MSE (non-convex). We use Maximum Likelihood Estimation (MLE).

m
1 X (i)
J(θ) = − [y log(hθ (x(i) )) + (1 − y (i) ) log(1 − hθ (x(i) ))]
m i=1

Gradient Descent Update:

m
(i)
X
θj := θj − α (hθ (x(i) ) − y (i) )xj
i=1

2.2 K-Nearest Neighbors (KNN)

2.2.1 1. Core Characteristics

• Lazy Learner: It does not learn a discriminative function from the training data. Instead, it “memorizes”
the dataset. All computation is deferred until the prediction phase.
• Non-Parametric: It assumes no underlying probability distribution (like Gaussian) for the data.

• Instance-Based: Predictions are based purely on local instances found in memory.

2.2.2 2. Distance Metrics (The “Closeness” Logic)

The choice of metric dictates the geometry of the decision boundary.

1. Minkowski Distance (General Form):

n
!1/p
X
p
D(x, y) = |xi − yi |
i=1

2. Euclidean Distance (p = 2): Standard straight-line distance. Sensitive to magnitude.

3. Manhattan Distance (p = 1): Sum of absolute differences (Grid/Taxi-cab geometry). Preferred for
high-dimensional sparse data.
4. Hamming Distance: Used for Categorical Variables.
X
D(x, y) = I(xi ̸= yi ) (Count of mismatched attributes)

2.2.3 3. The Hyperparameter K (Bias-Variance Trade-off )

How many neighbors vote?

• Small K (e.g., K = 1):

– Model: Extremely complex, jagged boundaries. Captures local noise.

– Result: Low Bias, High Variance (Overfitting).

4
• Large K (e.g., K = N ):
– Model: Simple, smooth boundaries. Ignores local structure. Approximates the global majority class.
– Result: High Bias, Low Variance (Underfitting).
√
• Rule of Thumb: K ≈ N . Usually choose an odd number to avoid voting ties in binary classification.

2.2.4 4. Computational Complexity (GATE Trap)

Unlike other models (like Regression/Neural Nets), KNN shifts the cost to the Test phase.

Phase Time Complexity Explanation

Training O(1) Just storing the data.
Prediction O(N · d) Must compute distance to every point in training set.

Note: Prediction can be optimized to O(log N ) using spatial tree structures like KD-Trees or Ball-Trees, but
these degrade in high dimensions.

2.2.5 5. Critical Limitations

GATE Trap / Crucial Note

1. The Necessity of Feature Scaling KNN relies on distance. If Feature A ranges [0, 1] and Feature B
ranges [0, 1000]:

• Feature B will dominate the distance calculation (10002 ≫ 12 ).

• Feature A becomes irrelevant.
Solution: You MUST apply Min-Max Scaling or Standardization (Z-score) before using KNN.

2. The Curse of Dimensionality

In very high dimensions (d → ∞):
• All points become roughly equidistant from each other.
• The ratio Dmax −Dmin
Dmin → 0.

• The concept of "nearest" neighbor becomes meaningless.

Fix: Use PCA/LDA to reduce dimensions before applying KNN.

2.3 Naive Bayes Classifier

Type: Generative Learning Model. Core Assumption: All features x1 , x2 , . . . , xn are mutually independent
given the class label y.

2.3.1 1. Mathematical Derivation (MAP Estimation)

We rely on Bayes’ Theorem to find the class y with the Maximum A Posteriori (MAP) probability.
P (X|y)P (y)
P (y|X) =
P (X)
Since the denominator P (X) (evidence) is constant for all classes, we ignore it for optimization:
 
Yn
ŷ = argmax P (y) P (xj |y)
y
j=1

• Prior P (y): Probability of class y (based on frequency in training data).

• Likelihood P (xj |y): Probability of feature j appearing given class y.

5
2.3.2 2. Variants of Naive Bayes

The difference lies in how we calculate the Likelihood P (xj |y).

A. Gaussian Naive Bayes (Continuous Data) Used when features are continuous real numbers. We assume
the likelihood follows a Normal Distribution.
!
1 (xj − µjc )2
P (xj |y = c) = q exp − 2
2πσ 2 2σjc
jc

2
Note: We must estimate the mean µjc and variance σjc for every feature j within every class c.
B. Multinomial Naive Bayes (Discrete Counts) Used for document classification (word counts).

Count(xj in class c) + α
P (xj |y = c) =
Total Words in class c + α · V

• V : Vocabulary size (number of unique words).

• α: Smoothing parameter (see below).

C. Bernoulli Naive Bayes (Binary) Used when features are binary (0 or 1). It penalizes the absence of a
feature as well.
P (xj |y) = P (j|y)xj · (1 − P (j|y))(1−xj )

2.3.3 3. Crucial Implementation Details (GATE Traps)

The Zero-Frequency Problem & Laplace Smoothing

Problem: If a word W never appears in the training set for class A, then P (W |A) = 0. Since probabilities
are multiplied, the entire posterior becomes zero, wiping out all other useful information.
Solution (Additive Smoothing): Add a dummy count α (usually α = 1, called Laplace Smoothing) to
the numerator and normalize the denominator.
xji + α
θ̂ji =
Ni + αd

GATE Trap / Crucial Note

Log-Probabilities (Numerical Stability) Multiplying many small probabilities (e.g., 0.001 × 0.002 . . . )
causes floating-point underflow. Solution: Maximize the sum of logs instead of the product.
 
n
X
ŷ = argmax log P (y) + log P (xj |y)
y
j=1

This transforms the non-linear product space into a linear sum space.

2.4 Linear Discriminant Analysis (LDA)

2.4.1 1. Core Concept

LDA is a Supervised dimensionality reduction and classification technique. Unlike PCA (which maximizes
global variance), LDA aims to find a projection vector w that:

1. Maximizes the distance between the means of different classes (Between-Class Variance).
2. Minimizes the spread/scatter within each class (Within-Class Variance).

Intuition: "Group items of the same class tightly together, and push the groups as far apart as possible."

6
2.4.2 2. The Scatter Matrices

To formulate this mathematically, we define two scatter matrices. Let N1 , N2 be the number of samples in Class
1 and Class 2, and µ1 , µ2 be the class means.
A. Within-Class Scatter Matrix (SW ): Measures how compact each class is. It is the sum of the covariance
matrices of each individual class.
X X
SW = (x − µ1 )(x − µ1 )T + (x − µ2 )(x − µ2 )T
x∈C1 x∈C2

B. Between-Class Scatter Matrix (SB ): Measures the separation between class means.

SB = (µ1 − µ2 )(µ1 − µ2 )T

2.4.3 3. Fisher’s Linear Discriminant

We want to find a projection vector w to maximize the ratio (Fisher’s Criterion):

The Objective Function J(w)

Projected Between-Class Variance wT SB w

J(w) = = T
Projected Within-Class Variance w SW w

The Solution (Derivation Result): To maximize J(w), we take the derivative w.r.t w and set to 0. This
solves to a Generalized Eigenvalue Problem:
−1
SW SB w = λw

For a 2-class problem, the solution simplifies directly to:

−1
w∗ ∝ SW (µ1 − µ2 )

(Note: If the data is already whitened/spherical where SW = I, then LDA is just the vector connecting the
means).

2.4.4 4. LDA vs. PCA (GATE High-Yield)

PCA (Principal Component Analysis) LDA (Linear Discriminant Analysis)

Unsupervised (Ignores class labels). Supervised (Uses class labels).
Maximizes Total Variance. Maximizes Class Separation.
Can produce up to N − 1 components (where N is Can produce at most C − 1 components (where C
features). is number of classes).
Used for Feature Extraction. Used for Classification + Extraction.

2.4.5 5. Assumptions & Limitations

LDA works optimally (is the Bayes Optimal Classifier) only if:

1. Data in each class follows a Gaussian (Normal) Distribution.

2. All classes share the same Covariance Matrix (Σ1 = Σ2 = · · · = Σk ).
3. Warning: If classes have different covariance matrices (e.g., one is a tight circle, the other is a wide oval),
LDA fails. You must use QDA (Quadratic Discriminant Analysis), which gives a curved decision
boundary.

2.5 Support Vector Machine (SVM)

2.5.1 1. Geometric Intuition

Goal: Find the optimal hyperplane wT x + b = 0 that separates classes with the maximum Margin.

7
• Support Vectors: Data points closest to the hyperplane. Only these points influence the position of the
boundary.
• Margin Width: The distance between the dashed lines (wT x + b = 1 and wT x + b = −1).
2
Width =
||w||

2.5.2 2. Hard Margin SVM (Linearly Separable)

Assumes data is perfectly separable without errors.

Primal Optimization Problem

Minimize: 12 ||w||2
Subject to: yi (wT xi + b) ≥ 1, ∀i

If data has noise or outliers, this formulation has no solution.

2.5.3 3. Soft Margin SVM (Non-Linearly Separable)

Introduces Slack Variables (ξi ≥ 0) to allow some misclassification.

• If ξi = 0: Point is correctly classified and outside/on the margin.

• If 0 < ξi < 1: Point is correctly classified but inside the margin.

• If ξi > 1: Point is misclassified.

Soft Margin Objective

n
!
1 X
min ||w||2 + C ξi
w,b,ξ 2 i=1

Subject to: yi (wT xi + b) ≥ 1 − ξi

The Hyperparameter C (Regularization):

• Large C: Penalizes errors heavily. Tries to classify everything correctly. Result: Narrow margin, potential
Overfitting (High Variance).
• Small C: Allows more errors (higher ξ). Result: Wider margin, smoother boundary, potential Underfit-
ting (High Bias).

2.5.4 4. The Dual Formulation

Using Lagrange Multipliers (αi ), we convert the Primal problem to the Dual problem. Why? The Dual form
reveals that the algorithm only depends on the dot product between points (xTi xj ).

n n n
X 1 XX
Maximize LD (α) = αi − αi αj yi yj (xTi xj )
i=1
2 i=1 j=1

αi yi = 0 and 0 ≤ αi ≤ C.
P
Constraints:
Key Property (KKT Conditions):

• If αi = 0: The point is correctly classified and ignored.

• If αi > 0: The point is a Support Vector.

8
2.5.5 5. The Kernel Trick

When data is not linearly separable in input space Rd , we map it to a higher-dimensional feature space RD using
a mapping ϕ(x).
x → ϕ(x)
Instead of computing ϕ(x) explicitly (which is computationally expensive), we use a Kernel Function K(xi , xj )
that computes the dot product directly in high dimensions.
K(xi , xj ) = ϕ(xi )T ϕ(xj )

Common Kernels:

1. Linear Kernel: K(x, y) = xT y. (For high-dim text data).

2. Polynomial Kernel: K(x, y) = (xT y + c)d .
3. RBF (Gaussian) Kernel: (Most popular for non-linear data).
K(x, y) = exp(−γ||x − y||2 )

GATE Trap / Crucial Note

RBF Parameter γ (Gamma): Defines how far the influence of a single training example reaches.
• High γ: Influence is short-range. Decision boundary is jagged/complex. Risk of Overfitting.

• Low γ: Influence is long-range. Decision boundary is smooth/linear. Risk of Underfitting.

2.6 Decision Trees

2.6.1 1. Core Concept

A non-parametric, supervised learning method. It builds a model in the form of a tree structure.

• Strategy: Greedy, Top-Down, Recursive Partitioning (Divide and Conquer).

• Goal: Create pure leaf nodes (homogeneous class distribution).

2.6.2 2. Splitting Criteria (Impurity Measures)

To split a node, we maximize the decrease in impurity.

A. Entropy (Information Theory - ID3, C4.5) Measure of randomness/disorder.
c
X
H(S) = − pi log2 pi
i=1

• Range: [0, log2 c].

• 0: Perfectly Pure (All elements belong to one class).
• High: Maximal impurity (Uniform distribution).

B. Gini Impurity (CART) Measure of misclassification probability. Computationally faster (no log).
c
X
Gini(S) = 1 − p2i
i=1

• Range (for binary): [0, 0.5].

• Favors larger partitions (binary splits).

C. Information Gain (IG) The expected reduction in entropy after splitting set S on attribute A.
X |Sv |
IG(S, A) = H(S) − H(Sv )
|S|
v∈V alues(A)

Decision Rule: Choose attribute A with the highest IG.

9
2.6.3 3. The "High Cardinality" Trap

GATE Trap / Crucial Note

Problem with Information Gain: IG is biased towards attributes with many distinct values (e.g.,
"User_ID" or "Date"). If "User_ID" is used, every leaf has 1 sample (Pure, Entropy=0), resulting in
max IG but a useless model.
Solution: Gain Ratio (Used in C4.5) We normalize the gain by the "Split Information" (Intrinsic entropy
of the split itself).
X |Sv | |Sv |
SplitInfoA (S) = − log2
v
|S| |S|

IG(S, A)
GainRatio(S, A) =
SplitInfoA (S)

2.6.4 4. Regression Trees (CART)

When the target y is continuous, we cannot use Entropy.

• Splitting Criterion: Variance Reduction (or MSE).

X |Sv |
Reduction = Var(S) − Var(Sv )
|S|

• Prediction: The mean value (ȳ) of the samples in the leaf node.

2.6.5 5. Overfitting & Pruning

Trees tend to memorize noise (Low Bias, High Variance).

• Pre-Pruning (Early Stopping): Stop growing if:

– Tree depth reaches max limit.
– Number of samples in node < min threshold.
– Information gain < threshold.
• Post-Pruning (Cost-Complexity Pruning): Grow full tree, then remove branches that add little
predictive power. Minimize: RSS + α(number of terminal nodes).

2.6.6 6. Algorithm Comparison (GATE High-Yield)

Feature ID3 C4.5 CART

Tree Type Multi-way Multi-way Binary Only
Criterion Info Gain Gain Ratio Gini (Class) / Variance (Reg)
Data Type Categorical Cat + Numerical Cat + Numerical
Missing Data No Yes Yes (Surrogate Splits)

3 Model Evaluation

3.1 Bias-Variance Trade-off

Expected Test Error Decomposition:

E[(y − fˆ(x))2 ] = Bias[fˆ(x)]2 + Var[fˆ(x)] + σ 2

• Bias: Error due to simplifying assumptions (Underfitting).

• Variance: Error due to sensitivity to small fluctuations in training set (Overfitting).
• σ 2 : Irreducible error (Noise).

10
3.2 Cross-Validation
• LOOCV (Leave-One-Out): k = n. Train on n − 1, test on 1. Unbiased estimate of error, but High
Variance and computationally expensive.
• K-Fold CV: Split into K parts (usually 5 or 10). Lower variance than LOOCV, biased if K is small.

4 Neural Networks (Deep Learning)

4.1 1. The Multi-Layer Perceptron (MLP)

A feed-forward network mapping input X to output Y via hidden layers. Notation:

• L: Total number of layers.

• n[l] : Number of units in layer l.
• g [l] : Activation function for layer l.

4.2 2. Forward Propagation

Computing the output step-by-step. For a single sample x:
z [l] = W [l] a[l−1] + b[l] (Linear Step)
a[l] = g [l] (z [l] ) (Non-Linear Activation)

Dimensions: If layer l has n[l] units and layer l − 1 has n[l−1] units, then W [l] is (n[l] × n[l−1] ).

4.3 3. Activation Functions & Derivatives

The derivative g ′ (z) is needed for Gradient Descent.

Function Equation g(z) Derivative g ′ (z) Range

Sigmoid σ(z) = 1+e1−z σ(z)(1 − σ(z)) (0, 1)
ez −e−z
Tanh ez +e−z 1 − (tanh(z))2 (−1, 1)
ReLU max(0, z) 1 if z > 0, 0 if z < 0 [0, ∞)
zi
Softmax Pe z
e k (Complex Jacobian) (0, 1)

GATE Trap / Crucial Note

Softmax (Output Layer): Used for Multi-Class Classification. It ensures ŷi = 1 (Probability Distribu-
P
tion).

4.4 4. Loss Functions (Cost Functions)

A. Regression: Mean Squared Error (MSE).
m
1 X (i)
J(θ) = (y − ŷ (i) )2
m i=1

B. Binary Classification: Binary Cross-Entropy (Log Loss).

m
1 X (i)
J(θ) = − [y log(ŷ (i) ) + (1 − y (i) ) log(1 − ŷ (i) )]
m i=1

C. Multi-Class Classification: Categorical Cross-Entropy.

m X
C
(i) (i)
X
J(θ) = − yk log(ŷk )
i=1 k=1

11
4.5 5. Backpropagation (Derivation Steps)
We use the **Chain Rule** to calculate gradients starting from the Output Layer (L) backwards.
Step 1: Output Layer Error (δ [L] ) For Cross-Entropy Loss with Softmax/Sigmoid:

δ [L] = a[L] − y

Step 2: Hidden Layer Error (δ [l] ) How much did layer l contribute to the error in l + 1?

δ [l] = (W [l+1] )T δ [l+1] ⊙ g ′[l] (z [l] )

Note: ⊙ represents element-wise multiplication (Hadamard Product).

Step 3: Gradients (Partial Derivatives)
∂J
= δ [l] (a[l−1] )T
∂W [l]
∂J
= δ [l]
∂b[l]
Step 4: Update Weights (Gradient Descent)
∂J
W [l] := W [l] − α
∂W [l]

4.6 6. Common Problems & Solutions

The Vanishing Gradient Problem
Cause: In deep networks using Sigmoid/Tanh, the derivative is always < 1 (Max 0.25 for Sigmoid). As
we multiply these small derivatives back through many layers (Chain Rule), the gradient → 0. Effect: Early
layers stop learning. Solution: Use ReLU (Derivative is 1 for z > 0) and He Initialization.

Overfitting & Regularization

Neural Networks have high variance.

1. L2 Regularization (Weight Decay): Add λ

2m ||W ||
2
to cost.
2. Dropout: Randomly kill neurons during training with probability p. Forces network to learn redundant
representations.

5 Unsupervised Learning

5.1 1. Clustering Algorithms

5.1.1 A. K-Means Clustering

Objective: Partition n observations into k clusters to minimize the Within-Cluster Sum of Squares
(WCSS) (also called Inertia).
Xk X
J(µ, C) = ||xi − µj ||2
j=1 xi ∈Cj

Nature: Iterative Coordinate Descent algorithm.

The Algorithm Steps:

1. Initialization: Pick k centroids. (Standard: Random. Robust: K-Means++).

2. Assignment (Expectation Step): Assign each point to the nearest centroid.

c(i) := argmin||x(i) − µj ||2

12
3. Update (Maximization Step): Move centroid to the mean of assigned points.
1 X
µj := x
|Cj |
x∈Cj

4. Repeat until convergence (centroids do not move).

GATE Trap / Crucial Note

Crucial Properties for GATE:
• Convergence: Guaranteed to converge to a Local Optimum, not necessarily Global.

• Geometry: Assumes clusters are Spherical and of similar size. Fails on concentric circles or irregular
shapes.
• Complexity: O(t · k · n · d) where t is iterations. Linear in n, making it fast.

5.1.2 B. K-Medoids (PAM - Partitioning Around Medoids)

Instead of the Mean (which is a virtual point), the center must be an actual data point (Medoid).

• Cost Function: Sum of absolute differences (L1 Norm / Manhattan).

k
X X
J= |xi − µj |
j=1 xi ∈Cj

• Robustness: Highly robust to Outliers. (A mean shifts drastically with one outlier; a median/medoid
does not).
• Complexity: O(k(n − k)2 ). Much slower than K-Means.

5.1.3 C. Hierarchical Clustering

Builds a hierarchy of clusters. Output is a Dendrogram. Approach: Agglomerative (Bottom-Up). Start with
n clusters, merge closest pair until 1 remains.
Linkage Criteria (The distance metric between Cluster A and B):

1. Single Linkage (Min): d(A, B) = min{d(a, b) : a ∈ A, b ∈ B}. Effect: Produces long, "chain-like"
clusters. Good for non-globular shapes.
2. Complete Linkage (Max): d(A, B) = max{d(a, b) : a ∈ A, b ∈ B}. Effect: Forces compact, spherical
clusters. Sensitive to outliers.
3. Average Linkage: Average distance between all pairs.
4. Ward’s Method: Minimizes the increase in Variance (WCSS) when merging.

Complexity: O(n3 ) (Standard) or O(n2 log n) (Optimized). Not suitable for large datasets.

5.1.4 D. Cluster Evaluation: Silhouette Score

Used when ground truth is unknown. For a point i:

• a(i): Mean distance to other points in the same cluster (Cohesion).

• b(i): Mean distance to points in the nearest neighboring cluster (Separation).

b(i) − a(i)
S(i) =
max(a(i), b(i))
Range: [−1, 1].

13
• +1: Perfect clustering.
• 0: Overlapping clusters.
• −1: Assigned to wrong cluster.

5.2 2. Principal Component Analysis (PCA)

5.2.1 A. Concept

A linear transformation that projects data onto a new coordinate system where:

1. The first coordinate (PC1) has the maximum variance.

2. The second coordinate (PC2) has the second max variance and is orthogonal to PC1.

5.2.2 B. Mathematical Derivation (Optimization)

Let u1 be a unit vector (||u1 || = 1). We want to project data X onto u1 such that the variance of the projection
is maximized. Projection: z = Xu1 . Variance: Var(z) = n1 z T z = n1 uT1 X T Xu1 = uT1 Σu1 .
Objective: Maximize uT1 Σu1 subject to uT1 u1 = 1. Using Lagrange Multipliers:

L = uT1 Σu1 − λ(uT1 u1 − 1)

Take derivative w.r.t u1 and set to 0:

2Σu1 − 2λu1 = 0 =⇒ Σu1 = λu1

Conclusion: The direction of maximum variance u1 is simply the Eigenvector of the Covariance Matrix Σ,
and the variance itself is the Eigenvalue λ.

5.2.3 C. The PCA Algorithm Steps

1. Standardize Data: Xstd = X−µ

σ (Crucial! PCA is sensitive to scale).
2. Covariance Matrix: Σ = 1 T
n−1 X X.

3. Eigen Decomposition: Compute eigenvectors/values of Σ.

4. Sort: λ1 ≥ λ2 ≥ · · · ≥ λd .
5. Projection: Select top k eigenvectors (Wk ) and project:

Xnew = X · Wk

5.2.4 D. SVD Approach (Alternative Calculation)

In practice (e.g., Python’s sklearn), PCA uses Singular Value Decomposition (SVD) of X directly, not the
Covariance Matrix.
X = U SV T

• The columns of V (Right Singular Vectors) are the Principal Components.

s2i
• The squared singular values relate to eigenvalues: λi = n−1 .

Machine Learning Fundamentals Guide
100% (1)
Machine Learning Fundamentals Guide
163 pages
CS181 Machine Learning Fundamentals
No ratings yet
CS181 Machine Learning Fundamentals
163 pages
Supervised Machine Learning Notes
No ratings yet
Supervised Machine Learning Notes
112 pages
GATE Machine Learning Study Guide
No ratings yet
GATE Machine Learning Study Guide
105 pages
Machine Learning Optimization Survey
No ratings yet
Machine Learning Optimization Survey
41 pages
MIT 6.390 Spring 2024 Lecture Notes
No ratings yet
MIT 6.390 Spring 2024 Lecture Notes
145 pages
Machine Learning Basics Explained
No ratings yet
Machine Learning Basics Explained
77 pages
UC Berkeley Machine Learning Guide
No ratings yet
UC Berkeley Machine Learning Guide
152 pages
Machine Learning Fundamentals Guide
No ratings yet
Machine Learning Fundamentals Guide
71 pages
Aerospace Engineering Machine Learning Guide
No ratings yet
Aerospace Engineering Machine Learning Guide
93 pages
Machine Learning Concepts at Polimi
No ratings yet
Machine Learning Concepts at Polimi
107 pages
Introduction to Machine Learning ETH
No ratings yet
Introduction to Machine Learning ETH
18 pages
Notes de Cours Gabriel Stoltz
No ratings yet
Notes de Cours Gabriel Stoltz
181 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
38 pages
Supervised Learning: Concepts & Algorithms
No ratings yet
Supervised Learning: Concepts & Algorithms
88 pages
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
No ratings yet
The Hundred-Page Machine Learning Book-Andriy Burkov (2019) - Removed
145 pages
Understanding Gradient Descent Basics
No ratings yet
Understanding Gradient Descent Basics
63 pages
Categorizing Weekly Sales Data
100% (1)
Categorizing Weekly Sales Data
178 pages
Machine Learning Concepts Overview
No ratings yet
Machine Learning Concepts Overview
378 pages
CS229 Machine Learning Lecture Notes
No ratings yet
CS229 Machine Learning Lecture Notes
4 pages
Decision Trees in Machine Learning
No ratings yet
Decision Trees in Machine Learning
113 pages
Machine Learning & Deep Learning Notes
No ratings yet
Machine Learning & Deep Learning Notes
118 pages
Machine Learning Course Manual
No ratings yet
Machine Learning Course Manual
69 pages
CS229 Lecture Notes by Ng & Ma 2023
No ratings yet
CS229 Lecture Notes by Ng & Ma 2023
223 pages
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
194 pages
Introduction to Statistical Learning Concepts
No ratings yet
Introduction to Statistical Learning Concepts
174 pages
Feature Selection Techniques in ML
No ratings yet
Feature Selection Techniques in ML
20 pages
Shatter Pitch in MLP Explained
No ratings yet
Shatter Pitch in MLP Explained
227 pages
Main - Notes Chapter 6
No ratings yet
Main - Notes Chapter 6
79 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
95 pages
SML Book Draft Latest
No ratings yet
SML Book Draft Latest
275 pages
CS229 Lecture Notes by Andrew Ng
No ratings yet
CS229 Lecture Notes by Andrew Ng
216 pages
main_notes-1-30
No ratings yet
main_notes-1-30
30 pages
CS229 Machine Learning Lecture Notes
No ratings yet
CS229 Machine Learning Lecture Notes
241 pages
CS229 Lecture Notes by Ng & Ma
No ratings yet
CS229 Lecture Notes by Ng & Ma
227 pages
CS229 Machine Learning Lecture Notes
No ratings yet
CS229 Machine Learning Lecture Notes
227 pages
Data Analysis Techniques for B.Tech
No ratings yet
Data Analysis Techniques for B.Tech
55 pages
MIT 6.390 Spring 2025 Lecture Notes
No ratings yet
MIT 6.390 Spring 2025 Lecture Notes
146 pages
Statistical Machine Learning Overview
No ratings yet
Statistical Machine Learning Overview
204 pages
MIT 6.390 Fall 2024 Lecture Notes
No ratings yet
MIT 6.390 Fall 2024 Lecture Notes
146 pages
CS229 Lecture Notes Overview
No ratings yet
CS229 Lecture Notes Overview
216 pages
UC Berkeley Machine Learning Guide
No ratings yet
UC Berkeley Machine Learning Guide
185 pages
CS229 Lecture Notes by Andrew Ng
100% (1)
CS229 Lecture Notes by Andrew Ng
226 pages
Preface To The Second Edition V 1 1
No ratings yet
Preface To The Second Edition V 1 1
9 pages
Least Squares in Machine Learning Guide
100% (1)
Least Squares in Machine Learning Guide
185 pages
MIT 6.390: Intro to Machine Learning
No ratings yet
MIT 6.390: Intro to Machine Learning
144 pages
Machine Learning A First Course For Engineers and Scientists (Andreas Lindholm, Niklas Wahlström Etc.) (Z-Library)
No ratings yet
Machine Learning A First Course For Engineers and Scientists (Andreas Lindholm, Niklas Wahlström Etc.) (Z-Library)
352 pages
Machine Learning Lecture Notes PDF
100% (1)
Machine Learning Lecture Notes PDF
120 pages
CS725: Machine Learning Foundations
100% (1)
CS725: Machine Learning Foundations
119 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
17 pages
Visual Mining for Decision Trees
No ratings yet
Visual Mining for Decision Trees
14 pages
Supervised Learning: Algorithms Explained
No ratings yet
Supervised Learning: Algorithms Explained
15 pages
Supervised Learning: KNN & Decision Trees
No ratings yet
Supervised Learning: KNN & Decision Trees
33 pages
Understanding CART: Decision Trees Explained
No ratings yet
Understanding CART: Decision Trees Explained
25 pages
Overview of Machine Learning Algorithms
No ratings yet
Overview of Machine Learning Algorithms
19 pages
Data Types and Statistical Concepts
No ratings yet
Data Types and Statistical Concepts
356 pages
Decision Tree Q&A: Gini, Entropy, Errors
No ratings yet
Decision Tree Q&A: Gini, Entropy, Errors
8 pages
Data Science Techniques and Concepts
No ratings yet
Data Science Techniques and Concepts
37 pages
ID3 Algorithm Decision Tree Program
No ratings yet
ID3 Algorithm Decision Tree Program
10 pages
Predictive Maintenance with XGBoost and LSTM
No ratings yet
Predictive Maintenance with XGBoost and LSTM
11 pages
Data Mining and Machine Learning Problems
No ratings yet
Data Mining and Machine Learning Problems
3 pages
Data Mining: Classification Techniques
No ratings yet
Data Mining: Classification Techniques
79 pages
EUC1502 Module3 Machine-Learning
No ratings yet
EUC1502 Module3 Machine-Learning
25 pages
ID3 vs CART Decision Trees Explained
No ratings yet
ID3 vs CART Decision Trees Explained
28 pages
Classification and Clustering in ML
No ratings yet
Classification and Clustering in ML
16 pages
Comparative Analysis of Classifiers For PDF
No ratings yet
Comparative Analysis of Classifiers For PDF
6 pages
Machine Learning Concepts and Algorithms
No ratings yet
Machine Learning Concepts and Algorithms
1 page
IJCSI Jehad
No ratings yet
IJCSI Jehad
8 pages
Knowledge Representation in AI: Challenges & Techniques
No ratings yet
Knowledge Representation in AI: Challenges & Techniques
31 pages
Pyagrum Theorem in AI Lab Exercises
No ratings yet
Pyagrum Theorem in AI Lab Exercises
37 pages
Data Mining for Decision Making Insights
0% (1)
Data Mining for Decision Making Insights
66 pages
Big Data Analytics: Key Concepts and Practices
No ratings yet
Big Data Analytics: Key Concepts and Practices
89 pages
Decision Trees in Data Warehousing
No ratings yet
Decision Trees in Data Warehousing
33 pages
Implementing Decision Tree Algorithms
No ratings yet
Implementing Decision Tree Algorithms
128 pages
Phishing Website Detection with ML
100% (3)
Phishing Website Detection with ML
23 pages
Supervised Learning: kNN & Decision Trees
No ratings yet
Supervised Learning: kNN & Decision Trees
34 pages
Big Data Sentiment Analysis Insights
No ratings yet
Big Data Sentiment Analysis Insights
22 pages
Adversarial Search and Decision Trees Guide
No ratings yet
Adversarial Search and Decision Trees Guide
33 pages
Data Mining: Concepts and Techniques
100% (1)
Data Mining: Concepts and Techniques
22 pages