0% found this document useful (0 votes)
43 views14 pages

GATE DA Machine Learning Revision Notes

The document provides comprehensive revision notes on machine learning, covering supervised learning techniques such as regression and classification, as well as unsupervised learning and neural networks. Key topics include various regression methods, classification algorithms like logistic regression and KNN, and model evaluation techniques. Additionally, it discusses mathematical foundations and concepts essential for understanding machine learning models.

Uploaded by

Abhishek Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views14 pages

GATE DA Machine Learning Revision Notes

The document provides comprehensive revision notes on machine learning, covering supervised learning techniques such as regression and classification, as well as unsupervised learning and neural networks. Key topics include various regression methods, classification algorithms like logistic regression and KNN, and model evaluation techniques. Additionally, it discusses mathematical foundations and concepts essential for understanding machine learning models.

Uploaded by

Abhishek Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

GATE DA: Complete Machine Learning Revision Notes

Deep Dive & Mathematical Foundations

Contents
1 Supervised Learning: Regression 3
1.1 Simple Linear Regression (SLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Ridge Regression (L2 Regularization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Supervised Learning: Classification 4


2.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 K-Nearest Neighbors (KNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 1. Core Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 2. Distance Metrics (The “Closeness” Logic) . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.3 3. The Hyperparameter K (Bias-Variance Trade-off) . . . . . . . . . . . . . . . . . . . . . 4
2.2.4 4. Computational Complexity (GATE Trap) . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.5 5. Critical Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 1. Mathematical Derivation (MAP Estimation) . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 2. Variants of Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 3. Crucial Implementation Details (GATE Traps) . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.1 1. Core Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4.2 2. The Scatter Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.3 3. Fisher’s Linear Discriminant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.4 4. LDA vs. PCA (GATE High-Yield) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4.5 5. Assumptions & Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.1 1. Geometric Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5.2 2. Hard Margin SVM (Linearly Separable) . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.3 3. Soft Margin SVM (Non-Linearly Separable) . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.4 4. The Dual Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5.5 5. The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.1 1. Core Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.2 2. Splitting Criteria (Impurity Measures) . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6.3 3. The "High Cardinality" Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.4 4. Regression Trees (CART) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6.5 5. Overfitting & Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1
2.6.6 6. Algorithm Comparison (GATE High-Yield) . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Model Evaluation 10
3.1 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Neural Networks (Deep Learning) 11


4.1 1. The Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 2. Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.3 3. Activation Functions & Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.4 4. Loss Functions (Cost Functions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.5 5. Backpropagation (Derivation Steps) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 6. Common Problems & Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Unsupervised Learning 12
5.1 1. Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1 A. K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2 B. K-Medoids (PAM - Partitioning Around Medoids) . . . . . . . . . . . . . . . . . . . . . 13
5.1.3 C. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.4 D. Cluster Evaluation: Silhouette Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 2. Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.1 A. Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.2 B. Mathematical Derivation (Optimization) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.3 C. The PCA Algorithm Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.4 D. SVD Approach (Alternative Calculation) . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2
1 Supervised Learning: Regression

1.1 Simple Linear Regression (SLR)


Model: yi = β0 + β1 xi + ϵi , where ϵi ∼ N (0, σ 2 ).

Ordinary Least Squares (OLS) Derivation

Objective: Minimize Residual Sum of Squares (RSS).


n
X
J(β0 , β1 ) = (yi − (β0 + β1 xi ))2
i=1

Closed Form Solutions: Taking partial derivatives ∂J


∂β0 = 0 and ∂J
∂β1 = 0:

Cov(x, y)
P
(xi − x̄)(yi − ȳ) Sy
β̂1 = = = rxy
Var(x)
P 2
(xi − x̄) Sx

β̂0 = ȳ − β̂1 x̄

Coefficient of Determination (R2 ): Measures the proportion of variance explained by the model.

(yi − ŷi )2
P
RSS
R2 = 1 − =1− P
T SS (yi − ȳ)2

Range: 0 ≤ R2 ≤ 1 (for training data).

1.2 Multiple Linear Regression


Matrix Form: Y = Xβ + ϵ. Where Y ∈ Rn×1 , X ∈ Rn×(p+1) (Design Matrix), β ∈ R(p+1)×1 .

The Normal Equation

To minimize J(β) = (Y − Xβ)T (Y − Xβ):

∇β J(β) = −2X T Y + 2X T Xβ = 0

=⇒ X T Xβ = X T Y
=⇒ β̂ = (X T X)−1 X T Y

GATE Trap / Crucial Note

Invertibility: The Normal Equation requires (X T X) to be invertible. It is not invertible if: 1. Multi-
collinearity exists (features are linearly dependent). 2. p > n (More features than samples).

1.3 Ridge Regression (L2 Regularization)


Addresses multicollinearity and overfitting by adding a penalty term.

X p
X
J(β) = (yi − ŷi )2 + λ βj2
j=1

In Matrix form: Minimize ||Y − Xβ||2 + λ||β||2 .


Analytical Solution:
β̂Ridge = (X T X + λI)−1 X T Y
Note: Adding λI to the diagonal ensures the matrix is always invertible.

3
2 Supervised Learning: Classification

2.1 Logistic Regression


Used for binary classification. Models the log-odds.
 
P (y = 1|x)
ln = βT x
1 − P (y = 1|x)

1+e−z .
1
Sigmoid Function: hθ (x) = σ(z) =

Cost Function (Log Loss / Cross Entropy)

We cannot use MSE (non-convex). We use Maximum Likelihood Estimation (MLE).


m
1 X (i)
J(θ) = − [y log(hθ (x(i) )) + (1 − y (i) ) log(1 − hθ (x(i) ))]
m i=1

Gradient Descent Update:


m
(i)
X
θj := θj − α (hθ (x(i) ) − y (i) )xj
i=1

2.2 K-Nearest Neighbors (KNN)


2.2.1 1. Core Characteristics

• Lazy Learner: It does not learn a discriminative function from the training data. Instead, it “memorizes”
the dataset. All computation is deferred until the prediction phase.
• Non-Parametric: It assumes no underlying probability distribution (like Gaussian) for the data.

• Instance-Based: Predictions are based purely on local instances found in memory.

2.2.2 2. Distance Metrics (The “Closeness” Logic)

The choice of metric dictates the geometry of the decision boundary.

1. Minkowski Distance (General Form):

n
!1/p
X
p
D(x, y) = |xi − yi |
i=1

2. Euclidean Distance (p = 2): Standard straight-line distance. Sensitive to magnitude.


3. Manhattan Distance (p = 1): Sum of absolute differences (Grid/Taxi-cab geometry). Preferred for
high-dimensional sparse data.
4. Hamming Distance: Used for Categorical Variables.
X
D(x, y) = I(xi ̸= yi ) (Count of mismatched attributes)

2.2.3 3. The Hyperparameter K (Bias-Variance Trade-off )

How many neighbors vote?

• Small K (e.g., K = 1):

– Model: Extremely complex, jagged boundaries. Captures local noise.


– Result: Low Bias, High Variance (Overfitting).

4
• Large K (e.g., K = N ):
– Model: Simple, smooth boundaries. Ignores local structure. Approximates the global majority class.
– Result: High Bias, Low Variance (Underfitting).

• Rule of Thumb: K ≈ N . Usually choose an odd number to avoid voting ties in binary classification.

2.2.4 4. Computational Complexity (GATE Trap)

Unlike other models (like Regression/Neural Nets), KNN shifts the cost to the Test phase.

Phase Time Complexity Explanation


Training O(1) Just storing the data.
Prediction O(N · d) Must compute distance to every point in training set.

Note: Prediction can be optimized to O(log N ) using spatial tree structures like KD-Trees or Ball-Trees, but
these degrade in high dimensions.

2.2.5 5. Critical Limitations

GATE Trap / Crucial Note

1. The Necessity of Feature Scaling KNN relies on distance. If Feature A ranges [0, 1] and Feature B
ranges [0, 1000]:

• Feature B will dominate the distance calculation (10002 ≫ 12 ).


• Feature A becomes irrelevant.
Solution: You MUST apply Min-Max Scaling or Standardization (Z-score) before using KNN.

2. The Curse of Dimensionality


In very high dimensions (d → ∞):
• All points become roughly equidistant from each other.
• The ratio Dmax −Dmin
Dmin → 0.

• The concept of "nearest" neighbor becomes meaningless.


Fix: Use PCA/LDA to reduce dimensions before applying KNN.

2.3 Naive Bayes Classifier


Type: Generative Learning Model. Core Assumption: All features x1 , x2 , . . . , xn are mutually independent
given the class label y.

2.3.1 1. Mathematical Derivation (MAP Estimation)

We rely on Bayes’ Theorem to find the class y with the Maximum A Posteriori (MAP) probability.
P (X|y)P (y)
P (y|X) =
P (X)
Since the denominator P (X) (evidence) is constant for all classes, we ignore it for optimization:
 
Yn
ŷ = argmax P (y) P (xj |y)
y
j=1

• Prior P (y): Probability of class y (based on frequency in training data).


• Likelihood P (xj |y): Probability of feature j appearing given class y.

5
2.3.2 2. Variants of Naive Bayes

The difference lies in how we calculate the Likelihood P (xj |y).


A. Gaussian Naive Bayes (Continuous Data) Used when features are continuous real numbers. We assume
the likelihood follows a Normal Distribution.
!
1 (xj − µjc )2
P (xj |y = c) = q exp − 2
2πσ 2 2σjc
jc

2
Note: We must estimate the mean µjc and variance σjc for every feature j within every class c.
B. Multinomial Naive Bayes (Discrete Counts) Used for document classification (word counts).

Count(xj in class c) + α
P (xj |y = c) =
Total Words in class c + α · V

• V : Vocabulary size (number of unique words).


• α: Smoothing parameter (see below).

C. Bernoulli Naive Bayes (Binary) Used when features are binary (0 or 1). It penalizes the absence of a
feature as well.
P (xj |y) = P (j|y)xj · (1 − P (j|y))(1−xj )

2.3.3 3. Crucial Implementation Details (GATE Traps)

The Zero-Frequency Problem & Laplace Smoothing


Problem: If a word W never appears in the training set for class A, then P (W |A) = 0. Since probabilities
are multiplied, the entire posterior becomes zero, wiping out all other useful information.
Solution (Additive Smoothing): Add a dummy count α (usually α = 1, called Laplace Smoothing) to
the numerator and normalize the denominator.
xji + α
θ̂ji =
Ni + αd

GATE Trap / Crucial Note

Log-Probabilities (Numerical Stability) Multiplying many small probabilities (e.g., 0.001 × 0.002 . . . )
causes floating-point underflow. Solution: Maximize the sum of logs instead of the product.
 
n
X
ŷ = argmax log P (y) + log P (xj |y)
y
j=1

This transforms the non-linear product space into a linear sum space.

2.4 Linear Discriminant Analysis (LDA)


2.4.1 1. Core Concept

LDA is a Supervised dimensionality reduction and classification technique. Unlike PCA (which maximizes
global variance), LDA aims to find a projection vector w that:

1. Maximizes the distance between the means of different classes (Between-Class Variance).
2. Minimizes the spread/scatter within each class (Within-Class Variance).

Intuition: "Group items of the same class tightly together, and push the groups as far apart as possible."

6
2.4.2 2. The Scatter Matrices

To formulate this mathematically, we define two scatter matrices. Let N1 , N2 be the number of samples in Class
1 and Class 2, and µ1 , µ2 be the class means.
A. Within-Class Scatter Matrix (SW ): Measures how compact each class is. It is the sum of the covariance
matrices of each individual class.
X X
SW = (x − µ1 )(x − µ1 )T + (x − µ2 )(x − µ2 )T
x∈C1 x∈C2

B. Between-Class Scatter Matrix (SB ): Measures the separation between class means.

SB = (µ1 − µ2 )(µ1 − µ2 )T

2.4.3 3. Fisher’s Linear Discriminant

We want to find a projection vector w to maximize the ratio (Fisher’s Criterion):

The Objective Function J(w)

Projected Between-Class Variance wT SB w


J(w) = = T
Projected Within-Class Variance w SW w

The Solution (Derivation Result): To maximize J(w), we take the derivative w.r.t w and set to 0. This
solves to a Generalized Eigenvalue Problem:
−1
SW SB w = λw

For a 2-class problem, the solution simplifies directly to:


−1
w∗ ∝ SW (µ1 − µ2 )

(Note: If the data is already whitened/spherical where SW = I, then LDA is just the vector connecting the
means).

2.4.4 4. LDA vs. PCA (GATE High-Yield)

PCA (Principal Component Analysis) LDA (Linear Discriminant Analysis)


Unsupervised (Ignores class labels). Supervised (Uses class labels).
Maximizes Total Variance. Maximizes Class Separation.
Can produce up to N − 1 components (where N is Can produce at most C − 1 components (where C
features). is number of classes).
Used for Feature Extraction. Used for Classification + Extraction.

2.4.5 5. Assumptions & Limitations

LDA works optimally (is the Bayes Optimal Classifier) only if:

1. Data in each class follows a Gaussian (Normal) Distribution.


2. All classes share the same Covariance Matrix (Σ1 = Σ2 = · · · = Σk ).
3. Warning: If classes have different covariance matrices (e.g., one is a tight circle, the other is a wide oval),
LDA fails. You must use QDA (Quadratic Discriminant Analysis), which gives a curved decision
boundary.

2.5 Support Vector Machine (SVM)


2.5.1 1. Geometric Intuition

Goal: Find the optimal hyperplane wT x + b = 0 that separates classes with the maximum Margin.

7
• Support Vectors: Data points closest to the hyperplane. Only these points influence the position of the
boundary.
• Margin Width: The distance between the dashed lines (wT x + b = 1 and wT x + b = −1).
2
Width =
||w||

2.5.2 2. Hard Margin SVM (Linearly Separable)

Assumes data is perfectly separable without errors.

Primal Optimization Problem

Minimize: 12 ||w||2
Subject to: yi (wT xi + b) ≥ 1, ∀i

If data has noise or outliers, this formulation has no solution.

2.5.3 3. Soft Margin SVM (Non-Linearly Separable)

Introduces Slack Variables (ξi ≥ 0) to allow some misclassification.

• If ξi = 0: Point is correctly classified and outside/on the margin.

• If 0 < ξi < 1: Point is correctly classified but inside the margin.


• If ξi > 1: Point is misclassified.

Soft Margin Objective


n
!
1 X
min ||w||2 + C ξi
w,b,ξ 2 i=1

Subject to: yi (wT xi + b) ≥ 1 − ξi

The Hyperparameter C (Regularization):

• Large C: Penalizes errors heavily. Tries to classify everything correctly. Result: Narrow margin, potential
Overfitting (High Variance).
• Small C: Allows more errors (higher ξ). Result: Wider margin, smoother boundary, potential Underfit-
ting (High Bias).

2.5.4 4. The Dual Formulation

Using Lagrange Multipliers (αi ), we convert the Primal problem to the Dual problem. Why? The Dual form
reveals that the algorithm only depends on the dot product between points (xTi xj ).

n n n
X 1 XX
Maximize LD (α) = αi − αi αj yi yj (xTi xj )
i=1
2 i=1 j=1

αi yi = 0 and 0 ≤ αi ≤ C.
P
Constraints:
Key Property (KKT Conditions):

• If αi = 0: The point is correctly classified and ignored.

• If αi > 0: The point is a Support Vector.

8
2.5.5 5. The Kernel Trick

When data is not linearly separable in input space Rd , we map it to a higher-dimensional feature space RD using
a mapping ϕ(x).
x → ϕ(x)
Instead of computing ϕ(x) explicitly (which is computationally expensive), we use a Kernel Function K(xi , xj )
that computes the dot product directly in high dimensions.
K(xi , xj ) = ϕ(xi )T ϕ(xj )

Common Kernels:

1. Linear Kernel: K(x, y) = xT y. (For high-dim text data).


2. Polynomial Kernel: K(x, y) = (xT y + c)d .
3. RBF (Gaussian) Kernel: (Most popular for non-linear data).
K(x, y) = exp(−γ||x − y||2 )

GATE Trap / Crucial Note

RBF Parameter γ (Gamma): Defines how far the influence of a single training example reaches.
• High γ: Influence is short-range. Decision boundary is jagged/complex. Risk of Overfitting.

• Low γ: Influence is long-range. Decision boundary is smooth/linear. Risk of Underfitting.

2.6 Decision Trees


2.6.1 1. Core Concept

A non-parametric, supervised learning method. It builds a model in the form of a tree structure.

• Strategy: Greedy, Top-Down, Recursive Partitioning (Divide and Conquer).


• Goal: Create pure leaf nodes (homogeneous class distribution).

2.6.2 2. Splitting Criteria (Impurity Measures)

To split a node, we maximize the decrease in impurity.


A. Entropy (Information Theory - ID3, C4.5) Measure of randomness/disorder.
c
X
H(S) = − pi log2 pi
i=1

• Range: [0, log2 c].


• 0: Perfectly Pure (All elements belong to one class).
• High: Maximal impurity (Uniform distribution).

B. Gini Impurity (CART) Measure of misclassification probability. Computationally faster (no log).
c
X
Gini(S) = 1 − p2i
i=1

• Range (for binary): [0, 0.5].


• Favors larger partitions (binary splits).

C. Information Gain (IG) The expected reduction in entropy after splitting set S on attribute A.
X |Sv |
IG(S, A) = H(S) − H(Sv )
|S|
v∈V alues(A)

Decision Rule: Choose attribute A with the highest IG.

9
2.6.3 3. The "High Cardinality" Trap

GATE Trap / Crucial Note

Problem with Information Gain: IG is biased towards attributes with many distinct values (e.g.,
"User_ID" or "Date"). If "User_ID" is used, every leaf has 1 sample (Pure, Entropy=0), resulting in
max IG but a useless model.
Solution: Gain Ratio (Used in C4.5) We normalize the gain by the "Split Information" (Intrinsic entropy
of the split itself).
X |Sv | |Sv |
SplitInfoA (S) = − log2
v
|S| |S|

IG(S, A)
GainRatio(S, A) =
SplitInfoA (S)

2.6.4 4. Regression Trees (CART)

When the target y is continuous, we cannot use Entropy.

• Splitting Criterion: Variance Reduction (or MSE).


X |Sv |
Reduction = Var(S) − Var(Sv )
|S|

• Prediction: The mean value (ȳ) of the samples in the leaf node.

2.6.5 5. Overfitting & Pruning

Trees tend to memorize noise (Low Bias, High Variance).

• Pre-Pruning (Early Stopping): Stop growing if:


– Tree depth reaches max limit.
– Number of samples in node < min threshold.
– Information gain < threshold.
• Post-Pruning (Cost-Complexity Pruning): Grow full tree, then remove branches that add little
predictive power. Minimize: RSS + α(number of terminal nodes).

2.6.6 6. Algorithm Comparison (GATE High-Yield)

Feature ID3 C4.5 CART


Tree Type Multi-way Multi-way Binary Only
Criterion Info Gain Gain Ratio Gini (Class) / Variance (Reg)
Data Type Categorical Cat + Numerical Cat + Numerical
Missing Data No Yes Yes (Surrogate Splits)

3 Model Evaluation

3.1 Bias-Variance Trade-off


Expected Test Error Decomposition:

E[(y − fˆ(x))2 ] = Bias[fˆ(x)]2 + Var[fˆ(x)] + σ 2

• Bias: Error due to simplifying assumptions (Underfitting).


• Variance: Error due to sensitivity to small fluctuations in training set (Overfitting).
• σ 2 : Irreducible error (Noise).

10
3.2 Cross-Validation
• LOOCV (Leave-One-Out): k = n. Train on n − 1, test on 1. Unbiased estimate of error, but High
Variance and computationally expensive.
• K-Fold CV: Split into K parts (usually 5 or 10). Lower variance than LOOCV, biased if K is small.

4 Neural Networks (Deep Learning)

4.1 1. The Multi-Layer Perceptron (MLP)


A feed-forward network mapping input X to output Y via hidden layers. Notation:

• L: Total number of layers.


• n[l] : Number of units in layer l.
• g [l] : Activation function for layer l.

4.2 2. Forward Propagation


Computing the output step-by-step. For a single sample x:
z [l] = W [l] a[l−1] + b[l] (Linear Step)
a[l] = g [l] (z [l] ) (Non-Linear Activation)

Dimensions: If layer l has n[l] units and layer l − 1 has n[l−1] units, then W [l] is (n[l] × n[l−1] ).

4.3 3. Activation Functions & Derivatives


The derivative g ′ (z) is needed for Gradient Descent.

Function Equation g(z) Derivative g ′ (z) Range


Sigmoid σ(z) = 1+e1−z σ(z)(1 − σ(z)) (0, 1)
ez −e−z
Tanh ez +e−z 1 − (tanh(z))2 (−1, 1)
ReLU max(0, z) 1 if z > 0, 0 if z < 0 [0, ∞)
zi
Softmax Pe z
e k (Complex Jacobian) (0, 1)

GATE Trap / Crucial Note

Softmax (Output Layer): Used for Multi-Class Classification. It ensures ŷi = 1 (Probability Distribu-
P
tion).

4.4 4. Loss Functions (Cost Functions)


A. Regression: Mean Squared Error (MSE).
m
1 X (i)
J(θ) = (y − ŷ (i) )2
m i=1

B. Binary Classification: Binary Cross-Entropy (Log Loss).


m
1 X (i)
J(θ) = − [y log(ŷ (i) ) + (1 − y (i) ) log(1 − ŷ (i) )]
m i=1

C. Multi-Class Classification: Categorical Cross-Entropy.


m X
C
(i) (i)
X
J(θ) = − yk log(ŷk )
i=1 k=1

11
4.5 5. Backpropagation (Derivation Steps)
We use the **Chain Rule** to calculate gradients starting from the Output Layer (L) backwards.
Step 1: Output Layer Error (δ [L] ) For Cross-Entropy Loss with Softmax/Sigmoid:

δ [L] = a[L] − y

Step 2: Hidden Layer Error (δ [l] ) How much did layer l contribute to the error in l + 1?
 
δ [l] = (W [l+1] )T δ [l+1] ⊙ g ′[l] (z [l] )

Note: ⊙ represents element-wise multiplication (Hadamard Product).


Step 3: Gradients (Partial Derivatives)
∂J
= δ [l] (a[l−1] )T
∂W [l]
∂J
= δ [l]
∂b[l]
Step 4: Update Weights (Gradient Descent)
∂J
W [l] := W [l] − α
∂W [l]

4.6 6. Common Problems & Solutions


The Vanishing Gradient Problem
Cause: In deep networks using Sigmoid/Tanh, the derivative is always < 1 (Max 0.25 for Sigmoid). As
we multiply these small derivatives back through many layers (Chain Rule), the gradient → 0. Effect: Early
layers stop learning. Solution: Use ReLU (Derivative is 1 for z > 0) and He Initialization.

Overfitting & Regularization


Neural Networks have high variance.

1. L2 Regularization (Weight Decay): Add λ


2m ||W ||
2
to cost.
2. Dropout: Randomly kill neurons during training with probability p. Forces network to learn redundant
representations.

5 Unsupervised Learning

5.1 1. Clustering Algorithms


5.1.1 A. K-Means Clustering

Objective: Partition n observations into k clusters to minimize the Within-Cluster Sum of Squares
(WCSS) (also called Inertia).
Xk X
J(µ, C) = ||xi − µj ||2
j=1 xi ∈Cj

Nature: Iterative Coordinate Descent algorithm.


The Algorithm Steps:

1. Initialization: Pick k centroids. (Standard: Random. Robust: K-Means++).


2. Assignment (Expectation Step): Assign each point to the nearest centroid.

c(i) := argmin||x(i) − µj ||2


j

12
3. Update (Maximization Step): Move centroid to the mean of assigned points.
1 X
µj := x
|Cj |
x∈Cj

4. Repeat until convergence (centroids do not move).

GATE Trap / Crucial Note


Crucial Properties for GATE:
• Convergence: Guaranteed to converge to a Local Optimum, not necessarily Global.

• Geometry: Assumes clusters are Spherical and of similar size. Fails on concentric circles or irregular
shapes.
• Complexity: O(t · k · n · d) where t is iterations. Linear in n, making it fast.

5.1.2 B. K-Medoids (PAM - Partitioning Around Medoids)

Instead of the Mean (which is a virtual point), the center must be an actual data point (Medoid).

• Cost Function: Sum of absolute differences (L1 Norm / Manhattan).


k
X X
J= |xi − µj |
j=1 xi ∈Cj

• Robustness: Highly robust to Outliers. (A mean shifts drastically with one outlier; a median/medoid
does not).
• Complexity: O(k(n − k)2 ). Much slower than K-Means.

5.1.3 C. Hierarchical Clustering

Builds a hierarchy of clusters. Output is a Dendrogram. Approach: Agglomerative (Bottom-Up). Start with
n clusters, merge closest pair until 1 remains.
Linkage Criteria (The distance metric between Cluster A and B):

1. Single Linkage (Min): d(A, B) = min{d(a, b) : a ∈ A, b ∈ B}. Effect: Produces long, "chain-like"
clusters. Good for non-globular shapes.
2. Complete Linkage (Max): d(A, B) = max{d(a, b) : a ∈ A, b ∈ B}. Effect: Forces compact, spherical
clusters. Sensitive to outliers.
3. Average Linkage: Average distance between all pairs.
4. Ward’s Method: Minimizes the increase in Variance (WCSS) when merging.

Complexity: O(n3 ) (Standard) or O(n2 log n) (Optimized). Not suitable for large datasets.

5.1.4 D. Cluster Evaluation: Silhouette Score

Used when ground truth is unknown. For a point i:

• a(i): Mean distance to other points in the same cluster (Cohesion).


• b(i): Mean distance to points in the nearest neighboring cluster (Separation).

b(i) − a(i)
S(i) =
max(a(i), b(i))
Range: [−1, 1].

13
• +1: Perfect clustering.
• 0: Overlapping clusters.
• −1: Assigned to wrong cluster.

5.2 2. Principal Component Analysis (PCA)


5.2.1 A. Concept

A linear transformation that projects data onto a new coordinate system where:

1. The first coordinate (PC1) has the maximum variance.


2. The second coordinate (PC2) has the second max variance and is orthogonal to PC1.

5.2.2 B. Mathematical Derivation (Optimization)

Let u1 be a unit vector (||u1 || = 1). We want to project data X onto u1 such that the variance of the projection
is maximized. Projection: z = Xu1 . Variance: Var(z) = n1 z T z = n1 uT1 X T Xu1 = uT1 Σu1 .
Objective: Maximize uT1 Σu1 subject to uT1 u1 = 1. Using Lagrange Multipliers:

L = uT1 Σu1 − λ(uT1 u1 − 1)

Take derivative w.r.t u1 and set to 0:

2Σu1 − 2λu1 = 0 =⇒ Σu1 = λu1

Conclusion: The direction of maximum variance u1 is simply the Eigenvector of the Covariance Matrix Σ,
and the variance itself is the Eigenvalue λ.

5.2.3 C. The PCA Algorithm Steps

1. Standardize Data: Xstd = X−µ


σ (Crucial! PCA is sensitive to scale).
2. Covariance Matrix: Σ = 1 T
n−1 X X.

3. Eigen Decomposition: Compute eigenvectors/values of Σ.


4. Sort: λ1 ≥ λ2 ≥ · · · ≥ λd .
5. Projection: Select top k eigenvectors (Wk ) and project:

Xnew = X · Wk

5.2.4 D. SVD Approach (Alternative Calculation)

In practice (e.g., Python’s sklearn), PCA uses Singular Value Decomposition (SVD) of X directly, not the
Covariance Matrix.
X = U SV T

• The columns of V (Right Singular Vectors) are the Principal Components.


s2i
• The squared singular values relate to eigenvalues: λi = n−1 .

14

You might also like