GATE DA Machine Learning Revision Notes
GATE DA Machine Learning Revision Notes
Contents
1 Supervised Learning: Regression 3
1.1 Simple Linear Regression (SLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Ridge Regression (L2 Regularization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1
2.6.6 6. Algorithm Comparison (GATE High-Yield) . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Model Evaluation 10
3.1 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Unsupervised Learning 12
5.1 1. Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.1 A. K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
5.1.2 B. K-Medoids (PAM - Partitioning Around Medoids) . . . . . . . . . . . . . . . . . . . . . 13
5.1.3 C. Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1.4 D. Cluster Evaluation: Silhouette Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.2 2. Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.1 A. Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.2 B. Mathematical Derivation (Optimization) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.3 C. The PCA Algorithm Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.4 D. SVD Approach (Alternative Calculation) . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2
1 Supervised Learning: Regression
Cov(x, y)
P
(xi − x̄)(yi − ȳ) Sy
β̂1 = = = rxy
Var(x)
P 2
(xi − x̄) Sx
β̂0 = ȳ − β̂1 x̄
Coefficient of Determination (R2 ): Measures the proportion of variance explained by the model.
(yi − ŷi )2
P
RSS
R2 = 1 − =1− P
T SS (yi − ȳ)2
∇β J(β) = −2X T Y + 2X T Xβ = 0
=⇒ X T Xβ = X T Y
=⇒ β̂ = (X T X)−1 X T Y
Invertibility: The Normal Equation requires (X T X) to be invertible. It is not invertible if: 1. Multi-
collinearity exists (features are linearly dependent). 2. p > n (More features than samples).
X p
X
J(β) = (yi − ŷi )2 + λ βj2
j=1
3
2 Supervised Learning: Classification
1+e−z .
1
Sigmoid Function: hθ (x) = σ(z) =
• Lazy Learner: It does not learn a discriminative function from the training data. Instead, it “memorizes”
the dataset. All computation is deferred until the prediction phase.
• Non-Parametric: It assumes no underlying probability distribution (like Gaussian) for the data.
n
!1/p
X
p
D(x, y) = |xi − yi |
i=1
4
• Large K (e.g., K = N ):
– Model: Simple, smooth boundaries. Ignores local structure. Approximates the global majority class.
– Result: High Bias, Low Variance (Underfitting).
√
• Rule of Thumb: K ≈ N . Usually choose an odd number to avoid voting ties in binary classification.
Unlike other models (like Regression/Neural Nets), KNN shifts the cost to the Test phase.
Note: Prediction can be optimized to O(log N ) using spatial tree structures like KD-Trees or Ball-Trees, but
these degrade in high dimensions.
1. The Necessity of Feature Scaling KNN relies on distance. If Feature A ranges [0, 1] and Feature B
ranges [0, 1000]:
We rely on Bayes’ Theorem to find the class y with the Maximum A Posteriori (MAP) probability.
P (X|y)P (y)
P (y|X) =
P (X)
Since the denominator P (X) (evidence) is constant for all classes, we ignore it for optimization:
Yn
ŷ = argmax P (y) P (xj |y)
y
j=1
5
2.3.2 2. Variants of Naive Bayes
2
Note: We must estimate the mean µjc and variance σjc for every feature j within every class c.
B. Multinomial Naive Bayes (Discrete Counts) Used for document classification (word counts).
Count(xj in class c) + α
P (xj |y = c) =
Total Words in class c + α · V
C. Bernoulli Naive Bayes (Binary) Used when features are binary (0 or 1). It penalizes the absence of a
feature as well.
P (xj |y) = P (j|y)xj · (1 − P (j|y))(1−xj )
Log-Probabilities (Numerical Stability) Multiplying many small probabilities (e.g., 0.001 × 0.002 . . . )
causes floating-point underflow. Solution: Maximize the sum of logs instead of the product.
n
X
ŷ = argmax log P (y) + log P (xj |y)
y
j=1
This transforms the non-linear product space into a linear sum space.
LDA is a Supervised dimensionality reduction and classification technique. Unlike PCA (which maximizes
global variance), LDA aims to find a projection vector w that:
1. Maximizes the distance between the means of different classes (Between-Class Variance).
2. Minimizes the spread/scatter within each class (Within-Class Variance).
Intuition: "Group items of the same class tightly together, and push the groups as far apart as possible."
6
2.4.2 2. The Scatter Matrices
To formulate this mathematically, we define two scatter matrices. Let N1 , N2 be the number of samples in Class
1 and Class 2, and µ1 , µ2 be the class means.
A. Within-Class Scatter Matrix (SW ): Measures how compact each class is. It is the sum of the covariance
matrices of each individual class.
X X
SW = (x − µ1 )(x − µ1 )T + (x − µ2 )(x − µ2 )T
x∈C1 x∈C2
B. Between-Class Scatter Matrix (SB ): Measures the separation between class means.
SB = (µ1 − µ2 )(µ1 − µ2 )T
The Solution (Derivation Result): To maximize J(w), we take the derivative w.r.t w and set to 0. This
solves to a Generalized Eigenvalue Problem:
−1
SW SB w = λw
(Note: If the data is already whitened/spherical where SW = I, then LDA is just the vector connecting the
means).
LDA works optimally (is the Bayes Optimal Classifier) only if:
Goal: Find the optimal hyperplane wT x + b = 0 that separates classes with the maximum Margin.
7
• Support Vectors: Data points closest to the hyperplane. Only these points influence the position of the
boundary.
• Margin Width: The distance between the dashed lines (wT x + b = 1 and wT x + b = −1).
2
Width =
||w||
Minimize: 12 ||w||2
Subject to: yi (wT xi + b) ≥ 1, ∀i
• Large C: Penalizes errors heavily. Tries to classify everything correctly. Result: Narrow margin, potential
Overfitting (High Variance).
• Small C: Allows more errors (higher ξ). Result: Wider margin, smoother boundary, potential Underfit-
ting (High Bias).
Using Lagrange Multipliers (αi ), we convert the Primal problem to the Dual problem. Why? The Dual form
reveals that the algorithm only depends on the dot product between points (xTi xj ).
n n n
X 1 XX
Maximize LD (α) = αi − αi αj yi yj (xTi xj )
i=1
2 i=1 j=1
αi yi = 0 and 0 ≤ αi ≤ C.
P
Constraints:
Key Property (KKT Conditions):
8
2.5.5 5. The Kernel Trick
When data is not linearly separable in input space Rd , we map it to a higher-dimensional feature space RD using
a mapping ϕ(x).
x → ϕ(x)
Instead of computing ϕ(x) explicitly (which is computationally expensive), we use a Kernel Function K(xi , xj )
that computes the dot product directly in high dimensions.
K(xi , xj ) = ϕ(xi )T ϕ(xj )
Common Kernels:
RBF Parameter γ (Gamma): Defines how far the influence of a single training example reaches.
• High γ: Influence is short-range. Decision boundary is jagged/complex. Risk of Overfitting.
A non-parametric, supervised learning method. It builds a model in the form of a tree structure.
B. Gini Impurity (CART) Measure of misclassification probability. Computationally faster (no log).
c
X
Gini(S) = 1 − p2i
i=1
C. Information Gain (IG) The expected reduction in entropy after splitting set S on attribute A.
X |Sv |
IG(S, A) = H(S) − H(Sv )
|S|
v∈V alues(A)
9
2.6.3 3. The "High Cardinality" Trap
Problem with Information Gain: IG is biased towards attributes with many distinct values (e.g.,
"User_ID" or "Date"). If "User_ID" is used, every leaf has 1 sample (Pure, Entropy=0), resulting in
max IG but a useless model.
Solution: Gain Ratio (Used in C4.5) We normalize the gain by the "Split Information" (Intrinsic entropy
of the split itself).
X |Sv | |Sv |
SplitInfoA (S) = − log2
v
|S| |S|
IG(S, A)
GainRatio(S, A) =
SplitInfoA (S)
• Prediction: The mean value (ȳ) of the samples in the leaf node.
3 Model Evaluation
10
3.2 Cross-Validation
• LOOCV (Leave-One-Out): k = n. Train on n − 1, test on 1. Unbiased estimate of error, but High
Variance and computationally expensive.
• K-Fold CV: Split into K parts (usually 5 or 10). Lower variance than LOOCV, biased if K is small.
Dimensions: If layer l has n[l] units and layer l − 1 has n[l−1] units, then W [l] is (n[l] × n[l−1] ).
Softmax (Output Layer): Used for Multi-Class Classification. It ensures ŷi = 1 (Probability Distribu-
P
tion).
11
4.5 5. Backpropagation (Derivation Steps)
We use the **Chain Rule** to calculate gradients starting from the Output Layer (L) backwards.
Step 1: Output Layer Error (δ [L] ) For Cross-Entropy Loss with Softmax/Sigmoid:
δ [L] = a[L] − y
Step 2: Hidden Layer Error (δ [l] ) How much did layer l contribute to the error in l + 1?
δ [l] = (W [l+1] )T δ [l+1] ⊙ g ′[l] (z [l] )
5 Unsupervised Learning
Objective: Partition n observations into k clusters to minimize the Within-Cluster Sum of Squares
(WCSS) (also called Inertia).
Xk X
J(µ, C) = ||xi − µj ||2
j=1 xi ∈Cj
12
3. Update (Maximization Step): Move centroid to the mean of assigned points.
1 X
µj := x
|Cj |
x∈Cj
• Geometry: Assumes clusters are Spherical and of similar size. Fails on concentric circles or irregular
shapes.
• Complexity: O(t · k · n · d) where t is iterations. Linear in n, making it fast.
Instead of the Mean (which is a virtual point), the center must be an actual data point (Medoid).
• Robustness: Highly robust to Outliers. (A mean shifts drastically with one outlier; a median/medoid
does not).
• Complexity: O(k(n − k)2 ). Much slower than K-Means.
Builds a hierarchy of clusters. Output is a Dendrogram. Approach: Agglomerative (Bottom-Up). Start with
n clusters, merge closest pair until 1 remains.
Linkage Criteria (The distance metric between Cluster A and B):
1. Single Linkage (Min): d(A, B) = min{d(a, b) : a ∈ A, b ∈ B}. Effect: Produces long, "chain-like"
clusters. Good for non-globular shapes.
2. Complete Linkage (Max): d(A, B) = max{d(a, b) : a ∈ A, b ∈ B}. Effect: Forces compact, spherical
clusters. Sensitive to outliers.
3. Average Linkage: Average distance between all pairs.
4. Ward’s Method: Minimizes the increase in Variance (WCSS) when merging.
Complexity: O(n3 ) (Standard) or O(n2 log n) (Optimized). Not suitable for large datasets.
b(i) − a(i)
S(i) =
max(a(i), b(i))
Range: [−1, 1].
13
• +1: Perfect clustering.
• 0: Overlapping clusters.
• −1: Assigned to wrong cluster.
A linear transformation that projects data onto a new coordinate system where:
Let u1 be a unit vector (||u1 || = 1). We want to project data X onto u1 such that the variance of the projection
is maximized. Projection: z = Xu1 . Variance: Var(z) = n1 z T z = n1 uT1 X T Xu1 = uT1 Σu1 .
Objective: Maximize uT1 Σu1 subject to uT1 u1 = 1. Using Lagrange Multipliers:
Conclusion: The direction of maximum variance u1 is simply the Eigenvector of the Covariance Matrix Σ,
and the variance itself is the Eigenvalue λ.
Xnew = X · Wk
In practice (e.g., Python’s sklearn), PCA uses Singular Value Decomposition (SVD) of X directly, not the
Covariance Matrix.
X = U SV T
14