INSTITUTE OF AERONAUTICAL ENGINEERING
AN ASSIGNMENT REPORT OF
Machine Learning Techniques and Practices
COURSE CODE- ACAD20
BY
STUDENT NAME:[Link]
ROLL NUMBER: 23951A0444
ELECTRONICS AND COMMUNICATION
ENGINEERING
INSTITUTE OF AERONAUTICAL ENGINEERING DUNDIGAL,
HYDERABAD-500 043, TELANGANA, INDIA.
1. Overfitting and Underfitting in Supervised Learning
Overfitting and underfitting are two fundamental problems in supervised learning that
directly impact the predictive performance of machine learning models.
Overfitting occurs when a model learns not only the underlying patterns in the training data
but also the noise and random fluctuations. As a result, the model performs extremely well
on training data but poorly on unseen test data. This typically happens when the model is
too complex relative to the amount of available data. For example, a high-degree polynomial
regression model may perfectly fit training points but fail to generalize. Overfitting is
characterized by low bias and high variance. Common causes include small datasets,
excessive features, very complex models, and insufficient regularization.
Solutions to overfitting include:
1. Cross-validation to evaluate model generalization.
2. Regularization techniques such as L1 and L2.
3. Pruning in decision trees.
4. Early stopping in neural networks.
5. Increasing training data.
6. Dropout in deep learning.
7. Feature selection and dimensionality reduction.
Underfitting occurs when a model is too simple to capture the underlying structure of the
data. It results in poor performance on both training and test data. This is characterized by
high bias and low variance. For example, fitting a linear model to nonlinear data causes
underfitting.
Solutions to underfitting include:
1. Increasing model complexity.
2. Adding more relevant features.
3. Reducing regularization.
4. Using nonlinear models.
5. Training longer in iterative algorithms.
Balancing bias and variance is key. Techniques such as bias-variance tradeoff analysis,
cross-validation, and model comparison help in selecting appropriate models.
2. Supervised Learning and Importance of Labeled Data
(a) The dataset described is used for supervised learning because it contains input features
along with corresponding output labels. In supervised learning, the model learns a mapping
function from inputs to outputs based on labeled examples.
(b) Classification problems require labeled data because the goal is to assign inputs to
predefined categories. Without labels, the algorithm cannot learn the relationship between
input features and target classes. Labels serve as ground truth. During training, the model
compares predicted labels with actual labels using a loss function. The difference (error)
guides weight updates.
For example, in spam detection, emails are labeled as 'spam' or 'not spam.' The model learns
patterns such as certain keywords or sender characteristics. Without labels, it would be
impossible to measure error or improve predictions.
Labeled data ensures:
1. Supervised training with clear objectives.
2. Quantifiable performance evaluation.
3. Model validation and tuning.
4. Ability to compute metrics such as accuracy, precision, recall, and F1-score.
Thus, labeled data is essential for classification because it provides the learning signal.
3. Model Performs Well on Training but Poor on Test Data
(a) The problem is overfitting.
(b) The model performs well on training data but poorly on test data because it has
memorized the training examples instead of learning general patterns. This may be due to
excessive model complexity, insufficient training samples, or noisy data.
Reasons include:
1. High variance model.
2. Too many parameters.
3. Small dataset.
4. Noise in training data.
Solutions:
1. Apply regularization (L1/L2).
2. Use cross-validation.
3. Simplify the model.
4. Collect more data.
5. Use dropout (for neural networks).
6. Prune decision trees.
The goal is to improve generalization performance.
4. Concept and Types of Machine Learning
Machine Learning is a branch of artificial intelligence that enables systems to learn patterns
from data and improve performance without explicit programming. It relies on algorithms
that learn from experience.
Types of Machine Learning:
1. Supervised Learning – Uses labeled data (classification, regression).
2. Unsupervised Learning – No labels (clustering, dimensionality reduction).
3. Semi-supervised Learning – Combination of labeled and unlabeled data.
4. Reinforcement Learning – Learning through rewards and penalties.
Supervised Learning in Detail:
Supervised learning uses labeled datasets. It consists of:
- Classification (categorical output)
- Regression (continuous output)
Example 1: House price prediction (Regression)
Example 2: Email spam detection (Classification)
Key components:
- Input features
- Target variable
- Loss function
- Optimization algorithm
Algorithms:
- Linear Regression
- Logistic Regression
- Decision Trees
- KNN
- SVM
- Neural Networks
Advantages:
- Clear objective
- High accuracy with sufficient data
Disadvantages:
- Requires labeled data
- Risk of overfitting
Supervised learning is widely used in finance, healthcare, marketing, and image recognition.
5. Entropy and Information Gain Calculation
Entropy measures impurity in a dataset. It is calculated as:
Entropy(S) = -Σ p_i log2(p_i)
Where p_i is the probability of class i.
Information Gain (IG) measures reduction in entropy after splitting:
IG = Entropy(before split) - Entropy(after split)
Given:
Entropy before split = 0.94
Entropy after split = 0.55
IG = 0.94 - 0.55 = 0.39
Thus, the information gain is 0.39.
Higher information gain indicates better attribute for splitting in decision trees.
6. Ensemble Learning
Ensemble learning combines multiple models to improve predictive performance.
(a) Ensemble methods are used to:
- Improve accuracy
- Reduce variance
- Reduce bias
- Increase robustness
(b) Types of ensemble techniques:
1. Bagging (Bootstrap Aggregating)
2. Boosting
3. Stacking
4. Voting
5. Random Forest
Ensembles outperform individual weak learners by aggregating predictions.
7. Gradient Boosting Algorithm
Gradient Boosting is an ensemble method that builds models sequentially.
(a) Loss Function:
Common loss functions include:
- Mean Squared Error (regression)
- Log Loss (classification)
(b) Sequential Learning Process:
1. Initialize model with constant prediction.
2. Compute residual errors.
3. Train weak learner on residuals.
4. Update model predictions.
5. Repeat iteratively.
Each new model corrects previous errors. Learning rate controls contribution.
8. Bagging and Unstable Classifiers
Bagging improves unstable classifiers by reducing variance.
Steps:
1. Create bootstrap samples.
2. Train base learners independently.
3. Aggregate predictions (average or majority vote).
Bagging stabilizes models like decision trees and reduces overfitting. Random Forest is a
popular example.
9. Kernel Functions in SVM
SVM performance depends on kernel choice.
(a) Linear Kernel:
Suitable for linearly separable data. Fast and efficient.
(b) Polynomial Kernel:
Captures polynomial relationships. Degree controls complexity.
(c) RBF Kernel:
Handles nonlinear data effectively. Uses gamma parameter.
Kernel selection impacts bias-variance tradeoff and computational complexity.
10. K-Nearest Neighbors Algorithm
KNN is a supervised learning algorithm that classifies data based on nearest neighbors.
Steps:
1. Choose K.
2. Calculate distance.
3. Select K nearest neighbors.
4. Assign majority class.
Effect of K:
Small K:
- Low bias
- High variance
- Sensitive to noise
Large K:
- High bias
- Low variance
- More stable but may underfit
Choosing optimal K requires cross-validation.
KNN is simple but computationally expensive for large datasets.