UNIT-1
Introduction
Machine Learning
Machine Learning is the field of study that gives computers the capability to learn without
being explicitly programmed.
Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that
focuses on the using data and algorithms to enable AI to imitate the way that humans learn,
gradually improving its accuracy.
Traditional Programming
Machine Learning
Terminologies:
Model A model is a specific representation learned from data by applying
some machine learning algorithm. The model is called as hypothesis.
Feature A feature is an individual measurable property of our data. Feature
vectors are fed as input to the model. For example, in order to predict a fruit,
there may be features like color, smell, taste, etc. Note: Choosing informative,
discriminating and independent features is a crucial step for effective algorithms.
Target (Label) A target variable or label is the value to be predicted by our
model. For the fruit example discussed in the features section, the label with
each set of input would be the name of the fruit like apple, orange, banana, etc.
Training The idea is to give a set of inputs(features) and its expected
outputs(labels), so after training, we will have a model (hypothesis) that will
then map new data to one of the categories trained on.
Prediction Once our model is ready, it can be fed a set of inputs to which it will
provide a predicted output(label). But make sure if the machine performs well
on unseen data, then only we can say the machine performs well.
Features of Machine learning
Machine learning is data driven technology. Large amount of data generated by
organizations on daily bases. So, by notable relationships in data, organizations makes
better decisions.
Machine can learn itself from past data and automatically improve.
From the given dataset it detects various patterns on data.
For the big organizations branding is important and it will become more easy to target
relatable customer base.
It is similar to data mining because it is also deals with the huge amount of data.
Definition
A computer program is said to learn from experience E concerning some class of tasks T and
performance measure P, if its performance at tasks T, as measured by P, improves with
experience E.
Example
o Task T : Recognizing and classifying handwritten words within images
o Performance P : Percent of words correctly classified
o Training experience E : A dataset of handwritten words with given classifications
Types of Machine Learning:
Machine learning algorithms can be trained in many ways, with each method having its pros
and cons. Based on these methods and ways of learning, machine learning is broadly
categorized into four main types:
1. Supervised Machine Learning
1. Unsupervised Machine Learning
1. Semi-Supervised Machine Learning
1. Reinforcement Learning
Supervised Machine Learning:
Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.
Labelled datasets have both input and output parameters. In Supervised Learning algorithms
learn to map points between inputs and correct outputs. It has both training and validation
datasets labelled.
Example:
Build an image classifier to differentiate between cats and dogs. If we feed the datasets of
dogs and cats labelled images to the algorithm, the machine will learn to classify between a
dog or a cat from these labelled images. When the input new dog or cat images that it has
never seen before, it will use the learned algorithms and predict whether it is a dog or a cat.
This is how supervised learning works, and this is particularly an image classification.
There are two main categories of supervised learning that are mentioned below:
Classification
Regression
Classification
Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting whether
a patient has a high risk of heart disease. Classification algorithms learn to map the input
features to one of the predefined classes.
Here are some classification algorithms:
Logistic Regression
Support Vector Machine
Random Forest
Decision Tree
K-Nearest Neighbors (KNN)
Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
Here are some regression algorithms:
Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Decision tree
Random Forest
Unsupervised Machine Learning
Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabelled data. The primary goal of Unsupervised
learning is often to discover hidden patterns, similarities, or clusters within the data, which
can then be used for various purposes, such as data exploration, visualization, dimensionality
reduction, and more.
Example:
For example, consider an input dataset of images of a fruit-filled container. Here, the images
are not known to the machine learning model. When we input the dataset into the ML model,
the task of the model is to identify the pattern of objects, such as color, shape, or differences
seen in the input images and categorize them. Upon categorization, the machine then predicts
the output as it gets tested with a test dataset.
There are two main categories of unsupervised learning that are mentioned below:
Clustering
Association
Dimensionality reduction
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labelled examples.
Here are some clustering algorithms:
K-Means Clustering algorithm
Hierarchical Clustering
Principal Component Analysis
Independent Component Analysis
Association
Association rule learning is a technique for discovering relationships between items in a
dataset. It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.
Here are some association rule learning algorithms:
Apriori Algorithm
Eclat
FP-growth Algorithm
Semi-Supervised Learning:
Semi-Supervised learning is a machine learning algorithm that works between the supervised
and unsupervised learning so it uses both labelled and unlabelled data. It’s particularly useful
when obtaining labelled data is costly, time-consuming, or resource-intensive. This approach
is useful when the dataset is expensive and time-consuming. Semi-supervised learning is
chosen when labelled data requires skills and relevant resources in order to train or learn from
it.
We use these techniques when we are dealing with data that is a little bit labelled and the rest
large portion of it is unlabelled. We can use the unsupervised techniques to predict labels and
then feed these labels to supervised techniques. This technique is mostly applicable in the
case of image data sets where usually all images are not labelled.
Example:
Consider that we are building a language translation model, having labelled translations for
every sentence pair can be resources intensive. It allows the models to learn from labelled and
unlabelled sentence pairs, making them more accurate. This technique has led to significant
improvements in the quality of machine translation services.
Reinforcement Learning:
Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the most
relevant characteristics of reinforcement learning. In this technique, the model keeps on
increasing its performance using Reward Feedback to learn the behaviour or pattern. These
algorithms are specific to a particular problem e.g. Google Self Driving car, AlphaGo where a
bot competes with humans and even itself to get better and better performers in Go Game.
Each time we feed in data, they learn and add the data to their knowledge which is training
data. So, the more it learns the better it gets trained and hence experienced.
Example:
Consider that you are training an AI agent to play a game like chess. The agent explores
different moves and receives positive or negative feedback based on the outcome.
Reinforcement Learning also finds applications in which they learn to perform tasks by
interacting with their surroundings.
Machine Learning Process:
1. Data Collection and integration:
The first step of the ML pipeline involves the collection of data and integration of
data.
Data collected acts as an input to the model (data preparation phase)
Inputs are called features.
The more the data is, more the better our model becomes.
Once the data is collected, we need to integrate and prepare the data.
Integration of data means placing all related data together.
Then data preparation phase starts in which we manually and critically explore the
data.
The data preparation phase tells the developer that is the data matching the
expectations. Is there enough info to make an accurate prediction? Is the data
consistent?
2. Exploratory Data Analysis and Visualisation:
Once the data is prepared developer needs to visualize the data to have a better
understanding of relationships within the dataset.
When we get to see data, we can notice the unseen patterns that we may not have
noticed in the first phase.
It helps developers easily identify missing data and outliers.
Data visualization can be done by plotting histograms, scatter plots, etc.
After visualization is done data is analyzed so that developer can decide what ML
technique he may use.
3. Feature Selection and Engineering:
Feature selection means selecting what features the developer wants to use within the
model.
Features should be selected so that a minimum correlation exists between them and a
maximum correlation exists between the selected features and output.
Feature engineering is the process to manipulate the original data into new and
potential data that has a lot many features within it.
In simple words Feature engineering is converting raw data into useful data or getting
the maximum out of the original data.
Feature engineering is arguably the most crucial and time-consuming step of the ML
pipeline.
Feature selection deals with the accuracy and precision of data.
4. Model Training:
It is the first step officially when the developer gets to train the model on basis of
data.
To train the model, data is split into three parts- Training data, validation data, and test
data.
Around 70%-80% of data goes into the training data set which is used in training the
model.
Validation data is also known as development set or dev set and is used to avoid
overfitting or underfitting situations i.e. enabling hyperparameter tuning.
Hyperparameter tuning is a technique used to combat overfitting and underfitting.
Validation data is used during model evaluation.
Around 10%-15% of data is used as validation data.
Rest 10%-15% of data goes into the test data set. Test data set is used for testing after
the model preparation.
It is crucial to randomize data sets while splitting the data to get an accurate model.
5. Model Evaluation:
After the model training, validation, or development data is used to evaluate the
model.
To get the most accurate predictions to test data may be used for further model
evaluation.
A confusion matrix is created after model evaluation to calculate accuracy and
precision numerically.
After model evaluation, our model enters the final stage that is prediction.
6. Prediction:
In the prediction phase developer deploys the model.
After model deployment, it becomes ready to make predictions.
Predictions are made on training data and test data to have a better understanding of
the build model.
The deployment of the model isn’t a one-time exercise. As more and more data get generated,
the model is trained on new data, evaluated again, and deployed again. Model training, model
evaluation, and prediction phase circulate each other.
Supervised Learning
Supervised learning is defined as when a model gets trained on a “Labelled Dataset”.
Labelled datasets have both input and output parameters. In Supervised Learning algorithms
learn to map points between inputs and correct outputs. It has both training and validation
datasets labelled.
Example:
Build an image classifier to differentiate between cats and dogs. If we feed the datasets of
dogs and cats labelled images to the algorithm, the machine will learn to classify between a
dog or a cat from these labelled images. When the input new dog or cat images that it has
never seen before, it will use the learned algorithms and predict whether it is a dog or a cat.
This is how supervised learning works, and this is particularly an image classification.
There are two main categories of supervised learning that are mentioned below:
Classification
Regression
Classification
Classification deals with predicting categorical target variables, which represent discrete
classes or labels. For instance, classifying emails as spam or not spam, or predicting whether
a patient has a high risk of heart disease. Classification algorithms learn to map the input
features to one of the predefined classes.
Here are some classification algorithms:
Logistic Regression
Support Vector Machine
Random Forest
Decision Tree
K-Nearest Neighbors (KNN)
Naive Bayes
Regression
Regression, on the other hand, deals with predicting continuous target variables, which
represent numerical values. For example, predicting the price of a house based on its size,
location, and amenities, or forecasting the sales of a product. Regression algorithms learn to
map the input features to a continuous numerical value.
Here are some regression algorithms:
Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Decision tree
Random Forest
Advantages of Supervised Machine Learning
Supervised Learning models can have high accuracy as they are trained on labelled
data.
The process of decision-making in supervised learning models is often interpretable.
It can often be used in pre-trained models which saves time and resources when
developing new models from scratch.
Disadvantages of Supervised Machine Learning
It has limitations in knowing patterns and may struggle with unseen or unexpected
patterns that are not present in the training data.
It can be time-consuming and costly as it relies on labelled data only.
It may lead to poor generalizations based on new data.
Evaluating Supervised Learning Models
Evaluating supervised learning models is an important step in ensuring that the model is
accurate and generalizable. There are a number of different metrics that can be used to
evaluate supervised learning models:
For Regression
Mean Squared Error (MSE): MSE measures the average squared difference between
the predicted values and the actual values. Lower MSE values indicate better model
performance.
Root Mean Squared Error (RMSE): RMSE is the square root of MSE, representing
the standard deviation of the prediction errors. Similar to MSE, lower RMSE values
indicate better model performance.
Mean Absolute Error (MAE): MAE measures the average absolute difference
between the predicted values and the actual values. It is less sensitive to outliers
compared to MSE or RMSE.
R-squared (Coefficient of Determination): R-squared measures the proportion of the
variance in the target variable that is explained by the model. Higher R-squared values
indicate better model fit.
Problem:
Actual values : [3, 5, 2, 7, 10]
Predicted values : [2, 5, 3, 8, 9]
MAE = 1/5 (1+0+1+1+1) = 0.8
MSE = 1/5 (1+0+1+1+1) = 0.8
RMSE = sqrt(0.8) = 0.8944
R2 = 1 – (4 / 41.2) = 0.9029
For Classification
Accuracy: Accuracy is the percentage of predictions that the model makes correctly. It
is calculated by dividing the number of correct predictions by the total number of
predictions.
Precision: Precision is the percentage of positive predictions that the model makes
that are actually correct. It is calculated by dividing the number of true positives by
the total number of positive predictions.
Recall: Recall is the percentage of all positive examples that the model correctly
identifies. It is calculated by dividing the number of true positives by the total number
of positive examples.
F1 score: The F1 score is a weighted average of precision and recall. It is calculated
by taking the harmonic mean of precision and recall.
Confusion matrix: A confusion matrix is a table that shows the number of predictions
for each class, along with the actual class labels. It can be used to visualize the
performance of the model and identify areas where the model is struggling.
True Positive (TP): The model correctly predicted a positive outcome (the actual
outcome was positive).
True Negative (TN): The model correctly predicted a negative outcome (the actual
outcome was negative).
False Positive (FP): The model incorrectly predicted a positive outcome (the actual
outcome was negative). Also known as a Type I error.
False Negative (FN): The model incorrectly predicted a negative outcome (the actual
outcome was positive). Also known as a Type II error.
Problem:
Actual Outcomes:
[1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
Predicted Outcomes:
[1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
Using the provided labels:
TP = 4 (Positions: 1, 3, 6, 9)
TN = 4 (Positions: 2, 5, 8, 10)
FP = 1 (Position: 7)
FN = 1 (Position: 4)
Accuracy: 0.8
Precision: 0.8
Recall: 0.8
F1 Score: 0.8
Applications:
Supervised learning can be used to solve a wide variety of problems, including:
Spam filtering: Supervised learning algorithms can be trained to identify and classify
spam emails based on their content, helping users avoid unwanted messages.
Image classification: Supervised learning can automatically classify images into
different categories, such as animals, objects, or scenes, facilitating tasks like image
search, content moderation, and image-based product recommendations.
Medical diagnosis: Supervised learning can assist in medical diagnosis by analyzing
patient data, such as medical images, test results, and patient history, to identify
patterns that suggest specific diseases or conditions.
Fraud detection: Supervised learning models can analyze financial transactions and
identify patterns that indicate fraudulent activity, helping financial institutions prevent
fraud and protect their customers.
Natural language processing (NLP): Supervised learning plays a crucial role in NLP
tasks, including sentiment analysis, machine translation, and text summarization,
enabling machines to understand and process human language effectively.
Unsupervised Learning
Unsupervised learning is a type of machine learning technique in which an algorithm
discovers patterns and relationships using unlabelled data. The primary goal of Unsupervised
learning is often to discover hidden patterns, similarities, or clusters within the data, which
can then be used for various purposes, such as data exploration, visualization, dimensionality
reduction, and more.
Example:
For example, consider an input dataset of images of a fruit-filled container. Here, the images
are not known to the machine learning model. When we input the dataset into the ML model,
the task of the model is to identify the pattern of objects, such as color, shape, or differences
seen in the input images and categorize them. Upon categorization, the machine then predicts
the output as it gets tested with a test dataset.
There are two main categories of unsupervised learning that are mentioned below:
Clustering
Association
Dimensionality reduction
Clustering
Clustering is the process of grouping data points into clusters based on their similarity. This
technique is useful for identifying patterns and relationships in data without the need for
labelled examples.
Here are some clustering algorithms:
K-Means Clustering algorithm
Hierarchical Clustering
Association
Association rule learning is a technique for discovering relationships between items in a
dataset. It identifies rules that indicate the presence of one item implies the presence of
another item with a specific probability.
Here are some association rule learning algorithms:
Apriori Algorithm
FP-growth Algorithm
Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of features in a dataset while
preserving as much information as possible.
Principal Component Analysis
Linear Discriminant Analysis
Isomap
Advantages of Unsupervised Learning
No labelled data required
Can uncover hidden patterns
Can be used for variety of tasks: clustering, dimensionality reduction, and anomaly
detection
Can be used to explore new data
Disadvantages:
Difficult to evaluate
Difficult to interpret
Sensitive to quality of data
Computationally expensive
Applications
Customer segmentation
Recommendation system
Natural language processing
Image Analysis
Reinforcement Learning:
Reinforcement Learning is a feedback-based Machine learning technique in which an agent
learns to behave in an environment by performing the actions and seeing the results of
actions. For each good action, the agent gets positive feedback, and for each bad action, the
agent gets negative feedback or penalty.
In Reinforcement Learning, the agent learns automatically using feedbacks without any
labeled data.
o Agent(): An entity that can perceive/explore the environment and act upon it.
o Environment(): A situation in which an agent is present or surrounded by. In RL, we
assume the stochastic environment, which means it is random in nature.
o Action(): Actions are the moves taken by an agent within the environment.
o State(): State is a situation returned by the environment after each action taken by the
agent.
o Reward(): A feedback returned to the agent from the environment to evaluate the
action of the agent.
o Policy(): Policy is a strategy applied by the agent for the next action based on the
current state.
Elements of Reinforcement Learning:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment
Policy: Policy defines the learning agent behavior for given time period. It is a mapping from
perceived states of the environment to actions to be taken when in those states.
Reward function: Reward function is used to define a goal in a reinforcement learning
problem.A reward function is a function that provides a numerical score based on the state of
the environment
Value function: Value functions specify what is good in the long run. The value of a state is
the total amount of reward an agent can expect to accumulate over the future, starting from
that state.
Model of the environment: Models are used for planning.
Hypothesis:
A hypothesis is a function that best describes the target in supervised machine learning.
A hypothesis in machine learning is the model’s presumption regarding the connection
between the input features and the result.
Hypothesis Space (H)
Hypothesis space is the set of all the possible legal hypothesis. This is the set from which the
machine learning algorithm would determine the best possible (only one) which would best
describe the target function or the outputs.
Evaluation:
The process of machine learning involves not only formulating hypotheses but also
evaluating their performance. Common evaluation metrics include mean squared error
(MSE), accuracy, precision, recall, F1-score, and others. By comparing the predictions of the
hypothesis with the actual outcomes on a validation or test dataset, one can assess the
effectiveness of the model.
Inductive Bias:
Inductive bias can be defined as the set of assumptions or biases that a learning algorithm
employs to make predictions on unseen data based on its training data.
Eg: If we choose to use a linear model, we are introducing an inductive bias that assumes the
relationship between house size and price is linear.
Cross-Validation:
Cross validation is a technique used in machine learning to evaluate the performance of a
model on unseen data. The main idea is to split the dataset into several parts (folds) and
systematically train and test the model on different subsets of the data.
The diagram below shows an example of the training subsets and evaluation subsets
generated in k-fold cross-validation. Here, we have total 25 instances. In first iteration we use
the first 20 percent of data for evaluation, and the remaining 80 percent for training ([1-5]
testing and [5-25] training) while in the second iteration we use the second subset of 20
percent for evaluation, and the remaining subsets of the data for training ([5-10] testing and
[1-5 and 10-25] training), and so on.
Total instances: 25
Value of k : 5
No. Iteration Training set observations Testing set observations
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [0 1 2 3 4]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] [5 6 7 8 9]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 23 24] [10 11 12 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 23 24] [15 16 17 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19] [20 21 22 23 24]
Advantages:
1. Overcoming Overfitting: Cross validation helps to prevent overfitting by providing a
more robust estimate of the model’s performance on unseen data.
1. Model Selection: Cross validation can be used to compare different models and select
the one that performs the best on average.
1. Hyperparameter tuning: Cross validation can be used to optimize the hyperparameters
of a model, such as the regularization parameter, by selecting the values that result in
the best performance on the validation set.
1. Data Efficient: Cross validation allows the use of all the available data for both
training and validation, making it a more data-efficient method compared to
traditional validation techniques.
Disadvantages:
1. Computationally Expensive: Cross validation can be computationally expensive,
especially when the number of folds is large or when the model is complex and
requires a long time to train.
1. Time-Consuming: Cross validation can be time-consuming, especially when there are
many hyperparameters to tune or when multiple models need to be compared.
1. Bias-Variance Tradeoff: The choice of the number of folds in cross validation can
impact the bias-variance tradeoff, i.e., too few folds may result in high variance, while
too many folds may result in high bias.
Weight Space
Weight space is the multidimensional space in which each dimension corresponds to one of
the parameter in the model.
Testing:
Unit Testing for Components
Like traditional software testing, unit testing in ML focuses on testing individual components
of the ML pipeline. It involves assessing the correctness of each step, from data
preprocessing to feature extraction, model architecture, and hyperparameters. Ensuring that
each building block functions as expected contributes to the overall reliability of the model.
Data Testing and Preprocessing
The quality of input data impacts the performance of an ML model. Data testing involves
verifying the data’s integrity, accuracy, and consistency. This step also includes preprocessing
testing to ensure that data transformation, normalisation, and cleaning processes are executed
correctly. Clean and reliable data leads to accurate predictions.
Cross-Validation
Cross-validation is a powerful technique for assessing how well an ML model generalises to
new, unseen data. It involves partitioning the dataset into multiple subsets, training the model
on different subsets, and testing its performance on the remaining data. Cross-validation
provides insights into a model’s potential performance on diverse inputs by repeating this
process and averaging the results.
Performance Metrics Testing
Choosing appropriate performance metrics is crucial for evaluating model performance.
Metrics like accuracy, precision, recall, and F1-score provide quantitative measures of how
well the model is doing. Testing these metrics ensures that the model delivers results per the
intended objectives.
Robustness and Adversarial Testing
Robustness testing involves assessing how well the model handles unexpected inputs or
adversarial attacks. Adversarial testing explicitly evaluates the model’s behaviour when
exposed to deliberately modified inputs designed to confuse it. Robust models are less likely
to make erroneous predictions under challenging conditions.
A/B Testing for Deployment
Once a model is ready for deployment, A/B testing can be employed. It involves deploying
the new ML model alongside an existing one and comparing their performance in a real-
world setting. A/B testing helps ensure that the new model doesn’t introduce unexpected
issues and performs at least as well as the current solution.
Bias Testing
Bias in ML models can lead to unfair or discriminatory outcomes. To tackle this, bias and
fairness testing aims to identify and mitigate biases in the data and the ML model’s
predictions. It ensures that the model treats all individuals and groups fairly.