MACHINE
LEARNING-C
SC701
MODULE 1- INTRODUCTION TO ML
Machine learning is a growing technology which
enables computers
to learn automatically from past data.
Machine learning uses various
algorithms for building mathematical
models and making predictions using historical
data or information. Currently, it is being used for
various tasks such as image recognition, speech
recognition, email filtering, Facebook
auto-tagging, recommender system, and many
more.
What is Machine Learning
In the real world, we are surrounded by humans
who can learn everything from their experiences
with their learning capability, and we have
computers or machines which work on our
instructions. But can a machine also learn from
experiences or past data like a human does? So
here comes the role of Machine Learning.
Machine Learning is said as a subset of
artificial intelligence
that is mainly concerned
with the development of
algorithms which allow
a computer to learn from
the data and past experiences
on their own. The term machine learning was
first introduced by Arthur Samuel in 1959.
WE CAN DEFINE IT IN A SUMMARIZED WAY AS:
“MACHINE LEARNING ENABLES A MACHINE TO
AUTOMATICALLY LEARN FROM DATA, IMPROVE
PERFORMANCE FROM EXPERIENCES, AND PREDICT
THINGS WITHOUT BEING EXPLICITLY
PROGRAMMED”.
Bloom’s Taxonomy
HOW DOES MACHINE LEARNING WORK
A Machine Learning system learns from
historical data, builds the prediction
models, and whenever it receives new
data, predicts the output for it. The
accuracy of predicted output depends upon
the amount of data, as the huge amount of
data helps to build a better model which
predicts the output more accurately.
MACHINE LEARNING MODEL
Training Data
Train Machine
Learning algorithm
Trained Model
Test the model with
new input
Is model
N Y Machine Learning
performing
Model Ready
correctly?
DATA FORMATS
Structured Data is stored in predefined format and is
highly specific.
Unstructured Data is a collection of many varied data
types which are stored in their native formats.
Semi structured Data that does not follow the tabular
data structure models associated with relational
databases or other data table.
DIKW PYRAMID
Understanding
Meaning
content
Events, Records and
Transaction
CATEGORIES OF DATA ANALYTICS
TYPES OF MACHINE LEARNING
Based on the methods and way of learning,
machine learning is divided into various types
Supervised Machine Learning
Unsupervised Machine Learning
Reinforcement Learning
Supervised Machine Learning
Supervised machine learning is based on
supervision. It means in the supervised learning
technique, we train the machines using the
"labeled" dataset, and based on the training, the
machine predicts the output. Here, the labeled data
specifies that some of the inputs are already
mapped to the output.
Categories of Supervised Machine Learning
Supervised machine learning can be classified
into two types of problems, which are given
below:
Classification
Regression
ADVANTAGES
Since supervised learning work with the labeled
dataset so we can have an exact idea about the
data
These algorithms are helpful in predicting the
output on the basis of prior experience.
DISADVANTAGES
These algorithms are not able to solve complex
tasks.
It may predict the wrong output if the test data is
different from the training data.
It requires lots of computational time to train the
algorithm.
APPLICATIONS
Image Segmentation
Medical Diagnosis
Fraud Detection
Spam detection
Speech Recognition
UNSUPERVISED MACHINE LEARNING
The main aim of the unsupervised learning
algorithm is to group or categories the
unsorted/unlabelled dataset according to the
similarities, patterns, and differences.
Machines are instructed to find the hidden
patterns from the input dataset.
CATEGORIES OF UNSUPERVISED MACHINE
LEARNING
Unsupervised Learning can be further classified
into two types, which are given below:
Clustering
Association
ADVANTAGES:
These algorithms can be used for complicated
tasks compared to the supervised ones because
these algorithms work on the unlabeled dataset.
Unsupervised algorithms are preferable for
various tasks as getting the unlabeled dataset is
easier as compared to the labeled dataset.
DISADVANTAGES:
The output of an unsupervised algorithm can be
less accurate as the dataset is not labeled, and
algorithms are not trained with the exact output
in prior.
Working with Unsupervised learning is more
difficult as it works with the unlabeled dataset
that does not map with the output.
APPLICATIONS
Network Analysis
Recommendation Systems
Anomaly Detection
Singular Value Decomposition
REINFORCEMENT LEARNING
Reinforcement learning works on a
feedback-based process, in which an AI agent (A
software component) automatically explore its
surrounding by hitting & trail, taking action,
learning from experiences, and improving its
performance.
Agent gets rewarded for each good action and get
punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the
rewards.
ADVANTAGES
It helps in solving complex real-world problems
which are difficult to be solved by general
techniques.
The learning model of RL is similar to the
learning of human beings; hence most accurate
results can be found.
Helps in achieving long term results.
DISADVANTAGE
RL algorithms are not preferred for simple
problems.
RL algorithms require huge data and
computations.
Too much reinforcement learning can lead to an
overload of states which can weaken the results.
APPLICATION
Video Games
Resource Management
Robotics
Text Mining
ISSUES IN MACHINE LEARNING
Inadequate Training Data
Poor quality of data
Massive training data
Over fitting and Under fitting
Monitoring and maintenance
Getting bad recommendations
Lack of skilled resources
Limited possibilities to reuse a model
Data Bias
APPLICATION OF MACHINE LEARNING
STEPS IN DEVELOPING A MACHINE LEARNING
APPLICATION.
Collect Data
Prepare the input data
Analyze the input data
Train the algorithm
Test the algorithm
Use the algorithm
Periodic Revisit
TRAINING ERROR
Training error is simply an error that occurs
during model training, i.e. dataset
inappropriately handle during preprocessing or
in feature selection.
GENERALIZATION ERROR
In supervised learning applications in machine
learning and statistical learning theory,
generalization error (also known as the
out-of-sample error) is a measure of how
accurately an algorithm is able to predict
outcome values for previously unseen data.
Notice that the gap between predictions and
observed data is induced by model
inaccuracy, sampling error, and noise. Some
of the errors are reducible but some are not.
Choosing the right algorithm and tuning
parameters could improve model accuracy, but we
will never be able to make our predictions 100%
accurate.
TRAINING ERROR AND GENERALIZATION
ERROR
OVERFITTING
A statistical model is said to be over fitted when
the model does not make accurate predictions on
testing data. When a model gets trained with so
much data, it starts learning from the noise and
inaccurate data entries in our data set. And when
testing with test data results in High variance.
Then the model does not categorize the data
correctly, because of too many details and noise.
REASONS FOR OVERFITTING:
High variance and low bias.
The model is too complex.
The size of the training data.
Bias- differences between actual or expected
values and the predicted values are known as
error or bias error or error due to bias
Low Bias: Low bias value means fewer
assumptions are taken to build the target
function. In this case, the model will closely
match the training dataset.
High Bias: High bias value means more
assumptions are taken to build the target
function. In this case, the model will not match
the training dataset closely.
Variance is the measure of spread in data from
its mean position. In machine learning variance
is the amount by which the performance of a
predictive model changes when it is trained on
different subsets of the training data.
Low variance: Low variance means that the
model is less sensitive to changes in the training
data and can produce consistent estimates of the
target function with different subsets of data
from the same distribution.
High variance: High variance means that the
model is very sensitive to changes in the training
data and can result in significant changes in the
estimate of the target function when trained on
different subsets of data from the same
distribution.
EXAMPLE-
Actual- 9.2
Predicted-8.9,12.2,7.2,7.8
Bias= |actual- predicted|
Low bias= 0.3
High bias=2.0
Variance= variety in predicted output
Low variance= 7.2,7.8
High variance= 12.2,8.9
TECHNIQUES TO REDUCE OVERFITTING
Decrease training data.
Reduce model complexity.
UNDERFITTING
Machine learning algorithm is said to have under
fitting when it cannot capture the underlying
trend of the data, i.e., it only performs well on
training data but performs poorly on testing
data.
REASONS FOR UNDERFITTING
High bias and low variance.
The size of the training dataset used is not
enough.
The model is too simple.
Training data is not cleaned and also contains
noise in it.
TECHNIQUES TO REDUCE UNDERFITTING
Increase model complexity
Increase the number of features, performing feature
engineering
Remove noise from the data.
BIAS-VARIANCE TRADE-OFF.
The bias is known as the difference between the
prediction of the values by the Machine
Learning model and the correct value. Being high
in biasing gives a large error in training as well
as testing data.
The variability of model prediction for a given
data point which tells us the spread of our data is
called the variance of the model. The model with
high variance has a very complex fit to the
training data and thus is not able to fit
accurately on the data which it hasn’t seen
before. As a result, such models perform very well
on training data but have high error rates on test
data
If the algorithm is too simple (hypothesis with
linear equation) then it may be on high bias and
low variance condition and thus is error-prone. If
algorithms fit too complex (hypothesis with high
degree equation) then it may be on high variance
and low bias. In the latter condition, the new
entries will not perform well. Well, there is
something between both of these conditions,
known as a Trade-off or Bias Variance Trade-off.
BIAS-VARIANCE TRADE OFF