Cross Validation
Contents
Understanding Underfitting and Overfitting:
Overfit Model
Overfitting occurs when a statistical model or
machine learning algorithm captures the
noise of the data. Intuitively, overfitting
occurs when the model or the algorithm fits
the data too well.
Overfitting a model result in good accuracy
for training data set but poor results on new
data sets. Such a model is not of any use in
the real world as it is not able to predict
outcomes for new cases.
Underfit Model
Underfitting occurs when a statistical model or machine
learning algorithm cannot capture the underlying trend
of the data.
Intuitively, underfitting occurs when the model or the
algorithm does not fit the data well enough.
Underfitting is often a result of an excessively simple
model. By simple we mean that
The missing data is not handled properly.
No outlier treatment.
Removing of irrelevant features or features which do
not contribute much to the predictor variable.
How to tackle Problem of Overfitting:
The answer is Cross Validation
A key challenge with overfitting, and with machine
learning in general, is that we can’t know how well our
model will perform on new data until we actually test it.
There are different types of Cross Validation
Techniques but the overall concept remains the same,
• To partition the data into a number of subsets
• Hold out a set at a time and train the model on
remaining set
• Test model on hold out set
How K-fold works
Divide your training data into K equal-sized
“folds.”
Algorithm iterates through each fold, treating
that fold as holdout data, training a model on
all the other K-1 folds, and evaluating the
model’s performance on the one holdout fold.
This results in having K different models,
each with an out of sample model accuracy
score on a different holdout set.
The average of these K models’ out-of-sample
scores is the model’s cross-validation score.
What is Cross Validation?
Cross-validation is a technique for evaluating ML models by
training several ML models on subsets of the available input
data and evaluating them on the complementary subset of the
data. ... In k-fold cross-validation, you split the input data
into k subsets of data (also known as folds).
Here are the steps involved in cross validation:
1. You reserve a sample data set
2. Train the model using the remaining part of the dataset
3. Use the reserve sample of the test (validation) set. This will
help you in gauging the effectiveness of your model’s
performance. If your model delivers a positive result on
validation data, go ahead with the current model. It rocks!
Why to use Cross Validation?
Cross Validation is a very useful technique for
assessing the effectiveness of your model,
particularly in cases where you need to
mitigate over-fitting.
The process of cross validation in general
Types of Cross Validation
K-Fold Cross Validation
Stratified K-fold Cross Validation
Leave One Out Cross Validation
k-Fold Cross Validation:
The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.
If k=5 the dataset will be divided into 5 equal parts and the below process will run 5 times, each time with a different holdout set.
1. Take the group as a holdout or test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test set
4. Retain the evaluation score and discard the model
At the end of the above process Summarize the skill of the model using the sample of model evaluation scores.
How to decide the value of k?
The value for k is chosen such that each train/test group
of data samples is large enough to be statistically
representative of the broader dataset.
A value of k=10 is very common in the field of applied
machine learning, and is recommend if you are struggling
to choose a value for your dataset.
If a value for k is chosen that does not evenly split the
data sample, then one group will contain a remainder of
the examples. It is preferable to split the data sample into
k groups with the same number of samples, such that the
sample of model skill scores are all equivalent.
Stratified k-Fold Cross Validation:
Same as K-Fold Cross Validation, just a slight difference
The splitting of data into folds may be governed by criteria such as ensuring that each fold has the same proportion of observations with a given categorical value, such as the class outcome value. This is called stratified cross-validation.
In below image, the stratified k-fold validation is set on basis of Gender whether M or F
Leave One Out Cross Validation (LOOCV):
This approach leaves 1 data point out of training data, i.e. if there are n data points in the original sample then, n-1 samples are used to train the model and p points are used as the validation set. This is repeated for all combinations in
which the original sample can be separated this way, and then the error is averaged for all trials, to give overall effectiveness.
The number of possible combinations is equal to the number of data points in the original sample or n.