100% found this document useful (1 vote)
180 views59 pages

CNN Batch Size and Optimization Techniques

Uploaded by

hassan.ma242
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
180 views59 pages

CNN Batch Size and Optimization Techniques

Uploaded by

hassan.ma242
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lecture 09 – Convolutional Neural Network

◼ Base concepts

◼ Optimizers
◼ Gradient Decent
◼ Batch Normalization
◼ Dropout

◼ Convolutional NN
◼ Basic idea
◼ Pooling Layer
◼ CNN modeling
◼ Batch Size
◼ Epoch

◼ Transfer Learning
3

◼ Let’s take an example:


◼ If we draw the graph from expression, 𝒚 = 𝒙𝟐, we will get the graph in OVAL
shape, then it is called CONVEX Function.
◼ If we take any 2 points on this plane except the minimum point, we got the above
the functions.

◼ For Convex function, we have 1 minimum point called


4

◼ If we take convex function with and if we


take any points with similar heights on this graph, it will
sketch one round circle on the surface of convex function.

◼ These circle are called


◼ The counter-line are used to visualize the 3D or 2 variables
convex function in a 2D plan using circle.
◼ It also, allow us to of
5

◼ In the non-convex function, there’s no straight line, nor


it is a smooth curve.

◼ It is curve with multiple ups and down

◼ This kind of functions are called Non-Convex


Function
+ 6


▪ Contour line give us clear view that where is the
minimal local or global minima exist.

▪ Now Let’s map these Functions with


Deep Neural Network.
7

◼ DNN models for regression model are using loss functions (mean-square-
error, Absolute-Mean-Square error etc.).

◼ If fact, these loss functions are 1-variable convex function and has one minimal
point (also called global minima).

◼ With classification, we are using Non-Convex Function approach, because we have


more than 1-variables and for that variables, we are finding out the global
minimal point that have minimum error value for Loss function.

◼ Let’s take an example, where we will try to find that the neural network is working
with a Non-Convex function .
8

1
2
3
4
5
6
7 8

◼ The NN has some inputs, hidden layers and output layer. The weights are initialized with random
values .
◼ The first neuron have W’s = 1, 4, 7, while second neuron has W’s = 2, 5, 8 etc. and third neuron has
W’s = 3, 6, 9.
◼ It means there are different local points (weights) for the value that we need to find out.
9

◼Now, if we exchange (SWAP) the W’s assign to neuron. We will get the
same values.
◼Means we are getting same value for different possibilities of the weights
assign to NN, and we are getting almost same global minima.
◼With multiple W’s, we have multiple local minima’s and one of them is
global minima.
◼By using Backpropagation technique, we updates the W’s values and
calculate the LOSS function of the model.
◼In fact, we are trying to find the Global Minima throughout the process.
10

◼ Questions: How we can find out the global minima, if more than one minimal point
exists OR How we can make sure that we are finding the correct minimal (global
minima) point?
◼ Answer: By using Backpropagation and SGD , we are trying to calculate and
update the W’s value and then trying to find the minimal point (expected to be

global minimal point ).


◼ But sometimes, it happen that we can get stuck with the saddle point (which is
one of the biggest problem with Non-Convex function).
11

◼ Before discussing about the Saddle point, let’s have an example.

◼ In the fig. there’s a small curve having saddle point.

◼ Usually, the saddle points on the function are linear functions , and we know that if we take
derivative of linear function, we are getting constant value .

◼ Means, the saddle point are the points on the function, that are not changing their positions

(they are stuck points) and not allowing us to move either to maximal or minimal point .

◼ Taking differentiations of the saddle points, we are getting answer 0 , means that there’s no
learning occur and there’s no changes for the model to learn
12

◼ As with Backpropagation, we are calculating LOSS and by using SGD


updates the W’s and then we are getting good results .
◼ If the model is stuck at some point, we called it SADDLE Point and the
model is not getting further tune-up (means not getting new values for
W’s), as a result the model is not learning from the data.

◼ This situation is called .


◼ Here we need to add the Learning Rate.
13

◼ the learning rate is a tuning parameter in an optimization algorithm that determines

the step size at each iteration while moving toward a minimum of a loss function .

◼ Learning Rate should be between 0-1.


+ 14

March 28, 2022


15

◼ Let’s understand this problem related to SGD with


example of using Linear Regression Data Points

5
◼ With initial random assign values to W’s, we get line
(1) line, after 1st iterations and update W’s values,
4
we get line (2) and so on.
3
2
◼ To update the W’s values, we have option to use
1
Gradient Decent, Stochastic Gradient Decent and
Batch-Gradient Decent algorithms.
16

Data
Points

◼ W’s value are updated using 5

4
3
2

1
Important Point:

◼ Based on all points (data points), the w’s are updated using GD-Algorithm.

◼ Advantages of GD:
◼ Convergence Fast: It has a fast convergence towards minimal point

◼ Disadvantages of GD:
◼ Computational Costly: As it consider all data points for derivative and then
sum all derivative values, that’s why it’s computationally costly.
17
Data
Points

4
◼ SGD: Taken each points and updates W’s values .

3
◼ BGD: Taken batch/group of data points and updates W’s values . 2

◼ Let’s take 4-batch size, in total we have 4-sets


1
◼ With initial values, we are getting line-(1)

◼ With 1 st batch (red-points), we are getting line-(2)

◼ With 2 nd batch (blue-point), we are getting line-(3)

◼ and so o n … finally got full convergence (minimal point)

◼ Linear-Regression are taken different directions , but finally it reach


Contour-line.
to the convergence . Let’s understand this by Contour-line.
18

◼ Advantages of SGD/BGD:
◼ Computational Easy: As it consider less data points for derivative
and then sum all derivative values , that’s why it’s computationally easy.

◼ Disadvantagesof SGD/BGD:
◼ Convergence Slow: It has a slow convergence towards minimal point
19

◼ If we take a big NN, means multiple layers and multiple neurons. So how
many W’s will be there.

◼ To obtain the minimum loss – mean to update the W’s values to get
convergence. It will take a lot of time.
◼ If we use GD, it have fast convergence, but computationally costly.
◼ If we use SGD/BGD, it have SLOW-convergence, but computationally easy.

What’s the solution:


◼ We need to follow SGD or BGD but also, we need to improve convergence
speed.

How to improve the Convergence Speed: We need to adopt Optimizers.


20

◼ There are various optimizer that could support us in getting optimal outcomes.

◼ Ada Grad & Ada Delta Optimizers

◼ Ada m etc.
+ 21

Documentation [Link]

Example [Link]
accelerate-learning-of-deep-neural-networks-with-
batch-normalization/

March 28, 2022


23

◼ Before discussing Batch-Normalization, let’s take an example,


Cat ◼ If we distribute the images data, it will look like this
Non-Cat

◼ If we distribute the images data, it will look like this


Test Dataset
◼ It will be a little bit on the

right side , because the

images have different colors


representing cat
24

◼ Apply BatchNormalization to each layer to Normalize them,


i.e., adding a sub-layer within hidden layer with name
BatchNormalization()
+

Documentation [Link]

Example [Link]
regularization-deep-learning-models-keras/

March 28, 2022 25


26

◼ Dropout is a technique to avoid overfitting.

◼ But before going to Dropout, let’s discuss about the Ensemble technique and then we
will go Neural Network.

◼ The is using Ensemble technique

▪ Process here is.


▪ Select one-date point as a root and construct
a tree .
▪ Select another data point as a root and
construct another tree
▪ and so on….
▪ Until we construct ‘m’ decision trees
27

◼ Answer: To avoid overfitting

◼ Ans: We cannot adopt the Ensemble technique for NN , because, if we want to try with NN , it
will take too much time to train the model,

◼ If we want to construct N-number of NN and then train them with different random

samples , is almost impossible , because it will take too much time and computation .

◼ Also, during Testing , the data will be provided to all Networks to maintain the
Ensemble process
28

◼ Q: What is DROP Out technique?

◼ Answer: This technique is used to implement similar idea like Ensemble technique
but reduce the training/testing computation and time .

◼ In the dropout, we are assigning the probability to each layer including input layer
(here’s say 0.8) and this is not fixed (you may assign any probability value).
29

▪ Q: What’s this Probability means here?

▪ Answer:

▪ The Probability 0.8 reflects that the any neuron in that layer have 80% chances to keep in
the current iteration while 20% chances are there to avoid any neuron .

▪ This is done for each neuron separately in each layer . The Same process is applicable to
the Input Layers as well to decide about the input participation in each iteration .

▪ For the 2nd iteration , we will have different set of neurons to consider and drop
based on the probability values we are randomly selecting neurons.

▪ The 3rd iteration will have different set of neurons .


30

◼ Q: What happen with these dropping and


selecting neurons?

◼ Answer: With dropping some of the neurons,


we are forming a Thinned network in 1 st
iteration, while in 2 nd iteration, we are
forming different Thinned network and so
on.

◼ Q: How many thinned network are possible here?

◼ Answer: In a neural network, if we have N-nodes, then possible 2n


Thinned networks are possible.
31

◼ In a normal DNN, each neuron are totally based on the inputs received from
the previous neurons and then it process the data and provide output
which is subsequently input to other neurons. Through this CO-Adoption
process, the model is learning a lot and become more specific or sensitive
to the data and hence we are getting Overfitting issue.

◼ DNN with Dropout option, the participation of the neuron is based on the
probability values, and we are dropping some of the neurons for that
specific iteration means we are restricting neurons to participate in
updating their depending weights.

◼ Hence some of the neuros are not working on CO-Adoption and we break
the CO-Adaptation, and make the model more generalize
+

Documentation [Link]

Example

March 28, 2022 32


+ 33

90 83 42 24 2 4 Weight matrix , Filter , Kernel

54 87 20 55 8 91

97 15 34 64 53 30 1 0.3
3 29 78 53 53 55
72 27 54 68 9 64
14 25 31 40 71 93 114.9 95.6 49.2 24.6 3.2 4
54 46 16 76 67 24 80.1 93 36.5 57.4 35.3 91
101.5 25.2 53.2 79.9 62 30
11.7 52.4 93.9 68.9 69.5 55
90*1 + 83 * 0.3
80.1 43.2 74.4 70.7 28.2 64
21.5 34.3 43 61.3 98.9March 28,93
2022
+ 34

Conv1D filter Conv2D filter


+
+ 35

50 50 50 0 0 0

50 50 50 0 0 0
0 150 150 0
1 0 -1
50 50 50 0 0 0 0 150 150 0
50 50 50 0 0 0 x 1 0 -1
= 0 150 150 0
1 0 -1
50 50 50 0 0 0

50 50 50 0 0 0 0 150 150 0

Through this filter we are identifying Vertical Edges


1 1 1

For Horizontal Edges 0 0 0

-1 -1 -1
+ - Examples 36

Various kind of filters


to detect edges
+ 37

W1 W2 W3
New
Image W4 W5 W6
Image
W7 W8 W9

W’s are identify by the CNN

Second Layer
F1 Horizontal
F2 Vertical
F3 45 0
Image F4 …


Fn Etc.
The 2 nd Layer will get the input
This is one Layer, F’s = W’s from 1 st layer
+ 38
+ 39
+ 40

March 28, 2022


+ How Max Pooling Works:
41

It divides each feature map into non-


overlapping rectangular regions
(usually 2x2 or 3x3).
For each region, it retains only the
maximum value and discards the rest.
The output feature map has reduced
spatial dimensions and depth,
achieved by downsampling and
retaining the maximum values from
local neighborhoods.

March 28, 2022


+ 43

◼ Often referred to as Global Average Pooling or GAP for Global Average Pooling

◼ Output

◼ The output is thus a 1-dimensional tensor of size (input channels).

Global Max Pooling is applied to reduce


spatial dimensions before the fully
connected layers.

March 28, 2022


+

Documentation [Link]
[Link]
Example cb0883dd6529#:~:text=Examples%20of%20CNN%20in%20computer,i.e%2
C%20weights%2C%20biases%20etc.

March 28, 2022 [Link] 45


+ 46

◼ A CNN is a combination of two components: a feature extractor module


followed by a trainable classifier.

◼ The first component includes a stack of convolution, activation, and pooling layers.

◼ A DNN does the classification. Each neuron in a layer is connected to those in the next
layer.

March 28, 2022


+ 47

◼ Today, the overall architecture of the CNNs is already streamlined .

◼ The final part of CNNs is very similar to feedforward neural networks, where
there are fully connected layers of neurons with weights and biases.

◼ Just like in feedforward neural networks, there is a loss function (e.g.,


crossentropy, MSE), a number of activation functions , and an optimizer (e.g.,
SGD, Ada m optimizer) in CNNs. Additionally, in CNNs, there are also
Convolutional layers, Pooling layers, and Flatten layers .

March 28, 2022


+ why not Feedforward Network 48

◼ This is our image; the goal of our network will be to determine whether this
image is a cat or not.

◼ A dense layer will consider the ENTIRE image. It will look at all the pixels
and use that information to generate some output.

◼ The convolutional layer will look at specific parts of the image. In


this example let’s say it analyzes the highlighted parts below and detects patterns
there.

March 28, 2022


+ why not Feedforward Network 49

◼ Densely connected network has only recognized patterns globally it will look
where it thinks the eyes should be present. Clearly it does not find them there
and therefore would likely determine this image is not a dog. Even though the
pattern of the eyes is present, it’s just in a different location.

Let’s say it’s determined that an image Now let’s flip the image.
is likely to be a dog if an eye is present
in the boxed off locations of the image
above.
March 28, 2022
+ 50

◼ Since convolutional layers learn and detect patterns from different areas of
the image, they don’t have problems with the location of the object.

◼ They know what an eye looks like and by analyzing different parts of the
image can find where it is present.

March 28, 2022


+ 51

◼ What is Data Augmentation?


◼ A regularization technique, whereby the dataset is expanded by the creation of

artificial variations such as zooming, rotation, shifting, salt/pepper noise, blur

etc.

◼ The augmentation approach chosen is often more important than the type of

network architecture used.


◼ Simple data augmentation like rotation, flip,

distortion, obstruction, etc. has proven to improve

generalization performance, a simple fact that

adding more variation of the same image makes it


difficult for the model to memorize, i.e.,

OVERFITTING. March 28, 2022


+

March 28, 2022 52


+ 53

◼ Loading the data set to the memory we have two options -

1. You can either load the whole data set to the memory at once

2. You can load a sample set of data into the memory

◼ While training a neural network, instead of sending the ,


we in input into of .

Number of batches * Number of images in a single batch = Total number of data set

Example
◼ Image data set containing , we can convert it into
batches where each batch has images in it.
◼ Total images = 100000
◼ 1 batch = 32 images
◼ So, a total of 3125 batches, ( ).
March 28, 2022
+ 54

◼A training dataset can be divided into one or more batches.


◼ Batch Gradient Descent. Batch Size = Size of Training Set
◼ Stochastic Gradient Descent. Batch Size = 1
◼ Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set
◼ In the case of mini-batch gradient descent, popular batch sizes include 32, 64, and
128.

◼ It simply means that the has than the


.

◼ Alternately, you can from the or the


such that the number of samples in the dataset
.

March 28, 2022


+ 55

◼ One Epoch is when a complete dataset is passed forward and backward through
the neural network only ONCE.
◼ Since one epoch is too big to feed to the computer at once we divide it in several
smaller batches.

◼ Updating the .
◼ One epoch leads to of the curve.
◼ As the no. of epochs , more are changed in the
neural network and the curves goes from

◼ The number of epochs is traditionally large, often or , allowing the


learning algorithm to run until the error from the model has been sufficiently minimized.
◼ Number of epochs in the literature set to 10, 100, 500, 1000, and larger
March 28, 2022
+

Documentation [Link]
[Link]
transfer-learning-with-real-world-applications-in-deep-learning-
212bf3b2f27a

[Link]
Example
[Link]
March 28, 2022 57
+ 58

2. Small dataset: 3. Medium dataset:


1. Train on
Imagenet
more data = retrain
more of the network
(or all of it)

Freeze these

Freeze
these

Train this

Train
this

March 28, 2022


+ 59

◼ What Is Transfer Learning?


◼ In transfer learning, the knowledge of an already trained machine
learning model is applied to a different but related problem.

◼ The general idea is to use the knowledge of a model


(that learned from a task with a lot of available labeled
training data) in a new task that doesn't have much
data.

◼ Instead of starting the learning process from scratch,


we start with patterns learned from solving a related
task.

March 28, 2022


+ 60

◼ In neural networks dealing with images, usually try to detect edges in the
earlier layers, shapes in the middle layer and

◼ In transfer learning, the early and middle layers are used, and we only
retrain the task-specific layers. It helps influence the labeled data of the task
it was initially trained on.

◼ The task 1 and task 2 must have the same setting for input.

March 28, 2022


+ 61

◼ The popular transfer learning models using TensorFlow available on.

March 28, 2022


+ 68

End of Lecture – 08

March 28, 2022

You might also like