0% found this document useful (0 votes)
13 views8 pages

LeNet 5 Model and Backpropagation Insights

The document provides an overview of the LeNet 5 model, a pioneering convolutional neural network developed in the 1980s, detailing its architecture, including convolutional and fully connected layers, and implementation specifics. It discusses backpropagation techniques for training, including challenges related to numerical optimization and automatic differentiation. Additionally, it highlights the model's evaluation on the MNIST dataset and its successful application in postal code reading.

Uploaded by

meriem.bekh.info
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

LeNet 5 Model and Backpropagation Insights

The document provides an overview of the LeNet 5 model, a pioneering convolutional neural network developed in the 1980s, detailing its architecture, including convolutional and fully connected layers, and implementation specifics. It discusses backpropagation techniques for training, including challenges related to numerical optimization and automatic differentiation. Additionally, it highlights the model's evaluation on the MNIST dataset and its successful application in postal code reading.

Uploaded by

meriem.bekh.info
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Outline

Apprentissage statistique : modélisation décisionnelle et


apprentissage profond
Neural Networks and Deep Learning 1 Case Study: LeNet 5 Model

2 Backpropagation: implementation issues

LeNet Implementation LeNet Implementation

Case Study: LeNet 5 Model Case Study: LeNet 5 Model

80’s: 1st Convolutionnal Neural Networks


● LeNet 5 Model [LBD+ 89], trained using back-prop C1 Layer

● Input: 32x32 pixel image. Largest character is 20x20


● Convolutional layer with 6 5x5 filters ⇒ 6 feature maps of size 28x28 (no
● 2 successive blocks [Convolution + Sigmoid + Pooling (+sigmoid)]
padding).
Cx: Convolutional layer, Sx: Subsampling layer
● # Parameters: 52 per filer + bias ⇒ (5 ∗ 5 + 1) ∗ 6 = 156
● C5: convolution layer ∼ fully connected
If it was fully connected: (32*32+1)*(28*28)*6 parameters 5 ∼ 106 !
● 2 Fully connected layers Fx

Deep Learning 1/ 42 Deep Learning 2/ 42

LeNet Implementation LeNet Implementation

Case Study: LeNet 5 Model Case Study: LeNet 5 Model

S2 Layer C3 Layer: Convolutional


● C3: 16 filters ⇒ 16 feature maps of size 10x10 (no padding)

● Subsampling layer = pooling layer


● Pooling area : 2x2 in C1
● Pooling stride: 2 ⇒ 6 features maps of size 14x14 ● 5x5 filters connected to a subset of S2 maps
● Pooling type : sum, multiplied by a trainable param + bias ⇒ 0-5 connected to 3, 6-14 to 4, 15 connected to 6
⇒ 2 parameters per channel
● # Parameters: 1516
● Total # Parameters: 2 ∗ 6 = 12 (5 ∗ 5 ∗ 3 + 1) ∗ 6 + (5 ∗ 5 ∗ 4 + 1) ∗ 9 + (5 ∗ 5 ∗ 6 + 1) = 456 + 909 + 151

Deep Learning 3/ 42 Deep Learning 4/ 42


LeNet Implementation LeNet Implementation

Case Study: LeNet 5 Model Case Study: LeNet 5 Model

S4 Layer
C5 Layer: Convolutionnal layer

● Subsampling layer = pooling layer


● Pooling area : 2x2 in C3 ● 120 5x5x16 filters ⇒ whole depth of S4 (≠ C 3)
● Pooling stride: 2 ⇒ 16 features maps of size 5x5 ● Each maps in S4 is 5x5 ⇒ single value for each C 5 maps
● Pooling type : sum, multiplied by a trainable param + bias ● C 5 120 features map of size 1x1 (vector of size 120)
⇒ 2 parameters per channel ⇒ spatial information lost, ∼ to a fully connected layer
● Total # Parameters: 2 ∗ 16 = 32 ● Total # Parameters: (5 ∗ 5 ∗ 16 + 1) ∗ 120 = 48210

Deep Learning 5/ 42 Deep Learning 6/ 42

LeNet Implementation LeNet Implementation

Case Study: LeNet 5 Model Case Study: LeNet 5 Model

● Evaluation on MNIST
● Total # parameters ∼ 60000
60,000 original datasets: test error: 0.95%
540,000 artificial distortions + 60,000 original: Test
error: 0.8%

● Successful deployment for postal code reading in the US


F6 Layer: Fully Connected layer
● 84 fully connected units.
● # Parameters: 84*(120+1)=10164

F7 Layer (output): Fully Connected layer


● 10 (# classes) fully connected units.
● # Parameters: 10*(84+1)=850

Deep Learning 7/ 42 Deep Learning 8/ 42

LeNet Implementation
Outline
Error Back-Propagation ConvNets

● Convolution: example for 1d scalar conv with mask w = [w1 w2 w3 ]T

1 Case Study: LeNet 5 Model

2 Backpropagation: implementation issues

● Shared weights: simple chain rule application

⇒ Sum gradients for every region yk !

N
=� − = [xk−1 xk xk+1 ]T
@L @L @yk @yk
@w k=1 @yk @w @w

Deep Learning 9/ 42
LeNet Implementation LeNet Implementation

Error Back-Propagation with ConvNets Error Back-Propagation for Average pooling

● Pooling: example for pooling area of size 2L + 1: ● Average pooling: yk = ∑ xh


k+L
1

● yk = f (xk−L , ..., xk+L )


N
h=k−L

@L
@xh = k @yk
@xh = 1 k
N

⇒ Gradient propagated through each input node ( N1 )

@L
@xh = @L @yk
@yk @xh = k @yk
@xh

Deep Learning 10/ 42 Deep Learning 11/ 42

LeNet Implementation LeNet Implementation

Error Back-Propagation for Max pooling Challenges for Training Deep Learning Models

● Max pooling: yk = max xh


′ h


� ● Training of deep ConvNets: Gradient descent on loss function L:
� if xh = h′ ∈{k−L,...,k+L}

k
max xh ′
=�

@L
@xh
�0 otherwise
� wt+1 = wt − ⌘∇L �wt �

⇒ Gradient propagated through arg max input node ● Error-Backpropagation: way to compute ∇L �wt � in neural networks
Analytical expression of the gradient straightforward: chain rule
BUT: efficient evaluation of gradient can be very tricky

⇒ Efficient backprop implementation far from trivial


1 Numerical Optimization Issues
2 Automatic Differentiation

Deep Learning 12/ 42 Deep Learning 13/ 42

LeNet Implementation LeNet Implementation

Numerical Optimization Issues Numerical Optimization Issues

● Accumulating approximations (rounding) can be e xi


SM(x)i =
problematic
∑ e xj
K
● Underflow: x ≈ 0 or x = 0: different behaviors j=1
Division by 0 ⇒ Nan ● If xi = C ∀i SM(x)i = 1
∀i expected
● Overflow: large > 0 or < 0 numbers ⇒ Nan
K
C → −∞ ⇒ e C → 0: division by 0, Nan! (underflow)
● Ex: softmax on x = {x1 , x2 , ..., xK }: C → +∞ ⇒ e C → +∞: Nan! (overflow)
● Numerical stabilization for denominator:
e xi
SM(x)i =
z = x − maxi (xi ) ⇒ SM(z) = SM(x)
∑e
K
maxi (e zi ) = 1 ⇒ no overflow
xj

● What if xi = C ∀i, and: maxi (e zi ) = 1 ⇒ no underflow


j=1

C → +∞?
C → −∞?

Deep Learning 14/ 42 Deep Learning 15/ 42


LeNet Implementation LeNet Implementation

Numerical Optimization Issues Computing Derivatives

● Numerical approximation: f ′ (x) ≈ f (x+h)∗f (x)


?
● SM(x)i = , z = x − maxi (xi ) ⇒ SM(z) = SM(x)
h
⊖ Approximate, numerical issues (underflow/overflow)
e xi

● Symbolic differentiation?
K
∑ e
xj
j=1

● If we compute log [SM(z)], eg cross-entrpoy loss? ⊖ Rapidly: huge expressions, many duplicated terms ⇒ lack of efficiency
SM(z) = 0 (underflow) ⇒ log [SM(z)] → −∞: Nan!

● Solution: stabilize log [SM(x)i ] = xi − log � ∑ e xj �


K

j=1
Same solution: z = x − maxi (xi )
● Automatic differentiation
⇒ log [SM(z)i ] = log [SM(x)i ]? Interleave symbolic differentiation and simplification steps
⇒ Underflow, overflow? Symbolic differentiation at the elementary operation level
keep intermediate numerical results

Deep Learning 16/ 42 Deep Learning 17/ 42

LeNet Implementation LeNet Implementation

Automatic Differentiation (AD) Computation Graph & backprop

● Computation graph: core abstraction for computing gradient with backprop ● @f


= 1 , back-prop ⇒ @f
, @f
● Exa : f (x, y , z) = (x + y ) ∗ z
@f @q @z

● f =q�z ⇒ = z = −4, = q = 3 , back-prop ⇒


● Ultimate goal: compute numerical values @x
@f @f @f
@f @f @f , @f
, @y , @z @q @z @x @y

● x = −2, y = 5, z = −4, forward prop ⇒ q = 3, f = −12 ● q =x +y ⇒ @f


= @f @q
= −4 � 1 = −4 @f
= @f @q
= −4 � 1 = −4
@x @q @x @y @q @y

a
From Stanford course: [Link] 2/

Deep Learning 18/ 42 Deep Learning 19/ 42

LeNet Implementation LeNet Implementation

Computation Graph (CG) & backprop Computation Graph (CG) & backprop

● Automatic differentiation with CG: Forward + backward pass ● Symbolic differentiation only at atomic operation level, e.g. binary arithmetic operatorsa ,
Backward pass: recursively compute derivatives from top → bottom exp, log, trigonometric functions
● Include block with known derivative which is well numerically behaved
Dynamic programming, big speed-up wrt naive forward-mode differentiation
ex: Forward forward-mode differentiation from x: @x = 1, @q = 1, @x
@f
= −4
● Ex: sigmoid (x) = 1+e1−x , ′ (x) = (x) [1 − (x)]
@x @x

⇒ N nodes: N forward passes vs 1 forward + 1 backward for backprop


x = 1.0 ⇒ (x) = 0.73, ′
(x) = 0.2

a
multiplication, addition, subtraction, division, etc

Deep Learning 20/ 42 Deep Learning 21/ 42


LeNet Implementation LeNet Implementation

Computation Graph (CG) & backprop Back-prop on Tensors

● Simply compute derivative wrt flattened tensor


● BUT: direct chain rule application
Application for a simple Neural Network: h = f (x, W) ⇒ big memory & computation issues!
Problem already with last linear layers

⇒ Backward: store numerical values for @h


, @h
@W @x

Deep Learning 22/ 42 Deep Learning 23/ 42

LeNet Implementation LeNet Implementation

Back-prop on Tensors Back-prop on Tensors

● Example: batch training for last linear layer @LCE


=? - Ex: m = 4096, N = 100, K = 1000 (ImageNet)
● Data matrix X (N × m), label matrix Y,
^ Y∗ (N × K ) @W

● Chain rule: @LCE


= @LCE @S
● Cross-entropy loss: LCE (W, b) = − N1 ∑ log (ŷc ∗ ,i )
N
@W @S @W
=Y^ − Y∗ = : size K ⋅ N = 100K params ≈ 800KB small
@L CE
i=1
@S
BUT: @W @S
size (K ⋅ N) ⋅ (K ⋅ d) = 1000 ⋅ 100 ⋅ 1000 ⋅ 4000
= 400G params ≈ 3.2TB huge !!
@S
@W
: far too large to fit into memory, not explicitly computable
� x1 0 ... 0 �
� �
� �
0 x1 ... 0
� �
@S � �
0 0 ... x1
=� �
@W �



... ... ... ...
� �
� �
xN 0 ... 0
⇒ @LCE
=? - Ex: m = 4096, N = 100, K = 1000 (ImageNet) � 0 xN ... 0 �
@W � 0 0 xN �
● ⋅ d ≈ 4M params ≈ 32MB memory OK
@LCE ...
@W
: size K

Deep Learning 24/ 42 Deep Learning 25/ 42

LeNet Implementation LeNet Implementation

Back-prop on Tensors Back-prop on Tensors

● @s
intractable but @L CE
= @L@sCE @W
@s
tractable
● Solution: first project s on :
@W @W

● @S
: far too large to fit into memory, not explicitly computable
sp = s = ∑ sk = ∑ ∑ wjk xk
K K m
@W T
● However, computing =
k k
@LCE @LCE @S
is tractable! k=1 k=1 j=1
@W @S @W
Compute gradient on sp ∈ R:
� x 0 ... 0 �
= � 1x � = xT =
@sp @LCE
For a single example: @s
=� 0 x ... 0 � = ( @w1
@s
, ..., @s @s
, ..., @wK ) 2x ... Kx
� 0 x �
@W @wk @W @W
0 ...
@LCE
= =� 1 2 ... K � ● Each layer should be able to compute @Sp
@W
= @S T
@W
tractable (≠ @S
@W
)
@s

� x 0 ... 0 �
⇒ =� �� 0 0 � =� � = xT
@LCE
... x ... 1x 2x ... Kx
� 0 x �
1 2 K
@W
0 ...

Deep Learning 26/ 42 Deep Learning 27/ 42


LeNet Implementation LeNet Implementation

Computation on GPU Computation on GPU

● Solution for GPU programming:


CUDA (NVIDIA):
C-code
Higher-level APIs: cuBLAS, cuFFT, cuDNN, etc
OpenCL: ∼ CUDA (not limited to NVIDIA): generally slower

● Training deep ConvNets: huge speed-up with Graphical Processing Units


(GPU)1 , more next week
Especially convolution

1
data from [Link] benchmarks
Deep Learning 28/ 42 Deep Learning 29/ 42

LeNet Implementation LeNet Implementation

Computation on GPU Deep Learning resources for the community


Deep learning softwares / libraries
● Main features
● Big data: transfer from disk to memory ⇒ bottleneck
Transparent use of computational graphs and auto-differentiation (autograd)
Transparent GPU computing, no low-level programming
● Solution: use SSD instead of HDD ● Libraries made available in the community:
MatConvNet (Oxford): easy
Caffe (UC Berkeley) / Caffe2 (Facebook): script ⇒ good for production
Torch (NYU, Facebook) / PyTorch (Facebook), Theano (U Montreal), TensorFlow
(Google)⇒ good for research
● Nowadays, main maintained libraries
1 Tensorflow (Google)
2 Pytorch (Facebook)

Deep Learning 30/ 42 Deep Learning 31/ 42

LeNet Implementation LeNet Implementation

Deep Learning resources: Tensorflow & Keras Keras: Logistic Regression for Classification

● Do not re-invent the wheel: use them!


● Library used in this course: Tensorflow & Keras
● Simple example: Logistic Regression (LR)
● xi vector, si = xi W + b
● Soft-Max (SM): yˆk ∼ P(k�xi , W, b) = e sk

∑ e k′
K s
k ′ =1

● Training with cross-entropy: ∑ `CE (yˆi , yi )



N
1
N
i=1
● Example in MNSIT database: K = 10 classes
● Input: flattened images: d = 282 = 784
Keras: python wrapper on top of Tensorflow / Theano (F. Chollet) ● # parameters: 784 ∗ 10 + 10 = 7850
Now fully included in Tensorflow
Install (with pip): pip install keras
[Link]

Deep Learning 32/ 42 Deep Learning 33/ 42


LeNet Implementation LeNet Implementation

Keras: Logistic Regression Keras: Logistic Regression

● Load MNIST data:

from tensorflow import keras


● Keras: use class Sequential for (chain) feedfoward network
1
from [Link] import mnist

● Define an (empty) feedfoward network


# MNIST data, shuffled and split between train and test sets 3
(X_train, y_train), (X_test, y_test) = [Link].load_data()
# Some pre-processing 5
X_train = X_train.reshape(60000, 784) from [Link] import layers
X_test = X_test.reshape(10000, 784) 7 from [Link] import Sequential 2
X_train = X_train.astype('float32') model = Sequential()
X_test = X_test.astype('float32') 9
X_train /= 255
X_test /= 255 11

● Add fully connected layer (size 10) + softmax activation


# convert class vectors to binary class matrices
Y_train = [Link].to_categorical(y_train, 10) 13
Y_test = [Link].to_categorical(y_test, 10)
from [Link] import Dense, Activation 1
[Link](Dense(10, input_dim=784, name='fc1'))
[Link](Activation('softmax')) 3

Deep Learning 34/ 42 Deep Learning 35/ 42

LeNet Implementation LeNet Implementation

Keras: Logistic Regression Keras: Logistic Regression

● Compile model with cross-entropy loss and sgd optimizer


● Visualize (text) architecture:
from [Link] import SGD 1
[Link]() 1 learning_rate = 0.5
learning_rate= 0.5 3
[Link](loss='categorical_crossentropy',optimizer=[Link].
SGD(learning_rate=learning_rate),metrics=['accuracy'])

● Optimize model parameters to fit training data (e.g. MNIST)


# Fit model to data
[Link](X_train, y_train,batch_size=128, epochs=20,verbose=1) 2

Deep Learning 36/ 42 Deep Learning 37/ 42

LeNet Implementation LeNet Implementation

Keras: Logistic Regression Keras: more complex models

● Evaluate performances on test set: ● Design more complex model by adding layers:
Fully connected, convolution, non-linearity, pooling, etc
scores = [Link](X_test, Y_test, verbose=0) Class for convolution on images: Conv2D
print("%s TEST: %.2f%%" % (model.metrics_names[0], scores[0]*100)) 2
● Code for training remains unchanged (back-prop does the job)
print("%s TEST: %.2f%%" % (model.metrics_names[1], scores[1]*100))
nb_classes=10 1
s=(5,5)
is = (28, 28, 1) 3
model = Sequential()
[Link](Conv2D(16, kernel_size=s, activation='relu',input_shape=is, padding='valid' )) 5
[Link](MaxPooling2D(pool_size=(2, 2)))
[Link](Conv2D(32, (5, 5), activation='relu', padding='valid')) 7
[Link](MaxPooling2D(pool_size=(2, 2)))
[Link](Flatten()) 9
[Link](Dense(100, activation='sigmoid'))
[Link](Dense(nb_classes, activation='softmax')) 11

Deep Learning 38/ 42 Deep Learning 39/ 42


LeNet Implementation LeNet Implementation

Keras: more complex models Keras: GPU

● Many other layers implemented, e.g. locally-connected, dropout,


normalization
● Various optimizer implemented, e.g. Nesterov, Adam
● Various loss fonctions, e.g. hinge loss, regression, kullback leibler ● Automatic detection of existing resources and selection
divergence ● Several GPU cards ⇒ use them all
● Various datasets, e.g. CIFAR
● Ex for selecting a specific card:
CUDA_VISIBLE_DEVICES=0 python my_deep_script.py

Deep Learning 40/ 42 Deep Learning 41/ 42

LeNet Implementation LeNet Implementation

Keras: Conclusion References I

● Intuitive librairy: work with layers


Does not expose computation graph
● Very quickly train a model and evaluate performances
● Use all tensorflow backend features (e.g. tensorboard)
● Try it: [Link]
Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and
Lawrence D Jackel, Backpropagation applied to handwritten zip code recognition, Neural computation 1
(1989), no. 4, 541–551.

Deep Learning 42/ 42 Deep Learning 42/ 42

You might also like