LeNet 5 Model and Backpropagation Insights
LeNet 5 Model and Backpropagation Insights
S4 Layer
C5 Layer: Convolutionnal layer
● Evaluation on MNIST
● Total # parameters ∼ 60000
60,000 original datasets: test error: 0.95%
540,000 artificial distortions + 60,000 original: Test
error: 0.8%
LeNet Implementation
Outline
Error Back-Propagation ConvNets
N
=� − = [xk−1 xk xk+1 ]T
@L @L @yk @yk
@w k=1 @yk @w @w
Deep Learning 9/ 42
LeNet Implementation LeNet Implementation
@L
@xh = k @yk
@xh = 1 k
N
@L
@xh = @L @yk
@yk @xh = k @yk
@xh
Error Back-Propagation for Max pooling Challenges for Training Deep Learning Models
�
� ● Training of deep ConvNets: Gradient descent on loss function L:
� if xh = h′ ∈{k−L,...,k+L}
�
k
max xh ′
=�
�
@L
@xh
�0 otherwise
� wt+1 = wt − ⌘∇L �wt �
�
⇒ Gradient propagated through arg max input node ● Error-Backpropagation: way to compute ∇L �wt � in neural networks
Analytical expression of the gradient straightforward: chain rule
BUT: efficient evaluation of gradient can be very tricky
C → +∞?
C → −∞?
● Symbolic differentiation?
K
∑ e
xj
j=1
● If we compute log [SM(z)], eg cross-entrpoy loss? ⊖ Rapidly: huge expressions, many duplicated terms ⇒ lack of efficiency
SM(z) = 0 (underflow) ⇒ log [SM(z)] → −∞: Nan!
j=1
Same solution: z = x − maxi (xi )
● Automatic differentiation
⇒ log [SM(z)i ] = log [SM(x)i ]? Interleave symbolic differentiation and simplification steps
⇒ Underflow, overflow? Symbolic differentiation at the elementary operation level
keep intermediate numerical results
a
From Stanford course: [Link] 2/
Computation Graph (CG) & backprop Computation Graph (CG) & backprop
● Automatic differentiation with CG: Forward + backward pass ● Symbolic differentiation only at atomic operation level, e.g. binary arithmetic operatorsa ,
Backward pass: recursively compute derivatives from top → bottom exp, log, trigonometric functions
● Include block with known derivative which is well numerically behaved
Dynamic programming, big speed-up wrt naive forward-mode differentiation
ex: Forward forward-mode differentiation from x: @x = 1, @q = 1, @x
@f
= −4
● Ex: sigmoid (x) = 1+e1−x , ′ (x) = (x) [1 − (x)]
@x @x
a
multiplication, addition, subtraction, division, etc
● @s
intractable but @L CE
= @L@sCE @W
@s
tractable
● Solution: first project s on :
@W @W
● @S
: far too large to fit into memory, not explicitly computable
sp = s = ∑ sk = ∑ ∑ wjk xk
K K m
@W T
● However, computing =
k k
@LCE @LCE @S
is tractable! k=1 k=1 j=1
@W @S @W
Compute gradient on sp ∈ R:
� x 0 ... 0 �
= � 1x � = xT =
@sp @LCE
For a single example: @s
=� 0 x ... 0 � = ( @w1
@s
, ..., @s @s
, ..., @wK ) 2x ... Kx
� 0 x �
@W @wk @W @W
0 ...
@LCE
= =� 1 2 ... K � ● Each layer should be able to compute @Sp
@W
= @S T
@W
tractable (≠ @S
@W
)
@s
� x 0 ... 0 �
⇒ =� �� 0 0 � =� � = xT
@LCE
... x ... 1x 2x ... Kx
� 0 x �
1 2 K
@W
0 ...
1
data from [Link] benchmarks
Deep Learning 28/ 42 Deep Learning 29/ 42
Deep Learning resources: Tensorflow & Keras Keras: Logistic Regression for Classification
∑ e k′
K s
k ′ =1
● Evaluate performances on test set: ● Design more complex model by adding layers:
Fully connected, convolution, non-linearity, pooling, etc
scores = [Link](X_test, Y_test, verbose=0) Class for convolution on images: Conv2D
print("%s TEST: %.2f%%" % (model.metrics_names[0], scores[0]*100)) 2
● Code for training remains unchanged (back-prop does the job)
print("%s TEST: %.2f%%" % (model.metrics_names[1], scores[1]*100))
nb_classes=10 1
s=(5,5)
is = (28, 28, 1) 3
model = Sequential()
[Link](Conv2D(16, kernel_size=s, activation='relu',input_shape=is, padding='valid' )) 5
[Link](MaxPooling2D(pool_size=(2, 2)))
[Link](Conv2D(32, (5, 5), activation='relu', padding='valid')) 7
[Link](MaxPooling2D(pool_size=(2, 2)))
[Link](Flatten()) 9
[Link](Dense(100, activation='sigmoid'))
[Link](Dense(nb_classes, activation='softmax')) 11