0% found this document useful (0 votes)
5 views57 pages

Module 3 DL

The document provides an overview of autoencoders, a type of artificial neural network used for unsupervised learning and dimensionality reduction. It covers various types of autoencoders, including linear, undercomplete, and regularized versions, as well as their architectures, training methods, and applications such as image compression and denoising. Key concepts include the encoder-decoder structure, bottleneck representation, and the importance of hyperparameters in training autoencoders.

Uploaded by

jinaypatel0504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views57 pages

Module 3 DL

The document provides an overview of autoencoders, a type of artificial neural network used for unsupervised learning and dimensionality reduction. It covers various types of autoencoders, including linear, undercomplete, and regularized versions, as well as their architectures, training methods, and applications such as image compression and denoising. Key concepts include the encoder-decoder structure, bottleneck representation, and the importance of hyperparameters in training autoencoders.

Uploaded by

jinaypatel0504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Deep Learning (AIC701)

Module 3-Autoencoder

Devanand K. Bathe
Asst. Professor
3.1 Introduction,
• Linear Autoencoder,
• Undercomplete Autoencoder,
• Overcomplete Autoencoders,
• Regularization in Autoencoders

3.2 :
• Denoising Autoencoders,
• Sparse Autoencoders,
• Contractive Autoencoders.

3.3 :
• Application of Autoencoders: Image Compression
Introduction:
• An autoencoder is a type of artificial neural network used to learn
efficient data codings in an unsupervised manner.
• The goal of an autoencoder is to-
• learn a representation for a set of data, usually for dimensionality reduction
by training the network to ignore signal noise.
• Along with the reduction side, a reconstructing side is also learned, where the
autoencoder tries to generate from the reduced encoding a representation as
close as possible to its original input.
• This helps autoencoders to learn important features present in the
data.
• When a representation allows a good reconstruction of its input then
it has retained much of the information present in the input.
• Recently, the autoencoder concept has become more widely used for
learning generative models of data.
• Autoencoders are a specific type of feedforward neural networks
where the input is the same as the output.
• They compress the input into a lower-dimensional code and then
reconstruct the output from this representation.
• The code is a compact “summary” or “compression” of the input, also
called the latent-space representation.
• An autoencoder consists of 3 components: encoder, code and
decoder. The encoder compresses the input and produces the code,
the decoder then reconstructs the input only using this code.
• Autoencoders are mainly a dimensionality reduction (or compression)
algorithm with a couple of important properties:
• Data-specific: Autoencoders are only able to meaningfully compress data
similar to what they have been trained on. Since they learn features specific
for the given training data, they are different than a standard data
compression algorithm like gzip. So we can’t expect an autoencoder trained
on handwritten digits to compress landscape photos.
• Lossy: The output of the autoencoder will not be exactly the same as the
input, it will be a close but degraded representation. If you want lossless
compression they are not the way to go.
• Unsupervised: To train an autoencoder we don’t need to do anything fancy,
just throw the raw input data at it. Autoencoders are considered an
unsupervised learning technique since they don’t need explicit labels to
train on. But to be more precise they are self-supervised because they
generate their own labels from the training data.
Architecture of autoencoder:
Let’s start with a quick overview of autoencoders’ architecture.
• Autoencoders consist of 3 parts:
1. Encoder: A module that compresses the train-validate-test set input data into an encoded
representation that is typically several orders of magnitude smaller than the input data.
2. Bottleneck: A module that contains the compressed knowledge representations and is therefore the
most important part of the network.
3. Decoder: A module that helps the network“decompress” the knowledge representations and
reconstructs the data back from its encoded form. The output is then compared with a ground truth.
• The architecture as a whole looks something like this:
Encoder
• The encoder is a set of convolutional blocks followed by pooling modules that compress
the input to the model into a compact section called the bottleneck.
• The bottleneck is followed by the decoder that consists of a series of upsampling modules
to bring the compressed feature back into the form of an image.
• In case of simple autoencoders, the output is expected to be the same as the input data
with reduced noise.
• However, for variational autoencoders it is a completely new image, formed with
information the model has been provided as input.

Bottleneck
• The most important part of the neural network, and ironically the smallest one, is the
bottleneck.
• The bottleneck exists to restrict the flow of information to the decoder from the encoder,
thus,allowing only the most vital information to pass through.
• Since the bottleneck is designed in such a way that the maximum information possessed
by an image is captured in it, we can say that the bottleneck helps us form a
knowledge-representation of the input.
• Thus, the encoder-decoder structure helps us extract the most from an image in the form
of data and establish useful correlations between various inputs within the network.
• A bottleneck as a compressed representation of the input further prevents the neural
network from memorising the input and overfitting on the data.
• As a rule of thumb, remember this: The smaller the bottleneck, the lower the risk of
overfitting.
• However, Very small bottlenecks would restrict the amount of information storable, which
increases the chances of important information slipping out through the pooling layers of
the encoder.

Decoder
• Finally, the decoder is a set of upsampling and convolutional blocks that reconstructs the
bottleneck's output.

• Since the input to the decoder is a compressed knowledge representation, the decoder
serves as a “decompressor” and builds back the image from its latent attributes.
• Properties and Hyperparameters
• Data-specific: Autoencoders are only able to compress data similar to
what they have been trained on.
• Lossy: The decompressed outputs will be degraded compared to the
original inputs.
• Learned automatically from examples: It is easy to train specialized
instances of the algorithm that will perform well on a specific type of
input.
• Hyperparameters of Autoencoders:
• There are 4 hyperparameters that we need to set before training an
autoencoder:
• Code size: It represents the number of nodes in the middle layer.
Smaller size results in more compression.
• Number of layers: The autoencoder can consist of as many layers as
we want.
• Number of nodes per layer: The number of nodes per layer decreases
with each subsequent layer of the encoder, and increases back in the
decoder. The decoder is symmetric to the encoder in terms of the
layer structure.
• Loss function: We either use mean squared error or binary
cross-entropy. If the input values are in the range [0, 1] then we
typically use cross-entropy, otherwise, we use the mean squared
error.
• Similar to the standard feedforward neural network with a key difference:
• Unsupervised. No label at the output layer; Output layer simply tries to recreate the
input

• Defined by two (possibly nonlinear) mapping functions: Encoding function f ,


Decoding function g
• h = f (x) denotes an encoding (possibly nonlinear) for the input x
• ^x = g(h) = g(f (x)) denotes the reconstruction (or the decoding) for the input x
• For an Autoencoder, f and g are learned with a goal to minimize the difference
between ^x and x
• The learned code h = f (x) can be used as a new feature representation of the
input x
• Therefore autoencoders can also be used for feature learning
• Note: Size of the hidden units (encoding) can also be larger than the input
• General structure of an autoencoder
• Maps an input x to an output r (called reconstruction) through
• an internal representation code h
• It has a hidden layer h that describes a code used to represent the input
• The network has two parts
• The encoder function
h=f(x)
• A decoder that produces a reconstruction
r=g(h)
Linear Autoencoder:
• A linear autoencoder and Principal Component Analysis (PCA) are similar in that both
methods aim to reduce the dimensionality of a dataset. However, there are some key
differences between the two.

• PCA is a linear dimensionality reduction technique that finds the directions (principal
components) of maximum variance in the data, and projects the data onto a
lower-dimensional subspace. It is unsupervised, which means it does not use any labeled
data.

• A linear autoencoder is an unsupervised neural network that aims to learn a


lower-dimensional representation of the data by training an encoder to compress the
input data and a decoder to reconstruct the original data from the compressed
representation. The encoder and decoder are both linear, which means they are
composed of fully-connected layers without non-linear activations.

• In summary, PCA is a linear technique that finds the directions of maximum variance in the
data, while a linear autoencoder is a neural network that learns a linear
encoding-decoding process to compress and reconstruct the data. Both methods can be
used for dimensionality reduction, but they have different assumptions and limitations.
• An autoencoder with linear transfer functions is equivalent to PCA
• Let’s prove the equivalence for the case of an autoencoder with just 1 hidden layer, the bottleneck
layer.
• First recall how PCA works:
• x-the original data, z-the reduced data and x^-the reconstructed data from the reduced
representation.
• Then we can write PCA as:
z=BTx
x^=Bz
• Now consider an autoenoder:
Consider the architecure in the picture
• Wich basically is x→z→x^
• Since we said that the activation functions are linear transfer
functions σ(x)=x then we can write the autoencoder as:
x^=W1W2x
• where W1 and W2 are the weights of the first and second layer.
• Now if we set W1=B and W2=BT
we have:
x^=W1(W2x)
x^=W1z
x^=Bz
• which is the same solution that we had for PCA.
Autoencoders are preferred over PCA because:
• An autoencoder can learn non-linear transformations with a
non-linear activation function and multiple layers.
• It doesn’t have to learn dense layers. It can use convolutional layers to
learn which is better for video, image and series data.
• It is more efficient to learn several layers with an autoencoder rather
than learn one huge transformation with PCA.
• An autoencoder provides a representation of each layer as the
output.
• It can make use of pre-trained layers from another model to apply
transfer learning to enhance the encoder/decoder.
Linear autoencoder:
Non-linear autoencoder:
How to train autoencoder:
• You need to set 4 hyperparameters before training an autoencoder:
• Code size: The code size or the size of the bottleneck is the most important
hyperparameter used to tune the autoencoder. The bottleneck size decides how
much the data has to be compressed. This can also act as a regularisation term.
• Number of layers: Like all neural networks, an important hyperparameter to tune
autoencoders is the depth of the encoder and the decoder. While a higher depth
increases model complexity, a lower depth is faster to process.
• Number of nodes per layer: The number of nodes per layer defines the weights we
use per layer. Typically, the number of nodes decreases with each subsequent
layer in the autoencoder as the input to each of these layers becomes smaller
across the layers.
• Reconstruction Loss: The loss function we use to train the autoencoder is highly
dependent on the type of input and output we want the autoencoder to adapt to.
If we are working with image data, the most popular loss functions for
reconstruction are MSE Loss and L1 Loss. In case the inputs and outputs are
within the range [0,1], we can also make use of Binary Cross Entropy as the
reconstruction loss.
Training the autoencoder:
Undercomplete, Overcomplete and need for
Regularization
Undercomplete Overcomplete
autoencoder autoencoder
Undercomplete autoencoder:
An undercomplete autoencoder is one of the simplest types of autoencoders.

• The way it works is very straightforward—


• Undercomplete autoencoder takes in an image and tries to predict the
same image as output, thus reconstructing the image from the compressed
bottleneck region.
• Undercomplete autoencoders are truly unsupervised as they do not take
any form of label, the target being the same as the input.
• The primary use of autoencoders like such is the generation of the latent
space or the bottleneck, which forms a compressed substitute of the input
data and can be easily decompressed back with the help of the network
when needed.
• This form of compression in the data can be modeled as a form of
dimensionality reduction.
• When we think of dimensionality reduction, we tend to think of methods like PCA
(Principal Component Analysis) that form a lower-dimensional hyperplane to represent
data in a higher-dimensional form without losing information.
• However—PCA can only build linear relationships. As a result, it is put at a disadvantage
compared with methods like undercomplete autoencoders that can learn non-linear
relationships and, therefore, perform better in dimensionality reduction.

• This form of nonlinear dimensionality reduction where the autoencoder learns a


non-linear manifold is also termed as manifold learning.

• Effectively, if we remove all non-linear activations from an undercomplete autoencoder


and use only linear layers, we reduce the undercomplete autoencoder into something that
works at an equal footing with PCA.

• The loss function used to train an undercomplete autoencoder is called reconstruction


loss, as it is a check of how well the image has been reconstructed from the input data.
Applications of Autoencoders
• Image Coloring
• Autoencoders are used for converting any black and white picture
into a colored image. Depending on what is in the picture, it is
possible to tell what the color should be.
• Feature variation: It extracts only the required features of an image and generates the output by
removing any noise or unnecessary interruption.

• Dimensionality Reduction:The reconstructed image is the same as our input but with reduced
dimensions. It helps in providing the similar image with a reduced pixel value.
• Denoising Image: The input seen by the autoencoder is not the raw input but a stochastically
corrupted version. A denoising autoencoder is thus trained to reconstruct the original input from
the noisy version.

• Watermark Removal: It is also used for removing watermarks from images or to remove any
object while filming a video or a movie.
Regularized autoencoder:
• Regularized autoencoders are useful to prevent the autoencoders to copy
the input features and learn the important characteristics as well.
• They are useful in the case when the autoencoders have the same input
and output dimension and in the case of over-complete autoencoders
• There are other ways we can constraint the reconstruction of an
autoencoder than to impose a hidden layer of smaller dimension than the
input.
• Rather than limiting the model capacity by keeping the encoder and
decoder shallow and the code size small, regularized autoencoders use a
loss function that encourages the model to have other properties besides
the ability to copy its input to its output.
• In practice, we usually find two types of regularized autoencoder: the
sparse autoencoder and the denoising autoencoder.
Sparse Autoencoder:
• A sparse autoencoder is simply an autoencoder whose training criterion
involves a sparsity penalty.
• In most cases, we would construct our loss function by penalizing
activations of hidden layers so that only a few nodes are encouraged to
activate when a single sample is fed into the network.
• The intuition behind this method is that, for example, if a man claims to be
an expert in mathematics, computer science, psychology, and classical
music, he might be just learning some quite shallow knowledge in these
subjects.
• However, if he only claims to be devoted to mathematics, we would like to
anticipate some useful insights from him. And it’s the same for
autoencoders we’re training — fewer nodes activating while still keeping its
performance would guarantee that the autoencoder is actually learning
latent representations instead of redundant information in our input data.
• There are actually two different ways to construct our sparsity
penalty: L1 regularization and KL-divergence.
• Why L1 Regularization Sparse
• L1 regularization and L2 regularization are widely used in machine
learning and deep learning. L1 regularization adds “absolute value of
magnitude” of coefficients as penalty term while L2 regularization
adds “squared magnitude” of coefficient as a penalty term.
• Although L1 and L2 can both be used as regularization term, the key
difference between them is that L1 regularization tends to shrink the
penalty coefficient to zero while L2 regularization would move
coefficients towards zero but they will never reach.
• Thus L1 regularization is often used as a method of feature extraction.
• Loss Function
• Finally, after the above analysis, we get the idea of using L1 regularization in
sparse autoencoder and the loss function is as below:

• Except for the first two terms, we add the third term which penalizes the absolute
value of the vector of activations a in layer h for sample i.
• Then we use a hyperparameter to control its effect on the whole loss function.
And in this way, we do build a sparse autoencoder.
• Due to the sparsity of L1 regularization, sparse autoencoder actually learns better
representations and its activations are more sparse which makes it perform
better than original autoencoder without L1 regularization.
• Sparse autoencoders have hidden nodes greater than input nodes.
They can still discover important features from the data.
• A generic sparse autoencoder is visualized where the obscurity of a
node corresponds with the level of activation.
• Sparsity constraint is introduced on the hidden layer. This is to
prevent output layer copy input data.
• Sparsity may be obtained by additional terms in the loss function
during the training process, either by comparing the probability
distribution of the hidden unit activations with some low desired
value,or by manually zeroing all but the strongest hidden unit
activations. Some of the most powerful AIs in the 2010s involved
sparse autoencoders stacked inside of deep neural networks.
• Advantages-
• Sparse autoencoders have a sparsity penalty, a value close to zero but
not exactly zero. Sparsity penalty is applied on the hidden layer in
addition to the reconstruction error. This prevents overfitting.
• They take the highest activation values in the hidden layer and zero
out the rest of the hidden nodes. This prevents autoencoders to use
all of the hidden nodes at a time and forcing only a reduced number
of hidden nodes to be used.
• Drawbacks-
• For it to be working, it's essential that the individual nodes of a
trained model which activate are data dependent, and that different
inputs will result in activations of different nodes through the
network.
Sparse Autoencoder:
Denoising Autoencoder:
• Autoencoders are Neural Networks which are commonly used for
feature selection and extraction.
• However, when there are more nodes in the hidden layer than there
are inputs, the Network is risking to learn the so-called “Identity
Function”, also called “Null Function”, meaning that the output equals
the input, marking the Autoencoder useless.
• Denoising Autoencoders solve this problem by corrupting the data on
purpose by randomly turning some of the input values to zero.
• In general, the percentage of input nodes which are being set to zero
is about 50%.
• Other sources suggest a lower count, such as 30%. It depends on the
amount of data and input nodes you have.
DAE architecture:
• When calculating the Loss function, it is important to compare the output values with
the original input, not with the corrupted input. That way, the risk of learning the
identity function instead of extracting features is eliminated.
• Denoising Autoencoders are an important and crucial tool for feature selection and
extraction
original corrupted reconstructed

Denoising autoencoders create a corrupted copy of the input by introducing some noise.
This helps to avoid the autoencoders to copy the input to the output without learning
features about the data.
These autoencoders take a partially corrupted input while training to recover the original
undistorted input. The model learns a vector field for mapping the input data towards a
lower dimensional manifold which describes the natural data to cancel out the added
noise.
• Advantages-
• It was introduced to achieve good representation. Such a representation is one
that can be obtained robustly from a corrupted input and that will be useful for
recovering the corresponding clean input.
• Corruption of the input can be done randomly by making some of the input as
zero. Remaining nodes copy the input to the noised input.
• Minimizes the loss function between the output node and the corrupted input.
• Setting up a single-thread denoising autoencoder is easy.

• Drawbacks-
• To train an autoencoder to denoise data, it is necessary to perform preliminary
stochastic mapping in order to corrupt the data and use as input.
• This model isn't able to develop a mapping which memorizes the training data
because our input and target output are no longer the same.
In denoising autoencoders, some noise is added to the input data and then the
model is trained to get the denoised version of the input data. The loss function
that is used in denoising autoencoders is:-
Loss = L(x, g(f(x’)))
where x’ is the input data with some noise and x is input data without noise.
Denoising Autoencoder:
Contractive autoencoder:
• Contractive autoencoder is an unsupervised deep learning technique
that helps a neural network encode unlabeled training data.
• A simple autoencoder is used to compress information of the given
data while keeping the reconstruction cost as low as possible.
Contractive autoencoder simply targets to learn invariant
representations to unimportant transformations for the given data.
• It only learns those transformations that are provided in the given
dataset so it makes the encoding process less sensitive to small
variations in its training dataset.
• The goal of Contractive Autoencoder is to reduce the representation’s
sensitivity towards the training input data.
• In order to achieve this, we must add a regularizer or penalty term to
the cost function that the autoencoder is trying to minimize.
• So from the mathematical point of view, it gives the effect of contraction by
adding an additional term to reconstruction cost and this term needs to comply
with the Frobenius norm of the Jacobian matrix to be applicable for the encoder
activation sequence.
• If this value is zero, it means that as we change input values, we don't observe
any change on the learned hidden representations.
• But if the value is very large, then the learned representation is unstable as the
input values change.
• Contractive autoencoders are usually deployed as just one of several other
autoencoder nodes, activating only when other encoding schemes fail to label a
data point.
• The objective of a contractive autoencoder is to have a robust learned representation which is less
sensitive to small variation in the data.
• Robustness of the representation for the data is done by applying a penalty term to the loss
function.
• Contractive autoencoder is another regularization technique just like sparse and denoising
autoencoders. However, this regularizer corresponds to the Frobenius norm of the Jacobian
matrix of the encoder activations with respect to the input.
• Frobenius norm of the Jacobian matrix for the hidden layer is calculated with respect to input and
it is basically the sum of square of all elements.

Advantages-
• Contractive autoencoder is a better choice than denoising autoencoder to learn useful feature
extraction.
• This model learns an encoding in which similar inputs have similar encodings. Hence, we're
forcing the model to learn how to contract a neighborhood of inputs into a smaller neighborhood
of outputs.
• Contradictive autoencoders are same as sparse autoencoders with a little
difference in penalty term. The loss function of contradictive autoencoders
is written as:-
Loss = L(x, f(g(x))) + d(h, x)
where h = f(x) and d(h,x) can be written as:-

d(h, x) = lambda * ||∂h/∂x||²

• This forces the model to learn a function that does not change much when
x changes slightly.
• There is a connection between the denoising autoencoder and the
contractive autoencoder:
• the denoising reconstruction error is equivalent to a contractive
penalty on the reconstruction function that maps x to r - g(f(x)).
• In other words, denoising autoencoders make the reconstruction
function resist small but finite sized perturbations of the input,
whereas contractive autoencoders make the feature extraction
function resist infinitesimal perturbations of the input.
• When using the Jacobian based contractive penalty to pretrain
features f(x) for use with a classifier, the best classification accuracy
usually results from applying the contractive penalty to f(x) rather
than to g(f(x)).
Contractive autoencoder:
• There are some important equations we need to know first before deriving
contractive autoencoder. Before going there, we'll touch base on the
Frobenius norm of the Jacobian matrix.

• The Frobenius norm, also called the Euclidean norm, is matrix norm of an
mxn matrix A defined as the square root of the sum of the absolute squares
of its elements.

• The Jacobian matrix is the matrix of all first-order partial derivatives of a


vector-valued function. So when the matrix is a square matrix, both the
matrix and its determinant are referred to as the Jacobian.

• Combining these two defintions gives us the meaning of Frobenius norm of


the Jacobian matrix.
• The loss function - where the penalty term, λ(J(x))^2, is the squared Frobenius norm of the
Jacobian matrix of partial derivatives associated with the encoder function and is defined in the
second line.

• where ф is sigmoid nonlinearity. To get the j-th hidden unit, we need to get the dot product of the
i-th feature and the corresponding weight. Then using chain rule and substituting our above
assumptions for Z and h we get:

• Our main objective is to calculate the norm, so we could simplify that in our implementation so
that we don’t need to construct the diagonal matrix:
THANK YOU

You might also like