Deep Generative Models - 1
Variational Autoencoders (VAEs)
The machine learning framework
y = f(x)
output prediction Image
function feature
• Training: given a training set of labeled examples {(x1,y1), …,
(xN,yN)}, estimate the prediction function f by minimizing the
prediction error on the training set
• Testing: apply f to a never before seen test example x and
output the predicted value y = f(x)
■Slide credit: L. Lazebnik
Structured Learning
Machine learning is to find a function f
Regression: output a scalar
Classification: output a “class” (one-hot vector)
1 0 0 0 1 0 0 0 1
Class 1 Class 2 Class 3
Structured Learning/Prediction: output a
sequence, a matrix, a graph, a tree ……
Output is composed of components with dependency
Output Sequence
Machine Translation
“機器學習及其深層與 “Machine learning and having
結構化” it deep and structured”
(sentence of language 1) (sentence of language 2)
Speech Recognition
感謝大家來上課”
(speech) (transcription
Chat-bot )
“How are you?” “I’m fine.”
(what a user says) (response of machine)
Output Matrix
Image to Image Colorization:
Ref: [Link]
Text to Image
“this white and yellow flower
have thin white petals and a
round yellow stamen”
ref: [Link]
What does Deep Learning (DL) offer?
A machine learning subfield is the learning of representations in data.
Deep learning algorithms attempt to learn (multiple levels of) representation by using a hierarchy of
multiple layers.
If you provide the system tons of information, it begins to understand it and respond in useful ways.
[Link]
Deep Learning in One Slide (Review)
Fully connected feedforward network
Many kinds of
Convolutional neural network (CNN)
network
structures: Recurrent neural network (RNN)
Different networks can take different kinds of input/output.
Vector
Matrix Network
Vector Seq
(speech, video, sentence)
How to find Given the example inputs/outputs as
the function? training data: {(x1,y1),(x2,y2), ……,
(x1000,y1000)}
Predict the category/class of Example:
Predictive
an object by learning a model CNN
Two deep inference
models
Generate a new object that closely
Generative matches to those in a limited
training set by closely Example:
approximating the underlying (but VAE, GAN
unknown) probability
distribution of the set of objects.
8
So what are these Deep Nets in a nutshell?.....
News Feature: What are the limits of deep learning?
9 M. Mitchell Waldrop, Proceedings of the National Academy of Sciences
Jan 2019, 116 (4) 1074-1077; DOI: 10.1073/pnas.1821594116
The 2012 Breakthrough in Predictive Learning….
• A Krizhevsky, I Sutskever, and GE Hinton, “Imagenet classification with
deep convolutional neural networks”, Advances in neural information
processing systems (NIPS 2012), 1097-1105
Has 42,000+ citations by now!!
Reduced the error rate on imagenet under LSVRC
to 16%
Later evolved into Alexnet.
10
What happens inside a Convolutional Net….?
Liu et al., A survey of deep neural network architectures and their applications.
11 Neurocomputing 234: 11-26 (2017)
The standard autoencoder: An autoencoder network is actually a pair of two connected networks, an encoder
and a decoder. An encoder network takes in an input, and converts it into a smaller, dense representation,
which the decoder network can use to convert it back to the original input.
The convolutional layers of any CNN take in a large image (eg. rank 3 tensor of size 299x299x3),
and convert it to a much more compact, dense representation (eg. rank 1 tensor of size 1000). This
dense representation is then used by the fully connected classifier network to classify the image.
The encoder is similar, it is simply is a network that takes in an input and produces a much smaller
representation (the encoding), that contains enough information for the next part of the network to process it
into the desired output format. Typically, the encoder is trained together with the other parts of the network,
optimized via back-propagation, to produce encodings specifically useful for the task at hand. In CNNs, the
1000-dimensional encodings produced are such that they’re specifically useful for classification.
Autoencoders take this idea, and slightly flip it on its head, by making the encoder generate encodings
specifically useful for reconstructing its own input.
A simple neural network based autoencoder architecture:
Because neural networks are capable of learning nonlinear relationships, this can be thought of as a more
powerful (nonlinear) generalization of PCA. Whereas PCA attempts to discover a lower dimensional hyperplane
which describes the original data, autoencoders are capable of learning nonlinear manifolds (a manifold is
defined in simple terms as a continuous, non-intersecting surface). The difference between these two approaches
is visualized below.
The problem with standard autoencoders:
Standard autoencoders learn to generate compact
representations and reconstruct their inputs well, but asides from
a few applications like denoising autoencoders, they are fairly
limited.
The fundamental problem with autoencoders, for generation, is
that the latent space they convert their inputs to and where their
encoded vectors lie, may not be continuous, or allow easy
interpolation.
For example, training an autoencoder on the
MNIST dataset, and visualizing the encodings
from a 2D latent space reveals the formation of
distinct clusters. This makes sense, as distinct
encodings for each image type makes it far
easier for the decoder to decode them. This is
fine if you’re just replicating the same images.
But when you’re building a generative model,
you don’t want to prepare to replicate the
same image you put in. You want to randomly
sample from the latent space, or generate
variations on an input image, from a
continuous latent space.
If the space has discontinuities (eg. gaps
between clusters) and you sample/generate a
variation from there, the decoder will simply
generate an unrealistic output, because the
decoder has no idea how to deal with that
region of the latent space. During training,
it never saw encoded vectors coming from that
region of latent space.
Let’s suppose we've trained an autoencoder model on a large dataset of faces with a encoding dimension of 6.
An ideal autoencoder will learn descriptive attributes of faces such as skin color, whether or not the person is
wearing glasses, etc. in an attempt to describe an observation in some compressed representation.
Slides made from Jeremy Jordon’s blog
In the last example, we've described the input image in terms of its latent attributes using a single value to
describe each attribute. However, we may prefer to represent each latent attribute as a range of possible
values. For instance, what single value would you assign for the smile attribute if you feed in a photo of the
Mona Lisa? Using a variational autoencoder, we can describe latent attributes in probabilistic terms.
With this approach, we'll now represent each latent attribute for a given input as a probability distribution.
When decoding from the latent state, we'll randomly sample from each latent state distribution to generate a
vector as input for our decoder model.
By constructing our encoder model to output a range of possible values (a statistical distribution) from which
we'll randomly sample to feed into our decoder model, we're essentially enforcing a continuous, smooth
latent space representation. For any sampling of the latent distributions, we're expecting our decoder model
to be able to accurately reconstruct the input. Thus, values which are nearby to one another in latent space should
correspond with very similar reconstructions.
Variational Autoencoders (VAEs) have one fundamentally
unique property that separates them from vanilla
autoencoders, and it is this property that makes them so
useful for generative modeling: their latent spaces are, by
design, continuous, allowing easy random sampling and
interpolation.
It achieves this by doing something that seems rather
surprising at first: making its encoder not output an
encoding vector of size n, rather, outputting two vectors of
size n: a vector of means, μ, and another vector of
standard deviations, σ.
[Link]
They form the parameters of a vector of random variables of length n, with the i th element of μ and σ being
the mean and standard deviation of the i th random variable, X i, from which we sample, to obtain the
sampled encoding which we pass onward to the decoder:
This stochastic generation means, that even for the same input, while the mean and standard deviations
remain the same, the actual encoding will somewhat vary on every single pass simply due to sampling
• Intuitively, the mean vector controls where the encoding of an input should be centered around, while the
standard deviation controls the “area”, how much from the mean the encoding can vary.
• As encodings are generated at random from anywhere inside the “circle” (the distribution), the decoder
learns that not only is a single point in latent space referring to a sample of that class, but all nearby points
refer to the same as well.
• This allows the decoder to not just decode single, specific encodings in the latent space (leaving the
decodable latent space discontinuous), but ones that slightly vary too, as the decoder is exposed to a range
of variations of the encoding of the same input during training.
• The model is now exposed to a certain degree of local variation by varying the encoding of one
sample, resulting in smooth latent spaces on a local scale, that is, for similar samples.
• Ideally, we want overlap between samples that are not very similar too, in order to
interpolate between classes.
The regularity that is expected from the latent space in order to make generative process possible can be
expressed through two main properties: continuity (two close points in the latent space should not give
two completely different contents once decoded) and completeness (for a chosen distribution, a point
sampled from the latent space should give “meaningful” content once decoded).
The only fact that VAEs encode inputs as distributions instead of simple points is not sufficient to ensure
continuity and completeness. Without a well defined regularization term, the model can learn, in order to
minimize its reconstruction error, to “ignore” the fact that distributions are returned and behave almost
like classic autoencoders (leading to overfitting). To do so, the encoder can either return distributions with
tiny variances (that would tend to be punctual distributions) or return distributions with very different means
(that would then be really far apart from each other in the latent space). In both cases, distributions are used
the wrong way (cancelling the expected benefit) and continuity and/or completeness are not satisfied.
So, in order to avoid these effects we have to regularize both the covariance matrix and the mean of the
distributions returned by the encoder. In practice, this regularization is done by enforcing distributions to be
close to a standard normal distribution (centred and reduced). This way, we require the covariance matrices to
be close to the identity, preventing punctual distributions, and the mean to be close to 0, preventing encoded
distributions to be too far apart from each others.
Regularization tends to create a “gradient” over the
information encoded in the latent space.
However, since there are no limits on what values vectors μ and σ can take on, the encoder can learn to
generate very different μ for different classes, clustering them apart, and minimize σ, making sure the
encodings themselves don’t vary much for the same sample (that is, less uncertainty for the decoder). This
allows the decoder to efficiently reconstruct the training data.
What we ideally want are encodings, all of which are as close as
possible to each other while still being distinct, allowing smooth
interpolation, and enabling the construction of new samples.
In order to force this, we introduce the Kullback–Leibler
divergence (KL divergence) into the loss function. The KL
divergence between two probability distributions simply measures
how much they diverge from each other. Minimizing the KL
divergence here means optimizing the probability distribution
parameters (μ and σ) to closely resemble that of the target
distribution.
For discrete case:
For continuous case:
Homework:
Intuitively, this loss encourages the
encoder to distribute all encodings
(for all types of inputs, eg. all MNIST
numbers), evenly around the center
of the latent space. If it tries to
“cheat” by clustering them apart
into specific regions, away from
the origin, it will be penalized.
Now, using purely KL loss results in
a latent space results in encodings
densely placed randomly, near the
center of the latent space, with
little regard for similarity among
nearby encodings. The decoder
finds it impossible to decode
anything meaningful from this space,
simply because there really isn’t any
meaning.
Optimizing the two
together, however, results
in the generation of a
latent space which
maintains the similarity of
nearby encodings on
the local scale via
clustering, yet globally, is
very densely packed near
the latent space origin
(compare the axes with
the original).
Intuitively, this is the equilibrium reached
by the cluster-forming nature of the
reconstruction loss, and the dense
packing nature of the KL loss, forming
distinct clusters the decoder can decode.
This is great, as it means when randomly
generating, if you sample a vector from the
same prior distribution of the encoded
vectors, N(0, I), the decoder will
successfully decode it. And if you’re
interpolating, there are no sudden gaps
between clusters, but a smooth mix of
features a decoder can understand.
For latent space
visualizations, we can
train a VAE with 2-D latent
variables (though this
space is generally too
small for the intrinsic
dimensionality of
real-world data). Picturing
this compressed latent
space lets us see how the
model has disentangled
complex raw data into
abstract higher-order
features.
This is how the
encoder/inference network
learns to map the training
set from the input data
space to the latent
space…
[Link]
…and this is how the decoder/generative network learns to map latent coordinates into reconstructions of the
original data space:
Here we are sampling
evenly-spaced percentiles along
the latent manifold and plotting
their corresponding output from the
decoder, with the same axis labels
as above.
❑ This tableau highlights the overall smoothness of the latent manifold—and how any “unrealistic”
outputs from the generative decoder correspond to apparent discontinuities in the variational
posterior of the encoder (e.g. between the “7-space” and the “1-space”). These gaps could
probably be improved by experimenting with model hyperparameters.
❑ Whereas the original data dotted a sparse landscape in 784 dimensions, where “realistic” images
were few and far between, this 2-dimensional latent manifold is densely populated with such
samples. Beyond its inherent visual coolness, latent space smoothness shows the model’s ability to
leverage its “understanding” of the underlying data-generating process to generalize beyond the
training set.
❑ Smooth interpolation within and between digits—in contrast to the spotty latent space characteristic
of many autoencoders—is a direct result of the variational regularization intrinsic to VAEs.
When using generative models, you could simply want to generate a random, new output, that looks similar to
the training data, and you can certainly do that too with VAEs. But more often, you’d like to alter, or explore
variations on data you already have, and not just in a random way either, but in a desired, specific direction. This
is where VAEs work better than any other method currently available.
And that’s
it…..