0% found this document useful (0 votes)
20 views8 pages

Understanding Variational Autoencoders

The document discusses deep generative models, focusing on Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). VAEs utilize learned approximate inference for generating samples and have advantages in manifold learning, but may produce blurry images. GANs involve a generator and discriminator in a zero-sum game, facing challenges in convergence, with various formulations proposed to improve learning stability.

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views8 pages

Understanding Variational Autoencoders

The document discusses deep generative models, focusing on Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). VAEs utilize learned approximate inference for generating samples and have advantages in manifold learning, but may produce blurry images. GANs involve a generator and discriminator in a zero-sum game, facing challenges in convergence, with various formulations proposed to improve learning stability.

Uploaded by

Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

5.6.

DEEP GENERATIVE MODELS

Variational Autoencoders

 The variational autoencoder or VAE (Kingma, 2013; Rezende et al., 2014) is a


directed model that uses learned approximate inference and can be trained purely with
gradient-based methods.

 To generate a sample from the model, the VAE first draws a sample z from the code
distribution pmodel(z). The sample is then run through a differentiable generator
network g(z).

 Finally, x is sampled from a distribution p model(x; g(z)) = pmodel(x | z). However, during
training, the approximate inference network (or encoder) q(z | x) is used to obtain z
and pmodel(x | z) is then viewed as a decoder network.

 The key insight behind variational autoencoders is that they may be trained by
maximizing the variational lower bound L(q) associated with data point x:

 In equation 20.76, we recognize the first term as the joint log-likelihood of the visible
and hidden variables under the approximate posterior over the latent variables (just
like with EM, except that we use an approximate rather than the exact posterior).

 We recognize also a second term, the entropy of the approximate posterior. When q is
chosen to be a Gaussian distribution, with noise added to a predicted mean value,
maximizing this entropy term encourages increasing the standard deviation of this
noise.

 More generally, this entropy term encourages the variational posterior to place high
probability mass on many z values that could have generated x, rather than collapsing
to a single point estimate of the most likely value.
 In equation 20.77, we recognize the first term as the reconstruction log-likelihood
found in other autoencoders. The second term tries to make the approximate posterior
distribution q(z | x) and the model prior pmodel(z) approach each other.

 Traditional approaches to variational inference and learning infer q via an


optimization algorithm, typically iterated fixed point equations (section 19.4).

 These approaches are slow and often require the ability to compute Ez∼q log p
model(z, x) in closed form.

 The main idea behind the variational autoencoder is to train a parametric encoder
(also sometimes called an inference network or recognition model) that produces the
parameters of q. So long as z is a continuous variable, we can then back-propagate
through samples of z drawn from q(z | x) = q (z; f(x; θ)) in order to obtain a gradient
with respect to θ.

 Learning then consists solely of maximizing L with respect to the parameters of the
encoder and decoder. All of the expectations in L may be approximated by Monte
Carlo sampling.

 The variational autoencoder approach is elegant, theoretically pleasing, and simple to


implement. It also obtains excellent results and is among the state of the art
approaches to generative modeling.

 Its main drawback is that samples from variational autoencoders trained on images
tend to be somewhat blurry. The causes of this phenomenon are not yet known.

 One possibility is that the blurriness is an intrinsic effect of maximum likelihood,


which minimizes DKL(pdata||pmodel).

 As illustrated in figure 3.6, this means that the model will assign high probability to
points that occur in the training set, but may also assign high probability to other
points. These other points may include blurry images.
 Part of the reason that the model would choose to put probability mass on blurry
images rather than some other part of the space is that the variational autoencoders
used in practice usually have a Gaussian distribution for pmodel(x; g(z)).

 Maximizing a lower bound on the likelihood of such a distribution is similar to


training a traditional autoencoder with mean squared error, in the sense that it has a
tendency to ignore features of the input that occupy few pixels or that cause only a
small change in the brightness of the pixels that they occupy. This issue is not specific
to VAEs and is shared with generative models that optimize a log-likelihood, or
equivalently, DKL(p datapmodel), as argued by Theis et al. (2015) and by Huszar
(2015).

 Another troubling issue with contemporary VAE models is that they tend to use only
a small subset of the dimensions of z, as if the encoder was not able to transform
enough of the local directions in input space to a space where the marginal
distribution matches the factorized prior.

 The VAE framework is very straightforward to extend to a wide range of model


architectures. This is a key advantage over Boltzmann machines, which require
extremely careful model design to maintain tractability.

 VAEs work very well with a diverse family of differentiable operators. One
particularly sophisticated VAE is the deep recurrent attention writer or DRAW model
(Gregor et al., 2015).

 DRAW uses a recurrent encoder and recurrent decoder combined with an attention
mechanism. The generation process for the DRAW model consists of sequentially
visiting different small image patches and drawing the values of the pixels at those
points.

 VAEs can also be extended to generate sequences by defining variational RNNs


(Chung et al., 2015b) by using a recurrent encoder and decoder within the VAE
framework. Generating a sample from a traditional RNN involves only non-
deterministic operations at the output space.

 Variational RNNs also have random variability at the potentially more abstract level
captured by the VAE latent variables.
 The VAE framework has been extended to maximize not just the traditional
variational lower bound, but instead the importance weighted autoencoder (Burda et
al., 2015) objective:

 This new objective is equivalent to the traditional lower bound L when k = 1.


However, it may also be interpreted as forming an estimate of the true log p model(x)
using importance sampling of z from proposal distribution q(z | x).

 The importance weighted autoencoder objective is also a lower bound on log p model(x)
and becomes tighter as k increases.

 Variational autoencoders have some interesting connections to the MP-DBM and


other approaches that involve back-propagation through the approximate inference
[Link] previous approaches required an inference procedure such as mean field
fixed point equations to provide the computational graph.

 The variational autoencoder is defined for arbitrary computational graphs, which


makes it applicable to a wider range of probabilistic model families because there is
no need to restrict the choice of models to those with tractable mean field fixed point
equations.

 The variational autoencoder also has the advantage that it increases a bound on the
log-likelihood of the model, while the criteria for the MP-DBM and related models
are more heuristic and have little probabilistic interpretation beyond making the
results of approximate inference accurate.

 One disadvantage of the variational autoencoder is that it learns an inference network


for only one problem, inferring z given x.

 The older methods are able to perform approximate inference over any subset of
variables given any other subset of variables, because the mean field fixed point
equations specify how to share parameters between the computational graphs for all
of these different problems.

 One very nice property of the variational autoencoder is that simultaneously training a
parametric encoder in combination with the generator network forces the model to
learn a predictable coordinate system that the encoder can capture.

 This makes it an excellent manifold learning algorithm. See figure 20.6 for examples
of low-dimensional manifolds learned by the variational autoencoder. In one of the
cases demonstrated in the figure, the algorithm discovered two independent factors of
variation present in images of faces: angle of rotation and emotional expression.

 Generative Adversarial Networks Generative adversarial networks or GANs


(Goodfellow et al., 2014c) are another generative modeling approach based on
differentiable generator networks. Generative adversarial networks are based on a
game theoretic scenario in which the generator network must compete against an
adversary.

 The generator network directly produces samples x = g(z; θ (g) ). Its adversary, the
discriminator network, attempts to distinguish between samples drawn from the
training data and samples drawn from the generator.

 The discriminator emits a probability value given by d(x; θ (d)), indicating the
probability that x is a real training example rather than a fake sample drawn from the
model.

 The simplest way to formulate learning in generative adversarial networks is as a


zero-sum game, in which a function v(θ (g) , θ (d)) determines the payoff of the
discriminator.

 The generator receives −v(θ (g) , θ (d)) as its own payoff. During learning, each player
attempts to maximize its own payoff, so that at convergence
 This drives the discriminator to attempt to learn to correctly classify samples as real or
fake. Simultaneously, the generator attempts to fool the classifier into believing its
samples are real.

 At convergence, the generator’s samples are indistinguishable from real data, and the
discriminator outputs 1 2 everywhere. The discriminator may then be discarded.

 The main motivation for the design of GANs is that the learning process requires
neither approximate inference nor approximation of a partition function gradient.

 In the case where maxd v(g, d) is convex in θ (g) (such as the case where optimization
is performed directly in the space of probability density functions) the procedure is
guaranteed to converge and is asymptotically consistent.

 Unfortunately, learning in GANs can be difficult in practice when g and d are


represented by neural networks and maxd v(g, d) is not convex (2014) identified non-
convergence as an issue that may cause GANs to underfit.

 In general, simultaneous gradient descent on two players’ costs is not guaranteed to


reach an equilibrium. Consider for example the value function v(a, b) = ab, where one
player controls a and incurs cost ab, while the other player controls b and receives a
cost −ab.

 If we model each player as making infinitesimally small gradient steps, each player
reducing their own cost at the expense of the other player, then a and b go into a
stable, circular orbit, rather than arriving at the equilibrium point at the origin.

 Note that the equilibria for a minimax game are not local minima of v. Instead, they
are points that are simultaneously minima for both players’ costs. This means that
they are saddle points of v that are local minima with respect to the first player’s
parameters and local maxima with respect to the second player’s parameters.

 It is possible for the two players to take turns increasing then decreasing v forever,
rather than landing exactly on the saddle point where neither player is capable of
reducing its cost. It is not known to what extent this non-convergence problem affects
GANs.
 Goodfellow (2014) identified an alternative formulation of the payoffs, in which the
game is no longer zero-sum, that has the same expected gradient as maximum
likelihood learning whenever the discriminator is optimal. Because maximum
likelihood training converges, this reformulation of the GAN game should also
converge, given enough samples.
 Unfortunately, this alternative formulation does not seem to improve convergence in
practice, possibly due to suboptimality of the discriminator, or possibly due to high
variance around the expected gradient. In realistic experiments, the best-performing
formulation of the GAN game is a different formulation that is neither zero-sum nor
equivalent to maximum likelihood, introduced by Goodfellow et al. (2014c) with a
heuristic motivation.

 In this best-performing formulation, the generator aims to increase the log probability
that the discriminator makes a mistake, rather than aiming to decrease the log
probability that the discriminator makes the correct prediction.

 This reformulation is motivated solely by the observation that it causes the derivative
of the generator’s cost function with respect to the discriminator’s logits to remain
large even in the situation where the discriminator confidently rejects all generator
samples. Stabilization of GAN learning remains an open problem. Fortunately, GAN
learning performs well when the model architecture and hyperparameters are carefully
selected.

 Radford et al. (2015) crafted a deep convolutional GAN (DCGAN) that performs very
well for image synthesis tasks, and showed that its latent representation space
captures important factors of variation, as shown in figure 15.9. See figure 20.7 for
examples of images generated by a DCGAN generator. The GAN learning problem
can also be simplified by breaking the generation (2014) identified non-convergence
as an issue that may cause GANs to underfit.

 In general, simultaneous gradient descent on two players’ costs is not guaranteed to


reach an equilibrium. Consider for example the value function v(a, b) = ab, where one
player controls a and incurs cost ab, while the other player controls b and receives a
cost −ab.

 If we model each player as making infinitesimally small gradient steps, each player
reducing their own cost at the expense of the other player, then a and b go into a
stable, circular orbit, rather than arriving at the equilibrium point at the origin.
 Note that the equilibria for a minimax game are not local minima of v. Instead, they
are points that are simultaneously minima for both players’ costs. This means that
they are saddle points of v that are local minima with respect to the first player’s
parameters and local maxima with respect to the second player’s parameters.

 It is possible for the two players to take turns increasing then decreasing v forever,
rather than landing exactly on the saddle point where neither player is capable of
reducing its cost. It is not known to what extent this non-convergence problem affects
GANs.

 Goodfellow (2014) identified an alternative formulation of the payoffs, in which the


game is no longer zero-sum, that has the same expected gradient as maximum
likelihood learning whenever the discriminator is optimal.
 Because maximum likelihood training converges, this reformulation of the GAN
game should also converge, given enough samples.

 Unfortunately, this alternative formulation does not seem to improve convergence in


practice, possibly due to suboptimality of the discriminator, or possibly due to high
variance around the expected gradient.

 In realistic experiments, the best-performing formulation of the GAN game is a


different formulation that is neither zero-sum nor equivalent to maximum likelihood,
introduced by Goodfellow et al. (2014c) with a heuristic motivation.

 In this best-performing formulation, the generator aims to increase the log probability
that the discriminator makes a mistake, rather than aiming to decrease the log
probability that the discriminator makes the correct prediction.

 This reformulation is motivated solely by the observation that it causes the derivative
of the generator’s cost function with respect to the discriminator’s logits to remain
large even in the situation where the discriminator confidently rejects all generator
samples. Stabilization of GAN learning remains an open problem.

Common questions

Powered by AI

In GANs, the generator and discriminator networks interact in a zero-sum game where the generator seeks to produce realistic samples to fool the discriminator, and the discriminator tries to correctly differentiate real from synthetic samples . The challenge lies in ensuring convergence to equilibrium since simultaneous gradient descent on the costs of both players may not always reach it, leading to persistent oscillations rather than stable solutions . This non-convergence issue poses a significant challenge in GAN training, requiring careful model architecture and hyperparameter selection .

The use of non-convex cost functions in GANs presents significant challenges for convergence, as such functions can lead to stable orbits instead of equilibria during training . Non-convexity complicates the learning dynamics and can prevent the training from reaching a saddle point where both networks achieve minimal costs . Despite this challenge, non-convex formulations have proven effective in practice, as demonstrated by deep convolutional GANs, which underscore the importance of careful design in model architectures to mitigate convergence issues .

VAEs aim to maximize the variation lower bound, balancing reconstruction accuracy and latent space regularization by striving to match the approximate posterior q(z | x) with the prior pmodel(z). This probabilistic framework often results in samples with greater diversity but may lack sharpness. In contrast, GANs optimize a minmax game where the generator tries to fool a discriminator classifying real versus fake samples . This adversarial objective pushes for high-fidelity, clearer samples, but faces challenges with stability and convergence due to dynamic generator-discriminator interactions .

The non-zero-sum reformulation of GANs changes the objective such that the generator aims to maximize the log probability that the discriminator makes a mistake, leading to potentially smoother derivative properties of the generator's cost function . This reformulation helps keep the discriminator's logit gradients large even when it confidently discriminates fake samples, facilitating better learning dynamics. However, in practice, this does not always improve convergence due to potential issues with suboptimal discriminator performance and gradient variance .

The entropy term in the training of a variational autoencoder encourages the variational posterior to distribute high probability mass on many z values that could generate x, rather than collapsing to a single point estimate . When q is chosen to be a Gaussian distribution, this term encourages maximizing the standard deviation of the noise added to the predicted mean to ensure model robustness. This robust distribution of probability mass helps VAEs generalize better to new data but can also contribute to blurriness in the generated samples by allowing high variability in output .

VAEs have the advantage of being straightforward to extend to a wide range of model architectures due to their ability to work well with diverse differentiable operators, unlike Boltzmann machines, which require extremely careful model design to maintain tractability . VAEs can effectively discover complex patterns such as low-dimensional manifolds . However, they tend to produce blurry samples when trained on images, possibly due to intrinsic effects of maximum likelihood estimation . In contrast, Boltzmann machines allow for performing approximate inference over any subset of variables, which VAEs do not .

To stabilize GAN training, careful selection of model architectures and hyperparameters is crucial. Design choices like deep convolutional architectures (e.g., DCGANs) have empirically shown success in generating high-quality images by capturing significant variation factors in the latent space . Choosing optimal learning rates, balancing the capacity of generator and discriminator networks, and implementing techniques like batch normalization and spectral normalization can also enhance stability . Nevertheless, achieving stability remains an open problem, emphasizing the need for ongoing experimentation and adjustment .

Variational RNNs enhance variational autoencoders by integrating random variability at abstract levels captured by VAE latent variables, in addition to the standard non-deterministic operations of traditional RNNs . This combination allows for more sophisticated sequence generation that benefits from structured latent variability, enabling models to capture complex temporal patterns .

The blurriness in VAE-generated samples is partly attributed to maximum likelihood estimation, which involves minimizing the KL divergence DKL(pdata||pmodel). This approach causes the model to place high probability on blurry images because the loss function is similar to training with mean squared error, which can ignore input features that occupy few pixels. The Gaussian distribution used in VAEs contributes to this blurriness because it upholds smooth, average-like characteristics at the expense of fine detail .

In variational autoencoders, a sample z is drawn from the code distribution pmodel(z), and this sample is run through a generator network g(z) to produce x from the distribution pmodel(x; g(z)). This process includes a probabilistic formulation where the approximate inference network q(z | x) is used during training to obtain z, treated as a continuous variable. This differs from traditional autoencoders, which directly map inputs through an encoder to a hidden layer and decode them without involving probabilistic methods .

You might also like