DEEP LEARNING
23DS5PCDLG
• Representation Learning using • Generative Adversarial Networks
Autoencoders: (GANs):
Stacked Autoencoders Difficulties of training GANs
Convolutional Autoencoders Deep Convolutional GANs
Recurrent Autoencoders Progressive Growing of GANs
Denoising Autoencoders
StyleGANs
Sparse Autoencoders
Applications of Autoencoders
Transfer Learning, and Domain
Adaptation.
Representation Learning Using
Autoencoders
• Autoencoders are neural networks that learn dense latent
representations (codings) of input data without supervision.
• Uses:
Dimensionality reduction: Compress data for visualization or analysis.
Feature detection: Identify key patterns in data.
Unsupervised pretraining: Prepare deep networks for better learning.
Generative capability: Can create data similar to training data (e.g.,
new faces) but often appear fuzzy and less realistic.
• GANs: Realistic Image Generation and Applications
• Convincing Faces: GANs generate highly realistic faces.
• Check examples:
• [Link]
• [Link]
• Key Applications:
1. Super resolution: Enhance image quality.
2. Colorization: Add realistic colors to black-and-white images.
3. Image editing: Replace objects with realistic backgrounds.
4. Sketch-to-photo: Convert sketches to photorealistic images.
5. Video prediction: Predict future frames in videos.
6. Data augmentation: Generate data for training models.
7. Cross-domain generation: Create text, audio, or time-series data.
8. Model enhancement: Identify and fix weaknesses in other models.
Autoencoders and GANs are unsupervised, learn dense
representations, can be used as generative models, and they
have many similar applications. However, they work very
differently:
Feature Autoencoders GANs (Generative Adversarial Networks)
Learn to copy inputs to outputs under
Purpose Generate data similar to training data.
constraints.
Constrained learning prevents direct Adversarial training: a generator creates
Key Concept copying, forcing efficient data data, and a discriminator distinguishes real
representation. from fake.
Output Produces dense latent representations Produces realistic data samples.
Representation (codings) of input data.
Architectural Single neural network with encoder and Two competing networks: generator and
Components decoder. discriminator.
Training Minimize reconstruction error between Compete: generator improves to fool the
Objective inputs and outputs. discriminator; discriminator improves to
detect fakes.
Innovation Constraints like noise or size limits make Adversarial training is considered one of the
copying difficult, leading to efficient most significant recent advancements in
representations. ML.
Famous Use Case Dimensionality reduction and data encoding. Generating photorealistic images, video
prediction, and more.
Efficient Data Representations
• Which of the following number sequences do you find the easiest to
memorize?
• 40, 27, 25, 36, 81, 57, 10, 73, 19, 68
• 50, 48, 46, 44, 42, 40, 38, 36, 34, 32, 30, 28, 26, 24, 22, 20, 18, 16, 14
• Second sequence
• Notice the pattern that it is list of even numbers from 50 down to 14.
• much easier to memorize than the first because you only need to remember
the pattern (i.e., decreasing even numbers) and the starting and ending
numbers (i.e., 50 and 14).
Constraining an autoencoder during training pushes it to discover and
exploit patterns in the data.
• Autoencoders work like chess players recalling positions from
memory.
• They analyze inputs, create efficient latent representations, and
reconstruct outputs that resemble the inputs.
• Composed of two parts:
• Encoder: Encodes inputs into concise latent representations.
• Decoder: Decodes latent representations back into outputs.
The chess memory experiment (left) and a simple autoencoder (right)
• Autoencoders resemble MLPs but have equal input and output layer
sizes(neurons).
• Example(fig in prev slide):
• Hidden layer (encoder): 2 neurons.
• Output layer (decoder): 3 neurons.
• Outputs are called reconstructions as they replicate inputs.
• Reconstruction loss penalizes differences between inputs and outputs.
• Undercomplete autoencoder:
• Latent representation has lower dimensionality (e.g., 2D vs. 3D input).
• Learns key features of data by dropping less important details.
Performing PCA with an Undercomplete Linear
Autoencoder
• If the autoencoder uses only linear activations and the cost function is
the mean squared error (MSE), then it ends up performing Principal
Component Analysis(PCA).
• A simple linear autoencoder to perform PCA on a 3D dataset,
projecting it to 2D:
• This code is not different from MLPs, but a few things to note:
• The autoencoder are organized into two subcomponents: the encoder
and the decoder.
• Both are regular Sequential models with a single Dense layer each, and
• The autoencoder is a Sequential model containing the encoder followed by the
decoder
• remember that a model can be used as a layer in another model.
• The autoencoder’s number of outputs is equal to the number of inputs
(i.e., 3).
• To perform simple PCA, do not use any activation function (i.e., all
neurons are linear), and the cost function is the MSE.
• Train the model on a simple generated 3D dataset and use it to
encode that same dataset (i.e., project it to 2D):
• Same dataset, X_train, is used as both the inputs and the targets.
• Figure shows the original 3D dataset (on the left) and the output of
the autoencoder’s hidden layer (i.e., the coding layer, on the right).
• The autoencoder found the best 2D plane to project the data onto,
preserving as much variance in the data as it could (just like PCA).
Stacked Autoencoders
• Stacked Autoencoders:
• Have multiple hidden layers.
• Learn more complex codings.
• Also called deep autoencoders.
• Caution:
• Example: Encoder maps inputs to arbitrary numbers, and decoder
reverses this.
• Result: Perfect reconstruction of training data but no useful data
representation.
• Poor generalization to new data.
• Stacked Autoencoder Architecture:
• Symmetrical around the coding layer (central hidden layer, like a
sandwich).
• Example:
• Input layer: 784 neurons (e.g., MNIST).
• Hidden layers: 100 → 30 (coding layer) → 100 neurons.
• Output layer: 784 neurons.
Implementing a Stacked Autoencoder Using Keras
• Following code builds a stacked autoencoder for Fashion MNIST
(loaded and normalized), using the SELU activation function:
Code Explanation:
• Model Split: • Loss Function:
• Encoder: • Use binary cross-entropy instead
• Input: 28×28 grayscale image (flattened of MSE.
to 784).
• Treat reconstruction as a
• Layers: Two Dense layers (100 → 30
units). multilabel binary classification
• Activation: SELU (optionally with LeCun task.
initialization).
• Output: 30-dimensional coding vector.
• Training:
• Decoder: • Inputs = Outputs: Train using
• Input: Coding vector (size 30). X_train for both inputs and
• Layers: Two Dense layers (30 → 100 → targets.
784 units). • Validate using X_valid as both
• Reshapes output to 28×28 array. inputs and targets.
Visualizing the Reconstructions
• To ensure that an autoencoder is properly trained compare
the inputs and the outputs:
• the differences should not be too significant.
• Plot a few images from the validation set, as well as their
reconstructions:
• The reconstructions are
recognizable, but a bit too
lossy.
• Need to train the model for
longer, or make the encoder
and decoder deeper, or
make the codings larger.
• But if we make the network
too powerful, it will manage
to make perfect
reconstructions without
having learned any useful
patterns in the data.
Original images (top) and their reconstructions (bottom)
Visualizing the Fashion MNIST
Dataset
• Using a Stacked Autoencoder for Dimensionality Reduction
• Dimensionality Reduction:
• Autoencoders handle large datasets efficiently.
• Reduce dataset dimensions to a manageable size using the autoencoder.
• Visualization Strategy:
• Use the autoencoder to shrink the data to a lower dimension.
• Apply another algorithm (e.g., PCA, t-SNE) to further reduce dimensions for
visualization.
• Example:
• Apply this two-step approach to Fashion MNIST for better visualization results.
• Use the encoder from stacked autoencoder to reduce the
dimensionality down to 30
• Then use Scikit-Learn’s implementation of the t-SNE algorithm to
reduce the dimensionality down to 2 for visualization:
• plot the dataset:
• Figure shows the resulting scatterplot.
• The t-SNE algorithm identified several clusters which match the
classes reasonably well (each class is represented with a different
color).
Unsupervised Pretraining Using
Stacked Autoencoders
• Problem: Large dataset with mostly unlabeled data.
• Solution:
• Train a stacked autoencoder using all the data (unsupervised).
• Reuse its lower layers to build a neural network for the task.
• Train the new network using labeled data.
• If labeled data is very limited, freeze the pretrained layers (especially
the lower ones).
• Example: Use this approach for classification tasks.
Tying Weights
• Symmetrical Autoencoders: Encoder and decoder layers mirror each
other.
• Weight Tying:
• Formula: 𝑊𝑁−𝐿+1=𝑊𝐿𝑇 (fo 𝐿=1,2,…,𝑁/2).
• Decoder weights are tied to encoder weights.
• N – layers; WL represents the connection weights of the Lth layer
• Benefits:
• Reduces the number of weights (halves them).
• Speeds up training.
• Lowers the risk of overfitting.
• A custom layer to tie weights • This custom layer acts like a
between layers using Keras: regular Dense layer, but it
uses another Dense layer’s
weights, transposed
• setting transpose_b=True is
equivalent to transposing
the second argument
• But it’s more efficient as it
performs the transposition
on the fly within the
matmul() operation.
• However, it uses its own
bias vector.
Build a new stacked autoencoder but with the decoder’s Dense layers tied to the
encoder’s Dense layers:
• This model achieves a very slightly lower reconstruction error with
almost half the number of parameters.
Training One Autoencoder at a Time
• Rather than training the whole stacked autoencoder in one go, it is
possible to train one shallow autoencoder at a time, then stack all of
them into a single stacked autoencoder.
• Phase 1:
• Train the first autoencoder to reconstruct inputs.
• Encode the entire dataset using this trained autoencoder, creating a compressed
dataset.
• Phase 2:
• Train the second autoencoder on the compressed dataset.
• Final Model:
• Stack hidden layers from all autoencoders.
• Add output layers in reverse order.
• This forms a deep stacked autoencoder.
• This process allows easy extension by adding more autoencoders for greater
depth.
Convolutional Autoencoders
• Why Use CNNs?
• Dense autoencoders struggle with large images.
• CNNs handle images better by capturing spatial hierarchies.
• Encoder:
• Composed of convolutional and pooling layers.
• Reduces height and width while increasing depth (feature maps).
• Decoder:
• Reverses the process to restore original dimensions.
• Uses transpose convolutional layers or a combination of upsampling and
convolutional layers.
• This architecture works well for tasks like unsupervised pretraining or
dimensionality reduction.
Recurrent Autoencoders
• To build an autoencoder for sequences, such as time series or text,
then recurrent neural networks is better suited than dense networks.
• Building a recurrent autoencoder:
• The encoder is a sequence-to-vector RNN which compresses the input
sequence down to a single vector.
• The decoder is a vector-to-sequence RNN that does the reverse
• Sequence Processing: Handles sequences of any length with 28
dimensions per time step.
• Fashion MNIST Example:
Treats each image as a sequence of rows (each row = 28 pixels).
Processes one row at a time using the RNN.
General Use: Suitable for any kind of sequence.
Decoder Setup: Includes a RepeatVector layer to ensure the input
vector is fed at every time step.
Overcomplete autoencoder
• An overcomplete autoencoder has a coding layer with more
dimensions than the input data.
• It risks simply copying inputs, so regularization techniques like sparsity
or noise are used to ensure meaningful feature learning.
• This makes it useful for capturing rich and detailed representations of
the data.
Denoising Autoencoders
• Adding noise to inputs trains an autoencoder to recover the original
noise-free inputs.
• This technique dates back to the 1980s, in Yann LeCun’s 1987 thesis.
• Pascal Vincent's 2008 paper demonstrated autoencoders for feature
extraction.
• In 2010, Vincent introduced stacked denoising autoencoders.
• The noise can be Gaussian or randomly switched-off inputs, like
dropout.
Denoising autoencoders, with
Gaussian noise (left) or dropout
(right)
• Implementation:
• A regular stacked autoencoder with an additional Dropout layer
applied to the encoder’s inputs (or GaussianNoise layer instead).
• The Dropout layer is only active during training (and so is the
GaussianNoise layer):
• Figure shows noisy images with half the pixels turned off and their
reconstructions by a dropout-based denoising autoencoder.
• The autoencoder fills in missing details, like the top of the white shirt
(bottom row, fourth image).
• Denoising autoencoders are useful for data visualization,
unsupervised pretraining, and efficiently removing noise from images.
Noisy images (top) and their reconstructions (bottom)
Sparse Autoencoders
• Sparsity can be used as a constraint in autoencoders for better feature
extraction.
• By adding a term to the cost function, the autoencoder is encouraged
to reduce the number of active neurons in the coding layer (e.g., only
5% active).
• This forces the autoencoder to represent each input with a small
number of activations, making each neuron in the coding layer
represent a useful feature.
• Approach:
• Use the sigmoid activation function in the coding layer (to constrain
the codings to values between 0 and 1)
• Use a large coding layer (e.g., with 300 units), and add some ℓ1
regularization to the coding layer’s activations (the decoder is just a
regular decoder):
• The ActivityRegularization layer adds a training loss equal to the sum
of absolute values of its inputs.
• Alternatively, you can use
activity_regularizer=[Link].l1(1e-3) in the previous layer.
• This regularization encourages the network to produce codings close
to 0, but still allows for necessary nonzero values for reconstruction.
• Using ℓ1 norm preserves important codings and eliminates
unnecessary ones, unlike ℓ2 norm which reduces all codings.
Variational Autoencoders
• They are probabilistic autoencoders, meaning that their outputs are
partly determined by chance, even after training (as opposed to
denoising autoencoders, which use randomness only during training).
• Most importantly, they are generative autoencoders, meaning that
they can generate new instances that look like they were sampled
from the training set.
Applications of Autoencoders
• Autoencoders are widely used for dimensionality reduction and
information retrieval.
• Hinton and Salakhutdinov (2006) trained stacked RBMs to initialize a
deep autoencoder with progressively smaller hidden layers, ending in
a 30-unit bottleneck.
• Their approach achieved lower reconstruction error than PCA with 30
dimensions.
• The learned representations were more interpretable, forming well-
separated clusters corresponding to underlying categories.
• Lower-dimensional representations enhance performance in tasks like
classification.
• Smaller models use less memory and have faster runtime.
• Dimensionality reduction techniques group semantically related
examples together.
• These lower-dimensional mappings help improve generalization.
• Information retrieval benefits from dimensionality reduction.
• Reduced dimensions improve efficiency for database searches.
• A binary low-dimensional code enables storing database entries in a
hash table for fast retrieval.
• Semantic hashing allows retrieving entries with identical binary codes
or slightly less similar ones by flipping bits.
• This approach has been applied to text and images.
• Binary codes for semantic hashing are generated using an encoder
with sigmoids in the final layer.
• During training, additive noise is injected before the sigmoid
activation, gradually increasing over time.
• To counter the noise, the network amplifies the inputs to the sigmoid,
forcing saturation near 0 or 1.
• Extensions to this idea include training representations to optimize
task-specific loss for finding nearby examples in the hash table.
Transfer Learning, and Domain
Adaptation
• Transfer Learning vs. Domain Adaptation
• Both involve using knowledge from one distribution (P1) to improve
performance in another (P2).
• Transfer learning: applies knowledge across tasks or settings.
• Domain adaptation: specifically adjusts for differences between
distributions (P1 and P2).
• Builds on ideas of transferring representations between unsupervised
and supervised tasks.
Transfer Learning
• Different tasks: Learner handles two or more distinct tasks.
• Shared relevance: Factors explaining variations in P1 are helpful for learning P2.
• Example:
• Task 1: Classify cats and dogs (P1).
• Task 2: Classify ants and wasps (P2).
• Shared low-level features: edges, shapes, lighting effects.
• Benefit: Large data in P1 helps train representations that generalize well for P2,
even with limited data.
• Representation learning: Achieves transfer learning, multi-task learning, and
domain adaptation by leveraging features that work across tasks or domains.
• Illustration: Shared lower layers for common features; task-specific upper layers for
distinct outputs.
• Task Sharing in Outputs
• Output Semantics: Shared among tasks, not input
semantics.
• Example:
• Speech Recognition: Outputs valid sentences, but
input layers adapt to different speakers (phoneme
variations).
• Approach:
• Share upper layers (output-related).
• Use task-specific preprocessing for input variations.
• Scenario:
(e.g., 𝑥(1),𝑥(2),𝑥(3)).
• Input (x): Different meanings/dimensions for each task
• Output (y): Same semantics for all tasks. • Upper Levels:
• Design: • Shared across tasks.
• Lower Levels: • Process generic features to
• Task-specific up to a selection switch. produce consistent outputs.
• Translate task-specific inputs into a generic feature set.
Domain Adaptation
• Definition: The task and optimal input-to-output mapping remain the
same, but the input distribution varies across settings.
• Example:
• Task: Sentiment analysis (positive, neutral, or negative sentiment).
• Domains:
• Training: Customer reviews of books, videos, music.
• Application: Comments about electronics like TVs or smartphones.
• Challenge: Vocabulary and style differ across domains, affecting
generalization.
• Solution:
• Unsupervised Pretraining:
• Use techniques like denoising autoencoders.
• Found effective for sentiment analysis in domain adaptation.
• Concept Drift
• Definition: A gradual change in the data distribution over time.
• Relation to Transfer Learning: Viewed as a form of transfer learning
where models adapt to changing distributions.
• Connection to Multi-Task Learning:
• Both concept drift and transfer learning can be seen as specific forms of
multi-task learning.
• Multi-task learning usually refers to supervised tasks, while transfer learning
applies to unsupervised and reinforcement learning too.
• Objective Across Settings
• Leverage data from one setting to improve learning or predictions in
another.
• Representation learning enables shared representations across
settings.
• Shared representations benefit from training data of both tasks.
• Unsupervised Deep Learning in Transfer Learning
• Used successfully in machine learning competitions (Mesnil et al.,
2011; Goodfellow et al., 2011).
• Participants learn a feature space from the first setting (P1) and apply
it to the transfer setting (P2).
• A linear classifier is trained with few labeled examples in P2.
• Deeper representations from P1 improve generalization in P2,
reducing the need for labeled data.
• Forms of transfer learning
• One-Shot Learning: Only one labeled example is provided for the
transfer task.
• The representation cleanly separates classes.
• Only one labeled example is needed to infer the label of many test
examples.
• Works if factors of variation are well-separated in the learned
representation space.
• Zero-Shot Learning: No labeled examples are provided for the transfer
task.
• Learner uses textual descriptions to solve recognition problems.
• Can recognize an object without seeing it if described well (e.g.,
recognizing a cat from text about its features).
• Zero-Data Learning:
• Involves three variables: inputs (x), outputs (y), and task description (T).
• Trained to estimate p(y∣x,T), where T describes the task.
• Example: Recognizing cats after reading text about cats, using task T like
"Is there a cat in this image?"
• Key Point:
• The model leverages unsupervised examples to infer meanings of unseen
task instances.
• Text like "cats have four legs" aids in recognizing cats without prior
images.
• Zero-shot Learning:
• Requires a generalizable representation for task T (not just a one-hot
code).Example: use word embeddings to represent object categories.
• Machine Translation:
• Words in one language are represented via distributed embeddings.
• Even without direct translation examples, a model can infer translations
by learning relationships between two languages through matched
sentence pairs.
• Successful transfer requires jointly learning word representations and
their relationships.
• Zero-shot Learning wrt Multi-modality:
• A form of transfer learning that involves multi-modal learning.
• It captures representations in one modality and the relationship between
pairs (x, y) from different modalities.
• Multi-modal Learning:
• Learn three sets of parameters:
• From x to its representation.
• From y to its representation.
• The relationship between the two representations.
• Concepts from one modality are anchored in the other, enabling
generalization to new pairs.
• Representation Functions:
• Learn functions 𝑓𝑥 for domain x and 𝑓𝑦 for domain y using labeled/unlabeled examples.
• Distance in Representation Space:
• The distance in ℎ𝑥 and ℎ𝑦 spaces provides similarity metrics, offering more meaningful
comparisons than the raw x or y space.
• Similarity Functions:
• Bidirectional dotted arrows represent similarity between points in x and y spaces.
• Learning Maps:
• Labeled pairs (𝑥,𝑦)(x,y) help learn maps between 𝑓𝑥(𝑥) and 𝑓𝑦(𝑦), anchoring representations.
• Zero-Data Learning:
• Enables associating test images and words without prior pairing by relating their feature vectors
through learned maps.
---XXX---