1.
A) Write a short note on i) Auto Encoders ii) Deep Generative Models
Ans) i)Auto Encoders: An autoencoder is a type of neural network designed to closely replicate its input in its
output. It has two main parts: an encoder that compresses the input into a reduced form (or embedding) and a
decoder that reconstructs the input from this compressed representation. This structure allows the network to
learn a simplified, latent representation of data.
Training an autoencoder involves providing it with input-target pairs, with the goal being to minimize the loss
(difference) between the original and reconstructed inputs. For instance, in a game analogy, one child (encoder)
tries to communicate an image’s features to another child (decoder), who then tries to recreate it. Any error in the
recreation measures how well they communicated.
Architecture
Let’s explore the details of the architecture of the auto-encoder. An auto-encoder consists of
three main components:
1. Encoder
2. Code or Embedding
3. Decoder
Autoencoders can perform tasks such as anomaly detection (detecting data that differs significantly from the
norm) and style transfer. Key points about autoencoders include:
1. Data-Specific Compression: They perform best on data similar to their training data.
2. Unsupervised Learning: They require no labeled data.
3. Lossy Compression: Reconstructed data might lack some details.
To optimize autoencoders, constraints like fewer nodes in layers, added noise, and regularization are applied to
prevent overfitting. Key hyperparameters to adjust include the number of layers, nodes in the code layer, and loss
function (often Mean Squared Error or Binary Cross Entropy).
Types of Auto Encoders
There are multiple types or variants of auto-encoders that are developed for specific tasks.
1. Vanilla Autoencoder
2. Deep autoencoder
3. Convolutional autoencoder
4. Denoising autoencoder
5. Variational autoencoder
The feature learned by the auto-encoder can be used for other tasks like image classification
or text classification. It is also useful for dimensionality reduction or compression of the data
which can be important in some applications.
ii)Deep Generative Models:
Generative models have surged in popularity, thanks to their ability to learn complex data distributions in
unsupervised settings. They aim to approximate the underlying distribution of training data and generate new,
similar data points. However, capturing the true data distribution is challenging, so neural networks are typically
used to approximate the distribution as closely as possible. The two leading generative approaches are Variational
Autoencoders (VAEs) and Generative Adversarial Networks (GANs), each with unique training techniques and
use cases.
Variational Autoencoder (VAE)
A VAE is a probabilistic model rooted in Bayesian inference. It captures the data distribution by learning a low-
dimensional latent representation that reflects the characteristics of the training data. In VAEs:
1. Encoder: Encodes input data to a lower-dimensional latent vector. Instead of encoding fixed values, VAE
outputs parameters of a probability distribution (e.g., mean and variance), allowing randomness in the
representation.
2. Decoder: Decodes the latent vector back to the original data format.
3. Objective: Maximizes the likelihood of the data by balancing two losses:
- Reconstruction Loss: Ensures generated samples are similar to the input.
- KL-Divergence Loss: Ensures the encoded latent vectors follow a Gaussian distribution.
Reparameterization Trick: To make the network differentiable and enable backpropagation through random
nodes, the VAE samples latent variables through a Gaussian transformation, ensuring stability during training.
Final objective function of VAE:
Limitations of VAEs
Although VAEs capture the underlying data distribution well, they often produce blurry images because of their
assumptions and approximations, particularly when generating detailed images. This is where GANs excel, as
they are better suited for creating high-resolution, realistic images.
Generative Adversarial Networks (GANs)
GANs consist of two competing neural networks:
1. Generator: Generates synthetic data.
2. Discriminator: Differentiates between real and fake data.
These networks are trained in a minimax game where the generator aims to create data that can fool the
discriminator, while the discriminator improves its accuracy in distinguishing real from fake data. GANs are
flexible and powerful, generating high-quality images without the need for explicit density estimation.
The objective function of GAE:
DCGAN (Deep Convolutional GAN)
An early variant of GANs, DCGAN uses convolutional layers and batch normalization for stabilized training,
achieving realistic image generation. Conditional GANs (cGANs) further extend GANs by allowing user-defined
conditions on the generated data, such as generating images with specific attributes.
GAN Training Challenges
Training GANs can be complex due to:
- Instability: Balancing the generator and discriminator to avoid mode collapse (where the generator only outputs
limited patterns).
- Hyperparameter Sensitivity: Tuning GANs requires careful adjustment of parameters for convergence and
quality.
Conclusion
Both VAEs and GANs are powerful unsupervised learning frameworks with unique benefits and limitations.
VAEs provide a structured approach with lower variational bounds, but GANs excel in generating high-quality,
realistic data and continue to push the boundaries in generative tasks.
B) Describe in detail how to compute the gradient in a recurrent neural network?
Ans) Computing gradients in a Recurrent Neural Network (RNN) involves a process known as Backpropagation
Through Time (BPTT). This process extends the usual backpropagation in feedforward neural networks to
account for the temporal dependencies in RNNs. Here’s a breakdown of the steps and concepts involved in
computing gradients in RNNs:
Since the same weights are applied across all time steps, we must treat each step in the sequence as part of a
single unfolded computation graph, where each time step has its own node in the graph but shares weights with
other nodes.
Key Challenges in RNN Gradient Computation
1. Vanishing/Exploding Gradients:
o Gradients tend to shrink or grow exponentially as they propagate through many time steps.
o Solutions include techniques like gradient clipping, using gated architectures like LSTMs and
GRUs, or initializing weights carefully.
2. Computational Cost:
o BPTT is computationally intensive, especially for long sequences. Often, truncated BPTT is used,
where the sequence is broken into smaller segments, and gradients are only backpropagated over a
limited number of steps.
3. Recurrent Nature of Dependencies:
o Unlike in feedforward networks, the recurrent dependencies mean that each time step’s gradient
depends not only on its loss but also on all future losses in the sequence, making the computation
and analysis more complex.
By following BPTT, we can update the weights of an RNN so that it learns temporal patterns in sequential data,
optimizing for the sequence prediction task.
2. A) Explain the following terms i) Bi-directional RNN ii) LSTM
Ans) i) Bi-directional RNN: A typical state in an RNN (simple RNN, GRU, or LSTM) relies on the
past and the present events. A state at time t depends on the states x1,x2,…,xt−1and Xt. However, there
can be situations where a prediction depends on the past, present, and future events.
For example, predicting a word to be included in a sentence might require us to look into the
future, i.e., a word in a sentence could depend on a future event. Such linguistic dependencies are
customary in several text prediction tasks.
Take speech recognition. When you use a voice assistant, you initially utter a few words after which
the assistant interprets and responds. This interpretation may not entirely depend on the preceding
words; the whole sequence of words can make sense only when the succeeding words are analyzed.
Thus, capturing and analyzing both past and future events is helpful in the above- mentioned
scenarios.
To enable straight (past) and reverse traversal of input (future), Bidirectional RNNs, or BRNNs, are
used. A BRNN is a combination of two RNNs - one RNN moves forward, beginning from the start
of the data sequence, and the other, moves backward, beginning from the end of the data sequence.
The network blocks in a BRNN can either be simple RNNs, GRUs, or LSTMs.
A BRNN has an additional hidden layer to accommodate the backward training process.
The training of a BRNN is similar to Back-Propagation Through Time (BPTT) algorithm. BPTT is
the back-propagation algorithm used while training RNNs. A typical BPTT algorithm works as
follows:
• Unroll the network and compute errors at every time step.
• Roll-up the network and update weights.
In a BRNN however, since there’s forward and backward passes happening simultaneously, updating
the weights for the two processes could happen at the same point of time. This leads to erroneous
results. Thus, to accommodate forward and backward passes separately, the following algorithm is
used for training a BRNN:
Forward Pass
• Forward states (from t = 1 to NN) and backward states (from t = N to 1) are passed.
• Output neuron values are passed (from t= 1 to N).
Backward Pass
• Output neuron values are passed (t = NN to 1).
• Forward states (from t= NN to 1) and backward states (from t = 1 to NN) are passed.
Both the forward and backward passes together train a BRNN.
Example: Bi Directional RNN for word sequence
Consider the word sequence “I love mango juice”. The forward layer would feed the sequence as
such. But, the Backward Layer would feed the sequence in the reverse order “juice mango love I”.
Now, the outputs would be generated by concatenating the word sequences at each time and
generating weights accordingly. This can be used for POS tagging problems as well.
Applications
BRNN is useful for the following applications:
• Handwriting Recognition
• Speech Recognition
• Dependency Parsing
• Natural Language Processing
The bidirectional traversal idea can also be extended to 2D inputs such as images. We can have four
RNNs each denoting one direction. Unlike a Convolutional Neural Network (CNN), a BRNN can
assure long term dependency between the image feature maps.
ii) LSTM (Long Short Term Memory):
When we have a small RNN, we would be able to effectively use the RNN because there is no problem
of vanishing gradients. But, when we consider using long RNN’s there is not much we could do with
the traditional RNN’s and hence it wasn’t widely used. That is the reason that lead to the finding of
LSTM’s which basically uses a slightly different neuron structure. This was created with one basic
thing in mind- the gradients shouldn’t vanish even if the sequence is very large.
Long Short Term Memory networks (LSTMs) is a special kind of recurrent neural network
capable of learning long-term dependencies. They were introduced by Hochreiter & Schmidhuber in
1997. Remembering information for longer periods of time is their default behavior. The Long short-
term memory (LSTM) is made up of a memory cell, an input gate, an output gate and a forget gate.
The memory cell is responsible for remembering theprevious state while the gates are responsible for
controlling the amount of memory to be exposed.
The memory cell is responsible for keeping track of the dependencies between the elements in the
input [Link] present input and the previous is passed to forget gate and the output of this
forget gate is fed to the previous cell state. After that the output from the input gate is also fed to the
previous cell state. By using this the output gate operates and will generate the output.
Forget Gate
There are some information from the previous cell state that is not needed for the present unit in a
LSTM. A forget gate is responsible for removing this information from the cell state. The information
that is no longer required for the LSTM to understand or the information that is of less importance is
removed via multiplication of a filter. This is required for optimizing the performance of the LSTM
network. In other words we can say that it determines how much of previous state is to be passed to
the next state.
Forget gate
The gate has two inputs X t and h t-1. h t-1 is the output of the previous cell and x t is the input at that
particular time step. The given inputs are multiplied by the weight matrices and a bias is added.
Following this, the sigmoid function(activation function) is applied to this value.
Input Gate
Input gate
The process of adding new information takes place in input gate. Here combination of x t and h t-1 is
passed through sigmoid and tanh functions(activation functions) and added. Creating a vector
containing all the possible values that can be added (as perceived from h t- 1 and x t) to the cell state.
This is done using the tanh function. By this step we ensure that only that information is added to the
cell state that is important and is not redundant.
Output Gate
Output gate
A vector is created after applying tanh function to the cell state. Then making a filter using the values
of h t-1 and x t, such that it can regulate the values that need to be output from the vector created
above. This filter again employs a sigmoid function. Then both of them are multiplied to form output
of that cell state.
Out of all the remarkable results achieved using recurrent neural network most of them are by using
LSTM. The real magic behind LSTM networks is that they are achieving almost human-level of
sequence generation quality, without any magic at all.
In LSTM, you can see that all the 3 sigmoid and 1 tanh activation functions for which the input
would be a concatenation of h(t-1) and x(t), has different weights associated with them, say w(f),
w(i), w(c) and w(o). Then the total parameters required for training an LSTM model is 4 times larger
than a normal RNN. So, the computational cost is extremely higher. To solve this problem, they
invented something called GRU.
B) Explain in detail about the Sequence-to-Sequence RNN architecture.
Ans) A sequence to sequence model lies behind numerous systems which you face on a daily basis.
For instance, seq2seq model powers applications like Google Translate, voice-enabled devices and
online chatbots.
Definition: Introduced for the first time in 2014 by Google, a sequence to sequence model aims to
map a fixed-length input with a fixed-length output where the length of the input and output may
differ.
For example, translating “What are you doing today?” from English to Chinese has input of 5 words
and output of 7 symbols (今天你在做什麼?). Clearly, we can’t use a regular LSTM network to map
each word from the English sentence to the Chinese sentence.
This is why the sequence to sequence model is used to address problems like that one.
In order to fully understand the model’s underlying logic, we will go over the below illustration:
The model consists of 3 parts: encoder, intermediate (encoder) vector and decoder.
Encoder
• A stack of several recurrent units (LSTM or GRU cells for better performance)
where each accepts a single element of the input sequence, collects information
for that element and propagates it forward.
• In question-answering problem, the input sequence is a collection of all words
from the question. Each word is represented as x_i where i is the order of that
word.
• The hidden states h_i are computed using the formula:
This simple formula represents the result of an ordinary recurrent neural network. As you can see, we
just apply the appropriate weights to the previous hidden state h_(t-1) and the input vector x_t.
Encoder Vector
• This is the final hidden state produced from the encoder part of the model. It is
calculated using the formula above.
• This vector aims to encapsulate the information for all input elements in order to
help the decoder make accurate predictions.
• It acts as the initial hidden state of the decoder part of the model.
Decoder
• A stack of several recurrent units where each predicts an output y_t at a time
step t.
• Each recurrent unit accepts a hidden state from the previous unit and produces and
output as well as its own hidden state.
• In the question-answering problem, the output sequence is a collection of all
words from the answer. Each word is represented as y_i where i is the order of
that word.
• Any hidden state h_i is computed using the formula:
As you can see, we are just using the previous hidden state to compute the next one.
• The output y_t at time step t is computed using the formula:
We calculate the outputs using the hidden state at the current time step together with the respective
weight W(S). Softmax is used to create a probability vector which will help us determine the final
output (e.g. word in the question-answering problem).
The power of this model lies in the fact that it can map sequences of different lengths to each
other. As you can see the inputs and outputs are not correlated and their lengths can differ.
3. A) Discuss the regularization and under-constrained problems.
Ans) Regularization is a technique used in ML, that prevents from over-fitting i.e., putting rules to prevent the
model from getting too wild with its predictions and stay balanced. Over-fitting means great at predicting on that
data but not so good at generalizing to new, unseen data.
Imagine you're a teacher trying to predict how well students will do on their next test basedon how much they
studied for the last one. If you only look at a few students and fit your prediction model to their scores
perfectly, the model might end up predicting unrealistic things for everyone else, like someone getting 150%
on the test. This is over-fitting — the model has learned too much from just a small set of data, and now it’s
making bad predictions for others.
Regularization helps by saying, "Don't go overboard! Stick to reasonable predictions, even ifit means you're a
bit less precise with the few students you know about." It pulls back on extreme predictions and ensures the
model stays general enough to work well on new, unseendata.
Under-constrained or underdetermined problems occur when there are not enough constraints or
information to solve a problem uniquely. In simple terms, it means there aretoo many possible solutions and
not enough data or rules to pinpoint the right one.
In machine learning, an underdetermined problem happens when you have more variables (features) than you
have data points (examples). For instance, if you're trying to build a predictive model with 10 features (like
temperature, humidity, wind speed, etc.), but you only have data from 5 days, there's not enough data to decide
how all 10 features interact to make a prediction. There are too many possible ways the model could fit the
data, leading to uncertainty or multiple solutions.
Linear Models and Matrix Inversion:
• In machine learning, some models use mathematical formulas that need you to "reverse" or invert a
matrix (like a grid of numbers). However, sometimes the datadoesn't provide enough information to
do this, kind of like trying to solve a puzzlewith missing pieces. Without all the information, the
puzzle can't be completed.
Regularization and Invertibility:
• To address singularity, regularization is often introduced. For example, ridge regression (or
Tikhonov regularization) modifies the matrix to be inverted as XTX+αIX^T X + \alpha IXTX+αI,
where III is the identity matrix and α\alphaα is theregularization strength. This regularization
guarantees that the matrix becomes invertible.
• Regularization is essential for ensuring that a well-defined solution exists, especiallyin cases where
the data distribution lacks variance in some directions.
Underdetermined Problems:
• Problems like logistic regression can be underdetermined when the data is linearlyseparable. In
such cases, the weight vector www can theoretically increase indefinitely during optimization
without halting (e.g., gradient descent).
• Without regularization, the weights can grow too large, potentially leading tonumerical
instability or overflow.
Role of Regularization in Optimization:
• Regularization methods like weight decay prevent the weights from growingindefinitely by
ensuring that the optimization process converges.
• In the case of gradient descent, weight decay ensures that the optimization halts whenthe gradient of
the likelihood becomes equal to the regularization term.
Pseudoinverse and Regularization:
• The passage links this idea to the Moore-Penrose pseudoinverse, a method used tosolve
underdetermined linear systems. The pseudoinverse is defined as
X+=limα→0(XTX+αI)−1XTX^+ = \lim_{\alpha \to 0} (X^T X + \alpha I)^{-1}
X^TX+=limα→0(XTX+αI)−1XT, which can be interpreted as a form of regularizedsolution.
• This shows how the pseudoinverse stabilizes solutions to underdetermined problemsby incorporating
a small regularization term, which effectively shrinks to zero.
B) Explain in detail how early stopping acts as a regularizer.
Ans) While training our machine learning models, we often observe that with a model with significant learning
capacity training error steadily decreases but after some epochs validation error starts increasing.
After each epoch, the model learns the data and updates the weights accordingly. Training and validation error
decreases as long as our model is generalising the input data. After some iterations, the model starts memorising
the data and even though training error decreases the validation error increases, causing the overfitting of the
model.
Regularization is a technique to avoid the overfitting of the models with large learning capacity. The
regularization techniques increase the bias and reduce the variance of the model.
Early Stopping
If the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase, or
accuracy begins to decrease), then the training process is stopped. The model at this stage have low variance and
is known to generalize the data well. Training the model further would increase the variance of the model and
lead to overfitting. This regularization technique is called “early stopping”.
We can show that for a simple linear model with a quadratic error function and simple gradient descent, early
stopping is equivalent to L2 regularization.
From Taylor series expansion, we will make a quadratic approximation around the empirically optimal value of
the weights w*. The matrix H is the hessian matrix.
The gradient of this approximation comes out to be a linear function of H and w. Now, we can calculate the
weights at each step of the gradient descent.
Since the Hessian matrix is real, symmetric and positive semi-definite. By eigenvalue decomposition. matrix H
can be written as,
Assuming ε is chosen to be small and w(0) = 0. The weights after 𝜏 iterations are:
The optimal weights obtained on imposing L2 norm penalty on the objective function is given by :
Comparing Eq(1) and Eq(2), we derive the following relation between 𝜏, w, and ε. The equivalence can be seen
between the L2 regularization and early stopping.
Taking log both sides and using the series approximation of log(1+x), we can conclude that if all λi are small (that
is, ελi << 1 and λi/α << 1) then the following equation holds.
Here, α is the regularization constant, 𝜏 is no. of iterations, and ε is the learning rate.
Increasing no. of epochs/iterations 𝜏 is equivalent to reducing the regularization constant. Similarly, early
stopping the model, i.e. reducing the no. of iterations is similar to L2 regularization with large α. Thus, we can
say that early stopping regularizes the model.
4.A) Explain the difference between L2 and L1 parameter regularization.
Ans) L1 regularization:
L1 regularization, also known as Lasso regularization, is a machine-learning strategy that inhibits overfitting by
introducing a penalty term into the model's loss function based on the absolute values of the model's parameters.
L1 regularization seeks to reduce some model parameters toward zero in order to lower the number of non-zero
parameters in the model (sparse model).
L1 regularization is particularly useful when working with high-dimensional data since it enables one to choose a
subset of the most important attributes. This lessens the risk of overfitting and also makes the model easier to
understand. The size of a penalty term is controlled by the hyperparameter lambda, which regulates the L1
regularization's regularization strength. As lambda rises, more parameters will be lowered to zero, improving
regularization.
L2 regularization:
L2 regularization, also known as Ridge regularization, is a machine learning technique that avoids overfitting by
introducing a penalty term into the model's loss function based on the squares of the model's parameters. The goal
of L2 regularization is to keep the model's parameter sizes short and prevent oversizing.
In order to achieve L2 regularization, a term that is proportionate to the squares of the model's parameters is
added to the loss function. This word works as a limiter on the parameters' size, preventing them from growing
out of control. A hyperparameter called lambda that controls the regularization's intensity also controls the size of
the penalty term. The parameters will be smaller and the regularization will be stronger the greater the lambda.
Difference between L1 & L2 regularization
L1 Regularization L2 Regularization
The penalty term is based on the absolute values The penalty term is based on the squares of the
of the model's parameters. model's parameters.
Produces sparse solutions (some parameters are Produces non-sparse solutions (all parameters are
shrunk towards zero). used by the model).
Sensitive to outliers. Robust to outliers.
Selects a subset of the most important features. All features are used by the model.
Optimization is non-convex. Optimization is convex.
The penalty term is less sensitive to correlated The penalty term is more sensitive to correlated
features. features.
Useful when dealing with high-dimensional data
Useful when dealing with high-dimensional data
with many correlated features and when the goal
with many correlated features.
is to have a less complex model.
Also known as Lasso regularization. Also known as Ridge regularization.
Conclusion
L1 and L2 regularization are two methods for preventing overfitting in machine learning models, to sum up. L1
regularization, which generates sparse solutions and is based on the absolute values of the model's parameters, is
helpful for feature selection. In contrast, L2 regularization yields non-sparse solutions and is based on the squares
of the model's parameters, making it beneficial for building simpler models. A hyperparameter called lambda that
controls the degree of regularization controls both methods. Depending on the particular situation and the
required model attributes, L1 or L2 regularization is chosen.
B) Briefly explain how learning differs from pure optimization.
Ans) In Machine Learning (ML), we care about a certain performance measure P (for e.g. accuracy) defined w.r.t the test
set and optimize J(θ) (for e.g. cross-entropy loss) with the hope that it improves P as well. In pure optimization,
optimizing J(θ) is the final goal. The expected generalization error (risk) is taken over the true data-generating
distribution pdata. If we do have that, it becomes an optimization problem. When we don't have pdata but a finite
training set, we have a ML problem. The latter can be converted back to an optimization problem by
replace pdata with the empirical distribution, p~data obtained from the training set, thereby reducing the
empirical risk. This is called empirical risk minimization (ERM) and although it might look relatively similar to
optimizationthere are two main problems:
• ERM is prone to overfitting with the possibility of the dataset being learned by high capacity models.
• ERM might not be feasible. Most optimization algorithms now are based on Gradient Descent (GD)
which may not work with various loss functions like 0-1 loss (as it is not differentiable)
• It is for the reasons mentioned above that a surrogate loss function (SLF) is used instead that acts as a proxy. E.g.
the negative log-likelihood of the true class is used as a surrogate for 0-1 loss. Using a SLF might even turn out to
be beneficial as you can keep continuing to obtain a better test error by pushing the classes even further apart to
get a more reliable classifier.
• Another common difference is that training might be halted following some convergence criterion based
on Early Stopping to prevent overfitting, when the derivative of the surrogate loss function might still be
large. This is different from pure optimization which is halted only when the derivative becomes very
small. If you're not familiar with Early Stopping, I'd recommend you to look at our previous post where
we talk about Early Stopping and other regularization techniques.
• In ML optimization algorithms that objective function decomposes as a sum over the examples and we
can perform updates by randomly sampling a batch of examples and taking the average over those
examples. The Standard Error of the mean estimated from n examples is given by σn indicating that as we
include more examples for making an update, the returns of additonal examples in improving the error is
less than linear. Thus, if we use 100 and 10000 example s to make an update, the latter takes 100 times
more compute, but reducing the error only by a factor of 10. Thus, it's better to compute rapid
approximate updates rather than a slow exact update.
• There are 3 types of sampling based algorithms - batch gradient descent (BGD) where the entire training
set is used to make a single update, stochastic gradient descent (SGD) where a single example is used to
make an update and mini-batch gradient descent (MBGD) where a batch (not to be confused with
BGD) of examples is randomly sampled from the entire training set and is used to make an
update. MBGD is nowadays commonly refered to as SGD. It is a common practise to use batch sizes as
powers of 2 to offer better runtime with certain hardware. Small batches tend to have a regularizing effect
because of the noise they inject as each update is made by seeing only a very small portion of the entire
training set, a.k.a., a batch of samples.
• The minibatches should be selected randomly. It is sufficient to shuffle the dataset once and iterate over it
multiple times. In the first epoch, the network sees each example for the first time and hence, the estimate
of gradient is an unbiased estimate of the gradient of the true generalization error. However, from the
second epoch onwards, the estimate becomes biased as it is resampling from data that it has already seen.