0% found this document useful (0 votes)
43 views33 pages

Understanding Biological Neurons and Autoencoders

The document discusses the structure and function of biological neurons, detailing components such as dendrites, soma, axon, and synapses, and how they interact to process information. It also covers autoencoders, a type of unsupervised neural network used for data compression and anomaly detection, explaining their architecture and various types. Additionally, it highlights challenges in deep learning, including overfitting, data quality, computational resources, interpretability, hyperparameter tuning, scalability, and ethical issues.

Uploaded by

selvarani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views33 pages

Understanding Biological Neurons and Autoencoders

The document discusses the structure and function of biological neurons, detailing components such as dendrites, soma, axon, and synapses, and how they interact to process information. It also covers autoencoders, a type of unsupervised neural network used for data compression and anomaly detection, explaining their architecture and various types. Additionally, it highlights challenges in deep learning, including overfitting, data quality, computational resources, interpretability, hyperparameter tuning, scalability, and ethical issues.

Uploaded by

selvarani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Biological Neuron

 The human brain consists of a large number; more than a billion of neural
cells that process information. Each cell works like a simple processor.
 The massive interaction between all cells and 3 their parallel processing
only makes the brain’s abilities possible
 Figure 1 represents a human biological nervous unit. Various parts of
biological neural network(BNN) is marked in Figure 1.


 Figure 1: Biological Neural Network
 Dendrites are branching fibres that extend from the cell body or soma.
 Soma or cell body of a neuron contains the nucleus and other structures,
support chemical processing and production of neurotransmitters.
 Axon is a singular fiber carries information away from the soma to the
synaptic sites of other neurons (dendrites ans somas), muscels, or glands.
Axon hillock is the site of summation for incoming information.
 At any moment, the collective influence of all neurons that conduct
impulses to a given neuron will determine whether or n ot an action
potential will be initiated at the axon hillock and propagated along the
axon.
 Myelin sheath consists of fat-containing cells that insulate the axon from
electrical activity. This insulation acts to increase the rate of transmission
of signals.
 A gap exists between each myelin sheath cell along the axon. Since fat
inhibits the propagation of electricity, the signals jump from one gap to the
next.
 Nodes of Ranvier are the gaps (about 1 μm) between myelin sheath cells.
Since fat serves as a good insulator, the myelin sheaths speed the rate of
transmission of an electrical impulse along the axon.
 Synapse is the point of connection between two neurons or a neuron and a
muscle or a gland. Electrochemical communication between neurons take
place at these junctions.
 Terminal buttons of a neuron are the small knobs at the end of an axon
that release chemicals called neurotransmitters.
 Information flow in a neural cell The input/output and the propagation of
information are shown below.

Auto Encoders
 Autoencoders are a special type of unsupervised feedforward neural
network (no labels needed!). The main application of Autoencoders is to
accurately capture the key aspects of the provided data to provide a
compressed version of the input data, generate realistic synthetic data, or
flag anomalies.
 Autoencoders are composed of 2 key fully connected feedforward neural
networks (Figure 1):
 Encoder: compresses the input data to remove any form of noise and
generates a latent space/bottleneck. Therefore, the output neural network
dimensions are smaller than the input and can be adjusted as a
hyperparameter in order to decide how much lossy our compression should
be.
 Decoder: making use of only the compressed data representation from the
latent space, tries to reconstruct with as much fidelity as possible the
original input data (the architecture of this neural network is, therefore,
generally a mirror image of the encoder). The “goodness” of the prediction
can then be measured by calculating the reconstruction error between the
input and output data using a loss function.
 Repeating iteratively this process of passing data through the encoder and
decoder and measuring the error to tune the parameters through
backpropagation, the Autoencoder can, with time, correctly work with
extremely difficult forms of data.
 If an Autoencoder is provided with a set of input features completely
independent of each other, then it would be really difficult for the model to
find a good lower-dimensional representation without losing a great deal of
information (lossy compression).
 Autoencoders can, therefore, also be considered a dimensionality reduction
technique, which compared to traditional techniques such as Principal
Component Analysis (PCA), can make use of non-linear transformations to
project data in a lower dimensional spa ce. If you are interested in
learning more about other Feature Extraction techniques, additional
information is available in this feature extraction tutorial..
 Additionally, compared to standard data compression algorithms like gzpi,
Autoencoders can not be used as general-purpose compression algorithms
but are handcrafted to work best just on similar data on which they have
been trained on.
 Some of the most common hyperparameters that can be tuned when
optimizing your Autoencoder are:
 The number of layers for the Encoder and Decoder neural networks
 The number of nodes for each of these layers
 The loss function to use for the optimization process (e.g., binary cross-
entropy or mean squared error)
 The size of the latent space (the smaller, the higher the compression,
acting, therefore as a regularization mechanism)
 Finally, Autoencoders can be designed to work with different types of data,
such as tabular, time-series, or image data, and can, therefore, be
designed to use a variety of layers, such as convolutional layers, for image
analysis.
 Ideally, a well-trained Autoencoder should be responsive enough to adapt
to the input data in order to provide a tailor-made response but not so
much as to just mimic the input data and not be able to generalize with
unseen data (therefore overfitting).
Types of Autoencoders
Over the years, different types of Autoencoders have been developed:
1. Undercomplete Autoencoder
2. Sparse Autoencoder
3. Contractive Autoencoder
4. Denoising Autoencoder
5. Convolutional Autoencoder
6. Variational Autoencoder
Let’s explore each in more detail.
1. Undercomplete Autoencoder
This is the simplest version of an autoencoder. In this case, we don’t
have an explicit regularization mechanism, but we ensure that the size of
the bottleneck is always lower than the original input size to avoid
overfitting. This type of configuration is typically used as a dimensionality
reduction technique (more powerful than PCA since its also able to capture
non-linearities in the data).
2. Sparse Autoencoder
A Sparse Autoencoder is quite similar to an Undercomplete
Autoencoder, but their main difference lies in how regularization is applied.
In fact, with Sparse Autoencoders, we don’t necessarily have to reduce the
dimensions of the bottleneck, but we use a loss function that tries to
penalize the model from using all its neurons in the different hidden layers
(Figure 2).
 This penalty is commonly referred to as a sparsity function, and it's quite
different from traditional regularization techniques since it doesn’t focus on
penalizing the size of the weights but the number of nodes activated.
3. Contractive Autoencoder
The main idea behind Contractive Autoencoders is that given some
similar inputs, their compressed representation should be quite similar
(neighborhoods of inputs should be contracted in small neighborhood of
outputs). In mathematical terms, this can be enforced by keeping input
hidden layer activations derivatives small when fed similar inputs.
4. Denoising Autoencoder
With Denoising Autoencoders, the input and output of the model are no
longer the same. For example, the model could be fed some low-resolution
corrupted images and work for the output to improve the quality of the
images. In order to assess the performance of the model and improve it
over time, we would then need to have some form of labeled clean image
to compare with the model prediction.
5. Convolutional Autoencoder

To work with image data, Convolutional Autoencoders replace traditional


feedforward neural networks with Convolutional Neural Networks for both
the encoder and decoder steps. Updating type of loss function, etc., this
type of Autoencoder can also be made, for example, Sparse or Denoising,
depending on your use case requirements.

6. Variational Autoencoder

In every type of Autoencoder considered so far, the encoder outputs a


single value for each dimension involved. With Variational Autoencoders
(VAE), we make this process instead probabilistic, creating a probability
distribution for each dimension. The decoder can then sample a value
from each distribution describing the different dimensions and construct
the input vector, which it can then be used to reconstruct the original
input data.

One of the main applications of Variational Autoencoders is for


generative tasks. In fact, sampling the latent model from distributions can
enable the decoder to create new forms of outputs that were previously
not possible using a deterministic approach.

If you are interested in testing an online a Variational Autoencoder


trained on the MNIST dataset, you can find a live example.
McCulloch-Pitts Neuron
The first computational model of a neuron was proposed by Warren
MuCulloch (neuroscientist) and Walter Pitts (logician) in 1943.

This is where it all began.

It may be divided into 2 parts. The first part, g takes an input (ahem
dendrite ahem), performs an aggregation and based on the aggregated
value the second part, f makes a decision.

Let’s suppose that I want to predict my own decision, whether to watch


a random football game or not on TV. The inputs are all boolean i.e., {0,1}
and my output variable is also boolean {0: Will watch it, 1: Won’t watch
it}.

So, x_1 could be isPremierLeagueOn (I like Premier League more)

x_2 could be isItAFriendlyGame (I tend to care less about the friendlies)

x_3 could be isNotHome (Can’t watch it when I’m running errands. Can
I?)

x_4 could be isManUnitedPlaying (I am a big Man United fan. GGMU!)


and so on.

These inputs can either be excitatory or inhibitory. Inhibitory inputs are


those that have maximum effect on the decision making irrespective of
other inputs i.e., if x_3 is 1 (not home) then my output will always be 0
i.e., the neuron will never fire, so x_3 is an inhibitory input. Excitatory
inputs are NOT the ones that will make the neuron fire on their own but
they might fire it when combined together. Formally, this is what is going
on:

We can see that g(x) is just doing a sum of the inputs — a simple
aggregation. And theta here is called thresholding parameter. For
example, if I always watch the game when the sum turns out to be 2 or
more, the theta is 2 here. This is called the Thresholding Logic.

Boolean Functions Using M-P Neuron

So far we have seen how the M-P neuron works. Now lets look at how
this very neuron can be used to represent a few boolean functions. Mind
you that our inputs are all boolean and the output is also boolean so
essentially, the neuron is just trying to learn a boolean function. A lot of
boolean decision problems can be cast into this, based on appropriate
input variables— like whether to continue reading this post, whether to
watch Friends after reading this post etc. can be represented by the M-P
neuron.

M-P Neuron: A Concise Representation


This representation just denotes that, for the boolean
inputs x_1, x_2 and x_3 if the g(x) i.e., sum ≥ theta, the neuron will fire
otherwise, it won’t.

AND Function

An AND function neuron would only fire when ALL the inputs are ON
i.e., g(x) ≥ 3 here.

OR Function

I believe this is self explanatory as we know that an OR function neuron


would fire if ANY of the inputs is ON i.e., g(x) ≥ 1 here.

A Function With An Inhibitory Input


Now this might look like a tricky one but it’s really not. Here, we have an
inhibitory input i.e., x_2 so whenever x_2 is 1, the output will be 0.
Keeping that in mind, we know that x_1 AND !x_2 would output 1 only
when x_1 is 1 and x_2 is 0 so it is obvious that the threshold parameter
should be 1.

NOR Function

For a NOR neuron to fire, we want ALL the inputs to be 0 so the


thresholding parameter should also be 0 and we take them all as
inhibitory input.

NOT Function
For a NOT neuron, 1 outputs 0 and 0 outputs 1. So we take the input as
an inhibitory input and set the thresholding parameter to 0. It works!

Can any boolean function be represented using the M-P neuron? Before
you answer that, lets understand what M-P neuron is doing geometrically.

Geometric Interpretation Of M-P Neuron

This is the best part of the post according to me. Lets start with the OR
function.

OR Function

We already discussed that the OR function’s thresholding


parameter theta is 1, for obvious reasons. The inputs are obviously
boolean, so only 4 combinations are possible — (0,0), (0,1), (1,0) and
(1,1). Now plotting them on a 2D graph and making use of the OR
function’s aggregation equation
i.e., x_1 + x_2 ≥ 1 using which we can draw the decision boundary as
shown in the graph below. Mind you again, this is not a real number
graph.

Press enter or click to view image in full size


We just used the aggregation equation i.e., x_1 + x_2 =1 to graphically
show that all those inputs whose output when passed through the OR
function M-P neuron lie ON or ABOVE that line and all the input points that
lie BELOW that line are going to output 0.

Voila!! The M-P neuron just learnt a linear decision boundary! The M-P
neuron is splitting the input sets into two classes — positive and negative.
Positive ones (which output 1) are those that lie ON or ABOVE the decision
boundary and negative ones (which output 0) are those that lie BELOW
the decision boundary.

Lets convince ourselves that the M-P unit is doing the same for all the
boolean functions by looking at more examples (if it is not already clear
from the math).

AND Function

Press enter or click to view image in full size

In this case, the decision boundary equation is x_1 + x_2 =2. Here, all
the input points that lie ON or ABOVE, just (1,1), output 1 when passed
through the AND function M-P neuron. It fits! The decision boundary
works!

Tautology

Press enter or click to view image in full size


Greedy-Layerwise in Deep Learning
Deep Learning Challenges
Deep learning offers immense potential, but several challenges can
hinder its effective implementation. Addressing these challenges is
crucial for developing reliable and efficient models. Here are the main
challenges faced in deep learning:
1. Overfitting and Underfitting
Balancing model complexity to ensure it generalizes well to new data is
challenging. Overfitting occurs when a model is too complex and
captures noise in the training data. Underfitting happens when a model
is too simple and fails to capture the underlying patterns.
2. Data Quality and Quantity
Deep learning models require large, high-quality datasets for training.
Insufficient or poor-quality data can lead to inaccurate predictions and
model failures. Acquiring and annotating large datasets is often time-
consuming and expensive.
3. Computational Resources
Training deep learning models demands significant computational power
and resources. This can be expensive and inaccessible for many
organizations. High-performance hardware like GPUs and TPUs are often
necessary to handle the intensive computations.
4. Interpretability
Deep learning models often function as "black boxes," making it
difficult to understand how they make decisions. This lack of
transparency can be problematic, especially in critical applications.
Understanding the decision-making process is crucial for trust and
accountability.
5. Hyperparameter Tuning
Finding the optimal settings for a model’s hyperparameters requires
expertise. This process can be time-consuming and computationally
intensive. Hyperparameters significantly impact the model’s
performance, and tuning them effectively is essential for achieving high
accuracy.
6. Scalability
Scaling deep learning models to handle large datasets and complex
tasks efficiently is a major challenge. Ensuring models perform well in
real-world applications often requires significant adjustments. This
involves optimizing both algorithms and infrastructure to manage
increased loads.
7. Ethical and Bias Issues
Deep learning models can inadvertently learn and perpetuate biases
present in the training data. This can lead to unfair outcomes and ethical
concerns. Addressing bias and ensuring fairness in models is critical for
their acceptance and trustworthiness.
8. Hardware Limitations
Training deep learning models requires substantial computational
resources, including high-performance GPUs or TPUs. Access to such
hardware can be a bottleneck for researchers and practitioners.
10. Adversarial Attacks
Deep learning models are susceptible to adversarial attacks, where
subtle perturbations to input data can cause misclassification.
Robustness against such attacks remains a significant concern in safety-
critical applications.
Strategies to Overcome Deep Learning Challenges
Addressing the challenges in deep learning is crucial for developing
effective and reliable models. By implementing the right strategies, we
can mitigate these issues and enhance the performance of our deep
learning systems. Here are the key strategies:
Enhancing Data Quality and Quantity
 Preprocessing: Invest in data preprocessing techniques to clean and
organize data.
 Data Augmentation: Use data augmentation methods to artificially
increase the size of your dataset.
 Data Collection: Gathering more labeled data improves model
accuracy and robustness.
Leveraging Cloud Computing
 Cloud Platforms: Utilize cloud-based platforms like AWS, Google
Cloud, or Azure for accessing computational resources.
 Scalable Computing: These platforms offer scalable computing
power without the need for significant upfront investment.
 Tools and Frameworks: Cloud services also provide tools and
frameworks that simplify the deployment and management of deep
learning models.
Implementing Regularization Techniques
 Dropout: Use techniques like dropout to prevent overfitting.
 L2 Regularization: Regularization helps the model generalize better
by adding constraints or noise during training.
 Data Augmentation: This ensures that the model performs well on
new, unseen data.
Improving Model Interpretability
 Interpretability Tools: Employ tools like LIME (Local
Interpretable Model-agnostic Explanations) or SHAP (SHapley
Additive explanations) to understand model decisions.
 Transparency: Enhancing interpretability helps build trust in the
model, especially in critical applications.
Automating Hyperparameter Tuning
 Automated Tuning: Use automated tools like grid search, random
search, or Bayesian optimization for hyperparameter tuning.
 Efficiency: Automated tuning saves time and computational
resources by systematically exploring the hyperparameter space.
Optimizing Algorithms and Hardware
 Efficient Algorithms: Implement efficient algorithms and leverage
specialized hardware like GPUs and TPUs.
 Advanced Hardware: These optimizations significantly reduce
training time and improve model performance.
Addressing Bias and Ethical Concerns
 Fairness Practices: Implement fairness-aware machine learning
practices to identify and mitigate biases.
 Regular Audits: Regularly audit models to ensure they do not
perpetuate harmful biases present in the training data.
Saddle Points in Deep Learning

What Are Saddle Points?


Difference from Minima and Maxima

 Local Minimum

 Gradient = 0

 Curvature: upward in all directions

 Hessian eigenvalues: all positive

 Intuition: bottom of a bowl

 Local Maximum

 Gradient = 0

 Curvature: downward in all directions

 Hessian eigenvalues: all negative

 Intuition: top of a hill

 Saddle Point

 Gradient = 0

 Curvature: upward in some directions, downward in others

 Hessian eigenvalues: mixed (positive and negative)

 Intuition: a mountain ridge or a horse saddle


3. Why Are Saddle Points a Problem?
 Slow Training: Near a saddle, gradients are almost zero.

Optimizers like vanilla gradient descent make microscopic steps →

convergence stalls.

 Illusion of Convergence: You might think training is “done”

because loss isn’t decreasing. In reality, the optimizer is stuck in a

flat plateau.

 High-Dimensional Curse: In low dimensions, minima/maxima

dominate. But in high-D spaces (millions of parameters), most

critical points are saddle points (Dauphin et al., 2014).


How to Detect if You’re Stuck in a Saddle Point

It’s hard to directly detect saddle points, but you can infer:
1. Training Loss Stagnates: Loss stops decreasing but doesn’t reach

a low value.

2. Gradients Are Tiny: Norm of gradient ≈ 0 across many steps.

3. Oscillation Without Progress: Model parameters “hover” without

meaningful improvement.

4. Second-Order Clues: In theory, compute the Hessian

eigenvalues. If they’re mixed → saddle. But Hessian is infeasible to

compute for large networks.

In practice, deep learning frameworks don’t check Hessians. Instead, they

rely on optimizer dynamics to automatically “escape.”


5. How to Escape Saddle Points

Several strategies are used in modern training:

 Stochasticity (SGD mini-batches): Random batch noise helps

push the optimizer out of flat regions.

 Momentum: Adds velocity so even when gradients vanish,

momentum carries parameters forward.

 Adaptive Methods (Adam, RMSProp): Scale updates per

parameter. Flat directions get relatively larger steps, spiky ones


smaller → accelerates escape.

 Learning Rate Scheduling: A carefully tuned learning rate avoids

stalling.

 Regularization (Weight Decay, Dropout): Helps reduce

degeneracy and reshapes the surface.


6. How TensorFlow and PyTorch Deal with Saddle Points
🔹 PyTorch

PyTorch’s optimizers ([Link]) use exactly these strategies:

 SGD with momentum: Keeps a buffer of past gradients. Even if

∇f≈0, the buffer pushes forward.


buf = momentum * buf + grad
param = param - lr * buf

 Adam: Combines momentum (first moment) with per-parameter

scaling (second moment). This prevents stalls in flat regions.

m = β1 * m + (1-β1) * grad
v = β2 * v + (1-β2) * grad**2
param = param - lr * m_hat / (sqrt(v_hat) + eps)

🔹 TensorFlow / Keras

TensorFlow’s optimizers ([Link]) follow the same rules:

 SGD(momentum=0.9) uses identical momentum logic.

 Adam implements the same biased/unbiased first and second

moments.

 With XLA compilation, updates may be fused, but the math

remains identical.

Both frameworks rely on momentum + noise + adaptive scaling to

naturally escape saddle points.


7. Are There Other “Points” Like Saddle Points?

Yes — optimization landscapes are full of interesting stationary points:

 Local Minima: Stable but not necessarily optimal.

 Global Minimum: Absolute best point, rare in high-D nonconvex

settings.

 Flat Minima: Regions where the loss is low and curvature is nearly

zero in all directions. These are actually good — they often

generalize better.

 Sharp Minima: Opposite of flat, high curvature. Usually worse

generalization.
 Plateaus: Extended regions with almost zero gradient. Similar to

saddles in effect, though not defined by mixed curvature.


8. Key Takeaways

 Saddle points are stationary points with mixed curvature.

 They are more common than minima in deep networks and slow

training.

 You can detect being stuck if loss plateaus, gradients vanish, and no

progress occurs.

 SGD noise, momentum, and adaptive optimizers (Adam,

RMSProp) are the main weapons against saddles.

 PyTorch and TensorFlow explicitly implement these techniques in

their optimizer code.

What is Adam Optimizer?


dam (Adaptive Moment Estimation) optimizer combines the advantages of
Momentum and RMSprop techniques to adjust learning rates during training. It
works well with large datasets and complex models because it uses memory
efficiently and adapts the learning rate for each parameter automatically.
How Does Adam Work?
Adam builds upon two key concepts in optimization:
1. Momentum
Momentum is used to accelerate the gradient descent process by incorporating an
exponentially weighted moving average of past gradients. This helps smooth out
the trajectory of the optimization allowing the algorithm to converge faster by
reducing oscillations.
The update rule with momentum is:
wt+1=wt−αmt
where:
 mt is the moving average of the gradients at time t
 α is the learning rate
 wt and +1wt+1 are the weights at time t and t+1, respectively
The momentum term mt is updated recursively as:
mt=β1mt−1+(1−β1)∂wt∂L
where:
 β1 is the momentum parameter (typically set to 0.9)
 ∂wt ∂L is the gradient of the loss function with respect to the weights at
/

time t
Why Adam Works So Well?
Adam addresses several challenges of gradient descent optimization:
 Dynamic learning rates: Each parameter has its own adaptive
learning rate based on past gradients and their magnitudes. This
helps the optimizer avoid oscillations and get past local minima more
effectively.
 Bias correction: By adjusting for the initial bias when the first and
second moment estimates are close to zero helping to prevent early-
stage instability.
 Efficient performance: Adam typically requires fewer
hyperparameter tuning adjustments compared to other optimization
algorithms like SGD making it a more convenient choice for most
problems.
Performance of Adam
In comparison to other optimizers like SGD (Stochastic Gradient
Descent) and momentum-based SGD, Adam outperforms them
significantly in terms of both training time and convergence accuracy. Its
ability to adjust the learning rate per parameter combined with the bias-
correction mechanism leading to faster convergence and more stable
optimization. This makes Adam especially useful in complex models with
large datasets as it avoids slow convergence and instability while
reaching the global minimum.

You might also like