0% found this document useful (0 votes)
22 views89 pages

Deep Learning: Linear Regression & Optimization

Uploaded by

Nalain Abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Topics covered

  • Sigmoid Function,
  • Contour Plot,
  • Data Visualization,
  • Predictive Modeling,
  • Weight Adjustment,
  • Statistical Models,
  • Linear Regression,
  • Training Data,
  • Training Epochs,
  • Function Parameters
0% found this document useful (0 votes)
22 views89 pages

Deep Learning: Linear Regression & Optimization

Uploaded by

Nalain Abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Topics covered

  • Sigmoid Function,
  • Contour Plot,
  • Data Visualization,
  • Predictive Modeling,
  • Weight Adjustment,
  • Statistical Models,
  • Linear Regression,
  • Training Data,
  • Training Epochs,
  • Function Parameters

Deep Learning

Lecture-2
Dr. Abdul Jaleel
Associate Professor
Machine learning: A new Programming Paradigm
Linear Regression y=mx+c

But how to determine such exact m & c ??


Linear Regression y=mx+c

Possibility of too many


lines.

which one is best


suited?
Mean Squared Error
Residuals, Error or Loss

Cost Function

May be used to
Compare different
hypothetical Lines
Lots of Regression Lines, Each having some Cost
Which one is the best?
That have minimum
MSE cost.
Loss Functions

 Squared Loss Loss =

 Mean Squared Error (MSE) MSE =

 Absolute Loss Loss =

 Mean Absolute Error (MAE) MAE =

8
Convex Optimization and Gradient Descent Approach
A real-valued function defined on an n-dimensional interval is called convex if the
line segment between any two points on the graph of the function lies above or on the
graph.
Convex Optimization for a set of 21 Data-Points

Regression Line

+0

12
Convex Optimization

Loss function plot of 21


data points for

Wj ={-1,-0.5, 0, 0.5, 1, 1.5, 2, 2.5,…,5}

Loss =

13
Convex Optimization

From a range of weight values plotted in left side graph, Lets estimate loss for weight value zero.

14
Convex Optimization

𝑦 =0

Mean Squared Error Loss for weight value zero is calculated as a


difference/distance of Red and Green lines plotted in right side.
+0
15
Convex Optimization

𝑦 =0.5 𝑥

+0
16
Convex Optimization

𝑦 =1 𝑥

Next, lets guess loss for weight value one. +0


17
Convex Optimization

𝑦 =1.5 𝑥

Weight value 1.5 decreases the MSE.


+0
18
Convex Optimization

𝑦 =2 𝑥

For weight value 2, the predicted line best fits the data point line.
+0
19
Convex Optimization with bias

𝑦 𝑝 =𝑤𝑥 + 𝑏

20
Convex Optimization with bias

Loss =

21
Convex Optimization with bias

Loss =

Loss function’s surface plot is converted into contour plot.

22
Convex Optimization with bias

𝑦 𝑝 =𝑤𝑥 + 𝑏
23
Convex Optimization with bias

𝑦 =0 𝑥 − 1

𝑦 𝑝 =𝑤𝑥 + 𝑏
24
Convex Optimization with bias

𝑦 =1 𝑥 −1

𝑦 𝑝 =𝑤𝑥 + 𝑏
25
Convex Optimization with bias

𝑦 =2 𝑥 −1

𝑦 𝑝 =𝑤𝑥 + 𝑏
26
Convex Optimization with bias

𝑦 =2 𝑥+ 0

𝑦 𝑝 =𝑤𝑥 + 𝑏
27
Convex Optimization with bias

𝑦 =2 𝑥+1

𝑦 𝑝 =𝑤𝑥 + 𝑏
28
Gradient Descent Approach
Gradient Descent
Slope and
Derivative
Slope and Derivative
Result: the derivative of x2 is 2x
Derivative Partial Derivative
Partial Derivative
Gradient Descent Approach
Deep Learning
Lecture-3
Dr. Abdul Jaleel
Associate Professor
H(x) = Pred_y
 Lets apply Gradient descent in
coefficient learning to find the
Gradient
values of a function's parameters
Descent
that minimize the cost function as
far as possible.
Almost we reached a best fit line
- In Neural Networks, we apply Logistic Regression on the outcome of Gradient

Decent based learned parameters of Linear Regression best fit line.

- The Sigmoid Function works as an activation function for the Neuron to


classify

the outcome
Why we need a Sigmoid / Logit function instead of
Step Function for Neuron Activation
Why we need a Sigmoid / Logit function instead of
Step Function for Neuron Activation
Why we need a Sigmoid / Logit function instead of
Step Function for Neuron Activation
The Linear Equation

Non Linear
Activation Function
How it works for Row 1
Predicted and Actual outcome for Row 1 :- Error calculated with LogLoss function
instead of MSE
Predicted and Actual outcome for Row 2 :- Error calculated with LogLoss function
Predicted and Actual outcome for Row 13 :- Error calculated with LogLoss function
[Link]
ss-function-for-logistic-regression-589816b5e03c
Loss is high for W1=1,W2=1, Need to apply Gradient Descent
Implementation of activation functions in python
Implementation of Loss functions in python
Now we start implementing gradient descent in plain python. Again the goal is to come up with same w1, w2
and bias that keras model calculated.
We want to show how keras/tensorflow would have computed these values internally using gradient descent

First write couple of helper routines such as sigmoid and log_loss


Now comes the time to implement our own custom neural network class !!
This shows that in the end we were able to come up with same value of w1,w2
and bias using a plain python implementation of gradient descent function

you can compare predictions from our own custom model


and tensoflow model.

You will notice that predictions are almost same


 [Link]
ear-regression-with-mathematical-insights/
 [Link]
LINKS
 [Link]
https://
[Link]/why-not-mse-as-a-loss-fu
nction-for-logistic-regression-589816b5e03c

[Link]
c-regression-logarithmic-expr#:~:
text=Mean%20Squared%20Error%2C%20common
ly%20used,function%20is%20however%20always
%20convex

Common questions

Powered by AI

Gradient descent is a fundamental optimization algorithm used to find the optimal parameter values of a function by iteratively moving towards the minimum of a cost function. This is achieved by computing the gradient, or derivative, of the cost function with respect to the parameters and updating the parameters in the opposite direction to the gradient. In neural networks, this approach is crucial for training the network as it systematically reduces the error by fine-tuning the weights and biases based on the computed gradients. By continuously applying these updates, the network's parameters converge towards values that minimize the loss, thus ensuring an optimal model fit. This iterative process allows deep learning models to adjust during training, ultimately improving their predictive accuracy .

The Sigmoid function is preferred over the Step function in neural networks for neuron activation primarily due to its non-linear property. While the Step function changes output abruptly between 0 and 1, the Sigmoid function provides a smooth gradient which helps in better gradient-based learning via backpropagation. The continuous nature of the Sigmoid function allows the network to adjust weights gently and update them incrementally, aiding convergence and learning. Also, the derivative of the Sigmoid function is well-defined everywhere, offering a useful gradient for optimization methods like gradient descent, which a Step function lacks due to its discontinuous nature .

Using the Step function instead of the Sigmoid function for neuron activation in neural networks presents several challenges. The primary issue arises from the Step function's non-differentiable and abrupt output transition between classes, making it unsuitable for gradient-based optimization methods such as backpropagation. This discontinuity prevents the effective computation and propagation of gradients through the network, stymieing learning and convergence. Moreover, the lack of output sensitivity to input variations essentially nullifies the potential for fine-tuning during training, as neuron outputs either stay the same or change instantaneously without intermediate values. Consequently, replacing the Sigmoid function with a Step function can severely limit the network's ability to learn complex patterns and adjust to errors incrementally during training .

The gradient descent approach assists in reaching the best-fit line in linear regression by iteratively updating the model's parameters to minimize the cost function, typically the Mean Squared Error (MSE). This is achieved by calculating the gradient of the cost function concerning the parameters, directing how adjustments should be made to decrease error. The effectiveness of this method in linear regression stems from its ability to handle the optimization of continuous cost landscapes effectively, ensuring convergence by following the steepest descent path. Similarly, gradient descent is applied in other machine learning models where it optimizes complex parameters subject to a cost function. This technique is especially indispensable in deep learning for training multi-layered networks by allowing backpropagation to adjust weights and biases systematically across layers, ultimately enhancing the model's predictive performance .

A neural network utilizes gradient descent internally by iteratively tuning its weights and biases to minimize a defined cost function, achieving a configuration that parallels implementations like those in Keras or Tensorflow. When training a neural network, the algorithm calculates the gradient of the cost function in relation to each weight, indicating the direction in which the weights should be adjusted to reduce error. By applying these adjustments step-by-step, the network's parameters get fine-tuned towards minimizing the loss. In implementations such as Keras or Tensorflow, this process is streamlined through automatic differentiation, which efficiently computes gradients during model training. By coding gradient descent in Python, similar outcomes can be achieved, highlighting that the choice of framework, while offering ease and optimizations, primarily facilitates the gradient computations that a custom neural network can replicate through manual gradient descent steps .

Convex optimization plays a crucial role in determining the best-fit line in linear regression by minimizing the cost function, which is often represented as Mean Squared Error (MSE). In this context, a real-valued cost function defined over an n-dimensional interval is convex if the line segment between any two points on the graph of the function lies above or on the graph. This property is utilized to find the optimal values of the weights by comparing costs for different hypothetical regression lines. The lowest cost, or MSE, indicates the best-fit line for the data points. An iterative approach, such as adjusting the weights and calculating the corresponding losses, helps identify the optimal regression line. For example, varying the weight value, as shown by a loss function plot for different weight values like 0, 0.5, 1.5, 2, etc., helps evaluate how well the line fits the data, ultimately selecting the one with minimal MSE, effectively optimizing the model's parameters .

A contour plot helps visualize the optimization of a loss function during linear regression model training by representing the cost function's surface and illustrating how adjustments to parameters move toward a minimization point. Each contour line shows points where the function takes on a constant value; hence, adjacent contours differ by changes in the cost function value. As gradient descent progresses, moving along a path on the contour plot, the plotted path highlights how weight and bias adjustments affect the loss reduction. This visualization offers an intuitive understanding of how the optimization process approaches the optimal solution via systematically descending to regions of lower cost, aiding in tracking the convergence and diagnosing potential issues during training .

Log Loss is crucial in logistic regression as it aligns with the nature of the classification problems, providing a measure of how far the predicted probabilities deviate from the actual labels. Unlike MSE, which is suited for regression tasks and assumes a continuous output, Log Loss evaluates the accuracy of probability predictions in binary classification. It measures the uncertainty of predictions, penalizing incorrect ones more significant than correct predictions. This allows for more nuanced updates during model training, effectively adjusting weights based on prediction confidence. The Log Loss approach ensures that the cost function used is convex, facilitating convergence during optimization and making it a preferred choice over MSE, which may lead to non-convex problems in classification scenarios .

Mean Squared Error (MSE) and Mean Absolute Error (MAE) are both used as loss functions in linear regression but have distinct characteristics and impacts on the model. MSE calculates the square of the residuals, which emphasizes larger errors more than smaller ones, making it sensitive to outliers. It provides a solution that minimizes the variance of the error distribution, often preferred when outliers are less of a concern, aiming to reduce large predictions errors more aggressively. Conversely, MAE computes the absolute differences between actual and predicted values, offering a linear penalization of errors and a more robust performance in the presence of outliers due to its less sensitivity. Choice between the two depends on the specific model goals and data characteristics; MSE is useful when large errors are especially problematic or when the model requires a differentiable function for optimization via gradient methods, while MAE is preferred for more uniform treatment of all errors .

The convexity of a cost function is pivotal in the effectiveness of optimization methods like gradient descent because it ensures the function has a single global minimum and no local minima. This property guarantees that the iterative updates carried out by gradient descent will reliably converge to the optimal solution if the learning rate is appropriately set. A convex cost function allows for more straightforward optimization because the surface defined by the function is smooth and predictable, enabling consistent and efficient parameter updates. This contrasts with non-convex functions that may lead to suboptimal local minima, thereby complicating the training process, making convergence more difficult, and heightening the risk of getting stuck in less ideal parameter spaces .

You might also like