Linear Regression Equation Explained
Linear Regression Equation Explained
Linear Regression is general form of predictive analysis. It is broadly used for all statistical
techniques. It measures the relationship between one or more predictor variables and one
outcome variable. Regression analysis examines the relationship between a dependent and
independent variable.
UQ. What do you mean by a linear regression? Which applications are best modeled by linear
regression?
(SPPU - Q. 5(a), March 19, Q. 5(b), March 19,
Q. 1(b), Nov./Dec. 17, 4 Marks))
• Linear regression model illustrates the relationship between two variables or factors.
Regression analysis is generally used to show the correlation between two variables.
• The prediction variable present in the equation model of linear regression is known as
dependent variable. Let us consider it as Y. The variables that predict the dependent
variable are known as independent variables. Let say X.
• Y becomes a dependent variable, as the prediction (Y) dependent on the other variables
(X).
• In simple linear regression analysis, each observation has two variables. That is the
independent variable and the dependent variable. Multiple regression analysis consists of
two or more independent variables and in what way they correlate to the independent
variable. The equation that defines how y is related to X is called the regression model.
• The coefficient describes that a change in independent variable is maybe not totally
equivalent to a change
in y.
• Now let us look at the proof. To put a line through our data that best fits the data. A
regression line shows a positive linear relationship (the line that is sloping up), a negative
linear relationship (the line that is sloping down), or totally absence of relationship (shown
by a flat line)
• Where the line crosses the vertical axis, that point is said to be as constant.
• The steeper the slope, the more the salary for years of experience.
• For example, if 1 more year of experience is considered, then salary (y) should be
incremented by $10,000, but due to steeper slope, it may increase like $15,000
• When we look at a graph, vertical lines can be drawn from the line to our actual
observations. The actual observations can be clearly seen as the dots, while the line displays
the model observations (the predictions).
Fig. 3.1.5
Fig. 3.1.6
• The line shows the difference between employees’ actually earning and he’s prediction to
be earned. To find the best line, the minimum sum of squares is looked, the sum of all the
squared differences is done and the minimum is found out. This is known as Ordinary least
squares method.
• Regression is one of the parametric techniques that make assumptions. Let's have a glance
at the assumptions it makes :
1. A linear and additive relationship is present between dependent variable (DV) and
independent variable (IV). Linear relationship means, that the change in DV by 1 unit
change in IV is constant. By additive it means, the effect of X on Y is independent of other
variables.
2. No correlation between independent variable must be present. As presence of correlation
in ndependent variables may cause Multicollinearity. That is, it becomes difficult for the
model to define the actual effect of IV on DV.
3. The error terms should consist of constant variance. Due to its absence, it causes
heteroskedest a city.
4. Error at ∈t should not decide the error at ∈t+1
i.e. the error terms must not be uncorrelated.
5. Correlation in error terms is called as Autocorrelation. Its presence extremely affects the
regression coefficients and standard error values as they are based on the assumption of
uncorrelated error terms.
6. There should be normal distribution between dependent variable and the error terms.
• Presence of these assumptions makes regression relatively obstructive. By restrictive it
means, the performance of a regression model is dependent on completion of these
assumptions.
• Simple linear regression consists of a single input, that can be used in statistics to estimate
the coefficients.
• Statistical properties from data are required to get calculated like means, standard
deviations, correlations and covariance. All the data should be present to traverse and
calculate statistics. Hypothesis function for it is given by
y = 1 + 2 x
• In simple linear regression, the topic of this section, the predictions of Y when plotted as a
function of X form a straight line.
• In Table 3.1.1. Example data is plotted in Fig. 3.1.7. There exists a positive relationship
between X and Y. If Y is predicted from X, higher the value of X, higher will be prediction
of Y.
X Y
1.00 1.00
2.00 2.00
3.00 1.30
X Y
4.00 3.75
5.00 2.25
• The best fit straight line through the points is found in linear regression. That best-fitting
line is known as regression line. In Fig. 3.1.8, the back diagonal line is regression line
which consists of the predicted score on Y for each possible value of X. Errors of prediction
are represented by the vertical lines from the point to the regression line. As shown in Fig.
3.1.8, the red point is near the regression line; it has less error prediction level. Whereas,
the yellow point is much higher, so has more error prediction.
• The black line comprises of the predictions, the points depict the actual data, and the
vertical lines between the points and the black line denote prediction errors.
• The error of prediction for a point is the calculation of value of the point minus the
predicted value (the value on the line).
• Table 3.1.2 shows the predicted values (Y') and the errors of prediction (Y-Y'). An example
illustrates it, the first point has a Y of 1.00 and a predicted Y
(called Y') of 1.21. So, its error of prediction is – 0.21.
X Y Y' Y – Y (Y – Y)2
• We have not yet defined the term "best-fitting line." The line that has minimum sum of
the squared errors of prediction is so far widely used criterion. Same criterion is used for
finding the line in Fig. 3.1.8.
• The squared errors of prediction are given in last column of Table 3.1.2. Compared to any
other regression line, the sum of the squared errors of prediction shown in Table 3.1.2 is
lowest.
Y = bX + A
where Y' is the predicted value, b is the slope of the line, and A is the Y intercept.
The equation for the line in Fig. 3.1.8 is
Y = 0.425X + 0.785
For X = 1,
For X = 2,
• The relationship between more than one explanatory variables and response variable is
modeled by multiple linear regression through fitting a linear equation to observed data.
• Each value of the independent variable x is associated with a value of the dependent
variable y. The population regression line for p explanatory variables x1, x2, ...xp is defined
to be
• p in this line it is shown that the way the mean response y changes with the explanatory
variables. The observed values for y differ about their means y and are assumed to have
the same standard deviation. The parameters 0, 1,...p are estimated by the fitted values
b0, b1,…... bp of the population regression line.
• The multiple regression models include a term for variation, as the observed values for y
vary about their means y.
DATA = FIT + RESIDUAL, where the “FIT” term represents the expression 0 + 1 x1 +
2x2 + ... + p xp.
• The “RESIDUAL” term signifies the deviations of the observed values y from their means
y, which are normally distributed with mean 0 and variance. The notation for the model
deviations is .
• Here we have learnt the concept of simple linear regression where to model the response
variable Y, a single predictor variable X was used. In so many applications, more than one
factor are responsible to influence the response.
• Multiple regression models define how a single response variable Y is dependent linearly
on a number of predictor variables.
Examples
• The selling price of a house can depend on various factors like the popularity of the
location, the number of bedrooms, the number of bathrooms, the year the house was built,
the square footage of the plot etc.
• The child’s height can depend on the height of the parents, nutrition he gets, and other
environmental factors.
• The “least squares” method is a type of mathematical regression analysis that determines
the best fit line for a collection of data, displaying the relationship between the points
visually.
UQ. What do you mean by least square method? Explain least square method in the context
of linear regression. (SPPU - Q. 2(b), Dec. 19, 5 Marks,
Q. 1(a), May/June 2016, 5 Marks)
UQ. What do you mean by coefficient of regression? SSR, MSE in the context of regression.
Explain SST, SSE,
(SPPU - Q. 4(b), Dec. 19, 5 Marks)
• Regression analysis is used to identify linear relationship between single dependent and an
independent variable.
Y = 0 + 1 X +
where,
Y = Value of dependent variable
X = independent variable ;
= random error
b = Y axis intercept = 0
m = slope of line = 1
Fig. 3.3.1
and predicted value of
y is,
y = 0 + 1 X
• Even if X = 0 i.e. value of independent variable is zero then also it is expected that value
of Y is 0.
– –
(
2. It must pass through centroid of sample data where centroid is X Y and )
n n
xi yi
–– i=1 –– i=1
X = , Y =
n n
• It does not need to have same number of sample points above and below to it.
–– ––
0 = Y – 1 X
n –– ––
yi – Y xi – X
i=1
1 = n –– 2
(xi – X)
i=1
• Least square method is discussed with respect to shaft univariate linear regression
Y = 0 + 1 X +
• Here, target is to find values of 0 and 1 which should best fit to the given sample data.
y1 is predicted as
y1
y2 is predicted as
y2
• For each point on regression, we can calculate the difference between actual value y and
predicted value
y (predicted by regression line)
i=1
n
yi – 0 – 1 xi
2
SSE = ...(1)
i=1
• To get minimum value of SSE for 0 and 1 partial derivative of SSE [Link] of 0 and 1
must be equal to 0
SSE SSE
=0 ; =0
0 1
SSE n 2
= yi – 0 – 1xi = 0
0 0 i = 1
n
– 2yi – 0 – 1 xi = 0
i=1
n
– 2 yi – 0 – 1 xi = 0
i=1
n
yi – 0 – 1 xi = 0 ...(2)
i=1
SSE
= 0
1
n
yi – 0 – 1 xi =
2
0
1 i = 1
n
yi – 0 – 1 xi =
2
0
i = 1 1
n
– 2xi yi – 0 – 1 xi = 0
i=1
n
– 2 xi yi – 0 – 1 xi = 0
i=1
n
xi yi – 0 – 1 xi = 0 ...(3)
i=1
n n n
yi – 0 – 1 xi = 0
i=1 i=1 i=1
n n n
yi – 0 1 – 1 xi = 0
i=1 i=1 i=1
n n
yi – n 0 – 1 xi = 0
i=1 i=1
n n
n 0 = – yi – 1 xi
i=1 i=1
n n
yi 1 xi
i=1 i=1
0 = n – n
n n
xi yi
–– i=1 –– i=1
But, X = ;Y =
n n
–– ––
0 = Y – 1 X ...(4)
n –– ––
xi yi – Y – 1 X – 1 xi = 0
i=1
n –– ––
xi yi – Y + 1 X – 1 xi = 0
i=1
n –– ––
xi yi – Y + 1 X – xi = 0
i=1
( )
n –– ––
xi yi – Y – 1 xi – X = 0
i=1
( )
n –– n ––
( )
xi yi – Y – xi 1 xi – X
i=1 i=1
( ) = 0
n –– n ––
xi yi – Y
i=1
( )= 1 xi xi – X
i=1
( )
n –– n ––
xi yi – Y
i=1
( )
i=1
xi – X ( ) (y – ––Y )
i
1 = = n
n –– ––
xi xi – X
i=1
( ) xi xi – X
i=1
( ) (x – ––X)
i i
n
i=1
(y – ––Y ) (x – ––X )
i i
1 = ...(5)
n –– 2
i=1
( xi – X )
n
i=1
(x – ––X ) (y – ––Y )
i i
n
i=1
(y – ––Y ) (x – ––X )
i i
n xy
1 = =
n n xx
i=1
(x – ––X )
i
2
xy
1 =
xx
Y = 0 + 1 X
• Even if x = 0 i.e. value of independent variable is zero then also it is expected that value of
Y is 0.
n n
xi yi
–– i=1 –– i = 1
X = and Y= n
n
n ––
–– ––
i=1
(
xi – X yi )
0 = Y – 1 X ; 1 = n
–– 2
i=1
(
xi – X )
UQ. What do you mean by coefficient of regression? SSR, MSE in the context of
regression. Explain SST, SSE.
(SPPU - Q. 4(b), Dec. 19, 5 Marks)
UQ. What are the ingredients of machine learning?
(SPPU - Q. 4(b), May/June 2016, 5 Marks)
UQ. Enlist ingredients of ML. Explain each ingredient in two or three sentences.
(SPPU - Q. 2(b), Nov./Dec. 16, 5 Marks)
UQ. How the performance of a regression function is measured ? (SPPU - Q. 2(b),
Nov./Dec. 17, 4 Marks)
UQ. Define and explain Squared Error (SE) and Mean Squared Error (MSE) w.r.t.
Regression.
(SPPU - Q. 1(b), May/June 2018, 5 Marks)
UQ. How the performance of regression is assessed? Write various performance metrics
used for it.
(SPPU - Q. 4(b), May/June 2019, 4 Marks)
UQ. Suppose you have been given a set of training examples {[x1, y1), (x2, y2)........(xn,
yn)}. Find the equation of the line that best fits the data in that minimizes the squared
error.
(SPPU - Q. 6(b), Oct.19, 5 Marks)
• A cost function is a measure of how wrong the model is in terms of its ability to estimate
the relationship between X and y. This is typically expressed as a difference or distance
between the predicted value and the actual value.
• A cost function used in the regression problem is called “Regression Cost Function”.
• Calculating the mean of the errors is the simplest and most intuitive way possible.
• The errors can be both negative and positive. So they can cancel each other out during
summation giving zero mean error for the model.
• Thus this is not a recommended cost function but it does lay the foundation for other cost
functions of regression models.
Mean Squared Error (MSE)
• This improves the drawback we encountered in Mean Error above. Here a square of the
difference between the actual and predicted value is calculated to avoid any possibility of
negative error.
• It is measured as the average of the sum of squared differences between predictions and
actual observations. It is also known as L2 loss.
• In MSE, since each error is squared, it helps to penalize even small deviations in prediction
when compared to MAE. But if our dataset has outliers that contribute to larger prediction
errors, then squaring this error further will magnify the error many times more and also
lead to higher MSE error.
• So in this cost function, MAE is measured as the average of the sum of absolute differences
between predictions and actual observations.
• It is robust to outliers thus it will give better results even when our dataset has noise or
outliers.
R – Squared
• R-squared (R2) is a statistical measure that represents the proportion of the variance for a
dependent variable that's explained by an independent variable or variables in a regression
model. ... It may also be known as the coefficient of determination.
yi = 0 + 2 Xi + i
• Values of 0 and 1 are estimated by various methods for e.g for least square method,
maximum likelihood method
• Values of 0 and 1 are used to predict values of Yi as Yi where
y i = 0 + 1 x
1 n 2
MSE = n (SSE) = n yi – yi
1
i=1
1 n 2
RMSE = MSE = yi – yi
ni=1
(e) R-Squared
SSE
Rr-squared = 1 – Var (y )
i
1 n
MAE = n yi – yi
i=1
Solved Examples
1 95 85
2 85 95
3 80 70
4 70 65
5 60 70
Soln. :
Given Xi = Score of ith student in Xth std.
Y = 0 + 1 X
Where, values of 0 and 1 are given by least square method are as,
Fig. 3.4.1 : Linear Regression for Students example
–– ––
0 = Y – 1 X
n
i=1
(x – ––X ) × (y – ––Y )
i i
1 = n
i=1
(x – ––X )
i
2
In this example, n = 5
xi yi –– –– 2
–– 2
––
(xi –X ) (xi –X ) (yi –Y ) (x i –X )
––
(yi –Y )
95 85 17 289 +8 136
85 95 7 49 + 18 126
80 70 2 4 –7 – 14
70 65 –8 64 – 12 96
60 70 – 18 324 –7 126
n
xi
n –– i = 1 390
xi = 390 ; X = 5 = 5 = 78
i=1
n –– 1 n 385
xi = 385 ; Y = n yi = 5 = 77
i=1 i=1
n –– 2
( )
xi – X = 289 + 49 + 4 + 64 + 324 = 730
i=1
and
n –– ––
i=1
(
xi – X )(y i –Y ) = 470
n –– ––
i=1
(xi –X ) (y i –Y) 470
1 = = 730 = 0.644
n –– 2
i=1
(x i –X)
–– ––
and 0 = Y – 1 X = 77 – (0.644) (78) = 26.768
y = 26.768 + 0.644 x
y = 26.768 + 0.644 x
Interpretation 1
For increase in value of X increase score of students in X increase in score of student in
Xth Std same student in XIIth std is 0.644.
Interpretation 2
Even if X = 0 (practically it is not possible to apply this interpretation in real word). i.e. a
students score in Xth is 0 then also expected score of student in XIIth standard is 26.76
(c) If student score in Xth Std is 90 then students score in XIIth std is calculated as :
y = 26.768 + 0.644 (90) ; y = 84.72
UQ. For a given data having 100 examples, if squared errors SE1, SE2, and SE3 are 13.33,
3.33 and 4.00 respectively, calculate Mean Squared Error (MSE). State the formula for
MSE.
(SPPU - Q. 1(b), May/June 16, 5 Marks)
UQ. Consider the following data points :
Calculate the Cost Function for 00 = 0.5 and
01 = l using linear regression.
(SPPU - Q. 4(b), Nov./Dec. 17, 6 Marks)
X Y
1 1.5
2 2.75
3 4
4 4.5
5 5.5
• Linear regression is a statistical model that observes the linear relationship between two
(Simple Linear Regression) or more (Multiple Linear Regression) variables a dependent
variable and independent variable(s). Linear relationship mainly means that dependent
variable too increases (or decreases), when one (or more) independent variables increases
(or decreases).
• As it can be seen, that a linear relationship may be positive (independent variable goes up,
dependent variable goes up) or negative (independent variable goes up, dependent variable
goes down.
• Multiple Linear Regression attempts to model the Relationship between two or more
features and a response by fitting a linear equation to observed data. The steps that are
needed to perform multiple linear Regression are similar to that of simple linear
Regression.
• The difference is in the Evalution. it can be used to find out which factor has the highest
impact on the predicted output and different variable relate to each other.
• There are two important disadvantages of Linear Regression. Let’s consider that the shown
model is actually close to (or exactly) linear, i.e.,
yi = r(xi) + I i = 1,….n,
for some underlying regression function r(X0) that is approximately (or exactly) linear in
X0.
• In a high dimensional regression setting, these limitations become major problems, where
the number of predictors p rivals or even exceeds the number of observations n. In fact,
where p > n, the linear regression approximation is not well defined.
How can we do better?
• For a linear model, the linear regression has predictable test error σ2 + p σ2/n. The first
term is the irreducible error; the second term is entirely from the variance of the linear
regression estimate (averaged over the input points). Its bias is exactly zero
• What can be understood from this? If another predictor variable is being added into the
mix, then same amount of variance will get added, σ2/n, irrespective of whether its true
coefficient is large or small (or zero)
• Hence in the last example, efforts for “spending” variance for trying to fit truly small
coefficients was done there were 20 out of 30 them.
• One may find that it can be done better by shrinking small coefficients towards zero, which
possibly introduces some bias, but also reduces the variance. In other words, “small
details” were ignored in order to get a more stable “big picture”. If it is properly done, this
way can actually work.
UQ. Write short notes on : Linearly and non- linearly separable data(SPPU - Q.
6(a), March 19, 5 Marks)
UQ. What is a polynomial regression? How it can be represented in a form of a matrix?
(SPPU - Q. 2(b), May/June 18, 5 Marks)
UQ. What do you mean by zero centered and
un-correlated features? What is the use of it in the solution of multivariate linear
regression?
(SPPU - Q. 4(b), May/June 18, 6 Marks)
• The non- linear relationship between value of x and equivalent conditional mean y, is fitted
by polynomial regression, denoted as E(y | x).
• Though polynomial regression fits a nonlinear model to the data, it is linear as a statistical
estimation, in the logic that the regression function E(y | x) is linear in the unknown
parameters estimated by the data.
• Because of this, polynomial regression is considered as a special case of multiple linear
regression.
• In the case above, the model remains linear externally, but it can hold internal non-linearity.
Let’s consider the above Fig. 3.9.1, which shows how scikit-learn implements this
technique. This is obviously a non-linear dataset, and any linear regression based only on
the original two-dimensional points cannot capture the dynamics.
• If a linear model is applied on a linear dataset, then it offers good result as seen in Simple
Linear Regression, but if the same model is applied without any alteration on a non-linear
dataset, then the result that is being produced may be drastic.
• Because of which loss function may increase, the error rate will become high, and accuracy
will ultimately get decreased.
• Thus in such cases, a Polynomial Regression model is needed, where data points are
arranged in a non-linear fashion. This can be understood in a better way using the
comparison shown in below Fig. 3.9.2 and
Fig. 3.9.3 of the linear dataset and non-linear dataset.
• Henceforth, if the datasets are organized in a non-linear way, then the Polynomial
Regression model is used instead of Simple Linear Regression.
y = b0 + b1x + b2 x2 + b3 x3 + …... + bn xn
• Polynomial regression is a crucial part of linear regression. It’s main idea is how to
select the features. Observing at the multivariate regression with 2 variables: x1 and
x2. Linear regression will look like this:
y = a1 * x1 + a2 * x2.
• To have a polynomial regression (let’s make 2-degree polynomial). Few additional features
2 2
are created: x1*x2, x1 and x2 . So we will get your ‘linear regression’:
2 2
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1 + a5 * x1
• A polynomial term: linear model can be turned into curve by a quadratic (squared) or cubic
(cubed) terms. But here data X is squared or cubed, and not Beta coefficients, so it is still
a linear model. This helps us to model curves easily without explicitly modelling a
complicated nonlinear model.
• One frequent pattern in machine learning is to use trained linear models on nonlinear
functions of the data. This method gives fast performance of linear methods while letting
them to fit a much wider range of data.
1. Polynomial delivers the best approximation of the relationship between the dependent and
independent variable.
2. A Wide range of function can be fit onto it.
3. Polynomial mostly fits a wide range of curvature.
Fig. 3.12.1
• A model can be called as a good machine learning model, if it can generalize new input
data from the problem domain appropriately. This may help to make future data prediction,
that data model will be unknown about.
• Assume that we need to check how good machine learning model learns and generalizes
new data, for this is the concept of over fitting and Underfitting. They are responsible for
poor performance of machine learning algorithms.
3.12.1 Underfitting
• Underfitting means that data is not able to fit well. this happens usually when limited data
is present to build an accurate model and also possibly when a linear model is tried to build
using non-linear model.
• In these situations machine learning model is much easier and flexible to apply rules on
minimal data when results that model makes in wrong predictions. For this, more data and
reduction of feature selection is required to avoid underfitting.
3.12.2 Overfitting
• It is said to be overfitted, when a model is trained with a lot of data. And when such
situation occurs, it starts learning from the noise and inaccurate data entries are done in
data set.
• Thus model cannot categorize the data properly, due to too much of details and noise. The
non-parametric and non-linear methods are the causes of overfitting as these types of
machine learning algorithms have more freedom in building the model based on the dataset
and hence they can really build unrealistic models.
• Overfitting is more probable with nonparametric and nonlinear models that have more
flexibility while learning a target function.
• For instance, decision trees are a nonparametric machine learning algorithm as they are
flexible, the problem of overfitting arrives.
• This problem can be overcome by reducing a tree after learning so that some details can
be removed.
3.13.1 Bias
• Let’s consider we have two values, one is predicted by our model and other is actual value
of data (target value).
• Bias refers to the gap between these two values (predicted value by our model and actual
value of data).
• Bias helps us to generalize better and make our model less sensitive to some single data
point.
• Model with high bias pays very little attention to the training data and oversimplifies the
model.
High Bias
Our estimated data value is a long way from the actual data value, resulting in a large gap
between the two.
Low Bias
Our estimated data value is close to the actual data value, i.e. there is a smaller gap between
expected and actual data value.
3.13.2 Variance
• A high variance model pays close attention to training data and does not generalise to data
it hasn't seen before.
• On training data, such models work well, but on test data, they have a high error rate.
• Variance comes from highly complex models with a large number of features.
Low Variance
High Variance
The difference between actual and predicted values is small, and it belongs to the same
group as the low biassed and low variance rule (refer Fig. 3.13.1)
Data is scattered due to high variance, but due to the rule of low bias, it is not far from the
actual data (target value) as seen Fig. 3.13.1.
By the rule of high bias it’s a huge gap and by the rule of low variance it’s in group refer
Fig. 3.13.1.
Fig. 3.13.1 : Bias and Variance graphical visualization
• By the rule of high bias it’s a huge gap and by the rule of high variance data is scattered
refer Fig. 3.13.1.
• The predicted values are almost identical to the data's actual value. So the ideal option is
Low Bias and Low Variance.
Underfitting