1
INTRODUCTION TO STATISTICS &
PROBABILITY
Chapter 11:
Multiple Regression
Dr. Nahid Sultana
1/29/2023 Copyright© Nahid Sultana 2017-2018
Chapter 11
Multiple Regression
2
11.1 Inference for Multiple Regression
Copyright© Nahid Sultana 2017-2018 1/29/2023
11.1 Inference for Multiple Regression
3
Population multiple regression equation
Data for multiple regression
Multiple linear regression model
Confidence intervals and significance tests
ANOVA
Squared multiple correlation R2
Copyright© Nahid Sultana 2017-2018 1/29/2023
Population Multiple Regression Equation
4
Linear regression model in which the mean response, μy, is related to one
explanatory variable x:
𝜇𝑦 = 𝛽0 + 𝛽1 𝑥
Usually, more complex linear models are needed in practical situations.
There are many problems in which knowledge of more than one explanatory
variable is necessary in order to obtain a better understanding and better
prediction of a particular response.
In multiple regression, the response variable y depends on p explanatory
variables 𝑥1 , 𝑥2 , … 𝑥𝑝 :
𝜇𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑝 𝑥𝑝
Copyright© Nahid Sultana 2017-2018 1/29/2023
Data for Multiple Regression
5
The data for a simple linear regression problem consists of n observations
(𝑥𝑖 , 𝑦𝑖 ) of two variables.
Data for multiple linear regression consists of the value of a response
variable y and p explanatory variables (𝑥1 , 𝑥2 , … , 𝑥𝑝 ) on each of n cases.
We write the data and enter them into software in the form:
Variables
Case x1 x2 … xp y
1 x11 x12 … x1p y1
2 x21 x22 … x2p y2
… … … … … …
n xn1 xn2 … xnp yn
Copyright© Nahid Sultana 2017-2018 1/29/2023
Multiple Linear Regression Model
6
The statistical model for multiple linear regression is
𝒚𝒊 = 𝒃𝟎 + 𝒃𝟏𝒙𝒊𝟏 + … +
𝒃𝒑𝒙𝒊𝒑 + 𝒆𝒊
for 𝑖 = 1, 2, … , 𝑛.
The mean response µy is a linear function of the explanatory variables:
𝜇𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑝 𝑥𝑝
The deviations 𝒆𝒊 are independent and Normally distributed N(0, 𝜎).
The parameters of the model are 𝜷𝟎 , 𝜷𝟏 , … 𝜷𝒑 and s.
The coefficient 𝜷𝒊 (𝒊 = 𝟏, … , 𝒑) has the following interpretation: It
represents the average change in the response when the variable xi increases
by one unit and all other x variables are held constant.
Copyright© Nahid Sultana 2017-2018 1/29/2023
Estimation of the Parameters
7
Select a random sample of n individuals on which 𝑝 + 1 variables
(𝑥1, … , 𝑥𝑝, 𝑦) are measured. The least-squares regression method chooses
𝑏0, 𝑏1, … , 𝑏𝑝 to minimize the sum of squared deviations (yi – ŷi)2, where
ŷi = b0 + b1xi1 +… + bpxip
As with simple linear regression, the constant b0 is the y intercept.
The regression coefficients (𝑏1, … , 𝑏𝑝) reflect the unique association of
each independent variable with the y variable. They are analogous to
the slope in simple regression.
The parameter 𝜎2 measures the variability of the responses about the
population response mean. The estimator of σ2 is
σ 𝟐 σ 𝟐
𝒆 𝒊 𝒚𝒊 − ෝ
𝒚 𝒊
𝒔𝟐 = =
𝒏−𝒑−𝟏 𝒏−𝒑−𝟏
Copyright© Nahid Sultana 2017-2018 1/29/2023
Confidence Interval for βj
8
Estimating the regression parameters 𝛽0, … , 𝛽𝑗, … , 𝛽𝑝 is a case of one-
sample inference with unknown population variance.
We rely on the t distribution, with n – p – 1 degrees of freedom.
A level C confidence interval for βj is
𝑏𝑗 ± 𝑡 ∗ 𝑆𝐸𝑏𝑗
where 𝑆𝐸𝑏𝑗 is the standard error of bj and t* is the t critical for the
t(n – p – 1) distribution with area C between –t* and t*.
Copyright© Nahid Sultana 2017-2018 1/29/2023
Example
9 A multiple regression was performed on data obtained from 31 trees to
predict usable volume from the height (in feet) and the diameter (in inches)
of the tree. Partial computer results are shown below.
t 1.701
*
28
Based on these results, a 90% confidence interval for the coefficient 1 of
diameter is
a. 4.708 ± 1.645 (0.2643).
b. 4.708 ± 1.701 (0.2643).
c. 4.708 ± 2.048 (0.2643).
Significance Test for 𝛽𝑗
10
To test the hypothesis H0: 𝜷j = 0 versus a one- or two-sided alternative,
we calculate the t statistic 𝒕 = 𝒃𝒋 ൗ𝑺𝑬𝒃𝒋 , which has the t (n – p – 1) distribution
when H0 is true. The P-value of the test is found in the usual way.
Note: Software typically
provides two-sided P-values.
Copyright© Nahid Sultana 2017-2018 1/29/2023
Example
A multiple regression was performed on data obtained from 50 stores
11
predicting sales given traffic, ease of access, and income. Output is given
below. Using .05 as level of significance, which of the following is true?
Coefficients Standard Error t Stat P-value
Intercept 284.2341216 165.341245 1.719076 0.092324
TRAFFIC 0.42858422 0.996331003 0.430162 0.669086
EASE 6.221037701 2.991297216 2.079712 0.043152
INCOME 6.408191928 5.184074702 1.23613 0.222685
a. Only ease of access is significant.
b. Only traffic is significant.
c. All three variables are significant.
Significance Test for 𝛽𝑗
12
Suppose we test H0: βj = 0 for each j So, failure to reject all such
and find that none of the p tests is hypotheses merely means
significant. that it is safe to throw away
at least one of the variables.
Should we then conclude that none of
the explanatory variables is related to Further analysis must be
the response? done to see which subset
of variables provides the
No, we should not! best model.
When we fail to reject H0: βj = 0, this
means that we probably do not need xj
in the model with all the other variables.
Copyright© Nahid Sultana 2017-2018 1/29/2023
ANOVA F-test for Multiple Regression
13
In multiple regression, the ANOVA F statistic tests the hypotheses
𝐻0 : 𝛽 1 = 𝛽 2 = … = 𝛽 𝑝 = 0
versus Ha: at least one 𝛽𝑗 ≠ 0
by computing the F statistic F = MSM / MSE.
When H0 is true, F follows
the F(p, n − p − 1) distribution.
The P-value is P(F > f ).
A significant P-value does not mean that all p explanatory variables have
a significant influence on y—only that at least one does.
Copyright© Nahid Sultana 2017-2018 1/29/2023
ANOVA Table for Multiple Regression
14
Source Sum of squares SS df Mean square MS F P-value
Model 2 p MSM = SSM/DFM MSM/MSE Tail area above F
𝑦ො − 𝑦ത
Error 2 n−p−1 MSE = SSE/DFE
𝑦𝑖 − 𝑦ො𝑖
Total 2 n−1
𝑦𝑖 − 𝑦ത
SSM = model sum of squares SSE = error sum of squares
SST = total sum of squares SST = SSM + SSE
DFM = p DFE = n − p − 1 DFT = n − 1 DFT = DFM + DFE
Copyright© Nahid Sultana 2017-2018 1/29/2023
Example
A multiple regression was performed on data obtained from 31 trees to
15
predict usable volume from the height (in feet) and the diameter (in inches)
of the tree. Partial computer results are shown below.
Predictor Coef SE Coef t
Constant −57.988 8.638 −6.71
Diameter 4.7082 0.2643 17.82
Height 0.3393 0.1302 2.61
Analysis of Variance
Source DF SS MS F p
Regression 2 7684.2 3842.1
Residual Error 28 421.9
Total 30 8106.1
The value of the MSR for this model is
a. 7684.2. SSR 7684.2
MSR 3842.1
b. 3842.1. df 2
c. 15.07.
Example
16
A multiple regression was performed on data obtained from 31 trees to predict usable
volume from the height (in feet) and the diameter (in inches) of the tree. Partial computer
results are shown below.
Predictor Coef SE Coef t
Constant −57.988 8.638 −6.71
Diameter 4.7082 0.2643 17.82
Height 0.3393 0.1302 2.61
Analysis of Variance
Source DF SS MS F p
Regression 2 7684.2 3842.1 254.99
Residual Error 28 421.9 15.07
Total 30 8106.1
The value of the F statistic for this model is
MSR 3842.1
a. 18.21. F 254.99
b. 254.99. MSE 15.07
c. 14.22.
Squared Multiple Correlation R 2
17
Just as with simple linear regression, R2, the squared multiple correlation, is
the proportion of the variation in the response variable y that is explained by
the model.
σ 𝑦ො𝑖 − 𝑦ത 2
2
𝑆𝑆𝑀
𝑅 = 2
=
σ 𝑦𝑖 − 𝑦ത 𝑆𝑆𝑇
In the particular case of multiple linear regression, the model is all p
explanatory variables taken together.
The square root of R2, called the multiple correlation coefficient, is the
correlation between the observations and the predicted values.
Copyright© Nahid Sultana 2017-2018 1/29/2023
Example
18
A multiple regression was performed on data obtained from 50 stores
predicting sales given traffic, ease of access, and income. Output is given
below. The value of R2 is
ANOVA
df SS MS F
Regression 3 2124768.5 708256.2 17.95249
Residual 46 1814777.9 39451.69
Total 49 3939546.4
a. 73.4%.
b. 53.9%. R2 = SSReg/Sstotal = 2124768.5/3939546.4
c. 46.1%.
Copyright© Nahid Sultana 2017-2018 1/29/2023