Unit Nine
Correlation and Simple Linear Regression
Unit Objectives
By the end of the unit, students will be able to
Describe association of two variables
Measure association of variables
Describe independent and dependent variables
Model relationship of two variables using simple linear regression
Make inference on the slope
Construct confidence and prediction intervals
Introduction
In studies, we measure more than one variable for each individual.
precipitation and plant growth
Soil erosion and volume of water …
Height and weight
Education and Self - esteem
We collect pairs of data from each unit of sample and want to describe the
pair of data (bivariate data) instead of describing separately(Univariate
data).
When we describe bivariate data, we try to see if there is any relation
between the variables by scatterplot or numerically.
Correlation and regressions are methods used to
describe such relationship numerically.
Correlation
Correlation is defined as the statistical association between two variables.
The first step in checking the presence of association is through scatter plot.
Correlation …
Visual examinations are largely subjective, we need a more precise and
objective measure to define the correlate on between the two variables
To quantify the strength and direction of the association we use Pearson
coefficient of correlation.
Pearson coefficient of correlation(r) for pair of data 𝑥1 , 𝑦1 , 𝑥2 , 𝑦2 , … , 𝑥𝑛 𝑦𝑛
defined as:
𝐧 σ 𝐱𝐲− σ 𝐱 σ 𝐲 x y 𝑿𝟐 𝒚𝟐 xy
𝐫 = .
𝐧 σ 𝐱𝟐− σ 𝐱 𝟐 𝐧 σ 𝐲𝟐− σ 𝒚 𝟐 𝑥1 𝑦1 𝑥12 𝑦12 𝑥1 × 𝑦1
The value of r ranges in [-1,1] 𝑥2 𝑦2 𝑥22 𝑦22 𝑥2 × 𝑦2
𝐈𝐧𝐭𝐞𝐫𝐩𝐫𝐞𝐭𝐚𝐭𝐢𝐨𝐧 𝐨𝐟 𝐫: : : : : :
✓ r = +1 perfect positive relationship 𝑥𝑛 𝑦𝑛 𝑥𝑛2 𝑦𝑛2 𝑥𝑛 × 𝑦𝑛
✓ r = -1 perfect Negative relationship 𝑛 𝑛 𝑛 𝑛 𝑛
✓ r → 1 strong positive relationship 𝑥𝑖 𝑦𝑖 𝑥𝑖2 𝑦𝑖2 (𝑥𝑖 × 𝑦𝑖 )
✓ r → −1 strong negative relationship 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
✓ |r|→ 0 weak to no linear
relationship
Example
A researcher wishes to see whether there is a
relationship between number of hours of study(X) and
test scores on an exam(Y) and she took a random sample
of 6 students. The data with the corresponding scatter
plot is shown in the right.
Compute the Correlation Coefficient r.
Y
The result: 100
90
80
70
60
50
40
30
20
𝐧 σ 𝐱𝐲− σ 𝐱 σ 𝐲 𝟔×𝟏𝟒𝟕𝟔−𝟏𝟗×𝟒𝟑𝟑
𝐫 = = = 10
𝐧 σ 𝐱𝟐 − σ 𝐱 𝟐 𝐧 σ 𝐲𝟐 − σ 𝒚 𝟐 𝟔×𝟕𝟗− 𝟏𝟗 𝟐× 𝟔×𝟑𝟏𝟗𝟑𝟓− 𝟒𝟑𝟑 𝟐
0
𝟎. 𝟗𝟐𝟏 which Shows strong positive relationship 0 2 4 6 8
Simple Linear Regression(SLR)
Once relationship between two variables exists, the next step is to model the
relationship.
When we model, we need
Explanatory OR independent Or predictor variable (X)
Response or dependent – variable being explained (Y)
The model can be used to predict the value of the response variable for a
given value of the independent variable
SLR model is a mathematical equation that allows us to predict a response for
a given predictor value.
Simple Linear Regression Model (SLR)
The SLR equation provides an estimate of population regression
line called fitted line. 𝐘𝐢 = 𝒃𝟎 + 𝒃𝟏 𝑿𝒊 but how?
Measures of Variation Unexplained
Total sum od Squares (SST or 𝑺𝒀𝒀 ): The total variation
squared deviation from the mean of the
dependent variable Y.
SST- measures variation of Y values from Explained
the mean(𝐘) variation
SSR = Regression sum of Squares which is
explained variation attributable to the
relationship between X and Y
SSE = Error sum of squares, which is
Variation attributable to factors other than
the relationship between X and Y.
𝟐
o Measures variability of points around 𝑺𝒙𝒙 = σ 𝑿 − 𝑿 =σ 𝑿𝟐 − 𝒏ഥ
𝒙𝟐 ,
the fitted line
𝑺𝑿𝒀 = 𝑿 − 𝑿 𝒀 − 𝒀
𝑺𝑺𝑬
o 𝒔𝟐 𝒚/𝒙 = = 𝑴𝑺𝑬 and SE= 𝑴𝑺𝑬
𝒏−𝟐
=σ 𝑿𝒀 − 𝒏ഥ
𝒙𝒚ഥ
Least Square Method
Helps us to get best fitted line
How ?
𝟐
Minimize: SSE = 𝒀 − 𝒀 = σ 𝒀 − (𝒃𝟎 + 𝒃𝟏 𝐗)
σ 𝟐
To minimize SSE,
o Find the derivative of SSE with respect to 𝒃𝟎 & 𝒃𝟏 ,
o Set them to Zero, and
o Solve the resulting equations for 𝒃𝟎 & 𝒃𝟏
o Which gives us
𝑺𝑿𝒀
𝒃𝟏 = and 𝐛𝟎 = 𝐘 − 𝐛𝟏 𝐗
𝑺𝒙𝒙
Example(Continued)
For the study hour and test score example, fit regression
line:
σ𝒀 𝟒𝟑𝟑 σ𝑿 𝟏𝟗
𝐘= = ≈ 𝟕𝟐.17, 𝐗= = ≈ 𝟑. 𝟐
𝒏 𝟔 𝒏 𝟔
𝑺𝑿𝒀 σ 𝑿𝒀 −𝒏ഥ
𝒙𝒚ഥ 𝒏 σ 𝑿𝒀−σ 𝒙 σ 𝒚 𝟔×𝟏𝟒𝟕𝟔−𝟏𝟗×𝟒𝟑𝟑 𝟔𝟐𝟗
𝒃𝟏 = = σ 𝑿𝟐 −𝒏ഥ
= = = 𝟏𝟏𝟑 ≈ 𝟓. 𝟓𝟕
𝑺𝒙𝒙 𝒙𝟐 𝒏 σ 𝑿𝟐 − σ 𝒙 𝟐 𝟔×𝟕𝟗− 𝟏𝟗 𝟐
𝒃𝟎 = 𝐘 − 𝐛𝟏 𝐗 = 𝟕𝟐. 𝟐 − 𝟓. 𝟓𝟕 ∗ 𝟑. 𝟐 ≈ 𝟓𝟒. 𝟓
Based on the given data, we can say for one hour
increase in study hour, the test score increases by 5.57
points.
𝒃𝟎 = 54.5, Meaning Y = 54.5 when X = 0, its obvious that its
difficult to score such result. This is an indication for scores
within sizes observed, 54.5 is the proportion of scores not
explained by study hour.
If X= 4, 𝒀
=𝒃𝟎 + 𝒃𝟏 𝑿 = 𝟓𝟒. 𝟓 + 𝟓. 𝟓𝟕 × 𝟒 = 𝟓𝟒. 𝟓 + 𝟐𝟐. 𝟐𝟖 =
𝟕𝟔. 𝟖
Coefficient of Determination
Measures the proportion of the total variation in the dependent variable that is explained
by variation in the independent variable
It is also called r-squared. Denoted by 𝐫 𝟐
𝑺𝑺𝑹 𝑺𝑺𝑬
𝐫 𝟐 = 𝑺𝑺𝑻 = 𝟏 − 𝑺𝑺𝑻
𝟎 ≤ 𝐫𝟐 ≤ 𝟏
Example(Continued)
2 𝟐 σ𝒀 𝟐
𝑆𝑆𝑇 = σ 𝑌 − 𝑌 = σ𝒀 − = 686.83
𝒏
2
SSE =σ 𝑌 − 𝑌 = 103.4
𝑺𝑺𝑹 𝑺𝑺𝑬 103.4
r2 = =𝟏− =1− = 0.8495
𝑺𝑺𝑻 𝑺𝑺𝑻 686.83
84.95% of the variation in the test score is explained by the variation in the study hour.
𝑺𝑺𝑹 𝑺𝑺𝑬
𝐫𝟐 = =𝟏−
𝑺𝑺𝑻 𝑺𝑺𝑻
𝑺𝑺𝑬
Adjusted 𝐫 𝟐 = 𝟏 − 𝒏−𝒌
𝑺𝑺𝑻
𝑆𝑆𝐸 103.39
𝒏−𝟏
𝑆𝐸𝐛𝟏 = 𝑛−2
2 = 6−2
=1.17
σ 𝑋−𝑋 18.83
= 𝑴𝑺𝑬
Source SS df MS Number of obs = 6
F(1, 4) = 22.60
Model 583.541298 1 583.541298 Prob > F = 0.0089
Residual 103.292035 4 25.8230088 R-squared = 0.8496
Adj R-squared = 0.8120
Total 686.833333 5 137.366667 Root MSE = 5.0816
Y Coef. Std. Err. t P>|t| [95% Conf. Interval]
X 5.566372 1.170954 4.75 0.009 2.315282 8.817461
_cons 54.53982 4.248912 12.84 0.000 42.74295 66.33669
Confidence Interval for Slope(𝛃𝟏 )
As the sample varies, the value of the Example Continued:
slope varies accordingly. Hence 𝑏1 is a 95% CI for β1 :
variable with:
𝑆𝑆𝐸 103.39
Mean the population slope(𝛃𝟏 ).
o 𝑆𝐸𝐛𝟏 = 𝑛−2
2 = 6−2
=1.17
σ 𝑋−𝑋 18.83
𝑆𝑆𝐸
𝑀𝑆𝐸
Standard error = 𝑆𝐸𝐛𝟏 = 𝑛−2
2 = o 𝒃𝟏 = 𝟓. 𝟓𝟕
σ 𝑋−𝑋 𝑆𝑋𝑋
o 𝑬 = 𝑆𝐸𝐛𝟏 × 𝒕𝟎.𝟗𝟕𝟓,𝟒 =1.17× 2.776=3.25
Follows t distribution with 𝒏 − 𝟐 degrees
of freedom. o ( 𝑏1 −𝐸, 𝑏1 + 𝐸) is 95% CI
Confidence Interval for the slope: o = 5.57 − 3.25, 5.57 + 3.25
o (𝒃𝟏 −𝑬, 𝒃𝟏 + 𝑬) o =(2.32,8.82) We are 95 confident
that the this interval covers true
o Where 𝑬 = 𝑆𝐸𝐛𝟏 × 𝒕𝟏−𝜶/𝟐, 𝒏−𝟐 population slope.
Hypothesis Test on Slope(𝛃𝟏 )
𝐛𝟏 𝐢𝐬 𝐩𝐨𝐢𝐧𝐭 𝐞𝐬𝐭𝐢𝐦𝐚𝐭𝐨𝐫 𝐨𝐟 𝛃𝟏 Example Continued:
To check if the relationship between test
b1 follows T distribution
score and study hour is significant at 0.05
T- test for population slope level:
✓ 𝐇𝟎 : 𝛃𝟏 = 𝟎 (𝐍𝐨 𝐥𝐢𝐧𝐞𝐚𝐫 𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧𝐬𝐡𝐢
o Tests if there is linear relationship 𝐇𝒂 : 𝛃𝟏 ≠ 𝟎 (𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧𝐬𝐡𝐢𝐩 𝐄𝐱𝐢𝐬𝐭𝐬)
between X and Y
𝒃𝟏 −𝛃𝟏 𝟓.𝟓𝟕 −𝟎
Hypothesis to be tested ✓ 𝒕= = = 𝟒. 𝟕𝟓
𝑆𝐸𝐛𝟏 1.17
o 𝐇𝟎 : 𝛃𝟏 = 𝟎 (𝐍𝐨 𝐥𝐢𝐧𝐞𝐚𝐫 𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧𝐬𝐡𝐢𝐩) ✓ P-Value = 0.009
✓ Since P-value< 0.05, reject 𝐇𝟎 .
o 𝐇𝒂 : 𝛃𝟏 ≠ 𝟎 (𝐋𝐢𝐧𝐞𝐚𝐫 𝐫𝐞𝐥𝐚𝐭𝐢𝐨𝐧𝐬𝐡𝐢𝐩 𝐄𝐱𝐢𝐬𝐭𝐬)
𝒃𝟏 −𝛃𝟏 ✓ At 0.05 level of significance, there is no
o Test Statistic : t = 𝑆𝐸𝐛𝟏 sufficient evidence that supports the null
hypothesis. Linear relationship between
study hour and test score does exists.
Confidence and Prediction Intervals
Confidence interval(Mean response for given x value):
o 𝒀 − 𝑬, 𝒀 + 𝑬 Represent a range of values that
are likely to contain the true mean
𝟏 𝒙−𝒙 𝟐 value of some response variable
o 𝑬 = 𝒕𝟏−𝜶Τ𝟐,𝒏−𝟐 × 𝑺𝑬 × + based on specific values of one or
𝒏 𝑺𝑿𝑿
more predictor variables.
o 𝑺𝑬= 𝑴𝑺𝑬 = Standard error
Prediction Interval for single new response:
− 𝑬, 𝒀 + 𝑬
o 𝒀 Represent a range of values that are
likely to contain the true value of
𝟏 𝒙−𝒙 𝟐 some response variable for a single
o 𝑬 = 𝒕𝟏−𝜶ൗ𝟐,𝒏−𝟐 × 𝑺𝑬 × 𝟏 + + new observation based on specific
𝒏 𝑺𝑿𝑿
values of one or more predictor
o 𝑺𝑬= 𝑴𝑺𝑬 = Standard error variables.
Example cont’d
− 𝑬, 𝒀 + 𝑬
𝒀
𝑬 = 𝒕𝟏−𝜶Τ𝟐,𝒏−𝟐 × 𝑺𝑬 ×
𝟏 𝒙−𝒙 𝟐
+
𝒏 𝑺𝑿𝑿
E= 𝒕𝟏−𝜶ൗ𝟐,𝒏−𝟐 × 𝑺𝑬
𝟏 𝒙−𝒙 𝟐
× 𝟏+ +
𝒏 𝑺𝑿𝑿
Eg. If we want to estimate average score of students with study hour =3, We use CI. Where
as if we want to estimate the performance of a student with study hour =3, we use
prediction interval.
Assumptions of the Regression
Linearity
▪ The underlying relationship between X and Y is linear
Independence of Errors
▪ Error values are statistically independent
Normality of Errors
▪ Error values are normally distributed for a given value of X
Equal Variance
▪ The probability distribution of the error has constant variance
You can use the acronym LINE
Checking Assumptions: Residual Analysis
Residual is the difference between observed and predicted values
𝒊
𝒆𝒊 = 𝒀𝒊 − 𝒀
To check Assumptions:
Examine residuals for linearity
Examine residuals for Independence
Examine residuals for normality
Examine residuals for homoscedasticity
Graphical Analysis of residual:
Residual vs X scatter plot
Checking …
Checking …
Quiz[5%]
1. A residual is computed as a y‐coordinate from the data minus a y‐coordinate
predicted by the line. [TRUE |False]
2. A paired data set has n = 5, Σ x = 15, Σ y = 27 , Σ xy = 100, and Σ x 2 = 55.
Then S xy = __________________________
S xx =____________________________
3. A data set has 𝑥ҧ = 10, 𝑦ത = 8 𝑎𝑛𝑑 𝑠𝑙𝑜𝑝𝑒 = 1.5 the intercept is = ________________________
4. If a confidence interval for the slope of the least squares regression line for all pairs in
the population has a positive lower bound and a positive upper bound, then there is
no evidence for linear association between the variables. [TRUE |False]
5. Which of the following statements is NOT true regarding linear regression?
a)It identifies significant predictors for a continuous outcome variable.
b)It predicts the outcome of a binary variable with continuous variables
c)It quantifies a relationship between two continuous variables
d)It models a linear relationship between two continuous variables.
End of the course!
Thank You !