Introduction to Econometric Modeling
Basic Concept
1. Regression Analysis is a process of estimating the functional relationship between a random
variable Y (called dependent variable) and one or more other non-random variables Xs (called
independent variables).
2. In regression analysis, we assume that the relationship between the dependent variable Y and the
independent variables Xs is linear, as in the following form:
Y 0 1 X 1 ... k X k (1)
3. If equation (1) above involves only one independent variable (i.e., Y 0 1 X 1 ), it is called a
simple regression model. If it involves more than one independent variable, it is referred to as a
multiple regression model.
4. In equation (1) above, we assume that all Xs are independent (uncorrelated) and, in addition, the
values of Xs are given constants (not random variables). The random errors, ’s (epsilons), are
assumed to be independently and normally distributed with mean equal to zero and a constant (but
unknown) variance 2. With these assumptions, the expected value of Y is:
E (Y ) E ( 0 1 X 1 ... k X k ) 0 1 X 1 ... k X k (2)
5. Since the coefficients (’s) of equation (1) are unknown, we need to make estimations based on the
data collected for X and Y. In Regression Analysis, the method used to estimate these coefficients
is called “the least squares method” – so called because the resulting mathematical function (called
the least squares equation, least squares line, or simply regression line) has the smallest sum of
squared estimation errors (minimum SSE). The least squares equation can be written as:
yˆ ˆ0 ˆ1 x1 ˆk xk (3)
where ŷ (the fitted value of y) is used to estimate E(y), and
ˆi (the fitted values of i) is used to estimate i.
Note that ˆi are obtained by solving the following k + 1 equations (called normal equations)
simultaneously:
∑ (𝑦 − 𝑦 ) = ∑ (𝑦 − 𝛽 − 𝛽 𝑥 − ⋯ − 𝛽 𝑥 ) = 0 𝑖 = 0, . . . , 𝑘 (4)
6. If all assumptions mentioned above hold, the sample statistics such as ˆi will hold some important
properties that allow us to perform statistical analysis on the regression model obtained.
1
Simple Linear Regression
Steps involved in simple linear regression:
1 Propose a model: Generally, the model is Y = 0 + 1X + .
2 Collect data to estimate 0 and 1.
yˆ ˆ0 ˆ1 x; ˆ0 y ˆ1 x ; ˆ1
SS XY
( x x )( y y )
SS XX (x x ) 2
The Analysis of Variance Table
SOURCE DF SS MS F P
Regression 1 SSR MSR = SSR MSR/MSE p-value
Error n–2 SSE MSE = SSE/( n – 2)
Total n–1 SST
SST ( y y )2 SSYY ; SSE ( y yˆ )2 SSYY ˆ1SS XY ; SSR SST SSE
3 Test the usefulness of the model
1) H0: 1 = 0 vs. H1: 1 0
Test Statistic: t
ˆ1
ˆ1
where s
( y i yˆ ) 2
SSE
.
sˆ s n2 n2
1
SS XX
Decision Rule: Reject H0 if |t| > tn-2, /2 and do not reject otherwise (for two-tailed test).
You can also construct a confidence interval for 1 and the formula is:
s
ˆ1 tn 2, / 2 sˆ ˆ1 tn 2, / 2 .
1
SS XX
If the confidence interval doesn’t cover 0, the null hypothesis can be rejected at significance
level.
2) Calculate r, R2 and adjusted R2 to check the strength of linear relationship between Y and X.
Note that r2 = R2 in a simple regression model.
SST SSE
SS XY SST SSE SSR
r ; R
2
; 2 df df
SS XX SSYY SST SST R adj SST
df
2
4 Make Prediction
1) Point prediction of y is ŷ .
2) 100(1 – )% Confidence Interval for the mean of y when x = xp
1 ( xp x )
2
yˆ tn 2, / 2 s
n SS XX
3) 100(1 – )% Prediction Interval for y when x = xp
1 (xp x )
2
yˆ tn 2, / 2 s 1
n SS XX
Simple Regression: An Example
In the following example, we will use regression analysis to find the relationship between size of house
and electricity usage. Data collected from 10 houses are given below, where X = size of the house (in
square ft) and Y = electricity usage (in kw/hr). To facilitate calculations, we also include columns of
XY, X2 and Y2.
Y X XY X2 Y2
1182 1290 1524780 1664100 1397124
1172 1350 1582200 1822500 1373584
1264 1470 1858080 2160900 1597696
1493 1600 2388800 2560000 2229049
1571 1710 2686410 2924100 2468041
1711 1840 3148240 3385600 2927521
1804 1980 3571920 3920400 3254416
1840 2230 4103200 4972900 3385600
1956 2400 4694400 5760000 3825936
1954 2930 5725220 8584900 3818116
Sum 15947 18800 31283250 37755400 26277083
The plot of Y against X shown below indicates that these two variables have quite strong positive
linear relationship.
y = 0.5403x + 578.93
2000 2
R = 0.8317
1800
1600
Y
1400
1200
1000
1200 1600 2000 2400 2800 3200
X
3
We will now do the following:
1 Fit the simple regression model: Y = 0 + 1X + .
2 Test the validity of the model.
3 Examine how well the model fits the data by using R2.
Regression Analysis: Y versus X
The regression equation is
Y = 579 + 0.540 X
Predictor Coef SE Coef T P
Constant 578.9 167.0 3.47 0.008
X 0.54030 0.08593 6.29 0.000
S = 133.4 R-Sq = 83.2% R-Sq(adj) = 81.1%
Analysis of Variance
Source DF SS MS F P
Regression 1 703957 703957 39.54 0.000
Residual Error 8 142445 17806
Total 9 846402
Obs X Y Fit SE Fit Residual St Resid
1 1290 1182.0 1275.9 66.0 -93.9 -0.81
2 1350 1172.0 1308.3 62.1 -136.3 -1.15
3 1470 1264.0 1373.2 55.0 -109.2 -0.90
4 1600 1493.0 1443.4 48.6 49.6 0.40
5 1710 1571.0 1502.8 44.7 68.2 0.54
6 1840 1711.0 1573.1 42.3 137.9 1.09
7 1980 1804.0 1648.7 43.1 155.3 1.23
8 2230 1840.0 1783.8 51.8 56.2 0.46
9 2400 1956.0 1875.7 61.5 80.3 0.68
10 2930 1954.0 2162.0 99.6 -208.0 -2.34R
R denotes an observation with a large standardized residual
You should try to verify all the numbers printed in the above regression analysis output, using formulae
that we covered in class.
From the output, we can see that R2 of the model is fairly high (83.2%) and the p-values for F-test and
the t-test of the coefficient are very small – indicating that this is a fairly satisfactory model.
Thus, we can use this model to make prediction.