UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics

Economics 143 (Cameron) - Applied Regression Analysis

Computing Lab Session #4: Multiple Regression and Omitted Variables Bias


Goals for this lab:

Tasks:

1. Omitted Variables Bias Example from Problem Set #4: Concept: Finding an apparent relationship in a simple regression that actually isn't there, as a consequence of omitted variables bias. File needed: n:colds.sha.

a.) Try a naive regression model that treats COLDS as a function of VISITS to places of worship: OLS COLDS VISITS. If your atheist brother-in-law asserts that more-religious people are light-weight hypochondriacs who are always getting sick, what would this regression imply with respect to his assertion? Using the basic regression results, test the null hypothesis that VISITS have no effect on COLDS. (Also try an appropriate version of the TEST command to verify these results.) BE SURE YOU KNOW HOW TO DO THIS AND HOW TO INTERPRET THE RESULTS. b.) You wonder if any other factors besides religiosity affect health status. AGE is a variable you have handy. You control for AGE before examining the effect of VISITS on COLDS by running the regression OLS COLDS VISITS AGE. (Note that the order of the explanatory variables doesn't matter; the results for each variable will be unaffected.) Repeat the test of the hypothesis that VISITS have no effect on COLDS. What happens now? Test the hypothesis that AGE has no effect on COLDS. c.) Examine a plot of the relationship between the two explanatory variables: VISITS and AGE. Use PLOT VISITS AGE / GNU (Why do you not use the LINEONLY or LINE option?) Explain what accounts for the conflicting implications of the simple regression and the multiple regression. Which model is more "right"? Why? d.) Compare the "fit" of the simple and multiple regression models. Which model does a better job of explaining the observed variation across people in the number of colds? What about the fact that one model has an unfair advantage in that it uses more regressors? How do you control for this unfair advantage?

2. Omitted Variables Bias Example from Problem Set #4: Concept: Finding NO effect of one variable on another (or an effect that is counter to what you might expect) when there actually is an effect, but it is obscured by omitted variables. File needed: n:study.sha.

a.) Repeat the steps for the n:colds.sha example using the n:study.sha program and data. Here you are interested in knowing whether hours of studying (STUDY) has a statistically significant effect on midterm grades (MIDTERM). If you fail to control for "ability" approximated by prior GPA, you find some counter-intuitive results; an effect you might think should be there is not statistically significant--and the sign is even "wrong." When you do control for ability, something more reasonable appears. b.) Be sure to examine the relationship between study hours and GPA for a clue to the source of the bias in the simple regression. Ensure that you can explain in words what accounts for the difference in results between the simple and the multiple regressions in this case. c.) For your preferred specification, test the hypothesis that study time has no effect on midterm grade. What do you conclude, based on this specification? Now test the hypothesis that an extra hour of study time, on average, will produce a 5-point-higher midterm score. What does your model imply? d.) Compare the "fit" of the simple and multiple regression models. Which model does a better job of explaining the observed variation across people in midterm scores? What about the fact that one model has an unfair advantage in that it uses more regressors? How do you control for this unfair advantage? e.) Contemplate the interpretation of the intercept in this model. Is it meaningful? Why or why not?

3. Suppose the "true" data-generating process (DGP) which we call the population regression function (PRF) is of the form:

Yi = B1 + B2 Xi + B3 Zi + ui

Suppose that we draw a random sample from this population, forming a sample of size n with observations on Yi, Xi and Zi. Perhaps we are most interested in the size and sign of the coefficient B2. If we regress Y on both X and Z, and the usual "maintained hypotheses" for regression are met, then OLS is appropriate and it will provide "best linear unbiased estimates" (B.L.U.E.) of the true populations parameters B1, B2, and B3.

However, it is possible, since we do not really know the form of the underlying PRF, that we might estimate a "misspecified" model of the following form:

Yi = b1 + b2 Xi + ei.

What are the properties of the estimated parameters under these circumstances? They may no longer be unbiased. We have omitted the Zi variable when it really ought to be there, since Zi truly does help explain Yi

Some discussion:

1. If both X and Z have a positive effect on Y (i.e. B2, B3 > zero) and Zi is uncorrelated with Xi, then omission of Zi from the regression model will NOT bias b2. However, the error terms will be larger, since they are now capturing (B3 Zi + ui). Thus, the estimate of s 2 (s2) is larger than it needs to be, meaning that the standard errors of the regression parameters are all larger than they need to be. Consequently, the associated t-test statistics will be smaller than necessary and we will be less likely to be able to reject specific hypotheses about the sizes of any of the underlying parameters.

2. If both X and Z have a positive effect on Y (i.e. B2, B3 > zero) and Zi is positively correlated with Xi, then the omission of Zi from the regression model WILL bias b2. This is because larger values of Xi are proxying for larger values of Zi (and smaller values of Xi are proxying for smaller values of Zi. Thus, the estimated parameter b2 is not telling us how Yi changes when Xi changes, ceteris paribus. Instead, it is telling us how Yi changes when BOTH Xi and Zi change. Whatever the sign of the coefficient b2, the parameter estimate will be larger in absolute value than it ought to be. The more correlated is Zi with Xi, the greater will be the bias. That portion of the variability in Zi that does not coincide with the variability in Xi (i.e. that part which remains "orthogonal" to the variation in Xi) will be absorbed by the intercept and the error term, leading to a potential for the error term again to be inflated. This will again make for smaller t-test statistics and more difficulty in rejecting hypotheses about the parameters.

3. If both X and Z have a positive effect on Y (i.e. B2, B3 > zero) and Zi is negatively correlated with Xi, then the omission of Zi from the regression model WILL bias b2. The bias may be sufficiently large in some cases that the apparent sign of b2 is actually reversed. The problem is that when Zi is left out, larger values of Xi (which should increase Yi) are associated with smaller values of Zi (which should decrease Yi). If the effect of smaller Zi values dominates the effect of larger Xi values, Yi will appear to be made smaller by larger values of Xi and the estimated value of b2 will be negative.

There are several more cases to consider....
 

In general:

The direction and size of the bias in b2 as an estimator for the true B2 depends on:

a.) the actual signs and sizes of B2 and B3
b.) the degree of correlation (positive or negative) between Xi and Zi.
 

Some key points to remember are:

a.) if the omitted variable(s) is uncorrelated with any of the included variable(s), then the point estimates of the coefficients on the included variables will be unbiased.

b.) despite the lack of bias with exclusion of uncorrelated relevant variables, the efficiency of the estimation can be compromised (i.e. the standard errors on the slope coefficients for included variables will be inflated).

c.) omission of correlated relevant variables is often the explanation for the appearance of unexpected signs (or sizes) for coefficients on included variables.

 
The simulation program:

To illustrate some of these points, we have the simulation program contained in  n:omitvar.sha. Here is what is going on in the default version of the program:

1. We are exploring the consequences of omitting a relevant explanatory variable as the correlation between the excluded (Z) explanatory variable and the included (X) explanatory variable changes from "close to -1," to zero, to "close to +1."

2. For each of almost 100 different degrees of correlation, we generate a set of data on X and Z. As defaults, the mean of X is set to 3, the mean of Z is set to 5, and the marginal standard deviation of X is 3 and that of Z is 2.

3. We use a common statistical method involving so-called "Cholesky factorization" of matrices to convert a pair of independent N(0,1) random variables into a pair of correlated random variables each with a N(0,1) marginal distributions. Then we change the location and scale of these correlated normal random variables to the default values mentioned above. (You are not responsible for this matrix algebra--it is simply a means to an end.)

4. For these correlated data on X and Z, we specify a population regression function of the form:

Yi = B1 + B2 Xi + B3 Zi + ui.

As default values, we have set B1 = 1, B2 = 2 and B3 = 3 (all positive, in this case). We determine a "random" value of Yi to go with each (Xi,Zi) pair by plugging the Xi and Zi values into this formula, and then tacking on a random draw from a N(0,s 2) distribution, where the default value of s is set to 3 in this illustration.

5. Now that we have this sample of "data" from a known population regression function, we use it in two regression models:

Model I.       Yi = b1 + b2 Xi + b3 Zi + ei
Model II.       Yi = b1* + b2* Xi + ei

We then save the estimated parameters and their associated t-test statistics from the first model and from the second model (which may suffer from omitted variables bias).

6. The final task is to observe how certain key regression quantities change as we consider data sets with differing levels of correlation between Xi and Zi. We can plot:

a.) b2 from Model I and b2* from Model II as functions of the error correlation between X and Z (called CORR_X_Z in the program).

b.) b1 from Model I and b1* from Model II as function of the error correlation between X and Z.

c.) t-ratios on b2 and b3 from Model I alone as a function of the correlation between X and Z (this illustrates the consequences of multicollinearity between regressors in one model).

d.) t-ratios on b2 from Model I and b2* from Model II as functions of the error correlation between X and Z.

4. (Time permitting) Exploring how to get fancy laser-printed plots that show both raw data and fitted values: GNUPLOT option on SHAZAM PLOT command with subsequent editing of the gnuplot command file.

a.) Consider a plot of the relationship between VISITS and AGE from the colds.sha example (as an illustration of the technique). Append to the end of your own version of colds.sha the following additional code:
ols visits age / predict=vhat
plot visits vhat age / gnu lineonly
plot visits vhat age / gnu lineonly commfile=cold.gnu &
       datafile=cold.dat
The first plot command lets you see a screen version of the graphics plot. It will have a straight line and a wiggly line. We are going to erase the wiggly line and leave just its "dots." This requires the second plot command and subsequent editing of the appropriate hidden file indicated in the cold.gnu output file generated by this code. The second plot command won't send output to the screen; instead, it will all go to files. Note (as before) that the names you specify for the commfile= and datafile= options cannot be more than 8 characters in total, including the "." and the extension, so if you are going to use informative extensions like those above (.gnu and .dat), the first part of the filenames should be no more than 4 characters long. Make a note of the name you selected for the commfile. Now type STOP to exit SHAZAM.
b.) Now select TED, because we are going to peek into the cold.gnu file. Note the name of the real gnuplot program that this file points to, exit TED, and then enter TED again to edit this hidden file. This is where you can change the title for the plot, and change the names of the variables if you like. In particular, you want to find the line of code for the VISITS variable (not VHAT) and delete the w lines part. When you are happy with your modifications, save and exit TED. c.) Now select GNUPLOT for Windows and tell the program the name of the file you want processed to create a plot for the laser printer. This will be the COLDS.GNU file (or the hidden filename), if you have been proceeding as above. d.) As before, you can print the plot on the laser printer by right-clicking on the plot after it appears on the screen, ensuring that the print options are to your liking, and then sending the print task to the printer.

COURSE OUTLINE LECTURE OUTLINES PROBLEM SETS PROBLEM SOLUTIONS COMPUTER LABS
SHAZAM EXAMPLES DATA SETS ONLINE QUIZZES GRAPHICS HANDOUTS
Update date: February 6, 1998
Prepared by: Trudy Ann Cameron