1. Omitted Variables Bias Example from Problem Set #4: Concept: Finding an apparent relationship in a simple regression that actually isn't there, as a consequence of omitted variables bias. File needed: colds.sha.
2. Omitted Variables Bias Example from Problem Set #4: Concept: Finding NO effect of one variable on another (or an effect that is counter to what you might expect) when there actually is an effect, but it is obscured by omitted variables. File needed: study.sha.
3. Suppose the "true" data-generating process (DGP) which we call the population regression function (PRF) is of the form:
Suppose that we draw a random sample from this population, forming a sample of size n with observations on Yi, Xi and Zi. Perhaps we are most interested in the size and sign of the coefficient B2. If we regress Y on both X and Z, and the usual "maintained hypotheses" for regression are met, then OLS is appropriate and it will provide "best linear unbiased estimates" (B.L.U.E.) of the true populations parameters B1, B2, and B3.
However, it is possible, since we do not really
know the form of the underlying PRF, that we might estimate a "misspecified"
model of the following form:
What are the properties of the estimated parameters under these circumstances? They may no longer be unbiased. We have omitted the Zi variable when it really ought to be there, since Zi truly does help explain Yi.
Some discussion:
1. If both X and Z have a positive effect on Y (i.e. B2, B3 > zero) and Zi is uncorrelated with Xi, then omission of Zi from the regression model will NOT bias b2. However, the error terms will be larger, since they are now capturing (B3 Zi + ui). Thus, the estimate of s 2 (s2) is larger than it needs to be, meaning that the standard errors of the regression parameters are all larger than they need to be. Consequently, the associated t-test statistics will be smaller than necessary and we will be less likely to be able to reject specific hypotheses about the sizes of any of the underlying parameters.
2. If both X and Z have a positive effect on Y (i.e. B2, B3 > zero) and Zi is positively correlated with Xi, then the omission of Zi from the regression model WILL bias b2. This is because larger values of Xi are proxying for larger values of Zi (and smaller values of Xi are proxying for smaller values of Zi. Thus, the estimated parameter b2 is not telling us how Yi changes when Xi changes, ceteris paribus. Instead, it is telling us how Yi changes when BOTH Xi and Zi change. Whatever the sign of the coefficient b2, the parameter estimate will be larger in absolute value than it ought to be. The more correlated is Zi with Xi, the greater will be the bias. That portion of the variability in Zi that does not coincide with the variability in Xi (i.e. that part which remains "orthogonal" to the variation in Xi) will be absorbed by the intercept and the error term, leading to a potential for the error term again to be inflated. This will again make for smaller t-test statistics and more difficulty in rejecting hypotheses about the parameters.
3. If both X and Z have a positive effect on Y (i.e. B2, B3 > zero) and Zi is negatively correlated with Xi, then the omission of Zi from the regression model WILL bias b2. The bias may be sufficiently large in some cases that the apparent sign of b2 is actually reversed. The problem is that when Zi is left out, larger values of Xi (which should increase Yi) are associated with smaller values of Zi (which should decrease Yi). If the effect of smaller Zi values dominates the effect of larger Xi values, Yi will appear to be made smaller by larger values of Xi and the estimated value of b2 will be negative.
There are several more cases to consider....
In general:
The direction and size of the bias in b2 as an estimator for the true B2 depends on:
Some key points to remember are:
The simulation program:
To illustrate some of these points, we have the simulation program contained in omitvar.sha. Here is what is going on in the default version of the program:
1. We are exploring the consequences of omitting a relevant explanatory variable as the correlation between the excluded (Z) explanatory variable and the included (X) explanatory variable changes from "close to -1," to zero, to "close to +1."
2. For each of almost 100 different degrees of correlation, we generate a set of data on X and Z. As defaults, the mean of X is set to 3, the mean of Z is set to 5, and the marginal standard deviation of X is 3 and that of Z is 2.
3. We use a common statistical method involving so-called "Cholesky factorization" of matrices to convert a pair of independent N(0,1) random variables into a pair of correlated random variables each with a N(0,1) marginal distributions. Then we change the location and scale of these correlated normal random variables to the default values mentioned above. (You are not responsible for this matrix algebra--it is simply a means to an end.)
4. For these correlated data on X and Z, we specify
a population regression function of the form:
As default values, we have set B1 = 1, B2 = 2 and B3 = 3 (all positive, in this case). We determine a "random" value of Yi to go with each (Xi,Zi) pair by plugging the Xi and Zi values into this formula, and then tacking on a random draw from a N(0,s 2) distribution, where the default value of s is set to 3 in this illustration.
5. Now that we have this sample of "data" from
a known population regression function, we use it in two regression models:
We then save the estimated parameters and their associated t-test statistics from the first model and from the second model (which may suffer from omitted variables bias).
6. The final task is to observe how certain key regression quantities change as we consider data sets with differing levels of correlation between Xi and Zi. We can plot:
a.) b2 from Model I and b2* from Model II as functions of the error correlation between X and Z (called CORR_X_Z in the program).
b.) b1 from Model I and b1* from Model II as function of the error correlation between X and Z.
c.) t-ratios on b2 and b3 from Model I alone as a function of the correlation between X and Z (this illustrates the consequences of multicollinearity between regressors in one model).
d.) t-ratios on b2 from Model I and b2* from Model II as functions of the error correlation between X and Z.
4. (Time permitting) Exploring how to get fancy laser-printed plots that show both raw data and fitted values: GNUPLOT option on SHAZAM PLOT command with subsequent editing of the gnuplot command file.
ols visits age / predict=vhatThe first plot command lets you see a screen version of the graphics plot. It will have a straight line and a wiggly line. We are going to erase the wiggly line and leave just its "dots." Editing a gnuplot plotting program requires the second version of the plot command and subsequent editing of the appropriate hidden file indicated in the cold.gnu output file generated by this code. Note (as before) that the names you specify for the commfile= and datafile= options cannot be more than 8 characters in total, including the "." and the extension, so if you are going to use informative extensions like those above (.gnu and .dat), the first part of the filenames should be no more than 4 characters long. Make a note of the name you selected for the commfile. Now type STOP and exit SHAZAM.
plot visits vhat age / gnu lineonly
plot visits vhat age / gnu lineonly commfile=cold.gnu &
datafile=cold.dat