Department of Economics

Economics 143 (Cameron) - Applied Regression Analysis

Problem Set #6: Heteroscedasticity

March 5, 1998
Due:  Thursday, March 12

NETWORK FILES NEEDED:  n:het1.dat, n:het2.dat, n:het1.sha, n:het2.sha, n:wls_sim.sha

NOTE: After this, there will be one more homework set, covering serially correlated error, endogeneity, and dummy dependent variables model. However, this homework set will NOT be turned in for grading. It will be designed solely to give you an idea of the types of exam issues you might encounter concerning these models.

1. Weighted least squares (WLS) is the most common technique to use for (mostly) cross-sectional data where heteroskedasticity is a problem. We will look at a pair of contrived data sets in the files n:het1.dat and n:het2.dat. In n:het1.dat, there are 50 observations for recently hired production line workers on two variables: weeks of experience (weeksi), and percent perfect gizmos produced (perfecti). To save you some programming anxiety, I have begun the program to deal with these data in a file called n:het1.sha. NOTE: the type of heteroskedasticity explored in these data differs from the usual "fanning out" form, where the magnitude of the conditional error variance in the regression increases as some explanatory variable increases.

a.) Examine a plot of these data. Are plotting techniques always sufficient to identify the nature of simple regression? multiple regression? Might a problem of omitted variables masquerade as heteroscedasticity? Ignoring, for now, any potential heteroskedasticity, run a naive OLS regression of perfecti on weeksi and test the hypothesis that experience has no effect on workers' abilities to produce perfect gizmos. b.) Now note that the data set contains several workers with each level of experience (from 1 to 7 weeks). For each of these seven experience levels, calculate the sample variance in perfecti. Plot these variances as a function of the number of weeks of experience. How would you describe this relationship? c.) Since we are lucky enough with these data to be able to estimate si for each experience level, we can design explicit weights to use in SHAZAM that boost the influence of low-variance data and diminish the influence of high-variance data in the process of estimating the regression parameters. SHAZAM multiplies all of the data in a regression by the square root of the specified weighting variable, so construct wsigi as 1/s for each observation. (The weight option is demonstrated in the SHAZAM manual--on page 95 of Version 8.) Now run an OLS regression of perfecti on weeksi using these weights and test the hypothesis that experience has no effect on percent perfect gizmos produced. What has happened to your conclusions? Are they altered substantially by the use of WLS instead of OLS? In particular, think about the range of values for the PRF slope parameter that would be deemed "acceptable" hypotheses about the true but unknown value of that parameter. Does this range of values differ for the WLS and the OLS specifications? Remember that the point estimates by either method are unbiased--does this mean they should be identical? Why or why not?

2. At other times, we will not have the luxury of repeated observations at each value of the explanatory variable(s) that let us calculate estimates of the conditional error variance for each group of observations to use in our weighting computations. The data set in n:het2.dat contains another sample of 50 new workers for the same firm, but this time, experience is reported in days (daysi). Percent perfect gizmos (perfecti) is reported as before. In this type of situation, we need to figure out a way to construct weights using numbers that are at least proportional to the conditional error variances, si2. These methods are always somewhat ad hoc, since it is impossible to know the true exact relationship between si2 and any of the observable data, but it is important to make a concerted effort to correct for heteroskedasticity. WHY? The program in n:het2.sha reads the second data set.

a.)Using a naive OLS specification, test the hypothesis that days of experience has no effect on percent perfect gizmos produced. Be sure to save the estimated residuals with a / resid=e option on the OLS command. b.) Now attempt to find a relationship between si2 and something observable so that you can construct a proxy (for si2) that is at least proportional to si2. First generate ei2. Plot and/or regress ei2 on daysi to examine any systematic relationship. If there appears to be some kind of statistically discernible relationship between ei2 and daysi, we can use the apparently best-fitting relationship between these two quantities to construct a satisfactory set of weights. Some possibilities for weights are: genr wda=days; genr wdb=days*days; genr wdc=1/days; and genr wdd=1/(days*days). Given your results in the first part of this section, which set of weights makes the most sense? (Think very carefully; the situation in this data set is not the most common form of heteroskedasticity observed in "the wild".) Then try each of these as weights in an OLS perfect days / weight=... command. Now test the hypothesis that experience has no effect on percent perfect gizmos produced. Are your inferences from the WLS model any different than those from the OLS model? Explain. c.) Does heteroskedasticity always mean that naive OLS models will give you the appearance of statistically significant slope coefficients when in fact the slope coefficients are insignificantly different from zero? d.) Is it always completely clear-cut what weights should be used to accommodate heteroskedasticity in an empirical model? (HINT: The answer is "no." Explain why.) e.) Sometimes, using a specification wherein the dependent variable is logged will effectively eliminate a heteroskedasticity problem. Why? Does a log-linear model eliminate the heteroskedasticity you observe for either of the data sets used for this problem? Why or why not? f.) OPTIONAL: If there were two or more explanatory variables, how would you determine which of the regressors (or what subset) could be employed in constructing weights for a WLS regression? Discuss. g.) OPTIONAL: Explore an array of tests for the presence of heteroskedasticity in an ols model by running the DIAGNOS / HET command after a naive OLS specification such as the first models run with the n:het1.dat or n:het2.dat data. In subsequent courses in econometrics, you will learn more about these tests. You will also learn more about an available option on OLS called HETCOV that corrects OLS estimates for an unknown form of heteroskedasticity. Try this and see what happens to your OLS estimates.

3. Verifying what goes on in the background when you specify the weight= on an OLS command:  Run the simulation program contained in n:wls_sim.sha and review the output. This program creates a single sample of size 300 for a dependent variable d and an explanatory variable r with a known form of heteroscedasticity in the population regression function from which the data were drawn. Each time the program is run, an entirely different set of data will be created, so your results will differ from those of other people in your study group.

Review the commentary and questions contained in the program, then run it and see if your results are typical. Does the ols d r / weight=... method of producing weighted least squares regression estimates produce the same results as used to be obtained by making explicit transformations of all the variables in the model and running the regression on the transformed data? Note also that the transformed data, when plotted, probably look much more like they satisfy the maintained hypothesis of ordinary least squares regressions. A scatterplot for the transformed data certainly looks different than a scatterplot of the raw data.

Mission: insert lines to perform an initial "naive" OLS regression without any weights. Save the fitted residuals and square them. Then regress these residuals alternately on r (the explanatory variable in the main model), on the square of r, and on both at the same time. Do you recover the type of heteroscedasticity that was used in the production of the data? The lesson here is that a sample will not necessarily always reveal exactly the type of heteroscedasticity that afflicts the population from which the sample is drawn. BOTTOM LINE: We usually just do the best we can to characterize the approximate nature of our heteroscedasticity problem. We then use this information to effect the most appropriate WLS solution that we can come up with.


Updated: March 4, 1998
Prepared by: Trudy Ann Cameron