UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics
Economics 143 (Cameron) - Applied Regression
Analysis
Problem Set #6: Heteroscedasticity
March 5, 1998
Due: Thursday, March 12
NETWORK FILES NEEDED:
n:het1.dat,
n:het2.dat,
n:het1.sha,
n:het2.sha,
n:wls_sim.sha
NOTE: After this, there will be one more homework set,
covering serially correlated error, endogeneity, and dummy dependent variables
model. However, this homework set will NOT be turned in for grading. It will be
designed solely to give you an idea of the types of exam issues you might
encounter concerning these models.
1. Weighted least squares (WLS) is the most common technique to use for
(mostly) cross-sectional data where heteroskedasticity is a problem. We
will look at a pair of contrived data sets in the files n:het1.dat and
n:het2.dat. In n:het1.dat, there are 50 observations
for recently hired
production line workers on two variables: weeks of experience
(weeksi), and
percent perfect gizmos produced (perfecti). To save you some
programming
anxiety, I have begun the program to deal with these data in a file called
n:het1.sha. NOTE: the type of heteroskedasticity
explored in these data
differs from the usual "fanning out" form, where the magnitude of the
conditional error variance in the regression increases as some explanatory
variable increases.
a.) Examine a plot of these data. Are plotting techniques always sufficient
to identify the nature of heteroscedasticity...in simple regression? ...in
multiple regression? Might a problem of omitted variables masquerade as
heteroscedasticity? Ignoring, for now, any potential heteroskedasticity, run a
naive OLS regression
of perfecti on weeksi and test the hypothesis
that experience has no effect
on workers' abilities to produce perfect gizmos.
b.) Now note that the data set contains several workers with each level of
experience (from 1 to 7 weeks). For each of these seven experience levels,
calculate the sample variance in perfecti. Plot these variances
as a
function of the number of weeks of experience. How would you describe this
relationship?
c.) Since we are lucky enough with these data to be able to estimate
si for
each experience level, we can design explicit weights to use in SHAZAM that boost
the
influence of low-variance data and diminish the influence of high-variance
data in the process of estimating the regression parameters. SHAZAM
multiplies all of the data in a regression by the square root of the
specified weighting variable, so construct wsigi as 1/s for each
observation. (The weight option is demonstrated in the SHAZAM
manual--on page 95 of Version 8.) Now run an OLS regression of
perfecti on weeksi
using these
weights and test the hypothesis that experience has no effect on percent
perfect gizmos produced. What has happened to your conclusions? Are they altered
substantially by the use of WLS instead of OLS? In particular, think about the
range of values for the PRF slope parameter that would be deemed "acceptable"
hypotheses about the true but unknown value of that parameter. Does this range of
values differ for the WLS and the OLS specifications? Remember that the point
estimates by either method are unbiased--does this mean they should be identical?
Why or why not?
2. At other times, we will not have the luxury of
repeated observations at
each value of the explanatory variable(s) that let us calculate estimates of
the conditional error variance for each group of observations to use in our
weighting computations. The data set in n:het2.dat
contains another sample
of 50 new workers for the same firm, but this time, experience is reported
in days (daysi). Percent perfect gizmos
(perfecti) is
reported as before.
In this type of situation, we need to figure out a way to construct weights
using numbers that are at least proportional to the conditional error
variances, si2. These
methods are always somewhat ad hoc, since it is
impossible to know the true exact relationship between si2 and any of the
observable data, but it is important to make a concerted effort to correct for
heteroskedasticity. WHY? The program in n:het2.sha reads
the second data set.
a.)Using a naive OLS specification, test the hypothesis that days of
experience
has no effect on percent perfect gizmos produced. Be sure to save the
estimated residuals with a / resid=e option on the OLS command.
b.) Now attempt to find a relationship between si2 and something
observable
so that you can construct a proxy (for si2) that is at least
proportional
to
si2. First generate
ei2. Plot and/or regress
ei2 on daysi to
examine any
systematic relationship. If there appears to
be some kind of statistically discernible relationship between
ei2 and daysi,
we can use the
apparently best-fitting relationship between these two quantities to
construct a satisfactory set of weights. Some possibilities for weights are:
genr wda=days; genr wdb=days*days; genr wdc=1/days; and
genr wdd=1/(days*days).
Given your results in the first part of this section, which set of weights
makes the most sense? (Think very carefully; the situation in this data set
is not the most common form of heteroskedasticity observed in "the wild".)
Then try each of these as weights in an OLS perfect days / weight=...
command. Now test the hypothesis that experience has no effect on percent
perfect gizmos produced. Are your inferences from the WLS model any different
than those from the OLS model? Explain.
c.) Does heteroskedasticity always mean that naive OLS models will give you
the appearance of statistically significant slope coefficients when in fact
the slope coefficients are insignificantly different from zero?
d.) Is it always completely clear-cut what weights should be used to
accommodate heteroskedasticity in an empirical model? (HINT: The answer is
"no." Explain why.)
e.) Sometimes, using a specification wherein the dependent variable is
logged will effectively eliminate a heteroskedasticity problem. Why? Does
a log-linear model eliminate the heteroskedasticity you observe for either
of the data sets used for this problem? Why or why not?
f.) OPTIONAL: If there were two or more explanatory variables, how would
you determine which of the regressors (or what subset) could be employed in
constructing weights for a WLS regression? Discuss.
g.) OPTIONAL: Explore an array of tests for the presence of
heteroskedasticity in an ols model by running the DIAGNOS / HET command
after a naive OLS specification such as the first models run with the
n:het1.dat or n:het2.dat
data.
In subsequent courses in econometrics, you
will learn more about these tests. You will also learn more about an
available option on OLS called HETCOV that corrects OLS estimates for an
unknown form of heteroskedasticity. Try this and see what happens to your
OLS estimates.
3. Verifying what goes on in the background when you specify the
weight= on an OLS command: Run the simulation program contained
in n:wls_sim.sha and review the output. This program creates
a single sample of size 300 for a dependent variable d and an explanatory
variable r with a known form of heteroscedasticity in the
population regression function from which the data were drawn. Each time the
program is run, an entirely different set of data will be created, so your results
will differ from those of other people in your study group.
Review the commentary and questions contained in the program, then run it and
see if your results are typical. Does
the ols d r / weight=... method of producing weighted least squares
regression estimates produce the same results as used to be obtained by making
explicit transformations of all the variables in the model and running the
regression on the transformed data? Note also that the transformed data, when
plotted, probably look much more like they satisfy the maintained hypothesis of
ordinary least squares regressions. A scatterplot for the transformed data
certainly looks different than a scatterplot of the raw data.
Mission: insert lines to perform an initial "naive" OLS regression without
any weights. Save the fitted residuals and square them. Then regress these
residuals alternately on r (the explanatory variable in the main model), on
the square of r, and on both at the same time. Do you recover the
type of heteroscedasticity that was used in the production of the data? The
lesson here is that a sample will not necessarily always reveal exactly the type
of heteroscedasticity that afflicts the population from which the sample is drawn.
BOTTOM LINE: We usually just do the best we can to characterize the approximate
nature of our heteroscedasticity problem. We then use this information to effect
the most appropriate WLS solution that we can come up with.
Updated: March 4, 1998
Prepared by: Trudy Ann Cameron