UNIVERSITY OF CALIFORNIA, LOS
ANGELES
Department of Economics
Winter 1998
Policy Studies 208 - Midterm
Examination - Outlines of
Solutions in Red
Version without Answers
INSTRUCTIONS: Answer all questions
in the spaces provided (or indicate clearly where you have continued your
answer). Calculators are NOT permitted. Reduce all computations
to the simplest form so that anyone with a calculator could attain the
answer easily. Show your work and reasoning to the fullest extent possible
so that part marks can be assigned as warranted. You have 75 minutes to
complete this exam. All parts of both questions are worth 10 points
(and some are much easier than others). Total points = 150. This means
roughly 5 minutes for each answer. Budget your time carefully. NOTE: these
data are fictitious.
SCENARIO: Your consulting firm has been
hired by the Department of Environmental Health and Safety. The DEHS would
like you to analyze the relationship between the annual number of toxic
releases (per 100 establishments) for dry-cleaning firms (spillsi)
and some policy variables deemed relevant for the control of these accidents.
You have been provided with a sample of data from 19 randomly selected
jurisdictions. The variables they have given you include the average age
of such establishments in the jurisdiction (agei), the
number of full-time inspectors per 100 establishments in the jurisdiction
(inspi), the average legal penalties imposed for toxic
releases in that jurisdiction (in thousands of dollars) (peni), and the median
income of households in the jurisdiction (in thousands of dollars) (medinci). The
statistical analyses you perform are given in the Exhibits.
1. Begin gently. Fill in the blanks:
Across these 19 jurisdictions, what is
the mean "number of toxic releases per 100 establishments"? 3.0625 releases
What is the highest observed "number
of inspectors per 100 establishments"? 10 inspectors (overkill?)
What is the standard deviation in "average
legal penalties per release" across the sample? 5.8239 thousand dollars or
$5823.90
Do the descriptive statistics you have
just provided refer to the joint distribution of these three variables,
or to their marginal distributions? marginal distributions
What is the correlation between inspi
and peni in this sample? -0.71181
What are the units for this correlation
measure? none, correlation is a unit-free measure.
2. Using the descriptive statistics only,
test the hypothesis that the true marginal mean "number of toxic releases
per 100 establishments" is 4 per year.
The key thing to remember is that this is univariate statistics, the stuff from
the beginning of the quarter. We are asking about the true mean of a single variable,
which means that we must use the sample mean and the variance of the sample mean. The
test statistic will be [Y-bar - 4] / (sY/square root of n). For this example, the
numbers will be [3.0625 - 4] / (1.7016/square root of 19).
3. Does Regression 1 make sense?
Why or why not?
Not really. This model suggests that the median age of drycleaning establishments
in a jurisdiction is determined by the number of spills, the number of inspectors, typical
penalties, and the median income in the neighborhood. The causality probably does
not run this way, or if it does, the effects are extremely minimal.
4. The Administrator for the DEHS says
"If the number of inspectors in a jurisdiction has no statistically discernible
effect on the number of toxic releases from these establishments, why are
we paying the salaries of these people?" Based upon the relevant simple
regression in the Exhibits, is it possible that there is a downward-sloping
relationship between the number of inspectors and the number of releases?
Explain how you have reached this conclusion.
The Administrator is essentially saying that if the slope of a regression of
spills on inspectors is zero, then the number of inspectors does not affect the
frequency of spills, so why do we need inspectors? The expected number of spills
would be the same with zero inspectors as with ten of them, for example. This is
a zero hypothesis about the slope in Regression 2.
We can just look at the t-ratio and its associated P-value. The P-value of 0.479
says we cannot reject the zero hypothesis, so it looks like the Administrator might
be justifiably unhappy about paying for inspectors. However, one must always
suspect possible omitted variables bias.
5. Based on Regression 3, test
the hypothesis that in order to reduce the "number of releases per 100
establishments", on average, by one per year, it would be necessary to
increase the average legal penalty per spill by $5,000.
This hypothesis first needs to be translated into something that involves
the estimated coefficients of the model. Since this is a linear specification,
the slope is the same everywhere. If a 5-unit change in pen leads to a one-unit
decrease in spills, then this is equivalent to a 1-unit change in pen leading to
a 0.2 unit decrease in spills. A convenient version of this hypothesis is thus
to test whether the slope on pen equals -0.2. Let's look at what is available in the way
of tests associated with Regression 3. No hypotheses other
than the zero hypothesis have been requested, so we have to construct our own
t-test statistic. This test statistic is the point estimate minus the null hypothesis, all
divided by the standard error of the point estimate. Specifically, [-0.188 - (-0.2)] / .05425.
If this number, when calculated, is larger than the relevant critical value for
a t-distributed random variable with 17 degrees of freedom (which is 2.11 at the 5%
level of significance) we would reject the null hypothesis. Eyeballing the formula,
the number is going to be about 0.012/0.05, which will be nowhere near this size. Thus,
we will fail to reject this hypothesis. It is plausible.
6. Based on Regression 3, what
average number of releases per 100 establishments would you expect for
a jurisdiction with an average legal penalty of $50,000 per release? Give
the precise formula for a point estimate and explain explicitly how a 95%
confidence interval for this prediction would be constructed. Why should
you use caution in making the this prediction?
Here is where you use the "big messy formula" for a confidence interval
for mean prediction. You would plug 50 into the fitted model from this regression
to get the midpoint of the confidence interval (the point estimate). Then you will need the estimates
of x, the marginal mean of pen from the stat output, sample size n = 19 to plug
into the formula for the standard error of the point estimate. The thing that
needs to be constructed is the sum of the little xi2.
For this, you need to us the estimate of s ("standard error of the estimate - sigma")
from the third line below the R-squared information. Divide this by the
standard error of the slope estimate (since this number is s/(the square root of what you want)).
Finally, square the resulting number to get the desired sum of squared deviations.
7. You think for a while and then realize
that the number of toxic releases per 100 establishments is probably a
joint function of several different factors, rather than just one at a
time. You estimate Regression 6 in order to ascertain the joint
effects of all available determinants on the average number of toxic releases
per 100 establishments. Describe what seems to happen to the apparent effect
of the inspi variable when you include the other variables
in your model. If this apparent effect is different, explain why. What
do you tell the Administrator of the DEHS?
When you control for other factors that influence the number of spills,
compared to Regression 2, the coefficient on insp changes from positive to
negative and actually becomes statistically significantly different from
zero. There must have been some omitted variables bias obscuring the effect
of insp in the simple regression. The culprit variable is probably the
size of the penalties. There is a fairly high negative correlation between
these two variables (about -0.7). Thus greater numbers of inspectors were
serving as proxies for smaller fines in that jurisdiction. Since smaller
fines reduce the incentive to prevent releases, the effects of the smaller
fines were offsetting the effects of more inspectors (and a higher probability
of problems being detected). You can tell the Administrator not to worry,
because it seems that inspectors are making a difference after all, when
you control for variations in the levels of penalties.
8. In Regression 6, explain the
use of the / auxrsqr option on the ols command. What does it tell
you here?
The auxrsqr command seems to be unique to SHAZAM among popular regression
packages. It allows you to track down the probable sources of multicollinearity
problems that might be affecting your data and therefore your inferences. This
command, in the background, runs regressions for each explanatory variable on
all of the others and reveals where "good linear fit" is found. High auxiliary
R-squared values suggest that the subset of variables with these high values
has a high degree of (possibly higher-order) multicollinearity. The coefficients
on these variables may be insignificant because the OLS algorithm is unable
to parcel out explanatory power between them, even though collectively, they
might have great bearing on the expected value of Y (although maybe they don't).
9. For Regression 6, test the
hypothesis that the agei variable does not belong in
the model. What do you conclude?
Age is completely uncorrelated with any of the other regressors AND it has
a lousy t-ratio and P-value, so we are fairly safe in concluding that age
probably does not belong in this specification. However, there is always a
danger that it IS correlated with some unidentified omitted variable, and
omitted variables bias is still obscuring its role in the model. The unnerving
thing is that you can never be sure, so you just think about potential determinants
as carefully as possible and try to argue that you have everything...
10. One employee of the DEHS, who has
worked there for decades, claims that higher expected penalties for infractions
can work as a substitute for greater monitoring of establishments by inspectors.
In fact, she says, if you can work to write higher penalties into the regulations,
a $1000 higher expected penalty for violations is as good at preventing
spills from happening as the presence of one more full-time inspector.
Test this hypothesis statistically.
In environmental economics, it is a common insight that compliance with
environmental regulations depends upon both the certainty and the severity of
punishment for non-compliance. (In fact, your TA's dissertation concerns this
issue!) In this simple model, then, there can be expected to be tradeoffs
between certainty and severity in achieving a certain level of compliance with
requirements for minimal releases. The number of inspectors is serving as a
proxy for the certainty of punishment, and the typical sizes of penalties for
infractions is serving as a proxy for the severity of punishment. The relevant
hypothesis here is that the effect of a one-unit increase in pen is equivalent
to the effect of a one-unit increase in insp. This is an hypothesis that
the coefficients on these two variables are equal. Fortunately, the output
contains an explicit F-test of this hypothesis: test insp=pen. The P-value
associated with this hypothesis test is 0.69, indicating that we cannot reject
the hypothesis. It seems that the employee's assertion is plausible.
11. In the different specifications in
the Exhibits, what proportion of the variation in toxic releases
across jurisdictions can be explained by a model that uses only the average
legal penalty? From Regression 3, this proportion is
0.4140, or 41.4% What proportion can be explained by a model that
uses the average legal penalty, the number of inspectors, the average ages
of facilities and median incomes? From Regression 6, this
proportion is 0.5967, or 59.67%. Can these be compared? Why or
why not?
These results cannot be compared, because the models have different numbers of
explanatory variables.
A model with more variables MUST have a higher R-squared value (at least, it can
be no lower, since the coefficients could stay the same on the existing variables
and be zero on the new variables, if that is what the data dictated). To compare
goodness of fit across models for the same dependent variable with different numbers
of explanatory variables, we need to use the adjusted R-squared value. For these two
models the adjusted R-squared values are 0.3796 and 0.4815, respectively. Therefore,
the loss of degrees of freedom in the larger model does not offset the improvement
in the explained sum of squares with more variables. We would say that the bigger
model displays a better fit, for the number of variables it uses.
12. The estimated model in Regression
6 exhibits a positive intercept. What is the interpretation of this
point estimate? Does it make sense to test hypotheses about the size of
the coefficient on the constant term? Explain.
The intercept of a regression is the expected value of the dependent variable
when ALL of the explanatory variables are simultaneously zero (not their coefficients, that is
the standard F-test story). In this case, it would mean the expected number of
spills for brand-new facilities in jurisdictions with no penalties and no inspectors (fine so far) and zero
median household incomes (nonsense). Thus, the intercept is irrelevant in this model, since
the data will never wander into that territory in any plausible scenario.
13. In Regression 6, are the slope
coefficient
on the agei and medinci variables individually
statistically significantly different from zero? No. Does our inability
to discern the separate effect of agei and medinci
on spillsi stem from multicollinearity problems? Explain.
If we look at the STAT output, we see that age and
medinc are not correlated at all. Likewise, the auxiliary R-squared values reveal
that these two variables are not higher-order linear functions of other variables. Thus, multicollinearity is
not what is causing the individual insignificance of these coefficients.
It is probably just the case that neither age nor medinc has much of an
effect on spills. Incidentally, notice that there is another source of information: the
confidence ellipse for this pair of coefficients has been provided.
Despite the crummy plot, it is abundantly clear that the point (0,0) is well
within this ellipse. The relatively "round" ellipse that pretty much
covers the intersection of the two marginal confidence intervals reveals
that our inferences with respect to these parameters are not being confounded
by any multicollinearity between these regressors.
14. Is the model in Regression 7 an
adequate model for toxic releases by these establishments? Explain your
reasoning. Can we conclude that there are no other relevant determinants
of the average annual number of releases for this set of jurisdictions?
Explain.
It would be nice to have a specific F-test for the incremental contribution
of the pair of variables, age and medinc, to the explained sum of squares for the model. This test
statement would have read:
test
test age=0
test medinc=0
end
However, this has not been provided. Still, we do have all the ingredients for
this test because of Regression 7. We need the
explained sums of squares from the analysis of variance (from means) tables for
the unrestricted model (Regression 6) and restricted model (Regression 7), as well
as the residual sum of squares from the unrestricted model (Regression 6).
Other ingredients are the number of restrictions (2) and the degrees of freedom
in the unrestricted model (14, from the t-ratio column heading or from the
"ERROR" DF in the analysis of variance from means table).
The formula is as follows:
[ (ExSSur - ExSSr) / 2 ] / [ResSSur/14]
Note that the denominator is just the variance of the error, or s2
for the unrestricted model. Plugging in the numbers for this example, we
get:
[ (31.102 - 30.171 )/2 ] / 1.5013
This is going to be a very tiny number, so we have a sense that the
hypothesis that both of these slopes are simultaneously zero will not be
rejected. So, the answer is
that a model without these variables is probably going to be fine.
15. (i.) Specifically, what do we call
the distribution that appears in the histogram at the end of the
Exhibits?
This is a marginal distribution, for this sample, for the number of toxic
releases.
(ii.) Specifically, what do we call the
scatterplot that appears at the end of the Exhibits?
This the joint distribution, within the sample, of the number of inspectors
and the size of the legal penalties. (Note the negative correlation.)
Updated: February 18, 1998
Prepared by: Trudy Ann Cameron