UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics
Winter 1998
Policy Studies 208 - Midterm Examination - Outlines of Solutions in Red

Version without Answers

INSTRUCTIONS: Answer all questions in the spaces provided (or indicate clearly where you have continued your answer). Calculators are NOT permitted. Reduce all computations to the simplest form so that anyone with a calculator could attain the answer easily. Show your work and reasoning to the fullest extent possible so that part marks can be assigned as warranted. You have 75 minutes to complete this exam. All parts of both questions are worth 10 points (and some are much easier than others). Total points = 150. This means roughly 5 minutes for each answer. Budget your time carefully. NOTE: these data are fictitious.
 

SCENARIO: Your consulting firm has been hired by the Department of Environmental Health and Safety. The DEHS would like you to analyze the relationship between the annual number of toxic releases (per 100 establishments) for dry-cleaning firms (spillsi) and some policy variables deemed relevant for the control of these accidents. You have been provided with a sample of data from 19 randomly selected jurisdictions. The variables they have given you include the average age of such establishments in the jurisdiction (agei), the number of full-time inspectors per 100 establishments in the jurisdiction (inspi), the average legal penalties imposed for toxic releases in that jurisdiction (in thousands of dollars) (peni), and the median income of households in the jurisdiction (in thousands of dollars) (medinci). The statistical analyses you perform are given in the Exhibits.
 

1. Begin gently. Fill in the blanks:

Across these 19 jurisdictions, what is the mean "number of toxic releases per 100 establishments"? 3.0625 releases

What is the highest observed "number of inspectors per 100 establishments"? 10 inspectors (overkill?)

What is the standard deviation in "average legal penalties per release" across the sample? 5.8239 thousand dollars or $5823.90

Do the descriptive statistics you have just provided refer to the joint distribution of these three variables, or to their marginal distributions? marginal distributions

What is the correlation between inspi and peni in this sample? -0.71181

What are the units for this correlation measure? none, correlation is a unit-free measure.
 

2. Using the descriptive statistics only, test the hypothesis that the true marginal mean "number of toxic releases per 100 establishments" is 4 per year.

The key thing to remember is that this is univariate statistics, the stuff from the beginning of the quarter. We are asking about the true mean of a single variable, which means that we must use the sample mean and the variance of the sample mean. The test statistic will be [Y-bar - 4] / (sY/square root of n). For this example, the numbers will be [3.0625 - 4] / (1.7016/square root of 19).

3. Does Regression 1 make sense? Why or why not?

Not really. This model suggests that the median age of drycleaning establishments in a jurisdiction is determined by the number of spills, the number of inspectors, typical penalties, and the median income in the neighborhood. The causality probably does not run this way, or if it does, the effects are extremely minimal.

4. The Administrator for the DEHS says "If the number of inspectors in a jurisdiction has no statistically discernible effect on the number of toxic releases from these establishments, why are we paying the salaries of these people?" Based upon the relevant simple regression in the Exhibits, is it possible that there is a downward-sloping relationship between the number of inspectors and the number of releases? Explain how you have reached this conclusion.

The Administrator is essentially saying that if the slope of a regression of spills on inspectors is zero, then the number of inspectors does not affect the frequency of spills, so why do we need inspectors? The expected number of spills would be the same with zero inspectors as with ten of them, for example. This is a zero hypothesis about the slope in Regression 2. We can just look at the t-ratio and its associated P-value. The P-value of 0.479 says we cannot reject the zero hypothesis, so it looks like the Administrator might be justifiably unhappy about paying for inspectors. However, one must always suspect possible omitted variables bias.

5. Based on Regression 3, test the hypothesis that in order to reduce the "number of releases per 100 establishments", on average, by one per year, it would be necessary to increase the average legal penalty per spill by $5,000.

This hypothesis first needs to be translated into something that involves the estimated coefficients of the model. Since this is a linear specification, the slope is the same everywhere. If a 5-unit change in pen leads to a one-unit decrease in spills, then this is equivalent to a 1-unit change in pen leading to a 0.2 unit decrease in spills. A convenient version of this hypothesis is thus to test whether the slope on pen equals -0.2. Let's look at what is available in the way of tests associated with Regression 3. No hypotheses other than the zero hypothesis have been requested, so we have to construct our own t-test statistic. This test statistic is the point estimate minus the null hypothesis, all divided by the standard error of the point estimate. Specifically, [-0.188 - (-0.2)] / .05425. If this number, when calculated, is larger than the relevant critical value for a t-distributed random variable with 17 degrees of freedom (which is 2.11 at the 5% level of significance) we would reject the null hypothesis. Eyeballing the formula, the number is going to be about 0.012/0.05, which will be nowhere near this size. Thus, we will fail to reject this hypothesis. It is plausible.

6. Based on Regression 3, what average number of releases per 100 establishments would you expect for a jurisdiction with an average legal penalty of $50,000 per release? Give the precise formula for a point estimate and explain explicitly how a 95% confidence interval for this prediction would be constructed. Why should you use caution in making the this prediction?

Here is where you use the "big messy formula" for a confidence interval for mean prediction. You would plug 50 into the fitted model from this regression to get the midpoint of the confidence interval (the point estimate). Then you will need the estimates of x, the marginal mean of pen from the stat output, sample size n = 19 to plug into the formula for the standard error of the point estimate. The thing that needs to be constructed is the sum of the little xi2. For this, you need to us the estimate of s ("standard error of the estimate - sigma") from the third line below the R-squared information. Divide this by the standard error of the slope estimate (since this number is s/(the square root of what you want)). Finally, square the resulting number to get the desired sum of squared deviations.

7. You think for a while and then realize that the number of toxic releases per 100 establishments is probably a joint function of several different factors, rather than just one at a time. You estimate Regression 6 in order to ascertain the joint effects of all available determinants on the average number of toxic releases per 100 establishments. Describe what seems to happen to the apparent effect of the inspi variable when you include the other variables in your model. If this apparent effect is different, explain why. What do you tell the Administrator of the DEHS?

When you control for other factors that influence the number of spills, compared to Regression 2, the coefficient on insp changes from positive to negative and actually becomes statistically significantly different from zero. There must have been some omitted variables bias obscuring the effect of insp in the simple regression. The culprit variable is probably the size of the penalties. There is a fairly high negative correlation between these two variables (about -0.7). Thus greater numbers of inspectors were serving as proxies for smaller fines in that jurisdiction. Since smaller fines reduce the incentive to prevent releases, the effects of the smaller fines were offsetting the effects of more inspectors (and a higher probability of problems being detected). You can tell the Administrator not to worry, because it seems that inspectors are making a difference after all, when you control for variations in the levels of penalties.

8. In Regression 6, explain the use of the / auxrsqr option on the ols command. What does it tell you here?

The auxrsqr command seems to be unique to SHAZAM among popular regression packages. It allows you to track down the probable sources of multicollinearity problems that might be affecting your data and therefore your inferences. This command, in the background, runs regressions for each explanatory variable on all of the others and reveals where "good linear fit" is found. High auxiliary R-squared values suggest that the subset of variables with these high values has a high degree of (possibly higher-order) multicollinearity. The coefficients on these variables may be insignificant because the OLS algorithm is unable to parcel out explanatory power between them, even though collectively, they might have great bearing on the expected value of Y (although maybe they don't).

9. For Regression 6, test the hypothesis that the agei variable does not belong in the model. What do you conclude?

Age is completely uncorrelated with any of the other regressors AND it has a lousy t-ratio and P-value, so we are fairly safe in concluding that age probably does not belong in this specification. However, there is always a danger that it IS correlated with some unidentified omitted variable, and omitted variables bias is still obscuring its role in the model. The unnerving thing is that you can never be sure, so you just think about potential determinants as carefully as possible and try to argue that you have everything...

10. One employee of the DEHS, who has worked there for decades, claims that higher expected penalties for infractions can work as a substitute for greater monitoring of establishments by inspectors. In fact, she says, if you can work to write higher penalties into the regulations, a $1000 higher expected penalty for violations is as good at preventing spills from happening as the presence of one more full-time inspector. Test this hypothesis statistically.

In environmental economics, it is a common insight that compliance with environmental regulations depends upon both the certainty and the severity of punishment for non-compliance. (In fact, your TA's dissertation concerns this issue!) In this simple model, then, there can be expected to be tradeoffs between certainty and severity in achieving a certain level of compliance with requirements for minimal releases. The number of inspectors is serving as a proxy for the certainty of punishment, and the typical sizes of penalties for infractions is serving as a proxy for the severity of punishment. The relevant hypothesis here is that the effect of a one-unit increase in pen is equivalent to the effect of a one-unit increase in insp. This is an hypothesis that the coefficients on these two variables are equal. Fortunately, the output contains an explicit F-test of this hypothesis: test insp=pen. The P-value associated with this hypothesis test is 0.69, indicating that we cannot reject the hypothesis. It seems that the employee's assertion is plausible.

11. In the different specifications in the Exhibits, what proportion of the variation in toxic releases across jurisdictions can be explained by a model that uses only the average legal penalty? From Regression 3, this proportion is 0.4140, or 41.4% What proportion can be explained by a model that uses the average legal penalty, the number of inspectors, the average ages of facilities and median incomes? From Regression 6, this proportion is 0.5967, or 59.67%. Can these be compared? Why or why not?

These results cannot be compared, because the models have different numbers of explanatory variables. A model with more variables MUST have a higher R-squared value (at least, it can be no lower, since the coefficients could stay the same on the existing variables and be zero on the new variables, if that is what the data dictated). To compare goodness of fit across models for the same dependent variable with different numbers of explanatory variables, we need to use the adjusted R-squared value. For these two models the adjusted R-squared values are 0.3796 and 0.4815, respectively. Therefore, the loss of degrees of freedom in the larger model does not offset the improvement in the explained sum of squares with more variables. We would say that the bigger model displays a better fit, for the number of variables it uses.

 
12. The estimated model in Regression 6 exhibits a positive intercept. What is the interpretation of this point estimate? Does it make sense to test hypotheses about the size of the coefficient on the constant term? Explain.

The intercept of a regression is the expected value of the dependent variable when ALL of the explanatory variables are simultaneously zero (not their coefficients, that is the standard F-test story). In this case, it would mean the expected number of spills for brand-new facilities in jurisdictions with no penalties and no inspectors (fine so far) and zero median household incomes (nonsense). Thus, the intercept is irrelevant in this model, since the data will never wander into that territory in any plausible scenario.

 
13. In Regression 6, are the slope coefficient on the agei and medinci variables individually statistically significantly different from zero? No. Does our inability to discern the separate effect of agei and medinci on spillsi stem from multicollinearity problems? Explain.

If we look at the STAT output, we see that age and medinc are not correlated at all. Likewise, the auxiliary R-squared values reveal that these two variables are not higher-order linear functions of other variables. Thus, multicollinearity is not what is causing the individual insignificance of these coefficients. It is probably just the case that neither age nor medinc has much of an effect on spills. Incidentally, notice that there is another source of information: the confidence ellipse for this pair of coefficients has been provided. Despite the crummy plot, it is abundantly clear that the point (0,0) is well within this ellipse. The relatively "round" ellipse that pretty much covers the intersection of the two marginal confidence intervals reveals that our inferences with respect to these parameters are not being confounded by any multicollinearity between these regressors.

 
14. Is the model in Regression 7 an adequate model for toxic releases by these establishments? Explain your reasoning. Can we conclude that there are no other relevant determinants of the average annual number of releases for this set of jurisdictions? Explain.

It would be nice to have a specific F-test for the incremental contribution of the pair of variables, age and medinc, to the explained sum of squares for the model. This test statement would have read:
test
test age=0
test medinc=0
end
However, this has not been provided. Still, we do have all the ingredients for this test because of Regression 7. We need the explained sums of squares from the analysis of variance (from means) tables for the unrestricted model (Regression 6) and restricted model (Regression 7), as well as the residual sum of squares from the unrestricted model (Regression 6). Other ingredients are the number of restrictions (2) and the degrees of freedom in the unrestricted model (14, from the t-ratio column heading or from the "ERROR" DF in the analysis of variance from means table). The formula is as follows:
[ (ExSSur - ExSSr) / 2 ] / [ResSSur/14]
Note that the denominator is just the variance of the error, or s2 for the unrestricted model. Plugging in the numbers for this example, we get:
[ (31.102 - 30.171 )/2 ] / 1.5013
This is going to be a very tiny number, so we have a sense that the hypothesis that both of these slopes are simultaneously zero will not be rejected. So, the answer is that a model without these variables is probably going to be fine.

 
15. (i.) Specifically, what do we call the distribution that appears in the histogram at the end of the Exhibits?

This is a marginal distribution, for this sample, for the number of toxic releases.

(ii.) Specifically, what do we call the scatterplot that appears at the end of the Exhibits?

This the joint distribution, within the sample, of the number of inspectors and the size of the legal penalties. (Note the negative correlation.)
 
COURSE OUTLINE LECTURE OUTLINES PROBLEM SETS PROBLEM SOLUTIONS COMPUTER LABS
SHAZAM EXAMPLES DATA SETS ONLINE QUIZZES GRAPHICS HANDOUTS
Updated: February 18, 1998
Prepared by: Trudy Ann Cameron