UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Policy Studies
Winter 1998 Cameron
Policy Studies 208 - Final Examination

Version without answers

INSTRUCTIONS: Answer all questions in the space provided (or indicate clearly where you have continued your answer on the back of the page). Calculators are NOT permitted. Reduce all computations to the simplest form so that anyone with a calculator could attain the answer easily. Show your work and reasoning to the fullest extent possible so that part marks can be assigned as warranted. You have three hours to complete this exam. There are 25 questions (or question sections) worth 5 points each except where noted. Total points = 125. Budget your time carefully. Exhibit pages should not be turned in with your exam. Remember: answer questions in a manner that reflects the econometric reasoning you have learned in this course.

1.  Exhibit A describes an analysis of some fictitious data concerning graduate student movie- going behavior as a function of the average full-price (including parking costs) of movies at theater complexes in the student's local area and the annual income of the student (in thousands of dollars per year).

a.)  According to Regression A1 and Regression A2, are movies a normal or an inferior good, on average, for the graduate students in the sample? Or, are they a normal good for lower-income students and an inferior good for higher income students. Explain.

It does not appear that income has much effect on quantity no matter how it is incorporated into the model. It does not appear to be statistically significant if it enters linearly; nor is either term significant if it enters both in linear and quadratic form. Note, however, that residuals analysis following Regression A2 suggests that we should be cautious in drawing any inferences about anything from Regression A2.

b.)  What does the diagnos / het output following Regression A2 tell us? What are the implications for the standard OLS results produced by Regression A2?

The diagnos / het output provides just the relevant results from a set of regression of the squared errors from the previous regression on a variety of things to which they might be shown to be related. The squared errors from a preliminary naive OLS regression represent the best information we have about the actual sizes of the true si2. If there is homoscedasticity, these conditional error variances should be independent of any other variable we consider. These results show that they are not. We reject the hypothesis of no relationship between the error variance and each thing except in one case. The exception is the test for ARCH (AutoRegressive Conditional Heteroscedasticity). This is only relevant if we have time series data, where there are patterns in the error variance over time (as opposed to patterns in the sign of the error over time). Since there is no particular order to these data, it is not surprising that the variance associated with adjacent observations is unrelated.

c.)  Consider Regression A3, Regression A4, Regression A5, and Regression A6 . Is there one exogenous variable that is unambiguously the most closely related to the sizes of the unobserved individual conditional error variances, si2? Explain.

In the one-by-one regressions, the squared errors appear to be positively related to p and to p2. They are unrelated to y. When we regress the squared errors on all three candidates, we see a high degree of multicollinearity between p and p2, such that it is impossible to distinguish, statistically, the independent contributions of these two variables to explaining the squared errors if both are used. It looks like a toss-up which we choose from among p and p2 to capture the variations in the error variance across observations. We definitely do not want to use y.

d.)  Among Regression A7, Regression A8, and Regression A9, which specification is inappropriate as a potential remedy for the problems afflicting Regression A2? Explain why.

Since the error variance appears to vary directly with the magnitude of either p or p2, and the weights should vary inversely with the error variance, therefore the weights should vary inversely with either p or p2. The inappropriate weighting variable would be wt1. Either wt2 or wt3 is probably sufficient, with perhaps the slightest edge to 1/p, since the statistical significance of p as an explanatory variable for the squared errors is ever so slightly larger and the R-squared value for that model is slightly higher.

e.)  In this example, are the substantive implications of the fitted model altered by the use of weighted least squares methods? Discuss.

If my preferred model is taken to be Regression A9, the parameter point estimates change only very slightly (since different formulas are used to calculate them under WLS than under OLS). As for the standard errors, if we had the corrected OLS standard errors, they should be expected to be larger than the WLS standard errors--since WLS is more efficient than OLS under heteroscedasticity. However, all we get from SHAZAM is the uncorrected standard errors, which are just plain wrong, since they have been calculated using a formula that assumes you can factor out a common s2, when you cannot. All the same, there appear to be no meaningful or surprising changes in the implications of this model when we correct for heteroscedasticity. Unfortunately, this is not a general result. You never know that heteroscedasticity correction will make little difference to your estimates or inferences until you do the weighted least squares model and find out.

f.)  If you did not have to worry about violations of the maintained hypotheses for OLS regarding error terms, would you prefer the linear specification in Regression A2 or the log- log specification in Regression A10? By what criterion? Explain.

In Regression A10, we have been careful to include the loglog option in the regression command, so SHAZAM knows that the dependent variable in this model has been logged. Thus the regression algorithm undertakes to adjust the formula for the log-likelihood to account for the logged dependent variable, thereby making the log-likelihood comparable to models which use the (raw) "levels" of the dependent variable. For the log-log model, the maximized log-likelihood is -253.223. For the levels model, it is -246.915. The higher value is obtained for the levels model, so it would be preferred. Recall that a log-likelihood can be interpreted as the log of the joint probability of observing the data that we have observed in our sample.

g.)  Sometimes, using a log-log model will eliminate a heteroscedasticity problem. Is this the case here? Explain. Mention the circumstances under which a logarithmic transformation of the dependent variable will perfectly remedy a heteroscedasticity problem.

Residuals analysis following the log-log model in Regression A10 is conducted in Regression A11. The squared errors from the log- log model (note that e is redefined in Regression A10) are statistically significantly related to the magnitude of the log of price. Thus the log transformation is not successful, for these data, in eliminating the heteroscedasticity that plagues the levels data. Logging can sometimes eliminate heteroscedasticity, but only if the nature of the heteroscedasticity is exactly such that logging will get rid of it. If the heteroscedasticity is not of this kind, logging can fail to remedy the problem, or even make it worse.

2. The following questions pertain to EXHIBIT B. These are real data, and we will explore a preliminary model to explain the observed monthly time-series variation in new construction of public buildings for education. The variables read by the program are defined as follows:

YRMO = year and month in CITIBASE format (e.g. 9509 = September, 1995
PUBLIC = New construction, public buildings, educational (million $, monthly, not seasonally adjusted) [CITIBASE variable CZONQE; (1964:1-1995:12)].
P1 = Population estimate; under 5 years (thousands, annual) [CITIBASE variable PAN1; annual data replicated for each month of the corresponding year; (1964-1995)].
P2 = Population estimate; 5-9 years (thousands, annual) [CITIBASE variable PAN2; annual data replicated for each month of the corresponding year; (1964-1995)].
P3 = Population estimate; 10-14 years (thousands, annual) [CITIBASE variable PAN3; annual data replicated for each month of the corresponding year; (1964-1995)].
P4 = Population estimate; 15-19 years (thousands, annual) [CITIBASE variable PAN4; annual data replicated for each month of the corresponding year; (1964-1995)].

There is also a variable called YROB, which is a CITIBASE annual data observation indicator running from 1964-01 to 1995-01. The annual population data are "spread" across all twelve months in the relevant year, since no more-frequent data on populations are available.

a.)  According to Regression B1, has new construction of public school buildings been growing over time? Explain.

Yes, the coefficient on the time trend variable T is positive and appears to be hugely statistically significant (although we'll have more on this later). The point estimate suggests that public school expenditure has been growing, on average, by $4.4 million per month over the time period from January 1964 through December 1995.

b.)  According to Regression B1, does new construction of public school buildings depend upon the numbers of kids of different ages in the population? Explain.

New public school construction appears to depend positively on the sizes of the first three cohorts (P1, P2, and P3), but not on the size of the oldest cohort of school-aged children (P4). Changes in the size of the P2 group appear to have the biggest influence on new public school construction. Again, we will see in a minute that we cannot really trust the t-ratios, although in this model they seem pretty "healthy."

c.)  Is there multicollinearity among the regressors in Regression B1? Is it causing any problems of inference concerning the parameters in this model? Explain.

There is considerable multicollinearity between the sizes of the four different age cohorts of school-aged children. One would expect this to compromise the individual statistical significance of the slope coefficients on these variables. The strong linear relationship between P4 and the others might indeed explain the statistical insignificance of its coefficient in the regression. (We'll argue in a minute that time-series hypothesis tests, before we have assessed the error properties, are always suspect. Thus, while it looks as though this multicollinearity has not ruined our ability to detect statistically- different-from-zero slope coefficients on three of the cohort-size variables, we cannot be sure. In any event, multicollinearity will make the parameter standard errors larger, meaning relatively more hypotheses will be deemed acceptable. I.e., our parameters estimates will offer less resolution than might have been possible with uncorrelated regressors. Here, unfortunately, there is no option of going back for a different sample. History is history. Although there might be hope for future data, say over the next 20 years.)

d.)  Based on the output following Regression B1 and on the results of Regression B2, what do you suspect might be wrong with the results of Regression B1? Why?

Inconveniently, whomever ran these regressions for you did not ask for a DWPVALUE (exact Durbin-Watson test statistic and p-value). All you get with this output is a point estimate of the D-W test statistic (0.2884) and a point estimate of the AR(1) error correlation (rho=0.85648). These look suspicious, even though you do not have d-tables in which to look up the critical values. Smells like positively serially correlated errors of at least AR(1), possibly more, since these are monthly data. Fortunately, Regression B2 provides pretty strong evidence of systematic relationships between the errors associated with different time periods at particular intervals of spacing. Positive serial correlation in the errors generally means that the standard errors on the parameter estimates are understated, leaving the t-ratios overstated, and the P-values too small. You end up rejecting a lot of hypothesis you really cannot reject. The point estimates are still unbiased, but your inferences are pretty worthless until you have corrected the problem.

e.)  Why does Regression B2 involve so many explanatory variables? Are we concerned that there might be multicollinearity among these regressors? Explain.

Since we have monthly data and many processes have a regular annual cycle, we can expect a priori that 12th order autoregressive errors may be relevant in monthly data. Thus, we want to explore at least the relationship between current error and each of the first twelve lags of the error term. This is an interesting side-issue, intended to show if you understand serial correlation patterns in errors and can tie this to multicollinearity. If you have high-order serial correlation in the errors and you run this regression of current errors on twelve different lags of the same variable, then each et will be correlated with each et-1. If rho is large, there could be very substantial multicollinearity and it could get difficult to discern the individual coefficients in this regression. Despite this problem here, the first, eighth, and eleventh lags appear to be statistically significant. Note that there would be no way to "fix" this multicollinearity, since all "variables" are just different past observations on the same variable.

f.)  Explain succinctly the main tasks that are performed "behind the scenes" by SHAZAM when the AUTO command is used.

Consider an AR(1) model, where the presumption is that ut = rho ut- 1 + et. SHAZAM comes up with an initial estimate of the correlation between current and lagged error terms and uses this to transform the regression equation by taking each variable (including the constant term), lagging it, and calculating the transformed variables according to the current-period value minus rho times the lagged-1-period value. SHAZAM then regresses uses these generalized differences in a regression where the error term is "fine" because the current minus rho-times-lagged error term is just et, which is pure noise and therefore fits the criteria for OLS. The "iteration" part: Next, SHAZAM takes the point estimates from this generalized difference regression and applies them to the raw X data to compute a fitted Y value for each observation. The difference between this fitted Y and the actual Y is a new estimate for et. If you take the correlation between this new et and its lagged value, you get an revised estimate of rho. Use this again in creating the generalized differences. Continue these iterations until some convergence criterion is achieved (either a stable largest-achievable maximized log-likelihood, or a stable smallest-achievable sum of squared errors). These final "fine-tuned" parameter estimates (intercept, slope, rho value(s)...and sigma-squared, of course) can then be reported along with their asymptotic standard errors and associated asymptotic t-ratios that allow us to test statistically the null hypotheses that individual rho parameters are zero.

g.)  Consider the revised results in Regression B3: (i.) Does public school construction depend on demographics? Explain. (ii.) Does public school construction activity seem to anticipate future enrollments, or simply respond to current enrollments? Explain.

Conveniently, there is an F-test provided to test the hypothesis that all coefficients on the P1 through P4 variables are simultaneously zero. This hypothesis is soundly rejected, even though ONLY P2 is now individually statistically significant. Just the P2 coefficient could account for this, however. The interesting test would have been to see a joint test for the coefficients that appear individually insignificant: P1, P3 and P4. You'd certainly want to try one of these. If it weren't for the multicollinearity issue, we'd probably conclude that public school construction only starts when there has been a surge in kindergarten-through-fourth-grade aged children. If preschool populations (P1) had an individually statistically significant effect on new school construction, you could say that construction anticipated enrollments. The insignificance of this coefficient could still be due to multicollinearity, however, so we cannot be entirely certain about this conclusion.

h.)  Regression B4 explores a more-general specification for the public school construction model. According to this model, does this new construction change systematically over time? Does it change systematically in response to populations of children in different age groups? Explain each answer carefully.

Regression B4 includes interaction terms between the time trend variable and each of the four population variables. This means that the derivative of new construction with respect to time is no longer constant and simply equal to the coefficient on the time variable. Likewise, the derivatives of new construction with respect to the sizes of each of the four relevant sub- population now depends on time, rather than simply being constants. The coefficient on the linear term in T is no longer individually statistically significant. However in order for the "change in construction for a 1-unit change in time" to be zero, the coefficients on t, tp1, tp2 tp3, and tp4 would have to be all simultaneously zero. If we write out the relevant part of the main regression model, it is:

b1 + b2Ti +...+ b7 T*P1i + b8 T*P2i + b9 T*P3i + b10 T*P4i +...+ ei
If we write out the time derivative of the estimated function, we get:
b2 + b7 P1i + b8 P2i + b9 P3i + b10 P4i
Individually, tp2 has a statistically significant coefficient. But we are provided with just the F-test we need: If the coefficients on t, tp1, tp2, tp3 and tp4 are all simultaneously zero, the time derivative is zero. Apparently, it is not.

We are also provided with an appropriate F-test to determine whether all the derivatives of the new construction regression function with respect to the population variables are simultaneously zero. This hypothesis is also rejected, even though none of the individual coefficients on the P1 through P4 variables is individually statistically significant any more. Fortunately, T*P2 is individually significant, and this does the trick.

i.)  Is there a "typical" seasonal pattern in public school construction expenditures? Characterize this pattern. Does it conform with your intuition?

The left-out month is January, so the basic intercept term is the intercept for January. The coefficients on the other monthly dummies tell how much expected new public school construction differs in that month compared to it's expected level in January. For example, expected new public school construction in August is higher than in January by about $287 million. In February, it is lower by $15.9 million compared to January, although this difference is not statistically significantly different from zero. What is the overall pattern? Public school construction is highest in the summer months, when there is good weather and most kids are out of school. It is lowest in the winter months when weather is bad and attempts to build new structures at existing campuses would disrupt classes.


3. (10 points) Non-experimental data can sometimes make it very difficult to draw policy implications from regression analysis. Choose (a.) OR (b.)

a.) GUN CONTROL: Suppose your sample consists of households that have been victimized by robbery. The dependent variable takes a value of 1 if a household member is shot during the robbery and 0 otherwise. One of your explanatory variables is a dummy variable equal to 1 if there is a handgun present in the house, 0 otherwise. When a handgun is present in a household, an occupant of that house is much more likely to be shot in the process of a robbery than when no handgun is present. Therefore, to minimize injury and loss of life from robbery incidents, private ownership of handguns should be banned. Evaluate this policy proposal and the "evidence" upon which it is premised. Briefly describe the nature of the true "experiment" that would allow an unambiguous determination of the effect of handgun presence on robbery shootings via a regression like this.

Households gets to choose whether to own a handgun or not. The reasons a household might choose to have a handgun might include fear of robbery by violent criminals who might also be armed. If handgun ownership is greater when the odds of violent robberies are higher, then it could even be the case that presence of a handgun has no bearing whatsoever on whether a household member is shot in a robbery. It might just be an indicator for a more dangerous neighborhood, or a more ostentatious home with lots of goodies that look ripe for robbery. It might also be an indicator for a more belligerent householder who is more likely to resist or attempt to attack a robber. If the model fails to control for all these other factors, it could look like the mere presence of a gun leads to more homeowner shootings in robberies.

The "experiment" that would be necessary to discern the effects of handgun presence on householder shootings in robberies might be something like the following. Randomly give handguns to some households and ensure that other households do not have them. (Would the NRA let you do that?) After some suitable period of time, identify the households from this population that have been robbed. Compare the freqency of homeowner shootings in the robbery group that had handguns with the frequency of same in the robbery group who did not have handguns. Since the presence or absence of a handgun in the household would have been completely random (exogenous), and independent of anything unobservable about the household, then the difference in homeowner shooting rates between these two groups, if statistically significant, would tell you whether the apparent effect in the non-experimental data was real.

In the absence of an opportunity to conduct such an experiment (which would seem to be the case), a research could attempt to first model handgun holdings in terms of strictly exogenous variables, and then to use "two-stage" types of methods to purge the endogenous handgun-ownership variable of any correlation with the error term in the main model. Note that the main model in this case is probably going to be a probit or logit type model, but that is only a minor variation on the usual intuition.

b.) LEGALIZATION OF MARIJUANA: Suppose you have a random sample of at-risk 18-year-olds. The dependent variable is the number of times each teenager has used heroin. Among the explanatory variables is a dummy variable that takes a value of 1 if the subject experimented with marijuana prior to age 13, and 0 otherwise. You find that the coefficient on this dummy variable is positive and strongly statistically significant. Therefore, we should not legalize marijuana use (which would make it much more accessible to pre-teens) since this will lead to widespread use of heroin. Evaluate this policy proposal and the "evidence" upon which it is premised. Briefly describe the nature of the true "experiment" that would allow an unambiguous determination of the effect of pre-teen marijuana use on subsequent heroin use via a regression like this.

The same individuals who are making choices that lead to heroin experimentation are making the earlier choices about marijuana experimentation. There may be innate individual tendencies to seek and use mood- or mind-altering substances. Perhaps one could call this an "addictive personality." Or perhaps there are hereditary or social factors that predispose certain youngsters to illegal drug use (a neighborhood or school drug culture, for example). Pre-teen experimentation with marijuana may have no effect whatsoever on the odds of later heroin use (other kids might have started with their parent's liquor cabinet). But if pre-teen marijuana experimentation is an indicator for a host of conditions that combine to lead teens to try heroin, then it could certainly look like the marijuana use is "causing" the later heroin use.

The "experiment" that would be necessary to judge causality might be as follows. Take a sample of early pre-teen at-risk children who have not yet tried marijuana. Randomly assign them to two groups and make one group use marijuana and ensure that nobody in the other (control) group does. (Sure would be hard to get funding for that research!) Revisit the group when they turn 18 and compare average heroin use rates in the two groups. If the rates are statistically significantly different, then you will have demonstrated causality.

In the absence of any opportunity to conduct such a controlled experiment, the researcher would have to work with non-experimental data. This would necessitate constructing a model to explain pre-teen marijuana experimentation in terms of solely exogenous variables. The fitted portion of this model could then be used in the main model to explain later heroin use, ensuring that this revised exogenous "pre-teen marijuana-use propensity" variable is uncorrelated with unobserved components of the main model error term.

 
4. Assume your dependent variable takes on a value of 1 if a high-school student is affiliated with a gang and zero otherwise. Among your explanatory variables are included: family income level, GPA in school, dummy variables for father present in household and mother present in household, eligibility for after-school programs, educational attainment of each parent, etc.

a.)  What sort of estimation method would you probably choose to determine empirically the effect of after-school program eligibility on gang affiliation? How would you interpret the results? Are there any caveats you might add concerning this single-equation model?

This is a discrete-outcome, or dummy dependent variable model. Thus, a probit or logit model is probably appropriate. Providing after- school program eligibility is randomly assigned across school districts, independent of levels of gang activity, you might be able to make the desired assessment of the policy. If there is systematically greater (or lesser) access to after-school programs in areas where gangs are more active, you might have trouble with this "program evaluation."

Suppose the kid makes the decision to belong to the gang or not. The kid also makes decisions with respect to how much effort to put out in school, thereby making GPA a potentially endogenous variable. If gang membership influences household stability, it may be that absent parents are sometime a result, rather than a cause, of gang membership (or at least these two outcomes may be jointly determined by some of the same conditions in the neighborhood or local culture).

b.)  Multicollinearity among the regressors can lead to problems in making clear inferences about the effects of changes in individual explanatory variables only in Ordinary Least Squares models. It is not a concern in fundamentally nonlinear estimation methods such as probit or logit models. True, False, Uncertain? Explain.

FALSE. OLS estimation methods are called linear estimation methods (even if they are non-linear in the variables) because it is possible to calculate the parameter point estimates as a solution to k equations in k unknowns (there are k first-order conditions for the minimization of the sum of squared errors function with respect to k unknown intercept and slope parameters). Nonlinear models (like the probit and the logit) cannot be solved this simply and it is necessary to use a search algorithm to find the best parameter values (usually those which maximize the log-likelihood function for the model). Almost all commonly used "regression-type" models, however, involve some linear-in-parameters "index" of the explanatory variables. If any of the explanatory variables are highly correlated, it will be difficult to identify the separate slope coefficients on each of them.

 
5. Suppose you are reading an article concerning the effects of immigration status on utilization levels of social services among legal and undocumented immigrants who have been in the US for less than 10 years and who have been receiving social services. You encounter the following estimated model. (Note that the sample producing these results is fictitious.)
 

SERVi = 30.90 - 1.20 TIMEi + 9.30 LEGALi + 0.33 TIMEi*LEGALi
                (5.2)     (0.31)              (5.50)                     (0.20)

where SERVi = value of social services utilized (in hundreds of dollars per year);
          TIMEi = time spent in the US (in years);
          LEGALi = 1 if legal immigrant; = 0 if undocumented; i = 1,...,676.

and the parameter standard errors are given in parentheses below each point estimate.


a.)   Based on the point estimates, what is the average utilization of social services for a legal immigrant in the first year after arrival in the US? ______________ For an undocumented immigrant in the first year? ___________________________

For a legal immigrant, the "LEGAL" variable takes on a value of 1, so all coefficients are relevant. Interpret the TIME variable as "number of PRIOR years in the US" (as per instructions during the exam). In the "first year after arrival in the US" an immigrant would have a value of the TIME variable equal to zero. Therefore, fitted SERV will be just 30.90 + 9.30 = 40.20. For an undocumented immigrant, predicted SERV is just 30.90, since the LEGAL dummy variable takes on a value of 0. Note, however that the point estimate of the 9.30 coefficient on LEGAL is less than twice its standard error, so we cannot reject the hypothesis that there is no difference between the two groups.

b.)   Based on the point estimates, how does utilization of social services vary with time in the US for a legal immigrant? _________________ For an undocumented immigrant? ____________________________

The derivative of SERV with respect to TIME is not a single constant, but is equal to (-1.20 + 0.33 LEGALi). Thus, for legal immigrants, SERV changes by -1.20 + 0.33 = -0.87 per year. In words, social service utilization falls by $870 per year for legal immigrants who have been receiving services. For undocumented immigrants, LEGAL=0, so service utilization falls by $1200 per year. However, note that the point estimate for the coefficient on the interaction term, 0.33, is less than twice its standard error, so we cannot reject the hypothesis that there is no difference in the rates at which service utilization changes over time.

c.)   Overall, does legal/undocumented status have a statistically significant effect on utilization of social services? Explain.

This is a question about the derivative of SERV with respect to LEGAL, which shows up in two places in the estimated model. The formula for this derivative is 9.30 + 0.33 TIMEi. The only way for there to be NO effect of status on utilization would be if both the 9.30 and 0.33 coefficients were actually zero. Individually, it has already been noted that we cannot reject the zero hypothesis for either of these parameters. The relevant test, however, would be an F-test to discern whether they could be jointly zero. We are not provided with enough information to perform this test, so the question cannot be answered with the available data. Of course, for the F-test to reject the jointly zero hypothesis when the individual t-tests fail to reject the marginal hypotheses, there would have to be some correlation between the two variables in question. Since one variable is LEGAL and the other is TIME*LEGAL, this is certainly a possibility.

d.)   Does this model predict that legal immigrants will always utilize more social services than undocumented immigrants (or vice-versa)? If not, how does the predicted utilization differential (legal- undocumented) change with time in the US? When will predicted utilization be the same for both groups? Comment.

If we were to plot SERV against TIME, for each of the two groups in the sample (the LEGAL=1 and LEGAL=0 groups), we would see that the intercept is higher for the LEGAL=1 group, but its slope is less negative. This means that the undocumented immigrant utilization profile starts lower and drops more quickly. Thus the model says (within the range of the data only) that legal immigrants will always use more services. Utilization falls for both groups, but the utilization differential widens over time. The predicted utilization will not be the same for the two groups anywhere within the relevant range of the data.

6.  Suppose you are working with individual household survey data. If you do not have data at the individual household level for one of your explanatory variables, you might be able to use group averages as a proxy for this variable (e.g. 5-digit zip code median household income instead of individual household incomes for a nation-wide sample). To the extent that the groups you use are relatively homogeneous, the proxies may be very useful in mitigating what would otherwise be omitted variables bias. The same strategy is appropriate if you do not have any individual data for your desired dependent variable. True, False, Uncertain? Explain, suggesting the best alternative if you disagree.

Everything is fine here until you get to the part that asserts that you can do the same thing if you do not have any individual data for the desired dependent variable. The key issue here is the "unit of observation." It is possible to "spread" more-aggregated data across observations on the right-hand-side of a regression model. (In fact, we did that with the annual data on population cohorts in question 2 above!) However, it is considered poor form to have variables on the right-hand-side be at a lesser level of aggregation tha the dependent variable. For instance, we would not get very far if we had county-level unemployment rates as a dependent variable, but individual data on people's education levels, genders, and ethnicities on the right-hand side. In a sense, there would be more than one observation on each X variable for each available value of the Y variable. If faced with this sort of a situation, researcher generally resign themselves to doing the whole analysis at the higher level of aggregation. E.g. switch to county proportions of people at each of several education levels, county proportions of each of several ethnic groups, and so on. We lose the individual detail (and gender, for example, will almost always be 0.50 female), so it is often a shame that we do not have disaggregated data for the dependent variable (such as whether or not each individual has a job).

BONUS: (5 points)  If you estimate a regression model and get a counter-intuitive sign on a slope coefficient, what sort of problem(s) do you initially suspect? Explain.

First guess: omitted variables bias. Second guess: endogeneity bias. "Wrong" signs are generally signs that are biased so much by one of the potential reasons for bias that the "true" sign is reversed. We saw an example of this in the study.sha program in one of the early labs. When we failed to control for GPA, it looked like more studying might actually decrease your expected midterm grade, implying a policy recommendation that one should study zero hours in order to maximize their grade.


COURSE OUTLINE LECTURE OUTLINES PROBLEM SETS PROBLEM SOLUTIONS COMPUTER LABS
SHAZAM EXAMPLES DATA SETS ONLINE QUIZZES GRAPHICS HANDOUTS

Updated: March 26, 1998
Prepared by: Trudy Ann Cameron