THE UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics
Economics 143 - Applied Regression Analysis
October 22, 1998
Cameron
Problem Set #4: Multiple Regression, Introduction

Due: October 29, 1998

NETWORK FILES NEEDED: n:study.sha, n:colds.sha, n:larent.sha, n:larent.dat

1. Omitted variables bias is a very common problem, both in simple regression models and in multiple regression. It is often the reason for the appearance of counter-intuitive signs or sizes of estimated coefficients on variables which are included as regressors in a model. In this question, we will explore two instances of omitted variables bias.

a.) In the first example, imagine you first looked only at the relationship between midterm scores and amount of time spent studying. Intuitively, what should be the effect of an additional hour of study time on midterm scores? Now look at the program in the file study.sha. Regress MIDTERM on STUDY. If you fail to "control for" GPA by including the GPA variable in the regression of midterm scores on amount of study time, what are the implications of the model? Test the hypothesis that studying time has no effect on grades. Now, controlling for GPA, by regressing MIDTERM on both STUDY and GPA at the same time, what is the contribution of an extra hour of study time to midterm score? Again test the hypothesis that studying time has no effect on grades. Look at the correlation of STUDY and GPA and tell a story about why this relationship might be observed. If GPA is left out of the specification, what dual role does the STUDY variable play? What is the nature of the "omitted variables bias" when the effect of GPA is not included explicitly in the model?

2. The file larent.dat contains some (hypothetical) information on rental rates for one-, two-, and three-bedroom apartments in the Los Angeles area. The data can be read in according to the statement

sample 1 26

read(n:larent.dat) RENT SQKLD BED SQBED BATH SQBATH PKG BEACH UCLA

where the variables are:

   RENT   = monthly rental in dollars
   SQKLD  = square feet of common area in the kitchen, livingroom and dining
            areas
   BED    = number of bedrooms
   SQBED  = total square feet in all bedrooms
   BATH   = number of bathrooms
   SQBATH   = total square feet in all bathrooms
   PKG    = number of parking spaces  included in rent
   BEACH  = number of miles from the beach
   UCLA   = number of miles from UCLA 

We will use these data to estimate what is called an "hedonic" rent model. Similar models have been calibrated in attempts to explain the selling prices of properties in the residential and commercial real estate markets. The idea is to use regression analysis to decompose the rental rate or selling price into additive components due to different features of the property in question. Categories of variables often used include the features of the dwelling, the lot, the neighborhood, or the municipality. One use for such a model is to predict what "should" be the rental rate or selling price for a given property. This is an "assessment" strategy such as that used by realtors or mortgage lenders. Another use has been for assessing the economic welfare effects of externalities. For example, localized air pollution, airport noise levels, or distance from a Superfund site have all been included in such models in an effort to determine willingness to pay for lower levels of such disamenities. This willingness to pay can be interpreted as the social cost of existence of these pollution or noise levels.

In the Urban Economics literature, there is considerable discussion of the theoretical underpinnings of models such as these; in practice, however, researchers often simply take the OLS regressions at face value. (CAVEAT: Before you go into business consulting for the Board of Realtors and Property Managers, you should be sure you understand the theoretical limitations on these models.)

a.) Do a STAT / PCOR command to verify the plausibility of the values for individual variables (always check the minima and maxima). This also produces a pairwise correlation matrix that allows you to check for simple correlations that might be producing multicollinearity in your model. []

b.) Then regress RENT on all of the other variables available. Interpret the coefficients, in general. Specifically, what does the coefficient on PKG mean, in words? The coefficient on BEACH? []

c.) Now regress RENT on BED, BATH, PKG, BEACH, and UCLA only (this assumes that only the number of rooms of each type matters). Why do we not use a variable for "number of kitchens, livingrooms and diningrooms" (assuming all apartments have a kitchen, livingroom and dining area)? How do the estimated coefficients on individual variables compare with their values (and significance) in the model in (b.)? Compare the goodness-of-fit of this model with that of the full model in (b.). Does an adjusted R2 statistic tell you whether one model is statistically significantly better than another, or just "better"? []

d.) Now regress RENT on SQKLD, SQBED, SQBATH, PKG, BEACH, and UCLA only. What does this specification imply about which factors matter in determining rental rates? Compare the fit of this model with the ones in (b.) and (c.). How do the estimated coefficients on individual variables compare with their values (and signficance) in the models in (b.) and (c.)? Why do you observe these differences? (Think about the results of the STAT / PCOR in part (a.).) []

e.) To detect higher-order collinearities among the explanatory variables, auxiliary regressions of each explanatory variable on all of the others can be done. Try a few of these. For example, (i.) SQKLD on all of the other variables on the right hand side; (ii.) BED on all of the other variables on the right hand side, and (iii.) BEACH on all of the other variables on the right hand side. When you get an auxiliary regression with a good fit (high R2) you know that there is a strong linear relationship between the variables. Are there any of these? Why is it the case that a good fit or lots of significant coefficients in an auxiliary regression can mean poor statistical significance on some or all of the coefficients in the main regression? []

f.) CHALLENGE: (We may not have gotten to F-tests by the time you are working on this homework, in which case this question may be considered optional.) Starting from the model in (b.), use an F-test to determine whether the marginal rental price for an additional square foot of space is the same regardless of whether it is kitchen/livingroom/diningroom space, bedroom space, or bathroom space. When the coefficients on these three variables are restricted to be the same, the written form of the SRF can be simplified (how?). (You will have to create a new variable based on SQKLD, SQBED and SQBATH.) If two of the coefficients are being restricted to equal the third, this means that the model has 2 restrictions (since the third coefficient can still be anything). This F-test involves the explained sum of squares for the unrestricted and restricted models, the number of restrictions, and the residual sum of squares for the unrestricted model. You may also be able to get SHAZAM to do this test for you, if you are clever. []

g.) Starting from the model in (b.), test the hypothesis that the incremental rental value of distance from the BEACH is the same as the incremental rental value of distance from UCLA. Do this "by hand," and then see if you can check it within SHAZAM. The CONFID BEACH UCLA command will give you the ellipse containing all acceptable joint hypotheses about the two parameters on these variables. []

h.) A well-meaning community activist is concerned about the plight of students and is worried that apartments closer to the campus are priced higher than those further away from campus. The activist plots RENT against UCLA (or even regresses RENT against UCLA). Are her fears confirmed? Why or why not? Then she realizes that she is not controlling for distance from the beach, which will also affect rental rates. So she tosses BEACH into the regression as well. Now is her intuition about the plight of students confirmed? Why or why not? Maybe the relationships are obscured because she is not controlling for apartment size? Include BED, and then PKG. Observe the behavior of the coefficients on the existing variables and their statistical significance. If coefficients change when a new variable is added, that variable is correlated somehow with the existing variables. This means that the earlier regression coefficients were suffering from "omitted variables bias." How does the information produced by the STAT / PCOR help to explain what is happening as variables are added? []

i.) If you are a landlord trying to decide what should be the market rental rate for an apartment you are about to rent. (The apartment has been rented at 1970 rates for the last twenty years to a sweet little old lady whom you knew would not be able to handle a rent increase. Now she has moved in with her daughter.) Suppose that this apartment has one bedroom and one bath and one parking space. It has 300 square feet in the kitchen/livingroom/dining area, 92 square feet in the bedroom, and 45 square feet in the bath. It is also four miles from the beach and three miles from the campus. What rent should you set? What is a 95% confidence interval for mean prediction in this case? []

j.) Suppose that you have a second empty apartment that is identical in description, except it is more remote: 18 miles from the beach and 22 miles from the campus. Should you follow the same procedure you did in part (j.)? Why or why not? What are the implications of using the same procedures? Where are the pitfalls? []

k.) Suppose that you believe that the rents on one, two and three-bedroom apartments are determined in separate markets, so that it is not reasonable to expect that, say, the additional rent for an additional square foot of SQKLD space is the same for each of these categories of renters (as is the assumption in (b.) for example). Using the command

SKIPIF(BED.NE.1) (...can be undone by DELETE SKIP$)

to delete all two and three-bedroom apartments from the estimating sample, regress RENT on SQKLD, SQBED, SQBATH, PKG, BEACH, and UCLA. Then do the same for BED.NE.2 and BED.NE.3 in turn. (Why does the last regression "bomb" and how can you fix the problem?)

Are the hedonic rent functions different between these three groups? Just apparently, or statistically different? We will be able to test whether they are the same or different when we explore the use of "dummy" variables later on. []


Updated: 11/2/98; Prepared by: Trudy Ann Cameron; Site Index