UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics

Economics 143 (Cameron) - Applied Regression Analysis

1997 Proposal Bloopers and Problems

This is a partial inventory of some of the things that went wrong in the research proposals that were submitted for the Fall 1997 version of the course. Studying these, and being sure you know WHY they are problems, may solidify your ability to think and write about empirical issues in economics.

Semantic and Grammar Problems:
  1. F-tests have a capitalized F, whereas t-tests use a lower-case t.


  2. The word "it's" means "it is." The possessive form is "its" with no apostrophe.


  3. The word "cannot" is a single word, not "can not" (in most cases).


  4. Note the difference between "whether" and "weather."


  5. You should say "A number of models is possible," since you are talking about "a number." If this gives you the creeps, switch to "Several models are possible."


  6. When you multiply two explanatory variables to create a new variable, the result is called an "interaction term," not an "interactive term."


  7. Contractions such as "wouldn't", "didn't" "shouldn't" are not commonly seen in professional writing. Switch to the longer form.


  8. "Constituency" is commonly used to describe the people who belong to the voting public relevant to a politician. But it is common also to talk about the constituency for a research project. This is the group in society that will be interested in the results. This is not necessarily the same thing as one politician's electorate.


  9. You can assume your audience for these proposals is familiar with econometrics at the level of Economics 143. It is not necessary to explain what a slope is, or how to do a t-test or an F-test.


  10. The particular software used to estimate an ordinary least squares regression model is not typically of general interest. There are many alternatives. At best, any mention of the program used and discussion of specific commands should be relegated to footnotes. But the software used should be mentioned, since a proper citation is needed.

Problems in Specific Proposals

I have not selected problems from everyone's proposals, just some specific examples. These are worth mentioning (so everyone can learn from everybody else's mistakes, as opposed to just their own). In some cases, I have appended an outline of the nature of the problem.

  1. Free throw percentage in a particular game regressed on (among other things) free throw percentages in the first, second, third, and fourth quarters of that game; the expectation that the coefficients on these variables will drop as the game progresses.


  2. Dependent variable = number of car accidents per year for each of 1000 randomly selected individuals. Are all accidents equal? Is there a better dependent variable? A dummy variable for "young age" (<45) and "old age" (>45) both included in the model; a plan then to use the logs of these two variables in an alternative log-log specification.


  3. Dependent variable = percentage of cases won by a trial attorney; failure to control for types of cases taken (contingency fees create incentives to choose only winnable cases)? "We can regress a single explanatory variable on the dependent variable." "By holding all other explanatory variables constant, we could formulate a null hypothesis..."


  4. Including an intercept term and a full set of 12 monthly dummy variables to capture seasonality. [If demand for a product is expected to change over one's life-cycle, rather than simply to increase with age or fall with age, then a quadratic function of age may be appropriate (or one might switch to a set of age- interval dummies to detect the actual shape of the relationship).]


  5. Dependent variable = GPA, as a function of average daily commute time per quarter. Should interact commute time with mode of commuting. Public transit commuting could foster more reading, driving alone should not. Should control for time studying and going to classes, as well as time commuting. Failing to do this could lead to omitted variable bias if study time varies inversely with commute time.


  6. Dependent variable is a dummy variable (0,1)--either a condition is present for an observation or it is not. We have not discussed these models formally as of the time the proposals were due, but as long as Y differs from observation to observation, you can get some useful information by regressing it on potential factors that could contribute to it being either a zero or a 1. Examples: study on osteoporosis.


  7. Once a model has been estimated, identify which explanatory variables are amenable to being manipulated by policy decisions and which are not. Your ancestors may influence your susceptibility to disease, but they are exogenous and predetermined and not within the reach of policy prescriptions. However, knowledge of the influence of heredity means people could be advised of their greater risk and the need to pay attention to minimizing other factors that contribute to the disease in question. E.g. how much attention to different good- health practices might be necessary to make up for bad genes?


  8. Dependent variable: robbery sentences. Model fails to control for the nature or severity of the robbery (amount, gender or race of victim, etc.). If this is uncorrelated with all of the other things that act to determine the length of a sentence (here, emphasizing gender and race), then there is no problem. However, if males commit more heinous crimes, for example, then the result that they get longer sentences on average than women should not be construed as evidence of discrimination. Could also consider identity of the judge.


  9. Dependent variable: sales (of bus company? for a particular bus route?), presumably quarterly, because there is mention of using quarterly dummy variables for seasonality. Acronyms used before the variables are defined. Must be across cities as well as over time, since bus fares might not vary enough to explain sales, especially if they are regulated by a public transit authority. Explanatory variables include frequency of buses going through the route. Is an observation thus a single bus route? Coverage of bus routes? Conclusion mentions "daily sales." (Clear definition of unit of observation clearly needed.) Also, specification section mentions "Certain quantities will be chosen for each variable and all combinations of the values will be tried to obtain the sales for that data point, this will be done for every season to get the seasonal values." Is this an economic experiment, where X's can be manipulated at will and Y can then be observed?


  10. "...we explain the behavior of one variable in relation to the behavior of other variables allowing for the fact that the relationship between the variables is not exact by adding the error term E. And because of the presence of other unmeasurable factors such as ability, personality, performance and motivation that differ from one person to another, we include the term U."


  11. "...number of children is expected to be negatively correlated with experience and gender..." (as explanatory variables in a model to explain earnings). "The Log w formula indicates that all explanatory variables determine earnings." (model is not yet estimated!) "...hoping to make information more available at the time of hire to both employers and employees as much as for the census bureau in order to change wage rules accordingly."


  12. Dependent variable: starting incomes for a sample of recent UCLA graduates. Suggests that implications from the research will include insights into "...Is a bachelor's degree worth the same today as it was 10,20,30 years ago?...How much is a UCLA degree worth today when compared to other universities? Has the value of a UCLA degree declined in the past decade?"


  13. Dependent variable: economic growth in an urban region. Modelled as a function of annual growth rates in multi-lane highways, growth in rail systems, number of airplanes and buses in service, and population growth. (HINT: Some variables are growth rates and some are current levels.) Is this local or interregional travel?


  14. Dependent variable: a woman's weight. Does not include mother's average adult weight (and grandmothers' weights). Genetics can play a very important role in weight determination, in addition to behavioral factors.


  15. Dependent variable: public railway ridership demand (average annual number of people). Claims pooled data of 900 observations of thirty metropolitan cities in the world with elaborate railway systems, over a thirty year time span from 1980 to 2010.


  16. "In the following model, only one factor from the four groups are discussed." Proposal identifies broad classes of explanatory factors, but illustrates each with only one example from that class. (CAN test factor collectively by doing a joint test of the significance of the coefficients on ALL of the variables making up that "factor.")


  17. Dependent variable: GPA. Key explanatory variables=dummies for large campus, small campus, distance learning; distance learning the omitted category. Now know about endogeneity bias since students self-select to participate in distance learning. If the ones who choose it are predisposed to have greater success by that mode than learning by other modes, distance learning will look artificially successful.


  18. Dependent variable: "the composition of waste material and its toxicity (WASTE)." Explanatory variables: amounts of metals, cloth, rubber, glass, plastics, yard waste, ..., food waste. Apparently no independent measure of WASTE. RHS variables ARE the waste composition and toxicity. Would be better to try to explain the quantity of each component as a function of economic conditions, season of the year, etc., unless some scientists can monitor and measure emissions from a waste dump and provide some index of effluent from the site. THAT could be a useful dependent variable.


  19. "Any variable that may have an effect on a high school student's GPA need to be accounted for. Only in this way can we determine whether or not learning how to play a musical instrument has a positive effect on a high school student's GPA. If I did not include all of these other variables, the omitted variable bias problem would occur." (HINT: Not necessarily. When would it NOT be a problem?)


  20. Dependent variable: GPA. Explanatory variables include SAT score and race dummies. "Some have criticized the (SAT) test to be culturally biased and an unfair indicator of a student's academic potential. If the SAT score variable coefficient is significantly positive, then these critics can be statistically proven to be wrong." (Need to interact SAT score with ethnicity variables to see if the difference in college GPA for a one-unit difference in SAT score (slope on SAT) differs by ethnic group.)


  21. Dependent variable: number of adults who suffer from fear of flying ("proportional to the total random population in the sample"). Explanatory variables include a female dummy, years of education, whether it is a big plane,... (Hint: dependent var is for a population, explanatory variables (are variously) for and individual or for an individual on a specific flight.)


  22. "This model is designed to observe any differences, if they exist, between salaries due to race and gender." (Model includes ONLY gender and race variables. Model fails to control for other left-out factors that might affect salaries and are correlated with race and gender. If women choose jobs with more-flexible hours (perhaps in anticipation of child-rearing), and these jobs pay less, failing to control for the flexibility of the job would create an apparent salary decrement just for being female.)


  23. Dependent variable: hourly labor charges for auto repairs. Explanatory variables: value of the auto being fixed and income of the auto owner. Objective: look for evidence that mechanics price-discriminate on the basis of owners' incomes. Proposal suggests this might be a socially undesirable "bias." Recognize market power on the part of the repair shop and economically rational exploitation of lesser demand elasticities of higher income consumers. Auto repairs are non-transferable and seller can identify different groups by the auto they own. Still, differences could be due to differing complexity of fixing a high-priced auto (more bells and whistles in the technology?).


  24. Dependent variable: total cost of water treatment (data over ten years, monthly). Coefficients discussed before regression specification is spelled out. Simple regression only. Discussion of economic theory, but no distinction between conventional generic micro theory and the assertion of increasing returns in this industry. Not clear on whether slope of total cost function (marginal cost) should be everywhere falling as output increases. A diagram summarizing a sense of the technology in this industry would have helped.


  25. Dependent variable: number of home security alarm systems installed in the greater Los Angeles area per year. Explanatory variables: whether this is a house or a condo, number of floor, family yearly income, ...price of an alarm system...number of additional functions in an alarm system. Problem: dependent variable is annual aggregate, some explanatory variables are for individual households and some are for individual home security options facing any given household. Units of observation MUST conform. Could use as dependent variable a dummy variable for whether or not a given household HAS an alarm system.


  26. Dependent variable: number of network computers (NCs) demanded (no mention during what time period or by whom). Explanatory variables include prices of substitute and complementary systems (good!), but also income (whose??) and tastes (whose??). Proposal suggests surveying users. This would yield individual information. But how many people demand more than one NC? Either RHS variables should be aggregated to the state level (maybe) and NCs could be measured at the state level (for each month?). Then monthly average total state income could be used. Tastes can be proxied by a vector of individual attributes (for an individual) or by state average attributes for an entire state. Units of observation must conform for LHS and RHS variables.


  27. Dependent variable: "Mexican Americans at the university level" (no indication as to year of college or what geographical scope...all universities, all US universities, UCLA?). Explanatory variables include: family economic status, number of parents in household, number of siblings, etc. RHS variables all pertain to individual college-aged Mexican Americans, yet dependent variable is total number at university level. (Sample is presumably drawn from the population of all college aged Mexican Americans. Could convert dependent variable to a dummy variable equal to 1 if the individual is in college, 0 if not.)


  28. "If there is a suspicion that OFFER and HRSTUDY [two explanatory variables in a model] are somehow related (which is very plausible), we need to include an interactive terms in our model." --NO, not necessarily. At issue is whether the contributions of OFFER and HRSTUDY to explaining the dependent variable are distinct and simply additive, or whether the effect of OFFER on the dependent variable depends on the level of HRSTUDY, for example. Also, the expression is "interaction term" not "interactive term."


  29. In describing a model that looks for differences in wages between natives and immigrants, first specify a model that pools the data, using dummy variables for status to distinguish the intercept (and slopes) for the two groups. Do not start with the separate specialized models that obtain when the "immigrant" dummy variable is set equal to zero or one.


  30. The modelling of durations is complicated because many interesting durations (such as duration of a marriage) are not known until divorce occurs or one partner dies. Some durations are "censored," in that all we know is that the duration for a particular couple is at least as great as the length of time they have been married at the time of the survey. Other marriages will have ended in divorce, so we know exactly how long they lasted.


  31. Dependent variable: crime rates in seven areas of Los Angeles (per month? for how many months?). Explanatory variables: ...include dummy variable for the level of organization within each gang, dummies for the involvement of individual gangs in illegal gun and drug distribution... Problem. Dependent variable is for city level, explanatory variables are for individual gangs in that city. RHS variables cannot be more disaggregated that LHS variables (although sometimes if we do not have a sufficiently disaggregated measure for a RHS variable, we proxy with an average for a larger geographic area or longer time period incorporating the one represented by the dependent variable. ALSO: "expect that number of gun shops will have no effect on the crime level." However, now that we know about joint endogeneity of dependent and explanatory variables, it is likely that this variable would be significant in such a regression, if gun shops spring up in response to citizens' demands for protection against existing crime.


  32. "The dependent variable will be a numerical figure of all of the independent variables added together, given the individual applicant's characteristics. ... The higher the figure is for the dependent variable, the higher will be the applicant's chances of acceptance." (HINT: you need an separate measure for the dependent variable, such as ACCEPT=1 if accepted, =0 if not. Regression reveals relationships between Y and the X's. It is not the way to create a Y variable.)


  33. Dependent variable: rate of acceptance to colleges. Independent variables: grade point average, SAT score, and extracurricular activities. Again, a problem with the RHS variables corresponding to individuals, but the LHS variable not matching. The LHS variable applies to a group.


  34. Dependent variable: weight gain during the first quarter of college. Independent variables: number of grams of fat and the number of calories taken in, recorded daily, number of hours of exercise, recorded on a weekly basis.... (The LHS variable corresponds to a time interval of a quarter, whereas the first two explanatory variables correspond, respectively, to an interval of a day, and a week. These RHS variables should be aggregated (or averaged) over the same quarter. RHS variables cannot be more disaggregated than the LHS variable.)


  35. "There will definitely be an omitted variable bias because there are always more variable that can be added that could probably affect test scores." Again, omitted variable bias only occurs if an important explanatory variable that has been left out of the model is correlated with another variable that is included.


  36. "Regression analysis can be performed in each chosen country to find out the relationships, if any, between the dependable [!] variable...and the following explanatory variables..." [Pool the data across countries and use dummy intercept shifters and dummy slope shifters to distinguish between the regressions for each country. The differences in regressions across countries can then be tested.]


  37. "Age would certainly increase the risk of the developing lung cancer. It is in fact a more dominant factor than cigarette smoking as conclude by other research. However, since this research purpose in exam the effect of cigarette. We will leave this factor out by taking people of the same age group." [NO. Can easily control for age by including an age variable in the regression. That is the whole purpose of multiple regression. If you use only one age group, you can only describe the relationship among the variables for that one age group.]


  38. Dependent variable: Bicycle accidents. Explanatory variables: bike lanes, cars, bikes,.... "The intercept b1 is the number of accidents when all variables are equal to zero, which means the number of automobile-related accidents that occur without the effects of each of the variables." [Should be careful to point out that it has no real meaning in this case because it is unlikely that a community will have no cars (in particular).]


  39. Grade inflation model. "The high schools should have approximately equal student populations. There might be an effect on evaluation which might depend on the number of students in a classroom. This should be avoided in our model. It is also important that the schools have a similar academic curriculum and offer about the same number of honors and advanced placement classes." [Why not extend your sample to a wide array of schools and specifically control for systematic variations in these factors by including them in your regression model. Rather than just allowing the GPA as a function of SAT scores to vary between two high schools, we could then see whether the relationship has a different slope or different intercept according to a wide range of measurable characteristics of schools. For example, is there more grade inflation in high-income neighborhoods where parents have high expectations for the college prospects of their students?


  40. "The primary statistical information I would explore would be to test the ordinary least square regression of all the variables on sprodi." "Testing the hypothesis of what will happen if you set b2, b3,b6, and b7 to zero and if you decrease teach and books by one additional unit, this will; bring about a negative effect on sprodi. You would also be able to come to a conclusion on the opposite, a positive effect of these two variables if there is an increase."


  41. "Similarly I would use a T-test to test the other explanatory variables and dummy variable by setting their coefficients, B2=B3=B4=B5=0, all equal to zero." [Actually, this sounds like a job for an F-test.] "i would be very careful of multicollinearity in my econometric model since if any explanatory variables is a linear combination of the other variables that would mean that multicollinearity exists and my hypothesis testing could be erroneous." [Not necessarily. Multicollinearity may make it harder to reject zero hypotheses because of the inability to distinguish the separate contributions of sets of variables. Their standard errors are large, not necessarily wrong.]


  42. Performing an ordinary least square on the data will give us b1 and b2. The correlation between usage and income is expected to be a high positive number while that between usage and rate is expected to be a low negative number. [The author seems to be referring to coefficients, not correlations. You can make the coefficient as big as you want by defining your variables in the smallest possible units. About all that is relevant is expectations about sign of coefficients.]


  43. Dependent variable: long-distance telephone usage. Explanatory variables: call rates and income. Says companies "...may encourage long distance usage by higher income household and the middle income households only. The model predicts that low-income households will have generally low usage despite the average rates. [There is an unexploited opportunity here to see if the price responsiveness of demand varies with income level. This can be done by interacting income and price and including this in the model as well.]


  44. Dependent variable: average income of a person over first five years of being employed after graduation. "This particular model will concentrate only on using the graduating GPA of a student and the specification of the university as the explanatory variables, since it will estimate possible significance of the choice between schools." [Choice of university can be correlated with many other variables that will also influence job placement and income post-college. Think about "old-boy" networks. Also, which university you attend is not always a choice, since options are constrained by where you are successful in gaining admission. This paper uses dummies for UCLA, USC, and CSULA (where one should have been left out, since all observations come from one of these three), with a plan to seeing whether choice of college affects income. GPA prior to college might have been helpful, in addition to college GPA. Suppose a student is not qualified for admission to UCLA, and also not qualified for a good job after graduation (due to glaring deficiencies in math or English composition skills, for example). The proposed model will make it look like a randomly selected student, assigned arbitrarily to CSULA, will do much worse on the job market than the actually would. Weaker students go to CS schools and weaker students get poorer jobs. College attendance is not randomly assigned across students. We now know that it is an endogenous variable.]


  45. If you need a set of dummy variables to capture different categories of observations, and choose to keep the intercept in the model and use m-1 dummy variables, it is imperative that there be SOME observations in your sample in the omitted category.



COURSE OUTLINE LECTURE OUTLINES PROBLEM SETS PROBLEM SOLUTIONS COMPUTER LABS
SHAZAM EXAMPLES DATA SETS ONLINE QUIZZES GRAPHICS HANDOUTS

Update date: January 15, 1997
Prepared by: Trudy Ann Cameron