UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics

Economics 143 (Cameron) - Applied Regression Analysis

1998 (Fall) Proposal Bloopers and Problems

The feature of this year’s proposals that overwhelmed all other problems, on average, was the quality of the exposition of ideas. There were some very good ideas, but it sometimes took some real patience and persistence to overcome the expositional difficulties of the writers in order to appreciate these ideas.

Given that I am incapable of expressing myself in writing in any language other than English (which my parents spoke), I can be very sympathetic to the challenges presented by a requirement to write clearly and accurately in some language other than your native language. However, for a large proportion of you, your employability in an English-speaking firm will be seriously compromised if you cannot hone your expositional skills to a higher level. Research proposals represent the "first impression" that employers, supervisors, funding agencies, etc., receive regarding the package of skills that your expertise represents. Clear standard English is indispensable. Please continue to work on English composition skills, no matter how tangential they may seem at this point.

To the extent that I could, I marked up expositional problems in this year's proposals as thoroughly as possible, making alternative suggestions for how ideas could be expressed. In some cases, however, there was more to do than I had time to deal with. In no case did expositional problems compromise your grade on the proposal, unless it was not possible for me to figure out what you were saying. In a few years, however, exposition will begin to "count." Please keep working on the task of expressing ideas in writing. The real world is not all "short answer questions."

Here is a partial inventory of major and minor problems I encountered in this year's proposals. Some papers have more than one quote; others had no errors significant enough to warrant being included in this inventory.

  1. [In a model to explain demand for a new model of motorcycle,] the major factors that will be tested are the average income of a given area and the average top level income (the upper quartile). Other main facors are meteorological: days of sunshine, average rainfall, and average temperature. Subsequent factors such as whether or not the area has a helmet law, local sales tax in percent, the aerage price of gasoline, and number of riders in the given area, and the number of days to delivery after purchase will be included to improve accuracy. … In a preliminary evaluation, it was indicated that number of days of sunshine and the average and top income may have a positive, or increasing effect on the quantity demanded of the R6. While factors such as average rainfall, temperature, and whether the area has a helmet law are all potentially negative. The other subsequent factors were inconclusive.

    Be sure to specify which area the observations will cover (county? state?) and what time period (year?) Remember that most legislation (like helmet laws) must be viewed as potentially endogenous. The number of potential riders in each jurisdiction is probably more relevant than the number of actual riders, unless the company is only interested in people who might "trade up." If there are shortages, there may be unsatisfied demand for motorcycles of particular varieties. If preliminary analysis was actually done, cite sample sizes, actual variables used, and regression coefficients and standard errors. If it is just intuition, make this clear as well. AND is it not the factors which are "negative" or "positive," it is their coefficients. Furthermore, the observation that a variable has a "positive" effect is different from the observation that it has an "increasing" effect. The latter requires non-constant derivatives, such as might be found in a log-linear specification, or certain quadratic models. Whenever you model demand for anything, remember the basics from Econ 1. Demand depends on own price, income, prices of substitute and complementary goods, and tastes. The author noted gasoline prices, but failed to mention the prices of other substitute modes of transportation, from other motorcyle models (new and used) to cars, to bus fare, etc.



  2. "In this general regression, the constant term B0 is anticipated to be positive but near zero."

    This is hard to argue, since for no aggregate data set will it ever be the case that an entire county or state will display zero income (one of the variables in the model). Thus, the data will never span the situation where all explanatory variables are simultaneously zero, so the intercept is merely an artifact of fitting the plane, and will have no intuitive content. .



  3. For a random sample of books from different bookstores in Los Angeles county, one author proposes regressing the absolute price of an academic textbook on the number of pages, number of copies published, cost of the copyright, years since publication, number of books written by the author, number of years the author has worked in this field, whether the book is in color, whether it has a soft cover, and whether a CD-ROM is included.

    The first thing to think about is whether the equilibrium price of these books is determined by the free interaction of supply and demand in a market characterized by perfect competition among buyers and sellers. I suspect buyers are price-takers, but the mere existence of copyright suggests monopoly power over the publication of each specific text. This means that price will depart from marginal costs. The disparity will reflect the degree of market power for that text. Perhaps number of other titles on the same or similar topics would be a relevant regressor (number of competing texts by other publishers). I also find it difficult to imagine that the cost of the copyright is something that would be known in advance of an attempt to market that copyright, and the resulting price would depend upon the dependent variable in this regression (endogenous). .



  4. I do not include the cost of production of a book since it depends on many other factors and on how the publisher allocates the manufacturing and administrative costs. I may assume that it is a constant and it may be interpreted by the intercept.

    The last sentence is strictly incorrect. The intercept is the expected price of a book with zero pages, zero copies published, zero copyright cost, zero years published, zero books by same author, zero years of work in the field by the author, etc. This is a nonsense scenario, so your intercept has no intuitive interpretation. It is just an artifact of the functional form you fit through the actual data. .



  5. One author proposes to "assess the validity of proposed risk factors as statistically significant to the likelihood of alcoholism in females." The method will involve distribution "of sample surveys to selected female alcoholics (population). " "The sample..will be diagnosed female alcoholics as sampled from two self-help organizations –Alcoholics Anonymous and Women for Sobriety. The observation in this proposal is a female alcoholic who is a newcomer (within the past week) to AA and/or WFS.

    Problem: if your sample consists only of alcoholics, it will be difficult to predict what factors contribute to the development of alcoholism in females. You would prefer a sample that contained just a random selection of females, and the dependent variable could be whether or not they have problem drinking patterns, say, by age 40. Or, you could use the current proposed variable (number of times in the past year the observed individual was intoxicated). Keep in mind, however, that drinking behavior is not the same thing as alcoholism. Perhaps the AA’s criteria for alcoholism could be employed. .



  6. "I expect to see a strong positive relationship—with the addtion of the dummy variable in which a spouse is also an alcoholic, the chances of the observed female being an alcoholic is heightened." The proposal also has dummy variables for whether the respondent is a widow, etc.

    All of the females in the sample are "diagnosed alcoholics" so it won’t matter if the spouse is an alcoholic or not. In any event, the specification recommends taking logs of all regressors. Remember that it is inadvisable to try to take the log of a dummy variable—the log of zero is negative infinity. A dummy just recognizes two possible levels for a categorical effect. 0,1 is fine either in logs or in levels—the coefficients will simply adjust to take up the difference in the dependent variable when it is either in levels or in logs. . (How about having an alcoholic parent??? That is widely suspected to be correlated with alcoholism, and you cannot choose your parents, so it is an exogenous variable.) .



  7. One proposal suggested regressing current period GDP on a number of "leading indicators" for GDP. "The predictions of this model would therefore be able to influence government policy makers as to which options are best-suited for a stimulation of the GDP. Predictions could also aid major corporations who are concerned with how next quarter’s GDP will change.

    The $64,000 question is whether leading indicators are exogenous variables that can be independently manipulated by policy, or whether all the variables in such a model are jointly determined by the same underlying process. Regularities among certain variables can be exploited to good effect by many businesses (for example, those who must plan the sizes of productions runs based on the strength of demands for their products, when demand depends upon income levels). However, it is harder to imagine that the government can intervene to manipulate GDP quite as mechanically as this seems to suggest. We now know that endogeneity can really complicate the interpretation of regression coefficients as ceteris paribus effects to be enjoyed by policy makers. .



  8. One author proposes the following: "Based on the data of male,fmale salary differentials, we decompose this total figure into a portion in which based on the accepted standard in the sense that it reflect differences in qualifications between men and women and an unexplained portion. The unexplained residual represents the salary difference remaining after we control for all available salary determinants and therefore, constitutes our estimate of discrimination.

    This strategy apparently relies on data aggregated to the level of the individual academic department. It is almost always preferable to work with the most disaggregated data that are available, so that heterogeneity across individuals can be preserved and measured to the fullest extent possible. If the dependent variable is the "salary difference" between males and females in a given department, then the only explanatory variables that can be used, pretty much, are variables that describe the department (which is the unit of observation). It would be preferable to have the unit of observation be the individual faculty member and to use their individual salary as the dependent variable. All sorts of factors that might influence individual’s salaries (number of publications, placement of publications, number of citations in the literature, number of conference presentations per year, number of seminar invitations per year, etc.) could be included to control for systematic differences in salaries due to these "prestige" and "influence" factors. There could also be a dummy variable for FEMALE and perhaps a set of slope-shifting dummy variables using interaction terms between FEMALE and these other factors. A joint test of whether the coefficient on FEMALE, or on the set of coefficients on all terms involving FEMALE, would be a standard sort of test for the existence of a salary differential not explained by other things. This still does not prove discrimination, as there may still be other important omitted variables that are correlated with gender (such as time out of the labor force, or at reduced productivity, for child-bearing/rearing). .



  9. "One of the dummy variables we can construct is sex dummy variables and the coefficient of this dummy variable can indicates the male faculty members received an annual salary in the case of males equal to one. There are possibilities that the coefficient may capture the effects of omitted variables. If, for examples, male faculty members are generally multicollinearity with female faculty members, as a result the sex coefficient may be biased."

    A degree of confusion is evidenced here. In the first part of the statement, individual-specific dummy variables are being proposed, yet the unit of observation has earlier been identified as the department. If the data are at the department level, it is probably appropriate to use the proportions of male and female faculty. (Only one proportion would be used, since the other would result in perfect multicollinearity with the intercept term. The fractions should sum to one.) With individual data, the person-specific dummy variable would be appropriate, but then there can be no multicollinearity, since you cannot belong to both genders at the same time (or to neither). It is correct that only one dummy can be used if there is an intercept term in the model, since otherwise the sum of the male and the female dummies would equal the intercept term. This would not result in coefficient "bias." It would result in a total failure of the OLS algorithm.



  10. "In this case, we would need to look at the adjusted R-squared to determine the multicollinearity."

    R-squared values for "auxiliary regressions" are appropriate, not "adjusted R-squared values, which are something else entirely.



  11. "Null hypothesis" is not the same thing as "Zero hypothesis." Sometimes, the null hypothesis may consist of the assertion that a particular underlying population parameter takes on a value of zero (a "zero" hypothesis about a parameter). But the null hypothesis could equally well be an hypothesis that the underlying population parameter takes on some specific non-zero value.



  12. "These coefficients are the key of this regression to determine not only how important each variable is, but also how our growth will be affected when we increase a unit of one these variables. Big variables such as income savings, unemployment, and inflation will have higher coefficients than the rest of variables in all cases since they have an enormous effect on growth. "

    Remember that the magnitude of a coefficient is entirely dependent upon the units in which you choose to measure the variable with which it is associated. Remember the example where quantity was measured in dozens, rather than units, and the cost was measured in "dollars in excess of 100" rather than straight dollars.



  13. "In this paper, I decided to do a cross-sectional analysis on data collected from all casinos (observation=casino) in Las Vegas for the same month."

    It might be very beneficial to collected monthly data for a panel of casinos, over a span of several years. So-called "panel" data can be a very useful source of information, because a much larger data set can typically result. Furthermore, since there are multiple time-series observations for each time-series entity, techniques related to the idea of using a dummy for each cross- sectional entity (casino) can be used to net out any unmeasured attributes for each casino that can systematically affect the dependent variable.



  14. In double-spaced text, it is critical either to indent new paragraphs, or to triple- or quadruple-space before a new paragraph, so that readers know where one idea ends and another begins. Also, don't forget to use paragraphs. It is surprising how many people offer paragraphs that go on for more than a page. Give the reader a break. Offer digestible chunks.


  15. "Assuming the data can be gathered, I would regress the seven explanatory variables on TV. Simultaneously, auxiliary regressions will be employed to detect multicollinearity. The resulting R- squared score needs to be evaluated against typical R-squared values of similar behavior studies."

    First of all, the implicit "direction" of regression needs to be clear. It is backwards here. The dependent variable is TV, which will be "regressed on" the seven explanatory variables. Checking for multicollinearity is important, but there is some confusion about "the resulting R- squared score." We are concerned about the individual R-squared values for the auxiliary regressions which are used to identify the sources of multicollinearity. A high R-squared in an auxiliary regression among the regressors for the main model belies multicollinearity problems. However, the discussion above, about evaluating these against typical R-squared values, is incorrect.



  16. "The study does assert that subscribers to America Online watch 15 percent [less?] television than average viewers."

    Now we know that this could be an endogenous variable, since the same individuals that are making decisions to watch television are also deciding whether to subscribe to America Online. Perhaps there is something about these (self-selected) individuals that means they would be watching less television even if there was no opportunity to subscribe to AOL. It is hard, without a lot more statistical work, to attribute the differences in television watching time exclusively to AOL. It is possibly due to AOL, but the effect has not been proven. Certainly there is correlation, it is just causality that is questionable.



  17. "Once model is calibrated using the actual data, it could be used to predict US Stock Market performance base on current/past economic conditions outside of the United States. The model would show the positive relationship between the explanatory variables and the dependent variable. After adjustment of possible existence of collinearity between the explanatory variables, the econometric model could be used to predict future performance of the US economy. Possible "policy" implications may be formulation of specific investment strategies according to the current economic status of the world."

    First, one cannot really assert the signs of coefficients in a model that has not yet been estimated. It is fine to offer hypotheses about the likely sign, but we do not know yet that the model "would show the positive relationship…". No mention is made about just how the researcher would "adjust" for collinearity. The assertion sounds very naïve; it would be best not to mention it at all, just to state that the possibility of multicollinearity will be explored. Specific investment strategies are not really an output of the model, although an ability to predict broad trends in markets as a function of external conditions might inform such investment strategies. Never overstate what your model will be able to do. What it CAN do is show what to expect about the Y variable for a unit change in each X variable, providing nothing else changes, AND providing all of the necessary conditions for ordinary OLS to be valid have been met in either the raw data or in some transformation of the data that is used for estimation.



  18. "If the model is a success we will be able to see what the effects of amount of time spent in the library have on GPA. Students can use this to help them judge how much time they want to spend studying in a library and out of the library."

    As we found in the STUDY.SHA example, back near the beginning of the course, choices about time spent studying are likely to be endogenous. Nobody does the experiment of randomly assigning different enforced study times to different students to see what a controlled experiment would show about the effects of study time of performance. Having studied the problem of endogeneity bias now in the lectures, we are better equipped to worry about the consequences of drawing inferences about the effects of changes in endogenous variables. Proceed with caution.



  19. "This research will determine the affect of consumption of cigarettes in underage smoking on an increase in prices in packs of cigarettes. This can then be extended to further statistical analyses of the affects on illegal activity and violence in society."

    First, be sure you can use "affect" and "effect" correctly. Both of these words are used widely in empirical economics. Using them incorrectly is sure to aggravate many readers. Also, the direction of the cause/effect relationship is backwards in this statement. The author is trying to find the "effect on" consumption of cigarettes in underage smoking "of" an increase in prices of packs of cigarettes. The proposal does not go into any model that would shed light on the effects "on illegal activity and violence in society." This seems to be conjecture. However, it would be possible to attempt to model the levels of these outcomes as a function of cigarette prices. Maybe there is a data trail that would support the conjecture; it is simply not discussed in this proposal.



  20. "To explore this model, I would first create a scatter plot and see if there is a general curve or shape that would best fit the data. I would then test a linear relationship between the dependant and explanatory variables. I would expect all explanatory variables, but value of imports, to have a negative value with the amount of debt accumulated."

    In a multivariate model, scatter plots cannot be relied upon to reveal the true shapes of conditional bivariate relationships among pairs of variables. Omitted variables will bias the apparent relationships, perhaps obscuring them completely. A scatterplot is the geometric analog to a simple regression that fails to control for important covariates. [Remember it is "dependent" variable, not "dependant" variable.] Remember that "variables" do not have negative values as a result of regressions; their "coefficients" may have negative values, however.



  21. "While gathering the data, we would require about 10 random samples from 10 different randomly selected colleges/universities all across the United States. Each sample would consist of about 1000 college students who play video games."

    No real justification is given for why it is necessary to collect samples from 10 different schools. This sort of decision needs to be justified. More important, restricting the sample just to people who play video games means that the variable for the "number of hours spent playing video games" can never take on a value of zero. It would be interesting to have a bunch of genuine zeros in the population from which you are sampling so that it makes sense to simulate, ex post, the expected GPA for somebody who does not play video games. As it is, you get only people who currently do play video games, so simulating zero hours for everybody would be an "out-of-sample" prediction. There may be a big jump between zero hours and positive hours, so we cannot predict what would actually happen.



  22. "We might also want to plot dependent variable GPA and the explanatory variables video and study individually to check for multicollinearity and confirm that each actually varies across the observation in the data set."

    A simple STAT command will reveal whether there are any "constants" among the variables in the data set. If the standard deviation and variance of a variable are both zero, you know the values are constant across the sample. If there are any such constants, they will not be absorbed by the intercept term unless you leave them out of the list of regressors. If they stay in, the algorithm will fail.



  23. "Video game players would probably play up to as much video games they can play while keeping an optimum GPA."

    This seems to scream for a quadratic specification that might let the research infer, on average, how many hours of video game playing maximized GPA. If GPA is first increasing, then decreasing, in video game hours, then there will be some maximum GPA corresponding to the "optimal" number of hours of game playing. Remember, however, that video game playing is an endogenous decision, by the same individuals who bring you the GPA scores that represent the dependent variable in this model. Endogeneity bias may be present.



  24. "It is important to get the dependent variables as similar as possible; a comparison of Microsoft Headquarters to the local Circuit City would still not be a good fit. In collecting observations for the dependent variable, the size of the company is ultimately the key, such as a comparison between UCLA and Cedars Sinai Hospital."

    Here, the researcher missed out on an opportunity to specifically control for big differences across business entities by including these sorts of factors as regressors. By using multiple regression, we effectively "control for" other sources of variation in the Y variable that an experimental researcher might "control for" by ensuring that observations were the same along these dimensions. Rather than limiting the sample to "similar companies," use all kinds of companies and control for their major differences in size and financial strength by including measures of these variables among the regressors for each firm.



  25. "Knowledge we learn in college is highly specialized in different fields. The job market demand for different people with different skill varies very much, so we cannot test all students together."

    With a large enough sample, you can accommodate a whole raft of explanatory variables, including dummy variables for different fields. Job market demands may vary systematically by type of field. Also in this proposal, SALARY might be profitably modeled as quadratic in the different modes of time allocation for each student (future worker). This is done later in the proposal. But remember that since time in different activities must sum to 24, you will have to leave out one category (miscellaneous? Or sleep?) or there will be perfect multicollinearity with the intercept term.



  26. "Each observation would be from pre-defined categories such as Office Suite components, browsers, email clients, and financial programs. These categories are most relevant to the users who have the least amount of information available to them, and are the bulk of where consumer dollars are being spent."

    Are these categories completely separable? Or will these categories be dummy variables that will also influence the dummy variable, number of program bugs? Reminder again: never try to log a dummy variable, since the log operation will fail when you try to take the log of zero. Interaction terms might also be very useful in this specification.



  27. "My dependent variable is average cost of living (COL) in thousands of dollars. I am choosing this variable because no matter where the income comes from, there is an established basis of how a family, no matter how many parents and depending on a certain amount of kids, are able to sustain a decent lifestyle."

    Cost of living is then proposed to be regressed on average county check amount, money paid for child care, food stamps, MediCal, housing costs, personal (money spent for clothing, entertainment, etc.) and income earned from working. The big problem here is that the left- hand side is something that is defined by government dictate, not by individual choice. It is not a behavioral variable to be explained by exogenously given factors that make up a list of explanatory variables. The individual’s income and expenditures are all lumped together on the right-hand side. If they are living in a subsistence fashion, the sum of all the terms on the right (without any coefficients) will be close to zero, which cannot be expected to bear much relationship to some officially sanctioned cost of living. There is simply no relationship between the left-hand side and the right-hand side variables.



  28. "The intercept in the model should be the average starting salary for all UCLA graduates for all majors (if all other coefficients are zero)."

    NO. If all the regressors are simultaneously zero! Regressors in this case include IQ and family income, and since few UCLA students will have an IQ of zero, the intercept is probably not meaningful in this context.



  29. "This is why our model only distinguishes between science and non-science majors, although it could be made more accurate at the cost of additional complexity."

    All that precludes greater complexity is sample size; with enough observations, you can accommodate a large number of regressors.



  30. One proposal suggests that we model GPA as a function of AOD: the amount of drinking (binge drinking in times per week), PAR: prior academic performance (SAT scores), and AOS: amount of studying (weekly hours spent studying). The researcher suggests first running a model with GPA as a function solely of AOD. The slope is argued to give the effect that an additional binge drinking day will have on the student’s GPA.

    This is true only if there are no important omitted variables correlated with weekly episodes of binge drinking. There may be omitted variables bias. Binge drinking is a behavioral decision by the same individuals who are making the choices that lead to their GPA. It may be highly endogenous.



  31. "Setting all the independent variables to zero and getting an F-test statistic, we can test the hypothesis that neither of the variables has any effect on GPA by once again looking at the P-value."

    NO, NO, NO. You do not set the independent variables to zero. You are testing the null hypothesis that the slope coefficients on each of the explanatory variables are all simultaneously zero. In SHAZAM, where the program has no way to refer to coefficients except by using the name of the variable that the coefficient modifies, the use of the TEST…END block of statements requires that you state the hypotheses as VAR1=0, VAR2=0, VAR3=0. But you ARE NOT setting the variables equal to zero, you are referring to their coefficients in a linear-in-parameters regression model.



  32. "If greater precision is preferred, then examination for possible heteroscedasticity condition is recommended."

    No again. If you want to be reassured that your hypothesis testing is not invalid, you need to ensure that the homoscedasticity assumption implicit in every plain OLS command is satisfied by the data. If you have heteroscedasticity in your data, but ignore the problem, all of your standard error estimates for the parameters will be wrong and your hypothesis testing invalid. WLS allows greater efficiency than corrected OLS estimates, but uncorrected OLS estimates are basically garbage.



  33. "As shown at the econometric model that was figured out, the price of coffee has the greatest effect to the customers’ monthly expenditure on coffee because it has the greatest coefficient."

    This paper used what appears to be simulated data to estimate an illustrative model using actual regression analysis. "Fake" data are not usually used in research proposals (although they are widely used in teaching to illustrate possible properties of some data sets). Given that the data are constructed, there is no reason to expect that the real data for these variables would display the same relationships. (Unless, of course, the true "data generating process" is identical to the data generating process used to build the hypothetical data.) Once again, however, the size of the coefficient has nothing to do with the magnitude of the effect of any one variable on the dependent variable. Coefficient magnitudes can be changed arbitrarily by changing the units in which the explanatory variables are measured. That is why we sometimes use "standardized coefficients."



  34. "The dependent varible of this model is rage, which means the chance of road rage would flare in the traffic. While ‘I’ her represents an individual. As of the observations obtained data include age, gender, education, the area where the driver at, the traffic condition, the distance between car to car, wealth of the driver, insurance plan on the cars and the season." …. Data is proposed to be collected by survey "by randomly selecting 5,000 people who have registered their cars and driver licenses through Department of Motor Vehicle (DMV) across the whole California."

    It sounds from the descriptions of the variables that an observation should be an individual driving a car at a moment in time (since distance between cars is proposed as one regressor). Yet most data will be at the level of the individual California driver. Perhaps the dependent variable should be measured at that level as well, possibly "episodes of road rage occuring per month, on average" or "episodes of road rage by other drivers inflicted per month." Both variables would be interesting stories, and there would be better conformability between the dependent variable and the explanatory variables. Perhaps the "event specific" factors such as traffic conditions and distances between cars would be better expressed as typical values for the individual’s usual commute.




Updated: 12/7/98; Prepared by: Trudy Ann Cameron; Site Index