UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics
Economics 143 (Cameron) - Applied Regression
Analysis
1997 Proposal Bloopers and Problems
This is a partial inventory of some of the things that went wrong in the
research proposals that were submitted for the Fall 1997 version of the course.
Studying these, and being sure you know WHY they are problems, may solidify your
ability to think and write about empirical issues in economics.
Semantic and Grammar Problems:
- F-tests have a capitalized F, whereas t-tests use a lower-case t.
- The word "it's" means "it is." The possessive form is "its" with no
apostrophe.
- The word "cannot" is a single word, not "can not" (in most
cases).
- Note the difference between "whether" and "weather."
- You should say "A number of models is possible," since you are talking about
"a number." If this gives you the creeps, switch to "Several models are
possible."
- When you multiply two explanatory variables to create a new variable, the
result is called an "interaction term," not an "interactive term."
- Contractions such as "wouldn't", "didn't" "shouldn't" are not commonly seen in
professional writing. Switch to the longer form.
- "Constituency" is commonly used to describe the people who belong to the
voting public relevant to a politician. But it is common also to talk about the
constituency for a research project. This is the group in society that will be
interested in the results. This is not necessarily the same thing as one
politician's electorate.
- You can assume your audience for these proposals is familiar with econometrics
at the level of Economics 143. It is not necessary to explain what a slope is, or
how to do a t-test or an F-test.
- The particular software used to estimate an ordinary least squares regression
model is not typically of general interest. There are many alternatives. At
best, any mention of the program used and discussion of specific commands should
be relegated to footnotes. But the software used should be mentioned,
since a proper citation is needed.
Problems in Specific Proposals
I have not selected problems from everyone's proposals, just some specific
examples. These are worth mentioning (so everyone can learn from everybody else's
mistakes, as opposed to just their own). In some cases, I have appended an
outline of the nature of the problem.
- Free throw percentage in a particular game regressed on (among other things)
free throw percentages in the first, second, third, and fourth quarters of that
game; the expectation that the coefficients on these variables will drop as the
game progresses.
- Dependent variable = number of car accidents per year for each of 1000
randomly selected individuals. Are all accidents equal? Is there a better
dependent variable? A dummy variable for "young age" (<45) and "old age" (>45)
both included in the model; a plan then to use the logs of these two variables in
an alternative log-log specification.
- Dependent variable = percentage of cases won by a trial attorney; failure to
control for types of cases taken (contingency fees create incentives to choose
only winnable cases)? "We can regress a single explanatory variable on the
dependent variable." "By holding all other explanatory variables constant, we
could formulate a null hypothesis..."
- Including an intercept term and a full set of 12 monthly dummy variables to
capture seasonality. [If demand for a product is expected to change over one's
life-cycle, rather than simply to increase with age or fall with age, then a
quadratic function of age may be appropriate (or one might switch to a set of age-
interval dummies to detect the actual shape of the relationship).]
- Dependent variable = GPA, as a function of average daily commute time per
quarter. Should interact commute time with mode of commuting. Public transit
commuting could foster more reading, driving alone should not. Should control for
time studying and going to classes, as well as time commuting. Failing to do this
could lead to omitted variable bias if study time varies inversely with commute
time.
- Dependent variable is a dummy variable (0,1)--either a condition is present
for an observation or it is not. We have not discussed these models formally as
of the time the proposals were due, but as long as Y differs from observation to
observation, you can get some useful information by regressing it on potential
factors that could contribute to it being either a zero or a 1. Examples: study
on osteoporosis.
- Once a model has been estimated, identify which explanatory variables are
amenable to being manipulated by policy decisions and which are not. Your
ancestors may influence your susceptibility to disease, but they are exogenous and
predetermined and not within the reach of policy prescriptions. However,
knowledge of the influence of heredity means people could be advised of their
greater risk and the need to pay attention to minimizing other factors that
contribute to the disease in question. E.g. how much attention to different good-
health practices might be necessary to make up for bad genes?
- Dependent variable: robbery sentences. Model fails to control for the nature
or severity of the robbery (amount, gender or race of victim, etc.). If this is
uncorrelated with all of the other things that act to determine the length of a
sentence (here, emphasizing gender and race), then there is no problem. However,
if males commit more heinous crimes, for example, then the result that they get
longer sentences on average than women should not be construed as evidence of
discrimination. Could also consider identity of the judge.
- Dependent variable: sales (of bus company? for a particular bus route?),
presumably quarterly, because there is mention of using quarterly dummy variables
for seasonality. Acronyms used before the variables are defined. Must be across
cities as well as over time, since bus fares might not vary enough to explain
sales, especially if they are regulated by a public transit authority.
Explanatory variables include frequency of buses going through the route. Is an
observation thus a single bus route? Coverage of bus routes? Conclusion mentions
"daily sales." (Clear definition of unit of observation clearly needed.) Also,
specification section mentions "Certain quantities will be chosen for each
variable and all combinations of the values will be tried to obtain the sales for
that data point, this will be done for every season to get the seasonal values."
Is this an economic experiment, where X's can be manipulated at will and Y can
then be observed?
- "...we explain the behavior of one variable in relation to the behavior of
other variables allowing for the fact that the relationship between the variables
is not exact by adding the error term E. And because of the presence of other
unmeasurable factors such as ability, personality, performance and motivation that
differ from one person to another, we include the term U."
- "...number of children is expected to be negatively correlated with experience
and gender..." (as explanatory variables in a model to explain earnings). "The
Log w formula indicates that all explanatory variables determine earnings." (model
is not yet estimated!) "...hoping to make information more available at the time
of hire to both employers and employees as much as for the census bureau in order
to change wage rules accordingly."
- Dependent variable: starting incomes for a sample of recent UCLA graduates.
Suggests that implications from the research will include insights into "...Is a
bachelor's degree worth the same today as it was 10,20,30 years ago?...How much is
a UCLA degree worth today when compared to other universities? Has the value of a
UCLA degree declined in the past decade?"
- Dependent variable: economic growth in an urban region. Modelled as a
function of annual growth rates in multi-lane highways, growth in rail systems,
number of airplanes and buses in service, and population growth. (HINT: Some
variables are growth rates and some are current levels.) Is this local or
interregional travel?
- Dependent variable: a woman's weight. Does not include mother's average
adult weight (and grandmothers' weights). Genetics can play a very important role
in weight determination, in addition to behavioral factors.
- Dependent variable: public railway ridership demand (average annual number of
people). Claims pooled data of 900 observations of thirty metropolitan cities in
the world with elaborate railway systems, over a thirty year time span from 1980
to 2010.
- "In the following model, only one factor from the four groups are discussed."
Proposal identifies broad classes of explanatory factors, but illustrates each
with only one example from that class. (CAN test factor collectively by doing a
joint test of the significance of the coefficients on ALL of the variables making
up that "factor.")
- Dependent variable: GPA. Key explanatory variables=dummies for large campus,
small campus, distance learning; distance learning the omitted category. Now know
about endogeneity bias since students self-select to participate in distance
learning. If the ones who choose it are predisposed to have greater success by
that mode than learning by other modes, distance learning will look artificially
successful.
- Dependent variable: "the composition of waste material and its toxicity
(WASTE)." Explanatory variables: amounts of metals, cloth, rubber, glass,
plastics, yard waste, ..., food waste. Apparently no independent measure of
WASTE. RHS variables ARE the waste composition and toxicity. Would be better to
try to explain the quantity of each component as a function of economic
conditions, season of the year, etc., unless some scientists can monitor and
measure emissions from a waste dump and provide some index of effluent from the
site. THAT could be a useful dependent variable.
- "Any variable that may have an effect on a high school student's GPA need to
be accounted for. Only in this way can we determine whether or not learning how
to play a musical instrument has a positive effect on a high school student's GPA.
If I did not include all of these other variables, the omitted variable bias
problem would occur." (HINT: Not necessarily. When would it NOT be a
problem?)
- Dependent variable: GPA. Explanatory variables include SAT score and race
dummies. "Some have criticized the (SAT) test to be culturally biased and an
unfair indicator of a student's academic potential. If the SAT score variable
coefficient is significantly positive, then these critics can be statistically
proven to be wrong." (Need to interact SAT score with ethnicity variables to see
if the difference in college GPA for a one-unit difference in SAT score (slope on
SAT) differs by ethnic group.)
- Dependent variable: number of adults who suffer from fear of flying
("proportional to the total random population in the sample"). Explanatory
variables include a female dummy, years of education, whether it is a big
plane,... (Hint: dependent var is for a population, explanatory variables (are
variously) for and individual or for an individual on a specific
flight.)
- "This model is designed to observe any differences, if they exist, between
salaries due to race and gender." (Model includes ONLY gender and race variables.
Model fails to control for other left-out factors that might affect salaries and
are correlated with race and gender. If women choose jobs with more-flexible
hours (perhaps in anticipation of child-rearing), and these jobs pay less, failing
to control for the flexibility of the job would create an apparent salary
decrement just for being female.)
- Dependent variable: hourly labor charges for auto repairs. Explanatory
variables: value of the auto being fixed and income of the auto owner.
Objective: look for evidence that mechanics price-discriminate on the basis of
owners' incomes. Proposal suggests this might be a socially undesirable "bias."
Recognize market power on the part of the repair shop and economically rational
exploitation of lesser demand elasticities of higher income consumers. Auto
repairs are non-transferable and seller can identify different groups by the auto
they own. Still, differences could be due to differing complexity of fixing a
high-priced auto (more bells and whistles in the technology?).
- Dependent variable: total cost of water treatment (data over ten years,
monthly). Coefficients discussed before regression specification is spelled out.
Simple regression only. Discussion of economic theory, but no distinction between
conventional generic micro theory and the assertion of increasing returns in this
industry. Not clear on whether slope of total cost function (marginal cost)
should be everywhere falling as output increases. A diagram summarizing a sense
of the technology in this industry would have helped.
- Dependent variable: number of home security alarm systems installed in the
greater Los Angeles area per year. Explanatory variables: whether this is a
house or a condo, number of floor, family yearly income, ...price of an alarm
system...number of additional functions in an alarm system. Problem: dependent
variable is annual aggregate, some explanatory variables are for individual
households and some are for individual home security options facing any given
household. Units of observation MUST conform. Could use as dependent variable a
dummy variable for whether or not a given household HAS an alarm
system.
- Dependent variable: number of network computers (NCs) demanded (no mention
during what time period or by whom). Explanatory variables include prices of
substitute and complementary systems (good!), but also income (whose??) and tastes
(whose??). Proposal suggests surveying users. This would yield individual
information. But how many people demand more than one NC? Either RHS variables
should be aggregated to the state level (maybe) and NCs could be measured at the
state level (for each month?). Then monthly average total state income could be
used. Tastes can be proxied by a vector of individual attributes (for an
individual) or by state average attributes for an entire state. Units of
observation must conform for LHS and RHS variables.
- Dependent variable: "Mexican Americans at the university level" (no
indication as to year of college or what geographical scope...all universities,
all US universities, UCLA?). Explanatory variables include: family economic
status, number of parents in household, number of siblings, etc. RHS variables
all pertain to individual college-aged Mexican Americans, yet dependent variable
is total number at university level. (Sample is presumably drawn from the
population of all college aged Mexican Americans. Could convert dependent
variable to a dummy variable equal to 1 if the individual is in college, 0 if
not.)
- "If there is a suspicion that OFFER and HRSTUDY [two explanatory variables in
a model] are somehow related (which is very plausible), we need to include an
interactive terms in our model." --NO, not necessarily. At issue is whether the
contributions of OFFER and HRSTUDY to explaining the dependent variable are
distinct and simply additive, or whether the effect of OFFER on the dependent
variable depends on the level of HRSTUDY, for example. Also, the expression is
"interaction term" not "interactive term."
- In describing a model that looks for differences in wages between natives and
immigrants, first specify a model that pools the data, using dummy variables for
status to distinguish the intercept (and slopes) for the two groups. Do not start
with the separate specialized models that obtain when the "immigrant" dummy
variable is set equal to zero or one.
- The modelling of durations is complicated because many interesting durations
(such as duration of a marriage) are not known until divorce occurs or one partner
dies. Some durations are "censored," in that all we know is that the duration for
a particular couple is at least as great as the length of time they have been
married at the time of the survey. Other marriages will have ended in divorce, so
we know exactly how long they lasted.
- Dependent variable: crime rates in seven areas of Los Angeles (per month? for
how many months?). Explanatory variables: ...include dummy variable for the
level of organization within each gang, dummies for the involvement of individual
gangs in illegal gun and drug distribution... Problem. Dependent variable is for
city level, explanatory variables are for individual gangs in that city. RHS
variables cannot be more disaggregated that LHS variables (although sometimes if
we do not have a sufficiently disaggregated measure for a RHS variable, we proxy
with an average for a larger geographic area or longer time period incorporating
the one represented by the dependent variable. ALSO: "expect that number of gun
shops will have no effect on the crime level." However, now that we know about
joint endogeneity of dependent and explanatory variables, it is likely that this
variable would be significant in such a regression, if gun shops spring up in
response to citizens' demands for protection against existing crime.
- "The dependent variable will be a numerical figure of all of the independent
variables added together, given the individual applicant's characteristics. ...
The higher the figure is for the dependent variable, the higher will be the
applicant's chances of acceptance." (HINT: you need an separate measure for the
dependent variable, such as ACCEPT=1 if accepted, =0 if not. Regression reveals
relationships between Y and the X's. It is not the way to create a Y
variable.)
- Dependent variable: rate of acceptance to colleges. Independent variables:
grade point average, SAT score, and extracurricular activities. Again, a problem
with the RHS variables corresponding to individuals, but the LHS variable not
matching. The LHS variable applies to a group.
- Dependent variable: weight gain during the first quarter of college.
Independent variables: number of grams of fat and the number of calories taken
in, recorded daily, number of hours of exercise, recorded on a weekly basis....
(The LHS variable corresponds to a time interval of a quarter, whereas the first
two explanatory variables correspond, respectively, to an interval of a day, and a
week. These RHS variables should be aggregated (or averaged) over the same
quarter. RHS variables cannot be more disaggregated than the LHS
variable.)
- "There will definitely be an omitted variable bias because there are always
more variable that can be added that could probably affect test scores." Again,
omitted variable bias only occurs if an important explanatory variable that has
been left out of the model is correlated with another variable that is
included.
- "Regression analysis can be performed in each chosen country to find out the
relationships, if any, between the dependable [!] variable...and the following
explanatory variables..." [Pool the data across countries and use dummy intercept
shifters and dummy slope shifters to distinguish between the regressions for each
country. The differences in regressions across countries can then be
tested.]
- "Age would certainly increase the risk of the developing lung cancer. It is
in fact a more dominant factor than cigarette smoking as conclude by other
research. However, since this research purpose in exam the effect of cigarette.
We will leave this factor out by taking people of the same age group." [NO. Can
easily control for age by including an age variable in the regression. That is
the whole purpose of multiple regression. If you use only one age group, you can
only describe the relationship among the variables for that one age
group.]
- Dependent variable: Bicycle accidents. Explanatory variables: bike lanes,
cars, bikes,.... "The intercept b1 is the number of accidents when all variables
are equal to zero, which means the number of automobile-related accidents that
occur without the effects of each of the variables." [Should be careful to point
out that it has no real meaning in this case because it is unlikely that a
community will have no cars (in particular).]
- Grade inflation model. "The high schools should have approximately equal
student populations. There might be an effect on evaluation which might depend on
the number of students in a classroom. This should be avoided in our model. It
is also important that the schools have a similar academic curriculum and offer
about the same number of honors and advanced placement classes." [Why not extend
your sample to a wide array of schools and specifically control for systematic
variations in these factors by including them in your regression model. Rather
than just allowing the GPA as a function of SAT scores to vary between two high
schools, we could then see whether the relationship has a different slope or
different intercept according to a wide range of measurable characteristics of
schools. For example, is there more grade inflation in high-income neighborhoods
where parents have high expectations for the college prospects of their
students?
- "The primary statistical information I would explore would be to test the
ordinary least square regression of all the variables on sprodi." "Testing the
hypothesis of what will happen if you set b2, b3,b6, and b7 to zero and if you
decrease teach and books by one additional unit, this will; bring about a negative
effect on sprodi. You would also be able to come to a conclusion on the opposite,
a positive effect of these two variables if there is an increase."
- "Similarly I would use a T-test to test the other explanatory variables and
dummy variable by setting their coefficients, B2=B3=B4=B5=0, all equal to zero."
[Actually, this sounds like a job for an F-test.] "i would be very careful of
multicollinearity in my econometric model since if any explanatory variables is a
linear combination of the other variables that would mean that multicollinearity
exists and my hypothesis testing could be erroneous." [Not necessarily.
Multicollinearity may make it harder to reject zero hypotheses because of the
inability to distinguish the separate contributions of sets of variables. Their
standard errors are large, not necessarily wrong.]
- Performing an ordinary least square on the data will give us b1 and b2. The
correlation between usage and income is expected to be a high positive number
while that between usage and rate is expected to be a low negative number. [The
author seems to be referring to coefficients, not correlations. You can make the
coefficient as big as you want by defining your variables in the smallest possible
units. About all that is relevant is expectations about sign of
coefficients.]
- Dependent variable: long-distance telephone usage. Explanatory variables:
call rates and income. Says companies "...may encourage long distance usage by
higher income household and the middle income households only. The model predicts
that low-income households will have generally low usage despite the average
rates. [There is an unexploited opportunity here to see if the price
responsiveness of demand varies with income level. This can be done by
interacting income and price and including this in the model as well.]
- Dependent variable: average income of a person over first five years of being
employed after graduation. "This particular model will concentrate only on using
the graduating GPA of a student and the specification of the university as the
explanatory variables, since it will estimate possible significance of the choice
between schools." [Choice of university can be correlated with many other
variables that will also influence job placement and income post-college. Think
about "old-boy" networks. Also, which university you attend is not always a
choice, since options are constrained by where you are successful in gaining
admission. This paper uses dummies for UCLA, USC, and CSULA (where one should
have been left out, since all observations come from one of these three), with a
plan to seeing whether choice of college affects income. GPA prior to college
might have been helpful, in addition to college GPA. Suppose a student is not
qualified for admission to UCLA, and also not qualified for a good job after
graduation (due to glaring deficiencies in math or English composition skills, for
example). The proposed model will make it look like a randomly selected student,
assigned arbitrarily to CSULA, will do much worse on the job market than the
actually would. Weaker students go to CS schools and weaker students get poorer
jobs. College attendance is not randomly assigned across students. We now know
that it is an endogenous variable.]
- If you need a set of dummy variables to capture different categories of
observations, and choose to keep the intercept in the model and use m-1 dummy
variables, it is imperative that there be SOME observations in your sample in the
omitted category.
Update date: January 15, 1997
Prepared by: Trudy Ann Cameron