UNIVERSITY OF CALIFORNIA, LOS
ANGELES
Department of Economics
Fall 1997; Cameron
Economics 143 - Midterm Examination Answer Outline
INSTRUCTIONS: Answer all questions in the
spaces provided (or indicate clearly where you have continued your answer).
[NOTE: Ample space was provided in the hardcopy version of the exam.]
Calculators are NOT permitted. Reduce all computations to the simplest
form so that anyone with a calculator could attain the answer easily. Show
your work and reasoning to the fullest extent possible so that part marks
can be assigned as warranted. You have 75 minutes to complete this exam.
All parts of both questions are worth 10 points (and some are much
easier than others). Total points = 150. This means roughly 5 minutes for
each answer. Budget your time carefully. NOTE: these data are fictitious.
These answer outlines do not include sketches of
regression relationships that would ordinarily be very helpful to the process of
sorting out what a particular regression means. On an actual exam, you are
encouraged to illustrate the points you are making, in order to verify your
intuition.
The actual exam included a "crib sheet" of standard formulas and a copy of t-
test critical values from the back of the Gujarati textbook.
1. Imagine that you are in charge of head office
personnel at Huge Corp. The corporate vice president keeps getting other
people's mail by mistake, so she tasks you to conduct a study of mailroom
productivity. Fortunately, the mailroom supervisor has, for years, been
sampling productivity during his monthly employee reviews. For a random
sample of 27 reviews (all from different employees) he provides you with
data on productivity (prodi = letters correctly sorted
per minute), and experience (monthsi = months of experience
in the Huge Corp. mailroom). You match these data to other information
about each employee (scorei = score on the aptitude test
they were required to take when they applied for a position at Huge Corp.).
The statistical analyses you perform are given in a
href="e143me97.htm#exhibita">Exhibit A.
a.) Fill in the blanks:
Across these 27 employees, what is the mean number
of months on the job? 4.8148
What is the maximum score on the aptitude test?
95
What is the standard deviation in productivity?
1.5779
Do the descriptive statistics you have just provided
refer to the joint distribution of these three variables, or to
their marginal distributions? marginal
What is the correlation between score and months
in this sample? -.75607
What are the units for this correlation measure?
correlation is a unit-free measure
b.) Using the STAT
output, test the hypothesis
that the true marginal mean value of productivity in the population of
all mailroom workers is 11 letters per minute.
This requires inference about the true mean of the
univariate distribution of the prod variable, regardless of the values of the
other variables. One would take the sample evidence regarding the mean, 9.7704,
subtract the hypothesized value (11), and divide by the standard error of the
mean, which is the sample variance (1.5779) divided by the square root of the
(square-root of 27) sample size. This number, when calculated, would be compared
to the critical value of a t-distributed random variable with 26 degrees of
freedom (because only one degree of freedom is used up by the calculation of an
estimator for the true population mean of prod). For a test at the usual 5%
significance level (95% confidence level), if we found the calculated t-test
statistic to be larger than 2.056 in absolute value, we would typically reject the
null hypothesis of the true mean being 11 letters per minute.
c.) The vice-president of Huge Corp. asks you
(not rhetorically): "Don't these mailroom people ever learn?" Translate
this question into a simple regression specification and test an
appropriate
hypothesis statistically, using the information in Exhibit
A.
This verbal question could be translated as a question
about whether, as they spend more time on the job, mailroom employees experience
an improvement in the number of letters accurately sorted per minute (on average).
You are asked specifically to consider simple regressions, of which two are
offered in Exhibit A. If you believe that job experience affects productivity,
rather than the other way around, your dependent variable should be productivity
and your explanatory variable should be months. A reasonable specification would
be prodi = b1 +
b2*monthsi + ei. An appropriate hypothesis
would
be H0: B2=0. If the point estimate is positive, and if we
can reject this null hypothesis, we would infer that the mailroom people do
experience improvements in their productivity with practice. Based on the results
of Regression 2 (or, more conveniently, on
the output of the confid command that follows it), we see that 0 lies within the
95% confidence interval, so we cannot reject the hypothesis that additional months
do not increase productivity. NOTE: Regression 1 was included just to see if
anybody was confused enough to select the wrong "outcome" variable. I hope you
did not fall for that.
d.) How is it that we can argue that a "t-test"
statistic, if the null hypothesis is true, has a t-distribution?
This was expected to be one of the more difficult
questions, since it asks you to remember the "theory" of regression as we covered
it in lecture. If the underlying parent population distribution for
the dependent variable is (conditionally) normal, then any linear combination of
the independent and identically distributed (i.i.d.) observations on the dependent
variable will also be normal. Even if the population is not exactly normal,
versions of the Central Limit Theorem can provide the same result. The slopes
(and intercept) of a regression model can be expressed as just such a linear
combination (they are "linear" estimators). Since we have to use sample variance,
s-squared, instead of the true population variance (sigma-squared), the
distribution of the standardized parameter estimate (point estimate-null
hypothesis)/(standard error of point estimate) is somewhat noisier than a standard
normal; it is a t-distributed variable.
e.) What productivity would you expect from a
new hire, based on Regression 2? Give a
point estimate and explain explicitly how a 95% confidence interval for this
prediction would be constructed.
This is a question about the intercept in Regression 2.
The interpretation of the intercept is the expected productivity when the
explanatory variable (months of experience) is zero. A "new hire" is somebody
with zero experience on the job. Since no handy confid
statement has been employed, you need to construct the confidence interval the
old-fashioned way. The point estimate is 9.4569. This is the center of the
confidence interval. The amount for the "plus or minus" term involves the 0.025
critical value of t(df=25), which, incidentally, shows up in the output for the
confid statement concerning the slope. It is 2.060. The other necessary
ingredient is the standard error of the point estimate of the intercept, which is
0.5765. In practice, it would be easiest to use a confid command. If this
does not work when you refer to the intercept as the coefficient on "CONSTANT,"
try creating a variable that is always 1 (i.e. genr one=1), and then
perform a regression of prod on months and one, forcing the regression "through
the origin" by using ols prod months one / noconstant. Then use confid
one. This will definitely work.
f.) Now you realize that cognitive skills and
manual dexterity may also affect productivity and account for differences
across employees. You include aptitude scores in the regression and obtain
the results in Regression 3. What does this
alternative specification suggest about "learning-by-doing" in the mailroom? Is
there any statistical evidence that experience affects productivity? Conduct an
appropriate hypothesis test.
"Learning-by-doing" means that the longer you do
something, the better you get at doing it. Regression 3 suggests that months (and
score) have a
strongly statistically significant effect on productivity, since the t-ratio for
the test of the zero hypothesis on the months coefficient is very large (4.232).
The 0.025% critical value of a t-distributed random variable does not exceed 4
until you get down to 2 degrees of freedom, and here we have 24 (=27-3 estimated
coefficients). To three decimal places, there is essentially zero probability out
in the tail of a t(24) distribution beyond 4.232. We soundly reject the
hypothesis that experience does NOT affect productivity, and the point estimate is
positive.
g.) Observe the STAT
output. Statistically,
what accounts for the difference between the implications of Regressions
2 and 3 regarding the effect of experience on productivity?
In this data set, months and score are negatively
correlated. If the variable score is omitted, then more months in the mailroom
(which should increase productivity) are serving as a proxy for lower aptitude
test scores (which will tend to mean lower productivity). The two effects tend to
cancel each other. This is akin to the study.sha example from the lab. It
seems that the people who are sampled after having been in the mailroom a longer
time are also likely to be relatively low scorers on their initial aptitude tests.
In contrast, in this sample, relatively few high-scorers on the aptitude test are
observed after a large number of months on the job.
h.) Can you give a logical intuitive explanation
for the process that leads to the relationship between monthsi
and scorei that is revealed in the STAT output?
Bright and productive people tend to get promoted out
of menial entry-level tasks like mailroom clerk very rapidly. We don't get to
observe them having served in the mailroom for large numbers of months. On the
other hand, less bright people might be stuck working in the mailroom
forever. Later in the course, we will talk about a variant of omitted variables
bias, called "endogeneity bias." If your "observations" in some sense determine
the values for their explanatory variables, you have a potential problem. Here,
by their performance, people have an influence on how long they have to spend
working in the mailroom. So this X variable cannot be viewed as "exogenously
given" and therefore independent of the error term in the regression model. More
on this later.
i.) If a new employee scores 75% on his aptitude
test, what should mailroom management expect in terms of productivity at
his 2-month review?
This a question about the expected value of the
dependent variable, given a particular value for the explanatory variable(s). We
should take the fitted model from Regression 3, and plug in months=2 and score=75,
and see what productivity is predicted. (Some people wondered if score was
measured in percent or as a decimal. For a good guess at this, you could simply
check the STAT output and see what the mean, minimum and maximum values of the
score happened to be.) An ambitious response would recommend
that a confidence interval for mean prediction be constructed. However, we have
not yet covered the algebra for constructing such a confidence interval, except in
the case of a simple regression, which this is not.
j.) In these specifications, what fraction of
the variation in productivity across employees can be explained by a model
that uses only months of experience? 0.0162What fraction can be explained
by a model that uses both experience and aptitude test scores? 0.5013 Can
these be compared? Why or why not?
This is a question about R-squared. From Regression 2,
we get the first number; we get the second from Regression 3. Note that goodness-
of-fit is improved dramatically when we control for score before looking for the
incremental effect of months on productivity. These ordinary R-squared values can
be compared, but we would hesitate to conclude that the second was better than the
first without first correcting for the fact that since Regression 3 using more
variables than Regression 2, we would expect the fit to be better.
Adjusted R-squared allows the comparison. These numbers are -.0231 versus 0.4598.
Note that the fit in Regression 2 is so abysmal that when R-squared is adjusted,
the value is even negative. It is unambiguous that Regression 3 is a better fit
to the data than Regression 2.
k.) On any give shift, the mailroom is staffed
by three people. The mailroom supervisor observes that productivity of
any particular mailroom worker seems to depend on how hard the other people
in the mailroom are working. What data would you collect, what variable(s)
would you construct and what model would you estimate in order to test
a statistical hypothesis that would show whether there is any evidence
to support the supervisor's conjecture?
Since the identities of the other two workers cannot be
ordered in any unambiguous way, the best strategy would be to combine their
respective productivity measures, perhaps using their mean productivity (call it
mprodi. A reasonable specification would then be ols prod months
score mprod. The relevant hypothesis test would be a zero-hypothesis test for
the coefficient on mprod. If the productivity of the other two people
working in the mailroom has no effect on the productivity of the workers in the
sample, then the supervisors casual empiricism would be deemed incorrect. If the
point estimate is positive, and the zero hypothesis can be rejected, we would
conclude that the supervisor's assertion may be valid.
2. You have always wanted to start your own business,
and the specialty coffee-bar business appeals to you. You have a friend
at the Association of Specialty Coffee Retailers who manages to get you
data on the average costs of different establishments. The technology is
virtually identical across firms. You have data on 21 firms, for
atci
= average total costs of production and for qi the rate
of output (in cups per hour). Exhibit B shows
the analyses you perform.
a.) - What is the interpretation of the intercept
in this model? Is it "meaningful"? Explain.
Note that you are not told the units for average total
costs. However, a reasonable assumption, given the output from the stat
command, is that average total costs are in cents per cup, since quantity
is measured in cups per hour. The intercept is the expect average total cost of
production if output is zero. In reality, this number must go to infinity, since
fixed costs are divided by zero. Since the stat command reveals that the
lowest output level observed in the data is 51 units, we cannot say anything at
all about the "true" intercept, since it is outside the range of the data.
Whatever number we get for an intercept, it is merely an artifact of the best-
fitting relationship through the cluster of points in the observed data (all at
much higher output levels than zero).
- By how much do average costs change if output
is higher by 1 unit?
This is a simple question about the slope. The point
estimate for this change is -0.38427 dollars/unit.
- By how much do average costs change if output
is lower by 10 units?
This is again a simple question about the slope. The
answer will be +3.8427 dollars/(10-units), since we are now talking about lower
output, and a change that is ten times as great as in the last question.
b.) Are all firms in your sample experiencing
"increasing returns to scale" (declining average costs)? Answer carefully.
This answer requires some thought. I have asserted
that all firms have identical technologies. If this is strictly true, then the
actual cost experiences of one firm at a particular output level will be a good
predictor for the likely cost experiences of another firm as it adjusts its output
to this level from another level. We can never be certain, however. A good idea
is to look at the plot of average costs against output. If we believe that
average cost functions tend to be U-shaped, then it may be the case that some of
the firms at higher output levels are beginning to experience diminishing returns
to scale, so that average costs are not rising. We really do yet have enough
information to examine this possibility in more detail. However, we will find
later that a quadratic form in quantity will be of some help.
c.) Suppose you plan to open a shop that will
operate at 80 cups per hour. What do you expect will be your average total
costs? What is the 95% confidence interval for these average costs?
Here is where you get to use the unpleasant formula for
a "confidence interval for mean prediction." We have a simple regression. To get
the center of the confidence interval, one substitutes 80 into the fitted
regression model: E[atc] = 93.784 - 0.38427*q. The ingredients you need to
finish constructing the confidence interval include the "standard error of the
estimate (s = 10.227), the sample size (n=21), the mean of q (=94.857), the 5%
critical value for a t-distribution with 19 degrees of freedom (from the tables,
2.093), and lastly, an estimate of the sum of the squared x-deviations. This can
be found by using the variance of the estimate (104.58) and dividing by the square
of the standard error on the slope (0.07693*0.07693). (Recall that the variance
of the slope is sigma-squared divided by the sum of the squared x-deviations.)
All of these quantities get plugged into the formula for a confidence interval for
mean prediction, and you are done.
d.) Suppose that in the market area where you
plan to open your shop, perfect competition prevails and you can be certain
that your price per unit for a cup of coffee will be exactly $0.70. Is
it statistically likely that you will make some positive profit?
You will make some positive profit if, at your chosen
output level, p-atc>0. Thus, for positive profit, you want p greater than atc
(or, atc less than p).
For testing whether this could be the case, it is often easiest to think about it
if you construct a confidence interval for atc and see whether any values less
than or equal to $0.70 are included in it. If so, then these values of atc are
acceptable hypotheses, so the condition could be true. If you have successfully
constructed the confidence interval requested in part (c.) you will have all the
necessary information.
e.) (BONUS) Given what you know about average
total costs (from Economics 1 or the equivalent), is the regression you
have specified likely to be appropriate to capture the shape of a typical
average total cost function? Explain why or why not. Do you have enough
information to determine the profit-maximizing level of output to produce?
Average total costs are often thought to be U-shaped,
so we probably want create a new variable: genr q2=q*q and then consider a
specification like ols atc q q2. This is like homework set #3, for the
part about marginal cost. We would need to do a little more work to determine the
profit-maximizing level of output, even if we assume perfect competition, since
this would require that we set output such that price equal marginal cost. From
atc at each output level, we can compute atc*q to get total cost. Marginal cost
can be read off either total cost or total variable cost. For the linear atc
curve we have actually estimated, tc=93.7*q - 0.384*q*q, so that marginal costs
are given by 93.7 - 0.768*q. If we set this equal to 70 and solve for the optimal
q, we get a number on the order of 30.85 units of output. Compare this to the
plot.
Updated: October 28, 1997
Prepared by: Trudy Ann Cameron