UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics
Fall 1997; Cameron
Economics 143 - Midterm Examination Answer Outline

INSTRUCTIONS: Answer all questions in the spaces provided (or indicate clearly where you have continued your answer). [NOTE:  Ample space was provided in the hardcopy version of the exam.]  Calculators are NOT permitted. Reduce all computations to the simplest form so that anyone with a calculator could attain the answer easily. Show your work and reasoning to the fullest extent possible so that part marks can be assigned as warranted. You have 75 minutes to complete this exam. All parts of both questions are worth 10 points (and some are much easier than others). Total points = 150. This means roughly 5 minutes for each answer. Budget your time carefully. NOTE: these data are fictitious.

These answer outlines do not include sketches of regression relationships that would ordinarily be very helpful to the process of sorting out what a particular regression means. On an actual exam, you are encouraged to illustrate the points you are making, in order to verify your intuition. The actual exam included a "crib sheet" of standard formulas and a copy of t- test critical values from the back of the Gujarati textbook.

1. Imagine that you are in charge of head office personnel at Huge Corp. The corporate vice president keeps getting other people's mail by mistake, so she tasks you to conduct a study of mailroom productivity. Fortunately, the mailroom supervisor has, for years, been sampling productivity during his monthly employee reviews. For a random sample of 27 reviews (all from different employees) he provides you with data on productivity (prodi = letters correctly sorted per minute), and experience (monthsi = months of experience in the Huge Corp. mailroom). You match these data to other information about each employee (scorei = score on the aptitude test they were required to take when they applied for a position at Huge Corp.). The statistical analyses you perform are given in a href="e143me97.htm#exhibita">Exhibit A.

a.) Fill in the blanks:

Across these 27 employees, what is the mean number of months on the job? 4.8148
What is the maximum score on the aptitude test? 95
What is the standard deviation in productivity? 1.5779
Do the descriptive statistics you have just provided refer to the joint distribution of these three variables, or to their marginal distributions? marginal
What is the correlation between score and months in this sample? -.75607
What are the units for this correlation measure? correlation is a unit-free measure

b.) Using the STAT output, test the hypothesis that the true marginal mean value of productivity in the population of all mailroom workers is 11 letters per minute.

This requires inference about the true mean of the univariate distribution of the prod variable, regardless of the values of the other variables. One would take the sample evidence regarding the mean, 9.7704, subtract the hypothesized value (11), and divide by the standard error of the mean, which is the sample variance (1.5779) divided by the square root of the (square-root of 27) sample size. This number, when calculated, would be compared to the critical value of a t-distributed random variable with 26 degrees of freedom (because only one degree of freedom is used up by the calculation of an estimator for the true population mean of prod). For a test at the usual 5% significance level (95% confidence level), if we found the calculated t-test statistic to be larger than 2.056 in absolute value, we would typically reject the null hypothesis of the true mean being 11 letters per minute.

c.) The vice-president of Huge Corp. asks you (not rhetorically): "Don't these mailroom people ever learn?" Translate this question into a simple regression specification and test an appropriate hypothesis statistically, using the information in Exhibit A.

This verbal question could be translated as a question about whether, as they spend more time on the job, mailroom employees experience an improvement in the number of letters accurately sorted per minute (on average). You are asked specifically to consider simple regressions, of which two are offered in Exhibit A. If you believe that job experience affects productivity, rather than the other way around, your dependent variable should be productivity and your explanatory variable should be months. A reasonable specification would be prodi = b1 + b2*monthsi + ei. An appropriate hypothesis would be H0: B2=0. If the point estimate is positive, and if we can reject this null hypothesis, we would infer that the mailroom people do experience improvements in their productivity with practice. Based on the results of Regression 2 (or, more conveniently, on the output of the confid command that follows it), we see that 0 lies within the 95% confidence interval, so we cannot reject the hypothesis that additional months do not increase productivity. NOTE: Regression 1 was included just to see if anybody was confused enough to select the wrong "outcome" variable. I hope you did not fall for that.

d.) How is it that we can argue that a "t-test" statistic, if the null hypothesis is true, has a t-distribution?

This was expected to be one of the more difficult questions, since it asks you to remember the "theory" of regression as we covered it in lecture. If the underlying parent population distribution for the dependent variable is (conditionally) normal, then any linear combination of the independent and identically distributed (i.i.d.) observations on the dependent variable will also be normal. Even if the population is not exactly normal, versions of the Central Limit Theorem can provide the same result. The slopes (and intercept) of a regression model can be expressed as just such a linear combination (they are "linear" estimators). Since we have to use sample variance, s-squared, instead of the true population variance (sigma-squared), the distribution of the standardized parameter estimate (point estimate-null hypothesis)/(standard error of point estimate) is somewhat noisier than a standard normal; it is a t-distributed variable.

e.) What productivity would you expect from a new hire, based on Regression 2? Give a point estimate and explain explicitly how a 95% confidence interval for this prediction would be constructed.

This is a question about the intercept in Regression 2. The interpretation of the intercept is the expected productivity when the explanatory variable (months of experience) is zero. A "new hire" is somebody with zero experience on the job. Since no handy confid statement has been employed, you need to construct the confidence interval the old-fashioned way. The point estimate is 9.4569. This is the center of the confidence interval. The amount for the "plus or minus" term involves the 0.025 critical value of t(df=25), which, incidentally, shows up in the output for the confid statement concerning the slope. It is 2.060. The other necessary ingredient is the standard error of the point estimate of the intercept, which is 0.5765. In practice, it would be easiest to use a confid command. If this does not work when you refer to the intercept as the coefficient on "CONSTANT," try creating a variable that is always 1 (i.e. genr one=1), and then perform a regression of prod on months and one, forcing the regression "through the origin" by using ols prod months one / noconstant. Then use confid one. This will definitely work.

f.) Now you realize that cognitive skills and manual dexterity may also affect productivity and account for differences across employees. You include aptitude scores in the regression and obtain the results in Regression 3. What does this alternative specification suggest about "learning-by-doing" in the mailroom? Is there any statistical evidence that experience affects productivity? Conduct an appropriate hypothesis test.

"Learning-by-doing" means that the longer you do something, the better you get at doing it. Regression 3 suggests that months (and score) have a strongly statistically significant effect on productivity, since the t-ratio for the test of the zero hypothesis on the months coefficient is very large (4.232). The 0.025% critical value of a t-distributed random variable does not exceed 4 until you get down to 2 degrees of freedom, and here we have 24 (=27-3 estimated coefficients). To three decimal places, there is essentially zero probability out in the tail of a t(24) distribution beyond 4.232. We soundly reject the hypothesis that experience does NOT affect productivity, and the point estimate is positive.

g.) Observe the STAT output. Statistically, what accounts for the difference between the implications of Regressions 2 and 3 regarding the effect of experience on productivity?

In this data set, months and score are negatively correlated. If the variable score is omitted, then more months in the mailroom (which should increase productivity) are serving as a proxy for lower aptitude test scores (which will tend to mean lower productivity). The two effects tend to cancel each other. This is akin to the study.sha example from the lab. It seems that the people who are sampled after having been in the mailroom a longer time are also likely to be relatively low scorers on their initial aptitude tests. In contrast, in this sample, relatively few high-scorers on the aptitude test are observed after a large number of months on the job.

h.) Can you give a logical intuitive explanation for the process that leads to the relationship between monthsi and scorei that is revealed in the STAT output?

Bright and productive people tend to get promoted out of menial entry-level tasks like mailroom clerk very rapidly. We don't get to observe them having served in the mailroom for large numbers of months. On the other hand, less bright people might be stuck working in the mailroom forever. Later in the course, we will talk about a variant of omitted variables bias, called "endogeneity bias." If your "observations" in some sense determine the values for their explanatory variables, you have a potential problem. Here, by their performance, people have an influence on how long they have to spend working in the mailroom. So this X variable cannot be viewed as "exogenously given" and therefore independent of the error term in the regression model. More on this later.

i.) If a new employee scores 75% on his aptitude test, what should mailroom management expect in terms of productivity at his 2-month review?

This a question about the expected value of the dependent variable, given a particular value for the explanatory variable(s). We should take the fitted model from Regression 3, and plug in months=2 and score=75, and see what productivity is predicted. (Some people wondered if score was measured in percent or as a decimal. For a good guess at this, you could simply check the STAT output and see what the mean, minimum and maximum values of the score happened to be.) An ambitious response would recommend that a confidence interval for mean prediction be constructed. However, we have not yet covered the algebra for constructing such a confidence interval, except in the case of a simple regression, which this is not.

j.) In these specifications, what fraction of the variation in productivity across employees can be explained by a model that uses only months of experience? 0.0162What fraction can be explained by a model that uses both experience and aptitude test scores? 0.5013 Can these be compared? Why or why not?

This is a question about R-squared. From Regression 2, we get the first number; we get the second from Regression 3. Note that goodness- of-fit is improved dramatically when we control for score before looking for the incremental effect of months on productivity. These ordinary R-squared values can be compared, but we would hesitate to conclude that the second was better than the first without first correcting for the fact that since Regression 3 using more variables than Regression 2, we would expect the fit to be better. Adjusted R-squared allows the comparison. These numbers are -.0231 versus 0.4598. Note that the fit in Regression 2 is so abysmal that when R-squared is adjusted, the value is even negative. It is unambiguous that Regression 3 is a better fit to the data than Regression 2.

k.) On any give shift, the mailroom is staffed by three people. The mailroom supervisor observes that productivity of any particular mailroom worker seems to depend on how hard the other people in the mailroom are working. What data would you collect, what variable(s) would you construct and what model would you estimate in order to test a statistical hypothesis that would show whether there is any evidence to support the supervisor's conjecture?

Since the identities of the other two workers cannot be ordered in any unambiguous way, the best strategy would be to combine their respective productivity measures, perhaps using their mean productivity (call it mprodi. A reasonable specification would then be ols prod months score mprod. The relevant hypothesis test would be a zero-hypothesis test for the coefficient on mprod. If the productivity of the other two people working in the mailroom has no effect on the productivity of the workers in the sample, then the supervisors casual empiricism would be deemed incorrect. If the point estimate is positive, and the zero hypothesis can be rejected, we would conclude that the supervisor's assertion may be valid.
 

2. You have always wanted to start your own business, and the specialty coffee-bar business appeals to you. You have a friend at the Association of Specialty Coffee Retailers who manages to get you data on the average costs of different establishments. The technology is virtually identical across firms. You have data on 21 firms, for atci = average total costs of production and for qi the rate of output (in cups per hour). Exhibit B shows the analyses you perform.

a.) - What is the interpretation of the intercept in this model? Is it "meaningful"? Explain.

Note that you are not told the units for average total costs. However, a reasonable assumption, given the output from the stat command, is that average total costs are in cents per cup, since quantity is measured in cups per hour. The intercept is the expect average total cost of production if output is zero. In reality, this number must go to infinity, since fixed costs are divided by zero. Since the stat command reveals that the lowest output level observed in the data is 51 units, we cannot say anything at all about the "true" intercept, since it is outside the range of the data. Whatever number we get for an intercept, it is merely an artifact of the best- fitting relationship through the cluster of points in the observed data (all at much higher output levels than zero).
- By how much do average costs change if output is higher by 1 unit? This is a simple question about the slope. The point estimate for this change is -0.38427 dollars/unit.
- By how much do average costs change if output is lower by 10 units? This is again a simple question about the slope. The answer will be +3.8427 dollars/(10-units), since we are now talking about lower output, and a change that is ten times as great as in the last question.

b.) Are all firms in your sample experiencing "increasing returns to scale" (declining average costs)? Answer carefully.

This answer requires some thought. I have asserted that all firms have identical technologies. If this is strictly true, then the actual cost experiences of one firm at a particular output level will be a good predictor for the likely cost experiences of another firm as it adjusts its output to this level from another level. We can never be certain, however. A good idea is to look at the plot of average costs against output. If we believe that average cost functions tend to be U-shaped, then it may be the case that some of the firms at higher output levels are beginning to experience diminishing returns to scale, so that average costs are not rising. We really do yet have enough information to examine this possibility in more detail. However, we will find later that a quadratic form in quantity will be of some help.

c.) Suppose you plan to open a shop that will operate at 80 cups per hour. What do you expect will be your average total costs? What is the 95% confidence interval for these average costs?

Here is where you get to use the unpleasant formula for a "confidence interval for mean prediction." We have a simple regression. To get the center of the confidence interval, one substitutes 80 into the fitted regression model: E[atc] = 93.784 - 0.38427*q. The ingredients you need to finish constructing the confidence interval include the "standard error of the estimate (s = 10.227), the sample size (n=21), the mean of q (=94.857), the 5% critical value for a t-distribution with 19 degrees of freedom (from the tables, 2.093), and lastly, an estimate of the sum of the squared x-deviations. This can be found by using the variance of the estimate (104.58) and dividing by the square of the standard error on the slope (0.07693*0.07693). (Recall that the variance of the slope is sigma-squared divided by the sum of the squared x-deviations.) All of these quantities get plugged into the formula for a confidence interval for mean prediction, and you are done.

d.) Suppose that in the market area where you plan to open your shop, perfect competition prevails and you can be certain that your price per unit for a cup of coffee will be exactly $0.70. Is it statistically likely that you will make some positive profit?

You will make some positive profit if, at your chosen output level, p-atc>0. Thus, for positive profit, you want p greater than atc (or, atc less than p). For testing whether this could be the case, it is often easiest to think about it if you construct a confidence interval for atc and see whether any values less than or equal to $0.70 are included in it. If so, then these values of atc are acceptable hypotheses, so the condition could be true. If you have successfully constructed the confidence interval requested in part (c.) you will have all the necessary information.

e.) (BONUS) Given what you know about average total costs (from Economics 1 or the equivalent), is the regression you have specified likely to be appropriate to capture the shape of a typical average total cost function? Explain why or why not. Do you have enough information to determine the profit-maximizing level of output to produce?

Average total costs are often thought to be U-shaped, so we probably want create a new variable: genr q2=q*q and then consider a specification like ols atc q q2. This is like homework set #3, for the part about marginal cost. We would need to do a little more work to determine the profit-maximizing level of output, even if we assume perfect competition, since this would require that we set output such that price equal marginal cost. From atc at each output level, we can compute atc*q to get total cost. Marginal cost can be read off either total cost or total variable cost. For the linear atc curve we have actually estimated, tc=93.7*q - 0.384*q*q, so that marginal costs are given by 93.7 - 0.768*q. If we set this equal to 70 and solve for the optimal q, we get a number on the order of 30.85 units of output. Compare this to the plot.  
Updated: October 28, 1997
Prepared by: Trudy Ann Cameron