INSTRUCTIONS: Answer all questions in the spaces provided (or indicate clearly where you have continued your answer). Calculators are NOT permitted. Reduce all computations to the simplest form so that anyone with a calculator could attain the answer easily. Show your work and reasoning to the fullest extent possible so that part marks can be assigned as warranted. You have 75 minutes to complete this exam. All parts are worth 10 points (and some are much easier than others). Total points = 150. This means roughly 5 minutes for each answer. Budget your time carefully. NOTE: these data are fictitious.
SCENARIO: Suppose you are interested in learning about the determinants of demand for packs of cigarettes (per month) by young adult smokers between the ages of 18 and 30. An elected official with ties to the tobacco industry has claimed that falling real incomes in a mild recession that is likely to result from global economic instability will have a large negative effect on cigarette consumption by this group. Being aware of the commonly addictive properties of nicotine, you wish to determine the effect of a decrease in income on cigarette consumption. You have collected a random sample of 30 young adults and have recorded their ages (AGEi in years), incomes (INCi in thousands of dollars), the price they typically pay for a pack of cigarettes (PRICEi in dollars), and the average number of packs per month they smoke (PACKSi per month). The statistical analyses you perform are given in the Exhibits.
1. Fill in the blanks:
Across these 30 individuals, what is
the mean number of packs of cigarettes consumed? 14.433
This an hypothesis about the true mean of a marginal distribution. We need the formula: (mean X - hypothesized value)/(s/(square root of n)). The formula is thus (14.433 - 10)/(6.1691/sqrt(30)). This answer is 4.433/1.126=3.937 (although you are only required to provide the correct formula on the exam. 3. Does Regression 1 make sense? Why or why not? We are trying to figure out what explains packs of cigarettes consumed per month. It does not make much sense to propose packs of cigarettes consumed as the only variable that explains income levels. Other regressions can be used to try to model the determinants of income levels, but Regression 1 would be a rather silly one, under any circumstances. 4. Based on Regression 2, what is the verbal interpretation of the slope? Comment. Test the hypothesis that a $1,000 decrease in annual income (recession) will have no effect on the number of packs of cigarettes consumed per month by smokers in this age group. If the hypothesis can be rejected, comment upon the qualitative importance of the relationship, in terms of improved health outcomes for smokers as a consequence of recession.
The slope in Regression 2 gives the "change in the expected number of packs of cigarettes consumed in one month for a one-unit (namely $1000) increase in the young adult's income." A $1000 decrease in annual income is a "1 unit" decrease in annual income, since income is measured in thousands of dollars. Whether we are talking about and increase or a decrease in an explanatory variable, the hypothesis that the slope is zero is tested the same way. Look at the t-test statistic (t-ratio) on INC. This is 10.85, which is a really implausible value of a t-distributed random variable with 28 degrees of freedom. The implausibility of the null hypothesis is confirmed by the fact that the probability in the tails of such a distribution beyond + or - 10.85 is essentially zero (when rounded off to three decimal places). While the point estimate is strongly statistically significant, the predicted decrease in packs consumed per month for a $1000 decrease in income is only 0.26 packs. While statistically significant, this response is not very "large," so there will probably not be much effect on peoples' health as a result.
5. Based upon the simple regression results the Regression 2, do cigarettes appear to be inferior (as opposed to normal) goods? Explain how you have reached this conclusion.
Recall from your knowledge of basic economic demand theory that a normal good is one for which quantity demanded increases as income increases, whereas an inferior good is one for which quantity demanded decreases as income increases. Cigarettes appear to be a normal good, based on the results of this simple regression. 6. Based on Regression 2, what level of monthly cigarette consumption is expected for a smoker with a monthly income of $50,000? Give the formula for a point estimate and explain explicitly how a 95% confidence interval for this prediction would be constructed (plug in all the numbers). Why should you use caution in making this prediction?
The point estimate of cigarette consumption at an income level of $50,000 is found by plugging this income into the fitted regression equation. E[packs]= 3.9934 + 0.26122 * (50) = 17.0544. Constructing the confidence interval involves using the formula provided on the sheet of equations and realizing that this E[packs] is the middle of the confidence interval, plus or minus 2.7521 * { 1/30 + [(50-39.967)2]/(sum of the little xi-squared) }. The
sum of the little xi-squared can be found by realizing that it appears inside a square root sign in the denominator of the formula for the standard error of the slope coefficient. The numerator is sigma. You also have information in the output that SIGMA**2 is 7.5739. This number, divided by (0.02408)2, gives the desired piece, which will be approximately 1306.2952. The +/- part will thus be 2.7521 *(0.033333 + (100.66108/1306.2952)) = 2.7521 * 0.11039 = 2.862. Thus, the confidence interval is 17.054 +/- 2.862 = [14.192,19.916]. This would be the full set of hypotheses (about expected packs of cigarettes per month for somebody with $50,000 income) that could not be rejected, according to this model. 7. You finally remember that demand functions
are functions of several variables, not just one at a time. You estimate
Regression 3 in order to ascertain the
effects of both income and price on the number of packs of cigarettes consumed per month. Controlling for income levels do cigarette prices have any statistically discernible effect on cigarette consumption? What will be the expected effect of a $0.50 additional tax on each pack of cigarettes?
In a regression that also contains INC, the slope on the price variable is -11.390. If we believe we are actually estimating a demand function (i.e. price is exogenous, which it probably is for these individual consumers), then it appears that a one dollar increase in the price of cigarettes will cause the number of packs consumed per month to fall by 11.39 on average in the sample. This is a much bigger effect than the effect of a $ 1000 decrease in income. However big this point estimate may be, however, it is not statistically significantly different from zero, which means we cannot reject the hypothesis that an increase in price has NO effect on consumption. If these data constituted a random sample from the population of California smokers in this age group, this model would suggest that the recently voter-approved $0.50 per pack increase in cigarette taxes would cause these smokers to cut back by an average of 11.39/2 packs per month, or by about 5.7 packs per month. However, we cannot be confident that they will cut back at all (at least at the 95% level). 8. What is the interpretation of the intercept term in
Regression 3? Should you be interest in testing statistical hypotheses about the magnitude of the intercept in this model? Why or why not? Explain.
In Regression 3, the intercept gives the expected number of packs per month for somebody with zero income and who faces a zero price of cigarettes. Since the minimum values of income and price in the data are 0.00 and 2.04, these conditions do not hold anywhere in the data sample that was used for estimation. Therefore, we will not spend much time worrying about the intercept, since it is merely an artifact of extending the fitted regression plane back to the origin in the income-price plane. 9. Being sensitized to the ever-present potential for omitted variables bias, you begin to wonder whether the results in Regression 3 are robust. You collected age data when you surveyed this sample of smokers, and cumulative time spent smoking might affect monthly demand for cigarettes, since this is widely understood to be an addictive product. Based on the results of Regression 4, on the Descriptive Statistics, and
on the data displayed in Plot 1 and Plot 2, assess whether and why the coefficient on the income variable differs from that estimated in Regression 3. Is a recession likely to improve health outcomes by causing a statistically significant decrease in monthly cigarette consumption?
When you include age in the regression, along with income and price drops from 0.26 to 0.08, less than one-third as big. This new point estimate is still statistically significantly different from zero, since the prob-value associated with the value of the t-test statistic for the zero hypothesis is only 0.03 (less than the 0.05 required for statistical significance at the 5% level. This is because age and income in the sample are rather highly positively correlated (at 0.88). When age was left out of the model, income carried a good portion of its explanatory power, and about 2/3 of the apparent income effect was actually an age effect. Regression 4 says that for each extra year of age, cigarette consumption goes up by almost a whole pack per month, and this effect is overwhelmingly statistically significant. Whether the "actual" effect of a recession on incomes and therefore on cigarette consumption (and thus on health) is likely to be enough to make a detectible difference to health is not really clear from this regression. 10. Does failure to include the age variable in Regression 3 lead to bias in the estimation of the effect of price on pack of cigarettes consumed per month? Explain carefully.
Age and price have very little correlation (-0.23). Without age, the statistically insignificant price coefficient is -11.39, whereas with age included, the price coefficient changes to -0.53. This is a large relative difference, but since neither point estimate is statistically significantly different from zero, there is no real point in comparing these estimates. There is no real effect of price on cigarette consumption in these data. 11. For Regression 4, explain the use of the / auxrsqr option on the OLS command. What does it tell you here?
The AUXRSQR option on the ols statement in Regression 4 is included to explore for potential sources of multicollinearity among the RHS regressor variables. In this instance, the R-squared values from the auxiliary regression of each RHS variable on all of the other RHS variables shows that there are likely to be problems between age and income, as we have already suspected. In contrast, no other variables in the model bear much of a linear relationship to price (it is more-or-less "orthogonal" to the other regressors. 12. For Regression 4, test the hypothesis that none of the explanatory variables has any effect on the dependent variable. Explain your reasoning.
A test of whether all slopes could be simultaneously equal to zero is accomplished by an F-test of the "joint significance" of the regressors. If all slopes are simultaneously zero, then the ratio of the mean explained sum of squares to the mean residual sum of squares should be F-distributed with 3 and 26 degress of freedom. We know that a value of 92.795 for such a random variable is highly implausible, because the P-value tells us that there is virtually NO probability in the upper tail beyond this value. Thus we choose to reject the hypothesis that all slopes are zero. (However, this is actually obvious, since we know from the t-test statistics that two of them are individually statistically significantly different from zero.) 13. For Regression 4, test the hypothesis that neither of the "economic" variables (i.e. PRICEi and INCi has any effect on the dependent variable. Explain your reasoning.
Fortunately, we have been provided with a ready-made F-test of the joint hypothesis that the slopes on inc and price are simultaneously zero. This is the first "test...end" block of test statements. Since the p-value for this joint test is greater than 0.05, we CANNOT reject the hypothesis that the "economic" determinants of cigarette consumption are jointly zero. If you did a confid inc price statement, you would find that zero is well within the one-dimensional confidence interval for the price coefficient, and it is just outside the one dimensional confidence interval for the income coefficient. However, the point (0,0) lies just inside the joint confidence ellipse for the joint test. (See the relevant confidence ellipse just after the appropriate F-test.) 14. For Regression 4, test the hypothesis that being one year older, in this age group of smokers, means that you consume, on average, one more pack of cigarettes per month.
This is a test that the slope coefficient on the age variable is one. This test is not produced automatically when you run a regression, although you are lucky that the person who ran this regression thought to ask SHAZAM to produce a test of this hypothesis. The value of the t-test statistic is found by taking (0.98616 - 1)/0.1833...the difference between the point estimate and the hypothesized value, divided by the standard error of the point estimate. This number should be t-distributed with 26 degrees of freedom if the null hypothesis is true. The value of the test statistic turns out to be -0.0755, which is very small, and close to the expected value of the t-test statistic if the null hypothesis is true (the expected value is zero, of course, since a t-distribution is a bell-shaped distribution centered on zero, much the same shape as a standard normal). The probability in the two tails of the distribution, out beyond + or -0.0755, is 0.94039. Thus it is entirely plausible that one might observe a point estimate of the age coefficient of 0.98516 if the true slope is 1. So the answer is that yes, we cannot reject that being one year older means you conjsume one more pack of cigarettes per month. 15. An expert in smoking behavior has asserted for years that "A recession that cuts gross incomes of smokers by $10,000 per year in this age group would have the same effect on cigarette consumption as turning back the clock by one year for these smokers. For Regression 4, assess the statistical validity of this assertion.
In a linear model, the slope on the income variable is the same whether we are talking about an increase or a decrease in income. The slope gives the change in expected number of packs per month for a $1,000, not a $10,000 change in income, with the change in consumption being the same sign as the change in income. For the age variable, the slope is the change in consumption for a 1-year change in age, with the change again being the same sign as the change in age. Therefore, we would have to consider whether 10 times the income coefficient could be equal to the age coefficient. The last test statement invokes exactly this test. If this is true, then 10 time the income coefficient minus the age coefficient is zero. The program calculates the point estimate of this difference, and then (recognizing that both the income and the age coefficient are random variables with their own respective variances and covariances), calculates the standard error of this linear function of the estimated parameters. The point estimate of this function, minus its hypothesized value (0) divided by its standard error, is -.2776 (and should be a t-distributed random variable with 26 degrees of freedom. The p-value for this test statistic is 0.78354, which is far greater than the 5% cutoff value, so we cannot reject this hypothesis. Bonus: (Trickier) Suppose you are trying to use the model in Regression 4 to predict how many packs per month each of these smokers will consume ten years from now. Is this possible?
This is a subtle question. I'll be impressed if anybody gets it. At the simplest level, people might respond to this question by suggesting that we construct a 95% confidence interval for prediction for each person, substituting their values of (age+10), inc and price into the fitted regression model to come up with a point estimate. We do not cover the matrix algebra version of confidence intervals for mean prediction in this course, but there will be an analogous formula for multiple regression models that generalizes the simple regression formula that we have seen. You would get part marks for recognizing that something like this must be possible. The bigger issue when predicting the effects of age, however, comes from a very standard conundrum encountered when collecting age data from a single cross-sectional sample. This sample does not follow individuals as they age, it just looks at different individuals of different ages at the same point in time. Thus the "age effects" are confounded with "cohort effects." Perhaps, people who are older now began smoking in an era where smoking was not perceived to be as dangerous, so maybe they have always smoked more and always will. They smoke a lot now not because they are old, but because they are part of a cohort that had a different attitude about smoking than do other cohorts. On the other hand, people who are younger now may smoke fewer packs per month because there is more a a social stigma against smoking, they may always smoke less, no matter how much older they get. "Cohort" effects are never discernible in a single cross-sectional sample because there is perfect collinearity between "year born" (cohort) and age. You cannot discern the distinct effects of age as opposed to cohort if they are always perfectly collinear. However, if you had cross-sections taken in a number of different years, you could break the perfect collinearity because you would have 25-year-olds from several different cohorts in the estimation sample. This would break the perfect multicollinearity and give you a change at isolating the pure age effect. Since you can't separate out age and cohort effects, you cannot control for cohort differences in your regression, so the coefficient on age will actually be a compound effect of both age an cohort, and may be afflicted by omitted variables bias. If it is cohort, and not age that dictates smoking habits, then as these people age, they will remain members of the same cohort, so their smoking habits will not change at all. But the (potentially biased) age coefficient will predict that as they age, they will smoke more.
What is the highest observed price paid for
cigarettes across these 30 people? $2.25
What is the standard deviation in incomes
across the sample? $21.227 (in thousands of dollars, i.e. $21,227)
Do the descriptive statistics you have
just provided refer to the joint distribution of these three variables, to their conditional distributions, or to their marginal distributions? the marginal distributions
What is the correlation between
agei
and inci in this sample? 0.88253
What are the units for this correlation
measure? correlation is a unit-free measure of linear association
2. Using the Descriptive Statistics
only, test the hypothesis that the true marginal mean number of packs of cigarettes consumed across all young adults smokers in this age group is 10 packs per month.