UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics

Economics 143 (Cameron) - Applied Regression Analysis

Problem Set #3: Simple Regression; Estimation

Outline of Solutions

Due: October 22, 1998

INSTRUCTIONS:This homework set is intended to consolidate in your mind what happens when you ask SHAZAM to run an OLS regression.

NETWORK FILES NEEDED: transfer.dat, doodad.dat . NOTE: If you have BruinOnline or other Web access, you should be able to get the contents of these files by going to http://www.sscnet.ucla.ed u/98W/econ143-1. Select the link to Problem Sets, find Problem Set #3, and once inside, click on the name of the file you want. Or, you can go directly to the list of Data Sets for the course and look for the ones you need by name. You should then be able to highlight, copy, and save to a new file in SHAZAM the relevant contents of the file of interest, and proceed with the homework. Be sure to get the file extensions right (either *.sha or *.dat, accordingly). See the instructions in the SHAZAM for WINDOWS orientation handout.

1. Do a very small "simple regression" problem by hand, using the computations necessary to arrive at b1, b2, s2 = (1/(n-2))S ei2 . Also calculate the standard errors of the point estimates for b1 and b2. (You might want to postpone the standard error calculations if we do not get to this in time during the lectures.) You will probably want to mimic some of the steps displayed in the handout entitled "Table 5.4" which shows the kinds of step-by-step calculations that are now done much more efficiently by computers than by people. Assume that your data are as follows:

Y    6    5    1    0
X    1    2    3    4

Verify your results using the appropriate one-line OLS command in SHAZAM. (Note that with this small number of observations, you will probably find it easiest to embed the data directly within the SHAZAM commands, rather than reading the data from a separate file. See the handout on how to run SHAZAM for Windows for how to do this (the READ statement with no filename given). NOTE: Be sure to regress Y on X (i.e. OLS Y X), not vice versa.

I would certainly recommend, when doing this by hand, to follow the format of Table 5.4 in the handout. However, I will provide answers here by getting SHAZAM to make all of the individual calculations that would be necessary to arrive at the numbers that are needed.

 |_sample 1 4

 |_read y x
    2 VARIABLES AND        4 OBSERVATIONS STARTING AT OBS       1
 
 |_* get ybar and xbar

 |_stat y x / mean=m
 NAME        N   MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 Y            4   3.0000       2.9439       8.6667      0.00000       6.0000
 X            4   2.5000       1.2910       1.6667       1.0000       4.0000

 |_* create the deviations from the means
 |_genr ydev=y-m:1
 |_genr xdev=x-m:2

 |_* check these values
 |_print ydev xdev
       YDEV           XDEV
    3.000000      -1.500000
    2.000000     -0.5000000
   -2.000000      0.5000000
   -3.000000       1.500000

 |_stat ydev xdev
 NAME        N   MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 YDEV         4  0.00000       2.9439       8.6667      -3.0000       3.0000
 XDEV         4  0.00000       1.2910       1.6667      -1.5000       1.5000

 |_* generate the terms to be summed in the numerator and denominator
 |_genr xydev=xdev*ydev
 |_genr xxdev=xdev*xdev

 |_print xydev xxdev
       XYDEV          XXDEV
   -4.500000       2.250000
   -1.000000      0.2500000
   -1.000000      0.2500000
   -4.500000       2.250000

 |_* calculate and save the sums of these for later

 |_stat xydev xxdev / sum=s
 NAME        N   MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 XYDEV        4  -2.7500       2.0207       4.0833      -4.5000      -1.0000
 XXDEV        4   1.2500       1.1547       1.3333      0.25000       2.2500

 |_* now explicitly calculate the regression parameters
 |_gen1 b2=s:1/s:2
 |_gen1 b1=m:1-b2*m:2

 |_print b1 b2
     B1
    8.500000
     B2
   -2.200000

 |_* calculate the error terms
 |_genr e=y-b1-b2*x

 |_print e
     E
  -0.3000000      0.9000000     -0.9000000      0.3000000

 |_genr e2=e*e
 |_stat e2 / sum=resss
 NAME        N   MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 E2           4  0.45000      0.41569      0.17280      0.90000E-01  0.81000

 |_gen1 s2=(1/(4-2))*resss:1
 |_print s2
     S2
   0.9000000

 |_* do the OLS and see if it comes out the same as doing it the long way
 
 |_ols y x
 
 REQUIRED MEMORY IS PAR=     2 CURRENT PAR=   500
  OLS ESTIMATION
        4 OBSERVATIONS     DEPENDENT VARIABLE = Y
 ...NOTE..SAMPLE RANGE SET TO:      1,      4
 
  R-SQUARE =   0.9308     R-SQUARE ADJUSTED =   0.8962
 VARIANCE OF THE ESTIMATE-SIGMA**2 =  0.90000
 STANDARD ERROR OF THE ESTIMATE-SIGMA =  0.94868
 SUM OF SQUARED ERRORS-SSE=   1.8000
 MEAN OF DEPENDENT VARIABLE =   3.0000
 LOG OF THE LIKELIHOOD FUNCTION = -4.07874
 
 VARIABLE   ESTIMATED  STANDARD   T-RATIO        PARTIAL STANDARDIZED ELASTICITY
   NAME    COEFFICIENT   ERROR       2 DF   P-VALUE CORR. COEFFICIENT  AT MEANS
 X         -2.2000     0.4243      -5.185     0.035-0.965    -0.9648    -1.8333
 CONSTANT   8.5000      1.162       7.316     0.018 0.982     0.0000     2.8333

For this trivial data set, the fitted regression line has an intercept of 8.5 and a slope of -2.2. Furthermore, both of these point estimates is statistically significantly different from zero at the 5% level. In fact, even with this tiny number of degress of freedom (only 2), the slope is significantly different from zero at the 3.5% level, and the intercept is different from zero at the 1.8% level.

I have prepared a virtual handout that goes over the geometric intuition of finding the values for the slope and intercept that minimize the sum of squared (vertical) errors from each point to the "best" regression line. The objective function in ordinary least squares (the function we are trying to minimize by our choices of the slope and intercept) is quadratic in these two unknown parameters, so we can readily solve for the best values of theses parameters by taking derivatives in each direction, and setting them equal to zero. This yields the "normal equations": two equations in two unknowns that are solved to yield the formulas for slope and intercept that are used to produce parameter estimates.

2. Determine whether the following models are linear in the parameters, or the variables, or both. Which of these models can be estimated as linear regression models (possibly after transformation of the data)?

a.) Yi = B1 + B2 (1/Xi) + ui

This population regression function is linear in the parameters and linear in Y, but it is nonlinear in X, since X appears in reciprocal form. This can be estimated by OLS because we would simply create a new variable, say Z=1/X, and regress Y on Z. The desired coefficients B1 and B2 would result.

b.) Yi = B1 + B2 log(Xi) + ui

This population regression function is also linear in parameters. It, too, is linear in the dependent variable, Y. However, it is nonlinear in X. Still, we can estimate any model that is linear in parameters, even if it is not linear in the variables. All we do is redefine the variables, such as W = log(X), and regress Y on W to get the desired intercept and slope coefficients.

c.) Yi = B1 XB 2 eui (exponent on e is ui)

This might seem a little tricky at first, but you will begin to recognize this class of models. As shown, the model is linear in Y, but nonlinear in X (since it appears with an exponent) and nonlinear in B2, since this parameter is IN an exponent. As it stands, it cannot be estimated by OLS. Importantly, the error term u, rather than simply being added to the expression, appears multiplicatively as an exponent of the mathematical constant e. Things look bleak for OLS until you notice that taking the natural logarithm of both sides preserves the same underlying relationship between the variables: log(Yi) = log(B1) + B2 log(Xi) + ui. If you then redefine the variables, say, by using genr ly=log(y), genr lx=log(x) in SHAZAM, then the command ols ly lx will yield the desired B1 and B2 coefficients.

Note: ln and log are used interchangeably to signify natural logarithms (log to the base e). Base 10 logarithms are almost never used in econometrics.

4. Two special cases. Note: these exercises are easiest if you mimic the algebra covered in the two class handouts on OLS estimator formulas and derivation of the variances of these estimators. Just zero-out the parameters that are not relevant.

a.) There are occasions when the two-variable population regression function (PRF) assumes the following form:

Yi = B2Xi + ui

In this model, the intercept term is absent (perhaps some theory tells us it should be exactly zero--that when X is zero, Y must also be zero). The model is therefore known as "regression through the origin." SHAZAM can estimate such a model by using the command OLS Y X / NOCONSTANT. For this model, show that:

i.) b2 = S Xi Yi / S Xi2

For a model with no intercept, the fitted regression is Yi = b2Xi + ei. Thus, the SRF error can be written as ei = Yi - b2Xi. The "best-fitting line" through a scatter of points for Xi and Yi, using the least squares criterion, requires minimizing the sum of the squared errors (measured vertically from the chosen regression line). Thus the formula is min S (Yi - b2Xi)2. As usual, we take the derivative of this function with respect to the unknown parameter b2, set it equal to zero, and solve. The derivative is 2(S (Yi - b2Xi)(-Xi) = 0 at the optimal value of b2. We can cancel the 2 and the (-) and the equality will still hold. The equation simplifies to S YiXi - b2(S Xi2) = 0, or, b2 = S XiYi/(S Xi2), as predicted.

ii.) Var (b2) = s 2 / S Xi2      (This question should be considered optional if we do not get to the discussion of the variance of a regression slope estimator before the problem set is due.)

In order to derive the variance of b2, we can mimic the steps in the handout where we considered the expected values and variances of the coefficients in the familiar two-coefficient model. First, recognize that b2 can be expressed as a linear function of the individual Yi values in the data (if it is a random sample, these are independent random variables with the same variances...although they have different conditional means). We can write:

b2 = (X1/(S Xi2)) Y1 + (X2/(S Xi2)) Y2 + ... + (Xn/(S Xi2)) Yn.

The usual formula for the variance of a linear combination of random variables can be employed. E.g., coefficient-squared times variance of the first variable, plus coefficient-squared time variance of the second variable, and so on. We can ignore the usual covariance terms because these independent Yi observations are uncorrelated if the sample is randomly drawn. The Var(Yi) = s 2 for all i, so we can simplify this variance as: Var(b2) = s 2 S Xi2/(S Xi2)2 . The sum of squared Xi values in the numerator cancels one of the squares of that term in the denominator, and we are left with Var(b2) = s 2/(S Xi2). This differs from the usual formula in that we use the actual level of each Xi value, rather than their differences from the overall marginal mean of the observed X values.
 

b.) What happens if your population regression function (PRF) assumes the following form:
 

Yi = B1 + ui ?
The result in this specification is what you would find if you issued to SHAZAM the command OLS Y with no explanatory variables at all. Compare to results of STAT Y.
 

If Yi = b1 + ei is your model, then the error term is just ei = Yi - b1.  Minimizing the sum of squared errors from the regression "line" involves taking the derivative of S (Yi - b1)2 and setting it equal to zero. This derivative is 2S (Yi - b1)(-1) = 0 at the optimal value of the single unknown parameter, b1. The 2 and the (-1) can be eliminated while preserving the equality. Then S Yi - nb1 = 0, or b1 = S Yi/n, which is just the marginal mean of Y in the sample. Thus, saying OLS Y will give you the opportunity to test hypotheses about the sample mean of Y. (E.g., you can make great use of the TEST or CONFID auxiliary commands that may follow regression models.)

 

5. I have sent to the network (and posted on the web) a copy of the data file transfer.dat. Imagine that this file contains data on government transfer payments to families (transfer) and family expenditures on children (childexp). Look at the contents of n:transfer.dat. Create your own SHAZAM command by opening a new file and entering appropriate commands to accomplish the following tasks.

a.) Read in the data using: sample 1 100
read(n:transfer.dat) transfer childexp
Remember, if you have copied the file n:transfer.dat from the network to your own diskette, which resides, say, in your a: drive, you would refer to the file as a:transfer.dat.

b.) Using all the data provided, estimate the parameters in a linear regression of "monthly expenditures on family's children" (childexp) on "monthly receipts of transfer payments" (transfer) and obtain the coefficient of determination (r-squared value) for the model. What does this coefficient imply in a simple regression model? (Text pp. 160-164)


 |_sample 1 100

 |_read(transfer.dat) transfer childexp
 UNIT 88 IS NOW ASSIGNED TO: transfer.dat
    2 VARIABLES AND      100 OBSERVATIONS STARTING AT OBS       1
 
 |_stat transfer childexp
 NAME        N   MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 TRANSFER   100   284.85       154.45       23854.       51.816       990.34
 CHILDEXP   100   241.79       142.40       20279.       14.740       935.40
 
 |_ols childexp transfer
 
 REQUIRED MEMORY IS PAR=     5 CURRENT PAR=   500
  OLS ESTIMATION
      100 OBSERVATIONS     DEPENDENT VARIABLE = CHILDEXP
 ...NOTE..SAMPLE RANGE SET TO:      1,    100
 
  R-SQUARE =   0.0401     R-SQUARE ADJUSTED =   0.0303
 VARIANCE OF THE ESTIMATE-SIGMA**2 =   19665.
 STANDARD ERROR OF THE ESTIMATE-SIGMA =   140.23
 SUM OF SQUARED ERRORS-SSE=  0.19272E+07
 MEAN OF DEPENDENT VARIABLE =   241.79
 LOG OF THE LIKELIHOOD FUNCTION = -635.213
 
 VARIABLE   ESTIMATED  STANDARD   T-RATIO        PARTIAL STANDARDIZED ELASTICITY
   NAME    COEFFICIENT   ERROR      98 DF   P-VALUE CORR. COEFFICIENT  AT MEANS
 TRANSFER  0.18459     0.9125E-01   2.023     0.046 0.200     0.2002     0.2175
 CONSTANT   189.21      29.53       6.406     0.000 0.543     0.0000    
0.7825

The coefficient of determination (the r-squared value) gives the proportion of the variation in the dependent variable that is explained by the right-hand-side variable. Here, it is only about 4%. Not much.

c.) Does this model suggest that
 

(i.) on average, for each additional dollar of transfer payments, these families will spend some of that dollar on their children?

  We usually set this up the other way. The null hypotheses is that families will spend zero dollars out of every additional dollar on their children. This corresponds to a null hypothesis that the slope is zero. The t-test statistic for the slope is 2.023, with 98 degrees of freedom, which looks close to being significant. (We could look up the 5% critical value in the back of the text and find that it is something less than 2 --i.e. between 2.000 and 1.98, the values for 60 and 120 degrees of freedom.) However, the P-value gives us what we need directly. We can just reject the zero hypothesis for the slope at a significance level of 4.6%, so we can definitely reject at the usual 5%. So yes, it looks like families spend some portion of each extra transfer dollar on the kids (the point estimate is about 18 cents).

(ii.) on average, families spend positive amounts on their children, even if they receive no transfer payments?

  Again, we generally set this up the other way. The null hypothesis is that if people receive no transfer payments, then they spend zero dollars on their children. This is an hypothesis about the intercept being zero. If the point estimate of the intercept is greater than zero, and we can reject the hypothesis of a zero intercept at the 5% level, then we would conclude that people do spend money on their kids even if they receive no transfer payments. Here, the point estimate of the intercept is about $189. The t- test statistic is larger than 6, which is way out in the tails of any t-distribution. The p- value confirms that there is virtually no probability left in the symmetric tails of a t- distribution with 98 degrees of freedom beyond -6.406 and +6.406. We clearly reject the hypothesis that people spend zero money on their kids if they get zero transfer payments.

The answers to these questions concern the slope coefficient and intercept coefficient in the regression. (It is helpful to think about the verbal definition of the slope and intercept in any regression model. The slope is the "change in Y for a one-unit change in X." The intercept is the "expected value of Y when X is zero.")

d.) A little harder: Does this model suggest that, on average, for an additional dollar of transfer payments, these families tend to spend all of that additional dollar on the family's children? The answer to this also concerns the slope coefficient in the regression.

This is a test of a non-standard hypothesis (i.e. something other than the "zero hypothesis" for which it is automatically assumed that you will be interesting in knowing the answer). Now we need to test whether the slope could be one. This could be done explicitly, by calculating the difference between the point estimate and one, and dividing by the standard deviation of the point estimate. But make you life easier by having SHAZAM do it for you. The test command uses information for the last ols command that has been run.


 |_test transfer=1
 TEST VALUE = -0.81541     STD. ERROR OF TEST VALUE  0.91253E-01
 T STATISTIC =  -8.9357075     WITH   98 D.F.    P-VALUE= 0.00000
 F STATISTIC =   79.846868     WITH    1 AND   98 D.F.  P-VALUE= 0.00000
 WALD CHI-SQUARE STATISTIC =   79.846868     WITH    1 D.F.  P-VALUE= 0.00000
 UPPER BOUND ON P-VALUE BY CHEBYCHEV INEQUALITY = 0.01252

The difference between the point estimate of the slope and one is about -0.8, but the standard error is really tiny, so the test statistic is on the order of about -9, which is a clear rejection at the 5% level. The extremely small P-value confirms this.

e.) Plot the data in a scattergram. Examine the plot carefully. Are any points that are likely to be "influential" in the fitting of a regression line (called "outliers")? Explain.


 |_plot childexp transfer
 
 REQUIRED MEMORY IS PAR=     2 CURRENT PAR=   500
 FOR MAXIMUM EFFICIENCY USE AT LEAST PAR=     4
       100 OBSERVATIONS
                    *=CHILDEXP
                    M=MULTIPLE POINT
    884.21        |                                *  <--this is the outlier
    821.05        |
    757.89        |
    694.74        |
    631.58        |
    568.42        |
    505.26        |
    442.11        |  ***  *    *    *
    378.95        |  * *  * M** *  *
    315.79        |  M * *M** ***  **
    252.63        |  *MM M*M M*  M  *
    189.47        | * *** ** *M MM
    126.32        |  *M ****MM * * *
    63.158        |  * M* MM M *M *M
   0.32685E-12    |  *M * * *    * *
                   ________________________________________
 
               0.000   300.000   600.000   900.000  1200.000
 
                                TRANSFER

This is a case where the crummy dot-matrix-type SHAZAM plots may have an advantage over the fancier gnuplots. If you had used the gnuplot option on your plot command, you might have found that it was difficult to see the outlier, because it was rather close to the key that gnuplot provides:


One way to tell if a point is data or just related to the key is to connect the points together. There is no natural order to the observations, so it will look like scribbling, but if a point is connected to other points, it is part of the data and not related to the key.


Almost all of the data lie in an amorphous blob with no particular linear relationship at all. It seems, upon inspection of the data, that the outlier in the upper right is completely responsible for the apparent positive slope in the estimated relationship.

Now back to the task at hand... From a simple plot of childexp against transfer, identify a range of values for, say, transfers, that includes ONLY the offending observations. Say this range includes values greater than 800. Use the SKIPIF command to force SHAZAM to leave this observation out of subsequent calculations. The format will be skipif(transfer.ge.800). The ".ge." is the way SHAZAM compares variable values to some benchmark. Correspondingly, you could use .le., .gt., .lt., .eq., .ne., for "less than or equal to," "greater than or equal to," "less than," "equal to," "not equal to," and so on. Re-run the above tasks on this reduced data set. What happens to your results from the above regressions?

 |_skipif(transfer.ge.800)
 OBSERVATION    79 WILL BE SKIPPED
 
 |_ols childexp transfer
 
 REQUIRED MEMORY IS PAR=     6 CURRENT PAR=   500
  OLS ESTIMATION
       99 OBSERVATIONS     DEPENDENT VARIABLE = CHILDEXP
 ...NOTE..SAMPLE RANGE SET TO:      1,    100
 
  R-SQUARE =   0.0012     R-SQUARE ADJUSTED =  -0.0091
 VARIANCE OF THE ESTIMATE-SIGMA**2 =   15668.
 STANDARD ERROR OF THE ESTIMATE-SIGMA =   125.17
 SUM OF SQUARED ERRORS-SSE=  0.15198E+07
 MEAN OF DEPENDENT VARIABLE =   234.79
 LOG OF THE LIKELIHOOD FUNCTION = -617.605
 
 VARIABLE   ESTIMATED  STANDARD   T-RATIO        PARTIAL STANDARDIZED ELASTICITY
   NAME    COEFFICIENT   ERROR      97 DF   P-VALUE CORR. COEFFICIENT  AT MEANS
 TRANSFER -0.31404E-01 0.9181E-01 -0.3420     0.733-0.035    -0.0347    -0.0371
 CONSTANT   243.51      28.43       8.564     0.000 0.656     0.0000     1.0371

 |_test transfer=1
 TEST VALUE =  -1.0314     STD. ERROR OF TEST VALUE  0.91812E-01
 T STATISTIC =  -11.233887     WITH   97 D.F.    P-VALUE= 0.00000
 F STATISTIC =   126.20021     WITH    1 AND   97 D.F.  P-VALUE= 0.00000
 WALD CHI-SQUARE STATISTIC =   126.20021     WITH    1 D.F.  P-VALUE= 0.00000
 UPPER BOUND ON P-VALUE BY CHEBYCHEV INEQUALITY = 0.00792

Without the influential outlier in the data set, all of the apparent relationship between transfers and child expenditures disappears. The slope is no longer statistically significantly different from zero. The fitted regression line is essentially flat. Now, however, we can clearly reject a unit slope. There is NO evidence to support these households spending an extra dollar on the kids for each extra dollar of transfer payments.

INTUITION??? We have a Java Applet that helps you gain an understanding of what types of observations can have a very influential effect on the slope and/or intercept of a regression line. After you start this applet, click on different places in the plot to see how the fitted regression line can get "dragged around" by additional points in different places. See if you can figure out what kinds of outliers are the most "influential" and which do relatively little damage to the slope and/or intercept estimates.


7. In this problem, you will explore the consequences of 'changes in scale' and 'changes in origin' in the measurement of either the dependent or the explanatory variable. Imagine that you have been supplied with six observations on the marginal costs incurred by the Acme Doodad company for the production of one additional doodad. Marginal costs (MC) depend crucially on the level of output (Q) at which the company is producing. The data are available on the network as the file n:doodad.dat (or you could type these data into your program or into your own data file).
MC ($)   117   111   109   114   126   131
Q (#)        94   106   118   130   142   154
a.) Using OLS MC Q, estimate a linear marginal cost "curve" for this firm using SHAZAM. Be sure to give the units associated with each coefficient on your annotated computer output.
 |_sample 1 6
 |_read(doodad.dat) mc q
 UNIT 88 IS NOW ASSIGNED TO: doodad.dat
    2 VARIABLES AND        6 OBSERVATIONS STARTING AT OBS       1
 
 |_print mc q
       MC             Q
    117.0000       94.00000
    111.0000       106.0000
    109.0000       118.0000
    114.0000       130.0000
    126.0000       142.0000
    131.0000       154.0000

 |_* try a straight ols regression on the raw data
 
 |_ols mc q
 REQUIRED MEMORY IS PAR=     1 CURRENT PAR=   500
  OLS ESTIMATION
        6 OBSERVATIONS     DEPENDENT VARIABLE = MC
 ...NOTE..SAMPLE RANGE SET TO:      1,      6
 
  R-SQUARE =   0.5414     R-SQUARE ADJUSTED =   0.4267
 VARIANCE OF THE ESTIMATE-SIGMA**2 =   43.571
 STANDARD ERROR OF THE ESTIMATE-SIGMA =   6.6009
 SUM OF SQUARED ERRORS-SSE=   174.29
 MEAN OF DEPENDENT VARIABLE =   118.00
 LOG OF THE LIKELIHOOD FUNCTION = -18.6204
 
 VARIABLE   ESTIMATED  STANDARD   T-RATIO        PARTIAL STANDARDIZED ELASTICITY
   NAME    COEFFICIENT   ERROR       4 DF   P-VALUE CORR. COEFFICIENT  AT MEANS
 Q         0.28571     0.1315       2.173     0.096 0.736     0.7358     0.3002
 CONSTANT   82.571      16.53       4.996     0.008 0.928     0.0000     0.6998

Marginal cost as a function of quantity appears to be positively sloped. Marginal cost goes up by about $ 0.28 for each additional unit of output level. However, note that this slope is only statistically significantly different from zero at the 10% level, not the usual 5% level, because the standard error of the estimate is quite large relative to the size of the point estimate. The marginal cost a zero units of output appears to be $82. However, since there are no output levels anywhere near zero in the data, the intercept is not really meaningful. It is just where the fitted regression line happens to cut through the vertical axis when we project it back to Q=0.

b.) Change of scale: Now measure units in dozens (i.e. GENR QD=Q/12), re-estimate the model, identify which quantities of interest on the regression output have changed and which have not. Why? What happens to the product (slope coefficient times variable) when you change the scale of measurement of an explanatory variable.

 |_* now measure quantity in numbers of dozens
 |_genr qd=q/12
 |_* regress mc on quantity in dozens
 
 |_ols mc qd
 
 REQUIRED MEMORY IS PAR=     1 CURRENT PAR=   500
  OLS ESTIMATION
        6 OBSERVATIONS     DEPENDENT VARIABLE = MC
 ...NOTE..SAMPLE RANGE SET TO:      1,      6
 
  R-SQUARE =   0.5414     R-SQUARE ADJUSTED =   0.4267
 VARIANCE OF THE ESTIMATE-SIGMA**2 =   43.571
 STANDARD ERROR OF THE ESTIMATE-SIGMA =   6.6009
 SUM OF SQUARED ERRORS-SSE=   174.29
 MEAN OF DEPENDENT VARIABLE =   118.00
 LOG OF THE LIKELIHOOD FUNCTION = -18.6204
 
 VARIABLE   ESTIMATED  STANDARD   T-RATIO        PARTIAL STANDARDIZED ELASTICITY
   NAME    COEFFICIENT   ERROR       4 DF   P-VALUE CORR. COEFFICIENT  AT MEANS
 QD         3.4286      1.578       2.173     0.096 0.736     0.7358     0.3002
 CONSTANT   82.571      16.53       4.996     0.008 0.928     0.0000     0.6998

When the magnitude of the explanatory variable is made smaller by a factor of 12 by measuring output in dozens, the slope coefficient (and its associated standard error) increase by a factor of exactly 12. As a result, [slope*variable] is unchanged. We are still using regression to partition the actual value of Y into three parts: an intercept that is always there, a portion that varies with X, and a random error term. Since we have the same underlying data, the relationship between the variables cannot have changed. Everything else besides the slope and its estimated coefficient is unaffected.

c.) Change of origin: Go back to the original quantity measure, Q, but now measure MC in "dollars in excess of $100." (i.e. GENR MC100=MC-100.) Which quantities are now different from the original model, which aren't, and why?

 |_* now measure marginal cost in dollars in excess of 100
 |_genr mc100=mc-100
 |_* regress "mc in dollars in excess of 100) on plain q
 
 |_ols mc100 q
 
 REQUIRED MEMORY IS PAR=     1 CURRENT PAR=   500
  OLS ESTIMATION
        6 OBSERVATIONS     DEPENDENT VARIABLE = MC100
 ...NOTE..SAMPLE RANGE SET TO:      1,      6
 
  R-SQUARE =   0.5414     R-SQUARE ADJUSTED =   0.4267
 VARIANCE OF THE ESTIMATE-SIGMA**2 =   43.571
 STANDARD ERROR OF THE ESTIMATE-SIGMA =   6.6009
 SUM OF SQUARED ERRORS-SSE=   174.29
 MEAN OF DEPENDENT VARIABLE =   18.000
 LOG OF THE LIKELIHOOD FUNCTION = -18.6204
 
 VARIABLE   ESTIMATED  STANDARD   T-RATIO        PARTIAL STANDARDIZED ELASTICITY
   NAME    COEFFICIENT   ERROR       4 DF   P-VALUE CORR. COEFFICIENT  AT MEANS
 Q         0.28571     0.1315       2.173     0.096 0.736     0.7358     1.9683
 CONSTANT  -17.429      16.53      -1.055     0.351-0.466     0.0000    -0.9683

The point estimate of the intercept parameter changes by 100 units, although its standard error is unchanged. Since all values of MC in the sample have been made smaller by 100 units, so has the intercept. This means the t-test and p-value associated with the intercept terms also change. Nothing happens to the slope, however, since the units for "rise" are the same as before, and slope is still "rise"/"run."

d.) Part (b.) represented a 'change of scale,' while part (c.) was a 'change of origin.' A special combination of a change of scale and a change of origin is called "standardization." Variable-by-variable, one first subtracts the mean and then divides by the standard deviation. A regression of standardized MC on standardized Q is interesting in that the slope coefficient(s) tell the number of standard deviations by which MC will change when Q changes by one standard deviation. When we begin considering models with more than one explanatory variable, this will be a useful way to compare the relative influence of different explanatory variables on the dependent variable. The units of the different explanatory variables will not matter. (Why?)

The units all drop out, since dividing by the standard deviation, which is in the same units as the variable itself, causes the units to cancel, leaving pure numbers.

SHAZAM produces the coefficients for this "standardized" regression automatically on every run. Locate them on your output. How do these coefficients change between (a.), (b.), and (c.) above? Can you visualize why using a graph? Optional: Can you produce them explicitly by generating the standardized variables directly and regressing them? Try it. (HINT: You can get the means and the standard deviations using the "STAT MC Q / MEAN=mvars STDEV=svars" command. The mean of the first variable, MC, can then be referred to as mvars:1 and its standard deviation as svars:1; likewise, the mean of Q will be mvars:2 and the standard deviation of Q will be svars:2.

 |_* now try the standardization process; calculate and save means and
 |_* standard deviations

 |_stat mc q / mean=m stdev=s
 NAME        N   MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 MC           6   118.00       8.7178       76.000       109.00       131.00
 Q            6   124.00       22.450       504.00       94.000       154.00

 |_genr mcstd=(mc-m:1)/s:1
 |_genr qstd=(q-m:2)/s:2

 |_* now regress the standardized mc on the standardized q
 
 |_ols mcstd qstd
 
  R-SQUARE =   0.5414     R-SQUARE ADJUSTED =   0.4267
 VARIANCE OF THE ESTIMATE-SIGMA**2 =  0.57331
 STANDARD ERROR OF THE ESTIMATE-SIGMA =  0.75717
 SUM OF SQUARED ERRORS-SSE=   2.2932
 MEAN OF DEPENDENT VARIABLE =  0.37007E-16
 LOG OF THE LIKELIHOOD FUNCTION = -5.62824
 
                      ANALYSIS OF VARIANCE - FROM MEAN
                       SS         DF             MS                 F
 REGRESSION        2.7068          1.        2.7068                 4.721
 ERROR             2.2932          4.       0.57331               P-VALUE
 TOTAL             5.0000          5.        1.0000                 0.096
 
                      ANALYSIS OF VARIANCE - FROM ZERO
                       SS         DF             MS                 F
 REGRESSION        2.7068          2.        1.3534                 2.361
 ERROR             2.2932          4.       0.57331               P-VALUE
 TOTAL             5.0000          6.       0.83333                 0.210
 
 
 VARIABLE   ESTIMATED  STANDARD   T-RATIO        PARTIAL STANDARDIZED ELASTICITY
   NAME    COEFFICIENT   ERROR       4 DF   P-VALUE CORR. COEFFICIENT  AT MEANS
 QSTD      0.73577     0.3386       2.173     0.096 0.736     0.7358     0.0000
 CONSTANT  0.18504E-16 0.3091      0.5986E-16 1.000 0.000     0.0000     0.5000

Note that the regular parameter estimates are now identical to the standardized coefficients, except for a very tiny rounding error. The 0.18504E-16 means that 16 zeros need to be inserted between the decimal and the 18504. This preserves significant figures. Another interpretation is 0.18504*10- 16.

e.) Optional: Reflect upon the validity of fitting a straight line to these data. Think back to Economics 1. What does economic theory have to say about the shape of a MC curve? What does a plot of MC against quantity suggest about the shape of the MC curve?

If technology (the total product curve) is s-shaped, then the associated marginal cost curve will be U-shaped. The data look more than a little U-shaped, as opposed to linear. Fitting a straight line will be inappropriate.

 |_* visually check the relationship between the raw variables
 
 |_plot mc q
 
    132.00        |
    130.74        |                                    *
    129.47        |
    128.21        |
    126.95        |
    125.68        |                              *
    124.42        |
    123.16        |
    121.89        |
    120.63        |
    119.37        |
    118.11        |
    116.84        |      *
    115.58        |
    114.32        |
    113.05        |                        *
    111.79        |
    110.53        |            *
    109.26        |
    108.00        |                  *
                   ________________________________________
 
              80.000   100.000   120.000   140.000   160.000
 
                                Q

 |_* try creating a quadratic term in q
 |_genr q2=q*q

 |_* now try a "multiple regression"
 
 |_ols mc q q2 / predict=mchat
 
  R-SQUARE =   0.9273     R-SQUARE ADJUSTED =   0.8789
 VARIANCE OF THE ESTIMATE-SIGMA**2 =   9.2024
 STANDARD ERROR OF THE ESTIMATE-SIGMA =   3.0335
 SUM OF SQUARED ERRORS-SSE=   27.607
 MEAN OF DEPENDENT VARIABLE =   118.00
 LOG OF THE LIKELIHOOD FUNCTION = -13.0926
 
 VARIABLE   ESTIMATED  STANDARD   T-RATIO        PARTIAL STANDARDIZED ELASTICITY
   NAME    COEFFICIENT   ERROR       3 DF   P-VALUE CORR. COEFFICIENT  AT MEANS
 Q         -3.1280     0.8572      -3.649     0.036-0.903    -8.0551    -3.2870
 Q2        0.13765E-01 0.3448E-02   3.992     0.028 0.917     8.8128     1.8426
 CONSTANT   288.44      52.12       5.534     0.012 0.954     0.0000     2.4444

 |_* note that mchat (mc-hat) is the fitted value of the regression equation
 |_* at each observation.  We can now plot mchat and true mc against q:
 |_* use the next plot command (comment deleted) if you have a graphics
 |_* adaptor.  Line printers won't be able to display this very well, though

 |_*plot mc mchat q / ega line
 |_* this type of plot output is adequate for this course, fortunately
 
 |_plot mc mchat q

                    *=MC
                    +=MCHAT
                    M=MULTIPLE POINT
    135.00        |
    133.42        |
    131.84        |                                    +
    130.26        |                                    *
    128.68        |
    127.11        |
    125.53        |                              *
    123.95        |
    122.37        |
    120.79        |                              +
    119.21        |
    117.63        |
    116.05        |      *
    114.47        |      +
    112.89        |                        M
    111.32        |            +
    109.74        |            *     +
    108.16        |                  *
    106.58        |
    105.00        |
                   ________________________________________
 
              80.000   100.000   120.000   140.000   160.000
 
                                Q

In this last regression, the "+" signs describe the smooth quadratic curve that best fits these U-shaped data. The "*" signs are the actual values of MC that go along with each Q. Clearly, the U-shape seems to fit the data better than does any straight line.


Updated: 11:28 AM 10/21/98; Prepared by: Trudy Ann Cameron; Site Index