Due: October 22, 1998
INSTRUCTIONS:This homework set is intended to consolidate in your mind what happens when you ask SHAZAM to run an OLS regression.
NETWORK FILES NEEDED: transfer.dat, doodad.dat . NOTE: If you have BruinOnline or other Web access, you should be able to get the contents of these files by going to http://www.sscnet.ucla.ed u/98W/econ143-1. Select the link to Problem Sets, find Problem Set #3, and once inside, click on the name of the file you want. Or, you can go directly to the list of Data Sets for the course and look for the ones you need by name. You should then be able to highlight, copy, and save to a new file in SHAZAM the relevant contents of the file of interest, and proceed with the homework. Be sure to get the file extensions right (either *.sha or *.dat, accordingly). See the instructions in the SHAZAM for WINDOWS orientation handout.
1. Do a very small "simple regression" problem
by hand, using the computations necessary to arrive at b1,
b2, s2 = (1/(n-2))S
ei2 . Also calculate the standard errors of the point
estimates for b1 and b2. (You might want to postpone
the standard error calculations if we do not get to this in time during the lectures.)
You will probably want to mimic some of the steps displayed in the
handout entitled "Table 5.4" which shows the kinds of step-by-step
calculations that are now done much more efficiently by computers than by people.
Assume that your data are as follows:
Verify your results using the appropriate one-line OLS
command in SHAZAM.
(Note that with this small number of observations, you will probably find
it easiest to embed the data directly within the SHAZAM commands, rather
than reading the data from a separate file. See the handout on how to run
SHAZAM for Windows for how to do this
(the READ statement with no filename given). NOTE: Be
sure to
regress Y on X (i.e. OLS Y X), not vice versa.
I would certainly recommend, when doing this by hand, to
follow the
format of Table 5.4 in the handout. However, I will provide answers here by getting
SHAZAM to
make all of the individual calculations that would be necessary to arrive at the
numbers that
are needed. For this trivial data set, the fitted regression line has an intercept of
8.5 and a slope of -2.2. Furthermore, both of these point estimates is
statistically
significantly different from zero at the 5% level. In fact, even with this tiny
number of degress of freedom (only 2), the slope is significantly different from
zero at the 3.5% level, and the intercept is different from zero at the 1.8%
level. I have prepared a virtual handout
that goes over the geometric intuition of
finding the values for the slope and intercept that minimize the sum of squared
(vertical) errors from each point to the "best" regression line. The
objective function in ordinary least squares (the function we are trying to
minimize by our choices of the slope and intercept) is quadratic in these two
unknown parameters, so we can readily solve for the best values of theses parameters
by taking derivatives in each direction, and setting them equal to zero. This yields
the "normal equations": two equations in two unknowns that are solved to yield the
formulas for slope and intercept that are used to produce parameter estimates.
|_sample 1 4
|_read y x
2 VARIABLES AND 4 OBSERVATIONS STARTING AT OBS 1
|_* get ybar and xbar
|_stat y x / mean=m
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
Y 4 3.0000 2.9439 8.6667 0.00000 6.0000
X 4 2.5000 1.2910 1.6667 1.0000 4.0000
|_* create the deviations from the means
|_genr ydev=y-m:1
|_genr xdev=x-m:2
|_* check these values
|_print ydev xdev
YDEV XDEV
3.000000 -1.500000
2.000000 -0.5000000
-2.000000 0.5000000
-3.000000 1.500000
|_stat ydev xdev
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
YDEV 4 0.00000 2.9439 8.6667 -3.0000 3.0000
XDEV 4 0.00000 1.2910 1.6667 -1.5000 1.5000
|_* generate the terms to be summed in the numerator and denominator
|_genr xydev=xdev*ydev
|_genr xxdev=xdev*xdev
|_print xydev xxdev
XYDEV XXDEV
-4.500000 2.250000
-1.000000 0.2500000
-1.000000 0.2500000
-4.500000 2.250000
|_* calculate and save the sums of these for later
|_stat xydev xxdev / sum=s
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
XYDEV 4 -2.7500 2.0207 4.0833 -4.5000 -1.0000
XXDEV 4 1.2500 1.1547 1.3333 0.25000 2.2500
|_* now explicitly calculate the regression parameters
|_gen1 b2=s:1/s:2
|_gen1 b1=m:1-b2*m:2
|_print b1 b2
B1
8.500000
B2
-2.200000
|_* calculate the error terms
|_genr e=y-b1-b2*x
|_print e
E
-0.3000000 0.9000000 -0.9000000 0.3000000
|_genr e2=e*e
|_stat e2 / sum=resss
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
E2 4 0.45000 0.41569 0.17280 0.90000E-01 0.81000
|_gen1 s2=(1/(4-2))*resss:1
|_print s2
S2
0.9000000
|_* do the OLS and see if it comes out the same as doing it the long way
|_ols y x
REQUIRED MEMORY IS PAR= 2 CURRENT PAR= 500
OLS ESTIMATION
4 OBSERVATIONS DEPENDENT VARIABLE = Y
...NOTE..SAMPLE RANGE SET TO: 1, 4
R-SQUARE = 0.9308 R-SQUARE ADJUSTED = 0.8962
VARIANCE OF THE ESTIMATE-SIGMA**2 = 0.90000
STANDARD ERROR OF THE ESTIMATE-SIGMA = 0.94868
SUM OF SQUARED ERRORS-SSE= 1.8000
MEAN OF DEPENDENT VARIABLE = 3.0000
LOG OF THE LIKELIHOOD FUNCTION = -4.07874
VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY
NAME COEFFICIENT ERROR 2 DF P-VALUE CORR. COEFFICIENT AT MEANS
X -2.2000 0.4243 -5.185 0.035-0.965 -0.9648 -1.8333
CONSTANT 8.5000 1.162 7.316 0.018 0.982 0.0000 2.8333
2. Determine whether the following models are linear in the parameters, or the variables, or both. Which of these models can be estimated as linear regression models (possibly after transformation of the data)?
a.) Yi = B1 + B2
(1/Xi) + ui
This population regression function is linear in the
parameters and
linear in Y, but it is nonlinear in X, since X appears in reciprocal form. This
can be
estimated by OLS because we would simply create a new variable, say Z=1/X, and
regress Y on Z.
The desired coefficients B1 and B2 would result.
b.) Yi = B1 + B2
log(Xi) + ui
This population regression function is also linear in
parameters. It,
too, is linear in the dependent variable, Y. However, it is nonlinear in X.
Still, we can
estimate any model that is linear in parameters, even if it is not linear in the
variables.
All we do is redefine the variables, such as W = log(X), and regress Y on W to get
the desired
intercept and slope coefficients.
c.) Yi = B1 XB 2
eui (exponent on e is
ui)
This might seem a little tricky at first, but you will
begin to
recognize this class of models. As shown, the model is linear in Y, but nonlinear
in X (since
it appears with an exponent) and nonlinear in B2, since this parameter
is IN an
exponent. As it stands, it cannot be estimated by OLS. Importantly, the error
term u, rather
than simply being added to the expression, appears multiplicatively as an exponent
of the
mathematical constant e. Things look bleak for OLS until you notice that taking
the natural
logarithm of both sides preserves the same underlying relationship between the
variables:
log(Yi) = log(B1) + B2 log(Xi) +
ui. If you then redefine the variables, say, by using genr
ly=log(y),
genr lx=log(x) in SHAZAM, then the command ols ly lx will yield the
desired
B1 and B2 coefficients.
Note: ln and log are used interchangeably to signify natural logarithms (log to the base e). Base 10 logarithms are almost never used in econometrics.
4. Two special cases. Note: these exercises are easiest if you mimic the algebra covered in the two class handouts on OLS estimator formulas and derivation of the variances of these estimators. Just zero-out the parameters that are not relevant.
Yi = B2Xi + ui
In this model, the intercept term is absent (perhaps some theory tells us it should be exactly zero--that when X is zero, Y must also be zero). The model is therefore known as "regression through the origin." SHAZAM can estimate such a model by using the command OLS Y X / NOCONSTANT. For this model, show that:
i.) b2 = S Xi Yi / S Xi2
ii.) Var (b2) = s 2 / S Xi2 (This question should be considered optional if we do not get to the discussion of the variance of a regression slope estimator before the problem set is due.)
In order to derive the variance of b2, we can mimic the steps in the handout where we considered the expected values and variances of the coefficients in the familiar two-coefficient model. First, recognize that b2 can be expressed as a linear function of the individual Yi values in the data (if it is a random sample, these are independent random variables with the same variances...although they have different conditional means). We can write:
The usual formula for the variance of a linear
combination of random variables can be employed. E.g., coefficient-squared
times variance of the first variable, plus coefficient-squared time variance
of the second variable, and so on. We can ignore the usual covariance terms
because these independent Yi observations are uncorrelated if
the sample is randomly drawn. The Var(Yi) = s
2 for all i, so we can simplify this variance as: Var(b2)
= s 2
S
Xi2/(S
Xi2)2 . The sum of squared Xi
values in the numerator cancels one of the squares of that term in the
denominator, and we are left with Var(b2) = s
2/(S
Xi2). This differs from the usual formula in that
we use the actual level of each Xi value, rather than their
differences from the overall marginal mean of the observed X values.
b.) What happens if your population regression
function (PRF) assumes the following form:
If Yi = b1 + ei is your model, then the error term is just ei = Yi - b1. Minimizing the sum of squared errors from the regression "line" involves taking the derivative of S (Yi - b1)2 and setting it equal to zero. This derivative is 2S (Yi - b1)(-1) = 0 at the optimal value of the single unknown parameter, b1. The 2 and the (-1) can be eliminated while preserving the equality. Then S Yi - nb1 = 0, or b1 = S Yi/n, which is just the marginal mean of Y in the sample. Thus, saying OLS Y will give you the opportunity to test hypotheses about the sample mean of Y. (E.g., you can make great use of the TEST or CONFID auxiliary commands that may follow regression models.)
5. I have sent to the network (and posted on the web) a copy of the data file transfer.dat. Imagine that this file contains data on government transfer payments to families (transfer) and family expenditures on children (childexp). Look at the contents of n:transfer.dat. Create your own SHAZAM command by opening a new file and entering appropriate commands to accomplish the following tasks.
b.) Using all the data provided, estimate the parameters in a linear regression of "monthly expenditures on family's children" (childexp) on "monthly receipts of transfer payments" (transfer) and obtain the coefficient of determination (r-squared value) for the model. What does this coefficient imply in a simple regression model? (Text pp. 160-164)
|_sample 1 100
|_read(transfer.dat) transfer childexp
UNIT 88 IS NOW ASSIGNED TO: transfer.dat
2 VARIABLES AND 100 OBSERVATIONS STARTING AT OBS 1
|_stat transfer childexp
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
TRANSFER 100 284.85 154.45 23854. 51.816 990.34
CHILDEXP 100 241.79 142.40 20279. 14.740 935.40
|_ols childexp transfer
REQUIRED MEMORY IS PAR= 5 CURRENT PAR= 500
OLS ESTIMATION
100 OBSERVATIONS DEPENDENT VARIABLE = CHILDEXP
...NOTE..SAMPLE RANGE SET TO: 1, 100
R-SQUARE = 0.0401 R-SQUARE ADJUSTED = 0.0303
VARIANCE OF THE ESTIMATE-SIGMA**2 = 19665.
STANDARD ERROR OF THE ESTIMATE-SIGMA = 140.23
SUM OF SQUARED ERRORS-SSE= 0.19272E+07
MEAN OF DEPENDENT VARIABLE = 241.79
LOG OF THE LIKELIHOOD FUNCTION = -635.213
VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY
NAME COEFFICIENT ERROR 98 DF P-VALUE CORR. COEFFICIENT AT MEANS
TRANSFER 0.18459 0.9125E-01 2.023 0.046 0.200 0.2002 0.2175
CONSTANT 189.21 29.53 6.406 0.000 0.543 0.0000
0.7825
The coefficient of determination (the r-squared value) gives the proportion of the variation in the dependent variable that is explained by the right-hand-side variable. Here, it is only about 4%. Not much.
c.) Does this model suggest
that
We usually set this up the other way. The null
hypotheses is
that families will spend zero dollars out of every additional dollar on their
children. This
corresponds to a null hypothesis that the slope is zero. The t-test statistic for
the slope is
2.023, with 98 degrees of freedom, which looks close to being significant. (We
could look up
the 5% critical value in the back of the text and find that it is something less
than 2 --i.e.
between 2.000 and 1.98, the values for 60 and 120 degrees of freedom.) However,
the P-value
gives us what we need directly. We can just reject the zero hypothesis for the
slope at a
significance level of 4.6%, so we can definitely reject at the usual 5%. So yes,
it looks like
families spend some portion of each extra transfer dollar on the kids (the point
estimate is
about 18 cents).
(ii.) on average, families spend positive amounts on their children, even if
they receive no transfer payments?
Again, we generally set this up the other way.
The null
hypothesis is that if people receive no transfer payments, then they spend zero
dollars on
their children. This is an hypothesis about the intercept being zero. If the
point estimate
of the intercept is greater than zero, and we can reject the hypothesis of a zero
intercept at
the 5% level, then we would conclude that people do spend money on their kids even
if they
receive no transfer payments. Here, the point estimate of the intercept is about
$189. The t-
test statistic is larger than 6, which is way out in the tails of any
t-distribution. The p-
value confirms that there is virtually no probability left in the symmetric tails
of a t-
distribution with 98 degrees of freedom beyond -6.406 and +6.406. We clearly
reject the
hypothesis that people spend zero money on their kids if they get zero transfer
payments.
The answers to these questions concern the slope coefficient and
intercept coefficient in the regression. (It is helpful to think
about the verbal definition of the slope and intercept in any regression model.
The slope is the "change in Y for a one-unit change in X." The intercept is the
"expected value of Y when X is zero.")
d.) A little harder: Does this model suggest
that, on average, for an additional dollar of transfer payments, these families
tend to spend all of that additional dollar on the family's children? The
answer to this also concerns the slope coefficient in the
regression.
This is a test of a non-standard hypothesis (i.e.
something other than
the "zero hypothesis" for which it is automatically assumed that you will be
interesting in
knowing the answer). Now we need to test whether the slope could be one. This
could be done
explicitly, by calculating the difference between the point estimate and one, and
dividing by
the standard deviation of the point estimate. But make you life easier by having
SHAZAM do it
for you. The test command uses information for the last ols command that has been
run. The difference between the point estimate of the slope and one is about -0.8,
but the
standard error is really tiny, so the test statistic is on the order of about -9,
which is a
clear rejection at the 5% level. The extremely small P-value confirms this. e.) Plot the data in a scattergram.
Examine the plot
carefully. Are any points that are likely to be "influential" in the fitting of a
regression line (called "outliers")? Explain.
This is a case where the crummy dot-matrix-type SHAZAM plots may have an
advantage over the fancier gnuplots. If you had used the gnuplot option
on your plot command, you might
have found that it was difficult to see the outlier, because it was rather close
to the key that gnuplot provides: One way to tell if a point is data or just related to the key is to connect the
points together. There is no natural order to the observations, so it will look
like scribbling, but if a point is connected to other points, it is part of
the data and not related to the key. Almost all of the data lie in an amorphous blob with no particular linear
relationship at
all. It seems, upon inspection of the data, that the outlier in the upper right
is completely
responsible for the apparent positive slope in the estimated
relationship. Now back to the task at hand...
From a simple plot of childexp against transfer, identify a range of values for,
say,
transfers, that includes ONLY the offending observations. Say this range includes
values
greater than 800. Use the SKIPIF command to force SHAZAM to leave this
observation out of subsequent calculations. The format will be
skipif(transfer.ge.800). The ".ge." is the way SHAZAM compares variable
values to some benchmark. Correspondingly, you could use .le., .gt., .lt., .eq.,
.ne., for
"less than or equal to," "greater than or equal to," "less than," "equal to," "not
equal to," and so on. Re-run the above tasks on this reduced data set. What
happens to your results from the above regressions? Without the influential outlier in the data set, all of the apparent
relationship
between transfers and child expenditures disappears. The slope is no longer
statistically significantly different from zero. The fitted regression line is
essentially
flat. Now, however, we can clearly reject a unit slope. There is NO evidence to
support these
households spending an extra dollar on the kids for each extra dollar of transfer
payments.
INTUITION??? We have a Java Applet
that helps you gain an understanding of what types
of observations can have a very influential effect on the slope and/or intercept
of a regression line. After you start this applet, click on different places in the plot
to see how the fitted regression line can get "dragged around" by additional points in
different places. See if you can figure out what kinds of outliers are the most "influential"
and which do relatively little damage to the slope and/or intercept estimates.
|_test transfer=1
TEST VALUE = -0.81541 STD. ERROR OF TEST VALUE 0.91253E-01
T STATISTIC = -8.9357075 WITH 98 D.F. P-VALUE= 0.00000
F STATISTIC = 79.846868 WITH 1 AND 98 D.F. P-VALUE= 0.00000
WALD CHI-SQUARE STATISTIC = 79.846868 WITH 1 D.F. P-VALUE= 0.00000
UPPER BOUND ON P-VALUE BY CHEBYCHEV INEQUALITY = 0.01252
|_plot childexp transfer
REQUIRED MEMORY IS PAR= 2 CURRENT PAR= 500
FOR MAXIMUM EFFICIENCY USE AT LEAST PAR= 4
100 OBSERVATIONS
*=CHILDEXP
M=MULTIPLE POINT
884.21 | * <--this is the outlier
821.05 |
757.89 |
694.74 |
631.58 |
568.42 |
505.26 |
442.11 | *** * * *
378.95 | * * * M** * *
315.79 | M * *M** *** **
252.63 | *MM M*M M* M *
189.47 | * *** ** *M MM
126.32 | *M ****MM * * *
63.158 | * M* MM M *M *M
0.32685E-12 | *M * * * * *
________________________________________
0.000 300.000 600.000 900.000 1200.000
TRANSFER

|_skipif(transfer.ge.800)
OBSERVATION 79 WILL BE SKIPPED
|_ols childexp transfer
REQUIRED MEMORY IS PAR= 6 CURRENT PAR= 500
OLS ESTIMATION
99 OBSERVATIONS DEPENDENT VARIABLE = CHILDEXP
...NOTE..SAMPLE RANGE SET TO: 1, 100
R-SQUARE = 0.0012 R-SQUARE ADJUSTED = -0.0091
VARIANCE OF THE ESTIMATE-SIGMA**2 = 15668.
STANDARD ERROR OF THE ESTIMATE-SIGMA = 125.17
SUM OF SQUARED ERRORS-SSE= 0.15198E+07
MEAN OF DEPENDENT VARIABLE = 234.79
LOG OF THE LIKELIHOOD FUNCTION = -617.605
VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY
NAME COEFFICIENT ERROR 97 DF P-VALUE CORR. COEFFICIENT AT MEANS
TRANSFER -0.31404E-01 0.9181E-01 -0.3420 0.733-0.035 -0.0347 -0.0371
CONSTANT 243.51 28.43 8.564 0.000 0.656 0.0000 1.0371
|_test transfer=1
TEST VALUE = -1.0314 STD. ERROR OF TEST VALUE 0.91812E-01
T STATISTIC = -11.233887 WITH 97 D.F. P-VALUE= 0.00000
F STATISTIC = 126.20021 WITH 1 AND 97 D.F. P-VALUE= 0.00000
WALD CHI-SQUARE STATISTIC = 126.20021 WITH 1 D.F. P-VALUE= 0.00000
UPPER BOUND ON P-VALUE BY CHEBYCHEV INEQUALITY = 0.00792
|_sample 1 6
|_read(doodad.dat) mc q
UNIT 88 IS NOW ASSIGNED TO: doodad.dat
2 VARIABLES AND 6 OBSERVATIONS STARTING AT OBS 1
|_print mc q
MC Q
117.0000 94.00000
111.0000 106.0000
109.0000 118.0000
114.0000 130.0000
126.0000 142.0000
131.0000 154.0000
|_* try a straight ols regression on the raw data
|_ols mc q
REQUIRED MEMORY IS PAR= 1 CURRENT PAR= 500
OLS ESTIMATION
6 OBSERVATIONS DEPENDENT VARIABLE = MC
...NOTE..SAMPLE RANGE SET TO: 1, 6
R-SQUARE = 0.5414 R-SQUARE ADJUSTED = 0.4267
VARIANCE OF THE ESTIMATE-SIGMA**2 = 43.571
STANDARD ERROR OF THE ESTIMATE-SIGMA = 6.6009
SUM OF SQUARED ERRORS-SSE= 174.29
MEAN OF DEPENDENT VARIABLE = 118.00
LOG OF THE LIKELIHOOD FUNCTION = -18.6204
VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY
NAME COEFFICIENT ERROR 4 DF P-VALUE CORR. COEFFICIENT AT MEANS
Q 0.28571 0.1315 2.173 0.096 0.736 0.7358 0.3002
CONSTANT 82.571 16.53 4.996 0.008 0.928 0.0000 0.6998
Marginal cost as a function of quantity appears to be positively sloped. Marginal cost goes up by about $ 0.28 for each additional unit of output level. However, note that this slope is only statistically significantly different from zero at the 10% level, not the usual 5% level, because the standard error of the estimate is quite large relative to the size of the point estimate. The marginal cost a zero units of output appears to be $82. However, since there are no output levels anywhere near zero in the data, the intercept is not really meaningful. It is just where the fitted regression line happens to cut through the vertical axis when we project it back to Q=0.
b.) Change of scale: Now measure units
in dozens (i.e. GENR QD=Q/12), re-estimate the model, identify which quantities
of interest on the regression output have changed and which have not. Why? What
happens to the product (slope coefficient times variable) when you change the
scale of measurement of an explanatory variable.
When the magnitude of the explanatory variable is made smaller by a factor of
12 by
measuring output in dozens, the slope coefficient (and its associated standard
error) increase
by
a factor of exactly 12. As a result, [slope*variable] is unchanged. We are still
using
regression to partition the actual value of Y into three parts: an intercept that
is always
there, a portion that varies with X, and a random error term. Since we have the
same
underlying data, the relationship between the variables cannot have changed.
Everything else
besides the slope and its estimated coefficient is unaffected.
|_* now measure quantity in numbers of dozens
|_genr qd=q/12
|_* regress mc on quantity in dozens
|_ols mc qd
REQUIRED MEMORY IS PAR= 1 CURRENT PAR= 500
OLS ESTIMATION
6 OBSERVATIONS DEPENDENT VARIABLE = MC
...NOTE..SAMPLE RANGE SET TO: 1, 6
R-SQUARE = 0.5414 R-SQUARE ADJUSTED = 0.4267
VARIANCE OF THE ESTIMATE-SIGMA**2 = 43.571
STANDARD ERROR OF THE ESTIMATE-SIGMA = 6.6009
SUM OF SQUARED ERRORS-SSE= 174.29
MEAN OF DEPENDENT VARIABLE = 118.00
LOG OF THE LIKELIHOOD FUNCTION = -18.6204
VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY
NAME COEFFICIENT ERROR 4 DF P-VALUE CORR. COEFFICIENT AT MEANS
QD 3.4286 1.578 2.173 0.096 0.736 0.7358 0.3002
CONSTANT 82.571 16.53 4.996 0.008 0.928 0.0000 0.6998
c.) Change of origin: Go back to the original
quantity measure, Q, but now measure MC in "dollars in excess of $100."
(i.e. GENR MC100=MC-100.) Which quantities are now different from the original
model, which aren't, and why?
The point estimate of the intercept parameter changes by 100 units, although
its standard
error is unchanged. Since all values of MC in the sample have been made smaller
by 100 units,
so has the intercept. This means the t-test and p-value associated with the
intercept terms
also change. Nothing happens to the slope, however, since the units for "rise"
are the same as
before, and slope is still "rise"/"run."
|_* now measure marginal cost in dollars in excess of 100
|_genr mc100=mc-100
|_* regress "mc in dollars in excess of 100) on plain q
|_ols mc100 q
REQUIRED MEMORY IS PAR= 1 CURRENT PAR= 500
OLS ESTIMATION
6 OBSERVATIONS DEPENDENT VARIABLE = MC100
...NOTE..SAMPLE RANGE SET TO: 1, 6
R-SQUARE = 0.5414 R-SQUARE ADJUSTED = 0.4267
VARIANCE OF THE ESTIMATE-SIGMA**2 = 43.571
STANDARD ERROR OF THE ESTIMATE-SIGMA = 6.6009
SUM OF SQUARED ERRORS-SSE= 174.29
MEAN OF DEPENDENT VARIABLE = 18.000
LOG OF THE LIKELIHOOD FUNCTION = -18.6204
VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY
NAME COEFFICIENT ERROR 4 DF P-VALUE CORR. COEFFICIENT AT MEANS
Q 0.28571 0.1315 2.173 0.096 0.736 0.7358 1.9683
CONSTANT -17.429 16.53 -1.055 0.351-0.466 0.0000 -0.9683
d.) Part (b.) represented a 'change of scale,'
while part (c.) was a 'change of origin.' A special combination
of a change of scale and a change of origin is called "standardization."
Variable-by-variable, one first subtracts the mean and then divides by
the standard deviation. A regression of standardized MC on standardized
Q is interesting in that the slope coefficient(s) tell the number of standard
deviations by which MC will change when Q changes by one standard deviation.
When we begin considering models with more than one explanatory variable,
this will be a useful way to compare the relative influence of different
explanatory variables on the dependent variable. The units of the different
explanatory variables will not matter. (Why?)
The units all drop out, since dividing by the standard
deviation, which is in the same units as the variable itself, causes the units to
cancel, leaving pure numbers.
SHAZAM produces the coefficients for this "standardized"
regression automatically on every run. Locate them on your output. How
do these coefficients change between (a.), (b.), and (c.) above? Can you
visualize why using a graph? Optional: Can you produce them explicitly
by generating the standardized variables directly and regressing them?
Try it. (HINT: You can get the means and the standard deviations using
the "STAT MC Q / MEAN=mvars STDEV=svars" command. The mean of the first
variable, MC, can then be referred to as mvars:1 and its standard deviation
as svars:1; likewise, the mean of Q will be mvars:2 and the standard deviation
of Q will be svars:2.
Note that the regular parameter estimates are now identical to the standardized
coefficients, except for a very tiny rounding error. The 0.18504E-16 means
that 16 zeros need to be inserted between the decimal and the 18504. This
preserves significant figures. Another interpretation is 0.18504*10-
16.
|_* now try the standardization process; calculate and save means and
|_* standard deviations
|_stat mc q / mean=m stdev=s
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
MC 6 118.00 8.7178 76.000 109.00 131.00
Q 6 124.00 22.450 504.00 94.000 154.00
|_genr mcstd=(mc-m:1)/s:1
|_genr qstd=(q-m:2)/s:2
|_* now regress the standardized mc on the standardized q
|_ols mcstd qstd
R-SQUARE = 0.5414 R-SQUARE ADJUSTED = 0.4267
VARIANCE OF THE ESTIMATE-SIGMA**2 = 0.57331
STANDARD ERROR OF THE ESTIMATE-SIGMA = 0.75717
SUM OF SQUARED ERRORS-SSE= 2.2932
MEAN OF DEPENDENT VARIABLE = 0.37007E-16
LOG OF THE LIKELIHOOD FUNCTION = -5.62824
ANALYSIS OF VARIANCE - FROM MEAN
SS DF MS F
REGRESSION 2.7068 1. 2.7068 4.721
ERROR 2.2932 4. 0.57331 P-VALUE
TOTAL 5.0000 5. 1.0000 0.096
ANALYSIS OF VARIANCE - FROM ZERO
SS DF MS F
REGRESSION 2.7068 2. 1.3534 2.361
ERROR 2.2932 4. 0.57331 P-VALUE
TOTAL 5.0000 6. 0.83333 0.210
VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY
NAME COEFFICIENT ERROR 4 DF P-VALUE CORR. COEFFICIENT AT MEANS
QSTD 0.73577 0.3386 2.173 0.096 0.736 0.7358 0.0000
CONSTANT 0.18504E-16 0.3091 0.5986E-16 1.000 0.000 0.0000 0.5000
e.) Optional: Reflect upon the validity
of fitting a straight line to these data. Think back to Economics 1. What
does economic theory have to say about the shape of a MC curve? What does
a plot of MC against quantity suggest about the shape of the MC curve?
If technology (the total product curve) is s-shaped, then the associated
marginal cost curve will be U-shaped. The data look more than a little U-shaped,
as opposed to linear. Fitting a straight line will be inappropriate. In this last regression, the "+" signs describe the smooth quadratic
curve that best fits these U-shaped data. The "*" signs are the actual
values of MC that go along with each Q. Clearly, the U-shape seems to fit the
data better than does any straight line.
|_* visually check the relationship between the raw variables
|_plot mc q
132.00 |
130.74 | *
129.47 |
128.21 |
126.95 |
125.68 | *
124.42 |
123.16 |
121.89 |
120.63 |
119.37 |
118.11 |
116.84 | *
115.58 |
114.32 |
113.05 | *
111.79 |
110.53 | *
109.26 |
108.00 | *
________________________________________
80.000 100.000 120.000 140.000 160.000
Q
|_* try creating a quadratic term in q
|_genr q2=q*q
|_* now try a "multiple regression"
|_ols mc q q2 / predict=mchat
R-SQUARE = 0.9273 R-SQUARE ADJUSTED = 0.8789
VARIANCE OF THE ESTIMATE-SIGMA**2 = 9.2024
STANDARD ERROR OF THE ESTIMATE-SIGMA = 3.0335
SUM OF SQUARED ERRORS-SSE= 27.607
MEAN OF DEPENDENT VARIABLE = 118.00
LOG OF THE LIKELIHOOD FUNCTION = -13.0926
VARIABLE ESTIMATED STANDARD T-RATIO PARTIAL STANDARDIZED ELASTICITY
NAME COEFFICIENT ERROR 3 DF P-VALUE CORR. COEFFICIENT AT MEANS
Q -3.1280 0.8572 -3.649 0.036-0.903 -8.0551 -3.2870
Q2 0.13765E-01 0.3448E-02 3.992 0.028 0.917 8.8128 1.8426
CONSTANT 288.44 52.12 5.534 0.012 0.954 0.0000 2.4444
|_* note that mchat (mc-hat) is the fitted value of the regression equation
|_* at each observation. We can now plot mchat and true mc against q:
|_* use the next plot command (comment deleted) if you have a graphics
|_* adaptor. Line printers won't be able to display this very well, though
|_*plot mc mchat q / ega line
|_* this type of plot output is adequate for this course, fortunately
|_plot mc mchat q
*=MC
+=MCHAT
M=MULTIPLE POINT
135.00 |
133.42 |
131.84 | +
130.26 | *
128.68 |
127.11 |
125.53 | *
123.95 |
122.37 |
120.79 | +
119.21 |
117.63 |
116.05 | *
114.47 | +
112.89 | M
111.32 | +
109.74 | * +
108.16 | *
106.58 |
105.00 |
________________________________________
80.000 100.000 120.000 140.000 160.000
Q