THE UNIVERSITY OF CALIFORNIA, LOS ANGELES
Department of Economics
Economics 143 - Applied Regression Analysis
October 8, 1998
Cameron
Problem Set #2: Statistics Review, Continued
Outline of Solutions

INSTRUCTIONS: Some review of your prerequisite coursework in univariate statistics may be required to complete these problems.

NETWORK FILES NEEDED: n:alaska.dat, n:alaska0.sha

1. Is either of the following a valid probability density function? Why or why not?

 , (this should be very familiar) This is a valid p.d.f. because it is the algebraic form of the standard normal distribution for a random variable called X. We reviewed in class the reasons for the bell shape of this distribution. Its symmetry around zero comes from x appearing only in squared form. The function exp is another way of saying "e to the power." Anything to a negative power is the same as 1/ that thing to the same power. The smaller the value of x, the greater the height of the density function, with a maximum height at zero. The larger is x in absolute value, the smaller will be the height of the function. The "bell" shape comes from the exp(.) function that is used.

  This function is exactly the same as the first one, except it is missing the factor of 1 over the square root of 2 pi. We know the first p.d.f. is valid, which means that the total area under the function is 1.0. Therefore the area under this one must be equal to the square root of 2 pi. Now we know why the square root of 2 pi factor is there is the standard normal. The exp(.) part gives the shape of the distribution, and we need the factor to ensure that the area under the curve is 1.0 for valid p.d.f.

 2. For a standard normal random variable Z, what is Pr(Z = 0.5)? Don't be fooled. Explain. For a continuous random variable, such as the standard normal, the probability associated with any single value of the variable is exactly zero. Probabilities must be determined over intervals of the variable.

 3. If Z is the standard normal random variable, use the table inside the front cover of HGJ to determine:

 a.) Pr (Z > 0.5) Go down the margin of Table 1 until you find 0.5. Go "across" the columns until you get to the decimal amount needed to be added to 0.5, which is 0.00 in this simple case (just the first column in the body of the table, 0.1915). The number in the body of the table gives the area under the standard normal curve between zero and the value 0.5. The cumulative probability up to zero is 0.50. Thus the amount of probability in the region less than 0.5 is 0.50 +.1915 = 0.6915. This means that the probability above 0.5 is 1 - 0.6915 = .3085. It is usually very helpful to sketch a diagram when you are figuring out these probabilities.  Another way to have calculated this would have been to think of it as ALL the probability above 0 (i.e. 0.5) less the amount between zero and 0.5 (i.e. 0.1915).  This would yield the same answer.

b.) Pr (0.3 < Z < 0.8) Both of these values are to the right of 0. Thus, we can figure the probability between these bounds by finding the area between 0 and each of them, and then taking the difference. The answer is 0.2881-0.1179 = 0.1702..

c.) Pr (-0.5 < Z < 1.2) One of these is to the left of zero, and one to the right, so we need to exploit the symmetry of the standard normal distribution. First, find the bit between zero and 1.2. This is 0.3849. Then, recognize that the bit between -0.5 and 0 will be the same as the amount between 0 and 0.5, which is .1915. Now add them, to get .5764.

d.) Pr (|Z| > 1.96) (an important one!)   To compute probabilities to a second decimal place, you need to find the first decimal place on the left margin of the table, then move across the row to get the second decimal place.  Here, you want the sum of the probabilities associated with values less that -1.96 and greater than 1.96. First, remember that the probability greater than 0 is 0.5, and that the distribution is symmetric around zero. The probability between 0 and 1.96 is 0.4750, so the probability beyond 1.96 (to the right) is 0.025. Since we want the probability in both ends, the answer will be 0.05.

4. If X is approximately normally distributed with mean 5 and variance 9, determine

a.) Pr (X > 6) Pr(X > 6) = Pr(Z > (6-5)/3 ) = roughly Pr(Z > 0.33) = 0.5 - 0.1293 = 0.37.

b.) Pr (3 < X < 6) Same as Pr( (3-5)/3 < Z < (6-5)/3 ) = Pr (-0.67 < Z < 0.33). The first cutoff value is to the left of zero, and the second to the right. Find the probability between zero and the absolute value of each of these values, and then add to find the probability between -0.67 and 0.33. Thus, 0.2483 + 0.1293 = 0.3776 .

c.) x* such that Pr (X < x*) = .025   To answer this one, you need to look in the body of the table and work back to the marging. The cut off value x* will be well out to the left of zero, since there is only a little bit of cumulative probability up to this point. We need to use both the symmetry of the standard normal distribution and the fact that the probability under either half of the distribution is 0.5. If we find the positive cutoff value that captures 0.4750 between zero and this cutoff, the negative of this cutoff value will be the one we are seeking. This cutoff appears to be +1.96, so the mirror image, -1.96 is the cutoff of the standard normal that leaves 0.025 in the lower tail of the distribution. Now we need to get from the standard normal to the general normal with mean 5 and standard deviation 3. Since Z = (X - mean)/std.dev., we can solve for X = mean + std.dev*Z. Thus, the answer is x* = 5 + 3*(-1.96). = -0.88. Double-checking for common sense, this number is roughly two standard deviations below the mean of X.

5. Distinguish between a population "parameter," an "estimator," and an "estimate." A population parameter describes a characteristic of the population (a population from which you may be planning to draw a sample for analysis). It is a true but unknown constant. It does not have any variance and is not random. An example would be the true population mean, denoted by the Greek mu. An estimator is a formula that you use to calculate an estimate of the true but unknown population parameter. An example would be the sample mean: (1/n)*[sum of the Xi values in the sample]. When you plug the values from a particular sample into the estimator formula, you get an estimate of the true but unknown population parameter. You will always use the same estimator formula, but depending on what sample you draw from the population, you will come up with different "estimates" of the population parameter of interest.

6. Distinguish between "point estimation" and "interval estimation." Point estimation means coming up with the best possible guess about a true but unknown population parameter. However, the probability of this value equalling the true parameter is zero (it's a continuous distribution). Often, we prefer an interval estimate--a range within which the true but unknown parameter value will lie with some degree of "confidence." There is a positive probability that this will happen.

7. Is the median a "linear estimator"? Is the mean? An estimator is a linear estimator if its formula combines the values of the variable (those showing up in a sample) in a linear fashion, namely, the sum of coefficient times value, plus coefficient times value, etc. The observed values cannot be raised to powers, or multiplied by other variables, etc. The mean is a linear estimator because it is calculated at (1/n)X1 + (1/n)X2 + (1/n)X3 +...+ (1/n)Xn, where n is the number of observations in the sample. In contrast, the recipe for calculating a median is not a "linear" formula. It involves sorting all of the observations from smallest to largest and identifying the single observation in the "middle" (for odd-number sized samples), or averaging the pair of observations in the "middle" (for even-number sized samples). This is definitely not a linear formula.

8. If we draw a random sample of size 36 and we find that the sample mean is 7 and the sample variance is s2=4, construct a 95% "confidence interval" for the value of the population mean m x. Show your work carefully. The confidence interval is based on the sample evidence concerning the unknown parameter value, in this case, the sample mean, which is 7. You then add/subtract an amount that accounts for the noise in the sample, which is assumed to provide information about the noise in the population. This plus-or-minus amount involves the 0.025 critical value of the t distribution with n-1 = 35 degrees of freedom = 2.0315 (by "linear" interpolation from Table 2 inside the front of the text). This must be multiplied by the standard error of the estimate of the mean, which is equal to the sample standard deviation (s=2, since s2=4) divided by the square root of the sample size (6). The confidence interval can thus be written as: [ 7 - (2.0315)(2/6), 7 + (2.0315)(2/6) ] = [6.6669, 7.3331].

10. If we draw a random sample of size 25 and discover that the mean of X in the sample is 10 and s2=16, test the null hypothesis that m x=8.5 (using either a "two-tailed" Z-test or a "two-tailed" t-test, whichever is most appropriate, justifying your choice): Since we know only the sample variance, not the true underlying population sigma-squared value, and since the sample size is only 25 (not asymptotically large), we cannot use a Z-test.  The t-test is appropriate, with its somewhat greater dispersion due to the additional uncertainty stemming from using s-squared as a guess about the true sigma- squared.

a.) at the 5% "level of significance"; You can test hypotheses either by constructing a 95% confidence interval and seeing if the hypothesized value lies inside this confidence interval, or you can do a classical hypothesis test that involved subtracting the "true mean" and dividing by the "true standard deviation," and comparing to the "known" distribution of this standardized variable. We don't know the true mean, but this is usually what we are hypothesizing. We don't know the true standard deviation, so we use the sample data s over the square root of n, which sends us to a t-distribution with n-1 degrees of freedom, instead of a standard normal distribution. Both of these strategies (confidence interval or t- test) use the same sample information, just combined in a different way.

For a confidence interval, we use the point estimate of the true mean, 10, plus or minus the 0.025 critical value for a t(25-1) distribution (which is 2.064) times s/(square root n) (equal to 4/5). CI.95(mu) = 10 + 2.064(4/5). If 8.5 is in this interval, we cannot reject 8.5 as a plausible hypothese about the true mean. The interval works out to be (8.348, 11.651), thus we cannot reject 8.5 as the true mean.

For the t-test, we construct (10 - 8.5)/(4/5) = 1.875. Since this number lies inside the 95% range of values of a t distribution with 24 degrees of freedom (- 2.064,+2.064), we cannot reject the null hypothesis that the true mean is 8.5.

b.) at the 10% "level of significance". For this level of significance, nothing changes except the critical value of the t-distribution. Instead of the cutoff that leaves a total of  5% in the two tails of the t-distribution, the cutoffs that leave a total of 10% in the two tails, still for (25-1) degrees of freedom. For one tail, then, we want to have 5% of the probability out in the tail beyond the cutoff, so the relevant value, from the first column of Table 2 in the text, is 1.711. Since the critical value for the 10% significance level test is smaller in absolute value that it was in the case of the 5% significance level hypothesis test, the confidence interval is narrower, being now CI.90(mu) = 10 + 1.711(4/5). This gives (8.631, 11.369), which excludes the hypothesized value of 8.5. If 8.5 lies outside the 90% confidence interval, we reject this value as an hypothesis about the true mean.

By traditional t-test, the calculation is the same. We still get a standardized value of 1.875. Now, however, this number is beyond the 1.711 critical value of the t- distribution. If the null hypothesis is true, a value further from zero than 1.711 would happen only 10% or less of the time. Thus 1.875 is an implausible value of the test statistic, so we reject the null hypothesis (even though there is a 10% or less chance that this is just a bizarre sample).
 

 
11. EXPLORING A DATA SET (Calisthenics with SHAZAM): Download from the network (or from the website) to your own diskette the files n:alaska.dat and n:alaska0.sha using the procedure outlined in the SHAZAM computer software orientation handout. Be sure to print out a copy of the program itself so you can refer to it later.

A description of the data set is contained in comment lines at the top of the program file.

a.) Start the SHAZAM program and when it says "TYPE COMMAND" invoke an already-created set of commands from the alaska0.sha file by issuing the command "file 5 alaska0.sha." If you are working from a disk in the a: drive, use file 5 a:alaska0.sha.) Use the pause button to view intermediate steps; this run is simply to verify that you can read and use the data (i.e. that all files are in the right places). Enter the command stat to verify that you have all the data.

b.) Now, use the SHAZAM editor to make some changes to the alaska0.sha file. Instead of issuing commands interactively after executing the initial set of commands from the alaska0.sha file, incorporate some additional commands into the program. Until you think you have the program running smoothly and correctly, just have the output sent to the screen. When it all looks like it works fine, save the otuput in the SHAZAM window to a file, for later printing via Notepad (from the Econ143 menu, if you are working in the lab).

For example, look at the actual numbers and then produce a set of descriptive statistics for some of the variables in the augmented data set by using the commands:

print year ptot qtot rtot
stat ptot qtot rtot / pcor You can just make sure that these command lines appear, without comment characters (*) in the first column, at the end of the program file.

c.) Descriptive statistics: What are the highest and lowest prices (in 1989 dollars) that have been observed over the 1964-1993 period? What has been the average size of the catch, in millions of pounds, over this time period? What has been the standard deviation in catch over this period? This can be read directly from the above stat output.


|_stat ptot qtot rtot / pcor
 NAME        N    MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 PTOT         30  0.94821     0.32028     0.10258      0.52831       1.8280
 QTOT         30  0.43576E+06 0.21326E+06 0.45479E+11  0.13160E+06  0.84608E+06
 RTOT         30  0.40407E+06 0.22485E+06 0.50557E+11   99802.      0.97655E+06

The minimum and maximum values of PTOT in the sample are given in the last two columns of the STAT output. The average catch in thousands of pounds is given by the mean value of QTOT. In millions of pounds, it would be about 436 million. The standard deviation in catch has been about 213.

d.) SOME ECONOMIC THEORY: When demand is "elastic" (such that a given percent change in price leads to a larger percent opposite change in quantity demanded), an increase in price results in a decrease in total revenues in a market. When demand is "inelastic" (such that a given percent change in price leads to a smaller percent opposite change in quantity demanded), and increase in price results in an increase in total revenues. If we (erroneously) considered all five major types of Alaskan salmon to be sold in one market, what does the correlation table produced by the / pcor option on the stat command imply about overall demand elasticity in this market?


 CORRELATION MATRIX OF VARIABLES -       30 OBSERVATIONS

 PTOT       1.0000
 QTOT     -0.13806       1.0000
 RTOT      0.49714      0.74976       1.0000
              PTOT         QTOT         RTOT

Correlation between revenues and prices is positive, suggesting that when prices are higher, revenues are higher. Thus, demand appears to be inelastic. Would you consider this implication reliable? Why or why not? (Think about the implicit ceteris paribus requirement underlying "demand curves," i.e. that everything else be held constant.) This is not a solid conclusion because lots of things besides prices have changed over the thirty years of data. The price-quantity pairs we observe are not points on one demand curve unless demand has remained constant and the supply curve has been shifting around. Plotting the relationship between ptot (average prices) and qtot (overall quantities) reveals that we are probably not looking at points along one stable demand curve over the time period represented by the data:

e.) Now take into account that there are five different "goods" involved--chum, king, pink, red, and silver salmon species, possibly each with a distinct market. Use stat / pcor and the crude plot option in SHAZAM to see whether catch levels for each of these five species "move together" over this time period. Comment.

stat cquant kquant rquant pquant squant / pcor
plot cquant year / nopretty
plot kquant year / nopretty
plot rquant year / nopretty
plot pquant year / nopretty
plot squant year / nopretty


 |_stat cquant kquant rquant pquant squant / pcor
 NAME        N    MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 CQUANT       30   64333.      23979.     0.57499E+09   22668.      0.12157E+06
 KQUANT       30   11828.      2299.0     0.52853E+07   7184.0       16904.
 RQUANT       30  0.16031E+06  99843.     0.99685E+10   32246.      0.37840E+06
 PQUANT       30  0.17384E+06  93265.     0.86985E+10   28822.      0.33879E+06
 SQUANT       30   25455.      13958.     0.19482E+09   7688.0       53776.

  CORRELATION MATRIX OF VARIABLES -       30 OBSERVATIONS
 CQUANT     1.0000
 KQUANT    0.40846       1.0000
 RQUANT    0.59101      0.38130       1.0000
 PQUANT    0.63651      0.39569      0.81231       1.0000
 SQUANT    0.72498      0.35044      0.78959      0.81930       1.0000
              CQUANT       KQUANT       RQUANT       PQUANT       SQUANT

All of the pairwise correlations between different catch quantities are positive, so that on average, if one quantity is higher, so are all others. You could certainly use plot kquant pquant for all pairs and see the scattergrams of quantity pairs that would reveal the same sort of thing.

or try some fancy plots by using:

or plot cquant kquant rquant pquant squant year / gnu line

If you have your own stand-alone computer equipment and an attached laser printer, you are welcome to experiment with the gnuplot options mentioned in the manual in an attempt to print out hard copies of graphics files. Note that these fancier files do not come out if you direct the output to a file. The cruder dot matrix plots will be typically be adequate for homeworks.

f.) Have the prices of these five species moved together? An appropriate stat output with interpretation will be sufficient to answer this question.


 |_stat cprice kprice rprice pprice sprice / pcor
 NAME        N    MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 CPRICE       30  0.65220     0.28124     0.79095E-01  0.32800       1.2590
 KPRICE       30   2.0559     0.62286     0.38796       1.0810       3.3880
 RPRICE       30   1.3528     0.50440     0.25442      0.79600       2.9390
 PPRICE       30  0.55017     0.23236     0.53989E-01  0.16200      0.99300
 SPRICE       30   1.3817     0.54441     0.29638      0.74700       2.5510

  CORRELATION MATRIX OF VARIABLES -       30 OBSERVATIONS

 CPRICE     1.0000
 KPRICE    0.69505       1.0000
 RPRICE    0.64362      0.76019       1.0000
 PPRICE    0.84334      0.38005      0.51253       1.0000
 SPRICE    0.95809      0.72896      0.68958      0.77243       1.0000
              CPRICE       KPRICE       RPRICE       PPRICE       SPRICE

All of the price pairs are also positively correlated, so the prices tend to move together as well. Higher prices for one species are associated with higher prices for all other species.

g.) Generate revenues from each species and assess whether these have moved similarly over the time period of these data. Use genr commands like:

genr crev=cquant*cprice
genr krev=kquant*kprice
genr rrev=rquant*rprice
genr prev=pquant*pprice
genr srev=squant*sprice
stat crev krev rrev prev srev / pcor


NAME        N    MEAN        ST. DEV      VARIANCE     MINIMUM      MAXIMUM
 CREV         30   41600.      23877.     0.57010E+09   10213.      0.13178E+06
 KREV         30   24695.      9983.3     0.99665E+08   11998.       44276.
 RREV         30  0.22087E+06 0.15725E+06 0.24726E+11   42404.      0.55412E+06
 PREV         30   83413.      44449.     0.19757E+10   13056.      0.17652E+06
 SREV         30   33495.      18958.     0.35940E+09   8242.9       77217.

  CORRELATION MATRIX OF VARIABLES -       30 OBSERVATIONS

 CREV       1.0000
 KREV      0.72146       1.0000
 RREV      0.49899      0.61728       1.0000
 PREV      0.65114      0.68561      0.59497       1.0000
 SREV      0.70110      0.78329      0.77147      0.56880       1.0000
              CREV         KREV         RREV         PREV         SREV

If prices are positively correlated across all species and so are quantities, it is not surprising that revenues are also positively correlated (that they tend to "move together").

h.) If an Alaska commercial fisher targeted only one species, say the low-end chum salmon, would this fisher have felt much of a decrease in their income in 1989, assuming a fairly constant number of fishers and an even distribution of the catch? Comment. What about fishers who targetted king salmon? Note that you can limit all analysis to a subset of the observations, say observations 24 through 30 (1987-1993) by using the command sample 24 30. (To undo this limitation, issue the command sample 1 30.)

 |_sample 24 30
 
 |_plot crev year
                    *=CREV
   0.13178E+06    |         *
   0.12485E+06    |
   0.11791E+06    |
   0.11097E+06    |
   0.10404E+06    |
    97102.        |
    90167.        |
    83231.        |
    76295.        |
    69359.        |
    62423.        |
    55487.        |
    48551.        |
    41615.        |    *
    34679.        |                                  *
    27744.        |              *    *         *
    20808.        |                        *
                    ________________________________________
 
            1986.000  1988.000  1990.000  1992.000  1994.000
 
                                YEAR
 
 |_plot krev year
                      *=KREV
 
   0.13178E+06    |         *
   0.12485E+06    |
   0.11791E+06    |
   0.11097E+06    |
   0.10404E+06    |
    97102.        |
    90167.        |
    83231.        |
    76295.        |
    69359.        |
    62423.        |
    55487.        |
    48551.        |
    41615.        |    *
    34679.        |                                  *
    27744.        |              *    *         *
    20808.        |                        *
                    ________________________________________
 
            1986.000  1988.000  1990.000  1992.000  1994.000
 
                               YEAR
For both species, these particular plots show that revenues went up between 1987 and 1988, and then fell sharply in 1989 and stayed lower in subsequent years. Whether this was due to perceived product "taint" from the oil spill cannot be concluded without further analysis.

i.) OPTIONAL (If all the other tasks were easy for you): Try the GNUPLOT capability of SSCnet SHAZAM by including in your program the following command:

plot cquant kquant rquant pquant squant year / gnu line &
   commfile=plot.gnu datafile=plot.dat

After you exit the program, choose the GNUPLOT for Windows icon from the Econ 143 folder. Open the file you identified as your commfile= file (here, plot.gnu). If you wish, you can edit this file first, using Notepad. The file you actually edit will be called something like "C000.gnu," which is what is loaded by your plot.gnu program. [A VERY good idea is to put your name in the title of the plot, so that you can find your own output as it emerges from the laser printer in the lab!] 

Contents of the C000.gnu program file that was created by the commfile=plot.gnu option on the above plot command (SHAZAM actually creates a program file called "plot.gnu" which points to an automatically numbered C00?.gnu file that does not already exist):

set samples           30
set title
set title "                                                                "
set key
set xlabel "YEAR    "
set ylabel
plot  "plot.dat" using      1:       2  title "CQUANT  "  w linespoint,\
      "plot.dat" using      1:       3  title "KQUANT  "  w linespoint,\
      "plot.dat" using      1:       4  title "RQUANT  "  w linespoint,\
      "plot.dat" using      1:       5  title "PQUANT  "  w linespoint,\
      "plot.dat" using      1:       6  title "SQUANT  "  w linespoint  

To change this file so that it will produce a gnuplot plot with a title and different labels for the vertical axis and the key to the plotted lines, modify the file as shown below:

set samples           30
set title
set title " Quantities of different Salmon Species Caught over Time "
set key
set xlabel "YEAR    "
set ylabel  " Quantities Caught"
plot  "plot.dat" using      1:       2  title "CHUM    "  w linespoint,\
      "plot.dat" using      1:       3  title "KING    "  w linespoint,\
      "plot.dat" using      1:       4  title "RED     "  w linespoint,\
      "plot.dat" using      1:       5  title "PINK    "  w linespoint,\
      "plot.dat" using      1:       6  title "SILVER  "  w linespoint  

When you run this plot, it should come out looking like the following graphic:




Updated: October 15, 1998;  Trudy Ann Cameron; site index