INSTRUCTIONS: Some review of your prerequisite coursework in univariate statistics may be required to complete these problems.
NETWORK FILES NEEDED: n:alaska.dat, n:alaska0.sha
1. Is either of the following a valid probability density function? Why or why not?
,
(this should be very familiar) This is a valid
p.d.f. because it is the algebraic form of the standard normal distribution
for a random variable called X. We reviewed in class the reasons for the
bell shape of this distribution. Its symmetry around zero comes from x
appearing only in squared form. The function exp is another way of saying
"e to the power." Anything to a negative power is the same as 1/ that thing
to the same power. The smaller the value of x, the greater the height of
the density function, with a maximum height at zero. The larger is x in
absolute value, the smaller will be the height of the function. The "bell"
shape comes from the exp(.) function that is used.
This function is exactly the same as the first one,
except it is missing the factor of 1 over the square root of 2 pi. We know
the first p.d.f. is valid, which means that the total area under the function
is 1.0. Therefore the area under this one must be equal to the square root
of 2 pi. Now we know why the square root of 2 pi factor is there is the
standard normal. The exp(.) part gives the shape of the distribution, and
we need the factor to ensure that the area under the curve is 1.0 for valid
p.d.f.
2. For a standard normal random variable Z, what is Pr(Z = 0.5)? Don't be fooled. Explain. For a continuous random variable, such as the standard normal, the probability associated with any single value of the variable is exactly zero. Probabilities must be determined over intervals of the variable.
3. If Z is the standard normal random variable, use the table inside the front cover of HGJ to determine:
a.) Pr (Z > 0.5) Go down the margin of Table 1 until you find 0.5. Go "across" the columns until you get to the decimal amount needed to be added to 0.5, which is 0.00 in this simple case (just the first column in the body of the table, 0.1915). The number in the body of the table gives the area under the standard normal curve between zero and the value 0.5. The cumulative probability up to zero is 0.50. Thus the amount of probability in the region less than 0.5 is 0.50 +.1915 = 0.6915. This means that the probability above 0.5 is 1 - 0.6915 = .3085. It is usually very helpful to sketch a diagram when you are figuring out these probabilities. Another way to have calculated this would have been to think of it as ALL the probability above 0 (i.e. 0.5) less the amount between zero and 0.5 (i.e. 0.1915). This would yield the same answer.
b.) Pr (0.3 < Z < 0.8) Both of these values are to the right of 0. Thus, we can figure the probability between these bounds by finding the area between 0 and each of them, and then taking the difference. The answer is 0.2881-0.1179 = 0.1702..
c.) Pr (-0.5 < Z < 1.2) One of these is to the left of zero, and one to the right, so we need to exploit the symmetry of the standard normal distribution. First, find the bit between zero and 1.2. This is 0.3849. Then, recognize that the bit between -0.5 and 0 will be the same as the amount between 0 and 0.5, which is .1915. Now add them, to get .5764.
d.) Pr (|Z| > 1.96) (an important one!) To compute probabilities to a second decimal place, you need to find the first decimal place on the left margin of the table, then move across the row to get the second decimal place. Here, you want the sum of the probabilities associated with values less that -1.96 and greater than 1.96. First, remember that the probability greater than 0 is 0.5, and that the distribution is symmetric around zero. The probability between 0 and 1.96 is 0.4750, so the probability beyond 1.96 (to the right) is 0.025. Since we want the probability in both ends, the answer will be 0.05.
4. If X is approximately normally distributed with mean 5 and variance 9, determine
a.) Pr (X > 6) Pr(X > 6) = Pr(Z > (6-5)/3 ) = roughly Pr(Z > 0.33) = 0.5 - 0.1293 = 0.37.
b.) Pr (3 < X < 6) Same as Pr( (3-5)/3 < Z < (6-5)/3 ) = Pr (-0.67 < Z < 0.33). The first cutoff value is to the left of zero, and the second to the right. Find the probability between zero and the absolute value of each of these values, and then add to find the probability between -0.67 and 0.33. Thus, 0.2483 + 0.1293 = 0.3776 .
c.) x* such that Pr (X < x*) = .025 To answer this one, you need to look in the body of the table and work back to the marging. The cut off value x* will be well out to the left of zero, since there is only a little bit of cumulative probability up to this point. We need to use both the symmetry of the standard normal distribution and the fact that the probability under either half of the distribution is 0.5. If we find the positive cutoff value that captures 0.4750 between zero and this cutoff, the negative of this cutoff value will be the one we are seeking. This cutoff appears to be +1.96, so the mirror image, -1.96 is the cutoff of the standard normal that leaves 0.025 in the lower tail of the distribution. Now we need to get from the standard normal to the general normal with mean 5 and standard deviation 3. Since Z = (X - mean)/std.dev., we can solve for X = mean + std.dev*Z. Thus, the answer is x* = 5 + 3*(-1.96). = -0.88. Double-checking for common sense, this number is roughly two standard deviations below the mean of X.
5. Distinguish between a population "parameter," an "estimator," and an "estimate." A population parameter describes a characteristic of the population (a population from which you may be planning to draw a sample for analysis). It is a true but unknown constant. It does not have any variance and is not random. An example would be the true population mean, denoted by the Greek mu. An estimator is a formula that you use to calculate an estimate of the true but unknown population parameter. An example would be the sample mean: (1/n)*[sum of the Xi values in the sample]. When you plug the values from a particular sample into the estimator formula, you get an estimate of the true but unknown population parameter. You will always use the same estimator formula, but depending on what sample you draw from the population, you will come up with different "estimates" of the population parameter of interest.
6. Distinguish between "point estimation" and "interval estimation." Point estimation means coming up with the best possible guess about a true but unknown population parameter. However, the probability of this value equalling the true parameter is zero (it's a continuous distribution). Often, we prefer an interval estimate--a range within which the true but unknown parameter value will lie with some degree of "confidence." There is a positive probability that this will happen.
7. Is the median a "linear estimator"? Is the mean? An estimator is a linear estimator if its formula combines the values of the variable (those showing up in a sample) in a linear fashion, namely, the sum of coefficient times value, plus coefficient times value, etc. The observed values cannot be raised to powers, or multiplied by other variables, etc. The mean is a linear estimator because it is calculated at (1/n)X1 + (1/n)X2 + (1/n)X3 +...+ (1/n)Xn, where n is the number of observations in the sample. In contrast, the recipe for calculating a median is not a "linear" formula. It involves sorting all of the observations from smallest to largest and identifying the single observation in the "middle" (for odd-number sized samples), or averaging the pair of observations in the "middle" (for even-number sized samples). This is definitely not a linear formula.
8. If we draw a random sample of size 36 and we find that the sample mean is 7 and the sample variance is s2=4, construct a 95% "confidence interval" for the value of the population mean m x. Show your work carefully. The confidence interval is based on the sample evidence concerning the unknown parameter value, in this case, the sample mean, which is 7. You then add/subtract an amount that accounts for the noise in the sample, which is assumed to provide information about the noise in the population. This plus-or-minus amount involves the 0.025 critical value of the t distribution with n-1 = 35 degrees of freedom = 2.0315 (by "linear" interpolation from Table 2 inside the front of the text). This must be multiplied by the standard error of the estimate of the mean, which is equal to the sample standard deviation (s=2, since s2=4) divided by the square root of the sample size (6). The confidence interval can thus be written as: [ 7 - (2.0315)(2/6), 7 + (2.0315)(2/6) ] = [6.6669, 7.3331].
10. If we draw a random sample of size 25 and discover that the mean of X in the sample is 10 and s2=16, test the null hypothesis that m x=8.5 (using either a "two-tailed" Z-test or a "two-tailed" t-test, whichever is most appropriate, justifying your choice): Since we know only the sample variance, not the true underlying population sigma-squared value, and since the sample size is only 25 (not asymptotically large), we cannot use a Z-test. The t-test is appropriate, with its somewhat greater dispersion due to the additional uncertainty stemming from using s-squared as a guess about the true sigma- squared.
a.) at the 5% "level of significance"; You can test hypotheses either by constructing a 95% confidence interval and seeing if the hypothesized value lies inside this confidence interval, or you can do a classical hypothesis test that involved subtracting the "true mean" and dividing by the "true standard deviation," and comparing to the "known" distribution of this standardized variable. We don't know the true mean, but this is usually what we are hypothesizing. We don't know the true standard deviation, so we use the sample data s over the square root of n, which sends us to a t-distribution with n-1 degrees of freedom, instead of a standard normal distribution. Both of these strategies (confidence interval or t- test) use the same sample information, just combined in a different way.
For a confidence interval, we use the point estimate of the true mean, 10, plus or minus the 0.025 critical value for a t(25-1) distribution (which is 2.064) times s/(square root n) (equal to 4/5). CI.95(mu) = 10 + 2.064(4/5). If 8.5 is in this interval, we cannot reject 8.5 as a plausible hypothese about the true mean. The interval works out to be (8.348, 11.651), thus we cannot reject 8.5 as the true mean.
For the t-test, we construct (10 - 8.5)/(4/5) = 1.875. Since this number lies inside the 95% range of values of a t distribution with 24 degrees of freedom (- 2.064,+2.064), we cannot reject the null hypothesis that the true mean is 8.5.
b.) at the 10% "level of significance". For this level of significance, nothing changes except the critical value of the t-distribution. Instead of the cutoff that leaves a total of 5% in the two tails of the t-distribution, the cutoffs that leave a total of 10% in the two tails, still for (25-1) degrees of freedom. For one tail, then, we want to have 5% of the probability out in the tail beyond the cutoff, so the relevant value, from the first column of Table 2 in the text, is 1.711. Since the critical value for the 10% significance level test is smaller in absolute value that it was in the case of the 5% significance level hypothesis test, the confidence interval is narrower, being now CI.90(mu) = 10 + 1.711(4/5). This gives (8.631, 11.369), which excludes the hypothesized value of 8.5. If 8.5 lies outside the 90% confidence interval, we reject this value as an hypothesis about the true mean.
By traditional t-test, the calculation is the
same. We still get a standardized value of 1.875. Now, however, this number
is beyond the 1.711 critical value of the t- distribution. If the null
hypothesis is true, a value further from zero than 1.711 would happen only
10% or less of the time. Thus 1.875 is an implausible value of the test
statistic, so we reject the null hypothesis (even though there is a 10%
or less chance that this is just a bizarre sample).
11. EXPLORING A DATA SET (Calisthenics
with SHAZAM): Download from the network (or from the website) to your own
diskette the files n:alaska.dat and n:alaska0.sha using the
procedure outlined in the SHAZAM computer software orientation handout.
Be sure to print out a copy of the program itself so you can refer to it
later.
A description of the data set is contained in comment lines at the top of the program file.
a.) Start the SHAZAM program and when it says "TYPE COMMAND" invoke an already-created set of commands from the alaska0.sha file by issuing the command "file 5 alaska0.sha." If you are working from a disk in the a: drive, use file 5 a:alaska0.sha.) Use the pause button to view intermediate steps; this run is simply to verify that you can read and use the data (i.e. that all files are in the right places). Enter the command stat to verify that you have all the data.
b.) Now, use the SHAZAM editor to make some changes to the alaska0.sha file. Instead of issuing commands interactively after executing the initial set of commands from the alaska0.sha file, incorporate some additional commands into the program. Until you think you have the program running smoothly and correctly, just have the output sent to the screen. When it all looks like it works fine, save the otuput in the SHAZAM window to a file, for later printing via Notepad (from the Econ143 menu, if you are working in the lab).
For example, look at the actual numbers and then produce a set of descriptive statistics for some of the variables in the augmented data set by using the commands:
print year ptot qtot rtot
stat ptot qtot rtot / pcor
You can just make sure that these command lines appear,
without comment characters (*) in the first column, at the end of the program
file.
c.) Descriptive statistics: What are the highest and lowest prices (in 1989 dollars) that have been observed over the 1964-1993 period? What has been the average size of the catch, in millions of pounds, over this time period? What has been the standard deviation in catch over this period? This can be read directly from the above stat output.
|_stat ptot qtot rtot / pcor NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM PTOT 30 0.94821 0.32028 0.10258 0.52831 1.8280 QTOT 30 0.43576E+06 0.21326E+06 0.45479E+11 0.13160E+06 0.84608E+06 RTOT 30 0.40407E+06 0.22485E+06 0.50557E+11 99802. 0.97655E+06The minimum and maximum values of PTOT in the sample are given in the last two columns of the STAT output. The average catch in thousands of pounds is given by the mean value of QTOT. In millions of pounds, it would be about 436 million. The standard deviation in catch has been about 213.
d.) SOME ECONOMIC THEORY: When demand is "elastic" (such that a given percent change in price leads to a larger percent opposite change in quantity demanded), an increase in price results in a decrease in total revenues in a market. When demand is "inelastic" (such that a given percent change in price leads to a smaller percent opposite change in quantity demanded), and increase in price results in an increase in total revenues. If we (erroneously) considered all five major types of Alaskan salmon to be sold in one market, what does the correlation table produced by the / pcor option on the stat command imply about overall demand elasticity in this market?
CORRELATION MATRIX OF VARIABLES - 30 OBSERVATIONS
PTOT 1.0000
QTOT -0.13806 1.0000
RTOT 0.49714 0.74976 1.0000
PTOT QTOT RTOT
Correlation between
revenues and prices is positive, suggesting that when prices are higher,
revenues are higher. Thus, demand appears to be inelastic. Would
you consider this implication reliable? Why or why not? (Think about the
implicit ceteris paribus requirement underlying "demand curves,"
i.e. that everything else be held constant.) This
is not a solid conclusion because lots of things besides prices have changed
over the thirty years of data. The price-quantity pairs we observe are
not points on one demand curve unless demand has remained constant and
the supply curve has been shifting around. Plotting the relationship between
ptot (average prices) and qtot (overall quantities) reveals that we are probably
not looking at points along one
stable demand curve over the time period represented by the data:
e.) Now take into account that there are five different "goods" involved--chum, king, pink, red, and silver salmon species, possibly each with a distinct market. Use stat / pcor and the crude plot option in SHAZAM to see whether catch levels for each of these five species "move together" over this time period. Comment.
stat cquant kquant rquant pquant squant
/ pcor
plot cquant year / nopretty
plot kquant year / nopretty
plot rquant year / nopretty
plot pquant year / nopretty
plot squant year / nopretty
|_stat cquant kquant rquant pquant squant / pcor
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
CQUANT 30 64333. 23979. 0.57499E+09 22668. 0.12157E+06
KQUANT 30 11828. 2299.0 0.52853E+07 7184.0 16904.
RQUANT 30 0.16031E+06 99843. 0.99685E+10 32246. 0.37840E+06
PQUANT 30 0.17384E+06 93265. 0.86985E+10 28822. 0.33879E+06
SQUANT 30 25455. 13958. 0.19482E+09 7688.0 53776.
CORRELATION MATRIX OF VARIABLES - 30 OBSERVATIONS
CQUANT 1.0000
KQUANT 0.40846 1.0000
RQUANT 0.59101 0.38130 1.0000
PQUANT 0.63651 0.39569 0.81231 1.0000
SQUANT 0.72498 0.35044 0.78959 0.81930 1.0000
CQUANT KQUANT RQUANT PQUANT SQUANT
All of the pairwise correlations between different
catch quantities are positive, so that on average, if one quantity is higher,
so are all others. You could certainly use plot kquant pquant for
all pairs and see the scattergrams of quantity pairs that would reveal
the same sort of thing.
or try some fancy plots by using:
or plot cquant kquant rquant pquant squant year / gnu line
If you have your own stand-alone computer equipment and an attached laser printer, you are welcome to experiment with the gnuplot options mentioned in the manual in an attempt to print out hard copies of graphics files. Note that these fancier files do not come out if you direct the output to a file. The cruder dot matrix plots will be typically be adequate for homeworks.
f.) Have the prices of these five species moved together? An appropriate stat output with interpretation will be sufficient to answer this question.
|_stat cprice kprice rprice pprice sprice / pcor
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
CPRICE 30 0.65220 0.28124 0.79095E-01 0.32800 1.2590
KPRICE 30 2.0559 0.62286 0.38796 1.0810 3.3880
RPRICE 30 1.3528 0.50440 0.25442 0.79600 2.9390
PPRICE 30 0.55017 0.23236 0.53989E-01 0.16200 0.99300
SPRICE 30 1.3817 0.54441 0.29638 0.74700 2.5510
CORRELATION MATRIX OF VARIABLES - 30 OBSERVATIONS
CPRICE 1.0000
KPRICE 0.69505 1.0000
RPRICE 0.64362 0.76019 1.0000
PPRICE 0.84334 0.38005 0.51253 1.0000
SPRICE 0.95809 0.72896 0.68958 0.77243 1.0000
CPRICE KPRICE RPRICE PPRICE SPRICE
All of the price pairs are also positively correlated,
so the prices tend to move together as well. Higher prices for one species
are associated with higher prices for all other species.
g.) Generate revenues from each species and assess whether these have moved similarly over the time period of these data. Use genr commands like:
genr crev=cquant*cprice
genr krev=kquant*kprice
genr rrev=rquant*rprice
genr prev=pquant*pprice
genr srev=squant*sprice
stat crev krev rrev prev srev / pcor
NAME N MEAN ST. DEV VARIANCE MINIMUM MAXIMUM
CREV 30 41600. 23877. 0.57010E+09 10213. 0.13178E+06
KREV 30 24695. 9983.3 0.99665E+08 11998. 44276.
RREV 30 0.22087E+06 0.15725E+06 0.24726E+11 42404. 0.55412E+06
PREV 30 83413. 44449. 0.19757E+10 13056. 0.17652E+06
SREV 30 33495. 18958. 0.35940E+09 8242.9 77217.
CORRELATION MATRIX OF VARIABLES - 30 OBSERVATIONS
CREV 1.0000
KREV 0.72146 1.0000
RREV 0.49899 0.61728 1.0000
PREV 0.65114 0.68561 0.59497 1.0000
SREV 0.70110 0.78329 0.77147 0.56880 1.0000
CREV KREV RREV PREV SREV
If prices are positively correlated across all species
and so are quantities, it is not surprising that revenues are also positively
correlated (that they tend to "move together").
h.) If an Alaska commercial fisher targeted only one species, say the low-end chum salmon, would this fisher have felt much of a decrease in their income in 1989, assuming a fairly constant number of fishers and an even distribution of the catch? Comment. What about fishers who targetted king salmon? Note that you can limit all analysis to a subset of the observations, say observations 24 through 30 (1987-1993) by using the command sample 24 30. (To undo this limitation, issue the command sample 1 30.)
|_sample 24 30 |_plot crev year *=CREV 0.13178E+06 | * 0.12485E+06 | 0.11791E+06 | 0.11097E+06 | 0.10404E+06 | 97102. | 90167. | 83231. | 76295. | 69359. | 62423. | 55487. | 48551. | 41615. | * 34679. | * 27744. | * * * 20808. | * ________________________________________ 1986.000 1988.000 1990.000 1992.000 1994.000 YEAR |_plot krev year *=KREV 0.13178E+06 | * 0.12485E+06 | 0.11791E+06 | 0.11097E+06 | 0.10404E+06 | 97102. | 90167. | 83231. | 76295. | 69359. | 62423. | 55487. | 48551. | 41615. | * 34679. | * 27744. | * * * 20808. | * ________________________________________ 1986.000 1988.000 1990.000 1992.000 1994.000 YEARFor both species, these particular plots show that revenues went up between 1987 and 1988, and then fell sharply in 1989 and stayed lower in subsequent years. Whether this was due to perceived product "taint" from the oil spill cannot be concluded without further analysis.
i.) OPTIONAL (If all the other tasks were easy for you): Try the GNUPLOT capability of SSCnet SHAZAM by including in your program the following command:
plot cquant kquant rquant pquant squant
year / gnu line &
commfile=plot.gnu datafile=plot.dat
After you exit the program, choose the
GNUPLOT for Windows icon from the Econ 143 folder. Open the file you identified
as your commfile= file (here, plot.gnu). If you wish, you can edit this
file first, using Notepad. The file you actually edit will be called something
like "C000.gnu," which is what is loaded by your plot.gnu program. [A VERY
good idea is to put your name in the title of the plot, so that you can
find your own output as it emerges from the laser printer in the lab!]
Contents of the C000.gnu program file that was created by the commfile=plot.gnu option
on the above plot command (SHAZAM actually creates a program file called "plot.gnu" which
points to an automatically numbered C00?.gnu file that does not already exist): To change this file so that it will produce a gnuplot plot with a title and different
labels for the vertical axis and the key to the plotted lines, modify the file as shown
below: When you run this plot, it should come out looking like the following graphic:
set samples 30
set title
set title " "
set key
set xlabel "YEAR "
set ylabel
plot "plot.dat" using 1: 2 title "CQUANT " w linespoint,\
"plot.dat" using 1: 3 title "KQUANT " w linespoint,\
"plot.dat" using 1: 4 title "RQUANT " w linespoint,\
"plot.dat" using 1: 5 title "PQUANT " w linespoint,\
"plot.dat" using 1: 6 title "SQUANT " w linespoint
set samples 30
set title
set title " Quantities of different Salmon Species Caught over Time "
set key
set xlabel "YEAR "
set ylabel " Quantities Caught"
plot "plot.dat" using 1: 2 title "CHUM " w linespoint,\
"plot.dat" using 1: 3 title "KING " w linespoint,\
"plot.dat" using 1: 4 title "RED " w linespoint,\
"plot.dat" using 1: 5 title "PINK " w linespoint,\
"plot.dat" using 1: 6 title "SILVER " w linespoint