Basics of Parameter Estimation and Sampling Theory

Now that we've covered the necessary foundations in probability theory and descriptive statistics, we now can consider the area of inferential statistics, in particular the interesting problem of parameter estimation. Let's introduce some concepts and terms in the context of a very important current problem: how can we estimate how many people (in the U.S.) use the World Wide Web? How could we answer this question?

It's too hard and expensive to ask every single individual. So we will use a sample to estimate this property of the larger population. In this case, the population might consist of all adults 18 or older in the United States. The sample might consist of a few thousand of those adults, randomly sampled from the population.

In the most general case, a population defines a particular box model. It's easiest to think of a population in terms of people, but the central notion is of a population of scores (or tickets in the box). In the case of the WWW example, the box consists of about 250,000,000 tickets, some of which represent WWW-users, and some of which represent WWW-nonusers. This describes the population. Our task is to estimate the proportion of WWW-users in the box, without examining all the tickets in the box.

Typically, we are interested in getting a sense about certain numerical properties of the population (e.g., the mean). These numerical properties of the population are known as parameters.

Numbers calculated from a sample are called statistics. (Nice and easy mnemonic for remembering which is which: population parameter and sample statistic.) In statistical inference, we use sample statistics to estimate population parameters.

Estimation and hypothesis testing

There are two subtasks:

 

Example: the World Wide Web

Simple random sampling

Probability samples are those samples that incorporate "the planned use of chance". Freedman discusses in some detail the pitfalls of methods of sampling that do not incorporate the planned use of chance:

In simple random sampling, we draw a sample of n tickets from the box without replacement. Note some of the key properties of this approach:

The way the sample is drawn is critically important (and is more important than the size of the sample).

Although in simple random sampling the tickets are drawn without replacement, if the sample size is small relative to the population size, we can very closely approximate the situation using our results from sampling with replacement.

Recall the key results about the sampling distribution of the sample mean when sampling with replacement:

Now how will things be different if we sample without replacement? The draws are no longer exactly independent (later draws depend on what comes out on earlier draws, as in dealing cards from a deck). But if the population is large enough, this dependence is quite small.

Results for sampling without replacement:

So sampling without replacement affects the standard error of the sample mean. How will it affect the standard error: will it go up (CF>1) or down (CF<1)?

The final result is that we can typically use our results for sampling with replacement.

Point estimation of a population proportion

Given a simple random sample from a Bernoulli (zero-one) box with unknown proportion of 1's:

Use the sample proportion of 1's to estimate the proportion of 1's in the population (i.e., the box)

Notational note: Often, to distinguish between a parameter and our estimate of a parameter, we use a hat (^) over the symbol for a parameter to denote an estimate of that parameter.

Interval estimation of a population proportion

The standard error of the sample proportion is a measure of the sampling variability, or chance error, of the sample proportion; it measures how far off we can expect the sample proportion to be from the population proportion. We can use this measure to find a range of plausible values for the population proportion.

 

We face a small problem however: the standard error of the sample proportion depends on the population proportion, which we don't know.

If the sample size is large enough, however, this problem is quite easily solved, by simply using the sample proportion as our estimate of the population proportion, and plugging that value into the expression for the standard error. The result is an "estimated standard error."

Constructing confidence intervals

Now let us consider how to construct confidence intervals. From using the normal approximation of the sampling distribution of the sample proportion, we know that the sample proportion will be within 1 SE of the population proportion about 68% of the time (i.e., for about 68% of samples of size n).

Adding and subtracting one SE from the sample proportion gives us an approximate 68% confidence interval. We can say, with 68% confidence, that the population proportion falls within the interval. What this means is that if we (many times) repeated the procedure of taking a sample of size n, computing the 68% confidence interval, about 68% of those intervals would cover the true population proportion.

Interpretation of confidence interval as plausible values for the parameter of interest. (We think p=.2. Could it be .21? Could it be .3? .1?)

Recipe for constructing an interval estimate, or confidence interval:

Take point estimate, and move away from it in either direction based on the (estimated) SE of the point estimate. How far to move away? Depends on the level of confidence desired. Use value from standard normal table to cover desired area.

(Point estimate) +/- (z-value based on desired confidence level) * (estimated SE of point estimate)

How do various factors affect the width of confidence intervals?

Tradeoff between informativeness of interval and risk of making a mistake.

Margin of error in polling

Reports of polls and surveys typically report the "margin of error" of the poll. This typically defines a 95% confidence interval around the point estimate reported. However, journalists typically don't understand statistics well, and often make mistakes when describing the "margin of error":

Summary

  1. Notion of estimation: use sample statistics to estimate population parameters
  2. Simple random sampling: given large population, can be approximated by sampling with replacement
  3. Importance of probability methods in sampling: avoid selection bias!
  4. Use standard error of sample proportion to create confidence interval
  5. Confidence interval refers to the parameter, not the sample value. We know the sample with certainty; however, we are uncertain about the population parameter, because of chance error / sampling variability.
  6. Confidence interval only accounts for sampling variability. It does not account for a host of other possible problems, e.g.: