UCLA Soc. 210A, Topic 7, Sampling Distributions and Estimation

6nov2000

UCLA Soc. 210A, Topic 7, Sampling Distributions and Estimation

Professor: David D. McFarland

Web Pages for Fall 2000

Syllabus for logistics
ClassWeb site for announcements, discussion board
Outline for course content

Topic 7: Sampling Distributions and Estimation

Assignment 7

We have already looked at probability distributions in the context where it was the social phenomenon under observation, and not the observation process, that was conceptualized as probabilistic. We will now review them, and then proceed to consider probability distributions that arise in the observation process, as when we select cases to interview, and that are used in inference, from observed data to inferences -- guesses -- about the population or process from which those data arose.

Some distributions generated by social processes

Binomial distribution, for a count of the number of occurrences of some event, in a fixed number of "trials", on each of which the event's probability is the same, regardless of the number and outcomes of previous trials.
Example: In a family with four children, the probabilities of 0, 1, 2, 3, or 4 girls will be binomially distributed if the probability of a girl on each birth is the same, regardless of the number and gender of previous births.
If p denotes the probability of the event occurring on any particular trial, and n denotes the number of trials, then the number of trials on which the event occurs has:
mean = np
standard deviation = sqrt(np(1-p))
Gaussian or "Normal" distribution for a quantity which is the sum of a large number of small effects which operate independently of one another.
This is a symmetrical, bell-shaped distribution, with two parameters, the mean, mu, and the standard deviation, sigma, whose values are 0 and 1 respectively in the case called a standard normal distribution.
"Normal" in this context is a technical term, with a technical meaning different from its everyday usage. You should not assume that this shape of distribution is "normal" in the everyday sense of that word, and that other shapes are "abnormal". Indeed, for this reason some authors have abandoned the term, Normal Distribution, and instead call it the Gaussian Distribution. These are two names for the same distribution.
The Central Limit Theorem states conditions under which a sequence of distributions of partial sums converges to a normal distribution.
Remark: Feller (1957, pp. 238-241), who along with Lindeberg was a leading probabilist examining necessary and sufficient conditions for convergence to normality, has a section on "variable distributions", in which he points out that the various terms of the sum actually do not need to be either independent or identically distributed, although those conditions do make the task of proving theorems more manageable. In applying the central limit theorem to biometric measurements, such as height, Feller remarks, "It is true that not all of the (terms) are mutually independent. However, the central limit theorem holds also for large classes of dependent variables, and, besides, it is plausible that the great majority of the (terms) can be treated as independent".
Example: Employees' current salaries are the salaries they had a decade ago, modified by successive pay raises. An egalitarian employer might consider awarding raises in similar dollar amounts, regardless of the employees' previous salaries. In some years $800 raises might be about average, with some getting more, some less; in other years $1200 raises might be about average, with some getting more, some less; etc. Thus after two such raises an employee who began at $40,000 would have a new salary in the vicinity of (40,000 + 800 + 1200), and similarly for subsequent years.
Lognormal distribution for a quantity which is the product of a large number of small effects which operate independently of one another. Factors greater than 1 increase the product, while factors less than 1 decrease it. All factors are positive, as is their product. A lognormal distribution has a long upper tail, rather than being symmetric like Gaussian distributions.
Example: Employees' current salaries are the salaries they had a decade ago, modified by successive pay raises. Rather than giving similar dollar amounts, most employers would give similar percentage raises. In some years 5% raises might be about average, with some getting more, some less; in other years 3% raises might be about average, with some getting more, some less; etc. Thus after two such raises an employee who began at $40,000 would have a new salary in the vicinity of ($40,000)(1.05)(1.03), and similarly for subsequent years.
If some variable X is lognormally distributed, then log(X) is normally distributed. This helps explain why researchers often transform variables, for example to analyze log(income) rather than income itself.
A lognormal distribution has two parameters, usually expressed in terms of the mean and standard deviation of the corresponding normal distribution.
Bimodal distributions Unobserved heterogeneity is one of various ways that social processes can produce bimodal distributions.
Example: In calculating the distribution of the number of girls in a two-child family, we would ordinarily assume p=1/2 throughout, for each family and regardless of the sex of any previous children in that family. This would yield:
P(0 girls) = 1/4, P(1 girl) = 1/2, P(2 girls) = 1/4
which has a single mode, at 1 girl.
Suppose, instead, that there are two types of couples, boy-prone and girl-prone, with P(girl|boy-prone) = .1, P(boy-prone) = .5, P(girl|girl-prone) = .9, P(girl-prone) = .5. Then applying the total probability rule we could calculate P(girl) = .5, the same as before. However, the distribution in two-child families would be:
P(0 girls) = .41, P(1 girl) = .18, P(2 girls) = .41.
Note that this is not only different, but also bimodal.
Empirically, there may be some heterogeneity of this sort, with some couples slightly more boy-prone and others slightly more girl-prone, but it is nowhere near as extreme as in this numerical example. However, in empirical survey data, we have seen bimodal distributions on some other variables. Polarized attitudes on some survey items pertaining to affirmative action and homosexuality appeared as bimodal distributions.
Others Various other distributions arising in social processes include the exponential, geometric, and Poisson distributions. There are also various more complicated distributions. These include "censored" distributions (such as a non-negative variable with many zeros; e.g., Seltzer 1991).

Some distributions utilized in inference

The binomial also occurs in the context of estimation, as it is the sampling distribution of the number of occurrences of a dichotomous response.
The t distribution resembles the standard Gaussian, but has a less steep peak and longer tails. It has one parameter, known as "degrees of freedom", or df. The t is the sampling distribution of a sample mean.
The Chi-square distribution is one of the distributions for non-negative variables. Accordingly it has no left tail extending along the negative axis, but may have a long right tail. The chi-square distribution arises in several contexts. One of these is the sampling distribution of the sample variance, when the population variance is unknown and estimated from the sample variance. Another is the Pearson Chi-square goodness-of-fit test for a multinomial hypothesis about the frequencies in several categories. A special case of the latter is the Chi-square test for the independence hypothesis in a two-way frequency table.
The Gaussian ("Normal") distribution arises in inference because with sufficiently large samples the binomial, t, and chi-squared distributions are approximately normal, and the latter may be more convenient for computations. Also, many statistical procedures are built on rationale that includes normality assumptions.

Point Estimation

Next we focus directly on estimation, a transition topic between descriptive and inferential statistics, at least insofar as a quantity calculated from data is interpreted as (an estimate of) the corresponding quantity for some larger population.

Ordinarily the investigator will only select a single sample, not a large number of replications, as suggested in the imagery of sampling distributions. However, the investigator has no control over which of the outcomes in the sampling distribution he or she happens to get. Thus the strategy is to arrange the sampling distribution, which can be controlled, in such a manner that the vast majority of the possible samples would, if they happened to be the one actually selected, yield suitably accurate inferences about the population being sampled.

Specifically, if a particular statistic in the sample is to be used as an estimate of a parameter in the population, one would like:

The statistic equals the parameter, not in any particular sample, but on average over all possible samples. An estimation procedure with this property is called unbiased. (Note that "bias" is a technical term, whose usage in statistics bears no specific relationship to its everyday usage.)
The various possible values of the statistic, that would be yielded by the different possible samples, do not differ greatly, or at least have only a small probability of doing so. The term "reliability" is sometimes used here. Such an estimation procedure has a small standard error of estimate. Alternatively one might say it has a small "margin of error", a term preferred by pollsters (see Moore and McCabe, pp 437, 446), usually referring to an interval 1.96 times as wide as the standard error.
In addition to unbiasedness and small standard error, there exist other desirable features of estimation procedures, including likelihood maximization, but we will not consider them now.

Sometimes the sample counterpart of a population parameter is unbiased. In examining sample statistics from the TVHOURS variable, this appeared to be the case for sample mean, but possibly not for sample standard deviation, and certainly not for sample maximum. The results from the five samples we examined do in fact generalize:

The sample mean is an unbiased estimate for the population mean.
The sample standard deviation is not an unbiased estimate for the population standard deviation.
A fix is available for bias in the variance, which is the square of the standard deviation. Formulae for variances involve sums of squared deviations from the mean, divided by one or another of three things: (a) the population size, say N, (b) the sample size, say n, or (c) one less than the sample size, (n-1). Authors who use all three distinguish them as (a) population variance, (b) sample variance, and (c) unbiased sample estimate of population variance. The sample variance is a biased estimate of the population variance, but multiplying it by n/(n-1) removes the bias. Some authors (Moore and McCabe among them) slur over that distinction, and use (n-1) throughout. [Note that that adjustment factor, n/(n-1), does not differ very much from 1.0 except when n is so small that one might wish to collect additional data anyhow.]
The sample max is a biased estimator of the population max, but we shall treat it only as a blatant illustration of bias, and not try to adjust it.

The standard error of a sample estimator:

depends on the size of the sample, with n inside a square root sign in the denominator of formulae such as those in the boxes on pages 382, 399, or 440 of Moore and McCabe. The larger the sample size, the smaller the standard error of estimates.
does not depend on the size of the population. This is important, and counter-intuitive for many people, who imagine that a larger sample is required to get accurate estimates for a larger population.

Here it might be noted that reliability, in something like the sense used in statistics, is also a concern of at least some nonquantitative sociologists. Katz (1982) rejects such quantitative formulations, but suggests how similar sorts of concerns may be addressed in a style of research known as Analytic Induction. In this course we espouse careful use, not rejection, of quantitative tools. Sampling theory does not solve all our conceptual problems, particularly those involving indefinite theoretically relevant populations from which data at hand were not randomly selected; but it does tell how to obtain data representative of a population from which they were randomly selected, and also the amount of data required to provide specified levels of reliability.

Interval Estimation

Instead of estimating some population parameter with a single number based on sample data, an investigator sometimes prefers to use sample data to calculate endpoints of an interval, in such a manner that the interval has a high probability of including the true value of the parameter being estimated. Such an interval is referred to as a confidence interval. Its endpoints are called confidence limits. And the probability that it contains the true parameter value is called the confidence level.

Desirable properties of confidence intervals

A confidence interval should have a high probability of containing the parameter being estimated, with 95% a commonly used confidence level.
A confidence interval should be narrow. For example, the assertion that some population proportion is between 39% and 42% is much more informative than the assertion that it is between 19% and 72%.
Alas, there are tradeoffs between those two desiderata.

Bayesian Interval Estimation

As indicated earlier, Bayes' Rule is a theorem that follows directly from the probability axioms and the definition of conditional probability; it does not depend on any particular interpretation such as degree-of-belief. However, when a statistician is described as a "Bayesian", that ordinarily refers to someone using degree-of-belief interpretation of probability.

Both frequentists and Bayesians use interval estimates, but they use somewhat different ways of describing them.

Frequentist: The interval is random, in the sense that the same procedure applied to a different sample would have yielded different confidence limits. But the procedure used is one that 95% of the time produces a confidence interval that includes the true parameter value.
Bayesian: I am 95% certain that this particular interval covers the true parameter value.

Formulae for Interval Estimates

The prototypical formula is for a parameter whose point estimate is unbiased and has a Gaussian sampling distribution, but we would like to calculate an interval estimate instead of a point estimate. Noting that the standard Gaussian distribution has 95% probability between the values -1.96 and +1.96, we could use as endpoints the values which were that many standard units above and below the point estimate.

Example: Find a confidence interval for the proportion of all voters favoring a particular measure, based on the proportion of respondents in a sample favoring it. The sample proportion is an unbiased estimate of the population proportion, and it has a standard error of sqrt[p(1-p)/n]. In a sample of n=400, if p has a value near .8, this would work out to about .02, and 1.96 times that would be about .04, yielding an interval of .8-.04 to .8+.04, or .76 to .84. Thus instead of using .8 as a point estimate of the population proportion, one would use .76 to .84 as a 95% confidence interval.

Special Values in or out of Confidence Interval

Sometimes one may wish to know whether some special value, typically zero, is in a confidence interval. A value of 0 for some parameter might mean that the patterns in the data are simpler than anticipated, that a simpler formula which omits that parameter will suffice for the data in hand. Such considerations lead directly to the next topic, tests of statistical hypotheses.

Feller, William. 1957. An Introduction to Probability Theory and Its Applications. Volume 1, 2nd edn. New York: Wiley. Section X.5, pages 238-241, "Variable Distributions".

Katz, Jack. 1982. "A Theory of Qualitative Methodology: The Social System of Analytic Fieldwork." Pages 197-218 in: Poor People's Lawyers in Transition. New Brunswick, NJ: Rutgers University Press. Reprinted, pages 127-148 in: Robert M. Emerson, ed. 1988. Contemporary Field Research: A Collection of Readings. Prospect Heights, IL: Waveland Press.

Seltzer, Judith A. 1991 "Legal Custody Arrangements and Children's Economic Welfare." American Journal of Sociology 96 (#4, January): 895-929.