UCLA Soc. 210A, Topic 8, Logic of Statistical Inference

11nov2000

UCLA Soc. 210A, Topic 8, Logic of Statistical Inference

Professor: David D. McFarland

Web Pages for Fall 2000

Syllabus for logistics
ClassWeb site for announcements, discussion board
Outline for course content

Topic 8: Logic of Statistical Inference

Assignment 7

This section is devoted to the logic of statistical inference, particularly the rationale for tests of statistical hypotheses.

Here we consider just a few concrete examples of hypothesis tests; the Moore and McCabe book contains many more, and omits far more than it contains. The point here is to understand the main logic of hypothesis tests, not to try to learn all the tests for all the possible situations.

One of the most important uses of statistical tests is really beyond the predominantly univariate scope of this quarter's course, but will arise frequently in 210b and 210c, which treat complicated multivariate models. This is a test of an hypothesis to the effect that a specific simpler model will suffice for the data at hand.

For example, in predicting whether a high school graduate goes on to college, based on parental and grandparental education and income and other variables, one might wish to test whether grandparents have any direct effect on grandchildren, beyond their indirect effects through the intervening generation, and, if not, to simplify the model by omitting the grandparental variables.

Null hypothesis

A "null" hypothesis posits a particular numerical value for some population parameter, and the statistical test determines how compatible the data are with that hypothesis. Typically it is an hypothesis to the effect that some coefficient is 0, or that there is no difference between two coefficients, or some other negatively stated proposition.

Note that a researcher may be accustomed to stating substantive hypotheses differently, in two respects: positively, and vaguely. A sociological theory may lead one to expect that some coefficient is important, rather than unimportant, as a value of 0 would imply, but sociological theories are seldom sufficiently precise to specify any particular numerical value.

Alternative hypotheses

An alternative hypothesis may be at several different levels of specificity.

Completely vague, simply stating that the null hypothesis is false. For example, with a null hypothesis that p(male) = p(female), this type of alternative would simply be that the male and female probabilities are unequal.
Directional only. Continuing the example, one might specify an alternative that is one-sided, namely p(male) > p(female). This might be appropriate where there is no theory nor previous research that would suggest even considering the other logical possibility, p(male) < p(female).
Specific. Continuing the example, one might consider an alternative that specifies a particular numerical value, say p(male) = .52. Now in fact a sociologist seldom has any theory capable of producing such a specific alternative hypothesis, and it is more apt to arise from a previous research study, or to arise from calculations about sample size considerations.

Sample size

An important consideration in the design stage (not after the data have already been collected) is the sample size. For example, one might wish to have a sample sufficiently large to detect it if the population value were as large as .52, rather than .5 as specified in the null hypothesis. That type of consideration gives a specific alternative hypothesis, which we can put into the appropriate formula, and solve for the required sample size.

With very large samples, authors commonly save a tree or two by not bothering to report that every null hypothesis they considered was emphatically rejected, or save some effort by not bothering with formal tests in the first place. For example, some research on the 5-county Los Angeles area, which had a 14.5 million population in 1990, is based on a 5% sample, the US Census Bureau's 1990 Public Use Microdata Sample (PUMS). But 5% of 14.5 million is around 700,000, and when that value of n is plugged into a formula for standard error (in the denominator), the result is very close to zero.

The only cautions are when rare subpopulations, rather than the entire population, are the topic under consideration, and (more relevant in 210B and 210C than here) when one is testing hypotheses about a complicated model that incorporates a large number of variables.

Example: The book, Ethnic Los Angeles, edited by Waldinger and Bozorgmehr, includes many results based on about 700,000 cases in the 5% PUMS sample, rather than the 14.5 million in the entire population. Most of its prose ignores that distinction--as it should, since 700,000 cases is a huge sample, much larger than needed for precise estimates of the kinds of things being discussed therein. The book contains numerous tables, but they are not cluttered with p-values and double asterisks denoting statistical significance beyond the .01 level. [One apparent exception turns out not to be. A table with columns labeled "P*" in Ortiz' chapter on the Mexican-origin population (page 270) is not about either p-values or null hypotheses rejected at the .05 significance level. Rather, the quantity denoted P* there is an index that measures exposure of members of one ethnic group to members of another ethnic group (page 476).]

Power of a Statistical Test

The probability that a statistical test will conclude with rejection of the null hypothesis depends on (1) how far wrong the null hypothesis is, and (2) how large the sample is.

Several related concepts are as follows:

Correct decision = rejection of a null hypothesis if it is false, or acceptance if it is true.
Type I error = rejection of a true null hypothesis.
Type II error = acceptance of a false null hypothesis.
Some authors object to the term "accept", insisting that we should take a more agnostic stance about hypotheses not yet rejected, and use alternative terminology such as "fail to reject". This emphasizes the fact that a test that does not reject the null hypothesis may tell as much about the smallness of the sample as it tells about the truth of the hypothesis. They are quite right, but I wish they would come up with a less awkward phrase than "fail to reject".
Significance level of a test = p(Type I | null true)
Power function of a test = p(reject null | true value of parameter)
Whenever null is false this is also = 1 - p(Type II).
P-value, or Significance of Data against a null hypothesis = Significance level of the test that would put the observed data right on the border between accepting and rejecting the null hypothesis.

Some Specific Hypothesis Tests

One Sample Test for Mean. Theory or, more commonly, previous research provides a hypothesized value for the population mean, and the hypothesis is that the current sample is from a population with that particular value for the population mean.
The sampling distribution of the sample mean is the t distribution, so one calculates t from the sample data, and compares it with values in the table of the t distribution.
Unless one has a directional alternative hypothesis, the alternative is simply that this population is different from the one specified in the null hypothesis, and the appropriate test is two-tailed. Using the conventional .05 significance level, the critical region would be chosen to cut off .025 probability in each tail, and the null hypothesis rejected if the observed t value lies in either half of the critical region.
The t distribution differs from the Gaussian when df is small, such as 10 or 20, but for df as large as 100 or so, the Gaussian (shown in the t table as the bottom row, with df = infinity) is a good approximation.
One Sample Test for Proportion. When a population proportion, rather than a population mean, is the quantity about which inference is being made, the procedure is similar, but based on the binomial rather than the t distribution.
Topic 9 will include some further tests, regarding hypotheses that compare two means, or two proportions, from two different populations.
Chi-square Goodness of Fit Test. This is a versatile test for how well some data fit some theoretical model. Moore and McCabe cover only a special case of it, for two-way frequency tables, and that too will be in Topic 9.
In Chi-square tests, one calculates the expected frequencies under some theoretical model, and compares them with the corresponding observed frequencies, using the formula: Chi-square = Sum[ (observed - expected)^2 /expected ] Each discrepancy is squared, and the square divided by the expected frequency; then all such terms are summed.
Under the null hypothesis that observed frequencies are from the same distribution used to calculate the expected frequencies, the value of Chi-square follows a distribution of the same name, which appears in Moore and McCabe's Table F, on page T-20. The Chi-square distribution has one parameter, called "degrees of freedom", or "df" for short.
The degrees of freedom, which tells which part of the Chi-square table to use to find the significance level, is found as follows: df = (number of categories) - (number of parameters estimated from the dataset being fitted) - (number of constraints on parameters). The latter constraints are such things as requiring expected frequencies to have the same marginal totals as the observed frequencies.
We have already seen an instance of the Chi-square test, earlier in the course when we covered conditional probability and independence; the theoretical model in that case was independence of the row and column variables in a table. This special case, where the model being fitted is one of independence, is treated in Moore and McCabe, Section 9.2. In our application to actual data, stata automatically calculated expected values, the value of Chi-squared, the degrees of freedom, and the significance probability.

Statistical Tests vs Interval Estimates

Tests and interval estimates can be converted back and forth, with a test rejecting (or accepting) an hypothesis if the hypothesized parameter value lies outside (or within) the confidence interval.

Controversy

Earlier in the quarter we cited a classic article by S. F. Camilleri, for its discussion of the three different ways in which probability considerations arise in sociology. That article also contained one of the early critiques of use of significance tests in sociology in circumstances where they could not be justified in terms of a random sample selected from the population to which inferences were being made.

A more recent discussion of some of the same issues, but now with a Bayesian slant, is given in the article by Berk, Western, and Weiss (1995). Note, however, that those authors have not settled the matter to the satisfaction of their own critics. Still, this article does provide some progress over earlier authors who merely complained about hypothesis tests being used to justify inferences to theoretical relevant populations from which the data at hand were not random samples; Berk et al. take the further step of proposing alternative procedures for some such situations.

Notice that the controversy is not about statistical hypothesis testing per se, as much as about its use in situations where the data being analyzed are not a probability sample from some larger population of theoretical interest.

Likelihood functions and Bayesian inference

Likelihood functions, are useful in inference, especially in some of the more complex models of 210B and 210C. We already saw likelihoods when we studied conditional probability, but now use them to revise prior beliefs in light of new data.

Consider a simplified situation involving only two hypotheses, H1 and H2, and only two possible values for the data to be collected, D1 and D2. The likelihood of an hypothesis, given the data, is defined as the conditional probability of the observed data, conditioned on that hypothesis being true. Thus if we observed data outcome D1, we would consider the likelihoods of the two different hypotheses, given the one data outcome actually observed:

L(H1 | D1) = p(D1 | H1)
L(H2 | D1) = p(D1 | H2)

Note that while both of those quantities are probabilities, they do not together constitute a probability distribution, and do not sum to 1.0 except by coincidence. They are probabilities of the same data outcome, D1, not probabilities of a mutually exclusive and exhaustive set of different possible data outcomes, as in the case of a probability distribution.

Bayes' Theorem, with subjective prior and posterior probabilities, commonly is used with continuous distributions, but we will consider only a couple of discrete examples, whose mathematics is much more straightforward, while still giving some of the flavor of Bayesian inference.

The Bayesian begins with subjective prior probabilities expressing his or her degree of belief in the hypotheses, namely p(H1) and p(H2), two non-negative numbers which (in the simplified case we consider, which has only two hypotheses) sum to 1.0.

On observing the data, the Bayesian revises those subjective probabilities, replacing his or her prior probabilities with posterior probabilities; in particular, replacing p(H1) with p(H1|data) and replacing p(H2) with p(H2|data). Bayes' Rule tells how to calculate the appropriate revised subjective probabilities.

In case the data happened to have the outcome D1, these revisions would be:

p(H1|D1) = p(D1|H1) x p(H1) / denom
p(H2|D1) = p(D1|H2) x p(H2) / denom
where denom = whatever needed to make the posterior (revised) probabilities sum to 1.0, namely
denom = p(D1|H1) x p(H1) + p(D1|H2) x p(H2)

Bayesians are noncommital as to where the prior probabilities come from, and different Bayesians may bring different sets of priors to the same problem.

A probability distribution for a set of competing hypotheses is called diffuse if the various hypotheses (two in our example) are given nearly equal values, and is called informative if some hypotheses are given much higher probabilities than others.

Data may also be informative or not, depending on whether some particular outcomes are much more probable under some hypotheses than under other hypotheses. If the data are relatively informative, compared to the priors, the posterior probabilities will depend mainly on the data, and the Bayesian will reach conclusions similar to those of a frequentist.

References:

Berk, Richard A., Bruce Western, and Robert E. Weiss. 1995. "Statistical Inference for Apparent Populations." Sociological Methodology 25: 421-458. [With discussions by: Kenneth A. Bollen; Glenn Firebaugh; Donald B. Rubin; and Reply by Berk, Western, and Weiss.]

Ortiz, Vilma. 1996. "The Mexican-Origin Population: Permanent Working Class or Emerging Middle Class?" Chapter 9, pp. 247-277, in Waldinger and Bozorghmer 1996.

Waldinger, Roger, and Medhi Bozorghmer, eds. 1996. Ethnic Los Angeles. New York: Russell Sage Foundation.