UCLA Soc. 210A, Topic 5, Conditional Probability

18oct2000

Outline:

UCLA Soc. 210A, Topic 5, Conditional Probability

Professor: David D. McFarland

Web Pages for Fall 2000

Syllabus for logistics
ClassWeb site for announcements, discussion board
Outline for course content

Topic 5: Conditional Probability

Reading Assignment

Moore and McCabe, Section 2.6, "Relations in Categorical Data"
Moore and McCabe, Section 4.5, "General Probability Rules"
And after you have a proper definition of "independence", Moore and McCabe, pages 301-305, 337-339.
Hamilton, Chapter 4, "Tables and Summary Statistics", especially pages 103-106, on two-way cross-tabulations.

In the preceding topic we dealt with "the" probability of an event, a single number that might have been either stated in an hypothesis, or estimated from some data, or designed into some mechanism constructed to produce randomized outcomes. Here we consider the situation where the probability of a particular event takes on various numerical values, depending on some other events or variables. Here the focal event's probability is one or another number, depending on the circumstances or "conditions". This brings us to Conditional Probability and related matters.

The probability of attending college, for example, is higher for someone whose parents attended college than for someone whose parents did not. Similarly, the probability of dying within the next year varies systematically with age.

Conditional probability considerations also arise in the design of research projects, such as sample surveys, which ordinarily preclude someone who has been sampled already from being sampled a second time in the same wave of the survey. Generally, the probability of any particular outcome at one stage of the sampling depends on what has happened at the previous stages.

Definition: The probabilities considered thus far which refer to only a single event, such as A, are called marginal probabilities when needed to distinguish them from probabilities involving two or more events. P(A) is the marginal probability of event A, and P(B) is the marginal probability of event B. But P(A and B) and P(A or B) each refer to two events, A and B, and thus are not marginal probabilities.
Definition: P(A and B) is the joint probability of A and B; i.e., the probability that both events occur.
Definition: The conditional probability of event A, given event B, is defined as the ratio P(A and B)/P(B) and denoted P(A|B).
Except in very special circumstances, the probabilities P(A) and P(A|B) will differ. The probability of A occurring depends on whether or not B occurs.
Definition: There is a name for the very special circumstances in which B's occurrence or non-occurrence is irrelevant to A: we say that A and B are independent if P(A|B) = P(A).
From the relative amounts of emphasis statistics books give independence and lack thereof, one might erroneously conclude that independence is the usual state of affairs. Not so in empirical phenomena. Indeed, we social researchers make a living examining the specific forms of dependence among the things we study.
With two events, either can be conditioned on the other, as follows:
- P(A|B) = P(A and B)/P(B), as above, or
- P(B|A) = P(A and B)/P(A)
Rearranged Form of Conditional Definition. Multiplication of either definition by the denominator of its fraction yields a rearranged form that is sometimes convenient:
- P(A|B) P(B) = P(A and B) or
- P(B|A) P(A) = P(A and B)
Either of these is sometimes called the "General Multiplicative Rule".
Total Probability Rule. If (notB) denotes the complement of event B, we can write P(A) = P(A and B) + P(A and notB), and by applying the multiplicative rule to each term obtain:
```
        P(A) = P(A|B) P(B) + P(A|notB) P(notB)
```
and similarly
```
        P(B) = P(B|A) P(A) + P(B|notA) P(notA)
```

Bayes' Rule.

        P(A|B) = P(B|A) P(A) /[ P(B|A) P(A) + P(B|notA) P(notA)]

This generalizes beyond the dichotomous outcome, A or notA, to more general situation of outcomes A1, A2, A3,... which are mutually exclusive and exhaustive, and likewise for B1, B2, B3, ... . The logic is the same, but the equations are messier.
Bayes' Rule is a theorem which follows directly from the definitions of probability and conditional probability; it does not depend on any particular interpretation of probability. Thus it fits into both frequentist and degree-of-belief interpretations of probability, although with somewhat different interpretations. If a "Bayesian" is anyone who considers this rule valid, then we are all Bayesians; however, that term usually has a much more restricted usage (see below).
Likelihood Functions. These figure in both Bayesian and classical versions of statistical inference.

Frequentist Interpretation of Bayes' Rule: Turning a Table Around

Frequentists interpret Bayes' Rule as applicable to the following sorts of problems: The printed table shows what percent of high school dropouts are unemployed, but what you want to know instead is what percent of the unemployed are high school dropouts. Bayes' Rule tells how to turn the table around.

This interpretation is illustrated in some cross-classifications from the GSS data, in a stata do-file and a log of the results.

Degree-of-belief Interpretation of Bayes' Rule: Revision of Prior Beliefs, in Light of New Evidence

Consider the problem of estimating p, the proportion of a population with a particular characteristic.

Before collecting and analyzing any new data, a Bayesian will assess his or her prior probability distribution. In fact, a Bayesian making inferences about a population proportion would specify priors in terms of a beta distribution, and select parameter values that would yield a particular beta distribution reflecting his or her best guess about p, and how certain or uncertain he or she is about it. The reason for choosing beta distributions is that they are "conjugate" to the process of estimating a population proportion, which means that a beta prior yields a beta posterior as well, only shifting the parameter values (Leamer pp 40-51; Winkler and Hays pp 498-506). The beta distribution has two parameters, and can, by appropriate choice of parameter values, be made single-peaked, or flat, or bimodal; as well as centered, or skewed either direction. Thus, confining one's attention to beta distributions is not, in fact, very restrictive, as far as the shape of the distribution is concerned.

Beta distributions, like normal distributions, are for continuous variables, and treatment of such requires calculus. Alas, unlike normal distributions, beta distributions are not widely discussed in basic statistics textbooks, with the calculus already worked out and the numerical values already tabulated. Thus, in this example, instead of a beta distribution, we will use a discrete distribution that requires only arithmetic, but that will give some of the flavor of an actual Bayesian analysis.

For illustration, suppose that based on similar previous studies or whatever, the Bayesian believes that p has a value that is probably around .2 or a little higher, but could be on either side of that. Not specific enough! We need particular numbers to insert in the calculations, so let's pick some particular numbers that are a precise instance of the vague ideas just expressed. Let's make the prior a discrete distribution, with positive probabilities only on multiples of .1, give p=.2 the highest prior probability, with smaller positive probabilities for p=.1, .3, and .4; and zero prior any other value of p. For example:

        p       prior

        0       0	
        .1      .20
        .2      .40
        .3      .30
        .4      .10
        >.4     0

        sum     1.00

Notice that the numbers in the prior probability column are all non-negative, and sum to 1.0, as required of a probability distribution.

How would those prior probabilities be revised after observing some data? Suppose, for example, 10 cases were observed, and 4 of the 10 had the characteristic being considered. How should the prior probabilities be revised? Bayes' rule gives the formula.

One needs to find the likelihood of the observed 4 in 10, calculated separately using each of the p values. Actually, these can be looked up in tables of the binomial distribution, such as Moore and McCabe pp T8-9, using the parts of the table for the probability of k=4 occurrences out of n=10 trials.

In the column for p=.10 we find L(.1|data) = .0112; in the column for p=.20 we find L(.2|data) = .0881; and similarly for the other likelihoods. Ading them as a third column makes the table:

 
        p       prior   likelihood

        0       0       0
        .1      .20     .0112
        .2      .40     .0881
        .3      .30     .2001
        .4      .10     .2508
        >.4     --      --

        sum     1.00    (not 1.0)

Notice that, unlike the prior probabilities, the likelihoods do not sum to 1.0. Recall my earlier warning, not to treat 'likelihood' as a synonym for 'probability'.

To complete the calculation of posterior probabilities, each of the likelihoods is multiplied by the corresponding prior, in the 4th column, and finally each of those products is divided by their sum, yielding the posterior probabilities in the 5th column.

                                        prior x
        p       prior   likelihood      likelihood      posterior
        
        0       0       0               0               0
        .1      .20     .0112           .00224          .018          
        .2      .40     .0881           .03524          .287
        .3      .30     .2001           .06003          .490
        .4      .10     .2508           .02508          .205
        >.4     --      --              --              --
        
        sum     1.00    (not 1.0)       .12259          1.000

Remark: The posteriors are shown here with a spurious precision, merely to facilitate a student's working through the calculations. The numbers in the third decimal place are meaningless, and those in the second place also rather doubtful.

Observation of 4 in 10 in the data led to the following revisions:

The observed 4 in 10, a proportion .4, is supporting evidence for hypothesized p values near .4, so their posteriors are higher than their priors.
p=.4, the parameter value under which 4 in 10 would be anticipated, had its probability more than doubled, from .10 prior to .205 posterior.
p=.3, nearby, had its probability increased by more than 60%, from .30 prior to .49 posterior.
The observed 4 in 10, a proportion .4, brings doubt on hypothesized p values far from .4, so their posteriors are lower than their priors.
p=.2 had its probability decreased by about 30%, from .40 prior to .287 posterior.
p=.1, farthest away from what happened in the data, had its probability decreased by more than 90%, from .20 prior to .018 posterior.
Notice that there was no increase in the probability assigned to p=.5, even though it was also close to the 4 in 10 observed in the data. This is because it had a 0 value in the prior distribution, and a prior of 0 forces a posterior of 0, no matter how large the likelihood: zero times anything, no matter how large, is still zero. This example illustrates the reason some Bayesians prefer to assign tiny nonzero priors, rather than zero priors, as was done here.

Sequences of Events

Many phenomena of sociological interest come in sequences, rather than one-time occurrences. Schooling is completed one year at a time rather than in a single selection (Mare 1981). A career consists not of a single job, but of a sequence of related jobs, with increasing rewards, each building on the experience gained in the previous jobs. Intergenerational mobility takes place over multiple generations, not just one or two. Negotiations consist of a sequence of proposals and counter-proposals.

To consider sequences we need more notation and concepts.

Let A1, A2, A3, ... denote a sequence of events which may or may not occur. They are independent if p(Ai|A1, A2, ..., Ai-1) does not in fact vary with A1, A2, ..., Ai-1, all combinations of those producing the same value p(Ai).
They are Markovian if p(Ai|A1, A2, ..., Ai-1) depends only on Ai-1, and not on the earlier events A1 through Ai-2, all combinations of those producing the same values p(Ai|Ai-1).

Independence, as noted when it was first mentioned, is a very special circumstance, seldom found in observational settings. Most examples of independence are in constructed settings, such as experimental laboratories where different people record their individual decisions prior to discussing them.

Example: A classic paper by Lorge and Solomon (1955) provides a model of group decision making with the individuals operating independently.

Markovian models provide a kind of compromise between overly simplistic independence, on the one hand, and everything-depends-on-everything-else anarchy, on the other hand. Social status depends on one's parents, but not on all the ancestors back to Lucy or Adam and Eve; that sort of thing.

A Markov model takes more possibly relevant information into account than does an independence model, and thus may be closer to reality. But in fact, things may not be that simple either.

Examples: McFarland (1970a) and Oliver and Glick (1982) treated occupational mobility using Markovian models. Weingart et al. (1999) treat negotiations along similar lines.

References

Berk, Richard A., Alec Campbell, Ruth Klap, and Bruce Western 1992. "The Deterrent Effect of Arrest in Incidents of Domestic Violence: A Bayesian Analysis of Four Field Experiments." American Sociological Review 57 (October): 698-708. [Available on jstor.] [Also see related articles in the same issue.]

Leamer, Edward E. 1978. Specification Searches: Ad Hoc Inference with Nonexperimental Data. New York: Wiley.

McFarland, David D. 1970a. "Intragenerational Social Mobility as a Markov Process: Including a Time-Stationary Markovian Model That Explains Observed Declines in Mobility Rates over Time." American Sociological Review 35 (June): 463-476. [Available on jstor] [Also see 1974 comment by Larry Schroeder and reply by McFarland in ASR 39: 883-885.]

Oliver, Melvin L., and Mark A Glick. 1982. "An Analysis of the New Orthodoxy on Black Mobility." Social Problems 29(No. 5, June): 511-523.

Weingart, Laurie R., Michael J. Prietula, Elaine B. Hyder, and Christopher R. Genovese. 1999. "Knowledge and the Sequential Processes of Negotiation: A Markov Chain Analysis of Response-in-Kind." Journal of Experimental Social Psychology 35: 366-393. (Online in idealibrary.)

Winkler, Robert L., and William L. Hays 1975. Statistics: Probability, Inference, and Decision. 2nd edn. New York: Holt.