Binomial distribution

sum of n independent Bernoulli trials with constant p

PDF, Expected value, Variance, Standard error

Normal approximation

 

The Law of Averages

What can we say about the standard error of a sum of Bernoulli draws? What can we say about the standard error of the mean of Bernoulli draws? These are questions about sampling distributions. A sampling distribution describes the distribution of a quantity computed based on a sample (i.e., a set of draws). In general, we’ll be interested in describing the sampling distributions of the sample sum and sample mean.

 

T=X1 + X2 + X3 +… + Xn

 

E(T) = E(X1 ) + E(X2 ) + E(X3 ) + … + E(Xn ) = p + p + p + …+ p = np

 

Var(T) = Var(X1) + Var(X2) + Var(X3) + … + Var(Xn)

(can add variances because draws are independent)

 

Var(T) = p(1-p) + p(1-p) + p(1-p) + …+ p(1-p) = np(1-p)

 

SE(T) =

 

Note that the standard error of the total increases as the number of draws n increases. Think of Kerrich’s coin tossing experiment. What this result about the standard error says is that as the number of flips increases, the expected "error" between the observed number of flips (T) and the expected number of flips (n/2, or more generally np) increases. You’re more likely to be 10 heads away from expectation if you flip 1000 times (510 heads) than if you flip 100 times (60 heads).

 

Now let’s consider the proportion of heads (or equivalently, the mean of the draws, rather than the sum).

 

=T/n

E() = E(T/n) = (1/n) E(T) = (1/n) * np = p

Var() = Var(T/n) = [(1/n)^2] * Var(T) = [(1/n)^2] np(1-p) = p(1-p)/n

SE() =

 

Note that the standard error the mean/proportion decreases as n increases. What this means is that as the number of flips increases, the observed proportion of heads will tend to be closer to the expected proportion of heads (in the case of a fair coin, ½, or more generally, p).

 

In short, as n increases, the chance error (as measured by the standard error) increases in absolute terms, but decreases in relative terms.

 

Summary of Binomial

Let’s take stock of what we’ve done here:

We’ve considered a sequence of n independent draws from a Bernoulli box, a box that has only two kinds of tickets: 0’s and 1’s.

The proportion of 1’s in the box is p (and the proportion of 0’s in the box is 1-p).

We’ve seen that the sum of n draws follows a binomial distribution, described by on the parameter p and the number of draws n). [The mean/proportion of n draws does not have a binomial distribution, but we can easily convert a mean to a sum by multiplying by n.]

We have determined the expected value, variance, and standard error of the sum of n draws, and the mean of n draws.

The standard error of a sum increases with (the square root of n), while the standard error of a mean decreases with (the square root of) n.

The mean/proportion of n draws does not have a binomial distribution, but

 

Normal approximation to binomial

Now, if n gets sufficiently large, binomial probabilities are a hassle to calculate. To see why, consider the following example:

 

You are faced with a 20 question multiple choice test, where each question has 3 alternatives. The test is on a completely unknown topic, so you are forced to guess on each question. What is the probability of scoring the passing grade of 10 or better?

 

We can think of the number of correct answers as a binomial random variable with n=20 and p=1/3. Each question is like a draw into a box with 1 correct answer and 2 wrong answers. We assume that the guesses on the different questions are independent, and the probability of a correct answer is the same (1/3) for each question, so the binomial distribution is appropriate.

 

P(X >= 10) = P(X=10) + P(X=11) + … + P(X=19) + P(X=20)

P(X>=10) = .0543 + .0247 + .0092 + ..0028 + .0007 + .0001 + … = .0919

 

Now, instead of using the binomial directly, we can use the normal approximation to the binomial. As n gets sufficiently large, the binomial distribution resembles the normal distribution more and more closely.

 

But which normal distribution is appropriate? This is where the results about the expected value and variance of the binomial distribution come in handy. In this particular case, we can approximate the number of correct answers with a normal distribution with mean = 20/3 and variance = 20 (1/3)(2/3) = 40/9. What proportion of the area of this normal distribution is to the right of 10?

 

We face one additional complication in using the normal approximation. The binomial is a discrete distribution, while the normal is continuous. We need to make a small continuity correction. We do so by using values for the normal approximation that are halfway between possible values of the binomial. In this case, we would consider the area to the right of 9.5 rather than 10. The z-score of interest then becomes (9.5-20/3) / 2.11 = 1.344. The area to the right of 1.34 from the normal table is .0901, quite close to the value obtained from the binomial directly.

 

In summary, to use the normal approximation to the binomial:

Make sure the number of draws is large enough. As a rule of thumb np and n(1-p) should both be greater than 5.

Apply the continuity correction (change from whole numbers to halves).

Use the mean (np) and standard error ()for the binomial to determine the z-score of interest.

Find the appropriate area from the normal table.

 

 

Sampling distribution results for Bernoulli boxes

n independent draws from Bernoulli(p) box. [sample of size n drawn from box, with replacement]

Expected value of box is p.

Standard error of 1 draw from box is p(1-p)

 

  Sum of draws (T) Mean of draws
Expected value E(T) = np E() = p
Variance Var(T) = np(1-p) Var() = p(1-p)/n
Standard Error SE(T) = SE() =
Distribution Binomial / Approximately normal for large n Approximately normal for large n

 

Some summary notation ("~" is read "distributed as")

Xi ~ Bernoulli(p) [The Xi are independent Bernoulli random variables.]

T=X1 + X2 + X3 +… + Xn

T ~ Binomial(n,p)

T approx ~ Normal(np,)

=T/n

approx ~ Normal(p, )

 

Sampling distribution results for any box

n independent draws from box. [sample of size n drawn from box, with replacement]

Expected value of box is.

Standard error of 1 draw from box is

 

  Sum of draws (T) Mean of draws
Expected value E(T) = n E() =
Variance Var(T) = n Var() = /n
Standard Error SE(T) = SE() =
Distribution Approximately normal for large n Approximately normal for large n

 

T approx ~ Normal(,)

=T/n

approx ~ Normal(,)