Conditional Probability and Bayes' Rule 1

227 阅读4分钟

The Importance of Thinking Conditionally

Conditional probability is the concept that addresses the following fundamental question: how should we update our beliefs in light of the evidence we observe? In fact, a useful perspective is that all probabilities are conditional; whether or not it's written explicitly, there is always background knowledge (or assumptions) built into every probability.

In addition to giving us a technique for updating our probabilities based on observed information, conditioning is a very powerful problem-solving strategy, often making it possible to solve a complicated problem by decomposing it into manageable pieces with case-by-case reasoning. Due to the central importance of conditioning, we say that conditioning is the soul of statistics.

Definition and Ituition

Definition: Conditional Probability

If AA and BB are events with P(B)>0P(B) > 0, then the conditional probability of AA given BB , denoted by P(AB)P(A|B), is defined as

P(AB)=P(AB)P(B).P(A|B) = \frac{P(A\cap B)}{P(B)}.

Here AA is the event whose uncertainty we want to update, and BB is the evidence we observe (or want to treat as given). We call P(A)P(A) the prior probability of AA and P(AB)P(A|B) the posterior probability of AA (''prior" means before updating based on the evidence, and ''posterior" means after updating based on the evidence). It is important to interpret the event appearing after the vertical conditioning bar as the evidence that we have observed or that is being conditioned on: P(AB)P(A|B) is the probability of AA given the evidence BB, not the probability of some entity called ABA|B.

Example Two Cards

image.png

Solution: By the naive definition of probability and the multiplication rule,

P(AB)=13255251=25204,P(A \cap B) = \frac{13 \cdot 25}{52 \cdot 51} = \frac{25}{204},

since a favorable outcome is determined by choosing any of the 13 hearts and then any of the remaining 25 red cards. Also, P(A)=1/4P(A)=1/4 since the 4 suits are equally likely, and

P(B)=26515251=12P(B) = \frac{26 \cdot 51}{52 \cdot 51} = \frac{1}{2}

since there are 26 favorable possibilities for the second card, and for each of those, the first card can be any other card. A neater way to see that P(B)=1/2P(B)=1/2 is by symmetry: from a vantage point before having done the experiment, the second card is equally likely to be any card in the deck. We now have all the pieces needed to apply the definition of conditional probability:

P(AB)=P(AB)P(B)=25/2041/2=25102,P(BA)=P(BA)P(A)=25/2041/4=2551.\begin{align*} P(A|B) &= \frac{P(A \cap B)}{P(B)} = \frac{25/204}{1/2}=\frac{25}{102}, \\ P(B|A) &= \frac{P(B \cap A)}{P(A)} = \frac{25/204}{1/4} = \frac{25}{51}.\\ \end{align*}

This is a simple example, but already there are several things worth noting.

  1. It's extremely important to be careful about which events to put on which side of the conditioning bar. In particular, P(AB)P(BA)P(A|B) \neq P(B|A). The next section explores how P(AB)P(A|B) and P(BA)P(B|A) are related in general. Confusing these two quantities is called the prosecutor's fallacy. If instead we had defined BB to be the event that the second card is a heart, the two conditional probabilities would have been equal.
  2. Both P(AB)P(A|B) and P(BA)P(B|A) make sense (intuitively and mathematically); the chronological order in which cards were chosen does not dictate which conditional probabilities we can look at. When we calculate conditional probabilities, we are considering what information observing one event provides about another event, not whether one event causes another.

To shed more light on what conditional probability means, here are two intuitive interpretations.

Intuition: Pebble World

Consider a finite sample space, with the outcomes visualized as pebbles with total mass 1. Since AA is an event, it is a set of pebbles, and likewise for BB.

image.png Events AA and BB are subsets of the sample space.

image.png Now suppose that we learn that BB occurred. Upon obtaining this information, we get rid of all the pebbles in BcB^c because they are incompatible with the knowledge that BB has occurred. Then P(AB)P(A \cap B) is the total mass of the pebbles remaining in AA.

image.png Then, we renormalize, that is, divide all the masses by a constant so that the new total mass of the remaining pebbles is 1. This is achieved by dividing by P(B)P(B), the total mass of the pebbles in BB. The updated mass of the outcomes corresponding to event AA is the conditional probability P(AB)=P(AB)/P(B)P(A|B) = P(A \cap B)/P(B).

Intuition: Frequentist Interpretation

image.png

Imagine repeating an experiment many times, randomly generating a long list of observed outcomes, each of them represented by a string of twenty-four 0's and 1's. The conditional probability of AA given BB can then be thought of in a natural way: it is the fraction of times that AA occurs, restricting attention to the trials where BB occurs. In the above figure, our experiment has outcomes which can be represented as a string of 0's and 1's; BB is the event that the first digit is 1 and AA is the event that the second digit is 1. Conditioning on BB, we circle all the repetitions where occurred, and then we look at the fraction of circled repetitions in which event AA also occurred.

In symbols, let nA,nB,nABn_A, n_B, n_{AB} be the number of occurrences of A,B,ABA, B, A \cap B respectively in a large number nn of repetitions of the experiment. The frequentist interpretation is that

P(A)nAn,P(B)nBn,P(AB)nABn.P(A) \approx \frac{n_A}{n}, P(B) \approx \frac{n_B}{n}, P(A \cap B) \approx \frac{n_{AB}}{n}.

Then P(AB)P(A|B) is interpreted as nAB/nBn_{AB}/n_B, which equals (nAB/n)/(nB/n)(n_{AB}/n)/(n_B/n). This interpretation again translates to P(AB)=P(AB)/P(B)P(A|B) = P(A \cap B)/P(B).

Bayes' Rule and the Law of Total Probability

The definition of conditional probability is simple—just a ratio of two probabilities—but it has far-reaching consequences. The first consequence is obtained easily by moving the denominator in the definition to the other side of the equation.

Theorem

For any events AA and BB with positive probabilities,

P(AB)=P(B)P(AB)=P(A)P(BA).P(A \cap B) = P(B) P(A|B) = P(A) P(B|A).

At first sight this theorem may not seem very useful: it is the definition of conditional probability, just written slightly differently, and anyway it seems circular to use P(AB)P(A|B) to help find P(AB)P(A \cap B) when P(AB)P(A|B) was defined in terms of P(AB)P(A \cap B). But we will see that the theorem is in fact very useful, since it often turns out to be possible to find conditional probabilities without going back to the definition.

Applying the above theorem repeatedly, we can generalize to the intersection of nn events.

Theorem

For any events A1,,AnA_1,\dots,A_n with positive probabilities,

P(A1,A2,,An)=P(A1)P(A2A1)P(A3A1,A2)P(AnA1,,An1).P(A_1, A_2, \dots, A_n) = P(A_1) P(A_2|A_1) P(A_3|A_1, A_2) \cdots P(A_n | A_1, \dots, A_{n-1}).

The commas denote intersections. For example, P(A3A1,A2)P(A_3|A_1, A_2) is the probability that A3A_3 occurs, given that both A1A_1 and A2A_2 occur.

We are now ready to introduce the two main theorems about conditional probability: Bayes' rule and the law of total probability (LOTP).

Theorem: Bayes' Rule

P(AB)=P(BA)P(A)P(B).P(A|B) = \frac{P(B|A) P(A)}{P(B)}.

This follows immediately from Theorem Bayes' Rule, which in turn followed immediately from the definition of conditional probability. Yet Bayes' rule has important implications and applications in probability and statistics, since it is so often necessary to find conditional probabilities, and often P(BA)P(B|A) is much easier to find directly than P(AB)P(A|B) (or vice versa).

The law of total probability (LOTP) relates conditional probability to unconditional probability. It is essential for fulfilling the promise that conditional probability can be used to decompose complicated probability problems into simpler pieces.

Theorem: Law of Total Probability (LOTP)

Let A1,,AnA_1, \dots, A_n be a partition of the sample space SS (i.e., the AiA_i are disjoint events and their union is SS), with P(Ai)>0P(A_i) > 0 for all ii. Then

P(B)=i=1nP(BAi)P(Ai).P(B) = \sum_{i=1}^n P(B|A_i) P(A_i).

Proof:

image.png

The law of total probability tells us that to get the unconditional probability of BB, we can divide the sample space into disjoint slices AiA_i, find the conditional probability of BB within each of the slices, then take a weighted sum of the conditional probabilities, where the weights are the probabilities P(Ai)P(A_i). The choice of how to divide up the sample space is crucial: a well-chosen partition will reduce a complicated problem into simpler pieces, whereas a poorly chosen partition will only exacerbate our problems, requiring us to calculate nn difficult probabilities instead of just one!

Example Random Coin

You have one fair coin, and one biased coin which lands Heads with probability 3/4. You pick one of the coins at random and flip it three times. It lands Heads all three times. Given this information, what is the probability that the coin you picked is the fair one?

Solution:

image.png

Example Testing for a Rare Disease

image.png