The Importance of Thinking Conditionally

Conditional probability is the concept that addresses the following fundamental question: how should we update our beliefs in light of the evidence we observe? In fact, a useful perspective is that all probabilities are conditional; whether or not it's written explicitly, there is always background knowledge (or assumptions) built into every probability.

In addition to giving us a technique for updating our probabilities based on observed information, conditioning is a very powerful problem-solving strategy, often making it possible to solve a complicated problem by decomposing it into manageable pieces with case-by-case reasoning. Due to the central importance of conditioning, we say that conditioning is the soul of statistics.

Definition and Ituition

Definition: Conditional Probability

If $A$ and $B$ are events with $P(B) > 0$ , then the conditional probability of $A$ given $B$ , denoted by $P(A|B)$ , is defined as
$P(A|B) = \frac{P(A\cap B)}{P(B)}.$

Here $A$ is the event whose uncertainty we want to update, and $B$ is the evidence we observe (or want to treat as given). We call $P(A)$ the prior probability of $A$ and $P(A|B)$ the posterior probability of $A$ (''prior" means before updating based on the evidence, and ''posterior" means after updating based on the evidence). It is important to interpret the event appearing after the vertical conditioning bar as the evidence that we have observed or that is being conditioned on: $P(A|B)$ is the probability of $A$ given the evidence $B$ , not the probability of some entity called $A|B$ .

Example Two Cards

Solution: By the naive definition of probability and the multiplication rule,

P(A \cap B) = \frac{13 \cdot 25}{52 \cdot 51} = \frac{25}{204},

since a favorable outcome is determined by choosing any of the 13 hearts and then any of the remaining 25 red cards. Also, $P(A)=1/4$ since the 4 suits are equally likely, and

P(B) = \frac{26 \cdot 51}{52 \cdot 51} = \frac{1}{2}

since there are 26 favorable possibilities for the second card, and for each of those, the first card can be any other card. A neater way to see that $P(B)=1/2$ is by symmetry: from a vantage point before having done the experiment, the second card is equally likely to be any card in the deck. We now have all the pieces needed to apply the definition of conditional probability:

\begin{align*} P(A|B) &= \frac{P(A \cap B)}{P(B)} = \frac{25/204}{1/2}=\frac{25}{102}, \\ P(B|A) &= \frac{P(B \cap A)}{P(A)} = \frac{25/204}{1/4} = \frac{25}{51}.\\ \end{align*}

This is a simple example, but already there are several things worth noting.

It's extremely important to be careful about which events to put on which side of the conditioning bar. In particular, $P(A|B) \neq P(B|A)$ . The next section explores how $P(A|B)$ and $P(B|A)$ are related in general. Confusing these two quantities is called the prosecutor's fallacy. If instead we had defined $B$ to be the event that the second card is a heart, the two conditional probabilities would have been equal.
Both $P(A|B)$ and $P(B|A)$ make sense (intuitively and mathematically); the chronological order in which cards were chosen does not dictate which conditional probabilities we can look at. When we calculate conditional probabilities, we are considering what information observing one event provides about another event, not whether one event causes another.

To shed more light on what conditional probability means, here are two intuitive interpretations.

Intuition: Pebble World

Consider a finite sample space, with the outcomes visualized as pebbles with total mass 1. Since $A$ is an event, it is a set of pebbles, and likewise for $B$ .

Events $A$ and $B$ are subsets of the sample space.

Now suppose that we learn that $B$ occurred. Upon obtaining this information, we get rid of all the pebbles in $B^c$ because they are incompatible with the knowledge that $B$ has occurred. Then $P(A \cap B)$ is the total mass of the pebbles remaining in $A$ .

Then, we renormalize, that is, divide all the masses by a constant so that the new total mass of the remaining pebbles is 1. This is achieved by dividing by $P(B)$ , the total mass of the pebbles in $B$ . The updated mass of the outcomes corresponding to event $A$ is the conditional probability $P(A|B) = P(A \cap B)/P(B)$ .

Intuition: Frequentist Interpretation

Imagine repeating an experiment many times, randomly generating a long list of observed outcomes, each of them represented by a string of twenty-four 0's and 1's. The conditional probability of $A$ given $B$ can then be thought of in a natural way: it is the fraction of times that $A$ occurs, restricting attention to the trials where $B$ occurs. In the above figure, our experiment has outcomes which can be represented as a string of 0's and 1's; $B$ is the event that the first digit is 1 and $A$ is the event that the second digit is 1. Conditioning on $B$ , we circle all the repetitions where occurred, and then we look at the fraction of circled repetitions in which event $A$ also occurred.

In symbols, let $n_A, n_B, n_{AB}$ be the number of occurrences of $A, B, A \cap B$ respectively in a large number $n$ of repetitions of the experiment. The frequentist interpretation is that

P(A) \approx \frac{n_A}{n}, P(B) \approx \frac{n_B}{n}, P(A \cap B) \approx \frac{n_{AB}}{n}.

Then $P(A|B)$ is interpreted as $n_{AB}/n_B$ , which equals $(n_{AB}/n)/(n_B/n)$ . This interpretation again translates to $P(A|B) = P(A \cap B)/P(B)$ .

Bayes' Rule and the Law of Total Probability

The definition of conditional probability is simple—just a ratio of two probabilities—but it has far-reaching consequences. The first consequence is obtained easily by moving the denominator in the definition to the other side of the equation.

Theorem

For any events $A$ and $B$ with positive probabilities,
$P(A \cap B) = P(B) P(A|B) = P(A) P(B|A).$

At first sight this theorem may not seem very useful: it is the definition of conditional probability, just written slightly differently, and anyway it seems circular to use $P(A|B)$ to help find $P(A \cap B)$ when $P(A|B)$ was defined in terms of $P(A \cap B)$ . But we will see that the theorem is in fact very useful, since it often turns out to be possible to find conditional probabilities without going back to the definition.

Applying the above theorem repeatedly, we can generalize to the intersection of $n$ events.

Theorem

For any events $A_1,\dots,A_n$ with positive probabilities,
$P(A_1, A_2, \dots, A_n) = P(A_1) P(A_2|A_1) P(A_3|A_1, A_2) \cdots P(A_n | A_1, \dots, A_{n-1}).$
The commas denote intersections. For example, $P(A_3|A_1, A_2)$ is the probability that $A_3$ occurs, given that both $A_1$ and $A_2$ occur.

We are now ready to introduce the two main theorems about conditional probability: Bayes' rule and the law of total probability (LOTP).

Theorem: Bayes' Rule
$P(A|B) = \frac{P(B|A) P(A)}{P(B)}.$

This follows immediately from Theorem Bayes' Rule, which in turn followed immediately from the definition of conditional probability. Yet Bayes' rule has important implications and applications in probability and statistics, since it is so often necessary to find conditional probabilities, and often $P(B|A)$ is much easier to find directly than $P(A|B)$ (or vice versa).

The law of total probability (LOTP) relates conditional probability to unconditional probability. It is essential for fulfilling the promise that conditional probability can be used to decompose complicated probability problems into simpler pieces.

Theorem: Law of Total Probability (LOTP)

Let $A_1, \dots, A_n$ be a partition of the sample space $S$ (i.e., the $A_i$ are disjoint events and their union is $S$ ), with $P(A_i) > 0$ for all $i$ . Then
$P(B) = \sum_{i=1}^n P(B|A_i) P(A_i).$

Proof:

The law of total probability tells us that to get the unconditional probability of $B$ , we can divide the sample space into disjoint slices $A_i$ , find the conditional probability of $B$ within each of the slices, then take a weighted sum of the conditional probabilities, where the weights are the probabilities $P(A_i)$ . The choice of how to divide up the sample space is crucial: a well-chosen partition will reduce a complicated problem into simpler pieces, whereas a poorly chosen partition will only exacerbate our problems, requiring us to calculate $n$ difficult probabilities instead of just one!

Example Random Coin

You have one fair coin, and one biased coin which lands Heads with probability 3/4. You pick one of the coins at random and flip it three times. It lands Heads all three times. Given this information, what is the probability that the coin you picked is the fair one?

Solution:

Conditional Probability and Bayes' Rule 1