Joint Distributions and Conditional Expectation

41 阅读6分钟

Joint, marginal, and conditional distributions

So far we have been focusing on the distribution of one random variable at a time, but very often we care about the relationship between multiple r.v.s in the same experiment. To give just a few examples:

  • Surveys: When conducting a survey, we may ask multiple questions to each respondent in order to determine the relationship between, say, opinions on social issues and opinions on economic issues.
  • Medicine: To evaluate the effectiveness of a treatment, we may take multiple measurements per patient; an ensemble of blood pressure, heart rate, and cholesterol readings can be more informative than any of these measurements considered separately.
  • Genetics: To study the relationships between various genetic markers and a particular disease, if we only looked separately at distributions for each genetic marker, we could fail to learn about whether an interaction between markers is related to the disease.
  • Time series: To study how something evolves over time, we can often make a series of measurements over time, and then study the series jointly. There are many applications of such series, such as global temperatures, stock prices, or national unemployment rates. The series of measurements considered jointly can help us deduce trends for the purpose of forecasting future measurements.

This unit considers joint distributions, also called multivariate distributions, which describe how multiple r.v.s interact with each other. We introduce multivariate analogs of the CDF, PMF, and PDF in order to provide a complete specification of the relationship between multiple r.v.s. After this groundwork is in place, we'll study a couple of famous named multivariate distributions, generalizing the Binomial and Normal distributions to higher dimensions.

The three key concepts for this section are jointmarginal, and conditional distributions. Recall that the distribution of a single r.v. XX provides complete information about the probability of XX falling into any subset of the real line. Analogously, the joint distribution of two r.v.s XX and YY provides complete information about the probability of the vector  falling into any subset of the plane. The marginal distribution of  is the individual distribution of , ignoring the value of , and the conditional distribution of  given  is the updated distribution for  after observing . We'll look at these concepts in the discrete case first, then extend them to the continuous case.

Discrete

The most general description of the joint distribution of two r.v.s is the joint CDF, which applies to discrete and continuous r.v.s alike.

Definition: Joint CDF

The joint CDF of r.v.s XX and YY is the function FX,YF_{X,Y} given by

FX,Y(x,y)=P(Xx,Yy).F_{X,Y}(x,y) = P(X\leq x, Y \leq y).

The joint CDF of  r.v.s is defined analogously.

Unfortunately, the joint CDF of discrete r.v.s is not a well-behaved function; as in the univariate case, it consists of jumps and flat regions. For this reason, with discrete r.v.s we usually work with the joint PMF, which also determines the joint distribution and is much easier to visualize.

Definition: Joint PMF

The joint PMF of discrete r.v.s XX and YY is the function pX,Yp_{X,Y} given by

pX,Y(x,y)=P(X=x,Y=y).p_{X,Y}(x,y) = P(X=x, Y=y).

The joint PMF of nn discrete r.v.s is defined analogously.

Just as univariate PMFs must be nonnegative and sum to 1, we require valid joint PMFs to be nonnegative and sum to 1, where the sum is taken over all possible values of XX and YY:

xyP(X=x,Y=y)=1.\sum_x \sum_y P(X=x, Y=y) = 1.

The following figure shows a sketch of what the joint PMF of two discrete r.v.s could look like. The height of a vertical bar at (x,y)(x, y) represents the probability P(X=x,Y=y)P(X=x,Y=y). For the joint PMF to be valid, the total height of the vertical bars must be 1.

6_jointPMFlabeled.png

From the joint distribution of XX and YY, we can get the distribution of XX alone by summing over the possible values of YY. This gives us the familiar PMF of XX that we have seen in previous chapters. In the context of joint distributions, we will call it the marginal or unconditional distribution of XX, to make it clear that we are referring to the distribution of XX alone, without regard for the value of YY.

Definition: Marginal PMF

For discrete r.v.s XX and YY, the marginal PMF of XX is

P(X=x)=yP(X=x,Y=y).P(X=x) = \sum_y P(X=x, Y=y).

The marginal PMF of XX is the PMF of XX, viewing XX individually rather than jointly with YY. The above equation follows from the axioms of probability (we are summing over disjoint cases). The operation of summing over the possible values of YY in order to convert the joint PMF into the marginal PMF of XX is known as marginalizing out YY.

Similarly, the marginal PMF of YY is obtained by summing over all possible values of XX. So given the joint PMF, we can marginalize out YY to get the PMF of XX, or marginalize out XX to get the PMF of YY. But if we only know the marginal PMFs of XX and YY, there is no way to recover the joint PMF without further assumptions.

Now suppose that we observe the value of XX and want to update our distribution of YY to reflect this information. Instead of using the marginal PMF P(Y=y)P(Y=y), which does not take into account any information about XX, we should use a PMF that conditions on the event X=xX=x, where xx is the value we observed for XX. This naturally leads us to consider conditional PMFs.

Definition: Conditional PMF

For discrete r.v.s XX and YY, the conditional PMF of YY given X=xX=x is

P(Y=yX=x)=P(X=x,Y=y)P(X=x).P(Y=y | X=x) = \frac{P(X=x, Y=y)}{P(X=x)}.

This is viewed as a function of yy for fixed xx.

The following figure illustrates the definition of conditional PMF. To condition on the event , we first take the joint PMF and focus in on the vertical bars where takes on the value ; in the figure, these are shown in bold. All of the other vertical bars are irrelevant because they are inconsistent with the knowledge that occurred. Since the total height of the bold bars is the marginal probability , we then renormalize the conditional PMF by dividing by ; this ensures that the conditional PMF will sum to . Therefore conditional PMFs are PMFs, just as conditional probabilities are probabilities. Notice that there is a different conditional PMF of for every possible value of ; the following figure highlights just one of these conditional PMFs.

6_conditionalPMFlabeled.png

屏幕截图 2023-07-05 202324.png

Definition: Independence of Discrete r.v.s

Random variables XX and YY are independent if for all xx and yy,

FX,Y(x,y)=FX(x)FY(y).F_{X,Y}(x,y) = F_X(x) F_Y(y).

If XX and YY are discrete, this is equivalent to the condition

P(X=x,Y=y)=P(X=x)P(Y=y)P(X=x, Y=y) = P(X=x) P(Y=y)

for all xx and yy, and it is also equivalent to the condition

P(Y=yX=x)=P(Y=y)P(Y=y|X=x) = P(Y=y)

for all yy and all xx such that P(X=x)>0P(X=x) > 0.

Using the terminology from this chapter, the definition says that for independent r.v.s, the joint CDF factors into the product of the marginal CDFs, or that the joint PMF factors into the product of the marginal PMFs. Remember that in general, the marginal distributions do not determine the joint distribution: this is the entire reason why we wanted to study joint distributions in the first place! But in the special case of independence, the marginal distributions are all we need in order to specify the joint distribution; we can get the joint PMF by multiplying the marginal PMFs.

Another way of looking at independence is that all the conditional PMFs are the same as the marginal PMF. In other words, starting with the marginal PMF of YY, no updating is necessary when we condition on X=xX=x, regardless of what xx is. There is no event purely involving XX that influences our distribution of YY, and vice versa.

Continuous

Once we have a handle on discrete joint distributions, it isn't much harder to consider continuous joint distributions. We simply make the now-familiar substitutions of integrals for sums and PDFs for PMFs, remembering that the probability of any individual point is now 0.

Formally, in order for XX and YY to have a continuous joint distribution, we require that the joint CDF

FX,Y(x,y)=P(Xx,Yy)F_{X,Y}(x,y) = P(X \leq x, Y \leq y)

be differentiable with respect to xx and yy. The partial derivative with respect to xx and yy is called the joint PDF. The joint PDF determines the joint distribution, as does the joint CDF.

Definition: Joint PDF

If XX and YY are continuous with joint CDF FX,YF_{X,Y}, their* joint PDF* is the derivative of the joint CDF with respect to xx and yy:

fX,Y(x,y)=2xyFX,Y(x,y).f_{X,Y}(x,y) = \frac{\partial^2}{\partial x \partial y} F_{X,Y}(x,y).

We require valid joint PDFs to be nonnegative and integrate to 1:

fX,Y(x,y)0, and fX,Y(x,y)dxdy=1.f_{X,Y}(x,y) \geq 0, \textrm{ and } \int_{-\infty}^\infty \int_{-\infty}^\infty f_{X,Y}(x,y) dx dy = 1.

In the univariate case, the PDF was the function we integrated to get the probability of an interval. Similarly, the joint PDF of two r.v.s is the function we integrate to get the probability of a two-dimensional region.

The following figure shows a sketch of what a joint PDF of two r.v.s could look like. As usual with continuous r.v.s, we need to keep in mind that the height of the surface fX,Y(x,y)f_{X,Y}(x,y) at a single point does not represent a probability. The probability of any specific point in the plane is 0; furthermore, now that we've gone up a dimension, the probability of any line or curve in the plane is also 0. The only way we can get nonzero probability is by integrating over a region of positive area in the xyxy-plane.

When we integrate the joint PDF over an area AA, what we are calculating is the volume under the surface of the joint PDF and above AA. Thus, probability is represented by volume under the joint PDF. The total volume under a valid joint PDF is 1.

In the discrete case, we get the marginal PMF of XX by summing over all possible values of YY in the joint PMF. In the continuous case, we get the marginal PDF of XX by integrating over all possible values of YY in the joint PDF.

Definition: Marginal PDF

For continuous r.v.s XX and YY with joint PDF fX,Yf_{X,Y}, the marginal PDF of XX is

fX(x)=fX,Y(x,y)dy.f_X(x) = \int_{-\infty}^\infty f_{X,Y}(x,y) dy.

This is the PDF of XX, viewing XX individually rather than jointly with YY.

To simplify notation, we have mainly been looking at the joint distribution of two r.v.s rather than nn r.v.s, but marginalization works analogously with any number of variables. For example, if we have the joint PDF of X,Y,Z,WX,Y,Z,W but want the joint PDF of X,WX,W, we just have to integrate over all possible values of YY and ZZ:

fX,W(x,w)=fX,Y,Z,W(x,y,z,w)dydz.f_{X,W}(x,w) = \int_{-\infty}^\infty \int_{-\infty}^\infty f_{X,Y,Z,W}(x,y,z,w) dydz.

Conceptually this is very easy---just integrate over the unwanted variables to get the joint PDF of the wanted variables---but computing the integral may or may not be difficult. Returning to the case of the joint distribution of two r.v.s XX and YY, let's consider how to update our distribution for YY after observing the value of XX, using the conditional PDF.

Definition: Conditional PDF

For continuous r.v.s XX and YY with joint PDF fX,Yf_{X,Y}, the conditional PDF of YY given X=xX=x is

fYX(yx)=fX,Y(x,y)fX(x).f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}.

This is considered as a function of yy for fixed xx.

屏幕截图 2023-07-06 004107.png

6_conditionalPDFlabeled.png

Note that we can recover the joint PDF fX,Yf_{X,Y} if we have the conditional PDF fYXf_{Y|X} and the corresponding marginal fXf_X:

fX,Y(x,y)=fYX(yx)fX(x).f_{X,Y}(x,y) = f_{Y|X}(y|x) f_X(x).

Similarly, we can recover the joint PDF if we have fXYf_{X|Y} and fYf_Y:

fX,Y(x,y)=fXY(xy)fY(y).f_{X,Y}(x,y) = f_{X|Y}(x|y) f_Y(y).

This allows us to develop continuous analogs of Bayes' rule and LOTP. These formulas still hold in the continuous case, replacing probabilities with probability density functions.

Theorem: Continuous Form of Bayes' Rule and LOTP

For continuous r.v.s XX and YY,

fYX(yx)=fXY(xy)fY(y)fX(x),f_{Y|X}(y|x) = \frac{f_{X|Y}(x|y)f_Y(y)}{f_X(x)},
fX(x)=fXY(xy)fY(y)dy.f_X(x) = \int_{-\infty}^\infty f_{X|Y}(x|y)f_Y(y) dy.

Proof: 屏幕截图 2023-07-06 004926.png

Finally, let's discuss the definition of independence for continuous r.v.s; then we'll turn to concrete examples. As in the discrete case, we can view independence of continuous r.v.s in two ways. One is that the joint CDF factors into the product of the marginal CDFs, or the joint PDF factors into the product of the marginal PDFs. The other is that the conditional PDF of YY given X=xX=x is the same as the marginal PDF of YY, so conditioning on XX provides no information about YY.

Definition: Independence of Continuous r.v.s

Random variables XX and YY are independent if for all xx and yy,

FX,Y(x,y)=FX(x)FY(y).F_{X,Y}(x,y) = F_X(x) F_Y(y).

If XX and YY are continuous with joint PDF fX,Yf_{X,Y}, this is equivalent to the condition

fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y) = f_X(x) f_Y(y)

for all xx and yy, and it is also equivalent to the condition

fYX(yx)=fY(y)f_{Y|X}(y|x) = f_Y(y)

for all yy and all xx such that fX(x)>0f_X(x) > 0.

Example Comparing Exponentials of Different Rates

Let T1Expo(λ1)T_1 \sim \textrm{Expo}(\lambda_1) and T2Expo(λ2)T_2 \sim \textrm{Expo}(\lambda_2) be independent. Find P(T1<T2)P(T_1 < T_2). For example, T1T_1 could be the lifetime of a refrigerator and T2T_2 could be the lifetime of a stove (if we are willing to assume Exponential distributions for these), and then P(T1<T2)P(T_1< T_2) is the probability that the refrigerator fails before the stove. We know from Chapter 5 that min(T1,T2)Expo(λ1+λ2)\min(T_1,T_2) \sim \textrm{Expo}(\lambda_1 + \lambda_2), which tells us about when the first appliance failure will occur, but we also may want to know about which appliance will fail first.

Solution:

屏幕截图 2023-07-06 161921.png