Expectation

Often it is useful to have one number summarizing the ''average" value of a random variable. There are several senses in which the word ''average" is used, but by far the most commonly used is the mean of an r.v., also known as its expected value.

In addition, much of statistics is about understanding variability in the world, so it is often important to know how "spread out" the distribution is; we will formalize this with the concepts of variance and standard deviation. As we'll see, variance and standard deviation are defined in terms of expected values, so the uses of expected values go far beyond just computing averages.

Given a list of numbers $x_1,x_2,\dots,x_n$ , the familiar way to average them is to add them up and divide by $n$ . This is called the arithmetic mean, and is defined by $\bar{x} = \frac{1}{n} \sum_{j=1}^n x_j.$

More generally, we can define a weighted mean of $x_1,x_2,\dots,x_n$ as

\textrm{weighted-mean}(x) = \sum_{j=1}^n x_jp_j,

where the weights $p_1,\dots,p_n$ are pre-specified nonnegative numbers that add up to 1 (so the unweighted mean $\bar{x}$ is obtained when $p_j=1/n$ for all $j$ ).

The definition of expectation for a discrete r.v. is inspired by the weighted mean of a list of numbers, with weights given by probabilities.

Definition: Expectation of a Discrete r.v.

The expected value (also called the expectation or mean) of a discrete r.v. $X$ whose distinct possible values are $x_1, x_2, \dots$ is
$E(X) = \sum_{j=1}^\infty x_j P(X=x_j).$
If the support is finite, then this is replaced by a finite sum. We can also write
$E(X) = \sum_x \underbrace{x \vphantom{P(X=x)}}_{\textrm{value}} \underbrace{P(X=x)}_\textrm{ PMF at $x$},$
where the sum is over the support of $X$ (in any case, $xP(X=x)$ is 0 for any $x$ not in the support). The expectation is undefined if $\sum_{j=1}^\infty |x_j| P(X=x_j)$ diverges, since then the series for $E(X)$ diverges or its value depends on the order in which the $x_j$ are listed.

In words, the expected value of $X$ is a weighted average of the possible values that $X$ can take on, weighted by their probabilities. Let's check that the definition makes sense in two simple examples:

Let $X$ be the result of rolling a fair 6-sided die, so $X$ takes on the values $1,2,3,4,5,6$ , with equal probabilities. Intuitively, we should be able to get the average by adding up these values and dividing by 6. Using the definition, the expected value is
$E(X) = \frac{1}{6}(1+2+\dots+6) = 3.5,$
as we expected. Note though that $X$ never equals its mean in this example. This is similar to the fact that the average number of children per household in some country could be 1.8, but that doesn't mean that a typical household has 1.8 children!
Let $X \sim \textrm{Bern}(p)$ and $q=1-p$ . Then
$E(X) = 1p + 0q=p,$
which makes sense intuitively since it is between the two possible values of $X$ , compromising between 0 and 1 based on how likely each is. This is illustrated in the following figure for a case with $p< 1/2$ : two pebbles are being balanced on a seesaw. For the seesaw to balance, the fulcrum (shown as a triangle) must be at $p$ , which in physics terms is the center of mass.

Warning

For any discrete r.v. $X$ , the expected value $E(X)$ is a number (if it exists). A common mistake is to replace an r.v. by its expectation without justification, which is wrong both mathematically ( $X$ is a function, $E(X)$ is a constant) and statistically (it ignores the variability of $X$ ), except in the degenerate case where $X$ is a constant.

Notation

We often abbreviate $E(X)$ to $EX$ . Similarly, we often abbreviate $E(X^2)$ to $EX^2$ . Thus $EX^2$ is the expectation of the random variable $X^2$ , not the square of the number $EX$ . In general, unless the parentheses explicitly indicate otherwise, the expectation is to be taken at the very end. For example, $E(X-3)^2$ is $E\left((X-3)^2\right)$ , not $(E(X-3))^2$ . As we will see, the order of operations here is very important!

Linearity of Expectation

The most important property of expectation is linearity: the expected value of a sum of r.v.s is the sum of the individual expected values.

Theorem: Linearity of Expectation

For any r.v.s $X，Y$ and any constant $c$ ,
$\begin{align*} E(X+Y) &= E(X) + E(Y), \\ E(cX) &= cE(X). \end{align*}$

We will now show that linearity is true for discrete r.v.s $X$ and $Y$ . Before doing that, let's recall some basic facts about averages. If we have a list of numbers, say $(1, 1, 1, 1, 1, 3, 3, 5)$ , we can calculate their mean by adding all the values and dividing by the length of the list, so that each element of the list gets a weight of $1/8$ :

\frac{1}{8}(1+1+1+1+1+3+3+5) = 2.

But another way to calculate the mean is to group together all the 1's, all the 3's, and all the 5's, and then take a weighted average, giving appropriate weights to 1's, 3's, and 5's:

\frac{5}{8} \cdot 1 + \frac{2}{8} \cdot 3 + \frac{1}{8} \cdot 5 = 2.

This insight - that averages can be calculated in two ways, ungrouped or grouped---is all that is needed to prove linearity (in the discrete case)! Recall that $X$ is a function which assigns a real number to every outcome $s$ in the sample space. The r.v. $X$ may assign the same value to multiple sample outcomes. When this happens, our definition of expectation groups all these outcomes together into a super-pebble whose weight, $P(X=x)$ , is the total weight of the constituent pebbles. This grouping process is illustrated in the following figure for a hypothetical r.v. taking values in $\{0,1,2\}$ . So our definition of expectation corresponds to the grouped way of taking averages.

Left: $X$ assigns a number to each pebble in the sample space. Right: Grouping the pebbles by the value that $X$ assigns to them, the 9 pebbles become 3 super-pebbles. The weight of a super-pebble is the sum of the weights of the constituent pebbles.

The advantage of this definition is that it allows us to work with the distribution of $X$ directly, without returning to the sample space. The disadvantage comes when we have to prove theorems like this one, for if we have another r.v. $Y$ on the same sample space, the super-pebbles created by $Y$ are different from those created from $X$ , with different weights $P(Y=y)$ ; this makes it difficult to combine $\sum_x x P(X=x)$ and $\sum_y y P(Y=y)$ .

Fortunately, we know there's another equally valid way to calculate an average: we can take a weighted average of the values of individual pebbles. In other words, if $X(s)$ is the value that $X$ assigns to pebble $s$ , we can take the weighted average

E(X) = \sum_s X(s) P(\{s\}),

where $P(\{s\})$ is the weight of pebble $s$ .

This corresponds to the ungrouped way of taking averages. The advantage of this definition is that it breaks down the sample space into the smallest possible units, so we are now using the same weights $P(\{s\})$ for every random variable defined on this sample space. If $Y$ is another random variable, then

E(Y) = \sum_s Y(s) P(\{s\}).

捕获.JPG

Example: Binomial Expectation

For $X \sim \textrm{Bin}(n,p)$ , let's find $E(X)$ . By definition of expectation,

E(X) = \sum_{k=0}^n k P(X=k) = \sum_{k=0}^n k {n \choose k} p^k q^{n-k}.

This sum can be done with some work, but linearity of expectation provides a much shorter path to the same result. Let's write $X$ as the sum of $n$ independent $\textrm{Bern}(p)$ r.v.s:

X = I_1 + \dots + I_n,

where each $I_j$ has expectation $E(I_j) = 1p + 0q = p$ . By linearity,

E(X) = E(I_1) + \dots + E(I_n) = np.

Example: Hypergeometric Expectation

Let $X \sim \textrm{HGeom}(w,b,n)$ , interpreted as the number of white balls in a sample of size $n$ drawn without replacement from an urn with $w$ white and $b$ black balls. As in the Binomial case, we can write $X$ as a sum of Bernoulli random variables,

X = I_1 + \dots + I_n,

where $I_j$ equals 1 if the $j$ th ball in the sample is white and 0 otherwise. By symmetry, $I_j \sim \textrm{Bern}(p)$ with $p=w/(w+b)$ , since unconditionally the $j$ th ball drawn is equally likely to be any of the balls.

Unlike in the Binomial case, the $I_j$ are not independent, since the sampling is without replacement: given that a ball in the sample is white, there is a lower chance that another ball in the sample is white. However, linearity still holds for dependent r.v.s! Thus,

E(X) = nw/(w+b).

Averages, Law of Large Numbers, and Central Limit Theorem 1

Expectation

Linearity of Expectation

Example: Binomial Expectation

Example: Hypergeometric Expectation