Law of Large Numbers
We turn next to two theorems, the law of large numbers and the central limit theorem, which describe the behavior of the sample mean of i.i.d. r.v.s as the sample size grows. Let be i.i.d. with finite mean and finite variance . For all positive integers , let
be the sample mean of through . The sample mean is itself an r.v., with mean and variance :
The law of large numbers (LLN) says that as grows, the sample mean converges to the true mean (in a sense that is explained below). LLN comes in two versions, which use slightly different definitions of what it means for a sequence of random variables to converge to a number. We will state both versions.
Theorem: Strong Law of Large Numbers
The sample mean converges to the true mean pointwise as , with probability 1. In other words, the event has probability .
Theorem: Weak Law of Large Numbers
For all , as . (This form of convergence is called convergence in probability.)
The law of large numbers is essential for simulations, statistics, and science. Consider generating ''data" from a large number of independent replications of an experiment, performed either by computer simulation or in the real world. Every time we use the proportion of times that something happened as an approximation to its probability, we are implicitly appealing to LLN. Every time we use the average value in the replications of some quantity to approximate its theoretical average, we are implicitly appealing to LLN.
Example Running Proportion of Heads
Let be i.i.d. . Interpreting the as indicators of Heads in a string of fair coin tosses, is the proportion of Heads after tosses. SLLN says that with probability , when the sequence of r.v.s crystallizes into a sequence of numbers, the sequence of numbers will converge to . Mathematically, there are bizarre outcomes such as HHHHHH... and HHTHHTHHTHHT..., but collectively they have zero probability of occurring. WLLN says that for any , the probability of being more than away from can be made as small as we like by letting grow.
As an illustration, we simulated six sequences of fair coin tosses and, for each sequence, computed as a function of . Of course, in real life we cannot simulate infinitely many coin tosses, so we stopped after 300 tosses. The figure below plots as a function of for each sequence.
At the beginning, we can see that there is quite a bit of fluctuation in the running proportion of Heads. As the number of coin tosses increases, however, gets smaller and smaller, and approaches .
Central Limit Theorem
Let be i.i.d. with mean and variance . The law of large numbers says that as , converges to the constant (with probability ). But what is its distribution along the way to becoming a constant? This is addressed by the central limit theorem (CLT), which, as its name suggests, is a limit theorem of central importance in statistics.
The CLT states that for large , the distribution of after standardization approaches a standard Normal distribution. By standardization, we mean that we subtract , the expected value of , and divide by , the standard deviation of .
Theorem: Central Limit Theorem
As ,
In words, the CDF of the left-hand side approaches , the CDF of the standard Normal distribution.
The CLT is an asymptotic result, telling us about the limiting distribution of as , but it also suggests an approximation for the distribution of when is a finite large number.
Central limit theorem, approximation form.
For large , the distribution of is approximately . Of course, we already knew from properties of expectation and variance that has mean and variance ; the central limit theorem gives us the additional information that is approximately Normal with said mean and variance.
Let's take a moment to admire the generality of this result. The distribution of the individual can be anything in the world, as long as the mean and variance are finite. We could have a discrete distribution like the Binomial, a bounded continuous distribution, or a distribution with multiple peaks and valleys. No matter what, the act of averaging will cause Normality to emerge. In the figure below we show histograms of the distribution of for four different starting distributions and for . We can see that as increases, the distribution of starts to look Normal, regardless of the distribution of the .
This does not mean that the distribution of the is irrelevant, however. If the have a highly skewed or multimodal distribution, we may need to be very large before the Normal approximation becomes accurate; at the other extreme, if the are already i.i.d. Normals, the distribution of is exactly for all . Since there are no infinite datasets in the real world, the quality of the Normal approximation for finite is an important consideration.
The CLT says that the sample mean is approximately Normal, but since the sum is just a scaled version of , the CLT also implies is approximately Normal. If the have mean and variance , has mean and variance . The CLT then states that for large ,
This is completely equivalent to the approximation for , but it can be useful to state it in this form because many of the named distributions we have studied can be considered as a sum of i.i.d. r.v.s.