Random variables and distributions are among the most useful concepts in all of probability and statistics. This unit introduces discrete random variables and distributions, and the next unit introduces continuous random variables and distributions.
Random Variables
Random variables are an incredibly useful concept that simplifies notation and expands our ability to quantify uncertainty and summarize the results of experiments. Random variables are essential throughout statistics, so it is crucial to think through what they mean, both intuitively and mathematically.
Sometimes a definition of ''random variable" (r.v.) is given that is a barely paraphrased version of "a random variable is a variable that takes on random values", but this fails to say where the randomness come from. To make the notion of random variable precise, we define it as a function mapping the sample space to the real line.
Definition: Random Variable
Given an experiment with sample space , a random variable (r.v.) is a function from the sample space to the real numbers . It is common, but not required, to denote random variables by capital letters.
Thus, a random variable assigns a numerical value to each possible outcome of the experiment. The randomness comes from the fact that we have a random experiment (with probabilities described by the probability function ); the mapping itself is deterministic.
This definition is abstract but fundamental; one of the most important skills to develop when studying probability and statistics is the ability to go back and forth between abstract ideas and concrete examples. Relatedly, it is important to work on recognizing the essential pattern or structure of a problem and how it connects to problems you have studied previously. We will often discuss stories that involve tossing coins or drawing balls from urns because they are simple, convenient scenarios to work with, but many other problems are isomorphic: they have the same essential structure, but in a different guise.
To start, let's consider a coin-tossing example. The structure of the problem is that we have a sequence of trials where there are two possible outcomes for each trial. Here we think of the possible outcomes as (Heads) and (Tails), but we could just as well think of them as ''success" and ''failure" or as 1 and 0, for example.
Example Coin Tosses
Consider an experiment where we toss a fair coin twice. The sample space consists of four possible outcomes: Here are some random variables on this space (for practice, you can think up some of your own). Each r.v. is a numerical summary of some aspect of the experiment.
We can also encode the sample space as , where is the code for Heads and is the code for Tails. Then we can give explicit formulas for :
where for simplicity we write to mean , etc. For most r.v.s we will consider, it is tedious or infeasible to write down an explicit formula in this way. Fortunately, it is usually unnecessary to do so. As before, for a sample space with a finite number of outcomes we can visualize the outcomes as pebbles, with the mass of a pebble corresponding to its probability, such that the total mass of the pebbles is 1. A random variable simply labels each pebble with a number. The following figure shows two random variables defined on the same sample space: the pebbles or outcomes are the same, but the real numbers assigned to the outcomes are different.
Before we perform the experiment, the outcome has not yet been realized, so we don't know the value of , though we could calculate the probability that will take on a given value or range of values. After we perform the experiment and the outcome has been realized, the random variable crystallizes into the numerical value . In this way, random variables provide numerical summaries of the experiment in question.
Discrete Random Variable and Probability Mass Functions
Definition: Discrete Random Variable
A random variable is said to be discrete if there is a finite list of values or an infinite list of values such that . If is a discrete r.v., then the finite or countably infinite set of values such that is called the support of X.
Most commonly in applications, the support of a discrete r.v. is a set of integers. In contrast, a continuous r.v. can take on any real value in an interval (possibly even the entire real line); such r.v.s are defined more precisely in Unit 4. It is also possible to have an r.v. that is a hybrid of discrete and continuous, such as by flipping a coin and then generating a discrete r.v. if the coin lands Heads and generating a continuous r.v. if the coin lands Tails. For example, imagine that a customer in a store flips a coin to decide whether to make a purchase. If the coin lands Heads, the customer doesn't buy anything; if Tails, the customer spends some random positive real amount of money. But the starting point for understanding such r.v.s is to understand discrete and continuous r.v.s.
Given a random variable, we would like to be able to describe its behavior using the language of probability. For example, we might want to answer questions about the probability that the r.v. will fall into a given range: if is the lifetime earnings of a randomly chosen U.S. college graduate, what is the probability that exceeds a million dollars? If is the number of major earthquakes in California in the next five years, what is the probability that equals 0? The distribution of a random variable provides the answers to these questions; it specifies the probabilities of all events associated with the r.v., such as the probability of it equaling 3 and the probability of it being at least 110. We will see that there are several equivalent ways to express the distribution of an r.v. For a discrete r.v., the most natural way to do so is with a probability mass function, which we now define.
Definition: Probability Mass Function
The probability mass function (PMF) of a discrete r.v. is the function given by . Note that this is positive if is in the support of X, and otherwise. Here denotes an event, consisting of all outcomes to which assigns the number .
Let's look at a few examples of PMFs.
Example Coin Tosses Continued
In this example we'll find the PMFs of all the random variables in Example Coin Tosses, the example with two fair coin tosses. Here are the r.v.s we defined, along with their PMFs:
The PMFs of , , and are plotted in the above figur. Vertical bars are drawn to make it easier to compare the heights of different points.
We will now state the properties of a valid PMF.
Theorem: Valid PMFs
Let be a discrete r.v. with support (assume these values are distinct and, for notational simplicity, that the support is countably infinite; the analogous results hold if the support is finite).
The PMF of must satisfy the following two criteria:
- Nonnegative: if for some , and otherwise;
- Sums to 1: .
Proof: The first criterion is true since probability is nonnegative. The second is true since must take on some value, and the events are disjoint, so
Conversely, if distinct values are specified and we have a function satisfying the two criteria above, then this function is the PMF of some r.v.
The PMF is one way of expressing the distribution of a discrete r.v. This is because once we know the PMF of , we can calculate the probability that will fall into a given subset of the real numbers by summing over the appropriate values of . Given a discrete r.v. and a set of real numbers, if we know the PMF of we can find , the probability that is in , by summing up the heights of the vertical bars at points in in the plot of the PMF of . Knowing the PMF of a discrete r.v. determines its distribution.
Example Poisson Distribution
An r.v. has the Poisson distribution with parameter , where , if the PMF of is
We write this as . The Poisson is one of the most widely used distributions in all of statistics, and is a very common choice of model (or building block for more complicated models) for data that counts the number of occurences of some kind. The Poisson is discussed in much more detail in Unit 5. The Poisson also arises through the Poisson process, a model that is used in a wide variety of problems in which events occur at random points in time. Poisson processes are introduced in Unit 4.