Random Variables and Probability Distributions

A random variable represents the result of some random process, like flipping a coin, rolling dice, or spinning a bottle. They can be discrete and take on only certain values (like heads or tails, or 1, 2, 3, 4, 5 or 6 on a die) or continuous and take on any real value (like the direction of the bottle, or 2.3, 58.19, or -3.18575683). Random variables come from a probability distribution, which tells you what kind of values it can be and how likely those values are, and you can learn about a probability distribution by sampling from it. For example, let’s say you want to know if they make an equal number of M&Ms in each color. The random variable is the color of a random M&M you grab from the bag, and the probability distribution is how many of each color there is. So, you decide to take samples by opening a bag and sampling (eating) the M&Ms one by one, noting each one’s color. In the end, you have a count of each color, called an estimate and the proportion of each color M&M in your estimate tells you about the true distribution of colors.

A random variable can be the result of more than one process. For example, rolling a 6-sided die has an equal chance of landing on each number. If you roll two 6-sided dice and add them together, you get a different distribution. Both of these distributions are shown below:

Top. The probability distribution of rolling one fair 6-sided die. All results are equally probable because it is a fair die.
Bottom. The probability distribution of rolling two 6-sided dice and adding them together. Certain outcomes are more probable because there are more chances to get that sum (e.g. there’s only one way to get a 2, by rolling two 1s, but there’s 6 ways to get a 7).

Expected Value and Variance

You might be familiar with the average, or mean, of a set of numbers, which is often written as μ. We can have the same thing for a random variable, which is known as the expected value and written as E[X] or E(X), where X is our random variable. This is also called the first moment. Formally, the expected value is the weighted average of each possible outcome. For example, if X is the result of rolling one die, then every possible result xi (where i = 1, 2, 3, 4, 5, or 6) has probability pi and the expected value is:

\begin{aligned}
E[X] &= p_1x_1+p_2x_2+p_3x_3+p_4x_4+p_5x_5+p_6x_6 \\
&= \frac{1}{6} \cdot 1+\frac{1}{6}\cdot 2+\frac{1}{6}\cdot 3+\frac{1}{6}\cdot 4+\frac{1}{6}\cdot 5+\frac{1}{6} \cdot 6 \\
& = 3.5
\end{aligned}

The variance (and standard deviation) can be thought of how “spread out” the variable is. It is sometimes called the second moment and is written as σ2, where σ is standard deviation.

The classic example of low vs. high variance. In the low variance case, the shots are all tightly packed together. In the high variance case, they’re spread out all over the target board.

Low variance means the samples are tightly packed together. This is like if you had a cookie-making machine. Each one might not be exactly the same, but the differences between each cookie are very small. High variance is like if you tried to make cookies by hand. They’re all going to be recognizably cookies (I hope!), but some will be a lot bigger than others or have weird shapes or uneven chocolate chips or sprinkles.

Variance is formally defined as the expected value of the square of the difference of X and the mean (which is itself the expected value of X):

\begin{aligned}
Var(X) &= E[(X - μ)^2]\\
&= E[(X - E[X])^2]\\
&=E[X^2 - 2XE[X]+E[X]^2]\\
&=E[X^2] - 2E[X]E[X] + E[X]^2\\
&=E[X^2] - E[X]^2
\end{aligned}

This is how to calculate the variance for rolling a single die:

\begin{aligned}
Var(X) &=\sum_{i=1}^{n}p_n(x_n-\mu)^2 = \sum_{}^{}p_n\cdot x_n^2 - \mu^2\\
& = \frac{1}{6}\cdot 1^2 +\frac{1}{6}\cdot 2^2 +\frac{1}{6}\cdot 3^2 +\frac{1}{6}\cdot 4^2 +\frac{1}{6}\cdot 5^2 +\frac{1}{6}\cdot 6^2  - 3.5^2\\
&=\frac{1}{6} \cdot (1+4+9+16+25+36) - 12.25 \\
&= 15.1666 - 12.25\\
&= 2.9167
\end{aligned}

Probability Distributions

There are many different types of probability distributions for random variables. They are described using a function that tells you what the probability of each value is over the domain of possible values. A probability mass function (pmf) is used for discrete variables and a probability density function (pdf) is used for continuous variables. Some common probability distributions include the uniform, binomial, and normal (or Gaussian), but there are many probability densities that exist.

Uniform Distribution

There are two kinds of uniform distribution: the discrete and the continuous. The discrete uniform distribution is like rolling a fair die: one of 6 possible outcomes and each one is equally likely. The probability distribution is described by P(x) = 1/n, where n is the number of possible outcomes.

The continuous version is very similar, except now the random variable can take on any value between the minimum and maximum. The probability distribution is described by P(x) = 1/(b-a) for any x between a and b, where a is the minimum value (often 0), and b is the maximum value.

Binomial Distribution

The binomial distribution is for events that can have one of two different outcomes, like flipping a coin. The probability mass function is defined as:

P(X=k) = {n \choose k} p^k (1-p)^{n-k}\\
\text{where}   
\begin{cases}
p & \text{probability of a success}\\
n & \text{number of samples} \\
k & \text{number of successes} 
\end{cases} \\
\text{and} {n \choose k} = \frac{n!}{k!(n-k)!}

As an example, the probability of getting 3 out of 10 heads with a fair coin with P(H)=0.5 is:

\begin{aligned}
P(\text{3 out of 10 Heads}) &= {10 \choose 3} \cdot 0.5^3 \cdot (1-0.5)^{10-3}\\
&= \frac{10!}{3!(10-3)!}\cdot 0.5^3\cdot 0.5^7\\
&= 0.1172 \text{ or 11.72\%}
\end{aligned}

Normal (Gaussian) Distribution

The normal distribution is one of the most important probability distributions in statistics. It is defined as:

P(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \\
\text{where}\begin{cases}
\mu &\text{mean}\\
\sigma&\text{standard deviation}
\end{cases}

Many statistical models use the normal because of some important properties. One reason is that any linear combination of normal distributions is itself a normal distribution, making certain situations more convenient to solve. Another reason is the Central Limit Theorem, which states that, under certain conditions, the average of many samples of a random variable is itself a random variable that converges to a normal distribution as the number of samples go up. In other words, even if the process you’re looking at isn’t distributed normally, you can still sample it many times and estimate the mean using a normal distribution. I will cover this in more detail in another post.

Leave a Reply

Your email address will not be published. Required fields are marked *