Hypothesis tests are a test you do when you wanna find out about something in the world. You make a hypothesis, then you test it against another hypothesis, and see which one is less wrong. Sometimes they’re called A/B tests (usually in business or marketing), but this is the same idea. You have two options, and you see how they differ.
Notice that I didn’t say you find out which one is right. This is because all models are wrong. Models are representations of reality that we use to make it easier to understand. Because they’re less complex, they won’t capture every aspect of reality, so in a sense every model is “wrong”, even if it is useful.
Experiments
There are three main parts when doing experiments
- Make a hypothesis
- Collect data
- Analyze the results
Traditionally, you compare your hypothesis, h1, against the null hypothesis, h0. You can think of the null hypothesis as the base state of the world, or what if your hypothesis was wrong. Then, you collect data and see how likely it is to get that data, assuming the null hypothesis is true. If you get a really unlikely result (you decide how unlikely before doing the experiment), then you can reject the null hypothesis. The percent chance you consider to be too unlikely is called alpha, or α. In the social sciences, it’s traditionally set to be 0.05, although you can pick lower or higher values depending on your experiment. Alpha is the rate of making a Type I error, or getting a false positive, assuming the null hypothesis is correct.
This doesn’t necessarily mean your alternate hypothesis is true, nor does it even mean that the null hypothesis is false. If you get a result that should only happen 1% of the time, there’s still a 1% chance that you got “lucky”. This is why repeating experiments is extremely important.
Let’s do a simple example with coin flips to see how we can make a decision on whether or not a coin is fair.
Coin Flip Example
Let’s say you have a coin and want to decide if you think it is fair or not. A fair coin has a 50% chance of being heads or tails. The null hypothesis in this situation would be “this coin is fair”. The alternative hypothesis is “this coin is not fair”.
You now decide to flip the coin 10 times and get THTTHHTTTT, or 3 heads and 7 tails. What’s the likelihood of getting this or a more extreme result assuming P(H) = 0.5? Why do we also count the “more extreme” outcomes? Take a look below, which is a graph of how many times you can expect to get each result for a fair coin. This is called a probability mass function because it shows the probability of each outcome. The reason it isn’t smooth is because coin flips are discrete, you can’t get half of a heads. For continuous variables it is called a probability density function. These functions are also called probability distributions because they show how the probability is distributed among all the different possible outcomes.

As you’d expect, the most likely outcome is to get exactly 5 heads. In our case, we only got 3 heads, so we want to know how unlikely it is to get this kind of result. We draw a line between 3 and 4 and ask “what’s the probability of the left side vs. the right side?”
This is where we do some counting. We want to get the probability of getting 0, 1, 2, or 3 heads out of 10 coin flips. Since we don’t care about order, we just have to count how many possibilities there are total, then count how many of those have 3 heads.
If we did 3 flips, there are 2*2*2 = 8 possible outcomes: HHH, HHT, HTH, THH, HTT, THT, TTH, and TTT. Then, we just count up how many results have 0, 1, 2, or 3 heads and we can get the full probability mass function. For example, the probability of getting exactly 1 heads is 3 out of 8, or 37.5%.
What about for 10 flips? Well, there’s no way you’d want to count it by hand, but lucky for us there’s a formula for this. It is called the binomial coefficient, sometimes seen as “n choose k”, or “n choose x”:
\text{Number of possibilities} = 2^{10} = 1024 \\ \text{Number of ways to get 0, 1, 2, or 3 heads out of 10 flips:}\\ \begin{aligned} {n \choose k} &= \frac{n!}{k(n-k)!} \\ {10 \choose 3}&= \frac{10!}{3!(10-3)!} &= \frac{10*9*8*7*6*5*4*3*2*1}{3*2*1*(7*6*5*4*3*2*1)}&=\frac{10*9*8}{3*2*1} &= 120 \\ {10 \choose 2}&=45 \\ {10 \choose 1}&=10 \\ {10 \choose 0}&=1 \\ \end{aligned}\\ \text{total outcomes} = 120 + 45 + 10 + 1 = 176
We will use α = 0.05 for our hypothesis test. So, what is the likelihood of getting 3 out of 10 heads? There are 1024 possible combinations of flips and 176 of them have 3 heads or fewer heads, which is 176/1024 = .171875. In this case, the likelihood of getting this result is greater than our cutoff of α (.171875 > 0.05) so we cannot reject the null hypothesis and we conclude that the coin is fair. This is good news because the coin I used to generate the initial sequence was actually fair.
Power
What if I secretly gave you another, unfair coin with P(H) = 0.2 and you repeated the experiment before? You flip the coin 10 times and this time get 2 heads. You repeat the process above and find out that the probability of getting 2 or fewer heads is (45+10+1 )/1024 = 0.055. This is greater than α (0.055 > 0.05), so you cannot reject the null hypothesis.
But wait, the coin wasn’t actually fair, why didn’t the experiment detect that?
This idea is called power. It is the ability of an experiment to correctly reject the null hypothesis. Power is equal to 1 – β, where β is the chance of making a Type II error, getting a false negative. Our experiment of flipping the coin 10 times did not have enough power to detect the unfair coin.
Below is a picture of the distributions of the fair coin and the unfair coin. You can see they have a lot of overlap, meaning it would be hard to separate the two of them only using 10 flips.

How do we increase our power? One way to do it is to increase the sample size. In the coin flip example, this means flipping the coin more times before stopping the experiment. What if we now flipped the coin 100 times but got the same proportion of heads (20%)? We get the following distributions for a fair (blue) and unfair (orange) coin.

If we got only 20 heads out of 100, then there’s no way it’s a fair coin, it’s just too rare. Our experiment now has enough power to correctly reject the null hypothesis.
As you might have guessed, the effect size and number of needed samples are linked. The smaller the effect size you’re trying to detect, the more samples you’re going to need.
Putting It All Together
All of the concepts I just talked about can be summed up in one neat little graph.

Here, our null hypothesis is the blue distribution and the alternative is the orange distribution. When we do our experiment, assuming the null hypothesis is true, then all our results will come from the blue area. If the alternative hypothesis is true, our results will come from the orange area. Of course while we’re doing the experiment we don’t know which area we are sampling from.
The black vertical line represents setting α (the false positive rate). The horizontal stripes area is equal in size to α, so if α= 0.05, then we have to draw the vertical line so that the horizontal stripe area is 5% of blue’s total area. If the null hypothesis is true, then that α area is the chance we incorrectly reject the null hypothesis (e.g. We say the coin is not fair when it actually is).
The vertical orange bars area is β, where we fail to reject the null hypothesis. The power is 1 – β, or the rest of the area under this curve. This is because if the alternative hypothesis is true, then we have a chance of β to get a result that looks like the null hypothesis, and a 1-β chance of getting a result that exceeds the threshold and lets us correctly reject the null hypothesis.