In one sense, our hypothesis test is complete. We’ve constructed a teststatistic, figured out its sampling distribution if the null hypothesisis true, and then constructed the critical region for the test.Nevertheless, I’ve actually omitted the most important number of all,the p-value. It is to this topic that we now turn. There aretwo somewhat different ways of interpreting a p-value, oneproposed by Sir Ronald Fisher and the other by Jerzy Neyman. Bothversions are legitimate, though they reflect very different ways ofthinking about hypothesis tests. Most introductory textbooks tend togive Fisher’s version only, but I think that’s a bit of a shame. To mymind, Neyman’s version is cleaner and actually better reflects the logicof the null hypothesis test. You might disagree though, so I’ve includedboth. I’ll start with Neyman’s version.
A softer view of decision making¶
One problem with the hypothesis testing procedure that I’ve described isthat it makes no distinction at all between a result that is “barelysignificant” and those that are “highly significant”. For instance, inmy ESP study the data I obtained only just fell inside the criticalregion, so I did get a significant effect but it was a pretty nearthing. In contrast, suppose that I’d run a study in which X = 97out of my N = 100 participants got the answer right. This wouldobviously be significant too but by a much larger margin, such thatthere’s really no ambiguity about this at all. The procedure that I havealready described makes no distinction between the two. If I adopt thestandard convention of allowing α = 0.05 as my acceptableType I error rate, then both of these are significant results.
This is where the p-value comes in handy. To understand how itworks, let’s suppose that we ran lots of hypothesis tests on the samedata set, but with a different value of α in each case.When we do that for my original ESP data what we’d get is something likethis
Value of α | 0.05 | 0.04 | 0.03 | 0.02 | 0.01 |
Reject the null? | Yes | Yes | Yes | No | No |
When we test the ESP data (X = 62 successes out of N = 100observations), using α levels of 0.03 and above, we’d alwaysfind ourselves rejecting the null hypothesis. For α levelsof 0.02 and below we always end up retaining the null hypothesis.Therefore, somewhere between 0.02 and 0.03 there must be a smallest valueof α that would allow us to reject the null hypothesis forthis data. This is the p-value. As it turns out the ESP data hasp = 0.021. In short,
p is defined to be the smallest Type I error rate(α) that you have to be willing to tolerate if you wantto reject the null hypothesis.
If it turns out that p describes an error rate that you findintolerable, then you must retain the null. If you’re comfortable withan error rate equal to p, then it’s okay to reject the nullhypothesis in favour of your preferred alternative.
In effect, p is a summary of all the possible hypothesis teststhat you could have run, taken across all possible αvalues. And as a consequence it has the effect of “softening” ourdecision process. For those tests in which p ≤ α youwould have rejected the null hypothesis, whereas for those tests inwhich p > α you would have retained the null. In my ESPstudy I obtained X = 62 and as a consequence I’ve ended up withp = 0.021. So the error rate I have to tolerate is 2.1%. Incontrast, suppose my experiment had yielded X = 97. What happensto my p-value now? This time it’s shrunk to p = 1.36 · 1025,which is a tiny, tiny[1] Type Ierror rate. For this second case I would be able to reject the nullhypothesis with a lot more confidence, because I only have to be“willing” to tolerate a type I error rate of about 1 in 10 trilliontrillion in order to justify my decision to reject.
The probability of extreme data¶
The second definition of the p-value comes from Sir RonaldFisher, and it’s actually this one that you tend to see in mostintroductory statistics textbooks. Notice how, when I constructed thecritical region, it corresponded to the tails (i.e., extreme values)of the sampling distribution? That’s not a coincidence, almost all“good” tests have this characteristic (good in the sense of minimisingour type II error rate, β). The reason for that is that agood critical region almost always corresponds to those values of thetest statistic that are least likely to be observed if the nullhypothesis is true. If this rule is true, then we can define thep-value as the probability that we would have observed a teststatistic that is at least as extreme as the one we actually did get. Inother words, if the data are extremely implausible according to the nullhypothesis, then the null hypothesis is probably wrong.
A common mistake¶
Okay, so you can see that there are two rather different but legitimateways to interpret the p-value, one based on Neyman’s approach tohypothesis testing and the other based on Fisher’s. Unfortunately, thereis a third explanation that people sometimes give, especially whenthey’re first learning statistics, and it is absolutely and completelywrong. This mistaken approach is to refer to the p-value as“the probability that the null hypothesis is true”. It’s an intuitivelyappealing way to think, but it’s wrong in two key respects. First, nullhypothesis testing is a frequentist tool and the frequentist approach toprobability does not allow you to assign probabilities to the nullhypothesis. According to this view of probability, the null hypothesisis either true or it is not, it cannot have a “5% chance” of being true.Second, even within the Bayesian approach, which does let you assignprobabilities to hypotheses, the p-value would not correspond tothe probability that the null is true. This interpretation is entirelyinconsistent with the mathematics of how the p-value iscalculated. Put bluntly, despite the intuitive appeal of thinking thisway, there is no justification for interpreting a p-value thisway. Never do it.
[1] | That’s p = 0.000000000000000000000000136 for folks that don’t likescientific notation! |