Reproducibility of statistical test results
This is a short, simple exercise to assess the reproducibility of decisions based on statistical testing.
See also, e.g.:
Consider a null hypothesis H0 with a set of alternative hypotheses containing H1 and H2. Setup the statistical hypothesis test procedure at a significance level of 0.05 to have a power of 0.8, if H1 is true. Further assume that the power for H2 is 0.5. To assess reproducibility of test result, consider the experiment of executing the test procedure two times.
Starting with the situation, where H0 is true, the probabilities for the outcomes of the joint experiment are displayed in Table 1. The probability of not being able to reproduce decisions is 0.095.
Table 1. Frequencies, if Ho is true
Frequency of decision

Reject H0

Retain H0

Reject H0

0.0025

0.0475

Retain H0

0.0475

0.9025

The frequencies change as the true state of nature changes. Assuming H1 is true, Ho can be rejected as designed with a power of 0.8. The resulting frequencies for the different outcomes of the joint experiment are displayed in Table 2. The probability of not being able to reproduce decisions is 0.32.
Table 2. Frequencies, if H1 is true
Frequency of decision

Reject H0

Retain H0

Reject H0

0.64

0.16

Retain H0

0.16

0.04

Assuming H2 is true, H0 will be rejected with a probability of 0.5. The resulting frequencies for the different outcomes of the joint experiment are displayed in Table 3. The probability of not being able to reproduce decisions is 0.5.
Table 3. Frequencies, if H2 is true
Frequency of decision

Reject H0

Retain H0

Reject H0

0.25

0.25

Retain H0

0.25

0.25

The test procedure was designed to control type I errors (the rejection of the null hypothesis even though it is true) with a probability of 0.05 and limit type II errors (no rejection of the null hypothesis even though it is wrong and H1 is true) to 0.2. For both cases, with either H0 orH1 assumed to be true, this leads to nonnegligible frequencies, 0.095 and 0.32, respectively, of “nonreproducible”, “contradictory” decisions, if the same experiment is repeated twice. The situation gets worse with a frequency up to 0.5 for “nonreproducible”, “contradictory” decisions, if the true state of nature is between the null and the alternative hypothesis used to design the experiment. The situation can also get better  if type I errors are controlled more strictly, or if the true state of the nature is far away from the null, such that the power to reject the null is close to 1.
No comments:
Post a Comment