An interesting conversation is taking place in science about replicability and reproducibility of results and the use and misuse of statistics. A very well written introductory article on the subject and other problems of contemporary science is available at fivethirtyeight.com: Science isn’t broken.

A recent scientific article tried to replicate the findings of psychological science articles and managed to replicate only 36% of the significant results instead of the 95% that we expect. Jeff Leek had a more positive view and showed that 77% of the replicated effect sizes were in the 95% confidence interval of the original study (EDIT : Actually, the confidence interval for prediction. It takes into account also the uncertainty in the replication sample).

If you want a reminder of what a p-value is you can look at the introduction of my earlier post.

In that Jeff Leek post, I also discovered a very interesting article: The garden of forking paths. The basic idea is that a scientific hypothesis can translate to many different statistical hypothesis. The researcher will perform only one test but his choice of test will depend on the data he collected. He will first look at the data and tune his hypothesis to it, not necessarily in a dishonest way. The problem is that the p-value the test produces will not offer the control over false positive that it should. Had the data been different another test would have obtained a significant result. This is a very valid criticism and reflects well on how the scientific process works. We collect some data with some idea of what we are looking for and then look at the data to try and translate the idea in a statistical framework. What Gelman suggests is that we should do this in a first step and then try and replicate our precise statistical hypothesis in a second round of data collection.

This reflection on the way science is done and statistics are used led me to other thoughts on the subject. Now let us assume that we have a very specific hypothesis but the data collection is very expensive and slow. The scientific team wants to publish their results but would also like to have enough money left to present the results at this conference in a luxurious hotel in Hawai. So they collect the data one data point at a time and look at the p-value of the test with one more data point. They start doing this after having 30 observations as they have some self-respect. When the test is significant at p<0.05, they stop and publish in a high impact factor journal. If they have 200 experiments and no significant results, then the team gives up and does not publish anything. The principal investigator is fired and ends up herding goats in Azerbaïdjan. Unfortunately for the team, the null hypothesis is true. Will the team go to Hawaï or look for another job?

The answer to that question is given by that R code :

#P-value as stopping criterion

n=200

signifstop<-logical(1000)

for (seed in 1:1000){

set.seed(seed)

y<-rnorm(n)#This is the data : normal law with mean 0 and variance 1.

p<-numeric(n)

for (j in 30:n){

p[j]<-t.test(y[1:j],mu=0)$p.value}#The alternative hypothesis is that the mean is different from 0.

signifstop[seed]<-(min(p[30:n])<0.05)}

sum(signifstop)

Over a 1000 simulations, 251 produce a significant result at some point instead of the 50 we expect for a fixed data size! The odds are not too bad for our team. I am not the first to think about this of course. By typing the title of my post in Google, I found this interesting post on what can be done to avoid this problem.

Here are two examples of success :