When you follow a statistics class, data is perfect and you can apply all kind of fancy algorithms and procedures on it to get to the truth. And sometimes you even have theoretical justifications for them. But the first time you encounter real data, you are shocked: there are holes in the data !
You have missing values encoded by NA in all data. And you can’t just take all the observations that have no NAs, you would end up with nothing. A first step is to exclude variables and observations that have too much missing values. This process is called quality control or QC. Once you gave it this name, it seems difficult to defend less quality control. But we could also call it Throwing Expensive Data Away. It is all a matter of perspective.
Even after you throw away the observations and variables with Continue reading
An interesting conversation is taking place in science about replicability and reproducibility of results and the use and misuse of statistics. A very well written introductory article on the subject and other problems of contemporary science is available at fivethirtyeight.com: Science isn’t broken.
A recent scientific article tried to replicate the findings of psychological science articles and managed to replicate only 36% of the significant results instead of the 95% that we expect. Jeff Leek had a more positive view and showed that 77% of the replicated effect sizes were in the 95% confidence interval of the original study (EDIT : Actually, the confidence interval for prediction. It takes into account also the uncertainty in the replication sample).
If you want a reminder of what a p-value is you can look at the introduction of my earlier post.
In that Jeff Leek post, I also discovered a very interesting article: The garden of forking paths. The basic idea is that a scientific hypothesis can translate to many different statistical hypothesis. The researcher will perform only one test but his choice of test will depend on the data he collected. He will first look at the data and tune his hypothesis to it, not necessarily in a dishonest way. The problem is that the p-value the test produces will not offer the control over false positive that it should. Had the data been different another test would have obtained a significant result. This is a very valid criticism and reflects well on how the scientific process works. We collect some data with some idea of what we are looking for and then look at the data to try and translate the idea in a statistical framework. What Gelman suggests is that we should do this in a first step and then try and replicate our precise statistical hypothesis in a second round of data collection.
This reflection on the way science is done and statistics are used led me to other thoughts on the subject. Now let us assume that we have a very specific hypothesis but the data collection is very expensive and slow. The scientific team wants to publish their results but would also like to have enough money left to present the results at this conference in a luxurious hotel in Hawai. So they collect Continue reading