Unrealistic standards of beauty for data in statistics class

When you follow a statistics class, data is perfect and you can apply all kind of fancy algorithms and procedures on it to get to the truth. And sometimes you even have theoretical justifications for them. But the first time you encounter real data, you are shocked: there are holes in the data !

This is what actual data looks like. By Dieter Seeger [CC BY-SA 2.0 (http://creativecommons.org/licenses/by-sa/2.0)%5D, via Wikimedia Commons

You have missing values encoded by NA in all data. And you can’t just take all the observations that have no NAs, you would end up with nothing. A first step is to exclude variables and observations that have too much missing values. This process is called quality control or QC. Once you gave it this name, it seems difficult to defend less quality control. But we could also call it Throwing Expensive Data Away. It is all a matter of perspective.

Even after you throw away the observations and variables with the most missing values you often have to deal with residual missing values if the procedure you want to apply can not deal with it. There is some literature on missing data subtleties asking if the data is missing at random or not. But I never went too deep in that direction. Of more practical interest are the missing data imputation methods. A simple approach is to replace missing values by the median or the mode. This is what na.roughfix of the randomForest package for R does. In this package, there also is a function called rfimpute that takes into account correlations in the data to impute the data.

There are different packages tailored to impute missing data in R: mice, amelia, missMDA, missforest. Of course you can’t just trust the results of those packages and you should try and propagate the uncertainty in downstream analysis. Which is a pain and that no one really does… Well in any case, you should check that your results are not too sensitive to the kind of data imputation that you use.

I discovered this problematic 2 years and a half ago and after that I felt confident that I had mastered the concept of QC. But reality is more perverse than that and QC does not stop at missing values. This story starts when I gained access to the WTCCC 1 ❤ data. With it, you can download exclusion lists for individuals and SNPs that they used in the original study in 2007. I thought: “Great! I don’t have to deal with QC, I will just use those lists.” Then I went to work and coded for the first time in my life. After a few weeks (ok maybe months), I finally obtained a result for my method: 0.92 of AUC in cross-validation for type 1 diabetes. This is better than the 0.89 that Wei et al obtained in cross-validation on the same dataset. I then concluded that I had mastered genetic risk prediction and that I was a genius. I even posted a cryptic facebook message relating to that (Facebook friends, this the only retraction you are going to get).

But then I compared other methods on the data, and obtained similar results or even better results. I therefore suspected that there was something suspicious about my results. As we say in French: il y a une couille dans le potage. So I searched some SNPs that would come up in my results but were not said to be linked with Type 1 diabetes using the internet. This is how I found the work of Botta et al. that was the subject of my last post. In that work, they encountered the same problematic as me, unrealistically good performance and they then compared this with more stringent QC. Our common point is that we are not geneticists by training so did not know about the need for QC.

I would like to elaborate on a QC criterion that is specific to genetics and that discriminated well the bad data in my case: Hardy-Weinberg Equilibrium. For those of you who are mathematically-inclined, yes this Hardy is the famous Hardy, the one from Hardy-Littlewood, the one that brought Ramanujan to Cambridge, the one who said to Ramanujan: “The cab number was 1729 and I could find nothing interesting to say about it.” And then Ramanujan responded: “1729! The smallest number that can be written as the sum of two cubes in two distinct ways: $1729=1000+729=10^3+9^3=1728+1=12^3+1^3$“. Anyway, you should read The Man who Knew Infinity, a very nice biography of Ramanujan.

Wilhelm Weinberg is a German physician who discovered the same principle at around the same time. Anyway, what is the Hardy-Weinberg equilibrium ? Consider a gene with two alleles $A$ and $a$ under no selection, in a large randomly mating population. Then the frequency of the alleles let’s say $p$ for $A$ and $q$ for $a$ will remain stable at each generation (Hardy-Weinberg principle). Furthermore, the frequency of homozygotes $AA$, heterozygotes $Aa$ and homozygotes $aa$ will be $p^2$, $2pq$ and $q^2$. This should remind you of binomial expansion (by the way, that is the best translation I could find for identités remarquables/productos notables/prodotto notevole which is something that we learn about in middle school. Funny that there seems to be no real equivalent in english) and this is no coïncidence. Just look at this :

There is no reason for Hardy-Weinberg Equilibrium (HWE) to be respected by the cases especially for SNPs that are linked with the diseases. But we do expect HWE to be respected by all SNPs for controls. Selection of a single SNP is never strong enough to impact HWE and as long as you do not have population structure, random mating is a reasonable assumption. And adequation to HWE can be tested using a $\chi^2$ test. After removing the SNPs that did not respect HWE, Botta obtained more reasonable results (but still great).

So what went wrong with this data? The problem is not the biology but the technology. Some probes on an assay just do not work well or are very dependent to the way the chemical agents are used (I have no idea how this works). Nevertheless, the end result is that there are corrupt variables that behave differently for cases and controls while not carrying real biological information. This kind of noise would destroy the signal if we were not lucky enough to have an additional quality criterion in our case HWE. Besides, it is not impossible that a bad probe happens by chance to respect HWE. So we need to have independent cohorts to see if the results are replicable.

Oh and a last word about the WTCCC <3. Their exclusion lists were not strict enough but their results are good because they manually inspected the cluster plots of the hits in order to make sure that there had been no problems.

Also, I guess the moral of this story is that you can never apply statistical methods while not looking real hard at the data and having a lot of domain-specific knowledge.