Using a hammer to wash the dishes
Statistical procedures offer control over uncertainty. For example, the Bonferroni correction or other correction for multiple testing allow to control the Family-Wise Error Rate (FWER). The Family-Wise Error Rate is the probability of reporting at least one false positive when performing many tests. A false positive is a variable following the null hypothesis (SNP not associated with disease) but being reported significant.
A statistical test is a way to say if a null hypothesis should be rejected or not, e.g. this SNP is independent of the disease. The p-value is the probability of observing the data or something more unlikely under the null hypothesis. The more tests you perform, the more likely you are to obtain by chance a p-value smaller than the dreaded 0.05. In fact, if all variables you test follow the null hypothesis, 1 in 20 will have a p-value smaller than 0.05. The Bonferroni correction simply divides the cut-off for the p-values by the number of tests being performed. This way, the probability of having at least one false positive in the list of all the significant variables is smaller than 0.05 (the chosen cut-off). This does not mean that the rest is not associated of course. This is very conservative and sometimes you can be looking for a more relaxed control over uncertainty (mainly, if you do not have significant results for Bonferroni). One example is the Benjamini-Hochberg procedure that controls the expected False Discovery Rate (FDR but not Franklin Delano Roosevelt) i.e. the percentage of false positives in your list of findings. If you control for a FDR of 0.05 and you have 40 significant results, you can expect two of them to be false positives.
All this to say that the answer you get from the data depends on the question you ask. The missing heritability problem is (to some extent) a failure to grasp this simple notion. The GWAS significant SNPs in aggregate explain a small proportion of the heritability of the disease simply because they are a restrictive list that allows for FWER control and not chosen to maximize predictive accuracy. There are many false negatives. When trying to explain heritability or predict a disease, we are no longer in the realm of statistical tests to fall in the joyous land of statistical learning. And therefore, we will use the computationally efficient, theoretically well understood and sparse lasso. The following review is not exhaustive and you are welcome to complete it in the comments section.
Lasso for GWAS: a review
The way that the predictive accuracy of different models has been evaluated in our binary setting is the Receiving Operator Characteristic (ROC) and more synthetically the Area Under the ROC Curve (AUC). The AUC is between 0.5 and 1 with 0.5 being no predictive power and 1 being complete separation between the classes. It can be interpreted as the probability that the classifier will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example. It is independent of the proportion of cases and controls in the study. This is important for epidemiology where the proportions are very different in a case-control study compared to the general population. As for all synthetic indicator, there is information loss. The ROC curve is more informative as it allows to form diverse screening scenarii for different threshold of classification. As for all measure of predictive power, it must be evaluated on a test set and not on the training set.
The first article is Large Sample Size, Wide Variant Spectrum, and Advanced Machine-Learning Technique Boost Risk Prediction for Inflammatory Bowel Disease (2013) by Zhi Wei et al. It is remarkable by the sample size ~17,000 Crohn’s disease cases, ~13,000 Ulcerative Colitis cases, and ~22,000 controls from 15 European countries. The data comes from the IBD (Inflamatory Bowel Disease Consortium) international consortium whose original focus was doing a very large GWAS for IBD. The SNP chip that was used for this study is the Immunochip, an Illumina custom chip focused on regions of the genome known or suspected of having links with the immune system. The interest of such a custom chip is that if you order it in large volume, it becomes cheaper. For more on the economics of the immunochip you should read chapter 4 of Luke Jostin’s Ph.D. who was involved in the consortium. Its introduction on the history of genetics is also much more complete than my previous post on the subject.
The immunochip has of the order of 200 000 SNPs. This is too much computationally even for lasso. So in the article, they performed pre-selection of variables by univariate tests before applying lasso. But they use a much more lenient p-value threshold than genome-wide significance. It is designed to keep a manageable number of variables, here ~10000. Since they had so much samples, they did not bother doing cross-validation, they simply split the data in three folds. The first fold was used to perform the preselection step, the second fold to train the lasso on the preselected SNPs (including the choice of the regularization parameter) and the third to test its performance. For Crohn’s disease, the final model had ~600 SNPs and obtained an AUC of 0.86 on the test fold. For ulcerative colitis, it was 300 SNPs and an AUC of 0.82. This should be compared with the AUC obtained using only GWAS significant SNPs as shown in part B of the following figure :
We next investigated the contribution by predictors. The most recent IBD association study increases the number of susceptibility loci to 163, including 23 UC-specific loci, 30 CD-specific loci, and 110 IBD loci. 3 By using different combinations of these confirmed loci as predictors, we trained [logistic regression] on the second fold data set followed by testing on the third fold data set. […] We note that these validated loci were selected by using all data so the AUC results might be inflated. Even so, the resulting performance was inferior (AUC < 0.75 for CD and AUC < 0.7 for UC), confirming that using only validated loci is not a good strategy for risk prediction.
A nice thing they did is also to show how the AUC vary with the sample size used to train the lasso as seen in part A of the following figure. The increase in predictive power is large.
The second article that I will review is Accurate and Robust Genomic Prediction of Celiac Disease Using Statistical Learning (2014) by Gad Abraham et al. It uses software described in SparSNP: Fast and memory-efficient analysis of all SNPs for phenotype prediction (2012) by Gad Abraham et al. that is an implementation of a lasso penalized linear regression with square hinge loss optimized in order to be fast and memory efficient. An appreciable feature is that it takes as input the format used to store GWAS data. It is advertised as being able to run on a laptop.
The paper on Celiac disease uses European datasets of the order of thousands of patients. It obtains AUCs between 0.86 and 0.9. It compares favourably with the more expensive HLA-typing that is sometimes used to diagnose Celiac disease. There is an interesting discussion in the paper on what this predictive power means in a clinical setting. The short version is that it is not powerful enough to be used in a whole population screening except to define a group of people unlikely to develop the disease. In clinical settings where someone is at higher risk either because he has symptoms suggestive of Celiac disease or because he has a relative with the disease, it could be useful.
Another use of machine learning (but not lasso) is From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes (2009) by Zhi Wei et al. It uses three datasets of the order of the thousands of patients and obtain AUCs of the order of 0.82-0.84.
The hygiene hypothesis
The four diseases for which I reviewed results are all auto-immune diseases. They are caused by the immune system attacking the cells it is supposed to protect. For type 1 diabetes, the targeted cells are the beta cells of the pancreas that produce insulin. After their destruction, the only treatment is regular injection of insulin. For the other three, gut cells are attacked leading to digestive pains and problems. I did not pick them on purpose. They are to the best of my knowledge the best results obtained for genetic risk prediction (using linear models). The fact that auto-immune diseases are the most easily predicted diseases is linked in my opinion to the hygiene hypothesis.
Auto-immune diseases are rich countries disease and still on the rise. The hygiene hypothesis says that the malfunctioning of the immune system is due to a too much ascepticized environment. The immune system is not being trained as it used to when most people were farmers in contact with farm animals and with less hygiene. As a result, it malfunctions and does not recognize self from non-self. This phenomenon of higher hygiene is a very recent one in evolutionary history and therefore deleterious mutations of high impact in this new environment did not have time to be selected against. They might even have been beneficial in a dirtier environment. Infectious diseases have been the strongest selective pressure on man ever since the neolithic and the beginning of agriculture 10 000 years ago. A more aggressive immune system was potentially a favourable trait.