Tag Archives: lasso

How good are we at genetic risk prediction of complex diseases ?

Using a hammer to wash the dishes

Statistical procedures offer control over uncertainty. For example, the Bonferroni correction or other correction for multiple testing allow to control the Family-Wise Error Rate (FWER). The Family-Wise Error Rate is the probability of reporting at least one false positive when performing many tests. A false positive is a variable following the null hypothesis (SNP not associated with disease) but being reported significant.

A statistical test is a way to say if a null hypothesis should be rejected or not, e.g. H_0 : this SNP is independent of the disease. The p-value is the probability of observing the data or something more unlikely under the null hypothesis. The more tests you perform, the more likely you are to obtain by chance a p-value smaller than the dreaded 0.05. In fact, if all variables you test follow the null hypothesis, 1 in 20 will have a p-value smaller than 0.05. The Bonferroni correction simply divides the cut-off for the p-values by the number of tests being performed. This way, the probability of having at least one false positive in the list of all the significant variables is smaller than 0.05 (the chosen cut-off). This does not mean that the rest is not associated of course. This is very conservative and sometimes you can be looking for a more relaxed control over uncertainty (mainly, if you do not have significant results for Bonferroni). One example is the Benjamini-Hochberg procedure that controls the expected False Discovery Rate (FDR but not Franklin Delano Roosevelt) i.e. the percentage of false positives in your list of findings. If you control for a FDR of 0.05 and you have 40 significant results, you can expect two of them to be false positives.

All this to say that the answer you get from the data depends on the question you ask. The missing heritability problem is (to some extent) a failure to grasp this simple notion. The GWAS significant SNPs in aggregate explain a small proportion of the heritability of the disease simply because they are a restrictive list that allows for FWER control and not chosen to maximize predictive accuracy. There are many false negatives. When trying to explain heritability or predict a disease, we are no longer in the realm of statistical tests to fall in the joyous land of statistical learning. And therefore, we will use the computationally efficient, theoretically well understood and sparse lasso. The following review is not exhaustive and you are welcome to complete it in the comments section.

Lasso for GWAS: a review Continue reading



Filed under Review

Linear regression in high dimension, sparsity and convex relaxation

I don’t feel like explaining what linear regression is so I’ll let someone else do it for me (you probably need to know at least some linear algebra to follow the notations):

When I was in high school, in a physics practical we had done some observations on a pendulum or something and we had to graph them. They were almost on a line so I simply joined each point to the next and ended up with a broken line. The teacher, seeing that, told me : ” Where do you think you are? Kindergarten? Draw a line!” Well, look at me now, Ms Mauprivez! Doing a PhD and all!

In physics, for such easy experiments, it is obvious that the relation is linear. It can have almost no noise except for some small measurement error and it reveals a “true” linear relation embodied by the line. In the rest of science, linear regression is not expected to uncover true linear relations. It would be unrealistic to hope to predict precisely the age at which you will have pulmonary cancer by the period of time you were a smoker (and very difficult to draw the line just by looking at the points). It is rather a way to find correlation and a trend between noisy features that have many other determinants: smoking is correlated with cancer. Proving causation is another complicated step.

But linear regression breaks down if you try to apply it with many explaining features like in GWAS. The error (mean squared error) will decrease as you add more and more features but if you use the model to predict on new data, you will be completely off target. This problem is called overfitting. If you allow the model to be very complicated, it can fit perfectly to the training data but will be useless in prediction (just like the broken line). Continue reading


Filed under introductory