Tag Archives: statistical learning

Haplotype based genetic risk estimation

I have recently submitted an article titled as this blog post and it is already accessible as a preprint. A preprint is a scientific paper before it has been peer-reviewed. The idea behind publishing preprints is that research is more quickly available than if you have to wait for the sometimes lengthy peer-review process to take place.

My idea when I decided to start this blog was to be able to give context to my work. A scientific publication is not designed to be understandable to the lay man (and most of the time, it is hidden behind a paywall). A blog is therefore useful as an antechamber to the scientific literature. Also, shameless self promotion.

If you want to try and understand what the preprint is about you should follow the links to blog posts in the following paragraphs.

I have used Genome-Wide Association Studies (GWAS) data in order to try and predict genetic risk of disease using machine learning techniques. In particular, I have combined lasso regression and random forests. Compared to more traditional approaches, I have used biological structure to try and improve predictions, namely chromosomal distance and phase information.

Chromosomal distance is simply the fact that SNPs (single-nucleotide polymorphisms) have a physical location on chromosomes and you can therefore define a distance measured in base pairs between two SNPs that are on the same chromosome. This structure was exploited in T-trees.

The second structure I tried to use is phase information or haplotypes. We have 22 pairs of autosomal (not sexual) chromosomes. Each autosomal SNP is therefore present twice in each individual. Because of the way the technology works, we do not have access to the two sequences of the two chromosomes of the same pair but only to the genotypes. To make this clearer:


Knowing the genotype does not allow us to distinguish between the two scenarios. You should note that the two black lines are the two different chromosomes of the pair, one coming from the mother and one from the father.

Ok, but why is this information important ? Suppose that the two SNPs are located on the same gene in the coding sequence. Further suppose that the SNP1=A and SNP2=C are nonsense mutation that imply a dysfunctional protein. In the scenario on the left, the two mutations are on the same chromosome and therefore the other chromosome will produce a healthy version of the protein. On the other hand, on the right, both copies of the gene are dysfunctional, the healthy protein will not be produced and the individual will be sick. This is called compound heterozigoty.

With all this you are equipped to read my first preprint !


Leave a comment

27 May 2016 · 11 h 18 min

The sleeping beauty in the random forest: T-Trees

A sleeping beauty refers to an article that is undervalued and is not cited very often and then awakens later and is recognised as important. The most famous one is the work of Mendel that were published in 1865 and rediscovered 34 years later. To learn more about this concept, you can read this or that.

Today, I want to talk about a paper that is a bit too young to be a sleeping beauty but that seems undervalued :

Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies (2014) by Vincent Botta, Gilles Louppe, Pierre Geurts, Louis Wehenkel

The four authors are from Liège in Belgium. Their team is known to work on Random forest (if you do not know what that is, you can read my earlier blogpost on the subject). Geurts and Wehenkel have proposed a variant of random forest called extremely randomized trees. Gilles Louppe is the one who implemented random forest for the scikit-learn package for python (which I use. Thanks!). The first author Vincent Botta left academia after his PhD and went to work for a company (a start-up in newspeak).


The idea of the paper is to use biological structure in order to increase prediction accuracy. The additional structure used here is chromosomal distance. A SNP is located on a chromosome and it has neighbours. This information can be useful in several ways: Continue reading


Filed under Review

Everything is not linear: the example of Random Forest

Linear regression is great. But unfortunately, not everything in nature is linear. If you drink alcohol, you get drunk. If you take your prescribed drugs, you are healthy. But if you do both at the same time, you will not be drunk and healthy, you will probably get very sick. This is an interaction. In general, we talk about interaction when there is a departure from linearity. There are many ways to try and capture interaction using statistical learning but today, I will focus on Random Forest. But before, I explain what a forest is I have to explain what a decision tree is.

“Erik – Prunus sp 02” by Zeynel Cebeci – Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons – https://commons.wikimedia.org/wiki/File:Erik_-_Prunus_sp_02.JPG#/media/File:Erik_-_Prunus_sp_02.JPG

The good people at www.r2d3.us did a great job of explaining what a decision tree is in a very visual way. So click here and go look at it. Also, subtle Star Wars reference. Continue reading


Filed under introductory

How good are we at genetic risk prediction of complex diseases ?

Using a hammer to wash the dishes

Statistical procedures offer control over uncertainty. For example, the Bonferroni correction or other correction for multiple testing allow to control the Family-Wise Error Rate (FWER). The Family-Wise Error Rate is the probability of reporting at least one false positive when performing many tests. A false positive is a variable following the null hypothesis (SNP not associated with disease) but being reported significant.

A statistical test is a way to say if a null hypothesis should be rejected or not, e.g. H_0 : this SNP is independent of the disease. The p-value is the probability of observing the data or something more unlikely under the null hypothesis. The more tests you perform, the more likely you are to obtain by chance a p-value smaller than the dreaded 0.05. In fact, if all variables you test follow the null hypothesis, 1 in 20 will have a p-value smaller than 0.05. The Bonferroni correction simply divides the cut-off for the p-values by the number of tests being performed. This way, the probability of having at least one false positive in the list of all the significant variables is smaller than 0.05 (the chosen cut-off). This does not mean that the rest is not associated of course. This is very conservative and sometimes you can be looking for a more relaxed control over uncertainty (mainly, if you do not have significant results for Bonferroni). One example is the Benjamini-Hochberg procedure that controls the expected False Discovery Rate (FDR but not Franklin Delano Roosevelt) i.e. the percentage of false positives in your list of findings. If you control for a FDR of 0.05 and you have 40 significant results, you can expect two of them to be false positives.

All this to say that the answer you get from the data depends on the question you ask. The missing heritability problem is (to some extent) a failure to grasp this simple notion. The GWAS significant SNPs in aggregate explain a small proportion of the heritability of the disease simply because they are a restrictive list that allows for FWER control and not chosen to maximize predictive accuracy. There are many false negatives. When trying to explain heritability or predict a disease, we are no longer in the realm of statistical tests to fall in the joyous land of statistical learning. And therefore, we will use the computationally efficient, theoretically well understood and sparse lasso. The following review is not exhaustive and you are welcome to complete it in the comments section.

Lasso for GWAS: a review Continue reading


Filed under Review

Linear regression in high dimension, sparsity and convex relaxation

I don’t feel like explaining what linear regression is so I’ll let someone else do it for me (you probably need to know at least some linear algebra to follow the notations):

When I was in high school, in a physics practical we had done some observations on a pendulum or something and we had to graph them. They were almost on a line so I simply joined each point to the next and ended up with a broken line. The teacher, seeing that, told me : ” Where do you think you are? Kindergarten? Draw a line!” Well, look at me now, Ms Mauprivez! Doing a PhD and all!

In physics, for such easy experiments, it is obvious that the relation is linear. It can have almost no noise except for some small measurement error and it reveals a “true” linear relation embodied by the line. In the rest of science, linear regression is not expected to uncover true linear relations. It would be unrealistic to hope to predict precisely the age at which you will have pulmonary cancer by the period of time you were a smoker (and very difficult to draw the line just by looking at the points). It is rather a way to find correlation and a trend between noisy features that have many other determinants: smoking is correlated with cancer. Proving causation is another complicated step.

But linear regression breaks down if you try to apply it with many explaining features like in GWAS. The error (mean squared error) will decrease as you add more and more features but if you use the model to predict on new data, you will be completely off target. This problem is called overfitting. If you allow the model to be very complicated, it can fit perfectly to the training data but will be useless in prediction (just like the broken line). Continue reading


Filed under introductory