The sleeping beauty in the random forest: T-Trees

A sleeping beauty refers to an article that is undervalued and is not cited very often and then awakens later and is recognised as important. The most famous one is the work of Mendel that were published in 1865 and rediscovered 34 years later. To learn more about this concept, you can read this or that.

Today, I want to talk about a paper that is a bit too young to be a sleeping beauty but that seems undervalued :

Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies (2014) by Vincent Botta, Gilles Louppe, Pierre Geurts, Louis Wehenkel

The four authors are from Liège in Belgium. Their team is known to work on Random forest (if you do not know what that is, you can read my earlier blogpost on the subject). Geurts and Wehenkel have proposed a variant of random forest called extremely randomized trees. Gilles Louppe is the one who implemented random forest for the scikit-learn package for python (which I use. Thanks!). The first author Vincent Botta left academia after his PhD and went to work for a company (a start-up in newspeak).

Motivation

The idea of the paper is to use biological structure in order to increase prediction accuracy. The additional structure used here is chromosomal distance. A SNP is located on a chromosome and it has neighbours. This information can be useful in several ways:

-First, SNPs that are next to each other might impact the same gene. In other words, they are more likely to interact than SNPs that are far away (even if long-range interactions do exist in genetics).

-Second, SNPs that are close to each other are often in strong Linkage Desequilibrium. This means that they are highly correlated. If they carry the same information, grouping them achieves dimension reduction. Correlated variables are also problematic for computing variable importance with random forest. Grouping variables once again allow to circumvent this problem.

Random forest is nice because it can capture interactions. However, if you consider two interacting variables that explain all the signal and that you then add a large number of noise variables, it becomes less likely that the two relevant variable will be used to grow the same tree. Hence the interest of dimension reduction.

Algorithm

The paper proposes an approach that achieve the two objectives : capturing local interactions and reducing dimension. They call it Trees inside Trees or in short T-Trees (lolilol). They define groups of contiguous SNPs. Groups can be seen as the analogue of variables in random forest. In the same way, to split a node you select at random a fixed number of groups. For each group, the proposed split is the result of a weak learner that uses the SNPs in the group as variables. The weak learner could be many things but here they used a single random decision tree.

botta_ttrees

Figure 1. A closer look into a T-Tree test-node. The group 1 is tested. Out of this group, three SNPs are exploited by the weak learner. In red (resp. green), probability of being a case (resp. control) estimated by the weak-learner. doi:10.1371/journal.pone.0093379.g001

They then grow many such trees and average the results.

They also defined an analogue of importance both at the group level and at the SNP level. This allowed them to report new loci associated with Crohn’s disease. They would need to be replicated in an independent dataset.

Results

They used T-trees on the WTCCC-1 data ❤ which is the same data I am working on. So seven diseases (BD=bipolar disorder, CAD=coronary artery disease, CD=Crohn’s disease, HT=hypertension,RA=rhumatoid arthritis,T1D=type 1 diabetes,T2D= type 2 diabetes) with 2000 patients each and 3000 shared controls.

The results are very impressive :

results_botta

For now, focus on the two columns on the left. The numbers are AUC for Area Under the receiver operating Curve. It is equal to the probability of, if you are given one patient and one control, attributing a higher risk to the patient. If you are familiar at all with genetic epidemiology of complex diseases, you must be in shock in front of those AUCs. They are much better than what you can find in the litterature. You can’t just compare AUCs between diseases. To get a sense of what this could mean for prediction you need to take into account prevalence. In this paper, Gad Abraham et al. argued that an AUC of 0.85 for celiac disease means prediction could be useful in a population with excess risk such as siblings of patients or people showing early symptoms. But celiac disease is a rare disease affecting less than 1% of the population. Things are very different if you consider diseases with high prevalence like T2D or CAD. If those results hold in the population, this means personalised medecine can stop being just a far away target and start having real impact.

But now, you should temper your enthusiasm by looking at the two columns on the right. The results there are too good to be true. The difference between the two sides of the table is quality control. As I mentioned in my last post, quality control often is 90% of the job of a bioinformatician. The data obtained using all the new technologies that became available in the last decades are very noisy, here mainly for technological reasons. On the right, the filters used are the ones that are given by the WTCCC with the data. On the left, much stricter quality control has been applied including a test of adequation to the Hardy-Weinberg Equilibrium (more on that in a coming blog post). The thing is random forest and T-trees seem extremely good at separating cases and controls using corrupt variables. So how can we know for sure that the results on the left would hold in the general population? The answer is we need to try and predict on an independent data set.

SCIENCE NEEDS YOUR DATASET FOR REPLICATION

“Fortunately, it is very easy to have access to GWAS datasets for a researcher.” would I say in a perfect world. For example, you can not access dbgap (database of of genome and phenome the american equivalent of the EGA european genome-phenome archive) if your PI does not have an NIH accreditation. The best data is in consortiums like the IBDGC but to have access to it you need to be a part of the consortium and then ask for it. It is no coincidence that I am working on the same data as Botta et al. : it is accessible. You just have to ask for it, wait for two months for a response, sign papers and download it. And you get 7 datasets in one time. All of this is of course because of medical privacy and the risk of misuse of the data. Still, what is the point of spending a lot of money on datasets if scientists can not have access to it.

Which is why I ask you, GWAS researcher of the internets, to use your dataset to see if T-trees is as awesome as it looks! It is on github.

Advertisements

3 Comments

Filed under Review

3 responses to “The sleeping beauty in the random forest: T-Trees

  1. Pingback: Unrealistic standards of beauty for data in statistics class | Heritability : lost and found

  2. Pingback: David Ledbetter at Collège de France | Heritability : lost and found

  3. Pingback: Haplotype based genetic risk estimation | Heritability : lost and found

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s