Tag Archives: genetics

Haplotype based genetic risk estimation

I have recently submitted an article titled as this blog post and it is already accessible as a preprint. A preprint is a scientific paper before it has been peer-reviewed. The idea behind publishing preprints is that research is more quickly available than if you have to wait for the sometimes lengthy peer-review process to take place.

My idea when I decided to start this blog was to be able to give context to my work. A scientific publication is not designed to be understandable to the lay man (and most of the time, it is hidden behind a paywall). A blog is therefore useful as an antechamber to the scientific literature. Also, shameless self promotion.

If you want to try and understand what the preprint is about you should follow the links to blog posts in the following paragraphs.

I have used Genome-Wide Association Studies (GWAS) data in order to try and predict genetic risk of disease using machine learning techniques. In particular, I have combined lasso regression and random forests. Compared to more traditional approaches, I have used biological structure to try and improve predictions, namely chromosomal distance and phase information.

Chromosomal distance is simply the fact that SNPs (single-nucleotide polymorphisms) have a physical location on chromosomes and you can therefore define a distance measured in base pairs between two SNPs that are on the same chromosome. This structure was exploited in T-trees.

The second structure I tried to use is phase information or haplotypes. We have 22 pairs of autosomal (not sexual) chromosomes. Each autosomal SNP is therefore present twice in each individual. Because of the way the technology works, we do not have access to the two sequences of the two chromosomes of the same pair but only to the genotypes. To make this clearer:


Knowing the genotype does not allow us to distinguish between the two scenarios. You should note that the two black lines are the two different chromosomes of the pair, one coming from the mother and one from the father.

Ok, but why is this information important ? Suppose that the two SNPs are located on the same gene in the coding sequence. Further suppose that the SNP1=A and SNP2=C are nonsense mutation that imply a dysfunctional protein. In the scenario on the left, the two mutations are on the same chromosome and therefore the other chromosome will produce a healthy version of the protein. On the other hand, on the right, both copies of the gene are dysfunctional, the healthy protein will not be produced and the individual will be sick. This is called compound heterozigoty.

With all this you are equipped to read my first preprint !


Leave a comment

27 May 2016 · 11 h 18 min

SMPGD 2016: Microbial genetics and miRNA

I was in Lille on thursday and friday for an intense conference on Statistical Models for Post-Genomic Data. There were two main themes that emerged: genetics of bacteria and viruses and change point detection. I’ll just talk about the first one and an unrelated talk on miRNA.

Viral evolutionary inference

Phillipe Lemey showed us how sequencing of virus genome could be used to retrace the spatio-temporal evolution of diseases. By sequencing viruses, you can reconstruct the phylogeny of viruses and therefore you can find where the virus came from. This allows to understand the dynamic of the epidemy in a much more precise way. See for example the spread of H1N1. He also showed us his results on ebola which is the first epidemic to be sequenced as it unfolds. This showed how the disease went from district to district. His work was retrospective as he pooled the data of different teams. He stressed the importance of efficient data sharing. His work allows to see how the epidemic is propagated and therefore allows to understand what public health measures are efficient.

GWAS for bacteria

Genome wide-association studies can help discover the genetic determinant of traits. But this idea is not limited to humans. One of the main trait of interest of bacteria is resistance to antibiotics. However, bacterial genome are very challenging in several ways :

Continue reading

Leave a comment

Filed under Review

Unrealistic standards of beauty for data in statistics class

When you follow a statistics class, data is perfect and you can apply all kind of fancy algorithms and procedures on it to get to the truth. And sometimes you even have theoretical justifications for them. But the first time you encounter real data, you are shocked: there are holes in the data !


This is what actual data looks like. By Dieter Seeger [CC BY-SA 2.0 (http://creativecommons.org/licenses/by-sa/2.0)%5D, via Wikimedia Commons

You have missing values encoded by NA in all data. And you can’t just take all the observations that have no NAs, you would end up with nothing. A first step is to exclude variables and observations that have too much missing values. This process is called quality control or QC. Once you gave it this name, it seems difficult to defend less quality control. But we could also call it Throwing Expensive Data Away. It is all a matter of perspective.

Even after you throw away the observations and variables with Continue reading

Leave a comment

Filed under introductory

The sleeping beauty in the random forest: T-Trees

A sleeping beauty refers to an article that is undervalued and is not cited very often and then awakens later and is recognised as important. The most famous one is the work of Mendel that were published in 1865 and rediscovered 34 years later. To learn more about this concept, you can read this or that.

Today, I want to talk about a paper that is a bit too young to be a sleeping beauty but that seems undervalued :

Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies (2014) by Vincent Botta, Gilles Louppe, Pierre Geurts, Louis Wehenkel

The four authors are from Liège in Belgium. Their team is known to work on Random forest (if you do not know what that is, you can read my earlier blogpost on the subject). Geurts and Wehenkel have proposed a variant of random forest called extremely randomized trees. Gilles Louppe is the one who implemented random forest for the scikit-learn package for python (which I use. Thanks!). The first author Vincent Botta left academia after his PhD and went to work for a company (a start-up in newspeak).


The idea of the paper is to use biological structure in order to increase prediction accuracy. The additional structure used here is chromosomal distance. A SNP is located on a chromosome and it has neighbours. This information can be useful in several ways: Continue reading


Filed under Review

How good are we at genetic risk prediction of complex diseases ?

Using a hammer to wash the dishes

Statistical procedures offer control over uncertainty. For example, the Bonferroni correction or other correction for multiple testing allow to control the Family-Wise Error Rate (FWER). The Family-Wise Error Rate is the probability of reporting at least one false positive when performing many tests. A false positive is a variable following the null hypothesis (SNP not associated with disease) but being reported significant.

A statistical test is a way to say if a null hypothesis should be rejected or not, e.g. H_0 : this SNP is independent of the disease. The p-value is the probability of observing the data or something more unlikely under the null hypothesis. The more tests you perform, the more likely you are to obtain by chance a p-value smaller than the dreaded 0.05. In fact, if all variables you test follow the null hypothesis, 1 in 20 will have a p-value smaller than 0.05. The Bonferroni correction simply divides the cut-off for the p-values by the number of tests being performed. This way, the probability of having at least one false positive in the list of all the significant variables is smaller than 0.05 (the chosen cut-off). This does not mean that the rest is not associated of course. This is very conservative and sometimes you can be looking for a more relaxed control over uncertainty (mainly, if you do not have significant results for Bonferroni). One example is the Benjamini-Hochberg procedure that controls the expected False Discovery Rate (FDR but not Franklin Delano Roosevelt) i.e. the percentage of false positives in your list of findings. If you control for a FDR of 0.05 and you have 40 significant results, you can expect two of them to be false positives.

All this to say that the answer you get from the data depends on the question you ask. The missing heritability problem is (to some extent) a failure to grasp this simple notion. The GWAS significant SNPs in aggregate explain a small proportion of the heritability of the disease simply because they are a restrictive list that allows for FWER control and not chosen to maximize predictive accuracy. There are many false negatives. When trying to explain heritability or predict a disease, we are no longer in the realm of statistical tests to fall in the joyous land of statistical learning. And therefore, we will use the computationally efficient, theoretically well understood and sparse lasso. The following review is not exhaustive and you are welcome to complete it in the comments section.

Lasso for GWAS: a review Continue reading


Filed under Review

A short (and biased) history of genetics up to GWAS

This post is the first post of a series of introductory posts that I will write. I will get more technical at some point.

The history of genetics begins by this major discovery.

There is a separation in genetics between mendelian traits and complex traits. Mendelian traits only depend on a few genes when complex traits are the results of many genes and environmental factors. For example, mendelian traits include eye color, cystic fibrosis and Tay-Sachs disease. Complex traits include height, skin color and type 1 diabetes. Continue reading


Filed under introductory