Monthly Archives: May 2016

Haplotype based genetic risk estimation

I have recently submitted an article titled as this blog post and it is already accessible as a preprint. A preprint is a scientific paper before it has been peer-reviewed. The idea behind publishing preprints is that research is more quickly available than if you have to wait for the sometimes lengthy peer-review process to take place.

My idea when I decided to start this blog was to be able to give context to my work. A scientific publication is not designed to be understandable to the lay man (and most of the time, it is hidden behind a paywall). A blog is therefore useful as an antechamber to the scientific literature. Also, shameless self promotion.

If you want to try and understand what the preprint is about you should follow the links to blog posts in the following paragraphs.

I have used Genome-Wide Association Studies (GWAS) data in order to try and predict genetic risk of disease using machine learning techniques. In particular, I have combined lasso regression and random forests. Compared to more traditional approaches, I have used biological structure to try and improve predictions, namely chromosomal distance and phase information.

Chromosomal distance is simply the fact that SNPs (single-nucleotide polymorphisms) have a physical location on chromosomes and you can therefore define a distance measured in base pairs between two SNPs that are on the same chromosome. This structure was exploited in T-trees.

The second structure I tried to use is phase information or haplotypes. We have 22 pairs of autosomal (not sexual) chromosomes. Each autosomal SNP is therefore present twice in each individual. Because of the way the technology works, we do not have access to the two sequences of the two chromosomes of the same pair but only to the genotypes. To make this clearer:


Knowing the genotype does not allow us to distinguish between the two scenarios. You should note that the two black lines are the two different chromosomes of the pair, one coming from the mother and one from the father.

Ok, but why is this information important ? Suppose that the two SNPs are located on the same gene in the coding sequence. Further suppose that the SNP1=A and SNP2=C are nonsense mutation that imply a dysfunctional protein. In the scenario on the left, the two mutations are on the same chromosome and therefore the other chromosome will produce a healthy version of the protein. On the other hand, on the right, both copies of the gene are dysfunctional, the healthy protein will not be produced and the individual will be sick. This is called compound heterozigoty.

With all this you are equipped to read my first preprint !


Leave a comment

27 May 2016 · 11 h 18 min

David Ledbetter at Collège de France

Earlier today, I saw a class by David H Ledbetter at Collège de France who was invited by Jean-Louis Mandel who holds the human genetics chair in that glorious institution.

He talked about his experience of building a large cohort with Whole Exome Sequencing(WES) at Geisinger Health System. He used to work in academia but was recruited by Glenn D Steele another former academic to lead a large genomics program at Geisinger.

What is Geisinger ?

Geisinger is a not for profit organisation that offers health insurance and also runs large hospitals. It covers mainly a rural area of Pennsylvania where the biggest city is Scranton, a city were the tv series The Office was located to symbolize small-town America. However, the headquarters of Geisinger are not even in Scranton, but in Danville with a population of around 5000 habitants. So, not the most glamorous place!

Geisinger seems to have many advantages for the success of a large cohort:

Continue reading

Leave a comment

Filed under Non classé