The different lines of evidence in epidemiology

Epidemiology is the study of diseases in population. For example, a precursor of epidemiology, John Snow, understood that an outbreak of cholera in London was due to infected water :

However, epidemiology is not limited to infectious diseases. For example, the study of type 1 diabetes (T1D) in the human population falls in the field of epidemiology. T1D is an autoimmune disease that affects children and results in the destruction of the insulin-producing beta cells in the pancreas. The treatment for the disease is to inject insulin several times a day for the rest of a patient’s life. The first thing epidemiologists study is the incidence (number of new cases per unit of time) and prevalence (total number of cases in the population) of a disease. For T1D in France, incidence is 13.5 new cases for 100 000 children under 15 per year and prevalence is around 2 out of 1000 people. Incidence after 15 years is not zero but one order of magitude lower.

A second question that epidemiologists are interested in is the causes of the diseases. Genetic causes have been investigated using the genome wide association study design (cf earlier post). Here, I will present the different kinds of study that can be done to try and understand the environmental determinants of a disease. I will start from the study design that provides the weakest evidence and is the less expensive to the study design  that provides the strongest evidence but is the most expensive.

Ecological study Continue reading

Leave a comment

Filed under introductory, Review

Haplotype based genetic risk estimation

I have recently submitted an article titled as this blog post and it is already accessible as a preprint. A preprint is a scientific paper before it has been peer-reviewed. The idea behind publishing preprints is that research is more quickly available than if you have to wait for the sometimes lengthy peer-review process to take place.

My idea when I decided to start this blog was to be able to give context to my work. A scientific publication is not designed to be understandable to the lay man (and most of the time, it is hidden behind a paywall). A blog is therefore useful as an antechamber to the scientific literature. Also, shameless self promotion.

If you want to try and understand what the preprint is about you should follow the links to blog posts in the following paragraphs.

I have used Genome-Wide Association Studies (GWAS) data in order to try and predict genetic risk of disease using machine learning techniques. In particular, I have combined lasso regression and random forests. Compared to more traditional approaches, I have used biological structure to try and improve predictions, namely chromosomal distance and phase information.

Chromosomal distance is simply the fact that SNPs (single-nucleotide polymorphisms) have a physical location on chromosomes and you can therefore define a distance measured in base pairs between two SNPs that are on the same chromosome. This structure was exploited in T-trees.

The second structure I tried to use is phase information or haplotypes. We have 22 pairs of autosomal (not sexual) chromosomes. Each autosomal SNP is therefore present twice in each individual. Because of the way the technology works, we do not have access to the two sequences of the two chromosomes of the same pair but only to the genotypes. To make this clearer:


Knowing the genotype does not allow us to distinguish between the two scenarios. You should note that the two black lines are the two different chromosomes of the pair, one coming from the mother and one from the father.

Ok, but why is this information important ? Suppose that the two SNPs are located on the same gene in the coding sequence. Further suppose that the SNP1=A and SNP2=C are nonsense mutation that imply a dysfunctional protein. In the scenario on the left, the two mutations are on the same chromosome and therefore the other chromosome will produce a healthy version of the protein. On the other hand, on the right, both copies of the gene are dysfunctional, the healthy protein will not be produced and the individual will be sick. This is called compound heterozigoty.

With all this you are equipped to read my first preprint !

Leave a comment

27 May 2016 · 11 h 18 min

David Ledbetter at Collège de France

Earlier today, I saw a class by David H Ledbetter at Collège de France who was invited by Jean-Louis Mandel who holds the human genetics chair in that glorious institution.

He talked about his experience of building a large cohort with Whole Exome Sequencing(WES) at Geisinger Health System. He used to work in academia but was recruited by Glenn D Steele another former academic to lead a large genomics program at Geisinger.

What is Geisinger ?

Geisinger is a not for profit organisation that offers health insurance and also runs large hospitals. It covers mainly a rural area of Pennsylvania where the biggest city is Scranton, a city were the tv series The Office was located to symbolize small-town America. However, the headquarters of Geisinger are not even in Scranton, but in Danville with a population of around 5000 habitants. So, not the most glamorous place!

Geisinger seems to have many advantages for the success of a large cohort:

Continue reading

Leave a comment

Filed under Non classé

SMPGD 2016: Microbial genetics and miRNA

I was in Lille on thursday and friday for an intense conference on Statistical Models for Post-Genomic Data. There were two main themes that emerged: genetics of bacteria and viruses and change point detection. I’ll just talk about the first one and an unrelated talk on miRNA.

Viral evolutionary inference

Phillipe Lemey showed us how sequencing of virus genome could be used to retrace the spatio-temporal evolution of diseases. By sequencing viruses, you can reconstruct the phylogeny of viruses and therefore you can find where the virus came from. This allows to understand the dynamic of the epidemy in a much more precise way. See for example the spread of H1N1. He also showed us his results on ebola which is the first epidemic to be sequenced as it unfolds. This showed how the disease went from district to district. His work was retrospective as he pooled the data of different teams. He stressed the importance of efficient data sharing. His work allows to see how the epidemic is propagated and therefore allows to understand what public health measures are efficient.

GWAS for bacteria

Genome wide-association studies can help discover the genetic determinant of traits. But this idea is not limited to humans. One of the main trait of interest of bacteria is resistance to antibiotics. However, bacterial genome are very challenging in several ways :

Continue reading

Leave a comment

Filed under Review

Unrealistic standards of beauty for data in statistics class

When you follow a statistics class, data is perfect and you can apply all kind of fancy algorithms and procedures on it to get to the truth. And sometimes you even have theoretical justifications for them. But the first time you encounter real data, you are shocked: there are holes in the data !


This is what actual data looks like. By Dieter Seeger [CC BY-SA 2.0 (, via Wikimedia Commons

You have missing values encoded by NA in all data. And you can’t just take all the observations that have no NAs, you would end up with nothing. A first step is to exclude variables and observations that have too much missing values. This process is called quality control or QC. Once you gave it this name, it seems difficult to defend less quality control. But we could also call it Throwing Expensive Data Away. It is all a matter of perspective.

Even after you throw away the observations and variables with Continue reading

Leave a comment

Filed under introductory

The sleeping beauty in the random forest: T-Trees

A sleeping beauty refers to an article that is undervalued and is not cited very often and then awakens later and is recognised as important. The most famous one is the work of Mendel that were published in 1865 and rediscovered 34 years later. To learn more about this concept, you can read this or that.

Today, I want to talk about a paper that is a bit too young to be a sleeping beauty but that seems undervalued :

Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies (2014) by Vincent Botta, Gilles Louppe, Pierre Geurts, Louis Wehenkel

The four authors are from Liège in Belgium. Their team is known to work on Random forest (if you do not know what that is, you can read my earlier blogpost on the subject). Geurts and Wehenkel have proposed a variant of random forest called extremely randomized trees. Gilles Louppe is the one who implemented random forest for the scikit-learn package for python (which I use. Thanks!). The first author Vincent Botta left academia after his PhD and went to work for a company (a start-up in newspeak).


The idea of the paper is to use biological structure in order to increase prediction accuracy. The additional structure used here is chromosomal distance. A SNP is located on a chromosome and it has neighbours. This information can be useful in several ways: Continue reading


Filed under Review

Transcriptomics and the stochasticity of biological cells

I just came home from a two-day conference in Evry 30 km south of Paris. Evry hosts a biocluster centred on genetics. It hosted the first genetic map in the 90s that inspired the human genome project and it is also the location where the French contribution to the human genome project -the sequencing of chromosome 14- took place.

I won’t be as rigorous as I usually aim to be, I will just try to give you a flavour of some of the talks.

Transcriptome from micro-arrays to RNA-seq by François Cambien

The transcriptome is the study of the expression of genes. According to the central dogma of molecular biology, DNA is is transcribed as an RNA in the nucleus. The RNA then goes in the cytoplasm and there it is translated into proteins. Transcriptomics is therefore the study of RNA abundance.

François Cambien works on heart diseases Continue reading

1 Comment

Filed under Non classé

The p-value as a stopping criterion

An interesting conversation is taking place in science about replicability and reproducibility of results and the use and misuse of statistics. A very well written introductory article on the subject and other problems of contemporary science is available at Science isn’t broken.

A recent scientific article tried to replicate the findings of psychological science articles and managed to replicate only 36% of the significant results instead of the 95% that we expect. Jeff Leek had a more positive view and showed that 77% of the replicated effect sizes were in the 95% confidence interval of the original study (EDIT : Actually, the confidence interval for prediction. It takes into account also the uncertainty in the replication sample).

If you want a reminder of what a p-value is you can look at the introduction of my earlier post.

In that Jeff Leek post, I also discovered a very interesting article: The garden of forking paths. The basic idea is that a scientific hypothesis can translate to many different statistical hypothesis. The researcher will perform only one test but his choice of test will depend on the data he collected. He will first look at the data and tune his hypothesis to it, not necessarily in a dishonest way. The problem is that the p-value the test produces will not offer the control over false positive that it should. Had the data been different another test would have obtained a significant result. This is a very valid criticism and reflects well on how the scientific process works. We collect some data with some idea of what we are looking for and then look at the data to try and translate the idea in a statistical framework. What Gelman suggests is that we should do this in a first step and then try and replicate our precise statistical hypothesis in a second round of data collection.

This reflection on the way science is done and statistics are used led me to other thoughts on the subject. Now let us assume that we have a very specific hypothesis but the data collection is very expensive and slow. The scientific team wants to publish their results but would also like to have enough money left to present the results at this conference in a luxurious hotel in Hawai. So they collect Continue reading

Leave a comment

Filed under introductory

Everything is not linear: the example of Random Forest

Linear regression is great. But unfortunately, not everything in nature is linear. If you drink alcohol, you get drunk. If you take your prescribed drugs, you are healthy. But if you do both at the same time, you will not be drunk and healthy, you will probably get very sick. This is an interaction. In general, we talk about interaction when there is a departure from linearity. There are many ways to try and capture interaction using statistical learning but today, I will focus on Random Forest. But before, I explain what a forest is I have to explain what a decision tree is.

“Erik – Prunus sp 02” by Zeynel Cebeci – Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons –

The good people at did a great job of explaining what a decision tree is in a very visual way. So click here and go look at it. Also, subtle Star Wars reference. Continue reading


Filed under introductory

How good are we at genetic risk prediction of complex diseases ?

Using a hammer to wash the dishes

Statistical procedures offer control over uncertainty. For example, the Bonferroni correction or other correction for multiple testing allow to control the Family-Wise Error Rate (FWER). The Family-Wise Error Rate is the probability of reporting at least one false positive when performing many tests. A false positive is a variable following the null hypothesis (SNP not associated with disease) but being reported significant.

A statistical test is a way to say if a null hypothesis should be rejected or not, e.g. H_0 : this SNP is independent of the disease. The p-value is the probability of observing the data or something more unlikely under the null hypothesis. The more tests you perform, the more likely you are to obtain by chance a p-value smaller than the dreaded 0.05. In fact, if all variables you test follow the null hypothesis, 1 in 20 will have a p-value smaller than 0.05. The Bonferroni correction simply divides the cut-off for the p-values by the number of tests being performed. This way, the probability of having at least one false positive in the list of all the significant variables is smaller than 0.05 (the chosen cut-off). This does not mean that the rest is not associated of course. This is very conservative and sometimes you can be looking for a more relaxed control over uncertainty (mainly, if you do not have significant results for Bonferroni). One example is the Benjamini-Hochberg procedure that controls the expected False Discovery Rate (FDR but not Franklin Delano Roosevelt) i.e. the percentage of false positives in your list of findings. If you control for a FDR of 0.05 and you have 40 significant results, you can expect two of them to be false positives.

All this to say that the answer you get from the data depends on the question you ask. The missing heritability problem is (to some extent) a failure to grasp this simple notion. The GWAS significant SNPs in aggregate explain a small proportion of the heritability of the disease simply because they are a restrictive list that allows for FWER control and not chosen to maximize predictive accuracy. There are many false negatives. When trying to explain heritability or predict a disease, we are no longer in the realm of statistical tests to fall in the joyous land of statistical learning. And therefore, we will use the computationally efficient, theoretically well understood and sparse lasso. The following review is not exhaustive and you are welcome to complete it in the comments section.

Lasso for GWAS: a review Continue reading


Filed under Review