# Category Archives: Review

## The different lines of evidence in epidemiology

Epidemiology is the study of diseases in population. For example, a precursor of epidemiology, John Snow, understood that an outbreak of cholera in London was due to infected water :

However, epidemiology is not limited to infectious diseases. For example, the study of type 1 diabetes (T1D) in the human population falls in the field of epidemiology. T1D is an autoimmune disease that affects children and results in the destruction of the insulin-producing beta cells in the pancreas. The treatment for the disease is to inject insulin several times a day for the rest of a patient’s life. The first thing epidemiologists study is the incidence (number of new cases per unit of time) and prevalence (total number of cases in the population) of a disease. For T1D in France, incidence is 13.5 new cases for 100 000 children under 15 per year and prevalence is around 2 out of 1000 people. Incidence after 15 years is not zero but one order of magitude lower.

A second question that epidemiologists are interested in is the causes of the diseases. Genetic causes have been investigated using the genome wide association study design (cf earlier post). Here, I will present the different kinds of study that can be done to try and understand the environmental determinants of a disease. I will start from the study design that provides the weakest evidence and is the less expensive to the study design  that provides the strongest evidence but is the most expensive.

Filed under introductory, Review

## SMPGD 2016: Microbial genetics and miRNA

I was in Lille on thursday and friday for an intense conference on Statistical Models for Post-Genomic Data. There were two main themes that emerged: genetics of bacteria and viruses and change point detection. I’ll just talk about the first one and an unrelated talk on miRNA.

Viral evolutionary inference

Phillipe Lemey showed us how sequencing of virus genome could be used to retrace the spatio-temporal evolution of diseases. By sequencing viruses, you can reconstruct the phylogeny of viruses and therefore you can find where the virus came from. This allows to understand the dynamic of the epidemy in a much more precise way. See for example the spread of H1N1. He also showed us his results on ebola which is the first epidemic to be sequenced as it unfolds. This showed how the disease went from district to district. His work was retrospective as he pooled the data of different teams. He stressed the importance of efficient data sharing. His work allows to see how the epidemic is propagated and therefore allows to understand what public health measures are efficient.

GWAS for bacteria

Genome wide-association studies can help discover the genetic determinant of traits. But this idea is not limited to humans. One of the main trait of interest of bacteria is resistance to antibiotics. However, bacterial genome are very challenging in several ways :

Filed under Review

## The sleeping beauty in the random forest: T-Trees

A sleeping beauty refers to an article that is undervalued and is not cited very often and then awakens later and is recognised as important. The most famous one is the work of Mendel that were published in 1865 and rediscovered 34 years later. To learn more about this concept, you can read this or that.

Today, I want to talk about a paper that is a bit too young to be a sleeping beauty but that seems undervalued :

Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies (2014) by Vincent Botta, Gilles Louppe, Pierre Geurts, Louis Wehenkel

The four authors are from Liège in Belgium. Their team is known to work on Random forest (if you do not know what that is, you can read my earlier blogpost on the subject). Geurts and Wehenkel have proposed a variant of random forest called extremely randomized trees. Gilles Louppe is the one who implemented random forest for the scikit-learn package for python (which I use. Thanks!). The first author Vincent Botta left academia after his PhD and went to work for a company (a start-up in newspeak).

Motivation

The idea of the paper is to use biological structure in order to increase prediction accuracy. The additional structure used here is chromosomal distance. A SNP is located on a chromosome and it has neighbours. This information can be useful in several ways: Continue reading

Filed under Review

## How good are we at genetic risk prediction of complex diseases ?

Using a hammer to wash the dishes

Statistical procedures offer control over uncertainty. For example, the Bonferroni correction or other correction for multiple testing allow to control the Family-Wise Error Rate (FWER). The Family-Wise Error Rate is the probability of reporting at least one false positive when performing many tests. A false positive is a variable following the null hypothesis (SNP not associated with disease) but being reported significant.

A statistical test is a way to say if a null hypothesis should be rejected or not, e.g. $H_0 :$ this SNP is independent of the disease. The p-value is the probability of observing the data or something more unlikely under the null hypothesis. The more tests you perform, the more likely you are to obtain by chance a p-value smaller than the dreaded 0.05. In fact, if all variables you test follow the null hypothesis, 1 in 20 will have a p-value smaller than 0.05. The Bonferroni correction simply divides the cut-off for the p-values by the number of tests being performed. This way, the probability of having at least one false positive in the list of all the significant variables is smaller than 0.05 (the chosen cut-off). This does not mean that the rest is not associated of course. This is very conservative and sometimes you can be looking for a more relaxed control over uncertainty (mainly, if you do not have significant results for Bonferroni). One example is the Benjamini-Hochberg procedure that controls the expected False Discovery Rate (FDR but not Franklin Delano Roosevelt) i.e. the percentage of false positives in your list of findings. If you control for a FDR of 0.05 and you have 40 significant results, you can expect two of them to be false positives.

All this to say that the answer you get from the data depends on the question you ask. The missing heritability problem is (to some extent) a failure to grasp this simple notion. The GWAS significant SNPs in aggregate explain a small proportion of the heritability of the disease simply because they are a restrictive list that allows for FWER control and not chosen to maximize predictive accuracy. There are many false negatives. When trying to explain heritability or predict a disease, we are no longer in the realm of statistical tests to fall in the joyous land of statistical learning. And therefore, we will use the computationally efficient, theoretically well understood and sparse lasso. The following review is not exhaustive and you are welcome to complete it in the comments section.

Lasso for GWAS: a review Continue reading