Category Archives: introductory

The different lines of evidence in epidemiology

Epidemiology is the study of diseases in population. For example, a precursor of epidemiology, John Snow, understood that an outbreak of cholera in London was due to infected water :

However, epidemiology is not limited to infectious diseases. For example, the study of type 1 diabetes (T1D) in the human population falls in the field of epidemiology. T1D is an autoimmune disease that affects children and results in the destruction of the insulin-producing beta cells in the pancreas. The treatment for the disease is to inject insulin several times a day for the rest of a patient’s life. The first thing epidemiologists study is the incidence (number of new cases per unit of time) and prevalence (total number of cases in the population) of a disease. For T1D in France, incidence is 13.5 new cases for 100 000 children under 15 per year and prevalence is around 2 out of 1000 people. Incidence after 15 years is not zero but one order of magitude lower.

A second question that epidemiologists are interested in is the causes of the diseases. Genetic causes have been investigated using the genome wide association study design (cf earlier post). Here, I will present the different kinds of study that can be done to try and understand the environmental determinants of a disease. I will start from the study design that provides the weakest evidence and is the less expensive to the study design  that provides the strongest evidence but is the most expensive.

Ecological study Continue reading


Leave a comment

Filed under introductory, Review

Unrealistic standards of beauty for data in statistics class

When you follow a statistics class, data is perfect and you can apply all kind of fancy algorithms and procedures on it to get to the truth. And sometimes you even have theoretical justifications for them. But the first time you encounter real data, you are shocked: there are holes in the data !


This is what actual data looks like. By Dieter Seeger [CC BY-SA 2.0 (, via Wikimedia Commons

You have missing values encoded by NA in all data. And you can’t just take all the observations that have no NAs, you would end up with nothing. A first step is to exclude variables and observations that have too much missing values. This process is called quality control or QC. Once you gave it this name, it seems difficult to defend less quality control. But we could also call it Throwing Expensive Data Away. It is all a matter of perspective.

Even after you throw away the observations and variables with Continue reading

Leave a comment

Filed under introductory

The p-value as a stopping criterion

An interesting conversation is taking place in science about replicability and reproducibility of results and the use and misuse of statistics. A very well written introductory article on the subject and other problems of contemporary science is available at Science isn’t broken.

A recent scientific article tried to replicate the findings of psychological science articles and managed to replicate only 36% of the significant results instead of the 95% that we expect. Jeff Leek had a more positive view and showed that 77% of the replicated effect sizes were in the 95% confidence interval of the original study (EDIT : Actually, the confidence interval for prediction. It takes into account also the uncertainty in the replication sample).

If you want a reminder of what a p-value is you can look at the introduction of my earlier post.

In that Jeff Leek post, I also discovered a very interesting article: The garden of forking paths. The basic idea is that a scientific hypothesis can translate to many different statistical hypothesis. The researcher will perform only one test but his choice of test will depend on the data he collected. He will first look at the data and tune his hypothesis to it, not necessarily in a dishonest way. The problem is that the p-value the test produces will not offer the control over false positive that it should. Had the data been different another test would have obtained a significant result. This is a very valid criticism and reflects well on how the scientific process works. We collect some data with some idea of what we are looking for and then look at the data to try and translate the idea in a statistical framework. What Gelman suggests is that we should do this in a first step and then try and replicate our precise statistical hypothesis in a second round of data collection.

This reflection on the way science is done and statistics are used led me to other thoughts on the subject. Now let us assume that we have a very specific hypothesis but the data collection is very expensive and slow. The scientific team wants to publish their results but would also like to have enough money left to present the results at this conference in a luxurious hotel in Hawai. So they collect Continue reading

Leave a comment

Filed under introductory

Everything is not linear: the example of Random Forest

Linear regression is great. But unfortunately, not everything in nature is linear. If you drink alcohol, you get drunk. If you take your prescribed drugs, you are healthy. But if you do both at the same time, you will not be drunk and healthy, you will probably get very sick. This is an interaction. In general, we talk about interaction when there is a departure from linearity. There are many ways to try and capture interaction using statistical learning but today, I will focus on Random Forest. But before, I explain what a forest is I have to explain what a decision tree is.

“Erik – Prunus sp 02” by Zeynel Cebeci – Own work. Licensed under CC BY-SA 4.0 via Wikimedia Commons –

The good people at did a great job of explaining what a decision tree is in a very visual way. So click here and go look at it. Also, subtle Star Wars reference. Continue reading


Filed under introductory

Linear regression in high dimension, sparsity and convex relaxation

I don’t feel like explaining what linear regression is so I’ll let someone else do it for me (you probably need to know at least some linear algebra to follow the notations):

When I was in high school, in a physics practical we had done some observations on a pendulum or something and we had to graph them. They were almost on a line so I simply joined each point to the next and ended up with a broken line. The teacher, seeing that, told me : ” Where do you think you are? Kindergarten? Draw a line!” Well, look at me now, Ms Mauprivez! Doing a PhD and all!

In physics, for such easy experiments, it is obvious that the relation is linear. It can have almost no noise except for some small measurement error and it reveals a “true” linear relation embodied by the line. In the rest of science, linear regression is not expected to uncover true linear relations. It would be unrealistic to hope to predict precisely the age at which you will have pulmonary cancer by the period of time you were a smoker (and very difficult to draw the line just by looking at the points). It is rather a way to find correlation and a trend between noisy features that have many other determinants: smoking is correlated with cancer. Proving causation is another complicated step.

But linear regression breaks down if you try to apply it with many explaining features like in GWAS. The error (mean squared error) will decrease as you add more and more features but if you use the model to predict on new data, you will be completely off target. This problem is called overfitting. If you allow the model to be very complicated, it can fit perfectly to the training data but will be useless in prediction (just like the broken line). Continue reading


Filed under introductory

A short (and biased) history of genetics up to GWAS

This post is the first post of a series of introductory posts that I will write. I will get more technical at some point.

The history of genetics begins by this major discovery.

There is a separation in genetics between mendelian traits and complex traits. Mendelian traits only depend on a few genes when complex traits are the results of many genes and environmental factors. For example, mendelian traits include eye color, cystic fibrosis and Tay-Sachs disease. Complex traits include height, skin color and type 1 diabetes. Continue reading


Filed under introductory

I just started a blog !

This will be a scientific blog. I will use this blog to comment on others’ work and to try and make some methodological points. I will also use it as an informal space to present my work (a process known as shameless self-promotion).

I was inspired to do it by Lior Pachter who does a great job of always introducing a technical issue by a seemingly unrelated anecdote. He also happens to give a lot of emphasis on method and mathematical rigor like I will try to do. However, I will not be as aggressive with fellow researchers as I do not have tenure.

The area of science this blog will focus on is statistics, machine learning, genomics of complex diseases and epidemiology. I have a background in math which is why I will focus a lot on methodology.

Leave a comment

Filed under introductory