## Linear regression in high dimension, sparsity and convex relaxation

I don’t feel like explaining what linear regression is so I’ll let someone else do it for me (you probably need to know at least some linear algebra to follow the notations):

When I was in high school, in a physics practical we had done some observations on a pendulum or something and we had to graph them. They were almost on a line so I simply joined each point to the next and ended up with a broken line. The teacher, seeing that, told me : ” Where do you think you are? Kindergarten? Draw a line!” Well, look at me now, Ms Mauprivez! Doing a PhD and all!

In physics, for such easy experiments, it is obvious that the relation is linear. It can have almost no noise except for some small measurement error and it reveals a “true” linear relation embodied by the line. In the rest of science, linear regression is not expected to uncover true linear relations. It would be unrealistic to hope to predict precisely the age at which you will have pulmonary cancer by the period of time you were a smoker (and very difficult to draw the line just by looking at the points). It is rather a way to find correlation and a trend between noisy features that have many other determinants: smoking is correlated with cancer. Proving causation is another complicated step.

But linear regression breaks down if you try to apply it with many explaining features like in GWAS. The error (mean squared error) will decrease as you add more and more features but if you use the model to predict on new data, you will be completely off target. This problem is called overfitting. If you allow the model to be very complicated, it can fit perfectly to the training data but will be useless in prediction (just like the broken line). Continue reading

Filed under introductory

## A short (and biased) history of genetics up to GWAS

This post is the first post of a series of introductory posts that I will write. I will get more technical at some point.

The history of genetics begins by this major discovery.

There is a separation in genetics between mendelian traits and complex traits. Mendelian traits only depend on a few genes when complex traits are the results of many genes and environmental factors. For example, mendelian traits include eye color, cystic fibrosis and Tay-Sachs disease. Complex traits include height, skin color and type 1 diabetes. Continue reading

Filed under introductory

## I just started a blog !

This will be a scientific blog. I will use this blog to comment on others’ work and to try and make some methodological points. I will also use it as an informal space to present my work (a process known as shameless self-promotion).

I was inspired to do it by Lior Pachter who does a great job of always introducing a technical issue by a seemingly unrelated anecdote. He also happens to give a lot of emphasis on method and mathematical rigor like I will try to do. However, I will not be as aggressive with fellow researchers as I do not have tenure.

The area of science this blog will focus on is statistics, machine learning, genomics of complex diseases and epidemiology. I have a background in math which is why I will focus a lot on methodology.