I don’t feel like explaining what linear regression is so I’ll let someone else do it for me (you probably need to know at least some linear algebra to follow the notations):
When I was in high school, in a physics practical we had done some observations on a pendulum or something and we had to graph them. They were almost on a line so I simply joined each point to the next and ended up with a broken line. The teacher, seeing that, told me : ” Where do you think you are? Kindergarten? Draw a line!” Well, look at me now, Ms Mauprivez! Doing a PhD and all!
In physics, for such easy experiments, it is obvious that the relation is linear. It can have almost no noise except for some small measurement error and it reveals a “true” linear relation embodied by the line. In the rest of science, linear regression is not expected to uncover true linear relations. It would be unrealistic to hope to predict precisely the age at which you will have pulmonary cancer by the period of time you were a smoker (and very difficult to draw the line just by looking at the points). It is rather a way to find correlation and a trend between noisy features that have many other determinants: smoking is correlated with cancer. Proving causation is another complicated step.
But linear regression breaks down if you try to apply it with many explaining features like in GWAS. The error (mean squared error) will decrease as you add more and more features but if you use the model to predict on new data, you will be completely off target. This problem is called overfitting. If you allow the model to be very complicated, it can fit perfectly to the training data but will be useless in prediction (just like the broken line). I’ll let Alexander Ihler tell you more about it :
After this, it is clear that the performance of a trained estimator must be assessed using a test set different from the training set.
To deal with overfitting, we penalize the complexity of the model. Instead of asking for minimization of the mean squared error, we minimize the sum of the mean squared error and a penalization term. In the next video, considering higher order polynomial means adding features (the successive powers of the feature).
We have introduced a new parameter that controls how much complexity is allowed. But now we have to find a way to estimate it ! The way we do this is cross-validation, we split the training data in k folds and we use the first k-1 folds to train our estimator for different values of and we evaluate the performance on the last remaining fold. We then do this for each fold and we select the value for that obtained the best performance on average.
By Joan.domenech91 (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)%5D, via Wikimedia Commons
At first, it is not intuitively clear why the norm of the parameter vector is a measure of the complexity. But when you think about it, it makes sense. If our predictions depend a lot on many features, any small change in those features will change the prediction. It is even clearer if you consider the 0-norm which is the number of non-zero coefficients of a vector. The less features you depend on, the less complex your model is. We call a solution that has many coefficients set to 0 a sparse solution. It is often desirable to have a sparse solution because it allows for more interpretability. In many cases, we are not so much interested by predictions but by which features are important.
Unfortunately, it is not computationally feasible to solve the minimization with the 0-norm as it is not convex. When we are not capable of having a closed form solution (a formula for the solution) and we want to find an approximate minimum, we have fast algorithms only if the function we want to minimize is convex. For the 0-norm problem, we would have to consider the (where is the number of features) possible subsets and compute for each the linear regression in order to solve the minimization problem.
We want a convex problem so we make it convex by replacing the 0-norm by the 1-norm, the sum of the absolute values of the coefficients. The 1-norm is the convex relaxation of the 0-norm. And we hope that the desired properties are conserved. The short version is : it works. The 1-norm penalized linear regression is called lasso because all statisticians wish they were cowboys like Tibshirani.
The long version is that a lot of things are known theoretically on lasso. The book Statistics for High Dimensional Data exposes many results for the lasso. Theory is important for machine learning because it allows to understand better how things work and under what conditions. Of course, you need to be into it to appreciate it.
After this funny interlude, you are ready for some math. Usually in statistics, when we talk about asymptotic results, we mean the limit when the number of observations goes to infinity while the number of features stays fixed. The interest of Lasso is its use in a high-dimensional setting with so this would be unsatisfactory. To capture high dimensional scenarios, the results are with respect to a triangular array of observations :
where are independent identically distributed “reasonable” errors with mean 0.
If we assume that the true parameter vector is sparse i.e. that we have then we have consistency when meaning that lasso is successful in estimating the true regression function.
The fact that the dependence in is in log in the sparsity assumption means that lasso can deal satisfactorily with many more features than observations.
Even if we do not assume that there is a linear relation between and , it can be shown that the lasso will come close to the best sparse linear prediction possible. This kind of result is called an oracle inequality.
Convex relaxation is not always that successful. I followed a class this june at ENSAE by Philippe Rigollet who works on sparse PCA. You can make a sparsity assumption on the first principal component () in order to detect it more easily and statistically it works. But by adding a constraint you have made the problem computationally hard. If you try to take a convex relaxation of the problem, you lose in power. What he shows is that any computationally efficient method will pay a statistical price. It was very interesting to see connections between computational complexity and statistical performance.
The next post will be a review of linear regression for GWAS.
P.S. (21/04/2016) : I am in New York for a few days on vacation and I went and saw a seminar at Courant institute by Quentin Berthet called trade-offs in statistical learning. I was surprised to find it was on the same subject than the class by Philippe Rigollet. It all makes sense since Quentin was is PhD student. It was nice to have a half hour summary of what I had spent 8 hour studying. He then went on to show what he had been up to since that time. Basically, he (and other people) added a constraint for example having data distributed between different data centres. Think a grid of hospital, each having their data centre with some patients and that do not want (or have the right to) share the data. He exhibited a statistical problem where an optimal statistical solution was available with reasonable computational complexity. An optimal statistical solution was available with the distributed data. But you could not have an optimal statistical solution that was both computationally efficient and distributed.