Tag Archives: QC

Unrealistic standards of beauty for data in statistics class

When you follow a statistics class, data is perfect and you can apply all kind of fancy algorithms and procedures on it to get to the truth. And sometimes you even have theoretical justifications for them. But the first time you encounter real data, you are shocked: there are holes in the data !

emmentaler_aoc_block

This is what actual data looks like. By Dieter Seeger [CC BY-SA 2.0 (http://creativecommons.org/licenses/by-sa/2.0)%5D, via Wikimedia Commons

You have missing values encoded by NA in all data. And you can’t just take all the observations that have no NAs, you would end up with nothing. A first step is to exclude variables and observations that have too much missing values. This process is called quality control or QC. Once you gave it this name, it seems difficult to defend less quality control. But we could also call it Throwing Expensive Data Away. It is all a matter of perspective.

Even after you throw away the observations and variables with Continue reading

Advertisements

Leave a comment

Filed under introductory

The sleeping beauty in the random forest: T-Trees

A sleeping beauty refers to an article that is undervalued and is not cited very often and then awakens later and is recognised as important. The most famous one is the work of Mendel that were published in 1865 and rediscovered 34 years later. To learn more about this concept, you can read this or that.

Today, I want to talk about a paper that is a bit too young to be a sleeping beauty but that seems undervalued :

Exploiting SNP Correlations within Random Forest for Genome-Wide Association Studies (2014) by Vincent Botta, Gilles Louppe, Pierre Geurts, Louis Wehenkel

The four authors are from Liège in Belgium. Their team is known to work on Random forest (if you do not know what that is, you can read my earlier blogpost on the subject). Geurts and Wehenkel have proposed a variant of random forest called extremely randomized trees. Gilles Louppe is the one who implemented random forest for the scikit-learn package for python (which I use. Thanks!). The first author Vincent Botta left academia after his PhD and went to work for a company (a start-up in newspeak).

Motivation

The idea of the paper is to use biological structure in order to increase prediction accuracy. The additional structure used here is chromosomal distance. A SNP is located on a chromosome and it has neighbours. This information can be useful in several ways: Continue reading

3 Comments

Filed under Review