I have recently submitted an article titled as this blog post and it is already accessible as a preprint. A preprint is a scientific paper before it has been peer-reviewed. The idea behind publishing preprints is that research is more quickly available than if you have to wait for the sometimes lengthy peer-review process to take place.
My idea when I decided to start this blog was to be able to give context to my work. A scientific publication is not designed to be understandable to the lay man (and most of the time, it is hidden behind a paywall). A blog is therefore useful as an antechamber to the scientific literature. Also, shameless self promotion.
If you want to try and understand what the preprint is about you should follow the links to blog posts in the following paragraphs.
I have used Genome-Wide Association Studies (GWAS) data in order to try and predict genetic risk of disease using machine learning techniques. In particular, I have combined lasso regression and random forests. Compared to more traditional approaches, I have used biological structure to try and improve predictions, namely chromosomal distance and phase information.
Chromosomal distance is simply the fact that SNPs (single-nucleotide polymorphisms) have a physical location on chromosomes and you can therefore define a distance measured in base pairs between two SNPs that are on the same chromosome. This structure was exploited in T-trees.
The second structure I tried to use is phase information or haplotypes. We have 22 pairs of autosomal (not sexual) chromosomes. Each autosomal SNP is therefore present twice in each individual. Because of the way the technology works, we do not have access to the two sequences of the two chromosomes of the same pair but only to the genotypes. To make this clearer:
Knowing the genotype does not allow us to distinguish between the two scenarios. You should note that the two black lines are the two different chromosomes of the pair, one coming from the mother and one from the father.
Ok, but why is this information important ? Suppose that the two SNPs are located on the same gene in the coding sequence. Further suppose that the SNP1=A and SNP2=C are nonsense mutation that imply a dysfunctional protein. In the scenario on the left, the two mutations are on the same chromosome and therefore the other chromosome will produce a healthy version of the protein. On the other hand, on the right, both copies of the gene are dysfunctional, the healthy protein will not be produced and the individual will be sick. This is called compound heterozigoty.
With all this you are equipped to read my first preprint !