I was in Lille on thursday and friday for an intense conference on Statistical Models for Post-Genomic Data. There were two main themes that emerged: genetics of bacteria and viruses and change point detection. I’ll just talk about the first one and an unrelated talk on miRNA.
Viral evolutionary inference
Phillipe Lemey showed us how sequencing of virus genome could be used to retrace the spatio-temporal evolution of diseases. By sequencing viruses, you can reconstruct the phylogeny of viruses and therefore you can find where the virus came from. This allows to understand the dynamic of the epidemy in a much more precise way. See for example the spread of H1N1. He also showed us his results on ebola which is the first epidemic to be sequenced as it unfolds. This showed how the disease went from district to district. His work was retrospective as he pooled the data of different teams. He stressed the importance of efficient data sharing. His work allows to see how the epidemic is propagated and therefore allows to understand what public health measures are efficient.
GWAS for bacteria
Genome wide-association studies can help discover the genetic determinant of traits. But this idea is not limited to humans. One of the main trait of interest of bacteria is resistance to antibiotics. However, bacterial genome are very challenging in several ways :
-They have very high linkage desequilibrium as it is not lost by recombination. This means that there is very strong population structure. Danny Wilson proposed a method to deal with that: Bugwas. Instead of simply correcting for population structure by principal component analysis, he proposed to also look at lineage effect by looking at whether the principal components were associated with antibiotic resistance.
-They have highly variable genome size. A bacterial reference genome cannot really be used for some bacteria because of this. Therefore a different approach has been to represent a pool of genomes by a De Bruijn graph. Rayan Chikhi gave an detailed and interesting tutorial on this. Here is another if you are interested. This technique has first been used for de novo assembly but has found new applications notably in RNA-seq quantification with kallisto a method as accurate as others but orders of magnitude faster. For bacterial genome, this allows to replace SNPs by nodes in the de Bruijn graph as shown by Magali Jaillard. Alexandre Drouin showed very nice reults on that kind of data with a Set Covering machine a machine learning technique I had never heard of. He had theoretical results on his method and showed good results but most importantly the result is very sparse i.e. it depends on very few variables making it more interpretable.
All this was made more concrete by the talk of Zamin Iqbal who explained the program he will be implementing with a hospital on antibiotic resistant tuberculosis. Nowadays, the standard procedure is to get some bacteria from a patient and then culture it for two weeks. At this point, the first line drugs are tested on the strain to evaluate its sensitivity to it. This takes several weeks. If the strain is sensitive, the patient gets the first line drugs. If not, the strain is then tested for resistance to second line drugs which are much more toxic. This also takes several weeks. In comparison, what Iqbal proposes is to culture the bacteria for two weeks (otherwise there is not enough material to sequence) and then sequence it and determine to which drugs the bacteria is resistant to using results from GWAS on resistance. This saves a lot of time to patients before they receive their treatment.
I’ll finish this post by mentioning an unrelated talk that I really liked.
False positive in micro RNA target prediction by Hervé Seitz
Micro RNA are very small RNAs that can inhibit a gene expression by binding to a messenger RNA. Their base numbered 2 through 7 are called the seed. The seed attaches to the complementary target on the mRNA. The seed is a very short sequence and therefore the complementary sequence i.e. the binding site will appear in many places in the genome. To try and predict which mRNA are repressed by a miRNA, a comparative genomic approach was taken. If the binding site is conserved in many species in a precise region of a gene it means that it is functional and therefore it seems likely that it is a target for the miRNA. By doing that we find that around 60% of human coding genes are targeted by miRNA.
But Seitz does not think that those sequences are conserved because of miRNA. He argues that the effects of miRNA on gene expression are too weak to change gene expression enough to affect phenotypes. The inter-individual variability of gene expression between very similar mice is much larger than the effect size of miRNA. This is because biological systems are very redundant and have many mitigating mechanisms so a difference in gene expression level will not necessarily have a phenotypic impact.
Another of his argument was that some of the conserved binding sites are conserved even in species without the miRNA.
To understand why those binding sites are conserved despite the unlikeliness of miRNA being the cause, his team plans to modify the binding sites with CRISPR/CAS9.