I just came home from a two-day conference in Evry 30 km south of Paris. Evry hosts a biocluster centred on genetics. It hosted the first genetic map in the 90s that inspired the human genome project and it is also the location where the French contribution to the human genome project -the sequencing of chromosome 14- took place.
I won’t be as rigorous as I usually aim to be, I will just try to give you a flavour of some of the talks.
Transcriptome from micro-arrays to RNA-seq by François Cambien
The transcriptome is the study of the expression of genes. According to the central dogma of molecular biology, DNA is is transcribed as an RNA in the nucleus. The RNA then goes in the cytoplasm and there it is translated into proteins. Transcriptomics is therefore the study of RNA abundance.
François Cambien works on heart diseases and to understand better those diseases, he tries to see if there are genes that are differentially expressed between people with a heart disease and controls. He first described the change in technology that happened over the last few years as we went from micro-arrays to RNA-seq. Micro-arrays are chips on which a large number of probes are disposed, then we put the sample of RNA we are interested in and the complementary strands attach themselves to the probes as shown in the next figure.The abundance of the transcript is determined by the luminous intensity of the probes they are supposed to attach to. The same kind of technology is behind GWAS.
The reference project for transcriptomics using micro-arrays is ENCODE for ENCyclopedia of CODing Elements. While being a useful resource, this project has had to face some controversy because the project team claimed that 80% of the human genome was functional. This controversy makes for some very entertaining reading. Just to quote a bit :
A recent slew of ENCyclopedia Of DNA Elements (ENCODE) Consortium publications, specifically the article signed by all Consortium members, put forward the idea that more than 80% of the human genome is functional. This claim flies in the face of current estimates according to which the fraction of the genome that is evolutionarily conserved through purifying selection is less than 10%.[…] The ENCODE results were predicted by one of its authors to necessitate the rewriting of textbooks. We agree, many textbooks dealing with marketing, mass-media hype, and public relations may well have to be rewritten.
The new technology that is now being used is RNA-seq. It consists in sequencing the RNA directly i.e. getting the sequence of the RNA. This means that you can get all transcripts and not only those tagged by an array. Furthermore, this allows to investigate alternative splicing and allelle-specific expression. On the other hand, it is more sensitive to RNA quality.
This brings us to one of Cambien’s main point: RNA is much more noisy than DNA. It is less stable, it degrades faster. Expression will also depend on the different cell-types: hence the importance of cell-type separation. He stressed therefore the importance of having precise protocols for sample collection, storage and sequencing. Even after you have the data, the analysis is not without challenges. You have to try and recover the signal hidden behind a lot of noise.
The big project using RNA-seq is GTEx (Genotype Tissue Expression) that aims to characterize different cell-type expression. This is important to understand when and where a gene will be expressed and therefore what impact a mutation on it will have. Of course, in humans you cannot just take a sample of any kind of cell-type on a living subject. Typically, in type 1 diabetes we would be interested in the expression of the Langheran’s islets cells that produce insulin. But to have access to that we would need to perform heavy surgery in small children. GTEx is therefore a post-mortem analysis. Cambien stressed the importance of death-to-analysis time for the quality of this data. In his own work, Cambien mostly used blood samples. This is of course a limitation but it is also an opportunity to study live subjects: “the blood window”. And in the blood, there are important things besides erythrocytes (red blood cell) especially if you are interested in the immune system.
Single-cell transcriptomics by Sandrine Dudoit
We had the chance to have a speaker who is working on the edge of technological development (aka in the USA). Sandrine Dudoit is working on single-cell transcriptomics as part of the Brain Research through Advancing Innovative Neurotechnologies (BRAIN !) initiative. In her project, they took cells from two neighboring layers of a mouse cortex and then they sequenced the transcripts of each cell indepently which is very impressive. The point of this work is to try and see if this kind of data allows to discriminate between distinct cell types and if possible identify new subtypes of cell.
However, this data is even more problematic than bulk RNA-seq data where RNA from many cells from the same tissue are sequenced together and therefore average each other. The main difference is the zero inflation. There are many genes that will have no reads for many cells. This can be either because of biological issues (it had not been transcribed in the cell) or technological issues (the procedure failed to capture it). To deal with this, Dudoit presented the Zero Inflated Negative Binomial model.
For count data (when your variable has integer value), the standard model is not Gaussian but Poisson. However, a Poisson random variable has the same mean and variance. This is not always desirable and therefore there is a generalization of Poisson random variable that allows to have different mean and variance. The Zero inflated then just adds more zero to the negative binomial model.
After many technical filterings, quality control checks and normalization, she finally arrived at reasonably clean data to analyse. Unfortunately, the clustering she attempted was somewhat unsatisfactory. It split the cells in 3 categories with two clusters corresponding to the distinct layers and one being a mix of the two kind of cells. She proposed a direction of study to improve on clustering: consensus clustering that would be an analog of ensemble methods (like random forest).
Stochasticity as a challenge for genetic determinism in the cell by Andras Paldi
The last talk I will mention tonight was made by a biologist about stochasticity in single cells. We can sometimes have a very deterministic representation of how a cell works. For example, signalling pathways seem to imply that once a signal arrives, it automatically brings a cascade of consequences as if you pressed a button on a vending machine.
But this is a false impression. All the actions in a cell are dependent on molecules meeting and they are mainly drifting randomly according to a Brownian motion. This picture with the arrows should really be understood as the average behavior of a cell. An important reference is stochastic gene expression in a single cell (2002):
These results establish a quantitative foundation for modeling noise in genetic networks and reveal how low intracellular copy numbers of molecules can fundamentally limit the precision of gene regulation.
This is of course connected with thermodynamics and physical statistics. You describe the behavior of random particles (a gas molecule/a cell) and when you look at many of them you can observe measurable quantities (the temperature, the pressure/ a phenotype). Paldi also made the point that it is very expensive to have a tight control over gene regulation :
In such systems, the minimal error decreases with the quarticroot of the integer number of signalling events, making a decent job16 times harder than a half-decent job.
This is connected with the difficulties of Dudoit. If you look at individual cells, then you are not looking at the simple average story but to all the very noisy individual cells. Furthermore, the cells in the mixed cluster might actually be oscillating between the two distinct neuron type. Still, being able to get expression data for a single cell is awesome.