# A short (and biased) history of genetics up to GWAS

This post is the first post of a series of introductory posts that I will write. I will get more technical at some point.

The history of genetics begins by this major discovery.

There is a separation in genetics between mendelian traits and complex traits. Mendelian traits only depend on a few genes when complex traits are the results of many genes and environmental factors. For example, mendelian traits include eye color, cystic fibrosis and Tay-Sachs disease. Complex traits include height, skin color and type 1 diabetes.

Mendelian traits were discovered by Mendel (Surprise!). He was an Austrian monk growing different breeds of pees and by crossing them he observed some patterns that made him understand that for some traits like color or wrinkles, there are two versions (alleles) of the same gene inside each pee and that one version is dominant and the other is recessive. A dominant allele need only appear once to determine the outcome (the phenotype). A recessive allele need to appear twice to determine the phenotype. For example, blue eyes is a recessive trait while brown eyes are dominant. He published his findings in 1866 but his work was rediscovered at the beginning of the 20th century by three researchers simultaneously.

Mendelian traits are easy enough to understand and allow for accurate prediction. They also guided the understanding of much of biology thanks to Mendel-style hybridizing experiments on drosophila, mouse or other model animals. They also are an important public health issue as many rare disease (mainly mendelian) affect humans. 7% of the world population has a rare disease. I will not talk about mendelian traits anymore.

The founding father of the study of complex traits is Francis Galton, an English jack of all trades that implemented the use of fingerprints by the police (cool), that coined the term anticyclone in meteorology (cool) and that also founded eugenism (not cool). He is known in probability theory for the Galton-Watson process which is the simplest model of population growth. He was a cousin of Darwin. Let us turn now to his work on the heredity of height. It is common knowledge that taller parents have taller offsprings. To study this more quantitatively, he collected the heights of trios of parents and offspring. He corrected for the fact that women are in average shorter than men to produce a modified mean height of the two parents (a mid-parent) and he compared this to the height of the offspring.

On the left, the original graph by Galton. Mediocity means the mean height.

He observed that the offspring are closer to the mean of the population than their parents. This phenomenon is known as regression to the mean and this is where the term linear regression comes from.

To understand this, you have to see height (phenotype) as the sum of the mean $\mu$, a genetic component and an environmental component :

$P=\mu +G+E$

and we are going to assume that $G$ and $E$ are independent and normally distributed in the population. The normal distribution is central in much of probability, statistics and therefore science and it already fascinated Galton. If you want to learn more about it, I recommend reading chapter 7 of Probability theory : the logic of science by Jaynes or wikipedia.

Tall parents will be tall because their $G$ and their $E$ is positive but offspring only inherit the $G$ part from their parents not the $E$ part. To put this in other words, choosing tall people in a population is conditioning on the sum of genetics and environment being tall but only the genetics part is transmitted to offspring.

But now for any continuous trait, we can define how much the trait is due to genetics, we call it heritability and its definition is :

$H^2 = \frac{\text{var}(G)}{\text{var}(P)}$

where $\text{var}$ stands for variance.

Heritability is worth between 0 and 1 with 0 meaning no genetic influence and 1 no environmental influence. Height has heritability of around 0.8. Here is a recent review on heritability. It is determined by studies on families.

For complex diseases, the problem is a bit different since the outcome is binary. There is an adaptation of heritability for binary traits but I am not a big fan for reasons that I will probably explain in another post. You can also look at the risk to sibling : if my brother is sick, what does that mean for me? For example, brothers of patients with type 1 diabetes (T1D) are 15 times more likely to have T1D. An excellent remark is : How do you know that it is the genetic part that matters and not the shared environment between brothers ? The answer is twins. We compare monozygotic twins to dizygotic twins : in this study in Finland, in 27% of the monozygotic twin pairs with at least one T1D case, both were sick while for dizygotic twins this number was only of 4%. The monozygotic twins share the same environment and the same genes while dizygotic twins are like brothers born on the same date sharing their environment and around half of their genes. The big difference in concordance shows that genes play an important role.

The study of complex traits has found its main application up to now in animal and plant breeding. An animal breeder wants to have animals that grow faster, are bigger or produce more eggs or milk. Those are complex traits. He can choose which animals will reproduce and therefore try to maximize the trait for the next generation. He could simply pick the biggest animals, but they are not necessarily the one with the best genetics maybe they were just lucky environment-wise. To determine more precisely the genetic value of an animal you can make a pedigree of all its parents. At the same time that they select the traits they want, animal breeders want to preserve some genetic diversity in their population to avoid any inbreeding problems.

Genetics started as a discipline without access to genes but only access to phenotypes and family ties. But this access has slowly been made available by technical progress. I won’t go into details here as I don’t know them. Chromosomes were discovered in the middle of the 19th century but it took another century in 1955 to have the right count (which is 46 chromosomes or 23 pairs). In 1953, Crick and Watson discovered the double helix structure of DNA. From 1977, Sanger sequencing allowed for slow and expensive sequencing. In 1983, Polymerase Chain Reaction (PCR) was invented. Then things accelerated and the price of sequencing which followed Moore’s law up to 2007 suddenly fell. This is due to next-generation sequencing also known as high-throughput sequencing. The result is that we now have access to DNA.

GWAS

I will now talk about Genome-Wide Association Studies (GWAS) which is not the most recent development in genomics but I am working on them. The main interest both economically and socially of human genomics is insight into complex diseases’ etiology (only because we feel threatened by genetically enhanced humans with super powers as shown in the documentary series X-men). GWAS is the genotyping of thousands of patients of a disease and of controls to compare them. Genotyping is not whole genome sequencing. In genotyping, the most common variants in the human population are genotyped. This is of the order of hundred of thousands to millions of Single Nucleotide Polymorphisms (SNPs). Whole genome sequencing on the other hand is the sequencing of the 3 billion of base pairs of the genome of a person. So SNPs are captured but also variants that are rare or very rare and therefore hard to interpret.

An example of early GWAS is the Wellcome Trust Case Control Consortium (WTCCC) ❤ whose first study was published in 2007. They sequenced 2000 patients for each of seven diseases : bipolar disorder, Crohn’s disease, coronary artery disease, hypertension, rheumatoid arthritis, type 1 diabetes and type 2 diabetes. They also sequenced 3000 controls that were used for all the diseases. They used a chip that allowed for genotyping of 500 000 SNPs and they found 24 SNPs that were genome-wide significant for a disease. Their dedication to sharing their data means that I have access to it and I want to thank them here for that as well as the good people at the EGA (European Genome-phenome Archive).

Each allele of a SNP has only two possible values taken from the four nucleotides  A T C G. And so the SNP consisting of the two alleles coming from the two homologous chromosomes will have three possible values A/A (homozygous), A/C (heterozygous) or C/C (homozygous). The association of a SNP to a disease is assessed through a statistical test. As there are a huge number of tests performed, the results are corrected for multiple testing. This ensures that there is less than 1 chance in 20 that there is more than zero false positives. This procedure allows for control of the family-wise error rate.

Manhattan plot of a GWAS study. The smaller the p-value the higher -log(P). The highest dots correspond to the more associated SNPs. By M. Kamran Ikram et al [CC BY 2.5 (http://creativecommons.org/licenses/by/2.5)%5D, via Wikimedia Commons

SNPs have a strong correlation structure due to distance on the chromosomes. Each parent transmits one chromosome of each pair to its offspring. However, there is another mixing event that happens during meiosis (the formation of gametes): recombination. It consists of the exchange of segments between the two homologous chromosome in each parent before they are segregated. This create new chromosomes that are a mix of earlier chromosomes. SNPs that are close to each other on the same chromosome will therefore be strongly correlated. This phenomenon is called linkage desiquilibrium (LD). As recombination is not homogeneous, LD will depend on the recombination distance measured in centimorgan and not directly on the physical distance measured in base pairs. Linkage Desequilibrium means that what is identified is SNP-disease association but locus-disease association. A group of SNPs in linkage desequilibrium will all be associated to the disease. This can be seen in the Manhattan plot.

Many geneticists think that there must be a causal SNP that explains the association and that all the other associated SNPs are significant only because they are linked to that causal SNP. In order to understand precisely the biological mechanism that is responsible for the association to the disease, it can be worthwhile to identify this SNP. This exercise is called fine mapping.

Another goal of particular interest for pharmaceutical companies is to identify potential drug targets. One success story is the discovery of PCSK9. It was discovered that a rare non-sense mutation in Afro-Americans in Texas leads to lower levels of cholesterol and triglyceride and lower risk of coronary artery disease. This means that the action of the protein is detrimental to health in our sedentary societies. It also means that if you can block the action of proteins in people without the mutation, you should be able to produce similar protection to coronary artery disease, one of the leading cause of death in our societies. As of 2015, three drugs have been released based on that discovery. Here is a longer exposition.

Another question of interest raised by GWAS is: can we use it to predict who will develop a disease ? A related question is : are we explaining the heritability we were supposed to find ? The answer to both those questions has been a resounding no. The GWAS findings were of small effect sizes and in aggregate did not explain a large part of the heritability. This problem is called the missing heritability.

I am deeply interested in those two questions. I think that at least some progress can be achieved through better methodology and better modelling of heritability. It is irresponsible to call for ever larger sample size at great cost for the public while using naive methodology. I am not saying anyone does that of course.