In this blog post (and others to follow) I want to give some examples of how statistics and statisticians have helped advance genetics research.
Most genetic studies these days consider and measure variation across all 20,000 human genes simultaneously. This is a great advance as it means we can forgot all the old biological theories we had based any previous research around and as yet not found any concrete support for. This is the basis of a genome-wide association study, often shortened to GWAS. GWAS are often referred to as a hypothesis-free approach. Technically, they are not completely hypothesis-free, as to do any statistics we need a hypothesis to test. They work on the hypothesis is that the disease of interest has genetic risk factors, however, we don’t need to have idea which gene or genes may be involved before we start. This means we may find a completely new gene or novel biological process which could revolutionize our understanding of a particular disease. Hence, they brought great promise, and new insight, to contemporary genetics research.
So when it comes to doing the statistical analysis for our GWAS, we are essentially performing the same mathematical routine over and over again for each genetic variant in turn. This procedure is automated by computer programmes designed to do this efficiently. At the end we have a vast table (as a gene will have multiple genetic variants across it this can contain hundreds of thousands or even millions of rows) of summary statistics to draw our conclusions from. One highly important number for each site is the p-value from each statistical test that we can use to rank our table of results. There is no plausible way in which we can apply the standard checks of the individual statistical tests that a mathematician may have typically been taught to do (i.e. do the data meet the assumptions), to every single genetic variant that we have tested. Instead we often look at the distribution of p-values across all the tests, generally using a Q-Q plot to compare the expected quartiles to the observed quartiles, to decide if there is major bias, or any confounders affecting the results. Once happy in general, we can look at which genetic variants are significantly associated with your disease of interest.
With a number of computer software tools it can be fairly straight-forward to plug in the numbers and perform the required statistical test. The challenge is often the interpretation or drawing conclusions, in particular when it comes to the p-value. This is made harder by the fact that most statistical training courses make the rather unrealistic assumption that you will only ever do 1 statistical test at a time and teach you how to apply a significance threshold in this scenario. This knowledge is then taken forward, and merrily applied in exactly the same manner to all statistical tests performed from that point forward.
However, there is a potential trap.
When you perform multiple tests, you increase your chances of getting a significant finding, even if there are no true associations. For example, let’s assume that there is no association between eating fruit and time spent watching TV. But to be 100% sure, we have found a group of people to ask about their TV watching habits and how many apples, bananas, oranges, strawberries, kiwis, melons, pears, blueberries, mangoes and plums they eat each week, then we decide to test each one of these ten different fruits individually. At a 10% significance level ( i.e. p-value < 0.1) we would expect that 0.1 x 10 = 1 test would identify a significant finding, which would be a false positive finding. The more things we test, the more we increase our chances of finding a significant association, even where none exists. This is called ‘multiple testing’, or ‘multiple comparisons’.
This knowledge is crucial for correctly interpreting the results of a GWAS. Say we have tested 500,000 genetic variants, even if none of them were truly associated at a significance threshold of P < 0.05 we would get 500000 x 0.05 = 25000 associations! That is (potentially) a rather hefty number of false positives (the number of associations you report as true but in fact are false). To prevent this, we need to adjust our significance threshold to account for the number of tests we have performed, minimizing our chances of incorrectly reporting a false positive. There are multiple methodologies proposed to resolve this issue, and this is one example where statistics plays an important role in genetic research.
What’s more, by highlighting the high probability of chance findings in GWAS there is a common consensus that all findings, even if they withstand the appropriate control for the number of genetic variants tested, must be replicated before they are ‘believed’ or taken seriously. Replication means repeating the whole GWAS process in a completely separate sample. So that’s more work for the statisticians then!
If you are interested in this topic you may enjoy this cartoon, which gives an alternative (comical solution).