Category Archives: Mathematics

You may have never heard of a bioinformatician but there is a big demand for them.

Today while scanning twitter I came across two posts relating to the demand for data science/analytical/programming jobs in both academia and industry.

The first tweet was from the Royal Society and Royal Statistical Society promoting a report on the need for STEM (science,  technology, engineering and maths) skills in the workforce. It is the product of a conference which brought together academics, the Government’s Chief Scientific Adviser and senior representatives from BT, Amazon, Jaguar Land Rover all united by their need for computer and numeracy literate graduates. They estimate that up to 58,000 data science jobs a year are created each year, and there are a large number of these positions which do not get filled because there is a lack of suitable candidates applying.  In industry there is demand to model data to make predictions and decisions on what trends to follow and demand to visualize this data in a way that allows those without such strong numerical skills to make sense of it. They require employees who can communicate effectively what they are doing and think creatively about what further information they can get out of the data that will improve the commercial aspects of the business. It is a worthwhile read for anyone wondering where their mathematics or computer science training might take them.

The second was a tweet from Barefoot Computing stating that in the UK we are losing £63bn from GDP as we don’t have the appropriate technical or digital skill. I don’t know where the statistic comes from, but Barefoot are using it to encourage confidence and enthusiasm in the teaching of the Primary School Computer Science curriculum which is their underlying ethos.

Both of these reiterated to me the  demand there is and will be, in a wide variety of disciplines, for individuals who have strong mathematical, or computer science skill sets. So if you are considering your career options or know someone who is, encourage them to pursue a mathematics or computer science degree as this demand will just keep on growing.

 

Bringing together, statistics, genetics and genealogy

In this post I want to highlight a recent genetic study published this week in Nature Communications which uses genetic data to characterize the current population of the US and understand how it came to be using databases of family history.

Their starting point was a very large genetic data set of 774,516 people currently residing in the US, the majority of which were also born there, with measurements at 709,358 different genetic positions.

They compared the genetic profiles of all pairs of individuals to identify regions of the genome (of a certain size) shared by both individuals, consistent with those two individuals having a common ancestor. It is important to note, that this is very unlikely to be the case between two randomly selected or even two distantly related individuals. Therefore this study was only possible because they had accumulated such a large genetic data set, meaning they had enough pairs of individuals with such a genomic region in common to make any inferences. Based on this information they produce a plot of US states where distance between points represents the similarity in common ancestry between individuals born in those states, which closely resembles a geographical map of the US. What it means is that, in general, the closer together two individuals live, the closer their ancestry is likely to be. This isn’t hard to believe,  and has been shown before, for example, similar studies in European populations have produced similar figures in the past.

The aim of the study was to divide the sample up into groups, referred to as clusters, of individuals whose genetic data implied common ancestry and which represented the substructure of the US population. What is perhaps novel to this study, is the inclusion of information from participants relating to when and where their relatives were born to interpret the origins and migratory patterns of each cluster. All of which is then discussed in the context of known immigration and migration patterns in recent times (~last 500 years).

A few things struck me about this article. Firstly, the data was taken from a genetic genealogy service AncestryDNA, who use a saliva sample to genetically profile and generate statistics on customer’s ancestry. Their analytical sample size was 774,516 individuals of US origin who provided consent for their data to be included in genetics research demonstrating potentially how interested the general population is in the information that their genome harnesses. What’s more these individuals are also keen for it to be used to improve our understanding of how genetics influences health and disease.

Secondly, the authors used network analysis to identify their clusters of individuals with common ancestry. The article is littered with mathematical terminology, “principal components”,  “weight function”, “hierarchical clustering”, “ spectral dimensionality reduction technique”, demonstrating not only the utility of statistics in genetics but the additional applications of this to supplementing our knowledge of modern history.

Thirdly, they make use of a range of large data sets (multiple genetic data sets and genealogy databases). This is increasingly necessary in genetics research in order to interpret findings and draw conclusions, making this a nice demonstration of how to think about incorporating additional sources of information (like a historian would) in order to contextualize your results.

Finally, if nothing else, this research serves as a timely reminder of the broad roots and origins of the current residents of the USA and how they came to be there.

Let’s test all the genes!

In this blog post (and others to follow) I want to give some examples of how statistics and statisticians have helped advance genetics research.

Most genetic studies these days consider and measure variation across all 20,000 human genes simultaneously. This is a great advance as it means we can forgot all the old biological theories we had based any previous research around and as yet not found any concrete support for. This is the basis of a genome-wide association study, often shortened to GWAS. GWAS are often referred to as a hypothesis-free approach. Technically, they are not completely hypothesis-free, as to do any statistics we need a hypothesis to test. They work on the hypothesis is that the disease of interest has genetic risk factors, however, we don’t need to have idea which gene or genes may be involved before we start. This means we may find a completely new gene or novel biological process which could revolutionize our understanding of a particular disease. Hence, they brought great promise, and new insight,  to contemporary genetics research.

So when it comes to doing the statistical analysis for our GWAS, we are essentially performing the same mathematical routine over and over again for each genetic variant in turn. This procedure is automated by computer programmes designed to do this efficiently. At the end we have a vast table (as a gene will have multiple genetic variants across it this can contain hundreds of thousands or even millions of rows) of summary statistics to draw our conclusions from. One highly important number for each site is the p-value from each statistical test that we can use to rank our table of results. There is no plausible way in which we can apply the standard checks of the individual statistical tests that a mathematician may have typically been taught to do (i.e. do the data meet the assumptions), to every single genetic variant that we have tested. Instead we often look at the distribution of p-values across all the tests, generally using a Q-Q plot to compare the expected quartiles to the observed quartiles, to decide if there is major bias, or any confounders affecting the results. Once happy in general, we can look at which genetic variants are significantly associated with your disease of interest.

With a number of computer software tools it can be fairly straight-forward to plug in the numbers and perform the required statistical test. The challenge is often the interpretation or drawing conclusions, in particular when it comes to the p-value.  This is made harder by the fact that most statistical training courses make the rather unrealistic assumption that you will only ever do 1 statistical test at a time and teach you how to apply a significance threshold in this scenario. This knowledge is then taken forward, and merrily applied in exactly the same manner to all statistical tests performed from that point forward.

However, there is a potential trap.

When you perform multiple tests, you increase your chances of getting a significant finding, even if there are no true associations. For example, let’s assume that there is no association between eating fruit and time spent watching TV. But to be 100% sure, we have found a group of people to ask about their TV watching habits and how many apples, bananas, oranges, strawberries, kiwis, melons, pears, blueberries, mangoes and plums they eat each week, then we decide to test each one of these ten different fruits individually. At a 10% significance level ( i.e. p-value < 0.1) we would expect that 0.1 x 10 = 1 test would identify a significant finding, which would be a false positive finding. The more things we test, the more we increase our chances of finding a significant association, even where none exists. This is called ‘multiple testing’, or ‘multiple comparisons’.

This knowledge is crucial for correctly interpreting the results of a GWAS. Say we have tested 500,000 genetic variants, even if none of them were truly associated at a significance threshold of P < 0.05 we would get 500000 x 0.05 = 25000 associations! That is (potentially) a rather hefty number of false positives (the number of associations you report as true but in fact are false). To prevent this, we need to adjust our significance threshold to account for the number of tests we have performed, minimizing our chances of incorrectly reporting a false positive. There are multiple methodologies proposed to resolve this issue, and this is one example where statistics plays an important role in genetic research.

What’s more, by highlighting the high probability of chance findings in GWAS there is a common consensus that all findings, even if they withstand the appropriate control for the number of genetic variants tested, must be replicated before they are ‘believed’ or taken seriously. Replication means repeating the whole GWAS process in a completely separate sample. So that’s more work for the statisticians then!

If you are interested in this topic you may enjoy this cartoon, which gives an alternative (comical solution).

You can’t do that!

I have previously discussed what I feel is the disjoint between taught statistics and the reality of being a statistician. Part of this is that the hard and fast rules are not always obeyed by the data you are working with. This can lead to a state of paralysis either through confusion on what to do next or refusal to use any of the standard approaches.

 

Unfortunately though, I am paid to do data analysis. I am expected to present results, not a list of reasons why I didn’t think any of the tests I know were not appropriate. Now I am not advocating that all the assumptions are there to be ignored but sometimes you just have to give something a go,  get stuck in and see how far you can bend the rules. For something like a regression model, some of the assumptions relate to the model fitting. For example, you can’t check if the residuals are normally distributed until you have fit the model. Therefore you have to do the calculations and generate the results before you know if it is appropriate.

 

A big part of statistical analysis is ensuring the robustness of your results. In other words are they a fluke? Is there another variable you haven’t factored in? I find visualization helpful here, can you see any outliers that change the result if you were to take them out? Is there one particular group of samples that is driving your association? Is your sample size large enough that that you can pick up subtle but noisy relationships? Does it hold true in males and females? Old and young? Essentially you are trying to break your finding.

 

In genomics the large datasets with measurements at thousands of markers for hundreds or thousands of individuals often mean repeating the same test for each marker. Doing statistics at this intensity makes it implausible to check the robustness of every single test. To prevent serious violations, fairly stringent filtering is applied to each of the markers prior to analysis. But the main approach to avoid false positives is to try to replicate your findings in an independent dataset.

 

Often performing the analysis is quite quick: it’s checking and believing that it’s true that takes the time.