Bringing together, statistics, genetics and genealogy

In this post I want to highlight a recent genetic study published this week in Nature Communications which uses genetic data to characterize the current population of the US and understand how it came to be using databases of family history.

Their starting point was a very large genetic data set of 774,516 people currently residing in the US, the majority of which were also born there, with measurements at 709,358 different genetic positions.

They compared the genetic profiles of all pairs of individuals to identify regions of the genome (of a certain size) shared by both individuals, consistent with those two individuals having a common ancestor. It is important to note, that this is very unlikely to be the case between two randomly selected or even two distantly related individuals. Therefore this study was only possible because they had accumulated such a large genetic data set, meaning they had enough pairs of individuals with such a genomic region in common to make any inferences. Based on this information they produce a plot of US states where distance between points represents the similarity in common ancestry between individuals born in those states, which closely resembles a geographical map of the US. What it means is that, in general, the closer together two individuals live, the closer their ancestry is likely to be. This isn’t hard to believe,  and has been shown before, for example, similar studies in European populations have produced similar figures in the past.

The aim of the study was to divide the sample up into groups, referred to as clusters, of individuals whose genetic data implied common ancestry and which represented the substructure of the US population. What is perhaps novel to this study, is the inclusion of information from participants relating to when and where their relatives were born to interpret the origins and migratory patterns of each cluster. All of which is then discussed in the context of known immigration and migration patterns in recent times (~last 500 years).

A few things struck me about this article. Firstly, the data was taken from a genetic genealogy service AncestryDNA, who use a saliva sample to genetically profile and generate statistics on customer’s ancestry. Their analytical sample size was 774,516 individuals of US origin who provided consent for their data to be included in genetics research demonstrating potentially how interested the general population is in the information that their genome harnesses. What’s more these individuals are also keen for it to be used to improve our understanding of how genetics influences health and disease.

Secondly, the authors used network analysis to identify their clusters of individuals with common ancestry. The article is littered with mathematical terminology, “principal components”,  “weight function”, “hierarchical clustering”, “ spectral dimensionality reduction technique”, demonstrating not only the utility of statistics in genetics but the additional applications of this to supplementing our knowledge of modern history.

Thirdly, they make use of a range of large data sets (multiple genetic data sets and genealogy databases). This is increasingly necessary in genetics research in order to interpret findings and draw conclusions, making this a nice demonstration of how to think about incorporating additional sources of information (like a historian would) in order to contextualize your results.

Finally, if nothing else, this research serves as a timely reminder of the broad roots and origins of the current residents of the USA and how they came to be there.

Leave a Reply

Your email address will not be published. Required fields are marked *