In this blog post, I wanted to draw your attention to an article published in Nature at the beginning of the week.
Essentially, it documents how a group of bioinformaticians have turned the presumed ridiculous idea of storing data in DNA into a plausible option.
It is quite ironic really, as one of the major hurdles of sequencing the genome is how or where do we store the 1.5 Gb worth of A,C,T and Gs? We have now turned this on its head with the realization that DNA is such an efficient way of compacting lots of information into a small space maybe we should try to take advantage of this. It is a neat example of how nature has a solution for a very modern problem.
This article, also highlights to me the computational challenge genetics faces to manage and process data efficiently. Once we were able to sequence the genome, the technology continued to develop to do it faster, cheaper and more accurately, dramatically increasing the data outputted. Alongside this we need the software to keep up, or even ahead of the technologies.
Some software is developed by researchers as a necessity to get their projects done, but many companies, will develop programs in parallel to developing sequencing machines in order to offer the complete package to consumers. This means that if you are not interested in the academic lifestyle or a career in scientific research, there are plenty of alternative opportunities in industry.
It means you will be working at the cutting edge of genomic’s technology, but also as the article highlight the cutting edge of computational solutions.
I have previously discussed what I feel is the disjoint between taught statistics and the reality of being a statistician. Part of this is that the hard and fast rules are not always obeyed by the data you are working with. This can lead to a state of paralysis either through confusion on what to do next or refusal to use any of the standard approaches.
Unfortunately though, I am paid to do data analysis. I am expected to present results, not a list of reasons why I didn’t think any of the tests I know were not appropriate. Now I am not advocating that all the assumptions are there to be ignored but sometimes you just have to give something a go, get stuck in and see how far you can bend the rules. For something like a regression model, some of the assumptions relate to the model fitting. For example, you can’t check if the residuals are normally distributed until you have fit the model. Therefore you have to do the calculations and generate the results before you know if it is appropriate.
A big part of statistical analysis is ensuring the robustness of your results. In other words are they a fluke? Is there another variable you haven’t factored in? I find visualization helpful here, can you see any outliers that change the result if you were to take them out? Is there one particular group of samples that is driving your association? Is your sample size large enough that that you can pick up subtle but noisy relationships? Does it hold true in males and females? Old and young? Essentially you are trying to break your finding.
In genomics the large datasets with measurements at thousands of markers for hundreds or thousands of individuals often mean repeating the same test for each marker. Doing statistics at this intensity makes it implausible to check the robustness of every single test. To prevent serious violations, fairly stringent filtering is applied to each of the markers prior to analysis. But the main approach to avoid false positives is to try to replicate your findings in an independent dataset.
Often performing the analysis is quite quick: it’s checking and believing that it’s true that takes the time.
One aspect of my job I never really expected to be involved with is study design. Early in my career I worked with publically available data, so had virtually no insight into how the experiment had been run or any experience of what might actually happen whilst working in a lab.
The nice thing about being in a group that generates a lot of data is the opportunity to be involved at the conception of an idea and have an input in how the project proceeds. I can imagine this may make some data analysts rather jealous as they get handed a dataset and a question to answer with no obvious link between the two, or a technical flaw that scuppers the proposed analysis.
There is no such thing as the perfect experiment. There are so many variables that may influence the outcome either grossly or subtly: quality of sample going in, temperature, batch of reagents, individual(s) doing the experiment, day of the week, time of day, the list is endless. In larger studies you will inevitably need to perform the experiment multiple times over days, weeks or months. This will lead to batch effects. I don’t like the word batch as I think it used very loosely to cover a range of different factors. Broadly it means a group of samples that have something in common that may mean their data is more similar to each other than samples included in other batches. Often this means they were processed at the same time (think about a batch of cakes) and refers to technical factors relating to the experiment.
The challenge is to organise your samples prior to the experimental procedure so that these technical variations do not influence the statistics you want to do. If you are doing a case control study, you want to randomly allocate each sample so that in each batch there is a mix of each group. What you don’t want is the cases to be processed as a group and all the controls to processed together, as then you can’t be sure that the differences you see are due to the disease or the fact the experiment was run by two different people.
There are times when you want to make sure what you comparing is from the same batch. For example we do a lot of work with discordant twin design. Here we are looking at the differences between the two members of twin pairs, so we want to be sure that they are not an artifact of the fact that they were processed two months apart.
While I have no desire to go into the lab to run any experiments, I have learnt a lot by having day to day interaction with my colleagues who generated the data. That knowledge can really help when it comes to processing the data. Comparing notes with the person who ran the experiments to identify why something doesn’t look like you were expecting invariably gives you confidence when it can be resolved. This is the kind of interaction I always wanted out of a job. I enjoy that I bring my skills and having responsibility for certain parts of a project whilst others with different skill sets are responsible for something else.
There is enough data out there that as a Bioinformatician I don’t have to work with a group who generate data. However I would strongly recommend spending some time in that environment as it is always beneficial to understand a bit more about how and where your data came from.
I am a mathematician.
It’s not my job title, nor do I work in a Mathematics department but that is what I am.
If you wanted me to be more specific, I would say I am a statistician. And I work in a biological field, so maybe biostatistician would be more accurate. But I use computer programming to do my maths so that makes me a bioinformatician, and hey presto we’ve reached my actual job title. Underneath it all though I am a mathematician, that is my fundamental skill set, but I have always seen it as that a toolkit designed to be applied to a range of fields – in fact anything you fancy (energy output, retail, population demographics, economic trends, sports performance, elections,…). I chose genetics and now work in the School of Medicine amongst predominantly biologists.
Through this blog, I will discuss what it is like to continue doing maths beyond the classroom and hopefully encourage a few future mathematicians to stick with it.