Science is all about dealing with unknowns.
There are the big unknowns, ‘Can we eradicate cancer?‘, ‘Why do we forget things as we get older?‘, ‘ Can we grow replacement organs?‘. Then there are the day to day niggling unknowns. These are the ones that tend to cause the most anxiety. Perhaps because we never expect to completely answer the big questions and are simply looking to add to the body of knowledge.
Pretty much all of the day to day problems I deal with relate to ‘how’ are we going to test a particular hypothesis. Once you have data in hand, it is not uncommon for some technicalities or oversights to emerge. We have to accept that the perfect study design is often unobtainable, and instead are striving to control for as many external factors that may influence the result as possible. Where you couldn’t do so in the way the experiment was conducted, you have a second chance at the analysis stage. This is limited by two things: 1) knowing what all of these possible confounders are, and 2) actually having a measure or proxy for that confounder.
There are two routes taken when dealing with confounders: one option is you perform the initial analysis and then see if it changes with the addition of additional covariates, alternatively you include all the variables from the outset. Personally I don’t see the point of doing an analysis, if you are subsequently going to discount any of the results which you find later to be a product of some other factor. Of course this view, may reflect my ‘omics background, where, given the large number of features tested in every experiment, spurious results are expected as part of the course and the quicker you can discount them the better.
Recently I have been working with some data for which, we are aware of many possible confounders. Some of these were obvious at the start and we have the relevant information to include in the analysis. For some of the unknowns, we have calculated estimates from our data using a commonly accepted methodology – however we are unsure of how accurate these are, as there is little empirical evidence to truly assess them, and whether they are capturing everything they should.
An alternative in high dimensional data (that is when you have lots of data points for each sample), is to use methods to create surrogate variables. These capture the variation present in your dataset presumed to be reflecting the confounders we are concerned about (and those perhaps we haven’t thought of yet). I have always been cautious of such an approach as I don’t like the idea of not understanding exactly what you are putting into your model. What’s more there is a possibility that you are removing some of the true effects you are interested in. However, there is the opposing argument of, ‘What does it matter? If it prevents false positive results then that’s the whole point.’
At present it is somewhat an open question which way we should proceed. It is good practise to question your approach and test it until it breaks. Having tried a few ways of doing something – all of which produce a slightly different numbers, how do we decide which is the correct one? Part of the problem is that we don’t know what the right answer is. We can keep trying new things but how do we know when to stop? Because unlike school we can’t just turn the textbook upside-down and flick to the back pages to mark our effort as right or wrong. Instead we have to think outside the box to come up with additional ways to check the robustness of our result. But this is part of the course, research is all about unknowns. These are the challenges we relish, and maybe eventually we will start to convince ourselves that our result might be true!
Often the gold standard is replication, that is repeating the analysis and finding the same result in a completely independent sample. Sometimes you might have a second cohort already lined up, so this validation can be internal and give you confidence in what you are doing. Or you may face a nervous wait to collect more data or for another group to follow up your work.
Sometimes though, you just have to go with what you have got. Sharing your work with the research community is a great opportunity for feedback and may prompt a long overdue conversation about the issues at hand. Ultimately, as long as you are clear about exactly what has been done, your findings can be interpreted appropriately.