Monthly Archives: February 2016

You can’t do that!

I have previously discussed what I feel is the disjoint between taught statistics and the reality of being a statistician. Part of this is that the hard and fast rules are not always obeyed by the data you are working with. This can lead to a state of paralysis either through confusion on what to do next or refusal to use any of the standard approaches.

 

Unfortunately though, I am paid to do data analysis. I am expected to present results, not a list of reasons why I didn’t think any of the tests I know were not appropriate. Now I am not advocating that all the assumptions are there to be ignored but sometimes you just have to give something a go,  get stuck in and see how far you can bend the rules. For something like a regression model, some of the assumptions relate to the model fitting. For example, you can’t check if the residuals are normally distributed until you have fit the model. Therefore you have to do the calculations and generate the results before you know if it is appropriate.

 

A big part of statistical analysis is ensuring the robustness of your results. In other words are they a fluke? Is there another variable you haven’t factored in? I find visualization helpful here, can you see any outliers that change the result if you were to take them out? Is there one particular group of samples that is driving your association? Is your sample size large enough that that you can pick up subtle but noisy relationships? Does it hold true in males and females? Old and young? Essentially you are trying to break your finding.

 

In genomics the large datasets with measurements at thousands of markers for hundreds or thousands of individuals often mean repeating the same test for each marker. Doing statistics at this intensity makes it implausible to check the robustness of every single test. To prevent serious violations, fairly stringent filtering is applied to each of the markers prior to analysis. But the main approach to avoid false positives is to try to replicate your findings in an independent dataset.

 

Often performing the analysis is quite quick: it’s checking and believing that it’s true that takes the time.