Saturday, 18 June 2011

Five statistical things I wished I had been taught 20 years ago

I came through the English educational system, which meant that although I was mathematically minded, because I had chosen biochemistry for my undergraduate, my maths teaching rapidly stopped - in university I took the more challenging "Maths for Chemists" option in my first year, though in retrospect that was probably a mistake because it was all about partial differentiation, and not enough stats. Probably the maths for biologists was a better course, but even that I think spent too much time on things like t-test and ANOVA, and not enough on what you need. To my subsequent regret, no one took my aside and said "listen mate, you're going to be doing alot of statistics, so just get the major statistical tools under your belt now".


Biology is really about stats. Indeed, the foundation of much of frequentist statistics - RA Fisher and colleagues - were totally motivated by biological problems. We just lost the link the heyday of molecular biology when you could get away with n=2 (or n=1!) experiments due the "large effect size" - ie, band is either there or not - style of experiment. But now we're back to working out far more intertwined and subtle things. So - every biologists - molecular or not - is going to have to become a "reasonable" statistician.


These are the pieces of hard won statistical knowledge I wish someone had taught me 20 years ago rather than my meandering, random walk approach.


1. Non parametric statistics. These are statistical tests which make a bare minimum of assumptions of underlying distributions; in biology we are rarely confident that we know the underlying distribution, and hand waving about central limit theorem can only get you so far. Wherever possible you should use a non parameteric test. This is Mann-Whitney (or Wilcoxon if you prefer) for testing "medians" (Medians is in quotes because this is not quite true. They test something which is closely related to the median) of two distributions, Spearman's Rho (rather pearson's r2) for correlation, and the Kruskal test rather than ANOVAs (though if I get this right, you can't in Kruskal do the more sophisticated nested models you can do with ANOVA). Finally, don't forget the rather wonderful Kolmogorov-Smirnov (I always think it sounds like really good vodka) test of whether two sets of observations come from the same distribution. All of these methods have a basic theme of doing things on the rank of items in a distribution, not the actual level. So - if in doubt, do things on the rank of metric, rather than the metric itself.


2. R (or I guess S). R is a cranky, odd statistical language/system with a great scientific plotting package. Its a package written mainly by statisticians for statisticians, and is rather unforgiving the first time you use it. It is defnitely worth persevering. It's basically a combination of excel spreadsheets on steriods (with no data entry. an Rdata frame is really the same logical set as a excel workbook - able to handle millions of points, not 1,000s), a statistical methods compendium (it's usually the case that statistical methods are written first in R, and you can almost guarantee that there are no bugs in the major functions - unlike many other scenarios) and a graphical data exploration tool (in particular lattice and ggplot packages). The syntax is inconsistent, the documentation sometimes wonderful, often awful and the learning curve is like the face of the Eiger. But once you've met p.adjust(), xyplot() and apply(), you can never turn back.

3. The problem of multiple testing, and how to handle it, either with the Expected value, or FDR, and the backstop of many of piece of bioinformatics - large scale permutation. Large scale permutation is sometimes frowned upon by more maths/distribution purists but often is the only way to get a sensible sense of whether something is likely "by chance" (whatever the latter phrase means - it's a very open question) given the complex, hetreogenous data we have. 10 years ago perhaps the lack of large scale compute resources meant this option was less open to people, but these days basically everyone should be working out how to appropriate permute the data to allow a good estimate of "surprisingness" of an observation.

4. The relationship between Pvalue, Effect size, and Sample size - this needs to be drilled into everyone - we're far too trigger happy quoting Pvalues, when we should often be quoting Pvalues and Effect size. Once a Pvalue is significant, it's higher significance is sort of meaningless (or rather it compounds Effect size things with Sample size things, the latter often being about relative frequency). So - if something is significantly correlated/different, then you want to know about how much of an effect this observation has. This is not just about GWAS like statistics - in genomic biology we're all too happy about quoting some small Pvalue not realising that with a million or so points often, even very small deviations will be significant. Quote your r2, Rhos or proportion of variance explained...


5. Linear models and PCA. There is a tendency often to jump to quite complex models - networks, or biologically inspired combinations, when our first instinct should be to crack out the well established lm() (linear model) for prediction and princomp() (PCA) for dimensionality reduction. These are old school techniques - and often if you want to talk about statistical fits one needs to make gaussian assumptions about distributions - but most of the things we do could be either done well in a linear model, and most of the correlation we look at could have been found with a PCA biplot. The fact that these are 1970s bits of statistics doesn't mean they don't work well.


Kudos to Leonid Kruglyak for spotting some minor errors

21 comments:

Rosie Redfield said...

Geez, I still don't know any of those things (except the problem of multiple testing)!

I was (badly) trained in the "If your experiment needs statistics you should have done a better experiment" school.

Dave Bridges said...

As per the Pearson vs Spearman debate, my understanding is that it depends on normality of the data. Some good discussions here and here

Rosie Redfield said...

In self-defence (against my own criticism) I should add that I have a reasonably good intuition for when my experiments do need statistics, and then I go ask Mike Whitlock. Or, if it's a really simple problem, I read his book (Whitlock and Schluter, The Analysis of Biological Data).

Paul Harrison said...

This.

If I were to add anything to this list, it would be testing statistical routines with simulated data. This gives you some assurance that the p-values they report are accurate. Is that moderated test of count data accurate? Yes... but only if you set the prior n high enough! It also tells you if an experiment is powerful enough to detect effects of the size you expect. The false negative rate is always a worry in microarray experiments and similar.

Iddo said...

Hear, hear. My PhD adviser insisted that we all take a course that covered, 1,3,& 4, and I believe it improved my ability not only to test hypotheses, but to come up with them. I don't think that R was that robust even 8 years ago as it is today (most people then used S or MatLab). I would add that the skill I consider the most important in statistics is the proper formulation of a null hypothesis & the alternative hypothesis.

Ralph said...

I agree with your "things," and I'd add that people who are serious about quantitative biology must be willing and able to learn new methods all the time. Almost every project I've undertaken in the past five years has involved learning yet another "advanced topic" in statistics (e.g., one of my papers from last year involves regression trees, and another is essentially a formal meta-analysis).

Incidentally, my undergraduate education was in physics and mathematics and included no statistics whatsoever. ("Statistical mechanics" is great fun but has very little to do with the inferential statistics that matter to biologists.) The attitude of many physicists at the time was and most likely to this day is, "If it takes statistics to prove it, then I don't believe it."

PatrikD said...

Heck - if people could just learn #3, I'd be happy. Large-scale/omics data is becoming the norm, so you will always find *something* that seems "significant" if you blindly look at p-values.

I've rejected more than one manuscript because they failed to correct for multiple hypothesis testing and were making a big deal out of something that was essentially random noise...

Susan Gurney said...

I am a mathematician who came to biostatistics late in my career. I heartily support Ewan's blog, and his comments about Fisher are a good starting point. Unfortunately, there is even more work to be done, as the fact that underlying distributions for small sample sizes (he sad reality of modern labs and real-life constraints) are not known is not necessarily remedied by the use of 'non parametric' tests. These tests actually do assume a parameter, as Roger Newson, a statistician at Imperial College writes: I personally would argue that there are no such things as "nonparametric
methods". They should be called "rank methods", and are based on rank order parameters, which can be estimated with confidence limits. They
are particularly suitable for use with Likert scales (I think). For
details of how to use these methods in Stata, see Newson (2002), Newson
(2006a) and Newson (2008b).

So....more knowledge is good, but even more is....also good, and even more is....just good - and there are only so many hours in the day and night shift. My advice- think like a scientist, and repeat experiments often!

John Mark said...

The next level - number 6 - would be to get beyond P values, and instead compute probability distributions of the quantities of interest. This leads naturally to number 7, which is to delve into the generative models that are currently solved by MCMC methods. This is basically the Bayesian approach. Just as an aside "non parametrics" in some new work is also used to mean models where the number of parameters varies, as a consequence of the method.

Dr. Fox said...

As an ecologist (so, a biologist in a field that always been obliged to use statistics), I agree with most of what you have to say, but I can't say I'm with you on nonparametric tests. If the assumptions of a parametric test are reasonably well met, you're just giving up power for no reason by doing a parametric test. Indeed, classical parametric stats like ANOVA are actually pretty robust to non-normality (less so to heteroscedasticity, but heteroscedasticity violates the assumptions of nonparametric tests too). Further, these days 'parametric test' doesn't necessarily mean assuming normality--think generalized linear models. Finally, if you really do need or want to minimize your assumptions about the shape of the distribution, in many cases you're probably going to get more power with a randomization- or bootstrapping-based approach than with a classical nonparametric test.

nico said...

Absolutely agree with Dr. Fox. Also, a very interesting discussion on normality testing can be found here: http://stats.stackexchange.com/questions/2492/normality-testing-essentially-useless

Xin said...

to Rosie Redfield:
I actually think there is element of truth about the statement you quoted "If your experiment needs statistics you should have done a better experiment".

Try to think about what is the most awesome drug that have been developed by human being.

It is not XXXX cancer drug that elongage patients's life from 11 months to 13 months with a p value=0.001 derived from a cox proportional hazard model, which is done by a army of statistican and proven by FDA.....

It is penicillin! which statisitican was overseeing that clinical trial and what is the p value for efficacy? I assure you none become RA Fisher had not yet then developed all the tools nowadays statisitcans make a living with and FDA did not even exisit then....

statistics does not replace biological intuition and I am not sure it can even better the latter. Statistics is always a aid for science and it should never be a driving force. Abuse of it may help generates tons of papers but rarely any real medical breakthrough.....

a bit of my two cents...

Jenny Koenig said...

I too did no statistics in my undergraduate course. I had fully intended to do maths and physics at uni and so did the "hard" pure and applied maths at uni. Then I saw the light and switched to chemistry and pharmacology. As a result I started the Open University Statistics diploma a few years ago - excellent course with quite a lot of biological examples.
One question though: do you mean that these topics should be covered at undergraduate bioscience degree level or in postgrad courses?
I think this sort of discussion is really important to help define what should be in the undergrad curriculum for a bioscientist.

Felix said...

I am missing the use of confidence intervals around effects of interests like mean difference or around effect sizes. A p-value is usually a rather useless information and only ci as an estimate of the likely range of the effect. And since the correct definition of confidence intervals is awkward, I also wsh to have started with Bayesian statistics and not with frequentist.

alf leo said...

me too don't know any of these so don't worry.
Anyway i would like to learn the 5th first.




engineering interview questions and answers

jenny said...

Maths for Chemists..sounds complex

Tampa Water softener Florida

jenny said...

Non parametric statistics sounds interesting. willing to learn that.

water softener tampa fl

Frozen_toes said...

Great post (i'm reading it quite late) Would You recommend some resources for learning these methods? would save time plowing through multiple statistics books unrelated to biology.

Jason Bodnar said...

I am a traditional PhD industry statistician (12 yrs in industry) looking for my next adventure. I came across bioinformatics, a field that sounds scientifically mesmerizing. Does any one have any input if, in this field, it would be better to be a statistician who understands the basics of bioinformatics, or someone who has a MS in bioinformatics who is versed in many facets of statistical theory.

Paul Morrison said...
This comment has been removed by the author.
Paul Morrison said...

Great post and I love the comments. I grew up in the era of "the band increased write a paper" but at least I understood what your five tips were talking about.

My tip is to become very good friends with someone who loves statistics and is also willing to tell you that your experimental plan is highly flawed. I married one of those so I got that covered. She came home last year with four huge books with titles like "R is Good". I think she read them all in a weekend and now she says stuff like "R is awesome you should read ..."

ps. I post and click on my name and it connects to some other Paul Morrison although it loads my G+ account. Anyway, I'm a scientist, not a dilettante baker.

https://plus.google.com/107071332326571881653