Wednesday, 7 December 2011

My Emotions in Science

First off, apologies that I have not blogged recently. This is because I have been pushing hard on a large, consortium paper, and it felt bad to have a public evidence of not working on it when I was asking many other people for pretty continuous effort over a protracted period of months. This is not entirely logical - we all actually need some diversity in what we do scientifically, even when we're on a big push, but somehow makes sense sociologically.

That consortium paper has (finally!) been submitted (of course not done yet; there will be review. I have all fingers and toes crossed on that), and the sheer physical relief that gave me has made me think about the emotional, personal connection I have to science - I think in common with many other scientists, I have a very visceral connection to my work - emotion is bound up not just with the successess or failures, or the interactions with people, but also, for me, with the very direct doing of science - the thinking, mulling and frustrations.

Here are my "big emotions" in science.

The relief - of a large body of work done. Sometimes this feels almost like a mental marathon, and at the end of the race you cross the finish line and just ache and half-collapse. Some neurons seem to be physically hurting from overuse, and just want a week of you staring into the middle distance not really using them. You know - or you hope - that once recovered you will be proud of the race and the effort, but at the actual time of completion you are just thankful it is all over. This is the emotion that I had last week.

The thrill - of when you see a new result for the first time. At the moment this is usually watching the spinning cursor wheel on my computer end for a R plotting session, and - eh voila - some plot. Previously it has been the final cat/sort/more of some large data processing pipeline, and sometime in my very distant past, the peeling off of an autoradiograph from a gel (northern blots if you are interested). The best are the cases where any of the outcomes are interesting - correlation means X, whereas non correlation means Y. I get a physical tingling of anticipation in these cases, and can spend a moment or two in silence thanks that I'm the person who gets to see this tiny piece of science for the first time. And then you have to think about artefacts, and other tests, and how to plot best, the right statistic etc etc.

The satisfaction - of removing a bug in the code. There are times when you are sure that X should work, but the program is crashing, or (worse still) giving the the wrong number, and one sets off to debug the system. As a computational scientist I think this is far easier than my experimental colleagues - in developing a new method there are so many steps that could go wrong, and often you have the start, patiently at the beginning, that I suspect this sense of complete mastery when it all works must be even stronger.

The frustration - of not understanding something. Often there are two bits of data, or two ideas that somehow wont merge together. One ruminates on this - turning the ideas over in your head, or trying to recast one of the datasets or problems in a different way. The most frustrating is when you feel they should interconnect, but don't quite. This can be quite distracting, and often when something like this is happening I have a small frown on my face and not very good at social interaction (my wife is very tolerant of me in this mode, and it has taken me years to realise how socially inept I am when I am really thinking about something). This frustration hopefully leads onto...

The 'aha' - of gaining new understanding. Calling this a 'Eureka' moment is too strong - and it is telling that there isn't really a commonly used english word to describe this emotion (I wonder if in other languages there is a word?). For me I can almost hear internally the "click" when I have rearranged in my head some ideas and they now fit. I definitely have a statistically higher chance of this happening in the morning (in the shower is my normal thing), but I've been known to leap out of bed at 5am in the morning getting very excited about some new insight.

The delight - in learning about a new area of science, explained by a excellent scientist. The best scientists can do this is to an audience of hundreds providing deep insight in a succint way. More commonly I find myself in a one on one, asking questions and getting appropriate responses as one explores an area with a someone who thinks about it every day of the week as a personal guide

The 'esprit de corps' feeling when a group scientists in an area come together. Sometimes these are these large consortia, but as often for me they are about a field coming together and seeing progress - sometimes there has been a closely contested race, or disagreement on some point, but even with that competition one can feel part of the same broader team in an area.

I am sure there are more, and indeed, I'd be interested if these emotions resonate with others.

Sunday, 18 September 2011

Customising Ensembl Displays

Ensembl's display customisation h
as become both far more detailed, and far easier toaccess. I'm goingto talk you through how to customise and what my favourite view is.

The default of Ensembl display is something like this. Nice, pretty, but very transcript/gene centric, and with a strong emphasis on splitting up the strands. I am now a more regulatory person, and genes really form the context of what I want to see, but most of the time I don't want to see the details of the transcripts.

So - on each track title to the left you can click this to g

et an immediate configuration of this track. I click on this to set to the genes to "Collapsed with Labels". Then also the order of the tracks can be changed by dragging the left hand yellow/green bars (the colour indicates the strand). I push the contig to the top, then I have the forward (green) genes and the reverse (yellow genes). I also put the ncRNA track into "Collapsed, no label" mode (as the gene names for ncRNA is often pretty uninformative). At the end of all of this, I end up with a display like:

But now we need to add things to the display. The "Configure" button is your friend on the left hand side under the menus (it's important to realise that the "Configure" button is page specific, so as you around the pages in Ensembl it will configure things specific for that).

Click on this and you will get a rather bewildering set of options for tracks (over 300 or so). Notice you can search across these tracks on the top left, and there are different sections.

like to put in Comparative genomics - the 35 way GERP conservation scores, and the conserved elements. Then I like to switch on regulatory information. Choose the regulatory information section from the Configuration panel. This gives a matrix of cell lines (on the top)and factors on the left. On the final display you will see these organised first by cell line and then by factor as colors inside that track. Be careful of the multi c
ell track as this is the union over all cell lines - if you switch that you can run out of alot of screen space! I like GM12878, K562 and H1esc on, and go for the "peaks and signal" style. Then I usually have DNaseI, CTCF, cFos, and P300 on (cFos just to get a sense of a "relatively normal" factor). There are some "gotchas" in this matrix. For example, if you use the "Select all" on the left hand side for a factor, this will trigger each cell line this is present to switch to being displayed - the logic being that it is better to show the user something that they can then get rid of - but I often wanted it only on in the already selected cell lines. Once you get used to this you can either go back to the top and switch things off, or switch things on and off in the individual cells.

Notice that you can also set up histone modifications and polymerase tracks.

Once all this done, I get something like:

Before you start browsing around even more, do make sure you set these tracks as "favourites". On the configure page, click on the "Active Tracks" (things which are on now) and then hit the greyed out stars. This moves these tracks into your "Favourites" which means when you went to turn things on and off, you just need to go to favourites rather than researching these.

Go play on:

If you are not human or mouse obessed, the Ensembl Genomes project has the same sortable and configurable tracks for Worm, fly and others, but we have not (yet) put regulatory information at the same level into these species. We hope to of course.

(Ugh. I've just discovered the mismatch between the blogger "preview" and the real preview of this and images. For the last screen shot I've now added in a higher resolution capture of this - if anyone knows the better way to get images to work in blogger in the place you think you'd like them to go, let me know :)).

Thursday, 21 July 2011

10 rules-of-thumb in genomics

In my semi-ongoing series of "things I wished someone had told me..." I wanted to share my sense of common "gotchas" in genomics. Here's my top 10.

  1. No large scale dataset is perfect - and many are at some distance from perfection. This is true of everything from assemblies through to gene sets to transcription factor binding sites to phenotypes generated cohorts. For people who come from a more focused, single area of biology where in a single experiment you can have mastery of every component if desired this ends up being a bit weird - whenever you dig into something you'll find some miscalls, errors or general weirdness. Welcome to large scale biology.
  2. When you do predictions, favour specificity over sensitivity. Most problems in genomics are bounded by the genome/gene set/protein products, so the totally wide "capture everything" experiment (usually called "a genome-wide approach") has become routine. It is rarely the case (though not never) that one wants a high sensitivity set which is not genome-wide. This means for prediction methods you want to focus on specificity (ie, driving down your error rate) as long as one is generating a reasonable number of predictions (>1000 say) and, of course, cross validating your method.
  3. When you compare experiments you are comparing the combination of the experiment and the processing. If the processing was done in separate groups, in particular with complex scripting or filtering, expect many differences to be solely due to the processing.
  4. Interesting biology is confounded with Artefact (1). Interesting biology is inevitably about things which are outliers or form a separate cluster in some analysis. So are artefacts in the experimental process or bioinformatics process - everything from biases towards reference genome, to correlations of signals actually being driven by a small set of sites.
  5. Interesting biology is confounded with Artefacts (2). There is a subset to the above which is so common to be worth noting separately. When you have an error rate - and everything has an error rate due to point 1 - the errors are either correlated with biology classification (see point 2) or uniform. Even when they are uniform, you still get mislead because often you want to look at things which are rare - for example, homozygous stop codons in a whole genome sequencing run, or lack of orthologs between species. The fact that the biological phenomena you are looking for is rare means that you enrich for errors.
  6. Interesting biology is usually hard to model as a formal data structure and one has to make some compromises just to make things systematic. Should one classify the Ig locus as one gene, or many genes, or something else? Should one try to handle the creation of new selenocystine amber stop codon by a SNP as a non synonymous variant? To what extent should you model the difference between two epiptopes for a chip-seq pull down of the same factor when done in different laboratories? Trying to handle all of this "correctly" becomes such a struggle to be both systematic and precise one has to compromise at the some point, and just write down/reference papers a discussion in plain old English. Much of bioinformatics databases is trying to push the boundary between systematic knowledge and written down knowledge further; but you will always have to compromise. Biology is too diverse.
  7. The corollary of 1, 2 and 4 is that when most of your problems in making a large scale dataset is about modelling biological exceptions, your error rate is low enough. Until you are agonising over biological weirdness, you've still got to work on error rate.
  8. Evolution has a requirement that things work, not that it's an elegant engineering solution. Expect jury rigged systems which can be bewildering in their complexity. My current favourite is the Platypus X chromosomal system which is just clearly a bonkers solution to hetreogametic sex. Have fun reading about it (here's one paper to get you started)!
  9. Everyone should learn the basics of evolution (Trees, orthologs vs paralogs. And please, could everyone use these terms correctly!) and population genetics (Hardy Weinberg equilibrium, the impact of selection on allele frequency, and coalesence, in particular wrt to human populations). For the latter case people often need to be reminded that the fact a site is polymorphic does not mean it is not under negative selection.
  10. Chromosomes have been arbitrarily orientated p to q. This means that the forward and reverse strand have no special status on a reference genome. If any method gives a difference between forward and reverse strands on a genome wide scale - it's probably a bug somewhere :)

I am sure other people have other comments or suggestions ;)

Wednesday, 13 July 2011

Biology is a 1st class quantitative science

Last week I was at a "Town Hall" meeting for the UK High Performance Computing community (or at least part of it) which has a lot of computer people rubbing shoulder with what used to be called "hard" science scientists; high energy physics, astronomers, modellers (particularly climate change/oceanography) and the odd molecular modeller/chemist. High beard/non beard ratio, and almost 1:1 Y to X chromosome ratio. From their perspective I was a straight up biologist (though from the biologist's perspective, I am a computer scientist. The curse of being interdisciplinary is that you are often seen as on the outside of all the disciplines you span. Hey ho.).

At that meeting one person almost inadvertently stated the standard science hierarchy - something like (this is not a direct quote): we are here to serve the hard, mathematical sciences of physics and climate change, and bring in new quantitative sciences of molecular biology and even reach out to Arts and Humanities based research, such as Dance research.

At one level this is great - we (biology) are both fully acknowledged, and we're no longer the newest in town! It's the Arts and Humanities who are the newest.

But I do find it a bit frustrating, and when it came to my talk I ended a slide with a bullet point

"Biology is a 1st class quantitative science"

I really believe this. You can't really move in biology without having to do at least some number crunching, and the fully quantitative end of biology - be it genomics, image analysis or systems biology - has easily enough quantitative/mathematical/statistical meat on it to be totally at home with astronomy/physical chemistry or oceanography/climate science. And the extent to which biology is not quantitative I think is far more due to the 20 years of us getting away with "cartoon based" explanations of our data (which worked out fine for quite a long time realistically) and I dont think is inherent.

In short, in 50 years time I think biology will be (or should be) reunited to its quantitative roots where it started at the turn of the 20th Century, in particular the way that frequentist statistics developed mainly to tackle biological questions.

But this idea of Biology as a "soft" or "semi-soft" science is pervasive - in the expectations we have of students, in the way we train graduates, and in the way we deal with peer review on grants and papers. Even the Royal Society has an "A" section (old school "hard" science) and a "B" section (old school "soft" science). So we've got almost a generational change to execute here (which also concerns me; I am probably both interdisciplinary and intergenerational in this aspect. sigh).

There is a subtle but deep cultural change needed to place mathematics and statistics in a central position in biology; this has to go across every part of the science. I dont mean we all have to understand Fields, Tensors and high dimensional mathematical topology (and nor do most chemists/material scientists/climate modellers....), but just as a practicing physicist of any sort - from material science through to experimental physics must be mathematically conversant at some level - ditto chemist, ditto climate modeller, so every biologist must have some fundamental mathematical skills - mainly associated with statistics. It's just part of the science we do.

So - say it proudly:

"Biology is a 1st class quantitative science"

Tuesday, 21 June 2011

Bayesian vs Frequentist. A pragmatic bioinformatician's view

Ian Holmes, who I did my graduate studies together in Sanger (two biologists/coders in a room with happy house blearing out in the late 90s. Makes me feel old just thinking about it) was chastising me about my last post about statistical methods not having anything about Bayesian statistics.

Which, let's face it, was an oversight, because Bayesian statistics totally rocks in some scenarios in bioinformatics. At one level Bayesian statistics is straightforward (just write down as a probability model what you believe, sum over any uncertainity, eh voila, your answer), but at another level is profound; almost a way of life for some statisticians.

If you haven't met Bayesian statistics, it is really worth learning about it (I highly recommend the "Biological sequence analysis book" by Durbin, Eddy, Krogh, Mitchenson which introduces probabilistic modelling in sequence bioinformatics if you are a bioinformatician and used to sequences). The rather profound viewpoint of Bayesian statistics is that your confidence in believing something is true is related to the probability you think this will happen given a model. By either having alternative models, or by having different parameters inside of one model one can choose which model "is most likely" (ie, fits the observed data the best).

In contrast, the other view of statistics - so called "Frequentist" statistics - sets up the problem as a multiple trial problem with a null hypothesis. The null hypothesis is usually some interpretation of "random chance". By imaging the probability of a particular data point under the null hypothesis one either accepts (ie, it was reasonably likely to occur by chance) or rejects the null hypothesis; if you reject the null hypothesis there is normal an alternative which represents "means are different" or "correlation exists". The Frequentist viewpoint is to focus on modelling the "null hypothesis".

Better people than I have written about the difference between Bayesian and Frequentist view points - and have provided the arguments that these unify conceptually. Previously this was a almost religious schism in statistics, and now my feeling is there is more agreement that both perspectives are valid. People often get distracted by either the philosophical aspects of Bayesian statistics - for example, Bayesian statistics insists you provide a "prior belief" of your model before you see any data. This sounds like something you don't ask for on the Frequentist side, though Bayesians would argue that the choice of null and alternative hypotheses and in particular the distribution of the null hypothesis is basically all "prior belief". In Bayesian statistics there is a particularly neat business of being able to factor out variables which you know are important, but you don't know how to measure/estimate - in Bayesian statistics you can "just" sum over all possible values of that variable, weighted by your prior belief. (The "just" in the previous sentence hides often quite a bit of magic - doing the straightforward "sum" is often hard to impossible, and much of magic in Bayesian statistics is working out a way to do this sum in maths - ie, as a calculable integral - rather than explicitly. However, Bayesians can always fall back on Markov Chain Monte Carlo - which is in effect randomly sampling and walking around the space of possibilities. It's a bit like the frequentists approach of using permutation to get an effective null).

But the real benefit of Bayesian statistics is that it is fully compatible with models of the data. And those models can be as sophisticated as you like, as long as you can calculate the probability of observing a dataset, given the model (this is called the "likelihood"). In sequence based bioinformatics this is invaluable - all our understanding of sequence evolution on phylogenetic trees, or base substitutions, or codon behaviour, or exon/intron rules we can write down sensible models of this behaviour. If we then had to work out some "frequentist" based null model as background for this it is just plain... weird... to think about some null model, and almost impossible outside of generating "random" sequence (itself a rather non trivial thing to do right) and feeding it into metrics. Basically Bayesian statistics, due to the fact that it's totally at home with explicit models of the data, is a great fit to sequence analysis, and frequentist statistics just doesn't work as well.

In my PhD and the early days of running Ensembl, everything I did had to do with one or other aspect of sequence analysis. In particular Genewise, which is a principled combination of homology matching and gene prediction, was impossible to think about outside of Bayesian style statistics. Genewise was at the core of both Ensembl and in fact Celera genomics models of the human and mouse genome, and this rather venerable algorithm still happily chugs away at the core of a variety of genome annotation tools (including still some parts of Ensembl) although newer methods - such as Exonerate - work far faster with almost no loss in power.

This got me thinking about why I am not using Bayesian (or let's face it, it's poor cousin, maximum likelihood methods; most "Bayesian" methods actually don't do so much summing over uncertainity but rather take a maximum likelihood point. Whenever you get out an alignment, you are doing maximum likelihood) so much at the moment. And that's really because I've shifted into more open analysis of ENCODE data, where I don't trust my own internal models of what is going on - can anyone really "know" what the right model of a promoter or a enhancer is? Or whether even these words "enhancer" and "promoter" are the right things? And when we compare experimental results to predictions, I want to make the least number of assumptions and just ask the simple question - did my prediction process "predict" the outcome of the experiment? Even though sometimes those predictions are set up with a Bayesian style formalism, but often with a very naive model, I want to be as naive as possible in the next part of the test. So - I find myself most comfortable with the non-parameteric statistics (either a Chi-sq - category vs category, or Wilcoxon/Kruskal - category vs continuous, or Spearman's Rho - continuous vs continuous test). As a good Bayesian perhaps one can write down the "most naive models of association as possible" - itself something I don't quite know how to - and then do a likelihood/posterior calculation, but ... it feels like this is going to be the same as the non parametric statistics anyway. And it's far easier to say "Success in the Medaka fish enhancer assays were significantly associated with the prediction class of element (Chi-sq test, Pvalue, 0.0007)" rather than "We then set up a naive Bayesian model for the the association of the Medaka fish enhancer assays to the prediction class of elements, with flat prior beliefs of the association or not, and considering all possible proportions of association; The Bayes factor of 78.9 indicates that there is high confidence in the model with association" - not least because I think it would give the same result.

Perhaps in the future this area of research will come back to Bayesian statistics, but I think this will be because we are confident in our models. Once you want to have more sophisticated models, one is almost forced to be more Bayesian.

But, I should have had Bayesian statistics in my top 5 statistical techniques. Or said the top 6. Because they do totally rock in the right place.

So - you are right Ian :)

Saturday, 18 June 2011

Five statistical things I wished I had been taught 20 years ago

I came through the English educational system, which meant that although I was mathematically minded, because I had chosen biochemistry for my undergraduate, my maths teaching rapidly stopped - in university I took the more challenging "Maths for Chemists" option in my first year, though in retrospect that was probably a mistake because it was all about partial differentiation, and not enough stats. Probably the maths for biologists was a better course, but even that I think spent too much time on things like t-test and ANOVA, and not enough on what you need. To my subsequent regret, no one took my aside and said "listen mate, you're going to be doing alot of statistics, so just get the major statistical tools under your belt now".

Biology is really about stats. Indeed, the foundation of much of frequentist statistics - RA Fisher and colleagues - were totally motivated by biological problems. We just lost the link the heyday of molecular biology when you could get away with n=2 (or n=1!) experiments due the "large effect size" - ie, band is either there or not - style of experiment. But now we're back to working out far more intertwined and subtle things. So - every biologists - molecular or not - is going to have to become a "reasonable" statistician.

These are the pieces of hard won statistical knowledge I wish someone had taught me 20 years ago rather than my meandering, random walk approach.

1. Non parametric statistics. These are statistical tests which make a bare minimum of assumptions of underlying distributions; in biology we are rarely confident that we know the underlying distribution, and hand waving about central limit theorem can only get you so far. Wherever possible you should use a non parameteric test. This is Mann-Whitney (or Wilcoxon if you prefer) for testing "medians" (Medians is in quotes because this is not quite true. They test something which is closely related to the median) of two distributions, Spearman's Rho (rather pearson's r2) for correlation, and the Kruskal test rather than ANOVAs (though if I get this right, you can't in Kruskal do the more sophisticated nested models you can do with ANOVA). Finally, don't forget the rather wonderful Kolmogorov-Smirnov (I always think it sounds like really good vodka) test of whether two sets of observations come from the same distribution. All of these methods have a basic theme of doing things on the rank of items in a distribution, not the actual level. So - if in doubt, do things on the rank of metric, rather than the metric itself.

2. R (or I guess S). R is a cranky, odd statistical language/system with a great scientific plotting package. Its a package written mainly by statisticians for statisticians, and is rather unforgiving the first time you use it. It is defnitely worth persevering. It's basically a combination of excel spreadsheets on steriods (with no data entry. an Rdata frame is really the same logical set as a excel workbook - able to handle millions of points, not 1,000s), a statistical methods compendium (it's usually the case that statistical methods are written first in R, and you can almost guarantee that there are no bugs in the major functions - unlike many other scenarios) and a graphical data exploration tool (in particular lattice and ggplot packages). The syntax is inconsistent, the documentation sometimes wonderful, often awful and the learning curve is like the face of the Eiger. But once you've met p.adjust(), xyplot() and apply(), you can never turn back.

3. The problem of multiple testing, and how to handle it, either with the Expected value, or FDR, and the backstop of many of piece of bioinformatics - large scale permutation. Large scale permutation is sometimes frowned upon by more maths/distribution purists but often is the only way to get a sensible sense of whether something is likely "by chance" (whatever the latter phrase means - it's a very open question) given the complex, hetreogenous data we have. 10 years ago perhaps the lack of large scale compute resources meant this option was less open to people, but these days basically everyone should be working out how to appropriate permute the data to allow a good estimate of "surprisingness" of an observation.

4. The relationship between Pvalue, Effect size, and Sample size - this needs to be drilled into everyone - we're far too trigger happy quoting Pvalues, when we should often be quoting Pvalues and Effect size. Once a Pvalue is significant, it's higher significance is sort of meaningless (or rather it compounds Effect size things with Sample size things, the latter often being about relative frequency). So - if something is significantly correlated/different, then you want to know about how much of an effect this observation has. This is not just about GWAS like statistics - in genomic biology we're all too happy about quoting some small Pvalue not realising that with a million or so points often, even very small deviations will be significant. Quote your r2, Rhos or proportion of variance explained...

5. Linear models and PCA. There is a tendency often to jump to quite complex models - networks, or biologically inspired combinations, when our first instinct should be to crack out the well established lm() (linear model) for prediction and princomp() (PCA) for dimensionality reduction. These are old school techniques - and often if you want to talk about statistical fits one needs to make gaussian assumptions about distributions - but most of the things we do could be either done well in a linear model, and most of the correlation we look at could have been found with a PCA biplot. The fact that these are 1970s bits of statistics doesn't mean they don't work well.

Kudos to Leonid Kruglyak for spotting some minor errors

Tuesday, 31 May 2011

Compressing DNA: The future plan

I've talked about our work early this year on compressing next generation sequencing in the first part of this series of three blogs; my second post outlined the engineering task ahead of us (mainly Guy, Rasko and Vadim). This third post is to get into the fundamental question of whether this can scale "for the foreseeable future". By "scale" here, I mean a very simple question - can we (the ENA) store the world's public research DNA output inside of the current fixed disk budget we spend each year currently? (this applies to ENA as an archive, and to each sequencing centre or lab-with-a-machine individually about their own output).

A note of caution - the raw disk cost is just one cost component of running an archive or a sequencing centre, and actually not the largest cost component (though it's growth is concerning); please don't feel that if we make the disk part of this problem "solved" that somehow all the other costs of submission management, meta-data management and output presentation - let alone appropriate analysis - are magically solved. There is a lot more to an archive - or a sequencing centre, or a lab - than just storing stuff.

At one level this is hard; the foreseeable future is ... unpredictable. The breakneck speed of technology innovation in DNA sequencing might plateau; some people think that other aspects of the DNA sequencing process, perhaps the sample acquisition, or electricity of the machines, or downstream informatics will be the choke point in the future, and so the growth of DNA sequencing will slow. Alternatively storage might get radically cheaper due to your choice of technologies - holographic storage, or physical storage, or a new type of optical disk. These "risks" are all risks in our favour as archives and informaticians; we need to worry far more about whether there is a sustained increased in sequencing technology (doubling every 6 months per $ spent) for the next 5 years. As disk technology doubles between 18 to 24 months, there is a factor of around 3 fold each year we need to make up - in other words for constant cost spend, we have to "win" about 3 fold in compression technology each year.

First off - to dispel some common suggested "solutions" to this which we ourselves have seriously thought about but then (reluctantly) realised that they dont change the problem enough; Firstly discarding "old data" doesn't change the problem enough to impact the curves. As the observed growth is exponential, one is always storing more next month than one did last month, and very quickly the old parts of the archive are not a big storage component of the total. Realistically one is always making a decision about whether one can handle the next 6 months or so set of data. Sorting data into "valuable" (eg, one-off Cancer samples) and "less valuable" (eg, repeated cell line sequencing) is going to be useful for some decisions, but the naive business of discarding the less valuable sequences only gives you a one-off win by the proportion you discard; as you'll see below, this doesn't actually make the problem scaleable. And, in any case, much of the data growth is in high value, one-off samples, such as Cancer samples, which have significant sample acquisition costs. It would feel very weird to be in the situation that we can sequencing large numbers of cancer genomes to help discover new aspects of cancer biology... but we can't actually reanalyse those genomes in 5 years time because we can't store them.

A repeated suggestion is though to store this as DNA - as the sample itself. At first glance this seems quite intriguing; DNA is a good storage medium (dense, well understood, doesn't consume so much electricity to store), and you have the sample in ones hand by definition. One major issue is that for some of the most important samples - eg, cancers - one often exhausts the sample itself in sequencing. But there are also some gnarly practical issues to "just store the sample" - this implies to pool the analysis of a current set of cancer genomes with a set of cancer genomes in 5 years time one would have to order those samples, and fully resequence them - that's quite a bit of logistics and sequencing machine time, and this implies not only a large amount of sample material to be able to send out to each group that wants this, but also highly time efficient and low cost resequencing in the future (not an impossible suggestion, but it's not definitely going to happen). Then there is the practicalities of freezer management and shipping logistics. (Does anyone know the doubling time of freezer capacity in terms of fixed $ spend over time?). Doing this in a centralised way is akin to running biobanks - themselves not a trivial nor particularly cheap undertaking - or in a decentralised way one is trusting everyone to run a biobank level of service in freeze sample tracking and shipping. (or you accept that there is some proportion of irretrivable samples, but if we accept a high level of data loss, then there are quite a few other options open). Finally this would mean that only places with good, efficient sequencing machines can participate in this level of scientific data reuse - sequencing machines might be democratised, but it's nowhere near the reach of hard-drives and CPUs. Although I think this is an interesting idea to think about, I just can't see this as a practical and responsible way of storing our DNA data over the next 5 years.

So - can we find a better solution to this scaling problem? In particular, will data compression allow us to scale?

You can view this problem in two ways. We either need technology to provide a sustained 3 fold improvement in performance every year, year on year. This is going to be hard to have a credible solution for this with an open ended time window as the solution has to deliver a integer factor improvement each year. Or we can give ourselves a fixed time window - say 5 years, or 10 years, and ask us to manage within the current year-on-year budget for that time. We then appeal to our paymasters that a 5 year or 10 year time horizon is far enough away that many, many things can change in this time, and it's pointless to try to second guess things beyond this horizon. This latter proposition is more bounded, though still challenging. It means over 5 years we need to have ~250 fold compression, and over 10 years in theory it's a whopping ~60,000 fold.

(This prediction also implies that in 10 years the cost for a genome is below $1, which seems as the "there will be another choke point argument" will kick in. It is also I think pretty safe to say that there has to be shift over into healthcare as the major user at some point during the 10 year horizon, which doesn't change the fundamental aspects of the technology mismatch between sequencing and storage, but does mean there is a whole other portion of society which will be grappling with this problem; for example, my colleague Paul Flicek reckons that healthcare will not want to store read level data due to a combination of data flow issues and liability issues. This again is a "risk in our favour").

Setting ourselves realistic goals, my feeling is that we've got to be confident that we can deliver on at least 200 fold compression (worse case - this will take us to around 4 years away) and aim for 2,000 fold compression (worse case this will be somewhere near 6-7 years away, but it may well be that other factors kick in before then). This means a genome dataset will be around a Gigabyte (200 fold compression) or much below it at 2,000 fold compression, and it's likely that in the scientific process there will be many other data files (or just even emails!) larger than the genome data.

This isn't actually too bad. We show in the paper and all the current engineering is pointing to a 10-fold one-off drop in compression now with reference based compression on real datasets. So we have a factor of 20-fold to get to the lower bound. The upper bound though is still challenging.

In the original paper we pointed out the compression scheme gets better both with coverage and read length. The shift from medical style sequencing at between 4x - 1,000 genome style - to 20/30x personal genome style to 30-5ox cancer genome sequencing means that this density increase is occuring at the moment per sample. There is also a steady increase in read length. So both areas of technology growth "positively" impact our compression scheme. The "10 fold" number I've quoted above is for a 30x high coverage genome (sadly!), but improvements in read length might give us another 4 fold (this is predicting something like 500bp reads over the time window we are considering). Finally in the paper we note that both the absolute compression rate and the way the compression responds to read length is altered by the quality budget. So, if, over time, we reduce the quality budget, from say 2% of identical residues to 0.5% of identical residues, and we argue that the trend on high value genomes (Cancer samples) will be to go somewhat higher in density (50 to 60x perhaps) we might get another 4-5 fold win. 10 x 4 x 5 == 200. Lower bound reached. Phew!

But there is a fly in the ointment here. Actually the "quality budget" concept is for both identical and mismatched bases combined, and thus there is a "floor" of quality budget being the read error rate. This is below 1% in Illumina, but is definitely above 0.1%; for things like color space reads from ABI solid, the "color error" rate is far higher (the properties of color space means you don't see these errors in standard analyses), and PacBio reads currently have a quoted error rate of around 10% or higher. Let's leave aside questions of what is the best mix and/or cost effectiveness of different technologies, and accept that the error rate in the reads is quite likely to be higher than our desired quality budget (perhaps dramatically higher). All in all it looks like we've got quite a fundamental problem to crack here.

Problems of course are there to be solved, and I am confident we can do this. In many ways it's obvious - if we've reached the end of where we can go with lossy compression on qualities, we've got to be lossy on base information as well. But which bases? And if we're happy about losing bases, why not chuck all of them out?

Going back to my original post about video compression, lossy compression schemes ask you to think about what data you know you can discard - in particular, a good understanding of noise processes allows us rank all the information from "I am certain I don't need this for future analysis" to "I am certain I need this for future analysis", and think far more about the first end of this rank list - the uninformative end of it - to discard.

Because we understand sequencing quite well, we actually have a lot of apparent "information" in a sequencing output which we can be very confident is coming from the sequencing process when considered in aggregate. When we see 30 reads crossing a base pair of a normal genome, and just one read has a difference (and if that difference is low quality...) then the probability this is due to a sequencing error process is very high; the alternative hypothesis that this is something biological, is low. Note of course if I told you this was a Cancer genome you'd definitely change the probability of it being a biological process. In effect, we can do "partial error correction" on the most confident errors, leading to a more compressible dataset.

And not only can we change our prior on the "biology" side (for example, normal vs tumor samples), and indeed our aggressiveness of this error correction, but we can also borrow the same ideas from video compression of doing this adaptively on a set of DNA reads; ie, inside of one dataset we might identify areas of the genome where we have high confidence that it is a "normal, well behaving" diploid genome, and then in those regions use the most aggressive error correction. We would of course keep track of what regions we error corrected (but not of course the errors themselves; that's the whole point of making this more compressible), and the parameters of it, but this would be a small overhead. So one might imagine in a "normal" genome one could have 3 tiers: 1% of the genome (really nasty bits) with all mismatches and many identical to reference qualities scored (quality budget of 2%); a second tier of 10% of the genome where all differences are stored with qualities (quality budget of 0.5%) and then 90% of the genome where the top 95% of confident errors are smoothed out, and biological differences (homozygous SNPs) treated as an edit string on the reference, leaving an apparent error rate of ~0.05%; still above the rate of hetreozygous sites, which are about 10-4; overall the mean "quality budget" is 0.11 ish. This might get us down to something like 0.04 bits/base, around 500 fold better than the data footprint now. One might be able to do a bit better with a clever way to encode the hetreozygous sites, and homozygous sites taking into account known haplotypes, though we will be doing very well in terms of compression if this is what we're worrying about. To achieve 2,000 fold compression from now, we'd need to squeeze out another factor of 4, though I am still here making the assumption of 500bp reads, which might well be exceeded by then. In any case, once we're in the framework of lossy compression on both the bases and the qualities there is a more straightforward scientific question to pose - to what extent does this loss of quality impact your analysis? This might be easier to ask on (say) medical resequencing of normal diploids, and harder to ask on hetreogenous tumor samples, but at least it's a bounded question.

Indeed, as soon as you start thinking about the quality budget not being smooth - either between samples (due to scientific value) or inside a sample, it begs the question why don't we look at this earlier to help mitigate whether we need to (say) discard unalignable and unassemblable reads, something which will be otherwise hard to justify. And, indeed, when you think about doing reference based compression on ABI Solid Colorspace reads, you've got to think about this sort of scheme from the get go, otherwise the compression just wont give you the results you want; this is not surprising - afterall discarding "confident errors" is precisely what the color space aligners are doing in their reconciliation of a color space read to a reference.

A critic might argue that all I am stating here is that we should analyse our data and store the analysis. Although there are clear parallels between "optimal analysis" and "appropriate lossy data compression" - at some limit they must meet to describe the entire dataset - it is I think a fundamental change in perspective to think about what information one can lose rather than what information should one keep. For example, not only can one rate how likely this is to be an error, but in effect weight this by how confident you are in the model itself. This encourages us to be a bit more generous about the information in regions we don't have good models for (CNVs, Segmental duplications etc) at the expense of regions we do have good models. Nevertheless, good compression is clearly closely tied to good models of the data; the better we understand the data, the better we can compress it. No surprise.

Coming back to the here-and-now - this more aggressive error smoothing scheme is a twinkle in my eye at the moment. We ( and by that I mean really Guy, Rasko, Vadim and the rest of the ENA crew....) still have a considerable amount of entirely practical work to complete this year on the mundane business of good, solid, working code; all these things need to be critically tested in real life examples and by people both inside and outside of the EBI - the near term things which we have a path for in the next year, and the longer term things such as this error smoothing process. And indeed, we might not need such aggressive schemes, as technology might change in our favour - either slower DNA growth or faster disk growth. But I wanted to let people know our current planning. I've always loved this quote from General Eisenhower (apparently himself quoting from Army teaching at Westpoint):

"Plans are worthless, but planning is everything".

Tuesday, 24 May 2011

Engineering around reference based compression

(this is my second blog post about DNA Compression, focusing on the practicalities of compression).

The reference DNA compression which I talked about in my previous blog post was a mixture of theoretical results and real life examples which we had taken from submitted datasets. But it was definitely a research paper, and the code that Markus wrote was Python code with some awful optimisations in the code/decode streams (codec as it's called by the pros). So it... didn't have a great runtime. This was one of the points we had to negotiate with the reviewers, and there's a whole section in the discussion about this is going to work in the future with nice clean engineering.

So - what are the practicalities of reference based DNA compression? Well, first off, this has passed from my area (more research and development) to my colleague Guy Cochrane's area. Guy runs the European Nucleotide Archive (ENA) - the EBI's integrated resource for DNA archives that spans "traditional" assemblies and annotation (EMBL data bank for the people who remember this) and the more modern "Sequence Read Archive" (SRA) submissions. ENA has a unified namespace for all these objects and a progressively harmonised underlying data model. Guy brought in Rasko Leinonen (ENA's technical lead) and they took on the task of creating appropriate production level code for this.

Some of the decisions taken by Guy and Rasko are; first code base in Java, aiming to compatible with the existing Picard toolkit (Picard read/writes the successful SAM/BAM formats, alongside the C based samtools) from Broad, the format copies much of BAM in terms of its header and meta-data structure, and the format has to consider all the practicalities of both external and internal references (see below). The format has to be written with an eye to indexability (so people can slice into it) and there is some discussion about having hooks for future extensibility though one has to be totally careful about not trying to bloat the format - after all the whole point of the format is that it doesn't take up space. Vadim Zalunin is the lead engineer on this, and there has basically been an extensive transfer of knowledge from Markus to Vadim from Janurary onwards. Finally we had to hit on a good name, which we've chosen "cram". (we know the phrase cram has been used in other computer contexts, eg, cramfs in linux systems, but there are so many other uses of cram in computer science that one more name clash we don't think is a headache. And it's too good a name...).

The CRAM toolkit is now on it's second point release (0.2) and perhaps more importantly Vadim, Rasko and Guy feel confident that "in the summer" of 2011 there will be useful end point code which does the basic operation of compressing a read-paired sequence file with a user-defined "quality budget". This is the point basically when other groups can try this on realistic scenarios. Quality budget is the idea that we would only be saving a proportion of qualities (between 1% and 5% realistically) from datasets. We had an interesting internal argument about whether the first target was a quality budget that could run from the command line as simple parameters (eg, save the bottom 2% of the qualities, or save all the qualities of places where there is > 2 bases different to the reference) or a more "complete control" of qualities where a structured text file provides the precise quality values per read to save - what we are now calling a "quality mask". We've ended going for the latter for the first release, as we think the first group of people who want to use this will be "power users" in cancer genomics, medical genomic sequencing, and de novo sequencing who will want to critically explore the trade offs between quality budgets, downstream analysis and storage. Clearly as this matures, and used more broadly there will have to be good 'default choices' of quality budgets.

This brings me to the next point, which is that although CRAM is being developed as a pragmatic response to the challenge of storing DNA in archives, but we want to push CRAM both upstream and downstream in the toolchains - upstream as a potential target file for sequencers to make, and downstream for analysis groups. We're deliberating trying to mimic the use of SAM/BAM (hence calling it CRAM, and hence targeting Picard as a potential toolkit use). There are risks here - the information that a sequencer wants to write out is not necessarily precisely what the archives want to store, nor necessarily is an archival format ideal for analysis. But the success of the BAM format - which now the archives take as a submission format, and the ENA is planning to provide access to suggests that the needs are aligned well enough that this is doable. However, this does give Guy an extra headache - he has to manage not only the engineering for the archive, but also try to keep a rather diverse community also happy. We've been working with the samtools/picard guys for a while now, but it's clear that we'll be getting closer. I know Guy is considering how best to structure the alpha/beta testing cycles and how to work the community as this is pushed out.

A key question to answer early on is to what extent can we compress "unmapped" reads onto "crash" assemblies - this is taking all the unmapped reads and not caring about making a correct assembly, but aiming for a good compression reference. Disturbingly in the paper that for an early 1,000 genome dataset (NA12878 genome high coverage trio) we find that of the ~20% of reads which don't map to the current reference, only about 15% can be incorporated into a crash assembly. As basically there is about a 10-fold difference in compressability between truly unmappable reads and reference (either "true" reference or crash assemblies), having 15% of singleton reads means that to hold the entire dataset would mean the storage would be dominated by these reads. In the paper we discuss whether we should consider discarding these reads completely - the ultimate loss in lossy compression, and not something you want to do without proper consideration.

Thankfully this probably an overly pessimistic view, triggered by the fact that we used early datasets in which people were submitting all the reads, even the ones flagged as being probably errors. More recent datasets we've looked at recently - even a metagenomic microbiome dataset - and cancer datasets show far less unmapped reads at the start, and less post crash assembly- between 10% to perhaps 5% of truly unmapped reads. This would mean even in the worse case of storing all of this, it would be at least equal to the main storage. With a bit more critical analysis of these reads I feel that we might be able to push this down to sub 5% - then we don't have to worry too much until we get into 20 fold or 50 fold compression of the main aligned datasets. But we've clearly got to do some more work here, and I know there are some people across the world keen to get involved in this assessment, in particular from the Cancer community.

Another concern often expressed is about complex datasets such a RNAseq - RNA-seq will have variable expression rates across genes, and spliced reads will not necessarily map well if at all. It seems hard to discard reads from RNAseq at all as it's precisely these low count, novel splice reads. We need to do some more work here about how many unmapped reads there are in an average RNAseq, but it is worth noting that RNAseq experiments are far, far smaller than whole genome shotgun. As a proportion of the current archival storage, RNAseq is less than 10% of the data, arguing that one can tolerate perhaps 5 fold worse compression properties in RNAseq before it starts to dominate the overall storage.

Finally we have the practicalities of the "Reference" part of reference-based compression. Reference based compression implies a set of references that can be reused. People often get overly exercised about this - as we show in the paper, even in the worse case where we pack the reference inside of each data package (appropriately compressed itself) for the real datasets we examined here, it did not inflate the size too much. Guy and Rasko have made sure that the idea of a mosaic set of references can be handled (for example, the 1,000 genomes project has a reference of the Human GRCh37 with the Epstein Barr Virus - EBV - genome added - quite sensibly as the 1,000 genomes use EBV transformed cells).

There is still a complex task ahead of Guy and Rasko, and then the broader community, for making this a reality. That said, gains of even 5 fold compressability will be welcomed both for people's disk budgets and volumes for transfer. And I think there is alot more headspace in compression way beyond 5 fold compression, into 20, 100 fold compressability. And that's going to be the topic for my third blog post on this...

Monday, 16 May 2011

Compressing DNA: Part 1.

(First off, I am back to blogging, and this is the first of three blogs
I'm planning about DNA compression)

Markus Hsi-yang Fritz - a student in my group - came up with a reference sequence compression scheme for next generation sequencing data. This has just come out in "proper" print format this month (it was previously fully available on line...) (Full open access). Markus not only handled the bases, but also the key other components in a sequencing experiment - read pairing, quality values and unaligned sequence. For bases and read pair information, the compression is lossless. For the quality information and unaligned sequence one has to be lossy - if not, we can't see a way of making an efficient compression scheme. But one can quite carefully control the where one keeps and loses data ("controlled loss of precision" in the terminology of compression) meaning that one can save information around important regions (eg, SNPs, CNVs etc).

In our paper we show that one can achieve 10-fold to 100-fold better compression than current methods (gzipped FASTQ, or BAM files), and rather brilliantly not only does this compression scheme gets better with read length, but also the rate of improvement with respect to read length can be tuned by the amount of precision being held on the quality information. As increase in read length is a reasonably large component of why next generation sequencing is improving so fast, this makes this compression partially mitigate the gap between next generation sequence technology improvement and disk drive improvement. By judicious tuning of the "precision" at which we hold quality data (progressively tightening this over time, presumably because we will understand where to keep precision better) we believe there is a sustainable storage model (sustainable for us means a constant money spend each year on disk) for next generation sequence data for at least 5 years, if not 10.

Due to this paper and thinking about this problem more generally, over this last year I've learnt a lot more about compression. Video compression - for the entertainment industry (all sorts!) - is a strong driver for compression technologies; not all of this work is accessible (both in a publication sort of way and in a easy-to-understand way) but nevertheless reading about video compression has radically changed my thinking about how to compress DNA data. There are three, interlinked themes I've learnt.

1. Lossy compression makes you think about what information you can lose, rather than what information you want to keep. You might think that as this is just partitioning the information total information into these two categories, and as there is just one line to draw, it doesn't actually matter which way around you define this. But that's not actually the case - it's clear that one's going to have different levels of certainity about how important particular bits of information are. So really you end up ranking all the information from "I am sure this is important" to "I am sure this is unimportant". Lossy compression basically is about approaching this as discarding from the "I am sure this is unimportant" end and working towards the more uncertain "I am not clear whether it's important or not", leaving the "I am sure this important" as the last to go. This is the absolute opposite from assessing information in terms of how well it fits a biological model.

2. By thinking about this problem as a lossy compression problem, and by realising that one can tune the "aggressiveness" of the compression, one naturally thinks about the budget tradeoffs - how much space/bandwidth is available, rather than how much space does this particular compression take. In skype video calls, and some of the more sophisticatd internet video codecs, the system deliberately estimates the bandwidth and then dynamically adjusts the compression scheme to discard as a little as possible consistent with the bandwidth constraints. This is also closer to what we have; a certain budget for storage, and we want to fit DNA sequence into that budget, though one has to always be careful to understand how degraded the resulting signal is. For skype, low bandwidth means dropping the call; for storing cancer genomes, we can't be quite so trigger happy... Still - the basic set up is "what should I fit into this relatively fixed envelope of storage/bandwidth" not "how much storage is this going to cost me".

3. It is far easier to model and understand systematic noise processes introduced by measuring procedures than modelling the "real world" of things you don't necessarily have full models for. You'll be surprised about how much bandwidth on a high image definition TV is going on color differences in which people _know_ can't be a good representation of the image as the original s sensor of the image has a poorer color resolution than the output device. In other words, sometimes, people are effectively adding known noise to image systems. In the case of DNA, we (... or at the very least vendors) know alot about the different error modes in the physical process of sequencing. It is precisely these errors which lead to the majority of the information stored for a read set after reference based compression (without using reference based compression, most of the information is indeed in the fundamental bases and qualities, but reference based compression effectively subtracts out repeated, redundant base information).

So - these three themes - we're trying to rank "confident" noise information, that we're working to fit data inside fixed (or at the best, less flexible) storage budgets and that we're more confident about our "noise" models than our "biology" models means that lossy compression - of which our proposed reference based compression is one sort - is very, very different from thinking about this problem as (say) an optimal analysis problem, or how does one store analysis results.

More on the practicalities compression next week....