Saturday, 14 July 2012

Optical and Nucleic capture. The future of high density information capture in biology

 Last week a positioning paper by Guy Cochrane, Chuck Cook and myself finally came out in Gigascience. It's premise is rather simple: we are all going to have to get use to lossy compression of DNA sequence, and as lossy compression is variable (you can set it at a variety of levels), we will have to have a community consensus of how much one compresses different data. This really part of our 2 year process into making efficient compression a practical reality for DNA sequencing, which I've blogged on before, numerous times.

I encourage you to read the paper, but in this blog I want to explore more the analogies between imaging and DNA sequencing - which now are numerous.  I believe at the core biology (of all sorts) the majority of data gathering will either be optical or "nucleic". (the third machine type probably being mass spectroscopy, if you are interested). As a colleague once said about molecular biology - the game in town is now to get your method to output a set of DNA molecules. If you can do that - you can scale.

The first question is to ask why these two technologies are so dominant. The first is that one is fundamentally trying to capture information - information about distributions of photons or information about the make up of molecules.  One is not trying to do something large and physical. This means that the mechanisms for detection can be excruiatingly sensitive. In the case of imaging, single photon sensitivity is almost routine on modern instruments, and with things like Super-resolution imaging, which is a whole bunch of tricks to (in effect) in effect convert multiple time-separated images of a static image into better spatial resolution (I saw a deeply impressive talk by a Post Doc in Jan Ellenberg's group showing a remarkable resolution of the nuclear pore). But one does not need to have fancy tricks to make great use of imaging - rather mundane, cheap imaging is the mainstay of all sorts of molecular biology - drosophila embryo development would be a different world without it. In the case of DNA, the ability to sequence at scale has been the focus for the last 20 years, and still will remain so probably until a human genome is in the ~$1,000 or $100 zone - as expensive a serious Xray or MRI. But at the same time the other shift is to moving towards more real time systems (the "latency" of the big sequencers are probably the biggest drag - 3 weeks is your best case on the old HiSeqs) and to single molecule systems. People talk about real time as critical to the clinic, and certainly the difference between even 12 hours and 2 weeks is day and night (at 12 hours or less one can do the cycle within a 1 day stay, and start to impact in-time diagnosis), but faster cycle times will really change research as well. Going back to the information aspect of these two technologies, as one is trying to only get information out of these things the physical limits of the technologies are remarkably far away. Imaging is hitting some of these limits (though there is still plenty of space for innovation); 3rd generation DNA sequencers will get closer to some limits in DNA processing as well, but again, we have some way to go. The future is bright for both of these technologies.

The second similarity is just the mundane business of storage of the output of these technologies - they are high density information streams, and therefore have alot of inherent entropy - some of that entropy one wants to utilise - that's the whole point - but there is also quite a bit of extra "field" (in imaging terms) or "other bits of the genome" (in DNA terms) which one often knows is going to be there, but is less interested in. Imaging has long led the area of data-specific compression, using at first a variety of techniques of transformation of the data from straightforward x,y layout of pixel intensities to ways which inherently capture the correlation between pixels, allowing for efficient lossless compression. But the real breakthroughs came with lossy compression, understanding that for alot of the pixels, a transformation which lost some information for a large gain in compressability where appropriate for uses. Although the tendancy is to think about lossy compression in terms of "visual" display-to-user uses, in fact many technical groups use a variety of lossy forms for their storage, choosing mainly the amount of loss appropriately (I'd be interested in experiences on this, and in particular whether people deliberately choose other lossy algorithms away from the JPEG family). But Video compression has really taken lossy compression into new directions, with complex between frame transformations and then lossy applications, in particular adaptive modelling.

When we started in DNA compression many people critiqued it that we "couldn't beat established generic compression" or that certain compression forms we "already optimal". This totally misses the point - generic and optimal compression schemes are only generic and optimal for a particular data model, and to be generic, that data model involves a byte-stream. One doesn't hear people saying about video compression "oh well, that problem has been solved generically and there are optimal compression methods" - putting a set of raw TIFFs straight into a byte-based compressor would not do very well. The key thing is first a data transformation that makes explicit correlation in the data for standard generic methods to compress (in the case of DNA, reference based alignment provides a sensible realisation of the redundancy between bases in a high coverage sample, and for a low coverage sample realises the redundancy with respect to previous samples). The second thing is the insight that not all the precision of the data is needed for interpretation. Interestingly lossy compression makes you think about the problem as the inverse of the normal thought process - often you ask "what information am I looking for" for some biological process - SNPs, or structural variants. Lossy compression methods inverts to the problem to ask "what information are you pretty sure you don't need". For example, when you know your photon detector will generate some random noise in particular patterns, having a lossy compression remove that entropy is highly unlikely to effect downstream analysis. Similarly when we can confidently recognise an isolated sequencing error, degrading the entropy of the quality score of the base is unlikely to change downstream analysis.  I've enjoyed learning more about image compression, and I think we've only started in DNA compression - at the moment we can 2 to 4 fold compression compared to standard methods with a clearly acceptable lossy mode (acceptable because the machine manufactures sort of know that they are generating a little too much precision in their quality scores). But with more aggressive techniques we can already think about 50 to 100 fold compression - though this is getting quite lossy. But this is not the end of the road here - I reckon we could be at 1,000 fold more compressed in the future.

The third similarity is the intensity in informatics in the processing. Both for image analysis and DNA analysis there are some standard tools (segmentation, hull finding, texture characterisation in imaging; alignment, assembly, variant calling in DNA sequence analysis) but how these tools are deployed is very experiment specific. There is not some "generic image analysis pipeline" any more than there is a "generic DNA analysis pipeline". One has to choose particular analysis routes mainly driven by the experiment that was performed, and then to some extent for the output you want to see. This means that the bioinformatician must have a good mastery of the techniques. I have to admit, although I live and breathe DNA analysis, often developing new tools, I am pretty naive about image analysis - not that that's stopping me diving in with my students in using (but not developing...) image analysis.  I think we're not making image analysis enough of a mainstream skill set in bioinformatics, and this needs to change.

Finally the cheapness and ubiquity of imaging has meant that from the start image based techniques had to think carefully about which images one would store and at what compression. Clearly DNA sequencing is heading the same way, and this is the paper that Guy and myself put forward. Similarly to imaging, the key question is what is the overall cost of replacing the experiment, not the details of how much the image itself cost. So - a rare sample (such as a Neanderthal bone) is very hard to repeat the experiment - you need to store that information at high fidelity. But a routine mouse sequencing chip-seq is far more reproducible and one can be far more aggressive on compression. I actually think it has been to the detriment of biological imaging that there has not be a good, reference archive - probably because of this problem is knowing which things it is worth archiving coupled with the awesome diversity of uses for imaging - but projects like EuroBioImaging I think will provide the first (in this case federated) archiving process.

Over the next decade then I see 'imaging' and 'dna sequencing' converging more and more. Time to learn some image analysis basics (does anyone know a good book on the topic that geeky and detailed but starts at the basics?)

Tuesday, 3 July 2012

Galois and Sequencing

   It is not often anyone will hear the phrase "Galois field" and "DNA" together, but this paper from my colleagues, Tim Massingham and Nick Goldman provide a great link between these topics. Some other authors have used Galois fields in DNA analysis, but this is the first time I have seen a practical application of this level of mathematics in bioinformatics. It's a tour de force by Tim, and although only in a lowly BMC Bioinformatics journal I think should be celebrated for its sheer chuptaz in cross scientific - indeed academic - domains.

  If you have not met Galois field, then a small crash course in some pretty dense pure maths. Fields are the rather glorious, whimsical world of pure maths where one gets to fool around with the fundamental of maths - in this case redefining addition and multiplication; if certain conditions are met (that one can "add" and "multiply" in any order, and that multiplying an addition is the same as adding together the multiplication of each element, and a couple of other requirements), one has a "Ring", (with the wonderful Tolkein like world of ring theory). With a couple more criteria to meet, in particular that multiplication doesn't care about the order one has a Field. Numbers are fields - real numbers, complex numbers, rational numbers... but so are all sorts of other things - vectors, and exotic beasts like p-adic numbers which somehow involve primes in a suitably Alice-in-wonderland like way. Importantly, one can also have finite element fields, in which there is a limited number of elements. A gifted, young frenchman, Galois, explored the properties of these finite elment fields and showed that there is a limited number of fields - in fact, for finite elements with 2, 3 or 4 members there is only one field - ie, only one way to define "addition" and "multiplication" and satisify the criteria of a Field. (If you are curious, the 2 element finite field is like an XOR and an AND in logic for "add" and "multiply"). Many wonderful and deep things have been proved using Galois Fields, in both pure and applied maths.

  So much for Maths. Now onto sequencing chemistry.

   At the EBI we were funded to explore Lifetech's new "Exact Call Chemistry" (ECC). Normal Lifetech Solid chemistry reads bases in pairs, where two adjacent bases gives one read out. Because there are 16 possibilities for two adjacent bases, but only 4 fluorophore read outs, each read out is ambigous, representing two possible scenarios. The sequence of these ambigous calls is called "Colorspace", with the set {0,1,2,3} to distinguish it from the underlying bases ("Basespace" in Solid-speak, with {A,C,T,G} ).  As the very first, primer, base is known due to the way the Solid chemistry works, this means one can in theory work out the next base (first in the read), and chain down the entire read. But people rarely do this because if you make an error in one position, the error propagates throughout the rest of the read. It is far more appropriate to do all the calculations (such as alignment, snp calling and even assembly) in colorspace, and then "key" the answer into basespace right at the end. A whole host of tools have sprung up around the colorspace world.

   Exact Call Chemistry added another ligation-read step where rather than interrogating the sequence in adjacent pairs, interrogated it in a series of read-2, skip 1, read 1, skip 1 - a complex overlay. This pattern was chosen for its error correcting properties, and had the useful side effect that one could easily map the read directly into basespace. Using this though much of the sequencing errors can be corrected, meaning that one could get to something like 10-6 error rate on the chemistry. But this poses lots of questions - the error model is no longer a simple process associated with each base, rather one has to take a rather gestalt view of the error process across the entire read.

   But how does one do this? How does one represent the combination of this colorspace plus this 2on-1off-1on-1off read? Here Tim rather beautifully brings in a Galois Field - remember that there is only one 4 element Galois field, and the design of colorspace allows for a one to one mapping of the 4 elements, traditionally called {0,1,a,b} (or sometimes alpha and beta) to both colors {0,1,2,3} and bases {A,T,G,C}. The color that occurs between two bases is just "addition" in the Galois field. This is a very elegant way to consider colorspace, but really comes into its own for ECC space. The additional 7th read of ECC space with this complex structure is a type of matrix multiplication in the Galois field. (at this point by the way my mathematically abilities have been stretched to breaking point by Tim and Nick and I just have trust them). By using this transformation therefore a lot of the things you might want to do with ECC that can be now written down as "straightforward" maths in this transformed space.

  So now Tim can explore the impact say of an error model considering all the separate reads independently (what he calls a "trellis" model), or other representations of the data, most obviously a base-space model of the data with independent error - ie, the traditional "fastq" model. Unsurprisingly the trellis model gives you far more information in, say, calling SNPs than the "fastq" model, because the underlying data has complex interdependancies between the errors - for example, a mistaken T for a G in position 1 implies also a particular error in position 3. However, a different representation, corrected colorspace, in which one stores an errorcorrected colour space read in fact maintains the majority of the information, but is far more compact. Furthermore, as it is error corrected, it will compress better into reference based schemes (something we're pretty obessed with at the EBI: see previous blog posts), and, indeed this compression is best thought of as a compression of the Galois field elements on the reference, which "naturally" compresses well. There are a bunch of other things that Tim can do in this framework, for example understand the number of errors that can be detected (up to 2) and the number that can be definitely corrected (only 1).

   It's not clear how much of a future ECC chemistry has out there. Lifetech recently bought ion torrent (with a completely different, 454 like chemistry, with very different error properties). It's clear to me that if colorspace and friends (like ECC) was the only way to sequence DNA, we'd all know this backwards, and "Galois field" would become as commonplace as "Dynamic programming" in bioinformatics circles. But whatever the future of the chemistry, it's great to see a relatively deep piece of maths (and pretty modern - from the 17th Centuary - in terms of pure maths) being used in a totally profound way in bioinformatics. I have no doubt that over time biology and bioinformatics will end up using more and more of the mathematical toolkit developed over the years.

   Who knows, we (bioinformatics) might even inspire the development of new maths sometime in the future.