Tuesday, 26 August 2014

CRAM goes mainline


Two weeks ago there was the announcement from John Marshall from Sanger for SAMtools 1.0 - one of the two most widely used Next Generation Sequencing (NGS) variant-calling tools embedded in hundreds if not thousands of bioinformatics pipelines worldwide. (The majority of germline variant calling happens either through SAMtools or the Broad's GATK toolkit.) SAMtools was started at the Sanger Institute by Li Heng when he was in Richard Durbin's group, and stayed at Sanger now under the watchful eye of Thomas Keene.  

The 1.0 release of SAMtools has all sorts of good features, including more robust variant calling. But a real delight for me is the inclusion of CRAM read/write, which is a first-class format similar to SAM or BAM. SAM started this all off (hence the name samtools), and BAM was a binary and "standard" compressed form of SAM. CRAM was written to be compatible with the data model from SAM/BAM, and its main purpose is to leverage data-specific compression routines. These include reference-based compression of base sequences, based on work done by Markus Hsi-Yang Fritz and myself (blogged here, 3 years ago).

CRAM also brings in a whole set of compression tricks from James Bonfield (Sanger) and Vadim Zalunin (EMBL-EBI) that go beyond reference-based compression. James was the winner of the Pistoia Alliance's "Sequence Squeeze" competition, but sensibly said that his techniques would be best used as part of CRAM. James was instrumental in splitting out his C code into a reusable library (htslib), which is part of what SAMtools is built on. There is a whole mini-ecosystem of tools and hooks that enable reference-based compression to work, including a lightweight reference sequence server developed by Rasko Lenionen at EMBL-EBI. 


Elegant engineering


SAMtools is basically about a series of good engineers (John, James, Vadim and others) each working on different components for NGS processing, under the bonnet - with the Sanger and EMBL-EBI investing considerable effort into making the machinery come together. Just as an internal combustion engine requires more complex engineering than mixing fuel, igniting it and making a car go, good engineering takes more than a proof of principle. Really elegant engineering is invisible, and that is what SAMtools offers: it just works. It has been great to see John and James work CRAM into this indispensable piece of software, and for Sanger coordinate the project so well.

Space saving and flexibility

With this release CRAM becomes part of the natural SAMtools upgrade cycle, so when people upgrade their installation they will immediately see at least a 10% saving on disk space - if not better. If they allow the new Illumina machines to output the lower entropy quality values (this is the default), the savings will be more like 30-40%.

Another practical benefit of CRAM is its future proofing: CRAM comes "out of the box" with a variety of lossy compression techniques, and the format is quite flexible about potential new compression routines. We've seen that the main source of entropy is not the bases themselves, rather quality values on DNA sequence bases. CRAM provides options for controlled loss of precision on these qualities (something Markus and I explored in the original 2011 paper). It's important to stress that the right decision for the right level of lossy compression is best done by the scientist using and submitting the data. It might be that community standards grow up about lossy compression levels - its important to realise that in effect Illumina is already make a huge host of appropriate "data loss" decisions in their processing pipeline, most recently the shift to a reduced entropy quality score scheme. The first upgrade cycle will allow groups to mainline this and give them the option to explore appropriate levels of compression.

The Sanger Institute has been submitting CRAM since February 2014. And in preparation for the widespread use of CRAM - with the option of lossy compression - the European Nucleotide Archive has already announced that submitters can choose the level of compression and the service will archive data accordingly. Guy Cochrane, Chuck Cook and I also explored the potential community usage of compression for DNA sequencing in this paperSo we have solid engineering, a flexible technical ecosystem and have prepared for the social systems in which it will all work.


R&D ...&D ...&D

When the first CRAM submission from Sanger to EBI happened about a year ago, I blogged about how well this illustrates the difference between research and development. Our research an proof of principle implementation on data-specific compression of DNA took perhaps 1.5 FTE years of work, but to get from there to the release of SAMtools 1.0 has taken at least 8 FTE years of development. In modern biology, I think we still underestimate the amount of effort needed for effective development of a research idea for infrastructure deployment, but I sincerely hope this is changing - this wont be the last time we will need to roll in a more complex piece of engineering.

SAM, CRAM and the Global Alliance for Genomics and Health

CRAM development has come under the Global Alliance for Genomics and Health (GA4GH), a world-wide consortium of 200 organisations (including Sanger and the EBI) that aims to develop technologies for the secure sharing of genomic data. CRAM is part of the rather antiquated world in which specific file formats are needed for everything (all the cool kids focus on APIs these days). But also represents an efficient storage scheme for the more modern API vision of GA4GH. APIs need to have implementations, and the large scale "CRAM store" at the EBI provides one of the largest single instances of DNA sequence worldwide. Interestingly the container based format of CRAM has strong similarities with row-grouped, column orientated data store implementations, common in a variety of modern web technology. We will be exploring this more in the coming months.


Upgrade now, please.

The main thing for the genomics community is to now upgrade to SAMtools 1.0. This will give you many benefits in the whole short read calling and management, and one of those will be being able to read/write CRAM directly. If you do this it will have a considerable saving in disk for your institute and for us at the EBI. It will also help ensure we're all prepared for whatever volume of data you might be producing in the future.



Thursday, 10 July 2014

Scaling up bioinformatics training online

Bioinformatics has grown very quickly since the EBI opened 20 years ago, and I think it’s fair to say that it will grow even faster over the next 20 years. Biology is being transformed to a fundamentally information-centric science, and a key part of this has been the aggregation of knowledge in large-scale databases. When you put all the hard-won information about living systems together – their genome sequences, variation, proteins, interactions with small molecules – they are, potentially, incredibly useful. I say “potentially” because even the most pristine, large, interconnected data collection in the world isn’t worth much if people don’t know how to use it.
So at the EBI we have this challenge of making sure researchers (a) know we have all this amazing data for them, and (b) are able to use it. One aspect of this is making easy to use, intuitive websites, which is something I’ve blogged about before. But training, in all its forms, is really important.

A moving target

Not very surprisingly, face-to-face interactions really make the biggest difference. Nothing is better than having a person to guide you through using a resource (and making sure you’re using the right one), which is why some seven years ago we increased our training efforts substantially. We now run a huge number of courses on the Genome Campus in Cambridge and deliver an even larger number around the globe.
All this training is coordinated in one team, but of course training is embedded in all the different resource teams so the people actually leading the courses really know their stuff. I know first-hand from days as co-lead of Ensembl how effective this can be.
These courses have been taken out to over 200 sites in close to 30 countries, and they’ve reached more than 7000 people. But as I said, bioinformatics is growing fast and face-to-face training is just really hard to scale up. One way we are dealing with that is through a train-the-trainer programme, which we run in lots of different places, and while it’s effective, it’s just not enough.

Training online

So about three years ago we launched Train online, an e-learning platform set up to help molecular biologists figure out how to make the best use of our resources, dipping in as time permits. We now have 43 online courses, and over 60,000 people from close to 200 countries have visited Train online this year alone (effectively doubling the number we had this time last year).
Location of Visitors to Train Online

Good content first

These training materials are put together by teams of people who put a lot of effort into making them engaging and motivating. The way people learn individually (specifically, using online tools) is rather different from the way they learn through interacting with another person, so we try to accommodate this in a number of ways, for example making more ‘bite-sized’ courses.
Face-to-face training happens in real time and is fairly fluid, and there is a lot of preparation that goes on right up to the moment a course starts. The workflow for e-learning, on the other hand, needs constant reviewing and refreshing to stay current, so someone with adequate expertise needs to stay on it.
(As a point of interest, we assign DOIs to our courses so that the authors get recognition for their work and so we can track citations of them.)

Production value matters

As bioinformatics and computational techniques become increasingly important to more and more applied fields – healthcare, agriculture, environmental research and others – we will need to continue to innovate around how we train people. That means anything from new, effective methods for training educators to making e-learning platforms like Train online as interactive as possible.
High-quality online courses need be inviting to explore, so that you remember what you learn and are inspired to learn more. That requires significant infrastructure behind it. You need much more than just the technical capacity to set things up properly on the web – you need video equipment, editing and production workflows, video hosting and a great interface… but more than anything, you need solid in-house UX and multimedia expertise and you have to be ready to use it.

Who needs it?

A rough, back-of-the-envelope calculation estimates that there are up to 2 million life science researchers worldwide, and depending on how you count all healthcare-related research that number could go up to 4 million. If 100,000 people use Train online this year, our online learning resource alone will reach between 2% and 10% of the scientists who probably need bioinformatics training. That’s a pretty good start, but there’s a long way to go.

So if you haven’t had a look at Train online yet, please do – you might be surprised. It’s one innovation we’re particularly proud of, and we’re looking forward to seeing more in the future.