Tuesday, 17 December 2013

The Start of a Journey

Last week a new paper, “Policy challenge of clinical genome sequencing,” led by Caroline Wright and Helen Firth and on which I am a co-author, was published in the British Medical Journal. It lays out the challenges of making more widespread use of genetic information in clinical practice, in particular around ‘incidental findings’. Caroline and I have a joint blog on this paper on Genome Unzipped.
This paper also marks an important watershed in my own career, as it is my first paper in an outright clinical journal. Like many other genomicists and bioinformaticians I have started to interweave my work more tightly with clinical research, as the previously mainly basic research world of molecular biology begins to gravitate towards clinical practice.

Worlds apart

Clinical research and basic research are profoundly different, both in terms of scientific approach and culture. Clinical researchers who keep a hand in clinical practice are nearly always time pressured (i.e. with hospital meetings, clinics, inflexible public responsibilities) and their research has to be squeezed to fit around their practice. The language of clinical research is also distinctly different from that of genomics. For example, I used to use the word ‘doctor’ interchangeably with ‘clinician,’ until a generous clinician took me aside and patiently explained that ‘doctor’ is not the word clinicians use, as it does not capture the myriad disciplines in the healthcare system. They use the word… clinician.
But the differences run deeper than terminology and schedules. Clinical practice involves seeing patients, each of whom presents a different constellation of symptoms, tolerance to treatment and set of personal circumstances – it’s a far cry from the nice, uniform draws of statistical distributions that one hopes to see in designed experiments. A clinician has to work out the true underlying problem – often different from the one described by the patient – and find a way to make it better, often under pressure to act quickly and contain costs.
In theory, molecular measurements – from genotypes to metabolomics – should be informative and useful to the clinician. In practice, there is a wide gulf between any given molecular approach (usually from a retrospective study) and the uptake of molecular information into clinical practice.
Hanging out with more clinicians has given me a deeper appreciation about the difficulty of achieving this, and for why clinicians make such a sharp distinction between people who are part of medical practice and those who are or not. I, for one, have never had the responsibility of making a clinical decision (I’m rather glad other people have taken that on, and appreciate the amount of training and mental balance it takes), so I know I haven't  grasped all the crucial details and interactions that make up the whole process.

Different perspectives

Medicine is also quite diverse, and rightly so. A clinical geneticist might be dealing with a family with a suspected genetic disorder, but a number of family members are currently healthy. Meanwhile, a pancreatic cancer specialist might be helping a new patient whose chances of living another five years is around 2% - and who is therefore a lot more willing to look into experimental treatments than the clinical geneticist’s family.
Even within a discipline, it is not so obvious where the new molecular information is best used. I had the pleasure to be the examiner for Dr James Ware, a young clinician and PhD doing research on cardiac arrhythmias (a subset of inherited cardiac diseases) with Dr Stuart Cook. He presented excellent work on geneticially ‘dissecting’ out some new arrhythmia mutations from families. He also revealed a passion not just for using genetics but for finding practical ways to do so. From his perspective, in this particular medical area, the bigger impact for genetics would be after a phenotype-led diagnosis, rather than for diagnosis itself.

Discussions leading to insight

Our recent paper in the BMJ is a good example of how much I have learned in recent years simply by discussing things with clinicians in detail. I have long advocated a more open and collaborative approach to sharing information about variants with ‘known’ pathogenic impact, even considering the daunting complexity of variant reporting and phenotypic definition (progress is steady in this area, e.g. the LRG project), and this seemed to be aligned with the discussion about making definitive list of variants for “incidental findings”  So I was somewhat taken aback to find that many clinicians did not share my enthusiasm about incidental findings.
After a workshop organised with Helen and with strong input from Dr Caroline Wright, both passionate, open-minded clinical researchers, I fundamentally changed my mind about the utility of ‘incidental findings’ (better described as ‘opportunistic genetic screening’). For the vast majority of known variants we either have poor estimates of penetrance or – at best – estimates driven by ‘cascade screening’ in affected families (i.e., an initial case presents to a clinical geneticist, triggering exploration around the family).
While this is a really important aspect to consider, my passion about more open sharing of knowledge around specific variants remains firmly in place. Caroline, Helen and I remain positive about the growing utility of genome information in clinical research and in targeted diagnostic scenarios, but not for incidental findings until more systematic research is performed (see our ‘Genomes Unzipped’blog post).

Bridging the gulf

Working with clinicians has given me deeper insights into my own work, and in this particular instance changed my opinion. I hope that these interactions have also been positive for the clinicians, perhaps changing their minds about the utility of bioinformatics and genomics and giving a new perspective on the possibilities and pitfalls of the technology.

More broadly, the coming decade is expected to be characterised by basic researchers delving deeper into other areas of science, in particular applied science: areas of medicine, public health, epidemiology, agricultural and ecological research. This is a fascinating, if daunting, challenge for us all. New people to meet, new terminology and language to navigate, new science and applications to wrap our heads around… These are all good things, and I’m sure we will get used to it. We have to.

Friday, 13 December 2013

Making decisions: metrics and judgement

The conversation around impact factors and the assessment of research outputs, amplified by the recent 'splash' boycott by Randy Shekman, is turning my mind to a different aspect of science - and indeed society - and that is the use of metrics.

We are becoming better and better at producing metrics: more of the things we do are digitised, and by coordinating what we do more carefully we can 'instrument' our lives better. Familiar examples might be monitoring household electricity meters to improve energy consumption, analysing traffic patterns to control traffic flow, or even tracking the movement of people in stores to improve sales. 

At the workplace it's more about how many citations we have, how much grant funding we obtain, how many conferences we participate in, how much disk space we use... even how often we tweet. All these things usually have fairly 'low friction' instrumentation (with notable exceptions). 

This means there is a lot more quantitative data about us as scientists out there than ever before, particularly our 'outputs' and related citations, and mostly with an emphasis on the traditional (often maligned) Impact Factor of journals and increasingly on "altmetrics". This is only going to intensify in the future.

Data driven... to a point

At one level this is great. I'm a big believer in data-driven decisions in science, and logically this should be extended to other arenas. But on another level, metrics can be dangerous.

Four dangers of metrics

  1. Metrics are low-dimensional rankings of high-dimensional spaces;
  2. Metrics are horribly confounded and correlated;
  3. A few metrics are more easily 'gamed' than a broad array of metrics;
  4. There is a shift towards arguments that are supported by available metrics.

The tangle of multidimensional metrics

A metric, by definition, provides a single dimension on which to place people or things (in this case scientists). The big downside is that we know that science is considered "good" only after evaluating it on many levels. It can't be judged usefully along any single, linear metric. On a big-picture, strategic level, one has to consider things within the context of different disciplines. Then there is  an aspect of  'science community' - successful science needs both people who are excellent mentors and community drivers, and the 'lone cats' who tend to keep to themselves. Even at the smallest level, you have to have a diversity of thinking patterns (even within the same discipline, even with the same modus operandi) for science to be really good. It would be a disaster if scientists were too homogeneous. Metrics implicitly make an assumption of low dimensionality (in the most extreme case, of a single dimension), which by its very definition, cannot capture this multi-dimensional space.

Clearly, there are going to be a lot of factors blending into metrics, and a lot of those will be unwanted confounders and/or correlation structures that confuse the picture. Some of this is well known: for example, different subfields have very different citation rates; parents who take career breaks to raise children (the majority being women) will often have a different readout of their career through this period. Perhaps less widely considered is that institutions in less well-resourced countries do not actually have poorer access to the 'hidden' channels of meetings and workshops of science. 

Some of the correlations are hard to untangle. Currently, many good scientists like to publish in Science, Nature and Cell, and so ... judging people by their Science, Nature and Cell papers is (again, currently) an 'informative proxy'. But this confounding goes way deeper than one or two factors; rather, it is a really crazy series of things: a 'fashion' in a particular discipline, a 'momentum' effect in a particular field, attendance at certain conferences, the tweeting and blogging of papers... 

Because of the complex correlation between these factors, people can use a whole series of implicit or explicit proxies for success to get a reasonable estimation of where someone might be placed in this broad correlation structure. The harder question is: why is this scientist - or this project proposed by this scientist - in this position in the correlation structure? What happens next if we fund this project/scientist/scheme?

Gaming the system

I've observed that developing metrics, even when one is transparent about their use, encourages more narrow thinking and opens up the ability to game systems more. This gaming is done by people, communities and institutions alike, often in quite an unconscious way. So... when journal impact factors become the metric, there is a bias - across the board - to shift towards fitting the science to the journal. When paper citation numbers (or h-indexes) become the measure by which one's work is judged, communities that are generous in their authorship benefit relative to others. When 'excellent individuals' are the commodity by which departments are assessed, complex cross-holdings of individuals between institutions begin to emerge. And so on.

In some sense there is a desire to keep metrics more closed (consider NICE, who have a methodology but are deliberately fuzzy about the precise details, making it hard to game the system). But this is at complete odds with transparency and the notion of providing a level playing field. I think transparency trumps any efficiency here, and so the push has to be towards a broader array of metrics.

Making the judgement call

One unconscious aspect of using metrics is the way it affects the whole judgement process. I've seen committees - and myself sometimes when I catch myself at it - shift towards making arguments based on available metrics, rather than stepping back and saying, "These metrics are one of a number of inputs, including my own judgement of their work". 

One needs to almost read past the numbers - even if they are poor - and ask, "Is the science worth it?" In the worst case, the person or committee making that judgement call will be asked to justify the decision based entirely on metrics, in order to present a sort of watertight argument. But there are real dangers of believing - against all evidence - that metrics are adequate measures. That said, this is the counter-argument to 'using objective evidence' and 'removing establishment bias' - the very thing that using metrics helps counter. There has to balance.

So what is to be done here? I don't believe there is an easy solution. Getting rid of metrics exposes us to the risk of sticking with the people we already know and other equally bad processes. 

I would argue that:

  • We need more, not fewer, metrics, and to have a diversity of metrics presented to us when we make judgements. This might make interpretation seem more complicated, and therefore harder to judge. And that is, in many cases, correct - it is more complicated and it is hard to judge these things.
  • We need good research on metrics and confounders. At the very least this will help expose their strengths and weaknesses; even better, it will potentially make it possible to adjust for (perhaps unexpected) major influencing factors.
  • We should collectively accept that, even with a large number of somewhat un-confounded metrics, there will still be confounders we have not thought about. And even if there were perfect, unconfounded metrics, we would still have to decided which aspects of this high-dimensional space we want to select; after all, selecting just one area of 'science' is, well, not going to be good.
  • We should trust the judgement of committees, in particular when they 're-rank' against metrics. Indeed, if there is a committee whose results can be accurately predicted by its input metrics, what's the point of that grouping?


My thinking on this subject has been influenced by two great books. One is Daniel Kahneman's "Thinking, Fast and Slow", which I've blogged about previously. The other is Nate Silver's excellent "The signal and the noise". Both are seriously worth reading, for any scientist.

Thursday, 7 November 2013

Heterogeneity in Cancer genomics

I've just come back from a great meeting on Cancer Genomics, held at EMBL Heidelberg (full disclosure: I was an organiser, so no surprise I enjoyed the talks!)

The application of genomics to cancer has been progressing for a long time, but we are now in the era where "cheap enough" exome sequencing (and increasingly whole genome sequencing) is present for both fundamental cancer research and clinical research - and there is really a sense of starting to "mainstream" sequencing into clinical care (clinical care and clinical research seem closer in the Cancer field than some other areas of medicine).

A Cluster of breast cancer cells showing visual evidence of programmed cell death (apoptosis) in yellow. Credit: Annie Cavanagh, Wellcome Images
Before I go into more detail, just a reminder for people not used to thinking about Cancer. Cancer is really a large collection of diseases where a collection of cells in the body is growing uncontrollably. For this to happen, there is always some genomic changes of the cancer cell, and sometimes quite extensive changes (there is also quite a bit of other knock-on changes in RNA and epigenetics, and possibly some of the initiating events are epigenetic changes, though it's pretty clear it the majority of cancers DNA changes are the main culprit). For a cancer cell to become a concern it not only has to start dividing, but it also has to circumvent a considerable amount of both intra-cellular and immune system based monitoring of its growth, and then if it is in a tissue, it has to encourage blood vessels to grow towards it to feed it nutrients etc - basically there are a lot of changes and features for a single cancer cell to become a tumour.

The advent of cheap(ish) genomics leads to a very simple sounding experiment: sequence cancer genomes so we have a catalog of these genomic changes. This is more technically demanding that it at first looks. Firstly by the word "changes" we mean "changes from the healthy tissue" - and of course each person is unique. So - to know the difference between the cancer genome and normal, one needs to sequence the genome of the individual who has cancer. The second problem is that the human genome is very big (3 billion bases) and this means one has to be very accurate in the sequencing of both the cancer and the normal; even a small final error rate will cause a considerable number of sequencing artefacts. This means both the cancer genome and the healthy genome needs to be sequenced at high depth to give that low error rate, and that you have to be really careful about variant calling. The third problem is that for the majority of cancers there is always a mixture of normal and cancer cells in a tumour sample (normal cells are both the surrounding tissue, and things like blood vessels which have been encouraged to feed to the tumour from the cancer, and immune system cells trying to attack the cancer). Furthermore the cancer continues to evolve, which different cancer cells changing their genome even more (very often the DNA repair and genome stability mechanisms are damaged in cancer), so there isn't even a sense of "1" cancer genome in a tumour. The fourth problem is that the genomic changes are not just  the simple changes of single bases. There are all sort of other things, in particular whole scale movements, losses and duplications of chromosomes (something that had been recognised in cancer for some time) as well as far more "focal" medium scale amplifications or losses. As well as these being challenging to "call" from sequencing data that comes in only 100-200 letter runs (these changes will be thousands to multiple millions of letters long), it also plays havoc in understanding how to call the single base pair changes in the context of all this other stuff.

Pretty much as soon as there was cheap sequencing people started to apply this to cancer genomics, but it has taken time to get on top of all these issues - and a considerable amount of these problems are in the informatics and methods - as well as getting good samples with good DNA for sequencing. But now there really is a steady stream of cancer genome projects in the 100s of cancers for a particular cancer type being reported both from the American TCGA umbrella project and the "rest of the world" ICGC umbrella project. Cancers are divided mainly by the tissue of origin (eg, Bone cancer, Breast Cancer, Colorectal Cancer) and then sometimes sub divided by features that one can see by looking at the cancer under a microscope (so called histo-pathology). Each cancer project will be taking a well defined cancer type and doing a number of cancers (in the 100s at the moment). Currently we are far far better at analysing changes to protein coding genes, so an effective approach is to focus on exomes.

Back now to the meeting. For me the major theme of the meeting was heterogeneity - heterogeneity between cancers - some cancers have relatively low number of changes (like this astrocytoma study from the DKFZ and EMBL guys, presented by Peter Lichter from DKFZ) and only knock out one or two pathways, some are just all over the place (here's a lung cancer study from TCGA, part of a tour of the TCGA pan cancer analysis from Josh Stuart); heterogeneity between patients - some cancers that look like a histologically similar have very different genomic alterations; hetereogeneity over time - cancers often come back (recur) and this is often due to a single rare change in the original cancer that was resistant to treatment. Elaine Mardis and Sam Aparicio presented results with the general theme of tracking cancer mutations longitudinally. And then there is the long list of "ways a cancer genome changes", with Jan Korbel presenting the work on Chromothripsis (shattering) in medulloblastoma. Naz Rahman showed how germline cancer pre-disposition was very variable, and a surprising feature of mosaic-ism associated with cancer (biology can endlessly surprise one!).

This heterogeneity in cancers is both a positive and a negative for clinical use of genomic sequence. The positive is that the current low response rate to treatment of some cancers may well be a function of not choosing the right drug for the right molecular type of cancer. By "typing" the cancer better, there can be better tailored treatment. The negative is that the high heterogeneity between patients means that doing well structured trials is hard. Not only are there the challenges of just turning around the whole cancer sequencing and analysis process in time for results (something elegantly presented by Steve Jones from British Columbia Cancer Agency) but the rapid branching of options means that having a simple treat with A/treat with B randomisation scheme is hard (and the confounding between feasible treatment options for a molecular subtype and the aggressiveness of the cancer makes this really annoying). Andrew Biankin - previously from Brisbane and now in Glasgow presented impressive work from their Australian Pancreatic cancer (some of which is published here). To have a good baseline of effective treatments for which one understands the molecular components one needs a thorough and controlled investigation of genomic legions vs drug response. Ultan McDermott presented the systematic cancer screening work (some of which published here) from the Sanger Institute. Finally there was a rather sobering presentation from Ivo Gut (CNAG, Barcelona) on the hetreogeneity in sequencing itself and somatic variant calling - it's clear as a community we have to get tighter and understand better this process (this is a clear cut reason why we must have the ability to go back to at least the sequence level data for cancer, and probably stay that way in research for at least another 5 years).

At some level this heterogeneity is daunting - it is going to take a lot of samples with careful analysis to sort out both what is going on biologically in cancers and then how to leverage that knowledge into improving treatments. That said, this heterogeneity is not something generated by these experiments - this is how it is for cancer, and this is task we have to collectively take on. As a bioinformatician there is both the rather conceptually mundane aspect of minimising technical variance, and it stresses again the importance of keeping raw sequence data available in archives such as EGA. In addition, I am concerned that it is going to be very easy to find correlations between all sorts of things in these datasets - between types of mutations and outcomes, or between types of RNA expression and changes, or structural variants. What will be far, far harder is deciding on whether these changes are causal (the phrase used in this field are "cancer drivers") or whether it is the consequence of a more complex process that confounds the correlation. Peter Campbell from the Sanger Institute gave a great, detailed talk dissecting out one recurrent mutation mechanism in ALL which, if you didn't know about it, would like a potential cancer driver.

But the other side of the coin is that this heterogeneity means that even sorting out things in a couple of areas might have a big effect. The "Extreme Responders" shown by Andrew Biankin - people who got targeted therapies and had remarkable improvements shows some of the potential, in particular for cancers such as Pancreatic cancer where the 5 year survival rate is a depressing 2%. Even small gains in understanding might have a real impact on this number. And we're early in this game - as the sample numbers go up from 100s to 1000s (and I am sure in the 10,000s in the future) we will have more power to sort out some of this hetereogeneity in all these areas. The pan cancer analysis - the first by the TCGA at the exome level, published this year (http://www.nature.com/tcga/) and the future plans by the ICGC to have a whole genome analysis is the start of this.

I am by nature a glass half full sort of person, and optimistic about the future of cancer genomics - but realistic about the task, and the fact that this will need to draw on all the talents of many oncologists, clinical geneticists, genomicists, bioinformaticians and mechanistic basic biology researchers worldwide.

Monday, 14 October 2013

CERN for molecular biologists

This September I visited CERN again, this time with a rather technical delegation from the EBI to meet with their ‘big physics data’ counterparts. Our generous hosts Ian Bird, Bob Jones and several experimental scientists showed us a great day, and gave us an extended opportunity to understand their data flow in detail. We also got a tour of the anti-matter experiments, which was very cool (though, sadly, it did not include a trip down to the main tunnels of the LHC).

CERN is a marvellous place, and it triggers some latent physics I learnt as an undergraduate. Sometimes the data challenges in CERN are used as a template for the data challenges across all of sciences in the future; I have come to learn that these analogies – unsurprisingly – are sometimes useful and sometimes not. To understand the data flow behind CERN, though, one needs to understand CERN itself in more detail.

CERN basics

Like EMBL-EBI, CERN is an international treaty organisation that performs research. It is right on the Swiss/French border (you can enter from either side). On the hardware side, CERN has been building, running and maintaining particle accelerators since its founding in 1954. These are powerful collections of magnets, radio-wave producers and other whiz-bang things that can push particles (often protons) to very high energies.
CERN’s main accelerators are circular, which is a good design for proton accelerators. To help the particles reach high speed you need to have them a vacuum, and circulating at close to the speed of light. Because this is done in a circular loop you need to have them constantly turning, which means you need some really, really BIG magnets. This means using super-conductors and, accordingly, keeping everything extremely cold (as super conductivity only works when cold). Just building all this requires the application of some serious physics (for example, they actively use the quantum properties of super-cold liquid helium in their engineering), so that other people can explore some profoundly interesting physics.
CERN’s ‘big daddy’ accelerator these days is the Large Hadron Collider (LHC), which produced the very fast protons that led researchers to the Higgs Boson. Their previous generation of accelerators (called SPS) are not only active but are crucial for the smooth running of the LHC. Basically, protons get a running start in the SPS before they start their sprint in the LHC, and “fast protons” are used in other experiments around CERN.

Research at CERN

When you visit CERN, a healthy percentage of the people you will see don’t actually work for CERN – they are just conducting their research there. At first it seems a bit chaotic, because everything doesn’t fit nicely into a formal ‘line management’ organisation. But, like other science consortia including those in biology, it is ordered by the fact that everyone is science focused. The main thing is that it does actually work.
‘Experiments’ at CERN refer to complex, multi-year projects. They are actually run by a series of international consortia – of which CERN is always a member. An ‘experiment’ at CERN can be anything from a small-scale, 5-year project to simulate upper atmosphere cloud forming behaviour to the very long term projects of building and running detectors in the LHC, like the CMS Experiment. For biologists, a small-scale CERN experiment maps to a reasonably large molecular biology laboratory, and a large-scale project dwarfs the biggest of big genomics consortia. High-energy physics does not really have the equivalent of an individual experiment as used in laboratory science (or at least I have not come across an analog).
Experimental consortia operating at CERN usually have charters, and are coordinated by an elected high-energy physics professor usually from another institute (e.g. Imperial College London). In addition to the major experiments on the LHC (Atlas, CMS, LHC-B and Alice), there are three anti-matter experiments and an atmospheric cloud experiment. They (nearly) all have one thing in common: they need a source of high-energy particles that can be supplied by one of the accelerator rings.

Really, really big data

CERN experiments produce a huge amount of data, and to understand this better it’s best to think of the data flow in five phases:
Phase 1. The detector. In most cases one has to convert the path of particles that have come from high energy collisions into some sort of data that be captured. This is an awesome feat of measurement using micro-electronics, physics and engineering. The primary measurements are for the momentum and direction of each particle coming from collisions inside the detector, though each detector is very bespoke in what precisely they are measuring.
Phase 2. Frame-by-frame selection of data. The decision-making process for this selection has to be near instantaneous, as the data rate from the detector is far too high to capture everything. So there has to be a process to decide which events are interesting. This is mixture of triggering for certain characteristic patterns (e.g. the Higgs Boson will often decay via a path that releases two Z bosons, which themselves release two Muons in opposite directions – spotting such paired Muons is an ‘interesting event’). On top of this there is a very thin random sampling of general events. The amount ‘interesting events’ are selected based on two things: what the experiment is looking for, and the data rate that can be ingested. Thresholds set to optimise the amount of interesting events collected given the data rate in the next phase.
Phase 3. Capture. The resulting filtered/sampled data are streamed to disk (ultimately tape; here’s an interesting piece by the Economist about tape at CERN). These data are then distributed from CERN (i.e. Tier-0) to other sites worldwide (i.e. Tier-1). People then crunch the collection of events to try to understand what was going on in the detector:
Phase 4: Basic data crunching. This first involves standard stuff, like creating particle tracks from the set of detector readouts. A lot of detailed background knowledge is needed to do this effectively – for example you can’t just look up detector specifications in the blueprints at the level accuracy needed for these experiments, and at the desired level of accuracy the detector will shift subtly month to month. A whole data collection needs to be calibrated. Intriguingly, they do this by having the detector left on with no protons going through the ring, and no bending magnets on the detector so that cosmic rays, entering at random directions through the detector, will provide calibration paths for the spatial detector components.
Phase 5: High-end data crunching. Once the basic data crunching is sorted, it’s time for the more ‘high end’ physics. They now look at questions like, what events did this collision produce? For example, the Higgs boson decays into a two Z bosons and then to muons, and the momentum of these muons specifies the properties of the boson. By having a collection of such events one can start to infer things such as the mass (or energy) of the Boson. (By the way, high-energy physics is a fearsome world but if you want a good introduction, Richard Feynman’s QED is very readable and informative, though not about this level of quantum physics, it's a good place to start.)

Parallels with biology...

At the moment, the main large-data-volume detectors in biology are sequencing machines, mass spec devices, image-collection systems at synchrotrons and a vast array of microscopes. In all these cases there are institutions with larger or smaller concentrations of devices. I am pretty sure the raw ‘data rates’ of all these detectors are well in excess of the CERN experiments, though of course these are distributed worldwide, rather than at a single location. (It would be interesting to make an estimate of the world-wide raw data rates from these instruments). The phases listed above have surprisingly close analogies to biology.
Some have drawn parallels between the ‘interesting event’ capture in high-energy physics with things like variant calling in DNA sequence analysis. But I think this is not quite right. The low-level ‘interesting event’ calling is far more analogous to the set of pixels => spot => read-calling process that happens in a sequencing machine, or to the noise-suppression that happens on some imaging devices. These are very low-level, sometimes quite crude processes to ensure that good signal is being propagated further without having to export the full data volume of the more “noise” or “background”.
We don’t usually build our own detectors in biology – these are usually purchased from machine vendors, and the ‘raw data’ in biology is not usually the detector readout. Take, for example, the distinctly different outputs of the first generation Solexa/Illumina machines. The image processing happened on a separate machine, and you could do your own image/low-level processing (here’s a paper from mycolleague Nick Goldman doing precisely that). But most biologists did not go that deep, and the more modern HiSeqs now do the image processing in-line inside the machine.
The next step of the pipeline – the standardised processing followed by the more bespoke processing – seems to match up well to the more academic features of a genomics or proteomics experiment. Interestingly, the business of calibrating a set of events (experiments in molecular biology speak) in a deliberate manner is really quite similar in terms of data flow. At CERN, it was nice to see Laura Clarke (who heads up our DNA pipelines) smile knowingly as the corresponding CMS data manager was describing the versioning problems associated with analysis for the Higgs Boson.

...and some important differences

These are the similarities between large-scale biology projects and CERN experiments: the large-scale data flow, the many stages of analysis, and the need to keep the right metadata around and propagated through to the next stage of analysis. But the differences are considerable. For example, the LHC data is about one order of magnitude (x10) larger than molecular biology data – though our data doubling time (~1 year) is shorter than their basic data doubling time (~2 years).
Another difference is that high-energy-physics data flow is more ‘starburst’ in shape, emanating from a few central sites to progressively broader groups. Molecular biology data has a more ‘uneven bow-tie’ topology: a reasonable number of data-producing sites (i.e. thousands) going to a small number of global archive sites (that interchange the data) and then distributing to 100,000s of institutions worldwide. For both inputs (data producers) and outputs (data users) the ‘long tail’ of all sorts of wonderful species, experiments and usage is more important in biology.
The user community for HEP data is smaller than in life sciences (10,000s of scientists compared to the millions of life-science and clinical researchers), and more technically minded. Most high-energy physicists would prefer a command-line interface to a graphical one. Although high-energy physics is not uniform – the results for each experiment are different – there is a far more limited repertoire of types-of-things one might want to catch. In molecular biology, in particular as you head towards cells, organs and organisms, the incredible heterogeneity of life is simply awe inspiring. So in addition to data-volume tasks in molecular biology, we also have fundamental, large-scale annotation and meta-data structuring tasks.

What we can learn from CERN

There is a lot more we can learn in biology from high-energy physics than one might expect. Some relates to pragmatic information engineering and some to deeper scientific aspects. There is certainly much to learn from how the LHC handles its data storage (for example they are still quite bullish about using magnetic-tape archives). We should also look carefully at how they have created portable compute schemes, including robust fail-over for data access (i.e. attempts to find data locally first, but then with fall-back on global servers).
There is a lot of knowledge we can share as well, for example in ontology engineering. The Experimental Factor Ontology’s ability to deal with hundreds of component ontologies without exploding could well be translated to other areas of science, and I think they were quietly impressed with they way we are still able to make good use of archived experimental data from the 1970s and 1980s in analysis and rendering schemes. In molecular biology, I think this on-going use of data is something to be proud of.
Engaging further with our counterparts in the high-energy physics fields at both information-engineering and analysis levels is something I am really looking forward to. It will be great to see Ian, Bob and the team at EMBL-EBI next year. CERN is a leader in data-intensive science, but it’s science is not a one to one mapping to everything else; we will need to adopt, adapt and sometimes create custom solutions for each data-intensive science.