Monday, 19 January 2015

Untangling Big Data

"Big Data" is a trendy, catch-all phrase for handling large datasets in all sorts of domains: finance, advertising, food distribution, physics, astronomy and molecular biology - notably genomics. It means different things to different people, and has inspired any number of conferences, meetings and new companies. Amidst the general hype, some outstanding examples shine forth and today sees an exceptional Big Data analysis paper by a trio of EMBL-EBI research labs - Oliver Stegle, John Marioni and Sarah Teichmann - that shows why all this attention is more than just hype.

The paper is about analysing single-cell transcriptomics. The ability to measure all the RNA levels in a single cell simultaneously - and to do so in many cells at the same time - is one of the most powerful new technologies of this decade. Looking at gene regulation cell by cell brings genomics and transcriptomics closer to the world of cellular imaging. Unsurprisingly, many of the things we've had to treat as homogenous samples in the past - just because of the limitations of biochemical assays - break apart into different components at the single-cell level. The most obvious examples are tissues, but even quite "homogenous" samples separate into different constituents.

These data pose analysis challenges, the most immediate of which are technical. Single-cell transcriptomics requires quite aggressive PCR, which can easily be variable (for all sorts of reasons). The Marioni group created a model that both measures and accounts for this technical noise. But in addition to technical noise there are other large sources of variability, first and foremost of which is the cell cycle. 


Cell cycle redux

For the non-biologists reading this, cells are nearly always dividing, and when they're not they are usually paused in a specific state. Cell division is a complex dance: not only does the genome have to be duplicated, but much of the internal structure has also be split - the nucleus has to dissassemble and reassemble each time (that's just for eukaryotic cells, not bacteria). This dance has been pieced together thanks to elegant research conducted over the past 30 years in yeast (two different types), frog cells, human cells and many others. But much remains to be understood. Because cells divide multiple times, the fundamental cycle (the cell cycle) has very tightly defined stages when specific processes must happen. Much of the cell cycle is controlled by both protein regulation and gene regulation. Indeed, the whole process of the nucleus "dissolving", sister DNA chromosomes being pulled to either side, and the nucleus reassembling has a big impact on RNA levels. 

When you are measuring cells in bulk (i.e. 10,000 or more at the same time), the results will be weighted by the different 'lengths of stay' in different stages of the cell cycle. (You can sometimes synchronise the cell cycle, which is useful for research into the cell cycle, but it's hard to do routinely on any sample of interest). Now that we have single-cell measurements, which presumably tell us something about cell-by-cell variation, we also have an elephant in the room: namely, massive variation due to the cells being at different stages of the cell cycle. Bugger.

One approach is to focus on cell populations that have paused (presumably in a consistent manner) in the cell cycle, like dendritic cells. But this is limiting, and many of the more interesting processes happen during cell proliferation; for example, Sarah Teichmann's favourite process of T-cell differentiation nearly always occurs in the context of proliferating cells. If we want to see things clearly, we need to somehow factor out the cell-cycle variation so we can look at other features.


Latent variables to the rescue

Taking a step back, our task is to untangle many different sources of variation - technical noise, the cell cycle and other factors - understand them, and set them to the side. Once we do that, the interesting biology will begin to come out. This is generally how Oliver Stegle approaches most problems, in particular using Bayesian techniques to coax unknown, often complex factors (also called 'latent variables') from the data. For these techniques to work you need a lot of data (i.e. Big Data) to allow for variance decomposition, which can show how much each variable contributes to the others. 

But even the best algorithm needs good targeting. Rather than trying to learn everything at once, Oli, John and Sarah set up the method to learn the identity of cell-cycling genes from a synchronised dataset - learning both well-stablished and some anonymous genes. They brought that gene list into the context of single-cell experiments to learn the behaviour of these genes in a particular cell population, paying careful attention to technical noise. Et voilà: one can split the variation between cells into 'cell-cycle components' (in effect, assigning each cell to its cell-cycle stage), 'technical noise' and 'other variation'. 

This really changes the result. Before applying the method, the cells looked like a large, variable population. After factoring out the cell cycle, two subpopulations emerged that had been hidden by the overlay of the variable cell cycle position, cell by cell, and those two subpopulations correlated to aspects of T-cell biology. Taking it from there, they could start to to model other aspects as specific latent variables, such as T-cell differentiation.


You say confounder, I say signal

We are going to see variations on this method again and again (in my research group, we are heavy users of Oliver's latent-variable work). This variance decomposition is about splitting different components apart and showing them more clearly. If you are interested in the cell cycle, cell-cycle decomposition, or how certain details of factor changes differ between cell populations, it will be incredibly useful. If you are interested in differentiation, you can now "factor out" the cell cycle. In contrast, you might only be interested in the cell cycle and prefer to drop out other biological sources of variation. Even the technical variation is interesting if you are looking at optimising the PCR or machine conditions. "Noise" is a pejorative term here - it's all variation, with different sources and explanations. 

These techniques are not just about the cell cycle or single-cell genomics. Taken together, they represent a general mindset of isolating, understanding and ultimately modelling sources of variation in all datasets, whether they are cells, tissues, organs, whole organisms or populations. It is perhaps counter-intuitive to consider that if you have enough samples with enough homogenous dimensions (e.g. gene expression, metabolites, or other features), you can cope with data that is otherwise quite variable by splitting out the different components. 

This will be a mainstay for biological studies over this century. In many ways, we are just walking down the same road that the founders of statistics (Fisher, Pearson and others) laid down a century ago in their discussions on variance. But we are carrying on with far, far more data points and previously unimaginable abilities to compute. Big Data is allowing us to really get a grip on these complex datasets, using statistical tools, and thus to see the processes of life more clearly. 

Tuesday, 13 January 2015

Moving 20 Petabytes


EMBL-EBI's data resources are built on a constantly running compute and storage infrastructure. Over the past decade that infrastructure has grown exponentially, keeping pace with the rapid growth of molecular data and the corresponding need for computation. Terabytes of data flow every day on and off our storage systems, making up the hidden life-blood of data and knowledge that permeates much of modern molecular biology.

There is a somewhat bewildering complexity to all of this. We have 57 key resources: everything from low-level, raw DNA storage (ENA) through genome analysis (Ensembl and Ensembl Genomes), complex knowledge systems (UniProt) and 3D protein structures (PDBe). At minimum, over half a million users visit at least one of the EMBL-EBI websites each month, making 12 million web hits and downloading 35 Terabytes each day. Each resource has its own release cycle, with different international collaborations (e.g. INSDC, wwPDB, ProteomeXchange) handling the worldwide data flow. 

To achieve consistent delivery, we have a complex arrangement of compute hardware distributed around different machine rooms, some at Hinxton and some off site. Around two years ago we started the process to rebid our machine room space, and last year Gyron, a commercial machine-room provider in the southeast of England, won the next five-year contract. This was good news for efficiency (Gyron provided a similar level of service but at a sharper price) but posed an immediate problem for EMBL-EBI's systems, networking and service teams: to wit, how we were going to move our infrastructure without disruption?


Moving a mountain

The carefully laid plans were put into operation in October 2014 and the move was completed in December 2014. Over that time, the EMBL-EBI systems team moved 9500 CPU cores and 22 petabytes of disk, and reconnected 3,400 cables/fibre with 850 power cables. Effectively, they moved half our storage infrastructure with no unscheduled downtime for any resource. In fact, most resources ran as usual throughout the entire operation. That the vast majority of users were totally unaware of the move is a huge tribute to the team, who had to work closely with each of the 57 resource groups to deliver constant service.

Much of this was due to good planning some five years ago, when EMBL-EBI originally grew out of its Hinxton-based machine rooms and started leasing machine-room space in London. Two key decisions were made. The first was that every service would run from two identical, isolated systems, such that one system could be incapacitated and the usage would switch to the other. The second decision was that only the technical groups (i.e. systems and web production) would be allowed direct access to the machines running the front-end services. 

All the testing and development happened in a separate, cloned system (running in Hinxton), and deployment was carried out via a series of automated procedures. These procedures were designed to accommodate the different standard operations of each resource, and to support complicated issues around, for example, user-uploaded data. After a couple of (rather painful) years of fixing and fine tuning, our front-end services were logically and formally separated from testing and development. All of this is done in a highly virtualised environment (I don't think anyone at EMBL-EBI logs into or uses a "real" machine anymore), allowing yet more resilience and flexibility of the system.

This preparation made the conceptual process of moving relatively easy. One half of the system was brought down, and the active traffic was diverted to the other half. Then the machines were moved, reconnected in their new location, tested and brought back up. Once the new system checked out, the services were started up on the new location, and the second system went through the same process. 

Some of our largest resources (e.g. ENA storage and 1000 Genomes storage) went through a "read only" period to allow our technicians to transfer half the disk component in a safe manner. For our very rare single-point services, in particular the European Genotype-phenotype Archive (EGA), which operates with a far, far higher level of security, we scheduled one week of downtime. Our high-bandwidth, redundant link with JANET had to be up and running, and our internal network across the three machine rooms (Hinxton, backup and Gyron) had to be configured correctly.


The dreaded downtime

I have to admit, I was concerned about this move. It is all too easy to uncover some hidden dependancies in the way machines are configured with respect to each other, or to find some unexpected flaw deep in the workings of a network or subnetwork. Even though everything seemed fine in theory, I was dreading one or two days - perhaps even a week - of serious access problems. This kind of downtime might be alright once every five years, but the fundamental process of webpages being returned in a timely manner is the first test of a robust informatics infrastructure. Any time this fails, users lose a bit of confidence and that is the last thing we want. 


Thanks, guys!

I am truly impressed at what the techincal cluster at EMBL-EBI has achieved in this move. These four separate, closely interlinked teams keep the technical infrastructure working: Systems infrastructure, Systems applications, Web Production and Web Development. They are headed up by Steven Newhouse, who came into the job just six months before this all started. Many of the people who made the move possible worked with incredible dedication throughout: Pettri Jokenien, Rodrigo Lopez, Bren Vaughan, Andy Cafferkey, Jonathan Barker, Manuela Menchi, Conor McMenamin and Mary Barlow. All of them understood implicitly what needed to be done to make the system robust, and pulled out all the stops to make it happen. 

People only really notice infrastructure when it goes wrong. When you switch on a light, do you marvel at the complexity of a system that constantly produces and ships a defined voltage and amperage to your house, to provide light the instant you wish it? When you use public transport, do you praise the metro and train network for delivering you on (or nearly on) time? Probably not. Similarly, when you click on a gene in Ensembl, look up a protein's function in UniProt, or run a GO enrichment analysis, you probably don't think about the scientific and technical complexity of delivering those results accurately and efficiently. And that's just how it should be.

So - many thanks to the EMBL-EBI technical cluster, who finished the job just before Christmas 2014. I hope you all enjoyed a well-deserved break.

(I've just about uncrossed my fingers and toes now....)