Tuesday, 11 December 2012

EBI as a data refinery

In describing what the EBI does, it is sometimes hard to provide a feel for the complexity and detail of our processes. Recently I have been using an analogy of  EBI as a “data refinery”: it takes in raw data streams ("feedstocks"), combines and refines them, and transforms them into multi-use outputs ("products"). I see it as a pristine, state-of-the-art refinery with large, imposing central tanks (perhaps with steam venting here and there for effect) from which massive pipes emerge, covered in reflective cladding and connected in refreshingly sensible ways. In the foreground are arrayed a series of smaller tanks and systems, interconnected in more complex ways. Surrounding the whole are workshops, and if you look close enough you can see a group of workers decommissioning one part of the system whilst another builds a new one.

 Oil refinery on the north end of March Point; Mount Baker
I find this analogy useful for a number of reasons. First, a "product" is often itself a feedstock, which is why the EBI has so many complex cycles of information. For example, InterPro member database models and patterns are feedstocks for the InterPro entries; during refinement they become associated with one another, documentation and gene ontology (GO) assignments. InterPro takes in UniProt (UniParc) protein sequences and combines them with models to provide boundaries on proteins; these in turn allow the ‘InterPro2GO’ GO assignment process to occur. This automatic GO annotation is then applied to the UniProtKB entries along with experimentally defined GO annotations which come from GO curators worldwide, and include many entries about model organisms .The InterPro entries additionally provide raw information (feedstock) for the UniRule automatic annotation, where InterPro matches are  the mainstay condition of a particular rule, which the UniProt curator combines with other conditions such taxonomic restrictions and sequence properties , ensuring the most accurate application of the  annotation extracted from the experimentally proven UniProtKB entries to the proteins of unknown function.

This is a complex network of inputs and outputs, (just writing it down and trying to keep it all straight is exhausting unless you are part of it – I went through a couple of rounds with Claire O’Donovan and Sarah Hunter to get the above flow absolutely straight) but the main input – bare protein sequences (coming from internal feedstocks including ENA and Ensembl) –is being converted into the main output: annotated protein entries, with human-readable annotation and careful audit trail of its 'refinement'. This is what the user sees as the output of the refinery, and understandably does not want to spend too much time worrying about the details of pipe connectivity inside the refinery.

Another reason I find the refinery analogy useful is because volume can be deceptive. The biggest, most impressive tanks in this refinery are filled with DNA sequence data but for the refinery to work as a whole it needs many "specialist" chemicals, in lower volumes, to serve as critical catalytic components. It might be necessary for the refinery to make and store some components in order to streamline a more complex flow of information. The EBI works with key "catalyst" streams of information that have a disproportionate impact relative to their volume (e.g. this assignment of experimentally defined annotation).

A deceptive view of this refinery would focus exclusively on the final outputs and the most recent refinement process, without taking in the intricate web of components behind them. People might use Reactome or IntAct to understand a particular functional dataset, but the protein information in these resources depends on UniProt to track and annotate these sequences. The protein information in UniProtKB is dependent on the ENA database smoothly accepting submissions with annotated CDS proteins present. In this way, asking to visualise, say, phosphoprotein results on a pathway diagram is not as simple as it might seem. It implicitly draws on many of the tanks in the EBI refinery. This larger network actually goes beyond the EBI's borders to its worldwide collaborators (e.g. wwPDB, or the INSDC’s GenBank/ENA/DDBJ).

The final "product" that the user sees often has a local manufacturer (i.e., bioinformatician/computational biologist) who pulls in information from the large tanks and combines it with local data to provide an overall picture and give context. Often, the research group querying EBI data does not worry too much about the details of how the refinery works, or about the complex inter-dependencies of the refinery; they just want easy access to a product they can rely on. It is the job of the EBI, and in future will be the job of ELIXIR, to satisfy this desire.

A refinery does not stay still. In each process, engineers (in our case bioinformaticians and software engineers) work to improve minor, everyday things and to carry out major retooling. New types of experimental information might require a new tank and pipelines, or become cheap enough to replace older feedstocks, in both cases opening up potential for new, useful products. New discoveries might change the way processes or transformations are handled, perhaps by adding a certain catalyst at a particular stage to improve the products.

Clearly the EBI is not the only refinery. Our European partners, such as SIB and Sanger, collaborate so closely with us on key projects that it’s hard to work out where one refinery stops and the other begins. We exchange data and expertise regularly with large refineries in the US and Japan, such as NCBI, UCSC, NIG and RCSB. It is exciting to see all of the proto-refineries in Europe, which offer different core competencies and are coalescing into a single robust, refinery: ELIXIR.

Like all analogies, this is not perfect. The concept of free data sharing, which is at the heart of molecular biology, does not fit well with this analogy. Although the complex process of providing the necessary CPU, disk and network has some resonance with the internal “plant” infrastructure, the fact that it is so generic and tradable does not. The EBI's products are also directly used via the web, often without much intermediation (no need for a network of gas stations, etc.). Nevertheless, the picture of a complex interplay of inputs being progressively refined is helpful when trying to disentangle some of our trickier problems.

I welcome feedback on this analogy, and to what extent it helps one understand the EBI.

Thursday, 6 December 2012

West meets East

I've just come back from around 10 days in China, visiting Nanjing, Shanghai and Hong Kong, and have a whole new perspective on this part of the world. I was not able to work Beijing into my trip this time, which was frustrating because I know there is a lot of good science happening there.

What was really different about this trip was that I came away feeling much more of a connection to China. It was great to meet new people and to renew more longstanding scientific contacts – but I also had more time (and, perhaps more importantly, more confidence) to travel between cities, have breakfast in local cafes rather than hotels, and generally get to know each place a little better. Previous trips (this was my fourth) required such a packed schedule that jetlag and the whole novelty of China completely dominated my experience.

Now that I’m sitting down to write about the experience, the first thing I’m inclined to do is draw some analogies with western countries. But analogies only go so far - even when they fit relatively well, they break down in the face of China’s distinct character. I do feel more knowledgeable than I have after previous visit to China, but I fully expect that future visits will reveal further dimensions and facets to this immense and complex country.

On some level, China reminds me of the US: it’s a huge country with vast distances to travel between locations, and has a tremendously strong sense of a single nation. Everyone I met considered themselves "Chinese", and there is a strong sense of a binding history and cultural underpinning. Also, similar to the US, China (and Chinese...) is aware of its size and economic power, and is conscious of having strong voice on the world stage. Hong Kong, Shanghai and Beijing are cosmopolitan cities, with a sometime exuberant celebration of the past 20 years economic growth. I won’t stray into geopolitics – it’s not my field of expertise at all – but a country of this size with sophisticated metroplotian areas will almost certainly make a big impact on science over the next couple of decades.

China shares some features with Europe – notably a diversity of language and culture across many provinces. Chinese provinces are often larger than European countries, and often have similar overall GDP. The many Chinese "dialects" are better described as different spoken languages, but importantly they share a set of written characters (with some modifications).  The implications of having a universally comprehensible written language for such a range of linguistic groups are profound.

My initial impression was that China had two major languages – Cantonese (used around Guandong and Hong Kong) and Mandarin – with various dialects, but this trip really impressed upon me just how diverse the linguistic landscape of Mainland China is. For example, Shanghaise is a dialect of Wu, which is a language family predominant in the eastern central area. When I was out for dinner in Shanghai with a Mandarin speaker, the waiter spoke to us in this lilting tone (Shanghaiese, as it turned out) and I turned to my companion for translation; she smiled, shrugged her shoulders and shifted the conversation to Mandarin. It was like dining with an Italian colleague in Finland and thinking she would know Finnish.

I’m much more aware now of the distinctive character and cultures of China’s provinces, which, along with the importance of personal networks, resonates with Europe.

While it’s fun to draw familiar parallels, China is clearly nothing like a mixture of the US and Europe. It is hard enough to completely understand the historical perspectives and cultures of one’s neighbours – it is going to be a long time before I will completely grasp the fundamental complexities of China. What I can say now is that its diversity is more and more fascinating to me, and something to be celebrated.

I wrote some time ago about scientific collaboration with China (see East meets West ), focusing on the positive aspects of openness and collaboration in engaging with this and other emerging economies (i.e. Brazil, India, Russia and Vietnam). As scientists, we have the good fortune of being expected to share scientific advances, discuss collaborations, discover new things jointly because they are the right thing to do – socially and strategically.

China already has some leading scientists and excellent scientific institutions, and I am sure this will only grow in the future. But communication is an essential component of community, and social media has been highly beneficial in keeping information flowing in much of the global scientific community. It’s frustrating that news platforms like Twitter are blocked in China. The EBI has set up a Weibo account (www.weibo.com/emblebi) where we will be posting (in English!) news items from the EBI. Hopwfully this help keep scientists in China up to date with developments at the EBI – so please do distribute to your Chinese colleagues.

On a more personal note, I've discovered that my first name (Ewan) is pronounced (in some dialects) almost identically to Yuan (a Chinese word for money). In Wikipedia, one of the pronunciation descriptions of Yuan is written identically to one of Ewan (what more proof do you need!) but I am not clear (a) if this is a variation in pronouncing Yuan in Mandarin or a dialect shift and (b) what tonal form it has. I'd be delighted to get some sort of linguistic survey of Yuan forms geo-tagged across China. People who have read my name sometimes get confused because they have a pre-formed idea of how to pronounce it (often "Evan" or "Ee-Wan" – one to save for my next Star Wars role). So it’s useful to know that I can say, "Ewan, like Money, Yuan," and this will provide some relief to my new acquaintance, who can file the name alongside a well-known phrase. (And before you say it, I know that I am just as bad when it comes to pronouncing some names – Chinese or not – in other languages!)

So - I'm "Money" Birney. I can't quite work out whether I should be proud or a bit worried about this moniker.

Many, many thanks to my hosts and the new people I met on this journey: Ying, Philipp, Jing, Jun, Huaming, Hong, Laurie, Scott and many others. I look forward to seeing you again, and learning more on my next trips to China.