Tuesday, 17 May 2016

Publishing Big Data Science

This is the third and final post in a series in which I share some lessons learned about how to plan, manage, analyse and deliver a ‘big biodata’ project successfully.

Now that you have the results of your carefully planned, meticulously managed and diligently analysed experiment, it’s time to decide on what to publish, and where.

1. Present your work

I love presenting, because having to explain my work to a mixed audience helps me understand and articulate the science better, and to convey the excitement of discovery. What is the work for, it not the joy of exploration? Creating figures to use in a presentation is enjoyable, and helps me get my thoughts in order.

I find writing paper less enjoyable than presentations, but the same core is present in both – good figures which provide a strong narrative from design through to analysis. There is however a particular rigour in writing a paper that brings out the best in a piece of scientific work. Present, and publish – it’s important to us all.

2. Organise your material

Most of these papers comprise both a main paper and a supplement. The main paper will feature the figures that tell the story: experimental design, discovery, main findings, interesting cases. It should be written for the interested reader who will mainly trust you on the experimental and analysis details.The supplement is for the reader (including a reviewer or two) who does not trust you. Sometimes, on other people's papers, you will be that reader. The supplement should have the same flow, but have all the supporting details that tell that reader the data and analysis are kosher.

3. Figures first

Make good figures that illustrate your point, and test them out in presentations, first to the group, then to colleagues in your institute, and then more widely. You’ll fine-tune the figures as you go. Your presentation will need quite a bit scaffolding (why the question is interesting, about your experimental design, key statistics), but don’t be afraid to show sample data from your results to show your motivation. Consider showing a boring and interesting case side by side. You may find this scaffolding can be condensed into your Figure 1 for the paper. You can show other figures in the supplement if they support your work.

4. Put pen to paper

Once your figures flow, you can write the results. You can also start working on the supplement, following the same general flow. All the ‘data is good’ plots will go in the supplement, as it can have extra “lemmas” about the data. Don’t skimp in the supplement – include technical details supporting things like, why your normalisation is sensible, or better than other approaches. If the supplement gets big, provide an index on the supplement for navigation. The sceptical reader will like to see this.

5. Focus on the results

Write the introduction and discussion after you are happy with your results write-up. Think about the readers and the reviewers, and make sure to cite widely. If you are coming into a new arena with this high-throughput approach, lavish praise on the importance of the field and the massive amount of individual loci work on which you are building. Basically, if you are publishing a large-scale approach in an area that hasn’t had one, avoid being seen as an interloper; read the papers, cite them – and you are likely to find a couple of new angles on your work through this process.

6. Length angst

If you are aiming for a journal with strict length limits (and I do wonder why we tolerate this in this day and age), don’t let that hold you back at the submission phase. Write as much as you need to, and acknowledge the length in your cover letter. Emphasise that you want the reviewers to have a full understanding of the science. For these more restricted space papers, reviewing at that density is often really hard – the text can be edited after review.

7. Be open

It is pretty standard that you will be publishing eventually open access (certainly if you are NIH, or Wellcome Trust and other funders). It is easier to do this via journals which automatically handle the open access submission (Plos, Genome Biology, BMC series and many others, sometime with open access fees). Due to the funder mandates pretty much every journal will at least allow submission of your author manuscript to PubMedCentral, but doing it yourself is quite annoying. 

There are new experiments in open publishing as well to look at. Two examples are F1000 and Bioarxiv. In F1000 the whole process of submission, peer review and publication is done in the open - it interesting to watch open peer review in action. Bioarxiv is following the more physics pre-print server, and many journals allow pre-print posting whilst a paper is under review. This is a cool way to stop being scooped and provides a way to get community input ("informal peer review"). I think we're in an experimentation phase of this next stage in open science, and it's going to be interesting to see where we end up.

8. Tidy up and submit your data

Make sure you have all the raw data to submit, with the meta-data nicely tidied up (ideally, your LIMS system will have this ready to go by default). Submit your structured data (DNA, Proteomics, Metabolomics, X-ray structure, EM) to the appropriate archive (EMBL-EBI has the full range). Have a directory that you keep in house; otherwise, put all the intermediate datasets and files on the web. This is good for transparency – the sceptical reader will be even more reassured when he or she knows that they can (if they want) not only get the raw data (a given for molecular biology) but can also come into the analysis half-way through. About half of these readers could be future members of a group you may ask to "follow the analysis in paper A", or to confirm that "XXX did this in paper B". Do this for your own group's sanity and for extra brownie points from readers around the world.

Tuesday, 10 May 2016

Managing and Analysing Big Data - Part II

This is the second of three blog posts about planning, managing and delivering a ‘big biodata’ project. Here, I share some of my experience and lessons learned in management and analysis – because you can’t have one without the other.


1. Monitor progress – actively!

You need a good structure to monitor progress, particularly if you’re going to have more than 10 samples of experiments. If this is a one-off, use a program that’s well supported at your institute, like FileMakerPro or... Excel (more on this below). If you’re going to do this a lot, think about investing in a LIMS system, as this is better suited to handling things at a high level of detail routinely. Whatever you use, make sure your structure invites active updating – you’ll need to stay on top of things and you don’t want to struggle to find what you need.

2. Excel (with apologies to the bioinformaticians)

Most bioinformaticians would prefer the experimental group not to use Excel for tracking, for very good reasons: Excel provides too much freedom, has extremely annoying (sometimes dangerous) "autocorrect" schemes, fails in weird ways and is often hard to integrate into other data flows. However, it is a pragmatic choice for data entry and project management due to its ubiquity and familiar interface.

Experimental group: before you set up the project tracking in Excel, discuss it with your computational colleagues, perhaps offering a bribe to soften the blow. It will help if the Excel form template comes out of discussions with both groups, and bioinformaticians can set up drop-downs with fixed text where possible and use Excel’s data entry restrictions to (kind of) bullet proof it.

One important thing with Excel: NEVER use formatting or colour to be the primary store of meaning. It is extremely hard to extract this information from Excel into other schemes. Also, two things might look the same visually (say, subtly different shades of red), but are computationally as different as red and blue. When presentation matters (often to show progress against targets), you or your colleagues can (pretty easily) knock up a pivot table/Excel formula/visual basic solution to turn basic information (one type in each column) into a visually appealing set of summaries.

3. Remember planning?

When you planned the project (you did plan, right?), you decided on which key confounders and metadata to track. So here’s where you set things up to track them, and anything else that’s easy and potentially useful. What’s potentially useful? It’s hard to say. Even if things look trivial, they (a) might not be and (b) could be related to something complex that you can’t track. You will thank yourself later for tracking things when you regress this out. 

4. Protect your key datasets

Have a data ‘write only’ area for storing the key datasets as they come out of your sequencing core/proteomics core/microscopes. There are going to be sample swaps (have you detected them? For sure they will be there for any experimental scheme with more than 20 samples), so don’t edit the received files directly! Make sure you have a mapping file, kept elsewhere, showing the relationships between the original data and the new fixed terms.

5. Be meticulous about workflow

Keep track of all the individual steps and processes in your analysis. At any point, it should be possible to trace individual steps back to the original starting data and repeat the entire analysis from start to finish.

My approach is to make a new directory with soft-links for each ‘analysis prototyping’, then lock down components for a final run. Others make heavy use of iPython notebooks – you might well have your own tried-and-tested approach. Just make sure it’s tight.

6. “Measure twice, cut once”

If you are really, really careful in the beginning, the computational team will thank you, and may even forgive you for using Excel. Try to get a test dataset out and put it all the way through as soon as possible. This will give you time to understand the major confounders to the data, and to tidy things up before the full  analysis.

You may be tempted to churn out a partial (but in a more limited sense ‘complete’) dataset early, perhaps even for a part-way publication. After some experience playing this game, my lesson learned is to go for full randomisation every time, and not to have a partial, early dataset that breaks the randomisation of the samples against time or key reagents. The alternative is the commit to a separate, early pilot experiment, which explicitly will not be combined with the main analysis. It is fine for this preliminary dataset to be mostly about understanding confounders and looking at normalisation procedures.

7. Communicate

It is all too easy to gloss over the social aspect of this kind of project, but believe me, it is absolutely essential to get this right. Schedule several in-person meetings with ‘people time’ baked in (shared dinner, etc.) so people can get to know each other. Have regular phone calls involving the whole team, so people have a good understanding of were things stand at any given time. Keep a Slack channel or run an email list open for all of those little exchanges that help people clarify details and build trust. 

Of course there will be glitches – sometimes quite serious – in both the experimental work and the analysis. You will have to respond to these issues as a team, rather than resorting to finger-pointing. Building relationships on regular, open communication raises the empathy level and helps you weather technical storms, big and small.


1. You know what they say about ‘assume’

Computational team: Don’t assume your data is correct as received – everything that can go wrong, will go wrong. Start with unbiased clustering (heat-maps are a great visualisation) and let the data point to sample swaps or large issues. If you collect data over a longer period of time, plot key metrics v. time to see if there are unwanted batch/time effects. For sample swaps, check things like genotypes (e.g. RNAseq-implied to sample-DNA genotypes). If you have mixed genders, a chromosome check will catch many sample swaps. Backtrack any suspected swaps with the experimental team and fail suspect samples by default. Sample swaps are the same as bugs in analysis code - be nice to the experimental team so they will be nice when you have a bug in your code.

Experimental team: Don’t assume the data is correct at the end of an analytical process. Even with the best will in the world, careful analysis and detailed method-testing mistakes are inevitable and flag results that don't feel right to you. Repeat appropriate sample-identity checks at key time points. At absolute minimum, you should perform checks after initial data receipt and before data release.

2. One thing you can assume

You can safely assume that there are many confounders to your data. But thanks to careful planning, the analysis team will have all the metadata the experimental team has lovingly stored to work with.
Work with untrained methods  (e.g. straight PCA; we’re also very fond of PEER in my group), and correlate the known covariates. Put the big ones in the analysis, or even regress them out completely (it’s usually best to put them in as terms in the downstream analysis). Don’t be surprised by strongly structured covariates that you didn’t capture as metadata. Once you have convinced yourself that you are really not interested, move on.

(Side note on PCA and PEER: take out the means first, and scale. Otherwise, your first PCA component will be means, and everything else will have to be orthogonal to that. PEER, in theory, can handle that non-orthogonality, but it's a big ask, and the means in particular are best removed. This means this is all wrapped up with normalisation, below.)

3. Pay attention to your reagents

Pay careful attention to key reagents, such as antibody or oligo batches, on which your analysis will rely. If they are skewed, all sorts of bad things can happen. If you notice your reagent data is skewed, you’ll have to make some important decisions. Your carefully prepared randomisation procedure will help you here.

4. The new normal

It is highly unlikely that the raw data can just be put into downstream analysis schemes - you will need to normalise. But what is your normalisation procedure? Lately, my mantra is, “If in doubt, inverse normalise.” Rank the data, then project those ranks back onto a normal distribution. You’ll probably lose only a bit of power – the trade-off is that you can use all your normal parametric modelling without worrying (too much) about outliers. 

You need to decide on a host of things: how to correct for lane biases, GC, library complexity, cell numbers, plate effects in imaging. Even using inverse normalisation, you can do this in all sorts of ways (e.g. in a genome direction or a per-feature direction – sometimes both) so there are lots of options, and no automatic flow chart about how to select the right option.

Use an obvious technical normalisation to start with (e.g. read depth, GC, plate effects), then progress to a more aggressive normalisation (i.e. inverse normalisation). When you get to interpretation, you may want to present things in the lighter, more intuitive normalisation space, even if the underlying statistics are more aggressive.

You’ll likely end up with three or four solid choices through this flow chart. Choose the one you like on the basis of first-round analysis (see below). Don’t get hung up on regrets! But if you don’t discover anything interesting, come back to this point and choose again. Taking a more paranoid approach, using two normalisation schemes through the analysis will give you a bit of extra security - strong results will not change too much on different "reasonable" normalisation approaches.

5. Is the Data good?

Do a number of ‘data is good’ analyses.

  • Can you replicate the overall gene-expression results? 
  • Is the SNP Tv/Ts rate good? 
  • Is the number of rare variants per sample as expected? 
  • Do you see the right combination of mitotic-to-nonmitotic cells in your images? 
  • Where does your dataset sit, when compared with other previously published datasets? 

These answers can guide you to the ‘right’ normalisation strategy - so flipping between normalisation procedures and these sorts of "validation" analyses helps make the choice of the normalisation.

6. Entering the discovery phase

‘Discovery’ is a good way to describe the next phase of the analysis, whether it’s differential-expression or time-course or GWAS. This is where one needs to have quite a bit more discipline in how to handle the statistics.

First, use a small (but not too small) subset of the data to test your pipelines (in Human, I am fond of the small, distinctly un-weird chromosome 20). If you can make a QQ plot, check the QQ plot looks good (ie, calibrated). Then, do the whole pipeline.

7. False-discovery check

Now you’re ready to apply your carefully thought-through ‘false discovery rate’ approach, ideally without fiddling around. Hopefully your QQ plot looks good (calibrated with a kick at the end), and you can roll out false discovery control now. Aim to do this just once (and when that happens, be very proud). 

8. There is no spoon

At this point you will either have some statistically interesting findings above your false discovery rate threshold, or you won’t have anything above threshold. In neither case should you assume you are successful or unsuccessful. You are not there yet.

9. Interesting findings

You may have nominally interesting results, but don’t trust the first full analysis. Interesting results often enrich errors and artefacts earlier on in your process. Be paranoid about the underlying variant calls, imputation quality or sample issues.

Do a QQ plot (quantile-quantile plot of the P values, expected v. observed). Is the test is well calibrated (i.e. QQ plot starts on the expected == observed, with a kick at the end)? If you can’t do a straight-up QQ plot, carry out some close alternative so you can get a frequentist P value out. In my experience, a bad QQ plot is the easiest way to spot dodgy whole-genome statistics.

Spot-check that things make sense up to here. Take one or two results all the way through a ‘manual’ analysis. Plot the final results so you can eyeball outliers and interesting cases. Plot in both normalisation spaces (i.e. ‘light’ and aggressive/inverse).

For genome-wide datasets, have an ‘old hand’ at genomes/imaging/proteomics eyeball either all results or a random subset on a browser. When weird things pop up ("oh, look, it’s always in a zinc finger!"), they might offer an alternative (and hopefully still interesting, though often not) explanation. Talk with colleagues who have done similar things, and listen to the war stories of nasty, subtle artefacts that mislead us all.

10. ‘Meh’ findings

If your results look uninteresting:

  • Double check that things have been set up right in your pipeline (you wouldn’t be the first to have a forehead-smacking moment at this point). 
  • Put dummy data that you know should be right into the discovery pipeline to test whether it works. 
  • Triple-check all the joining mechanisms (e.g. the genotype sample labels with the phenotype). 
  • Make sure incredibly stupid things have not happened – like the compute farm died silently, and with spite, making the data files look valid when they are in fact… not.

11. When good data goes bad

So you’ve checked everything, and confirmed that nothing obvious went wrong. At this point, I would allow myself some alternative normalisation approaches, FDR thresholding or confounder manipulation. But stay disciplined here.

Start a mental audit of your monkeying around (penalising yourself appropriately in your FDR). I normally allow four or five trips on the normalisation merry-go-round or on the “confounders-in-or-out” wheel. What I really want out of these rides is to see a P value / FDR rate that’s around five-fold better than a default threshold (of, say 0.1 FDR, so hits at 0.02 FDR or better).

Often you are struggling here with the multiple testing burden if there is a genome-wide scan. If you are not quite there with your FDRs, here are some tricks: examine whether just using protein-coding genes will help the denominator, and look at whether restricting by expression level/quantification helps (i.e. removing lowly expressed genes which you couldn't find a signal in anyway). 

You may still have nothing over threshold. So, after a  drink/ice cream, open up your plan to the “Found Nothing Interesting” chapter (you did that part, right?) and follow the instructions. 

Do stop monkeying around if you can’t get to that 0.02 FDR. You could spend your career chasing will-o-the-wisps if you keep doing this. You have to be honest with yourself: be prepared to say “There’s nothing here.” If you end up here, shift to salvage mode (it’s in the plan, right?).

12. But is it a result?

Hopefully you have something above threshold, and are pretty happy as a team. But is it a good biological result? Has your FDR merry-go-round actually been a house of mirrors? You don’t want to be in any doubt when you go to pull that final trigger on a replication / validation experiment.

It may seem a bit shallow, but look at the top genes/genomic regions, and see if there is other interesting, already-published data to support what you see. I don't, at this point, trust GO analysis (which often is "non random"), but the Ensembl phenotype per gene feature is really freakily useful (in particular with its ‘phenotypes in orthologous genes’ section) and the UniProt comments section. Sometimes you stumble across a complete amazing confirmation at this point, from a previously published paper. 

But be warned: humans can find patterns in nearly everything – clouds, leaf patterns, histology, and Ensembl/UniProt function pages. Keep yourself honest by inverting the list of genes, and look at the least interesting genes from the discovery process. If the story is overtly consistent from bottom to top, I’d be sceptical that this list actually provides confirmation. Cancer poses a particular problem: half the genome has been implicated in one type of cancer or another by some study.

Sometimes though you just have a really clean discovery dataset, with no previous literature support, and you need to do the replication in place without any more confidence that your statistics are confirming something valuable.

13. Replication/Validation

Put your replication/validation strategy into effect. You might have baked it into your original discovery. Once you are happy with a clean (or as clean as you can get) discovery and biological context, it’s time to pull the trigger on the strategy. This can be surprisingly time consuming.

If you have specific follow-up experiments, sort some of them out now and get them going. You may also want to pick out some of the juiciest results to get additional experimental data to show them off. It’s hard to predict what the best experiment or analysis will be; you can only start thinking about these things when you get the list.

My goal is for the replication / validation experiments to be as unmanipulated as possible, and you should be confident that they will work. It's a world of pain when they don't!

14. Feed the GO

With the replication/validation strategy underway, your analysis can now move onto broader things, like the dreaded GO enrichments. Biology is very non-random, so biological datasets will nearly always give some sort of enriched GO terms. There are weird confounders both in terms of genome structure (e.g. developmental genes are often longer on the genome) and confounders in GO annotation schemes.

Controlling for all this is almost impossible, so this is more about gathering hints to chase up in a more targeted analysis. Or to satisfy the “did you do GO enrichment?” requirement that a reviewer might ask. Look at other things, like related datasets, or orthologous genes. If you are in a model organism, Human is a likely target. If you are in Human, go to mouse, as the genome-wide phenotyping in mouse is pretty good now). Look at other external datasets you can bring in, for example Chromatin states on the genome, or lethality data in model organisms.

15. Work up examples

Work up your one or two examples, as these will help people understand the whole thing. Explore data visualisations that are both appealing and informative. Consider working up examples of interesting, expected and even boring cases.

16. Serendipity strikes!

Throughout this process, always be ready for serendipity to strike. What might look like a confounder could turn out to be a really cool piece of biology – this was certainly the case for us, when we were looking at CTCF across human individuals and found a really interesting CTCF behaviour involved in X inactivation. 

My guess is that serendipity has graced our group in one out of every ten or so projects – enough to keep us poised for it. But serendipity must be approached with caution, as it could just an artefact in your data that simply lent itself to an interesting narrative. If you’ve made an observation and worked out what you think is going on, you almost have to create a new discovery process, as if this was your main driver. It can be frustrating, because you might now not have the ideal dataset for this biological question. In the worst case, you might have to set up an entirely new discovery experiment.

But often you are looking at a truly interesting phenomenon (not an artefact). In our CTCF paper, the very allele-specific behaviour of two types of CTCF sites we found was the clincher: this was real (Figure 5C). That was a glorious moment.

17. Confirmation

When you get the replication dataset in, slot it into place. It should confirm your discovery. Ideally, the replication experiments fit in nicely. Your cherry-on-the-cake experiment or analysis will show off the top result.

18. Pat on the back if it is boring

The most important thing to know is that sometimes, nothing too interesting will come out of the data. Nobody can get a cool result out every large scale experiment. These will be sad moments for you and the team, but be proud of yourself when you don’t push a dataset too far - and for students and postdocs, this is why having two projects is often good. You can still publish an interesting approach, or a call for larger datasets. It might be less exciting, but it’s better than forcing a result.

Monday, 9 May 2016

Advice on Big Data Experiments and Analysis, Part I: Planning

Biology has changed a lot over the past decade, driven by ever-cheaper data gathering technologies: genomics, transcriptomics, proteomics, metabolomics and imaging of all sorts. After a few years of gleeful abandon in the data generation department, analysis has come to the fore, demanding a whole new outlook and on-going collaboration between scientists, statisticians, engineers and others who bring to the table a very broad range of skills and experience.

Finding meaning in these beautiful datasets and connecting them up, particularly when they are extremely different from one another, is a detail-riddled journey fraught with perils. Innovation is happening so quickly that trusty guides are rather thin on the ground, so I’ve tried to put down some of my hard-won experience, mistakes and all, to help you plan, manage, analyse and deliver these projects successfully.

Without up-front planning, you won’t really have much of a ‘project’. Throwing yourself into data gathering just because it’s ‘cheap’ or ‘possible’ is really not the best thing to do (I've seen this happen a number of times - and embarrassingly I've done it myself). ‘Wrong’ experiments are time vampires: they will slurp up a massive amount of your time and energy, potentially exposing you to reputational risk in the event you are tempted to force a result out of a dataset.

This post, the first of three, is about having the strongest possible start for your project via good planning.

1. Buddy up

In the olden days, experimental biologists would generate a bunch of data and then ask a bioinformatician how to deal with it. Well, that didn’t work too well. At the very outset of a project, we have all learned that you need to ensure there are two PIs: one to focus on the experimental/sample-gathering side, and one to keep the analysis in their sights at all times. These two PIs must have healthy, mutual respect, and be motivated by the same overall goal. There are a few, rare individuals who can honestly be described as being both experimental and computational, but in most cases you’ll need two people to make sure both perspectives are represented in the study’s design.

Now, I’m not saying that experimentalists are strangers to analysis, or that bioinformaticians are strangers to data generation. It’s just important to acknowledge that being able to ‘talk the talk’ of another discipline does not, on its own, qualify you to manage that end of the project, with all its complexities, gotchas and signature fails.

As with anything you set out to do for a couple of years, you’ll need to make sure you are working with someone you get on with. There will be tense moments, and you’ll get past them if your co-PI shares your motivation and goal. Provided you get on and share information as you go, buddying up will save you resources in the long run.

Note to Experimental PIs: never assume you’ll be able to tack on an analytic collaboration at the end, after you’ve gathered the data. You don’t want to be caught out by not having considered some important analysis aspect. 

Note to Computational PIs: Never assume you can delegate sample management and experimental details to a third party through facility technicians. You know there is a huge difference between experimental data and good experimental data – you will need a trusted experimental partner who understands all the relevant confounders and lab processes, and who can spot a serendipitous result if one pops out, if you’re going to have a successful project.

2. Outlining

The idea that you can generate datasets first and then watch your results emerge from the depths is simply misguided. It is really quite painful (and wasteful) when a dataset doesn’t have what it needs to support an analysis - it is a set-up for forcing results. Before you do anything, have a brief discussion with your co-PI about the main questions you are looking to answer and make a high-level sketch of the project.

I’m not talking about a laboured series of chapter outlines – the main thing is to determine the central question. Large-scale data-gathering projects often focus on basic, descriptive things, like, “How much of phenomenon X do we see under Y or Z conditions?” Sometimes the questions are more directed, for example, “How does mitosis coordinate with chromosomal condensation?”

Outlining your hypotheses need only be as simple as, “At the end, we will have a list of proteins in the Q process.” If you’re hoping to test a hypothesis, aim for something straightforward, like, “I believe the B process is downstream of the Ras process.” 

Consider your possible hypothesis-testing modes, but avoid trying too hard to imagine where the analysis might take you; your data and analysis might not agree with your preconceptions in the end. 

Also, do not commit to specific follow-up strategy too early! Your follow-up strategy should be determined after your initial analysis has been explored, or your pilot study has been performed.

3. Back-of-the-envelope ‘power calculations’

Take some of the anxiety out of the process by doing a rough calculation before getting into things too deeply. If you (or someone else) has done a similar analysis well in the past, simply use their analysis as a basis for your rough estimate. If you are on completely new ground, make sure you factor in false positives (e.g. mutation calls, miscalled allele-specific events, general messiness) and pay careful attention to frequencies (e.g. alleles, rare cell types).

Many a bad project could have been stopped in its tracks by a half hour’s worth of power analysis. Unless you really need to impress reviewers, you probably don’t need to go overboard – just make a quick sketch. But be honest with yourself! It is all too easy to fudge the numbers in a power analysis to get an answer you want. Use it as a tool for looking honestly at what sort of results you could expect.

4. Get logistical

Plan the logistics according to Sod’s Law. Assume everything that can go wrong will go wrong at least once. This is particularly important if you are scaling up, for example moving an assay from single-well/Eppendorf to an array.

For assays, give yourself at least a year for scale-up in the lab (better still, do a pilot scale-up with publication before moving on to the real thing). Pad out all sample acquisition with at least three months for general monkeying around.

5. Have a healthy respect for confounders

Think about the major confounders you will encounter downstream, and randomise your experimental flow accordingly. That is, do not just do all of state X first, then progress to state Y, then Z.

Make sure you store all the known confounders (e.g. antibody batch number, day of growth). Try to work off single antibody batches/oligo batches for key reagents. If you know you will need more than one batch, remember the randomisation! You absolutely do not want the key reagent batch being confounded with your key experimental question, i.e. normal with batch 1, disease with batch 2. Disaster!

6. Plan the analysis

If possible, stagger the experimental and analysis work. See if you can have your analysis postdocs come in to the project later, ideally with some prior knowledge of the work (the best case is that they are around but on another project early, and then switch into this project about a year in). Unfortunately, because funding agencies like to have neat and tidy three-year projects, this is often quite difficult to arrange.

Determine when an initial dataset will be available, and time the data coordination accordingly. Budget at least six months (more likely 12–18 months) of pure computational work. Use early data to ‘kick the tyres’ and test different analysis schemes, but plan to have a single run of analysis that takes at least 12-18 months.

7. Replication/validation strategy

You know you’re not going to cook up the data and analysis, but will you convince the sceptical reader? Make sure you have a strategy in place.

I find it helps to think of this as two separate phases: discovery, and validation/replication. In discovery, you have plenty of freedom to try out different methods and normalisation before settling. The validation/replication phase, for a project of any size, features ‘single-shot’ experiments, which offer a minimal amount of flexibility.

Generally speaking, you should not be doing single-sample-per-state experiments; rather, you should be carrying out at least two biological replicates, which is enough to show up any problem. With five or more biological replicates, you can make good mean estimates. The one exception to the "no single sample" rule is QTL/GWAS, when it is nearly always better to sample new genotypes each time, rather than replicate data from the same genotype (i.e. maximise your genetic samples first, and then improve on per-genetic individual variance).

8. Confront multiple testing

How many tests are you going to do? If it is genome-wide project, you will do a lot, so you need to control for your multiple testing. This is partly about the power calculation, but requires some up-front thinking. Will you do permutations, or trust to the magic of p.adjust() (A wonderful R function that has a set of False Discovery Rate approaches)? What will you do if you find nothing? Is finding nothing interesting in itself?

You’ll all have agreed to try and discover something excellent, but make sure you have a serious conversation up front with your co-PI about what you’ll do if you don’t find anything interesting. Is there a fall-back plan?

Traditional, outright replication of an entire discovery cohort needs as much logistical planning – if not more – as the discovery itself. You might decide to use prior data to show how yours is at least solid and good. Organise this beforehand.

9. Publishing parameters

What would you consider to be the first publishable output from this project? Could you put it into a technical publication (e.g. assay scale up, bespoke analytical methods)?  At the beginning of the project, you and your co-PI should agree on the broad parameters of authorship on papers, and how multiple papers might be coordinated. For example, will you credit two first authors and two last authors, swapping in priority if there is more than one paper?

If you are a more senior partner in a collaboration, be generous with your “last last” position. Your junior PI partners need it more than you do!

Next up

This is the first of three posts. Next up: Managing your Big Data project.