views:

98

answers:

4

I come from a c.s. background but am now doing genomics.

My projects include a lot of Bioinformatics typically involving: aligning sequences, comparing overlap etc between sequences and various genome-annotation-features, from different classes of biological samples, time-course data, microarray, high-throughput sequencing ("next-gen" sequencing, though it's the current gen actually) data, this kind of stuff.

The workflow with this kind of analyses is quite different from what I experienced during my c.s. studies: no UML and thoughtfully designed objects shining with sublime elegance, no version management, no proper documentation (often no documentation at all), no software engineering at all.

Instead, what everyone does in this field is hacking out one Perl-script or AWK-one-liner after the other, usually for one-time usage.

I think the reason is that the input data and formats change so fast, the questions need to be answered so soon (deadlines!), that there seems to be no time for project organization.

One example to illustrate this: let's say you want to write a raytracer. You would probably put a lot of effort into the software engineering first. Then program it, finally in some highly-optimized form. Because you would use the raytracer countless of times with different input data, and would make changes to the source code over a duration of years to come. So good software engineering is paramount when coding a serious raytracer from scratch. But imagine you want to write a raytracer, where you already know that you will use it to raytrace one, single picture ever. And that picture is of a reflecting sphere over a checkered floor. In this case you would just hack it together somehow. Bioinformatics is like the latter case only.

What you end up with are whole directory trees with the same information in different formats until you have reached the one particular format necessary for the next step, and dozen of files with names like "tmp_SNP_cancer_34521_unique_IDs_not_Chimp.csv" where you don't have the slightest idea one day later why you created this file and what it exactly is.

For a while I was using mySQL which helped, but now the speed in which new data is generated and changes formats is such that it is not possible to do proper database design.

I am aware of one single publication which deals with these issues (Noble, W. S. (2009, July). A quick guide to organizing computational biology projects. PLoS Comput Biol 5 (7), e1000424+). The author sums the goal up quite nicely:

The core guiding principle is simple: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.

Well that's what I want, too! But I am following the same practices as that author already, and I feel it is absolutely insufficient.

Documenting each and every command you issue in bash, commenting it with why exactly you did it etc, is just tedious and error-prone. The steps during the workflow are just too fine-grained. Even if you do it, it can be still an extremely tedious task to figure out what each file was for, and at which point a particular workflow was interrupted, and for what reason, and where you continued.

(I am not using the word "workflow" in the sense of Taverna, by workflow I just mean the steps, commands and programs you choose to execute to reach a particular goal).

My questions is: how do you organize your Bioinformatics projects?

A: 

Your question is about project management. Bad project management is not unique to bioInformatics. I find it hard to believe that the entire industry of bioInformatics is commited to bad software design.

About the presure... Again there are others in this world that have a very challenging deadlines and they are still using good software designs.

In many cases, following a good software design does not hold down the projects and may even speed its design and maintainance(at least on the long run).

Now to your real question... You can offer your manager to redesign small parts of the code that have no influence on the rest of the code as a POC (prove of concept), but it's really hard to stop a truck from keep on moving, so dont get upset if he feels "we worked this way for years - we know what we are doing, and we dont need a child to teach us how to do our work". Learn to work like the rest and when you will gain their trust, you could do your thing once in a while (hope you will have time and the devotion to do the right thing)

Good luck

Asaf
+3  A: 

I'm a software specialist embedded in a team of research scientists, though in the earth sciences, not the life sciences. A lot of what you write is familiar to me.

One thing to bear in mind is that much of what you have learned in your studies is about engineering software for continued use. As you have observed a lot of what research scientists do is about one-off use and the engineered approach is not suitable. If you want to implement some aspects of good software engineering you are going to have to pick your battles carefully.

Before you start fighting any battles, you are going to have to critically examine your own ideas to ensure that what you learned in school about general-purpose software engineering is valid for your current situation. Don't assume that it is.

In my case the first battle I picked was the implementation of source code control. It wasn't hard to find examples of all the things that go wrong when you don't have version control in place:

  • some users had dozens of directories each with different versions of the 'same' code, and only the haziest idea of what most of them did that was unique, or why they were there;
  • some users had lost useful modifications by overwriting them and not being able to remember what they had done;
  • it was easy to find situations where people were working on what should have been the same program but were in fact developing incompatibly in different directions;
  • etc etc etc

Once I had gathered the information -- and make sure you keep good notes about who said what and what it cost them -- it became relatively easy to paint a picture of a better world with source code control.

Next, well, next you have to choose your own next battle. But one of the seeds of doubt you have to sow in your scientist-colleagues minds is 'reproducibility'. Scientific experiments are not valid if they are not reproducible; if their experiments involve software (and they always do) then careful software engineering is essential for reproducibility. A lot of this is about data provenance, but that's a topic for another day.

High Performance Mark
+3  A: 

For bioinformatics-specific answers, you'll likely be interested in these two threads over at BioStar (the bioinformatics stackexchange)

chrisamiller
A: 

Part of the issue here is the distinction between documentation for software vs documentation for publication.

For software development (and research plan) design, the important documentation is structural and intentional. Thus, modeling the data, reasons why you are doing something, etc. I strongly recommend using the skills you've learned in CS for documenting your research plan. Having a plan for what you want to do gives you a lot of freedom to multi-task while long analyses are running.

On the other hand, a lot of bioinformatics work is analysis. Here, you need to treat documentation like a lab notebook, and not necessarily a project plan. You want to be document what you did, maybe a brief comment why (e.g. when you are troubleshooting data), and what the outputs and results are. What I do is fairly simple. First, I start in a directory and create a git repo. Then, whenever I change some file, I commit it to the repo. As much as possible, I try to name data outputs in a way that I can drop then into my git ignore files. Then, as much as possible, I work on a single terminal session for a project at a time, and when I hit a pause point (like when I've got a set of jobs sent up to the grid, I run 'history |cut -c 8-' and paste that into a lab notes file. I then edit the file to add comments for what I did, and remember, change the git add/commit lines to git checkout (I have a script that does this based on the commit messages). As long as I start it in the right directory, and my external data doesn't go away, this means that I can recreate the entire process later.

For any even slightly complex processing tasks, I write a script to do it, so that my notebook, as much as possible, looks clean. To an approximation, a helper script can be viewed as a subroutine in a larger project, and should be documented internally to at least that level.