views:

730

answers:

15

In my Copious Free Time, I collaborate with a number of scientists (mostly biologists) who develop software, databases, and other tools related to the work they do. Generally these projects are built on a one-off basis, used in-house, and eventually someone decides "oh, this could be useful to other people," so they release a binary or slap a PHP interface onto it and shove it onto the web. However, they typically can't be bothered to make their source code or dumps of their databases available for other developers, so in practice, these projects usually die when the project for which the code was written comes to an end or loses funding. A few months (or years) later, some other lab has a need for the same kind of tool, they have to repeat the work that the first lab did, that project eventually dies, lather, rinse, repeat.

Does anyone have any suggestions for how to persuade people whose primary job isn't programming that it's of benefit to their community for them to be more open with the tools they've built? Similarly, any advice on how to communicate the idea that version control, bug tracking, refactoring, automated tests, continuous integration and other common practices we professional developers take for granted are good ideas worth spending time on? Unfortunately, a lot of scientists seem to hold the opinion that programming is a dull, make-work necessary evil and that their research is much more important, not realising that these days, software development is part of scientific research, and if the community as a whole were to raise the bar for development standards, everyone would benefit.

Have you ever been in a situation like this? What worked for you?

A: 

You could have them use a content management system, like Joomla. That way they only push content and not code.

Tony Borf
+5  A: 

It's not exactly simple, but demonstration by example would probably drive the point home most effectively - find a task the researcher needs done, find someone who did take the time to make a tool w/source available, and point out how much time the researcher could save as a result due to having that tool available - then point out that they could give back to the community in the same fashion.

Amber
+2  A: 

Actually, asking any busy project team to include in their schedule time for making their software suitable for adoption by another team is extremely hard in my experience.

Doing extra work for the public good is a big ask.

I've seen a common pattern of "harvesting" after the project is complete, reflecting that immediate coding for reuse tends to get lost in the urgency of the day.

The only avenue I can think of is if the reuse is within an organisation with a budget for a "hunter gatherer", someone whose reason for being there is IT.

You may be on more of a win for things such as unit tests because they have immediate payback for the development.

djna
A: 

I wouldn't so much persuade as I would streamline the process. Document it clearly, make video tutorials and bundle some kind of tool chain that makes it ridiculously easy to get source repositories set up without requiring them to become experts in something that isn't their main field.

Soviut
+4  A: 

In effect, what you are asking them to do is become professional developers (with their copious free time), in addition to their chosen profession. Their reluctance is understandable.

Robert Harvey
It's literally true "that programming is a dull, make-work necessary evil and that their research is much more important". Hard to convince them otherwise, since their career doesn't depend on dull, make-work evil software development practices. Sorry, but software IS hard.
S.Lott
+12  A: 

Software Carpentry sounds like a match for your request:

Overview

Many scientists and engineers spend much of their lives programming, but only a handful have ever been taught how to do this well. As a result, they spend their time wrestling with software, instead of doing research, but have no idea how reliable or efficient their programs are.

This course is an intensive introduction to basic software development practices for scientists and engineers that can reduce the time they spend programming by 20-25%. All of the material is open source: it may be used freely by anyone for educational or commercial purposes, and research groups in academia and industry are actively encouraged to adapt it to their needs.

Dirk Eddelbuettel
Have a look at Intermediate and advanced softawre carprentry in python (http://ivory.idyll.org/articles/advanced-swc/), too
dalloliogm
+4  A: 

Does anyone have any suggestions for how to persuade people whose primary job isn't programming that it's of benefit to their community for them to be more open with the tools they've built?

Give up. Seriously, this is like teaching a pig to sing. (I can say this because I used to be a physicist so I know what they're like.)

The real issue is that your colleagues are rewarded for scientific output measured in publications, not software. It's hard enough in computer science to get recognized for building software; in the other sciences, it's nearly impossible.

You can't sell good development practice to your biology friends on the grounds that "it's good for you." They're going to ask "should I invest effort in learning about good software practice, or should I invest the same effort to publish another biology paper?" No contest.

Norman Ramsey
In big science (I'm a particle physicsist), there is money and (some recognition for building long term software projects, so we have a few. But most of the work is done by hired programmers...
dmckee
+2  A: 

Maybe framing it in terms of academic/intellectual responsibility would help, to a degree - sharing your source is, in many ways, like properly citing your sources or detailing your research methodology. There are similar arguments to be made for some of the "professional software developer" behaviors you'd like to encourage, though I think releasing the code is probably an easier sell on these grounds than other things which could require significantly more work.

brennen
A: 

Take a really good programmer who already knows best practices, ask your scientists to teach him what they need and what they do, eventually the programmer will have minimum domain knowledge (I suspect it takes between 1 and 3 years depending on the domain) to do what scientists asks for.

Developers always learn another domain of competency, because most of their programs are not for developers, so they need to know what the "client" do.

Nicolas Dorier
+7  A: 

Let me preface this by saying that I'm a bioinformatician, so I see the things you're talking about all the time. There's some truth to the fact that many of these people are biologists-turned-coders who just don't have the exposure to best practices.

That said, the core problem isn't that these people don't know about good practices, or don't care. The problem is that there is no incentive for them to spend more time learning software engineering, or to clean up their code and release it.

In an academic research setting, your reputation (and thus your future job prospects) depends almost entirely on the number and quality of publications that you've contributed to. Publications on methods or new algorithms are not given as much respect as those that report new biological findings. So after I do a quick analysis of a dataset, there's very little incentive for me to spend lots of time cleaning up my code and releasing it, when I could be moving on to the next dataset and making more biological discoveries.

I'll also note that the availability of funding for computational development is orders of magnitude less than that available for doing the biology. In a climate where only 10% of submitted grants are getting funded, scientists don't have the luxury of taking time to clean and release their code, when doing so doesn't help them keep their lab funded.

So, there's the problem in a nutshell. As a bioinformatician, I think it's perverse and often frustrating.

That said, there is hope for the future. With second-and-third generation sequencing, in particular, biology is moving into the realm of high-throughput discovery, where data mining and solid computational pipelines become integral to the success of the science. As that happens, you'll see more and more funding for computational projects, and more and more real software engineering happening.

chrisamiller
+2  A: 

For one thing, could we please stop teaching biologists Perl? Teaching non-professional programmers a write-only language is practically guaranteed to lead to unmaintainable, throw-away code. Python fills the same niche, is just as easy to learn (it's even used to teach kids programming!), and is much more readable.

Does it have as many ready bioinformatics modules / libraries? If not, then stopping bioinformatics/bilogists from using it would be as futile as stopping astronomers from using FORTRAN
Jakub Narębski
BioPython isn't quite as mature as BioPerl, but it's getting there. (There's also BioRuby, but it's a distant third.) I have no idea what the chemists use; the physicists I know all use C.
Meredith L. Patterson
+2  A: 

Draw parallels with statistics. Stats is a crucial part of scientific research, and one where the only sensible advice is: either learn to do it properly, or get an expert to do it for you. Incorrectly-done stats can completely undermine a paper, just as badly-written code can completely undermine a public database or web resource.

PS: This blog is very good, but getting them to read it will be an uphill struggle: Programming for Scientists

Andrew Clegg
A: 

To be devil's advocate, is teaching scientists to be good software engineers the right thing to do? Software in research is usually very purpose specific - sometimes to the point where a piece of code needs to run successfully only once on a single data set. The results then feed into a publication and the goal is met. And there's a high risk that your technique or algorithm will be superseded by a better one in short order. So, there's a real risk that effort spent producing sparkling code will be wasted.

When you're frustrated by wading through a swamp of ill-formed perl code, just think that the code you're looking at is one of the rare survivors. Mountains of such code has been written, used a few times, then discarded never to see the light of day again.

I guess I'm just saying there's a big place in research for smelly heinous one-off prototype code. There are good reasons why such code exists. It may not be pretty, but if it gets the job done, who cares? We can always hire a software engineer to write the production-ready version later, IF it turns out to be justified, and let our scientists move on.

+1  A: 

Chris,

I agree with you to a degree, but in my experience what ends up happening is that in their eagerness to publish you end up with too many "me too" codes and methods, which don't really add to the quality of science. If there was a little more thought about open sourcing code and encouraging others to contribute (without necessarily getting publications out of it) then everyone would benefit.

Definitely agree that a separation between the scientific programmers and the software engineers is a good thing, especially for production applications. But even for scientific programming, the quality of my code would have been so much better if I had followed good practices at the time.

mndoci
+1  A: 

In my experience the best way of getting people to program cleanly is to show a good example when you're working with them. eg: "I never spend hopeless days debugging my code because the first things I code are automated unit tests that will pinpoint problems when they are small and easily detectable" or: "I'm very bad at keeping track of versions of things, but sometimes my new code does break what did work before. So I use svn/git/dropbox to keep track of things for me"

In my experience that kind of statement can raise the interest of "biologists that learned how to script". And if you need to collaborate on a bigger project, make it clear that you have more experience and that everything will go more smoothly if things are done your way.


Regarding publication of code, current practice is indeed frustrating. I would like to see a new journal like Source Code for Biology and Medecine, where code is peer-reviewed and can be published, but that has no (or very low) publication costs. Putting code on sourceforge or others is indeed not "scientifically worth it" because it doesn't make a line on your publication list, and most code is not revolutionary enough to warrant paying $1,000 for publication in Source Code for Biology and Medecine or PLoS One...

Yannick Wurm
Journal: you don't need the backing of Elsevier or Reed to start a journal. In fact, it need not be very time consuming, nor does it require an open-ended commitment. Just how much would you like to see this journal?
Charles Stewart