views:

396

answers:

10

I frequently write throwaway code (in a research environment) - for example to explore an algorithm or a model for a scientific property or process. Many of these "experiments" are one-off but sometimes I find that I need to use a few later. For example I have just unearthed code for string matching I wrote 7 years ago (stopped because of other priorities) but which is now valuable for a coworker's project. Having looked at it (did I really write such impenetrable code?) I realise there are some things I could have done then to help me when I restarted the "project" ("experiment" is still a better word). The earlier experiment "worked" but I know that at the time I would not have had time to refactor as my priorities lay elsewhere.

What approaches are cost-effective in enabling such work to be dug up and re-used?

EDIT: I have answered my own question (below) because there are issues beyond the actual source itself.

A: 

Some strategies:

  1. Good comments. Hard to reuse what you can't find or understand later.
  2. Save every query to a folder that is backed up or is under source control.
  3. Have a common library of useful functions that you "promote" something to once it has been reused.
richardtallent
+1  A: 

I think the most import thing (if you do no refactoring it isn't going to happen) is to comment and document your thought process at the time. It will help make the code less impenetrable and help you find the good bits when needed.

pdemarest
+1  A: 

Comments - describe what you were thinking and why you chose to implement something a certain way including what alternatives you considered. There are probably all sorts of fancy solutions but just commenting your code properly at the time you are writing it seems to work the best.

TLiebe
+5  A: 

I disagree with all of the answers saying "write comments". That's being offered as a catch-all for the code itself not being understandable.

Get yourself a copy of Code Complete (Steve McConnell, 2nd edition). If you learn the techniques of writing maintainable code in the first place, it won't take you more time, and you will be able to return to your work later with less trouble.

Which would you prefer:

  • Cryptic code with comments?
  • Mostly OK code without?

I strongly prefer the latter, as the OK code is easier to understand in the situations where the cryptic code was uncommented, and comments are another place that the original developer can make mistakes. The code may be buggy, but it's never wrong.

Once you're comfortable with Code Complete, I'd recommend The Pragmatic Programmer, as it gives slightly higher-level software-development advice.

Novelocrat
Yes, writing maintainable code is very important and I would certainly second the advice on reading Code Complete (I haven't read The Pragmatic Programmer). However, I still think it's important to comment why you made the decisions you did. Things like why you chose a certain sorting algorithm over others (maybe you were expecting the data to be partially sorted already or it wasn't going to be more than X records). Comments can clear up a lot of the "What the heck was I thinking?" questions later on.
TLiebe
If the comments could directly contradict the code, strongly consider removing them. In almost all cases, comments are only useful when they provide information you couldn't easily get by reading the code.
David Thornley
@Novelocrat I agree. Comments are not the primary problem or solution. I'll post my particular problems in more detail shortly but I'm still interested in general experience. (As a result of SO I have now ordered CC and starting dipping into it and will certainly adopt any near zero-cost approaches. Like you I prefer the mostly OK code (which this is) and I'm not worried about the tough algorithms being uncommented - they work and I give a single pointer to the generic source.
peter.murray.rust
+1  A: 

I would echo what the others have said as far as commenting the "why's" of why the code was written and it's intended usage, but I would also add this:

Code as if you were planning on putting this into production even when you're just messing around. Code for:

  • Clarity and readability
  • Follow the coding conventions of the time. (naming conventions, etc). Even though such conventions change over time, if you stick to the standards you're more likely to be able to understand it later.
  • Security (if applicable)
  • performance (if applicable)

Particularly, I would stress the first point, but the others are important as well. I find that if I use "test code" later on, I tend to just use it if it works, rather than refactoring it.

David Stratton
+2  A: 

As the excellent answers in your other post indicate, and from my own experience, there is a difficult-to-cross gap between the software used for research and software that has been engineered. In my opinion, Code Complete might help a little, but not much. As an economic question, is it going to be worthwhile to refactor everything for reuse compared to the occasional reward for finding a later use for something? Your balance point may vary.

Here's a practical tip for storing snippets. Instead of full-blown comments, throw in some keywords:

  • "graph isomorphism wrapper"
  • "polymer simulated annealing"
  • "string match feynmann"
  • "equilibrium"

and then put the code somewhere Google-searchable, like a GMail account.

Edit: I might add that free Google Sites are really searchable wikis that are a good place to put code, either in the form of attachments or pasted in.

Also, I should say that I am a fan of Code Complete and have given copies to grad students writing software for scientific research for several years. It's a good start, but no silver bullet. I'm writing a paper right now on using open source frameworks to solve scientific data management problems and one of the conclusions is that some software engineering expertise is essential for long-running systems. Many scientific projects should probably budget for this from the beginning.

Glenn
+1 This is an excellent idea. If the code had been easily searchable (unlike Sourceforge where I have found it impossible to search) it would have helped
peter.murray.rust
Good advice, but *every* such a snippet should be defined within the project, in a conspicuous place. For a person who's into the project, they're obvious, but for an outlander they totally are not.
Pavel Shved
+1  A: 

No, No, No, No, No!

Do not write throwaway code even in a research environment. Please!

Currently I'm messing with such a "throwaway code", namely BLAST project. The thing is that it started as a playground but then happened to become somewhat successful, Now it's a neat tool with many concepts implemented, but the code is virtually unmaintainable. But that's not the main point.

The main point is, you do research for engineers to later benefit from your findings. Having done a good scientific work on general concept and writing a tool that proves this successful, you can easily forget that you're doing it not for publication and PhD only. You do it for the benefit of the mankind. Your code may contain a bunch of "special cases", that were hard to debug, a set of quirks and hacks that do not fit into any conference article. It's especially important to document and comment such things throughout your code.

If a developer decided to implement your concepts in a commercial product, he could have studied the quirks and hacks from your code and the implementation would ten have less bugs than it might have had. Everyone says "Wow, his research on A realy is useful!" But if you write "trowaway", they say "his concept looks nice on paper, but X tried to implement it and drowned in a bunch of bugs".

(EDIT: taken from comments below) To help future developers of your codebase, you don't need much. First, comment what each function does. Second, make sure that every non-obvious fix of a tricky bug is placed in a separate commit in revision-control system (with an appropriate comment, of course). That's quite enough. And if you even make things modular (even if they're not ready for outright reuse--that's three times more costly, according to Brooks) you will be adored by engineers who implement your research.

I think that the world would be a better place if researchers threw away their hubris and stopped haughty thinking that they're not these dirty coders who do menial job of writing a good code. Writing a good code is not just a job for these stupid programmers. It is a really valuable thing everyone should strive. Without this, your experimental ground, your code, your brainchild will just die.

Pavel Shved
@Pavel Shved. I empathize with your comments. But I shall defend the concept of throwaway code - the one I am talking about came from a sequence alignment project based on Kruskal's book that I set students about 20 years ago - I think in BBC Basic - it has migrated from that toC and then C++ and then to Java and now I am using it for text alignment. At no stage was it ever anything but a learning exercise and a playground. So I own no moral responsibility for its evolution and state. But now I think it has some added value in a differeht domain and so I am refactoring it.
peter.murray.rust
@peter: Noone forces you to be responsible. Writing good code is a good habit and it really doesn't require a lot of time to put it into practice! Tiny descriptions of what the functions do, comments on module's purpose, revision-control system that shows me *why* a particular line exists (`svn blame`). This will just make people to re-use your work and hence give you a credit and attribution.
Pavel Shved
@Pavel. Thanks. It is *because* I wish to behave responsibly that I am raising this issue. FWIW when the original code was written SVN didn't exist :-) and the primitive CVS system we had wasn't maintained.
peter.murray.rust
@peter: well, you got my point. Comment what each function does. And make sure that *every* non-obvious line (resulting from a fix of a tricky bug) has an appropriate comment and is placed in a separate commit. I can assure you -- that's enough for you to be praised by the followers. And if you even make things modular (even if they're not ready for outright reuse--that's three times more costly, according to Brooks) you will be adored by engineers.
Pavel Shved
+4  A: 

[Answering own question] There are several other aspects to the problem which haven't been raised and which I would have found useful when revisiting it. Some of these may be "self-evident" but remember this code was pre-SVN and IDEs.

  • Discoverability. It has been difficult actually to find the code. I believe it's in my SourceForge project but there are so many versions and branches over 7 years that I can't find it. So I would have to have a system that searched code and until IDEs appeared I don't think there was any.
  • What does it do?. The current checkout contains about 13 classes (all in one package as it wasn't easy to refactor at the time). Some are clear (DynamicAligner) but others are opaque (MainBox, named because it extended a Swing Box). There are four main() programs and there are actually about 3 subprojects in the distrib. So it is critical to have an external manifest as to what the components actually were.
  • instructions on how to run it. When running the program, main() will offer a brief commandline usage (e.g. DynamicAligner file1 file2) but it doesn't say what the contents of files actually look like. I knew this at the time, of course but not now. So there should be associated example files in sibling directories. These are more valuable than trying to document file formats.
  • does it still work?. It should be possible to run each each example without thinking. The first question will be whether the associated libraries, runtimes, etc. are still relevant and available. One ex-coworker wrote a system which only runs with a particular version of Python. The only answer is to rewrite. So certainly we should avoid any lock-in where possible, and I have trained myself (though not necessarily coworkers) to do this.

So how can I and coworkers avoid problems in the future? I think the first step is that there should be a discipline of creating a "project" (however small) when you create code and that these projects should be under version control. This may sound obvious to some of you, but in some environments (academia, domestic) there is a significant overhead to setting up a project management system. I suspect that the majority of academic code is not under any version control.

Then there is the question as to how the projects should be organized. They can't be on Sourceforge by default as the code is (a) trivial and (b) not open by default. We need a server where there can be both communal projects and private ones. I would calculate that the effort to set this up and run it is about 0.1 FTE - that's 20 days a year from all parties (installation, training, maintenance) - if there are easier options I'd like to know as this is a large expense in some cases - do I spend my time setting up a server or do I write papers?

The project should try to encourage good discipline. This is really what I was hoping to get from this question. It could include:

  1. A template of required components (manifest, README, log of commits, examples, required libraries, etc. Not all projects can run under maven - e.g. FORTRAN).
  2. A means of searching a large number (hundreds at least) of small projects for mnemonic strings (I liked the idea of dumping the code in Googledocs, and this may be a fruitful avenue - but it's extra maintenance effort).
  3. Clear naming conventions. These are more valuable than comments. I now regularly have names of the type iterateOverAllXAndDoY. I try to use createX() rather than getX() when the routine actually creates information. I have a bad habit of calling routines process() rather than convertAllBToY().

I am aware of but haven't used GIT and Mercurial and GoogleCode. I do not know how much effort these are to set up and how many of my concerns they answer. I would be delighted if there was an IDE plugin which helped create better code (e.g. "poor choice of method name").

And whatever the approaches they have got to come naturally to people who do not naturraly have good code discipline and to be worth the effort.

peter.murray.rust
Peter, I couldn't agree more with your suggestion about the naming convention. IMHO clear names are by far the most effective about communicating your intent via code. We have a guideline: take at least twice the time you originally wanted to think up a good name.
LaszloG
Git or Mercurial can help with the discovery and maintenance effort concerns, at least somewhat. If you have some idea of a string that would identify something you're looking for, Git's 'grep' and 'pickaxe' operations can be helpful. As for keeping up a repository, if you can move some files (by whatever means), you can distribute the code with all its history.
Novelocrat
Also, you should really just edit this into the question, as it clarifies, rather than responding to, what you wrong above.
Novelocrat
A: 

You could also borrow the idea of unit tests from the TDD (test-driven development) folks. You need to make sure that the throwaway code actually works OK anyway, so why not express the check linke a small unit test? This would have two advantages:

  1. Reading the test code communicates the intent of the throwaway quite clearly: after all it expresses its expectations in the same language: code.

  2. It would also help with the 4th problem of your self-reply: "does it still work?". Well, it's easy: just run the unit tests and they tell you what and where (and with a bit of luck) why (it) doesn't work.

LaszloG
Thanks. I do use unit tests for this purpose. This type of test is normally "has it got to the end without crashing". It's very difficult to check that - say - a complex output with floating point numbers is still correct.
peter.murray.rust
+1  A: 

I've probably missed the point of this whole discussion, I frequently do, but here goes, an invitation for brickbats and downvoting ...

If it's throwaway code, throw it away !

If you don't want to throw it away then follow the good advice above. For me, and I write a fair amount of throwaway code, the question of whether it gets thrown away or put into a reusable state and kept against a rainy day boils down to the economics.

Can I foresee circumstances in which this code will be useful again ? Once in a blue moon, twice a year, every month ?

Will I be able to rewrite this code in less time than it takes to make it reusable ? If the answer to this question is No, then how many times will I have to reuse it to make it worth while enhancing it now ? (Back to the previous question.)

If I do make this code reusable, will I be able to find it again when I next want it ? (Anyone ever had the experience of knowing, with absolute certainty, that somewhere in your code repository there is just the fragment you want, but not having a clue what it was called, nor where to look nor what to grep for ?)

Finally, the 3 step approach to making quickly-written code reusable. Stop after whichever of these steps you like:

1) Document the code as a black-box. Inputs, outputs, operation(s). File this document carefully.

2) Write instructions about how to build/interpret/install the code, in case you ever have to port it. File these instructions carefully.

3) Only if worth the effort -- improve the source code quality to make the code maintainable in future. Make sure the sources are in the source control system and findable.

Regards

Mark

High Performance Mark
Quite right, when it comes down to it there is a distinction between throwaway and unfinished code. Throwaway by definition is not sufficiently complex to want or need to keep, so can easily be re-generated in a more enlightened epoch. Throwaway code is a shot of cheap tequila. Unfinished code, on the other hand, REALLY needs good care: wrapped in (commented) cotton wool, kept at a const temperature in a dark place, turned at regular intervals, and having 'Mastering Regular Expressions' read to it in a soft voice at bedtime.
Alistair Knock