views:

405

answers:

9

I have some research code that's a real rat's nest, with code duplication everywhere, and clearly needs to be refactored. However, the code base is evolving as I come up with new variations on the theme and fit them into the codebase. The reason I've put off refactoring so long is because I feel like the minute I spend a few days coming up with good abstractions, seeing what design patterns fit where, etc., I'll want to try out some new unforeseen idea that makes my abstractions completely inadequate. In other words, because of the rate at which the code is evolving, I really have no idea where abstraction lines belong, even though there is no shortage of (approximate) duplication and the general messiness of the code makes adding stuff to it a real pain. What are some general best practices for coping with this kind of situation?

+13  A: 

Don't spend so long refactoring!

When you're about make a change in a piece of code, consider refactoring it to make the change easier.

After making the change, refactor again to clean up the damage done by that change.

In both cases, make the refactorings small and do them quickly, and move on.

You don't have to keep your code pristine at all times, but remember that it's easier to go fast if you have well-factored code to work in (and if you have good unit tests, of course).

Jay Bazuzi
A: 

Put it in Distributed SCM like Git at least, that way when you break something refactoring you can reverse time divisibly to find the commit prior to the change, as well as being able to work on changes and commit them in branches without interfering with others work.

Gits Branch merge is great for things like this and you'll know easily if 2 people made incompatible changes in parallel without having to worry about the rest of the code.

For the above reasons, I would also create a seperate branch in the repository just for re factoring code with, and keep it up-dated regularly. This way, not only will others not interfere with your progress, but they can keep an eye on it and see changes in it that will eventually hit the main branch so they can pre-emptively code around those changes.

Kent Fredric
A: 

If you already know where there is duplication, you don't need several days to refactor it away.

Pete Kirkham
+5  A: 

Test Driven Development:

Red, Green, Refactor. Rinse, repeat.

Since it's one of the steps in every single cycle, you'll notice that's a LOT of usually minor refactoring taking place. That's the way it should be.

Epaga
+5  A: 
MadKeithV
A: 

Sometimes a rewrite is the only choice. This seems to be the case.

Andrei Rinea
+2  A: 

Clean up the code a little bit at a time. Always when you touch a class, try to leave the class cleaner that it was before you touched it ("the boy scout rule"). Refactoring is best done in very small steps, but very often.

Things like renaming some variable, splitting a method etc. take only some seconds or minutes. Large refactorings such as splitting or joining classes, may take an hour or two (and you make it in small steps, so that all tests pass at least every five minutes - otherwise you have entered Refactoring Hell and you should revert to the last known working state). If it takes days or weeks for you to refactor something, then it's not anymore "refactoring" - it's more like rewriting.

An article about this topic: http://blog.objectmentor.com/articles/2007/07/20/whats-your-unit-of-measure

Esko Luontola
A: 

The CloneDR finds duplicate code, both exact copies and near-misses, across large source systems, parameterized by langauge syntax. It supports Java, C#, COBOL, C++, PHP and many other languages.

When it shows a parameterized abstraction of a set of found clones, it is essentially proposing that you refactor the code with that abstraction implemented (as a method, a function, a class, ...).

So running the CloneDR gets a list of potential abstractions to be added to your code, and replacing the clone instances by calls on the abstraction refactors your code thus cleaning it up (somewhat).

Even more remarkably, when it shows the parameter bindings used at each clone site needed to invoke the abstraction, it often shows a bungled clone instance, easily recognized when the bound paramters are conceptually inconsistent. If a parameer is bound to variables named YYYY-MM-DD, and one of them is YY-MM-DD, the "its a 4 digit-year" parameter type looks violated and in this this case there's a broken Y2K remediation. So examining the clone bindings often finds bugs.

Ira Baxter
Answer removed after being flagged as spam. The question is asking for general practice guidelines. If it was asking for product recommendations, I would let this stay.
Bill the Lizard
The OP asked for suggestions on how to deal with code with lots of duplication and bemoaned his inability to propose abstractions fast enough to cope. This tool specifically addresses his question of what to do: "run a clone detector" and how to respond to abstraction-slowness: "it shows you proposed abstractions".There are other clone detection tools. Most of them don't find near-miss clones, and don't provide a proposed abstraction; all the do is say: "that's a clone..." So I think this response was direct to his questions.
Ira Baxter
A: 

This is a very common problem in scientific computing. Some of the most effective ideas for reducing the size and complexity of code require leveraging assumptions, and science demands that you constantly change those assumptions.

All you can do is try to refactor your code as you go, and try not to write yourself into any corners. Also work with good people who understand the value of not making a mess.

James Thompson