Policy for fixing broken nightly builds

views:

answers:

+8 Q:

Policy for fixing broken nightly builds

I guess everybody agrees that having continuous builds and continuous integration is beneficial for quality of the software product. Defects are found early so they can be fixed ASAP. For continuous builds, which take several minutes, it is usually easy to find the one who caused the defect. However, for nightly integration tests, which take long time to run, this may be a challenge. Here are specifics of the situation, for which I'm looking for an optimal solution:

Running integration tests takes more than 1 hour. Therefore they are run overnight. Multiple check-ins happen every day (team of about 15 developers) so it is sometimes difficult to find the "culprit" (if any).
Integration testing environment depends on other environments (web services and databases), which may fail from time to time. This causes integration tests to fail.

So how to organize the team so that these failures are fixed early? In my opinion, there should be someone appointed to DIAGNOSE the defect(s). This should be the first task in the morning. If he needs an expertise of others, they should be readily available. Once the source (component, database, web service) of the failure is determined, the owner should start fixing it (or another team should be notified).

How to appoint the one who diagnoses the defects? Ideally, someone would volunteer (ha ha). This won't happen very often, I'm afraid. I've heard other option - whoever comes first to the office should check the results of the nightly builds. This is OK, if the whole team agrees. However, this rewards those who come late. I suppose that this role should rotate in the team. The excuse "I don't know much about builds" should not be accepted. Diagnostics of the source of the failure should be rather straightforward. If it is not, then adding more of diagnostics logging to the code should improve the visibility into integration test failures.

Any experience in this area or suggestions for improvements of the above approach?

+3 A:

What I generally do (I've done it for a team of between 8 and 10 persons) is two have one guy that checks the build, as the first thing he does in the morning -- some would say he is responsible for QA, I suppose.

If there is a problem, he's responsible for finding out what/how -- of course, he can ask help from the other members of the team, if needed.

This means there's at least one member of the team that has to have a great knowledge of the whole application -- but that's not a bad thing anyway : it'll help diagnose problems the day that application is used in production and suffers a failure.

And instead of having one guy to do that, I like when there are two : one for one week, the other for the second week -- for instance ; this way, there are greater chances of always having someone who can diagnose problems, even if one of them is in holidays.

As a sidenote : the more useful things you log during the build, the easier it is to find out what went wrong -- and why.

Why not let everyone in the team check the build every morning ?

Well, not every one wants to, first of all -- and that will be done better if the one doing it likes what he does
And you don't want 10 people spending half an hour every day on that ^^

Pascal MARTIN 2009-12-22 15:07:32

+2 A:

In your case I'd suggest whoever is in charge of the CM. If that is the manager or technical lead who has too many responsibilities why not give it to a junior developer? I wish someone had forced me early in my career to get to know source control more thoroughly. Not only that but looking at other people's code to track down a source of error is a real skill building or knowledge learning exercise. They say you gain the most from looking at other people's code and I'm a firm believer of this.

wheaties 2009-12-22 15:11:11

+6 A:

A famous policy about broken nightly builds, attributed to Microsoft, is that the guy whose commit broke the build becomes responsible for maintaining nightly builds until someone else breaks it.

That makes sense, since

everyone makes mistakes, so the necessary rotation will occur (empowered with Least-Recently-Used choice pattern for ambiguous cases)
it encourages people to write better code

Pavel Shved 2009-12-22 15:16:18

I'd be tempted to suggest splitting things up in either of a couple of ways:

Time split - Assuming that the tests could run twice a night, why not run the tests against the code at 2 different time points,i.e. all the check-ins up to X p.m. and then the remainder, so that could help narrow down where the problem is.
Team split - Can the code be split into smaller pieces so that the tests could be run on different machines to help narrow down which group should dig into things?

This assumes you could run the tests multiple times and divide things up in such a way so it is a rough idea.

JB King 2009-12-22 15:16:57

+1 A:

Pair experienced with unexperienced

You may want to consider having pairs of developers diagnose the broken builds. I've had good luck with that. Especially if you pair team members who have little familiarity with the build system and team members who have significant familiarity. This may reduce the possibility of team members saying "I don't know much about builds" as a way to try and get out of the duty, and it will decrease your bus number and increase collective ownership.

Give the team a choice of your assigned solution or one of their own making

You could put the issue to your team and ask them to offer a solution. Tell them that if they don't come up with a workable solution, you will make a weekly schedule, assigning one pair per day and making sure that everyone has the opportunity to participate.

JeffH 2010-01-11 18:22:34

+1 A:

Practice continuous integration so you don't need infrequent mega-builds ** you can distribute builds between machines if it's too slow for one machine to do
Use a build status monitor so that whoever checked something in can be made responsible for build failures.
Have an afternoon check-in deadline

Either:

Nobody checks-in after 5pm

Nobody checks-in after 5pm unless they're prepared to stay at work until their build passes as green - even if that means working on, committing a fix and waiting for a rebuild.

It's much easier to enforce and obey the first form, so that's what I'd follow.

Members of a former team of mine actually got phoned up and told to return to work to fix the build... and they did.

cartoonfox 2010-01-14 12:15:59

We have a stand-up meeting every morning, before starting work. One of the things on the checklist is the status of the nightly build. Our build system spits out an email after it's run, reporting the status, so this is easy to find out - as it happens, it goes to one guy, but really it should go to everyone, or be posted onto the project wiki.

If the build is broken, then fixing it becomes a top-priority task, to be handled like any other task, which means that we'll decide at the standup who is going to work on it, and then they go and do so. We do pair programming, and will usually treat this as a pair task. If we're short-staffed, we might assign one person to investigate the breakage, and then pair someone with him to fix it a bit later.

We don't have a formal mechanism for assigning the task: we're a small team (six people, usually), and have collective code ownership, so we just work it out between ourselves. If we think one particular pair's checkin broke the build, it would usually be them who fix it. If not, it could be anyone; it's usually decided by seeing who isn't currently in the middle of some other task.

Tom Anderson 2010-01-14 18:28:10

ansaurus

tags:

views:

answers:

Policy for fixing broken nightly builds

related questions