What are some of the nastiest, most difficult bugs you have had to track and fix and why?

I am both genuinely curious and knee deep in the process as we speak. So as they say - misery likes company.

+29  A: 

Race conditions and deadlocks. I do a lot of multithreaded processes and that is the hardest thing to deal with.

Otávio Décio
Yep - that would be my answer. Sprinkle in a complex regular expression or two and a server array. good times..
+1 - Especially since these tend to be intermittent which can make it a nightmare both to track them down and prove they are fixed.
Justin Ethier
+1  A: 

Difficulty of tracking:

  • off-by-one errors
  • boundary condition errors
Aren't these particulary easy ones?
Fabian Steeg
Yes, only after I've found them.
I find these easily, but thinking back they used to be hard. Experience can really help in where to start looking.
@Richard: Good point. Difficulty is subjective. The two I mentioned occur frequently and you are able to identify them with some effort. I have only once seen someone track a Heisenbug. I won't know one even if I get one :D
+7  A: 

Threading bugs, especially race conditions. When you cannot stop the system (because the bug goes away), things quickly get tough.

+3  A: 
  • Bugs that happen on one server and not another, and you don't have access to the offending server to debug it.
  • Bugs that have to do with threading.
+44  A: 


A heisenbug (named after the Heisenberg Uncertainty Principle) is a computer bug that disappears or alters its characteristics when an attempt is made to study it.

Good link - especially like the Schroedinbug. I swear I have seen these. "There is no way this ever worked".
+2  A: 

Buffer overflows ( in native code )

Matt Brunell
For those, memory checking tools are very very useful. Valgrind saved my ass more times than I care to admit.
+1  A: 

When objects are cached and their equals and hashcode implementations are implemented so poorly that the hash code value isn't unique and the equals returns true when it isn't equal.

No problem with hash codes being non-unique (they have to be, a lot more possible object values than hash values). However breaking the rule that hash values are constant for an object, or equal objects always have the same hash value is definitely a problem.
+1  A: 

A friend of mine had this bug. He accidentally put a function argument in a C program in square brackets instead of parenthesis like this: foo[5] instead of foo(5). The compiler was perfectly happy, because the function name is a pointer, and there is nothing illegal about indexing off a pointer.

+11  A: 

Bugs that are not in your code per se, but rather in a vendor's module on which you depend. Particularly when the vendor is unresponsive and you are forced to hack a work-around. Very frustrating!

sometimes "select" IS broken. I found a bug in one of the java libraries... Sun's response: "will not fix".
This is especially so with certain game engines that exist out there. I am looking at you Torque!
+13  A: 

Any bug based on timing conditions. These often come when working with inter-thread communication, an external system, reading from a network, reading from a file, or communicating with any external server or device.

Robert P
You mean inter-thread? Threads don't usually communicate with themselves :)
+2  A: 

Last year I spent a couple of months tracking a problem that ended up being a bug in a downstream system. The team lead from the offending system kept claiming that it must be something funny in our processing even though we passed the data just like they requested it from us. If the lead would have been a little more cooperative we might have nailed the bug sooner.

+2  A: 

Uninitialized variables. (Or have modern languages done away with this?)

+16  A: 

Bugs that happen when compiled in release mode but not in debug mode.

17 of 26
Subclass of Heisenbugs.
Loren Pechtel
These infuriate me as well.
+5  A: 

We were developing a database to hold words and definitions in another language. It turns out that this language had only recently been added to the Unicode standard and it didn't make it into SQL Server 2005 (though it was added around 2005). This had a very frustrating effect when it came to collation.

Words and definitions went in just fine, I could see everything in Management Studio. But whenever we tried to find the definition for a given word, our queries returned nothing. After a solid 8 hours of debugging, I was at the point of thinking I had lost the ability to write a simple SELECT query.

That is, until I noticed English letters matched other English letters with any amount of foreign letters thrown in. For example, EnglishWord would match [email protected]##$ish$&Word. (With [email protected]#$%^&* representing foreign letters).

When a collation doesn't know about a certain character, it can't sort them. If it can't sort them, it can't tell whether two string match or not (a surprise for me). So frustrating and a whole day down the drain for a stupid collation setting.

+1  A: 

Machine dependent problems.

I'm currently trying to debug why an application has an unhandled exception in a try{} catch{} block (yes, unhandled inside of a try / catch) that only manifests on certain OS / machine builds, and not on others.

Same version of software, same installation media, same source code, works on some - unhandled exception in what should be a very well handled part of code on others.


Matt Jordan
+2  A: 

The most frustrating for me have been compiler bugs, where the code is correct but I've hit an undocumented corner case or something where the compiler's wrong. I start with the assumption that I've made a mistake, and then spend days trying to find it.

Edit: The other most frustrating was the time I got the test case set slightly wrong, so my code was correct but the test wasn't. That took days to find.

In general, I guess the worst bugs I've had have been the ones that aren't my fault.

David Thornley
Worse is when it's in the OS.
Loren Pechtel
You're right, but my personal worst experiences have been with compiler bugs.
David Thornley
+1  A: 

Cosmetic web bugs involving styling in various browser O/S configurations, e.g. a page looks fine in Windows and Mac in Firefox and IE but on the Mac in Safari something gets messed up. These are annoying sometimes because they require so much attention to detail and making the change to fix Safari may break something in Firefox or IE so one has to tread carefully and realize that the styling may be a series of hacks to fix page after page. I'd say those are my nastiest ones that sometimes just don't get fixed as they aren't viewed as important.

JB King
Most browser bugs are well documented on the web; especially when devving for IE. See for example or
Jasper Bekkers
+1  A: 

WAY back in the days, memory leaks. Thankfully, there's a lot of tools to find them, these days.

+3  A: 

The hardest ones I usually run into are ones that don't show up in any log trace. You should never silently eat an exception! The problem is that eating an exception often moves your code into an invalid state, where it fails later in another thread and in a completely unrelated manner.

That said, the hardest one I ever really ran into was a C program in a function call where the calling signature didn't exactly match the called signature (one was a long, the other an int). There were no errors at compile time or link time and most tests passed, but the stack was off by sizeof(int), so the variables after it on the stack would randomly have bad values, but most of the time it would work fine (the values following that bad parameter were generally being passed in as zero).

That was a BITCH to track.

Bill K
+2  A: 

There was a project building a chemical engineering simulator using a beowulf cluster. It so happened that the network cards would not transmit one particular sequence of bytes. If a packet contained that string, the packet would be lost. They solved the problem by replacing the hardware - finding it in the first place was much harder.

+1  A: 

Memory issues, particularly on older systems. We have some legacy 16-bit C software that must remain 16-bit for the time being. The 64K memory blocks are royal pain to work with, and we constantly add statics or code logic that pushes us past the 64K group limits.

To make matters worse, memory errors usually don't cause the program to crash, but cause certain features to sporadically break (and not always the same features). Debugging is a non-option - the debugger doesn't have the same memory constraints so the programs always run fine in debug mode ... plus, we can't add inline printf statements for testing since that bumps the memory usage even higher.

As a result, we can sometimes spend DAYS trying to find a single block of code to rewrite, or hours moving static chars to files. Luckily the system is slowly being moved offline.

+1  A: 

Multithreading, memory leaks, anything requiring extensive mocks, interfacing with third-party software.

+4  A: 

Memory corruption under load due to bad hardware.

+1 because every weird problem with Linux installs turned out to be bad memory. Which Windows ignored. Someone else here also mentioned "not reporting errors".
+1  A: 

One of the most frustrating for me was when the algorithm was wrong in the software spec.

+1  A: 

For embedded systems:

Unusual behaviour reported by customers in the field, but which we're unable to reproduce.

After that, bugs which turn out to be due to a freak series or concurrence of events. These are at least reproducable, but obviously they can take a long time - and a lot of experimentation - to make happen.

Steve Melnikoff
+2  A: 

The hardest bugs to track down and fix are those that combine all the difficult cases:

  • reported by a third party but you can't reproduce it under your own testing conditions;
  • bug occurs rarely and unpredictably (e.g. because it's caused by a race condition);
  • bug is on an embedded system and you can't attach a debugger;
  • when you try to get logging information out the bug goes away;
  • bug is in third-party code such as a library ...
  • ... to which you don't have the source code so you have to work with disassembly only;
  • and the bug is at the interface between multiple hardware systems (e.g. networking protocol bugs or bus contention bugs).

I was working on a bug with all these features this week. It was necessary to reverse engineer the library to find out what it was up to; then generate hypotheses about which two devices were racing; then make specially-instrumented versions of the program designed to provoke the hypothesized race condition; then once one of the hypotheses was confirmed it was possible to synchronize the timing of events so that the library won the race 100% of the time.

Gareth Rees
+2  A: 

Probably not the hardest, but they are extremely common and not trivial:

  • Bugs concerning mutable state. It is hard to maintain invariants in a data structure if it has many mutable fields. And you have operation order dependency - swap two lines and something bad occurs. One of my recent hard-to-find bugs was when I found that previous developer of the system I maintained used mutable data for hashtable keys - in some rare conditions it lead to infinite loops.
  • Order of initialization bugs. Can be obvious when found, but not so when coding.
+1  A: 

Without a doubt memory leaks. Especially when you do things like dynamically create controls and add handlers in ASP.NET. On Page Load.

+2  A: 

Ever used Crystal Reports?

Gavin Miller
+1  A: 

The hardest one ever was actually a bug I was helping a friend with. He was writing C in MS Visual Studio 2005, and forgot to include time.h. He further called time without the required argument, usually NULL. This implicitly declared time like: int time(); This corrupted the stack, and in a completely unpredictable way. It was a large amount of code, and we didn't think to look at the time() call for quite some time.


thread leaks, you often forget to count the number of threads


This is purely fictional, but The Bug by Ellen Ullman is a great tale of a hard to find bug that had tragic consequences.

+1  A: 

One of the hardest bugs I had to find was a memory corruption error that only occurred after the program had been running for hours. Because of the length of time it took to corrupt the data, we assumed hardware and tried two or three other computers first.

The bug would take hours to appear, and when it did appear it was usually only noticed quite a length of time after when the program got so messed up it started misbehaving. Narrowing down in the code base to where the bug was occurring was very difficult because the crashes due to corrupted memory never occurred in the function that corrupted the memory, and it took so damned long for the bug to manifest itself.

The bug turned out to be an off-by-one error in a rarely called piece of code to handle a data line that had something wrong with it (invalid character encoding from memory).

In the end the debugger proved next to useless because the crashes never occurred in the call tree for the offending function. A well sequenced stream of fprintf(stderr, ...) calls in the code and dumping the output to a file was what eventually allowed us to identify what the problem was.

Adam Hawes
+2  A: 

Concurrency bugs are quite hard to track, because reproducing them can be very hard when you do not yet know what the bug is. That's why, every time you see an unexplained stack trace in the logs, you should search for the reason of that exception until you find it. Even if it happens only one time in a million, that does not make it unimportant.

Since you can not rely on the tests to reproduce the bug, you must use deductive reasoning to find out the bug. That in turn requires a deep understanding of how the system works (for example how Java's memory model works and what are possible sources of concurrency bugs).

Here is an example of a concurrency bug in Guice 1.0 which I located just some days ago. You can test your bug finding skills by trying to find out what is the bug causing that exception. The bug is not too hard to find - I found its cause in some 15-30 min (the answer is here).

    at ...

P.S. Faulty hardware might cause even nastier bugs than concurrency, because it may take a long time before you can confidently conclude that there is no bug in the code. Luckily hardware bugs are rarer than software bugs.

Esko Luontola

character conversion problems on edbdic systems where the system mechanism to handle this has been disabled - urgh

Thorbjørn Ravn Andersen