views:

1898

answers:

28

On a recent round of interviews, one interviewer at "a leading brand technology company" asked - as an ice breaker - this question. I thought it was a good question, worth asking (and warning) others.

Interesting can mean a multitude of things, so there a few more suggested constraints to make this a valid question:

  • minor cause, major effect
  • unexpected place in the code
  • deterministic cause, non-deterministic effect
  • non-deterministic cause, deterministic effect

Props to answers which admit fault.

I don't think this necessarily means:

  • the hardest to track down
  • the strangest symptoms
  • the dumbest programming errors

What's important is that the bug is interesting, and that you fixed it.

+5  A: 

Feburary 29th bug / End Of Month Calculation bug.

It wasn't hard to fix, just it was inobvious because the bug only turned up once-a-month, and then mysteriously vanished again.

Code was using php's mktime function and putting "31" in the date field.

Created all sorts of interesting problems, which of course, only lasted a few days.

Recognising that it was in fact a date calculation bug and the customer wasn't however inhaling petrol, .. that was the challenging part.

Kent Fredric
"it was in fact a date calculation bug and the customer wasn't however inhaling petrol..." haha! I know exactly what you mean :-)
Jay
hahahah - "inhaling petrol"... I've had *LOTS* of customers doing that, sadly :-\
warren
+2  A: 

It was from a telecom company with their biggest competitor.
A bill came out missing a couple of million because of a flaw in Oracle.
The software looked ok, and after spending a weekend with a friend/collegue we found out that a power cut happened so that the rownumber of the index had a gap.
We even had a call from the CEO asking us whatever we wanted we could (for help).
We didnt needed any :)
Solving that was kind of cool.

Dr. Hfuhruhurr
+7  A: 

This piece of code caused us really pain:

          int lpr = receivedData[9] & 0x07 - 1;

The variable LPR stands for "Last Pointer Register" which is a value inside a device EPROM pointing to the last register holding some data. The variable could have a value between 1 and 4 and it was encoded in 3 least significant bits of the byte accessed. The expression above meant to have an LPR starting from 0, so it could be used as base index to the data structure.

The problem here is an operator precedence issue. When not recalling clearly the operator precedence one would conclude that the preceding expression would give us the expected result, but the trick of the operator precedence was really doing this:

          int lpr = receivedData[9] & 0x06;

What is more evil is that this expression worked for odd values of LPR variable, and for even values the accessed register was the following.

  • LPR = 1 --> LPR-0-based = 0
  • LPR = 2 --> LPR-0-based = 2 !!
  • LPR = 3 --> LPR-0-based = 2
  • LPR = 4 --> LPR-0-based = 4 !! (beyond data zone)

The results were puzzling, but we never thought we had a BUG in our code since we hadn't much trust in the device that wrote EPROM data either.

The most painful part is that the tandem of our application and the external device has never worked fine for a system in production for more than three years supposedly due to external device hardware malfunction.

Fernando Miguélez
A: 

all bugs which involve multi-threading, multi-processing and concurrency..

E.
This doesn't answer the question. Downvoting.
jamesh
+1  A: 
Johannes Schaub - litb
Can you explain how that works? Or any pointers to info on the net?
sundar
I have no idea how it works. Sorry.
Johannes Schaub - litb
+5  A: 

I had to debug a deadlock caused by a finalizer in Oracle's .NET code.

We had a stored proc with an in/out parameter. We were putting in a string, but getting a clob back. The clob implemented IDisposable - but we were never actually using it, so didn't notice. (I still don't know why the parameter was in/out in the first place.)

So, we had these clobs being created for us "invisibly" - we were disposing of everything we'd created, but not the clobs. The clob had an implicit connection to the database... and something in the release code wasn't thread-safe. We'd get a complete deadlock between a thread trying to acquire a connection from the pool, and the finalizer trying to release it.

Fix: dispose of anything disposable in the command's parameter collection, rather than just the command itself. Ick. (Oh, and leave in a comment for someone to investigate why it was in/out at some point - I didn't have time then, and I'm not working there now.)

Jon Skeet
A: 

Detection of memory loss due to a bug in Weblogic JMS service was quite interesting since it caused in server instability for quite a long time.

This finally had to be raised to Weblogic and we got the fix as a service pack.

Nrj
+3  A: 

Very rarely (we hit it about 1 out of a 10,000 times), we generated data to give to g++'s implementation of std::sort which caused the algorithm to use heap_sort, as the quicksort it used was going badly.

A bug in the debugging version of heap_sort caused O(n^2) instead of O(n log n) performance.

For those who think that's not too bad, for sorting a million objects, it increased the runtime for a fraction of a second to about 10 minutes.

So to summarise:

A bug which occurred: A) Only when debugging B) About 1 out of each 10,000 program runs C) Because of heap_sort (which we assumed worked, and weren't directly calling anyway).

Caused our code to hang for about 10 minutes.

We eventually found the bug by getting the program to recording carefully all sources of randomness in the program, and then automatically running it thousands of times until we found a bad run, and then debugged it.

The fix is now in the latest version of g++, but as the program is open source we've had to switch to using our own sort until we can be sure all old dodgy versions of g++ have gone away (which could well take a long, long time).

Chris Jefferson
A classic Heisenbug. I hates them.
jamesh
+1  A: 

Our company had a problem with the software causing a freeze at VERY intermittent intervals. Really horrible because the software worked nearly all of the time which made it a nightmare to debug. This problem had plagued us years before I even joined the company.

On the face of it the code appeared to work perfectly well, either when debugging or when used in release mode.

A few weeks grunt work and liberal use of logging revealed that the problem was with the setting of RS-232 timeout intervals, which for all FPGA commands except one were adequate. This setting worked almost all of the time except for a few isolated moments, such as when giving demonstrations to potential customers :(

This bug was interesting in that resetting the timeout interval cured the problem for this particular software release, but a major code cleanup/refactoring done later also cured the problem, regardless of any 'wrong' timeout settings!

AndyUK
+3  A: 

This is an ASP.Net bug which still lives today. Our application was running in .Net 2.0 on a windows server 2003 machine. While developing and testing everything was great. Going to staging was also perfect but when we got to pre-prod something odd started showing up from time to time.

From time to time the execution context would switch from 2.0 to 1.1 causing the application to crash. After making sure that the settings were all properly set. Web.Config, AppPool, Asp.Net version.. that Asp.Net was properly installed.

A colleague of mine noticed that the application worked properly only half the time, the other half it would load the application in a 1.1 context. He found that everything was dependent on what page was hit first to load up the application. If the first page was an ASPX page then everything would load with the right context. But if the page was a Classic ASP page, the AppPool would load in a 1.1 context and crash when ever it tried to read XML.

I hope this can help someone out there, we spent 2 weeks trying to figure it out.

Alexandre Brisebois
+4  A: 

Not so much debugging, but we did diagnose a bug with excel once.

A client was testing our application. As part of the test process, they'd load a bunch of data in, do some work with it, then replicate the same steps in Excel. Turns out they'd found out that we were out by a few cents in a huge data processing job.

"floating point error" was my initial thought. Checked through the code, safe usage of FP arithmetic etc, we were stumped, and were banging away at it for days trying to reproduce it. We opened up Excel and replicated what they had done, but couldn't find the bug there, but they were adamant there was a difference.

We eventually went to their office and asked them to demonstrate what they had done.

Turns out what they were doing was highlighting the entire worksheet (all 5000 rows, and 30 columns) and then narrowing the selection to the column they wanted - And in doing that, it exposed a FP bug in Excel - it loses precision, and the FP arithmetic is out by exactly the amount they were telling us.

We went back to the office happy that our app was right, and Excel was wrong.

madlep
A: 

Not too long ago a problem landed in my lap - a part of our web application that had been working fine for years had suddenly stopped working. No changes to the code had been made.

After a fair amount of digging I discovered that, buried somewhere in SQL, a function or stored proc (I forget which) was being called that was expecting a datetime as one of the parameters. Instead of that, it was receiving an int from an Identity column.

Somehow all this time SQL was getting this ID column and managing to parse it as a datetime, but eventually this int got too big and it fell over with an OutOfRange type exception.

Of course the weirdest part was the rest of the code ran fine and seemed no different once the code was fixed to pass the correct params.

Valerion
As for the int to datetime thing, I'm guessing the SQL server thought it was a (4 byte) UNIX timestamp.
R. Bemrose
+1  A: 

This happened a couple of years a got, but so far one of the more interesting ones I have worked on was figuring out way data coming back from a scientific formula calculator where incorrect.

When I sat down and starting looking at the code I found it to be Visual Basic 6 (Visual Basic .NET had just came out) code with little to no comments. Stepping through the code I found the area where the calculations where done and everything looked alright as the math was correct and the data was being moved around as a Double. However, when I actually sat down and watched the code executing in depth I noticed that the variables were losing a bit of data on some of the calculations. Now normally, this isn't a big deal as the Double in Visual Basic 6 can store values from -1.79769313486232E+308 to 1.79769313486232E+308; however, in this context, precision was needed and loss of that precision was bad. After doing a bit of research, I found that the Variant data type in Visual Basic 6 actually has a way of sorting Decimal data using CDec. A couple revisions to the code, some testing, and a discussion with the scientists to explain exactly how precise they could expect the application to be and everything was good to go.

Things are a bit different now in Visual Basic .NET, but About.com actually has a pretty interesting short article on the Decimal data type.

Rob
+2  A: 

Long time ago, and in a galaxy far, far away... Well, back when I was doing COBOL anyway, there was this big COBOL program that displayed vehicle accident reports and usually produced the correct results, but once in awhile it would get certain things wrong. It seemed random, but a pattern emerged eventually, though it was still pretty inscrutable. Officer-originated reports never failed. Citizen-originated reports messed up, but only occasionally. A number of coders had a crack at it, but nobody could find the source of the problem, and since it usually worked and since citizen reports were in the small minority of reports that were entered into the system, the project manager finally shrugged his shoulders just let it ride.

When I came on-staff, the PM decided I needed to have a crack at the problem. Fresh eye and all that. I discovered that the two types of reports were generated by different subroutines that were similar, but not identical, and both of them were over 2,000 lines long, much of which was inside a triply-nested for-next loop. I acquired a certain amount of modest fame as a miracle worker when I found that the citizen subroutine was missing a single instruction at the bottom of one of the loops. It took just one hour of work, too, which made it look even better.

All I did was use the "exclude lines" function in the SPF editor to look at the both the top and bottom of the triple loop and the problem became completely obvious.

Cyberherbalist
A: 

Once I installed Ubuntu GNU/Linux over MS Windows XP.

zvoase
Haha. Nice fix :D
jamesh
I'd consider a down-vote a good price to pay for what is in my opinion the best possible answer to this question
Alex Brault
Ditto. Counter-upvoted. :-)
Adam Liss
+2  A: 

In or around 1998 I decided to ues a third party vendor library to do registry access in a VB6 SQL Server Classic ASP web application. Everything worked fine on dev box, but crashed on server. After several weeks discovered that, inside their dll, the third party vendor developers were using HKEY_CURRENT_USER to store application data.

Charles Bretana
A: 

Years ago, when TCP/IP was just starting to take over the corporate world, one of our developers spent several weeks at a customer site trying to understand why a new product worked perfectly until it spontaneously stopped responding to all network traffic after an apparently random length of time, typically between 4 and 10 days. Every other module was completely unaffected.

The cause was a wayward piece of customer equipment that sent BOOTP packets to port 0. (RFC 951, anyone?) Rather than discarding these packets, our 3rd-party stacks erroneously queued them for processing by a (nonexistent) task. Each packet monopolized an entire buffer, so whenever the offending device rebooted for the 20th time, our buffers were permanently exhausted and we became deaf.

The fix was about a dozen keystrokes. The customer was across the ocean. In Finland. In winter.

Adam Liss
+3  A: 

Way-back-when, I wrote a function for a game I was working on in my spare time, to return the distance between two tiles on a slightly odd square/hexagonal grid. The grid looked hexagonal in the game, but was stored as a plain ol' square grid - every other line was staggered and you needed to know if a tile was on an odd or an even row before you could tell what its neighbours were.

Still, reasonably simple I thought - the function just tested all directions, scoring each using a matrix designed to account for the quirk, and moved into the best scoring for the next iteration. What could go wrong with that?

It worked for all cases except one, and then only failed for half of those cases, and then only by a distance of one tile. If it had just spat out random garbage it would have been so much less frustrating. Anyway, I spent days, literally days, poring over the code and the matrices, rewriting them from scratch only to see them coming out identical, swearing at them, the works.

When I found the bug I didn't know whether to laugh or cry - I had used a 'greater than' instead of a 'greater than or equal to' in one of the loops, giving precedence to the first of equal candidates instead of the last. I had even used '>=' in all my notepad pseudo-code, right from the first. I had just never noticed the difference between pad and screen. It was enough to steer the path wrong, avoiding a slight shortcut that was to be had in just the right circumstances. One single press of the equals key and a couple of mouse-clicks later, et voila - the hardest bug I've ever had to to grok was also the easiest to fix...

jTresidder
That kind of thing always make me feel so stupid, especially after getting angry with the computer. As they say, garbage in, garbage out...
Erik Forbes
Been there, done that... I don't remember what it was (too long ago) but it was also a single character somewhere.
some
+6  A: 

I was dealing with some very old C code at our company which a few users were reporting bugs against. One of our tests was failing on their machines and we just simply could not reproduce it in-house.

After scanning over the list of bug reports, I realized that all of them had come from international users, specifically in Eastern Europe and Asia.

The problem turned out to be timezone related, and Windows specific as well. The mktime function apparently doesn't handle dates pre-epoch on Windows, and one of our test cases was for dates very near the epoch. For users in specific timezones, the specific date would become before the epoch, and the test would blow up on Windows.

Once I had a hunch of what the problem was, debugging involved "spoofing" my timezone by setting the TZ environment variable to confirm my suspicion, and then add some special checks for mktime deficiencies on Windows.

Figuring out what the heck was going on what pretty difficult, but it was pretty satisfying to fix such a peculiar bug in the end.

Scott Wegner
+1  A: 

I was once hired on a part-time basis to try and fix a problem with an online logging program for a school network. For some reason, late at night, the program would crash and data would occasionally become corrupted, sometimes for the entire day.

After ensuring that data was saved before the crashes I couldn't find out what was causing the system to crash, as there shouldn't be any traffic on logged-out machines, except for automatic updates. It was about 10PM and I was about to call it a night when the administrator walked in with a large bag of Doritos and some dip.

The bug was the administrator, who liked to come into the school late at night and download movies, games and music whilst the network was free.

EnderMB
A: 

removed network hotspot using setsockopt()'s *IP_MULTICAST_IF* option.

6.3 *IP_MULTICAST_IF*.

Usually, the system administrator specifies the default interface multicast datagrams should be sent from. The programmer can override this and choose a concrete outgoing interface for a given socket with this option.

struct in_addr interface_addr;
setsockopt (socket, IPPROTO_IP, IP_MULTICAST_IF, &interface_addr,

sizeof(interface_addr));

From now on, all multicast traffic generated in this socket will be output from the interface chosen. To revert to the original behavior and let the kernel choose the outgoing interface based on the system administrator's configuration, it is enough to call setsockopt() with this same option and INADDR_ANY in the interface field.

In determining or selecting outgoing interfaces, the following ioctls might be useful: SIOCGIFADDR (to get an interface's address), SIOCGIFCONF (to get the list of all the interfaces) and SIOCGIFFLAGS (to get an interface's flags and, thus, determine whether the interface is multicast capable or not -the IFF_MULTICAST flag-).

If the host has more than one interface and the IP_MULTICAST_IF option is not set, multicast transmissions are sent from the default interface, although the remainding interfaces might be used for multicast forwarding if the host is acting as a multicast router.

plan9assembler
+27  A: 

In the summer of 1973, I'd just graduated from college and started working at Data General, which developed minicomputer hardware and software. Being low person on the totem pole, I was presented with "There's a professor at U of X with a Nova 800 and 8020 Floating Point Processor who reports that in one out of every million or so invocations, the Fortran sin function returns a number greater than 1". This stinker had rattled around DG's language runtime group for awhile, but they'd accomplished little more than writing a proof that it couldn't be their fault. I started with a minimum faulting Fortran program, reduced it to a minimum faulting assembler program, and with continued reduction got to a reasonably frequent failure rate. The good news was that I could scope any signal in the CPU or FPP; the bad news was that logic analyzers had yet to be invented and storage scopes were slow, so I spent a lot of time with my head under a hood peering at very faint traces. Eventually, I tracked the problem to data-specific meta-stability in a D flipflop in the FPP: its clock and data were entirely asynchronous, but the meta-stability only lasted long enough to cause trouble when one particular double-precision value was being returned to the CPU -- a relationship attributable to small variations in the flipflop's power supply caused by the nearby register storing that value. Adding one stage of synchronous delay reduced the predicted failure incidence to once per epoch.

A tech was dispatched to modify the professor's floating point processor, after which there were no further sins of trouble.

Dave
+1 for literally going inside the processor
some
+1 for the pun.
Andy Mikula
+1 because, although I read that, I don't understand a word.
Noon Silk
+1  A: 

One of the more interesting bugs I've found concerned SQL queries used in some reports. The report was identifying items by a serial number, and taking input from an HTML form that included a list of serials. I'm by no means a SQL expert, but as the report seemed to be taking excessively long to run, I decided to take a look at the query.

The SQL query was being generated from the inputs with a similar form to the below (psedo-sql):

SELECT foo WHERE 
    serialnum = '1111' 
    OR serialnum = '1112' 
    OR serialnum = '1113' 
  [...]

I discussed this with a friend who had more SQL knowledge and he pointed out this would be more efficiently expressed with IN instead of a chain of ORs:

SELECT foo WHERE
    serialnum IN (1111,1112,1113,[...])

For the heck of it, I took one of our generated production queries, and rewrote it to use IN instead of OR chains, and tested it. I wasn't sure how much performance difference (if any) to expect, but I was surprised to find using IN turned out to be at least 10x faster even on a smaller query. Obviously, the two queries aren't exactly the same semantically, but even so this was a pretty unusual "fix" to the performance issues we were experiencing.

Note: I'm sure there were plenty of other issues such as inefficient indexes and other query optimizations to be done. No need to preach ;) This just happened to be the first thing I ran across and had by far the largest effect.

Jay
Sounds like WTF to me.
jamesh
yeah - they return identical results, but IN is faster? who knew? OTOH, makes you think that way in the future :)
warren
A: 

On a webform I built, some of the buttons wouldn't fire their event handlers until the second time they were clicked.

After a couple hours of troubleshooting, the closest I had gotten was having JavaScript click one of the buttons on page_load.

Ends up it was a bug in the AJAX Control Toolkit. Apparently Accordions don't like to be in UpdatePanels. Putting the UpdatePanels in the AccordionPanes instead fixed it.

Wierd.

tsilb
+1  A: 

I just had a crash course in .NET assembly lazy loading due to this bug.

For a little background, I've ripped System.Core/LINQ out of Mono to use in .NET 2.0, since our users are still on Win2K and can't use .NET 3.5 until we finish upgrading them all to XP. I'm writing a new rules engine using CS-Script that's working great. The following code worked fine:

if(someCrapInvolvingTheOldRulesEngine)
{
    // snipped
}

if(useNewRulesEngine)
{
    InitializeNewRulesEngine();
}

I needed to rewrite the first if block, which would depend on the new rules engine, so I switched the order of the "if" blocks, and suddenly, the dynamic compilation fails and completely crashes my app, saying it can't find the namespace "Linq".

After pounding my head on the desk for 2 hours (not an effective debugging technique), an idea hit, which solved the problem, and clued me in as to what was happening. I added the following line:

if(useNewRulesEngine)
{
    List<int> unused = (new[]{1,2,3}).ToList();
    InitializeNewRulesEngine();
}

if(someCrapInvolvingTheOldRulesEngine)
{
    // snipped
}

That one additional line solved everything. Apparently, until I called that, System.Core wasn't loaded, and CS-Script couldn't find the assembly to reference. The "someCrapInvolvingTheOldRulesEngine" function in another assembly actually used System.Core, so it was loaded deep in that code. The old and new code have no dependencies on each other, so it was just baffling to me that simply switching the order of two unrelated "if" blocks could completely break my application.

I wound up moving the new line of code inside my RulesEngine initialization code with a "DO NOT REMOVE OR I WILL END YOU" comment, where it's going to have to stay until the users are able to use .NET 3.5 and can find the real System.Core in the GAC.

Chris Doggett
Gah. Lazy loading/class loading is magic when it works. Horrible to debug. Great find.
jamesh
+2  A: 

No karma to comment on answers, but the rough explanation for this:

#define DETECTNULL(X) (((X) - 0x01010101) & ~(X) & 0x80808080)

The right hand side will set the high order bit for each byte if and only if it's not set in X. We AND this against the left hand side, and will get a non-zero value if and only if one of the high order bits for one of the bytes is also set on the left hand side.

So how can that happen? Think about a single byte - if X > 81, 0x80 will still be set (0x81 - 1 = 0x80). Or if X = 0, you'll get FF (-1), setting the high order bit.

So our condition is (X-1) sets the high order bit (left hand side), and the high order bit was not already set (right hand side) - translation, if subtracting one from the byte causes it to roll to a negative number, and hence it was a 0 byte.

in the X > F1 case, the bit on the right hand side is NOT set, and thus you get 0.

James
+1  A: 

One of our web apps was returning seemingly random numbers. In JavaScript if you use parseInt and give it a number beginning with 0 it will treat it as an octal, little did anyone know. I told them to add a parameter to make sure it comes back as base 10 and everything was fine.

Joe Philllips
+1  A: 

In 1994 I was working on an application that was an order of magnitude faster than its competitors. There was a lot of pressure from on high to keep the program fast, and a lot of effort went into optimising the code (and converting critical parts of it to assembler) to squeeze even fractions of a percent of extra performance out of it.

One day I discovered that a linked list class used throughout the program had some debugging code in it. Whenever you inserted an item, it scanned the entire list to check that it wasn't already in the list. When deleting, it scanned the entire list to check that the item was in the list.

Unfortunately the author had forgotten to place this debug code in an '#if DEBUG', so it was included in the release version. When I added this missing line, a number of important operations in the program suddenly went about twice as fast.

(Moral of the story: Profile before you optimise!)

Jason Williams