What made it hard to find? How did you track it down?
Not close enough to close but see also
http://stackoverflow.com/questions/175854/what-is-the-funniest-bug-youve-ever-experienced
What made it hard to find? How did you track it down?
Not close enough to close but see also
http://stackoverflow.com/questions/175854/what-is-the-funniest-bug-youve-ever-experienced
While I don't recall a specific instance, the toughest category are those bugs which only manifest after the system has been running for hours or days, and when it goes down, leaves little or no trace of what caused the crash. What makes them particularly bad is that no matter how well you think you've reasoned out the cause, and applied the appropriate fix to remedy it, you'll have to wait for another few hours or days to get any confidence at all that you've really nailed it.
A multi-threaded applications where running in debug is fine but as soon as you run in release it goes wrong because of slightly different timing. Even adding Console.WriteLine calls to product basic debugging outpit caused enough of a change in timing for it to work and not show the issue. Tool a week to find and fix a couple of lines of code that needed changing.
Not sure this is the toughest, but several years ago I had a Java program which made use of XMLEncoder
in order to save/load a particular class. For some reason the class wasn't working properly. I did a simple binary search for error and discovered that the error was happening after one function call but before another call, which should have been impossible. 2 hours later I had not figured it out, though the moment I took a break (and was leaving) I realized the problem. It turned out the XMLEncoder
was creating a default-constructed instance of the class instead of having both the class and the reference to the class refer to the same object. So, while I thought the two function calls where both on members of the same instance of a particular class, one was actually on a default-constructed copy.
Was tough to find since I knew they were both references to the same class.
I had a bug in a console game that occurred only after you fought and won a lengthy boss-battle, and then only around 1 time in 5. When it triggered, it would leave the hardware 100% wedged and unable to talk to outside world at all.
It was the shyest bug I've ever encountered; modifying, automating, instrumenting or debugging the boss-battle would hide the bug (and of course I'd have to do 10-20 runs to determine that the bug had hidden).
In the end I found the problem (a cache/DMA/interrupt race thing) by reading the code over and over for 2-3 days.
A bug where you come across some code, and after studying it you conclude, "There's no way this could have ever worked!" and suddenly it stops working though it always did work before.
The toughest bug I ever had to fix was one I'd raised myself - I contracted as a tester for a large telco, testing another company's product. Several years later, I had a contract with the other company and the first thing they gave me were the bugs I'd raised myself.
It was a kernel race condition in am embedded operating system written in 6809 assembler and BCPL. The debugging environment consisted of a special printf which wrote to a serial device; no fancy IDE stuff in this setup.
Took quite a while to fix but it was a huge satisfaction boost when I finally nutted it out.
A nasty crash in a GUI app written in Turbo Pascal. Three days plus before i discovered, by single stepping in the debugger, at a machine code level, over simple and obviously correct code, that i was putting a 16-bit integer on the call stack for a function expecting 32-bit (or some such mismatch)
Now i am wise to that, although modern compilers don't allow that kind of trouble any more.
long ago, i wrote an object-oriented language using C and a (character-based) forms library; each form was an object, forms could contain subforms, and so on. The complex invoicing application written using this would work fine for about 20 minutes, then random garbage characters would appear every now and then on the screen. After a few more minutes of using the app, the machine would reboot, hang, or something drastic.
this turned out to be a bad deallocation resulting from a misdirected delegation in the message-processing engine; mis-routed messages were being delegated up the containment tree when we ran out of superclasses, and sometimes the parent objects would have methods with the same name so it would appear to work most of the time. The rest of the time it would deallocate a small buffer (8 bytes or so) in the wrong context. The pointer being deallocated incorrectly was actually dead memory used by an intermediate counter for another operation, so its value tended to converge on zero after time.
yes, the bad pointer would cross through the memory-map area of the screen on its way to the zero page, where it eventually overwrote an interrupt vector and killed the PC
this was way before modern debugging tools, so figuring out what was happening took a couple of weeks...
Our network interface, a DMA-capable ATM card, would very occasionally deliver corrupted data in received packets. The AAL5 CRC had checked out as correct when the packet came in off the wire, yet the data DMAd to memory would be incorrect. The TCP checksum would generally catch it, but back in the heady days of ATM people were enthused about running native applications directly on AAL5, dispensing with TCP/IP altogether. We eventually noticed that the corruption only occurred on some models of the vendor's workstation (who shall remain nameless), not others.
By calculating the CRC in the driver software we were able to detect the corrupted packets, at the cost of a huge performance hit. While trying to debug we noticed that if we just stored the packet for a while and went back to look at it later, the data corruption would magically heal itself. The packet contents would be fine, and if the driver calculated the CRC a second time it would check out ok.
We'd found a bug in the data cache of a shipping CPU. The cache in this processor was not coherent with DMA, requiring the software to explicitly flush it at the proper times. The bug was that sometimes the cache didn't actually flush its contents when told to do so.
This requires knowing a bit of Z-8000 assembler, which I'll explain as we go.
I was working on an embedded system (in Z-8000 assembler). A different division of the company was building a different system on the same platform, and had written a library of functions, which I was also using on my project. The bug was that every time I called one function, the program crashed. I checked all my inputs; they were fine. It had to be a bug in the library -- except that the library had been used (and was working fine) in thousands of POS sites across the country.
Now, Z-8000 CPUs have 16 32-bit registers, R0, R1, R2 ...R15, which can also be addressed as 8 64-bit registers, named RR0, RR2, RR4..RR14 etc. The library was written from scratch, refactoring a bunch of older libraries. It was very clean and followed strict programming standards. At the start of each function, every register that would be used in the function was pushed onto the stack to preserve its value. Everything was neat & tidy -- they were perfect.
Nevertheless, I studied the assembler listing for the library, and I noticed something odd about that function --- At the start of the function, it had PUSH RR0 / PUSH RR2 and at the end to had POP RR2 / POP R0. Now, if you didn't follow that, it pushed 4 values on the stack at the start, but only removed 3 of them at the end. That's a recipe for disaster. There an unknown value on the top of the stack where return address needed to be. The function couldn't possibly work.
Except, may I remind you, that it WAS working. It was being called thousands of times a day on thousands of machines. It couldn't possibly NOT work.
After some time debugging (which wasn't easy in assembler on an embedded system with the tools of the mid-1980s), it would always crash on the return, because the bad value was sending it to a random address. Evidently I had to debug the working app, to figure out why it didn't fail.
Well, remember that the library was very good about preserving the values in the registers, so once you put a value into the register, it stayed there. R1 had 0000 in it. It would always have 0000 in it when that function was called. The bug therefore left 0000 on the stack. So when the function returned it would jump to address 0000, which just so happened to be a RET, which would pop the next value (the correct return address) off the stack, and jump to that. The data perfectly masked the bug.
Of course, in my app, I had a different value in R1, so it just crashed....
This didn't happen to me, but a friend told me about it.
He had to debug a app which would crash very rarely. It would only fail on Wednesdays -- in September -- after the 9th. Yes, 362 days of the year, it was fine, and three days out of the year it would crash immediately.
It would format a date as "Wednesday, September 22 2008", but the buffer was one character too short -- so it would only cause a problem when you had a 2 digit DOM on a day with the longest name in the month with the longest name.
My team inherited a CGI-based, multi-threaded C++ web app. The main platform was Windows; a distant, secondary platform was Solaris with Posix threads. Stability on Solaris was a disaster, for some reason. We had various people who looked at the problem for over a year, off and on (mostly off), while our sales staff successfully pushed the Windows version.
The symptom was pathetic stability: a wide range of system crashes with little rhyme or reason. The app used both Corba and a home-grown protocol. One developer went so far as to remove the entire Corba subsystem as a desperate measure: no luck.
Finally, a senior, original developer wondered aloud about an idea. We looked into it and eventually found the problem: on Solaris, there was a compile-time (or run-time?) parameter to adjust the stack size for the executable. It was set incorrectly: far too small. So, the app was running out of stack and printing stack traces that were total red herrings.
It was a true nightmare.
Lessons learned:
Had a bug on a platform with a very bad on device debugger. We would get a crash on the device if we added a printf to the code. It then would crash at a different spot than the location of the printf. If we moved the printf, the crash would ether move or disappear. In fact, if we changed that code by reordering some simple statements, the crash would happen some where unrelated to the code we did change.
It turns out there was a bug in the relocator for our platform. the relocator was not zero initializing the ZI section but rather using the relocation table to initialze the values. So any time the relocation table changed in the binary the bug would move. So simply added a printf would change the relocation table an there for the bug.
The first was that our released product exhibited a bug, but when I tried to debug the problem, it didn't occur. I thought this was a "release vs. debug" thing at first -- but even when I compiled the code in release mode, I couldn't reproduce the problem. I went to see if any other developer could reproduce the problem. Nope. After much investigation (producing a mixed assembly code / C code listing) of the program output and stepping through the assembly code of the released product (yuck!), I found the offending line. But the line looked just fine to me! I then had to lookup what the assembly instructions did -- and sure enough the wrong assembly instruction was in the released executable. Then I checked the executable that my build environment produced -- it had the correct assembly instruction. It turned out that the build machine somehow got corrupt and produced bad assembly code for only one instruction for this application. Everything else (including previous versions of our product) produced identical code to other developers machines. After I showed my research to the software manager, we quickly re-built our build machine.
One of the products I helped build at my work was running on a customer site for several months, collecting and happily recording each event it received to a SQL Server database. It ran very well for about 6 months, collecting about 35 million records or so.
Then one day our customer asked us why the database hadn't updated for almost two weeks. Upon further investigation we found that the database connection that was doing the inserts had failed to return from the ODBC call. Thankfully the thread that does the recording was separated from the rest of the threads, allowing everything but the recording thread to continue functioning correctly for almost two weeks!
We tried for several weeks on end to reproduce the problem on any machine other than this one. We never could reproduce the problem. Unfortunately, several of our other products then began to fail in about the same manner, none of which have their database threads separated from the rest of their functionality, causing the entire application to hang, which then had to be restarted by hand each time they crashed.
Weeks of investigation turned into several months and we still had the same symptoms: full ODBC deadlocks in any application that we used a database. By this time our products are riddled with debugging information and ways to determine what went wrong and where, even to the point that some of the products will detect the deadlock, collect information, email us the results, and then restart itself.
While working on the server one day, still collecting debugging information from the applications as they crashed, trying to figure out what was going on, the server BSoD on me. When the server came back online, I opened the minidump in WinDbg to figure out what the offending driver was. I got the file name and traced it back to the actual file. After examining the version information in the file, I figured out it was part of the McAfee anti-virus suite installed on the computer.
We disabled the anti-virus and haven't had a single problem since!!
A deadlock in my first multi-threaded program!
It was very tough to find it because it happened in a thread pool. Occasionally a thread in the pool would deadlock but the others would still work. Since the size of the pool was much greater than needed it took a week or two to notice the first symptom: application completely hung.
The two toughest bugs that come to mind were both in the same type of software, only one was in the web-based version, and one in the windows version.
This product is a floorplan viewer/editor. The web-based version has a flash front-end that loads the data as SVG. Now, this was working fine, only sometimes the browser would hang. Only on a few drawings, and only when you wiggled the mouse over the drawing for a bit. I narrowed the problem down to a single drawing layer, containing 1.5 MB of SVG data. If I took only a subsection of the data, any subsection, the hang didn't occur. Eventually it dawned on me that the problem probably was that there were several different sections in the file that in combination caused the bug. Sure enough, after randomly deleting sections of the layer and testing for the bug, I found the offending combination of drawing statements. I wrote a workaround in the SVG generator, and the bug was fixed without changing a line of actionscript.
In the same product on the windows side, written in Delphi, we had a comparable problem. Here the product takes autocad DXF files, imports them to an internal drawing format, and renders them in a custom drawing engine. This import routine isn't particularly efficient (it uses a lot of substring copying), but it gets the job done. Only in this case it wasn't. A 5 megabyte file generally imports in 20 seconds, but on one file it took 20 minutes, because the memory footprint ballooned to a gigabyte or more. At first it seemed like a typical memory leak, but memory leak tools reported it clean, and manual code inspection turned up nothing either. The problem turned out to be a bug in Delphi 5's memory allocator. In some conditions, which this particular file was duly recreating, it would be prone to severe memory fragmentation. The system would keep trying to allocate large strings, and find nowhere to put them except above the highest allocated memory block. Integrating a new memory allocation library fixed the bug, without changing a line of import code.
Thinking back, the toughest bugs seem to be the ones whose fix involves changing a different part of the system than the one where the problem occurs.
Not one of mine, but a colleague at a previous place of employment spent 3 days debugging his JavaScript popout editor control (this was quite a while ago, before the joys of frameworks), only to find that it was missing a single semicolon halfway down one of its huge core files.
We dubbed it "the world's most expensive semicolon", but I'm sure there's been far worse throughout history!
When the client's pet bunny rabbit gnawed partway through the ethernet cable. Yes. It was bad.
A heap memory violation in a text edit control that I used. After many months (...) looking for it, I found the solution working with another programmer, peer debugging the problem. This very instance convinced me of the value of working in teams and Agile in general. Read more about it at my blog
This was on Linux but could have happened on virtually any OS. Now most of you are probably familiar with the BSD socket API. We happily use it year after year, and it works.
We were working on a massively parallel application that would have many sockets open. To test its operation we had a testing team that would open hundreds and sometimes over a thousand connections for data transfer. With the highest channel numbers our application would begin to show weird behavior. Sometimes it just crashed. The other time we got errors that simply could not be true (e.g. accept() returning the same file descriptor on subsequent calls which of course resulted in chaos.)
We could see in the log files that something went wrong, but it was insanely hard to pinpoint. Tests with Rational Purify said nothing was wrong. But something WAS wrong. We worked on this for days and got increasingly frustrated. It was a showblocker because the already negotiated test would cause havoc in the app.
As the error only occured in high load situations, I double-checked everything we did with sockets. We had never tested high load cases in Purify because it was not feasible in such a memory-intensive situation.
Finally (and luckily) I remembered that the massive number of sockets might be a problem with select() which waits for state changes on sockets (may read / may write / error). Sure enough our application began to wreak havoc exactly the moment it reached the socket with descriptor 1025. The problem is that select() works with bit field parameters. The bit fields are filled by macros FD_SET() and friends which DON'T CHECK THEIR PARAMETERS FOR VALIDITY.
So everytime we got over 1024 descriptors (each OS has its own limit, Linux vanilla kernels have 1024, the actual value is defined as FD_SETSIZE), the FD_SET macro would happily overwrite its bit field and write garbage into the next structure in memory.
I replaced all select() calls with poll() which is a well-designed alternative to the arcane select() call, and high load situations have never been a problem everafter. We were lucky because all socket handling was in one framework class where 15 minutes of work could solve the problem. It would have been a lot worse if select() calls had been sprinkled all over of the code.
Lessons learned:
even if an API function is 25 years old and everybody uses it, it can have dark corners you don't know yet
unchecked memory writes in API macros are EVIL
a debugging tool like Purify can't help with all situations, especially when a lot of memory is used
Always have a framework for your application if possible. Using it not only increases portability but also helps in case of API bugs
many applications use select() without thinking about the socket limit. So I'm pretty sure you can cause bugs in a LOT of popular software by simply using many many sockets. Thankfully, most applications will never have more than 1024 sockets.
Instead of having a secure API, OS developers like to put the blame on the developer. The Linux select() man page says
"The behavior of these macros is undefined if a descriptor value is less than zero or greater than or equal to FD_SETSIZE, which is normally at least equal to the maximum number of descriptors supported by the system."
That's misleading. Linux can open more than 1024 sockets. And the behavior is absolutely well defined: Using unexpected values will ruin the application running. Instead of making the macros resilient to illegal values, the developers simply overwrite other structures. FD_SET is implemented as inline assembly(!) in the linux headers and will evaluate to a single assembler write instruction. Not the slightest bounds checking happening anywhere.
To test your own application, you can artificially inflate the number of descriptors used by programmatically opening FD_SETSIZE files or sockets directly after main() and then running your application.
Thorsten79
I have spent hours to days debugging a number of things that ended up being fixable with literally just a couple characters.
Some various examples:
ffmpeg has this nasty habit of producing a warning about "brainfart cropping" (referring to a case where in-stream cropping values are >= 16) when the crop values in the stream were actually perfectly valid. I fixed it by adding three characters: "h->".
x264 had a bug where in extremely rare cases (one in a million frames) with certain options it would produce a random block of completely the wrong color. I fixed the bug by adding the letter "O" in two places in the code. Turned out I had mispelled the name of a #define in an earlier commit.
That was an access violation crash.
from the crash dump I could only figure out a parameter on the call stack was corrupted.
The reason was this code:
n = strlen(p->s) - 1;
if (p->s[n] == '\n')
p->s[n] = '\0';
if the string length was 0, and the parameter on the stack above happen to be on address 0x0Axxxxxxx
==> stack corruption
Fortunately this code was close enough to the actuall crash location, so browsing the (ugly) source code was the way to find the culrpit
With FORTRAN on a Data General minicomputer in the 80's we had a case where the compiler caused a constant 1 (one) to be treated as 0 (zero). It happened because some old code was passing a constant of value 1 to a function which declared the variable as a FORTRAN parameter, which meant it was (supposed to be) immutable. Due to a defect in the code we did an assignment to the parameter variable and the compiler gleefully changed the data in the memory location it used for a constant 1 to 0.
Many unrelated functions later we had code that did a compare against the literal value 1 and the test would fail. I remember staring at that code for the longest time in the debugger. I would print out the value of the variable, it would be 1 yet the test 'if (foo .EQ. 1)' would fail. It took me a long time before I thought to ask the debugger to print out what it thought the value of 1 was. It then took a lot of hair pulling to trace back through the code to find when the constant 1 became 0.
There are a couple of those I can recollect, most of them caused by me :). Almost evey one of these needed lots of head scratching.
I was part of a java project (rich client), the java code used to work well on vanilla builds or new machines without problem, but when installed on the presentation laptops,it suddenly stopped working and started throwing stackdump. Further investigation showed that the the code relied on a custom dll which has conflicting with cygwin. Thats not the end of the story, we were supposed to install it on 5 other machies and guess what, on one of the machines it again crashed! This time the culprit was the jvm, the code we gave was for built using Sun microsystems jdk and the machine had ibm's jvm.
Another instance I can recollect has to do with a custom event handler code, The code was unit tested and verified, finally when I removed the print() statements, BOOM!!. When we debugged, the code ran perfectly adding to our owes. I had to resort to zen meditation (a nap on the desk) and it occured that there might be a temporal anamoly! The event we were delegating was triggering the function even before the condition was set, the print statements & debug mode gave enough time for the condition to be set and so worked properly. A sigh of relief and some refactoring solved the issue.
One fine day I decided that some of the domain objects needed to implement Clonable interface, things were fine. After some weeks, we observed that the application started behaving wierdly. Guess what? we were adding these shallow copies to the collection classes and the remove() methods were not actually clearing the contents properly, (due to duplicate references pointing to the same object). This caused some serious model review and a couple of raised browes.
Somewhere deep in the bowels of a networked application was the line (simplified):
if (socket = accept() == 0)
return false;
//code using the socket()
What happened when the call succeeded? socket
was set to 1. What does send()
do when given a 1? (such as in:
send(socket, "mystring", 7);
It prints to stdout
... this I found after 4 hours of wondering why, with all my printf()
s taken out, my app was printing to the terminal window instead of sending the data over the network.
Not very tough, but I laughed a lot when it was uncovered.
When I was maintaining a 24/7 order processing system for an online shop, a customer complained that his order was "truncated". He claimed that while the order he placed actually contained N positions, the system accepted much less positions without any warning whatsoever.
After we traced order flow through the system, the following facts were revealed. There was a stored procedure responsible for storing order items in database. It accepted a list of order items as string, which encoded list of (product-id, quantity, price)
triples like this:
"<12345, 3, 19.99><56452, 1, 8.99><26586, 2, 12.99>"
Now, the author of stored procedure was too smart to resort to anything like ordinary parsing and looping. So he directly transformed the string into SQL multi-insert statement by replacing "<"
with "insert into ... values ("
and ">"
with ");"
. Which was all fine and dandy, if only he didn't store resulting string in a varchar(8000) variable!
What happened is that his "insert ...; insert ...;"
was truncated at 8000th character and for that particular order the cut was "lucky" enough to happen right between insert
s, so that truncated SQL remained syntactically correct.
Later I found out the author of sp was my boss.
Had a bug on a platform with a very bad on device debugger. We would get a crash on the device if we added a printf to the code. It then would crash at a different spot than the location of the printf. If we moved the printf, the crash would ether move or disappear. In fact, if we changed that code by reordering some simple statements, the crash would happen some where unrelated to the code we did change.
This looks like a classic Heisenbug. The minute you recognize it, you immediately go looking for uninitialized variables or stack boundary trashing.
Just before the internet caught on, we were working on a modem-based home banking application (The first in North America).
Three days before release, we were (almost) on schedule, and were planning to use the remaining time to exhaustivly test the system. We had a test plan, and next on the list was modem communications.
Right about then, our client came rushing in wanting a last minute feature upgrade. Of course, I was completely against this, but I was overruled. We burned the midnight oil for three days adding the stupid thing, and got it working by release date. We made the deadline, and delivered over 2000 floppy disks to the customers.
The day after release, I got back to my testing schedule, and resumed testing the modem communication module. Much to my suprise, I found that the modem would randomly fail to connect. Just about then, our phones started ringing off the hook, with angry customers not being able use their application.
After much knashing of teeth and pulling of hair, I traced the problem to the serial port initialization. A junior programmer had commented out a write to one of the control registers. The register remained uninitialized, and there was about a 10% chance that it would contain an invalid value - depending upon the user's configuration, and what applications he had run beforehand.
When asked about it, the programmer claimed that it made it work on his machine.
So we had to re-burn those 2000+ floppies, and track down each and every customer to recall them. Not a fun thing to do, especially with an already burnt-out team.
We took a big hit on that one. Our client claimed that because it was our bug, we should have to absorb the cost of the recall. Our schedule for the next release was put back a month. And our relationship with the client was tarnished.
Nowadays, I am much less flexible with last-minute feature additions, and I try to communicate better with my team.
Designed a realtime multithreaded (shudder) system once which polled images from mutliple network surveilance cameras and did all kinds of magic on the images.
The bug simply made the system crash, some critical section being mistreated ofcourse. I had no idea how to trigger the failure directly, but had to wait for it to occur, which was about once in three or four days (odds: about 1 in 15000000 on 30 fps).
I had to prepare everything I could, debug output messages soiling the code, trace tools, remote debugging tools on the camera and the list goes on. Then I just had to wait two-three days and hope to catch all info for locating the failing mutex or whatever. It took four of these runs before I tracked it down, four weeks!. One more run and I would have broken the customer deadline..
A jpeg parser, running on a surveillance camera, which crashed every time the company's CEO came into the room.
100% reproducible error.
I kid you not!
This is why:
For you who doesn't know much about JPEG compression - the image is kind of broken down into a matrix of small blocks which then are encoded using magic etc.
The parser choked when the CEO came into the room, because he always had a shirt with a square pattern on it, which triggered some special case of contrast and block boundary algorithms.
Truly classic.
While testing some new functionality that I had recently added to a trading application, I happened to notice that the code to display the results of a certain type of trade would never work properly. After looking at the source control system, it was obvious that this bug had existed for at least a year, and I was amazed that none of the traders had ever spotted it.
After puzzling for a while and checking with a colleague, I fixed the bug and went on testing my new functionality. About 3 minutes later, my phone rang. On the other end of the line was an irate trader who complained that one of his trades wasn’t showing correctly.
Upon further investigation, I realized that the trader had been hit with the exact same bug I had noticed in the code 3 minutes earlier. This bug had been lying around for a year, just waiting for a developer to come along and spot it so that it could strike for real.
This is a good example of a type of bug known as a Schroedinbug. While most of us have heard about these peculiar entities, it is an eerie feeling when you actually encounter one in the wild.
This is back when I thought that C++ and digital watches were pretty neat...
I got a reputation for being able to solve difficult memory leaks. Another team had a leak they couldn't track down. They asked me to investigate.
In this case, they were COM objects. In the core of the system was a component that gave out many twisty little COM objects that all looked more or less the same. Each one was handed out to many different clients, each of which was responsible for doing AddRef()
and Release()
the same number of times.
There wasn't a way to automatically calculate who had called each AddRef
, and whether they had Release
d.
I spent a few days in the debugger, writing down hex addresses on little pieces of paper. My office was covered with them. Finally I found the culprit. The team that asked me for help was very grateful.
The next day I switched to a GC'd language.*
(*Not actually true, but would be a good ending to the story.)
Bryan Cantrill of Sun Microsystems gave an excellent Google Tech Talk on a bug he tracked down using a tool he helped develop called dtrace.
The The Tech Talk is funny, geeky, informative, and very impressive (and long, about 78 minutes).
I won't give any spoilers here on what the bug was but he starts revealing the culprit at around 53:00.
My first "real" job was for a company that wrote client-server sales-force automation software. Our customers ran the client app on their (15-pound) laptops, and at the end of the day they dialed up to our unix servers to synchronize with the Mother database. After a series of complaints, we found that an astronomical number of calls were dropping at the very beginning, during authentication.
After weeks of debugging, we discovered that the authentication always failed if the incoming call was answered by a getty process on the server whose Process ID contained an even number followed immediately by a 9. Turns out the authentication was a homebrew scheme that depended on an 8-character string representation of the PID; a bug caused an offending PID to crash the getty, which respawned with a new PID. The second or third call usually found an acceptable PID, and automatic redial made it unnecessary for the customers to intervene, so it wasn't considered a significant problem until the phone bills arrived at the end of the month.
The "fix" (ahem) was to convert the PID to a string representing its value in octal rather than decimal, making it impossible to contain a 9 and unnecessary to address the underlying problem.
There was a code that sets some expiry date to current date plus one year by adding 1 to the current year and keeping the day and month as the same. This failed big time on Feb 29, 2008 because the database refused to accept Feb 29, 2009 !!
Don't know whether that qualifies for being 'tough', but it was a weird code which was rewritten immediately of course !
When I first started at the company I work for I did a lot of CPR to learn the products.
This embedded product written in HC11 assembly had a feature that occurred every eight hours. Turns out the interrupt that decremented the value was firing during the code that was checking the counter. Slapped some CLI/STI around the code and it was fine. I tracked it down by hacking the event to happen twice a second rather than every eight hours.
The lesson I learned from this was when debugging code that fails infrequently I should check the variables used by interrupts first.
Adam Liss's message above talking about the project we both worked on, reminded me of a fun bug I had to deal with. Actually, it wasn't a bug, but we'll get to that in a minute.
Executive summary of the app in case you haven't seen Adam message yet: sales-force automation software...on laptops...end of the day they dialed up ...to synchronize with the Mother database.
One user complained that every time he tried to dial in, the application would crash. The customer support folks went through all their usually over-the-phone diagnostic tricks, and they found nothing. So, they had to relent to the ultimate: have the user FedEx the laptop to our offices. (This was a very big deal, as each laptop's local database was customized to the user, so a new laptop had to be prepared, shipped to the user for him to use while we worked on his original, then we had to swap back and have him finally sync the data on first original laptop).
So, when the laptop arrived, it was given to me to figure out the problem. Now, syncing involved hooking up the phone line to the internal modem, going to the "Communication" page of our app, and selecting a phone number from a Drop-down list (with last number used pre-selected). The numbers in the DDL were part of the customization, and were basically, the number of the office, the number of the office prefixed with "+1", the number of the office prefixed with "9,,," in case they were calling from an hotel etc.
So, I click the "COMM" icon, and pressed return. It dialed in, it connected to a modem -- and then immediately crashed. I tired a couple more times. 100% repeatability.
So, a hooked a data scope between the laptop & the phone line, and looked at the data going across the line. It looked rather odd... The oddest part was that I could read it!
The user had apparently wanted to use his laptop to dial into a local BBS system, and so, change the configuration of the app to use the BBS's phone number instead of the company's. Our app was expecting our proprietary binary protocol -- not long streams of ASCII text. Buffers overflowed -- KaBoom!
The fact that a problem dialing in started immediately after he changed the phone number, might give the average user a clue that it was the cause of the problem, but this guy never mentioned it.
I fixed the phone number, and sent it back to the support team, with a note electing the guy the "Bonehead user of the week". (*)
(*) OkOkOk... There's probably a very good chance what actually happened in that the guy's kid, seeing his father dial in every night, figured that's how you dial into BBS's also, and changed the phone number sometime when he was home alone with the laptop. When it crashed, he didn't want to admit he touched the laptop, let alone broke it; so he just put it away, and didn't tell anyone.
Basically, anything involving threads.
I held a position at a company once in which I had the dubious distinction of being one of the only people comfortable enough with threading to debug nasty issues. The horror. You should have to get some kind of certification before you're allowed to write threaded code.
Thanks to a flash of inspiration this didn't take too long to track down but was a bit odd nonetheless. Small application, only used by other people in the IT department. It is connecting in turn to all of the desktop PC's in the domain. Many are turned off and the connection takes AGES to time out, so it runs on the threadpool. It just scans AD and queues thousands of work items to the thread pool. All worked fine. Some years later I was talking to another member of staff that actually uses this appliacation and he mentioned it made the PC un-usable. While it was running trying to open web pages or browse a network drive would take minutes, or just never happen.
the problem turned out to be XP's half open tcp limit. The original PC's were dual processor, so .NET allocates 50 (or 100, not sure) threads to the pool, no problem. Now we have dual processor dual core, we now have more threads in the thread pool than you can have half open connections, so other network activity becomes impossable while the application is running.
It is now fixed, it pings machines before attempting to connect to them so the timeout is much shorter and uses a small fixed number of threads to do the actual work.
Mine was a hardware problem...
Back in the day, I used a DEC VaxStation with a big 21" CRT monitor. We moved to a lab in our new building, and installed two VaxStations in opposite corners of the room. Upon power-up,my monitor flickered like a disco (yeah, it was the 80's), but the other monitor didn't.
Okay, swap the monitors. The other monitor (now connected to my VaxStation) flickered, and my former monitor (moved across the room) didn't.
I remembered that CRT-based monitors were susceptable to magnetic fields. In fact, they were -very- susceptable to 60 Hz alternating magnetic fields. I immediately suspected that something in my work area was generating a 60 Hz alterating magnetic field.
At first, I suspected something in my work area. Unfortunately, the monitor still flickered, even when all other equipment was turned off and unplugged. At that point, I began to suspect something in the building.
To test this theory, we converted the VaxStation and its 85 lb monitor into a portable system. We placed the entire system on a rollaround cart, and connected it to a 100 foot orange construction extension cord. The plan was to use this setup as a portable field strength meter,in order to locate the offending piece of equipment.
Rolling the monitor around confused us totally. The monitor flickered in exactly one half of the room, but not the other side. The room was in the shape of a square, with doors in opposite corners, and the monitor flickered on one side of a diagnal line connecting the doors, but not on the other side. The room was surrounded on all four sides by hallways. We pushed the monitor out into the hallways, and the flickering stopped. In fact, we discovered that the flicker only occurred in one triangular-shaped half of the room, and nowhere else.
After a period of total confusion, I remembered that the room had a two-way ceiling lighting system, with light switches at each door. At that moment, I realized what was wrong.
I moved the monitor into the half of the room with the problem, and turned the ceiling lights off. The flicker stopped. When I turned the lights on, the flicker resumed. Turning the lights on or off from either light switch, turned the flicker on or off within half of the room.
The problem was caused by somebody cutting corners when they wired the ceiling lights. When wiring up a two-way switch on a lighting circuit, you run a pair of wires between the SPDT switch contacts, and a single wire from the common on one switch, through the lights, and over to the common on the other switch.
Normally, these wires are bundeled together. They leave as a group from one switchbox, run to the overhead ceiling fixture, and on to the other box. The key idea, is that all of the current-carrying wires are bundeled together.
When the building was wired, the single wire between the switches and the light was routed through the ceiling, but the wires travelling between the switches were routed through the walls.
If all of the wires ran close and parallel to each other, then the magnetic field generated by the current in one wire was cancelled out by the magnetic field generated by the equal and opposite current in a nearby wire. Unfortunately, the way that the lights were actually wired meant that one half of the room was basically inside a large, single-turn transformer primary. When the lights were on, the current flowed in a loop, and the poor monitor was basically sitting inside of a large electromagnet.
Moral of the story: The hot and neutral lines in your AC power wiring are next to each other for a good reason.
Now, all I had to do was to explain to management why they had to rewire part of their new building...
This happened to me on the time I worked on a computer store.
One customer came one day into shop and tell us that his brand new computer worked fine on evenings and night, but it does not work at all on midday or late morning. The trouble was that mouse pointer does not move at that times.
The first thing we did was changing his mouse by a new one, but the trouble were not fixed. Of course, both mouses worked on store with no fault.
After several tries, we found the trouble was with that particular brand and model of mouse. Customer workstation was close to a very big window, and at midday the mouse was under direct sunlight. Its plastic was so thin that under that circumstances, it became translucent and sunlight prevented optomechanical wheel for working :|
In a game I was working on, a particular sprite would not display anymore in Release mode, but worked fine in Debug mode, and only in one particular edition. Another programmer tried to find this bug for 2 days, then left for vacation. It ended up on my shoulders to try to find the bug ~5 hours before release.
Since the Debug build worked, I had to debug with the release build. Visual Studio supports some debugging in the Release build, but you can't rely on everything the debugger tells you to be correct (especially with the aggressive optimization settings we were using). Therefore, I had to step through half code listings and half assembler listings, sometimes looking at objects directly in the hex dump instead of in the nicely formatted debugger view.
After spending a while making sure that all the correct draw calls were being made, I found out that the material color of the sprite was incorrect - it was supposed to be full opacity orange, but instead was set to black and completely transparent. The color was grabbed from a palette residing in a const array in our EditionManager class. It was setup initially as the correct orange color, but when the actual color was retrieved from the sprite drawing code, it was that transparent black again. I set a memory breakpoint, which was triggered in the EditionManager constructor. A write to a different array caused the value in the palette array to change.
As it turns out, the other programmer changed an essential enum of the system:
enum {
EDITION_A = 0,
EDITION_B,
//EDITION_DEMO,
EDITION_MAX,
EDITION_DEMO,
};
He put EDITION_DEMO
right after EDITION_MAX
, and the array that was being written to was indexed with EDITION_DEMO
so it overflowed into the palette and set the wrong values there. I couldn't change the enum back, however, since the edition numbers couldn't change anymore (they were being used in binary transmission). Therefore, I ended up making a EDITION_REAL_MAX
entry in the enum and using that as the array size.
It was a tiny bug in Rhino (Javascript interpreter in Java) that was causing one script to fail. It was hard because I knew little about how the interpreter would work, but I had to jump in there to fix the bug as quickly as possible, for the sake of another project.
First I tracked down which call in the Javascript was failing, so I could reproduce the problem. I stepped through the running interpreter in debug mode, initially quite lost, but slowly learning bits of how it worked. (Reading the docs helped a little.) I added printlns/logging at points I thought might be relevant.
I diffed the (cleaned up) logfile of a working run against a breaking run, to see at what point they first started to diverge. By re-running and adding lots of breakpoints, I found my way to the chain of events that lead up to the failure. Somewhere in there was a line of code that, if written slightly differently, solved the problem! (It was something very simple, like nextNode() should return null instead of IndexOutOfBounds.)
Two weeks after that I realised my fix broke scripts in certain other situations, and I changed the line to work well for all the cases.
I was in an unfamiliar environment. So I just tried a lot of different things, until one of them worked, or at least helped to make some progress/understanding. It did take a while, but I was pleased to get there in the end!
If I was doing it again now, I would look for the project's IRC channel (not only its mailing list), to ask a few polite questions and seek pointers.
I had a piece of delphi code that ran a long processing routine updating a progress bar as it went. The code ran fine in 16bit Delphi 1 however when we upgraded to delphi 2 a process that was taking 2 minutes suddenly took about an hour.
After weeks of pulling the routine apart it turns out it was the line that updated the progress bar that caused the issue, for every itteration we were checking the record count using table1.recordcount, in delphi 1 this worked fine but it seems in later versions of delphi calling table.recordcount on a dbase table takes a copy of the table counts the records and returns the amount, calling this on every itteration of our progress was causing the table to be downloaded from the network with every ittteration and counted. The solution was to count the records before the processing started and stored the amount in a variable.
Took ages to find but turned out to be so simple.
I heard about a classic bug back in high school; a terminal that you could only log into if you sat in the chair in front of it. (It would reject your password if you were standing.)
It reproduced pretty reliably for most people; you could sit in the chair, log in, log out... but if you stand up, you're denied, every time.
Eventually it turned out some jerk had swapped a couple of adjacent keys on the keyboard, E/R and C/V IIRC, and when you sat down, you touch-typed and got in, but when you stood, you had to hunt 'n peck, so you looked at the incorrent labels and failed.
I work for a large community college and we switched over from Blackboard to Moodle last year. Moodle uses the nomenclature of "courses" and "groups". A course might be Microeconomics ECO-150, for example, and groups are what we would call sections (OL1, OL2, 01, 14, W09 as examples).
Anyway we are primitive. We don't even have LDAP. Everything is text files, excel spreadsheets and GD microsoft Access databases. My job is to create a web application that takes all of the above as input and produces still more text files than can then be uploaded into Moodle to create courses, groups in courses and users and put users into courses and groups. The whole setup is positively byzantine, with about 17 individual steps that must be done in order. But the thing works and replaces a process that previously took days during the busiest time of the semester.
But there was one problem. Sometimes we got what I dubbed "Crazy Groups". So instead of creating a course with 4 groups of 20 students each it would create a course with 80 groups of 1 student each. The worst part, there is no way programmatically short of getting into cpanel(which I don't have access to) to delete a group once it is created. It is a manual process that takes about 5 button clicks. So every time a course with Crazy Groups got created I either had to delete the course, which is preferable but not an option if the teacher had already started putting content in the course, or I had to spend an hour repetitively following the same pattern: Select group, display group, edit group, delete group, Are you sure you want to delete group? Yes for godsake!
And there was no way to know if crazy groups had occured unless you manually opened up each course and looked (with hundreds of courses) or until you got a complaint. Crazy Groups seemed to happen randomly and Google and the Moodle forums were no help, it seems everyone else uses this thing called LDAP or a REAL database so they've never encountered the problem.
Finally, after I don't know how much investigating and more time deleting crazy groups than I ever want to admit I figured it out. It was a bug in Moodle not my code! This gave me not a little pleasure. You see the way to create a group is just try to enroll someone into the group and if the group does not already exist then Moodle creates it. And this worked fine for groups named OL1 or W12 or even SugarCandyMountain but if you tried to create a group with a number as the name, say 01 or 14 THAT is when crazy groups would occur. Moodle does not properly compare numbers as strings. No matter how many groups named 01 inside a course there are it will always think that group does not exist yet and will therefore create it. That is how you end up with 80 groups with 1 person in each.
Proud of my discovery I went to the Moodle forum and posted my findings complete with steps to reproduce the problem at will. That was about a year ago and the problem still exists inside of Moodle to my knowledge, no one seems motivated to fix it because no one but us primitives uses the text file enrollment. My solution, simply to make sure that all our group names contained at least 1 non-numeric character. Crazy groups are gone forever at least for us but I feel for that guy who works at a community college in outer Mongolia who just uploaded a semester's worth of courses and is about to have a rude awakening. At least this time Google may help him because I've written him this message in a bottle on the tides of cyberspace.
DevExpress XPO talking to an Oracle database crashing hard (as in: program exits silently) if directory path that the application is installed to does not contain at least one space, and the data dictionary XPO checks for isn't 100% correctly cased in the database.
Problem described here.
I can tell you this: I was this >< close to crying when we figured out how to circumvent the problem. I still don't know what the actual, real, cause of the problem is, but our product is not going to support Oracle in future version so I'm actually not giving a .... any more.
I had a bug with a custom synchronization program once. It used the date/time stamp of files/folders to compare what was modified to synchronize data from a flash key to a network share in windows, with some extra integrity and business logic built in it.
One day, an operator reported that his sync was taking forever...after reviewing the logs, for some reason, the software thought every file on the stick (or the server) was 3 hours older than it should be, refreshing all 8 gigs of data! I was using UTC, how the heck could this be?
It turns out, this particular operator did indeed set his time zone to Pacific time instead of Eastern, causing the problem, but it shouldn't have, because all the code was using UTC - good god what could it be?! It worked when testing it on my local system...what gives?
At this point, we requested all operators ensure that their laptops were set to eastern time before they synced, and the bug stayed in the queue until we had more time to investigate.
Then, October came around and BOOM! Daylight savings time! What the heck!? Now everyone was complaining syncing was taking forever! Had to be fixed, and fast!
I tracked it down by modifying the test case to run off a stick instead of off my local hard drive, and sure enough, it failed...phew, must a a memory stick thing - wait a sec, is it formatted FAT32... AH HA! FAT32 uses localtime when recording the timestamp of a file!
http://msdn.microsoft.com/en-us/library/ms724290(VS.85).aspx
So, the software was rewritten so that when writing to FAT32 media, we programatically set it to UTC...
In CS435 back at Purdue, we had to write a raytracer for our final project. Everything mine produced had a strong orange tint to it, but I could see every one of the objects in my scene. I finally gave up and submitted it as is, and had the professor look over my code to find the bug, and when he couldn't find it, I spent most of the summer digging to find just what the hell was wrong.
Buried deep in the code, as part of a color calculation function, I finally realized I was dividing an int and passing it to an OpenGL function that expected a float value. One of the color components was just low enough throughout most of the scene that it would round down to 0, causing the orange tint. Casting it to a float in just one place (before the division) fixed the bug.
Always check your inputs and expected types.
A deadlock in a Java Server Application. But not a simple deadlock with two threads. I tracked down a deadlock involving eight threads. Thread 1 waits for thread 2 that waits for thread 3, etc, and finally thread 8 waits for thread 1.
It took me about one entire day to understand what was going on and then just 15 minutes to fix it. I use eclipse to monitor about 40 threads till I discovered the deadlock.
It was during my diploma thesis. I was writing a program to simulate the effect of high intensity laser on a helium atom using FORTRAN.
One test run worked like this:
These should be constant in total, but they weren't. They did all kinds of weird things.
After debugging for two weeks I went berserk on the logging and logged every variable in every step of the simulation including the constants.
That way I found out that I wrote over an end of an array, which changed a constant!
A friend said he once changed the literal 2 with such a mistake.
The toughest bug would have to be when a programmer output to a log "General Error!". After looking through the code, it was scattered everywhere with the text "General Error!". Try nailing that one down.
At least writing a macro to output __LINE__ or __FUNCTION__ would have been a little more helpful to add to the debug output.
A race between Oracle's OracleDecimal class's ToString method (which P/Invokes the native version of the same functionality) and the garbage collector caused by a missing GC.KeepAlive call which can cause OracleDecimal.ToString() to return essentially arbitrary junk if its heap space happens to be overwritten before the call finishes.
I wrote a detailed bug report and never heard back, for all I know this is still out there. I even had a test harness that did nothing but create new OracleDecimal representations of the number 1, call ToString on them, and compare the result with "1". It would fail every ten-millionth time or so with crazy gibberish (huge numbers, negative numbers, and even alphanumeric junk strings).
Be careful out there with your P/Invoke calls! It is legal for the .NET garbage collector to collect your instance while a call to an instance method on that instance is still pending, as long as the instance method has finished using the this
reference.
Reflector is an absolute lifesaver for stuff like this.
In Python, I had a thread doing something like this:
while True:
with some_mutex:
...
clock.tick(60)
clock.tick(60)
suspends the thread so that it runs no more than 60 times per second.
The problem was that most of the time the program just showed a black screen. If I let it run for some time, it finally showed the game screen.
It's because the thread was doing the pause while maintaining the mutex. Thus it rarely let other threads acquire the mutex. It may seem obvious here, but I took me two days to figure it out. The solution is simply to remove an indent level:
while True:
with some_mutex:
...
clock.tick(60)
A crash happening in a DLL, loaded from a service. Triggered by shutting the system down.
The bug was simple to fix, but it took about a week - and a lot of frustration - to locate.
Years ago I spent several days trying to track down and fix a small bug in dbx, the text-based debugger on AIX. I don't remember the exact bug. What made it tough was I was using the installed dbx to debug the dev version of dbx I was working on. It was very tough to keep track of where I was. More than once, I prepared to leave for the day and exited dbx twice (the dev version and the installed version) only to see that I was still running inside dbx, sometimes two or more levels "deep".
--
bmb
A Heisenbug where the main difficulty was not realizing it wasn't my bug at all.
The problem was an API interface. Calling any real function (as opposed to the setup stuff) had a very high probability of crashing with a protection violation. Single-stepping through the function (to the extent possible, it would hit an interrupt and you couldn't trace past that point--this was back when you used interrupts to talk to the system) produced the correct output, no crash.
After a long search in vain for what I was doing wrong I finally dug through the RTL routines to try to understand what I was doing wrong. What I was doing wrong was believing the routines worked--all the routines that bombed were manipulating a real-mode pointer with a protected-mode pointer type. Unless the real-mode segment value happened to be valid in protected mode this went boom.
However, something about the debugger's manipulation of the program caused correct operation while single-stepping, I never bothered to figure out why.
might seem funny but when i was learing i spent an entire afternoon trying to figure out why an if statment always evaluate to true i used = instead of == :d i ve rewritten everything twice on an other computer :)
A box had crashed at a big customer's site, and we had to connect via a WebX session to an IT guy's computer, which was connected to our box. I poked around for about an hour, grabbing stack traces, register dumps, statistics, counters, and dumping sections of memory that seemed relevant.
Their IT guys then emailed me a transcript of my session, and I got to work.
After a few hours, I'd traced it back to an array of structures which contained packet metadata followed by packet data. One of the packet's metadata was corrupt, and it looked like it had been overwritten by a few bytes of packet data. Bugzilla had no record of anything similar.
Delving into the code, I checked all the obvious things. The code that copied packet data into the buffer was meticulous about not exceeding its bounds: the buffer was the MTU size for the interface, and the copy routine checked that the data didn't exceed the MTU size. My memory dumps allowed me to validate that, yes, foo->bar was indeed 4 when the crash happened. Nothing added up. Nothing was wrong in a way that should have caused the problem. There were what looked like 16 bytes of packet data in the next header.
A couple days later, I started checking anything and everything that I could think of.
I noticed that the length of the data buffer was actually correct. That is, the number of bytes from start of data until end of data was an MTU, even though the next header started at MTU-16.
When these structs were malloc'd, pointers to each element were placed in an array, and I'd dumped that array. I started measuring distance between these pointers. 6888... 6888... 6888... 6872... 6904... 6880... 6880...
Wait, what?
I started looking at the internal pointers and offsets in both structures. Everything added up. It just looked like my one bad structure - the one that'd been partially clobbered - was just 16 bytes too soon in memory.
The allocation routine malloc'd these guys as a chunk, and then carved them up in a loop:
for (i = 0; i < NUM_ELEMS; i++) {
array[i] = &head[i*sizeof(foo)];
}
(with allowances for alignment, etc.).
When the array was filled the value for my corrupt pointer must have been read as 0x8a112**8**ac instead of 0x8a112**9**ac.
I came to the conclusion that I'd been the victim of a 1-bit memory error during allocation (I know, I know! I didn't believe it either, but we'd seen them before on this hardware -- NULL values that were read as 0x00800000). In any case, I managed to convince my boss and co-workers that there was no other reasonable explanation, and that my explanation exactly explained what we were seeing.
So, box RMA'd.
A legacy database based application (with only part of the source avaliable) crashed when one particular user accessed a certain inventory feature. It worked perfectly for all other users. The user profile right? Nope. When logging in as a different user (even as admin) the same user had the same problem.
Computer problem? Nope. Same user, different PC (under her login or any other login) still crashed.
The problem: when logging in the program displayed a copyright splash screen that could be closed either by clicking the "X" to close the window, or by pressing any key. When logging in this user always clicked the "X" where other users always pressed a key. This resulted in a memory leak that caused but only when the inventory lookup was accessed.
Fix: Don't click the X.
We had an RMI server running on a DOS prompt Someone "selected" the window - which paused the process
The fix was quite simple...press enter.
It was quite an agonizing day...
The toughest bugs I ever fixed actually came quite early in my career. I was working on a real-time system for a power station that used pairs of GEC 2050 computers with shared memory.
2050 RTOS had a main scheduling table which consisted of one slot per process, the contents of which were either an add 1,X instruction for an inactive process or a jump for an executable process. Executing this table with X set to zero meant that the first runnable process automatically got entered with the X register being the process number. Whoever designed this obviously felt he was being very clever!
The 2050 architecture also had a security feature where an unrecognised opcode always caused a halt. Since the 2050 had a full-blown front panel, you could then use that to try and work out what had crashed. Since the X register always held the current process ID, this was usually fairly straight-forward.
There was no memory segmentation or protection, so it was perfectly possible for a process to corrupt either any other process currently in memory or indeed anything in the system area.
So far so consistent for the era (late 70s).
Since this particular system had shared memory between the two CPUs, the system configuration placed the system tables in the shared memory, to allow one CPU to start and stop processes in the other without having to go through any namby pamby secure interface.
Unfortunately this also allowed one CPU's wild process to corrupt the tables for the other CPU, so one CPU could happily crash the other. If this happened, what was running in the crashed CPU bore no relationship at all to the actual fault. Meanwhile the other CPU had happily carried on so there was no way to tell if it had caused the problem.
Needless to say, this provided a few hard to fix issues!
After a little bit of hair tearing, I ended up writing a fairly substantial patch to the O/S which looked for corruption in the scheduler table for the other CPU and crashed the CPU it was running on. This was hooked into a regular interrupt so while not being perfectly synchronised, at least it had a good chance of catching the offending process.
This helped me clear up quite a few mutual-CPU issues...
Unexplained SQL Server Timeouts and Intermittent Blocking
We had a problem where our users would timeout for apparently no reason. I monitored the SQL Server for a while and found that every once in a while there would be a lot of blocking going on. So I need to find the cause of this and fix it.
If there was blocking going on, than there must have been exclusive locks somewhere in the chain of stored proc calls…. Right?
I walked thru the full list of stored procs that were called, and all of the subsequent stored procs, functions and views. Sometimes this hierarchy was deep and even recursive.
I was looking for any UPDATE or INSERT statements…. There weren’t any (except on temporary tables that only had the scope of the stored proc so they didn’t count.)
On further research I found the locking is caused by the following:
A. If you use a SELECT INTO to create your temp table then SQL Sever places locks on system objects. The following was in our getUserPrivileges proc:
--get all permissions for the specified user
select permissionLocationId,
permissionId,
siteNodeHierarchyPermissionId,
contactDescr as contactName,
l.locationId, description, siteNodeId, roleId
into #tmpPLoc
from vw_PermissionLocationUsers vplu
inner join vw_ContactAllTypes vcat on vplu.contactId = vcat.contactId
inner join Location l on vplu.locationId = l.locationId
where isSelected = 1 and
contactStatusId = 1 and
vplu.contactId = @contactId
The getUserPrivileges proc is called with every page request (it is in the base pages.) It was not cached like you might expect. It doesn’t look like it, but the SQL above references 23 tables in the FROM or JOIN clauses. None of these table have the “with(nolock)” hint on it so it is taking longer than it should. If I remove the WHERE clause to get an idea of the number of rows involved it returns 159,710 rows and takes 3 to 5 seconds to run (after hours with no one else on the server.)
So if this stored proc can only be run one-at-a-time because of the lock, and it is being called once per page, and it holds the locks on the system tables for the duration of the select and temp table creation, you can see how it might be affecting the performance of the whole application.
The fix for this would be: 1. Use session level caching so this is only called once per session. 2. Replace the SELECT INTO with code that creates the table using standard Transact-SQL DDL statements, and then use INSERT INTO to populate the table. 3. Put “with(nolock)” on everything involved with this call.
B. If the stored proc getUserPrivileges didn’t have enough problems for you, then let me add: it probably gets recompiled on each call. So SQL Server acquires a COMPILE lock on each call.
The reason it gets recompiled is because the temp table gets created and then a lot of rows are deleted from it (if a @locationId or @permissionLocationId are passed in). This will cause the stored proc to be recompiled on the SELECT that follows (yes, in the middle of running the stored proc.) In other procs I’ve noticed a DECLARE CURSOR statement whose SELECT statement references a temporary table – this will force a recompile too.
For more info on recompilation see: http://support.microsoft.com/kb/243586/en-us
The fix for this would be: 1. Again, hit this stored proc far fewer times by using caching. 2. Have the @locationId or @permissionLocationId filtering applied in the WHERE clause while the table is being created. 3. Replace the temp tables with table variables – they result in fewer recompilations.
If things don’t work like you expect them to then you can spend a lot of time staring at something without every figuring out what is wrong.
I'm currently attending university and the hardest bug I encountered was from a programming class there. In the previous two semesters, we simply wrote all of our own code. But for the third semester, the professor and TA would write half the code, and we were to write the other half. This was to help us learn to read code.
Our first assignment for that semester was to write a program that simulates DNA gene splitting. Basically, we just had to find a substring in a larger one and process the results. Apparently, the professor and TA were both busy that week and gave us their half of the code without having their own full implementation finished yet. They hadn't had time to write the other half to act as a solution. Their half would compile, but without a full solution coded, there wasn't a way for them to test it. We were told not to alter the professors code. Everyone in the class had the exact same bug, but we all still assumed we were just all making the same mistake.
The program was gobbling gigabytes of memory and then running out and crashed. We (the students) all assumed that our half the code must have some obscure memory leak in it. Everyone in the class was scouring the code for two weeks and running it through a debugger over and over again. Our input file was a 5.7 MB string and we were finding hundreds of substrings in it and storing them. The professor/TA's code used this.
myString = myString.substr(0,pos);
See the problem? When you assign a string variable to its own substring, the memory is not reallocated. That's a tidbit of information nobody (not even the professor or TA) knew. So myString had 5.7 MB of allocated memory only to hold a few bytes of actual data. This was repeated hundreds of times; thus the massive memory usage. I spent two weeks on this problem. I spent the first week checking my own code for memory leaks. In my frustration I finally concluded the professor/TA's half must have the leak, so I spent the second week checking their code. But even then, it took me so long to find because this wasn't technically a leak. All allocations were eventually being freed and the program worked fine when our input data was only a dozen kilobytes. The only reason I found it was because I sent psycho crazy and decided to analyze every single last variable; even the temporary throw-away stuff. I was also spending a lot of time checking how many chars the string actually had, not how much was allocated. I assumed the string class was taking care of this. Here was the solution, a one line change that fixed weeks of frustrated and earned me an A on the assignment for finding/fixing the teacher's code.
myString.substr(0,pos).swap(myString);
The swap method, does force a reallocation.
I fixes someone's bug with the code below :
private void foo(Bar bar) {
bar = new Bar();
bar.setXXX(yyy);
}
He was expecting bar
will be changed outside foo
!
I once had a bug in a .NET app that would cause the CLR to crash - yes the CLR would just exit with a non-zero result and there'd be no debug info.
I peppered the code with console trace messages trying to find out where the issue was (the error would occur at startup) and eventually found the few lines causing the problem. I tried isolating the issue but every time I did the isolated case would work!
In the end I changed the code from:
int value = obj.CalculateSomething();
to
int value;
value = obj.CalculateSomething();
Don't ask me why, but this worked.