views:

253

answers:

3

There's a nasty problem that has temporarily stumped a number of engineers at my company trying to debug it.

The C++ program is normally run on a cluster of multicore computers with MPI.

It will run for a very long time -- perhaps days -- and then suddenly fail.

Most of engineers working on it have eliminated any reasonable possibility of a bug in the program itself, so they're starting to assign blame to a possible hardware problem, but I suspect there must be a software problem in either a Linux kernel module or device driver.

What is suspect is that a kernel module or device driver, in order to do some floating-point calculations, is doing FXSAVE/FXRSTOR in a manner that is unsafe on SMP systems. It could be something as simple as doing the FXSAVE to a static buffer in a kernel routine that needed to be reentrant. That would create a race condition bug that would very rarely corrupt the floating-point context of a thread.

At the application level, what appears to be happening is that one or more bits of the MXCSR -- which is part of the FXSAVE/FXRSTOR context -- is suddenly changed, but there is no application code to change it.

I encountered something similar many years ago on Windows, which ultimately turned out to be a bug in a video driver, such that when the application code was preempted by the operating system, some MXCSR bits in that thread's context were corrupted.

I'm not an expert at Linux Kernel hacking or device driver development, but I'm reading that the reentrancy rules have been changing a lot; between non-SMP and SMP (multi-core) systems; between kernel versions; etc. So the possibility of a race-condition bug seems reasonable.

So my question is: Are there any known Linux driver(or kernel) bugs that fit that description?

Any precedents that I could cite would be helpful, if they had similar symptoms. At this point, a lot of the people involved are (IMHO) wasting time thinking "well, there's no bug in my code, so it must be bad hardware." and I'd like to get them beyond that and looking for something more likely to be the true cause.

+1  A: 

This isn't the best place to ask this question - you should mail the LKML (http://en.wikipedia.org/wiki/LKML)

Paul Betts
A: 

Could you tell us how the program fails?

It's complicated, but it blows up with a segment fault, which can be traced back to a wrong bit in the MXCSR.
Die in Sente
A: 

The source for your kernel is available, usually as a src.rpm. You can extract this (and the .tgz inside) and then grep everything for fxsave asm instructions and the like. I'd be very surprised if you find something, but who knows? If you are running any binary video drivers then see if the problem persists without them loaded.

  1. download kernel-2-whatever.src.rpm
  2. mkdir temp; cd temp
  3. rpm2cpio ../kernel*rpm | cpio -id
  4. tar xvf linux-*.tgz
  5. grep -ri fxsave *
Andy Grover