views:

462

answers:

4

My application uses lseek() to seek the desired position to write data. The file is successfully opened using open() and my application was able to use lseek() and write() lots of times.

At a given time, for some users and not easily reproducable, lseek() returns -1 with an errno of 9. File is not closed before this and the filehandle (int) isn't reset.

After this, another file is created; open() is okay again and lseek() and write() works again.

To make it even worse, this user tried the complete sequence again and all was well.

So my question is, can the OS close the file handle for me for some reason? What could cause this? A file indexer or file scanner of some sort?

What is the best way to solve this; is this pseudo code the best solution? (never mind the code layout, will create functions for it)

int fd=open(...);
if (fd>-1) {
  long result = lseek(fd,....);
  if (result == -1 && errno==9) {
      close(fd..); //make sure we try to close nicely
      fd=open(...);

      result = lseek(fd,....);
  }
}

Anybody experience with something similar?

Summary: file seek and write works okay for a given fd and suddenly gives back errno=9 without a reason.

+2  A: 

The OS shall not close file handles randomly (I am assuming a Unix-like system). If your file handle is closed, then there is something wrong with your code, most probably elsewhere (thanks to the C language and the Unix API, this can be really anywhere in the code, and may be due to, e.g., a slight buffer overflow in some piece of code which really looks like to be unrelated).

Your pseudo-code is the worst solution, since it will give you the impression of having fixed the problem, while the bug still lurks.

I suggest that you add debug prints (i.e. printf() calls) wherever you open and close a file or socket. Also, try Valgrind.

(I just had yesterday a spooky off-by-1 buffer overflow, which damaged the least significant byte of a temporary slot generated by the compiler to save a CPU register; the indirect effect was that a structure in another function appeared to be shifted by a few bytes. It took me quite some time to understand what was going on, including some thorough reading of Mips assembly code).

Thomas Pornin
Was expecting this reaction, which is the most obvious one of course.No to be cocky, but I am almost sure there isn't a memory leak.The file handle id remains the same (checked) and the lseek and write just before it worked fine. There is nothing dynamic (memory wise) going on between write operations.All file IO code is being logged (NSlog) and all is going okay, until suddenly in some situations (not reproducible second time) the -1 occurs with errno=9.You have to agree that whenever the fd doesn't change it is most unlikely the errno=9 would appear right? For some reason it does.
Ger Teunis
@Thomas: +1 I couldn't agree more. Bugs should be fixed, not hidden. @Ger, how can you be sure you don't have stack or heap corruption? This sounds like a sinister bug which needs fixing
Sam Post
Sure I do agree, better find the cause. But at first sight it seems the cause was outside my scope. It still seems that way because there is no way the current fd is being overwritten with another value.
Ger Teunis
Some made a suggestion what might happen if the (network) drive suddenly gets disconnected. Will have look into that.
Ger Teunis
A: 

No, the OS should not close file handles just like that, and other applications (file scanners etc.) should not be able to do it.

Do not work around the problem, find it's source. If you don't know what the reason for your problem was, you will never know if your workaround actually does work.

  1. Check your assumptions. Is errno set to 0 before the call? Is fd really valid at the point the call is being made? (I know you said it is, but did you check it?)
  2. What is the output of puts( strerror( 9 ) ); on your platform?
DevSolar
errno is 0 before and result of lseek suddenly is -1.fd is REALLY REALLY open because the writes do work before in the same iteration.
Ger Teunis
Well then, go on checking your assumptions. You know your code better than me. Printing the value of fd before each call would be my next move.
DevSolar
Good idea and also a suggestion what happens when a (network) drive gets disconnected for some reason.
Ger Teunis
+4  A: 

So my question is, can the OS close the file handle for me for some reason? What could cause > this? A file indexer or file scanner of some sort?

No, this will not happen.

What is the best way to solve this; is this pseudo code the best solution? (never mind the code layout, will create functions for it)

No, the best way is to find the bug and fix it.

Anybody experience with something similar?

I've seen fds getting messed up many times, resulting in EBADF in the some of the cases, and blowing up spectacularly in others, it's been:

  • buffer overflows - overflowing something and writing a nonsense value into a 'int fd;' variable.
  • silly bugs that happen because some corner case someone did if(fd = foo[i].fd) when they meant if(fd == foo[i].fd)
  • Raceconditions between threads, some thread closes the wrong file descriptor that some other thread wants to use.

If you can find a way to reproduce this problem, run your program under 'strace', so you can see whats going on.

nos
I do agree about finding the bug. I'll do some more code review but I will check what happens if i.e. a network drive is disconnected. Perhaps this will cause an Bad File descriptor?
Ger Teunis
After close inspection the fd doesn't change at all and only one thread is doing the disk IO (all file handling).I may assume if I do not close the fd, the fd remains the same I should not get the errno and -1 result right? It really does.
Ger Teunis
While some particular OS, utilizing some particular filesystem/network IO could have some corner case bug that causes this to happen is entierly possible, it'd just be guesswork. a strace of all the threads when this does happen would be very helpful though
nos
After a lot more logging and debugging it was not an error in my code. The fh's never change and it doesn't close them. The file handles are controlled ouside of my program scope (open and write) so it's not a memory issue as well.You are right here. All file handles where okay. No closing of file handles. There wasn't a problem in my code just osx marking the fh invalid. Reoping the file also returned the Same fd id.Funny stuff.
Ger Teunis
+1  A: 

I don't know what type of setup you have, but the following scenario, could I think produce such an effect (or else one similar to it). I have not tested this to verify, so please take it with a grain of salt.

If the file/device you are opening implemented as a server application (eg NFS), consider what could happen if the server application goes down / restarts / reboots. The file descriptor though originally valid at the client end might no longer map to a valid file handle at the server end. This can conceivably lead to a sequence of events wherein the client will get EBADF.

Hope this helps.

Sparky
You are right here. All file handles where okay. No closing of file handles. There wasn't a problem in my code just osx marking the fh invalid. Reoping the file also returned the fd id.Funny stuff.
Ger Teunis