tags:

views:

90

answers:

4

Are there any alternatives to stat (which is found on most Unix systems) which can determine the file type? The manpage says that a call to stat is expensive, and I need to call it quite often in my app.

+1  A: 

Are you aware of the "magic" file on *nix systems? By querying a file from the command line with something like file myfile.ext you can get the real file type.

This is done by reading the contents of the file rather than looking at its extension, and is widely used on *nix (Linux, Unix, ...) systems.

Etamar L.
If `stat()` is expensive, can you imagine the cost of running a program to determine the file type?
Jonathan Leffler
(1) I didn't say to run the program. I mean detecting a file type can be done by its first bytes ("signature") if Helper really needs to know the type. It's a tradeoff to consider. (2) But if the extension is trusted, or sufficient, then it can be used without opening the file as previously suggested. Note that he said "any alternative".
Etamar L.
The cost of determining file type is immense compared with a stat call. Also the OP refers to unix file type, not mime type.
Matt Joiner
+5  A: 

The alternative is fstat() if you already have the file open (so you have a file descriptor for it). Or lstat() if you want to find out about symbolic links rather than the file the symlink points to.

I think the man page is exaggerating the cost; it is not much worse than any other system call that has to resolve the name of the file into an inode. It is more costly than getpid(); it is less costly than open().

Jonathan Leffler
The manpage's note about the expense of stat isn't so much saying to use something besides stat; it's implying that you should try to avoid unnecessary calls to stat.
Jefromi
Indeed... in fact, if you look at straces of some linux programs, you'll see quite a few stat calls.
jpalecek
The typical process stat's everything constantly. Especially those that wrap or are built on a layer on top of C. They generate constant stat calls to "check" things are as they expected etc. Even hitting "tab" on an autocompleting terminal generates piles of stat calls.
Matt Joiner
+2  A: 

The "file type" that stat() gives you is whether the file is a regular file or something like a device file or directory, among other things like its size and inode number. If that's what you need to know, then you must use stat().

If what you actually need to know is the type of the file's contents -- e.g. text file, JPEG image, MP3 audio -- then you have two options. You can guess based on the filename extension (if it ends in ".mp3", the file probably contains MP3 audio), or you can use libmagic, which actually opens the file and reads some of its contents to figure out what it is. The libmagic approach is more expensive (if you're trying to avoid stat(), you probably want to avoid open() too), but less prone to error (in case that ".mp3" file is actually a JPEG image, for example).

Wyzard
+2  A: 

Under Linux with some filesystems the file type (regular, char device, block device, directory, pipe, sym link, ...) is stored in the linux_dirent struct, which is what the kernel supplies applications directory entries in via the getdents system call. If the only thing in the stat structure you needed was the file type and you needed to get that for all or many entries of a directory, you could use getdents directly (rather than readdir) and attempt to get the file type out of that, only using stat if you found an invalid file type in linux_dirent. Depending on the your application's filesystem usage pattern this could be faster than using stat if you are using Linux, but stat should be fast in many cases.

Stat's speed has mostly to do with locating the data that is being asked for on disk. If you are traversing a directory recursively stat-ing all of the files then each stat should end up being fairly quick overall because most of the work getting the data stat needs ends up cached before you ask the kernel for it by a previous call to stat. If on the other hand you stat the same number of files randomly distributed around the system then the kernel will likely have to read from disk several directories for each file you are going to call stat on.

fstat should always be very fast since the kernel should already have the data you're asking for in RAM, as it needs to access it for the file to be in the open state, and the kernel won't have to go through the trouble of traversing the path of the filename to see if each component is in RAM or on disk and possibly reading in a directory from disk (but likely not having to), only to discover that it has the data that you are asking for in RAM.

That being said, calling stat on an open file should be faster than calling it on an unopened file.

nategoose