views:

342

answers:

9

Since reading this question, I'm wondering: What would be the Right ThingTM to do to associate a UNIX file (ie, blob of data) with a file type? We have file extensions (DOS), data/resource forks (Apple) and ... err, file extensions and file(1) (UNIX), and MIME types (html et. al.).

Given all your griefs about handling files of various types: What would - you may dream - the perfect data storage system look like, in terms of structure and/or API?

+4  A: 

The thing to remember is that there really is only one file type in real UNIX: a vector of bytes. Extensions etc are conventions only; real UNIX doesn't have a mechanism that associates a file with some particular tool or executable; all the tools, theoretically, should work on any file. (This isn't as true as it was in the Good Old Days, but that's still the basic design theory.)

What you do to associate a file with a particular tool is have an appropriate "magic number" as the first 16 bits of the file. The most common example of that, by far, is the "shebang line": the one that starts #! as the first two characters. That particular magic number indicates a file that is to be interpreted or processed with some other tool, and while it's most commonly used for interpreters like bash, perl, or Python, it will work with most everything.

For some of the others, see man magic and man 1 file.

(I think it would even work with, say, wc. I'll try it in a second and update.)

UPDATE

Sure enough, it does. Here's my example:

This is my Clozure Common Lisp init file, just because it was handy, copied to a file wctry, with a shebang line to run wc -w, followed by executing the file.

$ cat wctry 
#!/usr/bin/wc -w
(format t "In init file.")
$ ./wctry 
       7 ./wctry
$
Charlie Martin
A: 

The infrastructure for this (this means avoiding to associate a file to what it is using a set of magic bytes) already exists, which is metadata through extended file attributes.

Few use them in practice, which shows that there isn't a real need for it and that the file(1) method plus extensions conventions are good enough.

Vinko Vrsalovic
+4  A: 

Some kind of metadata system is probably best, but the trouble comes when you try to integrate with other operating systems and disks that don't support the same metadata. I think that's why Mac OS X started using file extensions, even though it has no real need for them.

JW
+3  A: 

Your question kinda contradicts itself, because you ask for the "perfect way" and yet say that it has to be Unix which already has some pretty fixed way.

If I had to dream of the "Perfect" way then I have to admit that Windows already comes pretty close to it. IMHO the perfect way would be something like this:

  1. Every tool that can open some files installs a handler in the system. Whenever the system needs to determine which tool can handle this file, it queries all the handlers and check which return TRUE. Or, more likely, the handlers will each return a list of actions that they can do with the file. This is already implemented in Windows.
  2. As an optimization a file can have some extended metadata with it that stores the file's MIME type. This is done so that the handler can make a quick decision without actually taking a look at the file's contents. This can theoretically be done in Windows (by using alternate file streams on NTFS), although there is no such practice yet.

I have to admit that I'm not exactly sure if MIME type would be the best thing. On one hand it's an already established standard that contains definitions for many file types. On the other hand even more file types don't have an associated MIME type. Perhaps there could be an alternative to the MIME type, like a GUID. Rarely used file types that are specific to one software (and thus have no registered MIME type) would use the GUID, while popular file formats that many tools support would use the standardized MIME types.

Added: A relevant article has just been written in a blog called "The Old New Thing".

Vilx-
No need for a GUID, MIME defines an application-specific type "application". See http://www.ietf.org/rfc/rfc2046.txt
Mark Ransom
you just described the way the old MacOS worked, except that it used a pair of 32bit values instead of mimetypes (because those didn't exist at the time). to bad it was abandoned in favor of DOS-like extensions
Javier
Probably closer to how BeOS used to do it - including that it used mime types, too.
Matthew Schinckel
I'm not saying that the UNIX way is any good. Your Ideas would be much better.
doppelfish
+1  A: 

Adding on to Charlie Martin's answer, some other magic numbers used to identify files (these are also used by the file(1) program mentioned by others):

FourCC signatures on video files - a video format can be identified even without a file extension, by reading the four character code which uniquely identifies the codec used... (Also, Microsoft's list.)

Some other by-design magic numbers:

%PDF
- Adobe PDF
%!PS-Adobe-
- Adobe PostScript (followed by a version number)
Hex: 1F 8B 08
- GZip file
Hex: "00 01 00" followed by "Standard Jet DB"
- MS Access DB
!BDN
- Outlook PST file
BM
- Windows Bitmap .BMP image
GIF89a
- GIF image
MZ
- DOS/Windows executable file


The main issue with using purely signature-based methods of determining file type, especially with ASCII text signatures like "BM" or "MZ", is that they cannot distinguish between a binary file of the specified type, and an plain text file which just happens to begin with those letters. (This is less of a problem with binary signatures, but could still cause conflicts with some binary special devices.)

Stobor
A: 

I agree with JW. Theoretically, a filesystem where every file has a piece of metadata associated with it (I'd vote for mime type, personally) would be optimal in my opinion. I don't think Vilx's objection that not everything has an accepted mime type is valid--there's already a system for "nonstandard" or non-registered mime types (like "application/x-stuffit") that each application can define at will.

Realistically speaking, though, you're never going to get that metadata to propagate properly across to other operating systems. Filename extensions have become the de facto metadata that works on almost every commonly-used operating system in existence today.

Ross
A: 

I hate invisible metadata, because it doesn't always travel with the file.

I like the approach taken by the CUPS printing system. There are one or more configuration files that have access to the file contents and file name to determine the MIME type, then more configuration files to determine how to process said MIME type. This can work with magic signatures or file extensions, or both in combination, and can get pretty subtle if you want it to.

Mark Ransom
+1  A: 

It is strange, but I like the idea of extension :) It is well used in unix and win.

The real question is - is an operation system should specially handle data type information or not. Win is protecting extension. Unix is not. Applications themselves handle extensions.

Extension is nice to handle. You could rename file to new extension easily. You could select files with regexps "grep line *.c". Instead "grep line file *|grep C++" is not so nice to have.

You could same data type info into metadata of the file. But it is too FS/OS dependent and will not live for long :)

Malx
A: 

The True Unix Way was to have no user-definable metadata with a file, and precious little system-definable metadata (permissions, last update time, and the like). In Unix, metadata begins and ends with stat(2). As has been said, programmers never saw metadata they didn't like, so the True Mac Way added the resource fork, which I have always thought would be a pleasant way to do business, but it sure doesn't play nicely with Unix. As Vilx points out, Windows has a nice system too, which is also utterly different from both Mac and Unix. I suppose if extended file attributes ever catch on universally, that would be great, but I suspect this area will remain a swamp of legacy filesystems and backward compatibility.

I'm a big fan of file formats that

  • Are human-readable ASCII
  • Identify their own versions
  • Are extensible
  • Are executed as code to produce desired data

In other words, the only safe way on Unix is to make the metadata part of the file. Yes, this sucks.

Related observation: the way the Mac desktop interprets directories as applications is extremely clever. Unix desktops would do well to emulate this trip.

P.S. I do not consider XML to be a human-readable format.

Norman Ramsey
OS X also handles other directories as bundles - things like frameworks and plugins are also bundle-dirs, which means they appear to normal users like a file.
Matthew Schinckel