views:

302

answers:

6

I found out and tested linux allows any character as a file name except for / and null. So what sequence should i not allow in a filename? I heard a leading - may confuse some command line programs which doesnt matter to me however it may bother other people if they decide to collect a bunch of files and filter it with some GNU programs.

I was suggested to remove leading and trailing spaces and i plan to only because typically the user doesnt mean to have leading/trailing space.

What problematic sequence might there be and what sequence should i consider not allowing? I am also considering not allowing characters illegal in windows just for convenience. I think i may not allow dashes at the beginning (dash is a legal window character)

+4  A: 

I would leave the determination of what's "valid" up to the OS and filesystem driver. Let the user type whatever they want, and pass it on. Handle errors from the OS in an appropriate manner. The exception is I think it's reasonable to strip leading and trailing spaces. If people want to create filenames with embedded spaces or leading dashes or question marks, and their chosen filesystem allows it, it shouldn't be up to you to try to prevent them.

It's possible to mount different filesystems at different mount points (or drives in Windows) that have different rules regarding legal characters in a file name. Handling this sort of thing inside your application will be much more work than is necessary, because the OS will already do it for you.

Greg Hewgill
+1, I agree, leave up to the OS.
Ninefingers
I'm actually generating filenames for the user to DL from a webbrowser. It would be pretty annoying to do save as and have it tell you the filename is not valid. Also because linux does allow everything it doesnt mean it should (apparently there were things that anger cmd line tools but i heard it secondhand).
acidzombie24
A: 

urlencode all strings to be use as filenames and you'll only have to worry about length. This answer might be worth reading.

mrclay
+2  A: 

Since you seem to be interested primarily in Linux, one thing to avoid is characters that the (typical) shell will try to interpret, for example, as a wildcard. You can create a file named "*" if you insist, but you might have some users who don't appreciate it much.

Jerry Coffin
I never thought of *, +1
acidzombie24
+1  A: 

Are you developing an application where you have to ask the user to create files themselves? If that's what you are doing, then you can set the rules in your application. (eg only allow [a-zA-Z0-9_.] and reject the rest of special characters.) this is much simpler to enforce.

ghostdog74
A: 

I'd recommend the use of a set of whitelist characters. In general, symbols in filenames will annoy people.

By all means allow people to use a-z 0-9 and unicode characters > 0x80, but do not allow arbitrary symbols, things like & and , will cause a lot of annoyance, as well as fullstops in inappropriate places.

I think the ASCII symbols which are safe to allow are: fullstop underscore hyphen

Allowing any OTHER ascii symbols in the filename is asking for trouble.

A filename should also not start with an ascii symbol. Policy on spaces in filenames is tricky as users may expect to be able to use them, but some filenames are obviously silly (such as those which START with spaces)

MarkR
+5  A: 

Your question is somewhat confusing since you talk at length about Linux, but then in a comment to another answer you say that you are generating filenames for people to download, which presumably means that you have absolutely no control whatsoever over the filesystem and operating system that the files will be stored on, making Linux completely irrelevant.

For the purpose of this answer I'm going to assume that your question is wrong and your comment is correct.

The vast majority of operating systems and filesystems in use today fall roughly into three categories: POSIX, Windows and MacOS.

The POSIX specification is very clear on what a filename that is guaranteed to be portable across all POSIX systems looks like. The characters that you can use are defined in Section 3.276 (Portable Filename Character Set) of the Open Group Base Specification as:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789._-
The maximum filename length that you can rely on is defined in Section 13.23.3.5 (<limits.h> Minimum Values) as 14. (The relevant constant is _POSIX_NAME_MAX.)

So, a filename which is up to 14 characters long and contains only the 65 characters listed above, is safe to use on all POSIX compliant systems, which gives you 24407335764928225040435790 combinations (or roughly 84 bits).

If you don't want to annoy your users, you should add two more restrictions: don't start the filename with a dash or a dot. Filenames starting with a dot are customarily interpreted as "hidden" files and are not displayed in directory listings unless explicitly requested. And filenames starting with a dash may be interpreted as an option by many commands. (Sidenote: it is amazing how many users don't know about the rm ./-rf or rm -- -rf tricks.)

This leaves you at 23656340818315048885345458 combinations (still 84 bits).

Windows adds a couple of new restrictions to this: filenames cannot end with a dot and filenames are case-insensitive. This reduces the character set from 65 to 39 characters (37 for the first, 38 for the last character). It doesn't add any length restrictions, Windows can deal with 14 characters just fine.

This reduces the possible combinations to 17866587696996781449603 (73 bits).

Another restriction is that Windows treats everything after the last dot as a filename extension which denotes the type of the file. If you want to avoid potential confusion (say, if you generate a filename like abc.mp3 for a text file), you should avoid dots altogether.

You still have 13090925539866773438463 combinations (73 bits).

If you have to worry about DOS, then additional restrictions apply: the filename consists of one or two parts (seperated by a dot), where neither of the two parts can contain a dot. The first part has a maximum length of 8, the second of 3 characters. Again, the second part is usually reserved to indicate the file type, which leaves you only 8 characters.

Now you have 4347792138495 possible filenames or 41 bits.

The good news is that you can use the 3 character extension to actually correctly indicate the file type, without breaking the POSIX filename limit (8+3+1 = 12 < 14).

If you want your users to be able to burn the files onto a CD-R formatted with ISO9660 Level 1, then you have to disallow hyphen anywhere, not just as the first character. Now, the remaining character set looks like

ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789_
which gives you 3512479453921 combinations (41 bits).

Jörg W Mittag
Amazing answer. Yes my question was a bit strange. I store the files as their ID and i can check what OS the user is on to customize valid filenames with the name being pulled as the title in the DB. I never knew hyphens are not allowed on CDs. posix is a little limiting (and doesnt allow unicode!) and i am required to support unicode. So I'll disallow hypens, dots and illegal window characters. offer an option to be less restrictive (or use on linux) and make sure i check my work so nothing will prevent it from burning onto CD (and hopefully works across main OS with default naming).
acidzombie24
@acidzombie24: The *overwhelming* majority of CDs is formatted with ISO9660 Level 2, which *does* allow hyphens and also longer filenames. Hyphens are only illegal on ISO9660 Level 1, which practically nobody uses. I only mentioned it for completeness. Note also that I didn't say anything about MacOS, simply because I don't know enough about it. Again, Apple has actually ironed out pretty much all of the ancient MacOS restrictions in OSX, so it's probably not very relevant.
Jörg W Mittag
BTW POSIX does *allow* UTF-8 Unicode; it only mandates the *minimum* set of characters that must be valid for POSIX compliance. Most POSIX filesystems allow any byte sequence not containing '/' or '\0', but a portable application should not rely on this.
mark4o
There are also certain sequences that aren't allowed by various OSes, even if they otherwise follow the rules. For example, "CON" is not allowed on Windows.
caf