views:

750

answers:

4

I'm receiving a string from an external process. I want to use that String to make a filename, and then write to that file. Here's my code snippet to do this:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), s);
    PrintWriter currentWriter = new PrintWriter(currentFile);

If s contains an invalid character, such as '/' in a Unix-based OS, then a java.io.FileNotFoundException is (rightly) thrown.

How can I safely encode the String so that it can be used as a filename?

Edit: What I'm hoping for is an API call that does this for me.

I can do this:

    String s = ... // comes from external source
    File currentFile = new File(System.getProperty("user.home"), URLEncoder.encode(s, "UTF-8"));
    PrintWriter currentWriter = new PrintWriter(currentFile);

But I'm not sure whether URLEncoder it is reliable for this purpose.

A: 

You could remove the invalid chars ( '/', '\', '?', '*') and then use it.

Burkhard
This would introduce the possibility of naming conflicts. I.e., "tes?t", "tes*t" and "test" would go the the same file "test".
vog
True. Then replace them. For instance, '/' -> slash, '*' -> star... or use a hash as vog suggested.
Burkhard
You're *always* open to the possibility of naming conflicts
Brian Agnew
"?" and "*" are allowed characters in file names. They only need to be escaped in shell commands, because usually globbing is used. On the file API level, however, there's no problem.
vog
@Brian Agnew: not actually true. Schemes that encode invalid characters using a reversible escaping scheme won't give collisions.
Stephen C
+5  A: 
vog
URLEncoder's idea of what is a special character may not be correct.
Stephen C
+1: nice distinction between reversible and irreversible encodings
dfa
@Stephen C: according to the documentation (see URLEncoder link), the function generates strings which contain at most the following 67 characters: a-z, A-Z, 0-9, ".", "-", "*", "_" and "+". Each of them is allowed in file names. (yes, "*" is allowed!)
vog
@vog: URLEncoder fails for "." and "..". These must be encoded or else you will collide with directory entries in $HOME
Stephen C
Good point. Thanks! I corrected my answer.
vog
@vog: Just for completeness, there is a third case - "possibly reversible" - which can be implemented with a computationally cheap hash, by removing 'bad' characters or by a variety of other means.
Stephen C
@vog: "*" is only allowed in most Unix-based filesystems, NTFS and FAT32 do not support it.
Jonathan
+2  A: 

My suggestion is to take a "white list" approach, meaning don't try and filter out bad characters. Instead define what is OK. You can either reject the filename or filter it. If you want to filter it:

String name = s.replaceAll("\W+", "");

What this does is replaces any character that isn't a number, letter or underscore with nothing. Alternatively you could replace them with another character (like an underscore).

The problem is that if this is a shared directory then you don't want file name collision. Even if user storage areas are segregated by user you may end up with a colliding filename just by filtering out bad characters. The name a user put in is often useful if they ever want to download it too.

For this reason I tend to allow the user to enter what they want, store the filename based on a scheme of my own choosing (eg userId_fileId) and then store the user's filename in a database table. That way you can display it back to the user, store things how you want and you don't compromise security or wipe out other files.

You can also hash the file (eg MD5 hash) but then you can't list the files the user put in (not with a meaningful name anyway).

cletus
I don't think it's a good idea to provide the bad solution first. In addition, MD5 is a nearly cracked hash algorithm. I recommend at least SHA-1 or better.
vog
For the purposes of creating a unique filename who cares if the algorithm is "broken"?
cletus
@cletus: the problem is that different strings will map to the same filename; i.e. collision.
Stephen C
A collision would have to be deliberate, the original question doesn't talk about these strings being chosen by an attacker.
tialaramex
A problem no-one has really addressed is that there are limits on filename length and on total length of a file path, plus arbitrary limits on file names on some platforms, and even a limit on how many files can be in a particular directory. And this is a Java question, so we can't be sure the software will only run on (fill in the name of your favourite OS here). Thus I think any adequate solution would want to consider how to retry or what else to do if the name tried is rejected by the OS.
tialaramex
@tialaramax: re collisions. Suppose that the user simply wants to store two distinct files using names that happen to collide. Result: one overwrites the other. Re filename limits: the simple answer is to report "name too long" or "too many files" to the user. There clearly have to be limits somewhere. Differences in file name syntax could be handled via a config setting to say which chars are illegal.
Stephen C
Collisions is why I suggest not using user input for a filename but instead using your own scheme but storing the user's preferred name in a database as a convenience to them. This avoids security and collision issues.
cletus
+1  A: 

If you want the result to resemble the original file, SHA-1 or any other hashing scheme is not the answer. Instead you want something like this.

char fileSep = '/'; // ... or do this portably.
char escape = '%'; // ... or some other legal char.
String s = ...
int len = s.length();
StringBuilder sb = new StringBuilder(len);
for (int i = 0; i < len; i++) {
    char ch = s.charAt(i);
    if (ch < ' ' || ch >= 0x7F || ch == fileSep || ... // add other illegal chars
        || (ch == '.' && i == 0) // we don't want to collide with "." or ".."!
        || ch == escape) {
        sb.append(escape);
        if (ch < 0x10) {
            sb.append('0');
        }
        sb.append(Integer.toHexString(ch));
    } else {
        sb.append(ch);
    }
}
File currentFile = new File(System.getProperty("user.home"), sb.toString());
PrintWriter currentWriter = new PrintWriter(currentFile);

This solution gives a reversible encoding (with no collisions) where the encoded strings resemble the original strings in most cases. I'm assuming that you are using 8-bit characters.

URLEncoder has the disadvantage that it encodes a whole lot of legal file name characters.

Edit: If you want a not-guaranteed-to-be-reversible solution, then simply remove the 'bad' characters rather than replacing them with and escape sequence.

Edit 2: Addressed collisions with "." and ".." directory entries.

Stephen C
As explained above, URLEncoder encodes too much, AND it fails to deal with "." and ".."
Stephen C