tags:

views:

1257

answers:

5

I'm trying to build a regular expression that will detect any character that Windows does not accept as part of a file name (are these the same for other OS? I don't know, to be honest).

These symbols are:

 \ / : * ? "  | 

Anyway, this is what I have: [\\/:*?\"<>|]

The tester over at http://gskinner.com/RegExr/ shows this to be working. For the string Allo*ha, the * symbol lights up, signalling it's been found. Should I enter Allo**ha however, only the first * will light up. So I think I need to modify this regex to find all appearances of the mentioned characters, but I'm not sure.

You see, in Java, I'm lucky enough to have the function String.replaceAll(String regex, String replacement). The description says:

Replaces each substring of this string that matches the given regular expression with the given replacement.

So in other words, even if the regex only finds the first and then stops searching, this function will still find them all.

For instance: String.replaceAll("[\\/:*?\"<>|]","")

However, I don't feel like I can take that risk. So does anybody know how I can extend this?

A: 

You might try allowing only the stuff you want the user to be able to enter, for example A-Z, a-z, and 0-9.

Lucas McCoy
Don't forget the lonely period.
Daniel A. White
Or the vast range of valid Unicode and extended characters people all over world use.
McDowell
A: 

For the record, POSIX-compliant systems (including UNIX and Linux) support all characters except the null character ('\0') and forwards slash ('/') in filenames. Special characters such as space and asterisk must be escaped on the command line so that they do not take their usual roles.

Artelius
A: 

You cannot do this with a single regexp, because a regexp always matches a substring if the input. Consider the word Alo*h*a, there is no substring that contains all *s, and not any other character. So if you can use the replaceAll function, just stick with it.

BTW, the set of forbidden characters is different in other OSes.

jpalecek
I'm not sure I understand what you're saying, but you can definitely match invalid filenames with a regex.
wilhelmtell
Yes, but you cannot sanitize invalid filenames by replacing a single occurence of a regex without lots of collateral damage
jpalecek
A: 

Java has a replaceAll function, but every programming language has a way to do something similar. Perl, for example, uses the g switch to signify a global replacement. Python's sub function allows you to specify the number of replacements to make. If, for some reason, your language didn't have an equivalent, you can always do something like this:

while (filename.matches(bad_characters)
  filename.replace(bad_characters, "")
Pesto
+2  A: 

Windows filename rules are tricky. You're only scratching the surface.

For example here are some things that are not valid filenames, in addition to the chracters you listed:

                                    (yes, that's an empty string)
.
.a
a.
 a                                  (that's a leading space)
a                                   (or a trailing space)
com
prn.txt
[anything over 240 characters]
[any control characters]
[any non-ASCII chracters that don't fit in the system codepage,
 if the filesystem is FAT32]

Removing special characters in a single regex sub like String.replaceAll() isn't enough; you can easily end up with something invalid like an empty string or trailing ‘.’ or ‘ ’. Replacing something like “[^A-Za-z0-9_.]*” with ‘_’ would be a better first step. But you will still need higher-level processing on whatever platform you're using.

bobince
Windows filename rules are indeed tricky. No one (not even Microsoft) has written a fully correct set of rules. I haven't either. But I can tell you "." is legal (that directory always exists), and ".a" and "a." and com and >240 characters etc. can be created by escaping the names perfectly legally.
Windows programmer
Well ‘.’ (and ‘..’) are a legal pathnames, but you can't use them as filenames, obviously! How do you ‘escape’ leading/trailing dots and reserved filenames? I can't see any public interface that allows it; both the UI and the file IO interface rename the dots and disallow the reserved name.
bobince
(I can create the long pathnames by renaming and moving, but it causes Explorer and many other applications to be unstable when accessing them, which is why it's undesirable.)
bobince
"copy con \\.\d:\.a" (without the quotes), Enter key to start, Ctrl+Z to stop. File d:\.a exists just fine. Well, fortunately someone accepted your answer so lots of future readers can be misled too.
Windows programmer
"copy con \\.\d:\con" and you get to use con with both meanings. By the way this assumes drive d is a disk; if it isn't then say drive c or something else.
Windows programmer