I want to include batch file rename functionality in my application. User can type destination filename pattern and (after replacing some wildcards in pattern) I need to check if it's going to be legal filename under Windows. I tried to use regular expression like [a-zA-Z0-9_]+ but it doesn't include many national-specific characters from various languages (umlauts and so on). What is the best way to do such check?
views:
10330answers:
16Rather than explicitly include all possible characters, you could do a regex to check for the presence of illegal characters, and report an error then. Ideally your application should name the files exactly as the user wishes, and only cry foul if it stumbles across an error.
From MSDN, here's a list of characters that aren't allowed:
Use almost any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
- The following reserved characters are not allowed: < > : " / \ | ? *
- Characters whose integer representations are in the range from zero through 31 are not allowed.
- Any other character that the target file system does not allow.
You can get a list of invalid characters from Path.GetInvalidPathChars
http://msdn.microsoft.com/en-us/library/system.io.path.getinvalidpathchars.aspx
And GetInvalidFileNameChars
http://msdn.microsoft.com/en-us/library/system.io.path.getinvalidfilenamechars.aspx
UPD: See Steve Cooper's suggestion on how to use these in a regular expression.
Windows filenames are pretty unrestrictive, so really it might not even be that much of an issue. The characters that are disallowed by Windows are:
\ / : * ? " < > |
You could easily write an expression to check if those characters are present. A better solution though would be to try and name the files as the user wants, and alert them when a filename doesn't stick.
Microsoft Windows: Windows kernel forbids the use of characters in range 1-31 (i.e., 0x01-0x1F) and characters " * : < > ? \ |. Although NTFS allows each path component (directory or filename) to be 255 characters long and paths up to about 32767 characters long, the Windows kernel only supports paths up to 259 characters long. Additionally, Windows forbids the use of the MS-DOS device names AUX, CLOCK$, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, CON, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9, NUL and PRN, as well as these names with any extension (for example, AUX.txt), except when using Long UNC paths (ex. \.\C:\nul.txt or \?\D:\aux\con). (In fact, CLOCK$ may be used if an extension is provided.) These restrictions only apply to Windows - Linux, for example, allows use of " * : < > ? \ | even in NTFS.
Regular expression matching should get you some of the way. Here's a snippet using the System.IO.Path.InvalidPathChars
constant;
bool IsValidFilename(string testName)
{
Regex containsABadCharacter = new Regex("[" + Regex.Escape(System.IO.Path.InvalidPathChars) + "]");
if (containsABadCharacter.IsMatch(testName) { return false; };
// other checks for UNC, drive-path format, etc
return true;
}
Once you know that, you should also check for different formats, eg c:\my\drive
and \\server\share\dir\file.ext
The question is are you trying to determine if a path name is a legal windows path, or if it's legal on the system where the code is running.? I think the latter is more important, so personally, I'd probably decompose the full path and try to use _mkdir to create the directory the file belongs in, then try to create the file.
This way you know not only if the path contains only valid windows characters, but if it actually represents a path that can be written by this process.
From MSDN's "Naming a File or Directory," here are the general conventions for what a legal file name is under Windows:
You may use any character in the current code page (Unicode/ANSI above 127), except:
- < > : " / \ | ? *
- Characters whose integer representations are 0-31 (less than ASCII space)
- Any other character that the target file system does not allow (say, trailing periods or spaces)
- Any of the DOS names: CON, PRN, AUX, NUL, COM1, COM2, COM3, COM4, COM5, COM6, COM7, COM8, COM9, LPT1, LPT2, LPT3, LPT4, LPT5, LPT6, LPT7, LPT8, LPT9 (and avoid AUX.txt, etc)
- The file name is all periods
Some optional things to check:
- File paths (including the file name) may not have more than 260 characters (that don't use the "\?\" prefix)
- Unicode file paths (including the file name) with more than 32,000 characters when using "\?\" (note that prefix may expand directory components and cause it to overflow the 32,000 limit)
References:
This is what I use:
public static bool IsValidFileName(this string expression, bool platformIndependent)
{
string sPattern = @"^(?!^(PRN|AUX|CLOCK\$|NUL|CON|COM\d|LPT\d|\..*)(\..+)?$)[^\x00-\x1f\\?*:\"";|/]+$";
if (platformIndependent)
{
sPattern = @"^(([a-zA-Z]:|\\)\\)?(((\.)|(\.\.)|([^\\/:\*\?""\|<>\. ](([^\\/:\*\?""\|<>\. ])|([^\\/:\*\?""\|<>]*[^\\/:\*\?""\|<>\. ]))?))\\)*[^\\/:\*\?""\|<>\. ](([^\\/:\*\?""\|<>\. ])|([^\\/:\*\?""\|<>]*[^\\/:\*\?""\|<>\. ]))?$";
}
return (Regex.IsMatch(expression, sPattern, RegexOptions.CultureInvariant)));
}
The first pattern creates a regular expression containing the invalid/illegal file names and characters for Windows platforms only. The second one does the same but ensures that the name is legal for any platform.
Regular expressions are overkill for this situation. You can use the String.IndexOfAny() method in combination with Path.GetInvalidPathChars() and Path.GetInvalidFileNameChars().
Also note that both Path.GetInvalidXXX() methods clone an internal array and return the clone. So if you're going to be doing this a lot (thousands and thousands of times) you can cache a copy of the invalid chars array for reuse.
HTH,
Sam
Keep in mind that even if the user enters a semantically valid path, this does not mean that they will have permissions to create that file, or even if the semantically valid path can even be created.
One corner case to keep in mind, which surprised me when I first found out about it: Windows allows leading space characters in file names! For example, the following are all legal, and distinct, file names on Windows (minus the quotes):
"file.txt"
" file.txt"
" file.txt"
One takeaway from this: Use caution when writing code that trims leading/trailing whitespace from a filename string.
This class cleans filenames and paths; use it like
var myCleanPath = PathSanitizer.SanitizeFilename(myBadPath, ' ');
Here's the code;
/// <summary>
/// Cleans paths of invalid characters.
/// </summary>
public static class PathSanitizer
{
/// <summary>
/// The set of invalid filename characters, kept sorted for fast binary search
/// </summary>
private readonly static char[] invalidFilenameChars;
/// <summary>
/// The set of invalid path characters, kept sorted for fast binary search
/// </summary>
private readonly static char[] invalidPathChars;
static PathSanitizer()
{
// set up the two arrays -- sorted once for speed.
invalidFilenameChars = System.IO.Path.GetInvalidFileNameChars();
invalidPathChars = System.IO.Path.GetInvalidPathChars();
Array.Sort(invalidFilenameChars);
Array.Sort(invalidPathChars);
}
/// <summary>
/// Cleans a filename of invalid characters
/// </summary>
/// <param name="input">the string to clean</param>
/// <param name="errorChar">the character which replaces bad characters</param>
/// <returns></returns>
public static string SanitizeFilename(string input, char errorChar)
{
return Sanitize(input, invalidFilenameChars, errorChar);
}
/// <summary>
/// Cleans a path of invalid characters
/// </summary>
/// <param name="input">the string to clean</param>
/// <param name="errorChar">the character which replaces bad characters</param>
/// <returns></returns>
public static string SanitizePath(string input, char errorChar)
{
return Sanitize(input, invalidPathChars, errorChar);
}
/// <summary>
/// Cleans a string of invalid characters.
/// </summary>
/// <param name="input"></param>
/// <param name="invalidChars"></param>
/// <param name="errorChar"></param>
/// <returns></returns>
private static string Sanitize(string input, char[] invalidChars, char errorChar)
{
// null always sanitizes to null
if (input == null) { return null; }
StringBuilder result = new StringBuilder();
foreach (var characterToTest in input)
{
// we binary search for the character in the invalid set. This should be lightning fast.
if (Array.BinarySearch(invalidChars, characterToTest) >= 0)
{
// we found the character in the array of
result.Append(errorChar);
}
else
{
// the character was not found in invalid, so it is valid.
result.Append(characterToTest);
}
}
// we're done.
return result.ToString();
}
}
Also the destination file system is important.
Under NTFS, some files can not be created in specific directories. E.G. $Boot in root