views:

2292

answers:

6

I've got a routine that converts a file into a different format and saves it. The original datafiles were numbered, but my routine gives the output a filename based on an internal name found in the original.

I tried to batch-run it on a whole directory, and it worked fine until I hit one file whose internal name had a slash in it. Oops! And if it does that here, it could easily do it on other files. Is there an RTL (or WinAPI) routine somewhere that will sanitize a string and remove invalid symbols so it's safe to use as a filename?

A: 

I did this:

// Initialized elsewhere...
string folder;
string name;
var prepl = System.IO.Path.GetInvalidPathChars();
var frepl = System.IO.Path.GetInvalidFileNameChars();
foreach (var c in prepl)
{
 folder = folder.Replace(c,'_');
 name = name.Replace(c, '_');
}
foreach (var c in frepl)
{
 folder = folder.Replace(c, '_');
 name = name.Replace(c, '_');
}
John Weldon
*points to the "Delphi" tag on the question* Looks like a good algorithm, though. Is there a Win32 API equivalent to those System.IO.Path calls?
Mason Wheeler
Having looked at the class in Reflector you can easily see what these invalid chars are. Simply port it to Delphi.
RichardOD
...if I had C# and Reflector, yes. I prefer to avoid the headaches that come with managed code.
Mason Wheeler
Mason- when I say port to Delphi, I meant to say check there is nothing else out there in Win32 API first.
RichardOD
+4  A: 

Check if string has invalid chars; solution from here:

//test if a "fileName" is a valid Windows file name
//Delphi >= 2005 version

function IsValidFileName(const fileName : string) : boolean;
const 
  InvalidCharacters : set of char = ['\', '/', ':', '*', '?', '"', '<', '>', '|'];
var
  c : char;
begin
  result := fileName <> '';

  if result then
  begin
    for c in fileName do
    begin
      result := NOT (c in InvalidCharacters) ;
      if NOT result then break;
    end;
  end;
end; (* IsValidFileName *)

And, for strings returning False, you could do something simple like this for each invalid character:

var
  before, after : string;

begin
  before := 'i am a rogue file/name';

  after  := StringReplace(before, '/', '',
                      [rfReplaceAll, rfIgnoreCase]);
  ShowMessage('Before = '+before);
  ShowMessage('After  = '+after);
end;

// Before = i am a rogue file/name
// After  = i am a rogue filename
Adam Bernier
A: 

Well, the easy thing is to use a regex and your favourite language's version of gsub to replace anything that's not a "word character." This character class would be "\w" in most languages with Perl-like regexes, or "[A-Za-z0-9]" as a simple option otherwise.

Particularly, in contrast to some of the examples in other answers, you don't want to look for invalid characters to remove, but look for valid characters to keep. If you're looking for invalid characters, you're always vulnerable to the introduction of new characters, but if you're looking for only valid ones, you might be slightly less inefficient (in that you replaced a character you didn't really need to), but at least you'll never be wrong.

Now, if you want to make the new version as much like the old as possible, you might consider replacement. Instead of deleting, you can substitute a character or characters you know to be ok. But doing that is an interesting enough problem that it's probably a good topic for another question.

Curt Sampson
Nope. When you consider that recent versions of Windows support full Unicode filenames, and something like Ä£̆Ώۑ≥♣.txt is valid, you definitely want a blacklist for an operation like this, not a whitelist.
Mason Wheeler
Not in the way I interpreted the question. You're not looking to see if an arbitrary string is a valid filename, you're looking to guarantee a valid filename from a transformation of an arbitrary string. These are (perhaps subtly) different.For example, if you could translate any string into a unique 8-digit number, that might bear no obvious relation to the original string, but still guarantees you can save the darn thing to disk.
Curt Sampson
Yeah. I'm looking to guarantee a valid filename from the transformation of an arbitrary string, while preserving as much of the information conveyed by the original string as possible.
Mason Wheeler
Do you also want the filename to look as much like the original string as possible?
Curt Sampson
Yes, that's exactly what I want.
Mason Wheeler
+1 Sanitizing must *always* use whitelists. Otherwise you're vulnerable as soon as new inputs become possible, or inputs become dangerous which were ok before (code changes in the interpreting code).
sleske
+5  A: 

Regarding the question whether there is any API function to sanitize a file a name (or even check for its validity) - there seems to be none. Quoting from the comment on the PathSearchAndQualify() function:

There does not appear to be any Windows API that will validate a path entered by the user; this is left as an an ad hoc exercise for each application.

So you can only consult the rules for file name validity from File Names, Paths, and Namespaces (Windows):

  • Use almost any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:

    • The following reserved characters are not allowed:
      < > : " / \ | ? *
    • Characters whose integer representations are in the range from zero through 31 are not allowed.
    • Any other character that the target file system does not allow.
  • Do not use the following reserved device names for the name of a file: CON, PRN, AUX, NUL, COM1..COM9, LPT1..LPT9.
    Also avoid these names followed immediately by an extension; for example, NUL.txt is not recommended.

If you know that your program will only ever write to NTFS file systems you can probably be sure that there are no other characters that the file system does not allow, so you would only have to check that the file name is not too long (use the MAX_PATH constant) after all invalid chars have been removed (or replaced by underscores, for example).

A program should also make sure that the file name sanitizing has not lead to file name conflicts and it silently overwrites other files which ended up with the same name.

mghie
+4  A: 

You can use PathGetCharType function, PathCleanupSpec function or the following trick:

  function IsValidFilePath(const FileName: String): Boolean;
  var
    S: String;
    I: Integer;
  begin
    Result := False;
    S := FileName;
    repeat
      I := LastDelimiter('\/', S);
      MoveFile(nil, PChar(S));
      if (GetLastError = ERROR_ALREADY_EXISTS) or
         (
           (GetFileAttributes(PChar(Copy(S, I + 1, MaxInt))) = INVALID_FILE_ATTRIBUTES)
           and
           (GetLastError=ERROR_INVALID_NAME)
         ) then
        Exit;
      if I>0 then
        S := Copy(S,1,I-1);
    until I = 0;
    Result := True;
  end;

This code divides string into parts and uses MoveFile to verify each part. MoveFile will fail for invalid characters or reserved file names (like 'COM') and return success or ERROR_ALREADY_EXISTS for valid file name.

Alexander
Thanks! PathCleanupSpec looks like exactly what I'm looking for.
Mason Wheeler
+1 for PathCleanupSpec, interesting stuff
Wim Coenen
A: 
{
  CleanFileName
  ---------------------------------------------------------------------------

  Given an input string strip any chars that would result
  in an invalid file name.  This should just be passed the
  filename not the entire path because the slashes will be
  stripped.  The function ensures that the resulting string
  does not hae multiple spaces together and does not start
  or end with a space.  If the entire string is removed the
  result would not be a valid file name so an error is raised.

}

function CleanFileName(const InputString: string): string;
var
  i: integer;
  ResultWithSpaces: string;
begin

  ResultWithSpaces := InputString;

  for i := 1 to Length(ResultWithSpaces) do
  begin
    // These chars are invalid in file names.
    case ResultWithSpaces[i] of 
      '/', '\', ':', '*', '?', '"', '', '|', ' ', #$D, #$A, #9:
        // Use a * to indicate a duplicate space so we can remove
        // them at the end.
        {$WARNINGS OFF} // W1047 Unsafe code 'String index to var param'
        if (i > 1) and
          ((ResultWithSpaces[i - 1] = ' ') or (ResultWithSpaces[i - 1] = '*')) then
          ResultWithSpaces[i] := '*'
        else
          ResultWithSpaces[i] := ' ';

        {$WARNINGS ON}
    end;
  end;

  // A * indicates duplicate spaces.  Remove them.
  result := ReplaceStr(ResultWithSpaces, '*', '');

  // Also trim any leading or trailing spaces
  result := Trim(Result);

  if result = '' then
  begin
    raise(Exception.Create('Resulting FileName was empty Input string was: '
      + InputString));
  end;
end;
Mark Elder