tags:

views:

224

answers:

3

My script downloads files from the net and then it saves them under the name taken from the same web server. I need a filter/remover of invalid characters for file/folder names under Windows NTFS.

I would be happy for multi platform filter too.

NOTE: something like htmlentities would be great....

A: 

I think your best bet would be gsub on the filename. One of the things I know you'll need to delete/replace is :.

Geo
+2  A: 

Like Geo said, by using gsub you can easily convert all invalid characters to a valid character. For example:

file_names.map! do |f|
  f.gsub(/[<invalid characters>]/, '_')
end

You need to replace <invalid characters> with all the possible characters that your file names might have in them that are not allowed on your file system. In the above code each invalid character is replaced with a _.

Wikipedia tells us that the following characters are not allowed on NTFS:

  • U+0000 (NUL)
  • / (slash)
  • \ (backslash)
  • : (colon)
  • * (asterisk)
  • ? (question mark)
  • " (quote)
  • < (less than)
  • > (greater than)
  • | (pipe)

So your gsub call could be something like this:

file_names.map! { |f| f.gsub(/[\x00\/\\:\*\?\"<>\|]/, '_') }

which replaces all the invalid characters with an underscore.

liwp
@liwp: that is definitely a solution. I knew about gsub but was wondering if there was any gem and even cross platform available. Usually there few of them for anything ;-) Thank you for your code.
Radek
A: 

I don't know how you plan to use those files later, but pretty much most reliable solution would be to keep the original filenames in a db table (or otherwise serialized hash), and name physical files after the unique ID that you (or the database) generated.

PS Another advantage of this approach is that you don't have to worry about the files with the same names (or different names that filter to same names).

Mladen Jablanović
I'm guessing humans would need to look at those files. Even if he removes the unwanted chars, the file's name could still mean something to someone, whereas an ID might not.
Geo
Ok, then. IDs are ok if you only plan to further serve these files through HTTP, for example.
Mladen Jablanović