views:

174

answers:

2

I have a bunch of music files on a NTFS partition mounted on linux that have filenames with unicode characters. I'm having trouble writing a script to rename the files so that all of the file names use only ASCII characters. I think that using the iconv command should work, but I'm having trouble escaping the characters for the 'mv' command.

EDIT: It doesn't matter if there isn't a direct translieration for the unicode chars. I guess that i'll just replace those with a "?" character.

A: 

I don't think iconv has any character replacement facilities. This in Python might help:

#!/usr/bin/python
import sys

def unistrip(s):
    if isinstance(s, str):
        s = s.decode('utf-8')
    chars = []
    for i in s:
        if ord(i) > 0x7f:
            chars.append(u'?')
        else:
            chars.append(i)
    return u''.join(chars)

if __name__ == '__main__':
    print unistrip(sys.argv[1])

Then call as:

$ ./unistrip.py "yikes__oh_look_a_file_火"
yikes_?_oh_look_a_file_?

Also:

$ mv "yikes__oh_look_a_file_火" "`./unistrip.py "yikes__oh_look_a_file_火"`"

You might test it a bit first. For large move operations, generating a list of mv commands (ie, write code to write a script) is advisable, as you can look over the move commands before telling them to execute.

Thanatos
`return s.encode('ascii', 'replace')`
Ignacio Vazquez-Abrams
Correct me if I'm wrong, but I think that `iconv` does have character replacement facilities: http://stackoverflow.com/questions/1975057/bash-convert-non-ascii-characters-to-ascii
B Johnson
A: 

convmv is a good Perl script to convert file name encodings. But it can't handle characters that aren't in the destination encoding.

You can change any character not in ASCII to '?' using the rename utility distributed with Perl:

rename 's/[^ -~]/?/g' *

Unfortunately this replaces multi-byte characters with multiple '?'s. Depending on the Unicode encoding that is used and the characters involved changing the regex may help, e.g.

rename 's/[^ -~]{2}/?/g' *

for 2-byte characters.

Florian Diesch