views:

717

answers:

2

Hi,

O/S = Fedora Code 9.

I have a number of files hiding in my LANG=en_US:UTF-8 filesystem that have been uploaded with unrecognisable characters in their filename.

I need to search the filesystem and return all filenames that have at least one character that is not in the standard range (a-zA-Z0-9 and .-_ etc.)

I have been trying to following but no luck.

find . | egrep [^a-zA-Z0-9_\.\/\-\s]

All suggestions would be welcome.

Cheers,

AP.

+2  A: 

convmv might be interesting to you. It doesn't just find those files, but also supports renaming them to correct file names (if it can guess what went wrong).

Joachim Sauer
A: 

find . | egrep [^a-zA-Z0-9_.\/-\s]

Danger, shell escaping!

bash will be interpreting that last parameter, removing one level of backslash-escaping. Try putting double quotes around the "[^group]" expression.

Also of course this disallows a lot more than UTF-8. It is possible to construct a regex to match valid UTF-8 strings, but it's rather ugly. If you have Python 2.x available you could take advantage of that:

import os.path
def walk(dir):
    for child in os.listdir(dir):
        child= os.path.join(dir, child)
        if os.path.isdir(child):
            for descendant in walk(child):
                yield descendant
        yield child

for path in walk('.'):
    try:
        u= unicode(path, 'utf-8')
    except UnicodeError:
        # print path, or attempt to rename file
bobince
Single quotes would be better, in that context.
Arafangion