views:

178

answers:

1

I'm downloading via FTP some files with chinese names (BIG5 encoded), and Filezilla displays those filenames as garbage (as FTP cannot handle any encoding other than ASCII and UTF-8, as least the standard compliant ones).

Given a filename with garbled characters, is it possible for me to repair the encoding and get a proper filename String given that I already know the source encoding? Will the FTP client misinterpreting BIG5 as UTF-8 insert bytes that make conversion back to BIG5 difficult?

My proposed steps (in Java): 1. get the garbled filename using File object. 2. getbytes using UTF-8. 3. create a new string using those bytes in BIG5. 4. Write the decoded filename back to the file.

Will the above method work?

+2  A: 

Not every sequence of bytes is a valid ASCII or UTF-8 string so it's quite likely that some of the bytes will have been discarded, converted to the replacement character, or otherwise irreversibly mangled. So it looks like you won't be able to retrieve the original filenames if they have been modified by FileZilla to become correctly formed UTF-8 or ASCII.

You might be lucky to be able to get a certain percentage of the original characters back, where they just happened to be both valid BIG5 and valid UTF-8, but I doubt you will be able to recover the entire filename.

You could post a few examples of your garbled filenames (as raw bytes encoded in hex) to get a more definite answer. That way we can see exactly what the damage is.

Mark Byers