views:

306

answers:

5

I have a web application that allows users to upload their content for processing. The processing engine expects UTF8 (and I'm composing XML from multiple users' files), so I need to ensure that I can properly decode the uploaded files.

Since I'd be surprised if any of my users knew their files even were encoded, I have very little hope they'd be able to correctly specify the encoding (decoder) to use. And so, my application is left with task of detecting before decoding.

This seems like such a universal problem, I'm surprised not to find either a framework capability or general recipe for the solution. Can it be I'm not searching with meaningful search terms?

I've implemented BOM-aware detection (http://en.wikipedia.org/wiki/Byte_order_mark) but I'm not sure how often files will be uploaded w/o a BOM to indicate encoding, and this isn't useful for most non-UTF files.

My questions boil down to:

  1. Is BOM-aware detection sufficient for the vast majority of files?
  2. In the case where BOM-detection fails, is it possible to try different decoders and determine if they are "valid"? (My attempts indicate the answer is "no.")
  3. Under what circumstances will a "valid" file fail with the C# encoder/decoder framework?
  4. Is there a repository anywhere that has a multitude of files with various encodings to use for testing?
  5. While I'm specifically asking about C#/.NET, I'd like to know the answer for Java, Python and other languages for the next time I have to do this.

So far I've found:

  • A "valid" UTF-16 file with Ctrl-S characters has caused encoding to UTF-8 to throw an exception (Illegal character?) (That was an XML encoding exception.)
  • Decoding a valid UTF-16 file with UTF-8 succeeds but gives text with null characters. Huh?
  • Currently, I only expect UTF-8, UTF-16 and probably ISO-8859-1 files, but I want the solution to be extensible if possible.
  • My existing set of input files isn't nearly broad enough to uncover all the problems that will occur with live files.
  • Although the files I'm trying to decode are "text" I think they are often created w/methods that leave garbage characters in the files. Hence "valid" files may not be "pure". Oh joy.

Thanks.

+2  A: 

Have you tried reading a representative cross-section of your files from user, running them through your program, testing, correcting any errors and moving on?

I've found File.ReadAllLines() pretty effective across a very wide range of applications without worrying about all of the encodings. It seems to handle it pretty well.

Xmlreader() has done fairly well once I figured out how to use it properly.

Maybe you could post some specific examples of data and get some better responses.

No Refunds No Returns
The anonymous down-vote. My favorite.
No Refunds No Returns
its a valid answer, why downvote?
Sunny
Thanks, but I'm looking for a general purpose solution. In this application, the app is deployed at the customer's site and I don't have access (or legal permission) to the files. They are *any* text document the user wishes to upload. Some are PDF-to-text, some are scraped from web sites, some are from PPT slides, some are.... who knows.
NVRAM
Then I would say make sure you have extensive logging about input/output etc written to the users local event log. This sounds like a no-win situation to me.
No Refunds No Returns
Incidentally, I don't get what you mean by *correcting any errors and moving on* -- I cannot "correct" the user's files, and the error I now have is that they must correctly select the encoding format. I'll look into **File.ReadAllLines()**...
NVRAM
Is the encoding-detection capabilities of **File.ReadAllLines()** available for Streams?
NVRAM
+1  A: 

This is a well known problem. You can try to do what Internet Explorer is doing. This is a nice article in The CodeProject that describes Microsoft's solution to the problem. However no solution is 100% accurate as everything is based on heuristcs. And it is also no safe to assume that a BOM will be present.

kgiannakakis
+3  A: 

There won't be an absolutely reliable way, but you may be able to get "pretty good" result with some heuristics.

  • If the data starts with a BOM, use it.
  • If the data contains 0-bytes, it is likely utf-16 or ucs-32. You can distinguish between these, and between the big-endian and little-endian variants of these by looking at the positions of the 0-bytes
  • If the data can be decoded as utf-8 (without errors), then it is very likely utf-8 (or US-ASCII, but this is a subset of utf-8)
  • Next, if you want to go international, map the browser's language setting to the most likely encoding for that language.
  • Finally, assume ISO-8859-1

Whether "pretty good" is "good enough" depends on your application, of course. If you need to be sure, you might want to display the results as a preview, and let the user confirm that the data looks right. If it doesn't, try the next likely encoding, until the user is satisfied.

Note: this algorithm will not work if the data contains garbage characters. For example, a single garbage byte in otherwise valid utf-8 will cause utf-8 decoding to fail, making the algorithm go down the wrong path. You may need to take additional measures to handle this. For example, if you can identify possible garbage beforehand, strip it before you try to determine the encoding. (It doesn't matter if you strip too aggressive, once you have determined the encoding, you can decode the original unstripped data, just configure the decoders to replace invalid characters instead of throwing an exception.) Or count decoding errors and weight them appropriately. But this probably depends much on the nature of your garbage, i.e. what assumptions you can make.

oefe
This is helpful, although note that some UTF16LE files I had decoded *without exceptions* by the C#/.NET encoding framework; there were *errors* (null characters) but no *exceptions*. My intention is auto-detection (hence the posting) and I've partially implemented it since I already detect MSWord, PDF and other non-text files, but the issue is determining when an encoding is the *right* one.
NVRAM
You're right, the 0-Byte check needs to go first, I fixed the order of steps in my answer accordingly
oefe
+1  A: 

You may like to look at a Python-based solution called chardet. It's a Python port of Mozilla code. Although you may not be able to use it directly, its documentation is well worth reading, as is the original Mozilla article it references.

John Machin
FWIW, I grabbed UDE [http://code.google.com/p/ude/] compiled it with Mono. I then ran the resulting EXE against files that were encoded ISO-8859-1, -2, UTF-{8,16LE,16BE,32LE,32BE} and it only recognized the UTF-8 correctly (guessed windows-1255 or -1252 for everything else).
NVRAM
It won't recognise UTF-nnxE without a BOM; did yours have a BOM? ISO-8859-n is a figment of the imagination -- decode it to Unicode and see if you have any characters in the range U+0080 to U+009F ;-)
John Machin
A: 

I ran into a similar issue. I needed a powershell script that figured out if a file was text-encoded ( in any common encoding ) or not.

It's definitely not exhaustive, but here's my solution...

http://stackoverflow.com/questions/1077634/powershell-search-script-that-ignores-binary-files/1080976#1080976

kervin