In our application, we receive text files (.txt, .csv, etc.) from diverse sources. When reading, these files sometimes contain garbage, because the files where created in a different/unknown codepage.
Is there a way to (automatically) detect the codepage of a text file?
(I use .Net / C#).
The detectEncodingFromByteOrderMarks, on the StreamReader constructor, works for UTF8 and other unicode marked files, but I'm looking for a way to detect ASCII code pages, like ibm850, windows1252.
Thanks for your answers, this is what I've done.
The files we receive are from end-users, they do not have a clue about codepages. The receivers are also end-users, by now this is what they know about codepages: Codepages exist, and are annoying.
Solution:
- Open the received file in Notepad, look at a garbled piece of text. If somebody is called François or something, with your human intelligence you can guess this.
- I've created a small app that the user can use to open the file with, and enter a text that user knows it will appear in the file, when the correct codepage is used.
- Loop through all codepages, and display the ones that give a solution with the user provided text.
- If more as one codepage pops up, ask the user to specify more text.