Hi, I am using Python 3.1, but I can downgrade if needed.
I have an ASCII file containing a short story written in one of the languages the alphabet of which can be represented with upper and or lower ASCII. I wish to:
1) Detect an encoding to the best of my abilities, get some sort of confidence metric (would vary depending on the length of the file, right?)
2) Automatically translate the whole thing using some free online service or a library.
Additional question: What if the text is written in a language where it takes 2 or more bytes to represent one letter and the byte order mark is not there to help me?
Finally, how do I deal with punctuation and misc characters such as space? It will occur more frequently than some letters, right? How about the fact that punctuation and characters can be sometimes mixed - there might be two representations of a comma, two representations for what looks like an "a", etc.?
Yes, I have read the article by Joel Spolsky on Unicode. Please help me with at least some of these items.
Thank you!
P.S. This is not a homework, but it is for self-educational purposes. I prefer using a letter frequency library that is open-source and readable as opposed to the one that is closed, efficient, but gets the job done well.