For example I allow user to use
Unicode UTF-8 and iso-8859-2 for their
csv files. Is it possible to detect
whether it is former or latter?
It's not possible with 100% accuracy because, for example, the bytes C3 B1 are an equally valid representation of "Ăą" in ISO-8859-2 as they are of "ñ" in UTF-8. In fact, because ISO-8859-2 assigns a character to all 256 possible bytes, every UTF-8 string is also a valid ISO-8859-2 string (representing different characters if non-ASCII).
However, the converse is not true. UTF-8 has strict rules about what sequences are valid. More than 99% of possible 8-octet sequences are not valid UTF-8. And your CSV files are probably much longer than that. Because of this, you can get good accuracy if you:
- Perform a UTF-8 validity check. If it passes, assume the data is UTF-8.
- Otherwise, assume it's ISO-8859-2.
However is it possible to detect
whether encoding is one of two
allowed?
UTF-32 (either byte order), UTF-8, and CESU-8 can be reliably detected by validation.
UTF-16 can be detected by presence of a BOM (but not by validation, since the only way for an even-length byte sequence to be invalid UTF-16 is to have unpaired surrogates).
If you have at least one "detectable" encoding, then you can check for the detectable encoding, and use the undetectable encoding as a fallback.
If both encodings are "undetectable", like ISO-8859-1 and ISO-8859-2, then it's more difficult. You could try a statistical approach like chardet uses.