I'm trying to parse an XML file (OSM data) with expat, and there are lines with some Unicode characters that expat can't parse:
<tag k="name"
v="абвгдежзиклмнопр�?туфхцчшщьыъ�?ю�?�?БВГДЕЖЗИКЛМ�?ОПРСТУФХЦЧШЩЬЫЪЭЮЯ" />
<tag k="name" v="Cin\x8e? Rex" />
(XML file encoding in the opening line is "UTF-8")
The file is quite old, and there must have been errors. In modern files I don't see UTF-8 errors, and they are parsed fine. But what if my program meets a broken symbol, what workaround can I make? Is it possible to join bz2 codec (I parse a compressed file) and utf-8 codec to ignore the broken characters, or change them to "?"?