I'm working on crawling pages for information, and have run into many problems with parsing the pages in Groovy. I've made semi-solution that works most of the time using juniversal chardet and just scanning the page for tag in the head, but sometimes two of these tags are found on one page, for example:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Is there a standard on which one to use (first, last, both..?) or some easier way to do this? Thanks.