views:

79

answers:

2

I'm working on crawling pages for information, and have run into many problems with parsing the pages in Groovy. I've made semi-solution that works most of the time using juniversal chardet and just scanning the page for tag in the head, but sometimes two of these tags are found on one page, for example:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
...
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

Is there a standard on which one to use (first, last, both..?) or some easier way to do this? Thanks.

A: 

The behavior of this is undefined by the HTML spec. You can't have two seperate content-type tags in the same document. Since presumably you'd have to parse this document anyway, your best bet is to make an educated guess about the developers intent.

Ryan Brunner
+3  A: 

I would do it heuristically:

  • Is everything actually ASCII? If so, it doesn't matter which you use.
  • Does it conform to valid UTF-8? If so, I'd use that.
  • Otherwise, use ISO-8859-1.

You might want to look at the content-type header coming back from the web server, too...

Fundamentally the page is broken, but the above should give a reasonable "best guess."

Jon Skeet