I'm writing a crawler in ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...
views:
298answers:
2Before you use scan
, make sure that the requested page's Content-Type
header is text/html
, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a href
in something like a <link>
element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with String#ascii_only?
(not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use scan
.
I recommend you to use a HTML parser. Just find the fastest one.
Parsing HTML is not as easy as it may seem.
Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.
Even inside attribute values you have to decode HTML entities like amp
Here is a great question that sums up why you can not reliably parse HTML with a regular expression: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags