tags:

views:

298

answers:

2

I'm writing a crawler in ruby (1.9) that consumes lots of HTML from a lot of random sites.
When trying to extract links, I decided to just use .scan(/href="(.*?)"/i) instead of nokogiri/hpricot (major speedup). The problem is that I now receive a lot of "invalid byte sequence in UTF-8" errors.
From what I understood, the net/http library doesn't have any encoding specific options and the stuff that comes in is basically not properly tagged.
What would be the best way to actually work with that incoming data? I tried .encode with the replace and invalid options set, but no success so far...

A: 

Before you use scan, make sure that the requested page's Content-Type header is text/html, since there can be links to things like images which are not encoded in UTF-8. The page could also be non-html if you picked up a href in something like a <link> element. How to check this varies on what HTTP library you are using. Then, make sure the result is only ascii with String#ascii_only? (not UTF-8 because HTML is only supposed to be using ascii, entities can be used otherwise). If both of those tests pass, it is safe to use scan.

Adrian
thanks, but that's not my problem :) I only extract the host part of the URL anyway and hit only the front page. My problem is that my input apparently isn't UTF-8 and the 1.9 encoding foo goes haywire
Marc Seeger
@Marc Seeger: What do you mean by "my input"? Stdin, the URL, or the page body?
Adrian
HTML can be encoded in UTF-8:http://en.wikipedia.org/wiki/Character_encodings_in_HTML
Eduardo
@Adrian: my input = the page body@Eduardo: I know. My problem is that the data coming from net/http seems to have a bad encoding from time to time
Marc Seeger
A: 

I recommend you to use a HTML parser. Just find the fastest one.

Parsing HTML is not as easy as it may seem.

Browsers parse invalid UTF-8 sequences, in UTF-8 HTML documents, just putting the "�" symbol. So once the invalid UTF-8 sequence in the HTML gets parsed the resulting text is a valid string.

Even inside attribute values you have to decode HTML entities like amp

Here is a great question that sums up why you can not reliably parse HTML with a regular expression: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Eduardo
I'd love to keep the regexp since it's about 10 times faster and I really don't want to parse the html correctly but just want to extract links. I should be able to replace the invalid parts in ruby by just doing: ok_string = bad_string.encode("UTF-8", {:invalid => :replace, :undef => :replace}) but that doesn't seem to work :(
Marc Seeger