views:

706

answers:

6

I'm using hpricot to read HTML. I got a segmentation fault error, I googled and some say upgrade to latest version of Ruby. I am using rails 2.3.2 and ruby 1.8.7. How to resolve this error?

A: 

Well, based on your own question, I'd say "Upgrade to the latest version of Ruby". However, I've also had problems with hpricot segfaulting, which seemed to be related to my usage of threading.

Adam Wright
But I am using almost the latest version of ruby already. Also, I am not doing any threading in my code :(
Alas not. Ruby latest is 1.9.1
Adam Wright
My host is using 1.8.5Even if I upgrade to 1.9.1 on my dev machine, I wont be able to deploy the code on production
Is there any way to catch it?
For clarification, upgrading to 1.9 is probably not the answer. Hpricot works better on 1.8 than 1.9. Still some bugs that haven't been worked out in 1.9.
Chuck
A: 

This appears to be an outstanding issue on the bug list. I have experienced it to. My theory is has to do with the HTML structure or bad/corrupt character in the file but I have not found where exactly.

Here are the links to the issues:

dave elkins
+3  A: 

If you're free to choose your HTML parsing library, switch it. Why, the creator of Hpricot, recently posted that you should better use Nokogiri instead of HPricot, nowadays.

You may also have a look at HTTParty.

bb
And he also subsequently vanished from the Internet, so for the moment HPricot appears to be unmaintained.
molf
A: 

I'm having the same segfault issue but sadly can't consult the issues Dave cited above, even via Google cache -- from what I've been googling the parse.rb segfaults have to do with encoded entities or alt character sets (accented characters perhaps)

The sanitize lib encountered the same issue and posted a monkeypatch here: http://github.com/rgrove/sanitize/blob/1e1dc9681de99e32dc166f591343dfa60fc1f648/lib/sanitize/monkeypatch/hpricot.rb

jamiew
A: 

From memory, since I last used it about a year ago:

Hpricot stores attributes in a fixed-size buffer, and some frameworks generate outrageously long hashes in document attributes. There's some static field you can set before parsing that lets you set the size of this buffer.

I remember it being fairly prominent in the docs on the webpage, though of course it's gone now.

Ken
+1  A: 

I was trying to parse html pages with many unicode characters in them and Hpricot kept crashing. Finally, I used the monkey patch from sanitize and put it in the environment.rb for my rails application. There hasn't been a single crash since I added this patch:

http://github.com/rgrove/sanitize/blob/1e1dc9681de99e32dc166f591343dfa60fc1f648/lib/sanitize/monkeypatch/hpricot.rb

mehdi
This worked perfect! I know I should switch to Nokogiri (and plan to), but I needed this fix for an older project!
John