views:

143

answers:

1
 Hpricot(html).inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")

hpricot = Hpricot(html)
hpricot.search("script").remove
hpricot.search("link").remove
hpricot.search("meta").remove
hpricot.search("style").remove

found it on http://www.savedmyday.com/2008/04/25/how-to-extract-text-from-html-using-rubyhpricot/

A: 

Nokogiri and Hpricot are pretty interchangeable. I.e. Nokogiri(html) is an equivalent of Hpricot(html). Not really sure I understand what the linked article is trying to achieve, but to:

Extract text from HTML body which includes ignoring large white spaces between tags and words.

This would be an easier approach in Hpricot, and remove the need for the hpricot.search("script").remove bits. I.e. Just get the body in the first place:

Hpricot(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")

And in Nokogiri:

Nokogiri(html).search('body').inner_text.gsub("\r"," ").gsub("\n"," ").split(" ").join(" ")
i5m