views:

306

answers:

2

Sorry for this question, apparently all my googling and api searching skills must be failing me. I've written a web crawler in ruby and I'm using Nokogiri::HTML to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print method. However it takes a parameter and I can't figure out what it wants. If someone could point me in the right direction I would be greatly appreciate it.

Thanks.

EDIT: I should clarify that my crawler is caching the html of the webpages and writing it to files on my local machine. I would like to "pretty print" the html so that it looks nice and properly formatted when I do so.

+1  A: 

why don't you try the pp method?

require 'pp'
pp some_var
khelll
While Nokogiri implements methods to aid "pretty printing", the output is intended for developers only. It seems to me that Jarsen wants to display pretty-printed HTML source.
mislav
+5  A: 

By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; pretty_print method is for the "pp" library and the output is useful for debugging only.

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

It comes down to this:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

It requires you, of course, to download the linked xsl file to your filesystem. I've tried it very quickly on my machine and it works like a charm.

mislav