After spending some hours with the Ruby Debugger I finally learned that I need to clean up some malformed HTML pages before I can feed those to Hpricot. The best solution I found so far is the Tidy Ruby interface.
Tidy works great from the command line and also the Ruby interface works. However, it requires dl/import, which fails to load in JRuby:
$ jirb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'tidy'
LoadError: no such file to load -- dl/import
Is this library available for JRuby? A web search revealed that it wasn't available last year.
Alternatively, can someone suggest other ways to clean up malformed HTML in JRuby?
Update
Following Markus' suggestion I now use Tidy via popen instead of libtidy. I posted the code which pipes the document data through tidy for future reference. Hopefully, this is robust and portable.
def clean(data)
cleaned = nil
tidy = IO.popen('tidy -f "log/tidy.log" --force-output yes -wrap 0 -utf8', 'w+')
begin
tidy.write(data)
tidy.close_write
cleaned = tidy.read
tidy.close_read
rescue Errno::EPIPE
$stderr.print "Running 'tidy' failed: " + $!
tidy.close
end
return cleaned if cleaned and cleaned != ""
return data
end