tags:

views:

175

answers:

1

After spending some hours with the Ruby Debugger I finally learned that I need to clean up some malformed HTML pages before I can feed those to Hpricot. The best solution I found so far is the Tidy Ruby interface.

Tidy works great from the command line and also the Ruby interface works. However, it requires dl/import, which fails to load in JRuby:

$ jirb
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'tidy'
LoadError: no such file to load -- dl/import

Is this library available for JRuby? A web search revealed that it wasn't available last year.

Alternatively, can someone suggest other ways to clean up malformed HTML in JRuby?

Update

Following Markus' suggestion I now use Tidy via popen instead of libtidy. I posted the code which pipes the document data through tidy for future reference. Hopefully, this is robust and portable.

def clean(data)
    cleaned = nil
    tidy = IO.popen('tidy -f "log/tidy.log" --force-output yes -wrap 0 -utf8', 'w+')
    begin
        tidy.write(data)
        tidy.close_write
        cleaned = tidy.read
        tidy.close_read
    rescue Errno::EPIPE
        $stderr.print "Running 'tidy' failed: " + $!
        tidy.close
    end        
    return cleaned if cleaned and cleaned != ""
    return data
end
+2  A: 

You could use it from the command line from within JRuby with %x{...} or backticks. You may also want to consider popen (and pipe things through it).

Not elegant perhaps, but more likely to get you going with minimal hassle than trying to mess with unsupported libraries.

MarkusQ