views:

255

answers:

1

I am trying to scrape this page: http://www.udel.edu/dining/menus/russell.html. I have written a scraper in Ruby using the Hpricot library.

problem: HTML page is escaped and I need to display it unescaped

example: "M&M" should be "M&M"  
example: "Entrée" should be "Vegetarian Entrée"  

I have tried using the CGI library in Ruby (not too successful) and the HTMLEntities gem that I found through this Stack Overflow post.

HTMLEntities works during testing:

require 'rubygems' 
require 'htmlentities'
require 'cgi'

h = HTMLEntities.new
puts "h.decode('Entrée') = #{h.decode("Entrée")}"

blank = " "
puts "h.decode blank = #{h.decode blank}"
puts "CGI.unescapeHTML blank = |#{CGI.unescapeHTML blank}|"

puts "h.decode '<th width=86 height=59 scope=row>Vegetarian Entr&eacute;e</th> ' = |#{h.decode '<th width=86 height=59 scope=row>Vegetarian Entr&eacute;e</th> '}|"  

correctly yields

h.decode('Entr&eacute;e') = Entrée
h.decode blank =  
CGI.unescapeHTML blank = |&nbsp;|
h.decode '<th width=86 height=59 scope=row>Vegetarian Entr&eacute;e</th> ' = |<th width=86 height=59 scope=row>Vegetarian Entrée</th> |

However, when I go to use it on a file with open-uri it does not work properly:

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'htmlentities'
require 'cgi'
f = open("http://www.udel.edu/dining/menus/russell.html")
htmlentity = HTMLEntities.new
while line = f.gets
  puts htmlentity.decode line
end

Incorrectly yields things like:

<th width="60" height="59" scope="row">Vegetarian Entrée</th>

and

<th scope="row"> </th>  // note: was originally '&nbsp;' to indicate a blank

but correctly handles M&M by yielding:

<td valign="middle" class="menulineA">M&M Brownies</td>

Am I treating the escaped HTML incorrectly? I don't understand why it works in some cases and not in others.

I am running ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]

Any help/suggestion is appreciated. Thanks.

A: 

HTMLEntities seems to work but you have an encoding problem. The terminal you're printing on is probably set up for a latin charset and barfs on the utf-8 characters output by your script.

In what environment are you running ruby ?

The reason '&' displays correctly is that it's an ascii character and thus will display the same in most encodings.The problem is that it's not supposed to happen alone in an xml document and could pose problems later when you feed your decoded file to hpricot. I believe the proper way would be to parse with hpricot and then pass what you're extracting from the document to HTMLEntity.

You were exactly correct about the encoding problem. I finally realized that the problems arise when I open the file in xemacs, but the problems do not appear when I just do a 'more' command on the file and print it in the terminal. I guess xemacs is just not set up to read the UTF8 file because when I switched to gedit for kicks the problems also did not arise. Thanks!
conorgil
I don't use xemacs but I'd think a reasonably recent version would know about utf-8. For info in emacs the command I would use would be called revert-buffer-with-coding-system with keyboard shortcut 'ctrl+x <return> r utf-8'