ansaurus

Question

Ruby HTML scraper written in Hpricot having trouble with escaped HTML

Answer 1

A:

HTMLEntities seems to work but you have an encoding problem. The terminal you're printing on is probably set up for a latin charset and barfs on the utf-8 characters output by your script.

In what environment are you running ruby ?

The reason '&' displays correctly is that it's an ascii character and thus will display the same in most encodings.The problem is that it's not supposed to happen alone in an xml document and could pose problems later when you feed your decoded file to hpricot. I believe the proper way would be to parse with hpricot and then pass what you're extracting from the document to HTMLEntity.

2010-05-11 06:12:43

You were exactly correct about the encoding problem. I finally realized that the problems arise when I open the file in xemacs, but the problems do not appear when I just do a 'more' command on the file and print it in the terminal. I guess xemacs is just not set up to read the UTF8 file because when I switched to gedit for kicks the problems also did not arise. Thanks!

conorgil 2010-05-12 03:07:53

I don't use xemacs but I'd think a reasonably recent version would know about utf-8. For info in emacs the command I would use would be called revert-buffer-with-coding-system with keyboard shortcut 'ctrl+x <return> r utf-8'

2010-05-12 05:34:21

ansaurus

tags:

views:

answers:

Ruby HTML scraper written in Hpricot having trouble with escaped HTML

related questions