tags:

views:

13

answers:

0

I have been successfully screen-scraping certain sites but have come across some very odd behavior with Nokogiri today on a certain site.

If I view the HTML source code pulled down by Nokogiri with the actual HTML scource code from the site on a certain pages it is truncated. Some pages work just fine and all the data is there and others wig out and get truncated.

www.bento.com/revj/0172.html (Doesn't work - truncated HTML returned by Nokogiri) www.bento.com/revj/0101.html (Works great)

scraped_jpage = Nokogiri::HTML(open(page_to_scrape)
puts scraped_pagej

I have tried all sorts of different code, changed encoding (UTF-8, SHIFT_JIS etc) but I cannot see any reason whatsoever that Nokogiri truncates the returned HTML.

The english versions of these pages all work perfectly.

www.bento.com/rev/0172.html www.bento.com/rev/0101.html

Thanks for any help - hopefully it's something obvious I have missed and not a bug.