views:

51

answers:

2

Hi!

I have a small crawler/screen-scraping script that used to work half a year ago, but now, it doesnt work anymore. I checked the html and css values for the reg expression in the page source, but they are still the same, so from this point of view, it should work. Any guesses?

require "open-uri"

# output file
f = open 'results.csv', 'w+'

# output string
results = ""

begin

  # crawl first 20 pages
  for i in (1..20)
    open("http://www.my-hammer.de/search.php?mhFormData[allCategories]=1&mhFormData[rangeAll]=1&mhFormData[priceRangeEnd]=999999999&mhFormData[refineSearch]=1&mhFormData[searchText]=&mhFormData[searchZipcode]=&mhFormData[searchZipcodeCircumcircle]=50&mhFormData[priceRangeStart]=1&mhFormData[categories][0]=45&page=" + i.to_s) {|url|

      # check each line using regular expression
      url.each_line { |line|
        if line =~ /class=\"L1g\" onclick=\"s_objectID=\'ShowAuction_from_AuctionTitle\'\">([^<]+)<\/a><\/h3><\/li>/
          # if regular expression matches then add to results
          results += $1 + "\n"
        end
      }
    }
  end
ensure
  # write to and close file
  f.print results
  f.close
end
A: 

The target website would appear to have changed the structure of their page so your Regex no longer matches.

This is a good example of why you should not scrape pages using Regex to match content. Try reworking your script using a DOM parser like Nokogiri. This will not necessarily stop your script from breaking but will at least allow it to survive minor changes.

The reason it is not working can be seen in this Rubular link

Steve Weet
Obligatory HTML and regular expressions link: [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)
Andrew Grimm
Thanks for the answer! Will try to fix that with Nokogiri...
belehe
@Andre Grimm: the regular expressions link is awesome...
belehe
A: 

Another option for web scraping is iMacros. These scripts are very easy to adapt to site changes.

FrankJK