views:

103

answers:

0

Hi all:

I know there must be ways to do this sort of things. I am not pro in RoR or Python, not even an expert in PHP. So my solution tends to be quite dumb: It uses a FireFox add-on called imarcos to scrape the target urls from Google SERP, and use PHP to store info into the database.

At the very core of my workaround there lies a problem: How to specifically find target urls based on their copyright year? I mean, something like "copyright 1998-2006" in the footer is to be considered a target, but my search results are not 100% accurate.

I used the following url to search :

http://www.google.com.au/#hl=en&q=inurl:.com.au+intext:copyright+1995..2007+--2008+--2009&start=0&cad=b&fp=6a8119b094529f00

It reads : search for pages that have .com.au in URL and a copyright range from 1995 to 2007 exclude the year of 2008 or 2009. Starting position is 0, of course the offset can be changed.

I've already done a dummy list and honestly I am not pleased with the result. That's mostly because I cannot find a way to restrict search terms in the exact order as they are entered into the search url. copyright can appear in anywhere on page and doesn't necessarily before the years, that's the current story.

Is there a more clear way to sort out this? Oh, almost forgot to say the client doesn't wanna spend too much in this - I cannot persuade him simply buy some cool software, unfortunately. I hope there is a way to use clever Google search operators or similar things to go around this issue.

Many thanks in advance!