Hi all, somewhat open ended question here as I am mostly looking for opinions. I am grabbing some data from craigslist for apt ads in my area since I am looking to move. My goal is to be able to compare items to see when something is a duplicate so that I don't spend all day looking at the same 3 ads. The problem is that they change things around a little to get past CL's filters.
I already have some regex to look for address and phone numbers to compare, but that isn't the most reliable. Is anyone familiar with an easy-ish method to compare the whole document and maybe show something simple like "80% similar"? I can't think of anything offhand, so I suspect I'll have to start from scratch on my own solution, but figured it would be worth asking the collective genius of stackoverflow :)
Preferred languages/methods would be python/php/perl, but if it's a great solution I'm pretty open.
Update: one thing worth noting is that since I will be storing the scraped data of the rss feed for apts in my area (los angeles) in a local DB, the preferred method would include a way to compare it to everything I currently know. This could be a bit of a showstopper since that could become a very long process as the post counts grow.