views:

124

answers:

5

Hi all, somewhat open ended question here as I am mostly looking for opinions. I am grabbing some data from craigslist for apt ads in my area since I am looking to move. My goal is to be able to compare items to see when something is a duplicate so that I don't spend all day looking at the same 3 ads. The problem is that they change things around a little to get past CL's filters.

I already have some regex to look for address and phone numbers to compare, but that isn't the most reliable. Is anyone familiar with an easy-ish method to compare the whole document and maybe show something simple like "80% similar"? I can't think of anything offhand, so I suspect I'll have to start from scratch on my own solution, but figured it would be worth asking the collective genius of stackoverflow :)

Preferred languages/methods would be python/php/perl, but if it's a great solution I'm pretty open.

Update: one thing worth noting is that since I will be storing the scraped data of the rss feed for apts in my area (los angeles) in a local DB, the preferred method would include a way to compare it to everything I currently know. This could be a bit of a showstopper since that could become a very long process as the post counts grow.

+1  A: 

There are few quite complex projects to find text duplications. One of them is Simian. Take a look at it.

nailxx
That is a very cool project, thank you for sharing! My only concern, and perhaps one that I need to update the post with, is that because I will essentially be scraping the rss feeds, I need a method that will allow me to compare data that is stored in the local mysql db (a scrape of the body contents of the post). Since I'm just looking in my area of the city, it would be possible to do a compare to everything still in the db with anything new that comes in, but at a certain point this will become computationally difficult, particularly in a city as large as los angeles
nick
A: 

You could use xdiff. There is an xdiff PECL extension for PHP available.

Or use similar_text to calculate the similarity between two strings

Gordon
+1  A: 

You can use difflib to calculate differences in python directly.

Edit: you can consider creating a hash of the content in some way to reduce the amount of text that needs to be "diffed". For example, remove all whitespace, punctuation, tags etc and just look at the actual content.

Aaron Harun
+2  A: 

You could calculate the Levenshtein difference between both strings - after some sane normalizing like minimizing duplicate whitespace and what not. After you run through enough "duplicates" you should get an idea of what your threshold is - then you can run Levenshtein on all new incoming data and if its less-than-equal to your threshold than you can consider it a duplicate.

Cody Caughlan
A: 

If you wanted to do this a lot and with some reliability, you might want to use a semi-advanced approach like a "bag of words" technique. I actually sat down and wrote a sketch of a more-or-less working (if horribly unoptimized) algorithm to do it, but I'm not sure if it would really be appropriate to include here. There are pre-made libraries that you can use for text classification instead.

hobbs