views:

3493

answers:

7

I'm doing some research work into content aggregators, and I'm curious how some of the current craigslist aggregators get data into their mashups.

For example, www.housingmaps.com and the now closed www.chicagocrime.org

If there is a URL that can be used for reference, that would be perfect!

+2  A: 

i am guessing screen scraping

i do not think there is a craigslist API yet.. and i do not think they will release one..

so the only way to go is to scrape data.. you could use cURL library and heave regex to scrape the data you want of a page

if you see a link .. access the page.. scrape the new page get the data and show it or store it

and so on..

Wael Awada
+6  A: 

For AdRavage.com I use a combination of Magpie RSS (to extract the data returned from searches) and a custom screen scraping class to properly populate the city/category information used when building searches.

For example, to extract the categories you could:

//scrape category data
$h = new http();
$h->dir = "../cache/"; 
$url = "http://craigslist.org/";

if (!$h->fetch($url, 300)) {
  echo "<h2>There is a problem with the http request!</h2>";      
  exit();
}

//we need to get all category abbreviations (data looks like: <option value="ccc">community)
preg_match_all ("/<option value=\"(.*)\">([^`]*?)\n/", $h->body, $categoryTemp);

$catNames = $categoryTemp['2']; 

//return the array of abreviations
if(sizeof($catNames) > 0)   
    return $catNames; 
else
    return $emptyArray = array();
cfay
Supremely excellent answer!
pearcewg
A: 

While continuing to research this area, I found an awesome site that does partly what I'm interested in:

Crazedlist

It uses the HTTPReferer of the client browser, which is interesting but not ideal. The author of the site also claims to have royally ticked on CL, which I understand. It also gives clear example of business need, which are similar to my needs, and why I'm interested in this topic.

pearcewg
A: 

I use screen scraping software written in java that uses regular expressions to parse out the data.

A: 

I just made one:

http://cdn.javascriptmvc.com/videos/jobs/craigslist.js

That produces:

http://cdn.javascriptmvc.com/videos/jobs/craigslist.html

Must be run in rhino.

Justin Meyer
A: 

The problem with any scraping solution of craigslist is that they automatically block any IP address that accesses them 'too much' - which usually means more than a few hundred times a day. So as soon as your tool got any kind of popularity, it would be shut down.

That's why the only craigslist search sites that have lasted either use frames (like searchtempest.com and crazedlist.org) or google (like allofcraigs.com).

Nathan
A: 

The alternative option would be to use YQL or Yahoo pipes to gather the results.

Craiglook and HousingMaps are using them to gather results

Rory