views:

85

answers:

2

I want build a search service for one particular thing. The data is freely available out there, via free classified services, and a host of other sites.

Are there any building blocks, e.g. open-source crawlers that I would customize - rather than build from scratch, that I can use?

Any advice on building such a product? Not just technical, but any privacy/legal things that I might need to take into consideration.

E.g. do I need to 'give credit' where the results are from and put a link to the original - if I get them from many places?

Edit: By the way, I am using GWT with JS for the front-end, haven't decided on the language for the back-end. Either PHP or Python. Thoughts?

A: 

I made a screen-scraper in Ruby that took like five minutes. Apparently this dude has it down to 60 seconds! I'm not sure if Ruby is as scalable or fast as what you're looking for, but I've never seen a faster route to a proof-of-concept or a prototype.

The secret is a library called "hpricot", which was built for exactly this purpose.

I don't know anything about PHP or Python or what's available for those development systems/languages.

Good luck!

Chris McCall
So I guess the notion is that I would be creating a 'screen-scraper' and parsing through the HTML code and taking out the useful info, then dumping that into a db?Is that the general process?
marcamillion
Yeah that's the idea.
Chris McCall
+2  A: 

There are few blocks in python you can use.

  1. beautifulsoup [http://www.crummy.com/software/BeautifulSoup/] for parsing HTML. It can handle bad code too, and its API is veeery easy... way better than any DOM-like tool for me. My friend used it to scrape his old phpbb forum with success. It has pretty good docs.
  2. mechanize [http://wwwsearch.sourceforge.net/mechanize/] is a webbrowser-simulating http client library. It handles cookies, filling forms and so on. Also easy to use, but it helps if you understand how does http work.
  3. http://dev.scrapy.org/ -- this is a relatively new thing: a whole scraping framework based on twisted. I haven't played with it much.

I use first two for my needs; f.e. it needs 20 lines of code to get an automatic testing tool for a 3-stage poll, with simulation of waiting for user entering data and so on.

liori
So I guess the notion is that I would be creating a 'screen-scraper' and parsing through the HTML code and taking out the useful info, then dumping that into a db? Is that the general process?
marcamillion
For me it was generic enough... the only limitation I see is that there is neither javascript nor flash engine to fully simulate webbrowser. You can add js with spidermonkey binding though -- I never needed that.
liori