views:

119

answers:

1

Hi, I came across an an open source crawler Bixo. Has anyone tried it? Could you please share the learning? Could we build directed crawler with enough ease (compared to Nutch/Heritrix) ? Thanks Nayn

+2  A: 

I used Bixo in production at a large social networking site (100M page views/day) for user content classification (basically anything user produced with a link in it).

It was a fairly complex workflow using Cascading to

  • dedupe URLs,
  • make Bixo retrieve the page content,
  • push the page content through classifiers and
  • trigger account revocations for spammy accounts, run spam reports, etc.

If you know Cascading then Bixo works really like any other Cascading component essentially expecting URLs as input and emitting a bunch of page related information as output.

One thing that I underestimated in the beginning is that for a lot of vertical crawlers is that the crawling aspect is "only" one small piece in the puzzle. The entire workflow around it can become very complex and if you go with another isolated crawler product you need to find a way to integrate it. Bixo using Cascading becomes just another input to your workflow.

Bixo itself seems to be very solid. Ken Krugler (lead dev) is super responsive and was able to fix some hanging issues I had in the beginning within a day (my dataset contained lots of "dirty" URLs). He has a very comprehensive automated test suite making sure Bixo works as designed.

Overall I can't recommend it highly enough. The entire system was built by me in 6-9 months and I don't think I could have done it w/o it in that timeframe.

Erich Nachbar
Thanks Eric for the info. Would you let me know some sample code to get started with? Ken mentioned about writing some tutorial but it is not yet existent.
Nayn
Welcome! I got started by looking at the code for the sample crawler (http://bit.ly/bixoSample), reading through group postings and asking questions. But I agree, a tutorial would help getting people started.
Erich Nachbar