views:

572

answers:

4

I'm trying to do three things.

One: crawl and archive, at least daily, a predefined set of sites.

Two: run overnight batch python scripts on this data (text classification).

Three: expose a Django based front end to users to let them search the crawled data.

I've been playing with Apache Nutch/Lucene but getting it to play nice with Django just seems too difficult when I could just use another crawler engine.

Question 950790 suggests I could just write the crawler in Django itself, but I'm not sure how to go about this.

Basically - any pointers to writing a crawler in Django or an existing python crawler that I could adapt? Or should I incorporate 'turning into Django-friendly stuff' in step two and write some glue code? Or, finally, should I abandon Django altogether? I really need something that can search quickly from the front end, though.

+1  A: 

If you insert your django project's app directories into sys.path, you can write standard Python scripts that utilize the Django ORM functionality. We have an /admin/ directory that contains scripts to perform various tasks-- at the top of each script is a block that looks like:

sys.path.insert(0,os.path.abspath('../my_django_project'))
sys.path.insert(0,os.path.abspath('../'))
sys.path.insert(0,os.path.abspath('../../'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'

Then it's just a matter of using your tool of choice to crawl the web and using the Django database API to store the data.

pjbeardsley
+1  A: 

If you don't want to write crawler using Django ORM (or already have working crawler) you could share database between crawler and Django-powred front-end.

To be able to search (and edit) existing database using Django admin you should create Django models. The easy way for that is described here:

http://docs.djangoproject.com/en/dev/howto/legacy-databases/

Mike Korobov
+2  A: 

You write your own crawler using urllib2 to get the pages and Beautiful Soup to parse the HTML looking for the content.

Here's an example of reading a page:

http://docs.python.org/library/urllib2.html#examples

Here's an example of parsing the page:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML

S.Lott
In my experience, lxml2 (http://codespeak.net/lxml/) works a lot faster than BeautifulSoup. I don't currently have any proof benchmarks, though.
drdaeman
@drdaeman: I don't have experience with lxml2, but BeautifulSoup's strong point is its error tolerance. Since web pages famously contain errors.
muhuk
In my experience, lxml2 works pretty good with malformed HTML. And if something goes really wrong it can use BeautifulSoup as a parser (http://codespeak.net/lxml/elementsoup.html).
drdaeman
@drdaeman: nice. Thanks.
muhuk
A: 

Be careful about ORM layers when it comes to mass inserts and updates. We had a project with approx. 1m web sites that needed to be crawled, their information extracted (xquery), and inserted into the database (mysql). We started using JPA, but because of massive performance problems we replaced it with native SQL commands via JDBC.

Chris