ansaurus

Question

Interfacing web crawler with Django front end

Answer 1

+1 A:

If you insert your django project's app directories into sys.path, you can write standard Python scripts that utilize the Django ORM functionality. We have an /admin/ directory that contains scripts to perform various tasks-- at the top of each script is a block that looks like:

sys.path.insert(0,os.path.abspath('../my_django_project'))
sys.path.insert(0,os.path.abspath('../'))
sys.path.insert(0,os.path.abspath('../../'))
os.environ['DJANGO_SETTINGS_MODULE'] = 'settings'

Then it's just a matter of using your tool of choice to crawl the web and using the Django database API to store the data.

pjbeardsley 2009-06-09 18:28:06

Answer 2

+1 A:

If you don't want to write crawler using Django ORM (or already have working crawler) you could share database between crawler and Django-powred front-end.

To be able to search (and edit) existing database using Django admin you should create Django models. The easy way for that is described here:

http://docs.djangoproject.com/en/dev/howto/legacy-databases/

Mike Korobov 2009-06-09 18:32:19

Answer 3

+2 A:

You write your own crawler using urllib2 to get the pages and Beautiful Soup to parse the HTML looking for the content.

Here's an example of reading a page:

http://docs.python.org/library/urllib2.html#examples

Here's an example of parsing the page:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML

S.Lott 2009-06-09 18:52:47

In my experience, lxml2 (http://codespeak.net/lxml/) works a lot faster than BeautifulSoup. I don't currently have any proof benchmarks, though.

drdaeman 2009-06-09 22:31:13

@drdaeman: I don't have experience with lxml2, but BeautifulSoup's strong point is its error tolerance. Since web pages famously contain errors.

muhuk 2009-06-10 09:39:36

In my experience, lxml2 works pretty good with malformed HTML. And if something goes really wrong it can use BeautifulSoup as a parser (http://codespeak.net/lxml/elementsoup.html).

drdaeman 2009-06-10 14:32:39

@drdaeman: nice. Thanks.

muhuk 2009-06-10 15:01:20

Answer 4

A:

Be careful about ORM layers when it comes to mass inserts and updates. We had a project with approx. 1m web sites that needed to be crawled, their information extracted (xquery), and inserted into the database (mysql). We started using JPA, but because of massive performance problems we replaced it with native SQL commands via JDBC.

Chris 2010-08-18 12:54:08

ansaurus

tags:

views:

answers:

Interfacing web crawler with Django front end

related questions