ansaurus

Question

How to write a simple spider in Python?

Answer 1

+2 A:

http://www.ibm.com/developerworks/linux/library/l-spider/ IBM article with good description

or

http://code.activestate.com/recipes/576551/ Python cookbook, better code but less explanation

Martin Beckett 2009-11-26 19:10:26

Answer 2

A:

Also, I suggest you read:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Before you try to parse HTML with a regular expression. Then think about what happens the first time someone's name forces the page to be unicode instead of latin-1.

EDIT: To answer your question about a library to use in Python, I would suggest Beautiful Soup, which is a great HTML parser and supports unicode throughout (and does a really good job with malformed HTML, which you're going to find all over the place).

Nick Bastin 2009-11-26 21:39:34

Ok. Can you let me know a specific tool, other than what I am using, for Python, to extract the urls, '/aabbas', '/rwagner', '/cabel' and so on, from this url http://www.whitecase.com/Attorneys/List.aspx?LastName=A

Zeynel 2009-11-26 22:18:21

Also, the problem that I am having is to pass to the parse function the string extracted by the regex, as I asked here http://stackoverflow.com/questions/1805050/scrapy-spider-index-error Actually the only part of the code that works is the regex :)

Zeynel 2009-11-26 22:20:25

Thanks. I was looking at BeautifulSoup before but I am having problems understanding their tutorial. For instance, how would you translate this: hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)') to BeautifulSoup?

Zeynel 2009-11-27 01:21:18

ansaurus

tags:

views:

answers:

How to write a simple spider in Python?

related questions