views:

653

answers:

2

Hello,

I've been trying to write this spider for weeks but without success. What is the best way for me to code this in Python:

1) Initial url: http://www.whitecase.com/Attorneys/List.aspx?LastName=A

2) from initial url pick up these urls with this regex:

hxs.select('//td[@class="altRow"][1]/a/@href').re('/.a\w+')

[u'/cabel', u'/jacevedo', u'/jacuna', u'/aadler', u'/zahmedani', u'/tairisto', u
/zalbert', u'/salberts', u'/aaleksandrova', u'/malhadeff', u'/nalivojvodic', u'
....

3) Go to each of these urls and scrape the school info with this regex

hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)'

[u'JD, ', u'University of Florida Levin College of Law, <em>magna cum laude</em> , Order of the Coif, Symposium Editor, Florida Law Review, Awards for highest grades in Comparative Constitutional History, Legal Drafting, Real Property and Sales, ', u'2007']

4) Write the scraped school info into schools.csv file

Can you help me write this spider in Python? I've been trying to write it in Scrapy but without success. See my previous question.

Thank you.

+2  A: 

http://www.ibm.com/developerworks/linux/library/l-spider/ IBM article with good description

or

http://code.activestate.com/recipes/576551/ Python cookbook, better code but less explanation

Martin Beckett
A: 

Also, I suggest you read:

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Before you try to parse HTML with a regular expression. Then think about what happens the first time someone's name forces the page to be unicode instead of latin-1.

EDIT: To answer your question about a library to use in Python, I would suggest Beautiful Soup, which is a great HTML parser and supports unicode throughout (and does a really good job with malformed HTML, which you're going to find all over the place).

Nick Bastin
Ok. Can you let me know a specific tool, other than what I am using, for Python, to extract the urls, '/aabbas', '/rwagner', '/cabel' and so on, from this url http://www.whitecase.com/Attorneys/List.aspx?LastName=A
Zeynel
Also, the problem that I am having is to pass to the parse function the string extracted by the regex, as I asked here http://stackoverflow.com/questions/1805050/scrapy-spider-index-error Actually the only part of the code that works is the regex :)
Zeynel
Thanks. I was looking at BeautifulSoup before but I am having problems understanding their tutorial. For instance, how would you translate this: hxs.select('//td[@class="mainColumnTDa"]').re('(?<=(JD,\s))(.*?)(\d+)') to BeautifulSoup?
Zeynel