tags:

views:

86

answers:

1

Hello spider experts:

I am a newbie trying to achive this simple task by using Scrapy with no luck so far. I am asking your advice about how to do this with Scrapy or with any other tool (with Python). Thank you.

I want to

  1. start from a page that lists bios of attorneys whose last name start with A: initial_url = www.example.com/Attorneys/List.aspx?LastName=A

  2. From LastName=A to extract links to actual bios: /BioLinks/

  3. visit each of the /BioLinks/ to extract the school info for each attorney.

I am able to extract the /BioLinks/ and School information but I am unable to go from the initial url to the bio pages.

If you think this is the wrong way to go about this, then, how would you achieve this goal?

Many thanks.

A: 

Not sure I fully understand what you're asking, but maybe you need to get the absolute URL to each bio and retrieve the source code for that page:

import urllib2
bio_page = urllib.urlopen(bio_url).read()

Then use a regular expressions or other parsing to get the attorney's law school.

twneale
Yes, I will try this, but don't I still need a spider to get the urls for the 140k bios I want to scan? How would that work?
Zeynel