tags:

views:

53

answers:

1

What is the best way to organize scraped data into a csv? More specifically each item is in this form

url
"firstName middleInitial, lastName - level - word1 word2 word3, & wordN practice officeCity."
JD, schoolName, date

Example:

http://www.examplefirm.com/jang
"Joe E. Ang - partner - privatization mergers, media & technology practice New York."
JD, University of Chicago Law School, 1985

I want to put this item in this form:

(http://www.examplefirm.com/jang, Joe, E., Ang, partner, privatization mergers, media & technology, New York, University of Chicago Law School, 1985)

so that I can write it into a csv file to import to a django db.

What would be the best way of doing this?

Thank you.

+2  A: 

There's really no short cut on this. Line 1 is easy. Just assign it to url. Line 3 can probably be split on , without any ill effects, but line 2 will have to be manually parsed. What do you know about word1-wordN? Are you sure "practice" will never be a "word". Are you sure the words are only one word long? Can they be quoted? Can they contain dashes?

Then I would parse out the beginning and end bits, so you're left with a list of words, split it by commas and/or & (is there a consistent comma before &? Your format says yes, but your example says no.) If there are a variable number of words, you don't want to inline them in your tuple like that, because you don't know how to get them out. Create a list from your words, and add that as one element of the tuple.

>>> tup = (url, first, middle, last, rank, words, city, school, year)
>>> tup
('http://www.examplefirm.com/jang', 'Joe', 'E.', 'Ang', 'partner', 
['privatization mergers', 'media & technology'], 'New York', 
'University of Chicago Law School', '1985')

More specifically? You're on your own there.

jcdyer
Ok, thanks. I think I'll eliminate line 2 and use the rest. But, how did you get the actual items into tup?
Zeynel
Look into the builtin str methods. `.split()`, `.index()`, `len()`, and the slicing syntax could all be useful. Just take it one piece at a time.
jcdyer