views:

148

answers:

1

HI, I've got a list of 10 websites in CSV. All of the sites have the same general format, including a large table. I only want the the data in the 7th columns. I am able to extract the html and filter the 7th column data (via RegEx) on an individual basis but I can't figure out how to loop through the CSV. I think I'm close but my script won't run. I would really appreciate it if someone could help me figure-out how to do this. Here's what i've got:

#Python v2.6.2

import csv 
import urllib2
import re

urls = csv.reader(open('list.csv'))
n =0
while n <=10:
    for url in urls:
        response = urllib2.urlopen(url[n])
        html = response.read()
        print re.findall('td7.*?td',html)
        n +=1
+1  A: 

When I copied your routine, I did get a white space / tab error error. Check your tabs. You were indexing into the URL string incorrectly using your loop counter. This would have also messed you up.

Also, you don't really need to control the loop with a counter. This will loop for each line entry in your CSV file.

#Python v2.6.2

import csv 
import urllib2
import re

urls = csv.reader(open('list.csv'))
for url in urls:
    response = urllib2.urlopen(url[0])
    html = response.read()
    print re.findall('td7.*?td',html)

Lastly, be sure that your URLs are properly formed:

http://www.cnn.com
http://www.fark.com
http://www.cbc.ca
gdc
Thanks! I was trying this approach before but only got one return per list value -- ie: [0] only yielded col7 data for the first URL , [1] only yielded col7 data for the second, etc. Your second note sealed it: Looks like my URLs were in the wrong format -- eg: http://www.cnn.com,http://www.fark.com,http://www.cbc.ca -- it worked once I changed to your format. Looks like I need to read more about proper Python/CSV formatting. Thanks again!
KenBurnsFan1
Also, nice to receive help from a Cannuck! My Mother's side reigns from SaltSpring Island / Vancouver / Victoria areas -- I was very tempted to attend UVIC. BC is crazy beautiful.
KenBurnsFan1