views:

55

answers:

4

I have a text file that is tab delimited and looks like:

1_0 NP_045689 100.00 279 0 0 18 296 18 296 3e-156 539

1_0 NP_045688 54.83 259 108 6 45 296 17 273 2e-61 224

I need to parse out specific columns such as column 2.

I've tried with the code below:

z = open('output.blast', 'r')
for line in z.readlines():
    for col in line:
        print col[1]
z.close()

But i get a index out of range error.

+2  A: 
z = open('output.blast', 'r')
for line in z.readlines():
    cols = line.split('\t'):
        print cols[1]
z.close()

You need to split() the line on tab characters first.

Alternatively, you could use Python's csv module in tab-delimiters mode.

Amber
This will print out the second letter in every column - this is not what you intent I am sure.
Dave Kirby
Whoops. The danger of copy paste. :) Fixed.
Amber
+3  A: 

Check out the csv module. That should help you a lot if you plan on doing more stuff with your tab-delimited files, too. One nice thing is that you can assign names to the various columns.

JAB
+1  A: 
import csv,StringIO
text="""1_0 NP_045689   100.00  279 0   0   18  296 18  296 3e-156  539
1_0 NP_045688   54.83   259 108 6   45  296 17  273 2e-61   224"""

f = csv.reader(StringIO.StringIO(text), delimiter='\t')
for row in f:
    print row[1]

two things of note:

the delimiter argument to the reader method tells the csv module how to split the text line. Check the other arguments to the reader function to extend functionality (ie: quotechar)

I use StringIO to wrap the text example as a file object, you dont need that if you are using a file reference.

ex:

f=csv.reader(open('./test.csv'),delimiter='\t')
ebt
A: 

This is why your code is going wrong:

for col in line:

will iterate over every CHARACTER in the line.

    print col[1]

A character is a string of length 1, so col[1] is always going to give an index out of range error.

As others have said, you either need to split the line on the TAB character '\t', or use the csv module, which will correctly handle quoted fields that may contain tabs or newlines.

I also recommend avoiding using readlines - it will read the entire file into memory, which may cause problems if it is very large. You can iterate over the open file a line at a time instead:

z = open('output.blast', 'r')
for line in z:
    ...
Dave Kirby